Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jason Gunthorpe @ 2023-06-12 18:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Leon Romanovsky, Wei Hu, netdev@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org, Long Li,
	Ajay Sharma, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <20230612102221.2ca726fd@kernel.org>

On Mon, Jun 12, 2023 at 10:22:21AM -0700, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 09:13:49 +0300 Leon Romanovsky wrote:
> > > Thanks for you comment. I am  new to the process. I have a few
> > > questions regarding to this and hope you can help. First of all,
> > > the patch is mostly for IB. Is it possible for the patch to just go
> > > through the RDMA branch, since most of the changes are in RDMA?   
> > 
> > Yes, it can, we (RDMA) will handle it.
> 
> Probably, although it's better to teach them some process sooner
> rather than later?

I've been of the opinion the shared branch process is difficult - we
took a long time to fine tune the process. If you don't fully
understand how to do this with git you can make a real mess of it.

So I would say MS is welcome to use it if they can do it right, but I
wouldn't push them to do so or expect they must to be
successful. Really only Mellanox and Intel seem to have enough churn
to justify it right now.

If they don't use shared branches then they must be responsible to
avoid conflicts, even if that means they have to delay sending patches
for a cycle.

Jason

^ permalink raw reply

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jakub Kicinski @ 2023-06-12 17:22 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, jgg@ziepe.ca,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
	vkuznets@redhat.com, ssengar@linux.microsoft.com,
	shradhagupta@linux.microsoft.com
In-Reply-To: <20230612061349.GM12152@unreal>

On Mon, 12 Jun 2023 09:13:49 +0300 Leon Romanovsky wrote:
> > Thanks for you comment. I am  new to the process. I have a few
> > questions regarding to this and hope you can help. First of all,
> > the patch is mostly for IB. Is it possible for the patch to just go
> > through the RDMA branch, since most of the changes are in RDMA?   
> 
> Yes, it can, we (RDMA) will handle it.

Probably, although it's better to teach them some process sooner
rather than later?

^ permalink raw reply

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jakub Kicinski @ 2023-06-12 17:21 UTC (permalink / raw)
  To: Wei Hu
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <SI2P153MB0441DAC4E756A1991A03520FBB54A@SI2P153MB0441.APCP153.PROD.OUTLOOK.COM>

On Mon, 12 Jun 2023 04:44:44 +0000 Wei Hu wrote:
> If the patch also needs to go through the NETDEV branch, does it mean two subsystems will pull its own part? A few follow-up questions about generating a PR, since I have never done such before.
> 
> 1. Which repo should I clone and create the branch from?

The main tree of Linus Torvalds. Check which tags are present in both
netdev and rdma trees and use the newest common tag between the trees
as a base.

> 2. From the example you provided, I see these people has their own branches on kernel.org, for example something like:
> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2023-06-06. 
> I am not Linux maintainer. I just have repo on Github. How do I
> create or fork on kernel.org? Do I need an account to do so? Or I can
> use my own repo on Github?

GitHub is fine.

> 3.  How to create PR in this case? Should I follow this link:
> https://docs.kernel.org/maintainer/pull-requests.html?

Sort of. But still post the patches, so you'd want to use these
commands somewhere along the way:

git format-patch [...] -o $path --cover-letter
git request-pull [...] >> $path/0000-cover-letter.patch

^ permalink raw reply

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-12 14:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, jgg@ziepe.ca,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <20230611181857.GK12152@unreal>

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Monday, June 12, 2023 2:19 AM
> 
> > +
> > +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq) {
> > +	struct mana_ib_cq *cq = ctx;
> > +	struct ib_device *ibdev = cq->ibcq.device;
> > +
> > +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
> 
> This patch has two many debug prints, most if not all should go.
> 
Thanks. I will remove the debug prints in the normal path. 

Wei

^ permalink raw reply

* Re: [PATCH RFC net-next v4 7/8] vsock: Add lockless sendmsg() support
From: Simon Horman @ 2023-06-12  9:53 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Krasnov Arseniy, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v4-7-0cebbb2ae899@bytedance.com>

On Sat, Jun 10, 2023 at 12:58:34AM +0000, Bobby Eshleman wrote:
> Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> it does not scale when many senders share a socket.
> 
> Prior to this patch the socket lock is used to protect both reads and
> writes to the local_addr, remote_addr, transport, and buffer size
> variables of a vsock socket. What follows are the new protection schemes
> for these fields that ensure a race-free and usually lock-free
> multi-sender sendmsg() path for vsock dgrams.
> 
> - local_addr
> local_addr changes as a result of binding a socket. The write path
> for local_addr is bind() and various vsock_auto_bind() call sites.
> After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> rejects the user request and vsock_auto_bind() early exits.
> Therefore, the local addr can not change while a parallel thread is
> in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> Change: only acquire lock for auto-binding as-needed in sendmsg().
> 
> - buffer size variables
> Not used by dgram, so they do not need protection. No change.
> 
> - remote_addr and transport
> Because a remote_addr update may result in a changed transport, but we
> would like to be able to read these two fields lock-free but coherently
> in the vsock send path, this patch packages these two fields into a new
> struct vsock_remote_info that is referenced by an RCU-protected pointer.
> 
> Writes are synchronized as usual by the socket lock. Reads only take
> place in RCU read-side critical sections. When remote_addr or transport
> is updated, a new remote info is allocated. Old readers still see the
> old coherent remote_addr/transport pair, and new readers will refer to
> the new coherent. The coherency between remote_addr and transport
> previously provided by the socket lock alone is now also preserved by
> RCU, except with the highly-scalable lock-free read-side.
> 
> Helpers are introduced for accessing and updating the new pointer.
> 
> The new structure is contains an rcu_head so that kfree_rcu() can be
> used. This removes the need of writers to use synchronize_rcu() after
> freeing old structures which is simply more efficient and reduces code
> churn where remote_addr/transport are already being updated inside RCU
> read-side sections.
> 
> Only virtio has been tested, but updates were necessary to the VMCI and
> hyperv code. Unfortunately the author does not have access to
> VMCI/hyperv systems so those changes are untested.
> 
> Perf Tests (results from patch v2)
> vCPUS: 16
> Threads: 16
> Payload: 4KB
> Test Runs: 5
> Type: SOCK_DGRAM
> 
> Before: 245.2 MB/s
> After: 509.2 MB/s (+107%)
> 
> Notably, on the same test system, vsock dgram even outperforms
> multi-threaded UDP over virtio-net with vhost and MQ support enabled.
> 
> Throughput metrics for single-threaded SOCK_DGRAM and
> single/multi-threaded SOCK_STREAM showed no statistically signficant

Hi Bobby,

a minor nit from checkpatch --codespell: signficant -> significant

> throughput changes (lowest p-value reaching 0.27), with the range of the
> mean difference ranging between -5% to +1%.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>

...

^ permalink raw reply

* Re: [PATCH RFC net-next v4 4/8] vsock: make vsock bind reusable
From: Simon Horman @ 2023-06-12  9:49 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Krasnov Arseniy, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v4-4-0cebbb2ae899@bytedance.com>

On Sat, Jun 10, 2023 at 12:58:31AM +0000, Bobby Eshleman wrote:
> This commit makes the bind table management functions in vsock usable
> for different bind tables. For use by datagrams in a future patch.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  net/vmw_vsock/af_vsock.c | 33 ++++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index ef86765f3765..7a3ca4270446 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -230,11 +230,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
>  	sock_put(&vsk->sk);
>  }
>  
> -static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
> +struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
> +					    struct list_head *bind_table)

Hi Bobby,

This function seems to only be used in this file.
Should it be static?

^ permalink raw reply

* Re: [PATCH net-next,V2] net: mana: Add support for vlan tagging
From: patchwork-bot+netdevbpf @ 2023-06-12  8:40 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, decui, kys, paulros, olaf, vkuznets, davem,
	wei.liu, edumazet, kuba, pabeni, leon, longli, ssengar,
	linux-rdma, daniel, john.fastabend, bpf, ast, sharmaajay, hawk,
	tglx, shradhagupta, linux-kernel
In-Reply-To: <1686314837-14042-1-git-send-email-haiyangz@microsoft.com>

Hello:

This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri,  9 Jun 2023 05:47:17 -0700 you wrote:
> To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
> skb. Then extract the vlan tag from the skb struct, and save it to
> tx_oob for the NIC to transmit. For vlan tags on the payload, they
> are accepted by the NIC too.
> 
> For RX, extract the vlan tag from CQE and put it into skb.
> 
> [...]

Here is the summary with links:
  - [net-next,V2] net: mana: Add support for vlan tagging
    https://git.kernel.org/netdev/net-next/c/b803d1fded40

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Leon Romanovsky @ 2023-06-12  6:13 UTC (permalink / raw)
  To: Wei Hu
  Cc: Jakub Kicinski, netdev@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org, Long Li,
	Ajay Sharma, jgg@ziepe.ca, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
	edumazet@google.com, pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <SI2P153MB0441DAC4E756A1991A03520FBB54A@SI2P153MB0441.APCP153.PROD.OUTLOOK.COM>

On Mon, Jun 12, 2023 at 04:44:44AM +0000, Wei Hu wrote:
> Hi Jakub,
> 
> > -----Original Message-----
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Thursday, June 8, 2023 12:39 PM
> > To: Wei Hu <weh@microsoft.com>
> > Cc: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> > rdma@vger.kernel.org; Long Li <longli@microsoft.com>; Ajay Sharma
> > <sharmaajay@microsoft.com>; jgg@ziepe.ca; leon@kernel.org; KY
> > Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> > wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
> > davem@davemloft.net; edumazet@google.com; pabeni@redhat.com;
> > vkuznets@redhat.com; ssengar@linux.microsoft.com;
> > shradhagupta@linux.microsoft.com
> > Subject: Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> > mana ib driver.
> > 
> > On Tue,  6 Jun 2023 15:17:47 +0000 Wei Hu wrote:
> > >  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
> > >  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
> > >  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
> > >  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
> > >  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
> > >  include/net/mana/gdma.h                       |   9 +-
> > 
> > IB and netdev are different subsystem, can you put it on a branch and send a
> > PR as the cover letter so that both subsystems can pull?
> > 
> > Examples:
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Fall%2F20230607210410.88209-1-
> > saeed%40kernel.org%2F&data=05%7C01%7Cweh%40microsoft.com%7Cb672
> > 4a9f672f47d433ef08db67da4ada%7C72f988bf86f141af91ab2d7cd011db47%7C
> > 1%7C0%7C638217959538674174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiM
> > C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000
> > %7C%7C%7C&sdata=amO0W8QsR2I5INNNzCNOKEjrsYbzuZ92KXhNdfwSCHA
> > %3D&reserved=0
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Fall%2F20230602171302.745492-1-
> > anthony.l.nguyen%40intel.com%2F&data=05%7C01%7Cweh%40microsoft.co
> > m%7Cb6724a9f672f47d433ef08db67da4ada%7C72f988bf86f141af91ab2d7cd0
> > 11db47%7C1%7C0%7C638217959538674174%7CUnknown%7CTWFpbGZsb3d8
> > eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> > D%7C3000%7C%7C%7C&sdata=A%2BjjtSx%2FvY2T%2BNIEPGuftk%2BCr%2Fv
> > Yt2Xc1q8B6h2tb6g%3D&reserved=0
> 
> Thanks for you comment. I am  new to the process. I have a few questions regarding to this and hope you can help. First of all, the patch is mostly for IB. Is it possible for the patch to just go through the RDMA branch, since most of the changes are in RDMA? 

Yes, it can, we (RDMA) will handle it.

Thanks

^ permalink raw reply

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-12  4:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <20230607213903.470f71ae@kernel.org>

Hi Jakub,

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, June 8, 2023 12:39 PM
> To: Wei Hu <weh@microsoft.com>
> Cc: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> rdma@vger.kernel.org; Long Li <longli@microsoft.com>; Ajay Sharma
> <sharmaajay@microsoft.com>; jgg@ziepe.ca; leon@kernel.org; KY
> Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
> davem@davemloft.net; edumazet@google.com; pabeni@redhat.com;
> vkuznets@redhat.com; ssengar@linux.microsoft.com;
> shradhagupta@linux.microsoft.com
> Subject: Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> mana ib driver.
> 
> On Tue,  6 Jun 2023 15:17:47 +0000 Wei Hu wrote:
> >  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
> >  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
> >  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
> >  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
> >  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
> >  include/net/mana/gdma.h                       |   9 +-
> 
> IB and netdev are different subsystem, can you put it on a branch and send a
> PR as the cover letter so that both subsystems can pull?
> 
> Examples:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fall%2F20230607210410.88209-1-
> saeed%40kernel.org%2F&data=05%7C01%7Cweh%40microsoft.com%7Cb672
> 4a9f672f47d433ef08db67da4ada%7C72f988bf86f141af91ab2d7cd011db47%7C
> 1%7C0%7C638217959538674174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiM
> C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000
> %7C%7C%7C&sdata=amO0W8QsR2I5INNNzCNOKEjrsYbzuZ92KXhNdfwSCHA
> %3D&reserved=0
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fall%2F20230602171302.745492-1-
> anthony.l.nguyen%40intel.com%2F&data=05%7C01%7Cweh%40microsoft.co
> m%7Cb6724a9f672f47d433ef08db67da4ada%7C72f988bf86f141af91ab2d7cd0
> 11db47%7C1%7C0%7C638217959538674174%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C3000%7C%7C%7C&sdata=A%2BjjtSx%2FvY2T%2BNIEPGuftk%2BCr%2Fv
> Yt2Xc1q8B6h2tb6g%3D&reserved=0

Thanks for you comment. I am  new to the process. I have a few questions regarding to this and hope you can help. First of all, the patch is mostly for IB. Is it possible for the patch to just go through the RDMA branch, since most of the changes are in RDMA? 

If the patch also needs to go through the NETDEV branch, does it mean two subsystems will pull its own part? A few follow-up questions about generating a PR, since I have never done such before.

1. Which repo should I clone and create the branch from?

2. From the example you provided, I see these people has their own branches on kernel.org, for example something like:
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2023-06-06. 
I am not Linux maintainer. I just have repo on Github. How do I create or fork on kernel.org? Do I need an account to do so? Or I can use my own repo on Github?

3.  How to create PR in this case? Should I follow this link: https://docs.kernel.org/maintainer/pull-requests.html?

Thanks,
Wei

^ permalink raw reply

* Re: [PATCH RFC net-next v4 8/8] tests: add vsock dgram tests
From: Arseniy Krasnov @ 2023-06-11 20:54 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf, Jiang Wang
In-Reply-To: <20230413-b4-vsock-dgram-v4-8-0cebbb2ae899@bytedance.com>

Hello Bobby!

Sorry, may be I become a little bit annoying:), but I tried to run vsock_test with
this v4 version, and again get the same crash:

# cat client.sh 
./vsock_test  --mode=client --control-host=192.168.1.1 --control-port=12345 --peer-cid=2
# ./client.sh 
Control socket connected to 192.168.1.1:12345.
0 - SOCK_STREAM connection reset...[   20.065237] BUG: kernel NULL pointer dereference, addre0
[   20.065895] #PF: supervisor read access in kernel mode
[   20.065895] #PF: error_code(0x0000) - not-present page
[   20.065895] PGD 0 P4D 0 
[   20.065895] Oops: 0000 [#1] PREEMPT SMP PTI
[   20.065895] CPU: 0 PID: 111 Comm: vsock_test Not tainted 6.4.0-rc3-gefcccba07069 #385
[   20.065895] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd44
[   20.065895] RIP: 0010:static_key_count+0x0/0x20
[   20.065895] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
[   20.065895] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
[   20.065895] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
[   20.065895] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
[   20.065895] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
[   20.065895] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
[   20.065895] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
[   20.065895] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
[   20.065895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.065895] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
[   20.065895] Call Trace:
[   20.065895]  <TASK>
[   20.065895]  once_deferred+0xd/0x30
[   20.065895]  vsock_assign_transport+0x9a/0x1b0 [vsock]
[   20.065895]  vsock_connect+0xb4/0x3a0 [vsock]
[   20.065895]  ? var_wake_function+0x60/0x60
[   20.065895]  __sys_connect+0x9e/0xd0
[   20.065895]  ? _raw_spin_unlock_irq+0xe/0x30
[   20.065895]  ? do_setitimer+0x128/0x1f0
[   20.065895]  ? alarm_setitimer+0x4c/0x90
[   20.065895]  ? fpregs_assert_state_consistent+0x1d/0x50
[   20.065895]  ? exit_to_user_mode_prepare+0x36/0x130
[   20.065895]  __x64_sys_connect+0x11/0x20
[   20.065895]  do_syscall_64+0x3b/0xc0
[   20.065895]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
[   20.065895] RIP: 0033:0x7f1ad822dd13
[   20.065895] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8
[   20.065895] RSP: 002b:00007ffc513e3c98 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
[   20.065895] RAX: ffffffffffffffda RBX: 000055aed298e020 RCX: 00007f1ad822dd13
[   20.065895] RDX: 0000000000000010 RSI: 00007ffc513e3cb0 RDI: 0000000000000004
[   20.065895] RBP: 0000000000000004 R08: 000055aed32b2018 R09: 0000000000000000
[   20.065895] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   20.065895] R13: 000055aed298acb1 R14: 00007ffc513e3cb0 R15: 00007ffc513e3d40
[   20.065895]  </TASK>
[   20.065895] Modules linked in: vsock_loopback vhost_vsock vmw_vsock_virtio_transport vmw_vb
[   20.065895] CR2: 0000000000000000
[   20.154060] ---[ end trace 0000000000000000 ]---
[   20.155519] RIP: 0010:static_key_count+0x0/0x20
[   20.156932] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
[   20.161367] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
[   20.162613] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
[   20.164262] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
[   20.165934] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
[   20.167684] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
[   20.169427] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
[   20.171109] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
[   20.173000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.174381] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0

So, what HEAD do You use? May be You have some specific config (I use x86-64 defconfig + vsock/vhost
related things) ?

Thanks, Arseniy


On 10.06.2023 03:58, Bobby Eshleman wrote:
> From: Jiang Wang <jiang.wang@bytedance.com>
> 
> This patch adds tests for vsock datagram.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> ---
>  tools/testing/vsock/util.c       | 141 ++++++++++++-
>  tools/testing/vsock/util.h       |   6 +
>  tools/testing/vsock/vsock_test.c | 432 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 578 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> index 01b636d3039a..811e70d7cf1e 100644
> --- a/tools/testing/vsock/util.c
> +++ b/tools/testing/vsock/util.c
> @@ -99,7 +99,8 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)
>  	int ret;
>  	int fd;
>  
> -	control_expectln("LISTENING");
> +	if (type != SOCK_DGRAM)
> +		control_expectln("LISTENING");
>  
>  	fd = socket(AF_VSOCK, type, 0);
>  
> @@ -130,6 +131,11 @@ int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
>  	return vsock_connect(cid, port, SOCK_SEQPACKET);
>  }
>  
> +int vsock_dgram_connect(unsigned int cid, unsigned int port)
> +{
> +	return vsock_connect(cid, port, SOCK_DGRAM);
> +}
> +
>  /* Listen on <cid, port> and return the first incoming connection.  The remote
>   * address is stored to clientaddrp.  clientaddrp may be NULL.
>   */
> @@ -211,6 +217,34 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>  	return vsock_accept(cid, port, clientaddrp, SOCK_SEQPACKET);
>  }
>  
> +int vsock_dgram_bind(unsigned int cid, unsigned int port)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = port,
> +			.svm_cid = cid,
> +		},
> +	};
> +	int fd;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	return fd;
> +}
> +
>  /* Transmit one byte and check the return value.
>   *
>   * expected_ret:
> @@ -260,6 +294,57 @@ void send_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Transmit one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +		 int flags)
> +{
> +	const uint8_t byte = 'A';
> +	ssize_t nwritten;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> +				  len);
> +		timeout_check("write");
> +	} while (nwritten < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nwritten != -1) {
> +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
> +				nwritten);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("write");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nwritten < 0) {
> +		perror("write");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while sending byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten != sizeof(byte)) {
> +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
>  /* Receive one byte and check the return value.
>   *
>   * expected_ret:
> @@ -313,6 +398,60 @@ void recv_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Receive one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +		   int expected_ret, int flags)
> +{
> +	uint8_t byte;
> +	ssize_t nread;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> +		timeout_check("read");
> +	} while (nread < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nread != -1) {
> +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> +				nread);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("read");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nread < 0) {
> +		perror("read");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while receiving byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread != sizeof(byte)) {
> +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> +		exit(EXIT_FAILURE);
> +	}
> +	if (byte != 'A') {
> +		fprintf(stderr, "unexpected byte read %c\n", byte);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
>  /* Run test cases.  The program terminates if a failure occurs. */
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts)
> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> index fb99208a95ea..a69e128d120c 100644
> --- a/tools/testing/vsock/util.h
> +++ b/tools/testing/vsock/util.h
> @@ -37,13 +37,19 @@ void init_signals(void);
>  unsigned int parse_cid(const char *str);
>  int vsock_stream_connect(unsigned int cid, unsigned int port);
>  int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
> +int vsock_dgram_connect(unsigned int cid, unsigned int port);
>  int vsock_stream_accept(unsigned int cid, unsigned int port,
>  			struct sockaddr_vm *clientaddrp);
>  int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>  			   struct sockaddr_vm *clientaddrp);
> +int vsock_dgram_bind(unsigned int cid, unsigned int port);
>  void vsock_wait_remote_close(int fd);
>  void send_byte(int fd, int expected_ret, int flags);
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +		 int flags);
>  void recv_byte(int fd, int expected_ret, int flags);
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +		   int expected_ret, int flags);
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts);
>  void list_tests(const struct test_case *test_cases);
> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> index ac1bd3ac1533..ded82d39ee5d 100644
> --- a/tools/testing/vsock/vsock_test.c
> +++ b/tools/testing/vsock/vsock_test.c
> @@ -1053,6 +1053,413 @@ static void test_stream_virtio_skb_merge_server(const struct test_opts *opts)
>  	close(fd);
>  }
>  
> +static void test_dgram_sendto_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fd;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_sendto_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int len = sizeof(addr.sa);
> +	int fd;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int ret;
> +	int fd;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
> +	if (ret < 0) {
> +		perror("connect");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	send_byte(fd, 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_server(const struct test_opts *opts)
> +{
> +	test_dgram_sendto_server(opts);
> +}
> +
> +static void test_dgram_multiconn_sendto_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fds[MULTICONN_NFDS];
> +	int i;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +		if (fds[i] < 0) {
> +			perror("socket");
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		close(fds[i]);
> +}
> +
> +static void test_dgram_multiconn_sendto_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int len = sizeof(addr.sa);
> +	int fd;
> +	int i;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_multiconn_send_client(const struct test_opts *opts)
> +{
> +	int fds[MULTICONN_NFDS];
> +	int i;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> +		fds[i] = vsock_dgram_connect(opts->peer_cid, 1234);
> +		if (fds[i] < 0) {
> +			perror("socket");
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		send_byte(fds[i], 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		close(fds[i]);
> +}
> +
> +static void test_dgram_multiconn_send_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int fd;
> +	int i;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		recv_byte(fd, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_msg_bounds_client(const struct test_opts *opts)
> +{
> +	unsigned long recv_buf_size;
> +	int page_size;
> +	int msg_cnt;
> +	int fd;
> +
> +	fd = vsock_dgram_connect(opts->peer_cid, 1234);
> +	if (fd < 0) {
> +		perror("connect");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Let the server know the client is ready */
> +	control_writeln("CLNTREADY");
> +
> +	msg_cnt = control_readulong();
> +	recv_buf_size = control_readulong();
> +
> +	/* Wait, until receiver sets buffer size. */
> +	control_expectln("SRVREADY");
> +
> +	page_size = getpagesize();
> +
> +	for (int i = 0; i < msg_cnt; i++) {
> +		unsigned long curr_hash;
> +		ssize_t send_size;
> +		size_t buf_size;
> +		void *buf;
> +
> +		/* Use "small" buffers and "big" buffers. */
> +		if (i & 1)
> +			buf_size = page_size +
> +					(rand() % (MAX_MSG_SIZE - page_size));
> +		else
> +			buf_size = 1 + (rand() % page_size);
> +
> +		buf_size = min(buf_size, recv_buf_size);
> +
> +		buf = malloc(buf_size);
> +
> +		if (!buf) {
> +			perror("malloc");
> +			exit(EXIT_FAILURE);
> +		}
> +
> +		memset(buf, rand() & 0xff, buf_size);
> +		/* Set at least one MSG_EOR + some random. */
> +
> +		send_size = send(fd, buf, buf_size, 0);
> +
> +		if (send_size < 0) {
> +			perror("send");
> +			exit(EXIT_FAILURE);
> +		}
> +
> +		if (send_size != buf_size) {
> +			fprintf(stderr, "Invalid send size\n");
> +			exit(EXIT_FAILURE);
> +		}
> +
> +		/* In theory the implementation isn't required to transmit
> +		 * these packets in order, so we use this SYNC control message
> +		 * so that server and client coordinate sending and receiving
> +		 * one packet at a time. The client sends a packet and waits
> +		 * until it has been received before sending another.
> +		 */
> +		control_writeln("PKTSENT");
> +		control_expectln("PKTRECV");
> +
> +		/* Send the server a hash of the packet */
> +		curr_hash = hash_djb2(buf, buf_size);
> +		control_writeulong(curr_hash);
> +		free(buf);
> +	}
> +
> +	control_writeln("SENDDONE");
> +	close(fd);
> +}
> +
> +static void test_dgram_msg_bounds_server(const struct test_opts *opts)
> +{
> +	const unsigned long msg_cnt = 16;
> +	unsigned long sock_buf_size;
> +	struct msghdr msg = {0};
> +	struct iovec iov = {0};
> +	char buf[MAX_MSG_SIZE];
> +	socklen_t len;
> +	int fd;
> +	int i;
> +
> +	fd = vsock_dgram_bind(VMADDR_CID_ANY, 1234);
> +
> +	if (fd < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Set receive buffer to maximum */
> +	sock_buf_size = -1;
> +	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> +		       &sock_buf_size, sizeof(sock_buf_size))) {
> +		perror("setsockopt(SO_RECVBUF)");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Retrieve the receive buffer size */
> +	len = sizeof(sock_buf_size);
> +	if (getsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> +		       &sock_buf_size, &len)) {
> +		perror("getsockopt(SO_RECVBUF)");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Client ready to receive parameters */
> +	control_expectln("CLNTREADY");
> +
> +	control_writeulong(msg_cnt);
> +	control_writeulong(sock_buf_size);
> +
> +	/* Ready to receive data. */
> +	control_writeln("SRVREADY");
> +
> +	iov.iov_base = buf;
> +	iov.iov_len = sizeof(buf);
> +	msg.msg_iov = &iov;
> +	msg.msg_iovlen = 1;
> +
> +	for (i = 0; i < msg_cnt; i++) {
> +		unsigned long remote_hash;
> +		unsigned long curr_hash;
> +		ssize_t recv_size;
> +
> +		control_expectln("PKTSENT");
> +		recv_size = recvmsg(fd, &msg, 0);
> +		control_writeln("PKTRECV");
> +
> +		if (!recv_size)
> +			break;
> +
> +		if (recv_size < 0) {
> +			perror("recvmsg");
> +			exit(EXIT_FAILURE);
> +		}
> +
> +		curr_hash = hash_djb2(msg.msg_iov[0].iov_base, recv_size);
> +		remote_hash = control_readulong();
> +
> +		if (curr_hash != remote_hash) {
> +			fprintf(stderr, "Message bounds broken\n");
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	close(fd);
> +}
> +
>  static struct test_case test_cases[] = {
>  	{
>  		.name = "SOCK_STREAM connection reset",
> @@ -1128,6 +1535,31 @@ static struct test_case test_cases[] = {
>  		.run_client = test_stream_virtio_skb_merge_client,
>  		.run_server = test_stream_virtio_skb_merge_server,
>  	},
> +	{
> +		.name = "SOCK_DGRAM client sendto",
> +		.run_client = test_dgram_sendto_client,
> +		.run_server = test_dgram_sendto_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM client connect",
> +		.run_client = test_dgram_connect_client,
> +		.run_server = test_dgram_connect_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM multiple connections using sendto",
> +		.run_client = test_dgram_multiconn_sendto_client,
> +		.run_server = test_dgram_multiconn_sendto_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM multiple connections using send",
> +		.run_client = test_dgram_multiconn_send_client,
> +		.run_server = test_dgram_multiconn_send_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM msg bounds",
> +		.run_client = test_dgram_msg_bounds_client,
> +		.run_server = test_dgram_msg_bounds_server,
> +	},
>  	{},
>  };
>  
> 

^ permalink raw reply

* Re: [PATCH RFC net-next v4 6/8] virtio/vsock: support dgrams
From: Arseniy Krasnov @ 2023-06-11 20:49 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v4-6-0cebbb2ae899@bytedance.com>

Hello Bobby!

On 10.06.2023 03:58, Bobby Eshleman wrote:
> This commit adds support for datagrams over virtio/vsock.
> 
> Message boundaries are preserved on a per-skb and per-vq entry basis.

I'm a little bit confused about the following case: let vhost sends 4097 bytes
datagram to the guest. Guest uses 4096 RX buffers in it's virtio queue, each
buffer has attached empty skb to it. Vhost places first 4096 bytes to the first
buffer of guests RX queue, and 1 last byte to the second buffer. Now IIUC guest
has two skb in it rx queue, and user in guest wants to read data - does it read
4097 bytes, while guest has two skb - 4096 bytes and 1 bytes? In seqpacket there is
special marker in header which shows where message ends, and how it works here?

Thanks, Arseniy

> Messages are copied in whole from the user to an SKB, which in turn is
> added to the scatterlist for the virtqueue in whole for the device.
> Messages do not straddle skbs and they do not straddle packets.
> Messages may be truncated by the receiving user if their buffer is
> shorter than the message.
> 
> Other properties of vsock datagrams:
> - Datagrams self-throttle at the per-socket sk_sndbuf threshold.
> - The same virtqueue is used as is used for streams and seqpacket flows
> - Credits are not used for datagrams
> - Packets are dropped silently by the device, which means the virtqueue
>   will still get kicked even during high packet loss, so long as the
>   socket does not exceed sk_sndbuf.
> 
> Future work might include finding a way to reduce the virtqueue kick
> rate for datagram flows with high packet loss.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |  27 ++++-
>  include/linux/virtio_vsock.h            |   5 +-
>  include/net/af_vsock.h                  |   1 +
>  include/uapi/linux/virtio_vsock.h       |   1 +
>  net/vmw_vsock/af_vsock.c                |  58 +++++++--
>  net/vmw_vsock/virtio_transport.c        |  23 +++-
>  net/vmw_vsock/virtio_transport_common.c | 207 ++++++++++++++++++++++++--------
>  net/vmw_vsock/vsock_loopback.c          |   8 +-
>  8 files changed, 264 insertions(+), 66 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 8f0082da5e70..159c1a22c1a8 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -32,7 +32,8 @@
>  enum {
>  	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>  			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
>  };
>  
>  enum {
> @@ -56,6 +57,7 @@ struct vhost_vsock {
>  	atomic_t queued_replies;
>  
>  	u32 guest_cid;
> +	bool dgram_allow;
>  	bool seqpacket_allow;
>  };
>  
> @@ -394,6 +396,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
>  	return val < vq->num;
>  }
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port);
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport vhost_transport = {
> @@ -410,10 +413,11 @@ static struct virtio_transport vhost_transport = {
>  		.cancel_pkt               = vhost_transport_cancel_pkt,
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> -		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_allow              = vhost_transport_dgram_allow,
>  		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
>  		.dgram_get_port		  = virtio_transport_dgram_get_port,
>  		.dgram_get_length	  = virtio_transport_dgram_get_length,
> +		.dgram_payload_offset	  = 0,
>  
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
> @@ -446,6 +450,22 @@ static struct virtio_transport vhost_transport = {
>  	.send_pkt = vhost_transport_send_pkt,
>  };
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port)
> +{
> +	struct vhost_vsock *vsock;
> +	bool dgram_allow = false;
> +
> +	rcu_read_lock();
> +	vsock = vhost_vsock_get(cid);
> +
> +	if (vsock)
> +		dgram_allow = vsock->dgram_allow;
> +
> +	rcu_read_unlock();
> +
> +	return dgram_allow;
> +}
> +
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid)
>  {
>  	struct vhost_vsock *vsock;
> @@ -802,6 +822,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>  	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
>  		vsock->seqpacket_allow = true;
>  
> +	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
> +		vsock->dgram_allow = true;
> +
>  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>  		vq = &vsock->vqs[i];
>  		mutex_lock(&vq->mutex);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 73afa09f4585..237ca87a2ecd 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -216,7 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
>  u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
>  bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
>  bool virtio_transport_stream_allow(u32 cid, u32 port);
> -bool virtio_transport_dgram_allow(u32 cid, u32 port);
>  int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
>  int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
>  int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
> @@ -247,4 +246,8 @@ void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
>  void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
>  int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *list);
>  int virtio_transport_read_skb(struct vsock_sock *vsk, skb_read_actor_t read_actor);
> +void virtio_transport_init_dgram_bind_tables(void);
> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
>  #endif /* _LINUX_VIRTIO_VSOCK_H */
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 7bedb9ee7e3e..c115e655b4f5 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -225,6 +225,7 @@ void vsock_for_each_connected_socket(struct vsock_transport *transport,
>  				     void (*fn)(struct sock *sk));
>  int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
>  bool vsock_find_cid(unsigned int cid);
> +struct sock *vsock_find_bound_dgram_socket(struct sockaddr_vm *addr);
>  
>  /**** TAP ****/
>  
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 9c25f267bbc0..27b4b2b8bf13 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>  enum virtio_vsock_type {
>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
>  };
>  
>  enum virtio_vsock_op {
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 7a3ca4270446..b0b18e7f4299 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -114,6 +114,7 @@
>  static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
>  static void vsock_sk_destruct(struct sock *sk);
>  static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> +static bool sock_type_connectible(u16 type);
>  
>  /* Protocol family. */
>  struct proto vsock_proto = {
> @@ -180,6 +181,8 @@ struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
>  EXPORT_SYMBOL_GPL(vsock_connected_table);
>  DEFINE_SPINLOCK(vsock_table_lock);
>  EXPORT_SYMBOL_GPL(vsock_table_lock);
> +static struct list_head vsock_dgram_bind_table[VSOCK_HASH_SIZE];
> +static DEFINE_SPINLOCK(vsock_dgram_table_lock);
>  
>  /* Autobind this socket to the local address if necessary. */
>  static int vsock_auto_bind(struct vsock_sock *vsk)
> @@ -202,6 +205,9 @@ static void vsock_init_tables(void)
>  
>  	for (i = 0; i < ARRAY_SIZE(vsock_connected_table); i++)
>  		INIT_LIST_HEAD(&vsock_connected_table[i]);
> +
> +	for (i = 0; i < ARRAY_SIZE(vsock_dgram_bind_table); i++)
> +		INIT_LIST_HEAD(&vsock_dgram_bind_table[i]);
>  }
>  
>  static void __vsock_insert_bound(struct list_head *list,
> @@ -230,8 +236,8 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
>  	sock_put(&vsk->sk);
>  }
>  
> -struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
> -					    struct list_head *bind_table)
> +static struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
> +						   struct list_head *bind_table)
>  {
>  	struct vsock_sock *vsk;
>  
> @@ -248,6 +254,23 @@ struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
>  	return NULL;
>  }
>  
> +struct sock *
> +vsock_find_bound_dgram_socket(struct sockaddr_vm *addr)
> +{
> +	struct sock *sk;
> +
> +	spin_lock_bh(&vsock_dgram_table_lock);
> +	sk = vsock_find_bound_socket_common(addr,
> +					    &vsock_dgram_bind_table[VSOCK_HASH(addr)]);
> +	if (sk)
> +		sock_hold(sk);
> +
> +	spin_unlock_bh(&vsock_dgram_table_lock);
> +
> +	return sk;
> +}
> +EXPORT_SYMBOL_GPL(vsock_find_bound_dgram_socket);
> +
>  static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
>  {
>  	return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
> @@ -287,6 +310,14 @@ void vsock_insert_connected(struct vsock_sock *vsk)
>  }
>  EXPORT_SYMBOL_GPL(vsock_insert_connected);
>  
> +static void vsock_remove_dgram_bound(struct vsock_sock *vsk)
> +{
> +	spin_lock_bh(&vsock_dgram_table_lock);
> +	if (__vsock_in_bound_table(vsk))
> +		__vsock_remove_bound(vsk);
> +	spin_unlock_bh(&vsock_dgram_table_lock);
> +}
> +
>  void vsock_remove_bound(struct vsock_sock *vsk)
>  {
>  	spin_lock_bh(&vsock_table_lock);
> @@ -338,7 +369,10 @@ EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
>  
>  void vsock_remove_sock(struct vsock_sock *vsk)
>  {
> -	vsock_remove_bound(vsk);
> +	if (sock_type_connectible(sk_vsock(vsk)->sk_type))
> +		vsock_remove_bound(vsk);
> +	else
> +		vsock_remove_dgram_bound(vsk);
>  	vsock_remove_connected(vsk);
>  }
>  EXPORT_SYMBOL_GPL(vsock_remove_sock);
> @@ -720,11 +754,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>  	return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
>  }
>  
> -static int __vsock_bind_dgram(struct vsock_sock *vsk,
> -			      struct sockaddr_vm *addr)
> +static int vsock_bind_dgram(struct vsock_sock *vsk,
> +			    struct sockaddr_vm *addr)
>  {
> -	if (!vsk->transport || !vsk->transport->dgram_bind)
> -		return -EINVAL;
> +	if (!vsk->transport || !vsk->transport->dgram_bind) {
> +		int retval;
> +
> +		spin_lock_bh(&vsock_dgram_table_lock);
> +		retval = vsock_bind_common(vsk, addr, vsock_dgram_bind_table,
> +					   VSOCK_HASH_SIZE);
> +		spin_unlock_bh(&vsock_dgram_table_lock);
> +
> +		return retval;
> +	}
>  
>  	return vsk->transport->dgram_bind(vsk, addr);
>  }
> @@ -755,7 +797,7 @@ static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr)
>  		break;
>  
>  	case SOCK_DGRAM:
> -		retval = __vsock_bind_dgram(vsk, addr);
> +		retval = vsock_bind_dgram(vsk, addr);
>  		break;
>  
>  	default:
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 1b7843a7779a..7160a3104218 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -63,6 +63,7 @@ struct virtio_vsock {
>  
>  	u32 guest_cid;
>  	bool seqpacket_allow;
> +	bool dgram_allow;
>  };
>  
>  static u32 virtio_transport_get_local_cid(void)
> @@ -413,6 +414,7 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
> +static bool virtio_transport_dgram_allow(u32 cid, u32 port);
>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport virtio_transport = {
> @@ -465,6 +467,21 @@ static struct virtio_transport virtio_transport = {
>  	.send_pkt = virtio_transport_send_pkt,
>  };
>  
> +static bool virtio_transport_dgram_allow(u32 cid, u32 port)
> +{
> +	struct virtio_vsock *vsock;
> +	bool dgram_allow;
> +
> +	dgram_allow = false;
> +	rcu_read_lock();
> +	vsock = rcu_dereference(the_virtio_vsock);
> +	if (vsock)
> +		dgram_allow = vsock->dgram_allow;
> +	rcu_read_unlock();
> +
> +	return dgram_allow;
> +}
> +
>  static bool virtio_transport_seqpacket_allow(u32 remote_cid)
>  {
>  	struct virtio_vsock *vsock;
> @@ -658,6 +675,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
>  		vsock->seqpacket_allow = true;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> +		vsock->dgram_allow = true;
> +
>  	vdev->priv = vsock;
>  
>  	ret = virtio_vsock_vqs_init(vsock);
> @@ -750,7 +770,8 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  static unsigned int features[] = {
> -	VIRTIO_VSOCK_F_SEQPACKET
> +	VIRTIO_VSOCK_F_SEQPACKET,
> +	VIRTIO_VSOCK_F_DGRAM
>  };
>  
>  static struct virtio_driver virtio_vsock_driver = {
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index d5a3c8efe84b..bc9d459723f5 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,6 +37,35 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>  	return container_of(t, struct virtio_transport, transport);
>  }
>  
> +/* Requires info->msg and info->vsk */
> +static struct sk_buff *
> +virtio_transport_sock_alloc_send_skb(struct virtio_vsock_pkt_info *info, unsigned int size,
> +				     gfp_t mask, int *err)
> +{
> +	struct sk_buff *skb;
> +	struct sock *sk;
> +	int noblock;
> +
> +	if (size < VIRTIO_VSOCK_SKB_HEADROOM) {
> +		*err = -EINVAL;
> +		return NULL;
> +	}
> +
> +	if (info->msg)
> +		noblock = info->msg->msg_flags & MSG_DONTWAIT;
> +	else
> +		noblock = 1;
> +
> +	sk = sk_vsock(info->vsk);
> +	sk->sk_allocation = mask;
> +	skb = sock_alloc_send_skb(sk, size, noblock, err);
> +	if (!skb)
> +		return NULL;
> +
> +	skb_reserve(skb, VIRTIO_VSOCK_SKB_HEADROOM);
> +	return skb;
> +}
> +
>  /* Returns a new packet on success, otherwise returns NULL.
>   *
>   * If NULL is returned, errp is set to a negative errno.
> @@ -47,7 +76,8 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  			   u32 src_cid,
>  			   u32 src_port,
>  			   u32 dst_cid,
> -			   u32 dst_port)
> +			   u32 dst_port,
> +			   int *errp)
>  {
>  	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>  	struct virtio_vsock_hdr *hdr;
> @@ -55,9 +85,21 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  	void *payload;
>  	int err;
>  
> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> -	if (!skb)
> +	/* dgrams do not use credits, self-throttle according to sk_sndbuf
> +	 * using sock_alloc_send_skb. This helps avoid triggering the OOM.
> +	 */
> +	if (info->vsk && info->type == VIRTIO_VSOCK_TYPE_DGRAM) {
> +		skb = virtio_transport_sock_alloc_send_skb(info, skb_len, GFP_KERNEL, &err);
> +	} else {
> +		skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> +		if (!skb)
> +			err = -ENOMEM;
> +	}
> +
> +	if (!skb) {
> +		*errp = err;
>  		return NULL;
> +	}
>  
>  	hdr = virtio_vsock_hdr(skb);
>  	hdr->type	= cpu_to_le16(info->type);
> @@ -96,12 +138,14 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  
>  	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>  		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> +		err = -EFAULT;
>  		goto out;
>  	}
>  
>  	return skb;
>  
>  out:
> +	*errp = err;
>  	kfree_skb(skb);
>  	return NULL;
>  }
> @@ -183,7 +227,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>  
>  static u16 virtio_transport_get_type(struct sock *sk)
>  {
> -	if (sk->sk_type == SOCK_STREAM)
> +	if (sk->sk_type == SOCK_DGRAM)
> +		return VIRTIO_VSOCK_TYPE_DGRAM;
> +	else if (sk->sk_type == SOCK_STREAM)
>  		return VIRTIO_VSOCK_TYPE_STREAM;
>  	else
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> @@ -239,11 +285,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  
>  		skb = virtio_transport_alloc_skb(info, skb_len,
>  						 src_cid, src_port,
> -						 dst_cid, dst_port);
> -		if (!skb) {
> -			ret = -ENOMEM;
> +						 dst_cid, dst_port,
> +						 &ret);
> +		if (!skb)
>  			break;
> -		}
>  
>  		virtio_transport_inc_tx_pkt(vvs, skb);
>  
> @@ -583,14 +628,30 @@ virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_enqueue);
>  
> -int
> -virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> -			       struct msghdr *msg,
> -			       size_t len, int flags)
> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> +{
> +	*cid = le64_to_cpu(virtio_vsock_hdr(skb)->src_cid);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> +
> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> +{
> +	*port = le32_to_cpu(virtio_vsock_hdr(skb)->src_port);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> +
> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
>  {
> -	return -EOPNOTSUPP;
> +	/* The device layer must have already moved the data ptr beyond the
> +	 * header for skb->len to be correct.
> +	 */
> +	WARN_ON(skb->data == skb->head);
> +	*len = skb->len;
> +	return 0;
>  }
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
>  
>  s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
>  {
> @@ -790,30 +851,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>  
> -int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> -{
> -	return -EOPNOTSUPP;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> -
> -int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> -{
> -	return -EOPNOTSUPP;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> -
> -int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> -{
> -	return -EOPNOTSUPP;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
> -
> -bool virtio_transport_dgram_allow(u32 cid, u32 port)
> -{
> -	return false;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> -
>  int virtio_transport_connect(struct vsock_sock *vsk)
>  {
>  	struct virtio_vsock_pkt_info info = {
> @@ -846,7 +883,34 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>  			       struct msghdr *msg,
>  			       size_t dgram_len)
>  {
> -	return -EOPNOTSUPP;
> +	const struct virtio_transport *t_ops;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_RW,
> +		.msg = msg,
> +		.vsk = vsk,
> +		.type = VIRTIO_VSOCK_TYPE_DGRAM,
> +	};
> +	u32 src_cid, src_port;
> +	struct sk_buff *skb;
> +	int err;
> +
> +	if (dgram_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> +		return -EMSGSIZE;
> +
> +	t_ops = virtio_transport_get_ops(vsk);
> +	src_cid = t_ops->transport.get_local_cid();
> +	src_port = vsk->local_addr.svm_port;
> +
> +	skb = virtio_transport_alloc_skb(&info, dgram_len,
> +					 src_cid, src_port,
> +					 remote_addr->svm_cid,
> +					 remote_addr->svm_port,
> +					 &err);
> +
> +	if (!skb)
> +		return err;
> +
> +	return t_ops->send_pkt(skb);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>  
> @@ -903,6 +967,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  		.reply = true,
>  	};
>  	struct sk_buff *reply;
> +	int err;
>  
>  	/* Send RST only if the original pkt is not a RST pkt */
>  	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
> @@ -915,9 +980,10 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  					   le64_to_cpu(hdr->dst_cid),
>  					   le32_to_cpu(hdr->dst_port),
>  					   le64_to_cpu(hdr->src_cid),
> -					   le32_to_cpu(hdr->src_port));
> +					   le32_to_cpu(hdr->src_port),
> +					   &err);
>  	if (!reply)
> -		return -ENOMEM;
> +		return err;
>  
>  	return t->send_pkt(reply);
>  }
> @@ -1137,6 +1203,21 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
>  		kfree_skb(skb);
>  }
>  
> +/* This function takes ownership of the skb.
> + *
> + * It either places the skb on the sk_receive_queue or frees it.
> + */
> +static void
> +virtio_transport_recv_dgram(struct sock *sk, struct sk_buff *skb)
> +{
> +	if (sock_queue_rcv_skb(sk, skb)) {
> +		kfree_skb(skb);
> +		return;
> +	}
> +
> +	sk->sk_data_ready(sk);
> +}
> +
>  static int
>  virtio_transport_recv_connected(struct sock *sk,
>  				struct sk_buff *skb)
> @@ -1300,7 +1381,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>  static bool virtio_transport_valid_type(u16 type)
>  {
>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
>  }
>  
>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> @@ -1314,40 +1396,52 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  	struct vsock_sock *vsk;
>  	struct sock *sk;
>  	bool space_available;
> +	u16 type;
>  
>  	vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
>  			le32_to_cpu(hdr->src_port));
>  	vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
>  			le32_to_cpu(hdr->dst_port));
>  
> +	type = le16_to_cpu(hdr->type);
> +
>  	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
>  					dst.svm_cid, dst.svm_port,
>  					le32_to_cpu(hdr->len),
> -					le16_to_cpu(hdr->type),
> +					type,
>  					le16_to_cpu(hdr->op),
>  					le32_to_cpu(hdr->flags),
>  					le32_to_cpu(hdr->buf_alloc),
>  					le32_to_cpu(hdr->fwd_cnt));
>  
> -	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
> +	if (!virtio_transport_valid_type(type)) {
>  		(void)virtio_transport_reset_no_sock(t, skb);
>  		goto free_pkt;
>  	}
>  
> -	/* The socket must be in connected or bound table
> -	 * otherwise send reset back
> +	/* For stream/seqpacket, the socket must be in connected or bound table
> +	 * otherwise send reset back.
> +	 *
> +	 * For datagrams, no reset is sent back.
>  	 */
>  	sk = vsock_find_connected_socket(&src, &dst);
>  	if (!sk) {
> -		sk = vsock_find_bound_socket(&dst);
> -		if (!sk) {
> -			(void)virtio_transport_reset_no_sock(t, skb);
> -			goto free_pkt;
> +		if (type == VIRTIO_VSOCK_TYPE_DGRAM) {
> +			sk = vsock_find_bound_dgram_socket(&dst);
> +			if (!sk)
> +				goto free_pkt;
> +		} else {
> +			sk = vsock_find_bound_socket(&dst);
> +			if (!sk) {
> +				(void)virtio_transport_reset_no_sock(t, skb);
> +				goto free_pkt;
> +			}
>  		}
>  	}
>  
> -	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
> -		(void)virtio_transport_reset_no_sock(t, skb);
> +	if (virtio_transport_get_type(sk) != type) {
> +		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
> +			(void)virtio_transport_reset_no_sock(t, skb);
>  		sock_put(sk);
>  		goto free_pkt;
>  	}
> @@ -1363,12 +1457,18 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  
>  	/* Check if sk has been closed before lock_sock */
>  	if (sock_flag(sk, SOCK_DONE)) {
> -		(void)virtio_transport_reset_no_sock(t, skb);
> +		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
> +			(void)virtio_transport_reset_no_sock(t, skb);
>  		release_sock(sk);
>  		sock_put(sk);
>  		goto free_pkt;
>  	}
>  
> +	if (sk->sk_type == SOCK_DGRAM) {
> +		virtio_transport_recv_dgram(sk, skb);
> +		goto out;
> +	}
> +
>  	space_available = virtio_transport_space_update(sk, skb);
>  
>  	/* Update CID in case it has changed after a transport reset event */
> @@ -1400,6 +1500,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  		break;
>  	}
>  
> +out:
>  	release_sock(sk);
>  
>  	/* Release refcnt obtained when we fetched this socket out of the
> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> index e9de45a26fbd..68312aa8c972 100644
> --- a/net/vmw_vsock/vsock_loopback.c
> +++ b/net/vmw_vsock/vsock_loopback.c
> @@ -46,6 +46,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
>  	return 0;
>  }
>  
> +static bool vsock_loopback_dgram_allow(u32 cid, u32 port);
>  static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport loopback_transport = {
> @@ -62,7 +63,7 @@ static struct virtio_transport loopback_transport = {
>  		.cancel_pkt               = vsock_loopback_cancel_pkt,
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> -		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_allow              = vsock_loopback_dgram_allow,
>  		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
>  		.dgram_get_port		  = virtio_transport_dgram_get_port,
>  		.dgram_get_length	  = virtio_transport_dgram_get_length,
> @@ -98,6 +99,11 @@ static struct virtio_transport loopback_transport = {
>  	.send_pkt = vsock_loopback_send_pkt,
>  };
>  
> +static bool vsock_loopback_dgram_allow(u32 cid, u32 port)
> +{
> +	return true;
> +}
> +
>  static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
>  {
>  	return true;
> 

^ permalink raw reply

* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Arseniy Krasnov @ 2023-06-11 20:43 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v4-1-0cebbb2ae899@bytedance.com>

Hello Bobby! Thanks for this patchset! Small comment below:

On 10.06.2023 03:58, Bobby Eshleman wrote:
> This commit drops the transport->dgram_dequeue callback and makes
> vsock_dgram_recvmsg() generic. It also adds additional transport
> callbacks for use by the generic vsock_dgram_recvmsg(), such as for
> parsing skbs for CID/port which vary in format per transport.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |  4 +-
>  include/linux/virtio_vsock.h            |  3 ++
>  include/net/af_vsock.h                  | 13 ++++++-
>  net/vmw_vsock/af_vsock.c                | 51 ++++++++++++++++++++++++-
>  net/vmw_vsock/hyperv_transport.c        | 17 +++++++--
>  net/vmw_vsock/virtio_transport.c        |  4 +-
>  net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
>  net/vmw_vsock/vmci_transport.c          | 68 +++++++++++++--------------------
>  net/vmw_vsock/vsock_loopback.c          |  4 +-
>  9 files changed, 132 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 6578db78f0ae..c8201c070b4b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
>  		.cancel_pkt               = vhost_transport_cancel_pkt,
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> -		.dgram_dequeue            = virtio_transport_dgram_dequeue,
>  		.dgram_bind               = virtio_transport_dgram_bind,
>  		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> +		.dgram_get_port		  = virtio_transport_dgram_get_port,
> +		.dgram_get_length	  = virtio_transport_dgram_get_length,
>  
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index c58453699ee9..23521a318cf0 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>  				struct sockaddr_vm *addr);
>  bool virtio_transport_dgram_allow(u32 cid, u32 port);
> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
>  
>  int virtio_transport_connect(struct vsock_sock *vsk);
>  
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 0e7504a42925..7bedb9ee7e3e 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -120,11 +120,20 @@ struct vsock_transport {
>  
>  	/* DGRAM. */
>  	int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
> -	int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> -			     size_t len, int flags);
>  	int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
>  			     struct msghdr *, size_t len);
>  	bool (*dgram_allow)(u32 cid, u32 port);
> +	int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
> +	int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
> +	int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
> +
> +	/* The number of bytes into the buffer at which the payload starts, as
> +	 * first seen by the receiving socket layer. For example, if the
> +	 * transport presets the skb pointers using skb_pull(sizeof(header))
> +	 * than this would be zero, otherwise it would be the size of the
> +	 * header.
> +	 */
> +	const size_t dgram_payload_offset;
>  
>  	/* STREAM. */
>  	/* TODO: stream_bind() */
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index efb8a0937a13..ffb4dd8b6ea7 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
>  int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>  			size_t len, int flags)
>  {
> +	const struct vsock_transport *transport;
>  #ifdef CONFIG_BPF_SYSCALL
>  	const struct proto *prot;
>  #endif
>  	struct vsock_sock *vsk;
> +	struct sk_buff *skb;
> +	size_t payload_len;
>  	struct sock *sk;
> +	int err;
>  
>  	sk = sock->sk;
>  	vsk = vsock_sk(sk);
> @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>  		return prot->recvmsg(sk, msg, len, flags, NULL);
>  #endif
>  
> -	return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> +		return -EOPNOTSUPP;
> +
> +	transport = vsk->transport;
> +
> +	/* Retrieve the head sk_buff from the socket's receive queue. */
> +	err = 0;
> +	skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
> +	if (!skb)
> +		return err;
> +
> +	err = transport->dgram_get_length(skb, &payload_len);
> +	if (err)
> +		goto out;
> +
> +	if (payload_len > len) {
> +		payload_len = len;
> +		msg->msg_flags |= MSG_TRUNC;
> +	}
> +
> +	/* Place the datagram payload in the user's iovec. */
> +	err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
> +	if (err)
> +		goto out;
> +
> +	if (msg->msg_name) {
> +		/* Provide the address of the sender. */
> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> +		unsigned int cid, port;
> +
> +		err = transport->dgram_get_cid(skb, &cid);
> +		if (err)
> +			goto out;
> +
> +		err = transport->dgram_get_port(skb, &port);
> +		if (err)
> +			goto out;

Maybe we can merge 'dgram_get_cid' and 'dgram_get_port' to a single callback? Because I see that this is
the only place where both are used (correct me if i'm wrong) and logically both operates with addresses:
CID and port. E.g. something like that: dgram_get_cid_n_port().

Moreover, I'm not sure, but is it good "tradeoff" here: remove transport specific callback for dgram receive
where we already have 'msghdr' with both data buffer and buffer for 'sockaddr_vm' and instead of it add new
several fields (callbacks) to transports like dgram_get_cid(), dgram_get_port()? I agree, that in each transport
specific callback we will have same copying logic by calling 'skb_copy_datagram_msg()' and filling address
by using 'vsock_addr_init()', but in this case we don't need to update transports too much. For example HyperV
still unchanged as it does not support SOCK_DGRAM. For VMCI You just need to add 'vsock_addr_init()' logic
to it's dgram dequeue callback.

What do You think?

Thanks, Arseniy

> +
> +		vsock_addr_init(vm_addr, cid, port);
> +		msg->msg_namelen = sizeof(*vm_addr);
> +	}
> +	err = payload_len;
> +
> +out:
> +	skb_free_datagram(&vsk->sk, skb);
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
>  
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index 7cb1a9d2cdb4..ff6e87e25fa0 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
>  	return -EOPNOTSUPP;
>  }
>  
> -static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
> -			     size_t len, int flags)
> +static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
>  {
>  	return -EOPNOTSUPP;
>  }
> @@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
>  	.shutdown                 = hvs_shutdown,
>  
>  	.dgram_bind               = hvs_dgram_bind,
> -	.dgram_dequeue            = hvs_dgram_dequeue,
> +	.dgram_get_cid		  = hvs_dgram_get_cid,
> +	.dgram_get_port		  = hvs_dgram_get_port,
> +	.dgram_get_length	  = hvs_dgram_get_length,
>  	.dgram_enqueue            = hvs_dgram_enqueue,
>  	.dgram_allow              = hvs_dgram_allow,
>  
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index e95df847176b..5763cdf13804 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
>  		.cancel_pkt               = virtio_transport_cancel_pkt,
>  
>  		.dgram_bind               = virtio_transport_dgram_bind,
> -		.dgram_dequeue            = virtio_transport_dgram_dequeue,
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>  		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> +		.dgram_get_port		  = virtio_transport_dgram_get_port,
> +		.dgram_get_length	  = virtio_transport_dgram_get_length,
>  
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index b769fc258931..e6903c719964 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>  
> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> +{
> +	return -EOPNOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> +
> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> +{
> +	return -EOPNOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> +
> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> +{
> +	return -EOPNOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
> +
>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
>  {
>  	return false;
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index b370070194fa..bbc63826bf48 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
>  	return err - sizeof(*dg);
>  }
>  
> -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
> -					struct msghdr *msg, size_t len,
> -					int flags)
> +static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>  {
> -	int err;
>  	struct vmci_datagram *dg;
> -	size_t payload_len;
> -	struct sk_buff *skb;
>  
> -	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> -		return -EOPNOTSUPP;
> +	dg = (struct vmci_datagram *)skb->data;
> +	if (!dg)
> +		return -EINVAL;
>  
> -	/* Retrieve the head sk_buff from the socket's receive queue. */
> -	err = 0;
> -	skb = skb_recv_datagram(&vsk->sk, flags, &err);
> -	if (!skb)
> -		return err;
> +	*cid = dg->src.context;
> +	return 0;
> +}
> +
> +static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> +{
> +	struct vmci_datagram *dg;
>  
>  	dg = (struct vmci_datagram *)skb->data;
>  	if (!dg)
> -		/* err is 0, meaning we read zero bytes. */
> -		goto out;
> -
> -	payload_len = dg->payload_size;
> -	/* Ensure the sk_buff matches the payload size claimed in the packet. */
> -	if (payload_len != skb->len - sizeof(*dg)) {
> -		err = -EINVAL;
> -		goto out;
> -	}
> +		return -EINVAL;
>  
> -	if (payload_len > len) {
> -		payload_len = len;
> -		msg->msg_flags |= MSG_TRUNC;
> -	}
> +	*port = dg->src.resource;
> +	return 0;
> +}
>  
> -	/* Place the datagram payload in the user's iovec. */
> -	err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
> -	if (err)
> -		goto out;
> +static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> +{
> +	struct vmci_datagram *dg;
>  
> -	if (msg->msg_name) {
> -		/* Provide the address of the sender. */
> -		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> -		vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
> -		msg->msg_namelen = sizeof(*vm_addr);
> -	}
> -	err = payload_len;
> +	dg = (struct vmci_datagram *)skb->data;
> +	if (!dg)
> +		return -EINVAL;
>  
> -out:
> -	skb_free_datagram(&vsk->sk, skb);
> -	return err;
> +	*len = dg->payload_size;
> +	return 0;
>  }
>  
>  static bool vmci_transport_dgram_allow(u32 cid, u32 port)
> @@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
>  	.release = vmci_transport_release,
>  	.connect = vmci_transport_connect,
>  	.dgram_bind = vmci_transport_dgram_bind,
> -	.dgram_dequeue = vmci_transport_dgram_dequeue,
>  	.dgram_enqueue = vmci_transport_dgram_enqueue,
>  	.dgram_allow = vmci_transport_dgram_allow,
> +	.dgram_get_cid = vmci_transport_dgram_get_cid,
> +	.dgram_get_port = vmci_transport_dgram_get_port,
> +	.dgram_get_length = vmci_transport_dgram_get_length,
> +	.dgram_payload_offset = sizeof(struct vmci_datagram),
>  	.stream_dequeue = vmci_transport_stream_dequeue,
>  	.stream_enqueue = vmci_transport_stream_enqueue,
>  	.stream_has_data = vmci_transport_stream_has_data,
> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> index 5c6360df1f31..2f3cabc79ee5 100644
> --- a/net/vmw_vsock/vsock_loopback.c
> +++ b/net/vmw_vsock/vsock_loopback.c
> @@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
>  		.cancel_pkt               = vsock_loopback_cancel_pkt,
>  
>  		.dgram_bind               = virtio_transport_dgram_bind,
> -		.dgram_dequeue            = virtio_transport_dgram_dequeue,
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>  		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> +		.dgram_get_port		  = virtio_transport_dgram_get_port,
> +		.dgram_get_length	  = virtio_transport_dgram_get_length,
>  
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
> 

^ permalink raw reply

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Leon Romanovsky @ 2023-06-11 18:18 UTC (permalink / raw)
  To: Wei Hu
  Cc: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, kys,
	haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni, vkuznets,
	ssengar, shradhagupta
In-Reply-To: <20230606151747.1649305-1-weh@microsoft.com>

On Tue, Jun 06, 2023 at 03:17:47PM +0000, Wei Hu wrote:
> Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> handler when completion interrupt happens. EQs are destroyed when
> ucontext is deallocated.
> 
> The change calls some public APIs in mana ethernet driver to
> allocate EQs and other resources. Ehe EQ process routine is also shared
> by mana ethernet and mana ib drivers.
> 
> Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---
> 
> v2: Use ibdev_dbg to print error messages and return -ENOMEN
>     when kzalloc fails.

<...>

> +	if (atomic_read(&ibcq->usecnt) == 0) {

What exactly are you checking here? And in all places where you access ibcq->usecnt?

> +		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> +		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq->id]);
> +		kfree(gc->cq_table[cq->id]);
> +		gc->cq_table[cq->id] = NULL;
> +		ib_umem_release(cq->umem);
> +	}
>  
>  	return 0;
>  }
> +
> +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq)
> +{
> +	struct mana_ib_cq *cq = ctx;
> +	struct ib_device *ibdev = cq->ibcq.device;
> +
> +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);

This patch has two many debug prints, most if not all should go.

Thanks

^ permalink raw reply

* [PATCH RFC net-next v4 8/8] tests: add vsock dgram tests
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman,
	Jiang Wang
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

From: Jiang Wang <jiang.wang@bytedance.com>

This patch adds tests for vsock datagram.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
---
 tools/testing/vsock/util.c       | 141 ++++++++++++-
 tools/testing/vsock/util.h       |   6 +
 tools/testing/vsock/vsock_test.c | 432 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 578 insertions(+), 1 deletion(-)

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 01b636d3039a..811e70d7cf1e 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -99,7 +99,8 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)
 	int ret;
 	int fd;
 
-	control_expectln("LISTENING");
+	if (type != SOCK_DGRAM)
+		control_expectln("LISTENING");
 
 	fd = socket(AF_VSOCK, type, 0);
 
@@ -130,6 +131,11 @@ int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
 	return vsock_connect(cid, port, SOCK_SEQPACKET);
 }
 
+int vsock_dgram_connect(unsigned int cid, unsigned int port)
+{
+	return vsock_connect(cid, port, SOCK_DGRAM);
+}
+
 /* Listen on <cid, port> and return the first incoming connection.  The remote
  * address is stored to clientaddrp.  clientaddrp may be NULL.
  */
@@ -211,6 +217,34 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
 	return vsock_accept(cid, port, clientaddrp, SOCK_SEQPACKET);
 }
 
+int vsock_dgram_bind(unsigned int cid, unsigned int port)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = cid,
+		},
+	};
+	int fd;
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	return fd;
+}
+
 /* Transmit one byte and check the return value.
  *
  * expected_ret:
@@ -260,6 +294,57 @@ void send_byte(int fd, int expected_ret, int flags)
 	}
 }
 
+/* Transmit one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+		 int flags)
+{
+	const uint8_t byte = 'A';
+	ssize_t nwritten;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
+				  len);
+		timeout_check("write");
+	} while (nwritten < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nwritten != -1) {
+			fprintf(stderr, "bogus sendto(2) return value %zd\n",
+				nwritten);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nwritten < 0) {
+		perror("write");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while sending byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten != sizeof(byte)) {
+		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
+		exit(EXIT_FAILURE);
+	}
+}
+
 /* Receive one byte and check the return value.
  *
  * expected_ret:
@@ -313,6 +398,60 @@ void recv_byte(int fd, int expected_ret, int flags)
 	}
 }
 
+/* Receive one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+		   int expected_ret, int flags)
+{
+	uint8_t byte;
+	ssize_t nread;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
+		timeout_check("read");
+	} while (nread < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nread != -1) {
+			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
+				nread);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("read");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nread < 0) {
+		perror("read");
+		exit(EXIT_FAILURE);
+	}
+	if (nread == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while receiving byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nread != sizeof(byte)) {
+		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
+		exit(EXIT_FAILURE);
+	}
+	if (byte != 'A') {
+		fprintf(stderr, "unexpected byte read %c\n", byte);
+		exit(EXIT_FAILURE);
+	}
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts)
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..a69e128d120c 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -37,13 +37,19 @@ void init_signals(void);
 unsigned int parse_cid(const char *str);
 int vsock_stream_connect(unsigned int cid, unsigned int port);
 int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
+int vsock_dgram_connect(unsigned int cid, unsigned int port);
 int vsock_stream_accept(unsigned int cid, unsigned int port,
 			struct sockaddr_vm *clientaddrp);
 int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
 			   struct sockaddr_vm *clientaddrp);
+int vsock_dgram_bind(unsigned int cid, unsigned int port);
 void vsock_wait_remote_close(int fd);
 void send_byte(int fd, int expected_ret, int flags);
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+		 int flags);
 void recv_byte(int fd, int expected_ret, int flags);
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+		   int expected_ret, int flags);
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts);
 void list_tests(const struct test_case *test_cases);
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index ac1bd3ac1533..ded82d39ee5d 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -1053,6 +1053,413 @@ static void test_stream_virtio_skb_merge_server(const struct test_opts *opts)
 	close(fd);
 }
 
+static void test_dgram_sendto_client(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int fd;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_sendto_server(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	int len = sizeof(addr.sa);
+	int fd;
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Notify the client that the server is ready */
+	control_writeln("BIND");
+
+	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+
+	/* Wait for the client to finish */
+	control_expectln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_connect_client(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int ret;
+	int fd;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = connect(fd, &addr.sa, sizeof(addr.svm));
+	if (ret < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	send_byte(fd, 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_connect_server(const struct test_opts *opts)
+{
+	test_dgram_sendto_server(opts);
+}
+
+static void test_dgram_multiconn_sendto_client(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int fds[MULTICONN_NFDS];
+	int i;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
+		if (fds[i] < 0) {
+			perror("socket");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		close(fds[i]);
+}
+
+static void test_dgram_multiconn_sendto_server(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	int len = sizeof(addr.sa);
+	int fd;
+	int i;
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Notify the client that the server is ready */
+	control_writeln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+
+	/* Wait for the client to finish */
+	control_expectln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_multiconn_send_client(const struct test_opts *opts)
+{
+	int fds[MULTICONN_NFDS];
+	int i;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		fds[i] = vsock_dgram_connect(opts->peer_cid, 1234);
+		if (fds[i] < 0) {
+			perror("socket");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		send_byte(fds[i], 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		close(fds[i]);
+}
+
+static void test_dgram_multiconn_send_server(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	int fd;
+	int i;
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Notify the client that the server is ready */
+	control_writeln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		recv_byte(fd, 1, 0);
+
+	/* Wait for the client to finish */
+	control_expectln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_msg_bounds_client(const struct test_opts *opts)
+{
+	unsigned long recv_buf_size;
+	int page_size;
+	int msg_cnt;
+	int fd;
+
+	fd = vsock_dgram_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Let the server know the client is ready */
+	control_writeln("CLNTREADY");
+
+	msg_cnt = control_readulong();
+	recv_buf_size = control_readulong();
+
+	/* Wait, until receiver sets buffer size. */
+	control_expectln("SRVREADY");
+
+	page_size = getpagesize();
+
+	for (int i = 0; i < msg_cnt; i++) {
+		unsigned long curr_hash;
+		ssize_t send_size;
+		size_t buf_size;
+		void *buf;
+
+		/* Use "small" buffers and "big" buffers. */
+		if (i & 1)
+			buf_size = page_size +
+					(rand() % (MAX_MSG_SIZE - page_size));
+		else
+			buf_size = 1 + (rand() % page_size);
+
+		buf_size = min(buf_size, recv_buf_size);
+
+		buf = malloc(buf_size);
+
+		if (!buf) {
+			perror("malloc");
+			exit(EXIT_FAILURE);
+		}
+
+		memset(buf, rand() & 0xff, buf_size);
+		/* Set at least one MSG_EOR + some random. */
+
+		send_size = send(fd, buf, buf_size, 0);
+
+		if (send_size < 0) {
+			perror("send");
+			exit(EXIT_FAILURE);
+		}
+
+		if (send_size != buf_size) {
+			fprintf(stderr, "Invalid send size\n");
+			exit(EXIT_FAILURE);
+		}
+
+		/* In theory the implementation isn't required to transmit
+		 * these packets in order, so we use this SYNC control message
+		 * so that server and client coordinate sending and receiving
+		 * one packet at a time. The client sends a packet and waits
+		 * until it has been received before sending another.
+		 */
+		control_writeln("PKTSENT");
+		control_expectln("PKTRECV");
+
+		/* Send the server a hash of the packet */
+		curr_hash = hash_djb2(buf, buf_size);
+		control_writeulong(curr_hash);
+		free(buf);
+	}
+
+	control_writeln("SENDDONE");
+	close(fd);
+}
+
+static void test_dgram_msg_bounds_server(const struct test_opts *opts)
+{
+	const unsigned long msg_cnt = 16;
+	unsigned long sock_buf_size;
+	struct msghdr msg = {0};
+	struct iovec iov = {0};
+	char buf[MAX_MSG_SIZE];
+	socklen_t len;
+	int fd;
+	int i;
+
+	fd = vsock_dgram_bind(VMADDR_CID_ANY, 1234);
+
+	if (fd < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Set receive buffer to maximum */
+	sock_buf_size = -1;
+	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
+		       &sock_buf_size, sizeof(sock_buf_size))) {
+		perror("setsockopt(SO_RECVBUF)");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Retrieve the receive buffer size */
+	len = sizeof(sock_buf_size);
+	if (getsockopt(fd, SOL_SOCKET, SO_RCVBUF,
+		       &sock_buf_size, &len)) {
+		perror("getsockopt(SO_RECVBUF)");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Client ready to receive parameters */
+	control_expectln("CLNTREADY");
+
+	control_writeulong(msg_cnt);
+	control_writeulong(sock_buf_size);
+
+	/* Ready to receive data. */
+	control_writeln("SRVREADY");
+
+	iov.iov_base = buf;
+	iov.iov_len = sizeof(buf);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	for (i = 0; i < msg_cnt; i++) {
+		unsigned long remote_hash;
+		unsigned long curr_hash;
+		ssize_t recv_size;
+
+		control_expectln("PKTSENT");
+		recv_size = recvmsg(fd, &msg, 0);
+		control_writeln("PKTRECV");
+
+		if (!recv_size)
+			break;
+
+		if (recv_size < 0) {
+			perror("recvmsg");
+			exit(EXIT_FAILURE);
+		}
+
+		curr_hash = hash_djb2(msg.msg_iov[0].iov_base, recv_size);
+		remote_hash = control_readulong();
+
+		if (curr_hash != remote_hash) {
+			fprintf(stderr, "Message bounds broken\n");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	close(fd);
+}
+
 static struct test_case test_cases[] = {
 	{
 		.name = "SOCK_STREAM connection reset",
@@ -1128,6 +1535,31 @@ static struct test_case test_cases[] = {
 		.run_client = test_stream_virtio_skb_merge_client,
 		.run_server = test_stream_virtio_skb_merge_server,
 	},
+	{
+		.name = "SOCK_DGRAM client sendto",
+		.run_client = test_dgram_sendto_client,
+		.run_server = test_dgram_sendto_server,
+	},
+	{
+		.name = "SOCK_DGRAM client connect",
+		.run_client = test_dgram_connect_client,
+		.run_server = test_dgram_connect_server,
+	},
+	{
+		.name = "SOCK_DGRAM multiple connections using sendto",
+		.run_client = test_dgram_multiconn_sendto_client,
+		.run_server = test_dgram_multiconn_sendto_server,
+	},
+	{
+		.name = "SOCK_DGRAM multiple connections using send",
+		.run_client = test_dgram_multiconn_send_client,
+		.run_server = test_dgram_multiconn_send_server,
+	},
+	{
+		.name = "SOCK_DGRAM msg bounds",
+		.run_client = test_dgram_msg_bounds_client,
+		.run_server = test_dgram_msg_bounds_server,
+	},
 	{},
 };
 

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 7/8] vsock: Add lockless sendmsg() support
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
it does not scale when many senders share a socket.

Prior to this patch the socket lock is used to protect both reads and
writes to the local_addr, remote_addr, transport, and buffer size
variables of a vsock socket. What follows are the new protection schemes
for these fields that ensure a race-free and usually lock-free
multi-sender sendmsg() path for vsock dgrams.

- local_addr
local_addr changes as a result of binding a socket. The write path
for local_addr is bind() and various vsock_auto_bind() call sites.
After a socket has been bound via vsock_auto_bind() or bind(), subsequent
calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
rejects the user request and vsock_auto_bind() early exits.
Therefore, the local addr can not change while a parallel thread is
in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
Change: only acquire lock for auto-binding as-needed in sendmsg().

- buffer size variables
Not used by dgram, so they do not need protection. No change.

- remote_addr and transport
Because a remote_addr update may result in a changed transport, but we
would like to be able to read these two fields lock-free but coherently
in the vsock send path, this patch packages these two fields into a new
struct vsock_remote_info that is referenced by an RCU-protected pointer.

Writes are synchronized as usual by the socket lock. Reads only take
place in RCU read-side critical sections. When remote_addr or transport
is updated, a new remote info is allocated. Old readers still see the
old coherent remote_addr/transport pair, and new readers will refer to
the new coherent. The coherency between remote_addr and transport
previously provided by the socket lock alone is now also preserved by
RCU, except with the highly-scalable lock-free read-side.

Helpers are introduced for accessing and updating the new pointer.

The new structure is contains an rcu_head so that kfree_rcu() can be
used. This removes the need of writers to use synchronize_rcu() after
freeing old structures which is simply more efficient and reduces code
churn where remote_addr/transport are already being updated inside RCU
read-side sections.

Only virtio has been tested, but updates were necessary to the VMCI and
hyperv code. Unfortunately the author does not have access to
VMCI/hyperv systems so those changes are untested.

Perf Tests (results from patch v2)
vCPUS: 16
Threads: 16
Payload: 4KB
Test Runs: 5
Type: SOCK_DGRAM

Before: 245.2 MB/s
After: 509.2 MB/s (+107%)

Notably, on the same test system, vsock dgram even outperforms
multi-threaded UDP over virtio-net with vhost and MQ support enabled.

Throughput metrics for single-threaded SOCK_DGRAM and
single/multi-threaded SOCK_STREAM showed no statistically signficant
throughput changes (lowest p-value reaching 0.27), with the range of the
mean difference ranging between -5% to +1%.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |  12 +-
 include/linux/virtio_vsock.h            |   3 +-
 include/net/af_vsock.h                  |  38 ++-
 net/vmw_vsock/af_vsock.c                | 399 ++++++++++++++++++++++++++------
 net/vmw_vsock/diag.c                    |  10 +-
 net/vmw_vsock/hyperv_transport.c        |  27 ++-
 net/vmw_vsock/virtio_transport_common.c |  34 ++-
 net/vmw_vsock/vmci_transport.c          |  84 +++++--
 net/vmw_vsock/vsock_bpf.c               |  10 +-
 9 files changed, 492 insertions(+), 125 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 159c1a22c1a8..b027a780d333 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -297,13 +297,17 @@ static int
 vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct vhost_vsock *vsock;
+	unsigned int cid;
 	int cnt = 0;
 	int ret = -ENODEV;
 
 	rcu_read_lock();
+	ret = vsock_remote_addr_cid(vsk, &cid);
+	if (ret < 0)
+		goto out;
 
 	/* Find the vhost_vsock according to guest context id  */
-	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+	vsock = vhost_vsock_get(cid);
 	if (!vsock)
 		goto out;
 
@@ -706,6 +710,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
 static void vhost_vsock_reset_orphans(struct sock *sk)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	unsigned int cid;
+
+	if (vsock_remote_addr_cid(vsk, &cid) < 0)
+		return;
 
 	/* vmci_transport.c doesn't take sk_lock here either.  At least we're
 	 * under vsock_table_lock so the sock cannot disappear while we're
@@ -713,7 +721,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
 	 */
 
 	/* If the peer is still valid, no need to reset connection */
-	if (vhost_vsock_get(vsk->remote_addr.svm_cid))
+	if (vhost_vsock_get(cid))
 		return;
 
 	/* If the close timeout is pending, let it expire.  This avoids races
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 237ca87a2ecd..97656e83606f 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -231,7 +231,8 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
 				struct msghdr *msg,
 				size_t len);
 int
-virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+virtio_transport_dgram_enqueue(const struct vsock_transport *transport,
+			       struct vsock_sock *vsk,
 			       struct sockaddr_vm *remote_addr,
 			       struct msghdr *msg,
 			       size_t len);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index c115e655b4f5..928b09fbc64b 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
 #define vsock_sk(__sk)    ((struct vsock_sock *)__sk)
 #define sk_vsock(__vsk)   (&(__vsk)->sk)
 
+struct vsock_remote_info {
+	struct sockaddr_vm addr;
+	struct rcu_head rcu;
+	const struct vsock_transport *transport;
+};
+
 struct vsock_sock {
 	/* sk must be the first member. */
 	struct sock sk;
-	const struct vsock_transport *transport;
 	struct sockaddr_vm local_addr;
-	struct sockaddr_vm remote_addr;
+	struct vsock_remote_info __rcu *remote_info;
 	/* Links for the global tables of bound and connected sockets. */
 	struct list_head bound_table;
 	struct list_head connected_table;
@@ -120,8 +125,8 @@ struct vsock_transport {
 
 	/* DGRAM. */
 	int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
-	int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
-			     struct msghdr *, size_t len);
+	int (*dgram_enqueue)(const struct vsock_transport *, struct vsock_sock *,
+			     struct sockaddr_vm *, struct msghdr *, size_t len);
 	bool (*dgram_allow)(u32 cid, u32 port);
 	int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
 	int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
@@ -196,6 +201,16 @@ void vsock_core_unregister(const struct vsock_transport *t);
 /* The transport may downcast this to access transport-specific functions */
 const struct vsock_transport *vsock_core_get_transport(struct vsock_sock *vsk);
 
+static inline struct vsock_remote_info *
+vsock_core_get_remote_info(struct vsock_sock *vsk)
+{
+	/* vsk->remote_info may be accessed if the rcu read lock is held OR the
+	 * socket lock is held
+	 */
+	return rcu_dereference_check(vsk->remote_info,
+				     lockdep_sock_is_held(sk_vsock(vsk)));
+}
+
 /**** UTILS ****/
 
 /* vsock_table_lock must be held */
@@ -214,7 +229,7 @@ void vsock_release_pending(struct sock *pending);
 void vsock_add_pending(struct sock *listener, struct sock *pending);
 void vsock_remove_pending(struct sock *listener, struct sock *pending);
 void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
-void vsock_insert_connected(struct vsock_sock *vsk);
+int vsock_insert_connected(struct vsock_sock *vsk);
 void vsock_remove_bound(struct vsock_sock *vsk);
 void vsock_remove_connected(struct vsock_sock *vsk);
 struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
@@ -223,7 +238,8 @@ struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
 void vsock_remove_sock(struct vsock_sock *vsk);
 void vsock_for_each_connected_socket(struct vsock_transport *transport,
 				     void (*fn)(struct sock *sk));
-int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
+int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk,
+			   struct sockaddr_vm *remote_addr);
 bool vsock_find_cid(unsigned int cid);
 struct sock *vsock_find_bound_dgram_socket(struct sockaddr_vm *addr);
 
@@ -253,4 +269,14 @@ static inline void __init vsock_bpf_build_proto(void)
 {}
 #endif
 
+/* RCU-protected remote addr helpers */
+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
+			       unsigned int *port);
+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
+bool vsock_remote_addr_bound(struct vsock_sock *vsk);
+bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
+int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
+
 #endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index b0b18e7f4299..9e620d67889b 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -114,7 +114,12 @@
 static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
 static void vsock_sk_destruct(struct sock *sk);
 static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+static bool vsock_use_local_transport(unsigned int remote_cid);
 static bool sock_type_connectible(u16 type);
+static const struct vsock_transport *
+vsock_connectible_lookup_transport(unsigned int cid, __u8 flags);
+static const struct vsock_transport *
+vsock_dgram_lookup_transport(unsigned int cid, __u8 flags);
 
 /* Protocol family. */
 struct proto vsock_proto = {
@@ -146,6 +151,123 @@ static const struct vsock_transport *transport_local;
 static DEFINE_MUTEX(vsock_register_mutex);
 
 /**** UTILS ****/
+bool vsock_remote_addr_bound(struct vsock_sock *vsk)
+{
+	struct vsock_remote_info *remote_info;
+	bool ret;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	ret = vsock_addr_bound(&remote_info->addr);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
+
+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
+{
+	struct vsock_remote_info *remote_info;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	memcpy(dest, &remote_info->addr, sizeof(*dest));
+	rcu_read_unlock();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
+
+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
+{
+	return vsock_remote_addr_cid_port(vsk, cid, NULL);
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
+
+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
+{
+	return vsock_remote_addr_cid_port(vsk, NULL, port);
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
+
+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
+			       unsigned int *port)
+{
+	struct vsock_remote_info *remote_info;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (cid)
+		*cid = remote_info->addr.svm_cid;
+	if (port)
+		*port = remote_info->addr.svm_port;
+
+	rcu_read_unlock();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
+
+/* The socket lock must be held by the caller */
+static int vsock_set_remote_info(struct vsock_sock *vsk,
+				 const struct vsock_transport *transport,
+				 struct sockaddr_vm *addr)
+{
+	struct vsock_remote_info *old, *new;
+
+	if (addr || transport) {
+		new = kmalloc(sizeof(*new), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		if (addr)
+			memcpy(&new->addr, addr, sizeof(new->addr));
+
+		if (transport)
+			new->transport = transport;
+	} else {
+		new = NULL;
+	}
+
+	old = rcu_replace_pointer(vsk->remote_info, new,
+				  lockdep_sock_is_held(sk_vsock(vsk)));
+	kfree_rcu(old, rcu);
+
+	return 0;
+}
+
+bool vsock_remote_addr_equals(struct vsock_sock *vsk,
+			      struct sockaddr_vm *other)
+{
+	struct vsock_remote_info *remote_info;
+	bool equals;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	equals = vsock_addr_equals_addr(&remote_info->addr, other);
+	rcu_read_unlock();
+
+	return equals;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
 
 /* Each bound VSocket is stored in the bind hash table and each connected
  * VSocket is stored in the connected hash table.
@@ -283,10 +405,17 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 
 	list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
 			    connected_table) {
-		if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
+		struct vsock_remote_info *remote_info;
+
+		rcu_read_lock();
+		remote_info = vsock_core_get_remote_info(vsk);
+		if (remote_info &&
+		    vsock_addr_equals_addr(src, &remote_info->addr) &&
 		    dst->svm_port == vsk->local_addr.svm_port) {
+			rcu_read_unlock();
 			return sk_vsock(vsk);
 		}
+		rcu_read_unlock();
 	}
 
 	return NULL;
@@ -299,14 +428,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
 	spin_unlock_bh(&vsock_table_lock);
 }
 
-void vsock_insert_connected(struct vsock_sock *vsk)
+int vsock_insert_connected(struct vsock_sock *vsk)
 {
-	struct list_head *list = vsock_connected_sockets(
-		&vsk->remote_addr, &vsk->local_addr);
+	struct vsock_remote_info *remote_info;
+	struct list_head *list;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	list = vsock_connected_sockets(&remote_info->addr, &vsk->local_addr);
+	rcu_read_unlock();
 
 	spin_lock_bh(&vsock_table_lock);
 	__vsock_insert_connected(list, vsk);
 	spin_unlock_bh(&vsock_table_lock);
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(vsock_insert_connected);
 
@@ -388,7 +528,7 @@ void vsock_for_each_connected_socket(struct vsock_transport *transport,
 		struct vsock_sock *vsk;
 		list_for_each_entry(vsk, &vsock_connected_table[i],
 				    connected_table) {
-			if (vsk->transport != transport)
+			if (vsock_core_get_transport(vsk) != transport)
 				continue;
 
 			fn(sk_vsock(vsk));
@@ -454,12 +594,19 @@ static bool vsock_use_local_transport(unsigned int remote_cid)
 
 static void vsock_deassign_transport(struct vsock_sock *vsk)
 {
-	if (!vsk->transport)
+	struct vsock_remote_info *remote_info;
+
+	remote_info = rcu_replace_pointer(vsk->remote_info, NULL,
+					  lockdep_sock_is_held(sk_vsock(vsk)));
+	if (!remote_info)
 		return;
 
-	vsk->transport->destruct(vsk);
-	module_put(vsk->transport->module);
-	vsk->transport = NULL;
+	if (remote_info->transport) {
+		remote_info->transport->destruct(vsk);
+		module_put(remote_info->transport->module);
+	}
+
+	kfree_rcu(remote_info, rcu);
 }
 
 static const struct vsock_transport *
@@ -490,26 +637,29 @@ vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
 	return transport_dgram;
 }
 
-/* Assign a transport to a socket and call the .init transport callback.
+/* Assign a transport and remote addr to a socket and call the .init transport
+ * callback.
  *
- * Note: for connection oriented socket this must be called when vsk->remote_addr
- * is set (e.g. during the connect() or when a connection request on a listener
- * socket is received).
- * The vsk->remote_addr is used to decide which transport to use:
+ * The remote_addr is used to decide which transport to use. Both the addr
+ * and transport are updated simultaneously via RCU-protected pointer:
  *  - remote CID == VMADDR_CID_LOCAL or g2h->local_cid or VMADDR_CID_HOST if
  *    g2h is not loaded, will use local transport;
  *  - remote CID <= VMADDR_CID_HOST or h2g is not loaded or remote flags field
  *    includes VMADDR_FLAG_TO_HOST flag value, will use guest->host transport;
  *  - remote CID > VMADDR_CID_HOST will use host->guest transport;
  */
-int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
+int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk,
+			   struct sockaddr_vm *remote_addr)
 {
 	const struct vsock_transport *new_transport;
+	struct vsock_remote_info *old_info;
 	struct sock *sk = sk_vsock(vsk);
-	unsigned int remote_cid = vsk->remote_addr.svm_cid;
+	unsigned int remote_cid;
 	__u8 remote_flags;
 	int ret;
 
+	remote_cid = remote_addr->svm_cid;
+
 	/* If the packet is coming with the source and destination CIDs higher
 	 * than VMADDR_CID_HOST, then a vsock channel where all the packets are
 	 * forwarded to the host should be established. Then the host will
@@ -519,10 +669,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 	 * the connect path the flag can be set by the user space application.
 	 */
 	if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
-	    vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
-		vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
+	    remote_cid > VMADDR_CID_HOST)
+		remote_addr->svm_flags |= VMADDR_FLAG_TO_HOST;
 
-	remote_flags = vsk->remote_addr.svm_flags;
+	remote_flags = remote_addr->svm_flags;
 
 	switch (sk->sk_type) {
 	case SOCK_DGRAM:
@@ -538,8 +688,9 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		return -ESOCKTNOSUPPORT;
 	}
 
-	if (vsk->transport) {
-		if (vsk->transport == new_transport)
+	old_info = vsock_core_get_remote_info(vsk);
+	if (old_info && old_info->transport) {
+		if (old_info->transport == new_transport)
 			return 0;
 
 		/* transport->release() must be called with sock lock acquired.
@@ -548,7 +699,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		 * function is called on a new socket which is not assigned to
 		 * any transport.
 		 */
-		vsk->transport->release(vsk);
+		old_info->transport->release(vsk);
 		vsock_deassign_transport(vsk);
 	}
 
@@ -566,13 +717,18 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		}
 	}
 
-	ret = new_transport->init(vsk, psk);
+	ret = vsock_set_remote_info(vsk, new_transport, remote_addr);
 	if (ret) {
 		module_put(new_transport->module);
 		return ret;
 	}
 
-	vsk->transport = new_transport;
+	ret = new_transport->init(vsk, psk);
+	if (ret) {
+		vsock_set_remote_info(vsk, NULL, NULL);
+		module_put(new_transport->module);
+		return ret;
+	}
 
 	return 0;
 }
@@ -629,12 +785,14 @@ static bool vsock_is_pending(struct sock *sk)
 
 static int vsock_send_shutdown(struct sock *sk, int mode)
 {
+	const struct vsock_transport *transport;
 	struct vsock_sock *vsk = vsock_sk(sk);
 
-	if (!vsk->transport)
+	transport = vsock_core_get_transport(vsk);
+	if (!transport)
 		return -ENODEV;
 
-	return vsk->transport->shutdown(vsk, mode);
+	return transport->shutdown(vsk, mode);
 }
 
 static void vsock_pending_work(struct work_struct *work)
@@ -757,7 +915,10 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 static int vsock_bind_dgram(struct vsock_sock *vsk,
 			    struct sockaddr_vm *addr)
 {
-	if (!vsk->transport || !vsk->transport->dgram_bind) {
+	const struct vsock_transport *transport;
+
+	transport = vsock_core_get_transport(vsk);
+	if (!transport || !transport->dgram_bind) {
 		int retval;
 
 		spin_lock_bh(&vsock_dgram_table_lock);
@@ -768,7 +929,7 @@ static int vsock_bind_dgram(struct vsock_sock *vsk,
 		return retval;
 	}
 
-	return vsk->transport->dgram_bind(vsk, addr);
+	return transport->dgram_bind(vsk, addr);
 }
 
 static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr)
@@ -817,6 +978,7 @@ static struct sock *__vsock_create(struct net *net,
 				   unsigned short type,
 				   int kern)
 {
+	struct vsock_remote_info *remote_info;
 	struct sock *sk;
 	struct vsock_sock *psk;
 	struct vsock_sock *vsk;
@@ -836,7 +998,14 @@ static struct sock *__vsock_create(struct net *net,
 
 	vsk = vsock_sk(sk);
 	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
-	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+
+	remote_info = kmalloc(sizeof(*remote_info), GFP_KERNEL);
+	if (!remote_info) {
+		sk_free(sk);
+		return NULL;
+	}
+	vsock_addr_init(&remote_info->addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+	rcu_assign_pointer(vsk->remote_info, remote_info);
 
 	sk->sk_destruct = vsock_sk_destruct;
 	sk->sk_backlog_rcv = vsock_queue_rcv_skb;
@@ -883,6 +1052,7 @@ static bool sock_type_connectible(u16 type)
 static void __vsock_release(struct sock *sk, int level)
 {
 	if (sk) {
+		const struct vsock_transport *transport;
 		struct sock *pending;
 		struct vsock_sock *vsk;
 
@@ -896,8 +1066,9 @@ static void __vsock_release(struct sock *sk, int level)
 		 */
 		lock_sock_nested(sk, level);
 
-		if (vsk->transport)
-			vsk->transport->release(vsk);
+		transport = vsock_core_get_transport(vsk);
+		if (transport)
+			transport->release(vsk);
 		else if (sock_type_connectible(sk->sk_type))
 			vsock_remove_sock(vsk);
 
@@ -927,8 +1098,6 @@ static void vsock_sk_destruct(struct sock *sk)
 	 * possibly register the address family with the kernel.
 	 */
 	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
-	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
-
 	put_cred(vsk->owner);
 }
 
@@ -952,16 +1121,22 @@ EXPORT_SYMBOL_GPL(vsock_create_connected);
 
 s64 vsock_stream_has_data(struct vsock_sock *vsk)
 {
-	return vsk->transport->stream_has_data(vsk);
+	const struct vsock_transport *transport;
+
+	transport = vsock_core_get_transport(vsk);
+
+	return transport->stream_has_data(vsk);
 }
 EXPORT_SYMBOL_GPL(vsock_stream_has_data);
 
 s64 vsock_connectible_has_data(struct vsock_sock *vsk)
 {
+	const struct vsock_transport *transport;
 	struct sock *sk = sk_vsock(vsk);
 
+	transport = vsock_core_get_transport(vsk);
 	if (sk->sk_type == SOCK_SEQPACKET)
-		return vsk->transport->seqpacket_has_data(vsk);
+		return transport->seqpacket_has_data(vsk);
 	else
 		return vsock_stream_has_data(vsk);
 }
@@ -969,7 +1144,10 @@ EXPORT_SYMBOL_GPL(vsock_connectible_has_data);
 
 s64 vsock_stream_has_space(struct vsock_sock *vsk)
 {
-	return vsk->transport->stream_has_space(vsk);
+	const struct vsock_transport *transport;
+
+	transport = vsock_core_get_transport(vsk);
+	return transport->stream_has_space(vsk);
 }
 EXPORT_SYMBOL_GPL(vsock_stream_has_space);
 
@@ -1018,6 +1196,7 @@ static int vsock_getname(struct socket *sock,
 	struct sock *sk;
 	struct vsock_sock *vsk;
 	struct sockaddr_vm *vm_addr;
+	struct vsock_remote_info *rcu_ptr;
 
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
@@ -1030,7 +1209,14 @@ static int vsock_getname(struct socket *sock,
 			err = -ENOTCONN;
 			goto out;
 		}
-		vm_addr = &vsk->remote_addr;
+
+		rcu_ptr = vsock_core_get_remote_info(vsk);
+		if (!rcu_ptr) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		vm_addr = &rcu_ptr->addr;
 	} else {
 		vm_addr = &vsk->local_addr;
 	}
@@ -1154,7 +1340,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 
 		lock_sock(sk);
 
-		transport = vsk->transport;
+		transport = vsock_core_get_transport(vsk);
 
 		/* Listening sockets that have connections in their accept
 		 * queue can be read.
@@ -1225,9 +1411,11 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 
 static int vsock_read_skb(struct sock *sk, skb_read_actor_t read_actor)
 {
+	const struct vsock_transport *transport;
 	struct vsock_sock *vsk = vsock_sk(sk);
 
-	return vsk->transport->read_skb(vsk, read_actor);
+	transport = vsock_core_get_transport(vsk);
+	return transport->read_skb(vsk, read_actor);
 }
 
 static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
@@ -1236,7 +1424,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	int err;
 	struct sock *sk;
 	struct vsock_sock *vsk;
-	struct sockaddr_vm *remote_addr;
+	struct sockaddr_vm stack_addr, *remote_addr;
 	const struct vsock_transport *transport;
 
 	if (msg->msg_flags & MSG_OOB)
@@ -1247,7 +1435,23 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
 
-	lock_sock(sk);
+	/* If auto-binding is required, acquire the slock to avoid potential
+	 * race conditions. Otherwise, do not acquire the lock.
+	 *
+	 * We know that the first check of local_addr is racy (indicated by
+	 * data_race()). By acquiring the lock and then subsequently checking
+	 * again if local_addr is bound (inside vsock_auto_bind()), we can
+	 * ensure there are no real data races.
+	 *
+	 * This technique is borrowed by inet_send_prepare().
+	 */
+	if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
+		lock_sock(sk);
+		err = vsock_auto_bind(vsk);
+		release_sock(sk);
+		if (err)
+			return err;
+	}
 
 	/* If the provided message contains an address, use that.  Otherwise
 	 * fall back on the socket's remote handle (if it has been connected).
@@ -1257,6 +1461,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 			    &remote_addr) == 0) {
 		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
 							 remote_addr->svm_flags);
+
 		if (!transport) {
 			err = -EINVAL;
 			goto out;
@@ -1287,18 +1492,39 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 			goto out;
 		}
 
-		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
+		err = transport->dgram_enqueue(transport, vsk, remote_addr, msg, len);
 		module_put(transport->module);
 	} else if (sock->state == SS_CONNECTED) {
-		remote_addr = &vsk->remote_addr;
-		transport = vsk->transport;
+		struct vsock_remote_info *remote_info;
+		const struct vsock_transport *transport;
 
-		err = vsock_auto_bind(vsk);
-		if (err)
+		rcu_read_lock();
+		remote_info = vsock_core_get_remote_info(vsk);
+		if (!remote_info) {
+			err = -EINVAL;
+			rcu_read_unlock();
 			goto out;
+		}
 
-		if (remote_addr->svm_cid == VMADDR_CID_ANY)
+		transport = remote_info->transport;
+		memcpy(&stack_addr, &remote_info->addr, sizeof(stack_addr));
+		rcu_read_unlock();
+
+		remote_addr = &stack_addr;
+
+		if (remote_addr->svm_cid == VMADDR_CID_ANY) {
 			remote_addr->svm_cid = transport->get_local_cid();
+			lock_sock(sk_vsock(vsk));
+			/* Even though the CID has changed, We do not have to
+			 * look up the transport again because the local CID
+			 * will never resolve to a different transport.
+			 */
+			err = vsock_set_remote_info(vsk, transport, remote_addr);
+			release_sock(sk_vsock(vsk));
+
+			if (err)
+				goto out;
+		}
 
 		/* XXX Should connect() or this function ensure remote_addr is
 		 * bound?
@@ -1314,14 +1540,13 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 			goto out;
 		}
 
-		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
+		err = transport->dgram_enqueue(transport, vsk, &stack_addr, msg, len);
 	} else {
 		err = -EINVAL;
 		goto out;
 	}
 
 out:
-	release_sock(sk);
 	return err;
 }
 
@@ -1332,18 +1557,22 @@ static int vsock_dgram_connect(struct socket *sock,
 	struct sock *sk;
 	struct vsock_sock *vsk;
 	struct sockaddr_vm *remote_addr;
+	const struct vsock_transport *transport;
 
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
 
 	err = vsock_addr_cast(addr, addr_len, &remote_addr);
 	if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
+		struct sockaddr_vm addr_any;
+
 		lock_sock(sk);
-		vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
-				VMADDR_PORT_ANY);
+		vsock_addr_init(&addr_any, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+		err = vsock_set_remote_info(vsk, vsock_core_get_transport(vsk),
+					    &addr_any);
 		sock->state = SS_UNCONNECTED;
 		release_sock(sk);
-		return 0;
+		return err;
 	} else if (err != 0)
 		return -EINVAL;
 
@@ -1353,14 +1582,13 @@ static int vsock_dgram_connect(struct socket *sock,
 	if (err)
 		goto out;
 
-	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
-
-	err = vsock_assign_transport(vsk, NULL);
+	err = vsock_assign_transport(vsk, NULL, remote_addr);
 	if (err)
 		goto out;
 
-	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
-					 remote_addr->svm_port)) {
+	transport = vsock_core_get_transport(vsk);
+	if (!transport->dgram_allow(remote_addr->svm_cid,
+				    remote_addr->svm_port)) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -1407,7 +1635,9 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
 		return -EOPNOTSUPP;
 
-	transport = vsk->transport;
+	rcu_read_lock();
+	transport = vsock_core_get_transport(vsk);
+	rcu_read_unlock();
 
 	/* Retrieve the head sk_buff from the socket's receive queue. */
 	err = 0;
@@ -1475,7 +1705,7 @@ static const struct proto_ops vsock_dgram_ops = {
 
 static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
 {
-	const struct vsock_transport *transport = vsk->transport;
+	const struct vsock_transport *transport = vsock_core_get_transport(vsk);
 
 	if (!transport || !transport->cancel_pkt)
 		return -EOPNOTSUPP;
@@ -1512,6 +1742,7 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
 	struct sock *sk;
 	struct vsock_sock *vsk;
 	const struct vsock_transport *transport;
+	struct vsock_remote_info *remote_info;
 	struct sockaddr_vm *remote_addr;
 	long timeout;
 	DEFINE_WAIT(wait);
@@ -1549,14 +1780,20 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
 		}
 
 		/* Set the remote address that we are connecting to. */
-		memcpy(&vsk->remote_addr, remote_addr,
-		       sizeof(vsk->remote_addr));
-
-		err = vsock_assign_transport(vsk, NULL);
+		err = vsock_assign_transport(vsk, NULL, remote_addr);
 		if (err)
 			goto out;
 
-		transport = vsk->transport;
+		rcu_read_lock();
+		remote_info = vsock_core_get_remote_info(vsk);
+		if (!remote_info) {
+			err = -EINVAL;
+			rcu_read_unlock();
+			goto out;
+		}
+
+		transport = remote_info->transport;
+		rcu_read_unlock();
 
 		/* The hypervisor and well-known contexts do not have socket
 		 * endpoints.
@@ -1820,7 +2057,7 @@ static int vsock_connectible_setsockopt(struct socket *sock,
 
 	lock_sock(sk);
 
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	switch (optname) {
 	case SO_VM_SOCKETS_BUFFER_SIZE:
@@ -1958,7 +2195,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	/* Callers should not provide a destination with connection oriented
 	 * sockets.
@@ -1981,7 +2218,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
 		goto out;
 	}
 
-	if (!vsock_addr_bound(&vsk->remote_addr)) {
+	if (!vsock_remote_addr_bound(vsk)) {
 		err = -EDESTADDRREQ;
 		goto out;
 	}
@@ -2102,7 +2339,7 @@ static int vsock_connectible_wait_data(struct sock *sk,
 
 	vsk = vsock_sk(sk);
 	err = 0;
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	while (1) {
 		prepare_to_wait(sk_sleep(sk), wait, TASK_INTERRUPTIBLE);
@@ -2170,7 +2407,7 @@ static int __vsock_stream_recvmsg(struct sock *sk, struct msghdr *msg,
 	DEFINE_WAIT(wait);
 
 	vsk = vsock_sk(sk);
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	/* We must not copy less than target bytes into the user's buffer
 	 * before returning successfully, so we wait for the consume queue to
@@ -2246,7 +2483,7 @@ static int __vsock_seqpacket_recvmsg(struct sock *sk, struct msghdr *msg,
 	DEFINE_WAIT(wait);
 
 	vsk = vsock_sk(sk);
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
@@ -2303,7 +2540,7 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 	lock_sock(sk);
 
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	if (!transport || sk->sk_state != TCP_ESTABLISHED) {
 		/* Recvmsg is supposed to return 0 if a peer performs an
@@ -2370,7 +2607,7 @@ static int vsock_set_rcvlowat(struct sock *sk, int val)
 	if (val > vsk->buffer_size)
 		return -EINVAL;
 
-	transport = vsk->transport;
+	transport = vsock_core_get_transport(vsk);
 
 	if (transport && transport->set_rcvlowat)
 		return transport->set_rcvlowat(vsk, val);
@@ -2460,7 +2697,10 @@ static int vsock_create(struct net *net, struct socket *sock,
 	vsk = vsock_sk(sk);
 
 	if (sock->type == SOCK_DGRAM) {
-		ret = vsock_assign_transport(vsk, NULL);
+		struct sockaddr_vm remote_addr;
+
+		vsock_addr_init(&remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+		ret = vsock_assign_transport(vsk, NULL, &remote_addr);
 		if (ret < 0) {
 			sock_put(sk);
 			return ret;
@@ -2582,7 +2822,18 @@ static void __exit vsock_exit(void)
 
 const struct vsock_transport *vsock_core_get_transport(struct vsock_sock *vsk)
 {
-	return vsk->transport;
+	const struct vsock_transport *transport;
+	struct vsock_remote_info *remote_info;
+
+	rcu_read_lock();
+	remote_info = vsock_core_get_remote_info(vsk);
+	if (!remote_info) {
+		rcu_read_unlock();
+		return NULL;
+	}
+	transport = remote_info->transport;
+	rcu_read_unlock();
+	return transport;
 }
 EXPORT_SYMBOL_GPL(vsock_core_get_transport);
 
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
index a2823b1c5e28..f843bae86b32 100644
--- a/net/vmw_vsock/diag.c
+++ b/net/vmw_vsock/diag.c
@@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
 			u32 portid, u32 seq, u32 flags)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	struct sockaddr_vm remote_addr;
 	struct vsock_diag_msg *rep;
 	struct nlmsghdr *nlh;
+	int err;
+
+	err = vsock_remote_addr_copy(vsk, &remote_addr);
+	if (err < 0)
+		return err;
 
 	nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
 			flags);
@@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
 	rep->vdiag_shutdown = sk->sk_shutdown;
 	rep->vdiag_src_cid = vsk->local_addr.svm_cid;
 	rep->vdiag_src_port = vsk->local_addr.svm_port;
-	rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
-	rep->vdiag_dst_port = vsk->remote_addr.svm_port;
+	rep->vdiag_dst_cid = remote_addr.svm_cid;
+	rep->vdiag_dst_port = remote_addr.svm_port;
 	rep->vdiag_ino = sock_i_ino(sk);
 
 	sock_diag_save_cookie(sk, rep->vdiag_cookie);
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index c00bc5da769a..84e8c64b3365 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -323,6 +323,8 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 		goto out;
 
 	if (conn_from_host) {
+		struct sockaddr_vm remote_addr;
+
 		if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog)
 			goto out;
 
@@ -336,10 +338,9 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 		hvs_addr_init(&vnew->local_addr, if_type);
 
 		/* Remote peer is always the host */
-		vsock_addr_init(&vnew->remote_addr,
-				VMADDR_CID_HOST, VMADDR_PORT_ANY);
-		vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
-		ret = vsock_assign_transport(vnew, vsock_sk(sk));
+		vsock_addr_init(&remote_addr, VMADDR_CID_HOST, get_port_by_srv_id(if_instance));
+
+		ret = vsock_assign_transport(vnew, vsock_sk(sk), &remote_addr);
 		/* Transport assigned (looking at remote_addr) must be the
 		 * same where we received the request.
 		 */
@@ -459,13 +460,18 @@ static int hvs_connect(struct vsock_sock *vsk)
 {
 	union hvs_service_id vm, host;
 	struct hvsock *h = vsk->trans;
+	int err;
 
 	vm.srv_id = srv_id_template;
 	vm.svm_port = vsk->local_addr.svm_port;
 	h->vm_srv_id = vm.srv_id;
 
 	host.srv_id = srv_id_template;
-	host.svm_port = vsk->remote_addr.svm_port;
+
+	err = vsock_remote_addr_port(vsk, &host.svm_port);
+	if (err < 0)
+		return err;
+
 	h->host_srv_id = host.srv_id;
 
 	return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
@@ -566,7 +572,8 @@ static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
 	return -EOPNOTSUPP;
 }
 
-static int hvs_dgram_enqueue(struct vsock_sock *vsk,
+static int hvs_dgram_enqueue(const struct vsock_transport *transport,
+			     struct vsock_sock *vsk,
 			     struct sockaddr_vm *remote, struct msghdr *msg,
 			     size_t dgram_len)
 {
@@ -866,7 +873,13 @@ static struct vsock_transport hvs_transport = {
 
 static bool hvs_check_transport(struct vsock_sock *vsk)
 {
-	return vsk->transport == &hvs_transport;
+	bool ret;
+
+	rcu_read_lock();
+	ret = vsock_core_get_transport(vsk) == &hvs_transport;
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int hvs_probe(struct hv_device *hdev,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index bc9d459723f5..9d090f208648 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -259,8 +259,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	src_cid = t_ops->transport.get_local_cid();
 	src_port = vsk->local_addr.svm_port;
 	if (!info->remote_cid) {
-		dst_cid	= vsk->remote_addr.svm_cid;
-		dst_port = vsk->remote_addr.svm_port;
+		ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
+		if (ret < 0)
+			return ret;
 	} else {
 		dst_cid = info->remote_cid;
 		dst_port = info->remote_port;
@@ -878,12 +879,14 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
 EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
 
 int
-virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+virtio_transport_dgram_enqueue(const struct vsock_transport *transport,
+			       struct vsock_sock *vsk,
 			       struct sockaddr_vm *remote_addr,
 			       struct msghdr *msg,
 			       size_t dgram_len)
 {
-	const struct virtio_transport *t_ops;
+	const struct virtio_transport *t_ops =
+		(const struct virtio_transport *)transport;
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_RW,
 		.msg = msg,
@@ -897,7 +900,6 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 	if (dgram_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
 		return -EMSGSIZE;
 
-	t_ops = virtio_transport_get_ops(vsk);
 	src_cid = t_ops->transport.get_local_cid();
 	src_port = vsk->local_addr.svm_port;
 
@@ -1121,7 +1123,11 @@ virtio_transport_recv_connecting(struct sock *sk,
 	case VIRTIO_VSOCK_OP_RESPONSE:
 		sk->sk_state = TCP_ESTABLISHED;
 		sk->sk_socket->state = SS_CONNECTED;
-		vsock_insert_connected(vsk);
+		err = vsock_insert_connected(vsk);
+		if (err) {
+			skerr = ECONNRESET;
+			goto destroy;
+		}
 		sk->sk_state_change(sk);
 		break;
 	case VIRTIO_VSOCK_OP_INVALID:
@@ -1323,6 +1329,7 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
 	struct vsock_sock *vsk = vsock_sk(sk);
 	struct vsock_sock *vchild;
+	struct sockaddr_vm child_remote;
 	struct sock *child;
 	int ret;
 
@@ -1351,14 +1358,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	vchild = vsock_sk(child);
 	vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
 			le32_to_cpu(hdr->dst_port));
-	vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
+	vsock_addr_init(&child_remote, le64_to_cpu(hdr->src_cid),
 			le32_to_cpu(hdr->src_port));
-
-	ret = vsock_assign_transport(vchild, vsk);
+	ret = vsock_assign_transport(vchild, vsk, &child_remote);
 	/* Transport assigned (looking at remote_addr) must be the same
 	 * where we received the request.
 	 */
-	if (ret || vchild->transport != &t->transport) {
+	if (ret || vsock_core_get_transport(vchild) != &t->transport) {
 		release_sock(child);
 		virtio_transport_reset_no_sock(t, skb);
 		sock_put(child);
@@ -1368,7 +1374,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	if (virtio_transport_space_update(child, skb))
 		child->sk_write_space(child);
 
-	vsock_insert_connected(vchild);
+	ret = vsock_insert_connected(vchild);
+	if (ret) {
+		release_sock(child);
+		virtio_transport_reset_no_sock(t, skb);
+		sock_put(child);
+		return ret;
+	}
 	vsock_enqueue_accept(sk, child);
 	virtio_transport_send_response(vchild, skb);
 
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index bbc63826bf48..943539857ccb 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
 				u16 proto,
 				struct vmci_handle handle)
 {
+	struct sockaddr_vm addr_stack;
+	struct sockaddr_vm *remote_addr = &addr_stack;
 	struct vsock_sock *vsk;
+	int err;
 
 	vsk = vsock_sk(sk);
 
 	if (!vsock_addr_bound(&vsk->local_addr))
 		return -EINVAL;
 
-	if (!vsock_addr_bound(&vsk->remote_addr))
+	if (!vsock_remote_addr_bound(vsk))
 		return -EINVAL;
 
+	err = vsock_remote_addr_copy(vsk, remote_addr);
+	if (err < 0)
+		return err;
+
 	return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
-						     &vsk->remote_addr,
+						     remote_addr,
 						     type, size, mode,
 						     wait, proto, handle);
 }
@@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
 	struct sockaddr_vm *dst_ptr;
 	struct sockaddr_vm dst;
 	struct vsock_sock *vsk;
+	int err;
 
 	if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
 		return 0;
@@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
 	if (!vsock_addr_bound(&vsk->local_addr))
 		return -EINVAL;
 
-	if (vsock_addr_bound(&vsk->remote_addr)) {
-		dst_ptr = &vsk->remote_addr;
+	if (vsock_remote_addr_bound(vsk)) {
+		err = vsock_remote_addr_copy(vsk, &dst);
+		if (err < 0)
+			return err;
 	} else {
 		vsock_addr_init(&dst, pkt->dg.src.context,
 				pkt->src_port);
-		dst_ptr = &dst;
 	}
+	dst_ptr = &dst;
+
 	return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
 					     VMCI_TRANSPORT_PACKET_TYPE_RST,
 					     0, 0, NULL, VSOCK_PROTO_INVALID,
@@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
 
 	list_for_each_entry(vpending, &vlistener->pending_links,
 			    pending_links) {
-		if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
+		if (vsock_remote_addr_equals(vpending, &src) &&
 		    pkt->dst_port == vpending->local_addr.svm_port) {
 			pending = sk_vsock(vpending);
 			sock_hold(pending);
@@ -940,6 +951,7 @@ static void vmci_transport_recv_pkt_work(struct work_struct *work)
 static int vmci_transport_recv_listen(struct sock *sk,
 				      struct vmci_transport_packet *pkt)
 {
+	struct sockaddr_vm remote_addr;
 	struct sock *pending;
 	struct vsock_sock *vpending;
 	int err;
@@ -1015,10 +1027,10 @@ static int vmci_transport_recv_listen(struct sock *sk,
 
 	vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
 			pkt->dst_port);
-	vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
-			pkt->src_port);
 
-	err = vsock_assign_transport(vpending, vsock_sk(sk));
+	vsock_addr_init(&remote_addr, pkt->dg.src.context, pkt->src_port);
+
+	err = vsock_assign_transport(vpending, vsock_sk(sk), &remote_addr);
 	/* Transport assigned (looking at remote_addr) must be the same
 	 * where we received the request.
 	 */
@@ -1133,6 +1145,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
 {
 	struct vsock_sock *vpending;
 	struct vmci_handle handle;
+	unsigned int vpending_remote_cid;
 	struct vmci_qp *qpair;
 	bool is_local;
 	u32 flags;
@@ -1189,8 +1202,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
 	/* vpending->local_addr always has a context id so we do not need to
 	 * worry about VMADDR_CID_ANY in this case.
 	 */
-	is_local =
-	    vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
+	err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
+	if (err < 0) {
+		skerr = EPROTO;
+		goto destroy;
+	}
+
+	is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
 	flags = VMCI_QPFLAG_ATTACH_ONLY;
 	flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
 
@@ -1203,7 +1221,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
 					flags,
 					vmci_transport_is_trusted(
 						vpending,
-						vpending->remote_addr.svm_cid));
+						vpending_remote_cid));
 	if (err < 0) {
 		vmci_transport_send_reset(pending, pkt);
 		skerr = -err;
@@ -1277,6 +1295,8 @@ static int
 vmci_transport_recv_connecting_client(struct sock *sk,
 				      struct vmci_transport_packet *pkt)
 {
+	struct vsock_remote_info *remote_info;
+	struct sockaddr_vm *remote_addr;
 	struct vsock_sock *vsk;
 	int err;
 	int skerr;
@@ -1306,9 +1326,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
 		break;
 	case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
 	case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
+		rcu_read_lock();
+		remote_info = vsock_core_get_remote_info(vsk);
+		if (!remote_info) {
+			skerr = EPROTO;
+			err = -EINVAL;
+			rcu_read_unlock();
+			goto destroy;
+		}
+
+		remote_addr = &remote_info->addr;
+
 		if (pkt->u.size == 0
-		    || pkt->dg.src.context != vsk->remote_addr.svm_cid
-		    || pkt->src_port != vsk->remote_addr.svm_port
+		    || pkt->dg.src.context != remote_addr->svm_cid
+		    || pkt->src_port != remote_addr->svm_port
 		    || !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
 		    || vmci_trans(vsk)->qpair
 		    || vmci_trans(vsk)->produce_size != 0
@@ -1316,9 +1347,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
 		    || vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
 			skerr = EPROTO;
 			err = -EINVAL;
-
+			rcu_read_unlock();
 			goto destroy;
 		}
+		rcu_read_unlock();
 
 		err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
 		if (err) {
@@ -1379,6 +1411,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
 	int err;
 	struct vsock_sock *vsk;
 	struct vmci_handle handle;
+	unsigned int remote_cid;
 	struct vmci_qp *qpair;
 	u32 detach_sub_id;
 	bool is_local;
@@ -1449,19 +1482,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
 
 	/* Make VMCI select the handle for us. */
 	handle = VMCI_INVALID_HANDLE;
-	is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
+
+	err = vsock_remote_addr_cid(vsk, &remote_cid);
+	if (err < 0)
+		goto destroy;
+
+	is_local = remote_cid == vsk->local_addr.svm_cid;
 	flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
 
 	err = vmci_transport_queue_pair_alloc(&qpair,
 					      &handle,
 					      pkt->u.size,
 					      pkt->u.size,
-					      vsk->remote_addr.svm_cid,
+					      remote_cid,
 					      flags,
 					      vmci_transport_is_trusted(
 						  vsk,
-						  vsk->
-						  remote_addr.svm_cid));
+						  remote_cid));
 	if (err < 0)
 		goto destroy;
 
@@ -1692,6 +1729,7 @@ static int vmci_transport_dgram_bind(struct vsock_sock *vsk,
 }
 
 static int vmci_transport_dgram_enqueue(
+	const struct vsock_transport *transport,
 	struct vsock_sock *vsk,
 	struct sockaddr_vm *remote_addr,
 	struct msghdr *msg,
@@ -2052,7 +2090,13 @@ static struct vsock_transport vmci_transport = {
 
 static bool vmci_check_transport(struct vsock_sock *vsk)
 {
-	return vsk->transport == &vmci_transport;
+	bool retval;
+
+	rcu_read_lock();
+	retval = vsock_core_get_transport(vsk) == &vmci_transport;
+	rcu_read_unlock();
+
+	return retval;
 }
 
 static void vmci_vsock_transport_cb(bool is_host)
diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c
index a3c97546ab84..4d811c9cdf6e 100644
--- a/net/vmw_vsock/vsock_bpf.c
+++ b/net/vmw_vsock/vsock_bpf.c
@@ -148,6 +148,7 @@ static void vsock_bpf_check_needs_rebuild(struct proto *ops)
 
 int vsock_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
+	const struct vsock_transport *transport;
 	struct vsock_sock *vsk;
 
 	if (restore) {
@@ -157,10 +158,15 @@ int vsock_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore
 	}
 
 	vsk = vsock_sk(sk);
-	if (!vsk->transport)
+
+	rcu_read_lock();
+	transport = vsock_core_get_transport(vsk);
+	rcu_read_unlock();
+
+	if (!transport)
 		return -ENODEV;
 
-	if (!vsk->transport->read_skb)
+	if (!transport->read_skb)
 		return -EOPNOTSUPP;
 
 	vsock_bpf_check_needs_rebuild(psock->sk_proto);

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 4/8] vsock: make vsock bind reusable
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

This commit makes the bind table management functions in vsock usable
for different bind tables. For use by datagrams in a future patch.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 net/vmw_vsock/af_vsock.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index ef86765f3765..7a3ca4270446 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -230,11 +230,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
 	sock_put(&vsk->sk);
 }
 
-static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
+					    struct list_head *bind_table)
 {
 	struct vsock_sock *vsk;
 
-	list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
+	list_for_each_entry(vsk, bind_table, bound_table) {
 		if (vsock_addr_equals_addr(addr, &vsk->local_addr))
 			return sk_vsock(vsk);
 
@@ -247,6 +248,11 @@ static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
 	return NULL;
 }
 
+static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+{
+	return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
+}
+
 static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 						  struct sockaddr_vm *dst)
 {
@@ -646,12 +652,17 @@ static void vsock_pending_work(struct work_struct *work)
 
 /**** SOCKET OPERATIONS ****/
 
-static int __vsock_bind_connectible(struct vsock_sock *vsk,
-				    struct sockaddr_vm *addr)
+static int vsock_bind_common(struct vsock_sock *vsk,
+			     struct sockaddr_vm *addr,
+			     struct list_head *bind_table,
+			     size_t table_size)
 {
 	static u32 port;
 	struct sockaddr_vm new_addr;
 
+	if (table_size < VSOCK_HASH_SIZE)
+		return -1;
+
 	if (!port)
 		port = get_random_u32_above(LAST_RESERVED_PORT);
 
@@ -667,7 +678,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 
 			new_addr.svm_port = port++;
 
-			if (!__vsock_find_bound_socket(&new_addr)) {
+			if (!vsock_find_bound_socket_common(&new_addr,
+							    &bind_table[VSOCK_HASH(addr)])) {
 				found = true;
 				break;
 			}
@@ -684,7 +696,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 			return -EACCES;
 		}
 
-		if (__vsock_find_bound_socket(&new_addr))
+		if (vsock_find_bound_socket_common(&new_addr,
+						   &bind_table[VSOCK_HASH(addr)]))
 			return -EADDRINUSE;
 	}
 
@@ -696,11 +709,17 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 	 * by AF_UNIX.
 	 */
 	__vsock_remove_bound(vsk);
-	__vsock_insert_bound(vsock_bound_sockets(&vsk->local_addr), vsk);
+	__vsock_insert_bound(&bind_table[VSOCK_HASH(&vsk->local_addr)], vsk);
 
 	return 0;
 }
 
+static int __vsock_bind_connectible(struct vsock_sock *vsk,
+				    struct sockaddr_vm *addr)
+{
+	return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
+}
+
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 6/8] virtio/vsock: support dgrams
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

This commit adds support for datagrams over virtio/vsock.

Message boundaries are preserved on a per-skb and per-vq entry basis.
Messages are copied in whole from the user to an SKB, which in turn is
added to the scatterlist for the virtqueue in whole for the device.
Messages do not straddle skbs and they do not straddle packets.
Messages may be truncated by the receiving user if their buffer is
shorter than the message.

Other properties of vsock datagrams:
- Datagrams self-throttle at the per-socket sk_sndbuf threshold.
- The same virtqueue is used as is used for streams and seqpacket flows
- Credits are not used for datagrams
- Packets are dropped silently by the device, which means the virtqueue
  will still get kicked even during high packet loss, so long as the
  socket does not exceed sk_sndbuf.

Future work might include finding a way to reduce the virtqueue kick
rate for datagram flows with high packet loss.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |  27 ++++-
 include/linux/virtio_vsock.h            |   5 +-
 include/net/af_vsock.h                  |   1 +
 include/uapi/linux/virtio_vsock.h       |   1 +
 net/vmw_vsock/af_vsock.c                |  58 +++++++--
 net/vmw_vsock/virtio_transport.c        |  23 +++-
 net/vmw_vsock/virtio_transport_common.c | 207 ++++++++++++++++++++++++--------
 net/vmw_vsock/vsock_loopback.c          |   8 +-
 8 files changed, 264 insertions(+), 66 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 8f0082da5e70..159c1a22c1a8 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -32,7 +32,8 @@
 enum {
 	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
 			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
-			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
+			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
+			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
 };
 
 enum {
@@ -56,6 +57,7 @@ struct vhost_vsock {
 	atomic_t queued_replies;
 
 	u32 guest_cid;
+	bool dgram_allow;
 	bool seqpacket_allow;
 };
 
@@ -394,6 +396,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
 	return val < vq->num;
 }
 
+static bool vhost_transport_dgram_allow(u32 cid, u32 port);
 static bool vhost_transport_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport vhost_transport = {
@@ -410,10 +413,11 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_allow              = vhost_transport_dgram_allow,
 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
 		.dgram_get_port		  = virtio_transport_dgram_get_port,
 		.dgram_get_length	  = virtio_transport_dgram_get_length,
+		.dgram_payload_offset	  = 0,
 
 		.stream_enqueue           = virtio_transport_stream_enqueue,
 		.stream_dequeue           = virtio_transport_stream_dequeue,
@@ -446,6 +450,22 @@ static struct virtio_transport vhost_transport = {
 	.send_pkt = vhost_transport_send_pkt,
 };
 
+static bool vhost_transport_dgram_allow(u32 cid, u32 port)
+{
+	struct vhost_vsock *vsock;
+	bool dgram_allow = false;
+
+	rcu_read_lock();
+	vsock = vhost_vsock_get(cid);
+
+	if (vsock)
+		dgram_allow = vsock->dgram_allow;
+
+	rcu_read_unlock();
+
+	return dgram_allow;
+}
+
 static bool vhost_transport_seqpacket_allow(u32 remote_cid)
 {
 	struct vhost_vsock *vsock;
@@ -802,6 +822,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
 	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
 		vsock->seqpacket_allow = true;
 
+	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
+		vsock->dgram_allow = true;
+
 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
 		vq = &vsock->vqs[i];
 		mutex_lock(&vq->mutex);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 73afa09f4585..237ca87a2ecd 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -216,7 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
 u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
 bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
 bool virtio_transport_stream_allow(u32 cid, u32 port);
-bool virtio_transport_dgram_allow(u32 cid, u32 port);
 int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
 int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
 int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
@@ -247,4 +246,8 @@ void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
 void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
 int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *list);
 int virtio_transport_read_skb(struct vsock_sock *vsk, skb_read_actor_t read_actor);
+void virtio_transport_init_dgram_bind_tables(void);
+int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
+int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
+int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
 #endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 7bedb9ee7e3e..c115e655b4f5 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -225,6 +225,7 @@ void vsock_for_each_connected_socket(struct vsock_transport *transport,
 				     void (*fn)(struct sock *sk));
 int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
 bool vsock_find_cid(unsigned int cid);
+struct sock *vsock_find_bound_dgram_socket(struct sockaddr_vm *addr);
 
 /**** TAP ****/
 
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 9c25f267bbc0..27b4b2b8bf13 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
 enum virtio_vsock_type {
 	VIRTIO_VSOCK_TYPE_STREAM = 1,
 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
+	VIRTIO_VSOCK_TYPE_DGRAM = 3,
 };
 
 enum virtio_vsock_op {
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 7a3ca4270446..b0b18e7f4299 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -114,6 +114,7 @@
 static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
 static void vsock_sk_destruct(struct sock *sk);
 static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+static bool sock_type_connectible(u16 type);
 
 /* Protocol family. */
 struct proto vsock_proto = {
@@ -180,6 +181,8 @@ struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
 EXPORT_SYMBOL_GPL(vsock_connected_table);
 DEFINE_SPINLOCK(vsock_table_lock);
 EXPORT_SYMBOL_GPL(vsock_table_lock);
+static struct list_head vsock_dgram_bind_table[VSOCK_HASH_SIZE];
+static DEFINE_SPINLOCK(vsock_dgram_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
@@ -202,6 +205,9 @@ static void vsock_init_tables(void)
 
 	for (i = 0; i < ARRAY_SIZE(vsock_connected_table); i++)
 		INIT_LIST_HEAD(&vsock_connected_table[i]);
+
+	for (i = 0; i < ARRAY_SIZE(vsock_dgram_bind_table); i++)
+		INIT_LIST_HEAD(&vsock_dgram_bind_table[i]);
 }
 
 static void __vsock_insert_bound(struct list_head *list,
@@ -230,8 +236,8 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
 	sock_put(&vsk->sk);
 }
 
-struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
-					    struct list_head *bind_table)
+static struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
+						   struct list_head *bind_table)
 {
 	struct vsock_sock *vsk;
 
@@ -248,6 +254,23 @@ struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
 	return NULL;
 }
 
+struct sock *
+vsock_find_bound_dgram_socket(struct sockaddr_vm *addr)
+{
+	struct sock *sk;
+
+	spin_lock_bh(&vsock_dgram_table_lock);
+	sk = vsock_find_bound_socket_common(addr,
+					    &vsock_dgram_bind_table[VSOCK_HASH(addr)]);
+	if (sk)
+		sock_hold(sk);
+
+	spin_unlock_bh(&vsock_dgram_table_lock);
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(vsock_find_bound_dgram_socket);
+
 static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
 {
 	return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
@@ -287,6 +310,14 @@ void vsock_insert_connected(struct vsock_sock *vsk)
 }
 EXPORT_SYMBOL_GPL(vsock_insert_connected);
 
+static void vsock_remove_dgram_bound(struct vsock_sock *vsk)
+{
+	spin_lock_bh(&vsock_dgram_table_lock);
+	if (__vsock_in_bound_table(vsk))
+		__vsock_remove_bound(vsk);
+	spin_unlock_bh(&vsock_dgram_table_lock);
+}
+
 void vsock_remove_bound(struct vsock_sock *vsk)
 {
 	spin_lock_bh(&vsock_table_lock);
@@ -338,7 +369,10 @@ EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
 
 void vsock_remove_sock(struct vsock_sock *vsk)
 {
-	vsock_remove_bound(vsk);
+	if (sock_type_connectible(sk_vsock(vsk)->sk_type))
+		vsock_remove_bound(vsk);
+	else
+		vsock_remove_dgram_bound(vsk);
 	vsock_remove_connected(vsk);
 }
 EXPORT_SYMBOL_GPL(vsock_remove_sock);
@@ -720,11 +754,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 	return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
 }
 
-static int __vsock_bind_dgram(struct vsock_sock *vsk,
-			      struct sockaddr_vm *addr)
+static int vsock_bind_dgram(struct vsock_sock *vsk,
+			    struct sockaddr_vm *addr)
 {
-	if (!vsk->transport || !vsk->transport->dgram_bind)
-		return -EINVAL;
+	if (!vsk->transport || !vsk->transport->dgram_bind) {
+		int retval;
+
+		spin_lock_bh(&vsock_dgram_table_lock);
+		retval = vsock_bind_common(vsk, addr, vsock_dgram_bind_table,
+					   VSOCK_HASH_SIZE);
+		spin_unlock_bh(&vsock_dgram_table_lock);
+
+		return retval;
+	}
 
 	return vsk->transport->dgram_bind(vsk, addr);
 }
@@ -755,7 +797,7 @@ static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr)
 		break;
 
 	case SOCK_DGRAM:
-		retval = __vsock_bind_dgram(vsk, addr);
+		retval = vsock_bind_dgram(vsk, addr);
 		break;
 
 	default:
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 1b7843a7779a..7160a3104218 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -63,6 +63,7 @@ struct virtio_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool dgram_allow;
 };
 
 static u32 virtio_transport_get_local_cid(void)
@@ -413,6 +414,7 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
 	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
+static bool virtio_transport_dgram_allow(u32 cid, u32 port);
 static bool virtio_transport_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport virtio_transport = {
@@ -465,6 +467,21 @@ static struct virtio_transport virtio_transport = {
 	.send_pkt = virtio_transport_send_pkt,
 };
 
+static bool virtio_transport_dgram_allow(u32 cid, u32 port)
+{
+	struct virtio_vsock *vsock;
+	bool dgram_allow;
+
+	dgram_allow = false;
+	rcu_read_lock();
+	vsock = rcu_dereference(the_virtio_vsock);
+	if (vsock)
+		dgram_allow = vsock->dgram_allow;
+	rcu_read_unlock();
+
+	return dgram_allow;
+}
+
 static bool virtio_transport_seqpacket_allow(u32 remote_cid)
 {
 	struct virtio_vsock *vsock;
@@ -658,6 +675,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
 		vsock->seqpacket_allow = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
+		vsock->dgram_allow = true;
+
 	vdev->priv = vsock;
 
 	ret = virtio_vsock_vqs_init(vsock);
@@ -750,7 +770,8 @@ static struct virtio_device_id id_table[] = {
 };
 
 static unsigned int features[] = {
-	VIRTIO_VSOCK_F_SEQPACKET
+	VIRTIO_VSOCK_F_SEQPACKET,
+	VIRTIO_VSOCK_F_DGRAM
 };
 
 static struct virtio_driver virtio_vsock_driver = {
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index d5a3c8efe84b..bc9d459723f5 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,6 +37,35 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
 	return container_of(t, struct virtio_transport, transport);
 }
 
+/* Requires info->msg and info->vsk */
+static struct sk_buff *
+virtio_transport_sock_alloc_send_skb(struct virtio_vsock_pkt_info *info, unsigned int size,
+				     gfp_t mask, int *err)
+{
+	struct sk_buff *skb;
+	struct sock *sk;
+	int noblock;
+
+	if (size < VIRTIO_VSOCK_SKB_HEADROOM) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	if (info->msg)
+		noblock = info->msg->msg_flags & MSG_DONTWAIT;
+	else
+		noblock = 1;
+
+	sk = sk_vsock(info->vsk);
+	sk->sk_allocation = mask;
+	skb = sock_alloc_send_skb(sk, size, noblock, err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, VIRTIO_VSOCK_SKB_HEADROOM);
+	return skb;
+}
+
 /* Returns a new packet on success, otherwise returns NULL.
  *
  * If NULL is returned, errp is set to a negative errno.
@@ -47,7 +76,8 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 			   u32 src_cid,
 			   u32 src_port,
 			   u32 dst_cid,
-			   u32 dst_port)
+			   u32 dst_port,
+			   int *errp)
 {
 	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
 	struct virtio_vsock_hdr *hdr;
@@ -55,9 +85,21 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 	void *payload;
 	int err;
 
-	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
-	if (!skb)
+	/* dgrams do not use credits, self-throttle according to sk_sndbuf
+	 * using sock_alloc_send_skb. This helps avoid triggering the OOM.
+	 */
+	if (info->vsk && info->type == VIRTIO_VSOCK_TYPE_DGRAM) {
+		skb = virtio_transport_sock_alloc_send_skb(info, skb_len, GFP_KERNEL, &err);
+	} else {
+		skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+		if (!skb)
+			err = -ENOMEM;
+	}
+
+	if (!skb) {
+		*errp = err;
 		return NULL;
+	}
 
 	hdr = virtio_vsock_hdr(skb);
 	hdr->type	= cpu_to_le16(info->type);
@@ -96,12 +138,14 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 
 	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
 		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
+		err = -EFAULT;
 		goto out;
 	}
 
 	return skb;
 
 out:
+	*errp = err;
 	kfree_skb(skb);
 	return NULL;
 }
@@ -183,7 +227,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
 
 static u16 virtio_transport_get_type(struct sock *sk)
 {
-	if (sk->sk_type == SOCK_STREAM)
+	if (sk->sk_type == SOCK_DGRAM)
+		return VIRTIO_VSOCK_TYPE_DGRAM;
+	else if (sk->sk_type == SOCK_STREAM)
 		return VIRTIO_VSOCK_TYPE_STREAM;
 	else
 		return VIRTIO_VSOCK_TYPE_SEQPACKET;
@@ -239,11 +285,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 
 		skb = virtio_transport_alloc_skb(info, skb_len,
 						 src_cid, src_port,
-						 dst_cid, dst_port);
-		if (!skb) {
-			ret = -ENOMEM;
+						 dst_cid, dst_port,
+						 &ret);
+		if (!skb)
 			break;
-		}
 
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
@@ -583,14 +628,30 @@ virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
 }
 EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_enqueue);
 
-int
-virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
-			       struct msghdr *msg,
-			       size_t len, int flags)
+int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
+{
+	*cid = le64_to_cpu(virtio_vsock_hdr(skb)->src_cid);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
+
+int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
+{
+	*port = le32_to_cpu(virtio_vsock_hdr(skb)->src_port);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
+
+int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
 {
-	return -EOPNOTSUPP;
+	/* The device layer must have already moved the data ptr beyond the
+	 * header for skb->len to be correct.
+	 */
+	WARN_ON(skb->data == skb->head);
+	*len = skb->len;
+	return 0;
 }
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
 
 s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
 {
@@ -790,30 +851,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
 }
 EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
 
-int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
-
-int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
-
-int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
-
-bool virtio_transport_dgram_allow(u32 cid, u32 port)
-{
-	return false;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
-
 int virtio_transport_connect(struct vsock_sock *vsk)
 {
 	struct virtio_vsock_pkt_info info = {
@@ -846,7 +883,34 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 			       struct msghdr *msg,
 			       size_t dgram_len)
 {
-	return -EOPNOTSUPP;
+	const struct virtio_transport *t_ops;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.msg = msg,
+		.vsk = vsk,
+		.type = VIRTIO_VSOCK_TYPE_DGRAM,
+	};
+	u32 src_cid, src_port;
+	struct sk_buff *skb;
+	int err;
+
+	if (dgram_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+		return -EMSGSIZE;
+
+	t_ops = virtio_transport_get_ops(vsk);
+	src_cid = t_ops->transport.get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+
+	skb = virtio_transport_alloc_skb(&info, dgram_len,
+					 src_cid, src_port,
+					 remote_addr->svm_cid,
+					 remote_addr->svm_port,
+					 &err);
+
+	if (!skb)
+		return err;
+
+	return t_ops->send_pkt(skb);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
 
@@ -903,6 +967,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 		.reply = true,
 	};
 	struct sk_buff *reply;
+	int err;
 
 	/* Send RST only if the original pkt is not a RST pkt */
 	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
@@ -915,9 +980,10 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 					   le64_to_cpu(hdr->dst_cid),
 					   le32_to_cpu(hdr->dst_port),
 					   le64_to_cpu(hdr->src_cid),
-					   le32_to_cpu(hdr->src_port));
+					   le32_to_cpu(hdr->src_port),
+					   &err);
 	if (!reply)
-		return -ENOMEM;
+		return err;
 
 	return t->send_pkt(reply);
 }
@@ -1137,6 +1203,21 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
 		kfree_skb(skb);
 }
 
+/* This function takes ownership of the skb.
+ *
+ * It either places the skb on the sk_receive_queue or frees it.
+ */
+static void
+virtio_transport_recv_dgram(struct sock *sk, struct sk_buff *skb)
+{
+	if (sock_queue_rcv_skb(sk, skb)) {
+		kfree_skb(skb);
+		return;
+	}
+
+	sk->sk_data_ready(sk);
+}
+
 static int
 virtio_transport_recv_connected(struct sock *sk,
 				struct sk_buff *skb)
@@ -1300,7 +1381,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 static bool virtio_transport_valid_type(u16 type)
 {
 	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
-	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
+	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
+	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
 }
 
 /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
@@ -1314,40 +1396,52 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 	struct vsock_sock *vsk;
 	struct sock *sk;
 	bool space_available;
+	u16 type;
 
 	vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
 			le32_to_cpu(hdr->src_port));
 	vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
 			le32_to_cpu(hdr->dst_port));
 
+	type = le16_to_cpu(hdr->type);
+
 	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
 					dst.svm_cid, dst.svm_port,
 					le32_to_cpu(hdr->len),
-					le16_to_cpu(hdr->type),
+					type,
 					le16_to_cpu(hdr->op),
 					le32_to_cpu(hdr->flags),
 					le32_to_cpu(hdr->buf_alloc),
 					le32_to_cpu(hdr->fwd_cnt));
 
-	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
+	if (!virtio_transport_valid_type(type)) {
 		(void)virtio_transport_reset_no_sock(t, skb);
 		goto free_pkt;
 	}
 
-	/* The socket must be in connected or bound table
-	 * otherwise send reset back
+	/* For stream/seqpacket, the socket must be in connected or bound table
+	 * otherwise send reset back.
+	 *
+	 * For datagrams, no reset is sent back.
 	 */
 	sk = vsock_find_connected_socket(&src, &dst);
 	if (!sk) {
-		sk = vsock_find_bound_socket(&dst);
-		if (!sk) {
-			(void)virtio_transport_reset_no_sock(t, skb);
-			goto free_pkt;
+		if (type == VIRTIO_VSOCK_TYPE_DGRAM) {
+			sk = vsock_find_bound_dgram_socket(&dst);
+			if (!sk)
+				goto free_pkt;
+		} else {
+			sk = vsock_find_bound_socket(&dst);
+			if (!sk) {
+				(void)virtio_transport_reset_no_sock(t, skb);
+				goto free_pkt;
+			}
 		}
 	}
 
-	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+	if (virtio_transport_get_type(sk) != type) {
+		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
+			(void)virtio_transport_reset_no_sock(t, skb);
 		sock_put(sk);
 		goto free_pkt;
 	}
@@ -1363,12 +1457,18 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 
 	/* Check if sk has been closed before lock_sock */
 	if (sock_flag(sk, SOCK_DONE)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
+			(void)virtio_transport_reset_no_sock(t, skb);
 		release_sock(sk);
 		sock_put(sk);
 		goto free_pkt;
 	}
 
+	if (sk->sk_type == SOCK_DGRAM) {
+		virtio_transport_recv_dgram(sk, skb);
+		goto out;
+	}
+
 	space_available = virtio_transport_space_update(sk, skb);
 
 	/* Update CID in case it has changed after a transport reset event */
@@ -1400,6 +1500,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 		break;
 	}
 
+out:
 	release_sock(sk);
 
 	/* Release refcnt obtained when we fetched this socket out of the
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index e9de45a26fbd..68312aa8c972 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -46,6 +46,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
 	return 0;
 }
 
+static bool vsock_loopback_dgram_allow(u32 cid, u32 port);
 static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport loopback_transport = {
@@ -62,7 +63,7 @@ static struct virtio_transport loopback_transport = {
 		.cancel_pkt               = vsock_loopback_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_allow              = vsock_loopback_dgram_allow,
 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
 		.dgram_get_port		  = virtio_transport_dgram_get_port,
 		.dgram_get_length	  = virtio_transport_dgram_get_length,
@@ -98,6 +99,11 @@ static struct virtio_transport loopback_transport = {
 	.send_pkt = vsock_loopback_send_pkt,
 };
 
+static bool vsock_loopback_dgram_allow(u32 cid, u32 port)
+{
+	return true;
+}
+
 static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
 {
 	return true;

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 3/8] vsock: support multi-transport datagrams
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

This patch adds support for multi-transport datagrams.

This includes:
- Per-packet lookup of transports when using sendto(sockaddr_vm)
- Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
  sockaddr_vm

To preserve backwards compatibility with VMCI, some important changes
were made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
be used for dgrams iff there is not yet a g2h or h2g transport that has
been registered that can transmit the packet. If there is a g2h/h2g
transport for that remote address, then that transport will be used and
not "transport_dgram". This essentially makes "transport_dgram" a
fallback transport for when h2g/g2h has not yet gone online, which
appears to be the exact use case for VMCI.

This design makes sense, because there is no reason that the
transport_{g2h,h2g} cannot also service datagrams, which makes the role
of transport_dgram difficult to understand outside of the VMCI context.

The logic around "transport_dgram" had to be retained to prevent
breaking VMCI:

1) VMCI datagrams appear to function outside of the h2g/g2h
   paradigm. When the vmci transport becomes online, it registers itself
   with the DGRAM feature, but not H2G/G2H. Only later when the
   transport has more information about its environment does it register
   H2G or G2H. In the case that a datagram socket becomes active
   after DGRAM registration but before G2H/H2G registration, the
   "transport_dgram" transport needs to be used.

2) VMCI seems to require special message be sent by the transport when a
   datagram socket calls bind(). Under the h2g/g2h model, the transport
   is selected using the remote_addr which is set by connect(). At
   bind time there is no remote_addr because often no connect() has been
   called yet: the transport is null. Therefore, with a null transport
   there doesn't seem to be any good way for a datagram socket a tell the
   VMCI transport that it has just had bind() called upon it.

Only transports with a special datagram fallback use-case such as VMCI
need to register VSOCK_TRANSPORT_F_DGRAM.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |  1 -
 include/linux/virtio_vsock.h            |  2 -
 net/vmw_vsock/af_vsock.c                | 78 +++++++++++++++++++++++++--------
 net/vmw_vsock/hyperv_transport.c        |  6 ---
 net/vmw_vsock/virtio_transport.c        |  1 -
 net/vmw_vsock/virtio_transport_common.c |  7 ---
 net/vmw_vsock/vsock_loopback.c          |  1 -
 7 files changed, 60 insertions(+), 36 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index c8201c070b4b..8f0082da5e70 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -410,7 +410,6 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_allow              = virtio_transport_dgram_allow,
 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
 		.dgram_get_port		  = virtio_transport_dgram_get_port,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 23521a318cf0..73afa09f4585 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -216,8 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
 u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
 bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
 bool virtio_transport_stream_allow(u32 cid, u32 port);
-int virtio_transport_dgram_bind(struct vsock_sock *vsk,
-				struct sockaddr_vm *addr);
 bool virtio_transport_dgram_allow(u32 cid, u32 port);
 int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
 int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 74358f0b47fa..ef86765f3765 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -438,6 +438,18 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
 	return transport;
 }
 
+static const struct vsock_transport *
+vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
+{
+	const struct vsock_transport *transport;
+
+	transport = vsock_connectible_lookup_transport(cid, flags);
+	if (transport)
+		return transport;
+
+	return transport_dgram;
+}
+
 /* Assign a transport to a socket and call the .init transport callback.
  *
  * Note: for connection oriented socket this must be called when vsk->remote_addr
@@ -474,7 +486,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 
 	switch (sk->sk_type) {
 	case SOCK_DGRAM:
-		new_transport = transport_dgram;
+		new_transport = vsock_dgram_lookup_transport(remote_cid,
+							     remote_flags);
 		break;
 	case SOCK_STREAM:
 	case SOCK_SEQPACKET:
@@ -691,6 +704,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {
+	if (!vsk->transport || !vsk->transport->dgram_bind)
+		return -EINVAL;
+
 	return vsk->transport->dgram_bind(vsk, addr);
 }
 
@@ -1172,19 +1188,24 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	transport = vsk->transport;
-
-	err = vsock_auto_bind(vsk);
-	if (err)
-		goto out;
-
-
 	/* If the provided message contains an address, use that.  Otherwise
 	 * fall back on the socket's remote handle (if it has been connected).
 	 */
 	if (msg->msg_name &&
 	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
 			    &remote_addr) == 0) {
+		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
+							 remote_addr->svm_flags);
+		if (!transport) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (!try_module_get(transport->module)) {
+			err = -ENODEV;
+			goto out;
+		}
+
 		/* Ensure this address is of the right type and is a valid
 		 * destination.
 		 */
@@ -1193,11 +1214,27 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 			remote_addr->svm_cid = transport->get_local_cid();
 
 		if (!vsock_addr_bound(remote_addr)) {
+			module_put(transport->module);
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (!transport->dgram_allow(remote_addr->svm_cid,
+					    remote_addr->svm_port)) {
+			module_put(transport->module);
 			err = -EINVAL;
 			goto out;
 		}
+
+		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
+		module_put(transport->module);
 	} else if (sock->state == SS_CONNECTED) {
 		remote_addr = &vsk->remote_addr;
+		transport = vsk->transport;
+
+		err = vsock_auto_bind(vsk);
+		if (err)
+			goto out;
 
 		if (remote_addr->svm_cid == VMADDR_CID_ANY)
 			remote_addr->svm_cid = transport->get_local_cid();
@@ -1205,23 +1242,23 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 		/* XXX Should connect() or this function ensure remote_addr is
 		 * bound?
 		 */
-		if (!vsock_addr_bound(&vsk->remote_addr)) {
+		if (!vsock_addr_bound(remote_addr)) {
 			err = -EINVAL;
 			goto out;
 		}
-	} else {
-		err = -EINVAL;
-		goto out;
-	}
 
-	if (!transport->dgram_allow(remote_addr->svm_cid,
-				    remote_addr->svm_port)) {
+		if (!transport->dgram_allow(remote_addr->svm_cid,
+					    remote_addr->svm_port)) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
+	} else {
 		err = -EINVAL;
 		goto out;
 	}
 
-	err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
-
 out:
 	release_sock(sk);
 	return err;
@@ -1255,13 +1292,18 @@ static int vsock_dgram_connect(struct socket *sock,
 	if (err)
 		goto out;
 
+	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
+
+	err = vsock_assign_transport(vsk, NULL);
+	if (err)
+		goto out;
+
 	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
 					 remote_addr->svm_port)) {
 		err = -EINVAL;
 		goto out;
 	}
 
-	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
 	sock->state = SS_CONNECTED;
 
 	/* sock map disallows redirection of non-TCP sockets with sk_state !=
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index ff6e87e25fa0..c00bc5da769a 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -551,11 +551,6 @@ static void hvs_destruct(struct vsock_sock *vsk)
 	kfree(hvs);
 }
 
-static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
-{
-	return -EOPNOTSUPP;
-}
-
 static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
 {
 	return -EOPNOTSUPP;
@@ -841,7 +836,6 @@ static struct vsock_transport hvs_transport = {
 	.connect                  = hvs_connect,
 	.shutdown                 = hvs_shutdown,
 
-	.dgram_bind               = hvs_dgram_bind,
 	.dgram_get_cid		  = hvs_dgram_get_cid,
 	.dgram_get_port		  = hvs_dgram_get_port,
 	.dgram_get_length	  = hvs_dgram_get_length,
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 5763cdf13804..1b7843a7779a 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -428,7 +428,6 @@ static struct virtio_transport virtio_transport = {
 		.shutdown                 = virtio_transport_shutdown,
 		.cancel_pkt               = virtio_transport_cancel_pkt,
 
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index e6903c719964..d5a3c8efe84b 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -790,13 +790,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
 }
 EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
 
-int virtio_transport_dgram_bind(struct vsock_sock *vsk,
-				struct sockaddr_vm *addr)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
-
 int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
 {
 	return -EOPNOTSUPP;
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 2f3cabc79ee5..e9de45a26fbd 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -61,7 +61,6 @@ static struct virtio_transport loopback_transport = {
 		.shutdown                 = virtio_transport_shutdown,
 		.cancel_pkt               = vsock_loopback_cancel_pkt,
 
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 5/8] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman,
	Jiang Wang
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

This commit adds a feature bit for virtio vsock to support datagrams.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 include/uapi/linux/virtio_vsock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..9c25f267bbc0 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -40,6 +40,7 @@
 
 /* The feature bitmap for virtio vsock */
 #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
+#define VIRTIO_VSOCK_F_DGRAM		3	/* SOCK_DGRAM supported */
 
 struct virtio_vsock_config {
 	__le64 guest_cid;

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 2/8] vsock: refactor transport lookup code
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

Introduce new reusable function vsock_connectible_lookup_transport()
that performs the transport lookup logic.

No functional change intended.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 net/vmw_vsock/af_vsock.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index ffb4dd8b6ea7..74358f0b47fa 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -422,6 +422,22 @@ static void vsock_deassign_transport(struct vsock_sock *vsk)
 	vsk->transport = NULL;
 }
 
+static const struct vsock_transport *
+vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
+{
+	const struct vsock_transport *transport;
+
+	if (vsock_use_local_transport(cid))
+		transport = transport_local;
+	else if (cid <= VMADDR_CID_HOST || !transport_h2g ||
+		 (flags & VMADDR_FLAG_TO_HOST))
+		transport = transport_g2h;
+	else
+		transport = transport_h2g;
+
+	return transport;
+}
+
 /* Assign a transport to a socket and call the .init transport callback.
  *
  * Note: for connection oriented socket this must be called when vsk->remote_addr
@@ -462,13 +478,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		break;
 	case SOCK_STREAM:
 	case SOCK_SEQPACKET:
-		if (vsock_use_local_transport(remote_cid))
-			new_transport = transport_local;
-		else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
-			 (remote_flags & VMADDR_FLAG_TO_HOST))
-			new_transport = transport_g2h;
-		else
-			new_transport = transport_h2g;
+		new_transport = vsock_connectible_lookup_transport(remote_cid,
+								   remote_flags);
 		break;
 	default:
 		return -ESOCKTNOSUPPORT;

-- 
2.30.2


^ permalink raw reply related

* [PATCH RFC net-next v4 0/8] virtio/vsock: support datagrams
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman,
	Jiang Wang

Hey all!

This series introduces support for datagrams to virtio/vsock.

It is a spin-off (and smaller version) of this series from the summer:
  https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/

Please note that this is an RFC and should not be merged until
associated changes are made to the virtio specification, which will
follow after discussion from this series.

Another aside, the v4 of the series has only been mildly tested with a
run of tools/testing/vsock/vsock_test. Some code likely needs cleaning
up, but I'm hoping to get some of the design choices agreed upon before
spending too much time making it pretty.

This series first supports datagrams in a basic form for virtio, and
then optimizes the sendpath for all datagram transports.

The result is a very fast datagram communication protocol that
outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
of multi-threaded workload samples.

For those that are curious, some summary data comparing UDP and VSOCK
DGRAM (N=5):

	vCPUS: 16
	virtio-net queues: 16
	payload size: 4KB
	Setup: bare metal + vm (non-nested)

	UDP: 287.59 MB/s
	VSOCK DGRAM: 509.2 MB/s

Some notes about the implementation...

This datagram implementation forces datagrams to self-throttle according
to the threshold set by sk_sndbuf. It behaves similar to the credits
used by streams in its effect on throughput and memory consumption, but
it is not influenced by the receiving socket as credits are.

The device drops packets silently.

As discussed previously, this series introduces datagrams and defers
fairness to future work. See discussion in v2 for more context around
datagrams, fairness, and this implementation.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
Changes in v4:
- style changes
  - vsock: use sk_vsock(vsk) in vsock_dgram_recvmsg instead of
    &sk->vsk
  - vsock: fix xmas tree declaration
  - vsock: fix spacing issues
  - virtio/vsock: virtio_transport_recv_dgram returns void because err
    unused
- sparse analysis warnings/errors
  - virtio/vsock: fix unitialized skerr on destroy
  - virtio/vsock: fix uninitialized err var on goto out
  - vsock: fix declarations that need static
  - vsock: fix __rcu annotation order
- bugs
  - vsock: fix null ptr in remote_info code
  - vsock/dgram: make transport_dgram a fallback instead of first
    priority
  - vsock: remove redundant rcu read lock acquire in getname()
- tests
  - add more tests (message bounds and more)
  - add vsock_dgram_bind() helper
  - add vsock_dgram_connect() helper

Changes in v3:
- Support multi-transport dgram, changing logic in connect/bind
  to support VMCI case
- Support per-pkt transport lookup for sendto() case
- Fix dgram_allow() implementation
- Fix dgram feature bit number (now it is 3)
- Fix binding so dgram and connectible (cid,port) spaces are
  non-overlapping
- RCU protect transport ptr so connect() calls never leave
  a lockless read of the transport and remote_addr are always
  in sync
- Link to v2: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v2-0-079cc7cee62e@bytedance.com

---
Bobby Eshleman (7):
      vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
      vsock: refactor transport lookup code
      vsock: support multi-transport datagrams
      vsock: make vsock bind reusable
      virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
      virtio/vsock: support dgrams
      vsock: Add lockless sendmsg() support

Jiang Wang (1):
      tests: add vsock dgram tests

 drivers/vhost/vsock.c                   |  44 ++-
 include/linux/virtio_vsock.h            |  13 +-
 include/net/af_vsock.h                  |  52 ++-
 include/uapi/linux/virtio_vsock.h       |   2 +
 net/vmw_vsock/af_vsock.c                | 616 ++++++++++++++++++++++++++------
 net/vmw_vsock/diag.c                    |  10 +-
 net/vmw_vsock/hyperv_transport.c        |  42 ++-
 net/vmw_vsock/virtio_transport.c        |  28 +-
 net/vmw_vsock/virtio_transport_common.c | 226 +++++++++---
 net/vmw_vsock/vmci_transport.c          | 152 ++++----
 net/vmw_vsock/vsock_bpf.c               |  10 +-
 net/vmw_vsock/vsock_loopback.c          |  13 +-
 tools/testing/vsock/util.c              | 141 +++++++-
 tools/testing/vsock/util.h              |   6 +
 tools/testing/vsock/vsock_test.c        | 432 ++++++++++++++++++++++
 15 files changed, 1533 insertions(+), 254 deletions(-)
---
base-commit: 28cfea989d6f55c3d10608eba2a2bae609c5bf3e
change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5

Best regards,
-- 
Bobby Eshleman <bobby.eshleman@bytedance.com>


^ permalink raw reply

* [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Bobby Eshleman @ 2023-06-10  0:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, Bobby Eshleman
In-Reply-To: <20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com>

This commit drops the transport->dgram_dequeue callback and makes
vsock_dgram_recvmsg() generic. It also adds additional transport
callbacks for use by the generic vsock_dgram_recvmsg(), such as for
parsing skbs for CID/port which vary in format per transport.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |  4 +-
 include/linux/virtio_vsock.h            |  3 ++
 include/net/af_vsock.h                  | 13 ++++++-
 net/vmw_vsock/af_vsock.c                | 51 ++++++++++++++++++++++++-
 net/vmw_vsock/hyperv_transport.c        | 17 +++++++--
 net/vmw_vsock/virtio_transport.c        |  4 +-
 net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
 net/vmw_vsock/vmci_transport.c          | 68 +++++++++++++--------------------
 net/vmw_vsock/vsock_loopback.c          |  4 +-
 9 files changed, 132 insertions(+), 50 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..c8201c070b4b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
+		.dgram_get_port		  = virtio_transport_dgram_get_port,
+		.dgram_get_length	  = virtio_transport_dgram_get_length,
 
 		.stream_enqueue           = virtio_transport_stream_enqueue,
 		.stream_dequeue           = virtio_transport_stream_dequeue,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..23521a318cf0 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
 int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 				struct sockaddr_vm *addr);
 bool virtio_transport_dgram_allow(u32 cid, u32 port);
+int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
+int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
+int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
 
 int virtio_transport_connect(struct vsock_sock *vsk);
 
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..7bedb9ee7e3e 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -120,11 +120,20 @@ struct vsock_transport {
 
 	/* DGRAM. */
 	int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
-	int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
-			     size_t len, int flags);
 	int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
 			     struct msghdr *, size_t len);
 	bool (*dgram_allow)(u32 cid, u32 port);
+	int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
+	int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
+	int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
+
+	/* The number of bytes into the buffer at which the payload starts, as
+	 * first seen by the receiving socket layer. For example, if the
+	 * transport presets the skb pointers using skb_pull(sizeof(header))
+	 * than this would be zero, otherwise it would be the size of the
+	 * header.
+	 */
+	const size_t dgram_payload_offset;
 
 	/* STREAM. */
 	/* TODO: stream_bind() */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index efb8a0937a13..ffb4dd8b6ea7 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
 int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 			size_t len, int flags)
 {
+	const struct vsock_transport *transport;
 #ifdef CONFIG_BPF_SYSCALL
 	const struct proto *prot;
 #endif
 	struct vsock_sock *vsk;
+	struct sk_buff *skb;
+	size_t payload_len;
 	struct sock *sk;
+	int err;
 
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
@@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		return prot->recvmsg(sk, msg, len, flags, NULL);
 #endif
 
-	return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
+	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
+		return -EOPNOTSUPP;
+
+	transport = vsk->transport;
+
+	/* Retrieve the head sk_buff from the socket's receive queue. */
+	err = 0;
+	skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
+	if (!skb)
+		return err;
+
+	err = transport->dgram_get_length(skb, &payload_len);
+	if (err)
+		goto out;
+
+	if (payload_len > len) {
+		payload_len = len;
+		msg->msg_flags |= MSG_TRUNC;
+	}
+
+	/* Place the datagram payload in the user's iovec. */
+	err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
+	if (err)
+		goto out;
+
+	if (msg->msg_name) {
+		/* Provide the address of the sender. */
+		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
+		unsigned int cid, port;
+
+		err = transport->dgram_get_cid(skb, &cid);
+		if (err)
+			goto out;
+
+		err = transport->dgram_get_port(skb, &port);
+		if (err)
+			goto out;
+
+		vsock_addr_init(vm_addr, cid, port);
+		msg->msg_namelen = sizeof(*vm_addr);
+	}
+	err = payload_len;
+
+out:
+	skb_free_datagram(&vsk->sk, skb);
+	return err;
 }
 EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 7cb1a9d2cdb4..ff6e87e25fa0 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
 	return -EOPNOTSUPP;
 }
 
-static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
-			     size_t len, int flags)
+static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
+{
+	return -EOPNOTSUPP;
+}
+
+static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
+{
+	return -EOPNOTSUPP;
+}
+
+static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
 {
 	return -EOPNOTSUPP;
 }
@@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
 	.shutdown                 = hvs_shutdown,
 
 	.dgram_bind               = hvs_dgram_bind,
-	.dgram_dequeue            = hvs_dgram_dequeue,
+	.dgram_get_cid		  = hvs_dgram_get_cid,
+	.dgram_get_port		  = hvs_dgram_get_port,
+	.dgram_get_length	  = hvs_dgram_get_length,
 	.dgram_enqueue            = hvs_dgram_enqueue,
 	.dgram_allow              = hvs_dgram_allow,
 
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..5763cdf13804 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
 		.cancel_pkt               = virtio_transport_cancel_pkt,
 
 		.dgram_bind               = virtio_transport_dgram_bind,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
+		.dgram_get_port		  = virtio_transport_dgram_get_port,
+		.dgram_get_length	  = virtio_transport_dgram_get_length,
 
 		.stream_dequeue           = virtio_transport_stream_dequeue,
 		.stream_enqueue           = virtio_transport_stream_enqueue,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..e6903c719964 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
 
+int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
+
+int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
+
+int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
+
 bool virtio_transport_dgram_allow(u32 cid, u32 port)
 {
 	return false;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index b370070194fa..bbc63826bf48 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
 	return err - sizeof(*dg);
 }
 
-static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
-					struct msghdr *msg, size_t len,
-					int flags)
+static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
 {
-	int err;
 	struct vmci_datagram *dg;
-	size_t payload_len;
-	struct sk_buff *skb;
 
-	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
-		return -EOPNOTSUPP;
+	dg = (struct vmci_datagram *)skb->data;
+	if (!dg)
+		return -EINVAL;
 
-	/* Retrieve the head sk_buff from the socket's receive queue. */
-	err = 0;
-	skb = skb_recv_datagram(&vsk->sk, flags, &err);
-	if (!skb)
-		return err;
+	*cid = dg->src.context;
+	return 0;
+}
+
+static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
+{
+	struct vmci_datagram *dg;
 
 	dg = (struct vmci_datagram *)skb->data;
 	if (!dg)
-		/* err is 0, meaning we read zero bytes. */
-		goto out;
-
-	payload_len = dg->payload_size;
-	/* Ensure the sk_buff matches the payload size claimed in the packet. */
-	if (payload_len != skb->len - sizeof(*dg)) {
-		err = -EINVAL;
-		goto out;
-	}
+		return -EINVAL;
 
-	if (payload_len > len) {
-		payload_len = len;
-		msg->msg_flags |= MSG_TRUNC;
-	}
+	*port = dg->src.resource;
+	return 0;
+}
 
-	/* Place the datagram payload in the user's iovec. */
-	err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
-	if (err)
-		goto out;
+static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
+{
+	struct vmci_datagram *dg;
 
-	if (msg->msg_name) {
-		/* Provide the address of the sender. */
-		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
-		vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
-		msg->msg_namelen = sizeof(*vm_addr);
-	}
-	err = payload_len;
+	dg = (struct vmci_datagram *)skb->data;
+	if (!dg)
+		return -EINVAL;
 
-out:
-	skb_free_datagram(&vsk->sk, skb);
-	return err;
+	*len = dg->payload_size;
+	return 0;
 }
 
 static bool vmci_transport_dgram_allow(u32 cid, u32 port)
@@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
 	.release = vmci_transport_release,
 	.connect = vmci_transport_connect,
 	.dgram_bind = vmci_transport_dgram_bind,
-	.dgram_dequeue = vmci_transport_dgram_dequeue,
 	.dgram_enqueue = vmci_transport_dgram_enqueue,
 	.dgram_allow = vmci_transport_dgram_allow,
+	.dgram_get_cid = vmci_transport_dgram_get_cid,
+	.dgram_get_port = vmci_transport_dgram_get_port,
+	.dgram_get_length = vmci_transport_dgram_get_length,
+	.dgram_payload_offset = sizeof(struct vmci_datagram),
 	.stream_dequeue = vmci_transport_stream_dequeue,
 	.stream_enqueue = vmci_transport_stream_enqueue,
 	.stream_has_data = vmci_transport_stream_has_data,
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 5c6360df1f31..2f3cabc79ee5 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
 		.cancel_pkt               = vsock_loopback_cancel_pkt,
 
 		.dgram_bind               = virtio_transport_dgram_bind,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
+		.dgram_get_port		  = virtio_transport_dgram_get_port,
+		.dgram_get_length	  = virtio_transport_dgram_get_length,
 
 		.stream_dequeue           = virtio_transport_stream_dequeue,
 		.stream_enqueue           = virtio_transport_stream_enqueue,

-- 
2.30.2


^ permalink raw reply related

* [PATCH 1/1] scsi: storvsc: Always set no_report_opcodes
From: Michael Kelley @ 2023-06-09 20:38 UTC (permalink / raw)
  To: kys, martin.petersen, longli, wei.liu, decui, jejb, linux-hyperv,
	linux-kernel, linux-scsi
  Cc: mikelley

Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages
like this:

hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001

The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with
many disks.

Fix the problem by always marking Hyper-V synthetic SCSI devices as
not supporting scsi_report_opcode(). With this setting, the
MAINTENANCE_IN SCSI command is not issued and no messages are logged.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index e6bc622..659196a 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1567,6 +1567,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
 {
 	blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
 
+	/* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+	sdevice->no_report_opcodes = 1;
 	sdevice->no_write_same = 1;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH net-next] tcp: Make pingpong threshold tunable
From: Kuniyuki Iwashima @ 2023-06-09 19:53 UTC (permalink / raw)
  To: haiyangz
  Cc: ncardwell, atenart, bagasdotme, corbet, davem, dsahern, edumazet,
	kuba, kuniyu, kys, linux-doc, linux-hyperv, linux-kernel,
	liushixin2, maheshb, netdev, olaf, pabeni, simon.horman, soheil,
	stephen, tim.gardner, vkuznets, weiwan, ycheng, ykaliuta
In-Reply-To: <CADVnQykbSQTrNtpFm8YVgGY929mmzY2zSQ2-KxGmNthYyR9GLg@mail.gmail.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Fri, 9 Jun 2023 15:16:00 -0400
> On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@microsoft.com> wrote:
> 
> Regarding the patch title:
> > [PATCH net-next] tcp: Make pingpong threshold tunable
> 
> There are many ways to make something tunable these days, including
> BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please
> consider a more clear/specific title, like:
> 
>    [PATCH net-next] tcp: set pingpong threshold via sysctl
> 
> > TCP pingpong threshold is 1 by default. But some applications, like SQL DB
> > may prefer a higher pingpong threshold to activate delayed acks in quick
> > ack mode for better performance.
> >
> > The pingpong threshold and related code were changed to 3 in the year
> > 2019, and reverted to 1 in the year 2022.
> 
> Please include the specific commit, like:
> 
> The pingpong threshold and related code were changed to 3 in the year
>  2019 in:
>    commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
> and reverted to 1 in the year 2022 in:
>   commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")
> 
> Then please make sure to use scripts/checkpatch.pl on your resulting
> patch to check the formatting of the commit references, among other
> things.
> 
> > There is no single value that
> > fits all applications.
> >
> > Add net.core.tcp_pingpong_thresh sysctl tunable,
> 
> For consistency, TCP sysctls should be in net.ipv4 rather than
> net.core. Yes, that is awkward, given IPv6 support. But consistency is
> very important here. :-)
> 
> > so it can be tuned for
> > optimal performance based on the application needs.
> >
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  Documentation/admin-guide/sysctl/net.rst |  8 ++++++++
> >  include/net/inet_connection_sock.h       | 14 +++++++++++---
> >  net/core/sysctl_net_core.c               |  9 +++++++++
> >  net/ipv4/tcp.c                           |  2 ++
> >  net/ipv4/tcp_output.c                    | 17 +++++++++++++++--
> >  5 files changed, 45 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 4877563241f3..16f54be9461f 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -413,6 +413,14 @@ historical importance.
> >
> >  Default: 0
> >
> > +tcp_pingpong_thresh
> > +-------------------
> > +
> > +TCP pingpong threshold is 1 by default, but some application may need a higher
> > +threshold for optimal performance.
> > +
> > +Default: 1, min: 1, max: 3
> 
> If we want to make this tunable, it seems sad to make the max 3. I'd
> suggest making the max 255, since we have 8 bits of space anyway in
> the inet_csk(sk)->icsk_ack.pingpong field.
> 
> > +
> >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> >  ----------------------------------------------------------
> >
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index c2b15f7e5516..e84e33ddae49 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
> >
> >  struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
> >
> > -#define TCP_PINGPONG_THRESH    1
> > +extern int tcp_pingpong_thresh;
> 
> To match most TCP sysctls, this should be per-namespace, rather than global.

Also, please change int to u8.


> 
> Please follow a recent example by Eric, perhaps:
>  65466904b015f6eeb9225b51aeb29b01a1d4b59c
>   tcp: adjust TSO packet sizes based on min_rtt
> 
> 
> >
> >  static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
> >  {
> > -       inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
> > +       inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
> >  }
> 
>   inet_csk(sk)->icsk_ack.pingpong =  sock_net(sk)->sysctl_tcp_pingpong_thresh;

Let's use READ_ONCE(sock_net(sk)->sysctl_tcp_pingpong_thresh).
Same for other sysctl reads.


> 
> >  static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> > @@ -338,7 +338,15 @@ static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> >
> >  static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
> >  {
> > -       return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
> > +       return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
> > +}
> 
> Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than tcp_pingpong_thresh.
> 
> > +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
> > +{
> > +       struct inet_connection_sock *icsk = inet_csk(sk);
> > +
> > +       if (icsk->icsk_ack.pingpong < U8_MAX)
> > +               icsk->icsk_ack.pingpong++;
> >  }
> >
> >  static inline bool inet_csk_has_ulp(struct sock *sk)
> > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> > index 782273bb93c2..b5253567f2bd 100644
> > --- a/net/core/sysctl_net_core.c
> > +++ b/net/core/sysctl_net_core.c
> > @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {
> 
> Again, in net.ipv4, not net.core.
> 
> >                 .proc_handler   = proc_dointvec_minmax,
> >                 .extra1         = SYSCTL_ZERO,
> >         },
> > +       {
> > +               .procname       = "tcp_pingpong_thresh",
> > +               .data           = &tcp_pingpong_thresh,
> > +               .maxlen         = sizeof(int),
> > +               .mode           = 0644,
> > +               .proc_handler   = proc_dointvec_minmax,
> > +               .extra1         = SYSCTL_ONE,
> > +               .extra2         = SYSCTL_THREE,
> 
> Please make the max U8_MAX to allow more flexibility (since we have 8
> bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field).

Please use proc_dou8vec_minmax(), then you can drop .extra2.

		.maxlen		= sizeof(u8),
		.mode		= 0644,
		.proc_handler	= proc_dou8vec_minmax,
		.extra1         = SYSCTL_ONE,

Thanks,
Kuniyuki

> 
> > +       },
> >         { }
> >  };
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 53b7751b68e1..dcd143193d41 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
> >  struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
> >  EXPORT_SYMBOL(tcp_sockets_allocated);
> >
> > +int tcp_pingpong_thresh __read_mostly = 1;
> > +
> 
> Again, per-network-namespace. You will need to initialize the
> per-netns value in tcp_sk_init(). Again, see Eric's
> 65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example.
> 
> >   * TCP splice context
> >   */
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index cfe128b81a01..576d21621778 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
> >         if (tcp_packets_in_flight(tp) == 0)
> >                 tcp_ca_event(sk, CA_EVENT_TX_START);
> >
> > +       /* If tcp_pingpong_thresh > 1, and
> > +        * this is the first data packet sent in response to the
> > +        * previous received data,
> > +        * and it is a reply for ato after last received packet,
> > +        * increase pingpong count.
> > +        */
> > +       if (tcp_pingpong_thresh > 1 &&
> > +           before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > +               inet_csk_inc_pingpong_cnt(sk);
> > +
> 
> Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5.
> As that commit description noted:
> 
>     This to-be-reverted commit was meant to apply a stricter rule for the
>     stack to enter pingpong mode. However, the condition used to check for
>     interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
>     jiffy based and might be too coarse, which delays the stack entering
>     pingpong mode.
>     We revert this patch so that we no longer use the above condition to
>     determine interactive session,
> 
> >         tp->lsndtime = now;
> >
> > -       /* If it is a reply for ato after last received
> > +       /* If tcp_pingpong_thresh == 1, and
> 
> Please remove the "If tcp_pingpong_thresh == 1, and" part, since this
> is the correct code path no matter the value of the threshold.
> 
> > +        * it is a reply for ato after last received
> >          * packet, enter pingpong mode.
> >          */
> > -       if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > +       if (tcp_pingpong_thresh == 1 &&
> 
> Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this
> is the correct code path no matter the value of the threshold.
> 
> > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> >                 inet_csk_enter_pingpong_mode(sk);
> 
> Please make this call inet_csk_inc_pingpong_cnt(), since this is the
> correct code path no matter the value of the threshold.

^ permalink raw reply

* Re: [PATCH net-next] tcp: Make pingpong threshold tunable
From: Neal Cardwell @ 2023-06-09 19:16 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, kys, olaf, vkuznets, davem, weiwan,
	tim.gardner, corbet, edumazet, kuba, pabeni, dsahern, atenart,
	bagasdotme, ykaliuta, kuniyu, stephen, simon.horman, maheshb,
	liushixin2, linux-doc, linux-kernel, Yuchung Cheng,
	Soheil Hassas Yeganeh
In-Reply-To: <1686327959-13478-1-git-send-email-haiyangz@microsoft.com>

On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@microsoft.com> wrote:

Regarding the patch title:
> [PATCH net-next] tcp: Make pingpong threshold tunable

There are many ways to make something tunable these days, including
BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please
consider a more clear/specific title, like:

   [PATCH net-next] tcp: set pingpong threshold via sysctl

> TCP pingpong threshold is 1 by default. But some applications, like SQL DB
> may prefer a higher pingpong threshold to activate delayed acks in quick
> ack mode for better performance.
>
> The pingpong threshold and related code were changed to 3 in the year
> 2019, and reverted to 1 in the year 2022.

Please include the specific commit, like:

The pingpong threshold and related code were changed to 3 in the year
 2019 in:
   commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
and reverted to 1 in the year 2022 in:
  commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

Then please make sure to use scripts/checkpatch.pl on your resulting
patch to check the formatting of the commit references, among other
things.

> There is no single value that
> fits all applications.
>
> Add net.core.tcp_pingpong_thresh sysctl tunable,

For consistency, TCP sysctls should be in net.ipv4 rather than
net.core. Yes, that is awkward, given IPv6 support. But consistency is
very important here. :-)

> so it can be tuned for
> optimal performance based on the application needs.
>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Documentation/admin-guide/sysctl/net.rst |  8 ++++++++
>  include/net/inet_connection_sock.h       | 14 +++++++++++---
>  net/core/sysctl_net_core.c               |  9 +++++++++
>  net/ipv4/tcp.c                           |  2 ++
>  net/ipv4/tcp_output.c                    | 17 +++++++++++++++--
>  5 files changed, 45 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 4877563241f3..16f54be9461f 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -413,6 +413,14 @@ historical importance.
>
>  Default: 0
>
> +tcp_pingpong_thresh
> +-------------------
> +
> +TCP pingpong threshold is 1 by default, but some application may need a higher
> +threshold for optimal performance.
> +
> +Default: 1, min: 1, max: 3

If we want to make this tunable, it seems sad to make the max 3. I'd
suggest making the max 255, since we have 8 bits of space anyway in
the inet_csk(sk)->icsk_ack.pingpong field.

> +
>  2. /proc/sys/net/unix - Parameters for Unix domain sockets
>  ----------------------------------------------------------
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index c2b15f7e5516..e84e33ddae49 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
>
>  struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
>
> -#define TCP_PINGPONG_THRESH    1
> +extern int tcp_pingpong_thresh;

To match most TCP sysctls, this should be per-namespace, rather than global.

Please follow a recent example by Eric, perhaps:
 65466904b015f6eeb9225b51aeb29b01a1d4b59c
  tcp: adjust TSO packet sizes based on min_rtt


>
>  static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
>  {
> -       inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
> +       inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
>  }

  inet_csk(sk)->icsk_ack.pingpong =  sock_net(sk)->sysctl_tcp_pingpong_thresh;

>  static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> @@ -338,7 +338,15 @@ static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
>
>  static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
>  {
> -       return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
> +       return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
> +}

Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than tcp_pingpong_thresh.

> +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +
> +       if (icsk->icsk_ack.pingpong < U8_MAX)
> +               icsk->icsk_ack.pingpong++;
>  }
>
>  static inline bool inet_csk_has_ulp(struct sock *sk)
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 782273bb93c2..b5253567f2bd 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {

Again, in net.ipv4, not net.core.

>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = SYSCTL_ZERO,
>         },
> +       {
> +               .procname       = "tcp_pingpong_thresh",
> +               .data           = &tcp_pingpong_thresh,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = SYSCTL_ONE,
> +               .extra2         = SYSCTL_THREE,

Please make the max U8_MAX to allow more flexibility (since we have 8
bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field).

> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53b7751b68e1..dcd143193d41 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
>  struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
>  EXPORT_SYMBOL(tcp_sockets_allocated);
>
> +int tcp_pingpong_thresh __read_mostly = 1;
> +

Again, per-network-namespace. You will need to initialize the
per-netns value in tcp_sk_init(). Again, see Eric's
65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example.

>   * TCP splice context
>   */
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cfe128b81a01..576d21621778 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
>         if (tcp_packets_in_flight(tp) == 0)
>                 tcp_ca_event(sk, CA_EVENT_TX_START);
>
> +       /* If tcp_pingpong_thresh > 1, and
> +        * this is the first data packet sent in response to the
> +        * previous received data,
> +        * and it is a reply for ato after last received packet,
> +        * increase pingpong count.
> +        */
> +       if (tcp_pingpong_thresh > 1 &&
> +           before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> +               inet_csk_inc_pingpong_cnt(sk);
> +

Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5.
As that commit description noted:

    This to-be-reverted commit was meant to apply a stricter rule for the
    stack to enter pingpong mode. However, the condition used to check for
    interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
    jiffy based and might be too coarse, which delays the stack entering
    pingpong mode.
    We revert this patch so that we no longer use the above condition to
    determine interactive session,

>         tp->lsndtime = now;
>
> -       /* If it is a reply for ato after last received
> +       /* If tcp_pingpong_thresh == 1, and

Please remove the "If tcp_pingpong_thresh == 1, and" part, since this
is the correct code path no matter the value of the threshold.

> +        * it is a reply for ato after last received
>          * packet, enter pingpong mode.
>          */
> -       if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> +       if (tcp_pingpong_thresh == 1 &&

Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this
is the correct code path no matter the value of the threshold.

> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
>                 inet_csk_enter_pingpong_mode(sk);

Please make this call inet_csk_inc_pingpong_cnt(), since this is the
correct code path no matter the value of the threshold.

thanks,
neal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox