Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v2] printk: Add a short description string to kmsg_dump()
From: Petr Mladek @ 2024-07-03  8:22 UTC (permalink / raw)
  To: Jocelyn Falempe
  Cc: Kees Cook, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Naveen N. Rao, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, Tony Luck,
	Guilherme G. Piccoli, Steven Rostedt, John Ogness,
	Sergey Senozhatsky, Andrew Morton, Jani Nikula,
	Greg Kroah-Hartman, Kefeng Wang, Thomas Gleixner, Uros Bizjak,
	linuxppc-dev, linux-kernel, dri-devel, linux-hyperv, linux-mtd,
	linux-hardening
In-Reply-To: <10ea2ea1-e692-443e-8b48-ce9884e8b942@redhat.com>

On Wed 2024-07-03 09:57:26, Jocelyn Falempe wrote:
> 
> 
> On 02/07/2024 22:29, Kees Cook wrote:
> > On Tue, Jul 02, 2024 at 02:26:04PM +0200, Jocelyn Falempe wrote:
> > > kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
> > > callback.
> > > This patch adds a new struct kmsg_dump_detail, that will hold the
> > > reason and description, and pass it to the dump() callback.
> > 
> > Thanks! I like this much better. :)
> > 
> > > 
> > > To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
> > > function and a macro for backward compatibility.
> > > 
> > > I've written this for drm_panic, but it can be useful for other
> > > kmsg_dumper.
> > > It allows to see the panic reason, like "sysrq triggered crash"
> > > or "VFS: Unable to mount root fs on xxxx" on the drm panic screen.
> > > 
> > > v2:
> > >   * Use a struct kmsg_dump_detail to hold the reason and description
> > >     pointer, for more flexibility if we want to add other parameters.
> > >     (Kees Cook)
> > >   * Fix powerpc/nvram_64 build, as I didn't update the forward
> > >     declaration of oops_to_nvram()
> > 
> > The versioning history commonly goes after the "---".
> 
> ok, I was not aware of this.
> > 
> > > [...]
> > > diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> > > index 906521c2329c..65f5a47727bc 100644
> > > --- a/include/linux/kmsg_dump.h
> > > +++ b/include/linux/kmsg_dump.h
> > > @@ -39,6 +39,17 @@ struct kmsg_dump_iter {
> > >   	u64	next_seq;
> > >   };
> > > +/**
> > > + *struct kmsg_dump_detail - kernel crash detail
> > 
> > Is kern-doc happy with this? I think there is supposed to be a space
> > between the "*" and the first word:
> > 
> >   /**
> >    * struct kmsg...
> > 
> > 
> Good catch, yes there is a space missing.
> 
> I just checked with "make htmldocs", and in fact include/linux/kmsg_dump.h
> is not indexed for kernel documentation.
> And you can't find the definition of struct kmsg_dumper in the online doc.
> https://www.kernel.org/doc/html/latest/search.html?q=kmsg_dumper
> 
> > Otherwise looks good to me!
> > 
> 
> Thanks.
> 
> As this patch touches different subsystems, do you know on which tree it
> should land ?

Andrew usually takes patches against kernel/panic.c.

Or you could take it via the DRM tree, especially if you already have the code
using the string.

Also I could take it via the printk tree. The only complication is
that I am going to be away the following two weeks and would come
back in the middle of the merge window. I do not expect much problems
with this change but...

Best Regards,
Petr

^ permalink raw reply

* [PATCH v2] drivers: hv: vmbus: Add missing check for dma_set_mask in vmbus_device_register()
From: Haoxiang Li @ 2024-07-03  8:42 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, mikelley, parri.andrea
  Cc: linux-hyperv, linux-kernel, Haoxiang Li

child_device_obj->device cannot perform DMA properly if dma_set_mask()
returns non-zero. Add check for dma_set_mask() and return the error if it
fails. To do the right cleanup, call dma_set_mask() after device_register()
so that we can use existent error path of vmbus_device_register().

Fixes: 3a5469582c24 ("Drivers: hv: vmbus: Fix initialization of device object in vmbus_device_register()")
Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>
---
Changes in v2:
Thanks for your comments. They help a lot. It is not sufficient to do the
cleanup just with kfree(). The memory allocated by dev_set_name()
should also be freed, which is done by put_device() (finally in
kobject_cleanup() [1]). I think performing a complete cleanup within
vmbus_device_register() would be verbose and detrimental to code
maintenance. I suggest calling dma_set_mask() after device_register(),
so that proper error handling can be achieved using the existent error path
of vmbus_device_register(), i.e., err_dev_unregister [2]. Moreover,
I found a similar usage in dmapool_checks() [3], which calls
dma_set_mask_and_coherent() after device_register(). I believe the
modification is workable.

Reference link:
[1]https://github.com/torvalds/linux/blob/master/lib/kobject.c#L695
[2]https://github.com/torvalds/linux/blob/master/drivers/hv/vmbus_drv.c#L1933
[3]https://github.com/torvalds/linux/blob/master/mm/dmapool_test.c#L105
---
 drivers/hv/vmbus_drv.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a707ab73f8..f3999d8afd77 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1897,7 +1897,6 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 
 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
 	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
-	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
 
 	/*
 	 * Register with the LDM. This will kick off the driver/device
@@ -1910,6 +1909,12 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 		return ret;
 	}
 
+	ret = dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
+	if (ret) {
+		pr_err("64-bit DMA enable failed!\n");
+		goto err_dev_unregister;
+	}
+
 	child_device_obj->channels_kset = kset_create_and_add("channels",
 							      NULL, kobj);
 	if (!child_device_obj->channels_kset) {
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 00/18] Introducing Core Building Blocks for Hyper-V VSM Emulation
From: Nicolas Saenz Julienne @ 2024-07-03  9:55 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, seanjc, linux-kernel, kvm, vkuznets, linux-doc,
	linux-hyperv, linux-arch, nsaenz, linux-trace-kernel, graf, dwmw2,
	pdurrant, mlevitsk, jgowans, corbet, decui, tglx, mingo, bp,
	dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Hi Sean,

On Sun Jun 9, 2024 at 3:49 PM UTC, Nicolas Saenz Julienne wrote:
> This series introduces core KVM functionality necessary to emulate Hyper-V's
> Virtual Secure Mode in a Virtual Machine Monitor (VMM).

Just wanted to make sure the series is in your radar.

Thanks,
Nicolas

^ permalink raw reply

* Re: [PATCH 00/18] Introducing Core Building Blocks for Hyper-V VSM Emulation
From: Vitaly Kuznetsov @ 2024-07-03 12:48 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, seanjc
  Cc: pbonzini, seanjc, linux-kernel, kvm, linux-doc, linux-hyperv,
	linux-arch, nsaenz, linux-trace-kernel, graf, dwmw2, pdurrant,
	mlevitsk, jgowans, corbet, decui, tglx, mingo, bp, dave.hansen,
	x86, amoorthy
In-Reply-To: <D2FTASL4CXLN.32GYJ8QZH4OCR@amazon.com>

Nicolas Saenz Julienne <nsaenz@amazon.com> writes:

> Hi Sean,
>
> On Sun Jun 9, 2024 at 3:49 PM UTC, Nicolas Saenz Julienne wrote:
>> This series introduces core KVM functionality necessary to emulate Hyper-V's
>> Virtual Secure Mode in a Virtual Machine Monitor (VMM).
>
> Just wanted to make sure the series is in your radar.
>

Not Sean here but I was planning to take a look at least at Hyper-V
parts of it next week.

-- 
Vitaly


^ permalink raw reply

* Re: [PATCH 00/18] Introducing Core Building Blocks for Hyper-V VSM Emulation
From: Nicolas Saenz Julienne @ 2024-07-03 13:18 UTC (permalink / raw)
  To: Vitaly Kuznetsov, seanjc
  Cc: pbonzini, linux-kernel, kvm, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, pdurrant, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <87ikxm63px.fsf@redhat.com>

Hi Vitaly,

On Wed Jul 3, 2024 at 12:48 PM UTC, Vitaly Kuznetsov wrote:
> Nicolas Saenz Julienne <nsaenz@amazon.com> writes:
>
> > Hi Sean,
> >
> > On Sun Jun 9, 2024 at 3:49 PM UTC, Nicolas Saenz Julienne wrote:
> >> This series introduces core KVM functionality necessary to emulate Hyper-V's
> >> Virtual Secure Mode in a Virtual Machine Monitor (VMM).
> >
> > Just wanted to make sure the series is in your radar.
> >
>
> Not Sean here but I was planning to take a look at least at Hyper-V
> parts of it next week.

Thanks for the update.

Nicolas

^ permalink raw reply

* Re: [PATCH v2] drivers: hv: vmbus: Add missing check for dma_set_mask in vmbus_device_register()
From: Markus Elfring @ 2024-07-03 15:05 UTC (permalink / raw)
  To: Haoxiang Li, linux-hyperv, kernel-janitors, Andrea Parri,
	Dexuan Cui, Haiyang Zhang, K. Y. Srinivasan, Michael Kelley,
	Wei Liu
  Cc: LKML
In-Reply-To: <20240703084221.12057-1-make24@iscas.ac.cn>

> child_device_obj->device cannot perform DMA properly if dma_set_mask()
> returns non-zero. …

Can the repetition of another wording suggestion influence the software evolution?
  Direct memory access can not be properly performed any more
  after a dma_set_mask() call failed.


See also:
https://elixir.bootlin.com/linux/v6.10-rc6/source/kernel/dma/mapping.c#L804> Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>

Under which circumstances will applications of the Developer's Certificate of Origin
be reconsidered any more (after three different names were presented so far)?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6#n398


Would you like to append parentheses to another function name in the summary phrase?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH v2] printk: Add a short description string to kmsg_dump()
From: Kees Cook @ 2024-07-03 16:27 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jocelyn Falempe, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Naveen N. Rao, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, Tony Luck,
	Guilherme G. Piccoli, Steven Rostedt, John Ogness,
	Sergey Senozhatsky, Andrew Morton, Jani Nikula,
	Greg Kroah-Hartman, Kefeng Wang, Thomas Gleixner, Uros Bizjak,
	linuxppc-dev, linux-kernel, dri-devel, linux-hyperv, linux-mtd,
	linux-hardening
In-Reply-To: <ZoUKM9-RiOrv0_Vf@pathway.suse.cz>

On Wed, Jul 03, 2024 at 10:22:11AM +0200, Petr Mladek wrote:
> On Wed 2024-07-03 09:57:26, Jocelyn Falempe wrote:
> > 
> > 
> > On 02/07/2024 22:29, Kees Cook wrote:
> > > On Tue, Jul 02, 2024 at 02:26:04PM +0200, Jocelyn Falempe wrote:
> > > > kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
> > > > callback.
> > > > This patch adds a new struct kmsg_dump_detail, that will hold the
> > > > reason and description, and pass it to the dump() callback.
> > > 
> > > Thanks! I like this much better. :)
> > > 
> > > > 
> > > > To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
> > > > function and a macro for backward compatibility.
> > > > 
> > > > I've written this for drm_panic, but it can be useful for other
> > > > kmsg_dumper.
> > > > It allows to see the panic reason, like "sysrq triggered crash"
> > > > or "VFS: Unable to mount root fs on xxxx" on the drm panic screen.
> > > > 
> > > > v2:
> > > >   * Use a struct kmsg_dump_detail to hold the reason and description
> > > >     pointer, for more flexibility if we want to add other parameters.
> > > >     (Kees Cook)
> > > >   * Fix powerpc/nvram_64 build, as I didn't update the forward
> > > >     declaration of oops_to_nvram()
> > > 
> > > The versioning history commonly goes after the "---".
> > 
> > ok, I was not aware of this.
> > > 
> > > > [...]
> > > > diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> > > > index 906521c2329c..65f5a47727bc 100644
> > > > --- a/include/linux/kmsg_dump.h
> > > > +++ b/include/linux/kmsg_dump.h
> > > > @@ -39,6 +39,17 @@ struct kmsg_dump_iter {
> > > >   	u64	next_seq;
> > > >   };
> > > > +/**
> > > > + *struct kmsg_dump_detail - kernel crash detail
> > > 
> > > Is kern-doc happy with this? I think there is supposed to be a space
> > > between the "*" and the first word:
> > > 
> > >   /**
> > >    * struct kmsg...
> > > 
> > > 
> > Good catch, yes there is a space missing.
> > 
> > I just checked with "make htmldocs", and in fact include/linux/kmsg_dump.h
> > is not indexed for kernel documentation.
> > And you can't find the definition of struct kmsg_dumper in the online doc.
> > https://www.kernel.org/doc/html/latest/search.html?q=kmsg_dumper
> > 
> > > Otherwise looks good to me!
> > > 
> > 
> > Thanks.
> > 
> > As this patch touches different subsystems, do you know on which tree it
> > should land ?
> 
> Andrew usually takes patches against kernel/panic.c.
> 
> Or you could take it via the DRM tree, especially if you already have the code
> using the string.
> 
> Also I could take it via the printk tree. The only complication is
> that I am going to be away the following two weeks and would come
> back in the middle of the merge window. I do not expect much problems
> with this change but...

If DRM doesn't want to carry it, I can put it in through the pstore
tree. Let me know! :)

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v2] printk: Add a short description string to kmsg_dump()
From: Jocelyn Falempe @ 2024-07-03 16:40 UTC (permalink / raw)
  To: Kees Cook, Petr Mladek
  Cc: Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Naveen N. Rao, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, Tony Luck,
	Guilherme G. Piccoli, Steven Rostedt, John Ogness,
	Sergey Senozhatsky, Andrew Morton, Jani Nikula,
	Greg Kroah-Hartman, Kefeng Wang, Thomas Gleixner, Uros Bizjak,
	linuxppc-dev, linux-kernel, dri-devel, linux-hyperv, linux-mtd,
	linux-hardening
In-Reply-To: <202407030926.D5DA9B901D@keescook>



On 03/07/2024 18:27, Kees Cook wrote:
> On Wed, Jul 03, 2024 at 10:22:11AM +0200, Petr Mladek wrote:
>> On Wed 2024-07-03 09:57:26, Jocelyn Falempe wrote:
>>>
>>>
>>> On 02/07/2024 22:29, Kees Cook wrote:
>>>> On Tue, Jul 02, 2024 at 02:26:04PM +0200, Jocelyn Falempe wrote:
>>>>> kmsg_dump doesn't forward the panic reason string to the kmsg_dumper
>>>>> callback.
>>>>> This patch adds a new struct kmsg_dump_detail, that will hold the
>>>>> reason and description, and pass it to the dump() callback.
>>>>
>>>> Thanks! I like this much better. :)
>>>>
>>>>>
>>>>> To avoid updating all kmsg_dump() call, it adds a kmsg_dump_desc()
>>>>> function and a macro for backward compatibility.
>>>>>
>>>>> I've written this for drm_panic, but it can be useful for other
>>>>> kmsg_dumper.
>>>>> It allows to see the panic reason, like "sysrq triggered crash"
>>>>> or "VFS: Unable to mount root fs on xxxx" on the drm panic screen.
>>>>>
>>>>> v2:
>>>>>    * Use a struct kmsg_dump_detail to hold the reason and description
>>>>>      pointer, for more flexibility if we want to add other parameters.
>>>>>      (Kees Cook)
>>>>>    * Fix powerpc/nvram_64 build, as I didn't update the forward
>>>>>      declaration of oops_to_nvram()
>>>>
>>>> The versioning history commonly goes after the "---".
>>>
>>> ok, I was not aware of this.
>>>>
>>>>> [...]
>>>>> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
>>>>> index 906521c2329c..65f5a47727bc 100644
>>>>> --- a/include/linux/kmsg_dump.h
>>>>> +++ b/include/linux/kmsg_dump.h
>>>>> @@ -39,6 +39,17 @@ struct kmsg_dump_iter {
>>>>>    	u64	next_seq;
>>>>>    };
>>>>> +/**
>>>>> + *struct kmsg_dump_detail - kernel crash detail
>>>>
>>>> Is kern-doc happy with this? I think there is supposed to be a space
>>>> between the "*" and the first word:
>>>>
>>>>    /**
>>>>     * struct kmsg...
>>>>
>>>>
>>> Good catch, yes there is a space missing.
>>>
>>> I just checked with "make htmldocs", and in fact include/linux/kmsg_dump.h
>>> is not indexed for kernel documentation.
>>> And you can't find the definition of struct kmsg_dumper in the online doc.
>>> https://www.kernel.org/doc/html/latest/search.html?q=kmsg_dumper
>>>
>>>> Otherwise looks good to me!
>>>>
>>>
>>> Thanks.
>>>
>>> As this patch touches different subsystems, do you know on which tree it
>>> should land ?
>>
>> Andrew usually takes patches against kernel/panic.c.
>>
>> Or you could take it via the DRM tree, especially if you already have the code
>> using the string.

If it's not taken in Andrew's tree next week, I will see if I can push 
it to the drm-misc tree. I think there is a very low chance of conflicts.

>>
>> Also I could take it via the printk tree. The only complication is
>> that I am going to be away the following two weeks and would come
>> back in the middle of the merge window. I do not expect much problems
>> with this change but...
> 
> If DRM doesn't want to carry it, I can put it in through the pstore
> tree. Let me know! :)
> 

Thanks for the proposition, I will see how it goes, it would be nice to 
have it in time for the v6.11 merge window.

Best regards,

-- 

Jocelyn


^ permalink raw reply

* RE: [PATCH v2] tools: hv: lsvmbus: change shebang to use python3
From: Michael Kelley @ 2024-07-03 16:44 UTC (permalink / raw)
  To: Anthony Nandaa, linux-hyperv@vger.kernel.org, decui@microsoft.com,
	wei.liu@kernel.org
  Cc: kys@microsoft.com
In-Reply-To: <20240702102250.13935-1-profnandaa@gmail.com>

From: Anthony Nandaa <profnandaa@gmail.com> Sent: Tuesday, July 2, 2024 3:23 AM
> 
> In many modern Linux distros, running `lsvmbus` returns the error:
> ```
> /usr/bin/env: 'python': No such file or directory
> ```
> because 'python' doesn't point anywhere.
> 
> Now that python2 has reached EOL as of January 1, 2020 and is no longer
> maintained[1], these distros have python3 instead.
> 
> Also, the script isn't executable by default because the permissions are
> set to mode 644.
> 
> Fix this by updating the shebang in the `lsvmbus` to use python3 instead
> of python. Also fix the permissions to be 755 so that is executable by
> default, which matches other similar scripts in `tools/hv`.
> 
> The script is also tested and verified that is compatible with
> python3.
> 
> [1] https://www.python.org/doc/sunset-python-2/
> 
> Signed-off-by: Anthony Nandaa <profnandaa@gmail.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> ---
> v2:
> * change the commit message body to conform to guidelines.
> ---
>  tools/hv/lsvmbus | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>  mode change 100644 => 100755 tools/hv/lsvmbus
> 
> diff --git a/tools/hv/lsvmbus b/tools/hv/lsvmbus
> old mode 100644
> new mode 100755
> index 55e7374bade0..23dcd8e705be
> --- a/tools/hv/lsvmbus
> +++ b/tools/hv/lsvmbus
> @@ -1,4 +1,4 @@
> -#!/usr/bin/env python
> +#!/usr/bin/env python3
>  # SPDX-License-Identifier: GPL-2.0
> 
>  import os
> --

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply

* Re: [PATCH v3] PCI: hv: Return zero, not garbage, when reading PCI_INTERRUPT_PIN
From: Wei Liu @ 2024-07-03 21:09 UTC (permalink / raw)
  To: Krzysztof Wilczyński
  Cc: Wei Liu, Linux on Hyper-V List, stable, Michael Kelley,
	K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Lorenzo Pieralisi,
	Rob Herring, Bjorn Helgaas, Jake Oshins,
	open list:PCI NATIVE HOST BRIDGE AND ENDPOINT DRIVERS, open list
In-Reply-To: <20240703081247.GA4117643@rocinante>

On Wed, Jul 03, 2024 at 05:12:47PM +0900, Krzysztof Wilczyński wrote:
> Hello,
> 
> > > The intent of the code snippet is to always return 0 for both
> > > PCI_INTERRUPT_LINE and PCI_INTERRUPT_PIN.
> > > 
> > > The check misses PCI_INTERRUPT_PIN. This patch fixes that.
> > > 
> > > This is discovered by this call in VFIO:
> > > 
> > >     pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
> > > 
> > > The old code does not set *val to 0 because it misses the check for
> > > PCI_INTERRUPT_PIN. Garbage is returned in that case.
> [...]
> > 
> > Bjorn & other PCI maintainers, do you want to pick this up via your
> > tree?
> > 
> > I can pick this up via the hyperv tree if you prefer.
> 
> We will pick this up.  No worries.

Thank you very much!

> 
> 	Krzysztof
> 

^ permalink raw reply

* Re: [PATCH v2] drivers: hv: vmbus: Add missing check for dma_set_mask in vmbus_device_register()
From: Wei Liu @ 2024-07-03 21:16 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: kys, haiyangz, wei.liu, decui, mikelley, parri.andrea,
	linux-hyperv, linux-kernel
In-Reply-To: <20240703084221.12057-1-make24@iscas.ac.cn>

On Wed, Jul 03, 2024 at 04:42:21PM +0800, Haoxiang Li wrote:
> child_device_obj->device cannot perform DMA properly if dma_set_mask()
> returns non-zero. Add check for dma_set_mask() and return the error if it
> fails. To do the right cleanup, call dma_set_mask() after device_register()
> so that we can use existent error path of vmbus_device_register().
> 
> Fixes: 3a5469582c24 ("Drivers: hv: vmbus: Fix initialization of device object in vmbus_device_register()")
> Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>

The email address doesn't seem to match your name.  Why?

You should use your own email address to sign off the patches you write.

Thanks,
Wei.

> ---
> Changes in v2:
> Thanks for your comments. They help a lot. It is not sufficient to do the
> cleanup just with kfree(). The memory allocated by dev_set_name()
> should also be freed, which is done by put_device() (finally in
> kobject_cleanup() [1]). I think performing a complete cleanup within
> vmbus_device_register() would be verbose and detrimental to code
> maintenance. I suggest calling dma_set_mask() after device_register(),
> so that proper error handling can be achieved using the existent error path
> of vmbus_device_register(), i.e., err_dev_unregister [2]. Moreover,
> I found a similar usage in dmapool_checks() [3], which calls
> dma_set_mask_and_coherent() after device_register(). I believe the
> modification is workable.
> 
> Reference link:
> [1]https://github.com/torvalds/linux/blob/master/lib/kobject.c#L695
> [2]https://github.com/torvalds/linux/blob/master/drivers/hv/vmbus_drv.c#L1933
> [3]https://github.com/torvalds/linux/blob/master/mm/dmapool_test.c#L105
> ---
>  drivers/hv/vmbus_drv.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 12a707ab73f8..f3999d8afd77 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1897,7 +1897,6 @@ int vmbus_device_register(struct hv_device *child_device_obj)
>  
>  	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
>  	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
> -	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
>  
>  	/*
>  	 * Register with the LDM. This will kick off the driver/device
> @@ -1910,6 +1909,12 @@ int vmbus_device_register(struct hv_device *child_device_obj)
>  		return ret;
>  	}
>  
> +	ret = dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
> +	if (ret) {
> +		pr_err("64-bit DMA enable failed!\n");
> +		goto err_dev_unregister;
> +	}
> +
>  	child_device_obj->channels_kset = kset_create_and_add("channels",
>  							      NULL, kobj);
>  	if (!child_device_obj->channels_kset) {
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3] PCI: hv: Return zero, not garbage, when reading PCI_INTERRUPT_PIN
From: Krzysztof Wilczyński @ 2024-07-06  3:20 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linux on Hyper-V List, stable, Michael Kelley, K. Y. Srinivasan,
	Haiyang Zhang, Dexuan Cui, Lorenzo Pieralisi, Rob Herring,
	Bjorn Helgaas, Jake Oshins,
	open list:PCI NATIVE HOST BRIDGE AND ENDPOINT DRIVERS, open list
In-Reply-To: <20240701202606.129606-1-wei.liu@kernel.org>

Hello,

> The intent of the code snippet is to always return 0 for both
> PCI_INTERRUPT_LINE and PCI_INTERRUPT_PIN.
> 
> The check misses PCI_INTERRUPT_PIN. This patch fixes that.
> 
> This is discovered by this call in VFIO:
> 
>     pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
> 
> The old code does not set *val to 0 because it misses the check for
> PCI_INTERRUPT_PIN. Garbage is returned in that case.

Applied to controller/hyperv, thank you!

[1/1] PCI: hv: Return zero, not garbage, when reading PCI_INTERRUPT_PIN
      https://git.kernel.org/pci/pci/c/fea93a3e5d5e

	Krzysztof

^ permalink raw reply

* Re: [PATCH 01/18] KVM: x86: hyper-v: Introduce XMM output support
From: Vitaly Kuznetsov @ 2024-07-08 14:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, linux-kernel, kvm
  Cc: pbonzini, seanjc, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-2-nsaenz@amazon.com>

Nicolas Saenz Julienne <nsaenz@amazon.com> writes:

> Prepare infrastructure to be able to return data through the XMM
> registers when Hyper-V hypercalls are issues in fast mode. The XMM
> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
> restored on successful hypercall completion.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>
> ---
>
> There was some discussion in the RFC about whether growing 'struct
> kvm_hyperv_exit' is ABI breakage. IMO it isn't:
> - There is padding in 'struct kvm_run' that ensures that a bigger
>   'struct kvm_hyperv_exit' doesn't alter the offsets within that struct.
> - Adding a new field at the bottom of the 'hcall' field within the
>   'struct kvm_hyperv_exit' should be fine as well, as it doesn't alter
>   the offsets within that struct either.
> - Ultimately, previous updates to 'struct kvm_hyperv_exit's hint that
>   its size isn't part of the uABI. It already grew when syndbg was
>   introduced.

Yes but SYNDBG exit comes with KVM_EXIT_HYPERV_SYNDBG. While I don't see
any immediate issues with the current approach, we may want to introduce
something like KVM_EXIT_HYPERV_HCALL_XMM: the userspace must be prepared
to handle this new information anyway and it is better to make
unprepared userspace fail with 'unknown exit' then to mishandle a
hypercall by ignoring XMM portion of the data.

>
>  Documentation/virt/kvm/api.rst     | 19 ++++++++++
>  arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>  arch/x86/kvm/hyperv.c              | 56 +++++++++++++++++++++++++++++-
>  include/uapi/linux/kvm.h           |  6 ++++
>  4 files changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a71d91978d9ef..17893b330b76f 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8893,3 +8893,22 @@ Ordering of KVM_GET_*/KVM_SET_* ioctls
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
>  TBD
> +
> +10. Hyper-V CPUIDs
> +==================
> +
> +This section only applies to x86.

We can probably use 

:Architectures: x86

which we already use.

> +
> +New Hyper-V feature support is no longer being tracked through KVM
> +capabilities.  Userspace can check if a particular version of KVM supports a
> +feature using KMV_GET_SUPPORTED_HV_CPUID.  This section documents how Hyper-V
> +CPUIDs map to KVM functionality.
> +
> +10.1 HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE
> +------------------------------------------
> +
> +:Location: CPUID.40000003H:EDX[bit 15]
> +
> +This CPUID indicates that KVM supports retuning data to the guest in response
> +to a hypercall using the XMM registers. It also extends ``struct
> +kvm_hyperv_exit`` to allow passing the XMM data from userspace.

It's always good to document things, thanks! I'm, however, wondering
what should we document as part of KVM API. In the file, we already
have:
- "4.118 KVM_GET_SUPPORTED_HV_CPUID"
- "struct kvm_hyperv_exit" description in "5. The kvm_run structure"

The later should definitely get extended to cover XMM and I guess the
former can accomodate the 'no longer being tracked' comment. With that,
maybe there's no need for a new section? 

> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 3787d26810c1c..6a18c9f77d5fe 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -49,7 +49,7 @@
>  /* Support for physical CPU dynamic partitioning events is available*/
>  #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE	BIT(3)
>  /*
> - * Support for passing hypercall input parameter block via XMM
> + * Support for passing hypercall input and output parameter block via XMM
>   * registers is available
>   */
>  #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE		BIT(4)

This change of the comment is weird (or I may have forgotten something
important), could you please elaborate?. Currently, we have:

/*
 * Support for passing hypercall input parameter block via XMM
 * registers is available
 */
#define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE         BIT(4)
...
/*
 * Support for returning hypercall output block via XMM
 * registers is available
 */
#define HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE                BIT(15)

which seems to be correct. TLFS also defines

Bit 4: XmmRegistersForFastHypercallAvailable

in CPUID 0x40000009.EDX (Nested Hypervisor Feature Identification) which
probably covers both but we don't set this leaf in KVM currently ...

> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 8a47f8541eab7..42f44546fe79c 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1865,6 +1865,7 @@ struct kvm_hv_hcall {
>  	u16 rep_idx;
>  	bool fast;
>  	bool rep;
> +	bool xmm_dirty;
>  	sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>  
>  	/*
> @@ -2396,9 +2397,49 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
>  	return ret;
>  }
>  
> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
> +{
> +	int reg;
> +
> +	kvm_fpu_get();
> +	for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
> +		const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
> +		_kvm_write_sse_reg(reg, &data);
> +	}
> +	kvm_fpu_put();
> +}
> +
> +static bool kvm_hv_is_xmm_output_hcall(u16 code)
> +{
> +	return false;
> +}
> +
> +static bool kvm_hv_xmm_output_allowed(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
> +
> +	return !hv_vcpu->enforce_cpuid ||
> +	       hv_vcpu->cpuid_cache.features_edx &
> +		       HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
> +}
> +
>  static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
> +	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
> +	u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
> +	u64 result = vcpu->run->hyperv.u.hcall.result;
> +
> +	if (hv_result_success(result) && fast &&
> +	    kvm_hv_is_xmm_output_hcall(code)) {

Assuming hypercalls with XMM output are always 'fast', should we include
'fast' check in kvm_hv_is_xmm_output_hcall()?

> +		if (unlikely(!kvm_hv_xmm_output_allowed(vcpu))) {
> +			kvm_queue_exception(vcpu, UD_VECTOR);
> +			return 1;
> +		}
> +
> +		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
> +	}
> +
> +	return kvm_hv_hypercall_complete(vcpu, result);
>  }
>  
>  static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> @@ -2553,6 +2594,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  	hc.rep_cnt = (hc.param >> HV_HYPERCALL_REP_COMP_OFFSET) & 0xfff;
>  	hc.rep_idx = (hc.param >> HV_HYPERCALL_REP_START_OFFSET) & 0xfff;
>  	hc.rep = !!(hc.rep_cnt || hc.rep_idx);
> +	hc.xmm_dirty = false;
>  
>  	trace_kvm_hv_hypercall(hc.code, hc.fast, hc.var_cnt, hc.rep_cnt,
>  			       hc.rep_idx, hc.ingpa, hc.outgpa);
> @@ -2673,6 +2715,15 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  		break;
>  	}
>  
> +	if (hv_result_success(ret) && hc.xmm_dirty) {
> +		if (unlikely(!kvm_hv_xmm_output_allowed(vcpu))) {
> +			kvm_queue_exception(vcpu, UD_VECTOR);
> +			return 1;
> +		}
> +
> +		kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg *)hc.xmm);
> +	}
> +
>  hypercall_complete:
>  	return kvm_hv_hypercall_complete(vcpu, ret);
>  
> @@ -2682,6 +2733,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  	vcpu->run->hyperv.u.hcall.input = hc.param;
>  	vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
>  	vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
> +	if (hc.fast)
> +		memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
>  	vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
>  	return 0;
>  }
> @@ -2830,6 +2883,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>  			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>  
>  			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
> +			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
>  			ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
>  			ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d03842abae578..fbdee8d754595 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -90,6 +90,11 @@ struct kvm_pit_config {
>  
>  #define KVM_PIT_SPEAKER_DUMMY     1
>  
> +struct kvm_hyperv_xmm_reg {
> +	__u64 low;
> +	__u64 high;
> +};
> +
>  struct kvm_hyperv_exit {
>  #define KVM_EXIT_HYPERV_SYNIC          1
>  #define KVM_EXIT_HYPERV_HCALL          2
> @@ -108,6 +113,7 @@ struct kvm_hyperv_exit {
>  			__u64 input;
>  			__u64 result;
>  			__u64 params[2];
> +			struct kvm_hyperv_xmm_reg xmm[6];

In theory, we have HV_HYPERCALL_MAX_XMM_REGISTERS in TLFS (which you
already use in the code). While I'm not sure it makes sense to make KVM
ABI dependent on TLFS changes (probably not), we may want to leave a
short comment explaining where '6' comes from.

>  		} hcall;
>  		struct {
>  			__u32 msr;

-- 
Vitaly


^ permalink raw reply

* [RFC PATCH net-next v6 00/14] virtio/vsock: support datagrams
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong

Hey all!

This series introduces support for datagrams to virtio/vsock.

It is a spin-off (and smaller version) of this series from the summer:
  https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/

Please note that this is an RFC and should not be merged until
associated changes are made to the virtio specification, which will
follow after discussion from this series.

Another aside, the v4 of the series has only been mildly tested with a
run of tools/testing/vsock/vsock_test. Some code likely needs cleaning
up, but I'm hoping to get some of the design choices agreed upon before
spending too much time making it pretty.

This series first supports datagrams in a basic form for virtio, and
then optimizes the sendpath for all datagram transports.

The result is a very fast datagram communication protocol that
outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
of multi-threaded workload samples.

For those that are curious, some summary data comparing UDP and VSOCK
DGRAM (N=5):

	vCPUS: 16
	virtio-net queues: 16
	payload size: 4KB
	Setup: bare metal + vm (non-nested)

	UDP: 287.59 MB/s
	VSOCK DGRAM: 509.2 MB/s

Some notes about the implementation...

This datagram implementation forces datagrams to self-throttle according
to the threshold set by sk_sndbuf. It behaves similar to the credits
used by streams in its effect on throughput and memory consumption, but
it is not influenced by the receiving socket as credits are.

The device drops packets silently.

As discussed previously, this series introduces datagrams and defers
fairness to future work. See discussion in v2 for more context around
datagrams, fairness, and this implementation.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
Changes in v6:
- allow empty transport in datagram vsock
- add empty transport checks in various paths
- transport layer now saves source cid and port to control buffer of skb
  to remove the dependency of transport in recvmsg()
- fix virtio dgram_enqueue() by looking up the transport to be used when
  using sendto(2)
- fix skb memory leaks in two places
- add dgram auto-bind test
- Link to v5: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v5-0-581bd37fdb26@bytedance.com

Changes in v5:
- teach vhost to drop dgram when a datagram exceeds the receive buffer
  - now uses MSG_ERRQUEUE and depends on Arseniy's zerocopy patch:
	"vsock: read from socket's error queue"
- replace multiple ->dgram_* callbacks with single ->dgram_addr_init()
  callback
- refactor virtio dgram skb allocator to reduce conflicts w/ zerocopy series
- add _fallback/_FALLBACK suffix to dgram transport variables/macros
- add WARN_ONCE() for table_size / VSOCK_HASH issue
- add static to vsock_find_bound_socket_common
- dedupe code in vsock_dgram_sendmsg() using module_got var
- drop concurrent sendmsg() for dgram and defer to future series
- Add more tests
  - test EHOSTUNREACH in errqueue
  - test stream + dgram address collision
- improve clarity of dgram msg bounds test code
- Link to v4: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com

Changes in v4:
- style changes
  - vsock: use sk_vsock(vsk) in vsock_dgram_recvmsg instead of
    &sk->vsk
  - vsock: fix xmas tree declaration
  - vsock: fix spacing issues
  - virtio/vsock: virtio_transport_recv_dgram returns void because err
    unused
- sparse analysis warnings/errors
  - virtio/vsock: fix unitialized skerr on destroy
  - virtio/vsock: fix uninitialized err var on goto out
  - vsock: fix declarations that need static
  - vsock: fix __rcu annotation order
- bugs
  - vsock: fix null ptr in remote_info code
  - vsock/dgram: make transport_dgram a fallback instead of first
    priority
  - vsock: remove redundant rcu read lock acquire in getname()
- tests
  - add more tests (message bounds and more)
  - add vsock_dgram_bind() helper
  - add vsock_dgram_connect() helper

Changes in v3:
- Support multi-transport dgram, changing logic in connect/bind
  to support VMCI case
- Support per-pkt transport lookup for sendto() case
- Fix dgram_allow() implementation
- Fix dgram feature bit number (now it is 3)
- Fix binding so dgram and connectible (cid,port) spaces are
  non-overlapping
- RCU protect transport ptr so connect() calls never leave
  a lockless read of the transport and remote_addr are always
  in sync
- Link to v2: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v2-0-079cc7cee62e@bytedance.com


Bobby Eshleman (14):
  af_vsock: generalize vsock_dgram_recvmsg() to all transports
  af_vsock: refactor transport lookup code
  af_vsock: support multi-transport datagrams
  af_vsock: generalize bind table functions
  af_vsock: use a separate dgram bind table
  virtio/vsock: add VIRTIO_VSOCK_TYPE_DGRAM
  virtio/vsock: add common datagram send path
  af_vsock: add vsock_find_bound_dgram_socket()
  virtio/vsock: add common datagram recv path
  virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  vhost/vsock: implement datagram support
  vsock/loopback: implement datagram support
  virtio/vsock: implement datagram support
  test/vsock: add vsock dgram tests

 drivers/vhost/vsock.c                   |   62 +-
 include/linux/virtio_vsock.h            |    9 +-
 include/net/af_vsock.h                  |   24 +-
 include/uapi/linux/virtio_vsock.h       |    2 +
 net/vmw_vsock/af_vsock.c                |  343 ++++++--
 net/vmw_vsock/hyperv_transport.c        |   13 -
 net/vmw_vsock/virtio_transport.c        |   24 +-
 net/vmw_vsock/virtio_transport_common.c |  188 ++++-
 net/vmw_vsock/vmci_transport.c          |   61 +-
 net/vmw_vsock/vsock_loopback.c          |    9 +-
 tools/testing/vsock/util.c              |  177 +++-
 tools/testing/vsock/util.h              |   10 +
 tools/testing/vsock/vsock_test.c        | 1032 ++++++++++++++++++++---
 13 files changed, 1638 insertions(+), 316 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [RFC PATCH net-next v6 01/14] af_vsock: generalize vsock_dgram_recvmsg() to all transports
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit drops the transport->dgram_dequeue callback and makes
vsock_dgram_recvmsg() generic to all transports.

To make this possible, two transport-level changes are introduced:
- transport in the receiving path now stores the cid and port into
  the control buffer of an skb when populating an skb. The information
  later is used to initialize sockaddr_vm structure in recvmsg()
  without referencing vsk->transport.
- transport implementations set the skb->data pointer to the beginning
  of the payload prior to adding the skb to the socket's receive queue.
  That is, they must use skb_pull() before enqueuing. This is an
  agreement between the transport and the socket layer that skb->data
  always points to the beginning of the payload (and not, for example,
  the packet header).

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 drivers/vhost/vsock.c                   |  1 -
 include/linux/virtio_vsock.h            |  5 ---
 include/net/af_vsock.h                  | 11 ++++-
 net/vmw_vsock/af_vsock.c                | 42 +++++++++++++++++-
 net/vmw_vsock/hyperv_transport.c        |  7 ---
 net/vmw_vsock/virtio_transport.c        |  1 -
 net/vmw_vsock/virtio_transport_common.c |  9 ----
 net/vmw_vsock/vmci_transport.c          | 59 +++----------------------
 net/vmw_vsock/vsock_loopback.c          |  1 -
 9 files changed, 55 insertions(+), 81 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ec20ecff85c7..97fffa914e66 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -419,7 +419,6 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..8b56b8a19ddd 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -177,11 +177,6 @@ virtio_transport_stream_dequeue(struct vsock_sock *vsk,
 				size_t len,
 				int type);
 int
-virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
-			       struct msghdr *msg,
-			       size_t len, int flags);
-
-int
 virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
 				   struct msghdr *msg,
 				   size_t len);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 535701efc1e5..7aa1f5f2b1a5 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -120,8 +120,6 @@ struct vsock_transport {
 
 	/* DGRAM. */
 	int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
-	int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
-			     size_t len, int flags);
 	int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
 			     struct msghdr *, size_t len);
 	bool (*dgram_allow)(u32 cid, u32 port);
@@ -219,6 +217,15 @@ void vsock_for_each_connected_socket(struct vsock_transport *transport,
 int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
 bool vsock_find_cid(unsigned int cid);
 
+struct vsock_skb_cb {
+	unsigned int src_cid;
+	unsigned int src_port;
+};
+
+static inline struct vsock_skb_cb *vsock_skb_cb(struct sk_buff *skb) {
+	return (struct vsock_skb_cb *)skb->cb;
+};
+
 /**** TAP ****/
 
 struct vsock_tap {
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 4b040285aa78..5e7d4d99ea2c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1273,11 +1273,15 @@ static int vsock_dgram_connect(struct socket *sock,
 int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 			size_t len, int flags)
 {
+	struct vsock_skb_cb *vsock_cb;
 #ifdef CONFIG_BPF_SYSCALL
 	const struct proto *prot;
 #endif
 	struct vsock_sock *vsk;
+	struct sk_buff *skb;
+	size_t payload_len;
 	struct sock *sk;
+	int err;
 
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
@@ -1288,7 +1292,43 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		return prot->recvmsg(sk, msg, len, flags, NULL);
 #endif
 
-	return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
+	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
+		return -EOPNOTSUPP;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
+
+	/* Retrieve the head sk_buff from the socket's receive queue. */
+	err = 0;
+	skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
+	if (!skb)
+		return err;
+
+	payload_len = skb->len;
+
+	if (payload_len > len) {
+		payload_len = len;
+		msg->msg_flags |= MSG_TRUNC;
+	}
+
+	/* Place the datagram payload in the user's iovec. */
+	err = skb_copy_datagram_msg(skb, 0, msg, payload_len);
+	if (err)
+		goto out;
+
+	if (msg->msg_name) {
+		/* Provide the address of the sender. */
+		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
+
+		vsock_cb = vsock_skb_cb(skb);
+		vsock_addr_init(vm_addr, vsock_cb->src_cid, vsock_cb->src_port);
+		msg->msg_namelen = sizeof(*vm_addr);
+	}
+	err = payload_len;
+
+out:
+	skb_free_datagram(&vsk->sk, skb);
+	return err;
 }
 EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index e2157e387217..326dd41ee2d5 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -556,12 +556,6 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
 	return -EOPNOTSUPP;
 }
 
-static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
-			     size_t len, int flags)
-{
-	return -EOPNOTSUPP;
-}
-
 static int hvs_dgram_enqueue(struct vsock_sock *vsk,
 			     struct sockaddr_vm *remote, struct msghdr *msg,
 			     size_t dgram_len)
@@ -833,7 +827,6 @@ static struct vsock_transport hvs_transport = {
 	.shutdown                 = hvs_shutdown,
 
 	.dgram_bind               = hvs_dgram_bind,
-	.dgram_dequeue            = hvs_dgram_dequeue,
 	.dgram_enqueue            = hvs_dgram_enqueue,
 	.dgram_allow              = hvs_dgram_allow,
 
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 43d405298857..a8c97e95622a 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -508,7 +508,6 @@ static struct virtio_transport virtio_transport = {
 		.cancel_pkt               = virtio_transport_cancel_pkt,
 
 		.dgram_bind               = virtio_transport_dgram_bind,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..4bf73d20c12a 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -810,15 +810,6 @@ virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
 }
 EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_enqueue);
 
-int
-virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
-			       struct msghdr *msg,
-			       size_t len, int flags)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
-
 s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index b370070194fa..b39df3ed8c8d 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -610,6 +610,7 @@ vmci_transport_datagram_create_hnd(u32 resource_id,
 
 static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg)
 {
+	struct vsock_skb_cb *vsock_cb;
 	struct sock *sk;
 	size_t size;
 	struct sk_buff *skb;
@@ -637,10 +638,14 @@ static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg)
 	if (!skb)
 		return VMCI_ERROR_NO_MEM;
 
+	vsock_cb = vsock_skb_cb(skb);
+	vsock_cb->src_cid = dg->src.context;
+	vsock_cb->src_port = dg->src.resource;
 	/* sk_receive_skb() will do a sock_put(), so hold here. */
 	sock_hold(sk);
 	skb_put(skb, size);
 	memcpy(skb->data, dg, size);
+	skb_pull(skb, VMCI_DG_HEADERSIZE);
 	sk_receive_skb(sk, skb, 0);
 
 	return VMCI_SUCCESS;
@@ -1731,59 +1736,6 @@ static int vmci_transport_dgram_enqueue(
 	return err - sizeof(*dg);
 }
 
-static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
-					struct msghdr *msg, size_t len,
-					int flags)
-{
-	int err;
-	struct vmci_datagram *dg;
-	size_t payload_len;
-	struct sk_buff *skb;
-
-	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
-		return -EOPNOTSUPP;
-
-	/* Retrieve the head sk_buff from the socket's receive queue. */
-	err = 0;
-	skb = skb_recv_datagram(&vsk->sk, flags, &err);
-	if (!skb)
-		return err;
-
-	dg = (struct vmci_datagram *)skb->data;
-	if (!dg)
-		/* err is 0, meaning we read zero bytes. */
-		goto out;
-
-	payload_len = dg->payload_size;
-	/* Ensure the sk_buff matches the payload size claimed in the packet. */
-	if (payload_len != skb->len - sizeof(*dg)) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	if (payload_len > len) {
-		payload_len = len;
-		msg->msg_flags |= MSG_TRUNC;
-	}
-
-	/* Place the datagram payload in the user's iovec. */
-	err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
-	if (err)
-		goto out;
-
-	if (msg->msg_name) {
-		/* Provide the address of the sender. */
-		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
-		vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
-		msg->msg_namelen = sizeof(*vm_addr);
-	}
-	err = payload_len;
-
-out:
-	skb_free_datagram(&vsk->sk, skb);
-	return err;
-}
-
 static bool vmci_transport_dgram_allow(u32 cid, u32 port)
 {
 	if (cid == VMADDR_CID_HYPERVISOR) {
@@ -2040,7 +1992,6 @@ static struct vsock_transport vmci_transport = {
 	.release = vmci_transport_release,
 	.connect = vmci_transport_connect,
 	.dgram_bind = vmci_transport_dgram_bind,
-	.dgram_dequeue = vmci_transport_dgram_dequeue,
 	.dgram_enqueue = vmci_transport_dgram_enqueue,
 	.dgram_allow = vmci_transport_dgram_allow,
 	.stream_dequeue = vmci_transport_stream_dequeue,
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 6dea6119f5b2..11488887a5cc 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -66,7 +66,6 @@ static struct virtio_transport loopback_transport = {
 		.cancel_pkt               = vsock_loopback_cancel_pkt,
 
 		.dgram_bind               = virtio_transport_dgram_bind,
-		.dgram_dequeue            = virtio_transport_dgram_dequeue,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 02/14] af_vsock: refactor transport lookup code
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

Introduce new reusable function vsock_connectible_lookup_transport()
that performs the transport lookup logic.

No functional change intended.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 net/vmw_vsock/af_vsock.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5e7d4d99ea2c..98d10cd30483 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -424,6 +424,22 @@ static void vsock_deassign_transport(struct vsock_sock *vsk)
 	vsk->transport = NULL;
 }
 
+static const struct vsock_transport *
+vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
+{
+	const struct vsock_transport *transport;
+
+	if (vsock_use_local_transport(cid))
+		transport = transport_local;
+	else if (cid <= VMADDR_CID_HOST || !transport_h2g ||
+		 (flags & VMADDR_FLAG_TO_HOST))
+		transport = transport_g2h;
+	else
+		transport = transport_h2g;
+
+	return transport;
+}
+
 /* Assign a transport to a socket and call the .init transport callback.
  *
  * Note: for connection oriented socket this must be called when vsk->remote_addr
@@ -464,13 +480,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		break;
 	case SOCK_STREAM:
 	case SOCK_SEQPACKET:
-		if (vsock_use_local_transport(remote_cid))
-			new_transport = transport_local;
-		else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
-			 (remote_flags & VMADDR_FLAG_TO_HOST))
-			new_transport = transport_g2h;
-		else
-			new_transport = transport_h2g;
+		new_transport = vsock_connectible_lookup_transport(remote_cid,
+								   remote_flags);
 		break;
 	default:
 		return -ESOCKTNOSUPPORT;
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 03/14] af_vsock: support multi-transport datagrams
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This patch adds support for multi-transport datagrams.

This includes:
- Allow transport to be undecided (i.e., empty) for non-VMCI datagram
  use cases during socket creation.
- connect() now assigns the transport for (similar to connectible
  sockets)
- Per-packet lookup of transports when using sendto(sockaddr_vm)
- Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
  sockaddr_vm
- Rename VSOCK_TRANSPORT_F_DGRAM to VSOCK_TRANSPORT_F_DGRAM_FALLBACK

* Dynamic transport lookup *

virtio datagram will follow h2g/g2h paradigm. Since it is impossible
to know which transport to use during socket creation, the transport is
allowed to remain empty. The transport will be assigned only when
connect() is called. Otherwise, in the sendmsg() path, if sendto() is
used, the cid is used to lookup the transport that will be used. In the
recvmsg() path, since the receiving method is generalized and shared by
different transport, there is now no need to resolve the transport.
Finally, a couple of checks for empty transport are added in other paths
to prevent null-pointer dereference.

* Compatibiliity with VMCI *

To preserve backwards compatibility with VMCI, some important changes
are made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
be used for dgrams only if there is not yet a g2h or h2g transport that
has been registered that can transmit the packet. If there is a g2h/h2g
transport for that remote address, then that transport will be used and
not "transport_dgram". This essentially makes "transport_dgram" a
fallback transport for when h2g/g2h has not yet gone online, and so it
is renamed "transport_dgram_fallback". VMCI implements this transport.

The logic around "transport_dgram" needs to be retained to prevent
breaking VMCI:

1) VMCI datagrams existed prior to h2g/g2h and so operate under a
   different paradigm. When the vmci transport comes online, it registers
   itself with the DGRAM feature, but not H2G/G2H. Only later when the
   transport has more information about its environment does it register
   H2G or G2H.  In the case that a datagram socket is created after
   VSOCK_TRANSPORT_F_DGRAM registration but before G2H/H2G registration,
   the "transport_dgram" transport is the only registered transport and so
   needs to be used.

2) VMCI seems to require a special message be sent by the transport when a
   datagram socket calls bind(). Under the h2g/g2h model, the transport
   is selected using the remote_addr which is set by connect(). At
   bind time there is no remote_addr because often no connect() has been
   called yet: the transport is null. Therefore, with a null transport
   there doesn't seem to be any good way for a datagram socket to tell the
   VMCI transport that it has just had bind() called upon it.

With the new fallback logic, after H2G/G2H comes online the socket layer
will access the VMCI transport via transport_{h2g,g2h}. Prior to H2G/G2H
coming online, the socket layer will access the VMCI transport via
"transport_dgram_fallback".

Only transports with a special datagram fallback use-case such as VMCI
need to register VSOCK_TRANSPORT_F_DGRAM_FALLBACK.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 drivers/vhost/vsock.c                   |   1 -
 include/linux/virtio_vsock.h            |   2 -
 include/net/af_vsock.h                  |  10 +-
 net/vmw_vsock/af_vsock.c                | 127 +++++++++++++++++++-----
 net/vmw_vsock/hyperv_transport.c        |   6 --
 net/vmw_vsock/virtio_transport.c        |   1 -
 net/vmw_vsock/virtio_transport_common.c |   7 --
 net/vmw_vsock/vmci_transport.c          |   2 +-
 net/vmw_vsock/vsock_loopback.c          |   1 -
 9 files changed, 107 insertions(+), 50 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 97fffa914e66..fa1aefb78016 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -419,7 +419,6 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
 		.stream_enqueue           = virtio_transport_stream_enqueue,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 8b56b8a19ddd..f749a066af46 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -221,8 +221,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
 u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
 bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
 bool virtio_transport_stream_allow(u32 cid, u32 port);
-int virtio_transport_dgram_bind(struct vsock_sock *vsk,
-				struct sockaddr_vm *addr);
 bool virtio_transport_dgram_allow(u32 cid, u32 port);
 
 int virtio_transport_connect(struct vsock_sock *vsk);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 7aa1f5f2b1a5..44db8f2c507d 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -96,13 +96,13 @@ struct vsock_transport_send_notify_data {
 
 /* Transport features flags */
 /* Transport provides host->guest communication */
-#define VSOCK_TRANSPORT_F_H2G		0x00000001
+#define VSOCK_TRANSPORT_F_H2G			0x00000001
 /* Transport provides guest->host communication */
-#define VSOCK_TRANSPORT_F_G2H		0x00000002
-/* Transport provides DGRAM communication */
-#define VSOCK_TRANSPORT_F_DGRAM		0x00000004
+#define VSOCK_TRANSPORT_F_G2H			0x00000002
+/* Transport provides fallback for DGRAM communication */
+#define VSOCK_TRANSPORT_F_DGRAM_FALLBACK	0x00000004
 /* Transport provides local (loopback) communication */
-#define VSOCK_TRANSPORT_F_LOCAL		0x00000008
+#define VSOCK_TRANSPORT_F_LOCAL			0x00000008
 
 struct vsock_transport {
 	struct module *module;
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 98d10cd30483..acc15e11700c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -140,8 +140,8 @@ struct proto vsock_proto = {
 static const struct vsock_transport *transport_h2g;
 /* Transport used for guest->host communication */
 static const struct vsock_transport *transport_g2h;
-/* Transport used for DGRAM communication */
-static const struct vsock_transport *transport_dgram;
+/* Transport used as a fallback for DGRAM communication */
+static const struct vsock_transport *transport_dgram_fallback;
 /* Transport used for local communication */
 static const struct vsock_transport *transport_local;
 static DEFINE_MUTEX(vsock_register_mutex);
@@ -440,19 +440,20 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
 	return transport;
 }
 
-/* Assign a transport to a socket and call the .init transport callback.
- *
- * Note: for connection oriented socket this must be called when vsk->remote_addr
- * is set (e.g. during the connect() or when a connection request on a listener
- * socket is received).
- * The vsk->remote_addr is used to decide which transport to use:
- *  - remote CID == VMADDR_CID_LOCAL or g2h->local_cid or VMADDR_CID_HOST if
- *    g2h is not loaded, will use local transport;
- *  - remote CID <= VMADDR_CID_HOST or h2g is not loaded or remote flags field
- *    includes VMADDR_FLAG_TO_HOST flag value, will use guest->host transport;
- *  - remote CID > VMADDR_CID_HOST will use host->guest transport;
- */
-int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
+static const struct vsock_transport *
+vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
+{
+	const struct vsock_transport *transport;
+
+	transport = vsock_connectible_lookup_transport(cid, flags);
+	if (transport)
+		return transport;
+
+	return transport_dgram_fallback;
+}
+
+static int __vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk,
+				    bool create_sock)
 {
 	const struct vsock_transport *new_transport;
 	struct sock *sk = sk_vsock(vsk);
@@ -476,7 +477,21 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 
 	switch (sk->sk_type) {
 	case SOCK_DGRAM:
-		new_transport = transport_dgram;
+		/* During vsock_create(), the transport cannot be decided yet if
+		 * using virtio. While for VMCI, it is transport_dgram_fallback.
+		 * Therefore, we try to initialize it to transport_dgram_fallback
+		 * so that we don't break VMCI. If VMCI is not present, it is okay
+		 * to leave the transport empty since vsk->transport != NULL checks
+		 * will be performed in send and receive paths.
+		 *
+		 * During vsock_dgram_connect(), since remote_cid is available,
+		 * the right transport is assigned after lookup.
+		 */
+		if (create_sock)
+			new_transport = transport_dgram_fallback;
+		else
+			new_transport = vsock_dgram_lookup_transport(remote_cid,
+								     remote_flags);
 		break;
 	case SOCK_STREAM:
 	case SOCK_SEQPACKET:
@@ -501,6 +516,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		vsock_deassign_transport(vsk);
 	}
 
+	/* Only allow empty transport during vsock_create() for datagram */
+	if (!new_transport && sk->sk_type == SOCK_DGRAM && create_sock)
+		return 0;
+
 	/* We increase the module refcnt to prevent the transport unloading
 	 * while there are open sockets assigned to it.
 	 */
@@ -525,6 +544,23 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 
 	return 0;
 }
+
+/* Assign a transport to a socket and call the .init transport callback.
+ *
+ * Note: for connection oriented socket this must be called when vsk->remote_addr
+ * is set (e.g. during the connect() or when a connection request on a listener
+ * socket is received).
+ * The vsk->remote_addr is used to decide which transport to use:
+ *  - remote CID == VMADDR_CID_LOCAL or g2h->local_cid or VMADDR_CID_HOST if
+ *    g2h is not loaded, will use local transport;
+ *  - remote CID <= VMADDR_CID_HOST or h2g is not loaded or remote flags field
+ *    includes VMADDR_FLAG_TO_HOST flag value, will use guest->host transport;
+ *  - remote CID > VMADDR_CID_HOST will use host->guest transport;
+ */
+int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
+{
+	return __vsock_assign_transport(vsk, psk, false);
+}
 EXPORT_SYMBOL_GPL(vsock_assign_transport);
 
 bool vsock_find_cid(unsigned int cid)
@@ -693,6 +729,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {
+	if (!vsk->transport || !vsk->transport->dgram_bind)
+		return -EINVAL;
+
 	return vsk->transport->dgram_bind(vsk, addr);
 }
 
@@ -825,6 +864,9 @@ static void __vsock_release(struct sock *sk, int level)
 			vsk->transport->release(vsk);
 		else if (sock_type_connectible(sk->sk_type))
 			vsock_remove_sock(vsk);
+		else if (sk->sk_type == SOCK_DGRAM &&
+			 (!vsk->transport || !vsk->transport->dgram_bind))
+			vsock_remove_sock(vsk);
 
 		sock_orphan(sk);
 		sk->sk_shutdown = SHUTDOWN_MASK;
@@ -1152,6 +1194,9 @@ static int vsock_read_skb(struct sock *sk, skb_read_actor_t read_actor)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
 
+	if (!vsk->transport)
+		return -EINVAL;
+
 	return vsk->transport->read_skb(vsk, read_actor);
 }
 
@@ -1163,6 +1208,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	struct vsock_sock *vsk;
 	struct sockaddr_vm *remote_addr;
 	const struct vsock_transport *transport;
+	bool module_got = false;
 
 	if (msg->msg_flags & MSG_OOB)
 		return -EOPNOTSUPP;
@@ -1174,19 +1220,40 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	transport = vsk->transport;
-
 	err = vsock_auto_bind(vsk);
 	if (err)
 		goto out;
 
-
 	/* If the provided message contains an address, use that.  Otherwise
 	 * fall back on the socket's remote handle (if it has been connected).
 	 */
 	if (msg->msg_name &&
 	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
 			    &remote_addr) == 0) {
+		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
+							 remote_addr->svm_flags);
+		/* transport_dgram_fallback needs to be initialized to be called */
+		if (transport == transport_dgram_fallback && transport != vsk->transport) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (!transport) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		if (!try_module_get(transport->module)) {
+			err = -ENODEV;
+			goto out;
+		}
+
+		/* When looking up a transport dynamically and acquiring a
+		 * reference on the module, we need to remember to release the
+		 * reference later.
+		 */
+		module_got = true;
+
 		/* Ensure this address is of the right type and is a valid
 		 * destination.
 		 */
@@ -1201,6 +1268,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	} else if (sock->state == SS_CONNECTED) {
 		remote_addr = &vsk->remote_addr;
 
+		transport = vsk->transport;
 		if (remote_addr->svm_cid == VMADDR_CID_ANY)
 			remote_addr->svm_cid = transport->get_local_cid();
 
@@ -1225,6 +1293,8 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
 
 out:
+	if (module_got)
+		module_put(transport->module);
 	release_sock(sk);
 	return err;
 }
@@ -1257,13 +1327,18 @@ static int vsock_dgram_connect(struct socket *sock,
 	if (err)
 		goto out;
 
+	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
+
+	err = vsock_assign_transport(vsk, NULL);
+	if (err)
+		goto out;
+
 	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
 					 remote_addr->svm_port)) {
 		err = -EINVAL;
 		goto out;
 	}
 
-	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
 	sock->state = SS_CONNECTED;
 
 	/* sock map disallows redirection of non-TCP sockets with sk_state !=
@@ -2406,7 +2481,7 @@ static int vsock_create(struct net *net, struct socket *sock,
 	vsk = vsock_sk(sk);
 
 	if (sock->type == SOCK_DGRAM) {
-		ret = vsock_assign_transport(vsk, NULL);
+		ret = __vsock_assign_transport(vsk, NULL, true);
 		if (ret < 0) {
 			sock_put(sk);
 			return ret;
@@ -2548,7 +2623,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
 
 	t_h2g = transport_h2g;
 	t_g2h = transport_g2h;
-	t_dgram = transport_dgram;
+	t_dgram = transport_dgram_fallback;
 	t_local = transport_local;
 
 	if (features & VSOCK_TRANSPORT_F_H2G) {
@@ -2567,7 +2642,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
 		t_g2h = t;
 	}
 
-	if (features & VSOCK_TRANSPORT_F_DGRAM) {
+	if (features & VSOCK_TRANSPORT_F_DGRAM_FALLBACK) {
 		if (t_dgram) {
 			err = -EBUSY;
 			goto err_busy;
@@ -2585,7 +2660,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
 
 	transport_h2g = t_h2g;
 	transport_g2h = t_g2h;
-	transport_dgram = t_dgram;
+	transport_dgram_fallback = t_dgram;
 	transport_local = t_local;
 
 err_busy:
@@ -2604,8 +2679,8 @@ void vsock_core_unregister(const struct vsock_transport *t)
 	if (transport_g2h == t)
 		transport_g2h = NULL;
 
-	if (transport_dgram == t)
-		transport_dgram = NULL;
+	if (transport_dgram_fallback == t)
+		transport_dgram_fallback = NULL;
 
 	if (transport_local == t)
 		transport_local = NULL;
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 326dd41ee2d5..64ad87a3879c 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -551,11 +551,6 @@ static void hvs_destruct(struct vsock_sock *vsk)
 	kfree(hvs);
 }
 
-static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
-{
-	return -EOPNOTSUPP;
-}
-
 static int hvs_dgram_enqueue(struct vsock_sock *vsk,
 			     struct sockaddr_vm *remote, struct msghdr *msg,
 			     size_t dgram_len)
@@ -826,7 +821,6 @@ static struct vsock_transport hvs_transport = {
 	.connect                  = hvs_connect,
 	.shutdown                 = hvs_shutdown,
 
-	.dgram_bind               = hvs_dgram_bind,
 	.dgram_enqueue            = hvs_dgram_enqueue,
 	.dgram_allow              = hvs_dgram_allow,
 
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index a8c97e95622a..4891b845fcde 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -507,7 +507,6 @@ static struct virtio_transport virtio_transport = {
 		.shutdown                 = virtio_transport_shutdown,
 		.cancel_pkt               = virtio_transport_cancel_pkt,
 
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 4bf73d20c12a..a1c76836d798 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1008,13 +1008,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
 }
 EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
 
-int virtio_transport_dgram_bind(struct vsock_sock *vsk,
-				struct sockaddr_vm *addr)
-{
-	return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
-
 bool virtio_transport_dgram_allow(u32 cid, u32 port)
 {
 	return false;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index b39df3ed8c8d..49aba9c48415 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -2061,7 +2061,7 @@ static int __init vmci_transport_init(void)
 	/* Register only with dgram feature, other features (H2G, G2H) will be
 	 * registered when the first host or guest becomes active.
 	 */
-	err = vsock_core_register(&vmci_transport, VSOCK_TRANSPORT_F_DGRAM);
+	err = vsock_core_register(&vmci_transport, VSOCK_TRANSPORT_F_DGRAM_FALLBACK);
 	if (err < 0)
 		goto err_unsubscribe;
 
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 11488887a5cc..4dd4886f29d1 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -65,7 +65,6 @@ static struct virtio_transport loopback_transport = {
 		.shutdown                 = virtio_transport_shutdown,
 		.cancel_pkt               = vsock_loopback_cancel_pkt,
 
-		.dgram_bind               = virtio_transport_dgram_bind,
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
 		.dgram_allow              = virtio_transport_dgram_allow,
 
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 04/14] af_vsock: generalize bind table functions
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit makes the bind table management functions in vsock usable
for different bind tables. Future work will introduce a new table for
datagrams to avoid address collisions, and these functions will be used
there.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 net/vmw_vsock/af_vsock.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index acc15e11700c..d571be9cdbf0 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -232,11 +232,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
 	sock_put(&vsk->sk);
 }
 
-static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+static struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
+						   struct list_head *bind_table)
 {
 	struct vsock_sock *vsk;
 
-	list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
+	list_for_each_entry(vsk, bind_table, bound_table) {
 		if (vsock_addr_equals_addr(addr, &vsk->local_addr))
 			return sk_vsock(vsk);
 
@@ -249,6 +250,11 @@ static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
 	return NULL;
 }
 
+static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+{
+	return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
+}
+
 static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 						  struct sockaddr_vm *dst)
 {
@@ -671,12 +677,18 @@ static void vsock_pending_work(struct work_struct *work)
 
 /**** SOCKET OPERATIONS ****/
 
-static int __vsock_bind_connectible(struct vsock_sock *vsk,
-				    struct sockaddr_vm *addr)
+static int vsock_bind_common(struct vsock_sock *vsk,
+			     struct sockaddr_vm *addr,
+			     struct list_head *bind_table,
+			     size_t table_size)
 {
 	static u32 port;
 	struct sockaddr_vm new_addr;
 
+	if (WARN_ONCE(table_size < VSOCK_HASH_SIZE,
+		      "table size too small, may cause overflow"))
+		return -EINVAL;
+
 	if (!port)
 		port = get_random_u32_above(LAST_RESERVED_PORT);
 
@@ -692,7 +704,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 
 			new_addr.svm_port = port++;
 
-			if (!__vsock_find_bound_socket(&new_addr)) {
+			if (!vsock_find_bound_socket_common(&new_addr,
+							    &bind_table[VSOCK_HASH(addr)])) {
 				found = true;
 				break;
 			}
@@ -709,7 +722,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 			return -EACCES;
 		}
 
-		if (__vsock_find_bound_socket(&new_addr))
+		if (vsock_find_bound_socket_common(&new_addr,
+						   &bind_table[VSOCK_HASH(addr)]))
 			return -EADDRINUSE;
 	}
 
@@ -721,11 +735,17 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 	 * by AF_UNIX.
 	 */
 	__vsock_remove_bound(vsk);
-	__vsock_insert_bound(vsock_bound_sockets(&vsk->local_addr), vsk);
+	__vsock_insert_bound(&bind_table[VSOCK_HASH(&vsk->local_addr)], vsk);
 
 	return 0;
 }
 
+static int __vsock_bind_connectible(struct vsock_sock *vsk,
+				    struct sockaddr_vm *addr)
+{
+	return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
+}
+
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 05/14] af_vsock: use a separate dgram bind table
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit adds support for bound dgram sockets to be tracked in a
separate bind table from connectible sockets in order to avoid address
collisions. With this commit, users can simultaneously bind a dgram
socket and connectible socket to the same CID and port.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 net/vmw_vsock/af_vsock.c | 103 +++++++++++++++++++++++++++++----------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index d571be9cdbf0..ab08cd81720e 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -10,18 +10,23 @@
  * - There are two kinds of sockets: those created by user action (such as
  * calling socket(2)) and those created by incoming connection request packets.
  *
- * - There are two "global" tables, one for bound sockets (sockets that have
- * specified an address that they are responsible for) and one for connected
- * sockets (sockets that have established a connection with another socket).
- * These tables are "global" in that all sockets on the system are placed
- * within them. - Note, though, that the bound table contains an extra entry
- * for a list of unbound sockets and SOCK_DGRAM sockets will always remain in
- * that list. The bound table is used solely for lookup of sockets when packets
- * are received and that's not necessary for SOCK_DGRAM sockets since we create
- * a datagram handle for each and need not perform a lookup.  Keeping SOCK_DGRAM
- * sockets out of the bound hash buckets will reduce the chance of collisions
- * when looking for SOCK_STREAM sockets and prevents us from having to check the
- * socket type in the hash table lookups.
+ * - There are three "global" tables, one for bound connectible (stream /
+ * seqpacket) sockets, one for bound datagram sockets, and one for connected
+ * sockets. Bound sockets are sockets that have specified an address that
+ * they are responsible for. Connected sockets are sockets that have
+ * established a connection with another socket. These tables are "global" in
+ * that all sockets on the system are placed within them. - Note, though,
+ * that the bound tables contain an extra entry for a list of unbound
+ * sockets. The bound tables are used solely for lookup of sockets when packets
+ * are received.
+ *
+ * - There are separate bind tables for connectible and datagram sockets to avoid
+ * address collisions between stream/seqpacket sockets and datagram sockets.
+ *
+ * - Transports may elect to NOT use the global datagram bind table by
+ * implementing the ->dgram_bind() callback. If that callback is implemented,
+ * the global bind table is not used and the responsibility of bound datagram
+ * socket tracking is deferred to the transport.
  *
  * - Sockets created by user action will either be "client" sockets that
  * initiate a connection or "server" sockets that listen for connections; we do
@@ -116,6 +121,7 @@
 static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
 static void vsock_sk_destruct(struct sock *sk);
 static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+static bool sock_type_connectible(u16 type);
 
 /* Protocol family. */
 struct proto vsock_proto = {
@@ -152,21 +158,25 @@ static DEFINE_MUTEX(vsock_register_mutex);
  * VSocket is stored in the connected hash table.
  *
  * Unbound sockets are all put on the same list attached to the end of the hash
- * table (vsock_unbound_sockets).  Bound sockets are added to the hash table in
- * the bucket that their local address hashes to (vsock_bound_sockets(addr)
- * represents the list that addr hashes to).
+ * tables (vsock_unbound_sockets/vsock_unbound_dgram_sockets).  Bound sockets
+ * are added to the hash table in the bucket that their local address hashes to
+ * (vsock_bound_sockets(addr) and vsock_bound_dgram_sockets(addr) represents
+ * the list that addr hashes to).
  *
- * Specifically, we initialize the vsock_bind_table array to a size of
- * VSOCK_HASH_SIZE + 1 so that vsock_bind_table[0] through
- * vsock_bind_table[VSOCK_HASH_SIZE - 1] are for bound sockets and
- * vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.  The hash function
- * mods with VSOCK_HASH_SIZE to ensure this.
+ * Specifically, taking connectible sockets as an example we initialize the
+ * vsock_bind_table array to a size of VSOCK_HASH_SIZE + 1 so that
+ * vsock_bind_table[0] through vsock_bind_table[VSOCK_HASH_SIZE - 1] are for
+ * bound sockets and vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.
+ * The hash function mods with VSOCK_HASH_SIZE to ensure this.
+ * Datagrams and vsock_dgram_bind_table operate in the same way.
  */
 #define MAX_PORT_RETRIES        24
 
 #define VSOCK_HASH(addr)        ((addr)->svm_port % VSOCK_HASH_SIZE)
 #define vsock_bound_sockets(addr) (&vsock_bind_table[VSOCK_HASH(addr)])
+#define vsock_bound_dgram_sockets(addr) (&vsock_dgram_bind_table[VSOCK_HASH(addr)])
 #define vsock_unbound_sockets     (&vsock_bind_table[VSOCK_HASH_SIZE])
+#define vsock_unbound_dgram_sockets     (&vsock_dgram_bind_table[VSOCK_HASH_SIZE])
 
 /* XXX This can probably be implemented in a better way. */
 #define VSOCK_CONN_HASH(src, dst)				\
@@ -182,6 +192,8 @@ struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
 EXPORT_SYMBOL_GPL(vsock_connected_table);
 DEFINE_SPINLOCK(vsock_table_lock);
 EXPORT_SYMBOL_GPL(vsock_table_lock);
+static struct list_head vsock_dgram_bind_table[VSOCK_HASH_SIZE + 1];
+static DEFINE_SPINLOCK(vsock_dgram_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
@@ -204,6 +216,9 @@ static void vsock_init_tables(void)
 
 	for (i = 0; i < ARRAY_SIZE(vsock_connected_table); i++)
 		INIT_LIST_HEAD(&vsock_connected_table[i]);
+
+	for (i = 0; i < ARRAY_SIZE(vsock_dgram_bind_table); i++)
+		INIT_LIST_HEAD(&vsock_dgram_bind_table[i]);
 }
 
 static void __vsock_insert_bound(struct list_head *list,
@@ -271,13 +286,28 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 	return NULL;
 }
 
-static void vsock_insert_unbound(struct vsock_sock *vsk)
+static void __vsock_insert_dgram_unbound(struct vsock_sock *vsk)
+{
+	spin_lock_bh(&vsock_dgram_table_lock);
+	__vsock_insert_bound(vsock_unbound_dgram_sockets, vsk);
+	spin_unlock_bh(&vsock_dgram_table_lock);
+}
+
+static void __vsock_insert_connectible_unbound(struct vsock_sock *vsk)
 {
 	spin_lock_bh(&vsock_table_lock);
 	__vsock_insert_bound(vsock_unbound_sockets, vsk);
 	spin_unlock_bh(&vsock_table_lock);
 }
 
+static void vsock_insert_unbound(struct vsock_sock *vsk)
+{
+	if (sock_type_connectible(sk_vsock(vsk)->sk_type))
+		__vsock_insert_connectible_unbound(vsk);
+	else
+		__vsock_insert_dgram_unbound(vsk);
+}
+
 void vsock_insert_connected(struct vsock_sock *vsk)
 {
 	struct list_head *list = vsock_connected_sockets(
@@ -289,6 +319,14 @@ void vsock_insert_connected(struct vsock_sock *vsk)
 }
 EXPORT_SYMBOL_GPL(vsock_insert_connected);
 
+static void vsock_remove_dgram_bound(struct vsock_sock *vsk)
+{
+	spin_lock_bh(&vsock_dgram_table_lock);
+	if (__vsock_in_bound_table(vsk))
+		__vsock_remove_bound(vsk);
+	spin_unlock_bh(&vsock_dgram_table_lock);
+}
+
 void vsock_remove_bound(struct vsock_sock *vsk)
 {
 	spin_lock_bh(&vsock_table_lock);
@@ -340,7 +378,10 @@ EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
 
 void vsock_remove_sock(struct vsock_sock *vsk)
 {
-	vsock_remove_bound(vsk);
+	if (sock_type_connectible(sk_vsock(vsk)->sk_type))
+		vsock_remove_bound(vsk);
+	else
+		vsock_remove_dgram_bound(vsk);
 	vsock_remove_connected(vsk);
 }
 EXPORT_SYMBOL_GPL(vsock_remove_sock);
@@ -746,11 +787,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 	return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
 }
 
-static int __vsock_bind_dgram(struct vsock_sock *vsk,
-			      struct sockaddr_vm *addr)
+static int vsock_bind_dgram(struct vsock_sock *vsk,
+			    struct sockaddr_vm *addr)
 {
-	if (!vsk->transport || !vsk->transport->dgram_bind)
-		return -EINVAL;
+	if (!vsk->transport || !vsk->transport->dgram_bind) {
+		int retval;
+
+		spin_lock_bh(&vsock_dgram_table_lock);
+		retval = vsock_bind_common(vsk, addr, vsock_dgram_bind_table,
+					   VSOCK_HASH_SIZE);
+		spin_unlock_bh(&vsock_dgram_table_lock);
+
+		return retval;
+	}
 
 	return vsk->transport->dgram_bind(vsk, addr);
 }
@@ -781,7 +830,7 @@ static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr)
 		break;
 
 	case SOCK_DGRAM:
-		retval = __vsock_bind_dgram(vsk, addr);
+		retval = vsock_bind_dgram(vsk, addr);
 		break;
 
 	default:
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 06/14] virtio/vsock: add VIRTIO_VSOCK_TYPE_DGRAM
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit adds the datagram packet type for inclusion in virtio vsock
packet headers. It is included here as a standalone commit because
multiple future but distinct commits depend on it.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 include/uapi/linux/virtio_vsock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..331be28b1d30 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -69,6 +69,7 @@ struct virtio_vsock_hdr {
 enum virtio_vsock_type {
 	VIRTIO_VSOCK_TYPE_STREAM = 1,
 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
+	VIRTIO_VSOCK_TYPE_DGRAM = 3,
 };
 
 enum virtio_vsock_op {
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 07/14] virtio/vsock: add common datagram send path
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit implements the common function
virtio_transport_dgram_enqueue for enqueueing datagrams. It does not add
usage in either vhost or virtio yet.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/virtio_vsock.h            |  1 +
 include/net/af_vsock.h                  |  2 +
 net/vmw_vsock/af_vsock.c                |  2 +-
 net/vmw_vsock/virtio_transport_common.c | 87 ++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index f749a066af46..4408749febd2 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -152,6 +152,7 @@ struct virtio_vsock_pkt_info {
 	u16 op;
 	u32 flags;
 	bool reply;
+	u8 remote_flags;
 };
 
 struct virtio_transport {
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 44db8f2c507d..6e97d344ac75 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -216,6 +216,8 @@ void vsock_for_each_connected_socket(struct vsock_transport *transport,
 				     void (*fn)(struct sock *sk));
 int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
 bool vsock_find_cid(unsigned int cid);
+const struct vsock_transport *vsock_dgram_lookup_transport(unsigned int cid,
+							   __u8 flags);
 
 struct vsock_skb_cb {
 	unsigned int src_cid;
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index ab08cd81720e..f83b655fdbe9 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -487,7 +487,7 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
 	return transport;
 }
 
-static const struct vsock_transport *
+const struct vsock_transport *
 vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
 {
 	const struct vsock_transport *transport;
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index a1c76836d798..46cd1807f8e3 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1040,13 +1040,98 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
 }
 EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
 
+static int virtio_transport_dgram_send_pkt_info(struct vsock_sock *vsk,
+						struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	const struct vsock_transport *transport;
+	const struct virtio_transport *t_ops;
+	struct sock *sk = sk_vsock(vsk);
+	struct virtio_vsock_hdr *hdr;
+	struct sk_buff *skb;
+	void *payload;
+	int noblock = 0;
+	int err;
+
+	info->type = virtio_transport_get_type(sk_vsock(vsk));
+
+	if (info->pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+		return -EMSGSIZE;
+
+	transport = vsock_dgram_lookup_transport(info->remote_cid, info->remote_flags);
+	t_ops = container_of(transport, struct virtio_transport, transport);
+	if (unlikely(!t_ops))
+		return -EFAULT;
+
+	if (info->msg)
+		noblock = info->msg->msg_flags & MSG_DONTWAIT;
+
+	/* Use sock_alloc_send_skb to throttle by sk_sndbuf. This helps avoid
+	 * triggering the OOM.
+	 */
+	skb = sock_alloc_send_skb(sk, info->pkt_len + VIRTIO_VSOCK_SKB_HEADROOM,
+				  noblock, &err);
+	if (!skb)
+		return err;
+
+	skb_reserve(skb, VIRTIO_VSOCK_SKB_HEADROOM);
+
+	src_cid = t_ops->transport.get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	dst_cid = info->remote_cid;
+	dst_port = info->remote_port;
+
+	hdr = virtio_vsock_hdr(skb);
+	hdr->type	= cpu_to_le16(info->type);
+	hdr->op		= cpu_to_le16(info->op);
+	hdr->src_cid	= cpu_to_le64(src_cid);
+	hdr->dst_cid	= cpu_to_le64(dst_cid);
+	hdr->src_port	= cpu_to_le32(src_port);
+	hdr->dst_port	= cpu_to_le32(dst_port);
+	hdr->flags	= cpu_to_le32(info->flags);
+	hdr->len	= cpu_to_le32(info->pkt_len);
+
+	if (info->msg && info->pkt_len > 0) {
+		payload = skb_put(skb, info->pkt_len);
+		err = memcpy_from_msg(payload, info->msg, info->pkt_len);
+		if (err)
+			goto out;
+	}
+
+	trace_virtio_transport_alloc_pkt(src_cid, src_port,
+					 dst_cid, dst_port,
+					 info->pkt_len,
+					 info->type,
+					 info->op,
+					 info->flags,
+					 false);
+
+	return t_ops->send_pkt(skb);
+out:
+	kfree_skb(skb);
+	return err;
+}
+
 int
 virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 			       struct sockaddr_vm *remote_addr,
 			       struct msghdr *msg,
 			       size_t dgram_len)
 {
-	return -EOPNOTSUPP;
+	/* Here we are only using the info struct to retain style uniformity
+	 * and to ease future refactoring and merging.
+	 */
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.remote_cid = remote_addr->svm_cid,
+		.remote_port = remote_addr->svm_port,
+		.remote_flags = remote_addr->svm_flags,
+		.msg = msg,
+		.vsk = vsk,
+		.pkt_len = dgram_len,
+	};
+
+	return virtio_transport_dgram_send_pkt_info(vsk, &info);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
 
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 08/14] af_vsock: add vsock_find_bound_dgram_socket()
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit adds vsock_find_bound_dgram_socket() which allows transports
to find bound dgram sockets in the global dgram bind table. It is
intended to be used for "routing" incoming packets to the correct
sockets if the transport uses the global bind table.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 include/net/af_vsock.h   |  1 +
 net/vmw_vsock/af_vsock.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 6e97d344ac75..9d0882b82bfa 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -218,6 +218,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk);
 bool vsock_find_cid(unsigned int cid);
 const struct vsock_transport *vsock_dgram_lookup_transport(unsigned int cid,
 							   __u8 flags);
+struct sock *vsock_find_bound_dgram_socket(struct sockaddr_vm *addr);
 
 struct vsock_skb_cb {
 	unsigned int src_cid;
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f83b655fdbe9..f0e5db0eb43a 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -265,6 +265,22 @@ static struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
 	return NULL;
 }
 
+struct sock *
+vsock_find_bound_dgram_socket(struct sockaddr_vm *addr)
+{
+	struct sock *sk;
+
+	spin_lock_bh(&vsock_dgram_table_lock);
+	sk = vsock_find_bound_socket_common(addr, vsock_bound_dgram_sockets(addr));
+	if (sk)
+		sock_hold(sk);
+
+	spin_unlock_bh(&vsock_dgram_table_lock);
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(vsock_find_bound_dgram_socket);
+
 static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
 {
 	return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 09/14] virtio/vsock: add common datagram recv path
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit adds the common datagram receive functionality for virtio
transports. It does not add the vhost/virtio users of that
functionality.

This functionality includes:
- changes to the virtio_transport_recv_pkt() path for finding the
  bound socket receiver for incoming packets
- virtio_transport_recv_pkt() saves the source cid and port to the
  control buffer for recvmsg() to initialize sockaddr_vm structure
  when using datagram

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/vmw_vsock/virtio_transport_common.c | 79 +++++++++++++++++++++----
 1 file changed, 66 insertions(+), 13 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 46cd1807f8e3..a571b575fde9 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -235,7 +235,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
 
 static u16 virtio_transport_get_type(struct sock *sk)
 {
-	if (sk->sk_type == SOCK_STREAM)
+	if (sk->sk_type == SOCK_DGRAM)
+		return VIRTIO_VSOCK_TYPE_DGRAM;
+	else if (sk->sk_type == SOCK_STREAM)
 		return VIRTIO_VSOCK_TYPE_STREAM;
 	else
 		return VIRTIO_VSOCK_TYPE_SEQPACKET;
@@ -1422,6 +1424,33 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
 		kfree_skb(skb);
 }
 
+static void
+virtio_transport_dgram_kfree_skb(struct sk_buff *skb, int err)
+{
+	if (err == -ENOMEM)
+		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_RCVBUFF);
+	else if (err == -ENOBUFS)
+		kfree_skb_reason(skb, SKB_DROP_REASON_PROTO_MEM);
+	else
+		kfree_skb(skb);
+}
+
+/* This function takes ownership of the skb.
+ *
+ * It either places the skb on the sk_receive_queue or frees it.
+ */
+static void
+virtio_transport_recv_dgram(struct sock *sk, struct sk_buff *skb)
+{
+	int err;
+
+	err = sock_queue_rcv_skb(sk, skb);
+	if (err) {
+		virtio_transport_dgram_kfree_skb(skb, err);
+		return;
+	}
+}
+
 static int
 virtio_transport_recv_connected(struct sock *sk,
 				struct sk_buff *skb)
@@ -1591,7 +1620,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 static bool virtio_transport_valid_type(u16 type)
 {
 	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
-	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
+	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
+	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
 }
 
 /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
@@ -1601,44 +1631,57 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 			       struct sk_buff *skb)
 {
 	struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
+	struct vsock_skb_cb *vsock_cb;
 	struct sockaddr_vm src, dst;
 	struct vsock_sock *vsk;
 	struct sock *sk;
 	bool space_available;
+	u16 type;
 
 	vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
 			le32_to_cpu(hdr->src_port));
 	vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
 			le32_to_cpu(hdr->dst_port));
 
+	type = le16_to_cpu(hdr->type);
+
 	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
 					dst.svm_cid, dst.svm_port,
 					le32_to_cpu(hdr->len),
-					le16_to_cpu(hdr->type),
+					type,
 					le16_to_cpu(hdr->op),
 					le32_to_cpu(hdr->flags),
 					le32_to_cpu(hdr->buf_alloc),
 					le32_to_cpu(hdr->fwd_cnt));
 
-	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
+	if (!virtio_transport_valid_type(type)) {
 		(void)virtio_transport_reset_no_sock(t, skb);
 		goto free_pkt;
 	}
 
-	/* The socket must be in connected or bound table
-	 * otherwise send reset back
+	/* For stream/seqpacket, the socket must be in connected or bound table
+	 * otherwise send reset back.
+	 *
+	 * For datagrams, no reset is sent back.
 	 */
 	sk = vsock_find_connected_socket(&src, &dst);
 	if (!sk) {
-		sk = vsock_find_bound_socket(&dst);
-		if (!sk) {
-			(void)virtio_transport_reset_no_sock(t, skb);
-			goto free_pkt;
+		if (type == VIRTIO_VSOCK_TYPE_DGRAM) {
+			sk = vsock_find_bound_dgram_socket(&dst);
+			if (!sk)
+				goto free_pkt;
+		} else {
+			sk = vsock_find_bound_socket(&dst);
+			if (!sk) {
+				(void)virtio_transport_reset_no_sock(t, skb);
+				goto free_pkt;
+			}
 		}
 	}
 
-	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+	if (virtio_transport_get_type(sk) != type) {
+		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
+			(void)virtio_transport_reset_no_sock(t, skb);
 		sock_put(sk);
 		goto free_pkt;
 	}
@@ -1654,12 +1697,21 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 
 	/* Check if sk has been closed before lock_sock */
 	if (sock_flag(sk, SOCK_DONE)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+		if (type != VIRTIO_VSOCK_TYPE_DGRAM)
+			(void)virtio_transport_reset_no_sock(t, skb);
 		release_sock(sk);
 		sock_put(sk);
 		goto free_pkt;
 	}
 
+	if (sk->sk_type == SOCK_DGRAM) {
+		vsock_cb = vsock_skb_cb(skb);
+		vsock_cb->src_cid = src.svm_cid;
+		vsock_cb->src_port = src.svm_port;
+		virtio_transport_recv_dgram(sk, skb);
+		goto out;
+	}
+
 	space_available = virtio_transport_space_update(sk, skb);
 
 	/* Update CID in case it has changed after a transport reset event */
@@ -1691,6 +1743,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 		break;
 	}
 
+out:
 	release_sock(sk);
 
 	/* Release refcnt obtained when we fetched this socket out of the
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 10/14] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit adds a feature bit for virtio vsock to support datagrams.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 include/uapi/linux/virtio_vsock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 331be28b1d30..27b4b2b8bf13 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -40,6 +40,7 @@
 
 /* The feature bitmap for virtio vsock */
 #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
+#define VIRTIO_VSOCK_F_DGRAM		3	/* SOCK_DGRAM supported */
 
 struct virtio_vsock_config {
 	__le64 guest_cid;
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH net-next v6 11/14] vhost/vsock: implement datagram support
From: Amery Hung @ 2024-07-10 21:25 UTC (permalink / raw)
  To: stefanha, sgarzare, mst, jasowang, xuanzhuo, davem, edumazet,
	kuba, pabeni, kys, haiyangz, wei.liu, decui, bryantan, vdasa,
	pv-drivers
  Cc: dan.carpenter, simon.horman, oxffffaa, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv, bpf, bobby.eshleman,
	jiang.wang, amery.hung, ameryhung, xiyou.wangcong
In-Reply-To: <20240710212555.1617795-1-amery.hung@bytedance.com>

From: Bobby Eshleman <bobby.eshleman@bytedance.com>

This commit implements datagram support for vhost/vsock by teaching
vhost to use the common virtio transport datagram functions.

If the virtio RX buffer is too small, then the transmission is
abandoned, the packet dropped, and EHOSTUNREACH is added to the socket's
error queue.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 drivers/vhost/vsock.c    | 60 ++++++++++++++++++++++++++++++++++++++--
 net/vmw_vsock/af_vsock.c |  2 +-
 2 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index fa1aefb78016..13c3cbff21da 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -8,6 +8,7 @@
  */
 #include <linux/miscdevice.h>
 #include <linux/atomic.h>
+#include <linux/errqueue.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/vmalloc.h>
@@ -32,7 +33,8 @@
 enum {
 	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
 			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
-			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
+			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
+			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
 };
 
 enum {
@@ -56,6 +58,7 @@ struct vhost_vsock {
 	atomic_t queued_replies;
 
 	u32 guest_cid;
+	bool dgram_allow;
 	bool seqpacket_allow;
 };
 
@@ -86,6 +89,32 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
 	return NULL;
 }
 
+/* Claims ownership of the skb, do not free the skb after calling! */
+static void
+vhost_transport_error(struct sk_buff *skb, int err)
+{
+	struct sock_exterr_skb *serr;
+	struct sock *sk = skb->sk;
+	struct sk_buff *clone;
+
+	serr = SKB_EXT_ERR(skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = err;
+	serr->ee.ee_origin = SO_EE_ORIGIN_NONE;
+
+	clone = skb_clone(skb, GFP_KERNEL);
+	if (!clone)
+		goto out;
+
+	if (sock_queue_err_skb(sk, clone))
+		kfree_skb(clone);
+
+	sk->sk_err = err;
+	sk_error_report(sk);
+out:
+	kfree_skb(skb);
+}
+
 static void
 vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 			    struct vhost_virtqueue *vq)
@@ -162,9 +191,15 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		hdr = virtio_vsock_hdr(skb);
 
 		/* If the packet is greater than the space available in the
-		 * buffer, we split it using multiple buffers.
+		 * buffer, we split it using multiple buffers for connectible
+		 * sockets and drop the packet for datagram sockets.
 		 */
 		if (payload_len > iov_len - sizeof(*hdr)) {
+			if (le16_to_cpu(hdr->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
+				vhost_transport_error(skb, EHOSTUNREACH);
+				continue;
+			}
+
 			payload_len = iov_len - sizeof(*hdr);
 
 			/* As we are copying pieces of large packet's buffer to
@@ -403,6 +438,22 @@ static bool vhost_transport_msgzerocopy_allow(void)
 	return true;
 }
 
+static bool vhost_transport_dgram_allow(u32 cid, u32 port)
+{
+	struct vhost_vsock *vsock;
+	bool dgram_allow = false;
+
+	rcu_read_lock();
+	vsock = vhost_vsock_get(cid);
+
+	if (vsock)
+		dgram_allow = vsock->dgram_allow;
+
+	rcu_read_unlock();
+
+	return dgram_allow;
+}
+
 static bool vhost_transport_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport vhost_transport = {
@@ -419,7 +470,7 @@ static struct virtio_transport vhost_transport = {
 		.cancel_pkt               = vhost_transport_cancel_pkt,
 
 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
-		.dgram_allow              = virtio_transport_dgram_allow,
+		.dgram_allow              = vhost_transport_dgram_allow,
 
 		.stream_enqueue           = virtio_transport_stream_enqueue,
 		.stream_dequeue           = virtio_transport_stream_dequeue,
@@ -811,6 +862,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
 	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
 		vsock->seqpacket_allow = true;
 
+	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
+		vsock->dgram_allow = true;
+
 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
 		vq = &vsock->vqs[i];
 		mutex_lock(&vq->mutex);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f0e5db0eb43a..344db0f3a602 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1463,7 +1463,7 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		return prot->recvmsg(sk, msg, len, flags, NULL);
 #endif
 
-	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
+	if (unlikely(flags & MSG_OOB))
 		return -EOPNOTSUPP;
 
 	if (unlikely(flags & MSG_ERRQUEUE))
-- 
2.20.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox