* RE: [PATCH 1/2] x86/hyperv: Allow guests to enable InvariantTSC
From: Vitaly Kuznetsov @ 2019-10-08 15:12 UTC (permalink / raw)
To: Michael Kelley, Andrea Parri, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, x86@kernel.org
Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
Andrea Parri
In-Reply-To: <DM5PR21MB01371F96CD845743D9777DC5D79E0@DM5PR21MB0137.namprd21.prod.outlook.com>
Michael Kelley <mikelley@microsoft.com> writes:
> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Friday, October 4, 2019 9:57 AM
>>
>> Andrea Parri <parri.andrea@gmail.com> writes:
>>
>> > If the hardware supports TSC scaling, Hyper-V will set bit 15 of the
>> > HV_PARTITION_PRIVILEGE_MASK in guest VMs with a compatible Hyper-V
>> > configuration version. Bit 15 corresponds to the
>> > AccessTscInvariantControls privilege. If this privilege bit is set,
>> > guests can access the HvSyntheticInvariantTscControl MSR: guests can
>> > set bit 0 of this synthetic MSR to enable the InvariantTSC feature.
>> > After setting the synthetic MSR, CPUID will enumerate support for
>> > InvariantTSC.
>>
>> I tried getting more information from TLFS but as of 5.0C this feature
>> is not described there. I'm really interested in why this additional
>> interface is needed, e.g. why can't Hyper-V just set InvariantTSC
>> unconditionally when TSC scaling is supported?
>>
>
> Yes, this is very new functionality that is not yet available in a released
> version of Hyper-V. And as you know, the Hyper-V TLFS has gotten
> woefully out-of-date. :-(
>
> Your question is the same question I asked. The reason given by
> Hyper-V is to take the more cautious approach of not "automatically"
> giving VMs an InvariantTSC due to updating the underlying Hyper-V
> version. Instead, guest VMs must have been explicitly coded to take
> advantage of the new InvariantTSC feature. It's not clear to me how
> much of this caution is driven by Windows guests vs. Linux or FreeBSD
> guests, but it is what it is.
>
> Having to explicitly enable the InvariantTSC does give the Linux code
> the opportunity to be a bit cleaner by doing things like not marking
> the TSC as unstable when the InvariantTSC feature is present, and to
> mark the TSC as reliable so we don't try to do TSC synchronization
> (which Hyper-V does not want guests to try to do).
Thank you for these additional details Michael,
we'll probably have to add support for this bit to KVM and I'd like to
know the background. From Linux perspective, no matter what's the
interface we'd like to get InvariantTSC.
Feel free to add
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
--
Vitaly
^ permalink raw reply
* Re: [PATCH RFC] kconfig: add hvconfig for Linux on Hyper-V
From: Masahiro Yamada @ 2019-10-08 16:03 UTC (permalink / raw)
To: Wei Liu
Cc: Linux on Hyper-V List, Linux Kernel List, Linux Kconfig List,
Wei Liu
In-Reply-To: <20191008131508.21189-1-liuw@liuw.name>
On Tue, Oct 8, 2019 at 10:15 PM Wei Liu <wei.liu@kernel.org> wrote:
>
> From: Wei Liu <liuwe@microsoft.com>
>
> Add an config file snippet which enalbes additional options useful for
> running the kernel in a Hyper-V guest.
>
> The expected use case is a user provides an existing config file then
> executes `make hvconfig`. It will merge those options with the
> provided config file.
>
> Based on similar concept for Xen and KVM.
>
> Signed-off-by: Wei Liu <liuwe@microsoft.com>
> ---
> RFC: I only tested this on x86. Although the config options included in
> hv_guest.config don't seem to be arch-specific, we should probably
> move the ones not yet implemented on Arm to an x86 specific config
> file.
> ---
> Documentation/admin-guide/README.rst | 3 +++
> kernel/configs/hv_guest.config | 33 ++++++++++++++++++++++++++++
> scripts/kconfig/Makefile | 5 +++++
> 3 files changed, 41 insertions(+)
> create mode 100644 kernel/configs/hv_guest.config
>
> diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
> index cc6151fc0845..d5f4389a7a2f 100644
> --- a/Documentation/admin-guide/README.rst
> +++ b/Documentation/admin-guide/README.rst
> @@ -224,6 +224,9 @@ Configuring the kernel
> "make xenconfig" Enable additional options for xen dom0 guest kernel
> support.
>
> + "make hvconfig" Enable additional options for Hyper-V guest kernel
> + support.
> +
> "make tinyconfig" Configure the tiniest possible kernel.
>
> You can find more information on using the Linux kernel config tools
> diff --git a/kernel/configs/hv_guest.config b/kernel/configs/hv_guest.config
> new file mode 100644
> index 000000000000..0e71e34a2d4d
> --- /dev/null
> +++ b/kernel/configs/hv_guest.config
> @@ -0,0 +1,33 @@
> +CONFIG_NET=y
> +CONFIG_NET_CORE=y
> +CONFIG_NETDEVICES=y
> +CONFIG_BLOCK=y
> +CONFIG_BLK_DEV=y
> +CONFIG_NETWORK_FILESYSTEMS=y
> +CONFIG_INET=y
> +CONFIG_TTY=y
> +CONFIG_SERIAL_8250=y
> +CONFIG_SERIAL_8250_CONSOLE=y
> +CONFIG_IP_PNP=y
> +CONFIG_IP_PNP_DHCP=y
> +CONFIG_BINFMT_ELF=y
> +CONFIG_PCI=y
> +CONFIG_PCI_MSI=y
> +CONFIG_DEBUG_KERNEL=y
> +CONFIG_VIRTUALIZATION=y
> +CONFIG_HYPERVISOR_GUEST=y
> +CONFIG_PARAVIRT=y
> +CONFIG_HYPERV=y
> +CONFIG_HYPERV_VSOCKETS=y
> +CONFIG_PCI_HYPERV=y
> +CONFIG_PCI_HYPERV_INTERFACE=y
> +CONFIG_HYPERV_STORAGE=y
> +CONFIG_HYPERV_NET=y
> +CONFIG_HYPERV_KEYBOARD=y
> +CONFIG_FB_HYPERV=y
> +CONFIG_HID_HYPERV_MOUSE=y
> +CONFIG_HYPERV=y
> +CONFIG_HYPERV_TIMER=y
> +CONFIG_HYPERV_UTILS=y
> +CONFIG_HYPERV_BALLOON=y
> +CONFIG_HYPERV_IOMMU=y
> diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
> index ef2f2336c469..2ee46301b22e 100644
> --- a/scripts/kconfig/Makefile
> +++ b/scripts/kconfig/Makefile
> @@ -104,6 +104,10 @@ PHONY += xenconfig
> xenconfig: xen.config
> @:
>
> +PHONY += hvconfig
> +hvconfig: hv_guest.config
> + @:
> +
Does this need to be hooked up to kconfig Makefile?
In my understanding, this code provides
"make hvconfig" as a shorthand for "make hv_guest.config"
Please do not do this.
See "xenconfig" as a bad example.
"make xenconfig" is a shorthand of "make xen.config".
This exists to save just one character typing.
If I allow this, people would push more and more random pointless shorthands,
which are essentially unrelated to kconfig.
kvmconfig and xenconfig are just historical mistakes.
Please drop the changes to scripts/kconfig/Makefile.
Also, please do not use misleading "kconfig:" for the subject prefix.
You can use the subject prefix "hyper-v:" or something.
Thanks.
> PHONY += tinyconfig
> tinyconfig:
> $(Q)$(MAKE) -f $(srctree)/Makefile allnoconfig tiny.config
> @@ -138,6 +142,7 @@ help:
> @echo ' default value without prompting'
> @echo ' kvmconfig - Enable additional options for kvm guest kernel support'
> @echo ' xenconfig - Enable additional options for xen dom0 and guest kernel support'
> + @echo ' hvconfig - Enable additional options for Hyper-V guest kernel support'
> @echo ' tinyconfig - Configure the tiniest possible kernel'
> @echo ' testconfig - Run Kconfig unit tests (requires python3 and pytest)'
>
> --
> 2.20.1
>
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* Re: [PATCH RFC] kconfig: add hvconfig for Linux on Hyper-V
From: Wei Liu @ 2019-10-08 16:10 UTC (permalink / raw)
To: Masahiro Yamada
Cc: Wei Liu, Linux on Hyper-V List, Linux Kernel List,
Linux Kconfig List, Wei Liu
In-Reply-To: <CAK7LNAQpszkwtp2mAfoPajkRi0SHPspivWn9sUsxO0oua2X6NQ@mail.gmail.com>
On Wed, Oct 09, 2019 at 01:03:19AM +0900, Masahiro Yamada wrote:
[...]
> > diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
> > index ef2f2336c469..2ee46301b22e 100644
> > --- a/scripts/kconfig/Makefile
> > +++ b/scripts/kconfig/Makefile
> > @@ -104,6 +104,10 @@ PHONY += xenconfig
> > xenconfig: xen.config
> > @:
> >
> > +PHONY += hvconfig
> > +hvconfig: hv_guest.config
> > + @:
> > +
>
>
> Does this need to be hooked up to kconfig Makefile?
>
It is not strictly necessary. Just thought it would be nice to follow
existing examples.
> In my understanding, this code provides
> "make hvconfig" as a shorthand for "make hv_guest.config"
>
Yes that's correct.
> Please do not do this.
>
>
> See "xenconfig" as a bad example.
>
> "make xenconfig" is a shorthand of "make xen.config".
> This exists to save just one character typing.
>
>
> If I allow this, people would push more and more random pointless shorthands,
> which are essentially unrelated to kconfig.
>
>
> kvmconfig and xenconfig are just historical mistakes.
>
>
> Please drop the changes to scripts/kconfig/Makefile.
OK.
>
> Also, please do not use misleading "kconfig:" for the subject prefix.
> You can use the subject prefix "hyper-v:" or something.
>
OK.
Wei.
^ permalink raw reply
* Re: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
From: Rafael J. Wysocki @ 2019-10-08 17:32 UTC (permalink / raw)
To: Dexuan Cui, Bjorn Helgaas
Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
Michael Kelley, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org,
driverdev-devel@linuxdriverproject.org, Sasha Levin,
Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
Stephen Hemminger, jackm@mellanox.com
In-Reply-To: <PU1P153MB016996765F9BB827256D05DEBF9B0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
On 10/7/2019 8:57 PM, Dexuan Cui wrote:
>> -----Original Message-----
>> From: Bjorn Helgaas <helgaas@kernel.org>
>> Sent: Monday, October 7, 2019 6:24 AM
>> To: Dexuan Cui <decui@microsoft.com>
>> Cc: lorenzo.pieralisi@arm.com; linux-pci@vger.kernel.org; Michael Kelley
>> <mikelley@microsoft.com>; linux-hyperv@vger.kernel.org;
>> linux-kernel@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Sasha
>> Levin <Alexander.Levin@microsoft.com>; Haiyang Zhang
>> <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
>> olaf@aepfle.de; apw@canonical.com; jasowang@redhat.com; vkuznets
>> <vkuznets@redhat.com>; marcelo.cerri@canonical.com; Stephen Hemminger
>> <sthemmin@microsoft.com>; jackm@mellanox.com
>> Subject: Re: [PATCH v2] PCI: PM: Move to D0 before calling
>> pci_legacy_resume_early()
>>
>> On Wed, Aug 14, 2019 at 01:06:55AM +0000, Dexuan Cui wrote:
>>> In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN.
>>>
>>> In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0,
>>> but the current code misses the pci_legacy_resume_early() path, so the
>>> state remains in PCI_UNKNOWN in that path. As a result, in the resume
>>> phase of hibernation, this causes an error for the Mellanox VF driver,
>>> which fails to enable MSI-X because pci_msi_supported() is false due
>>> to dev->current_state != PCI_D0:
>>>
>>> mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode
>>> mlx4_core a6d1:00:02.0: Sending reset
>>> mlx4_core a6d1:00:02.0: Sending vhcr0
>>> mlx4_core a6d1:00:02.0: HCA minimum page size:512
>>> mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode
>>> mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode,
>> aborting
>>> PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
>>> PM: Device a6d1:00:02.0 failed to thaw: error -95
>>>
>>> To be more accurate, the "resume" phase means the "thaw" callbacks which
>>> run before the system enters hibernation: when the user runs the command
>>> "echo disk > /sys/power/state" for hibernation, first the kernel "freezes"
>>> all the devices and creates a hibernation image, then the kernel "thaws"
>>> the devices including the disk/NIC, writes the memory to the disk, and
>>> powers down. This patch fixes the error message for the Mellanox VF driver
>>> in this phase.
>>>
>>> When the system starts again, a fresh kernel starts to run, and when the
>>> kernel detects that a hibernation image was saved, the kernel "quiesces"
>>> the devices, and then "restores" the devices from the saved image. In this
>>> path:
>>> device_resume_noirq() -> ... ->
>>> pci_pm_restore_noirq() ->
>>> pci_pm_default_resume_early() ->
>>> pci_power_up() moves the device states back to PCI_D0. This path is
>>> not broken and doesn't need my patch.
>>>
>>> Signed-off-by: Dexuan Cui <decui@microsoft.com>
>> This looks like a bugfix for 5839ee7389e8 ("PCI / PM: Force devices to
>> D0 in pci_pm_thaw_noirq()") so maybe it should be marked for stable as
>> 5839ee7389e8 was?
>>
>> Rafael, could you confirm?
No, it is not a bug fix for that commit. The underlying issue would be
there without that commit too.
>>> ---
>>>
>>> changes in v2:
>>> Updated the changelog with more details.
>>>
>>> drivers/pci/pci-driver.c | 7 ++++---
>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>>> index 36dbe960306b..27dfc68db9e7 100644
>>> --- a/drivers/pci/pci-driver.c
>>> +++ b/drivers/pci/pci-driver.c
>>> @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device
>> *dev)
>>> return error;
>>> }
>>>
>>> - if (pci_has_legacy_pm_support(pci_dev))
>>> - return pci_legacy_resume_early(dev);
>>> -
>>> /*
>>> * pci_restore_state() requires the device to be in D0 (because of MSI
>>> * restoration among other things), so force it into D0 in case the
>>> * driver's "freeze" callbacks put it into a low-power state directly.
>>> */
>>> pci_set_power_state(pci_dev, PCI_D0);
>>> +
>>> + if (pci_has_legacy_pm_support(pci_dev))
>>> + return pci_legacy_resume_early(dev);
>>> +
>>> pci_restore_state(pci_dev);
>>>
>>> if (drv && drv->pm && drv->pm->thaw_noirq)
>>> --
>>> 2.19.1
>>>
The patch looks reasonable to me, but the comment above the
pci_set_power_state() call needs to be updated too IMO.
^ permalink raw reply
* Re: [PATCH 0/2] Drivers: hv: vmbus: Miscellaneous improvements
From: Vitaly Kuznetsov @ 2019-10-08 19:47 UTC (permalink / raw)
To: Andrea Parri, Dexuan Cui
Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
Michael Kelley
In-Reply-To: <20191008150847.GA15276@andrea>
Andrea Parri <parri.andrea@gmail.com> writes:
> On Mon, Oct 07, 2019 at 05:41:10PM +0000, Dexuan Cui wrote:
>> > From: linux-hyperv-owner@vger.kernel.org
>> > <linux-hyperv-owner@vger.kernel.org> On Behalf Of Andrea Parri
>> > Sent: Monday, October 7, 2019 9:31 AM
>> >
>> > Hi all,
>> >
>> > The patchset:
>> >
>> > - simplifies/refactors the VMBus negotiation code by introducing
>> > the table of VMBus protocol versions (patch 1/2),
>> >
>> > - enables VMBus protocol versions 5.1 and 5.2 (patch 2/2).
>> >
>> > Thanks,
>> > Andrea
>> >
>> > Andrea Parri (2):
>> > Drivers: hv: vmbus: Introduce table of VMBus protocol versions
>> > Drivers: hv: vmbus: Enable VMBus protocol versions 5.1 and 5.2
>>
>> Should we add a module parameter to allow the user to specify a lower
>> protocol version, when the VM runs on the latest host?
>>
>> This can be useful for testing and debugging purpose: the variable
>> "vmbus_proto_version" is referenced by the vmbus driver itself and
>> some VSC drivers: if we always use the latest available proto version,
>> some code paths can not be tested on the latest hosts.
>
> The idea is appealing to me (altough I made no attempt to implement/test
> it yet). What do others think about this? Maybe can be considered as a
> follow-up patch/work to this series?
IMO such debug option makes sense, it shouldn be a simple patch so you
may want to squeeze it in this series as it will definitely have code
dependencies.
--
Vitaly
^ permalink raw reply
* Re: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
From: Bjorn Helgaas @ 2019-10-08 19:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Dexuan Cui, lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
Michael Kelley, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org,
driverdev-devel@linuxdriverproject.org, Sasha Levin,
Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
Stephen Hemminger, jackm@mellanox.com
In-Reply-To: <a2d8ad9f-b59d-57e4-f014-645e7b796cc4@intel.com>
On Tue, Oct 08, 2019 at 07:32:27PM +0200, Rafael J. Wysocki wrote:
> On 10/7/2019 8:57 PM, Dexuan Cui wrote:
> > > -----Original Message-----
> > > From: Bjorn Helgaas <helgaas@kernel.org>
> > > Sent: Monday, October 7, 2019 6:24 AM
> > > To: Dexuan Cui <decui@microsoft.com>
> > > Cc: lorenzo.pieralisi@arm.com; linux-pci@vger.kernel.org; Michael Kelley
> > > <mikelley@microsoft.com>; linux-hyperv@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Sasha
> > > Levin <Alexander.Levin@microsoft.com>; Haiyang Zhang
> > > <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> > > olaf@aepfle.de; apw@canonical.com; jasowang@redhat.com; vkuznets
> > > <vkuznets@redhat.com>; marcelo.cerri@canonical.com; Stephen Hemminger
> > > <sthemmin@microsoft.com>; jackm@mellanox.com
> > > Subject: Re: [PATCH v2] PCI: PM: Move to D0 before calling
> > > pci_legacy_resume_early()
> > >
> > > On Wed, Aug 14, 2019 at 01:06:55AM +0000, Dexuan Cui wrote:
> > > > In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN.
> > > >
> > > > In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0,
> > > > but the current code misses the pci_legacy_resume_early() path, so the
> > > > state remains in PCI_UNKNOWN in that path. As a result, in the resume
> > > > phase of hibernation, this causes an error for the Mellanox VF driver,
> > > > which fails to enable MSI-X because pci_msi_supported() is false due
> > > > to dev->current_state != PCI_D0:
> > > >
> > > > mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode
> > > > mlx4_core a6d1:00:02.0: Sending reset
> > > > mlx4_core a6d1:00:02.0: Sending vhcr0
> > > > mlx4_core a6d1:00:02.0: HCA minimum page size:512
> > > > mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode
> > > > mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode,
> > > aborting
> > > > PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
> > > > PM: Device a6d1:00:02.0 failed to thaw: error -95
> > > >
> > > > To be more accurate, the "resume" phase means the "thaw" callbacks which
> > > > run before the system enters hibernation: when the user runs the command
> > > > "echo disk > /sys/power/state" for hibernation, first the kernel "freezes"
> > > > all the devices and creates a hibernation image, then the kernel "thaws"
> > > > the devices including the disk/NIC, writes the memory to the disk, and
> > > > powers down. This patch fixes the error message for the Mellanox VF driver
> > > > in this phase.
Wordsmithing nit: what the patch does is not "fix the error message";
what it does is fix the *problem*, i.e., the fact that we can't
operate the device because we can't enable MSI-X. The message is only
a symptom.
IIUC the relevant part of the system hibernation sequence is:
pci_pm_freeze_noirq
pci_pm_thaw_noirq
pci_pm_thaw
And the execution flow is:
pci_pm_freeze_noirq
if (pci_has_legacy_pm_support(pci_dev)) # true for mlx4
pci_legacy_suspend_late(dev, PMSG_FREEZE)
pci_pm_set_unknown_state
dev->current_state = PCI_UNKNOWN # <---
pci_pm_thaw_noirq
if (pci_has_legacy_pm_support(pci_dev)) # true
pci_legacy_resume_early(dev) # noop; mlx4 doesn't implement
pci_pm_thaw # returns -95 EOPNOTSUPP
if (pci_has_legacy_pm_support(pci_dev)) # true
pci_legacy_resume
drv->resume
mlx4_resume # mlx4_driver.resume (legacy)
mlx4_load_one
mlx4_enable_msi_x
pci_enable_msix_range
__pci_enable_msix_range
__pci_enable_msix
if (!pci_msi_supported())
if (dev->current_state != PCI_D0) # <---
return 0
return -EINVAL
err = -EOPNOTSUPP
"INTx is not supported ..."
(These are just my notes; you don't need to put them all into the
commit message. I'm just sharing them in case I'm not understanding
correctly.)
> > > > When the system starts again, a fresh kernel starts to run, and when the
> > > > kernel detects that a hibernation image was saved, the kernel "quiesces"
> > > > the devices, and then "restores" the devices from the saved image. In this
> > > > path:
> > > > device_resume_noirq() -> ... ->
> > > > pci_pm_restore_noirq() ->
> > > > pci_pm_default_resume_early() ->
> > > > pci_power_up() moves the device states back to PCI_D0. This path is
> > > > not broken and doesn't need my patch.
> > > >
The cc list suggests that this might be a fix for a user-reported
problem. Is there a launchpad or similar link you could include here?
Should this be marked for stable?
> > > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > This looks like a bugfix for 5839ee7389e8 ("PCI / PM: Force devices to
> > > D0 in pci_pm_thaw_noirq()") so maybe it should be marked for stable as
> > > 5839ee7389e8 was?
> > >
> > > Rafael, could you confirm?
>
> No, it is not a bug fix for that commit. The underlying issue would be
> there without that commit too.
Oh, right, I dunno what I was thinking, sorry.
> > > > --- a/drivers/pci/pci-driver.c
> > > > +++ b/drivers/pci/pci-driver.c
> > > > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device
> > > *dev)
> > > > return error;
> > > > }
> > > >
> > > > - if (pci_has_legacy_pm_support(pci_dev))
> > > > - return pci_legacy_resume_early(dev);
> > > > -
> > > > /*
> > > > * pci_restore_state() requires the device to be in D0 (because of MSI
> > > > * restoration among other things), so force it into D0 in case the
> > > > * driver's "freeze" callbacks put it into a low-power state directly.
> > > > */
> > > > pci_set_power_state(pci_dev, PCI_D0);
> > > > +
> > > > + if (pci_has_legacy_pm_support(pci_dev))
> > > > + return pci_legacy_resume_early(dev);
> > > > +
> > > > pci_restore_state(pci_dev);
> > > >
> > > > if (drv && drv->pm && drv->pm->thaw_noirq)
> > > > --
> > > > 2.19.1
> > > >
> The patch looks reasonable to me, but the comment above the
> pci_set_power_state() call needs to be updated too IMO.
Hmm.
1) pci_restore_state() mainly writes config space, which doesn't
require the device to be in D0. The only thing I see that would
require D0 is the MSI-X MMIO space, so to be more specific, the
comment could say "restoring the MSI-X *MMIO* state requires the
device to be in D0".
But I think you meant some other comment change. Did you mean
something along the lines of "a legacy drv->resume_early() callback
and pci_restore_state() both require the device to be in D0"?
If something else, maybe you could propose some text?
2) I assume pci_pm_thaw_noirq() should leave the device in a
functionally equivalent state, whether it uses legacy PM or not. Do
we want something like the patch below instead? If we *do* want to
skip pci_restore_state() for legacy PM, maybe we should add a comment.
3) Documentation/power/pci.rst says:
... devices have to be brought back to the fully functional
state ...
pci_pm_thaw_noirq() ... doesn't put the device into the full power
state and doesn't attempt to restore its standard configuration
registers.
That doesn't seem consistent, and it looks like pci_pm_thaw_noirq()
actually *does* put the device in full power (D0) state and restore
config registers.
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index a8124e47bf6e..30c721fd6bcf 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1068,7 +1068,7 @@ static int pci_pm_thaw_noirq(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
struct device_driver *drv = dev->driver;
- int error = 0;
+ int error;
if (pcibios_pm_ops.thaw_noirq) {
error = pcibios_pm_ops.thaw_noirq(dev);
@@ -1076,9 +1076,6 @@ static int pci_pm_thaw_noirq(struct device *dev)
return error;
}
- if (pci_has_legacy_pm_support(pci_dev))
- return pci_legacy_resume_early(dev);
-
/*
* pci_restore_state() requires the device to be in D0 (because of MSI
* restoration among other things), so force it into D0 in case the
@@ -1087,10 +1084,13 @@ static int pci_pm_thaw_noirq(struct device *dev)
pci_set_power_state(pci_dev, PCI_D0);
pci_restore_state(pci_dev);
+ if (pci_has_legacy_pm_support(pci_dev))
+ return pci_legacy_resume_early(dev);
+
if (drv && drv->pm && drv->pm->thaw_noirq)
- error = drv->pm->thaw_noirq(dev);
+ return drv->pm->thaw_noirq(dev);
- return error;
+ return 0;
}
static int pci_pm_thaw(struct device *dev)
^ permalink raw reply related
* RE: [PATCH v2] x86/hyperv: make vapic support x2apic mode
From: Michael Kelley @ 2019-10-08 20:54 UTC (permalink / raw)
To: Roman Kagan
Cc: vkuznets, kvm@vger.kernel.org, Tianyu Lan, Joerg Roedel,
KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
x86@kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR21MB0137FCE28A16166207E08C7CD79E0@DM5PR21MB0137.namprd21.prod.outlook.com>
From: Michael Kelley <mikelley@microsoft.com> Sent: Friday, October 4, 2019 3:33 PM
>
> From: Roman Kagan <rkagan@virtuozzo.com> Sent: Friday, October 4, 2019 2:19 AM
> >
> > On Fri, Oct 04, 2019 at 03:01:51AM +0000, Michael Kelley wrote:
> > > From: Roman Kagan <rkagan@virtuozzo.com> Sent: Thursday, October 3, 2019 5:53 AM
> > > > >
> > > > > AFAIU you're trying to mirror native_x2apic_icr_write() here but this is
> > > > > different from what hv_apic_icr_write() does
> > > > > (SET_APIC_DEST_FIELD(id)).
> > > >
> > > > Right. In xapic mode the ICR2 aka the high 4 bytes of ICR is programmed
> > > > with the destination id in the highest byte; in x2apic mode the whole
> > > > ICR2 is set to the 32bit destination id.
> > > >
> > > > > Is it actually correct? (I think you've tested this and it is but)
> > > >
> > > > As I wrote in the commit log, I haven't tested it in the sense that I
> > > > ran a Linux guest in a Hyper-V VM exposing x2apic to the guest, because
> > > > I didn't manage to configure it to do so. OTOH I did run a Windows
> > > > guest in QEMU/KVM with hv_apic and x2apic enabled and saw it write
> > > > destination ids unshifted to the ICR2 part of ICR, so I assume it's
> > > > correct.
> > > >
> > > > > Michael, could you please shed some light here?
> > > >
> > > > Would be appreciated, indeed.
> > > >
> > >
> > > The newest version of Hyper-V provides an x2apic in a guest VM when the
> > > number of vCPUs in the VM is > 240. This version of Hyper-V is beginning
> > > to be deployed in Azure to enable the M416v2 VM size, but the functionality
> > > is not yet available for the on-premises version of Hyper-V. However, I can
> > > test this configuration internally with the above patch -- give me a few days.
> > >
> > > An additional complication is that when running on Intel processors that offer
> > > vAPIC functionality, the Hyper-V "hints" value does *not* recommend using the
> > > MSR-based APIC accesses. In this case, memory-mapped access to the x2apic
> > > registers is faster than the synthetic MSRs.
> >
> > I guess you mean "using regular x2apic MSRs compared to the synthetic
> > MSRs".
>
> Yes, of course you are correct.
>
> > Indeed they do essentially the same thing, and there's no reason
> > for one set of MSRs to be significantly faster than the other. However,
> > hv_apic_eoi_write makes use of "apic assists" aka lazy EOI which is
> > certainly a win, and I'm not sure if it works without hv_apic.
> >
>
> I've checked with the Hyper-V people and the presence of vAPIC makes
> a difference. If vAPIC is present in the hardware:
> 1) Hyper-V does not set the HV_X64_APIC_ACCESS_RECOMMENDED flag
> 2) The architectural MSRs should be used instead of the Hyper-V
> synthetic MSRs, as they are significantly faster. The architectural
> MSRs do not cause a VMEXIT because they are handled entirely by
> the vAPIC microcode in the CPU. The synthetic MSRs do cause a VMEXIT.
> 3) The lazy EOI functionality should not be used
>
> If vAPIC is not present in the hardware:
> 1) Hyper-V will set HV_X64_APIC_ACCESS_RECOMMENDED
> 2) Either set of MSRs has about the same performance, but we
> should use the synthetic MSRs.
> 3) The lazy EOI functionality has some value and should be used
>
> The same will apply to the AMD AVIC in some Hyper-V updates that
> are coming soon.
>
> So I think your code makes sense given the above information. By
> Monday I'll try to test it on a Hyper-V guest VM with x2APIC.
>
I've smoke tested your code with a Hyper-V guest VM with x2APIC
and 1024 vCPUs and HV_X64_APIC_ACCESS_RECOMMENDED
enabled. The new x2apic functions you have added appear to work.
No issues were seen.
However, based on further discussion with the Hyper-V team, the
architectural MSRs and the synthetic MSRs really are interchangeable
with an x2apic. There's no perf difference like there is with the
memory-mapped registers in the classic APIC. So your new x2apic
functions aren't really needed -- all that's needed is to skip plugging
in the hv_apic functions when an x2apic is present. The native x2apic
functions will work just fine. Note that even with x2apic,
hv_apic_eoi_write() should still be used to take advantage of the
lazy EOI functionality. It's OK to use the synthetic EOI MSR with
x2apic for this case, so again no additional code is needed.
I quickly changed the code to do the above so that the architectural
MSRs are used, and those changes successfully smoke test on the
same 1024 vCPU VM with no problems. I tested with vAPIC enabled
and with vAPIC disabled, and all looks good.
Michael
^ permalink raw reply
* RE: [PATCH 1/2] Drivers: hv: vmbus: Introduce table of VMBus protocol versions
From: Dexuan Cui @ 2019-10-08 22:41 UTC (permalink / raw)
To: vkuznets, Andrea Parri
Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
Michael Kelley
In-Reply-To: <87zhibz91y.fsf@vitty.brq.redhat.com>
> From: Vitaly Kuznetsov <vkuznets@redhat.com>
> Sent: Tuesday, October 8, 2019 6:00 AM
> ...
> > Looking at the uses of VERSION_INVAL, I find one remaining occurrence
> > of this macro in vmbus_bus_resume(), which does:
> >
> > if (vmbus_proto_version == VERSION_INVAL ||
> > vmbus_proto_version == 0) {
> > ...
> > }
> >
> > TBH I'm looking at vmbus_bus_resume() and vmbus_bus_suspend() for the
> > first time and I'm not sure about how to change such check yet... any
> > suggestions?
>
> Hm, I don't think vmbus_proto_version can ever become == VERSION_INVAL
> if we rewrite the code the way you suggest, right? So you'll reduce this
> to 'if (!vmbus_proto_version)' meaning we haven't negotiated any version
> (yet).
Yeah, Vitaly is correct. The check may be a little paranoid as I believe
"vmbus_proto_version" must be a negotiated value in vmbus_bus_resume()
and vmbus_bus_suspend(). I added the check just in case.
> > Mmh, I see that vmbus_bus_resume() and vmbus_bus_suspend() can access
> > vmbus_connection.conn_state: can such accesses race with a concurrent
> > vmbus_connect()?
>
> Let's summon Dexuan who's the author! :-)
There should not be an issue:
vmbus_connect() is called in the early subsys_initcall(hv_acpi_init).
vmbus_bus_suspend() is called late in the PM code after the kernel boots up, e.g.
in the hibernation function hibernation_snapshot() -> dpm_suspend().
vmbus_bus_resume() is also called later in late_initcall_sync(software_resume).
In the hibernatin process, vmbus_bus_suspend()/resume() can also be called a
few times, and vmbus_bus_resume() calls vmbus_negotiate_version(). As I
checked, there is no issue, either.
Thanks,
Dexuan
^ permalink raw reply
* RE: [PATCH v2] PCI: PM: Move to D0 before calling pci_legacy_resume_early()
From: Dexuan Cui @ 2019-10-09 0:16 UTC (permalink / raw)
To: Bjorn Helgaas, Rafael J. Wysocki
Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
Michael Kelley, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org,
driverdev-devel@linuxdriverproject.org, Sasha Levin,
Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
Stephen Hemminger, jackm@mellanox.com
In-Reply-To: <20191008195624.GA198287@google.com>
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, October 8, 2019 12:56 PM
> ...
> Wordsmithing nit: what the patch does is not "fix the error message";
> what it does is fix the *problem*, i.e., the fact that we can't
> operate the device because we can't enable MSI-X. The message is only
> a symptom.
I totally agree. :-)
> IIUC the relevant part of the system hibernation sequence is:
>
> pci_pm_freeze_noirq
> pci_pm_thaw_noirq
> pci_pm_thaw
>
> And the execution flow is:
>
> pci_pm_freeze_noirq
> if (pci_has_legacy_pm_support(pci_dev)) # true for mlx4
> pci_legacy_suspend_late(dev, PMSG_FREEZE)
> pci_pm_set_unknown_state
> dev->current_state = PCI_UNKNOWN # <---
> pci_pm_thaw_noirq
> if (pci_has_legacy_pm_support(pci_dev)) # true
> pci_legacy_resume_early(dev) # noop; mlx4 doesn't
> implement
> pci_pm_thaw # returns -95
> EOPNOTSUPP
> if (pci_has_legacy_pm_support(pci_dev)) # true
> pci_legacy_resume
> drv->resume
> mlx4_resume # mlx4_driver.resume (legacy)
> mlx4_load_one
> mlx4_enable_msi_x
> pci_enable_msix_range
> __pci_enable_msix_range
> __pci_enable_msix
> if (!pci_msi_supported())
> if (dev->current_state != PCI_D0) # <---
> return 0
> return -EINVAL
> err = -EOPNOTSUPP
> "INTx is not supported ..."
>
> (These are just my notes; you don't need to put them all into the
> commit message. I'm just sharing them in case I'm not understanding
> correctly.)
Yes, these notes are accurate.
> > > > > When the system starts again, a fresh kernel starts to run, and when the
> > > > > kernel detects that a hibernation image was saved, the kernel
> "quiesces"
> > > > > the devices, and then "restores" the devices from the saved image. In
> this
> > > > > path:
> > > > > device_resume_noirq() -> ... ->
> > > > > pci_pm_restore_noirq() ->
> > > > > pci_pm_default_resume_early() ->
> > > > > pci_power_up() moves the device states back to PCI_D0. This
> path is
> > > > > not broken and doesn't need my patch.
> > > > >
>
> The cc list suggests that this might be a fix for a user-reported
> problem. Is there a launchpad or similar link you could include here?
I guess I'm the first one to notice the issue and there is not any bug link AFAIK.
The hibernation process usually saves the states into a local disk (before the
system is powered off), and the Mellanox NIC is not needed during the process,
so it's not a real issue that the NIC can not work between pci_pm_thaw() and
power_down(). This may explain why nobody else noticed the issue. I happened
to see the error message, and hence investigated the issue.
> Should this be marked for stable?
I think we should do it.
> > > > > --- a/drivers/pci/pci-driver.c
> > > > > +++ b/drivers/pci/pci-driver.c
> > > > > @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device
> > > > *dev)
> > > > > return error;
> > > > > }
> > > > >
> > > > > - if (pci_has_legacy_pm_support(pci_dev))
> > > > > - return pci_legacy_resume_early(dev);
> > > > > -
> > > > > /*
> > > > > * pci_restore_state() requires the device to be in D0 (because
> of MSI
> > > > > * restoration among other things), so force it into D0 in case
> the
> > > > > * driver's "freeze" callbacks put it into a low-power state
> directly.
> > > > > */
> > > > > pci_set_power_state(pci_dev, PCI_D0);
> > > > > +
> > > > > + if (pci_has_legacy_pm_support(pci_dev))
> > > > > + return pci_legacy_resume_early(dev);
> > > > > +
> > > > > pci_restore_state(pci_dev);
> > > > >
> > > > > if (drv && drv->pm && drv->pm->thaw_noirq)
> > > > > --
> > > > > 2.19.1
> > > > >
> > The patch looks reasonable to me, but the comment above the
> > pci_set_power_state() call needs to be updated too IMO.
>
> Hmm.
>
> 1) pci_restore_state() mainly writes config space, which doesn't
> require the device to be in D0. The only thing I see that would
> require D0 is the MSI-X MMIO space, so to be more specific, the
> comment could say "restoring the MSI-X *MMIO* state requires the
> device to be in D0".
>
> But I think you meant some other comment change. Did you mean
> something along the lines of "a legacy drv->resume_early() callback
> and pci_restore_state() both require the device to be in D0"?
>
> If something else, maybe you could propose some text?
>
> 2) I assume pci_pm_thaw_noirq() should leave the device in a
> functionally equivalent state, whether it uses legacy PM or not. Do
> we want something like the patch below instead? If we *do* want to
> skip pci_restore_state() for legacy PM, maybe we should add a comment.
>
> 3) Documentation/power/pci.rst says:
>
> ... devices have to be brought back to the fully functional
> state ...
>
> pci_pm_thaw_noirq() ... doesn't put the device into the full power
> state and doesn't attempt to restore its standard configuration
> registers.
>
> That doesn't seem consistent, and it looks like pci_pm_thaw_noirq()
> actually *does* put the device in full power (D0) state and restore
> config registers.
I would leave these questions to Rafael.
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a8124e47bf6e..30c721fd6bcf 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1068,7 +1068,7 @@ static int pci_pm_thaw_noirq(struct device *dev)
> {
> struct pci_dev *pci_dev = to_pci_dev(dev);
> struct device_driver *drv = dev->driver;
> - int error = 0;
> + int error;
>
> if (pcibios_pm_ops.thaw_noirq) {
> error = pcibios_pm_ops.thaw_noirq(dev);
> @@ -1076,9 +1076,6 @@ static int pci_pm_thaw_noirq(struct device *dev)
> return error;
> }
>
> - if (pci_has_legacy_pm_support(pci_dev))
> - return pci_legacy_resume_early(dev);
> -
> /*
> * pci_restore_state() requires the device to be in D0 (because of MSI
> * restoration among other things), so force it into D0 in case the
> @@ -1087,10 +1084,13 @@ static int pci_pm_thaw_noirq(struct device *dev)
> pci_set_power_state(pci_dev, PCI_D0);
> pci_restore_state(pci_dev);
>
> + if (pci_has_legacy_pm_support(pci_dev))
> + return pci_legacy_resume_early(dev);
> +
> if (drv && drv->pm && drv->pm->thaw_noirq)
> - error = drv->pm->thaw_noirq(dev);
> + return drv->pm->thaw_noirq(dev);
>
> - return error;
> + return 0;
> }
>
> static int pci_pm_thaw(struct device *dev)
The only real difference from my patch is that you moved
+ if (pci_has_legacy_pm_support(pci_dev))
+ return pci_legacy_resume_early(dev);
to after the line "pci_restore_state(pci_dev);"
This change is good to me, and shoud also resolve the error message I saw.
Thanks,
-- Dexuan
^ permalink raw reply
* Re: [RFC PATCH 11/13] vsock: add 'transport_hg' to handle g2h\h2g transports
From: Stefano Garzarella @ 2019-10-09 9:44 UTC (permalink / raw)
To: netdev, Stefan Hajnoczi
Cc: linux-hyperv, K. Y. Srinivasan, Sasha Levin, linux-kernel, kvm,
David S. Miller, virtualization, Stephen Hemminger, Jason Wang,
Michael S. Tsirkin, Haiyang Zhang, Dexuan Cui, Jorgen Hansen
In-Reply-To: <20190927112703.17745-12-sgarzare@redhat.com>
On Fri, Sep 27, 2019 at 01:27:01PM +0200, Stefano Garzarella wrote:
> VMCI transport provides both g2h and h2g behaviors in a single
> transport.
> We are able to set (or not) the g2h behavior, detecting if we
> are in a VMware guest (or not), but the h2g feature is always set.
> This prevents to load other h2g transports while we are in a
> VMware guest.
>
> This patch adds a new 'transport_hg' to handle this case, reducing
> the priority of transports that provide both g2h and h2g
> behaviors. A transport that has g2h and h2g features, can be
> bypassed by a transport that has only the h2g feature.
>
Since I'm enabling the VSOCK_TRANSPORT_F_G2H in the vmci_transport only
when we run in a VMware guest, this patch doesn't work well if a KVM (or
HyperV) guest application create an AF_VSOCK socket and no transports are
loaded, because in this case the vmci_transport is loaded
(MODULE_ALIAS_NETPROTO(PF_VSOCK)) and it is registered as transport_h2g.
At this point, if we want to run a nested VM using vhost_transport, we
can't load it.
So, I can leave VSOCK_TRANSPORT_F_G2H always set in the vmci_transport
and this should fix this issue.
Or maybe I need to change how the registering works, e.g. handling a list
of transport registered, setting priority or using the last registered
transport.
Any suggestion?
Thanks,
Stefano
^ permalink raw reply
* Re: [PATCH 1/2] Drivers: hv: vmbus: Introduce table of VMBus protocol versions
From: Andrea Parri @ 2019-10-09 9:54 UTC (permalink / raw)
To: Dexuan Cui
Cc: vkuznets, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
Stephen Hemminger, Sasha Levin, Michael Kelley
In-Reply-To: <PU1P153MB0169B3B15DB8D220F8E1A728BF9A0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
On Tue, Oct 08, 2019 at 10:41:42PM +0000, Dexuan Cui wrote:
> > From: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Sent: Tuesday, October 8, 2019 6:00 AM
> > ...
> > > Looking at the uses of VERSION_INVAL, I find one remaining occurrence
> > > of this macro in vmbus_bus_resume(), which does:
> > >
> > > if (vmbus_proto_version == VERSION_INVAL ||
> > > vmbus_proto_version == 0) {
> > > ...
> > > }
> > >
> > > TBH I'm looking at vmbus_bus_resume() and vmbus_bus_suspend() for the
> > > first time and I'm not sure about how to change such check yet... any
> > > suggestions?
> >
> > Hm, I don't think vmbus_proto_version can ever become == VERSION_INVAL
> > if we rewrite the code the way you suggest, right? So you'll reduce this
> > to 'if (!vmbus_proto_version)' meaning we haven't negotiated any version
> > (yet).
>
> Yeah, Vitaly is correct. The check may be a little paranoid as I believe
> "vmbus_proto_version" must be a negotiated value in vmbus_bus_resume()
> and vmbus_bus_suspend(). I added the check just in case.
>
> > > Mmh, I see that vmbus_bus_resume() and vmbus_bus_suspend() can access
> > > vmbus_connection.conn_state: can such accesses race with a concurrent
> > > vmbus_connect()?
> >
> > Let's summon Dexuan who's the author! :-)
>
> There should not be an issue:
>
> vmbus_connect() is called in the early subsys_initcall(hv_acpi_init).
>
> vmbus_bus_suspend() is called late in the PM code after the kernel boots up, e.g.
> in the hibernation function hibernation_snapshot() -> dpm_suspend().
>
> vmbus_bus_resume() is also called later in late_initcall_sync(software_resume).
>
> In the hibernatin process, vmbus_bus_suspend()/resume() can also be called a
> few times, and vmbus_bus_resume() calls vmbus_negotiate_version(). As I
> checked, there is no issue, either.
Thank you both for these remarks.
So, I'll proceed with the removal of VERSION_INVAL in v2 of this series.
Thanks,
Andrea
^ permalink raw reply
* Re: [RFC PATCH 01/13] vsock/vmci: remove unused VSOCK_DEFAULT_CONNECT_TIMEOUT
From: Stefan Hajnoczi @ 2019-10-09 11:41 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-2-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 549 bytes --]
On Fri, Sep 27, 2019 at 01:26:51PM +0200, Stefano Garzarella wrote:
> The VSOCK_DEFAULT_CONNECT_TIMEOUT definition was introduced with
> commit d021c344051af ("VSOCK: Introduce VM Sockets"), but it is
> never used in the net/vmw_vsock/vmci_transport.c.
>
> VSOCK_DEFAULT_CONNECT_TIMEOUT is used and defined in
> net/vmw_vsock/af_vsock.c
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> net/vmw_vsock/vmci_transport.c | 5 -----
> 1 file changed, 5 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 02/13] vsock: remove vm_sockets_get_local_cid()
From: Stefan Hajnoczi @ 2019-10-09 11:42 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-3-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 614 bytes --]
On Fri, Sep 27, 2019 at 01:26:52PM +0200, Stefano Garzarella wrote:
> vm_sockets_get_local_cid() is only used in virtio_transport_common.c.
> We can replace it calling the virtio_transport_get_ops() and
> using the get_local_cid() callback registered by the transport.
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> include/linux/vm_sockets.h | 2 --
> net/vmw_vsock/af_vsock.c | 10 ----------
> net/vmw_vsock/virtio_transport_common.c | 2 +-
> 3 files changed, 1 insertion(+), 13 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 03/13] vsock: remove include/linux/vm_sockets.h file
From: Stefan Hajnoczi @ 2019-10-09 11:43 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-4-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 619 bytes --]
On Fri, Sep 27, 2019 at 01:26:53PM +0200, Stefano Garzarella wrote:
> This header file now only includes the "uapi/linux/vm_sockets.h".
> We can include directly it when needed.
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> include/linux/vm_sockets.h | 13 -------------
> include/net/af_vsock.h | 2 +-
> include/net/vsock_addr.h | 2 +-
> net/vmw_vsock/vmci_transport_notify.h | 1 -
> 4 files changed, 2 insertions(+), 16 deletions(-)
> delete mode 100644 include/linux/vm_sockets.h
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 04/13] vsock: add 'transport' member in the struct vsock_sock
From: Stefan Hajnoczi @ 2019-10-09 11:46 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-5-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 741 bytes --]
On Fri, Sep 27, 2019 at 01:26:54PM +0200, Stefano Garzarella wrote:
> As a preparation to support multiple transports, this patch adds
> the 'transport' member at the 'struct vsock_sock'.
> This new field is initialized during the creation in the
> __vsock_create() function.
>
> This patch also renames the global 'transport' pointer to
> 'transport_single', since for now we're only supporting a single
> transport registered at run-time.
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> include/net/af_vsock.h | 1 +
> net/vmw_vsock/af_vsock.c | 56 +++++++++++++++++++++++++++-------------
> 2 files changed, 39 insertions(+), 18 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 05/13] vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock()
From: Stefan Hajnoczi @ 2019-10-09 11:52 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-6-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]
On Fri, Sep 27, 2019 at 01:26:55PM +0200, Stefano Garzarella wrote:
> We are going to add 'struct vsock_sock *' parameter to
> virtio_transport_get_ops().
>
> In some cases, like in the virtio_transport_reset_no_sock(),
> we don't have any socket assigned to the packet received,
> so we can't use the virtio_transport_get_ops().
>
> In order to allow virtio_transport_reset_no_sock() to use the
> '.send_pkt' callback from the 'vhost_transport' or 'virtio_transport',
> we add the 'struct virtio_transport *' to it and to its caller:
> virtio_transport_recv_pkt().
>
> We moved the 'vhost_transport' and 'virtio_transport' definition,
> to pass their address to the virtio_transport_recv_pkt().
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> drivers/vhost/vsock.c | 94 +++++++-------
> include/linux/virtio_vsock.h | 3 +-
> net/vmw_vsock/virtio_transport.c | 160 ++++++++++++------------
> net/vmw_vsock/virtio_transport_common.c | 12 +-
> 4 files changed, 135 insertions(+), 134 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 06/13] vsock: add 'struct vsock_sock *' param to vsock_core_get_transport()
From: Stefan Hajnoczi @ 2019-10-09 11:54 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-7-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 554 bytes --]
On Fri, Sep 27, 2019 at 01:26:56PM +0200, Stefano Garzarella wrote:
> -const struct vsock_transport *vsock_core_get_transport(void)
> +const struct vsock_transport *vsock_core_get_transport(struct vsock_sock *vsk)
> {
> /* vsock_register_mutex not taken since only the transport uses this
> * function and only while registered.
> */
> - return transport_single;
This comment is about protecting transport_single. It no longer applies
when using vsk->transport. Please drop it.
Otherwise:
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 07/13] vsock: handle buffer_size sockopts in the core
From: Stefan Hajnoczi @ 2019-10-09 12:30 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-8-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2332 bytes --]
On Fri, Sep 27, 2019 at 01:26:57PM +0200, Stefano Garzarella wrote:
> @@ -140,18 +145,11 @@ struct vsock_transport {
> struct vsock_transport_send_notify_data *);
> int (*notify_send_post_enqueue)(struct vsock_sock *, ssize_t,
> struct vsock_transport_send_notify_data *);
> + int (*notify_buffer_size)(struct vsock_sock *, u64 *);
Is ->notify_buffer_size() called under lock_sock(sk)? If yes, please
document it.
> +static void vsock_update_buffer_size(struct vsock_sock *vsk,
> + const struct vsock_transport *transport,
> + u64 val)
> +{
> + if (val > vsk->buffer_max_size)
> + val = vsk->buffer_max_size;
> +
> + if (val < vsk->buffer_min_size)
> + val = vsk->buffer_min_size;
> +
> + if (val != vsk->buffer_size &&
> + transport && transport->notify_buffer_size)
> + transport->notify_buffer_size(vsk, &val);
Why does this function return an int if we don't check the return value?
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index fc046c071178..bac9e7430a2e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -403,17 +403,13 @@ int virtio_transport_do_socket_init(struct vsock_sock *vsk,
> if (psk) {
> struct virtio_vsock_sock *ptrans = psk->trans;
>
> - vvs->buf_size = ptrans->buf_size;
> - vvs->buf_size_min = ptrans->buf_size_min;
> - vvs->buf_size_max = ptrans->buf_size_max;
> vvs->peer_buf_alloc = ptrans->peer_buf_alloc;
> - } else {
> - vvs->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
> - vvs->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
> - vvs->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
> }
>
> - vvs->buf_alloc = vvs->buf_size;
> + if (vsk->buffer_size > VIRTIO_VSOCK_MAX_BUF_SIZE)
> + vsk->buffer_size = VIRTIO_VSOCK_MAX_BUF_SIZE;
Hmm...this could be outside the [min, max] range. I'm not sure how much
it matters.
Another issue is that this patch drops the VIRTIO_VSOCK_MAX_BUF_SIZE
limit that used to be enforced by virtio_transport_set_buffer_size().
Now the limit is only applied at socket init time. If the buffer size
is changed later then VIRTIO_VSOCK_MAX_BUF_SIZE can be exceeded. If
that doesn't matter, why even bother with VIRTIO_VSOCK_MAX_BUF_SIZE
here?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 08/13] vsock: move vsock_insert_unbound() in the vsock_create()
From: Stefan Hajnoczi @ 2019-10-09 12:34 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-9-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 960 bytes --]
On Fri, Sep 27, 2019 at 01:26:58PM +0200, Stefano Garzarella wrote:
> vsock_insert_unbound() was called only when 'sock' parameter of
> __vsock_create() was not null. This only happened when
> __vsock_create() was called by vsock_create().
>
> In order to simplify the multi-transports support, this patch
> moves vsock_insert_unbound() at the end of vsock_create().
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> net/vmw_vsock/af_vsock.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
Maybe transports shouldn't call __vsock_create() directly. They always
pass NULL as the parent socket, so we could have a more specific
function that transports call without a parent sock argument. This
would eliminate any concern over moving vsock_insert_unbound() out of
this function. In any case, I've checked the code and this patch is
correct.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 09/13] hv_sock: set VMADDR_CID_HOST in the hvs_remote_addr_init()
From: Stefan Hajnoczi @ 2019-10-09 12:35 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-10-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 394 bytes --]
On Fri, Sep 27, 2019 at 01:26:59PM +0200, Stefano Garzarella wrote:
> Remote peer is always the host, so we set VMADDR_CID_HOST as
> remote CID instead of VMADDR_CID_ANY.
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> net/vmw_vsock/hyperv_transport.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 10/13] vsock: add multi-transports support
From: Stefan Hajnoczi @ 2019-10-09 13:11 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-11-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2463 bytes --]
On Fri, Sep 27, 2019 at 01:27:00PM +0200, Stefano Garzarella wrote:
> RFC:
> - I'd like to move MODULE_ALIAS_NETPROTO(PF_VSOCK) to af_vsock.c.
> @Jorgen could this break the VMware products?
What will cause the vmw_vsock_vmci_transport.ko module to be loaded
after you remove MODULE_ALIAS_NETPROTO(PF_VSOCK)? Perhaps
drivers/misc/vmw_vmci/vmci_guest.c:vmci_guest_probe_device() could do
something when the guest driver loads. There would need to be something
equivalent for the host side too.
This will solve another issue too. Today the VMCI transport can be
loaded if an application creates an AF_VSOCK socket during early boot
before the virtio transport has been probed. This happens because the
VMCI transport uses MODULE_ALIAS_NETPROTO(PF_VSOCK) *and* it does not
probe whether this system is actually a VMware guest.
If we instead load the core af_vsock.ko module and transports are only
loaded based on hardware feature probing (e.g. the presence of VMware
guest mode, a virtio PCI adapter, etc) then transports will be
well-behaved.
> - DGRAM sockets are handled as before, I don't know if make sense work
> on it now, or when another transport will support DGRAM. The big
> issues here is that we cannot link 1-1 a socket to transport as
> for stream sockets since DGRAM is not connection-oriented.
Let's ignore DGRAM for now since only VMCI supports it and we therefore
do not require multi-transport support.
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 86f8f463e01a..2a081d19e20d 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -94,7 +94,13 @@ struct vsock_transport_send_notify_data {
> u64 data2; /* Transport-defined. */
> };
>
> +#define VSOCK_TRANSPORT_F_H2G 0x00000001
> +#define VSOCK_TRANSPORT_F_G2H 0x00000002
> +#define VSOCK_TRANSPORT_F_DGRAM 0x00000004
Documentation comments, please.
> +void vsock_core_unregister(const struct vsock_transport *t)
> +{
> + mutex_lock(&vsock_register_mutex);
> +
> + /* RFC-TODO: maybe we should check if there are open sockets
> + * assigned to that transport and avoid the unregistration
> + */
If unregister() is only called from module_exit() functions then holding
a reference to the transport module would be enough to prevent this
case. The transport could only be removed once all sockets have been
destroyed (and dropped their transport module reference).
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 11/13] vsock: add 'transport_hg' to handle g2h\h2g transports
From: Stefan Hajnoczi @ 2019-10-09 13:16 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-12-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 648 bytes --]
On Fri, Sep 27, 2019 at 01:27:01PM +0200, Stefano Garzarella wrote:
> VMCI transport provides both g2h and h2g behaviors in a single
> transport.
> We are able to set (or not) the g2h behavior, detecting if we
> are in a VMware guest (or not), but the h2g feature is always set.
> This prevents to load other h2g transports while we are in a
> VMware guest.
In the vhost_vsock.ko case we only register the h2g transport when
userspace has loaded the module (by opening /dev/vhost-vsock).
VMCI has something kind of similar: /dev/vmci and the
vmci_host_active_users counter. Maybe we can use this instead of
introducing the transport_hg concept?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 12/13] vsock: prevent transport modules unloading
From: Stefan Hajnoczi @ 2019-10-09 13:28 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-13-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 582 bytes --]
On Fri, Sep 27, 2019 at 01:27:02PM +0200, Stefano Garzarella wrote:
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index c5f46b8242ce..750b62711b01 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -416,13 +416,28 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> return -ESOCKTNOSUPPORT;
> }
>
> - if (!vsk->transport)
> + /* We increase the module refcnt to prevent the tranport unloading
s/tranport/transport/
Otherwise:
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 13/13] vsock: fix bind() behaviour taking care of CID
From: Stefan Hajnoczi @ 2019-10-09 13:29 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-14-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 630 bytes --]
On Fri, Sep 27, 2019 at 01:27:03PM +0200, Stefano Garzarella wrote:
> When we are looking for a socket bound to a specific address,
> we also have to take into account the CID.
>
> This patch is useful with multi-transports support because it
> allows the binding of the same port with different CID, and
> it prevents a connection to a wrong socket bound to the same
> port, but with different CID.
>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> net/vmw_vsock/af_vsock.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 00/13] vsock: add multi-transports support
From: Stefan Hajnoczi @ 2019-10-09 13:29 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, Sasha Levin, linux-hyperv, Stephen Hemminger, kvm,
Michael S. Tsirkin, Dexuan Cui, Haiyang Zhang, linux-kernel,
virtualization, Stefan Hajnoczi, David S. Miller, Jorgen Hansen
In-Reply-To: <20190927112703.17745-1-sgarzare@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 299 bytes --]
On Fri, Sep 27, 2019 at 01:26:50PM +0200, Stefano Garzarella wrote:
> Hi all,
> this series adds the multi-transports support to vsock, following
> this proposal:
> https://www.spinics.net/lists/netdev/msg575792.html
Nice series! I have left a few comments but overall it looks promising.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox