Linux-HyperV List
 help / color / mirror / Atom feed
* RE: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
From: Haiyang Zhang @ 2019-03-26 14:42 UTC (permalink / raw)
  To: mgamal@redhat.com, Stephen Hemminger, Michael Kelley,
	linux-hyperv@vger.kernel.org, kimbrownkd
  Cc: Sasha Levin, Dexuan Cui, Long Li, KY Srinivasan, vkuznets
In-Reply-To: <1553609158.5021.15.camel@redhat.com>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of Mohammed Gamal
> Sent: Tuesday, March 26, 2019 10:06 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Michael Kelley <mikelley@microsoft.com>;
> linux-hyperv@vger.kernel.org; kimbrownkd <kimbrownkd@gmail.com>
> Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; vkuznets <vkuznets@redhat.com>
> Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> hv_get_bytes_to_read/write
> 
> On Mon, 2019-03-25 at 20:13 +0000, Haiyang Zhang wrote:
> > Hi Mohammed,
> >
> > I found by reading the code and testing – in netvsc_detach(), after
> > netif_tx_disable(), the queues may be waken up again when ring buffer
> > usage drops below the “low wartermark”. This is expected in normal
> > conditions as part of our flow control mechanism.
> >
> > But when we stopped all tx queues in netvsc_detach(), and start
> > removing the netvsc device, this may cause send path panic on NULL
> > pointer on a closed channel.
> >
> > I have attached a patch for this issue, could you test it on your
> > side?
> >
> > Thanks,
> > Haiyang
> 
> Hi Haiyang,
> 
> I've tested the patch and it seems to fix the problem.
> 
> Thanks,
> Mohammed
> 

Thank you for the testing!
 - Haiyang

^ permalink raw reply

* Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Lorenzo Pieralisi @ 2019-03-26 17:08 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Michael Kelley, bhelgaas@google.com, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB0169E9AA13EB51DF5CFC1CE4BF420@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > From: Michael Kelley <mikelley@microsoft.com>
> > Sent: Wednesday, March 20, 2019 2:38 PM
> > 
> > From: Dexuan Cui <decui@microsoft.com>
> > >
> > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > >
> > > When we hot remove the device from the host, in Linux VM we first call
> > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > adds one put_pcichild() to fix the memory leak.
> > >
> > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > this
> > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > pci_devices_present_work().
> > 
> > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > There is the reference in the hbus->children list, and there is the reference that
> > is returned to the caller.  
> So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> device is about to be destroyed, its reference count can drop to less than 2, 
> i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> 
> > But what is strange is that pci_devices_present_work()
> > overwrites the reference returned in local variable hpdev without doing a
> > put_pcichild().
> I suppose you mean:
> 
>         /* First, mark all existing children as reported missing. */
>         spin_lock_irqsave(&hbus->device_list_lock, flags);
>         list_for_each_entry(hpdev, &hbus->children, list_entry) {
>                 hpdev->reported_missing = true;
>         }
>         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> 
> This is not strange to me, because, in pci_devices_present_work(), at first we
> don't know which devices are about to disappear, so we pre-mark all devices to
> be potentially missing like that; if a device is still on the bus, we'll mark its
> hpdev->reported_missing to false later; only after we know exactly which
> devices are missing, we should call put_pcichild() against them. All these
> seem natural to me.
> 
> > It seems like the "normal" reference count should be 1 when the
> > child device is not being manipulated, not 2.
> What does "not being manipulated" mean?
> 
> > The fix would be to add a call to
> > put_pcichild() when the return value from new_pcichild_device() is
> > overwritten.
> In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> from new_pcichild_device(): the "reported_missing" field of the new hpdev
> is implicitly initialized to false in new_pcichild_device().
> 
> > Then remove the call to put_pcichild() in pci_device_present_work() when
> > missing
> > children are moved to the local list. The children have been moved from one
> > list
> > to another, so there's no need to decrement the reference count.  Then when
> > everything in the local list is deleted, the reference is correctly decremented,
> > presumably freeing the memory.
> > 
> > With this approach, the code in hv_eject_device_work() is correct.  There's
> > one call to put_pcichild() to reflect removing the child device from the hbus->
> > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > hv_pci_eject_device().
> Please refer to my replies above. IMO we should fix
> hv_eject_device_work() rather than pci_devices_present_work().

Have we reached a conclusion on this ? I would like to merge this series
given that it is fixing bugs and it has hung in the balance for quite
a while but it looks like Michael is not too happy about these patches
and I need a maintainer ACK to merge them.

Thanks,
Lorenzo

> Thanks
> -- Dexuan
>  
> > Your patch works, but to me it leaves the ref count in an unnatural state
> > most of the time.
> > 
> > Michael
> 

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Michael Kelley @ 2019-03-26 17:47 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Dexuan Cui
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Sasha Levin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190326170842.GA10666@e107981-ln.cambridge.arm.com>

From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26, 2019 10:09 AM
> On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > From: Michael Kelley <mikelley@microsoft.com>
> > > Sent: Wednesday, March 20, 2019 2:38 PM
> > >
> > > From: Dexuan Cui <decui@microsoft.com>
> > > >
> > > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > >
> > > > When we hot remove the device from the host, in Linux VM we first call
> > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > adds one put_pcichild() to fix the memory leak.
> > > >
> > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > > this
> > > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > pci_devices_present_work().
> > >
> > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > There is the reference in the hbus->children list, and there is the reference that
> > > is returned to the caller.
> > So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> > device is about to be destroyed, its reference count can drop to less than 2,
> > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> >
> > > But what is strange is that pci_devices_present_work()
> > > overwrites the reference returned in local variable hpdev without doing a
> > > put_pcichild().
> > I suppose you mean:
> >
> >         /* First, mark all existing children as reported missing. */
> >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> >                 hpdev->reported_missing = true;
> >         }
> >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> >
> > This is not strange to me, because, in pci_devices_present_work(), at first we
> > don't know which devices are about to disappear, so we pre-mark all devices to
> > be potentially missing like that; if a device is still on the bus, we'll mark its
> > hpdev->reported_missing to false later; only after we know exactly which
> > devices are missing, we should call put_pcichild() against them. All these
> > seem natural to me.
> >
> > > It seems like the "normal" reference count should be 1 when the
> > > child device is not being manipulated, not 2.
> > What does "not being manipulated" mean?
> >
> > > The fix would be to add a call to
> > > put_pcichild() when the return value from new_pcichild_device() is
> > > overwritten.
> > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > is implicitly initialized to false in new_pcichild_device().
> >
> > > Then remove the call to put_pcichild() in pci_device_present_work() when
> > > missing
> > > children are moved to the local list. The children have been moved from one
> > > list
> > > to another, so there's no need to decrement the reference count.  Then when
> > > everything in the local list is deleted, the reference is correctly decremented,
> > > presumably freeing the memory.
> > >
> > > With this approach, the code in hv_eject_device_work() is correct.  There's
> > > one call to put_pcichild() to reflect removing the child device from the hbus->
> > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > hv_pci_eject_device().
> > Please refer to my replies above. IMO we should fix
> > hv_eject_device_work() rather than pci_devices_present_work().
> 
> Have we reached a conclusion on this ? I would like to merge this series
> given that it is fixing bugs and it has hung in the balance for quite
> a while but it looks like Michael is not too happy about these patches
> and I need a maintainer ACK to merge them.
> 
> Thanks,
> Lorenzo

Dexuan and I have discussed the topic extensively offline.  The patch works
in its current form, and I'll agree to it.

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Michael Kelley @ 2019-03-26 17:50 UTC (permalink / raw)
  To: Dexuan Cui, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB01691FE3199135CE5C3BFFEDBF420@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

From: Dexuan Cui <decui@microsoft.com>  Sent: Wednesday, March 20, 2019 5:36 PM
> 
> > From: Michael Kelley <mikelley@microsoft.com>
> > > ...
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> > work_struct *work)
> > >  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
> > >  					 list_entry);
> > >  		list_del(&hpdev->list_entry);
> > > +
> > > +		if (hpdev->pci_slot)
> > > +			pci_destroy_slot(hpdev->pci_slot);
> >
> > The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> > pci_destory_slot().
> Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
> Because:
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed in the below put_pcichild():
> 
>      while (!list_empty(&removed)) {
>                 hpdev = list_first_entry(&removed, struct hv_pci_dev,
>                                          list_entry);
>                 list_del(&hpdev->list_entry);
> 
>                 if (hpdev->pci_slot)
>                         pci_destroy_slot(hpdev->pci_slot);
> 
>                 put_pcichild(hpdev);
>         }
> 
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
> 
> > And the code in hv_eject_device_work() does not set it to NULL.
> It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
> Because in hv_eject_device_work():
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
> first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
> 
> > It looks like all the places that test the value of hpdev->pci_slot or call
> > pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But
> > when
> > the code is inconsistent about setting to NULL, it always makes me wonder if
> > there
> > is a reason.
> >
> > Michael
> 

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Dexuan Cui @ 2019-03-26 18:01 UTC (permalink / raw)
  To: Michael Kelley, Lorenzo Pieralisi
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Sasha Levin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB0918F2C06BBFF7684BF179FFD75F0@DM5PR2101MB0918.namprd21.prod.outlook.com>

> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Tuesday, March 26, 2019 10:47 AM
> To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>; Dexuan Cui
> <decui@microsoft.com>
> Cc: bhelgaas@google.com; linux-pci@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> Sasha Levin <Alexander.Levin@microsoft.com>; linux-hyperv@vger.kernel.org;
> linux-kernel@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Haiyang
> Zhang <haiyangz@microsoft.com>; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; vkuznets <vkuznets@redhat.com>;
> marcelo.cerri@canonical.com; jackm@mellanox.com; stable@vger.kernel.org
> Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
> 
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26,
> 2019 10:09 AM
> > On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > > From: Michael Kelley <mikelley@microsoft.com>
> > > > Sent: Wednesday, March 20, 2019 2:38 PM
> > > >
> > > > From: Dexuan Cui <decui@microsoft.com>
> > > > >
> > > > > After a device is just created in new_pcichild_device(), hpdev->refs is
> set
> > > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > > >
> > > > > When we hot remove the device from the host, in Linux VM we first call
> > > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild()
> and
> > > > > then schedules a work of hv_eject_device_work(), so hpdev->refs
> becomes 3
> > > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > > adds one put_pcichild() to fix the memory leak.
> > > > >
> > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv".
> On
> > > > this
> > > > > path (hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present()),
> > > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > > pci_devices_present_work().
> > > >
> > > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > > There is the reference in the hbus->children list, and there is the
> reference that
> > > > is returned to the caller.
> > > So IMO the "normal" reference count should be 2. :-) IMO only when a
> hv_pci_dev
> > > device is about to be destroyed, its reference count can drop to less than 2,
> > > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed
> from
> > > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> > >
> > > > But what is strange is that pci_devices_present_work()
> > > > overwrites the reference returned in local variable hpdev without doing a
> > > > put_pcichild().
> > > I suppose you mean:
> > >
> > >         /* First, mark all existing children as reported missing. */
> > >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> > >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> > >                 hpdev->reported_missing = true;
> > >         }
> > >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> > >
> > > This is not strange to me, because, in pci_devices_present_work(), at first
> we
> > > don't know which devices are about to disappear, so we pre-mark all
> devices to
> > > be potentially missing like that; if a device is still on the bus, we'll mark its
> > > hpdev->reported_missing to false later; only after we know exactly which
> > > devices are missing, we should call put_pcichild() against them. All these
> > > seem natural to me.
> > >
> > > > It seems like the "normal" reference count should be 1 when the
> > > > child device is not being manipulated, not 2.
> > > What does "not being manipulated" mean?
> > >
> > > > The fix would be to add a call to
> > > > put_pcichild() when the return value from new_pcichild_device() is
> > > > overwritten.
> > > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev"
> returned
> > > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > > is implicitly initialized to false in new_pcichild_device().
> > >
> > > > Then remove the call to put_pcichild() in pci_device_present_work()
> when
> > > > missing
> > > > children are moved to the local list. The children have been moved from
> one
> > > > list
> > > > to another, so there's no need to decrement the reference count.  Then
> when
> > > > everything in the local list is deleted, the reference is correctly
> decremented,
> > > > presumably freeing the memory.
> > > >
> > > > With this approach, the code in hv_eject_device_work() is correct.
> There's
> > > > one call to put_pcichild() to reflect removing the child device from the
> hbus->
> > > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > > hv_pci_eject_device().
> > > Please refer to my replies above. IMO we should fix
> > > hv_eject_device_work() rather than pci_devices_present_work().
> >
> > Have we reached a conclusion on this ? I would like to merge this series
> > given that it is fixing bugs and it has hung in the balance for quite
> > a while but it looks like Michael is not too happy about these patches
> > and I need a maintainer ACK to merge them.
> >
> > Thanks,
> > Lorenzo
> 
> Dexuan and I have discussed the topic extensively offline.  The patch works
> in its current form, and I'll agree to it.
> 
> Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

Thanks, Michael!

Hi Lorenzo,
All the 3 patches have got Michael's Reviewed-by.

Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
provided his Reviewed-by in the " [PATCH 0/3]" mail:
https://lkml.org/lkml/2019/3/5/521

Thanks,
--Dexuan

^ permalink raw reply

* Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Lorenzo Pieralisi @ 2019-03-26 18:12 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Michael Kelley, bhelgaas@google.com, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB0169CA2F484C02A196F9B78DBF5F0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Tue, Mar 26, 2019 at 06:01:32PM +0000, Dexuan Cui wrote:

[...]

> > > Have we reached a conclusion on this ? I would like to merge this series
> > > given that it is fixing bugs and it has hung in the balance for quite
> > > a while but it looks like Michael is not too happy about these patches
> > > and I need a maintainer ACK to merge them.
> > >
> > > Thanks,
> > > Lorenzo
> > 
> > Dexuan and I have discussed the topic extensively offline.  The patch works
> > in its current form, and I'll agree to it.
> > 
> > Reviewed-by:  Michael Kelley <mikelley@microsoft.com>
> 
> Thanks, Michael!
> 
> Hi Lorenzo,
> All the 3 patches have got Michael's Reviewed-by.

Good, I asked because I do not want to merge patches with review
questions open, thanks for the clarifications.

> Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
> provided his Reviewed-by in the " [PATCH 0/3]" mail:
> https://lkml.org/lkml/2019/3/5/521

I will reformat/rewrite the logs and notify you when queued.

Thanks,
Lorenzo

^ permalink raw reply

* Re: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Lorenzo Pieralisi @ 2019-03-26 19:54 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Michael Kelley, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190304213357.16652-4-decui@microsoft.com>

On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> When we hot-remove a device, usually the host sends us a PCI_EJECT message,
> and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
> we do the quick hot-add/hot-remove test, the host may not send us the
> PCI_EJECT message, if the guest has not fully finished the initialization
> by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> potentially unsafe to only depend on the pci_destroy_slot() in
> hv_eject_device_work(), though create_root_hv_pci_bus() ->
> hv_pci_assign_slots() is not called in this case. Note: in this case, the
> host still sends the guest a PCI_BUS_RELATIONS message with
> bus_rel->device_count == 0.
> 
> And, in the quick hot-add/hot-remove test, we can have such a race: before
> pci_devices_present_work() -> new_pcichild_device() adds the new device
> into hbus->children, we may have already received the PCI_EJECT message,
> and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> with bus_rel->device_count == 0 removes the device from hbus->children, and
> we end up being unable to remove the slot in hv_pci_remove() ->
> hv_pci_remove_slots().
> 
> The patch removes the slot in pci_devices_present_work() when the device
> is removed. This can address the above race. Note 1:
> pci_devices_present_work() and hv_eject_device_work() run in the
> singled-threaded hbus->wq, so there is not a double-remove issue for the
> slot. Note 2: we can't offload hv_pci_eject_device() from
> hv_pci_onchannelcallback() to the workqueue, because we need
> hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> poll the channel's ringbuffer to work around the
> "hangs in hv_compose_msi_msg()" issue: see
> commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

This commit log is unreadable, sorry. Indentation, punctuation and
formatting are just a mess, try to read it, you will notice by
yourself.

I basically reformatted it completely and pushed the series to
pci/controller-fixes but that's the last time I do it since I am not an
editor, next time I won't merge it.

More importantly, these patches are marked for stable, given the series
of fixes that triggered this series please ensure it was tested
thoroughly because it is honestly complicate to understand and I do not
want to backport further fixes to stable kernels on top of this.

Please have a look and report back.

Thanks,
Lorenzo

> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
>  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
>  					 list_entry);
>  		list_del(&hpdev->list_entry);
> +
> +		if (hpdev->pci_slot)
> +			pci_destroy_slot(hpdev->pci_slot);
> +
>  		put_pcichild(hpdev);
>  	}
>  
> -- 
> 2.19.1
> 

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Dexuan Cui @ 2019-03-27  0:22 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Michael Kelley, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190326195429.GA15410@e107981-ln.cambridge.arm.com>

> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Tuesday, March 26, 2019 12:55 PM
> On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> > When we hot-remove a device, usually the host sends us a PCI_EJECT
> message,
> > and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But
> when
> > we do the quick hot-add/hot-remove test, the host may not send us the
> > PCI_EJECT message, if the guest has not fully finished the initialization
> > by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> > potentially unsafe to only depend on the pci_destroy_slot() in
> > hv_eject_device_work(), though create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() is not called in this case. Note: in this case, the
> > host still sends the guest a PCI_BUS_RELATIONS message with
> > bus_rel->device_count == 0.
> >
> > And, in the quick hot-add/hot-remove test, we can have such a race: before
> > pci_devices_present_work() -> new_pcichild_device() adds the new device
> > into hbus->children, we may have already received the PCI_EJECT message,
> > and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> > the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> > hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> > with bus_rel->device_count == 0 removes the device from hbus->children,
> and
> > we end up being unable to remove the slot in hv_pci_remove() ->
> > hv_pci_remove_slots().
> >
> > The patch removes the slot in pci_devices_present_work() when the device
> > is removed. This can address the above race. Note 1:
> > pci_devices_present_work() and hv_eject_device_work() run in the
> > singled-threaded hbus->wq, so there is not a double-remove issue for the
> > slot. Note 2: we can't offload hv_pci_eject_device() from
> > hv_pci_onchannelcallback() to the workqueue, because we need
> > hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> > poll the channel's ringbuffer to work around the
> > "hangs in hv_compose_msi_msg()" issue: see
> > commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in
> hv_compose_msi_msg()")
> 
> This commit log is unreadable, sorry. Indentation, punctuation and
> formatting are just a mess, try to read it, you will notice by
> yourself.
> 
> I basically reformatted it completely and pushed the series to
> pci/controller-fixes but that's the last time I do it since I am not an
> editor, next time I won't merge it.

Hi Lorenzo,
Thank you for helping improve my changelogs! I did learn a lot after
carefully comparing the improved version with my original version. :-)

I'll try my best to write a good changelog for my future patches.

> More importantly, these patches are marked for stable, given the series
> of fixes that triggered this series please ensure it was tested
> thoroughly because it is honestly complicate to understand and I do not
> want to backport further fixes to stable kernels on top of this.

I did the hot-add/hot-remove test in a loop for several thousand times,
and the patchset worked as expected and didn't show any issue.

> Please have a look and report back.
> 
> Thanks,
> Lorenzo

Thanks again!

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Fix Hyperv vIOMMU driver file name
From: Mukesh Ojha @ 2019-03-27 13:51 UTC (permalink / raw)
  To: lantianyu1986, davem, mchehab+samsung, gregkh, nicolas.ferre,
	tglx, mingo, konrad.wilk, jpoimboe, peterz, jkosina, riel, peterz,
	Tianyu.Lan, luto, michael.h.kelley, kys, sashal, joe
  Cc: linux-kernel, linux-hyperv
In-Reply-To: <1553581701-25250-1-git-send-email-Tianyu.Lan@microsoft.com>


On 3/26/2019 11:58 AM, lantianyu1986@gmail.com wrote:
> From: Lan Tianyu <Tianyu.Lan@microsoft.com>
>
> The Hyperv vIOMMU file name should be "hyperv-iommu.c" rather

s/vIOMMU/IOMMU

> than "hyperv_iommu.c". This patch is to fix it.
>
> Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>


Make the above change otherwise looks good.
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>

-Mukesh

> ---
>   MAINTAINERS | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e17ebf7..403247d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7175,7 +7175,7 @@ F:	drivers/net/hyperv/
>   F:	drivers/scsi/storvsc_drv.c
>   F:	drivers/uio/uio_hv_generic.c
>   F:	drivers/video/fbdev/hyperv_fb.c
> -F:	drivers/iommu/hyperv_iommu.c
> +F:	drivers/iommu/hyperv-iommu.c
>   F:	net/vmw_vsock/hyperv_transport.c
>   F:	include/linux/hyperv.h
>   F:	include/uapi/linux/hyperv.h

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Fix Hyperv vIOMMU driver file name
From: Mukesh Ojha @ 2019-03-27 14:09 UTC (permalink / raw)
  To: lantianyu1986, davem, mchehab+samsung, gregkh, nicolas.ferre,
	tglx, mingo, konrad.wilk, jpoimboe, peterz, jkosina, riel, peterz,
	Tianyu.Lan, luto, michael.h.kelley, kys, sashal, joe
  Cc: linux-kernel, linux-hyperv
In-Reply-To: <1553581701-25250-1-git-send-email-Tianyu.Lan@microsoft.com>


On 3/26/2019 11:58 AM, lantianyu1986@gmail.com wrote:
> From: Lan Tianyu <Tianyu.Lan@microsoft.com>
>
> The Hyperv vIOMMU file name should be "hyperv-iommu.c" rather
s/vIOMMU /IOMMU
> than "hyperv_iommu.c". This patch is to fix it.
>
> Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>


Othewise looks fine.
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>

-Mukesh

> ---
>   MAINTAINERS | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e17ebf7..403247d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7175,7 +7175,7 @@ F:	drivers/net/hyperv/
>   F:	drivers/scsi/storvsc_drv.c
>   F:	drivers/uio/uio_hv_generic.c
>   F:	drivers/video/fbdev/hyperv_fb.c
> -F:	drivers/iommu/hyperv_iommu.c
> +F:	drivers/iommu/hyperv-iommu.c
>   F:	net/vmw_vsock/hyperv_transport.c
>   F:	include/linux/hyperv.h
>   F:	include/uapi/linux/hyperv.h

^ permalink raw reply

* [PATCH AUTOSEL 4.19 019/192] x86/hyperv: Fix kernel panic when kexec on HyperV
From: Sasha Levin @ 2019-03-27 18:07 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Kairui Song, Thomas Gleixner, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, H. Peter Anvin,
	Vitaly Kuznetsov, Dave Young, devel, linux-hyperv
In-Reply-To: <20190327181025.13507-1-sashal@kernel.org>

From: Kairui Song <kasong@redhat.com>

[ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]

After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xffffc9000001d000

Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156

That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.

To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: devel@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/hyperv/hv_init.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 20c876c7c5bf..87abd5145cc9 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -387,6 +387,13 @@ void hyperv_cleanup(void)
 	/* Reset our OS id */
 	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+	/*
+	 * Reset hypercall page reference before reset the page,
+	 * let hypercall operations fail safely rather than
+	 * panic the kernel for using invalid hypercall page
+	 */
+	hv_hypercall_pg = NULL;
+
 	/* Reset the hypercall page */
 	hypercall_msr.as_uint64 = 0;
 	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.19.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.0 028/262] x86/hyperv: Fix kernel panic when kexec on HyperV
From: Sasha Levin @ 2019-03-27 17:58 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Kairui Song, Thomas Gleixner, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, H. Peter Anvin,
	Vitaly Kuznetsov, Dave Young, devel, linux-hyperv
In-Reply-To: <20190327180158.10245-1-sashal@kernel.org>

From: Kairui Song <kasong@redhat.com>

[ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]

After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xffffc9000001d000

Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156

That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.

To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: devel@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/hyperv/hv_init.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..d3f42b6bbdac 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,13 @@ void hyperv_cleanup(void)
 	/* Reset our OS id */
 	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+	/*
+	 * Reset hypercall page reference before reset the page,
+	 * let hypercall operations fail safely rather than
+	 * panic the kernel for using invalid hypercall page
+	 */
+	hv_hypercall_pg = NULL;
+
 	/* Reset the hypercall page */
 	hypercall_msr.as_uint64 = 0;
 	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-28  4:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR2101MB091851BC7C55093D388514D9D7420@DM5PR2101MB0918.namprd21.prod.outlook.com>

On Thu, Mar 21, 2019 at 04:04:20PM +0000, Michael Kelley wrote:
> From: Kimberly Brown <kimbrownkd@gmail.com>  Sent: Wednesday, March 20, 2019 8:48 PM
> > > > > Adding more locks will solve the problem but it seems like overkill.
> > > > > Why not either use a reference count or an RCU style access for the
> > > > > ring buffer?
> > > >
> > > > I agree that a reference count or RCU could also solve this problem.
> > > > Using mutex locks seemed like the most straightforward solution, but
> > > > I'll certainly switch to a different approach if it's better!
> > > >
> > > > Are you concerned about the extra memory required for the mutex locks,
> > > > read performance, or something else?
> > >
> > > Locks in control path are ok, but my concern is performance of the
> > > data path which puts packets in/out of rings. To keep reasonable performance,
> > > no additional locking should be added in those paths.
> > >
> > > So if data path is using RCU, can/should the control operations also
> > > use it?
> > 


Hi Stephen,

Do you have any additional questions or suggestions for this race
condition and the mutex locks? I think that your initial questions were
addressed in the responses below. If there's anything else, please let
me know!

Thanks,
Kim


> > The data path doesn't use RCU to protect the ring buffers.
> 
> My $.02:  The mutex is obtained only in the sysfs path and the "delete
> ringbuffers" path, neither of which is performance or concurrency sensitive. 
> There's no change to any path that reads or writes data from/to the ring
> buffers.  It seems like the mutex is the most straightforward solution to
> preventing sysfs from accessing the ring buffer info while the memory is
> being freed as part of "delete ringbuffers".
> 
> Michael

^ permalink raw reply

* [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 17:48 UTC (permalink / raw)
  To: sashal, linux-hyperv
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, davem, netdev,
	linux-kernel

From: Haiyang Zhang <haiyangz@microsoft.com>

After queue stopped, the wakeup mechanism may wake it up again
when ring buffer usage is lower than a threshold. This may cause
send path panic on NULL pointer when we stopped all tx queues in
netvsc_detach and start removing the netvsc device.

This patch fix it by adding a tx_disable flag to prevent unwanted
queue wakeup.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c     |  6 ++++--
 drivers/net/hyperv/netvsc_drv.c | 32 ++++++++++++++++++++++++++------
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index e859ae2..49f41b6 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -987,6 +987,7 @@ struct netvsc_device {
 
 	wait_queue_head_t wait_drain;
 	bool destroy;
+	bool tx_disable; /* if true, do not wake up queue again */
 
 	/* Receive buffer allocated by us but manages by NetVSP */
 	void *recv_buf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 813d195..e0dce37 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,6 +110,7 @@ static struct netvsc_device *alloc_net_device(void)
 
 	init_waitqueue_head(&net_device->wait_drain);
 	net_device->destroy = false;
+	net_device->tx_disable = false;
 
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
@@ -719,7 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
 	} else {
 		struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
-		if (netif_tx_queue_stopped(txq) &&
+		if (netif_tx_queue_stopped(txq) && !net_device->tx_disable &&
 		    (hv_get_avail_to_write_percent(&channel->outbound) >
 		     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
 			netif_tx_wake_queue(txq);
@@ -874,7 +875,8 @@ static inline int netvsc_send_pkt(
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(txq);
 		ndev_ctx->eth_stats.stop_queue++;
-		if (atomic_read(&nvchan->queue_sends) < 1) {
+		if (atomic_read(&nvchan->queue_sends) < 1 &&
+		    !net_device->tx_disable) {
 			netif_tx_wake_queue(txq);
 			ndev_ctx->eth_stats.wake_queue++;
 			ret = -ENOSPC;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1a08679..0824155 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -109,6 +109,15 @@ static void netvsc_set_rx_mode(struct net_device *net)
 	rcu_read_unlock();
 }
 
+static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
+				    struct net_device *ndev)
+{
+	nvscdev->tx_disable = false;
+	mb(); /* ensure queue wake up mechanism is on */
+
+	netif_tx_wake_all_queues(ndev);
+}
+
 static int netvsc_open(struct net_device *net)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -129,7 +138,7 @@ static int netvsc_open(struct net_device *net)
 	rdev = nvdev->extension;
 	if (!rdev->link_state) {
 		netif_carrier_on(net);
-		netif_tx_wake_all_queues(net);
+		netvsc_tx_enable(nvdev, net);
 	}
 
 	if (vf_netdev) {
@@ -184,6 +193,17 @@ static int netvsc_wait_until_empty(struct netvsc_device *nvdev)
 	}
 }
 
+static inline void netvsc_tx_disable(struct netvsc_device *nvscdev,
+				     struct net_device *ndev)
+{
+	if (nvscdev) {
+		nvscdev->tx_disable = true;
+		mb(); /* ensure txq will not wake up after stop */
+	}
+
+	netif_tx_disable(ndev);
+}
+
 static int netvsc_close(struct net_device *net)
 {
 	struct net_device_context *net_device_ctx = netdev_priv(net);
@@ -192,7 +212,7 @@ static int netvsc_close(struct net_device *net)
 	struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
 	int ret;
 
-	netif_tx_disable(net);
+	netvsc_tx_disable(nvdev, net);
 
 	/* No need to close rndis filter if it is removed already */
 	if (!nvdev)
@@ -918,7 +938,7 @@ static int netvsc_detach(struct net_device *ndev,
 
 	/* If device was up (receiving) then shutdown */
 	if (netif_running(ndev)) {
-		netif_tx_disable(ndev);
+		netvsc_tx_disable(nvdev, ndev);
 
 		ret = rndis_filter_close(nvdev);
 		if (ret) {
@@ -1906,7 +1926,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (rdev->link_state) {
 			rdev->link_state = false;
 			netif_carrier_on(net);
-			netif_tx_wake_all_queues(net);
+			netvsc_tx_enable(net_device, net);
 		} else {
 			notify = true;
 		}
@@ -1916,7 +1936,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 		}
 		kfree(event);
 		break;
@@ -1925,7 +1945,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 			event->event = RNDIS_STATUS_MEDIA_CONNECT;
 			spin_lock_irqsave(&ndev_ctx->lock, flags);
 			list_add(&event->list, &ndev_ctx->reconfig_events);
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Stephen Hemminger @ 2019-03-28 18:38 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: sashal, linux-hyperv, haiyangz, kys, sthemmin, olaf, vkuznets,
	davem, netdev, linux-kernel
In-Reply-To: <20190328174845.4799-1-haiyangz@linuxonhyperv.com>

On Thu, 28 Mar 2019 17:48:45 +0000
Haiyang Zhang <haiyangz@linuxonhyperv.com> wrote:

> +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> +				    struct net_device *ndev)
> +{
> +	nvscdev->tx_disable = false;
> +	mb(); /* ensure queue wake up mechanism is on */
> +
> +	netif_tx_wake_all_queues(ndev);
> +}

You don't need a full mb(). virt_wmb() should be sufficient.

Could I suggest an alternative approach.
You don't need to introduce a local tx_disable flag, the only place where a wakeup
could cause problems is after a send_completion was processed during detach state.

Instead, just avoid wakeup in that place.

--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
                struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
                if (netif_tx_queue_stopped(txq) &&
+                   netif_device_present(ndev) &&
                    (hv_get_avail_to_write_percent(&channel->outbound) >
                     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
                        netif_tx_wake_queue(txq);

^ permalink raw reply

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Stephen Hemminger @ 2019-03-28 18:42 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190328043057.GA2258@ubu-Virtual-Machine>

On Thu, 28 Mar 2019 00:30:57 -0400
Kimberly Brown <kimbrownkd@gmail.com> wrote:

> On Thu, Mar 21, 2019 at 04:04:20PM +0000, Michael Kelley wrote:
> > From: Kimberly Brown <kimbrownkd@gmail.com>  Sent: Wednesday, March 20, 2019 8:48 PM  
> > > > > > Adding more locks will solve the problem but it seems like overkill.
> > > > > > Why not either use a reference count or an RCU style access for the
> > > > > > ring buffer?  
> > > > >
> > > > > I agree that a reference count or RCU could also solve this problem.
> > > > > Using mutex locks seemed like the most straightforward solution, but
> > > > > I'll certainly switch to a different approach if it's better!
> > > > >
> > > > > Are you concerned about the extra memory required for the mutex locks,
> > > > > read performance, or something else?  
> > > >
> > > > Locks in control path are ok, but my concern is performance of the
> > > > data path which puts packets in/out of rings. To keep reasonable performance,
> > > > no additional locking should be added in those paths.
> > > >
> > > > So if data path is using RCU, can/should the control operations also
> > > > use it?  
> > >   
> 
> 
> Hi Stephen,
> 
> Do you have any additional questions or suggestions for this race
> condition and the mutex locks? I think that your initial questions were
> addressed in the responses below. If there's anything else, please let
> me know!
> 
> Thanks,
> Kim
> 
> 
> > > The data path doesn't use RCU to protect the ring buffers.  
> > 
> > My $.02:  The mutex is obtained only in the sysfs path and the "delete
> > ringbuffers" path, neither of which is performance or concurrency sensitive. 
> > There's no change to any path that reads or writes data from/to the ring
> > buffers.  It seems like the mutex is the most straightforward solution to
> > preventing sysfs from accessing the ring buffer info while the memory is
> > being freed as part of "delete ringbuffers".
> > 
> > Michael  


I have no problems with the patch you did.
My discussion was more around the general issues with ringbuffers being detached
from the device. Not sure if it was even a good design choice but that is
something that is hard to fix now.


^ permalink raw reply

* RE: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 19:00 UTC (permalink / raw)
  To: Stephen Hemminger, Haiyang Zhang
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, olaf@aepfle.de, vkuznets, davem@davemloft.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190328113823.3c9b3599@shemminger-XPS-13-9360>



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, March 28, 2019 2:38 PM
> To: Haiyang Zhang <haiyangz@linuxonhyperv.com>
> Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org; Haiyang Zhang
> <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; davem@davemloft.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after
> tx_disable
> 
> On Thu, 28 Mar 2019 17:48:45 +0000
> Haiyang Zhang <haiyangz@linuxonhyperv.com> wrote:
> 
> > +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> > +				    struct net_device *ndev)
> > +{
> > +	nvscdev->tx_disable = false;
> > +	mb(); /* ensure queue wake up mechanism is on */
> > +
> > +	netif_tx_wake_all_queues(ndev);
> > +}
> 
> You don't need a full mb(). virt_wmb() should be sufficient.

I will make this change. 

> Could I suggest an alternative approach.
> You don't need to introduce a local tx_disable flag, the only place where a
> wakeup could cause problems is after a send_completion was processed
> during detach state.
> 
> Instead, just avoid wakeup in that place.

In netvsc_detach(), after netif_tx_disable(), we call  netvsc_wait_until_empty(nvdev);
TX patch should not be waken up again while waiting for in/out ring to becomes empty.

In my tests before this patch, there are wakeup happens before netif_device_detach(), 
so netif_device_present(ndev) is still true at that time.

In other places, like netvsc_close(), link_change(), we also don't want wakeup after tx_disable.

Thanks.
- Haiyang

> 
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct
> net_device *ndev,
>                 struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
> 
>                 if (netif_tx_queue_stopped(txq) &&
> +                   netif_device_present(ndev) &&
>                     (hv_get_avail_to_write_percent(&channel->outbound) >
>                      RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
>                         netif_tx_wake_queue(txq);

^ permalink raw reply

* Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: David Miller @ 2019-03-28 19:21 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, linux-hyperv, haiyangz, kys, sthemmin, olaf, vkuznets,
	netdev, linux-kernel
In-Reply-To: <20190328174845.4799-1-haiyangz@linuxonhyperv.com>

From: Haiyang Zhang <haiyangz@linuxonhyperv.com>
Date: Thu, 28 Mar 2019 17:48:45 +0000

> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 1a08679..0824155 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -109,6 +109,15 @@ static void netvsc_set_rx_mode(struct net_device *net)
>  	rcu_read_unlock();
>  }
>  
> +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> +				    struct net_device *ndev)

Do not use inline in foo.c files, let the compiler decide.

> @@ -184,6 +193,17 @@ static int netvsc_wait_until_empty(struct netvsc_device *nvdev)
>  	}
>  }
>  
> +static inline void netvsc_tx_disable(struct netvsc_device *nvscdev,
> +				     struct net_device *ndev)
> +{

Likewise.

^ permalink raw reply

* Re: [PATCH hyperv-fixes,v2] hv_netvsc: Fix unwanted wakeup after tx_disable
From: David Miller @ 2019-03-28 19:22 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, linux-hyperv, haiyangz, kys, sthemmin, olaf, vkuznets,
	netdev, linux-kernel
In-Reply-To: <20190328191637.6698-1-haiyangz@linuxonhyperv.com>

From: Haiyang Zhang <haiyangz@linuxonhyperv.com>
Date: Thu, 28 Mar 2019 19:16:37 +0000

> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> After queue stopped, the wakeup mechanism may wake it up again
> when ring buffer usage is lower than a threshold. This may cause
> send path panic on NULL pointer when we stopped all tx queues in
> netvsc_detach and start removing the netvsc device.
> 
> This patch fix it by adding a tx_disable flag to prevent unwanted
> queue wakeup.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Reported-by: Mohammed Gamal <mgamal@redhat.com>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

Still need to fix the inline issues I mentioned in v1.

^ permalink raw reply

* Re: [PATCH hyperv-fixes,v2] hv_netvsc: Fix unwanted wakeup after tx_disable
From: David Miller @ 2019-03-28 19:23 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, linux-hyperv, haiyangz, kys, sthemmin, olaf, vkuznets,
	netdev, linux-kernel
In-Reply-To: <20190328.122214.1269115693598269959.davem@davemloft.net>


Also, your email bounces:

<haiyangz@linuxonhyperv.com>: host mail.linuxonhyperv.com[107.180.71.197] said:
    550 No Such User Here (in reply to RCPT TO command)

^ permalink raw reply

* RE: [PATCH hyperv-fixes,v2] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 19:27 UTC (permalink / raw)
  To: David Miller
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, olaf@aepfle.de, vkuznets,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190328.122214.1269115693598269959.davem@davemloft.net>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of David Miller
> Sent: Thursday, March 28, 2019 3:22 PM
> To: haiyangz@linuxonhyperv.com
> Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org; Haiyang Zhang
> <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH hyperv-fixes,v2] hv_netvsc: Fix unwanted wakeup after
> tx_disable
> 
> From: Haiyang Zhang <haiyangz@linuxonhyperv.com>
> Date: Thu, 28 Mar 2019 19:16:37 +0000
> 
> > From: Haiyang Zhang <haiyangz@microsoft.com>
> >
> > After queue stopped, the wakeup mechanism may wake it up again when
> > ring buffer usage is lower than a threshold. This may cause send path
> > panic on NULL pointer when we stopped all tx queues in netvsc_detach
> > and start removing the netvsc device.
> >
> > This patch fix it by adding a tx_disable flag to prevent unwanted
> > queue wakeup.
> >
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > Reported-by: Mohammed Gamal <mgamal@redhat.com>
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Still need to fix the inline issues I mentioned in v1.

I will fix the inline issues. 

The haiyangz@linuxonhyperv.com cannot receive emails. 
My regular email is in the reply-to: field. 
Also, linuxonhyperv.com is being replaced by a new system soon.

Thanks.
- Haiyang

^ permalink raw reply

* [PATCH hyperv-fixes,v2] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 19:16 UTC (permalink / raw)
  To: sashal, linux-hyperv
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, davem, netdev,
	linux-kernel

From: Haiyang Zhang <haiyangz@microsoft.com>

After queue stopped, the wakeup mechanism may wake it up again
when ring buffer usage is lower than a threshold. This may cause
send path panic on NULL pointer when we stopped all tx queues in
netvsc_detach and start removing the netvsc device.

This patch fix it by adding a tx_disable flag to prevent unwanted
queue wakeup.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c     |  6 ++++--
 drivers/net/hyperv/netvsc_drv.c | 32 ++++++++++++++++++++++++++------
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index e859ae2..49f41b6 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -987,6 +987,7 @@ struct netvsc_device {
 
 	wait_queue_head_t wait_drain;
 	bool destroy;
+	bool tx_disable; /* if true, do not wake up queue again */
 
 	/* Receive buffer allocated by us but manages by NetVSP */
 	void *recv_buf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 813d195..e0dce37 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,6 +110,7 @@ static struct netvsc_device *alloc_net_device(void)
 
 	init_waitqueue_head(&net_device->wait_drain);
 	net_device->destroy = false;
+	net_device->tx_disable = false;
 
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
@@ -719,7 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
 	} else {
 		struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
-		if (netif_tx_queue_stopped(txq) &&
+		if (netif_tx_queue_stopped(txq) && !net_device->tx_disable &&
 		    (hv_get_avail_to_write_percent(&channel->outbound) >
 		     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
 			netif_tx_wake_queue(txq);
@@ -874,7 +875,8 @@ static inline int netvsc_send_pkt(
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(txq);
 		ndev_ctx->eth_stats.stop_queue++;
-		if (atomic_read(&nvchan->queue_sends) < 1) {
+		if (atomic_read(&nvchan->queue_sends) < 1 &&
+		    !net_device->tx_disable) {
 			netif_tx_wake_queue(txq);
 			ndev_ctx->eth_stats.wake_queue++;
 			ret = -ENOSPC;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1a08679..ffb7922 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -109,6 +109,15 @@ static void netvsc_set_rx_mode(struct net_device *net)
 	rcu_read_unlock();
 }
 
+static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
+				    struct net_device *ndev)
+{
+	nvscdev->tx_disable = false;
+	virt_wmb(); /* ensure queue wake up mechanism is on */
+
+	netif_tx_wake_all_queues(ndev);
+}
+
 static int netvsc_open(struct net_device *net)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -129,7 +138,7 @@ static int netvsc_open(struct net_device *net)
 	rdev = nvdev->extension;
 	if (!rdev->link_state) {
 		netif_carrier_on(net);
-		netif_tx_wake_all_queues(net);
+		netvsc_tx_enable(nvdev, net);
 	}
 
 	if (vf_netdev) {
@@ -184,6 +193,17 @@ static int netvsc_wait_until_empty(struct netvsc_device *nvdev)
 	}
 }
 
+static inline void netvsc_tx_disable(struct netvsc_device *nvscdev,
+				     struct net_device *ndev)
+{
+	if (nvscdev) {
+		nvscdev->tx_disable = true;
+		virt_wmb(); /* ensure txq will not wake up after stop */
+	}
+
+	netif_tx_disable(ndev);
+}
+
 static int netvsc_close(struct net_device *net)
 {
 	struct net_device_context *net_device_ctx = netdev_priv(net);
@@ -192,7 +212,7 @@ static int netvsc_close(struct net_device *net)
 	struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
 	int ret;
 
-	netif_tx_disable(net);
+	netvsc_tx_disable(nvdev, net);
 
 	/* No need to close rndis filter if it is removed already */
 	if (!nvdev)
@@ -918,7 +938,7 @@ static int netvsc_detach(struct net_device *ndev,
 
 	/* If device was up (receiving) then shutdown */
 	if (netif_running(ndev)) {
-		netif_tx_disable(ndev);
+		netvsc_tx_disable(nvdev, ndev);
 
 		ret = rndis_filter_close(nvdev);
 		if (ret) {
@@ -1906,7 +1926,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (rdev->link_state) {
 			rdev->link_state = false;
 			netif_carrier_on(net);
-			netif_tx_wake_all_queues(net);
+			netvsc_tx_enable(net_device, net);
 		} else {
 			notify = true;
 		}
@@ -1916,7 +1936,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 		}
 		kfree(event);
 		break;
@@ -1925,7 +1945,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 			event->event = RNDIS_STATUS_MEDIA_CONNECT;
 			spin_lock_irqsave(&ndev_ctx->lock, flags);
 			list_add(&event->list, &ndev_ctx->reconfig_events);
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH hyperv-fixes,v3] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 19:40 UTC (permalink / raw)
  To: sashal, linux-hyperv
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, davem, netdev,
	linux-kernel

From: Haiyang Zhang <haiyangz@microsoft.com>

After queue stopped, the wakeup mechanism may wake it up again
when ring buffer usage is lower than a threshold. This may cause
send path panic on NULL pointer when we stopped all tx queues in
netvsc_detach and start removing the netvsc device.

This patch fix it by adding a tx_disable flag to prevent unwanted
queue wakeup.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c     |  6 ++++--
 drivers/net/hyperv/netvsc_drv.c | 32 ++++++++++++++++++++++++++------
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index e859ae2..49f41b6 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -987,6 +987,7 @@ struct netvsc_device {
 
 	wait_queue_head_t wait_drain;
 	bool destroy;
+	bool tx_disable; /* if true, do not wake up queue again */
 
 	/* Receive buffer allocated by us but manages by NetVSP */
 	void *recv_buf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 813d195..e0dce37 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,6 +110,7 @@ static struct netvsc_device *alloc_net_device(void)
 
 	init_waitqueue_head(&net_device->wait_drain);
 	net_device->destroy = false;
+	net_device->tx_disable = false;
 
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
@@ -719,7 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
 	} else {
 		struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
-		if (netif_tx_queue_stopped(txq) &&
+		if (netif_tx_queue_stopped(txq) && !net_device->tx_disable &&
 		    (hv_get_avail_to_write_percent(&channel->outbound) >
 		     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
 			netif_tx_wake_queue(txq);
@@ -874,7 +875,8 @@ static inline int netvsc_send_pkt(
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(txq);
 		ndev_ctx->eth_stats.stop_queue++;
-		if (atomic_read(&nvchan->queue_sends) < 1) {
+		if (atomic_read(&nvchan->queue_sends) < 1 &&
+		    !net_device->tx_disable) {
 			netif_tx_wake_queue(txq);
 			ndev_ctx->eth_stats.wake_queue++;
 			ret = -ENOSPC;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1a08679..06393b2 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -109,6 +109,15 @@ static void netvsc_set_rx_mode(struct net_device *net)
 	rcu_read_unlock();
 }
 
+static void netvsc_tx_enable(struct netvsc_device *nvscdev,
+			     struct net_device *ndev)
+{
+	nvscdev->tx_disable = false;
+	virt_wmb(); /* ensure queue wake up mechanism is on */
+
+	netif_tx_wake_all_queues(ndev);
+}
+
 static int netvsc_open(struct net_device *net)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -129,7 +138,7 @@ static int netvsc_open(struct net_device *net)
 	rdev = nvdev->extension;
 	if (!rdev->link_state) {
 		netif_carrier_on(net);
-		netif_tx_wake_all_queues(net);
+		netvsc_tx_enable(nvdev, net);
 	}
 
 	if (vf_netdev) {
@@ -184,6 +193,17 @@ static int netvsc_wait_until_empty(struct netvsc_device *nvdev)
 	}
 }
 
+static void netvsc_tx_disable(struct netvsc_device *nvscdev,
+			      struct net_device *ndev)
+{
+	if (nvscdev) {
+		nvscdev->tx_disable = true;
+		virt_wmb(); /* ensure txq will not wake up after stop */
+	}
+
+	netif_tx_disable(ndev);
+}
+
 static int netvsc_close(struct net_device *net)
 {
 	struct net_device_context *net_device_ctx = netdev_priv(net);
@@ -192,7 +212,7 @@ static int netvsc_close(struct net_device *net)
 	struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
 	int ret;
 
-	netif_tx_disable(net);
+	netvsc_tx_disable(nvdev, net);
 
 	/* No need to close rndis filter if it is removed already */
 	if (!nvdev)
@@ -918,7 +938,7 @@ static int netvsc_detach(struct net_device *ndev,
 
 	/* If device was up (receiving) then shutdown */
 	if (netif_running(ndev)) {
-		netif_tx_disable(ndev);
+		netvsc_tx_disable(nvdev, ndev);
 
 		ret = rndis_filter_close(nvdev);
 		if (ret) {
@@ -1906,7 +1926,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (rdev->link_state) {
 			rdev->link_state = false;
 			netif_carrier_on(net);
-			netif_tx_wake_all_queues(net);
+			netvsc_tx_enable(net_device, net);
 		} else {
 			notify = true;
 		}
@@ -1916,7 +1936,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 		}
 		kfree(event);
 		break;
@@ -1925,7 +1945,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 			event->event = RNDIS_STATUS_MEDIA_CONNECT;
 			spin_lock_irqsave(&ndev_ctx->lock, flags);
 			list_add(&event->list, &ndev_ctx->reconfig_events);
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Stephen Hemminger @ 2019-03-28 19:42 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: Haiyang Zhang, sashal@kernel.org, linux-hyperv@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, olaf@aepfle.de, vkuznets,
	davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <DM6PR21MB1337250A47CEE73D1EFE0AB7CA590@DM6PR21MB1337.namprd21.prod.outlook.com>

On Thu, 28 Mar 2019 19:00:18 +0000
Haiyang Zhang <haiyangz@microsoft.com> wrote:

> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Thursday, March 28, 2019 2:38 PM
> > To: Haiyang Zhang <haiyangz@linuxonhyperv.com>
> > Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org; Haiyang Zhang
> > <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Stephen
> > Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> > <vkuznets@redhat.com>; davem@davemloft.net; netdev@vger.kernel.org;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after
> > tx_disable
> > 
> > On Thu, 28 Mar 2019 17:48:45 +0000
> > Haiyang Zhang <haiyangz@linuxonhyperv.com> wrote:
> >   
> > > +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> > > +				    struct net_device *ndev)
> > > +{
> > > +	nvscdev->tx_disable = false;
> > > +	mb(); /* ensure queue wake up mechanism is on */
> > > +
> > > +	netif_tx_wake_all_queues(ndev);
> > > +}  
> > 
> > You don't need a full mb(). virt_wmb() should be sufficient.  
> 
> I will make this change. 
> 
> > Could I suggest an alternative approach.
> > You don't need to introduce a local tx_disable flag, the only place where a
> > wakeup could cause problems is after a send_completion was processed
> > during detach state.
> > 
> > Instead, just avoid wakeup in that place.  
> 
> In netvsc_detach(), after netif_tx_disable(), we call  netvsc_wait_until_empty(nvdev);
> TX patch should not be waken up again while waiting for in/out ring to becomes empty.
> 
> In my tests before this patch, there are wakeup happens before netif_device_detach(), 
> so netif_device_present(ndev) is still true at that time.
> 
> In other places, like netvsc_close(), link_change(), we also don't want wakeup after tx_disable.
> 
> Thanks.
> - Haiyang
> 
> > 
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
> > @@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct
> > net_device *ndev,
> >                 struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
> > 
> >                 if (netif_tx_queue_stopped(txq) &&
> > +                   netif_device_present(ndev) &&
> >                     (hv_get_avail_to_write_percent(&channel->outbound) >
> >                      RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
> >                         netif_tx_wake_queue(txq);  


Then what about doing netif_detach earlier in netvsc_detach.

The state management is already (too) complex in netvsc and adding another
boolean flag just makes it harder to understand.

^ permalink raw reply

* RE: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 19:59 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, olaf@aepfle.de, vkuznets, davem@davemloft.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190328124205.42f5f337@shemminger-XPS-13-9360>



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, March 28, 2019 3:42 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linuxonhyperv.com>; sashal@kernel.org;
> linux-hyperv@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; davem@davemloft.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after
> tx_disable
> 
> On Thu, 28 Mar 2019 19:00:18 +0000
> Haiyang Zhang <haiyangz@microsoft.com> wrote:
> 
> > > -----Original Message-----
> > > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Sent: Thursday, March 28, 2019 2:38 PM
> > > To: Haiyang Zhang <haiyangz@linuxonhyperv.com>
> > > Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org; Haiyang Zhang
> > > <haiyangz@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> Stephen
> > > Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> > > <vkuznets@redhat.com>; davem@davemloft.net;
> netdev@vger.kernel.org;
> > > linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup
> > > after tx_disable
> > >
> > > On Thu, 28 Mar 2019 17:48:45 +0000
> > > Haiyang Zhang <haiyangz@linuxonhyperv.com> wrote:
> > >
> > > > +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> > > > +				    struct net_device *ndev)
> > > > +{
> > > > +	nvscdev->tx_disable = false;
> > > > +	mb(); /* ensure queue wake up mechanism is on */
> > > > +
> > > > +	netif_tx_wake_all_queues(ndev);
> > > > +}
> > >
> > > You don't need a full mb(). virt_wmb() should be sufficient.
> >
> > I will make this change.
> >
> > > Could I suggest an alternative approach.
> > > You don't need to introduce a local tx_disable flag, the only place
> > > where a wakeup could cause problems is after a send_completion was
> > > processed during detach state.
> > >
> > > Instead, just avoid wakeup in that place.
> >
> > In netvsc_detach(), after netif_tx_disable(), we call
> > netvsc_wait_until_empty(nvdev); TX patch should not be waken up again
> while waiting for in/out ring to becomes empty.
> >
> > In my tests before this patch, there are wakeup happens before
> > netif_device_detach(), so netif_device_present(ndev) is still true at that
> time.
> >
> > In other places, like netvsc_close(), link_change(), we also don't want
> wakeup after tx_disable.
> >
> > Thanks.
> > - Haiyang
> >
> > >
> > > --- a/drivers/net/hyperv/netvsc.c
> > > +++ b/drivers/net/hyperv/netvsc.c
> > > @@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct
> > > net_device *ndev,
> > >                 struct netdev_queue *txq = netdev_get_tx_queue(ndev,
> > > q_idx);
> > >
> > >                 if (netif_tx_queue_stopped(txq) &&
> > > +                   netif_device_present(ndev) &&
> > >                     (hv_get_avail_to_write_percent(&channel->outbound) >
> > >                      RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
> > >                         netif_tx_wake_queue(txq);
> 
> 
> Then what about doing netif_detach earlier in netvsc_detach.
> 
> The state management is already (too) complex in netvsc and adding another
> boolean flag just makes it harder to understand.

If we move netif_device_detach() before the netvsc_wait_until_empty(), the remaining
Packets in the receive buffer will be passed to a detached device.

Also, in case of netvsc_close() and link_change(), we don't call netif_device_detach().

Thanks,
- Haiyang

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox