Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-21  3:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <20190320130619.07e49c97@shemminger-XPS-13-9360>

On Wed, Mar 20, 2019 at 01:06:19PM -0700, Stephen Hemminger wrote:
> On Sat, 16 Mar 2019 21:49:28 -0400
> Kimberly Brown <kimbrownkd@gmail.com> wrote:
> 
> > On Thu, Mar 14, 2019 at 03:45:33PM -0700, Stephen Hemminger wrote:
> > > On Thu, 14 Mar 2019 13:05:15 -0700
> > > "Kimberly Brown" <kimbrownkd@gmail.com> wrote:
> > >   
> > > > Fix a race condition that can result in a ring buffer pointer being set
> > > > to null while a "_show" function is reading the ring buffer's data. This
> > > > problem was discussed here:
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> > > > %2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Csthemmin%40microsoft.com
> > > > %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> > > > %7C0%7C636881907217609564&amp;sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> > > > vtGu2hU%3D&amp;reserved=0
> > > > 
> > > > To fix the race condition, add a new mutex lock to the
> > > > "hv_ring_buffer_info" struct. Add a new function,
> > > > "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> > > > ring_buffer_info mutex locks are initialized.
> > > >
> > > >  ... snip ...    
> > > 
> > > Adding more locks will solve the problem but it seems like overkill.
> > > Why not either use a reference count or an RCU style access for the
> > > ring buffer?  
> > 
> > I agree that a reference count or RCU could also solve this problem.
> > Using mutex locks seemed like the most straightforward solution, but
> > I'll certainly switch to a different approach if it's better!
> > 
> > Are you concerned about the extra memory required for the mutex locks,
> > read performance, or something else?
> 
> Locks in control path are ok, but my concern is performance of the
> data path which puts packets in/out of rings. To keep reasonable performance,
> no additional locking should be added in those paths.
> 
> So if data path is using RCU, can/should the control operations also
> use it?

The data path doesn't use RCU to protect the ring buffers.

^ permalink raw reply

* Re: [PATCH v6] Drivers: hv: vmbus: Expose monitor data only when monitor pages are used
From: Sasha Levin @ 2019-03-21  3:57 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, Greg KH, K. Y. Srinivasan, Haiyang Zhang,
	linux-hyperv, linux-kernel
In-Reply-To: <20190319040401.GA3050@ubu-Virtual-Machine>

On Tue, Mar 19, 2019 at 12:04:01AM -0400, Kimberly Brown wrote:
>There are two methods for signaling the host: the monitor page mechanism
>and hypercalls. The monitor page mechanism is used by performance
>critical channels (storage, networking, etc.) because it provides
>improved throughput. However, latency is increased. Monitor pages are
>allocated to these channels.
>
>Monitor pages are not allocated to channels that do not use the monitor
>page mechanism. Therefore, these channels do not have a valid monitor id
>or valid monitor page data. In these cases, some of the "_show"
>functions return incorrect data. They return an invalid monitor id and
>data that is beyond the bounds of the hv_monitor_page array fields.
>
>The "channel->offermsg.monitor_allocated" value can be used to determine
>whether monitor pages have been allocated to a channel.
>
>Add "is_visible()" callback functions for the device-level and
>channel-level attribute groups. These functions will hide the monitor
>sysfs files when the monitor mechanism is not used.
>
>Remove ".default_attributes" from "vmbus_chan_attrs" and create a
>channel-level attribute group. These changes allow the new
>"is_visible()" callback function to be applied to the channel-level
>attributes.
>
>Call "sysfs_create_group()" in "vmbus_add_channel_kobj()" to create the
>channel's sysfs files. Add a new function,
>“vmbus_remove_channel_attr_group()”, and call it in "free_channel()" to
>remove the channel's sysfs files when the channel is closed.
>
>Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>

I've queued it up for hyperv-fixes, thanks Kimberly.

Should this go in stable as well?

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH v2 0/2] Drivers: hv: Move Hyper-V clock/timer code to separate clocksource driver
From: Vitaly Kuznetsov @ 2019-03-21 14:24 UTC (permalink / raw)
  To: Michael Kelley
  Cc: catalin.marinas@arm.com, mark.rutland@arm.com,
	will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <1552849512-10063-1-git-send-email-mikelley@microsoft.com>

Michael Kelley <mikelley@microsoft.com> writes:

> This patch series moves Hyper-V clock/timer code to a separate Hyper-V
> clocksource driver. Previously, Hyper-V clock/timer code and data
> structures were mixed in with other Hyper-V code in the ISA independent
> drivers/hv code as well as in arch dependent code. The new Hyper-V
> clocksource driver is ISA independent, with a just few dependencies on
> arch specific functions. The patch series does not change any behavior
> or functionality -- it only reorganizes the existing code and fixes up
> the linkages. A few places outside of Hyper-V code are fixed up to use
> the new #include file structure.
>
> This restructuring is in response to Marc Zyngier's review comments
> on supporting Hyper-V running on ARM64, and is a good idea in general.
> It increases the amount of code shared between the x86 and ARM64
> architectures, and reduces the size of the new code for supporting
> Hyper-V on ARM64. A new version of the Hyper-V on ARM64 patches will
> follow once this clocksource restructuring is accepted.
>
> The code is currently diff'ed against Linux 5.0.  I'll rebase
> to linux-next once 5.1-rc1 is available.
>
> Changes in v2:
> * Revised commit short descriptions so the distinction between
> the two patches is clearer [GregKH]
> * Renamed new clocksource driver files and functions to use
> existing "timer" and "stimer" names instead of introducing
> "syntimer". [Vitaly Kuznetsov]
> * Introduced CONFIG_HYPER_TIMER to fix build problem when
> CONFIG_HYPERV=m [Vitaly Kuznetsov]

Actually, it is "CONFIG_HYPERV_TIMER" but the typo appears only in this
blurb :-)

> * Added "Suggested-by: Marc Zyngier"
>
> Michael Kelley (2):
>   Drivers: hv: Create Hyper-V clocksource driver from existing
>     clockevents code
>   Drivers: hv: Move Hyper-V clocksource code to new clocksource driver

Thanks for fixing v1 issues!

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

>
>  MAINTAINERS                          |   2 +
>  arch/x86/entry/vdso/vclock_gettime.c |   1 +
>  arch/x86/entry/vdso/vma.c            |   2 +-
>  arch/x86/hyperv/hv_init.c            |  91 +---------
>  arch/x86/include/asm/hyperv-tlfs.h   |   6 +
>  arch/x86/include/asm/mshyperv.h      |  80 ++-------
>  arch/x86/kernel/cpu/mshyperv.c       |   2 +
>  arch/x86/kvm/x86.c                   |   1 +
>  drivers/clocksource/Makefile         |   1 +
>  drivers/clocksource/hyperv_timer.c   | 328 +++++++++++++++++++++++++++++++++++
>  drivers/hv/Kconfig                   |   3 +
>  drivers/hv/hv.c                      | 154 ----------------
>  drivers/hv/hyperv_vmbus.h            |   3 -
>  drivers/hv/vmbus_drv.c               |  39 +++--
>  include/clocksource/hyperv_timer.h   | 104 +++++++++++
>  15 files changed, 485 insertions(+), 332 deletions(-)
>  create mode 100644 drivers/clocksource/hyperv_timer.c
>  create mode 100644 include/clocksource/hyperv_timer.h

-- 
Vitaly

^ permalink raw reply

* RE: [PATCH v6] Drivers: hv: vmbus: Expose monitor data only when monitor pages are used
From: Michael Kelley @ 2019-03-21 15:55 UTC (permalink / raw)
  To: Sasha Levin, kimbrownkd
  Cc: Long Li, Sasha Levin, Stephen Hemminger, Dexuan Cui, Greg KH,
	KY Srinivasan, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190321035705.GE25262@sasha-vm>

From: Sasha Levin <sashal@kernel.org> Sent: Wednesday, March 20, 2019 8:57 PM
> 
> I've queued it up for hyperv-fixes, thanks Kimberly.
> 
> Should this go in stable as well?
> 

My take:  these changes lean more toward being a "clean up" rather
than fixing a problem that has significant impact.  The data in the sysfs
entries is bogus, but isn't actually hurting anything. The current code does
index off the end of an array in some cases, but the accesses are reads
and there's plenty of padding so that actually making an invalid memory
reference won't happen.   In the interest of reducing churn in the
stable branches, I would lean toward *not* backporting this to stable.

Michael

^ permalink raw reply

* RE: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Michael Kelley @ 2019-03-21 16:04 UTC (permalink / raw)
  To: kimbrownkd, Stephen Hemminger
  Cc: Long Li, Sasha Levin, Stephen Hemminger, Dexuan Cui,
	KY Srinivasan, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190321034752.GA6828@ubu-Virtual-Machine>

From: Kimberly Brown <kimbrownkd@gmail.com>  Sent: Wednesday, March 20, 2019 8:48 PM
> > > > Adding more locks will solve the problem but it seems like overkill.
> > > > Why not either use a reference count or an RCU style access for the
> > > > ring buffer?
> > >
> > > I agree that a reference count or RCU could also solve this problem.
> > > Using mutex locks seemed like the most straightforward solution, but
> > > I'll certainly switch to a different approach if it's better!
> > >
> > > Are you concerned about the extra memory required for the mutex locks,
> > > read performance, or something else?
> >
> > Locks in control path are ok, but my concern is performance of the
> > data path which puts packets in/out of rings. To keep reasonable performance,
> > no additional locking should be added in those paths.
> >
> > So if data path is using RCU, can/should the control operations also
> > use it?
> 
> The data path doesn't use RCU to protect the ring buffers.

My $.02:  The mutex is obtained only in the sysfs path and the "delete
ringbuffers" path, neither of which is performance or concurrency sensitive. 
There's no change to any path that reads or writes data from/to the ring
buffers.  It seems like the mutex is the most straightforward solution to
preventing sysfs from accessing the ring buffer info while the memory is
being freed as part of "delete ringbuffers".

Michael

^ permalink raw reply

* Bad file pattern in MAINTAINERS section 'Hyper-V CORE AND DRIVERS'
From: Joe Perches @ 2019-03-25 21:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: K . Y . Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	linux-hyperv, Michael Kelley, Lan Tianyu, Joerg Roedel
In-Reply-To: <7cd8d12f59bcacd18a78f599b46dac555f7f16c0.camel@perches.com>

A file pattern line in this section of the MAINTAINERS file in linux-next
does not have a match in the linux source files.

This could occur because a matching filename was never added, was deleted
or renamed in some other commit.

The commits that added and if found renamed or removed the file pattern
are shown below.

Please fix this defect appropriately.

1: ---------------------------------------------------------------------------

linux-next MAINTAINERS section:

	7164	Hyper-V CORE AND DRIVERS
	7165	M:	"K. Y. Srinivasan" <kys@microsoft.com>
	7166	M:	Haiyang Zhang <haiyangz@microsoft.com>
	7167	M:	Stephen Hemminger <sthemmin@microsoft.com>
	7168	M:	Sasha Levin <sashal@kernel.org>
	7169	T:	git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
	7170	L:	linux-hyperv@vger.kernel.org
	7171	S:	Supported
	7172	F:	Documentation/networking/device_drivers/microsoft/netvsc.txt
	7173	F:	arch/x86/include/asm/mshyperv.h
	7174	F:	arch/x86/include/asm/trace/hyperv.h
	7175	F:	arch/x86/include/asm/hyperv-tlfs.h
	7176	F:	arch/x86/kernel/cpu/mshyperv.c
	7177	F:	arch/x86/hyperv
	7178	F:	drivers/hid/hid-hyperv.c
	7179	F:	drivers/hv/
	7180	F:	drivers/input/serio/hyperv-keyboard.c
	7181	F:	drivers/pci/controller/pci-hyperv.c
	7182	F:	drivers/net/hyperv/
	7183	F:	drivers/scsi/storvsc_drv.c
	7184	F:	drivers/uio/uio_hv_generic.c
	7185	F:	drivers/video/fbdev/hyperv_fb.c
-->	7186	F:	drivers/iommu/hyperv_iommu.c
	7187	F:	net/vmw_vsock/hyperv_transport.c
	7188	F:	include/linux/hyperv.h
	7189	F:	include/uapi/linux/hyperv.h
	7190	F:	tools/hv/
	7191	F:	Documentation/ABI/stable/sysfs-bus-vmbus

2: ---------------------------------------------------------------------------

The most recent commit that added or modified file pattern 'drivers/iommu/hyperv_iommu.c':

commit 32d5860a9e3c98b5043716fff05a7b20b15918f9
Author: Lan Tianyu <Tianyu.Lan@microsoft.com>
Date:   Wed Feb 27 22:54:05 2019 +0800

    MAINTAINERS: Add Hyper-V IOMMU driver into Hyper-V CORE AND DRIVERS scope
    
    This patch is to add Hyper-V IOMMU driver file into Hyper-V CORE and
    DRIVERS scope.
    
    Reviewed-by: Michael Kelley <mikelley@microsoft.com>
    Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

3: ---------------------------------------------------------------------------

No commit with file pattern 'drivers/iommu/hyperv_iommu.c' was found

^ permalink raw reply

* [PATCH] MAINTAINERS: Fix Hyperv vIOMMU driver file name
From: lantianyu1986 @ 2019-03-26  6:28 UTC (permalink / raw)
  To: davem, mchehab+samsung, gregkh, nicolas.ferre, tglx, mingo,
	konrad.wilk, jpoimboe, peterz, mojha, jkosina, riel, peterz,
	Tianyu.Lan, luto, michael.h.kelley, kys, sashal, joe
  Cc: linux-kernel, linux-hyperv

From: Lan Tianyu <Tianyu.Lan@microsoft.com>

The Hyperv vIOMMU file name should be "hyperv-iommu.c" rather
than "hyperv_iommu.c". This patch is to fix it.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index e17ebf7..403247d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7175,7 +7175,7 @@ F:	drivers/net/hyperv/
 F:	drivers/scsi/storvsc_drv.c
 F:	drivers/uio/uio_hv_generic.c
 F:	drivers/video/fbdev/hyperv_fb.c
-F:	drivers/iommu/hyperv_iommu.c
+F:	drivers/iommu/hyperv-iommu.c
 F:	net/vmw_vsock/hyperv_transport.c
 F:	include/linux/hyperv.h
 F:	include/uapi/linux/hyperv.h
-- 
2.7.4


^ permalink raw reply related

* RE: Bad file pattern in MAINTAINERS section 'Hyper-V CORE AND DRIVERS'
From: Tianyu Lan @ 2019-03-26  7:02 UTC (permalink / raw)
  To: Joe Perches, linux-kernel@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, Michael Kelley, Joerg Roedel
In-Reply-To: <20190325212516.26489-1-joe@perches.com>

Hi Joe:
	Thanks for report. I just sent out a fix patch.

-----Original Message-----
From: Joe Perches <joe@perches.com> 
Sent: Tuesday, March 26, 2019 5:25 AM
To: linux-kernel@vger.kernel.org
Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; linux-hyperv@vger.kernel.org; Michael Kelley <mikelley@microsoft.com>; Tianyu Lan <Tianyu.Lan@microsoft.com>; Joerg Roedel <jroedel@suse.de>
Subject: Bad file pattern in MAINTAINERS section 'Hyper-V CORE AND DRIVERS'

A file pattern line in this section of the MAINTAINERS file in linux-next does not have a match in the linux source files.

This could occur because a matching filename was never added, was deleted or renamed in some other commit.

The commits that added and if found renamed or removed the file pattern are shown below.

Please fix this defect appropriately.

1: ---------------------------------------------------------------------------

linux-next MAINTAINERS section:

	7164	Hyper-V CORE AND DRIVERS
	7165	M:	"K. Y. Srinivasan" <kys@microsoft.com>
	7166	M:	Haiyang Zhang <haiyangz@microsoft.com>
	7167	M:	Stephen Hemminger <sthemmin@microsoft.com>
	7168	M:	Sasha Levin <sashal@kernel.org>
	7169	T:	git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
	7170	L:	linux-hyperv@vger.kernel.org
	7171	S:	Supported
	7172	F:	Documentation/networking/device_drivers/microsoft/netvsc.txt
	7173	F:	arch/x86/include/asm/mshyperv.h
	7174	F:	arch/x86/include/asm/trace/hyperv.h
	7175	F:	arch/x86/include/asm/hyperv-tlfs.h
	7176	F:	arch/x86/kernel/cpu/mshyperv.c
	7177	F:	arch/x86/hyperv
	7178	F:	drivers/hid/hid-hyperv.c
	7179	F:	drivers/hv/
	7180	F:	drivers/input/serio/hyperv-keyboard.c
	7181	F:	drivers/pci/controller/pci-hyperv.c
	7182	F:	drivers/net/hyperv/
	7183	F:	drivers/scsi/storvsc_drv.c
	7184	F:	drivers/uio/uio_hv_generic.c
	7185	F:	drivers/video/fbdev/hyperv_fb.c
-->	7186	F:	drivers/iommu/hyperv_iommu.c
	7187	F:	net/vmw_vsock/hyperv_transport.c
	7188	F:	include/linux/hyperv.h
	7189	F:	include/uapi/linux/hyperv.h
	7190	F:	tools/hv/
	7191	F:	Documentation/ABI/stable/sysfs-bus-vmbus

2: ---------------------------------------------------------------------------

The most recent commit that added or modified file pattern 'drivers/iommu/hyperv_iommu.c':

commit 32d5860a9e3c98b5043716fff05a7b20b15918f9
Author: Lan Tianyu <Tianyu.Lan@microsoft.com>
Date:   Wed Feb 27 22:54:05 2019 +0800

    MAINTAINERS: Add Hyper-V IOMMU driver into Hyper-V CORE AND DRIVERS scope
    
    This patch is to add Hyper-V IOMMU driver file into Hyper-V CORE and
    DRIVERS scope.
    
    Reviewed-by: Michael Kelley <mikelley@microsoft.com>
    Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

3: ---------------------------------------------------------------------------

No commit with file pattern 'drivers/iommu/hyperv_iommu.c' was found

^ permalink raw reply

* Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
From: Mohammed Gamal @ 2019-03-26 14:05 UTC (permalink / raw)
  To: Haiyang Zhang, Stephen Hemminger, Michael Kelley,
	linux-hyperv@vger.kernel.org, kimbrownkd
  Cc: Sasha Levin, Dexuan Cui, Long Li, KY Srinivasan, vkuznets
In-Reply-To: <DM5PR2101MB0725E0BD19C4D4EBA1F2B9FCCA5E0@DM5PR2101MB0725.namprd21.prod.outlook.com>

On Mon, 2019-03-25 at 20:13 +0000, Haiyang Zhang wrote:
> Hi Mohammed,
>  
> I found by reading the code and testing – in netvsc_detach(), after
> netif_tx_disable(), the queues may be waken up again when ring buffer
> usage drops below the “low wartermark”. This is expected in normal
> conditions as part of our flow control mechanism.
>  
> But when we stopped all tx queues in netvsc_detach(), and start
> removing the netvsc device, this may cause send path panic on NULL
> pointer on a closed channel.
>  
> I have attached a patch for this issue, could you test it on your
> side?
>  
> Thanks,
> Haiyang

Hi Haiyang,

I've tested the patch and it seems to fix the problem.

Thanks,
Mohammed

>  
> From: Stephen Hemminger <sthemmin@microsoft.com> 
> Sent: Thursday, March 14, 2019 1:09 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>; Michael Kelley <mikelley@
> microsoft.com>; linux-hyperv@vger.kernel.org; kimbrownkd <kimbrownkd@
> gmail.com>; mgamal@redhat.com
> Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui <decui@mi
> crosoft.com>; Long Li <longli@microsoft.com>; KY Srinivasan <kys@micr
> osoft.com>; vkuznets <vkuznets@redhat.com>
> Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> hv_get_bytes_to_read/write
>  
> There were lots of interations of netvsc driver to ensure that state
> was handled properly on queue changes.
> Can you reproduce this with current 5.0 kernel? If it only happens
> with driver backport then maybe something is different there (across
> netvsc and vmbus).
> From: Mohammed Gamal <mgamal@redhat.com>
> Sent: Thursday, March 14, 2019 5:42 AM
> To: Stephen Hemminger; Haiyang Zhang; Michael Kelley; linux-hyperv@vg
> er.kernel.org; kimbrownkd
> Cc: Sasha Levin; Dexuan Cui; Long Li; KY Srinivasan; vkuznets
> Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> hv_get_bytes_to_read/write
>  
> On Wed, 2019-03-13 at 21:12 +0000, Stephen Hemminger wrote:
> > What test are you running?
> 
> I am running iperf3 with the following arguments:
> iperf3 -u -c ${iperf3 server address} -b 0 -P8 -t 3600
> 
> while changing the interface parameters in parallel with the
> following
> script:
> 
> cat ./test.sh
> #!/bin/bash
> device="eth1"
> i=0
> while [ "$i" -lt 1000 ]
> do
>     ethtool -L $device combined 1
>     ethtool -L $device combined 2
>     let "i++"
>     echo $i
> done
> 
> > 
> > -----Original Message-----
> > From: Mohammed Gamal <mgamal@redhat.com> 
> > Sent: Wednesday, March 13, 2019 3:25 AM
> > To: Haiyang Zhang <haiyangz@microsoft.com>; Michael Kelley
> <mikelley@
> > microsoft.com>; linux-hyperv@vger.kernel.org; kimbrownkd
> <kimbrownkd@
> > gmail.com>
> > Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui <decui@
> mi
> > crosoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Long Li
> <lo
> > ngli@microsoft.com>; KY Srinivasan <kys@microsoft.com>; vkuznets
> <vku
> > znets@redhat.com>; linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > hv_get_bytes_to_read/write
> > 
> > On Tue, 2019-03-12 at 18:02 +0000, Haiyang Zhang wrote:
> > >  
> > >  
> > > > -----Original Message-----
> > > > From: Mohammed Gamal <mgamal@redhat.com>
> > > > Sent: Thursday, March 7, 2019 1:32 PM
> > > > To: Michael Kelley <mikelley@microsoft.com>; linux-hyperv@vger.
> ke
> > > > rn
> > > 
> > > el.org;
> > > > kimbrownkd <kimbrownkd@gmail.com>
> > > > Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui
> > > > <decui@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.co
> m>
> > > > ;
> > > > Long Li <longli@microsoft.com>; KY Srinivasan <kys@microsoft.co
> m>
> > > > ;
> > > 
> > > Haiyang
> > > > Zhang <haiyangz@microsoft.com>; vkuznets <vkuznets@redhat.com>;
> > > 
> > > linux-
> > > > kernel@vger.kernel.org
> > > > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > > > hv_get_bytes_to_read/write
> > > > 
> > > > On Thu, 2019-03-07 at 17:33 +0000, Michael Kelley wrote:
> > > > > From: Mohammed Gamal <mgamal@redhat.com> Sent: Thursday,
> March
> > > > > 7,
> > > > > 2019 8:36 AM
> > > > > > 
> > > > > > This patch adds a check for the presence of the ring buffer
> > > > > > in
> > > > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > > > > > dereferences.
> > > > > > If the ring buffer is not yet allocated, return 0 bytes to
> be
> > > > > > read/written.
> > > > > > 
> > > > > > The root cause is that code that accesses the ring buffer
> > > 
> > > including
> > > > > > hv_get_bytes_to_read/write() could be vulnerable to the
> race
> > > > > > condition discussed in
> > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3
> A%
> > > > > > 2F
> > > 
> > > %2Flk
> > > > > > 
> > > > 
> > > >
> ml.org%2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Chaiyangz
> > > > %40m
> > > > > > 
> > > > 
> > > >
> icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141a
> > > > f9
> > > > 1
> > > > > > 
> > > > 
> > > > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=b51Xc5GUN
> > > > nHX0K
> > > > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&amp;reserved=0>;
> > > > > > 
> > > > > > This race is being addressed by the patch series by
> Kimberly
> > > 
> > > Brown
> > > > > > in
> > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3
> A%
> > > > > > 2F
> > > 
> > > %2Flk
> > > > > > 
> > > > 
> > > >
> ml.org%2Flkml%2F2019%2F2%2F21%2F1236&amp;data=02%7C01%7Chaiyangz
> > > > %40m
> > > > > > 
> > > > 
> > > >
> icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141a
> > > > f9
> > > > 1
> > > > > > 
> > > > 
> > > >
> ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=js1ff15Gbk7
> > > > 0MD
> > > > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&amp;reserved=0 which is
> not
> > > > 
> > > > final
> > > > > > yet
> > > > > > 
> > > > > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > > > 
> > > > > Could you elaborate on the code paths where
> > > > > hv_get_bytes_to_read/write() could be called when the ring
> > > > > buffer
> > > > > isn't yet allocated?  My sense is that Kim Brown's patch will
> > > 
> > > address
> > > > > all of the code paths that involved sysfs access from outside
> > > > > the
> > > > > driver.  And within a driver, the ring buffer should never be
> > > 
> > > accessed
> > > > > unless it is already allocated.  Is there another code path
> > > > > we're
> > > 
> > > not
> > > > > aware of?  I'm wondering if these changes are really needed
> > > > > once
> > > 
> > > Kim
> > > > > Brown's patch is finished.
> > > > > 
> > > > > Michael
> > > > 
> > > > I've seen one instance of the race in the netvsc driver when
> > > 
> > > running traffic
> > > > through it with iperf3 while continuously changing the channel
> > > 
> > > settings.
> > > > 
> > > > The following code path deallocates the ring buffer:
> > > > netvsc_set_channels() -> netvsc_detach() ->
> > > > rndis_filter_device_remove() -> netvsc_device_remove() ->
> > > 
> > > vmbus_close()
> > > > -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
> > > > 
> > > > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> > > 
> > > concurrently
> > > > after vmbus_close() and before vmbus_open() returns and sets up
> > > > the
> > > 
> > > new ring
> > > > buffer.
> > > > 
> > > > The race is fairly hard to reproduce on recent upstream
> kernels,
> > > 
> > > but I still
> > > > managed to reproduce it.
> > > 
> > >  
> > > Looking at the code from netvsc_detach() –
> > >          netif_tx_disable(ndev) is called before
> > > rndis_filter_device_remove(hdev, nvdev).
> > > So there should be no call to netvsc_send_pkt() after detaching.
> > > What’s the crash stack trace?
> > >  
> > > static int netvsc_detach(struct net_device *ndev,
> > >                          struct netvsc_device *nvdev)
> > > {
> > >         struct net_device_context *ndev_ctx = netdev_priv(ndev);
> > >         struct hv_device *hdev = ndev_ctx->device_ctx;
> > >         int ret;
> > >  
> > >         /* Don't try continuing to try and setup sub channels */
> > >         if (cancel_work_sync(&nvdev->subchan_work))
> > >                 nvdev->num_chn = 1;
> > >  
> > >         /* If device was up (receiving) then shutdown */
> > >         if (netif_running(ndev)) {
> > >                 netif_tx_disable(ndev);
> > >  
> > >                 ret = rndis_filter_close(nvdev);
> > >                 if (ret) {
> > >                         netdev_err(ndev,
> > >                                    "unable to close device (ret
> > > %d).\n", ret);
> > >                         return ret;
> > >                 }
> > >  
> > >                 ret = netvsc_wait_until_empty(nvdev);
> > >                 if (ret) {
> > >                         netdev_err(ndev,
> > >                                    "Ring buffer not empty after
> > > closing rndis\n");
> > >                         return ret;
> > >                 }
> > >         }
> > >  
> > >         netif_device_detach(ndev);
> > >  
> > >         rndis_filter_device_remove(hdev, nvdev);
> > >  
> > >         return 0;
> > > }
> > >  
> > > Thanks,
> > > Haiyang
> > 
> > Here is one stack trace on a 4.18 kernel, the most recent kernel I
> > managed to reproduce this bug on. 
> > I haven't managed to reproduce on 5.0.0 yet, but I guess some
> recent
> > changes to the netvsc driver could be masking the problem, as I
> tried
> > backporting those changes to older RHEL 7 kernels and still managed
> > to
> > reproduce the problem there. I could however be wrong, and any
> > pointers
> > are still appreciated:
> > 
> > [  545.308507] BUG: unable to handle kernel NULL pointer
> dereference
> > at
> > 0000000000000004
> > [  545.308656] PGD 0 P4D 0 
> > [  545.308763] Oops: 0000 [#1] SMP PTI
> > [  545.308855] CPU: 2 PID: 1800 Comm: iperf3 Kdump: loaded Not
> > tainted
> > 4.18.0-64.el8.test.x86_64 #1
> > [  545.308990] Hardware name: Microsoft Corporation Virtual
> > Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
> > [  545.309143] RIP: 0010:netvsc_send+0x2c9/0xce0 [hv_netvsc]
> > [  545.309298] Code: 4c 8b b1 c0 00 00 00 4f 8d 2c 64 49 c1 e5 07
> 4d
> > 03
> > ae c0 03 00 00 48 8b 84 03 30 01 00 00 4c 89 6c 24 18 48 8b 90 20
> 01
> > 00
> > 00 <8b> 72 04 8b 0a 8b 90 38 01 00 00 89 f7 01 f2 29 cf 29 ca 39 ce
> >  0f
> > [  545.309321] RSP: 0018:ffffb8a305d5b6c0 EFLAGS: 00010282
> > [  545.309321] RAX: ffff926928bd7000 RBX: ffff92687dbe0000 RCX:
> > ffff92687d5bec00
> > [  545.309321] RDX: 0000000000000000 RSI: ffff92691b61c654 RDI:
> > 0000000000000000
> > [  545.309321] RBP: ffff926915dcde28 R08: ffff926915dcde00 R09:
> > 0000000000000000
> > [  545.309321] R10: 00000000000db61c R11: 0000000000000f7e R12:
> > 0000000000000001
> > [  545.309321] R13: ffff926931808180 R14: ffff926931801000 R15:
> > 0000000000000000
> > [  545.309321] FS:  00007feca6a4b740(0000)
> GS:ffff926940080000(0000)
> > knlGS:0000000000000000
> > [  545.309321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  545.309321] CR2: 0000000000000004 CR3: 00000000dfccc004 CR4:
> > 00000000003606e0
> > [  545.309321] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  545.309321] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  545.309321] Call Trace:
> > [  545.309321]  netvsc_start_xmit+0x3c9/0x800 [hv_netvsc]
> > [  545.309321]  ? __switch_to_asm+0x34/0x70
> > [  545.309321]  ? __switch_to_asm+0x34/0x70
> > [  545.309321]  ? ___slab_alloc+0x269/0x4e0
> > [  545.309321]  ? __alloc_skb+0x82/0x1c0
> > [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> > [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> > [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> > [  545.309321]  ? _cond_resched+0x15/0x30
> > [  545.309321]  ? netif_skb_features+0x118/0x280
> > [  545.309321]  dev_hard_start_xmit+0xa5/0x210
> > [  545.309321]  sch_direct_xmit+0x14f/0x340
> > [  545.309321]  __dev_queue_xmit+0x799/0x8f0
> > [  545.309321]  ip_finish_output2+0x2e0/0x430
> > [  545.309321]  ? ip_finish_output+0x139/0x270
> > [  545.309321]  ip_output+0x6c/0xe0
> > [  545.309321]  ? ip_append_data.part.50+0xc0/0xc0
> > [  545.309321]  ip_send_skb+0x15/0x40
> > [  545.309321]  udp_send_skb.isra.43+0x153/0x340
> > [  545.309321]  udp_sendmsg+0xac2/0xd30
> > [  545.309321]  ? set_fd_set.part.7+0x40/0x40
> > [  545.309321]  ? set_fd_set.part.7+0x40/0x40
> > [  545.309321]  ? __check_object_size+0xa3/0x181
> > [  545.309321]  ? sock_has_perm+0x78/0xa0
> > [  545.309321]  ? core_sys_select+0x242/0x2f0
> > [  545.309321]  ? sock_sendmsg+0x36/0x40
> > [  545.309321]  ? udp_push_pending_frames+0x60/0x60
> > [  545.309321]  sock_sendmsg+0x36/0x40
> > [  545.309321]  sock_write_iter+0x8f/0xf0
> > [  545.309321]  __vfs_write+0x156/0x1a0
> > [  545.309321]  vfs_write+0xa5/0x1a0
> > [  545.309321]  ksys_write+0x4f/0xb0
> > [  545.309321]  do_syscall_64+0x5b/0x1b0
> > [  545.309321]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> > [  545.309321] RIP: 0033:0x7feca5fb5348
> > [  545.309321] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00
> 00
> > 00
> > 00 f3 0f 1e fa 48 8d 05 d5 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00
> 00
> > 0f
> > 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4
> > 55
> > [  545.309321] RSP: 002b:00007ffc3ff1f108 EFLAGS: 00000246
> ORIG_RAX:
> > 0000000000000001
> > [  545.309321] RAX: ffffffffffffffda RBX: 00000000000005a8 RCX:
> > 00007feca5fb5348
> > [  545.309321] RDX: 00000000000005a8 RSI: 00007feca6a59000 RDI:
> > 0000000000000009
> > [  545.309321] RBP: 00007feca6a59000 R08: 0000000000000002 R09:
> > 00cd09a3238b4e43
> > [  545.309321] R10: 0002961ecea49016 R11: 0000000000000246 R12:
> > 0000000000000009
> > [  545.309321] R13: 00000000000005a8 R14: 00007ffc3ff1f180 R15:
> > 0000563c1e05b260
> > [  545.309321] Modules linked in: nft_chain_nat_ipv6
> > nf_conntrack_ipv6
> > nf_defrag_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4
> > nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> > nft_chain_route_ipv4 nf_conntrack ip_set nf_tables nfnetlink vfat
> fat
> > sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> > intel_rapl_perf sg hv_utils hv_balloon pcspkr joydev xfs libcrc32c
> > sd_mod sr_mod cdrom serio_raw hv_storvsc hv_netvsc
> scsi_transport_fc
> > hyperv_fb hyperv_keyboard hid_hyperv crc32c_intel hv_vmbus
> dm_mirror
> > dm_region_hash dm_log dm_mod [last unloaded: nft_compat]
> > [  545.309321] CR2: 0000000000000004
> > 
> > From the stack trace netvsc_send+0x2c9 points to this line:
> > 
> > static inline u32 hv_get_bytes_to_write(const struct  hv_ring_
> > bu
> > ffer_info *rbi)
> > {
> >         u32 read_loc, write_loc, dsize, write;
> > 
> >         dsize = rbi->ring_datasize;
> >         read_loc = READ_ONCE(rbi->ring_buffer->read_index);  <-----
> --
> > --
> >         write_loc = rbi->ring_buffer->write_index;
> > 
> >         write = write_loc >= read_loc ? dsize - (write_loc -
> > read_loc) 
> >                 read_loc - write_loc;
> >         return write;
> > }
> > 
> > which gets called from netvsc_send_pkt().
> 

^ permalink raw reply

* RE: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
From: Haiyang Zhang @ 2019-03-26 14:42 UTC (permalink / raw)
  To: mgamal@redhat.com, Stephen Hemminger, Michael Kelley,
	linux-hyperv@vger.kernel.org, kimbrownkd
  Cc: Sasha Levin, Dexuan Cui, Long Li, KY Srinivasan, vkuznets
In-Reply-To: <1553609158.5021.15.camel@redhat.com>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of Mohammed Gamal
> Sent: Tuesday, March 26, 2019 10:06 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Michael Kelley <mikelley@microsoft.com>;
> linux-hyperv@vger.kernel.org; kimbrownkd <kimbrownkd@gmail.com>
> Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; vkuznets <vkuznets@redhat.com>
> Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> hv_get_bytes_to_read/write
> 
> On Mon, 2019-03-25 at 20:13 +0000, Haiyang Zhang wrote:
> > Hi Mohammed,
> >
> > I found by reading the code and testing – in netvsc_detach(), after
> > netif_tx_disable(), the queues may be waken up again when ring buffer
> > usage drops below the “low wartermark”. This is expected in normal
> > conditions as part of our flow control mechanism.
> >
> > But when we stopped all tx queues in netvsc_detach(), and start
> > removing the netvsc device, this may cause send path panic on NULL
> > pointer on a closed channel.
> >
> > I have attached a patch for this issue, could you test it on your
> > side?
> >
> > Thanks,
> > Haiyang
> 
> Hi Haiyang,
> 
> I've tested the patch and it seems to fix the problem.
> 
> Thanks,
> Mohammed
> 

Thank you for the testing!
 - Haiyang

^ permalink raw reply

* Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Lorenzo Pieralisi @ 2019-03-26 17:08 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Michael Kelley, bhelgaas@google.com, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB0169E9AA13EB51DF5CFC1CE4BF420@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > From: Michael Kelley <mikelley@microsoft.com>
> > Sent: Wednesday, March 20, 2019 2:38 PM
> > 
> > From: Dexuan Cui <decui@microsoft.com>
> > >
> > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > >
> > > When we hot remove the device from the host, in Linux VM we first call
> > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > adds one put_pcichild() to fix the memory leak.
> > >
> > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > this
> > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > pci_devices_present_work().
> > 
> > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > There is the reference in the hbus->children list, and there is the reference that
> > is returned to the caller.  
> So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> device is about to be destroyed, its reference count can drop to less than 2, 
> i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> 
> > But what is strange is that pci_devices_present_work()
> > overwrites the reference returned in local variable hpdev without doing a
> > put_pcichild().
> I suppose you mean:
> 
>         /* First, mark all existing children as reported missing. */
>         spin_lock_irqsave(&hbus->device_list_lock, flags);
>         list_for_each_entry(hpdev, &hbus->children, list_entry) {
>                 hpdev->reported_missing = true;
>         }
>         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> 
> This is not strange to me, because, in pci_devices_present_work(), at first we
> don't know which devices are about to disappear, so we pre-mark all devices to
> be potentially missing like that; if a device is still on the bus, we'll mark its
> hpdev->reported_missing to false later; only after we know exactly which
> devices are missing, we should call put_pcichild() against them. All these
> seem natural to me.
> 
> > It seems like the "normal" reference count should be 1 when the
> > child device is not being manipulated, not 2.
> What does "not being manipulated" mean?
> 
> > The fix would be to add a call to
> > put_pcichild() when the return value from new_pcichild_device() is
> > overwritten.
> In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> from new_pcichild_device(): the "reported_missing" field of the new hpdev
> is implicitly initialized to false in new_pcichild_device().
> 
> > Then remove the call to put_pcichild() in pci_device_present_work() when
> > missing
> > children are moved to the local list. The children have been moved from one
> > list
> > to another, so there's no need to decrement the reference count.  Then when
> > everything in the local list is deleted, the reference is correctly decremented,
> > presumably freeing the memory.
> > 
> > With this approach, the code in hv_eject_device_work() is correct.  There's
> > one call to put_pcichild() to reflect removing the child device from the hbus->
> > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > hv_pci_eject_device().
> Please refer to my replies above. IMO we should fix
> hv_eject_device_work() rather than pci_devices_present_work().

Have we reached a conclusion on this ? I would like to merge this series
given that it is fixing bugs and it has hung in the balance for quite
a while but it looks like Michael is not too happy about these patches
and I need a maintainer ACK to merge them.

Thanks,
Lorenzo

> Thanks
> -- Dexuan
>  
> > Your patch works, but to me it leaves the ref count in an unnatural state
> > most of the time.
> > 
> > Michael
> 

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Michael Kelley @ 2019-03-26 17:47 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Dexuan Cui
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Sasha Levin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190326170842.GA10666@e107981-ln.cambridge.arm.com>

From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26, 2019 10:09 AM
> On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > From: Michael Kelley <mikelley@microsoft.com>
> > > Sent: Wednesday, March 20, 2019 2:38 PM
> > >
> > > From: Dexuan Cui <decui@microsoft.com>
> > > >
> > > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > >
> > > > When we hot remove the device from the host, in Linux VM we first call
> > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > adds one put_pcichild() to fix the memory leak.
> > > >
> > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > > this
> > > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > pci_devices_present_work().
> > >
> > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > There is the reference in the hbus->children list, and there is the reference that
> > > is returned to the caller.
> > So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> > device is about to be destroyed, its reference count can drop to less than 2,
> > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> >
> > > But what is strange is that pci_devices_present_work()
> > > overwrites the reference returned in local variable hpdev without doing a
> > > put_pcichild().
> > I suppose you mean:
> >
> >         /* First, mark all existing children as reported missing. */
> >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> >                 hpdev->reported_missing = true;
> >         }
> >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> >
> > This is not strange to me, because, in pci_devices_present_work(), at first we
> > don't know which devices are about to disappear, so we pre-mark all devices to
> > be potentially missing like that; if a device is still on the bus, we'll mark its
> > hpdev->reported_missing to false later; only after we know exactly which
> > devices are missing, we should call put_pcichild() against them. All these
> > seem natural to me.
> >
> > > It seems like the "normal" reference count should be 1 when the
> > > child device is not being manipulated, not 2.
> > What does "not being manipulated" mean?
> >
> > > The fix would be to add a call to
> > > put_pcichild() when the return value from new_pcichild_device() is
> > > overwritten.
> > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > is implicitly initialized to false in new_pcichild_device().
> >
> > > Then remove the call to put_pcichild() in pci_device_present_work() when
> > > missing
> > > children are moved to the local list. The children have been moved from one
> > > list
> > > to another, so there's no need to decrement the reference count.  Then when
> > > everything in the local list is deleted, the reference is correctly decremented,
> > > presumably freeing the memory.
> > >
> > > With this approach, the code in hv_eject_device_work() is correct.  There's
> > > one call to put_pcichild() to reflect removing the child device from the hbus->
> > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > hv_pci_eject_device().
> > Please refer to my replies above. IMO we should fix
> > hv_eject_device_work() rather than pci_devices_present_work().
> 
> Have we reached a conclusion on this ? I would like to merge this series
> given that it is fixing bugs and it has hung in the balance for quite
> a while but it looks like Michael is not too happy about these patches
> and I need a maintainer ACK to merge them.
> 
> Thanks,
> Lorenzo

Dexuan and I have discussed the topic extensively offline.  The patch works
in its current form, and I'll agree to it.

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Michael Kelley @ 2019-03-26 17:50 UTC (permalink / raw)
  To: Dexuan Cui, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB01691FE3199135CE5C3BFFEDBF420@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

From: Dexuan Cui <decui@microsoft.com>  Sent: Wednesday, March 20, 2019 5:36 PM
> 
> > From: Michael Kelley <mikelley@microsoft.com>
> > > ...
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> > work_struct *work)
> > >  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
> > >  					 list_entry);
> > >  		list_del(&hpdev->list_entry);
> > > +
> > > +		if (hpdev->pci_slot)
> > > +			pci_destroy_slot(hpdev->pci_slot);
> >
> > The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> > pci_destory_slot().
> Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
> Because:
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed in the below put_pcichild():
> 
>      while (!list_empty(&removed)) {
>                 hpdev = list_first_entry(&removed, struct hv_pci_dev,
>                                          list_entry);
>                 list_del(&hpdev->list_entry);
> 
>                 if (hpdev->pci_slot)
>                         pci_destroy_slot(hpdev->pci_slot);
> 
>                 put_pcichild(hpdev);
>         }
> 
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
> 
> > And the code in hv_eject_device_work() does not set it to NULL.
> It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
> Because in hv_eject_device_work():
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
> first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
> 
> > It looks like all the places that test the value of hpdev->pci_slot or call
> > pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But
> > when
> > the code is inconsistent about setting to NULL, it always makes me wonder if
> > there
> > is a reason.
> >
> > Michael
> 

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Dexuan Cui @ 2019-03-26 18:01 UTC (permalink / raw)
  To: Michael Kelley, Lorenzo Pieralisi
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Sasha Levin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB0918F2C06BBFF7684BF179FFD75F0@DM5PR2101MB0918.namprd21.prod.outlook.com>

> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Tuesday, March 26, 2019 10:47 AM
> To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>; Dexuan Cui
> <decui@microsoft.com>
> Cc: bhelgaas@google.com; linux-pci@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> Sasha Levin <Alexander.Levin@microsoft.com>; linux-hyperv@vger.kernel.org;
> linux-kernel@vger.kernel.org; driverdev-devel@linuxdriverproject.org; Haiyang
> Zhang <haiyangz@microsoft.com>; olaf@aepfle.de; apw@canonical.com;
> jasowang@redhat.com; vkuznets <vkuznets@redhat.com>;
> marcelo.cerri@canonical.com; jackm@mellanox.com; stable@vger.kernel.org
> Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
> 
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>  Sent: Tuesday, March 26,
> 2019 10:09 AM
> > On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > > From: Michael Kelley <mikelley@microsoft.com>
> > > > Sent: Wednesday, March 20, 2019 2:38 PM
> > > >
> > > > From: Dexuan Cui <decui@microsoft.com>
> > > > >
> > > > > After a device is just created in new_pcichild_device(), hpdev->refs is
> set
> > > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > > >
> > > > > When we hot remove the device from the host, in Linux VM we first call
> > > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild()
> and
> > > > > then schedules a work of hv_eject_device_work(), so hpdev->refs
> becomes 3
> > > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > > adds one put_pcichild() to fix the memory leak.
> > > > >
> > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv".
> On
> > > > this
> > > > > path (hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present()),
> > > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > > pci_devices_present_work().
> > > >
> > > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > > There is the reference in the hbus->children list, and there is the
> reference that
> > > > is returned to the caller.
> > > So IMO the "normal" reference count should be 2. :-) IMO only when a
> hv_pci_dev
> > > device is about to be destroyed, its reference count can drop to less than 2,
> > > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed
> from
> > > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> > >
> > > > But what is strange is that pci_devices_present_work()
> > > > overwrites the reference returned in local variable hpdev without doing a
> > > > put_pcichild().
> > > I suppose you mean:
> > >
> > >         /* First, mark all existing children as reported missing. */
> > >         spin_lock_irqsave(&hbus->device_list_lock, flags);
> > >         list_for_each_entry(hpdev, &hbus->children, list_entry) {
> > >                 hpdev->reported_missing = true;
> > >         }
> > >         spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> > >
> > > This is not strange to me, because, in pci_devices_present_work(), at first
> we
> > > don't know which devices are about to disappear, so we pre-mark all
> devices to
> > > be potentially missing like that; if a device is still on the bus, we'll mark its
> > > hpdev->reported_missing to false later; only after we know exactly which
> > > devices are missing, we should call put_pcichild() against them. All these
> > > seem natural to me.
> > >
> > > > It seems like the "normal" reference count should be 1 when the
> > > > child device is not being manipulated, not 2.
> > > What does "not being manipulated" mean?
> > >
> > > > The fix would be to add a call to
> > > > put_pcichild() when the return value from new_pcichild_device() is
> > > > overwritten.
> > > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev"
> returned
> > > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > > is implicitly initialized to false in new_pcichild_device().
> > >
> > > > Then remove the call to put_pcichild() in pci_device_present_work()
> when
> > > > missing
> > > > children are moved to the local list. The children have been moved from
> one
> > > > list
> > > > to another, so there's no need to decrement the reference count.  Then
> when
> > > > everything in the local list is deleted, the reference is correctly
> decremented,
> > > > presumably freeing the memory.
> > > >
> > > > With this approach, the code in hv_eject_device_work() is correct.
> There's
> > > > one call to put_pcichild() to reflect removing the child device from the
> hbus->
> > > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > > hv_pci_eject_device().
> > > Please refer to my replies above. IMO we should fix
> > > hv_eject_device_work() rather than pci_devices_present_work().
> >
> > Have we reached a conclusion on this ? I would like to merge this series
> > given that it is fixing bugs and it has hung in the balance for quite
> > a while but it looks like Michael is not too happy about these patches
> > and I need a maintainer ACK to merge them.
> >
> > Thanks,
> > Lorenzo
> 
> Dexuan and I have discussed the topic extensively offline.  The patch works
> in its current form, and I'll agree to it.
> 
> Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

Thanks, Michael!

Hi Lorenzo,
All the 3 patches have got Michael's Reviewed-by.

Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
provided his Reviewed-by in the " [PATCH 0/3]" mail:
https://lkml.org/lkml/2019/3/5/521

Thanks,
--Dexuan

^ permalink raw reply

* Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Lorenzo Pieralisi @ 2019-03-26 18:12 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Michael Kelley, bhelgaas@google.com, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB0169CA2F484C02A196F9B78DBF5F0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Tue, Mar 26, 2019 at 06:01:32PM +0000, Dexuan Cui wrote:

[...]

> > > Have we reached a conclusion on this ? I would like to merge this series
> > > given that it is fixing bugs and it has hung in the balance for quite
> > > a while but it looks like Michael is not too happy about these patches
> > > and I need a maintainer ACK to merge them.
> > >
> > > Thanks,
> > > Lorenzo
> > 
> > Dexuan and I have discussed the topic extensively offline.  The patch works
> > in its current form, and I'll agree to it.
> > 
> > Reviewed-by:  Michael Kelley <mikelley@microsoft.com>
> 
> Thanks, Michael!
> 
> Hi Lorenzo,
> All the 3 patches have got Michael's Reviewed-by.

Good, I asked because I do not want to merge patches with review
questions open, thanks for the clarifications.

> Previously, Stephen Hemminger, one of the Hyper-V driver maintainers, 
> provided his Reviewed-by in the " [PATCH 0/3]" mail:
> https://lkml.org/lkml/2019/3/5/521

I will reformat/rewrite the logs and notify you when queued.

Thanks,
Lorenzo

^ permalink raw reply

* Re: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Lorenzo Pieralisi @ 2019-03-26 19:54 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Michael Kelley, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190304213357.16652-4-decui@microsoft.com>

On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> When we hot-remove a device, usually the host sends us a PCI_EJECT message,
> and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
> we do the quick hot-add/hot-remove test, the host may not send us the
> PCI_EJECT message, if the guest has not fully finished the initialization
> by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> potentially unsafe to only depend on the pci_destroy_slot() in
> hv_eject_device_work(), though create_root_hv_pci_bus() ->
> hv_pci_assign_slots() is not called in this case. Note: in this case, the
> host still sends the guest a PCI_BUS_RELATIONS message with
> bus_rel->device_count == 0.
> 
> And, in the quick hot-add/hot-remove test, we can have such a race: before
> pci_devices_present_work() -> new_pcichild_device() adds the new device
> into hbus->children, we may have already received the PCI_EJECT message,
> and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> with bus_rel->device_count == 0 removes the device from hbus->children, and
> we end up being unable to remove the slot in hv_pci_remove() ->
> hv_pci_remove_slots().
> 
> The patch removes the slot in pci_devices_present_work() when the device
> is removed. This can address the above race. Note 1:
> pci_devices_present_work() and hv_eject_device_work() run in the
> singled-threaded hbus->wq, so there is not a double-remove issue for the
> slot. Note 2: we can't offload hv_pci_eject_device() from
> hv_pci_onchannelcallback() to the workqueue, because we need
> hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> poll the channel's ringbuffer to work around the
> "hangs in hv_compose_msi_msg()" issue: see
> commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

This commit log is unreadable, sorry. Indentation, punctuation and
formatting are just a mess, try to read it, you will notice by
yourself.

I basically reformatted it completely and pushed the series to
pci/controller-fixes but that's the last time I do it since I am not an
editor, next time I won't merge it.

More importantly, these patches are marked for stable, given the series
of fixes that triggered this series please ensure it was tested
thoroughly because it is honestly complicate to understand and I do not
want to backport further fixes to stable kernels on top of this.

Please have a look and report back.

Thanks,
Lorenzo

> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
>  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
>  					 list_entry);
>  		list_del(&hpdev->list_entry);
> +
> +		if (hpdev->pci_slot)
> +			pci_destroy_slot(hpdev->pci_slot);
> +
>  		put_pcichild(hpdev);
>  	}
>  
> -- 
> 2.19.1
> 

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Dexuan Cui @ 2019-03-27  0:22 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Michael Kelley, Sasha Levin,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190326195429.GA15410@e107981-ln.cambridge.arm.com>

> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Tuesday, March 26, 2019 12:55 PM
> On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> > When we hot-remove a device, usually the host sends us a PCI_EJECT
> message,
> > and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But
> when
> > we do the quick hot-add/hot-remove test, the host may not send us the
> > PCI_EJECT message, if the guest has not fully finished the initialization
> > by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> > potentially unsafe to only depend on the pci_destroy_slot() in
> > hv_eject_device_work(), though create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() is not called in this case. Note: in this case, the
> > host still sends the guest a PCI_BUS_RELATIONS message with
> > bus_rel->device_count == 0.
> >
> > And, in the quick hot-add/hot-remove test, we can have such a race: before
> > pci_devices_present_work() -> new_pcichild_device() adds the new device
> > into hbus->children, we may have already received the PCI_EJECT message,
> > and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> > the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> > hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> > with bus_rel->device_count == 0 removes the device from hbus->children,
> and
> > we end up being unable to remove the slot in hv_pci_remove() ->
> > hv_pci_remove_slots().
> >
> > The patch removes the slot in pci_devices_present_work() when the device
> > is removed. This can address the above race. Note 1:
> > pci_devices_present_work() and hv_eject_device_work() run in the
> > singled-threaded hbus->wq, so there is not a double-remove issue for the
> > slot. Note 2: we can't offload hv_pci_eject_device() from
> > hv_pci_onchannelcallback() to the workqueue, because we need
> > hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> > poll the channel's ringbuffer to work around the
> > "hangs in hv_compose_msi_msg()" issue: see
> > commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in
> hv_compose_msi_msg()")
> 
> This commit log is unreadable, sorry. Indentation, punctuation and
> formatting are just a mess, try to read it, you will notice by
> yourself.
> 
> I basically reformatted it completely and pushed the series to
> pci/controller-fixes but that's the last time I do it since I am not an
> editor, next time I won't merge it.

Hi Lorenzo,
Thank you for helping improve my changelogs! I did learn a lot after
carefully comparing the improved version with my original version. :-)

I'll try my best to write a good changelog for my future patches.

> More importantly, these patches are marked for stable, given the series
> of fixes that triggered this series please ensure it was tested
> thoroughly because it is honestly complicate to understand and I do not
> want to backport further fixes to stable kernels on top of this.

I did the hot-add/hot-remove test in a loop for several thousand times,
and the patchset worked as expected and didn't show any issue.

> Please have a look and report back.
> 
> Thanks,
> Lorenzo

Thanks again!

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Fix Hyperv vIOMMU driver file name
From: Mukesh Ojha @ 2019-03-27 13:51 UTC (permalink / raw)
  To: lantianyu1986, davem, mchehab+samsung, gregkh, nicolas.ferre,
	tglx, mingo, konrad.wilk, jpoimboe, peterz, jkosina, riel, peterz,
	Tianyu.Lan, luto, michael.h.kelley, kys, sashal, joe
  Cc: linux-kernel, linux-hyperv
In-Reply-To: <1553581701-25250-1-git-send-email-Tianyu.Lan@microsoft.com>


On 3/26/2019 11:58 AM, lantianyu1986@gmail.com wrote:
> From: Lan Tianyu <Tianyu.Lan@microsoft.com>
>
> The Hyperv vIOMMU file name should be "hyperv-iommu.c" rather

s/vIOMMU/IOMMU

> than "hyperv_iommu.c". This patch is to fix it.
>
> Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>


Make the above change otherwise looks good.
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>

-Mukesh

> ---
>   MAINTAINERS | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e17ebf7..403247d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7175,7 +7175,7 @@ F:	drivers/net/hyperv/
>   F:	drivers/scsi/storvsc_drv.c
>   F:	drivers/uio/uio_hv_generic.c
>   F:	drivers/video/fbdev/hyperv_fb.c
> -F:	drivers/iommu/hyperv_iommu.c
> +F:	drivers/iommu/hyperv-iommu.c
>   F:	net/vmw_vsock/hyperv_transport.c
>   F:	include/linux/hyperv.h
>   F:	include/uapi/linux/hyperv.h

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Fix Hyperv vIOMMU driver file name
From: Mukesh Ojha @ 2019-03-27 14:09 UTC (permalink / raw)
  To: lantianyu1986, davem, mchehab+samsung, gregkh, nicolas.ferre,
	tglx, mingo, konrad.wilk, jpoimboe, peterz, jkosina, riel, peterz,
	Tianyu.Lan, luto, michael.h.kelley, kys, sashal, joe
  Cc: linux-kernel, linux-hyperv
In-Reply-To: <1553581701-25250-1-git-send-email-Tianyu.Lan@microsoft.com>


On 3/26/2019 11:58 AM, lantianyu1986@gmail.com wrote:
> From: Lan Tianyu <Tianyu.Lan@microsoft.com>
>
> The Hyperv vIOMMU file name should be "hyperv-iommu.c" rather
s/vIOMMU /IOMMU
> than "hyperv_iommu.c". This patch is to fix it.
>
> Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>


Othewise looks fine.
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>

-Mukesh

> ---
>   MAINTAINERS | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e17ebf7..403247d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7175,7 +7175,7 @@ F:	drivers/net/hyperv/
>   F:	drivers/scsi/storvsc_drv.c
>   F:	drivers/uio/uio_hv_generic.c
>   F:	drivers/video/fbdev/hyperv_fb.c
> -F:	drivers/iommu/hyperv_iommu.c
> +F:	drivers/iommu/hyperv-iommu.c
>   F:	net/vmw_vsock/hyperv_transport.c
>   F:	include/linux/hyperv.h
>   F:	include/uapi/linux/hyperv.h

^ permalink raw reply

* [PATCH AUTOSEL 4.19 019/192] x86/hyperv: Fix kernel panic when kexec on HyperV
From: Sasha Levin @ 2019-03-27 18:07 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Kairui Song, Thomas Gleixner, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, H. Peter Anvin,
	Vitaly Kuznetsov, Dave Young, devel, linux-hyperv
In-Reply-To: <20190327181025.13507-1-sashal@kernel.org>

From: Kairui Song <kasong@redhat.com>

[ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]

After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xffffc9000001d000

Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156

That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.

To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: devel@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/hyperv/hv_init.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 20c876c7c5bf..87abd5145cc9 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -387,6 +387,13 @@ void hyperv_cleanup(void)
 	/* Reset our OS id */
 	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+	/*
+	 * Reset hypercall page reference before reset the page,
+	 * let hypercall operations fail safely rather than
+	 * panic the kernel for using invalid hypercall page
+	 */
+	hv_hypercall_pg = NULL;
+
 	/* Reset the hypercall page */
 	hypercall_msr.as_uint64 = 0;
 	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.19.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.0 028/262] x86/hyperv: Fix kernel panic when kexec on HyperV
From: Sasha Levin @ 2019-03-27 17:58 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Kairui Song, Thomas Gleixner, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, H. Peter Anvin,
	Vitaly Kuznetsov, Dave Young, devel, linux-hyperv
In-Reply-To: <20190327180158.10245-1-sashal@kernel.org>

From: Kairui Song <kasong@redhat.com>

[ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]

After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:

kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xffffc9000001d000

Call Trace:
 ? __send_ipi_mask+0x1c6/0x2d0
 ? hv_send_ipi_mask_allbutself+0x6d/0xb0
 ? mp_save_irq+0x70/0x70
 ? __ioapic_read_entry+0x32/0x50
 ? ioapic_read_entry+0x39/0x50
 ? clear_IO_APIC_pin+0xb8/0x110
 ? native_stop_other_cpus+0x6e/0x170
 ? native_machine_shutdown+0x22/0x40
 ? kernel_kexec+0x136/0x156

That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.

To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.

Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: devel@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/hyperv/hv_init.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..d3f42b6bbdac 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -406,6 +406,13 @@ void hyperv_cleanup(void)
 	/* Reset our OS id */
 	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
 
+	/*
+	 * Reset hypercall page reference before reset the page,
+	 * let hypercall operations fail safely rather than
+	 * panic the kernel for using invalid hypercall page
+	 */
+	hv_hypercall_pg = NULL;
+
 	/* Reset the hypercall page */
 	hypercall_msr.as_uint64 = 0;
 	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-28  4:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <DM5PR2101MB091851BC7C55093D388514D9D7420@DM5PR2101MB0918.namprd21.prod.outlook.com>

On Thu, Mar 21, 2019 at 04:04:20PM +0000, Michael Kelley wrote:
> From: Kimberly Brown <kimbrownkd@gmail.com>  Sent: Wednesday, March 20, 2019 8:48 PM
> > > > > Adding more locks will solve the problem but it seems like overkill.
> > > > > Why not either use a reference count or an RCU style access for the
> > > > > ring buffer?
> > > >
> > > > I agree that a reference count or RCU could also solve this problem.
> > > > Using mutex locks seemed like the most straightforward solution, but
> > > > I'll certainly switch to a different approach if it's better!
> > > >
> > > > Are you concerned about the extra memory required for the mutex locks,
> > > > read performance, or something else?
> > >
> > > Locks in control path are ok, but my concern is performance of the
> > > data path which puts packets in/out of rings. To keep reasonable performance,
> > > no additional locking should be added in those paths.
> > >
> > > So if data path is using RCU, can/should the control operations also
> > > use it?
> > 


Hi Stephen,

Do you have any additional questions or suggestions for this race
condition and the mutex locks? I think that your initial questions were
addressed in the responses below. If there's anything else, please let
me know!

Thanks,
Kim


> > The data path doesn't use RCU to protect the ring buffers.
> 
> My $.02:  The mutex is obtained only in the sysfs path and the "delete
> ringbuffers" path, neither of which is performance or concurrency sensitive. 
> There's no change to any path that reads or writes data from/to the ring
> buffers.  It seems like the mutex is the most straightforward solution to
> preventing sysfs from accessing the ring buffer info while the memory is
> being freed as part of "delete ringbuffers".
> 
> Michael

^ permalink raw reply

* [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Haiyang Zhang @ 2019-03-28 17:48 UTC (permalink / raw)
  To: sashal, linux-hyperv
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, davem, netdev,
	linux-kernel

From: Haiyang Zhang <haiyangz@microsoft.com>

After queue stopped, the wakeup mechanism may wake it up again
when ring buffer usage is lower than a threshold. This may cause
send path panic on NULL pointer when we stopped all tx queues in
netvsc_detach and start removing the netvsc device.

This patch fix it by adding a tx_disable flag to prevent unwanted
queue wakeup.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |  1 +
 drivers/net/hyperv/netvsc.c     |  6 ++++--
 drivers/net/hyperv/netvsc_drv.c | 32 ++++++++++++++++++++++++++------
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index e859ae2..49f41b6 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -987,6 +987,7 @@ struct netvsc_device {
 
 	wait_queue_head_t wait_drain;
 	bool destroy;
+	bool tx_disable; /* if true, do not wake up queue again */
 
 	/* Receive buffer allocated by us but manages by NetVSP */
 	void *recv_buf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 813d195..e0dce37 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,6 +110,7 @@ static struct netvsc_device *alloc_net_device(void)
 
 	init_waitqueue_head(&net_device->wait_drain);
 	net_device->destroy = false;
+	net_device->tx_disable = false;
 
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
@@ -719,7 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
 	} else {
 		struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
-		if (netif_tx_queue_stopped(txq) &&
+		if (netif_tx_queue_stopped(txq) && !net_device->tx_disable &&
 		    (hv_get_avail_to_write_percent(&channel->outbound) >
 		     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
 			netif_tx_wake_queue(txq);
@@ -874,7 +875,8 @@ static inline int netvsc_send_pkt(
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(txq);
 		ndev_ctx->eth_stats.stop_queue++;
-		if (atomic_read(&nvchan->queue_sends) < 1) {
+		if (atomic_read(&nvchan->queue_sends) < 1 &&
+		    !net_device->tx_disable) {
 			netif_tx_wake_queue(txq);
 			ndev_ctx->eth_stats.wake_queue++;
 			ret = -ENOSPC;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1a08679..0824155 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -109,6 +109,15 @@ static void netvsc_set_rx_mode(struct net_device *net)
 	rcu_read_unlock();
 }
 
+static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
+				    struct net_device *ndev)
+{
+	nvscdev->tx_disable = false;
+	mb(); /* ensure queue wake up mechanism is on */
+
+	netif_tx_wake_all_queues(ndev);
+}
+
 static int netvsc_open(struct net_device *net)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -129,7 +138,7 @@ static int netvsc_open(struct net_device *net)
 	rdev = nvdev->extension;
 	if (!rdev->link_state) {
 		netif_carrier_on(net);
-		netif_tx_wake_all_queues(net);
+		netvsc_tx_enable(nvdev, net);
 	}
 
 	if (vf_netdev) {
@@ -184,6 +193,17 @@ static int netvsc_wait_until_empty(struct netvsc_device *nvdev)
 	}
 }
 
+static inline void netvsc_tx_disable(struct netvsc_device *nvscdev,
+				     struct net_device *ndev)
+{
+	if (nvscdev) {
+		nvscdev->tx_disable = true;
+		mb(); /* ensure txq will not wake up after stop */
+	}
+
+	netif_tx_disable(ndev);
+}
+
 static int netvsc_close(struct net_device *net)
 {
 	struct net_device_context *net_device_ctx = netdev_priv(net);
@@ -192,7 +212,7 @@ static int netvsc_close(struct net_device *net)
 	struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev);
 	int ret;
 
-	netif_tx_disable(net);
+	netvsc_tx_disable(nvdev, net);
 
 	/* No need to close rndis filter if it is removed already */
 	if (!nvdev)
@@ -918,7 +938,7 @@ static int netvsc_detach(struct net_device *ndev,
 
 	/* If device was up (receiving) then shutdown */
 	if (netif_running(ndev)) {
-		netif_tx_disable(ndev);
+		netvsc_tx_disable(nvdev, ndev);
 
 		ret = rndis_filter_close(nvdev);
 		if (ret) {
@@ -1906,7 +1926,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (rdev->link_state) {
 			rdev->link_state = false;
 			netif_carrier_on(net);
-			netif_tx_wake_all_queues(net);
+			netvsc_tx_enable(net_device, net);
 		} else {
 			notify = true;
 		}
@@ -1916,7 +1936,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 		}
 		kfree(event);
 		break;
@@ -1925,7 +1945,7 @@ static void netvsc_link_change(struct work_struct *w)
 		if (!rdev->link_state) {
 			rdev->link_state = true;
 			netif_carrier_off(net);
-			netif_tx_stop_all_queues(net);
+			netvsc_tx_disable(net_device, net);
 			event->event = RNDIS_STATUS_MEDIA_CONNECT;
 			spin_lock_irqsave(&ndev_ctx->lock, flags);
 			list_add(&event->list, &ndev_ctx->reconfig_events);
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH hyperv-fixes] hv_netvsc: Fix unwanted wakeup after tx_disable
From: Stephen Hemminger @ 2019-03-28 18:38 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: sashal, linux-hyperv, haiyangz, kys, sthemmin, olaf, vkuznets,
	davem, netdev, linux-kernel
In-Reply-To: <20190328174845.4799-1-haiyangz@linuxonhyperv.com>

On Thu, 28 Mar 2019 17:48:45 +0000
Haiyang Zhang <haiyangz@linuxonhyperv.com> wrote:

> +static inline void netvsc_tx_enable(struct netvsc_device *nvscdev,
> +				    struct net_device *ndev)
> +{
> +	nvscdev->tx_disable = false;
> +	mb(); /* ensure queue wake up mechanism is on */
> +
> +	netif_tx_wake_all_queues(ndev);
> +}

You don't need a full mb(). virt_wmb() should be sufficient.

Could I suggest an alternative approach.
You don't need to introduce a local tx_disable flag, the only place where a wakeup
could cause problems is after a send_completion was processed during detach state.

Instead, just avoid wakeup in that place.

--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -720,6 +720,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
                struct netdev_queue *txq = netdev_get_tx_queue(ndev, q_idx);
 
                if (netif_tx_queue_stopped(txq) &&
+                   netif_device_present(ndev) &&
                    (hv_get_avail_to_write_percent(&channel->outbound) >
                     RING_AVAIL_PERCENT_HIWATER || queue_sends < 1)) {
                        netif_tx_wake_queue(txq);

^ permalink raw reply

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Stephen Hemminger @ 2019-03-28 18:42 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190328043057.GA2258@ubu-Virtual-Machine>

On Thu, 28 Mar 2019 00:30:57 -0400
Kimberly Brown <kimbrownkd@gmail.com> wrote:

> On Thu, Mar 21, 2019 at 04:04:20PM +0000, Michael Kelley wrote:
> > From: Kimberly Brown <kimbrownkd@gmail.com>  Sent: Wednesday, March 20, 2019 8:48 PM  
> > > > > > Adding more locks will solve the problem but it seems like overkill.
> > > > > > Why not either use a reference count or an RCU style access for the
> > > > > > ring buffer?  
> > > > >
> > > > > I agree that a reference count or RCU could also solve this problem.
> > > > > Using mutex locks seemed like the most straightforward solution, but
> > > > > I'll certainly switch to a different approach if it's better!
> > > > >
> > > > > Are you concerned about the extra memory required for the mutex locks,
> > > > > read performance, or something else?  
> > > >
> > > > Locks in control path are ok, but my concern is performance of the
> > > > data path which puts packets in/out of rings. To keep reasonable performance,
> > > > no additional locking should be added in those paths.
> > > >
> > > > So if data path is using RCU, can/should the control operations also
> > > > use it?  
> > >   
> 
> 
> Hi Stephen,
> 
> Do you have any additional questions or suggestions for this race
> condition and the mutex locks? I think that your initial questions were
> addressed in the responses below. If there's anything else, please let
> me know!
> 
> Thanks,
> Kim
> 
> 
> > > The data path doesn't use RCU to protect the ring buffers.  
> > 
> > My $.02:  The mutex is obtained only in the sysfs path and the "delete
> > ringbuffers" path, neither of which is performance or concurrency sensitive. 
> > There's no change to any path that reads or writes data from/to the ring
> > buffers.  It seems like the mutex is the most straightforward solution to
> > preventing sysfs from accessing the ring buffer info while the memory is
> > being freed as part of "delete ringbuffers".
> > 
> > Michael  


I have no problems with the patch you did.
My discussion was more around the general issues with ringbuffers being detached
from the device. Not sure if it was even a good design choice but that is
something that is hard to fix now.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox