Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-21  3:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <20190320130619.07e49c97@shemminger-XPS-13-9360>

On Wed, Mar 20, 2019 at 01:06:19PM -0700, Stephen Hemminger wrote:
> On Sat, 16 Mar 2019 21:49:28 -0400
> Kimberly Brown <kimbrownkd@gmail.com> wrote:
> 
> > On Thu, Mar 14, 2019 at 03:45:33PM -0700, Stephen Hemminger wrote:
> > > On Thu, 14 Mar 2019 13:05:15 -0700
> > > "Kimberly Brown" <kimbrownkd@gmail.com> wrote:
> > >   
> > > > Fix a race condition that can result in a ring buffer pointer being set
> > > > to null while a "_show" function is reading the ring buffer's data. This
> > > > problem was discussed here:
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> > > > %2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Csthemmin%40microsoft.com
> > > > %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> > > > %7C0%7C636881907217609564&amp;sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> > > > vtGu2hU%3D&amp;reserved=0
> > > > 
> > > > To fix the race condition, add a new mutex lock to the
> > > > "hv_ring_buffer_info" struct. Add a new function,
> > > > "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> > > > ring_buffer_info mutex locks are initialized.
> > > >
> > > >  ... snip ...    
> > > 
> > > Adding more locks will solve the problem but it seems like overkill.
> > > Why not either use a reference count or an RCU style access for the
> > > ring buffer?  
> > 
> > I agree that a reference count or RCU could also solve this problem.
> > Using mutex locks seemed like the most straightforward solution, but
> > I'll certainly switch to a different approach if it's better!
> > 
> > Are you concerned about the extra memory required for the mutex locks,
> > read performance, or something else?
> 
> Locks in control path are ok, but my concern is performance of the
> data path which puts packets in/out of rings. To keep reasonable performance,
> no additional locking should be added in those paths.
> 
> So if data path is using RCU, can/should the control operations also
> use it?

The data path doesn't use RCU to protect the ring buffers.

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Dexuan Cui @ 2019-03-21  0:42 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org, KY Srinivasan,
	Stephen Hemminger, Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <PU1P153MB01691FE3199135CE5C3BFFEDBF420@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui
> > ...
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
A minor correction: I meant we'll call pci_destroy_slot(hpdev->pci_slot)
twice, not "double-free hpdev".

Thanks,
-- Dexuan

^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Dexuan Cui @ 2019-03-21  0:35 UTC (permalink / raw)
  To: Michael Kelley, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB091892992BB0431538A09A30D7410@DM5PR2101MB0918.namprd21.prod.outlook.com>

> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, March 20, 2019 2:44 PM
> To: Dexuan Cui <decui@microsoft.com>; lorenzo.pieralisi@arm.com;
> bhelgaas@google.com; linux-pci@vger.kernel.org; KY Srinivasan
> > ...
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> work_struct *work)
> >  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
> >  					 list_entry);
> >  		list_del(&hpdev->list_entry);
> > +
> > +		if (hpdev->pci_slot)
> > +			pci_destroy_slot(hpdev->pci_slot);
> 
> The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> pci_destory_slot(). 
Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
Because:
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed in the below put_pcichild():

     while (!list_empty(&removed)) {
                hpdev = list_first_entry(&removed, struct hv_pci_dev,
                                         list_entry);
                list_del(&hpdev->list_entry);

                if (hpdev->pci_slot)
                        pci_destroy_slot(hpdev->pci_slot);

                put_pcichild(hpdev);
        }

> Patch 2 in this series does set it to NULL, but this code does not.
In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
we must set hpdev->pci_slot to NULL, otherwise, later, due to
hv_pci_remove() -> hv_pci_bus_exit() ->
hv_pci_devices_present() with the zero "relations", we'll double-free the
"hpdev" struct in pci_devices_present_work() -- see the above.

> And the code in hv_eject_device_work() does not set it to NULL.
It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
Because in hv_eject_device_work(): 
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
 
> It looks like all the places that test the value of hpdev->pci_slot or call
> pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But
> when
> the code is inconsistent about setting to NULL, it always makes me wonder if
> there
> is a reason.
> 
> Michael

Thanks,
-- Dexuan

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Dexuan Cui @ 2019-03-21  0:12 UTC (permalink / raw)
  To: Michael Kelley, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB0918F6A089C461EEA4457998D7410@DM5PR2101MB0918.namprd21.prod.outlook.com>

> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, March 20, 2019 2:38 PM
> 
> From: Dexuan Cui <decui@microsoft.com>
> >
> > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> >
> > When we hot remove the device from the host, in Linux VM we first call
> > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > (let's ignore the paired get/put_pcichild() in other places). But in
> > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > adds one put_pcichild() to fix the memory leak.
> >
> > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> this
> > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > pci_devices_present_work().
> 
> Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> There is the reference in the hbus->children list, and there is the reference that
> is returned to the caller.  
So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
device is about to be destroyed, its reference count can drop to less than 2, 
i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
hbus->children), and then drop to zero (meaning kfree(hpdev) is called).

> But what is strange is that pci_devices_present_work()
> overwrites the reference returned in local variable hpdev without doing a
> put_pcichild().
I suppose you mean:

        /* First, mark all existing children as reported missing. */
        spin_lock_irqsave(&hbus->device_list_lock, flags);
        list_for_each_entry(hpdev, &hbus->children, list_entry) {
                hpdev->reported_missing = true;
        }
        spin_unlock_irqrestore(&hbus->device_list_lock, flags)

This is not strange to me, because, in pci_devices_present_work(), at first we
don't know which devices are about to disappear, so we pre-mark all devices to
be potentially missing like that; if a device is still on the bus, we'll mark its
hpdev->reported_missing to false later; only after we know exactly which
devices are missing, we should call put_pcichild() against them. All these
seem natural to me.

> It seems like the "normal" reference count should be 1 when the
> child device is not being manipulated, not 2.
What does "not being manipulated" mean?

> The fix would be to add a call to
> put_pcichild() when the return value from new_pcichild_device() is
> overwritten.
In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
from new_pcichild_device(): the "reported_missing" field of the new hpdev
is implicitly initialized to false in new_pcichild_device().

> Then remove the call to put_pcichild() in pci_device_present_work() when
> missing
> children are moved to the local list. The children have been moved from one
> list
> to another, so there's no need to decrement the reference count.  Then when
> everything in the local list is deleted, the reference is correctly decremented,
> presumably freeing the memory.
> 
> With this approach, the code in hv_eject_device_work() is correct.  There's
> one call to put_pcichild() to reflect removing the child device from the hbus->
> children list, and one call to put_pcichild() to pair with the get_pcichild() in
> hv_pci_eject_device().
Please refer to my replies above. IMO we should fix
hv_eject_device_work() rather than pci_devices_present_work().

Thanks
-- Dexuan
 
> Your patch works, but to me it leaves the ref count in an unnatural state
> most of the time.
> 
> Michael


^ permalink raw reply

* RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary
From: Michael Kelley @ 2019-03-20 21:44 UTC (permalink / raw)
  To: Dexuan Cui, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190304213357.16652-4-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>  Sent: Monday, March 4, 2019 1:35 PM
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
>  		hpdev = list_first_entry(&removed, struct hv_pci_dev,
>  					 list_entry);
>  		list_del(&hpdev->list_entry);
> +
> +		if (hpdev->pci_slot)
> +			pci_destroy_slot(hpdev->pci_slot);

The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
pci_destory_slot().  Patch 2 in this series does set it to NULL, but this code does not.
And the code in hv_eject_device_work() does not set it to NULL.

It looks like all the places that test the value of hpdev->pci_slot or call
pci_destroy_slot() are serialized, so it looks like it really doesn't matter.  But when
the code is inconsistent about setting to NULL, it always makes me wonder if there
is a reason.

Michael


> +
>  		put_pcichild(hpdev);
>  	}
> 
> --
> 2.19.1


^ permalink raw reply

* RE: [PATCH 2/3] PCI: hv: Add hv_pci_remove_slots() when we unload the driver
From: Michael Kelley @ 2019-03-20 21:38 UTC (permalink / raw)
  To: Dexuan Cui, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190304213357.16652-3-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Monday, March 4, 2019 1:35 PM
> 
> When we unload pci-hyperv, the host doesn't send us a PCI_EJECT message.
> In this case we also need to make sure the sysfs pci slot directory
> is removed, otherwise "cat /sys/bus/pci/slots/2/address" will trigger
> "BUG: unable to handle kernel paging request" (I noticed the issue when
> systemd-dev crashed for me when I unloaded the driver). And, if we
> unload/reload the driver several times, we'll have multiple pci slot
> directories in /sys/bus/pci/slots/ like this:
> 
> root@localhost:~# ls -rtl  /sys/bus/pci/slots/
> total 0
> drwxr-xr-x 2 root root 0 Feb  7 10:49 2
> drwxr-xr-x 2 root root 0 Feb  7 10:49 2-1
> drwxr-xr-x 2 root root 0 Feb  7 10:51 2-2
> 
> The patch adds the missing code.
> 
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Acked-by: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
From: Michael Kelley @ 2019-03-20 21:37 UTC (permalink / raw)
  To: Dexuan Cui, lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Sasha Levin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Haiyang Zhang,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com, jackm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20190304213357.16652-2-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>
>
> After a device is just created in new_pcichild_device(), hpdev->refs is set
> to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> 
> When we hot remove the device from the host, in Linux VM we first call
> hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> (let's ignore the paired get/put_pcichild() in other places). But in
> hv_eject_device_work(), currently we only call put_pcichild() twice,
> meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> adds one put_pcichild() to fix the memory leak.
> 
> BTW, the device can also be removed when we run "rmmod pci-hyperv". On this
> path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> pci_devices_present_work().
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: <stable@vger.kernel.org>

Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
There is the reference in the hbus->children list, and there is the reference that
is returned to the caller.  But what is strange is that pci_devices_present_work()
overwrites the reference returned in local variable hpdev without doing a
put_pcichild().  It seems like the "normal" reference count should be 1 when the
child device is not being manipulated, not 2.  The fix would be to add a call to
put_pcichild() when the return value from new_pcichild_device() is overwritten.
Then remove the call to put_pcichild() in pci_device_present_work() when missing
children are moved to the local list. The children have been moved from one list
to another, so there's no need to decrement the reference count.  Then when
everything in the local list is deleted, the reference is correctly decremented,
presumably freeing the memory.

With this approach, the code in hv_eject_device_work() is correct.  There's
one call to put_pcichild() to reflect removing the child device from the hbus->
children list, and one call to put_pcichild() to pair with the get_pcichild() in
hv_pci_eject_device().

Your patch works, but to me it leaves the ref count in an unnatural state
most of the time.

Michael

^ permalink raw reply

* RE: [PATCH v6] Drivers: hv: vmbus: Expose monitor data only when monitor pages are used
From: Michael Kelley @ 2019-03-20 20:18 UTC (permalink / raw)
  To: kimbrownkd, Long Li, Sasha Levin, Stephen Hemminger, Dexuan Cui,
	Greg KH
  Cc: KY Srinivasan, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190319040401.GA3050@ubu-Virtual-Machine>

From: Kimberly Brown <kimbrownkd@gmail.com> Sent: Monday, March 18, 2019 9:04 PM
> 
> There are two methods for signaling the host: the monitor page mechanism
> and hypercalls. The monitor page mechanism is used by performance
> critical channels (storage, networking, etc.) because it provides
> improved throughput. However, latency is increased. Monitor pages are
> allocated to these channels.
> 
> Monitor pages are not allocated to channels that do not use the monitor
> page mechanism. Therefore, these channels do not have a valid monitor id
> or valid monitor page data. In these cases, some of the "_show"
> functions return incorrect data. They return an invalid monitor id and
> data that is beyond the bounds of the hv_monitor_page array fields.
> 
> The "channel->offermsg.monitor_allocated" value can be used to determine
> whether monitor pages have been allocated to a channel.
> 
> Add "is_visible()" callback functions for the device-level and
> channel-level attribute groups. These functions will hide the monitor
> sysfs files when the monitor mechanism is not used.
> 
> Remove ".default_attributes" from "vmbus_chan_attrs" and create a
> channel-level attribute group. These changes allow the new
> "is_visible()" callback function to be applied to the channel-level
> attributes.
> 
> Call "sysfs_create_group()" in "vmbus_add_channel_kobj()" to create the
> channel's sysfs files. Add a new function,
> “vmbus_remove_channel_attr_group()”, and call it in "free_channel()" to
> remove the channel's sysfs files when the channel is closed.
> 
> Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Stephen Hemminger @ 2019-03-20 20:06 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <20190317014927.GA60356@ubu-Virtual-Machine>

On Sat, 16 Mar 2019 21:49:28 -0400
Kimberly Brown <kimbrownkd@gmail.com> wrote:

> On Thu, Mar 14, 2019 at 03:45:33PM -0700, Stephen Hemminger wrote:
> > On Thu, 14 Mar 2019 13:05:15 -0700
> > "Kimberly Brown" <kimbrownkd@gmail.com> wrote:
> >   
> > > Fix a race condition that can result in a ring buffer pointer being set
> > > to null while a "_show" function is reading the ring buffer's data. This
> > > problem was discussed here:
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> > > %2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Csthemmin%40microsoft.com
> > > %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> > > %7C0%7C636881907217609564&amp;sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> > > vtGu2hU%3D&amp;reserved=0
> > > 
> > > To fix the race condition, add a new mutex lock to the
> > > "hv_ring_buffer_info" struct. Add a new function,
> > > "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> > > ring_buffer_info mutex locks are initialized.
> > >
> > >  ... snip ...    
> > 
> > Adding more locks will solve the problem but it seems like overkill.
> > Why not either use a reference count or an RCU style access for the
> > ring buffer?  
> 
> I agree that a reference count or RCU could also solve this problem.
> Using mutex locks seemed like the most straightforward solution, but
> I'll certainly switch to a different approach if it's better!
> 
> Are you concerned about the extra memory required for the mutex locks,
> read performance, or something else?

Locks in control path are ok, but my concern is performance of the
data path which puts packets in/out of rings. To keep reasonable performance,
no additional locking should be added in those paths.

So if data path is using RCU, can/should the control operations also
use it?

^ permalink raw reply

* Re: [PATCH] hyperv: a potential NULL pointer dereference
From: Mukesh Ojha @ 2019-03-20 15:50 UTC (permalink / raw)
  To: Kangjie Lu
  Cc: pakki001, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, linux-hyperv, linux-kernel
In-Reply-To: <20190314054651.1315-1-kjlu@umn.edu>


On 3/14/2019 11:16 AM, Kangjie Lu wrote:
> In case alloc_page, the fix returns -ENOMEM to avoid the potential
> NULL pointer dereference.
>
> Signed-off-by: Kangjie Lu <kjlu@umn.edu>
> ---
>   arch/x86/hyperv/hv_init.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..dfdb4ce1ae9c 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -102,9 +102,13 @@ static int hv_cpu_init(unsigned int cpu)
>   	u64 msr_vp_index;
>   	struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
>   	void **input_arg;
> +	struct page *pg;
>   
>   	input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> -	*input_arg = page_address(alloc_page(GFP_KERNEL));
> +	pg = alloc_page(GFP_KERNEL);
> +	if (unlikely(!pg))
> +		return -ENOMEM;
> +	*input_arg = page_address(pg);
>   
>   	hv_get_vp_index(msr_vp_index);



Looks good to me.
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>


Thanks.
Mukesh


>   

^ permalink raw reply

* RE: [PATCH] hyperv: a potential NULL pointer dereference
From: KY Srinivasan @ 2019-03-20 14:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kangjie Lu, pakki001@umn.edu, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <alpine.DEB.2.21.1903201119240.1906@nanos.tec.linutronix.de>



> -----Original Message-----
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, March 20, 2019 3:21 AM
> To: KY Srinivasan <kys@microsoft.com>
> Cc: Kangjie Lu <kjlu@umn.edu>; pakki001@umn.edu; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Ingo Molnar
> <mingo@redhat.com>; Borislav Petkov <bp@alien8.de>; H. Peter Anvin
> <hpa@zytor.com>; x86@kernel.org; linux-hyperv@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: RE: [PATCH] hyperv: a potential NULL pointer dereference
> 
> On Thu, 14 Mar 2019, KY Srinivasan wrote:
> > > -----Original Message-----
> > > From: Kangjie Lu <kjlu@umn.edu>
> > > Sent: Wednesday, March 13, 2019 10:47 PM
> > > To: kjlu@umn.edu
> > > Cc: pakki001@umn.edu; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang
> > > <haiyangz@microsoft.com>; Stephen Hemminger
> > > <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Thomas
> > > Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> Borislav
> > > Petkov <bp@alien8.de>; H. Peter Anvin <hpa@zytor.com>;
> x86@kernel.org;
> > > linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> > > Subject: [PATCH] hyperv: a potential NULL pointer dereference
> > >
> > > In case alloc_page, the fix returns -ENOMEM to avoid the potential
> > > NULL pointer dereference.
> > >
> > Thanks.
> >
> > > Signed-off-by: Kangjie Lu <kjlu@umn.edu>
> > Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
> 
> Did you mean: Reviewed-by or Acked-by?
Sorry, I meant Acked-by.

K. Y
> 
> You cannot sign off on a patch from
> someone else which you are not picking up and transporting it further.
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply

* RE: [PATCH] hyperv: a potential NULL pointer dereference
From: Thomas Gleixner @ 2019-03-20 10:20 UTC (permalink / raw)
  To: KY Srinivasan
  Cc: Kangjie Lu, pakki001@umn.edu, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CY4PR21MB0773FB535B3F4177CA9662C0A04B0@CY4PR21MB0773.namprd21.prod.outlook.com>

On Thu, 14 Mar 2019, KY Srinivasan wrote:
> > -----Original Message-----
> > From: Kangjie Lu <kjlu@umn.edu>
> > Sent: Wednesday, March 13, 2019 10:47 PM
> > To: kjlu@umn.edu
> > Cc: pakki001@umn.edu; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; Stephen Hemminger
> > <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Thomas
> > Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav
> > Petkov <bp@alien8.de>; H. Peter Anvin <hpa@zytor.com>; x86@kernel.org;
> > linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH] hyperv: a potential NULL pointer dereference
> > 
> > In case alloc_page, the fix returns -ENOMEM to avoid the potential
> > NULL pointer dereference.
> > 
> Thanks.
> 
> > Signed-off-by: Kangjie Lu <kjlu@umn.edu>
> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>

Did you mean: Reviewed-by or Acked-by?

You cannot sign off on a patch from
someone else which you are not picking up and transporting it further.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v6] Drivers: hv: vmbus: Expose monitor data only when monitor pages are used
From: Greg KH @ 2019-03-19  9:26 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, K. Y. Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <20190319040401.GA3050@ubu-Virtual-Machine>

On Tue, Mar 19, 2019 at 12:04:01AM -0400, Kimberly Brown wrote:
> There are two methods for signaling the host: the monitor page mechanism
> and hypercalls. The monitor page mechanism is used by performance
> critical channels (storage, networking, etc.) because it provides
> improved throughput. However, latency is increased. Monitor pages are
> allocated to these channels.
> 
> Monitor pages are not allocated to channels that do not use the monitor
> page mechanism. Therefore, these channels do not have a valid monitor id
> or valid monitor page data. In these cases, some of the "_show"
> functions return incorrect data. They return an invalid monitor id and
> data that is beyond the bounds of the hv_monitor_page array fields.
> 
> The "channel->offermsg.monitor_allocated" value can be used to determine
> whether monitor pages have been allocated to a channel.
> 
> Add "is_visible()" callback functions for the device-level and
> channel-level attribute groups. These functions will hide the monitor
> sysfs files when the monitor mechanism is not used.
> 
> Remove ".default_attributes" from "vmbus_chan_attrs" and create a
> channel-level attribute group. These changes allow the new
> "is_visible()" callback function to be applied to the channel-level
> attributes.
> 
> Call "sysfs_create_group()" in "vmbus_add_channel_kobj()" to create the
> channel's sysfs files. Add a new function,
> “vmbus_remove_channel_attr_group()”, and call it in "free_channel()" to
> remove the channel's sysfs files when the channel is closed.
> 
> Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>

Nice work, thanks for all of the changes you have made to this over
time.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply

* [PATCH v6] Drivers: hv: vmbus: Expose monitor data only when monitor pages are used
From: Kimberly Brown @ 2019-03-19  4:04 UTC (permalink / raw)
  To: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, Greg KH
  Cc: K. Y. Srinivasan, Haiyang Zhang, linux-hyperv, linux-kernel
In-Reply-To: <20190308224611.GA3047@ubu-Virtual-Machine>

There are two methods for signaling the host: the monitor page mechanism
and hypercalls. The monitor page mechanism is used by performance
critical channels (storage, networking, etc.) because it provides
improved throughput. However, latency is increased. Monitor pages are
allocated to these channels.

Monitor pages are not allocated to channels that do not use the monitor
page mechanism. Therefore, these channels do not have a valid monitor id
or valid monitor page data. In these cases, some of the "_show"
functions return incorrect data. They return an invalid monitor id and
data that is beyond the bounds of the hv_monitor_page array fields.

The "channel->offermsg.monitor_allocated" value can be used to determine
whether monitor pages have been allocated to a channel.

Add "is_visible()" callback functions for the device-level and
channel-level attribute groups. These functions will hide the monitor
sysfs files when the monitor mechanism is not used.

Remove ".default_attributes" from "vmbus_chan_attrs" and create a
channel-level attribute group. These changes allow the new
"is_visible()" callback function to be applied to the channel-level
attributes.

Call "sysfs_create_group()" in "vmbus_add_channel_kobj()" to create the
channel's sysfs files. Add a new function,
“vmbus_remove_channel_attr_group()”, and call it in "free_channel()" to
remove the channel's sysfs files when the channel is closed.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
---

Changes in v6:
 - Removed the “sysfs_group_ready” flag proposed in v5. The flag was
   used to verify that the attribute group was created, which is
   unnecessary; sysfs_remove_group() can be called even when
   sysfs_create_group() fails.
 - Replaced pr_err() with dev_err() as suggested by G. Kroah-Hartman.
 - Shortened the error path comment in vmbus_add_channel_kobj().
 - Updated the commit message.

Changes in v5:
 - Added the “bool sysfs_group_ready” field to vmbus_channel.
 - Added the “vmbus_remove_channel_attr_group()” function which calls
   "sysfs_remove_group()".
 - Added a comment to "vmbus_add_channel_kobj()" to describe how the
   empty directory is removed if "sysfs_create_group()" returns an
   error.
 - Updated the commit message.

NOTE: “.default_attrs” must be removed from vmbus_chan_ktype in order to
use the is_visible() function because "default_attrs" is an array of
attributes, not an attribute_group.

Changes in v4:
 - Added “is_visible()” callback functions for the device-level and
   channel-level attribute groups.
 - Removed the separate monitor attribute groups proposed in v3. They’re
   no longer needed because the “is_visible()” callbacks are used to
   determine the attribute visibility.
 - Removed "default_attributes" from "vmbus_chan_attrs" and created a
   channel-level attribute group.
 - Removed the "kobject_put(kobj)" call proposed in v3. The calling
   functions take care of calling "kobject_put(channel->kobj)" if an
   error is returned by "vmbus_add_channel_kobj()".
 - Updated the commit message and subject for clarity and to reflect the
   new changes in v4.

Changes in v3:
 - The monitor "_show" functions no longer return an error when a
   channel does not use the monitor page mechanism. Instead, the monitor
   page sysfs files are created only when a channel uses the monitor
   page mechanism. This change was suggested by G. Kroah-Hartman.

Note: this patch was originally patch 2/2 in a patchset. Patch 1/2 has
already been added to char-misc-testing, so I'm not resending it.

Changes in v2:
 - Changed the return value for cases where monitor_allocated is not set
   to "-EINVAL".
 - Updated the commit message to provide more details about the monitor
   page mechanism.
 - Updated the sysfs documentation to describe the new return value.


Documentation/ABI/stable/sysfs-bus-vmbus | 12 +++-
 drivers/hv/channel_mgmt.c                |  1 +
 drivers/hv/hyperv_vmbus.h                |  2 +
 drivers/hv/vmbus_drv.c                   | 77 +++++++++++++++++++++++-
 4 files changed, 87 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus
index 826689dcc2e6..8e8d167eca31 100644
--- a/Documentation/ABI/stable/sysfs-bus-vmbus
+++ b/Documentation/ABI/stable/sysfs-bus-vmbus
@@ -81,7 +81,9 @@ What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/latency
 Date:		September. 2017
 KernelVersion:	4.14
 Contact:	Stephen Hemminger <sthemmin@microsoft.com>
-Description:	Channel signaling latency
+Description:	Channel signaling latency. This file is available only for
+		performance critical channels (storage, network, etc.) that use
+		the monitor page mechanism.
 Users:		Debugging tools
 
 What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/out_mask
@@ -95,7 +97,9 @@ What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/pending
 Date:		September. 2017
 KernelVersion:	4.14
 Contact:	Stephen Hemminger <sthemmin@microsoft.com>
-Description:	Channel interrupt pending state
+Description:	Channel interrupt pending state. This file is available only for
+		performance critical channels (storage, network, etc.) that use
+		the monitor page mechanism.
 Users:		Debugging tools
 
 What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/read_avail
@@ -137,7 +141,9 @@ What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/monitor_id
 Date:		January. 2018
 KernelVersion:	4.16
 Contact:	Stephen Hemminger <sthemmin@microsoft.com>
-Description:	Monitor bit associated with channel
+Description:	Monitor bit associated with channel. This file is available only
+		for performance critical channels (storage, network, etc.) that
+		use the monitor page mechanism.
 Users:		Debugging tools and userspace drivers
 
 What:		/sys/bus/vmbus/devices/<UUID>/channels/<N>/ring
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index f174b38f390f..3fc0b247a807 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -347,6 +347,7 @@ static struct vmbus_channel *alloc_channel(void)
 static void free_channel(struct vmbus_channel *channel)
 {
 	tasklet_kill(&channel->callback_event);
+	vmbus_remove_channel_attr_group(channel);
 
 	kobject_put(&channel->kobj);
 }
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 50fabf4dcb75..e5467b821f41 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -322,6 +322,8 @@ void vmbus_device_unregister(struct hv_device *device_obj);
 int vmbus_add_channel_kobj(struct hv_device *device_obj,
 			   struct vmbus_channel *channel);
 
+void vmbus_remove_channel_attr_group(struct vmbus_channel *channel);
+
 struct vmbus_channel *relid2channel(u32 relid);
 
 void vmbus_free_channels(void);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 41f782af73ea..aa25f3bcbdea 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -630,7 +630,36 @@ static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_driver_override.attr,
 	NULL,
 };
-ATTRIBUTE_GROUPS(vmbus_dev);
+
+/*
+ * Device-level attribute_group callback function. Returns the permission for
+ * each attribute, and returns 0 if an attribute is not visible.
+ */
+static umode_t vmbus_dev_attr_is_visible(struct kobject *kobj,
+					 struct attribute *attr, int idx)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	const struct hv_device *hv_dev = device_to_hv_device(dev);
+
+	/* Hide the monitor attributes if the monitor mechanism is not used. */
+	if (!hv_dev->channel->offermsg.monitor_allocated &&
+	    (attr == &dev_attr_monitor_id.attr ||
+	     attr == &dev_attr_server_monitor_pending.attr ||
+	     attr == &dev_attr_client_monitor_pending.attr ||
+	     attr == &dev_attr_server_monitor_latency.attr ||
+	     attr == &dev_attr_client_monitor_latency.attr ||
+	     attr == &dev_attr_server_monitor_conn_id.attr ||
+	     attr == &dev_attr_client_monitor_conn_id.attr))
+		return 0;
+
+	return attr->mode;
+}
+
+static const struct attribute_group vmbus_dev_group = {
+	.attrs = vmbus_dev_attrs,
+	.is_visible = vmbus_dev_attr_is_visible
+};
+__ATTRIBUTE_GROUPS(vmbus_dev);
 
 /*
  * vmbus_uevent - add uevent for our device
@@ -1583,10 +1612,34 @@ static struct attribute *vmbus_chan_attrs[] = {
 	NULL
 };
 
+/*
+ * Channel-level attribute_group callback function. Returns the permission for
+ * each attribute, and returns 0 if an attribute is not visible.
+ */
+static umode_t vmbus_chan_attr_is_visible(struct kobject *kobj,
+					  struct attribute *attr, int idx)
+{
+	const struct vmbus_channel *channel =
+		container_of(kobj, struct vmbus_channel, kobj);
+
+	/* Hide the monitor attributes if the monitor mechanism is not used. */
+	if (!channel->offermsg.monitor_allocated &&
+	    (attr == &chan_attr_pending.attr ||
+	     attr == &chan_attr_latency.attr ||
+	     attr == &chan_attr_monitor_id.attr))
+		return 0;
+
+	return attr->mode;
+}
+
+static struct attribute_group vmbus_chan_group = {
+	.attrs = vmbus_chan_attrs,
+	.is_visible = vmbus_chan_attr_is_visible
+};
+
 static struct kobj_type vmbus_chan_ktype = {
 	.sysfs_ops = &vmbus_chan_sysfs_ops,
 	.release = vmbus_chan_release,
-	.default_attrs = vmbus_chan_attrs,
 };
 
 /*
@@ -1594,6 +1647,7 @@ static struct kobj_type vmbus_chan_ktype = {
  */
 int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
 {
+	const struct device *device = &dev->device;
 	struct kobject *kobj = &channel->kobj;
 	u32 relid = channel->offermsg.child_relid;
 	int ret;
@@ -1604,11 +1658,30 @@ int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
 	if (ret)
 		return ret;
 
+	ret = sysfs_create_group(kobj, &vmbus_chan_group);
+
+	if (ret) {
+		/*
+		 * The calling functions' error handling paths will cleanup the
+		 * empty channel directory.
+		 */
+		dev_err(device, "Unable to set up channel sysfs files\n");
+		return ret;
+	}
+
 	kobject_uevent(kobj, KOBJ_ADD);
 
 	return 0;
 }
 
+/*
+ * vmbus_remove_channel_attr_group - remove the channel's attribute group
+ */
+void vmbus_remove_channel_attr_group(struct vmbus_channel *channel)
+{
+	sysfs_remove_group(&channel->kobj, &vmbus_chan_group);
+}
+
 /*
  * vmbus_device_create - Creates and registers a new child device
  * on the vmbus.
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 2/2] Drivers: hv: Move Hyper-V clocksource code to new clocksource driver
From: Michael Kelley @ 2019-03-17 19:06 UTC (permalink / raw)
  To: will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, vkuznets, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
  Cc: Michael Kelley, catalin.marinas@arm.com, mark.rutland@arm.com
In-Reply-To: <1552849512-10063-1-git-send-email-mikelley@microsoft.com>

Code for the Hyper-V specific clocksources is currently mixed
in with other Hyper-V code. Move the code to the Hyper-V specific
driver in the "clocksource" directory, while separating out
ISA dependencies so that the clocksource driver remains ISA
independent. Update the Hyper-V initialization code to call
initialization and cleanup routines since the Hyper-V synthetic
timers are not independently enumerated in ACPI. Update Hyper-V
clocksource users KVM and VDSO to get definitions from a new
include file.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
 arch/x86/entry/vdso/vclock_gettime.c |   1 +
 arch/x86/entry/vdso/vma.c            |   2 +-
 arch/x86/hyperv/hv_init.c            |  91 ++------------------------
 arch/x86/include/asm/mshyperv.h      |  80 ++++-------------------
 arch/x86/kvm/x86.c                   |   1 +
 drivers/clocksource/hyperv_timer.c   | 122 +++++++++++++++++++++++++++++++++++
 include/clocksource/hyperv_timer.h   |  78 ++++++++++++++++++++++
 7 files changed, 219 insertions(+), 156 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 007b3fe9..59bb5d1 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include <linux/math64.h>
 #include <linux/time.h>
 #include <linux/kernel.h>
+#include <clocksource/hyperv_timer.h>
 
 #define gtod (&VVAR(vsyscall_gtod_data))
 
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index babc4e7..99b38e9 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -22,7 +22,7 @@
 #include <asm/page.h>
 #include <asm/desc.h>
 #include <asm/cpufeature.h>
-#include <asm/mshyperv.h>
+#include <clocksource/hyperv_timer.h>
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e..8e209c3 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,64 +27,13 @@
 #include <linux/version.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
-#include <linux/clockchips.h>
 #include <linux/hyperv.h>
 #include <linux/slab.h>
 #include <linux/cpuhotplug.h>
-
-#ifdef CONFIG_HYPERV_TSCPAGE
-
-static struct ms_hyperv_tsc_page *tsc_pg;
-
-struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
-{
-	return tsc_pg;
-}
-EXPORT_SYMBOL_GPL(hv_get_tsc_page);
-
-static u64 read_hv_clock_tsc(struct clocksource *arg)
-{
-	u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-	if (current_tick == U64_MAX)
-		rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
-
-	return current_tick;
-}
-
-static struct clocksource hyperv_cs_tsc = {
-		.name		= "hyperv_clocksource_tsc_page",
-		.rating		= 400,
-		.read		= read_hv_clock_tsc,
-		.mask		= CLOCKSOURCE_MASK(64),
-		.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
-};
-#endif
-
-static u64 read_hv_clock_msr(struct clocksource *arg)
-{
-	u64 current_tick;
-	/*
-	 * Read the partition counter to get the current tick count. This count
-	 * is set to 0 when the partition is created and is incremented in
-	 * 100 nanosecond units.
-	 */
-	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
-	return current_tick;
-}
-
-static struct clocksource hyperv_cs_msr = {
-	.name		= "hyperv_clocksource_msr",
-	.rating		= 400,
-	.read		= read_hv_clock_msr,
-	.mask		= CLOCKSOURCE_MASK(64),
-	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
-};
+#include <clocksource/hyperv_timer.h>
 
 void *hv_hypercall_pg;
 EXPORT_SYMBOL_GPL(hv_hypercall_pg);
-struct clocksource *hyperv_cs;
-EXPORT_SYMBOL_GPL(hyperv_cs);
 
 u32 *hv_vp_index;
 EXPORT_SYMBOL_GPL(hv_vp_index);
@@ -349,41 +298,11 @@ void __init hyperv_init(void)
 	x86_init.pci.arch_init = hv_pci_init;
 
 	/*
-	 * Register Hyper-V specific clocksource.
+	 * Register Hyper-V specific clocksource. Pass 'false' as the
+	 * arguemnt, indicating to not register the clocksource as the
+	 * sched clock.
 	 */
-#ifdef CONFIG_HYPERV_TSCPAGE
-	if (ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE) {
-		union hv_x64_msr_hypercall_contents tsc_msr;
-
-		tsc_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
-		if (!tsc_pg)
-			goto register_msr_cs;
-
-		hyperv_cs = &hyperv_cs_tsc;
-
-		rdmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
-
-		tsc_msr.enable = 1;
-		tsc_msr.guest_physical_address = vmalloc_to_pfn(tsc_pg);
-
-		wrmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
-
-		hyperv_cs_tsc.archdata.vclock_mode = VCLOCK_HVCLOCK;
-
-		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
-		return;
-	}
-register_msr_cs:
-#endif
-	/*
-	 * For 32 bit guests just use the MSR based mechanism for reading
-	 * the partition counter.
-	 */
-
-	hyperv_cs = &hyperv_cs_msr;
-	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
-		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
-
+	hv_init_clocksource(false);
 	return;
 
 remove_cpuhp_state:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index cc60e61..9326689 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -105,6 +105,17 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
 #define hv_get_crash_ctl(val) \
 	rdmsrl(HV_X64_MSR_CRASH_CTL, val)
 
+#define hv_get_time_ref_count(val) \
+	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, val)
+
+#define hv_get_reference_tsc(val) \
+	rdmsrl(HV_X64_MSR_REFERENCE_TSC, val)
+#define hv_set_reference_tsc(val) \
+	wrmsrl(HV_X64_MSR_REFERENCE_TSC, val)
+#define hv_set_clocksource_vdso(val) \
+	((val).archdata.vclock_mode = VCLOCK_HVCLOCK)
+#define hv_get_raw_timer() rdtsc_ordered()
+
 void hyperv_callback_vector(void);
 void hyperv_reenlightenment_vector(void);
 #ifdef CONFIG_TRACING
@@ -387,73 +398,4 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
 }
 #endif /* CONFIG_HYPERV */
 
-#ifdef CONFIG_HYPERV_TSCPAGE
-struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
-				       u64 *cur_tsc)
-{
-	u64 scale, offset;
-	u32 sequence;
-
-	/*
-	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
-	 * Top-Level Functional Specification ver. 3.0 and above. To get the
-	 * reference time we must do the following:
-	 * - READ ReferenceTscSequence
-	 *   A special '0' value indicates the time source is unreliable and we
-	 *   need to use something else. The currently published specification
-	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
-	 *   instead of '0' as the special value, see commit c35b82ef0294.
-	 * - ReferenceTime =
-	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
-	 * - READ ReferenceTscSequence again. In case its value has changed
-	 *   since our first reading we need to discard ReferenceTime and repeat
-	 *   the whole sequence as the hypervisor was updating the page in
-	 *   between.
-	 */
-	do {
-		sequence = READ_ONCE(tsc_pg->tsc_sequence);
-		if (!sequence)
-			return U64_MAX;
-		/*
-		 * Make sure we read sequence before we read other values from
-		 * TSC page.
-		 */
-		smp_rmb();
-
-		scale = READ_ONCE(tsc_pg->tsc_scale);
-		offset = READ_ONCE(tsc_pg->tsc_offset);
-		*cur_tsc = rdtsc_ordered();
-
-		/*
-		 * Make sure we read sequence after we read all other values
-		 * from TSC page.
-		 */
-		smp_rmb();
-
-	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
-
-	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
-}
-
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
-{
-	u64 cur_tsc;
-
-	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
-}
-
-#else
-static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
-{
-	return NULL;
-}
-
-static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
-				       u64 *cur_tsc)
-{
-	BUG();
-	return U64_MAX;
-}
-#endif
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 941f932..9e2b2c6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -70,6 +70,7 @@
 #include <asm/mshyperv.h>
 #include <asm/hypervisor.h>
 #include <asm/intel_pt.h>
+#include <clocksource/hyperv_timer.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index e856529..95ed828 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -14,6 +14,8 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/clockchips.h>
+#include <linux/clocksource.h>
+#include <linux/sched_clock.h>
 #include <linux/cpuhotplug.h>
 #include <linux/mm.h>
 #include <clocksource/hyperv_timer.h>
@@ -204,3 +206,123 @@ void hv_stimer_global_cleanup(void)
 	hv_stimer_free();
 }
 EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
+
+/*
+ * Code and definitions for the Hyper-V clocksources.  Two
+ * clocksources are defined: one that reads the Hyper-V defined MSR, and
+ * the other that uses the TSC reference page feature as defined in the
+ * TLFS.  The MSR version is for compatibility with old versions of
+ * Hyper-V and 32-bit x86.  The TSC reference page version is preferred.
+ */
+
+struct clocksource *hyperv_cs;
+EXPORT_SYMBOL_GPL(hyperv_cs);
+
+#ifdef CONFIG_HYPERV_TSCPAGE
+
+static struct ms_hyperv_tsc_page *tsc_pg;
+
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return tsc_pg;
+}
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
+
+static u64 read_hv_sched_clock_tsc(void)
+{
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick == U64_MAX)
+		hv_get_time_ref_count(current_tick);
+
+	return current_tick;
+}
+
+static u64 read_hv_clock_tsc(struct clocksource *arg)
+{
+	return read_hv_sched_clock_tsc();
+}
+
+static struct clocksource hyperv_cs_tsc = {
+		.name		= "hyperv_clocksource_tsc_page",
+		.rating		= 400,
+		.read		= read_hv_clock_tsc,
+		.mask		= CLOCKSOURCE_MASK(64),
+		.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
+};
+#endif
+
+static u64 read_hv_sched_clock_msr(void)
+{
+	u64 current_tick;
+	/*
+	 * Read the partition counter to get the current tick count. This count
+	 * is set to 0 when the partition is created and is incremented in
+	 * 100 nanosecond units.
+	 */
+	hv_get_time_ref_count(current_tick);
+	return current_tick;
+}
+
+static u64 read_hv_clock_msr(struct clocksource *arg)
+{
+	return read_hv_sched_clock_msr();
+}
+
+static struct clocksource hyperv_cs_msr = {
+	.name		= "hyperv_clocksource_msr",
+	.rating		= 400,
+	.read		= read_hv_clock_msr,
+	.mask		= CLOCKSOURCE_MASK(64),
+	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
+};
+
+void __init hv_init_clocksource(bool reg_sched_clock)
+{
+#ifdef CONFIG_HYPERV_TSCPAGE
+	if (ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE) {
+
+		u64		tsc_msr;
+		phys_addr_t	phys_addr;
+
+		tsc_pg = vmalloc(PAGE_SIZE);
+		if (!tsc_pg)
+			goto register_msr_cs;
+
+		hyperv_cs = &hyperv_cs_tsc;
+		phys_addr = page_to_phys(vmalloc_to_page(tsc_pg));
+
+		/* The Hyper-V TLFS specifies to preserve the value of
+		 * reserved bits in registers.  So read the existing
+		 * value, preserve the low order 12 bits, and add in
+		 * the guest physical address (which already has at
+		 * least the low 12 bits set to zero since it is page
+		 * aligned). Also set the "enable" bit, which is bit 0.
+		 */
+		hv_get_reference_tsc(tsc_msr);
+		tsc_msr &= GENMASK_ULL(11, 0);
+		tsc_msr = tsc_msr | 0x1 | (u64)phys_addr;
+		hv_set_reference_tsc(tsc_msr);
+
+		hv_set_clocksource_vdso(hyperv_cs_tsc);
+		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
+		if (reg_sched_clock)
+			sched_clock_register(read_hv_sched_clock_tsc,
+						64, HV_CLOCK_HZ);
+		return;
+	}
+register_msr_cs:
+#endif
+	/*
+	 * For 32 bit guests just use the MSR based mechanism for reading
+	 * the partition counter.
+	 */
+	hyperv_cs = &hyperv_cs_msr;
+	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE) {
+		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+		if (reg_sched_clock)
+			sched_clock_register(read_hv_sched_clock_msr,
+						64, HV_CLOCK_HZ);
+	}
+}
+EXPORT_SYMBOL_GPL(hv_init_clocksource);
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
index 1a5698c..3c11e2b 100644
--- a/include/clocksource/hyperv_timer.h
+++ b/include/clocksource/hyperv_timer.h
@@ -13,6 +13,10 @@
 #ifndef __CLKSOURCE_HYPERV_TIMER_H
 #define __CLKSOURCE_HYPERV_TIMER_H
 
+#include <linux/clocksource.h>
+#include <linux/math64.h>
+#include <asm/mshyperv.h>
+
 #define HV_MAX_MAX_DELTA_TICKS 0xffffffff
 #define HV_MIN_DELTA_TICKS 1
 
@@ -23,4 +27,78 @@
 extern void hv_stimer_global_cleanup(void);
 extern void hv_stimer0_isr(void);
 
+#if IS_ENABLED(CONFIG_HYPERV)
+extern struct clocksource *hyperv_cs;
+extern void hv_init_clocksource(bool reg_sched_clock);
+#endif /* CONFIG_HYPERV */
+
+#ifdef CONFIG_HYPERV_TSCPAGE
+extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
+{
+	u64 scale, offset;
+	u32 sequence;
+
+	/*
+	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
+	 * Top-Level Functional Specification ver. 3.0 and above. To get the
+	 * reference time we must do the following:
+	 * - READ ReferenceTscSequence
+	 *   A special '0' value indicates the time source is unreliable and we
+	 *   need to use something else. The currently published specification
+	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
+	 *   instead of '0' as the special value, see commit c35b82ef0294.
+	 * - ReferenceTime =
+	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
+	 * - READ ReferenceTscSequence again. In case its value has changed
+	 *   since our first reading we need to discard ReferenceTime and repeat
+	 *   the whole sequence as the hypervisor was updating the page in
+	 *   between.
+	 */
+	do {
+		sequence = READ_ONCE(tsc_pg->tsc_sequence);
+		if (!sequence)
+			return U64_MAX;
+		/*
+		 * Make sure we read sequence before we read other values from
+		 * TSC page.
+		 */
+		smp_rmb();
+
+		scale = READ_ONCE(tsc_pg->tsc_scale);
+		offset = READ_ONCE(tsc_pg->tsc_offset);
+		*cur_tsc = hv_get_raw_timer();
+
+		/*
+		 * Make sure we read sequence after we read all other values
+		 * from TSC page.
+		 */
+		smp_rmb();
+
+	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
+
+	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 cur_tsc;
+
+	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
+}
+
+#else /* CONFIG_HYPERV_TSC_PAGE */
+static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return NULL;
+}
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
+{
+	return U64_MAX;
+}
+#endif /* CONFIG_HYPERV_TSCPAGE */
+
 #endif
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v2 1/2] Drivers: hv: Create Hyper-V clocksource driver from existing clockevents code
From: Michael Kelley @ 2019-03-17 19:06 UTC (permalink / raw)
  To: will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, vkuznets, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
  Cc: Michael Kelley, catalin.marinas@arm.com, mark.rutland@arm.com
In-Reply-To: <1552849512-10063-1-git-send-email-mikelley@microsoft.com>

Clockevents code for Hyper-V synthetic timers is currently mixed
in with other Hyper-V code. Move the code to a Hyper-V specific
driver in the "clocksource" directory. Update the VMbus driver
to call initialization and cleanup routines since the Hyper-V
synthetic timers are not independently enumerated in ACPI.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
 MAINTAINERS                        |   2 +
 arch/x86/include/asm/hyperv-tlfs.h |   6 ++
 arch/x86/kernel/cpu/mshyperv.c     |   2 +
 drivers/clocksource/Makefile       |   1 +
 drivers/clocksource/hyperv_timer.c | 206 +++++++++++++++++++++++++++++++++++++
 drivers/hv/Kconfig                 |   3 +
 drivers/hv/hv.c                    | 154 ---------------------------
 drivers/hv/hyperv_vmbus.h          |   3 -
 drivers/hv/vmbus_drv.c             |  39 +++----
 include/clocksource/hyperv_timer.h |  26 +++++
 10 files changed, 266 insertions(+), 176 deletions(-)
 create mode 100644 drivers/clocksource/hyperv_timer.c
 create mode 100644 include/clocksource/hyperv_timer.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 21ab064..63142da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7159,6 +7159,7 @@ F:	arch/x86/include/asm/trace/hyperv.h
 F:	arch/x86/include/asm/hyperv-tlfs.h
 F:	arch/x86/kernel/cpu/mshyperv.c
 F:	arch/x86/hyperv
+F:	drivers/clocksource/hyperv_timer.c
 F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
 F:	drivers/input/serio/hyperv-keyboard.c
@@ -7168,6 +7169,7 @@ F:	drivers/scsi/storvsc_drv.c
 F:	drivers/uio/uio_hv_generic.c
 F:	drivers/video/fbdev/hyperv_fb.c
 F:	net/vmw_vsock/hyperv_transport.c
+F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
 F:	include/uapi/linux/hyperv.h
 F:	tools/hv/
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 2bdbbbc..ee62f57 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -401,6 +401,12 @@ enum HV_GENERIC_SET_FORMAT {
 #define HV_STATUS_INVALID_CONNECTION_ID		18
 #define HV_STATUS_INSUFFICIENT_BUFFERS		19
 
+/*
+ * The Hyper-V TimeRefCount register and the TSC
+ * page provide a guest VM clock with 100ns tick rate
+ */
+#define HV_CLOCK_HZ (NSEC_PER_SEC/100)
+
 typedef struct _HV_REFERENCE_TSC_PAGE {
 	__u32 tsc_sequence;
 	__u32 res1;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e81a2db..f53a35a 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -21,6 +21,7 @@
 #include <linux/irq.h>
 #include <linux/kexec.h>
 #include <linux/i8253.h>
+#include <linux/random.h>
 #include <asm/processor.h>
 #include <asm/hypervisor.h>
 #include <asm/hyperv-tlfs.h>
@@ -84,6 +85,7 @@ __visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs)
 	inc_irq_stat(hyperv_stimer0_count);
 	if (hv_stimer0_handler)
 		hv_stimer0_handler();
+	add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0);
 	ack_APIC_irq();
 
 	exiting_irq();
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index be6e0fb..cd284a4 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -83,3 +83,4 @@ obj-$(CONFIG_ATCPIT100_TIMER)		+= timer-atcpit100.o
 obj-$(CONFIG_RISCV_TIMER)		+= timer-riscv.o
 obj-$(CONFIG_CSKY_MP_TIMER)		+= timer-mp-csky.o
 obj-$(CONFIG_GX6605S_TIMER)		+= timer-gx6605s.o
+obj-$(CONFIG_HYPERV_TIMER)		+= hyperv_timer.o
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
new file mode 100644
index 0000000..e856529
--- /dev/null
+++ b/drivers/clocksource/hyperv_timer.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Clocksource driver for the synthetic counter and timers
+ * provided by the Hyper-V hypervisor to guest VMs, as described
+ * in the Hyper-V Top Level Functional Spec (TLFS). This driver
+ * is instruction set architecture independent.
+ *
+ * Copyright (C) 2019, Microsoft, Inc.
+ *
+ * Author:  Michael Kelley <mikelley@microsoft.com>
+ */
+
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/clockchips.h>
+#include <linux/cpuhotplug.h>
+#include <linux/mm.h>
+#include <clocksource/hyperv_timer.h>
+#include <asm/hyperv-tlfs.h>
+#include <asm/mshyperv.h>
+
+static struct clock_event_device __percpu *hv_clock_event;
+
+/*
+ * If false, we're using the old mechanism for stimer0 interrupts
+ * where it sends a VMbus message when it expires. The old
+ * mechanism is used when running on older versions of Hyper-V
+ * that don't support Direct Mode. While Hyper-V provides
+ * four stimer's per CPU, Linux uses only stimer0.
+ */
+static bool direct_mode_enabled;
+
+static int stimer0_irq;
+static int stimer0_vector;
+static int stimer0_message_sint;
+static int stimer0_cpuhp_online;
+
+/*
+ * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
+ * does not use VMbus or any VMbus messages, so process here and not
+ * in the VMbus driver code.
+ */
+void hv_stimer0_isr(void)
+{
+	struct clock_event_device *ce;
+
+	ce = this_cpu_ptr(hv_clock_event);
+	ce->event_handler(ce);
+}
+EXPORT_SYMBOL_GPL(hv_stimer0_isr);
+
+static int hv_ce_set_next_event(unsigned long delta,
+				struct clock_event_device *evt)
+{
+	u64 current_tick;
+
+	WARN_ON(!clockevent_state_oneshot(evt));
+
+	current_tick = hyperv_cs->read(NULL);
+	current_tick += delta;
+	hv_init_timer(0, current_tick);
+	return 0;
+}
+
+static int hv_ce_shutdown(struct clock_event_device *evt)
+{
+	hv_init_timer(0, 0);
+	hv_init_timer_config(0, 0);
+	if (direct_mode_enabled)
+		hv_disable_stimer0_percpu_irq(stimer0_irq);
+
+	return 0;
+}
+
+static int hv_ce_set_oneshot(struct clock_event_device *evt)
+{
+	union hv_stimer_config timer_cfg;
+
+	timer_cfg.as_uint64 = 0;
+	timer_cfg.enable = 1;
+	timer_cfg.auto_enable = 1;
+	if (direct_mode_enabled) {
+		/*
+		 * When it expires, the timer will directly interrupt
+		 * on the specified hardware vector/IRQ.
+		 */
+		timer_cfg.direct_mode = 1;
+		timer_cfg.apic_vector = stimer0_vector;
+		hv_enable_stimer0_percpu_irq(stimer0_irq);
+	} else {
+		/*
+		 * When it expires, the timer will generate a VMbus message,
+		 * to be handled by the normal VMbus interrupt handler.
+		 */
+		timer_cfg.direct_mode = 0;
+		timer_cfg.sintx = stimer0_message_sint;
+	}
+	hv_init_timer_config(0, timer_cfg.as_uint64);
+	return 0;
+}
+
+/*
+ * hv_stimer_init - Per-cpu initialization of the clockevent
+ */
+static int hv_stimer_init(unsigned int cpu)
+{
+	struct clock_event_device *ce;
+
+	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
+		ce = per_cpu_ptr(hv_clock_event, cpu);
+		ce->name = "Hyper-V clockevent";
+		ce->features = CLOCK_EVT_FEAT_ONESHOT;
+		ce->cpumask = cpumask_of(cpu);
+		ce->rating = 1000;
+		ce->set_state_shutdown = hv_ce_shutdown;
+		ce->set_state_oneshot = hv_ce_set_oneshot;
+		ce->set_next_event = hv_ce_set_next_event;
+
+		clockevents_config_and_register(ce,
+						HV_CLOCK_HZ,
+						HV_MIN_DELTA_TICKS,
+						HV_MAX_MAX_DELTA_TICKS);
+	}
+	return 0;
+}
+
+/*
+ * hv_stimer_cleanup - Per-cpu cleanup of the clockevent
+ */
+int hv_stimer_cleanup(unsigned int cpu)
+{
+	struct clock_event_device *ce;
+
+	/* Turn off clockevent device */
+	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
+		ce = per_cpu_ptr(hv_clock_event, cpu);
+		clockevents_unbind_device(ce, cpu);
+		hv_ce_shutdown(ce);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hv_stimer_cleanup);
+
+/* hv_stimer_alloc - Global initialization of the clockevent and stimer0 */
+int hv_stimer_alloc(int sint)
+{
+	int	ret;
+
+	hv_clock_event = alloc_percpu(struct clock_event_device);
+	if (!hv_clock_event)
+		return -ENOMEM;
+
+	direct_mode_enabled = ms_hyperv.misc_features &
+			HV_STIMER_DIRECT_MODE_AVAILABLE;
+	if (direct_mode_enabled &&
+	    hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
+				hv_stimer0_isr))
+		goto err_irq;
+
+	stimer0_message_sint = sint;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/stimer0:online",
+				hv_stimer_init, hv_stimer_cleanup);
+	if (ret < 0)
+		goto err_cpuhp;
+	stimer0_cpuhp_online = ret;
+	return 0;
+
+err_cpuhp:
+	if (direct_mode_enabled)
+		hv_remove_stimer0_irq(stimer0_irq);
+err_irq:
+	free_percpu(hv_clock_event);
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(hv_stimer_alloc);
+
+/* hv_stimer_free - Free global resources allocated by hv_stimer_alloc() */
+void hv_stimer_free(void)
+{
+	cpuhp_remove_state(stimer0_cpuhp_online);
+	if (direct_mode_enabled)
+		hv_remove_stimer0_irq(stimer0_irq);
+	free_percpu(hv_clock_event);
+}
+EXPORT_SYMBOL_GPL(hv_stimer_free);
+
+/*
+ * Do a global cleanup of clockevents for the cases of kexec and
+ * vmbus exit
+ */
+void hv_stimer_global_cleanup(void)
+{
+	int	cpu;
+	struct clock_event_device *ce;
+
+	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE)
+		for_each_present_cpu(cpu) {
+			ce = per_cpu_ptr(hv_clock_event, cpu);
+			clockevents_unbind_device(ce, cpu);
+		}
+	hv_stimer_free();
+}
+EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 1c1a251..c423e57 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -10,6 +10,9 @@ config HYPERV
 	  Select this option to run Linux as a Hyper-V client operating
 	  system.
 
+config HYPERV_TIMER
+	def_bool HYPERV
+
 config HYPERV_TSCPAGE
        def_bool HYPERV && X86_64
 
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 632d256..e3ee010 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -36,21 +36,6 @@
 struct hv_context hv_context;
 
 /*
- * If false, we're using the old mechanism for stimer0 interrupts
- * where it sends a VMbus message when it expires. The old
- * mechanism is used when running on older versions of Hyper-V
- * that don't support Direct Mode. While Hyper-V provides
- * four stimer's per CPU, Linux uses only stimer0.
- */
-static bool direct_mode_enabled;
-static int stimer0_irq;
-static int stimer0_vector;
-
-#define HV_TIMER_FREQUENCY (10 * 1000 * 1000) /* 100ns period */
-#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
-#define HV_MIN_DELTA_TICKS 1
-
-/*
  * hv_init - Main initialization routine.
  *
  * This routine must be called before any other routines in here are called
@@ -60,9 +45,6 @@ int hv_init(void)
 	hv_context.cpu_context = alloc_percpu(struct hv_per_cpu_context);
 	if (!hv_context.cpu_context)
 		return -ENOMEM;
-
-	direct_mode_enabled = ms_hyperv.misc_features &
-			HV_STIMER_DIRECT_MODE_AVAILABLE;
 	return 0;
 }
 
@@ -101,89 +83,6 @@ int hv_post_message(union hv_connection_id connection_id,
 	return status & 0xFFFF;
 }
 
-/*
- * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
- * does not use VMbus or any VMbus messages, so process here and not
- * in the VMbus driver code.
- */
-
-static void hv_stimer0_isr(void)
-{
-	struct hv_per_cpu_context *hv_cpu;
-
-	hv_cpu = this_cpu_ptr(hv_context.cpu_context);
-	hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
-	add_interrupt_randomness(stimer0_vector, 0);
-}
-
-static int hv_ce_set_next_event(unsigned long delta,
-				struct clock_event_device *evt)
-{
-	u64 current_tick;
-
-	WARN_ON(!clockevent_state_oneshot(evt));
-
-	current_tick = hyperv_cs->read(NULL);
-	current_tick += delta;
-	hv_init_timer(0, current_tick);
-	return 0;
-}
-
-static int hv_ce_shutdown(struct clock_event_device *evt)
-{
-	hv_init_timer(0, 0);
-	hv_init_timer_config(0, 0);
-	if (direct_mode_enabled)
-		hv_disable_stimer0_percpu_irq(stimer0_irq);
-
-	return 0;
-}
-
-static int hv_ce_set_oneshot(struct clock_event_device *evt)
-{
-	union hv_stimer_config timer_cfg;
-
-	timer_cfg.as_uint64 = 0;
-	timer_cfg.enable = 1;
-	timer_cfg.auto_enable = 1;
-	if (direct_mode_enabled) {
-		/*
-		 * When it expires, the timer will directly interrupt
-		 * on the specified hardware vector/IRQ.
-		 */
-		timer_cfg.direct_mode = 1;
-		timer_cfg.apic_vector = stimer0_vector;
-		hv_enable_stimer0_percpu_irq(stimer0_irq);
-	} else {
-		/*
-		 * When it expires, the timer will generate a VMbus message,
-		 * to be handled by the normal VMbus interrupt handler.
-		 */
-		timer_cfg.direct_mode = 0;
-		timer_cfg.sintx = VMBUS_MESSAGE_SINT;
-	}
-	hv_init_timer_config(0, timer_cfg.as_uint64);
-	return 0;
-}
-
-static void hv_init_clockevent_device(struct clock_event_device *dev, int cpu)
-{
-	dev->name = "Hyper-V clockevent";
-	dev->features = CLOCK_EVT_FEAT_ONESHOT;
-	dev->cpumask = cpumask_of(cpu);
-	dev->rating = 1000;
-	/*
-	 * Avoid settint dev->owner = THIS_MODULE deliberately as doing so will
-	 * result in clockevents_config_and_register() taking additional
-	 * references to the hv_vmbus module making it impossible to unload.
-	 */
-
-	dev->set_state_shutdown = hv_ce_shutdown;
-	dev->set_state_oneshot = hv_ce_set_oneshot;
-	dev->set_next_event = hv_ce_set_next_event;
-}
-
-
 int hv_synic_alloc(void)
 {
 	int cpu;
@@ -212,14 +111,6 @@ int hv_synic_alloc(void)
 		tasklet_init(&hv_cpu->msg_dpc,
 			     vmbus_on_msg_dpc, (unsigned long) hv_cpu);
 
-		hv_cpu->clk_evt = kzalloc(sizeof(struct clock_event_device),
-					  GFP_KERNEL);
-		if (hv_cpu->clk_evt == NULL) {
-			pr_err("Unable to allocate clock event device\n");
-			goto err;
-		}
-		hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
-
 		hv_cpu->synic_message_page =
 			(void *)get_zeroed_page(GFP_ATOMIC);
 		if (hv_cpu->synic_message_page == NULL) {
@@ -242,11 +133,6 @@ int hv_synic_alloc(void)
 		INIT_LIST_HEAD(&hv_cpu->chan_list);
 	}
 
-	if (direct_mode_enabled &&
-	    hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
-				hv_stimer0_isr))
-		goto err;
-
 	return 0;
 err:
 	/*
@@ -265,7 +151,6 @@ void hv_synic_free(void)
 		struct hv_per_cpu_context *hv_cpu
 			= per_cpu_ptr(hv_context.cpu_context, cpu);
 
-		kfree(hv_cpu->clk_evt);
 		free_page((unsigned long)hv_cpu->synic_event_page);
 		free_page((unsigned long)hv_cpu->synic_message_page);
 		free_page((unsigned long)hv_cpu->post_msg_page);
@@ -324,39 +209,10 @@ int hv_synic_init(unsigned int cpu)
 
 	hv_set_synic_state(sctrl.as_uint64);
 
-	/*
-	 * Register the per-cpu clockevent source.
-	 */
-	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE)
-		clockevents_config_and_register(hv_cpu->clk_evt,
-						HV_TIMER_FREQUENCY,
-						HV_MIN_DELTA_TICKS,
-						HV_MAX_MAX_DELTA_TICKS);
 	return 0;
 }
 
 /*
- * hv_synic_clockevents_cleanup - Cleanup clockevent devices
- */
-void hv_synic_clockevents_cleanup(void)
-{
-	int cpu;
-
-	if (!(ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE))
-		return;
-
-	if (direct_mode_enabled)
-		hv_remove_stimer0_irq(stimer0_irq);
-
-	for_each_present_cpu(cpu) {
-		struct hv_per_cpu_context *hv_cpu
-			= per_cpu_ptr(hv_context.cpu_context, cpu);
-
-		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
-	}
-}
-
-/*
  * hv_synic_cleanup - Cleanup routine for hv_synic_init().
  */
 int hv_synic_cleanup(unsigned int cpu)
@@ -401,16 +257,6 @@ int hv_synic_cleanup(unsigned int cpu)
 	if (channel_found && vmbus_connection.conn_state == CONNECTED)
 		return -EBUSY;
 
-	/* Turn off clockevent device */
-	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
-		struct hv_per_cpu_context *hv_cpu
-			= this_cpu_ptr(hv_context.cpu_context);
-
-		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
-		hv_ce_shutdown(hv_cpu->clk_evt);
-		put_cpu_ptr(hv_cpu);
-	}
-
 	hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
 
 	shared_sint.masked = 1;
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index cb86b133..ffd4ad8 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -151,7 +151,6 @@ struct hv_per_cpu_context {
 	 * per-cpu list of the channels based on their CPU affinity.
 	 */
 	struct list_head chan_list;
-	struct clock_event_device *clk_evt;
 };
 
 struct hv_context {
@@ -189,8 +188,6 @@ extern int hv_post_message(union hv_connection_id connection_id,
 
 extern int hv_synic_cleanup(unsigned int cpu);
 
-extern void hv_synic_clockevents_cleanup(void);
-
 /* Interface */
 
 
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 000b53e..e71b04b 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -43,6 +43,7 @@
 #include <linux/kdebug.h>
 #include <linux/efi.h>
 #include <linux/random.h>
+#include <clocksource/hyperv_timer.h>
 #include "hyperv_vmbus.h"
 
 struct vmbus_dynid {
@@ -939,17 +940,6 @@ static void vmbus_onmessage_work(struct work_struct *work)
 	kfree(ctx);
 }
 
-static void hv_process_timer_expiration(struct hv_message *msg,
-					struct hv_per_cpu_context *hv_cpu)
-{
-	struct clock_event_device *dev = hv_cpu->clk_evt;
-
-	if (dev->event_handler)
-		dev->event_handler(dev);
-
-	vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
-}
-
 void vmbus_on_msg_dpc(unsigned long data)
 {
 	struct hv_per_cpu_context *hv_cpu = (void *)data;
@@ -1143,9 +1133,10 @@ static void vmbus_isr(void)
 
 	/* Check if there are actual msgs to be processed */
 	if (msg->header.message_type != HVMSG_NONE) {
-		if (msg->header.message_type == HVMSG_TIMER_EXPIRED)
-			hv_process_timer_expiration(msg, hv_cpu);
-		else
+		if (msg->header.message_type == HVMSG_TIMER_EXPIRED) {
+			hv_stimer0_isr();
+			vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
+		} else
 			tasklet_schedule(&hv_cpu->msg_dpc);
 	}
 
@@ -1248,8 +1239,8 @@ static int vmbus_bus_init(void)
 	if (ret)
 		goto err_alloc;
 	/*
-	 * Initialize the per-cpu interrupt state and
-	 * connect to the host.
+	 * Initialize the per-cpu interrupt state and stimer state.
+	 * Then connect to the host.
 	 */
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vmbus:online",
 				hv_synic_init, hv_synic_cleanup);
@@ -1257,6 +1248,10 @@ static int vmbus_bus_init(void)
 		goto err_alloc;
 	hyperv_cpuhp_online = ret;
 
+	ret = hv_stimer_alloc(VMBUS_MESSAGE_SINT);
+	if (ret < 0)
+		goto err_stimer_alloc;
+
 	ret = vmbus_connect();
 	if (ret)
 		goto err_connect;
@@ -1301,6 +1296,8 @@ static int vmbus_bus_init(void)
 	return 0;
 
 err_connect:
+	hv_stimer_free();
+err_stimer_alloc:
 	cpuhp_remove_state(hyperv_cpuhp_online);
 err_alloc:
 	hv_synic_free();
@@ -1971,7 +1968,7 @@ static int vmbus_acpi_add(struct acpi_device *device)
 
 static void hv_kexec_handler(void)
 {
-	hv_synic_clockevents_cleanup();
+	hv_stimer_global_cleanup();
 	vmbus_initiate_unload(false);
 	vmbus_connection.conn_state = DISCONNECTED;
 	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
@@ -1982,6 +1979,8 @@ static void hv_kexec_handler(void)
 
 static void hv_crash_handler(struct pt_regs *regs)
 {
+	int cpu;
+
 	vmbus_initiate_unload(true);
 	/*
 	 * In crash handler we can't schedule synic cleanup for all CPUs,
@@ -1989,7 +1988,9 @@ static void hv_crash_handler(struct pt_regs *regs)
 	 * for kdump.
 	 */
 	vmbus_connection.conn_state = DISCONNECTED;
-	hv_synic_cleanup(smp_processor_id());
+	cpu = smp_processor_id();
+	hv_stimer_cleanup(cpu);
+	hv_synic_cleanup(cpu);
 	hyperv_cleanup();
 };
 
@@ -2038,7 +2039,7 @@ static void __exit vmbus_exit(void)
 	hv_remove_kexec_handler();
 	hv_remove_crash_handler();
 	vmbus_connection.conn_state = DISCONNECTED;
-	hv_synic_clockevents_cleanup();
+	hv_stimer_global_cleanup();
 	vmbus_disconnect();
 	hv_remove_vmbus_irq();
 	for_each_online_cpu(cpu) {
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
new file mode 100644
index 0000000..1a5698c
--- /dev/null
+++ b/include/clocksource/hyperv_timer.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Definitions for the clocksource provided by the Hyper-V
+ * hypervisor to guest VMs, as described in the Hyper-V Top
+ * Level Functional Spec (TLFS).
+ *
+ * Copyright (C) 2019, Microsoft, Inc.
+ *
+ * Author:  Michael Kelley <mikelley@microsoft.com>
+ */
+
+#ifndef __CLKSOURCE_HYPERV_TIMER_H
+#define __CLKSOURCE_HYPERV_TIMER_H
+
+#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
+#define HV_MIN_DELTA_TICKS 1
+
+/* Routines called by the VMbus driver */
+extern int hv_stimer_alloc(int sint);
+extern void hv_stimer_free(void);
+extern int hv_stimer_cleanup(unsigned int cpu);
+extern void hv_stimer_global_cleanup(void);
+extern void hv_stimer0_isr(void);
+
+#endif
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v2 0/2] Drivers: hv: Move Hyper-V clock/timer code to separate clocksource driver
From: Michael Kelley @ 2019-03-17 19:06 UTC (permalink / raw)
  To: will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, vkuznets, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
  Cc: Michael Kelley, catalin.marinas@arm.com, mark.rutland@arm.com

This patch series moves Hyper-V clock/timer code to a separate Hyper-V
clocksource driver. Previously, Hyper-V clock/timer code and data
structures were mixed in with other Hyper-V code in the ISA independent
drivers/hv code as well as in arch dependent code. The new Hyper-V
clocksource driver is ISA independent, with a just few dependencies on
arch specific functions. The patch series does not change any behavior
or functionality -- it only reorganizes the existing code and fixes up
the linkages. A few places outside of Hyper-V code are fixed up to use
the new #include file structure.

This restructuring is in response to Marc Zyngier's review comments
on supporting Hyper-V running on ARM64, and is a good idea in general.
It increases the amount of code shared between the x86 and ARM64
architectures, and reduces the size of the new code for supporting
Hyper-V on ARM64. A new version of the Hyper-V on ARM64 patches will
follow once this clocksource restructuring is accepted.

The code is currently diff'ed against Linux 5.0.  I'll rebase
to linux-next once 5.1-rc1 is available.

Changes in v2:
* Revised commit short descriptions so the distinction between
the two patches is clearer [GregKH]
* Renamed new clocksource driver files and functions to use
existing "timer" and "stimer" names instead of introducing
"syntimer". [Vitaly Kuznetsov]
* Introduced CONFIG_HYPER_TIMER to fix build problem when
CONFIG_HYPERV=m [Vitaly Kuznetsov]
* Added "Suggested-by: Marc Zyngier"

Michael Kelley (2):
  Drivers: hv: Create Hyper-V clocksource driver from existing
    clockevents code
  Drivers: hv: Move Hyper-V clocksource code to new clocksource driver

 MAINTAINERS                          |   2 +
 arch/x86/entry/vdso/vclock_gettime.c |   1 +
 arch/x86/entry/vdso/vma.c            |   2 +-
 arch/x86/hyperv/hv_init.c            |  91 +---------
 arch/x86/include/asm/hyperv-tlfs.h   |   6 +
 arch/x86/include/asm/mshyperv.h      |  80 ++-------
 arch/x86/kernel/cpu/mshyperv.c       |   2 +
 arch/x86/kvm/x86.c                   |   1 +
 drivers/clocksource/Makefile         |   1 +
 drivers/clocksource/hyperv_timer.c   | 328 +++++++++++++++++++++++++++++++++++
 drivers/hv/Kconfig                   |   3 +
 drivers/hv/hv.c                      | 154 ----------------
 drivers/hv/hyperv_vmbus.h            |   3 -
 drivers/hv/vmbus_drv.c               |  39 +++--
 include/clocksource/hyperv_timer.h   | 104 +++++++++++
 15 files changed, 485 insertions(+), 332 deletions(-)
 create mode 100644 drivers/clocksource/hyperv_timer.c
 create mode 100644 include/clocksource/hyperv_timer.h

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-17  1:49 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <20190314154533.17c8a362@shemminger-XPS-13-9360>

On Thu, Mar 14, 2019 at 03:45:33PM -0700, Stephen Hemminger wrote:
> On Thu, 14 Mar 2019 13:05:15 -0700
> "Kimberly Brown" <kimbrownkd@gmail.com> wrote:
> 
> > Fix a race condition that can result in a ring buffer pointer being set
> > to null while a "_show" function is reading the ring buffer's data. This
> > problem was discussed here:
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> > %2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Csthemmin%40microsoft.com
> > %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> > %7C0%7C636881907217609564&amp;sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> > vtGu2hU%3D&amp;reserved=0
> > 
> > To fix the race condition, add a new mutex lock to the
> > "hv_ring_buffer_info" struct. Add a new function,
> > "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> > ring_buffer_info mutex locks are initialized.
> >
> >  ... snip ...  
> 
> Adding more locks will solve the problem but it seems like overkill.
> Why not either use a reference count or an RCU style access for the
> ring buffer?

I agree that a reference count or RCU could also solve this problem.
Using mutex locks seemed like the most straightforward solution, but
I'll certainly switch to a different approach if it's better!

Are you concerned about the extra memory required for the mutex locks,
read performance, or something else?

Thanks,
Kim

^ permalink raw reply

* Re: [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Stephen Hemminger @ 2019-03-14 22:45 UTC (permalink / raw)
  To: Kimberly Brown
  Cc: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui, KY Srinivasan, Haiyang Zhang, linux-hyperv,
	linux-kernel
In-Reply-To: <262046fa9e89d5f483ecd5972d86f4f9608dbcc3.1552592620.git.kimbrownkd@gmail.com>

On Thu, 14 Mar 2019 13:05:15 -0700
"Kimberly Brown" <kimbrownkd@gmail.com> wrote:

> Fix a race condition that can result in a ring buffer pointer being set
> to null while a "_show" function is reading the ring buffer's data. This
> problem was discussed here:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> %2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Csthemmin%40microsoft.com
> %7C1d7557d667b741bdbb6008d6a8b8620f%7C72f988bf86f141af91ab2d7cd011db47%7C1
> %7C0%7C636881907217609564&amp;sdata=1bUbLaxsODANM7lCBR8lxyYajNpufuwUW%2FOl
> vtGu2hU%3D&amp;reserved=0
> 
> To fix the race condition, add a new mutex lock to the
> "hv_ring_buffer_info" struct. Add a new function,
> "hv_ringbuffer_pre_init()", where a channel's inbound and outbound
> ring_buffer_info mutex locks are initialized.
> 
> Acquire/release the locks in the "hv_ringbuffer_cleanup()" function,
> which is where the ring buffer pointers are set to null.
> 
> Acquire/release the locks in the four channel-level "_show" functions
> that access ring buffer data. Remove the "const" qualifier from the
> "vmbus_channel" parameter and the "rbi" variable of the channel-level
> "_show" functions so that the locks can be acquired/released in these
> functions.
> 
> Acquire/release the locks in hv_ringbuffer_get_debuginfo(). Remove the
> "const" qualifier from the "hv_ring_buffer_info" parameter so that the
> locks can be acquired/released in this function.
> 
> Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
> ---
>  drivers/hv/channel_mgmt.c |  2 +
>  drivers/hv/hyperv_vmbus.h |  1 +
>  drivers/hv/ring_buffer.c  | 19 ++++++++-
>  drivers/hv/vmbus_drv.c    | 82 +++++++++++++++++++++++++--------------
>  include/linux/hyperv.h    |  7 +++-
>  5 files changed, 79 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 9d7f9c1c60c7..14543059cc3e 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -336,6 +336,8 @@ static struct vmbus_channel *alloc_channel(void)
>  	tasklet_init(&channel->callback_event,
>  		     vmbus_on_event, (unsigned long)channel);
>  
> +	hv_ringbuffer_pre_init(channel);
> +
>  	return channel;
>  }
>  
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index a94aab94e0b5..e5467b821f41 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -193,6 +193,7 @@ extern void hv_synic_clockevents_cleanup(void);
>  
>  /* Interface */
>  
> +void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
>  
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  		       struct page *pages, u32 pagecnt);
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 0386ff48c5ea..121a01c43298 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -166,14 +166,18 @@ hv_get_ringbuffer_availbytes(const struct
> hv_ring_buffer_info *rbi,
>  }
>  
>  /* Get various debug metrics for the specified ring buffer. */
> -int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info
> *ring_info,
> +int hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
>  				struct hv_ring_buffer_debug_info
> *debug_info)
>  {
>  	u32 bytes_avail_towrite;
>  	u32 bytes_avail_toread;
>  
> -	if (!ring_info->ring_buffer)
> +	mutex_lock(&ring_info->ring_buffer_mutex);
> +
> +	if (!ring_info->ring_buffer) {
> +		mutex_unlock(&ring_info->ring_buffer_mutex);
>  		return -EINVAL;
> +	}
>  
>  	hv_get_ringbuffer_availbytes(ring_info,
>  				     &bytes_avail_toread,
> @@ -184,10 +188,19 @@ int hv_ringbuffer_get_debuginfo(const struct
> hv_ring_buffer_info *ring_info,
>  	debug_info->current_write_index =
> ring_info->ring_buffer->write_index;
>  	debug_info->current_interrupt_mask
>  		= ring_info->ring_buffer->interrupt_mask;
> +	mutex_unlock(&ring_info->ring_buffer_mutex);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(hv_ringbuffer_get_debuginfo);
>  
> +/* Initialize a channel's ring buffer info mutex locks */
> +void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
> +{
> +	mutex_init(&channel->inbound.ring_buffer_mutex);
> +	mutex_init(&channel->outbound.ring_buffer_mutex);
> +}
> +
>  /* Initialize the ring buffer. */
>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
>  		       struct page *pages, u32 page_cnt)
> @@ -240,8 +253,10 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info
> *ring_info,
>  /* Cleanup the ring buffer. */
>  void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info)
>  {
> +	mutex_lock(&ring_info->ring_buffer_mutex);
>  	vunmap(ring_info->ring_buffer);
>  	ring_info->ring_buffer = NULL;
> +	mutex_unlock(&ring_info->ring_buffer_mutex);
>  }
>  
>  /* Write to the ring buffer. */
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 7f15c41d952e..84f3a510b4c9 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1410,7 +1410,7 @@ static void vmbus_chan_release(struct kobject *kobj)
>  
>  struct vmbus_chan_attribute {
>  	struct attribute attr;
> -	ssize_t (*show)(const struct vmbus_channel *chan, char *buf);
> +	ssize_t (*show)(struct vmbus_channel *chan, char *buf);
>  	ssize_t (*store)(struct vmbus_channel *chan,
>  			 const char *buf, size_t count);
>  };
> @@ -1429,7 +1429,7 @@ static ssize_t vmbus_chan_attr_show(struct kobject
> *kobj,
>  {
>  	const struct vmbus_chan_attribute *attribute
>  		= container_of(attr, struct vmbus_chan_attribute, attr);
> -	const struct vmbus_channel *chan
> +	struct vmbus_channel *chan
>  		= container_of(kobj, struct vmbus_channel, kobj);
>  
>  	if (!attribute->show)
> @@ -1442,57 +1442,81 @@ static const struct sysfs_ops vmbus_chan_sysfs_ops
> = {
>  	.show = vmbus_chan_attr_show,
>  };
>  
> -static ssize_t out_mask_show(const struct vmbus_channel *channel, char
> *buf)
> +static ssize_t out_mask_show(struct vmbus_channel *channel, char *buf)
>  {
> -	const struct hv_ring_buffer_info *rbi = &channel->outbound;
> +	struct hv_ring_buffer_info *rbi = &channel->outbound;
> +	ssize_t ret;
>  
> -	if (!rbi->ring_buffer)
> +	mutex_lock(&rbi->ring_buffer_mutex);
> +	if (!rbi->ring_buffer) {
> +		mutex_unlock(&rbi->ring_buffer_mutex);
>  		return -EINVAL;
> +	}
>  
> -	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
> +	ret = sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
> +	mutex_unlock(&rbi->ring_buffer_mutex);
> +	return ret;
>  }
>  static VMBUS_CHAN_ATTR_RO(out_mask);
>  
> -static ssize_t in_mask_show(const struct vmbus_channel *channel, char
> *buf)
> +static ssize_t in_mask_show(struct vmbus_channel *channel, char *buf)
>  {
> -	const struct hv_ring_buffer_info *rbi = &channel->inbound;
> +	struct hv_ring_buffer_info *rbi = &channel->inbound;
> +	ssize_t ret;
>  
> -	if (!rbi->ring_buffer)
> +	mutex_lock(&rbi->ring_buffer_mutex);
> +	if (!rbi->ring_buffer) {
> +		mutex_unlock(&rbi->ring_buffer_mutex);
>  		return -EINVAL;
> +	}
>  
> -	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
> +	ret = sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
> +	mutex_unlock(&rbi->ring_buffer_mutex);
> +	return ret;
>  }
>  static VMBUS_CHAN_ATTR_RO(in_mask);
>  
> -static ssize_t read_avail_show(const struct vmbus_channel *channel, char
> *buf)
> +static ssize_t read_avail_show(struct vmbus_channel *channel, char *buf)
>  {
> -	const struct hv_ring_buffer_info *rbi = &channel->inbound;
> +	struct hv_ring_buffer_info *rbi = &channel->inbound;
> +	ssize_t ret;
>  
> -	if (!rbi->ring_buffer)
> +	mutex_lock(&rbi->ring_buffer_mutex);
> +	if (!rbi->ring_buffer) {
> +		mutex_unlock(&rbi->ring_buffer_mutex);
>  		return -EINVAL;
> +	}
>  
> -	return sprintf(buf, "%u\n", hv_get_bytes_to_read(rbi));
> +	ret = sprintf(buf, "%u\n", hv_get_bytes_to_read(rbi));
> +	mutex_unlock(&rbi->ring_buffer_mutex);
> +	return ret;
>  }
>  static VMBUS_CHAN_ATTR_RO(read_avail);
>  
> -static ssize_t write_avail_show(const struct vmbus_channel *channel, char
> *buf)
> +static ssize_t write_avail_show(struct vmbus_channel *channel, char *buf)
>  {
> -	const struct hv_ring_buffer_info *rbi = &channel->outbound;
> +	struct hv_ring_buffer_info *rbi = &channel->outbound;
> +	ssize_t ret;
>  
> -	if (!rbi->ring_buffer)
> +	mutex_lock(&rbi->ring_buffer_mutex);
> +	if (!rbi->ring_buffer) {
> +		mutex_unlock(&rbi->ring_buffer_mutex);
>  		return -EINVAL;
> +	}
>  
> -	return sprintf(buf, "%u\n", hv_get_bytes_to_write(rbi));
> +	ret = sprintf(buf, "%u\n", hv_get_bytes_to_write(rbi));
> +	mutex_unlock(&rbi->ring_buffer_mutex);
> +	return ret;
>  }
>  static VMBUS_CHAN_ATTR_RO(write_avail);
>  
> -static ssize_t show_target_cpu(const struct vmbus_channel *channel, char
> *buf)
> +static ssize_t show_target_cpu(struct vmbus_channel *channel, char *buf)
>  {
>  	return sprintf(buf, "%u\n", channel->target_cpu);
>  }
>  static VMBUS_CHAN_ATTR(cpu, S_IRUGO, show_target_cpu, NULL);
>  
> -static ssize_t channel_pending_show(const struct vmbus_channel *channel,
> +static ssize_t channel_pending_show(struct vmbus_channel *channel,
>  				    char *buf)
>  {
>  	return sprintf(buf, "%d\n",
> @@ -1501,7 +1525,7 @@ static ssize_t channel_pending_show(const struct
> vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(pending, S_IRUGO, channel_pending_show, NULL);
>  
> -static ssize_t channel_latency_show(const struct vmbus_channel *channel,
> +static ssize_t channel_latency_show(struct vmbus_channel *channel,
>  				    char *buf)
>  {
>  	return sprintf(buf, "%d\n",
> @@ -1510,19 +1534,19 @@ static ssize_t channel_latency_show(const struct
> vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(latency, S_IRUGO, channel_latency_show, NULL);
>  
> -static ssize_t channel_interrupts_show(const struct vmbus_channel
> *channel, char *buf)
> +static ssize_t channel_interrupts_show(struct vmbus_channel *channel,
> char *buf)
>  {
>  	return sprintf(buf, "%llu\n", channel->interrupts);
>  }
>  static VMBUS_CHAN_ATTR(interrupts, S_IRUGO, channel_interrupts_show,
> NULL);
>  
> -static ssize_t channel_events_show(const struct vmbus_channel *channel,
> char *buf)
> +static ssize_t channel_events_show(struct vmbus_channel *channel, char
> *buf)
>  {
>  	return sprintf(buf, "%llu\n", channel->sig_events);
>  }
>  static VMBUS_CHAN_ATTR(events, S_IRUGO, channel_events_show, NULL);
>  
> -static ssize_t channel_intr_in_full_show(const struct vmbus_channel
> *channel,
> +static ssize_t channel_intr_in_full_show(struct vmbus_channel *channel,
>  					 char *buf)
>  {
>  	return sprintf(buf, "%llu\n",
> @@ -1530,7 +1554,7 @@ static ssize_t channel_intr_in_full_show(const
> struct vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(intr_in_full, 0444, channel_intr_in_full_show,
> NULL);
>  
> -static ssize_t channel_intr_out_empty_show(const struct vmbus_channel
> *channel,
> +static ssize_t channel_intr_out_empty_show(struct vmbus_channel *channel,
>  					   char *buf)
>  {
>  	return sprintf(buf, "%llu\n",
> @@ -1538,7 +1562,7 @@ static ssize_t channel_intr_out_empty_show(const
> struct vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(intr_out_empty, 0444, channel_intr_out_empty_show,
> NULL);
>  
> -static ssize_t channel_out_full_first_show(const struct vmbus_channel
> *channel,
> +static ssize_t channel_out_full_first_show(struct vmbus_channel *channel,
>  					   char *buf)
>  {
>  	return sprintf(buf, "%llu\n",
> @@ -1546,7 +1570,7 @@ static ssize_t channel_out_full_first_show(const
> struct vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(out_full_first, 0444, channel_out_full_first_show,
> NULL);
>  
> -static ssize_t channel_out_full_total_show(const struct vmbus_channel
> *channel,
> +static ssize_t channel_out_full_total_show(struct vmbus_channel *channel,
>  					   char *buf)
>  {
>  	return sprintf(buf, "%llu\n",
> @@ -1554,14 +1578,14 @@ static ssize_t channel_out_full_total_show(const
> struct vmbus_channel *channel,
>  }
>  static VMBUS_CHAN_ATTR(out_full_total, 0444, channel_out_full_total_show,
> NULL);
>  
> -static ssize_t subchannel_monitor_id_show(const struct vmbus_channel
> *channel,
> +static ssize_t subchannel_monitor_id_show(struct vmbus_channel *channel,
>  					  char *buf)
>  {
>  	return sprintf(buf, "%u\n", channel->offermsg.monitorid);
>  }
>  static VMBUS_CHAN_ATTR(monitor_id, S_IRUGO, subchannel_monitor_id_show,
> NULL);
>  
> -static ssize_t subchannel_id_show(const struct vmbus_channel *channel,
> +static ssize_t subchannel_id_show(struct vmbus_channel *channel,
>  				  char *buf)
>  {
>  	return sprintf(buf, "%u\n",
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 64698ec8f2ac..8b9a93c99c9b 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -141,6 +141,11 @@ struct hv_ring_buffer_info {
>  
>  	u32 ring_datasize;		/* < ring_size */
>  	u32 priv_read_index;
> +	/*
> +	 * The ring buffer mutex lock. This lock prevents the ring buffer
> from
> +	 * being freed while the ring buffer is being accessed.
> +	 */
> +	struct mutex ring_buffer_mutex;
>  };
>  
>  
> @@ -1206,7 +1211,7 @@ struct hv_ring_buffer_debug_info {
>  };
>  
>  
> -int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info
> *ring_info,
> +int hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
>  				struct hv_ring_buffer_debug_info
> *debug_info);
>  
>  /* Vmbus interface */

Adding more locks will solve the problem but it seems like overkill.
Why not either use a reference count or an RCU style access for the
ring buffer?

^ permalink raw reply

* [PATCH v3 3/3] Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex
From: Kimberly Brown @ 2019-03-14 20:05 UTC (permalink / raw)
  To: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui
  Cc: K. Y. Srinivasan, Haiyang Zhang, linux-hyperv, linux-kernel
In-Reply-To: <cover.1552592620.git.kimbrownkd@gmail.com>

Fix a race condition that can result in a ring buffer pointer being set
to null while a "_show" function is reading the ring buffer's data. This
problem was discussed here: https://lkml.org/lkml/2018/10/18/779

To fix the race condition, add a new mutex lock to the
"hv_ring_buffer_info" struct. Add a new function,
"hv_ringbuffer_pre_init()", where a channel's inbound and outbound
ring_buffer_info mutex locks are initialized.

Acquire/release the locks in the "hv_ringbuffer_cleanup()" function,
which is where the ring buffer pointers are set to null.

Acquire/release the locks in the four channel-level "_show" functions
that access ring buffer data. Remove the "const" qualifier from the
"vmbus_channel" parameter and the "rbi" variable of the channel-level
"_show" functions so that the locks can be acquired/released in these
functions.

Acquire/release the locks in hv_ringbuffer_get_debuginfo(). Remove the
"const" qualifier from the "hv_ring_buffer_info" parameter so that the
locks can be acquired/released in this function.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
---
 drivers/hv/channel_mgmt.c |  2 +
 drivers/hv/hyperv_vmbus.h |  1 +
 drivers/hv/ring_buffer.c  | 19 ++++++++-
 drivers/hv/vmbus_drv.c    | 82 +++++++++++++++++++++++++--------------
 include/linux/hyperv.h    |  7 +++-
 5 files changed, 79 insertions(+), 32 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 9d7f9c1c60c7..14543059cc3e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -336,6 +336,8 @@ static struct vmbus_channel *alloc_channel(void)
 	tasklet_init(&channel->callback_event,
 		     vmbus_on_event, (unsigned long)channel);
 
+	hv_ringbuffer_pre_init(channel);
+
 	return channel;
 }
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index a94aab94e0b5..e5467b821f41 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -193,6 +193,7 @@ extern void hv_synic_clockevents_cleanup(void);
 
 /* Interface */
 
+void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
 
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
 		       struct page *pages, u32 pagecnt);
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 0386ff48c5ea..121a01c43298 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -166,14 +166,18 @@ hv_get_ringbuffer_availbytes(const struct hv_ring_buffer_info *rbi,
 }
 
 /* Get various debug metrics for the specified ring buffer. */
-int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+int hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
 				struct hv_ring_buffer_debug_info *debug_info)
 {
 	u32 bytes_avail_towrite;
 	u32 bytes_avail_toread;
 
-	if (!ring_info->ring_buffer)
+	mutex_lock(&ring_info->ring_buffer_mutex);
+
+	if (!ring_info->ring_buffer) {
+		mutex_unlock(&ring_info->ring_buffer_mutex);
 		return -EINVAL;
+	}
 
 	hv_get_ringbuffer_availbytes(ring_info,
 				     &bytes_avail_toread,
@@ -184,10 +188,19 @@ int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
 	debug_info->current_write_index = ring_info->ring_buffer->write_index;
 	debug_info->current_interrupt_mask
 		= ring_info->ring_buffer->interrupt_mask;
+	mutex_unlock(&ring_info->ring_buffer_mutex);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(hv_ringbuffer_get_debuginfo);
 
+/* Initialize a channel's ring buffer info mutex locks */
+void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
+{
+	mutex_init(&channel->inbound.ring_buffer_mutex);
+	mutex_init(&channel->outbound.ring_buffer_mutex);
+}
+
 /* Initialize the ring buffer. */
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
 		       struct page *pages, u32 page_cnt)
@@ -240,8 +253,10 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
 /* Cleanup the ring buffer. */
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info)
 {
+	mutex_lock(&ring_info->ring_buffer_mutex);
 	vunmap(ring_info->ring_buffer);
 	ring_info->ring_buffer = NULL;
+	mutex_unlock(&ring_info->ring_buffer_mutex);
 }
 
 /* Write to the ring buffer. */
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 7f15c41d952e..84f3a510b4c9 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1410,7 +1410,7 @@ static void vmbus_chan_release(struct kobject *kobj)
 
 struct vmbus_chan_attribute {
 	struct attribute attr;
-	ssize_t (*show)(const struct vmbus_channel *chan, char *buf);
+	ssize_t (*show)(struct vmbus_channel *chan, char *buf);
 	ssize_t (*store)(struct vmbus_channel *chan,
 			 const char *buf, size_t count);
 };
@@ -1429,7 +1429,7 @@ static ssize_t vmbus_chan_attr_show(struct kobject *kobj,
 {
 	const struct vmbus_chan_attribute *attribute
 		= container_of(attr, struct vmbus_chan_attribute, attr);
-	const struct vmbus_channel *chan
+	struct vmbus_channel *chan
 		= container_of(kobj, struct vmbus_channel, kobj);
 
 	if (!attribute->show)
@@ -1442,57 +1442,81 @@ static const struct sysfs_ops vmbus_chan_sysfs_ops = {
 	.show = vmbus_chan_attr_show,
 };
 
-static ssize_t out_mask_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t out_mask_show(struct vmbus_channel *channel, char *buf)
 {
-	const struct hv_ring_buffer_info *rbi = &channel->outbound;
+	struct hv_ring_buffer_info *rbi = &channel->outbound;
+	ssize_t ret;
 
-	if (!rbi->ring_buffer)
+	mutex_lock(&rbi->ring_buffer_mutex);
+	if (!rbi->ring_buffer) {
+		mutex_unlock(&rbi->ring_buffer_mutex);
 		return -EINVAL;
+	}
 
-	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
+	ret = sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
+	mutex_unlock(&rbi->ring_buffer_mutex);
+	return ret;
 }
 static VMBUS_CHAN_ATTR_RO(out_mask);
 
-static ssize_t in_mask_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t in_mask_show(struct vmbus_channel *channel, char *buf)
 {
-	const struct hv_ring_buffer_info *rbi = &channel->inbound;
+	struct hv_ring_buffer_info *rbi = &channel->inbound;
+	ssize_t ret;
 
-	if (!rbi->ring_buffer)
+	mutex_lock(&rbi->ring_buffer_mutex);
+	if (!rbi->ring_buffer) {
+		mutex_unlock(&rbi->ring_buffer_mutex);
 		return -EINVAL;
+	}
 
-	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
+	ret = sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
+	mutex_unlock(&rbi->ring_buffer_mutex);
+	return ret;
 }
 static VMBUS_CHAN_ATTR_RO(in_mask);
 
-static ssize_t read_avail_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t read_avail_show(struct vmbus_channel *channel, char *buf)
 {
-	const struct hv_ring_buffer_info *rbi = &channel->inbound;
+	struct hv_ring_buffer_info *rbi = &channel->inbound;
+	ssize_t ret;
 
-	if (!rbi->ring_buffer)
+	mutex_lock(&rbi->ring_buffer_mutex);
+	if (!rbi->ring_buffer) {
+		mutex_unlock(&rbi->ring_buffer_mutex);
 		return -EINVAL;
+	}
 
-	return sprintf(buf, "%u\n", hv_get_bytes_to_read(rbi));
+	ret = sprintf(buf, "%u\n", hv_get_bytes_to_read(rbi));
+	mutex_unlock(&rbi->ring_buffer_mutex);
+	return ret;
 }
 static VMBUS_CHAN_ATTR_RO(read_avail);
 
-static ssize_t write_avail_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t write_avail_show(struct vmbus_channel *channel, char *buf)
 {
-	const struct hv_ring_buffer_info *rbi = &channel->outbound;
+	struct hv_ring_buffer_info *rbi = &channel->outbound;
+	ssize_t ret;
 
-	if (!rbi->ring_buffer)
+	mutex_lock(&rbi->ring_buffer_mutex);
+	if (!rbi->ring_buffer) {
+		mutex_unlock(&rbi->ring_buffer_mutex);
 		return -EINVAL;
+	}
 
-	return sprintf(buf, "%u\n", hv_get_bytes_to_write(rbi));
+	ret = sprintf(buf, "%u\n", hv_get_bytes_to_write(rbi));
+	mutex_unlock(&rbi->ring_buffer_mutex);
+	return ret;
 }
 static VMBUS_CHAN_ATTR_RO(write_avail);
 
-static ssize_t show_target_cpu(const struct vmbus_channel *channel, char *buf)
+static ssize_t show_target_cpu(struct vmbus_channel *channel, char *buf)
 {
 	return sprintf(buf, "%u\n", channel->target_cpu);
 }
 static VMBUS_CHAN_ATTR(cpu, S_IRUGO, show_target_cpu, NULL);
 
-static ssize_t channel_pending_show(const struct vmbus_channel *channel,
+static ssize_t channel_pending_show(struct vmbus_channel *channel,
 				    char *buf)
 {
 	return sprintf(buf, "%d\n",
@@ -1501,7 +1525,7 @@ static ssize_t channel_pending_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(pending, S_IRUGO, channel_pending_show, NULL);
 
-static ssize_t channel_latency_show(const struct vmbus_channel *channel,
+static ssize_t channel_latency_show(struct vmbus_channel *channel,
 				    char *buf)
 {
 	return sprintf(buf, "%d\n",
@@ -1510,19 +1534,19 @@ static ssize_t channel_latency_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(latency, S_IRUGO, channel_latency_show, NULL);
 
-static ssize_t channel_interrupts_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t channel_interrupts_show(struct vmbus_channel *channel, char *buf)
 {
 	return sprintf(buf, "%llu\n", channel->interrupts);
 }
 static VMBUS_CHAN_ATTR(interrupts, S_IRUGO, channel_interrupts_show, NULL);
 
-static ssize_t channel_events_show(const struct vmbus_channel *channel, char *buf)
+static ssize_t channel_events_show(struct vmbus_channel *channel, char *buf)
 {
 	return sprintf(buf, "%llu\n", channel->sig_events);
 }
 static VMBUS_CHAN_ATTR(events, S_IRUGO, channel_events_show, NULL);
 
-static ssize_t channel_intr_in_full_show(const struct vmbus_channel *channel,
+static ssize_t channel_intr_in_full_show(struct vmbus_channel *channel,
 					 char *buf)
 {
 	return sprintf(buf, "%llu\n",
@@ -1530,7 +1554,7 @@ static ssize_t channel_intr_in_full_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(intr_in_full, 0444, channel_intr_in_full_show, NULL);
 
-static ssize_t channel_intr_out_empty_show(const struct vmbus_channel *channel,
+static ssize_t channel_intr_out_empty_show(struct vmbus_channel *channel,
 					   char *buf)
 {
 	return sprintf(buf, "%llu\n",
@@ -1538,7 +1562,7 @@ static ssize_t channel_intr_out_empty_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(intr_out_empty, 0444, channel_intr_out_empty_show, NULL);
 
-static ssize_t channel_out_full_first_show(const struct vmbus_channel *channel,
+static ssize_t channel_out_full_first_show(struct vmbus_channel *channel,
 					   char *buf)
 {
 	return sprintf(buf, "%llu\n",
@@ -1546,7 +1570,7 @@ static ssize_t channel_out_full_first_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(out_full_first, 0444, channel_out_full_first_show, NULL);
 
-static ssize_t channel_out_full_total_show(const struct vmbus_channel *channel,
+static ssize_t channel_out_full_total_show(struct vmbus_channel *channel,
 					   char *buf)
 {
 	return sprintf(buf, "%llu\n",
@@ -1554,14 +1578,14 @@ static ssize_t channel_out_full_total_show(const struct vmbus_channel *channel,
 }
 static VMBUS_CHAN_ATTR(out_full_total, 0444, channel_out_full_total_show, NULL);
 
-static ssize_t subchannel_monitor_id_show(const struct vmbus_channel *channel,
+static ssize_t subchannel_monitor_id_show(struct vmbus_channel *channel,
 					  char *buf)
 {
 	return sprintf(buf, "%u\n", channel->offermsg.monitorid);
 }
 static VMBUS_CHAN_ATTR(monitor_id, S_IRUGO, subchannel_monitor_id_show, NULL);
 
-static ssize_t subchannel_id_show(const struct vmbus_channel *channel,
+static ssize_t subchannel_id_show(struct vmbus_channel *channel,
 				  char *buf)
 {
 	return sprintf(buf, "%u\n",
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 64698ec8f2ac..8b9a93c99c9b 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -141,6 +141,11 @@ struct hv_ring_buffer_info {
 
 	u32 ring_datasize;		/* < ring_size */
 	u32 priv_read_index;
+	/*
+	 * The ring buffer mutex lock. This lock prevents the ring buffer from
+	 * being freed while the ring buffer is being accessed.
+	 */
+	struct mutex ring_buffer_mutex;
 };
 
 
@@ -1206,7 +1211,7 @@ struct hv_ring_buffer_debug_info {
 };
 
 
-int hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+int hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
 				struct hv_ring_buffer_debug_info *debug_info);
 
 /* Vmbus interface */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v3 2/3] Drivers: hv: vmbus: Set ring_info field to 0 and remove memset
From: Kimberly Brown @ 2019-03-14 20:05 UTC (permalink / raw)
  To: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui
  Cc: K. Y. Srinivasan, Haiyang Zhang, linux-hyperv, linux-kernel
In-Reply-To: <cover.1552592620.git.kimbrownkd@gmail.com>

Set "ring_info->priv_read_index" to 0. Now, all of ring_info's fields
are explicitly set in this function. The memset() call is no longer
necessary, so remove it.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
---
 drivers/hv/ring_buffer.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 9e8b31ccc142..0386ff48c5ea 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -197,8 +197,6 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
 
 	BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
 
-	memset(ring_info, 0, sizeof(struct hv_ring_buffer_info));
-
 	/*
 	 * First page holds struct hv_ring_buffer, do wraparound mapping for
 	 * the rest.
@@ -232,6 +230,7 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
 		reciprocal_value(ring_info->ring_size / 10);
 	ring_info->ring_datasize = ring_info->ring_size -
 		sizeof(struct hv_ring_buffer);
+	ring_info->priv_read_index = 0;
 
 	spin_lock_init(&ring_info->ring_lock);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v3 1/3] Drivers: hv: vmbus: Refactor chan->state if statement
From: Kimberly Brown @ 2019-03-14 20:05 UTC (permalink / raw)
  To: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui
  Cc: K. Y. Srinivasan, Haiyang Zhang, linux-hyperv, linux-kernel
In-Reply-To: <cover.1552592620.git.kimbrownkd@gmail.com>

The chan->state "if statement" was introduced in commit 6712cc9c2211
("vmbus: don't return values for uninitalized channels"). That commit
states that the purpose of the chan->state "if statement" is to prevent
returning garbage or causing a kernel OOPS when the channel ring buffer
is not initialized. The changes in this patch provide the same
protection.

Refactor the chan->state “if statement” in vmbus_chan_attr_show():
 - Instead of checking the channel state in the "if statement", check
   whether the channel ring buffer pointer is NULL. Checking the
   ring buffer pointer makes this code consistent with
   hv_ringbuffer_get_debuginfo().

 - Move the "if statement" to the four "_show" functions that access a
   channel ring buffer. Only four of the channel-level "_show" functions
   access a ring buffer. The ring buffer pointer does not need to be
   checked before calling the other "_show" functions, and moving the
   ring buffer pointer "if statement" to the "_show" functions that
   access a ring buffer makes the purpose of the "if statement" clear.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
---
 drivers/hv/vmbus_drv.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index a7bb709870a8..7f15c41d952e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1435,9 +1435,6 @@ static ssize_t vmbus_chan_attr_show(struct kobject *kobj,
 	if (!attribute->show)
 		return -EIO;
 
-	if (chan->state != CHANNEL_OPENED_STATE)
-		return -EINVAL;
-
 	return attribute->show(chan, buf);
 }
 
@@ -1449,6 +1446,9 @@ static ssize_t out_mask_show(const struct vmbus_channel *channel, char *buf)
 {
 	const struct hv_ring_buffer_info *rbi = &channel->outbound;
 
+	if (!rbi->ring_buffer)
+		return -EINVAL;
+
 	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
 }
 static VMBUS_CHAN_ATTR_RO(out_mask);
@@ -1457,6 +1457,9 @@ static ssize_t in_mask_show(const struct vmbus_channel *channel, char *buf)
 {
 	const struct hv_ring_buffer_info *rbi = &channel->inbound;
 
+	if (!rbi->ring_buffer)
+		return -EINVAL;
+
 	return sprintf(buf, "%u\n", rbi->ring_buffer->interrupt_mask);
 }
 static VMBUS_CHAN_ATTR_RO(in_mask);
@@ -1465,6 +1468,9 @@ static ssize_t read_avail_show(const struct vmbus_channel *channel, char *buf)
 {
 	const struct hv_ring_buffer_info *rbi = &channel->inbound;
 
+	if (!rbi->ring_buffer)
+		return -EINVAL;
+
 	return sprintf(buf, "%u\n", hv_get_bytes_to_read(rbi));
 }
 static VMBUS_CHAN_ATTR_RO(read_avail);
@@ -1473,6 +1479,9 @@ static ssize_t write_avail_show(const struct vmbus_channel *channel, char *buf)
 {
 	const struct hv_ring_buffer_info *rbi = &channel->outbound;
 
+	if (!rbi->ring_buffer)
+		return -EINVAL;
+
 	return sprintf(buf, "%u\n", hv_get_bytes_to_write(rbi));
 }
 static VMBUS_CHAN_ATTR_RO(write_avail);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v3 0/3] Drivers: hv: vmbus: Fix a race condition in "_show" functions
From: Kimberly Brown @ 2019-03-14 20:04 UTC (permalink / raw)
  To: Michael Kelley, Long Li, Sasha Levin, Stephen Hemminger,
	Dexuan Cui
  Cc: K. Y. Srinivasan, Haiyang Zhang, linux-hyperv, linux-kernel
In-Reply-To: <cover.1550806305.git.kimbrownkd@gmail.com>

This patchset fixes a race condition in the "_show" functions that
access the channel ring buffers.

Changes in v3:
Patch 1: Drivers: hv: vmbus: Refactor chan->state if statement
 - Added the “reviewed-by” line from v2.

Patch 2: Drivers: hv: vmbus: Set ring_info field to 0 and remove memset
 - This patch is new. This change allows the new mutex locks in patch 3
   to be initialized when the channel is initialized.

Patch 3: Drivers: hv: vmbus: Fix race condition with new
         ring_buffer_info mutex
 - Added two ring buffer info mutex locks instead of the single channel
   mutex lock that was proposed in v2.
 - Changed the mutex acquire/release calls as needed for the new ring
   buffer info mutex locks.
 - Updated the commit message.


Changes in v2:
 - In v1, I proposed using “vmbus_connection.channel_mutex” in the
   “_show” functions to prevent the race condition. However, using this
   mutex could result in a deadlock, so a new approach is proposed in
   this patchset.
 - Patch 1 is new and consists of refactoring an if statement.
 - Patch 2 introduces a new mutex lock in the “vmbus_channel” struct,
   which is used to eliminate the race condition.

Kimberly Brown (3):
  Drivers: hv: vmbus: Refactor chan->state if statement
  Drivers: hv: vmbus: Set ring_info field to 0 and remove memset
  Drivers: hv: vmbus: Fix race condition with new ring_buffer_info mutex

 drivers/hv/channel_mgmt.c |  2 +
 drivers/hv/hyperv_vmbus.h |  1 +
 drivers/hv/ring_buffer.c  | 22 ++++++++--
 drivers/hv/vmbus_drv.c    | 89 +++++++++++++++++++++++++++------------
 include/linux/hyperv.h    |  7 ++-
 5 files changed, 88 insertions(+), 33 deletions(-)

-- 
2.17.1


^ permalink raw reply

* RE: [PATCH] hyperv: a potential NULL pointer dereference
From: KY Srinivasan @ 2019-03-14 14:56 UTC (permalink / raw)
  To: Kangjie Lu
  Cc: pakki001@umn.edu, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190314054651.1315-1-kjlu@umn.edu>



> -----Original Message-----
> From: Kangjie Lu <kjlu@umn.edu>
> Sent: Wednesday, March 13, 2019 10:47 PM
> To: kjlu@umn.edu
> Cc: pakki001@umn.edu; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Thomas
> Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav
> Petkov <bp@alien8.de>; H. Peter Anvin <hpa@zytor.com>; x86@kernel.org;
> linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] hyperv: a potential NULL pointer dereference
> 
> In case alloc_page, the fix returns -ENOMEM to avoid the potential
> NULL pointer dereference.
> 
Thanks.

> Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>

> ---
>  arch/x86/hyperv/hv_init.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 7abb09e2eeb8..dfdb4ce1ae9c 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -102,9 +102,13 @@ static int hv_cpu_init(unsigned int cpu)
>  	u64 msr_vp_index;
>  	struct hv_vp_assist_page **hvp =
> &hv_vp_assist_page[smp_processor_id()];
>  	void **input_arg;
> +	struct page *pg;
> 
>  	input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> -	*input_arg = page_address(alloc_page(GFP_KERNEL));
> +	pg = alloc_page(GFP_KERNEL);
> +	if (unlikely(!pg))
> +		return -ENOMEM;
> +	*input_arg = page_address(pg);
> 
>  	hv_get_vp_index(msr_vp_index);
> 
> --
> 2.17.1


^ permalink raw reply

* Re: [PATCH] hyper-v: Check for ring buffer in hv_get_bytes_to_read/write
From: Mohammed Gamal @ 2019-03-14 12:42 UTC (permalink / raw)
  To: Stephen Hemminger, Haiyang Zhang, Michael Kelley,
	linux-hyperv@vger.kernel.org, kimbrownkd
  Cc: Sasha Levin, Dexuan Cui, Long Li, KY Srinivasan, vkuznets
In-Reply-To: <SN6PR2101MB0912DF97AC683F8DAD5B59FDCC4A0@SN6PR2101MB0912.namprd21.prod.outlook.com>

On Wed, 2019-03-13 at 21:12 +0000, Stephen Hemminger wrote:
> What test are you running?

I am running iperf3 with the following arguments:
iperf3 -u -c ${iperf3 server address} -b 0 -P8 -t 3600

while changing the interface parameters in parallel with the following
script:

cat ./test.sh
#!/bin/bash
device="eth1"
i=0
while [ "$i" -lt 1000 ]
do
    ethtool -L $device combined 1
    ethtool -L $device combined 2
    let "i++"
    echo $i
done

> 
> -----Original Message-----
> From: Mohammed Gamal <mgamal@redhat.com> 
> Sent: Wednesday, March 13, 2019 3:25 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>; Michael Kelley <mikelley@
> microsoft.com>; linux-hyperv@vger.kernel.org; kimbrownkd <kimbrownkd@
> gmail.com>
> Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui <decui@mi
> crosoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Long Li <lo
> ngli@microsoft.com>; KY Srinivasan <kys@microsoft.com>; vkuznets <vku
> znets@redhat.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> hv_get_bytes_to_read/write
> 
> On Tue, 2019-03-12 at 18:02 +0000, Haiyang Zhang wrote:
> >  
> >  
> > > -----Original Message-----
> > > From: Mohammed Gamal <mgamal@redhat.com>
> > > Sent: Thursday, March 7, 2019 1:32 PM
> > > To: Michael Kelley <mikelley@microsoft.com>; linux-hyperv@vger.ke
> > > rn
> > 
> > el.org;
> > > kimbrownkd <kimbrownkd@gmail.com>
> > > Cc: Sasha Levin <Alexander.Levin@microsoft.com>; Dexuan Cui
> > > <decui@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>
> > > ;
> > > Long Li <longli@microsoft.com>; KY Srinivasan <kys@microsoft.com>
> > > ;
> > 
> > Haiyang
> > > Zhang <haiyangz@microsoft.com>; vkuznets <vkuznets@redhat.com>;
> > 
> > linux-
> > > kernel@vger.kernel.org
> > > Subject: Re: [PATCH] hyper-v: Check for ring buffer in
> > > hv_get_bytes_to_read/write
> > > 
> > > On Thu, 2019-03-07 at 17:33 +0000, Michael Kelley wrote:
> > > > From: Mohammed Gamal <mgamal@redhat.com> Sent: Thursday, March
> > > > 7,
> > > > 2019 8:36 AM
> > > > > 
> > > > > This patch adds a check for the presence of the ring buffer
> > > > > in
> > > > > hv_get_bytes_to_read/write() to avoid possible NULL pointer
> > > > > dereferences.
> > > > > If the ring buffer is not yet allocated, return 0 bytes to be
> > > > > read/written.
> > > > > 
> > > > > The root cause is that code that accesses the ring buffer
> > 
> > including
> > > > > hv_get_bytes_to_read/write() could be vulnerable to the race
> > > > > condition discussed in
> > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > 2F
> > 
> > %2Flk
> > > > > 
> > > 
> > > ml.org%2Flkml%2F2018%2F10%2F18%2F779&amp;data=02%7C01%7Chaiyangz
> > > %40m
> > > > > 
> > > 
> > > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141a
> > > f9
> > > 1
> > > > > 
> > > 
> > > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=b51Xc5GUN
> > > nHX0K
> > > > > 08LrH3ShTyFcRZ4mYHUATd%2BDpvYDw%3D&amp;reserved=0>;
> > > > > 
> > > > > This race is being addressed by the patch series by Kimberly
> > 
> > Brown
> > > > > in
> > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%
> > > > > 2F
> > 
> > %2Flk
> > > > > 
> > > 
> > > ml.org%2Flkml%2F2019%2F2%2F21%2F1236&amp;data=02%7C01%7Chaiyangz
> > > %40m
> > > > > 
> > > 
> > > icrosoft.com%7C73af013c14034bb0b1ad08d6a32b419c%7C72f988bf86f141a
> > > f9
> > > 1
> > > > > 
> > > 
> > > ab2d7cd011db47%7C1%7C0%7C636875803518430021&amp;sdata=js1ff15Gbk7
> > > 0MD
> > > > > A2hkMZExxvAAbDuKDhfBvc5ZrckzM%3D&amp;reserved=0 which is not
> > > 
> > > final
> > > > > yet
> > > > > 
> > > > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > > 
> > > > Could you elaborate on the code paths where
> > > > hv_get_bytes_to_read/write() could be called when the ring
> > > > buffer
> > > > isn't yet allocated?  My sense is that Kim Brown's patch will
> > 
> > address
> > > > all of the code paths that involved sysfs access from outside
> > > > the
> > > > driver.  And within a driver, the ring buffer should never be
> > 
> > accessed
> > > > unless it is already allocated.  Is there another code path
> > > > we're
> > 
> > not
> > > > aware of?  I'm wondering if these changes are really needed
> > > > once
> > 
> > Kim
> > > > Brown's patch is finished.
> > > > 
> > > > Michael
> > > 
> > > I've seen one instance of the race in the netvsc driver when
> > 
> > running traffic
> > > through it with iperf3 while continuously changing the channel
> > 
> > settings.
> > > 
> > > The following code path deallocates the ring buffer:
> > > netvsc_set_channels() -> netvsc_detach() ->
> > > rndis_filter_device_remove() -> netvsc_device_remove() ->
> > 
> > vmbus_close()
> > > -> vmbus_free_ring() -> hv_ringbuffer_cleanup().
> > > 
> > > netvsc_send_pkt() -> hv_get_bytes_to_write() might get called
> > 
> > concurrently
> > > after vmbus_close() and before vmbus_open() returns and sets up
> > > the
> > 
> > new ring
> > > buffer.
> > > 
> > > The race is fairly hard to reproduce on recent upstream kernels,
> > 
> > but I still
> > > managed to reproduce it.
> > 
> >  
> > Looking at the code from netvsc_detach() –
> >          netif_tx_disable(ndev) is called before
> > rndis_filter_device_remove(hdev, nvdev).
> > So there should be no call to netvsc_send_pkt() after detaching.
> > What’s the crash stack trace?
> >  
> > static int netvsc_detach(struct net_device *ndev,
> >                          struct netvsc_device *nvdev)
> > {
> >         struct net_device_context *ndev_ctx = netdev_priv(ndev);
> >         struct hv_device *hdev = ndev_ctx->device_ctx;
> >         int ret;
> >  
> >         /* Don't try continuing to try and setup sub channels */
> >         if (cancel_work_sync(&nvdev->subchan_work))
> >                 nvdev->num_chn = 1;
> >  
> >         /* If device was up (receiving) then shutdown */
> >         if (netif_running(ndev)) {
> >                 netif_tx_disable(ndev);
> >  
> >                 ret = rndis_filter_close(nvdev);
> >                 if (ret) {
> >                         netdev_err(ndev,
> >                                    "unable to close device (ret
> > %d).\n", ret);
> >                         return ret;
> >                 }
> >  
> >                 ret = netvsc_wait_until_empty(nvdev);
> >                 if (ret) {
> >                         netdev_err(ndev,
> >                                    "Ring buffer not empty after
> > closing rndis\n");
> >                         return ret;
> >                 }
> >         }
> >  
> >         netif_device_detach(ndev);
> >  
> >         rndis_filter_device_remove(hdev, nvdev);
> >  
> >         return 0;
> > }
> >  
> > Thanks,
> > Haiyang
> 
> Here is one stack trace on a 4.18 kernel, the most recent kernel I
> managed to reproduce this bug on. 
> I haven't managed to reproduce on 5.0.0 yet, but I guess some recent
> changes to the netvsc driver could be masking the problem, as I tried
> backporting those changes to older RHEL 7 kernels and still managed
> to
> reproduce the problem there. I could however be wrong, and any
> pointers
> are still appreciated:
> 
> [  545.308507] BUG: unable to handle kernel NULL pointer dereference
> at
> 0000000000000004
> [  545.308656] PGD 0 P4D 0 
> [  545.308763] Oops: 0000 [#1] SMP PTI
> [  545.308855] CPU: 2 PID: 1800 Comm: iperf3 Kdump: loaded Not
> tainted
> 4.18.0-64.el8.test.x86_64 #1
> [  545.308990] Hardware name: Microsoft Corporation Virtual
> Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
> [  545.309143] RIP: 0010:netvsc_send+0x2c9/0xce0 [hv_netvsc]
> [  545.309298] Code: 4c 8b b1 c0 00 00 00 4f 8d 2c 64 49 c1 e5 07 4d
> 03
> ae c0 03 00 00 48 8b 84 03 30 01 00 00 4c 89 6c 24 18 48 8b 90 20 01
> 00
> 00 <8b> 72 04 8b 0a 8b 90 38 01 00 00 89 f7 01 f2 29 cf 29 ca 39 ce
>  0f
> [  545.309321] RSP: 0018:ffffb8a305d5b6c0 EFLAGS: 00010282
> [  545.309321] RAX: ffff926928bd7000 RBX: ffff92687dbe0000 RCX:
> ffff92687d5bec00
> [  545.309321] RDX: 0000000000000000 RSI: ffff92691b61c654 RDI:
> 0000000000000000
> [  545.309321] RBP: ffff926915dcde28 R08: ffff926915dcde00 R09:
> 0000000000000000
> [  545.309321] R10: 00000000000db61c R11: 0000000000000f7e R12:
> 0000000000000001
> [  545.309321] R13: ffff926931808180 R14: ffff926931801000 R15:
> 0000000000000000
> [  545.309321] FS:  00007feca6a4b740(0000) GS:ffff926940080000(0000)
> knlGS:0000000000000000
> [  545.309321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  545.309321] CR2: 0000000000000004 CR3: 00000000dfccc004 CR4:
> 00000000003606e0
> [  545.309321] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  545.309321] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  545.309321] Call Trace:
> [  545.309321]  netvsc_start_xmit+0x3c9/0x800 [hv_netvsc]
> [  545.309321]  ? __switch_to_asm+0x34/0x70
> [  545.309321]  ? __switch_to_asm+0x34/0x70
> [  545.309321]  ? ___slab_alloc+0x269/0x4e0
> [  545.309321]  ? __alloc_skb+0x82/0x1c0
> [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> [  545.309321]  ? nft_do_chain+0x3d7/0x3f0 [nf_tables]
> [  545.309321]  ? _cond_resched+0x15/0x30
> [  545.309321]  ? netif_skb_features+0x118/0x280
> [  545.309321]  dev_hard_start_xmit+0xa5/0x210
> [  545.309321]  sch_direct_xmit+0x14f/0x340
> [  545.309321]  __dev_queue_xmit+0x799/0x8f0
> [  545.309321]  ip_finish_output2+0x2e0/0x430
> [  545.309321]  ? ip_finish_output+0x139/0x270
> [  545.309321]  ip_output+0x6c/0xe0
> [  545.309321]  ? ip_append_data.part.50+0xc0/0xc0
> [  545.309321]  ip_send_skb+0x15/0x40
> [  545.309321]  udp_send_skb.isra.43+0x153/0x340
> [  545.309321]  udp_sendmsg+0xac2/0xd30
> [  545.309321]  ? set_fd_set.part.7+0x40/0x40
> [  545.309321]  ? set_fd_set.part.7+0x40/0x40
> [  545.309321]  ? __check_object_size+0xa3/0x181
> [  545.309321]  ? sock_has_perm+0x78/0xa0
> [  545.309321]  ? core_sys_select+0x242/0x2f0
> [  545.309321]  ? sock_sendmsg+0x36/0x40
> [  545.309321]  ? udp_push_pending_frames+0x60/0x60
> [  545.309321]  sock_sendmsg+0x36/0x40
> [  545.309321]  sock_write_iter+0x8f/0xf0
> [  545.309321]  __vfs_write+0x156/0x1a0
> [  545.309321]  vfs_write+0xa5/0x1a0
> [  545.309321]  ksys_write+0x4f/0xb0
> [  545.309321]  do_syscall_64+0x5b/0x1b0
> [  545.309321]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> [  545.309321] RIP: 0033:0x7feca5fb5348
> [  545.309321] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00
> 00
> 00 f3 0f 1e fa 48 8d 05 d5 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00
> 0f
> 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4
> 55
> [  545.309321] RSP: 002b:00007ffc3ff1f108 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> [  545.309321] RAX: ffffffffffffffda RBX: 00000000000005a8 RCX:
> 00007feca5fb5348
> [  545.309321] RDX: 00000000000005a8 RSI: 00007feca6a59000 RDI:
> 0000000000000009
> [  545.309321] RBP: 00007feca6a59000 R08: 0000000000000002 R09:
> 00cd09a3238b4e43
> [  545.309321] R10: 0002961ecea49016 R11: 0000000000000246 R12:
> 0000000000000009
> [  545.309321] R13: 00000000000005a8 R14: 00007ffc3ff1f180 R15:
> 0000563c1e05b260
> [  545.309321] Modules linked in: nft_chain_nat_ipv6
> nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nft_chain_route_ipv4 nf_conntrack ip_set nf_tables nfnetlink vfat fat
> sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> intel_rapl_perf sg hv_utils hv_balloon pcspkr joydev xfs libcrc32c
> sd_mod sr_mod cdrom serio_raw hv_storvsc hv_netvsc scsi_transport_fc
> hyperv_fb hyperv_keyboard hid_hyperv crc32c_intel hv_vmbus dm_mirror
> dm_region_hash dm_log dm_mod [last unloaded: nft_compat]
> [  545.309321] CR2: 0000000000000004
> 
> From the stack trace netvsc_send+0x2c9 points to this line:
> 
> static inline u32 hv_get_bytes_to_write(const struct 	hv_ring_
> bu
> ffer_info *rbi)
> {
>         u32 read_loc, write_loc, dsize, write;
> 
>         dsize = rbi->ring_datasize;
>         read_loc = READ_ONCE(rbi->ring_buffer->read_index);  <-------
> --
>         write_loc = rbi->ring_buffer->write_index;
> 
>         write = write_loc >= read_loc ? dsize - (write_loc -
> read_loc) 
>                 read_loc - write_loc;
>         return write;
> }
> 
> which gets called from netvsc_send_pkt().

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox