Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH net-next v8 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-04-25 23:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524700768-38627-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3b5991734118..51a085b1a242 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v8 0/4] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-04-25 23:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown

This is another update based on feedback from MST and Stephen on the
last patchset. Hopefully this series can be integrated and any further
enhancements can be made on top of this patchset.

v8:
- Made the failover managment routines more robust by updating the feature 
  bits/other fields in the failover netdev when slave netdevs are 
  registered/unregistered. (mst)
- added support for handling vlans.
- Limited the changes in netvsc to only use the notifier/event/lookups
  from the failover module. The slave register/unregister/link-change 
  handlers are only updated to use the getbymac routine to get the 
  upper netdev. There is no change in their functionality. (stephen)
- renamed structs/function/file names to use net_failover prefix. (mst)

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
used by hypervisor to indicate that virtio_net interface should act as
a standby for another device with the same MAC address.

Patch 2 introduces a failover module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver creates
an additional 'failover' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'standby' netdev and a passthru/vf device with the same MAC gets
registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
netdev as default for transmits when it is available with link up and
running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by failover module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'primary' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
  (jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
  so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
  registered as primary. (stephen)
-  minor bugfixes

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
  net: Introduce generic failover module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the failover
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 134 ++----
 drivers/net/virtio_net.c        |  37 +-
 include/linux/netdevice.h       |  16 +
 include/net/net_failover.h      |  94 +++++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  18 +
 net/core/Makefile               |   1 +
 net/core/net_failover.c         | 892 ++++++++++++++++++++++++++++++++++++++++
 11 files changed, 1086 insertions(+), 113 deletions(-)
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/net_failover.c

-- 
2.14.3

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 23:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: eric.dumazet, mst, netdev, Randy Dunlap, linux-kernel,
	Matthew Wilcox, Michal Hocko, linux-mm, dm-devel, Vlastimil Babka,
	David Rientjes, Andrew Morton, virtualization, David Miller,
	edumazet
In-Reply-To: <1524694663.4100.21.camel@HansenPartnership.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1285 bytes --]



On Wed, 25 Apr 2018, James Bottomley wrote:

> > > Do we really need the new config option?  This could just be
> > > manually  tunable via fault injection IIUC.
> > 
> > We do, because we want to enable it in RHEL and Fedora debugging
> > kernels, so that it will be tested by the users.
> > 
> > The users won't use some extra magic kernel options or debugfs files.
> 
> If it can be enabled via a tunable, then the distro can turn it on
> without the user having to do anything.  If you want to present the
> user with a different boot option, you can (just have the tunable set
> on the command line), but being tunable driven means that you don't
> have to choose that option, you could automatically enable it under a
> range of circumstances.  I think most sane distributions would want
> that flexibility.
> 
> Kconfig proliferation, conversely, is a bit of a nightmare from both
> the user and the tester's point of view, so we're trying to avoid it
> unless absolutely necessary.
> 
> James

BTW. even developers who compile their own kernel should have this enabled 
by a CONFIG option - because if the developer sees the option when 
browsing through menuconfig, he may enable it. If he doesn't see the 
option, he won't even know that such an option exists.

Mikulas

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-25 22:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <20180426011221-mutt-send-email-mst@kernel.org>

On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >>
>> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> > > > >
>> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
>> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> > > > >
>> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> > > >
>> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> >> > >
>> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >> >> >
>> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >
>> >> >> > > You can't go breaking userspace even if you don't like the application.
>> >> >> >
>> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> >> > API at all.
>> >> >> >
>> >> >>
>> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> >> >> VF device.
>> >> >>
>> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> >> or start hiding devices from userspace.
>> >> >
>> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >
>> >> >> Also, I am working on associating netvsc and VF device based on serial number
>> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
>> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >
>> >> > Maybe we should support same for virtio ...
>> >> > Which serial do you mean? From vpd?
>> >> >
>> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >
>> >> > It all seems like a reasonable thing to support in the generic core.
>> >>
>> >> That's the reason why I chose explicit identifier rather than rely on
>> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> can't, malicious guest user can fake MAC address to skip binding.
>> >>
>> >> -Siwei
>> >
>> > Address should be sampled at device creation to prevent this
>> > kind of hack. Not that it buys the malicious user much:
>> > if you can poke at MAC addresses you probably already can
>> > break networking.
>>
>> I don't understand why poking at MAC address may potentially break
>> networking.
>
> Set a MAC address to match another device on the same LAN,
> packets will stop reaching that MAC.

What I meant was guest users may create a virtual link, say veth that
has exactly the same MAC address as that for the VF, which can easily
get around of the binding procedure. There's no explicit flag to
identify a VF or pass-through device AFAIK. And sometimes this happens
maybe due to user misconfiguring the link. This process should be
hardened to avoid from any potential configuration errors.

>
>> Unlike VF, passthrough PCI endpoint device has its freedom
>> to change the MAC address. Even on a VF setup it's not neccessarily
>> always safe to assume the VF's MAC address cannot or shouldn't be
>> changed. That depends on the specific need whether the host admin
>> wants to restrict guest from changing the MAC address, although in
>> most cases it's true.
>>
>> I understand we can use the perm_addr to distinguish. But as said,
>> this will pose limitation of flexible configuration where one can
>> assign VFs with identical MAC address at all while each VF belongs to
>> different PF and/or different subnet for e.g. load balancing.
>> And
>> furthermore, the QEMU device model never uses MAC address to be
>> interpreted as an identifier, which requires to be unique per VM
>> instance. Why we're introducing this inconsistency?
>>
>> -Siwei
>
> Because it addresses most of the issues and is simple.  That's already
> much better than what we have now which is nothing unless guest
> configures things manually.

Did you see my QEMU patch for using BDF as the grouping identifier?
And there can be others like what you suggested, but the point is that
it's requried to support explicit grouping mechanism from day one,
before the backup property cast into stones. This is orthogonal to
device model being proposed, be it 1-netdev or not. Delaying it would
just mean support and compatibility burden, appearing more like a
design flaw rather than a feature to add later on.

>
> I think ideally the infrastructure should suppport flexible matching of
> NICs - netvsc is already reported to be moving to some kind of serial
> address.
>
As Stephen said, Hyper-V supports the serial UUID thing from day-one.
It's just the Linux netvsc guest driver itself does not leverage that
ID from the very beginging.

Regards,
-Siwei

>
>> >
>> >
>> >
>> >
>> >>
>> >> >
>> >> > --
>> >> > MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 22:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: eric.dumazet, mst, netdev, Randy Dunlap, linux-kernel,
	Matthew Wilcox, Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, Andrew Morton, virtualization, David Miller,
	edumazet
In-Reply-To: <alpine.DEB.2.21.1804251546240.58229@chino.kir.corp.google.com>



On Wed, 25 Apr 2018, David Rientjes wrote:

> On Wed, 25 Apr 2018, Mikulas Patocka wrote:
> 
> > You need to enable it on boot. Enabling it when the kernel starts to 
> > execute userspace code is already too late (because you would miss 
> > kvmalloc calls in the kernel boot path).
> 
> Is your motivation that since kvmalloc() never falls back to vmalloc() on 
> boot because fragmentation is not be an issue at boot that we should catch 
> bugs where it would matter if it had fallen back?  If we are worrying 
> about falling back to vmalloc before even initscripts have run I think we 
> have bigger problems.

The same driver can be compiled directly into the kernel or be loaded as a 
module. If the user (or the person preparing distro kernel) compiles the 
driver directly into the kernel, kvmalloc should be tested on that driver, 
because a different user or distribution can compile that driver as a 
module.

Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 22:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: eric.dumazet, mst, netdev, Randy Dunlap, linux-kernel,
	Matthew Wilcox, Michal Hocko, linux-mm, dm-devel, Vlastimil Babka,
	David Rientjes, Andrew Morton, virtualization, David Miller,
	edumazet
In-Reply-To: <1524694663.4100.21.camel@HansenPartnership.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2372 bytes --]



On Wed, 25 Apr 2018, James Bottomley wrote:

> On Wed, 2018-04-25 at 17:22 -0400, Mikulas Patocka wrote:
> > 
> > On Wed, 25 Apr 2018, David Rientjes wrote:
> > > 
> > > Do we really need the new config option?  This could just be
> > > manually  tunable via fault injection IIUC.
> > 
> > We do, because we want to enable it in RHEL and Fedora debugging
> > kernels, so that it will be tested by the users.
> > 
> > The users won't use some extra magic kernel options or debugfs files.
> 
> If it can be enabled via a tunable, then the distro can turn it on
> without the user having to do anything.

You need to enable it on boot. Enabling it when the kernel starts to 
execute userspace code is already too late (because you would miss 
kvmalloc calls in the kernel boot path).

These are files in the kernel-debug rpm package. Where would you put the 
extra kernel parameter to enable this feature? None of these files contain 
kernel parameters.

kernel-debug              /boot/.vmlinuz-3.10.0-693.21.1.el7.x86_64.debug.hmac
kernel-debug              /boot/System.map-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug              /boot/config-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug              /boot/initramfs-3.10.0-693.21.1.el7.x86_64.debug.img
kernel-debug              /boot/symvers-3.10.0-693.21.1.el7.x86_64.debug.gz
kernel-debug              /boot/vmlinuz-3.10.0-693.21.1.el7.x86_64.debug
kernel-debug              /etc/ld.so.conf.d/kernel-3.10.0-693.21.1.el7.x86_64.debug.conf
kernel-debug              /lib/modules/3.10.0-693.21.1.el7.x86_64.debug

> If you want to present the user with a different boot option, you can 
> (just have the tunable set on the command line), but being tunable 
> driven means that you don't have to choose that option, you could 
> automatically enable it under a range of circumstances.  I think most 
> sane distributions would want that flexibility.
> 
> Kconfig proliferation, conversely, is a bit of a nightmare from both
> the user and the tester's point of view, so we're trying to avoid it
> unless absolutely necessary.
> 
> James

I already offered that we don't need to introduce a new kernel option and 
we can bind this feature to any other kernel option, that is enabled in 
the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said 
that he wants a new kernel option instead.

Mikulas

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-25 22:22 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <CADGSJ22WMxXEtR9QpRa9_ZqgmTbgWHZz2j4hsUnRJrM9zTsW4w@mail.gmail.com>

On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >>
> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> >> > > > >
> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> >> > > > >
> >> >> > > > >Plus, DPDK is now dependent on existing model.
> >> >> > > >
> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> >> > >
> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >> >
> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> >> > AFAIK it's normally banging device registers directly.
> >> >> >
> >> >> > > You can't go breaking userspace even if you don't like the application.
> >> >> >
> >> >> > Could you please explain how is the proposed patchset breaking
> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> >> > API at all.
> >> >> >
> >> >>
> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> >> to look for Linux netvsc device and the paired VF device and setup the
> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> >> VF device.
> >> >>
> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> >> or start hiding devices from userspace.
> >> >
> >> > Okay so how does the existing patch break that? IIUC does not go to
> >> > a 3 device model since netvsc calls failover_register directly.
> >> >
> >> >> Also, I am working on associating netvsc and VF device based on serial number
> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> >> sense for Linux and Windows to use the same mechanism if possible.
> >> >
> >> > Maybe we should support same for virtio ...
> >> > Which serial do you mean? From vpd?
> >> >
> >> > I guess you will want to keep supporting MAC for old hypervisors?
> >> >
> >> > It all seems like a reasonable thing to support in the generic core.
> >>
> >> That's the reason why I chose explicit identifier rather than rely on
> >> MAC address to bind/pair a device. MAC address can change. Even if it
> >> can't, malicious guest user can fake MAC address to skip binding.
> >>
> >> -Siwei
> >
> > Address should be sampled at device creation to prevent this
> > kind of hack. Not that it buys the malicious user much:
> > if you can poke at MAC addresses you probably already can
> > break networking.
> 
> I don't understand why poking at MAC address may potentially break
> networking.

Set a MAC address to match another device on the same LAN,
packets will stop reaching that MAC.

> Unlike VF, passthrough PCI endpoint device has its freedom
> to change the MAC address. Even on a VF setup it's not neccessarily
> always safe to assume the VF's MAC address cannot or shouldn't be
> changed. That depends on the specific need whether the host admin
> wants to restrict guest from changing the MAC address, although in
> most cases it's true.
> 
> I understand we can use the perm_addr to distinguish. But as said,
> this will pose limitation of flexible configuration where one can
> assign VFs with identical MAC address at all while each VF belongs to
> different PF and/or different subnet for e.g. load balancing.
> And
> furthermore, the QEMU device model never uses MAC address to be
> interpreted as an identifier, which requires to be unique per VM
> instance. Why we're introducing this inconsistency?
> 
> -Siwei

Because it addresses most of the issues and is simple.  That's already
much better than what we have now which is nothing unless guest
configures things manually.

I think ideally the infrastructure should suppport flexible matching of
NICs - netvsc is already reported to be moving to some kind of serial
address.


> >
> >
> >
> >
> >>
> >> >
> >> > --
> >> > MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-25 21:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <20180423230037-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >>
>> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> > > > >
>> >> > > > >I will NAK patches to change to common code for netvsc especially the
>> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> > > > >
>> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> > > >
>> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> > >
>> >> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >> >
>> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> > AFAIK it's normally banging device registers directly.
>> >> >
>> >> > > You can't go breaking userspace even if you don't like the application.
>> >> >
>> >> > Could you please explain how is the proposed patchset breaking
>> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> > API at all.
>> >> >
>> >>
>> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> >> VF device.
>> >>
>> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> or start hiding devices from userspace.
>> >
>> > Okay so how does the existing patch break that? IIUC does not go to
>> > a 3 device model since netvsc calls failover_register directly.
>> >
>> >> Also, I am working on associating netvsc and VF device based on serial number
>> >> rather than MAC address. The serial number is how Windows works now, and it makes
>> >> sense for Linux and Windows to use the same mechanism if possible.
>> >
>> > Maybe we should support same for virtio ...
>> > Which serial do you mean? From vpd?
>> >
>> > I guess you will want to keep supporting MAC for old hypervisors?
>> >
>> > It all seems like a reasonable thing to support in the generic core.
>>
>> That's the reason why I chose explicit identifier rather than rely on
>> MAC address to bind/pair a device. MAC address can change. Even if it
>> can't, malicious guest user can fake MAC address to skip binding.
>>
>> -Siwei
>
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

I don't understand why poking at MAC address may potentially break
networking. Unlike VF, passthrough PCI endpoint device has its freedom
to change the MAC address. Even on a VF setup it's not neccessarily
always safe to assume the VF's MAC address cannot or shouldn't be
changed. That depends on the specific need whether the host admin
wants to restrict guest from changing the MAC address, although in
most cases it's true.

I understand we can use the perm_addr to distinguish. But as said,
this will pose limitation of flexible configuration where one can
assign VFs with identical MAC address at all while each VF belongs to
different PF and/or different subnet for e.g. load balancing. And
furthermore, the QEMU device model never uses MAC address to be
interpreted as an identifier, which requires to be unique per VM
instance. Why we're introducing this inconsistency?

-Siwei

>
>
>
>
>>
>> >
>> > --
>> > MST

^ permalink raw reply

* Re: [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 21:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: dm-devel, eric.dumazet, mst, netdev, Randy Dunlap, linux-kernel,
	Matthew Wilcox, Michal Hocko, linux-mm, edumazet, Andrew Morton,
	virtualization, David Miller, Vlastimil Babka
In-Reply-To: <alpine.DEB.2.21.1804251417470.166306@chino.kir.corp.google.com>



On Wed, 25 Apr 2018, David Rientjes wrote:

> On Wed, 25 Apr 2018, Mikulas Patocka wrote:
> 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
> > 
> > This patch introduces a fault-injection option "kvmalloc_fallback". This
> > option makes kvmalloc randomly fall back to vmalloc.
> > 
> > Unfortunately, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code. This options helps to test for these bugs.
> > 
> > The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> > It can be enabled in distribution debug kernels, so that kvmalloc abuse
> > can be tested by the users. The default can be overridden with
> > "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
> > 
> 
> Do we really need the new config option?  This could just be manually 
> tunable via fault injection IIUC.

We do, because we want to enable it in RHEL and Fedora debugging kernels, 
so that it will be tested by the users.

The users won't use some extra magic kernel options or debugfs files.

Mikulas


> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > ---
> >  Documentation/fault-injection/fault-injection.txt |    7 +++++
> >  include/linux/fault-inject.h                      |    9 +++---
> >  kernel/futex.c                                    |    2 -
> >  lib/Kconfig.debug                                 |   15 +++++++++++
> >  mm/failslab.c                                     |    2 -
> >  mm/page_alloc.c                                   |    2 -
> >  mm/util.c                                         |   30 ++++++++++++++++++++++
> >  7 files changed, 60 insertions(+), 7 deletions(-)
> > 
> > Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
> > ===================================================================
> > --- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt	2018-04-16 21:08:34.000000000 +0200
> > +++ linux-2.6/Documentation/fault-injection/fault-injection.txt	2018-04-25 21:36:36.000000000 +0200
> > @@ -15,6 +15,12 @@ o fail_page_alloc
> >  
> >    injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
> >  
> > +o kvmalloc_fallback
> > +
> > +  makes the function kvmalloc randomly fall back to vmalloc. This could be used
> > +  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
> > +  the result of kvmalloc with free.
> > +
> >  o fail_futex
> >  
> >    injects futex deadlock and uaddr fault errors.
> > @@ -167,6 +173,7 @@ use the boot option:
> >  
> >  	failslab=
> >  	fail_page_alloc=
> > +	kvmalloc_fallback=
> >  	fail_make_request=
> >  	fail_futex=
> >  	mmc_core.fail_request=<interval>,<probability>,<space>,<times>
> > Index: linux-2.6/include/linux/fault-inject.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/fault-inject.h	2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/include/linux/fault-inject.h	2018-04-25 21:38:22.000000000 +0200
> > @@ -31,17 +31,18 @@ struct fault_attr {
> >  	struct dentry *dname;
> >  };
> >  
> > -#define FAULT_ATTR_INITIALIZER {					\
> > +#define FAULT_ATTR_INITIALIZER(p) {					\
> > +		.probability = (p),					\
> >  		.interval = 1,						\
> > -		.times = ATOMIC_INIT(1),				\
> > +		.times = ATOMIC_INIT((p) ? -1 : 1),			\
> > +		.verbose = (p) ? 0 : 2,					\
> >  		.require_end = ULONG_MAX,				\
> >  		.stacktrace_depth = 32,					\
> >  		.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,	\
> > -		.verbose = 2,						\
> >  		.dname = NULL,						\
> >  	}
> >  
> > -#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
> > +#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
> >  int setup_fault_attr(struct fault_attr *attr, char *str);
> >  bool should_fail(struct fault_attr *attr, ssize_t size);
> >  
> > Index: linux-2.6/lib/Kconfig.debug
> > ===================================================================
> > --- linux-2.6.orig/lib/Kconfig.debug	2018-04-25 15:56:16.000000000 +0200
> > +++ linux-2.6/lib/Kconfig.debug	2018-04-25 21:39:45.000000000 +0200
> > @@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
> >  	help
> >  	  Provide fault-injection capability for alloc_pages().
> >  
> > +config FAIL_KVMALLOC_FALLBACK_PROBABILITY
> > +	int "Default kvmalloc fallback probability"
> > +	depends on FAULT_INJECTION
> > +	range 0 100
> > +	default "0"
> > +	help
> > +	  This option will make kvmalloc randomly fall back to vmalloc.
> > +	  Normally, kvmalloc falls back to vmalloc only rarely, if memory
> > +	  is fragmented.
> > +
> > +	  This option helps to detect hard-to-reproduce driver bugs, for
> > +	  example using DMA API on the result of kvmalloc.
> > +
> > +	  The default may be overridden with the kvmalloc_fallback parameter.
> > +
> >  config FAIL_MAKE_REQUEST
> >  	bool "Fault-injection capability for disk IO"
> >  	depends on FAULT_INJECTION && BLOCK
> > Index: linux-2.6/mm/util.c
> > ===================================================================
> > --- linux-2.6.orig/mm/util.c	2018-04-25 15:48:39.000000000 +0200
> > +++ linux-2.6/mm/util.c	2018-04-25 21:43:31.000000000 +0200
> > @@ -14,6 +14,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/userfaultfd_k.h>
> > +#include <linux/fault-inject.h>
> >  
> >  #include <asm/sections.h>
> >  #include <linux/uaccess.h>
> > @@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
> >  }
> >  EXPORT_SYMBOL(vm_mmap);
> >  
> > +#ifdef CONFIG_FAULT_INJECTION
> > +
> > +static struct fault_attr kvmalloc_fallback =
> > +	FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
> > +
> > +static int __init setup_kvmalloc_fallback(char *str)
> > +{
> > +	return setup_fault_attr(&kvmalloc_fallback, str);
> > +}
> > +
> > +__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
> > +
> > +#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
> > +static int __init kvmalloc_fallback_debugfs_init(void)
> > +{
> > +	fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
> > +	return 0;
> > +}
> > +late_initcall(kvmalloc_fallback_debugfs_init);
> > +#endif
> > +
> > +#endif
> > +
> >  /**
> >   * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
> >   * failure, fall back to non-contiguous (vmalloc) allocation.
> > @@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	 */
> >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >  
> > +#ifdef CONFIG_FAULT_INJECTION
> > +	if (should_fail(&kvmalloc_fallback, size))
> > +		goto do_vmalloc;
> > +#endif
> > +
> >  	/*
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> > @@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	if (ret || size <= PAGE_SIZE)
> >  		return ret;
> >  
> > +do_vmalloc: __maybe_unused
> >  	return __vmalloc_node_flags_caller(size, node, flags,
> >  			__builtin_return_address(0));
> >  }
> > Index: linux-2.6/kernel/futex.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/futex.c	2018-02-14 20:24:42.000000000 +0100
> > +++ linux-2.6/kernel/futex.c	2018-04-25 21:11:33.000000000 +0200
> > @@ -288,7 +288,7 @@ static struct {
> >  
> >  	bool ignore_private;
> >  } fail_futex = {
> > -	.attr = FAULT_ATTR_INITIALIZER,
> > +	.attr = FAULT_ATTR_INITIALIZER(0),
> >  	.ignore_private = false,
> >  };
> >  
> > Index: linux-2.6/mm/failslab.c
> > ===================================================================
> > --- linux-2.6.orig/mm/failslab.c	2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/mm/failslab.c	2018-04-25 21:11:40.000000000 +0200
> > @@ -9,7 +9,7 @@ static struct {
> >  	bool ignore_gfp_reclaim;
> >  	bool cache_filter;
> >  } failslab = {
> > -	.attr = FAULT_ATTR_INITIALIZER,
> > +	.attr = FAULT_ATTR_INITIALIZER(0),
> >  	.ignore_gfp_reclaim = true,
> >  	.cache_filter = false,
> >  };
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c	2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/mm/page_alloc.c	2018-04-25 21:11:47.000000000 +0200
> > @@ -3055,7 +3055,7 @@ static struct {
> >  	bool ignore_gfp_reclaim;
> >  	u32 min_order;
> >  } fail_page_alloc = {
> > -	.attr = FAULT_ATTR_INITIALIZER,
> > +	.attr = FAULT_ATTR_INITIALIZER(0),
> >  	.ignore_gfp_reclaim = true,
> >  	.ignore_gfp_highmem = true,
> >  	.min_order = 1,
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Randy Dunlap @ 2018-04-25 21:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	Michal Hocko, linux-mm, edumazet, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804251656300.9428@file01.intranet.prod.int.rdu2.redhat.com>

On 04/25/2018 01:57 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 25 Apr 2018, Randy Dunlap wrote:
> 
>> On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
>>>
>>>
>>> From: Mikulas Patocka <mpatocka@redhat.com>
>>> Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
>>>
>>> This patch introduces a fault-injection option "kvmalloc_fallback". This
>>> option makes kvmalloc randomly fall back to vmalloc.
>>>
>>> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
>>
>>   Unfortunately,
> 
> OK - here I fixed the typos:
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH] fault-injection: introduce kvmalloc fallback options
> 
> This patch introduces a fault-injection option "kvmalloc_fallback". This
> option makes kvmalloc randomly fall back to vmalloc.
> 
> Unfortunately, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code. This options helps to test for these bugs.
> 
> The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> It can be enabled in distribution debug kernels, so that kvmalloc abuse
> can be tested by the users. The default can be overridden with
> "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  Documentation/fault-injection/fault-injection.txt |    7 +++++
>  include/linux/fault-inject.h                      |    9 +++---
>  kernel/futex.c                                    |    2 -
>  lib/Kconfig.debug                                 |   15 +++++++++++
>  mm/failslab.c                                     |    2 -
>  mm/page_alloc.c                                   |    2 -
>  mm/util.c                                         |   30 ++++++++++++++++++++++
>  7 files changed, 60 insertions(+), 7 deletions(-)

Acked-by: Randy Dunlap <rdunlap@infradead.org> # Documentation and Kconfig only

thanks.
-- 
~Randy

^ permalink raw reply

* [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 20:57 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	Michal Hocko, linux-mm, edumazet, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <1114eda5-9b1f-4db8-2090-556b4a37c532@infradead.org>



On Wed, 25 Apr 2018, Randy Dunlap wrote:

> On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
> > 
> > 
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
> > 
> > This patch introduces a fault-injection option "kvmalloc_fallback". This
> > option makes kvmalloc randomly fall back to vmalloc.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> 
>   Unfortunately,

OK - here I fixed the typos:


From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] fault-injection: introduce kvmalloc fallback options

This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.

Unfortunately, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.

The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
It can be enabled in distribution debug kernels, so that kvmalloc abuse
can be tested by the users. The default can be overridden with
"kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/fault-injection/fault-injection.txt |    7 +++++
 include/linux/fault-inject.h                      |    9 +++---
 kernel/futex.c                                    |    2 -
 lib/Kconfig.debug                                 |   15 +++++++++++
 mm/failslab.c                                     |    2 -
 mm/page_alloc.c                                   |    2 -
 mm/util.c                                         |   30 ++++++++++++++++++++++
 7 files changed, 60 insertions(+), 7 deletions(-)

Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt	2018-04-16 21:08:34.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt	2018-04-25 21:36:36.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
 
   injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
 
+o kvmalloc_fallback
+
+  makes the function kvmalloc randomly fall back to vmalloc. This could be used
+  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+  the result of kvmalloc with free.
+
 o fail_futex
 
   injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
 
 	failslab=
 	fail_page_alloc=
+	kvmalloc_fallback=
 	fail_make_request=
 	fail_futex=
 	mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/include/linux/fault-inject.h
===================================================================
--- linux-2.6.orig/include/linux/fault-inject.h	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/include/linux/fault-inject.h	2018-04-25 21:38:22.000000000 +0200
@@ -31,17 +31,18 @@ struct fault_attr {
 	struct dentry *dname;
 };
 
-#define FAULT_ATTR_INITIALIZER {					\
+#define FAULT_ATTR_INITIALIZER(p) {					\
+		.probability = (p),					\
 		.interval = 1,						\
-		.times = ATOMIC_INIT(1),				\
+		.times = ATOMIC_INIT((p) ? -1 : 1),			\
+		.verbose = (p) ? 0 : 2,					\
 		.require_end = ULONG_MAX,				\
 		.stacktrace_depth = 32,					\
 		.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,	\
-		.verbose = 2,						\
 		.dname = NULL,						\
 	}
 
-#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
+#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
 int setup_fault_attr(struct fault_attr *attr, char *str);
 bool should_fail(struct fault_attr *attr, ssize_t size);
 
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug	2018-04-25 15:56:16.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug	2018-04-25 21:39:45.000000000 +0200
@@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
 	help
 	  Provide fault-injection capability for alloc_pages().
 
+config FAIL_KVMALLOC_FALLBACK_PROBABILITY
+	int "Default kvmalloc fallback probability"
+	depends on FAULT_INJECTION
+	range 0 100
+	default "0"
+	help
+	  This option will make kvmalloc randomly fall back to vmalloc.
+	  Normally, kvmalloc falls back to vmalloc only rarely, if memory
+	  is fragmented.
+
+	  This option helps to detect hard-to-reproduce driver bugs, for
+	  example using DMA API on the result of kvmalloc.
+
+	  The default may be overridden with the kvmalloc_fallback parameter.
+
 config FAIL_MAKE_REQUEST
 	bool "Fault-injection capability for disk IO"
 	depends on FAULT_INJECTION && BLOCK
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-25 15:48:39.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-25 21:43:31.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
 
 #include <asm/sections.h>
 #include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+#ifdef CONFIG_FAULT_INJECTION
+
+static struct fault_attr kvmalloc_fallback =
+	FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+	return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+	fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+	return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
 /**
  * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
  * failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_FAULT_INJECTION
+	if (should_fail(&kvmalloc_fallback, size))
+		goto do_vmalloc;
+#endif
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
+do_vmalloc: __maybe_unused
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));
 }
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c	2018-02-14 20:24:42.000000000 +0100
+++ linux-2.6/kernel/futex.c	2018-04-25 21:11:33.000000000 +0200
@@ -288,7 +288,7 @@ static struct {
 
 	bool ignore_private;
 } fail_futex = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_private = false,
 };
 
Index: linux-2.6/mm/failslab.c
===================================================================
--- linux-2.6.orig/mm/failslab.c	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/failslab.c	2018-04-25 21:11:40.000000000 +0200
@@ -9,7 +9,7 @@ static struct {
 	bool ignore_gfp_reclaim;
 	bool cache_filter;
 } failslab = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_gfp_reclaim = true,
 	.cache_filter = false,
 };
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/page_alloc.c	2018-04-25 21:11:47.000000000 +0200
@@ -3055,7 +3055,7 @@ static struct {
 	bool ignore_gfp_reclaim;
 	u32 min_order;
 } fail_page_alloc = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_gfp_reclaim = true,
 	.ignore_gfp_highmem = true,
 	.min_order = 1,

^ permalink raw reply

* Re: [PATCH v4] fault-injection: introduce kvmalloc fallback options
From: Randy Dunlap @ 2018-04-25 20:20 UTC (permalink / raw)
  To: Mikulas Patocka, Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804251556060.30569@file01.intranet.prod.int.rdu2.redhat.com>

On 04/25/2018 01:02 PM, Mikulas Patocka wrote:
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options
> 
> This patch introduces a fault-injection option "kvmalloc_fallback". This
> option makes kvmalloc randomly fall back to vmalloc.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then

  Unfortunately,

> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code. This options helps to test for these bugs.
> 
> The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
> It can be enabled in distribution debug kernels, so that kvmalloc abuse
> can be tested by the users. The default can be overriden with

                                                 overridden

> "kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  Documentation/fault-injection/fault-injection.txt |    7 +++++
>  include/linux/fault-inject.h                      |    9 +++---
>  kernel/futex.c                                    |    2 -
>  lib/Kconfig.debug                                 |   15 +++++++++++
>  mm/failslab.c                                     |    2 -
>  mm/page_alloc.c                                   |    2 -
>  mm/util.c                                         |   30 ++++++++++++++++++++++
>  7 files changed, 60 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt	2018-04-16 21:08:34.000000000 +0200
> +++ linux-2.6/Documentation/fault-injection/fault-injection.txt	2018-04-25 21:36:36.000000000 +0200
> @@ -15,6 +15,12 @@ o fail_page_alloc
>  
>    injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
>  
> +o kvmalloc_faillback

     kvmalloc_fallback

> +
> +  makes the function kvmalloc randonly fall back to vmalloc. This could be used

                                 randomly

> +  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
> +  the result of kvmalloc with free.
> +
>  o fail_futex
>  
>    injects futex deadlock and uaddr fault errors.
> @@ -167,6 +173,7 @@ use the boot option:
>  
>  	failslab=
>  	fail_page_alloc=
> +	kvmalloc_faillback=

	kvmalloc_fallback=

>  	fail_make_request=
>  	fail_futex=
>  	mmc_core.fail_request=<interval>,<probability>,<space>,<times>

> Index: linux-2.6/lib/Kconfig.debug
> ===================================================================
> --- linux-2.6.orig/lib/Kconfig.debug	2018-04-25 15:56:16.000000000 +0200
> +++ linux-2.6/lib/Kconfig.debug	2018-04-25 21:39:45.000000000 +0200
> @@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
>  	help
>  	  Provide fault-injection capability for alloc_pages().
>  
> +config FAIL_KVMALLOC_FALLBACK_PROBABILITY
> +	int "Default kvmalloc fallback probability"
> +	depends on FAULT_INJECTION
> +	range 0 100
> +	default "0"
> +	help
> +	  This option will make kvmalloc randomly fall back to vmalloc.
> +	  Normally, kvmalloc falls back to vmalloc only rarely, if memory
> +	  is fragmented.
> +
> +	  This option helps to detect hard-to-reproduce driver bugs, for
> +	  example using DMA API on the result of kvmalloc.
> +
> +	  The default may be overriden with the kvmalloc_faillback parameter.

	                     overridden         kvmalloc_fallback

> +
>  config FAIL_MAKE_REQUEST
>  	bool "Fault-injection capability for disk IO"
>  	depends on FAULT_INJECTION && BLOCK

-- 
~Randy

^ permalink raw reply

* [PATCH v4] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-25 20:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424173836.GR17484@dhcp22.suse.cz>



On Tue, 24 Apr 2018, Michal Hocko wrote:

> > > Wouldn't it be equally trivial to simply enable the fault injection? You
> > > would get additional failure paths testing as a bonus.
> > 
> > The RHEL and Fedora debugging kernels are compiled with fault injection. 
> > But the fault-injection framework will do nothing unless it is enabled by 
> > a kernel parameter or debugfs write.
> > 
> > Most users don't know about the fault injection kernel parameters or 
> > debugfs files and won't enabled it. We need a CONFIG_ option to enable it 
> > by default in the debugging kernels (and we could add a kernel parameter 
> > to override the default, fine-tune the fallback probability etc.)
> 
> If it is a real issue to install the debugging kernel with the required
> kernel parameter then I a config option for the default on makes sense
> to me.

Yes - the debug kernels use the same default kernel parameters as 
non-debug kernels and it is expected that all debug features are enabled 
by default.

Here I'm sending the patch using the fault-injection framework and the new 
option CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY.

Mikulas



From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH v4] fault-injection: introduce kvmalloc fallback options

This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.

The patch introduces a config option FAIL_KVMALLOC_FALLBACK_PROBABILITY.
It can be enabled in distribution debug kernels, so that kvmalloc abuse
can be tested by the users. The default can be overriden with
"kvmalloc_fallback" parameter or in /sys/kernel/debug/kvmalloc_fallback/.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/fault-injection/fault-injection.txt |    7 +++++
 include/linux/fault-inject.h                      |    9 +++---
 kernel/futex.c                                    |    2 -
 lib/Kconfig.debug                                 |   15 +++++++++++
 mm/failslab.c                                     |    2 -
 mm/page_alloc.c                                   |    2 -
 mm/util.c                                         |   30 ++++++++++++++++++++++
 7 files changed, 60 insertions(+), 7 deletions(-)

Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt	2018-04-16 21:08:34.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt	2018-04-25 21:36:36.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
 
   injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
 
+o kvmalloc_faillback
+
+  makes the function kvmalloc randonly fall back to vmalloc. This could be used
+  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+  the result of kvmalloc with free.
+
 o fail_futex
 
   injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
 
 	failslab=
 	fail_page_alloc=
+	kvmalloc_faillback=
 	fail_make_request=
 	fail_futex=
 	mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/include/linux/fault-inject.h
===================================================================
--- linux-2.6.orig/include/linux/fault-inject.h	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/include/linux/fault-inject.h	2018-04-25 21:38:22.000000000 +0200
@@ -31,17 +31,18 @@ struct fault_attr {
 	struct dentry *dname;
 };
 
-#define FAULT_ATTR_INITIALIZER {					\
+#define FAULT_ATTR_INITIALIZER(p) {					\
+		.probability = (p),					\
 		.interval = 1,						\
-		.times = ATOMIC_INIT(1),				\
+		.times = ATOMIC_INIT((p) ? -1 : 1),			\
+		.verbose = (p) ? 0 : 2,					\
 		.require_end = ULONG_MAX,				\
 		.stacktrace_depth = 32,					\
 		.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,	\
-		.verbose = 2,						\
 		.dname = NULL,						\
 	}
 
-#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
+#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER(0)
 int setup_fault_attr(struct fault_attr *attr, char *str);
 bool should_fail(struct fault_attr *attr, ssize_t size);
 
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug	2018-04-25 15:56:16.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug	2018-04-25 21:39:45.000000000 +0200
@@ -1527,6 +1527,21 @@ config FAIL_PAGE_ALLOC
 	help
 	  Provide fault-injection capability for alloc_pages().
 
+config FAIL_KVMALLOC_FALLBACK_PROBABILITY
+	int "Default kvmalloc fallback probability"
+	depends on FAULT_INJECTION
+	range 0 100
+	default "0"
+	help
+	  This option will make kvmalloc randomly fall back to vmalloc.
+	  Normally, kvmalloc falls back to vmalloc only rarely, if memory
+	  is fragmented.
+
+	  This option helps to detect hard-to-reproduce driver bugs, for
+	  example using DMA API on the result of kvmalloc.
+
+	  The default may be overriden with the kvmalloc_faillback parameter.
+
 config FAIL_MAKE_REQUEST
 	bool "Fault-injection capability for disk IO"
 	depends on FAULT_INJECTION && BLOCK
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-25 15:48:39.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-25 21:43:31.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
 
 #include <asm/sections.h>
 #include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+#ifdef CONFIG_FAULT_INJECTION
+
+static struct fault_attr kvmalloc_fallback =
+	FAULT_ATTR_INITIALIZER(CONFIG_FAIL_KVMALLOC_FALLBACK_PROBABILITY);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+	return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+	fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+	return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
 /**
  * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
  * failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_FAULT_INJECTION
+	if (should_fail(&kvmalloc_fallback, size))
+		goto do_vmalloc;
+#endif
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
+do_vmalloc: __maybe_unused
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));
 }
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c	2018-02-14 20:24:42.000000000 +0100
+++ linux-2.6/kernel/futex.c	2018-04-25 21:11:33.000000000 +0200
@@ -288,7 +288,7 @@ static struct {
 
 	bool ignore_private;
 } fail_futex = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_private = false,
 };
 
Index: linux-2.6/mm/failslab.c
===================================================================
--- linux-2.6.orig/mm/failslab.c	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/failslab.c	2018-04-25 21:11:40.000000000 +0200
@@ -9,7 +9,7 @@ static struct {
 	bool ignore_gfp_reclaim;
 	bool cache_filter;
 } failslab = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_gfp_reclaim = true,
 	.cache_filter = false,
 };
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/mm/page_alloc.c	2018-04-25 21:11:47.000000000 +0200
@@ -3055,7 +3055,7 @@ static struct {
 	bool ignore_gfp_reclaim;
 	u32 min_order;
 } fail_page_alloc = {
-	.attr = FAULT_ATTR_INITIALIZER,
+	.attr = FAULT_ATTR_INITIALIZER(0),
 	.ignore_gfp_reclaim = true,
 	.ignore_gfp_highmem = true,
 	.min_order = 1,

^ permalink raw reply

* [PATCH] fault-injection: reorder config entries
From: Mikulas Patocka @ 2018-04-25 20:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <20180424173836.GR17484@dhcp22.suse.cz>

This patch reorders Kconfig entries, so that menuconfig displays proper 
indentation.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 lib/Kconfig.debug |   36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug	2018-04-16 21:08:36.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug	2018-04-25 15:56:16.000000000 +0200
@@ -1503,6 +1503,10 @@ config NETDEV_NOTIFIER_ERROR_INJECT
 
 	  If unsure, say N.
 
+config FUNCTION_ERROR_INJECTION
+	def_bool y
+	depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
+
 config FAULT_INJECTION
 	bool "Fault-injection framework"
 	depends on DEBUG_KERNEL
@@ -1510,10 +1514,6 @@ config FAULT_INJECTION
 	  Provide fault-injection framework.
 	  For more details, see Documentation/fault-injection/.
 
-config FUNCTION_ERROR_INJECTION
-	def_bool y
-	depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES
-
 config FAILSLAB
 	bool "Fault-injection capability for kmalloc"
 	depends on FAULT_INJECTION
@@ -1544,16 +1544,6 @@ config FAIL_IO_TIMEOUT
 	  Only works with drivers that use the generic timeout handling,
 	  for others it wont do anything.
 
-config FAIL_MMC_REQUEST
-	bool "Fault-injection capability for MMC IO"
-	depends on FAULT_INJECTION_DEBUG_FS && MMC
-	help
-	  Provide fault-injection capability for MMC IO.
-	  This will make the mmc core return data errors. This is
-	  useful to test the error handling in the mmc block device
-	  and to test how the mmc host driver handles retries from
-	  the block device.
-
 config FAIL_FUTEX
 	bool "Fault-injection capability for futexes"
 	select DEBUG_FS
@@ -1561,6 +1551,12 @@ config FAIL_FUTEX
 	help
 	  Provide fault-injection capability for futexes.
 
+config FAULT_INJECTION_DEBUG_FS
+	bool "Debugfs entries for fault-injection capabilities"
+	depends on FAULT_INJECTION && SYSFS && DEBUG_FS
+	help
+	  Enable configuration of fault-injection capabilities via debugfs.
+
 config FAIL_FUNCTION
 	bool "Fault-injection capability for functions"
 	depends on FAULT_INJECTION_DEBUG_FS && FUNCTION_ERROR_INJECTION
@@ -1571,11 +1567,15 @@ config FAIL_FUNCTION
 	  an error value and have to handle it. This is useful to test the
 	  error handling in various subsystems.
 
-config FAULT_INJECTION_DEBUG_FS
-	bool "Debugfs entries for fault-injection capabilities"
-	depends on FAULT_INJECTION && SYSFS && DEBUG_FS
+config FAIL_MMC_REQUEST
+	bool "Fault-injection capability for MMC IO"
+	depends on FAULT_INJECTION_DEBUG_FS && MMC
 	help
-	  Enable configuration of fault-injection capabilities via debugfs.
+	  Provide fault-injection capability for MMC IO.
+	  This will make the mmc core return data errors. This is
+	  useful to test the error handling in the mmc block device
+	  and to test how the mmc host driver handles retries from
+	  the block device.
 
 config FAULT_INJECTION_STACKTRACE_FILTER
 	bool "stacktrace filter for fault-injection capabilities"

^ permalink raw reply

* Re: [PATCH] drm/virtio: fix mode_valid's return type
From: Daniel Vetter @ 2018-04-25  7:40 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: David Airlie, virtualization, linux-kernel, dri-devel
In-Reply-To: <20180424131524.2510-1-luc.vanoostenryck@gmail.com>

On Tue, Apr 24, 2018 at 03:15:24PM +0200, Luc Van Oostenryck wrote:
> The method struct drm_connector_helper_funcs::mode_valid is defined
> as returning an 'enum drm_mode_status' but the driver implementation
> for this method uses an 'int' for it.
> 
> Fix this by using 'enum drm_mode_status' in the driver too.
> 
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

Ok, applied all the others that have been stuck in moderation or
something, except for any patch touching nouveau/radeon/amdgpu. Those need
ot go in through other trees.

All merged into drm-misc-next.

Thanks, Daniel
> ---
>  drivers/gpu/drm/virtio/virtgpu_display.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
> index 8cc8c34d6..a5edd8660 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_display.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_display.c
> @@ -208,7 +208,7 @@ static int virtio_gpu_conn_get_modes(struct drm_connector *connector)
>  	return count;
>  }
>  
> -static int virtio_gpu_conn_mode_valid(struct drm_connector *connector,
> +static enum drm_mode_status virtio_gpu_conn_mode_valid(struct drm_connector *connector,
>  				      struct drm_display_mode *mode)
>  {
>  	struct virtio_gpu_output *output =
> -- 
> 2.17.0
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [PATCH 1/6] virtio_console: don't tie bufs to a vq
From: Greg Kroah-Hartman @ 2018-04-25  5:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Arnd Bergmann, Amit Shah, linux-kernel, stable, virtualization
In-Reply-To: <20180424215318-mutt-send-email-mst@kernel.org>

On Tue, Apr 24, 2018 at 09:56:33PM +0300, Michael S. Tsirkin wrote:
> On Sat, Apr 21, 2018 at 09:30:05AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Apr 20, 2018 at 09:18:01PM +0300, Michael S. Tsirkin wrote:
> > > an allocated buffer doesn't need to be tied to a vq -
> > > only vq->vdev is ever used. Pass the function the
> > > just what it needs - the vdev.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  drivers/char/virtio_console.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > > index 468f061..3e56f32 100644
> > > --- a/drivers/char/virtio_console.c
> > > +++ b/drivers/char/virtio_console.c
> > > @@ -422,7 +422,7 @@ static void reclaim_dma_bufs(void)
> > >  	}
> > >  }
> > >  
> > > -static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> > > +static struct port_buffer *alloc_buf(struct virtio_device *vdev, size_t buf_size,
> > >  				     int pages)
> > >  {
> > >  	struct port_buffer *buf;
> > > @@ -445,16 +445,16 @@ static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> > >  		return buf;
> > >  	}
> > >  
> > > -	if (is_rproc_serial(vq->vdev)) {
> > > +	if (is_rproc_serial(vdev)) {
> > >  		/*
> > >  		 * Allocate DMA memory from ancestor. When a virtio
> > >  		 * device is created by remoteproc, the DMA memory is
> > >  		 * associated with the grandparent device:
> > >  		 * vdev => rproc => platform-dev.
> > >  		 */
> > > -		if (!vq->vdev->dev.parent || !vq->vdev->dev.parent->parent)
> > > +		if (!vdev->dev.parent || !vdev->dev.parent->parent)
> > >  			goto free_buf;
> > > -		buf->dev = vq->vdev->dev.parent->parent;
> > > +		buf->dev = vdev->dev.parent->parent;
> > >  
> > >  		/* Increase device refcnt to avoid freeing it */
> > >  		get_device(buf->dev);
> > > @@ -838,7 +838,7 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
> > >  
> > >  	count = min((size_t)(32 * 1024), count);
> > >  
> > > -	buf = alloc_buf(port->out_vq, count, 0);
> > > +	buf = alloc_buf(port->portdev->vdev, count, 0);
> > >  	if (!buf)
> > >  		return -ENOMEM;
> > >  
> > > @@ -957,7 +957,7 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe,
> > >  	if (ret < 0)
> > >  		goto error_out;
> > >  
> > > -	buf = alloc_buf(port->out_vq, 0, pipe->nrbufs);
> > > +	buf = alloc_buf(port->portdev->vdev, 0, pipe->nrbufs);
> > >  	if (!buf) {
> > >  		ret = -ENOMEM;
> > >  		goto error_out;
> > > @@ -1374,7 +1374,7 @@ static unsigned int fill_queue(struct virtqueue *vq, spinlock_t *lock)
> > >  
> > >  	nr_added_bufs = 0;
> > >  	do {
> > > -		buf = alloc_buf(vq, PAGE_SIZE, 0);
> > > +		buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
> > >  		if (!buf)
> > >  			break;
> > >  
> > > -- 
> > > MST
> > 
> > <formletter>
> > 
> > This is not the correct way to submit patches for inclusion in the
> > stable kernel tree.  Please read:
> >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > for how to do this properly.
> > 
> > </formletter>
> 
> 
> Thanks!
> I have some questions about this one:
> 
> Cc: <stable@vger.kernel.org> # 3.3.x: a1f84a3: sched: Check for idle
> Cc: <stable@vger.kernel.org> # 3.3.x: 1b9508f: sched: Rate-limit newidle
> Cc: <stable@vger.kernel.org> # 3.3.x: fd21073: sched: Fix affinity logic
> Cc: <stable@vger.kernel.org> # 3.3.x
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> 1. what does the kernel version mean? can I omit it?

Did you read the document?, it explains that the version can be used to
say "this kernel version and newer"

> 2. so when I rebase to add the tag, this changes commit IDs for
>    following tags in the same tree, breaking their tags
>    in the process. Pretty annoying. Any idea how to do it better?

You only put tags there if you want me to pick up pre-requisite patches
that are already in Linus's tree.  If you have a patch series that all
needs to go into stable, just add the "cc: stable@" to the tags on all
of them and I'll pick them up in the correct order then.

hope this helps,

greg k-h

^ permalink raw reply

* [RFC v3 5/5] virtio_ring: enable packed ring
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b1039c2985b9..9a3d13e1e2ba 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1873,6 +1873,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_F_IOMMU_PLATFORM:
 			break;
+		case VIRTIO_F_RING_PACKED:
+			break;
 		default:
 			/* We don't understand this bit. */
 			__virtio_clear_bit(vdev, i);
-- 
2.11.0

^ permalink raw reply related

* [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>

This commit introduces the event idx support in packed
ring. This feature is temporarily disabled, because the
implementation in this patch may not work as expected,
and some further discussions on the implementation are
needed, e.g. do we have to check the wrap counter when
checking whether a kick is needed?

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0181e93897be..b1039c2985b9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 flags;
+	u16 new, old, off_wrap, flags;
 	bool needs_kick;
 	u32 snapshot;
 
@@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 	 * suppressions. */
 	virtio_mb(vq->weak_barriers);
 
+	old = vq->next_avail_idx - vq->num_added;
+	new = vq->next_avail_idx;
+	vq->num_added = 0;
+
 	snapshot = *(u32 *)vq->vring_packed.device;
+	off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
 	flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
 
 #ifdef DEBUG
@@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 	vq->last_add_time_valid = false;
 #endif
 
-	needs_kick = (flags != VRING_EVENT_F_DISABLE);
+	if (flags == VRING_EVENT_F_DESC)
+		needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
+	else
+		needs_kick = (flags != VRING_EVENT_F_DISABLE);
 	END_USE(vq);
 	return needs_kick;
 }
@@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
 	if (vq->last_used_idx >= vq->vring_packed.num)
 		vq->last_used_idx -= vq->vring_packed.num;
 
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+		virtio_store_mb(vq->weak_barriers,
+				&vq->vring_packed.driver->off_wrap,
+				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+						(vq->wrap_counter << 15)));
+
 #ifdef DEBUG
 	vq->last_add_time_valid = false;
 #endif
@@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+
+	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+			vq->last_used_idx | (vq->wrap_counter << 15));
 
 	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
 		virtio_wmb(vq->weak_barriers);
-		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+						     VRING_EVENT_F_ENABLE;
 		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
 							vq->event_flags_shadow);
 	}
@@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs, used_idx, wrap_counter;
 
 	START_USE(vq);
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+	used_idx = vq->last_used_idx + bufs;
+	wrap_counter = vq->wrap_counter;
+
+	if (used_idx >= vq->vring_packed.num) {
+		used_idx -= vq->vring_packed.num;
+		wrap_counter ^= 1;
+	}
+
+	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+			used_idx | (wrap_counter << 15));
 
 	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
 		virtio_wmb(vq->weak_barriers);
-		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+						     VRING_EVENT_F_ENABLE;
 		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
 							vq->event_flags_shadow);
 	}
@@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
 		switch (i) {
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+#if 0
 		case VIRTIO_RING_F_EVENT_IDX:
 			break;
+#endif
 		case VIRTIO_F_VERSION_1:
 			break;
 		case VIRTIO_F_IOMMU_PLATFORM:
-- 
2.11.0

^ permalink raw reply related

* [RFC v3 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>

This commit introduces the basic support (without EVENT_IDX)
for packed ring.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 444 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 434 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index e164822ca66e..0181e93897be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,7 +58,8 @@
 
 struct vring_desc_state {
 	void *data;			/* Data for callback. */
-	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
+	void *indir_desc;		/* Indirect descriptor, if any. */
+	int num;			/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
@@ -142,6 +143,16 @@ struct vring_virtqueue {
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+					  unsigned int total_sg)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	/* If the host supports indirect descriptor tables, and we have multiple
+	 * buffers, then go indirect. FIXME: tune this threshold */
+	return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -327,9 +338,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 
 	head = vq->free_head;
 
-	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+	if (virtqueue_use_indirect(_vq, total_sg))
 		desc = alloc_indirect_split(_vq, total_sg, gfp);
 	else {
 		desc = NULL;
@@ -741,6 +750,49 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
 		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+				   struct vring_packed_desc *desc)
+{
+	u16 flags;
+
+	if (!vring_use_dma_api(vq->vq.vdev))
+		return;
+
+	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+	if (flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vring_dma_dev(vq),
+				 virtio64_to_cpu(vq->vq.vdev, desc->addr),
+				 virtio32_to_cpu(vq->vq.vdev, desc->len),
+				 (flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vring_dma_dev(vq),
+			       virtio64_to_cpu(vq->vq.vdev, desc->addr),
+			       virtio32_to_cpu(vq->vq.vdev, desc->len),
+			       (flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+						       unsigned int total_sg,
+						       gfp_t gfp)
+{
+	struct vring_packed_desc *desc;
+
+	/*
+	 * We require lowmem mappings for the descriptors because
+	 * otherwise virt_to_phys will give us bogus addresses in the
+	 * virtqueue.
+	 */
+	gfp &= ~__GFP_HIGHMEM;
+
+	desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+	return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
 				       struct scatterlist *sgs[],
 				       unsigned int total_sg,
@@ -750,47 +802,419 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
 				       void *ctx,
 				       gfp_t gfp)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_packed_desc *desc;
+	struct scatterlist *sg;
+	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+	__virtio16 uninitialized_var(head_flags), flags;
+	int head, wrap_counter;
+	bool indirect;
+
+	START_USE(vq);
+
+	BUG_ON(data == NULL);
+	BUG_ON(ctx && vq->indirect);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+#ifdef DEBUG
+	{
+		ktime_t now = ktime_get();
+
+		/* No kick or get, with .1 second between?  Warn. */
+		if (vq->last_add_time_valid)
+			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+					    > 100);
+		vq->last_add_time = now;
+		vq->last_add_time_valid = true;
+	}
+#endif
+
+	BUG_ON(total_sg == 0);
+
+	head = vq->next_avail_idx;
+	wrap_counter = vq->wrap_counter;
+
+	if (virtqueue_use_indirect(_vq, total_sg))
+		desc = alloc_indirect_packed(_vq, total_sg, gfp);
+	else {
+		desc = NULL;
+		WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
+	}
+
+	if (desc) {
+		/* Use a single buffer which doesn't continue */
+		indirect = true;
+		/* Set up rest to use this indirect table. */
+		i = 0;
+		descs_used = 1;
+	} else {
+		indirect = false;
+		desc = vq->vring_packed.desc;
+		i = head;
+		descs_used = total_sg;
+	}
+
+	if (vq->vq.num_free < descs_used) {
+		pr_debug("Can't add buf len %i - avail = %i\n",
+			 descs_used, vq->vq.num_free);
+		/* FIXME: for historical reasons, we force a notify here if
+		 * there are outgoing parts to the buffer.  Presumably the
+		 * host should service the ring ASAP. */
+		if (out_sgs)
+			vq->notify(&vq->vq);
+		if (indirect)
+			kfree(desc);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	for (n = 0; n < out_sgs + in_sgs; n++) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+			dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
+					       DMA_TO_DEVICE : DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, addr))
+				goto unmap_release;
+
+			flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
+					(n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
+					VRING_DESC_F_AVAIL(vq->wrap_counter) |
+					VRING_DESC_F_USED(!vq->wrap_counter));
+			if (!indirect && i == head)
+				head_flags = flags;
+			else
+				desc[i].flags = flags;
+
+			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
+			i++;
+			if (!indirect && i >= vq->vring_packed.num) {
+				i = 0;
+				vq->wrap_counter ^= 1;
+			}
+		}
+	}
+
+	prev = (i > 0 ? i : vq->vring_packed.num) - 1;
+	desc[prev].id = cpu_to_virtio32(_vq->vdev, head);
+
+	/* Last one doesn't continue. */
+	if (total_sg == 1)
+		head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+	else
+		desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
+						~VRING_DESC_F_NEXT);
+
+	if (indirect) {
+		/* Now that the indirect table is filled in, map it. */
+		dma_addr_t addr = vring_map_single(
+			vq, desc, total_sg * sizeof(struct vring_packed_desc),
+			DMA_TO_DEVICE);
+		if (vring_mapping_error(vq, addr))
+			goto unmap_release;
+
+		head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
+					     VRING_DESC_F_AVAIL(wrap_counter) |
+					     VRING_DESC_F_USED(!wrap_counter));
+		vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev,
+								   addr);
+		vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
+				total_sg * sizeof(struct vring_packed_desc));
+		vq->vring_packed.desc[head].id = cpu_to_virtio32(_vq->vdev,
+								 head);
+	}
+
+	/* We're using some buffers from the free list. */
+	vq->vq.num_free -= descs_used;
+
+	/* Update free pointer */
+	if (indirect) {
+		n = head + 1;
+		if (n >= vq->vring_packed.num) {
+			n = 0;
+			vq->wrap_counter ^= 1;
+		}
+		vq->next_avail_idx = n;
+	} else
+		vq->next_avail_idx = i;
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].num = descs_used;
+	vq->desc_state[head].data = data;
+	if (indirect)
+		vq->desc_state[head].indir_desc = desc;
+	else
+		vq->desc_state[head].indir_desc = ctx;
+
+	/* A driver MUST NOT make the first descriptor in the list
+	 * available before all subsequent descriptors comprising
+	 * the list are made available. */
+	virtio_wmb(vq->weak_barriers);
+	vq->vring_packed.desc[head].flags = head_flags;
+	vq->num_added += descs_used;
+
+	pr_debug("Added buffer head %i to %p\n", head, vq);
+	END_USE(vq);
+
+	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_one_packed(vq, &desc[i]);
+		i++;
+		if (!indirect && i >= vq->vring_packed.num)
+			i = 0;
+	}
+
+	vq->wrap_counter = wrap_counter;
+
+	if (indirect)
+		kfree(desc);
+
+	END_USE(vq);
 	return -EIO;
 }
 
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 flags;
+	bool needs_kick;
+	u32 snapshot;
+
+	START_USE(vq);
+	/* We need to expose the new flags value before checking notification
+	 * suppressions. */
+	virtio_mb(vq->weak_barriers);
+
+	snapshot = *(u32 *)vq->vring_packed.device;
+	flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
+
+#ifdef DEBUG
+	if (vq->last_add_time_valid) {
+		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+					      vq->last_add_time)) > 100);
+	}
+	vq->last_add_time_valid = false;
+#endif
+
+	needs_kick = (flags != VRING_EVENT_F_DISABLE);
+	END_USE(vq);
+	return needs_kick;
+}
+
+static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
+			      void **ctx)
+{
+	struct vring_packed_desc *desc;
+	unsigned int i, j;
+
+	/* Clear data ptr. */
+	vq->desc_state[head].data = NULL;
+
+	i = head;
+
+	for (j = 0; j < vq->desc_state[head].num; j++) {
+		desc = &vq->vring_packed.desc[i];
+		vring_unmap_one_packed(vq, desc);
+		i++;
+		if (i >= vq->vring_packed.num)
+			i = 0;
+	}
+
+	vq->vq.num_free += vq->desc_state[head].num;
+
+	if (vq->indirect) {
+		u32 len;
+
+		/* Free the indirect table, if any, now that it's unmapped. */
+		desc = vq->desc_state[head].indir_desc;
+		if (!desc)
+			return;
+
+		len = virtio32_to_cpu(vq->vq.vdev,
+				      vq->vring_packed.desc[head].len);
+
+		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
+			vring_unmap_one_packed(vq, &desc[j]);
+
+		kfree(desc);
+		vq->desc_state[head].indir_desc = NULL;
+	} else if (ctx) {
+		*ctx = vq->desc_state[head].indir_desc;
+	}
 }
 
 static inline bool more_used_packed(const struct vring_virtqueue *vq)
 {
-	return false;
+	u16 last_used, flags;
+	bool avail, used;
+
+	if (vq->vq.num_free == vq->vring_packed.num)
+		return false;
+
+	last_used = vq->last_used_idx;
+	flags = virtio16_to_cpu(vq->vq.vdev,
+				vq->vring_packed.desc[last_used].flags);
+	avail = flags & VRING_DESC_F_AVAIL(1);
+	used = flags & VRING_DESC_F_USED(1);
+
+	return avail == used;
 }
 
 static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
 					  unsigned int *len,
 					  void **ctx)
 {
-	return NULL;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	void *ret;
+	unsigned int i;
+	u16 last_used;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return NULL;
+	}
+
+	if (!more_used_packed(vq)) {
+		pr_debug("No more buffers in queue\n");
+		END_USE(vq);
+		return NULL;
+	}
+
+	/* Only get used elements after they have been exposed by host. */
+	virtio_rmb(vq->weak_barriers);
+
+	last_used = vq->last_used_idx;
+	i = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
+	*len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
+
+	if (unlikely(i >= vq->vring_packed.num)) {
+		BAD_RING(vq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!vq->desc_state[i].data)) {
+		BAD_RING(vq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	/* detach_buf_packed clears data, so grab it now. */
+	ret = vq->desc_state[i].data;
+	detach_buf_packed(vq, i, ctx);
+
+	vq->last_used_idx += vq->desc_state[i].num;
+	if (vq->last_used_idx >= vq->vring_packed.num)
+		vq->last_used_idx -= vq->vring_packed.num;
+
+#ifdef DEBUG
+	vq->last_add_time_valid = false;
+#endif
+
+	END_USE(vq);
+	return ret;
 }
 
 static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
+		vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
 }
 
 static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 {
-	return 0;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+
+	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+		virtio_wmb(vq->weak_barriers);
+		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
+
+	END_USE(vq);
+	return vq->last_used_idx;
 }
 
 static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	bool avail, used;
+	u16 flags;
+
+	virtio_mb(vq->weak_barriers);
+	flags = virtio16_to_cpu(vq->vq.vdev,
+			vq->vring_packed.desc[last_used_idx].flags);
+	avail = flags & VRING_DESC_F_AVAIL(1);
+	used = flags & VRING_DESC_F_USED(1);
+	return avail == used;
 }
 
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+
+	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+		virtio_wmb(vq->weak_barriers);
+		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
+
+	if (more_used_packed(vq)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
 }
 
 static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int i;
+	void *buf;
+
+	START_USE(vq);
+
+	for (i = 0; i < vq->vring_packed.num; i++) {
+		if (!vq->desc_state[i].data)
+			continue;
+		/* detach_buf clears data, so grab it now. */
+		buf = vq->desc_state[i].data;
+		detach_buf_packed(vq, i, NULL);
+		END_USE(vq);
+		return buf;
+	}
+	/* That should have freed everything. */
+	BUG_ON(vq->vq.num_free != vq->vring_packed.num);
+
+	END_USE(vq);
 	return NULL;
 }
 
-- 
2.11.0

^ permalink raw reply related

* [RFC v3 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>

This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 764 ++++++++++++++++++++++++++++---------------
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 513 insertions(+), 259 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..e164822ca66e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
 struct vring_virtqueue {
 	struct virtqueue vq;
 
-	/* Actual memory layout for this queue */
-	struct vring vring;
+	/* Is this a packed ring? */
+	bool packed;
 
 	/* Can we use weak barriers? */
 	bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
 	/* Host publishes avail event idx */
 	bool event;
 
-	/* Head of free buffer list. */
-	unsigned int free_head;
 	/* Number we've added since last sync. */
 	unsigned int num_added;
 
 	/* Last used index we've seen. */
 	u16 last_used_idx;
 
-	/* Last written value to avail->flags */
-	u16 avail_flags_shadow;
+	union {
+		/* Available for split ring */
+		struct {
+			/* Actual memory layout for this queue. */
+			struct vring vring;
 
-	/* Last written value to avail->idx in guest byte order */
-	u16 avail_idx_shadow;
+			/* Head of free buffer list. */
+			unsigned int free_head;
+
+			/* Last written value to avail->flags */
+			u16 avail_flags_shadow;
+
+			/* Last written value to avail->idx in
+			 * guest byte order. */
+			u16 avail_idx_shadow;
+		};
+
+		/* Available for packed ring */
+		struct {
+			/* Actual memory layout for this queue. */
+			struct vring_packed vring_packed;
+
+			/* Driver ring wrap counter. */
+			u8 wrap_counter;
+
+			/* Index of the next avail descriptor. */
+			unsigned int next_avail_idx;
+
+			/* Last written value to driver->flags in
+			 * guest byte order. */
+			u16 event_flags_shadow;
+		};
+	};
 
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
 			      cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-			    struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+	if (!vring_use_dma_api(vq->vq.vdev))
+		return 0;
+
+	return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+				  struct vring_desc *desc)
 {
 	u16 flags;
 
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 	}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-			       dma_addr_t addr)
-{
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return 0;
-
-	return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+					       unsigned int total_sg,
+					       gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-				struct scatterlist *sgs[],
-				unsigned int total_sg,
-				unsigned int out_sgs,
-				unsigned int in_sgs,
-				void *data,
-				void *ctx,
-				gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+				      struct scatterlist *sgs[],
+				      unsigned int total_sg,
+				      unsigned int out_sgs,
+				      unsigned int in_sgs,
+				      void *data,
+				      void *ctx,
+				      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
@@ -303,7 +330,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+		desc = alloc_indirect_split(_vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
@@ -424,7 +451,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < total_sg; n++) {
 		if (i == err_idx)
 			break;
-		vring_unmap_one(vq, &desc[i]);
+		vring_unmap_one_split(vq, &desc[i]);
 		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
 	}
 
@@ -435,6 +462,355 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	return -EIO;
 }
 
+static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 new, old;
+	bool needs_kick;
+
+	START_USE(vq);
+	/* We need to expose available array entries before checking avail
+	 * event. */
+	virtio_mb(vq->weak_barriers);
+
+	old = vq->avail_idx_shadow - vq->num_added;
+	new = vq->avail_idx_shadow;
+	vq->num_added = 0;
+
+#ifdef DEBUG
+	if (vq->last_add_time_valid) {
+		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+					      vq->last_add_time)) > 100);
+	}
+	vq->last_add_time_valid = false;
+#endif
+
+	if (vq->event) {
+		needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
+					      new, old);
+	} else {
+		needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
+	}
+	END_USE(vq);
+	return needs_kick;
+}
+
+static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+			     void **ctx)
+{
+	unsigned int i, j;
+	__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
+
+	/* Clear data ptr. */
+	vq->desc_state[head].data = NULL;
+
+	/* Put back on free list: unmap first-level descriptors and find end */
+	i = head;
+
+	while (vq->vring.desc[i].flags & nextflag) {
+		vring_unmap_one_split(vq, &vq->vring.desc[i]);
+		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
+		vq->vq.num_free++;
+	}
+
+	vring_unmap_one_split(vq, &vq->vring.desc[i]);
+	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
+	vq->free_head = head;
+
+	/* Plus final descriptor */
+	vq->vq.num_free++;
+
+	if (vq->indirect) {
+		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
+		u32 len;
+
+		/* Free the indirect table, if any, now that it's unmapped. */
+		if (!indir_desc)
+			return;
+
+		len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
+
+		BUG_ON(!(vq->vring.desc[head].flags &
+			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+
+		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+			vring_unmap_one_split(vq, &indir_desc[j]);
+
+		kfree(indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	} else if (ctx) {
+		*ctx = vq->desc_state[head].indir_desc;
+	}
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
+{
+	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+}
+
+static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
+					 unsigned int *len,
+					 void **ctx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	void *ret;
+	unsigned int i;
+	u16 last_used;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return NULL;
+	}
+
+	if (!more_used_split(vq)) {
+		pr_debug("No more buffers in queue\n");
+		END_USE(vq);
+		return NULL;
+	}
+
+	/* Only get used array entries after they have been exposed by host. */
+	virtio_rmb(vq->weak_barriers);
+
+	last_used = (vq->last_used_idx & (vq->vring.num - 1));
+	i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
+	*len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
+
+	if (unlikely(i >= vq->vring.num)) {
+		BAD_RING(vq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!vq->desc_state[i].data)) {
+		BAD_RING(vq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	/* detach_buf_split clears data, so grab it now. */
+	ret = vq->desc_state[i].data;
+	detach_buf_split(vq, i, ctx);
+	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
+		virtio_store_mb(vq->weak_barriers,
+				&vring_used_event(&vq->vring),
+				cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
+
+#ifdef DEBUG
+	vq->last_add_time_valid = false;
+#endif
+
+	END_USE(vq);
+	return ret;
+}
+
+static void virtqueue_disable_cb_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+}
+
+static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
+	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+	vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
+	END_USE(vq);
+	return last_used_idx;
+}
+
+static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	virtio_mb(vq->weak_barriers);
+	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+}
+
+static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
+
+	virtio_store_mb(vq->weak_barriers,
+			&vring_used_event(&vq->vring),
+			cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
+
+	if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
+}
+
+static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int i;
+	void *buf;
+
+	START_USE(vq);
+
+	for (i = 0; i < vq->vring.num; i++) {
+		if (!vq->desc_state[i].data)
+			continue;
+		/* detach_buf clears data, so grab it now. */
+		buf = vq->desc_state[i].data;
+		detach_buf_split(vq, i, NULL);
+		vq->avail_idx_shadow--;
+		vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+		END_USE(vq);
+		return buf;
+	}
+	/* That should have freed everything. */
+	BUG_ON(vq->vq.num_free != vq->vring.num);
+
+	END_USE(vq);
+	return NULL;
+}
+
+/*
+ * The layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed {
+ *	// The actual descriptors (16 bytes each)
+ *	struct vring_packed_desc desc[num];
+ *
+ *	// Padding to the next align boundary.
+ *	char pad[];
+ *
+ *	// Driver Event Suppression
+ *	struct vring_packed_desc_event driver;
+ *
+ *	// Device Event Suppression
+ *	struct vring_packed_desc_event device;
+ * };
+ */
+static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
+				     void *p, unsigned long align)
+{
+	vr->num = num;
+	vr->desc = p;
+	vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+		* num + align - 1) & ~(align - 1));
+	vr->device = vr->driver + 1;
+}
+
+static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
+{
+	return ((sizeof(struct vring_packed_desc) * num + align - 1)
+		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
+static inline int virtqueue_add_packed(struct virtqueue *_vq,
+				       struct scatterlist *sgs[],
+				       unsigned int total_sg,
+				       unsigned int out_sgs,
+				       unsigned int in_sgs,
+				       void *data,
+				       void *ctx,
+				       gfp_t gfp)
+{
+	return -EIO;
+}
+
+static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
+{
+	return false;
+}
+
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+	return false;
+}
+
+static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
+					  unsigned int *len,
+					  void **ctx)
+{
+	return NULL;
+}
+
+static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
+{
+}
+
+static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
+{
+	return 0;
+}
+
+static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
+{
+	return false;
+}
+
+static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
+{
+	return false;
+}
+
+static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+{
+	return NULL;
+}
+
+static inline int virtqueue_add(struct virtqueue *_vq,
+				struct scatterlist *sgs[],
+				unsigned int total_sg,
+				unsigned int out_sgs,
+				unsigned int in_sgs,
+				void *data,
+				void *ctx,
+				gfp_t gfp)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
+						 in_sgs, data, ctx, gfp) :
+			    virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
+						in_sgs, data, ctx, gfp);
+}
+
 /**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
@@ -551,34 +927,9 @@ EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
 bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 new, old;
-	bool needs_kick;
 
-	START_USE(vq);
-	/* We need to expose available array entries before checking avail
-	 * event. */
-	virtio_mb(vq->weak_barriers);
-
-	old = vq->avail_idx_shadow - vq->num_added;
-	new = vq->avail_idx_shadow;
-	vq->num_added = 0;
-
-#ifdef DEBUG
-	if (vq->last_add_time_valid) {
-		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
-					      vq->last_add_time)) > 100);
-	}
-	vq->last_add_time_valid = false;
-#endif
-
-	if (vq->event) {
-		needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
-					      new, old);
-	} else {
-		needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
-	}
-	END_USE(vq);
-	return needs_kick;
+	return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
+			    virtqueue_kick_prepare_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
 
@@ -626,58 +977,9 @@ bool virtqueue_kick(struct virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick);
 
-static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
-		       void **ctx)
-{
-	unsigned int i, j;
-	__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
-
-	/* Clear data ptr. */
-	vq->desc_state[head].data = NULL;
-
-	/* Put back on free list: unmap first-level descriptors and find end */
-	i = head;
-
-	while (vq->vring.desc[i].flags & nextflag) {
-		vring_unmap_one(vq, &vq->vring.desc[i]);
-		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
-		vq->vq.num_free++;
-	}
-
-	vring_unmap_one(vq, &vq->vring.desc[i]);
-	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
-	vq->free_head = head;
-
-	/* Plus final descriptor */
-	vq->vq.num_free++;
-
-	if (vq->indirect) {
-		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
-		u32 len;
-
-		/* Free the indirect table, if any, now that it's unmapped. */
-		if (!indir_desc)
-			return;
-
-		len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
-
-		BUG_ON(!(vq->vring.desc[head].flags &
-			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
-		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
-
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
-			vring_unmap_one(vq, &indir_desc[j]);
-
-		kfree(indir_desc);
-		vq->desc_state[head].indir_desc = NULL;
-	} else if (ctx) {
-		*ctx = vq->desc_state[head].indir_desc;
-	}
-}
-
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+	return vq->packed ? more_used_packed(vq) : more_used_split(vq);
 }
 
 /**
@@ -700,57 +1002,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
 			    void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	void *ret;
-	unsigned int i;
-	u16 last_used;
 
-	START_USE(vq);
-
-	if (unlikely(vq->broken)) {
-		END_USE(vq);
-		return NULL;
-	}
-
-	if (!more_used(vq)) {
-		pr_debug("No more buffers in queue\n");
-		END_USE(vq);
-		return NULL;
-	}
-
-	/* Only get used array entries after they have been exposed by host. */
-	virtio_rmb(vq->weak_barriers);
-
-	last_used = (vq->last_used_idx & (vq->vring.num - 1));
-	i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
-	*len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
-
-	if (unlikely(i >= vq->vring.num)) {
-		BAD_RING(vq, "id %u out of range\n", i);
-		return NULL;
-	}
-	if (unlikely(!vq->desc_state[i].data)) {
-		BAD_RING(vq, "id %u is not a head!\n", i);
-		return NULL;
-	}
-
-	/* detach_buf clears data, so grab it now. */
-	ret = vq->desc_state[i].data;
-	detach_buf(vq, i, ctx);
-	vq->last_used_idx++;
-	/* If we expect an interrupt for the next entry, tell host
-	 * by writing event index and flush out the write before
-	 * the read in the next get_buf call. */
-	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
-		virtio_store_mb(vq->weak_barriers,
-				&vring_used_event(&vq->vring),
-				cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
-
-#ifdef DEBUG
-	vq->last_add_time_valid = false;
-#endif
-
-	END_USE(vq);
-	return ret;
+	return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
+			    virtqueue_get_buf_ctx_split(_vq, len, ctx);
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
 
@@ -772,12 +1026,10 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
-		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-
+	if (vq->packed)
+		virtqueue_disable_cb_packed(_vq);
+	else
+		virtqueue_disable_cb_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
@@ -796,23 +1048,9 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 last_used_idx;
 
-	START_USE(vq);
-
-	/* We optimistically turn back on interrupts, then check if there was
-	 * more to do. */
-	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
-	 * either clear the flags bit or point the event index at the next
-	 * entry. Always do both to keep code simple. */
-	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
-		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-	vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
-	END_USE(vq);
-	return last_used_idx;
+	return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
+			    virtqueue_enable_cb_prepare_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
@@ -829,8 +1067,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	virtio_mb(vq->weak_barriers);
-	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+	return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
+			    virtqueue_poll_split(_vq, last_used_idx);
 }
 EXPORT_SYMBOL_GPL(virtqueue_poll);
 
@@ -868,34 +1106,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 bufs;
 
-	START_USE(vq);
-
-	/* We optimistically turn back on interrupts, then check if there was
-	 * more to do. */
-	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
-	 * either clear the flags bit or point the event index at the next
-	 * entry. Always update the event index to keep code simple. */
-	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
-		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-	/* TODO: tune this threshold */
-	bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
-
-	virtio_store_mb(vq->weak_barriers,
-			&vring_used_event(&vq->vring),
-			cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
-
-	if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
-		END_USE(vq);
-		return false;
-	}
-
-	END_USE(vq);
-	return true;
+	return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
+			    virtqueue_enable_cb_delayed_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
 
@@ -910,27 +1123,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
 void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i;
-	void *buf;
 
-	START_USE(vq);
-
-	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->desc_state[i].data)
-			continue;
-		/* detach_buf clears data, so grab it now. */
-		buf = vq->desc_state[i].data;
-		detach_buf(vq, i, NULL);
-		vq->avail_idx_shadow--;
-		vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
-		END_USE(vq);
-		return buf;
-	}
-	/* That should have freed everything. */
-	BUG_ON(vq->vq.num_free != vq->vring.num);
-
-	END_USE(vq);
-	return NULL;
+	return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
+			    virtqueue_detach_unused_buf_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
 
@@ -955,7 +1150,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
 struct virtqueue *__vring_new_virtqueue(unsigned int index,
-					struct vring vring,
+					union vring_union vring,
+					bool packed,
 					struct virtio_device *vdev,
 					bool weak_barriers,
 					bool context,
@@ -963,19 +1159,20 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 					void (*callback)(struct virtqueue *),
 					const char *name)
 {
-	unsigned int i;
+	unsigned int num, i;
 	struct vring_virtqueue *vq;
 
-	vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
+	num = packed ? vring.vring_packed.num : vring.vring_split.num;
+
+	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
 		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
-	vq->vring = vring;
 	vq->vq.callback = callback;
 	vq->vq.vdev = vdev;
 	vq->vq.name = name;
-	vq->vq.num_free = vring.num;
+	vq->vq.num_free = num;
 	vq->vq.index = index;
 	vq->we_own_ring = false;
 	vq->queue_dma_addr = 0;
@@ -984,9 +1181,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 	vq->weak_barriers = weak_barriers;
 	vq->broken = false;
 	vq->last_used_idx = 0;
-	vq->avail_flags_shadow = 0;
-	vq->avail_idx_shadow = 0;
 	vq->num_added = 0;
+	vq->packed = packed;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
 	vq->in_use = false;
@@ -997,18 +1193,37 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	if (vq->packed) {
+		vq->vring_packed = vring.vring_packed;
+		vq->next_avail_idx = 0;
+		vq->wrap_counter = 1;
+		vq->event_flags_shadow = 0;
+	} else {
+		vq->vring = vring.vring_split;
+		vq->avail_flags_shadow = 0;
+		vq->avail_idx_shadow = 0;
+
+		/* Put everything in free lists. */
+		vq->free_head = 0;
+		for (i = 0; i < num-1; i++)
+			vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
+	}
+
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback) {
-		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
+		if (packed) {
+			vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+			vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
+						vq->event_flags_shadow);
+		} else {
+			vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+			if (!vq->event)
+				vq->vring.avail->flags = cpu_to_virtio16(vdev,
+						vq->avail_flags_shadow);
+		}
 	}
 
-	/* Put everything in free lists. */
-	vq->free_head = 0;
-	for (i = 0; i < vring.num-1; i++)
-		vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-	memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
+	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
@@ -1056,6 +1271,12 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
 	}
 }
 
+static inline int
+__vring_size(unsigned int num, unsigned long align, bool packed)
+{
+	return packed ? vring_size_packed(num, align) : vring_size(num, align);
+}
+
 struct virtqueue *vring_create_virtqueue(
 	unsigned int index,
 	unsigned int num,
@@ -1072,7 +1293,8 @@ struct virtqueue *vring_create_virtqueue(
 	void *queue = NULL;
 	dma_addr_t dma_addr;
 	size_t queue_size_in_bytes;
-	struct vring vring;
+	union vring_union vring;
+	bool packed;
 
 	/* We assume num is a power of 2. */
 	if (num & (num - 1)) {
@@ -1080,9 +1302,13 @@ struct virtqueue *vring_create_virtqueue(
 		return NULL;
 	}
 
+	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+
 	/* TODO: allocate each queue chunk individually */
-	for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
-		queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+	for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
+			num /= 2) {
+		queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+							     packed),
 					  &dma_addr,
 					  GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
 		if (queue)
@@ -1094,17 +1320,21 @@ struct virtqueue *vring_create_virtqueue(
 
 	if (!queue) {
 		/* Try to get a single page. You are my only hope! */
-		queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+		queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+							     packed),
 					  &dma_addr, GFP_KERNEL|__GFP_ZERO);
 	}
 	if (!queue)
 		return NULL;
 
-	queue_size_in_bytes = vring_size(num, vring_align);
-	vring_init(&vring, num, queue, vring_align);
+	queue_size_in_bytes = __vring_size(num, vring_align, packed);
+	if (packed)
+		vring_init_packed(&vring.vring_packed, num, queue, vring_align);
+	else
+		vring_init(&vring.vring_split, num, queue, vring_align);
 
-	vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
-				   notify, callback, name);
+	vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+				   context, notify, callback, name);
 	if (!vq) {
 		vring_free_queue(vdev, queue_size_in_bytes, queue,
 				 dma_addr);
@@ -1130,10 +1360,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      void (*callback)(struct virtqueue *vq),
 				      const char *name)
 {
-	struct vring vring;
-	vring_init(&vring, num, pages, vring_align);
-	return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
-				     notify, callback, name);
+	union vring_union vring;
+	bool packed;
+
+	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+	if (packed)
+		vring_init_packed(&vring.vring_packed, num, pages, vring_align);
+	else
+		vring_init(&vring.vring_split, num, pages, vring_align);
+
+	return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+				     context, notify, callback, name);
 }
 EXPORT_SYMBOL_GPL(vring_new_virtqueue);
 
@@ -1143,7 +1380,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 
 	if (vq->we_own_ring) {
 		vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
-				 vq->vring.desc, vq->queue_dma_addr);
+				 vq->packed ? (void *)vq->vring_packed.desc :
+					      (void *)vq->vring.desc,
+				 vq->queue_dma_addr);
 	}
 	list_del(&_vq->list);
 	kfree(vq);
@@ -1185,7 +1424,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
 
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->vring.num;
+	return vq->packed ? vq->vring_packed.num : vq->vring.num;
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
 
@@ -1228,6 +1467,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
 
 	BUG_ON(!vq->we_own_ring);
 
+	if (vq->packed)
+		return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
+				(char *)vq->vring_packed.desc);
+
 	return vq->queue_dma_addr +
 		((char *)vq->vring.avail - (char *)vq->vring.desc);
 }
@@ -1239,11 +1482,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
 
 	BUG_ON(!vq->we_own_ring);
 
+	if (vq->packed)
+		return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
+				(char *)vq->vring_packed.desc);
+
 	return vq->queue_dma_addr +
 		((char *)vq->vring.used - (char *)vq->vring.desc);
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
 
+/* Only available for split ring */
 const struct vring *virtqueue_get_vring(struct virtqueue *vq)
 {
 	return &to_vvq(vq)->vring;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf32524ab27..a0075894ad16 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
 struct virtio_device;
 struct virtqueue;
 
+union vring_union {
+	struct vring vring_split;
+	struct vring_packed vring_packed;
+};
+
 /*
  * Creates a virtqueue and allocates the descriptor ring.  If
  * may_reduce_num is set, then this may allocate a smaller ring than
@@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
 
 /* Creates a virtqueue with a custom layout. */
 struct virtqueue *__vring_new_virtqueue(unsigned int index,
-					struct vring vring,
+					union vring_union vring,
+					bool packed,
 					struct virtio_device *vdev,
 					bool weak_barriers,
 					bool ctx,
-- 
2.11.0

^ permalink raw reply related

* [RFC v3 1/5] virtio: add packed ring definitions
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 include/uapi/linux/virtio_config.h | 12 +++++++++++-
 include/uapi/linux/virtio_ring.h   | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		34
+#define VIRTIO_TRANSPORT_F_END		36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED		34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER		35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT	4
 
+#define VRING_DESC_F_AVAIL(b)	((b) << 7)
+#define VRING_DESC_F_USED(b)	((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
 
+#define VRING_EVENT_F_ENABLE	0x0
+#define VRING_EVENT_F_DISABLE	0x1
+#define VRING_EVENT_F_DESC	0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
 	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+	/* __virtio16 off  : 15; // Descriptor Event Offset
+	 * __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+	__virtio16 off_wrap;
+	/* __virtio16 flags : 2; // Descriptor Event Flags */
+	__virtio16 flags;
+};
+
+struct vring_packed_desc {
+	/* Buffer Address. */
+	__virtio64 addr;
+	/* Buffer Length. */
+	__virtio32 len;
+	/* Buffer ID. */
+	__virtio16 id;
+	/* The flags depending on descriptor type. */
+	__virtio16 flags;
+};
+
+struct vring_packed {
+	unsigned int num;
+
+	struct vring_packed_desc *desc;
+
+	struct vring_packed_desc_event *driver;
+
+	struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.11.0

^ permalink raw reply related

* [RFC v3 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-04-25  5:15 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev; +Cc: wexu

Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:

1. Reloading the guest driver will break the Tx/Rx;
2. Zeroing the flags when detaching a used desc will
   break the guest -> host path.

Some simple functional tests have also been done with
Wei's packed ring implementation in QEMU:

http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). Reloading the guest driver also worked as expected.

TODO:
- Refinements (for code and commit log) and bug fixes;
- Discuss/fix/test EVENT_IDX support;
- Test devices other than net;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c       | 1271 ++++++++++++++++++++++++++++--------
 include/linux/virtio_ring.h        |    8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1049 insertions(+), 278 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCH 1/6] virtio_console: don't tie bufs to a vq
From: Michael S. Tsirkin @ 2018-04-24 18:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Arnd Bergmann, Amit Shah, linux-kernel, stable, virtualization
In-Reply-To: <20180421073005.GA3744@kroah.com>

On Sat, Apr 21, 2018 at 09:30:05AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Apr 20, 2018 at 09:18:01PM +0300, Michael S. Tsirkin wrote:
> > an allocated buffer doesn't need to be tied to a vq -
> > only vq->vdev is ever used. Pass the function the
> > just what it needs - the vdev.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/char/virtio_console.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index 468f061..3e56f32 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -422,7 +422,7 @@ static void reclaim_dma_bufs(void)
> >  	}
> >  }
> >  
> > -static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> > +static struct port_buffer *alloc_buf(struct virtio_device *vdev, size_t buf_size,
> >  				     int pages)
> >  {
> >  	struct port_buffer *buf;
> > @@ -445,16 +445,16 @@ static struct port_buffer *alloc_buf(struct virtqueue *vq, size_t buf_size,
> >  		return buf;
> >  	}
> >  
> > -	if (is_rproc_serial(vq->vdev)) {
> > +	if (is_rproc_serial(vdev)) {
> >  		/*
> >  		 * Allocate DMA memory from ancestor. When a virtio
> >  		 * device is created by remoteproc, the DMA memory is
> >  		 * associated with the grandparent device:
> >  		 * vdev => rproc => platform-dev.
> >  		 */
> > -		if (!vq->vdev->dev.parent || !vq->vdev->dev.parent->parent)
> > +		if (!vdev->dev.parent || !vdev->dev.parent->parent)
> >  			goto free_buf;
> > -		buf->dev = vq->vdev->dev.parent->parent;
> > +		buf->dev = vdev->dev.parent->parent;
> >  
> >  		/* Increase device refcnt to avoid freeing it */
> >  		get_device(buf->dev);
> > @@ -838,7 +838,7 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
> >  
> >  	count = min((size_t)(32 * 1024), count);
> >  
> > -	buf = alloc_buf(port->out_vq, count, 0);
> > +	buf = alloc_buf(port->portdev->vdev, count, 0);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > @@ -957,7 +957,7 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe,
> >  	if (ret < 0)
> >  		goto error_out;
> >  
> > -	buf = alloc_buf(port->out_vq, 0, pipe->nrbufs);
> > +	buf = alloc_buf(port->portdev->vdev, 0, pipe->nrbufs);
> >  	if (!buf) {
> >  		ret = -ENOMEM;
> >  		goto error_out;
> > @@ -1374,7 +1374,7 @@ static unsigned int fill_queue(struct virtqueue *vq, spinlock_t *lock)
> >  
> >  	nr_added_bufs = 0;
> >  	do {
> > -		buf = alloc_buf(vq, PAGE_SIZE, 0);
> > +		buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
> >  		if (!buf)
> >  			break;
> >  
> > -- 
> > MST
> 
> <formletter>
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> </formletter>


Thanks!
I have some questions about this one:

Cc: <stable@vger.kernel.org> # 3.3.x: a1f84a3: sched: Check for idle
Cc: <stable@vger.kernel.org> # 3.3.x: 1b9508f: sched: Rate-limit newidle
Cc: <stable@vger.kernel.org> # 3.3.x: fd21073: sched: Fix affinity logic
Cc: <stable@vger.kernel.org> # 3.3.x
Signed-off-by: Ingo Molnar <mingo@elte.hu>

1. what does the kernel version mean? can I omit it?
2. so when I rebase to add the tag, this changes commit IDs for
   following tags in the same tree, breaking their tags
   in the process. Pretty annoying. Any idea how to do it better?

Thanks!

-- 
MST

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, dm-devel, eric.dumazet, mst, netdev,
	linux-kernel, Michal Hocko, Pekka Enberg, linux-mm, edumazet,
	David Rientjes, Joonsoo Kim, Andrew Morton, virtualization,
	David Miller, Vlastimil Babka
In-Reply-To: <20180424171651.GC30577@bombadil.infradead.org>



On Tue, 24 Apr 2018, Matthew Wilcox wrote:

> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > 
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > > 
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc.  Unless this is something already
> > > taken care of by kasan or similar.
> > 
> > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > the next slab size and detects buffer overflows only at this boundary.
> > 
> > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > immediatelly after the requested size - so that it can detect overflows at 
> > byte boundary 
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > 
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> 
> Is it still worth doing this, now we have kasan?

The kmalloc guard has much lower overhead than kasan.

(BTW. when I tried kasan, it oopsed with persistent memory)

Mikulas

^ permalink raw reply

* Re: [PATCH 0/6] virtio-console: spec compliance fixes
From: Michael S. Tsirkin @ 2018-04-24 18:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Arnd Bergmann, Amit Shah, Greg Kroah-Hartman, stable,
	virtualization
In-Reply-To: <1524248223-393618-1-git-send-email-mst@redhat.com>

On Fri, Apr 20, 2018 at 09:17:59PM +0300, Michael S. Tsirkin wrote:
> Turns out virtio console tries to take a buffer out of an active vq.
> Works by sheer luck, and is explicitly forbidden by spec.  And while
> going over it I saw that error handling is also broken -
> failure is easy to trigger if I force allocations to fail.
> 
> Lightly tested.

Amit - any feedback before I push these patches?

> Michael S. Tsirkin (6):
>   virtio_console: don't tie bufs to a vq
>   virtio: add ability to iterate over vqs
>   virtio_console: free buffers after reset
>   virtio_console: drop custom control queue cleanup
>   virtio_console: move removal code
>   virtio_console: reset on out of memory
> 
>  drivers/char/virtio_console.c | 155 ++++++++++++++++++++----------------------
>  include/linux/virtio.h        |   3 +
>  2 files changed, 75 insertions(+), 83 deletions(-)
> 
> -- 
> MST
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox