Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH] net: vhost: improve performance when enable busyloop
From: Tonghao Zhang @ 2018-06-20 13:28 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, Tonghao Zhang, virtualization

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

we set the poll-us=100 us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

iperf3 -s -D
iperf3 -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400 --bandwidth 100000M

* With the patch:    21.1 Gbits/sec
* Without the patch: 12.7 Gbits/sec

Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
 drivers/vhost/net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e7cf7d2..9364ede 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -429,22 +429,43 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
+static int sk_has_rx_data(struct sock *sk);
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_virtqueue *vq,
 				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num)
 {
 	unsigned long uninitialized_var(endtime);
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *rvq = &nvq->vq;
+	struct socket *sock = rvq->private_data;
+
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
+		mutex_lock_nested(&rvq->mutex, 1);
+
+		vhost_disable_notify(&net->dev, rvq);
+
 		preempt_disable();
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(vq->dev, endtime) &&
+		       !(sock && sk_has_rx_data(sock->sk)) &&
 		       vhost_vq_avail_empty(vq->dev, vq))
 			cpu_relax();
 		preempt_enable();
+
+		if (sock && sk_has_rx_data(sock->sk))
+			vhost_poll_queue(&rvq->poll);
+		else if (unlikely(vhost_enable_notify(&net->dev, rvq))) {
+			vhost_disable_notify(&net->dev, rvq);
+			vhost_poll_queue(&rvq->poll);
+		}
+
+		mutex_unlock(&rvq->mutex);
+
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v5 0/9] x86: macrofying inline asm for better compilation
From: Peter Zijlstra @ 2018-06-20 10:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Juergen Gross, Kate Stewart, Kees Cook, Josh Poimboeuf,
	Philippe Ombredanne, Greg Kroah-Hartman, Christopher Li, x86,
	linux-kernel, Linus Torvalds, virtualization, Masahiro Yamada,
	linux-sparse, Ingo Molnar, Jan Beulich, H. Peter Anvin,
	Alok Kataria, Sam Ravnborg, Thomas Gleixner
In-Reply-To: <20180619194854.69486-1-namit@vmware.com>

On Tue, Jun 19, 2018 at 12:48:45PM -0700, Nadav Amit wrote:
> Nadav Amit (9):
>   Makefile: Prepare for using macros for inline asm
>   x86: objtool: use asm macro for better compiler decisions
>   x86: refcount: prevent gcc distortions
>   x86: alternatives: macrofy locks for better inlining
>   x86: bug: prevent gcc distortions
>   x86: prevent inline distortion by paravirt ops
>   x86: extable: use macros instead of inline assembly
>   x86: cpufeature: use macros instead of inline assembly
>   x86: jump-labels: use macros instead of inline assembly
> 
>  Makefile                               |  9 ++-
>  arch/x86/Makefile                      | 11 ++-
>  arch/x86/include/asm/alternative-asm.h | 20 ++++--
>  arch/x86/include/asm/alternative.h     | 11 +--
>  arch/x86/include/asm/asm.h             | 61 +++++++---------
>  arch/x86/include/asm/bug.h             | 98 +++++++++++++++-----------
>  arch/x86/include/asm/cpufeature.h      | 82 ++++++++++++---------
>  arch/x86/include/asm/jump_label.h      | 65 ++++++++++-------
>  arch/x86/include/asm/paravirt_types.h  | 56 +++++++--------
>  arch/x86/include/asm/refcount.h        | 74 +++++++++++--------
>  arch/x86/kernel/macros.S               | 16 +++++
>  include/asm-generic/bug.h              |  8 +--
>  include/linux/compiler.h               | 56 +++++++++++----
>  scripts/Kbuild.include                 |  4 +-
>  scripts/mod/Makefile                   |  2 +
>  15 files changed, 340 insertions(+), 233 deletions(-)
>  create mode 100644 arch/x86/kernel/macros.S

Aside from the one niggle:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [PATCH v5 6/9] x86: prevent inline distortion by paravirt ops
From: Peter Zijlstra @ 2018-06-20 10:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: x86, linux-kernel, virtualization, Ingo Molnar, H. Peter Anvin,
	Alok Kataria, Thomas Gleixner
In-Reply-To: <20180619194854.69486-7-namit@vmware.com>

On Tue, Jun 19, 2018 at 12:48:51PM -0700, Nadav Amit wrote:
> +#define paravirt_alt							\
> +	"PARAVIRT_CALL type=\"%c[paravirt_typenum]\""			\
> +	" clobber=\"%c[paravirt_clobber]\""				\
> +	" pv_opptr=\"%c[paravirt_opptr]\""

That wants to be:

+     " pv_opptr=\"%c[paravirt_opptr]\";"

And other than that I would suggest: 's/paravirt_alt/paravirt_call/g'.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180619233001-mutt-send-email-mst@kernel.org>

On Tue, 19 Jun 2018 23:32:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> > 
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> That's what it boils down to, yes.  If there's need to have this for
> non-pci devices, then we should put it in config space.
> Cornelia, what do you think?
> 

I think the only really useful config on s390 is the vfio-pci network
card coupled with a virtio-net-ccw device: Using an s390 network card
via vfio-ccw is out due to the nature of the s390 network cards, and
virtio-ccw is the default transport (virtio-pci is not supported on any
enterprise distro AFAIK).

For this, having a uuid in the config space could work (vfio-pci
devices have a config space by virtue of being pci devices, and
virtio-net-ccw devices have a config space by virtue of being virtio
devices -- ccw devices usually don't have that concept).

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-20  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180619173256-mutt-send-email-mst@kernel.org>

On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
> 
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, which can report up to 2TB free memory. 

"getting two pages isn't harder", do you mean passing two arrays (two allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

        uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
        unsigned long *arrays;
 
     /*
         * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
         * store all the hints, we need to allocate multiple arrays.
         * max_hints: the max number of 4MB free page blocks
         * hints_per_page: the number of hints each page can store
         * hints_per_array: the number of hints an array can store
         * total_arrays: the number of arrays we need
         */
        max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
        hints_per_page = PAGE_SIZE / sizeof(__le64);
        hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
        total_arrays = max_hints /  hints_per_array +
                       !!(max_hints % hints_per_array);
        arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
        for (i = 0; i < total_arrays; i++) {
                arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, MAX_ORDER - 1);

              if (!arrays[i])
                      goto out;
        }


- the mm API needs to be changed to support storing hints to multiple separated arrays offered by the caller.

Best,
Wei

^ permalink raw reply

* Re: [PATCH v5 6/9] x86: prevent inline distortion by paravirt ops
From: kbuild test robot @ 2018-06-20  5:05 UTC (permalink / raw)
  Cc: x86, linux-kernel, virtualization, Ingo Molnar, Nadav Amit,
	kbuild-all, H. Peter Anvin, Alok Kataria, Thomas Gleixner
In-Reply-To: <20180619194854.69486-7-namit@vmware.com>

[-- Attachment #1: Type: text/plain, Size: 4337 bytes --]

Hi Nadav,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc1 next-20180619]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nadav-Amit/x86-macrofying-inline-asm-for-better-compilation/20180620-112043
config: i386-randconfig-x016-201824 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/include/asm/paravirt.h: Assembler messages:
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword arguments
--
   arch/x86/include/asm/paravirt.h: Assembler messages:
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments

vim +253 arch/x86/include/asm/paravirt.h

014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  249  
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  250  static inline void write_gdt_entry(struct desc_struct *dt, int entry,
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  251  				   void *desc, int type)
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge       2007-05-02  252  {
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30 @253  	PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, dt, entry, desc, type);
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge       2007-05-02  254  }
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  255  

:::::: The code at line 253 was first introduced by commit
:::::: 014b15be30c04622d130946ab7c0a9101b523a8a x86: change write_gdt_entry signature.

:::::: TO: Glauber de Oliveira Costa <gcosta@redhat.com>
:::::: CC: Ingo Molnar <mingo@elte.hu>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33213 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-19 20:32 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-19 20:09 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

Yes, let's do this just for PCI transport (homogenous) for now.

>
>>
>> >
>> >>
>> >> c. Fully combined device: in this model, virtio and passthough devices
>> >> are viewed as a single VM interface altogther. QEMU not just controls
>> >> the visibility of the primary in guest, but can also manage the
>> >> exposure of the passthrough for migratability. It can be like that
>> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> >> to provide group ID for grouping the two devices, if just one single
>> >> combined device is exposed by QEMU. In either case, QEMU manages all
>> >> aspect of such internal construct, including virtio feature
>> >> negotiation, presence of the primary, and live migration.
>> >
>> > Same question as above.
>> >
>> >>
>> >> It looks like to me that, in your opinion, you seem to prefer go with
>> >> (a). While I'm actually okay with either (b) or (c). Do I understand
>> >> your point correctly?
>> >
>> > I'm not yet preferring anything, as I'm still trying to understand how
>> > this works :) I hope we can arrive at a model that covers the use case
>> > and that is also flexible enough to be extended to other platforms.
>> >
>> >>
>> >> The reason that I feel that (a) might not be ideal, just as Michael
>> >> alluded to (quoting below), is that as management stack, it really
>> >> doesn't need to care about the detailed process of feature negotiation
>> >> (if we view the guest presence of the primary as part of feature
>> >> negotiation at an extended level not just virtio). All it needs to be
>> >> done is to hand in the required devices to QEMU and that's all. Why do
>> >> we need to addd various hooks, events for whichever happens internally
>> >> within the guest?
>> >>
>> >> ''
>> >> Primary device is added with a special "primary-failover" flag.
>> >> A virtual machine is then initialized with just a standby virtio
>> >> device. Primary is not yet added.
>> >>
>> >> Later QEMU detects that guest driver device set DRIVER_OK.
>> >> It then exposes the primary device to the guest, and triggers
>> >> a device addition event (hot-plug event) for it.
>> >>
>> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> >> to remove the primary driver.  In particular, if QEMU detects guest
>> >> re-initialization (e.g. by detecting guest reset) it immediately removes
>> >> the primary device.
>> >> ''
>> >>
>> >> and,
>> >>
>> >> ''
>> >> management just wants to give the primary to guest and later take it back,
>> >> it really does not care about the details of the process,
>> >> so I don't see what does pushing it up the stack buy you.
>> >>
>> >> So I don't think it *needs* to be done in libvirt. It probably can if you
>> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> >> unbinds from device, and can check that backup flag was set.
>> >> If you are pushing for a setup like that please get a buy-in
>> >> from libvirt maintainers or better write a patch.
>> >> ''
>> >
>> > This actually seems to mean the opposite to me: We need to know what
>> > the guest is doing and when, as it directly drives what we need to do
>> > with the devices. If we switch to a visibility vs a hotplug model (see
>> > the other mail), we might be able to handle that part within qemu.
>>
>> In the model of (b), I think it essentially turns hotplug to one of
>> mechanisms for QEMU to control the visibility. The libvirt can still
>> manage the hotplug of individual devices during live migration or in
>> normal situation to hot add/remove devices. Though the visibility of
>> the VFIO is under the controll of QEMU, and it's possible that the hot
>> add/remove request does not involve actual hot plug activity in guest
>> at all.
>
> That depends on how you model visibility, I guess. You'll probably want
> to stop traffic flowing through one or the other of the cards; would
> link down or similar be enough for the virtio device?

I'm not sure if it is a good idea. The guest user will see two devices
with same MAC but one of them is down. Do you expect user to use it or
not? And since the guest is going to be migrated, we need to unplug a
broken VF from guest before migrating, why do we bother plugging in
this useless VF at the first place?


Thanks,
-Siwei
>
>>
>> In the model of (c), the hotplug semantics of the combined device
>> would mean differently - it would end up with devices plugged in or
>> out altogther. To make this work, we either have to build a brand new
>> bond-like QEMU device consist of virtio and VFIO internally, or need
>> to have some abstraction in place for libvirt to manipulate the
>> combined device (and prohibit libvirt from operating on individual
>> internal device directly). Note with this model the group ID doesn't
>> even need to get exposed to libvirt, just imagine libvirt to supply
>> all options required to configure two regular virtio-net and VFIO
>> devices for a single device object, and QEMU will deal with the
>> device's visibility and enumeration, such when to hot plug VFIO device
>> in to or out from the guest.
>>
>> It might be complicated to implement (c) though.
>
> I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-19 18:12 UTC (permalink / raw)
  To: David Vrabel
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <7337582f-0007-d006-0809-cf41fd93b31e@nutanix.com>

On 19 June 2018 at 19:37, David Vrabel <david.vrabel@nutanix.com> wrote:
> It's not clear how this increases security. What threats is this
> protecting again?
It won't completely protect prevent rootkits, because still rootkits
can edit dynamic kernel data structures, but it will limit what
rootkits damage to only dynamic data.
This way system calls can't be changed, or Interrupt tables.
> As an attacker, modifying the sensitive pages (kernel text?) will
> require either: a) altering the existing mappings for these (to make
> them read-write or user-writable for example); or b) creating aliased
> mappings with suitable permissions.
>
> If the attacker can modify page tables in this way then it can also
> bypass the suggested hypervisor's read-only protection by changing the
> mappings to point to a unprotected page.

I think I was missing this part out, but I meant to say completely
prevent any modification to pages including the guest physical address
to guest virtual address mapping for those protected pages, Another
tricky (something random just popped up in my mind right now, better
to say it than to forget it) solution is making new memory mappings
inherit the same protection as old one, I assume that Hyper visor can
do either things. Also that was the kind of performance hit I was
talking about. I am not sure if that might break things or I can say
it will for sure heavily limit some functionalities. like maybe
hibernating guest. But that will be the kind of trades off I am
expecting at least at the begining.

^ permalink raw reply

* Re: [PATCH v4 6/9] x86: prevent inline distortion by paravirt ops
From: Juergen Gross @ 2018-06-19 15:09 UTC (permalink / raw)
  To: Nadav Amit, linux-kernel, x86
  Cc: Alok Kataria, Ingo Molnar, virtualization, Thomas Gleixner,
	H. Peter Anvin
In-Reply-To: <20180612115050.185112-7-namit@vmware.com>

On 12/06/18 13:50, Nadav Amit wrote:
> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> The effect of the patch is a more aggressive inlining, which also
> causes a size increase of kernel.
> 
>    text	   data	    bss	    dec	    hex	filename
> 18147336 10226688 2957312 31331336 1de1408 ./vmlinux before
> 18162555 10226288 2957312 31346155 1de4deb ./vmlinux after (+14819)
> 
> Static text symbols:
> Before:	40053
> After:	39942	(-111)
> 
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: virtualization@lists.linux-foundation.org
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply

* Re: [PATCH v5 0/3] extern inline native_save_fl for paravirt
From: Juergen Gross @ 2018-06-19 15:03 UTC (permalink / raw)
  To: Nick Desaulniers, akpm, hpa, mingo, tglx
  Cc: kstewart, andrea.parri, linux-efi, brijesh.singh, jan.kiszka,
	will.deacon, jarkko.sakkinen, linux-kernel, yamada.masahiro,
	manojgupta, akataria, tweek, mawilcox, x86, ghackmann, mka, geert,
	rientjes, aryabinin, thomas.lendacky, arnd, linux-kbuild,
	keescook, rostedt, acme, caoj.fnst, jpoimboe, sedat.dilek,
	boris.ostrovsky, virtualization, michal.lkml, tstellar, gregkh,
	ard.biesheuvel, astrachan, mjg59, pombre
In-Reply-To: <20180613210518.113983-1-ndesaulniers@google.com>

On 13/06/18 23:05, Nick Desaulniers wrote:
> paravirt depends on a custom calling convention (callee saved), but
> expects this from a static inline function that it then forces to be
> outlined. This is problematic because different compilers or flags can
> then add a stack guard that violates the calling conventions.
> 
> Uses extern inline with the out-of-line definition in assembly to
> prevent compilers from adding stack guards to the outlined version.
> 
> Other parts of the codebase overwrite KBUILD_CFLAGS, which is *extremely
> problematic* for extern inline, as the sematics are completely the
> opposite depending on what C standard is used.
> http://blahg.josefsipek.net/?p=529
> 
> Changes since v4:
>   Take Arnd's and hpa's suggestions properly feature detect gnu_inline
>   attribute to support gcc 4.1.
> 
> Changes since v3:
>   Take Joe's suggestion to hoist __inline__ and __inline out of
>   conditional block.
> 
> Changes since v2:
>   Take hpa's _ASM_ARG patch into the set in order to simplify cross
>   32b/64b x86 assembly and rebase my final patch to use it.  Apply
>   Sedat's typo fix to commit message and sign off on it. Take Joe's
>   suggestion to simplify __inline__ and __inline. Add Arnd to list of
>   folks who made helpful suggestions.
> 
> Changes since v1:
>   Prefer gnu_inline function attribute instead of explicitly setting C
>   standard compiler flag in problematic Makefiles. We should instead
>   carefully evaluate if those Makefiles should be overwriting
>   KBUILD_CFLAGS at all. Dropped the previous first two patches and added
>   a new first patch.
> 
> H. Peter Anvin (1):
>   x86/asm: add _ASM_ARG* constants for argument registers to <asm/asm.h>
> 
> Nick Desaulniers (2):
>   compiler-gcc.h: add gnu_inline to all inline declarations
>   x86: paravirt: make native_save_fl extern inline
> 
>  arch/x86/include/asm/asm.h      | 59 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/irqflags.h |  2 +-
>  arch/x86/kernel/Makefile        |  1 +
>  arch/x86/kernel/irqflags.S      | 26 +++++++++++++++
>  include/linux/compiler-gcc.h    | 29 ++++++++++++----
>  5 files changed, 109 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/kernel/irqflags.S
> 

For the series:

Acked-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-19 14:43 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <5B28F371.9020308@intel.com>

On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > > 
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the list.
> > > Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

> 
> > > by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
> > > 
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
> > > 
> > > This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> > Not clearly better, no, since you are slowing the guest.
> 
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to  build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

> 
> 
> > 
> > 
> > > So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
> 
> > > > If that doesn't fly, we can allocate out of the loop and just retry with more
> > > > pages.
> > > > 
> > > > > On the other hand, large guests being large mostly because the guests need
> > > > to use large memory. In that case, they usually won't have that much free
> > > > memory to report.
> > > > 
> > > > And following this logic small guests don't have a lot of memory to report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
> 
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

> 
> Best,
> Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-06-19 12:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180619055449-mutt-send-email-mst@kernel.org>

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
>> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
>>> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
>>>> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
>>> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
>>> 4TB, the optimization can still offer 2TB free memory (better than no
>>> optimization).
>>>
>>> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
>>> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
>>> suggested for this which is to steal the necessary # of entries off the list.
>> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> I'd do get_user_pages really. You don't want pages split, etc.

>> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
>>
>> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
>>
>> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> Not clearly better, no, since you are slowing the guest.

Not really. Live migration slows down the guest itself. It seems that 
the guest spends a little extra time reporting free pages, but in return 
the live migration time gets reduced a lot, which makes the guest endure 
less from live migration. (there is no drop of the workload performance 
when using the optimization in the tests)



>
>
>> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> No one has the time to read documentation about how an internal flag
> within a device works. Come on, getting two pages isn't much harder
> than a single one.

>>   
>>> If that doesn't fly, we can allocate out of the loop and just retry with more
>>> pages.
>>>
>>>> On the other hand, large guests being large mostly because the guests need
>>> to use large memory. In that case, they usually won't have that much free
>>> memory to report.
>>>
>>> And following this logic small guests don't have a lot of memory to report at
>>> all.
>>> Could you remind me why are we considering this optimization then?
>> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> I'd rather not spend time writing up random limitations.

This is not a random limitation. It would be more clear to see the code. 
Also I'm not sure how get_user_pages could be used in our case, and what 
you meant by "getting two pages". I'll post out a new version, and we 
can discuss on the code.


Best,
Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-19 10:54 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ208NRp=bjhSkzF2K5Ft94TwdRs73WfSp4h-_vXvEKwgiQ@mail.gmail.com>

On Fri, 15 Jun 2018 10:06:07 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Thu, 14 Jun 2018 18:57:11 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
> >> think you raised really good points, some of which I don't have answer
> >> yet and would also like to explore here.
> >>
> >> First off, I don't want to push the discussion to the extreme at this
> >> point, or sell anything about having QEMU manage everything
> >> automatically. Don't get me wrong, it's not there yet. Let's don't
> >> assume we are tied to a specific or concerte solution. I think the key
> >> for our discussion might be to define or refine the boundary between
> >> VM and guest,  e.g. what each layer is expected to control and manage
> >> exactly.
> >>
> >> In my view, there might be possibly 3 different options to represent
> >> the failover device conceipt to QEMU and libvirt (or any upper layer
> >> software):
> >>
> >> a. Seperate device: in this model, virtio and passthough remains
> >> separate devices just as today. QEMU exposes the standby feature bit
> >> for virtio, and publish status/event around the negotiation process of
> >> this feature bit for libvirt to react upon. Since Libvirt has the
> >> pairing relationship itself, maybe through MAC address or something
> >> else, it can control the presence of primary by hot plugging or
> >> unplugging the passthrough device, although it has to work tightly
> >> with virtio's feature negotation process. Not just for migration but
> >> also various corner scenarios (driver/feature ok, device reset,
> >> reboot, legacy guest etc) along virtio's feature negotiation.  
> >
> > Yes, that one has obvious tie-ins to virtio's modus operandi.
> >  
> >>
> >> b. Coupled device: in this model, virtio and passthough devices are
> >> weakly coupled using some group ID, i.e. QEMU match the passthough
> >> device for a standby virtio instance by comparing the group ID value
> >> present behind each device's bridge. Libvirt provides QEMU the group
> >> ID for both type of devices, and only deals with hot plug for
> >> migration, by checking some migration status exposed (e.g. the feature
> >> negotiation status on the virtio device) by QEMU. QEMU manages the
> >> visibility of the primary in guest along virtio's feature negotiation
> >> process.  
> >
> > I'm a bit confused here. What, exactly, ties the two devices together?  
> 
> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> address (which it doesn't have to), the association between VFIO
> passthrough and standby must be specificed for QEMU to understand the
> relationship with this model. Note, standby feature is no longer
> required to be exposed under this model.

Isn't that a bit limiting, though?

With this model, you can probably tie a vfio-pci device and a
virtio-net-pci device together. But this will fail if you have
different transports: Consider tying together a vfio-pci device and a
virtio-net-ccw device on s390, for example. The standby feature bit is
on the virtio-net level and should not have any dependency on the
transport used.

> 
> > If libvirt already has the knowledge that it should manage the two as a
> > couple, why do we need the group id (or something else for other
> > architectures)? (Maybe I'm simply missing something because I'm not
> > that familiar with pci.)  
> 
> The idea is to have QEMU control the visibility and enumeration order
> of the passthrough VFIO for the failover scenario. Hotplug can be one
> way to achieve it, and perhaps there's other way around also. The
> group ID is not just for QEMU to couple devices, it's also helpful to
> guest too as grouping using MAC address is just not safe.

Sorry about dragging mainframes into this, but this will only work for
homogenous device coupling, not for heterogenous. Consider my vfio-pci
+ virtio-net-ccw example again: The guest cannot find out that the two
belong together by checking some group ID, it has to either use the MAC
or some needs-to-be-architectured property.

Alternatively, we could propose that mechanism as pci-only, which means
we can rely on mechanisms that won't necessarily work on non-pci
transports. (FWIW, I don't see a use case for using vfio-ccw to pass
through a network card anytime in the near future, due to the nature of
network cards currently in use on s390.)

> 
> >  
> >>
> >> c. Fully combined device: in this model, virtio and passthough devices
> >> are viewed as a single VM interface altogther. QEMU not just controls
> >> the visibility of the primary in guest, but can also manage the
> >> exposure of the passthrough for migratability. It can be like that
> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> >> to provide group ID for grouping the two devices, if just one single
> >> combined device is exposed by QEMU. In either case, QEMU manages all
> >> aspect of such internal construct, including virtio feature
> >> negotiation, presence of the primary, and live migration.  
> >
> > Same question as above.
> >  
> >>
> >> It looks like to me that, in your opinion, you seem to prefer go with
> >> (a). While I'm actually okay with either (b) or (c). Do I understand
> >> your point correctly?  
> >
> > I'm not yet preferring anything, as I'm still trying to understand how
> > this works :) I hope we can arrive at a model that covers the use case
> > and that is also flexible enough to be extended to other platforms.
> >  
> >>
> >> The reason that I feel that (a) might not be ideal, just as Michael
> >> alluded to (quoting below), is that as management stack, it really
> >> doesn't need to care about the detailed process of feature negotiation
> >> (if we view the guest presence of the primary as part of feature
> >> negotiation at an extended level not just virtio). All it needs to be
> >> done is to hand in the required devices to QEMU and that's all. Why do
> >> we need to addd various hooks, events for whichever happens internally
> >> within the guest?
> >>
> >> ''
> >> Primary device is added with a special "primary-failover" flag.
> >> A virtual machine is then initialized with just a standby virtio
> >> device. Primary is not yet added.
> >>
> >> Later QEMU detects that guest driver device set DRIVER_OK.
> >> It then exposes the primary device to the guest, and triggers
> >> a device addition event (hot-plug event) for it.
> >>
> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> >> to remove the primary driver.  In particular, if QEMU detects guest
> >> re-initialization (e.g. by detecting guest reset) it immediately removes
> >> the primary device.
> >> ''
> >>
> >> and,
> >>
> >> ''
> >> management just wants to give the primary to guest and later take it back,
> >> it really does not care about the details of the process,
> >> so I don't see what does pushing it up the stack buy you.
> >>
> >> So I don't think it *needs* to be done in libvirt. It probably can if you
> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> >> unbinds from device, and can check that backup flag was set.
> >> If you are pushing for a setup like that please get a buy-in
> >> from libvirt maintainers or better write a patch.
> >> ''  
> >
> > This actually seems to mean the opposite to me: We need to know what
> > the guest is doing and when, as it directly drives what we need to do
> > with the devices. If we switch to a visibility vs a hotplug model (see
> > the other mail), we might be able to handle that part within qemu.  
> 
> In the model of (b), I think it essentially turns hotplug to one of
> mechanisms for QEMU to control the visibility. The libvirt can still
> manage the hotplug of individual devices during live migration or in
> normal situation to hot add/remove devices. Though the visibility of
> the VFIO is under the controll of QEMU, and it's possible that the hot
> add/remove request does not involve actual hot plug activity in guest
> at all.

That depends on how you model visibility, I guess. You'll probably want
to stop traffic flowing through one or the other of the cards; would
link down or similar be enough for the virtio device?

> 
> In the model of (c), the hotplug semantics of the combined device
> would mean differently - it would end up with devices plugged in or
> out altogther. To make this work, we either have to build a brand new
> bond-like QEMU device consist of virtio and VFIO internally, or need
> to have some abstraction in place for libvirt to manipulate the
> combined device (and prohibit libvirt from operating on individual
> internal device directly). Note with this model the group ID doesn't
> even need to get exposed to libvirt, just imagine libvirt to supply
> all options required to configure two regular virtio-net and VFIO
> devices for a single device object, and QEMU will deal with the
> device's visibility and enumeration, such when to hot plug VFIO device
> in to or out from the guest.
> 
> It might be complicated to implement (c) though.

I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-19  3:05 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396AA10C@shsmsx102.ccr.corp.intel.com>

On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > 4TB, the optimization can still offer 2TB free memory (better than no
> > optimization).
> > 
> > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > suggested for this which is to steal the necessary # of entries off the list.
> 
> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.

> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
> 
> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
> 
> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.

> 
>  
> > If that doesn't fly, we can allocate out of the loop and just retry with more
> > pages.
> > 
> > > On the other hand, large guests being large mostly because the guests need
> > to use large memory. In that case, they usually won't have that much free
> > memory to report.
> > 
> > And following this logic small guests don't have a lot of memory to report at
> > all.
> > Could you remind me why are we considering this optimization then?
> 
> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

I'd rather not spend time writing up random limitations.


> Best,
> Wei

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Martin K. Petersen @ 2018-06-19  2:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Juergen Gross, Jens Axboe, linux-scsi, Martin K. Petersen, netdev,
	linux-usb, linux-kernel, virtualization, target-devel, kvm,
	qla2xxx-upstream, linux1394-devel, Kent Overstreet
In-Reply-To: <20180615023731.GA5706@bombadil.infradead.org>


Matthew,

>> Since most of the changes are in scsi or target, should I take this
>> series through my tree?
>
> I'd welcome that.  Nick seems to be inactive as target maintainer;
> his tree on kernel.org hasn't seen any updates in five months.

Applied to 4.19/scsi-queue, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-19  1:06 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180618051637-mutt-send-email-mst@kernel.org>

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.

This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

Best,
Wei

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: David Hildenbrand @ 2018-06-18 19:01 UTC (permalink / raw)
  To: Ahmed Soliman
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <CAAGnT3avoMOJP9zEdG2WmmxUe7m4S+5mDWezjJerzO4fq11cjQ@mail.gmail.com>

On 18.06.2018 18:35, Ahmed Soliman wrote:
> Shortly after I sent the first email, we found that there is another
> way to achieve this kind of communication, via KVM Hypercalls, I think
> they are underutilised in kvm, but they exist.
> 
> We also found that they are architecture dependent, but the advantage
> is that one doesn't need to create QEMU<-> kvm interface
> 
> So from our point of view it is either have things easily compatible
> with many architectures out of the box (virtio) VS compatiability with
> almost every front end including QEMU and any other one without
> modification (Hypercalls)?

My gut feeling (I might of course be wrong) is that hypercalls will not
be accepted easily in KVM (I assume only if it is really highly
specialized for e.g. x86 and/or required very early during boot and/or
has very specific performance requirements - e.g. pvspinlocks or kvmclock).

> 
> If it is the case, We might stick to hypercalls at the beginning,
> because it can be easily tested with out modifying QEMU, but then
> later we can move to virtio if there turned out to be clearer
> advantage, especially performance wise.

hypercalls might be good for prototyping, but I assume that the
challenging part will rather be a clean KVM mmu interface. And once you
have that, a kernel interface might not be too hard (I remember some
work being done by malwarebytes).

> 
> Does that sounds like good idea?
> I wanted to make sure because maybe maybe hypercalls aren't that much
> used in kvm for a reason, so I wanted to verify that.

I assume the same.

Another thing to note is performance, having to go via QEMU might be
performance wise not that good. (usually chunking is the answer to
reduce the overhead). But if it is really a "protect once, forget until
reboot" thing, that should not be relevant.

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-18 16:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <7c7ddb96-e865-53a2-3aa9-b79403c646a9@redhat.com>

Shortly after I sent the first email, we found that there is another
way to achieve this kind of communication, via KVM Hypercalls, I think
they are underutilised in kvm, but they exist.

We also found that they are architecture dependent, but the advantage
is that one doesn't need to create QEMU<-> kvm interface

So from our point of view it is either have things easily compatible
with many architectures out of the box (virtio) VS compatiability with
almost every front end including QEMU and any other one without
modification (Hypercalls)?

If it is the case, We might stick to hypercalls at the beginning,
because it can be easily tested with out modifying QEMU, but then
later we can move to virtio if there turned out to be clearer
advantage, especially performance wise.

Does that sounds like good idea?
I wanted to make sure because maybe maybe hypercalls aren't that much
used in kvm for a reason, so I wanted to verify that.

On 18 June 2018 at 16:34, David Hildenbrand <david@redhat.com> wrote:
> On 16.06.2018 13:49, Ahmed Soliman wrote:
>> Following up on these threads:
>> - https://marc.info/?l=kvm&m=151929803301378&w=2
>> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
>>
>> I lost the original emails so I couldn't reply to them, and also sorry
>> for being late, it was the end of semester exams.
>>
>> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
>> help having better insights.
>>
>> To wrap things up, the basic design will be a method for communication
>> between host and guest is guest can request certain pages to be read
>> only, and then host will force them to be read-only by guest until
>> next guest reboot, then it will impossible for guest OS to have them
>> as RW again. The choice of which pages to be set as read only is the
>> guest's. So this way mixed pages can still be mixed with R/W content
>> even if holds kernel code.
>>
>> I was planning to use KVM as my hypervisor, until I found out that KVM
>> can't do that on its own so one will need a custom virtio driver to do
>> this kind of guest-host communication/coordination, I am still
>> sticking to KVM, and have no plans to do this for Xen at least for
>> now, this means that in order to get it to work there must be a QEMU
>> support our specific driver we are planning to write in order for
>> things to work properly.
>>
>> The question is is this the right approach? or is there a simpler way
>> to achieve this goal?
>>
>
> Especially if you want to support multiple architectures in the long
> term, virtio is the way to go.
>
> Design an architecture independent and extensible (+configurable)
> interface and be happy :) This might of course require some thought.
>
> (and don't worry, implementing a virtio driver is a lot simpler than you
> might think)
>
> But be aware that the virtio "hypervisor" side will be handled in QEMU,
> so you'll need a proper QEMU->KVM interface to get things running.
>
> --
>
> Thanks,
>
> David / dhildenb

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: David Hildenbrand @ 2018-06-18 14:34 UTC (permalink / raw)
  To: Ahmed Soliman, kvm, Kernel Hardening, riel, Kees Cook,
	Ard Biesheuvel, Hossam Hassan, Ahmed Lotfy, virtualization,
	qemu-devel
In-Reply-To: <CAAGnT3bjYPu9bordn_Dh8z+MW6p5DDLoSsZC9xg8QxQriVus9A@mail.gmail.com>

On 16.06.2018 13:49, Ahmed Soliman wrote:
> Following up on these threads:
> - https://marc.info/?l=kvm&m=151929803301378&w=2
> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
> 
> I lost the original emails so I couldn't reply to them, and also sorry
> for being late, it was the end of semester exams.
> 
> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
> help having better insights.
> 
> To wrap things up, the basic design will be a method for communication
> between host and guest is guest can request certain pages to be read
> only, and then host will force them to be read-only by guest until
> next guest reboot, then it will impossible for guest OS to have them
> as RW again. The choice of which pages to be set as read only is the
> guest's. So this way mixed pages can still be mixed with R/W content
> even if holds kernel code.
> 
> I was planning to use KVM as my hypervisor, until I found out that KVM
> can't do that on its own so one will need a custom virtio driver to do
> this kind of guest-host communication/coordination, I am still
> sticking to KVM, and have no plans to do this for Xen at least for
> now, this means that in order to get it to work there must be a QEMU
> support our specific driver we are planning to write in order for
> things to work properly.
> 
> The question is is this the right approach? or is there a simpler way
> to achieve this goal?
> 

Especially if you want to support multiple architectures in the long
term, virtio is the way to go.

Design an architecture independent and extensible (+configurable)
interface and be happy :) This might of course require some thought.

(and don't worry, implementing a virtio driver is a lot simpler than you
might think)

But be aware that the virtio "hypervisor" side will be handled in QEMU,
so you'll need a proper QEMU->KVM interface to get things running.

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-18 13:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180615152926-mutt-send-email-mst@kernel.org>

On Fri, 15 Jun 2018 15:31:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> > On Fri, 15 Jun 2018 05:34:24 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:  
> >   
> > > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > > do all the
> > > > > > required management of the primary/standby devices within Qemu, that is
> > > > > > definitely a better
> > > > > > approach without upper layer involvement.      
> > > > > 
> > > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > > The management tool can supply passthrough device and virtio with the
> > > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > > hot plug the device as needed before or after the migration.    
> > > > 
> > > > I do not really see how you can manage that kind of stuff in QEMU only.    
> > > 
> > > So right now failover is limited to pci passthrough devices only.
> > > The idea is to realize the vfio device but not expose it
> > > to guest. Have a separate command to expose it to guest.
> > > Hotunplug would also hide it from guest but not unrealize it.  
> > 
> > So, this would not be real hot(un)plug, but 'hide it from the guest',
> > right? The concept of "we have it realized in QEMU, but the guest can't
> > discover and use it" should be translatable to non-pci as well (at
> > least for ccw).
> >   
> > > 
> > > This will help ensure that e.g. on migration failure we can
> > > re-expose the device without risk of running out of resources.  
> > 
> > Makes sense.
> > 
> > Should that 'hidden' state be visible/settable from outside as well
> > (e.g. via a property)? I guess yes, so that management software has a
> > chance to see whether a device is visible.  
> 
> Might be handy for debug, but note that since QEMU manages this
> state it's transient: can change at any time, so it's kind
> of hard for management to rely on it.

Might be another reason to have this controlled by management software;
being able to find out easily why a device is not visible to the guest
seems to be a useful thing.

Anyway, let's defer this discussion until it is clear how we actually
want to handle the whole setup.

> 
> > Settable may be useful if we
> > find another use case for hiding realized devices.  

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Mike Galbraith @ 2018-06-18  3:13 UTC (permalink / raw)
  To: Dave Airlie, Greg KH
  Cc: Ray Strode, LKML, stable, open list:VIRTIO CORE, NET...,
	Dave Airlie
In-Reply-To: <CAPM=9tzkKWM2wjeQvWG2qKM-xgFwGD3bjWfgmDUDTQO+tXF-ig@mail.gmail.com>

On Mon, 2018-06-18 at 12:33 +1000, Dave Airlie wrote:
> On 17 June 2018 at 21:02, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Sun, Jun 17, 2018 at 12:38:06PM +0200, Mike Galbraith wrote:
> >> On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
> >> >
> >> > And if you revert that patch, does everything work again?
> >>
> >> Yes.
> >
> > Great, Dave, care to revert this in 4.18 so I can queue up that revert
> > in older kernels as well?
> >
> > thanks,
> >
> > greg k-h
> 
> Hi MIke,
> 
> Does
> https://patchwork.freedesktop.org/patch/229980/
> 
> fix it?

Yup.

	-Mike

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Dave Airlie @ 2018-06-18  2:33 UTC (permalink / raw)
  To: Greg KH
  Cc: Ray Strode, Mike Galbraith, LKML, stable,
	open list:VIRTIO CORE, NET..., Dave Airlie
In-Reply-To: <20180617110200.GA19216@kroah.com>

On 17 June 2018 at 21:02, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Sun, Jun 17, 2018 at 12:38:06PM +0200, Mike Galbraith wrote:
>> On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
>> >
>> > And if you revert that patch, does everything work again?
>>
>> Yes.
>
> Great, Dave, care to revert this in 4.18 so I can queue up that revert
> in older kernels as well?
>
> thanks,
>
> greg k-h

Hi MIke,

Does
https://patchwork.freedesktop.org/patch/229980/

fix it?

I can queue up a revert asap, but since a revert just brings us back
another broken behaviour, I'd rather we just hurry the fix up.

Dave.

^ permalink raw reply

* Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-18  2:28 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396A5CB0@shsmsx102.ccr.corp.intel.com>

On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > > indicates the support of reporting hints of guest free pages to host via
> > virtio-balloon.
> > > > >
> > > > > Host requests the guest to report free page hints by sending a
> > > > > command to the guest via setting the
> > > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > > bit of the host_cmd config register.
> > > > >
> > > > > As the first step here, virtio-balloon only reports free page
> > > > > hints from the max order (10) free page list to host. This has
> > > > > generated similar good results as reporting all free page hints during
> > our tests.
> > > > >
> > > > > TODO:
> > > > > - support reporting free page hints from smaller order free page lists
> > > > >   when there is a need/request from users.
> > > > >
> > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > ---
> > > > >  drivers/virtio/virtio_balloon.c     | 187
> > +++++++++++++++++++++++++++++--
> > > > -----
> > > > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > > > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -43,6 +43,9 @@
> > > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > > >
> > > > > +/* The size of memory in bytes allocated for reporting free page
> > > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > > +
> > > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > >
> > > > Doesn't this limit memory size of the guest we can report?
> > > > Apparently to several gigabytes ...
> > > > OTOH huge guests with lots of free memory is exactly where we would
> > > > gain the most ...
> > >
> > > Yes, the 16-page array can report up to 32GB (each page can hold 512
> > addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> > memory to host. It is not flexible.
> > >
> > > How about allocating the buffer according to the guest memory size
> > > (proportional)? That is,
> > >
> > > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > > pages blocks that the system can have */ 4m_page_blocks =
> > > totalram_pages / 1024;
> > >
> > > /* Allocating one page can hold 512 free page blocks, so calculates
> > > the number of pages that can hold those 4MB blocks. And this
> > > allocation should not exceed 1024 pages */ pages_to_allocate =
> > > min(4m_page_blocks / 512, 1024);
> > >
> > > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> > 1024 pages as the buffer.
> > >
> > > When the guest has large memory, it should be easier to succeed in
> > allocation of large buffer. If that allocation fails, that implies that nothing
> > would be got from the 4MB free page list.
> > >
> > > I think the proportional allocation is simpler compared to other
> > > approaches like
> > > - scattered buffer, which will complicate the get_from_free_page_list
> > > implementation;
> > > - one buffer to call get_from_free_page_list multiple times, which needs
> > get_from_free_page_list to maintain states.. also too complicated.
> > >
> > > Best,
> > > Wei
> > >
> > 
> > That's more reasonable, but question remains what to do if that value
> > exceeds MAX_ORDER. I'd say maybe tell host we can't report it.
> 
> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the optimization can still offer 2TB free memory (better than no optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even
more.  I'd rather we had a better plan. From that POV I like what
Matthew Wilcox suggested for this which is to steal the necessary #
of entries off the list.

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.

> On the other hand, large guests being large mostly because the guests need to use large memory. In that case, they usually won't have that much free memory to report.

And following this logic small guests don't have a lot of memory to report at all.
Could you remind me why are we considering this optimization then?

> > 
> > Also allocating it with GFP_KERNEL is out. You only want to take it off the free
> > list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.
> 
> Sounds good, thanks.
> 
> > Also you can't allocate this on device start. First totalram_pages can change.
> > Second that's too much memory to tie up forever.
> 
> Yes, makes sense.
> 
> Best,
> Wei

^ permalink raw reply

* Re: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Michael S. Tsirkin @ 2018-06-18  2:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	pbonzini, akpm, mhocko, torvalds
In-Reply-To: <20180616045005.GA14936@bombadil.infradead.org>

On Fri, Jun 15, 2018 at 09:50:05PM -0700, Matthew Wilcox wrote:
> I wonder if (to address Michael's concern), you shouldn't instead use
> the first free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
> 	__le64 *ret = NULL;
> 	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
> 	for_each_populated_zone(zone) {
> 		spin_lock_irq(&zone->lock);
> 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> 			list = &zone->free_area[order].free_list[mt];
> 			list_for_each_entry_safe(page, list, lru, ...) {
> 				if (index == size)
> 					break;
> 				addr = page_to_pfn(page) << PAGE_SHIFT;
> 				if (!ret) {
> 					list_del(...);
> 					ret = addr;
> 				}
> 				ret[index++] = cpu_to_le64(addr);
> 			}
> 		}
> 		spin_unlock_irq(&zone->lock);
> 	}
> 
> 	return ret;
> }
> 
> You'll need to return the page to the freelist afterwards, but free_pages()
> should take care of that.

Yes Wei already came up with the idea to stick this data into a
MAX_ORDER allocation. Are you sure just taking an entry off
the list like that has no bad side effects?
I have a vague memory someone complained that everyone
most go through get free pages/kmalloc, but I can't
find that anymore.


-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox