Linux virtualization list
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-20 19:48 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180620180619.6b4ee52d.cohuck@redhat.com>

On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> In any case, I'm not sure anymore why we'd want the extra uuid.

It's mostly so we can have e.g. multiple devices with same MAC
(which some people seem to want in order to then use
then with different containers).

But it is also handy for when you assign a PF, since then you
can't set the MAC.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20 16:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180620170904-mutt-send-email-mst@kernel.org>

On Wed, 20 Jun 2018 17:11:59 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> > On Tue, 19 Jun 2018 23:32:06 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:  
> > > > Sorry about dragging mainframes into this, but this will only work for
> > > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > > belong together by checking some group ID, it has to either use the MAC
> > > > or some needs-to-be-architectured property.
> > > > 
> > > > Alternatively, we could propose that mechanism as pci-only, which means
> > > > we can rely on mechanisms that won't necessarily work on non-pci
> > > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > > through a network card anytime in the near future, due to the nature of
> > > > network cards currently in use on s390.)    
> > > 
> > > That's what it boils down to, yes.  If there's need to have this for
> > > non-pci devices, then we should put it in config space.
> > > Cornelia, what do you think?
> > >   
> > 
> > I think the only really useful config on s390 is the vfio-pci network
> > card coupled with a virtio-net-ccw device: Using an s390 network card
> > via vfio-ccw is out due to the nature of the s390 network cards, and
> > virtio-ccw is the default transport (virtio-pci is not supported on any
> > enterprise distro AFAIK).
> > 
> > For this, having a uuid in the config space could work (vfio-pci
> > devices have a config space by virtue of being pci devices, and
> > virtio-net-ccw devices have a config space by virtue of being virtio
> > devices -- ccw devices usually don't have that concept).  
> 
> OK so this calls for adding such a field generally (it's
> device agnostic right now).
> 
> How would you suggest doing that?

I hope that I'm not thoroughly confused at this point in time, so I'll
summarize my current understanding (also keep in mind that I haven't
looked at Venu's patches yet):

- The Linux guest initiates coupling from the virtio-net driver.
  Matching the other device is done via the MAC, and only pci devices
  are allowed for the failover device. (There does not seem to be any
  restriction on the transport of the virtio-net device.)
- The Linux guest virtio-net driver does not allow changing the MAC if
  standby has been negotiated (implying that the hypervisor needs to
  configure the correct MAC).
- In QEMU, we need to know which two devices (vfio-pci and virtio-net)
  go together, so that the virtio-net device gets the correct MAC. We
  also need the pairing so that we can make the vfio-pci device
  available once the guest has negotiated the standby feature.

We can tack the two devices together in QEMU by introducing new,
optional properties pointing from the virtio-net device to the vfio-pci
device (only offer standby if this is set) and the other way around
(don't make the device visible at the start if this is set). Problems:

- The admin needs to figure out the MAC by themselves and set it
  correctly. If this is incorrect, the vfio-pci device cannot be found
  in the guest. (Not sure how much of a problem this is in practice --
  and QEMU cannot figure out the MAC without poking at the vfio-pci
  device, and we probably want to avoid that.)
- This two-way pointing makes for interesting handing of the command
  line and when both devices are plugged later.

In any case, I'm not sure anymore why we'd want the extra uuid. Is
there any way QEMU (or libvirt) can figure it out without actually
looking at the vfio-pci device?

^ permalink raw reply

* CFP SENSORNETS 2019 - 8th Int.l Conf. on Sensor Networks (Prague/Czech Republic)
From: sensornets @ 2018-06-20 14:54 UTC (permalink / raw)
  To: virtualization

SUBMISSION DEADLINE 

8th International Conference on Sensor Networks

Submission Deadline: October 1, 2018

http://www.sensornets.org/

February 26 - 27, 2019
Prague, Czech Republic.

 SENSORNETS is organized in 5 major tracks:

 - Sensor Networks Software, Architectures and Applications
 - Wireless Sensor Networks
 - Energy and Environment
 - Intelligent Data Analysis and Processing
 - Security and Privacy in Sensor Networks


In Cooperation with: EUROGRAPHICS and AFIG. <br/>
Proceedings will be submitted for indexation by: DBLP, DBLP, Thomson Reuters, EI, SCOPUS, Semantic Scholar and Semantic Scholar. <br/>
 
A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
SENSORNETS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 100 033
Fax: +351 265 520 186
Web: http://www.sensornets.org/
e-mail: sensornets.secretariat@insticc.org

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20 14:34 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ20wrF418-WVev6OjaeShZWE488uxRwgaZmsjOHZr8jyOw@mail.gmail.com>

On Tue, 19 Jun 2018 13:09:14 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Fri, 15 Jun 2018 10:06:07 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:  
> >> > On Thu, 14 Jun 2018 18:57:11 -0700
> >> > Siwei Liu <loseweigh@gmail.com> wrote:

> >> > I'm a bit confused here. What, exactly, ties the two devices together?  
> >>
> >> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> >> address (which it doesn't have to), the association between VFIO
> >> passthrough and standby must be specificed for QEMU to understand the
> >> relationship with this model. Note, standby feature is no longer
> >> required to be exposed under this model.  
> >
> > Isn't that a bit limiting, though?
> >
> > With this model, you can probably tie a vfio-pci device and a
> > virtio-net-pci device together. But this will fail if you have
> > different transports: Consider tying together a vfio-pci device and a
> > virtio-net-ccw device on s390, for example. The standby feature bit is
> > on the virtio-net level and should not have any dependency on the
> > transport used.  
> 
> Probably we'd limit the support for grouping to virtio-net-pci device
> and vfio-pci device only. For virtio-net-pci, as you might see with
> Venu's patch, we store the group UUID on the config space of
> virtio-pci, which is only applicable to PCI transport.
> 
> If virtio-net-ccw needs to support the same, I think similar grouping
> interface should be defined on the VirtIO CCW transport. I think the
> current implementation of the Linux failover driver assumes that it's
> SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
> with, and that the PV path is on same PF without needing to update
> network of the port-MAC association change. If we need to extend the
> grouping mechanism to virtio-net-ccw, it has to pass such failover
> mode to virtio driver specifically through some other option I guess.

Hm, I've just spent some time reading the Linux failover code and I did
not really find much pci-related magic in there (other than checking
for a pci device in net_failover_slave_pre_register). We also seem to
look for a matching device by MAC only. What magic am I missing?

Is the look-for-uuid handling supposed to happen in the host only?

> >> > If libvirt already has the knowledge that it should manage the two as a
> >> > couple, why do we need the group id (or something else for other
> >> > architectures)? (Maybe I'm simply missing something because I'm not
> >> > that familiar with pci.)  
> >>
> >> The idea is to have QEMU control the visibility and enumeration order
> >> of the passthrough VFIO for the failover scenario. Hotplug can be one
> >> way to achieve it, and perhaps there's other way around also. The
> >> group ID is not just for QEMU to couple devices, it's also helpful to
> >> guest too as grouping using MAC address is just not safe.  
> >
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> >
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> Yes, let's do this just for PCI transport (homogenous) for now.

But why? Using pci for passthrough to make things easier (and because
there's not really a use case), sure. But I really don't want to
restrict this to virtio-pci only.

> >> In the model of (b), I think it essentially turns hotplug to one of
> >> mechanisms for QEMU to control the visibility. The libvirt can still
> >> manage the hotplug of individual devices during live migration or in
> >> normal situation to hot add/remove devices. Though the visibility of
> >> the VFIO is under the controll of QEMU, and it's possible that the hot
> >> add/remove request does not involve actual hot plug activity in guest
> >> at all.  
> >
> > That depends on how you model visibility, I guess. You'll probably want
> > to stop traffic flowing through one or the other of the cards; would
> > link down or similar be enough for the virtio device?  
> 
> I'm not sure if it is a good idea. The guest user will see two devices
> with same MAC but one of them is down. Do you expect user to use it or
> not? And since the guest is going to be migrated, we need to unplug a
> broken VF from guest before migrating, why do we bother plugging in
> this useless VF at the first place?

I was thinking about using hotunplugging only over migration and doing
the link up only after feature negotiation has finished, but that is
probably too complicated. Let's stick to hotplug for simplicity's sake.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-20 14:14 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396AE2EC@shsmsx102.ccr.corp.intel.com>

On Wed, Jun 20, 2018 at 09:11:39AM +0000, Wang, Wei W wrote:
> On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> > On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > > 1024) above,
> > > > > > so the maximum memory that can be reported is 2TB. For larger
> > guests, e.g.
> > > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > > than no optimization).
> > > > > >
> > > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even
> > more.
> > > > > > I'd rather we had a better plan. From that POV I like what
> > > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> > entries off the list.
> > > > > Actually what Matthew suggested doesn't make a difference here.
> > > > > That method always steal the first free page blocks, and sure can
> > > > > be changed to take more. But all these can be achieved via kmalloc
> > > > I'd do get_user_pages really. You don't want pages split, etc.
> > 
> > Oops sorry. I meant get_free_pages .
> 
> Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, which can report up to 2TB free memory. 
> 
> "getting two pages isn't harder", do you mean passing two arrays (two allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Yes, or generally a list of pages with as many as needed.


> Please see if the following logic aligns to what you think:
> 
>         uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
>         unsigned long *arrays;
>  
>      /*
>          * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
>          * store all the hints, we need to allocate multiple arrays.
>          * max_hints: the max number of 4MB free page blocks
>          * hints_per_page: the number of hints each page can store
>          * hints_per_array: the number of hints an array can store
>          * total_arrays: the number of arrays we need
>          */
>         max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
>         hints_per_page = PAGE_SIZE / sizeof(__le64);
>         hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
>         total_arrays = max_hints /  hints_per_array +
>                        !!(max_hints % hints_per_array);
>         arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
>         for (i = 0; i < total_arrays; i++) {
>                 arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, MAX_ORDER - 1);
> 
>               if (!arrays[i])
>                       goto out;
>         }
> 
> 
> - the mm API needs to be changed to support storing hints to multiple separated arrays offered by the caller.
> 
> Best,
> Wei

Yes. And add an API to just count entries so we know how many arrays to allocate.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-20 14:11 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180620115359.1a3bf6fb.cohuck@redhat.com>

On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> On Tue, 19 Jun 2018 23:32:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > > Sorry about dragging mainframes into this, but this will only work for
> > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > belong together by checking some group ID, it has to either use the MAC
> > > or some needs-to-be-architectured property.
> > > 
> > > Alternatively, we could propose that mechanism as pci-only, which means
> > > we can rely on mechanisms that won't necessarily work on non-pci
> > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > through a network card anytime in the near future, due to the nature of
> > > network cards currently in use on s390.)  
> > 
> > That's what it boils down to, yes.  If there's need to have this for
> > non-pci devices, then we should put it in config space.
> > Cornelia, what do you think?
> > 
> 
> I think the only really useful config on s390 is the vfio-pci network
> card coupled with a virtio-net-ccw device: Using an s390 network card
> via vfio-ccw is out due to the nature of the s390 network cards, and
> virtio-ccw is the default transport (virtio-pci is not supported on any
> enterprise distro AFAIK).
> 
> For this, having a uuid in the config space could work (vfio-pci
> devices have a config space by virtue of being pci devices, and
> virtio-net-ccw devices have a config space by virtue of being virtio
> devices -- ccw devices usually don't have that concept).

OK so this calls for adding such a field generally (it's
device agnostic right now).

How would you suggest doing that?


-- 
MST

^ permalink raw reply

* [PATCH] net: vhost: improve performance when enable busyloop
From: Tonghao Zhang @ 2018-06-20 13:28 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, Tonghao Zhang, virtualization

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

we set the poll-us=100 us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

iperf3 -s -D
iperf3 -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400 --bandwidth 100000M

* With the patch:    21.1 Gbits/sec
* Without the patch: 12.7 Gbits/sec

Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
 drivers/vhost/net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e7cf7d2..9364ede 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -429,22 +429,43 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
+static int sk_has_rx_data(struct sock *sk);
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_virtqueue *vq,
 				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num)
 {
 	unsigned long uninitialized_var(endtime);
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *rvq = &nvq->vq;
+	struct socket *sock = rvq->private_data;
+
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
+		mutex_lock_nested(&rvq->mutex, 1);
+
+		vhost_disable_notify(&net->dev, rvq);
+
 		preempt_disable();
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(vq->dev, endtime) &&
+		       !(sock && sk_has_rx_data(sock->sk)) &&
 		       vhost_vq_avail_empty(vq->dev, vq))
 			cpu_relax();
 		preempt_enable();
+
+		if (sock && sk_has_rx_data(sock->sk))
+			vhost_poll_queue(&rvq->poll);
+		else if (unlikely(vhost_enable_notify(&net->dev, rvq))) {
+			vhost_disable_notify(&net->dev, rvq);
+			vhost_poll_queue(&rvq->poll);
+		}
+
+		mutex_unlock(&rvq->mutex);
+
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v5 0/9] x86: macrofying inline asm for better compilation
From: Peter Zijlstra @ 2018-06-20 10:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Juergen Gross, Kate Stewart, Kees Cook, Josh Poimboeuf,
	Philippe Ombredanne, Greg Kroah-Hartman, Christopher Li, x86,
	linux-kernel, Linus Torvalds, virtualization, Masahiro Yamada,
	linux-sparse, Ingo Molnar, Jan Beulich, H. Peter Anvin,
	Alok Kataria, Sam Ravnborg, Thomas Gleixner
In-Reply-To: <20180619194854.69486-1-namit@vmware.com>

On Tue, Jun 19, 2018 at 12:48:45PM -0700, Nadav Amit wrote:
> Nadav Amit (9):
>   Makefile: Prepare for using macros for inline asm
>   x86: objtool: use asm macro for better compiler decisions
>   x86: refcount: prevent gcc distortions
>   x86: alternatives: macrofy locks for better inlining
>   x86: bug: prevent gcc distortions
>   x86: prevent inline distortion by paravirt ops
>   x86: extable: use macros instead of inline assembly
>   x86: cpufeature: use macros instead of inline assembly
>   x86: jump-labels: use macros instead of inline assembly
> 
>  Makefile                               |  9 ++-
>  arch/x86/Makefile                      | 11 ++-
>  arch/x86/include/asm/alternative-asm.h | 20 ++++--
>  arch/x86/include/asm/alternative.h     | 11 +--
>  arch/x86/include/asm/asm.h             | 61 +++++++---------
>  arch/x86/include/asm/bug.h             | 98 +++++++++++++++-----------
>  arch/x86/include/asm/cpufeature.h      | 82 ++++++++++++---------
>  arch/x86/include/asm/jump_label.h      | 65 ++++++++++-------
>  arch/x86/include/asm/paravirt_types.h  | 56 +++++++--------
>  arch/x86/include/asm/refcount.h        | 74 +++++++++++--------
>  arch/x86/kernel/macros.S               | 16 +++++
>  include/asm-generic/bug.h              |  8 +--
>  include/linux/compiler.h               | 56 +++++++++++----
>  scripts/Kbuild.include                 |  4 +-
>  scripts/mod/Makefile                   |  2 +
>  15 files changed, 340 insertions(+), 233 deletions(-)
>  create mode 100644 arch/x86/kernel/macros.S

Aside from the one niggle:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [PATCH v5 6/9] x86: prevent inline distortion by paravirt ops
From: Peter Zijlstra @ 2018-06-20 10:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: x86, linux-kernel, virtualization, Ingo Molnar, H. Peter Anvin,
	Alok Kataria, Thomas Gleixner
In-Reply-To: <20180619194854.69486-7-namit@vmware.com>

On Tue, Jun 19, 2018 at 12:48:51PM -0700, Nadav Amit wrote:
> +#define paravirt_alt							\
> +	"PARAVIRT_CALL type=\"%c[paravirt_typenum]\""			\
> +	" clobber=\"%c[paravirt_clobber]\""				\
> +	" pv_opptr=\"%c[paravirt_opptr]\""

That wants to be:

+     " pv_opptr=\"%c[paravirt_opptr]\";"

And other than that I would suggest: 's/paravirt_alt/paravirt_call/g'.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180619233001-mutt-send-email-mst@kernel.org>

On Tue, 19 Jun 2018 23:32:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> > 
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> That's what it boils down to, yes.  If there's need to have this for
> non-pci devices, then we should put it in config space.
> Cornelia, what do you think?
> 

I think the only really useful config on s390 is the vfio-pci network
card coupled with a virtio-net-ccw device: Using an s390 network card
via vfio-ccw is out due to the nature of the s390 network cards, and
virtio-ccw is the default transport (virtio-pci is not supported on any
enterprise distro AFAIK).

For this, having a uuid in the config space could work (vfio-pci
devices have a config space by virtue of being pci devices, and
virtio-net-ccw devices have a config space by virtue of being virtio
devices -- ccw devices usually don't have that concept).

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-20  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180619173256-mutt-send-email-mst@kernel.org>

On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
> 
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, which can report up to 2TB free memory. 

"getting two pages isn't harder", do you mean passing two arrays (two allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

        uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
        unsigned long *arrays;
 
     /*
         * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
         * store all the hints, we need to allocate multiple arrays.
         * max_hints: the max number of 4MB free page blocks
         * hints_per_page: the number of hints each page can store
         * hints_per_array: the number of hints an array can store
         * total_arrays: the number of arrays we need
         */
        max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
        hints_per_page = PAGE_SIZE / sizeof(__le64);
        hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
        total_arrays = max_hints /  hints_per_array +
                       !!(max_hints % hints_per_array);
        arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
        for (i = 0; i < total_arrays; i++) {
                arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, MAX_ORDER - 1);

              if (!arrays[i])
                      goto out;
        }


- the mm API needs to be changed to support storing hints to multiple separated arrays offered by the caller.

Best,
Wei

^ permalink raw reply

* Re: [PATCH v5 6/9] x86: prevent inline distortion by paravirt ops
From: kbuild test robot @ 2018-06-20  5:05 UTC (permalink / raw)
  Cc: x86, linux-kernel, virtualization, Ingo Molnar, Nadav Amit,
	kbuild-all, H. Peter Anvin, Alok Kataria, Thomas Gleixner
In-Reply-To: <20180619194854.69486-7-namit@vmware.com>

[-- Attachment #1: Type: text/plain, Size: 4337 bytes --]

Hi Nadav,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc1 next-20180619]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nadav-Amit/x86-macrofying-inline-asm-for-better-compilation/20180620-112043
config: i386-randconfig-x016-201824 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/include/asm/paravirt.h: Assembler messages:
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword arguments
--
   arch/x86/include/asm/paravirt.h: Assembler messages:
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword arguments

vim +253 arch/x86/include/asm/paravirt.h

014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  249  
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  250  static inline void write_gdt_entry(struct desc_struct *dt, int entry,
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  251  				   void *desc, int type)
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge       2007-05-02  252  {
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30 @253  	PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, dt, entry, desc, type);
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge       2007-05-02  254  }
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  255  

:::::: The code at line 253 was first introduced by commit
:::::: 014b15be30c04622d130946ab7c0a9101b523a8a x86: change write_gdt_entry signature.

:::::: TO: Glauber de Oliveira Costa <gcosta@redhat.com>
:::::: CC: Ingo Molnar <mingo@elte.hu>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33213 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-19 20:32 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-19 20:09 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

Yes, let's do this just for PCI transport (homogenous) for now.

>
>>
>> >
>> >>
>> >> c. Fully combined device: in this model, virtio and passthough devices
>> >> are viewed as a single VM interface altogther. QEMU not just controls
>> >> the visibility of the primary in guest, but can also manage the
>> >> exposure of the passthrough for migratability. It can be like that
>> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> >> to provide group ID for grouping the two devices, if just one single
>> >> combined device is exposed by QEMU. In either case, QEMU manages all
>> >> aspect of such internal construct, including virtio feature
>> >> negotiation, presence of the primary, and live migration.
>> >
>> > Same question as above.
>> >
>> >>
>> >> It looks like to me that, in your opinion, you seem to prefer go with
>> >> (a). While I'm actually okay with either (b) or (c). Do I understand
>> >> your point correctly?
>> >
>> > I'm not yet preferring anything, as I'm still trying to understand how
>> > this works :) I hope we can arrive at a model that covers the use case
>> > and that is also flexible enough to be extended to other platforms.
>> >
>> >>
>> >> The reason that I feel that (a) might not be ideal, just as Michael
>> >> alluded to (quoting below), is that as management stack, it really
>> >> doesn't need to care about the detailed process of feature negotiation
>> >> (if we view the guest presence of the primary as part of feature
>> >> negotiation at an extended level not just virtio). All it needs to be
>> >> done is to hand in the required devices to QEMU and that's all. Why do
>> >> we need to addd various hooks, events for whichever happens internally
>> >> within the guest?
>> >>
>> >> ''
>> >> Primary device is added with a special "primary-failover" flag.
>> >> A virtual machine is then initialized with just a standby virtio
>> >> device. Primary is not yet added.
>> >>
>> >> Later QEMU detects that guest driver device set DRIVER_OK.
>> >> It then exposes the primary device to the guest, and triggers
>> >> a device addition event (hot-plug event) for it.
>> >>
>> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> >> to remove the primary driver.  In particular, if QEMU detects guest
>> >> re-initialization (e.g. by detecting guest reset) it immediately removes
>> >> the primary device.
>> >> ''
>> >>
>> >> and,
>> >>
>> >> ''
>> >> management just wants to give the primary to guest and later take it back,
>> >> it really does not care about the details of the process,
>> >> so I don't see what does pushing it up the stack buy you.
>> >>
>> >> So I don't think it *needs* to be done in libvirt. It probably can if you
>> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> >> unbinds from device, and can check that backup flag was set.
>> >> If you are pushing for a setup like that please get a buy-in
>> >> from libvirt maintainers or better write a patch.
>> >> ''
>> >
>> > This actually seems to mean the opposite to me: We need to know what
>> > the guest is doing and when, as it directly drives what we need to do
>> > with the devices. If we switch to a visibility vs a hotplug model (see
>> > the other mail), we might be able to handle that part within qemu.
>>
>> In the model of (b), I think it essentially turns hotplug to one of
>> mechanisms for QEMU to control the visibility. The libvirt can still
>> manage the hotplug of individual devices during live migration or in
>> normal situation to hot add/remove devices. Though the visibility of
>> the VFIO is under the controll of QEMU, and it's possible that the hot
>> add/remove request does not involve actual hot plug activity in guest
>> at all.
>
> That depends on how you model visibility, I guess. You'll probably want
> to stop traffic flowing through one or the other of the cards; would
> link down or similar be enough for the virtio device?

I'm not sure if it is a good idea. The guest user will see two devices
with same MAC but one of them is down. Do you expect user to use it or
not? And since the guest is going to be migrated, we need to unplug a
broken VF from guest before migrating, why do we bother plugging in
this useless VF at the first place?


Thanks,
-Siwei
>
>>
>> In the model of (c), the hotplug semantics of the combined device
>> would mean differently - it would end up with devices plugged in or
>> out altogther. To make this work, we either have to build a brand new
>> bond-like QEMU device consist of virtio and VFIO internally, or need
>> to have some abstraction in place for libvirt to manipulate the
>> combined device (and prohibit libvirt from operating on individual
>> internal device directly). Note with this model the group ID doesn't
>> even need to get exposed to libvirt, just imagine libvirt to supply
>> all options required to configure two regular virtio-net and VFIO
>> devices for a single device object, and QEMU will deal with the
>> device's visibility and enumeration, such when to hot plug VFIO device
>> in to or out from the guest.
>>
>> It might be complicated to implement (c) though.
>
> I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-19 18:12 UTC (permalink / raw)
  To: David Vrabel
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <7337582f-0007-d006-0809-cf41fd93b31e@nutanix.com>

On 19 June 2018 at 19:37, David Vrabel <david.vrabel@nutanix.com> wrote:
> It's not clear how this increases security. What threats is this
> protecting again?
It won't completely protect prevent rootkits, because still rootkits
can edit dynamic kernel data structures, but it will limit what
rootkits damage to only dynamic data.
This way system calls can't be changed, or Interrupt tables.
> As an attacker, modifying the sensitive pages (kernel text?) will
> require either: a) altering the existing mappings for these (to make
> them read-write or user-writable for example); or b) creating aliased
> mappings with suitable permissions.
>
> If the attacker can modify page tables in this way then it can also
> bypass the suggested hypervisor's read-only protection by changing the
> mappings to point to a unprotected page.

I think I was missing this part out, but I meant to say completely
prevent any modification to pages including the guest physical address
to guest virtual address mapping for those protected pages, Another
tricky (something random just popped up in my mind right now, better
to say it than to forget it) solution is making new memory mappings
inherit the same protection as old one, I assume that Hyper visor can
do either things. Also that was the kind of performance hit I was
talking about. I am not sure if that might break things or I can say
it will for sure heavily limit some functionalities. like maybe
hibernating guest. But that will be the kind of trades off I am
expecting at least at the begining.

^ permalink raw reply

* Re: [PATCH v4 6/9] x86: prevent inline distortion by paravirt ops
From: Juergen Gross @ 2018-06-19 15:09 UTC (permalink / raw)
  To: Nadav Amit, linux-kernel, x86
  Cc: Alok Kataria, Ingo Molnar, virtualization, Thomas Gleixner,
	H. Peter Anvin
In-Reply-To: <20180612115050.185112-7-namit@vmware.com>

On 12/06/18 13:50, Nadav Amit wrote:
> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> The effect of the patch is a more aggressive inlining, which also
> causes a size increase of kernel.
> 
>    text	   data	    bss	    dec	    hex	filename
> 18147336 10226688 2957312 31331336 1de1408 ./vmlinux before
> 18162555 10226288 2957312 31346155 1de4deb ./vmlinux after (+14819)
> 
> Static text symbols:
> Before:	40053
> After:	39942	(-111)
> 
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: virtualization@lists.linux-foundation.org
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply

* Re: [PATCH v5 0/3] extern inline native_save_fl for paravirt
From: Juergen Gross @ 2018-06-19 15:03 UTC (permalink / raw)
  To: Nick Desaulniers, akpm, hpa, mingo, tglx
  Cc: kstewart, andrea.parri, linux-efi, brijesh.singh, jan.kiszka,
	will.deacon, jarkko.sakkinen, linux-kernel, yamada.masahiro,
	manojgupta, akataria, tweek, mawilcox, x86, ghackmann, mka, geert,
	rientjes, aryabinin, thomas.lendacky, arnd, linux-kbuild,
	keescook, rostedt, acme, caoj.fnst, jpoimboe, sedat.dilek,
	boris.ostrovsky, virtualization, michal.lkml, tstellar, gregkh,
	ard.biesheuvel, astrachan, mjg59, pombre
In-Reply-To: <20180613210518.113983-1-ndesaulniers@google.com>

On 13/06/18 23:05, Nick Desaulniers wrote:
> paravirt depends on a custom calling convention (callee saved), but
> expects this from a static inline function that it then forces to be
> outlined. This is problematic because different compilers or flags can
> then add a stack guard that violates the calling conventions.
> 
> Uses extern inline with the out-of-line definition in assembly to
> prevent compilers from adding stack guards to the outlined version.
> 
> Other parts of the codebase overwrite KBUILD_CFLAGS, which is *extremely
> problematic* for extern inline, as the sematics are completely the
> opposite depending on what C standard is used.
> http://blahg.josefsipek.net/?p=529
> 
> Changes since v4:
>   Take Arnd's and hpa's suggestions properly feature detect gnu_inline
>   attribute to support gcc 4.1.
> 
> Changes since v3:
>   Take Joe's suggestion to hoist __inline__ and __inline out of
>   conditional block.
> 
> Changes since v2:
>   Take hpa's _ASM_ARG patch into the set in order to simplify cross
>   32b/64b x86 assembly and rebase my final patch to use it.  Apply
>   Sedat's typo fix to commit message and sign off on it. Take Joe's
>   suggestion to simplify __inline__ and __inline. Add Arnd to list of
>   folks who made helpful suggestions.
> 
> Changes since v1:
>   Prefer gnu_inline function attribute instead of explicitly setting C
>   standard compiler flag in problematic Makefiles. We should instead
>   carefully evaluate if those Makefiles should be overwriting
>   KBUILD_CFLAGS at all. Dropped the previous first two patches and added
>   a new first patch.
> 
> H. Peter Anvin (1):
>   x86/asm: add _ASM_ARG* constants for argument registers to <asm/asm.h>
> 
> Nick Desaulniers (2):
>   compiler-gcc.h: add gnu_inline to all inline declarations
>   x86: paravirt: make native_save_fl extern inline
> 
>  arch/x86/include/asm/asm.h      | 59 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/irqflags.h |  2 +-
>  arch/x86/kernel/Makefile        |  1 +
>  arch/x86/kernel/irqflags.S      | 26 +++++++++++++++
>  include/linux/compiler-gcc.h    | 29 ++++++++++++----
>  5 files changed, 109 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/kernel/irqflags.S
> 

For the series:

Acked-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-19 14:43 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <5B28F371.9020308@intel.com>

On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > > 
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the list.
> > > Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

> 
> > > by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
> > > 
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
> > > 
> > > This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> > Not clearly better, no, since you are slowing the guest.
> 
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to  build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

> 
> 
> > 
> > 
> > > So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
> 
> > > > If that doesn't fly, we can allocate out of the loop and just retry with more
> > > > pages.
> > > > 
> > > > > On the other hand, large guests being large mostly because the guests need
> > > > to use large memory. In that case, they usually won't have that much free
> > > > memory to report.
> > > > 
> > > > And following this logic small guests don't have a lot of memory to report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
> 
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

> 
> Best,
> Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-06-19 12:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180619055449-mutt-send-email-mst@kernel.org>

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
>> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
>>> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
>>>> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
>>> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
>>> 4TB, the optimization can still offer 2TB free memory (better than no
>>> optimization).
>>>
>>> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
>>> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
>>> suggested for this which is to steal the necessary # of entries off the list.
>> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> I'd do get_user_pages really. You don't want pages split, etc.

>> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
>>
>> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
>>
>> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> Not clearly better, no, since you are slowing the guest.

Not really. Live migration slows down the guest itself. It seems that 
the guest spends a little extra time reporting free pages, but in return 
the live migration time gets reduced a lot, which makes the guest endure 
less from live migration. (there is no drop of the workload performance 
when using the optimization in the tests)



>
>
>> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> No one has the time to read documentation about how an internal flag
> within a device works. Come on, getting two pages isn't much harder
> than a single one.

>>   
>>> If that doesn't fly, we can allocate out of the loop and just retry with more
>>> pages.
>>>
>>>> On the other hand, large guests being large mostly because the guests need
>>> to use large memory. In that case, they usually won't have that much free
>>> memory to report.
>>>
>>> And following this logic small guests don't have a lot of memory to report at
>>> all.
>>> Could you remind me why are we considering this optimization then?
>> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> I'd rather not spend time writing up random limitations.

This is not a random limitation. It would be more clear to see the code. 
Also I'm not sure how get_user_pages could be used in our case, and what 
you meant by "getting two pages". I'll post out a new version, and we 
can discuss on the code.


Best,
Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-19 10:54 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ208NRp=bjhSkzF2K5Ft94TwdRs73WfSp4h-_vXvEKwgiQ@mail.gmail.com>

On Fri, 15 Jun 2018 10:06:07 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Thu, 14 Jun 2018 18:57:11 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
> >> think you raised really good points, some of which I don't have answer
> >> yet and would also like to explore here.
> >>
> >> First off, I don't want to push the discussion to the extreme at this
> >> point, or sell anything about having QEMU manage everything
> >> automatically. Don't get me wrong, it's not there yet. Let's don't
> >> assume we are tied to a specific or concerte solution. I think the key
> >> for our discussion might be to define or refine the boundary between
> >> VM and guest,  e.g. what each layer is expected to control and manage
> >> exactly.
> >>
> >> In my view, there might be possibly 3 different options to represent
> >> the failover device conceipt to QEMU and libvirt (or any upper layer
> >> software):
> >>
> >> a. Seperate device: in this model, virtio and passthough remains
> >> separate devices just as today. QEMU exposes the standby feature bit
> >> for virtio, and publish status/event around the negotiation process of
> >> this feature bit for libvirt to react upon. Since Libvirt has the
> >> pairing relationship itself, maybe through MAC address or something
> >> else, it can control the presence of primary by hot plugging or
> >> unplugging the passthrough device, although it has to work tightly
> >> with virtio's feature negotation process. Not just for migration but
> >> also various corner scenarios (driver/feature ok, device reset,
> >> reboot, legacy guest etc) along virtio's feature negotiation.  
> >
> > Yes, that one has obvious tie-ins to virtio's modus operandi.
> >  
> >>
> >> b. Coupled device: in this model, virtio and passthough devices are
> >> weakly coupled using some group ID, i.e. QEMU match the passthough
> >> device for a standby virtio instance by comparing the group ID value
> >> present behind each device's bridge. Libvirt provides QEMU the group
> >> ID for both type of devices, and only deals with hot plug for
> >> migration, by checking some migration status exposed (e.g. the feature
> >> negotiation status on the virtio device) by QEMU. QEMU manages the
> >> visibility of the primary in guest along virtio's feature negotiation
> >> process.  
> >
> > I'm a bit confused here. What, exactly, ties the two devices together?  
> 
> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> address (which it doesn't have to), the association between VFIO
> passthrough and standby must be specificed for QEMU to understand the
> relationship with this model. Note, standby feature is no longer
> required to be exposed under this model.

Isn't that a bit limiting, though?

With this model, you can probably tie a vfio-pci device and a
virtio-net-pci device together. But this will fail if you have
different transports: Consider tying together a vfio-pci device and a
virtio-net-ccw device on s390, for example. The standby feature bit is
on the virtio-net level and should not have any dependency on the
transport used.

> 
> > If libvirt already has the knowledge that it should manage the two as a
> > couple, why do we need the group id (or something else for other
> > architectures)? (Maybe I'm simply missing something because I'm not
> > that familiar with pci.)  
> 
> The idea is to have QEMU control the visibility and enumeration order
> of the passthrough VFIO for the failover scenario. Hotplug can be one
> way to achieve it, and perhaps there's other way around also. The
> group ID is not just for QEMU to couple devices, it's also helpful to
> guest too as grouping using MAC address is just not safe.

Sorry about dragging mainframes into this, but this will only work for
homogenous device coupling, not for heterogenous. Consider my vfio-pci
+ virtio-net-ccw example again: The guest cannot find out that the two
belong together by checking some group ID, it has to either use the MAC
or some needs-to-be-architectured property.

Alternatively, we could propose that mechanism as pci-only, which means
we can rely on mechanisms that won't necessarily work on non-pci
transports. (FWIW, I don't see a use case for using vfio-ccw to pass
through a network card anytime in the near future, due to the nature of
network cards currently in use on s390.)

> 
> >  
> >>
> >> c. Fully combined device: in this model, virtio and passthough devices
> >> are viewed as a single VM interface altogther. QEMU not just controls
> >> the visibility of the primary in guest, but can also manage the
> >> exposure of the passthrough for migratability. It can be like that
> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> >> to provide group ID for grouping the two devices, if just one single
> >> combined device is exposed by QEMU. In either case, QEMU manages all
> >> aspect of such internal construct, including virtio feature
> >> negotiation, presence of the primary, and live migration.  
> >
> > Same question as above.
> >  
> >>
> >> It looks like to me that, in your opinion, you seem to prefer go with
> >> (a). While I'm actually okay with either (b) or (c). Do I understand
> >> your point correctly?  
> >
> > I'm not yet preferring anything, as I'm still trying to understand how
> > this works :) I hope we can arrive at a model that covers the use case
> > and that is also flexible enough to be extended to other platforms.
> >  
> >>
> >> The reason that I feel that (a) might not be ideal, just as Michael
> >> alluded to (quoting below), is that as management stack, it really
> >> doesn't need to care about the detailed process of feature negotiation
> >> (if we view the guest presence of the primary as part of feature
> >> negotiation at an extended level not just virtio). All it needs to be
> >> done is to hand in the required devices to QEMU and that's all. Why do
> >> we need to addd various hooks, events for whichever happens internally
> >> within the guest?
> >>
> >> ''
> >> Primary device is added with a special "primary-failover" flag.
> >> A virtual machine is then initialized with just a standby virtio
> >> device. Primary is not yet added.
> >>
> >> Later QEMU detects that guest driver device set DRIVER_OK.
> >> It then exposes the primary device to the guest, and triggers
> >> a device addition event (hot-plug event) for it.
> >>
> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> >> to remove the primary driver.  In particular, if QEMU detects guest
> >> re-initialization (e.g. by detecting guest reset) it immediately removes
> >> the primary device.
> >> ''
> >>
> >> and,
> >>
> >> ''
> >> management just wants to give the primary to guest and later take it back,
> >> it really does not care about the details of the process,
> >> so I don't see what does pushing it up the stack buy you.
> >>
> >> So I don't think it *needs* to be done in libvirt. It probably can if you
> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> >> unbinds from device, and can check that backup flag was set.
> >> If you are pushing for a setup like that please get a buy-in
> >> from libvirt maintainers or better write a patch.
> >> ''  
> >
> > This actually seems to mean the opposite to me: We need to know what
> > the guest is doing and when, as it directly drives what we need to do
> > with the devices. If we switch to a visibility vs a hotplug model (see
> > the other mail), we might be able to handle that part within qemu.  
> 
> In the model of (b), I think it essentially turns hotplug to one of
> mechanisms for QEMU to control the visibility. The libvirt can still
> manage the hotplug of individual devices during live migration or in
> normal situation to hot add/remove devices. Though the visibility of
> the VFIO is under the controll of QEMU, and it's possible that the hot
> add/remove request does not involve actual hot plug activity in guest
> at all.

That depends on how you model visibility, I guess. You'll probably want
to stop traffic flowing through one or the other of the cards; would
link down or similar be enough for the virtio device?

> 
> In the model of (c), the hotplug semantics of the combined device
> would mean differently - it would end up with devices plugged in or
> out altogther. To make this work, we either have to build a brand new
> bond-like QEMU device consist of virtio and VFIO internally, or need
> to have some abstraction in place for libvirt to manipulate the
> combined device (and prohibit libvirt from operating on individual
> internal device directly). Note with this model the group ID doesn't
> even need to get exposed to libvirt, just imagine libvirt to supply
> all options required to configure two regular virtio-net and VFIO
> devices for a single device object, and QEMU will deal with the
> device's visibility and enumeration, such when to hot plug VFIO device
> in to or out from the guest.
> 
> It might be complicated to implement (c) though.

I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-19  3:05 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396AA10C@shsmsx102.ccr.corp.intel.com>

On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > 4TB, the optimization can still offer 2TB free memory (better than no
> > optimization).
> > 
> > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > suggested for this which is to steal the necessary # of entries off the list.
> 
> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.

> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
> 
> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
> 
> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.

> 
>  
> > If that doesn't fly, we can allocate out of the loop and just retry with more
> > pages.
> > 
> > > On the other hand, large guests being large mostly because the guests need
> > to use large memory. In that case, they usually won't have that much free
> > memory to report.
> > 
> > And following this logic small guests don't have a lot of memory to report at
> > all.
> > Could you remind me why are we considering this optimization then?
> 
> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

I'd rather not spend time writing up random limitations.


> Best,
> Wei

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Martin K. Petersen @ 2018-06-19  2:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Juergen Gross, Jens Axboe, linux-scsi, Martin K. Petersen, netdev,
	linux-usb, linux-kernel, virtualization, target-devel, kvm,
	qla2xxx-upstream, linux1394-devel, Kent Overstreet
In-Reply-To: <20180615023731.GA5706@bombadil.infradead.org>


Matthew,

>> Since most of the changes are in scsi or target, should I take this
>> series through my tree?
>
> I'd welcome that.  Nick seems to be inactive as target maintainer;
> his tree on kernel.org hasn't seen any updates in five months.

Applied to 4.19/scsi-queue, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-19  1:06 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180618051637-mutt-send-email-mst@kernel.org>

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.

This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

Best,
Wei

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: David Hildenbrand @ 2018-06-18 19:01 UTC (permalink / raw)
  To: Ahmed Soliman
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <CAAGnT3avoMOJP9zEdG2WmmxUe7m4S+5mDWezjJerzO4fq11cjQ@mail.gmail.com>

On 18.06.2018 18:35, Ahmed Soliman wrote:
> Shortly after I sent the first email, we found that there is another
> way to achieve this kind of communication, via KVM Hypercalls, I think
> they are underutilised in kvm, but they exist.
> 
> We also found that they are architecture dependent, but the advantage
> is that one doesn't need to create QEMU<-> kvm interface
> 
> So from our point of view it is either have things easily compatible
> with many architectures out of the box (virtio) VS compatiability with
> almost every front end including QEMU and any other one without
> modification (Hypercalls)?

My gut feeling (I might of course be wrong) is that hypercalls will not
be accepted easily in KVM (I assume only if it is really highly
specialized for e.g. x86 and/or required very early during boot and/or
has very specific performance requirements - e.g. pvspinlocks or kvmclock).

> 
> If it is the case, We might stick to hypercalls at the beginning,
> because it can be easily tested with out modifying QEMU, but then
> later we can move to virtio if there turned out to be clearer
> advantage, especially performance wise.

hypercalls might be good for prototyping, but I assume that the
challenging part will rather be a clean KVM mmu interface. And once you
have that, a kernel interface might not be too hard (I remember some
work being done by malwarebytes).

> 
> Does that sounds like good idea?
> I wanted to make sure because maybe maybe hypercalls aren't that much
> used in kvm for a reason, so I wanted to verify that.

I assume the same.

Another thing to note is performance, having to go via QEMU might be
performance wise not that good. (usually chunking is the answer to
reduce the overhead). But if it is really a "protect once, forget until
reboot" thing, that should not be relevant.

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-18 16:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <7c7ddb96-e865-53a2-3aa9-b79403c646a9@redhat.com>

Shortly after I sent the first email, we found that there is another
way to achieve this kind of communication, via KVM Hypercalls, I think
they are underutilised in kvm, but they exist.

We also found that they are architecture dependent, but the advantage
is that one doesn't need to create QEMU<-> kvm interface

So from our point of view it is either have things easily compatible
with many architectures out of the box (virtio) VS compatiability with
almost every front end including QEMU and any other one without
modification (Hypercalls)?

If it is the case, We might stick to hypercalls at the beginning,
because it can be easily tested with out modifying QEMU, but then
later we can move to virtio if there turned out to be clearer
advantage, especially performance wise.

Does that sounds like good idea?
I wanted to make sure because maybe maybe hypercalls aren't that much
used in kvm for a reason, so I wanted to verify that.

On 18 June 2018 at 16:34, David Hildenbrand <david@redhat.com> wrote:
> On 16.06.2018 13:49, Ahmed Soliman wrote:
>> Following up on these threads:
>> - https://marc.info/?l=kvm&m=151929803301378&w=2
>> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
>>
>> I lost the original emails so I couldn't reply to them, and also sorry
>> for being late, it was the end of semester exams.
>>
>> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
>> help having better insights.
>>
>> To wrap things up, the basic design will be a method for communication
>> between host and guest is guest can request certain pages to be read
>> only, and then host will force them to be read-only by guest until
>> next guest reboot, then it will impossible for guest OS to have them
>> as RW again. The choice of which pages to be set as read only is the
>> guest's. So this way mixed pages can still be mixed with R/W content
>> even if holds kernel code.
>>
>> I was planning to use KVM as my hypervisor, until I found out that KVM
>> can't do that on its own so one will need a custom virtio driver to do
>> this kind of guest-host communication/coordination, I am still
>> sticking to KVM, and have no plans to do this for Xen at least for
>> now, this means that in order to get it to work there must be a QEMU
>> support our specific driver we are planning to write in order for
>> things to work properly.
>>
>> The question is is this the right approach? or is there a simpler way
>> to achieve this goal?
>>
>
> Especially if you want to support multiple architectures in the long
> term, virtio is the way to go.
>
> Design an architecture independent and extensible (+configurable)
> interface and be happy :) This might of course require some thought.
>
> (and don't worry, implementing a virtio driver is a lot simpler than you
> might think)
>
> But be aware that the virtio "hypervisor" side will be handled in QEMU,
> so you'll need a proper QEMU->KVM interface to get things running.
>
> --
>
> Thanks,
>
> David / dhildenb

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox