Linux virtualization list
 help / color / mirror / Atom feed
* Re: Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-18 16:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: riel, Kees Cook, kvm, Ard Biesheuvel, Kernel Hardening,
	qemu-devel, virtualization, Hossam Hassan, Ahmed Lotfy
In-Reply-To: <7c7ddb96-e865-53a2-3aa9-b79403c646a9@redhat.com>

Shortly after I sent the first email, we found that there is another
way to achieve this kind of communication, via KVM Hypercalls, I think
they are underutilised in kvm, but they exist.

We also found that they are architecture dependent, but the advantage
is that one doesn't need to create QEMU<-> kvm interface

So from our point of view it is either have things easily compatible
with many architectures out of the box (virtio) VS compatiability with
almost every front end including QEMU and any other one without
modification (Hypercalls)?

If it is the case, We might stick to hypercalls at the beginning,
because it can be easily tested with out modifying QEMU, but then
later we can move to virtio if there turned out to be clearer
advantage, especially performance wise.

Does that sounds like good idea?
I wanted to make sure because maybe maybe hypercalls aren't that much
used in kvm for a reason, so I wanted to verify that.

On 18 June 2018 at 16:34, David Hildenbrand <david@redhat.com> wrote:
> On 16.06.2018 13:49, Ahmed Soliman wrote:
>> Following up on these threads:
>> - https://marc.info/?l=kvm&m=151929803301378&w=2
>> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
>>
>> I lost the original emails so I couldn't reply to them, and also sorry
>> for being late, it was the end of semester exams.
>>
>> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
>> help having better insights.
>>
>> To wrap things up, the basic design will be a method for communication
>> between host and guest is guest can request certain pages to be read
>> only, and then host will force them to be read-only by guest until
>> next guest reboot, then it will impossible for guest OS to have them
>> as RW again. The choice of which pages to be set as read only is the
>> guest's. So this way mixed pages can still be mixed with R/W content
>> even if holds kernel code.
>>
>> I was planning to use KVM as my hypervisor, until I found out that KVM
>> can't do that on its own so one will need a custom virtio driver to do
>> this kind of guest-host communication/coordination, I am still
>> sticking to KVM, and have no plans to do this for Xen at least for
>> now, this means that in order to get it to work there must be a QEMU
>> support our specific driver we are planning to write in order for
>> things to work properly.
>>
>> The question is is this the right approach? or is there a simpler way
>> to achieve this goal?
>>
>
> Especially if you want to support multiple architectures in the long
> term, virtio is the way to go.
>
> Design an architecture independent and extensible (+configurable)
> interface and be happy :) This might of course require some thought.
>
> (and don't worry, implementing a virtio driver is a lot simpler than you
> might think)
>
> But be aware that the virtio "hypervisor" side will be handled in QEMU,
> so you'll need a proper QEMU->KVM interface to get things running.
>
> --
>
> Thanks,
>
> David / dhildenb

^ permalink raw reply

* Re: Design Decision for KVM based anti rootkit
From: David Hildenbrand @ 2018-06-18 14:34 UTC (permalink / raw)
  To: Ahmed Soliman, kvm, Kernel Hardening, riel, Kees Cook,
	Ard Biesheuvel, Hossam Hassan, Ahmed Lotfy, virtualization,
	qemu-devel
In-Reply-To: <CAAGnT3bjYPu9bordn_Dh8z+MW6p5DDLoSsZC9xg8QxQriVus9A@mail.gmail.com>

On 16.06.2018 13:49, Ahmed Soliman wrote:
> Following up on these threads:
> - https://marc.info/?l=kvm&m=151929803301378&w=2
> - http://www.openwall.com/lists/kernel-hardening/2018/02/22/18
> 
> I lost the original emails so I couldn't reply to them, and also sorry
> for being late, it was the end of semester exams.
> 
> I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
> help having better insights.
> 
> To wrap things up, the basic design will be a method for communication
> between host and guest is guest can request certain pages to be read
> only, and then host will force them to be read-only by guest until
> next guest reboot, then it will impossible for guest OS to have them
> as RW again. The choice of which pages to be set as read only is the
> guest's. So this way mixed pages can still be mixed with R/W content
> even if holds kernel code.
> 
> I was planning to use KVM as my hypervisor, until I found out that KVM
> can't do that on its own so one will need a custom virtio driver to do
> this kind of guest-host communication/coordination, I am still
> sticking to KVM, and have no plans to do this for Xen at least for
> now, this means that in order to get it to work there must be a QEMU
> support our specific driver we are planning to write in order for
> things to work properly.
> 
> The question is is this the right approach? or is there a simpler way
> to achieve this goal?
> 

Especially if you want to support multiple architectures in the long
term, virtio is the way to go.

Design an architecture independent and extensible (+configurable)
interface and be happy :) This might of course require some thought.

(and don't worry, implementing a virtio driver is a lot simpler than you
might think)

But be aware that the virtio "hypervisor" side will be handled in QEMU,
so you'll need a proper QEMU->KVM interface to get things running.

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-18 13:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180615152926-mutt-send-email-mst@kernel.org>

On Fri, 15 Jun 2018 15:31:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> > On Fri, 15 Jun 2018 05:34:24 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:  
> >   
> > > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > > do all the
> > > > > > required management of the primary/standby devices within Qemu, that is
> > > > > > definitely a better
> > > > > > approach without upper layer involvement.      
> > > > > 
> > > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > > The management tool can supply passthrough device and virtio with the
> > > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > > hot plug the device as needed before or after the migration.    
> > > > 
> > > > I do not really see how you can manage that kind of stuff in QEMU only.    
> > > 
> > > So right now failover is limited to pci passthrough devices only.
> > > The idea is to realize the vfio device but not expose it
> > > to guest. Have a separate command to expose it to guest.
> > > Hotunplug would also hide it from guest but not unrealize it.  
> > 
> > So, this would not be real hot(un)plug, but 'hide it from the guest',
> > right? The concept of "we have it realized in QEMU, but the guest can't
> > discover and use it" should be translatable to non-pci as well (at
> > least for ccw).
> >   
> > > 
> > > This will help ensure that e.g. on migration failure we can
> > > re-expose the device without risk of running out of resources.  
> > 
> > Makes sense.
> > 
> > Should that 'hidden' state be visible/settable from outside as well
> > (e.g. via a property)? I guess yes, so that management software has a
> > chance to see whether a device is visible.  
> 
> Might be handy for debug, but note that since QEMU manages this
> state it's transient: can change at any time, so it's kind
> of hard for management to rely on it.

Might be another reason to have this controlled by management software;
being able to find out easily why a device is not visible to the guest
seems to be a useful thing.

Anyway, let's defer this discussion until it is clear how we actually
want to handle the whole setup.

> 
> > Settable may be useful if we
> > find another use case for hiding realized devices.  

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Mike Galbraith @ 2018-06-18  3:13 UTC (permalink / raw)
  To: Dave Airlie, Greg KH
  Cc: Ray Strode, LKML, stable, open list:VIRTIO CORE, NET...,
	Dave Airlie
In-Reply-To: <CAPM=9tzkKWM2wjeQvWG2qKM-xgFwGD3bjWfgmDUDTQO+tXF-ig@mail.gmail.com>

On Mon, 2018-06-18 at 12:33 +1000, Dave Airlie wrote:
> On 17 June 2018 at 21:02, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Sun, Jun 17, 2018 at 12:38:06PM +0200, Mike Galbraith wrote:
> >> On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
> >> >
> >> > And if you revert that patch, does everything work again?
> >>
> >> Yes.
> >
> > Great, Dave, care to revert this in 4.18 so I can queue up that revert
> > in older kernels as well?
> >
> > thanks,
> >
> > greg k-h
> 
> Hi MIke,
> 
> Does
> https://patchwork.freedesktop.org/patch/229980/
> 
> fix it?

Yup.

	-Mike

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Dave Airlie @ 2018-06-18  2:33 UTC (permalink / raw)
  To: Greg KH
  Cc: Ray Strode, Mike Galbraith, LKML, stable,
	open list:VIRTIO CORE, NET..., Dave Airlie
In-Reply-To: <20180617110200.GA19216@kroah.com>

On 17 June 2018 at 21:02, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Sun, Jun 17, 2018 at 12:38:06PM +0200, Mike Galbraith wrote:
>> On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
>> >
>> > And if you revert that patch, does everything work again?
>>
>> Yes.
>
> Great, Dave, care to revert this in 4.18 so I can queue up that revert
> in older kernels as well?
>
> thanks,
>
> greg k-h

Hi MIke,

Does
https://patchwork.freedesktop.org/patch/229980/

fix it?

I can queue up a revert asap, but since a revert just brings us back
another broken behaviour, I'd rather we just hurry the fix up.

Dave.

^ permalink raw reply

* Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-18  2:28 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396A5CB0@shsmsx102.ccr.corp.intel.com>

On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > > indicates the support of reporting hints of guest free pages to host via
> > virtio-balloon.
> > > > >
> > > > > Host requests the guest to report free page hints by sending a
> > > > > command to the guest via setting the
> > > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > > bit of the host_cmd config register.
> > > > >
> > > > > As the first step here, virtio-balloon only reports free page
> > > > > hints from the max order (10) free page list to host. This has
> > > > > generated similar good results as reporting all free page hints during
> > our tests.
> > > > >
> > > > > TODO:
> > > > > - support reporting free page hints from smaller order free page lists
> > > > >   when there is a need/request from users.
> > > > >
> > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > ---
> > > > >  drivers/virtio/virtio_balloon.c     | 187
> > +++++++++++++++++++++++++++++--
> > > > -----
> > > > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > > > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -43,6 +43,9 @@
> > > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > > >
> > > > > +/* The size of memory in bytes allocated for reporting free page
> > > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > > +
> > > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > >
> > > > Doesn't this limit memory size of the guest we can report?
> > > > Apparently to several gigabytes ...
> > > > OTOH huge guests with lots of free memory is exactly where we would
> > > > gain the most ...
> > >
> > > Yes, the 16-page array can report up to 32GB (each page can hold 512
> > addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> > memory to host. It is not flexible.
> > >
> > > How about allocating the buffer according to the guest memory size
> > > (proportional)? That is,
> > >
> > > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > > pages blocks that the system can have */ 4m_page_blocks =
> > > totalram_pages / 1024;
> > >
> > > /* Allocating one page can hold 512 free page blocks, so calculates
> > > the number of pages that can hold those 4MB blocks. And this
> > > allocation should not exceed 1024 pages */ pages_to_allocate =
> > > min(4m_page_blocks / 512, 1024);
> > >
> > > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> > 1024 pages as the buffer.
> > >
> > > When the guest has large memory, it should be easier to succeed in
> > allocation of large buffer. If that allocation fails, that implies that nothing
> > would be got from the 4MB free page list.
> > >
> > > I think the proportional allocation is simpler compared to other
> > > approaches like
> > > - scattered buffer, which will complicate the get_from_free_page_list
> > > implementation;
> > > - one buffer to call get_from_free_page_list multiple times, which needs
> > get_from_free_page_list to maintain states.. also too complicated.
> > >
> > > Best,
> > > Wei
> > >
> > 
> > That's more reasonable, but question remains what to do if that value
> > exceeds MAX_ORDER. I'd say maybe tell host we can't report it.
> 
> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the optimization can still offer 2TB free memory (better than no optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even
more.  I'd rather we had a better plan. From that POV I like what
Matthew Wilcox suggested for this which is to steal the necessary #
of entries off the list.

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.

> On the other hand, large guests being large mostly because the guests need to use large memory. In that case, they usually won't have that much free memory to report.

And following this logic small guests don't have a lot of memory to report at all.
Could you remind me why are we considering this optimization then?

> > 
> > Also allocating it with GFP_KERNEL is out. You only want to take it off the free
> > list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.
> 
> Sounds good, thanks.
> 
> > Also you can't allocate this on device start. First totalram_pages can change.
> > Second that's too much memory to tie up forever.
> 
> Yes, makes sense.
> 
> Best,
> Wei

^ permalink raw reply

* Re: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Michael S. Tsirkin @ 2018-06-18  2:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	pbonzini, akpm, mhocko, torvalds
In-Reply-To: <20180616045005.GA14936@bombadil.infradead.org>

On Fri, Jun 15, 2018 at 09:50:05PM -0700, Matthew Wilcox wrote:
> I wonder if (to address Michael's concern), you shouldn't instead use
> the first free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
> 	__le64 *ret = NULL;
> 	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
> 	for_each_populated_zone(zone) {
> 		spin_lock_irq(&zone->lock);
> 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> 			list = &zone->free_area[order].free_list[mt];
> 			list_for_each_entry_safe(page, list, lru, ...) {
> 				if (index == size)
> 					break;
> 				addr = page_to_pfn(page) << PAGE_SHIFT;
> 				if (!ret) {
> 					list_del(...);
> 					ret = addr;
> 				}
> 				ret[index++] = cpu_to_le64(addr);
> 			}
> 		}
> 		spin_unlock_irq(&zone->lock);
> 	}
> 
> 	return ret;
> }
> 
> You'll need to return the page to the freelist afterwards, but free_pages()
> should take care of that.

Yes Wei already came up with the idea to stick this data into a
MAX_ORDER allocation. Are you sure just taking an entry off
the list like that has no bad side effects?
I have a vague memory someone complained that everyone
most go through get free pages/kmalloc, but I can't
find that anymore.


-- 
MST

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected  to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Greg KH @ 2018-06-17 11:02 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ray Strode, LKML, stable, virtualization, Dave Airlie
In-Reply-To: <1529231886.5303.6.camel@gmx.de>

On Sun, Jun 17, 2018 at 12:38:06PM +0200, Mike Galbraith wrote:
> On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
> > 
> > And if you revert that patch, does everything work again?
> 
> Yes.

Great, Dave, care to revert this in 4.18 so I can queue up that revert
in older kernels as well?

thanks,

greg k-h

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected  to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Mike Galbraith @ 2018-06-17 10:38 UTC (permalink / raw)
  To: Greg KH; +Cc: Ray Strode, LKML, stable, virtualization, Dave Airlie
In-Reply-To: <20180617093607.GA27652@kroah.com>

On Sun, 2018-06-17 at 11:36 +0200, Greg KH wrote:
> 
> And if you revert that patch, does everything work again?

Yes.

	-Mike

^ permalink raw reply

* Re: v4.14.21+: ATOMIC_SLEEP splat bisected  to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")
From: Greg KH @ 2018-06-17  9:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ray Strode, LKML, stable, virtualization, Dave Airlie
In-Reply-To: <1529163551.16103.16.camel@gmx.de>

On Sat, Jun 16, 2018 at 05:39:11PM +0200, Mike Galbraith wrote:
> Greetings,
> 
> Running a kernel with ATOMIC_SLEEP enabled in one of my VMs, I met the
> splat below.  I tracked it back to 4.14-stable, and bisected it there.

Why does your virtual machine use a drm driver?  Ah, it's a driver for
virtual gpu, nice.

And if you revert that patch, does everything work again?

> # first bad commit: [848dd9bf51544c8e5436960124f3d1f9c51cdb99] drm/qxl: reapply cursor after resetting primary

You should also cc: the other developers on that signed-off-by list,
like I did here...

thanks,

greg k-h

^ permalink raw reply

* RE: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Wang, Wei W @ 2018-06-17  0:07 UTC (permalink / raw)
  To: 'Matthew Wilcox'
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	mst@redhat.com, nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	pbonzini@redhat.com, akpm@linux-foundation.org, mhocko@kernel.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180616045005.GA14936@bombadil.infradead.org>

On Saturday, June 16, 2018 12:50 PM, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> > +/**
> > + * get_from_free_page_list - get free page blocks from a free page
> > +list
> > + * @order: the order of the free page list to check
> > + * @buf: the array to store the physical addresses of the free page
> > +blocks
> > + * @size: the array size
> > + *
> > + * This function offers hints about free pages. There is no guarantee
> > +that
> > + * the obtained free pages are still on the free page list after the
> > +function
> > + * returns. pfn_to_page on the obtained free pages is strongly
> > +discouraged
> > + * and if there is an absolute need for that, make sure to contact MM
> > +people
> > + * to discuss potential problems.
> > + *
> > + * The addresses are currently stored to the array in little endian.
> > +This
> > + * avoids the overhead of converting endianness by the caller who
> > +needs data
> > + * in the little endian format. Big endian support can be added on
> > +demand in
> > + * the future.
> > + *
> > + * Return the number of free page blocks obtained from the free page list.
> > + * The maximum number of free page blocks that can be obtained is
> > +limited to
> > + * the caller's array size.
> > + */
> 
> Please use:
> 
>  * Return: The number of free page blocks obtained from the free page list.
> 
> Also, please include a
> 
>  * Context: Any context.
> 
> or
> 
>  * Context: Process context.
> 
> or whatever other conetext this function can be called from.  Since you're
> taking the lock irqsafe, I assume this can be called from any context, but I
> wonder if it makes sense to have this function callable from interrupt context.
> Maybe this should be callable from process context only.

Thanks, sounds better to make it process context only.

> 
> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t
> > +size) {
> > +	struct zone *zone;
> > +	enum migratetype mt;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned long addr, flags;
> > +	uint32_t index = 0;
> > +
> > +	for_each_populated_zone(zone) {
> > +		spin_lock_irqsave(&zone->lock, flags);
> > +		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +			list = &zone->free_area[order].free_list[mt];
> > +			list_for_each_entry(page, list, lru) {
> > +				addr = page_to_pfn(page) << PAGE_SHIFT;
> > +				if (likely(index < size)) {
> > +					buf[index++] = cpu_to_le64(addr);
> > +				} else {
> > +					spin_unlock_irqrestore(&zone->lock,
> > +							       flags);
> > +					return index;
> > +				}
> > +			}
> > +		}
> > +		spin_unlock_irqrestore(&zone->lock, flags);
> > +	}
> > +
> > +	return index;
> > +}
> 
> I wonder if (to address Michael's concern), you shouldn't instead use the first
> free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
> 	__le64 *ret = NULL;
> 	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
> 	for_each_populated_zone(zone) {
> 		spin_lock_irq(&zone->lock);
> 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> 			list = &zone->free_area[order].free_list[mt];
> 			list_for_each_entry_safe(page, list, lru, ...) {
> 				if (index == size)
> 					break;
> 				addr = page_to_pfn(page) << PAGE_SHIFT;
> 				if (!ret) {
> 					list_del(...);

Thanks for sharing. But probably we would not take this approach for the reasons below:

1) I'm not sure if getting a block of free pages to use could be that simple (just pluck it from the list as above). I think it is more prudent to let the callers allocate the array via the regular allocation functions. 

2) Callers may need to use this with their own defined protocols, and they want the header and payload (i.e. the obtained hints) to locate in physically continuous memory (there are tricks they can use to make it work with non-physically continuous memory, but that would just complicate all the things) . In this case, it is better to have callers allocate the memory on their own, and pass the payload part memory to this API to get the payload filled.

Best,
Wei

^ permalink raw reply

* Design Decision for KVM based anti rootkit
From: Ahmed Soliman @ 2018-06-16 11:49 UTC (permalink / raw)
  To: kvm, Kernel Hardening, riel, Kees Cook, Ard Biesheuvel,
	Hossam Hassan, Ahmed Lotfy, virtualization, qemu-devel

Following up on these threads:
- https://marc.info/?l=kvm&m=151929803301378&w=2
- http://www.openwall.com/lists/kernel-hardening/2018/02/22/18

I lost the original emails so I couldn't reply to them, and also sorry
for being late, it was the end of semester exams.

I was adviced on #qemu and #kernelnewbies IRCs to ask here as it will
help having better insights.

To wrap things up, the basic design will be a method for communication
between host and guest is guest can request certain pages to be read
only, and then host will force them to be read-only by guest until
next guest reboot, then it will impossible for guest OS to have them
as RW again. The choice of which pages to be set as read only is the
guest's. So this way mixed pages can still be mixed with R/W content
even if holds kernel code.

I was planning to use KVM as my hypervisor, until I found out that KVM
can't do that on its own so one will need a custom virtio driver to do
this kind of guest-host communication/coordination, I am still
sticking to KVM, and have no plans to do this for Xen at least for
now, this means that in order to get it to work there must be a QEMU
support our specific driver we are planning to write in order for
things to work properly.

The question is is this the right approach? or is there a simpler way
to achieve this goal?

^ permalink raw reply

* Re: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Matthew Wilcox @ 2018-06-16  4:50 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, mst, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	pbonzini, akpm, mhocko, torvalds
In-Reply-To: <1529037793-35521-2-git-send-email-wei.w.wang@intel.com>

On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> +/**
> + * get_from_free_page_list - get free page blocks from a free page list
> + * @order: the order of the free page list to check
> + * @buf: the array to store the physical addresses of the free page blocks
> + * @size: the array size
> + *
> + * This function offers hints about free pages. There is no guarantee that
> + * the obtained free pages are still on the free page list after the function
> + * returns. pfn_to_page on the obtained free pages is strongly discouraged
> + * and if there is an absolute need for that, make sure to contact MM people
> + * to discuss potential problems.
> + *
> + * The addresses are currently stored to the array in little endian. This
> + * avoids the overhead of converting endianness by the caller who needs data
> + * in the little endian format. Big endian support can be added on demand in
> + * the future.
> + *
> + * Return the number of free page blocks obtained from the free page list.
> + * The maximum number of free page blocks that can be obtained is limited to
> + * the caller's array size.
> + */

Please use:

 * Return: The number of free page blocks obtained from the free page list.

Also, please include a

 * Context: Any context.

or

 * Context: Process context.

or whatever other conetext this function can be called from.  Since you're
taking the lock irqsafe, I assume this can be called from any context, but
I wonder if it makes sense to have this function callable from interrupt
context.  Maybe this should be callable from process context only.

> +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
> +{
> +	struct zone *zone;
> +	enum migratetype mt;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long addr, flags;
> +	uint32_t index = 0;
> +
> +	for_each_populated_zone(zone) {
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +			list = &zone->free_area[order].free_list[mt];
> +			list_for_each_entry(page, list, lru) {
> +				addr = page_to_pfn(page) << PAGE_SHIFT;
> +				if (likely(index < size)) {
> +					buf[index++] = cpu_to_le64(addr);
> +				} else {
> +					spin_unlock_irqrestore(&zone->lock,
> +							       flags);
> +					return index;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +
> +	return index;
> +}

I wonder if (to address Michael's concern), you shouldn't instead use
the first free chunk of pages to return the addresses of all the pages.
ie something like this:

	__le64 *ret = NULL;
	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);

	for_each_populated_zone(zone) {
		spin_lock_irq(&zone->lock);
		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
			list = &zone->free_area[order].free_list[mt];
			list_for_each_entry_safe(page, list, lru, ...) {
				if (index == size)
					break;
				addr = page_to_pfn(page) << PAGE_SHIFT;
				if (!ret) {
					list_del(...);
					ret = addr;
				}
				ret[index++] = cpu_to_le64(addr);
			}
		}
		spin_unlock_irq(&zone->lock);
	}

	return ret;
}

You'll need to return the page to the freelist afterwards, but free_pages()
should take care of that.

^ permalink raw reply

* RE: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Wang, Wei W @ 2018-06-16  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	Rik van Riel, quan.xu0@gmail.com, KVM list, Michael S. Tsirkin,
	nilal@redhat.com, liliang.opensource@gmail.com,
	Linux Kernel Mailing List, virtualization, linux-mm,
	Paolo Bonzini, Andrew Morton, Michal Hocko
In-Reply-To: <CA+55aFzhuGKinEq5udPsk_uYHShkQxJYqcPO=tLCkT-oxpsgPg@mail.gmail.com>

On Saturday, June 16, 2018 7:09 AM, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages,
> > because there is no guarantee that they are still on the free page
> > list after the function returns.
> 
> Ack. This is the kind of simple interface where I don't need to worry about
> the MM code calling out to random drivers or subsystems.
> 
> I think that "order" should be checked for validity, but from a MM standpoint
> I think this is fine.
> 

Thanks, will add a validity check for "order".

Best,
Wei

^ permalink raw reply

* RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-16  1:09 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180615171635-mutt-send-email-mst@kernel.org>

On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > indicates the support of reporting hints of guest free pages to host via
> virtio-balloon.
> > > >
> > > > Host requests the guest to report free page hints by sending a
> > > > command to the guest via setting the
> > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > bit of the host_cmd config register.
> > > >
> > > > As the first step here, virtio-balloon only reports free page
> > > > hints from the max order (10) free page list to host. This has
> > > > generated similar good results as reporting all free page hints during
> our tests.
> > > >
> > > > TODO:
> > > > - support reporting free page hints from smaller order free page lists
> > > >   when there is a need/request from users.
> > > >
> > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c     | 187
> +++++++++++++++++++++++++++++--
> > > -----
> > > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -43,6 +43,9 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +/* The size of memory in bytes allocated for reporting free page
> > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > Doesn't this limit memory size of the guest we can report?
> > > Apparently to several gigabytes ...
> > > OTOH huge guests with lots of free memory is exactly where we would
> > > gain the most ...
> >
> > Yes, the 16-page array can report up to 32GB (each page can hold 512
> addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> memory to host. It is not flexible.
> >
> > How about allocating the buffer according to the guest memory size
> > (proportional)? That is,
> >
> > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > pages blocks that the system can have */ 4m_page_blocks =
> > totalram_pages / 1024;
> >
> > /* Allocating one page can hold 512 free page blocks, so calculates
> > the number of pages that can hold those 4MB blocks. And this
> > allocation should not exceed 1024 pages */ pages_to_allocate =
> > min(4m_page_blocks / 512, 1024);
> >
> > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> 1024 pages as the buffer.
> >
> > When the guest has large memory, it should be easier to succeed in
> allocation of large buffer. If that allocation fails, that implies that nothing
> would be got from the 4MB free page list.
> >
> > I think the proportional allocation is simpler compared to other
> > approaches like
> > - scattered buffer, which will complicate the get_from_free_page_list
> > implementation;
> > - one buffer to call get_from_free_page_list multiple times, which needs
> get_from_free_page_list to maintain states.. also too complicated.
> >
> > Best,
> > Wei
> >
> 
> That's more reasonable, but question remains what to do if that value
> exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the optimization can still offer 2TB free memory (better than no optimization).

On the other hand, large guests being large mostly because the guests need to use large memory. In that case, they usually won't have that much free memory to report.

> 
> Also allocating it with GFP_KERNEL is out. You only want to take it off the free
> list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Sounds good, thanks.

> Also you can't allocate this on device start. First totalram_pages can change.
> Second that's too much memory to tie up forever.

Yes, makes sense.

Best,
Wei

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-16  1:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: robh, cohuck, pawel.moll, Tom Lendacky, Michael S. Tsirkin,
	Ram Pai, linux-kernel, virtualization, joe, Rustad, Mark D,
	Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180615091624.GA1064@infradead.org>

On Fri, 2018-06-15 at 02:16 -0700, Christoph Hellwig wrote:
> On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> > Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
> > 
> > There's no cache flushing and there's no architecture hooks that I can
> > see other than the AMD security stuff which is probably fine.
> > 
> > Or am I missing something ?
> 
> You are missing the __phys_to_dma arch hook that allows architectures
> to adjust the dma address.  Various systems have offsets, or even
> multiple banks with different offsets there.  Most of them don't
> use the dma-direct code yet (working on it), but there are a few
> examples in the tree already.

Ok and on those systems, qemu will bypass said offset ? Maybe we could
just create a device DMA "flag" or (or use the attributes) to instruct
dma-direct to not use that for legacy virtio ?

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v33 1/4] mm: add a function to get free page blocks
From: Linus Torvalds @ 2018-06-15 23:08 UTC (permalink / raw)
  To: wei.w.wang
  Cc: yang.zhang.wz, virtio-dev, Rik van Riel, quan.xu0, KVM list,
	Michael S. Tsirkin, nilal, liliang.opensource,
	Linux Kernel Mailing List, virtualization, linux-mm,
	Paolo Bonzini, Andrew Morton, Michal Hocko
In-Reply-To: <1529037793-35521-2-git-send-email-wei.w.wang@intel.com>

On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:
>
> This patch adds a function to get free pages blocks from a free page
> list. The obtained free page blocks are hints about free pages, because
> there is no guarantee that they are still on the free page list after
> the function returns.

Ack. This is the kind of simple interface where I don't need to worry
about the MM code calling out to random drivers or subsystems.

I think that "order" should be checked for validity, but from a MM
standpoint I think this is fine.

                Linus

^ permalink raw reply

* Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting
From: Luiz Capitulino @ 2018-06-15 19:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, mst, nilal,
	liliang.opensource, linux-kernel, virtualization, linux-mm,
	pbonzini, akpm, mhocko, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

On Fri, 15 Jun 2018 12:43:09 +0800
Wei Wang <wei.w.wang@intel.com> wrote:

> This patch series is separated from the previous "Virtio-balloon
> Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
> implemented by this series enables the virtio-balloon driver to report
> hints of guest free pages to the host. It can be used to accelerate live
> migration of VMs. Here is an introduction of this usage:

So, we have two page hinting solutions being proposed. One is this
series, the other one by Nitesh is intended to improve host memory
utilization by letting the host use unused guest memory[1].

Instead of merging two similar solutions, do we want a more generic
one that solves both problems? Or maybe an unified solution?

[1] https://www.spinics.net/lists/kvm/msg170113.html

> Live migration needs to transfer the VM's memory from the source machine
> to the destination round by round. For the 1st round, all the VM's memory
> is transferred. From the 2nd round, only the pieces of memory that were
> written by the guest (after the 1st round) are transferred. One method
> that is popularly used by the hypervisor to track which part of memory is
> written is to write-protect all the guest memory.
> 
> This feature enables the optimization by skipping the transfer of guest
> free pages during VM live migration. It is not concerned that the memory
> pages are used after they are given to the hypervisor as a hint of the
> free pages, because they will be tracked by the hypervisor and transferred
> in the subsequent round if they are used and written.
> 
> * Tests
> - Test Environment
>     Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>     Guest: 8G RAM, 4 vCPU
>     Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
> 
> - Test Results
>     - Idle Guest Live Migration Time (results are averaged over 10 runs):
>         - Optimization v.s. Legacy = 278ms vs 1757ms --> ~84% reduction
>     - Guest with Linux Compilation Workload (make bzImage -j4):
>         - Live Migration Time (average)
>           Optimization v.s. Legacy = 1408ms v.s. 2528ms --> ~44% reduction
>         - Linux Compilation Time
>           Optimization v.s. Legacy = 5min3s v.s. 5min12s
>           --> no obvious difference  
> 
> ChangeLog:
> v32->v33:
>     - mm/get_from_free_page_list: The new implementation to get free page
>       hints based on the suggestions from Linus:
>       https://lkml.org/lkml/2018/6/11/764
>       This avoids the complex call chain, and looks more prudent.
>     - virtio-balloon: 
>       - use a fix-sized buffer to get free page hints;
>       - remove the cmd id related interface. Now host can just send a free
>         page hint command to the guest (via the host_cmd config register)
>         to start the reporting. Currentlty the guest reports only the max
>         order free page hints to host, which has generated similar good
>         results as before. But the interface used by virtio-balloon to
>         report can support reporting more orders in the future when there
>         is a need.
> v31->v32:
>     - virtio-balloon:
>         - rename cmd_id_use to cmd_id_active;
>         - report_free_page_func: detach used buffers after host sends a vq
>           interrupt, instead of busy waiting for used buffers.
> v30->v31:
>     - virtio-balloon:
>         - virtio_balloon_send_free_pages: return -EINTR rather than 1 to
>           indicate an active stop requested by host; and add more
>           comments to explain about access to cmd_id_received without
>           locks;
>         -  add_one_sg: add TODO to comments about possible improvement.
> v29->v30:
>     - mm/walk_free_mem_block: add cond_sched() for each order
> v28->v29:
>     - mm/page_poison: only expose page_poison_enabled(), rather than more
>       changes did in v28, as we are not 100% confident about that for now.
>     - virtio-balloon: use a separate buffer for the stop cmd, instead of
>       having the start and stop cmd use the same buffer. This avoids the
>       corner case that the start cmd is overridden by the stop cmd when
>       the host has a delay in reading the start cmd.
> v27->v28:
>     - mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
>       to expose page poison val to kernel modules.
> v26->v27:
>     - add a new patch to expose page_poisoning_enabled to kernel modules
>     - virtio-balloon: set poison_val to 0xaaaaaaaa, instead of 0xaa
> v25->v26: virtio-balloon changes only
>     - remove kicking free page vq since the host now polls the vq after
>       initiating the reporting
>     - report_free_page_func: detach all the used buffers after sending
>       the stop cmd id. This avoids leaving the detaching burden (i.e.
>       overhead) to the next cmd id. Detaching here isn't considered
>       overhead since the stop cmd id has been sent, and host has already
>       moved formard.
> v24->v25:
>     - mm: change walk_free_mem_block to return 0 (instead of true) on
>           completing the report, and return a non-zero value from the
>           callabck, which stops the reporting.
>     - virtio-balloon:
>         - use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
>         - avoid __virtio_clear_bit when bailing out;
>         - a new method to avoid reporting the some cmd id to host twice
>         - destroy_workqueue can cancel free page work when the feature is
>           negotiated;
>         - fail probe when the free page vq size is less than 2.
> v23->v24:
>     - change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
>       VIRTIO_BALLOON_F_FREE_PAGE_HINT
>     - kick when vq->num_free < half full, instead of "= half full"
>     - replace BUG_ON with bailing out
>     - check vb->balloon_wq in probe(), if null, bail out
>     - add a new feature bit for page poisoning
>     - solve the corner case that one cmd id being sent to host twice
> v22->v23:
>     - change to kick the device when the vq is half-way full;
>     - open-code batch_free_page_sg into add_one_sg;
>     - change cmd_id from "uint32_t" to "__virtio32";
>     - reserver one entry in the vq for the driver to send cmd_id, instead
>       of busywaiting for an available entry;
>     - add "stop_update" check before queue_work for prudence purpose for
>       now, will have a separate patch to discuss this flag check later;
>     - init_vqs: change to put some variables on stack to have simpler
>       implementation;
>     - add destroy_workqueue(vb->balloon_wq);
> v21->v22:
>     - add_one_sg: some code and comment re-arrangement
>     - send_cmd_id: handle a cornercase
> 
> For previous ChangeLog, please reference
> https://lwn.net/Articles/743660/
> 
> Wei Wang (4):
>   mm: add a function to get free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>   mm/page_poison: expose page_poisoning_enabled to kernel modules
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
> 
>  drivers/virtio/virtio_balloon.c     | 197 +++++++++++++++++++++++++++++-------
>  include/linux/mm.h                  |   1 +
>  include/uapi/linux/virtio_balloon.h |  16 +++
>  mm/page_alloc.c                     |  52 ++++++++++
>  mm/page_poison.c                    |   6 ++
>  5 files changed, 235 insertions(+), 37 deletions(-)
> 

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-15 17:06 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, konrad.wilk, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <20180615134815.6613620e.cohuck@redhat.com>

On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Thu, 14 Jun 2018 18:57:11 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> think you raised really good points, some of which I don't have answer
>> yet and would also like to explore here.
>>
>> First off, I don't want to push the discussion to the extreme at this
>> point, or sell anything about having QEMU manage everything
>> automatically. Don't get me wrong, it's not there yet. Let's don't
>> assume we are tied to a specific or concerte solution. I think the key
>> for our discussion might be to define or refine the boundary between
>> VM and guest,  e.g. what each layer is expected to control and manage
>> exactly.
>>
>> In my view, there might be possibly 3 different options to represent
>> the failover device conceipt to QEMU and libvirt (or any upper layer
>> software):
>>
>> a. Seperate device: in this model, virtio and passthough remains
>> separate devices just as today. QEMU exposes the standby feature bit
>> for virtio, and publish status/event around the negotiation process of
>> this feature bit for libvirt to react upon. Since Libvirt has the
>> pairing relationship itself, maybe through MAC address or something
>> else, it can control the presence of primary by hot plugging or
>> unplugging the passthrough device, although it has to work tightly
>> with virtio's feature negotation process. Not just for migration but
>> also various corner scenarios (driver/feature ok, device reset,
>> reboot, legacy guest etc) along virtio's feature negotiation.
>
> Yes, that one has obvious tie-ins to virtio's modus operandi.
>
>>
>> b. Coupled device: in this model, virtio and passthough devices are
>> weakly coupled using some group ID, i.e. QEMU match the passthough
>> device for a standby virtio instance by comparing the group ID value
>> present behind each device's bridge. Libvirt provides QEMU the group
>> ID for both type of devices, and only deals with hot plug for
>> migration, by checking some migration status exposed (e.g. the feature
>> negotiation status on the virtio device) by QEMU. QEMU manages the
>> visibility of the primary in guest along virtio's feature negotiation
>> process.
>
> I'm a bit confused here. What, exactly, ties the two devices together?

The group UUID. Since QEMU VFIO dvice does not have insight of MAC
address (which it doesn't have to), the association between VFIO
passthrough and standby must be specificed for QEMU to understand the
relationship with this model. Note, standby feature is no longer
required to be exposed under this model.

> If libvirt already has the knowledge that it should manage the two as a
> couple, why do we need the group id (or something else for other
> architectures)? (Maybe I'm simply missing something because I'm not
> that familiar with pci.)

The idea is to have QEMU control the visibility and enumeration order
of the passthrough VFIO for the failover scenario. Hotplug can be one
way to achieve it, and perhaps there's other way around also. The
group ID is not just for QEMU to couple devices, it's also helpful to
guest too as grouping using MAC address is just not safe.

>
>>
>> c. Fully combined device: in this model, virtio and passthough devices
>> are viewed as a single VM interface altogther. QEMU not just controls
>> the visibility of the primary in guest, but can also manage the
>> exposure of the passthrough for migratability. It can be like that
>> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> to provide group ID for grouping the two devices, if just one single
>> combined device is exposed by QEMU. In either case, QEMU manages all
>> aspect of such internal construct, including virtio feature
>> negotiation, presence of the primary, and live migration.
>
> Same question as above.
>
>>
>> It looks like to me that, in your opinion, you seem to prefer go with
>> (a). While I'm actually okay with either (b) or (c). Do I understand
>> your point correctly?
>
> I'm not yet preferring anything, as I'm still trying to understand how
> this works :) I hope we can arrive at a model that covers the use case
> and that is also flexible enough to be extended to other platforms.
>
>>
>> The reason that I feel that (a) might not be ideal, just as Michael
>> alluded to (quoting below), is that as management stack, it really
>> doesn't need to care about the detailed process of feature negotiation
>> (if we view the guest presence of the primary as part of feature
>> negotiation at an extended level not just virtio). All it needs to be
>> done is to hand in the required devices to QEMU and that's all. Why do
>> we need to addd various hooks, events for whichever happens internally
>> within the guest?
>>
>> ''
>> Primary device is added with a special "primary-failover" flag.
>> A virtual machine is then initialized with just a standby virtio
>> device. Primary is not yet added.
>>
>> Later QEMU detects that guest driver device set DRIVER_OK.
>> It then exposes the primary device to the guest, and triggers
>> a device addition event (hot-plug event) for it.
>>
>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> to remove the primary driver.  In particular, if QEMU detects guest
>> re-initialization (e.g. by detecting guest reset) it immediately removes
>> the primary device.
>> ''
>>
>> and,
>>
>> ''
>> management just wants to give the primary to guest and later take it back,
>> it really does not care about the details of the process,
>> so I don't see what does pushing it up the stack buy you.
>>
>> So I don't think it *needs* to be done in libvirt. It probably can if you
>> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> unbinds from device, and can check that backup flag was set.
>> If you are pushing for a setup like that please get a buy-in
>> from libvirt maintainers or better write a patch.
>> ''
>
> This actually seems to mean the opposite to me: We need to know what
> the guest is doing and when, as it directly drives what we need to do
> with the devices. If we switch to a visibility vs a hotplug model (see
> the other mail), we might be able to handle that part within qemu.

In the model of (b), I think it essentially turns hotplug to one of
mechanisms for QEMU to control the visibility. The libvirt can still
manage the hotplug of individual devices during live migration or in
normal situation to hot add/remove devices. Though the visibility of
the VFIO is under the controll of QEMU, and it's possible that the hot
add/remove request does not involve actual hot plug activity in guest
at all.

In the model of (c), the hotplug semantics of the combined device
would mean differently - it would end up with devices plugged in or
out altogther. To make this work, we either have to build a brand new
bond-like QEMU device consist of virtio and VFIO internally, or need
to have some abstraction in place for libvirt to manipulate the
combined device (and prohibit libvirt from operating on individual
internal device directly). Note with this model the group ID doesn't
even need to get exposed to libvirt, just imagine libvirt to supply
all options required to configure two regular virtio-net and VFIO
devices for a single device object, and QEMU will deal with the
device's visibility and enumeration, such when to hot plug VFIO device
in to or out from the guest.

It might be complicated to implement (c) though.

Regards,
-Siwei


> However, I don't see how you get around needing libvirt to actually set
> this up in the first place and to handle migration per se.

^ permalink raw reply

* Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting
From: Michael S. Tsirkin @ 2018-06-15 14:38 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396A3DC9@shsmsx102.ccr.corp.intel.com>

On Fri, Jun 15, 2018 at 02:28:49PM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 7:30 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> > >       - remove the cmd id related interface. Now host can just send a free
> > >         page hint command to the guest (via the host_cmd config register)
> > >         to start the reporting.
> > 
> > Here we go again. And what if reporting was already started previously?
> > I don't think it's a good idea to tweak the host/guest interface yet again.
> 
> This interface is much simpler, and I'm not sure if that would be an
> issue here now, because now the guest delivers the whole buffer of
> hints to host once, instead of hint by hint as before. And the guest
> notifies host after the buffer is delivered. In any case, the host
> doorbell handler will be invoked, if host doesn't need the hints at
> that time, it will just give back the buffer. There will be no stale
> hints remained in the ring now.
> 
> Best,
> Wei

I still think all the old arguments for cmd id apply.

-- 
MST

^ permalink raw reply

* Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-15 14:29 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7396A3D04@shsmsx102.ccr.corp.intel.com>

On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > > the support of reporting hints of guest free pages to host via virtio-balloon.
> > >
> > > Host requests the guest to report free page hints by sending a command
> > > to the guest via setting the
> > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > bit of the host_cmd config register.
> > >
> > > As the first step here, virtio-balloon only reports free page hints
> > > from the max order (10) free page list to host. This has generated
> > > similar good results as reporting all free page hints during our tests.
> > >
> > > TODO:
> > > - support reporting free page hints from smaller order free page lists
> > >   when there is a need/request from users.
> > >
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > ---
> > >  drivers/virtio/virtio_balloon.c     | 187 +++++++++++++++++++++++++++++--
> > -----
> > >  include/uapi/linux/virtio_balloon.h |  13 +++
> > >  2 files changed, 163 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -43,6 +43,9 @@
> > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > >
> > > +/* The size of memory in bytes allocated for reporting free page
> > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > +
> > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > 
> > Doesn't this limit memory size of the guest we can report?
> > Apparently to several gigabytes ...
> > OTOH huge guests with lots of free memory is exactly where we would gain
> > the most ...
> 
> Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It is not flexible.
> 
> How about allocating the buffer according to the guest memory size (proportional)? That is,
> 
> /* Calculates the maximum number of 4MB (equals to 1024 pages) free pages blocks that the system can have */
> 4m_page_blocks = totalram_pages / 1024;
> 
> /* Allocating one page can hold 512 free page blocks, so calculates the number of pages that can hold those 4MB blocks. And this allocation should not exceed 1024 pages */
> pages_to_allocate = min(4m_page_blocks / 512, 1024);
> 
> For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 pages as the buffer.
> 
> When the guest has large memory, it should be easier to succeed in allocation of large buffer. If that allocation fails, that implies that nothing would be got from the 4MB free page list.
> 
> I think the proportional allocation is simpler compared to other approaches like 
> - scattered buffer, which will complicate the get_from_free_page_list implementation;
> - one buffer to call get_from_free_page_list multiple times, which needs get_from_free_page_list to maintain states.. also too complicated.
> 
> Best,
> Wei
> 

That's more reasonable, but question remains what to do if that value
exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Also allocating it with GFP_KERNEL is out. You only want to take
it off the free list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Also you can't allocate this on device start. First totalram_pages can
change. Second that's too much memory to tie up forever.

-- 
MST

^ permalink raw reply

* RE: [PATCH v33 0/4] Virtio-balloon: support free page reporting
From: Wang, Wei W @ 2018-06-15 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180615142610-mutt-send-email-mst@kernel.org>

On Friday, June 15, 2018 7:30 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> >       - remove the cmd id related interface. Now host can just send a free
> >         page hint command to the guest (via the host_cmd config register)
> >         to start the reporting.
> 
> Here we go again. And what if reporting was already started previously?
> I don't think it's a good idea to tweak the host/guest interface yet again.

This interface is much simpler, and I'm not sure if that would be an issue here now, because
now the guest delivers the whole buffer of hints to host once, instead of hint by hint as before. And the guest notifies host after the buffer is delivered. In any case, the host doorbell handler will be invoked, if host doesn't need the hints at that time, it will just give back the buffer. There will be no stale hints remained in the ring now.

Best,
Wei

^ permalink raw reply

* RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-06-15 14:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	torvalds@linux-foundation.org
In-Reply-To: <20180615144000-mutt-send-email-mst@kernel.org>

On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > the support of reporting hints of guest free pages to host via virtio-balloon.
> >
> > Host requests the guest to report free page hints by sending a command
> > to the guest via setting the
> VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > bit of the host_cmd config register.
> >
> > As the first step here, virtio-balloon only reports free page hints
> > from the max order (10) free page list to host. This has generated
> > similar good results as reporting all free page hints during our tests.
> >
> > TODO:
> > - support reporting free page hints from smaller order free page lists
> >   when there is a need/request from users.
> >
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  drivers/virtio/virtio_balloon.c     | 187 +++++++++++++++++++++++++++++--
> -----
> >  include/uapi/linux/virtio_balloon.h |  13 +++
> >  2 files changed, 163 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -43,6 +43,9 @@
> >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> >
> > +/* The size of memory in bytes allocated for reporting free page
> > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> 
> Doesn't this limit memory size of the guest we can report?
> Apparently to several gigabytes ...
> OTOH huge guests with lots of free memory is exactly where we would gain
> the most ...

Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It is not flexible.

How about allocating the buffer according to the guest memory size (proportional)? That is,

/* Calculates the maximum number of 4MB (equals to 1024 pages) free pages blocks that the system can have */
4m_page_blocks = totalram_pages / 1024;

/* Allocating one page can hold 512 free page blocks, so calculates the number of pages that can hold those 4MB blocks. And this allocation should not exceed 1024 pages */
pages_to_allocate = min(4m_page_blocks / 512, 1024);

For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 pages as the buffer.

When the guest has large memory, it should be easier to succeed in allocation of large buffer. If that allocation fails, that implies that nothing would be got from the 4MB free page list.

I think the proportional allocation is simpler compared to other approaches like 
- scattered buffer, which will complicate the get_from_free_page_list implementation;
- one buffer to call get_from_free_page_list multiple times, which needs get_from_free_page_list to maintain states.. also too complicated.

Best,
Wei

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-15 12:31 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180615113242.799974f6.cohuck@redhat.com>

On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> On Fri, 15 Jun 2018 05:34:24 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> 
> > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > do all the
> > > > > required management of the primary/standby devices within Qemu, that is
> > > > > definitely a better
> > > > > approach without upper layer involvement.    
> > > > 
> > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > The management tool can supply passthrough device and virtio with the
> > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > hot plug the device as needed before or after the migration.  
> > > 
> > > I do not really see how you can manage that kind of stuff in QEMU only.  
> > 
> > So right now failover is limited to pci passthrough devices only.
> > The idea is to realize the vfio device but not expose it
> > to guest. Have a separate command to expose it to guest.
> > Hotunplug would also hide it from guest but not unrealize it.
> 
> So, this would not be real hot(un)plug, but 'hide it from the guest',
> right? The concept of "we have it realized in QEMU, but the guest can't
> discover and use it" should be translatable to non-pci as well (at
> least for ccw).
> 
> > 
> > This will help ensure that e.g. on migration failure we can
> > re-expose the device without risk of running out of resources.
> 
> Makes sense.
> 
> Should that 'hidden' state be visible/settable from outside as well
> (e.g. via a property)? I guess yes, so that management software has a
> chance to see whether a device is visible.

Might be handy for debug, but note that since QEMU manages this
state it's transient: can change at any time, so it's kind
of hard for management to rely on it.

> Settable may be useful if we
> find another use case for hiding realized devices.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-15 11:48 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Netdev, aaron.f.brown
In-Reply-To: <CADGSJ23WnTmVKHezm3t0V6M2sWeHaOUoTjdXkmrvbO0EF83hMg@mail.gmail.com>

On Thu, 14 Jun 2018 18:57:11 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> Thank you for sharing your thoughts, Cornelia. With questions below, I
> think you raised really good points, some of which I don't have answer
> yet and would also like to explore here.
> 
> First off, I don't want to push the discussion to the extreme at this
> point, or sell anything about having QEMU manage everything
> automatically. Don't get me wrong, it's not there yet. Let's don't
> assume we are tied to a specific or concerte solution. I think the key
> for our discussion might be to define or refine the boundary between
> VM and guest,  e.g. what each layer is expected to control and manage
> exactly.
> 
> In my view, there might be possibly 3 different options to represent
> the failover device conceipt to QEMU and libvirt (or any upper layer
> software):
> 
> a. Seperate device: in this model, virtio and passthough remains
> separate devices just as today. QEMU exposes the standby feature bit
> for virtio, and publish status/event around the negotiation process of
> this feature bit for libvirt to react upon. Since Libvirt has the
> pairing relationship itself, maybe through MAC address or something
> else, it can control the presence of primary by hot plugging or
> unplugging the passthrough device, although it has to work tightly
> with virtio's feature negotation process. Not just for migration but
> also various corner scenarios (driver/feature ok, device reset,
> reboot, legacy guest etc) along virtio's feature negotiation.

Yes, that one has obvious tie-ins to virtio's modus operandi.

> 
> b. Coupled device: in this model, virtio and passthough devices are
> weakly coupled using some group ID, i.e. QEMU match the passthough
> device for a standby virtio instance by comparing the group ID value
> present behind each device's bridge. Libvirt provides QEMU the group
> ID for both type of devices, and only deals with hot plug for
> migration, by checking some migration status exposed (e.g. the feature
> negotiation status on the virtio device) by QEMU. QEMU manages the
> visibility of the primary in guest along virtio's feature negotiation
> process.

I'm a bit confused here. What, exactly, ties the two devices together?
If libvirt already has the knowledge that it should manage the two as a
couple, why do we need the group id (or something else for other
architectures)? (Maybe I'm simply missing something because I'm not
that familiar with pci.)

> 
> c. Fully combined device: in this model, virtio and passthough devices
> are viewed as a single VM interface altogther. QEMU not just controls
> the visibility of the primary in guest, but can also manage the
> exposure of the passthrough for migratability. It can be like that
> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> to provide group ID for grouping the two devices, if just one single
> combined device is exposed by QEMU. In either case, QEMU manages all
> aspect of such internal construct, including virtio feature
> negotiation, presence of the primary, and live migration.

Same question as above.

> 
> It looks like to me that, in your opinion, you seem to prefer go with
> (a). While I'm actually okay with either (b) or (c). Do I understand
> your point correctly?

I'm not yet preferring anything, as I'm still trying to understand how
this works :) I hope we can arrive at a model that covers the use case
and that is also flexible enough to be extended to other platforms.

> 
> The reason that I feel that (a) might not be ideal, just as Michael
> alluded to (quoting below), is that as management stack, it really
> doesn't need to care about the detailed process of feature negotiation
> (if we view the guest presence of the primary as part of feature
> negotiation at an extended level not just virtio). All it needs to be
> done is to hand in the required devices to QEMU and that's all. Why do
> we need to addd various hooks, events for whichever happens internally
> within the guest?
> 
> ''
> Primary device is added with a special "primary-failover" flag.
> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
> 
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.
> 
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.
> ''
> 
> and,
> 
> ''
> management just wants to give the primary to guest and later take it back,
> it really does not care about the details of the process,
> so I don't see what does pushing it up the stack buy you.
> 
> So I don't think it *needs* to be done in libvirt. It probably can if you
> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> unbinds from device, and can check that backup flag was set.
> If you are pushing for a setup like that please get a buy-in
> from libvirt maintainers or better write a patch.
> ''

This actually seems to mean the opposite to me: We need to know what
the guest is doing and when, as it directly drives what we need to do
with the devices. If we switch to a visibility vs a hotplug model (see
the other mail), we might be able to handle that part within qemu.
However, I don't see how you get around needing libvirt to actually set
this up in the first place and to handle migration per se.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox