Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* RE: [PATCH v5 1/1] crypto: add virtio-crypto driver
From: Gonglei (Arei) @ 2016-12-08  2:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: virtio-dev@lists.oasis-open.org, Huangweidong (C),
	Claudio Fontana, mst@redhat.com, qemu-devel@nongnu.org,
	Hanweidong (Randy), Luonengjun, linux-kernel@vger.kernel.org,
	Wanzongshun (Vincent), virtualization@lists.linux-foundation.org,
	Xuquan (Quan Xu), linux-crypto@vger.kernel.org,
	stefanha@redhat.com, Zhoujian (jay, Euler), longpeng,
	davem@davemloft.net, Wubin (H), arei.gonglei
In-Reply-To: <20161207094843.GA19970@gondor.apana.org.au>

>
> Subject: Re: [PATCH v5 1/1] crypto: add virtio-crypto driver
> 
> On Tue, Dec 06, 2016 at 09:01:49AM +0000, Gonglei (Arei) wrote:
> >
> > Would you please review and/or ack the virtio_crypto_algs.c?
> > It is the realization of specified algs based on Linux Crypto Framework.
> 
> I have no issues with it.  If the virtio folks are happy with
> the interface with the host then I'll take this patch.
> 
Cool, thanks!

Regards,
-Gonglei

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Andrea Arcangeli @ 2016-12-07 20:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mhocko@suse.com, kvm@vger.kernel.org, mst@redhat.com, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <b58fd9f6-d9dd-dd56-d476-dd342174dac5@intel.com>

On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>    This is inefficient, but independent from the underlying data
>    structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>    guest implementation today?  Andrea says they have zero benefits
>    over a pfn/len scheme.  Dave doesn't think they have zero benefits
>    but isn't that attached to them.  QEMU's handling gets more
>    complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>    pfn/order pair?
>    3a. 'len' is more flexible, but will always be a power-of-two anyway
> 	for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>    3b. if we decide not to have a bitmap, then we basically have plenty
> 	of space for 'len' and should just do it
>    3c. It's easiest for the hypervisor to turn pfn/len into the
>        madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.

^ permalink raw reply

* Re: [PATCH v5 1/1] crypto: add virtio-crypto driver
From: Sam Ravnborg @ 2016-12-07 20:20 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Xuquan (Quan Xu), Huangweidong (C), mst@redhat.com,
	qemu-devel@nongnu.org, Wanzongshun (Vincent),
	virtualization@lists.linux-foundation.org, Zhoujian (jay, Euler),
	arei.gonglei@hotmail.com, virtio-dev@lists.oasis-open.org,
	kbuild test robot, Hanweidong (Randy), longpeng,
	linux-crypto@vger.kernel.org, Luonengjun, stefanha@redhat.com,
	herbert@gondor.apana.org.au, Claudio Fontana, linux-kernel
In-Reply-To: <33183CC9F5247A488A2544077AF19020DA1528DE@DGGEMA505-MBX.china.huawei.com>

On Mon, Dec 05, 2016 at 03:12:52AM +0000, Gonglei (Arei) wrote:
> I don't think the root cause of those warnings are introduced by virtio-crypto driver.
> 
> What's your opinion? Sam and David?

Root cause here is that arch/sparc/include/asm/topology_64.h
references cpu_data without including arch/sparc/include/asm/cpudata.h

I think other architectures pull in the dependency from
either smp.h or they have it topology.h.

The easy fix would be to include cpudata.h in arch/sparc/include/asm/topology_64.h.
And that should also be a correct fix.

Could you include this in your patch-set and build test it?

	Sam

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Dave Hansen @ 2016-12-07 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: mhocko@suse.com, kvm@vger.kernel.org, mst@redhat.com, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <20161207183817.GE28786@redhat.com>

We're talking about a bunch of different stuff which is all being
conflated.  There are 3 issues here that I can see.  I'll attempt to
summarize what I think is going on:

1. Current patches do a hypercall for each order in the allocator.
   This is inefficient, but independent from the underlying data
   structure in the ABI, unless bitmaps are in play, which they aren't.
2. Should we have bitmaps in the ABI, even if they are not in use by the
   guest implementation today?  Andrea says they have zero benefits
   over a pfn/len scheme.  Dave doesn't think they have zero benefits
   but isn't that attached to them.  QEMU's handling gets more
   complicated when using a bitmap.
3. Should the ABI contain records each with a pfn/len pair or a
   pfn/order pair?
   3a. 'len' is more flexible, but will always be a power-of-two anyway
	for high-order pages (the common case)
   3b. if we decide not to have a bitmap, then we basically have plenty
	of space for 'len' and should just do it
   3c. It's easiest for the hypervisor to turn pfn/len into the
       madvise() calls that it needs.

Did I miss anything?

On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
>> It is more space-efficient.  We're fitting the order into 6 bits, which
>> would allows the full 2^64 address space to be represented in one entry,
> 
> Very large order is the same as very large len, 6 bits of order or 8
> bytes of len won't really move the needle here, simpler code is
> preferable.

Agreed.  But without seeing them side-by-side I'm not sure we can say
which is simpler.

> The main benefit of "len" is that it can be more granular, plus it's
> simpler than the bitmap too. Eventually all this stuff has to end up
> into a madvisev (not yet upstream but somebody posted it for jemalloc
> and should get merged eventually).
> 
> So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
> won't ever be sent to the madvise syscall, which makes the
> intermediate representation with the bitmap a complication with
> basically no benefits compared to a (N, [addr1,len1], .., [addrN,
> lenN]) representation.

FWIW, I don't feel that strongly about the bitmap.  Li had one
originally, but I think the code thus far has demonstrated a huge
benefit without even having a bitmap.

I've got no objections to ripping the bitmap out of the ABI.

>> and leaves room for the bitmap size to be encoded as well, if we decide
>> we need a bitmap in the future.
> 
> How would a bitmap ever be useful with very large page-order?

Surely we can think of a few ways...

A bitmap is 64x more dense if the lists are unordered.  It means being
able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
That's 64x fewer cachelines to touch, 64x fewer pages to move to the
hypervisor and lets us allocate 1/64th the memory.  Given a maximum
allocation that we're allowed, it lets us do 64x more per-pass.

Now, are those benefits worth it?  Maybe not, but let's not pretend they
don't exist. ;)

>> If that was purely a length, we'd be limited to 64*4k pages per entry,
>> which isn't even a full large page.
> 
> I don't follow here.
> 
> What we suggest is to send the data down represented as (N,
> [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
> one of maximum length 2^64, so 2^64 multiplied infinite times if you
> wish. Simplifying the code and not having any bitmap at all and no :6
> :6 bits either.
> 
> The high order to low order loop of allocations is the interesting part
> here, not the bitmap, and the fact of doing a single vmexit to send
> the large ranges.

Yes, the current code sends one batch of pages up to the hypervisor per
order.  But, this has nothing to do with the underlying data structure,
or the choice to have an order vs. len in the ABI.

What you describe here is obviously more efficient.

> Considering the loop that allocates starting from MAX_ORDER..1, the
> chance the bitmap is actually getting filled with more than one bit at
> page_shift of PAGE_SHIFT should be very low after some uptime.

Yes, if bitmaps were in use, this is true.  I think a guest populating
bitmaps would probably not use the same algorithm.

> By the very nature of this loop, if we already exacerbates all high
> order buddies, the page-order 0 pages obtained are going to be fairly
> fragmented reducing the usefulness of the bitmap and potentially only
> wasting CPU/memory.

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Andrea Arcangeli @ 2016-12-07 18:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mhocko@suse.com, kvm@vger.kernel.org, mst@redhat.com, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <3954fe69-15ac-43eb-e14b-e2bfe976be33@intel.com>

On Wed, Dec 07, 2016 at 10:44:31AM -0800, Dave Hansen wrote:
> On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> >> > and leaves room for the bitmap size to be encoded as well, if we decide
> >> > we need a bitmap in the future.
> > How would a bitmap ever be useful with very large page-order?
> 
> Please, guys.  Read the patches.  *Please*.

I did read the code but you didn't answer my question.

Why should a feature exist in the code that will never be useful. Why
do you think we could ever decide we'll need the bitmap in the future
for high order pages?

> The current code doesn't even _use_ a bitmap.

It's not using it right now, my question is exactly when it will ever
use it?

Leaving the bitmap only for order 0 allocations when you already wiped
all high pages orders from the buddy, doesn't seem very good idea
overall as the chance you got order 0 pages with close physical
address doesn't seem very high. It would be high if the loop that eat
into every possible higher order didn't run first, but such loop just
run and already wiped everything.

Also note, we need to call compaction very aggressive before falling
back from order 9 down to order 8. Ideally we should never use the
page_shift = PAGE_SHIFT case at all! Which leaves the bitmap as best
as an optimization for something that is suboptimal case already. If
the bitmap starts to payoff it means the admin did a mistake and
shrunk the guest too much.

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Dave Hansen @ 2016-12-07 18:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: mhocko@suse.com, kvm@vger.kernel.org, mst@redhat.com, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <20161207183817.GE28786@redhat.com>

On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
>> > and leaves room for the bitmap size to be encoded as well, if we decide
>> > we need a bitmap in the future.
> How would a bitmap ever be useful with very large page-order?

Please, guys.  Read the patches.  *Please*.

The current code doesn't even _use_ a bitmap.

^ permalink raw reply

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Andrea Arcangeli @ 2016-12-07 18:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mhocko@suse.com, kvm@vger.kernel.org, mst@redhat.com, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, kirill.shutemov@linux.intel.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <d3ff453c-56fa-19de-317c-1c82456f2831@intel.com>

Hello,

On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
> It is more space-efficient.  We're fitting the order into 6 bits, which
> would allows the full 2^64 address space to be represented in one entry,

Very large order is the same as very large len, 6 bits of order or 8
bytes of len won't really move the needle here, simpler code is
preferable.

The main benefit of "len" is that it can be more granular, plus it's
simpler than the bitmap too. Eventually all this stuff has to end up
into a madvisev (not yet upstream but somebody posted it for jemalloc
and should get merged eventually).

So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
won't ever be sent to the madvise syscall, which makes the
intermediate representation with the bitmap a complication with
basically no benefits compared to a (N, [addr1,len1], .., [addrN,
lenN]) representation.

If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len
that's possible too, I wouldn't be against that, the conversion before
calling madvise would be pretty efficient too.

> and leaves room for the bitmap size to be encoded as well, if we decide
> we need a bitmap in the future.

How would a bitmap ever be useful with very large page-order?

> If that was purely a length, we'd be limited to 64*4k pages per entry,
> which isn't even a full large page.

I don't follow here.

What we suggest is to send the data down represented as (N,
[addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
one of maximum length 2^64, so 2^64 multiplied infinite times if you
wish. Simplifying the code and not having any bitmap at all and no :6
:6 bits either.

The high order to low order loop of allocations is the interesting part
here, not the bitmap, and the fact of doing a single vmexit to send
the large ranges.

Once we pull out the largest order regions, we just add them to the
array as [addr,1UL<<order], when the array reaches a maximum N number
of entries or we fail a order 0 allocation, we flush all those entries
down to qemu. Qemu then builds the iov for madvisev and it's pretty
much a 1:1 conversion, not a decoding operation converting the bitmap
in the (N, [addr1,len1], ..., [addrN, lenN]) for madvisev (or a flood
of madvise MADV_DONTNEED with current kernels).

Considering the loop that allocates starting from MAX_ORDER..1, the
chance the bitmap is actually getting filled with more than one bit at
page_shift of PAGE_SHIFT should be very low after some uptime.

By the very nature of this loop, if we already exacerbates all high
order buddies, the page-order 0 pages obtained are going to be fairly
fragmented reducing the usefulness of the bitmap and potentially only
wasting CPU/memory.

^ permalink raw reply

* Re: automatic IRQ affinity for virtio
From: Christoph Hellwig @ 2016-12-07 18:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: axboe, linux-block, Christoph Hellwig, linux-kernel,
	virtualization
In-Reply-To: <20161127053515-mutt-send-email-mst@kernel.org>

On Sun, Nov 27, 2016 at 05:37:04AM +0200, Michael S. Tsirkin wrote:
> On Fri, Nov 25, 2016 at 08:25:38AM +0100, Christoph Hellwig wrote:
> > Btw, what's the best way to get any response to this series?
> > But this and the predecessor seem to have completly fallen on deaf
> > ears.
> 
> I'm sorry, I intend to review soon and if OK merge it.
> 
> The fact that it depends on tip means I can't put
> it in my first pull request (until tip is merged)
> so I deferred review somewhat.

any news before the end of the merge window?

^ permalink raw reply

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Dave Hansen @ 2016-12-07 16:57 UTC (permalink / raw)
  To: David Hildenbrand, Li, Liang Z, kvm@vger.kernel.org
  Cc: Andrea Arcangeli, mhocko@suse.com, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <2171e091-46ee-decd-7348-772555d3a5e3@redhat.com>

Removing silly virtio-dev@ list because it's bouncing mail...

On 12/07/2016 08:21 AM, David Hildenbrand wrote:
>> Li's current patches do that.  Well, maybe not pfn/length, but they do
>> take a pfn and page-order, which fits perfectly with the kernel's
>> concept of high-order pages.
> 
> So we can send length in powers of two. Still, I don't see any benefit
> over a simple pfn/len schema. But I'll have a more detailed look at the
> implementation first, maybe that will enlighten me :)

It is more space-efficient.  We're fitting the order into 6 bits, which
would allows the full 2^64 address space to be represented in one entry,
and leaves room for the bitmap size to be encoded as well, if we decide
we need a bitmap in the future.

If that was purely a length, we'd be limited to 64*4k pages per entry,
which isn't even a full large page.

^ permalink raw reply

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: David Hildenbrand @ 2016-12-07 16:21 UTC (permalink / raw)
  To: Dave Hansen, Li, Liang Z, kvm@vger.kernel.org
  Cc: Andrea Arcangeli, virtio-dev@lists.oasis-open.org,
	mhocko@suse.com, mst@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <0b83db29-ebad-2a70-8d61-756d33e33a48@intel.com>

>>>
>>> I did something similar. Filled the balloon with 15GB for a 16GB idle
>>> guest, by
>>> using bitmap, the madvise count was reduced to 605. when using the
>>> PFNs, the madvise count
>>> was 3932160. It means there are quite a lot consecutive bits in the
>>> bitmap.
>>> I didn't test for a guest with heavy memory workload.
>>
>> Would it then even make sense to go one step further and report {pfn,
>> length} combinations?
>>
>> So simply send over an array of {pfn, length}?
>
> Li's current patches do that.  Well, maybe not pfn/length, but they do
> take a pfn and page-order, which fits perfectly with the kernel's
> concept of high-order pages.

So we can send length in powers of two. Still, I don't see any benefit
over a simple pfn/len schema. But I'll have a more detailed look at the
implementation first, maybe that will enlighten me :)

>
>> And it makes sense if you think about:
>>
>> a) hugetlb backing: The host may only be able to free huge pages (we
>> might want to communicate that to the guest later, that's another
>> story). Still we would have to send bitmaps full of 4k frames (512 bits
>> for 2mb frames). Of course, we could add a way to communicate that we
>> are using a different bitmap-granularity.
>
> Yeah, please read the patches.  If they're not clear, then the
> descriptions need work, but this is done already.
>

I missed the page_shift, thanks for the hint.

-- 

David

^ permalink raw reply

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Dave Hansen @ 2016-12-07 15:45 UTC (permalink / raw)
  To: David Hildenbrand, Li, Liang Z, kvm@vger.kernel.org
  Cc: Andrea Arcangeli, virtio-dev@lists.oasis-open.org,
	mhocko@suse.com, mst@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <0b18c636-ee67-cbb4-1ba3-81a06150db76@redhat.com>

On 12/07/2016 07:42 AM, David Hildenbrand wrote:
> Am 07.12.2016 um 14:35 schrieb Li, Liang Z:
>>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>>>> This patch set contains two parts of changes to the virtio-balloon.
>>>>
>>>> One is the change for speeding up the inflating & deflating process,
>>>> the main idea of this optimization is to use bitmap to send the page
>>>> information to host instead of the PFNs, to reduce the overhead of
>>>> virtio data transmission, address translation and madvise(). This can
>>>> help to improve the performance by about 85%.
>>>
>>> Do you have some statistics/some rough feeling how many consecutive
>>> bits are
>>> usually set in the bitmaps? Is it really just purely random or is
>>> there some
>>> granularity that is usually consecutive?
>>>
>>
>> I did something similar. Filled the balloon with 15GB for a 16GB idle
>> guest, by
>> using bitmap, the madvise count was reduced to 605. when using the
>> PFNs, the madvise count
>> was 3932160. It means there are quite a lot consecutive bits in the
>> bitmap.
>> I didn't test for a guest with heavy memory workload.
> 
> Would it then even make sense to go one step further and report {pfn,
> length} combinations?
> 
> So simply send over an array of {pfn, length}?

Li's current patches do that.  Well, maybe not pfn/length, but they do
take a pfn and page-order, which fits perfectly with the kernel's
concept of high-order pages.

> And it makes sense if you think about:
> 
> a) hugetlb backing: The host may only be able to free huge pages (we
> might want to communicate that to the guest later, that's another
> story). Still we would have to send bitmaps full of 4k frames (512 bits
> for 2mb frames). Of course, we could add a way to communicate that we
> are using a different bitmap-granularity.

Yeah, please read the patches.  If they're not clear, then the
descriptions need work, but this is done already.

^ permalink raw reply

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: David Hildenbrand @ 2016-12-07 15:42 UTC (permalink / raw)
  To: Li, Liang Z, kvm@vger.kernel.org
  Cc: Andrea Arcangeli, virtio-dev@lists.oasis-open.org,
	mhocko@suse.com, mst@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, Hansen, Dave,
	dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E3A130C01@shsmsx102.ccr.corp.intel.com>

Am 07.12.2016 um 14:35 schrieb Li, Liang Z:
>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>>> This patch set contains two parts of changes to the virtio-balloon.
>>>
>>> One is the change for speeding up the inflating & deflating process,
>>> the main idea of this optimization is to use bitmap to send the page
>>> information to host instead of the PFNs, to reduce the overhead of
>>> virtio data transmission, address translation and madvise(). This can
>>> help to improve the performance by about 85%.
>>
>> Do you have some statistics/some rough feeling how many consecutive bits are
>> usually set in the bitmaps? Is it really just purely random or is there some
>> granularity that is usually consecutive?
>>
>
> I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
> using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count
> was 3932160. It means there are quite a lot consecutive bits in the bitmap.
> I didn't test for a guest with heavy memory workload.

Would it then even make sense to go one step further and report {pfn, 
length} combinations?

So simply send over an array of {pfn, length}?

This idea came up when talking to Andrea Arcangeli (put him on cc).

And it makes sense if you think about:

a) hugetlb backing: The host may only be able to free huge pages (we 
might want to communicate that to the guest later, that's another 
story). Still we would have to send bitmaps full of 4k frames (512 bits 
for 2mb frames). Of course, we could add a way to communicate that we 
are using a different bitmap-granularity.

b) if we really inflate huge memory regions (and it sounds like that 
according to your measurements), we can minimize the communication to 
the hypervisor and therefore the madvice calls.

c) we don't want to optimize for inflating guests with almost full 
memory (and therefore little consecutive memory areas) - my opinion :)


Thanks for the explanation!

-- 

David

^ permalink raw reply

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Dave Hansen @ 2016-12-07 15:34 UTC (permalink / raw)
  To: Li, Liang Z, David Hildenbrand, kvm@vger.kernel.org
  Cc: virtio-dev@lists.oasis-open.org, mhocko@suse.com, mst@redhat.com,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, dgilbert@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E3A130C01@shsmsx102.ccr.corp.intel.com>

On 12/07/2016 05:35 AM, Li, Liang Z wrote:
>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>> IOW in real examples, do we have really large consecutive areas or are all
>> pages just completely distributed over our memory?
> 
> The buddy system of Linux kernel memory management shows there should
> be quite a lot of consecutive pages as long as there are a portion of
> free memory in the guest.
...
> If all pages just completely distributed over our memory, it means
> the memory fragmentation is very serious, the kernel has the
> mechanism to avoid this happened.

While it is correct that the kernel has anti-fragmentation mechanisms, I
don't think it invalidates the question as to whether a bitmap would be
too sparse to be effective.

> In the other hand, the inflating should not happen at this time because the guest is almost
> 'out of memory'.

I don't think this is correct.  Most systems try to run with relatively
little free memory all the time, using the bulk of it as page cache.  We
have no reason to expect that ballooning will only occur when there is
lots of actual free memory and that it will not occur when that same
memory is in use as page cache.

In these patches, you're effectively still sending pfns.  You're just
sending one pfn per high-order page which is giving a really nice
speedup.  IMNHO, you're avoiding doing a real bitmap because creating a
bitmap means either have a really big bitmap, or you would have to do
some sorting (or multiple passes) of the free lists before populating a
smaller bitmap.

Like David, I would still like to see some data on whether the choice
between bitmaps and pfn lists is ever clearly in favor of bitmaps.  You
haven't convinced me, at least, that the data isn't even worth collecting.

^ permalink raw reply

* [PATCH v2 4/4] vsock: cancel packets when failing to connect
From: Peng Tao @ 2016-12-07 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Peng Tao, kvm, virtualization
In-Reply-To: <1481123652-80603-1-git-send-email-bergwolf@gmail.com>

Otherwise we'll leave the packets queued until releasing vsock device.
E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
will get the connect requests from failed host sockets.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 include/linux/virtio_vsock.h            | 7 +++++++
 net/vmw_vsock/af_vsock.c                | 7 +++++++
 net/vmw_vsock/virtio_transport_common.c | 7 -------
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index b92e88d..ff6850a 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -156,4 +156,11 @@ void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vs
 u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
 void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
 
+static inline const struct virtio_transport *virtio_transport_get_ops(void)
+{
+	const struct vsock_transport *t = vsock_core_get_transport();
+
+	return container_of(t, struct virtio_transport, transport);
+}
+
 #endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 8a398b3..ebb50d6 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -104,6 +104,7 @@
 #include <linux/unistd.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
+#include <linux/virtio_vsock.h>
 #include <net/sock.h>
 #include <net/af_vsock.h>
 
@@ -1105,6 +1106,7 @@ static void vsock_connect_timeout(struct work_struct *work)
 {
 	struct sock *sk;
 	struct vsock_sock *vsk;
+	int cancel = 0;
 
 	vsk = container_of(work, struct vsock_sock, dwork.work);
 	sk = sk_vsock(vsk);
@@ -1115,8 +1117,11 @@ static void vsock_connect_timeout(struct work_struct *work)
 		sk->sk_state = SS_UNCONNECTED;
 		sk->sk_err = ETIMEDOUT;
 		sk->sk_error_report(sk);
+		cancel = 1;
 	}
 	release_sock(sk);
+	if (cancel)
+		virtio_transport_get_ops()->cancel_pkt(vsk);
 
 	sock_put(sk);
 }
@@ -1223,11 +1228,13 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 			err = sock_intr_errno(timeout);
 			sk->sk_state = SS_UNCONNECTED;
 			sock->state = SS_UNCONNECTED;
+			virtio_transport_get_ops()->cancel_pkt(vsk);
 			goto out_wait;
 		} else if (timeout == 0) {
 			err = -ETIMEDOUT;
 			sk->sk_state = SS_UNCONNECTED;
 			sock->state = SS_UNCONNECTED;
+			virtio_transport_get_ops()->cancel_pkt(vsk);
 			goto out_wait;
 		}
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index cc1eeb5..72c5dff 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -25,13 +25,6 @@
 /* How long to wait for graceful shutdown of a connection */
 #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
 
-static const struct virtio_transport *virtio_transport_get_ops(void)
-{
-	const struct vsock_transport *t = vsock_core_get_transport();
-
-	return container_of(t, struct virtio_transport, transport);
-}
-
 struct virtio_vsock_pkt *
 virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
 			   size_t len,
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 3/4] vsock: add pkt cancel capability
From: Peng Tao @ 2016-12-07 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Peng Tao, kvm, virtualization
In-Reply-To: <1481123652-80603-1-git-send-email-bergwolf@gmail.com>

Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 net/vmw_vsock/virtio_transport.c | 42 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 936d7ee..a5f3833 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -170,6 +170,47 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 	return len;
 }
 
+static int
+virtio_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+	struct virtio_vsock *vsock;
+	struct virtio_vsock_pkt *pkt, *n;
+	int cnt = 0;
+	LIST_HEAD(freeme);
+
+	vsock = virtio_vsock_get();
+	if (!vsock) {
+		return -ENODEV;
+	}
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
+		if (pkt->vsk != vsk)
+			continue;
+		list_move(&pkt->list, &freeme);
+	}
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	list_for_each_entry_safe(pkt, n, &freeme, list) {
+		if (pkt->reply)
+			cnt++;
+		list_del(&pkt->list);
+		virtio_transport_free_pkt(pkt);
+	}
+
+	if (cnt) {
+		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
+		int new_cnt;
+
+		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
+		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
+		    new_cnt < virtqueue_get_vring_size(rx_vq))
+			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+	}
+
+	return 0;
+}
+
 static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 {
 	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
@@ -453,6 +494,7 @@ static struct virtio_transport virtio_transport = {
 	},
 
 	.send_pkt = virtio_transport_send_pkt,
+	.cancel_pkt = virtio_transport_cancel_pkt,
 };
 
 static int virtio_vsock_probe(struct virtio_device *vdev)
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 2/4] vhost-vsock: add pkt cancel capability
From: Peng Tao @ 2016-12-07 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Peng Tao, kvm, virtualization
In-Reply-To: <1481123652-80603-1-git-send-email-bergwolf@gmail.com>

To allow canceling all packets of a connection.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 drivers/vhost/vsock.c        | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_vsock.h |  3 +++
 2 files changed, 44 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a504e2e0..d01e4a4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -218,6 +218,46 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 	return len;
 }
 
+static int
+vhost_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+	struct vhost_vsock *vsock;
+	struct virtio_vsock_pkt *pkt, *n;
+	int cnt = 0;
+	LIST_HEAD(freeme);
+
+	/* Find the vhost_vsock according to guest context id  */
+	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+	if (!vsock)
+		return -ENODEV;
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
+		if (pkt->vsk != vsk)
+			continue;
+		list_move(&pkt->list, &freeme);
+	}
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	list_for_each_entry_safe(pkt, n, &freeme, list) {
+		if (pkt->reply)
+			cnt++;
+		list_del(&pkt->list);
+		virtio_transport_free_pkt(pkt);
+	}
+
+	if (cnt) {
+		struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
+		int new_cnt;
+
+		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
+		if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
+			vhost_poll_queue(&tx_vq->poll);
+	}
+
+	return 0;
+}
+
 static struct virtio_vsock_pkt *
 vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 		      unsigned int out, unsigned int in)
@@ -698,6 +738,7 @@ static struct virtio_transport vhost_transport = {
 	},
 
 	.send_pkt = vhost_transport_send_pkt,
+	.cancel_pkt = vhost_transport_cancel_pkt,
 };
 
 static int __init vhost_vsock_init(void)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 6dd3242..b92e88d 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -72,6 +72,9 @@ struct virtio_transport {
 
 	/* Takes ownership of the packet */
 	int (*send_pkt)(struct virtio_vsock_pkt *pkt);
+
+	/* Cancel packets belonging the same vsock */
+	int (*cancel_pkt)(struct vsock_sock *vsk);
 };
 
 ssize_t
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 1/4] vsock: track pkt owner vsock
From: Peng Tao @ 2016-12-07 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Peng Tao, kvm, virtualization
In-Reply-To: <1481123652-80603-1-git-send-email-bergwolf@gmail.com>

So that we can cancel a queued pkt later if necessary.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
 include/linux/virtio_vsock.h            | 2 ++
 net/vmw_vsock/virtio_transport_common.c | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 9638bfe..6dd3242 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -48,6 +48,7 @@ struct virtio_vsock_pkt {
 	struct virtio_vsock_hdr	hdr;
 	struct work_struct work;
 	struct list_head list;
+	struct vsock_sock *vsk;
 	void *buf;
 	u32 len;
 	u32 off;
@@ -56,6 +57,7 @@ struct virtio_vsock_pkt {
 
 struct virtio_vsock_pkt_info {
 	u32 remote_cid, remote_port;
+	struct vsock_sock *vsk;
 	struct msghdr *msg;
 	u32 pkt_len;
 	u16 type;
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index a53b3a1..cc1eeb5 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -57,6 +57,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
 	pkt->len		= len;
 	pkt->hdr.len		= cpu_to_le32(len);
 	pkt->reply		= info->reply;
+	pkt->vsk		= info->vsk;
 
 	if (info->msg && len > 0) {
 		pkt->buf = kmalloc(len, GFP_KERNEL);
@@ -180,6 +181,7 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
 		.type = type,
+		.vsk = vsk,
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -519,6 +521,7 @@ int virtio_transport_connect(struct vsock_sock *vsk)
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_REQUEST,
 		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.vsk = vsk,
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -534,6 +537,7 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
 			  VIRTIO_VSOCK_SHUTDOWN_RCV : 0) |
 			 (mode & SEND_SHUTDOWN ?
 			  VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
+		.vsk = vsk,
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -560,6 +564,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
 		.type = VIRTIO_VSOCK_TYPE_STREAM,
 		.msg = msg,
 		.pkt_len = len,
+		.vsk = vsk,
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -581,6 +586,7 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
 		.op = VIRTIO_VSOCK_OP_RST,
 		.type = VIRTIO_VSOCK_TYPE_STREAM,
 		.reply = !!pkt,
+		.vsk = vsk,
 	};
 
 	/* Send RST only if the original pkt is not a RST pkt */
@@ -826,6 +832,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
 		.remote_cid = le32_to_cpu(pkt->hdr.src_cid),
 		.remote_port = le32_to_cpu(pkt->hdr.src_port),
 		.reply = true,
+		.vsk = vsk,
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 0/4] vsock: cancel connect packets when failing to connect
From: Peng Tao @ 2016-12-07 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Peng Tao, kvm, virtualization

Currently, if a connect call fails on a signal or timeout (e.g., guest is still
in the process of starting up), we'll just return to caller and leave the connect
packet queued and they are sent even though the connection is considered a failure,
which can confuse applications with unwanted false connect attempt.

The patchset enables vsock (both host and guest) to cancel queued packets when
a connect attempt is considered to fail.

v2 changelog:
  - fix queued_replies counting and resume tx/rx when necessary

Peng Tao (4):
  vsock: track pkt owner vsock
  vhost-vsock: add pkt cancel capability
  vsock: add pkt cancel capability
  vsock: cancel packets when failing to connect

 drivers/vhost/vsock.c                   | 41 ++++++++++++++++++++++++++++++++
 include/linux/virtio_vsock.h            | 12 ++++++++++
 net/vmw_vsock/af_vsock.c                |  7 ++++++
 net/vmw_vsock/virtio_transport.c        | 42 +++++++++++++++++++++++++++++++++
 net/vmw_vsock/virtio_transport_common.c | 14 +++++------
 5 files changed, 109 insertions(+), 7 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 3/4] vsock: add pkt cancel capability
From: Peng Tao @ 2016-12-07 14:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: netdev@vger.kernel.org, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20161207132206.GC31552@stefanha-x1.localdomain>

On Wed, Dec 7, 2016 at 9:22 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Dec 07, 2016 at 06:00:20PM +0800, Peng Tao wrote:
>> Signed-off-by: Peng Tao <bergwolf@gmail.com>
>> ---
>>  net/vmw_vsock/virtio_transport.c | 36 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 36 insertions(+)
>>
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 936d7ee..f88b6ed 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -170,6 +170,41 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>>       return len;
>>  }
>>
>> +static int
>> +virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>> +{
>> +     struct virtio_vsock *vsock;
>> +     struct virtio_vsock_pkt *pkt, *n;
>> +     int cnt = 0;
>> +     LIST_HEAD(freeme);
>> +
>> +     vsock = virtio_vsock_get();
>> +     if (!vsock) {
>> +             return -ENODEV;
>> +     }
>> +
>> +     if (pkt->reply)
>
> pkt is uninitialized.  I guess this if statement should be deleted, you
> already take care of counting reply packets below.
>
ah, sorry! I forgot to delete it after handling cnt below...

>> +             cnt++;
>> +
>> +     spin_lock_bh(&vsock->send_pkt_list_lock);
>> +     list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
>> +             if (pkt->vsk != vsk)
>> +                     continue;
>> +             list_move(&pkt->list, &freeme);
>> +     }
>> +     spin_unlock_bh(&vsock->send_pkt_list_lock);
>> +
>> +     list_for_each_entry_safe(pkt, n, &freeme, list) {
>> +             if (pkt->reply)
>> +                     cnt++;
>> +             list_del(&pkt->list);
>> +             virtio_transport_free_pkt(pkt);
>> +     }
>> +     atomic_sub(cnt, &vsock->queued_replies);
>
> If we stopped rx because there were too many replies in flight then we
> might be able to resume rx now:
>
> /* Do we now have resources to resume rx processing? */
> if (old_val >= virtqueue_get_vring_size(rx_vq) &&
>     new_val < virtqueue_get_vring_size(rx_vq))
>         queue_work(virtio_vsock_workqueue, &vsock->rx_work);
Thanks! I totally missed the resume part... I'll send an updated version later.

Cheers,
Tao

^ permalink raw reply

* Re: [PATCH 10/10] virtio: enable endian checks for sparse builds
From: Stefan Hajnoczi @ 2016-12-07 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, Neil Armstrong, David Airlie, linux-remoteproc, dri-devel,
	virtualization, linux-s390, James E.J. Bottomley, Herbert Xu,
	linux-scsi, v9fs-developer, Asias He, Arnd Bergmann, linux-kbuild,
	Jens Axboe, Michal Marek, Matt Mackall, Greg Kroah-Hartman,
	linux-kernel, linux-crypto, netdev, David S. Miller
In-Reply-To: <1481038106-24899-11-git-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 1143 bytes --]

On Tue, Dec 06, 2016 at 05:41:05PM +0200, Michael S. Tsirkin wrote:
> __CHECK_ENDIAN__ isn't on by default presumably because
> it triggers too many sparse warnings for correct code.
> But virtio is now clean of these warnings, and
> we want to keep it this way - enable this for
> sparse builds.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> It seems that there should be a better way to do it,
> but this works too.
> 
>  drivers/block/Makefile          | 1 +
>  drivers/char/Makefile           | 1 +
>  drivers/char/hw_random/Makefile | 2 ++
>  drivers/gpu/drm/virtio/Makefile | 1 +
>  drivers/net/Makefile            | 3 +++
>  drivers/net/caif/Makefile       | 1 +
>  drivers/rpmsg/Makefile          | 1 +
>  drivers/s390/virtio/Makefile    | 2 ++
>  drivers/scsi/Makefile           | 1 +
>  drivers/vhost/Makefile          | 1 +
>  drivers/virtio/Makefile         | 3 +++
>  net/9p/Makefile                 | 1 +
>  net/packet/Makefile             | 1 +
>  net/vmw_vsock/Makefile          | 2 ++
>  14 files changed, 21 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 09/10] vsock/virtio: fix src/dst cid format
From: Stefan Hajnoczi @ 2016-12-07 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, stable, virtualization,
	David S. Miller
In-Reply-To: <1481038106-24899-10-git-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 449 bytes --]

On Tue, Dec 06, 2016 at 05:41:02PM +0200, Michael S. Tsirkin wrote:
> These fields are 64 bit, using le32_to_cpu and friends
> on these will not do the right thing.
> Fix this up.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  net/vmw_vsock/virtio_transport_common.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 08/10] vsock/virtio: mark an internal function static
From: Stefan Hajnoczi @ 2016-12-07 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, virtualization, David S. Miller
In-Reply-To: <1481038106-24899-9-git-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 357 bytes --]

On Tue, Dec 06, 2016 at 05:41:00PM +0200, Michael S. Tsirkin wrote:
> virtio_transport_alloc_pkt is only used locally, make it static.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  net/vmw_vsock/virtio_transport_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 07/10] vsock/virtio: add a missing __le annotation
From: Stefan Hajnoczi @ 2016-12-07 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, virtualization, David S. Miller
In-Reply-To: <1481038106-24899-8-git-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 413 bytes --]

On Tue, Dec 06, 2016 at 05:40:57PM +0200, Michael S. Tsirkin wrote:
> guest cid is read from config space, therefore it's in little endian
> format and is treated as such, annotate it accordingly.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  net/vmw_vsock/virtio_transport.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Li, Liang Z @ 2016-12-07 13:35 UTC (permalink / raw)
  To: David Hildenbrand, kvm@vger.kernel.org
  Cc: virtio-dev@lists.oasis-open.org, mhocko@suse.com, mst@redhat.com,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Hansen, Dave, dgilbert@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	kirill.shutemov@linux.intel.com
In-Reply-To: <f67ca79c-ad34-59dd-835f-e7bc9dcaef58@redhat.com>

> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this happened.
In the other hand, the inflating should not happen at this time because the guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David

^ permalink raw reply

* Re: [PATCH 4/4] vsock: cancel packets when failing to connect
From: Stefan Hajnoczi @ 2016-12-07 13:25 UTC (permalink / raw)
  To: Peng Tao; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <1481104821-77294-5-git-send-email-bergwolf@gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 604 bytes --]

On Wed, Dec 07, 2016 at 06:00:21PM +0800, Peng Tao wrote:
> Otherwise we'll leave the packets queued until releasing vsock device.
> E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
> will get the connect requests from failed host sockets.
> 
> Signed-off-by: Peng Tao <bergwolf@gmail.com>
> ---
>  include/linux/virtio_vsock.h            | 7 +++++++
>  net/vmw_vsock/af_vsock.c                | 7 +++++++
>  net/vmw_vsock/virtio_transport_common.c | 7 -------
>  3 files changed, 14 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox