Re: [Hackathon minutes] PV block improvements

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	xen-devel <xen-devel@lists.xen.org>, Matt Wilson <msw@amazon.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: [Hackathon minutes] PV block improvements
Date: Thu, 27 Jun 2013 17:20:19 +0200	[thread overview]
Message-ID: <51CC5833.7070605@citrix.com> (raw)
In-Reply-To: <1372342865.8976.7.camel@zakaz.uk.xensource.com>

On 27/06/13 16:21, Ian Campbell wrote:
> On Thu, 2013-06-27 at 14:58 +0100, George Dunlap wrote:
>> On 26/06/13 12:37, Ian Campbell wrote:
>>> On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote:
>>>> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>> On Tue, 25 Jun 2013, Ian Campbell wrote:
>>>>>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
>>>>>>> On 21/06/13 20:07, Matt Wilson wrote:
>>>>>>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> While working on further block improvements I've found an issue with
>>>>>>>>> persistent grants in blkfront.
>>>>>>>>>
>>>>>>>>> Persistent grants basically allocate grants and then they are never
>>>>>>>>> released, so both blkfront and blkback keep using the same memory pages
>>>>>>>>> for all the transactions.
>>>>>>>>>
>>>>>>>>> This is not a problem in blkback, because we can dynamically choose how
>>>>>>>>> many grants we want to map. On the other hand, blkfront cannot remove
>>>>>>>>> the access to those grants at any point, because blkfront doesn't know
>>>>>>>>> if blkback has this grants mapped persistently or not.
>>>>>>>>>
>>>>>>>>> So if for example we start expanding the number of segments in indirect
>>>>>>>>> requests, to a value like 512 segments per requests, blkfront will
>>>>>>>>> probably try to persistently map 512*32+512 = 16896 grants per device,
>>>>>>>>> that's much more grants that the current default, which is 32*256 = 8192
>>>>>>>>> (if using grant tables v2). This can cause serious problems to other
>>>>>>>>> interfaces inside the DomU, since blkfront basically starts hoarding all
>>>>>>>>> possible grants, leaving other interfaces completely locked.
>>>>>>>> Yikes.
>>>>>>>>
>>>>>>>>> I've been thinking about different ways to solve this, but so far I
>>>>>>>>> haven't been able to found a nice solution:
>>>>>>>>>
>>>>>>>>> 1. Limit the number of persistent grants a blkfront instance can use,
>>>>>>>>> let's say that only the first X used grants will be persistently mapped
>>>>>>>>> by both blkfront and blkback, and if more grants are needed the previous
>>>>>>>>> map/unmap will be used.
>>>>>>>> I'm not thrilled with this option. It would likely introduce some
>>>>>>>> significant performance variability, wouldn't it?
>>>>>>> Probably, and also it will be hard to distribute the number of available
>>>>>>> grant across the different interfaces in a performance sensible way,
>>>>>>> specially given the fact that once a grant is assigned to a interface it
>>>>>>> cannot be returned back to the pool of grants.
>>>>>>>
>>>>>>> So if we had two interfaces with very different usage (one very busy and
>>>>>>> another one almost idle), and equally distribute the grants amongst
>>>>>>> them, one will have a lot of unused grants while the other will suffer
>>>>>>> from starvation.
>>>>>> I do think we need to implement some sort of reclaim scheme, which
>>>>>> probably does mean a specific request (per your #4). We simply can't
>>>>>> have a device which once upon a time had high throughput but is no
>>>>>> mostly ideal continue to tie up all those grants.
>>>>>>
>>>>>> If you make the reuse of grants use an MRU scheme and reclaim the
>>>>>> currently unused tail fairly infrequently and in large batches then the
>>>>>> perf overhead should be minimal, I think.
>>>>>>
>>>>>> I also don't think I would discount the idea of using ephemeral grants
>>>>>> to cover bursts so easily either, in fact it might fall out quite
>>>>>> naturally from an MRU scheme? In that scheme bursting up is pretty cheap
>>>>>> since grant map is relative inexpensive, and recovering from the burst
>>>>>> shouldn't be too expensive if you batch it. If it turns out to be not a
>>>>>> burst but a sustained level of I/O then the MRU scheme would mean you
>>>>>> wouldn't be recovering them.
>>>>>>
>>>>>> I also think there probably needs to be some tunable per device limit on
>>>>>> the maximum persistent grants, perhaps minimum and maximum pool sizes
>>>>>> ties in with an MRU scheme? If nothing else it gives the admin the
>>>>>> ability to prioritise devices.
>>>>> If we introduce a reclaim call we have to be careful not to fall back
>>>>> to a map/unmap scheme like we had before.
>>>>>
>>>>> The way I see it either these additional grants are useful or not.
>>>>> In the first case we could just limit the maximum amount of persistent
>>>>> grants and be done with it.
>>>>> If they are not useful (they have been allocated for one very large
>>>>> request and not used much after that), could we find a way to identify
>>>>> unusually large requests and avoid using persistent grants for those?
>>>> Isn't it possible that these grants are useful for some periods of
>>>> time, but not for others?  You wouldn't say, "Caching the disk data in
>>>> main memory is either useful or not; if it is not useful (if it was
>>>> allocated for one very large request and not used much after that), we
>>>> should find a way to identify unusually large requests and avoid
>>>> caching it."  If you're playing a movie, sure; but in most cases, the
>>>> cache was useful for a time, then stopped being useful.  Treating the
>>>> persistent grants the same way makes sense to me.
>>> Right, this is what I was trying to suggest with the MRU scheme. If you
>>> are using lots of grants and you keep on reusing them then they remain
>>> persistent and don't get reclaimed. If you are not reusing them for a
>>> while then they get reclaimed. If you make "for a while" big enough then
>>> you should find you aren't unintentionally falling back to a map/unmap
>>> scheme.
>>
>> And I was trying to say that I agreed with you. :-)
> 
> Excellent ;-)

I also agree that this is the best solution, I will start looking at
implementing it.

>> BTW, I presume "MRU" stands for "Most Recently Used", and means "Keep 
>> the most recently used"; is there a practical difference between that 
>> and "LRU" ("Discard the Least Recently Used")?
> 
> I started off with LRU and then got my self confused and changed it
> everywhere. Yes I mean keep Most Recently Used == discard Least Recently
> Used.

This will help if the disk is only doing intermittent bursts of data,
but if the disk is under high I/O during a long time we might end up
under the same situation (all grants hoarded by a single disk). We
should make sure that there's always a buffer of unused grants so other
disks or nic interfaces can continue to work as expected.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-06-27 15:20 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-24 15:06 [Hackathon minutes] PV block improvements Roger Pau Monné
2013-06-21 17:10 ` Roger Pau Monné
2013-06-21 18:07   ` Matt Wilson
2013-06-22  7:11     ` Roger Pau Monné
2013-06-25  6:09       ` Matt Wilson
2013-06-25 13:01         ` Wei Liu
2013-06-25 15:39           ` Matt Wilson
2013-06-25 15:53       ` Ian Campbell
2013-06-25 18:04         ` Stefano Stabellini
2013-06-26  9:37           ` George Dunlap
2013-06-26 11:37             ` Ian Campbell
2013-06-27 13:58               ` George Dunlap
2013-06-27 14:21                 ` Ian Campbell
2013-06-27 15:20                   ` Roger Pau Monné [this message]
2013-06-25 15:57       ` Ian Campbell
2013-06-25 16:05         ` Jan Beulich
2013-06-25 16:30         ` Roger Pau Monné
2013-06-27 15:12     ` Roger Pau Monné
2013-06-27 15:26       ` Stefano Stabellini
2013-06-21 20:16   ` Konrad Rzeszutek Wilk
2013-06-21 23:17     ` Wei Liu
2013-06-24 11:06       ` Stefano Stabellini
2013-07-02 11:49         ` Roger Pau Monné
2013-06-22  7:17     ` Roger Pau Monné

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51CC5833.7070605@citrix.com \
    --to=roger.pau@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=msw@amazon.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).