blkback global resources

All of lore.kernel.org
 help / color / mirror / Atom feed

* blkback global resources
@ 2012-03-26 15:56 Jan Beulich
  2012-03-26 16:20 ` Ian Campbell
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-03-26 15:56 UTC (permalink / raw)
  To: xen-devel

All the resources allocated based on xen_blkif_reqs are global in
blkback. While (without having measured anything) I think that this
is bad from a QoS perspective (not the least implied from a warning
issued by Citrix'es multi-page-ring patches:

	if (blkif_reqs < BLK_RING_SIZE(order))
		printk(KERN_WARNING "WARNING: "
		       "I/O request space (%d reqs) < ring order %ld, "
		       "consider increasing %s.reqs to >= %ld.",
		       blkif_reqs, order, KBUILD_MODNAME,
		       roundup_pow_of_two(BLK_RING_SIZE(order)));

indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
this to per-instance allocations, as the amount of memory taken
away from Dom0 for this may be not insignificant when there are
many devices.

Does anyone have an opinion here, in particular regarding the
original authors' decision to make this global vs. the apparently
made observation (by Daniel Stodden, the author of said patch,
who I don't have any current email of to ask directly), but also
in the context of multi-page rings, the purpose of which is to
allow for larger amounts of in-flight I/O?

Thanks, Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-26 15:56 blkback global resources Jan Beulich
@ 2012-03-26 16:20 ` Ian Campbell
  2012-03-26 16:56   ` Konrad Rzeszutek Wilk
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Ian Campbell @ 2012-03-26 16:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, xen-devel

On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> All the resources allocated based on xen_blkif_reqs are global in
> blkback. While (without having measured anything) I think that this
> is bad from a QoS perspective (not the least implied from a warning
> issued by Citrix'es multi-page-ring patches:
> 
> 	if (blkif_reqs < BLK_RING_SIZE(order))
> 		printk(KERN_WARNING "WARNING: "
> 		       "I/O request space (%d reqs) < ring order %ld, "
> 		       "consider increasing %s.reqs to >= %ld.",
> 		       blkif_reqs, order, KBUILD_MODNAME,
> 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> 
> indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> this to per-instance allocations, as the amount of memory taken
> away from Dom0 for this may be not insignificant when there are
> many devices.
> 
> Does anyone have an opinion here, in particular regarding the
> original authors' decision to make this global vs. the apparently
> made observation (by Daniel Stodden, the author of said patch,
> who I don't have any current email of to ask directly), but also
> in the context of multi-page rings, the purpose of which is to
> allow for larger amounts of in-flight I/O?

Not really much to say other than we (well, mostly Wei Liu) have a
similar issue with netback too. That used to have a global pool, then a
pool-per-worker thread. When Wei added thread-per-vif support he solved
this by adding a "page pool" which handles allocations. Possibly this
could grow some sort of fairness etc nobs and be shared with blkback?

Ian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-26 16:20 ` Ian Campbell
@ 2012-03-26 16:56   ` Konrad Rzeszutek Wilk
  2012-03-26 20:47   ` Matt Wilson
  2012-03-27  9:41   ` Wei Liu
  2 siblings, 0 replies; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-03-26 16:56 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, Jan Beulich, xen-devel

On Mon, Mar 26, 2012 at 05:20:08PM +0100, Ian Campbell wrote:
> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > All the resources allocated based on xen_blkif_reqs are global in
> > blkback. While (without having measured anything) I think that this
> > is bad from a QoS perspective (not the least implied from a warning
> > issued by Citrix'es multi-page-ring patches:
> > 
> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> > 		printk(KERN_WARNING "WARNING: "
> > 		       "I/O request space (%d reqs) < ring order %ld, "
> > 		       "consider increasing %s.reqs to >= %ld.",
> > 		       blkif_reqs, order, KBUILD_MODNAME,
> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> > 
> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> > this to per-instance allocations, as the amount of memory taken
> > away from Dom0 for this may be not insignificant when there are
> > many devices.
> > 
> > Does anyone have an opinion here, in particular regarding the
> > original authors' decision to make this global vs. the apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don't have any current email of to ask directly), but also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 
> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly this
> could grow some sort of fairness etc nobs and be shared with blkback?

Yes! That is what I was hoping for - as it would hopefully also have
the shrinker API hooked up to combat excessive memory consumption.

> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-26 16:20 ` Ian Campbell
  2012-03-26 16:56   ` Konrad Rzeszutek Wilk
@ 2012-03-26 20:47   ` Matt Wilson
  2012-03-27  8:45     ` Ian Campbell
  2012-03-27  9:41   ` Wei Liu
  2 siblings, 1 reply; 14+ messages in thread
From: Matt Wilson @ 2012-03-26 20:47 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, Jan Beulich, xen-devel

On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell wrote:
> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > 
> > Does anyone have an opinion here, in particular regarding the
> > original authors' decision to make this global vs. the apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don't have any current email of to ask directly), but also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 
> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly this
> could grow some sort of fairness etc nobs and be shared with blkback?

I'd definitely welcome an approach in blkback similar to the page pool
changes to netback. Steve Noonan's change to use a spinlock per
blkfront device [1] should show increased contention on the global
request list on the blkback side when multiple devices are attached.

Does netback not also need to handle questions of fairness and
resource starvation?

Matt

[1] http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
       [not found] ` <1332780797.30244.159.camel@espiritosanto>
@ 2012-03-27  7:27   ` Jan Beulich
  2012-03-27  9:21     ` Wei Liu
  2012-04-02  9:56     ` Andrei Lifchits
  0 siblings, 2 replies; 14+ messages in thread
From: Jan Beulich @ 2012-03-27  7:27 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: xen-devel, Andrei Lifchits

>>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote:
> On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
>> Cc'ing Daniel for you on this one, Jan.
>> 
>>  K.
>> 
>> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>> > All the resources allocated based on xen_blkif_reqs are global in
>> > blkback. While (without having measured anything) I think that this
>> > is bad from a QoS perspective (not the least implied from a warning
>> > issued by Citrix'es multi-page-ring patches:
>> > 
>> > if (blkif_reqs < BLK_RING_SIZE(order))
>> > printk(KERN_WARNING "WARNING: "
>> >       "I/O request space (%d reqs) < ring order %ld, "
>> >       "consider increasing %s.reqs to >= %ld.",
>> >       blkif_reqs, order, KBUILD_MODNAME,
>> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> > 
>> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
>> > this to per-instance allocations, as the amount of memory taken
>> > away from Dom0 for this may be not insignificant when there are
>> > many devices.
>> > 
>> > Does anyone have an opinion here, in particular regarding the
>> > original authors' decision to make this global vs. the apparently
>> > made observation (by Daniel Stodden, the author of said patch,
>> > who I don't have any current email of to ask directly), but also
>> > in the context of multi-page rings, the purpose of which is to
>> > allow for larger amounts of in-flight I/O?
>> > 
>> > Thanks, Jan
> 
> Re-CC'ing Andrei Lifchits, I think there's been some work going on at
> Citrix regarding that matter.
> 
> Yes, just allocating a pfn pool per backend instance is way too much
> memory balooned out. Otherwise this stuff would have never looked the
> way it does now.

This of course could be accounted for by having an initially non-empty
(large enough) balloon (not sure how easy it is these days to do this
for pv-ops, but it has always been trivial with the legacy code). That
wouldn't help a 32-bit kernel much (where generally the initial balloon
is all in highmem, yet the vacated pages need to be in lowmem), but
for 64-bit kernels it should be fine.

> Regarding the right balance, note that on the other extreme end, if PFN
> space were infinite, there's not much expected performance gain from
> rendering virtual backends fully independent. Beyond controller queue
> depth, these requests are all just going to pile up, waiting.

Is there a way to look through the queue stack to find out how many
distinct ones there are that the backend is running on top of as well
as - for a particular I/O path - the one with the smallest depth? Or can
one assume that the top most one (generally loop's or blktap2's) won't
advertise a queue deeper than what is going to be accepted
downstream (probably not, I'd guess)?

And - what you say would similarly apply to the usefulness of multi-page
rings afaict.

> XenServer has some support for decoupling in blktap.ko [1] which worked
> relatively well: Use frame 'pool' kobjects. A bunch of pages, mapped to
> sysfs object. Name was arbitrary. Size configurable, even at runtime. 
> 
> Sysfs meant stuff was easily set up by shell or python code, or
> manually. To become operational, every backend must be bound to a pool
> (initially, the global 'default' one, for tool compat). Backends can be
> relinked arbitrarily before entering Connected state.
> 
> Then let the userland toolstack set things up according to physical I/O
> topology and properties probed. Basically every physical backend (say, a
> volume group, or a HBA) would start out by allocating and dimensioning a
> dedicated pool (named after the backend), and every backend instance
> fired up gets bound to the pool it belongs to.

Having userland do all that seems like a fallback solution only to me - I
would hope that sufficient information is available directly to the drivers.

Thanks in any case for responding so quickly,
Jan

> There's a lot of additional optimizations one could consider, e.g.
> autogrowing the pool (log(nbackends) or so?) and the like. To improve
> locality, having backends which look ahead in their request queue and
> allocate whole batches is probably a good idea too, etc, etc.
> 
> HTH,
> Daniel
> 
> [1]
> http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git 
>  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
>  Note that these are based on mempools, not the frame pools blkback
>  would take.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-26 20:47   ` Matt Wilson
@ 2012-03-27  8:45     ` Ian Campbell
  0 siblings, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2012-03-27  8:45 UTC (permalink / raw)
  To: Matt Wilson; +Cc: Wei Liu (Intern), Jan Beulich, xen-devel

On Mon, 2012-03-26 at 21:47 +0100, Matt Wilson wrote:
> On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell wrote:
> > On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > > 
> > > Does anyone have an opinion here, in particular regarding the
> > > original authors' decision to make this global vs. the apparently
> > > made observation (by Daniel Stodden, the author of said patch,
> > > who I don't have any current email of to ask directly), but also
> > > in the context of multi-page rings, the purpose of which is to
> > > allow for larger amounts of in-flight I/O?
> > 
> > Not really much to say other than we (well, mostly Wei Liu) have a
> > similar issue with netback too. That used to have a global pool, then a
> > pool-per-worker thread. When Wei added thread-per-vif support he solved
> > this by adding a "page pool" which handles allocations. Possibly this
> > could grow some sort of fairness etc nobs and be shared with blkback?
> 
> I'd definitely welcome an approach in blkback similar to the page pool
> changes to netback. Steve Noonan's change to use a spinlock per
> blkfront device [1] should show increased contention on the global
> request list on the blkback side when multiple devices are attached.
> 
> Does netback not also need to handle questions of fairness and
> resource starvation?

It does, what I was meaning to say was that the page pool could grow
some fairness nobs for netback use (if it is not already fair) and
separately that it could be shared with blkback (rather than fairness
nobs being related to the sharing with blkback which looking back is how
my original statement could have been read).

Ian.

> 
> Matt
> 
> [1] http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27  7:27   ` Jan Beulich
@ 2012-03-27  9:21     ` Wei Liu
  2012-03-27 10:11       ` Jan Beulich
  2012-04-02  9:56     ` Andrei Lifchits
  1 sibling, 1 reply; 14+ messages in thread
From: Wei Liu @ 2012-03-27  9:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrei Lifchits, wei.liu2, Daniel Stodden, xen-devel

On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:
> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc'ing Daniel for you on this one, Jan.
> >> 
> >>  K.
> >> 
> >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote:
> >> 
> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> > blkback. While (without having measured anything) I think that this
> >> > is bad from a QoS perspective (not the least implied from a warning
> >> > issued by Citrix'es multi-page-ring patches:
> >> > 
> >> > if (blkif_reqs < BLK_RING_SIZE(order))
> >> > printk(KERN_WARNING "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld, "
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> > 
> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> >> > this to per-instance allocations, as the amount of memory taken
> >> > away from Dom0 for this may be not insignificant when there are
> >> > many devices.
> >> > 
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors' decision to make this global vs. the apparently
> >> > made observation (by Daniel Stodden, the author of said patch,
> >> > who I don't have any current email of to ask directly), but also
> >> > in the context of multi-page rings, the purpose of which is to
> >> > allow for larger amounts of in-flight I/O?
> >> > 
> >> > Thanks, Jan
> > 
> > Re-CC'ing Andrei Lifchits, I think there's been some work going on at
> > Citrix regarding that matter.
> > 
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty
> (large enough) balloon (not sure how easy it is these days to do this
> for pv-ops, but it has always been trivial with the legacy code). That
> wouldn't help a 32-bit kernel much (where generally the initial balloon
> is all in highmem, yet the vacated pages need to be in lowmem), but
> for 64-bit kernels it should be fine.
> 
> > Regarding the right balance, note that on the other extreme end, if PFN
> > space were infinite, there's not much expected performance gain from
> > rendering virtual backends fully independent. Beyond controller queue
> > depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many
> distinct ones there are that the backend is running on top of as well
> as - for a particular I/O path - the one with the smallest depth? Or can
> one assume that the top most one (generally loop's or blktap2's) won't
> advertise a queue deeper than what is going to be accepted
> downstream (probably not, I'd guess)?
> 
> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 

The balance is tricky. What I observe so far is that having multi-page
rings doesn't necessarily help improve performance (but it is still nice
to have it in case of future usage). There are other contentions, which
limit throughput of a single VIF.

> > XenServer has some support for decoupling in blktap.ko [1] which worked
> > relatively well: Use frame 'pool' kobjects. A bunch of pages, mapped to
> > sysfs object. Name was arbitrary. Size configurable, even at runtime. 
> > 
> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global 'default' one, for tool compat). Backends can be
> > relinked arbitrarily before entering Connected state.
> > 
> > Then let the userland toolstack set things up according to physical I/O
> > topology and properties probed. Basically every physical backend (say, a
> > volume group, or a HBA) would start out by allocating and dimensioning a
> > dedicated pool (named after the backend), and every backend instance
> > fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I
> would hope that sufficient information is available directly to the drivers.
> 

I tempt to make all information available to drivers, but haven't
reached a conclusion yet. Maybe we should also allow user to experiment
various configuration for their specific needs?


Wei.

> Thanks in any case for responding so quickly,
> Jan
> 
> > There's a lot of additional optimizations one could consider, e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> > 
> > HTH,
> > Daniel
> > 
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git 
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> >  would take.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-26 16:20 ` Ian Campbell
  2012-03-26 16:56   ` Konrad Rzeszutek Wilk
  2012-03-26 20:47   ` Matt Wilson
@ 2012-03-27  9:41   ` Wei Liu
  2012-03-27 10:22     ` Jan Beulich
  2 siblings, 1 reply; 14+ messages in thread
From: Wei Liu @ 2012-03-27  9:41 UTC (permalink / raw)
  To: Ian Campbell; +Cc: wei.liu2, Jan Beulich, xen-devel

On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > All the resources allocated based on xen_blkif_reqs are global in
> > blkback. While (without having measured anything) I think that this
> > is bad from a QoS perspective (not the least implied from a warning
> > issued by Citrix'es multi-page-ring patches:
> > 
> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> > 		printk(KERN_WAfdfdRNING "WARNING: "
> > 		       "I/O request space (%d reqs) < ring order %ld, "
> > 		       "consider increasing %s.reqs to >= %ld.",
> > 		       blkif_reqs, order, KBUILD_MODNAME,
> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> > 
> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> > this to per-instance allocations, as the amount of memory taken
> > away from Dom0 for this may be not insignificant when there are
> > many devices.
> > 

What's your main concern on QoS? Lock contention? Starvation? Or any
other things?

You are right here, we should be careful on how much memory blkback may
consume.

> > Does anyone have an opinion here, in particular regarding the
> > original authors' decision to make this global vs. the apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don't have any current email of to ask directly), but also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 

I don't know the actual bottleneck for blkback because I never play with
it. But for multi-page ring netback / netfront, global pool (in the
sense of starvation) may not become bottleneck if there are very limited
VIFs running. What I observe is that to achieve 3.8Gb/s throughput
between two VIFs connected throughput internal bridge, it only consumes
~80 pages in the pool.


Wei.

> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly this
> could grow some sort of fairness etc nobs and be shared with blkback?
> 
> Ian.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27  9:21     ` Wei Liu
@ 2012-03-27 10:11       ` Jan Beulich
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Beulich @ 2012-03-27 10:11 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Daniel Stodden, Andrei Lifchits

>>> On 27.03.12 at 11:21, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:
>> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com> wrote:
>> > Then let the userland toolstack set things up according to physical I/O
>> > topology and properties probed. Basically every physical backend (say, a
>> > volume group, or a HBA) would start out by allocating and dimensioning a
>> > dedicated pool (named after the backend), and every backend instance
>> > fired up gets bound to the pool it belongs to.
>> 
>> Having userland do all that seems like a fallback solution only to me - I
>> would hope that sufficient information is available directly to the drivers.
>> 
> 
> I tempt to make all information available to drivers, but haven't
> reached a conclusion yet. Maybe we should also allow user to experiment
> various configuration for their specific needs?

Oh, yes, user overrides are certainly desirable. I was merely hinting
at it being desirable from my pov to have sensible default behavior.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27  9:41   ` Wei Liu
@ 2012-03-27 10:22     ` Jan Beulich
  2012-03-27 12:34       ` Wei Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-03-27 10:22 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu; +Cc: xen-devel

>>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote:
> On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
>> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
>> > All the resources allocated based on xen_blkif_reqs are global in
>> > blkback. While (without having measured anything) I think that this
>> > is bad from a QoS perspective (not the least implied from a warning
>> > issued by Citrix'es multi-page-ring patches:
>> > 
>> > 	if (blkif_reqs < BLK_RING_SIZE(order))
>> > 		printk(KERN_WAfdfdRNING "WARNING: "
>> > 		       "I/O request space (%d reqs) < ring order %ld, "
>> > 		       "consider increasing %s.reqs to >= %ld.",
>> > 		       blkif_reqs, order, KBUILD_MODNAME,
>> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> > 
>> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
>> > this to per-instance allocations, as the amount of memory taken
>> > away from Dom0 for this may be not insignificant when there are
>> > many devices.
>> > 
> 
> What's your main concern on QoS? Lock contention? Starvation? Or any
> other things?

However you want to put it. Prior to the multi-page ring patches, we
have 64 pending requests (global) and 32 ring entries. Obviously,
bumping the ring size just to order 1 will already bring the number of
possible in-flight entries per device on par with those in-flight across
all devices. So _if_ someone really determined that a multi-page ring
helps performance, I wonder whether that was with manually
adjusted global pending request values (not said anywhere) or with
just a single frontend (not very close to real world scenarios).

In any case, two guests with heavy I/O clearly have the potential to
hinder each other, even if both get backed by different physical
devices.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27 10:22     ` Jan Beulich
@ 2012-03-27 12:34       ` Wei Liu
  2012-03-27 12:45         ` Jan Beulich
  0 siblings, 1 reply; 14+ messages in thread
From: Wei Liu @ 2012-03-27 12:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: wei.liu2, Ian Campbell, xen-devel

On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:
> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote:
> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> > blkback. While (without having measured anything) I think that this
> >> > is bad from a QoS perspective (not the least implied from a warning
> >> > issued by Citrix'es multi-page-ring patches:
> >> > 
> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> >> > 		printk(KERN_WARNING "WARNING: "
> >> > 		       "I/O request space (%d reqs) < ring order %ld, "
> >> > 		       "consider increasing %s.reqs to >= %ld.",
> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> > 
> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> >> > this to per-instance allocations, as the amount of memory taken
> >> > away from Dom0 for this may be not insignificant when there are
> >> > many devices.
> >> > 
> > 
> > What's your main concern on QoS? Lock contention? Starvation? Or any
> > other things?
> 
> However you want to put it. Prior to the multi-page ring patches, we
> have 64 pending requests (global) and 32 ring entries. Obviously,
> bumping the ring size just to order 1 will already bring the number of
> possible in-flight entries per device on par with those in-flight across
> all devices. So _if_ someone really determined that a multi-page ring
> helps performance, I wonder whether that was with manually
> adjusted global pending request values (not said anywhere) or with
> just a single frontend (not very close to real world scenarios).
> 

Just to be precise, bumping order to 1 makes ring entries more than 64.

> In any case, two guests with heavy I/O clearly have the potential to
> hinder each other, even if both get backed by different physical
> devices.
> 

Right. One solution I can think of is that each blk thread holds a small
number of private entries (threshold to be determined), while blkback
maintains a pool for any allocation goes beyond that threshold. But this
just makes things more and more complex -- let's not overkill ourselves
before we identify the real bottleneck.



Wei.

> Jan
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27 12:34       ` Wei Liu
@ 2012-03-27 12:45         ` Jan Beulich
  2012-03-27 13:01           ` Wei Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Beulich @ 2012-03-27 12:45 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, xen-devel

>>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:
>> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote:
>> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
>> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
>> >> > All the resources allocated based on xen_blkif_reqs are global in
>> >> > blkback. While (without having measured anything) I think that this
>> >> > is bad from a QoS perspective (not the least implied from a warning
>> >> > issued by Citrix'es multi-page-ring patches:
>> >> > 
>> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
>> >> > 		printk(KERN_WARNING "WARNING: "
>> >> > 		       "I/O request space (%d reqs) < ring order %ld, "
>> >> > 		       "consider increasing %s.reqs to >= %ld.",
>> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
>> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> >> > 
>> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
>> >> > this to per-instance allocations, as the amount of memory taken
>> >> > away from Dom0 for this may be not insignificant when there are
>> >> > many devices.
>> >> > 
>> > 
>> > What's your main concern on QoS? Lock contention? Starvation? Or any
>> > other things?
>> 
>> However you want to put it. Prior to the multi-page ring patches, we
>> have 64 pending requests (global) and 32 ring entries. Obviously,
>> bumping the ring size just to order 1 will already bring the number of
>> possible in-flight entries per device on par with those in-flight across
>> all devices. So _if_ someone really determined that a multi-page ring
>> helps performance, I wonder whether that was with manually
>> adjusted global pending request values (not said anywhere) or with
>> just a single frontend (not very close to real world scenarios).
>> 
> 
> Just to be precise, bumping order to 1 makes ring entries more than 64.

Iirc order 0 -> 32 entries, order 1 -> 64 entries.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27 12:45         ` Jan Beulich
@ 2012-03-27 13:01           ` Wei Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Wei Liu @ 2012-03-27 13:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: wei.liu2, Ian Campbell, xen-devel

On Tue, 2012-03-27 at 13:45 +0100, Jan Beulich wrote:
> >>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com> wrote:
> > On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:
> >> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote:
> >> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
> >> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> >> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> >> > blkback. While (without having measured anything) I think that this
> >> >> > is bad from a QoS perspective (not the least implied from a warning
> >> >> > issued by Citrix'es multi-page-ring patches:
> >> >> > 
> >> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> >> >> > 		printk(KERN_WARNING "WARNING: "
> >> >> > 		       "I/O request space (%d reqs) < ring order %ld, "
> >> >> > 		       "consider increasing %s.reqs to >= %ld.",
> >> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
> >> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> >> > 
> >> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to convert
> >> >> > this to per-instance allocations, as the amount of memory taken
> >> >> > away from Dom0 for this may be not insignificant when there are
> >> >> > many devices.
> >> >> > 
> >> > 
> >> > What's your main concern on QoS? Lock contention? Starvation? Or any
> >> > other things?
> >> 
> >> However you want to put it. Prior to the multi-page ring patches, we
> >> have 64 pending requests (global) and 32 ring entries. Obviously,
> >> bumping the ring size just to order 1 will already bring the number of
> >> possible in-flight entries per device on par with those in-flight across
> >> all devices. So _if_ someone really determined that a multi-page ring
> >> helps performance, I wonder whether that was with manually
> >> adjusted global pending request values (not said anywhere) or with
> >> just a single frontend (not very close to real world scenarios).
> >> 
> > 
> > Just to be precise, bumping order to 1 makes ring entries more than 64.
> 
> Iirc order 0 -> 32 entries, order 1 -> 64 entries.
> 

I skimmed the code just to make sure. You're right, the size is rounded
down to power of 2. What I was talking about is the value before
rounded-down.


Wei.

> Jan
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: blkback global resources
  2012-03-27  7:27   ` Jan Beulich
  2012-03-27  9:21     ` Wei Liu
@ 2012-04-02  9:56     ` Andrei Lifchits
  1 sibling, 0 replies; 14+ messages in thread
From: Andrei Lifchits @ 2012-04-02  9:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Felipe Franciosi, Daniel Stodden, xen-devel

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, March 27, 2012 8:27 AM
> To: Daniel Stodden
> Cc: Andrei Lifchits; xen-devel
> Subject: Re: [Xen-devel] blkback global resources
> 
> >>> On 26.03.12 at 18:53, Daniel Stodden <daniel.stodden@googlemail.com>
> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc'ing Daniel for you on this one, Jan.
> >>
> >>  K.
> >>
> >> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com> wrote:
> >>
> >> > All the resources allocated based on xen_blkif_reqs are global in
> >> > blkback. While (without having measured anything) I think that this
> >> > is bad from a QoS perspective (not the least implied from a warning
> >> > issued by Citrix'es multi-page-ring patches:
> >> >
> >> > if (blkif_reqs < BLK_RING_SIZE(order)) printk(KERN_WARNING
> >> > "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld, "
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> >
> >> > indicating that this _is_ a bottleneck), I'm otoh hesitant to
> >> > convert this to per-instance allocations, as the amount of memory
> >> > taken away from Dom0 for this may be not insignificant when there
> >> > are many devices.
> >> >
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors' decision to make this global vs. the apparently
> >> > made observation (by Daniel Stodden, the author of said patch, who
> >> > I don't have any current email of to ask directly), but also in the
> >> > context of multi-page rings, the purpose of which is to allow for
> >> > larger amounts of in-flight I/O?
> >> >
> >> > Thanks, Jan
> >
> > Re-CC'ing Andrei Lifchits, I think there's been some work going on at
> > Citrix regarding that matter.
> >
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty (large
> enough) balloon (not sure how easy it is these days to do this for pv-ops, but
> it has always been trivial with the legacy code). That wouldn't help a 32-bit
> kernel much (where generally the initial balloon is all in highmem, yet the
> vacated pages need to be in lowmem), but for 64-bit kernels it should be
> fine.
> 
> > Regarding the right balance, note that on the other extreme end, if
> > PFN space were infinite, there's not much expected performance gain
> > from rendering virtual backends fully independent. Beyond controller
> > queue depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many distinct
> ones there are that the backend is running on top of as well as - for a
> particular I/O path - the one with the smallest depth? Or can one assume
> that the top most one (generally loop's or blktap2's) won't advertise a queue
> deeper than what is going to be accepted downstream (probably not, I'd
> guess)?

Hm, I don't remember seeing anything relating to that off the top of my head in the blkback code, so I don't think so. (I'm not sure the benefit would be that great, anyways).

> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 
> > XenServer has some support for decoupling in blktap.ko [1] which
> > worked relatively well: Use frame 'pool' kobjects. A bunch of pages,
> > mapped to sysfs object. Name was arbitrary. Size configurable, even at
> runtime.

I have added a similar functionality to blkback (pools configurable through xenstore, with userland tools creating one pool per SR), which is now out in the form of a limited-availability hotfix and will be there in the next XenServer release. Felipe (CC'd) measured the effects on performance and found that it helps.

> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global 'default' one, for tool compat). Backends can
> > be relinked arbitrarily before entering Connected state.
> >
> > Then let the userland toolstack set things up according to physical
> > I/O topology and properties probed. Basically every physical backend
> > (say, a volume group, or a HBA) would start out by allocating and
> > dimensioning a dedicated pool (named after the backend), and every
> > backend instance fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I would
> hope that sufficient information is available directly to the drivers.

You're probably right.

> Thanks in any case for responding so quickly, Jan
> 
> > There's a lot of additional optimizations one could consider, e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> >
> > HTH,
> > Daniel
> >
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> > would take.
> 

Cheers,
Andrei

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-04-02  9:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-26 15:56 blkback global resources Jan Beulich
2012-03-26 16:20 ` Ian Campbell
2012-03-26 16:56   ` Konrad Rzeszutek Wilk
2012-03-26 20:47   ` Matt Wilson
2012-03-27  8:45     ` Ian Campbell
2012-03-27  9:41   ` Wei Liu
2012-03-27 10:22     ` Jan Beulich
2012-03-27 12:34       ` Wei Liu
2012-03-27 12:45         ` Jan Beulich
2012-03-27 13:01           ` Wei Liu
     [not found] <CB965278.2F50F%keir.xen@gmail.com>
     [not found] ` <1332780797.30244.159.camel@espiritosanto>
2012-03-27  7:27   ` Jan Beulich
2012-03-27  9:21     ` Wei Liu
2012-03-27 10:11       ` Jan Beulich
2012-04-02  9:56     ` Andrei Lifchits

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.