stripe cache question

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* stripe cache question
@ 2011-02-24 21:06 Piergiorgio Sartor
  2011-02-25  3:51 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Piergiorgio Sartor @ 2011-02-24 21:06 UTC (permalink / raw)
  To: linux-raid

Hi all,

few posts ago was mentioned that the unit of the stripe
cache are "pages per device", usually 4K pages.

Questions:

1) Does "device" means raid (md) device or component
device (HDD)?

2) The max possible value seems to be 32768, which
means, in case of 4K page per md device, a max of
128MiB of RAM.
Is this by design? Would it be possible to increase
up to whatever is available?

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stripe cache question
  2011-02-24 21:06 stripe cache question Piergiorgio Sartor
@ 2011-02-25  3:51 ` NeilBrown
  2011-02-26 10:21   ` Piergiorgio Sartor
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2011-02-25  3:51 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Thu, 24 Feb 2011 22:06:43 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> Hi all,
> 
> few posts ago was mentioned that the unit of the stripe
> cache are "pages per device", usually 4K pages.
> 
> Questions:
> 
> 1) Does "device" means raid (md) device or component
> device (HDD)?

component device.
In drivers/md/raid5.[ch] there is a 'struct stripe_head'.
It holds one page per component device (ignoring spares).
Several of these comprise the 'cache'.  The 'size' of the cache is the number
oof 'struct stripe_head' and associated pages that are allocated.

> 
> 2) The max possible value seems to be 32768, which
> means, in case of 4K page per md device, a max of
> 128MiB of RAM.
> Is this by design? Would it be possible to increase
> up to whatever is available?

32768 is just an arbitrary number.  It is there in raid5.c and is easy to
change (for people comfortable with recompiling their kernels).

I wanted an upper limit because setting it too high could easily cause your
machine to run out of memory and become very sluggish - or worse.

Ideally the cache should be automatically sized based on demand and memory
size - with maybe just a tunable to select between "use a much memory as you
need - within reason" verse "use a little memory as you can manage with".

But that requires thought and design and code and .... it just never seemed
like a priority.

NeilBrown

> 
> Thanks,
> 
> bye,
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stripe cache question
  2011-02-25  3:51 ` NeilBrown
@ 2011-02-26 10:21   ` Piergiorgio Sartor
  2011-02-27  4:43     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Piergiorgio Sartor @ 2011-02-26 10:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On Fri, Feb 25, 2011 at 02:51:25PM +1100, NeilBrown wrote:
> On Thu, 24 Feb 2011 22:06:43 +0100 Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> 
> > Hi all,
> > 
> > few posts ago was mentioned that the unit of the stripe
> > cache are "pages per device", usually 4K pages.
> > 
> > Questions:
> > 
> > 1) Does "device" means raid (md) device or component
> > device (HDD)?
> 
> component device.
> In drivers/md/raid5.[ch] there is a 'struct stripe_head'.
> It holds one page per component device (ignoring spares).
> Several of these comprise the 'cache'.  The 'size' of the cache is the number
> oof 'struct stripe_head' and associated pages that are allocated.
> 
> 
> > 
> > 2) The max possible value seems to be 32768, which
> > means, in case of 4K page per md device, a max of
> > 128MiB of RAM.
> > Is this by design? Would it be possible to increase
> > up to whatever is available?
> 
> 32768 is just an arbitrary number.  It is there in raid5.c and is easy to
> change (for people comfortable with recompiling their kernels).

Ah! I found it. Maybe, considering currently
available memory you should think about incresing
it to, for example, 128K or 512K.
 
> I wanted an upper limit because setting it too high could easily cause your
> machine to run out of memory and become very sluggish - or worse.
> 
> Ideally the cache should be automatically sized based on demand and memory
> size - with maybe just a tunable to select between "use a much memory as you
> need - within reason" verse "use a little memory as you can manage with".
> 
> But that requires thought and design and code and .... it just never seemed
> like a priority.

You're a bit contraddicting your philosopy of
"let's do the smart things in user space"... :-)

IMHO, if really necessary, it could be enough to
have this "upper limit" avaialable in sysfs.

Then user space can decide what to do with it.

For example, at boot the amount of memory is checked
and the upper limit set.
I see a duplication here, maybe better just remove
the upper limit and let user space to deal with that.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stripe cache question
  2011-02-26 10:21   ` Piergiorgio Sartor
@ 2011-02-27  4:43     ` NeilBrown
  2011-02-27 11:37       ` Piergiorgio Sartor
  2011-03-06 20:08       ` Piergiorgio Sartor
  0 siblings, 2 replies; 6+ messages in thread
From: NeilBrown @ 2011-02-27  4:43 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Sat, 26 Feb 2011 11:21:28 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> On Fri, Feb 25, 2011 at 02:51:25PM +1100, NeilBrown wrote:
> > On Thu, 24 Feb 2011 22:06:43 +0100 Piergiorgio Sartor
> > <piergiorgio.sartor@nexgo.de> wrote:
> > 
> > > Hi all,
> > > 
> > > few posts ago was mentioned that the unit of the stripe
> > > cache are "pages per device", usually 4K pages.
> > > 
> > > Questions:
> > > 
> > > 1) Does "device" means raid (md) device or component
> > > device (HDD)?
> > 
> > component device.
> > In drivers/md/raid5.[ch] there is a 'struct stripe_head'.
> > It holds one page per component device (ignoring spares).
> > Several of these comprise the 'cache'.  The 'size' of the cache is the number
> > oof 'struct stripe_head' and associated pages that are allocated.
> > 
> > 
> > > 
> > > 2) The max possible value seems to be 32768, which
> > > means, in case of 4K page per md device, a max of
> > > 128MiB of RAM.
> > > Is this by design? Would it be possible to increase
> > > up to whatever is available?
> > 
> > 32768 is just an arbitrary number.  It is there in raid5.c and is easy to
> > change (for people comfortable with recompiling their kernels).
> 
> Ah! I found it. Maybe, considering currently
> available memory you should think about incresing
> it to, for example, 128K or 512K.
>  
> > I wanted an upper limit because setting it too high could easily cause your
> > machine to run out of memory and become very sluggish - or worse.
> > 
> > Ideally the cache should be automatically sized based on demand and memory
> > size - with maybe just a tunable to select between "use a much memory as you
> > need - within reason" verse "use a little memory as you can manage with".
> > 
> > But that requires thought and design and code and .... it just never seemed
> > like a priority.
> 
> You're a bit contraddicting your philosopy of
> "let's do the smart things in user space"... :-)
> 
> IMHO, if really necessary, it could be enough to
> have this "upper limit" avaialable in sysfs.
> 
> Then user space can decide what to do with it.
> 
> For example, at boot the amount of memory is checked
> and the upper limit set.
> I see a duplication here, maybe better just remove
> the upper limit and let user space to deal with that.


Maybe....  I still feel I want some sort of built-in protection...

Maybe if I did all the allocations with "__GFP_WAIT" clear so that it would
only allocate memory that is easily available.  It wouldn't be a hard
guarantee against running out, but it might help..

Maybe you could try removing the limit and see what actually happens when
you set a ridiculously large size.??

NeilBrown


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stripe cache question
  2011-02-27  4:43     ` NeilBrown
@ 2011-02-27 11:37       ` Piergiorgio Sartor
  2011-03-06 20:08       ` Piergiorgio Sartor
  1 sibling, 0 replies; 6+ messages in thread
From: Piergiorgio Sartor @ 2011-02-27 11:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> > > Ideally the cache should be automatically sized based on demand and memory
> > > size - with maybe just a tunable to select between "use a much memory as you
> > > need - within reason" verse "use a little memory as you can manage with".
> > > 
> > > But that requires thought and design and code and .... it just never seemed
> > > like a priority.
> > 
> > You're a bit contraddicting your philosopy of
> > "let's do the smart things in user space"... :-)
> > 
> > IMHO, if really necessary, it could be enough to
> > have this "upper limit" avaialable in sysfs.
> > 
> > Then user space can decide what to do with it.
> > 
> > For example, at boot the amount of memory is checked
> > and the upper limit set.
> > I see a duplication here, maybe better just remove
> > the upper limit and let user space to deal with that.
> 
> 
> Maybe....  I still feel I want some sort of built-in protection...

As I wrote, I think a second sysfs entry, with the upper
limit, could be enough.
It allows flexibility and somehow protection.
It would be required two _coordinated_ access to sysfs in order
to break the limit, which is unlikely to happen by random.
That is, at boot /sys/block/mdX/md/stripe_cache_limit will
be 32768 and the "cache_size" will be 256.
If someone wants to play with the cache size, will be able
to top it to 32768. Otherwise, the first entry has to be
changed to higher values (min value should be the cache_size).

This is, of course, a duplication, but it enforce a certain
process (two accesses) giving then some degree of protection.

I guess, but you're the expert, this should be easier than
other solutions.

> Maybe if I did all the allocations with "__GFP_WAIT" clear so that it would
> only allocate memory that is easily available.  It wouldn't be a hard
> guarantee against running out, but it might help..

Again, I think you're over-designing it.

BTW, I hope that is unswappable memory, or?

> Maybe you could try removing the limit and see what actually happens when
> you set a ridiculously large size.??

Yes and no. The home PC has a RAID-10f2, the work PC has
a RAID-5, but I do not want to play with the kernel on it.
I guess using loop devices will not be meaningful.

As soon as I manage to build the RAID-6 NAS I could give it
a try, but this has no "schedule" right now.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stripe cache question
  2011-02-27  4:43     ` NeilBrown
  2011-02-27 11:37       ` Piergiorgio Sartor
@ 2011-03-06 20:08       ` Piergiorgio Sartor
  1 sibling, 0 replies; 6+ messages in thread
From: Piergiorgio Sartor @ 2011-03-06 20:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> Maybe you could try removing the limit and see what actually happens when
> you set a ridiculously large size.??

Actually, I tried (unintentionally) something similar.

The storage I've has 7 RAID-6 arrays, so I tried to
increase the stripe_cache_size to 32768 on each array.
Of course, while writing to the array...

The first 3 have 10, 10 and 9 HDDs, so with the third
array about 3.6GiB were allocated out of 4GiB the PC has.

At this point the PC was completely unresponsive, but
still working, i.e. it was still writing to the array.

Also ssh did not answer in time.

Nevertheless, it was not dead or locked, just extremely
slow, in fact, once the writing finished, the PC was
again working as before.

I guess there was a lot of swapping going on...

In any case, an upper limit seems to be necessary, but
it should be consistent with all available RAM.
It does not help, it seems, to limit arrays independently,
there should a be a "global" limit, so that the _sum_ of
the caches does no exceed this limit.

Hope this helps,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-03-06 20:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-24 21:06 stripe cache question Piergiorgio Sartor
2011-02-25  3:51 ` NeilBrown
2011-02-26 10:21   ` Piergiorgio Sartor
2011-02-27  4:43     ` NeilBrown
2011-02-27 11:37       ` Piergiorgio Sartor
2011-03-06 20:08       ` Piergiorgio Sartor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).