how to scale root-reserved space going forward...

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* how to scale root-reserved space going forward...
@ 2009-03-01 22:22 Eric Sandeen
  2009-03-02  2:47 ` Theodore Tso
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2009-03-01 22:22 UTC (permalink / raw)
  To: ext4 development

5% of a 16T filesystem is getting a little crazy from the point of view
of "root-reserved" - 800G!

But I think the original reason for this reserved space was actually as
an allocator cushion; letting root gain access to it was just a
safety-valve for that.

Now that we have a completely different allocator in ext4, and
potentially much larger filesystems, I think we need to revisit how much
is held back, and for what reason.

Any thoughts on a reasonable way to scale this reservation (or, just for
discussion - if it's even needed at all today for ext4?)

-Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to scale root-reserved space going forward...
  2009-03-01 22:22 how to scale root-reserved space going forward Eric Sandeen
@ 2009-03-02  2:47 ` Theodore Tso
  2009-03-02  7:17   ` Ron Johnson
  2009-03-02  8:56   ` Andreas Dilger
  0 siblings, 2 replies; 5+ messages in thread
From: Theodore Tso @ 2009-03-02  2:47 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: ext4 development

On Sun, Mar 01, 2009 at 04:22:54PM -0600, Eric Sandeen wrote:
> 5% of a 16T filesystem is getting a little crazy from the point of view
> of "root-reserved" - 800G!
> 
> But I think the original reason for this reserved space was actually as
> an allocator cushion; letting root gain access to it was just a
> safety-valve for that.

Yep, that's correct.  Historically, this came from BSD Fast
Filesystem, which used to use a default reserve of 10%.  To quote from
the FreeBSD sources, in ufs/ffs.h:

/*
 * MINFREE gives the minimum acceptable percentage of filesystem
 * blocks which may be free. If the freelist drops below this level
 * only the superuser may continue to allocate blocks. This may
 * be set to 0 if no reserve of free blocks is deemed necessary,
 * however throughput drops by fifty percent if the filesystem
 * is run at between 95% and 100% full; thus the minimum default
 * value of fs_minfree is 5%. However, to get good clustering
 * performance, 10% is a better choice. hence we use 10% as our
 * default value. With 10% free space, fragmentation is not a
 * problem, so we choose to optimize for time.
 */
#define MINFREE         8

The interesting thing is that FreeBSD has decided push things down to
8%.  A quick survey shows that NetBSD is using a MINFREE of 5%, like
Linux.  (Fortunately, http://fxr.watson.org/ makes it easy to make
these comparisons.)  

And like Linux, it looks like the *BSD's have the same tendency not to
update the comments when they update the code.  :-)

> Now that we have a completely different allocator in ext4, and
> potentially much larger filesystems, I think we need to revisit how much
> is held back, and for what reason.
> 
> Any thoughts on a reasonable way to scale this reservation (or, just for
> discussion - if it's even needed at all today for ext4?)

This is a reasonable question.  What would be great is if we could get
a benchmarking team to fill an ext4 filesystem with files.  The simple
thing would be if we did something fixed --- say, 50 files per
directory, each file 100k, and say 10 subdirectories in each
directory, to some fixed depth, and with a filesystem size of at least
8 gigabytes (which would give us at least 16 flex groups with the
default flex size of 16) --- and then filled each filesystem to from
0% to 90% in increments of 10%, and from 90% to 99% in increments of
1%, and then ran some throughput benchmark like bonnie on the mostly
filled filesystem.

A better filler would probably use a random file sizes with a average
size of say 64k, but with outliers from 4k to 128 megs, and a similar
random distribution of number of files per directory, and number of
subdirectories and depth of subdirectories.

I suppose it would be good to do one set of charts with a filesystem
size of 8 gigs, and another at 80 gigs and 800 gigs, and see if the
shape of the filesystem curve changes at scale.  Once we have that, we
would be in a position to make a reasonable set of defaults.

Or we could just guess and come up with some percentage figure that
sounds good.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to scale root-reserved space going forward...
  2009-03-02  2:47 ` Theodore Tso
@ 2009-03-02  7:17   ` Ron Johnson
  2009-03-02  8:56   ` Andreas Dilger
  1 sibling, 0 replies; 5+ messages in thread
From: Ron Johnson @ 2009-03-02  7:17 UTC (permalink / raw)
  To: ext4 development

On 03/01/2009 08:47 PM, Theodore Tso wrote:
[snip]
> 
> Or we could just guess and come up with some percentage figure that
> sounds good.  :-)

5% but 100GB is the fs is .ge. 2TB?

-- 
Ron Johnson, Jr.
Jefferson LA  USA

The feeling of disgust at seeing a human female in a Relationship
with a chimp male is Homininphobia, and you should be ashamed of
yourself.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to scale root-reserved space going forward...
  2009-03-02  2:47 ` Theodore Tso
  2009-03-02  7:17   ` Ron Johnson
@ 2009-03-02  8:56   ` Andreas Dilger
  2009-03-02 16:26     ` Eric Sandeen
  1 sibling, 1 reply; 5+ messages in thread
From: Andreas Dilger @ 2009-03-02  8:56 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Eric Sandeen, ext4 development

On Mar 01, 2009  21:47 -0500, Theodore Ts'o wrote:
> This is a reasonable question.  What would be great is if we could get
> a benchmarking team to fill an ext4 filesystem with files.  The simple
> thing would be if we did something fixed --- say, 50 files per
> directory, each file 100k, and say 10 subdirectories in each
> directory, to some fixed depth, and with a filesystem size of at least
> 8 gigabytes (which would give us at least 16 flex groups with the
> default flex size of 16) --- and then filled each filesystem to from
> 0% to 90% in increments of 10%, and from 90% to 99% in increments of
> 1%, and then ran some throughput benchmark like bonnie on the mostly
> filled filesystem.

We've done tests like this, and it is important to take the inner vs.
outer cyliners into account.  It can happen that even a "perfectly"
allocated filesystem will appear to show slowdowns in performance as
it gets full, yet this is partitially due to physical disk layout issues.

> A better filler would probably use a random file sizes with a average
> size of say 64k, but with outliers from 4k to 128 megs, and a similar
> random distribution of number of files per directory, and number of
> subdirectories and depth of subdirectories.

You describe the Reiserfs "Mongo" benchmark.

> I suppose it would be good to do one set of charts with a filesystem
> size of 8 gigs, and another at 80 gigs and 800 gigs, and see if the
> shape of the filesystem curve changes at scale.  Once we have that, we
> would be in a position to make a reasonable set of defaults.
> 
> Or we could just guess and come up with some percentage figure that
> sounds good.  :-)

I suspect that at a certain filesystem size, there isn't much benefit
in having more reserved space.  If we keep 50GB of reserved space then
this is likely to contain a decent amount of 1MB free chunks, which is
what we really care about.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to scale root-reserved space going forward...
  2009-03-02  8:56   ` Andreas Dilger
@ 2009-03-02 16:26     ` Eric Sandeen
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2009-03-02 16:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Tso, ext4 development

Andreas Dilger wrote:

> I suspect that at a certain filesystem size, there isn't much benefit
> in having more reserved space.  If we keep 50GB of reserved space then
> this is likely to contain a decent amount of 1MB free chunks, which is
> what we really care about.
> 
> Cheers, Andreas

As long as we don't care much about inode<->block locality, that is
probably ok...

I got one off-list reply from someone who was depending on the 5% for
administration reasons, but even if we change the default I assume we'll
leave the tunable in place for people who really do want 5% for
"root-reserved" over "allocator safety buffer" reasons.

-Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-03-02 16:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-01 22:22 how to scale root-reserved space going forward Eric Sandeen
2009-03-02  2:47 ` Theodore Tso
2009-03-02  7:17   ` Ron Johnson
2009-03-02  8:56   ` Andreas Dilger
2009-03-02 16:26     ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).