tuning, many small files, small blocksize

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* tuning, many small files, small blocksize
@ 2008-02-16  5:01 Jeff Breidenbach
  2008-02-16  9:28 ` Hannes Dorbath
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Jeff Breidenbach @ 2008-02-16  5:01 UTC (permalink / raw)
  To: xfs

I'm testing xfs for use in storing 100 million+ small files
(roughly 4 to 10KB each) and some directories will contain
tens of thousands of files. There will be a lot of random
reading, and also some random writing, and very little
deletion. The underlying disks use linux software RAID-1
manged by mdadm with 5X redundancy. E.g. 5 drives that
completely mirror each other.

I am setting up the xfs partition now, and have only played
with blocksize so far. 512 byte blocks are most space efficient,
1024 byte blocks cost 3.3% additional space, and 4096 byte
blocks cost 22.3% additional space. I do not know of a good
way to benchmark filesystem speed; iozone -s 5 did not provide
meaningful results due to poor timing quantization.

My questions are:

a) Should I just go with the 512 byte blocksize or is that going to be
bad for some performance reason? Going to 1024 is no problem,
but I'd prefer not to waste 20% of the partition capacity by using 4096.

b) Are there any other mkfs.xfs paramters that I should play with.

Thanks for any response; I did do quite some searching for
recommended turning parameters, but did not find definitive answers.
The general consensus was xfs does pretty good tuning itself, but
almost none of the published benchmarks or recommendation go with
small blocksizes and I want to make sure I'm not about to do something
totally stupid. Like quadruple the number of seeks on the disk.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16  5:01 tuning, many small files, small blocksize Jeff Breidenbach
@ 2008-02-16  9:28 ` Hannes Dorbath
  2008-02-16 10:24   ` Jeff Breidenbach
  2008-02-16 12:23 ` pg_xfs2
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Hannes Dorbath @ 2008-02-16  9:28 UTC (permalink / raw)
  To: Jeff Breidenbach; +Cc: xfs

Jeff Breidenbach wrote:
> The underlying disks use linux software RAID-1
> manged by mdadm with 5X redundancy. E.g. 5 drives that
> completely mirror each other.

That's maybe a bit paranoid, but on the other hand it should give good 
parallelism.

> a) Should I just go with the 512 byte blocksize or is that going to be
> bad for some performance reason? Going to 1024 is no problem,
> but I'd prefer not to waste 20% of the partition capacity by using 4096.

I don't think there is performance problem with 512 byte block size, but 
it limits the internal log size to 32MB. You might want to use a larger 
external log.

> b) Are there any other mkfs.xfs paramters that I should play with.

mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m -b 
size=512 /dev/sda

mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs

Requires kernel 2.6.23 and xfsprogs 2.9.5. As said, you might want to 
use an external log device.


-- 
Best regards,
Hannes Dorbath

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16  9:28 ` Hannes Dorbath
@ 2008-02-16 10:24   ` Jeff Breidenbach
  2008-02-16 20:30     ` Jeff Breidenbach
  2008-02-19  0:48     ` Timothy Shimmin
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff Breidenbach @ 2008-02-16 10:24 UTC (permalink / raw)
  To: Hannes Dorbath; +Cc: xfs

> That's maybe a bit paranoid, but on the other hand it should give good
> parallelism.

Yes, the goal is fast read performance for small files.

> mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m -b
> size=512 /dev/sda
>
> mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs

This is highly appreciated, thank you very much.

> Requires kernel 2.6.23 and xfsprogs 2.9.5. As said, you might want to
> use an external log device.

I'm running vendor a supplied kernel of 2.6.22 and a quick test shows
the unsupported feature is lazy-count. How big a deal is it? Upgrading
the kernel before April is painful but I'll do it if important. Presumably
there's no simple way to migrate a non-lazy xfs filesytem to a lazy one.

PS. I don't know if this affects any parameters, but the biggest directory
will have approximately 1.5 million files. There are a few in the one to two
hundred thousand range, and then very many in the tens of thousands.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16 10:24   ` Jeff Breidenbach
@ 2008-02-16 20:30     ` Jeff Breidenbach
  2008-02-19  0:48     ` Timothy Shimmin
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Breidenbach @ 2008-02-16 20:30 UTC (permalink / raw)
  To: Hannes Dorbath; +Cc: xfs

> Upgradin the kernel before April is painful but I'll do it if important.

Ah, wasn't that painful after all. Thanks again.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16 10:24   ` Jeff Breidenbach
  2008-02-16 20:30     ` Jeff Breidenbach
@ 2008-02-19  0:48     ` Timothy Shimmin
  1 sibling, 0 replies; 15+ messages in thread
From: Timothy Shimmin @ 2008-02-19  0:48 UTC (permalink / raw)
  To: Jeff Breidenbach; +Cc: Hannes Dorbath, xfs

Jeff Breidenbach wrote:
> the kernel before April is painful but I'll do it if important. Presumably
> there's no simple way to migrate a non-lazy xfs filesytem to a lazy one.
> 

Dave would know that answer.

But on IRIX we used xfs_chver...
>      Using the +c option one can enable a filesystem to use lazy counters.
>      Note that you must run xfs_repair(1M) after setting this option to build
>      the internal state that is required to support this functionality.
> 
>      Using the -c option one can disable lazy counters if it is enabled. Note
>      that you must run xfs_repair(1M) after clearing this option to ensure
>      that the internal state of the filesystem is consistent.

On Linux we have xfs_admin (wrapper around db) and the "version" command in xfs_db.
However, it looks like "version" only does: extflg, v2-logs, attr1, attr2,
but not lazy sb counters.

--Tim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16  5:01 tuning, many small files, small blocksize Jeff Breidenbach
  2008-02-16  9:28 ` Hannes Dorbath
@ 2008-02-16 12:23 ` pg_xfs2
  2008-02-18 22:53 ` David Chinner
  2008-02-18 23:12 ` Linda Walsh
  3 siblings, 0 replies; 15+ messages in thread
From: pg_xfs2 @ 2008-02-16 12:23 UTC (permalink / raw)
  To: Linux XFS

>>> On Fri, 15 Feb 2008 21:01:10 -0800, "Jeff Breidenbach"
>>> <jeff@jab.org> said:

jeff> I'm testing xfs for use in storing 100 million+ small
jeff> files (roughly 4 to 10KB each) and some directories will
jeff> contain tens of thousands of files. There will be a lot of
jeff> random reading, and also some random writing, and very
jeff> little deletion. The underlying disks use linux software
jeff> RAID-1 manged by mdadm with 5X redundancy. E.g. 5 drives
jeff> that completely mirror each other.

Reading this was quite entertaining :-).

jeff> [ ... ] The general consensus was xfs does pretty good
jeff> tuning itself, but almost none of the published benchmarks
jeff> or recommendation go with small blocksizes and I want to
jeff> make sure I'm not about to do something totally stupid. [
jeff> ... ]

Makes me wonder why silly people come up with pointless stuff
like this :-)

  http://WWW.Oracle.com/database/berkeley-db.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16  5:01 tuning, many small files, small blocksize Jeff Breidenbach
  2008-02-16  9:28 ` Hannes Dorbath
  2008-02-16 12:23 ` pg_xfs2
@ 2008-02-18 22:53 ` David Chinner
  2008-02-18 23:12 ` Linda Walsh
  3 siblings, 0 replies; 15+ messages in thread
From: David Chinner @ 2008-02-18 22:53 UTC (permalink / raw)
  To: Jeff Breidenbach; +Cc: xfs

On Fri, Feb 15, 2008 at 09:01:10PM -0800, Jeff Breidenbach wrote:
> I'm testing xfs for use in storing 100 million+ small files
> (roughly 4 to 10KB each) and some directories will contain
> tens of thousands of files. There will be a lot of random
> reading, and also some random writing, and very little
> deletion.
.....
> a) Should I just go with the 512 byte blocksize or is that going to be
> bad for some performance reason? Going to 1024 is no problem,
> but I'd prefer not to waste 20% of the partition capacity by using 4096.

I'd suggest wasting 20% of disk space and staying with 4k block size.

> b) Are there any other mkfs.xfs paramters that I should play with.

Large directory block size (-n size=XXX), esp. if you are putting
thousands of files per directory....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-16  5:01 tuning, many small files, small blocksize Jeff Breidenbach
                   ` (2 preceding siblings ...)
  2008-02-18 22:53 ` David Chinner
@ 2008-02-18 23:12 ` Linda Walsh
  2008-02-18 23:51   ` David Chinner
  3 siblings, 1 reply; 15+ messages in thread
From: Linda Walsh @ 2008-02-18 23:12 UTC (permalink / raw)
  To: Jeff Breidenbach; +Cc: xfs

Jeff Breidenbach wrote:
> I'm testing xfs for use in storing 100 million+ small files
> (roughly 4 to 10KB each) and some directories will contain
> tens of thousands of files. There will be a lot of random
> reading, and also some random writing, and very little
> deletion. 
> 
> I am setting up the xfs partition now, and have only played
> with blocksize so far. 512 byte blocks are most space efficient,
> 1024 byte blocks cost 3.3% additional space, and 4096 byte
> blocks cost 22.3% additional space.
>
> a) Should I just go with the 512 byte blocksize or is that going to be
> bad for some performance reason? Going to 1024 is no problem,
> 
> b) Are there any other mkfs.xfs paramters that I should play with.

If your minimum file size is 4KB and max is 10KB, a blocksize of
2K might be give you a reasonable compaction level.

Might also play with the inode size.  I *usually* go with 1K-inode+4k block,
but with a 2k block, I'm slightly torn between 512-byte inodes and 1K
inodes, but I can't think of a _great_ reason to ever go with the default
256-byte inode size, since that size seems like it will always cause
the inode to be shared with another, possibly unrelated file.

Remember, in xfs, if the last bit of left-over data in an inode will fit
into the inode, it can save a block-allocation, though I don't know
how this will affect speed.

Space-wise, a 2k block size and 1k-inode size might be good, but don't
know how that would affect performance.

My concern about 512-byte blocks is, in general (don't recall having
used such a small block size on xfs), smaller blocks can lead to
greater fragmentation, but xfs is better than the average 'fs' in
laying out files.  While 'xfs_fsr' is good about keeping files
linear, it didn't used to work on directories -- and if you have
10's of thousands of files/directory...that might trigger some
more directory fragmentation.  Dunno.

After you write the many small files, will you be appending to them?

As for benchmarks, their's always the standard 'bonnie' and 'bonnie++'.
I don't know how they compare to iozone though -- I'm not familiar with
that benchmark.

I'm sure you are familiar with mount options noatime,nodiratime -- same
concepts, but dir's are split out.  Someone else mentioned using
logbsize=256k in the mount options.  My manpages may be dated, but
they "claim", that valid sizes are 16k and 32k.

Also, it depends on the situation, but sometimes flattening out the
directory structure can speed up lookup time.

Sometime back someone did some benchmarks involving log size and it seemed
that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly.

Good luck...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-18 23:12 ` Linda Walsh
@ 2008-02-18 23:51   ` David Chinner
  2008-02-19  1:03     ` Linda Walsh
  0 siblings, 1 reply; 15+ messages in thread
From: David Chinner @ 2008-02-18 23:51 UTC (permalink / raw)
  To: Linda Walsh; +Cc: Jeff Breidenbach, xfs

On Mon, Feb 18, 2008 at 03:12:44PM -0800, Linda Walsh wrote:
> 
> Might also play with the inode size.  I *usually* go with 1K-inode+4k block,
> but with a 2k block, I'm slightly torn between 512-byte inodes and 1K
> inodes, but I can't think of a _great_ reason to ever go with the default
> 256-byte inode size, since that size seems like it will always cause
> the inode to be shared with another, possibly unrelated file.

That makes no sense. Inodes are *unique* - they are not shared with
any other inode at all. Could you explain why you think that 256
byte inodes are any different to larger inodes in this respect?

> Remember, in xfs, if the last bit of left-over data in an inode will fit
> into the inode, it can save a block-allocation, though I don't know
> how this will affect speed.

No, that's wrong. We never put data in inodes.

> Space-wise, a 2k block size and 1k-inode size might be good, but don't
> know how that would affect performance.

Inode size vs block size is pretty much irrelevant w.r.t performance,
except for the fact inode size can't be larger than the block size.

> I'm sure you are familiar with mount options noatime,nodiratime -- same
> concepts, but dir's are split out.

noatime implies nodiratime.

> Also, it depends on the situation, but sometimes flattening out the
> directory structure can speed up lookup time.

Like using large directory block sizes to make large directory
btrees wider and flatter and therefore use less seeks for any given
random directory lookup? ;)

> Sometime back someone did some benchmarks involving log size and it seemed
> that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly.

128MB is the maximum size currently.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-18 23:51   ` David Chinner
@ 2008-02-19  1:03     ` Linda Walsh
  2008-02-19  2:49       ` David Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Linda Walsh @ 2008-02-19  1:03 UTC (permalink / raw)
  To: David Chinner; +Cc: Jeff Breidenbach, xfs

David Chinner wrote:
> That makes no sense. Inodes are *unique* - they are not shared with
> any other inode at all. Could you explain why you think that 256
> byte inodes are any different to larger inodes in this respect?
---
	Sorry to be unclear, but it would seem to me that if the
minimum physical blocksize on disk is 512 bytes, then either a 256
byte inode will share that block with another inode, or you are
wasting 256 bytes on each inode.  The latter interpretation doesn't
make logical sense.

	If the minimum physical I/O size is larger than 512 bytes,
then I would assume even more, *unique*, inodes could be packed
in per block.

> 
>> Remember, in xfs, if the last bit of left-over data in an inode will fit
>> into the inode, it can save a block-allocation, though I don't know
>> how this will affect speed.
> 
> No, that's wrong. We never put data in inodes.
---
	You mean file data, no?  Doesn't directory and link data
get packed in?  It always gnawed at me, as to why inode's packing
in small bits of data was disallowed for file data, but not
other types of data.  How about extended attribute data?  Is
it always allocated in separate data blocks as well, or can it
be fit into an inode if it fits?  Why not include file data as
a type of data that could be packed into an inode?  I'm sure there's
a good reason, but it seems other types of file system data can
be packed into inodes -- just not file data...or am I really
disinformed? :-)


> 
>> Space-wise, a 2k block size and 1k-inode size might be good, but don't
>> know how that would affect performance.
> 
> Inode size vs block size is pretty much irrelevant w.r.t performance,
> except for the fact inode size can't be larger than the block size.
----
	If you have a small directory, can't it be stored in the inode?
Wouldn't that save some bit (or block) of I/O?

>> I'm sure you are familiar with mount options noatime,nodiratime -- same
>> concepts, but dir's are split out.
> 
> noatime implies nodiratime.
----
	Well dang...thanks!  Ever since the nodiratime option came out,
I thought I had to specify it in addition.  Now my fstabs can be
shorter!

>> Also, it depends on the situation, but sometimes flattening out the
>> directory structure can speed up lookup time.
> 
> Like using large directory block sizes to make large directory
> btrees wider and flatter and therefore use less seeks for any given
> random directory lookup? ;)
---
	Are you saying that directory entries are stored in a sorted
order in a B-Tree?  Hmmm...
Well, I did say it depended on the situation -- you are right
that time lost to seeks might overshadow time lost to #blocks read
in, I'd think it might depend on how the directories are laid out
on disk, but in benchmarks, I've noticed larger slowdowns when using
more files/dir, than distributing the same number of files among
more dirs, but it could have been something about my test setup,
but I did not test with varying size directory block sizes.
Either I overlooked the naming option size param or was
limited to version=1 for some reason (don't remember when version=2
was added...)

> 
>> Sometime back someone did some benchmarks involving log size and it seemed
>> that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly.
> 
> 128MB is the maximum size currently.
---
	Maybe that's why it's optimal?  :-)

	Thanks for the corrections...I appreciate it!

-l

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-19  1:03     ` Linda Walsh
@ 2008-02-19  2:49       ` David Chinner
  2008-02-19  4:58         ` Jeff Breidenbach
  0 siblings, 1 reply; 15+ messages in thread
From: David Chinner @ 2008-02-19  2:49 UTC (permalink / raw)
  To: Linda Walsh; +Cc: David Chinner, Jeff Breidenbach, xfs

On Mon, Feb 18, 2008 at 05:03:57PM -0800, Linda Walsh wrote:
> David Chinner wrote:
> >That makes no sense. Inodes are *unique* - they are not shared with
> >any other inode at all. Could you explain why you think that 256
> >byte inodes are any different to larger inodes in this respect?
> ---
> 	Sorry to be unclear, but it would seem to me that if the
> minimum physical blocksize on disk is 512 bytes, then either a 256
> byte inode will share that block with another inode, or you are
> wasting 256 bytes on each inode.  The latter interpretation doesn't
> make logical sense.
> 
> 	If the minimum physical I/O size is larger than 512 bytes,
> then I would assume even more, *unique*, inodes could be packed
> in per block.

Inode I/O in XFS is done in *8k clusters* regardless of inode size.
We _never_ do single inode I/O and hence you logic completely
breaks down at that point ;). Inode clustering is a substantial
performance optimisation that speeds up inode writeback a
great deal. e.g. i broke clustering recently and we got bug
reports about how slow XFS had become after an upgrade to the
latest -rcX kernel....

FWIW, while inodes are still operated on completely
independently there are only occasional problems where
inode write I/O conflict. e.g.

http://oss.sgi.com/archives/xfs/2008-02/msg00137.html

> >>Remember, in xfs, if the last bit of left-over data in an inode will fit
> >>into the inode, it can save a block-allocation, though I don't know
> >>how this will affect speed.
> >
> >No, that's wrong. We never put data in inodes.
> ---
> 	You mean file data, no?

Right. data = file data.

> 	Doesn't directory and link data
> get packed in?

That's metadata and metadata != file data. metadata gets packed
into the inode.

> It always gnawed at me, as to why inode's packing
> in small bits of data was disallowed for file data, but not
> other types of data.

We disallow file data and allow metadata to be in the inode because
metadata is journalled and data is not and mixing the two introduces
extremely nasty consistency corner cases into crash recovery.  It
can be done, but it's tricky and complex.

> How about extended attribute data?

That's metadata.

> >>Also, it depends on the situation, but sometimes flattening out the
> >>directory structure can speed up lookup time.
> >
> >Like using large directory block sizes to make large directory
> >btrees wider and flatter and therefore use less seeks for any given
> >random directory lookup? ;)
> ---
> 	Are you saying that directory entries are stored in a sorted
> order in a B-Tree?  Hmmm...

Put simply, yes. in more detail:

http://oss.sgi.com/projects/xfs/publications/papers/xfs_filesystem_structure.pdf

(grrrr - the link is currently broken - bug raised, will work soon)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-19  2:49       ` David Chinner
@ 2008-02-19  4:58         ` Jeff Breidenbach
  2008-02-19  8:27           ` Peter Grandi
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Breidenbach @ 2008-02-19  4:58 UTC (permalink / raw)
  To: David Chinner; +Cc: Linda Walsh, xfs

Wow, this is quite some discussion.

I went with Hannes Dorbath's original suggestion (appended),
and am now several days into copying data onto the filesystem.
It's conceivable to change course at this point, but awkward.
Dave, you suggested biting the bullet and sacrificing capacity.
Are we talking an overwhelming difference in read performance
from random files - e.g. reducing the number of seeks by 2X?

Finally, in answer to Linda's question, I don't forsee any appends
at all. The vast majority of files will be write once, read many. A
small fraction will be re-written, e.g. new contents, same filename.
An utterly insignificant fraction will be deleted.

>mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m \
>-b size=512 /dev/sda
>
>mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-19  4:58         ` Jeff Breidenbach
@ 2008-02-19  8:27           ` Peter Grandi
  2008-02-19 11:44             ` Hannes Dorbath
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Grandi @ 2008-02-19  8:27 UTC (permalink / raw)
  To: Linux XFS

>>> On Mon, 18 Feb 2008 20:58:56 -0800, "Jeff Breidenbach"
>>> <jeff@jab.org> said:

  jeff> I'm testing xfs for use in storing 100 million+ small
  jeff> files (roughly 4 to 10KB each) and some directories will
  jeff> contain tens of thousands of files. There will be a lot
  jeff> of random reading, and also some random writing, and
  jeff> very little deletion. [ ... ]

jeff> [ ... ] am now several days into copying data onto the
jeff> filesystem. [ ... ]

I have found again an exchange about a similarly absurd (but 100
times smaller) setup, and here are two relevant extracts:

  >> I have a little script, the job of which is to create a lot
  >> of very small files (~1 million files, typically ~50-100bytes
  >> each). [ ... ] It's a bit of a one-off (or twice, maybe)
  >> script, and currently due to finish in about 15 hours, hence
  >> why I don't want to spend too much effort on rebuilding the
  >> box. [ ... ]

  > [ ... ] First, I have appended two little Perl scripts (each
  > rather small), one creates a Berkeley DB database of K
  > records of random length varying between I and J bytes, the
  > second does N accesses at random in that database. I have a
  > 1.6GHz Athlon XP with 512MB of memory, and a relatively
  > standard 80GB disc 7200RPM. The database is being created on
  > a 70% full 8GB JFS filesystem which has been somewhat
  > recently created:
  > ----------------------------------------------------------------
  > $  time perl megamake.pl /var/tmp/db 1000000 50 100
  > real    6m28.947s
  > user    0m35.860s
  > sys     0m45.530s
  > ----------------------------------------------------------------
  > $  ls -sd /var/tmp/db*
  > 130604 /var/tmp/db
  > ----------------------------------------------------------------
  > Now after an interval, but without cold start (for good
  > reasons), 100,000 random fetches:
  > ----------------------------------------------------------------
  > $  time perl megafetch.pl /var/tmp/db 1000000 100000
  > average length: 75.00628
  > real    3m3.491s
  > user    0m2.870s
  > sys     0m2.800s
  > ----------------------------------------------------------------
  > So, we got 130MiB of disc space used in a single file, >2500
  > records sustained per second inserted over 6 minutes and a half,
  > 500 records per second sustained over 3 minutes. [ ... ]

So it is less than 400 seconds instead of 15 hours and counting.

Those are the numbers, here are some comments as to what explains
the vast difference (2 orders of magnitude):

  > [ ... ]
  > * With 1,000,000 files and a fanout of 50, we need 20,000
  >   directories above them, 400 above those and 8 above those.
  >   So 3 directory opens/reads every time a file has to be
  >   accessed, in addition to opening and reading the file.
  > * Each file access will involve therefore four inode accesses
  >   and four filesystem block accesses, probably rather widely
  >   scattered. Depending on the size of the filesystem block and
  >   whether the inode is contiguous to the body of the file this
  >   can involve anything between 32KiB and 2KiB of logical IO per
  >   file access.
  > * It is likely that of the logical IOs those relating to the two
  >   top levels (those comprising 8 and 400 directories) of the
  >   subtree will be avoided by caching between 200KiB and 1.6MiB,
  >   but the other two levels, the 20,000 bottom directories and
  >   the 1,000,000 leaf files, won't likely be cached.
  > [ ... ]

[ ... ]

jeff> Finally, in answer to Linda's question, I don't forsee any
jeff> appends at all. The vast majority of files will be write
jeff> once, read many. [ ... ]

That sounds like a good use for a LDAP database, but using
Berkeley DB directly may be best. One could also do a FUSE module
or a special purpose NFS server that presents a Berkeley DB as a
filesystem, but then we would be getting rather close to ReiserFS.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-19  8:27           ` Peter Grandi
@ 2008-02-19 11:44             ` Hannes Dorbath
  2008-02-19 21:24               ` Peter Grandi
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Dorbath @ 2008-02-19 11:44 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

> That sounds like a good use for a LDAP database, but using
> Berkeley DB directly may be best. One could also do a FUSE module
> or a special purpose NFS server that presents a Berkeley DB as a
> filesystem, but then we would be getting rather close to ReiserFS.

During testing of HA clusters some time ago I found BDB to always be the 
first thing to break. It seems to have very poor recovery and seems not 
fine with neither file systems snapshots nor power failures. 
Nevertheless it's claimed ACID conform. I don't know -- from my 
experience I wouldn't even put my address book on it.

Personally I ended up doing this for OpenLDAP and never looked back: 
http://www.samse.fr/GPL/ldap_pg/HOWTO/x12.html


-- 
Best regards,
Hannes Dorbath

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tuning, many small files, small blocksize
  2008-02-19 11:44             ` Hannes Dorbath
@ 2008-02-19 21:24               ` Peter Grandi
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Grandi @ 2008-02-19 21:24 UTC (permalink / raw)
  To: Linux XFS

>>> On Tue, 19 Feb 2008 12:44:57 +0100, Hannes Dorbath
>>> <light@theendofthetunnel.de> said:

[ ... a collection of millions of small records ... ]

>> That sounds like a good use for a LDAP database, but using
>> Berkeley DB directly may be best. One could also do a FUSE
>> module or a special purpose NFS server that presents a
>> Berkeley DB as a filesystem, but then we would be getting
>> rather close to ReiserFS.

light> During testing of HA clusters some time ago I found BDB
light> to always be the first thing to break. It seems to have
light> very poor recovery and seems not fine with neither file
light> systems snapshots nor power failures. [ ... ]

Sometimes BDB had problems, but that seems in the past. It also
relies critically on some precise behaviour from the storage
layer, filesystem downwards:

  http://WWW.Oracle.com/technology/documentation/berkeley-db/db/ref/transapp/reclimit.html

If all those conditions are not met, then it cannot do recovery.
Fortunately XFS can meet those conditions (I think also the page
size one), if properly configured and if the hardware does not
lie.

light> Personally I ended up doing this for OpenLDAP and never looked back: 
light> http://www.samse.fr/GPL/ldap_pg/HOWTO/x12.html

Well, PostgresQL is of course a much nicer, more scalable DBMS
than BDB. But for a relatively small, mostly-ro collection of
small records the latter may be appropriate. XFS works with it
fairly well too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-02-20  6:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-16  5:01 tuning, many small files, small blocksize Jeff Breidenbach
2008-02-16  9:28 ` Hannes Dorbath
2008-02-16 10:24   ` Jeff Breidenbach
2008-02-16 20:30     ` Jeff Breidenbach
2008-02-19  0:48     ` Timothy Shimmin
2008-02-16 12:23 ` pg_xfs2
2008-02-18 22:53 ` David Chinner
2008-02-18 23:12 ` Linda Walsh
2008-02-18 23:51   ` David Chinner
2008-02-19  1:03     ` Linda Walsh
2008-02-19  2:49       ` David Chinner
2008-02-19  4:58         ` Jeff Breidenbach
2008-02-19  8:27           ` Peter Grandi
2008-02-19 11:44             ` Hannes Dorbath
2008-02-19 21:24               ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox