* tuning, many small files, small blocksize
@ 2008-02-16 5:01 Jeff Breidenbach
2008-02-16 9:28 ` Hannes Dorbath
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Jeff Breidenbach @ 2008-02-16 5:01 UTC (permalink / raw)
To: xfs
I'm testing xfs for use in storing 100 million+ small files
(roughly 4 to 10KB each) and some directories will contain
tens of thousands of files. There will be a lot of random
reading, and also some random writing, and very little
deletion. The underlying disks use linux software RAID-1
manged by mdadm with 5X redundancy. E.g. 5 drives that
completely mirror each other.
I am setting up the xfs partition now, and have only played
with blocksize so far. 512 byte blocks are most space efficient,
1024 byte blocks cost 3.3% additional space, and 4096 byte
blocks cost 22.3% additional space. I do not know of a good
way to benchmark filesystem speed; iozone -s 5 did not provide
meaningful results due to poor timing quantization.
My questions are:
a) Should I just go with the 512 byte blocksize or is that going to be
bad for some performance reason? Going to 1024 is no problem,
but I'd prefer not to waste 20% of the partition capacity by using 4096.
b) Are there any other mkfs.xfs paramters that I should play with.
Thanks for any response; I did do quite some searching for
recommended turning parameters, but did not find definitive answers.
The general consensus was xfs does pretty good tuning itself, but
almost none of the published benchmarks or recommendation go with
small blocksizes and I want to make sure I'm not about to do something
totally stupid. Like quadruple the number of seeks on the disk.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: tuning, many small files, small blocksize 2008-02-16 5:01 tuning, many small files, small blocksize Jeff Breidenbach @ 2008-02-16 9:28 ` Hannes Dorbath 2008-02-16 10:24 ` Jeff Breidenbach 2008-02-16 12:23 ` pg_xfs2 ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Hannes Dorbath @ 2008-02-16 9:28 UTC (permalink / raw) To: Jeff Breidenbach; +Cc: xfs Jeff Breidenbach wrote: > The underlying disks use linux software RAID-1 > manged by mdadm with 5X redundancy. E.g. 5 drives that > completely mirror each other. That's maybe a bit paranoid, but on the other hand it should give good parallelism. > a) Should I just go with the 512 byte blocksize or is that going to be > bad for some performance reason? Going to 1024 is no problem, > but I'd prefer not to waste 20% of the partition capacity by using 4096. I don't think there is performance problem with 512 byte block size, but it limits the internal log size to 32MB. You might want to use a larger external log. > b) Are there any other mkfs.xfs paramters that I should play with. mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m -b size=512 /dev/sda mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs Requires kernel 2.6.23 and xfsprogs 2.9.5. As said, you might want to use an external log device. -- Best regards, Hannes Dorbath ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 9:28 ` Hannes Dorbath @ 2008-02-16 10:24 ` Jeff Breidenbach 2008-02-16 20:30 ` Jeff Breidenbach 2008-02-19 0:48 ` Timothy Shimmin 0 siblings, 2 replies; 15+ messages in thread From: Jeff Breidenbach @ 2008-02-16 10:24 UTC (permalink / raw) To: Hannes Dorbath; +Cc: xfs > That's maybe a bit paranoid, but on the other hand it should give good > parallelism. Yes, the goal is fast read performance for small files. > mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m -b > size=512 /dev/sda > > mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs This is highly appreciated, thank you very much. > Requires kernel 2.6.23 and xfsprogs 2.9.5. As said, you might want to > use an external log device. I'm running vendor a supplied kernel of 2.6.22 and a quick test shows the unsupported feature is lazy-count. How big a deal is it? Upgrading the kernel before April is painful but I'll do it if important. Presumably there's no simple way to migrate a non-lazy xfs filesytem to a lazy one. PS. I don't know if this affects any parameters, but the biggest directory will have approximately 1.5 million files. There are a few in the one to two hundred thousand range, and then very many in the tens of thousands. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 10:24 ` Jeff Breidenbach @ 2008-02-16 20:30 ` Jeff Breidenbach 2008-02-19 0:48 ` Timothy Shimmin 1 sibling, 0 replies; 15+ messages in thread From: Jeff Breidenbach @ 2008-02-16 20:30 UTC (permalink / raw) To: Hannes Dorbath; +Cc: xfs > Upgradin the kernel before April is painful but I'll do it if important. Ah, wasn't that painful after all. Thanks again. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 10:24 ` Jeff Breidenbach 2008-02-16 20:30 ` Jeff Breidenbach @ 2008-02-19 0:48 ` Timothy Shimmin 1 sibling, 0 replies; 15+ messages in thread From: Timothy Shimmin @ 2008-02-19 0:48 UTC (permalink / raw) To: Jeff Breidenbach; +Cc: Hannes Dorbath, xfs Jeff Breidenbach wrote: > the kernel before April is painful but I'll do it if important. Presumably > there's no simple way to migrate a non-lazy xfs filesytem to a lazy one. > Dave would know that answer. But on IRIX we used xfs_chver... > Using the +c option one can enable a filesystem to use lazy counters. > Note that you must run xfs_repair(1M) after setting this option to build > the internal state that is required to support this functionality. > > Using the -c option one can disable lazy counters if it is enabled. Note > that you must run xfs_repair(1M) after clearing this option to ensure > that the internal state of the filesystem is consistent. On Linux we have xfs_admin (wrapper around db) and the "version" command in xfs_db. However, it looks like "version" only does: extflg, v2-logs, attr1, attr2, but not lazy sb counters. --Tim ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 5:01 tuning, many small files, small blocksize Jeff Breidenbach 2008-02-16 9:28 ` Hannes Dorbath @ 2008-02-16 12:23 ` pg_xfs2 2008-02-18 22:53 ` David Chinner 2008-02-18 23:12 ` Linda Walsh 3 siblings, 0 replies; 15+ messages in thread From: pg_xfs2 @ 2008-02-16 12:23 UTC (permalink / raw) To: Linux XFS >>> On Fri, 15 Feb 2008 21:01:10 -0800, "Jeff Breidenbach" >>> <jeff@jab.org> said: jeff> I'm testing xfs for use in storing 100 million+ small jeff> files (roughly 4 to 10KB each) and some directories will jeff> contain tens of thousands of files. There will be a lot of jeff> random reading, and also some random writing, and very jeff> little deletion. The underlying disks use linux software jeff> RAID-1 manged by mdadm with 5X redundancy. E.g. 5 drives jeff> that completely mirror each other. Reading this was quite entertaining :-). jeff> [ ... ] The general consensus was xfs does pretty good jeff> tuning itself, but almost none of the published benchmarks jeff> or recommendation go with small blocksizes and I want to jeff> make sure I'm not about to do something totally stupid. [ jeff> ... ] Makes me wonder why silly people come up with pointless stuff like this :-) http://WWW.Oracle.com/database/berkeley-db.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 5:01 tuning, many small files, small blocksize Jeff Breidenbach 2008-02-16 9:28 ` Hannes Dorbath 2008-02-16 12:23 ` pg_xfs2 @ 2008-02-18 22:53 ` David Chinner 2008-02-18 23:12 ` Linda Walsh 3 siblings, 0 replies; 15+ messages in thread From: David Chinner @ 2008-02-18 22:53 UTC (permalink / raw) To: Jeff Breidenbach; +Cc: xfs On Fri, Feb 15, 2008 at 09:01:10PM -0800, Jeff Breidenbach wrote: > I'm testing xfs for use in storing 100 million+ small files > (roughly 4 to 10KB each) and some directories will contain > tens of thousands of files. There will be a lot of random > reading, and also some random writing, and very little > deletion. ..... > a) Should I just go with the 512 byte blocksize or is that going to be > bad for some performance reason? Going to 1024 is no problem, > but I'd prefer not to waste 20% of the partition capacity by using 4096. I'd suggest wasting 20% of disk space and staying with 4k block size. > b) Are there any other mkfs.xfs paramters that I should play with. Large directory block size (-n size=XXX), esp. if you are putting thousands of files per directory.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-16 5:01 tuning, many small files, small blocksize Jeff Breidenbach ` (2 preceding siblings ...) 2008-02-18 22:53 ` David Chinner @ 2008-02-18 23:12 ` Linda Walsh 2008-02-18 23:51 ` David Chinner 3 siblings, 1 reply; 15+ messages in thread From: Linda Walsh @ 2008-02-18 23:12 UTC (permalink / raw) To: Jeff Breidenbach; +Cc: xfs Jeff Breidenbach wrote: > I'm testing xfs for use in storing 100 million+ small files > (roughly 4 to 10KB each) and some directories will contain > tens of thousands of files. There will be a lot of random > reading, and also some random writing, and very little > deletion. > > I am setting up the xfs partition now, and have only played > with blocksize so far. 512 byte blocks are most space efficient, > 1024 byte blocks cost 3.3% additional space, and 4096 byte > blocks cost 22.3% additional space. > > a) Should I just go with the 512 byte blocksize or is that going to be > bad for some performance reason? Going to 1024 is no problem, > > b) Are there any other mkfs.xfs paramters that I should play with. If your minimum file size is 4KB and max is 10KB, a blocksize of 2K might be give you a reasonable compaction level. Might also play with the inode size. I *usually* go with 1K-inode+4k block, but with a 2k block, I'm slightly torn between 512-byte inodes and 1K inodes, but I can't think of a _great_ reason to ever go with the default 256-byte inode size, since that size seems like it will always cause the inode to be shared with another, possibly unrelated file. Remember, in xfs, if the last bit of left-over data in an inode will fit into the inode, it can save a block-allocation, though I don't know how this will affect speed. Space-wise, a 2k block size and 1k-inode size might be good, but don't know how that would affect performance. My concern about 512-byte blocks is, in general (don't recall having used such a small block size on xfs), smaller blocks can lead to greater fragmentation, but xfs is better than the average 'fs' in laying out files. While 'xfs_fsr' is good about keeping files linear, it didn't used to work on directories -- and if you have 10's of thousands of files/directory...that might trigger some more directory fragmentation. Dunno. After you write the many small files, will you be appending to them? As for benchmarks, their's always the standard 'bonnie' and 'bonnie++'. I don't know how they compare to iozone though -- I'm not familiar with that benchmark. I'm sure you are familiar with mount options noatime,nodiratime -- same concepts, but dir's are split out. Someone else mentioned using logbsize=256k in the mount options. My manpages may be dated, but they "claim", that valid sizes are 16k and 32k. Also, it depends on the situation, but sometimes flattening out the directory structure can speed up lookup time. Sometime back someone did some benchmarks involving log size and it seemed that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly. Good luck... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-18 23:12 ` Linda Walsh @ 2008-02-18 23:51 ` David Chinner 2008-02-19 1:03 ` Linda Walsh 0 siblings, 1 reply; 15+ messages in thread From: David Chinner @ 2008-02-18 23:51 UTC (permalink / raw) To: Linda Walsh; +Cc: Jeff Breidenbach, xfs On Mon, Feb 18, 2008 at 03:12:44PM -0800, Linda Walsh wrote: > > Might also play with the inode size. I *usually* go with 1K-inode+4k block, > but with a 2k block, I'm slightly torn between 512-byte inodes and 1K > inodes, but I can't think of a _great_ reason to ever go with the default > 256-byte inode size, since that size seems like it will always cause > the inode to be shared with another, possibly unrelated file. That makes no sense. Inodes are *unique* - they are not shared with any other inode at all. Could you explain why you think that 256 byte inodes are any different to larger inodes in this respect? > Remember, in xfs, if the last bit of left-over data in an inode will fit > into the inode, it can save a block-allocation, though I don't know > how this will affect speed. No, that's wrong. We never put data in inodes. > Space-wise, a 2k block size and 1k-inode size might be good, but don't > know how that would affect performance. Inode size vs block size is pretty much irrelevant w.r.t performance, except for the fact inode size can't be larger than the block size. > I'm sure you are familiar with mount options noatime,nodiratime -- same > concepts, but dir's are split out. noatime implies nodiratime. > Also, it depends on the situation, but sometimes flattening out the > directory structure can speed up lookup time. Like using large directory block sizes to make large directory btrees wider and flatter and therefore use less seeks for any given random directory lookup? ;) > Sometime back someone did some benchmarks involving log size and it seemed > that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly. 128MB is the maximum size currently. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-18 23:51 ` David Chinner @ 2008-02-19 1:03 ` Linda Walsh 2008-02-19 2:49 ` David Chinner 0 siblings, 1 reply; 15+ messages in thread From: Linda Walsh @ 2008-02-19 1:03 UTC (permalink / raw) To: David Chinner; +Cc: Jeff Breidenbach, xfs David Chinner wrote: > That makes no sense. Inodes are *unique* - they are not shared with > any other inode at all. Could you explain why you think that 256 > byte inodes are any different to larger inodes in this respect? --- Sorry to be unclear, but it would seem to me that if the minimum physical blocksize on disk is 512 bytes, then either a 256 byte inode will share that block with another inode, or you are wasting 256 bytes on each inode. The latter interpretation doesn't make logical sense. If the minimum physical I/O size is larger than 512 bytes, then I would assume even more, *unique*, inodes could be packed in per block. > >> Remember, in xfs, if the last bit of left-over data in an inode will fit >> into the inode, it can save a block-allocation, though I don't know >> how this will affect speed. > > No, that's wrong. We never put data in inodes. --- You mean file data, no? Doesn't directory and link data get packed in? It always gnawed at me, as to why inode's packing in small bits of data was disallowed for file data, but not other types of data. How about extended attribute data? Is it always allocated in separate data blocks as well, or can it be fit into an inode if it fits? Why not include file data as a type of data that could be packed into an inode? I'm sure there's a good reason, but it seems other types of file system data can be packed into inodes -- just not file data...or am I really disinformed? :-) > >> Space-wise, a 2k block size and 1k-inode size might be good, but don't >> know how that would affect performance. > > Inode size vs block size is pretty much irrelevant w.r.t performance, > except for the fact inode size can't be larger than the block size. ---- If you have a small directory, can't it be stored in the inode? Wouldn't that save some bit (or block) of I/O? >> I'm sure you are familiar with mount options noatime,nodiratime -- same >> concepts, but dir's are split out. > > noatime implies nodiratime. ---- Well dang...thanks! Ever since the nodiratime option came out, I thought I had to specify it in addition. Now my fstabs can be shorter! >> Also, it depends on the situation, but sometimes flattening out the >> directory structure can speed up lookup time. > > Like using large directory block sizes to make large directory > btrees wider and flatter and therefore use less seeks for any given > random directory lookup? ;) --- Are you saying that directory entries are stored in a sorted order in a B-Tree? Hmmm... Well, I did say it depended on the situation -- you are right that time lost to seeks might overshadow time lost to #blocks read in, I'd think it might depend on how the directories are laid out on disk, but in benchmarks, I've noticed larger slowdowns when using more files/dir, than distributing the same number of files among more dirs, but it could have been something about my test setup, but I did not test with varying size directory block sizes. Either I overlooked the naming option size param or was limited to version=1 for some reason (don't remember when version=2 was added...) > >> Sometime back someone did some benchmarks involving log size and it seemed >> that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly. > > 128MB is the maximum size currently. --- Maybe that's why it's optimal? :-) Thanks for the corrections...I appreciate it! -l ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-19 1:03 ` Linda Walsh @ 2008-02-19 2:49 ` David Chinner 2008-02-19 4:58 ` Jeff Breidenbach 0 siblings, 1 reply; 15+ messages in thread From: David Chinner @ 2008-02-19 2:49 UTC (permalink / raw) To: Linda Walsh; +Cc: David Chinner, Jeff Breidenbach, xfs On Mon, Feb 18, 2008 at 05:03:57PM -0800, Linda Walsh wrote: > David Chinner wrote: > >That makes no sense. Inodes are *unique* - they are not shared with > >any other inode at all. Could you explain why you think that 256 > >byte inodes are any different to larger inodes in this respect? > --- > Sorry to be unclear, but it would seem to me that if the > minimum physical blocksize on disk is 512 bytes, then either a 256 > byte inode will share that block with another inode, or you are > wasting 256 bytes on each inode. The latter interpretation doesn't > make logical sense. > > If the minimum physical I/O size is larger than 512 bytes, > then I would assume even more, *unique*, inodes could be packed > in per block. Inode I/O in XFS is done in *8k clusters* regardless of inode size. We _never_ do single inode I/O and hence you logic completely breaks down at that point ;). Inode clustering is a substantial performance optimisation that speeds up inode writeback a great deal. e.g. i broke clustering recently and we got bug reports about how slow XFS had become after an upgrade to the latest -rcX kernel.... FWIW, while inodes are still operated on completely independently there are only occasional problems where inode write I/O conflict. e.g. http://oss.sgi.com/archives/xfs/2008-02/msg00137.html > >>Remember, in xfs, if the last bit of left-over data in an inode will fit > >>into the inode, it can save a block-allocation, though I don't know > >>how this will affect speed. > > > >No, that's wrong. We never put data in inodes. > --- > You mean file data, no? Right. data = file data. > Doesn't directory and link data > get packed in? That's metadata and metadata != file data. metadata gets packed into the inode. > It always gnawed at me, as to why inode's packing > in small bits of data was disallowed for file data, but not > other types of data. We disallow file data and allow metadata to be in the inode because metadata is journalled and data is not and mixing the two introduces extremely nasty consistency corner cases into crash recovery. It can be done, but it's tricky and complex. > How about extended attribute data? That's metadata. > >>Also, it depends on the situation, but sometimes flattening out the > >>directory structure can speed up lookup time. > > > >Like using large directory block sizes to make large directory > >btrees wider and flatter and therefore use less seeks for any given > >random directory lookup? ;) > --- > Are you saying that directory entries are stored in a sorted > order in a B-Tree? Hmmm... Put simply, yes. in more detail: http://oss.sgi.com/projects/xfs/publications/papers/xfs_filesystem_structure.pdf (grrrr - the link is currently broken - bug raised, will work soon) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-19 2:49 ` David Chinner @ 2008-02-19 4:58 ` Jeff Breidenbach 2008-02-19 8:27 ` Peter Grandi 0 siblings, 1 reply; 15+ messages in thread From: Jeff Breidenbach @ 2008-02-19 4:58 UTC (permalink / raw) To: David Chinner; +Cc: Linda Walsh, xfs Wow, this is quite some discussion. I went with Hannes Dorbath's original suggestion (appended), and am now several days into copying data onto the filesystem. It's conceivable to change course at this point, but awkward. Dave, you suggested biting the bullet and sacrificing capacity. Are we talking an overwhelming difference in read performance from random files - e.g. reducing the number of seeks by 2X? Finally, in answer to Linda's question, I don't forsee any appends at all. The vast majority of files will be write once, read many. A small fraction will be re-written, e.g. new contents, same filename. An utterly insignificant fraction will be deleted. >mkfs.xfs -n size=16k -i attr=2 -l lazy-count=1,version=2,size=32m \ >-b size=512 /dev/sda > >mount -onoatime,logbufs=8,logbsize=256k /dev/sda /mnt/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-19 4:58 ` Jeff Breidenbach @ 2008-02-19 8:27 ` Peter Grandi 2008-02-19 11:44 ` Hannes Dorbath 0 siblings, 1 reply; 15+ messages in thread From: Peter Grandi @ 2008-02-19 8:27 UTC (permalink / raw) To: Linux XFS >>> On Mon, 18 Feb 2008 20:58:56 -0800, "Jeff Breidenbach" >>> <jeff@jab.org> said: jeff> I'm testing xfs for use in storing 100 million+ small jeff> files (roughly 4 to 10KB each) and some directories will jeff> contain tens of thousands of files. There will be a lot jeff> of random reading, and also some random writing, and jeff> very little deletion. [ ... ] jeff> [ ... ] am now several days into copying data onto the jeff> filesystem. [ ... ] I have found again an exchange about a similarly absurd (but 100 times smaller) setup, and here are two relevant extracts: >> I have a little script, the job of which is to create a lot >> of very small files (~1 million files, typically ~50-100bytes >> each). [ ... ] It's a bit of a one-off (or twice, maybe) >> script, and currently due to finish in about 15 hours, hence >> why I don't want to spend too much effort on rebuilding the >> box. [ ... ] > [ ... ] First, I have appended two little Perl scripts (each > rather small), one creates a Berkeley DB database of K > records of random length varying between I and J bytes, the > second does N accesses at random in that database. I have a > 1.6GHz Athlon XP with 512MB of memory, and a relatively > standard 80GB disc 7200RPM. The database is being created on > a 70% full 8GB JFS filesystem which has been somewhat > recently created: > ---------------------------------------------------------------- > $ time perl megamake.pl /var/tmp/db 1000000 50 100 > real 6m28.947s > user 0m35.860s > sys 0m45.530s > ---------------------------------------------------------------- > $ ls -sd /var/tmp/db* > 130604 /var/tmp/db > ---------------------------------------------------------------- > Now after an interval, but without cold start (for good > reasons), 100,000 random fetches: > ---------------------------------------------------------------- > $ time perl megafetch.pl /var/tmp/db 1000000 100000 > average length: 75.00628 > real 3m3.491s > user 0m2.870s > sys 0m2.800s > ---------------------------------------------------------------- > So, we got 130MiB of disc space used in a single file, >2500 > records sustained per second inserted over 6 minutes and a half, > 500 records per second sustained over 3 minutes. [ ... ] So it is less than 400 seconds instead of 15 hours and counting. Those are the numbers, here are some comments as to what explains the vast difference (2 orders of magnitude): > [ ... ] > * With 1,000,000 files and a fanout of 50, we need 20,000 > directories above them, 400 above those and 8 above those. > So 3 directory opens/reads every time a file has to be > accessed, in addition to opening and reading the file. > * Each file access will involve therefore four inode accesses > and four filesystem block accesses, probably rather widely > scattered. Depending on the size of the filesystem block and > whether the inode is contiguous to the body of the file this > can involve anything between 32KiB and 2KiB of logical IO per > file access. > * It is likely that of the logical IOs those relating to the two > top levels (those comprising 8 and 400 directories) of the > subtree will be avoided by caching between 200KiB and 1.6MiB, > but the other two levels, the 20,000 bottom directories and > the 1,000,000 leaf files, won't likely be cached. > [ ... ] [ ... ] jeff> Finally, in answer to Linda's question, I don't forsee any jeff> appends at all. The vast majority of files will be write jeff> once, read many. [ ... ] That sounds like a good use for a LDAP database, but using Berkeley DB directly may be best. One could also do a FUSE module or a special purpose NFS server that presents a Berkeley DB as a filesystem, but then we would be getting rather close to ReiserFS. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-19 8:27 ` Peter Grandi @ 2008-02-19 11:44 ` Hannes Dorbath 2008-02-19 21:24 ` Peter Grandi 0 siblings, 1 reply; 15+ messages in thread From: Hannes Dorbath @ 2008-02-19 11:44 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS > That sounds like a good use for a LDAP database, but using > Berkeley DB directly may be best. One could also do a FUSE module > or a special purpose NFS server that presents a Berkeley DB as a > filesystem, but then we would be getting rather close to ReiserFS. During testing of HA clusters some time ago I found BDB to always be the first thing to break. It seems to have very poor recovery and seems not fine with neither file systems snapshots nor power failures. Nevertheless it's claimed ACID conform. I don't know -- from my experience I wouldn't even put my address book on it. Personally I ended up doing this for OpenLDAP and never looked back: http://www.samse.fr/GPL/ldap_pg/HOWTO/x12.html -- Best regards, Hannes Dorbath ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tuning, many small files, small blocksize 2008-02-19 11:44 ` Hannes Dorbath @ 2008-02-19 21:24 ` Peter Grandi 0 siblings, 0 replies; 15+ messages in thread From: Peter Grandi @ 2008-02-19 21:24 UTC (permalink / raw) To: Linux XFS >>> On Tue, 19 Feb 2008 12:44:57 +0100, Hannes Dorbath >>> <light@theendofthetunnel.de> said: [ ... a collection of millions of small records ... ] >> That sounds like a good use for a LDAP database, but using >> Berkeley DB directly may be best. One could also do a FUSE >> module or a special purpose NFS server that presents a >> Berkeley DB as a filesystem, but then we would be getting >> rather close to ReiserFS. light> During testing of HA clusters some time ago I found BDB light> to always be the first thing to break. It seems to have light> very poor recovery and seems not fine with neither file light> systems snapshots nor power failures. [ ... ] Sometimes BDB had problems, but that seems in the past. It also relies critically on some precise behaviour from the storage layer, filesystem downwards: http://WWW.Oracle.com/technology/documentation/berkeley-db/db/ref/transapp/reclimit.html If all those conditions are not met, then it cannot do recovery. Fortunately XFS can meet those conditions (I think also the page size one), if properly configured and if the hardware does not lie. light> Personally I ended up doing this for OpenLDAP and never looked back: light> http://www.samse.fr/GPL/ldap_pg/HOWTO/x12.html Well, PostgresQL is of course a much nicer, more scalable DBMS than BDB. But for a relatively small, mostly-ro collection of small records the latter may be appropriate. XFS works with it fairly well too. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2008-02-20 6:13 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-02-16 5:01 tuning, many small files, small blocksize Jeff Breidenbach 2008-02-16 9:28 ` Hannes Dorbath 2008-02-16 10:24 ` Jeff Breidenbach 2008-02-16 20:30 ` Jeff Breidenbach 2008-02-19 0:48 ` Timothy Shimmin 2008-02-16 12:23 ` pg_xfs2 2008-02-18 22:53 ` David Chinner 2008-02-18 23:12 ` Linda Walsh 2008-02-18 23:51 ` David Chinner 2008-02-19 1:03 ` Linda Walsh 2008-02-19 2:49 ` David Chinner 2008-02-19 4:58 ` Jeff Breidenbach 2008-02-19 8:27 ` Peter Grandi 2008-02-19 11:44 ` Hannes Dorbath 2008-02-19 21:24 ` Peter Grandi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox