* realtime partition support? @ 2011-01-07 14:36 Phil Karn 2011-01-08 2:17 ` Dave Chinner 0 siblings, 1 reply; 7+ messages in thread From: Phil Karn @ 2011-01-07 14:36 UTC (permalink / raw) To: xfs What's the status of the realtime partition feature in XFS? I think I read somewhere that it wasn't actually implemented and/or working in the Linux XFS implementation, but I'm not sure. If it is in Linux, how well tested is it? It occurred to me that the XFS realtime feature might be a quick and easy way to make a hybrid of a SSD and a rotating drive. Just create a XFS file system on the SSD that specifies the rotating drive as its realtime partition. This would put all the metadata on the SSD where it can be quickly accessed at random. Throughput on large files would be almost as fast as if everything were on the SSD. Small files wouldn't be as fast, but still much faster than with no SSD at all. Phil _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-07 14:36 realtime partition support? Phil Karn @ 2011-01-08 2:17 ` Dave Chinner 2011-01-08 3:59 ` Phil Karn 0 siblings, 1 reply; 7+ messages in thread From: Dave Chinner @ 2011-01-08 2:17 UTC (permalink / raw) To: karn; +Cc: xfs On Fri, Jan 07, 2011 at 06:36:16AM -0800, Phil Karn wrote: > What's the status of the realtime partition feature in XFS? I think I > read somewhere that it wasn't actually implemented and/or working in the > Linux XFS implementation, but I'm not sure. If it is in Linux, how well > tested is it? Experimental, implemented in linux and mostly just works, but is largely untested and not really recommended for any sort of production use. > It occurred to me that the XFS realtime feature might be a quick and > easy way to make a hybrid of a SSD and a rotating drive. Just create a > XFS file system on the SSD that specifies the rotating drive as its > realtime partition. This would put all the metadata on the SSD where it > can be quickly accessed at random. Has a couple of drawbacks: realtime device extent allocation is single threaded, and it's not designed as a general purpose allocator. > Throughput on large files would be almost as fast as if everything were > on the SSD. Not at all. The data is still written to the rotating disk, so the presence of the SSD won't change throughput rates at all. In fact, the rt device is not aimed at maximising throughput - it was designed for deterministic performance for low-latency multiple stream access patterns - so it will probably give lower throughput than just using the rotating drive alone.... > Small files wouldn't be as fast, but still much faster than > with no SSD at all. I'd also expect it to be be much, much slower than just using the rotating disk for the standard data device - the SSD will make no difference as metadata IO is not the limiting factor. Further, the rt allocator is simply not designed to handle lots of small files efficiently so will trigger many more seeks for small file data IO than the standard allocator (and hence be slower) because the standard allocator packs small files tightly together... It's a nice idea, but it doesn't really work out in practise with the current XFS structure. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-08 2:17 ` Dave Chinner @ 2011-01-08 3:59 ` Phil Karn 2011-01-08 6:29 ` Stan Hoeppner 2011-01-10 0:35 ` Dave Chinner 0 siblings, 2 replies; 7+ messages in thread From: Phil Karn @ 2011-01-08 3:59 UTC (permalink / raw) To: Dave Chinner; +Cc: karn, xfs On 1/7/11 6:17 PM, Dave Chinner wrote: > Experimental, implemented in linux and mostly just works, but is > largely untested and not really recommended for any sort of > production use. Thanks. >> Throughput on large files would be almost as fast as if everything were >> on the SSD. > > Not at all. The data is still written to the rotating disk, so > the presence of the SSD won't change throughput rates at all. My point is that the big win of the SSD comes from its lack of rotational and seek latency. They really shine on random small reads. On large sequential reads and writes, modern rotating disks can shovel data almost as quickly as an SSD once the head is in the right place and the data has started flowing. But it can't get there until it has walked the directory tree, read the inode for the file in question and finally seeked to the file's first extent. If all that meta information resided on SSD, the conventional drive could get to that first extent that much more quickly. Yeah, the SSD is typically still faster than a rotating drive on sequential reads -- but only by a factor of 2:1, not dozens or hundreds of times. On sequential writes, many rotating drives are actually faster than many SSDs. > I'd also expect it to be be much, much slower than just using the > rotating disk for the standard data device - the SSD will make no > difference as metadata IO is not the limiting factor. No? I'm having a very hard time getting XFS on rotating SATA drives to come close to Reiser or ext4 when extracting a large tarball (e.g., the Linux source tree) or when doing rm -rf. I've improved it by playing with logbsize and logbufs but Reiser is still much faster, especially at rm -rf. The only way I've managed to get XFS close is by essentially disabling journaling altogether, which I don't want to do. I've tried building XFS with an external journal and giving it a loopback device connected to a file in /tmp. Then it's plenty fast. But unsafe. As I understand it, the problem is all that seeking to the internal journal. I'd like to try putting the journal on a SSD partition but I can't figure out how to do that with an existing XFS file system without rebuilding it. Turning off the write barrier also speeds things up considerably, but that also makes me nervous. My system doesn't have a RAID controller with a nonvolatile cache but it is plugged into a UPS (actually a large solar power system with a battery bank) so unexpected loss of power is unlikely. Can I safely turn off the barrier? If I correctly understand how drive write caching works, then even a kernel panic shouldn't keep data that's already been sent to the drive from being written out to the media. Only a power failure could do that, or possibly the host resetting the drive. After a kernel panic the BIOS will eventually reset all the hardware, but that won't happen for some time after a kernel panic. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-08 3:59 ` Phil Karn @ 2011-01-08 6:29 ` Stan Hoeppner 2011-01-08 14:42 ` Phil Karn 2011-01-10 0:35 ` Dave Chinner 1 sibling, 1 reply; 7+ messages in thread From: Stan Hoeppner @ 2011-01-08 6:29 UTC (permalink / raw) To: xfs Phil Karn put forth on 1/7/2011 9:59 PM: > No? I'm having a very hard time getting XFS on rotating SATA drives to > come close to Reiser or ext4 when extracting a large tarball (e.g., the > Linux source tree) or when doing rm -rf. This is because you're not using Dave's delayed logging patch, and you've not been reading this list for many months, as it's been discussed in detail many times. See: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt;h=96d0df28bed323d5596fc051b0ffb96ed8e3c8df;hb=HEAD Dave Chinner put forth on 3/14/2010 11:30 PM: > The following results are from a synthetic test designed to show > just the impact of delayed logging on the amount of metadata > written to the log. > > load: Sequential create 100k zero-length files in a directory per > thread, no fsync between create and unlink. > (./fs_mark -S0 -n 100000 -s 0 -d ....) > > measurement: via PCP. XFS specific metrics: > > xfs.log.blocks > xfs.log.writes > xfs.log.noiclogs > xfs.log.force > xfs.transactions.* > xfs.dir_ops.create > xfs.dir_ops.remove > > > machine: > > 2GHz Dual core opteron, 3GB RAM > single 36GB 15krpm scsi drive w/ CTQ depth=32 > mkfs.xfs -f -l size=128m /dev/sdb2 > > Current code: > > mount -o "logbsize=262144" /dev/sdb2 /mnt/scratch > > threads: fs_mark CPU create log unlink log > throughput bandwidth bandwidth > 1 2900/s 75% 34MB/s 34MB/s > 2 2850/s 75% 33MB/s 33MB/s > 4 2800/s 80% 30MB/s 30MB/s > > Delayed logging: > > mount -o "delaylog,logbsize=262144" /dev/sdb2 /mnt/scratch > > threads: fs_mark CPU create log unlink log > throughput bandwidth bandwidth > 1 4300/s 110% 1.5MB/s <1MB/s > 2 7900/s 195% <4MB/s <1MB/s > 4 7500/s 200% <5MB/s <1.5MB/s > > I think it pretty clear that the design goal of "an order of > magnitude less log IO bandwidth" is being met here. Scalability is > looking promising, but a 2p machine is not large enough to make any > definitive statements about that. Hence from these results the > implementation is at or exceeding design levels. The above results were with very young code. I'm guessing the current code in the tree probably has a little better performance. Nonetheless, the above results are impressive, and put XFS on par with any other FS WRT metadata write heavy workloads. Your "rm -rf" operation will be _significantly_ faster, likely a factor of 2x or better, with this delayed logging option enabled, and will be limited mainly/only by the speed of your CPU/memory subsystem. Untarring a kernel should yield a similar, but somewhat lesser, performance increase as you'll be creating ~2300 directories and ~50,000 files (not nulls). With a modern AMD/Intel platform with a CPU of ~3GHz clock speed, XFS metadata OPs with delayed logging enabled should absolutely scream, especially so with multicore CPUs and parallel/concurrent metadata write heavy processes/threads. I can't remember any more recent test results from Dave, although I may simply have missed reading those emails, if they were sent. Even if the current code isn't any faster than that used for the tests above, the metadata write performance increase is still phenomenal. Again, nice work Dave. :) AFAIK, you've eliminated the one 'legit' performance gripe Linux folks have traditionally leveled at XFS WRT to use as a general purpose server/workstation filesystem. Now they have no excuses not to use it. :) I'd love to see a full up Linux FS performance comparison article after 2.6.39 rolls out and delaylog is the default mount option. I don't have the necessary hardware etc to do such a piece or I gladly would. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-08 6:29 ` Stan Hoeppner @ 2011-01-08 14:42 ` Phil Karn 0 siblings, 0 replies; 7+ messages in thread From: Phil Karn @ 2011-01-08 14:42 UTC (permalink / raw) To: Stan Hoeppner; +Cc: xfs On 1/7/11 10:29 PM, Stan Hoeppner wrote: > Phil Karn put forth on 1/7/2011 9:59 PM: > >> No? I'm having a very hard time getting XFS on rotating SATA drives to >> come close to Reiser or ext4 when extracting a large tarball (e.g., the >> Linux source tree) or when doing rm -rf. > > This is because you're not using Dave's delayed logging patch, and > you've not been reading this list for many months, as it's been > discussed in detail many times. See: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt;h=96d0df28bed323d5596fc051b0ffb96ed8e3c8df;hb=HEAD Yes, I am new to the list, and while I did pull down a year of the archives I certainly haven't read them all. Thanks to this file pointer and your explanation, I now have a pretty good idea of what's going on. Phil _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-08 3:59 ` Phil Karn 2011-01-08 6:29 ` Stan Hoeppner @ 2011-01-10 0:35 ` Dave Chinner 2011-01-10 10:58 ` Phil Karn 1 sibling, 1 reply; 7+ messages in thread From: Dave Chinner @ 2011-01-10 0:35 UTC (permalink / raw) To: karn; +Cc: xfs On Fri, Jan 07, 2011 at 07:59:27PM -0800, Phil Karn wrote: > On 1/7/11 6:17 PM, Dave Chinner wrote: > >> Throughput on large files would be almost as fast as if everything were > >> on the SSD. > > > > Not at all. The data is still written to the rotating disk, so > > the presence of the SSD won't change throughput rates at all. > > My point is that the big win of the SSD comes from its lack of > rotational and seek latency. They really shine on random small reads. On > large sequential reads and writes, modern rotating disks can shovel data > almost as quickly as an SSD once the head is in the right place and the > data has started flowing. But it can't get there until it has walked the > directory tree, read the inode for the file in question and finally > seeked to the file's first extent. Which often does not require IO because the path and inodes are cached in memory. > If all that meta information resided > on SSD, the conventional drive could get to that first extent that much > more quickly. Not that much more quickly, because XFS uses readahead to hide a lot of the directory traversal IO latency when it is not cached.... > > I'd also expect it to be be much, much slower than just using the > > rotating disk for the standard data device - the SSD will make no > > difference as metadata IO is not the limiting factor. > > No? I'm having a very hard time getting XFS on rotating SATA drives to > come close to Reiser or ext4 when extracting a large tarball (e.g., the > Linux source tree) or when doing rm -rf. I've improved it by playing > with logbsize and logbufs but Reiser is still much faster, especially at > rm -rf. The only way I've managed to get XFS close is by essentially > disabling journaling altogether, which I don't want to do. I've tried > building XFS with an external journal and giving it a loopback device > connected to a file in /tmp. Then it's plenty fast. But unsafe. As has already been suggested, "-o delaylog" is the solution to that problem. > Turning off the write barrier also speeds things up considerably, but > that also makes me nervous. My system doesn't have a RAID controller > with a nonvolatile cache but it is plugged into a UPS (actually a large > solar power system with a battery bank) so unexpected loss of power is > unlikely. Can I safely turn off the barrier? Should be safe. In 2.6.37 the overhead of barriers is greatly reduced. IIRC, on most modern hardware they will most likely be unnoticable, so disabling them is probably not necessary... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: realtime partition support? 2011-01-10 0:35 ` Dave Chinner @ 2011-01-10 10:58 ` Phil Karn 0 siblings, 0 replies; 7+ messages in thread From: Phil Karn @ 2011-01-10 10:58 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 1/9/11 4:35 PM, Dave Chinner wrote: > Which often does not require IO because the path and inodes are > cached in memory. I'm thinking mainly of rapid file creation or deletion, such as tar xjf linux-2.6.37.tar.bz2 rm -rf linux-2.6.37 Most of the paths are likely cached, yes, but a lot of inodes are being rapidly created or deleted so there's a lot of log activity. > Not that much more quickly, because XFS uses readahead to hide a lot > of the directory traversal IO latency when it is not cached.... > As has already been suggested, "-o delaylog" is the solution to that > problem. Thanks for the suggestion. I hadn't even heard of delaylog until the other day as it's not in the manual page. I just tried it, and 'tar x' now completes far more quickly. But the output rate shown by 'vmstat' is still rather low, and it takes a very long time (minutes) for a subsequent 'sync' or 'umount' command to finish. And just now my system has deadlocked. The CPUs are all idle, there's no disk I/O, and commands referencing that filesystem hang. /, /boot and /home, which are on a separate SSD, seem OK. [I just noticed that there doesn't seem to be any checking of options to 'mount -o remount'; anything is silently accepted: $ sudo mount -o rw,remount,relatime,xyzzy,bletch /dev/md0 /big $ mount [....] /dev/md0 on /big type xfs (rw,relatime,xyzzy,bletch) $ Options are checked when explicitly mounting a file system not already mounted.] > >> Turning off the write barrier also speeds things up considerably, but >> that also makes me nervous. My system doesn't have a RAID controller >> with a nonvolatile cache but it is plugged into a UPS (actually a large >> solar power system with a battery bank) so unexpected loss of power is >> unlikely. Can I safely turn off the barrier? > > Should be safe. In 2.6.37 the overhead of barriers is greatly > reduced. IIRC, on most modern hardware they will most likely be > unnoticable, so disabling them is probably not necessary... I'm running stock 2.6.37 and here the effect of barrier/nobarrier on rapid file creation or deletion is dramatic, well over an order of magnitude. With barriers on, "vmstat" shows a bo (block output) rate of only several hundred kB/sec. With nobarrier, it jumps to 5-9 MB/s. This is on a RAID-5 array of four 2TB WDC WD20EARS (advanced format) drives. (The XFS blocksize is 4K, and I was careful to align the partitions on 4K boundaries.) According to hdparm, each drive is running at 3.0 Gb/s, write caching is enabled but multicount is off. The SATA controllers are Intel ICH10 82801JI with the ahci driver. On another system with another WDC drive of the same model connected to a Intel ICH9R 82801IR controller and the ata_piix driver, multicount is set to 16. Could this be because the drives on the first machine are in a RAID array while the second is standalone? Is it safe to change this setting with hdparm to see what happens? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-01-10 10:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-07 14:36 realtime partition support? Phil Karn 2011-01-08 2:17 ` Dave Chinner 2011-01-08 3:59 ` Phil Karn 2011-01-08 6:29 ` Stan Hoeppner 2011-01-08 14:42 ` Phil Karn 2011-01-10 0:35 ` Dave Chinner 2011-01-10 10:58 ` Phil Karn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox