public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* realtime partition support?
@ 2011-01-07 14:36 Phil Karn
  2011-01-08  2:17 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Karn @ 2011-01-07 14:36 UTC (permalink / raw)
  To: xfs

What's the status of the realtime partition feature in XFS? I think I
read somewhere that it wasn't actually implemented and/or working in the
Linux XFS implementation, but I'm not sure. If it is in Linux, how well
tested is it?

It occurred to me that the XFS realtime feature might be a quick and
easy way to make a hybrid of a SSD and a rotating drive. Just create a
XFS file system on the SSD that specifies the rotating drive as its
realtime partition. This would put all the metadata on the SSD where it
can be quickly accessed at random.

Throughput on large files would be almost as fast as if everything were
on the SSD. Small files wouldn't be as fast, but still much faster than
with no SSD at all.

Phil

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-07 14:36 realtime partition support? Phil Karn
@ 2011-01-08  2:17 ` Dave Chinner
  2011-01-08  3:59   ` Phil Karn
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2011-01-08  2:17 UTC (permalink / raw)
  To: karn; +Cc: xfs

On Fri, Jan 07, 2011 at 06:36:16AM -0800, Phil Karn wrote:
> What's the status of the realtime partition feature in XFS? I think I
> read somewhere that it wasn't actually implemented and/or working in the
> Linux XFS implementation, but I'm not sure. If it is in Linux, how well
> tested is it?

Experimental, implemented in linux and mostly just works, but is
largely untested and not really recommended for any sort of
production use.

> It occurred to me that the XFS realtime feature might be a quick and
> easy way to make a hybrid of a SSD and a rotating drive. Just create a
> XFS file system on the SSD that specifies the rotating drive as its
> realtime partition. This would put all the metadata on the SSD where it
> can be quickly accessed at random.

Has a couple of drawbacks: realtime device extent allocation is single
threaded, and it's not designed as a general purpose allocator.

> Throughput on large files would be almost as fast as if everything were
> on the SSD.

Not at all. The data is still written to the rotating disk, so
the presence of the SSD won't change throughput rates at all. In
fact, the rt device is not aimed at maximising throughput - it
was designed for deterministic performance for low-latency multiple
stream access patterns - so it will probably give lower throughput
than just using the rotating drive alone....

> Small files wouldn't be as fast, but still much faster than
> with no SSD at all.

I'd also expect it to be be much, much slower than just using the
rotating disk for the standard data device - the SSD will make no
difference as metadata IO is not the limiting factor.  Further, the
rt allocator is simply not designed to handle lots of small files
efficiently so will trigger many more seeks for small file data IO
than the standard allocator (and hence be slower) because the
standard allocator packs small files tightly together...

It's a nice idea, but it doesn't really work out in practise
with the current XFS structure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-08  2:17 ` Dave Chinner
@ 2011-01-08  3:59   ` Phil Karn
  2011-01-08  6:29     ` Stan Hoeppner
  2011-01-10  0:35     ` Dave Chinner
  0 siblings, 2 replies; 7+ messages in thread
From: Phil Karn @ 2011-01-08  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: karn, xfs

On 1/7/11 6:17 PM, Dave Chinner wrote:

> Experimental, implemented in linux and mostly just works, but is
> largely untested and not really recommended for any sort of
> production use.

Thanks.

>> Throughput on large files would be almost as fast as if everything were
>> on the SSD.
> 
> Not at all. The data is still written to the rotating disk, so
> the presence of the SSD won't change throughput rates at all.

My point is that the big win of the SSD comes from its lack of
rotational and seek latency. They really shine on random small reads. On
large sequential reads and writes, modern rotating disks can shovel data
almost as quickly as an SSD once the head is in the right place and the
data has started flowing. But it can't get there until it has walked the
directory tree, read the inode for the file in question and finally
seeked to the file's first extent. If all that meta information resided
on SSD, the conventional drive could get to that first extent that much
more quickly.

Yeah, the SSD is typically still faster than a rotating drive on
sequential reads -- but only by a factor of 2:1, not dozens or hundreds
of times. On sequential writes, many rotating drives are actually faster
than many SSDs.

> I'd also expect it to be be much, much slower than just using the
> rotating disk for the standard data device - the SSD will make no
> difference as metadata IO is not the limiting factor.

No? I'm having a very hard time getting XFS on rotating SATA drives to
come close to Reiser or ext4 when extracting a large tarball (e.g., the
Linux source tree) or when doing rm -rf. I've improved it by playing
with logbsize and logbufs but Reiser is still much faster, especially at
rm -rf. The only way I've managed to get XFS close is by essentially
disabling journaling altogether, which I don't want to do. I've tried
building XFS with an external journal and giving it a loopback device
connected to a file in /tmp. Then it's plenty fast. But unsafe.

As I understand it, the problem is all that seeking to the internal
journal. I'd like to try putting the journal on a SSD partition but I
can't figure out how to do that with an existing XFS file system without
rebuilding it.

Turning off the write barrier also speeds things up considerably, but
that also makes me nervous. My system doesn't have a RAID controller
with a nonvolatile cache but it is plugged into a UPS (actually a large
solar power system with a battery bank) so unexpected loss of power is
unlikely. Can I safely turn off the barrier?

If I correctly understand how drive write caching works, then even a
kernel panic shouldn't keep data that's already been sent to the drive
from being written out to the media. Only a power failure could do that,
or possibly the host resetting the drive. After a kernel panic the BIOS
will eventually reset all the hardware, but that won't happen for some
time after a kernel panic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-08  3:59   ` Phil Karn
@ 2011-01-08  6:29     ` Stan Hoeppner
  2011-01-08 14:42       ` Phil Karn
  2011-01-10  0:35     ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Stan Hoeppner @ 2011-01-08  6:29 UTC (permalink / raw)
  To: xfs

Phil Karn put forth on 1/7/2011 9:59 PM:

> No? I'm having a very hard time getting XFS on rotating SATA drives to
> come close to Reiser or ext4 when extracting a large tarball (e.g., the
> Linux source tree) or when doing rm -rf.

This is because you're not using Dave's delayed logging patch, and
you've not been reading this list for many months, as it's been
discussed in detail many times.  See:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt;h=96d0df28bed323d5596fc051b0ffb96ed8e3c8df;hb=HEAD


Dave Chinner put forth on 3/14/2010 11:30 PM:

> The following results are from a synthetic test designed to show
> just the impact of delayed logging on the amount of metadata
> written to the log.
>
> load:	Sequential create 100k zero-length files in a directory per
> 	thread, no fsync between create and unlink.
> 	(./fs_mark -S0 -n 100000 -s 0 -d ....)
>
> measurement: via PCP. XFS specific metrics:
>
> 	xfs.log.blocks
> 	xfs.log.writes
> 	xfs.log.noiclogs
> 	xfs.log.force
> 	xfs.transactions.*
> 	xfs.dir_ops.create
> 	xfs.dir_ops.remove
>
>
> machine:
>
> 2GHz Dual core opteron, 3GB RAM
> single 36GB 15krpm scsi drive w/ CTQ depth=32
> mkfs.xfs -f -l size=128m /dev/sdb2
>
> Current code:
>
> mount -o "logbsize=262144" /dev/sdb2 /mnt/scratch
>
> threads:	 fs_mark	CPU	create log	unlink log
> 		throughput		bandwidth	bandwidth
> 1		  2900/s         75%       34MB/s	 34MB/s
> 2		  2850/s	 75%	   33MB/s	 33MB/s
> 4		  2800/s	 80%	   30MB/s        30MB/s
>
> Delayed logging:
>
> mount -o "delaylog,logbsize=262144" /dev/sdb2 /mnt/scratch
>
> threads:	 fs_mark	CPU	create log	unlink log
> 		throughput		bandwidth	bandwidth
> 1		  4300/s        110%       1.5MB/s	 <1MB/s
> 2		  7900/s	195%	   <4MB/s	 <1MB/s
> 4		  7500/s	200%	   <5MB/s        <1.5MB/s
>
> I think it pretty clear that the design goal of "an order of
> magnitude less log IO bandwidth" is being met here. Scalability is
> looking promising, but a 2p machine is not large enough to make any
> definitive statements about that. Hence from these results the
> implementation is at or exceeding design levels.


The above results were with very young code.  I'm guessing the current
code in the tree probably has a little better performance.  Nonetheless,
the above results are impressive, and put XFS on par with any other FS
WRT metadata write heavy workloads.  Your "rm -rf" operation will be
_significantly_ faster, likely a factor of 2x or better, with this
delayed logging option enabled, and will be limited mainly/only by the
speed of your CPU/memory subsystem.

Untarring a kernel should yield a similar, but somewhat lesser,
performance increase as you'll be creating ~2300 directories and ~50,000
files (not nulls).

With a modern AMD/Intel platform with a CPU of ~3GHz clock speed, XFS
metadata OPs with delayed logging enabled should absolutely scream,
especially so with multicore CPUs and parallel/concurrent metadata write
heavy processes/threads.

I can't remember any more recent test results from Dave, although I may
simply have missed reading those emails, if they were sent.  Even if the
current code isn't any faster than that used for the tests above, the
metadata write performance increase is still phenomenal.

Again, nice work Dave. :)  AFAIK, you've eliminated the one 'legit'
performance gripe Linux folks have traditionally leveled at XFS WRT to
use as a general purpose server/workstation filesystem.  Now they have
no excuses not to use it.  :)  I'd love to see a full up Linux FS
performance comparison article after 2.6.39 rolls out and delaylog is
the default mount option.  I don't have the necessary hardware etc to do
such a piece or I gladly would.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-08  6:29     ` Stan Hoeppner
@ 2011-01-08 14:42       ` Phil Karn
  0 siblings, 0 replies; 7+ messages in thread
From: Phil Karn @ 2011-01-08 14:42 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On 1/7/11 10:29 PM, Stan Hoeppner wrote:
> Phil Karn put forth on 1/7/2011 9:59 PM:
> 
>> No? I'm having a very hard time getting XFS on rotating SATA drives to
>> come close to Reiser or ext4 when extracting a large tarball (e.g., the
>> Linux source tree) or when doing rm -rf.
> 
> This is because you're not using Dave's delayed logging patch, and
> you've not been reading this list for many months, as it's been
> discussed in detail many times.  See:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt;h=96d0df28bed323d5596fc051b0ffb96ed8e3c8df;hb=HEAD

Yes, I am new to the list, and while I did pull down a year of the
archives I certainly haven't read them all.

Thanks to this file pointer and your explanation, I now have a pretty
good idea of what's going on.

Phil

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-08  3:59   ` Phil Karn
  2011-01-08  6:29     ` Stan Hoeppner
@ 2011-01-10  0:35     ` Dave Chinner
  2011-01-10 10:58       ` Phil Karn
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2011-01-10  0:35 UTC (permalink / raw)
  To: karn; +Cc: xfs

On Fri, Jan 07, 2011 at 07:59:27PM -0800, Phil Karn wrote:
> On 1/7/11 6:17 PM, Dave Chinner wrote:
> >> Throughput on large files would be almost as fast as if everything were
> >> on the SSD.
> > 
> > Not at all. The data is still written to the rotating disk, so
> > the presence of the SSD won't change throughput rates at all.
> 
> My point is that the big win of the SSD comes from its lack of
> rotational and seek latency. They really shine on random small reads. On
> large sequential reads and writes, modern rotating disks can shovel data
> almost as quickly as an SSD once the head is in the right place and the
> data has started flowing. But it can't get there until it has walked the
> directory tree, read the inode for the file in question and finally
> seeked to the file's first extent.

Which often does not require IO because the path and inodes are
cached in memory.

> If all that meta information resided
> on SSD, the conventional drive could get to that first extent that much
> more quickly.

Not that much more quickly, because XFS uses readahead to hide a lot
of the directory traversal IO latency when it is not cached....

> > I'd also expect it to be be much, much slower than just using the
> > rotating disk for the standard data device - the SSD will make no
> > difference as metadata IO is not the limiting factor.
> 
> No? I'm having a very hard time getting XFS on rotating SATA drives to
> come close to Reiser or ext4 when extracting a large tarball (e.g., the
> Linux source tree) or when doing rm -rf. I've improved it by playing
> with logbsize and logbufs but Reiser is still much faster, especially at
> rm -rf. The only way I've managed to get XFS close is by essentially
> disabling journaling altogether, which I don't want to do. I've tried
> building XFS with an external journal and giving it a loopback device
> connected to a file in /tmp. Then it's plenty fast. But unsafe.

As has already been suggested, "-o delaylog" is the solution to that
problem.

> Turning off the write barrier also speeds things up considerably, but
> that also makes me nervous. My system doesn't have a RAID controller
> with a nonvolatile cache but it is plugged into a UPS (actually a large
> solar power system with a battery bank) so unexpected loss of power is
> unlikely. Can I safely turn off the barrier?

Should be safe.  In 2.6.37 the overhead of barriers is greatly
reduced. IIRC, on most modern hardware they will most likely be
unnoticable, so disabling them is probably not necessary...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: realtime partition support?
  2011-01-10  0:35     ` Dave Chinner
@ 2011-01-10 10:58       ` Phil Karn
  0 siblings, 0 replies; 7+ messages in thread
From: Phil Karn @ 2011-01-10 10:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 1/9/11 4:35 PM, Dave Chinner wrote:

> Which often does not require IO because the path and inodes are
> cached in memory.

I'm thinking mainly of rapid file creation or deletion, such as

tar xjf linux-2.6.37.tar.bz2
rm -rf linux-2.6.37

Most of the paths are likely cached, yes, but a lot of inodes are being
rapidly created or deleted so there's a lot of log activity.

> Not that much more quickly, because XFS uses readahead to hide a lot
> of the directory traversal IO latency when it is not cached....

> As has already been suggested, "-o delaylog" is the solution to that
> problem.

Thanks for the suggestion. I hadn't even heard of delaylog until the
other day as it's not in the manual page. I just tried it, and 'tar x'
now completes far more quickly. But the output rate shown by 'vmstat' is
still rather low, and it takes a very long time (minutes) for a
subsequent 'sync' or 'umount' command to finish.

And just now my system has deadlocked. The CPUs are all idle, there's no
disk I/O, and commands referencing that filesystem hang. /, /boot and
/home, which are on a separate SSD, seem OK.

[I just noticed that there doesn't seem to be any checking of options to
'mount -o remount'; anything is silently accepted:

$ sudo mount -o rw,remount,relatime,xyzzy,bletch /dev/md0 /big
$ mount
[....]
/dev/md0 on /big type xfs (rw,relatime,xyzzy,bletch)
$

Options are checked when explicitly mounting a file system not already
mounted.]

> 
>> Turning off the write barrier also speeds things up considerably, but
>> that also makes me nervous. My system doesn't have a RAID controller
>> with a nonvolatile cache but it is plugged into a UPS (actually a large
>> solar power system with a battery bank) so unexpected loss of power is
>> unlikely. Can I safely turn off the barrier?
> 
> Should be safe.  In 2.6.37 the overhead of barriers is greatly
> reduced. IIRC, on most modern hardware they will most likely be
> unnoticable, so disabling them is probably not necessary...

I'm running stock 2.6.37 and here the effect of barrier/nobarrier on
rapid file creation or deletion is dramatic, well over an order of
magnitude. With barriers on, "vmstat" shows a bo (block output) rate of
only several hundred kB/sec. With nobarrier, it jumps to 5-9 MB/s. This
is on a RAID-5 array of four 2TB WDC WD20EARS (advanced format) drives.
(The XFS blocksize is 4K, and I was careful to align the partitions on
4K boundaries.) According to hdparm, each drive is running at 3.0 Gb/s,
write caching is enabled but multicount is off. The SATA controllers are
Intel ICH10 82801JI with the ahci driver.

On another system with another WDC drive of the same model connected to
a Intel ICH9R 82801IR controller and the ata_piix driver, multicount is
set to 16. Could this be because the drives on the first machine are in
a RAID array while the second is standalone? Is it safe to change this
setting with hdparm to see what happens?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-01-10 10:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-07 14:36 realtime partition support? Phil Karn
2011-01-08  2:17 ` Dave Chinner
2011-01-08  3:59   ` Phil Karn
2011-01-08  6:29     ` Stan Hoeppner
2011-01-08 14:42       ` Phil Karn
2011-01-10  0:35     ` Dave Chinner
2011-01-10 10:58       ` Phil Karn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox