public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* op-journaled fs, journal size and storage speeds
@ 2011-04-30 14:51 Peter Grandi
  2011-05-01  9:27 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Grandi @ 2011-04-30 14:51 UTC (permalink / raw)
  To: Linux fs XFS, Linux fs JFS

Been thinking about journals and RAID6s and SSDs.

In particular for file system designs like JFS and XFS that do
operation journaling (while ext[34] do block journaling).

The issue is: journal size?

It seems to me that adopting as guideline a percent of the
filesystem is very wrong, and so I have been using a rule of
thumb like one second of expected transfer rate, so "in flight"
updates are never much behind.

But even at a single disk *sequential* transfer rate of say
80MB/s average, a journal that contains operation records could
conceivably hold dozens if not hundreds of thousands of pending
metadata updates, probably targeted at very widely scattered
locations on disk, and playing a journal fully could take a long
time.

So the idea would be that the relevant transfer rate would be
the *random* one, and since that is around 4MB/s per single
disk, journal sizes would end up pretty small. But many people
allocate very large (at least compared to that) journals.

This seems to me a fairly bad idea, because then the journal
becomes a massive hot spot on the disk and draws the disk arm
like black hole. I suspect that operations should not stay on
the journal for a long time. However if the journal is too small
processes that do metadata updates start to hang on it.

So some questions for which I have guesses but not good answers:

  * What should journal size be proportional to?
  * What is the downside of a too small journal?
  * What is the downside of a too large journal other than space?

Again I expect answers to be very different for ext[34] but I am
asking for operation-journaling file system designs like JFS and
XFS.

BTW, another consideration is that for filesystems that are
fairly journal-intensive, putting the journal on a low traffic
storage device can have large benefits.

But if they can be pretty small, I wonder whether putting the
journals of several filesystems on the same storage device then
becomes a sensible option as the locality will be quite narrow
(e.g. a single physical cylinder) or it could be wortwhile like
the database people do to journal to battery-backed RAM.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: op-journaled fs, journal size and storage speeds
  2011-04-30 14:51 op-journaled fs, journal size and storage speeds Peter Grandi
@ 2011-05-01  9:27 ` Dave Chinner
  2011-05-01 18:13   ` Peter Grandi
  2011-05-02  4:35   ` Stan Hoeppner
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2011-05-01  9:27 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS, Linux fs JFS

On Sat, Apr 30, 2011 at 03:51:43PM +0100, Peter Grandi wrote:
> Been thinking about journals and RAID6s and SSDs.
> 
> In particular for file system designs like JFS and XFS that do
> operation journaling (while ext[34] do block journaling).

XFS is not an operation journalling filesystem. Most of the metadata
is dirty-region logged via buffers, just like ext3/4. Perhaps you
need to read some documentation like this:

http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Operation_Based_Logging

> The issue is: journal size?
>
> It seems to me that adopting as guideline a percent of the
> filesystem is very wrong, and so I have been using a rule of
> thumb like one second of expected transfer rate, so "in flight"
> updates are never much behind.

How do you know what "one second" of "in flight" operations is going
to be?

I had to deal with this in XFS when implementing the delayed logging
code. It uses a number of operations or a percentage of log space to
determine when to checkpoint the modifications, and that is
typically load dependent as to when it triggers.

And then you've got the problem of concurrency - one second of a
single threaded workload is much different to one second of the same
workload spread across 20 CPU cores. You need to have limits that
work well in both cases, and structures that scale to that level of
concurrency.

In reality, there's not much point in trying to calculate what one
second's worth of metadata is going to be - more often that not
you'll hit some other limitation in the journal subsystem, run out
of memory or have to put limits in place anyway to avoid latency
problems. Easiest and most reliable method seems to be to size your
journal appropriatly in the first place and have you algorithms key
off that....

> But even at a single disk *sequential* transfer rate of say
> 80MB/s average, a journal that contains operation records could
> conceivably hold dozens if not hundreds of thousands of pending
> metadata updates, probably targeted at very widely scattered
> locations on disk, and playing a journal fully could take a long
> time.

17 minutes is my current record by crashing a VM during a chmod -R
operation over a 100 million inode filesystem. That was on a ~2GB
log (maximum supported size).

http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Reducing_Recovery_Time

> So the idea would be that the relevant transfer rate would be
> the *random* one, and since that is around 4MB/s per single
> disk, journal sizes would end up pretty small. But many people
> allocate very large (at least compared to that) journals.
> 
> This seems to me a fairly bad idea, because then the journal
> becomes a massive hot spot on the disk and draws the disk arm
> like black hole. I suspect that operations should not stay on

That's why you can configure an external log....

> the journal for a long time. However if the journal is too small
> processes that do metadata updates start to hang on it.

Well, yes. The journal needs to be large enough to hold all the
transaction reservations for the active transactions. XFS, in the
worse case for a default filesystem config, needs about 100MB of log
space per 300 concurrent transactions. Increasing transaction
concurrency was the main reason we increased the log size...

> So some questions for which I have guesses but not good answers:
> 
>   * What should journal size be proportional to?

Your workload.

>   * What is the downside of a too small journal?

Performance sucks.

>   * What is the downside of a too large journal other than space?

Recovery times too long, lots of outstanding metadata pinned in
memory (hello OOM-killer!), and other resource management related
scalability issues.

> Again I expect answers to be very different for ext[34] but I am
> asking for operation-journaling file system designs like JFS and
> XFS.

> BTW, another consideration is that for filesystems that are
> fairly journal-intensive, putting the journal on a low traffic
> storage device can have large benefits.

Yeah, nobody ever thought of an external log before.... :)

> But if they can be pretty small, I wonder whether putting the
> journals of several filesystems on the same storage device then
> becomes a sensible option as the locality will be quite narrow
> (e.g. a single physical cylinder) or it could be wortwhile like
> the database people do to journal to battery-backed RAM.

Got a supplier for the custom hardware you'd need? Just use a PCIe
SSD....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: op-journaled fs, journal size and storage speeds
  2011-05-01  9:27 ` Dave Chinner
@ 2011-05-01 18:13   ` Peter Grandi
  2011-05-02  1:23     ` Dave Chinner
  2011-05-02 10:40     ` Christoph Hellwig
  2011-05-02  4:35   ` Stan Hoeppner
  1 sibling, 2 replies; 6+ messages in thread
From: Peter Grandi @ 2011-05-01 18:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linux fs XFS, Linux fs JFS


>> Been thinking about journals and RAID6s and SSDs. In particular
>> for file system designs like JFS and XFS that do operation
>> journaling (while ext[34] do block journaling).

> XFS is not an operation journalling filesystem. Most of the
> metadata is dirty-region logged via buffers, just like ext3/4.

Looking at the sources, XFS does operations journaling, in the
form of physical ("dirty region") operation logging, instead of
logical operation logging like JFS. Both are very different from
block journaling.

More in details, to me there is a stark contrast between 'jbd.h':

  http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=blob;f=include/linux/jbd.h;h=e06965081ba5548f74db935543af84334f58259e;hb=HEAD

where I find only a few journal transaction types (blocks) and
'xfs_trans.h' where I find many journal transaction types (ops):

 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=blob;f=fs/xfs/xfs_trans.h;h=c2042b736b81131a780703d8a5907c848793eebb;hb=HEAD

Given that in the latter I see transaction types like
'XFS_TRANS_RENAME' or 'XFS_TRANS_MKDIR' it is hard to imagine how
one can argue that the XFS journals something other than ops, even
if in a buffered way of sorts.

Ironically comparing with the 'jfs_logmgr.h':

  http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=blob;f=fs/jfs/jfs_logmgr.h;h=9236bc49ae7ff1aed9cad81a2b22c2c54e433ba0;hb=HEAD

I see lower level transaction types there (but they are logged as
ops rather than "dirty-region"s.).

[ ... ]

>> It seems to me that adopting as guideline a percent of the
>> filesystem is very wrong, and so I have been using a rule of
>> thumb like one second of expected transfer rate, so "in flight"
>> updates are never much behind.

> How do you know what "one second" of "in flight" operations is
> going to be?

Well, that's what I discuss later, it is a "rule of thumb" based
on on *some* rationale, but I have been questioning it.

[ ... interesting summary of some of the many issue related to
journal sizing ... ]

> Easiest and most reliable method seems to be to size your
> journal appropriatly in the first place and have you
> algorithms key off that....

Sure, but *I* am asking that question :-).

[ ... ]

> 17 minutes is my current record by crashing a VM during a
> chmod -R operation over a 100 million inode filesystem. That
> was on a ~2GB log (maximum supported size).

Uhhhm I happen to strongly relate to that (on a much smaller
scale :->).

[ ... ]

>> This seems to me a fairly bad idea, because then the journal
>> becomes a massive hot spot on the disk and draws the disk arm
>> like black hole. I suspect that operations should not stay on

> That's why you can configure an external log....

...and lose barriers :-). But indeed.

>> the journal for a long time. However if the journal is too
>> small processes that do metadata updates start to hang on it.

> Well, yes. The journal needs to be large enough to hold all
> the transaction reservations for the active transactions. XFS,
> in the worse case for a default filesystem config, needs about
> 100MB of log space per 300 concurrent transactions. [ ... ]

So something like 300KB per transaction? That seems a pretty
extreme worst case. How is that possible? A metadata transaction
with a "dirty region" of 300KB sound enormously expensive. It may
be about extent maps for a very fragmented file I guess. Also not
clear here what  concurrent  means because the log is sequential.
I'll guess that it means "in flight".

[ ... ]

>> * What should journal size be proportional to?

> Your workload.

Sure, as a very top level goal. But that's not an answer, it is
handwaving. As you argue earlier, it could be proportional in some
cases to IO threads; or it could be number of arms, filesystem
size, size of each volume, sequential transfer rate, random
transfer rate, large IO transfer rate, small IO transfer rate, ...

Some tighter guideline might be better than just guessing.

>> * What is the downside of a too small journal?

> Performance sucks.

But why? Without a journal completely performance is better;
assuming a one-transaction journal this becomes slower because
of writing everything twice, but that happens for any size of
journal, as it is unavoidable.

When the journal fills up the effect is the same as that of a 1
transaction journal. That's the same for every type of buffer.

So the effect of a journal larger than 1 transaction must be
felt only when the journal is not full, that is there are pauses
in the flow of transactions; and then it does not matter a lot
just how large the journal is.

So the journal should be large enough to accomodate the highest
possible rate of metadata updates for the longest time this
happens until there is a pause in the metadata updates.

This of course depends on workload, but some rule of thumb based
on experience might help.

And here my guess is that shorter journals are better than
longer ones, because also:

>> * What is the downside of a too large journal other than space?

> Recovery times too long, lots of outstanding metadata pinned
> in memory (hello OOM-killer!), and other resource management
> related scalability issues.

I would have expected also more seeks, as reading logged but not
yet finalized metadata has to go back to the journal, but I guess
that's a small effect.

>> BTW, another consideration is that for filesystems that are
>> fairly journal-intensive, putting the journal on a low traffic
>> storage device can have large benefits.

> Yeah, nobody ever thought of an external log before.... :)

I was just stating the obvious here, in order to contrast it with:

>> But if they can be pretty small, I wonder whether putting the
>> journals of several filesystems on the same storage device then
>> becomes a sensible option as the locality will be quite narrow
>> (e.g. a single physical cylinder) or it could be wortwhile like
>> the database people do to journal to battery-backed RAM.

For example as described in this old paper:

  http://www.evenenterprises.com/SSDoracl.pdf

> Got a supplier for the custom hardware you'd need?

There are still a few, for example at different ends of the scale:

  http://www.ramsan.com/solutions/oracle/
  http://www.microdirect.co.uk/home/product/39434/ACARD-RAM-Disk-SSD-ANS-9010B-6X-DDR-II-Slots

> Just use a PCIe SSD....

Yes, that's what many people are doing, but mostly for data,
rather than specifically journals.

As mentioned at the start I have indeed been thinking of SSDs.

But they seem to me fundamentally terrible for journals, because
of the large erase blocks sizes and the enormous latency of erase
operations (lots of read-erase-write cycles for small commits).
They seem more oriented to large mostly read-only data sets than
very small mostly write ones.

The saving grace is the capacitor-backed RAM in SSDs (used to work
around erase block size issues as you probably know) which to a
significant extent may act as the  battery-backed RAM  I was
mentioning; and similarly as another post says the  battery-backed
RAM  in RAID host adapters would do much the same function.

But neither as cleanly as a dedicated unit, not a cache.

But as another contributor said a fast/small disk RAID1 might be
quite decent in many situations.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: op-journaled fs, journal size and storage speeds
  2011-05-01 18:13   ` Peter Grandi
@ 2011-05-02  1:23     ` Dave Chinner
  2011-05-02 10:40     ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2011-05-02  1:23 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS, Linux fs JFS

On Sun, May 01, 2011 at 07:13:03PM +0100, Peter Grandi wrote:
> 
> >> Been thinking about journals and RAID6s and SSDs. In particular
> >> for file system designs like JFS and XFS that do operation
> >> journaling (while ext[34] do block journaling).
> 
> > XFS is not an operation journalling filesystem. Most of the
> > metadata is dirty-region logged via buffers, just like ext3/4.
> 
> Looking at the sources, XFS does operations journaling, in the
> form of physical ("dirty region") operation logging,

Operation logging contains no physical changes - it just indicates
the change to be made typically via an intent/done transaction pair.
It says what is going to be done, then what has been done, but not
the details of the changes made.

XFs _always_ logs the details of the changes made, and....

> instead of
> logical operation logging like JFS. Both are very different from
> block journaling.

When you are dirtying entire blocks, then the way the blocks are
logged is really no different to ext3/4's block logging...

> More in details, to me there is a stark contrast between 'jbd.h':
> 
>   http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=blob;f=include/linux/jbd.h;h=e06965081ba5548f74db935543af84334f58259e;hb=HEAD
> 
> where I find only a few journal transaction types (blocks) and
> 'xfs_trans.h' where I find many journal transaction types (ops):
> 
>  http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=blob;f=fs/xfs/xfs_trans.h;h=c2042b736b81131a780703d8a5907c848793eebb;hb=HEAD

Yeah, so that number goes into the transaction header on disk mainly
for debugging purposes - you can identify what operation triggered
the transaction in the log just by looking at the log.

However, taht is _completely ignored_ for delayed logging - you'll
only ever see "checkpoint" transactions with delayed logging as it
throws away all the transaction specific metadata in memory...

> Given that in the latter I see transaction types like
> 'XFS_TRANS_RENAME' or 'XFS_TRANS_MKDIR' it is hard to imagine how
> one can argue that the XFS journals something other than ops, even
> if in a buffered way of sorts.

Why don't you look at the transaction reservations that define what
one of those "transaction ops" contains. e.g. MKDIR uses the inode
create reservation:

/*
 * For create we can modify:
 *    the parent directory inode: inode size
 *    the new inode: inode size
 *    the inode btree entry: block size
 *    the superblock for the nlink flag: sector size
 *    the directory btree: (max depth + v2) * dir block size
 *    the directory inode's bmap btree: (max depth + v2) * block size
 * Or in the first xact we allocate some inodes giving:
 *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
 *    the superblock for the nlink flag: sector size
 *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
 *    the inode btree: max depth * blocksize
 *    the allocation btrees: 2 trees * (max depth - 1) * block size
 */
STATIC uint
xfs_calc_create_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                MAX((mp->m_sb.sb_inodesize +
                     mp->m_sb.sb_inodesize +
                     mp->m_sb.sb_sectsize +
                     XFS_FSB_TO_B(mp, 1) +
                     XFS_DIROP_LOG_RES(mp) +
                     128 * (3 + XFS_DIROP_LOG_COUNT(mp))),
                    (3 * mp->m_sb.sb_sectsize +
                     XFS_FSB_TO_B(mp, XFS_IALLOC_BLOCKS(mp)) +
                     XFS_FSB_TO_B(mp, mp->m_in_maxlevels) +
                     XFS_ALLOCFREE_LOG_RES(mp, 1) +
                     128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels +
                            XFS_ALLOCFREE_LOG_COUNT(mp, 1))));
}

> > How do you know what "one second" of "in flight" operations is
> > going to be?
> 
> Well, that's what I discuss later, it is a "rule of thumb" based
> on on *some* rationale, but I have been questioning it.
> 
> [ ... interesting summary of some of the many issue related to
> journal sizing ... ]
> 
> > Easiest and most reliable method seems to be to size your
> > journal appropriatly in the first place and have you
> > algorithms key off that....
> 
> Sure, but *I* am asking that question :-).

And my response is that there is no one correct answer, and that
physical limits are usually the issue...

> >> This seems to me a fairly bad idea, because then the journal
> >> becomes a massive hot spot on the disk and draws the disk arm
> >> like black hole. I suspect that operations should not stay on
> 
> > That's why you can configure an external log....
> 
> ...and lose barriers :-). But indeed.

As always, if performance and data safety is your concern, spend a
few hundred dollars more and buy a decent HW RAID card with a BBWC....

> >> the journal for a long time. However if the journal is too
> >> small processes that do metadata updates start to hang on it.
> 
> > Well, yes. The journal needs to be large enough to hold all
> > the transaction reservations for the active transactions. XFS,
> > in the worse case for a default filesystem config, needs about
> > 100MB of log space per 300 concurrent transactions. [ ... ]
> 
> So something like 300KB per transaction?

Yup. And the size is dependent on filesystem block size, filesystem
and AG size (max btree depths). So for a 64k block size filesystem,
that 300kb transaction reservation blows out to about 3MB....

> That seems a pretty
> extreme worst case. How is that possible? A metadata transaction
> with a "dirty region" of 300KB sound enormously expensive. It may
> be about extent maps for a very fragmented file I guess.

It's actually very small. Have you ever looked at how much metadata
a directory contains?  Rule of thumb is that a directory consumes
about 100MB of metadata for every million entries for average length
filenames. having a create transaction consume 300KB at maximum for
a worst case modification of a directory with a million, 10M or 100M
entries makes that 300k look pretty small...


> clear here what  concurrent  means because the log is sequential.
> I'll guess that it means "in flight".
> 
> [ ... ]
> 
> >> * What should journal size be proportional to?
> 
> > Your workload.
> 
> Sure, as a very top level goal. But that's not an answer, it is
> handwaving. As you argue earlier, it could be proportional in some
> cases to IO threads; or it could be number of arms, filesystem
> size, size of each volume, sequential transfer rate, random
> transfer rate, large IO transfer rate, small IO transfer rate, ...

Nice definition of "workload dependent".

> Some tighter guideline might be better than just guessing.
> 
> >> * What is the downside of a too small journal?
> 
> > Performance sucks.
> 
> But why? Without a journal completely performance is better;
> assuming a one-transaction journal this becomes slower because
> of writing everything twice, but that happens for any size of
> journal, as it is unavoidable.

Why does having a writeback cache improve perfromance? Larger
journals enable longer caching of dirty metadata before writeback
must occur. 

> When the journal fills up the effect is the same as that of a 1
> transaction journal. That's the same for every type of buffer.

And then you've got the problem of having to wait for those 10
objects to complete IO before you can do another transaction, while
if you have a large log, you can push on it before you run out of
space to try to ensure it never stalls. And when you have
100,000 metadata objects to write back, you can optimise the IO a
whole lot better than when you only have 10 objects.

> So the effect of a journal larger than 1 transaction must be
> felt only when the journal is not full,

Sure, and we've spent years optimising the metadata flushing to
ensure we empty the log as fast as possible under sustained
workloads. You need enough space in the journal to decouple
transactions from the flow of metadata writeback - how much is very
workload dependent.

> that is there are pauses
> in the flow of transactions; and then it does not matter a lot
> just how large the journal is.
>
> So the journal should be large enough to accomodate the highest
> possible rate of metadata updates for the longest time this
> happens until there is a pause in the metadata updates.

We need to be able to sustain hundreds of thousands of transactions
per second, every second, 24x7. There are no "pauses" we can take
advantage of to "catch up" - metadata writeback must take place
simultaneously with new transactions, and the journal must be large
enough to decouple these effectively.

> This of course depends on workload, but some rule of thumb based
> on experience might help.

Sure - we encode that experience in the mkfs and kernel default
behaviour. 

> And here my guess is that shorter journals are better than
> longer ones, because also:
> 
> >> * What is the downside of a too large journal other than space?
> 
> > Recovery times too long, lots of outstanding metadata pinned
> > in memory (hello OOM-killer!), and other resource management
> > related scalability issues.
> 
> I would have expected also more seeks, as reading logged but not
> yet finalized metadata has to go back to the journal, but I guess
> that's a small effect.

Say what? Nobody reads from the journal except during recovery.
Anything that is in the journal is dirty in memory, so any reads
come from the memory objects, not the journal....

> > Got a supplier for the custom hardware you'd need?
> 
> There are still a few, for example at different ends of the scale:
> 
>   http://www.ramsan.com/solutions/oracle/
>   http://www.microdirect.co.uk/home/product/39434/ACARD-RAM-Disk-SSD-ANS-9010B-6X-DDR-II-Slots

Neither of them are what I'd consider "battery backed RAM" - to the
filesystem they are simply fast block devices behind a SATA/SAS/FC
interface.  Effectively no different to a SAS/SATA/FC- or PCIe-based
flash SSD.

> But as another contributor said a fast/small disk RAID1 might be
> quite decent in many situations.

Not fast enough for an XFS log - I can push >500MB/s through the XFS
journal on a device (12 disk (7200rpm) RAID-0) that will do 700MB/s
for sequential data IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: op-journaled fs, journal size and storage speeds
  2011-05-01  9:27 ` Dave Chinner
  2011-05-01 18:13   ` Peter Grandi
@ 2011-05-02  4:35   ` Stan Hoeppner
  1 sibling, 0 replies; 6+ messages in thread
From: Stan Hoeppner @ 2011-05-02  4:35 UTC (permalink / raw)
  To: xfs

On 5/1/2011 4:27 AM, Dave Chinner wrote:

> Got a supplier for the custom hardware you'd need? Just use a PCIe
> SSD....

50GB OCZ RevoDrive PCIe x4 SSD MLC NAND
Dual SandForce 1200 controllers, internal RAID 0 design
70,000 write IOPS, 4KB aligned
350MB/s write sustained

$200 USD at Newegg:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227596

Current best value for a PCIe SSD suitable for dedicated log drive use, 
can fit ~22 maximum size (2GB) XFS logs.  Note the MLC NAND.  If all 
your filesystems will sustain constant high rate metadata writes, an SLC 
based product is more suitable, though price is 10-50x higher for PCIe 
SLC cards.  If you want/need the 10x increase in flash cell life of SLC 
NAND, go with this Intel SLC SATAII SSD for ~2x the $$ of the Revo. 
Note it's write IOPS is 'only' 33k, size is 32GB, 18GB less.

http://www.newegg.com/Product/Product.aspx?Item=N82E16820167013

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: op-journaled fs, journal size and storage speeds
  2011-05-01 18:13   ` Peter Grandi
  2011-05-02  1:23     ` Dave Chinner
@ 2011-05-02 10:40     ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2011-05-02 10:40 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS, Linux fs JFS

On Sun, May 01, 2011 at 07:13:03PM +0100, Peter Grandi wrote:
> > That's why you can configure an external log....
> 
> ...and lose barriers :-). But indeed.

Using a writeback cache on the log device is rather pointless as
every writes needs write through semantics using FUA or a post-flush
anyway.  But I actually have patch to allow for devices with
a writeback cache in external log configurations, it's just a bit
complicated as we basically need to copy the pre-flush statemachine
into XFS to deal with the preflush beeing for a different device
than the actual write.

> >> But if they can be pretty small, I wonder whether putting the
> >> journals of several filesystems on the same storage device then
> >> becomes a sensible option as the locality will be quite narrow
> >> (e.g. a single physical cylinder) or it could be wortwhile like
> >> the database people do to journal to battery-backed RAM.
> 
> For example as described in this old paper:

It only makes sense if the log activity bursts for the different
filesystems happen at different times, or none of the filesystems
maxes out the log IOP rate.  

> But they seem to me fundamentally terrible for journals, because
> of the large erase blocks sizes and the enormous latency of erase
> operations (lots of read-erase-write cycles for small commits).
> They seem more oriented to large mostly read-only data sets than
> very small mostly write ones.

As mentioned earlier in this thread XFS allows to align and pad
log writes.  Just make sure to get a device with an erase block
size <= 256 kilobytes, which usually means SLC.  But even drives
with a larger erase block size and sane firmware tend to be faster
than plain old disks.  But as Dave mentioned there's nothing that's
going to beat a battery backed cache/memory for log IOP performance.

> The saving grace is the capacitor-backed RAM in SSDs (used to work
> around erase block size issues as you probably know) which to a
> significant extent may act as the  battery-backed RAM  I was
> mentioning; and similarly as another post says the  battery-backed
> RAM  in RAID host adapters would do much the same function.

Just make sure your device actually has it.  Both the Intel X25 SSDs
and many other consumer / prosumer SSDs actually don't have them
and will lose data in case of a powerloss.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-05-02 10:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-30 14:51 op-journaled fs, journal size and storage speeds Peter Grandi
2011-05-01  9:27 ` Dave Chinner
2011-05-01 18:13   ` Peter Grandi
2011-05-02  1:23     ` Dave Chinner
2011-05-02 10:40     ` Christoph Hellwig
2011-05-02  4:35   ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox