Re: I/O hang, possibly XFS, possibly general

From: Stan Hoeppner <stan@hardwarefreak.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Paul Anderson <pha@umich.edu>,
	Christoph Hellwig <hch@infradead.org>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: I/O hang, possibly XFS, possibly general
Date: Sat, 04 Jun 2011 20:31:44 -0500	[thread overview]
Message-ID: <4DEADC80.8000200@hardwarefreak.com> (raw)
In-Reply-To: <20110604231032.GM32466@dastard>

On 6/4/2011 6:10 PM, Dave Chinner wrote:
> On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote:

>> So, would delayed logging have possibly prevented his hang problem or
>> no?  I always read your replies at least twice, and I don't recall you
>> touching on delayed logging in this thread.  If you did and I missed it,
>> my apologies.
> 
> It might, but I delayed logging iћ not he solution to every problem,
> and NFS servers are notoriously heavy on log forces due to COMMIT
> operations during writes. So it's a good bet that delyed logging
> won't fix the problem entirely.

So the solution in this case will likely require a multi pronged
approach, including XFS optimization, and RAID card and/or RAID level
reconfiguration that has been mentioned.

>> Do you believe MLC based SSDs are simply never appropriate for anything
>> but consumer use, and that only SLC devices should be used for real
>> storage applications?  AIUI SLC flash cells do have about a 10:1 greater
>> lifetime than MLC cells.  However, there have been a number of
>> articles/posts demonstrating math which shows a current generation
>> SandForce based MLC SSD, under a constant 100MB/s write stream, will run
>> for 20+ years, IIRC, before sufficient live+reserved spare cells burn
>> out to cause hard write errors, thus necessitating drive replacement.
>> Under your 500MB/s load, assuming that's constant, the drives would
>> theoretically last 4+ years.  If that 500MB/s load was only for 12 hours
>> each day, the drives would last 8+ years.  I wish I had one of those
>> articles bookmarked...
> 
> That's the theory, anyway. Let's call it an expected 4 year life
> cycle under this workload (which is highly optimistic, IMO). Now you
> have two drives in RAID1, that means one will fail in 2 years, or if
> you need more drives to sustain that performance the log needs (*)
> you might be looking at 4 or more drives, and that brings the expet
> failure rate down under one drive per year. Multiply that across
> 5-10 servers, and that's a drive failure every month just on the log
> devices.

Very good point.  I was looking at single system probabilities instead
of farm scale (shame on me for that newbish oversight).

> That failure rate would make me extremely nervous - losing the log
> is a -major- filesystem corruption event - and make me want to spend
> more money or change the config to reduce the risk of a double
> failure causing the log device to be lost. Especially if there are
> hundreds of terabytes of data at risk.

> Cheers,
> 
> Dave.
> 
> (*) You have to consider that sustained workloads mean that the
> drives don't get idle time to trigger background garbage collection,
> which is one of the key features that current consumer level drives
> rely on for maintaining performance and even wear levelling. The
> "spare" area in the drives is kept small because it is assumed that
> there won't be long term sustained IO so that the garbage collection
> can clean up before spare area is exhausted.
> 
> Enterprise drives have a much larger relative percentage of flash in
> the drive reserved as spare to avoid severe degradation in such
> sustained (common enterprise) workloads.  Hence performance on
> consumer MLC drives tails off much more quickly than SLC drives.

Ahh, I didn't realize the SLC drives have much larger reserved areas.
Shame on me again.  A hardwarefreak should know such things. :(

> Hence performance on consumer MLC drives may not be sustainable, and
> wear leveling may not be optimal, resulting in flash failure earlier
> than you expect.  To maintain performance, you'll need more MLC
> drives to maintain baseline performance.  And with more drives, the
> chance of failure goes up...

Are the enterprise SLC drives able to perform garbage collection etc
while under such constant load?  If not, is it always better to use SRDs
for the log, either internal on a BBWC array, or an external mirrored pair?

I previously mentioned I always read your posts twice.  You are a deep
well of authoritative information and experience.  Keep up the great
work and contribution to the knowledge base of this list.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs