Is this expected RAID10 performance?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Is this expected RAID10 performance?
@ 2013-06-06 23:52 Steve Bergman
  2013-06-07  3:25 ` Stan Hoeppner
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Steve Bergman @ 2013-06-06 23:52 UTC (permalink / raw)
  To: linux-raid

I have a Dell T310 server set up with 4 Seagate ST2000NM0011 2TB
drives connected to the 4 onboard SATA (3Gbit/s) ports of the
motherboard. Each drive is capable of doing sequential writes at
151MB/s and sequential reads at 204MB/s according to bonnie++. I've
done an installation of Scientific Linux 6.4 (RHEL 6.4) and let the
installer set up the RAID10 and logical volumes. What I got was a
RAID10 device with a 512K chunk size, and ext4 extended options of
stride=128 & stripe-width=256, with a filesystem block size of 4k. All
of this seems correct to me.

But when I run bonnie++ on the array (with ext4 mounted
data=writeback,nobarrier)  I get a sequential write speed of only
160MB/s, and a sequential read speed of only 267MB/s. I've verified
that the drives' write caches are enabled.

"sar -d" shows all 4 drives in operation, writing 80MB/s during the
sequential write phase, which agrees with the 160MB/s I'm seeing for
the whole array. (I haven't monitored the read test with sar.)

Is this about what I should expect? I would have expected both read
and write speeds to be higher. As it stands, writes are barely any
faster than for a single drive. And reads are only ~30% faster.

Thanks for any info,
Steve Bergman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-06 23:52 Is this expected RAID10 performance? Steve Bergman
@ 2013-06-07  3:25 ` Stan Hoeppner
  2013-06-07  7:51 ` Roger Heflin
  2013-06-08 18:23 ` keld
  2 siblings, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-07  3:25 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

On 6/6/2013 6:52 PM, Steve Bergman wrote:
> I have a Dell T310 server set up with 4 Seagate ST2000NM0011 2TB
> drives connected to the 4 onboard SATA (3Gbit/s) ports of the
> motherboard. Each drive is capable of doing sequential writes at
> 151MB/s and sequential reads at 204MB/s according to bonnie++. I've
> done an installation of Scientific Linux 6.4 (RHEL 6.4) and let the
> installer set up the RAID10 and logical volumes. What I got was a
> RAID10 device with a 512K chunk size, and ext4 extended options of
> stride=128 & stripe-width=256, with a filesystem block size of 4k. All
> of this seems correct to me.

If this is vanilla RAID10, not one of md's custom layouts, then a 512K
chunk gives an EXT4 stride of 512K and stripe-width of 1MB.  So your
parameters don't match.  Fix your EXT4 alignment to match the array.

> But when I run bonnie++ on the array (with ext4 mounted
> data=writeback,nobarrier)  I get a sequential write speed of only
> 160MB/s, and a sequential read speed of only 267MB/s. I've verified
> that the drives' write caches are enabled.

So with md/RAID10 you're getting about half of bonnie++ single disk+EXT4
throughput.

> "sar -d" shows all 4 drives in operation, writing 80MB/s during the
> sequential write phase, which agrees with the 160MB/s I'm seeing for
> the whole array. (I haven't monitored the read test with sar.)
> 
> Is this about what I should expect? I would have expected both read
> and write speeds to be higher. As it stands, writes are barely any
> faster than for a single drive. And reads are only ~30% faster.

It's not uncommon to see individual drive throughput in md that is less
than a single drive.  There are a number of things that affect this,
including the benchmarks themselves, the number of concurrent threads
issuing IOs (i.e. you need overlapping IO), the Linux tunables in
/sys/block/sdX/queue/.  One of the most important is the elevator.  CFQ,
typically the default, will yield sub optimal throughput with arrays,
hard or soft.  With md and no HBA BBWC you'll want to use deadline.  If
your bonnie test is using X threads for a single drive test then double
that amount for your RAID10 test as you have 2x as many data spindles.

What md/RAID10 layout did the installer choose for you?
Does throughput improve when you change the EXT4 alignment?
Have you performed any other throughput testing other than bonnie?
Are you using buffered IO or O_DIRECT?

One last note.  Test with parameters you will use in production.  Do not
test with barriers disabled.  You need them in production to prevent
filesystem corruption.  The point of benchmarking isn't to determine the
absolute maximal throughput of the hardware.  The purpose is to
determine how much of that you can actually get with your production
workload.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-06 23:52 Is this expected RAID10 performance? Steve Bergman
  2013-06-07  3:25 ` Stan Hoeppner
@ 2013-06-07  7:51 ` Roger Heflin
  2013-06-07  8:07   ` Alexander Zvyagin
  2013-06-08 18:23 ` keld
  2 siblings, 1 reply; 32+ messages in thread
From: Roger Heflin @ 2013-06-07  7:51 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

Looking up that disk on seagate's website, it lists the max sustained
transfer rate for that specific disk model of 140MB/second.

That would be on the outside of the disk were the most sectors are,
inside the disk will be less, and that is under perfect conditions.

So some of your initial bonnie++ results may be from caching affects.
and appear to be higher than possible given the disk model.

And it is also possible that running more disks at the same time
cannot be sustained by the on-board chipset.

Try reading the disk with something like this: "dd if=/dev/<device>
of=/dev/null bs=1M count=4096" and see what speed it reports as this
usually gets pretty close to raw disk speed and then trying doing 4 of
those at the same time to each of the devices and see if the rate
sustains with 4 running.

On Thu, Jun 6, 2013 at 6:52 PM, Steve Bergman <sbergman27@gmail.com> wrote:
> I have a Dell T310 server set up with 4 Seagate ST2000NM0011 2TB
> drives connected to the 4 onboard SATA (3Gbit/s) ports of the
> motherboard. Each drive is capable of doing sequential writes at
> 151MB/s and sequential reads at 204MB/s according to bonnie++. I've
> done an installation of Scientific Linux 6.4 (RHEL 6.4) and let the
> installer set up the RAID10 and logical volumes. What I got was a
> RAID10 device with a 512K chunk size, and ext4 extended options of
> stride=128 & stripe-width=256, with a filesystem block size of 4k. All
> of this seems correct to me.
>
> But when I run bonnie++ on the array (with ext4 mounted
> data=writeback,nobarrier)  I get a sequential write speed of only
> 160MB/s, and a sequential read speed of only 267MB/s. I've verified
> that the drives' write caches are enabled.
>
> "sar -d" shows all 4 drives in operation, writing 80MB/s during the
> sequential write phase, which agrees with the 160MB/s I'm seeing for
> the whole array. (I haven't monitored the read test with sar.)
>
> Is this about what I should expect? I would have expected both read
> and write speeds to be higher. As it stands, writes are barely any
> faster than for a single drive. And reads are only ~30% faster.
>
> Thanks for any info,
> Steve Bergman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07  7:51 ` Roger Heflin
@ 2013-06-07  8:07   ` Alexander Zvyagin
  2013-06-07 10:44     ` Steve Bergman
  0 siblings, 1 reply; 32+ messages in thread
From: Alexander Zvyagin @ 2013-06-07  8:07 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Steve Bergman, Linux RAID

> And it is also possible that running more disks at the same time
> cannot be sustained by the on-board chipset.

to check this, start "dd" or "badblocks" or something similar (which
will put disk to 100% load) on all your drives one-by-one and monitor
throughput with "iostat" (or similar). Yam may face the following
'problem':
1. start badblocks on /dev/sda, throughput is 140 MB/s
2. start badblocks on /dev/sdb, throughput is 140 MB/s on /dev/sda and /dev/sdb
3. start badblocks on /dev/sdc, throughput is 140 MB/s on /dev/sda and
70 MB/s on /dev/sdb,/dev/sdc

Alexander

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07  8:07   ` Alexander Zvyagin
@ 2013-06-07 10:44     ` Steve Bergman
  2013-06-07 10:52       ` Roman Mamedov
  2013-06-07 12:39       ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: Steve Bergman @ 2013-06-07 10:44 UTC (permalink / raw)
  To: Linux RAID

Stan, Roger, Alexander,

Thanks for the helpful posts. After posting, I decided to study up a
bit on what SATA 3Gb/s actually means. It turns out that the 3Gbit/s
bandwidth is aggregate per controller. This is a 4-port SATA
controller, so with 1 drive, the single drive gets all 3Gbit/s. With 4
operating simultaneously, each would get 750Mbit/s. There is supposed
to be about a 20% overhead involved in the SATA internals, so that
number drops to ~600Mbit/s. This is 75MByte/s, which is about what I'm
seeing on writes. For reads, I would expect to see ~300MBytes/s, and
am seeing 260MBytes/s, which is not too far off.

This is not really a problem for me, as the workloads I'm concerned
about are seekier than this, and are not bandwidth limited. (e.g.
Rebuilding indexes of Cobol C/ISAM files, and it's doing well on
that.) Mainly, I just wanted to make sure that this wasn't an
indication that I was doing something wrong, and to see if maybe there
was something to be learned here (which there was). Bonnie++ does
report that the RAID10 is doing ~2x the number of seeks/s as the
single-drive configuration. I'll be comparing the results of "iozone
-ae" between single-drive and RAID10 later today, to get a more
fine-grained view of the relative write performance.

BTW Stan, for ext4 stride and stripe-width are specified in filesystem
blocks rather than in K. In this case, I'm using the default 4k block
size. So stride should be:

chunksize / blocksize = 512k / 4k = 128

and the stripe-width should be:

stride * number of mirrored sets

In this case, I have 2 mirrored sets. So stripe-width should be 128 * 2 = 256.

-Steve Bergman

On Fri, Jun 7, 2013 at 3:07 AM, Alexander Zvyagin
<zvyagin.alexander@gmail.com> wrote:
>> And it is also possible that running more disks at the same time
>> cannot be sustained by the on-board chipset.
>
> to check this, start "dd" or "badblocks" or something similar (which
> will put disk to 100% load) on all your drives one-by-one and monitor
> throughput with "iostat" (or similar). Yam may face the following
> 'problem':
> 1. start badblocks on /dev/sda, throughput is 140 MB/s
> 2. start badblocks on /dev/sdb, throughput is 140 MB/s on /dev/sda and /dev/sdb
> 3. start badblocks on /dev/sdc, throughput is 140 MB/s on /dev/sda and
> 70 MB/s on /dev/sdb,/dev/sdc
>
> Alexander

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 10:44     ` Steve Bergman
@ 2013-06-07 10:52       ` Roman Mamedov
  2013-06-07 11:25         ` Steve Bergman
  2013-06-07 12:39       ` Stan Hoeppner
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Mamedov @ 2013-06-07 10:52 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 756 bytes --]

On Fri, 7 Jun 2013 05:44:00 -0500
Steve Bergman <sbergman27@gmail.com> wrote:

> Thanks for the helpful posts. After posting, I decided to study up a
> bit on what SATA 3Gb/s actually means. It turns out that the 3Gbit/s
> bandwidth is aggregate per controller.

This is just plain wrong. I wonder where do you find b/s like this (maybe
post the actual link with such misinformation so we could try getting it
corrected or removing it if it's some wiki?). Unless your controller is built
to use an onboard PMP (port multiplier), there is no such thing as 3 Gbit/sec
controller-wide limitation. But of course it's still limited by whatever bus
it sits on, if it's a narrow PCI-E 1x, that might well come into play.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 10:52       ` Roman Mamedov
@ 2013-06-07 11:25         ` Steve Bergman
  2013-06-07 13:18           ` Stan Hoeppner
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-07 11:25 UTC (permalink / raw)
  To: Linux RAID

I don't have the source link handy, but it was an industry white
paper. (I doubt you'll get it changed.)

There does seem to be some interface limitation here. Running "dd
if=/dev/sdX of=/dev/null bs=512k" simultaneously for various
combinations of drives gives me:

Port   A alone : 155MByte/s
Ports A & B   :  105MByte/s per drive
Ports A, B & C: 105MBytes/s per drive for A & B. 155MBytess's for C.
Ports A, B, C & D: 105MBytes/s per drive

So there's an aggregate limitation of ~1.7Gbit/s per port pair, with
A&B and C&D making up the pairs.

More information on the interfaces:

From lshw:

description: IDE interface
product: 5 Series/3400 Series Chipset 4 port SATA IDE Controller
vendor: Intel Corporation
physical id: 1f.2
bus info: pci@0000:00:1f.2
logical name: scsi0
logical name: scsi1
version: 05
width: 32 bits
clock: 66MHz
capabilities: ide pm bus_master cap_list emulated
configuration: driver=ata_piix latency=0
resources: irq:20 ioport:dca0(size=8) ioport:dc90(size=4)
ioport:dca8(size=8) ioport:dc94(size=4) ioport:dcc0(size=16)
ioport:dcd0(size=16)


From dmesg:

ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)


From lspci:

IDE interface: Intel Corporation 5 Series/3400 Series Chipset 4 port
SATA IDE Controller (rev 05) (prog-if 8f [Master SecP SecO PriP PriO])
Subsystem: Dell Device 02a4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 20
Region 0: I/O ports at dca0 [size=8]
Region 1: I/O ports at dc90 [size=4]
Region 2: I/O ports at dca8 [size=8]
Region 3: I/O ports at dc94 [size=4]
Region 4: I/O ports at dcc0 [size=16]
Region 5: I/O ports at dcd0 [size=16]
Capabilities: [70] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [b0] PCI Advanced Features
AFCap: TP+ FLR+
 AFCtrl: FLR-
 AFStatus: TP-
 Kernel driver in use: ata_piix
 Kernel modules: ata_generic, pata_acpi, ata_piix

-Steve Bergman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 11:25         ` Steve Bergman
@ 2013-06-07 13:18           ` Stan Hoeppner
  2013-06-07 13:54             ` Steve Bergman
  0 siblings, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-07 13:18 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/7/2013 6:25 AM, Steve Bergman wrote:
> I don't have the source link handy, but it was an industry white
> paper. (I doubt you'll get it changed.)
> 
> There does seem to be some interface limitation here. Running "dd
> if=/dev/sdX of=/dev/null bs=512k" simultaneously for various
> combinations of drives gives me:

It's not a bus interface limitation.  The DMI link speed on the 5 Series
PCH is 1.25GB/s each way.

> Port   A alone : 155MByte/s
> Ports A & B   :  105MByte/s per drive
> Ports A, B & C: 105MBytes/s per drive for A & B. 155MBytess's for C.
> Ports A, B, C & D: 105MBytes/s per drive
> 
> So there's an aggregate limitation of ~1.7Gbit/s per port pair, with
> A&B and C&D making up the pairs.

This may be a quirk of the 5 Series Southbridge.  But note you're using
the standard ICH driver.  Switch to the AHCI driver and you may see some
gains here.  Also try the deadline elevator.  I mentioned it because I
intended for you to use it.  This wasn't an "optional" thing.  It will
improve performance over CFQ.  This isn't guesswork.  Everyone in Linux
storage knows this to be true.  As they all know to use noop with SSD
and hardware RAID w/[F|B]BWC.

Which kernel version and OS is this again?

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 13:18           ` Stan Hoeppner
@ 2013-06-07 13:54             ` Steve Bergman
  2013-06-07 21:43               ` Bill Davidsen
  2013-06-07 23:33               ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: Steve Bergman @ 2013-06-07 13:54 UTC (permalink / raw)
  To: Linux RAID

This is Scientific Linux (RHEL) 6.4. That's nominally kernel 2.6.32,
but that doesn't tell one much. The RHEL kernel is the RHEL kernel,
with features selected from far more recent kernels included, more
being added with every point release. (e.g., I have the dm-thin target
available in LVM2.)

I've used both CFQ and Deadline for testing. It doesn't make a
measurable difference for either the multiple dd's or for the
single-threaded C/ISAM rebuild. (In fact, deadline, while often better
for servers, can have problems with mixed sequential/random access
workloads. At least according to what I've seen over on the PostgreSQL
lists. It's no surprise that deadline doesn't help my single-threaded
workload. Also note that deadline has shown itself to be slightly
superior to noop for SSD's in certain benchmarks.) There's no one size
fits all answer. Until the particular workload is actually tested, it
*is* guesswork. I/O scheuling is too complicated for it to be
otherwise.

The chipset supports AHCI, but unfortunately it's turned off on the
PET310, and the setting is not exposed in the BIOS setup, despite the
fact that Dell advertises AHCI capability. It would do AHCI if I
bought one of the optional SAS controllers.

Since this is an unusual RAID10 situation, and I have plenty of spare
processor available, I'm going to try RAID5 over the weekend. I've
never used it. But I'm guessing that parity might come at a lower
bandwidth cost than mirroring. Should be a fun weekend. :-)

BTW, any recommendations on chunk size?

-Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 13:54             ` Steve Bergman
@ 2013-06-07 21:43               ` Bill Davidsen
  2013-06-07 23:33               ` Stan Hoeppner
  1 sibling, 0 replies; 32+ messages in thread
From: Bill Davidsen @ 2013-06-07 21:43 UTC (permalink / raw)
  Cc: Linux RAID

Steve Bergman wrote:
> I've used both CFQ and Deadline for testing. It doesn't make a
> measurable difference for either the multiple dd's or for the
> single-threaded C/ISAM rebuild. (In fact, deadline, while often better
> for servers, can have problems with mixed sequential/random access
> workloads. At least according to what I've seen over on the PostgreSQL
> lists. It's no surprise that deadline doesn't help my single-threaded
> workload. Also note that deadline has shown itself to be slightly
> superior to noop for SSD's in certain benchmarks.) There's no one size
> fits all answer. Until the particular workload is actually tested, it
> *is* guesswork. I/O scheuling is too complicated for it to be
> otherwise.
Can't say one way or the other on SSD, but I can't measure any big benefit of 
deadline on RAID-5 or RAID-10. I haven't done proper testing on RAID-6, so I 
can't say.
> Since this is an unusual RAID10 situation, and I have plenty of spare
> processor available, I'm going to try RAID5 over the weekend. I've
> never used it. But I'm guessing that parity might come at a lower
> bandwidth cost than mirroring. Should be a fun weekend. :-)
When testing RAID-10, but sure you set it up for 'far' copies, since this should 
improve transfer rate, particularly under single thread read.

-- 
Bill Davidsen <davidsen@tmr.com>
   We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 13:54             ` Steve Bergman
  2013-06-07 21:43               ` Bill Davidsen
@ 2013-06-07 23:33               ` Stan Hoeppner
  1 sibling, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-07 23:33 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/7/2013 8:54 AM, Steve Bergman wrote:
> This is Scientific Linux (RHEL) 6.4. That's nominally kernel 2.6.32,
> but that doesn't tell one much. The RHEL kernel is the RHEL kernel,
> with features selected from far more recent kernels included, more
> being added with every point release. (e.g., I have the dm-thin target
> available in LVM2.)

Yeah, Red Hat marches to the beat of their own drummer.  Their version
string is meaningless.  They have newer kernel features in their 2.6.32
that rely on features only available in later upstream kernels, that
can't be backported to 2.6.32. Thus the core kernel isn't based on 2.6.32.

> I've used both CFQ and Deadline for testing. It doesn't make a
> measurable difference for either the multiple dd's or for the
> single-threaded C/ISAM rebuild. (In fact, deadline, while often better
> for servers, can have problems with mixed sequential/random access
> workloads. At least according to what I've seen over on the PostgreSQL
> lists. It's no surprise that deadline doesn't help my single-threaded
> workload. Also note that deadline has shown itself to be slightly
> superior to noop for SSD's in certain benchmarks.) There's no one size
> fits all answer. Until the particular workload is actually tested, it
> *is* guesswork. I/O scheuling is too complicated for it to be
> otherwise.

Caveat:  I'm an XFS user.  CFQ simply gives sub par to horrible
performance with XFS, regardless of workload or hardware.  The upstream
developers haven't supported XFS on CFQ for many years.  Because of
things like the note at the bottom of this FAQ entry:
http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

Considering one of, if not the, main reasons for using XFS is parallel
IO, this speaks volumes about CFQ's suitability for high performance
workloads.

> The chipset supports AHCI, but unfortunately it's turned off on the

Wow.  A huge Intel partner castrating their integrated SATA controllers.

> PET310, and the setting is not exposed in the BIOS setup, despite the
> fact that Dell advertises AHCI capability. It would do AHCI if I
> bought one of the optional SAS controllers.

Interesting.  SAS controllers typically don't use the AHCI interface.
In fact I know of none.  Dell uses LSISAS ASICs on their SAS cards and
mobo down on servers, and these use the LSI mptsas driver, not AHCI.

> Since this is an unusual RAID10 situation, and I have plenty of spare
> processor available, I'm going to try RAID5 over the weekend. I've
> never used it. But I'm guessing that parity might come at a lower
> bandwidth cost than mirroring. Should be a fun weekend. :-)

If you're looking for increased performance look elsewhere.  RMW latency
typically gives you random write throughput about 1/3rd to 1/5th that of
RAID10 with the same drive count.  Sequential read may be slightly
faster than vanilla RAID10.  However, as many are fond of mentioning,
using the far layout can get you sequential read close to that of pure
striping, so it'll be faster than RAID5 all around.  There never has
been and never will be a performance advantage for RAID5, unless you're
using SSDs where RMW latency is effectively zero.

> BTW, any recommendations on chunk size?

32KB works well for just about any workload.  Exceptions would be HPC or
media server workloads where you're writing files 10s of GB to TB in
size, especially in parallel.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 10:44     ` Steve Bergman
  2013-06-07 10:52       ` Roman Mamedov
@ 2013-06-07 12:39       ` Stan Hoeppner
  2013-06-07 12:59         ` Steve Bergman
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-07 12:39 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/7/2013 5:44 AM, Steve Bergman wrote:
> Stan, Roger, Alexander,
> 
> Thanks for the helpful posts. After posting, I decided to study up a
> bit on what SATA 3Gb/s actually means. It turns out that the 3Gbit/s
> bandwidth is aggregate per controller. 

I don't know what you read but it was unequivocally wrong.  SATA
specifies interface bandwidth per cable connection, i.e. per interface.
 A 4 port 3G SATA controller has aggregate one way SATA interface b/w of
12Gb/s. If you have a throughput limitation it would be the bus (slot)
connection.

> This is a 4-port SATA
> controller, so with 1 drive, the single drive gets all 3Gbit/s. With 4
> operating simultaneously, each would get 750Mbit/s. There is supposed
> to be about a 20% overhead involved in the SATA internals, so that
> number drops to ~600Mbit/s. This is 75MByte/s, which is about what I'm
> seeing on writes. For reads, I would expect to see ~300MBytes/s, and
> am seeing 260MBytes/s, which is not too far off.

What you're seeing is a limitation of either a PCIe 1.0 x1 bus
connection, 250MB/s, or a 66MHz/32bit or 33MHz/64bit PCI/-x slot,
264MB/s.  You didn't mention the bus type.  Gotta be one of these three
given your data.

> This is not really a problem for me, as the workloads I'm concerned
> about are seekier than this, and are not bandwidth limited....

Until you have to perform a rebuild or some other b/w intensive
operation.  Then having full b/w per drive comes in handy.

> BTW Stan, for ext4 stride and stripe-width are specified in filesystem
> blocks rather than in K. In this case, I'm using the default 4k block
> size....

This is what happens when XFS people try to help folks using inferior
filesystems. ;)  Yes, you're absolutely correct.  I should have read
mke2fs(8) before responding.  You can blame Ted et al for stealing XFS
concepts and then changing the names and value quantities (out of guilt
I guess).  FYI, in modern XFS, bytes are used for stripe unit/width
values.  The whole point of alignment is matching RAID geometry.  RAID
geometry is in bytes, not fs block size multiples.  Which is exactly why
XFS moved away from this arcane system many years ago.

If your workload has any parallelism, reformat that sucker with XFS with
the defaults.  You'll get better random IOPS performance that with EXT4,
and without alignment.  Many folks don't realize that with some
workloads alignment is actually detrimental to performance, especially
with small file workloads.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 12:39       ` Stan Hoeppner
@ 2013-06-07 12:59         ` Steve Bergman
  2013-06-07 20:51           ` Stan Hoeppner
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-07 12:59 UTC (permalink / raw)
  To: Linux RAID

66MHz/32bit matches the lshw output I posted. And the machine does
have 1 PCI-X slot. So I imagine they're using the same interface for
the onboard SATA controllers. Whatever it has, each of 2 pairs of SATA
ports seems to be on one of them.

No offense intended, but reliability is more important than
performance in this scenario. And although the machine is on a good
UPS with apcupsd installed, it's not sitting in a data center, but in
an office area. And I've found XFS to have pretty bad behavior on
unlcean shutdowns. I'm used to the rock-solid reliability of ext3 in
ordered mode, so even ext4 seems a bit reckless to me. I did compare
XFS when it was configured to RAID1, and it was slightly better. Most
of what this machine will be doing is single-threaded. But XFS is not
an option for testing on an LV right now since the whole VG is sitting
on an RAID10 at the default 512k chunk size, and XFS doesn't support
larger than 256k chunks while maintaining optimal su and sw. I may
grab the machine and bring it back to my office so that I can work
with it "in person" over the weekend. It's currently remote to me.

-Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-07 12:59         ` Steve Bergman
@ 2013-06-07 20:51           ` Stan Hoeppner
  0 siblings, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-07 20:51 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

Given your stated needs this SATA limitation isn't an issue, but I
thought I'd pass some information along so you might understand your
Intel platform a little better,  as well as some issues with various
Linux tools.

On 6/7/2013 7:59 AM, Steve Bergman wrote:
> 66MHz/32bit matches the lshw output I posted. 

You cannot trust the bus information provided by lshw or lspci -v.  Why
it is not correct I can't say as I've not looked at the code.  This is a
known issue.  But I can tell you  a couple of things here you may want
to know.

1.  The PCI bus interface provided by the 5 Series PCH is 33MHz, not
    66MHz.  See 5.1.1 on page 123 of the 5/3400 Series chipset
    datasheet:

http://www.intel.com/content/dam/doc/datasheet/5-chipset-3400-chipset-datasheet.pdf

> And the machine does
> have 1 PCI-X slot. So I imagine they're using the same interface for
> the onboard SATA controllers. Whatever it has, each of 2 pairs of SATA
> ports seems to be on one of them.

2.  The SATA controllers do not attach to the PCI or PCIe interfaces.
    They attach to the internal bus of the Intel PCH, and communicate
    to the CPU directly via DMI at 2.5GB/s duplex.  See the diagram
    on page 60 of the PDF linked above.  Also worth noting is that there
    are two SATA controllers each with 3 SATA channels, 6 total.  Some
    motherboards may not have connectors for all 6 channels.

3.  The apparent bottleneck you're seeing is not due to the bandwidth
    available on the back side of the SATA controllers.  It could be a
    limitation within the SATA controllers themselves, or it could be
    that you're using legacy IDE mode instead of AHCI, maybe both.

As you said, performance isn't as critical as reliability.  So it's not
worth your time to address this any further.  I'm simply supplying you
with some limited information to correct a few misconceptions you have
about the PCH capabilities.

> No offense intended, 

None taken.

> but reliability is more important than
> performance in this scenario. And although the machine is on a good
> UPS with apcupsd installed, it's not sitting in a data center, but in
> an office area. And I've found XFS to have pretty bad behavior on
> unlcean shutdowns. 

No doubt you have.  Note the last bug involving unclean shutdowns was
fixed some 3-4 years ago.  You may want to take another look at XFS.

> I'm used to the rock-solid reliability of ext3 in
> ordered mode, 

Heheh.  I'm guessing you don't read the Linux rags or LWN.  This has
been covered extensively.  The EXT3 rock solid reliability was actually
the result of a hack designed to fix another problem.  The inadvertent
side effect was that all data was flushed out to disk every few seconds,
5 IIRC.  This made EXT3 very reliable, at the cost of performance.  The
bug was fixed and the hack removed in EXT4.  Then users and application
developers started complaining about the lack of reliability of EXT4.
EXT3 was "so reliable" that app devs stopped using fsync thinking it was
no longer needed, that EXT3 had magically solved the data on disk issue.

Google "o_ponies" for far more information.  Or simple read this:
http://sandeen.net/wordpress/uncategorized/coming-clean-on-o_ponies/

> so even ext4 seems a bit reckless to me. I did compare

That's because you became acclimated to a broken filesystem, where, very
unusually, what was broken actually provided a beneficial side effect.

> XFS when it was configured to RAID1, and it was slightly better. Most
> of what this machine will be doing is single-threaded. But XFS is not
> an option for testing on an LV right now since the whole VG is sitting
> on an RAID10 at the default 512k chunk size, and XFS doesn't support
> larger than 256k chunks while maintaining optimal su and sw. I may

Sure it does.  You simply end up with less than optimal log journal
performance due to hotspots.  If your workload is not metadata heavy
it's not an issue.  If your workload involves mostly allocation and
files are stripe width size or larger, then you reap the benefit of
alignment.  If not, you don't.  And, if allocations are small, or the
workload is not allocation, you will likely decrease performance due to
FS alignment.

Worth noting, for the umpteenth time, is that the current md 512KB
default chunk size is insanely high, not suitable for most workloads,
and you should never use it.  See the archives of this list, the XFS
archives, Google, etc, to understand the relationship between
chunk/strip size, spindle count, workload allocation write patterns, and
IO hot spots.

But as you are stuck with EXT4, this is academic.  But, hopefully this
information may have future value to you, and others.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-06 23:52 Is this expected RAID10 performance? Steve Bergman
  2013-06-07  3:25 ` Stan Hoeppner
  2013-06-07  7:51 ` Roger Heflin
@ 2013-06-08 18:23 ` keld
  2 siblings, 0 replies; 32+ messages in thread
From: keld @ 2013-06-08 18:23 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

On Thu, Jun 06, 2013 at 06:52:03PM -0500, Steve Bergman wrote:
> I have a Dell T310 server set up with 4 Seagate ST2000NM0011 2TB
> drives connected to the 4 onboard SATA (3Gbit/s) ports of the
> motherboard. Each drive is capable of doing sequential writes at
> 151MB/s and sequential reads at 204MB/s according to bonnie++. I've
> done an installation of Scientific Linux 6.4 (RHEL 6.4) and let the
> installer set up the RAID10 and logical volumes. What I got was a
> RAID10 device with a 512K chunk size, and ext4 extended options of
> stride=128 & stripe-width=256, with a filesystem block size of 4k. All
> of this seems correct to me.
> 
> But when I run bonnie++ on the array (with ext4 mounted
> data=writeback,nobarrier)  I get a sequential write speed of only
> 160MB/s, and a sequential read speed of only 267MB/s. I've verified
> that the drives' write caches are enabled.
> 
> "sar -d" shows all 4 drives in operation, writing 80MB/s during the
> sequential write phase, which agrees with the 160MB/s I'm seeing for
> the whole array. (I haven't monitored the read test with sar.)
> 
> Is this about what I should expect? I would have expected both read
> and write speeds to be higher. As it stands, writes are barely any
> faster than for a single drive. And reads are only ~30% faster.

We have a wiki page on performance at https://raid.wiki.kernel.org/index.php/Performance

From the examples mentioned there you should be able to get something like
300 MB/s sequential write and 700 MB/s sequential read. Raid1 and raid10,near could
slow down your sequential read considerable, while raid10,far and raid5 should
give you read speed in the 700 MB/s range. Have a look at the bonnie results reported for
a variation on chunk size etc.

best regards
keld

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
@ 2013-06-08 19:56 Steve Bergman
  2013-06-09  3:08 ` Stan Hoeppner
  2013-06-09 12:09 ` Ric Wheeler
  0 siblings, 2 replies; 32+ messages in thread
From: Steve Bergman @ 2013-06-08 19:56 UTC (permalink / raw)
  To: Linux RAID

First of all, thank you to the people who took the time to help
illuminate this issue.

To summarize... for unknown reasons, the 4 port SATA controller on the
Dell PET-310 has an aggregate limitation of ~1.75 Gbit/s on the A&B
and C&D port pairs. Each port can provide more than that to a single
drive, but when trying to read or write both ports simultaneously,
each port in the pair gets  ~0.87Gbit/s. (Which is probably some
higher nominal value minus some overhead.)

The testing of (1) my workload, and (2) sequential read write, under
various RAID levels, filesystems, and chunk sizes got tedious, so I
decided to just automate the whole thing and let it run overnight. My
initial guess was that RAID5 might have some advantages in this
situation for sequential writes in that parity is less bandwidth
intensive for writes than is mirroring, and I almost always have
plenty of spare cpu cycles available. This turned out to be correct
for ext4. (xfs still liked RAID10.) The best numbers for sequential
read/write came from ext4 under 4-drive RAID5 at the default chunk
size of 512k. xfs did it's best under RAID10 with chunk sizes of
either 32k or 64k (which came out about the same), but was not able to
match the ext4 write performance, or even come close to the read
performance.

The more important testing was of my actual target work load, which
does a huge number of random writes building up a pair of files which
are each ~2GB. My suspicion was that RAID10 would yield the better
performance here, since this is not a bandwidth-bound workload. This
turned out to be correct for both ext4 and xfs. Here, the best
performance again came from ext4 at the default chunk size of 512k.
where the operation completed (including sync) in 11m24s, with xfs
doing best at a 32k chunk size, and completing in 13m07s.

With that established, I decided to focus on ext4 at 512k. For the
system volumes, delayed allocation is acceptable. However, for the
data partition, leaving delayed allocation turned on would be
irresponsible. (We have point of sale data being collected throughout
the day which could not be recovered from backup. The testing shows
that for this workload, mounting "nodelalloc" entails only a 7%
penalty in performance, which is quite acceptable (and still faster
than XFS).

So that pretty much nails down my configuration. RAID10 with 512k
chunks. ext4 mounted nodelalloc for the data volume. And ext4 mounted
at the defaults for everything else.

Now, that said... and though I don't really intend to engage in a long
thread over this... the subject of XFS's suitability for this kind of
work has come up, and I'll address the key points, since I do believe
in calling a spade a spade. Even if xfs had come out ahead on
performance, I would not have considered it for my data partition.
It's been said here that the major data loss bugs in xfs have been
fixed. And that's probably true. At least one would hope that after 13
years, the major data loss bugs would have been fixed. But xfs's data
integrity problems are not due to bugs, but due to fundamental design
decisions which cannot be changed at this point. And there is plenty
of recent evidence supporting the fact that xfs still has the same
data integrity problems it has always had. For example, this recent
report involving a very recent enterprise Linux version:

http://toruonu.blogspot.com/2012/12/xfs-vs-ext4.html

Simply Googling "xfs zero" and sorting by date yields pages and pages
of recent report hits.

The fundamental design philosophy issues for xfs are the assumptions that:

1. Metadata is more important than data. (A brain-dead concept, to start with.)

2. Data loss is acceptable as long as the metada is kept consistent.

3. Performance is only slightly less important than metadata, and far
more important than data.

More specifically, the data integrity design problems for xfs are (primarily):

1. It only journals metadata, and doesn't order data writes to ensure
that the data is always consistent with some valid state (even if it
isn't the latest state).

2. It uses delayed allocation, which is inherently unsafe, of it you
order writes ahead of the metadata. And you can't turn it off. (Please
correct me if I'm wrong about that. I'd like to know.)

#1 is a brick wall. There's not much that can be done. Regarding #2, I
think the xfs guys did model something on Ted T'so's ext4 patches to
2.6.30 which force fsyncs for certain common idioms. (Though I think I
heard that they did not adopt all of them. Not sure.) I do not
consider even that full patch set to be more than a band-aid.

But trusting important data to a store which employs either of the
above designs is just irresponsible, and in general, responsible
admins should never even consider it.

Regarding xfs performance, Dave Chinner made an interesting
presentation (at LinuxConf AU 2012, IIRC) in which he demonstrated the
metadata scalability work that the xfs team had done, which had made
it into RHEL 6.x. (It's on YT, if you missed it.) His slides did show
dramatic improvements. However, they also consistently showed ext4
blowing away xfs performance on fs_mark for everything test, up until
8 threads (which covers an awful lot of common workloads). So xfs
metadata performance isn't there yet, unless your workload involves 8
metadata intensive threads. To its credit, xfs did scale more or less
linearly, whereas ext4 (in whatever configuration her was using; he
didn't say.) started flagging somewhere between 5 & 8 threads.

There's no such thing as a "best filesystem". Horses for courses.
Above 16TB, xfs may (or may not) rule. Below that is (in general) ext4
territory, And we'll see how things work out for the featureful btrfs.
It's too early to guess, and my crystal ball is in the shop.

It's been suggested that I'm not familiar with the issues surrounding
ext3's ordered mode. In fact, I'm more familiar with the history than
anyone I've recently encountered. Back in '98 or '99, we didn't have
any journaling fs in Linux, and I was carefully following each and
every (relatively rare) post that Stephen Tweedie was making to lkml
and the linux-ext2 (IIRC) list. So I know the history. I know
Tweedie's thought process at the time. (Had an email exchange with him
about it once.) And so I recognize that T'so (and others?) have
managed an impressive rewriting of the history in a campaign to make
dangerous practices palatable to a modern audience. Ext3's aggressive
data-sync'ing behavior is no accident or side-effect. It was quite
deliberate and intentional. And ordered mode was not all about
security, but primarily about providing a sane level of data
integrity, with the security features being included for free. Tweedie
is a very meticulous and careful designer who understood (and
understands) that:

1. Data is more important than metadata.

2. Metadata is only important because it's required in order to work
with the data.

3. It's OK to provide data-endangering options to the system
administrator. But they should be turned *off* by default.

I get the impression that few people are aware of these aspects of
ext3's history and design. Probably fewer are aware that Tweedie
implemented the data=journal mode *before* he implemented the ordered
and writeback modes.

I can certainly see where ext3 design decisions would be a thorn in
the side of designers of less safe filesystems, as it does result in
programs which quickly show up their design misfeatures.

While getting things closer to right than xfs, ext4 falls short of
getting things really right by turning the dangerous delayed
allocation behavior on by default. It should have been left as a
performance optimization available to admins with workloads which
allowed for it.

Anyway, that's enough for me on this topic. Feel free to discuss among
yourselves. But the back and forth on this could go on for weeks (if
not more) and I don't care to allocate the time (delayed or not ;-)

Again, thank you for the discussion and info on the T310 and general SATA issue.

Sincerely,
Steve Bergman
(signing off)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-08 19:56 Steve Bergman
@ 2013-06-09  3:08 ` Stan Hoeppner
  2013-06-09 12:09 ` Ric Wheeler
  1 sibling, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-09  3:08 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/8/2013 2:56 PM, Steve Bergman wrote:
> First of all, thank you to the people who took the time to help
> illuminate this issue.
> 
> To summarize... for unknown reasons, the 4 port SATA controller on the
> Dell PET-310 has an aggregate limitation of ~1.75 Gbit/s on the A&B
> and C&D port pairs. Each port can provide more than that to a single
> drive, but when trying to read or write both ports simultaneously,
> each port in the pair gets  ~0.87Gbit/s. (Which is probably some
> higher nominal value minus some overhead.)

This is almost certainly a result of forced IDE mode.  With this you end
up with a master/slave setup between the drives on each controller, and
all of the other overhead of EIDE.
 ____  _   _ ___ ____
/ ___|| \ | |_ _|  _ \
\___ \|  \| || || |_) |
 ___) | |\  || ||  __/
|____/|_| \_|___|_|   running down of XFS, showing desire for O_PONIES.

> Anyway, that's enough for me on this topic. Feel free to discuss among
> yourselves. But the back and forth on this could go on for weeks (if
> not more) and I don't care to allocate the time (delayed or not ;-)

When you drop a bomb like you have here, and run away, it simply tells
everyone that you're not willing to defend your claims and opinions.
Thus all of that typing was a waste of your time as it will be ignored.
 Given your misstatements of fact, about both XFS and EXT4, I can see
why you're running away.  I won't bother debunking all of it.  I will
simply say this.

If you'd learn to properly use fsync or O_DIRECT in your application
you'd have no problem with data/file integrity with XFS, EXT4, or any
filesystem.  Either puts the data on the platter right now.  You
apparently write all 2GB of your data to buffer cache and then issue a
sync.  That is *horrible* practice.  This *creates* a window of
opportunity for data loss.  And you're complaining about XFS delayed
allocation?

WRT your data security complaints about XFS, note that machines exist
today that move an aggregate 6-10GB/s to/from a single XFS filesystems.
 Try that with EXT.  Such performance isn't possible if one journals
data as you suggest all filesystems should.  If you need high
performance throughput from an application  *and* data security you use
parallel O_DIRECT.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-08 19:56 Steve Bergman
  2013-06-09  3:08 ` Stan Hoeppner
@ 2013-06-09 12:09 ` Ric Wheeler
  2013-06-09 20:06   ` Steve Bergman
  1 sibling, 1 reply; 32+ messages in thread
From: Ric Wheeler @ 2013-06-09 12:09 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 06/08/2013 03:56 PM, Steve Bergman wrote:
> Simply Googling "xfs zero" and sorting by date yields pages and pages
> of recent report hits.

This is a just silly. Try googling for "Santa Claus lives at the North Pole" or 
"Do pixies really exist". Both queries will give you rock solid evidence that 
you can share will us, down to a specific mail address for Santa :)

For that matter, try googling "ext4 zero length files".

In my experience, which is based on first hand experience and direct knowledge, 
what enterprise users and enterprise storage array vendors actually use when 
constructing Linux based storage devices, XFS is by far the more popular choice.

To be clear, you absolutely can lose data with *any* file system if you 
misconfigure your storage, ignore the barriers, etc. That definitely includes ext4.

The way ext4 and xfs both do things is a lot closer these days (mainly because 
the ext4 developers have continually harvested good ideas from XFS with XFS 
occasionally doing the same from ext4).

For any application, I always encourage users to try out a few file systems and 
see what is best for them.

It is a lot more interesting to share your actual setup and results. Much less 
interesting to echo uninformed, old claims.

Regards,

Ric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 12:09 ` Ric Wheeler
@ 2013-06-09 20:06   ` Steve Bergman
  2013-06-09 21:40     ` Ric Wheeler
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Steve Bergman @ 2013-06-09 20:06 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Linux RAID

Hello Ric,

I was not intending to reply in this thread, for reasons I gave at the
end of my previous post. However, since it is you who are responding
to me, and I have a great deal of respect for you, I don't want to
ignore this.

Firstly, let me say that I do not care about winning an argument,
here.  What I've said, I felt I should say. And it is based upon my
best understanding of the situation, and my own experiences as an
admin. If my statements seemed overly strong, then... well... I've
found "Strong Opinions, Loosely Held" to be a good strategy for
learning things I might not otherwise have discovered.

I'm not a particular advocate or detractor for/against any particular
filesystem. But I do strongly believe in discussing the relative
advantages and disadvantaged, and in particular benefits and risks of
filesystems and filesystem features frankly and honestly. The
particular risks of a filesystem or feature should get have equal
visibility to prospective users as to the benefits. There's no denying
the XFS has a mystique. It's something I've noticed since that day in
1994 that the old SGI released the code under GPLv2. And if you did
Google for "ZFS and zeroes" you surely noticed that many of the
reports of trouble came from people who had no business using XFS in
their environment in first place. And often based upon erroneous and
incomplete information. And mixed in with those, there were folks who
really thought they'd done their homework and still got bitten by one
of the relative risks of "advanced and modern performance features". I
believe that it is especially important for advocates of a filesystem
to be forthright, honest, and frank about the relative risks. As doing
otherwise hurts, in the long run, the reputation of the filesystem
being advocated.

Saying that "you can lose data with any filesystem" is true... but
evasive, and misses the point. One could say that speeding down the
interstate at 100mph on a motorcycle without a helmet isn't any more
dangerous than driving a sedan with a "Five Star" safety rating at the
speed limit, since after all, it's possible for people in the sedan to
die in a crash, and there are even examples of this having happened.
But that doesn't really address the issue in a constructive and honest
way.

But enough of that. I've already said everything that I feel I'm
ethically bound to say on that topic. And I'm interested in your
thoughts on the topic of delayed allocation and languages which either
don't support the concept of fsync, or in which the capability it
little known and/or seldom used. e.g. Python. It does support the
concept of fsync. But that's almost never talked about in Python
circles. (At least to my knowledge.) The function is not a 1st class
player. But fsync() does exist, buried in the "os" module of the
standard library alongside dirname(), basename(), copy(), etc. My
distro of choice is Scientific Linux 6.4. (Essentially RHEL 6.4.) And
a quick find/fgrep doesn't reveal any usage of fsync at all in any of
the ".py" files which ship with the distro. Perhaps the Python VM
invokes it automatically? Strace says no. And this is in an enterprise
distro which clearly states in its Administrator Manual sections on
Ext4 & XFS that you *must* use fsync to avoid losing data. I haven't
checked Ruby or Perl, but I think it's a pretty good guess that I'd
find the same thing.

However, I'd like to talk (and get your thoughts) about another
language that doesn't support the concept of fsync. One that still
maintains a surprising presence even today, particularly in
government, but rarely gets talked about: COBOL. At a number of my
sites, I have COBOL C/ISAM files to deal with. And at new sites that I
take on, a common issue is that the filesystems have been mounted at
the ext4 defaults (with delayed allocation turned on) and that the
business has experienced data loss after an unexpected power loss, UPS
failure, etc. (In fact *every* time I've seen this configuration and
event I've observed data loss. The customer often just tacitly assumes
this is just a flaw in the way Linux works. My first action is to
mount nodelalloc, and this seems to do a great job of preventing
future problems. In a recent event (last week) the Point of Sale line
item file on a server was so badly corrupted that the C/ISAM rebuild
utility could not rebuild it at all. Since this was new (and
important) data which was not recoverable from the nightly backup, it
involved 2 days worth of re-entering the data and then figuring out
how to merge it with the PS data which had occurred during the
intervening time.

Is this level of corruption expected behavior for delayed allocation?
Or have I hit a bug that needs to be reported to the ext4 guys? Should
delayed allocation be the default in an enterprise distribution which
does not, itself, make proper use of fsync? Should the risks of
delayed allocation be made more salient than they are to people who
upgrade from say, RHEL5 to RHEL6? Should options which trade data
integrity guarantees for performance be the defaults in any case? As
an admin, I don't care about benchmark numbers. But I care very much
about the issue of data endangerment "by default".

SIncerely,
Steve Bergman

P.S I very much enjoyed that "Future of Red Hat Enterprise Linux"
event from Red Hat Summit 2012. While I don't necessarily advocate for
any particular filesystem, I do find the general topic exciting. In
fact, the entire suite of presentations was engaging and informative.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 20:06   ` Steve Bergman
@ 2013-06-09 21:40     ` Ric Wheeler
  2013-06-09 23:08       ` Steve Bergman
  2013-06-10  0:11       ` Joe Landman
  2013-06-09 22:05     ` Eric Sandeen
  2013-06-10  0:05     ` Joe Landman
  2 siblings, 2 replies; 32+ messages in thread
From: Ric Wheeler @ 2013-06-09 21:40 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 06/09/2013 04:06 PM, Steve Bergman wrote:
> Hello Ric,
>
> I was not intending to reply in this thread, for reasons I gave at the
> end of my previous post. However, since it is you who are responding
> to me, and I have a great deal of respect for you, I don't want to
> ignore this.
>
> Firstly, let me say that I do not care about winning an argument,
> here.  What I've said, I felt I should say. And it is based upon my
> best understanding of the situation, and my own experiences as an
> admin. If my statements seemed overly strong, then... well... I've
> found "Strong Opinions, Loosely Held" to be a good strategy for
> learning things I might not otherwise have discovered.
>
> I'm not a particular advocate or detractor for/against any particular
> filesystem. But I do strongly believe in discussing the relative
> advantages and disadvantaged, and in particular benefits and risks of
> filesystems and filesystem features frankly and honestly. The
> particular risks of a filesystem or feature should get have equal
> visibility to prospective users as to the benefits. There's no denying
> the XFS has a mystique. It's something I've noticed since that day in
> 1994 that the old SGI released the code under GPLv2. And if you did
> Google for "ZFS and zeroes" you surely noticed that many of the
> reports of trouble came from people who had no business using XFS in
> their environment in first place. And often based upon erroneous and
> incomplete information. And mixed in with those, there were folks who
> really thought they'd done their homework and still got bitten by one
> of the relative risks of "advanced and modern performance features". I
> believe that it is especially important for advocates of a filesystem
> to be forthright, honest, and frank about the relative risks. As doing
> otherwise hurts, in the long run, the reputation of the filesystem
> being advocated.
>
> Saying that "you can lose data with any filesystem" is true... but
> evasive, and misses the point. One could say that speeding down the
> interstate at 100mph on a motorcycle without a helmet isn't any more
> dangerous than driving a sedan with a "Five Star" safety rating at the
> speed limit, since after all, it's possible for people in the sedan to
> die in a crash, and there are even examples of this having happened.
> But that doesn't really address the issue in a constructive and honest
> way.

Hi Steve,

Specifically, ext4 and xfs behave exactly the same with regards to delayed 
allocation.

As I stated, pretty much without exception, people who monitor the actual data 
loss rates in shipping product have chosen to use XFS over ext4.  That is 
actual, tested deployed instances and backed up by careful monitoring.

I really don't care if you are a happy ext4 use - you should choose what you are 
comfortable and what gets the job done for you.

We (Red Hat we) work hard to make sure that all of the file systems we support 
handle power failure correctly and do regular and demanding tests on all of our 
file systems on a range of hardware types. We have full faith in both file systems.

Side note, we are also working to get btrfs up to the same standard and think it 
is closing in on stability.

>
> But enough of that. I've already said everything that I feel I'm
> ethically bound to say on that topic. And I'm interested in your
> thoughts on the topic of delayed allocation and languages which either
> don't support the concept of fsync, or in which the capability it
> little known and/or seldom used. e.g. Python. It does support the
> concept of fsync. But that's almost never talked about in Python
> circles. (At least to my knowledge.) The function is not a 1st class
> player. But fsync() does exist, buried in the "os" module of the
> standard library alongside dirname(), basename(), copy(), etc. My
> distro of choice is Scientific Linux 6.4. (Essentially RHEL 6.4.) And
> a quick find/fgrep doesn't reveal any usage of fsync at all in any of
> the ".py" files which ship with the distro. Perhaps the Python VM
> invokes it automatically? Strace says no. And this is in an enterprise
> distro which clearly states in its Administrator Manual sections on
> Ext4 & XFS that you *must* use fsync to avoid losing data. I haven't
> checked Ruby or Perl, but I think it's a pretty good guess that I'd
> find the same thing.

I don't know any details about Python, Ruby or Perl internals, but will poke 
around. Note that some of the standard libraries they call might also have 
buried calls in them to do the write thing with fsync() or fdatasync().
>
> However, I'd like to talk (and get your thoughts) about another
> language that doesn't support the concept of fsync. One that still
> maintains a surprising presence even today, particularly in
> government, but rarely gets talked about: COBOL. At a number of my
> sites, I have COBOL C/ISAM files to deal with. And at new sites that I
> take on, a common issue is that the filesystems have been mounted at
> the ext4 defaults (with delayed allocation turned on) and that the
> business has experienced data loss after an unexpected power loss, UPS
> failure, etc. (In fact *every* time I've seen this configuration and
> event I've observed data loss. The customer often just tacitly assumes
> this is just a flaw in the way Linux works. My first action is to
> mount nodelalloc, and this seems to do a great job of preventing
> future problems. In a recent event (last week) the Point of Sale line
> item file on a server was so badly corrupted that the C/ISAM rebuild
> utility could not rebuild it at all. Since this was new (and
> important) data which was not recoverable from the nightly backup, it
> involved 2 days worth of re-entering the data and then figuring out
> how to merge it with the PS data which had occurred during the
> intervening time.
>
> Is this level of corruption expected behavior for delayed allocation?
> Or have I hit a bug that needs to be reported to the ext4 guys? Should
> delayed allocation be the default in an enterprise distribution which
> does not, itself, make proper use of fsync? Should the risks of
> delayed allocation be made more salient than they are to people who
> upgrade from say, RHEL5 to RHEL6? Should options which trade data
> integrity guarantees for performance be the defaults in any case? As
> an admin, I don't care about benchmark numbers. But I care very much
> about the issue of data endangerment "by default".

I don't agree that data is at risk by default. The trade off of letting data 
accumulate in DRAM is *very* long standing (delayed allocation or not). Every 
database and serious application has dealt with this on a variety of operating 
systems for more than a decade.

If you have a bit of code that does the wrong thing, you can mount "-o sync" I 
suppose and crawl along safely but at painful slow speeds.

Regards,

Ric

>
> SIncerely,
> Steve Bergman
>
> P.S I very much enjoyed that "Future of Red Hat Enterprise Linux"
> event from Red Hat Summit 2012. While I don't necessarily advocate for
> any particular filesystem, I do find the general topic exciting. In
> fact, the entire suite of presentations was engaging and informative.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 21:40     ` Ric Wheeler
@ 2013-06-09 23:08       ` Steve Bergman
  2013-06-10  8:35         ` Stan Hoeppner
  2013-06-10  0:11       ` Joe Landman
  1 sibling, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-09 23:08 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Linux RAID

Hello, and thank you for the response.

Firstly, while I am a happy ext4 user, I would prefer not to be
defined by that. I'm not about filesystem A is better than filesystem
B. I'm very much in accord with Ted Ts'o's "horses for courses"
philosophy on the issue of filesystem selection.

However, this reminds me that I meant to ask specifically, since XFS
is on my list of filesystems that I consider for various use-cases,
about the relative integrity guarantees provided. In a situation where
I would be comfortable using ext4 with journal=writeback mode and
leaving delayed allocation turned on, I should get the same level of
data integrity with XFS, right? Or is there any difference, either
way? What about comparing ext4 with delayed allocation on or off, or
data=writeback, or other combinations? Is it possible to change the
level of guarantee I get from XFS? I ask because It's entirely
possible that there's something important that I happen not to know
about it.

Understand that the servers I administer are not sitting in data
centers. They are sitting in offices in small to medium sized
businesses with maybe 100 - 150 employees, generally family-owned and
pretty informal, without me there to watch over. (I'm an independent
consultant.) And there are inevitable situations like last week when
there was a(nother) power failure (I live in Oklahoma City) and the
management and employees were scrambling to put everything on
generators. The UPS didn't like the generator, and by the time they
got the server back up it had crashed 5 times. And I don't even find
out about these things until after the fact. And no matter what I do
or advise to prevent these kinds of things, there's always something
else that goes wrong. I may not be one of your "enterprise customers",
but I'm surely not alone. This may all seem a bit "Green Acres" to
you, but it's the reality I live in. And I can't force my customers to
act on my advice if they opt not to. But regardless of everything,
it's still my responsibility as a consultant, not to mention a Linux
advocate, to protect my customers' data, and to ensure that Linux
doesn't get the reputation for losing data. NCR would be more than
happy to move these people from their current Linux product to their
newer Windows-only "upgraded" version. Their marketing department is
certainly trying hard enough.

You may have your enterprise statistics. But I have a set of pretty
compelling personal experiences over the past 4 years which differs
substantially. Anecdotal, yes. But they're *my* anecdotes, which makes
a difference to me.

> Note that some of the standard libraries they call might also have buried calls in them to do the write thing with fsync() or fdatasync().

I just chose an application at random to look at. I opened up
virt-manager under "strace -o virtman.out -f", and created a new VM.
At no time was fdatasync() called. There was a single call to fsync()
on file "/root/.config/gtk-2.0/gtkfilechooser.ini.5ZESYW". I'd have
expected a number of key libvirtd config files to have been fsync'd or
fdatasync'd.

> If you have a bit of code that does the wrong thing, you can mount "-o sync" I suppose and crawl along safely but at painful slow speeds.

Sarcasm noted. ;-) I find that in practice, simply leaving the data
volumes in data=ordered mode and turning off DA results in -osync-like
data integrity.  I've considered data=journal. But even though I'm not
a RH customer, I like to abide by RH support guidelines (on the
assumption that you guys might be aware of some pitfalls of which I am
blissfully ignorant). And the Administration Manual implies that
data=journal is not an officially supported journaling mode. (Why?) I
think we can agree that "-osync" would be both unnecessary and
overkill for almost any situation.

At any rate, this conversation, and the fact that
/etc/cups/printers.conf turned up zero-length after the emergency last
week, are really piquing my curiosity about just how well OS vendors
who say "just use fsync" are following their own advice. It may be the
RHEL6 is one of those bits of "troublesome code" which doesn't
consistently use fsync properly. I don't mean that to be
overly-provocative. And I should check further. But so far I'm not
seeing a lot of evidence of fsyncs being used consistently on
significant config files. It may or may not be an area where there is
room for improvement in the excellent RHEL6 product. :-)

-Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 23:08       ` Steve Bergman
@ 2013-06-10  8:35         ` Stan Hoeppner
  0 siblings, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-10  8:35 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Ric Wheeler, Linux RAID

On 6/9/2013 6:08 PM, Steve Bergman wrote:
> And there are inevitable situations like last week when
> there was a(nother) power failure (I live in Oklahoma City) and the
> management and employees were scrambling to put everything on
> generators. The UPS didn't like the generator, and by the time they
> got the server back up it had crashed 5 times.

With all due respect Steve, your car has a hole in the exhaust and fumes
are entering the passenger cabin.  Instead of fixing the exhaust, you're
trying to roll down the windows to get the fumes out.

The problem is power, not Linux nor the filesystem.  So add more battery
packs to increase uptime, and implement apcupsd for clean shutdown on
low battery.  Avoiding unclean shutdowns due to power issues is a
problem solved long ago.  For this particular customer whose server went
down hard five times (there's no excuse for this BTW), this is a perfect
time to write a proposal for additional UPS hardware.  For the server
systems you've described, UPS+batteries for 12 hours of uptime is chump
change compared to lost revenue.  For any business in the Midwest, you
should have at least 12 hours of UPS time, or 1 hour UPS (cut at 30
mins) plus constantly tested and verified auto cutover to diesel
generator, which can be filled on the fly for indefinite power generation.

Storms usually arrive between late afternoon and during the overnight
hours.  Your biz power will normally go down during this time frame, and
multiple times while the power co is fixing downed lines.

I'm sure you'll come up with a list of reasons why they can't afford
more power backup, or that they shouldn't need it if filesystems just
worked "correctly", etc, etc.  They will all be invalid.

The simple fact is computers are powered by electricity.  If the current
stops flowing while they are running you will have problems, and not
just filesystems.  If they are properly powered down, they cannot be run
again until power is restored.

This is the problem you need to address.  Not filesystem "reliability".
 At minimum apcupsd should solve a lot of these problems.  But of course
you already know of apcupsd.  So why the 5 unclean shutdowns?

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 21:40     ` Ric Wheeler
  2013-06-09 23:08       ` Steve Bergman
@ 2013-06-10  0:11       ` Joe Landman
  1 sibling, 0 replies; 32+ messages in thread
From: Joe Landman @ 2013-06-10  0:11 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Steve Bergman, Linux RAID

On 06/09/2013 05:40 PM, Ric Wheeler wrote:
> On 06/09/2013 04:06 PM, Steve Bergman wrote:

[...]

>> a quick find/fgrep doesn't reveal any usage of fsync at all in any of
>> the ".py" files which ship with the distro. Perhaps the Python VM
>> invokes it automatically? Strace says no. And this is in an enterprise
>> distro which clearly states in its Administrator Manual sections on
>> Ext4 & XFS that you *must* use fsync to avoid losing data. I haven't
>> checked Ruby or Perl, but I think it's a pretty good guess that I'd
>> find the same thing.
>
> I don't know any details about Python, Ruby or Perl internals, but will
> poke around. Note that some of the standard libraries they call might
> also have buried calls in them to do the write thing with fsync() or
> fdatasync().

Perl's IO layer is tunable using various "personalities" you can invoke 
at module load, file open, on file descriptors, etc.  The syncs are in 
the standard library.  This said, a number of the distro Perl's are 
badly horribly broken (or out of date and EOLed for years now ... cough 
cough).  Python has similar capability, and its IO layer is usually a 
direct pass through to the underlying library calls.  You can force sync 
on IO, though most sane programmers won't do that.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 20:06   ` Steve Bergman
  2013-06-09 21:40     ` Ric Wheeler
@ 2013-06-09 22:05     ` Eric Sandeen
  2013-06-09 23:34       ` Steve Bergman
  2013-06-10  0:05     ` Joe Landman
  2 siblings, 1 reply; 32+ messages in thread
From: Eric Sandeen @ 2013-06-09 22:05 UTC (permalink / raw)
  To: linux-raid

Steve Bergman <sbergman27 <at> gmail.com> writes:

> Is this level of corruption expected behavior for delayed allocation?
> Or have I hit a bug that needs to be reported to the ext4 guys? Should
> delayed allocation be the default in an enterprise distribution which
> does not, itself, make proper use of fsync? Should the risks of
> delayed allocation be made more salient than they are to people who
> upgrade from say, RHEL5 to RHEL6? Should options which trade data
> integrity guarantees for performance be the defaults in any case? As
> an admin, I don't care about benchmark numbers. But I care very much
> about the issue of data endangerment "by default".

There's quite a lot I could follow up on or correct in your previous couple
emails, but here is a quick one:

Delayed allocation is a technique which chooses a physical location for data
when IO is sent to disk, not when the write() syscall is issued.

There is nothing at all inherently dangerous about that mode of operation.

ext4 may conflate this a little, because when delalloc is off, the old jbd
5-second-commit behavior is what starts pushing data out, rather than
periodic writeback.

But whether the filesystem chooses a physical block ahead of time or at the
time of IO has no direct effect on safety or data integrity.

-Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 22:05     ` Eric Sandeen
@ 2013-06-09 23:34       ` Steve Bergman
  2013-06-10  0:02         ` Eric Sandeen
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-09 23:34 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Linux RAID

Hi Eric,

Yes, I understand what you are saying about the interaction between
ordered data mode and DA in ext4. It's the combination of the 2
options that makes the difference. Merely having a switch to turn off
DA on XFS would not get me what I need for my data volumes. But thank
you for making that explicit.

I intentionally disable DA on my ext4 data volumes specifically to get
ext3-like behavior which results in a night and day difference in
resiliency during... difficult times... for many of my customers, in
my repeated experiences. I could just use ext3. But why give up
extents, multiblock allocation, CRC protection of the journal, etc?
(BTW, that's my vote *not* to remove the nodelalloc option of ext4 as
I noticed you and Ted discussing last April. ;-)

So on a set of Cobol C/ISAM files which never get fsync'd or
fdatasync'd, (because that concept does not exist in Cobol) would you
expect there to be any difference in the resiliency of ext4 and xfs
with both filesystems at completely default settings? Or would it be
about the same. I'm *very* interested in this topic, as I'd like the
best speed and more filesystem options, but need the resiliency even
more for many of my servers. Do I have an option with XFS to improve
behavior on/after an unclean shutdown? If so, I'd sincerely like to
know.

XFS is an excellent filesystem. Indispensable for certain use-cases.
If you need > 16TB files, there's nothing like it. (And I'm sure there
are other good use-cases.) Similarly, DA is a valuable filesystem
feature. And I'm very glad that both XFS & Ext4 have it available to
me. But as with any filesystem or fs feature, there are always
trade-offs, risks and benefits, etc. And those differences have turned
out to be crucially important to me and to quite a number of my
customers.

-Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 23:34       ` Steve Bergman
@ 2013-06-10  0:02         ` Eric Sandeen
  2013-06-10  2:37           ` Steve Bergman
  2013-06-10  7:19           ` David Brown
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Sandeen @ 2013-06-10  0:02 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/9/13 6:34 PM, Steve Bergman wrote:
> Hi Eric,
> 
> Yes, I understand what you are saying about the interaction between
> ordered data mode and DA in ext4. It's the combination of the 2
> options that makes the difference. Merely having a switch to turn off
> DA on XFS would not get me what I need for my data volumes. But thank
> you for making that explicit.
> 
> I intentionally disable DA on my ext4 data volumes specifically to get
> ext3-like behavior which results in a night and day difference in
> resiliency during... difficult times... for many of my customers, in
> my repeated experiences. I could just use ext3. But why give up
> extents, multiblock allocation, CRC protection of the journal, etc?
> (BTW, that's my vote *not* to remove the nodelalloc option of ext4 as
> I noticed you and Ted discussing last April. ;-)

I don't recommend nodelalloc just because I don't know that it's thoroughly
tested.  Anything that's not the default needs explicit and careful
test coverage to be sure that regressions etc. aren't popping up.

(One of ext4's weaknesses, IMHO, is its infinite matrix of options,
with wildly different behaviors.  It's more a filesystem multiplexer
than a filesystem itself.  ;)  Add enough knobs and there's no way you
can get coverage of all combinations.)

> So on a set of Cobol C/ISAM files which never get fsync'd or
> fdatasync'd, (because that concept does not exist in Cobol) would you
> expect there to be any difference in the resiliency of ext4 and xfs
> with both filesystems at completely default settings?

So back to the main point of this thread.

You probably need to define what _you_ mean by resiliency.  I have a hunch
that you have different, and IMHO unfounded, expectations.

I'm using a definition of resiliency for this conversation like this:

For properly configured, non-dodgey storage, 

1) Is metadata journaled such that the filesystem metadata is consistent
   after a crash or power loss, and fsck finds no errors?

and

2) Is data persistent on disk after either a periodic flush, or a data
   integrity syscall?

The answer to both had better be yes on ext3, ext4, xfs, or any other
journaling filesystem worth its salt.  If the answer is no, it's a broken
design or a bug.

And the answer for ext3, ext4, and xfs, barring the inevitable bugs that
come up from time to time on all filesystems, is yes, 1) and 2) are
satisfied.

Anything else you want in terms of data persistence (data from my careless
applications will be safe no matter what) is just wishful thinking.

> Or would it be
> about the same. I'm *very* interested in this topic, as I'd like the
> best speed and more filesystem options, but need the resiliency even
> more for many of my servers. Do I have an option with XFS to improve
> behavior on/after an unclean shutdown? If so, I'd sincerely like to
> know.

What you seem to want is an vanishingly small window for risk of data
loss for unsynced, buffered IO.

ext3 gave you about 5 seconds thanks to default jbd behavior and
data=ordered behavior.  ext4 & xfs are more on the order of
30s.

But this all boils down to:

Did you (or your app) fsync your data?  If not, you cannot guarantee
that it'll be there if you crash or lose power.  The window for risk
of loss depends on many things, but without data integrity syscalls,
there is a risk of data loss.  See also http://lwn.net/Articles/457667/

You said to Ric:

> I find that in practice, simply leaving the data volumes in
> data=ordered mode and turning off DA results in -osync-like
> data integrity.

It quite simply does not.  Write a new file, punch power 1-2s after
the write completes, reboot and see what you've got.  You're racing
against jbd2 waking up and getting work done, but most of the time,
you'll have data loss.

If you want a smaller window of opportunity for data loss, there
are plenty of tuneables at the fs & vm level to push data towards
disk more often, at the expense of performance.

Without data integrity syscalls, you're always exposed to a greater
or lesser degree.

(It'd probably be better to take this up on the filesystem lists,
since we've gotten awfully off-topic for linux-raid.  But I feel
like this is a rehash of the O_PONIES thread from long ago...)

-Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-10  0:02         ` Eric Sandeen
@ 2013-06-10  2:37           ` Steve Bergman
  2013-06-10 10:00             ` Stan Hoeppner
  2013-06-10  7:19           ` David Brown
  1 sibling, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-10  2:37 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Linux RAID

On Sun, Jun 9, 2013 at 7:02 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

As I've posted previously, despite my best efforts and advice to
customers, I still have to deal with the results of unclean shutdowns.
And that is specifically what I am concerned about. If I've given the
impression that I don't trust xfs or ext4 in normal operation, it was
unintentional. I have the greatest confidence in them. I have
particularly recent experience with unclean shutdowns here in OKC. One
can say that I and the operating system are not responsible for the
unwise (or not so unwise) things that other people might do which
result in the unclean shutdowns. But ultimately, it is my
responsibility to do my level best, despite everything, to see that
data is not lost. It's my charge. And its what I'll do come hell, high
water, or tornadoes. And I do have a pragmatic solution to the problem
which has worked well for 3 years. But I'm open to other options.

> I don't recommend nodelalloc just because I don't know that it's thoroughly
> tested.

I can help a bit there. At least regarding this particular Cobol
workload, since it's a category that I've been running for about 25
years. The SysV filesystem of AT&T Unix '386 & 3B2, Xenix's
filesystem, SCO Unix 4.x's Acer Fast Filesystem, and ext2 all
performed similarly. Occasionally, file rebuilds were necessary after
a crash. SCO Open Server 5's HTFS did better, IIRC. I have about 12
years of experience with ext3. And I cannot recall a time that I ever
had data inconsistency problems. (Probably a more accurate way to put
it than "data loss".) It's possible that I might have had 1 or 2 minor
issues. 12 years is a long time. I might have forgotten. But it was a
period of remarkable stability. This is why when people say "Oh, but
it can happen under ext3, too!" it doesn't impress me particularly. Of
course it "could". But I have 12 years of experience by which to gauge
the relative likelihood.

Now, with ext4 at it's defaults, it was an "every time" thing
regarding serious data problems and unclean shutdowns, until I
realized what was going on. I can tell you that in 3 or using
nodelalloc on those data volumes, it's been smooth sailing. No
unexpected problems. For reasons you note, I do try to keep things at
the defaults as much as possible. That is generally the safe and best
tested way to go. And it's one reason I don't go all the way and use
data=journal. I remember one reports, some years ago, where ext3 was
found to have a specific data loss issue... but only for people
mounting it data=journal.

But regarding nodelalloc not providing perfect protection...
"perfection" is the enemy of "good". I'm a pragmatist. And nodelalloc
works very well, while still providing acceptable performance, with no
deleterious side-effects. At least in my experience, and on this
category of workload, I would feel comfortable recommending it to
others in similar situations, with the caveat that YMMV.

> You probably need to define what _you_ mean by resiliency.

I need for the metadata to be in a consistent state. And for the data
to be in a consistent state. I do not insist upon that state being the
last state written to memory by the application. Only that the
resulting on-disk state reflect a valid state that the in-memory image
had seen at some time, even for applications written in environments
which have no concept of fsync or fdatasync, or where the program
(e.g. virt-manager or cupsd) don't do proper fsyncs. i.e. I need ext3
data=ordered behavior. And I'm not at all embarrassed to say that I
need (not want) a pony. And speaking pragmatically, I can vouch for
the fact that my pony has always done a very good job.

> Anything else you want in terms of data persistence (data from my careless
> applications will be safe no matter what) is just wishful thinking.

Unfortunately, I don't have the luxury of blaming the application.

> ext3 gave you about 5 seconds thanks to default jbd behavior and
> data=ordered behavior.  ext4 & xfs are more on the order of
> 30s.

There's more to it than that, though, isn't there? Ext3 (and
presumably ext4 without DA) flush the relevant data immediately before
the metadata write. It's more to do with metadata and data being
written at the same time (and data just *before* metadata) than of the
frequency with which it happens. Am I correct about that?

> But this all boils down to:>
> Did you (or your app) fsync your data?

No. Because Cobol doesn't support it. And few, apparently not even Red
Hat, bothers to use the little known os.fsync() call under Python, so
far as I've been able to tell. Still haven't checked on Perl and Ruby.

> (It'd probably be better to take this up on the filesystem lists,
> since we've gotten awfully off-topic for linux-raid.

I agree that this is off-topic. It started as a relevant question
(from me) about odd RAID10 performance I was seeing. Someone decided
to use it as an opportunity to sell me on XFS, and things went south
from there. (Although I have found it to be interesting.) I wasn't
going to post further here. I'd even unsubscribed from the list.  But
I couldn't resist when you and Ric posted back. I know that you both
know what you're talking about, and give honest answers, even if your
world of pristine data centers and mine of makeshift "server closets"
may result in differing views. I have a pretty good idea the way
things would go were I to post on linux-fsdevel. I saw how that all
worked out back in 2009. And I'd as soon not go there. I think I got
all the answers I was looking for here, anyway. I know I asked a
couple of questions of you in this post. But we can keep it short and
then cut it short, after.

Thanks for your time and your thoughts.

-Steve Bergman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-10  2:37           ` Steve Bergman
@ 2013-06-10 10:00             ` Stan Hoeppner
  0 siblings, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-10 10:00 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Eric Sandeen, Linux RAID

On 6/9/2013 9:37 PM, Steve Bergman wrote:

> I agree that this is off-topic. It started as a relevant question
> (from me) about odd RAID10 performance I was seeing. Someone decided
> to use it as an opportunity to sell me on XFS, and things went south
> from there.

You're referring to me Steve, but your recollection/perception of the
conversation is not accurate.  I did not try to sell you on XFS.  The
conversation drifted toward XFS, but I was not attempting to sell you on
it.  In fact, I said:

"If your workload has any parallelism, reformat that sucker with XFS
with the defaults."

Then later in the thread I said:

"But as you are stuck with EXT4, this is academic.  But, hopefully this
information may have future value to you, and others."

I'm not beating you up here Steve.  I'm trying to avoid being portrayed
as the hustler on the corner slinging XFS to the kids. ;)

Yes, I made positive comments about XFS, and some less than positive
commends about EXT4.  That isn't selling.  That's partisan.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-10  0:02         ` Eric Sandeen
  2013-06-10  2:37           ` Steve Bergman
@ 2013-06-10  7:19           ` David Brown
  1 sibling, 0 replies; 32+ messages in thread
From: David Brown @ 2013-06-10  7:19 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Steve Bergman, Linux RAID

On 10/06/13 02:02, Eric Sandeen wrote:
> (It'd probably be better to take this up on the filesystem lists,
> since we've gotten awfully off-topic for linux-raid.  But I feel
> like this is a rehash of the O_PONIES thread from long ago...)
> 

While this is a little off-topic, most people who use RAID also use
filesystems, and are interested in the integrity of their data.  I doubt
if I am alone in reading and learning (or at least refreshing) a little
here.  The other advantage of this list is that it is filesystem neutral
- we can learn about ext4 and xfs benefits, gotchas, and techniques,
while filesystem-specific lists are naturally biased.  So I for one am
happy to read this thread here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 20:06   ` Steve Bergman
  2013-06-09 21:40     ` Ric Wheeler
  2013-06-09 22:05     ` Eric Sandeen
@ 2013-06-10  0:05     ` Joe Landman
  2 siblings, 0 replies; 32+ messages in thread
From: Joe Landman @ 2013-06-10  0:05 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Ric Wheeler, Linux RAID

On 06/09/2013 04:06 PM, Steve Bergman wrote:

> Google for "ZFS and zeroes" you surely noticed that many of the

s/ZFS/xfs/

[...]

> Saying that "you can lose data with any filesystem" is true... but
> evasive, and misses the point. One could say that speeding down the

Er ... no.

If you insist upon absolute "guarantees" in *any* file system, then 
mount it with a sync option, so writes don't return until they are 
committed to disk, turn off all write caching on the drive, and turn off 
any other write caching throughout the system.  And if you believe that 
this *guarantees* your data integrity, I'd suggest staying away from 
real estate sales people in Florida.

You have to understand what is *guaranteed* and what is not.  Where bugs 
can hit (yes, bugs in the stack can tank a file system).

You can get corruption *anywhere* along the pathway from CPU to disk. 
Anywhere.  Even with ECC memory, checksums, etc.  Have a good long 
gander at this 
http://www.snia.org/sites/default/files2/SDC2011/presentations/monday/WilliamMartin_Data_Integerity.pdf 
and other articles on T10 DIF.

Understand that file systems do not give you guarantees.

If you must provide a guaranteed non-data lost system, then you need to 
engineer a resilient system below the file system itself.  At the file 
system level, you need to use options which give you the highest 
probability of surviving a data loss event.  Understand that you *will* 
lose data irrespective of what file system you have on there.  Its the 
best practices that you may or may not choose to implement that matter 
here, in terms of how impactful this data loss will be.

If you don't know how to use XFS safely, thats fine.  Its a very good 
file system, I've personally used it since IRIX days (when I was at 
SGI).  Many very large organizations swear by it.  Few would run without 
it.  But if you prefer something else, fine.  Just understand you are 
going to lose data with the other file system as well.  Denying that 
this is possible is not a viable strategy to ameliorate the damage from 
the loss, and fundamentally, your focus should be on risk amelioration 
with respect to your choices, not arguing with the development team over 
your choices.

Now, please, back to your regularly scheduled IO RAID system ...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
@ 2013-06-09 23:53 Steve Bergman
  2013-06-10  9:23 ` Stan Hoeppner
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Bergman @ 2013-06-09 23:53 UTC (permalink / raw)
  To: Linux RAID

>This is almost certainly a result of forced IDE mode.  With this you end
up with a master/slave setup between the drives on each controller, and
all of the other overhead of EIDE.

Thank you for that. Normally, I would not pursue the issue further, as
the server/filesystem is performing within 20%, on its most
challenging workload, of what it can do with the workload running in a
large tmpfs on the same machine. (I have lots of memory.) However, I'm
now engaged in the issue sufficiently that I'll be contacting Dell
tomorrow to ask them why we aren't getting what was advertised, and to
see if they have any suggestions.

So, would you expect the situation to change if there was some magic
way to make AHCI active?

I will briefly address the filesystems thing. I'm not running down
XFS. If anything, I'm shaking the bushes to see if it prompts anyone
to tell me something that I don't know about XFS which might change my
assessment of when it might be appropriate for my customers' use. I
wouldn't mind at all being able to expand use of XFS in appropriate
situations, if only to get more experience with it.

Beyond that, I'm not sure it would be constructive for you and me to
continue that conversation. I've already posted my views, and
repeating just gets... well... repetitive. ;-)

-Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Is this expected RAID10 performance?
  2013-06-09 23:53 Steve Bergman
@ 2013-06-10  9:23 ` Stan Hoeppner
  0 siblings, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-06-10  9:23 UTC (permalink / raw)
  To: Steve Bergman; +Cc: Linux RAID

On 6/9/2013 6:53 PM, Steve Bergman wrote:
>> This is almost certainly a result of forced IDE mode.  With this you end
> up with a master/slave setup between the drives on each controller, and
> all of the other overhead of EIDE.
> 
> Thank you for that. Normally, I would not pursue the issue further, as
> the server/filesystem is performing within 20%, on its most
> challenging workload, of what it can do with the workload running in a
> large tmpfs on the same machine. (I have lots of memory.) However, I'm
> now engaged in the issue sufficiently that I'll be contacting Dell
> tomorrow to ask them why we aren't getting what was advertised, and to
> see if they have any suggestions.
> 
> So, would you expect the situation to change if there was some magic
> way to make AHCI active?

I would expect AHCI mode to increase performance to a degree.  But quite
frankly I don't know Intel's system ASICs well enough to make further
predictions.  I only know such Intel ASICs from a published spec
standpoint, not direct experience or problem reports.  I typically don't
use motherboard down SATA controllers for server applications, but maybe
for the occasional mirror on a less than critical machine.  I don't
think the AHCI performance would be any worse.

As I stated previously, a ~$200 LSI HBA buys performance, flexibility,
and some piece of mind.  For a home PC it doesn't make sense to buy an
HBA at 2x the price of the motherboard.  For a business server using
either a SHV desktop board, or low end server board, it very often makes
sense.

> I will briefly address the filesystems thing. I'm not running down
> XFS. If anything, I'm shaking the bushes to see if it prompts anyone
> to tell me something that I don't know about XFS which might change my
> assessment of when it might be appropriate for my customers' use. I
> wouldn't mind at all being able to expand use of XFS in appropriate
> situations, if only to get more experience with it.

I did not say you should use XFS.  I was merely rebutting some of the
statements you made about XFS.  Ted, and others who've made similar
statements, are correct.  You pick the filesystem that best meets your
needs.  That's common sense.  I don't use XFS for my boot and root
filesystems because it doesn't fit those needs in my case.  I certainly
use it for user data.

> Beyond that, I'm not sure it would be constructive for you and me to
> continue that conversation. I've already posted my views, and
> repeating just gets... well... repetitive. ;-)

Of course not.  You've already covered all of this and much more in your
replies to Ric, Eric, etc.

Now we get to have the real discussion:  power. ;)  I know you'll have
different thoughts there, as I made some statements with very broad
general recommendations on runtimes.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-06-10 10:00 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-06 23:52 Is this expected RAID10 performance? Steve Bergman
2013-06-07  3:25 ` Stan Hoeppner
2013-06-07  7:51 ` Roger Heflin
2013-06-07  8:07   ` Alexander Zvyagin
2013-06-07 10:44     ` Steve Bergman
2013-06-07 10:52       ` Roman Mamedov
2013-06-07 11:25         ` Steve Bergman
2013-06-07 13:18           ` Stan Hoeppner
2013-06-07 13:54             ` Steve Bergman
2013-06-07 21:43               ` Bill Davidsen
2013-06-07 23:33               ` Stan Hoeppner
2013-06-07 12:39       ` Stan Hoeppner
2013-06-07 12:59         ` Steve Bergman
2013-06-07 20:51           ` Stan Hoeppner
2013-06-08 18:23 ` keld
  -- strict thread matches above, loose matches on Subject: below --
2013-06-08 19:56 Steve Bergman
2013-06-09  3:08 ` Stan Hoeppner
2013-06-09 12:09 ` Ric Wheeler
2013-06-09 20:06   ` Steve Bergman
2013-06-09 21:40     ` Ric Wheeler
2013-06-09 23:08       ` Steve Bergman
2013-06-10  8:35         ` Stan Hoeppner
2013-06-10  0:11       ` Joe Landman
2013-06-09 22:05     ` Eric Sandeen
2013-06-09 23:34       ` Steve Bergman
2013-06-10  0:02         ` Eric Sandeen
2013-06-10  2:37           ` Steve Bergman
2013-06-10 10:00             ` Stan Hoeppner
2013-06-10  7:19           ` David Brown
2013-06-10  0:05     ` Joe Landman
2013-06-09 23:53 Steve Bergman
2013-06-10  9:23 ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).