Linux MD? Or an H710p?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Linux MD? Or an H710p?
@ 2013-10-20  0:49 Steve Bergman
  2013-10-20  7:37 ` Stan Hoeppner
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Steve Bergman @ 2013-10-20  0:49 UTC (permalink / raw)
  To: linux-raid

Hello,

I'm configuring a PowerEdge R520 that I'll be installing RHEL 6.4 on 
next month. (Actually, Scientific Linux 6.4) I'll be upgrading to RHEL 
(SL) 7 when it's available, which is looking like it might default to XFS.

This will be a 6 drive RAID10 set up for ~100 Gnome (freenx) desktop 
users and a virtual Windows 2008 Server guest running MS-SQL, so there 
is plenty of opportunity for i/o parallelism. This seems a good fit for XFS.

My preference would be to use Linux MD RAID10. But the Dell configurator 
seems strongly inclined to force me towards hardware RAID.

My choices would be to get a PERC H310 controller that I don't need, 
plus a SAS controller that the drives would actually connect to, and use 
Linux md. Or I can go with a PERC H710p w/1GB NV cache running hardware 
RAID10. (Dell says their RAID cards have to function as RAID 
controllers, and cannot act as simple SAS controllers.)

I also have a choice between 600GB 15k drives and 600GB 10k "HYB CARR" 
drives, which I take to be 2.5" hybrid SSD/Rotational drives in a 3.5" 
mounting adapter.

Any comments on any of this? This is a bit fancier than what I usually 
configure. And I'm not sure what the performance and operational 
differences would be. I know that I'm familiar with Linux's software 
RAID tools. And I know I like the way I can replace a drive and have it 
sync up transparently in the background while the server is operational. 
I don't yet know if I can do that with the H710p card. I also like how I 
just *know* that XFS if configuring stride, etc. properly with MD. With 
the H710p, I don't know what, if anything, the card is telling the OS 
about the underlying RAID configuration. I also just plain like MD.

I like the 1GB NV cache I get if I go hardware RAID, which I don't get 
with the simple SAS controller. (I could turn off barriers.) I also like 
the fact that it seems a more standard Dell configuration. (They won't 
even connect the drives to the SAS controller at the factory.)

Any general guidance would be appreciated. We'll probably be keeping 
this server for 7 years, and it's pretty important to us. So I'm really 
wanting to get this right.

Thanks,
Steve Bergman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-20  0:49 Linux MD? Or an H710p? Steve Bergman
@ 2013-10-20  7:37 ` Stan Hoeppner
  2013-10-20  8:50 ` Mikael Abrahamsson
  2013-10-21 14:18 ` John Stoffel
  2 siblings, 0 replies; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-20  7:37 UTC (permalink / raw)
  To: Steve Bergman, linux-raid

On 10/19/2013 7:49 PM, Steve Bergman wrote:
> Hello,
> 
> I'm configuring a PowerEdge R520 that I'll be installing RHEL 6.4 on
> next month. (Actually, Scientific Linux 6.4) I'll be upgrading to RHEL
> (SL) 7 when it's available, which is looking like it might default to XFS.
> 
> This will be a 6 drive RAID10 set up for ~100 Gnome (freenx) desktop
> users and a virtual Windows 2008 Server guest running MS-SQL, so there
> is plenty of opportunity for i/o parallelism. This seems a good fit for
> XFS.
> 
> My preference would be to use Linux MD RAID10. But the Dell configurator
> seems strongly inclined to force me towards hardware RAID.
> 
> My choices would be to get a PERC H310 controller that I don't need,
> plus a SAS controller that the drives would actually connect to, and use
> Linux md. Or I can go with a PERC H710p w/1GB NV cache running hardware
> RAID10. (Dell says their RAID cards have to function as RAID
> controllers, and cannot act as simple SAS controllers.)
> 
> I also have a choice between 600GB 15k drives and 600GB 10k "HYB CARR"
> drives, which I take to be 2.5" hybrid SSD/Rotational drives in a 3.5"
> mounting adapter.
> 
> Any comments on any of this? This is a bit fancier than what I usually
> configure. And I'm not sure what the performance and operational
> differences would be. I know that I'm familiar with Linux's software
> RAID tools. And I know I like the way I can replace a drive and have it
> sync up transparently in the background while the server is operational.
> I don't yet know if I can do that with the H710p card. I also like how I
> just *know* that XFS if configuring stride, etc. properly with MD. With
> the H710p, I don't know what, if anything, the card is telling the OS
> about the underlying RAID configuration. I also just plain like MD.
> 
> I like the 1GB NV cache I get if I go hardware RAID, which I don't get
> with the simple SAS controller. (I could turn off barriers.) I also like
> the fact that it seems a more standard Dell configuration. (They won't
> even connect the drives to the SAS controller at the factory.)
> 
> Any general guidance would be appreciated. We'll probably be keeping
> this server for 7 years, and it's pretty important to us. So I'm really
> wanting to get this right.

Do what everyone else does in this situation:  Buy the box with
everything you want minus the disk controller.  Purchase an LSI 9211-8i
and cables, pop the lid and install it, takes 5 minutes tops.  Runs $300
for the KIT, about $250 if you buy the OEM card and 2x .5M cables
separately.

http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-9211-8i.aspx

-- 
Stan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-20  0:49 Linux MD? Or an H710p? Steve Bergman
  2013-10-20  7:37 ` Stan Hoeppner
@ 2013-10-20  8:50 ` Mikael Abrahamsson
  2013-10-21 14:18 ` John Stoffel
  2 siblings, 0 replies; 17+ messages in thread
From: Mikael Abrahamsson @ 2013-10-20  8:50 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

On Sat, 19 Oct 2013, Steve Bergman wrote:

> Any general guidance would be appreciated. We'll probably be keeping 
> this server for 7 years, and it's pretty important to us. So I'm really 
> wanting to get this right.

I prefer to use hardware raid for the boot device, because it can be a 
mess to make sure grub can boot off of any of the two boot drives and to 
re-assure this is true over time.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-20  0:49 Linux MD? Or an H710p? Steve Bergman
  2013-10-20  7:37 ` Stan Hoeppner
  2013-10-20  8:50 ` Mikael Abrahamsson
@ 2013-10-21 14:18 ` John Stoffel
  2013-10-22  0:36   ` Steve Bergman
  2 siblings, 1 reply; 17+ messages in thread
From: John Stoffel @ 2013-10-21 14:18 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

Steve> I'm configuring a PowerEdge R520 that I'll be installing RHEL
Steve> 6.4 on next month. (Actually, Scientific Linux 6.4) I'll be
Steve> upgrading to RHEL (SL) 7 when it's available, which is looking
Steve> like it might default to XFS.

Steve> This will be a 6 drive RAID10 set up for ~100 Gnome (freenx)
Steve> desktop users and a virtual Windows 2008 Server guest running
Steve> MS-SQL, so there is plenty of opportunity for i/o
Steve> parallelism. This seems a good fit for XFS.

So are you keeping home directories on here as well?  And how busy
will the MS-SQL server be?  That's probably where most of your IO will
come from I suspect.  Also, make sure you get lots of memory.  The
more your freenx server can cache in memory, the better things will
be.  

I also note that under Centos 6.4 firefox 22 has a tendency to grow
without bound, sucking up all the memory and causing the system to bog
down.  I admit I'm reading email via OWA, using Service Now, and lots
of tabs, but basically memory usage sucks.  And I'm using freenx as
well to access my desktop.  

I do admit I'm using a 3rd party repo, so I'm running:

  firefox-22.0-1.el6.remi.x86_64

Steve> My preference would be to use Linux MD RAID10. But the Dell
Steve> configurator seems strongly inclined to force me towards
Steve> hardware RAID.

Skip the configurator and just buy a controller 3rd hand.  

Steve> My choices would be to get a PERC H310 controller that I don't need, 
Steve> plus a SAS controller that the drives would actually connect to, and use 
Steve> Linux md. Or I can go with a PERC H710p w/1GB NV cache running hardware 
Steve> RAID10. (Dell says their RAID cards have to function as RAID 
Steve> controllers, and cannot act as simple SAS controllers.)

Steve> I also have a choice between 600GB 15k drives and 600GB 10k "HYB CARR" 
Steve> drives, which I take to be 2.5" hybrid SSD/Rotational drives in a 3.5" 
Steve> mounting adapter.

Is your key metric latency, or throughput?

John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-21 14:18 ` John Stoffel
@ 2013-10-22  0:36   ` Steve Bergman
  2013-10-22  7:24     ` David Brown
  2013-10-22 16:43     ` Stan Hoeppner
  0 siblings, 2 replies; 17+ messages in thread
From: Steve Bergman @ 2013-10-22  0:36 UTC (permalink / raw)
  Cc: linux-raid

First of all, thank you Stan, Mikael, and John for your replies.

Stan,

I had made a private bet with myself that Stan Hoeppner would be the 
first to respond to my query. And I was not disappointed. In fact, I was 
hoping for advice from you. We're getting the 7 yr hardware support 
contract from Dell, and I'm a little concerned about "finger-pointing" 
issues with regards to putting in a non-Dell SAS controller. Network 
card? No problem. But drive controller? Forgive me for "white-knuckling" 
on this a bit. But I have gotten an OK to order the server with both the 
H710p and the mystery "SAS 6Gbps HBA External Controller [$148.55]" for 
which no one at Dell seems to be able to tell me the pedigree. So I can 
configure both ways and see which I like. I do find that 1GB NV cache 
with barriers turned off to be attractive.

But hey, this is going to be a very nice opportunity for observing XFS's 
savvy with parallel i/o. And I'm looking forward to it. BTW, it's the 
problematic COBOL Point of Sale app that didn't do fsyncs that is being 
migrated to its Windows-only MS-SQL version in the virtualized instance 
of Windows 2008 Server. At least it will be a virtualized instance on 
this server if I get my way. Essentially, our core business is moving 
from Linux to Windows in this move. C'est la vie. I did my best. NCR won.

Mikael,

That's a good point. I know that at one time RHEL didn't get that right 
in its Grub config. I've been assuming that in 2013 it's a "taken for 
granted" thing, with the caveat that nothing involving the bootloader 
and boot sectors can ever be completely taken for granted.

John,

First, let me get an embarrassing misinterpretation out of the way. "HYB 
CARR" stands for "hybrid carrier" which is a fancy name for a 2.5" -> 
3.5" drive mounting adapter.

Fortunately, this is a workload (varied as it is) with which I am 
extremely familiar. Yes, Firefox uses (abuses?) memory aggressively. But 
if necessary, I can control that with system-wide lockprefs. This 
server, which ended up being a Dell R720, will have an insane 256GB of 
memory in a mirrored configuration, resulting in an effective (and half 
as insane) 128GB visible to the OS. In 7 years time that should seem 
about 1/25th as insane as that. And we'll just have to see about the 50% 
memory bandwidth hit we see for mirroring.

But anyway, I know that 16GB was iffy for the same workload 5 years ago. 
And we've expanded a bit. I think I could reasonably run what we're 
doing now on 24GB. Which means that we'd probably need something between 
that and 32GB, because my brain tends to underestimate these things. We 
currently are running on 48GB, which is so roomy that it makes it hard 
to tell.

-Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-22  0:36   ` Steve Bergman
@ 2013-10-22  7:24     ` David Brown
  2013-10-22 15:29       ` keld
  2013-10-22 16:56       ` Stan Hoeppner
  2013-10-22 16:43     ` Stan Hoeppner
  1 sibling, 2 replies; 17+ messages in thread
From: David Brown @ 2013-10-22  7:24 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

On 22/10/13 02:36, Steve Bergman wrote:

<snip>

> But hey, this is going to be a very nice opportunity for observing XFS's
> savvy with parallel i/o.

You mentioned using a 6-drive RAID10 in your first email, with XFS on
top of that.  Stan is the expert here, but my understanding is that you
should go for three 2-drive RAID1 pairs, and then use an md linear
"raid" for these pairs and put XFS on top of that in order to get the
full benefits of XFS parallelism.

mvh.,

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-22  7:24     ` David Brown
@ 2013-10-22 15:29       ` keld
  2013-10-22 16:56       ` Stan Hoeppner
  1 sibling, 0 replies; 17+ messages in thread
From: keld @ 2013-10-22 15:29 UTC (permalink / raw)
  To: David Brown; +Cc: Steve Bergman, linux-raid

It would be nice if we could get some benchmarks on this.
I would be inetersted in also figures from a standard raid10,far configuration.

best regards
keld

On Tue, Oct 22, 2013 at 09:24:57AM +0200, David Brown wrote:
> On 22/10/13 02:36, Steve Bergman wrote:
> 
> <snip>
> 
> > But hey, this is going to be a very nice opportunity for observing XFS's
> > savvy with parallel i/o.
> 
> You mentioned using a 6-drive RAID10 in your first email, with XFS on
> top of that.  Stan is the expert here, but my understanding is that you
> should go for three 2-drive RAID1 pairs, and then use an md linear
> "raid" for these pairs and put XFS on top of that in order to get the
> full benefits of XFS parallelism.
> 
> mvh.,
> 
> David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-22  0:36   ` Steve Bergman
  2013-10-22  7:24     ` David Brown
@ 2013-10-22 16:43     ` Stan Hoeppner
  1 sibling, 0 replies; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-22 16:43 UTC (permalink / raw)
  To: Steve Bergman; +Cc: linux-raid

On 10/21/2013 7:36 PM, Steve Bergman wrote:
> First of all, thank you Stan, Mikael, and John for your replies.
> 
> Stan,
> 
> I had made a private bet with myself that Stan Hoeppner would be the
> first to respond to my query. And I was not disappointed. In fact, I was
> hoping for advice from you. 

No need to bet.  Just assume.  ;)

> We're getting the 7 yr hardware support
> contract from Dell, 

Insane, 7 years is.  Exist then, Dell may not.  Clouded, Dell's future is.

> and I'm a little concerned about "finger-pointing"
> issues with regards to putting in a non-Dell SAS controller. 

Then use a Dell HBA.  They're all LSI products anyway, and have been
since the mid 90s when Dell re-badged their first AMI MegaRAID cards as
"Power Edge RAID Controller".

The PERC H310 is a no-cache RAID HBA, i.e. a fancy SAS/SATA HBA with
extremely low performance firmware based RAID5.  Its hardware RAID1/10
performance isn't bad, and allows booting from an array device sans teh
headaches of booting md based RAID.  It literally is the Dell OEM LSI
9240-8i, identical but for Dell branded firmware and the PCB.  You can
use it it JBOD mode, i.e. as a vanilla SAS/SATA HBA.  See page 43:

ftp://ftp.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_dell_adapters/poweredge-rc-h310_User%27s%20Guide_en-us.pdf

You can also use it in mixed mode, configuring two disk drives as a
hardware RAID1 set, and booting from it, and configuring the other
drives as non-virtual disks, i.e. standalone drives for md/RAID use.
This requires 8 drives if you want a 6 disk md/RAID10.  Why, you ask?
Because you cannot intermix hardware RAID and software RAID on any given
drive, obviously.

Frankly, if you plan to buy only 6 drives for a single RAID10 volume,
there is no reason to use md/RAID at all.  It will provide no advantage
for your stated application, as the firmware RAID executes plenty fast
enough on the LSISAS2008 ASIC of the H310 to handle the
striping/mirroring of 6 disks, with no appreciable decrease in IOPS.
Though for another $190 you can have the H710 with 512MB NVWC.  The
extra 512MB of the 710P won't gain you anything, yet costs an extra $225.

The $190 above the H310 is a great investment for the occasion that your
UPS takes a big dump and downs the server.  With the H310 you will lose
data, corrupting users' Gnome config files, and possibly suffer
filesystem corruption.  The H710 will also give a bump to write IOPS,
i.e. end user responsiveness with your workload.

All things considered, my advice is to buy the H710 at $375 and use
hardware RAID10 on 6 disks.  Make /boot, root, /home, etc on the single
RAID disk device.  I didn't give you my advice in my first reply, as you
seemed set on using md.

> Network
> card? No problem. But drive controller? Forgive me for "white-knuckling"
> on this a bit. But I have gotten an OK to order the server with both the
> H710p and the mystery "SAS 6Gbps HBA External Controller [$148.55]" for

Note "External".  You don't know what an SFF8088 port is.  See:
http://www.backupworks.com/productimages/lsilogic/lsias9205-8e.jpg

You do not plan to connect an MD1200/1220/3200 JBOD chassis.  You don't
need, nor want, this "External" SAS HBA.

> which no one at Dell seems to be able to tell me the pedigree. So I can

It's a Dell OEM card, sourced from LSI.  But for $150 I'd say it's a 4
port card, w/ single SFF8088 connector.  Doesn't matter.  You can't use it.

> configure both ways and see which I like. 

Again, you'll need 8 drives for the md solution.

> I do find that 1GB NV cache
> with barriers turned off to be attractive.

Make sure you use kernel 3.0.0 or later, and edit fstab with inode64
mount option, as well as nobarrier.

> But hey, this is going to be a very nice opportunity for observing XFS's
> savvy with parallel i/o. And I'm looking forward to it. 

Well given that you've provided zero detail about the workload in this
thread I can't comment.

> BTW, it's the
> problematic COBOL Point of Sale app 

Oh God... you're *that* guy?  ;)

> that didn't do fsyncs that is being
> migrated to its Windows-only MS-SQL version in the virtualized instance

Ok, so now we have the reason for the Windows VM and MSSQL.

> of Windows 2008 Server. At least it will be a virtualized instance on
> this server if I get my way. 

Did you happen to notice during your virtual machine educational
excursions that fsync is typically treated as a noop by many
hypervisors?  I'd definitely opt for a persistent cache RAID controller.

> Essentially, our core business is moving
> from Linux to Windows in this move. C'est la vie. I did my best. NCR won.

It's really difficult to believe POS vendors are moving away from some
of the most proprietary, and secure (if not just obscure) systems on the
planet, for decades running System/36, AT&T SYS V, SCO, Linux, and now
to... Windows?

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-22  7:24     ` David Brown
  2013-10-22 15:29       ` keld
@ 2013-10-22 16:56       ` Stan Hoeppner
  2013-10-23  7:03         ` David Brown
  1 sibling, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-22 16:56 UTC (permalink / raw)
  To: David Brown, Steve Bergman; +Cc: linux-raid

On 10/22/2013 2:24 AM, David Brown wrote:
> On 22/10/13 02:36, Steve Bergman wrote:
> 
> <snip>
> 
>> But hey, this is going to be a very nice opportunity for observing XFS's
>> savvy with parallel i/o.
> 
> You mentioned using a 6-drive RAID10 in your first email, with XFS on
> top of that.  Stan is the expert here, but my understanding is that you
> should go for three 2-drive RAID1 pairs, and then use an md linear
> "raid" for these pairs and put XFS on top of that in order to get the
> full benefits of XFS parallelism.

XFS on a concatenation, which is what you described above, is a very
workload specific storage architecture.  It is not a general use
architecture, and almost never good for database workloads.  Here most
of the data is stored in a single file or a small set of files, in a
single directory.  With such a DB workload and 3 concatenated mirrors,
only 1/3rd of the spindles would see the vast majority of the IO.

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-22 16:56       ` Stan Hoeppner
@ 2013-10-23  7:03         ` David Brown
  2013-10-24  6:23           ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: David Brown @ 2013-10-23  7:03 UTC (permalink / raw)
  To: stan; +Cc: Steve Bergman, linux-raid

On 22/10/13 18:56, Stan Hoeppner wrote:
> On 10/22/2013 2:24 AM, David Brown wrote:
>> On 22/10/13 02:36, Steve Bergman wrote:
>>
>> <snip>
>>
>>> But hey, this is going to be a very nice opportunity for observing XFS's
>>> savvy with parallel i/o.
>>
>> You mentioned using a 6-drive RAID10 in your first email, with XFS on
>> top of that.  Stan is the expert here, but my understanding is that you
>> should go for three 2-drive RAID1 pairs, and then use an md linear
>> "raid" for these pairs and put XFS on top of that in order to get the
>> full benefits of XFS parallelism.
> 
> XFS on a concatenation, which is what you described above, is a very
> workload specific storage architecture.  It is not a general use
> architecture, and almost never good for database workloads.  Here most
> of the data is stored in a single file or a small set of files, in a
> single directory.  With such a DB workload and 3 concatenated mirrors,
> only 1/3rd of the spindles would see the vast majority of the IO.
> 

That's a good point - while I had noted that the OP was running a
database, I forgot it was a virtual windows machine and MS SQL database.
 The virtual machine will use a single large file for its virtual
harddisk image, and so RAID10 + XFS will beat RAID1 + concat + XFS.

On the other hand, he is also serving 100+ freenx desktop users.  As far
as I understand it (and I'm very happy for corrections if I'm wrong),
that will mean a /home directory with 100+ sub-directories for the
different users - and that /is/ one of the ideal cases for concat+XFS
parallelism.

Only the OP can say which type of access is going to dominate and where
the balance should go.

As a more general point, I don't know that you can generalise that
database workloads normally store data in a single big file or a small
set of files.  I haven't worked with many databases, and none more than
a few hundred MB, so I am theorising here on things I have read rather
than personal practice.  But certainly with postgresql the data is split
into multiple directories - each table has its own directory.  For very
big tables, the data is split into multiple files - and at some point,
they will hit the allocation group size and then be split over multiple
AG's, leading to parallelism (with a bit of luck).  I am guessing other
databases are somewhat similar.  Of course, like any database tuning,
this will all be highly load-dependent.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
@ 2013-10-23 19:05 Drew
  0 siblings, 0 replies; 17+ messages in thread
From: Drew @ 2013-10-23 19:05 UTC (permalink / raw)
  To: David Brown; +Cc: Stan Hoeppner, Steve Bergman, Linux RAID Mailing List

> As a more general point, I don't know that you can generalise that
> database workloads normally store data in a single big file or a small
> set of files.  I haven't worked with many databases, and none more than
> a few hundred MB, so I am theorising here on things I have read rather
> than personal practice.  But certainly with postgresql the data is split
> into multiple directories - each table has its own directory.  For very
> big tables, the data is split into multiple files - and at some point,
> they will hit the allocation group size and then be split over multiple
> AG's, leading to parallelism (with a bit of luck).  I am guessing other
> databases are somewhat similar.  Of course, like any database tuning,
> this will all be highly load-dependent.

MS SQL Server does tend to store each database in it's own file no
matter the size. Ran into this with a VMware ESXi cluster maintained
by a vCenter instance running SQL Server Express on Windows Server
2008r2.

Both SQL Server Express 2005 & 2008 store the entire DB in one large
file. Know this because I ran up against a file size limitation on
Express '05 when the DB tables storing performance data grew to the
allowed max of '05. Had to upgrade to '08 and clean out old
performance data to make vCenter happy again.


-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

"This started out as a hobby and spun horribly out of control."
-Unknown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-23  7:03         ` David Brown
@ 2013-10-24  6:23           ` Stan Hoeppner
  2013-10-24  7:26             ` David Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-24  6:23 UTC (permalink / raw)
  To: David Brown; +Cc: Steve Bergman, linux-raid

On 10/23/2013 2:03 AM, David Brown wrote:

> On the other hand, he is also serving 100+ freenx desktop users.  As far
> as I understand it (and I'm very happy for corrections if I'm wrong),
> that will mean a /home directory with 100+ sub-directories for the
> different users - and that /is/ one of the ideal cases for concat+XFS
> parallelism.

No, it is /not/.  Homedir storage is not an ideal use case.  It's not
even in the ballpark.  There's simply not enough parallelism nor IOPS
involved, and file sizes can vary substantially, so the workload is not
deterministic, i.e. it is "general".  Recall I said in my last reply
that this "is a very workload specific storage architecture"?

Workloads that benefit from XFS over concatenated disks are those that:

1.  Expose inherent limitations and/or inefficiencies of striping,
    at the filesystem, elevator, and/or hardware level

2.  Exhibit a high degree of directory level parallelism

3.  Exhibit high IOPS or data rates

4.  Most importantly, exhibit relatively deterministic IO patterns

Typical homedir storage meets none of these criteria.  Homedir files on
a GUI desktop terminal server are not 'typical', but the TS workload
doesn't meet these criteria either.

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-24  6:23           ` Stan Hoeppner
@ 2013-10-24  7:26             ` David Brown
  2013-10-25  9:34               ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: David Brown @ 2013-10-24  7:26 UTC (permalink / raw)
  To: stan; +Cc: Steve Bergman, linux-raid

On 24/10/13 08:23, Stan Hoeppner wrote:
> On 10/23/2013 2:03 AM, David Brown wrote:
> 
>> On the other hand, he is also serving 100+ freenx desktop users.  As far
>> as I understand it (and I'm very happy for corrections if I'm wrong),
>> that will mean a /home directory with 100+ sub-directories for the
>> different users - and that /is/ one of the ideal cases for concat+XFS
>> parallelism.
> 
> No, it is /not/.  Homedir storage is not an ideal use case.  It's not
> even in the ballpark.  There's simply not enough parallelism nor IOPS
> involved, and file sizes can vary substantially, so the workload is not
> deterministic, i.e. it is "general".  Recall I said in my last reply
> that this "is a very workload specific storage architecture"?
> 
> Workloads that benefit from XFS over concatenated disks are those that:
> 
> 1.  Expose inherent limitations and/or inefficiencies of striping,
>     at the filesystem, elevator, and/or hardware level
> 
> 2.  Exhibit a high degree of directory level parallelism
> 
> 3.  Exhibit high IOPS or data rates
> 
> 4.  Most importantly, exhibit relatively deterministic IO patterns
> 
> Typical homedir storage meets none of these criteria.  Homedir files on
> a GUI desktop terminal server are not 'typical', but the TS workload
> doesn't meet these criteria either.
> 

I am trying to learn from your experience and knowledge here, so thank
you for your time so far.  Hopefully it is also of use and interest to
others - that's one of the beauties of public mailing lists.

Am I correct in thinking that a common "ideal use case" is a mail server
with lots of accounts, especially with maildir structures, so that
accesses are spread across lots of directories with typically many
parallel accesses to many small files?

First, to make sure I am not making any technical errors here, I believe
that when you make your XFS over a linear concat, the allocation groups
are spread evenly across the parts of the concat so that logically (by
number) adjacent AG's will be on different underlying disks.  When you
make a new directory on the filesystem, it gets put in a different AG
(wrapping around, of course, and overflowing when necessary).  Thus if
you make three directories, and put a file in each directory, then each
file will be on a different disk.  (I believe older XFS only allocated
different top-level directories to different AG's, but current XFS does
so for all directories).

I have been thinking about what the XFS over concat gives you compared
to XFS over raid0 on the same disks (or raid1 pairs - the details don't
matter much).

First, consider small files.  Access to small files (smaller than the
granularity of the raid0 chunks) will usually only involve one disk of
the raid0 stripe, and will /definitely/ only involve one disk of the
concat.  You should be able to access multiple small files in parallel,
if you are lucky in the mix (with raid0, this "luck" will be mostly
random, while with concat it will depend on the mix of files within
directories.  In particular, multiple files within the same directory
will not be paralleled).  With a concat, all relevant accesses such as
directory reads and inode table access will be within the same disk as
the file, while with raid0 it could easily be a different disk - but
such accesses are often cached in ram.  With raid0 you have the chance
of the small file spanning two disks, leading to longer latency for that
file and for other parallel accesses.

All in all, small file access should not be /too/ different - but my
guess is concat has the edge for lowest overall latency with multiple
parallel accesses, as I think concat will avoid jumps between disks better.

For large files, there is a bigger difference.  Raid0 gives striping for
higher throughput - but these accesses block the parallel accesses to
other files.  concat has slower throughput as there is no striping, but
the other disks are free for parallel accesses (big or small).

To my mind, this boils down to a question of balancing - concat gives
lower average latencies with highly parallel accesses, but sacrifices
maximum throughput of large files.  If you don't have lots of parallel
accesses, then concat gains little or nothing compared to raid0.

If I try to match up this with the points you made, point 1 about
striping is clear - this is a major difference between concat and raid0.
 Point 2 and 3 about parallelism and high IOPs (and therefore low
latency) is also clear - if you don't need such access, concat will give
you nothing.

Only the OP can decide if his usage will meet these points.

But I am struggling with point 4 - "most importantly, exhibit relatively
deterministic IO patterns".  All you need is to have your file accesses
spread amongst a range of directories.  If the number of (roughly)
parallel accesses is big enough, you'll get a fairly even spread across
the disks - and if it is not big enough for that, you haven't matched
point 2.  This is not really much different from raid0 - small accesses
will be scattered across the different disks.  The big difference comes
when there is a large file access - with raid0, you will block /all/
other accesses for a time, while with concat (over three disks) you will
block one third of the accesses for three times as long.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-24  7:26             ` David Brown
@ 2013-10-25  9:34               ` Stan Hoeppner
  2013-10-25 11:42                 ` David Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-25  9:34 UTC (permalink / raw)
  To: David Brown; +Cc: Steve Bergman, linux-raid

On 10/24/2013 2:26 AM, David Brown wrote:
> On 24/10/13 08:23, Stan Hoeppner wrote:
>> On 10/23/2013 2:03 AM, David Brown wrote:
>>
>>> On the other hand, he is also serving 100+ freenx desktop users.  As far
>>> as I understand it (and I'm very happy for corrections if I'm wrong),
>>> that will mean a /home directory with 100+ sub-directories for the
>>> different users - and that /is/ one of the ideal cases for concat+XFS
>>> parallelism.
>>
>> No, it is /not/.  Homedir storage is not an ideal use case.  It's not
>> even in the ballpark.  There's simply not enough parallelism nor IOPS
>> involved, and file sizes can vary substantially, so the workload is not
>> deterministic, i.e. it is "general".  Recall I said in my last reply
>> that this "is a very workload specific storage architecture"?
>>
>> Workloads that benefit from XFS over concatenated disks are those that:
>>
>> 1.  Expose inherent limitations and/or inefficiencies of striping,
>>     at the filesystem, elevator, and/or hardware level
>>
>> 2.  Exhibit a high degree of directory level parallelism
>>
>> 3.  Exhibit high IOPS or data rates
>>
>> 4.  Most importantly, exhibit relatively deterministic IO patterns
>>
>> Typical homedir storage meets none of these criteria.  Homedir files on
>> a GUI desktop terminal server are not 'typical', but the TS workload
>> doesn't meet these criteria either.

If you could sum up everything below into a couple of short, direct,
coherent questions you have, I'd be glad to address them.




> I am trying to learn from your experience and knowledge here, so thank
> you for your time so far.  Hopefully it is also of use and interest to
> others - that's one of the beauties of public mailing lists.
> 
> 
> Am I correct in thinking that a common "ideal use case" is a mail server
> with lots of accounts, especially with maildir structures, so that
> accesses are spread across lots of directories with typically many
> parallel accesses to many small files?
> 
> 
> First, to make sure I am not making any technical errors here, I believe
> that when you make your XFS over a linear concat, the allocation groups
> are spread evenly across the parts of the concat so that logically (by
> number) adjacent AG's will be on different underlying disks.  When you
> make a new directory on the filesystem, it gets put in a different AG
> (wrapping around, of course, and overflowing when necessary).  Thus if
> you make three directories, and put a file in each directory, then each
> file will be on a different disk.  (I believe older XFS only allocated
> different top-level directories to different AG's, but current XFS does
> so for all directories).
> 
> 
> 
> I have been thinking about what the XFS over concat gives you compared
> to XFS over raid0 on the same disks (or raid1 pairs - the details don't
> matter much).
> 
> First, consider small files.  Access to small files (smaller than the
> granularity of the raid0 chunks) will usually only involve one disk of
> the raid0 stripe, and will /definitely/ only involve one disk of the
> concat.  You should be able to access multiple small files in parallel,
> if you are lucky in the mix (with raid0, this "luck" will be mostly
> random, while with concat it will depend on the mix of files within
> directories.  In particular, multiple files within the same directory
> will not be paralleled).  With a concat, all relevant accesses such as
> directory reads and inode table access will be within the same disk as
> the file, while with raid0 it could easily be a different disk - but
> such accesses are often cached in ram.  With raid0 you have the chance
> of the small file spanning two disks, leading to longer latency for that
> file and for other parallel accesses.
> 
> All in all, small file access should not be /too/ different - but my
> guess is concat has the edge for lowest overall latency with multiple
> parallel accesses, as I think concat will avoid jumps between disks better.
> 
> 
> For large files, there is a bigger difference.  Raid0 gives striping for
> higher throughput - but these accesses block the parallel accesses to
> other files.  concat has slower throughput as there is no striping, but
> the other disks are free for parallel accesses (big or small).
> 
> 
> To my mind, this boils down to a question of balancing - concat gives
> lower average latencies with highly parallel accesses, but sacrifices
> maximum throughput of large files.  If you don't have lots of parallel
> accesses, then concat gains little or nothing compared to raid0.
> 
> 
> If I try to match up this with the points you made, point 1 about
> striping is clear - this is a major difference between concat and raid0.
>  Point 2 and 3 about parallelism and high IOPs (and therefore low
> latency) is also clear - if you don't need such access, concat will give
> you nothing.
> 
> Only the OP can decide if his usage will meet these points.
> 
> But I am struggling with point 4 - "most importantly, exhibit relatively
> deterministic IO patterns".  All you need is to have your file accesses
> spread amongst a range of directories.  If the number of (roughly)
> parallel accesses is big enough, you'll get a fairly even spread across
> the disks - and if it is not big enough for that, you haven't matched
> point 2.  This is not really much different from raid0 - small accesses
> will be scattered across the different disks.  The big difference comes
> when there is a large file access - with raid0, you will block /all/
> other accesses for a time, while with concat (over three disks) you will
> block one third of the accesses for three times as long.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-25  9:34               ` Stan Hoeppner
@ 2013-10-25 11:42                 ` David Brown
  2013-10-26  9:37                   ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: David Brown @ 2013-10-25 11:42 UTC (permalink / raw)
  To: stan; +Cc: Steve Bergman, linux-raid

On 25/10/13 11:34, Stan Hoeppner wrote:
> On 10/24/2013 2:26 AM, David Brown wrote:
>> On 24/10/13 08:23, Stan Hoeppner wrote:
>>> On 10/23/2013 2:03 AM, David Brown wrote:
>>> 
>>>> On the other hand, he is also serving 100+ freenx desktop
>>>> users.  As far as I understand it (and I'm very happy for
>>>> corrections if I'm wrong), that will mean a /home directory
>>>> with 100+ sub-directories for the different users - and that
>>>> /is/ one of the ideal cases for concat+XFS parallelism.
>>> 
>>> No, it is /not/.  Homedir storage is not an ideal use case.  It's
>>> not even in the ballpark.  There's simply not enough parallelism
>>> nor IOPS involved, and file sizes can vary substantially, so the
>>> workload is not deterministic, i.e. it is "general".  Recall I
>>> said in my last reply that this "is a very workload specific
>>> storage architecture"?
>>> 
>>> Workloads that benefit from XFS over concatenated disks are those
>>> that:
>>> 
>>> 1.  Expose inherent limitations and/or inefficiencies of
>>> striping, at the filesystem, elevator, and/or hardware level
>>> 
>>> 2.  Exhibit a high degree of directory level parallelism
>>> 
>>> 3.  Exhibit high IOPS or data rates
>>> 
>>> 4.  Most importantly, exhibit relatively deterministic IO
>>> patterns
>>> 
>>> Typical homedir storage meets none of these criteria.  Homedir
>>> files on a GUI desktop terminal server are not 'typical', but the
>>> TS workload doesn't meet these criteria either.
> 
> If you could sum up everything below into a couple of short, direct, 
> coherent questions you have, I'd be glad to address them.
> 

Maybe I've been rambling a bit much.  I am not sure I can be very short
while still explaining my reasoning, but these are the three most
important paragraphs.  They are statements that I hope to get confirmed
or corrected, rather than questions as such.


First, to make sure I am not making any technical errors here, I
believe that when you make your XFS over a linear concat, the
allocation groups are spread evenly across the parts of the concat
so that logically (by number) adjacent AG's will be on different
underlying disks.  When you make a new directory on the filesystem,
it gets put in a different AG (wrapping around, of course, and
overflowing when necessary).  Thus if you make three directories,
and put a file in each directory, then each file will be on a
different disk.  (I believe older XFS only allocated different
top-level directories to different AG's, but current XFS does so
for all directories).

<snip>

To my mind, this boils down to a question of balancing - concat
gives lower average latencies with highly parallel accesses, but
sacrifices maximum throughput of large files.  If you don't have
lots of parallel accesses, then concat gains little or nothing
compared to raid0.

<snip>

But I am struggling with point 4 - "most importantly, exhibit
relatively deterministic IO patterns".  All you need is to have
your file accesses spread amongst a range of directories.  If the
number of (roughly) parallel accesses is big enough, you'll get a
fairly even spread across the disks - and if it is not big enough
for that, you haven't matched point 2.  This is not really much
different from raid0 - small accesses will be scattered across the
different disks.  The big difference comes when there is a large
file access - with raid0, you will block /all/ other accesses for a
time, while with concat (over three disks) you will block one third
of the accesses for three times as long.


mvh.,

David



> 
> 
> 
>> I am trying to learn from your experience and knowledge here, so
>> thank you for your time so far.  Hopefully it is also of use and
>> interest to others - that's one of the beauties of public mailing
>> lists.
>> 
>> 
>> Am I correct in thinking that a common "ideal use case" is a mail
>> server with lots of accounts, especially with maildir structures,
>> so that accesses are spread across lots of directories with
>> typically many parallel accesses to many small files?
>> 
>> 
>> First, to make sure I am not making any technical errors here, I
>> believe that when you make your XFS over a linear concat, the
>> allocation groups are spread evenly across the parts of the concat
>> so that logically (by number) adjacent AG's will be on different
>> underlying disks.  When you make a new directory on the filesystem,
>> it gets put in a different AG (wrapping around, of course, and
>> overflowing when necessary).  Thus if you make three directories,
>> and put a file in each directory, then each file will be on a
>> different disk.  (I believe older XFS only allocated different
>> top-level directories to different AG's, but current XFS does so
>> for all directories).
>> 
>> 
>> 
>> I have been thinking about what the XFS over concat gives you
>> compared to XFS over raid0 on the same disks (or raid1 pairs - the
>> details don't matter much).
>> 
>> First, consider small files.  Access to small files (smaller than
>> the granularity of the raid0 chunks) will usually only involve one
>> disk of the raid0 stripe, and will /definitely/ only involve one
>> disk of the concat.  You should be able to access multiple small
>> files in parallel, if you are lucky in the mix (with raid0, this
>> "luck" will be mostly random, while with concat it will depend on
>> the mix of files within directories.  In particular, multiple files
>> within the same directory will not be paralleled).  With a concat,
>> all relevant accesses such as directory reads and inode table
>> access will be within the same disk as the file, while with raid0
>> it could easily be a different disk - but such accesses are often
>> cached in ram.  With raid0 you have the chance of the small file
>> spanning two disks, leading to longer latency for that file and for
>> other parallel accesses.
>> 
>> All in all, small file access should not be /too/ different - but
>> my guess is concat has the edge for lowest overall latency with
>> multiple parallel accesses, as I think concat will avoid jumps
>> between disks better.
>> 
>> 
>> For large files, there is a bigger difference.  Raid0 gives
>> striping for higher throughput - but these accesses block the
>> parallel accesses to other files.  concat has slower throughput as
>> there is no striping, but the other disks are free for parallel
>> accesses (big or small).
>> 
>> 
>> To my mind, this boils down to a question of balancing - concat
>> gives lower average latencies with highly parallel accesses, but
>> sacrifices maximum throughput of large files.  If you don't have
>> lots of parallel accesses, then concat gains little or nothing
>> compared to raid0.
>> 
>> 
>> If I try to match up this with the points you made, point 1 about 
>> striping is clear - this is a major difference between concat and
>> raid0. Point 2 and 3 about parallelism and high IOPs (and therefore
>> low latency) is also clear - if you don't need such access, concat
>> will give you nothing.
>> 
>> Only the OP can decide if his usage will meet these points.
>> 
>> But I am struggling with point 4 - "most importantly, exhibit
>> relatively deterministic IO patterns".  All you need is to have
>> your file accesses spread amongst a range of directories.  If the
>> number of (roughly) parallel accesses is big enough, you'll get a
>> fairly even spread across the disks - and if it is not big enough
>> for that, you haven't matched point 2.  This is not really much
>> different from raid0 - small accesses will be scattered across the
>> different disks.  The big difference comes when there is a large
>> file access - with raid0, you will block /all/ other accesses for a
>> time, while with concat (over three disks) you will block one third
>> of the accesses for three times as long.
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-25 11:42                 ` David Brown
@ 2013-10-26  9:37                   ` Stan Hoeppner
  2013-10-27 22:08                     ` David Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2013-10-26  9:37 UTC (permalink / raw)
  To: David Brown; +Cc: Steve Bergman, linux-raid

On 10/25/2013 6:42 AM, David Brown wrote:
> On 25/10/13 11:34, Stan Hoeppner wrote:
...
>>>> Workloads that benefit from XFS over concatenated disks are those
>>>> that:
>>>>
>>>> 1.  Expose inherent limitations and/or inefficiencies of
>>>> striping, at the filesystem, elevator, and/or hardware level
>>>>
>>>> 2.  Exhibit a high degree of directory level parallelism
>>>>
>>>> 3.  Exhibit high IOPS or data rates
>>>>
>>>> 4.  Most importantly, exhibit relatively deterministic IO
>>>> patterns
...

> allocation groups are spread evenly across the parts of the concat
> so that logically (by number) adjacent AG's will be on different
> underlying disks.

This is not correct.  The LBA sectors are numbered linearly, hence teh
md name "linear", from the first sector of the first disk (or partition)
to the last sector of the last disk, creating one large virtual disk.
Thus mkfs.xfs divides the disk into equal sized AGs from beginning to
end.  So if you have 4 exactly equal sized disks in the concatenation
and default mkfs.xfs creates 8 AGs, then AG0/1 would be on the first
disk, AG2/3 would be on the second, and so on.  If the disks (or
partitions) are not precisely the same number of sectors you will end up
with portions of AGs laying across physical disk boundaries.  The AGs
are NOT adjacently interleaved across disks as you suggest.

> To my mind, this boils down to a question of balancing - concat
> gives lower average latencies with highly parallel accesses, but

That's too general a statement.  Again, it depends on the workload, and
the type of parallel access.  For some parallel small file workloads
with high DLP, then yes.  For a parallel DB workload with a single table
file, no.  See #2 and #4 above.

> sacrifices maximum throughput of large files.  

Not true.  There are large file streaming workloads that perform better
with XFS over concatenation than with striped RAID.  Again, this is
workload dependent.  See #1-4 above.

> If you don't have
> lots of parallel accesses, then concat gains little or nothing
> compared to raid0.

You just repeated #2-3.

> But I am struggling with point 4 - "most importantly, exhibit
> relatively deterministic IO patterns".  

It means exactly what is says.  In the parallel workload, the file
sizes, IOPS, and/or data rate to each AG needs to be roughly equal.
Ergo the IO pattern is "deterministic".  Deterministic means we know
what the IO pattern is before we build the storage system and run the
application on it.

Again, this is a "workload specific storage architecture".

> All you need is to have
> your file accesses spread amongst a range of directories.  If the
> number of (roughly) parallel accesses is big enough, you'll get a
> fairly even spread across the disks - and if it is not big enough
> for that, you haven't matched point 2.  

And if you aim a shotgun at a flock of geese you might hit a couple.
This is not deterministic.

> This is not really much
> different from raid0 - small accesses will be scattered across the
> different disks.  

It's very different.  And no they won't be scattered across the disks
with a striped array.  When aligned to a striped array, XFS will
allocate all files at the start of a stripe.  If the file is smaller
than sunit it will reside entirely on the first disk.  This creates a
massive IO hotspot.  If the workload consists of files that are all or
mostly smaller than sunit, all other disks in the striped array will sit
idle until the filesystem is sufficiently full that no virgin stripes
remain.  At this point all allocation will become unaligned, or aligned
to sunit boundaries if possible, with new files being allocated into the
massive fragmented free space.  Performance can't be any worse than this
scenario.

You can format XFS without alignment on a striped array and avoid the
single drive hotspot above.  However, file placement within the AGs and
thus on the stripe is non-deterministic, because you're not aligned.
XFS doesn't know where the chunk and stripe boundaries are.  So you'll
still end up with hot spots, some disks more active than others.

This is where a properly designed XFS over concatenation may help.  I
say "may" because if you're not hitting #2-3 it doesn't matter.  The
load may not be sufficient to expose the architectural defect in either
of the striped architectures above.

So, again, use of XFS over concatenation is workload specific.  And 4 of
the criteria to evaluate whether it should be used are above.

> The big difference comes when there is a large
> file access - with raid0, you will block /all/ other accesses for a
> time, while with concat (over three disks) you will block one third
> of the accesses for three times as long.

You're assuming a mixed workload.  Again, XFS over concatenation is
never used with a mixed, i.e. non-deterministic, workload.  It is used
only with workloads that exhibit determinism.

Once again:  "This is a very workload specific storage architecture"

How many times have I repeated this on this list?  Apparently not enough.

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Linux MD? Or an H710p?
  2013-10-26  9:37                   ` Stan Hoeppner
@ 2013-10-27 22:08                     ` David Brown
  0 siblings, 0 replies; 17+ messages in thread
From: David Brown @ 2013-10-27 22:08 UTC (permalink / raw)
  To: stan; +Cc: Steve Bergman, linux-raid

On 26/10/13 11:37, Stan Hoeppner wrote:
> On 10/25/2013 6:42 AM, David Brown wrote:
>> On 25/10/13 11:34, Stan Hoeppner wrote:
> ...
>>>>> Workloads that benefit from XFS over concatenated disks are those
>>>>> that:
>>>>>
>>>>> 1.  Expose inherent limitations and/or inefficiencies of
>>>>> striping, at the filesystem, elevator, and/or hardware level
>>>>>
>>>>> 2.  Exhibit a high degree of directory level parallelism
>>>>>
>>>>> 3.  Exhibit high IOPS or data rates
>>>>>
>>>>> 4.  Most importantly, exhibit relatively deterministic IO
>>>>> patterns
> ...
>
>> allocation groups are spread evenly across the parts of the concat
>> so that logically (by number) adjacent AG's will be on different
>> underlying disks.
>
> This is not correct.  The LBA sectors are numbered linearly, hence teh
> md name "linear", from the first sector of the first disk (or partition)
> to the last sector of the last disk, creating one large virtual disk.
> Thus mkfs.xfs divides the disk into equal sized AGs from beginning to
> end.  So if you have 4 exactly equal sized disks in the concatenation
> and default mkfs.xfs creates 8 AGs, then AG0/1 would be on the first
> disk, AG2/3 would be on the second, and so on.  If the disks (or
> partitions) are not precisely the same number of sectors you will end up
> with portions of AGs laying across physical disk boundaries.  The AGs
> are NOT adjacently interleaved across disks as you suggest.

OK.

>
>> To my mind, this boils down to a question of balancing - concat
>> gives lower average latencies with highly parallel accesses, but
>
> That's too general a statement.  Again, it depends on the workload, and
> the type of parallel access.  For some parallel small file workloads
> with high DLP, then yes.  For a parallel DB workload with a single table
> file, no.  See #2 and #4 above.

Fair enough.  I was thinking of parallel accesses to /different/ files, 
in different directories.  I think if I had said that, we would be 
closer here.

>
>> sacrifices maximum throughput of large files.
>
> Not true.  There are large file streaming workloads that perform better
> with XFS over concatenation than with striped RAID.  Again, this is
> workload dependent.  See #1-4 above.

That would be workloads where you have parallel accesses to large files 
in different directories?

>
>> If you don't have
>> lots of parallel accesses, then concat gains little or nothing
>> compared to raid0.
>
> You just repeated #2-3.
>

Yes.

>> But I am struggling with point 4 - "most importantly, exhibit
>> relatively deterministic IO patterns".
>
> It means exactly what is says.  In the parallel workload, the file
> sizes, IOPS, and/or data rate to each AG needs to be roughly equal.
> Ergo the IO pattern is "deterministic".  Deterministic means we know
> what the IO pattern is before we build the storage system and run the
> application on it.
>

I know what deterministic means, and I know what you are saying here.  I 
just did not understand why you felt it mattered so much - but your 
answer below makes it much clearer.

> Again, this is a "workload specific storage architecture".

No doubts there!

>
>> All you need is to have
>> your file accesses spread amongst a range of directories.  If the
>> number of (roughly) parallel accesses is big enough, you'll get a
>> fairly even spread across the disks - and if it is not big enough
>> for that, you haven't matched point 2.
>
> And if you aim a shotgun at a flock of geese you might hit a couple.
> This is not deterministic.
>

I think you would be hard pushed to get better than "random with known 
characteristics" for most workloads (as always, there are exceptions 
where the workload is known very accurately).  Enough independent random 
accesses and tight enough characteristics will give you the determinism 
you are looking for.  (If 50 people aim shotguns at a flock of geese, it 
doesn't matter if they aim randomly or at carefully assigned targets - 
the result is a fairly even spread across the flock.)

>> This is not really much
>> different from raid0 - small accesses will be scattered across the
>> different disks.
>
> It's very different.  And no they won't be scattered across the disks
> with a striped array.  When aligned to a striped array, XFS will
> allocate all files at the start of a stripe.  If the file is smaller
> than sunit it will reside entirely on the first disk.  This creates a
> massive IO hotspot.  If the workload consists of files that are all or
> mostly smaller than sunit, all other disks in the striped array will sit
> idle until the filesystem is sufficiently full that no virgin stripes
> remain.  At this point all allocation will become unaligned, or aligned
> to sunit boundaries if possible, with new files being allocated into the
> massive fragmented free space.  Performance can't be any worse than this
> scenario.

/This/ is a key point that is new to me.  It is a specific detail of XFS 
that I was not aware of, and I fully agree it makes a very significant 
difference.

I am trying to think /why/ XFS does it this way.  I assume there is a 
good reason.  Could it be the general point that big files usually start 
as small files, and that by allocating in this way XFS aims to reduce 
fragmentation and maximise stripe throughput as the file grows?

One thing I get from this is that if your workload is mostly small files 
(smaller than sunit), then linear concat is going to give you better 
performance than raid0 even if the accesses are not very evenly spread 
across allocation groups - pretty much anything is better than 
concentrating everything on the first disk only.  (Of course, if you are 
only accessing small files and you /don't/ have a lot of parallelism, 
then performance is unlikely to matter much.)

>
> You can format XFS without alignment on a striped array and avoid the
> single drive hotspot above.  However, file placement within the AGs and
> thus on the stripe is non-deterministic, because you're not aligned.
> XFS doesn't know where the chunk and stripe boundaries are.  So you'll
> still end up with hot spots, some disks more active than others.
>
> This is where a properly designed XFS over concatenation may help.  I
> say "may" because if you're not hitting #2-3 it doesn't matter.  The
> load may not be sufficient to expose the architectural defect in either
> of the striped architectures above.
>
> So, again, use of XFS over concatenation is workload specific.  And 4 of
> the criteria to evaluate whether it should be used are above.
>
>> The big difference comes when there is a large
>> file access - with raid0, you will block /all/ other accesses for a
>> time, while with concat (over three disks) you will block one third
>> of the accesses for three times as long.
>
> You're assuming a mixed workload.  Again, XFS over concatenation is
> never used with a mixed, i.e. non-deterministic, workload.  It is used
> only with workloads that exhibit determinism.
>

Yes, I am assuming a mixed workload (partly because that's what the OP has).

> Once again:  "This is a very workload specific storage architecture"
>

I think most people, including me, understand that it is 
workload-specific.  What we are learning is exactly what kinds of 
workload are best suited to which layout, and why.  The ideal situation 
is to be able to test out many different layouts under real-life loads, 
but I think that's unrealistic in most cases.  So the best we can do is 
try to learn the theory.

> How many times have I repeated this on this list?  Apparently not enough.
>

I try to listen in to most of these threads, and sometimes I join in. 
Usually I learn a little more each time.  I hope the same applies to 
others here.

The general point - that filesystem and raid layout is workload 
dependent - is one of these things that cannot be repeated too often, I 
think.

Thanks,

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-10-27 22:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-20  0:49 Linux MD? Or an H710p? Steve Bergman
2013-10-20  7:37 ` Stan Hoeppner
2013-10-20  8:50 ` Mikael Abrahamsson
2013-10-21 14:18 ` John Stoffel
2013-10-22  0:36   ` Steve Bergman
2013-10-22  7:24     ` David Brown
2013-10-22 15:29       ` keld
2013-10-22 16:56       ` Stan Hoeppner
2013-10-23  7:03         ` David Brown
2013-10-24  6:23           ` Stan Hoeppner
2013-10-24  7:26             ` David Brown
2013-10-25  9:34               ` Stan Hoeppner
2013-10-25 11:42                 ` David Brown
2013-10-26  9:37                   ` Stan Hoeppner
2013-10-27 22:08                     ` David Brown
2013-10-22 16:43     ` Stan Hoeppner
  -- strict thread matches above, loose matches on Subject: below --
2013-10-23 19:05 Drew

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).