XFS + LVM + DM-Thin + Multi-Volume External RAID

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* XFS + LVM + DM-Thin + Multi-Volume External RAID
@ 2016-11-24  1:23 Dave Hall
  2016-11-24  9:43 ` Carlos Maiolino
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hall @ 2016-11-24  1:23 UTC (permalink / raw)
  To: linux-xfs

Hello,

I'm planning a storage installation on new hardware and I'd like to 
configure for best performance.  I will have 24 to 48 drives in a 
SAS-attached RAID box with dual 12GB/s controllers (Dell MD3420 with 10K 
1.8TB drives.  The server is dual socket with 28 cores, 256GB RAM, dual 
12GB HBAs, and multiple 10GB NICS.

My workload is NFS for user home directories - highly random access 
patterns with frequent bursts of random writes.

In order to maximize performance I'm planning to make multiple small 
RAID volumes (i.e. RAID5 - 4+1, or RAID6 - 8+2) that would be either 
striped or concatenated together.

I'm looking for information on:

- Are there any cautions or recommendations about  XFS 
stability/performance on a thin volume with thin snapshots?

- I've read that there are tricks and calculations for aligning XFS to 
the RAID stripes.  Can use suggest any guidelines or tools for 
calculating the right configuration?

- I've read also about tuning the number of allocation groups to reflect 
the CPU configuration of the server.  Any suggestions on this?

Thanks.

-Dave

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-24  1:23 XFS + LVM + DM-Thin + Multi-Volume External RAID Dave Hall
@ 2016-11-24  9:43 ` Carlos Maiolino
  2016-11-24 19:44   ` Dave Chinner
       [not found]   ` <ca974620-fff5-e26b-897d-c1c62d47cc64@binghamton.edu>
  0 siblings, 2 replies; 8+ messages in thread
From: Carlos Maiolino @ 2016-11-24  9:43 UTC (permalink / raw)
  To: Dave Hall; +Cc: linux-xfs

Hi,

On Wed, Nov 23, 2016 at 08:23:42PM -0500, Dave Hall wrote:
> Hello,
> 
> I'm planning a storage installation on new hardware and I'd like to
> configure for best performance.  I will have 24 to 48 drives in a
> SAS-attached RAID box with dual 12GB/s controllers (Dell MD3420 with 10K
> 1.8TB drives.  The server is dual socket with 28 cores, 256GB RAM, dual 12GB
> HBAs, and multiple 10GB NICS.
> 
> My workload is NFS for user home directories - highly random access patterns
> with frequent bursts of random writes.
> 
> In order to maximize performance I'm planning to make multiple small RAID
> volumes (i.e. RAID5 - 4+1, or RAID6 - 8+2) that would be either striped or
> concatenated together.
> 
> I'm looking for information on:
> 
> - Are there any cautions or recommendations about  XFS stability/performance
> on a thin volume with thin snapshots?
> 
> - I've read that there are tricks and calculations for aligning XFS to the
> RAID stripes.  Can use suggest any guidelines or tools for calculating the
> right configuration?

There is no magical trick :), you need to configure Stripe unit and stripe width
according to your raid configuration. You should set stripe unit (su option) to
the size of the stripes on your raid, and set the stripe width (sw option)
according to the number of data disks on your array (if you have a 4+1 raid 5,
it should be 4, into a 8+2 raid 6, it should be 8).
> 
> - I've read also about tuning the number of allocation groups to reflect the
> CPU configuration of the server.  Any suggestions on this?
> 

Allocation groups can't be bigger than 1TB. Assuming it should reflect your cpu
configuration is wrong, having too few or too many allocation groups can kill
your performance, and you also might face some another allocation problems in
the future, when the filesystem get aged when runnning with very small
allocation groups.

Determining the size of the allocation groups, is a case-by-case approach, and
it might need some experimenting.

Since you are dealing with thin provisioning devices, I'd be more careful then.
If you start with a small filesystem, and use the default configuration for
mkfs, it will give you a number of AGs according to your current block device
size, which can be a problem in the future when you decide to extend the
filesystem, AG size can't be changed after you make the filesystem. Make a
search on xfs list and you will see some reports of performance problems that
ended up being caused by very small filesystems that were extended later,
causing it to have lots of AGs.

So, what's the initial size that you expect to have such filesystems? How much
do you expect to grow them? These are some questions that might help you to have
some idea about the size of the AGs.

Regarding thing-provisioning, there are a couple things that you should keep in
mind.

- AGs segment the metadata across a whole disk, and increase parallelism in the
  filesystem, but, thin-provisioning will make such allocations sequential,
  despite where in the block device the filesystem tries to write, this is the
  nature of thin-provisioning devices so, I believe you should be more careful
  planning your DM-thin structure than the filesystem itself.

- There is a bug I'm working on with XFS while using thin-provisioning devices,
  where, if you overcommit the filesystem size (i.e. it's bigger than the real
  amount of space the dm-thin device really has), you might face some problems
  in case you try to write to the filesystem but there is no more space available
  in the dm-thin device, this thread contains a part of the story:
  http://www.spinics.net/lists/linux-xfs/msg01248.html
  Which remembers my I need to come back to this bug asap.

Just my 0.02, some other folks might remember something else.

Cheers

> Thanks.
> 
> -Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-24  9:43 ` Carlos Maiolino
@ 2016-11-24 19:44   ` Dave Chinner
  2016-11-25 14:50     ` Dave Hall
       [not found]   ` <ca974620-fff5-e26b-897d-c1c62d47cc64@binghamton.edu>
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2016-11-24 19:44 UTC (permalink / raw)
  To: Dave Hall, linux-xfs

On Thu, Nov 24, 2016 at 10:43:32AM +0100, Carlos Maiolino wrote:
> Hi,
> 
> On Wed, Nov 23, 2016 at 08:23:42PM -0500, Dave Hall wrote:
> > Hello,
> > 
> > I'm planning a storage installation on new hardware and I'd like to
> > configure for best performance.  I will have 24 to 48 drives in a
> > SAS-attached RAID box with dual 12GB/s controllers (Dell MD3420 with 10K
> > 1.8TB drives.  The server is dual socket with 28 cores, 256GB RAM, dual 12GB
> > HBAs, and multiple 10GB NICS.
> > 
> > My workload is NFS for user home directories - highly random access patterns
> > with frequent bursts of random writes.
> > 
> > In order to maximize performance I'm planning to make multiple small RAID
> > volumes (i.e. RAID5 - 4+1, or RAID6 - 8+2) that would be either striped or
> > concatenated together.
> > 
> > I'm looking for information on:
> > 
> > - Are there any cautions or recommendations about  XFS stability/performance
> > on a thin volume with thin snapshots?
> > 
> > - I've read that there are tricks and calculations for aligning XFS to the
> > RAID stripes.  Can use suggest any guidelines or tools for calculating the
> > right configuration?
> 
> There is no magical trick :), you need to configure Stripe unit and stripe width
> according to your raid configuration. You should set stripe unit (su option) to
> the size of the stripes on your raid, and set the stripe width (sw option)
> according to the number of data disks on your array (if you have a 4+1 raid 5,
> it should be 4, into a 8+2 raid 6, it should be 8).

mkfs.xfs will do this setup automatically on software raid and any
block device that exports the necessary information to set it up.
In general, it's only older/cheaper hardware RAID that you have to
worry about this anymore.

> > - I've read also about tuning the number of allocation groups to reflect the
> > CPU configuration of the server.  Any suggestions on this?
> > 
> 
> Allocation groups can't be bigger than 1TB. Assuming it should reflect your cpu
> configuration is wrong, having too few or too many allocation groups can kill
> your performance, and you also might face some another allocation problems in
> the future, when the filesystem get aged when runnning with very small
> allocation groups.

It also depends on your storage, mostly. SSDs can handle
agcount=NCPUS*2 easily, but for spinning storage this will cause
additional seek loading and slow things down. In this case, the
defaults are best.

> Determining the size of the allocation groups, is a case-by-case approach, and
> it might need some experimenting.
> 
> Since you are dealing with thin provisioning devices, I'd be more careful then.
> If you start with a small filesystem, and use the default configuration for
> mkfs, it will give you a number of AGs according to your current block device
> size, which can be a problem in the future when you decide to extend the
> filesystem, AG size can't be changed after you make the filesystem. Make a
> search on xfs list and you will see some reports of performance problems that
> ended up being caused by very small filesystems that were extended later,
> causing it to have lots of AGs.

Yup, rule of thumb is that growing the fs size by an order of
magnitude is fine, growing it by two orders of magnitude will cause
problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-24 19:44   ` Dave Chinner
@ 2016-11-25 14:50     ` Dave Hall
  2016-11-26 17:52       ` Eric Sandeen
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hall @ 2016-11-25 14:50 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 11/24/16 2:44 PM, Dave Chinner wrote:
>>> - I've read that there are tricks and calculations for aligning XFS to the
>>> > > RAID stripes.  Can use suggest any guidelines or tools for calculating the
>>> > > right configuration?
>> >
>> > There is no magical trick :), you need to configure Stripe unit and stripe width
>> > according to your raid configuration. You should set stripe unit (su option) to
>> > the size of the stripes on your raid, and set the stripe width (sw option)
>> > according to the number of data disks on your array (if you have a 4+1 raid 5,
>> > it should be 4, into a 8+2 raid 6, it should be 8).
> mkfs.xfs will do this setup automatically on software raid and any
> block device that exports the necessary information to set it up.
> In general, it's only older/cheaper hardware RAID that you have to
> worry about this anymore.
>
So how do we know for sure?  Is there a way that we can be sure that the 
hardware RAID has exported this information?  In lieu of this, is there 
a solid way to deduce or test for correct alignment?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-25 14:50     ` Dave Hall
@ 2016-11-26 17:52       ` Eric Sandeen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2016-11-26 17:52 UTC (permalink / raw)
  To: Dave Hall, Dave Chinner, linux-xfs

On 11/25/16 8:50 AM, Dave Hall wrote:
>> mkfs.xfs will do this setup automatically on software raid and any
>> block device that exports the necessary information to set it up.
>> In general, it's only older/cheaper hardware RAID that you have to
>> worry about this anymore.
>>
> So how do we know for sure? Is there a way that we can be sure that
> the hardware RAID has exported this information?

If you run mkfs.xfs and it shows stripe geometry in the stdout info,
then it detected stripe geometry.

Otherwise you can use lsblk -t to print the advertised topology:

# lsblk -t /dev/md121 
NAME  ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
md121         0    512      0     512     512    1           128 128    0B

min io size is the stripe unit, opt IO size is the stripe width.

(in the above case there is no stripe geometry; 512-byte min IO and 0 optimal
IO is uninteresting).

> In lieu of this, is
> there a solid way to deduce or test for correct alignment?

If the device itself doesn't advertise a stripe geometry and you think
it has one, you'll need to look at the device settings, bios, documentation,
configuration, or whatever else to work it out on your own, and then specify
that manually on the mkfs.xfs commandline.

-Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <ca974620-fff5-e26b-897d-c1c62d47cc64@binghamton.edu>]

[parent not found: <20161125111814.p7ltczag7akqk3w5@eorzea.usersys.redhat.com>]

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
       [not found]     ` <20161125111814.p7ltczag7akqk3w5@eorzea.usersys.redhat.com>
@ 2016-11-25 15:20       ` Dave Hall
  2016-11-25 17:09         ` Dave Hall
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hall @ 2016-11-25 15:20 UTC (permalink / raw)
  To: linux-xfs; +Cc: Carlos Maiolino



On 11/25/16 6:18 AM, Carlos Maiolino wrote:
>>> > > Regarding thing-provisioning, there are a couple things that you should keep in
>>> > > mind.
>>> > >
>>> > > - AGs segment the metadata across a whole disk, and increase parallelism in the
>>> > >   filesystem, but, thin-provisioning will make such allocations sequential,
>>> > >   despite where in the block device the filesystem tries to write, this is the
>>> > >   nature of thin-provisioning devices so, I believe you should be more careful
>>> > >   planning your DM-thin structure than the filesystem itself.
>> >
>> > So it sounds like I should used striping for my logical volume to assure
>> > that data is distributed across the whole physical array?
> I'm not sure if I understand you question here, what kind of architecture you
> have in your mind. All thin-provisioning allocation are sequential, block
> requested, next block available served (although with the recent dm-thin versions
> it will serve blocks in bundles, not on a block-by-block granularity anymore,
> but it is still a sequential alignment.
>
> I am really not sure what you have in mind to 'force' the distribution across
> the whole physical array. The only thing I could think was to have 2 dm-thin
> devices, on different pools, and use them to build a stripped LVM. I don't know
> if it is possible tbh, I never tried such configuration, but it's a setup bound
> to have problems IMHO.
>

Currently I have 4 LVM PVs that are mapped to explicit groups of 
physical disks (RAID 5) in my array.  I would either stripe or 
concatenate them together and create a single large DM-Thin LV and 
format it for XFS.

If the PVs are concatenated it sounds like DM-Thin would fill up the 
first PV before moving to the next.  It seems that DM-Thin on striped 
PVs would assure that disk activity is spread across all of the PVs and 
thus across all of the physical disks.  Without DM-Thin, an XFS on 
concatenated PVs would probably tend to organize an AGs into single PVs 
which would spread disk activity across all of the physical disks, just 
in a different way.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-25 15:20       ` Dave Hall
@ 2016-11-25 17:09         ` Dave Hall
  2016-11-27 21:56           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hall @ 2016-11-25 17:09 UTC (permalink / raw)
  To: linux-xfs; +Cc: Carlos Maiolino

-- 
Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)

On 11/25/16 10:20 AM, Dave Hall wrote:
>
>
> On 11/25/16 6:18 AM, Carlos Maiolino wrote:
>>>> > > Regarding thing-provisioning, there are a couple things that 
>>>> you should keep in
>>>> > > mind.
>>>> > >
>>>> > > - AGs segment the metadata across a whole disk, and increase 
>>>> parallelism in the
>>>> > >   filesystem, but, thin-provisioning will make such allocations 
>>>> sequential,
>>>> > >   despite where in the block device the filesystem tries to 
>>>> write, this is the
>>>> > >   nature of thin-provisioning devices so, I believe you should 
>>>> be more careful
>>>> > >   planning your DM-thin structure than the filesystem itself.
>>> >
>>> > So it sounds like I should used striping for my logical volume to 
>>> assure
>>> > that data is distributed across the whole physical array?
>> I'm not sure if I understand you question here, what kind of 
>> architecture you
>> have in your mind. All thin-provisioning allocation are sequential, 
>> block
>> requested, next block available served (although with the recent 
>> dm-thin versions
>> it will serve blocks in bundles, not on a block-by-block granularity 
>> anymore,
>> but it is still a sequential alignment.
>>
>> I am really not sure what you have in mind to 'force' the 
>> distribution across
>> the whole physical array. The only thing I could think was to have 2 
>> dm-thin
>> devices, on different pools, and use them to build a stripped LVM. I 
>> don't know
>> if it is possible tbh, I never tried such configuration, but it's a 
>> setup bound
>> to have problems IMHO.
>>
>
> Currently I have 4 LVM PVs that are mapped to explicit groups of 
> physical disks (RAID 5) in my array.  I would either stripe or 
> concatenate them together and create a single large DM-Thin LV and 
> format it for XFS.
>
> If the PVs are concatenated it sounds like DM-Thin would fill up the 
> first PV before moving to the next.  It seems that DM-Thin on striped 
> PVs would assure that disk activity is spread across all of the PVs 
> and thus across all of the physical disks.  Without DM-Thin, an XFS on 
> concatenated PVs would probably tend to organize an AGs into single 
> PVs which would spread disk activity across all of the physical disks, 
> just in a different way.
>
I'd like to add some clarification just to be sure...

The configuration strategy I've been using for my physical storage array 
is to map specific disks into a small RAID group and define a single LUN 
per RAID group.  Thus, each LUN presented to the server is currently 
mapped to a group of 5 disks in RAID 5.

If I understand correctly an LVM Logical Volume presents a single linear 
storage space to the file system (XFS) regardless of the underlying 
storage organization.  XFS divides this space into a number of 
Allocation Groups that it perceives to be contiguous sub-volumes within 
the Logical Volume.

With a concatenated LV most AGs would be mapped to a single PV, but XFS 
would still disperse disk activity across all AGs and thus across all 
PVs.  With a striped LV each AG would be striped across multiple PVs, 
which would change the distribution of disk activity across the PVs but 
still lead to all PVs being fairly active.

With DM-Thin, things would change.  XFS would perceive that it's AGs 
were fully allocated, but in reality new chunks of storage would be 
allocated as needed.  If DM-Thin uses a linear allocation algorithm on a 
concatenated LV it would seem that certain kinds of disk activity would 
tend to be concentrated in a single PV at a time.  On the other hand, 
DM-Thin in a striped LV would tend to spread things around more evenly 
regardless of allocation patterns.

Please let me know if this perception is accurate.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS + LVM + DM-Thin + Multi-Volume External RAID
  2016-11-25 17:09         ` Dave Hall
@ 2016-11-27 21:56           ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2016-11-27 21:56 UTC (permalink / raw)
  To: Dave Hall; +Cc: linux-xfs, Carlos Maiolino

On Fri, Nov 25, 2016 at 12:09:10PM -0500, Dave Hall wrote:
> With a concatenated LV most AGs would be mapped to a single PV, but
> XFS would still disperse disk activity across all AGs and thus
> across all PVs.

Like all things, this is only partially true.

For inode64 (the default) the allocation load is spread based on
directory structure. If all your work hits a single directory, then
it won't get spread across multiple devices. The log will land on a
single device, so it will always be limited by the throughput of
that device. And read/overwrite workloads will only hit single
devices, too.  So unless you have a largely concurrent, widely
distributed set of access patterns, XFS won't distribute the IO
load.

Now inode32, OTOH, distributes the data to different AGs at
allocation time, meaning that data in a single directory is spread
across multiple devices. However, all the metadata will be on the
first device and that guarantees a device loading imbalance will
occur.

> With a striped LV each AG would be striped across
> multiple PVs, which would change the distribution of disk activity
> across the PVs but still lead to all PVs being fairly active.

Striped devices can be thought of as the same as a single spindle -
the characteristics from the filesystem perspective are the same,
just with some added alignment constraints to optimise placement...

> With DM-Thin, things would change.  XFS would perceive that it's AGs
> were fully allocated, but in reality new chunks of storage would be
> allocated as needed.  If DM-Thin uses a linear allocation algorithm
> on a concatenated LV it would seem that certain kinds of disk
> activity would tend to be concentrated in a single PV at a time.  On
> the other hand, DM-Thin in a striped LV would tend to spread things
> around more evenly regardless of allocation patterns.

Yup, exactly the same as for a filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-27 21:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-24  1:23 XFS + LVM + DM-Thin + Multi-Volume External RAID Dave Hall
2016-11-24  9:43 ` Carlos Maiolino
2016-11-24 19:44   ` Dave Chinner
2016-11-25 14:50     ` Dave Hall
2016-11-26 17:52       ` Eric Sandeen
     [not found]   ` <ca974620-fff5-e26b-897d-c1c62d47cc64@binghamton.edu>
     [not found]     ` <20161125111814.p7ltczag7akqk3w5@eorzea.usersys.redhat.com>
2016-11-25 15:20       ` Dave Hall
2016-11-25 17:09         ` Dave Hall
2016-11-27 21:56           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).