XFS on top of LVM span in AWS. Stripe or are AG's good enough?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS on top of LVM span in AWS.  Stripe or are AG's good enough?
@ 2016-08-15 23:36 Jeff Gibson
  2016-08-16  0:59 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Gibson @ 2016-08-15 23:36 UTC (permalink / raw)
  To: xfs@oss.sgi.com

[-- Attachment #1.1: Type: text/plain, Size: 1162 bytes --]

So I'm creating an LVM volume with 8 AWS EBS disks that are spanned (linear) per Redhat's documentation for Gluster (https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Deployment_Guide_for_Public_Cloud/ch02s03.html#Provisioning_Storage_for_Three-way_Replication_Volumes).

2 questions-

1.  Will XFS's Allocation Groups essentially stripe the data for me or should I stripe the underlying volumes with LVM?  I'm not worried as much about data integrity with a stripe/span since Gluster is doing the redundancy work.

2.  AWS volumes sometimes have inconsistent performance.  If I understand things correctly, AG's run in parallel.  In a non-striped volume, if some of the AGs are temporarily slower to respond than others due to one of the underlying volumes being slow, will XFS prefer the quicker responding AGs or is I/O always evenly distributed?  If XFS prefers the more responsive AG's it seems to me that it would be better NOT to stripe the underlying disk since all AG's that are distributed in a stripe will continuously hit all component volumes, including the slow volume (unless if XFS compensates for this?)

Thank you

[-- Attachment #1.2: Type: text/html, Size: 2449 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS on top of LVM span in AWS.  Stripe or are AG's good enough?
  2016-08-15 23:36 XFS on top of LVM span in AWS. Stripe or are AG's good enough? Jeff Gibson
@ 2016-08-16  0:59 ` Dave Chinner
  2016-08-16 17:05   ` Jeff Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2016-08-16  0:59 UTC (permalink / raw)
  To: Jeff Gibson; +Cc: xfs@oss.sgi.com

On Mon, Aug 15, 2016 at 11:36:14PM +0000, Jeff Gibson wrote:
> So I'm creating an LVM volume with 8 AWS EBS disks that are
> spanned (linear) per Redhat's documentation for Gluster
> (https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Deployment_Guide_for_Public_Cloud/ch02s03.html#Provisioning_Storage_for_Three-way_Replication_Volumes).
> 
> 2 questions-
> 
> 1.  Will XFS's Allocation Groups essentially stripe the data for
> me

No. XFS does not stripe data. It does, however, *distribute* data
different AGs according to locality policy (e.g. inode32 vs
inode64), so it uses all the AGs as the directory structure grows.

> or should I stripe the underlying volumes with LVM?

No, you're using EBS. Forget anything you know about storage layout
and geometry, because EBS has no guaranteed physical layout you can
optimise for.

> I'm not
> worried as much about data integrity with a stripe/span since
> Gluster is doing the redundancy work.
> 
> 2.  AWS volumes sometimes have inconsistent performance.  If I
> understand things correctly, AG's run in parallel.

Define "run". AGs can allocate/free blocks in parallel. If IO does
not require allocation, then AGs play no part in the IO path.

> In a
> non-striped volume, if some of the AGs are temporarily slower to
> respond than others due to one of the underlying volumes being
> slow, will XFS prefer the quicker responding AGs

No, it does not.

> or is I/O always
> evenly distributed?

No, it is not.

> If XFS prefers the more responsive AG's it
> seems to me that it would be better NOT to stripe the underlying
> disk since all AG's that are distributed in a stripe will
> continuously hit all component volumes, including the slow volume
> (unless if XFS compensates for this?)

I think you have the wrong idea about what allocation groups do.
They are for maintaining allocation concurrency and locality of
related objects on disk - they have no influence on where IO is
directed based on IO load or response time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS on top of LVM span in AWS.  Stripe or are AG's good enough?
  2016-08-16  0:59 ` Dave Chinner
@ 2016-08-16 17:05   ` Jeff Gibson
  2016-08-16 17:37     ` Eric Sandeen
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Gibson @ 2016-08-16 17:05 UTC (permalink / raw)
  To: xfs@oss.sgi.com

>On Mon, Aug 15, 2016 at 11:36:14PM +0000, Jeff Gibson wrote:
>> So I'm creating an LVM volume with 8 AWS EBS disks that are
>> spanned (linear) per Redhat's documentation for Gluster
>> (https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Deployment_Guide_for_Public_Cloud/>ch02s03.html#Provisioning_Storage_for_Three-way_Replication_Volumes).
>> 
>> 2 questions-
>> 
>> 1.  Will XFS's Allocation Groups essentially stripe the data for
>> me
>
>No. XFS does not stripe data. It does, however, *distribute* data
>different AGs according to locality policy (e.g. inode32 vs
>inode64), so it uses all the AGs as the directory structure grows.
Poor wording on my part.  By "essentially stripe" I mean distribute data throughout all of the EBS subvolumes instead of just using one EBS subvolume at a time until full.  I do plan on using inode64.

>> or should I stripe the underlying volumes with LVM?
>
>No, you're using EBS. Forget anything you know about storage layout
>and geometry, because EBS has no guaranteed physical layout you can
>optimise for.
Right.  However there could still be some gains from striping due to IOP limits for single volumes. - That is the combined IOPS for all the volumes striped together can be higher than they are for a single volume. 

>> I'm not
>> worried as much about data integrity with a stripe/span since
>> Gluster is doing the redundancy work.
>> 
>> 2.  AWS volumes sometimes have inconsistent performance.  If I
>> understand things correctly, AG's run in parallel.
>
>Define "run". AGs can allocate/free blocks in parallel.
By run I meant read/write data to/from the AGs.

>If IO does
>not require allocation, then AGs play no part in the IO path.
Can you explain this a bit please?  From my understanding data is written and read from space inside of AGs, so I don't see how it couldn't be part of the IO path.  Or do you simply mean reads just use inodes and don't care about the AGs?

>> In a
>> non-striped volume, if some of the AGs are temporarily slower to
>> respond than others due to one of the underlying volumes being
>> slow, will XFS prefer the quicker responding AGs
>
>No, it does not.
>
>> or is I/O always
>> evenly distributed?
>
>No, it is not.
>
>> If XFS prefers the more responsive AG's it
>> seems to me that it would be better NOT to stripe the underlying
>> disk since all AG's that are distributed in a stripe will
>> continuously hit all component volumes, including the slow volume
>> (unless if XFS compensates for this?)
>
>I think you have the wrong idea about what allocation groups do.
I'm reading the XFS File System Structure doc on xfs.org.  It says, "XFS filesystems are divided into a number of equally sized chunks called Allocation Groups. Each AG can almost be thought of as an individual filesystem." so that's where most of my assumptions are coming from.

>They are for maintaining allocation concurrency and locality of
>related objects on disk - they have no influence on where IO is
>directed based on IO load or response time.
I understand that XFS has locality as far as trying to write files to the same AG as the parent directory.  Are there other cases?
I get that it's probably not measuring the responsiveness of each AG. I guess what I'm trying to ask is - will XFS *indirectly* compensate if one subvolume is busier?  For example, if writes to a "slow" subvolume and resident AGs take longer to complete, will XFS tend to prefer to use other less-busy AGs more often (with the exception of locality) for writes?  What is the basic algorithm for determining where new data is written?  In load-balancer terms, does it round-robin, pick the least busy, etc?

Thank you very much!
JG
    
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS on top of LVM span in AWS. Stripe or are AG's good enough?
  2016-08-16 17:05   ` Jeff Gibson
@ 2016-08-16 17:37     ` Eric Sandeen
  2016-08-17 16:23       ` Jeff Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2016-08-16 17:37 UTC (permalink / raw)
  To: xfs

On 8/16/16 12:05 PM, Jeff Gibson wrote:
>> On Mon, Aug 15, 2016 at 11:36:14PM +0000, Jeff Gibson wrote:

...

>> Define "run". AGs can allocate/free blocks in parallel.
> By run I meant read/write data to/from the AGs.
> 
>> If IO does
>> not require allocation, then AGs play no part in the IO path.
> Can you explain this a bit please? From my understanding data is
> written and read from space inside of AGs, so I don't see how it
> couldn't be part of the IO path. Or do you simply mean reads just use
> inodes and don't care about the AGs?

I think Dave just means that IO to already-allocated blocks simply
addresses the block and goes.  There is no AG locking or concurrency or
anything else that comes into play w.r.t. the specific AG the block
under IO happens to live in.

>>> In a
>>> non-striped volume, if some of the AGs are temporarily slower to
>>> respond than others due to one of the underlying volumes being
>>> slow, will XFS prefer the quicker responding AGs
>>
>> No, it does not.
>>
>>> or is I/O always
>>> evenly distributed?
>>
>> No, it is not.
>>
>>> If XFS prefers the more responsive AG's it
>>> seems to me that it would be better NOT to stripe the underlying
>>> disk since all AG's that are distributed in a stripe will
>>> continuously hit all component volumes, including the slow volume
>>> (unless if XFS compensates for this?)
>>
>> I think you have the wrong idea about what allocation groups do.

> I'm reading the XFS File System Structure doc on xfs.org. It says,
> "XFS filesystems are divided into a number of equally sized chunks
> called Allocation Groups. Each AG can almost be thought of as an
> individual filesystem." so that's where most of my assumptions are
> coming from.

Well, the above quote is correct, but it doesn't say anything about
IO time, latency, responsiveness, or anything like that.  Each AG
does indeed include its own structures to track allocation, but that's
unrelated to any notion of "fast" or "slow."

>> They are for maintaining allocation concurrency and locality of
>> related objects on disk - they have no influence on where IO is
>> directed based on IO load or response time.

> I understand that XFS has locality as far as trying to write files to
> the same AG as the parent directory. Are there other cases?

In general, new directories go to a new AG.  Inodes within that directory
tend to stay in the same AG as their parent, and data blocks associated
with those inodes tend to stay nearby as well.  That's the high-level
goal, but of course fragmented freespace and near-full conditions can
cause that to not remain true.

> I get that it's probably not measuring the responsiveness of each AG.

It is *definitely* not measuring the responsiveness of each AG :)

> I guess what I'm trying to ask is - will XFS *indirectly* compensate
> if one subvolume is busier?  For example, if writes to a "slow"
> subvolume and resident AGs take longer to complete, will XFS tend to
> prefer to use other less-busy AGs more often (with the exception of
> locality) for writes?  What is the basic algorithm for determining
> where new data is written?  In load-balancer terms, does it
> round-robin, pick the least busy, etc?

xfs has no notion of fast vs slow regions.  See above for the basic
algorithm; it's round-robin for new directories, keep inodes and blocks
near their parent if possible.  There are a few other smaller-granularity
heuristics related to stripe geometry as well.

-Eric

> Thank you very much!
> JG
>     
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS on top of LVM span in AWS. Stripe or are AG's good enough?
  2016-08-16 17:37     ` Eric Sandeen
@ 2016-08-17 16:23       ` Jeff Gibson
  2016-08-17 17:26         ` Eric Sandeen
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Gibson @ 2016-08-17 16:23 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Thanks for the great info guys.

Sorry to beat a dead horse here.  Just to be absolutely clear-

> > I guess what I'm trying to ask is - will XFS *indirectly* compensate
> > if one subvolume is busier?  For example, if writes to a "slow"
> > subvolume and resident AGs take longer to complete, will XFS tend to
> > prefer to use other less-busy AGs more often (with the exception of
> > locality) for writes?  What is the basic algorithm for determining
> > where new data is written?  In load-balancer terms, does it
> > round-robin, pick the least busy, etc?
> 
> xfs has no notion of fast vs slow regions.  See above for the basic
> algorithm; it's round-robin for new directories, keep inodes and blocks
> near their parent if possible.  
So if one EBS LVM subvolume has subpar performance it will basically slow down writes to the whole XFS volume.  XFS doesn't have any notion of a queue per AG or any other mechanism for compensating uneven performance of AGs.


> There are a few other smaller-granularity
> heuristics related to stripe geometry as well.
Oh, cool.  Since I'm considering stripe vs. linear for the LVM volume, I'd be very interested in what these are.


Thank you again,
JG



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS on top of LVM span in AWS. Stripe or are AG's good enough?
  2016-08-17 16:23       ` Jeff Gibson
@ 2016-08-17 17:26         ` Eric Sandeen
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2016-08-17 17:26 UTC (permalink / raw)
  To: xfs

On 8/17/16 11:23 AM, Jeff Gibson wrote:
> Thanks for the great info guys.
> 
> Sorry to beat a dead horse here.  Just to be absolutely clear-
> 
>>> I guess what I'm trying to ask is - will XFS *indirectly* compensate
>>> if one subvolume is busier?  For example, if writes to a "slow"
>>> subvolume and resident AGs take longer to complete, will XFS tend to
>>> prefer to use other less-busy AGs more often (with the exception of
>>> locality) for writes?  What is the basic algorithm for determining
>>> where new data is written?  In load-balancer terms, does it
>>> round-robin, pick the least busy, etc?
>>  
>> xfs has no notion of fast vs slow regions.  See above for the basic
>> algorithm; it's round-robin for new directories, keep inodes and blocks
>> near their parent if possible.  

> So if one EBS LVM subvolume has subpar performance it will basically
> slow down writes to the whole XFS volume.  XFS doesn't have any
> notion of a queue per AG or any other mechanism for compensating
> uneven performance of AGs.

It will slow down writes to blocks in that block device.
If those blocks gate other IO (i.e. core metadata structures, maybe
the log), then it could conceivably have an fs-wide impact.

i.e. -

If file "foo" has 100 blocks allocated in a slow-responding volume,
writing to those 100 blocks would only slow down that write.

If the log is allocated in a slow-responding volume and a workload
is log-bound, then it could have an fs-wide impact.

Again, xfs has no fast/slow notion.  There is no compensation.
IO queues are below the filesystem; there is no IO queue per AG.
 
>> There are a few other smaller-granularity
>> heuristics related to stripe geometry as well.

> Oh, cool.  Since I'm considering stripe vs. linear for the LVM volume, I'd be very interested in what these are.

Simply things like allocating files on stripe boundaries if possible.

So if you have a 64k stripe, it would try (IIRC) to allocate files
(at least larger files, not remembering details for sure) on 64k
boundaries.

If you're keen to look at code, m_dalign is the stripe unit for the
fs. You'll find things like:

                /*
                 * Round up the allocation request to a stripe unit
                 * (m_dalign) boundary if the file size is >= stripe unit
                 * size, and we are allocating past the allocation eof.
                 *
                 * If mounted with the "-o swalloc" option the alignment is
                 * increased from the strip unit size to the stripe width.
                 */

or for inode allocation:

                /*
                 * Set the alignment for the allocation.
                 * If stripe alignment is turned on then align at stripe unit
                 * boundary.
                 * If the cluster size is smaller than a filesystem block
                 * then we're doing I/O for inodes in filesystem block size
                 * pieces, so don't need alignment anyway.
                 */

-Eric
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-08-17 17:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-15 23:36 XFS on top of LVM span in AWS. Stripe or are AG's good enough? Jeff Gibson
2016-08-16  0:59 ` Dave Chinner
2016-08-16 17:05   ` Jeff Gibson
2016-08-16 17:37     ` Eric Sandeen
2016-08-17 16:23       ` Jeff Gibson
2016-08-17 17:26         ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox