block layer API for file system creation - when to use multidisk mode

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* block layer API for file system creation - when to use multidisk mode
       [not found]                   ` <20181130022510.GW6311@dastard>
@ 2018-11-30 18:00                     ` Ric Wheeler
  2018-11-30 18:05                       ` Mark Nelson
  2018-12-01  4:35                       ` Dave Chinner
  0 siblings, 2 replies; 4+ messages in thread
From: Ric Wheeler @ 2018-11-30 18:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Eric Sandeen, linux-scsi@vger.kernel.org,
	Mike Snitzer, Eric Sandeen, xfs, IDE/ATA development list,
	device-mapper development, Mark Nelson, Ilya Dryomov

On 11/30/18 7:55 AM, Dave Chinner wrote:
> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>> Can you give an example of a use case that would be negatively affected
>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>> Any time you only know a single alignment characteristic of the
>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>> alignment requirement is exposed, RAID1 devices exposing their chunk
>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>> etc.
>>>>> So the tl;dr here is "given any iomin > 512, we should infer low seek
>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>
>>>>> -Eric
>>>> Chiming in late here, but I do think that every decade or two (no
>>>> disrespect to xfs!), it is worth having a second look at how the
>>>> storage has changed under us.
>>>>
>>>> The workload that has lots of file systems pounding on a shared
>>>> device for example is one way to lay out container storage.
>>> The problem is that defaults can't cater for every use case.
>>> And in this case, we've got nothing to tell us that this is
>>> aggregated/shared storage rather than "the fileystem owns the
>>> entire device".
>>>
>>>> No argument about documenting how to fix this with command line
>>>> tweaks for now, but maybe this would be a good topic for the next
>>>> LSF/MM shared track of file & storage people to debate?
>>> Doubt it - this is really only an XFS problem at this point.
>>>
>>> i.e. if we can't infer what the user wants from existing
>>> information, then I don't see how the storage is going to be able to
>>> tell us anything different, either.  i.e. somewhere in the stack the
>>> user is going to have to tell the block device that this is
>>> aggregated storage.
>>>
>>> But even then, if it's aggregated solid state storage, we still want
>>> to make use of the concurency on increased AG count because there is
>>> no seek penalty like spinning drives end up with. Or if the
>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>> just doesn't matter because the IO is going to be massively
>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>
>>> So there's really no good way of "guessing" whether aggregated
>>> storage should or shouldn't use elevated AG counts even if the
>>> storage says "this is aggregated storage". The user still has to
>>> give us some kind of explict hint about how the filesystem should
>>> be configured.
>>>
>>> What we need is for a solid, reliable detection hueristic to be
>>> suggested by the people that need this functionality before there's
>>> anything we can talk about.
>> I think that is exactly the kind of discussion that the shared
>> file/storage track is good for.
> Yes, but why on earth do we need to wait 6 months to have that
> conversation. Start it now...


Sure, that is definitely a good idea - added in some of the storage lists to 
this reply. No perfect all encompassing block layer list that I know of.


>
>> Other file systems also need to
>> accommodate/probe behind the fictitious visible storage device
>> layer... Specifically, is there something we can add per block
>> device to help here? Number of independent devices
> That's how mkfs.xfs used to do stripe unit/stripe width calculations
> automatically on MD devices back in the 2000s. We got rid of that
> for more generaly applicable configuration information such as
> minimum/optimal IO sizes so we could expose equivalent alignment
> information from lots of different types of storage device....
>
>> or a map of
>> those regions?
> Not sure what this means or how we'd use it.
>
> Cheers,
>
> Dave.

What I was thinking of was a way of giving up a good outline of how many 
independent regions that are behind one "virtual" block device like a ceph rbd 
or device mapper device. My assumption is that we are trying to lay down (at 
least one) allocation group per region.

What we need to optimize for includes:

     * how many independent regions are there?

     * what are the boundaries of those regions?

     * optimal IO size/alignment/etc

Some of that we have, but the current assumptions don't work well for all device 
types.

Regards,

Ric



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: block layer API for file system creation - when to use multidisk mode
  2018-11-30 18:00                     ` block layer API for file system creation - when to use multidisk mode Ric Wheeler
@ 2018-11-30 18:05                       ` Mark Nelson
  2018-12-01  4:35                       ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2018-11-30 18:05 UTC (permalink / raw)
  To: Ric Wheeler, Dave Chinner
  Cc: Jens Axboe, Eric Sandeen, Mike Snitzer,
	linux-scsi@vger.kernel.org, Eric Sandeen, xfs,
	IDE/ATA development list, device-mapper development, Ilya Dryomov


On 11/30/18 12:00 PM, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
>> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>>> Can you give an example of a use case that would be negatively 
>>>>>>>> affected
>>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>>> Any time you only know a single alignment characteristic of the
>>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>>> alignment requirement is exposed, RAID1 devices exposing their 
>>>>>>> chunk
>>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>>> etc.
>>>>>> So the tl;dr here is "given any iomin > 512, we should infer low 
>>>>>> seek
>>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>>
>>>>>> -Eric
>>>>> Chiming in late here, but I do think that every decade or two (no
>>>>> disrespect to xfs!), it is worth having a second look at how the
>>>>> storage has changed under us.
>>>>>
>>>>> The workload that has lots of file systems pounding on a shared
>>>>> device for example is one way to lay out container storage.
>>>> The problem is that defaults can't cater for every use case.
>>>> And in this case, we've got nothing to tell us that this is
>>>> aggregated/shared storage rather than "the fileystem owns the
>>>> entire device".
>>>>
>>>>> No argument about documenting how to fix this with command line
>>>>> tweaks for now, but maybe this would be a good topic for the next
>>>>> LSF/MM shared track of file & storage people to debate?
>>>> Doubt it - this is really only an XFS problem at this point.
>>>>
>>>> i.e. if we can't infer what the user wants from existing
>>>> information, then I don't see how the storage is going to be able to
>>>> tell us anything different, either.  i.e. somewhere in the stack the
>>>> user is going to have to tell the block device that this is
>>>> aggregated storage.
>>>>
>>>> But even then, if it's aggregated solid state storage, we still want
>>>> to make use of the concurency on increased AG count because there is
>>>> no seek penalty like spinning drives end up with. Or if the
>>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>>> just doesn't matter because the IO is going to be massively
>>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>>
>>>> So there's really no good way of "guessing" whether aggregated
>>>> storage should or shouldn't use elevated AG counts even if the
>>>> storage says "this is aggregated storage". The user still has to
>>>> give us some kind of explict hint about how the filesystem should
>>>> be configured.
>>>>
>>>> What we need is for a solid, reliable detection hueristic to be
>>>> suggested by the people that need this functionality before there's
>>>> anything we can talk about.
>>> I think that is exactly the kind of discussion that the shared
>>> file/storage track is good for.
>> Yes, but why on earth do we need to wait 6 months to have that
>> conversation. Start it now...
>
>
> Sure, that is definitely a good idea - added in some of the storage 
> lists to this reply. No perfect all encompassing block layer list that 
> I know of.
>
>
>>
>>> Other file systems also need to
>>> accommodate/probe behind the fictitious visible storage device
>>> layer... Specifically, is there something we can add per block
>>> device to help here? Number of independent devices
>> That's how mkfs.xfs used to do stripe unit/stripe width calculations
>> automatically on MD devices back in the 2000s. We got rid of that
>> for more generaly applicable configuration information such as
>> minimum/optimal IO sizes so we could expose equivalent alignment
>> information from lots of different types of storage device....
>>
>>> or a map of
>>> those regions?
>> Not sure what this means or how we'd use it.
>>
>> Cheers,
>>
>> Dave.
>
> What I was thinking of was a way of giving up a good outline of how 
> many independent regions that are behind one "virtual" block device 
> like a ceph rbd or device mapper device. My assumption is that we are 
> trying to lay down (at least one) allocation group per region.
>
> What we need to optimize for includes:
>
>     * how many independent regions are there?
>
>     * what are the boundaries of those regions?
>
>     * optimal IO size/alignment/etc
>
> Some of that we have, but the current assumptions don't work well for 
> all device types.
>
> Regards,
>
> Ric
>
I won't comment on the details as there are others here that are far 
more knowledgeable than I am, but at a high level I think your idea is 
absolutely fantastic from the standpoint of making this decision process 
more explicit.


Mark

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: block layer API for file system creation - when to use multidisk mode
  2018-11-30 18:00                     ` block layer API for file system creation - when to use multidisk mode Ric Wheeler
  2018-11-30 18:05                       ` Mark Nelson
@ 2018-12-01  4:35                       ` Dave Chinner
  2018-12-01 20:52                         ` Ric Wheeler
  1 sibling, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2018-12-01  4:35 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jens Axboe, Eric Sandeen, linux-scsi@vger.kernel.org,
	Mike Snitzer, Eric Sandeen, xfs, IDE/ATA development list,
	device-mapper development, Mark Nelson, Ilya Dryomov

On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
> >On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
> >>Other file systems also need to
> >>accommodate/probe behind the fictitious visible storage device
> >>layer... Specifically, is there something we can add per block
> >>device to help here? Number of independent devices
> >That's how mkfs.xfs used to do stripe unit/stripe width calculations
> >automatically on MD devices back in the 2000s. We got rid of that
> >for more generaly applicable configuration information such as
> >minimum/optimal IO sizes so we could expose equivalent alignment
> >information from lots of different types of storage device....
> >
> >>or a map of
> >>those regions?
> >Not sure what this means or how we'd use it.
> >Dave.
> 
> What I was thinking of was a way of giving up a good outline of how
> many independent regions that are behind one "virtual" block device
> like a ceph rbd or device mapper device. My assumption is that we
> are trying to lay down (at least one) allocation group per region.
> 
> What we need to optimize for includes:
> 
>     * how many independent regions are there?
> 
>     * what are the boundaries of those regions?
> 
>     * optimal IO size/alignment/etc
> 
> Some of that we have, but the current assumptions don't work well
> for all device types.

Oh, so essential "independent regions" of the storage device. I
wrote this in 2008:

http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains

This was derived from the ideas in prototype code I wrote in ~2007
to try to optimise file layout and load distribution across linear
concats of multi-TB RAID6 luns. Some of that work was published
long after I left SGI:

https://marc.info/?l=linux-xfs&m=123441191222714&w=2

Essentially, independent regions - called "Logical
Extension Groups", or "legs" of the filesystem - and would
essentially be an aggregation of AGs in that region. The
concept was that we'd move the geometry information from the
superblock into the legs, and so we could have different AG
geoemetry optimies for each independent leg of the filesystem.

eg the SSD region could have numerous small AGs, the large,
contiguous RAID6 part could have maximally size AGs or even make use
of the RT allocator for free space management instead of the
AG/btree allocator. Basically it was seen as a mechanism for getting
rid of needing to specify block devices as command line or mount
options.

Fundamentally, though, it was based on the concept that Linux would
eventually grow an interface for the block device/volume manager to
tell the filesystem where the independent regions in the device
were(*), but that's not something that has ever appeared. If you can
provide an indepedent region map in an easy to digest format (e.g. a
set of {offset, len, geometry} tuples), then we can obviously make
use of it in XFS....

Cheers,

Dave.

(*) Basically provide a linux version of the functionality Irix
volume managers had provided filesystems since the late 80s....

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: block layer API for file system creation - when to use multidisk mode
  2018-12-01  4:35                       ` Dave Chinner
@ 2018-12-01 20:52                         ` Ric Wheeler
  0 siblings, 0 replies; 4+ messages in thread
From: Ric Wheeler @ 2018-12-01 20:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Eric Sandeen, linux-scsi@vger.kernel.org,
	Mike Snitzer, Eric Sandeen, xfs, IDE/ATA development list,
	device-mapper development, linux-block, Mark Nelson, Ilya Dryomov

On 11/30/18 11:35 PM, Dave Chinner wrote:
> On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
>> On 11/30/18 7:55 AM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>>>> Other file systems also need to
>>>> accommodate/probe behind the fictitious visible storage device
>>>> layer... Specifically, is there something we can add per block
>>>> device to help here? Number of independent devices
>>> That's how mkfs.xfs used to do stripe unit/stripe width calculations
>>> automatically on MD devices back in the 2000s. We got rid of that
>>> for more generaly applicable configuration information such as
>>> minimum/optimal IO sizes so we could expose equivalent alignment
>>> information from lots of different types of storage device....
>>>
>>>> or a map of
>>>> those regions?
>>> Not sure what this means or how we'd use it.
>>> Dave.
>> What I was thinking of was a way of giving up a good outline of how
>> many independent regions that are behind one "virtual" block device
>> like a ceph rbd or device mapper device. My assumption is that we
>> are trying to lay down (at least one) allocation group per region.
>>
>> What we need to optimize for includes:
>>
>>      * how many independent regions are there?
>>
>>      * what are the boundaries of those regions?
>>
>>      * optimal IO size/alignment/etc
>>
>> Some of that we have, but the current assumptions don't work well
>> for all device types.
> Oh, so essential "independent regions" of the storage device. I
> wrote this in 2008:
>
> http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains
>
> This was derived from the ideas in prototype code I wrote in ~2007
> to try to optimise file layout and load distribution across linear
> concats of multi-TB RAID6 luns. Some of that work was published
> long after I left SGI:
>
> https://marc.info/?l=linux-xfs&m=123441191222714&w=2
>
> Essentially, independent regions - called "Logical
> Extension Groups", or "legs" of the filesystem - and would
> essentially be an aggregation of AGs in that region. The
> concept was that we'd move the geometry information from the
> superblock into the legs, and so we could have different AG
> geoemetry optimies for each independent leg of the filesystem.
>
> eg the SSD region could have numerous small AGs, the large,
> contiguous RAID6 part could have maximally size AGs or even make use
> of the RT allocator for free space management instead of the
> AG/btree allocator. Basically it was seen as a mechanism for getting
> rid of needing to specify block devices as command line or mount
> options.
>
> Fundamentally, though, it was based on the concept that Linux would
> eventually grow an interface for the block device/volume manager to
> tell the filesystem where the independent regions in the device
> were(*), but that's not something that has ever appeared. If you can
> provide an indepedent region map in an easy to digest format (e.g. a
> set of {offset, len, geometry} tuples), then we can obviously make
> use of it in XFS....
>
> Cheers,
>
> Dave.
>
> (*) Basically provide a linux version of the functionality Irix
> volume managers had provided filesystems since the late 80s....
>
Hi Dave,

This is exactly the kind of thing I think would be useful.  We might want to 
have a distinct value (like the rotational) that indicates this is a device with 
multiple "legs" so that normally we query that and don't have to look for the 
more complicated information.

Regards,

Ric


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-12-01 20:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20181004222952.GV31060@dastard>
     [not found] ` <CAOi1vP_U-QpABgs+a9oYpYvLWs4D2qVmff=-JEikn7S_=eCAXQ@mail.gmail.com>
     [not found]   ` <67627995-714c-5c38-a796-32b503de7d13@sandeen.net>
     [not found]     ` <20181005232710.GH12041@dastard>
     [not found]       ` <CAOi1vP-Z7xZk1YvWHYWQGOHfyWccFCkyn8je0HBvGhGuUrXmaQ@mail.gmail.com>
     [not found]         ` <20181006232037.GB18095@dastard>
     [not found]           ` <36bc3f17-e7d1-ce8b-2088-36ff5d7b1e8b@sandeen.net>
     [not found]             ` <0290ec9f-ab2b-7c1b-faaf-409d72f99e5f@gmail.com>
     [not found]               ` <20181129214851.GU6311@dastard>
     [not found]                 ` <39031e68-3936-b5e1-bcb6-6fdecc5988c1@gmail.com>
     [not found]                   ` <20181130022510.GW6311@dastard>
2018-11-30 18:00                     ` block layer API for file system creation - when to use multidisk mode Ric Wheeler
2018-11-30 18:05                       ` Mark Nelson
2018-12-01  4:35                       ` Dave Chinner
2018-12-01 20:52                         ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).