All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ric Wheeler <ricwheeler@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>, Eric Sandeen <sandeen@redhat.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Mike Snitzer <snitzer@redhat.com>,
	Eric Sandeen <sandeen@sandeen.net>,
	xfs <linux-xfs@vger.kernel.org>,
	IDE/ATA development list <linux-ide@vger.kernel.org>,
	device-mapper development <dm-devel@redhat.com>,
	Mark Nelson <mnelson@redhat.com>,
	Ilya Dryomov <idryomov@gmail.com>
Subject: block layer API for file system creation - when to use multidisk mode
Date: Fri, 30 Nov 2018 13:00:52 -0500	[thread overview]
Message-ID: <3da04164-a89f-f4c0-1529-eab12b3226e1@gmail.com> (raw)
In-Reply-To: <20181130022510.GW6311@dastard>

On 11/30/18 7:55 AM, Dave Chinner wrote:
> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>> Can you give an example of a use case that would be negatively affected
>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>> Any time you only know a single alignment characteristic of the
>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>> alignment requirement is exposed, RAID1 devices exposing their chunk
>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>> etc.
>>>>> So the tl;dr here is "given any iomin > 512, we should infer low seek
>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>
>>>>> -Eric
>>>> Chiming in late here, but I do think that every decade or two (no
>>>> disrespect to xfs!), it is worth having a second look at how the
>>>> storage has changed under us.
>>>>
>>>> The workload that has lots of file systems pounding on a shared
>>>> device for example is one way to lay out container storage.
>>> The problem is that defaults can't cater for every use case.
>>> And in this case, we've got nothing to tell us that this is
>>> aggregated/shared storage rather than "the fileystem owns the
>>> entire device".
>>>
>>>> No argument about documenting how to fix this with command line
>>>> tweaks for now, but maybe this would be a good topic for the next
>>>> LSF/MM shared track of file & storage people to debate?
>>> Doubt it - this is really only an XFS problem at this point.
>>>
>>> i.e. if we can't infer what the user wants from existing
>>> information, then I don't see how the storage is going to be able to
>>> tell us anything different, either.  i.e. somewhere in the stack the
>>> user is going to have to tell the block device that this is
>>> aggregated storage.
>>>
>>> But even then, if it's aggregated solid state storage, we still want
>>> to make use of the concurency on increased AG count because there is
>>> no seek penalty like spinning drives end up with. Or if the
>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>> just doesn't matter because the IO is going to be massively
>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>
>>> So there's really no good way of "guessing" whether aggregated
>>> storage should or shouldn't use elevated AG counts even if the
>>> storage says "this is aggregated storage". The user still has to
>>> give us some kind of explict hint about how the filesystem should
>>> be configured.
>>>
>>> What we need is for a solid, reliable detection hueristic to be
>>> suggested by the people that need this functionality before there's
>>> anything we can talk about.
>> I think that is exactly the kind of discussion that the shared
>> file/storage track is good for.
> Yes, but why on earth do we need to wait 6 months to have that
> conversation. Start it now...


Sure, that is definitely a good idea - added in some of the storage lists to 
this reply. No perfect all encompassing block layer list that I know of.


>
>> Other file systems also need to
>> accommodate/probe behind the fictitious visible storage device
>> layer... Specifically, is there something we can add per block
>> device to help here? Number of independent devices
> That's how mkfs.xfs used to do stripe unit/stripe width calculations
> automatically on MD devices back in the 2000s. We got rid of that
> for more generaly applicable configuration information such as
> minimum/optimal IO sizes so we could expose equivalent alignment
> information from lots of different types of storage device....
>
>> or a map of
>> those regions?
> Not sure what this means or how we'd use it.
>
> Cheers,
>
> Dave.

What I was thinking of was a way of giving up a good outline of how many 
independent regions that are behind one "virtual" block device like a ceph rbd 
or device mapper device. My assumption is that we are trying to lay down (at 
least one) allocation group per region.

What we need to optimize for includes:

     * how many independent regions are there?

     * what are the boundaries of those regions?

     * optimal IO size/alignment/etc

Some of that we have, but the current assumptions don't work well for all device 
types.

Regards,

Ric



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

WARNING: multiple messages have this Message-ID (diff)
From: Ric Wheeler <ricwheeler@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@sandeen.net>,
	Ilya Dryomov <idryomov@gmail.com>,
	xfs <linux-xfs@vger.kernel.org>, Mark Nelson <mnelson@redhat.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	IDE/ATA development list <linux-ide@vger.kernel.org>,
	device-mapper development <dm-devel@redhat.com>,
	Jens Axboe <axboe@kernel.dk>
Subject: block layer API for file system creation - when to use multidisk mode
Date: Fri, 30 Nov 2018 13:00:52 -0500	[thread overview]
Message-ID: <3da04164-a89f-f4c0-1529-eab12b3226e1@gmail.com> (raw)
In-Reply-To: <20181130022510.GW6311@dastard>

On 11/30/18 7:55 AM, Dave Chinner wrote:
> On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
>> On 11/29/18 4:48 PM, Dave Chinner wrote:
>>> On Thu, Nov 29, 2018 at 08:53:39AM -0500, Ric Wheeler wrote:
>>>> On 10/6/18 8:14 PM, Eric Sandeen wrote:
>>>>> On 10/6/18 6:20 PM, Dave Chinner wrote:
>>>>>>> Can you give an example of a use case that would be negatively affected
>>>>>>> if this heuristic was switched from "sunit" to "sunit < swidth"?
>>>>>> Any time you only know a single alignment characteristic of the
>>>>>> underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
>>>>>> iomin = ioopt, multi-level RAID constructs where only the largest
>>>>>> alignment requirement is exposed, RAID1 devices exposing their chunk
>>>>>> size, remote replication chunk alignment (because remote rep. is
>>>>>> slow and so we need more concurrency to keep the pipeline full),
>>>>>> etc.
>>>>> So the tl;dr here is "given any iomin > 512, we should infer low seek
>>>>> latency and parallelism and adjust geometry accordingly?"
>>>>>
>>>>> -Eric
>>>> Chiming in late here, but I do think that every decade or two (no
>>>> disrespect to xfs!), it is worth having a second look at how the
>>>> storage has changed under us.
>>>>
>>>> The workload that has lots of file systems pounding on a shared
>>>> device for example is one way to lay out container storage.
>>> The problem is that defaults can't cater for every use case.
>>> And in this case, we've got nothing to tell us that this is
>>> aggregated/shared storage rather than "the fileystem owns the
>>> entire device".
>>>
>>>> No argument about documenting how to fix this with command line
>>>> tweaks for now, but maybe this would be a good topic for the next
>>>> LSF/MM shared track of file & storage people to debate?
>>> Doubt it - this is really only an XFS problem at this point.
>>>
>>> i.e. if we can't infer what the user wants from existing
>>> information, then I don't see how the storage is going to be able to
>>> tell us anything different, either.  i.e. somewhere in the stack the
>>> user is going to have to tell the block device that this is
>>> aggregated storage.
>>>
>>> But even then, if it's aggregated solid state storage, we still want
>>> to make use of the concurency on increased AG count because there is
>>> no seek penalty like spinning drives end up with. Or if the
>>> aggregated storage is thinly provisioned, the AG count of filesystem
>>> just doesn't matter because the IO is going to be massively
>>> randomised (i.e take random seek penalties) by the thinp layout.
>>>
>>> So there's really no good way of "guessing" whether aggregated
>>> storage should or shouldn't use elevated AG counts even if the
>>> storage says "this is aggregated storage". The user still has to
>>> give us some kind of explict hint about how the filesystem should
>>> be configured.
>>>
>>> What we need is for a solid, reliable detection hueristic to be
>>> suggested by the people that need this functionality before there's
>>> anything we can talk about.
>> I think that is exactly the kind of discussion that the shared
>> file/storage track is good for.
> Yes, but why on earth do we need to wait 6 months to have that
> conversation. Start it now...


Sure, that is definitely a good idea - added in some of the storage lists to 
this reply. No perfect all encompassing block layer list that I know of.


>
>> Other file systems also need to
>> accommodate/probe behind the fictitious visible storage device
>> layer... Specifically, is there something we can add per block
>> device to help here? Number of independent devices
> That's how mkfs.xfs used to do stripe unit/stripe width calculations
> automatically on MD devices back in the 2000s. We got rid of that
> for more generaly applicable configuration information such as
> minimum/optimal IO sizes so we could expose equivalent alignment
> information from lots of different types of storage device....
>
>> or a map of
>> those regions?
> Not sure what this means or how we'd use it.
>
> Cheers,
>
> Dave.

What I was thinking of was a way of giving up a good outline of how many 
independent regions that are behind one "virtual" block device like a ceph rbd 
or device mapper device. My assumption is that we are trying to lay down (at 
least one) allocation group per region.

What we need to optimize for includes:

     * how many independent regions are there?

     * what are the boundaries of those regions?

     * optimal IO size/alignment/etc

Some of that we have, but the current assumptions don't work well for all device 
types.

Regards,

Ric

  reply	other threads:[~2018-11-30 18:00 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-04 17:58 [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe Ilya Dryomov
2018-10-04 18:33 ` Eric Sandeen
2018-10-04 18:56   ` Ilya Dryomov
2018-10-04 22:29   ` Dave Chinner
2018-10-05 11:27     ` Ilya Dryomov
2018-10-05 13:51       ` Eric Sandeen
2018-10-05 23:27         ` Dave Chinner
2018-10-06 12:17           ` Ilya Dryomov
2018-10-06 23:20             ` Dave Chinner
2018-10-07  0:14               ` Eric Sandeen
2018-11-29 13:53                 ` Ric Wheeler
2018-11-29 21:48                   ` Dave Chinner
2018-11-29 23:53                     ` Ric Wheeler
2018-11-30  2:25                       ` Dave Chinner
2018-11-30 18:00                         ` Ric Wheeler [this message]
2018-11-30 18:00                           ` block layer API for file system creation - when to use multidisk mode Ric Wheeler
2018-11-30 18:05                           ` Mark Nelson
2018-11-30 18:05                             ` Mark Nelson
2018-12-01  4:35                           ` Dave Chinner
2018-12-01  4:35                             ` Dave Chinner
2018-12-01 20:52                             ` Ric Wheeler
2018-12-01 20:52                               ` Ric Wheeler
2018-10-07 13:54               ` [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe Ilya Dryomov
2018-10-10  0:28                 ` Dave Chinner
2018-10-05 14:50       ` Mike Snitzer
2018-10-05 14:55         ` Eric Sandeen
2018-10-05 17:21           ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3da04164-a89f-f4c0-1529-eab12b3226e1@gmail.com \
    --to=ricwheeler@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=dm-devel@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mnelson@redhat.com \
    --cc=sandeen@redhat.com \
    --cc=sandeen@sandeen.net \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.