REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
@ 2009-06-25  3:58 Neil Brown
  2009-06-25  8:00 ` Martin K. Petersen
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2009-06-25  3:58 UTC (permalink / raw)
  To: Mike Snitzer, Martin K. Petersen, Linus Torvalds,
	Alasdair G Kergon, jens.axboe
  Cc: linux-scsi, linux-kernel, linux-raid, linux-ide, dm-devel,
	linux-fsdevel

Hi,
 I have (belatedly, I admit) been looking at the new 'topology'
 metrics that were introduced for 2.6.31.
 I have a few questions about them which I have been discussing with
 Martin, but there is one issue that I feel fairly strongly about and
 would like to see changed before it gets into a -final release.
 Hence this email to try to garner understanding and consensus.

 The various topology metrics are exposed to user-space through sysfs
 attributes in the 'queue' subdirectory of the block device
 (e.g. .../sda/queue).
 I think this is a poor choice and would like to find a better
 choice.

 To explain why, it probably helps to review the current situation.
 Before 2.6.30, 'queue' contains:
   hw_sector_size	max_hw_sectors_kb  nr_requests	  rq_affinity
   iosched/		max_sectors_kb	   read_ahead_kb  scheduler
   iostats		nomerges	   rotational

 Of these:

   max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
   max_sectors_kb scheduler nomerges rotational

 are really only relevant to the elevator code and those devices that
 used that code (ide, scsi, etc).
 They are not relevant for dm or md (md has it's own separate 'md'
 directory, and before 2.6.30, the '/queue' subdirectory did not even
 appear in dm or md devices).

 Of the others:
   hw_sector_size   - is applicable to all block devices, and could
                      reasonably be placed one level up in the device
                      directory (along side 'size').
   read_ahead_kb    - a duplicate of bdi/read_ahead_kb
   iostats          - is a switch to enable or disable accounting of
                      statistics that are reported in the 'stat'
                      file (one level up)

 So most of '/queue' is specific to one class of devices (admittedly a
 very large class).  The others could be argued to be aberrations.

 Adding a number of extra fields such as minimum_io_size,
 optimal_io_size etc to '/queue' seems to increase the number of
 aberrations and enforces md and dm device to have a /queue which is
 largely irrelevant.

 One approach that we could take would be to hide all those fields
 in 'queue' that are not relevant to the current device, and let
 'queue' be a 'dumping ground' for each block device to place whatever
 sysfs attributes they want (thus md would move all of md/* to
 queue/*, and leave 'md' as a symlink to 'queue').

 I don't like this approach because it does not make best use of the
 name space.  If 'md' and 'queue' have different directories, they are
 each free to create new attributes without risk of collision between
 different drivers - not that the collision would be a technical
 problem but it could be confusing to users.

 So, where to put these new fields?

   They could go in the device directory, along side 'size' and 'ro'.
   Like those fields, the new ones give guidance to filesystems on how
   to use the device.  Whether or not this is a good thing depends a
   bit on how many fields we are talking about.  One or two might be
   OK.  4 or more might look messy.
   There are currently 4 fields: logical_block_size,
   physical_block_size, minimum_io_size, optimal_io_size.
   I have suggested to Martin that 2 are enough.  While I don't
   particularly want to debate that in this thread, it could be
   relevant so I'll outline my idea below.

   They could go in 'bdi' along with read_ahead_kb.  read_ahead_kb
   gives guidance on optimal read sizes.  The topology fields give
   guidance on optimal write sizes.  There is a synergy there.  And
   these fields are undeniably info about a backing device.
   NFS has it's own per-filesystem bdi so we would not want to impose
   fields on NFS that weren't at all relevant.  NFS has 'rsize' and
   'wsize' which are somewhat related.  So I feel somewhat positive
   about this possibility.  My only concern is that 'read_ahead_kb' is
   more about reading *files*, where as the *_block_size and *_io_size
   are about writing to the *device*.  I'm not sure how important a
   difference this is.

   They could go in a new subdirectory of the block device, just like
   the integrity fields. e.g 'topology/'. or 'metrics/'.  This would
   be my preferred approach if there do turn out to be the full 4
   fields.

Thanks for your attention.  Comments most welcome.

NeilBrown

----------------------
Alternate implementation with only two fields.
According to Documentation/ABI/testing/sysfs-block, both
physical_block_size and minimum_io_size are the smallest unit of IO
that doesn't require read-modify-write.  The first is thought to
relate to drives with 4K blocks.  The second to RAID5 arrays.
But that doesn't make sense as it stands: you cannot have two things
that are both the smallest.

Also, minimum_io_size and optimal_io_size are both described as a
"preferred" for IO - presumably writes, not reads.  Again, we cannot
have two values that are both preferred.  There is again some
suggestion that one is for disks and the other is for RAID, but I
cannot see how a mkfs would choose between them.

My conclusion is that there are two issues of importance.
1/ avoiding read-modify-write as that can affect correctness (When a
  write error happens, you can lose unrelated data).
2/ throughput.

For each of these issues, there are a number of sizes that are
relevant.
e.g as you increase the request size, the performance can increase,
but there a key points where a small increase in size can give a big
increase in performance.  These sizes might include block size, chunk
size, stripe size, and cache size.

So I suggested two fields, each of which can store multiple values:

safe_write_size:  512 4096 327680
preferred_write_size: 4096 65536 327680 10485760

The guidance for using these is simple:
 When choosing a size where atomicity of writes is important, choose
 the largest size from safe_write_size which is practical (or a
 multiple there-of).

 When choosing a size which doesn't require atomicity, but where
 throughput is important, choose a multiple of the largest size from
 preferred_write_size which is practical.

The smallest safe_write_size would be taken as the logical_block_size.

If we just have these two fields, I would put them in the top level
directory for the block device.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25  3:58 REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory Neil Brown
@ 2009-06-25  8:00 ` Martin K. Petersen
  2009-06-25 11:07   ` [dm-devel] " NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Martin K. Petersen @ 2009-06-25  8:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, linux-scsi, Mike Snitzer, linux-kernel,
	linux-raid, linux-ide, device-mapper development, jens.axboe,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:

Neil,

Neil>  Of these:

Neil>    max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
Neil>    max_sectors_kb scheduler nomerges rotational

Neil>  are really only relevant to the elevator code and those devices
Neil>  that used that code (ide, scsi, etc).

I'm not sure I completely agree with putting rotational in that bucket.
It affects the choice of allocation policy in btrfs, for instance.

Neil>  Of the others:
Neil>    hw_sector_size - is applicable to all block devices, and could
Neil>                       reasonably be placed one level up in the
Neil>                       device directory (along side 'size').

hw_sector_size is deprecated.  It's now split into logical and
physical_block_size.

Neil>  Adding a number of extra fields such as minimum_io_size,
Neil>  optimal_io_size etc to '/queue' seems to increase the number of
Neil>  aberrations and enforces md and dm device to have a /queue which
Neil>  is largely irrelevant.

You seem to be hung up on the fact that you don't queue things.  I think
that's beside the point.  You *do* have a request_queue thanks to
calling blk_queue_make_request() in md.c.  And there is more to
request_queue than the values you brought up.  Like the callback
functions.  I'm not saying that all the values in request_queue apply to
MD, but I really don't understand what all the fuss is about.  Other
than the presence of the string "queue" in the choice of naming.

Anyway.  If you look at the request_queue in the current tree you'll see
that the very limits we are discussing are contained in a separate
struct.  We can easily move that somewhere else at a later date if that
is deemed the right thing to do.

Neil>    I have suggested to Martin that 2 are enough.  

I think I have covered this in a separate mail.  You are mixing up
hardware limitations and I/O hints on the grounds that they went in as
part of the same patch set and live in the same place.

fdisk/mdadm/dmsetup need to use physical_block_size and alignment_offset
to prevent us from misaligning when setting up partitions and virtual
block devices.  Also, when stacking devices I need to know these values
to ensure that the I/O hints set by MD/DM don't conflict with the
underlying hardware limitations.  There are also special cases like
shared disk setups and filesystem journal padding that may need to know
details of the hardware atomicity.

mkfs.* can leverage minimum_io_size and optimal_io_size hints to choose
block sizes and to lay out data structures on stripe boundaries.  Just
like we're doing today except using a common interface for all block
devices instead of poking at MD and LVM internals.

logical_block_size, physical_block_size and alignment_offset are
hardware limits that need to be honored when creating a (virtual) block
device or partition.

The minimum/optimal write sizes are hints to the *user* of the block
device about how to lay out things.  If you look at my MD patch you'll
see that I only set the I/O hints.  The hardware settings are off limits
for MD.

I don't particularly care whether we store the values in queue/,
topology/, metrics/, limits/ or in the device root.  Nor whether we call
it minimum_write_size instead of minimum_io_size.  I'll be happy to roll
up a renaming patch...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25  8:00 ` Martin K. Petersen
@ 2009-06-25 11:07   ` NeilBrown
  2009-06-25 11:36     ` John Robinson
                       ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: NeilBrown @ 2009-06-25 11:07 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, Martin K. Petersen, Linus Torvalds,
	Alasdair G Kergon, jens.axboe, linux-scsi, linux-kernel,
	linux-raid, linux-ide, linux-fsdevel, device-mapper development

On Thu, June 25, 2009 6:00 pm, Martin K. Petersen wrote:
>>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:
>
> Neil,
>
> Neil>  Of these:
>
> Neil>    max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
> Neil>    max_sectors_kb scheduler nomerges rotational
>
> Neil>  are really only relevant to the elevator code and those devices
> Neil>  that used that code (ide, scsi, etc).
>
> I'm not sure I completely agree with putting rotational in that bucket.
> It affects the choice of allocation policy in btrfs, for instance.
>

I confess to have some uncertainty about this.  There is, as far as
I can tell, no documentation for it.

So I asked git why it as added, and it pointed to
  commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

which suggests that it was added so that user space could tell the
kernel whether the device was rotational, rather than the other way
around.

But yes, maybe 'rotational' doesn't belong with the others.

>
> Neil>  Adding a number of extra fields such as minimum_io_size,
> Neil>  optimal_io_size etc to '/queue' seems to increase the number of
> Neil>  aberrations and enforces md and dm device to have a /queue which
> Neil>  is largely irrelevant.
>
> You seem to be hung up on the fact that you don't queue things.  I think
> that's beside the point.  You *do* have a request_queue thanks to
> calling blk_queue_make_request() in md.c.  And there is more to
> request_queue than the values you brought up.  Like the callback
> functions.  I'm not saying that all the values in request_queue apply to
> MD, but I really don't understand what all the fuss is about.  Other
> than the presence of the string "queue" in the choice of naming.
>

Well names are very important.  And as I said later we could possibly keep
them in 'queue' and make 'queue' a more generic directory.  I don't like
that but it is probably better than the current situation.

As you say, I do currently have a request_queue, but that is an internal
detail, not part of the externally visible interface, and it is something
that is very much in my sights as something I want to change.  I'm
still working out the details so I'm a fair way from a concrete proposal
and a long way from some code.  That change certainly doesn't have
to happen in any rush.  But we should get the externally visible
names "right" if we can.

> Anyway.  If you look at the request_queue in the current tree you'll see
> that the very limits we are discussing are contained in a separate
> struct.  We can easily move that somewhere else at a later date if that
> is deemed the right thing to do.

Agreed.  But not relevant at the moment.  The view from user-space is what
is important.

>
>
> Neil>    I have suggested to Martin that 2 are enough.
>
> I think I have covered this in a separate mail.  You are mixing up
> hardware limitations and I/O hints on the grounds that they went in as
> part of the same patch set and live in the same place.

I think I'm actually mixing them up because they look very much the
same.  Both say "try to make write requests at least this big" and
I cannot see the difference being very big to the filesystem.

I tend to mix them up a bit with read_ahead_kb as they are in some
ways just the 'write' version of that.  But it didn't arrive in the
same patch set :-)

Also, I think you seem to be treating the read-modify-write behaviour
of a 4K-sector hard drive as different-in-kind to the read-modify-write
behaviour of raid5.  I cannot see that.  In both cases an error can cause
unexpected corruption and in both cases getting the alignment right helps
throughput a lot.  So the only difference between these two values
is the size.
If one is 4K and one is 40Meg and you have 512bytes of data that you
want to write as safely as possibly, you might pad it to 4K, but you
wont pad it to 40Meg.
If you have 32Meg of data that you want to write as safely as you can,
you may well pad it to 40Meg, rather than say  "it is a multiple of
4K, that is enough for me".
So: the difference is only in the size.

>
> fdisk/mdadm/dmsetup need to use physical_block_size and alignment_offset
> to prevent us from misaligning when setting up partitions and virtual
> block devices.  Also, when stacking devices I need to know these values
> to ensure that the I/O hints set by MD/DM don't conflict with the
> underlying hardware limitations.  There are also special cases like
> shared disk setups and filesystem journal padding that may need to know
> details of the hardware atomicity.

"... of the *device* atomicity." It hardly matters whether the device is
hardware or software, it can still have atomicity issues.

>
> mkfs.* can leverage minimum_io_size and optimal_io_size hints to choose
> block sizes and to lay out data structures on stripe boundaries.  Just
> like we're doing today except using a common interface for all block
> devices instead of poking at MD and LVM internals.

As we are using generic terms like "minimum" and "optimal", let's keep
with those terms when describing filesystem behaviour and not mention
stripes.
"mkfs.* can leverage these values to choose the most appropriate sizes
and alignments for various data strutures".
What is the most appropriate size?  It depends on how much data we have
to store, and how much wastage we can afford.  So if there are a range
of reasonably optimal sizes, the fs can choose the best fit, without
needing to pretend to understand why one is more optimal than another.

And the "Just like we're doing today ..." is actually a bit sad.
If you look at mkfs.ext3, it requires 3 values:
  - block size
  - stride-size
  - stripe-width

block size needs to be smallish and a power of 2 and at most
PAGE_SIZE (I think).  It would be ideal if this could be stripe-size on
a raid5, but the stripe is usually too large so the next best is
physical_block_size (so, a size based decision).

stripe-width is only really needed on raid4/5/6 as it is aimed at
avoiding read-modify-write, so it would be the stripe size, which would
be minimum_io_size.  Though on a SCSI device it should probably
be "OPTIMAL TRANSFER LENGTH GRANULARITY" which I thought would be
optimal_io_size..  so maybe we use that and make minimum_io_size and
optimal_io_size the same on raid5 ??  Or just provide a list of useful
sizes and let the fs choose whatever is in the right ball-park.

stride-size is used for raid0 or raid4, or the "Asymmetric" raid5 layouts.
It allows ext3 to stagger which drive certain 'hot' data structures are
on.  The current metrics don't allow for that at all.
I'm not saying they should, and I'm not sure how they could.  But it
is sad.

>
> logical_block_size, physical_block_size and alignment_offset are
> hardware limits that need to be honored when creating a (virtual) block
> device or partition.

logical_block_size is a firmware limit.  The others are just hints.
Strong hints I agree.  Hints we should follow if at all possible.
But still hints.

>
> The minimum/optimal write sizes are hints to the *user* of the block
> device about how to lay out things.  If you look at my MD patch you'll
> see that I only set the I/O hints.  The hardware settings are off limits
> for MD.
>
>
> I don't particularly care whether we store the values in queue/,
> topology/, metrics/, limits/ or in the device root.  Nor whether we call
> it minimum_write_size instead of minimum_io_size.  I'll be happy to roll
> up a renaming patch...

Good, thanks..  I do strongly care that they don't go in queue/, and
that they have a name that is as close as possible to what they mean.
Let's see if anyone else has an opinion.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 11:07   ` [dm-devel] " NeilBrown
@ 2009-06-25 11:36     ` John Robinson
  2009-06-25 17:43       ` Martin K. Petersen
  2009-06-25 12:17     ` berthiaume_wayne
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: John Robinson @ 2009-06-25 11:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, jens.axboe, linux-scsi, linux-kernel,
	linux-raid, linux-ide, linux-fsdevel, device-mapper development

On 25/06/2009 12:07, NeilBrown wrote:
[...]
> stripe-width is only really needed on raid4/5/6 as it is aimed at
> avoiding read-modify-write, so it would be the stripe size, which would
> be minimum_io_size.
[...]
> stride-size is used for raid0 or raid4, or the "Asymmetric" raid5 layouts.
> It allows ext3 to stagger which drive certain 'hot' data structures are
> on.  The current metrics don't allow for that at all.
> I'm not saying they should, and I'm not sure how they could.  But it
> is sad.

Even sadder, when a raid 0/4/5/6 is reshaped over more discs (and 
probably other scenarios outwith md), both stripe-width and stride-size 
change. Is there any prospect this new stacking could give us the 
opportunity to tell our client (LVM, filesystem, whatever) about the 
change, or that they'll be able to take advantage of it?

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 11:07   ` [dm-devel] " NeilBrown
  2009-06-25 11:36     ` John Robinson
@ 2009-06-25 12:17     ` berthiaume_wayne
  2009-06-25 17:38     ` Martin K. Petersen
  2009-06-25 19:40     ` [dm-devel] " Jens Axboe
  3 siblings, 0 replies; 33+ messages in thread
From: berthiaume_wayne @ 2009-06-25 12:17 UTC (permalink / raw)
  To: neilb, martin.petersen
  Cc: martin.petersen, linux-scsi, snitzer, linux-kernel, linux-raid,
	linux-ide, dm-devel, jens.axboe, linux-fsdevel, torvalds, agk

Just my 2 cents for what its worth, but I'd have to agree with Neil that
the location of these attributes should be somwhere logical where they
can be easily found and the naming convention of the attributes such
that they make sense to the common man. Afterall, not all of us are
kernel developers. =;^)  

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of NeilBrown
Sent: Thursday, June 25, 2009 7:08 AM
To: Martin K. Petersen
Cc: Mike Snitzer; Martin K. Petersen; Linus Torvalds; Alasdair G Kergon;
jens.axboe@oracle.com; linux-scsi@vger.kernel.org;
linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org;
linux-ide@vger.kernel.org; linux-fsdevel@vger.kernel.org; device-mapper
development
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved
out of the 'queue' sysfs directory.

On Thu, June 25, 2009 6:00 pm, Martin K. Petersen wrote:
>>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:
>
> Neil,
>
> Neil>  Of these:
>
> Neil>    max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
> Neil>    max_sectors_kb scheduler nomerges rotational
>
> Neil>  are really only relevant to the elevator code and those devices
> Neil>  that used that code (ide, scsi, etc).
>
> I'm not sure I completely agree with putting rotational in that
bucket.
> It affects the choice of allocation policy in btrfs, for instance.
>

I confess to have some uncertainty about this.  There is, as far as
I can tell, no documentation for it.

So I asked git why it as added, and it pointed to
  commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

which suggests that it was added so that user space could tell the
kernel whether the device was rotational, rather than the other way
around.

But yes, maybe 'rotational' doesn't belong with the others.

>
> Neil>  Adding a number of extra fields such as minimum_io_size,
> Neil>  optimal_io_size etc to '/queue' seems to increase the number of
> Neil>  aberrations and enforces md and dm device to have a /queue
which
> Neil>  is largely irrelevant.
>
> You seem to be hung up on the fact that you don't queue things.  I
think
> that's beside the point.  You *do* have a request_queue thanks to
> calling blk_queue_make_request() in md.c.  And there is more to
> request_queue than the values you brought up.  Like the callback
> functions.  I'm not saying that all the values in request_queue apply
to
> MD, but I really don't understand what all the fuss is about.  Other
> than the presence of the string "queue" in the choice of naming.
>

Well names are very important.  And as I said later we could possibly
keep
them in 'queue' and make 'queue' a more generic directory.  I don't like
that but it is probably better than the current situation.

As you say, I do currently have a request_queue, but that is an internal
detail, not part of the externally visible interface, and it is
something
that is very much in my sights as something I want to change.  I'm
still working out the details so I'm a fair way from a concrete proposal
and a long way from some code.  That change certainly doesn't have
to happen in any rush.  But we should get the externally visible
names "right" if we can.

> Anyway.  If you look at the request_queue in the current tree you'll
see
> that the very limits we are discussing are contained in a separate
> struct.  We can easily move that somewhere else at a later date if
that
> is deemed the right thing to do.

Agreed.  But not relevant at the moment.  The view from user-space is
what
is important.

>
>
> Neil>    I have suggested to Martin that 2 are enough.
>
> I think I have covered this in a separate mail.  You are mixing up
> hardware limitations and I/O hints on the grounds that they went in as
> part of the same patch set and live in the same place.

I think I'm actually mixing them up because they look very much the
same.  Both say "try to make write requests at least this big" and
I cannot see the difference being very big to the filesystem.

I tend to mix them up a bit with read_ahead_kb as they are in some
ways just the 'write' version of that.  But it didn't arrive in the
same patch set :-)

Also, I think you seem to be treating the read-modify-write behaviour
of a 4K-sector hard drive as different-in-kind to the read-modify-write
behaviour of raid5.  I cannot see that.  In both cases an error can
cause
unexpected corruption and in both cases getting the alignment right
helps
throughput a lot.  So the only difference between these two values
is the size.
If one is 4K and one is 40Meg and you have 512bytes of data that you
want to write as safely as possibly, you might pad it to 4K, but you
wont pad it to 40Meg.
If you have 32Meg of data that you want to write as safely as you can,
you may well pad it to 40Meg, rather than say  "it is a multiple of
4K, that is enough for me".
So: the difference is only in the size.

>
> fdisk/mdadm/dmsetup need to use physical_block_size and
alignment_offset
> to prevent us from misaligning when setting up partitions and virtual
> block devices.  Also, when stacking devices I need to know these
values
> to ensure that the I/O hints set by MD/DM don't conflict with the
> underlying hardware limitations.  There are also special cases like
> shared disk setups and filesystem journal padding that may need to
know
> details of the hardware atomicity.

"... of the *device* atomicity." It hardly matters whether the device is
hardware or software, it can still have atomicity issues.

>
> mkfs.* can leverage minimum_io_size and optimal_io_size hints to
choose
> block sizes and to lay out data structures on stripe boundaries.  Just
> like we're doing today except using a common interface for all block
> devices instead of poking at MD and LVM internals.

As we are using generic terms like "minimum" and "optimal", let's keep
with those terms when describing filesystem behaviour and not mention
stripes.
"mkfs.* can leverage these values to choose the most appropriate sizes
and alignments for various data strutures".
What is the most appropriate size?  It depends on how much data we have
to store, and how much wastage we can afford.  So if there are a range
of reasonably optimal sizes, the fs can choose the best fit, without
needing to pretend to understand why one is more optimal than another.

And the "Just like we're doing today ..." is actually a bit sad.
If you look at mkfs.ext3, it requires 3 values:
  - block size
  - stride-size
  - stripe-width

block size needs to be smallish and a power of 2 and at most
PAGE_SIZE (I think).  It would be ideal if this could be stripe-size on
a raid5, but the stripe is usually too large so the next best is
physical_block_size (so, a size based decision).

stripe-width is only really needed on raid4/5/6 as it is aimed at
avoiding read-modify-write, so it would be the stripe size, which would
be minimum_io_size.  Though on a SCSI device it should probably
be "OPTIMAL TRANSFER LENGTH GRANULARITY" which I thought would be
optimal_io_size..  so maybe we use that and make minimum_io_size and
optimal_io_size the same on raid5 ??  Or just provide a list of useful
sizes and let the fs choose whatever is in the right ball-park.

stride-size is used for raid0 or raid4, or the "Asymmetric" raid5
layouts.
It allows ext3 to stagger which drive certain 'hot' data structures are
on.  The current metrics don't allow for that at all.
I'm not saying they should, and I'm not sure how they could.  But it
is sad.

>
> logical_block_size, physical_block_size and alignment_offset are
> hardware limits that need to be honored when creating a (virtual)
block
> device or partition.

logical_block_size is a firmware limit.  The others are just hints.
Strong hints I agree.  Hints we should follow if at all possible.
But still hints.

>
> The minimum/optimal write sizes are hints to the *user* of the block
> device about how to lay out things.  If you look at my MD patch you'll
> see that I only set the I/O hints.  The hardware settings are off
limits
> for MD.
>
>
> I don't particularly care whether we store the values in queue/,
> topology/, metrics/, limits/ or in the device root.  Nor whether we
call
> it minimum_write_size instead of minimum_io_size.  I'll be happy to
roll
> up a renaming patch...

Good, thanks..  I do strongly care that they don't go in queue/, and
that they have a name that is as close as possible to what they mean.
Let's see if anyone else has an opinion.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 11:07   ` [dm-devel] " NeilBrown
  2009-06-25 11:36     ` John Robinson
  2009-06-25 12:17     ` berthiaume_wayne
@ 2009-06-25 17:38     ` Martin K. Petersen
  2009-06-25 17:46       ` Linus Torvalds
  2009-06-26 11:58       ` [dm-devel] " Neil Brown
  2009-06-25 19:40     ` [dm-devel] " Jens Axboe
  3 siblings, 2 replies; 33+ messages in thread
From: Martin K. Petersen @ 2009-06-25 17:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-scsi, jens.axboe, linux-kernel, linux-raid,
	linux-ide, device-mapper development, Martin K. Petersen,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

[rotational flag]

Neil> So I asked git why it as added, and it pointed to
Neil>   commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

Neil> which suggests that it was added so that user space could tell the
Neil> kernel whether the device was rotational, rather than the other
Neil> way around.

There's an option to do it via udev for broken devices that don't report
it.  But both SCSI and ATA have a setting that gets queried and the
queue flag set accordingly.


Neil> Also, I think you seem to be treating the read-modify-write
Neil> behaviour of a 4K-sector hard drive as different-in-kind to the
Neil> read-modify-write behaviour of raid5.  I cannot see that.  In both
Neil> cases an error can cause unexpected corruption and in both cases
Neil> getting the alignment right helps throughput a lot.

If you get a write error on a RAID5 component you are able to
reconstruct and remap the stripe given the cache and the remaining
drives.

If you get a write error on a 4KB phys/512 byte logical drive the result
is undefined.  In a single machine setting you can treat the 4KB block
as suspect.  In a clustered setting, however, the other machines will
unknowingly be reading garbage.

I realize this is moot in the context of MD given that it doesn't
support shared storage.  But MD is not the only virtual block device
driver that I need to support with the topology bits.


Neil> So the only difference between these two values is the size.  If
Neil> one is 4K and one is 40Meg and you have 512bytes of data that you
Neil> want to write as safely as possibly, you might pad it to 4K, but
Neil> you wont pad it to 40Meg.  If you have 32Meg of data that you want
Neil> to write as safely as you can, you may well pad it to 40Meg,
Neil> rather than say "it is a multiple of 4K, that is enough for me".
Neil> So: the difference is only in the size.

Yep.  I call the lower boundary minimum_io_size and the upper boundary
optimal_io_size.

People have been putting filesystems and databases on top of RAID
devices for ages.  And generally the best practice has been to align and
write in multiples of the chunk size and try to write full stripe
widths.

Given the requirement for read-modify-write on RAID[456] I can
understand your predisposition to set minimum_io_size to the stripe
width.  However, I'm not really sure that's what the user wants.  Given
the stripe cache I'm also not convinced the performance impact of the MD
RAID[456] RMW cycle is as bad as that of the disk drive.  So I set
minimum_io_size to the chunk size in my patch.

If you can come up with better names for minimum and optimal then that's
ok with me.  SCSI uses the term granularity.  I used that for a while in
my patches but most people thought that was really weird.  Minimum and
optimal seemed easier to grasp.  Maximum also exists in the storage
device context but is literally the largest I/O the device can receive.

And just to make it clear: I completely agree with your argument that
which knob to choose is I/O size dependent.  My beef with your proposal
is that I believe the length of the list should be 2.

How we do report this stuff is really something I'd like the FS guys to
comment on, though.  The knobs we have now correspond to what's
currently used by XFS (libdisk) and indirectly by ext2+.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 11:36     ` John Robinson
@ 2009-06-25 17:43       ` Martin K. Petersen
  0 siblings, 0 replies; 33+ messages in thread
From: Martin K. Petersen @ 2009-06-25 17:43 UTC (permalink / raw)
  To: John Robinson
  Cc: linux-raid, Martin K. Petersen, Mike Snitzer, linux-kernel,
	linux-scsi, linux-ide, device-mapper development, jens.axboe,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:

John> Even sadder, when a raid 0/4/5/6 is reshaped over more discs (and
John> probably other scenarios outwith md),

I'm not sure I agree with the whole "sad" sentiment.  Just because
ext[234] doesn't query devices automatically doesn't mean the technology
doesn't exist.  I wrote the MD/LVM extraction code for XFS in 2001.

John> both stripe-width and stride-size change. Is there any prospect
John> this new stacking could give us the opportunity to tell our client
John> (LVM, filesystem, whatever) about the change, or that they'll be
John> able to take advantage of it?

When you resize the fs it would have to compare the new topology to the
old one and adjust accordingly.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 17:38     ` Martin K. Petersen
@ 2009-06-25 17:46       ` Linus Torvalds
  2009-06-25 19:34         ` Jens Axboe
  2009-06-26 11:58       ` [dm-devel] " Neil Brown
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-06-25 17:46 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, linux-scsi, linux-kernel, linux-raid, linux-ide,
	device-mapper development, jens.axboe, linux-fsdevel,
	Alasdair G Kergon



On Thu, 25 Jun 2009, Martin K. Petersen wrote:
> 
> Neil> So I asked git why it as added, and it pointed to
> Neil>   commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec
> 
> Neil> which suggests that it was added so that user space could tell the
> Neil> kernel whether the device was rotational, rather than the other
> Neil> way around.
> 
> There's an option to do it via udev for broken devices that don't report
> it.  But both SCSI and ATA have a setting that gets queried and the
> queue flag set accordingly.

.. except few devices actually set it. 

That flag is _definitely_ all about the user being able to override it.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 17:46       ` Linus Torvalds
@ 2009-06-25 19:34         ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2009-06-25 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin K. Petersen, Mike Snitzer, linux-kernel, linux-raid,
	linux-ide, device-mapper development, linux-scsi, linux-fsdevel,
	Alasdair G Kergon

On Thu, Jun 25 2009, Linus Torvalds wrote:
> 
> 
> On Thu, 25 Jun 2009, Martin K. Petersen wrote:
> > 
> > Neil> So I asked git why it as added, and it pointed to
> > Neil>   commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec
> > 
> > Neil> which suggests that it was added so that user space could tell the
> > Neil> kernel whether the device was rotational, rather than the other
> > Neil> way around.
> > 
> > There's an option to do it via udev for broken devices that don't report
> > it.  But both SCSI and ATA have a setting that gets queried and the
> > queue flag set accordingly.
> 
> .. except few devices actually set it. 
> 
> That flag is _definitely_ all about the user being able to override it.

Most certainly, the idea was to add udev rules to set it for drives.
Fortunately newer drives to work right without a need for such rules,
but it should be handy for the ones released last year and earlier.

Most user space will not care what the setting is, it's mostly for
internal use. CFQ uses it, as does btrfs to decide allocation policy.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 11:07   ` [dm-devel] " NeilBrown
                       ` (2 preceding siblings ...)
  2009-06-25 17:38     ` Martin K. Petersen
@ 2009-06-25 19:40     ` Jens Axboe
  2009-06-26 12:41       ` Neil Brown
  3 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2009-06-25 19:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Thu, Jun 25 2009, NeilBrown wrote:
> > You seem to be hung up on the fact that you don't queue things.  I think
> > that's beside the point.  You *do* have a request_queue thanks to
> > calling blk_queue_make_request() in md.c.  And there is more to
> > request_queue than the values you brought up.  Like the callback
> > functions.  I'm not saying that all the values in request_queue apply to
> > MD, but I really don't understand what all the fuss is about.  Other
> > than the presence of the string "queue" in the choice of naming.
> >
> 
> Well names are very important.  And as I said later we could possibly keep
> them in 'queue' and make 'queue' a more generic directory.  I don't like
> that but it is probably better than the current situation.

Sorry to ask the obvious question, but what would the point of all this
pain be? The existing values can't go anywhere else, so you'd have to
add symlinks back into queue/ anyway.

> As you say, I do currently have a request_queue, but that is an internal
> detail, not part of the externally visible interface, and it is something
> that is very much in my sights as something I want to change.  I'm
> still working out the details so I'm a fair way from a concrete proposal
> and a long way from some code.  That change certainly doesn't have
> to happen in any rush.  But we should get the externally visible
> names "right" if we can.

What crack are you smoking? :-)

A block device must have a request_queue, that's pretty much spread
throughout the kernel. The fact that md/dm is only using a subset of the
functionality is not a very good reason for re-writing large parts of
that design. We could save some space, but whether the queue is 200
bytes or 1400 bytes doesn't really make a whole lot of real-world
difference. It's not like we allocate/deallocate these all the time,
they are mostly static structures.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out  of the 'queue' sysfs directory.
  2009-06-25 17:38     ` Martin K. Petersen
  2009-06-25 17:46       ` Linus Torvalds
@ 2009-06-26 11:58       ` Neil Brown
  2009-06-26 14:48         ` Martin K. Petersen
  1 sibling, 1 reply; 33+ messages in thread
From: Neil Brown @ 2009-06-26 11:58 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, Linus Torvalds, Alasdair G Kergon, jens.axboe,
	linux-scsi, linux-kernel, linux-raid, linux-ide, linux-fsdevel,
	device-mapper development

On Thursday June 25, martin.petersen@oracle.com wrote:
> >>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:
> 
> And just to make it clear: I completely agree with your argument that
> which knob to choose is I/O size dependent.  My beef with your proposal
> is that I believe the length of the list should be 2.

0, 1, or infinity are the only credible sizes for this sort of list.

However I feel I've written enough on this particular issue (the
particular meaning of the various fields, not the directory location
which I still feel strongly about).

Providing the fields are clearly and unambiguously documented so that
it I can use the documentation to verify the implementation (in md at
least), I will be satisfied.  And if the names of the files actually
match the documented meaning (so e.g. s/io/write/) I might even be
happy.

I'm looking forward to seeing how you justify the name
"physical_block_size" in a way the encompasses possibilities like a
device that stripes over a heterogeneous set of disk drives ;-)

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-25 19:40     ` [dm-devel] " Jens Axboe
@ 2009-06-26 12:41       ` Neil Brown
  2009-06-26 12:50         ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2009-06-26 12:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Thursday June 25, jens.axboe@oracle.com wrote:
> On Thu, Jun 25 2009, NeilBrown wrote:
> > > You seem to be hung up on the fact that you don't queue things.  I think
> > > that's beside the point.  You *do* have a request_queue thanks to
> > > calling blk_queue_make_request() in md.c.  And there is more to
> > > request_queue than the values you brought up.  Like the callback
> > > functions.  I'm not saying that all the values in request_queue apply to
> > > MD, but I really don't understand what all the fuss is about.  Other
> > > than the presence of the string "queue" in the choice of naming.
> > >
> > 
> > Well names are very important.  And as I said later we could possibly keep
> > them in 'queue' and make 'queue' a more generic directory.  I don't like
> > that but it is probably better than the current situation.
> 
> Sorry to ask the obvious question, but what would the point of all this
> pain be? The existing values can't go anywhere else, so you'd have to
> add symlinks back into queue/ anyway.

Why cannot the existing values go any where else?  I don't understand
that assertion at all.

> 
> > As you say, I do currently have a request_queue, but that is an internal
> > detail, not part of the externally visible interface, and it is something
> > that is very much in my sights as something I want to change.  I'm
> > still working out the details so I'm a fair way from a concrete proposal
> > and a long way from some code.  That change certainly doesn't have
> > to happen in any rush.  But we should get the externally visible
> > names "right" if we can.
> 
> What crack are you smoking? :-)

I have a special mix of crack that helps me see Patterns everywhere,
even in C code.  Some patterns are bright, shiny, and elegant.  Others
are muddy and confused.  struct request_queue has a distinct shadow
over it just now.

> 
> A block device must have a request_queue, that's pretty much spread
> throughout the kernel. The fact that md/dm is only using a subset of the
> functionality is not a very good reason for re-writing large parts of
> that design. We could save some space, but whether the queue is 200
> bytes or 1400 bytes doesn't really make a whole lot of real-world
> difference. It's not like we allocate/deallocate these all the time,
> they are mostly static structures.

It isn't about saving space.  It is about maintainability.  To be
maintainable, the code must be easy to understand.  For that, it must
be well structured.

Every block device has a 'gendisk' (aka generic disk).
Every block device also (currently) has a request_queue.
If I have generic data that is applicable to all disks, where should I
put it?  One would think "gendisk".
If I have data that is related to request handling (as in uses of 
'struct request'), where should that go?  in request_queue I suspect.
But there are several fields in request_queue that are not related to
requests, and are generic to all disks.

I think "generic data goes in gendisk" is something that it would be
easy for people to understand.
The current 'topology' values are intended to be generic to all disks
so they should ideally go in gendisk.  I'm not pushing for that now.
Longer term, that would be my aim, but for now I'm just focussing on
restoring the 'queue' subdirectory to it's previous non-generic state.

i.e. revert the change that made 'queue' appear for md and dm and loop
and nbd and .... which have all never needed it before.
And find somewhere else to put the topology data - probably just the
top level.
i.e. add the names as DEVICE_ATTRs in genhd.c, and write the 'show'
routine to dereference ->queue carefully, just like
disk_alignment_offset_show.
(oooo... just noticed that the 'alignment_offset' attribute can
 contain the string '-1'.... while I have been guilty of that sort of
 thing myself, I would much rather it said "misaligned".)

As for how intrusive vs beneficial it would be to move all the generic
fields out of request_queue and allow md to not have a request queue,
that will have to be a discussion for another day.  I do hope to
eventually present you with a series of patches which does just that.
My aim would be to make sure each one was clearly beneficial.  And I
do have a grand vision involving this which is more than just tidying
up some small in-elegances.  Only time will tell how it eventuates.

But for now, please please please can we revert the change which made
'queue' appear in md and dm devices, (and loop and ...) and put these
generic values somewhere ... generic?

Thanks.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 12:41       ` Neil Brown
@ 2009-06-26 12:50         ` Jens Axboe
  2009-06-26 13:16           ` NeilBrown
  2009-06-26 13:23           ` [dm-devel] " NeilBrown
  0 siblings, 2 replies; 33+ messages in thread
From: Jens Axboe @ 2009-06-26 12:50 UTC (permalink / raw)
  To: Neil Brown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, Jun 26 2009, Neil Brown wrote:
> On Thursday June 25, jens.axboe@oracle.com wrote:
> > On Thu, Jun 25 2009, NeilBrown wrote:
> > > > You seem to be hung up on the fact that you don't queue things.  I think
> > > > that's beside the point.  You *do* have a request_queue thanks to
> > > > calling blk_queue_make_request() in md.c.  And there is more to
> > > > request_queue than the values you brought up.  Like the callback
> > > > functions.  I'm not saying that all the values in request_queue apply to
> > > > MD, but I really don't understand what all the fuss is about.  Other
> > > > than the presence of the string "queue" in the choice of naming.
> > > >
> > > 
> > > Well names are very important.  And as I said later we could possibly keep
> > > them in 'queue' and make 'queue' a more generic directory.  I don't like
> > > that but it is probably better than the current situation.
> > 
> > Sorry to ask the obvious question, but what would the point of all this
> > pain be? The existing values can't go anywhere else, so you'd have to
> > add symlinks back into queue/ anyway.
> 
> Why cannot the existing values go any where else?  I don't understand
> that assertion at all.

Because it's an exported interface, we can't just move things around at
will.

> > > As you say, I do currently have a request_queue, but that is an internal
> > > detail, not part of the externally visible interface, and it is something
> > > that is very much in my sights as something I want to change.  I'm
> > > still working out the details so I'm a fair way from a concrete proposal
> > > and a long way from some code.  That change certainly doesn't have
> > > to happen in any rush.  But we should get the externally visible
> > > names "right" if we can.
> > 
> > What crack are you smoking? :-)
> 
> I have a special mix of crack that helps me see Patterns everywhere,
> even in C code.  Some patterns are bright, shiny, and elegant.  Others
> are muddy and confused.  struct request_queue has a distinct shadow
> over it just now.
> 
> > 
> > A block device must have a request_queue, that's pretty much spread
> > throughout the kernel. The fact that md/dm is only using a subset of the
> > functionality is not a very good reason for re-writing large parts of
> > that design. We could save some space, but whether the queue is 200
> > bytes or 1400 bytes doesn't really make a whole lot of real-world
> > difference. It's not like we allocate/deallocate these all the time,
> > they are mostly static structures.
> 
> It isn't about saving space.  It is about maintainability.  To be
> maintainable, the code must be easy to understand.  For that, it must
> be well structured.
> 
> Every block device has a 'gendisk' (aka generic disk).
> Every block device also (currently) has a request_queue.

I don't know why you keep saying currently. It has always had a queue,
and I don't see a good reason why that should change for "special" block
devices like md/dm/loop/whatnot.

> If I have generic data that is applicable to all disks, where should I
> put it?  One would think "gendisk".
> If I have data that is related to request handling (as in uses of 
> 'struct request'), where should that go?  in request_queue I suspect.
> But there are several fields in request_queue that are not related to
> requests, and are generic to all disks.

It is indeed a bit of a toss-up there, since we do a queue associated
with each gendisk.

> I think "generic data goes in gendisk" is something that it would be
> easy for people to understand.
> The current 'topology' values are intended to be generic to all disks
> so they should ideally go in gendisk.  I'm not pushing for that now.
> Longer term, that would be my aim, but for now I'm just focussing on
> restoring the 'queue' subdirectory to it's previous non-generic state.
> 
> i.e. revert the change that made 'queue' appear for md and dm and loop
> and nbd and .... which have all never needed it before.
> And find somewhere else to put the topology data - probably just the
> top level.
> i.e. add the names as DEVICE_ATTRs in genhd.c, and write the 'show'
> routine to dereference ->queue carefully, just like
> disk_alignment_offset_show.
> (oooo... just noticed that the 'alignment_offset' attribute can
>  contain the string '-1'.... while I have been guilty of that sort of
>  thing myself, I would much rather it said "misaligned".)
> 
> 
> As for how intrusive vs beneficial it would be to move all the generic
> fields out of request_queue and allow md to not have a request queue,
> that will have to be a discussion for another day.  I do hope to
> eventually present you with a series of patches which does just that.
> My aim would be to make sure each one was clearly beneficial.  And I
> do have a grand vision involving this which is more than just tidying
> up some small in-elegances.  Only time will tell how it eventuates.
> 
> 
> But for now, please please please can we revert the change which made
> 'queue' appear in md and dm devices, (and loop and ...) and put these
> generic values somewhere ... generic?

No we cannot, not without a time machine. 2.6.30 is released, so it's
too late to revert things like that, even if we wanted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 12:50         ` Jens Axboe
@ 2009-06-26 13:16           ` NeilBrown
  2009-06-26 13:27             ` Jens Axboe
  2009-06-26 13:41             ` NeilBrown
  2009-06-26 13:23           ` [dm-devel] " NeilBrown
  1 sibling, 2 replies; 33+ messages in thread
From: NeilBrown @ 2009-06-26 13:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> On Fri, Jun 26 2009, Neil Brown wrote:

>> But for now, please please please can we revert the change which made
>> 'queue' appear in md and dm devices, (and loop and ...) and put these
>> generic values somewhere ... generic?
>
> No we cannot, not without a time machine. 2.6.30 is released, so it's
> too late to revert things like that, even if we wanted.

Drat... for some reason I was thinking that it only came in in 2.6.31-rc.
However it is only documented in ABI/testing, not ABI/stable ... if there
is any value in that distinction, then it should be possible to move it??

Well, maybe I'm too late.
If I send you a patch to remove all those fields in 'queue' that are not
relevant to md/dm (when the queue is used for an md/dm device) would that
be OK ??

NeilBrown



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 12:50         ` Jens Axboe
  2009-06-26 13:16           ` NeilBrown
@ 2009-06-26 13:23           ` NeilBrown
  2009-06-26 13:29             ` Jens Axboe
  1 sibling, 1 reply; 33+ messages in thread
From: NeilBrown @ 2009-06-26 13:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> On Fri, Jun 26 2009, Neil Brown wrote:
>> Every block device has a 'gendisk' (aka generic disk).
>> Every block device also (currently) has a request_queue.
>
> I don't know why you keep saying currently. It has always had a queue,
> and I don't see a good reason why that should change for "special" block
> devices like md/dm/loop/whatnot.

I say "currently" because I'm planning to create patches to make it
optional, and I want to get you used to the idea :-)

And md/dm/loop/whatnot are not "special" block devices, any more than
scsi/ide are "special" block devices.  They are all just block devices.
They use different mechanisms, but each is just as valid as the other.

The current code makes 'struct request' based block devices in to
first-class citizens, and all the rest are second class (having to
tag our data structures on ->queue data and/or ->private_data).

I just want all block devices to be equal.

NeilBrown


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 13:16           ` NeilBrown
@ 2009-06-26 13:27             ` Jens Axboe
  2009-06-26 13:41             ` NeilBrown
  1 sibling, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2009-06-26 13:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, Jun 26 2009, NeilBrown wrote:
> On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> > On Fri, Jun 26 2009, Neil Brown wrote:
> 
> >> But for now, please please please can we revert the change which made
> >> 'queue' appear in md and dm devices, (and loop and ...) and put these
> >> generic values somewhere ... generic?
> >
> > No we cannot, not without a time machine. 2.6.30 is released, so it's
> > too late to revert things like that, even if we wanted.
> 
> Drat... for some reason I was thinking that it only came in in 2.6.31-rc.
> However it is only documented in ABI/testing, not ABI/stable ... if there
> is any value in that distinction, then it should be possible to move it??

The documentation doesn't really matter. The files are there and
released, so they cannot go away.

> Well, maybe I'm too late.
> If I send you a patch to remove all those fields in 'queue' that are not
> relevant to md/dm (when the queue is used for an md/dm device) would that
> be OK ??

What fields what that be? It's true that things like eg
max_hw_sectors_kb isn't strictly correct, but it still makes some sense
on pseudo-devices.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 13:23           ` [dm-devel] " NeilBrown
@ 2009-06-26 13:29             ` Jens Axboe
  2009-06-27 12:32               ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2009-06-26 13:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, linux-kernel, linux-raid,
	linux-ide, device-mapper development, linux-scsi, linux-fsdevel,
	Linus Torvalds, Alasdair G Kergon

On Fri, Jun 26 2009, NeilBrown wrote:
> On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> > On Fri, Jun 26 2009, Neil Brown wrote:
> >> Every block device has a 'gendisk' (aka generic disk).
> >> Every block device also (currently) has a request_queue.
> >
> > I don't know why you keep saying currently. It has always had a queue,
> > and I don't see a good reason why that should change for "special" block
> > devices like md/dm/loop/whatnot.
> 
> I say "currently" because I'm planning to create patches to make it
> optional, and I want to get you used to the idea :-)
> 
> And md/dm/loop/whatnot are not "special" block devices, any more than
> scsi/ide are "special" block devices.  They are all just block devices.
> They use different mechanisms, but each is just as valid as the other.

That was why I put it in quotes, they are just regular devices.

> The current code makes 'struct request' based block devices in to
> first-class citizens, and all the rest are second class (having to
> tag our data structures on ->queue data and/or ->private_data).

?? How is that different from devices, the ones you refer to as
first-class citizens? They too store device private data in eg
->queue_data.

> I just want all block devices to be equal.

There's no such thing as first or second class block devices. The fact
that drivers using ->make_request_fn directly do not utilize the full
scope of the queue isn't a very interesting fact, imho.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out   of the 'queue' sysfs directory.
  2009-06-26 13:16           ` NeilBrown
  2009-06-26 13:27             ` Jens Axboe
@ 2009-06-26 13:41             ` NeilBrown
  2009-06-26 13:49               ` Jens Axboe
  1 sibling, 1 reply; 33+ messages in thread
From: NeilBrown @ 2009-06-26 13:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, June 26, 2009 11:16 pm, NeilBrown wrote:
> On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
>> On Fri, Jun 26 2009, Neil Brown wrote:
>
>>> But for now, please please please can we revert the change which made
>>> 'queue' appear in md and dm devices, (and loop and ...) and put these
>>> generic values somewhere ... generic?
>>
>> No we cannot, not without a time machine. 2.6.30 is released, so it's
>> too late to revert things like that, even if we wanted.
>
> Drat... for some reason I was thinking that it only came in in 2.6.31-rc.
> However it is only documented in ABI/testing, not ABI/stable ... if there
> is any value in that distinction, then it should be possible to move it??
>

No, wait.  I was right.  The addition of 'queue' to md/dm devices
happened post-2.6.30.   So are the additions of physical_block_size etc.
So there is still time to put these in a properly generic place.

Do you have a good reason for them going in /queue?

NeilBrown



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 13:41             ` NeilBrown
@ 2009-06-26 13:49               ` Jens Axboe
  2009-06-27 12:50                 ` Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2009-06-26 13:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Fri, Jun 26 2009, NeilBrown wrote:
> On Fri, June 26, 2009 11:16 pm, NeilBrown wrote:
> > On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> >> On Fri, Jun 26 2009, Neil Brown wrote:
> >
> >>> But for now, please please please can we revert the change which made
> >>> 'queue' appear in md and dm devices, (and loop and ...) and put these
> >>> generic values somewhere ... generic?
> >>
> >> No we cannot, not without a time machine. 2.6.30 is released, so it's
> >> too late to revert things like that, even if we wanted.
> >
> > Drat... for some reason I was thinking that it only came in in 2.6.31-rc.
> > However it is only documented in ABI/testing, not ABI/stable ... if there
> > is any value in that distinction, then it should be possible to move it??
> >
> 
> No, wait.  I was right.  The addition of 'queue' to md/dm devices
> happened post-2.6.30.   So are the additions of physical_block_size etc.
> So there is still time to put these in a properly generic place.

Yes you are right, the commit date had me fooled.

> Do you have a good reason for them going in /queue?

Do you have a good reason for moving them? Im my opinion that makes the
separation between ->make_request_fn and ->request_fn devices bigger.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 11:58       ` [dm-devel] " Neil Brown
@ 2009-06-26 14:48         ` Martin K. Petersen
  2009-07-07  1:47           ` [dm-devel] " Neil Brown
  0 siblings, 1 reply; 33+ messages in thread
From: Martin K. Petersen @ 2009-06-26 14:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: Mike Snitzer, linux-scsi, jens.axboe, linux-kernel, linux-raid,
	linux-ide, device-mapper development, Martin K. Petersen,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:

Neil> Providing the fields are clearly and unambiguously documented so
Neil> that it I can use the documentation to verify the implementation
Neil> (in md at least), I will be satisfied.

The current sysfs documentation says:

/sys/block/<disk>/queue/minimum_io_size:
[...] For RAID arrays it is often the stripe chunk size.

/sys/block/<disk>/queue/optimal_io_size:
[...] For RAID devices it is usually the stripe width or the internal
block size.

The latter should be "internal track size".  But in the context of MD I
think those two definitions are crystal clear.

As far as making the application of these values more obvious I propose
the following:

What:		/sys/block/<disk>/queue/minimum_io_size
Date:		April 2009
Contact:	Martin K. Petersen <martin.petersen@oracle.com>
Description:
		Storage devices may report a granularity or minimum I/O
		size which is the device's preferred unit of I/O.
		Requests smaller than this may incur a significant
		performance penalty.

		For disk drives this value corresponds to the physical
		block size. For RAID devices it is usually the stripe
		chunk size.

		A properly aligned multiple of minimum_io_size is the
		preferred request size for workloads where a high number
		of I/O operations is desired.

What:		/sys/block/<disk>/queue/optimal_io_size
Date:		April 2009
Contact:	Martin K. Petersen <martin.petersen@oracle.com>
Description:
		Storage devices may report an optimal transfer length or
		streaming I/O size which is the device's preferred unit
		of sustained I/O.  This value is a multiple of the
		device's minimum_io_size.

		optimal_io_size is rarely reported for disk drives. For
		RAID devices it is usually the stripe width or the
		internal track size.

		A properly aligned multiple of optimal_io_size is the
		preferred request size for workloads where sustained
		throughput is desired.

After contemplating for a bit I think I prefer to keep them I/O
direction agnostic.  Granted, the potential penalties mostly apply to
writes.  But I think the application of the values apply to reads as
well.  They will in a hw RAID context for sure.

Neil> I'm looking forward to seeing how you justify the name
Neil> "physical_block_size" in a way the encompasses possibilities like
Neil> a device that stripes over a heterogeneous set of disk drives ;-)

I explained that in my mails yesterday.  But that is of no concern to
MD.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 13:29             ` Jens Axboe
@ 2009-06-27 12:32               ` Neil Brown
  2009-06-29 10:18                 ` [dm-devel] " Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2009-06-27 12:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, linux-kernel, linux-raid,
	linux-ide, device-mapper development, linux-scsi, linux-fsdevel,
	Linus Torvalds, Alasdair G Kergon

On Friday June 26, jens.axboe@oracle.com wrote:
> On Fri, Jun 26 2009, NeilBrown wrote:
> > On Fri, June 26, 2009 10:50 pm, Jens Axboe wrote:
> > > On Fri, Jun 26 2009, Neil Brown wrote:
> > >> Every block device has a 'gendisk' (aka generic disk).
> > >> Every block device also (currently) has a request_queue.
> > >
> > > I don't know why you keep saying currently. It has always had a queue,
> > > and I don't see a good reason why that should change for "special" block
> > > devices like md/dm/loop/whatnot.
> > 
> > I say "currently" because I'm planning to create patches to make it
> > optional, and I want to get you used to the idea :-)
> > 
> > And md/dm/loop/whatnot are not "special" block devices, any more than
> > scsi/ide are "special" block devices.  They are all just block devices.
> > They use different mechanisms, but each is just as valid as the other.
> 
> That was why I put it in quotes, they are just regular devices.

Ahh... I missed your point.  You were saying that they aren't special,
they are just like everything else and so should still have a queue.
Sorry for misunderstanding.
See below for my response.

> 
> > The current code makes 'struct request' based block devices in to
> > first-class citizens, and all the rest are second class (having to
> > tag our data structures on ->queue data and/or ->private_data).
> 
> ?? How is that different from devices, the ones you refer to as
> first-class citizens? They too store device private data in eg
> ->queue_data.
> 
> > I just want all block devices to be equal.
> 
> There's no such thing as first or second class block devices. The fact
> that drivers using ->make_request_fn directly do not utilize the full
> scope of the queue isn't a very interesting fact, imho.

 Your phrase "drivers using ->make_request_fn directly" seems to
 suggest you are looking at things very differently to me.

 From my perspective, all drivers use ->make_request_fn equally.
 Some set it to "__make_request", some to "md_make_request", others to
 "dm_request" or "loop_make_request" etc.

 Each of these different drivers need some private storage.
 __make_request uses struct request_queue
 md_make_request uses struct mddev_s
 dm_request uses struct mapped_device
 loop_make_request uses struct loop_device
  etc

 These structures are all attached to gendisk one way or another.

 Of these examples, the first three have an extra level.  They are
 intermediaries or "midlayers" for multiple drivers and perform some
 processing before passing the request down.
 __make_request provides this for ide and scsi (etc) via ->request_fn and
    ->queuedata in struct request_queue (and other fields).
 md_make_request provides this for raid1 and raid5 (etc) via
    ->pers->make_request and  ->private is struct mddev_s (and other
    fields). 
 dm_request provides this for crypt and multipath (etc) via
    ->map->targets[]->type->map and ->map->targets[]->private (and
    other fields).

 Looked at from this perspective, the fact that some drivers 'do not
 utilise the full scope of the queue' certainly isn't the interesting
 point.  The interesting point is that they have to use parts of the
 queue at all.

 And from this perspective, __make_request is a class above everything
 else.  __make_request gets a dedicate field in gendisk (->queue) and
 every driver has to provide a queue.  Other (lower class) drivers get
 to share gendisk->private_date and/or gendisk->queue->queuedata.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-26 13:49               ` Jens Axboe
@ 2009-06-27 12:50                 ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2009-06-27 12:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, linux-kernel, linux-raid,
	linux-ide, device-mapper development, linux-scsi, linux-fsdevel,
	Linus Torvalds, Alasdair G Kergon

On Friday June 26, jens.axboe@oracle.com wrote:
> On Fri, Jun 26 2009, NeilBrown wrote:
> > Do you have a good reason for them going in /queue?
> 
> Do you have a good reason for moving them? Im my opinion that makes the
> separation between ->make_request_fn and ->request_fn devices bigger.

Yes, I have a good reason.

Put succinctly:  I have a reason for reverting the change to make
/queue visible in md/dm devices:  It contains mostly fields that are
irrelevant to those devices.
Given that, I have a reason for moving the new fields:  So that can be
visible for md/dm/loop/etc devices.

More verbosely:
Currently (i.e. 2.6.30 and earlier) every block device has a device
directory in /sys/devices.  e.g. "sda".  It contains attributes that
are generally applicable to any block device (size, ro etc).

Some devices, which use the __make_request driver/layer, have a
'queue' directory with attributes that are specific to the
implementation of __make_request and related code.
Other devices, which make use of the md_make_request driver/layer,
have an 'md' directory with attributes that are specific to that
implementation.
Other devices have no such directory (relevant values are managed
exclusively via ioctls).

We are adding a number of attributes that are generally applicable to
all block devices.
Given the above description of the current state, it seems natural for
the new attributes to go in the device directory -
e.g. sda/new-attribute.
This is exactly what has been done for 'alignment_offset'.
The same should be done for physical_block_size, minimum_io_size, and
optimal_io_size.
Introducing the 'queue' directory - which mostly contains fields
completely irrelevant to non- __make_request drivers - just to store
some values that can easily go elsewhere doesn't seem to make sense to
me.

As for increasing the "separation between ->make_request_fn and
->request_fn devices", I don't think that is a useful way to look at
the devices, as I detail in my other Email.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-27 12:32               ` Neil Brown
@ 2009-06-29 10:18                 ` Jens Axboe
  2009-06-29 10:52                   ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2009-06-29 10:18 UTC (permalink / raw)
  To: Neil Brown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Sat, Jun 27 2009, Neil Brown wrote:
> > There's no such thing as first or second class block devices. The fact
> > that drivers using ->make_request_fn directly do not utilize the full
> > scope of the queue isn't a very interesting fact, imho.
> 
>  Your phrase "drivers using ->make_request_fn directly" seems to
>  suggest you are looking at things very differently to me.
> 
>  From my perspective, all drivers use ->make_request_fn equally.
>  Some set it to "__make_request", some to "md_make_request", others to
>  "dm_request" or "loop_make_request" etc.

Neil, will you please stop these silly games. Stop trying to invent
differences based on interpretations of what you read into my replies.

>  Each of these different drivers need some private storage.
>  __make_request uses struct request_queue
>  md_make_request uses struct mddev_s
>  dm_request uses struct mapped_device
>  loop_make_request uses struct loop_device
>   etc
> 
>  These structures are all attached to gendisk one way or another.
> 
>  Of these examples, the first three have an extra level.  They are
>  intermediaries or "midlayers" for multiple drivers and perform some
>  processing before passing the request down.
>  __make_request provides this for ide and scsi (etc) via ->request_fn and
>     ->queuedata in struct request_queue (and other fields).
>  md_make_request provides this for raid1 and raid5 (etc) via
>     ->pers->make_request and  ->private is struct mddev_s (and other
>     fields). 
>  dm_request provides this for crypt and multipath (etc) via
>     ->map->targets[]->type->map and ->map->targets[]->private (and
>     other fields).

Nothing - I repeat nothing - stops md/dm from removing that layer. It's
a layer they imposed themselves based on the design they chose to
implement internally. It has NOTHING to do with how the block layer is
designed. If md raid1 assigned raid1_dev (or whatever raid1 uses a its
device identifier structure) to ->queuedata, and had an mddev_s in its
raid1 structure, that would be a perfectly viable design as well.

Loop does that. md/dm have their own internal layering, if anything is a
"midlayer" (to keep to the apparent theme of design patterns), it's the
code md and dm bits.

>  Looked at from this perspective, the fact that some drivers 'do not
>  utilise the full scope of the queue' certainly isn't the interesting
>  point.  The interesting point is that they have to use parts of the
>  queue at all.
> 
>  And from this perspective, __make_request is a class above everything
>  else.  __make_request gets a dedicate field in gendisk (->queue) and
>  every driver has to provide a queue.  Other (lower class) drivers get
>  to share gendisk->private_date and/or gendisk->queue->queuedata.

That's just utter nonsense.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 10:18                 ` [dm-devel] " Jens Axboe
@ 2009-06-29 10:52                   ` NeilBrown
  2009-06-29 11:41                     ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2009-06-29 10:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Mon, June 29, 2009 8:18 pm, Jens Axboe wrote:
> On Sat, Jun 27 2009, Neil Brown wrote:
>> > There's no such thing as first or second class block devices. The fact
>> > that drivers using ->make_request_fn directly do not utilize the full
>> > scope of the queue isn't a very interesting fact, imho.
>>
>>  Your phrase "drivers using ->make_request_fn directly" seems to
>>  suggest you are looking at things very differently to me.
>>
>>  From my perspective, all drivers use ->make_request_fn equally.
>>  Some set it to "__make_request", some to "md_make_request", others to
>>  "dm_request" or "loop_make_request" etc.
>
> Neil, will you please stop these silly games. Stop trying to invent
> differences based on interpretations of what you read into my replies.


We do seem to be having trouble communicating don't we :-(
Be assured that it is not my intention to play games, silly or otherwise.

Maybe I should just try sending you patches.  Maybe that will
make my meaning clearer.

For the moment, I'm much more interested in the other question,
that of whether I can avoid having a 'queue' directory introduced into
md/dm/etc device directories in sysfs.

NeilBrown


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 10:52                   ` NeilBrown
@ 2009-06-29 11:41                     ` Jens Axboe
  2009-06-29 12:45                       ` Boaz Harrosh
  2009-06-29 23:09                       ` Andreas Dilger
  0 siblings, 2 replies; 33+ messages in thread
From: Jens Axboe @ 2009-06-29 11:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Mon, Jun 29 2009, NeilBrown wrote:
> On Mon, June 29, 2009 8:18 pm, Jens Axboe wrote:
> > On Sat, Jun 27 2009, Neil Brown wrote:
> >> > There's no such thing as first or second class block devices. The fact
> >> > that drivers using ->make_request_fn directly do not utilize the full
> >> > scope of the queue isn't a very interesting fact, imho.
> >>
> >>  Your phrase "drivers using ->make_request_fn directly" seems to
> >>  suggest you are looking at things very differently to me.
> >>
> >>  From my perspective, all drivers use ->make_request_fn equally.
> >>  Some set it to "__make_request", some to "md_make_request", others to
> >>  "dm_request" or "loop_make_request" etc.
> >
> > Neil, will you please stop these silly games. Stop trying to invent
> > differences based on interpretations of what you read into my replies.
> 
> 
> We do seem to be having trouble communicating don't we :-(
> Be assured that it is not my intention to play games, silly or otherwise.
> 
> Maybe I should just try sending you patches.  Maybe that will
> make my meaning clearer.
> 
> For the moment, I'm much more interested in the other question,
> that of whether I can avoid having a 'queue' directory introduced into
> md/dm/etc device directories in sysfs.

We already talked about this, several times. My answer is that it seems
pointless to begin with internally, and externally it just makes the API
worse since tools then have to know which device type they are talking
to.

So I still see absolutely zero point in making such a change, quite the
opposite.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 11:41                     ` Jens Axboe
@ 2009-06-29 12:45                       ` Boaz Harrosh
  2009-06-29 12:52                         ` Jens Axboe
  2009-06-29 23:09                       ` Andreas Dilger
  1 sibling, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2009-06-29 12:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On 06/29/2009 02:41 PM, Jens Axboe wrote:
> On Mon, Jun 29 2009, NeilBrown wrote:
>> On Mon, June 29, 2009 8:18 pm, Jens Axboe wrote:
>>> On Sat, Jun 27 2009, Neil Brown wrote:
>>>>> There's no such thing as first or second class block devices. The fact
>>>>> that drivers using ->make_request_fn directly do not utilize the full
>>>>> scope of the queue isn't a very interesting fact, imho.
>>>>  Your phrase "drivers using ->make_request_fn directly" seems to
>>>>  suggest you are looking at things very differently to me.
>>>>
>>>>  From my perspective, all drivers use ->make_request_fn equally.
>>>>  Some set it to "__make_request", some to "md_make_request", others to
>>>>  "dm_request" or "loop_make_request" etc.
>>> Neil, will you please stop these silly games. Stop trying to invent
>>> differences based on interpretations of what you read into my replies.
>>
>> We do seem to be having trouble communicating don't we :-(
>> Be assured that it is not my intention to play games, silly or otherwise.
>>
>> Maybe I should just try sending you patches.  Maybe that will
>> make my meaning clearer.
>>
>> For the moment, I'm much more interested in the other question,
>> that of whether I can avoid having a 'queue' directory introduced into
>> md/dm/etc device directories in sysfs.
> 
> We already talked about this, several times. My answer is that it seems
> pointless to begin with internally, and externally it just makes the API
> worse since tools then have to know which device type they are talking
> to.
> 

I do however see a problem with sysfs-files that mostly work for most devices
but for some "device type" they do nothing silently. At least make these
directory/files read-only for the un-used cases (eg. dm/md). And return proper
values to indicate their un-usefulness like "-1" or "NA"

> So I still see absolutely zero point in making such a change, quite the
> opposite.
> 

Just my $0.017
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 12:45                       ` Boaz Harrosh
@ 2009-06-29 12:52                         ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2009-06-29 12:52 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: NeilBrown, Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Mon, Jun 29 2009, Boaz Harrosh wrote:
> On 06/29/2009 02:41 PM, Jens Axboe wrote:
> > On Mon, Jun 29 2009, NeilBrown wrote:
> >> On Mon, June 29, 2009 8:18 pm, Jens Axboe wrote:
> >>> On Sat, Jun 27 2009, Neil Brown wrote:
> >>>>> There's no such thing as first or second class block devices. The fact
> >>>>> that drivers using ->make_request_fn directly do not utilize the full
> >>>>> scope of the queue isn't a very interesting fact, imho.
> >>>>  Your phrase "drivers using ->make_request_fn directly" seems to
> >>>>  suggest you are looking at things very differently to me.
> >>>>
> >>>>  From my perspective, all drivers use ->make_request_fn equally.
> >>>>  Some set it to "__make_request", some to "md_make_request", others to
> >>>>  "dm_request" or "loop_make_request" etc.
> >>> Neil, will you please stop these silly games. Stop trying to invent
> >>> differences based on interpretations of what you read into my replies.
> >>
> >> We do seem to be having trouble communicating don't we :-(
> >> Be assured that it is not my intention to play games, silly or otherwise.
> >>
> >> Maybe I should just try sending you patches.  Maybe that will
> >> make my meaning clearer.
> >>
> >> For the moment, I'm much more interested in the other question,
> >> that of whether I can avoid having a 'queue' directory introduced into
> >> md/dm/etc device directories in sysfs.
> > 
> > We already talked about this, several times. My answer is that it seems
> > pointless to begin with internally, and externally it just makes the API
> > worse since tools then have to know which device type they are talking
> > to.
> > 
> 
> I do however see a problem with sysfs-files that mostly work for most
> devices but for some "device type" they do nothing silently. At least
> make these directory/files read-only for the un-used cases (eg.
> dm/md). And return proper values to indicate their un-usefulness like
> "-1" or "NA"

I agree, but most of the files do have a meaning for the device types.
It's all about making the interface sane, moving things around is not a
sane interface. Making sure that the files propagate their proper
meaning is, so if some file does not make sense for device type X, then
by all means we should make sure that is apparent.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 11:41                     ` Jens Axboe
  2009-06-29 12:45                       ` Boaz Harrosh
@ 2009-06-29 23:09                       ` Andreas Dilger
  2009-07-01  0:29                         ` Neil Brown
  1 sibling, 1 reply; 33+ messages in thread
From: Andreas Dilger @ 2009-06-29 23:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Jun 29, 2009  13:41 +0200, Jens Axboe wrote:
> ... externally it just makes the API worse since tools then have to know
> which device type they are talking to.
> 
> So I still see absolutely zero point in making such a change, quite the
> opposite.

Exactly correct.  Changing these tunables just for the sake of giving
them a slightly different name is madness.  Making all block devices
appear more uniform to userspace (even if they don't strictly need all
of the semantics) is very sensible.  The whole point of the kernel is
to abstract away the underlying details so that userspace doesn't need
to understand it all again.

In order to get good throughput on RAID arrays we need to tune the
queue/max_* values to ensure the IO requests don't get split.

It would be great if the MD queue/max_* values would pass these tunings
down to the underlying disk devices as well.  As it stands now, we have
to follow the /sys/block/*/slaves tree to set all of these ourselves,
and before "slaves/" was introduced it was nigh impossible to automatically
tune these values.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-06-29 23:09                       ` Andreas Dilger
@ 2009-07-01  0:29                         ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2009-07-01  0:29 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jens Axboe, Martin K. Petersen, Mike Snitzer, Linus Torvalds,
	Alasdair G Kergon, linux-scsi, linux-kernel, linux-raid,
	linux-ide, linux-fsdevel, device-mapper development

On Tuesday June 30, adilger@sun.com wrote:
> On Jun 29, 2009  13:41 +0200, Jens Axboe wrote:
> > ... externally it just makes the API worse since tools then have to know
> > which device type they are talking to.
> > 
> > So I still see absolutely zero point in making such a change, quite the
> > opposite.
> 
> Exactly correct.  Changing these tunables just for the sake of giving
> them a slightly different name is madness.  Making all block devices
> appear more uniform to userspace (even if they don't strictly need all
> of the semantics) is very sensible.  The whole point of the kernel is
> to abstract away the underlying details so that userspace doesn't need
> to understand it all again.

Uniformity is certainly desirable.  But we shouldn't take it so far
as to make apples look like oranges.

We wouldn't want a SATA disk drive to have 'chunk_size' and 'raid_disks'.
Nor would we want a software RAID array to have a 'scheduler' or
'iosched' attributes.

> 
> In order to get good throughput on RAID arrays we need to tune the
> queue/max_* values to ensure the IO requests don't get split.
> 
> It would be great if the MD queue/max_* values would pass these tunings
> down to the underlying disk devices as well.  As it stands now, we have
> to follow the /sys/block/*/slaves tree to set all of these ourselves,
> and before "slaves/" was introduced it was nigh impossible to automatically
> tune these values.

I don't think that passing these values down is - in general - a well
defined problem.  This is (in part) because md/dm devices can be based on
partitions, and partitions don't have independent max_* values.

In your particular case, I don't expect that you use partitions, so it
makes perfect sense to do the tuning on a per-array basis.  But I
don't think that it is a concept that fits in the kernel.  As you say,
we have 'slaves/', which makes it practical to do this in user-space
and I would rather it stayed there.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out  of the 'queue' sysfs directory.
  2009-06-26 14:48         ` Martin K. Petersen
@ 2009-07-07  1:47           ` Neil Brown
  2009-07-07  5:29             ` Martin K. Petersen
  2009-07-07 22:06             ` Bill Davidsen
  0 siblings, 2 replies; 33+ messages in thread
From: Neil Brown @ 2009-07-07  1:47 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, Linus Torvalds, Alasdair G Kergon, jens.axboe,
	linux-scsi, linux-kernel, linux-raid, linux-ide, linux-fsdevel,
	device-mapper development

On Friday June 26, martin.petersen@oracle.com wrote:
> >>>>> "Neil" == Neil Brown <neilb@suse.de> writes:
> 
> Neil> Providing the fields are clearly and unambiguously documented so
> Neil> that it I can use the documentation to verify the implementation
> Neil> (in md at least), I will be satisfied.
> 
> The current sysfs documentation says:
> 
> /sys/block/<disk>/queue/minimum_io_size:
> [...] For RAID arrays it is often the stripe chunk size.
> 
> /sys/block/<disk>/queue/optimal_io_size:
> [...] For RAID devices it is usually the stripe width or the internal
> block size.
> 
> The latter should be "internal track size".  But in the context of MD I
> think those two definitions are crystal clear.

They might be "clear" but I'm not convinced that they are "correct".

> 
> 
> As far as making the application of these values more obvious I propose
> the following:
> 
> What:		/sys/block/<disk>/queue/minimum_io_size
> Date:		April 2009
> Contact:	Martin K. Petersen <martin.petersen@oracle.com>
> Description:
> 		Storage devices may report a granularity or minimum I/O
> 		size which is the device's preferred unit of I/O.
> 		Requests smaller than this may incur a significant
> 		performance penalty.
> 
> 		For disk drives this value corresponds to the physical
> 		block size. For RAID devices it is usually the stripe
> 		chunk size.

These two paragraphs are contradictory.  There is no sense in which a
RAID chunk size is a preferred minimum I/O size.

To some degree it is actually a 'maximum' preferred size for random
IO.  If you do random IO is blocks larger than the chunk size then you
risk causing more 'head contention' (at least with RAID0 - with RAID5
the tradeoff is more complex).

If you are talking about "alignment", then yes - the chunk size is an
appropriate size to align on.  But so are the block size and the
stripe size and none is, in general, any better than any other.

Also, you say "may" report.  If a device does not report, what happens
to this file.  Is it not present, or empty, or contain a special
"undefined" value?
I think the answer is that "512" is reported.  It might be good to
explicitly document that.

> 
> 		A properly aligned multiple of minimum_io_size is the
> 		preferred request size for workloads where a high number
> 		of I/O operations is desired.
> 
> 
> What:		/sys/block/<disk>/queue/optimal_io_size
> Date:		April 2009
> Contact:	Martin K. Petersen <martin.petersen@oracle.com>
> Description:
> 		Storage devices may report an optimal transfer length or
> 		streaming I/O size which is the device's preferred unit
> 		of sustained I/O.  This value is a multiple of the
> 		device's minimum_io_size.
> 
> 		optimal_io_size is rarely reported for disk drives. For
> 		RAID devices it is usually the stripe width or the
> 		internal track size.
> 
> 		A properly aligned multiple of optimal_io_size is the
> 		preferred request size for workloads where sustained
> 		throughput is desired.

In this case, if a device does not report an optimal size, the file
contains "0" - correct?  Should that be explicit?

I'd really like to see an example of how you expect filesystems to use
this.
I can well imagine the VM or elevator using this to assemble IO
requests in to properly aligned requests.  But I cannot imagine how
e.g mkfs would use it.
Or am I misunderstanding and this is for programs that use O_DIRECT on
the block device so they can optimise their request stream?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-07-07  1:47           ` [dm-devel] " Neil Brown
@ 2009-07-07  5:29             ` Martin K. Petersen
  2009-07-09  0:42               ` Neil Brown
  2009-07-07 22:06             ` Bill Davidsen
  1 sibling, 1 reply; 33+ messages in thread
From: Martin K. Petersen @ 2009-07-07  5:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: Mike Snitzer, linux-scsi, jens.axboe, linux-kernel, linux-raid,
	linux-ide, device-mapper development, Martin K. Petersen,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:

>> What: /sys/block/<disk>/queue/minimum_io_size Date: April 2009
>> Contact: Martin K. Petersen <martin.petersen@oracle.com> Description:
>> Storage devices may report a granularity or minimum I/O size which is
>> the device's preferred unit of I/O.  Requests smaller than this may
>> incur a significant performance penalty.
>> 
>> For disk drives this value corresponds to the physical block
>> size. For RAID devices it is usually the stripe chunk size.

Neil> These two paragraphs are contradictory.  There is no sense in
Neil> which a RAID chunk size is a preferred minimum I/O size.

Maybe not for MD.  This is not just about MD.

This is a hint that says "Please don't send me random I/Os smaller than
this.  And please align to a multiple of this value".

I agree that for MD devices the alignment portion of that is the
important one.  However, putting a lower boundary on the size *is* quite
important for 4KB disk drives.  There are also HW RAID devices that
choke on requests smaller than the chunk size.

I appreciate the difficulty in filling out these hints in a way that
makes sense for all the supported RAID levels in MD.  However, I really
don't consider the hints particularly interesting in the isolated
context of MD.  To me the hints are conduits for characteristics of the
physical storage.  The question you should be asking yourself is: "What
do I put in these fields to help the filesystem so that we get the most
out of the underlying, slow hardware?".

I think it is futile to keep spending time coming up with terminology
that encompasses all current and future software and hardware storage
devices with 100% accuracy.

Neil> To some degree it is actually a 'maximum' preferred size for
Neil> random IO.  If you do random IO is blocks larger than the chunk
Neil> size then you risk causing more 'head contention' (at least with
Neil> RAID0 - with RAID5 the tradeoff is more complex).

Please elaborate.

Neil> Also, you say "may" report.  If a device does not report, what
Neil> happens to this file.  Is it not present, or empty, or contain a
Neil> special "undefined" value?  I think the answer is that "512" is
Neil> reported.

The answer is physical_block_size.

Neil> In this case, if a device does not report an optimal size, the
Neil> file contains "0" - correct?  Should that be explicit?

Now documented.

Neil> I'd really like to see an example of how you expect filesystems to
Neil> use this.  I can well imagine the VM or elevator using this to
Neil> assemble IO requests in to properly aligned requests.  But I
Neil> cannot imagine how e.g mkfs would use it.  Or am I
Neil> misunderstanding and this is for programs that use O_DIRECT on the
Neil> block device so they can optimise their request stream?

The way it has been working so far (with the manual ioctl pokage) is
that mkfs will align metadata as well as data on a minimum_io_size
boundary.  And it will try to use the minimum_io_size as filesystem
block size.  On Linux that's currently limited by the fact that we can't
have blocks bigger than a page.  The filesystem can also report the
optimal I/O size in statfs.  For XFS the stripe width also affects how
the realtime/GRIO allocators work.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-07-07  1:47           ` [dm-devel] " Neil Brown
  2009-07-07  5:29             ` Martin K. Petersen
@ 2009-07-07 22:06             ` Bill Davidsen
  1 sibling, 0 replies; 33+ messages in thread
From: Bill Davidsen @ 2009-07-07 22:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Mike Snitzer, linux-scsi, jens.axboe, linux-kernel, linux-raid,
	linux-ide, device-mapper development, Martin K. Petersen,
	linux-fsdevel, Linus Torvalds, Alasdair G Kergon

Neil Brown wrote:
> On Friday June 26, martin.petersen@oracle.com wrote:
>   
>> As far as making the application of these values more obvious I propose
>> the following:
>>
>> What:		/sys/block/<disk>/queue/minimum_io_size
>> Date:		April 2009
>> Contact:	Martin K. Petersen <martin.petersen@oracle.com>
>> Description:
>> 		Storage devices may report a granularity or minimum I/O
>> 		size which is the device's preferred unit of I/O.
>> 		Requests smaller than this may incur a significant
>> 		performance penalty.
>>
>> 		For disk drives this value corresponds to the physical
>> 		block size. For RAID devices it is usually the stripe
>> 		chunk size.
>>     
>
> These two paragraphs are contradictory.  There is no sense in which a
> RAID chunk size is a preferred minimum I/O size.
>
> To some degree it is actually a 'maximum' preferred size for random
> IO.  If you do random IO is blocks larger than the chunk size then you
> risk causing more 'head contention' (at least with RAID0 - with RAID5
> the tradeoff is more complex).
>
>   
Actually this is allocation unit, and the array can be assumed to be a 
series of sets of contiguous bytes of this size. Given LBA addressing, 
array members which are not simple whole devices, etc, this doesn't 
(can't) promise much for the physical layout. And any read which resides 
entirely within a chunk would not have a performance penalty, although 
write might, if it were not a multiple of the sector size of the array 
member(s) involved.

> If you are talking about "alignment", then yes - the chunk size is an
> appropriate size to align on.  But so are the block size and the
> stripe size and none is, in general, any better than any other.
>
>   
I would assume that a chunk, aligned on a chunk boundary, would be 
allocated in a contiguous series of bytes on the underlying array 
member. And that any i/o not aligned on a chunk boundary would be more 
likely to access multiple array members.

Feel free to clarify my assumptions.

> Also, you say "may" report.  If a device does not report, what happens
> to this file.  Is it not present, or empty, or contain a special
> "undefined" value?
> I think the answer is that "512" is reported.  It might be good to
> explicitly document that.
>   
> I'd really like to see an example of how you expect filesystems to use
> this.
> I can well imagine the VM or elevator using this to assemble IO
> requests in to properly aligned requests.  But I cannot imagine how
> e.g mkfs would use it.
> Or am I misunderstanding and this is for programs that use O_DIRECT on
> the block device so they can optimise their request stream?
>   

-- 
Bill Davidsen <davidsen@tmr.com>
  Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one error occurs during
wildcard (glob) expansion.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.
  2009-07-07  5:29             ` Martin K. Petersen
@ 2009-07-09  0:42               ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2009-07-09  0:42 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, linux-scsi, linux-kernel, linux-raid, linux-ide,
	device-mapper development, jens.axboe, linux-fsdevel,
	Linus Torvalds, Alasdair G Kergon

On Tuesday July 7, martin.petersen@oracle.com wrote:
> >>>>> "Neil" == Neil Brown <neilb@suse.de> writes:
> 
> >> What: /sys/block/<disk>/queue/minimum_io_size Date: April 2009
> >> Contact: Martin K. Petersen <martin.petersen@oracle.com> Description:
> >> Storage devices may report a granularity or minimum I/O size which is
> >> the device's preferred unit of I/O.  Requests smaller than this may
> >> incur a significant performance penalty.
> >> 
> >> For disk drives this value corresponds to the physical block
> >> size. For RAID devices it is usually the stripe chunk size.
> 
> Neil> These two paragraphs are contradictory.  There is no sense in
> Neil> which a RAID chunk size is a preferred minimum I/O size.
> 
> Maybe not for MD.  This is not just about MD.

I wasn't just thinking about MD.  I was thinking about the generic
concept of RAID.

> 
> This is a hint that says "Please don't send me random I/Os smaller than
> this.  And please align to a multiple of this value".
> 
> I agree that for MD devices the alignment portion of that is the
> important one.  However, putting a lower boundary on the size *is* quite
> important for 4KB disk drives.  There are also HW RAID devices that
> choke on requests smaller than the chunk size.

I certainly see that the lower boundary is important for 4KB disk
drives, but we have that information encoded in physical_block_size,
so duplicating here doesn't seem to be a high priority.
I'm surprised that a HW RAID device would choke on requests smaller
than the chunk size.  If that is the case, then I guess it could be
useful to have this number separate from physical_block_size....

> 
> I appreciate the difficulty in filling out these hints in a way that
> makes sense for all the supported RAID levels in MD.  However, I really
> don't consider the hints particularly interesting in the isolated
> context of MD.  To me the hints are conduits for characteristics of the
> physical storage.  The question you should be asking yourself is: "What
> do I put in these fields to help the filesystem so that we get the most
> out of the underlying, slow hardware?".

That is certainly a good approach, but to be able to answer that
question, I would need to know how the filesystem is going to use this
information.   You have included that below (thanks).  Maybe including
it in the documentation would be helpful.

> 
> I think it is futile to keep spending time coming up with terminology
> that encompasses all current and future software and hardware storage
> devices with 100% accuracy.

And I think it is futile to export a value with such a vague meaning.
Concrete usage examples would help make the meaning less vague.

> 
> 
> Neil> To some degree it is actually a 'maximum' preferred size for
> Neil> random IO.  If you do random IO is blocks larger than the chunk
> Neil> size then you risk causing more 'head contention' (at least with
> Neil> RAID0 - with RAID5 the tradeoff is more complex).
> 
> Please elaborate.

If you are performing random IO on a 4-drive RAID0 and every IO fits
within a chunk, then you can expect to get 4 times the throughput of a
single drive as you get 4-way parallelism on the seeks.
If you are performing random IO on that same array but every IO
crosses a chunk boundary, then you need 2 drives to satisfy each
request, and you only get 2-way parallelism on the seeks.  As random
IO tends to be seek-bound, this will probably be slower.

The same is true for reading from RAID5.  For writing to RAID5, the
parity updates confuse the numbers, so it is hard to make such general
statements, though for largish arrays (6 or more devices) you probably
get a similar effect.

> 
> 
> Neil> Also, you say "may" report.  If a device does not report, what
> Neil> happens to this file.  Is it not present, or empty, or contain a
> Neil> special "undefined" value?  I think the answer is that "512" is
> Neil> reported.
> 
> The answer is physical_block_size.
> 
> 
> Neil> In this case, if a device does not report an optimal size, the
> Neil> file contains "0" - correct?  Should that be explicit?
> 
> Now documented.

Thanks.

> 
> 
> Neil> I'd really like to see an example of how you expect filesystems to
> Neil> use this.  I can well imagine the VM or elevator using this to
> Neil> assemble IO requests in to properly aligned requests.  But I
> Neil> cannot imagine how e.g mkfs would use it.  Or am I
> Neil> misunderstanding and this is for programs that use O_DIRECT on the
> Neil> block device so they can optimise their request stream?
> 
> The way it has been working so far (with the manual ioctl pokage) is
> that mkfs will align metadata as well as data on a minimum_io_size
> boundary.  And it will try to use the minimum_io_size as filesystem
> block size.  On Linux that's currently limited by the fact that we can't
> have blocks bigger than a page.  The filesystem can also report the
> optimal I/O size in statfs.  For XFS the stripe width also affects how
> the realtime/GRIO allocators work.

Thanks for these details.

It seems from these that the primary issue is alignment, while size is
secondary (with the exception of physical_block_size, where it is very
important).
Yet in your documentation, alignment doesn't get mentioned until the
last paragraph, making it seem like an afterthought.

(I couldn't find an 'optimal I/O size" in statfs.  Maybe you mean
 "st_blksize" in 'stat'??  I think that one really has to be the
 filesystem block size.)

I think all these numbers are probably useful numbers to export.  But
I think that trying to give them labels like "optimal" or "minimum" or
"physical" is trying to give them a meaning which they don't really
possess.  It is treading a middle-ground which is less useful than
either extreme.
I think they should either be:
   block_size   chunk_size  stripe_size
or
   align-A   align-B   align-C
    With the uniform definition:
      IO requests of this alignment (and possibly size) give
      significantly better performance non-aligned requests.

Yes, I know.  We've been here before.  You didn't understand/believe
me before so there is no reason for me to expect to be more successful
this time.  I'll go find some bugs that I can be successful in fixing.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-07-09  0:42 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-25  3:58 REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory Neil Brown
2009-06-25  8:00 ` Martin K. Petersen
2009-06-25 11:07   ` [dm-devel] " NeilBrown
2009-06-25 11:36     ` John Robinson
2009-06-25 17:43       ` Martin K. Petersen
2009-06-25 12:17     ` berthiaume_wayne
2009-06-25 17:38     ` Martin K. Petersen
2009-06-25 17:46       ` Linus Torvalds
2009-06-25 19:34         ` Jens Axboe
2009-06-26 11:58       ` [dm-devel] " Neil Brown
2009-06-26 14:48         ` Martin K. Petersen
2009-07-07  1:47           ` [dm-devel] " Neil Brown
2009-07-07  5:29             ` Martin K. Petersen
2009-07-09  0:42               ` Neil Brown
2009-07-07 22:06             ` Bill Davidsen
2009-06-25 19:40     ` [dm-devel] " Jens Axboe
2009-06-26 12:41       ` Neil Brown
2009-06-26 12:50         ` Jens Axboe
2009-06-26 13:16           ` NeilBrown
2009-06-26 13:27             ` Jens Axboe
2009-06-26 13:41             ` NeilBrown
2009-06-26 13:49               ` Jens Axboe
2009-06-27 12:50                 ` Neil Brown
2009-06-26 13:23           ` [dm-devel] " NeilBrown
2009-06-26 13:29             ` Jens Axboe
2009-06-27 12:32               ` Neil Brown
2009-06-29 10:18                 ` [dm-devel] " Jens Axboe
2009-06-29 10:52                   ` NeilBrown
2009-06-29 11:41                     ` Jens Axboe
2009-06-29 12:45                       ` Boaz Harrosh
2009-06-29 12:52                         ` Jens Axboe
2009-06-29 23:09                       ` Andreas Dilger
2009-07-01  0:29                         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).