All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] dm: Fix alignment stacking on partitioned devices
@ 2009-12-18  7:30 Martin K. Petersen
  2009-12-18 17:33 ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-18  7:30 UTC (permalink / raw)
  To: device-mapper development, Mike Snitzer; +Cc: jens.axboe


The dm device limits function passes the start sector within the block
device to the block layer stacking function.  However, the partition's
offset on the physical device is not added, resulting in incorrect
alignment reporting.

Add the partition offset to the values passed to blk_stack_limits().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

---

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index be62547..67efac9 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -495,6 +495,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
 	struct queue_limits *limits = data;
 	struct block_device *bdev = dev->bdev;
 	struct request_queue *q = bdev_get_queue(bdev);
+	sector_t offset = (get_start_sect(bdev) + start) << 9;
 	char b[BDEVNAME_SIZE];
 
 	if (unlikely(!q)) {
@@ -503,7 +504,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
 		return 0;
 	}
 
-	if (blk_stack_limits(limits, &q->limits, start << 9) < 0)
+	if (blk_stack_limits(limits, &q->limits, offset) < 0)
 		DMWARN("%s: target device %s is misaligned: "
 		       "physical_block_size=%u, logical_block_size=%u, "
 		       "alignment_offset=%u, start=%llu",
@@ -511,7 +512,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
 		       q->limits.physical_block_size,
 		       q->limits.logical_block_size,
 		       q->limits.alignment_offset,
-		       (unsigned long long) start << 9);
+		       (unsigned long long) offset);
 
 
 	/*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-18  7:30 [PATCH] dm: Fix alignment stacking on partitioned devices Martin K. Petersen
@ 2009-12-18 17:33 ` Mike Snitzer
  2009-12-21 17:49   ` Martin K. Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2009-12-18 17:33 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Fri, Dec 18 2009 at  2:30am -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> 
> The dm device limits function passes the start sector within the block
> device to the block layer stacking function.  However, the partition's
> offset on the physical device is not added, resulting in incorrect
> alignment reporting.
> 
> Add the partition offset to the values passed to blk_stack_limits().
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Martin,

This is not required because DM assumes alignment_offset has already
been accounted for by the caller (e.g. LVM2 or some other ficitional DM
consumer).

Below, "start" represents the aligned start of the data for a given volume.
That start must have already been shifted by alignment_offset; as is the
case with lvm2 (>= 2.02.51), see the following lvm2 commits:
http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=282029eb45e56
http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=6c88b6c660020

All this being said, how did you arrive at this patch?  Why do you feel
it is needed?  Was it just from code inspection?

Regards,
Mike


> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index be62547..67efac9 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -495,6 +495,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
>  	struct queue_limits *limits = data;
>  	struct block_device *bdev = dev->bdev;
>  	struct request_queue *q = bdev_get_queue(bdev);
> +	sector_t offset = (get_start_sect(bdev) + start) << 9;
>  	char b[BDEVNAME_SIZE];
>  
>  	if (unlikely(!q)) {
> @@ -503,7 +504,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
>  		return 0;
>  	}
>  
> -	if (blk_stack_limits(limits, &q->limits, start << 9) < 0)
> +	if (blk_stack_limits(limits, &q->limits, offset) < 0)
>  		DMWARN("%s: target device %s is misaligned: "
>  		       "physical_block_size=%u, logical_block_size=%u, "
>  		       "alignment_offset=%u, start=%llu",
> @@ -511,7 +512,7 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
>  		       q->limits.physical_block_size,
>  		       q->limits.logical_block_size,
>  		       q->limits.alignment_offset,
> -		       (unsigned long long) start << 9);
> +		       (unsigned long long) offset);
>  
>  
>  	/*

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-18 17:33 ` Mike Snitzer
@ 2009-12-21 17:49   ` Martin K. Petersen
  2009-12-21 19:52     ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-21 17:49 UTC (permalink / raw)
  To: Mike Snitzer, device-mapper development; +Cc: Martin K. Petersen, jens.axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> This is not required because DM assumes alignment_offset has
Mike> already been accounted for by the caller (e.g. LVM2 or some other
Mike> ficitional DM consumer).

Mike> Below, "start" represents the aligned start of the data for a
Mike> given volume.  That start must have already been shifted by
Mike> alignment_offset; as is the case with lvm2 (>= 2.02.51), see the
Mike> following lvm2 commits:
Mike> http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=282029eb45e56
Mike> http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=6c88b6c660020

Mike> All this being said, how did you arrive at this patch?  Why do you
Mike> feel it is needed?  Was it just from code inspection?

Interesting.

I have one case in my topology test scripts that prints a DM alignment
warning with the old stacking algorithm but doesn't with the new one.  A
bit surprising given that the new approach is much more picky.  MD
correctly prints a warning for the same device with both algorithms.

So I added a few printks and noticed that DM was calling the stacking
function with the same start sector for both devices.  That seemed odd
because the PV devices are misaligned partitions on the same disk.

Note that this was run using an old EL5 LVM toolkit because I'm not
interested in having userland compensate for any misalignment.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-21 17:49   ` Martin K. Petersen
@ 2009-12-21 19:52     ` Mike Snitzer
  2009-12-22  4:27       ` Martin K. Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2009-12-21 19:52 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Mon, Dec 21 2009 at 12:49pm -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> This is not required because DM assumes alignment_offset has
> Mike> already been accounted for by the caller (e.g. LVM2 or some other
> Mike> ficitional DM consumer).
> 
> Mike> Below, "start" represents the aligned start of the data for a
> Mike> given volume.  That start must have already been shifted by
> Mike> alignment_offset; as is the case with lvm2 (>= 2.02.51), see the
> Mike> following lvm2 commits:
> Mike> http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=282029eb45e56
> Mike> http://sources.redhat.com/git/gitweb.cgi?p=lvm2.git;a=commit;h=6c88b6c660020
> 
> Mike> All this being said, how did you arrive at this patch?  Why do you
> Mike> feel it is needed?  Was it just from code inspection?
> 
> Interesting.
> 
> I have one case in my topology test scripts that prints a DM alignment
> warning with the old stacking algorithm but doesn't with the new one.  A
> bit surprising given that the new approach is much more picky.  MD
> correctly prints a warning for the same device with both algorithms.
> 
> So I added a few printks and noticed that DM was calling the stacking
> function with the same start sector for both devices.  That seemed odd
> because the PV devices are misaligned partitions on the same disk.

That is interesting.

Do you have traces from the kernel with the old stacking function?
(BTW. you don't _need_ kernel traces because the 'dmsetup table' output
gives you the "start").  I have to believe the same "start" values are
being provided regardless of the stacking function.

If that is the case then the data "start" should be different for these
individual PVs (one following the other).  If not, can you provide the
'dmsetup table' output for the associated DM devices?
 
> Note that this was run using an old EL5 LVM toolkit because I'm not
> interested in having userland compensate for any misalignment.

OK, strikes me as odd but you must have your reasons (maybe you're just
testing that the old untrained tools create misaligned partitions, LVs,
etc?).

Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-21 19:52     ` Mike Snitzer
@ 2009-12-22  4:27       ` Martin K. Petersen
  2009-12-22 14:32         ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-22  4:27 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Martin K. Petersen, jens.axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> If that is the case then the data "start" should be different for
Mike> these individual PVs (one following the other).  If not, can you
Mike> provide the 'dmsetup table' output for the associated DM devices?

Did some more runs.  It turns out the old stacking algorithm identified
the two alignments as being incompatible for the wrong reasons.  That's
why a message was printed with the older kernel.

I do find it a bit strange that you pass relative offsets to the
stacking function and rely on the userland utilities to do the right
thing, though.  Doesn't that mean that existing misaligned volumes will
go undetected?  Or do you perform a check when you prepare the table
before feeding it to the kernel?


>> Note that this was run using an old EL5 LVM toolkit because I'm not
>> interested in having userland compensate for any misalignment.

Mike> OK, strikes me as odd but you must have your reasons

I'm testing the stacking function.  I'm not interested in things being
lined up.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-22  4:27       ` Martin K. Petersen
@ 2009-12-22 14:32         ` Mike Snitzer
  2009-12-22 17:41           ` Martin K. Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2009-12-22 14:32 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Mon, Dec 21 2009 at 11:27pm -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> If that is the case then the data "start" should be different for
> Mike> these individual PVs (one following the other).  If not, can you
> Mike> provide the 'dmsetup table' output for the associated DM devices?
> 
> Did some more runs.  It turns out the old stacking algorithm identified
> the two alignments as being incompatible for the wrong reasons.  That's
> why a message was printed with the older kernel.

OK, so is MD somehow getting things wrong?  (You originally said that
with the new stacking function MD resulted in an error but DM did not).
 
> I do find it a bit strange that you pass relative offsets to the
> stacking function and rely on the userland utilities to do the right
> thing, though.  Doesn't that mean that existing misaligned volumes will
> go undetected?  Or do you perform a check when you prepare the table
> before feeding it to the kernel?

"start" isn't a relative offset.  The "start" is absolute from the
beginning of the device.  So for old LVs that are already misaligned
they'll pass down a start that _should_ cause misalignment messages in
the kernel (via the stacking function).  I know I've seen such messages
when LVM doesn't compensate for alignment_offset but I'll revisit this
with your new stacking function.

The justification for relying on userspace to consume and adjust for
alignment_offset is: you're not creating LVM LVs with the kernel.
Metadata that (may) precede the data "start" of a volume gets layed out
(and managed) by userspace (LVM).  Not having the kernel involved in
generating the "correct" offset to data allows such concerns to be
purely in userspace (having the kernel contribute to that calculation
doesn't get us anything other than potential for split brain bugs).

Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-22 14:32         ` Mike Snitzer
@ 2009-12-22 17:41           ` Martin K. Petersen
  2009-12-22 21:42             ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-22 17:41 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Martin K. Petersen, jens.axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike,

Mike> OK, so is MD somehow getting things wrong?  (You originally said
Mike> that with the new stacking function MD resulted in an error but DM
Mike> did not).

MD is passing in absolute offsets and got the right result in both
cases.


Mike> "start" isn't a relative offset.  

It's relative to the beginning of the partition (block_device), not
relative to the beginning of the disk (request_queue).

My beef here is that DM created a device like this:

[root@10 ~]# pvs -o +pe_start
  PV         VG   Fmt  Attr PSize   PFree   1st PE 
  /dev/sde1  foo  lvm2 a-   508.00M 256.00M 192.00K
  /dev/sde2  foo  lvm2 a-   512.00M 260.00M 192.00K
[root@10 ~]# dmsetup table
foo-bar: 0 1032192 striped 2 32 8:66 384 8:65 384

/dev/sde  has an alignment_offset of 3584
/dev/sde1 has an alignment_offset of 0
/dev/sde2 has an alignment_offset of 1024

dm_set_device_limits() calls blk_stack_limits() with a byte offset of
196608 for both sde1 and sde2.  And that offset is checked for alignment
with the queue's limits which has an alignment_offset of 3584.

The two PVs are misaligned but that information is lost because you use
blk_stack_limits() with partition-relative offsets.  My patch was an
attempt to fix that.

Test case:

# modprobe scsi_debug dev_size_mb=1024 num_parts=2 lowest_aligned=7 physblk_exp=3
# pvcreate /dev/sde1
# pvcreate /dev/sde2
# vgcreate foo /dev/sde1 /dev/sde2
# lvcreate -I 16 -i 2 -L 500M -n bar foo

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-22 17:41           ` Martin K. Petersen
@ 2009-12-22 21:42             ` Mike Snitzer
  2009-12-22 22:13               ` Mike Snitzer
  2009-12-23  6:05               ` Martin K. Petersen
  0 siblings, 2 replies; 16+ messages in thread
From: Mike Snitzer @ 2009-12-22 21:42 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

Martin,

Apologies for the long mail here...

On Tue, Dec 22 2009 at 12:41pm -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike,
> 
> Mike> OK, so is MD somehow getting things wrong?  (You originally said
> Mike> that with the new stacking function MD resulted in an error but DM
> Mike> did not).
> 
> MD is passing in absolute offsets and got the right result in both
> cases.
> 
> 
> Mike> "start" isn't a relative offset.  
> 
> It's relative to the beginning of the partition (block_device), not
> relative to the beginning of the disk (request_queue).

Ah, thanks for clarifying!  No idea why I thought "start" was relative
to the beginning of the disk (request_queue).

> My beef here is that DM created a device like this:
> 
> [root@10 ~]# pvs -o +pe_start
>   PV         VG   Fmt  Attr PSize   PFree   1st PE 
>   /dev/sde1  foo  lvm2 a-   508.00M 256.00M 192.00K
>   /dev/sde2  foo  lvm2 a-   512.00M 260.00M 192.00K
> [root@10 ~]# dmsetup table
> foo-bar: 0 1032192 striped 2 32 8:66 384 8:65 384
> 
> /dev/sde  has an alignment_offset of 3584
> /dev/sde1 has an alignment_offset of 0
> /dev/sde2 has an alignment_offset of 1024

I'm seeing the following (when I use the scsi_debug from below):
# cat /sys/block/sdb/alignment_offset 
3584
# cat /sys/block/sdb/sdb1/alignment_offset 
512
# cat /sys/block/sdb/sdb2/alignment_offset 
512
 
> dm_set_device_limits() calls blk_stack_limits() with a byte offset of
> 196608 for both sde1 and sde2.  And that offset is checked for alignment
> with the queue's limits which has an alignment_offset of 3584.
> 
> The two PVs are misaligned but that information is lost because you use
> blk_stack_limits() with partition-relative offsets.  My patch was an
> attempt to fix that.

Coming full circle, your patch makes sense.  Sorry about giving you the
run around!  But I have new concerns (at the very end below) about the
new blk_stack_limits(); I also need to review your patch further
(relative to virtual device stacking: does get_start_sect(bdev) always
work?).

> Test case:
> 
> # modprobe scsi_debug dev_size_mb=1024 num_parts=2 lowest_aligned=7 physblk_exp=3
> # pvcreate /dev/sde1
> # pvcreate /dev/sde2
> # vgcreate foo /dev/sde1 /dev/sde2
> # lvcreate -I 16 -i 2 -L 500M -n bar foo

Here I'm just showing that I verified what you were seeing above.  With
an updated LVM but the older blk_stack_limits() the same test gives me:

# pvs -o +pe_start
  PV         VG   Fmt  Attr PSize   PFree   1st PE 
  /dev/sdb1       lvm2 --    99.48m  99.48m 192.50k
  /dev/sdb2       lvm2 --   100.50m 100.50m 192.50k
# vgcreate foo /dev/sdb1 /dev/sdb2
# lvcreate -I 16 -i 2 -L100M -n bar foo
# dmsetup table foo-bar
0 212992 striped 2 32 8:18 385 8:17 385

device-mapper: table: 253:1: target device sdb2 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=197120
device-mapper: table: 253:1: target device sdb1 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=197120

If I disable LVM's alignment_offset detection (runs the same as your
older LVM) I get:
# pvcreate --config 'devices {data_alignment_offset_detection=0}' /dev/sdb1
# pvcreate --config 'devices {data_alignment_offset_detection=0}' /dev/sdb2
# pvs -o +pe_start
  PV         VG   Fmt  Attr PSize   PFree   1st PE 
  /dev/sdb1       lvm2 --    99.48m  99.48m 192.00k
  /dev/sdb2       lvm2 --   100.50m 100.50m 192.00k
...
# dmsetup table foo-bar
0 212992 striped 2 32 8:18 384 8:17 384

device-mapper: table: 253:1: target device sdb2 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=196608
device-mapper: table: 253:1: target device sdb1 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=196608


Now if I apply your patch to add the partition offset -- which makes the
offset that DM passes to blk_stack_limits() absolute:

The old blk_stack_limits(), using LVM2 w/
data_alignment_offset_detection=1 shows this aligned case as
"misaligned":
device-mapper: table: 253:1: target device sdb2 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=104530432
device-mapper: table: 253:1: target device sdb1 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=213504

The new blk_stack_limits() "works" (believes the device is aligned).
But if I use pvcreate w/ data_alignment_offset_detection=0 it
_incorrectly_ believes the device is aligned.

Why does the new blk_stack_limits() think the misaligned offset is
aligned?  I added some debugging to dm_set_device_limits() to see the
following:

1) MISALIGNED-case: LVM2 with data_alignment_offset_detection=0:
device-mapper: table: 254:1: target device sdb2 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=104529920
device-mapper: table: 254:1: target device sdb1 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=212992

This is not properly aligned relative to physical_block_size=4096, sdb's
alignment_offset=3584 and each partitions' alignment_offset=512:

>>> 104529920%4096
0
>>> 212992%4096
0

2) ALIGNED-case: LVM2 with data_alignment_offset_detection=1:
device-mapper: table: 254:1: target device sdb2 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=104530432
device-mapper: table: 254:1: target device sdb1 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=213504

In this 2nd case, LVM2 is correctly adding the alignment_offset (512)
for each partitioned device.  Each data "start" includes the
alignment_offset of 512.  So the device is properly aligned:

>>> 104530432%4096
512
>>> 213504%4096
512


I'll dig into blk_stack_limits() to see if I can make sense of this.
And in the end; what, if anything, is DM doing wrong?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-22 21:42             ` Mike Snitzer
@ 2009-12-22 22:13               ` Mike Snitzer
  2009-12-23  6:05               ` Martin K. Petersen
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Snitzer @ 2009-12-22 22:13 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Tue, Dec 22 2009 at  4:42pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> Martin,
> 
> Apologies for the long mail here...
> 
> On Tue, Dec 22 2009 at 12:41pm -0500,
> Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
> > >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> > 
> > Mike,
> > 
> > Mike> OK, so is MD somehow getting things wrong?  (You originally said
> > Mike> that with the new stacking function MD resulted in an error but DM
> > Mike> did not).
> > 
> > MD is passing in absolute offsets and got the right result in both
> > cases.
> > 
> > 
> > Mike> "start" isn't a relative offset.  
> > 
> > It's relative to the beginning of the partition (block_device), not
> > relative to the beginning of the disk (request_queue).
> 
> Ah, thanks for clarifying!  No idea why I thought "start" was relative
> to the beginning of the disk (request_queue).
> 
> > My beef here is that DM created a device like this:
> > 
> > [root@10 ~]# pvs -o +pe_start
> >   PV         VG   Fmt  Attr PSize   PFree   1st PE 
> >   /dev/sde1  foo  lvm2 a-   508.00M 256.00M 192.00K
> >   /dev/sde2  foo  lvm2 a-   512.00M 260.00M 192.00K
> > [root@10 ~]# dmsetup table
> > foo-bar: 0 1032192 striped 2 32 8:66 384 8:65 384
> > 
> > /dev/sde  has an alignment_offset of 3584
> > /dev/sde1 has an alignment_offset of 0
> > /dev/sde2 has an alignment_offset of 1024
> 
> I'm seeing the following (when I use the scsi_debug from below):
> # cat /sys/block/sdb/alignment_offset 
> 3584
> # cat /sys/block/sdb/sdb1/alignment_offset 
> 512
> # cat /sys/block/sdb/sdb2/alignment_offset 
> 512

Bleh, I used dev_size_mb=200 to get the above alignment_offsets.

> > # modprobe scsi_debug dev_size_mb=1024 num_parts=2 lowest_aligned=7 physblk_exp=3

If I increase to dev_size_mb=500 I get:
sdb1 alignment_offset=0
sdb2 alignment_offset=512

If I increase to dev_size_mb=1000 I also get:
sdb1 alignment_offset=0
sdb2 alignment_offset=1024

Anyway, this doesn't change the fact that the new blk_stack_limits()
isn't detecting the offset that DM provides as misaligned (in the legacy
LVM case, data_alignment_offset_detection=0).

In an earlier mail you said:
dm_set_device_limits() calls blk_stack_limits() with a byte offset of
196608 for both sde1 and sde2.  And that offset is checked for alignment
with the queue's limits which has an alignment_offset of 3584.

Applying that to this example that I shared:

> The new blk_stack_limits() "works" (believes the device is aligned).
> But if I use pvcreate w/ data_alignment_offset_detection=0 it
> _incorrectly_ believes the device is aligned.
> 
> Why does the new blk_stack_limits() think the misaligned offset is
> aligned?  I added some debugging to dm_set_device_limits() to see the
> following:
> 
> 1) MISALIGNED-case: LVM2 with data_alignment_offset_detection=0:
> device-mapper: table: 254:1: target device sdb2 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=104529920
> device-mapper: table: 254:1: target device sdb1 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=212992
> 
> This is not properly aligned relative to physical_block_size=4096, sdb's
> alignment_offset=3584 and each partitions' alignment_offset=512:
> 
> >>> 104529920%4096
> 0
> >>> 212992%4096
> 0

dm_set_device_limits() calls blk_stack_limits() with a byte offset of
212992 for sdb1 and 104529920 for sdb2.  And those offsets are checked
for alignment with the queue's limits which has an alignment_offset of
3584.

/me now goes to look at the new blk_stack_limits()

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-22 21:42             ` Mike Snitzer
  2009-12-22 22:13               ` Mike Snitzer
@ 2009-12-23  6:05               ` Martin K. Petersen
  2009-12-23 14:26                 ` Mike Snitzer
  1 sibling, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-23  6:05 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Martin K. Petersen, jens.axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> I also need to review your patch further (relative to virtual
Mike> device stacking: does get_start_sect(bdev) always work?).

Yep.

I found a bug in the partition alignment reporting.  The following patch
should fix that.  Please try it out.

Concerning your test cases: It is perfectly valid for two component
devices to be misaligned with regards to their underlying physical
devices as long as they have identical alignment.  In that case the top
level device will report a suitable alignment_offset.


block: Fix topology stacking for data and discard alignment

The stacking code incorrectly scaled up the data offset in some cases
causing misaligned devices to report alignment.  Rewrite the stacking
algorithm to remedy this and apply the same alignment principles to the
discard handling.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6ae118d..ca4f0a4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -517,9 +517,8 @@ static unsigned int lcm(unsigned int a, unsigned int b)
 int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		     sector_t offset)
 {
-	int ret;
-
-	ret = 0;
+	sector_t alignment;
+	unsigned int top, bottom;
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
@@ -537,6 +536,18 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	alignment = queue_limit_alignment_offset(b, offset);
+
+	if (t->alignment_offset != alignment) {
+
+		top = max(t->physical_block_size, t->io_min)
+			+ t->alignment_offset;
+		bottom = max(b->physical_block_size, b->io_min) + alignment;
+
+		if (max(top, bottom) & (min(top, bottom) - 1))
+			t->misaligned = 1;
+	}
+
 	t->logical_block_size = max(t->logical_block_size,
 				    b->logical_block_size);
 
@@ -544,54 +555,55 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 				     b->physical_block_size);
 
 	t->io_min = max(t->io_min, b->io_min);
+	t->io_opt = lcm(t->io_opt, b->io_opt);
+
 	t->no_cluster |= b->no_cluster;
 	t->discard_zeroes_data &= b->discard_zeroes_data;
 
-	/* Bottom device offset aligned? */
-	if (offset &&
-	    (offset & (b->physical_block_size - 1)) != b->alignment_offset) {
+	if (t->physical_block_size & (t->logical_block_size - 1)) {
+		t->physical_block_size = t->logical_block_size;
 		t->misaligned = 1;
-		ret = -1;
 	}
 
-	/*
-	 * Temporarily disable discard granularity. It's currently buggy
-	 * since we default to 0 for discard_granularity, hence this
-	 * "failure" will always trigger for non-zero offsets.
-	 */
-#if 0
-	if (offset &&
-	    (offset & (b->discard_granularity - 1)) != b->discard_alignment) {
-		t->discard_misaligned = 1;
-		ret = -1;
+	if (t->io_min & (t->physical_block_size - 1)) {
+		t->io_min = t->physical_block_size;
+		t->misaligned = 1;
 	}
-#endif
 
-	/* If top has no alignment offset, inherit from bottom */
-	if (!t->alignment_offset)
-		t->alignment_offset =
-			b->alignment_offset & (b->physical_block_size - 1);
+	if (t->io_opt & (t->physical_block_size - 1)) {
+		t->io_opt = 0;
+		t->misaligned = 1;
+	}
 
-	if (!t->discard_alignment)
-		t->discard_alignment =
-			b->discard_alignment & (b->discard_granularity - 1);
+	t->alignment_offset = lcm(t->alignment_offset, alignment)
+		& (max(t->physical_block_size, t->io_min) - 1);
 
-	/* Top device aligned on logical block boundary? */
-	if (t->alignment_offset & (t->logical_block_size - 1)) {
+	if (t->alignment_offset & (t->logical_block_size - 1))
 		t->misaligned = 1;
-		ret = -1;
-	}
 
-	/* Find lcm() of optimal I/O size and granularity */
-	t->io_opt = lcm(t->io_opt, b->io_opt);
-	t->discard_granularity = lcm(t->discard_granularity,
-				     b->discard_granularity);
+	/* Discard alignment and granularity */
+	if (b->discard_granularity) {
+
+		alignment = b->discard_alignment -
+			(offset & (b->discard_granularity - 1));
+
+		if (t->discard_granularity != 0 &&
+		    t->discard_alignment != alignment) {
+			top = t->discard_granularity + t->discard_alignment;
+			bottom = b->discard_granularity + alignment;
 
-	/* Verify that optimal I/O size is a multiple of io_min */
-	if (t->io_min && t->io_opt % t->io_min)
-		ret = -1;
+			/* Verify that top and bottom intervals line up */
+			if (max(top, bottom) & (min(top, bottom) - 1))
+				t->discard_misaligned = 1;
+		}
+
+		t->discard_granularity = max(t->discard_granularity,
+					     b->discard_granularity);
+		t->discard_alignment = lcm(t->discard_alignment, alignment) &
+			(t->discard_granularity - 1);
+	}
 
-	return ret;
+	return t->misaligned ? -1 : 0;
 }
 EXPORT_SYMBOL(blk_stack_limits);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 784a919..af0ffac 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1116,11 +1116,19 @@ static inline int queue_alignment_offset(struct request_queue *q)
 	return q->limits.alignment_offset;
 }
 
+static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t offset)
+{
+	unsigned int granularity = max(lim->physical_block_size, lim->io_min);
+
+	offset &= granularity - 1;
+
+	return (granularity + lim->alignment_offset - offset) & (granularity - 1);
+}
+
 static inline int queue_sector_alignment_offset(struct request_queue *q,
 						sector_t sector)
 {
-	return ((sector << 9) - q->limits.alignment_offset)
-		& (q->limits.io_min - 1);
+	return queue_limit_alignment_offset(&q->limits, sector << 9);
 }
 
 static inline int bdev_alignment_offset(struct block_device *bdev)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-23  6:05               ` Martin K. Petersen
@ 2009-12-23 14:26                 ` Mike Snitzer
  2009-12-23 16:33                   ` Martin K. Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2009-12-23 14:26 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Wed, Dec 23 2009 at  1:05am -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> I also need to review your patch further (relative to virtual
> Mike> device stacking: does get_start_sect(bdev) always work?).
> 
> Yep.
> 
> I found a bug in the partition alignment reporting.  The following patch
> should fix that.  Please try it out.

alignment_offset(s) look good now.  Though I have one observation below.

> Concerning your test cases: It is perfectly valid for two component
> devices to be misaligned with regards to their underlying physical
> devices as long as they have identical alignment.  In that case the top
> level device will report a suitable alignment_offset.

One thing I noticed along the way is that: when stacking one or more
misaligned devices there is the potential for misleading error messages,
e.g.:

foo-bar: 0 212992 striped 2 32 8:18 384 8:17 384

sdb alignment_offset is 3584
sdb1 alignment_offset is 0
sdb2 alignment_offset is 3072

when loading this table sdb2 is stacked before sdb1; sdb2 is misaligned
but sdb1 is not, yet:

device-mapper: table: 254:2: target device sdb2 call to blk_stack_limits(): physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=534839808
device-mapper: table: 254:2: target device sdb1 is misaligned: physical_block_size=4096, logical_block_size=512, alignment_offset=3584, start=228864

>>> 228864 % 4096
3584
>>> 534839808 % 4096
512

So sdb2 is taken to be aligned (even though it isn't, relative to the
queue's alignment_offset of 3584) and then when sdb1 is stacked it is
flagged as misaligned.  Doesn't seem right.

> block: Fix topology stacking for data and discard alignment
> 
> The stacking code incorrectly scaled up the data offset in some cases
> causing misaligned devices to report alignment.  Rewrite the stacking
> algorithm to remedy this and apply the same alignment principles to the
> discard handling.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

With this patch the alignment_offset values are correct.  They
accurately reflect the established baseline (queue's limits
alignment_offset).

Here is the incremental diff (for those following along):

---
 block/blk-settings.c   |    7 +++----
 include/linux/blkdev.h |   12 ++++++++++--
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index e14fcbc..ca4f0a4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -518,7 +518,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		     sector_t offset)
 {
 	sector_t alignment;
-	unsigned int top, bottom, granularity;
+	unsigned int top, bottom;
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
@@ -536,14 +536,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
-	granularity = max(b->physical_block_size, b->io_min);
-	alignment = b->alignment_offset - (offset & (granularity - 1));
+	alignment = queue_limit_alignment_offset(b, offset);
 
 	if (t->alignment_offset != alignment) {
 
 		top = max(t->physical_block_size, t->io_min)
 			+ t->alignment_offset;
-		bottom = granularity + alignment;
+		bottom = max(b->physical_block_size, b->io_min) + alignment;
 
 		if (max(top, bottom) & (min(top, bottom) - 1))
 			t->misaligned = 1;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 784a919..af0ffac 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1116,11 +1116,19 @@ static inline int queue_alignment_offset(struct request_queue *q)
 	return q->limits.alignment_offset;
 }
 
+static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t offset)
+{
+	unsigned int granularity = max(lim->physical_block_size, lim->io_min);
+
+	offset &= granularity - 1;
+
+	return (granularity + lim->alignment_offset - offset) & (granularity - 1);
+}
+
 static inline int queue_sector_alignment_offset(struct request_queue *q,
 						sector_t sector)
 {
-	return ((sector << 9) - q->limits.alignment_offset)
-		& (q->limits.io_min - 1);
+	return queue_limit_alignment_offset(&q->limits, sector << 9);
 }
 
 static inline int bdev_alignment_offset(struct block_device *bdev)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-23 14:26                 ` Mike Snitzer
@ 2009-12-23 16:33                   ` Martin K. Petersen
  2009-12-23 17:13                     ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-23 16:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Martin K. Petersen, jens.axboe

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> One thing I noticed along the way is that: when stacking one or
Mike> more misaligned devices there is the potential for misleading
Mike> error messages, e.g.:

[...]

Mike> So sdb2 is taken to be aligned (even though it isn't, relative to
Mike> the queue's alignment_offset of 3584) and then when sdb1 is
Mike> stacked it is flagged as misaligned.  Doesn't seem right.

I know.  It's an unfortunate side-effect of having an iterative stacking
process.  I don't know the whole picture so I don't know which device
the real offender is.  At the time you add device #2 I have no
recollection of device #1.

When you create a new top-level (DM/MD) device and add the first disk to
it I have no option but to inherit that component's queue limits.  So
that's what initially defines the top-level.

In your case adding a misaligned device will cause the stacking to
report that alignment can be obtained by padding the top level device
with an offset. Then you add a device which happens to be naturally
aligned. And that means the stacking logic breaks down because the
0-alignment is incompatible with the top's nonzero alignment_offset.

Originally (before the Great Naming Debate) the `misaligned' flag was
called `inconsistent' (actually, it was called `consistent' and the
logic was reversed).  I think that's a better term for it.  Compatible
would also work, I guess.

So when blk_stack_limits() returns an error it is because no consistent
alignment could be calculated.  blk_stack_limits() returning 0 does not
imply "alignment" (i.e. an alignment_offset of 0).  A return value of 0
means that we have successfully added the bottom device and the
resulting top device queue_limits now apply.

Maybe you should print something along the lines of "adding device sdX
caused an alignment consistency error on DM device foo".  That's a bit
vague but doesn't single out sdX as much.

The alternative is to add a device list to the queue_limits so we can
record previously stacked devices.  But that opens up a whole other can
of worms...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: dm: Fix alignment stacking on partitioned devices
  2009-12-23 16:33                   ` Martin K. Petersen
@ 2009-12-23 17:13                     ` Mike Snitzer
  2009-12-23 20:31                       ` Topology fixes Martin K. Petersen
                                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mike Snitzer @ 2009-12-23 17:13 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: device-mapper development, jens.axboe

On Wed, Dec 23 2009 at 11:33am -0500,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> One thing I noticed along the way is that: when stacking one or
> Mike> more misaligned devices there is the potential for misleading
> Mike> error messages, e.g.:
> 
> [...]
> 
> Mike> So sdb2 is taken to be aligned (even though it isn't, relative to
> Mike> the queue's alignment_offset of 3584) and then when sdb1 is
> Mike> stacked it is flagged as misaligned.  Doesn't seem right.
> 
> I know.  It's an unfortunate side-effect of having an iterative stacking
> process.  I don't know the whole picture so I don't know which device
> the real offender is.  At the time you add device #2 I have no
> recollection of device #1.
> 
> When you create a new top-level (DM/MD) device and add the first disk to
> it I have no option but to inherit that component's queue limits.  So
> that's what initially defines the top-level.
> 
> In your case adding a misaligned device will cause the stacking to
> report that alignment can be obtained by padding the top level device
> with an offset. Then you add a device which happens to be naturally
> aligned. And that means the stacking logic breaks down because the
> 0-alignment is incompatible with the top's nonzero alignment_offset.

Understood.  Thanks for clarifying.

> Originally (before the Great Naming Debate) the `misaligned' flag was
> called `inconsistent' (actually, it was called `consistent' and the
> logic was reversed).  I think that's a better term for it.  Compatible
> would also work, I guess.
> 
> So when blk_stack_limits() returns an error it is because no consistent
> alignment could be calculated.  blk_stack_limits() returning 0 does not
> imply "alignment" (i.e. an alignment_offset of 0).  A return value of 0
> means that we have successfully added the bottom device and the
> resulting top device queue_limits now apply.

I think it would be great if you added something along those lines as a
comment block above blk_stack_limits() (rather than "Returns 0 if
alignment didn't change.")

Also, your most recent patch eliminates many of the inline comments you
had in blk_stack_limits().  I think that is unfortunate considering
those comments helped give context for code that wasn't always
immediately clear on its own.

That observation aside, please feel free to add 'Tested-by: Mike Snitzer
<snitzer@redhat.com>' to your most recent blk_stack_limits() patch
header.

> Maybe you should print something along the lines of "adding device sdX
> caused an alignment consistency error on DM device foo".  That's a bit
> vague but doesn't single out sdX as much.

Good suggestion.  I'll work that in to the DM patch you originally
posted (the start of this thread) and repost it to dm-devel.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Topology fixes
  2009-12-23 17:13                     ` Mike Snitzer
@ 2009-12-23 20:31                       ` Martin K. Petersen
  2009-12-23 20:31                       ` [PATCH 1/2] block: Fix incorrect reporting of partition alignment Martin K. Petersen
  2009-12-23 20:31                       ` [PATCH 2/2] block: Fix topology stacking for data and discard alignment Martin K. Petersen
  2 siblings, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-23 20:31 UTC (permalink / raw)
  To: jens.axboe, snitzer, dm-devel

These two patches fix several block I/O topology stacking issues.

The first patch corrects an error in calculating partition
offsets.  It should go into stable as well as .33.

The second patch is a revamp of the stacking algorithm for both
data and discard. 2.6.33 only.

--
Martin K. Petersen      Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] block: Fix incorrect reporting of partition alignment
  2009-12-23 17:13                     ` Mike Snitzer
  2009-12-23 20:31                       ` Topology fixes Martin K. Petersen
@ 2009-12-23 20:31                       ` Martin K. Petersen
  2009-12-23 20:31                       ` [PATCH 2/2] block: Fix topology stacking for data and discard alignment Martin K. Petersen
  2 siblings, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-23 20:31 UTC (permalink / raw)
  To: jens.axboe, snitzer, dm-devel; +Cc: Martin K. Petersen

queue_sector_alignment_offset returned the wrong value which caused
partitions to report an incorrect alignment_offset. Since offset
calculation is needed several places it has been split into a separate
helper function.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 784a919..59b832b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1116,11 +1116,18 @@ static inline int queue_alignment_offset(struct request_queue *q)
 	return q->limits.alignment_offset;
 }
 
+static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t offset)
+{
+	unsigned int granularity = max(lim->physical_block_size, lim->io_min);
+
+	offset &= granularity - 1;
+	return (granularity + lim->alignment_offset - offset) & (granularity - 1);
+}
+
 static inline int queue_sector_alignment_offset(struct request_queue *q,
 						sector_t sector)
 {
-	return ((sector << 9) - q->limits.alignment_offset)
-		& (q->limits.io_min - 1);
+	return queue_limit_alignment_offset(&q->limits, sector << 9);
 }
 
 static inline int bdev_alignment_offset(struct block_device *bdev)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] block: Fix topology stacking for data and discard alignment
  2009-12-23 17:13                     ` Mike Snitzer
  2009-12-23 20:31                       ` Topology fixes Martin K. Petersen
  2009-12-23 20:31                       ` [PATCH 1/2] block: Fix incorrect reporting of partition alignment Martin K. Petersen
@ 2009-12-23 20:31                       ` Martin K. Petersen
  2 siblings, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2009-12-23 20:31 UTC (permalink / raw)
  To: jens.axboe, snitzer, dm-devel; +Cc: Martin K. Petersen

The stacking code incorrectly scaled up the data offset in some cases
causing misaligned devices to report alignment.  Rewrite the stacking
algorithm to remedy this and apply the same alignment principles to the
discard handling.  Clarify documentation and comments.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-settings.c |  121 ++++++++++++++++++++++++++++++++------------------
 1 files changed, 78 insertions(+), 43 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6ae118d..d52d4ad 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -505,21 +505,30 @@ static unsigned int lcm(unsigned int a, unsigned int b)
 
 /**
  * blk_stack_limits - adjust queue_limits for stacked devices
- * @t:	the stacking driver limits (top)
- * @b:  the underlying queue limits (bottom)
+ * @t:	the stacking driver limits (top device)
+ * @b:  the underlying queue limits (bottom, component device)
  * @offset:  offset to beginning of data within component device
  *
  * Description:
- *    Merges two queue_limit structs.  Returns 0 if alignment didn't
- *    change.  Returns -1 if adding the bottom device caused
- *    misalignment.
+ *    This function is used by stacking drivers like MD and DM to ensure
+ *    that all component devices have compatible block sizes and
+ *    alignments.  The stacking driver must provide a queue_limits
+ *    struct (top) and then iteratively call the stacking function for
+ *    all component (bottom) devices.  The stacking function will
+ *    attempt to combine the values and ensure proper alignment.
+ *
+ *    Returns 0 if the top and bottom queue_limits are compatible.  The
+ *    top device's block sizes and alignment offsets may be adjusted to
+ *    ensure alignment with the bottom device. If no compatible sizes
+ *    and alignments exist, -1 is returned and the resulting top
+ *    queue_limits will have the misaligned flag set to indicate that
+ *    the alignment_offset is undefined.
  */
 int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		     sector_t offset)
 {
-	int ret;
-
-	ret = 0;
+	sector_t alignment;
+	unsigned int top, bottom;
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
@@ -537,6 +546,22 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	alignment = queue_limit_alignment_offset(b, offset);
+
+	/* Bottom device has different alignment.  Check that it is
+	 * compatible with the current top alignment.
+	 */
+	if (t->alignment_offset != alignment) {
+
+		top = max(t->physical_block_size, t->io_min)
+			+ t->alignment_offset;
+		bottom = max(b->physical_block_size, b->io_min) + alignment;
+
+		/* Verify that top and bottom intervals line up */
+		if (max(top, bottom) & (min(top, bottom) - 1))
+			t->misaligned = 1;
+	}
+
 	t->logical_block_size = max(t->logical_block_size,
 				    b->logical_block_size);
 
@@ -544,54 +569,64 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 				     b->physical_block_size);
 
 	t->io_min = max(t->io_min, b->io_min);
+	t->io_opt = lcm(t->io_opt, b->io_opt);
+
 	t->no_cluster |= b->no_cluster;
 	t->discard_zeroes_data &= b->discard_zeroes_data;
 
-	/* Bottom device offset aligned? */
-	if (offset &&
-	    (offset & (b->physical_block_size - 1)) != b->alignment_offset) {
+	/* Physical block size a multiple of the logical block size? */
+	if (t->physical_block_size & (t->logical_block_size - 1)) {
+		t->physical_block_size = t->logical_block_size;
 		t->misaligned = 1;
-		ret = -1;
 	}
 
-	/*
-	 * Temporarily disable discard granularity. It's currently buggy
-	 * since we default to 0 for discard_granularity, hence this
-	 * "failure" will always trigger for non-zero offsets.
-	 */
-#if 0
-	if (offset &&
-	    (offset & (b->discard_granularity - 1)) != b->discard_alignment) {
-		t->discard_misaligned = 1;
-		ret = -1;
+	/* Minimum I/O a multiple of the physical block size? */
+	if (t->io_min & (t->physical_block_size - 1)) {
+		t->io_min = t->physical_block_size;
+		t->misaligned = 1;
 	}
-#endif
-
-	/* If top has no alignment offset, inherit from bottom */
-	if (!t->alignment_offset)
-		t->alignment_offset =
-			b->alignment_offset & (b->physical_block_size - 1);
 
-	if (!t->discard_alignment)
-		t->discard_alignment =
-			b->discard_alignment & (b->discard_granularity - 1);
-
-	/* Top device aligned on logical block boundary? */
-	if (t->alignment_offset & (t->logical_block_size - 1)) {
+	/* Optimal I/O a multiple of the physical block size? */
+	if (t->io_opt & (t->physical_block_size - 1)) {
+		t->io_opt = 0;
 		t->misaligned = 1;
-		ret = -1;
 	}
 
-	/* Find lcm() of optimal I/O size and granularity */
-	t->io_opt = lcm(t->io_opt, b->io_opt);
-	t->discard_granularity = lcm(t->discard_granularity,
-				     b->discard_granularity);
+	/* Find lowest common alignment_offset */
+	t->alignment_offset = lcm(t->alignment_offset, alignment)
+		& (max(t->physical_block_size, t->io_min) - 1);
 
-	/* Verify that optimal I/O size is a multiple of io_min */
-	if (t->io_min && t->io_opt % t->io_min)
-		ret = -1;
+	/* Verify that new alignment_offset is on a logical block boundary */
+	if (t->alignment_offset & (t->logical_block_size - 1))
+		t->misaligned = 1;
+
+	/* Discard alignment and granularity */
+	if (b->discard_granularity) {
+		unsigned int granularity = b->discard_granularity;
+		offset &= granularity - 1;
+
+		alignment = (granularity + b->discard_alignment - offset)
+			& (granularity - 1);
+
+		if (t->discard_granularity != 0 &&
+		    t->discard_alignment != alignment) {
+			top = t->discard_granularity + t->discard_alignment;
+			bottom = b->discard_granularity + alignment;
+
+			/* Verify that top and bottom intervals line up */
+			if (max(top, bottom) & (min(top, bottom) - 1))
+				t->discard_misaligned = 1;
+		}
+
+		t->max_discard_sectors = min_not_zero(t->max_discard_sectors,
+						      b->max_discard_sectors);
+		t->discard_granularity = max(t->discard_granularity,
+					     b->discard_granularity);
+		t->discard_alignment = lcm(t->discard_alignment, alignment) &
+			(t->discard_granularity - 1);
+	}
 
-	return ret;
+	return t->misaligned ? -1 : 0;
 }
 EXPORT_SYMBOL(blk_stack_limits);
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-12-23 20:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-18  7:30 [PATCH] dm: Fix alignment stacking on partitioned devices Martin K. Petersen
2009-12-18 17:33 ` Mike Snitzer
2009-12-21 17:49   ` Martin K. Petersen
2009-12-21 19:52     ` Mike Snitzer
2009-12-22  4:27       ` Martin K. Petersen
2009-12-22 14:32         ` Mike Snitzer
2009-12-22 17:41           ` Martin K. Petersen
2009-12-22 21:42             ` Mike Snitzer
2009-12-22 22:13               ` Mike Snitzer
2009-12-23  6:05               ` Martin K. Petersen
2009-12-23 14:26                 ` Mike Snitzer
2009-12-23 16:33                   ` Martin K. Petersen
2009-12-23 17:13                     ` Mike Snitzer
2009-12-23 20:31                       ` Topology fixes Martin K. Petersen
2009-12-23 20:31                       ` [PATCH 1/2] block: Fix incorrect reporting of partition alignment Martin K. Petersen
2009-12-23 20:31                       ` [PATCH 2/2] block: Fix topology stacking for data and discard alignment Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.