linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* I/O topology fixes for big physical block size
@ 2010-09-27 16:41 Martin K. Petersen
  2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw)
  To: jaxboe, James.Bottomley, snitzer, linux-scsi

Mike Snitzer reported that he has access to a device that supports thin
provisioning but does not use the Block Limits VPD page to indicate
discard granularity. Instead it reports a huge (1MB) physical block
size. That caused a bit of fallout in the topology stack which assumed a
physical block size of 4KiB or less.

[PATCH 1/2] block: Ensure physical block size is unsigned int
[PATCH 2/2] sd: Fix overflow with big physical blocks

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/2] block: Ensure physical block size is unsigned int
  2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen
@ 2010-09-27 16:41 ` Martin K. Petersen
  2010-09-27 17:40   ` Mike Snitzer
  2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen
  2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe
  2 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw)
  To: jaxboe, James.Bottomley, snitzer, linux-scsi; +Cc: Martin K. Petersen

Physical block size was declared unsigned int to accomodate the maximum
size reported by READ CAPACITY(16).  Make sure we use the right type in
the related functions.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 block/blk-settings.c   |    2 +-
 include/linux/blkdev.h |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a234f4b..450577d 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -343,7 +343,7 @@ EXPORT_SYMBOL(blk_queue_logical_block_size);
  *   hardware can operate on without reverting to read-modify-write
  *   operations.
  */
-void blk_queue_physical_block_size(struct request_queue *q, unsigned short size)
+void blk_queue_physical_block_size(struct request_queue *q, unsigned int size)
 {
 	q->limits.physical_block_size = size;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2c54906..9e443b9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -851,7 +851,7 @@ extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
 extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
-extern void blk_queue_physical_block_size(struct request_queue *, unsigned short);
+extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
 extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
@@ -1004,7 +1004,7 @@ static inline unsigned int queue_physical_block_size(struct request_queue *q)
 	return q->limits.physical_block_size;
 }
 
-static inline int bdev_physical_block_size(struct block_device *bdev)
+static inline unsigned int bdev_physical_block_size(struct block_device *bdev)
 {
 	return queue_physical_block_size(bdev_get_queue(bdev));
 }
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2] sd: Fix overflow with big physical blocks
  2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen
  2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen
@ 2010-09-27 16:41 ` Martin K. Petersen
  2010-09-27 17:42   ` Mike Snitzer
  2010-09-27 18:13   ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer
  2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe
  2 siblings, 2 replies; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw)
  To: jaxboe, James.Bottomley, snitzer, linux-scsi; +Cc: Martin K. Petersen

The hw_sector_size variable could overflow if a device reported huge
physical blocks.  Switch to the more accurate physical_block_size
terminology and make sure we use an unsigned int to match the range
permitted by READ CAPACITY(16).

Also print a warning of the physical block size exceeds the page size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd.c |   19 +++++++++++++------
 drivers/scsi/sd.h |    2 +-
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ffa0689..1c14436 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1554,7 +1554,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 	}
 
 	/* Logical blocks per physical block exponent */
-	sdkp->hw_sector_size = (1 << (buffer[13] & 0xf)) * sector_size;
+	sdkp->physical_block_size = (1 << (buffer[13] & 0xf)) * sector_size;
 
 	/* Lowest aligned logical block */
 	alignment = ((buffer[14] & 0x3f) << 8 | buffer[15]) * sector_size;
@@ -1567,7 +1567,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 		struct request_queue *q = sdp->request_queue;
 
 		sdkp->thin_provisioning = 1;
-		q->limits.discard_granularity = sdkp->hw_sector_size;
+		q->limits.discard_granularity = sdkp->physical_block_size;
 		q->limits.max_discard_sectors = 0xffffffff;
 
 		if (buffer[14] & 0x40) /* TPRZ */
@@ -1635,7 +1635,7 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp,
 	}
 
 	sdkp->capacity = lba + 1;
-	sdkp->hw_sector_size = sector_size;
+	sdkp->physical_block_size = sector_size;
 	return sector_size;
 }
 
@@ -1756,10 +1756,16 @@ got_data:
 				  (unsigned long long)sdkp->capacity,
 				  sector_size, cap_str_10, cap_str_2);
 
-			if (sdkp->hw_sector_size != sector_size)
+			if (sdkp->physical_block_size != sector_size)
 				sd_printk(KERN_NOTICE, sdkp,
 					  "%u-byte physical blocks\n",
-					  sdkp->hw_sector_size);
+					  sdkp->physical_block_size);
+
+			if (sdkp->physical_block_size > PAGE_CACHE_SIZE)
+				sd_printk(KERN_WARNING, sdkp,
+					  "physical block size %u is bigger "
+					  "than system page size\n",
+					  sdkp->physical_block_size);
 		}
 	}
 
@@ -1773,7 +1779,8 @@ got_data:
 	else if (sector_size == 256)
 		sdkp->capacity >>= 1;
 
-	blk_queue_physical_block_size(sdp->request_queue, sdkp->hw_sector_size);
+	blk_queue_physical_block_size(sdp->request_queue,
+				      sdkp->physical_block_size);
 	sdkp->device->sector_size = sector_size;
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index f81a930..f947140 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -50,7 +50,7 @@ struct scsi_disk {
 	atomic_t	openers;
 	sector_t	capacity;	/* size in 512-byte sectors */
 	u32		index;
-	unsigned short	hw_sector_size;
+	unsigned int	physical_block_size;
 	u8		media_present;
 	u8		write_prot;
 	u8		protection_type;/* Data Integrity Field */
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen
  2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen
  2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen
@ 2010-09-27 16:54 ` Jens Axboe
  2010-09-27 17:20   ` Martin K. Petersen
  2010-09-27 17:23   ` Mike Snitzer
  2 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2010-09-27 16:54 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James.Bottomley@hansenpartnership.com, snitzer@redhat.com,
	linux-scsi@vger.kernel.org

On 2010-09-28 01:41, Martin K. Petersen wrote:
> Mike Snitzer reported that he has access to a device that supports thin
> provisioning but does not use the Block Limits VPD page to indicate
> discard granularity. Instead it reports a huge (1MB) physical block
> size. That caused a bit of fallout in the topology stack which assumed a
> physical block size of 4KiB or less.

Fixing the overflow aside, I question the validity of setting the physical
block size to something larger than PAGE_SIZE as there's no way that that
could really work in the current kernel.

I would suggest doing something similar as we do with other 'invalid'
settings that we cannot honor, print a warning and drop the queue
limits to PAGE_SIZE.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe
@ 2010-09-27 17:20   ` Martin K. Petersen
  2010-09-27 22:21     ` Jens Axboe
  2010-09-27 17:23   ` Mike Snitzer
  1 sibling, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 17:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com,
	snitzer@redhat.com, linux-scsi@vger.kernel.org

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

Jens> Fixing the overflow aside, I question the validity of setting the
Jens> physical block size to something larger than PAGE_SIZE as there's
Jens> no way that that could really work in the current kernel.

Jens> I would suggest doing something similar as we do with other
Jens> 'invalid' settings that we cannot honor, print a warning and drop
Jens> the queue limits to PAGE_SIZE.

Mike and I have been bouncing this back and forth on this the last
couple of days.

I totally agree with enforcing hard limits like the logical blocks size
inside the kernel. We have to. But the physical block size is just an
I/O hint. And consequently I prefer mkfs to do sanity checking in this
case and not the kernel. We report the topology truthfully and then
userland can treat the data as it sees fit.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe
  2010-09-27 17:20   ` Martin K. Petersen
@ 2010-09-27 17:23   ` Mike Snitzer
  2010-09-27 21:58     ` James Bottomley
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-27 17:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org

On Mon, Sep 27 2010 at 12:54pm -0400,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2010-09-28 01:41, Martin K. Petersen wrote:
> > Mike Snitzer reported that he has access to a device that supports thin
> > provisioning but does not use the Block Limits VPD page to indicate
> > discard granularity. Instead it reports a huge (1MB) physical block
> > size. That caused a bit of fallout in the topology stack which assumed a
> > physical block size of 4KiB or less.
> 
> Fixing the overflow aside, I question the validity of setting the physical
> block size to something larger than PAGE_SIZE as there's no way that that
> could really work in the current kernel.
> 
> I would suggest doing something similar as we do with other 'invalid'
> settings that we cannot honor, print a warning and drop the queue
> limits to PAGE_SIZE.

I'm inclined to agree.  Doesn't make a whole lot of sense.

But could this cap of PAGE_SIZE be enforced with a follow-on patch?  Or
would you rather see it be dealt with in a single revised 2/2 patch?

Mike

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int
  2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen
@ 2010-09-27 17:40   ` Mike Snitzer
  2010-10-08  5:15     ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-27 17:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi

On Mon, Sep 27 2010 at 12:41pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> Physical block size was declared unsigned int to accomodate the maximum
> size reported by READ CAPACITY(16).  Make sure we use the right type in
> the related functions.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] sd: Fix overflow with big physical blocks
  2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen
@ 2010-09-27 17:42   ` Mike Snitzer
  2010-09-27 18:13   ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer
  1 sibling, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2010-09-27 17:42 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi

On Mon, Sep 27 2010 at 12:41pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> The hw_sector_size variable could overflow if a device reported huge
> physical blocks.  Switch to the more accurate physical_block_size
> terminology and make sure we use an unsigned int to match the range
> permitted by READ CAPACITY(16).
> 
> Also print a warning of the physical block size exceeds the page size.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Acked-by: Mike Snitzer <snitzer@redhat.com>

(with the understanding that we may/will look to limit
physical_block_size to PAGE_SIZE in a follow-on patch).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard
  2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen
  2010-09-27 17:42   ` Mike Snitzer
@ 2010-09-27 18:13   ` Mike Snitzer
  2010-10-14 21:37     ` Mike Snitzer
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-27 18:13 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi, dm-devel

Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.

Eliminate the possibility for blkdev_issue_discard to infinitely loop.

Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device.  The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0.  This resulted in the
multipath device's limits.max_discard_sectors being 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-lib.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index c392029..186f249 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -56,7 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	 * granularity
 	 */
 	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
-	if (q->limits.discard_granularity) {
+	if (unlikely(!max_discard_sectors)) {
+		/* Avoid infinite loop (below) */
+		return -EOPNOTSUPP;
+	} else if (q->limits.discard_granularity) {
 		unsigned int disc_sects = q->limits.discard_granularity >> 9;
 
 		max_discard_sectors &= ~(disc_sects - 1);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 17:23   ` Mike Snitzer
@ 2010-09-27 21:58     ` James Bottomley
  2010-09-27 22:03       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: James Bottomley @ 2010-09-27 21:58 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, Martin K. Petersen, linux-scsi@vger.kernel.org

On Mon, 2010-09-27 at 13:23 -0400, Mike Snitzer wrote:
> On Mon, Sep 27 2010 at 12:54pm -0400,
> Jens Axboe <jaxboe@fusionio.com> wrote:
> 
> > On 2010-09-28 01:41, Martin K. Petersen wrote:
> > > Mike Snitzer reported that he has access to a device that supports thin
> > > provisioning but does not use the Block Limits VPD page to indicate
> > > discard granularity. Instead it reports a huge (1MB) physical block
> > > size. That caused a bit of fallout in the topology stack which assumed a
> > > physical block size of 4KiB or less.
> > 
> > Fixing the overflow aside, I question the validity of setting the physical
> > block size to something larger than PAGE_SIZE as there's no way that that
> > could really work in the current kernel.
> > 
> > I would suggest doing something similar as we do with other 'invalid'
> > settings that we cannot honor, print a warning and drop the queue
> > limits to PAGE_SIZE.
> 
> I'm inclined to agree.  Doesn't make a whole lot of sense.
> 
> But could this cap of PAGE_SIZE be enforced with a follow-on patch?  Or
> would you rather see it be dealt with in a single revised 2/2 patch?

It's up to Jens, but I'd prefer, unless there's a very good reason, that
the patches we put into the kernel be right first time around, since
that generates a cleaner history and a better example should anyone go
looking for one in the tree.

James



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 21:58     ` James Bottomley
@ 2010-09-27 22:03       ` Jens Axboe
  2010-09-27 22:14         ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2010-09-27 22:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mike Snitzer, Martin K. Petersen, linux-scsi@vger.kernel.org

On 2010-09-28 06:58, James Bottomley wrote:
> On Mon, 2010-09-27 at 13:23 -0400, Mike Snitzer wrote:
>> On Mon, Sep 27 2010 at 12:54pm -0400,
>> Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>>> On 2010-09-28 01:41, Martin K. Petersen wrote:
>>>> Mike Snitzer reported that he has access to a device that supports thin
>>>> provisioning but does not use the Block Limits VPD page to indicate
>>>> discard granularity. Instead it reports a huge (1MB) physical block
>>>> size. That caused a bit of fallout in the topology stack which assumed a
>>>> physical block size of 4KiB or less.
>>>
>>> Fixing the overflow aside, I question the validity of setting the physical
>>> block size to something larger than PAGE_SIZE as there's no way that that
>>> could really work in the current kernel.
>>>
>>> I would suggest doing something similar as we do with other 'invalid'
>>> settings that we cannot honor, print a warning and drop the queue
>>> limits to PAGE_SIZE.
>>
>> I'm inclined to agree.  Doesn't make a whole lot of sense.
>>
>> But could this cap of PAGE_SIZE be enforced with a follow-on patch?  Or
>> would you rather see it be dealt with in a single revised 2/2 patch?
> 
> It's up to Jens, but I'd prefer, unless there's a very good reason, that
> the patches we put into the kernel be right first time around, since
> that generates a cleaner history and a better example should anyone go
> looking for one in the tree.

Yes, from a correctness point of view it doesn't matter, but when
people go looking up fixes for whatever reason, it's much better to
include such a fix in the original patch so it's not missed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 22:03       ` Jens Axboe
@ 2010-09-27 22:14         ` Martin K. Petersen
  2010-09-27 22:24           ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 22:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: James Bottomley, Mike Snitzer, Martin K. Petersen,
	linux-scsi@vger.kernel.org

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

Jens> Yes, from a correctness point of view it doesn't matter, but when
Jens> people go looking up fixes for whatever reason, it's much better
Jens> to include such a fix in the original patch so it's not missed.

I have talked to a few standards people today. They are of the opinion
that the device's usage of the physical block exponent is incorrect. And
that the device must provide the Block Limits and the TP VPD if thin
provisioning is enabled.

However, devices with 8KiB physical blocks are shipping and 16KiB ditto
are right around the corner.  Which says to me that it's important to
report the correct thing to userland so we can cause allocators to align
on the right boundaries, etc. If we artificially clamp the physical
block size parameter in the kernel we are losing information. Note that
there are no kernel users of the physical block size parameter at all.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 17:20   ` Martin K. Petersen
@ 2010-09-27 22:21     ` Jens Axboe
  2010-09-27 22:36       ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2010-09-27 22:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James.Bottomley@hansenpartnership.com, snitzer@redhat.com,
	linux-scsi@vger.kernel.org

On 2010-09-28 02:20, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> 
> Jens> Fixing the overflow aside, I question the validity of setting the
> Jens> physical block size to something larger than PAGE_SIZE as there's
> Jens> no way that that could really work in the current kernel.
> 
> Jens> I would suggest doing something similar as we do with other
> Jens> 'invalid' settings that we cannot honor, print a warning and drop
> Jens> the queue limits to PAGE_SIZE.
> 
> Mike and I have been bouncing this back and forth on this the last
> couple of days.
> 
> I totally agree with enforcing hard limits like the logical blocks size
> inside the kernel. We have to. But the physical block size is just an
> I/O hint. And consequently I prefer mkfs to do sanity checking in this
> case and not the kernel. We report the topology truthfully and then
> userland can treat the data as it sees fit.

So it's just the hint, not the actual hardware sector size. The naming
is pretty bad on that, physical and logic... So that does look better,
but in that case I don't think that sd should dump a warning.

Does mkfs do the right thing?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 22:14         ` Martin K. Petersen
@ 2010-09-27 22:24           ` Jens Axboe
  2010-09-28 18:48             ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2010-09-27 22:24 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, Mike Snitzer, linux-scsi@vger.kernel.org

On 2010-09-28 07:14, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> 
> Jens> Yes, from a correctness point of view it doesn't matter, but when
> Jens> people go looking up fixes for whatever reason, it's much better
> Jens> to include such a fix in the original patch so it's not missed.
> 
> I have talked to a few standards people today. They are of the opinion
> that the device's usage of the physical block exponent is incorrect. And
> that the device must provide the Block Limits and the TP VPD if thin
> provisioning is enabled.
> 
> However, devices with 8KiB physical blocks are shipping and 16KiB ditto
> are right around the corner.  Which says to me that it's important to
> report the correct thing to userland so we can cause allocators to align
> on the right boundaries, etc. If we artificially clamp the physical
> block size parameter in the kernel we are losing information. Note that
> there are no kernel users of the physical block size parameter at all.

With the revised understanding that this is purely the IO hint, then yes
I agree we should not clamp it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 22:21     ` Jens Axboe
@ 2010-09-27 22:36       ` Martin K. Petersen
  2010-09-27 23:15         ` Mike Snitzer
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-27 22:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com,
	snitzer@redhat.com, linux-scsi@vger.kernel.org

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

Jens> So it's just the hint, not the actual hardware sector size. The
Jens> naming is pretty bad on that, physical and logic... 

This is the official T10/T13 terminology. Logical block size is the unit
used when talking to the device via ACS/SBC. Physical block size is the
unit used by the device internally. There's a pretty vague description
of what that means in SBC. But it was obviously conceived mainly to
handle disk drives with 512-byte logical and 4096-byte physical
(hardware) sectors.

The notion of using the physical block size as an indicator of erase
block size is something this unknown vendor came up with. And 1MB is
quite the ways out of range for what we expect in this field.

But my point is that devices with physical block sizes bigger than the
system page size are already out there.  So we need to do the right
thing here. And since this is not something affecting runtime behavior
I'm in favor of deferring the decision what to do about it to userland.


Jens> So that does look better, but in that case I don't think that sd
Jens> should dump a warning.

That's fine by me.


Jens> Does mkfs do the right thing?

Depends on which mkfs it is. Mike has tested things and can chip in
here...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 22:36       ` Martin K. Petersen
@ 2010-09-27 23:15         ` Mike Snitzer
  2010-09-28  4:30           ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-27 23:15 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, James.Bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org, linux-ext4

On Mon, Sep 27 2010 at  6:36pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> Jens> Does mkfs do the right thing?
> 
> Depends on which mkfs it is. Mike has tested things and can chip in
> here...

I haven't test all mkfs.* but...

mkfs.xfs just works with 1M physical_block_size.

mkfs.ext4 won't by default but -F "fixes" that:

# mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
mke2fs 1.41.12 (17-May-2010)
Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
...

I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
physical_block_size).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 23:15         ` Mike Snitzer
@ 2010-09-28  4:30           ` Jens Axboe
  2010-09-28  5:20             ` Eric Sandeen
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2010-09-28  4:30 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org

On 2010-09-28 08:15, Mike Snitzer wrote:
> On Mon, Sep 27 2010 at  6:36pm -0400,
> Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
>>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
>> Jens> Does mkfs do the right thing?
>>
>> Depends on which mkfs it is. Mike has tested things and can chip in
>> here...
> 
> I haven't test all mkfs.* but...
> 
> mkfs.xfs just works with 1M physical_block_size.
> 
> mkfs.ext4 won't by default but -F "fixes" that:
> 
> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
> mke2fs 1.41.12 (17-May-2010)
> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue

OK, so that's not exactly doing the right thing, but at least you can
work around it with a parameter. So I'd say that is good enough.

> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
> physical_block_size).

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28  4:30           ` Jens Axboe
@ 2010-09-28  5:20             ` Eric Sandeen
  2010-09-28 14:15               ` Mike Snitzer
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Sandeen @ 2010-09-28  5:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Martin K. Petersen,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

Jens Axboe wrote:
> On 2010-09-28 08:15, Mike Snitzer wrote:
>> On Mon, Sep 27 2010 at  6:36pm -0400,
>> Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>
>>>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
>>> Jens> Does mkfs do the right thing?
>>>
>>> Depends on which mkfs it is. Mike has tested things and can chip in
>>> here...
>> I haven't test all mkfs.* but...
>>
>> mkfs.xfs just works with 1M physical_block_size.
>>
>> mkfs.ext4 won't by default but -F "fixes" that:
>>
>> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
>> mke2fs 1.41.12 (17-May-2010)
>> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
> 
> OK, so that's not exactly doing the right thing, but at least you can
> work around it with a parameter. So I'd say that is good enough.

Which part of it is the wrong thing...?

Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical
or physical by default, because one is suboptimal and the other is impossible.
-F (force) can override the suboptimal fs blocksize < logical blocksize case...

Should we change something?

Thanks,
-Eric

>> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
>> physical_block_size).
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28  5:20             ` Eric Sandeen
@ 2010-09-28 14:15               ` Mike Snitzer
  2010-09-28 20:57                 ` Ted Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-28 14:15 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jens Axboe, Martin K. Petersen,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Tue, Sep 28 2010 at  1:20am -0400,
Eric Sandeen <sandeen@redhat.com> wrote:

> Jens Axboe wrote:
> > On 2010-09-28 08:15, Mike Snitzer wrote:
> >> On Mon, Sep 27 2010 at  6:36pm -0400,
> >> Martin K. Petersen <martin.petersen@oracle.com> wrote:
> >>
> >>>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> >>> Jens> Does mkfs do the right thing?
> >>>
> >>> Depends on which mkfs it is. Mike has tested things and can chip in
> >>> here...
> >> I haven't test all mkfs.* but...
> >>
> >> mkfs.xfs just works with 1M physical_block_size.
> >>
> >> mkfs.ext4 won't by default but -F "fixes" that:
> >>
> >> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
> >> mke2fs 1.41.12 (17-May-2010)
> >> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
> > 
> > OK, so that's not exactly doing the right thing, but at least you can
> > work around it with a parameter. So I'd say that is good enough.
> 
> Which part of it is the wrong thing...?
> 
> Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical
> or physical by default, because one is suboptimal and the other is impossible.
> -F (force) can override the suboptimal fs blocksize < logical blocksize case...

Actually, -F allows one to override fs blocksize < physical_block_size.

In this instance we have the following:
# cat /sys/block/dm-2/queue/physical_block_size 
1048576
# cat /sys/block/dm-2/queue/logical_block_size 
512
 
> Should we change something?

Unclear.  I could see maybe automatically capping the fs block size at
4096 if physical_block_size is larger and is a multiple of 4096?

> >> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
> >> physical_block_size).

Both fdisk and parted look good (partitions are physical_block_size
aligned, will warn if you attempt to stray from that alignment).  I'll
spare you detials of the creation steps...

Results of fdisk:
-----------------

# fdisk /dev/sdb
...
The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.
...

# fdisk -l -u /dev/sdb

Disk /dev/sdb: 17.2 GB, 17179869184 bytes
255 heads, 63 sectors/track, 2088 cylinders, total 33554432 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 1048576 bytes
I/O size (minimum/optimal): 1048576 bytes / 1048576 bytes
Disk identifier: 0x0009bf46

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    16775167     8386560   83  Linux


Results of parted:
------------------
Also looks good, doesn't care about physical_block_size.  Is more
concerned with {minimum,optimal}_io_size.

(parted) unit MiB
(parted) p
Model: XXXXXXXXXXXXX
Disk /dev/sdb: 16384MiB
Sector size (logical/physical): 512B/1048576B
Partition Table: msdos

Number  Start    End      Size     Type     File system  Flags
 1      1.00MiB  8191MiB  8190MiB  primary

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-27 22:24           ` Jens Axboe
@ 2010-09-28 18:48             ` Martin K. Petersen
  2010-09-28 18:54               ` Mike Snitzer
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-28 18:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, James Bottomley, Mike Snitzer,
	linux-scsi@vger.kernel.org

>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:

Jens> With the revised understanding that this is purely the IO hint,
Jens> then yes I agree we should not clamp it.

Ok, here's an updated sd patch that does not print a warning...


sd: Fix overflow with big physical blocks

The hw_sector_size variable could overflow if a device reported huge
physical blocks.  Switch to the more accurate physical_block_size
terminology and make sure we use an unsigned int to match the range
permitted by READ CAPACITY(16).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ffa0689..5446c4d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1554,7 +1554,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 	}
 
 	/* Logical blocks per physical block exponent */
-	sdkp->hw_sector_size = (1 << (buffer[13] & 0xf)) * sector_size;
+	sdkp->physical_block_size = (1 << (buffer[13] & 0xf)) * sector_size;
 
 	/* Lowest aligned logical block */
 	alignment = ((buffer[14] & 0x3f) << 8 | buffer[15]) * sector_size;
@@ -1567,7 +1567,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 		struct request_queue *q = sdp->request_queue;
 
 		sdkp->thin_provisioning = 1;
-		q->limits.discard_granularity = sdkp->hw_sector_size;
+		q->limits.discard_granularity = sdkp->physical_block_size;
 		q->limits.max_discard_sectors = 0xffffffff;
 
 		if (buffer[14] & 0x40) /* TPRZ */
@@ -1635,7 +1635,7 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp,
 	}
 
 	sdkp->capacity = lba + 1;
-	sdkp->hw_sector_size = sector_size;
+	sdkp->physical_block_size = sector_size;
 	return sector_size;
 }
 
@@ -1756,10 +1756,10 @@ got_data:
 				  (unsigned long long)sdkp->capacity,
 				  sector_size, cap_str_10, cap_str_2);
 
-			if (sdkp->hw_sector_size != sector_size)
+			if (sdkp->physical_block_size != sector_size)
 				sd_printk(KERN_NOTICE, sdkp,
 					  "%u-byte physical blocks\n",
-					  sdkp->hw_sector_size);
+					  sdkp->physical_block_size);
 		}
 	}
 
@@ -1773,7 +1773,8 @@ got_data:
 	else if (sector_size == 256)
 		sdkp->capacity >>= 1;
 
-	blk_queue_physical_block_size(sdp->request_queue, sdkp->hw_sector_size);
+	blk_queue_physical_block_size(sdp->request_queue,
+				      sdkp->physical_block_size);
 	sdkp->device->sector_size = sector_size;
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index f81a930..f947140 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -50,7 +50,7 @@ struct scsi_disk {
 	atomic_t	openers;
 	sector_t	capacity;	/* size in 512-byte sectors */
 	u32		index;
-	unsigned short	hw_sector_size;
+	unsigned int	physical_block_size;
 	u8		media_present;
 	u8		write_prot;
 	u8		protection_type;/* Data Integrity Field */

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28 18:48             ` Martin K. Petersen
@ 2010-09-28 18:54               ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2010-09-28 18:54 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, James Bottomley, linux-scsi@vger.kernel.org

On Tue, Sep 28 2010 at  2:48pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes:
> 
> Jens> With the revised understanding that this is purely the IO hint,
> Jens> then yes I agree we should not clamp it.
> 
> Ok, here's an updated sd patch that does not print a warning...
> 
> 
> sd: Fix overflow with big physical blocks
> 
> The hw_sector_size variable could overflow if a device reported huge
> physical blocks.  Switch to the more accurate physical_block_size
> terminology and make sure we use an unsigned int to match the range
> permitted by READ CAPACITY(16).
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks Martin.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28 14:15               ` Mike Snitzer
@ 2010-09-28 20:57                 ` Ted Ts'o
  2010-09-28 21:24                   ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Ted Ts'o @ 2010-09-28 20:57 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Eric Sandeen, Jens Axboe, Martin K. Petersen,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Tue, Sep 28, 2010 at 10:15:45AM -0400, Mike Snitzer wrote:
> Actually, -F allows one to override fs blocksize < physical_block_size.
> 
> In this instance we have the following:
> # cat /sys/block/dm-2/queue/physical_block_size 
> 1048576
> # cat /sys/block/dm-2/queue/logical_block_size 
> 512
>  
> > Should we change something?
> 
> Unclear.  I could see maybe automatically capping the fs block size at
> 4096 if physical_block_size is larger and is a multiple of 4096?

Can we decide soon what the right thing should be?  I'm about to
release e2fsrogs 1.41.13, and if I should put in some sanity checking
code so mke2fs does something sane when it sees a 1M physical block
size, I can do that.

Or if the kernel is going to do that, it's fine too....

      	  	    	     	      	   - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28 20:57                 ` Ted Ts'o
@ 2010-09-28 21:24                   ` Martin K. Petersen
  2010-09-28 21:36                     ` Eric Sandeen
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-09-28 21:24 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Mike Snitzer, Eric Sandeen, Jens Axboe, Martin K. Petersen,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

>>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes:

Ted> Can we decide soon what the right thing should be?  I'm about to
Ted> release e2fsrogs 1.41.13, and if I should put in some sanity
Ted> checking code so mke2fs does something sane when it sees a 1M
Ted> physical block size, I can do that.

I don't think it's entirely clear what the "right thing" would be.

Let's ignore the 1MB block size for now. That's clearly a fluke and a
buggy device. But there are SSDs that will advertise an 8KiB physical
block size. And apparently 16KiB devices are in the pipeline.

How do we want to handle these devices? Allowing blocks bigger than the
page size is going to be painful.

So the question is whether we can tweak the filesystem layout in a way
that would alleviate the pain without having to change the filesystem
block size in the traditional sense.

At least we're talking about SSDs and arrays here. I assume the partial
block write penalty for these devices would be smaller than it is for
rotating media.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28 21:24                   ` Martin K. Petersen
@ 2010-09-28 21:36                     ` Eric Sandeen
  2010-09-30 16:30                       ` Ted Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Sandeen @ 2010-09-28 21:36 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Ted Ts'o, Mike Snitzer, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

Martin K. Petersen wrote:

>>>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes:
>>>>>>             
>
> Ted> Can we decide soon what the right thing should be?  I'm about to
> Ted> release e2fsrogs 1.41.13, and if I should put in some sanity
> Ted> checking code so mke2fs does something sane when it sees a 1M
> Ted> physical block size, I can do that.
>
> I don't think it's entirely clear what the "right thing" would be.
>
> Let's ignore the 1MB block size for now. That's clearly a fluke and a
> buggy device. But there are SSDs that will advertise an 8KiB physical
> block size. And apparently 16KiB devices are in the pipeline.
>   
Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
than device physical sectorsize without -F is broken, and that should
be removed.  I'd say issue a warning in the case but if there's a 16k
physical device maybe there's no point in warning either?

> How do we want to handle these devices? Allowing blocks bigger than the
> page size is going to be painful.
>
> So the question is whether we can tweak the filesystem layout in a way
> that would alleviate the pain without having to change the filesystem
> block size in the traditional sense.
>
> At least we're talking about SSDs and arrays here. I assume the partial
> block write penalty for these devices would be smaller than it is for
> rotating media.
>
>   
I guess it must be.

Anyway here's a patch to remove the force requirement and just give the
user whatever they want, since apparently we can't avoid fs blocksize
less than physical sector size in general.  It does still warn
that the fs blocksize is less than physical sectorsize, but *shrug*


diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index add7c0c..6010fc1 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -1634,17 +1634,15 @@ static void PRS(int argc, char *argv[])
 					ext2fs_blocks_count(&fs_param) /
 					(blocksize / 1024));
 	} else {
-		if (blocksize < lsector_size ||			/* Impossible */
-		    (!force && (blocksize < psector_size))) {	/* Suboptimal */
+		if (blocksize < lsector_size) {			/* Impossible */
 			com_err(program_name, EINVAL,
 				_("while setting blocksize; too small "
 				  "for device\n"));
 			exit(1);
-		} else if (blocksize < psector_size) {
+		} else if (blocksize < psector_size) {		/* Suboptimal */
 			fprintf(stderr, _("Warning: specified blocksize %d is "
-				"less than device physical sectorsize %d, "
-				"forced to continue\n"), blocksize,
-				psector_size);
+				"less than device physical sectorsize %d\n")
+				blocksize, psector_size);
 		}
 	}
 

-Eric



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-28 21:36                     ` Eric Sandeen
@ 2010-09-30 16:30                       ` Ted Ts'o
  2010-09-30 17:07                         ` Eric Sandeen
       [not found]                         ` <4CA4C3B6.9000104@redhat.com>
  0 siblings, 2 replies; 36+ messages in thread
From: Ted Ts'o @ 2010-09-30 16:30 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Martin K. Petersen, Mike Snitzer, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
> than device physical sectorsize without -F is broken, and that should
> be removed.  I'd say issue a warning in the case but if there's a 16k
> physical device maybe there's no point in warning either?

If the device physical sectorsize is that big, should we perhaps use
that as a hint to align writes to that blocks aligned with that
physical sectorsize?  Right now we use the optimal I/O size, but if
the optimal I/O size is not specified and the physical sectorsize is,
say, 16k or 32k, maybe we should use to calculate for
s_raid_stripe_width?

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-30 16:30                       ` Ted Ts'o
@ 2010-09-30 17:07                         ` Eric Sandeen
       [not found]                         ` <4CA4C3B6.9000104@redhat.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Sandeen @ 2010-09-30 17:07 UTC (permalink / raw)
  To: Ted Ts'o, Martin K. Petersen, Mike Snitzer, Jens Axboe,
	James.Bottomley

On 09/30/2010 11:30 AM, Ted Ts'o wrote:
> On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
>> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
>> than device physical sectorsize without -F is broken, and that should
>> be removed.  I'd say issue a warning in the case but if there's a 16k
>> physical device maybe there's no point in warning either?
> 
> If the device physical sectorsize is that big, should we perhaps use
> that as a hint to align writes to that blocks aligned with that
> physical sectorsize?  Right now we use the optimal I/O size, but if
> the optimal I/O size is not specified and the physical sectorsize is,

I can't keep track of all the parameters, is it ever true that optimal
I/O size isn't specified?

> say, 16k or 32k, maybe we should use to calculate for
> s_raid_stripe_width?

Perhaps, though really ext4 still doesn't do -that- much with the value,
anyway...

-Eric

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
       [not found]                         ` <4CA4C3B6.9000104@redhat.com>
@ 2010-09-30 17:33                           ` Mike Snitzer
  2010-10-01 14:24                             ` Ted Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-09-30 17:33 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ted Ts'o, Martin K. Petersen, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Thu, Sep 30 2010 at  1:07pm -0400,
Eric Sandeen <sandeen@redhat.com> wrote:

> On 09/30/2010 11:30 AM, Ted Ts'o wrote:
> > On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
> >> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
> >> than device physical sectorsize without -F is broken, and that should
> >> be removed.  I'd say issue a warning in the case but if there's a 16k
> >> physical device maybe there's no point in warning either?
> > 
> > If the device physical sectorsize is that big, should we perhaps use
> > that as a hint to align writes to that blocks aligned with that
> > physical sectorsize?  Right now we use the optimal I/O size, but if
> > the optimal I/O size is not specified and the physical sectorsize is,
> 
> I can't keep track of all the parameters, is it ever true that optimal
> I/O size isn't specified?

Yes optimal_io_size may be 0.  But minimum_io_size will always be scaled
up to at least match physical_block_size.

In any case: this 1MB physical_block_size device, which started this
thread, also has 1MB for both minimum_io_size and optimal_io_size.

Mike

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-09-30 17:33                           ` Mike Snitzer
@ 2010-10-01 14:24                             ` Ted Ts'o
  2010-10-01 22:19                               ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Ted Ts'o @ 2010-10-01 14:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Eric Sandeen, Martin K. Petersen, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Thu, Sep 30, 2010 at 01:33:43PM -0400, Mike Snitzer wrote:
> 
> Yes optimal_io_size may be 0.  But minimum_io_size will always be scaled
> up to at least match physical_block_size.

Woah!  Are we sure we want to do that?  According to Jens, 8k physical
blockes are here already and 16k physical blocks sizes are right
around the corner.  If we scale minimum_io_size up to the physical
block size, then even though these devices will have 512 or 4k logical
block sizes, minimum_io_size will be 16k?  That sounds wrong,
incorrect, and given that the Linux VM can't handle file system block
sizes greater than page size.  And if we scale the minimum_io_size to
the physical block size, mke2fs will refuse to create a 4k blocksize
filesystem --- since presumably "minimum io size" means we can't do
I/O's smaller than that.

Please tell me you meant to say __logical__ blocksize above?

Or am I misunderstanding what you meant?

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-10-01 14:24                             ` Ted Ts'o
@ 2010-10-01 22:19                               ` Martin K. Petersen
  2010-10-02  2:31                                 ` Ted Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-10-01 22:19 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Mike Snitzer, Eric Sandeen, Martin K. Petersen, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

>>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes:

Ted> If we scale minimum_io_size up to the physical block size, then
Ted> even though these devices will have 512 or 4k logical block sizes,
Ted> minimum_io_size will be 16k?  That sounds wrong, incorrect, and
Ted> given that the Linux VM can't handle file system block sizes
Ted> greater than page size.  And if we scale the minimum_io_size to the
Ted> physical block size, mke2fs will refuse to create a 4k blocksize
Ted> filesystem --- since presumably "minimum io size" means we can't do
Ted> I/O's smaller than that.

logical <= physical <= minimum 

	logical is the smallest unit we can address.  Usually 512 bytes.

        physical is the allocation unit the device claims to use
        internally.  Typically 512 or 4096.  8 and 16 KiB coming.

        minimal is the device's preferred minimum random I/O unit.  This
        is usually identical to the physical block size.  Arrays might
        report a multiple of the physical block size here (stripe chunk
        size).

        optimal (if provided) is the preferred sequential I/O unit and a
        multiple of minimal (stripe width).

The logical and physical parameters are device protocol-centric values.
The minimum and optimal I/O sizes are the two "soft" values that
filesystems should be looking at for layout hints.

A filesystem should use minimal as a cue for block size and optimal as a
cue for stripe width.  minimum may indeed be bigger than page size and
this discussion was started to figure out if there were thing we could
do to accommodate these device without actually changing the filesystem
block size in the traditional sense.

Since not all drives guarantee that read-modify-write cycle on a 4 KiB
physical block won't clobber adjacent 512-byte logical blocks it may be
a good idea to look at physical block size if there are atomicity
concerns.  I.e. filesystems that depend on atomic journal writes may
want to look at the reported physical block size.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-10-01 22:19                               ` Martin K. Petersen
@ 2010-10-02  2:31                                 ` Ted Ts'o
  2010-10-04 19:49                                   ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Ted Ts'o @ 2010-10-02  2:31 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, Eric Sandeen, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

On Fri, Oct 01, 2010 at 06:19:21PM -0400, Martin K. Petersen wrote:
> Since not all drives guarantee that read-modify-write cycle on a 4 KiB
> physical block won't clobber adjacent 512-byte logical blocks it may be
> a good idea to look at physical block size if there are atomicity
> concerns.  I.e. filesystems that depend on atomic journal writes may
> want to look at the reported physical block size.

OK, but what do we do when we start seeing devices with 8k or 16k
physical block sizes?  The VM doesn't deal well with block sizes >
page size.

					- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: I/O topology fixes for big physical block size
  2010-10-02  2:31                                 ` Ted Ts'o
@ 2010-10-04 19:49                                   ` Martin K. Petersen
  0 siblings, 0 replies; 36+ messages in thread
From: Martin K. Petersen @ 2010-10-04 19:49 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Martin K. Petersen, Mike Snitzer, Eric Sandeen, Jens Axboe,
	James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org,
	linux-ext4@vger.kernel.org

>>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes:

Ted,

Ted> OK, but what do we do when we start seeing devices with 8k or 16k
Ted> physical block sizes?  The VM doesn't deal well with block sizes >
Ted> page size.

I don't think we're going to see devices reporting logical blocks bigger
than 4KiB anytime soon.  Too much pain for everybody in the industry
(most other operating systems can't even deal with 4KiB logical blocks
yet).  Eventually we will have to do the required page cache surgery to
support filesystem block sizes bigger than the page size.  But I don't
think that's something we'll have to deal with in the immediate future.

In the meantime, however, the question is whether there is something we
can do in the allocators to mitigate effects of devices reporting
physical blocks bigger than PAGE_CACHE_SIZE.  Obviously this would be in
the I/O hint/alignment category and not something which would guarantee
that all writes would be aligned multiples of that physical block size.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int
  2010-09-27 17:40   ` Mike Snitzer
@ 2010-10-08  5:15     ` Martin K. Petersen
  2010-10-13 19:12       ` Mike Snitzer
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2010-10-08  5:15 UTC (permalink / raw)
  To: jaxboe, James.Bottomley; +Cc: Mike Snitzer, linux-scsi

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

>> Physical block size was declared unsigned int to accomodate the
>> maximum size reported by READ CAPACITY(16).  Make sure we use the
>> right type in the related functions.
>> 
>> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Mike> Acked-by: Mike Snitzer <snitzer@redhat.com>

Jens, ping on the block fix.

James, ping on the sd ditto.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int
  2010-10-08  5:15     ` Martin K. Petersen
@ 2010-10-13 19:12       ` Mike Snitzer
  2010-10-13 19:15         ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-10-13 19:12 UTC (permalink / raw)
  To: jaxboe; +Cc: Martin K. Petersen, James.Bottomley, linux-scsi

On Fri, Oct 08 2010 at  1:15am -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> >> Physical block size was declared unsigned int to accomodate the
> >> maximum size reported by READ CAPACITY(16).  Make sure we use the
> >> right type in the related functions.
> >> 
> >> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> 
> Mike> Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> Jens, ping on the block fix.
> 
> James, ping on the sd ditto.

Hi Jens,

James had added the sd fix to his scsi-misc-2.6 tree:
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=526f7c7950b

Any chance you can stage the block fix for-2.6.37?

(guess its too late for 2.6.36... maybe cc: stable when adding to the
block fix to your tree?  with a requires on the sd fix?)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int
  2010-10-13 19:12       ` Mike Snitzer
@ 2010-10-13 19:15         ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2010-10-13 19:15 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org

On 2010-10-13 21:12, Mike Snitzer wrote:
> On Fri, Oct 08 2010 at  1:15am -0400,
> Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
>>>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
>>
>>>> Physical block size was declared unsigned int to accomodate the
>>>> maximum size reported by READ CAPACITY(16).  Make sure we use the
>>>> right type in the related functions.
>>>>
>>>> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
>>
>> Mike> Acked-by: Mike Snitzer <snitzer@redhat.com>
>>
>> Jens, ping on the block fix.
>>
>> James, ping on the sd ditto.
> 
> Hi Jens,
> 
> James had added the sd fix to his scsi-misc-2.6 tree:
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=526f7c7950b
> 
> Any chance you can stage the block fix for-2.6.37?
> 
> (guess its too late for 2.6.36... maybe cc: stable when adding to the
> block fix to your tree?  with a requires on the sd fix?)

Yep will do, and mark for stable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: block: eliminate potential for infinite loop in blkdev_issue_discard
  2010-09-27 18:13   ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer
@ 2010-10-14 21:37     ` Mike Snitzer
  2010-10-15 11:05       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Snitzer @ 2010-10-14 21:37 UTC (permalink / raw)
  To: jaxboe; +Cc: Martin K. Petersen, dm-devel, linux-scsi

On Mon, Sep 27 2010 at  2:13pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> Due to the recently identified overflow in read_capacity_16() it was
> possible for max_discard_sectors to be zero but still have discards
> enabled on the associated device's queue.
> 
> Eliminate the possibility for blkdev_issue_discard to infinitely loop.
> 
> Interestingly this issue wasn't identified until a device, whose
> discard_granularity was 0 due to read_capacity_16 overflow, was consumed
> by blk_stack_limits() to construct limits for a higher-level DM
> multipath device.  The multipath device's resulting limits never had the
> discard limits stacked because blk_stack_limits() will only do so if
> the bottom device's discard_granularity != 0.  This resulted in the
> multipath device's limits.max_discard_sectors being 0.

Hi Jens,

This patch would only serve as a future safety-net now that the
elimination of the overflow in read_capacity_16() has been staged for
2.6.37.  Defensive programming and all...

What do you (and others) think about this patch?

Thanks,
Mike

> ---
>  block/blk-lib.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index c392029..186f249 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -56,7 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  	 * granularity
>  	 */
>  	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> -	if (q->limits.discard_granularity) {
> +	if (unlikely(!max_discard_sectors)) {
> +		/* Avoid infinite loop (below) */
> +		return -EOPNOTSUPP;
> +	} else if (q->limits.discard_granularity) {
>  		unsigned int disc_sects = q->limits.discard_granularity >> 9;
>  
>  		max_discard_sectors &= ~(disc_sects - 1);
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: block: eliminate potential for infinite loop in  blkdev_issue_discard
  2010-10-14 21:37     ` Mike Snitzer
@ 2010-10-15 11:05       ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2010-10-15 11:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Martin K. Petersen, dm-devel@redhat.com,
	linux-scsi@vger.kernel.org

On 2010-10-14 23:37, Mike Snitzer wrote:
> On Mon, Sep 27 2010 at  2:13pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> Due to the recently identified overflow in read_capacity_16() it was
>> possible for max_discard_sectors to be zero but still have discards
>> enabled on the associated device's queue.
>>
>> Eliminate the possibility for blkdev_issue_discard to infinitely loop.
>>
>> Interestingly this issue wasn't identified until a device, whose
>> discard_granularity was 0 due to read_capacity_16 overflow, was consumed
>> by blk_stack_limits() to construct limits for a higher-level DM
>> multipath device.  The multipath device's resulting limits never had the
>> discard limits stacked because blk_stack_limits() will only do so if
>> the bottom device's discard_granularity != 0.  This resulted in the
>> multipath device's limits.max_discard_sectors being 0.
> 
> Hi Jens,
> 
> This patch would only serve as a future safety-net now that the
> elimination of the overflow in read_capacity_16() has been staged for
> 2.6.37.  Defensive programming and all...
> 
> What do you (and others) think about this patch?

Agree, may as well play it safe(r).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-10-15 11:05 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen
2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen
2010-09-27 17:40   ` Mike Snitzer
2010-10-08  5:15     ` Martin K. Petersen
2010-10-13 19:12       ` Mike Snitzer
2010-10-13 19:15         ` Jens Axboe
2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen
2010-09-27 17:42   ` Mike Snitzer
2010-09-27 18:13   ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer
2010-10-14 21:37     ` Mike Snitzer
2010-10-15 11:05       ` Jens Axboe
2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe
2010-09-27 17:20   ` Martin K. Petersen
2010-09-27 22:21     ` Jens Axboe
2010-09-27 22:36       ` Martin K. Petersen
2010-09-27 23:15         ` Mike Snitzer
2010-09-28  4:30           ` Jens Axboe
2010-09-28  5:20             ` Eric Sandeen
2010-09-28 14:15               ` Mike Snitzer
2010-09-28 20:57                 ` Ted Ts'o
2010-09-28 21:24                   ` Martin K. Petersen
2010-09-28 21:36                     ` Eric Sandeen
2010-09-30 16:30                       ` Ted Ts'o
2010-09-30 17:07                         ` Eric Sandeen
     [not found]                         ` <4CA4C3B6.9000104@redhat.com>
2010-09-30 17:33                           ` Mike Snitzer
2010-10-01 14:24                             ` Ted Ts'o
2010-10-01 22:19                               ` Martin K. Petersen
2010-10-02  2:31                                 ` Ted Ts'o
2010-10-04 19:49                                   ` Martin K. Petersen
2010-09-27 17:23   ` Mike Snitzer
2010-09-27 21:58     ` James Bottomley
2010-09-27 22:03       ` Jens Axboe
2010-09-27 22:14         ` Martin K. Petersen
2010-09-27 22:24           ` Jens Axboe
2010-09-28 18:48             ` Martin K. Petersen
2010-09-28 18:54               ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).