* I/O topology fixes for big physical block size @ 2010-09-27 16:41 Martin K. Petersen 2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw) To: jaxboe, James.Bottomley, snitzer, linux-scsi Mike Snitzer reported that he has access to a device that supports thin provisioning but does not use the Block Limits VPD page to indicate discard granularity. Instead it reports a huge (1MB) physical block size. That caused a bit of fallout in the topology stack which assumed a physical block size of 4KiB or less. [PATCH 1/2] block: Ensure physical block size is unsigned int [PATCH 2/2] sd: Fix overflow with big physical blocks ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 1/2] block: Ensure physical block size is unsigned int 2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen @ 2010-09-27 16:41 ` Martin K. Petersen 2010-09-27 17:40 ` Mike Snitzer 2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen 2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe 2 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw) To: jaxboe, James.Bottomley, snitzer, linux-scsi; +Cc: Martin K. Petersen Physical block size was declared unsigned int to accomodate the maximum size reported by READ CAPACITY(16). Make sure we use the right type in the related functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- block/blk-settings.c | 2 +- include/linux/blkdev.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index a234f4b..450577d 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -343,7 +343,7 @@ EXPORT_SYMBOL(blk_queue_logical_block_size); * hardware can operate on without reverting to read-modify-write * operations. */ -void blk_queue_physical_block_size(struct request_queue *q, unsigned short size) +void blk_queue_physical_block_size(struct request_queue *q, unsigned int size) { q->limits.physical_block_size = size; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2c54906..9e443b9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -851,7 +851,7 @@ extern void blk_queue_max_segment_size(struct request_queue *, unsigned int); extern void blk_queue_max_discard_sectors(struct request_queue *q, unsigned int max_discard_sectors); extern void blk_queue_logical_block_size(struct request_queue *, unsigned short); -extern void blk_queue_physical_block_size(struct request_queue *, unsigned short); +extern void blk_queue_physical_block_size(struct request_queue *, unsigned int); extern void blk_queue_alignment_offset(struct request_queue *q, unsigned int alignment); extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min); @@ -1004,7 +1004,7 @@ static inline unsigned int queue_physical_block_size(struct request_queue *q) return q->limits.physical_block_size; } -static inline int bdev_physical_block_size(struct block_device *bdev) +static inline unsigned int bdev_physical_block_size(struct block_device *bdev) { return queue_physical_block_size(bdev_get_queue(bdev)); } -- 1.7.2.2 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int 2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen @ 2010-09-27 17:40 ` Mike Snitzer 2010-10-08 5:15 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-27 17:40 UTC (permalink / raw) To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi On Mon, Sep 27 2010 at 12:41pm -0400, Martin K. Petersen <martin.petersen@oracle.com> wrote: > Physical block size was declared unsigned int to accomodate the maximum > size reported by READ CAPACITY(16). Make sure we use the right type in > the related functions. > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int 2010-09-27 17:40 ` Mike Snitzer @ 2010-10-08 5:15 ` Martin K. Petersen 2010-10-13 19:12 ` Mike Snitzer 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-10-08 5:15 UTC (permalink / raw) To: jaxboe, James.Bottomley; +Cc: Mike Snitzer, linux-scsi >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: >> Physical block size was declared unsigned int to accomodate the >> maximum size reported by READ CAPACITY(16). Make sure we use the >> right type in the related functions. >> >> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Mike> Acked-by: Mike Snitzer <snitzer@redhat.com> Jens, ping on the block fix. James, ping on the sd ditto. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int 2010-10-08 5:15 ` Martin K. Petersen @ 2010-10-13 19:12 ` Mike Snitzer 2010-10-13 19:15 ` Jens Axboe 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-10-13 19:12 UTC (permalink / raw) To: jaxboe; +Cc: Martin K. Petersen, James.Bottomley, linux-scsi On Fri, Oct 08 2010 at 1:15am -0400, Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: > > >> Physical block size was declared unsigned int to accomodate the > >> maximum size reported by READ CAPACITY(16). Make sure we use the > >> right type in the related functions. > >> > >> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > > Mike> Acked-by: Mike Snitzer <snitzer@redhat.com> > > Jens, ping on the block fix. > > James, ping on the sd ditto. Hi Jens, James had added the sd fix to his scsi-misc-2.6 tree: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=526f7c7950b Any chance you can stage the block fix for-2.6.37? (guess its too late for 2.6.36... maybe cc: stable when adding to the block fix to your tree? with a requires on the sd fix?) Thanks, Mike ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] block: Ensure physical block size is unsigned int 2010-10-13 19:12 ` Mike Snitzer @ 2010-10-13 19:15 ` Jens Axboe 0 siblings, 0 replies; 36+ messages in thread From: Jens Axboe @ 2010-10-13 19:15 UTC (permalink / raw) To: Mike Snitzer Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org On 2010-10-13 21:12, Mike Snitzer wrote: > On Fri, Oct 08 2010 at 1:15am -0400, > Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: >> >>>> Physical block size was declared unsigned int to accomodate the >>>> maximum size reported by READ CAPACITY(16). Make sure we use the >>>> right type in the related functions. >>>> >>>> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> >> >> Mike> Acked-by: Mike Snitzer <snitzer@redhat.com> >> >> Jens, ping on the block fix. >> >> James, ping on the sd ditto. > > Hi Jens, > > James had added the sd fix to his scsi-misc-2.6 tree: > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=526f7c7950b > > Any chance you can stage the block fix for-2.6.37? > > (guess its too late for 2.6.36... maybe cc: stable when adding to the > block fix to your tree? with a requires on the sd fix?) Yep will do, and mark for stable. -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 2/2] sd: Fix overflow with big physical blocks 2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen 2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen @ 2010-09-27 16:41 ` Martin K. Petersen 2010-09-27 17:42 ` Mike Snitzer 2010-09-27 18:13 ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer 2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe 2 siblings, 2 replies; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 16:41 UTC (permalink / raw) To: jaxboe, James.Bottomley, snitzer, linux-scsi; +Cc: Martin K. Petersen The hw_sector_size variable could overflow if a device reported huge physical blocks. Switch to the more accurate physical_block_size terminology and make sure we use an unsigned int to match the range permitted by READ CAPACITY(16). Also print a warning of the physical block size exceeds the page size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- drivers/scsi/sd.c | 19 +++++++++++++------ drivers/scsi/sd.h | 2 +- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index ffa0689..1c14436 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1554,7 +1554,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, } /* Logical blocks per physical block exponent */ - sdkp->hw_sector_size = (1 << (buffer[13] & 0xf)) * sector_size; + sdkp->physical_block_size = (1 << (buffer[13] & 0xf)) * sector_size; /* Lowest aligned logical block */ alignment = ((buffer[14] & 0x3f) << 8 | buffer[15]) * sector_size; @@ -1567,7 +1567,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, struct request_queue *q = sdp->request_queue; sdkp->thin_provisioning = 1; - q->limits.discard_granularity = sdkp->hw_sector_size; + q->limits.discard_granularity = sdkp->physical_block_size; q->limits.max_discard_sectors = 0xffffffff; if (buffer[14] & 0x40) /* TPRZ */ @@ -1635,7 +1635,7 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, } sdkp->capacity = lba + 1; - sdkp->hw_sector_size = sector_size; + sdkp->physical_block_size = sector_size; return sector_size; } @@ -1756,10 +1756,16 @@ got_data: (unsigned long long)sdkp->capacity, sector_size, cap_str_10, cap_str_2); - if (sdkp->hw_sector_size != sector_size) + if (sdkp->physical_block_size != sector_size) sd_printk(KERN_NOTICE, sdkp, "%u-byte physical blocks\n", - sdkp->hw_sector_size); + sdkp->physical_block_size); + + if (sdkp->physical_block_size > PAGE_CACHE_SIZE) + sd_printk(KERN_WARNING, sdkp, + "physical block size %u is bigger " + "than system page size\n", + sdkp->physical_block_size); } } @@ -1773,7 +1779,8 @@ got_data: else if (sector_size == 256) sdkp->capacity >>= 1; - blk_queue_physical_block_size(sdp->request_queue, sdkp->hw_sector_size); + blk_queue_physical_block_size(sdp->request_queue, + sdkp->physical_block_size); sdkp->device->sector_size = sector_size; } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index f81a930..f947140 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -50,7 +50,7 @@ struct scsi_disk { atomic_t openers; sector_t capacity; /* size in 512-byte sectors */ u32 index; - unsigned short hw_sector_size; + unsigned int physical_block_size; u8 media_present; u8 write_prot; u8 protection_type;/* Data Integrity Field */ -- 1.7.2.2 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sd: Fix overflow with big physical blocks 2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen @ 2010-09-27 17:42 ` Mike Snitzer 2010-09-27 18:13 ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer 1 sibling, 0 replies; 36+ messages in thread From: Mike Snitzer @ 2010-09-27 17:42 UTC (permalink / raw) To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi On Mon, Sep 27 2010 at 12:41pm -0400, Martin K. Petersen <martin.petersen@oracle.com> wrote: > The hw_sector_size variable could overflow if a device reported huge > physical blocks. Switch to the more accurate physical_block_size > terminology and make sure we use an unsigned int to match the range > permitted by READ CAPACITY(16). > > Also print a warning of the physical block size exceeds the page size. > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> (with the understanding that we may/will look to limit physical_block_size to PAGE_SIZE in a follow-on patch). ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard 2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen 2010-09-27 17:42 ` Mike Snitzer @ 2010-09-27 18:13 ` Mike Snitzer 2010-10-14 21:37 ` Mike Snitzer 1 sibling, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-27 18:13 UTC (permalink / raw) To: Martin K. Petersen; +Cc: jaxboe, James.Bottomley, linux-scsi, dm-devel Due to the recently identified overflow in read_capacity_16() it was possible for max_discard_sectors to be zero but still have discards enabled on the associated device's queue. Eliminate the possibility for blkdev_issue_discard to infinitely loop. Interestingly this issue wasn't identified until a device, whose discard_granularity was 0 due to read_capacity_16 overflow, was consumed by blk_stack_limits() to construct limits for a higher-level DM multipath device. The multipath device's resulting limits never had the discard limits stacked because blk_stack_limits() will only do so if the bottom device's discard_granularity != 0. This resulted in the multipath device's limits.max_discard_sectors being 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-lib.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index c392029..186f249 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -56,7 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, * granularity */ max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); - if (q->limits.discard_granularity) { + if (unlikely(!max_discard_sectors)) { + /* Avoid infinite loop (below) */ + return -EOPNOTSUPP; + } else if (q->limits.discard_granularity) { unsigned int disc_sects = q->limits.discard_granularity >> 9; max_discard_sectors &= ~(disc_sects - 1); ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: block: eliminate potential for infinite loop in blkdev_issue_discard 2010-09-27 18:13 ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer @ 2010-10-14 21:37 ` Mike Snitzer 2010-10-15 11:05 ` Jens Axboe 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-10-14 21:37 UTC (permalink / raw) To: jaxboe; +Cc: Martin K. Petersen, dm-devel, linux-scsi On Mon, Sep 27 2010 at 2:13pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > Due to the recently identified overflow in read_capacity_16() it was > possible for max_discard_sectors to be zero but still have discards > enabled on the associated device's queue. > > Eliminate the possibility for blkdev_issue_discard to infinitely loop. > > Interestingly this issue wasn't identified until a device, whose > discard_granularity was 0 due to read_capacity_16 overflow, was consumed > by blk_stack_limits() to construct limits for a higher-level DM > multipath device. The multipath device's resulting limits never had the > discard limits stacked because blk_stack_limits() will only do so if > the bottom device's discard_granularity != 0. This resulted in the > multipath device's limits.max_discard_sectors being 0. Hi Jens, This patch would only serve as a future safety-net now that the elimination of the overflow in read_capacity_16() has been staged for 2.6.37. Defensive programming and all... What do you (and others) think about this patch? Thanks, Mike > --- > block/blk-lib.c | 5 ++++- > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index c392029..186f249 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -56,7 +56,10 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, > * granularity > */ > max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); > - if (q->limits.discard_granularity) { > + if (unlikely(!max_discard_sectors)) { > + /* Avoid infinite loop (below) */ > + return -EOPNOTSUPP; > + } else if (q->limits.discard_granularity) { > unsigned int disc_sects = q->limits.discard_granularity >> 9; > > max_discard_sectors &= ~(disc_sects - 1); > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: block: eliminate potential for infinite loop in blkdev_issue_discard 2010-10-14 21:37 ` Mike Snitzer @ 2010-10-15 11:05 ` Jens Axboe 0 siblings, 0 replies; 36+ messages in thread From: Jens Axboe @ 2010-10-15 11:05 UTC (permalink / raw) To: Mike Snitzer Cc: Martin K. Petersen, dm-devel@redhat.com, linux-scsi@vger.kernel.org On 2010-10-14 23:37, Mike Snitzer wrote: > On Mon, Sep 27 2010 at 2:13pm -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > >> Due to the recently identified overflow in read_capacity_16() it was >> possible for max_discard_sectors to be zero but still have discards >> enabled on the associated device's queue. >> >> Eliminate the possibility for blkdev_issue_discard to infinitely loop. >> >> Interestingly this issue wasn't identified until a device, whose >> discard_granularity was 0 due to read_capacity_16 overflow, was consumed >> by blk_stack_limits() to construct limits for a higher-level DM >> multipath device. The multipath device's resulting limits never had the >> discard limits stacked because blk_stack_limits() will only do so if >> the bottom device's discard_granularity != 0. This resulted in the >> multipath device's limits.max_discard_sectors being 0. > > Hi Jens, > > This patch would only serve as a future safety-net now that the > elimination of the overflow in read_capacity_16() has been staged for > 2.6.37. Defensive programming and all... > > What do you (and others) think about this patch? Agree, may as well play it safe(r). -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen 2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen 2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen @ 2010-09-27 16:54 ` Jens Axboe 2010-09-27 17:20 ` Martin K. Petersen 2010-09-27 17:23 ` Mike Snitzer 2 siblings, 2 replies; 36+ messages in thread From: Jens Axboe @ 2010-09-27 16:54 UTC (permalink / raw) To: Martin K. Petersen Cc: James.Bottomley@hansenpartnership.com, snitzer@redhat.com, linux-scsi@vger.kernel.org On 2010-09-28 01:41, Martin K. Petersen wrote: > Mike Snitzer reported that he has access to a device that supports thin > provisioning but does not use the Block Limits VPD page to indicate > discard granularity. Instead it reports a huge (1MB) physical block > size. That caused a bit of fallout in the topology stack which assumed a > physical block size of 4KiB or less. Fixing the overflow aside, I question the validity of setting the physical block size to something larger than PAGE_SIZE as there's no way that that could really work in the current kernel. I would suggest doing something similar as we do with other 'invalid' settings that we cannot honor, print a warning and drop the queue limits to PAGE_SIZE. -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe @ 2010-09-27 17:20 ` Martin K. Petersen 2010-09-27 22:21 ` Jens Axboe 2010-09-27 17:23 ` Mike Snitzer 1 sibling, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 17:20 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com, snitzer@redhat.com, linux-scsi@vger.kernel.org >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: Jens> Fixing the overflow aside, I question the validity of setting the Jens> physical block size to something larger than PAGE_SIZE as there's Jens> no way that that could really work in the current kernel. Jens> I would suggest doing something similar as we do with other Jens> 'invalid' settings that we cannot honor, print a warning and drop Jens> the queue limits to PAGE_SIZE. Mike and I have been bouncing this back and forth on this the last couple of days. I totally agree with enforcing hard limits like the logical blocks size inside the kernel. We have to. But the physical block size is just an I/O hint. And consequently I prefer mkfs to do sanity checking in this case and not the kernel. We report the topology truthfully and then userland can treat the data as it sees fit. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 17:20 ` Martin K. Petersen @ 2010-09-27 22:21 ` Jens Axboe 2010-09-27 22:36 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Jens Axboe @ 2010-09-27 22:21 UTC (permalink / raw) To: Martin K. Petersen Cc: James.Bottomley@hansenpartnership.com, snitzer@redhat.com, linux-scsi@vger.kernel.org On 2010-09-28 02:20, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > > Jens> Fixing the overflow aside, I question the validity of setting the > Jens> physical block size to something larger than PAGE_SIZE as there's > Jens> no way that that could really work in the current kernel. > > Jens> I would suggest doing something similar as we do with other > Jens> 'invalid' settings that we cannot honor, print a warning and drop > Jens> the queue limits to PAGE_SIZE. > > Mike and I have been bouncing this back and forth on this the last > couple of days. > > I totally agree with enforcing hard limits like the logical blocks size > inside the kernel. We have to. But the physical block size is just an > I/O hint. And consequently I prefer mkfs to do sanity checking in this > case and not the kernel. We report the topology truthfully and then > userland can treat the data as it sees fit. So it's just the hint, not the actual hardware sector size. The naming is pretty bad on that, physical and logic... So that does look better, but in that case I don't think that sd should dump a warning. Does mkfs do the right thing? -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 22:21 ` Jens Axboe @ 2010-09-27 22:36 ` Martin K. Petersen 2010-09-27 23:15 ` Mike Snitzer 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 22:36 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com, snitzer@redhat.com, linux-scsi@vger.kernel.org >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: Jens> So it's just the hint, not the actual hardware sector size. The Jens> naming is pretty bad on that, physical and logic... This is the official T10/T13 terminology. Logical block size is the unit used when talking to the device via ACS/SBC. Physical block size is the unit used by the device internally. There's a pretty vague description of what that means in SBC. But it was obviously conceived mainly to handle disk drives with 512-byte logical and 4096-byte physical (hardware) sectors. The notion of using the physical block size as an indicator of erase block size is something this unknown vendor came up with. And 1MB is quite the ways out of range for what we expect in this field. But my point is that devices with physical block sizes bigger than the system page size are already out there. So we need to do the right thing here. And since this is not something affecting runtime behavior I'm in favor of deferring the decision what to do about it to userland. Jens> So that does look better, but in that case I don't think that sd Jens> should dump a warning. That's fine by me. Jens> Does mkfs do the right thing? Depends on which mkfs it is. Mike has tested things and can chip in here... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 22:36 ` Martin K. Petersen @ 2010-09-27 23:15 ` Mike Snitzer 2010-09-28 4:30 ` Jens Axboe 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-27 23:15 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4 On Mon, Sep 27 2010 at 6:36pm -0400, Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > Jens> Does mkfs do the right thing? > > Depends on which mkfs it is. Mike has tested things and can chip in > here... I haven't test all mkfs.* but... mkfs.xfs just works with 1M physical_block_size. mkfs.ext4 won't by default but -F "fixes" that: # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006 mke2fs 1.41.12 (17-May-2010) Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue ... I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at physical_block_size). ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 23:15 ` Mike Snitzer @ 2010-09-28 4:30 ` Jens Axboe 2010-09-28 5:20 ` Eric Sandeen 0 siblings, 1 reply; 36+ messages in thread From: Jens Axboe @ 2010-09-28 4:30 UTC (permalink / raw) To: Mike Snitzer Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On 2010-09-28 08:15, Mike Snitzer wrote: > On Mon, Sep 27 2010 at 6:36pm -0400, > Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: >> Jens> Does mkfs do the right thing? >> >> Depends on which mkfs it is. Mike has tested things and can chip in >> here... > > I haven't test all mkfs.* but... > > mkfs.xfs just works with 1M physical_block_size. > > mkfs.ext4 won't by default but -F "fixes" that: > > # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006 > mke2fs 1.41.12 (17-May-2010) > Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue OK, so that's not exactly doing the right thing, but at least you can work around it with a parameter. So I'd say that is good enough. > I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at > physical_block_size). Thanks! -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 4:30 ` Jens Axboe @ 2010-09-28 5:20 ` Eric Sandeen 2010-09-28 14:15 ` Mike Snitzer 0 siblings, 1 reply; 36+ messages in thread From: Eric Sandeen @ 2010-09-28 5:20 UTC (permalink / raw) To: Jens Axboe Cc: Mike Snitzer, Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org Jens Axboe wrote: > On 2010-09-28 08:15, Mike Snitzer wrote: >> On Mon, Sep 27 2010 at 6:36pm -0400, >> Martin K. Petersen <martin.petersen@oracle.com> wrote: >> >>>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: >>> Jens> Does mkfs do the right thing? >>> >>> Depends on which mkfs it is. Mike has tested things and can chip in >>> here... >> I haven't test all mkfs.* but... >> >> mkfs.xfs just works with 1M physical_block_size. >> >> mkfs.ext4 won't by default but -F "fixes" that: >> >> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006 >> mke2fs 1.41.12 (17-May-2010) >> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue > > OK, so that's not exactly doing the right thing, but at least you can > work around it with a parameter. So I'd say that is good enough. Which part of it is the wrong thing...? Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical or physical by default, because one is suboptimal and the other is impossible. -F (force) can override the suboptimal fs blocksize < logical blocksize case... Should we change something? Thanks, -Eric >> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at >> physical_block_size). > > Thanks! > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 5:20 ` Eric Sandeen @ 2010-09-28 14:15 ` Mike Snitzer 2010-09-28 20:57 ` Ted Ts'o 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-28 14:15 UTC (permalink / raw) To: Eric Sandeen Cc: Jens Axboe, Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Tue, Sep 28 2010 at 1:20am -0400, Eric Sandeen <sandeen@redhat.com> wrote: > Jens Axboe wrote: > > On 2010-09-28 08:15, Mike Snitzer wrote: > >> On Mon, Sep 27 2010 at 6:36pm -0400, > >> Martin K. Petersen <martin.petersen@oracle.com> wrote: > >> > >>>>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > >>> Jens> Does mkfs do the right thing? > >>> > >>> Depends on which mkfs it is. Mike has tested things and can chip in > >>> here... > >> I haven't test all mkfs.* but... > >> > >> mkfs.xfs just works with 1M physical_block_size. > >> > >> mkfs.ext4 won't by default but -F "fixes" that: > >> > >> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006 > >> mke2fs 1.41.12 (17-May-2010) > >> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue > > > > OK, so that's not exactly doing the right thing, but at least you can > > work around it with a parameter. So I'd say that is good enough. > > Which part of it is the wrong thing...? > > Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical > or physical by default, because one is suboptimal and the other is impossible. > -F (force) can override the suboptimal fs blocksize < logical blocksize case... Actually, -F allows one to override fs blocksize < physical_block_size. In this instance we have the following: # cat /sys/block/dm-2/queue/physical_block_size 1048576 # cat /sys/block/dm-2/queue/logical_block_size 512 > Should we change something? Unclear. I could see maybe automatically capping the fs block size at 4096 if physical_block_size is larger and is a multiple of 4096? > >> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at > >> physical_block_size). Both fdisk and parted look good (partitions are physical_block_size aligned, will warn if you attempt to stray from that alignment). I'll spare you detials of the creation steps... Results of fdisk: ----------------- # fdisk /dev/sdb ... The device presents a logical sector size that is smaller than the physical sector size. Aligning to a physical sector (or optimal I/O) size boundary is recommended, or performance may be impacted. ... # fdisk -l -u /dev/sdb Disk /dev/sdb: 17.2 GB, 17179869184 bytes 255 heads, 63 sectors/track, 2088 cylinders, total 33554432 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 1048576 bytes I/O size (minimum/optimal): 1048576 bytes / 1048576 bytes Disk identifier: 0x0009bf46 Device Boot Start End Blocks Id System /dev/sdb1 2048 16775167 8386560 83 Linux Results of parted: ------------------ Also looks good, doesn't care about physical_block_size. Is more concerned with {minimum,optimal}_io_size. (parted) unit MiB (parted) p Model: XXXXXXXXXXXXX Disk /dev/sdb: 16384MiB Sector size (logical/physical): 512B/1048576B Partition Table: msdos Number Start End Size Type File system Flags 1 1.00MiB 8191MiB 8190MiB primary ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 14:15 ` Mike Snitzer @ 2010-09-28 20:57 ` Ted Ts'o 2010-09-28 21:24 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Ted Ts'o @ 2010-09-28 20:57 UTC (permalink / raw) To: Mike Snitzer Cc: Eric Sandeen, Jens Axboe, Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Tue, Sep 28, 2010 at 10:15:45AM -0400, Mike Snitzer wrote: > Actually, -F allows one to override fs blocksize < physical_block_size. > > In this instance we have the following: > # cat /sys/block/dm-2/queue/physical_block_size > 1048576 > # cat /sys/block/dm-2/queue/logical_block_size > 512 > > > Should we change something? > > Unclear. I could see maybe automatically capping the fs block size at > 4096 if physical_block_size is larger and is a multiple of 4096? Can we decide soon what the right thing should be? I'm about to release e2fsrogs 1.41.13, and if I should put in some sanity checking code so mke2fs does something sane when it sees a 1M physical block size, I can do that. Or if the kernel is going to do that, it's fine too.... - Ted ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 20:57 ` Ted Ts'o @ 2010-09-28 21:24 ` Martin K. Petersen 2010-09-28 21:36 ` Eric Sandeen 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-28 21:24 UTC (permalink / raw) To: Ted Ts'o Cc: Mike Snitzer, Eric Sandeen, Jens Axboe, Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: Ted> Can we decide soon what the right thing should be? I'm about to Ted> release e2fsrogs 1.41.13, and if I should put in some sanity Ted> checking code so mke2fs does something sane when it sees a 1M Ted> physical block size, I can do that. I don't think it's entirely clear what the "right thing" would be. Let's ignore the 1MB block size for now. That's clearly a fluke and a buggy device. But there are SSDs that will advertise an 8KiB physical block size. And apparently 16KiB devices are in the pipeline. How do we want to handle these devices? Allowing blocks bigger than the page size is going to be painful. So the question is whether we can tweak the filesystem layout in a way that would alleviate the pain without having to change the filesystem block size in the traditional sense. At least we're talking about SSDs and arrays here. I assume the partial block write penalty for these devices would be smaller than it is for rotating media. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 21:24 ` Martin K. Petersen @ 2010-09-28 21:36 ` Eric Sandeen 2010-09-30 16:30 ` Ted Ts'o 0 siblings, 1 reply; 36+ messages in thread From: Eric Sandeen @ 2010-09-28 21:36 UTC (permalink / raw) To: Martin K. Petersen Cc: Ted Ts'o, Mike Snitzer, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org Martin K. Petersen wrote: >>>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: >>>>>> > > Ted> Can we decide soon what the right thing should be? I'm about to > Ted> release e2fsrogs 1.41.13, and if I should put in some sanity > Ted> checking code so mke2fs does something sane when it sees a 1M > Ted> physical block size, I can do that. > > I don't think it's entirely clear what the "right thing" would be. > > Let's ignore the 1MB block size for now. That's clearly a fluke and a > buggy device. But there are SSDs that will advertise an 8KiB physical > block size. And apparently 16KiB devices are in the pipeline. > Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less than device physical sectorsize without -F is broken, and that should be removed. I'd say issue a warning in the case but if there's a 16k physical device maybe there's no point in warning either? > How do we want to handle these devices? Allowing blocks bigger than the > page size is going to be painful. > > So the question is whether we can tweak the filesystem layout in a way > that would alleviate the pain without having to change the filesystem > block size in the traditional sense. > > At least we're talking about SSDs and arrays here. I assume the partial > block write penalty for these devices would be smaller than it is for > rotating media. > > I guess it must be. Anyway here's a patch to remove the force requirement and just give the user whatever they want, since apparently we can't avoid fs blocksize less than physical sector size in general. It does still warn that the fs blocksize is less than physical sectorsize, but *shrug* diff --git a/misc/mke2fs.c b/misc/mke2fs.c index add7c0c..6010fc1 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -1634,17 +1634,15 @@ static void PRS(int argc, char *argv[]) ext2fs_blocks_count(&fs_param) / (blocksize / 1024)); } else { - if (blocksize < lsector_size || /* Impossible */ - (!force && (blocksize < psector_size))) { /* Suboptimal */ + if (blocksize < lsector_size) { /* Impossible */ com_err(program_name, EINVAL, _("while setting blocksize; too small " "for device\n")); exit(1); - } else if (blocksize < psector_size) { + } else if (blocksize < psector_size) { /* Suboptimal */ fprintf(stderr, _("Warning: specified blocksize %d is " - "less than device physical sectorsize %d, " - "forced to continue\n"), blocksize, - psector_size); + "less than device physical sectorsize %d\n") + blocksize, psector_size); } } -Eric ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 21:36 ` Eric Sandeen @ 2010-09-30 16:30 ` Ted Ts'o 2010-09-30 17:07 ` Eric Sandeen [not found] ` <4CA4C3B6.9000104@redhat.com> 0 siblings, 2 replies; 36+ messages in thread From: Ted Ts'o @ 2010-09-30 16:30 UTC (permalink / raw) To: Eric Sandeen Cc: Martin K. Petersen, Mike Snitzer, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote: > Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less > than device physical sectorsize without -F is broken, and that should > be removed. I'd say issue a warning in the case but if there's a 16k > physical device maybe there's no point in warning either? If the device physical sectorsize is that big, should we perhaps use that as a hint to align writes to that blocks aligned with that physical sectorsize? Right now we use the optimal I/O size, but if the optimal I/O size is not specified and the physical sectorsize is, say, 16k or 32k, maybe we should use to calculate for s_raid_stripe_width? - Ted ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-30 16:30 ` Ted Ts'o @ 2010-09-30 17:07 ` Eric Sandeen [not found] ` <4CA4C3B6.9000104@redhat.com> 1 sibling, 0 replies; 36+ messages in thread From: Eric Sandeen @ 2010-09-30 17:07 UTC (permalink / raw) To: Ted Ts'o, Martin K. Petersen, Mike Snitzer, Jens Axboe, James.Bottomley On 09/30/2010 11:30 AM, Ted Ts'o wrote: > On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote: >> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less >> than device physical sectorsize without -F is broken, and that should >> be removed. I'd say issue a warning in the case but if there's a 16k >> physical device maybe there's no point in warning either? > > If the device physical sectorsize is that big, should we perhaps use > that as a hint to align writes to that blocks aligned with that > physical sectorsize? Right now we use the optimal I/O size, but if > the optimal I/O size is not specified and the physical sectorsize is, I can't keep track of all the parameters, is it ever true that optimal I/O size isn't specified? > say, 16k or 32k, maybe we should use to calculate for > s_raid_stripe_width? Perhaps, though really ext4 still doesn't do -that- much with the value, anyway... -Eric ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <4CA4C3B6.9000104@redhat.com>]
* Re: I/O topology fixes for big physical block size [not found] ` <4CA4C3B6.9000104@redhat.com> @ 2010-09-30 17:33 ` Mike Snitzer 2010-10-01 14:24 ` Ted Ts'o 0 siblings, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-30 17:33 UTC (permalink / raw) To: Eric Sandeen Cc: Ted Ts'o, Martin K. Petersen, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Thu, Sep 30 2010 at 1:07pm -0400, Eric Sandeen <sandeen@redhat.com> wrote: > On 09/30/2010 11:30 AM, Ted Ts'o wrote: > > On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote: > >> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less > >> than device physical sectorsize without -F is broken, and that should > >> be removed. I'd say issue a warning in the case but if there's a 16k > >> physical device maybe there's no point in warning either? > > > > If the device physical sectorsize is that big, should we perhaps use > > that as a hint to align writes to that blocks aligned with that > > physical sectorsize? Right now we use the optimal I/O size, but if > > the optimal I/O size is not specified and the physical sectorsize is, > > I can't keep track of all the parameters, is it ever true that optimal > I/O size isn't specified? Yes optimal_io_size may be 0. But minimum_io_size will always be scaled up to at least match physical_block_size. In any case: this 1MB physical_block_size device, which started this thread, also has 1MB for both minimum_io_size and optimal_io_size. Mike ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-30 17:33 ` Mike Snitzer @ 2010-10-01 14:24 ` Ted Ts'o 2010-10-01 22:19 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Ted Ts'o @ 2010-10-01 14:24 UTC (permalink / raw) To: Mike Snitzer Cc: Eric Sandeen, Martin K. Petersen, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Thu, Sep 30, 2010 at 01:33:43PM -0400, Mike Snitzer wrote: > > Yes optimal_io_size may be 0. But minimum_io_size will always be scaled > up to at least match physical_block_size. Woah! Are we sure we want to do that? According to Jens, 8k physical blockes are here already and 16k physical blocks sizes are right around the corner. If we scale minimum_io_size up to the physical block size, then even though these devices will have 512 or 4k logical block sizes, minimum_io_size will be 16k? That sounds wrong, incorrect, and given that the Linux VM can't handle file system block sizes greater than page size. And if we scale the minimum_io_size to the physical block size, mke2fs will refuse to create a 4k blocksize filesystem --- since presumably "minimum io size" means we can't do I/O's smaller than that. Please tell me you meant to say __logical__ blocksize above? Or am I misunderstanding what you meant? - Ted ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-10-01 14:24 ` Ted Ts'o @ 2010-10-01 22:19 ` Martin K. Petersen 2010-10-02 2:31 ` Ted Ts'o 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-10-01 22:19 UTC (permalink / raw) To: Ted Ts'o Cc: Mike Snitzer, Eric Sandeen, Martin K. Petersen, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: Ted> If we scale minimum_io_size up to the physical block size, then Ted> even though these devices will have 512 or 4k logical block sizes, Ted> minimum_io_size will be 16k? That sounds wrong, incorrect, and Ted> given that the Linux VM can't handle file system block sizes Ted> greater than page size. And if we scale the minimum_io_size to the Ted> physical block size, mke2fs will refuse to create a 4k blocksize Ted> filesystem --- since presumably "minimum io size" means we can't do Ted> I/O's smaller than that. logical <= physical <= minimum logical is the smallest unit we can address. Usually 512 bytes. physical is the allocation unit the device claims to use internally. Typically 512 or 4096. 8 and 16 KiB coming. minimal is the device's preferred minimum random I/O unit. This is usually identical to the physical block size. Arrays might report a multiple of the physical block size here (stripe chunk size). optimal (if provided) is the preferred sequential I/O unit and a multiple of minimal (stripe width). The logical and physical parameters are device protocol-centric values. The minimum and optimal I/O sizes are the two "soft" values that filesystems should be looking at for layout hints. A filesystem should use minimal as a cue for block size and optimal as a cue for stripe width. minimum may indeed be bigger than page size and this discussion was started to figure out if there were thing we could do to accommodate these device without actually changing the filesystem block size in the traditional sense. Since not all drives guarantee that read-modify-write cycle on a 4 KiB physical block won't clobber adjacent 512-byte logical blocks it may be a good idea to look at physical block size if there are atomicity concerns. I.e. filesystems that depend on atomic journal writes may want to look at the reported physical block size. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-10-01 22:19 ` Martin K. Petersen @ 2010-10-02 2:31 ` Ted Ts'o 2010-10-04 19:49 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Ted Ts'o @ 2010-10-02 2:31 UTC (permalink / raw) To: Martin K. Petersen Cc: Mike Snitzer, Eric Sandeen, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org On Fri, Oct 01, 2010 at 06:19:21PM -0400, Martin K. Petersen wrote: > Since not all drives guarantee that read-modify-write cycle on a 4 KiB > physical block won't clobber adjacent 512-byte logical blocks it may be > a good idea to look at physical block size if there are atomicity > concerns. I.e. filesystems that depend on atomic journal writes may > want to look at the reported physical block size. OK, but what do we do when we start seeing devices with 8k or 16k physical block sizes? The VM doesn't deal well with block sizes > page size. - Ted ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-10-02 2:31 ` Ted Ts'o @ 2010-10-04 19:49 ` Martin K. Petersen 0 siblings, 0 replies; 36+ messages in thread From: Martin K. Petersen @ 2010-10-04 19:49 UTC (permalink / raw) To: Ted Ts'o Cc: Martin K. Petersen, Mike Snitzer, Eric Sandeen, Jens Axboe, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org, linux-ext4@vger.kernel.org >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: Ted, Ted> OK, but what do we do when we start seeing devices with 8k or 16k Ted> physical block sizes? The VM doesn't deal well with block sizes > Ted> page size. I don't think we're going to see devices reporting logical blocks bigger than 4KiB anytime soon. Too much pain for everybody in the industry (most other operating systems can't even deal with 4KiB logical blocks yet). Eventually we will have to do the required page cache surgery to support filesystem block sizes bigger than the page size. But I don't think that's something we'll have to deal with in the immediate future. In the meantime, however, the question is whether there is something we can do in the allocators to mitigate effects of devices reporting physical blocks bigger than PAGE_CACHE_SIZE. Obviously this would be in the I/O hint/alignment category and not something which would guarantee that all writes would be aligned multiples of that physical block size. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe 2010-09-27 17:20 ` Martin K. Petersen @ 2010-09-27 17:23 ` Mike Snitzer 2010-09-27 21:58 ` James Bottomley 1 sibling, 1 reply; 36+ messages in thread From: Mike Snitzer @ 2010-09-27 17:23 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, James.Bottomley@hansenpartnership.com, linux-scsi@vger.kernel.org On Mon, Sep 27 2010 at 12:54pm -0400, Jens Axboe <jaxboe@fusionio.com> wrote: > On 2010-09-28 01:41, Martin K. Petersen wrote: > > Mike Snitzer reported that he has access to a device that supports thin > > provisioning but does not use the Block Limits VPD page to indicate > > discard granularity. Instead it reports a huge (1MB) physical block > > size. That caused a bit of fallout in the topology stack which assumed a > > physical block size of 4KiB or less. > > Fixing the overflow aside, I question the validity of setting the physical > block size to something larger than PAGE_SIZE as there's no way that that > could really work in the current kernel. > > I would suggest doing something similar as we do with other 'invalid' > settings that we cannot honor, print a warning and drop the queue > limits to PAGE_SIZE. I'm inclined to agree. Doesn't make a whole lot of sense. But could this cap of PAGE_SIZE be enforced with a follow-on patch? Or would you rather see it be dealt with in a single revised 2/2 patch? Mike ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 17:23 ` Mike Snitzer @ 2010-09-27 21:58 ` James Bottomley 2010-09-27 22:03 ` Jens Axboe 0 siblings, 1 reply; 36+ messages in thread From: James Bottomley @ 2010-09-27 21:58 UTC (permalink / raw) To: Mike Snitzer; +Cc: Jens Axboe, Martin K. Petersen, linux-scsi@vger.kernel.org On Mon, 2010-09-27 at 13:23 -0400, Mike Snitzer wrote: > On Mon, Sep 27 2010 at 12:54pm -0400, > Jens Axboe <jaxboe@fusionio.com> wrote: > > > On 2010-09-28 01:41, Martin K. Petersen wrote: > > > Mike Snitzer reported that he has access to a device that supports thin > > > provisioning but does not use the Block Limits VPD page to indicate > > > discard granularity. Instead it reports a huge (1MB) physical block > > > size. That caused a bit of fallout in the topology stack which assumed a > > > physical block size of 4KiB or less. > > > > Fixing the overflow aside, I question the validity of setting the physical > > block size to something larger than PAGE_SIZE as there's no way that that > > could really work in the current kernel. > > > > I would suggest doing something similar as we do with other 'invalid' > > settings that we cannot honor, print a warning and drop the queue > > limits to PAGE_SIZE. > > I'm inclined to agree. Doesn't make a whole lot of sense. > > But could this cap of PAGE_SIZE be enforced with a follow-on patch? Or > would you rather see it be dealt with in a single revised 2/2 patch? It's up to Jens, but I'd prefer, unless there's a very good reason, that the patches we put into the kernel be right first time around, since that generates a cleaner history and a better example should anyone go looking for one in the tree. James ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 21:58 ` James Bottomley @ 2010-09-27 22:03 ` Jens Axboe 2010-09-27 22:14 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Jens Axboe @ 2010-09-27 22:03 UTC (permalink / raw) To: James Bottomley Cc: Mike Snitzer, Martin K. Petersen, linux-scsi@vger.kernel.org On 2010-09-28 06:58, James Bottomley wrote: > On Mon, 2010-09-27 at 13:23 -0400, Mike Snitzer wrote: >> On Mon, Sep 27 2010 at 12:54pm -0400, >> Jens Axboe <jaxboe@fusionio.com> wrote: >> >>> On 2010-09-28 01:41, Martin K. Petersen wrote: >>>> Mike Snitzer reported that he has access to a device that supports thin >>>> provisioning but does not use the Block Limits VPD page to indicate >>>> discard granularity. Instead it reports a huge (1MB) physical block >>>> size. That caused a bit of fallout in the topology stack which assumed a >>>> physical block size of 4KiB or less. >>> >>> Fixing the overflow aside, I question the validity of setting the physical >>> block size to something larger than PAGE_SIZE as there's no way that that >>> could really work in the current kernel. >>> >>> I would suggest doing something similar as we do with other 'invalid' >>> settings that we cannot honor, print a warning and drop the queue >>> limits to PAGE_SIZE. >> >> I'm inclined to agree. Doesn't make a whole lot of sense. >> >> But could this cap of PAGE_SIZE be enforced with a follow-on patch? Or >> would you rather see it be dealt with in a single revised 2/2 patch? > > It's up to Jens, but I'd prefer, unless there's a very good reason, that > the patches we put into the kernel be right first time around, since > that generates a cleaner history and a better example should anyone go > looking for one in the tree. Yes, from a correctness point of view it doesn't matter, but when people go looking up fixes for whatever reason, it's much better to include such a fix in the original patch so it's not missed. -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 22:03 ` Jens Axboe @ 2010-09-27 22:14 ` Martin K. Petersen 2010-09-27 22:24 ` Jens Axboe 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-27 22:14 UTC (permalink / raw) To: Jens Axboe Cc: James Bottomley, Mike Snitzer, Martin K. Petersen, linux-scsi@vger.kernel.org >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: Jens> Yes, from a correctness point of view it doesn't matter, but when Jens> people go looking up fixes for whatever reason, it's much better Jens> to include such a fix in the original patch so it's not missed. I have talked to a few standards people today. They are of the opinion that the device's usage of the physical block exponent is incorrect. And that the device must provide the Block Limits and the TP VPD if thin provisioning is enabled. However, devices with 8KiB physical blocks are shipping and 16KiB ditto are right around the corner. Which says to me that it's important to report the correct thing to userland so we can cause allocators to align on the right boundaries, etc. If we artificially clamp the physical block size parameter in the kernel we are losing information. Note that there are no kernel users of the physical block size parameter at all. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 22:14 ` Martin K. Petersen @ 2010-09-27 22:24 ` Jens Axboe 2010-09-28 18:48 ` Martin K. Petersen 0 siblings, 1 reply; 36+ messages in thread From: Jens Axboe @ 2010-09-27 22:24 UTC (permalink / raw) To: Martin K. Petersen Cc: James Bottomley, Mike Snitzer, linux-scsi@vger.kernel.org On 2010-09-28 07:14, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > > Jens> Yes, from a correctness point of view it doesn't matter, but when > Jens> people go looking up fixes for whatever reason, it's much better > Jens> to include such a fix in the original patch so it's not missed. > > I have talked to a few standards people today. They are of the opinion > that the device's usage of the physical block exponent is incorrect. And > that the device must provide the Block Limits and the TP VPD if thin > provisioning is enabled. > > However, devices with 8KiB physical blocks are shipping and 16KiB ditto > are right around the corner. Which says to me that it's important to > report the correct thing to userland so we can cause allocators to align > on the right boundaries, etc. If we artificially clamp the physical > block size parameter in the kernel we are losing information. Note that > there are no kernel users of the physical block size parameter at all. With the revised understanding that this is purely the IO hint, then yes I agree we should not clamp it. -- Jens Axboe ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-27 22:24 ` Jens Axboe @ 2010-09-28 18:48 ` Martin K. Petersen 2010-09-28 18:54 ` Mike Snitzer 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2010-09-28 18:48 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, James Bottomley, Mike Snitzer, linux-scsi@vger.kernel.org >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: Jens> With the revised understanding that this is purely the IO hint, Jens> then yes I agree we should not clamp it. Ok, here's an updated sd patch that does not print a warning... sd: Fix overflow with big physical blocks The hw_sector_size variable could overflow if a device reported huge physical blocks. Switch to the more accurate physical_block_size terminology and make sure we use an unsigned int to match the range permitted by READ CAPACITY(16). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index ffa0689..5446c4d 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1554,7 +1554,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, } /* Logical blocks per physical block exponent */ - sdkp->hw_sector_size = (1 << (buffer[13] & 0xf)) * sector_size; + sdkp->physical_block_size = (1 << (buffer[13] & 0xf)) * sector_size; /* Lowest aligned logical block */ alignment = ((buffer[14] & 0x3f) << 8 | buffer[15]) * sector_size; @@ -1567,7 +1567,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, struct request_queue *q = sdp->request_queue; sdkp->thin_provisioning = 1; - q->limits.discard_granularity = sdkp->hw_sector_size; + q->limits.discard_granularity = sdkp->physical_block_size; q->limits.max_discard_sectors = 0xffffffff; if (buffer[14] & 0x40) /* TPRZ */ @@ -1635,7 +1635,7 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, } sdkp->capacity = lba + 1; - sdkp->hw_sector_size = sector_size; + sdkp->physical_block_size = sector_size; return sector_size; } @@ -1756,10 +1756,10 @@ got_data: (unsigned long long)sdkp->capacity, sector_size, cap_str_10, cap_str_2); - if (sdkp->hw_sector_size != sector_size) + if (sdkp->physical_block_size != sector_size) sd_printk(KERN_NOTICE, sdkp, "%u-byte physical blocks\n", - sdkp->hw_sector_size); + sdkp->physical_block_size); } } @@ -1773,7 +1773,8 @@ got_data: else if (sector_size == 256) sdkp->capacity >>= 1; - blk_queue_physical_block_size(sdp->request_queue, sdkp->hw_sector_size); + blk_queue_physical_block_size(sdp->request_queue, + sdkp->physical_block_size); sdkp->device->sector_size = sector_size; } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index f81a930..f947140 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -50,7 +50,7 @@ struct scsi_disk { atomic_t openers; sector_t capacity; /* size in 512-byte sectors */ u32 index; - unsigned short hw_sector_size; + unsigned int physical_block_size; u8 media_present; u8 write_prot; u8 protection_type;/* Data Integrity Field */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: I/O topology fixes for big physical block size 2010-09-28 18:48 ` Martin K. Petersen @ 2010-09-28 18:54 ` Mike Snitzer 0 siblings, 0 replies; 36+ messages in thread From: Mike Snitzer @ 2010-09-28 18:54 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, James Bottomley, linux-scsi@vger.kernel.org On Tue, Sep 28 2010 at 2:48pm -0400, Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > > Jens> With the revised understanding that this is purely the IO hint, > Jens> then yes I agree we should not clamp it. > > Ok, here's an updated sd patch that does not print a warning... > > > sd: Fix overflow with big physical blocks > > The hw_sector_size variable could overflow if a device reported huge > physical blocks. Switch to the more accurate physical_block_size > terminology and make sure we use an unsigned int to match the range > permitted by READ CAPACITY(16). > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Thanks Martin. ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2010-10-15 11:05 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-27 16:41 I/O topology fixes for big physical block size Martin K. Petersen 2010-09-27 16:41 ` [PATCH 1/2] block: Ensure physical block size is unsigned int Martin K. Petersen 2010-09-27 17:40 ` Mike Snitzer 2010-10-08 5:15 ` Martin K. Petersen 2010-10-13 19:12 ` Mike Snitzer 2010-10-13 19:15 ` Jens Axboe 2010-09-27 16:41 ` [PATCH 2/2] sd: Fix overflow with big physical blocks Martin K. Petersen 2010-09-27 17:42 ` Mike Snitzer 2010-09-27 18:13 ` [PATCH] block: eliminate potential for infinite loop in blkdev_issue_discard Mike Snitzer 2010-10-14 21:37 ` Mike Snitzer 2010-10-15 11:05 ` Jens Axboe 2010-09-27 16:54 ` I/O topology fixes for big physical block size Jens Axboe 2010-09-27 17:20 ` Martin K. Petersen 2010-09-27 22:21 ` Jens Axboe 2010-09-27 22:36 ` Martin K. Petersen 2010-09-27 23:15 ` Mike Snitzer 2010-09-28 4:30 ` Jens Axboe 2010-09-28 5:20 ` Eric Sandeen 2010-09-28 14:15 ` Mike Snitzer 2010-09-28 20:57 ` Ted Ts'o 2010-09-28 21:24 ` Martin K. Petersen 2010-09-28 21:36 ` Eric Sandeen 2010-09-30 16:30 ` Ted Ts'o 2010-09-30 17:07 ` Eric Sandeen [not found] ` <4CA4C3B6.9000104@redhat.com> 2010-09-30 17:33 ` Mike Snitzer 2010-10-01 14:24 ` Ted Ts'o 2010-10-01 22:19 ` Martin K. Petersen 2010-10-02 2:31 ` Ted Ts'o 2010-10-04 19:49 ` Martin K. Petersen 2010-09-27 17:23 ` Mike Snitzer 2010-09-27 21:58 ` James Bottomley 2010-09-27 22:03 ` Jens Axboe 2010-09-27 22:14 ` Martin K. Petersen 2010-09-27 22:24 ` Jens Axboe 2010-09-28 18:48 ` Martin K. Petersen 2010-09-28 18:54 ` Mike Snitzer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).