linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* impact of 4k sector size on the IO & FS stack
@ 2007-03-11 22:51 Ric Wheeler
  2007-03-11 23:14 ` Jan Engelhardt
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ric Wheeler @ 2007-03-11 22:51 UTC (permalink / raw)
  To: linux-scsi, linux-fsdevel, Linux-ide


During the recent IO/FS workshop, we spoke briefly about the coming 
change to a 4k sector size for disks on linux. If I recall correctly, 
the general feeling was that the impact was not significant since we 
already do most file system IO in 4k page sizes and should be fine as 
long as we partition drives correctly and avoid non-4k aligned partitions.

Are there other concerns in the IO or FS stack that we should bring up 
with vendors?  I have been asked to summarize the impact of 4k sectors 
on linux  for a disk vendor gathering and want to make sure that I put 
all of our linux specific items into that summary...

ric



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler
@ 2007-03-11 23:14 ` Jan Engelhardt
  2007-03-12  2:45   ` Ric Wheeler
  2007-03-12 14:36   ` Jeff Garzik
  2007-03-12  0:02 ` Alan Cox
  2007-03-12  8:18 ` Christoph Hellwig
  2 siblings, 2 replies; 29+ messages in thread
From: Jan Engelhardt @ 2007-03-11 23:14 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide


On Mar 11 2007 18:51, Ric Wheeler wrote:
>
> During the recent IO/FS workshop, we spoke briefly about the
> coming change to a 4k sector size for disks on linux. If I
> recall correctly, the general feeling was that the impact was
> not significant since we already do most file system IO in 4k
> page sizes and should be fine as long as we partition drives
> correctly and avoid non-4k aligned partitions.

Sorry about jumping right in, but what about an 'old-style'
partition table that relies on 512 as a unit?

> Are there other concerns in the IO or FS stack that we should
> bring up with vendors?  I have been asked to summarize the
> impact of 4k sectors on linux for a disk vendor gathering and
> want to make sure that I put all of our linux specific items
> into that summary...

Jan
-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler
  2007-03-11 23:14 ` Jan Engelhardt
@ 2007-03-12  0:02 ` Alan Cox
  2007-03-12  0:44   ` Jeff Garzik
  2007-03-12  2:41   ` Ric Wheeler
  2007-03-12  8:18 ` Christoph Hellwig
  2 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2007-03-12  0:02 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide

> Are there other concerns in the IO or FS stack that we should bring up 
> with vendors?  I have been asked to summarize the impact of 4k sectors 
> on linux  for a disk vendor gathering and want to make sure that I put 
> all of our linux specific items into that summary...

We need to make sure the physical sector size is correctly reported by
the disk (eg in the ATA7 identify data) but I think for libata at least
the right bits are already there and we've got a fair amount of scsi disk
experience with other media sizes (eg 2K) already. 256byte/sector media
is still broken btw 8)

I would be interested to know what the disk vendors intend to use as
their strategy when (with ATA) they have a 512 byte write from an older
file system/setup into a 4K block. The case where errors magically appear
in other parts of the fs when such an error occurs are not IMHO too well
considered.

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  0:02 ` Alan Cox
@ 2007-03-12  0:44   ` Jeff Garzik
  2007-03-12  2:37     ` Ric Wheeler
  2007-03-12 12:24     ` Alan Cox
  2007-03-12  2:41   ` Ric Wheeler
  1 sibling, 2 replies; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12  0:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

Alan Cox wrote:
> I would be interested to know what the disk vendors intend to use as
> their strategy when (with ATA) they have a 512 byte write from an older
> file system/setup into a 4K block. The case where errors magically appear

Well, you have logical and physical sector size changes.

First generation of 1K sector drives will continue to use the same 
512-byte ATA sector size you are familiar with.  A single 512-byte write 
will cause the drive to perform a read-modify-write cycle.  This 
configuration is physical 1K sector, logical 512b sector.

A future configuration will change the logical ATA interface away from 
512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
smaller than 1K or 4K, whatever the sector size is.

	Jeff



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  0:44   ` Jeff Garzik
@ 2007-03-12  2:37     ` Ric Wheeler
  2007-03-12 12:24     ` Alan Cox
  1 sibling, 0 replies; 29+ messages in thread
From: Ric Wheeler @ 2007-03-12  2:37 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, linux-scsi, linux-fsdevel, Linux-ide



Jeff Garzik wrote:
> Alan Cox wrote:
>> I would be interested to know what the disk vendors intend to use as
>> their strategy when (with ATA) they have a 512 byte write from an older
>> file system/setup into a 4K block. The case where errors magically 
>> appear
>
> Well, you have logical and physical sector size changes.
>
> First generation of 1K sector drives will continue to use the same 
> 512-byte ATA sector size you are familiar with.  A single 512-byte 
> write will cause the drive to perform a read-modify-write cycle.  This 
> configuration is physical 1K sector, logical 512b sector.
It would seem that most writes would avoid this - hopefully the drive 
firmware could use the write cache to coalesce contiguous IO's into 1k 
multiples when getting streams of 512 byte write requests.
>
> A future configuration will change the logical ATA interface away from 
> 512-byte sectors to 1K or 4K.  Here, it is impossible to read a 
> quantity smaller than 1K or 4K, whatever the sector size is.
>
>     Jeff
I will try and see if I can get some specific information on when the 
various flavors of this are going to appear...

ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  0:02 ` Alan Cox
  2007-03-12  0:44   ` Jeff Garzik
@ 2007-03-12  2:41   ` Ric Wheeler
  1 sibling, 0 replies; 29+ messages in thread
From: Ric Wheeler @ 2007-03-12  2:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-scsi, linux-fsdevel, Linux-ide


Alan Cox wrote:
>> Are there other concerns in the IO or FS stack that we should bring up 
>> with vendors?  I have been asked to summarize the impact of 4k sectors 
>> on linux  for a disk vendor gathering and want to make sure that I put 
>> all of our linux specific items into that summary...
>>     
>
> We need to make sure the physical sector size is correctly reported by
> the disk (eg in the ATA7 identify data) but I think for libata at least
> the right bits are already there and we've got a fair amount of scsi disk
> experience with other media sizes (eg 2K) already. 256byte/sector media
> is still broken btw 8)
>   
It would be really interesting to see if we can validate this with 
prototype drives.
> I would be interested to know what the disk vendors intend to use as
> their strategy when (with ATA) they have a 512 byte write from an older
> file system/setup into a 4K block. The case where errors magically appear
> in other parts of the fs when such an error occurs are not IMHO too well
> considered.
>
> Alan
As Jeff mentioned, I think that they would have to do a 
read-modify-write simulation which would kill performance for a small, 
random write work load...

ric


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-11 23:14 ` Jan Engelhardt
@ 2007-03-12  2:45   ` Ric Wheeler
  2007-03-12  3:27     ` Jan Engelhardt
  2007-03-12 14:36   ` Jeff Garzik
  1 sibling, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2007-03-12  2:45 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-scsi, linux-fsdevel, Linux-ide



Jan Engelhardt wrote:
> On Mar 11 2007 18:51, Ric Wheeler wrote:
>   
>> During the recent IO/FS workshop, we spoke briefly about the
>> coming change to a 4k sector size for disks on linux. If I
>> recall correctly, the general feeling was that the impact was
>> not significant since we already do most file system IO in 4k
>> page sizes and should be fine as long as we partition drives
>> correctly and avoid non-4k aligned partitions.
>>     
>
> Sorry about jumping right in, but what about an 'old-style'
> partition table that relies on 512 as a unit?
>
>   
I think that the normal case would involve new drives which would need 
to be partitioned in 4k aligned partitions. Shouldn't that work 
regardless of the unit used in the partition table?


ric




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  2:45   ` Ric Wheeler
@ 2007-03-12  3:27     ` Jan Engelhardt
  2007-03-12  3:46       ` Andreas Dilger
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jan Engelhardt @ 2007-03-12  3:27 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide


On Mar 11 2007 22:45, Ric Wheeler wrote:
> Jan Engelhardt wrote:
>> On Mar 11 2007 18:51, Ric Wheeler wrote:
>> 
>> > During the recent IO/FS workshop, we spoke briefly about the
>> > coming change to a 4k sector size for disks on linux. If I
>> > recall correctly, the general feeling was that the impact was
>> > not significant since we already do most file system IO in 4k
>> > page sizes and should be fine as long as we partition drives
>> > correctly and avoid non-4k aligned partitions.
>> > 
>> 
>> Sorry about jumping right in, but what about an 'old-style'
>> partition table that relies on 512 as a unit?
>> 
>> 
> I think that the normal case would involve new drives which
> would need to be partitioned in 4k aligned partitions.
> Shouldn't that work regardless of the unit used in the
> partition table?

Assume this partition table on my current HD:

	Disk /dev/hdc: 251.0 GB, 251000193024 bytes
	255 heads, 63 sectors/track, 30515 cylinders
	Units = cylinders of 16065 * 512 = 8225280 bytes
	
	   Device Start  End      Blocks   Id  System
	/dev/hdc1   1     33      265041   82  Linux swap / Solaris
	/dev/hdc2  34  30515   244846665    5  Extended

That is, 255 * 63 * 30515 * 512 == roughly 251 GB.

Now, if this disk was copied byte per byte (/bin/dd) to a
4096-based disk, and Linux would start using a sector size of
4096, then I would suddenly have

255 * 63 * 30515 * 4096 == 2 TB

Although I would not mind the 2 TB, the partition table would
read quite differently (note the Blocks column which is
multiplied by 4 (512x4=4096))

           Device Start  End      Blocks   Id  System
        /dev/hdc1   1     33     1060164   82  Linux swap / Solaris
        /dev/hdc2  34  30515   979386660    5  Extended

Which would mean that the swap partition reaches into the real
data partition and would corrupt it.

That's what I am concerned about.


Jan
-- 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  3:27     ` Jan Engelhardt
@ 2007-03-12  3:46       ` Andreas Dilger
  2007-03-12 12:17       ` Alan Cox
  2007-03-12 14:41       ` Jeff Garzik
  2 siblings, 0 replies; 29+ messages in thread
From: Andreas Dilger @ 2007-03-12  3:46 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

On Mar 12, 2007  04:27 +0100, Jan Engelhardt wrote:
> Assume this partition table on my current HD:
> 
> 	Disk /dev/hdc: 251.0 GB, 251000193024 bytes
> 	255 heads, 63 sectors/track, 30515 cylinders
> 	Units = cylinders of 16065 * 512 = 8225280 bytes
> 	
> 	   Device Start  End      Blocks   Id  System
> 	/dev/hdc1   1     33      265041   82  Linux swap / Solaris
> 	/dev/hdc2  34  30515   244846665    5  Extended
> 
> That is, 255 * 63 * 30515 * 512 == roughly 251 GB.
> 
> Now, if this disk was copied byte per byte (/bin/dd) to a
> 4096-based disk, and Linux would start using a sector size of
> 4096

The easy answer is "don't do that".  You should make a new partition
table on the 4096-byte sector drive (each of the partitions at least
as large as the old ones), and then copy the content of each of the
partitions separately onto the new disk.

> Although I would not mind the 2 TB, the partition table would
> read quite differently (note the Blocks column which is
> multiplied by 4 (512x4=4096))
> 
>            Device Start  End      Blocks   Id  System
>         /dev/hdc1   1     33     1060164   82  Linux swap / Solaris
>         /dev/hdc2  34  30515   979386660    5  Extended
> 
> Which would mean that the swap partition reaches into the real
> data partition and would corrupt it.

In the same way you can't copy raw disks from one vendor's RAID 5
array and put them into another vendor's (or even model's) RAID 5 array,
or you can't do a raw copy of a partitioned disk and expect it to
suddenly become an LVM volume, you can't do raw disk copies between
drives with different sector size.

You also won't be able to use a copy of an ext3 filesystems with 1kB
blocksize onto a 4kB sector size device - the ext3 code will detect
this and refuse to mount.  At that point you need to do a tar/untar
(or whatever) to copy the data instead of a raw partition copy.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler
  2007-03-11 23:14 ` Jan Engelhardt
  2007-03-12  0:02 ` Alan Cox
@ 2007-03-12  8:18 ` Christoph Hellwig
  2007-03-12 14:40   ` James Bottomley
  2007-03-12 14:45   ` Jeff Garzik
  2 siblings, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2007-03-12  8:18 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide

On Sun, Mar 11, 2007 at 06:51:53PM -0400, Ric Wheeler wrote:
> 
> During the recent IO/FS workshop, we spoke briefly about the coming 
> change to a 4k sector size for disks on linux. If I recall correctly, 
> the general feeling was that the impact was not significant since we 
> already do most file system IO in 4k page sizes and should be fine as 
> long as we partition drives correctly and avoid non-4k aligned partitions.
> 
> Are there other concerns in the IO or FS stack that we should bring up 
> with vendors?  I have been asked to summarize the impact of 4k sectors 
> on linux  for a disk vendor gathering and want to make sure that I put 
> all of our linux specific items into that summary...

The FS stack and higher levels of the I/O stack should be mostly ready.
The S/390 DASDs are commonly used with 4k sector sizes, and we've had
the occasional 2k sector SCSI MO device aswell.  It would be nice to
get samples of large sector size ATA devices into the hands of developers
to do real world testing of the whole stack.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  3:27     ` Jan Engelhardt
  2007-03-12  3:46       ` Andreas Dilger
@ 2007-03-12 12:17       ` Alan Cox
  2007-03-12 14:41       ` Jeff Garzik
  2 siblings, 0 replies; 29+ messages in thread
From: Alan Cox @ 2007-03-12 12:17 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

> Now, if this disk was copied byte per byte (/bin/dd) to a
> 4096-based disk, and Linux would start using a sector size of
> 4096, then I would suddenly have

The ATA drives I'm aware of report 512 byte sector size, do 512 byte
I/O's but use 4K physical sectors and to get sane performance except the
OS to issue sensible sized I/O requests.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  0:44   ` Jeff Garzik
  2007-03-12  2:37     ` Ric Wheeler
@ 2007-03-12 12:24     ` Alan Cox
  2007-03-12 13:32       ` Ric Wheeler
  2007-03-12 14:26       ` Jeff Garzik
  1 sibling, 2 replies; 29+ messages in thread
From: Alan Cox @ 2007-03-12 12:24 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

> First generation of 1K sector drives will continue to use the same 
> 512-byte ATA sector size you are familiar with.  A single 512-byte write 
> will cause the drive to perform a read-modify-write cycle.  This 
> configuration is physical 1K sector, logical 512b sector.

The problem case is "read-modify-screwup"

At that point we've trashed the block we were writing (a well studied
recovery case), and we've blasted some previously sane, totally
unrelated sector of data out of existance. Thats why we need to know
ideally if they are doing the write to a different physical block when
they do this, so that we don't lose the old data. My guess is they won't
as it'll be hard.
 
> A future configuration will change the logical ATA interface away from 
> 512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
> smaller than 1K or 4K, whatever the sector size is.

That one I'm not worried about - other than "guess how Redmond decide to
make partition tables work" that one is mostly easy (be fun to see how
many controllers simply can't cope with the command formats)

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 12:24     ` Alan Cox
@ 2007-03-12 13:32       ` Ric Wheeler
  2007-03-12 15:21         ` Douglas Gilbert
  2007-03-12 14:26       ` Jeff Garzik
  1 sibling, 1 reply; 29+ messages in thread
From: Ric Wheeler @ 2007-03-12 13:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide

Alan Cox wrote:
>> First generation of 1K sector drives will continue to use the same 
>> 512-byte ATA sector size you are familiar with.  A single 512-byte write 
>> will cause the drive to perform a read-modify-write cycle.  This 
>> configuration is physical 1K sector, logical 512b sector.
> 
> The problem case is "read-modify-screwup"
> 
> At that point we've trashed the block we were writing (a well studied
> recovery case), and we've blasted some previously sane, totally
> unrelated sector of data out of existance. Thats why we need to know
> ideally if they are doing the write to a different physical block when
> they do this, so that we don't lose the old data. My guess is they won't
> as it'll be hard.

I think that the firmware would have to do this in the drive's write 
cache and would always write the modified data back to the same physical 
sector (unless a media error forces a sector remap).

If firmware modifies the 7 512 byte sectors that it read to do the 1 512 
byte sector write, then we certainly would see what you describe happen.

In general, it would seem to be a bad idea to do allocate a different 
physical sector to underpin this king of read-modify-write since that 
would kill contiguous layout of files, etc.

>> A future configuration will change the logical ATA interface away from 
>> 512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
>> smaller than 1K or 4K, whatever the sector size is.
> 
> That one I'm not worried about - other than "guess how Redmond decide to
> make partition tables work" that one is mostly easy (be fun to see how
> many controllers simply can't cope with the command formats)
> 

This will be interesting to find out. I will be sharing a panel with 
some BIOS & MS people, so I will update all on what I hear,

ric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 12:24     ` Alan Cox
  2007-03-12 13:32       ` Ric Wheeler
@ 2007-03-12 14:26       ` Jeff Garzik
  2007-03-13  5:11         ` Andreas Dilger
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12 14:26 UTC (permalink / raw)
  To: Alan Cox; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

Alan Cox wrote:
>> First generation of 1K sector drives will continue to use the same 
>> 512-byte ATA sector size you are familiar with.  A single 512-byte write 
>> will cause the drive to perform a read-modify-write cycle.  This 
>> configuration is physical 1K sector, logical 512b sector.
> 
> The problem case is "read-modify-screwup"
> 
> At that point we've trashed the block we were writing (a well studied
> recovery case), and we've blasted some previously sane, totally
> unrelated sector of data out of existance. Thats why we need to know
> ideally if they are doing the write to a different physical block when
> they do this, so that we don't lose the old data. My guess is they won't
> as it'll be hard.

Strict ATA command set answer:  you will have no idea what goes on under 
the hood.  The current 512-b interface stays /exactly/ the same, save 
for a word or two in IDENTIFY DEVICE telling you the "secret" physical 
sector size.  If all your I/Os are aligned properly, then you need not 
worry about RMW cycles, as they will not occur.

Intuition answer:  they will use their firmware-internal standard code 
for scheduling reads and writes, and will only reallocate sectors as 
needed by media failure or similar events.

The "M" part of the modify cycle happens in disk ram.  So from the 
disk's point of view, a single 512-b write would require reading a 
single 1K hard sector, updating the contents in cache RAM, and then 
writing a single 1K hard sector.  The reading of the unknown half of the 
sector can be scheduled well in advance, usually, since writeback 
caching gives the drive plenty of time (relatively speaking) to optimize 
things.

Overall, it definitely adds a few more points of failure, but we can't 
do much at all about those points of failure.

In my own experiments on my own Fedora workstation, ~66% of IOs in Linux 
start on an odd sector, and ~33% started on even-numbered sectors.  For 
a 1K-sector drive with 'odd' alignment, the configuration Microsoft will 
likely want, that means the majority of disk transactions will avoid a 
RMW cycle, but a still-numerous minority will not.  I did not test 
transfer length, to see how many transfers /ended/ on an odd sector, 
thus determining how many RMW cycles the tail of an average I/O requires.



>> A future configuration will change the logical ATA interface away from 
>> 512-byte sectors to 1K or 4K.  Here, it is impossible to read a quantity 
>> smaller than 1K or 4K, whatever the sector size is.
> 
> That one I'm not worried about - other than "guess how Redmond decide to
> make partition tables work" that one is mostly easy (be fun to see how
> many controllers simply can't cope with the command formats)

Indeed...

	Jeff




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-11 23:14 ` Jan Engelhardt
  2007-03-12  2:45   ` Ric Wheeler
@ 2007-03-12 14:36   ` Jeff Garzik
  2007-03-12 15:45     ` Alan Cox
  2007-03-12 18:31     ` Bryan Henderson
  1 sibling, 2 replies; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12 14:36 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

Jan Engelhardt wrote:
> On Mar 11 2007 18:51, Ric Wheeler wrote:
>> During the recent IO/FS workshop, we spoke briefly about the
>> coming change to a 4k sector size for disks on linux. If I
>> recall correctly, the general feeling was that the impact was
>> not significant since we already do most file system IO in 4k
>> page sizes and should be fine as long as we partition drives
>> correctly and avoid non-4k aligned partitions.
> 
> Sorry about jumping right in, but what about an 'old-style'
> partition table that relies on 512 as a unit?

For 1K/4K physical sector size, where logical sector size remains 512-b, 
nothing changes.  DOS partitions start partitions on odd-numbered 
sectors, so presuming you have odd-aligned disks, life is good.

For 1K/4K logical sector sizes, who knows.  EFI?  <grins and runs>

Certainly seems incompatible with the current popular DOS partition format.

	Jeff




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  8:18 ` Christoph Hellwig
@ 2007-03-12 14:40   ` James Bottomley
  2007-03-12 14:45   ` Jeff Garzik
  1 sibling, 0 replies; 29+ messages in thread
From: James Bottomley @ 2007-03-12 14:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

On Mon, 2007-03-12 at 08:18 +0000, Christoph Hellwig wrote:
> The FS stack and higher levels of the I/O stack should be mostly ready.
> The S/390 DASDs are commonly used with 4k sector sizes, and we've had
> the occasional 2k sector SCSI MO device aswell.  It would be nice to
> get samples of large sector size ATA devices into the hands of developers
> to do real world testing of the whole stack.

Theoretically, we already have the capacity to verify this.  Although
not with ATA. However, since ATA uses virtually the same paths as SCSI,
we could test with variable sector SCSI devices, and SCSI does allow you
to reformat the device with different sector sizes.

James




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  3:27     ` Jan Engelhardt
  2007-03-12  3:46       ` Andreas Dilger
  2007-03-12 12:17       ` Alan Cox
@ 2007-03-12 14:41       ` Jeff Garzik
  2 siblings, 0 replies; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12 14:41 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

Jan Engelhardt wrote:
> On Mar 11 2007 22:45, Ric Wheeler wrote:
>> Jan Engelhardt wrote:
>>> On Mar 11 2007 18:51, Ric Wheeler wrote:
>>>
>>>> During the recent IO/FS workshop, we spoke briefly about the
>>>> coming change to a 4k sector size for disks on linux. If I
>>>> recall correctly, the general feeling was that the impact was
>>>> not significant since we already do most file system IO in 4k
>>>> page sizes and should be fine as long as we partition drives
>>>> correctly and avoid non-4k aligned partitions.
>>>>
>>> Sorry about jumping right in, but what about an 'old-style'
>>> partition table that relies on 512 as a unit?
>>>
>>>
>> I think that the normal case would involve new drives which
>> would need to be partitioned in 4k aligned partitions.
>> Shouldn't that work regardless of the unit used in the
>> partition table?
> 
> Assume this partition table on my current HD:
> 
> 	Disk /dev/hdc: 251.0 GB, 251000193024 bytes
> 	255 heads, 63 sectors/track, 30515 cylinders
> 	Units = cylinders of 16065 * 512 = 8225280 bytes
> 	
> 	   Device Start  End      Blocks   Id  System
> 	/dev/hdc1   1     33      265041   82  Linux swap / Solaris
> 	/dev/hdc2  34  30515   244846665    5  Extended
> 
> That is, 255 * 63 * 30515 * 512 == roughly 251 GB.
> 
> Now, if this disk was copied byte per byte (/bin/dd) to a
> 4096-based disk, and Linux would start using a sector size of
> 4096, then I would suddenly have
> 
> 255 * 63 * 30515 * 4096 == 2 TB
> 
> Although I would not mind the 2 TB, the partition table would
> read quite differently (note the Blocks column which is
> multiplied by 4 (512x4=4096))

At this level, for RMW drives, nothing changes.  The partition software, 
ATA driver, and all other bits continue to think that sector size == 512 
bytes.

The partition software /hopefully/ becomes smart enough to understand 
the alignment necessary, but that is not a requirement.

This is the key to understanding the difference between a physical 
(==platters) sector size change without a logical (==ATA interface) 
sector size change.


>            Device Start  End      Blocks   Id  System
>         /dev/hdc1   1     33     1060164   82  Linux swap / Solaris
>         /dev/hdc2  34  30515   979386660    5  Extended
> 
> Which would mean that the swap partition reaches into the real
> data partition and would corrupt it.

For RMW drives, RMW cycles would occur but not corruption.

For non-RMW drives, this just wouldn't occur.

	Jeff



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12  8:18 ` Christoph Hellwig
  2007-03-12 14:40   ` James Bottomley
@ 2007-03-12 14:45   ` Jeff Garzik
  2007-03-12 14:57     ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12 14:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

Christoph Hellwig wrote:
> the occasional 2k sector SCSI MO device aswell.  It would be nice to
> get samples of large sector size ATA devices into the hands of developers
> to do real world testing of the whole stack.

"hands of developers" meaning you specifically?  :)

I've had a 512b-logical/1K-physical ATA test drive for a few months now, 
and another couple arrived today.

Hopefully people can parse what I've been posting, since I cannot give 
out raw numbers or data at this time.

Of course, with RMW drives that leave the 512-b logical interface 
untouched, I had expected that they would Just Work(tm) and that is 
pretty much what happened.

	Jeff



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 14:45   ` Jeff Garzik
@ 2007-03-12 14:57     ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2007-03-12 14:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Ric Wheeler, linux-scsi, linux-fsdevel,
	Linux-ide

On Mon, Mar 12, 2007 at 10:45:16AM -0400, Jeff Garzik wrote:
> Christoph Hellwig wrote:
> >the occasional 2k sector SCSI MO device aswell.  It would be nice to
> >get samples of large sector size ATA devices into the hands of developers
> >to do real world testing of the whole stack.
> 
> "hands of developers" meaning you specifically?  :)

No.  I probably wouldn't have time to deal with it aswell.

> I've had a 512b-logical/1K-physical ATA test drive for a few months now, 
> and another couple arrived today.

Ok, that's exactly what I meant.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 13:32       ` Ric Wheeler
@ 2007-03-12 15:21         ` Douglas Gilbert
  2007-03-12 16:08           ` Martin K. Petersen
  0 siblings, 1 reply; 29+ messages in thread
From: Douglas Gilbert @ 2007-03-12 15:21 UTC (permalink / raw)
  To: ric; +Cc: Alan Cox, Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide

Ric Wheeler wrote:
> Alan Cox wrote:
>>> First generation of 1K sector drives will continue to use the same
>>> 512-byte ATA sector size you are familiar with.  A single 512-byte
>>> write will cause the drive to perform a read-modify-write cycle. 
>>> This configuration is physical 1K sector, logical 512b sector.
>>
>> The problem case is "read-modify-screwup"
>>
>> At that point we've trashed the block we were writing (a well studied
>> recovery case), and we've blasted some previously sane, totally
>> unrelated sector of data out of existance. Thats why we need to know
>> ideally if they are doing the write to a different physical block when
>> they do this, so that we don't lose the old data. My guess is they won't
>> as it'll be hard.
> 
> I think that the firmware would have to do this in the drive's write
> cache and would always write the modified data back to the same physical
> sector (unless a media error forces a sector remap).
> 
> If firmware modifies the 7 512 byte sectors that it read to do the 1 512
> byte sector write, then we certainly would see what you describe happen.
> 
> In general, it would seem to be a bad idea to do allocate a different
> physical sector to underpin this king of read-modify-write since that
> would kill contiguous layout of files, etc.
> 
>>> A future configuration will change the logical ATA interface away
>>> from 512-byte sectors to 1K or 4K.  Here, it is impossible to read a
>>> quantity smaller than 1K or 4K, whatever the sector size is.
>>
>> That one I'm not worried about - other than "guess how Redmond decide to
>> make partition tables work" that one is mostly easy (be fun to see how
>> many controllers simply can't cope with the command formats)
>>
> 
> This will be interesting to find out. I will be sharing a panel with
> some BIOS & MS people, so I will update all on what I hear,

Ric,
Just to add a SCSI perspective, it looks like 4 KB sectored
disks will be almost exclusively ATA devices. It is being
done to improve capacity at the expensive of performance.
[SCSI/FC/SAS disks typically trade off capacity for better
performance.]

Support for disks with smaller logical block size than
physical block size has already been added to SBC-3. The
overview of this document gives a rationale:
www.t10.org/ftp/t10/document.06/06-034r5.pdf

SAT is now a standard and an agenda item for SAT-2 is
to wire ATA8-ACS's large sector size support to the
additions to SBC-3 mentioned above.


I'm not sure how this stuff plays with end to end data
protection :-)
Most SCSI disks currently allow formatting sizes of 512
up to 528 bytes per logical block.

Doug Gilbert




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 14:36   ` Jeff Garzik
@ 2007-03-12 15:45     ` Alan Cox
  2007-03-12 18:31     ` Bryan Henderson
  1 sibling, 0 replies; 29+ messages in thread
From: Alan Cox @ 2007-03-12 15:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jan Engelhardt, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

> For 1K/4K logical sector sizes, who knows.  EFI?  <grins and runs>
> Certainly seems incompatible with the current popular DOS partition format.

Its a bit messier than that. There are two interpretations of "DOS"
partition formats found on 2K sector size magneto opticals. One is that
everything is the same as before (as if sectors were 512 byte), the other
is a different "everything is the same" which scales by the 2K sector
size. The two are of course wonderfully incompatible

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 15:21         ` Douglas Gilbert
@ 2007-03-12 16:08           ` Martin K. Petersen
  0 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2007-03-12 16:08 UTC (permalink / raw)
  To: dougg; +Cc: ric, Alan Cox, Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide

>>>>> "Doug" == Douglas Gilbert <dougg@torque.net> writes:

Doug> SAT is now a standard and an agenda item for SAT-2 is to wire
Doug> ATA8-ACS's large sector size support to the additions to SBC-3
Doug> mentioned above.

Doug> I'm not sure how this stuff plays with end to end data
Doug> protection :-) 

The proposal you forwarded talks about "transformed protection
information" but doesn't go into details.  

Assuming the drive has 4KB physical blocks and receives 512 byte
logical blocks, it's easy to verify the integrity of the 512 byte
sector and then do R-M-W on the physical.  Similarly, on the way out
logical guard and ref tags could be generated after integrity of the
physical has been verified.

The only thing that really bites is that the app tag will be per
physical block and not per logical (unless the drive leaves enough
space to store 8 tags per 4KB sector).

-- 
Martin K. Petersen	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 14:36   ` Jeff Garzik
  2007-03-12 15:45     ` Alan Cox
@ 2007-03-12 18:31     ` Bryan Henderson
  2007-03-12 18:37       ` Sergei Shtylyov
  2007-03-12 19:16       ` Douglas Gilbert
  1 sibling, 2 replies; 29+ messages in thread
From: Bryan Henderson @ 2007-03-12 18:31 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler

>DOS partitions start partitions on odd-numbered sectors

I don't get this.  If you mean partitions defined by the classic DOS 
partition table format, then AFAICS, such a partition can start in any 
sector.

>so presuming you have odd-aligned disks, life is good.

What is an odd-aligned disk?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 18:31     ` Bryan Henderson
@ 2007-03-12 18:37       ` Sergei Shtylyov
  2007-03-12 20:52         ` Bryan Henderson
  2007-03-12 19:16       ` Douglas Gilbert
  1 sibling, 1 reply; 29+ messages in thread
From: Sergei Shtylyov @ 2007-03-12 18:37 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi,
	Ric Wheeler

Hello.

Bryan Henderson wrote:

>>DOS partitions start partitions on odd-numbered sectors

> I don't get this.  If you mean partitions defined by the classic DOS 
> partition table format, then AFAICS, such a partition can start in any 
> sector.

    Only at "logical cylinder boudary" (except for the first partition).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 18:31     ` Bryan Henderson
  2007-03-12 18:37       ` Sergei Shtylyov
@ 2007-03-12 19:16       ` Douglas Gilbert
  2007-03-12 19:28         ` Jeff Garzik
  1 sibling, 1 reply; 29+ messages in thread
From: Douglas Gilbert @ 2007-03-12 19:16 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi,
	Ric Wheeler

Bryan Henderson wrote:
>> DOS partitions start partitions on odd-numbered sectors
> 
> I don't get this.  If you mean partitions defined by the classic DOS 
> partition table format, then AFAICS, such a partition can start in any 
> sector.

Bryan,
Typically the first partition on a DOS partitioned disk
starts at the next available sector after the mbr
which, for some bizarre reason, is 63 sectors long.
Hence:

# fdisk -lu /dev/hda

Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *          63    18314099     9157018+   c  W95 FAT32 (LBA)
/dev/hda2        18314100    19551104      618502+  82  Linux swap / Solaris
/dev/hda4        19551105   156296384    68372640   83  Linux


> 
>> so presuming you have odd-aligned disks, life is good.
> 
> What is an odd-aligned disk?

s/disk/partition/ ?
Perhaps hda1 and hda4 above are examples.

Doug Gilbert


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 19:16       ` Douglas Gilbert
@ 2007-03-12 19:28         ` Jeff Garzik
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff Garzik @ 2007-03-12 19:28 UTC (permalink / raw)
  To: dougg
  Cc: Bryan Henderson, Jan Engelhardt, linux-fsdevel, Linux-ide,
	linux-scsi, Ric Wheeler

Douglas Gilbert wrote:
> Bryan Henderson wrote:
>> What is an odd-aligned disk?
> 
> s/disk/partition/ ?


Example:  An odd-aligned disk in the 512-b logical / 1K-physical 
scenario is where odd LBAs indicate the start of a 1K physical sector. 
An even-aligned disk is where even LBAs indicate the start of a 1K 
physical sector.

In order to avoid too many RMW cycles, partition software SHOULD (using 
IETF language) be aware of the underlying physical sector size 
alignment, in order to align paritions for optimal performance.

	Jeff



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 18:37       ` Sergei Shtylyov
@ 2007-03-12 20:52         ` Bryan Henderson
  0 siblings, 0 replies; 29+ messages in thread
From: Bryan Henderson @ 2007-03-12 20:52 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi,
	Ric Wheeler

>> I don't get this.  If you mean partitions defined by the classic DOS 
>> partition table format, then AFAICS, such a partition can start in any 
>> sector.
>
>    Only at "logical cylinder boundary" (except for the first partition).

That's a requirement in ancient DOS systems that use CHS addressing 
(physical CHS, no less), isn't it  (so you can properly convert a 
within-partition address to a within-disk address)?

While I would guess most people still partition disks that way (Even 
linux-util fdisk seems to do it by default), they don't have to.

Doesn't matter for this discussion, though.  As Doug demonstrated, even 
when you do start at cylinder boundaries, half your partitions start on an 
even sector, because typical cylinders have an odd number of sectors.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-12 14:26       ` Jeff Garzik
@ 2007-03-13  5:11         ` Andreas Dilger
  2007-03-13  6:34           ` Chris Wedgwood
  0 siblings, 1 reply; 29+ messages in thread
From: Andreas Dilger @ 2007-03-13  5:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide

On Mar 12, 2007  10:26 -0400, Jeff Garzik wrote:
> In my own experiments on my own Fedora workstation, ~66% of IOs in Linux 
> start on an odd sector, and ~33% started on even-numbered sectors.  For 
> a 1K-sector drive with 'odd' alignment, the configuration Microsoft will 
> likely want, that means the majority of disk transactions will avoid a 
> RMW cycle, but a still-numerous minority will not.

Isn't that purely an artifact of the DOS partition table alignment, possibly
skewed by the fact that most of your IO is on partition 1 & 3?  Hard to
believe this because of the nice even numbers though.

Since ext3 has at least 1kB blocksize and defaults to 4kB blocksize with
most modern disks because they are > 500MB in size, you should never
have misaligned writes generated by the filesystem itself.

> I did not test 
> transfer length, to see how many transfers /ended/ on an odd sector, 
> thus determining how many RMW cycles the tail of an average I/O requires.

I'd guess a vast majority of IO will have the end similarly misaligned as
the start.  Very little filesystem IO is 512 bytes, possibly excluding XFS
in an unusual mode.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: impact of 4k sector size on the IO & FS stack
  2007-03-13  5:11         ` Andreas Dilger
@ 2007-03-13  6:34           ` Chris Wedgwood
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Wedgwood @ 2007-03-13  6:34 UTC (permalink / raw)
  To: Jeff Garzik, Alan Cox, Ric Wheeler, linux-scsi, linux-fsdevel,
	Linux-ide

On Tue, Mar 13, 2007 at 01:11:44AM -0400, Andreas Dilger wrote:

> I'd guess a vast majority of IO will have the end similarly
> misaligned as the start.  Very little filesystem IO is 512 bytes,
> possibly excluding XFS in an unusual mode.

XFS (mkfs.xfs) can be told what the native sector size is and will
adjust writes accordingly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-03-13  6:34 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler
2007-03-11 23:14 ` Jan Engelhardt
2007-03-12  2:45   ` Ric Wheeler
2007-03-12  3:27     ` Jan Engelhardt
2007-03-12  3:46       ` Andreas Dilger
2007-03-12 12:17       ` Alan Cox
2007-03-12 14:41       ` Jeff Garzik
2007-03-12 14:36   ` Jeff Garzik
2007-03-12 15:45     ` Alan Cox
2007-03-12 18:31     ` Bryan Henderson
2007-03-12 18:37       ` Sergei Shtylyov
2007-03-12 20:52         ` Bryan Henderson
2007-03-12 19:16       ` Douglas Gilbert
2007-03-12 19:28         ` Jeff Garzik
2007-03-12  0:02 ` Alan Cox
2007-03-12  0:44   ` Jeff Garzik
2007-03-12  2:37     ` Ric Wheeler
2007-03-12 12:24     ` Alan Cox
2007-03-12 13:32       ` Ric Wheeler
2007-03-12 15:21         ` Douglas Gilbert
2007-03-12 16:08           ` Martin K. Petersen
2007-03-12 14:26       ` Jeff Garzik
2007-03-13  5:11         ` Andreas Dilger
2007-03-13  6:34           ` Chris Wedgwood
2007-03-12  2:41   ` Ric Wheeler
2007-03-12  8:18 ` Christoph Hellwig
2007-03-12 14:40   ` James Bottomley
2007-03-12 14:45   ` Jeff Garzik
2007-03-12 14:57     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).