* impact of 4k sector size on the IO & FS stack
@ 2007-03-11 22:51 Ric Wheeler
2007-03-11 23:14 ` Jan Engelhardt
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Ric Wheeler @ 2007-03-11 22:51 UTC (permalink / raw)
To: linux-scsi, linux-fsdevel, Linux-ide
During the recent IO/FS workshop, we spoke briefly about the coming
change to a 4k sector size for disks on linux. If I recall correctly,
the general feeling was that the impact was not significant since we
already do most file system IO in 4k page sizes and should be fine as
long as we partition drives correctly and avoid non-4k aligned partitions.
Are there other concerns in the IO or FS stack that we should bring up
with vendors? I have been asked to summarize the impact of 4k sectors
on linux for a disk vendor gathering and want to make sure that I put
all of our linux specific items into that summary...
ric
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: impact of 4k sector size on the IO & FS stack 2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler @ 2007-03-11 23:14 ` Jan Engelhardt 2007-03-12 2:45 ` Ric Wheeler 2007-03-12 14:36 ` Jeff Garzik 2007-03-12 0:02 ` Alan Cox 2007-03-12 8:18 ` Christoph Hellwig 2 siblings, 2 replies; 29+ messages in thread From: Jan Engelhardt @ 2007-03-11 23:14 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide On Mar 11 2007 18:51, Ric Wheeler wrote: > > During the recent IO/FS workshop, we spoke briefly about the > coming change to a 4k sector size for disks on linux. If I > recall correctly, the general feeling was that the impact was > not significant since we already do most file system IO in 4k > page sizes and should be fine as long as we partition drives > correctly and avoid non-4k aligned partitions. Sorry about jumping right in, but what about an 'old-style' partition table that relies on 512 as a unit? > Are there other concerns in the IO or FS stack that we should > bring up with vendors? I have been asked to summarize the > impact of 4k sectors on linux for a disk vendor gathering and > want to make sure that I put all of our linux specific items > into that summary... Jan -- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-11 23:14 ` Jan Engelhardt @ 2007-03-12 2:45 ` Ric Wheeler 2007-03-12 3:27 ` Jan Engelhardt 2007-03-12 14:36 ` Jeff Garzik 1 sibling, 1 reply; 29+ messages in thread From: Ric Wheeler @ 2007-03-12 2:45 UTC (permalink / raw) To: Jan Engelhardt; +Cc: linux-scsi, linux-fsdevel, Linux-ide Jan Engelhardt wrote: > On Mar 11 2007 18:51, Ric Wheeler wrote: > >> During the recent IO/FS workshop, we spoke briefly about the >> coming change to a 4k sector size for disks on linux. If I >> recall correctly, the general feeling was that the impact was >> not significant since we already do most file system IO in 4k >> page sizes and should be fine as long as we partition drives >> correctly and avoid non-4k aligned partitions. >> > > Sorry about jumping right in, but what about an 'old-style' > partition table that relies on 512 as a unit? > > I think that the normal case would involve new drives which would need to be partitioned in 4k aligned partitions. Shouldn't that work regardless of the unit used in the partition table? ric ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 2:45 ` Ric Wheeler @ 2007-03-12 3:27 ` Jan Engelhardt 2007-03-12 3:46 ` Andreas Dilger ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Jan Engelhardt @ 2007-03-12 3:27 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide On Mar 11 2007 22:45, Ric Wheeler wrote: > Jan Engelhardt wrote: >> On Mar 11 2007 18:51, Ric Wheeler wrote: >> >> > During the recent IO/FS workshop, we spoke briefly about the >> > coming change to a 4k sector size for disks on linux. If I >> > recall correctly, the general feeling was that the impact was >> > not significant since we already do most file system IO in 4k >> > page sizes and should be fine as long as we partition drives >> > correctly and avoid non-4k aligned partitions. >> > >> >> Sorry about jumping right in, but what about an 'old-style' >> partition table that relies on 512 as a unit? >> >> > I think that the normal case would involve new drives which > would need to be partitioned in 4k aligned partitions. > Shouldn't that work regardless of the unit used in the > partition table? Assume this partition table on my current HD: Disk /dev/hdc: 251.0 GB, 251000193024 bytes 255 heads, 63 sectors/track, 30515 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Start End Blocks Id System /dev/hdc1 1 33 265041 82 Linux swap / Solaris /dev/hdc2 34 30515 244846665 5 Extended That is, 255 * 63 * 30515 * 512 == roughly 251 GB. Now, if this disk was copied byte per byte (/bin/dd) to a 4096-based disk, and Linux would start using a sector size of 4096, then I would suddenly have 255 * 63 * 30515 * 4096 == 2 TB Although I would not mind the 2 TB, the partition table would read quite differently (note the Blocks column which is multiplied by 4 (512x4=4096)) Device Start End Blocks Id System /dev/hdc1 1 33 1060164 82 Linux swap / Solaris /dev/hdc2 34 30515 979386660 5 Extended Which would mean that the swap partition reaches into the real data partition and would corrupt it. That's what I am concerned about. Jan -- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 3:27 ` Jan Engelhardt @ 2007-03-12 3:46 ` Andreas Dilger 2007-03-12 12:17 ` Alan Cox 2007-03-12 14:41 ` Jeff Garzik 2 siblings, 0 replies; 29+ messages in thread From: Andreas Dilger @ 2007-03-12 3:46 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide On Mar 12, 2007 04:27 +0100, Jan Engelhardt wrote: > Assume this partition table on my current HD: > > Disk /dev/hdc: 251.0 GB, 251000193024 bytes > 255 heads, 63 sectors/track, 30515 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Start End Blocks Id System > /dev/hdc1 1 33 265041 82 Linux swap / Solaris > /dev/hdc2 34 30515 244846665 5 Extended > > That is, 255 * 63 * 30515 * 512 == roughly 251 GB. > > Now, if this disk was copied byte per byte (/bin/dd) to a > 4096-based disk, and Linux would start using a sector size of > 4096 The easy answer is "don't do that". You should make a new partition table on the 4096-byte sector drive (each of the partitions at least as large as the old ones), and then copy the content of each of the partitions separately onto the new disk. > Although I would not mind the 2 TB, the partition table would > read quite differently (note the Blocks column which is > multiplied by 4 (512x4=4096)) > > Device Start End Blocks Id System > /dev/hdc1 1 33 1060164 82 Linux swap / Solaris > /dev/hdc2 34 30515 979386660 5 Extended > > Which would mean that the swap partition reaches into the real > data partition and would corrupt it. In the same way you can't copy raw disks from one vendor's RAID 5 array and put them into another vendor's (or even model's) RAID 5 array, or you can't do a raw copy of a partitioned disk and expect it to suddenly become an LVM volume, you can't do raw disk copies between drives with different sector size. You also won't be able to use a copy of an ext3 filesystems with 1kB blocksize onto a 4kB sector size device - the ext3 code will detect this and refuse to mount. At that point you need to do a tar/untar (or whatever) to copy the data instead of a raw partition copy. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 3:27 ` Jan Engelhardt 2007-03-12 3:46 ` Andreas Dilger @ 2007-03-12 12:17 ` Alan Cox 2007-03-12 14:41 ` Jeff Garzik 2 siblings, 0 replies; 29+ messages in thread From: Alan Cox @ 2007-03-12 12:17 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide > Now, if this disk was copied byte per byte (/bin/dd) to a > 4096-based disk, and Linux would start using a sector size of > 4096, then I would suddenly have The ATA drives I'm aware of report 512 byte sector size, do 512 byte I/O's but use 4K physical sectors and to get sane performance except the OS to issue sensible sized I/O requests. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 3:27 ` Jan Engelhardt 2007-03-12 3:46 ` Andreas Dilger 2007-03-12 12:17 ` Alan Cox @ 2007-03-12 14:41 ` Jeff Garzik 2 siblings, 0 replies; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 14:41 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide Jan Engelhardt wrote: > On Mar 11 2007 22:45, Ric Wheeler wrote: >> Jan Engelhardt wrote: >>> On Mar 11 2007 18:51, Ric Wheeler wrote: >>> >>>> During the recent IO/FS workshop, we spoke briefly about the >>>> coming change to a 4k sector size for disks on linux. If I >>>> recall correctly, the general feeling was that the impact was >>>> not significant since we already do most file system IO in 4k >>>> page sizes and should be fine as long as we partition drives >>>> correctly and avoid non-4k aligned partitions. >>>> >>> Sorry about jumping right in, but what about an 'old-style' >>> partition table that relies on 512 as a unit? >>> >>> >> I think that the normal case would involve new drives which >> would need to be partitioned in 4k aligned partitions. >> Shouldn't that work regardless of the unit used in the >> partition table? > > Assume this partition table on my current HD: > > Disk /dev/hdc: 251.0 GB, 251000193024 bytes > 255 heads, 63 sectors/track, 30515 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Start End Blocks Id System > /dev/hdc1 1 33 265041 82 Linux swap / Solaris > /dev/hdc2 34 30515 244846665 5 Extended > > That is, 255 * 63 * 30515 * 512 == roughly 251 GB. > > Now, if this disk was copied byte per byte (/bin/dd) to a > 4096-based disk, and Linux would start using a sector size of > 4096, then I would suddenly have > > 255 * 63 * 30515 * 4096 == 2 TB > > Although I would not mind the 2 TB, the partition table would > read quite differently (note the Blocks column which is > multiplied by 4 (512x4=4096)) At this level, for RMW drives, nothing changes. The partition software, ATA driver, and all other bits continue to think that sector size == 512 bytes. The partition software /hopefully/ becomes smart enough to understand the alignment necessary, but that is not a requirement. This is the key to understanding the difference between a physical (==platters) sector size change without a logical (==ATA interface) sector size change. > Device Start End Blocks Id System > /dev/hdc1 1 33 1060164 82 Linux swap / Solaris > /dev/hdc2 34 30515 979386660 5 Extended > > Which would mean that the swap partition reaches into the real > data partition and would corrupt it. For RMW drives, RMW cycles would occur but not corruption. For non-RMW drives, this just wouldn't occur. Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-11 23:14 ` Jan Engelhardt 2007-03-12 2:45 ` Ric Wheeler @ 2007-03-12 14:36 ` Jeff Garzik 2007-03-12 15:45 ` Alan Cox 2007-03-12 18:31 ` Bryan Henderson 1 sibling, 2 replies; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 14:36 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide Jan Engelhardt wrote: > On Mar 11 2007 18:51, Ric Wheeler wrote: >> During the recent IO/FS workshop, we spoke briefly about the >> coming change to a 4k sector size for disks on linux. If I >> recall correctly, the general feeling was that the impact was >> not significant since we already do most file system IO in 4k >> page sizes and should be fine as long as we partition drives >> correctly and avoid non-4k aligned partitions. > > Sorry about jumping right in, but what about an 'old-style' > partition table that relies on 512 as a unit? For 1K/4K physical sector size, where logical sector size remains 512-b, nothing changes. DOS partitions start partitions on odd-numbered sectors, so presuming you have odd-aligned disks, life is good. For 1K/4K logical sector sizes, who knows. EFI? <grins and runs> Certainly seems incompatible with the current popular DOS partition format. Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 14:36 ` Jeff Garzik @ 2007-03-12 15:45 ` Alan Cox 2007-03-12 18:31 ` Bryan Henderson 1 sibling, 0 replies; 29+ messages in thread From: Alan Cox @ 2007-03-12 15:45 UTC (permalink / raw) To: Jeff Garzik Cc: Jan Engelhardt, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide > For 1K/4K logical sector sizes, who knows. EFI? <grins and runs> > Certainly seems incompatible with the current popular DOS partition format. Its a bit messier than that. There are two interpretations of "DOS" partition formats found on 2K sector size magneto opticals. One is that everything is the same as before (as if sectors were 512 byte), the other is a different "everything is the same" which scales by the 2K sector size. The two are of course wonderfully incompatible ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 14:36 ` Jeff Garzik 2007-03-12 15:45 ` Alan Cox @ 2007-03-12 18:31 ` Bryan Henderson 2007-03-12 18:37 ` Sergei Shtylyov 2007-03-12 19:16 ` Douglas Gilbert 1 sibling, 2 replies; 29+ messages in thread From: Bryan Henderson @ 2007-03-12 18:31 UTC (permalink / raw) To: Jeff Garzik Cc: Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler >DOS partitions start partitions on odd-numbered sectors I don't get this. If you mean partitions defined by the classic DOS partition table format, then AFAICS, such a partition can start in any sector. >so presuming you have odd-aligned disks, life is good. What is an odd-aligned disk? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 18:31 ` Bryan Henderson @ 2007-03-12 18:37 ` Sergei Shtylyov 2007-03-12 20:52 ` Bryan Henderson 2007-03-12 19:16 ` Douglas Gilbert 1 sibling, 1 reply; 29+ messages in thread From: Sergei Shtylyov @ 2007-03-12 18:37 UTC (permalink / raw) To: Bryan Henderson Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler Hello. Bryan Henderson wrote: >>DOS partitions start partitions on odd-numbered sectors > I don't get this. If you mean partitions defined by the classic DOS > partition table format, then AFAICS, such a partition can start in any > sector. Only at "logical cylinder boudary" (except for the first partition). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 18:37 ` Sergei Shtylyov @ 2007-03-12 20:52 ` Bryan Henderson 0 siblings, 0 replies; 29+ messages in thread From: Bryan Henderson @ 2007-03-12 20:52 UTC (permalink / raw) To: Sergei Shtylyov Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler >> I don't get this. If you mean partitions defined by the classic DOS >> partition table format, then AFAICS, such a partition can start in any >> sector. > > Only at "logical cylinder boundary" (except for the first partition). That's a requirement in ancient DOS systems that use CHS addressing (physical CHS, no less), isn't it (so you can properly convert a within-partition address to a within-disk address)? While I would guess most people still partition disks that way (Even linux-util fdisk seems to do it by default), they don't have to. Doesn't matter for this discussion, though. As Doug demonstrated, even when you do start at cylinder boundaries, half your partitions start on an even sector, because typical cylinders have an odd number of sectors. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 18:31 ` Bryan Henderson 2007-03-12 18:37 ` Sergei Shtylyov @ 2007-03-12 19:16 ` Douglas Gilbert 2007-03-12 19:28 ` Jeff Garzik 1 sibling, 1 reply; 29+ messages in thread From: Douglas Gilbert @ 2007-03-12 19:16 UTC (permalink / raw) To: Bryan Henderson Cc: Jeff Garzik, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler Bryan Henderson wrote: >> DOS partitions start partitions on odd-numbered sectors > > I don't get this. If you mean partitions defined by the classic DOS > partition table format, then AFAICS, such a partition can start in any > sector. Bryan, Typically the first partition on a DOS partitioned disk starts at the next available sector after the mbr which, for some bizarre reason, is 63 sectors long. Hence: # fdisk -lu /dev/hda Disk /dev/hda: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hda1 * 63 18314099 9157018+ c W95 FAT32 (LBA) /dev/hda2 18314100 19551104 618502+ 82 Linux swap / Solaris /dev/hda4 19551105 156296384 68372640 83 Linux > >> so presuming you have odd-aligned disks, life is good. > > What is an odd-aligned disk? s/disk/partition/ ? Perhaps hda1 and hda4 above are examples. Doug Gilbert ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 19:16 ` Douglas Gilbert @ 2007-03-12 19:28 ` Jeff Garzik 0 siblings, 0 replies; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 19:28 UTC (permalink / raw) To: dougg Cc: Bryan Henderson, Jan Engelhardt, linux-fsdevel, Linux-ide, linux-scsi, Ric Wheeler Douglas Gilbert wrote: > Bryan Henderson wrote: >> What is an odd-aligned disk? > > s/disk/partition/ ? Example: An odd-aligned disk in the 512-b logical / 1K-physical scenario is where odd LBAs indicate the start of a 1K physical sector. An even-aligned disk is where even LBAs indicate the start of a 1K physical sector. In order to avoid too many RMW cycles, partition software SHOULD (using IETF language) be aware of the underlying physical sector size alignment, in order to align paritions for optimal performance. Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler 2007-03-11 23:14 ` Jan Engelhardt @ 2007-03-12 0:02 ` Alan Cox 2007-03-12 0:44 ` Jeff Garzik 2007-03-12 2:41 ` Ric Wheeler 2007-03-12 8:18 ` Christoph Hellwig 2 siblings, 2 replies; 29+ messages in thread From: Alan Cox @ 2007-03-12 0:02 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide > Are there other concerns in the IO or FS stack that we should bring up > with vendors? I have been asked to summarize the impact of 4k sectors > on linux for a disk vendor gathering and want to make sure that I put > all of our linux specific items into that summary... We need to make sure the physical sector size is correctly reported by the disk (eg in the ATA7 identify data) but I think for libata at least the right bits are already there and we've got a fair amount of scsi disk experience with other media sizes (eg 2K) already. 256byte/sector media is still broken btw 8) I would be interested to know what the disk vendors intend to use as their strategy when (with ATA) they have a 512 byte write from an older file system/setup into a 4K block. The case where errors magically appear in other parts of the fs when such an error occurs are not IMHO too well considered. Alan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 0:02 ` Alan Cox @ 2007-03-12 0:44 ` Jeff Garzik 2007-03-12 2:37 ` Ric Wheeler 2007-03-12 12:24 ` Alan Cox 2007-03-12 2:41 ` Ric Wheeler 1 sibling, 2 replies; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 0:44 UTC (permalink / raw) To: Alan Cox; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide Alan Cox wrote: > I would be interested to know what the disk vendors intend to use as > their strategy when (with ATA) they have a 512 byte write from an older > file system/setup into a 4K block. The case where errors magically appear Well, you have logical and physical sector size changes. First generation of 1K sector drives will continue to use the same 512-byte ATA sector size you are familiar with. A single 512-byte write will cause the drive to perform a read-modify-write cycle. This configuration is physical 1K sector, logical 512b sector. A future configuration will change the logical ATA interface away from 512-byte sectors to 1K or 4K. Here, it is impossible to read a quantity smaller than 1K or 4K, whatever the sector size is. Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 0:44 ` Jeff Garzik @ 2007-03-12 2:37 ` Ric Wheeler 2007-03-12 12:24 ` Alan Cox 1 sibling, 0 replies; 29+ messages in thread From: Ric Wheeler @ 2007-03-12 2:37 UTC (permalink / raw) To: Jeff Garzik; +Cc: Alan Cox, linux-scsi, linux-fsdevel, Linux-ide Jeff Garzik wrote: > Alan Cox wrote: >> I would be interested to know what the disk vendors intend to use as >> their strategy when (with ATA) they have a 512 byte write from an older >> file system/setup into a 4K block. The case where errors magically >> appear > > Well, you have logical and physical sector size changes. > > First generation of 1K sector drives will continue to use the same > 512-byte ATA sector size you are familiar with. A single 512-byte > write will cause the drive to perform a read-modify-write cycle. This > configuration is physical 1K sector, logical 512b sector. It would seem that most writes would avoid this - hopefully the drive firmware could use the write cache to coalesce contiguous IO's into 1k multiples when getting streams of 512 byte write requests. > > A future configuration will change the logical ATA interface away from > 512-byte sectors to 1K or 4K. Here, it is impossible to read a > quantity smaller than 1K or 4K, whatever the sector size is. > > Jeff I will try and see if I can get some specific information on when the various flavors of this are going to appear... ric ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 0:44 ` Jeff Garzik 2007-03-12 2:37 ` Ric Wheeler @ 2007-03-12 12:24 ` Alan Cox 2007-03-12 13:32 ` Ric Wheeler 2007-03-12 14:26 ` Jeff Garzik 1 sibling, 2 replies; 29+ messages in thread From: Alan Cox @ 2007-03-12 12:24 UTC (permalink / raw) To: Jeff Garzik; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide > First generation of 1K sector drives will continue to use the same > 512-byte ATA sector size you are familiar with. A single 512-byte write > will cause the drive to perform a read-modify-write cycle. This > configuration is physical 1K sector, logical 512b sector. The problem case is "read-modify-screwup" At that point we've trashed the block we were writing (a well studied recovery case), and we've blasted some previously sane, totally unrelated sector of data out of existance. Thats why we need to know ideally if they are doing the write to a different physical block when they do this, so that we don't lose the old data. My guess is they won't as it'll be hard. > A future configuration will change the logical ATA interface away from > 512-byte sectors to 1K or 4K. Here, it is impossible to read a quantity > smaller than 1K or 4K, whatever the sector size is. That one I'm not worried about - other than "guess how Redmond decide to make partition tables work" that one is mostly easy (be fun to see how many controllers simply can't cope with the command formats) Alan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 12:24 ` Alan Cox @ 2007-03-12 13:32 ` Ric Wheeler 2007-03-12 15:21 ` Douglas Gilbert 2007-03-12 14:26 ` Jeff Garzik 1 sibling, 1 reply; 29+ messages in thread From: Ric Wheeler @ 2007-03-12 13:32 UTC (permalink / raw) To: Alan Cox; +Cc: Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide Alan Cox wrote: >> First generation of 1K sector drives will continue to use the same >> 512-byte ATA sector size you are familiar with. A single 512-byte write >> will cause the drive to perform a read-modify-write cycle. This >> configuration is physical 1K sector, logical 512b sector. > > The problem case is "read-modify-screwup" > > At that point we've trashed the block we were writing (a well studied > recovery case), and we've blasted some previously sane, totally > unrelated sector of data out of existance. Thats why we need to know > ideally if they are doing the write to a different physical block when > they do this, so that we don't lose the old data. My guess is they won't > as it'll be hard. I think that the firmware would have to do this in the drive's write cache and would always write the modified data back to the same physical sector (unless a media error forces a sector remap). If firmware modifies the 7 512 byte sectors that it read to do the 1 512 byte sector write, then we certainly would see what you describe happen. In general, it would seem to be a bad idea to do allocate a different physical sector to underpin this king of read-modify-write since that would kill contiguous layout of files, etc. >> A future configuration will change the logical ATA interface away from >> 512-byte sectors to 1K or 4K. Here, it is impossible to read a quantity >> smaller than 1K or 4K, whatever the sector size is. > > That one I'm not worried about - other than "guess how Redmond decide to > make partition tables work" that one is mostly easy (be fun to see how > many controllers simply can't cope with the command formats) > This will be interesting to find out. I will be sharing a panel with some BIOS & MS people, so I will update all on what I hear, ric ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 13:32 ` Ric Wheeler @ 2007-03-12 15:21 ` Douglas Gilbert 2007-03-12 16:08 ` Martin K. Petersen 0 siblings, 1 reply; 29+ messages in thread From: Douglas Gilbert @ 2007-03-12 15:21 UTC (permalink / raw) To: ric; +Cc: Alan Cox, Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide Ric Wheeler wrote: > Alan Cox wrote: >>> First generation of 1K sector drives will continue to use the same >>> 512-byte ATA sector size you are familiar with. A single 512-byte >>> write will cause the drive to perform a read-modify-write cycle. >>> This configuration is physical 1K sector, logical 512b sector. >> >> The problem case is "read-modify-screwup" >> >> At that point we've trashed the block we were writing (a well studied >> recovery case), and we've blasted some previously sane, totally >> unrelated sector of data out of existance. Thats why we need to know >> ideally if they are doing the write to a different physical block when >> they do this, so that we don't lose the old data. My guess is they won't >> as it'll be hard. > > I think that the firmware would have to do this in the drive's write > cache and would always write the modified data back to the same physical > sector (unless a media error forces a sector remap). > > If firmware modifies the 7 512 byte sectors that it read to do the 1 512 > byte sector write, then we certainly would see what you describe happen. > > In general, it would seem to be a bad idea to do allocate a different > physical sector to underpin this king of read-modify-write since that > would kill contiguous layout of files, etc. > >>> A future configuration will change the logical ATA interface away >>> from 512-byte sectors to 1K or 4K. Here, it is impossible to read a >>> quantity smaller than 1K or 4K, whatever the sector size is. >> >> That one I'm not worried about - other than "guess how Redmond decide to >> make partition tables work" that one is mostly easy (be fun to see how >> many controllers simply can't cope with the command formats) >> > > This will be interesting to find out. I will be sharing a panel with > some BIOS & MS people, so I will update all on what I hear, Ric, Just to add a SCSI perspective, it looks like 4 KB sectored disks will be almost exclusively ATA devices. It is being done to improve capacity at the expensive of performance. [SCSI/FC/SAS disks typically trade off capacity for better performance.] Support for disks with smaller logical block size than physical block size has already been added to SBC-3. The overview of this document gives a rationale: www.t10.org/ftp/t10/document.06/06-034r5.pdf SAT is now a standard and an agenda item for SAT-2 is to wire ATA8-ACS's large sector size support to the additions to SBC-3 mentioned above. I'm not sure how this stuff plays with end to end data protection :-) Most SCSI disks currently allow formatting sizes of 512 up to 528 bytes per logical block. Doug Gilbert ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 15:21 ` Douglas Gilbert @ 2007-03-12 16:08 ` Martin K. Petersen 0 siblings, 0 replies; 29+ messages in thread From: Martin K. Petersen @ 2007-03-12 16:08 UTC (permalink / raw) To: dougg; +Cc: ric, Alan Cox, Jeff Garzik, linux-scsi, linux-fsdevel, Linux-ide >>>>> "Doug" == Douglas Gilbert <dougg@torque.net> writes: Doug> SAT is now a standard and an agenda item for SAT-2 is to wire Doug> ATA8-ACS's large sector size support to the additions to SBC-3 Doug> mentioned above. Doug> I'm not sure how this stuff plays with end to end data Doug> protection :-) The proposal you forwarded talks about "transformed protection information" but doesn't go into details. Assuming the drive has 4KB physical blocks and receives 512 byte logical blocks, it's easy to verify the integrity of the 512 byte sector and then do R-M-W on the physical. Similarly, on the way out logical guard and ref tags could be generated after integrity of the physical has been verified. The only thing that really bites is that the app tag will be per physical block and not per logical (unless the drive leaves enough space to store 8 tags per 4KB sector). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 12:24 ` Alan Cox 2007-03-12 13:32 ` Ric Wheeler @ 2007-03-12 14:26 ` Jeff Garzik 2007-03-13 5:11 ` Andreas Dilger 1 sibling, 1 reply; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 14:26 UTC (permalink / raw) To: Alan Cox; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide Alan Cox wrote: >> First generation of 1K sector drives will continue to use the same >> 512-byte ATA sector size you are familiar with. A single 512-byte write >> will cause the drive to perform a read-modify-write cycle. This >> configuration is physical 1K sector, logical 512b sector. > > The problem case is "read-modify-screwup" > > At that point we've trashed the block we were writing (a well studied > recovery case), and we've blasted some previously sane, totally > unrelated sector of data out of existance. Thats why we need to know > ideally if they are doing the write to a different physical block when > they do this, so that we don't lose the old data. My guess is they won't > as it'll be hard. Strict ATA command set answer: you will have no idea what goes on under the hood. The current 512-b interface stays /exactly/ the same, save for a word or two in IDENTIFY DEVICE telling you the "secret" physical sector size. If all your I/Os are aligned properly, then you need not worry about RMW cycles, as they will not occur. Intuition answer: they will use their firmware-internal standard code for scheduling reads and writes, and will only reallocate sectors as needed by media failure or similar events. The "M" part of the modify cycle happens in disk ram. So from the disk's point of view, a single 512-b write would require reading a single 1K hard sector, updating the contents in cache RAM, and then writing a single 1K hard sector. The reading of the unknown half of the sector can be scheduled well in advance, usually, since writeback caching gives the drive plenty of time (relatively speaking) to optimize things. Overall, it definitely adds a few more points of failure, but we can't do much at all about those points of failure. In my own experiments on my own Fedora workstation, ~66% of IOs in Linux start on an odd sector, and ~33% started on even-numbered sectors. For a 1K-sector drive with 'odd' alignment, the configuration Microsoft will likely want, that means the majority of disk transactions will avoid a RMW cycle, but a still-numerous minority will not. I did not test transfer length, to see how many transfers /ended/ on an odd sector, thus determining how many RMW cycles the tail of an average I/O requires. >> A future configuration will change the logical ATA interface away from >> 512-byte sectors to 1K or 4K. Here, it is impossible to read a quantity >> smaller than 1K or 4K, whatever the sector size is. > > That one I'm not worried about - other than "guess how Redmond decide to > make partition tables work" that one is mostly easy (be fun to see how > many controllers simply can't cope with the command formats) Indeed... Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 14:26 ` Jeff Garzik @ 2007-03-13 5:11 ` Andreas Dilger 2007-03-13 6:34 ` Chris Wedgwood 0 siblings, 1 reply; 29+ messages in thread From: Andreas Dilger @ 2007-03-13 5:11 UTC (permalink / raw) To: Jeff Garzik; +Cc: Alan Cox, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide On Mar 12, 2007 10:26 -0400, Jeff Garzik wrote: > In my own experiments on my own Fedora workstation, ~66% of IOs in Linux > start on an odd sector, and ~33% started on even-numbered sectors. For > a 1K-sector drive with 'odd' alignment, the configuration Microsoft will > likely want, that means the majority of disk transactions will avoid a > RMW cycle, but a still-numerous minority will not. Isn't that purely an artifact of the DOS partition table alignment, possibly skewed by the fact that most of your IO is on partition 1 & 3? Hard to believe this because of the nice even numbers though. Since ext3 has at least 1kB blocksize and defaults to 4kB blocksize with most modern disks because they are > 500MB in size, you should never have misaligned writes generated by the filesystem itself. > I did not test > transfer length, to see how many transfers /ended/ on an odd sector, > thus determining how many RMW cycles the tail of an average I/O requires. I'd guess a vast majority of IO will have the end similarly misaligned as the start. Very little filesystem IO is 512 bytes, possibly excluding XFS in an unusual mode. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-13 5:11 ` Andreas Dilger @ 2007-03-13 6:34 ` Chris Wedgwood 0 siblings, 0 replies; 29+ messages in thread From: Chris Wedgwood @ 2007-03-13 6:34 UTC (permalink / raw) To: Jeff Garzik, Alan Cox, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide On Tue, Mar 13, 2007 at 01:11:44AM -0400, Andreas Dilger wrote: > I'd guess a vast majority of IO will have the end similarly > misaligned as the start. Very little filesystem IO is 512 bytes, > possibly excluding XFS in an unusual mode. XFS (mkfs.xfs) can be told what the native sector size is and will adjust writes accordingly. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 0:02 ` Alan Cox 2007-03-12 0:44 ` Jeff Garzik @ 2007-03-12 2:41 ` Ric Wheeler 1 sibling, 0 replies; 29+ messages in thread From: Ric Wheeler @ 2007-03-12 2:41 UTC (permalink / raw) To: Alan Cox; +Cc: linux-scsi, linux-fsdevel, Linux-ide Alan Cox wrote: >> Are there other concerns in the IO or FS stack that we should bring up >> with vendors? I have been asked to summarize the impact of 4k sectors >> on linux for a disk vendor gathering and want to make sure that I put >> all of our linux specific items into that summary... >> > > We need to make sure the physical sector size is correctly reported by > the disk (eg in the ATA7 identify data) but I think for libata at least > the right bits are already there and we've got a fair amount of scsi disk > experience with other media sizes (eg 2K) already. 256byte/sector media > is still broken btw 8) > It would be really interesting to see if we can validate this with prototype drives. > I would be interested to know what the disk vendors intend to use as > their strategy when (with ATA) they have a 512 byte write from an older > file system/setup into a 4K block. The case where errors magically appear > in other parts of the fs when such an error occurs are not IMHO too well > considered. > > Alan As Jeff mentioned, I think that they would have to do a read-modify-write simulation which would kill performance for a small, random write work load... ric ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler 2007-03-11 23:14 ` Jan Engelhardt 2007-03-12 0:02 ` Alan Cox @ 2007-03-12 8:18 ` Christoph Hellwig 2007-03-12 14:40 ` James Bottomley 2007-03-12 14:45 ` Jeff Garzik 2 siblings, 2 replies; 29+ messages in thread From: Christoph Hellwig @ 2007-03-12 8:18 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-scsi, linux-fsdevel, Linux-ide On Sun, Mar 11, 2007 at 06:51:53PM -0400, Ric Wheeler wrote: > > During the recent IO/FS workshop, we spoke briefly about the coming > change to a 4k sector size for disks on linux. If I recall correctly, > the general feeling was that the impact was not significant since we > already do most file system IO in 4k page sizes and should be fine as > long as we partition drives correctly and avoid non-4k aligned partitions. > > Are there other concerns in the IO or FS stack that we should bring up > with vendors? I have been asked to summarize the impact of 4k sectors > on linux for a disk vendor gathering and want to make sure that I put > all of our linux specific items into that summary... The FS stack and higher levels of the I/O stack should be mostly ready. The S/390 DASDs are commonly used with 4k sector sizes, and we've had the occasional 2k sector SCSI MO device aswell. It would be nice to get samples of large sector size ATA devices into the hands of developers to do real world testing of the whole stack. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 8:18 ` Christoph Hellwig @ 2007-03-12 14:40 ` James Bottomley 2007-03-12 14:45 ` Jeff Garzik 1 sibling, 0 replies; 29+ messages in thread From: James Bottomley @ 2007-03-12 14:40 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide On Mon, 2007-03-12 at 08:18 +0000, Christoph Hellwig wrote: > The FS stack and higher levels of the I/O stack should be mostly ready. > The S/390 DASDs are commonly used with 4k sector sizes, and we've had > the occasional 2k sector SCSI MO device aswell. It would be nice to > get samples of large sector size ATA devices into the hands of developers > to do real world testing of the whole stack. Theoretically, we already have the capacity to verify this. Although not with ATA. However, since ATA uses virtually the same paths as SCSI, we could test with variable sector SCSI devices, and SCSI does allow you to reformat the device with different sector sizes. James ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 8:18 ` Christoph Hellwig 2007-03-12 14:40 ` James Bottomley @ 2007-03-12 14:45 ` Jeff Garzik 2007-03-12 14:57 ` Christoph Hellwig 1 sibling, 1 reply; 29+ messages in thread From: Jeff Garzik @ 2007-03-12 14:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide Christoph Hellwig wrote: > the occasional 2k sector SCSI MO device aswell. It would be nice to > get samples of large sector size ATA devices into the hands of developers > to do real world testing of the whole stack. "hands of developers" meaning you specifically? :) I've had a 512b-logical/1K-physical ATA test drive for a few months now, and another couple arrived today. Hopefully people can parse what I've been posting, since I cannot give out raw numbers or data at this time. Of course, with RMW drives that leave the 512-b logical interface untouched, I had expected that they would Just Work(tm) and that is pretty much what happened. Jeff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: impact of 4k sector size on the IO & FS stack 2007-03-12 14:45 ` Jeff Garzik @ 2007-03-12 14:57 ` Christoph Hellwig 0 siblings, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2007-03-12 14:57 UTC (permalink / raw) To: Jeff Garzik Cc: Christoph Hellwig, Ric Wheeler, linux-scsi, linux-fsdevel, Linux-ide On Mon, Mar 12, 2007 at 10:45:16AM -0400, Jeff Garzik wrote: > Christoph Hellwig wrote: > >the occasional 2k sector SCSI MO device aswell. It would be nice to > >get samples of large sector size ATA devices into the hands of developers > >to do real world testing of the whole stack. > > "hands of developers" meaning you specifically? :) No. I probably wouldn't have time to deal with it aswell. > I've had a 512b-logical/1K-physical ATA test drive for a few months now, > and another couple arrived today. Ok, that's exactly what I meant. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2007-03-13 6:34 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-03-11 22:51 impact of 4k sector size on the IO & FS stack Ric Wheeler 2007-03-11 23:14 ` Jan Engelhardt 2007-03-12 2:45 ` Ric Wheeler 2007-03-12 3:27 ` Jan Engelhardt 2007-03-12 3:46 ` Andreas Dilger 2007-03-12 12:17 ` Alan Cox 2007-03-12 14:41 ` Jeff Garzik 2007-03-12 14:36 ` Jeff Garzik 2007-03-12 15:45 ` Alan Cox 2007-03-12 18:31 ` Bryan Henderson 2007-03-12 18:37 ` Sergei Shtylyov 2007-03-12 20:52 ` Bryan Henderson 2007-03-12 19:16 ` Douglas Gilbert 2007-03-12 19:28 ` Jeff Garzik 2007-03-12 0:02 ` Alan Cox 2007-03-12 0:44 ` Jeff Garzik 2007-03-12 2:37 ` Ric Wheeler 2007-03-12 12:24 ` Alan Cox 2007-03-12 13:32 ` Ric Wheeler 2007-03-12 15:21 ` Douglas Gilbert 2007-03-12 16:08 ` Martin K. Petersen 2007-03-12 14:26 ` Jeff Garzik 2007-03-13 5:11 ` Andreas Dilger 2007-03-13 6:34 ` Chris Wedgwood 2007-03-12 2:41 ` Ric Wheeler 2007-03-12 8:18 ` Christoph Hellwig 2007-03-12 14:40 ` James Bottomley 2007-03-12 14:45 ` Jeff Garzik 2007-03-12 14:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).