* Re: Implementing NVMHCI... [not found] <20090412091228.GA29937@elte.hu> @ 2009-04-12 15:14 ` Szabolcs Szakacsits 2009-04-12 15:20 ` Alan Cox 2009-04-12 15:41 ` Linus Torvalds 0 siblings, 2 replies; 45+ messages in thread From: Szabolcs Szakacsits @ 2009-04-12 15:14 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > And people tend to really dislike hardware that forces a particular > filesystem on them. Guess how big the user base is going to be if you > cannot format the device as NTFS, for example? Hint: if a piece of > hardware only works well with special filesystems, that piece of hardware > won't be a big seller. > > Modern technology needs big volume to become cheap and relevant. > > And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let > me doubt that. I did not hear about NTFS using >4kB sectors yet but technically it should work. The atomic building units (sector size, block size, etc) of NTFS are entirely parametric. The maximum values could be bigger than the currently "configured" maximum limits. At present the limits are set in the BIOS Parameter Block in the NTFS Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for "Sectors Per Block". So >4kB sector size should work since 1993. 64kB+ sector size could be possible by bootstrapping NTFS drivers in a different way. Szaka -- NTFS-3G: http://ntfs-3g.org ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits @ 2009-04-12 15:20 ` Alan Cox 2009-04-12 16:15 ` Avi Kivity 2009-04-12 15:41 ` Linus Torvalds 1 sibling, 1 reply; 45+ messages in thread From: Alan Cox @ 2009-04-12 15:20 UTC (permalink / raw) To: Szabolcs Szakacsits Cc: Linus Torvalds, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven > The atomic building units (sector size, block size, etc) of NTFS are > entirely parametric. The maximum values could be bigger than the > currently "configured" maximum limits. That isn't what bites you - you can run 8K-32K ext2 file systems but if your physical page size is smaller than the fs page size you have a problem. The question is whether the NT VM can cope rather than the fs. Alan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 15:20 ` Alan Cox @ 2009-04-12 16:15 ` Avi Kivity 2009-04-12 17:11 ` Linus Torvalds 0 siblings, 1 reply; 45+ messages in thread From: Avi Kivity @ 2009-04-12 16:15 UTC (permalink / raw) To: Alan Cox Cc: Szabolcs Szakacsits, Linus Torvalds, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Alan Cox wrote: >> The atomic building units (sector size, block size, etc) of NTFS are >> entirely parametric. The maximum values could be bigger than the >> currently "configured" maximum limits. >> > > That isn't what bites you - you can run 8K-32K ext2 file systems but if > your physical page size is smaller than the fs page size you have a > problem. > > The question is whether the NT VM can cope rather than the fs. > A quick test shows that it can. I didn't try mmap(), but copying files around worked. Did you expect it not to work? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 16:15 ` Avi Kivity @ 2009-04-12 17:11 ` Linus Torvalds 2009-04-13 6:32 ` Avi Kivity 0 siblings, 1 reply; 45+ messages in thread From: Linus Torvalds @ 2009-04-12 17:11 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sun, 12 Apr 2009, Avi Kivity wrote: > > A quick test shows that it can. I didn't try mmap(), but copying files around > worked. You being who you are, I'm assuming you're doing this in a virtual environment, so you might be able to see the IO patterns.. Can you tell if it does the IO in chunks of 16kB or smaller? That can be hard to see with trivial tests (since any filesystem will try to chunk up writes regardless of how small the cache entry is, and on file creation it will have to write the full 16kB anyway just to initialize the newly allocated blocks on disk), but there's a couple of things that should be reasonably good litmus tests of what WNT does internally: - create a big file, then rewrite just a few bytes in it, and look at the IO pattern of the result. Does it actually do the rewrite IO as one 16kB IO, or does it do sub-blocking? If the latter, then the 16kB thing is just a filesystem layout issue, not an internal block-size issue, and WNT would likely have exactly the same issues as Linux. - can you tell how many small files it will cache in RAM without doing IO? If it always uses 16kB blocks for caching, it will be able to cache a _lot_ fewer files in the same amount of RAM than with a smaller block size. Of course, the _really_ conclusive thing (in a virtualized environment) is to just make the virtual disk only able to do 16kB IO accesses (and with 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size, and reporting a 16kB sector size to the READ CAPACITY command. If it works then, then clearly WNT has no issues with bigger sectors. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 17:11 ` Linus Torvalds @ 2009-04-13 6:32 ` Avi Kivity 2009-04-13 15:10 ` Linus Torvalds 0 siblings, 1 reply; 45+ messages in thread From: Avi Kivity @ 2009-04-13 6:32 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > On Sun, 12 Apr 2009, Avi Kivity wrote: > >> A quick test shows that it can. I didn't try mmap(), but copying files around >> worked. >> > > You being who you are, I'm assuming you're doing this in a virtual > environment, so you might be able to see the IO patterns.. > > Yes. I just used the Windows performance counters rather than mess with qemu for the test below. > Can you tell if it does the IO in chunks of 16kB or smaller? That can be > hard to see with trivial tests (since any filesystem will try to chunk up > writes regardless of how small the cache entry is, and on file creation it > will have to write the full 16kB anyway just to initialize the newly > allocated blocks on disk), but there's a couple of things that should be > reasonably good litmus tests of what WNT does internally: > > - create a big file, Just creating a 5GB file in a 64KB filesystem was interesting - Windows was throwing out 256KB I/Os even though I was generating 1MB writes (and cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4). > then rewrite just a few bytes in it, and look at the > IO pattern of the result. Does it actually do the rewrite IO as one > 16kB IO, or does it do sub-blocking? > It generates 4KB writes (I was generating aligned 512 byte overwrites). What's more interesting, it was also issuing 32KB reads to fill the cache, not 64KB. Since the number of reads and writes per second is almost equal, it's not splitting a 64KB read into two. > If the latter, then the 16kB thing is just a filesystem layout issue, > not an internal block-size issue, and WNT would likely have exactly the > same issues as Linux. > A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a 16KB block. So long as the filesystem is just a layer behind the pagecache (which I think is the case on Windows), I don't see what issues it can have. > - can you tell how many small files it will cache in RAM without doing > IO? If it always uses 16kB blocks for caching, it will be able to cache > a _lot_ fewer files in the same amount of RAM than with a smaller block > size. > I'll do this later, but given the 32KB reads for the test above, I'm guessing it will cache pages, not blocks. > Of course, the _really_ conclusive thing (in a virtualized environment) is > to just make the virtual disk only able to do 16kB IO accesses (and with > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector > size, and reporting a 16kB sector size to the READ CAPACITY command. If it > works then, then clearly WNT has no issues with bigger sectors. > I don't think IDE supports this? And Windows 2008 doesn't like the LSI emulated device we expose. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-13 6:32 ` Avi Kivity @ 2009-04-13 15:10 ` Linus Torvalds 2009-04-13 15:38 ` James Bottomley ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Linus Torvalds @ 2009-04-13 15:10 UTC (permalink / raw) To: Avi Kivity Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Mon, 13 Apr 2009, Avi Kivity wrote: > > > > - create a big file, > > Just creating a 5GB file in a 64KB filesystem was interesting - Windows > was throwing out 256KB I/Os even though I was generating 1MB writes (and > cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4). Heh, ok. So the "big file" really only needed to be big enough to not be cached, and 5GB was probably overkill. In fact, if there's some way to blow the cache, you could have made it much smaller. But 5G certainly works ;) And yeah, I'm not surprised it limits the size of the IO. Linux will generally do the same. I forget what our default maximum bio size is, but I suspect it is in that same kind of range. There are often problems with bigger IO's (latency being one, actual controller bugs being another), and even if the hardware has no bugs and its limits are higher, you usually don't want to have excessively large DMA mapping tables _and_ the advantage of bigger IO is usually not that big once you pass the "reasonably sized" limit (which is 64kB+). Plus they happen seldom enough in practice anyway that it's often not worth optimizing for. > > then rewrite just a few bytes in it, and look at the IO pattern of the > > result. Does it actually do the rewrite IO as one 16kB IO, or does it > > do sub-blocking? > > It generates 4KB writes (I was generating aligned 512 byte overwrites). > What's more interesting, it was also issuing 32KB reads to fill the > cache, not 64KB. Since the number of reads and writes per second is > almost equal, it's not splitting a 64KB read into two. Ok, that sounds pretty much _exactly_ like the Linux IO patterns would likely be. The 32kB read has likely nothing to do with any filesystem layout issues (especially as you used a 64kB cluster size), but is simply because (a) Windows caches things with a 4kB granularity, so the 512-byte write turned into a read-modify-write (b) the read was really for just 4kB, but once you start reading you want to do read-ahead anyway since it hardly gets any more expensive to read a few pages than to read just one. So once it had to do the read anyway, windows just read 8 pages instead of one - very reasonable. > > If the latter, then the 16kB thing is just a filesystem layout > > issue, not an internal block-size issue, and WNT would likely have > > exactly the same issues as Linux. > > A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a > 16KB block. So long as the filesystem is just a layer behind the pagecache > (which I think is the case on Windows), I don't see what issues it can have. Right. It's all very straightforward from a filesystem layout issue. The problem is all about managing memory. You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for your example!). It's a total disaster. Imagine what would happen to user application performance if kmalloc() always returned 16kB-aligned chunks of memory, all sized as integer multiples of 16kB? It would absolutely _suck_. Sure, it would be fine for your large allocations, but any time you handle strings, you'd allocate 16kB of memory for any small 5-byte string. You'd have horrible cache behavior, and you'd run out of memory much too quickly. The same is true in the kernel. The single biggest memory user under almost all normal loads is the disk cache. That _is_ the normal allocator for any OS kernel. Everything else is almost details (ok, so Linux in particular does cache metadata very aggressively, so the dcache and inode cache are seldom "just details", but the page cache is still generally the most important part). So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane system does that. It's only useful if you absolutely _only_ work with large files - ie you're a database server. For just about any other workload, that kind of granularity is totally unnacceptable. So doing a read-modify-write on a 1-byte (or 512-byte) write, when the block size is 4kB is easy - we just have to do it anyway. Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is also _doable_, and from the IO pattern standpoint it is no different. But from a memory allocation pattern standpoint it's a disaster - because now you're always working with chunks that are just 'too big' to be good building blocks of a reasonable allocator. If you always allocate 64kB for file caches, and you work with lots of small files (like a source tree), you will literally waste all your memory. And if you have some "dynamic" scheme, you'll have tons and tons of really nasty cases when you have to grow a 4kB allocation to a 64kB one when the file grows. Imagine doing "realloc()", but doing it in a _threaded_ environment, where any number of threads may be using the old allocation at the same time. And that's a kernel - it has to be _the_ most threaded program on the whole machine, because otherwise the kernel would be the scaling bottleneck. And THAT is why 64kB blocks is such a disaster. > > - can you tell how many small files it will cache in RAM without doing > > IO? If it always uses 16kB blocks for caching, it will be able to cache a > > _lot_ fewer files in the same amount of RAM than with a smaller block > > size. > > I'll do this later, but given the 32KB reads for the test above, I'm guessing > it will cache pages, not blocks. Yeah, you don't need to. I can already guarantee that Windows does caching on a page granularity. I can also pretty much guarantee that that is also why Windows stops compressing files once the blocksize is bigger than 4kB: because at that point, the block compressions would need to handle _multiple_ cache entities, and that's really painful for all the same reasons that bigger sectors would be really painful - you'd always need to make sure that you always have all of those cache entries in memory together, and you could never treat your cache entries as individual entities. > > Of course, the _really_ conclusive thing (in a virtualized environment) is > > to just make the virtual disk only able to do 16kB IO accesses (and with > > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size, > > and reporting a 16kB sector size to the READ CAPACITY command. If it works > > then, then clearly WNT has no issues with bigger sectors. > > I don't think IDE supports this? And Windows 2008 doesn't like the LSI > emulated device we expose. Yeah, you'd have to have the OS use the SCSI commands for disk discovery, so at least a SATA interface. With IDE disks, the sector size always has to be 512 bytes, I think. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-13 15:10 ` Linus Torvalds @ 2009-04-13 15:38 ` James Bottomley 2009-04-14 7:22 ` Andi Kleen 2009-04-14 9:59 ` Avi Kivity 2 siblings, 0 replies; 45+ messages in thread From: James Bottomley @ 2009-04-13 15:38 UTC (permalink / raw) To: Linus Torvalds Cc: Avi Kivity, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Mon, 2009-04-13 at 08:10 -0700, Linus Torvalds wrote: > On Mon, 13 Apr 2009, Avi Kivity wrote: > > > Of course, the _really_ conclusive thing (in a virtualized environment) is > > > to just make the virtual disk only able to do 16kB IO accesses (and with > > > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size, > > > and reporting a 16kB sector size to the READ CAPACITY command. If it works > > > then, then clearly WNT has no issues with bigger sectors. > > > > I don't think IDE supports this? And Windows 2008 doesn't like the LSI > > emulated device we expose. > > Yeah, you'd have to have the OS use the SCSI commands for disk discovery, > so at least a SATA interface. With IDE disks, the sector size always has > to be 512 bytes, I think. Actually, the latest ATA rev supports different sector sizes in preparation for native 4k sector size SATA disks (words 117-118 of IDENTIFY). Matthew Wilcox already has the patches for libata ready. James ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-13 15:10 ` Linus Torvalds 2009-04-13 15:38 ` James Bottomley @ 2009-04-14 7:22 ` Andi Kleen 2009-04-14 10:07 ` Avi Kivity 2009-04-14 9:59 ` Avi Kivity 2 siblings, 1 reply; 45+ messages in thread From: Andi Kleen @ 2009-04-14 7:22 UTC (permalink / raw) To: Linus Torvalds Cc: Avi Kivity, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds <torvalds@linux-foundation.org> writes: > > You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for > your example!). AFAIK at least for user visible anonymous memory Windows uses 64k chunks. At least that is what Cygwin's mmap exposes. I don't know if it does the same for disk cache. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 7:22 ` Andi Kleen @ 2009-04-14 10:07 ` Avi Kivity 0 siblings, 0 replies; 45+ messages in thread From: Avi Kivity @ 2009-04-14 10:07 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Andi Kleen wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for >> your example!). >> > > AFAIK at least for user visible anonymous memory Windows uses 64k > chunks. At least that is what Cygwin's mmap exposes. I don't know > if it does the same for disk cache. > I think that's just the region address and size granularity (as in vmas). For paging they still use the mmu page size. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-13 15:10 ` Linus Torvalds 2009-04-13 15:38 ` James Bottomley 2009-04-14 7:22 ` Andi Kleen @ 2009-04-14 9:59 ` Avi Kivity 2009-04-14 10:23 ` Jeff Garzik 2 siblings, 1 reply; 45+ messages in thread From: Avi Kivity @ 2009-04-14 9:59 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > On Mon, 13 Apr 2009, Avi Kivity wrote: > >>> - create a big file, >>> >> Just creating a 5GB file in a 64KB filesystem was interesting - Windows >> was throwing out 256KB I/Os even though I was generating 1MB writes (and >> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4). >> > > Heh, ok. So the "big file" really only needed to be big enough to not be > cached, and 5GB was probably overkill. In fact, if there's some way to > blow the cache, you could have made it much smaller. But 5G certainly > works ;) > I wanted to make sure my random writes later don't get coalesced. A 1GB file, half of which is cached (I used a 1GB guest), offers lots of chances for coalescing if Windows delays the writes sufficiently. At 5GB, Windows can only cache 10% of the file, so it will be continuously flushing. > > (a) Windows caches things with a 4kB granularity, so the 512-byte write > turned into a read-modify-write > > [...] > You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for > your example!). It's a total disaster. Imagine what would happen to user > application performance if kmalloc() always returned 16kB-aligned chunks > of memory, all sized as integer multiples of 16kB? It would absolutely > _suck_. Sure, it would be fine for your large allocations, but any time > you handle strings, you'd allocate 16kB of memory for any small 5-byte > string. You'd have horrible cache behavior, and you'd run out of memory > much too quickly. > > The same is true in the kernel. The single biggest memory user under > almost all normal loads is the disk cache. That _is_ the normal allocator > for any OS kernel. Everything else is almost details (ok, so Linux in > particular does cache metadata very aggressively, so the dcache and inode > cache are seldom "just details", but the page cache is still generally the > most important part). > > So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane > system does that. It's only useful if you absolutely _only_ work with > large files - ie you're a database server. For just about any other > workload, that kind of granularity is totally unnacceptable. > > So doing a read-modify-write on a 1-byte (or 512-byte) write, when the > block size is 4kB is easy - we just have to do it anyway. > > Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is > also _doable_, and from the IO pattern standpoint it is no different. But > from a memory allocation pattern standpoint it's a disaster - because now > you're always working with chunks that are just 'too big' to be good > building blocks of a reasonable allocator. > > If you always allocate 64kB for file caches, and you work with lots of > small files (like a source tree), you will literally waste all your > memory. > > Well, no one is talking about 64KB granularity for in-core files. Like you noticed, Windows uses the mmu page size. We could keep doing that, and still have 16KB+ sector sizes. It just means a RMW if you don't happen to have the adjoining clean pages in cache. Sure, on a rotating disk that's a disaster, but we're talking SSD here, so while you're doubling your access time, you're doubling a fairly small quantity. The controller would do the same if it exposed smaller sectors, so there's no huge loss. We still lose on disk storage efficiency, but I'm guessing that a modern tree with some object files with debug information and a .git directory it won't be such a great hit. For more mainstream uses, it would be negligible. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 9:59 ` Avi Kivity @ 2009-04-14 10:23 ` Jeff Garzik 2009-04-14 10:37 ` Avi Kivity 2009-04-25 8:26 ` Pavel Machek 0 siblings, 2 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-14 10:23 UTC (permalink / raw) To: Avi Kivity Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Avi Kivity wrote: > Well, no one is talking about 64KB granularity for in-core files. Like > you noticed, Windows uses the mmu page size. We could keep doing that, > and still have 16KB+ sector sizes. It just means a RMW if you don't > happen to have the adjoining clean pages in cache. > > Sure, on a rotating disk that's a disaster, but we're talking SSD here, > so while you're doubling your access time, you're doubling a fairly > small quantity. The controller would do the same if it exposed smaller > sectors, so there's no huge loss. > > We still lose on disk storage efficiency, but I'm guessing that a modern > tree with some object files with debug information and a .git directory > it won't be such a great hit. For more mainstream uses, it would be > negligible. Speaking of RMW... in one sense, we have to deal with RMW anyway. Upcoming ATA hard drives will be configured with a normal 512b sector API interface, but underlying physical sector size is 1k or 4k. The disk performs the RMW for us, but we must be aware of physical sector size in order to determine proper alignment of on-disk data, to minimize RMW cycles. At the moment, it seems like most of the effort to get these ATA devices to perform efficiently is in getting partition / RAID stripe offsets set up properly. So perhaps for NVMHCI we could (a) hardcode NVM sector size maximum at 4k (b) do RMW in the driver for sector size >4k, and (c) export information indicating the true sector size, in a manner similar to how the ATA driver passes that info to userland partitioning tools. Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 10:23 ` Jeff Garzik @ 2009-04-14 10:37 ` Avi Kivity 2009-04-14 11:45 ` Jeff Garzik 2009-04-25 8:26 ` Pavel Machek 1 sibling, 1 reply; 45+ messages in thread From: Avi Kivity @ 2009-04-14 10:37 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Jeff Garzik wrote: > Speaking of RMW... in one sense, we have to deal with RMW anyway. > Upcoming ATA hard drives will be configured with a normal 512b sector > API interface, but underlying physical sector size is 1k or 4k. > > The disk performs the RMW for us, but we must be aware of physical > sector size in order to determine proper alignment of on-disk data, to > minimize RMW cycles. > Virtualization has the same issue. OS installers will typically setup the first partition at sector 63, and that means every page-sized block access will be misaligned. Particularly bad when the guest's disk is backed on a regular file. Windows 2008 aligns partitions on a 1MB boundary, IIRC. > At the moment, it seems like most of the effort to get these ATA > devices to perform efficiently is in getting partition / RAID stripe > offsets set up properly. > > So perhaps for NVMHCI we could > (a) hardcode NVM sector size maximum at 4k > (b) do RMW in the driver for sector size >4k, and Why not do it in the block layer? That way it isn't limited to one driver. > (c) export information indicating the true sector size, in a manner > similar to how the ATA driver passes that info to userland > partitioning tools. Eventually we'll want to allow filesystems to make use of the native sector size. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 10:37 ` Avi Kivity @ 2009-04-14 11:45 ` Jeff Garzik 2009-04-14 11:58 ` Szabolcs Szakacsits 2009-04-14 12:08 ` Avi Kivity 0 siblings, 2 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-14 11:45 UTC (permalink / raw) To: Avi Kivity Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Avi Kivity wrote: > Jeff Garzik wrote: >> Speaking of RMW... in one sense, we have to deal with RMW anyway. >> Upcoming ATA hard drives will be configured with a normal 512b sector >> API interface, but underlying physical sector size is 1k or 4k. >> >> The disk performs the RMW for us, but we must be aware of physical >> sector size in order to determine proper alignment of on-disk data, to >> minimize RMW cycles. >> > > Virtualization has the same issue. OS installers will typically setup > the first partition at sector 63, and that means every page-sized block > access will be misaligned. Particularly bad when the guest's disk is > backed on a regular file. > > Windows 2008 aligns partitions on a 1MB boundary, IIRC. Makes a lot of sense... >> At the moment, it seems like most of the effort to get these ATA >> devices to perform efficiently is in getting partition / RAID stripe >> offsets set up properly. >> >> So perhaps for NVMHCI we could >> (a) hardcode NVM sector size maximum at 4k >> (b) do RMW in the driver for sector size >4k, and > > Why not do it in the block layer? That way it isn't limited to one driver. Sure. "in the driver" is a highly relative phrase :) If there is code to be shared among multiple callsites, let's share it. >> (c) export information indicating the true sector size, in a manner >> similar to how the ATA driver passes that info to userland >> partitioning tools. > > Eventually we'll want to allow filesystems to make use of the native > sector size. At the kernel level, you mean? Filesystems already must deal with issues such as avoiding RAID stripe boundaries (man mke2fs, search for 'RAID'). So I hope that same code should be applicable to cases where the "logical sector size" (as exported by storage interface) differs from "physical sector size" (the underlying hardware sector size, not directly accessible by OS). But if you are talking about filesystems directly supporting sector sizes >4kb, well, I'll let Linus and others settle that debate :) I will just write the driver once the dust settles... Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 11:45 ` Jeff Garzik @ 2009-04-14 11:58 ` Szabolcs Szakacsits 2009-04-17 22:45 ` H. Peter Anvin 2009-04-14 12:08 ` Avi Kivity 1 sibling, 1 reply; 45+ messages in thread From: Szabolcs Szakacsits @ 2009-04-14 11:58 UTC (permalink / raw) To: Jeff Garzik Cc: Avi Kivity, Linus Torvalds, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Tue, 14 Apr 2009, Jeff Garzik wrote: > Avi Kivity wrote: > > Jeff Garzik wrote: > > > Speaking of RMW... in one sense, we have to deal with RMW anyway. > > > Upcoming ATA hard drives will be configured with a normal 512b sector API > > > interface, but underlying physical sector size is 1k or 4k. > > > > > > The disk performs the RMW for us, but we must be aware of physical sector > > > size in order to determine proper alignment of on-disk data, to minimize > > > RMW cycles. > > > > > > > Virtualization has the same issue. OS installers will typically setup the > > first partition at sector 63, and that means every page-sized block access > > will be misaligned. Particularly bad when the guest's disk is backed on a > > regular file. > > > > Windows 2008 aligns partitions on a 1MB boundary, IIRC. > > Makes a lot of sense... Since Vista at least the first partition is 2048 sector aligned. Szaka -- NTFS-3G: http://ntfs-3g.org ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 11:58 ` Szabolcs Szakacsits @ 2009-04-17 22:45 ` H. Peter Anvin 0 siblings, 0 replies; 45+ messages in thread From: H. Peter Anvin @ 2009-04-17 22:45 UTC (permalink / raw) To: Szabolcs Szakacsits Cc: Jeff Garzik, Avi Kivity, Linus Torvalds, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Szabolcs Szakacsits wrote: >>> >>> Windows 2008 aligns partitions on a 1MB boundary, IIRC. >> Makes a lot of sense... > > Since Vista at least the first partition is 2048 sector aligned. > > Szaka > 2048 * 512 = 1 MB, yes. I *think* it's actually 1 MB and not 2048 sectors, but yes, they've finally dumped the idiotic DOS misalignment. Unfortunately the GNU parted people have said that the parted code is too fragile to fix parted in this way. Sigh. -hpa ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 11:45 ` Jeff Garzik 2009-04-14 11:58 ` Szabolcs Szakacsits @ 2009-04-14 12:08 ` Avi Kivity 2009-04-14 12:21 ` Jeff Garzik 1 sibling, 1 reply; 45+ messages in thread From: Avi Kivity @ 2009-04-14 12:08 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Jeff Garzik wrote: >>> (c) export information indicating the true sector size, in a manner >>> similar to how the ATA driver passes that info to userland >>> partitioning tools. >> >> Eventually we'll want to allow filesystems to make use of the native >> sector size. > > At the kernel level, you mean? > Yes. You'll want to align extents and I/O requests on that boundary. > > But if you are talking about filesystems directly supporting sector > sizes >4kb, well, I'll let Linus and others settle that debate :) I > will just write the driver once the dust settles... IMO drivers should expose whatever sector size the device have, filesystems should expose their block size, and the block layer should correct any impedance mismatches by doing RMW. Unfortunately, sector size > fs block size means a lot of pointless locking for the RMW, so if large sector sizes take off, we'll have to adjust filesystems to use larger block sizes. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 12:08 ` Avi Kivity @ 2009-04-14 12:21 ` Jeff Garzik 0 siblings, 0 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-14 12:21 UTC (permalink / raw) To: Avi Kivity Cc: Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Avi Kivity wrote: > Jeff Garzik wrote: >>>> (c) export information indicating the true sector size, in a manner >>>> similar to how the ATA driver passes that info to userland >>>> partitioning tools. >>> >>> Eventually we'll want to allow filesystems to make use of the native >>> sector size. >> >> At the kernel level, you mean? >> > > Yes. You'll want to align extents and I/O requests on that boundary. Sure. And RAID today presents these issues to the filesystem... man mke2fs(8), and look at extended options 'stride' and 'stripe-width'. It includes mention of RMW issues. >> But if you are talking about filesystems directly supporting sector >> sizes >4kb, well, I'll let Linus and others settle that debate :) I >> will just write the driver once the dust settles... > > IMO drivers should expose whatever sector size the device have, > filesystems should expose their block size, and the block layer should > correct any impedance mismatches by doing RMW. > > Unfortunately, sector size > fs block size means a lot of pointless > locking for the RMW, so if large sector sizes take off, we'll have to > adjust filesystems to use larger block sizes. Don't forget the case where the device does RMW for you, and does not permit direct access to physical sector size (all operations are in terms of logical sector size). Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-14 10:23 ` Jeff Garzik 2009-04-14 10:37 ` Avi Kivity @ 2009-04-25 8:26 ` Pavel Machek 1 sibling, 0 replies; 45+ messages in thread From: Pavel Machek @ 2009-04-25 8:26 UTC (permalink / raw) To: Jeff Garzik Cc: Avi Kivity, Linus Torvalds, Alan Cox, Szabolcs Szakacsits, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Hi! >> Well, no one is talking about 64KB granularity for in-core files. Like >> you noticed, Windows uses the mmu page size. We could keep doing that, >> and still have 16KB+ sector sizes. It just means a RMW if you don't >> happen to have the adjoining clean pages in cache. >> >> Sure, on a rotating disk that's a disaster, but we're talking SSD here, >> so while you're doubling your access time, you're doubling a fairly >> small quantity. The controller would do the same if it exposed smaller >> sectors, so there's no huge loss. >> >> We still lose on disk storage efficiency, but I'm guessing that a >> modern tree with some object files with debug information and a .git >> directory it won't be such a great hit. For more mainstream uses, it >> would be negligible. > > > Speaking of RMW... in one sense, we have to deal with RMW anyway. > Upcoming ATA hard drives will be configured with a normal 512b sector > API interface, but underlying physical sector size is 1k or 4k. > > The disk performs the RMW for us, but we must be aware of physical > sector size in order to determine proper alignment of on-disk data, to > minimize RMW cycles. Also... RMW hsa some nasty reliability implications. If we use 1KB block size ext3 (or something like that), unrelated data may now be damaged during powerfail. Filesystems can not handle that :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits 2009-04-12 15:20 ` Alan Cox @ 2009-04-12 15:41 ` Linus Torvalds 2009-04-12 17:02 ` Robert Hancock ` (2 more replies) 1 sibling, 3 replies; 45+ messages in thread From: Linus Torvalds @ 2009-04-12 15:41 UTC (permalink / raw) To: Szabolcs Szakacsits Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote: > > I did not hear about NTFS using >4kB sectors yet but technically > it should work. > > The atomic building units (sector size, block size, etc) of NTFS are > entirely parametric. The maximum values could be bigger than the > currently "configured" maximum limits. It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't already). That's not the problem. The "filesystem layout" part is just a parameter. The problem is then trying to actually access such a filesystem, in particular trying to write to it, or trying to mmap() small chunks of it. The FS layout is the trivial part. > At present the limits are set in the BIOS Parameter Block in the NTFS > Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for > "Sectors Per Block". So >4kB sector size should work since 1993. > > 64kB+ sector size could be possible by bootstrapping NTFS drivers > in a different way. Try it. And I don't mean "try to create that kind of filesystem". Try to _use_ it. Does Window actually support using it it, or is it just a matter of "the filesystem layout is _specified_ for up to 64kB block sizes"? And I really don't know. Maybe Windows does support it. I'm just very suspicious. I think there's a damn good reason why NTFS supports larger block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT! Because it really is a hard problem. It's really pretty nasty to have your cache blocking be smaller than the actual filesystem blocksize (the other way is much easier, although it's certainly not pleasant either - Linux supports it because we _have_ to, but sector-size of hardware had traditionally been 4kB, I'd certainly also argue against adding complexity just to make it smaller, the same way I argue against making it much larger). And don't get me wrong - we could (fairly) trivially make the PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a per-mapping thing, so that you could have some filesystems with that bigger sector size and some with smaller ones. I think Andrea had patches that did a fair chunk of it, and that _almost_ worked. But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would absolutely blow chunks. It would be disgustingly horrible. Putting the kernel source tree on such a filesystem would waste about 75% of all memory (the median size of a source file is just about 4kB), so your page cache would be effectively cut in a quarter for a lot of real loads. And to fix up _that_, you'd need to now do things like sub-page allocations, and now your page-cache size isn't even fixed per filesystem, it would be per-file, and the filesystem (and the drievrs!) would hav to handle the cases of getting those 4kB partial pages (and do r-m-w IO after all if your hardware sector size is >4kB). IOW, there are simple things we can do - but they would SUCK. And there are really complicated things we could do - and they would _still_ SUCK, plus now I pretty much guarantee that your system would also be a lot less stable. It really isn't worth it. It's much better for everybody to just be aware of the incredible level of pure suckage of a general-purpose disk that has hardware sectors >4kB. Just educate people that it's not good. Avoid the whole insane suckage early, rather than be disappointed in hardware that is total and utter CRAP and just causes untold problems. Now, for specialty uses, things are different. CD-ROM's have had 2kB sector sizes for a long time, and the reason it was never as big of a problem isn't that they are still smaller than 4kB - it's that they are read-only, and use special filesystems. And people _know_ they are special. Yes, even when you write to them, it's a very special op. You'd never try to put NTFS on a CD-ROM, and everybody knows it's not a disk replacement. In _those_ kinds of situations, a 64kB block isn't much of a problem. We can do read-only media (where "read-only" doesn't have to be absolute: the important part is that writing is special), and never have problems. That's easy. Almost all the problems with block-size go away if you think reading is 99.9% of the load. But if you want to see it as a _disk_ (ie replacing SSD's or rotational media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed, any "Linux/not-just-database-server" - it really isn't so much about x86, as it is about large cache granularity causing huge memory fragmentation issues). Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 15:41 ` Linus Torvalds @ 2009-04-12 17:02 ` Robert Hancock 2009-04-12 17:20 ` Linus Torvalds 2009-04-12 17:23 ` James Bottomley [not found] ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com> 2 siblings, 1 reply; 45+ messages in thread From: Robert Hancock @ 2009-04-12 17:02 UTC (permalink / raw) To: Linus Torvalds Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > > On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote: >> I did not hear about NTFS using >4kB sectors yet but technically >> it should work. >> >> The atomic building units (sector size, block size, etc) of NTFS are >> entirely parametric. The maximum values could be bigger than the >> currently "configured" maximum limits. > > It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't > already). > > That's not the problem. The "filesystem layout" part is just a parameter. > > The problem is then trying to actually access such a filesystem, in > particular trying to write to it, or trying to mmap() small chunks of it. > The FS layout is the trivial part. > >> At present the limits are set in the BIOS Parameter Block in the NTFS >> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for >> "Sectors Per Block". So >4kB sector size should work since 1993. >> >> 64kB+ sector size could be possible by bootstrapping NTFS drivers >> in a different way. > > Try it. And I don't mean "try to create that kind of filesystem". Try to > _use_ it. Does Window actually support using it it, or is it just a matter > of "the filesystem layout is _specified_ for up to 64kB block sizes"? > > And I really don't know. Maybe Windows does support it. I'm just very > suspicious. I think there's a damn good reason why NTFS supports larger > block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT! I can't find any mention that any formattable block size can't be used, other than the fact that "The maximum default cluster size under Windows NT 3.51 and later is 4K due to the fact that NTFS file compression is not possible on drives with a larger allocation size. So format will never use larger than 4k clusters unless the user specifically overrides the defaults". It could be there are other downsides to >4K cluster sizes as well, but that's the reason they state. What about FAT? It supports cluster sizes up to 32K at least (possibly up to 256K as well, although somewhat nonstandard), and that works.. We support that in Linux, don't we? > > Because it really is a hard problem. It's really pretty nasty to have your > cache blocking be smaller than the actual filesystem blocksize (the other > way is much easier, although it's certainly not pleasant either - Linux > supports it because we _have_ to, but sector-size of hardware had > traditionally been 4kB, I'd certainly also argue against adding complexity > just to make it smaller, the same way I argue against making it much > larger). > > And don't get me wrong - we could (fairly) trivially make the > PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a > per-mapping thing, so that you could have some filesystems with that > bigger sector size and some with smaller ones. I think Andrea had patches > that did a fair chunk of it, and that _almost_ worked. > > But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would > absolutely blow chunks. It would be disgustingly horrible. Putting the > kernel source tree on such a filesystem would waste about 75% of all > memory (the median size of a source file is just about 4kB), so your page > cache would be effectively cut in a quarter for a lot of real loads. > > And to fix up _that_, you'd need to now do things like sub-page > allocations, and now your page-cache size isn't even fixed per filesystem, > it would be per-file, and the filesystem (and the drievrs!) would hav to > handle the cases of getting those 4kB partial pages (and do r-m-w IO after > all if your hardware sector size is >4kB). > > IOW, there are simple things we can do - but they would SUCK. And there > are really complicated things we could do - and they would _still_ SUCK, > plus now I pretty much guarantee that your system would also be a lot less > stable. > > It really isn't worth it. It's much better for everybody to just be aware > of the incredible level of pure suckage of a general-purpose disk that has > hardware sectors >4kB. Just educate people that it's not good. Avoid the > whole insane suckage early, rather than be disappointed in hardware that > is total and utter CRAP and just causes untold problems. > > Now, for specialty uses, things are different. CD-ROM's have had 2kB > sector sizes for a long time, and the reason it was never as big of a > problem isn't that they are still smaller than 4kB - it's that they are > read-only, and use special filesystems. And people _know_ they are > special. Yes, even when you write to them, it's a very special op. You'd > never try to put NTFS on a CD-ROM, and everybody knows it's not a disk > replacement. > > In _those_ kinds of situations, a 64kB block isn't much of a problem. We > can do read-only media (where "read-only" doesn't have to be absolute: the > important part is that writing is special), and never have problems. > That's easy. Almost all the problems with block-size go away if you think > reading is 99.9% of the load. > > But if you want to see it as a _disk_ (ie replacing SSD's or rotational > media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed, > any "Linux/not-just-database-server" - it really isn't so much about x86, > as it is about large cache granularity causing huge memory fragmentation > issues). > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 17:02 ` Robert Hancock @ 2009-04-12 17:20 ` Linus Torvalds 2009-04-12 18:35 ` Robert Hancock 2009-04-13 11:18 ` Avi Kivity 0 siblings, 2 replies; 45+ messages in thread From: Linus Torvalds @ 2009-04-12 17:20 UTC (permalink / raw) To: Robert Hancock Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sun, 12 Apr 2009, Robert Hancock wrote: > > What about FAT? It supports cluster sizes up to 32K at least (possibly up to > 256K as well, although somewhat nonstandard), and that works.. We support that > in Linux, don't we? Sure. The thing is, "cluster size" in an FS is totally different from sector size. People are missing the point here. You can trivially implement bigger cluster sizes by just writing multiple sectors. In fact, even just a 4kB cluster size is actually writing 8 512-byte hardware sectors on all normal disks. So you can support big clusters without having big sectors. A 32kB cluster size in FAT is absolutely trivial to do: it's really purely an allocation size. So a fat filesystem allocates disk-space in 32kB chunks, but then when you actually do IO to it, you can still write things 4kB at a time (or smaller), because once the allocation has been made, you still treat the disk as a series of smaller blocks. IOW, when you allocate a new 32kB cluster, you will have to allocate 8 pages to do IO on it (since you'll have to initialize the diskspace), but you can still literally treat those pages as _individual_ pages, and you can write them out in any order, and you can free them (and then look them up) one at a time. Notice? The cluster size really only ends up being a disk-space allocation issue, not an issue for actually caching the end result or for the actual size of the IO. The hardware sector size is very different. If you have a 32kB hardware sector size, that implies that _all_ IO has to be done with that granularity. Now you can no longer treat the eight pages as individual pages - you _have_ to write them out and read them in as one entity. If you dirty one page, you effectively dirty them all. You can not drop and re-allocate pages one at a time any more. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 17:20 ` Linus Torvalds @ 2009-04-12 18:35 ` Robert Hancock 2009-04-13 11:18 ` Avi Kivity 1 sibling, 0 replies; 45+ messages in thread From: Robert Hancock @ 2009-04-12 18:35 UTC (permalink / raw) To: Linus Torvalds Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > IOW, when you allocate a new 32kB cluster, you will have to allocate 8 > pages to do IO on it (since you'll have to initialize the diskspace), but > you can still literally treat those pages as _individual_ pages, and you > can write them out in any order, and you can free them (and then look them > up) one at a time. > > Notice? The cluster size really only ends up being a disk-space allocation > issue, not an issue for actually caching the end result or for the actual > size of the IO. Right.. I didn't realize we were actually that smart (not writing out the entire cluster when dirtying one page) but I guess it makes sense. > > The hardware sector size is very different. If you have a 32kB hardware > sector size, that implies that _all_ IO has to be done with that > granularity. Now you can no longer treat the eight pages as individual > pages - you _have_ to write them out and read them in as one entity. If > you dirty one page, you effectively dirty them all. You can not drop and > re-allocate pages one at a time any more. > > Linus I suspect that in this case trying to gang together multiple pages inside the VM to actually handle it this way all the way through would be insanity. My guess is the only way you could sanely do it is the read-modify-write approach when writing out the data (in the block layer maybe?) where the read can be optimized away if the pages for the entire hardware sector are already in cache or the write is large enough to replace the entire sector. I assume we already do this in the md code somewhere for cases like software RAID 5 with a stripe size of >4KB.. That obviously would have some performance drawbacks compared to a smaller sector size, but if the device is bound and determined to use bigger sectors internally one way or the other and the alternative is the drive does R-M-W internally to emulate smaller sectors - which for some devices seems to be the case - maybe it makes more sense to do it in the kernel if we have more information to allow us to do it more efficiently. (Though, at least on the normal ATA disk side of things, 4K is the biggest number I've heard tossed about for a future expanded sector size, but flash devices like this may be another story..) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 17:20 ` Linus Torvalds 2009-04-12 18:35 ` Robert Hancock @ 2009-04-13 11:18 ` Avi Kivity 1 sibling, 0 replies; 45+ messages in thread From: Avi Kivity @ 2009-04-13 11:18 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Hancock, Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > The hardware sector size is very different. If you have a 32kB hardware > sector size, that implies that _all_ IO has to be done with that > granularity. Now you can no longer treat the eight pages as individual > pages - you _have_ to write them out and read them in as one entity. If > you dirty one page, you effectively dirty them all. You can not drop and > re-allocate pages one at a time any more. > You can still drop clean pages. Sure, that costs you performance as you'll have to do re-read them in order to write a dirty page, but in the common case, the clean pages around would still be available and you'd avoid it. Applications that randomly write to large files can be tuned to use the disk sector size. As for the rest, they're either read-only (executable mappings) or sequential. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 15:41 ` Linus Torvalds 2009-04-12 17:02 ` Robert Hancock @ 2009-04-12 17:23 ` James Bottomley [not found] ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com> 2 siblings, 0 replies; 45+ messages in thread From: James Bottomley @ 2009-04-12 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sun, 2009-04-12 at 08:41 -0700, Linus Torvalds wrote: > > On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote: > > > > I did not hear about NTFS using >4kB sectors yet but technically > > it should work. > > > > The atomic building units (sector size, block size, etc) of NTFS are > > entirely parametric. The maximum values could be bigger than the > > currently "configured" maximum limits. > > It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't > already). > > That's not the problem. The "filesystem layout" part is just a parameter. > > The problem is then trying to actually access such a filesystem, in > particular trying to write to it, or trying to mmap() small chunks of it. > The FS layout is the trivial part. > > > At present the limits are set in the BIOS Parameter Block in the NTFS > > Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for > > "Sectors Per Block". So >4kB sector size should work since 1993. > > > > 64kB+ sector size could be possible by bootstrapping NTFS drivers > > in a different way. > > Try it. And I don't mean "try to create that kind of filesystem". Try to > _use_ it. Does Window actually support using it it, or is it just a matter > of "the filesystem layout is _specified_ for up to 64kB block sizes"? > > And I really don't know. Maybe Windows does support it. I'm just very > suspicious. I think there's a damn good reason why NTFS supports larger > block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT! > > Because it really is a hard problem. It's really pretty nasty to have your > cache blocking be smaller than the actual filesystem blocksize (the other > way is much easier, although it's certainly not pleasant either - Linux > supports it because we _have_ to, but sector-size of hardware had > traditionally been 4kB, I'd certainly also argue against adding complexity > just to make it smaller, the same way I argue against making it much > larger). > > And don't get me wrong - we could (fairly) trivially make the > PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a > per-mapping thing, so that you could have some filesystems with that > bigger sector size and some with smaller ones. I think Andrea had patches > that did a fair chunk of it, and that _almost_ worked. > > But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would > absolutely blow chunks. It would be disgustingly horrible. Putting the > kernel source tree on such a filesystem would waste about 75% of all > memory (the median size of a source file is just about 4kB), so your page > cache would be effectively cut in a quarter for a lot of real loads. > > And to fix up _that_, you'd need to now do things like sub-page > allocations, and now your page-cache size isn't even fixed per filesystem, > it would be per-file, and the filesystem (and the drievrs!) would hav to > handle the cases of getting those 4kB partial pages (and do r-m-w IO after > all if your hardware sector size is >4kB). We might not have to go that far for a device with these special characteristics. It should be possible to build a block size remapping Read Modify Write type device to present a 4k block size to the OS while operating in n*4k blocks for the device. We could implement the read operations as readahead in the page cache, so if we're lucky we mostly end up operating on full n*4k blocks anyway. For the cases where we've lost pieces of the n*4k native block and we have to do a write, we'd just suck it up and do a read modify write on a separate memory area, a bit like the new 4k sector devices do emulating 512 byte blocks. The suck factor of this double I/O plus memory copy overhead should be mitigated partially by the fact that the underlying device is very fast. James ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>]
* Re: Implementing NVMHCI... [not found] ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com> @ 2009-04-15 6:37 ` Artem Bityutskiy 2009-04-30 22:51 ` Jörn Engel 0 siblings, 1 reply; 45+ messages in thread From: Artem Bityutskiy @ 2009-04-15 6:37 UTC (permalink / raw) To: Jared Hulbert Cc: Linus Torvalds, Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, David Woodhouse, Jörn Engel On Tue, 2009-04-14 at 10:52 -0700, Jared Hulbert wrote: > It really isn't worth it. It's much better for everybody to > just be aware > of the incredible level of pure suckage of a general-purpose > disk that has > hardware sectors >4kB. Just educate people that it's not good. > Avoid the > whole insane suckage early, rather than be disappointed in > hardware that > is total and utter CRAP and just causes untold problems. > > I don't disagree that >4KB DISKS are a bad idea. But I don't think > that's what's going on here. As I read it, NVMHCI would plug into the > MTD subsystem, not the block subsystem. > > > NVMHCI, as far as I understand the spec, is not trying to be a > general-purpose disk, it's for exposing more or less the raw NAND. As > far as I can tell it's a DMA engine spec for large arrays of NAND. > BTW, anybody actually seen a NVMHCI device or plan on making one? I briefly glanced at the doc, and it does not look like this is an interface to expose raw NAND. E.g., I could not find "erase" operation. I could not find information about bad eraseblocks. It looks like it is not about raw NANDs. May be about "managed" NANDs. Also, the following sentences from the "Outside of Scope" sub-section suggest I'm right: "NVMHCI is also specified above any non-volatile memory management, like wear leveling. Erases and other management tasks for NVM technologies like NAND are abstracted.". So it says NVMHCI is _above_ wear levelling, which means FTL would be _inside_ the NVMHCI device, which is not about raw NAND. But I may be wrong, I spent less than 10 minutes looking at the doc, sorry. -- Best regards, Artem Bityutskiy (Битюцкий Артём) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-15 6:37 ` Artem Bityutskiy @ 2009-04-30 22:51 ` Jörn Engel 2009-04-30 23:36 ` Jeff Garzik 0 siblings, 1 reply; 45+ messages in thread From: Jörn Engel @ 2009-04-30 22:51 UTC (permalink / raw) To: Artem Bityutskiy Cc: Jared Hulbert, Linus Torvalds, Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, David Woodhouse On Wed, 15 April 2009 09:37:50 +0300, Artem Bityutskiy wrote: > > I briefly glanced at the doc, and it does not look like this is an > interface to expose raw NAND. E.g., I could not find "erase" operation. > I could not find information about bad eraseblocks. > > It looks like it is not about raw NANDs. May be about "managed" NANDs. I'm not sure whether your distinction is exactly valid anymore. "raw NAND" used to mean two things. 1) A single chip of silicon without additional hardware. 2) NAND without FTL. Traditionally the FTL was implemented either in software or in a constroller chip. So you could not get "cooked" flash as in FTL without "cooked" flash as in extra hardware. Today you can, which makes "raw NAND" a less useful term. And I'm not sure what to think about flash chips with the (likely crappy) FTL inside either. Not having to deal with bad blocks anymore is a bliss. Not having to deal with wear leveling anymore is a lie. Not knowing whether errors occurred and whether uncorrected data was left on the device or replaced with corrected data is a pain. But like it or not, the market seems to be moving in that direction. Which means we will have "block devices" that have all the interfaces of disks and behave much like flash - modulo the crap FTL. Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-30 22:51 ` Jörn Engel @ 2009-04-30 23:36 ` Jeff Garzik 0 siblings, 0 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-30 23:36 UTC (permalink / raw) To: Jörn Engel Cc: Artem Bityutskiy, Jared Hulbert, Linus Torvalds, Szabolcs Szakacsits, Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, David Woodhouse Jörn Engel wrote: > But like it or not, the market seems to be moving in that direction. > Which means we will have "block devices" that have all the interfaces of > disks and behave much like flash - modulo the crap FTL. One driving goal behind NVMHCI was to avoid disk-originated interfaces, because they are not as well suited to flash storage. The NVMHCI command set (distinguished from NVMHCI, the silicon) is specifically targetted towards flash. Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Implementing NVMHCI...
@ 2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
0 siblings, 1 reply; 45+ messages in thread
From: Jeff Garzik @ 2009-04-11 17:33 UTC (permalink / raw)
To: Linux IDE mailing list; +Cc: LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds
Has anybody looked into working on NVMHCI support? It is a new
controller + new command set for direct interaction with non-volatile
memory devices:
http://download.intel.com/standards/nvmhci/spec.pdf
Although NVMHCI is nice from a hardware design perspective, it is a bit
problematic for Linux because
* NVMHCI might be implemented as part of an AHCI controller's
register set, much like how Marvell's AHCI clones implement
a PATA port: with wholly different per-port registers
and DMA data structures, buried inside the standard AHCI
per-port interrupt dispatch mechanism.
Or, NVMHCI might be implemented as its own PCI device,
wholly independent from the AHCI PCI device.
The per-port registers and DMA data structure remain the same,
whether or not it is embedded within AHCI or not.
* NVMHCI introduces a brand new command set, completely
incompatible with ATA or SCSI. Presumably it is tuned
specifically for non-volatile memory.
* The sector size can vary wildly from device to device. There
is no 512-byte legacy to deal with, for a brand new
command set. We should handle this OK, but...... who knows
until you try.
The spec describes the sector size as
"512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach
"etc" territory.
Here is my initial idea:
- Move 95% of ahci.c into libahci.c.
This will make implementation of AHCI-and-more devices like
NVMHCI (AHCI 1.3) and Marvell much easier, while avoiding
the cost of NVMHCI or Marvell support, for those users without
such hardware.
- ahci.c becomes a tiny stub with a pci_device_id match table,
calling functions in libahci.c.
- I can move my libata-dev.git#mv-ahci-pata work, recently refreshed,
into mv-ahci.c.
- nvmhci.c implements the NVMHCI controller standard. Maybe referenced
from ahci.c, or used standalone.
- nvmhci-blk.c implements a block device for NVMHCI-attached devices,
using the new NVMHCI command set.
With a brand new command set, might as well avoid SCSI completely IMO,
and create a brand new block device.
Open questions are...
1) When will we see hardware? This is a feature newly introduced in
AHCI 1.3. AHCI 1.3 spec is public, but I have not seen any machines
yet. http://download.intel.com/technology/serialata/pdf/rev1_3.pdf
My ICH10 box uses AHCI 1.2. dmesg | grep '^ahci'
> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems
2) Has anyone else started working on this? All relevant specs are
public on intel.com.
3) Are there major objections to doing this as a native block device (as
opposed to faking SCSI, for example...) ?
Thanks,
Jeff (engaging in some light Saturday reading...)
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: Implementing NVMHCI... 2009-04-11 17:33 Jeff Garzik @ 2009-04-11 19:32 ` Alan Cox 2009-04-11 19:52 ` Linus Torvalds 2009-04-11 19:54 ` Jeff Garzik 0 siblings, 2 replies; 45+ messages in thread From: Alan Cox @ 2009-04-11 19:32 UTC (permalink / raw) To: Jeff Garzik Cc: Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds > The spec describes the sector size as > "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach > "etc" territory. Over 4K will be fun. > - ahci.c becomes a tiny stub with a pci_device_id match table, > calling functions in libahci.c. It needs to a be a little bit bigger because of the folks wanting to do non PCI AHCI, so you need a little bit of PCI wrapping etc > With a brand new command set, might as well avoid SCSI completely IMO, > and create a brand new block device. Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8) Alan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 19:32 ` Alan Cox @ 2009-04-11 19:52 ` Linus Torvalds 2009-04-11 20:21 ` Jeff Garzik 2009-04-11 21:49 ` Grant Grundler 2009-04-11 19:54 ` Jeff Garzik 1 sibling, 2 replies; 45+ messages in thread From: Linus Torvalds @ 2009-04-11 19:52 UTC (permalink / raw) To: Alan Cox Cc: Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sat, 11 Apr 2009, Alan Cox wrote: > > > The spec describes the sector size as > > "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach > > "etc" territory. > > Over 4K will be fun. And by "fun", you mean "irrelevant". If anybody does that, they'll simply not work. And it's not worth it even trying to handle it. That said, I'm pretty certain Windows has the same 4k issue, so we can hope nobody will ever do that kind of idiotically broken hardware. Of course, hardware people often do incredibly stupid things, so no guarantees. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 19:52 ` Linus Torvalds @ 2009-04-11 20:21 ` Jeff Garzik 2009-04-11 21:49 ` Grant Grundler 1 sibling, 0 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-11 20:21 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > > On Sat, 11 Apr 2009, Alan Cox wrote: >>> The spec describes the sector size as >>> "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach >>> "etc" territory. >> Over 4K will be fun. > > And by "fun", you mean "irrelevant". > > If anybody does that, they'll simply not work. And it's not worth it even > trying to handle it. FSVO trying to handle... At the driver level, it would be easy to clamp sector size to 4k, and point the scatterlist to a zero-filled region for the >4k portion of each sector. Inefficient, sure, but it is low-cost to the driver and gives the user something other than a brick. if (too_large_sector_size) nvmhci_fill_sg_clamped_interleave() else nvmhci_fill_sg() Regards, Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 19:52 ` Linus Torvalds 2009-04-11 20:21 ` Jeff Garzik @ 2009-04-11 21:49 ` Grant Grundler 2009-04-11 22:33 ` Linus Torvalds 2009-04-11 23:25 ` Alan Cox 1 sibling, 2 replies; 45+ messages in thread From: Grant Grundler @ 2009-04-11 21:49 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sat, Apr 11, 2009 at 12:52 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Sat, 11 Apr 2009, Alan Cox wrote: >> >> > The spec describes the sector size as >> > "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach >> > "etc" territory. >> >> Over 4K will be fun. > > And by "fun", you mean "irrelevant". > > If anybody does that, they'll simply not work. And it's not worth it even > trying to handle it. Why does it matter what the sector size is? I'm failing to see what the fuss is about. We've abstract the DMA mapping/SG list handling enough that the block size should make no more difference than it does for the MTU size of a network. And the linux VM does handle bigger than 4k pages (several architectures have implemented it) - even if x86 only supports 4k as base page size. Block size just defines the granularity of the device's address space in the same way the VM base page size defines the Virtual address space. > That said, I'm pretty certain Windows has the same 4k issue, so we can > hope nobody will ever do that kind of idiotically broken hardware. Of > course, hardware people often do incredibly stupid things, so no > guarantees. That's just flame-bait. Not touching that. thanks, grant > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 21:49 ` Grant Grundler @ 2009-04-11 22:33 ` Linus Torvalds 2009-04-12 5:08 ` Leslie Rhorer 2009-04-11 23:25 ` Alan Cox 1 sibling, 1 reply; 45+ messages in thread From: Linus Torvalds @ 2009-04-11 22:33 UTC (permalink / raw) To: Grant Grundler Cc: Alan Cox, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sat, 11 Apr 2009, Grant Grundler wrote: > > Why does it matter what the sector size is? > I'm failing to see what the fuss is about. > > We've abstract the DMA mapping/SG list handling enough that the > block size should make no more difference than it does for the > MTU size of a network. The VM is not ready or willing to do more than 4kB pages for any normal cacheing scheme. > And the linux VM does handle bigger than 4k pages (several architectures > have implemented it) - even if x86 only supports 4k as base page size. 4k is not just the "supported" base page size, it's the only sane one. Bigger pages waste memory like mad on any normal load due to fragmentation. Only basically single-purpose servers are worth doing bigger pages for. > Block size just defines the granularity of the device's address space in > the same way the VM base page size defines the Virtual address space. .. and the point is, if you have granularity that is bigger than 4kB, you lose binary compatibility on x86, for example. The 4kB thing is encoded in mmap() semantics. In other words, if you have sector size >4kB, your hardware is CRAP. It's unusable sh*t. No ifs, buts or maybe's about it. Sure, we can work around it. We can work around it by doing things like read-modify-write cycles with bounce buffers (and where DMA remapping can be used to avoid the copy). Or we can work around it by saying that if you mmap files on such a a filesystem, your mmap's will have to have 8kB alignment semantics, and the hardware is only useful for servers. Or we can just tell people what a total piece of shit the hardware is. So if you're involved with any such hardware or know people who are, you might give people strong hints that sector sizes >4kB will not be taken seriously by a huge number of people. Maybe it's not too late to head the crap off at the pass. Btw, this is not a new issue. Sandisk and some other totally clueless SSD manufacturers tried to convince people that 64kB access sizes were the RightThing(tm) to do. The reason? Their SSD's were crap, and couldn't do anything better, so they tried to blame software. Then Intel came out with their controller, and now the same people who tried to sell their sh*t-for-brain SSD's are finally admittign that it was crap hardware. Do you really want to go through that one more time? Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: Implementing NVMHCI... 2009-04-11 22:33 ` Linus Torvalds @ 2009-04-12 5:08 ` Leslie Rhorer 0 siblings, 0 replies; 45+ messages in thread From: Leslie Rhorer @ 2009-04-12 5:08 UTC (permalink / raw) To: linux-ide > So if you're involved with any such hardware or know people who are, you > might give people strong hints that sector sizes >4kB will not be taken > seriously by a huge number of people. Maybe it's not too late to head the > crap off at the pass. > > Btw, this is not a new issue. Sandisk and some other totally clueless SSD > manufacturers tried to convince people that 64kB access sizes were the > RightThing(tm) to do. The reason? Their SSD's were crap, and couldn't do > anything better, so they tried to blame software. > > Then Intel came out with their controller, and now the same people who > tried to sell their sh*t-for-brain SSD's are finally admittign that > it was crap hardware. > > Do you really want to go through that one more time? So with drive sizes soon to be at 10TB, and arrays soon to exceed 250 and perhaps even 1000TB, what's the long term solution? I'm far from being an expert, and you most certainly are an expert, so do you really feel a 4KB sector is the way to go for the indeterminate future? My mind reels at the thought of dealing with perhaps more than a trillion sectors on a drive system. Clearly it's possible a 64 bit file system may be a simpler proposition than a 64K sector, but still... ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 21:49 ` Grant Grundler 2009-04-11 22:33 ` Linus Torvalds @ 2009-04-11 23:25 ` Alan Cox 2009-04-11 23:51 ` Jeff Garzik ` (2 more replies) 1 sibling, 3 replies; 45+ messages in thread From: Alan Cox @ 2009-04-11 23:25 UTC (permalink / raw) To: Grant Grundler Cc: Linus Torvalds, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven > We've abstract the DMA mapping/SG list handling enough that the > block size should make no more difference than it does for the > MTU size of a network. You need to start managing groups of pages in the vm and keeping them together and writing them out together and paging them together even if one of them is dirty and the other isn't. You have to deal with cases where a process forks and the two pages are dirtied one in each but still have to be written together. Alternatively you go for read-modify-write (nasty performance hit especially for RAID or a log structured fs). Yes you can do it but it sure won't be pretty with a conventional fs. Some of the log structured file systems have no problems with this and some kinds of journalling can help but for a typical block file system it'll suck. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 23:25 ` Alan Cox @ 2009-04-11 23:51 ` Jeff Garzik 2009-04-12 0:49 ` Linus Torvalds 2009-04-12 1:15 ` david 2009-04-12 14:23 ` Mark Lord 2 siblings, 1 reply; 45+ messages in thread From: Jeff Garzik @ 2009-04-11 23:51 UTC (permalink / raw) To: Alan Cox Cc: Grant Grundler, Linus Torvalds, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Alan Cox wrote: >> We've abstract the DMA mapping/SG list handling enough that the >> block size should make no more difference than it does for the >> MTU size of a network. > > You need to start managing groups of pages in the vm and keeping them > together and writing them out together and paging them together even if > one of them is dirty and the other isn't. You have to deal with cases > where a process forks and the two pages are dirtied one in each but still > have to be written together. > > Alternatively you go for read-modify-write (nasty performance hit > especially for RAID or a log structured fs). Or just ignore the extra length, thereby excising the 'read-modify' step... Total storage is halved or worse, but you don't take as much of a performance hit. Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 23:51 ` Jeff Garzik @ 2009-04-12 0:49 ` Linus Torvalds 2009-04-12 1:59 ` Jeff Garzik 0 siblings, 1 reply; 45+ messages in thread From: Linus Torvalds @ 2009-04-12 0:49 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sat, 11 Apr 2009, Jeff Garzik wrote: > > Or just ignore the extra length, thereby excising the 'read-modify' step... > Total storage is halved or worse, but you don't take as much of a performance > hit. Well, the people who want > 4kB sectors usually want _much_ bigger (ie 32kB sectors), and if you end up doing the "just use the first part" thing, you're wasting 7/8ths of the space. Yes, it's doable, and yes, it obviously makes for a simple driver thing, but no, I don't think people will consider it acceptable to lose that much of their effective size of the disk. I suspect people would scream even with a 8kB sector. Treating all writes as read-modify-write cycles on a driver level (and then opportunistically avoiding the read part when you are lucky and see bigger contiguous writes) is likely more acceptable. But it _will_ suck dick from a performance angle, because no regular filesystem will care enough, so even with nicely behaved big writes, the two end-points will have a fairly high chance of requiring a rmw cycle. Even the journaling ones that might have nice logging write behavior tend to have a non-logging part that then will behave badly. Rather few filesystems are _purely_ log-based, and the ones that are tend to have various limitations. Most commonly read performance just sucks. We just merged nilfs2, and I _think_ that one is a pure logging filesystem with just linear writes (within a segment). But I think random read performance (think: loading executables off the disk) is bad. And people tend to really dislike hardware that forces a particular filesystem on them. Guess how big the user base is going to be if you cannot format the device as NTFS, for example? Hint: if a piece of hardware only works well with special filesystems, that piece of hardware won't be a big seller. Modern technology needs big volume to become cheap and relevant. And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let me doubt that. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 0:49 ` Linus Torvalds @ 2009-04-12 1:59 ` Jeff Garzik 0 siblings, 0 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-12 1:59 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Grant Grundler, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Linus Torvalds wrote: > And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let > me doubt that. FWIW... No clue about sector size, but NTFS cluster size (i.e. block size) goes up to 64k. Compression is disabled after 4k. Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 23:25 ` Alan Cox 2009-04-11 23:51 ` Jeff Garzik @ 2009-04-12 1:15 ` david 2009-04-12 3:13 ` Linus Torvalds 2009-04-12 14:23 ` Mark Lord 2 siblings, 1 reply; 45+ messages in thread From: david @ 2009-04-12 1:15 UTC (permalink / raw) To: Alan Cox Cc: Grant Grundler, Linus Torvalds, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sun, 12 Apr 2009, Alan Cox wrote: >> We've abstract the DMA mapping/SG list handling enough that the >> block size should make no more difference than it does for the >> MTU size of a network. > > You need to start managing groups of pages in the vm and keeping them > together and writing them out together and paging them together even if > one of them is dirty and the other isn't. You have to deal with cases > where a process forks and the two pages are dirtied one in each but still > have to be written together. gaining this sort of ability would not be a bad thing. with current hardware (SSDs and raid arrays) you can very easily be in a situation where it's much cheaper to deal with a group of related pages as one group rather than processing them individually. this is just an extention of the same issue. David Lang > Alternatively you go for read-modify-write (nasty performance hit > especially for RAID or a log structured fs). > > Yes you can do it but it sure won't be pretty with a conventional fs. > Some of the log structured file systems have no problems with this and > some kinds of journalling can help but for a typical block file system > it'll suck. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 1:15 ` david @ 2009-04-12 3:13 ` Linus Torvalds 0 siblings, 0 replies; 45+ messages in thread From: Linus Torvalds @ 2009-04-12 3:13 UTC (permalink / raw) To: david Cc: Alan Cox, Grant Grundler, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven On Sat, 11 Apr 2009, david@lang.hm wrote: > > gaining this sort of ability would not be a bad thing. .. and if my house was built of gold, that wouldn't be a bad thing either. What's your point? Are you going to create the magical patches that make that happen? Are you going to maintain the added complexity that comes from suddenly having multiple dirty bits per "page"? Are you going to create the mythical filesystems that magically start doing tail packing in order to not waste tons of disk-space with small files, even if they have a 32kB block-size? In other words, your whole argument is built in "wouldn't it be nice". And I'm just the grumpy old guy who tells you that there's this small thing called REALITY that comes and bites you in the *ss. And I'm sorry, but the very nature of "reality" is that it doesn't care one whit whether you believe me or not. The fact is, >4kB sectors just aren't realistic right now, and I don't think you have any _clue_ about the pain of trying to make them so. You're just throwing pennies down a wishing well. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 23:25 ` Alan Cox 2009-04-11 23:51 ` Jeff Garzik 2009-04-12 1:15 ` david @ 2009-04-12 14:23 ` Mark Lord 2009-04-12 17:29 ` Jeff Garzik 2 siblings, 1 reply; 45+ messages in thread From: Mark Lord @ 2009-04-12 14:23 UTC (permalink / raw) To: Alan Cox Cc: Grant Grundler, Linus Torvalds, Jeff Garzik, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Alan Cox wrote: .. > Alternatively you go for read-modify-write (nasty performance hit > especially for RAID or a log structured fs). .. Initially, at least, I'd guess that this NVM-HCI thing is all about built-in flash memory on motherboards, to hold the "instant-boot" software that hardware companies (eg. ASUS) are rapidly growing fond of. At present, that means a mostly read-only Linux installation, though MS for sure are hoping for Moore's Law to kick in and provide sufficient space for a copy of Vista there or something. The point being, it's probable *initial* intended use is for a run-time read-only filesystem, so having to do dirty R-M-W sequences for writes might not be a significant issue. At present. And even if it were, it might not be much worse than having the hardware itself do it internally, which is what would have to happen if it always only ever showed 4KB to us. Longer term, as flash densities increase, we're going to end up with motherboards that have huge SSDs built-in, through an interface like this one, or over a virtual SATA link or something. I wonder how long until "desktop/notebook" computers no longer have replaceable "hard disks" at all? Cheers ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-12 14:23 ` Mark Lord @ 2009-04-12 17:29 ` Jeff Garzik 0 siblings, 0 replies; 45+ messages in thread From: Jeff Garzik @ 2009-04-12 17:29 UTC (permalink / raw) To: Mark Lord Cc: Alan Cox, Grant Grundler, Linus Torvalds, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven Mark Lord wrote: > Alan Cox wrote: > .. >> Alternatively you go for read-modify-write (nasty performance hit >> especially for RAID or a log structured fs). > .. > > Initially, at least, I'd guess that this NVM-HCI thing is all about > built-in flash memory on motherboards, to hold the "instant-boot" > software that hardware companies (eg. ASUS) are rapidly growing fond of. > > At present, that means a mostly read-only Linux installation, > though MS for sure are hoping for Moore's Law to kick in and > provide sufficient space for a copy of Vista there or something. Yeah... instant boot, and "trusted boot" (booting a signed image), storage of useful details like boot drive layouts, etc. I'm sure we can come up with other funs uses, too... Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 19:32 ` Alan Cox 2009-04-11 19:52 ` Linus Torvalds @ 2009-04-11 19:54 ` Jeff Garzik 2009-04-11 21:08 ` John Stoffel 1 sibling, 1 reply; 45+ messages in thread From: Jeff Garzik @ 2009-04-11 19:54 UTC (permalink / raw) To: Alan Cox Cc: Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds Alan Cox wrote: >> The spec describes the sector size as >> "512, 1k, 2k, 4k, 8k, etc." It will be interesting to reach >> "etc" territory. > > Over 4K will be fun. > >> - ahci.c becomes a tiny stub with a pci_device_id match table, >> calling functions in libahci.c. > > It needs to a be a little bit bigger because of the folks wanting to do > non PCI AHCI, so you need a little bit of PCI wrapping etc True... >> With a brand new command set, might as well avoid SCSI completely IMO, >> and create a brand new block device. > > Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8) Perhaps... from what I can tell, this is a direct, asynchronous NVM interface. It appears to lack any concept of bus or bus enumeration. No worries about link up/down, storage device hotplug, etc. (you still have PCI hotplug case, of course) Jeff ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 19:54 ` Jeff Garzik @ 2009-04-11 21:08 ` John Stoffel 2009-04-11 21:31 ` John Stoffel 0 siblings, 1 reply; 45+ messages in thread From: John Stoffel @ 2009-04-11 21:08 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds >>>>> "Jeff" == Jeff Garzik <jeff@garzik.org> writes: Jeff> Alan Cox wrote: >>> With a brand new command set, might as well avoid SCSI completely >>> IMO, and create a brand new block device. >> >> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8) Jeff> Perhaps... from what I can tell, this is a direct, asynchronous Jeff> NVM interface. It appears to lack any concept of bus or bus Jeff> enumeration. No worries about link up/down, storage device Jeff> hotplug, etc. (you still have PCI hotplug case, of course) Didn't we just spend years merging the old IDE PATA block devices into the libata/scsi block device setup to get a more unified userspace and to share common code? I'm a total ignoramous here, but it would seem that it would be nice to keep the /dev/sd# stuff around for this, esp since it is supported through/with/around AHCI and libata stuff. Honestly, I don't care as long as userspace isn't too affected and I can just format it using ext3. :] Which I realize would be silly since it's probably nothing like regular disk access, but more like the NVRAM used on Netapps for caching writes to disk so they can be acknowledged quicker to the clients. Or like the old PrestoServe NVRAM modules on DECsystems and Alphas. John ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Implementing NVMHCI... 2009-04-11 21:08 ` John Stoffel @ 2009-04-11 21:31 ` John Stoffel 0 siblings, 0 replies; 45+ messages in thread From: John Stoffel @ 2009-04-11 21:31 UTC (permalink / raw) To: John Stoffel Cc: Jeff Garzik, Alan Cox, Linux IDE mailing list, LKML, Jens Axboe, Arjan van de Ven, Linus Torvalds >>>>> "John" == John Stoffel <john@stoffel.org> writes: >>>>> "Jeff" == Jeff Garzik <jeff@garzik.org> writes: Jeff> Alan Cox wrote: >>>> With a brand new command set, might as well avoid SCSI completely >>>> IMO, and create a brand new block device. >>> >>> Providing we allow for the (inevitable ;)) joys of NVHCI over SAS etc 8) Jeff> Perhaps... from what I can tell, this is a direct, asynchronous Jeff> NVM interface. It appears to lack any concept of bus or bus Jeff> enumeration. No worries about link up/down, storage device Jeff> hotplug, etc. (you still have PCI hotplug case, of course) John> Didn't we just spend years merging the old IDE PATA block devices into John> the libata/scsi block device setup to get a more unified userspace and John> to share common code? John> I'm a total ignoramous here, but it would seem that it would be nice John> to keep the /dev/sd# stuff around for this, esp since it is supported John> through/with/around AHCI and libata stuff. John> Honestly, I don't care as long as userspace isn't too affected and I John> can just format it using ext3. :] Which I realize would be silly John> since it's probably nothing like regular disk access, but more like John> the NVRAM used on Netapps for caching writes to disk so they can be John> acknowledged quicker to the clients. Or like the old PrestoServe John> NVRAM modules on DECsystems and Alphas. And actually spending some thought on this, I'm thinking that this will be like the MTD block device and such... seperate specialized block devices, but still usable. So maybe I'll just shutup now. :] John ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2009-04-30 23:36 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090412091228.GA29937@elte.hu>
2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
2009-04-12 15:20 ` Alan Cox
2009-04-12 16:15 ` Avi Kivity
2009-04-12 17:11 ` Linus Torvalds
2009-04-13 6:32 ` Avi Kivity
2009-04-13 15:10 ` Linus Torvalds
2009-04-13 15:38 ` James Bottomley
2009-04-14 7:22 ` Andi Kleen
2009-04-14 10:07 ` Avi Kivity
2009-04-14 9:59 ` Avi Kivity
2009-04-14 10:23 ` Jeff Garzik
2009-04-14 10:37 ` Avi Kivity
2009-04-14 11:45 ` Jeff Garzik
2009-04-14 11:58 ` Szabolcs Szakacsits
2009-04-17 22:45 ` H. Peter Anvin
2009-04-14 12:08 ` Avi Kivity
2009-04-14 12:21 ` Jeff Garzik
2009-04-25 8:26 ` Pavel Machek
2009-04-12 15:41 ` Linus Torvalds
2009-04-12 17:02 ` Robert Hancock
2009-04-12 17:20 ` Linus Torvalds
2009-04-12 18:35 ` Robert Hancock
2009-04-13 11:18 ` Avi Kivity
2009-04-12 17:23 ` James Bottomley
[not found] ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
2009-04-15 6:37 ` Artem Bityutskiy
2009-04-30 22:51 ` Jörn Engel
2009-04-30 23:36 ` Jeff Garzik
2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
2009-04-11 19:52 ` Linus Torvalds
2009-04-11 20:21 ` Jeff Garzik
2009-04-11 21:49 ` Grant Grundler
2009-04-11 22:33 ` Linus Torvalds
2009-04-12 5:08 ` Leslie Rhorer
2009-04-11 23:25 ` Alan Cox
2009-04-11 23:51 ` Jeff Garzik
2009-04-12 0:49 ` Linus Torvalds
2009-04-12 1:59 ` Jeff Garzik
2009-04-12 1:15 ` david
2009-04-12 3:13 ` Linus Torvalds
2009-04-12 14:23 ` Mark Lord
2009-04-12 17:29 ` Jeff Garzik
2009-04-11 19:54 ` Jeff Garzik
2009-04-11 21:08 ` John Stoffel
2009-04-11 21:31 ` John Stoffel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).