linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Szabolcs Szakacsits <szaka@ntfs-3g.com>,
	Grant Grundler <grundler@google.com>,
	Linux IDE mailing list <linux-ide@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Jens Axboe <jens.axboe@oracle.com>,
	Arjan van de Ven <arjan@infradead.org>
Subject: Re: Implementing NVMHCI...
Date: Tue, 14 Apr 2009 12:59:56 +0300	[thread overview]
Message-ID: <49E45E9C.1020105@redhat.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0904130747440.4583@localhost.localdomain>

Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>   
>>>  - create a big file,
>>>       
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and 
>> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>     
>
> Heh, ok. So the "big file" really only needed to be big enough to not be 
> cached, and 5GB was probably overkill. In fact, if there's some way to 
> blow the cache, you could have made it much smaller. But 5G certainly 
> works ;)
>   

I wanted to make sure my random writes later don't get coalesced.  A 1GB 
file, half of which is cached (I used a 1GB guest), offers lots of 
chances for coalescing if Windows delays the writes sufficiently.  At 
5GB, Windows can only cache 10% of the file, so it will be continuously 
flushing.


>
>  (a) Windows caches things with a 4kB granularity, so the 512-byte write 
>      turned into a read-modify-write
>   
>   
[...]

> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
> your example!). It's a total disaster. Imagine what would happen to user 
> application performance if kmalloc() always returned 16kB-aligned chunks 
> of memory, all sized as integer multiples of 16kB? It would absolutely 
> _suck_. Sure, it would be fine for your large allocations, but any time 
> you handle strings, you'd allocate 16kB of memory for any small 5-byte 
> string. You'd have horrible cache behavior, and you'd run out of memory 
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under 
> almost all normal loads is the disk cache. That _is_ the normal allocator 
> for any OS kernel. Everything else is almost details (ok, so Linux in 
> particular does cache metadata very aggressively, so the dcache and inode 
> cache are seldom "just details", but the page cache is still generally the 
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane 
> system does that. It's only useful if you absolutely _only_ work with 
> large files - ie you're a database server. For just about any other 
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the 
> block size is 4kB is easy - we just have to do it anyway. 
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is 
> also _doable_, and from the IO pattern standpoint it is no different. But 
> from a memory allocation pattern standpoint it's a disaster - because now 
> you're always working with chunks that are just 'too big' to be good 
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of 
> small files (like a source tree), you will literally waste all your 
> memory.
>
>   

Well, no one is talking about 64KB granularity for in-core files.  Like 
you noticed, Windows uses the mmu page size.  We could keep doing that, 
and still have 16KB+ sector sizes.  It just means a RMW if you don't 
happen to have the adjoining clean pages in cache.

Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
so while you're doubling your access time, you're doubling a fairly 
small quantity.  The controller would do the same if it exposed smaller 
sectors, so there's no huge loss.

We still lose on disk storage efficiency, but I'm guessing that a modern 
tree with some object files with debug information and a .git directory 
it won't be such a great hit.  For more mainstream uses, it would be 
negligible.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


  parent reply	other threads:[~2009-04-14 10:01 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20090412091228.GA29937@elte.hu>
2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
2009-04-12 15:20   ` Alan Cox
2009-04-12 16:15     ` Avi Kivity
2009-04-12 17:11       ` Linus Torvalds
2009-04-13  6:32         ` Avi Kivity
2009-04-13 15:10           ` Linus Torvalds
2009-04-13 15:38             ` James Bottomley
2009-04-14  7:22             ` Andi Kleen
2009-04-14 10:07               ` Avi Kivity
2009-04-14  9:59             ` Avi Kivity [this message]
2009-04-14 10:23               ` Jeff Garzik
2009-04-14 10:37                 ` Avi Kivity
2009-04-14 11:45                   ` Jeff Garzik
2009-04-14 11:58                     ` Szabolcs Szakacsits
2009-04-17 22:45                       ` H. Peter Anvin
2009-04-14 12:08                     ` Avi Kivity
2009-04-14 12:21                       ` Jeff Garzik
2009-04-25  8:26                 ` Pavel Machek
2009-04-12 15:41   ` Linus Torvalds
2009-04-12 17:02     ` Robert Hancock
2009-04-12 17:20       ` Linus Torvalds
2009-04-12 18:35         ` Robert Hancock
2009-04-13 11:18         ` Avi Kivity
2009-04-12 17:23     ` James Bottomley
     [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
2009-04-15  6:37       ` Artem Bityutskiy
2009-04-30 22:51         ` Jörn Engel
2009-04-30 23:36           ` Jeff Garzik
2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
2009-04-11 19:52   ` Linus Torvalds
2009-04-11 20:21     ` Jeff Garzik
2009-04-11 21:49     ` Grant Grundler
2009-04-11 22:33       ` Linus Torvalds
2009-04-12  5:08         ` Leslie Rhorer
2009-04-11 23:25       ` Alan Cox
2009-04-11 23:51         ` Jeff Garzik
2009-04-12  0:49           ` Linus Torvalds
2009-04-12  1:59             ` Jeff Garzik
2009-04-12  1:15         ` david
2009-04-12  3:13           ` Linus Torvalds
2009-04-12 14:23         ` Mark Lord
2009-04-12 17:29           ` Jeff Garzik
2009-04-11 19:54   ` Jeff Garzik
2009-04-11 21:08     ` John Stoffel
2009-04-11 21:31       ` John Stoffel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49E45E9C.1020105@redhat.com \
    --to=avi@redhat.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=arjan@infradead.org \
    --cc=grundler@google.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=szaka@ntfs-3g.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).