Re: Implementing NVMHCI...

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Szabolcs Szakacsits <szaka@ntfs-3g.com>,
	Grant Grundler <grundler@google.com>,
	Linux IDE mailing list <linux-ide@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Jens Axboe <jens.axboe@oracle.com>,
	Arjan van de Ven <arjan@infradead.org>
Subject: Re: Implementing NVMHCI...
Date: Tue, 14 Apr 2009 12:59:56 +0300	[thread overview]
Message-ID: <49E45E9C.1020105@redhat.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0904130747440.4583@localhost.localdomain>

Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>   
>>>  - create a big file,
>>>       
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows 
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and 
>> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>     
>
> Heh, ok. So the "big file" really only needed to be big enough to not be 
> cached, and 5GB was probably overkill. In fact, if there's some way to 
> blow the cache, you could have made it much smaller. But 5G certainly 
> works ;)
>   

I wanted to make sure my random writes later don't get coalesced.  A 1GB 
file, half of which is cached (I used a 1GB guest), offers lots of 
chances for coalescing if Windows delays the writes sufficiently.  At 
5GB, Windows can only cache 10% of the file, so it will be continuously 
flushing.


>
>  (a) Windows caches things with a 4kB granularity, so the 512-byte write 
>      turned into a read-modify-write
>   
>   
[...]

> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for 
> your example!). It's a total disaster. Imagine what would happen to user 
> application performance if kmalloc() always returned 16kB-aligned chunks 
> of memory, all sized as integer multiples of 16kB? It would absolutely 
> _suck_. Sure, it would be fine for your large allocations, but any time 
> you handle strings, you'd allocate 16kB of memory for any small 5-byte 
> string. You'd have horrible cache behavior, and you'd run out of memory 
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under 
> almost all normal loads is the disk cache. That _is_ the normal allocator 
> for any OS kernel. Everything else is almost details (ok, so Linux in 
> particular does cache metadata very aggressively, so the dcache and inode 
> cache are seldom "just details", but the page cache is still generally the 
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane 
> system does that. It's only useful if you absolutely _only_ work with 
> large files - ie you're a database server. For just about any other 
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the 
> block size is 4kB is easy - we just have to do it anyway. 
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is 
> also _doable_, and from the IO pattern standpoint it is no different. But 
> from a memory allocation pattern standpoint it's a disaster - because now 
> you're always working with chunks that are just 'too big' to be good 
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of 
> small files (like a source tree), you will literally waste all your 
> memory.
>
>   

Well, no one is talking about 64KB granularity for in-core files.  Like 
you noticed, Windows uses the mmu page size.  We could keep doing that, 
and still have 16KB+ sector sizes.  It just means a RMW if you don't 
happen to have the adjoining clean pages in cache.

Sure, on a rotating disk that's a disaster, but we're talking SSD here, 
so while you're doubling your access time, you're doubling a fairly 
small quantity.  The controller would do the same if it exposed smaller 
sectors, so there's no huge loss.

We still lose on disk storage efficiency, but I'm guessing that a modern 
tree with some object files with debug information and a .git directory 
it won't be such a great hit.  For more mainstream uses, it would be 
negligible.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

next prev parent reply	other threads:[~2009-04-14 10:01 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20090412091228.GA29937@elte.hu>
2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
2009-04-12 15:20   ` Alan Cox
2009-04-12 16:15     ` Avi Kivity
2009-04-12 17:11       ` Linus Torvalds
2009-04-13  6:32         ` Avi Kivity
2009-04-13 15:10           ` Linus Torvalds
2009-04-13 15:38             ` James Bottomley
2009-04-14  7:22             ` Andi Kleen
2009-04-14 10:07               ` Avi Kivity
2009-04-14  9:59             ` Avi Kivity [this message]
2009-04-14 10:23               ` Jeff Garzik
2009-04-14 10:37                 ` Avi Kivity
2009-04-14 11:45                   ` Jeff Garzik
2009-04-14 11:58                     ` Szabolcs Szakacsits
2009-04-17 22:45                       ` H. Peter Anvin
2009-04-14 12:08                     ` Avi Kivity
2009-04-14 12:21                       ` Jeff Garzik
2009-04-25  8:26                 ` Pavel Machek
2009-04-12 15:41   ` Linus Torvalds
2009-04-12 17:02     ` Robert Hancock
2009-04-12 17:20       ` Linus Torvalds
2009-04-12 18:35         ` Robert Hancock
2009-04-13 11:18         ` Avi Kivity
2009-04-12 17:23     ` James Bottomley
     [not found]     ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
2009-04-15  6:37       ` Artem Bityutskiy
2009-04-30 22:51         ` Jörn Engel
2009-04-30 23:36           ` Jeff Garzik
2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
2009-04-11 19:52   ` Linus Torvalds
2009-04-11 20:21     ` Jeff Garzik
2009-04-11 21:49     ` Grant Grundler
2009-04-11 22:33       ` Linus Torvalds
2009-04-12  5:08         ` Leslie Rhorer
2009-04-11 23:25       ` Alan Cox
2009-04-11 23:51         ` Jeff Garzik
2009-04-12  0:49           ` Linus Torvalds
2009-04-12  1:59             ` Jeff Garzik
2009-04-12  1:15         ` david
2009-04-12  3:13           ` Linus Torvalds
2009-04-12 14:23         ` Mark Lord
2009-04-12 17:29           ` Jeff Garzik
2009-04-11 19:54   ` Jeff Garzik
2009-04-11 21:08     ` John Stoffel
2009-04-11 21:31       ` John Stoffel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49E45E9C.1020105@redhat.com \
    --to=avi@redhat.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=arjan@infradead.org \
    --cc=grundler@google.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=szaka@ntfs-3g.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.