From: Avi Kivity <avi@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
Szabolcs Szakacsits <szaka@ntfs-3g.com>,
Grant Grundler <grundler@google.com>,
Linux IDE mailing list <linux-ide@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Jens Axboe <jens.axboe@oracle.com>,
Arjan van de Ven <arjan@infradead.org>
Subject: Re: Implementing NVMHCI...
Date: Tue, 14 Apr 2009 12:59:56 +0300 [thread overview]
Message-ID: <49E45E9C.1020105@redhat.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0904130747440.4583@localhost.localdomain>
Linus Torvalds wrote:
> On Mon, 13 Apr 2009, Avi Kivity wrote:
>
>>> - create a big file,
>>>
>> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
>> was throwing out 256KB I/Os even though I was generating 1MB writes (and
>> cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
>>
>
> Heh, ok. So the "big file" really only needed to be big enough to not be
> cached, and 5GB was probably overkill. In fact, if there's some way to
> blow the cache, you could have made it much smaller. But 5G certainly
> works ;)
>
I wanted to make sure my random writes later don't get coalesced. A 1GB
file, half of which is cached (I used a 1GB guest), offers lots of
chances for coalescing if Windows delays the writes sufficiently. At
5GB, Windows can only cache 10% of the file, so it will be continuously
flushing.
>
> (a) Windows caches things with a 4kB granularity, so the 512-byte write
> turned into a read-modify-write
>
>
[...]
> You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
> your example!). It's a total disaster. Imagine what would happen to user
> application performance if kmalloc() always returned 16kB-aligned chunks
> of memory, all sized as integer multiples of 16kB? It would absolutely
> _suck_. Sure, it would be fine for your large allocations, but any time
> you handle strings, you'd allocate 16kB of memory for any small 5-byte
> string. You'd have horrible cache behavior, and you'd run out of memory
> much too quickly.
>
> The same is true in the kernel. The single biggest memory user under
> almost all normal loads is the disk cache. That _is_ the normal allocator
> for any OS kernel. Everything else is almost details (ok, so Linux in
> particular does cache metadata very aggressively, so the dcache and inode
> cache are seldom "just details", but the page cache is still generally the
> most important part).
>
> So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
> system does that. It's only useful if you absolutely _only_ work with
> large files - ie you're a database server. For just about any other
> workload, that kind of granularity is totally unnacceptable.
>
> So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
> block size is 4kB is easy - we just have to do it anyway.
>
> Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
> also _doable_, and from the IO pattern standpoint it is no different. But
> from a memory allocation pattern standpoint it's a disaster - because now
> you're always working with chunks that are just 'too big' to be good
> building blocks of a reasonable allocator.
>
> If you always allocate 64kB for file caches, and you work with lots of
> small files (like a source tree), you will literally waste all your
> memory.
>
>
Well, no one is talking about 64KB granularity for in-core files. Like
you noticed, Windows uses the mmu page size. We could keep doing that,
and still have 16KB+ sector sizes. It just means a RMW if you don't
happen to have the adjoining clean pages in cache.
Sure, on a rotating disk that's a disaster, but we're talking SSD here,
so while you're doubling your access time, you're doubling a fairly
small quantity. The controller would do the same if it exposed smaller
sectors, so there's no huge loss.
We still lose on disk storage efficiency, but I'm guessing that a modern
tree with some object files with debug information and a .git directory
it won't be such a great hit. For more mainstream uses, it would be
negligible.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
next prev parent reply other threads:[~2009-04-14 10:01 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20090412091228.GA29937@elte.hu>
2009-04-12 15:14 ` Implementing NVMHCI Szabolcs Szakacsits
2009-04-12 15:20 ` Alan Cox
2009-04-12 16:15 ` Avi Kivity
2009-04-12 17:11 ` Linus Torvalds
2009-04-13 6:32 ` Avi Kivity
2009-04-13 15:10 ` Linus Torvalds
2009-04-13 15:38 ` James Bottomley
2009-04-14 7:22 ` Andi Kleen
2009-04-14 10:07 ` Avi Kivity
2009-04-14 9:59 ` Avi Kivity [this message]
2009-04-14 10:23 ` Jeff Garzik
2009-04-14 10:37 ` Avi Kivity
2009-04-14 11:45 ` Jeff Garzik
2009-04-14 11:58 ` Szabolcs Szakacsits
2009-04-17 22:45 ` H. Peter Anvin
2009-04-14 12:08 ` Avi Kivity
2009-04-14 12:21 ` Jeff Garzik
2009-04-25 8:26 ` Pavel Machek
2009-04-12 15:41 ` Linus Torvalds
2009-04-12 17:02 ` Robert Hancock
2009-04-12 17:20 ` Linus Torvalds
2009-04-12 18:35 ` Robert Hancock
2009-04-13 11:18 ` Avi Kivity
2009-04-12 17:23 ` James Bottomley
[not found] ` <6934efce0904141052j3d4f87cey9fc4b802303aa73b@mail.gmail.com>
2009-04-15 6:37 ` Artem Bityutskiy
2009-04-30 22:51 ` Jörn Engel
2009-04-30 23:36 ` Jeff Garzik
2009-04-11 17:33 Jeff Garzik
2009-04-11 19:32 ` Alan Cox
2009-04-11 19:52 ` Linus Torvalds
2009-04-11 20:21 ` Jeff Garzik
2009-04-11 21:49 ` Grant Grundler
2009-04-11 22:33 ` Linus Torvalds
2009-04-12 5:08 ` Leslie Rhorer
2009-04-11 23:25 ` Alan Cox
2009-04-11 23:51 ` Jeff Garzik
2009-04-12 0:49 ` Linus Torvalds
2009-04-12 1:59 ` Jeff Garzik
2009-04-12 1:15 ` david
2009-04-12 3:13 ` Linus Torvalds
2009-04-12 14:23 ` Mark Lord
2009-04-12 17:29 ` Jeff Garzik
2009-04-11 19:54 ` Jeff Garzik
2009-04-11 21:08 ` John Stoffel
2009-04-11 21:31 ` John Stoffel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49E45E9C.1020105@redhat.com \
--to=avi@redhat.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=arjan@infradead.org \
--cc=grundler@google.com \
--cc=jens.axboe@oracle.com \
--cc=linux-ide@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=szaka@ntfs-3g.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).