From: Daniel Phillips <phillips@arcor.de>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
hugh@veritas.com
Subject: Re: [RFC][patch 0/2] mm: remove PageReserved
Date: Sat, 13 Aug 2005 05:34:53 +1000 [thread overview]
Message-ID: <200508130534.54155.phillips@arcor.de> (raw)
In-Reply-To: <3521.1123757360@warthog.cambridge.redhat.com>
On Thursday 11 August 2005 20:49, David Howells wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > To be honest I'm having some trouble following this through logically.
> > I'll read through a few more times and see if that fixes the problem.
> > This seems cluster-related, so I have an interest.
>
> Well, perhaps I can explain the function for which I'm using this page flag
> more clearly. You'll have to excuse me if it's covering stuff you don't
> know, but I want to take it from first principles; plus this stuff might
> well find its way into the kernel docs.
>
>
> We want to use a relatively fast medium (such as RAM or local disk) to
> speed up repeated accesses to a relatively slow medium (such as NFS, NBD,
> CDROM) by means of caching the results of previous accesses to the slow
> medium on the fast medium.
>
> Now we already do this at one level: RAM. The page cache _is_ such a cache,
> but whilst it's much faster than a disk, it is severely restricted in size
Did you just suggest that 16 TB/address_space is too small to cache NFS pages?
> compared to media such as disks, it's more expensive
It is?
> and it's contents generally don't last over power failure or reboots.
When used by RAMFS maybe. But fortunately the page cache has a backing store
API, in fact, that is its raison d'etre.
> The major attribute of the page cache is that the CPU can access it
> directly.
You seem to have forgotten about non-resident pages.
> So we want to add another level: local disk. The FS-Cache/CacheFS patches
> permit such as AFS and NFS to use local disk as a cache.
The page cache already lets you do that. I have not yet discerned a
fundamental reason why you need to interface to another filesystem to
implement backing store for an address_space.
> So, assume that NFS is using a local disk cache (it doesn't matter whether
> it's CacheFS, CacheFiles, or something else), and assume a process has a
> file open through NFS.
>
> The process attempts to read from the file. This causes the NFS readpage()
> or readpages() operation to be invoked to load the data into the page cache
> so that the CPU can make use of it.
>
> So the NFS page reading algorithm first consults the disk cache. Assume
> this returns a negative response - NFS will then read from the server into
> the page cache. Under cacheless operation, it would then unlock the page
> and the kernel could then let userspace play with it, but we're dealing
> with a cache, and so the newly fetched data must be stored in the disk
> cache for future retrieval.
>
> NFS now has three choices:
>
> (1) It could institigate a write to the disk cache and wait for that to
> complete before unlocking the page and letting userspace see it, but
> we don't know how long that might take.
Pages are typically unlocked while being written to backing store, e.g.:
http://lxr.linux.no/source/fs/buffer.c#L1839
What makes NFS special in this regard?
> CacheFS immediately dispatches a write BIO to get it DMA'd to the disk
> as soon as possible, but something like CacheFiles is dependent on an
> underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
> write, and we've no control over that.
That is a problem you are in the process of inventing.
> Time to unlock: CacheMiss + NetRead + CacheWrite
> Cache reliable: Yes
>
> (2) It could just unlock the page and let userspace scribble on it whilst
> simultaneously writing it to the cache. But that means the DMA to the
> disk may pick up some of userspace's scribblings, and that means you
> can't trust what's in the cache in the event of a power loss.
I thought I saw a journal in there. Anyway, if the user has asked for a racy
write, that is what they should get.
> This can be alleviated by marking untrustworthy files in the cache,
> but that then extends the management time in several ways.
>
> Time to unlock: CacheMiss + NetRead
> Cache reliable: No
I think your definition of trustworthy goes beyond what is required by Posix
or Linux local filesystem semantics.
> (3) It could tell the cache that the page needs writing to disk and then
> unlock it for userspace to read, but intercept the change of a PTE
> pointing to this page when it loses its write protection (PTEs start
> off read-only, generating a write protection fault on the first write).
We need to do something like this to implemented cross-node caching of
shared-writeable mmaps. This is another reason that your ideas need clear
explanations: we need to go the rest of the way and get this sorted out for
cluster filesystems in general, not just NFS (v4). It does help a lot that
you are attempting to explain what the needs of NFS actually are.
Unfortunately, it seems you are proposing that this mechanism is essential
even for single-node use, which is far from clear.
> The interceptor would then force userspace to wait for the cache to
> finish DMA'ing the page before writing to it.
>
> Similarly, the write() or prepare_write() operations would wait for
> the cache to finish with that page.
Here you return to the assumption that the VFS should enforce per-page write
granularity. There is no such rule as far as I know.
> Time to unlock: CacheMiss + NetRead
> Cache reliable: Yes
>
> I originally chose option (1), but then I saw just how much it affected
> performance and worked on option (3).
>
> I discarded option (2) because I want to be able to have some surety about
> the state in the cache - I don't want to have to reinitialise it after a
> power failure. Imagine if you cache /usr... Imagine if everyone in a very
> large office caches /usr...
>
>
> So, the way I implemented (3) is to use an extra page flag to indicate a
> write underway to the cache, and thus allow cache write status to be
> determined when someone wants to scribble on a page.
>
> The fscache_write_page() function takes a pointer to a callback function.
> In NFS this function clears the PG_fs_misc bit on the appropriate pages and
> wakes up anyone who was waiting for this event (end_page_fs_misc()).
>
> The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
> page bit if it is set.
>
> > Who is using this interface?
>
> AFS and NFS will both use it. There may be others eventually who use it for
> the same purpose. CacheFS has a different use for it internally.
Let's try to clear up the page write atomicity question, please. It seems
your argument depends on it.
Regards,
Daniel
next prev parent reply other threads:[~2005-08-12 19:34 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-08-07 3:28 [RFC][patch 0/2] mm: remove PageReserved Nick Piggin
2005-08-07 3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
2005-08-07 3:30 ` [patch 2/2] mm: core remove PageReserved Nick Piggin
2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
2005-08-08 21:24 ` Daniel Phillips
2005-08-08 21:54 ` Andrew Morton
2005-08-09 23:23 ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
2005-08-10 7:48 ` Hugh Dickins
2005-08-10 8:06 ` Daniel Phillips
2005-08-10 13:13 ` [RFC][patch 0/2] mm: remove PageReserved David Howells
2005-08-10 13:34 ` Daniel Phillips
2005-08-10 14:27 ` David Howells
2005-08-10 23:19 ` Daniel Phillips
2005-08-11 10:49 ` David Howells
2005-08-12 19:34 ` Daniel Phillips [this message]
2005-08-15 13:15 ` David Howells
2005-08-16 1:53 ` Daniel Phillips
2005-08-16 10:28 ` David Howells
2005-08-10 22:12 ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
2005-08-10 22:23 ` Daniel Phillips
2005-08-10 22:34 ` Trond Myklebust
2005-08-10 22:57 ` Daniel Phillips
2005-08-10 23:23 ` Trond Myklebust
2005-08-11 9:42 ` David Howells
2005-08-10 23:42 ` Adrian Bunk
2005-08-11 9:46 ` David Howells
2005-08-12 2:34 ` Daniel Phillips
2005-08-12 12:32 ` David Howells
2005-08-11 9:31 ` David Howells
2005-08-11 9:26 ` David Howells
2005-08-12 3:29 ` Daniel Phillips
2005-08-12 12:41 ` David Howells
2005-08-12 13:28 ` Hugh Dickins
2005-08-16 13:59 ` Pavel Machek
2005-08-18 14:33 ` David Howells
2005-08-18 22:27 ` Pavel Machek
2005-08-19 10:04 ` David Howells
2005-08-19 16:31 ` Daniel Phillips
2005-08-20 10:45 ` David Howells
2005-08-20 20:21 ` Daniel Phillips
2005-08-09 0:15 ` [RFC][patch 0/2] mm: remove PageReserved Nick Piggin
2005-08-09 8:51 ` Benjamin Herrenschmidt
2005-08-09 9:49 ` Nick Piggin
2005-08-09 19:19 ` Daniel Phillips
2005-08-09 19:22 ` Daniel Phillips
2005-08-10 21:50 ` Pavel Machek
2005-08-10 21:56 ` Martin J. Bligh
2005-08-11 10:36 ` Rafael J. Wysocki
2005-08-12 19:56 ` Daniel Phillips
2005-08-12 22:20 ` Rafael J. Wysocki
2005-08-12 23:04 ` Daniel Phillips
2005-08-13 7:06 ` Rafael J. Wysocki
2005-08-11 10:26 ` Rafael J. Wysocki
2005-08-09 11:25 ` Hugh Dickins
2005-08-09 14:31 ` Benjamin Herrenschmidt
2005-08-09 14:50 ` Hugh Dickins
2005-08-09 14:49 ` Benjamin Herrenschmidt
2005-08-09 15:36 ` Hugh Dickins
2005-08-09 21:27 ` Daniel Phillips
2005-08-09 19:14 ` Daniel Phillips
2005-08-09 20:17 ` Hugh Dickins
2005-08-09 20:52 ` Daniel Phillips
2005-08-09 4:39 ` Nigel Cunningham
2005-08-09 4:59 ` Nick Piggin
2005-08-09 5:11 ` Nigel Cunningham
2005-08-09 5:20 ` Nick Piggin
2005-08-09 5:30 ` Nigel Cunningham
2005-08-09 7:08 ` Russell King
2005-08-09 8:38 ` Arjan van de Ven
2005-08-09 9:31 ` Nick Piggin
2005-08-09 9:49 ` Arjan van de Ven
2005-08-09 9:57 ` Nick Piggin
2005-08-09 10:24 ` Rafael J. Wysocki
2005-08-09 8:53 ` Benjamin Herrenschmidt
2005-08-09 9:15 ` Hugh Dickins
2005-08-09 10:27 ` Nick Piggin
2005-08-09 11:15 ` Hugh Dickins
2005-08-09 13:15 ` Nick Piggin
2005-08-09 13:26 ` Arjan van de Ven
2005-08-09 14:28 ` Benjamin Herrenschmidt
2005-08-09 14:47 ` Hugh Dickins
2005-08-09 19:49 ` Roman Zippel
2005-08-09 9:29 ` Nick Piggin
2005-08-09 19:40 ` Russell King
2005-08-09 14:38 ` Martin J. Bligh
2005-08-09 19:41 ` Russell King
2005-08-09 20:51 ` Linus Torvalds
2005-08-09 21:16 ` Martin J. Bligh
2005-08-09 21:51 ` Martin J. Bligh
2005-08-10 9:27 ` Benjamin Herrenschmidt
2005-08-11 9:09 ` Nick Piggin
2005-08-09 22:14 ` Daniel Phillips
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200508130534.54155.phillips@arcor.de \
--to=phillips@arcor.de \
--cc=akpm@osdl.org \
--cc=dhowells@redhat.com \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox