public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: David Howells <dhowells@redhat.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
Date: Thu, 24 May 2007 13:38:21 -0700	[thread overview]
Message-ID: <20070524133821.3ee9c9f3.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070523191524.24135.2609.stgit@warthog.cambridge.redhat.com>

On Wed, 23 May 2007 20:15:24 +0100
David Howells <dhowells@redhat.com> wrote:

> Add a function - cancel_rejected_write() - to excise a rejected write from the
> pagecache.  This function is related to the truncation family of routines.  It
> permits the pages modified by a network filesystem client (such as AFS) to be
> excised and discarded from the pagecache if the attempt to write them back to
> the server fails.

Could you please flesh this out a bit, from a higher level?

As in: what does it mean for a server to "reject" a write?  What's actually
going on here?  Why does the server do this?  I assume that the application
will see a short write or an EIO or something?

How does this interact with MAP_SHARED?

Do we expect that any other networked filesystem would want to do this? 
(and if not, why not?) Or do they already attempt to do it in some other
fashion?

> The dirty and writeback states of the afflicted pages are cancelled and the
> pages themselves are detached for recycling.  All PTEs referring to those
> pages are removed.
> 
> Note that the locking is tricky as it's very easy to deadlock against
> truncate() and other routines once the pages have been unlocked as part of the
> writeback process.  To this end, the PG_error flag is set, then the
> PG_writeback flag is cleared, and only *then* can lock_page() be called.

hm.

> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  include/linux/mm.h |    5 ++-
>  mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e4183c6..73688d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
>  
>  extern unsigned long do_brk(unsigned long, unsigned long);
>  
> -/* filemap.c */
> -extern unsigned long page_unuse(struct page *);
> +/* truncate.c */
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  				       loff_t lstart, loff_t lend);
> +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
>  
> +/* filemap.c */
>  /* generic vm_area_ops exported for stackable file systems */
>  extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
>  extern int filemap_populate(struct vm_area_struct *, unsigned long,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4fbe1a2..f742096 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
>  	return invalidate_inode_pages2_range(mapping, 0, -1);
>  }
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> +
> +/*
> + * Cancel that part of a rejected write that affects a particular page
> + */
> +static void cancel_rejected_page(struct address_space *mapping,
> +				 struct page *page, pgoff_t *_next)
> +{
> +	if (!TestSetPageError(page)) {
> +		/* can't lock the page until we've cleared PG_writeback lest we
> +		 * deadlock with truncate (amongst other things) */
> +		end_page_writeback(page);
> +		if (page->mapping == mapping) {
> +			lock_page(page);
> +			if (page->mapping == mapping) {
> +				truncate_complete_page(mapping, page);
> +				*_next = page->index + 1;
> +			}
> +			unlock_page(page);
> +		}
> +	} else if (PageWriteback(page) || PageDirty(page)) {
> +		BUG();
> +	}
> +}

Why can't we just run the end_page_writeback() unconditionally? 
PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
so this thread of control "owns" that flag.

Also, are you really really sure that there is no way in which PG_writeback
can get itself set again after the end_page_writeback()?  I'd have thought
that we should be doing a wait_on_page_writeback() after the lock_page()
there.  Remember, other filesystems might want to be calling this, so we
shouldn't be designing around AFS implementation specifics.

> +/**
> + * cancel_rejected_write - Cancel a write on a contiguous set of pages
> + * @mapping: mapping affected
> + * @start: first page in set
> + * @end: last page in set
> + *
> + * Cancel a write of a contiguous set of pages when the writeback was rejected
> + * by the target medium or server.
> + *
> + * The pages in question are detached and discarded from the pagecache, and the
> + * writeback and dirty states are cleared prior to invalidation.  The caller
> + * must make sure that all the pages in the range are present in the pagecache,
> + * and the caller must hold PG_writeback on each of them.  NOTE! All the pages
> + * are locked and unlocked as part of this process, so the caller must take
> + * care to avoid deadlock.
> + *
> + * The PTEs pointing to those pages are also cleared, leading to the PTEs being
> + * reset when new pages are allocated and the contents reloaded.
> + */
> +void cancel_rejected_write(struct address_space *mapping,
> +			   pgoff_t start, pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	pgoff_t n;
> +	int i;
> +
> +	BUG_ON(mapping->nrpages < end - start + 1);
> +
> +	/* dispose of any PTEs pointing to the affected pages */
> +	unmap_mapping_range(mapping,
> +			    (loff_t)start << PAGE_CACHE_SHIFT,
> +			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +			    0);
> +
> +	pagevec_init(&pvec, 0);
> +	do {
> +		cond_resched();
> +		n = end - start + 1;
> +		if (n > PAGEVEC_SIZE)
> +			n = PAGEVEC_SIZE;
> +		n = pagevec_lookup(&pvec, mapping, start, n);
> +		for (i = 0; i < n; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			if (page->index < start || page->index > end)
> +				continue;
> +			start++;
> +			cancel_rejected_page(mapping, page, &start);
> +		}
> +		pagevec_release(&pvec);
> +	} while (start - 1 < end);
> +
> +	/* dispose of any new PTEs pointing to the affected pages */
> +	unmap_mapping_range(mapping,
> +			    (loff_t)start << PAGE_CACHE_SHIFT,
> +			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +			    0);
> +}
> +EXPORT_SYMBOL_GPL(cancel_rejected_write);

hm, is the pte-unmapping here completely solid?  Are there any race windows
in which a faulter can end up owning, say, an anonymous page?  We've had
heaps of problems with that sort of thing in generic_file_direct_IO() and I
don't expect it's this easy.






  reply	other threads:[~2007-05-24 20:38 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells
2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
2007-05-24 20:38   ` Andrew Morton [this message]
2007-05-24 21:35     ` David Howells
2007-05-24 21:47       ` Andrew Morton
2007-05-24 22:34         ` David Howells
2007-05-24 22:46           ` Andrew Morton
2007-05-24 23:08             ` David Howells
2007-05-24 23:24               ` Andrew Morton
2007-05-24 23:37                 ` David Howells
2007-05-24 22:35       ` Trond Myklebust
2007-05-24 23:18         ` David Howells
2007-05-24 23:54           ` Trond Myklebust
2007-05-30 10:35             ` David Howells
2007-05-30 17:39               ` Trond Myklebust
2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells
2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070524133821.3ee9c9f3.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox