public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] AFS: Add TestSetPageError()
@ 2007-05-23 19:15 David Howells
  2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, dhowells

Add a TestSetPageError() macro to the suite of page flag manipulators.  This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/page-flags.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index ae2d79f..cc8d952 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -121,6 +121,7 @@
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
 #define ClearPageError(page)	clear_bit(PG_error, &(page)->flags)
+#define TestSetPageError(page)	test_and_set_bit(PG_error, &(page)->flags)
 
 #define PageReferenced(page)	test_bit(PG_referenced, &(page)->flags)
 #define SetPageReferenced(page)	set_bit(PG_referenced, &(page)->flags)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells
@ 2007-05-23 19:15 ` David Howells
  2007-05-24 20:38   ` Andrew Morton
  2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells
  2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells
  2 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, dhowells

Add a function - cancel_rejected_write() - to excise a rejected write from the
pagecache.  This function is related to the truncation family of routines.  It
permits the pages modified by a network filesystem client (such as AFS) to be
excised and discarded from the pagecache if the attempt to write them back to
the server fails.

The dirty and writeback states of the afflicted pages are cancelled and the
pages themselves are detached for recycling.  All PTEs referring to those
pages are removed.

Note that the locking is tricky as it's very easy to deadlock against
truncate() and other routines once the pages have been unlocked as part of the
writeback process.  To this end, the PG_error flag is set, then the
PG_writeback flag is cleared, and only *then* can lock_page() be called.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    5 ++-
 mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4183c6..73688d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
-/* filemap.c */
-extern unsigned long page_unuse(struct page *);
+/* truncate.c */
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
 
+/* filemap.c */
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
 extern int filemap_populate(struct vm_area_struct *, unsigned long,
diff --git a/mm/truncate.c b/mm/truncate.c
index 4fbe1a2..f742096 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
 	return invalidate_inode_pages2_range(mapping, 0, -1);
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
+
+/*
+ * Cancel that part of a rejected write that affects a particular page
+ */
+static void cancel_rejected_page(struct address_space *mapping,
+				 struct page *page, pgoff_t *_next)
+{
+	if (!TestSetPageError(page)) {
+		/* can't lock the page until we've cleared PG_writeback lest we
+		 * deadlock with truncate (amongst other things) */
+		end_page_writeback(page);
+		if (page->mapping == mapping) {
+			lock_page(page);
+			if (page->mapping == mapping) {
+				truncate_complete_page(mapping, page);
+				*_next = page->index + 1;
+			}
+			unlock_page(page);
+		}
+	} else if (PageWriteback(page) || PageDirty(page)) {
+		BUG();
+	}
+}
+
+/**
+ * cancel_rejected_write - Cancel a write on a contiguous set of pages
+ * @mapping: mapping affected
+ * @start: first page in set
+ * @end: last page in set
+ *
+ * Cancel a write of a contiguous set of pages when the writeback was rejected
+ * by the target medium or server.
+ *
+ * The pages in question are detached and discarded from the pagecache, and the
+ * writeback and dirty states are cleared prior to invalidation.  The caller
+ * must make sure that all the pages in the range are present in the pagecache,
+ * and the caller must hold PG_writeback on each of them.  NOTE! All the pages
+ * are locked and unlocked as part of this process, so the caller must take
+ * care to avoid deadlock.
+ *
+ * The PTEs pointing to those pages are also cleared, leading to the PTEs being
+ * reset when new pages are allocated and the contents reloaded.
+ */
+void cancel_rejected_write(struct address_space *mapping,
+			   pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	pgoff_t n;
+	int i;
+
+	BUG_ON(mapping->nrpages < end - start + 1);
+
+	/* dispose of any PTEs pointing to the affected pages */
+	unmap_mapping_range(mapping,
+			    (loff_t)start << PAGE_CACHE_SHIFT,
+			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
+			    0);
+
+	pagevec_init(&pvec, 0);
+	do {
+		cond_resched();
+		n = end - start + 1;
+		if (n > PAGEVEC_SIZE)
+			n = PAGEVEC_SIZE;
+		n = pagevec_lookup(&pvec, mapping, start, n);
+		for (i = 0; i < n; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (page->index < start || page->index > end)
+				continue;
+			start++;
+			cancel_rejected_page(mapping, page, &start);
+		}
+		pagevec_release(&pvec);
+	} while (start - 1 < end);
+
+	/* dispose of any new PTEs pointing to the affected pages */
+	unmap_mapping_range(mapping,
+			    (loff_t)start << PAGE_CACHE_SHIFT,
+			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
+			    0);
+}
+EXPORT_SYMBOL_GPL(cancel_rejected_write);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] AFS: Improve handling of a rejected writeback
  2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells
  2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
@ 2007-05-23 19:15 ` David Howells
  2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells
  2 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, dhowells

Improve the handling of the case of a server rejecting an attempt to write back
a cached write.  AFS operates a write-back cache, so the following sequence of
events can theoretically occur:

	CLIENT 1		CLIENT 2
	=======================	=======================
	cat data >/the/file
	 (sits in pagecache)
				fs setacl -dir /the/dir/of/the/file \
					-acl system:administrators rlidka
				 (write permission removed for client 1)
	sync
	 (writeback attempt fails)

The way AFS attempts to handle this is:

 (1) The affected region will be excised and discarded on the basis that it
     can't be written back, yet we don't want it lurking in the page cache
     either.  The contents of the affected region will be reread from the
     server when called for again.

 (2) The EOF size will be set to the current server-based file size - usually
     that which it was before the affected write was made - assuming no
     conflicting write has been appended, and assuming the affected write
     extended the file.


This patch makes the following changes:

 (1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS
     server puts a silly value as the size of the returned data.  This prevents
     excised pages beyond the revised EOF being reinstantiated with a surprise
     PG_error.

 (2) Writebacks can now be put into a 'rejected' state in which all further
     attempts to write them back will result in excision of the affected pages
     instead.

 (3) Preparing a page for overwriting now reads the whole page instead of just
     those parts of it that aren't to be covered by the copy to be made.  This
     handles the possibility that the copy might fail on EFAULT.  Corollary to
     this, PG_update can now be set by afs_prepare_page() on behalf of
     afs_prepare_write() rather than setting it in afs_commit_write().

 (4) In the case of a conflicting write, afs_prepare_write() will attempt to
     flush the write to the server, and will then wait for PG_writeback to go
     away - after unlocking the page.  This helps prevent deadlock against the
     writeback-rejection handler.  AOP_TRUNCATED_PAGE is then returned to the
     caller to signify that the page has been unlocked, and that it should be
     revalidated.

 (5) The writeback-rejection handler now calls cancel_rejected_write() added by
     the previous patch to excise the affected pages rather than clearing the
     PG_uptodate flag on all the pages.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/fsclient.c |    4 +
 fs/afs/internal.h |    1 
 fs/afs/write.c    |  154 ++++++++++++++++++++++++++++-------------------------
 3 files changed, 85 insertions(+), 74 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 5dff130..60f9500 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -353,7 +353,9 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call,
 
 		call->count = ntohl(call->tmp);
 		_debug("DATA length: %u", call->count);
-		if (call->count > PAGE_SIZE)
+		if ((s32) call->count < 0)
+			call->count = 0; /* access completely beyond EOF */
+		else if (call->count > PAGE_SIZE)
 			return -EBADMSG;
 		call->offset = 0;
 		call->unmarshall++;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 2dac3ad..ddb18d9 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -154,6 +154,7 @@ struct afs_writeback {
 		AFS_WBACK_PENDING,		/* write pending */
 		AFS_WBACK_CONFLICTING,		/* conflicting writes posted */
 		AFS_WBACK_WRITING,		/* writing back */
+		AFS_WBACK_REJECTED,		/* the writeback was rejected */
 		AFS_WBACK_COMPLETE		/* the writeback record has been unlinked */
 	} state __attribute__((packed));
 };
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a03b92a..ac621e8 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -81,18 +81,16 @@ void afs_put_writeback(struct afs_writeback *wb)
 }
 
 /*
- * partly or wholly fill a page that's under preparation for writing
+ * fill a page that's under preparation for writing
  */
 static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
-			 unsigned start, unsigned len, struct page *page)
+			 unsigned len, struct page *page)
 {
 	int ret;
 
-	_enter(",,%u,%u", start, len);
+	_enter(",,%u,", len);
 
-	ASSERTCMP(start + len, <=, PAGE_SIZE);
-
-	ret = afs_vnode_fetch_data(vnode, key, start, len, page);
+	ret = afs_vnode_fetch_data(vnode, key, 0, len, page);
 	if (ret < 0) {
 		if (ret == -ENOENT) {
 			_debug("got NOENT from server"
@@ -110,18 +108,15 @@ static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
  * prepare a page for being written to
  */
 static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
-			    struct key *key, unsigned offset, unsigned to)
+			    struct key *key)
 {
-	unsigned eof, tail, start, stop, len;
+	unsigned len;
 	loff_t i_size, pos;
 	void *p;
 	int ret;
 
 	_enter("");
 
-	if (offset == 0 && to == PAGE_SIZE)
-		return 0;
-
 	p = kmap_atomic(page, KM_USER0);
 
 	i_size = i_size_read(&vnode->vfs_inode);
@@ -129,10 +124,7 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
 	if (pos >= i_size) {
 		/* partial write, page beyond EOF */
 		_debug("beyond");
-		if (offset > 0)
-			memset(p, 0, offset);
-		if (to < PAGE_SIZE)
-			memset(p + to, 0, PAGE_SIZE - to);
+		memset(p, 0, PAGE_SIZE);
 		kunmap_atomic(p, KM_USER0);
 		return 0;
 	}
@@ -140,31 +132,20 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
 	if (i_size - pos >= PAGE_SIZE) {
 		/* partial write, page entirely before EOF */
 		_debug("before");
-		tail = eof = PAGE_SIZE;
+		len = PAGE_SIZE;
 	} else {
 		/* partial write, page overlaps EOF */
-		eof = i_size - pos;
-		_debug("overlap %u", eof);
-		tail = max(eof, to);
-		if (tail < PAGE_SIZE)
-			memset(p + tail, 0, PAGE_SIZE - tail);
-		if (offset > eof)
-			memset(p + eof, 0, PAGE_SIZE - eof);
+		len = i_size - pos;
+		_debug("overlap %u", len);
+		ASSERTRANGE(0, <, len, <, PAGE_SIZE);
+		memset(p + len, 0, PAGE_SIZE - len);
 	}
 
 	kunmap_atomic(p, KM_USER0);
 
-	ret = 0;
-	if (offset > 0 || eof > to) {
-		/* need to fill one or two bits that aren't going to be written
-		 * (cover both fillers in one read if there are two) */
-		start = (offset > 0) ? 0 : to;
-		stop = (eof > to) ? eof : offset;
-		len = stop - start;
-		_debug("wr=%u-%u av=0-%u rd=%u@%u",
-		       offset, to, eof, start, len);
-		ret = afs_fill_page(vnode, key, start, len, page);
-	}
+	ret = afs_fill_page(vnode, key, len, page);
+	if (ret == 0)
+		SetPageUptodate(page);
 
 	_leave(" = %d", ret);
 	return ret;
@@ -187,6 +168,8 @@ int afs_prepare_write(struct file *file, struct page *page,
 	_enter("{%x:%u},{%lx},%u,%u",
 	       vnode->fid.vid, vnode->fid.vnode, page->index, offset, to);
 
+	BUG_ON(PageError(page));
+
 	candidate = kzalloc(sizeof(*candidate), GFP_KERNEL);
 	if (!candidate)
 		return -ENOMEM;
@@ -200,7 +183,7 @@ int afs_prepare_write(struct file *file, struct page *page,
 
 	if (!PageUptodate(page)) {
 		_debug("not up to date");
-		ret = afs_prepare_page(vnode, page, key, offset, to);
+		ret = afs_prepare_page(vnode, page, key);
 		if (ret < 0) {
 			kfree(candidate);
 			_leave(" = %d [prep]", ret);
@@ -269,21 +252,41 @@ flush_conflicting_wb:
 	_debug("flush conflict");
 	if (wb->state == AFS_WBACK_PENDING)
 		wb->state = AFS_WBACK_CONFLICTING;
+	wb->usage++;
 	spin_unlock(&vnode->writeback_lock);
-	if (PageDirty(page)) {
+	if (!PageDirty(page) && !PageWriteback(page)) {
+		/* no change outstanding - just reuse the page */
+		_debug("reuse");
+		afs_put_writeback(wb);
+		set_page_private(page, 0);
+		ClearPagePrivate(page);
+		goto try_again;
+	}
+
+	kfree(candidate);
+
+	/* if we're busy writing back a conflicting write, then unlock the page
+	 * and wait for the writeback to complete - this lets the process doing
+	 * the write-out handle rejection without deadlock */
+	if (PageWriteback(page)) {
+		_debug("wait wb");
+		unlock_page(page);
+	} else {
+		/* there's a conflicting modification we have to write back and
+		 * wait for before letting the next one proceed */
+		_debug("dirty");
 		ret = afs_write_back_from_locked_page(wb, page);
 		if (ret < 0) {
-			afs_put_writeback(candidate);
+			afs_put_writeback(wb);
 			_leave(" = %d", ret);
 			return ret;
 		}
 	}
 
-	/* the page holds a ref on the writeback record */
 	afs_put_writeback(wb);
-	set_page_private(page, 0);
-	ClearPagePrivate(page);
-	goto try_again;
+	wait_on_page_writeback(page);
+	_leave(" = A_T_P");
+	return AOP_TRUNCATED_PAGE;
 }
 
 /*
@@ -310,7 +313,6 @@ int afs_commit_write(struct file *file, struct page *page,
 		spin_unlock(&vnode->writeback_lock);
 	}
 
-	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (PageDirty(page))
 		_debug("dirtied");
@@ -319,38 +321,35 @@ int afs_commit_write(struct file *file, struct page *page,
 }
 
 /*
- * kill all the pages in the given range
+ * note the failure of a write, either due to an error or to a permission
+ * failure
+ * - all the pages in the affected range must have PG_writeback set
+ * - the caller must be responsible for the pages: no-one else should be trying
+ *   to note rejection
  */
-static void afs_kill_pages(struct afs_vnode *vnode, bool error,
-			   pgoff_t first, pgoff_t last)
+static void afs_write_rejected(struct afs_writeback *wb, bool error,
+			       pgoff_t first, pgoff_t last)
 {
-	struct pagevec pv;
-	unsigned count, loop;
+	struct afs_vnode *vnode = wb->vnode;
+	loff_t i_size;
 
 	_enter("{%x:%u},%lx-%lx",
 	       vnode->fid.vid, vnode->fid.vnode, first, last);
 
-	pagevec_init(&pv, 0);
-
-	do {
-		_debug("kill %lx-%lx", first, last);
-
-		count = last - first + 1;
-		if (count > PAGEVEC_SIZE)
-			count = PAGEVEC_SIZE;
-		pv.nr = find_get_pages_contig(vnode->vfs_inode.i_mapping,
-					      first, count, pv.pages);
-		ASSERTCMP(pv.nr, ==, count);
-
-		for (loop = 0; loop < count; loop++) {
-			ClearPageUptodate(pv.pages[loop]);
-			if (error)
-				SetPageError(pv.pages[loop]);
-			end_page_writeback(pv.pages[loop]);
-		}
+	spin_lock(&vnode->writeback_lock);
+	wb->state = AFS_WBACK_REJECTED;
+
+	/* wind back the file size if this write extended the file, and wasn't
+	 * followed by a conflicting write */
+	i_size = ((loff_t) wb->last) << PAGE_SHIFT;
+	i_size += wb->to_last;
+	if (i_size_read(&vnode->vfs_inode) == i_size) {
+		_debug("shorten");
+		i_size_write(&vnode->vfs_inode, vnode->status.size);
+	}
 
-		__pagevec_release(&pv);
-	} while (first < last);
+	spin_unlock(&vnode->writeback_lock);
+	cancel_rejected_write(vnode->vfs_inode.i_mapping, first, last);
 
 	_leave("");
 }
@@ -358,6 +357,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error,
 /*
  * synchronously write back the locked page and any subsequent non-locked dirty
  * pages also covered by the same writeback record
+ * - all pages written will be unlocked prior to returning
  */
 static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 					   struct page *primary_page)
@@ -407,6 +407,7 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 			if (TestSetPageLocked(page))
 				break;
 			if (!PageDirty(page) ||
+			    PageWriteback(page) ||
 			    page_private(page) != (unsigned long) wb) {
 				unlock_page(page);
 				break;
@@ -430,14 +431,22 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 
 no_more:
 	/* we now have a contiguous set of dirty pages, each with writeback set
-	 * and the dirty mark cleared; the first page is locked and must remain
-	 * so, all the rest are unlocked */
+	 * and the dirty mark cleared; all the pages barring the first are now
+	 * unlocked */
 	first = primary_page->index;
 	last = first + count - 1;
+	unlock_page(primary_page);
 
 	offset = (first == wb->first) ? wb->offset_first : 0;
 	to = (last == wb->last) ? wb->to_last : PAGE_SIZE;
 
+	if (wb->state == AFS_WBACK_REJECTED) {
+		cancel_rejected_write(wb->vnode->vfs_inode.i_mapping,
+				      first, last);
+		ret = 0;
+		goto out;
+	}
+
 	_debug("write back %lx[%u..] to %lx[..%u]", first, offset, last, to);
 
 	ret = afs_vnode_store_data(wb, first, last, offset, to);
@@ -455,7 +464,7 @@ no_more:
 		case -ENOENT:
 		case -ENOMEDIUM:
 		case -ENXIO:
-			afs_kill_pages(wb->vnode, true, first, last);
+			afs_write_rejected(wb, true, first, last);
 			set_bit(AS_EIO, &wb->vnode->vfs_inode.i_mapping->flags);
 			break;
 		case -EACCES:
@@ -464,7 +473,7 @@ no_more:
 		case -EKEYEXPIRED:
 		case -EKEYREJECTED:
 		case -EKEYREVOKED:
-			afs_kill_pages(wb->vnode, false, first, last);
+			afs_write_rejected(wb, false, first, last);
 			break;
 		default:
 			break;
@@ -473,6 +482,7 @@ no_more:
 		ret = count;
 	}
 
+out:
 	_leave(" = %d", ret);
 	return ret;
 }
@@ -493,7 +503,6 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	ASSERT(wb != NULL);
 
 	ret = afs_write_back_from_locked_page(wb, page);
-	unlock_page(page);
 	if (ret < 0) {
 		_leave(" = %d", ret);
 		return 0;
@@ -565,7 +574,6 @@ int afs_writepages_region(struct address_space *mapping,
 		spin_unlock(&wb->vnode->writeback_lock);
 
 		ret = afs_write_back_from_locked_page(wb, page);
-		unlock_page(page);
 		page_cache_release(page);
 		if (ret < 0) {
 			_leave(" = %d", ret);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] AFS: Implement shared-writable mmap
  2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells
  2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
  2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells
@ 2007-05-23 19:15 ` David Howells
  2 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, dhowells

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c     |   21 ++++++++++++++++++++-
 fs/afs/internal.h |    1 +
 fs/afs/write.c    |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 56 insertions(+), 1 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 9c0e721..1547500 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned long offset);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
 const struct file_operations afs_file_operations = {
 	.open		= afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
 	.aio_write	= afs_file_write,
-	.mmap		= generic_file_readonly_mmap,
+	.mmap		= afs_mmap,
 	.sendfile	= generic_file_sendfile,
 	.fsync		= afs_fsync,
 };
@@ -54,6 +55,12 @@ const struct address_space_operations afs_fs_aops = {
 	.writepages	= afs_writepages,
 };
 
+static struct vm_operations_struct afs_file_vm_ops = {
+	.nopage		= filemap_nopage,
+	.populate	= filemap_populate,
+	.page_mkwrite	= afs_page_mkwrite,
+};
+
 /*
  * open an AFS file or directory and attach a key to it
  */
@@ -293,3 +300,15 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
 	_leave(" = 0");
 	return 0;
 }
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	_enter("");
+
+	file_accessed(file);
+	vma->vm_ops = &afs_file_vm_ops;
+	return 0;
+}
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index ddb18d9..bc39eba 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -711,6 +711,7 @@ extern ssize_t afs_file_write(struct kiocb *, const struct iovec *,
 			      unsigned long, loff_t);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, struct dentry *, int);
+extern int afs_page_mkwrite(struct vm_area_struct *, struct page *);
 
 
 /*****************************************************************************/
diff --git a/fs/afs/write.c b/fs/afs/write.c
index ac621e8..dd471f0 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -155,6 +155,8 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
  * prepare to perform part of a write to a page
  * - the caller holds the page locked, preventing it from being written out or
  *   modified by anyone else
+ * - may be called from afs_page_mkwrite() to set up a page for modification
+ *   through shared-writable mmap
  */
 int afs_prepare_write(struct file *file, struct page *page,
 		      unsigned offset, unsigned to)
@@ -833,3 +835,36 @@ int afs_fsync(struct file *file, struct dentry *dentry, int datasync)
 	_leave(" = %d", ret);
 	return ret;
 }
+
+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the struct file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+	struct key *key = vma->vm_file->private_data;
+	int ret;
+
+	_enter("{{%x:%u},%x},{%lx}",
+	       vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+	do {
+		lock_page(page);
+		if (page->mapping == vma->vm_file->f_mapping)
+			ret = afs_prepare_write(vma->vm_file, page, 0,
+						PAGE_SIZE);
+		else
+			ret = 0; /* seems there was interference - let the
+				  * caller deal with it */
+		unlock_page(page);
+	} while (ret == AOP_TRUNCATED_PAGE);
+
+	_leave(" = %d", ret);
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
@ 2007-05-24 20:38   ` Andrew Morton
  2007-05-24 21:35     ` David Howells
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2007-05-24 20:38 UTC (permalink / raw)
  To: David Howells; +Cc: linux-kernel

On Wed, 23 May 2007 20:15:24 +0100
David Howells <dhowells@redhat.com> wrote:

> Add a function - cancel_rejected_write() - to excise a rejected write from the
> pagecache.  This function is related to the truncation family of routines.  It
> permits the pages modified by a network filesystem client (such as AFS) to be
> excised and discarded from the pagecache if the attempt to write them back to
> the server fails.

Could you please flesh this out a bit, from a higher level?

As in: what does it mean for a server to "reject" a write?  What's actually
going on here?  Why does the server do this?  I assume that the application
will see a short write or an EIO or something?

How does this interact with MAP_SHARED?

Do we expect that any other networked filesystem would want to do this? 
(and if not, why not?) Or do they already attempt to do it in some other
fashion?

> The dirty and writeback states of the afflicted pages are cancelled and the
> pages themselves are detached for recycling.  All PTEs referring to those
> pages are removed.
> 
> Note that the locking is tricky as it's very easy to deadlock against
> truncate() and other routines once the pages have been unlocked as part of the
> writeback process.  To this end, the PG_error flag is set, then the
> PG_writeback flag is cleared, and only *then* can lock_page() be called.

hm.

> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  include/linux/mm.h |    5 ++-
>  mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e4183c6..73688d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
>  
>  extern unsigned long do_brk(unsigned long, unsigned long);
>  
> -/* filemap.c */
> -extern unsigned long page_unuse(struct page *);
> +/* truncate.c */
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  				       loff_t lstart, loff_t lend);
> +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
>  
> +/* filemap.c */
>  /* generic vm_area_ops exported for stackable file systems */
>  extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
>  extern int filemap_populate(struct vm_area_struct *, unsigned long,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4fbe1a2..f742096 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
>  	return invalidate_inode_pages2_range(mapping, 0, -1);
>  }
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> +
> +/*
> + * Cancel that part of a rejected write that affects a particular page
> + */
> +static void cancel_rejected_page(struct address_space *mapping,
> +				 struct page *page, pgoff_t *_next)
> +{
> +	if (!TestSetPageError(page)) {
> +		/* can't lock the page until we've cleared PG_writeback lest we
> +		 * deadlock with truncate (amongst other things) */
> +		end_page_writeback(page);
> +		if (page->mapping == mapping) {
> +			lock_page(page);
> +			if (page->mapping == mapping) {
> +				truncate_complete_page(mapping, page);
> +				*_next = page->index + 1;
> +			}
> +			unlock_page(page);
> +		}
> +	} else if (PageWriteback(page) || PageDirty(page)) {
> +		BUG();
> +	}
> +}

Why can't we just run the end_page_writeback() unconditionally? 
PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
so this thread of control "owns" that flag.

Also, are you really really sure that there is no way in which PG_writeback
can get itself set again after the end_page_writeback()?  I'd have thought
that we should be doing a wait_on_page_writeback() after the lock_page()
there.  Remember, other filesystems might want to be calling this, so we
shouldn't be designing around AFS implementation specifics.

> +/**
> + * cancel_rejected_write - Cancel a write on a contiguous set of pages
> + * @mapping: mapping affected
> + * @start: first page in set
> + * @end: last page in set
> + *
> + * Cancel a write of a contiguous set of pages when the writeback was rejected
> + * by the target medium or server.
> + *
> + * The pages in question are detached and discarded from the pagecache, and the
> + * writeback and dirty states are cleared prior to invalidation.  The caller
> + * must make sure that all the pages in the range are present in the pagecache,
> + * and the caller must hold PG_writeback on each of them.  NOTE! All the pages
> + * are locked and unlocked as part of this process, so the caller must take
> + * care to avoid deadlock.
> + *
> + * The PTEs pointing to those pages are also cleared, leading to the PTEs being
> + * reset when new pages are allocated and the contents reloaded.
> + */
> +void cancel_rejected_write(struct address_space *mapping,
> +			   pgoff_t start, pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	pgoff_t n;
> +	int i;
> +
> +	BUG_ON(mapping->nrpages < end - start + 1);
> +
> +	/* dispose of any PTEs pointing to the affected pages */
> +	unmap_mapping_range(mapping,
> +			    (loff_t)start << PAGE_CACHE_SHIFT,
> +			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +			    0);
> +
> +	pagevec_init(&pvec, 0);
> +	do {
> +		cond_resched();
> +		n = end - start + 1;
> +		if (n > PAGEVEC_SIZE)
> +			n = PAGEVEC_SIZE;
> +		n = pagevec_lookup(&pvec, mapping, start, n);
> +		for (i = 0; i < n; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			if (page->index < start || page->index > end)
> +				continue;
> +			start++;
> +			cancel_rejected_page(mapping, page, &start);
> +		}
> +		pagevec_release(&pvec);
> +	} while (start - 1 < end);
> +
> +	/* dispose of any new PTEs pointing to the affected pages */
> +	unmap_mapping_range(mapping,
> +			    (loff_t)start << PAGE_CACHE_SHIFT,
> +			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +			    0);
> +}
> +EXPORT_SYMBOL_GPL(cancel_rejected_write);

hm, is the pte-unmapping here completely solid?  Are there any race windows
in which a faulter can end up owning, say, an anonymous page?  We've had
heaps of problems with that sort of thing in generic_file_direct_IO() and I
don't expect it's this easy.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 20:38   ` Andrew Morton
@ 2007-05-24 21:35     ` David Howells
  2007-05-24 21:47       ` Andrew Morton
  2007-05-24 22:35       ` Trond Myklebust
  0 siblings, 2 replies; 17+ messages in thread
From: David Howells @ 2007-05-24 21:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> Could you please flesh this out a bit, from a higher level?

See the description for patch 3/4.

> As in: what does it mean for a server to "reject" a write? What's actually
> going on here?

Simply, it means that the server refused to perform the write.  The example I'm
thinking of is that it refused to do so because the ACL governing that file
changed between the permission() check and the attempt to write the data back.

All it takes is a write-back cache and a tiny little delay between the copy to
page cache and the write back to the server and there's a window in which
another client can leap in and make it impossible for the client to write the
data.

There are other possibilities, for instance:

 (1) The key governing the writeback expires.

 (2) The user's account is deleted or disabled.

The file being deleted is handled elsewhere.  In such a case, everyone's
writes are just discarded.

> Why does the server do this?

Because it's told to by some other client.  Think chmod, or in AFS's case:

	fs setacl -dir . -acl system:administrators rlidka

This removes write access by not including a 'w' amongst 'rlidka'.

(Note in AFS, the ACLs attached to a directory control the files in that
directory).

> I assume that the application will see a short write or an EIO or something?

EACCES or similar most likely.  In AFS's case you should get an abort with some
appropriate error code.  This is then mapped to EACCES, EKEYREVOKED,
EKEYEXPIRED, etc.

> How does this interact with MAP_SHARED?

MAP_SHARED/PROT_WRITE is just another way of doing writes to the pagecache.  It
isn't really special.  The way I've done it is to use page_mkwrite() to track
the key with which a write through a mapping is written back.

> Do we expect that any other networked filesystem would want to do this? 
> (and if not, why not?) Or do they already attempt to do it in some other
> fashion?

_Any_ network filesystem that (a) has access controls, and (b) does writeback
caching (be the interval ever so brief) is liable to this issue.  It could
possible for a netfs to avoid with this by getting a guarantee that the write
will be accepted upfront (CIFS might do this), or by blocking attempts to
change the access controls through some sort of locking (again CIFS might do
this).

However, looking at NFS, it appears that NFS may be liable to this problem.  It
does appear to use writeback caching, though it flushes its writes back fairly
promptly making it quite difficult to test.  I talked to Steve Dickson about
this, and I get the impression that NFS will just sit their and keep retrying
the writeback.

So, yes, other netfs's may well want to do this, and perhaps *should* be doing
this.

> > Note that the locking is tricky as it's very easy to deadlock against
> > truncate() and other routines once the pages have been unlocked as part of
> > the writeback process.  To this end, the PG_error flag is set, then the
> > PG_writeback flag is cleared, and only *then* can lock_page() be called.
> 
> hm.

Yeah.  Hm.  I spent a lot of time trying to work around deadlocks and finding
new ones.

> Why can't we just run the end_page_writeback() unconditionally? 
> PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
> so this thread of control "owns" that flag.

You may be right.  I'm trying to avoid a race with truncate and attempts to
rewrite the page both, but as I set PG_error, that might not be a problem.

> Also, are you really really sure that there is no way in which PG_writeback
> can get itself set again after the end_page_writeback()?

PG_error ought to take care of that.  To set PG_writeback again, we have to
wait for any outstanding PG_writeback to go away first - at which point
PG_error should come to light.

That's also the reason for the second part of the if-complex - the one that
BUGs if PG_error is set *and* the page is dirty or undergoing writeback.  I
want to make sure I catch such a situation.

> I'd have thought that we should be doing a wait_on_page_writeback() after the
> lock_page() there.

That would require us to begin writing back a page marked PG_error, which
probably ought to be considered "a bad thing".

> Remember, other filesystems might want to be calling this, so we shouldn't be
> designing around AFS implementation specifics.

I know.  Part of the reason that I put it here is so that we can have a common
policy.  However, without trying to get it called from those other filesystems,
it's hard to see where I've made AFS-specific assumptions.  I don't think that
there are actually any, but...

> hm, is the pte-unmapping here completely solid?

I'm not sure.  Certainly by doing it after the grunging of the affected pages,
we should evict all the PTEs and they shouldn't come back after that point.

I'm tempted to rearrange the algorithm to be this:

 (1) Mark all affected pages PG_error.

 (2) Ditch all PTEs mapping to those pages.

 (3) Truncate all affected pages.

Which has the downside of traversing the set of affected pages twice, but the
upside of only whacking the PTEs once.  The calling netfs would then have to
make sure that nopage() didn't resurrect a PG_error page, but that could
possibly be built into filemap_nopage() or do_no_page() as a convenience.

> Are there any race windows in which a faulter can end up owning, say, an
> anonymous page?  We've had heaps of problems with that sort of thing in
> generic_file_direct_IO() and I don't expect it's this easy.

At the moment, I do two passes of PTE erasure to make sure that all these pages
are properly expunged.  However, if I use the above set of steps, and provided
PG_error is properly honoured, I don't think this should be a problem.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 21:35     ` David Howells
@ 2007-05-24 21:47       ` Andrew Morton
  2007-05-24 22:34         ` David Howells
  2007-05-24 22:35       ` Trond Myklebust
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2007-05-24 21:47 UTC (permalink / raw)
  To: David Howells; +Cc: linux-kernel

On Thu, 24 May 2007 22:35:22 +0100
David Howells <dhowells@redhat.com> wrote:

> 
> > Why can't we just run the end_page_writeback() unconditionally? 
> > PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
> > so this thread of control "owns" that flag.
> 
> You may be right.  I'm trying to avoid a race with truncate and attempts to
> rewrite the page both, but as I set PG_error, that might not be a problem.

Thing is, this new interpretation and usage of PG_error is worrisome.  For
example, I don't think we've previously made any effort to avoid starting
writeback against PageError() pages.  We may be avoiding it in certain
places, but it wasn't a design rule of any sort.  And any such code hasn't
had any useful testing: PageError() is a damn rare thing.

So my reason for asking the above is to try to find a way to make all these
new PG-error games just go away.

> > Also, are you really really sure that there is no way in which PG_writeback
> > can get itself set again after the end_page_writeback()?
> 
> PG_error ought to take care of that.  To set PG_writeback again, we have to
> wait for any outstanding PG_writeback to go away first - at which point
> PG_error should come to light.
> 
> That's also the reason for the second part of the if-complex - the one that
> BUGs if PG_error is set *and* the page is dirty or undergoing writeback.  I
> want to make sure I catch such a situation.
> 
> > I'd have thought that we should be doing a wait_on_page_writeback() after the
> > lock_page() there.
> 
> That would require us to begin writing back a page marked PG_error, which
> probably ought to be considered "a bad thing".

As I say, no effort has been made to enforce anything like that.

> > Remember, other filesystems might want to be calling this, so we shouldn't be
> > designing around AFS implementation specifics.
> 
> I know.  Part of the reason that I put it here is so that we can have a common
> policy.  However, without trying to get it called from those other filesystems,
> it's hard to see where I've made AFS-specific assumptions.  I don't think that
> there are actually any, but...
> 
> > hm, is the pte-unmapping here completely solid?
> 
> I'm not sure.  Certainly by doing it after the grunging of the affected pages,
> we should evict all the PTEs and they shouldn't come back after that point.
> 
> I'm tempted to rearrange the algorithm to be this:
> 
>  (1) Mark all affected pages PG_error.
> 
>  (2) Ditch all PTEs mapping to those pages.
> 
>  (3) Truncate all affected pages.
> 
> Which has the downside of traversing the set of affected pages twice, but the
> upside of only whacking the PTEs once.  The calling netfs would then have to
> make sure that nopage() didn't resurrect a PG_error page, but that could
> possibly be built into filemap_nopage() or do_no_page() as a convenience.
> 
> > Are there any race windows in which a faulter can end up owning, say, an
> > anonymous page?  We've had heaps of problems with that sort of thing in
> > generic_file_direct_IO() and I don't expect it's this easy.
> 
> At the moment, I do two passes of PTE erasure to make sure that all these pages
> are properly expunged.  However, if I use the above set of steps, and provided
> PG_error is properly honoured, I don't think this should be a problem.
> 

Would be better, I think, to be able to remove all this new PG_error stuff.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 21:47       ` Andrew Morton
@ 2007-05-24 22:34         ` David Howells
  2007-05-24 22:46           ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2007-05-24 22:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> So my reason for asking the above is to try to find a way to make all these
> new PG-error games just go away.

Yeah.  However, there needs to be something to cover the gap between releasing
PG_writeback and getting PG_lock.  They have to be done in that order to avoid
deadlocking against truncate and other stuff, but that leaves a window in which
the page appears to be in a good state - one in which prepare_write() or
page_mkwrite() can potentially leak through.

Nick Piggin talked about using an extra lock, but as far as I can tell, that
just compounds the deadlock problems.

I suppose I could leave something in page->private that indicated that the
page was defunct, but that'd have to be done by the filesystem, probably
before calling cancel_rejected_write().

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 21:35     ` David Howells
  2007-05-24 21:47       ` Andrew Morton
@ 2007-05-24 22:35       ` Trond Myklebust
  2007-05-24 23:18         ` David Howells
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2007-05-24 22:35 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel

On Thu, 2007-05-24 at 22:35 +0100, David Howells wrote:
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Could you please flesh this out a bit, from a higher level?
> 
> See the description for patch 3/4.
> 
> > As in: what does it mean for a server to "reject" a write? What's actually
> > going on here?
> 
> Simply, it means that the server refused to perform the write.  The example I'm
> thinking of is that it refused to do so because the ACL governing that file
> changed between the permission() check and the attempt to write the data back.
> 
> All it takes is a write-back cache and a tiny little delay between the copy to
> page cache and the write back to the server and there's a window in which
> another client can leap in and make it impossible for the client to write the
> data.
> 
> There are other possibilities, for instance:
> 
>  (1) The key governing the writeback expires.
> 
>  (2) The user's account is deleted or disabled.
> 
> The file being deleted is handled elsewhere.  In such a case, everyone's
> writes are just discarded.
> 
> > Why does the server do this?
> 
> Because it's told to by some other client.  Think chmod, or in AFS's case:
> 
> 	fs setacl -dir . -acl system:administrators rlidka
> 
> This removes write access by not including a 'w' amongst 'rlidka'.
> 
> (Note in AFS, the ACLs attached to a directory control the files in that
> directory).
> 
> > I assume that the application will see a short write or an EIO or something?
> 
> EACCES or similar most likely.  In AFS's case you should get an abort with some
> appropriate error code.  This is then mapped to EACCES, EKEYREVOKED,
> EKEYEXPIRED, etc.
> 
> > How does this interact with MAP_SHARED?
> 
> MAP_SHARED/PROT_WRITE is just another way of doing writes to the pagecache.  It
> isn't really special.  The way I've done it is to use page_mkwrite() to track
> the key with which a write through a mapping is written back.
> 
> > Do we expect that any other networked filesystem would want to do this? 
> > (and if not, why not?) Or do they already attempt to do it in some other
> > fashion?
> 
> _Any_ network filesystem that (a) has access controls, and (b) does writeback
> caching (be the interval ever so brief) is liable to this issue.  It could
> possible for a netfs to avoid with this by getting a guarantee that the write
> will be accepted upfront (CIFS might do this), or by blocking attempts to
> change the access controls through some sort of locking (again CIFS might do
> this).
> 
> However, looking at NFS, it appears that NFS may be liable to this problem.  It
> does appear to use writeback caching, though it flushes its writes back fairly
> promptly making it quite difficult to test.  I talked to Steve Dickson about
> this, and I get the impression that NFS will just sit their and keep retrying
> the writeback.

No. If the write fails, then NFS will mark the mapping as invalid and
attempt to call invalidate_inode_pages2() at the earliest possible
moment.

I'm adding in a patch to defer marking the page as uptodate until the
write is successful in cases where NFS is writing a pristine page.

As for pages that are already marked as uptodate, well you already have
a race: you have deferred the page write, and so other processes may
already have read the rejected data before you tried to write it out.

Trond


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 22:34         ` David Howells
@ 2007-05-24 22:46           ` Andrew Morton
  2007-05-24 23:08             ` David Howells
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2007-05-24 22:46 UTC (permalink / raw)
  To: David Howells; +Cc: linux-kernel

On Thu, 24 May 2007 23:34:33 +0100
David Howells <dhowells@redhat.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > So my reason for asking the above is to try to find a way to make all these
> > new PG-error games just go away.
> 
> Yeah.  However, there needs to be something to cover the gap between releasing
> PG_writeback and getting PG_lock.  They have to be done in that order to avoid
> deadlocking against truncate and other stuff, but that leaves a window in which
> the page appears to be in a good state - one in which prepare_write() or
> page_mkwrite() can potentially leak through.

hm.  I don't see why that race window would be a problem in practice: the
page-exciser does a lock_page();wait_on_page_writeback() as normal, then
proceeds with its business?

But given that this doesn't work right for some reason, can we use PG_error
and then handle that appropriately in the filesystem's ->prepare_write() and
->page_mkwrite()?

> Nick Piggin talked about using an extra lock, but as far as I can tell, that
> just compounds the deadlock problems.
> 
> I suppose I could leave something in page->private that indicated that the
> page was defunct, but that'd have to be done by the filesystem, probably
> before calling cancel_rejected_write().

Well, using PG_error is OK and appropriate for that if it's localised to
the fs.  But I'd be a bit worried about requiring that the VFS maintain
some special protocol for it, partly because it would be such a
rarely-tested thing.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 22:46           ` Andrew Morton
@ 2007-05-24 23:08             ` David Howells
  2007-05-24 23:24               ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2007-05-24 23:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> hm.  I don't see why that race window would be a problem in practice: the
> page-exciser does a lock_page();wait_on_page_writeback() as normal, then
> proceeds with its business?

No.  The page-exciser ends (cancels) PG_writeback, not waits for it (something
has to clear the flag).  The problem is that the truncation routines may be
sat there holding a lock on the page whilst waiting for PG_writeback to go
away - so we have to clear PG_writeback before we can think about getting
PG_lock:-(

> But given that this doesn't work right for some reason, can we use PG_error
> and then handle that appropriately in the filesystem's ->prepare_write() and
> ->page_mkwrite()?

Possibly, though I'd rather they didn't see such a page.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 22:35       ` Trond Myklebust
@ 2007-05-24 23:18         ` David Howells
  2007-05-24 23:54           ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2007-05-24 23:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> No. If the write fails, then NFS will mark the mapping as invalid and
> attempt to call invalidate_inode_pages2() at the earliest possible
> moment.

Will it erase *all* unwritten writes?  Or is that what launder_page() is for?

How do you deal with pages that were in the process of being written out when
that particular write was rejected?  Do you just summarily clear PG_writeback
and hope no-one else looks at the page until invalidate_inode_pages2() gets
around to excising it?  Or do you have a better way?

> I'm adding in a patch to defer marking the page as uptodate until the
> write is successful in cases where NFS is writing a pristine page.

That sounds reasonable, though it doesn't help in the case I'm looking at.  Do
you also munge i_size if the write fails?

> As for pages that are already marked as uptodate, well you already have
> a race: you have deferred the page write, and so other processes may
> already have read the rejected data before you tried to write it out.

Yeah, I know, and that's very difficult to deal with without some formal
transaction rollback mechanism.  I think that the best I can do is to discard
the dodgy data that I've got lurking in the pagecache, but I still have to
deal with writes made by other users to that file after the rejected write.

There isn't a perfect way of dealing with it, given the circumstances.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 23:08             ` David Howells
@ 2007-05-24 23:24               ` Andrew Morton
  2007-05-24 23:37                 ` David Howells
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2007-05-24 23:24 UTC (permalink / raw)
  To: David Howells; +Cc: linux-kernel

On Fri, 25 May 2007 00:08:43 +0100
David Howells <dhowells@redhat.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > hm.  I don't see why that race window would be a problem in practice: the
> > page-exciser does a lock_page();wait_on_page_writeback() as normal, then
> > proceeds with its business?
> 
> No.  The page-exciser ends (cancels) PG_writeback, not waits for it (something
> has to clear the flag).  The problem is that the truncation routines may be
> sat there holding a lock on the page whilst waiting for PG_writeback to go
> away - so we have to clear PG_writeback before we can think about getting
> PG_lock:-(

But we already covered that?  Your exciser can do an unconditional
end_page_writeback(), because it is this thread of control which did the
set_page_writeback().  So we end up with:

	end_page_writeback(page);
	lock_page(page);
	wait_on_page_writeback(page);
	<now nail it>

> > But given that this doesn't work right for some reason, can we use PG_error
> > and then handle that appropriately in the filesystem's ->prepare_write() and
> > ->page_mkwrite()?
> 
> Possibly, though I'd rather they didn't see such a page.

Well someone needs to be taught all about this case.  Question is, should
it be the VFS, or should it just be the address_space(s) which brought
this state about, and which care about it?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 23:24               ` Andrew Morton
@ 2007-05-24 23:37                 ` David Howells
  0 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2007-05-24 23:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> But we already covered that?  Your exciser can do an unconditional
> end_page_writeback(), because it is this thread of control which did the
> set_page_writeback().  So we end up with:

Ah, I misunderstood what you meant.  I assumed you meant to wait insted of
ending.

Of course, if we've decided to excise this page, it really oughtn't to get
PG_writeback set again.  I'll have to think about that.

> Well someone needs to be taught all about this case.  Question is, should
> it be the VFS, or should it just be the address_space(s) which brought
> this state about, and which care about it?

For the most part, I think that this only applies to netfs's, and possibly not
all of those, so making it fully general might be overkill.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 23:18         ` David Howells
@ 2007-05-24 23:54           ` Trond Myklebust
  2007-05-30 10:35             ` David Howells
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2007-05-24 23:54 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel

On Fri, 2007-05-25 at 00:18 +0100, David Howells wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > No. If the write fails, then NFS will mark the mapping as invalid and
> > attempt to call invalidate_inode_pages2() at the earliest possible
> > moment.
> 
> Will it erase *all* unwritten writes?  Or is that what launder_page() is for?

Yes. launder_page() will flush out any writes that may have raced with
the call to invalidate_inode_pages2() (you're supposed to attempt to
flush them out before you call the latter).

> How do you deal with pages that were in the process of being written out when
> that particular write was rejected?  Do you just summarily clear PG_writeback
> and hope no-one else looks at the page until invalidate_inode_pages2() gets
> around to excising it?  Or do you have a better way?

All I do is to protect new calls to read() and write() with a call to
check if the page cache needs invalidating. As I said earlier, you
cannot avoid the race. At best you can reduce the window of opportunity,
and so I'd argue that if you can't do that cheaply, then it isn't worth
it.

> > I'm adding in a patch to defer marking the page as uptodate until the
> > write is successful in cases where NFS is writing a pristine page.
> 
> That sounds reasonable, though it doesn't help in the case I'm looking at.  Do
> you also munge i_size if the write fails?

I just mark the inode as needing revalidation so that we update the size
on the next read()/write()/getattr(). That won't stop any existing
append writes from punching ugly holes into the file, but trying to
recover from that sort of thing would be _really_ painful!

> > As for pages that are already marked as uptodate, well you already have
> > a race: you have deferred the page write, and so other processes may
> > already have read the rejected data before you tried to write it out.
> 
> Yeah, I know, and that's very difficult to deal with without some formal
> transaction rollback mechanism.  I think that the best I can do is to discard
> the dodgy data that I've got lurking in the pagecache, but I still have to
> deal with writes made by other users to that file after the rejected write.
> 
> There isn't a perfect way of dealing with it, given the circumstances.

Agreed.

Trond


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-24 23:54           ` Trond Myklebust
@ 2007-05-30 10:35             ` David Howells
  2007-05-30 17:39               ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2007-05-30 10:35 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> All I do is to protect new calls to read() and write() with a call to
> check if the page cache needs invalidating.

What about mmap()?  What if someone gets a mapping on a section of file that
subsequently has a write rejected on it?  If you invalidate only on
read()/write(), what do you do about such a mapping?

> That won't stop any existing append writes from punching ugly holes into the
> file, but trying to recover from that sort of thing would be _really_
> painful!

Definitely.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache
  2007-05-30 10:35             ` David Howells
@ 2007-05-30 17:39               ` Trond Myklebust
  0 siblings, 0 replies; 17+ messages in thread
From: Trond Myklebust @ 2007-05-30 17:39 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, linux-kernel

On Wed, 2007-05-30 at 11:35 +0100, David Howells wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > All I do is to protect new calls to read() and write() with a call to
> > check if the page cache needs invalidating.
> 
> What about mmap()?  What if someone gets a mapping on a section of file that
> subsequently has a write rejected on it?  If you invalidate only on
> read()/write(), what do you do about such a mapping?

Same philosophy: The people who already have the file mapped will
already be working with corrupted data, so there is little point in
bending over backwards to "fix" the problem. Instead we flush out the
page cache using invalidate_inode_pages2() whenever someone tries to
mmap() a new section of the file.

Trond


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-05-30 17:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells
2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
2007-05-24 20:38   ` Andrew Morton
2007-05-24 21:35     ` David Howells
2007-05-24 21:47       ` Andrew Morton
2007-05-24 22:34         ` David Howells
2007-05-24 22:46           ` Andrew Morton
2007-05-24 23:08             ` David Howells
2007-05-24 23:24               ` Andrew Morton
2007-05-24 23:37                 ` David Howells
2007-05-24 22:35       ` Trond Myklebust
2007-05-24 23:18         ` David Howells
2007-05-24 23:54           ` Trond Myklebust
2007-05-30 10:35             ` David Howells
2007-05-30 17:39               ` Trond Myklebust
2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells
2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox