* [PATCH 1/4] AFS: Add TestSetPageError()
@ 2007-05-23 19:15 David Howells
2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, dhowells
Add a TestSetPageError() macro to the suite of page flag manipulators. This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.
Signed-off-by: David Howells <dhowells@redhat.com>
---
include/linux/page-flags.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index ae2d79f..cc8d952 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -121,6 +121,7 @@
#define PageError(page) test_bit(PG_error, &(page)->flags)
#define SetPageError(page) set_bit(PG_error, &(page)->flags)
#define ClearPageError(page) clear_bit(PG_error, &(page)->flags)
+#define TestSetPageError(page) test_and_set_bit(PG_error, &(page)->flags)
#define PageReferenced(page) test_bit(PG_referenced, &(page)->flags)
#define SetPageReferenced(page) set_bit(PG_referenced, &(page)->flags)
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells @ 2007-05-23 19:15 ` David Howells 2007-05-24 20:38 ` Andrew Morton 2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells 2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells 2 siblings, 1 reply; 17+ messages in thread From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, dhowells Add a function - cancel_rejected_write() - to excise a rejected write from the pagecache. This function is related to the truncation family of routines. It permits the pages modified by a network filesystem client (such as AFS) to be excised and discarded from the pagecache if the attempt to write them back to the server fails. The dirty and writeback states of the afflicted pages are cancelled and the pages themselves are detached for recycling. All PTEs referring to those pages are removed. Note that the locking is tricky as it's very easy to deadlock against truncate() and other routines once the pages have been unlocked as part of the writeback process. To this end, the PG_error flag is set, then the PG_writeback flag is cleared, and only *then* can lock_page() be called. Signed-off-by: David Howells <dhowells@redhat.com> --- include/linux/mm.h | 5 ++- mm/truncate.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e4183c6..73688d4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t); extern unsigned long do_brk(unsigned long, unsigned long); -/* filemap.c */ -extern unsigned long page_unuse(struct page *); +/* truncate.c */ extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t); +/* filemap.c */ /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); extern int filemap_populate(struct vm_area_struct *, unsigned long, diff --git a/mm/truncate.c b/mm/truncate.c index 4fbe1a2..f742096 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping) return invalidate_inode_pages2_range(mapping, 0, -1); } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); + +/* + * Cancel that part of a rejected write that affects a particular page + */ +static void cancel_rejected_page(struct address_space *mapping, + struct page *page, pgoff_t *_next) +{ + if (!TestSetPageError(page)) { + /* can't lock the page until we've cleared PG_writeback lest we + * deadlock with truncate (amongst other things) */ + end_page_writeback(page); + if (page->mapping == mapping) { + lock_page(page); + if (page->mapping == mapping) { + truncate_complete_page(mapping, page); + *_next = page->index + 1; + } + unlock_page(page); + } + } else if (PageWriteback(page) || PageDirty(page)) { + BUG(); + } +} + +/** + * cancel_rejected_write - Cancel a write on a contiguous set of pages + * @mapping: mapping affected + * @start: first page in set + * @end: last page in set + * + * Cancel a write of a contiguous set of pages when the writeback was rejected + * by the target medium or server. + * + * The pages in question are detached and discarded from the pagecache, and the + * writeback and dirty states are cleared prior to invalidation. The caller + * must make sure that all the pages in the range are present in the pagecache, + * and the caller must hold PG_writeback on each of them. NOTE! All the pages + * are locked and unlocked as part of this process, so the caller must take + * care to avoid deadlock. + * + * The PTEs pointing to those pages are also cleared, leading to the PTEs being + * reset when new pages are allocated and the contents reloaded. + */ +void cancel_rejected_write(struct address_space *mapping, + pgoff_t start, pgoff_t end) +{ + struct pagevec pvec; + pgoff_t n; + int i; + + BUG_ON(mapping->nrpages < end - start + 1); + + /* dispose of any PTEs pointing to the affected pages */ + unmap_mapping_range(mapping, + (loff_t)start << PAGE_CACHE_SHIFT, + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, + 0); + + pagevec_init(&pvec, 0); + do { + cond_resched(); + n = end - start + 1; + if (n > PAGEVEC_SIZE) + n = PAGEVEC_SIZE; + n = pagevec_lookup(&pvec, mapping, start, n); + for (i = 0; i < n; i++) { + struct page *page = pvec.pages[i]; + + if (page->index < start || page->index > end) + continue; + start++; + cancel_rejected_page(mapping, page, &start); + } + pagevec_release(&pvec); + } while (start - 1 < end); + + /* dispose of any new PTEs pointing to the affected pages */ + unmap_mapping_range(mapping, + (loff_t)start << PAGE_CACHE_SHIFT, + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, + 0); +} +EXPORT_SYMBOL_GPL(cancel_rejected_write); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells @ 2007-05-24 20:38 ` Andrew Morton 2007-05-24 21:35 ` David Howells 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-05-24 20:38 UTC (permalink / raw) To: David Howells; +Cc: linux-kernel On Wed, 23 May 2007 20:15:24 +0100 David Howells <dhowells@redhat.com> wrote: > Add a function - cancel_rejected_write() - to excise a rejected write from the > pagecache. This function is related to the truncation family of routines. It > permits the pages modified by a network filesystem client (such as AFS) to be > excised and discarded from the pagecache if the attempt to write them back to > the server fails. Could you please flesh this out a bit, from a higher level? As in: what does it mean for a server to "reject" a write? What's actually going on here? Why does the server do this? I assume that the application will see a short write or an EIO or something? How does this interact with MAP_SHARED? Do we expect that any other networked filesystem would want to do this? (and if not, why not?) Or do they already attempt to do it in some other fashion? > The dirty and writeback states of the afflicted pages are cancelled and the > pages themselves are detached for recycling. All PTEs referring to those > pages are removed. > > Note that the locking is tricky as it's very easy to deadlock against > truncate() and other routines once the pages have been unlocked as part of the > writeback process. To this end, the PG_error flag is set, then the > PG_writeback flag is cleared, and only *then* can lock_page() be called. hm. > Signed-off-by: David Howells <dhowells@redhat.com> > --- > > include/linux/mm.h | 5 ++- > mm/truncate.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 86 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e4183c6..73688d4 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t); > > extern unsigned long do_brk(unsigned long, unsigned long); > > -/* filemap.c */ > -extern unsigned long page_unuse(struct page *); > +/* truncate.c */ > extern void truncate_inode_pages(struct address_space *, loff_t); > extern void truncate_inode_pages_range(struct address_space *, > loff_t lstart, loff_t lend); > +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t); > > +/* filemap.c */ > /* generic vm_area_ops exported for stackable file systems */ > extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); > extern int filemap_populate(struct vm_area_struct *, unsigned long, > diff --git a/mm/truncate.c b/mm/truncate.c > index 4fbe1a2..f742096 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space *mapping) > return invalidate_inode_pages2_range(mapping, 0, -1); > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > + > +/* > + * Cancel that part of a rejected write that affects a particular page > + */ > +static void cancel_rejected_page(struct address_space *mapping, > + struct page *page, pgoff_t *_next) > +{ > + if (!TestSetPageError(page)) { > + /* can't lock the page until we've cleared PG_writeback lest we > + * deadlock with truncate (amongst other things) */ > + end_page_writeback(page); > + if (page->mapping == mapping) { > + lock_page(page); > + if (page->mapping == mapping) { > + truncate_complete_page(mapping, page); > + *_next = page->index + 1; > + } > + unlock_page(page); > + } > + } else if (PageWriteback(page) || PageDirty(page)) { > + BUG(); > + } > +} Why can't we just run the end_page_writeback() unconditionally? PG_writeback _must_ be set here, and it is the caller who set PG_writeback, so this thread of control "owns" that flag. Also, are you really really sure that there is no way in which PG_writeback can get itself set again after the end_page_writeback()? I'd have thought that we should be doing a wait_on_page_writeback() after the lock_page() there. Remember, other filesystems might want to be calling this, so we shouldn't be designing around AFS implementation specifics. > +/** > + * cancel_rejected_write - Cancel a write on a contiguous set of pages > + * @mapping: mapping affected > + * @start: first page in set > + * @end: last page in set > + * > + * Cancel a write of a contiguous set of pages when the writeback was rejected > + * by the target medium or server. > + * > + * The pages in question are detached and discarded from the pagecache, and the > + * writeback and dirty states are cleared prior to invalidation. The caller > + * must make sure that all the pages in the range are present in the pagecache, > + * and the caller must hold PG_writeback on each of them. NOTE! All the pages > + * are locked and unlocked as part of this process, so the caller must take > + * care to avoid deadlock. > + * > + * The PTEs pointing to those pages are also cleared, leading to the PTEs being > + * reset when new pages are allocated and the contents reloaded. > + */ > +void cancel_rejected_write(struct address_space *mapping, > + pgoff_t start, pgoff_t end) > +{ > + struct pagevec pvec; > + pgoff_t n; > + int i; > + > + BUG_ON(mapping->nrpages < end - start + 1); > + > + /* dispose of any PTEs pointing to the affected pages */ > + unmap_mapping_range(mapping, > + (loff_t)start << PAGE_CACHE_SHIFT, > + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, > + 0); > + > + pagevec_init(&pvec, 0); > + do { > + cond_resched(); > + n = end - start + 1; > + if (n > PAGEVEC_SIZE) > + n = PAGEVEC_SIZE; > + n = pagevec_lookup(&pvec, mapping, start, n); > + for (i = 0; i < n; i++) { > + struct page *page = pvec.pages[i]; > + > + if (page->index < start || page->index > end) > + continue; > + start++; > + cancel_rejected_page(mapping, page, &start); > + } > + pagevec_release(&pvec); > + } while (start - 1 < end); > + > + /* dispose of any new PTEs pointing to the affected pages */ > + unmap_mapping_range(mapping, > + (loff_t)start << PAGE_CACHE_SHIFT, > + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, > + 0); > +} > +EXPORT_SYMBOL_GPL(cancel_rejected_write); hm, is the pte-unmapping here completely solid? Are there any race windows in which a faulter can end up owning, say, an anonymous page? We've had heaps of problems with that sort of thing in generic_file_direct_IO() and I don't expect it's this easy. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 20:38 ` Andrew Morton @ 2007-05-24 21:35 ` David Howells 2007-05-24 21:47 ` Andrew Morton 2007-05-24 22:35 ` Trond Myklebust 0 siblings, 2 replies; 17+ messages in thread From: David Howells @ 2007-05-24 21:35 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@linux-foundation.org> wrote: > Could you please flesh this out a bit, from a higher level? See the description for patch 3/4. > As in: what does it mean for a server to "reject" a write? What's actually > going on here? Simply, it means that the server refused to perform the write. The example I'm thinking of is that it refused to do so because the ACL governing that file changed between the permission() check and the attempt to write the data back. All it takes is a write-back cache and a tiny little delay between the copy to page cache and the write back to the server and there's a window in which another client can leap in and make it impossible for the client to write the data. There are other possibilities, for instance: (1) The key governing the writeback expires. (2) The user's account is deleted or disabled. The file being deleted is handled elsewhere. In such a case, everyone's writes are just discarded. > Why does the server do this? Because it's told to by some other client. Think chmod, or in AFS's case: fs setacl -dir . -acl system:administrators rlidka This removes write access by not including a 'w' amongst 'rlidka'. (Note in AFS, the ACLs attached to a directory control the files in that directory). > I assume that the application will see a short write or an EIO or something? EACCES or similar most likely. In AFS's case you should get an abort with some appropriate error code. This is then mapped to EACCES, EKEYREVOKED, EKEYEXPIRED, etc. > How does this interact with MAP_SHARED? MAP_SHARED/PROT_WRITE is just another way of doing writes to the pagecache. It isn't really special. The way I've done it is to use page_mkwrite() to track the key with which a write through a mapping is written back. > Do we expect that any other networked filesystem would want to do this? > (and if not, why not?) Or do they already attempt to do it in some other > fashion? _Any_ network filesystem that (a) has access controls, and (b) does writeback caching (be the interval ever so brief) is liable to this issue. It could possible for a netfs to avoid with this by getting a guarantee that the write will be accepted upfront (CIFS might do this), or by blocking attempts to change the access controls through some sort of locking (again CIFS might do this). However, looking at NFS, it appears that NFS may be liable to this problem. It does appear to use writeback caching, though it flushes its writes back fairly promptly making it quite difficult to test. I talked to Steve Dickson about this, and I get the impression that NFS will just sit their and keep retrying the writeback. So, yes, other netfs's may well want to do this, and perhaps *should* be doing this. > > Note that the locking is tricky as it's very easy to deadlock against > > truncate() and other routines once the pages have been unlocked as part of > > the writeback process. To this end, the PG_error flag is set, then the > > PG_writeback flag is cleared, and only *then* can lock_page() be called. > > hm. Yeah. Hm. I spent a lot of time trying to work around deadlocks and finding new ones. > Why can't we just run the end_page_writeback() unconditionally? > PG_writeback _must_ be set here, and it is the caller who set PG_writeback, > so this thread of control "owns" that flag. You may be right. I'm trying to avoid a race with truncate and attempts to rewrite the page both, but as I set PG_error, that might not be a problem. > Also, are you really really sure that there is no way in which PG_writeback > can get itself set again after the end_page_writeback()? PG_error ought to take care of that. To set PG_writeback again, we have to wait for any outstanding PG_writeback to go away first - at which point PG_error should come to light. That's also the reason for the second part of the if-complex - the one that BUGs if PG_error is set *and* the page is dirty or undergoing writeback. I want to make sure I catch such a situation. > I'd have thought that we should be doing a wait_on_page_writeback() after the > lock_page() there. That would require us to begin writing back a page marked PG_error, which probably ought to be considered "a bad thing". > Remember, other filesystems might want to be calling this, so we shouldn't be > designing around AFS implementation specifics. I know. Part of the reason that I put it here is so that we can have a common policy. However, without trying to get it called from those other filesystems, it's hard to see where I've made AFS-specific assumptions. I don't think that there are actually any, but... > hm, is the pte-unmapping here completely solid? I'm not sure. Certainly by doing it after the grunging of the affected pages, we should evict all the PTEs and they shouldn't come back after that point. I'm tempted to rearrange the algorithm to be this: (1) Mark all affected pages PG_error. (2) Ditch all PTEs mapping to those pages. (3) Truncate all affected pages. Which has the downside of traversing the set of affected pages twice, but the upside of only whacking the PTEs once. The calling netfs would then have to make sure that nopage() didn't resurrect a PG_error page, but that could possibly be built into filemap_nopage() or do_no_page() as a convenience. > Are there any race windows in which a faulter can end up owning, say, an > anonymous page? We've had heaps of problems with that sort of thing in > generic_file_direct_IO() and I don't expect it's this easy. At the moment, I do two passes of PTE erasure to make sure that all these pages are properly expunged. However, if I use the above set of steps, and provided PG_error is properly honoured, I don't think this should be a problem. David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 21:35 ` David Howells @ 2007-05-24 21:47 ` Andrew Morton 2007-05-24 22:34 ` David Howells 2007-05-24 22:35 ` Trond Myklebust 1 sibling, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-05-24 21:47 UTC (permalink / raw) To: David Howells; +Cc: linux-kernel On Thu, 24 May 2007 22:35:22 +0100 David Howells <dhowells@redhat.com> wrote: > > > Why can't we just run the end_page_writeback() unconditionally? > > PG_writeback _must_ be set here, and it is the caller who set PG_writeback, > > so this thread of control "owns" that flag. > > You may be right. I'm trying to avoid a race with truncate and attempts to > rewrite the page both, but as I set PG_error, that might not be a problem. Thing is, this new interpretation and usage of PG_error is worrisome. For example, I don't think we've previously made any effort to avoid starting writeback against PageError() pages. We may be avoiding it in certain places, but it wasn't a design rule of any sort. And any such code hasn't had any useful testing: PageError() is a damn rare thing. So my reason for asking the above is to try to find a way to make all these new PG-error games just go away. > > Also, are you really really sure that there is no way in which PG_writeback > > can get itself set again after the end_page_writeback()? > > PG_error ought to take care of that. To set PG_writeback again, we have to > wait for any outstanding PG_writeback to go away first - at which point > PG_error should come to light. > > That's also the reason for the second part of the if-complex - the one that > BUGs if PG_error is set *and* the page is dirty or undergoing writeback. I > want to make sure I catch such a situation. > > > I'd have thought that we should be doing a wait_on_page_writeback() after the > > lock_page() there. > > That would require us to begin writing back a page marked PG_error, which > probably ought to be considered "a bad thing". As I say, no effort has been made to enforce anything like that. > > Remember, other filesystems might want to be calling this, so we shouldn't be > > designing around AFS implementation specifics. > > I know. Part of the reason that I put it here is so that we can have a common > policy. However, without trying to get it called from those other filesystems, > it's hard to see where I've made AFS-specific assumptions. I don't think that > there are actually any, but... > > > hm, is the pte-unmapping here completely solid? > > I'm not sure. Certainly by doing it after the grunging of the affected pages, > we should evict all the PTEs and they shouldn't come back after that point. > > I'm tempted to rearrange the algorithm to be this: > > (1) Mark all affected pages PG_error. > > (2) Ditch all PTEs mapping to those pages. > > (3) Truncate all affected pages. > > Which has the downside of traversing the set of affected pages twice, but the > upside of only whacking the PTEs once. The calling netfs would then have to > make sure that nopage() didn't resurrect a PG_error page, but that could > possibly be built into filemap_nopage() or do_no_page() as a convenience. > > > Are there any race windows in which a faulter can end up owning, say, an > > anonymous page? We've had heaps of problems with that sort of thing in > > generic_file_direct_IO() and I don't expect it's this easy. > > At the moment, I do two passes of PTE erasure to make sure that all these pages > are properly expunged. However, if I use the above set of steps, and provided > PG_error is properly honoured, I don't think this should be a problem. > Would be better, I think, to be able to remove all this new PG_error stuff. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 21:47 ` Andrew Morton @ 2007-05-24 22:34 ` David Howells 2007-05-24 22:46 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: David Howells @ 2007-05-24 22:34 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@linux-foundation.org> wrote: > So my reason for asking the above is to try to find a way to make all these > new PG-error games just go away. Yeah. However, there needs to be something to cover the gap between releasing PG_writeback and getting PG_lock. They have to be done in that order to avoid deadlocking against truncate and other stuff, but that leaves a window in which the page appears to be in a good state - one in which prepare_write() or page_mkwrite() can potentially leak through. Nick Piggin talked about using an extra lock, but as far as I can tell, that just compounds the deadlock problems. I suppose I could leave something in page->private that indicated that the page was defunct, but that'd have to be done by the filesystem, probably before calling cancel_rejected_write(). David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 22:34 ` David Howells @ 2007-05-24 22:46 ` Andrew Morton 2007-05-24 23:08 ` David Howells 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-05-24 22:46 UTC (permalink / raw) To: David Howells; +Cc: linux-kernel On Thu, 24 May 2007 23:34:33 +0100 David Howells <dhowells@redhat.com> wrote: > Andrew Morton <akpm@linux-foundation.org> wrote: > > > So my reason for asking the above is to try to find a way to make all these > > new PG-error games just go away. > > Yeah. However, there needs to be something to cover the gap between releasing > PG_writeback and getting PG_lock. They have to be done in that order to avoid > deadlocking against truncate and other stuff, but that leaves a window in which > the page appears to be in a good state - one in which prepare_write() or > page_mkwrite() can potentially leak through. hm. I don't see why that race window would be a problem in practice: the page-exciser does a lock_page();wait_on_page_writeback() as normal, then proceeds with its business? But given that this doesn't work right for some reason, can we use PG_error and then handle that appropriately in the filesystem's ->prepare_write() and ->page_mkwrite()? > Nick Piggin talked about using an extra lock, but as far as I can tell, that > just compounds the deadlock problems. > > I suppose I could leave something in page->private that indicated that the > page was defunct, but that'd have to be done by the filesystem, probably > before calling cancel_rejected_write(). Well, using PG_error is OK and appropriate for that if it's localised to the fs. But I'd be a bit worried about requiring that the VFS maintain some special protocol for it, partly because it would be such a rarely-tested thing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 22:46 ` Andrew Morton @ 2007-05-24 23:08 ` David Howells 2007-05-24 23:24 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: David Howells @ 2007-05-24 23:08 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@linux-foundation.org> wrote: > hm. I don't see why that race window would be a problem in practice: the > page-exciser does a lock_page();wait_on_page_writeback() as normal, then > proceeds with its business? No. The page-exciser ends (cancels) PG_writeback, not waits for it (something has to clear the flag). The problem is that the truncation routines may be sat there holding a lock on the page whilst waiting for PG_writeback to go away - so we have to clear PG_writeback before we can think about getting PG_lock:-( > But given that this doesn't work right for some reason, can we use PG_error > and then handle that appropriately in the filesystem's ->prepare_write() and > ->page_mkwrite()? Possibly, though I'd rather they didn't see such a page. David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 23:08 ` David Howells @ 2007-05-24 23:24 ` Andrew Morton 2007-05-24 23:37 ` David Howells 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2007-05-24 23:24 UTC (permalink / raw) To: David Howells; +Cc: linux-kernel On Fri, 25 May 2007 00:08:43 +0100 David Howells <dhowells@redhat.com> wrote: > Andrew Morton <akpm@linux-foundation.org> wrote: > > > hm. I don't see why that race window would be a problem in practice: the > > page-exciser does a lock_page();wait_on_page_writeback() as normal, then > > proceeds with its business? > > No. The page-exciser ends (cancels) PG_writeback, not waits for it (something > has to clear the flag). The problem is that the truncation routines may be > sat there holding a lock on the page whilst waiting for PG_writeback to go > away - so we have to clear PG_writeback before we can think about getting > PG_lock:-( But we already covered that? Your exciser can do an unconditional end_page_writeback(), because it is this thread of control which did the set_page_writeback(). So we end up with: end_page_writeback(page); lock_page(page); wait_on_page_writeback(page); <now nail it> > > But given that this doesn't work right for some reason, can we use PG_error > > and then handle that appropriately in the filesystem's ->prepare_write() and > > ->page_mkwrite()? > > Possibly, though I'd rather they didn't see such a page. Well someone needs to be taught all about this case. Question is, should it be the VFS, or should it just be the address_space(s) which brought this state about, and which care about it? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 23:24 ` Andrew Morton @ 2007-05-24 23:37 ` David Howells 0 siblings, 0 replies; 17+ messages in thread From: David Howells @ 2007-05-24 23:37 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton <akpm@linux-foundation.org> wrote: > But we already covered that? Your exciser can do an unconditional > end_page_writeback(), because it is this thread of control which did the > set_page_writeback(). So we end up with: Ah, I misunderstood what you meant. I assumed you meant to wait insted of ending. Of course, if we've decided to excise this page, it really oughtn't to get PG_writeback set again. I'll have to think about that. > Well someone needs to be taught all about this case. Question is, should > it be the VFS, or should it just be the address_space(s) which brought > this state about, and which care about it? For the most part, I think that this only applies to netfs's, and possibly not all of those, so making it fully general might be overkill. David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 21:35 ` David Howells 2007-05-24 21:47 ` Andrew Morton @ 2007-05-24 22:35 ` Trond Myklebust 2007-05-24 23:18 ` David Howells 1 sibling, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2007-05-24 22:35 UTC (permalink / raw) To: David Howells; +Cc: Andrew Morton, linux-kernel On Thu, 2007-05-24 at 22:35 +0100, David Howells wrote: > Andrew Morton <akpm@linux-foundation.org> wrote: > > > Could you please flesh this out a bit, from a higher level? > > See the description for patch 3/4. > > > As in: what does it mean for a server to "reject" a write? What's actually > > going on here? > > Simply, it means that the server refused to perform the write. The example I'm > thinking of is that it refused to do so because the ACL governing that file > changed between the permission() check and the attempt to write the data back. > > All it takes is a write-back cache and a tiny little delay between the copy to > page cache and the write back to the server and there's a window in which > another client can leap in and make it impossible for the client to write the > data. > > There are other possibilities, for instance: > > (1) The key governing the writeback expires. > > (2) The user's account is deleted or disabled. > > The file being deleted is handled elsewhere. In such a case, everyone's > writes are just discarded. > > > Why does the server do this? > > Because it's told to by some other client. Think chmod, or in AFS's case: > > fs setacl -dir . -acl system:administrators rlidka > > This removes write access by not including a 'w' amongst 'rlidka'. > > (Note in AFS, the ACLs attached to a directory control the files in that > directory). > > > I assume that the application will see a short write or an EIO or something? > > EACCES or similar most likely. In AFS's case you should get an abort with some > appropriate error code. This is then mapped to EACCES, EKEYREVOKED, > EKEYEXPIRED, etc. > > > How does this interact with MAP_SHARED? > > MAP_SHARED/PROT_WRITE is just another way of doing writes to the pagecache. It > isn't really special. The way I've done it is to use page_mkwrite() to track > the key with which a write through a mapping is written back. > > > Do we expect that any other networked filesystem would want to do this? > > (and if not, why not?) Or do they already attempt to do it in some other > > fashion? > > _Any_ network filesystem that (a) has access controls, and (b) does writeback > caching (be the interval ever so brief) is liable to this issue. It could > possible for a netfs to avoid with this by getting a guarantee that the write > will be accepted upfront (CIFS might do this), or by blocking attempts to > change the access controls through some sort of locking (again CIFS might do > this). > > However, looking at NFS, it appears that NFS may be liable to this problem. It > does appear to use writeback caching, though it flushes its writes back fairly > promptly making it quite difficult to test. I talked to Steve Dickson about > this, and I get the impression that NFS will just sit their and keep retrying > the writeback. No. If the write fails, then NFS will mark the mapping as invalid and attempt to call invalidate_inode_pages2() at the earliest possible moment. I'm adding in a patch to defer marking the page as uptodate until the write is successful in cases where NFS is writing a pristine page. As for pages that are already marked as uptodate, well you already have a race: you have deferred the page write, and so other processes may already have read the rejected data before you tried to write it out. Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 22:35 ` Trond Myklebust @ 2007-05-24 23:18 ` David Howells 2007-05-24 23:54 ` Trond Myklebust 0 siblings, 1 reply; 17+ messages in thread From: David Howells @ 2007-05-24 23:18 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > No. If the write fails, then NFS will mark the mapping as invalid and > attempt to call invalidate_inode_pages2() at the earliest possible > moment. Will it erase *all* unwritten writes? Or is that what launder_page() is for? How do you deal with pages that were in the process of being written out when that particular write was rejected? Do you just summarily clear PG_writeback and hope no-one else looks at the page until invalidate_inode_pages2() gets around to excising it? Or do you have a better way? > I'm adding in a patch to defer marking the page as uptodate until the > write is successful in cases where NFS is writing a pristine page. That sounds reasonable, though it doesn't help in the case I'm looking at. Do you also munge i_size if the write fails? > As for pages that are already marked as uptodate, well you already have > a race: you have deferred the page write, and so other processes may > already have read the rejected data before you tried to write it out. Yeah, I know, and that's very difficult to deal with without some formal transaction rollback mechanism. I think that the best I can do is to discard the dodgy data that I've got lurking in the pagecache, but I still have to deal with writes made by other users to that file after the rejected write. There isn't a perfect way of dealing with it, given the circumstances. David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 23:18 ` David Howells @ 2007-05-24 23:54 ` Trond Myklebust 2007-05-30 10:35 ` David Howells 0 siblings, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2007-05-24 23:54 UTC (permalink / raw) To: David Howells; +Cc: Andrew Morton, linux-kernel On Fri, 2007-05-25 at 00:18 +0100, David Howells wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > No. If the write fails, then NFS will mark the mapping as invalid and > > attempt to call invalidate_inode_pages2() at the earliest possible > > moment. > > Will it erase *all* unwritten writes? Or is that what launder_page() is for? Yes. launder_page() will flush out any writes that may have raced with the call to invalidate_inode_pages2() (you're supposed to attempt to flush them out before you call the latter). > How do you deal with pages that were in the process of being written out when > that particular write was rejected? Do you just summarily clear PG_writeback > and hope no-one else looks at the page until invalidate_inode_pages2() gets > around to excising it? Or do you have a better way? All I do is to protect new calls to read() and write() with a call to check if the page cache needs invalidating. As I said earlier, you cannot avoid the race. At best you can reduce the window of opportunity, and so I'd argue that if you can't do that cheaply, then it isn't worth it. > > I'm adding in a patch to defer marking the page as uptodate until the > > write is successful in cases where NFS is writing a pristine page. > > That sounds reasonable, though it doesn't help in the case I'm looking at. Do > you also munge i_size if the write fails? I just mark the inode as needing revalidation so that we update the size on the next read()/write()/getattr(). That won't stop any existing append writes from punching ugly holes into the file, but trying to recover from that sort of thing would be _really_ painful! > > As for pages that are already marked as uptodate, well you already have > > a race: you have deferred the page write, and so other processes may > > already have read the rejected data before you tried to write it out. > > Yeah, I know, and that's very difficult to deal with without some formal > transaction rollback mechanism. I think that the best I can do is to discard > the dodgy data that I've got lurking in the pagecache, but I still have to > deal with writes made by other users to that file after the rejected write. > > There isn't a perfect way of dealing with it, given the circumstances. Agreed. Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-24 23:54 ` Trond Myklebust @ 2007-05-30 10:35 ` David Howells 2007-05-30 17:39 ` Trond Myklebust 0 siblings, 1 reply; 17+ messages in thread From: David Howells @ 2007-05-30 10:35 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > All I do is to protect new calls to read() and write() with a call to > check if the page cache needs invalidating. What about mmap()? What if someone gets a mapping on a section of file that subsequently has a write rejected on it? If you invalidate only on read()/write(), what do you do about such a mapping? > That won't stop any existing append writes from punching ugly holes into the > file, but trying to recover from that sort of thing would be _really_ > painful! Definitely. David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache 2007-05-30 10:35 ` David Howells @ 2007-05-30 17:39 ` Trond Myklebust 0 siblings, 0 replies; 17+ messages in thread From: Trond Myklebust @ 2007-05-30 17:39 UTC (permalink / raw) To: David Howells; +Cc: Andrew Morton, linux-kernel On Wed, 2007-05-30 at 11:35 +0100, David Howells wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > All I do is to protect new calls to read() and write() with a call to > > check if the page cache needs invalidating. > > What about mmap()? What if someone gets a mapping on a section of file that > subsequently has a write rejected on it? If you invalidate only on > read()/write(), what do you do about such a mapping? Same philosophy: The people who already have the file mapped will already be working with corrupted data, so there is little point in bending over backwards to "fix" the problem. Instead we flush out the page cache using invalidate_inode_pages2() whenever someone tries to mmap() a new section of the file. Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/4] AFS: Improve handling of a rejected writeback 2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells 2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells @ 2007-05-23 19:15 ` David Howells 2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells 2 siblings, 0 replies; 17+ messages in thread From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, dhowells Improve the handling of the case of a server rejecting an attempt to write back a cached write. AFS operates a write-back cache, so the following sequence of events can theoretically occur: CLIENT 1 CLIENT 2 ======================= ======================= cat data >/the/file (sits in pagecache) fs setacl -dir /the/dir/of/the/file \ -acl system:administrators rlidka (write permission removed for client 1) sync (writeback attempt fails) The way AFS attempts to handle this is: (1) The affected region will be excised and discarded on the basis that it can't be written back, yet we don't want it lurking in the page cache either. The contents of the affected region will be reread from the server when called for again. (2) The EOF size will be set to the current server-based file size - usually that which it was before the affected write was made - assuming no conflicting write has been appended, and assuming the affected write extended the file. This patch makes the following changes: (1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS server puts a silly value as the size of the returned data. This prevents excised pages beyond the revised EOF being reinstantiated with a surprise PG_error. (2) Writebacks can now be put into a 'rejected' state in which all further attempts to write them back will result in excision of the affected pages instead. (3) Preparing a page for overwriting now reads the whole page instead of just those parts of it that aren't to be covered by the copy to be made. This handles the possibility that the copy might fail on EFAULT. Corollary to this, PG_update can now be set by afs_prepare_page() on behalf of afs_prepare_write() rather than setting it in afs_commit_write(). (4) In the case of a conflicting write, afs_prepare_write() will attempt to flush the write to the server, and will then wait for PG_writeback to go away - after unlocking the page. This helps prevent deadlock against the writeback-rejection handler. AOP_TRUNCATED_PAGE is then returned to the caller to signify that the page has been unlocked, and that it should be revalidated. (5) The writeback-rejection handler now calls cancel_rejected_write() added by the previous patch to excise the affected pages rather than clearing the PG_uptodate flag on all the pages. Signed-off-by: David Howells <dhowells@redhat.com> --- fs/afs/fsclient.c | 4 + fs/afs/internal.h | 1 fs/afs/write.c | 154 ++++++++++++++++++++++++++++------------------------- 3 files changed, 85 insertions(+), 74 deletions(-) diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c index 5dff130..60f9500 100644 --- a/fs/afs/fsclient.c +++ b/fs/afs/fsclient.c @@ -353,7 +353,9 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call, call->count = ntohl(call->tmp); _debug("DATA length: %u", call->count); - if (call->count > PAGE_SIZE) + if ((s32) call->count < 0) + call->count = 0; /* access completely beyond EOF */ + else if (call->count > PAGE_SIZE) return -EBADMSG; call->offset = 0; call->unmarshall++; diff --git a/fs/afs/internal.h b/fs/afs/internal.h index 2dac3ad..ddb18d9 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -154,6 +154,7 @@ struct afs_writeback { AFS_WBACK_PENDING, /* write pending */ AFS_WBACK_CONFLICTING, /* conflicting writes posted */ AFS_WBACK_WRITING, /* writing back */ + AFS_WBACK_REJECTED, /* the writeback was rejected */ AFS_WBACK_COMPLETE /* the writeback record has been unlinked */ } state __attribute__((packed)); }; diff --git a/fs/afs/write.c b/fs/afs/write.c index a03b92a..ac621e8 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -81,18 +81,16 @@ void afs_put_writeback(struct afs_writeback *wb) } /* - * partly or wholly fill a page that's under preparation for writing + * fill a page that's under preparation for writing */ static int afs_fill_page(struct afs_vnode *vnode, struct key *key, - unsigned start, unsigned len, struct page *page) + unsigned len, struct page *page) { int ret; - _enter(",,%u,%u", start, len); + _enter(",,%u,", len); - ASSERTCMP(start + len, <=, PAGE_SIZE); - - ret = afs_vnode_fetch_data(vnode, key, start, len, page); + ret = afs_vnode_fetch_data(vnode, key, 0, len, page); if (ret < 0) { if (ret == -ENOENT) { _debug("got NOENT from server" @@ -110,18 +108,15 @@ static int afs_fill_page(struct afs_vnode *vnode, struct key *key, * prepare a page for being written to */ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, - struct key *key, unsigned offset, unsigned to) + struct key *key) { - unsigned eof, tail, start, stop, len; + unsigned len; loff_t i_size, pos; void *p; int ret; _enter(""); - if (offset == 0 && to == PAGE_SIZE) - return 0; - p = kmap_atomic(page, KM_USER0); i_size = i_size_read(&vnode->vfs_inode); @@ -129,10 +124,7 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, if (pos >= i_size) { /* partial write, page beyond EOF */ _debug("beyond"); - if (offset > 0) - memset(p, 0, offset); - if (to < PAGE_SIZE) - memset(p + to, 0, PAGE_SIZE - to); + memset(p, 0, PAGE_SIZE); kunmap_atomic(p, KM_USER0); return 0; } @@ -140,31 +132,20 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, if (i_size - pos >= PAGE_SIZE) { /* partial write, page entirely before EOF */ _debug("before"); - tail = eof = PAGE_SIZE; + len = PAGE_SIZE; } else { /* partial write, page overlaps EOF */ - eof = i_size - pos; - _debug("overlap %u", eof); - tail = max(eof, to); - if (tail < PAGE_SIZE) - memset(p + tail, 0, PAGE_SIZE - tail); - if (offset > eof) - memset(p + eof, 0, PAGE_SIZE - eof); + len = i_size - pos; + _debug("overlap %u", len); + ASSERTRANGE(0, <, len, <, PAGE_SIZE); + memset(p + len, 0, PAGE_SIZE - len); } kunmap_atomic(p, KM_USER0); - ret = 0; - if (offset > 0 || eof > to) { - /* need to fill one or two bits that aren't going to be written - * (cover both fillers in one read if there are two) */ - start = (offset > 0) ? 0 : to; - stop = (eof > to) ? eof : offset; - len = stop - start; - _debug("wr=%u-%u av=0-%u rd=%u@%u", - offset, to, eof, start, len); - ret = afs_fill_page(vnode, key, start, len, page); - } + ret = afs_fill_page(vnode, key, len, page); + if (ret == 0) + SetPageUptodate(page); _leave(" = %d", ret); return ret; @@ -187,6 +168,8 @@ int afs_prepare_write(struct file *file, struct page *page, _enter("{%x:%u},{%lx},%u,%u", vnode->fid.vid, vnode->fid.vnode, page->index, offset, to); + BUG_ON(PageError(page)); + candidate = kzalloc(sizeof(*candidate), GFP_KERNEL); if (!candidate) return -ENOMEM; @@ -200,7 +183,7 @@ int afs_prepare_write(struct file *file, struct page *page, if (!PageUptodate(page)) { _debug("not up to date"); - ret = afs_prepare_page(vnode, page, key, offset, to); + ret = afs_prepare_page(vnode, page, key); if (ret < 0) { kfree(candidate); _leave(" = %d [prep]", ret); @@ -269,21 +252,41 @@ flush_conflicting_wb: _debug("flush conflict"); if (wb->state == AFS_WBACK_PENDING) wb->state = AFS_WBACK_CONFLICTING; + wb->usage++; spin_unlock(&vnode->writeback_lock); - if (PageDirty(page)) { + if (!PageDirty(page) && !PageWriteback(page)) { + /* no change outstanding - just reuse the page */ + _debug("reuse"); + afs_put_writeback(wb); + set_page_private(page, 0); + ClearPagePrivate(page); + goto try_again; + } + + kfree(candidate); + + /* if we're busy writing back a conflicting write, then unlock the page + * and wait for the writeback to complete - this lets the process doing + * the write-out handle rejection without deadlock */ + if (PageWriteback(page)) { + _debug("wait wb"); + unlock_page(page); + } else { + /* there's a conflicting modification we have to write back and + * wait for before letting the next one proceed */ + _debug("dirty"); ret = afs_write_back_from_locked_page(wb, page); if (ret < 0) { - afs_put_writeback(candidate); + afs_put_writeback(wb); _leave(" = %d", ret); return ret; } } - /* the page holds a ref on the writeback record */ afs_put_writeback(wb); - set_page_private(page, 0); - ClearPagePrivate(page); - goto try_again; + wait_on_page_writeback(page); + _leave(" = A_T_P"); + return AOP_TRUNCATED_PAGE; } /* @@ -310,7 +313,6 @@ int afs_commit_write(struct file *file, struct page *page, spin_unlock(&vnode->writeback_lock); } - SetPageUptodate(page); set_page_dirty(page); if (PageDirty(page)) _debug("dirtied"); @@ -319,38 +321,35 @@ int afs_commit_write(struct file *file, struct page *page, } /* - * kill all the pages in the given range + * note the failure of a write, either due to an error or to a permission + * failure + * - all the pages in the affected range must have PG_writeback set + * - the caller must be responsible for the pages: no-one else should be trying + * to note rejection */ -static void afs_kill_pages(struct afs_vnode *vnode, bool error, - pgoff_t first, pgoff_t last) +static void afs_write_rejected(struct afs_writeback *wb, bool error, + pgoff_t first, pgoff_t last) { - struct pagevec pv; - unsigned count, loop; + struct afs_vnode *vnode = wb->vnode; + loff_t i_size; _enter("{%x:%u},%lx-%lx", vnode->fid.vid, vnode->fid.vnode, first, last); - pagevec_init(&pv, 0); - - do { - _debug("kill %lx-%lx", first, last); - - count = last - first + 1; - if (count > PAGEVEC_SIZE) - count = PAGEVEC_SIZE; - pv.nr = find_get_pages_contig(vnode->vfs_inode.i_mapping, - first, count, pv.pages); - ASSERTCMP(pv.nr, ==, count); - - for (loop = 0; loop < count; loop++) { - ClearPageUptodate(pv.pages[loop]); - if (error) - SetPageError(pv.pages[loop]); - end_page_writeback(pv.pages[loop]); - } + spin_lock(&vnode->writeback_lock); + wb->state = AFS_WBACK_REJECTED; + + /* wind back the file size if this write extended the file, and wasn't + * followed by a conflicting write */ + i_size = ((loff_t) wb->last) << PAGE_SHIFT; + i_size += wb->to_last; + if (i_size_read(&vnode->vfs_inode) == i_size) { + _debug("shorten"); + i_size_write(&vnode->vfs_inode, vnode->status.size); + } - __pagevec_release(&pv); - } while (first < last); + spin_unlock(&vnode->writeback_lock); + cancel_rejected_write(vnode->vfs_inode.i_mapping, first, last); _leave(""); } @@ -358,6 +357,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error, /* * synchronously write back the locked page and any subsequent non-locked dirty * pages also covered by the same writeback record + * - all pages written will be unlocked prior to returning */ static int afs_write_back_from_locked_page(struct afs_writeback *wb, struct page *primary_page) @@ -407,6 +407,7 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb, if (TestSetPageLocked(page)) break; if (!PageDirty(page) || + PageWriteback(page) || page_private(page) != (unsigned long) wb) { unlock_page(page); break; @@ -430,14 +431,22 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb, no_more: /* we now have a contiguous set of dirty pages, each with writeback set - * and the dirty mark cleared; the first page is locked and must remain - * so, all the rest are unlocked */ + * and the dirty mark cleared; all the pages barring the first are now + * unlocked */ first = primary_page->index; last = first + count - 1; + unlock_page(primary_page); offset = (first == wb->first) ? wb->offset_first : 0; to = (last == wb->last) ? wb->to_last : PAGE_SIZE; + if (wb->state == AFS_WBACK_REJECTED) { + cancel_rejected_write(wb->vnode->vfs_inode.i_mapping, + first, last); + ret = 0; + goto out; + } + _debug("write back %lx[%u..] to %lx[..%u]", first, offset, last, to); ret = afs_vnode_store_data(wb, first, last, offset, to); @@ -455,7 +464,7 @@ no_more: case -ENOENT: case -ENOMEDIUM: case -ENXIO: - afs_kill_pages(wb->vnode, true, first, last); + afs_write_rejected(wb, true, first, last); set_bit(AS_EIO, &wb->vnode->vfs_inode.i_mapping->flags); break; case -EACCES: @@ -464,7 +473,7 @@ no_more: case -EKEYEXPIRED: case -EKEYREJECTED: case -EKEYREVOKED: - afs_kill_pages(wb->vnode, false, first, last); + afs_write_rejected(wb, false, first, last); break; default: break; @@ -473,6 +482,7 @@ no_more: ret = count; } +out: _leave(" = %d", ret); return ret; } @@ -493,7 +503,6 @@ int afs_writepage(struct page *page, struct writeback_control *wbc) ASSERT(wb != NULL); ret = afs_write_back_from_locked_page(wb, page); - unlock_page(page); if (ret < 0) { _leave(" = %d", ret); return 0; @@ -565,7 +574,6 @@ int afs_writepages_region(struct address_space *mapping, spin_unlock(&wb->vnode->writeback_lock); ret = afs_write_back_from_locked_page(wb, page); - unlock_page(page); page_cache_release(page); if (ret < 0) { _leave(" = %d", ret); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/4] AFS: Implement shared-writable mmap 2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells 2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells 2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells @ 2007-05-23 19:15 ` David Howells 2 siblings, 0 replies; 17+ messages in thread From: David Howells @ 2007-05-23 19:15 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, dhowells Implement shared-writable mmap for AFS. The key with which to access the file is obtained from the VMA at the point where the PTE is made writable by the page_mkwrite() VMA op and cached in the affected page. If there's an outstanding write on the page made with a different key, then page_mkwrite() will flush it before attaching a record of the new key. Signed-off-by: David Howells <dhowells@redhat.com> --- fs/afs/file.c | 21 ++++++++++++++++++++- fs/afs/internal.h | 1 + fs/afs/write.c | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 56 insertions(+), 1 deletions(-) diff --git a/fs/afs/file.c b/fs/afs/file.c index 9c0e721..1547500 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page); static void afs_invalidatepage(struct page *page, unsigned long offset); static int afs_releasepage(struct page *page, gfp_t gfp_flags); static int afs_launder_page(struct page *page); +static int afs_mmap(struct file *file, struct vm_area_struct *vma); const struct file_operations afs_file_operations = { .open = afs_open, @@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = { .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = afs_file_write, - .mmap = generic_file_readonly_mmap, + .mmap = afs_mmap, .sendfile = generic_file_sendfile, .fsync = afs_fsync, }; @@ -54,6 +55,12 @@ const struct address_space_operations afs_fs_aops = { .writepages = afs_writepages, }; +static struct vm_operations_struct afs_file_vm_ops = { + .nopage = filemap_nopage, + .populate = filemap_populate, + .page_mkwrite = afs_page_mkwrite, +}; + /* * open an AFS file or directory and attach a key to it */ @@ -293,3 +300,15 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags) _leave(" = 0"); return 0; } + +/* + * memory map part of an AFS file + */ +static int afs_mmap(struct file *file, struct vm_area_struct *vma) +{ + _enter(""); + + file_accessed(file); + vma->vm_ops = &afs_file_vm_ops; + return 0; +} diff --git a/fs/afs/internal.h b/fs/afs/internal.h index ddb18d9..bc39eba 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -711,6 +711,7 @@ extern ssize_t afs_file_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern int afs_writeback_all(struct afs_vnode *); extern int afs_fsync(struct file *, struct dentry *, int); +extern int afs_page_mkwrite(struct vm_area_struct *, struct page *); /*****************************************************************************/ diff --git a/fs/afs/write.c b/fs/afs/write.c index ac621e8..dd471f0 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -155,6 +155,8 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, * prepare to perform part of a write to a page * - the caller holds the page locked, preventing it from being written out or * modified by anyone else + * - may be called from afs_page_mkwrite() to set up a page for modification + * through shared-writable mmap */ int afs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) @@ -833,3 +835,36 @@ int afs_fsync(struct file *file, struct dentry *dentry, int datasync) _leave(" = %d", ret); return ret; } + +/* + * notification that a previously read-only page is about to become writable + * - if it returns an error, the caller will deliver a bus error signal + * + * we use this to make a record of the key with which the writeback should be + * performed and to flush any outstanding writes made with a different key + * + * the key to be used is attached to the struct file pinned by the VMA + */ +int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host); + struct key *key = vma->vm_file->private_data; + int ret; + + _enter("{{%x:%u},%x},{%lx}", + vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index); + + do { + lock_page(page); + if (page->mapping == vma->vm_file->f_mapping) + ret = afs_prepare_write(vma->vm_file, page, 0, + PAGE_SIZE); + else + ret = 0; /* seems there was interference - let the + * caller deal with it */ + unlock_page(page); + } while (ret == AOP_TRUNCATED_PAGE); + + _leave(" = %d", ret); + return ret; +} ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-05-30 17:39 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-23 19:15 [PATCH 1/4] AFS: Add TestSetPageError() David Howells 2007-05-23 19:15 ` [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache David Howells 2007-05-24 20:38 ` Andrew Morton 2007-05-24 21:35 ` David Howells 2007-05-24 21:47 ` Andrew Morton 2007-05-24 22:34 ` David Howells 2007-05-24 22:46 ` Andrew Morton 2007-05-24 23:08 ` David Howells 2007-05-24 23:24 ` Andrew Morton 2007-05-24 23:37 ` David Howells 2007-05-24 22:35 ` Trond Myklebust 2007-05-24 23:18 ` David Howells 2007-05-24 23:54 ` Trond Myklebust 2007-05-30 10:35 ` David Howells 2007-05-30 17:39 ` Trond Myklebust 2007-05-23 19:15 ` [PATCH 3/4] AFS: Improve handling of a rejected writeback David Howells 2007-05-23 19:15 ` [PATCH 4/4] AFS: Implement shared-writable mmap David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox