* [patch 2/4] vfs: add set_page_dirty_notag
@ 2009-02-13 11:56 Edward Shishkin
2009-02-13 13:08 ` Peter Zijlstra
0 siblings, 1 reply; 24+ messages in thread
From: Edward Shishkin @ 2009-02-13 11:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel,
Edward Shishkin, ReiserFS Mailing List
Andrew, please, review.
I'll change things, if needed..
This is a fixup for the following "todo":
akpm wrote:
> reiser4_set_page_dirty_internal() pokes around in VFS internals.
> Use __set_page_dirty_no_buffers() or create a new library function
> in mm/page-writeback.c.
Problem:
In accordance with reiser4 transactional model every dirty page
should be "captured" by some atom. However, outside reiser4 context
dirty page can not be captured in some cases, as it is accompanied
with specific work (jnode creation, etc). Reiser4 recognizes such
"anonymous" pages (i.e. pages that were dirtied outside of reiser4)
by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context
are not tagged at all: we don't need this. Indeed, once page is
dirtied and captured, it is attached to a jnode (a special header
to keep a track of transactions).
reiser4_set_page_dirty_internal() was the internal reiser4 function
that set dirty bit without tagging the page. Having such internal
function led to real problems (incorrect task io accounting, etc.
because of not updating this internal "friend").
Solution:
The following patch adds a core library function that sets a dirty
bit without tagging the page. It should be modified simultaneously
with its "friends": __set_page_dirty_nobuffers, __set_page_dirty.
Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
include/linux/mm.h | 1 +
mm/page-writeback.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
struct page *page);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
+int set_page_dirty_notag(struct page *page);
int clear_page_dirty_for_io(struct page *page);
extern unsigned long move_page_tables(struct vm_area_struct *vma,
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
/*
+ * The same as __set_page_dirty_nobuffers, but this function
+ * 1) doesn't tag the page in its radix tree;
+ * 2) makes an assumption that there is no races with truncate.
+ * 3) is not for anonymous or swap pages.
+ */
+int set_page_dirty_notag(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+
+ if (!TestSetPageDirty(page)) {
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ if (mapping_cap_account_dirty(mapping)) {
+ preempt_disable();
+ __inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_bdi_stat(mapping->backing_dev_info,
+ BDI_RECLAIMABLE);
+ task_dirty_inc(current);
+ task_io_account_write(PAGE_CACHE_SIZE);
+ preempt_enable();
+ }
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ return 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(set_page_dirty_notag);
+
+/*
* When a writepage implementation decides that it doesn't want to write this
* page for some reason, it should redirty the locked page via
* redirty_page_for_writepage() and it should then unlock the page and return 0
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-13 11:56 [patch 2/4] vfs: add set_page_dirty_notag Edward Shishkin @ 2009-02-13 13:08 ` Peter Zijlstra 2009-02-13 13:57 ` Edward Shishkin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-13 13:08 UTC (permalink / raw) To: Edward Shishkin Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Fri, 2009-02-13 at 14:56 +0300, Edward Shishkin wrote: > --- > include/linux/mm.h | 1 + > mm/page-writeback.c | 28 ++++++++++++++++++++++++++++ > 2 files changed, 29 insertions(+) > > --- mmotm.orig/include/linux/mm.h > +++ mmotm/include/linux/mm.h > @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr > struct page *page); > int set_page_dirty(struct page *page); > int set_page_dirty_lock(struct page *page); > +int set_page_dirty_notag(struct page *page); > int clear_page_dirty_for_io(struct page *page); > > extern unsigned long move_page_tables(struct vm_area_struct *vma, > --- mmotm.orig/mm/page-writeback.c > +++ mmotm/mm/page-writeback.c > @@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa > EXPORT_SYMBOL(__set_page_dirty_nobuffers); > > /* > + * The same as __set_page_dirty_nobuffers, but this function > + * 1) doesn't tag the page in its radix tree; > + * 2) makes an assumption that there is no races with truncate. > + * 3) is not for anonymous or swap pages. > + */ > +int set_page_dirty_notag(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (!TestSetPageDirty(page)) { > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > + if (mapping_cap_account_dirty(mapping)) { > + preempt_disable(); > + __inc_zone_page_state(page, NR_FILE_DIRTY); > + __inc_bdi_stat(mapping->backing_dev_info, > + BDI_RECLAIMABLE); > + task_dirty_inc(current); > + task_io_account_write(PAGE_CACHE_SIZE); > + preempt_enable(); > + } > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > + return 1; > + } > + return 0; > +} > +EXPORT_SYMBOL(set_page_dirty_notag); > + Eew, so reiser4 will totally side-step the regular vm inode writeback paths -- or is this fixed by a more elaborate than usual a_ops->writepages() ? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-13 13:08 ` Peter Zijlstra @ 2009-02-13 13:57 ` Edward Shishkin 2009-02-13 14:09 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Edward Shishkin @ 2009-02-13 13:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List Peter Zijlstra wrote: > On Fri, 2009-02-13 at 14:56 +0300, Edward Shishkin wrote: > > >> --- >> include/linux/mm.h | 1 + >> mm/page-writeback.c | 28 ++++++++++++++++++++++++++++ >> 2 files changed, 29 insertions(+) >> >> --- mmotm.orig/include/linux/mm.h >> +++ mmotm/include/linux/mm.h >> @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr >> struct page *page); >> int set_page_dirty(struct page *page); >> int set_page_dirty_lock(struct page *page); >> +int set_page_dirty_notag(struct page *page); >> int clear_page_dirty_for_io(struct page *page); >> >> extern unsigned long move_page_tables(struct vm_area_struct *vma, >> --- mmotm.orig/mm/page-writeback.c >> +++ mmotm/mm/page-writeback.c >> @@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa >> EXPORT_SYMBOL(__set_page_dirty_nobuffers); >> >> /* >> + * The same as __set_page_dirty_nobuffers, but this function >> + * 1) doesn't tag the page in its radix tree; >> + * 2) makes an assumption that there is no races with truncate. >> + * 3) is not for anonymous or swap pages. >> + */ >> +int set_page_dirty_notag(struct page *page) >> +{ >> + struct address_space *mapping = page->mapping; >> + >> + if (!TestSetPageDirty(page)) { >> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); >> + if (mapping_cap_account_dirty(mapping)) { >> + preempt_disable(); >> + __inc_zone_page_state(page, NR_FILE_DIRTY); >> + __inc_bdi_stat(mapping->backing_dev_info, >> + BDI_RECLAIMABLE); >> + task_dirty_inc(current); >> + task_io_account_write(PAGE_CACHE_SIZE); >> + preempt_enable(); >> + } >> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); >> + return 1; >> + } >> + return 0; >> +} >> +EXPORT_SYMBOL(set_page_dirty_notag); >> + >> > > Eew, so reiser4 will totally side-step the regular vm inode writeback > paths -- or is this fixed by a more elaborate than usual > a_ops->writepages() ? > The second. reiser4_writepages() catches the anonymous (tagged) pages, captures them mandatory, then commits all atoms of the file. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-13 13:57 ` Edward Shishkin @ 2009-02-13 14:09 ` Peter Zijlstra 2009-02-14 13:11 ` Edward Shishkin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-13 14:09 UTC (permalink / raw) To: Edward Shishkin Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote: > > > > Eew, so reiser4 will totally side-step the regular vm inode > writeback > > paths -- or is this fixed by a more elaborate than usual > > a_ops->writepages() ? > > > The second. > > reiser4_writepages() catches the anonymous (tagged) > pages, captures them mandatory, then commits all atoms > of the file. OK, can you then make it painfully clear in the function comment, esp. since you export this function? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-13 14:09 ` Peter Zijlstra @ 2009-02-14 13:11 ` Edward Shishkin 2009-02-14 21:11 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Edward Shishkin @ 2009-02-14 13:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List Peter Zijlstra writes: > On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote: > > > > > > Eew, so reiser4 will totally side-step the regular vm inode > > writeback > > > paths -- or is this fixed by a more elaborate than usual > > > a_ops->writepages() ? > > > > > The second. > > > > reiser4_writepages() catches the anonymous (tagged) > > pages, captures them mandatory, then commits all atoms > > of the file. > > OK, can you then make it painfully clear in the function comment, esp. > since you export this function? Hello. Does it look better? Thanks, Edward. This is a fixup for the following "todo": akpm wrote: > reiser4_set_page_dirty_internal() pokes around in VFS internals. > Use __set_page_dirty_no_buffers() or create a new library function > in mm/page-writeback.c. Problem: In accordance with reiser4 transactional model every dirty page should be "captured" by some atom. However, outside reiser4 context dirty page can not be captured in some cases, as it is accompanied with specific work (jnode creation, etc). Reiser4 recognizes such "anonymous" pages (i.e. pages that were dirtied outside of reiser4) by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context are not tagged at all: we don't need this. Indeed, once page is dirtied and captured, it is attached to a jnode (a special header to keep a track of transactions). reiser4_set_page_dirty_internal() was the internal reiser4 function that set dirty bit without tagging the page. Having such internal function led to real problems (incorrect task io accounting, etc. because of not updating this internal "friend"). Solution: The following patch adds a core library function that sets a dirty bit without tagging the page. It should be modified simultaneously with its "friends": __set_page_dirty_nobuffers, __set_page_dirty. Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> --- include/linux/mm.h | 1 + mm/page-writeback.c | 40 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+) --- mmotm.orig/include/linux/mm.h +++ mmotm/include/linux/mm.h @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr struct page *page); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); +int set_page_dirty_notag(struct page *page); int clear_page_dirty_for_io(struct page *page); extern unsigned long move_page_tables(struct vm_area_struct *vma, --- mmotm.orig/mm/page-writeback.c +++ mmotm/mm/page-writeback.c @@ -1248,6 +1248,46 @@ int __set_page_dirty_nobuffers(struct pa EXPORT_SYMBOL(__set_page_dirty_nobuffers); /* + * Some filesystems, which don't use buffers, provide their own + * writeback means. And it can happen, that even dirty tag, which + * is used by generic methods is not needed. In this case it would + * be reasonably to use the following lightweight version of + * __set_page_dirty_nobuffers: + * + * Don't tag page as dirty in its radix tree, just set dirty bit + * and update the accounting. + * NOTE: This function also doesn't take care of races, i.e. the + * caller should guarantee that page can not be truncated. + */ +int set_page_dirty_notag(struct page *page) +{ + struct address_space *mapping = page->mapping; + + if (!TestSetPageDirty(page)) { + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); + if (mapping_cap_account_dirty(mapping)) { + /* + * Since we don't tag the page as dirty, + * acquiring the tree_lock is replaced + * with disabling preemption to protect + * per-cpu data used for accounting. + */ + preempt_disable(); + __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping->backing_dev_info, + BDI_RECLAIMABLE); + task_dirty_inc(current); + task_io_account_write(PAGE_CACHE_SIZE); + preempt_enable(); + } + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + return 1; + } + return 0; +} +EXPORT_SYMBOL(set_page_dirty_notag); + +/* * When a writepage implementation decides that it doesn't want to write this * page for some reason, it should redirty the locked page via * redirty_page_for_writepage() and it should then unlock the page and return 0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-14 13:11 ` Edward Shishkin @ 2009-02-14 21:11 ` Peter Zijlstra 2009-02-16 22:43 ` Edward Shishkin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-14 21:11 UTC (permalink / raw) To: Edward Shishkin Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Sat, 2009-02-14 at 16:11 +0300, Edward Shishkin wrote: > Peter Zijlstra writes: > > On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote: > > > > > > > > Eew, so reiser4 will totally side-step the regular vm inode > > > writeback > > > > paths -- or is this fixed by a more elaborate than usual > > > > a_ops->writepages() ? > > > > > > > The second. > > > > > > reiser4_writepages() catches the anonymous (tagged) > > > pages, captures them mandatory, then commits all atoms > > > of the file. > > > > OK, can you then make it painfully clear in the function comment, esp. > > since you export this function? > > Hello. > Does it look better? > > Thanks, > Edward. > > This is a fixup for the following "todo": > akpm wrote: > > reiser4_set_page_dirty_internal() pokes around in VFS internals. > > Use __set_page_dirty_no_buffers() or create a new library function > > in mm/page-writeback.c. > > Problem: > > In accordance with reiser4 transactional model every dirty page > should be "captured" by some atom. However, outside reiser4 context > dirty page can not be captured in some cases, as it is accompanied > with specific work (jnode creation, etc). Reiser4 recognizes such > "anonymous" pages (i.e. pages that were dirtied outside of reiser4) > by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context > are not tagged at all: we don't need this. Indeed, once page is > dirtied and captured, it is attached to a jnode (a special header > to keep a track of transactions). > > reiser4_set_page_dirty_internal() was the internal reiser4 function > that set dirty bit without tagging the page. Having such internal > function led to real problems (incorrect task io accounting, etc. > because of not updating this internal "friend"). > > Solution: > > The following patch adds a core library function that sets a dirty > bit without tagging the page. It should be modified simultaneously > with its "friends": __set_page_dirty_nobuffers, __set_page_dirty. > > Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> > --- > include/linux/mm.h | 1 + > mm/page-writeback.c | 40 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 41 insertions(+) > > --- mmotm.orig/include/linux/mm.h > +++ mmotm/include/linux/mm.h > @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr > struct page *page); > int set_page_dirty(struct page *page); > int set_page_dirty_lock(struct page *page); > +int set_page_dirty_notag(struct page *page); > int clear_page_dirty_for_io(struct page *page); > > extern unsigned long move_page_tables(struct vm_area_struct *vma, > --- mmotm.orig/mm/page-writeback.c > +++ mmotm/mm/page-writeback.c > @@ -1248,6 +1248,46 @@ int __set_page_dirty_nobuffers(struct pa > EXPORT_SYMBOL(__set_page_dirty_nobuffers); > > /* > + * Some filesystems, which don't use buffers, provide their own > + * writeback means. And it can happen, that even dirty tag, which > + * is used by generic methods is not needed. In this case it would > + * be reasonably to use the following lightweight version of > + * __set_page_dirty_nobuffers: > + * > + * Don't tag page as dirty in its radix tree, just set dirty bit > + * and update the accounting. > + * NOTE: This function also doesn't take care of races, i.e. the > + * caller should guarantee that page can not be truncated. > + */ Maybe something like /* * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers() * except it doesn't tag the page dirty in the page-cache radix tree. * This means that the address space using this cannot use the regular * filemap ->writepages() helpers and must provide its own means of * tracking and finding dirty pages. * * NOTE: furthermore, this version also doesn't handle truncate races. */ > +int set_page_dirty_notag(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (!TestSetPageDirty(page)) { > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > + if (mapping_cap_account_dirty(mapping)) { > + /* > + * Since we don't tag the page as dirty, > + * acquiring the tree_lock is replaced > + * with disabling preemption to protect > + * per-cpu data used for accounting. > + */ This should be local_irq_save(flags) > + preempt_disable(); > + __inc_zone_page_state(page, NR_FILE_DIRTY); > + __inc_bdi_stat(mapping->backing_dev_info, > + BDI_RECLAIMABLE); > + task_dirty_inc(current); > + task_io_account_write(PAGE_CACHE_SIZE); > + preempt_enable(); local_irq_restore() These accounting functions rely on being atomic wrt interrupts. > + } > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > + return 1; > + } > + return 0; > +} > +EXPORT_SYMBOL(set_page_dirty_notag); How much performance gain do you see by avoiding that radix tree op? I take it the only reason you don't use the regular __set_page_dirty_nobuffers() and just clear the tag when you do the write-out by whatever alternative means you have to find the page, is that it gains you some performance. It would be good to have some numbers to judge this on. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-14 21:11 ` Peter Zijlstra @ 2009-02-16 22:43 ` Edward Shishkin 2009-02-17 9:09 ` Peter Zijlstra 2009-02-17 22:35 ` [patch 2/4] vfs: add set_page_dirty_notag Andrew Morton 0 siblings, 2 replies; 24+ messages in thread From: Edward Shishkin @ 2009-02-16 22:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List Peter Zijlstra writes: [...] > Maybe something like > > /* > * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers() > * except it doesn't tag the page dirty in the page-cache radix tree. > * This means that the address space using this cannot use the regular > * filemap ->writepages() helpers and must provide its own means of > * tracking and finding dirty pages. > * > * NOTE: furthermore, this version also doesn't handle truncate races. > */ > Yup, your version is better. I just have replaced ^finding dirty pages^finding non-tagged dirty pages, see below.. > > +int set_page_dirty_notag(struct page *page) > > +{ > > + struct address_space *mapping = page->mapping; > > + > > + if (!TestSetPageDirty(page)) { > > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > > + if (mapping_cap_account_dirty(mapping)) { > > + /* > > + * Since we don't tag the page as dirty, > > + * acquiring the tree_lock is replaced > > + * with disabling preemption to protect > > + * per-cpu data used for accounting. > > + */ > > This should be local_irq_save(flags) > > > + preempt_disable(); > > + __inc_zone_page_state(page, NR_FILE_DIRTY); > > + __inc_bdi_stat(mapping->backing_dev_info, > > + BDI_RECLAIMABLE); > > + task_dirty_inc(current); > > + task_io_account_write(PAGE_CACHE_SIZE); > > + preempt_enable(); > > local_irq_restore() > > These accounting functions rely on being atomic wrt interrupts. > Thanks! > > + } > > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > + return 1; > > + } > > + return 0; > > +} > > +EXPORT_SYMBOL(set_page_dirty_notag); > > How much performance gain do you see by avoiding that radix tree op? > Nop. We want to use it with extended semantics. All dirty pages are divided into 2 categories: A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY). B) captured by atoms (usual linked lists). reiser4_writepages() looks for pages of "A" in the radix tree and moves them to "B". set_page_dirty_notag(), introduced by my patch, is needed for pages of "B". If "B" is empty, then we get the traditional semantics with regular ->writepages(). That's all! > I take it the only reason you don't use the regular > __set_page_dirty_nobuffers() and just clear the tag when you do the > write-out by whatever alternative means you have to find the page, is > that it gains you some performance. > > It would be good to have some numbers to judge this on. > So, performance is not a concern. We want to extend functionality, which is currently restricted by the unnecessary(*) property "dirty bit is set <=> dirty tag is installed". (*) Reiser4 successfully uses such design since 2002. I don't remember any BUG_ONs, and related problems. Please, consider the appended patch. Thanks, Edward. Add set_page_dirty_notag() to the core library to enable extended functionality of radix tree attached to inode->i_mapping. Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> --- include/linux/mm.h | 1 + mm/page-writeback.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) --- mmotm.orig/include/linux/mm.h +++ mmotm/include/linux/mm.h @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr struct page *page); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); +int set_page_dirty_notag(struct page *page); int clear_page_dirty_for_io(struct page *page); extern unsigned long move_page_tables(struct vm_area_struct *vma, --- mmotm.orig/mm/page-writeback.c +++ mmotm/mm/page-writeback.c @@ -1248,6 +1248,42 @@ int __set_page_dirty_nobuffers(struct pa EXPORT_SYMBOL(__set_page_dirty_nobuffers); /* + * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers() + * except it doesn't tag the page dirty in the page-cache radix tree. + * This means that the address space using this cannot use the regular + * filemap ->writepages() helpers and must provide its own means of + * tracking and finding non-tagged dirty pages. + * + * NOTE: furthermore, this version also doesn't handle truncate races. + */ +int set_page_dirty_notag(struct page *page) +{ + struct address_space *mapping = page->mapping; + + if (!TestSetPageDirty(page)) { + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); + if (mapping_cap_account_dirty(mapping)) { + /* + * The accounting functions rely on + * being atomic wrt interrupts. + */ + unsigned long flags; + local_irq_save(flags); + __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping->backing_dev_info, + BDI_RECLAIMABLE); + task_dirty_inc(current); + task_io_account_write(PAGE_CACHE_SIZE); + local_irq_restore(flags); + } + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + return 1; + } + return 0; +} +EXPORT_SYMBOL(set_page_dirty_notag); + +/* * When a writepage implementation decides that it doesn't want to write this * page for some reason, it should redirty the locked page via * redirty_page_for_writepage() and it should then unlock the page and return 0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-16 22:43 ` Edward Shishkin @ 2009-02-17 9:09 ` Peter Zijlstra 2009-02-17 9:38 ` Nick Piggin 2009-02-17 22:35 ` [patch 2/4] vfs: add set_page_dirty_notag Andrew Morton 1 sibling, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-17 9:09 UTC (permalink / raw) To: Edward Shishkin Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, 2009-02-17 at 01:43 +0300, Edward Shishkin wrote: > > How much performance gain do you see by avoiding that radix tree op? > > > > Nop. We want to use it with extended semantics. > All dirty pages are divided into 2 categories: > > A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY). > B) captured by atoms (usual linked lists). > > reiser4_writepages() looks for pages of "A" in the radix tree > and moves them to "B". set_page_dirty_notag(), introduced by > my patch, is needed for pages of "B". > > If "B" is empty, then we get the traditional semantics with > regular ->writepages(). > > That's all! Ah, indeed. I had not considered such a scheme. > Add set_page_dirty_notag() to the core library to enable > extended functionality of radix tree attached to inode->i_mapping. > > Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> Looks good to me Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > include/linux/mm.h | 1 + > mm/page-writeback.c | 36 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 37 insertions(+) > > --- mmotm.orig/include/linux/mm.h > +++ mmotm/include/linux/mm.h > @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr > struct page *page); > int set_page_dirty(struct page *page); > int set_page_dirty_lock(struct page *page); > +int set_page_dirty_notag(struct page *page); > int clear_page_dirty_for_io(struct page *page); > > extern unsigned long move_page_tables(struct vm_area_struct *vma, > --- mmotm.orig/mm/page-writeback.c > +++ mmotm/mm/page-writeback.c > @@ -1248,6 +1248,42 @@ int __set_page_dirty_nobuffers(struct pa > EXPORT_SYMBOL(__set_page_dirty_nobuffers); > > /* > + * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers() > + * except it doesn't tag the page dirty in the page-cache radix tree. > + * This means that the address space using this cannot use the regular > + * filemap ->writepages() helpers and must provide its own means of > + * tracking and finding non-tagged dirty pages. > + * > + * NOTE: furthermore, this version also doesn't handle truncate races. > + */ > +int set_page_dirty_notag(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (!TestSetPageDirty(page)) { > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > + if (mapping_cap_account_dirty(mapping)) { > + /* > + * The accounting functions rely on > + * being atomic wrt interrupts. > + */ > + unsigned long flags; > + local_irq_save(flags); > + __inc_zone_page_state(page, NR_FILE_DIRTY); > + __inc_bdi_stat(mapping->backing_dev_info, > + BDI_RECLAIMABLE); > + task_dirty_inc(current); > + task_io_account_write(PAGE_CACHE_SIZE); > + local_irq_restore(flags); > + } > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > + return 1; > + } > + return 0; > +} > +EXPORT_SYMBOL(set_page_dirty_notag); > + > +/* > * When a writepage implementation decides that it doesn't want to write this > * page for some reason, it should redirty the locked page via > * redirty_page_for_writepage() and it should then unlock the page and return 0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-17 9:09 ` Peter Zijlstra @ 2009-02-17 9:38 ` Nick Piggin 2009-02-17 10:05 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2009-02-17 9:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, Feb 17, 2009 at 10:09:41AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 01:43 +0300, Edward Shishkin wrote: > > > > How much performance gain do you see by avoiding that radix tree op? > > > > > > > Nop. We want to use it with extended semantics. > > All dirty pages are divided into 2 categories: > > > > A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY). > > B) captured by atoms (usual linked lists). > > > > reiser4_writepages() looks for pages of "A" in the radix tree > > and moves them to "B". set_page_dirty_notag(), introduced by > > my patch, is needed for pages of "B". > > > > If "B" is empty, then we get the traditional semantics with > > regular ->writepages(). > > > > That's all! > > Ah, indeed. I had not considered such a scheme. It is a great shame that filesystems are not properly notified that a page may become dirty before the actual set_page_dirty event (which is not allowed to fail and is called after the page is already dirty). This is a big problem I have with fsblock simply in trying to make the memory allocation robust. page_mkwrite unfortunately is racy and I've fixed problems there... the big problem though is get_user_pages. Fixing that properly seems to require fixing callers so it is not really realistic in the short term. As such... > > Add set_page_dirty_notag() to the core library to enable > > extended functionality of radix tree attached to inode->i_mapping. > > > > Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> > > Looks good to me > > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> I'm fine with it too. Acked-by: Nick Piggin <npiggin@suse.de> You know... it wouldn't be terribly painful to introduce a new pagecache radix tree tag for filesystem private use (doesn't bloat the radix tree node size) if there is a strong need for it. But if you have this workaround then I think it is also reasonable. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-17 9:38 ` Nick Piggin @ 2009-02-17 10:05 ` Peter Zijlstra 2009-02-17 10:24 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-17 10:05 UTC (permalink / raw) To: Nick Piggin Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > It is a great shame that filesystems are not properly notified > that a page may become dirty before the actual set_page_dirty > event (which is not allowed to fail and is called after the > page is already dirty). Not quite true, for example the set_page_dirty() done by the write fault code is done _before_ the page becomes dirty. This before/after thing was the reason for that horrid file corruption bug that dragged on for a few weeks back in .19 (IIRC). > This is a big problem I have with fsblock simply in trying to > make the memory allocation robust. page_mkwrite unfortunately > is racy and I've fixed problems there... the big problem though > is get_user_pages. Fixing that properly seems to require fixing > callers so it is not really realistic in the short term. Right, I'm just not sure what we can do, even with a prepage_page_dirty() function, what are you going to do, fail the fault? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-17 10:05 ` Peter Zijlstra @ 2009-02-17 10:24 ` Nick Piggin 2009-02-17 10:40 ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2009-02-17 10:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > It is a great shame that filesystems are not properly notified > > that a page may become dirty before the actual set_page_dirty > > event (which is not allowed to fail and is called after the > > page is already dirty). > > Not quite true, for example the set_page_dirty() done by the write fault > code is done _before_ the page becomes dirty. > > This before/after thing was the reason for that horrid file corruption > bug that dragged on for a few weeks back in .19 (IIRC). Yeah, there are actually races though. The page can become cleaned before set_page_dirty is reached, and there are also nasty races with truncate. > > This is a big problem I have with fsblock simply in trying to > > make the memory allocation robust. page_mkwrite unfortunately > > is racy and I've fixed problems there... the big problem though > > is get_user_pages. Fixing that properly seems to require fixing > > callers so it is not really realistic in the short term. > > Right, I'm just not sure what we can do, even with a > prepage_page_dirty() function, what are you going to do, fail the fault? Oh, for regular page fault functions using page_mkwrite, they definitely want to fail the fault with a SIGBUS, and actually XFS already does that (for fsblock robust memory allocations you would also want to fail OOM on metadata allocation failure). What is the other option? Silently fail the write? For XFS purpose (ie. -ENOSPC handling), the current code is reasonable although there could be some truncate races with block allocation. But mostly probably works. For something like fsblock it can be much more common to have the metadata refcount reach 0 and freed before spd is called. In that case the code actually goes into a bug situation so it is a bit more critical. But no that's the "easy" part. The hard part is get_user_pages because the caller can hold onto the page indefinitely simply with a refcount, and go along happily dirtying it at any stage (actually writing to the page memory) before actually calling set_page_dirty. The "cleanest" way to fix this from VM point of view is probably to force gup callers to hold the page locked for the duration to prevent truncation or writeout after the filesystem notification. Don't know if that would be very popular, however. ^ permalink raw reply [flat|nested] 24+ messages in thread
* set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 10:24 ` Nick Piggin @ 2009-02-17 10:40 ` Peter Zijlstra 2009-02-17 11:25 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-17 10:40 UTC (permalink / raw) To: Nick Piggin Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote: > On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > > > It is a great shame that filesystems are not properly notified > > > that a page may become dirty before the actual set_page_dirty > > > event (which is not allowed to fail and is called after the > > > page is already dirty). > > > > Not quite true, for example the set_page_dirty() done by the write fault > > code is done _before_ the page becomes dirty. > > > > This before/after thing was the reason for that horrid file corruption > > bug that dragged on for a few weeks back in .19 (IIRC). > > Yeah, there are actually races though. The page can become cleaned > before set_page_dirty is reached, and there are also nasty races with > truncate. Hmm, so you're saying that never got properly fixed? > > > This is a big problem I have with fsblock simply in trying to > > > make the memory allocation robust. page_mkwrite unfortunately > > > is racy and I've fixed problems there... the big problem though > > > is get_user_pages. Fixing that properly seems to require fixing > > > callers so it is not really realistic in the short term. > > > > Right, I'm just not sure what we can do, even with a > > prepage_page_dirty() function, what are you going to do, fail the fault? > > Oh, for regular page fault functions using page_mkwrite, they > definitely want to fail the fault with a SIGBUS, and actually XFS > already does that (for fsblock robust memory allocations you > would also want to fail OOM on metadata allocation failure). What > is the other option? Silently fail the write? OK, agreed. > For XFS purpose (ie. -ENOSPC handling), the current code is reasonable > although there could be some truncate races with block allocation. But > mostly probably works. For something like fsblock it can be much more > common to have the metadata refcount reach 0 and freed before spd is > called. In that case the code actually goes into a bug situation so it > is a bit more critical. > > But no that's the "easy" part. The hard part is get_user_pages > because the caller can hold onto the page indefinitely simply with a > refcount, and go along happily dirtying it at any stage (actually > writing to the page memory) before actually calling set_page_dirty. Should a gup user not specify .write=1 if it wants to dirty the page, at which point the follow_page() will do the dirty-fault thingy. Ah, but then we can clean it because we're not holding the page-lock. I see. > The "cleanest" way to fix this from VM point of view is probably to > force gup callers to hold the page locked for the duration to > prevent truncation or writeout after the filesystem notification. > Don't know if that would be very popular, however. Right, so you'd want to keep the page locked over gup(.write=1) sections. So should we extend the gup() with put_user_page()? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 10:40 ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra @ 2009-02-17 11:25 ` Nick Piggin 2009-02-17 11:39 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2009-02-17 11:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, Feb 17, 2009 at 11:40:00AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote: > > On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > > > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > > > > > It is a great shame that filesystems are not properly notified > > > > that a page may become dirty before the actual set_page_dirty > > > > event (which is not allowed to fail and is called after the > > > > page is already dirty). > > > > > > Not quite true, for example the set_page_dirty() done by the write fault > > > code is done _before_ the page becomes dirty. > > > > > > This before/after thing was the reason for that horrid file corruption > > > bug that dragged on for a few weeks back in .19 (IIRC). > > > > Yeah, there are actually races though. The page can become cleaned > > before set_page_dirty is reached, and there are also nasty races with > > truncate. > > Hmm, so you're saying that never got properly fixed? For the purposes of dirty accounting and syncing they are OK I think. The problem is with page_mkwrite which is just another thing in the equation. Unfortunately it was introduced without a very clear statement of what kind of locking and concurrency it was expected, and a vauge promise that the filesystem would be able to handle synchronisation with the pagecache and virtual memory (yeah right). I should hopefully get around to splitting up fsblock and submitting some of the generic code changes (there aren't too many but mainly giving filesystems better and and less racy dirty page handling ability). > > For XFS purpose (ie. -ENOSPC handling), the current code is reasonable > > although there could be some truncate races with block allocation. But > > mostly probably works. For something like fsblock it can be much more > > common to have the metadata refcount reach 0 and freed before spd is > > called. In that case the code actually goes into a bug situation so it > > is a bit more critical. > > > > But no that's the "easy" part. The hard part is get_user_pages > > because the caller can hold onto the page indefinitely simply with a > > refcount, and go along happily dirtying it at any stage (actually > > writing to the page memory) before actually calling set_page_dirty. > > Should a gup user not specify .write=1 if it wants to dirty the page, at > which point the follow_page() will do the dirty-fault thingy. > > Ah, but then we can clean it because we're not holding the page-lock. I > see. Yep. > > The "cleanest" way to fix this from VM point of view is probably to > > force gup callers to hold the page locked for the duration to > > prevent truncation or writeout after the filesystem notification. > > Don't know if that would be very popular, however. > > Right, so you'd want to keep the page locked over gup(.write=1) > sections. Something like that. lock_page might cause complaints ;) But that appears to be the easiest and cleanest way, so I don't want to hurt my brain thinking of something better yet! > So should we extend the gup() with put_user_page()? put_user_pages simply defined to do the put_page thing for now would yes allow callers to slowly migrate over and perhaps one day allow a solution. At least it would make things more flexible in case any other issues arise in future. I had a patch... can't find it now. I don't think I'd done many driver conversions. Here is another one to kick it off. Thoughts? -- Introduce put_user_pages function. In order to have more flexibility to deal with issues surrounding get_user_pages difficulties[*], introduce put_user_pages function intended to release pages acquired by get_user_pages. For now, just do the regular put_page thing. If all callers are converted, it could be used to help with such races. In the meantime, it will actually serve as a small extra piece of documentation for the code. [*] eg. get_user_pages caller can bypass page_mkwrite calls into the filesystem to notify of page dirty activity if the page gets cleaned before the caller calls its final set_page_dirty). --- include/linux/mm.h | 1 + mm/memory.c | 13 ++++++++++++- 2 files changed, 13 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -826,6 +826,7 @@ extern int access_process_vm(struct task int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); +void put_user_pages(struct page **pages, int nr); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t start, len, flags, pages, vmas); } - EXPORT_SYMBOL(get_user_pages); +/* + * put_user_pages should be used to release pages acquired with get_user_pages. + */ +void put_user_pages(struct page **pages, int nr) +{ + int i; + + for (i = 0; i < nr; i++) + put_page(pages[i]); +} +EXPORT_SYMBOL(put_user_pages); + pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl) { ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 11:25 ` Nick Piggin @ 2009-02-17 11:39 ` Peter Zijlstra 2009-02-17 11:55 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-17 11:39 UTC (permalink / raw) To: Nick Piggin Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote: > Introduce put_user_pages function. > > In order to have more flexibility to deal with issues surrounding > get_user_pages difficulties[*], introduce put_user_pages function > intended to release pages acquired by get_user_pages. For now, just > do the regular put_page thing. If all callers are converted, it could > be used to help with such races. In the meantime, it will actually > serve as a small extra piece of documentation for the code. > > [*] eg. get_user_pages caller can bypass page_mkwrite calls into the > filesystem to notify of page dirty activity if the page gets cleaned > before the caller calls its final set_page_dirty). Hmm, if we want to distinguish between .write=1 and .write=0, we would have to pass .write to pup too, right? > --- > include/linux/mm.h | 1 + > mm/memory.c | 13 ++++++++++++- > 2 files changed, 13 insertions(+), 1 deletion(-) > > Index: linux-2.6/include/linux/mm.h > =================================================================== > --- linux-2.6.orig/include/linux/mm.h > +++ linux-2.6/include/linux/mm.h > @@ -826,6 +826,7 @@ extern int access_process_vm(struct task > int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, > int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); > > +void put_user_pages(struct page **pages, int nr); > extern int try_to_release_page(struct page * page, gfp_t gfp_mask); > extern void do_invalidatepage(struct page *page, unsigned long offset); > > Index: linux-2.6/mm/memory.c > =================================================================== > --- linux-2.6.orig/mm/memory.c > +++ linux-2.6/mm/memory.c > @@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t > start, len, flags, > pages, vmas); > } > - > EXPORT_SYMBOL(get_user_pages); > > +/* > + * put_user_pages should be used to release pages acquired with get_user_pages. > + */ > +void put_user_pages(struct page **pages, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) > + put_page(pages[i]); > +} > +EXPORT_SYMBOL(put_user_pages); > + > pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, > spinlock_t **ptl) > { > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 11:39 ` Peter Zijlstra @ 2009-02-17 11:55 ` Nick Piggin 2009-02-17 12:05 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2009-02-17 11:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, Feb 17, 2009 at 12:39:32PM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote: > > > Introduce put_user_pages function. > > > > In order to have more flexibility to deal with issues surrounding > > get_user_pages difficulties[*], introduce put_user_pages function > > intended to release pages acquired by get_user_pages. For now, just > > do the regular put_page thing. If all callers are converted, it could > > be used to help with such races. In the meantime, it will actually > > serve as a small extra piece of documentation for the code. > > > > [*] eg. get_user_pages caller can bypass page_mkwrite calls into the > > filesystem to notify of page dirty activity if the page gets cleaned > > before the caller calls its final set_page_dirty). > > Hmm, if we want to distinguish between .write=1 and .write=0, we would > have to pass .write to pup too, right? Doh, yeah. I hand edited the patch to put that parameter in, but quilt refresh must have outsmarted me! If nobody thinks it is insane, I'll resend to Andrew in a new thread. > > --- > > include/linux/mm.h | 1 + > > mm/memory.c | 13 ++++++++++++- > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > Index: linux-2.6/include/linux/mm.h > > =================================================================== > > --- linux-2.6.orig/include/linux/mm.h > > +++ linux-2.6/include/linux/mm.h > > @@ -826,6 +826,7 @@ extern int access_process_vm(struct task > > int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, > > int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); > > > > +void put_user_pages(struct page **pages, int nr); > > extern int try_to_release_page(struct page * page, gfp_t gfp_mask); > > extern void do_invalidatepage(struct page *page, unsigned long offset); > > > > Index: linux-2.6/mm/memory.c > > =================================================================== > > --- linux-2.6.orig/mm/memory.c > > +++ linux-2.6/mm/memory.c > > @@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t > > start, len, flags, > > pages, vmas); > > } > > - > > EXPORT_SYMBOL(get_user_pages); > > > > +/* > > + * put_user_pages should be used to release pages acquired with get_user_pages. > > + */ > > +void put_user_pages(struct page **pages, int nr) > > +{ > > + int i; > > + > > + for (i = 0; i < nr; i++) > > + put_page(pages[i]); > > +} > > +EXPORT_SYMBOL(put_user_pages); > > + > > pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, > > spinlock_t **ptl) > > { > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 11:55 ` Nick Piggin @ 2009-02-17 12:05 ` Peter Zijlstra 2009-02-17 12:30 ` Nick Piggin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2009-02-17 12:05 UTC (permalink / raw) To: Nick Piggin Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, 2009-02-17 at 12:55 +0100, Nick Piggin wrote: > On Tue, Feb 17, 2009 at 12:39:32PM +0100, Peter Zijlstra wrote: > > On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote: > > > > > Introduce put_user_pages function. > > > > > > In order to have more flexibility to deal with issues surrounding > > > get_user_pages difficulties[*], introduce put_user_pages function > > > intended to release pages acquired by get_user_pages. For now, just > > > do the regular put_page thing. If all callers are converted, it could > > > be used to help with such races. In the meantime, it will actually > > > serve as a small extra piece of documentation for the code. > > > > > > [*] eg. get_user_pages caller can bypass page_mkwrite calls into the > > > filesystem to notify of page dirty activity if the page gets cleaned > > > before the caller calls its final set_page_dirty). > > > > Hmm, if we want to distinguish between .write=1 and .write=0, we would > > have to pass .write to pup too, right? > > Doh, yeah. I hand edited the patch to put that parameter in, but quilt > refresh must have outsmarted me! > > If nobody thinks it is insane, I'll resend to Andrew in a new thread. Right, gup_fast() seems to also respect .write properly, so it would also be used to balance that. I guess gup_fast() would need to use trylock_page(), and fall back to the slow path when we start taking PG_locked on .write. I suppose we should start converting a few gup users over to pup before handing the thing to Andrew, to have at least a few examples in-kernel. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) 2009-02-17 12:05 ` Peter Zijlstra @ 2009-02-17 12:30 ` Nick Piggin 0 siblings, 0 replies; 24+ messages in thread From: Nick Piggin @ 2009-02-17 12:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap, linux-kernel, ReiserFS Mailing List On Tue, Feb 17, 2009 at 01:05:34PM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 12:55 +0100, Nick Piggin wrote: > > If nobody thinks it is insane, I'll resend to Andrew in a new thread. > > Right, gup_fast() seems to also respect .write properly, Phew! :) > so it would > also be used to balance that. > > I guess gup_fast() would need to use trylock_page(), and fall back to > the slow path when we start taking PG_locked on .write. Yeah, you're right there. It might also be possible to have a flag somewhere to avoid the lock if the underlying filesystem doesn't have a page_mkwrite or doesn't account dirty... which could avoid the overhead for the common case of anonymous or tmpfs memory. For gup_fast that pretty much implies an extra page flag I think. But let's not get too worried with details... I don't think put_user_pages hurts, even if it only remains as put_page loop. Just to help reader through the page refcounting. > I suppose we should start converting a few gup users over to pup before > handing the thing to Andrew, to have at least a few examples in-kernel. Could do. There are quite a few easy ones. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-16 22:43 ` Edward Shishkin 2009-02-17 9:09 ` Peter Zijlstra @ 2009-02-17 22:35 ` Andrew Morton 2009-02-18 0:26 ` Edward Shishkin 1 sibling, 1 reply; 24+ messages in thread From: Andrew Morton @ 2009-02-17 22:35 UTC (permalink / raw) To: Edward Shishkin Cc: peterz, edward.shishkin, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel On Tue, 17 Feb 2009 01:43:28 +0300 Edward Shishkin <edward.shishkin@gmail.com> wrote: > + */ > +int set_page_dirty_notag(struct page *page) > +{ > + struct address_space *mapping = page->mapping; > + > + if (!TestSetPageDirty(page)) { > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > + if (mapping_cap_account_dirty(mapping)) { > + /* > + * The accounting functions rely on > + * being atomic wrt interrupts. > + */ > + unsigned long flags; > + local_irq_save(flags); > + __inc_zone_page_state(page, NR_FILE_DIRTY); > + __inc_bdi_stat(mapping->backing_dev_info, > + BDI_RECLAIMABLE); > + task_dirty_inc(current); > + task_io_account_write(PAGE_CACHE_SIZE); > + local_irq_restore(flags); > + } > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > + return 1; > + } > + return 0; > +} I'll maintain this in -mm, alongside the resier4 patches which need it. Of course, this rather obviates the purpose - if someone changes, say, __set_page_dirty_nobuffers() then they won't similarly update set_page_dirty_notag(). Oh well. This problem would fix itself if those two functions were to substantially share code. And I think we can do that - something like static void update_page_accounting(struct page *page, struct address_space *mapping) { __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } maybe? We could do that as a separate patch and merge it into mainline - it should have zero impact on code generation if gcc does the right thing (please check this). ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-17 22:35 ` [patch 2/4] vfs: add set_page_dirty_notag Andrew Morton @ 2009-02-18 0:26 ` Edward Shishkin 2009-02-18 0:38 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Edward Shishkin @ 2009-02-18 0:26 UTC (permalink / raw) To: Andrew Morton Cc: peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel Andrew Morton wrote: > On Tue, 17 Feb 2009 01:43:28 +0300 > Edward Shishkin <edward.shishkin@gmail.com> wrote: > > >> + */ >> +int set_page_dirty_notag(struct page *page) >> +{ >> + struct address_space *mapping = page->mapping; >> + >> + if (!TestSetPageDirty(page)) { >> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); >> + if (mapping_cap_account_dirty(mapping)) { >> + /* >> + * The accounting functions rely on >> + * being atomic wrt interrupts. >> + */ >> + unsigned long flags; >> + local_irq_save(flags); >> + __inc_zone_page_state(page, NR_FILE_DIRTY); >> + __inc_bdi_stat(mapping->backing_dev_info, >> + BDI_RECLAIMABLE); >> + task_dirty_inc(current); >> + task_io_account_write(PAGE_CACHE_SIZE); >> + local_irq_restore(flags); >> + } >> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); >> + return 1; >> + } >> + return 0; >> +} >> > > I'll maintain this in -mm, alongside the resier4 patches which need it. > > Of course, this rather obviates the purpose - if someone changes, say, > __set_page_dirty_nobuffers() then they won't similarly update > set_page_dirty_notag(). Oh well. > > This problem would fix itself if those two functions were to > substantially share code. It will fix the problem only partially: there is one more friend __set_page_dirty() in buffer.c Maybe it makes sense to add comments with warnings in all such places, or create a header file with a static inline function update_page_accounting() ? > And I think we can do that - something like > > static void > update_page_accounting(struct page *page, struct address_space *mapping) > { > __inc_zone_page_state(page, NR_FILE_DIRTY); > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); > task_dirty_inc(current); > task_io_account_write(PAGE_CACHE_SIZE); > } > > maybe? > > We could do that as a separate patch and merge it into mainline - it > should have zero impact on code generation if gcc does the right thing > (please check this). > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 2/4] vfs: add set_page_dirty_notag 2009-02-18 0:26 ` Edward Shishkin @ 2009-02-18 0:38 ` Andrew Morton 2009-02-18 13:27 ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin 2009-02-18 13:27 ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin 0 siblings, 2 replies; 24+ messages in thread From: Andrew Morton @ 2009-02-18 0:38 UTC (permalink / raw) To: Edward Shishkin Cc: peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel On Wed, 18 Feb 2009 03:26:16 +0300 Edward Shishkin <edward.shishkin@gmail.com> wrote: > Andrew Morton wrote: > > On Tue, 17 Feb 2009 01:43:28 +0300 > > Edward Shishkin <edward.shishkin@gmail.com> wrote: > > > > > >> + */ > >> +int set_page_dirty_notag(struct page *page) > >> +{ > >> + struct address_space *mapping = page->mapping; > >> + > >> + if (!TestSetPageDirty(page)) { > >> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > >> + if (mapping_cap_account_dirty(mapping)) { > >> + /* > >> + * The accounting functions rely on > >> + * being atomic wrt interrupts. > >> + */ > >> + unsigned long flags; > >> + local_irq_save(flags); > >> + __inc_zone_page_state(page, NR_FILE_DIRTY); > >> + __inc_bdi_stat(mapping->backing_dev_info, > >> + BDI_RECLAIMABLE); > >> + task_dirty_inc(current); > >> + task_io_account_write(PAGE_CACHE_SIZE); > >> + local_irq_restore(flags); > >> + } > >> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > >> + return 1; > >> + } > >> + return 0; > >> +} > >> > > > > I'll maintain this in -mm, alongside the resier4 patches which need it. > > > > Of course, this rather obviates the purpose - if someone changes, say, > > __set_page_dirty_nobuffers() then they won't similarly update > > set_page_dirty_notag(). Oh well. > > > > This problem would fix itself if those two functions were to > > substantially share code. > > It will fix the problem only partially: > there is one more friend __set_page_dirty() in buffer.c But all three functions do the same thing? if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } ? > Maybe it makes sense to add comments with warnings > in all such places, or create a header file with a static inline > function update_page_accounting() ? Could just uninline the helper function I guess - if you look, those four statements already involve doing a heck of a lot of stuff. Try it, see how it looks? ^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 1/2] vfs: add/use update_page_accounting 2009-02-18 0:38 ` Andrew Morton @ 2009-02-18 13:27 ` Edward Shishkin 2009-02-18 14:06 ` Nick Piggin 2009-02-18 13:27 ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin 1 sibling, 1 reply; 24+ messages in thread From: Edward Shishkin @ 2009-02-18 13:27 UTC (permalink / raw) To: Andrew Morton Cc: Edward Shishkin, peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel Andrew Morton writes: > On Wed, 18 Feb 2009 03:26:16 +0300 > Edward Shishkin <edward.shishkin@gmail.com> wrote: > > > Andrew Morton wrote: > > > On Tue, 17 Feb 2009 01:43:28 +0300 > > > Edward Shishkin <edward.shishkin@gmail.com> wrote: > > > > > > > > >> + */ > > >> +int set_page_dirty_notag(struct page *page) > > >> +{ > > >> + struct address_space *mapping = page->mapping; > > >> + > > >> + if (!TestSetPageDirty(page)) { > > >> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > > >> + if (mapping_cap_account_dirty(mapping)) { > > >> + /* > > >> + * The accounting functions rely on > > >> + * being atomic wrt interrupts. > > >> + */ > > >> + unsigned long flags; > > >> + local_irq_save(flags); > > >> + __inc_zone_page_state(page, NR_FILE_DIRTY); > > >> + __inc_bdi_stat(mapping->backing_dev_info, > > >> + BDI_RECLAIMABLE); > > >> + task_dirty_inc(current); > > >> + task_io_account_write(PAGE_CACHE_SIZE); > > >> + local_irq_restore(flags); > > >> + } > > >> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > >> + return 1; > > >> + } > > >> + return 0; > > >> +} > > >> > > > > > > I'll maintain this in -mm, alongside the resier4 patches which need it. > > > > > > Of course, this rather obviates the purpose - if someone changes, say, > > > __set_page_dirty_nobuffers() then they won't similarly update > > > set_page_dirty_notag(). Oh well. > > > > > > This problem would fix itself if those two functions were to > > > substantially share code. > > > > It will fix the problem only partially: > > there is one more friend __set_page_dirty() in buffer.c > > But all three functions do the same thing? > > if (mapping_cap_account_dirty(mapping)) { > __inc_zone_page_state(page, NR_FILE_DIRTY); > __inc_bdi_stat(mapping->backing_dev_info, > BDI_RECLAIMABLE); > task_dirty_inc(current); > task_io_account_write(PAGE_CACHE_SIZE); > } > > ? Yup, exactly the same. > > > Maybe it makes sense to add comments with warnings > > in all such places, or create a header file with a static inline > > function update_page_accounting() ? > > Could just uninline the helper function I guess - if you look, those > four statements already involve doing a heck of a lot of stuff. > > Try it, see how it looks? > Done. Please, review. Add/use a helper function update_page_accounting(). Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> --- fs/buffer.c | 9 +-------- include/linux/mm.h | 1 + mm/page-writeback.c | 22 +++++++++++++++------- 3 files changed, 17 insertions(+), 15 deletions(-) --- mmotm.orig/fs/buffer.c +++ mmotm/fs/buffer.c @@ -803,14 +803,7 @@ static int __set_page_dirty(struct page spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); - - if (mapping_cap_account_dirty(mapping)) { - __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); - task_dirty_inc(current); - task_io_account_write(PAGE_CACHE_SIZE); - } + update_page_accounting(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } --- mmotm.orig/include/linux/mm.h +++ mmotm/include/linux/mm.h @@ -839,6 +839,7 @@ int __set_page_dirty_nobuffers(struct pa int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); +void update_page_accounting(struct page *page, struct address_space *mapping); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); int clear_page_dirty_for_io(struct page *page); --- mmotm.orig/mm/page-writeback.c +++ mmotm/mm/page-writeback.c @@ -1198,6 +1198,20 @@ int __set_page_dirty_no_writeback(struct } /* + * Helper function for set_page_dirty family. + * NOTE: This relies on being atomic wrt interrupts. + */ +void update_page_accounting(struct page *page, struct address_space *mapping) +{ + if (mapping_cap_account_dirty(mapping)) { + __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + task_dirty_inc(current); + task_io_account_write(PAGE_CACHE_SIZE); + } +} + +/* * For address_spaces which do not use buffers. Just tag the page as dirty in * its radix tree. * @@ -1226,13 +1240,7 @@ int __set_page_dirty_nobuffers(struct pa if (mapping2) { /* Race with truncate? */ BUG_ON(mapping2 != mapping); WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); - if (mapping_cap_account_dirty(mapping)) { - __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); - task_dirty_inc(current); - task_io_account_write(PAGE_CACHE_SIZE); - } + update_page_accounting(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/2] vfs: add/use update_page_accounting 2009-02-18 13:27 ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin @ 2009-02-18 14:06 ` Nick Piggin 2009-02-18 18:23 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Nick Piggin @ 2009-02-18 14:06 UTC (permalink / raw) To: Edward Shishkin Cc: Andrew Morton, peterz, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel On Wed, Feb 18, 2009 at 04:27:02PM +0300, Edward Shishkin wrote: > > > Maybe it makes sense to add comments with warnings > > > in all such places, or create a header file with a static inline > > > function update_page_accounting() ? > > > > Could just uninline the helper function I guess - if you look, those > > four statements already involve doing a heck of a lot of stuff. > > > > Try it, see how it looks? > > > > Done. > Please, review. > > Add/use a helper function update_page_accounting(). Fine patch, except the name I don't like. account_set_page_dirty, or account_page_dirtied, or something to hint it is for accounting dirty. > Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> > --- > fs/buffer.c | 9 +-------- > include/linux/mm.h | 1 + > mm/page-writeback.c | 22 +++++++++++++++------- > 3 files changed, 17 insertions(+), 15 deletions(-) > > --- mmotm.orig/fs/buffer.c > +++ mmotm/fs/buffer.c > @@ -803,14 +803,7 @@ static int __set_page_dirty(struct page > spin_lock_irq(&mapping->tree_lock); > if (page->mapping) { /* Race with truncate? */ > WARN_ON_ONCE(warn && !PageUptodate(page)); > - > - if (mapping_cap_account_dirty(mapping)) { > - __inc_zone_page_state(page, NR_FILE_DIRTY); > - __inc_bdi_stat(mapping->backing_dev_info, > - BDI_RECLAIMABLE); > - task_dirty_inc(current); > - task_io_account_write(PAGE_CACHE_SIZE); > - } > + update_page_accounting(page, mapping); > radix_tree_tag_set(&mapping->page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); > } > --- mmotm.orig/include/linux/mm.h > +++ mmotm/include/linux/mm.h > @@ -839,6 +839,7 @@ int __set_page_dirty_nobuffers(struct pa > int __set_page_dirty_no_writeback(struct page *page); > int redirty_page_for_writepage(struct writeback_control *wbc, > struct page *page); > +void update_page_accounting(struct page *page, struct address_space *mapping); > int set_page_dirty(struct page *page); > int set_page_dirty_lock(struct page *page); > int clear_page_dirty_for_io(struct page *page); > --- mmotm.orig/mm/page-writeback.c > +++ mmotm/mm/page-writeback.c > @@ -1198,6 +1198,20 @@ int __set_page_dirty_no_writeback(struct > } > > /* > + * Helper function for set_page_dirty family. > + * NOTE: This relies on being atomic wrt interrupts. > + */ > +void update_page_accounting(struct page *page, struct address_space *mapping) > +{ > + if (mapping_cap_account_dirty(mapping)) { > + __inc_zone_page_state(page, NR_FILE_DIRTY); > + __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); > + task_dirty_inc(current); > + task_io_account_write(PAGE_CACHE_SIZE); > + } > +} > + > +/* > * For address_spaces which do not use buffers. Just tag the page as dirty in > * its radix tree. > * > @@ -1226,13 +1240,7 @@ int __set_page_dirty_nobuffers(struct pa > if (mapping2) { /* Race with truncate? */ > BUG_ON(mapping2 != mapping); > WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > - if (mapping_cap_account_dirty(mapping)) { > - __inc_zone_page_state(page, NR_FILE_DIRTY); > - __inc_bdi_stat(mapping->backing_dev_info, > - BDI_RECLAIMABLE); > - task_dirty_inc(current); > - task_io_account_write(PAGE_CACHE_SIZE); > - } > + update_page_accounting(page, mapping); > radix_tree_tag_set(&mapping->page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); > } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 1/2] vfs: add/use update_page_accounting 2009-02-18 14:06 ` Nick Piggin @ 2009-02-18 18:23 ` Andrew Morton 0 siblings, 0 replies; 24+ messages in thread From: Andrew Morton @ 2009-02-18 18:23 UTC (permalink / raw) To: Nick Piggin Cc: Edward Shishkin, peterz, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel On Wed, 18 Feb 2009 15:06:26 +0100 Nick Piggin <npiggin@suse.de> wrote: > On Wed, Feb 18, 2009 at 04:27:02PM +0300, Edward Shishkin wrote: > > > > Maybe it makes sense to add comments with warnings > > > > in all such places, or create a header file with a static inline > > > > function update_page_accounting() ? > > > > > > Could just uninline the helper function I guess - if you look, those > > > four statements already involve doing a heck of a lot of stuff. > > > > > > Try it, see how it looks? > > > > > > > Done. > > Please, review. > > > > Add/use a helper function update_page_accounting(). > > Fine patch, except the name I don't like. account_set_page_dirty, or > account_page_dirtied, or something to hint it is for accounting > dirty. yep, sorry, I didn't think too hard about that for-example. I'll edit the diffs.. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 2/2] vfs: (take 2)add set_page_dirty_notag 2009-02-18 0:38 ` Andrew Morton 2009-02-18 13:27 ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin @ 2009-02-18 13:27 ` Edward Shishkin 1 sibling, 0 replies; 24+ messages in thread From: Edward Shishkin @ 2009-02-18 13:27 UTC (permalink / raw) To: Andrew Morton Cc: Edward Shishkin, peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel In accordance with reiser4 transactional model every dirty page should be "captured" by some atom. However, outside reiser4 context dirty page can not be captured in some cases, as it is accompanied with specific work (jnode creation, etc). Reiser4 recognizes such "anonymous" pages (i.e. pages that were dirtied outside of reiser4) by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context are not tagged at all: we don't need this. Indeed, once page is dirtied and captured, it is attached to a jnode (a special header to keep a track of transactions). reiser4_set_page_dirty_internal() was the internal reiser4 function that set dirty bit without tagging the page. Having such internal function led to real problems (incorrect task io accounting, etc. because of not updating this internal "friend"). Solution: The following patch adds a core library function that sets a dirty bit without tagging the page. Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> --- include/linux/mm.h | 1 + mm/page-writeback.c | 26 ++++++++++++++++++++++++++ 2 files changed, 27 insertions(+) --- mmotm.orig/include/linux/mm.h +++ mmotm/include/linux/mm.h @@ -842,6 +842,7 @@ int redirty_page_for_writepage(struct wr void update_page_accounting(struct page *page, struct address_space *mapping); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); +int set_page_dirty_notag(struct page *page); int clear_page_dirty_for_io(struct page *page); extern unsigned long move_page_tables(struct vm_area_struct *vma, --- mmotm.orig/mm/page-writeback.c +++ mmotm/mm/page-writeback.c @@ -1256,6 +1256,32 @@ int __set_page_dirty_nobuffers(struct pa EXPORT_SYMBOL(__set_page_dirty_nobuffers); /* + * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers() + * except it doesn't tag the page dirty in the page-cache radix tree. + * This means that the address space using this cannot use the regular + * filemap ->writepages() helpers and must provide its own means of + * tracking and finding non-tagged dirty pages. + * + * NOTE: furthermore, this version also doesn't handle truncate races. + */ +int set_page_dirty_notag(struct page *page) +{ + struct address_space *mapping = page->mapping; + + if (!TestSetPageDirty(page)) { + unsigned long flags; + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); + local_irq_save(flags); + update_page_accounting(page, mapping); + local_irq_restore(flags); + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + return 1; + } + return 0; +} +EXPORT_SYMBOL(set_page_dirty_notag); + +/* * When a writepage implementation decides that it doesn't want to write this * page for some reason, it should redirty the locked page via * redirty_page_for_writepage() and it should then unlock the page and return 0 ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2009-02-18 18:24 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-13 11:56 [patch 2/4] vfs: add set_page_dirty_notag Edward Shishkin 2009-02-13 13:08 ` Peter Zijlstra 2009-02-13 13:57 ` Edward Shishkin 2009-02-13 14:09 ` Peter Zijlstra 2009-02-14 13:11 ` Edward Shishkin 2009-02-14 21:11 ` Peter Zijlstra 2009-02-16 22:43 ` Edward Shishkin 2009-02-17 9:09 ` Peter Zijlstra 2009-02-17 9:38 ` Nick Piggin 2009-02-17 10:05 ` Peter Zijlstra 2009-02-17 10:24 ` Nick Piggin 2009-02-17 10:40 ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra 2009-02-17 11:25 ` Nick Piggin 2009-02-17 11:39 ` Peter Zijlstra 2009-02-17 11:55 ` Nick Piggin 2009-02-17 12:05 ` Peter Zijlstra 2009-02-17 12:30 ` Nick Piggin 2009-02-17 22:35 ` [patch 2/4] vfs: add set_page_dirty_notag Andrew Morton 2009-02-18 0:26 ` Edward Shishkin 2009-02-18 0:38 ` Andrew Morton 2009-02-18 13:27 ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin 2009-02-18 14:06 ` Nick Piggin 2009-02-18 18:23 ` Andrew Morton 2009-02-18 13:27 ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox