stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
       [not found] <1414750042-20832-1-git-send-email-guz.fnst@cn.fujitsu.com>
@ 2014-11-05  9:53 ` Gu Zheng
  2014-11-05 14:56   ` Benjamin LaHaise
  0 siblings, 1 reply; 4+ messages in thread
From: Gu Zheng @ 2014-11-05  9:53 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: Gu Zheng, linux-aio, m.koenigshaus, linux-kernel, stable

ping...

On 10/31/2014 06:07 PM, Gu Zheng wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=86831
> 
> Markus reported that when shutting down mysqld (with AIO support,
> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
> (underrun to the counter). The negative number results in a drastic reduction
> of the write performance because the page cache is not used, because the kernel
> thinks it is still 2 ^ 32 dirty pages open.
> 
> Add a warn trace in __dec_zone_state will catch this easily:
> 
> static inline void __dec_zone_state(struct zone *zone, enum
> 	zone_stat_item item)
> {
>      atomic_long_dec(&zone->vm_stat[item]);
> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
> 	atomic_long_read(&zone->vm_stat[item]) < 0);
>      atomic_long_dec(&vm_stat[item]);
> }
> 
> [   21.341632] ------------[ cut here ]------------
> [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
> cancel_dirty_page+0x164/0x224()
> [   21.355296] Modules linked in: wutbox_cp sata_mv
> [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
> [   21.366793] Workqueue: events free_ioctx
> [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
> (show_stack+0x20/0x24)
> [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
> (dump_stack+0x24/0x28)
> [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
> (warn_slowpath_common+0x84/0x9c)
> [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
> (warn_slowpath_null+0x2c/0x34)
> [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
> (cancel_dirty_page+0x164/0x224)
> [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
> (truncate_inode_page+0x8c/0x158)
> [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
> (truncate_inode_pages_range+0x11c/0x53c)
> [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
> [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
> [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
> (truncate_setsize+0x5c/0x74)
> [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
> (put_aio_ring_file.isra.14+0x34/0x90)
> [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
> [<c013b424>] (aio_free_ring+0x20/0xcc)
> [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
> (free_ioctx+0x24/0x44)
> [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
> (process_one_work+0x134/0x47c)
> [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
> (worker_thread+0x130/0x414)
> [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
> (kthread+0xd4/0xec)
> [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
> (ret_from_fork+0x14/0x20)
> [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
> 
> The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
> (bypasses the VFS dirty pages increment) when init, and aio fs uses
> *default_backing_dev_info* as the backing dev, which does not disable
> the dirty pages accounting capability.
> So truncating aio ring file contributes to accounting dirty pages,
> error occurs.
> The original goal is keeping these pages in memory (can not be reclaimed
> or swapped) in life-time via marking it dirty. But thinking more,
> pinning pages can already achieve the goal, so the SetPageDirty seems
> unnecessary.
> In order to fix the issue, using the __set_page_dirty_no_writeback instead
> of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
> set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
> And an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB
> capabilities) is introduced to replace the default one to avoid accounting
> dirty pages.
> 
> Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> Cc: stable <stable@vger.kernel.org>
> ---
>  fs/aio.c |   18 +++++++++++-------
>  1 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 84a7510..b23dc31 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
>  static const struct file_operations aio_ring_fops;
>  static const struct address_space_operations aio_ctx_aops;
>  
> +static struct backing_dev_info aio_fs_backing_dev_info = {
> +	.name           = "aiofs",
> +	.state          = 0,
> +	.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
> +};
> +
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  {
>  	struct qstr this = QSTR_INIT("[aio]", 5);
> @@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  
>  	inode->i_mapping->a_ops = &aio_ctx_aops;
>  	inode->i_mapping->private_data = ctx;
> +	inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
>  	inode->i_size = PAGE_SIZE * nr_pages;
>  
>  	path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
> @@ -220,6 +227,9 @@ static int __init aio_setup(void)
>  	if (IS_ERR(aio_mnt))
>  		panic("Failed to create aio fs mount.");
>  
> +	if (bdi_init(&aio_fs_backing_dev_info))
> +		panic("Failed to init aio fs backing dev info.");
> +
>  	kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>  	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>  
> @@ -281,11 +291,6 @@ static const struct file_operations aio_ring_fops = {
>  	.mmap = aio_ring_mmap,
>  };
>  
> -static int aio_set_page_dirty(struct page *page)
> -{
> -	return 0;
> -}
> -
>  #if IS_ENABLED(CONFIG_MIGRATION)
>  static int aio_migratepage(struct address_space *mapping, struct page *new,
>  			struct page *old, enum migrate_mode mode)
> @@ -357,7 +362,7 @@ out:
>  #endif
>  
>  static const struct address_space_operations aio_ctx_aops = {
> -	.set_page_dirty = aio_set_page_dirty,
> +	.set_page_dirty = __set_page_dirty_no_writeback,
>  #if IS_ENABLED(CONFIG_MIGRATION)
>  	.migratepage	= aio_migratepage,
>  #endif
> @@ -412,7 +417,6 @@ static int aio_setup_ring(struct kioctx *ctx)
>  		pr_debug("pid(%d) page[%d]->count=%d\n",
>  			 current->pid, i, page_count(page));
>  		SetPageUptodate(page);
> -		SetPageDirty(page);
>  		unlock_page(page);
>  
>  		ctx->ring_pages[i] = page;



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
  2014-11-05  9:53 ` [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer Gu Zheng
@ 2014-11-05 14:56   ` Benjamin LaHaise
  2014-11-05 21:00     ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin LaHaise @ 2014-11-05 14:56 UTC (permalink / raw)
  To: Gu Zheng; +Cc: akpm, linux-aio, m.koenigshaus, linux-kernel, stable

On Wed, Nov 05, 2014 at 05:53:11PM +0800, Gu Zheng wrote:
> ping...

I need someone a bit more familiar with this area of code to chime in on 
reviewing this.  Andrew, can you provide any feedback on this fix?

		-ben

> On 10/31/2014 06:07 PM, Gu Zheng wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=86831
> > 
> > Markus reported that when shutting down mysqld (with AIO support,
> > on a ext3 formatted Harddrive) leads to a negative number of dirty pages
> > (underrun to the counter). The negative number results in a drastic reduction
> > of the write performance because the page cache is not used, because the kernel
> > thinks it is still 2 ^ 32 dirty pages open.
> > 
> > Add a warn trace in __dec_zone_state will catch this easily:
> > 
> > static inline void __dec_zone_state(struct zone *zone, enum
> > 	zone_stat_item item)
> > {
> >      atomic_long_dec(&zone->vm_stat[item]);
> > +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
> > 	atomic_long_read(&zone->vm_stat[item]) < 0);
> >      atomic_long_dec(&vm_stat[item]);
> > }
> > 
> > [   21.341632] ------------[ cut here ]------------
> > [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
> > cancel_dirty_page+0x164/0x224()
> > [   21.355296] Modules linked in: wutbox_cp sata_mv
> > [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
> > [   21.366793] Workqueue: events free_ioctx
> > [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
> > (show_stack+0x20/0x24)
> > [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
> > (dump_stack+0x24/0x28)
> > [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
> > (warn_slowpath_common+0x84/0x9c)
> > [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
> > (warn_slowpath_null+0x2c/0x34)
> > [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
> > (cancel_dirty_page+0x164/0x224)
> > [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
> > (truncate_inode_page+0x8c/0x158)
> > [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
> > (truncate_inode_pages_range+0x11c/0x53c)
> > [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
> > [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
> > [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
> > (truncate_setsize+0x5c/0x74)
> > [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
> > (put_aio_ring_file.isra.14+0x34/0x90)
> > [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
> > [<c013b424>] (aio_free_ring+0x20/0xcc)
> > [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
> > (free_ioctx+0x24/0x44)
> > [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
> > (process_one_work+0x134/0x47c)
> > [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
> > (worker_thread+0x130/0x414)
> > [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
> > (kthread+0xd4/0xec)
> > [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
> > (ret_from_fork+0x14/0x20)
> > [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
> > 
> > The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
> > (bypasses the VFS dirty pages increment) when init, and aio fs uses
> > *default_backing_dev_info* as the backing dev, which does not disable
> > the dirty pages accounting capability.
> > So truncating aio ring file contributes to accounting dirty pages,
> > error occurs.
> > The original goal is keeping these pages in memory (can not be reclaimed
> > or swapped) in life-time via marking it dirty. But thinking more,
> > pinning pages can already achieve the goal, so the SetPageDirty seems
> > unnecessary.
> > In order to fix the issue, using the __set_page_dirty_no_writeback instead
> > of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
> > set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
> > And an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB
> > capabilities) is introduced to replace the default one to avoid accounting
> > dirty pages.
> > 
> > Reported-by: Markus K�nigshaus <m.koenigshaus@wut.de>
> > Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> > Cc: stable <stable@vger.kernel.org>
> > ---
> >  fs/aio.c |   18 +++++++++++-------
> >  1 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 84a7510..b23dc31 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
> >  static const struct file_operations aio_ring_fops;
> >  static const struct address_space_operations aio_ctx_aops;
> >  
> > +static struct backing_dev_info aio_fs_backing_dev_info = {
> > +	.name           = "aiofs",
> > +	.state          = 0,
> > +	.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
> > +};
> > +
> >  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> >  {
> >  	struct qstr this = QSTR_INIT("[aio]", 5);
> > @@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> >  
> >  	inode->i_mapping->a_ops = &aio_ctx_aops;
> >  	inode->i_mapping->private_data = ctx;
> > +	inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
> >  	inode->i_size = PAGE_SIZE * nr_pages;
> >  
> >  	path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
> > @@ -220,6 +227,9 @@ static int __init aio_setup(void)
> >  	if (IS_ERR(aio_mnt))
> >  		panic("Failed to create aio fs mount.");
> >  
> > +	if (bdi_init(&aio_fs_backing_dev_info))
> > +		panic("Failed to init aio fs backing dev info.");
> > +
> >  	kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
> >  	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
> >  
> > @@ -281,11 +291,6 @@ static const struct file_operations aio_ring_fops = {
> >  	.mmap = aio_ring_mmap,
> >  };
> >  
> > -static int aio_set_page_dirty(struct page *page)
> > -{
> > -	return 0;
> > -}
> > -
> >  #if IS_ENABLED(CONFIG_MIGRATION)
> >  static int aio_migratepage(struct address_space *mapping, struct page *new,
> >  			struct page *old, enum migrate_mode mode)
> > @@ -357,7 +362,7 @@ out:
> >  #endif
> >  
> >  static const struct address_space_operations aio_ctx_aops = {
> > -	.set_page_dirty = aio_set_page_dirty,
> > +	.set_page_dirty = __set_page_dirty_no_writeback,
> >  #if IS_ENABLED(CONFIG_MIGRATION)
> >  	.migratepage	= aio_migratepage,
> >  #endif
> > @@ -412,7 +417,6 @@ static int aio_setup_ring(struct kioctx *ctx)
> >  		pr_debug("pid(%d) page[%d]->count=%d\n",
> >  			 current->pid, i, page_count(page));
> >  		SetPageUptodate(page);
> > -		SetPageDirty(page);
> >  		unlock_page(page);
> >  
> >  		ctx->ring_pages[i] = page;
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
  2014-11-05 14:56   ` Benjamin LaHaise
@ 2014-11-05 21:00     ` Andrew Morton
  2014-11-06  1:14       ` Gu Zheng
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2014-11-05 21:00 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Gu Zheng, linux-aio, m.koenigshaus, linux-kernel, stable

On Wed, 5 Nov 2014 09:56:20 -0500 Benjamin LaHaise <bcrl@kvack.org> wrote:

> On Wed, Nov 05, 2014 at 05:53:11PM +0800, Gu Zheng wrote:
> > ping...
> 
> I need someone a bit more familiar with this area of code to chime in on 
> reviewing this.  Andrew, can you provide any feedback on this fix?

That looks like a sensible solution and the changelog is good.  A
couple of suggested modifications:

- Don't use the term "pinned" in the changelog without explaining it
  - it's a terribly ambiguous term.  I assume here you mean "pinned by
  elevating the page's refcount"?

- Could we have just one wee little code comment, explaining why
  aio_fs_backing_dev_info exists, what function it is serving?  Put
  yourself in the position of a reader wondering "why is this here".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
  2014-11-05 21:00     ` Andrew Morton
@ 2014-11-06  1:14       ` Gu Zheng
  0 siblings, 0 replies; 4+ messages in thread
From: Gu Zheng @ 2014-11-06  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin LaHaise, linux-aio, m.koenigshaus, linux-kernel, stable

Hi Andrew,
On 11/06/2014 05:00 AM, Andrew Morton wrote:

> On Wed, 5 Nov 2014 09:56:20 -0500 Benjamin LaHaise <bcrl@kvack.org> wrote:
> 
>> On Wed, Nov 05, 2014 at 05:53:11PM +0800, Gu Zheng wrote:
>>> ping...
>>
>> I need someone a bit more familiar with this area of code to chime in on 
>> reviewing this.  Andrew, can you provide any feedback on this fix?
> 
> That looks like a sensible solution and the changelog is good.  A
> couple of suggested modifications:
> 
> - Don't use the term "pinned" in the changelog without explaining it
>   - it's a terribly ambiguous term.  I assume here you mean "pinned by
>   elevating the page's refcount"?

Yes.

> 
> - Could we have just one wee little code comment, explaining why
>   aio_fs_backing_dev_info exists, what function it is serving?  Put
>   yourself in the position of a reader wondering "why is this here".
> .

Got it.
Thanks for your suggestion. Will update it soon.

Regards,
Gu

> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-11-06  1:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1414750042-20832-1-git-send-email-guz.fnst@cn.fujitsu.com>
2014-11-05  9:53 ` [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer Gu Zheng
2014-11-05 14:56   ` Benjamin LaHaise
2014-11-05 21:00     ` Andrew Morton
2014-11-06  1:14       ` Gu Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).