Re: [RFC 2/2] x86_64: expand kernel stack to 16K

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Date: Fri, 30 May 2014 15:12:15 +0900	[thread overview]
Message-ID: <20140530061215.GS10092@bbox> (raw)
In-Reply-To: <20140530021247.GR10092@bbox>

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c    | 2 +-
 block/blk-mq.c      | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
 			trace_block_plug(q);
 		else {
 			if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 		}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 			if (list_empty(&plug->mq_list))
 				trace_block_plug(q);
 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 			list_add_tail(&rq->queuelist, &plug->mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > > 
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe <jaxboe@fusionio.com>
> > > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > > 
> > > > >     block: let io_schedule() flush the plug inline
> > > > >     
> > > > >     Linus correctly observes that the most important dispatch cases
> > > > >     are now done from kblockd, this isn't ideal for latency reasons.
> > > > >     The original reason for switching dispatches out-of-line was to
> > > > >     avoid too deep a stack, so by _only_ letting the "accidental"
> > > > >     flush directly in schedule() be guarded by offload to kblockd,
> > > > >     we should be able to get the best of both worlds.
> > > > >     
> > > > >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > >     and only use that from the schedule() path.
> > > > >     
> > > > >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > > > 
> > > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > > 
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()....
> > ....
> > > I did below hacky test to apply your idea and the result is overflow again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > >  void __sched io_schedule(void)
> > >  {
> > >  	struct rq *rq = raw_rq();
> > > +	struct blk_plug *plug = current->plug;
> > >  
> > >  	delayacct_blkio_start();
> > >  	atomic_inc(&rq->nr_iowait);
> > > -	blk_flush_plug(current);
> > > +	if (plug)
> > > +		blk_flush_plug_list(plug, true);
> > > +
> > >  	current->in_iowait = 1;
> > >  	schedule();
> > >  	current->in_iowait = 0;
> > 
> > .....
> > 
> > >         Depth    Size   Location    (46 entries)
> > >
> > >   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
> > >   1)     7192     296   get_page_from_freelist+0x886/0x920
> > >   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
> > >   3)     6544       8   alloc_pages_current+0x10f/0x1f0
> > >   4)     6536     168   new_slab+0x2c5/0x370
> > >   5)     6368       8   __slab_alloc+0x3a9/0x501
> > >   6)     6360      80   __kmalloc+0x1cb/0x200
> > >   7)     6280     376   vring_add_indirect+0x36/0x200
> > >   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
> > >   9)     5760     288   __virtblk_add_req+0xda/0x1b0
> > >  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
> > >  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > >  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
> > >  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
> > >  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
> > >  15)     5024     112   blk_flush_plug_list+0xc7/0x220
> > >  16)     4912     128   blk_mq_make_request+0x42a/0x600
> > >  17)     4784      48   generic_make_request+0xc0/0x100
> > >  18)     4736     112   submit_bio+0x86/0x160
> > >  19)     4624     160   __swap_writepage+0x198/0x230
> > >  20)     4464      32   swap_writepage+0x42/0x90
> > >  21)     4432     320   shrink_page_list+0x676/0xa80
> > >  22)     4112     208   shrink_inactive_list+0x262/0x4e0
> > >  23)     3904     304   shrink_lruvec+0x3e1/0x6a0
> > 
> > The device is supposed to be plugged here in shrink_lruvec().
> > 
> > Oh, a plug can only hold 16 individual bios, and then it does a
> > synchronous flush. Hmmm - perhaps that should also defer the flush
> > to the kblockd, because if we are overrunning a plug then we've
> > already surrendered IO dispatch latency....
> > 
> > So, in blk_mq_make_request(), can you do:
> > 
> > 			if (list_empty(&plug->mq_list))
> > 				trace_block_plug(q);
> > 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> > -				blk_flush_plug_list(plug, false);
> > +				blk_flush_plug_list(plug, true);
> > 				trace_block_plug(q);
> > 			}
> > 			list_add_tail(&rq->queuelist, &plug->mq_list);
> > 
> > To see if that defers all the swap IO to kblockd?
> > 
> 
> Interim report,
> 
> I applied below(we need to fix io_schedule_timeout due to mempool_alloc)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bfe16d5af9f9..0c81aacec75b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1585,7 +1585,7 @@ get_rq:
>  			trace_block_plug(q);
>  		else {
>  			if (request_count >= BLK_MAX_REQUEST_COUNT) {
> -				blk_flush_plug_list(plug, false);
> +				blk_flush_plug_list(plug, true);
>  				trace_block_plug(q);
>  			}
>  		}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..ebca9e1f200f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> 
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.
> 
>         -----    ----   --------
>   0)     6896      16   lookup_address+0x28/0x30
>   1)     6880      16   _lookup_address_cpa.isra.3+0x3b/0x40
>   2)     6864     304   __change_page_attr_set_clr+0xe0/0xb50
>   3)     6560     112   kernel_map_pages+0x6c/0x120
>   4)     6448     256   get_page_from_freelist+0x489/0x920
>   5)     6192     352   __alloc_pages_nodemask+0x5e1/0xb20
>   6)     5840       8   alloc_pages_current+0x10f/0x1f0
>   7)     5832     168   new_slab+0x35d/0x370
>   8)     5664       8   __slab_alloc+0x3a9/0x501
>   9)     5656      80   kmem_cache_alloc+0x1ac/0x1c0
>  10)     5576     296   mempool_alloc_slab+0x15/0x20
>  11)     5280     128   mempool_alloc+0x5e/0x170
>  12)     5152      96   bio_alloc_bioset+0x10b/0x1d0
>  13)     5056      48   get_swap_bio+0x30/0x90
>  14)     5008     160   __swap_writepage+0x150/0x230
>  15)     4848      32   swap_writepage+0x42/0x90
>  16)     4816     320   shrink_page_list+0x676/0xa80
>  17)     4496     208   shrink_inactive_list+0x262/0x4e0
>  18)     4288     304   shrink_lruvec+0x3e1/0x6a0
>  19)     3984      80   shrink_zone+0x3f/0x110
>  20)     3904     128   do_try_to_free_pages+0x156/0x4c0
>  21)     3776     208   try_to_free_pages+0xf7/0x1e0
>  22)     3568     352   __alloc_pages_nodemask+0x783/0xb20
>  23)     3216       8   alloc_pages_current+0x10f/0x1f0
>  24)     3208     168   new_slab+0x2c5/0x370
>  25)     3040       8   __slab_alloc+0x3a9/0x501
>  26)     3032      80   kmem_cache_alloc+0x1ac/0x1c0
>  27)     2952     296   mempool_alloc_slab+0x15/0x20
>  28)     2656     128   mempool_alloc+0x5e/0x170
>  29)     2528      96   bio_alloc_bioset+0x10b/0x1d0
>  30)     2432      48   mpage_alloc+0x38/0xa0
>  31)     2384     208   do_mpage_readpage+0x49b/0x5d0
>  32)     2176     224   mpage_readpages+0xcf/0x120
>  33)     1952      48   ext4_readpages+0x45/0x60
>  34)     1904     224   __do_page_cache_readahead+0x222/0x2d0
>  35)     1680      16   ra_submit+0x21/0x30
>  36)     1664     112   filemap_fault+0x2d7/0x4f0
>  37)     1552     144   __do_fault+0x6d/0x4c0
>  38)     1408     160   handle_mm_fault+0x1a6/0xaf0
>  39)     1248     272   __do_page_fault+0x18a/0x590
>  40)      976      16   do_page_fault+0xc/0x10
>  41)      960     208   page_fault+0x22/0x30
>  42)      752      16   clear_user+0x2e/0x40
>  43)      736      16   padzero+0x2d/0x40
>  44)      720     304   load_elf_binary+0xa47/0x1a40
>  45)      416      48   search_binary_handler+0x9c/0x1a0
>  46)      368     144   do_execve_common.isra.25+0x58d/0x700
>  47)      224      16   do_execve+0x18/0x20
>  48)      208      32   SyS_execve+0x2e/0x40
>  49)      176     176   stub_execve+0x69/0xa0
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Date: Fri, 30 May 2014 15:12:15 +0900	[thread overview]
Message-ID: <20140530061215.GS10092@bbox> (raw)
In-Reply-To: <20140530021247.GR10092@bbox>

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c    | 2 +-
 block/blk-mq.c      | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
 			trace_block_plug(q);
 		else {
 			if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 		}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 			if (list_empty(&plug->mq_list))
 				trace_block_plug(q);
 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 			list_add_tail(&rq->queuelist, &plug->mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > > 
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe <jaxboe@fusionio.com>
> > > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > > 
> > > > >     block: let io_schedule() flush the plug inline
> > > > >     
> > > > >     Linus correctly observes that the most important dispatch cases
> > > > >     are now done from kblockd, this isn't ideal for latency reasons.
> > > > >     The original reason for switching dispatches out-of-line was to
> > > > >     avoid too deep a stack, so by _only_ letting the "accidental"
> > > > >     flush directly in schedule() be guarded by offload to kblockd,
> > > > >     we should be able to get the best of both worlds.
> > > > >     
> > > > >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > >     and only use that from the schedule() path.
> > > > >     
> > > > >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > > > 
> > > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > > 
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()....
> > ....
> > > I did below hacky test to apply your idea and the result is overflow again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > >  void __sched io_schedule(void)
> > >  {
> > >  	struct rq *rq = raw_rq();
> > > +	struct blk_plug *plug = current->plug;
> > >  
> > >  	delayacct_blkio_start();
> > >  	atomic_inc(&rq->nr_iowait);
> > > -	blk_flush_plug(current);
> > > +	if (plug)
> > > +		blk_flush_plug_list(plug, true);
> > > +
> > >  	current->in_iowait = 1;
> > >  	schedule();
> > >  	current->in_iowait = 0;
> > 
> > .....
> > 
> > >         Depth    Size   Location    (46 entries)
> > >
> > >   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
> > >   1)     7192     296   get_page_from_freelist+0x886/0x920
> > >   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
> > >   3)     6544       8   alloc_pages_current+0x10f/0x1f0
> > >   4)     6536     168   new_slab+0x2c5/0x370
> > >   5)     6368       8   __slab_alloc+0x3a9/0x501
> > >   6)     6360      80   __kmalloc+0x1cb/0x200
> > >   7)     6280     376   vring_add_indirect+0x36/0x200
> > >   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
> > >   9)     5760     288   __virtblk_add_req+0xda/0x1b0
> > >  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
> > >  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > >  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
> > >  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
> > >  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
> > >  15)     5024     112   blk_flush_plug_list+0xc7/0x220
> > >  16)     4912     128   blk_mq_make_request+0x42a/0x600
> > >  17)     4784      48   generic_make_request+0xc0/0x100
> > >  18)     4736     112   submit_bio+0x86/0x160
> > >  19)     4624     160   __swap_writepage+0x198/0x230
> > >  20)     4464      32   swap_writepage+0x42/0x90
> > >  21)     4432     320   shrink_page_list+0x676/0xa80
> > >  22)     4112     208   shrink_inactive_list+0x262/0x4e0
> > >  23)     3904     304   shrink_lruvec+0x3e1/0x6a0
> > 
> > The device is supposed to be plugged here in shrink_lruvec().
> > 
> > Oh, a plug can only hold 16 individual bios, and then it does a
> > synchronous flush. Hmmm - perhaps that should also defer the flush
> > to the kblockd, because if we are overrunning a plug then we've
> > already surrendered IO dispatch latency....
> > 
> > So, in blk_mq_make_request(), can you do:
> > 
> > 			if (list_empty(&plug->mq_list))
> > 				trace_block_plug(q);
> > 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> > -				blk_flush_plug_list(plug, false);
> > +				blk_flush_plug_list(plug, true);
> > 				trace_block_plug(q);
> > 			}
> > 			list_add_tail(&rq->queuelist, &plug->mq_list);
> > 
> > To see if that defers all the swap IO to kblockd?
> > 
> 
> Interim report,
> 
> I applied below(we need to fix io_schedule_timeout due to mempool_alloc)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bfe16d5af9f9..0c81aacec75b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1585,7 +1585,7 @@ get_rq:
>  			trace_block_plug(q);
>  		else {
>  			if (request_count >= BLK_MAX_REQUEST_COUNT) {
> -				blk_flush_plug_list(plug, false);
> +				blk_flush_plug_list(plug, true);
>  				trace_block_plug(q);
>  			}
>  		}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..ebca9e1f200f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> 
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.
> 
>         -----    ----   --------
>   0)     6896      16   lookup_address+0x28/0x30
>   1)     6880      16   _lookup_address_cpa.isra.3+0x3b/0x40
>   2)     6864     304   __change_page_attr_set_clr+0xe0/0xb50
>   3)     6560     112   kernel_map_pages+0x6c/0x120
>   4)     6448     256   get_page_from_freelist+0x489/0x920
>   5)     6192     352   __alloc_pages_nodemask+0x5e1/0xb20
>   6)     5840       8   alloc_pages_current+0x10f/0x1f0
>   7)     5832     168   new_slab+0x35d/0x370
>   8)     5664       8   __slab_alloc+0x3a9/0x501
>   9)     5656      80   kmem_cache_alloc+0x1ac/0x1c0
>  10)     5576     296   mempool_alloc_slab+0x15/0x20
>  11)     5280     128   mempool_alloc+0x5e/0x170
>  12)     5152      96   bio_alloc_bioset+0x10b/0x1d0
>  13)     5056      48   get_swap_bio+0x30/0x90
>  14)     5008     160   __swap_writepage+0x150/0x230
>  15)     4848      32   swap_writepage+0x42/0x90
>  16)     4816     320   shrink_page_list+0x676/0xa80
>  17)     4496     208   shrink_inactive_list+0x262/0x4e0
>  18)     4288     304   shrink_lruvec+0x3e1/0x6a0
>  19)     3984      80   shrink_zone+0x3f/0x110
>  20)     3904     128   do_try_to_free_pages+0x156/0x4c0
>  21)     3776     208   try_to_free_pages+0xf7/0x1e0
>  22)     3568     352   __alloc_pages_nodemask+0x783/0xb20
>  23)     3216       8   alloc_pages_current+0x10f/0x1f0
>  24)     3208     168   new_slab+0x2c5/0x370
>  25)     3040       8   __slab_alloc+0x3a9/0x501
>  26)     3032      80   kmem_cache_alloc+0x1ac/0x1c0
>  27)     2952     296   mempool_alloc_slab+0x15/0x20
>  28)     2656     128   mempool_alloc+0x5e/0x170
>  29)     2528      96   bio_alloc_bioset+0x10b/0x1d0
>  30)     2432      48   mpage_alloc+0x38/0xa0
>  31)     2384     208   do_mpage_readpage+0x49b/0x5d0
>  32)     2176     224   mpage_readpages+0xcf/0x120
>  33)     1952      48   ext4_readpages+0x45/0x60
>  34)     1904     224   __do_page_cache_readahead+0x222/0x2d0
>  35)     1680      16   ra_submit+0x21/0x30
>  36)     1664     112   filemap_fault+0x2d7/0x4f0
>  37)     1552     144   __do_fault+0x6d/0x4c0
>  38)     1408     160   handle_mm_fault+0x1a6/0xaf0
>  39)     1248     272   __do_page_fault+0x18a/0x590
>  40)      976      16   do_page_fault+0xc/0x10
>  41)      960     208   page_fault+0x22/0x30
>  42)      752      16   clear_user+0x2e/0x40
>  43)      736      16   padzero+0x2d/0x40
>  44)      720     304   load_elf_binary+0xa47/0x1a40
>  45)      416      48   search_binary_handler+0x9c/0x1a0
>  46)      368     144   do_execve_common.isra.25+0x58d/0x700
>  47)      224      16   do_execve+0x18/0x20
>  48)      208      32   SyS_execve+0x2e/0x40
>  49)      176     176   stub_execve+0x69/0xa0
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

next prev parent reply	other threads:[~2014-05-30  6:11 UTC|newest]

Thread overview: 205+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
2014-05-28  6:53 ` Minchan Kim
2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
2014-05-28  6:53   ` Minchan Kim
2014-05-28  8:37   ` Dave Chinner
2014-05-28  8:37     ` Dave Chinner
2014-05-28  8:37     ` Dave Chinner
2014-05-28  9:13     ` Dave Chinner
2014-05-28  9:13       ` Dave Chinner
2014-05-28  9:13       ` Dave Chinner
2014-05-28 16:06       ` Johannes Weiner
2014-05-28 16:06         ` Johannes Weiner
2014-05-28 16:06         ` Johannes Weiner
2014-05-28 21:55         ` Dave Chinner
2014-05-28 21:55           ` Dave Chinner
2014-05-28 21:55           ` Dave Chinner
2014-05-29  6:06         ` Minchan Kim
2014-05-29  6:06           ` Minchan Kim
2014-05-29  6:06           ` Minchan Kim
2014-05-28  9:04   ` Michael S. Tsirkin
2014-05-28  9:04     ` Michael S. Tsirkin
2014-05-29  1:09     ` Minchan Kim
2014-05-29  2:44       ` Steven Rostedt
2014-05-29  2:44         ` Steven Rostedt
2014-05-29  4:11         ` Minchan Kim
2014-05-29  4:11           ` Minchan Kim
2014-05-29  2:47       ` Rusty Russell
2014-05-29  2:47         ` Rusty Russell
2014-05-29  4:10     ` virtio_ring stack usage Rusty Russell
2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
2014-05-29 13:23     ` One Thousand Gnomes
2014-05-29 13:23       ` One Thousand Gnomes
2014-05-28 14:14   ` Steven Rostedt
2014-05-28 14:14     ` Steven Rostedt
2014-05-28 14:23     ` H. Peter Anvin
2014-05-28 14:23       ` H. Peter Anvin
2014-05-28 22:11       ` Dave Chinner
2014-05-28 22:11         ` Dave Chinner
2014-05-28 22:42         ` H. Peter Anvin
2014-05-28 22:42           ` H. Peter Anvin
2014-05-28 23:17           ` Dave Chinner
2014-05-28 23:17             ` Dave Chinner
2014-05-28 23:21             ` H. Peter Anvin
2014-05-28 23:21               ` H. Peter Anvin
2014-05-28 15:43   ` Richard Weinberger
2014-05-28 15:43     ` Richard Weinberger
2014-05-28 16:08     ` Steven Rostedt
2014-05-28 16:08       ` Steven Rostedt
2014-05-28 16:11       ` Richard Weinberger
2014-05-28 16:11         ` Richard Weinberger
2014-05-28 16:13       ` Linus Torvalds
2014-05-28 16:13         ` Linus Torvalds
2014-05-28 16:09   ` Linus Torvalds
2014-05-28 16:09     ` Linus Torvalds
2014-05-28 22:31     ` Dave Chinner
2014-05-28 22:31       ` Dave Chinner
2014-05-28 22:41       ` Linus Torvalds
2014-05-28 22:41         ` Linus Torvalds
2014-05-29  1:30         ` Dave Chinner
2014-05-29  1:30           ` Dave Chinner
2014-05-29  1:58           ` Dave Chinner
2014-05-29  1:58             ` Dave Chinner
2014-05-29  2:51             ` Linus Torvalds
2014-05-29  2:51               ` Linus Torvalds
2014-05-29 23:36             ` Minchan Kim
2014-05-29 23:36               ` Minchan Kim
2014-05-30  0:05               ` Linus Torvalds
2014-05-30  0:20                 ` Minchan Kim
2014-05-30  0:20                   ` Minchan Kim
2014-05-30  0:31                   ` Linus Torvalds
2014-05-30  0:31                     ` Linus Torvalds
2014-05-30  0:50                     ` Minchan Kim
2014-05-30  0:50                       ` Minchan Kim
2014-05-30  1:24                       ` Linus Torvalds
2014-05-30  1:24                         ` Linus Torvalds
2014-05-30  1:58                         ` Dave Chinner
2014-05-30  1:58                           ` Dave Chinner
2014-05-30  2:13                           ` Linus Torvalds
2014-05-30  2:13                             ` Linus Torvalds
2014-05-30  6:21                         ` Minchan Kim
2014-05-30  6:21                           ` Minchan Kim
2014-05-30  1:30                 ` Linus Torvalds
2014-05-30  1:30                   ` Linus Torvalds
2014-05-30  0:15               ` Dave Chinner
2014-05-30  0:15                 ` Dave Chinner
2014-05-30  2:12                 ` Minchan Kim
2014-05-30  2:12                   ` Minchan Kim
2014-05-30  4:37                   ` Linus Torvalds
2014-05-30  4:37                     ` Linus Torvalds
2014-05-31  1:45                     ` Linus Torvalds
2014-05-31  1:45                       ` Linus Torvalds
2014-05-30  6:12                   ` Minchan Kim [this message]
2014-05-30  6:12                     ` Minchan Kim
2014-06-03 13:28                   ` Rasmus Villemoes
2014-06-03 13:28                     ` Rasmus Villemoes
2014-06-03 19:04                     ` Linus Torvalds
2014-06-03 19:04                       ` Linus Torvalds
2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
2014-06-11 15:16                           ` Oleg Nesterov
2014-06-10 12:29                         ` [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro Rasmus Villemoes
2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
2014-06-12 21:46                           ` Rasmus Villemoes
2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
2014-05-29  2:42             ` Linus Torvalds
2014-05-29  5:14             ` H. Peter Anvin
2014-05-29  5:14               ` H. Peter Anvin
2014-05-29  6:01             ` Rusty Russell
2014-05-29  6:01               ` Rusty Russell
2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
2014-05-29  7:26                 ` Rusty Russell
2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
2014-05-29  7:26                   ` Rusty Russell
2014-05-29 15:39                   ` Linus Torvalds
2014-05-29 15:39                     ` Linus Torvalds
2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
2014-05-29  7:26                   ` Rusty Russell
2014-05-29 10:07                   ` Michael S. Tsirkin
2014-05-29 10:07                     ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
2014-05-29  7:26                   ` Rusty Russell
2014-05-29 11:18                   ` Michael S. Tsirkin
2014-05-29 11:18                     ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
2014-05-29  7:26                   ` Rusty Russell
2014-05-29  7:52                   ` Peter Zijlstra
2014-05-29 11:05                     ` Rusty Russell
2014-05-29 11:05                       ` Rusty Russell
2014-05-29 11:33                       ` Michael S. Tsirkin
2014-05-29 11:33                         ` Michael S. Tsirkin
2014-05-29 11:29                   ` Michael S. Tsirkin
2014-05-29 11:29                     ` Michael S. Tsirkin
2014-05-30  2:37                     ` Rusty Russell
2014-05-30  2:37                       ` Rusty Russell
2014-05-30  6:21                       ` Rusty Russell
2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
2014-05-29  7:41                   ` Minchan Kim
2014-05-29 10:39                   ` Dave Chinner
2014-05-29 10:39                     ` Dave Chinner
2014-05-29 11:08                   ` Rusty Russell
2014-05-29 11:08                     ` Rusty Russell
2014-05-29 23:45                     ` Minchan Kim
2014-05-29 23:45                       ` Minchan Kim
2014-05-30  1:06                       ` Minchan Kim
2014-05-30  1:06                         ` Minchan Kim
2014-05-30  6:56                       ` Rusty Russell
2014-05-30  6:56                         ` Rusty Russell
2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
2014-05-29  7:26               ` Dave Chinner
2014-05-29 15:24               ` Linus Torvalds
2014-05-29 15:24                 ` Linus Torvalds
2014-05-29 23:40                 ` Minchan Kim
2014-05-29 23:40                   ` Minchan Kim
2014-05-29 23:53                 ` Dave Chinner
2014-05-29 23:53                   ` Dave Chinner
2014-05-30  0:06                   ` Dave Jones
2014-05-30  0:06                     ` Dave Jones
2014-05-30  0:21                     ` Dave Chinner
2014-05-30  0:21                       ` Dave Chinner
2014-05-30  0:29                       ` Dave Jones
2014-05-30  0:29                         ` Dave Jones
2014-05-30  0:32                       ` Minchan Kim
2014-05-30  0:32                         ` Minchan Kim
2014-05-30  1:34                         ` Dave Chinner
2014-05-30  1:34                           ` Dave Chinner
2014-05-30 15:25                           ` H. Peter Anvin
2014-05-30 15:25                             ` H. Peter Anvin
2014-05-30 15:41                             ` Linus Torvalds
2014-05-30 15:41                               ` Linus Torvalds
2014-05-30 15:52                               ` H. Peter Anvin
2014-05-30 15:52                                 ` H. Peter Anvin
2014-05-30 16:06                                 ` Linus Torvalds
2014-05-30 16:06                                   ` Linus Torvalds
2014-05-30 17:24                                   ` Dave Hansen
2014-05-30 17:24                                     ` Dave Hansen
2014-05-30 18:12                                     ` H. Peter Anvin
2014-05-30 18:12                                       ` H. Peter Anvin
2014-10-21  2:00                               ` Dave Jones
2014-10-21  4:59                                 ` Andy Lutomirski
2014-05-30  9:48                 ` Richard Weinberger
2014-05-30  9:48                   ` Richard Weinberger
2014-05-30 15:36                   ` Linus Torvalds
2014-05-30 15:36                     ` Linus Torvalds
2014-05-31  2:06             ` Jens Axboe
2014-05-31  2:06               ` Jens Axboe
2014-06-02 22:59               ` Dave Chinner
2014-06-02 22:59                 ` Dave Chinner
2014-06-03 13:02               ` Konstantin Khlebnikov
2014-06-03 13:02                 ` Konstantin Khlebnikov
2014-05-29  3:46     ` Minchan Kim
2014-05-29  3:46       ` Minchan Kim
2014-05-29  4:13       ` Linus Torvalds
2014-05-29  4:13         ` Linus Torvalds
2014-05-29  5:10         ` Minchan Kim
2014-05-29  5:10           ` Minchan Kim
2014-05-30 21:23     ` Andi Kleen
2014-05-30 21:23       ` Andi Kleen
2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
2014-05-28 16:18   ` Steven Rostedt
2014-05-29  3:52   ` Minchan Kim
2014-05-29  3:52     ` Minchan Kim
2014-05-29  3:01 ` Steven Rostedt
2014-05-29  3:01   ` Steven Rostedt
2014-05-29  3:49   ` Minchan Kim
2014-05-29  3:49     ` Minchan Kim

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:bfe16d5af9f dfblob:0c81aacec75 dfblob:883f7208901
dfblob:6e72e700d11 dfblob:f5c6635b806 dfblob:ebca9e1f200
dfblob:bfe16d5af9f dfblob:0c81aacec75 dfblob:883f7208901
dfblob:6e72e700d11 dfblob:f5c6635b806 dfblob:ebca9e1f200 )
 OR (
bs:"Re: [RFC 2/2] x86_64: expand kernel stack to 16K" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140530061215.GS10092@bbox \
    --to=minchan@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mst@redhat.com \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.