[PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
@ 2016-02-24 22:47 Yang Shi
  2016-02-25  2:40 ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Yang Shi @ 2016-02-24 22:47 UTC (permalink / raw)
  To: tj, jack, axboe, fengguang.wu, tglx, rostedt, bigeasy, akpm
  Cc: linux-kernel, linux-mm, linux-rt-users, linaro-kernel, yang.shi

commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel due to the list_lock
held for the for loop in wb_writeback().

BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30

The list_lock was moved outside the for loop by commit
e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io()
into wb_writeback())", however, the commit log says "No behavior change", so
it sounds safe to have the list_lock acquired inside the for loop as it did
before.

Just acquire list_lock at the necessary points and keep all writeback
tracepoints outside the critical area protected by list_lock in
wb_writeback().

Signed-off-by: Yang Shi <yang.shi@linaro.org>
---
 fs/fs-writeback.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1f76d89..9b7b5f6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 	work->older_than_this = &oldest_jif;
 
 	blk_start_plug(&plug);
-	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb,
 			oldest_jif = jiffies;
 
 		trace_writeback_start(wb, work);
+
+		spin_lock(&wb->list_lock);
 		if (list_empty(&wb->b_io))
 			queue_io(wb, work);
 		if (work->sb)
 			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
 			progress = __writeback_inodes_wb(wb, work);
-		trace_writeback_written(wb, work);
 
 		wb_update_bandwidth(wb, wb_start);
+		spin_unlock(&wb->list_lock);
+
+		trace_writeback_written(wb, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		if (!list_empty(&wb->b_more_io))  {
 			trace_writeback_wait(wb, work);
+			spin_lock(&wb->list_lock);
 			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->list_lock);
+			spin_lock(&inode->i_lock);
 			/* This function drops i_lock... */
 			inode_sleep_on_writeback(inode);
-			spin_lock(&wb->list_lock);
 		}
 	}
-	spin_unlock(&wb->list_lock);
 	blk_finish_plug(&plug);
 
 	return nr_pages - work->nr_pages;
-- 
2.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-24 22:47 [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback() Yang Shi
@ 2016-02-25  2:40 ` Steven Rostedt
  2016-02-25 19:38   ` Shi, Yang
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2016-02-25  2:40 UTC (permalink / raw)
  To: Yang Shi
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On Wed, 24 Feb 2016 14:47:23 -0800
Yang Shi <yang.shi@linaro.org> wrote:

> commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
> tracepoints to report cgroup") made writeback tracepoints report cgroup
> writeback, but it may trigger the below bug on -rt kernel due to the list_lock
> held for the for loop in wb_writeback().

list_lock is a sleeping mutex, it's not disabling preemption. Moving it
doesn't make a difference.

> 
> BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
> in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3

Something else disabled preemption. And note, nothing in the tracepoint
should have called a sleeping function.


> INFO: lockdep is turned off.
> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
> 
> CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
> Hardware name: Freescale Layerscape 2085a RDB Board (DT)
> Workqueue: writeback wb_workfn (flush-7:0)
> Call trace:
> [<ffffffc00008d708>] dump_backtrace+0x0/0x200
> [<ffffffc00008d92c>] show_stack+0x24/0x30
> [<ffffffc0007b0f40>] dump_stack+0x88/0xa8
> [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
> [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
> [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
> [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8

How accurate is this trace back? Here's the code that is executed in
this tracepoint:

	TP_fast_assign(
		struct device *dev = bdi->dev;
		if (!dev)
			dev = default_backing_dev_info.dev;
		strncpy(__entry->name, dev_name(dev), 32);
		__entry->nr_pages = work->nr_pages;
		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
		__entry->sync_mode = work->sync_mode;
		__entry->for_kupdate = work->for_kupdate;
		__entry->range_cyclic = work->range_cyclic;
		__entry->for_background	= work->for_background;
		__entry->reason = work->reason;
	),

See anything that would sleep?

> [<ffffffc000374f90>] wb_writeback+0x620/0x830
> [<ffffffc000376224>] wb_workfn+0x61c/0x950
> [<ffffffc000110adc>] process_one_work+0x3ac/0xb30
> [<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
> [<ffffffc00011a9e8>] kthread+0x190/0x1b0
> [<ffffffc000086ca0>] ret_from_fork+0x10/0x30
> 
> The list_lock was moved outside the for loop by commit
> e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io()
> into wb_writeback())", however, the commit log says "No behavior change", so
> it sounds safe to have the list_lock acquired inside the for loop as it did
> before.
> 
> Just acquire list_lock at the necessary points and keep all writeback
> tracepoints outside the critical area protected by list_lock in
> wb_writeback().

But list_lock itself is a sleeping lock. This doesn't make sense.

This is not the bug you are looking for.

-- Steve

> 
> Signed-off-by: Yang Shi <yang.shi@linaro.org>
> ---
>  fs/fs-writeback.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1f76d89..9b7b5f6 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	work->older_than_this = &oldest_jif;
>  
>  	blk_start_plug(&plug);
> -	spin_lock(&wb->list_lock);
>  	for (;;) {
>  		/*
>  		 * Stop writeback when nr_pages has been consumed
> @@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb,
>  			oldest_jif = jiffies;
>  
>  		trace_writeback_start(wb, work);
> +
> +		spin_lock(&wb->list_lock);
>  		if (list_empty(&wb->b_io))
>  			queue_io(wb, work);
>  		if (work->sb)
>  			progress = writeback_sb_inodes(work->sb, wb, work);
>  		else
>  			progress = __writeback_inodes_wb(wb, work);
> -		trace_writeback_written(wb, work);
>  
>  		wb_update_bandwidth(wb, wb_start);
> +		spin_unlock(&wb->list_lock);
> +
> +		trace_writeback_written(wb, work);
>  
>  		/*
>  		 * Did we write something? Try for more
> @@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 */
>  		if (!list_empty(&wb->b_more_io))  {
>  			trace_writeback_wait(wb, work);
> +			spin_lock(&wb->list_lock);
>  			inode = wb_inode(wb->b_more_io.prev);
> -			spin_lock(&inode->i_lock);
>  			spin_unlock(&wb->list_lock);
> +			spin_lock(&inode->i_lock);
>  			/* This function drops i_lock... */
>  			inode_sleep_on_writeback(inode);
> -			spin_lock(&wb->list_lock);
>  		}
>  	}
> -	spin_unlock(&wb->list_lock);
>  	blk_finish_plug(&plug);
>  
>  	return nr_pages - work->nr_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25  2:40 ` Steven Rostedt
@ 2016-02-25 19:38   ` Shi, Yang
  2016-02-25 19:54     ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Shi, Yang @ 2016-02-25 19:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On 2/24/2016 6:40 PM, Steven Rostedt wrote:
> On Wed, 24 Feb 2016 14:47:23 -0800
> Yang Shi <yang.shi@linaro.org> wrote:
>
>> commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
>> tracepoints to report cgroup") made writeback tracepoints report cgroup
>> writeback, but it may trigger the below bug on -rt kernel due to the list_lock
>> held for the for loop in wb_writeback().
>
> list_lock is a sleeping mutex, it's not disabling preemption. Moving it
> doesn't make a difference.
>
>>
>> BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
>> in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
>
> Something else disabled preemption. And note, nothing in the tracepoint
> should have called a sleeping function.

Yes, it makes me confused too. It sounds like the preempt_ip address is 
not that accurate.

>
>
>> INFO: lockdep is turned off.
>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
>>
>> CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
>> Hardware name: Freescale Layerscape 2085a RDB Board (DT)
>> Workqueue: writeback wb_workfn (flush-7:0)
>> Call trace:
>> [<ffffffc00008d708>] dump_backtrace+0x0/0x200
>> [<ffffffc00008d92c>] show_stack+0x24/0x30
>> [<ffffffc0007b0f40>] dump_stack+0x88/0xa8
>> [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
>> [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
>> [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
>> [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
>
> How accurate is this trace back? Here's the code that is executed in
> this tracepoint:
>
> 	TP_fast_assign(
> 		struct device *dev = bdi->dev;
> 		if (!dev)
> 			dev = default_backing_dev_info.dev;
> 		strncpy(__entry->name, dev_name(dev), 32);
> 		__entry->nr_pages = work->nr_pages;
> 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
> 		__entry->sync_mode = work->sync_mode;
> 		__entry->for_kupdate = work->for_kupdate;
> 		__entry->range_cyclic = work->range_cyclic;
> 		__entry->for_background	= work->for_background;
> 		__entry->reason = work->reason;
> 	),
>
> See anything that would sleep?

According to the stack backtrace, kernfs_path_len calls slepping lock, 
which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char, 
cgroup, __trace_wb_cgroup_size(wb)).

The below is the definition:

DECLARE_EVENT_CLASS(writeback_work_class,
         TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
         TP_ARGS(wb, work),
         TP_STRUCT__entry(
                 __array(char, name, 32)
                 __field(long, nr_pages)
                 __field(dev_t, sb_dev)
                 __field(int, sync_mode)
                 __field(int, for_kupdate)
                 __field(int, range_cyclic)
                 __field(int, for_background)
                 __field(int, reason)
                 __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))

Thanks,
Yang

>
>> [<ffffffc000374f90>] wb_writeback+0x620/0x830
>> [<ffffffc000376224>] wb_workfn+0x61c/0x950
>> [<ffffffc000110adc>] process_one_work+0x3ac/0xb30
>> [<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
>> [<ffffffc00011a9e8>] kthread+0x190/0x1b0
>> [<ffffffc000086ca0>] ret_from_fork+0x10/0x30
>>
>> The list_lock was moved outside the for loop by commit
>> e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io()
>> into wb_writeback())", however, the commit log says "No behavior change", so
>> it sounds safe to have the list_lock acquired inside the for loop as it did
>> before.
>>
>> Just acquire list_lock at the necessary points and keep all writeback
>> tracepoints outside the critical area protected by list_lock in
>> wb_writeback().
>
> But list_lock itself is a sleeping lock. This doesn't make sense.
>
> This is not the bug you are looking for.
>
> -- Steve
>
>>
>> Signed-off-by: Yang Shi <yang.shi@linaro.org>
>> ---
>>   fs/fs-writeback.c | 12 +++++++-----
>>   1 file changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 1f76d89..9b7b5f6 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>>   	work->older_than_this = &oldest_jif;
>>
>>   	blk_start_plug(&plug);
>> -	spin_lock(&wb->list_lock);
>>   	for (;;) {
>>   		/*
>>   		 * Stop writeback when nr_pages has been consumed
>> @@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb,
>>   			oldest_jif = jiffies;
>>
>>   		trace_writeback_start(wb, work);
>> +
>> +		spin_lock(&wb->list_lock);
>>   		if (list_empty(&wb->b_io))
>>   			queue_io(wb, work);
>>   		if (work->sb)
>>   			progress = writeback_sb_inodes(work->sb, wb, work);
>>   		else
>>   			progress = __writeback_inodes_wb(wb, work);
>> -		trace_writeback_written(wb, work);
>>
>>   		wb_update_bandwidth(wb, wb_start);
>> +		spin_unlock(&wb->list_lock);
>> +
>> +		trace_writeback_written(wb, work);
>>
>>   		/*
>>   		 * Did we write something? Try for more
>> @@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
>>   		 */
>>   		if (!list_empty(&wb->b_more_io))  {
>>   			trace_writeback_wait(wb, work);
>> +			spin_lock(&wb->list_lock);
>>   			inode = wb_inode(wb->b_more_io.prev);
>> -			spin_lock(&inode->i_lock);
>>   			spin_unlock(&wb->list_lock);
>> +			spin_lock(&inode->i_lock);
>>   			/* This function drops i_lock... */
>>   			inode_sleep_on_writeback(inode);
>> -			spin_lock(&wb->list_lock);
>>   		}
>>   	}
>> -	spin_unlock(&wb->list_lock);
>>   	blk_finish_plug(&plug);
>>
>>   	return nr_pages - work->nr_pages;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 19:38   ` Shi, Yang
@ 2016-02-25 19:54     ` Steven Rostedt
  2016-02-25 23:16       ` Shi, Yang
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2016-02-25 19:54 UTC (permalink / raw)
  To: Shi, Yang
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On Thu, 25 Feb 2016 11:38:48 -0800
"Shi, Yang" <yang.shi@linaro.org> wrote:

> On 2/24/2016 6:40 PM, Steven Rostedt wrote:
> > On Wed, 24 Feb 2016 14:47:23 -0800
> > Yang Shi <yang.shi@linaro.org> wrote:
> >  
> >> commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
> >> tracepoints to report cgroup") made writeback tracepoints report cgroup
> >> writeback, but it may trigger the below bug on -rt kernel due to the list_lock
> >> held for the for loop in wb_writeback().  
> >
> > list_lock is a sleeping mutex, it's not disabling preemption. Moving it
> > doesn't make a difference.
> >  
> >>
> >> BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
> >> in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3  
> >
> > Something else disabled preemption. And note, nothing in the tracepoint
> > should have called a sleeping function.  
> 
> Yes, it makes me confused too. It sounds like the preempt_ip address is 
> not that accurate.

Yep, but the change you made doesn't look to be the fix.

> 
> >
> >  
> >> INFO: lockdep is turned off.
> >> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830

Can you disassemble the vmlinux file to see exactly where that call is.
I use gdb to find the right locations.

 gdb> li *0xffffffc000374a5c
 gdb> disass 0xffffffc000374a5c

> >>
> >> CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
> >> Hardware name: Freescale Layerscape 2085a RDB Board (DT)
> >> Workqueue: writeback wb_workfn (flush-7:0)
> >> Call trace:
> >> [<ffffffc00008d708>] dump_backtrace+0x0/0x200
> >> [<ffffffc00008d92c>] show_stack+0x24/0x30
> >> [<ffffffc0007b0f40>] dump_stack+0x88/0xa8
> >> [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
> >> [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
> >> [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
> >> [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8  
> >
> > How accurate is this trace back? Here's the code that is executed in
> > this tracepoint:
> >
> > 	TP_fast_assign(
> > 		struct device *dev = bdi->dev;
> > 		if (!dev)
> > 			dev = default_backing_dev_info.dev;
> > 		strncpy(__entry->name, dev_name(dev), 32);
> > 		__entry->nr_pages = work->nr_pages;
> > 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
> > 		__entry->sync_mode = work->sync_mode;
> > 		__entry->for_kupdate = work->for_kupdate;
> > 		__entry->range_cyclic = work->range_cyclic;
> > 		__entry->for_background	= work->for_background;
> > 		__entry->reason = work->reason;
> > 	),
> >
> > See anything that would sleep?  
> 
> According to the stack backtrace, kernfs_path_len calls slepping lock, 
> which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char, 
> cgroup, __trace_wb_cgroup_size(wb)).
> 
> The below is the definition:
> 
> DECLARE_EVENT_CLASS(writeback_work_class,
>          TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
>          TP_ARGS(wb, work),
>          TP_STRUCT__entry(
>                  __array(char, name, 32)
>                  __field(long, nr_pages)
>                  __field(dev_t, sb_dev)
>                  __field(int, sync_mode)
>                  __field(int, for_kupdate)
>                  __field(int, range_cyclic)
>                  __field(int, for_background)
>                  __field(int, reason)
>                  __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> 

Ah, thanks for pointing that out. I missed that.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 19:54     ` Steven Rostedt
@ 2016-02-25 23:16       ` Shi, Yang
  2016-02-25 23:31         ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Shi, Yang @ 2016-02-25 23:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On 2/25/2016 11:54 AM, Steven Rostedt wrote:
> On Thu, 25 Feb 2016 11:38:48 -0800
> "Shi, Yang" <yang.shi@linaro.org> wrote:
>
>> On 2/24/2016 6:40 PM, Steven Rostedt wrote:
>>> On Wed, 24 Feb 2016 14:47:23 -0800
>>> Yang Shi <yang.shi@linaro.org> wrote:
>>>
>>>> commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
>>>> tracepoints to report cgroup") made writeback tracepoints report cgroup
>>>> writeback, but it may trigger the below bug on -rt kernel due to the list_lock
>>>> held for the for loop in wb_writeback().
>>>
>>> list_lock is a sleeping mutex, it's not disabling preemption. Moving it
>>> doesn't make a difference.
>>>
>>>>
>>>> BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
>>>> in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
>>>
>>> Something else disabled preemption. And note, nothing in the tracepoint
>>> should have called a sleeping function.
>>
>> Yes, it makes me confused too. It sounds like the preempt_ip address is
>> not that accurate.
>
> Yep, but the change you made doesn't look to be the fix.

Actually, regardless whether this is the right fix for the splat, it 
makes me be wondering if the spin lock which protects the whole for loop 
is really necessary. It sounds feasible to move it into the for loop and 
just protect the necessary area.

>
>>
>>>
>>>
>>>> INFO: lockdep is turned off.
>>>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
>
> Can you disassemble the vmlinux file to see exactly where that call is.
> I use gdb to find the right locations.
>
>   gdb> li *0xffffffc000374a5c
>   gdb> disass 0xffffffc000374a5c

I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffffffc000374a5c
0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617            oldest_jif = jiffies;
1618            work->older_than_this = &oldest_jif;
1619
1620            blk_start_plug(&plug);
1621            spin_lock(&wb->list_lock);
1622            for (;;) {
1623                    /*
1624                     * Stop writeback when nr_pages has been consumed
1625                     */


The disassemble:
    0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0 
<migrate_disable>
    0xffffffc000374a5c <+236>:   mov     x0, x22
    0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518 <rt_spin_lock>

>
>>>>
>>>> CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
>>>> Hardware name: Freescale Layerscape 2085a RDB Board (DT)
>>>> Workqueue: writeback wb_workfn (flush-7:0)
>>>> Call trace:
>>>> [<ffffffc00008d708>] dump_backtrace+0x0/0x200
>>>> [<ffffffc00008d92c>] show_stack+0x24/0x30
>>>> [<ffffffc0007b0f40>] dump_stack+0x88/0xa8
>>>> [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
>>>> [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
>>>> [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
>>>> [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
>>>
>>> How accurate is this trace back? Here's the code that is executed in
>>> this tracepoint:
>>>
>>> 	TP_fast_assign(
>>> 		struct device *dev = bdi->dev;
>>> 		if (!dev)
>>> 			dev = default_backing_dev_info.dev;
>>> 		strncpy(__entry->name, dev_name(dev), 32);
>>> 		__entry->nr_pages = work->nr_pages;
>>> 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
>>> 		__entry->sync_mode = work->sync_mode;
>>> 		__entry->for_kupdate = work->for_kupdate;
>>> 		__entry->range_cyclic = work->range_cyclic;
>>> 		__entry->for_background	= work->for_background;
>>> 		__entry->reason = work->reason;
>>> 	),
>>>
>>> See anything that would sleep?
>>
>> According to the stack backtrace, kernfs_path_len calls slepping lock,
>> which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char,
>> cgroup, __trace_wb_cgroup_size(wb)).
>>
>> The below is the definition:
>>
>> DECLARE_EVENT_CLASS(writeback_work_class,
>>           TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
>>           TP_ARGS(wb, work),
>>           TP_STRUCT__entry(
>>                   __array(char, name, 32)
>>                   __field(long, nr_pages)
>>                   __field(dev_t, sb_dev)
>>                   __field(int, sync_mode)
>>                   __field(int, for_kupdate)
>>                   __field(int, range_cyclic)
>>                   __field(int, for_background)
>>                   __field(int, reason)
>>                   __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
>>
>
> Ah, thanks for pointing that out. I missed that.

It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions, 
but it sounds not reasonable since they are used heavily by cgroup.

Thanks,
Yang

>
> -- Steve
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 23:16       ` Shi, Yang
@ 2016-02-25 23:31         ` Steven Rostedt
  2016-02-25 23:47           ` Shi, Yang
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2016-02-25 23:31 UTC (permalink / raw)
  To: Shi, Yang
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On Thu, 25 Feb 2016 15:16:54 -0800
"Shi, Yang" <yang.shi@linaro.org> wrote:


> Actually, regardless whether this is the right fix for the splat, it 
> makes me be wondering if the spin lock which protects the whole for loop 
> is really necessary. It sounds feasible to move it into the for loop and 
> just protect the necessary area.

That's a separate issue, which may have its own merits that should be
decided by the writeback folks.

> 
> >  
> >>  
> >>>
> >>>  
> >>>> INFO: lockdep is turned off.
> >>>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830  
> >
> > Can you disassemble the vmlinux file to see exactly where that call is.
> > I use gdb to find the right locations.
> >  
> >   gdb> li *0xffffffc000374a5c
> >   gdb> disass 0xffffffc000374a5c  
> 
> I use gdb to get the code too.
> 
> It does point to the spin_lock.
> 
> (gdb) list *0xffffffc000374a5c
> 0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
> 1616
> 1617            oldest_jif = jiffies;
> 1618            work->older_than_this = &oldest_jif;
> 1619
> 1620            blk_start_plug(&plug);
> 1621            spin_lock(&wb->list_lock);
> 1622            for (;;) {
> 1623                    /*
> 1624                     * Stop writeback when nr_pages has been consumed
> 1625                     */
> 
> 
> The disassemble:
>     0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0 

The above is the place it recorded. But I just realized, this isn't the
issue. I know where the problem is.


> <migrate_disable>
>     0xffffffc000374a5c <+236>:   mov     x0, x22
>     0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518 <rt_spin_lock>
> 
> >  



> >> DECLARE_EVENT_CLASS(writeback_work_class,
> >>           TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
> >>           TP_ARGS(wb, work),
> >>           TP_STRUCT__entry(
> >>                   __array(char, name, 32)
> >>                   __field(long, nr_pages)
> >>                   __field(dev_t, sb_dev)
> >>                   __field(int, sync_mode)
> >>                   __field(int, for_kupdate)
> >>                   __field(int, range_cyclic)
> >>                   __field(int, for_background)
> >>                   __field(int, reason)
> >>                   __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> >>  
> >
> > Ah, thanks for pointing that out. I missed that.  
> 
> It sounds not correct if tracepoint doesn't allow sleep.
> 
> I considered to change sleeping lock to raw lock in kernfs_* functions, 
> but it sounds not reasonable since they are used heavily by cgroup.

It is the kernfs_* that can't sleep. Tracepoints use
rcu_read_lock_sched_notrace(), which disables preemption, and not only
that, hides itself from lockdep as the last place to disable preemption.

Is there a way to not use the kernfs_* function? At least for -rt?

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 23:31         ` Steven Rostedt
@ 2016-02-25 23:47           ` Shi, Yang
  2016-02-25 23:54             ` Shi, Yang
  0 siblings, 1 reply; 9+ messages in thread
From: Shi, Yang @ 2016-02-25 23:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On 2/25/2016 3:31 PM, Steven Rostedt wrote:
> On Thu, 25 Feb 2016 15:16:54 -0800
> "Shi, Yang" <yang.shi@linaro.org> wrote:
>
>
>> Actually, regardless whether this is the right fix for the splat, it
>> makes me be wondering if the spin lock which protects the whole for loop
>> is really necessary. It sounds feasible to move it into the for loop and
>> just protect the necessary area.
>
> That's a separate issue, which may have its own merits that should be
> decided by the writeback folks.

Yes, definitely. I will rework my commit log for this part.

>
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>> INFO: lockdep is turned off.
>>>>>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
>>>
>>> Can you disassemble the vmlinux file to see exactly where that call is.
>>> I use gdb to find the right locations.
>>>
>>>    gdb> li *0xffffffc000374a5c
>>>    gdb> disass 0xffffffc000374a5c
>>
>> I use gdb to get the code too.
>>
>> It does point to the spin_lock.
>>
>> (gdb) list *0xffffffc000374a5c
>> 0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
>> 1616
>> 1617            oldest_jif = jiffies;
>> 1618            work->older_than_this = &oldest_jif;
>> 1619
>> 1620            blk_start_plug(&plug);
>> 1621            spin_lock(&wb->list_lock);
>> 1622            for (;;) {
>> 1623                    /*
>> 1624                     * Stop writeback when nr_pages has been consumed
>> 1625                     */
>>
>>
>> The disassemble:
>>      0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0
>
> The above is the place it recorded. But I just realized, this isn't the
> issue. I know where the problem is.
>
>
>> <migrate_disable>
>>      0xffffffc000374a5c <+236>:   mov     x0, x22
>>      0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518 <rt_spin_lock>
>>
>>>
>
>
>
>>>> DECLARE_EVENT_CLASS(writeback_work_class,
>>>>            TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
>>>>            TP_ARGS(wb, work),
>>>>            TP_STRUCT__entry(
>>>>                    __array(char, name, 32)
>>>>                    __field(long, nr_pages)
>>>>                    __field(dev_t, sb_dev)
>>>>                    __field(int, sync_mode)
>>>>                    __field(int, for_kupdate)
>>>>                    __field(int, range_cyclic)
>>>>                    __field(int, for_background)
>>>>                    __field(int, reason)
>>>>                    __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
>>>>
>>>
>>> Ah, thanks for pointing that out. I missed that.
>>
>> It sounds not correct if tracepoint doesn't allow sleep.
>>
>> I considered to change sleeping lock to raw lock in kernfs_* functions,
>> but it sounds not reasonable since they are used heavily by cgroup.
>
> It is the kernfs_* that can't sleep. Tracepoints use
> rcu_read_lock_sched_notrace(), which disables preemption, and not only
> that, hides itself from lockdep as the last place to disable preemption.

Ah, thanks for pointing out this.

>
> Is there a way to not use the kernfs_* function? At least for -rt?

I'm not quite sure if there is straightforward replacement. However, I'm 
wondering if lock free version could be used by tracing.

For example, create __kernfs_path_len which doesn't acquire any lock for 
writeback tracing as long as there is not any race condition.

At least we could rule out preemption.

Thanks,
Yang

>
> -- Steve
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 23:47           ` Shi, Yang
@ 2016-02-25 23:54             ` Shi, Yang
  2016-02-26  0:09               ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Shi, Yang @ 2016-02-25 23:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On 2/25/2016 3:47 PM, Shi, Yang wrote:
> On 2/25/2016 3:31 PM, Steven Rostedt wrote:
>> On Thu, 25 Feb 2016 15:16:54 -0800
>> "Shi, Yang" <yang.shi@linaro.org> wrote:
>>
>>
>>> Actually, regardless whether this is the right fix for the splat, it
>>> makes me be wondering if the spin lock which protects the whole for loop
>>> is really necessary. It sounds feasible to move it into the for loop and
>>> just protect the necessary area.
>>
>> That's a separate issue, which may have its own merits that should be
>> decided by the writeback folks.
>
> Yes, definitely. I will rework my commit log for this part.
>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> INFO: lockdep is turned off.
>>>>>>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
>>>>
>>>> Can you disassemble the vmlinux file to see exactly where that call is.
>>>> I use gdb to find the right locations.
>>>>
>>>>    gdb> li *0xffffffc000374a5c
>>>>    gdb> disass 0xffffffc000374a5c
>>>
>>> I use gdb to get the code too.
>>>
>>> It does point to the spin_lock.
>>>
>>> (gdb) list *0xffffffc000374a5c
>>> 0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
>>> 1616
>>> 1617            oldest_jif = jiffies;
>>> 1618            work->older_than_this = &oldest_jif;
>>> 1619
>>> 1620            blk_start_plug(&plug);
>>> 1621            spin_lock(&wb->list_lock);
>>> 1622            for (;;) {
>>> 1623                    /*
>>> 1624                     * Stop writeback when nr_pages has been
>>> consumed
>>> 1625                     */
>>>
>>>
>>> The disassemble:
>>>      0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0
>>
>> The above is the place it recorded. But I just realized, this isn't the
>> issue. I know where the problem is.
>>
>>
>>> <migrate_disable>
>>>      0xffffffc000374a5c <+236>:   mov     x0, x22
>>>      0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518
>>> <rt_spin_lock>
>>>
>>>>
>>
>>
>>
>>>>> DECLARE_EVENT_CLASS(writeback_work_class,
>>>>>            TP_PROTO(struct bdi_writeback *wb, struct
>>>>> wb_writeback_work *work),
>>>>>            TP_ARGS(wb, work),
>>>>>            TP_STRUCT__entry(
>>>>>                    __array(char, name, 32)
>>>>>                    __field(long, nr_pages)
>>>>>                    __field(dev_t, sb_dev)
>>>>>                    __field(int, sync_mode)
>>>>>                    __field(int, for_kupdate)
>>>>>                    __field(int, range_cyclic)
>>>>>                    __field(int, for_background)
>>>>>                    __field(int, reason)
>>>>>                    __dynamic_array(char, cgroup,
>>>>> __trace_wb_cgroup_size(wb))
>>>>>
>>>>
>>>> Ah, thanks for pointing that out. I missed that.
>>>
>>> It sounds not correct if tracepoint doesn't allow sleep.
>>>
>>> I considered to change sleeping lock to raw lock in kernfs_* functions,
>>> but it sounds not reasonable since they are used heavily by cgroup.
>>
>> It is the kernfs_* that can't sleep. Tracepoints use
>> rcu_read_lock_sched_notrace(), which disables preemption, and not only
>> that, hides itself from lockdep as the last place to disable preemption.
>
> Ah, thanks for pointing out this.
>
>>
>> Is there a way to not use the kernfs_* function? At least for -rt?
>
> I'm not quite sure if there is straightforward replacement. However, I'm
> wondering if lock free version could be used by tracing.
>
> For example, create __kernfs_path_len which doesn't acquire any lock for
> writeback tracing as long as there is not any race condition.
>
> At least we could rule out preemption.

Can we disable irqs in tracepoints since spin_lock_irqsave is used by 
kernfs_* functions.

Thanks,
Yang

>
> Thanks,
> Yang
>
>>
>> -- Steve
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
  2016-02-25 23:54             ` Shi, Yang
@ 2016-02-26  0:09               ` Steven Rostedt
  0 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2016-02-26  0:09 UTC (permalink / raw)
  To: Shi, Yang
  Cc: tj, jack, axboe, fengguang.wu, tglx, bigeasy, akpm, linux-kernel,
	linux-mm, linux-rt-users, linaro-kernel

On Thu, 25 Feb 2016 15:54:13 -0800
"Shi, Yang" <yang.shi@linaro.org> wrote:


> Can we disable irqs in tracepoints since spin_lock_irqsave is used by 
> kernfs_* functions.

Disabling preemption or irqs is fine a tracepoint. You just can't
sleep, which spin_lock_irqsave() would do on the -rt kernel.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-02-26  0:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-24 22:47 [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback() Yang Shi
2016-02-25  2:40 ` Steven Rostedt
2016-02-25 19:38   ` Shi, Yang
2016-02-25 19:54     ` Steven Rostedt
2016-02-25 23:16       ` Shi, Yang
2016-02-25 23:31         ` Steven Rostedt
2016-02-25 23:47           ` Shi, Yang
2016-02-25 23:54             ` Shi, Yang
2016-02-26  0:09               ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).