rcu_sched self-detect stall when disable vif device

All of lore.kernel.org
 help / color / mirror / Atom feed

* rcu_sched self-detect stall when disable vif device
@ 2015-01-27 16:03 Julien Grall
  2015-01-27 16:45 ` Wei Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-27 16:03 UTC (permalink / raw)
  To: xen-devel, Wei Liu, Ian Campbell

Hi,

While I'm working on support for 64K page in netfront, I got
an rcu_sced self-detect message. It happens when netback is
disabling the vif device due to an error.

I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
the processor is stucked in xenvif_rx_queue_purge?

Here the log:

vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
vif vif-20-0 vif20.0: fatal error; disabling device
INFO: rcu_sched self-detected stall on CPU { 1}  (t=2101 jiffies g=37266 c=37265 q=2649)
Task dump for CPU 1:
vif20.0-q0-gues R  running task        0 12617      2 0x00000002
Call trace:
[<ffff800000089038>] dump_backtrace+0x0/0x124
[<ffff80000008916c>] show_stack+0x10/0x1c
[<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
[<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
[<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
[<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
[<ffff8000000e0b20>] update_process_times+0x38/0x6c
[<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
[<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
[<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
[<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
[<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
[<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
[<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
[<ffff8000000823b8>] gic_handle_irq+0x30/0x80
Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
[<ffff8000000854e4>] el1_irq+0x64/0xc0
[<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
[<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
[<ffff8000000b1060>] kthread+0xd8/0xf0


Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall
@ 2015-01-27 16:45 ` Wei Liu
  2015-01-27 16:47   ` Julien Grall
  2015-01-27 16:56   ` David Vrabel
  0 siblings, 2 replies; 10+ messages in thread
From: Wei Liu @ 2015-01-27 16:45 UTC (permalink / raw)
  To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel

On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
> Hi,
> 
> While I'm working on support for 64K page in netfront, I got
> an rcu_sced self-detect message. It happens when netback is
> disabling the vif device due to an error.
> 
> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
> the processor is stucked in xenvif_rx_queue_purge?
> 

When you try to release a SKB, core network driver need to enter some
RCU cirital region to clean up. dst_release for one, calls call_rcu.

Wei.

> Here the log:
> 
> vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
> vif vif-20-0 vif20.0: fatal error; disabling device
> INFO: rcu_sched self-detected stall on CPU { 1}  (t=2101 jiffies g=37266 c=37265 q=2649)
> Task dump for CPU 1:
> vif20.0-q0-gues R  running task        0 12617      2 0x00000002
> Call trace:
> [<ffff800000089038>] dump_backtrace+0x0/0x124
> [<ffff80000008916c>] show_stack+0x10/0x1c
> [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
> [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
> [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
> [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
> [<ffff8000000e0b20>] update_process_times+0x38/0x6c
> [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
> [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
> [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
> [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
> [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
> [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
> [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
> [<ffff8000000823b8>] gic_handle_irq+0x30/0x80
> Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
> 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
> 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
> 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
> 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
> 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
> 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
> 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
> 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
> 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
> [<ffff8000000854e4>] el1_irq+0x64/0xc0
> [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
> [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
> [<ffff8000000b1060>] kthread+0xd8/0xf0
> 
> 
> Regards,
> 
> -- 
> Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-27 16:45 ` Wei Liu
@ 2015-01-27 16:47   ` Julien Grall
  2015-01-27 16:53     ` Wei Liu
  2015-01-27 16:56   ` David Vrabel
  1 sibling, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-27 16:47 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, xen-devel

On 27/01/15 16:45, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>> Hi,
>>
>> While I'm working on support for 64K page in netfront, I got
>> an rcu_sced self-detect message. It happens when netback is
>> disabling the vif device due to an error.
>>
>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>> the processor is stucked in xenvif_rx_queue_purge?
>>
> 
> When you try to release a SKB, core network driver need to enter some
> RCU cirital region to clean up. dst_release for one, calls call_rcu.

But this message shouldn't happen in normal condition or because of
netfront. Right?

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-27 16:47   ` Julien Grall
@ 2015-01-27 16:53     ` Wei Liu
  2015-01-28 16:45       ` Julien Grall
  0 siblings, 1 reply; 10+ messages in thread
From: Wei Liu @ 2015-01-27 16:53 UTC (permalink / raw)
  To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel

On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
> On 27/01/15 16:45, Wei Liu wrote:
> > On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
> >> Hi,
> >>
> >> While I'm working on support for 64K page in netfront, I got
> >> an rcu_sced self-detect message. It happens when netback is
> >> disabling the vif device due to an error.
> >>
> >> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
> >> the processor is stucked in xenvif_rx_queue_purge?
> >>
> > 
> > When you try to release a SKB, core network driver need to enter some
> > RCU cirital region to clean up. dst_release for one, calls call_rcu.
> 
> But this message shouldn't happen in normal condition or because of
> netfront. Right?
> 

Never saw  report like this before, even in the case that netfront is
buggy.

Wei.

> Regards,
> 
> -- 
> Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-27 16:53     ` Wei Liu
@ 2015-01-28 16:45       ` Julien Grall
  2015-01-28 17:06         ` David Vrabel
  0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-28 16:45 UTC (permalink / raw)
  To: Wei Liu; +Cc: David Vrabel, Ian Campbell, xen-devel

On 27/01/15 16:53, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>> On 27/01/15 16:45, Wei Liu wrote:
>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> While I'm working on support for 64K page in netfront, I got
>>>> an rcu_sced self-detect message. It happens when netback is
>>>> disabling the vif device due to an error.
>>>>
>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>
>>>
>>> When you try to release a SKB, core network driver need to enter some
>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>
>> But this message shouldn't happen in normal condition or because of
>> netfront. Right?
>>
> 
> Never saw  report like this before, even in the case that netfront is
> buggy.

This is only happening when preemption is not enabled (i.e
CONFIG_PREEMPT_NONE in the config file) in the backend kernel.

When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
into an infinite loop. In my case, the code executed looks like:


 1. for (;;) {
 2. 	xenvif_wait_for_rx_work(queue);
 3.
 4.	if (kthread_should_stop())
 5.         break;
 6.
 7.	if (unlikely(vif->disabled && queue->id == 0) {
 8.		xenvif_carrier_off(vif);
 9.		xenvif_rx_queue_purge(queue);
10.		continue;
11.	}
12. }

The wait on line 2 will return directly because the vif is disabled
(see xenvif_have_rx_work)

We are on queue 0, so the condition on line 7 is true. Therefore we will
loop on line 10. And so on...

On platform where preemption is not enabled, this thread will never
yield/give the hand to another thread (unless the domain is destroyed).

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-28 16:45       ` Julien Grall
@ 2015-01-28 17:06         ` David Vrabel
  2015-01-28 17:27           ` Julien Grall
  0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2015-01-28 17:06 UTC (permalink / raw)
  To: Julien Grall, Wei Liu; +Cc: Ian Campbell, xen-devel

On 28/01/15 16:45, Julien Grall wrote:
> On 27/01/15 16:53, Wei Liu wrote:
>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>> On 27/01/15 16:45, Wei Liu wrote:
>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> While I'm working on support for 64K page in netfront, I got
>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>> disabling the vif device due to an error.
>>>>>
>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>
>>>>
>>>> When you try to release a SKB, core network driver need to enter some
>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>
>>> But this message shouldn't happen in normal condition or because of
>>> netfront. Right?
>>>
>>
>> Never saw  report like this before, even in the case that netfront is
>> buggy.
> 
> This is only happening when preemption is not enabled (i.e
> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
> 
> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
> into an infinite loop. In my case, the code executed looks like:
> 
> 
>  1. for (;;) {
>  2. 	xenvif_wait_for_rx_work(queue);
>  3.
>  4.	if (kthread_should_stop())
>  5.         break;
>  6.
>  7.	if (unlikely(vif->disabled && queue->id == 0) {
>  8.		xenvif_carrier_off(vif);
>  9.		xenvif_rx_queue_purge(queue);
> 10.		continue;
> 11.	}
> 12. }
> 
> The wait on line 2 will return directly because the vif is disabled
> (see xenvif_have_rx_work)
> 
> We are on queue 0, so the condition on line 7 is true. Therefore we will
> loop on line 10. And so on...
> 
> On platform where preemption is not enabled, this thread will never
> yield/give the hand to another thread (unless the domain is destroyed).

I'm not sure why we have a continue in the vif->disabled case and not
just a break.  Can you try that?

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-28 17:06         ` David Vrabel
@ 2015-01-28 17:27           ` Julien Grall
  2015-01-30 16:04             ` David Vrabel
  0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-28 17:27 UTC (permalink / raw)
  To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel

On 28/01/15 17:06, David Vrabel wrote:
> On 28/01/15 16:45, Julien Grall wrote:
>> On 27/01/15 16:53, Wei Liu wrote:
>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>>> On 27/01/15 16:45, Wei Liu wrote:
>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> While I'm working on support for 64K page in netfront, I got
>>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>>> disabling the vif device due to an error.
>>>>>>
>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>>
>>>>>
>>>>> When you try to release a SKB, core network driver need to enter some
>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>>
>>>> But this message shouldn't happen in normal condition or because of
>>>> netfront. Right?
>>>>
>>>
>>> Never saw  report like this before, even in the case that netfront is
>>> buggy.
>>
>> This is only happening when preemption is not enabled (i.e
>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>>
>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
>> into an infinite loop. In my case, the code executed looks like:
>>
>>
>>  1. for (;;) {
>>  2. 	xenvif_wait_for_rx_work(queue);
>>  3.
>>  4.	if (kthread_should_stop())
>>  5.         break;
>>  6.
>>  7.	if (unlikely(vif->disabled && queue->id == 0) {
>>  8.		xenvif_carrier_off(vif);
>>  9.		xenvif_rx_queue_purge(queue);
>> 10.		continue;
>> 11.	}
>> 12. }
>>
>> The wait on line 2 will return directly because the vif is disabled
>> (see xenvif_have_rx_work)
>>
>> We are on queue 0, so the condition on line 7 is true. Therefore we will
>> loop on line 10. And so on...
>>
>> On platform where preemption is not enabled, this thread will never
>> yield/give the hand to another thread (unless the domain is destroyed).
> 
> I'm not sure why we have a continue in the vif->disabled case and not
> just a break.  Can you try that?

So I applied this small patches:

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 908e65e..9448c6c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data)
                if (unlikely(vif->disabled && queue->id == 0)) {
                        xenvif_carrier_off(vif);
                        xenvif_rx_queue_purge(queue);
-                       continue;
+                       break;
                }
 
                if (!skb_queue_empty(&queue->rx_queue))


While I don't get anymore message rcu_sched stall, when I destroy the
guest, the backend hits a NULL pointer dereference:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffff800000a50000
[00000000] *pgd=00000083de82a003, *pud=00000083de82b003, *pmd=00000083de82c003, *pte=00600000e1110707
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 4 PID: 34 Comm: xenwatch Not tainted 3.19.0-rc5-xen-seattle+ #13
Hardware name: AMD Seattle (RevA) Development Board (Overdrive) (DT)
task: ffff80001ea39480 ti: ffff80001ea78000 task.ti: ffff80001ea78000
PC is at exit_creds+0x18/0x70
LR is at __put_task_struct+0x3c/0xd4
pc : [<ffff8000000b2d94>] lr : [<ffff800000094990>] pstate: 80000145
sp : ffff80001ea7bc50
x29: ffff80001ea7bc50 x28: 0000000000000000 
x27: 0000000000000000 x26: 0000000000000000 
x25: 0000000000000000 x24: ffff80001eb3c840 
x23: ffff80001eb3c840 x22: 000000000006c560 
x21: ffff0000011f7000 x20: 0000000000000000 
x19: ffff80001ba06680 x18: 0000ffffd2635bd0 
x17: 0000ffff839e4074 x16: 00000000deadbeef 
x15: ffffffffffffffff x14: 0ffffffffffffffe 
x13: 0000000000000028 x12: 0000000000000010 
x11: 0000000000000030 x10: 0101010101010101 
x9 : ffff80001ea7b8e0 x8 : ffff7c01cf6e2740 
x7 : 0000000000000000 x6 : 0000000000002fc9 
x5 : 0000000000000000 x4 : 0000000000000001 
x3 : 0000000000000000 x2 : ffff80001ba06690 
x1 : 0000000000000000 x0 : 0000000000000000 

Process xenwatch (pid: 34, stack limit = 0xffff80001ea78058)
Stack: (0xffff80001ea7bc50 to 0xffff80001ea7c000)
bc40:                                     1ea7bc70 ffff8000 00094990 ffff8000
bc60: 1ba06680 ffff8000 008b45a8 ffff8000 1ea7bc90 ffff8000 000b15f0 ffff8000
bc80: 1ba06680 ffff8000 005bcab8 ffff8000 1ea7bcc0 ffff8000 00541efc ffff8000
bca0: 011ed000 ffff0000 00000000 00000000 011f7000 ffff0000 00000006 00000000
bcc0: 1ea7bd00 ffff8000 00540984 ffff8000 1ce23680 ffff8000 00000006 00000000
bce0: 00752cf0 ffff8000 00000001 00000000 00752e38 ffff8000 1ea7bd98 ffff8000
bd00: 1ea7bd40 ffff8000 00540bcc ffff8000 1ce23680 ffff8000 1cce0c00 ffff8000
bd20: 00000000 00000000 1cce0c00 ffff8000 009b0288 ffff8000 1ea7be20 ffff8000
bd40: 1ea7bd70 ffff8000 0048011c ffff8000 1ce23700 ffff8000 1cf71000 ffff8000
bd60: 009a6258 ffff8000 00a36d38 00000000 1ea7bdb0 ffff8000 00480ea4 ffff8000
bd80: 1b89d800 ffff8000 009a62b0 ffff8000 009a6258 ffff8000 00a36d38 ffff8000
bda0: 00a36e30 ffff8000 0047f7c0 ffff8000 1ea7bdc0 ffff8000 0047f82c ffff8000
bdc0: 1ea7be30 ffff8000 000b1064 ffff8000 1ea48cc0 ffff8000 009dbfe8 ffff8000
bde0: 008552d8 ffff8000 00000000 00000000 0047f778 ffff8000 00000000 00000000
be00: 1ea7be30 ffff8000 00000000 ffff8000 1ea39480 ffff8000 000c75f8 ffff8000
be20: 1ea7be20 ffff8000 1ea7be20 ffff8000 00000000 00000000 00085930 ffff8000
be40: 000b0f88 ffff8000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000
be60: 00000000 00000000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000
be80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bea0: 1ea7bea0 ffff8000 1ea7bea0 ffff8000 00000000 ffff8000 00000000 00000000
bec0: 1ea7bec0 ffff8000 1ea7bec0 ffff8000 00000000 00000000 00000000 00000000
bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfa0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000005 00000000
bfe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call trace:
[<ffff8000000b2d94>] exit_creds+0x18/0x70
[<ffff80000009498c>] __put_task_struct+0x38/0xd4
[<ffff8000000b15ec>] kthread_stop+0xc0/0x130
[<ffff800000541ef8>] xenvif_disconnect+0x58/0xd0
[<ffff800000540980>] set_backend_state+0x134/0x278
[<ffff800000540bc8>] frontend_changed+0x8c/0xec
[<ffff800000480118>] xenbus_otherend_changed+0x9c/0xa4
[<ffff800000480ea0>] frontend_changed+0xc/0x18
[<ffff80000047f828>] xenwatch_thread+0xb0/0x140
[<ffff8000000b1060>] kthread+0xd8/0xf0
Code: f9000bf3 aa0003f3 f9422401 f9422000 (b9400021) 
---[ end trace af11d521ee530da8 ]---

Regards,

-- 
Julien Grall

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-28 17:27           ` Julien Grall
@ 2015-01-30 16:04             ` David Vrabel
  2015-02-02 13:54               ` Julien Grall
  0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2015-01-30 16:04 UTC (permalink / raw)
  To: Julien Grall, David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel

On 28/01/15 17:27, Julien Grall wrote:
> On 28/01/15 17:06, David Vrabel wrote:
>> On 28/01/15 16:45, Julien Grall wrote:
>>> On 27/01/15 16:53, Wei Liu wrote:
>>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>>>> On 27/01/15 16:45, Wei Liu wrote:
>>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> While I'm working on support for 64K page in netfront, I got
>>>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>>>> disabling the vif device due to an error.
>>>>>>>
>>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>>>
>>>>>>
>>>>>> When you try to release a SKB, core network driver need to enter some
>>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>>>
>>>>> But this message shouldn't happen in normal condition or because of
>>>>> netfront. Right?
>>>>>
>>>>
>>>> Never saw  report like this before, even in the case that netfront is
>>>> buggy.
>>>
>>> This is only happening when preemption is not enabled (i.e
>>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>>>
>>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
>>> into an infinite loop. In my case, the code executed looks like:
>>>
>>>
>>>  1. for (;;) {
>>>  2. 	xenvif_wait_for_rx_work(queue);
>>>  3.
>>>  4.	if (kthread_should_stop())
>>>  5.         break;
>>>  6.
>>>  7.	if (unlikely(vif->disabled && queue->id == 0) {
>>>  8.		xenvif_carrier_off(vif);
>>>  9.		xenvif_rx_queue_purge(queue);
>>> 10.		continue;
>>> 11.	}
>>> 12. }
>>>
>>> The wait on line 2 will return directly because the vif is disabled
>>> (see xenvif_have_rx_work)
>>>
>>> We are on queue 0, so the condition on line 7 is true. Therefore we will
>>> loop on line 10. And so on...
>>>
>>> On platform where preemption is not enabled, this thread will never
>>> yield/give the hand to another thread (unless the domain is destroyed).
>>
>> I'm not sure why we have a continue in the vif->disabled case and not
>> just a break.  Can you try that?
> 
> So I applied this small patches:
> 
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 908e65e..9448c6c 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data)
>                 if (unlikely(vif->disabled && queue->id == 0)) {
>                         xenvif_carrier_off(vif);
>                         xenvif_rx_queue_purge(queue);
> -                       continue;
> +                       break;
>                 }
>  
>                 if (!skb_queue_empty(&queue->rx_queue))

How about this?

8<------------------------------------------
xen-netback: stop the guest rx thread after a fatal error

After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback:
disable rogue vif in kthread context), a fatal (protocol) error would
leave the guest Rx thread spinning, wasting CPU time.  Commit
ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce
guest Rx stall detection) made this even worse by removing a
cond_resched() from this path.

A fatal error is non-recoverable so just allow the guest Rx thread to
exit.  This requires taking additional refs to the task so the thread
exiting early is handled safely.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>

diff --git a/drivers/net/xen-netback/interface.c
b/drivers/net/xen-netback/interface.c
index 9259a73..037f74f 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -578,6 +578,7 @@ int xenvif_connect(struct xenvif_queue *queue,
unsigned long tx_ring_ref,
 		goto err_rx_unbind;
 	}
 	queue->task = task;
+	get_task_struct(task);

 	task = kthread_create(xenvif_dealloc_kthread,
 			      (void *)queue, "%s-dealloc", queue->name);
@@ -634,6 +635,7 @@ void xenvif_disconnect(struct xenvif *vif)

 		if (queue->task) {
 			kthread_stop(queue->task);
+			put_task_struct(queue->task);
 			queue->task = NULL;
 		}

diff --git a/drivers/net/xen-netback/netback.c
b/drivers/net/xen-netback/netback.c
index 908e65e..c8ce701 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2109,8 +2109,7 @@ int xenvif_kthread_guest_rx(void *data)
 		 */
 		if (unlikely(vif->disabled && queue->id == 0)) {
 			xenvif_carrier_off(vif);
-			xenvif_rx_queue_purge(queue);
-			continue;
+			break;
 		}

 		if (!skb_queue_empty(&queue->rx_queue))

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-30 16:04             ` David Vrabel
@ 2015-02-02 13:54               ` Julien Grall
  0 siblings, 0 replies; 10+ messages in thread
From: Julien Grall @ 2015-02-02 13:54 UTC (permalink / raw)
  To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel

Hi David,

On 30/01/15 16:04, David Vrabel wrote:
> How about this?

This is working for me. Thanks!

> 8<------------------------------------------
> xen-netback: stop the guest rx thread after a fatal error
> 
> After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback:
> disable rogue vif in kthread context), a fatal (protocol) error would
> leave the guest Rx thread spinning, wasting CPU time.  Commit
> ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce
> guest Rx stall detection) made this even worse by removing a
> cond_resched() from this path.
> 
> A fatal error is non-recoverable so just allow the guest Rx thread to
> exit.  This requires taking additional refs to the task so the thread
> exiting early is handled safely.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Julien Grall <julien.grall@linaro.org>
Tested-by: Julien Grall <julien.grall@linaro.org>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rcu_sched self-detect stall when disable vif device
  2015-01-27 16:45 ` Wei Liu
  2015-01-27 16:47   ` Julien Grall
@ 2015-01-27 16:56   ` David Vrabel
  1 sibling, 0 replies; 10+ messages in thread
From: David Vrabel @ 2015-01-27 16:56 UTC (permalink / raw)
  To: xen-devel

On 27/01/15 16:45, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>> Hi,
>>
>> While I'm working on support for 64K page in netfront, I got
>> an rcu_sced self-detect message. It happens when netback is
>> disabling the vif device due to an error.
>>
>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>> the processor is stucked in xenvif_rx_queue_purge?
>>
> 
> When you try to release a SKB, core network driver need to enter some
> RCU cirital region to clean up. dst_release for one, calls call_rcu.

This is RCU detecting a soft-lockup.  You're either spinning on the
spinlock or the guest rx queue is corrupt and cannot be drained.

David

>> Here the log:
>>
>> vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
>> vif vif-20-0 vif20.0: fatal error; disabling device
>> INFO: rcu_sched self-detected stall on CPU { 1}  (t=2101 jiffies g=37266 c=37265 q=2649)
>> Task dump for CPU 1:
>> vif20.0-q0-gues R  running task        0 12617      2 0x00000002
>> Call trace:
>> [<ffff800000089038>] dump_backtrace+0x0/0x124
>> [<ffff80000008916c>] show_stack+0x10/0x1c
>> [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
>> [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
>> [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
>> [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
>> [<ffff8000000e0b20>] update_process_times+0x38/0x6c
>> [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
>> [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
>> [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
>> [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
>> [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
>> [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
>> [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
>> [<ffff8000000823b8>] gic_handle_irq+0x30/0x80
>> Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
>> 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
>> 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
>> 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
>> 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
>> 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
>> 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
>> 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
>> 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
>> 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
>> [<ffff8000000854e4>] el1_irq+0x64/0xc0
>> [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
>> [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
>> [<ffff8000000b1060>] kthread+0xd8/0xf0
>>
>>
>> Regards,
>>
>> -- 
>> Julien Grall
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-02-02 13:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall
2015-01-27 16:45 ` Wei Liu
2015-01-27 16:47   ` Julien Grall
2015-01-27 16:53     ` Wei Liu
2015-01-28 16:45       ` Julien Grall
2015-01-28 17:06         ` David Vrabel
2015-01-28 17:27           ` Julien Grall
2015-01-30 16:04             ` David Vrabel
2015-02-02 13:54               ` Julien Grall
2015-01-27 16:56   ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.