* rcu_sched self-detect stall when disable vif device
@ 2015-01-27 16:03 Julien Grall
2015-01-27 16:45 ` Wei Liu
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-27 16:03 UTC (permalink / raw)
To: xen-devel, Wei Liu, Ian Campbell
Hi,
While I'm working on support for 64K page in netfront, I got
an rcu_sced self-detect message. It happens when netback is
disabling the vif device due to an error.
I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
the processor is stucked in xenvif_rx_queue_purge?
Here the log:
vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
vif vif-20-0 vif20.0: fatal error; disabling device
INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649)
Task dump for CPU 1:
vif20.0-q0-gues R running task 0 12617 2 0x00000002
Call trace:
[<ffff800000089038>] dump_backtrace+0x0/0x124
[<ffff80000008916c>] show_stack+0x10/0x1c
[<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
[<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
[<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
[<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
[<ffff8000000e0b20>] update_process_times+0x38/0x6c
[<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
[<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
[<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
[<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
[<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
[<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
[<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
[<ffff8000000823b8>] gic_handle_irq+0x30/0x80
Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
[<ffff8000000854e4>] el1_irq+0x64/0xc0
[<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
[<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
[<ffff8000000b1060>] kthread+0xd8/0xf0
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: rcu_sched self-detect stall when disable vif device 2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall @ 2015-01-27 16:45 ` Wei Liu 2015-01-27 16:47 ` Julien Grall 2015-01-27 16:56 ` David Vrabel 0 siblings, 2 replies; 10+ messages in thread From: Wei Liu @ 2015-01-27 16:45 UTC (permalink / raw) To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: > Hi, > > While I'm working on support for 64K page in netfront, I got > an rcu_sced self-detect message. It happens when netback is > disabling the vif device due to an error. > > I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why > the processor is stucked in xenvif_rx_queue_purge? > When you try to release a SKB, core network driver need to enter some RCU cirital region to clean up. dst_release for one, calls call_rcu. Wei. > Here the log: > > vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382 > vif vif-20-0 vif20.0: fatal error; disabling device > INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649) > Task dump for CPU 1: > vif20.0-q0-gues R running task 0 12617 2 0x00000002 > Call trace: > [<ffff800000089038>] dump_backtrace+0x0/0x124 > [<ffff80000008916c>] show_stack+0x10/0x1c > [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8 > [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c > [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8 > [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748 > [<ffff8000000e0b20>] update_process_times+0x38/0x6c > [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4 > [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234 > [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0 > [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38 > [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c > [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c > [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac > [<ffff8000000823b8>] gic_handle_irq+0x30/0x80 > Exception stack(0xffff800013a07c20 to 0xffff800013a07d40) > 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000 > 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000 > 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d > 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff > 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff > 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000 > 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000 > 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000 > 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000 > [<ffff8000000854e4>] el1_irq+0x64/0xc0 > [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30 > [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c > [<ffff8000000b1060>] kthread+0xd8/0xf0 > > > Regards, > > -- > Julien Grall ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-27 16:45 ` Wei Liu @ 2015-01-27 16:47 ` Julien Grall 2015-01-27 16:53 ` Wei Liu 2015-01-27 16:56 ` David Vrabel 1 sibling, 1 reply; 10+ messages in thread From: Julien Grall @ 2015-01-27 16:47 UTC (permalink / raw) To: Wei Liu; +Cc: Ian Campbell, xen-devel On 27/01/15 16:45, Wei Liu wrote: > On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >> Hi, >> >> While I'm working on support for 64K page in netfront, I got >> an rcu_sced self-detect message. It happens when netback is >> disabling the vif device due to an error. >> >> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >> the processor is stucked in xenvif_rx_queue_purge? >> > > When you try to release a SKB, core network driver need to enter some > RCU cirital region to clean up. dst_release for one, calls call_rcu. But this message shouldn't happen in normal condition or because of netfront. Right? Regards, -- Julien Grall ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-27 16:47 ` Julien Grall @ 2015-01-27 16:53 ` Wei Liu 2015-01-28 16:45 ` Julien Grall 0 siblings, 1 reply; 10+ messages in thread From: Wei Liu @ 2015-01-27 16:53 UTC (permalink / raw) To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: > On 27/01/15 16:45, Wei Liu wrote: > > On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: > >> Hi, > >> > >> While I'm working on support for 64K page in netfront, I got > >> an rcu_sced self-detect message. It happens when netback is > >> disabling the vif device due to an error. > >> > >> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why > >> the processor is stucked in xenvif_rx_queue_purge? > >> > > > > When you try to release a SKB, core network driver need to enter some > > RCU cirital region to clean up. dst_release for one, calls call_rcu. > > But this message shouldn't happen in normal condition or because of > netfront. Right? > Never saw report like this before, even in the case that netfront is buggy. Wei. > Regards, > > -- > Julien Grall ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-27 16:53 ` Wei Liu @ 2015-01-28 16:45 ` Julien Grall 2015-01-28 17:06 ` David Vrabel 0 siblings, 1 reply; 10+ messages in thread From: Julien Grall @ 2015-01-28 16:45 UTC (permalink / raw) To: Wei Liu; +Cc: David Vrabel, Ian Campbell, xen-devel On 27/01/15 16:53, Wei Liu wrote: > On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: >> On 27/01/15 16:45, Wei Liu wrote: >>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >>>> Hi, >>>> >>>> While I'm working on support for 64K page in netfront, I got >>>> an rcu_sced self-detect message. It happens when netback is >>>> disabling the vif device due to an error. >>>> >>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >>>> the processor is stucked in xenvif_rx_queue_purge? >>>> >>> >>> When you try to release a SKB, core network driver need to enter some >>> RCU cirital region to clean up. dst_release for one, calls call_rcu. >> >> But this message shouldn't happen in normal condition or because of >> netfront. Right? >> > > Never saw report like this before, even in the case that netfront is > buggy. This is only happening when preemption is not enabled (i.e CONFIG_PREEMPT_NONE in the config file) in the backend kernel. When the vif is disabled, the loop in xenvif_kthread_guest_rx turned into an infinite loop. In my case, the code executed looks like: 1. for (;;) { 2. xenvif_wait_for_rx_work(queue); 3. 4. if (kthread_should_stop()) 5. break; 6. 7. if (unlikely(vif->disabled && queue->id == 0) { 8. xenvif_carrier_off(vif); 9. xenvif_rx_queue_purge(queue); 10. continue; 11. } 12. } The wait on line 2 will return directly because the vif is disabled (see xenvif_have_rx_work) We are on queue 0, so the condition on line 7 is true. Therefore we will loop on line 10. And so on... On platform where preemption is not enabled, this thread will never yield/give the hand to another thread (unless the domain is destroyed). Regards, -- Julien Grall ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-28 16:45 ` Julien Grall @ 2015-01-28 17:06 ` David Vrabel 2015-01-28 17:27 ` Julien Grall 0 siblings, 1 reply; 10+ messages in thread From: David Vrabel @ 2015-01-28 17:06 UTC (permalink / raw) To: Julien Grall, Wei Liu; +Cc: Ian Campbell, xen-devel On 28/01/15 16:45, Julien Grall wrote: > On 27/01/15 16:53, Wei Liu wrote: >> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: >>> On 27/01/15 16:45, Wei Liu wrote: >>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >>>>> Hi, >>>>> >>>>> While I'm working on support for 64K page in netfront, I got >>>>> an rcu_sced self-detect message. It happens when netback is >>>>> disabling the vif device due to an error. >>>>> >>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >>>>> the processor is stucked in xenvif_rx_queue_purge? >>>>> >>>> >>>> When you try to release a SKB, core network driver need to enter some >>>> RCU cirital region to clean up. dst_release for one, calls call_rcu. >>> >>> But this message shouldn't happen in normal condition or because of >>> netfront. Right? >>> >> >> Never saw report like this before, even in the case that netfront is >> buggy. > > This is only happening when preemption is not enabled (i.e > CONFIG_PREEMPT_NONE in the config file) in the backend kernel. > > When the vif is disabled, the loop in xenvif_kthread_guest_rx turned > into an infinite loop. In my case, the code executed looks like: > > > 1. for (;;) { > 2. xenvif_wait_for_rx_work(queue); > 3. > 4. if (kthread_should_stop()) > 5. break; > 6. > 7. if (unlikely(vif->disabled && queue->id == 0) { > 8. xenvif_carrier_off(vif); > 9. xenvif_rx_queue_purge(queue); > 10. continue; > 11. } > 12. } > > The wait on line 2 will return directly because the vif is disabled > (see xenvif_have_rx_work) > > We are on queue 0, so the condition on line 7 is true. Therefore we will > loop on line 10. And so on... > > On platform where preemption is not enabled, this thread will never > yield/give the hand to another thread (unless the domain is destroyed). I'm not sure why we have a continue in the vif->disabled case and not just a break. Can you try that? David ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-28 17:06 ` David Vrabel @ 2015-01-28 17:27 ` Julien Grall 2015-01-30 16:04 ` David Vrabel 0 siblings, 1 reply; 10+ messages in thread From: Julien Grall @ 2015-01-28 17:27 UTC (permalink / raw) To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel On 28/01/15 17:06, David Vrabel wrote: > On 28/01/15 16:45, Julien Grall wrote: >> On 27/01/15 16:53, Wei Liu wrote: >>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: >>>> On 27/01/15 16:45, Wei Liu wrote: >>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >>>>>> Hi, >>>>>> >>>>>> While I'm working on support for 64K page in netfront, I got >>>>>> an rcu_sced self-detect message. It happens when netback is >>>>>> disabling the vif device due to an error. >>>>>> >>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >>>>>> the processor is stucked in xenvif_rx_queue_purge? >>>>>> >>>>> >>>>> When you try to release a SKB, core network driver need to enter some >>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu. >>>> >>>> But this message shouldn't happen in normal condition or because of >>>> netfront. Right? >>>> >>> >>> Never saw report like this before, even in the case that netfront is >>> buggy. >> >> This is only happening when preemption is not enabled (i.e >> CONFIG_PREEMPT_NONE in the config file) in the backend kernel. >> >> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned >> into an infinite loop. In my case, the code executed looks like: >> >> >> 1. for (;;) { >> 2. xenvif_wait_for_rx_work(queue); >> 3. >> 4. if (kthread_should_stop()) >> 5. break; >> 6. >> 7. if (unlikely(vif->disabled && queue->id == 0) { >> 8. xenvif_carrier_off(vif); >> 9. xenvif_rx_queue_purge(queue); >> 10. continue; >> 11. } >> 12. } >> >> The wait on line 2 will return directly because the vif is disabled >> (see xenvif_have_rx_work) >> >> We are on queue 0, so the condition on line 7 is true. Therefore we will >> loop on line 10. And so on... >> >> On platform where preemption is not enabled, this thread will never >> yield/give the hand to another thread (unless the domain is destroyed). > > I'm not sure why we have a continue in the vif->disabled case and not > just a break. Can you try that? So I applied this small patches: diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 908e65e..9448c6c 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data) if (unlikely(vif->disabled && queue->id == 0)) { xenvif_carrier_off(vif); xenvif_rx_queue_purge(queue); - continue; + break; } if (!skb_queue_empty(&queue->rx_queue)) While I don't get anymore message rcu_sched stall, when I destroy the guest, the backend hits a NULL pointer dereference: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ffff800000a50000 [00000000] *pgd=00000083de82a003, *pud=00000083de82b003, *pmd=00000083de82c003, *pte=00600000e1110707 Internal error: Oops: 96000006 [#1] SMP Modules linked in: CPU: 4 PID: 34 Comm: xenwatch Not tainted 3.19.0-rc5-xen-seattle+ #13 Hardware name: AMD Seattle (RevA) Development Board (Overdrive) (DT) task: ffff80001ea39480 ti: ffff80001ea78000 task.ti: ffff80001ea78000 PC is at exit_creds+0x18/0x70 LR is at __put_task_struct+0x3c/0xd4 pc : [<ffff8000000b2d94>] lr : [<ffff800000094990>] pstate: 80000145 sp : ffff80001ea7bc50 x29: ffff80001ea7bc50 x28: 0000000000000000 x27: 0000000000000000 x26: 0000000000000000 x25: 0000000000000000 x24: ffff80001eb3c840 x23: ffff80001eb3c840 x22: 000000000006c560 x21: ffff0000011f7000 x20: 0000000000000000 x19: ffff80001ba06680 x18: 0000ffffd2635bd0 x17: 0000ffff839e4074 x16: 00000000deadbeef x15: ffffffffffffffff x14: 0ffffffffffffffe x13: 0000000000000028 x12: 0000000000000010 x11: 0000000000000030 x10: 0101010101010101 x9 : ffff80001ea7b8e0 x8 : ffff7c01cf6e2740 x7 : 0000000000000000 x6 : 0000000000002fc9 x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000000 x2 : ffff80001ba06690 x1 : 0000000000000000 x0 : 0000000000000000 Process xenwatch (pid: 34, stack limit = 0xffff80001ea78058) Stack: (0xffff80001ea7bc50 to 0xffff80001ea7c000) bc40: 1ea7bc70 ffff8000 00094990 ffff8000 bc60: 1ba06680 ffff8000 008b45a8 ffff8000 1ea7bc90 ffff8000 000b15f0 ffff8000 bc80: 1ba06680 ffff8000 005bcab8 ffff8000 1ea7bcc0 ffff8000 00541efc ffff8000 bca0: 011ed000 ffff0000 00000000 00000000 011f7000 ffff0000 00000006 00000000 bcc0: 1ea7bd00 ffff8000 00540984 ffff8000 1ce23680 ffff8000 00000006 00000000 bce0: 00752cf0 ffff8000 00000001 00000000 00752e38 ffff8000 1ea7bd98 ffff8000 bd00: 1ea7bd40 ffff8000 00540bcc ffff8000 1ce23680 ffff8000 1cce0c00 ffff8000 bd20: 00000000 00000000 1cce0c00 ffff8000 009b0288 ffff8000 1ea7be20 ffff8000 bd40: 1ea7bd70 ffff8000 0048011c ffff8000 1ce23700 ffff8000 1cf71000 ffff8000 bd60: 009a6258 ffff8000 00a36d38 00000000 1ea7bdb0 ffff8000 00480ea4 ffff8000 bd80: 1b89d800 ffff8000 009a62b0 ffff8000 009a6258 ffff8000 00a36d38 ffff8000 bda0: 00a36e30 ffff8000 0047f7c0 ffff8000 1ea7bdc0 ffff8000 0047f82c ffff8000 bdc0: 1ea7be30 ffff8000 000b1064 ffff8000 1ea48cc0 ffff8000 009dbfe8 ffff8000 bde0: 008552d8 ffff8000 00000000 00000000 0047f778 ffff8000 00000000 00000000 be00: 1ea7be30 ffff8000 00000000 ffff8000 1ea39480 ffff8000 000c75f8 ffff8000 be20: 1ea7be20 ffff8000 1ea7be20 ffff8000 00000000 00000000 00085930 ffff8000 be40: 000b0f88 ffff8000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000 be60: 00000000 00000000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000 be80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bea0: 1ea7bea0 ffff8000 1ea7bea0 ffff8000 00000000 ffff8000 00000000 00000000 bec0: 1ea7bec0 ffff8000 1ea7bec0 ffff8000 00000000 00000000 00000000 00000000 bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bf20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bf40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bf60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bf80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bfa0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000005 00000000 bfe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call trace: [<ffff8000000b2d94>] exit_creds+0x18/0x70 [<ffff80000009498c>] __put_task_struct+0x38/0xd4 [<ffff8000000b15ec>] kthread_stop+0xc0/0x130 [<ffff800000541ef8>] xenvif_disconnect+0x58/0xd0 [<ffff800000540980>] set_backend_state+0x134/0x278 [<ffff800000540bc8>] frontend_changed+0x8c/0xec [<ffff800000480118>] xenbus_otherend_changed+0x9c/0xa4 [<ffff800000480ea0>] frontend_changed+0xc/0x18 [<ffff80000047f828>] xenwatch_thread+0xb0/0x140 [<ffff8000000b1060>] kthread+0xd8/0xf0 Code: f9000bf3 aa0003f3 f9422401 f9422000 (b9400021) ---[ end trace af11d521ee530da8 ]--- Regards, -- Julien Grall ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-28 17:27 ` Julien Grall @ 2015-01-30 16:04 ` David Vrabel 2015-02-02 13:54 ` Julien Grall 0 siblings, 1 reply; 10+ messages in thread From: David Vrabel @ 2015-01-30 16:04 UTC (permalink / raw) To: Julien Grall, David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel On 28/01/15 17:27, Julien Grall wrote: > On 28/01/15 17:06, David Vrabel wrote: >> On 28/01/15 16:45, Julien Grall wrote: >>> On 27/01/15 16:53, Wei Liu wrote: >>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: >>>>> On 27/01/15 16:45, Wei Liu wrote: >>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >>>>>>> Hi, >>>>>>> >>>>>>> While I'm working on support for 64K page in netfront, I got >>>>>>> an rcu_sced self-detect message. It happens when netback is >>>>>>> disabling the vif device due to an error. >>>>>>> >>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >>>>>>> the processor is stucked in xenvif_rx_queue_purge? >>>>>>> >>>>>> >>>>>> When you try to release a SKB, core network driver need to enter some >>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu. >>>>> >>>>> But this message shouldn't happen in normal condition or because of >>>>> netfront. Right? >>>>> >>>> >>>> Never saw report like this before, even in the case that netfront is >>>> buggy. >>> >>> This is only happening when preemption is not enabled (i.e >>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel. >>> >>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned >>> into an infinite loop. In my case, the code executed looks like: >>> >>> >>> 1. for (;;) { >>> 2. xenvif_wait_for_rx_work(queue); >>> 3. >>> 4. if (kthread_should_stop()) >>> 5. break; >>> 6. >>> 7. if (unlikely(vif->disabled && queue->id == 0) { >>> 8. xenvif_carrier_off(vif); >>> 9. xenvif_rx_queue_purge(queue); >>> 10. continue; >>> 11. } >>> 12. } >>> >>> The wait on line 2 will return directly because the vif is disabled >>> (see xenvif_have_rx_work) >>> >>> We are on queue 0, so the condition on line 7 is true. Therefore we will >>> loop on line 10. And so on... >>> >>> On platform where preemption is not enabled, this thread will never >>> yield/give the hand to another thread (unless the domain is destroyed). >> >> I'm not sure why we have a continue in the vif->disabled case and not >> just a break. Can you try that? > > So I applied this small patches: > > diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c > index 908e65e..9448c6c 100644 > --- a/drivers/net/xen-netback/netback.c > +++ b/drivers/net/xen-netback/netback.c > @@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data) > if (unlikely(vif->disabled && queue->id == 0)) { > xenvif_carrier_off(vif); > xenvif_rx_queue_purge(queue); > - continue; > + break; > } > > if (!skb_queue_empty(&queue->rx_queue)) How about this? 8<------------------------------------------ xen-netback: stop the guest rx thread after a fatal error After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback: disable rogue vif in kthread context), a fatal (protocol) error would leave the guest Rx thread spinning, wasting CPU time. Commit ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce guest Rx stall detection) made this even worse by removing a cond_resched() from this path. A fatal error is non-recoverable so just allow the guest Rx thread to exit. This requires taking additional refs to the task so the thread exiting early is handled safely. Signed-off-by: David Vrabel <david.vrabel@citrix.com> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 9259a73..037f74f 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -578,6 +578,7 @@ int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref, goto err_rx_unbind; } queue->task = task; + get_task_struct(task); task = kthread_create(xenvif_dealloc_kthread, (void *)queue, "%s-dealloc", queue->name); @@ -634,6 +635,7 @@ void xenvif_disconnect(struct xenvif *vif) if (queue->task) { kthread_stop(queue->task); + put_task_struct(queue->task); queue->task = NULL; } diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 908e65e..c8ce701 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -2109,8 +2109,7 @@ int xenvif_kthread_guest_rx(void *data) */ if (unlikely(vif->disabled && queue->id == 0)) { xenvif_carrier_off(vif); - xenvif_rx_queue_purge(queue); - continue; + break; } if (!skb_queue_empty(&queue->rx_queue)) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-30 16:04 ` David Vrabel @ 2015-02-02 13:54 ` Julien Grall 0 siblings, 0 replies; 10+ messages in thread From: Julien Grall @ 2015-02-02 13:54 UTC (permalink / raw) To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel Hi David, On 30/01/15 16:04, David Vrabel wrote: > How about this? This is working for me. Thanks! > 8<------------------------------------------ > xen-netback: stop the guest rx thread after a fatal error > > After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback: > disable rogue vif in kthread context), a fatal (protocol) error would > leave the guest Rx thread spinning, wasting CPU time. Commit > ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce > guest Rx stall detection) made this even worse by removing a > cond_resched() from this path. > > A fatal error is non-recoverable so just allow the guest Rx thread to > exit. This requires taking additional refs to the task so the thread > exiting early is handled safely. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reported-by: Julien Grall <julien.grall@linaro.org> Tested-by: Julien Grall <julien.grall@linaro.org> Regards, -- Julien Grall ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device 2015-01-27 16:45 ` Wei Liu 2015-01-27 16:47 ` Julien Grall @ 2015-01-27 16:56 ` David Vrabel 1 sibling, 0 replies; 10+ messages in thread From: David Vrabel @ 2015-01-27 16:56 UTC (permalink / raw) To: xen-devel On 27/01/15 16:45, Wei Liu wrote: > On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >> Hi, >> >> While I'm working on support for 64K page in netfront, I got >> an rcu_sced self-detect message. It happens when netback is >> disabling the vif device due to an error. >> >> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >> the processor is stucked in xenvif_rx_queue_purge? >> > > When you try to release a SKB, core network driver need to enter some > RCU cirital region to clean up. dst_release for one, calls call_rcu. This is RCU detecting a soft-lockup. You're either spinning on the spinlock or the guest rx queue is corrupt and cannot be drained. David >> Here the log: >> >> vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382 >> vif vif-20-0 vif20.0: fatal error; disabling device >> INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649) >> Task dump for CPU 1: >> vif20.0-q0-gues R running task 0 12617 2 0x00000002 >> Call trace: >> [<ffff800000089038>] dump_backtrace+0x0/0x124 >> [<ffff80000008916c>] show_stack+0x10/0x1c >> [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8 >> [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c >> [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8 >> [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748 >> [<ffff8000000e0b20>] update_process_times+0x38/0x6c >> [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4 >> [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234 >> [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0 >> [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38 >> [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c >> [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c >> [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac >> [<ffff8000000823b8>] gic_handle_irq+0x30/0x80 >> Exception stack(0xffff800013a07c20 to 0xffff800013a07d40) >> 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000 >> 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000 >> 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d >> 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff >> 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff >> 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000 >> 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000 >> 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000 >> 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000 >> [<ffff8000000854e4>] el1_irq+0x64/0xc0 >> [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30 >> [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c >> [<ffff8000000b1060>] kthread+0xd8/0xf0 >> >> >> Regards, >> >> -- >> Julien Grall > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-02-02 13:54 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall 2015-01-27 16:45 ` Wei Liu 2015-01-27 16:47 ` Julien Grall 2015-01-27 16:53 ` Wei Liu 2015-01-28 16:45 ` Julien Grall 2015-01-28 17:06 ` David Vrabel 2015-01-28 17:27 ` Julien Grall 2015-01-30 16:04 ` David Vrabel 2015-02-02 13:54 ` Julien Grall 2015-01-27 16:56 ` David Vrabel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.