* kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] @ 2015-02-11 1:33 Jay Rolette [not found] ` <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Jay Rolette @ 2015-02-11 1:33 UTC (permalink / raw) To: Dev Environment: * DPDK 1.6.0r2 * Ubuntu 14.04 LTS * kernel: 3.13.0-38-generic When we start exercising KNI a fair bit (transferring files across it, both sending and receiving), I'm starting to see a fair bit of these kernel lockups: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] Frequently I can't do much other than get a screenshot of the error message coming across the console session once we get into this state, so debugging what is happening is "interesting"... I've seen this on multiple hardware platforms (so not box specific) as well as virtual machines. Are there any known issues with KNI that would cause kernel lockups in DPDK 1.6? Really hoping someone that knows KNI well can point me in the right direction. KNI in the 1.8 tree is significantly different, so it didn't look straight-forward to back-port it, although I do see a few changes that might be relevant. Any suggestions, pointers or other general help for tracking this down? Thanks! Jay ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] [not found] ` <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-11 16:25 ` Alejandro Lucero [not found] ` <CAD+H990A4UOw6qthPHfegdzyzcJOh+a8A9=HKeyp59niAgNBDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-16 16:33 ` Jay Rolette 1 sibling, 1 reply; 6+ messages in thread From: Alejandro Lucero @ 2015-02-11 16:25 UTC (permalink / raw) To: dev Hi Jay, I saw these errors when I worked in the HPC sector. They come usually with a kernel dump for each core in the machine so you can know, after some peering at the kernel code, how the soft lockup triggers. When I did that it was always an issue with the memory. So those times that you can still work on the machine after the problem, look at the kernel messages. I will be glad to look at it. On Wed, Feb 11, 2015 at 1:33 AM, Jay Rolette <rolette-bIuJOMs36aleGPcbtGPokg@public.gmane.org> wrote: > Environment: > * DPDK 1.6.0r2 > * Ubuntu 14.04 LTS > * kernel: 3.13.0-38-generic > > When we start exercising KNI a fair bit (transferring files across it, both > sending and receiving), I'm starting to see a fair bit of these kernel > lockups: > > kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] > > Frequently I can't do much other than get a screenshot of the error message > coming across the console session once we get into this state, so debugging > what is happening is "interesting"... > > I've seen this on multiple hardware platforms (so not box specific) as well > as virtual machines. > > Are there any known issues with KNI that would cause kernel lockups in DPDK > 1.6? Really hoping someone that knows KNI well can point me in the right > direction. > > KNI in the 1.8 tree is significantly different, so it didn't look > straight-forward to back-port it, although I do see a few changes that > might be relevant. > > Any suggestions, pointers or other general help for tracking this down? > > Thanks! > Jay > ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAD+H990A4UOw6qthPHfegdzyzcJOh+a8A9=HKeyp59niAgNBDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] [not found] ` <CAD+H990A4UOw6qthPHfegdzyzcJOh+a8A9=HKeyp59niAgNBDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-16 13:32 ` Jay Rolette 0 siblings, 0 replies; 6+ messages in thread From: Jay Rolette @ 2015-02-16 13:32 UTC (permalink / raw) To: Alejandro Lucero; +Cc: dev Thanks Alejandro. I'll look into the kernel dump if there is one. The system is extremely brittle once this happens. Usually I can't do much other than power-cycle the box. Anything requiring sudo just locks the terminal up, so little to look at besides the messages on the console. Matthew Hall also suggested a few things for me to look into, so I'm following up on that as well. Jay On Wed, Feb 11, 2015 at 10:25 AM, Alejandro Lucero < alejandro.lucero-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote: > Hi Jay, > > I saw these errors when I worked in the HPC sector. They come usually with > a kernel dump for each core in the machine so you can know, after some > peering at the kernel code, how the soft lockup triggers. When I did that > it was always an issue with the memory. > > So those times that you can still work on the machine after the problem, > look at the kernel messages. I will be glad to look at it. > > > > On Wed, Feb 11, 2015 at 1:33 AM, Jay Rolette <rolette-bIuJOMs36aleGPcbtGPokg@public.gmane.org> > wrote: > > > Environment: > > * DPDK 1.6.0r2 > > * Ubuntu 14.04 LTS > > * kernel: 3.13.0-38-generic > > > > When we start exercising KNI a fair bit (transferring files across it, > both > > sending and receiving), I'm starting to see a fair bit of these kernel > > lockups: > > > > kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] > > > > Frequently I can't do much other than get a screenshot of the error > message > > coming across the console session once we get into this state, so > debugging > > what is happening is "interesting"... > > > > I've seen this on multiple hardware platforms (so not box specific) as > well > > as virtual machines. > > > > Are there any known issues with KNI that would cause kernel lockups in > DPDK > > 1.6? Really hoping someone that knows KNI well can point me in the right > > direction. > > > > KNI in the 1.8 tree is significantly different, so it didn't look > > straight-forward to back-port it, although I do see a few changes that > > might be relevant. > > > > Any suggestions, pointers or other general help for tracking this down? > > > > Thanks! > > Jay > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] [not found] ` <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-11 16:25 ` Alejandro Lucero @ 2015-02-16 16:33 ` Jay Rolette [not found] ` <CADNuJVpKGyFOKNA1JCVbg72SwPbM0+9HSWAHwAiG=G2AXFKJ-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 6+ messages in thread From: Jay Rolette @ 2015-02-16 16:33 UTC (permalink / raw) To: Dev On Tue, Feb 10, 2015 at 7:33 PM, Jay Rolette <rolette-bIuJOMs36aleGPcbtGPokg@public.gmane.org> wrote: > Environment: > * DPDK 1.6.0r2 > * Ubuntu 14.04 LTS > * kernel: 3.13.0-38-generic > > When we start exercising KNI a fair bit (transferring files across it, > both sending and receiving), I'm starting to see a fair bit of these kernel > lockups: > > kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] > > Frequently I can't do much other than get a screenshot of the error > message coming across the console session once we get into this state, so > debugging what is happening is "interesting"... > > I've seen this on multiple hardware platforms (so not box specific) as > well as virtual machines. > > Are there any known issues with KNI that would cause kernel lockups in > DPDK 1.6? Really hoping someone that knows KNI well can point me in the > right direction. > > KNI in the 1.8 tree is significantly different, so it didn't look > straight-forward to back-port it, although I do see a few changes that > might be relevant. > Found the problem. No patch to submit since it's already fixed in later versions of DPDK, but thought I'd follow up with the details since I'm sure we aren't the only ones trying to use bleeding-edge versions of DPDK... In kni_net_rx_normal(), it was calling netif_receive_skb() instead of netif_rx(). The source for netif_receive_skb() point out that it should only be called from soft-irq context, which isn't the case for KNI. As typical, simple fix once you track it down. Yao-Po Wang's fix: commit 41a6ebded53982107c1adfc0652d6cc1375a7db9. Cheers, Jay ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CADNuJVpKGyFOKNA1JCVbg72SwPbM0+9HSWAHwAiG=G2AXFKJ-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] [not found] ` <CADNuJVpKGyFOKNA1JCVbg72SwPbM0+9HSWAHwAiG=G2AXFKJ-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-17 1:00 ` Matthew Hall [not found] ` <20150217010024.GB30617-Hv3ogNYU3JfZZajBQzqCxQ@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Matthew Hall @ 2015-02-17 1:00 UTC (permalink / raw) To: Jay Rolette; +Cc: Dev On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote: > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of > netif_rx(). The source for netif_receive_skb() point out that it should > only be called from soft-irq context, which isn't the case for KNI. For the uninitiated among us, what was the practical effect of the coding error? Waiting forever for a lock which will never be available in IRQ context, or causing unintended re-entrancy, or what? Thanks, Matthew. ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <20150217010024.GB30617-Hv3ogNYU3JfZZajBQzqCxQ@public.gmane.org>]
* Re: kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] [not found] ` <20150217010024.GB30617-Hv3ogNYU3JfZZajBQzqCxQ@public.gmane.org> @ 2015-02-17 15:57 ` Jay Rolette 0 siblings, 0 replies; 6+ messages in thread From: Jay Rolette @ 2015-02-17 15:57 UTC (permalink / raw) To: Matthew Hall; +Cc: Dev On Mon, Feb 16, 2015 at 7:00 PM, Matthew Hall <mhall-Hv3ogNYU3JfZZajBQzqCxQ@public.gmane.org> wrote: > On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote: > > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of > > netif_rx(). The source for netif_receive_skb() point out that it should > > only be called from soft-irq context, which isn't the case for KNI. > > For the uninitiated among us, what was the practical effect of the coding > error? Waiting forever for a lock which will never be available in IRQ > context, or causing unintended re-entrancy, or what? > Sadly, I'm not really one of the enlightened ones when it comes to the Linux kernel. VxWorks? sure. Linux kernel? learning as required. I didn't chase it down to the specific mechanism in this case. Unusual for me, but this time I took the expedient route of finding a likely explanation plus Yao's fix on that same code with his explanation of a deadlock and went with it. It'll be a few more days before we've had enough run time on it to absolutely confirm (not an easy bug to repro). If I get hand-wavy about it, my assumption is that the requirement for netif_receive_skb() be called in soft-irq context means it doesn't expect to be pre-empted or rentrant. When you call netif_rx() instead, it puts the skb on the backlog and it gets processed from there. Part of that code disables interrupts during part of the processing. Not sure what else is coming in and actually deadlocking things. Honestly, I don't understand enough details of how everything works in the Linux network stack yet. I've done tons of work on the network path of stack-less systems, a bit of work in device drivers, but have only touched the surface of the internals of Linux network stack. The meat of my product avoids that like the plague because it is too slow. Sorry, lots of words but not much light being shed this time... Jay ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-17 15:57 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-11 1:33 kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] Jay Rolette [not found] ` <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-11 16:25 ` Alejandro Lucero [not found] ` <CAD+H990A4UOw6qthPHfegdzyzcJOh+a8A9=HKeyp59niAgNBDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-16 13:32 ` Jay Rolette 2015-02-16 16:33 ` Jay Rolette [not found] ` <CADNuJVpKGyFOKNA1JCVbg72SwPbM0+9HSWAHwAiG=G2AXFKJ-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-17 1:00 ` Matthew Hall [not found] ` <20150217010024.GB30617-Hv3ogNYU3JfZZajBQzqCxQ@public.gmane.org> 2015-02-17 15:57 ` Jay Rolette
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).