From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Vrabel Subject: Re: rcu_sched self-detect stall when disable vif device Date: Wed, 28 Jan 2015 17:06:26 +0000 Message-ID: <54C91712.3020806@citrix.com> References: <54C7B6E8.9080106@linaro.org> <20150127164539.GJ24026@zion.uk.xensource.com> <54C7C131.9030502@linaro.org> <20150127165312.GK24026@zion.uk.xensource.com> <54C91229.8090104@linaro.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <54C91229.8090104@linaro.org> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Julien Grall , Wei Liu Cc: Ian Campbell , xen-devel List-Id: xen-devel@lists.xenproject.org On 28/01/15 16:45, Julien Grall wrote: > On 27/01/15 16:53, Wei Liu wrote: >> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote: >>> On 27/01/15 16:45, Wei Liu wrote: >>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote: >>>>> Hi, >>>>> >>>>> While I'm working on support for 64K page in netfront, I got >>>>> an rcu_sced self-detect message. It happens when netback is >>>>> disabling the vif device due to an error. >>>>> >>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why >>>>> the processor is stucked in xenvif_rx_queue_purge? >>>>> >>>> >>>> When you try to release a SKB, core network driver need to enter some >>>> RCU cirital region to clean up. dst_release for one, calls call_rcu. >>> >>> But this message shouldn't happen in normal condition or because of >>> netfront. Right? >>> >> >> Never saw report like this before, even in the case that netfront is >> buggy. > > This is only happening when preemption is not enabled (i.e > CONFIG_PREEMPT_NONE in the config file) in the backend kernel. > > When the vif is disabled, the loop in xenvif_kthread_guest_rx turned > into an infinite loop. In my case, the code executed looks like: > > > 1. for (;;) { > 2. xenvif_wait_for_rx_work(queue); > 3. > 4. if (kthread_should_stop()) > 5. break; > 6. > 7. if (unlikely(vif->disabled && queue->id == 0) { > 8. xenvif_carrier_off(vif); > 9. xenvif_rx_queue_purge(queue); > 10. continue; > 11. } > 12. } > > The wait on line 2 will return directly because the vif is disabled > (see xenvif_have_rx_work) > > We are on queue 0, so the condition on line 7 is true. Therefore we will > loop on line 10. And so on... > > On platform where preemption is not enabled, this thread will never > yield/give the hand to another thread (unless the domain is destroyed). I'm not sure why we have a continue in the vif->disabled case and not just a break. Can you try that? David