From mboxrd@z Thu Jan 1 00:00:00 1970 From: jerry Subject: Re: netback BUG_ON when using copy_skb=1 Date: Thu, 17 Oct 2013 15:41:51 +0800 Message-ID: <525F94BF.6050500@huawei.com> References: <525E125B.80100@huawei.com> <525E903602000078000FB6DF@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1VWiEX-0004mq-0U for xen-devel@lists.xenproject.org; Thu, 17 Oct 2013 07:43:01 +0000 In-Reply-To: <525E903602000078000FB6DF@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: xen-devel , Wei Liu , stefano.stabellini@eu.citrix.com List-Id: xen-devel@lists.xenproject.org Hi Jan, Thanks for your reply. Yes, I am using the SLE11 kernel 3.0.58 which is not up-to-date as you assumed. I find one related patch named xen-netback-generalize which was committed on Aug 7 and has been applied to SLE11 kernel 3.0.98. That BUG_ON(netbk->mmap_pages[idx] != page) has been removed in this patch. But there may be still concurrency problems in my test. If the page replacing in copy_pending_req() was done after netif_get_page_ext() in netbk_gop_frag(), copy_gop->flags is wrongly marked with GNTCOPY_source_gref. Here the memory of that page in skb has been replaced with Dom0 local memory, so the later HYPERVISOR_multicall() with GNTTABOP_copy in netbk_rx_actions() will get errors. The messages is shown as: (XEN) grant_table.c:305:d0 Bad flags (0) or dom (0). (expected dom 0) Would you like to share some opinions? Regards, Jerry On 2013/10/16 19:10, Jan Beulich wrote: >>>> On 16.10.13 at 06:13, jerry wrote: >> Hi Wei Liu, >> >> I am doing some network performance on Xen4.1.2 and kernel 3.0, and get a >> crash with BUG_ON(netbk->mmap_pages[idx] != page) in netbk_gop_frag() >> accidentally. >> >> By analyzing the module drivers/xen/netback, > > You aren't looking at the upstream driver, are you? If so, Wei is > very likely the wrong addressee. > > Assuming that you instead talk of the SLE11 kernel, I can only > point out that a problem in that code was found and fixed a > couple of months ago (resulting in the BUG_ON() you quoted not > being there anymore), so you're simply not looking at up-to-date > code. > > Jan > >> I think the reason is as >> follows when sending packets from VM1 to VM2: >> 1) The two netback thread(the first for VM1 sending, second for VM2 >> receiving) run concurrently. >> 2) In first netback thread, it will do delayed copy from a foreign granted >> page to local memory when some outstanding packets have been pending too >> long( above half of one HZ). >> Then netbk->mmap_pages[idx] will be replaced with new allocated page. >> 3) If the packets are forwarded to VM2 by virtual switch, netbk_gop_frag() >> will be called in second netback thread. >> And that function will judge whether the pages in skb frags[] is foreign >> in order to make sure how to do grant copy. >> 4) If the page replacing was done after the page foreign judge in >> netbk_gop_frag(), the BUG will be invoked because the page from skb frags[] >> are different with mmap_pages[idx]. >> >> I tried to using spin_lock to protect the page accessing, but no appropriate >> solutions was found. >> How to fix this problem? Would you like to share some opinions? >> >> In addition, I have tried to turn off copy_skb. Then the vif netdevice may >> not be released after shutting down VM, >> that's because outstanding packets hold the reference count of the device >> too long for some unknown reason. >> The reason may be that the NIC does not release packets after DMA. >> Does anyone have met such problems? Thanks. >> >> Best regards, >> Jerry >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > > > > . >