From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles Date: Thu, 27 Feb 2014 16:36:52 +0100 Message-ID: <530F5B94.3050308@citrix.com> References: <1772884781.20140218222513@eikelenboom.it> <5305CFC6.3080502@oracle.com> <587238484.20140220121842@eikelenboom.it> <5306F2E8.5090509@oracle.com> <824074181.20140226101442@eikelenboom.it> <59358334.20140226161123@eikelenboom.it> <20140227141812.GE16241@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140227141812.GE16241@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu , Sander Eikelenboom Cc: annie li , Paul Durrant , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 27/02/14 15:18, Wei Liu wrote: > On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote: >> >> Wednesday, February 26, 2014, 10:14:42 AM, you wrote: >> >> >>> Friday, February 21, 2014, 7:32:08 AM, you wrote: >> >> >>>> On 2014/2/20 19:18, Sander Eikelenboom wrote: >>>>> Thursday, February 20, 2014, 10:49:58 AM, you wrote: >>>>> >>>>> >>>>>> On 2014/2/19 5:25, Sander Eikelenboom wrote: >>>>>>> Hi All, >>>>>>> >>>>>>> I'm currently having some network troubles with Xen and recent linux kernels. >>>>>>> >>>>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU >>>>>>> I get what seems to be described in this thread: http://www.spinics.net/lists/netdev/msg242953.html >>>>>>> >>>>>>> In the guest: >>>>>>> [57539.859584] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859599] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859605] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859610] net eth0: Need more slots >>>>>>> [58157.675939] net eth0: Need more slots >>>>>>> [58725.344712] net eth0: Need more slots >>>>>>> [61815.849180] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849205] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849216] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849225] net eth0: Need more slots >>>>>> This issue is familiar... and I thought it get fixed. >>>>>> From original analysis for similar issue I hit before, the root cause >>>>>> is netback still creates response when the ring is full. I remember >>>>>> larger MTU can trigger this issue before, what is the MTU size? >>>>> In dom0 both for the physical nics and the guest vif's MTU=1500 >>>>> In domU the eth0 also has MTU=1500. >>>>> >>>>> So it's not jumbo frames .. just everywhere the same plain defaults .. >>>>> >>>>> With the patch from Wei that solves the other issue, i'm still seeing the Need more slots issue on 3.14-rc3+wei's patch now. >>>>> I have extended the "need more slots warn" to also print the cons, slots, max, rx->offset, size, hope that gives some more insight. >>>>> But it indeed is the VM were i had similar issues before, the primary thing this VM does is 2 simultaneous rsync's (one push one pull) with some gigabytes of data. >>>>> >>>>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant reference " as seen below, don't know if it's a cause or a effect though. >> >>>> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before. >>>> Probably the response overlaps the request and grantcopy return error >>>> when using wrong grant reference, Netback returns resp->status with >>>> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend. >>>> Would it be possible to print log in xenvif_rx_action of netback to see >>>> whether something wrong with max slots and used slots? >> >>>> Thanks >>>> Annie >> >>> Looking more closely it are perhaps 2 different issues ... the bad grant references do not happen >>> at the same time as the netfront messages in the guest. >> >>> I added some debugpatches to the kernel netback, netfront and xen granttable code (see below) >>> One of the things was to simplify the code for the debug key to print the granttables, the present code >>> takes too long to execute and brings down the box due to stalls and NMI's. So it now only prints >>> the nr of entries per domain. >> >> >>> Issue 1: grant_table.c:1858:d0 Bad grant reference >> >>> After running the box for just one night (with 15 VM's) i get these mentions of "Bad grant reference". >>> The maptrack also seems to increase quite fast and the number of entries seem to have gone up quite fast as well. >> >>> Most domains have just one disk(blkfront/blkback) and one nic, a few have a second disk. >>> The blk drivers use persistent grants so i would assume it would reuse those and not increase it (by much). >> > > As far as I can tell netfront has a pool of grant references and it > will BUG_ON() if there's no grefs in the pool when you request one. > Since your DomU didn't crash so I suspect the book-keeping is still > intact. > >>> Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 somewhere this night. >>> Domain 7 is the domain that happens to give the netfront messages. >> >>> I also don't get why it is reporting the "Bad grant reference" for domain 0, which seems to have 0 active entries .. >>> Also is this amount of grant entries "normal" ? or could it be a leak somewhere ? >> > > I suppose Dom0 expanding its maptrack is normal. I see as well when I > increase the number of domains. But if it keeps increasing while the > number of DomUs stay the same then it is not normal. blkfront/blkback will allocate persistent grants on demand, so it's not strange to see the number of grants increasing while the domain is running (although it should reach a stable state at some point). Roger.