From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network
	troubles
Date: Thu, 27 Feb 2014 16:36:52 +0100
Message-ID: <530F5B94.3050308@citrix.com>
References: <1772884781.20140218222513@eikelenboom.it>	<5305CFC6.3080502@oracle.com>	<587238484.20140220121842@eikelenboom.it>	<5306F2E8.5090509@oracle.com>	<824074181.20140226101442@eikelenboom.it>	<59358334.20140226161123@eikelenboom.it>
	<20140227141812.GE16241@zion.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20140227141812.GE16241@zion.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wei Liu <wei.liu2@citrix.com>, Sander Eikelenboom <linux@eikelenboom.it>
Cc: annie li <annie.li@oracle.com>, Paul Durrant <Paul.Durrant@citrix.com>, Zoltan Kiss <zoltan.kiss@citrix.com>, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On 27/02/14 15:18, Wei Liu wrote:
> On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote:
>>
>> Wednesday, February 26, 2014, 10:14:42 AM, you wrote:
>>
>>
>>> Friday, February 21, 2014, 7:32:08 AM, you wrote:
>>
>>
>>>> On 2014/2/20 19:18, Sander Eikelenboom wrote:
>>>>> Thursday, February 20, 2014, 10:49:58 AM, you wrote:
>>>>>
>>>>>
>>>>>> On 2014/2/19 5:25, Sander Eikelenboom wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm currently having some network troubles with Xen and recent linux kernels.
>>>>>>>
>>>>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU
>>>>>>>     I get what seems to be described in this thread: http://www.spinics.net/lists/netdev/msg242953.html
>>>>>>>
>>>>>>>     In the guest:
>>>>>>>     [57539.859584] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859599] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859605] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859610] net eth0: Need more slots
>>>>>>>     [58157.675939] net eth0: Need more slots
>>>>>>>     [58725.344712] net eth0: Need more slots
>>>>>>>     [61815.849180] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849205] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849216] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849225] net eth0: Need more slots
>>>>>> This issue is familiar... and I thought it get fixed.
>>>>>>   From original analysis for similar issue I hit before, the root cause
>>>>>> is netback still creates response when the ring is full. I remember
>>>>>> larger MTU can trigger this issue before, what is the MTU size?
>>>>> In dom0 both for the physical nics and the guest vif's MTU=1500
>>>>> In domU the eth0 also has MTU=1500.
>>>>>
>>>>> So it's not jumbo frames .. just everywhere the same plain defaults ..
>>>>>
>>>>> With the patch from Wei that solves the other issue, i'm still seeing the Need more slots issue on 3.14-rc3+wei's patch now.
>>>>> I have extended the "need more slots warn" to also print the cons, slots, max,  rx->offset, size, hope that gives some more insight.
>>>>> But it indeed is the VM were i had similar issues before, the primary thing this VM does is 2 simultaneous rsync's (one push one pull) with some gigabytes of data.
>>>>>
>>>>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant reference " as seen below, don't know if it's a cause or a effect though.
>>
>>>> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before.
>>>> Probably the response overlaps the request and grantcopy return error 
>>>> when using wrong grant reference, Netback returns resp->status with 
>>>> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend.
>>>> Would it be possible to print log in xenvif_rx_action of netback to see 
>>>> whether something wrong with max slots and used slots?
>>
>>>> Thanks
>>>> Annie
>>
>>> Looking more closely it are perhaps 2 different issues ... the bad grant references do not happen
>>> at the same time as the netfront messages in the guest.
>>
>>> I added some debugpatches to the kernel netback, netfront and xen granttable code (see below)
>>> One of the things was to simplify the code for the debug key to print the granttables, the present code
>>> takes too long to execute and brings down the box due to stalls and NMI's. So it now only prints
>>> the nr of entries per domain.
>>
>>
>>> Issue 1: grant_table.c:1858:d0 Bad grant reference
>>
>>> After running the box for just one night (with 15 VM's) i get these mentions of "Bad grant reference".
>>> The maptrack also seems to increase quite fast and the number of entries seem to have gone up quite fast as well.
>>
>>> Most domains have just one disk(blkfront/blkback) and one nic, a few have a second disk.
>>> The blk drivers use persistent grants so i would assume it would reuse those and not increase it (by much).
>>
> 
> As far as I can tell netfront has a pool of grant references and it
> will BUG_ON() if there's no grefs in the pool when you request one.
> Since your DomU didn't crash so I suspect the book-keeping is still
> intact.
> 
>>> Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 somewhere this night.
>>> Domain 7 is the domain that happens to give the netfront messages.
>>
>>> I also don't get why it is reporting the "Bad grant reference" for domain 0, which seems to have 0 active entries ..
>>> Also is this amount of grant entries "normal" ? or could it be a leak somewhere ?
>>
> 
> I suppose Dom0 expanding its maptrack is normal. I see as well when I
> increase the number of domains. But if it keeps increasing while the
> number of DomUs stay the same then it is not normal.

blkfront/blkback will allocate persistent grants on demand, so it's not
strange to see the number of grants increasing while the domain is
running (although it should reach a stable state at some point).

Roger.