From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52389) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dCdMk-00046r-L6 for qemu-devel@nongnu.org; Sun, 21 May 2017 22:46:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dCdMh-0000Mg-JQ for qemu-devel@nongnu.org; Sun, 21 May 2017 22:46:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38126) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dCdMh-0000MT-AN for qemu-devel@nongnu.org; Sun, 21 May 2017 22:46:35 -0400 References: <1495021572-20852-1-git-send-email-sameeh@daynix.com> From: Jason Wang Message-ID: <39c5030d-6de6-a3b9-355d-2abeaacc3317@redhat.com> Date: Mon, 22 May 2017 10:46:29 +0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sameeh Jubran Cc: Dmitry Fleytman , Yan Vugenfirer , QEMU Developers On 2017=E5=B9=B405=E6=9C=8819=E6=97=A5 22:04, Sameeh Jubran wrote: > On Fri, May 19, 2017 at 9:25 AM, Jason Wang wrote= : > >> >> On 2017=E5=B9=B405=E6=9C=8817=E6=97=A5 19:46, Sameeh Jubran wrote: >> >>> The bug was caused by the "receive overrun" (bit #6 of the ICR regist= er) >>> interrupt >>> which would be triggered post migration in a heavy traffic environmen= t. >>> Even though the >>> "receive overrun" bit (#6) is masked out by the IMS register (refer t= o >>> the log below) >>> the driver still receives an interrupt as the "receive overrun" bit (= #6) >>> causes the >>> "Other" - bit #24 of the ICR register - bit to be set as documented >>> below. The driver >>> handles the interrupt and clears the "Other" bit (#24) but doesn't cl= ear >>> the >>> "receive overrun" bit (#6) which leads to an infinite loop. Apparentl= y >>> the Windows >>> driver expects that the "receive overrun" bit and other ones - docume= nted >>> below - to be >>> cleared when the "Other" bit (#24) is cleared. >>> >>> So to sum that up: >>> 1. Bit #6 of the ICR register is set by heavy traffic >>> 2. As a results of setting bit #6, bit #24 is set >>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve a= n >>> interrupt for bit #6 as it is masked out by IMS) >>> 4. The driver handles and clears the interrupt of bit #24 >>> 5. Bit #6 is still set. >>> 6. 2 happens all over again >>> >>> The Interrupt Cause Read - ICR register: >>> >>> The ICR has the "Other" bit - bit #24 - that is set when one or more = of >>> the following >>> ICR register's bits are set: >>> >>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit = #17, >>> MNG - bit #18 >>> >>> Log sample of the storm: >>> >>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: >>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004) >>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x= 0 >>> (ICR: 0x815000c2, IMS: 0xa00004) >>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: >>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004) >>> >>> This commit solves: >>> https://bugzilla.redhat.com/show_bug.cgi?id=3D1447935 >>> https://bugzilla.redhat.com/show_bug.cgi?id=3D1449490 >>> >>> Signed-off-by: Sameeh Jubran >>> --- >>> hw/net/e1000e_core.c | 7 +++++-- >>> 1 file changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c >>> index 28c5be1..8174b53 100644 >>> --- a/hw/net/e1000e_core.c >>> +++ b/hw/net/e1000e_core.c >>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, >>> uint32_t val) >>> static void >>> e1000e_set_icr(E1000ECore *core, int index, uint32_t val) >>> { >>> + uint32_t icr =3D 0; >>> if ((core->mac[ICR] & E1000_ICR_ASSERTED) && >>> (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) { >>> trace_e1000e_irq_icr_process_iame(); >>> e1000e_clear_ims_bits(core, core->mac[IAM]); >>> } >>> - trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR= ] & >>> ~val); >>> - core->mac[ICR] &=3D ~val; >>> + icr =3D core->mac[ICR] & ~val; >>> + icr =3D (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES= ) : >>> icr; >>> + trace_e1000e_irq_icr_write(val, core->mac[ICR], icr); >>> + core->mac[ICR] =3D icr; >>> e1000e_update_interrupt_state(core); >>> } >>> >>> >> Thanks for the patch. >> >> So this is an undocumented behavior, we must be careful on this. Sever= al >> question below: >> >> - have you verified this on real hardware? >> > No I haven't Any chance to verify the behavior on real hardware? > >> - is MSIX enabled in this case? >> > Yes it is, I have tested the patch with msi disabled too. Does the problem exist in no msi case. If not, is this better to hack=20 EIAC behavior? > >> - according to the steps you've summed up above, it's not specific to >> migration? >> > True Then we probably need tweak the title. Thanks. > >> Thanks >> > >