[Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
@ 2017-05-17 11:46 Sameeh Jubran
  2017-05-17 12:14 ` Dmitry Fleytman
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sameeh Jubran @ 2017-05-17 11:46 UTC (permalink / raw)
  To: qemu-devel, Jason Wang, Dmitry Fleytman; +Cc: Yan Vugenfirer

The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
which would be triggered post migration in a heavy traffic environment. Even though the
"receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
"Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
"receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
driver expects that the "receive overrun" bit and other ones - documented below - to be
cleared when the "Other" bit (#24) is cleared.

So to sum that up:
1. Bit #6 of the ICR register is set by heavy traffic
2. As a results of setting bit #6, bit #24 is set
3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
4. The driver handles and clears the interrupt of bit #24
5. Bit #6 is still set.
6. 2 happens all over again

The Interrupt Cause Read - ICR register:

The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
ICR register's bits are set:

LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18

Log sample of the storm:

27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)

This commit solves:
https://bugzilla.redhat.com/show_bug.cgi?id=1447935
https://bugzilla.redhat.com/show_bug.cgi?id=1449490

Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
---
 hw/net/e1000e_core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 28c5be1..8174b53 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
 static void
 e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
 {
+    uint32_t icr = 0;
     if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
         (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
         trace_e1000e_irq_icr_process_iame();
         e1000e_clear_ims_bits(core, core->mac[IAM]);
     }

-    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
-    core->mac[ICR] &= ~val;
+    icr = core->mac[ICR] & ~val;
+    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
+    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
+    core->mac[ICR] = icr;
     e1000e_update_interrupt_state(core);
 }

-- 
2.8.1.185.gdc0db2c

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-17 11:46 [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration Sameeh Jubran
@ 2017-05-17 12:14 ` Dmitry Fleytman
  2017-05-18 10:04 ` Dr. David Alan Gilbert
  2017-05-19  6:25 ` Jason Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Dmitry Fleytman @ 2017-05-17 12:14 UTC (permalink / raw)
  To: Sameeh Jubran; +Cc: Qemu Developers, Jason Wang, Yan Vugenfirer


> On 17 May 2017, at 14:46 PM, Sameeh Jubran <sameeh@daynix.com> wrote:
> 
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490 <https://bugzilla.redhat.com/show_bug.cgi?id=1449490>


Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>


> 
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> ---
> hw/net/e1000e_core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
> static void
> e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
> {
> +    uint32_t icr = 0;
>     if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>         (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>         trace_e1000e_irq_icr_process_iame();
>         e1000e_clear_ims_bits(core, core->mac[IAM]);
>     }
> 
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>     e1000e_update_interrupt_state(core);
> }
> 
> -- 
> 2.8.1.185.gdc0db2c
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-17 11:46 [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration Sameeh Jubran
  2017-05-17 12:14 ` Dmitry Fleytman
@ 2017-05-18 10:04 ` Dr. David Alan Gilbert
  2017-05-18 10:40   ` Paolo Bonzini
  2017-05-19  6:25 ` Jason Wang
  2 siblings, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-05-18 10:04 UTC (permalink / raw)
  To: Sameeh Jubran; +Cc: qemu-devel, Jason Wang, Dmitry Fleytman, Yan Vugenfirer

* Sameeh Jubran (sameeh@daynix.com) wrote:
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
> 
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>

Thanks, I tested this with our downstream and it does
fix the reproducer for 1447935 for me, so:


Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

> ---
>  hw/net/e1000e_core.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>  static void
>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>  {
> +    uint32_t icr = 0;
>      if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>          (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>          trace_e1000e_irq_icr_process_iame();
>          e1000e_clear_ims_bits(core, core->mac[IAM]);
>      }
>  
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>      e1000e_update_interrupt_state(core);
>  }
>  
> -- 
> 2.8.1.185.gdc0db2c
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-18 10:04 ` Dr. David Alan Gilbert
@ 2017-05-18 10:40   ` Paolo Bonzini
  0 siblings, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2017-05-18 10:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Sameeh Jubran
  Cc: Dmitry Fleytman, Yan Vugenfirer, Jason Wang, qemu-devel,
	qemu-stable



On 18/05/2017 12:04, Dr. David Alan Gilbert wrote:
> * Sameeh Jubran (sameeh@daynix.com) wrote:
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
>> which would be triggered post migration in a heavy traffic environment. Even though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
>> driver expects that the "receive overrun" bit and other ones - documented below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> 
> Thanks, I tested this with our downstream and it does
> fix the reproducer for 1447935 for me, so:
> 
> 
> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Dave
> 
>> ---
>>  hw/net/e1000e_core.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>>  static void
>>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>  {
>> +    uint32_t icr = 0;
>>      if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>          (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>          trace_e1000e_irq_icr_process_iame();
>>          e1000e_clear_ims_bits(core, core->mac[IAM]);
>>      }
>>  
>> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
>> -    core->mac[ICR] &= ~val;
>> +    icr = core->mac[ICR] & ~val;
>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +    core->mac[ICR] = icr;
>>      e1000e_update_interrupt_state(core);
>>  }
>>  
>> -- 
>> 2.8.1.185.gdc0db2c
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

Cc: qemu-stable@nongnu.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-17 11:46 [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration Sameeh Jubran
  2017-05-17 12:14 ` Dmitry Fleytman
  2017-05-18 10:04 ` Dr. David Alan Gilbert
@ 2017-05-19  6:25 ` Jason Wang
  2017-05-19 14:04   ` Sameeh Jubran
  2 siblings, 1 reply; 8+ messages in thread
From: Jason Wang @ 2017-05-19  6:25 UTC (permalink / raw)
  To: Sameeh Jubran, qemu-devel, Dmitry Fleytman; +Cc: Yan Vugenfirer



On 2017年05月17日 19:46, Sameeh Jubran wrote:
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
>
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
>
> The Interrupt Cause Read - ICR register:
>
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
>
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
>
> Log sample of the storm:
>
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> ---
>   hw/net/e1000e_core.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>   static void
>   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>   {
> +    uint32_t icr = 0;
>       if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>           (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>           trace_e1000e_irq_icr_process_iame();
>           e1000e_clear_ims_bits(core, core->mac[IAM]);
>       }
>   
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>       e1000e_update_interrupt_state(core);
>   }
>   

Thanks for the patch.

So this is an undocumented behavior, we must be careful on this. Several 
question below:

- have you verified this on real hardware?
- is MSIX enabled in this case?
- according to the steps you've summed up above, it's not specific to 
migration?

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-19  6:25 ` Jason Wang
@ 2017-05-19 14:04   ` Sameeh Jubran
  2017-05-22  2:46     ` Jason Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Sameeh Jubran @ 2017-05-19 14:04 UTC (permalink / raw)
  To: Jason Wang; +Cc: QEMU Developers, Dmitry Fleytman, Yan Vugenfirer

On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:

>
>
> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>> interrupt
>> which would be triggered post migration in a heavy traffic environment.
>> Even though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>> the log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>> causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented
>> below. The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>> the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>> the Windows
>> driver expects that the "receive overrun" bit and other ones - documented
>> below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>> interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>> the following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
>> MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>> ---
>>   hw/net/e1000e_core.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>> uint32_t val)
>>   static void
>>   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>   {
>> +    uint32_t icr = 0;
>>       if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>           (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>           trace_e1000e_irq_icr_process_iame();
>>           e1000e_clear_ims_bits(core, core->mac[IAM]);
>>       }
>>   -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
>> ~val);
>> -    core->mac[ICR] &= ~val;
>> +    icr = core->mac[ICR] & ~val;
>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>> icr;
>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +    core->mac[ICR] = icr;
>>       e1000e_update_interrupt_state(core);
>>   }
>>
>>
>
> Thanks for the patch.
>
> So this is an undocumented behavior, we must be careful on this. Several
> question below:
>
> - have you verified this on real hardware?
>
No I haven't

> - is MSIX enabled in this case?
>
Yes it is, I have tested the patch with msi disabled too.

> - according to the steps you've summed up above, it's not specific to
> migration?
>
True

>
> Thanks
>



-- 
Respectfully,
*Sameeh Jubran*
*Linkedin <https://il.linkedin.com/pub/sameeh-jubran/87/747/a8a>*
*Software Engineer @ Daynix <http://www.daynix.com>.*

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-19 14:04   ` Sameeh Jubran
@ 2017-05-22  2:46     ` Jason Wang
  2017-05-22 11:05       ` Sameeh Jubran
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Wang @ 2017-05-22  2:46 UTC (permalink / raw)
  To: Sameeh Jubran; +Cc: Dmitry Fleytman, Yan Vugenfirer, QEMU Developers



On 2017年05月19日 22:04, Sameeh Jubran wrote:
> On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>>
>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>>> interrupt
>>> which would be triggered post migration in a heavy traffic environment.
>>> Even though the
>>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>>> the log below)
>>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>>> causes the
>>> "Other" - bit #24 of the ICR register - bit to be set as documented
>>> below. The driver
>>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>>> the
>>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>>> the Windows
>>> driver expects that the "receive overrun" bit and other ones - documented
>>> below - to be
>>> cleared when the "Other" bit (#24) is cleared.
>>>
>>> So to sum that up:
>>> 1. Bit #6 of the ICR register is set by heavy traffic
>>> 2. As a results of setting bit #6, bit #24 is set
>>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>>> interrupt for bit #6 as it is masked out by IMS)
>>> 4. The driver handles and clears the interrupt of bit #24
>>> 5. Bit #6 is still set.
>>> 6. 2 happens all over again
>>>
>>> The Interrupt Cause Read - ICR register:
>>>
>>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>>> the following
>>> ICR register's bits are set:
>>>
>>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
>>> MNG - bit #18
>>>
>>> Log sample of the storm:
>>>
>>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>
>>> This commit solves:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>>
>>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>>> ---
>>>    hw/net/e1000e_core.c | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>> index 28c5be1..8174b53 100644
>>> --- a/hw/net/e1000e_core.c
>>> +++ b/hw/net/e1000e_core.c
>>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>>> uint32_t val)
>>>    static void
>>>    e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>>    {
>>> +    uint32_t icr = 0;
>>>        if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>>            (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>>            trace_e1000e_irq_icr_process_iame();
>>>            e1000e_clear_ims_bits(core, core->mac[IAM]);
>>>        }
>>>    -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
>>> ~val);
>>> -    core->mac[ICR] &= ~val;
>>> +    icr = core->mac[ICR] & ~val;
>>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>>> icr;
>>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>>> +    core->mac[ICR] = icr;
>>>        e1000e_update_interrupt_state(core);
>>>    }
>>>
>>>
>> Thanks for the patch.
>>
>> So this is an undocumented behavior, we must be careful on this. Several
>> question below:
>>
>> - have you verified this on real hardware?
>>
> No I haven't

Any chance to verify the behavior on real hardware?

>
>> - is MSIX enabled in this case?
>>
> Yes it is, I have tested the patch with msi disabled too.

Does the problem exist in no msi case. If not, is this better to hack 
EIAC behavior?

>
>> - according to the steps you've summed up above, it's not specific to
>> migration?
>>
> True

Then we probably need tweak the title.

Thanks.

>
>> Thanks
>>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
  2017-05-22  2:46     ` Jason Wang
@ 2017-05-22 11:05       ` Sameeh Jubran
  0 siblings, 0 replies; 8+ messages in thread
From: Sameeh Jubran @ 2017-05-22 11:05 UTC (permalink / raw)
  To: Jason Wang; +Cc: Dmitry Fleytman, Yan Vugenfirer, QEMU Developers

On Mon, May 22, 2017 at 5:46 AM, Jason Wang <jasowang@redhat.com> wrote:

>
>
> On 2017年05月19日 22:04, Sameeh Jubran wrote:
>
>> On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>>>> interrupt
>>>> which would be triggered post migration in a heavy traffic environment.
>>>> Even though the
>>>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>>>> the log below)
>>>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>>>> causes the
>>>> "Other" - bit #24 of the ICR register - bit to be set as documented
>>>> below. The driver
>>>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>>>> the
>>>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>>>> the Windows
>>>> driver expects that the "receive overrun" bit and other ones -
>>>> documented
>>>> below - to be
>>>> cleared when the "Other" bit (#24) is cleared.
>>>>
>>>> So to sum that up:
>>>> 1. Bit #6 of the ICR register is set by heavy traffic
>>>> 2. As a results of setting bit #6, bit #24 is set
>>>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>>>> interrupt for bit #6 as it is masked out by IMS)
>>>> 4. The driver handles and clears the interrupt of bit #24
>>>> 5. Bit #6 is still set.
>>>> 6. 2 happens all over again
>>>>
>>>> The Interrupt Cause Read - ICR register:
>>>>
>>>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>>>> the following
>>>> ICR register's bits are set:
>>>>
>>>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit
>>>> #17,
>>>> MNG - bit #18
>>>>
>>>> Log sample of the storm:
>>>>
>>>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>>
>>>> This commit solves:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>>>
>>>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>>>> ---
>>>>    hw/net/e1000e_core.c | 7 +++++--
>>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>>> index 28c5be1..8174b53 100644
>>>> --- a/hw/net/e1000e_core.c
>>>> +++ b/hw/net/e1000e_core.c
>>>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>>>> uint32_t val)
>>>>    static void
>>>>    e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>>>    {
>>>> +    uint32_t icr = 0;
>>>>        if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>>>            (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>>>            trace_e1000e_irq_icr_process_iame();
>>>>            e1000e_clear_ims_bits(core, core->mac[IAM]);
>>>>        }
>>>>    -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR]
>>>> &
>>>> ~val);
>>>> -    core->mac[ICR] &= ~val;
>>>> +    icr = core->mac[ICR] & ~val;
>>>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>>>> icr;
>>>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>>>> +    core->mac[ICR] = icr;
>>>>        e1000e_update_interrupt_state(core);
>>>>    }
>>>>
>>>>
>>>> Thanks for the patch.
>>>
>>> So this is an undocumented behavior, we must be careful on this. Several
>>> question below:
>>>
>>> - have you verified this on real hardware?
>>>
>>> No I haven't
>>
>
> Any chance to verify the behavior on real hardware?

This can be very challenging to reproduce on a real hardware as the Windows
source code isn't available and the bug doesn't
reproduce with the Linux driver; Moreover, the Linux driver configures the
device differently than the Windows driver.

>
>
>
>> - is MSIX enabled in this case?
>>>
>>> Yes it is, I have tested the patch with msi disabled too.
>>
>
> Does the problem exist in no msi case. If not, is this better to hack EIAC
> behavior?

Even though it doesn't reproduce in no msi case, this certainly can occur
as the ICR register is present in both msi and
no msi cases while the EIAC register is for MSI only.

>
>> - according to the steps you've summed up above, it's not specific to
>>> migration?
>>>
>>> True
>>
>
> Then we probably need tweak the title.
>
No problem, I'll send a v2 soon.

> Thanks.
>
>
>> Thanks
>>>
>>>
>>
>>
>


-- 
Respectfully,
*Sameeh Jubran*
*Linkedin <https://il.linkedin.com/pub/sameeh-jubran/87/747/a8a>*
*Software Engineer @ Daynix <http://www.daynix.com>.*

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-22 11:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-17 11:46 [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration Sameeh Jubran
2017-05-17 12:14 ` Dmitry Fleytman
2017-05-18 10:04 ` Dr. David Alan Gilbert
2017-05-18 10:40   ` Paolo Bonzini
2017-05-19  6:25 ` Jason Wang
2017-05-19 14:04   ` Sameeh Jubran
2017-05-22  2:46     ` Jason Wang
2017-05-22 11:05       ` Sameeh Jubran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).