[Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

Sameeh Jubran posted 1 patch 6 years, 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1495021572-20852-1-git-send-email-sameeh@daynix.com
Test checkpatch passed
Test docker passed
Test s390x passed
hw/net/e1000e_core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
[Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Sameeh Jubran 6 years, 11 months ago
The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
which would be triggered post migration in a heavy traffic environment. Even though the
"receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
"Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
"receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
driver expects that the "receive overrun" bit and other ones - documented below - to be
cleared when the "Other" bit (#24) is cleared.

So to sum that up:
1. Bit #6 of the ICR register is set by heavy traffic
2. As a results of setting bit #6, bit #24 is set
3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
4. The driver handles and clears the interrupt of bit #24
5. Bit #6 is still set.
6. 2 happens all over again

The Interrupt Cause Read - ICR register:

The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
ICR register's bits are set:

LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18

Log sample of the storm:

27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)

This commit solves:
https://bugzilla.redhat.com/show_bug.cgi?id=1447935
https://bugzilla.redhat.com/show_bug.cgi?id=1449490

Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
---
 hw/net/e1000e_core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 28c5be1..8174b53 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
 static void
 e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
 {
+    uint32_t icr = 0;
     if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
         (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
         trace_e1000e_irq_icr_process_iame();
         e1000e_clear_ims_bits(core, core->mac[IAM]);
     }
 
-    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
-    core->mac[ICR] &= ~val;
+    icr = core->mac[ICR] & ~val;
+    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
+    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
+    core->mac[ICR] = icr;
     e1000e_update_interrupt_state(core);
 }
 
-- 
2.8.1.185.gdc0db2c


Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Dmitry Fleytman 6 years, 11 months ago
> On 17 May 2017, at 14:46 PM, Sameeh Jubran <sameeh@daynix.com> wrote:
> 
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490 <https://bugzilla.redhat.com/show_bug.cgi?id=1449490>


Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>


> 
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> ---
> hw/net/e1000e_core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
> static void
> e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
> {
> +    uint32_t icr = 0;
>     if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>         (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>         trace_e1000e_irq_icr_process_iame();
>         e1000e_clear_ims_bits(core, core->mac[IAM]);
>     }
> 
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>     e1000e_update_interrupt_state(core);
> }
> 
> -- 
> 2.8.1.185.gdc0db2c
> 

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Dr. David Alan Gilbert 6 years, 11 months ago
* Sameeh Jubran (sameeh@daynix.com) wrote:
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
> 
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>

Thanks, I tested this with our downstream and it does
fix the reproducer for 1447935 for me, so:


Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

> ---
>  hw/net/e1000e_core.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>  static void
>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>  {
> +    uint32_t icr = 0;
>      if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>          (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>          trace_e1000e_irq_icr_process_iame();
>          e1000e_clear_ims_bits(core, core->mac[IAM]);
>      }
>  
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>      e1000e_update_interrupt_state(core);
>  }
>  
> -- 
> 2.8.1.185.gdc0db2c
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Paolo Bonzini 6 years, 11 months ago

On 18/05/2017 12:04, Dr. David Alan Gilbert wrote:
> * Sameeh Jubran (sameeh@daynix.com) wrote:
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
>> which would be triggered post migration in a heavy traffic environment. Even though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
>> driver expects that the "receive overrun" bit and other ones - documented below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> 
> Thanks, I tested this with our downstream and it does
> fix the reproducer for 1447935 for me, so:
> 
> 
> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Dave
> 
>> ---
>>  hw/net/e1000e_core.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>>  static void
>>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>  {
>> +    uint32_t icr = 0;
>>      if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>          (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>          trace_e1000e_irq_icr_process_iame();
>>          e1000e_clear_ims_bits(core, core->mac[IAM]);
>>      }
>>  
>> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
>> -    core->mac[ICR] &= ~val;
>> +    icr = core->mac[ICR] & ~val;
>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +    core->mac[ICR] = icr;
>>      e1000e_update_interrupt_state(core);
>>  }
>>  
>> -- 
>> 2.8.1.185.gdc0db2c
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

Cc: qemu-stable@nongnu.org

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Jason Wang 6 years, 11 months ago

On 2017年05月17日 19:46, Sameeh Jubran wrote:
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) interrupt
> which would be triggered post migration in a heavy traffic environment. Even though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the Windows
> driver expects that the "receive overrun" bit and other ones - documented below - to be
> cleared when the "Other" bit (#24) is cleared.
>
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
>
> The Interrupt Cause Read - ICR register:
>
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the following
> ICR register's bits are set:
>
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - bit #18
>
> Log sample of the storm:
>
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 0x815000c2, IMS: 0xa00004)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>
> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
> ---
>   hw/net/e1000e_core.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t val)
>   static void
>   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>   {
> +    uint32_t icr = 0;
>       if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>           (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>           trace_e1000e_irq_icr_process_iame();
>           e1000e_clear_ims_bits(core, core->mac[IAM]);
>       }
>   
> -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -    core->mac[ICR] &= ~val;
> +    icr = core->mac[ICR] & ~val;
> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +    core->mac[ICR] = icr;
>       e1000e_update_interrupt_state(core);
>   }
>   

Thanks for the patch.

So this is an undocumented behavior, we must be careful on this. Several 
question below:

- have you verified this on real hardware?
- is MSIX enabled in this case?
- according to the steps you've summed up above, it's not specific to 
migration?

Thanks

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Sameeh Jubran 6 years, 11 months ago
On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:

>
>
> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>> interrupt
>> which would be triggered post migration in a heavy traffic environment.
>> Even though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>> the log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>> causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented
>> below. The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>> the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>> the Windows
>> driver expects that the "receive overrun" bit and other ones - documented
>> below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>> interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>> the following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
>> MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa00004)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>> ---
>>   hw/net/e1000e_core.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>> uint32_t val)
>>   static void
>>   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>   {
>> +    uint32_t icr = 0;
>>       if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>           (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>           trace_e1000e_irq_icr_process_iame();
>>           e1000e_clear_ims_bits(core, core->mac[IAM]);
>>       }
>>   -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
>> ~val);
>> -    core->mac[ICR] &= ~val;
>> +    icr = core->mac[ICR] & ~val;
>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>> icr;
>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +    core->mac[ICR] = icr;
>>       e1000e_update_interrupt_state(core);
>>   }
>>
>>
>
> Thanks for the patch.
>
> So this is an undocumented behavior, we must be careful on this. Several
> question below:
>
> - have you verified this on real hardware?
>
No I haven't

> - is MSIX enabled in this case?
>
Yes it is, I have tested the patch with msi disabled too.

> - according to the steps you've summed up above, it's not specific to
> migration?
>
True

>
> Thanks
>



-- 
Respectfully,
*Sameeh Jubran*
*Linkedin <https://il.linkedin.com/pub/sameeh-jubran/87/747/a8a>*
*Software Engineer @ Daynix <http://www.daynix.com>.*
Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Jason Wang 6 years, 11 months ago

On 2017年05月19日 22:04, Sameeh Jubran wrote:
> On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>>
>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>>> interrupt
>>> which would be triggered post migration in a heavy traffic environment.
>>> Even though the
>>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>>> the log below)
>>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>>> causes the
>>> "Other" - bit #24 of the ICR register - bit to be set as documented
>>> below. The driver
>>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>>> the
>>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>>> the Windows
>>> driver expects that the "receive overrun" bit and other ones - documented
>>> below - to be
>>> cleared when the "Other" bit (#24) is cleared.
>>>
>>> So to sum that up:
>>> 1. Bit #6 of the ICR register is set by heavy traffic
>>> 2. As a results of setting bit #6, bit #24 is set
>>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>>> interrupt for bit #6 as it is masked out by IMS)
>>> 4. The driver handles and clears the interrupt of bit #24
>>> 5. Bit #6 is still set.
>>> 6. 2 happens all over again
>>>
>>> The Interrupt Cause Read - ICR register:
>>>
>>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>>> the following
>>> ICR register's bits are set:
>>>
>>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
>>> MNG - bit #18
>>>
>>> Log sample of the storm:
>>>
>>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>
>>> This commit solves:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>>
>>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>>> ---
>>>    hw/net/e1000e_core.c | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>> index 28c5be1..8174b53 100644
>>> --- a/hw/net/e1000e_core.c
>>> +++ b/hw/net/e1000e_core.c
>>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>>> uint32_t val)
>>>    static void
>>>    e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>>    {
>>> +    uint32_t icr = 0;
>>>        if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>>            (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>>            trace_e1000e_irq_icr_process_iame();
>>>            e1000e_clear_ims_bits(core, core->mac[IAM]);
>>>        }
>>>    -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
>>> ~val);
>>> -    core->mac[ICR] &= ~val;
>>> +    icr = core->mac[ICR] & ~val;
>>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>>> icr;
>>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>>> +    core->mac[ICR] = icr;
>>>        e1000e_update_interrupt_state(core);
>>>    }
>>>
>>>
>> Thanks for the patch.
>>
>> So this is an undocumented behavior, we must be careful on this. Several
>> question below:
>>
>> - have you verified this on real hardware?
>>
> No I haven't

Any chance to verify the behavior on real hardware?

>
>> - is MSIX enabled in this case?
>>
> Yes it is, I have tested the patch with msi disabled too.

Does the problem exist in no msi case. If not, is this better to hack 
EIAC behavior?

>
>> - according to the steps you've summed up above, it's not specific to
>> migration?
>>
> True

Then we probably need tweak the title.

Thanks.

>
>> Thanks
>>
>
>


Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Posted by Sameeh Jubran 6 years, 11 months ago
On Mon, May 22, 2017 at 5:46 AM, Jason Wang <jasowang@redhat.com> wrote:

>
>
> On 2017年05月19日 22:04, Sameeh Jubran wrote:
>
>> On Fri, May 19, 2017 at 9:25 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>>>> interrupt
>>>> which would be triggered post migration in a heavy traffic environment.
>>>> Even though the
>>>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>>>> the log below)
>>>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>>>> causes the
>>>> "Other" - bit #24 of the ICR register - bit to be set as documented
>>>> below. The driver
>>>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>>>> the
>>>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>>>> the Windows
>>>> driver expects that the "receive overrun" bit and other ones -
>>>> documented
>>>> below - to be
>>>> cleared when the "Other" bit (#24) is cleared.
>>>>
>>>> So to sum that up:
>>>> 1. Bit #6 of the ICR register is set by heavy traffic
>>>> 2. As a results of setting bit #6, bit #24 is set
>>>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>>>> interrupt for bit #6 as it is masked out by IMS)
>>>> 4. The driver handles and clears the interrupt of bit #24
>>>> 5. Bit #6 is still set.
>>>> 6. 2 happens all over again
>>>>
>>>> The Interrupt Cause Read - ICR register:
>>>>
>>>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>>>> the following
>>>> ICR register's bits are set:
>>>>
>>>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit
>>>> #17,
>>>> MNG - bit #18
>>>>
>>>> Log sample of the storm:
>>>>
>>>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>>
>>>> This commit solves:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>>>
>>>> Signed-off-by: Sameeh Jubran <sjubran@redhat.com>
>>>> ---
>>>>    hw/net/e1000e_core.c | 7 +++++--
>>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>>> index 28c5be1..8174b53 100644
>>>> --- a/hw/net/e1000e_core.c
>>>> +++ b/hw/net/e1000e_core.c
>>>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>>>> uint32_t val)
>>>>    static void
>>>>    e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>>>    {
>>>> +    uint32_t icr = 0;
>>>>        if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>>>            (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>>>            trace_e1000e_irq_icr_process_iame();
>>>>            e1000e_clear_ims_bits(core, core->mac[IAM]);
>>>>        }
>>>>    -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR]
>>>> &
>>>> ~val);
>>>> -    core->mac[ICR] &= ~val;
>>>> +    icr = core->mac[ICR] & ~val;
>>>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>>>> icr;
>>>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>>>> +    core->mac[ICR] = icr;
>>>>        e1000e_update_interrupt_state(core);
>>>>    }
>>>>
>>>>
>>>> Thanks for the patch.
>>>
>>> So this is an undocumented behavior, we must be careful on this. Several
>>> question below:
>>>
>>> - have you verified this on real hardware?
>>>
>>> No I haven't
>>
>
> Any chance to verify the behavior on real hardware?

This can be very challenging to reproduce on a real hardware as the Windows
source code isn't available and the bug doesn't
reproduce with the Linux driver; Moreover, the Linux driver configures the
device differently than the Windows driver.

>
>
>
>> - is MSIX enabled in this case?
>>>
>>> Yes it is, I have tested the patch with msi disabled too.
>>
>
> Does the problem exist in no msi case. If not, is this better to hack EIAC
> behavior?

Even though it doesn't reproduce in no msi case, this certainly can occur
as the ICR register is present in both msi and
no msi cases while the EIAC register is for MSI only.

>
>> - according to the steps you've summed up above, it's not specific to
>>> migration?
>>>
>>> True
>>
>
> Then we probably need tweak the title.
>
No problem, I'll send a v2 soon.

> Thanks.
>
>
>> Thanks
>>>
>>>
>>
>>
>


-- 
Respectfully,
*Sameeh Jubran*
*Linkedin <https://il.linkedin.com/pub/sameeh-jubran/87/747/a8a>*
*Software Engineer @ Daynix <http://www.daynix.com>.*