[Qemu-devel] [PATCH 0/2] e1000: Correct TX offload context handling

Ed Swierk via Qemu-devel posted 2 patches 6 years, 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1510701814-52973-1-git-send-email-eswierk@skyportsystems.com
Test checkpatch passed
Test docker passed
Test ppc passed
Test s390x passed
hw/net/e1000.c         | 92 ++++++++++++++++++++++++++++----------------------
hw/net/e1000e.c        |  4 +--
hw/net/e1000e_core.c   | 16 ++++-----
hw/net/e1000e_core.h   |  2 ++
hw/net/e1000x_common.h |  2 --
5 files changed, 64 insertions(+), 52 deletions(-)
[Qemu-devel] [PATCH 0/2] e1000: Correct TX offload context handling
Posted by Ed Swierk via Qemu-devel 6 years, 5 months ago
The transmit offload implementation in QEMU's e1000 device is
deficient and causes packet data corruption in some situations.

According to the Intel 8254x software developer's manual[1], the
device maintains two separate contexts: the TCP segmentation offload
context includes parameters for both segmentation offload and checksum
offload, and the normal (checksum-offload-only) context includes only
checksum offload parameters. These parameters specify over which
packet data to compute the checksum, and where in the packet to store
the computed checksum(s).

[1] https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf

The e1000 driver can update either of these contexts by sending a
transmit context descriptor. The TSE bit in the TUCMD field controls
which context is modified by the descriptor. Crucially, a transmit
context descriptor with TSE=1 changes only the TSO context, leaving
the non-TSO context unchanged; with TSE=0 the opposite is true.

Fields in the transmit data descriptor determine which (if either) of
these two contexts the device uses when actually transmitting some
data:

- If the TSE bit in the DCMD field is set, then the device performs
  TCP segmentation offload using the parameters previously set in the
  TSO context. In addition, if TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the same (TSO) context.

- Otherwise, if the TSE bit in the DCMD field is clear, then there is
  no TCP segmentation offload. If TXSM and/or IXSM is set in the POPTS
  field, the device performs the appropriate checksum offloads using
  the parameters in the non-TSO context.

The e1000 driver is free to set up the TSO and non-TSO contexts and
then transmit a mixture of data, with each data descriptor using a
different (or neither) context. This is what the e1000 driver for
Windows (Intel(R) PRO/1000 MT Network Connection, aka E1G6023E.sys)
does in certain cases. Sometimes with quite undesirable results, since
the QEMU e1000 device doesn't work as described above.

Instead, the QEMU e1000 device maintains only one context in its state
structure. When it receives a transmit context descriptor from the
driver, it overwrites the context parameters regardless of the TSE bit
in the TUCMD field.

To see why this is wrong, suppose the driver first sets up a non-TSO
context with UDP checksum offload parameters (say, TUCSO pointing to
the appropriate offset for a UDP checksum, 6 bytes into the header),
and then sets up a TSO context with TCP checksum offload parameters
(TUCSO pointing to the appropriate offset for a TCP checksum, 16 bytes
into the header). The driver then sends a transmit data descriptor
with TSO=0 and TXSM=1 along with a UDP datagram. The QEMU e1000 device
computes the checksum using the last set of checksum offload
parameters, and writes the checksum to offset 16, stomping on two
bytes of UDP data, and leaving the wrong checksum in the UDP checksum
field.

To make matters worse, if the network stack on the host running QEMU
treats data transmitted from a VM as locally originated, it may do its
own UDP checksum computation, "correcting" it to match the corrupt
data before sending it on the wire. Now the corrupt UDP packet makes
its way all the way to the peer application.

I have reproduced this behavior with a Windows 10 guest, rather easily
with a TCP iperf and a UDP iperf running in parallel. With the
patchlet below, you'll see an error message whenever the bug is
triggered.


 --- a/hw/net/e1000.c
 +++ b/hw/net/e1000.c
 @@ -534,6 +534,30 @@ e1000_send_packet(E1000State *s, const uint8_t *buf, int size)
  }
  
  static void
 +debug_csum(struct e1000_tx *tp, uint16_t oldsum)
 +{
 +    e1000x_txd_props *props = &tp->props;
 +    uint8_t proto = tp->data[14 + 9];
 +    uint32_t sumoff = props->tucso - props->tucss;
 +
 +    if ((proto == 17 && sumoff != 6) ||
 +        (proto == 6 && sumoff != 16)) {
 +        DBGOUT(TXERR, "txsum bug! ver %d src %08x dst %08x len %d proto %d "
 +               "cptse %d sum_needed %x oldsum %x newsum %x realsum %x\n",
 +               tp->data[14] >> 4,
 +               ldl_be_p(tp->data + 14 + 12),
 +               ldl_be_p(tp->data + 14 + 16),
 +               lduw_be_p(tp->data + 14 + 2),
 +               proto,
 +               props->cptse,
 +               props->sum_needed,
 +               oldsum,
 +               lduw_be_p(tp->data + props->tucso),
 +               lduw_be_p(tp->data + props->tucss + (proto == 6 ? 16 : 6)));
 +    }
 +}
 +
 +static void
  xmit_seg(E1000State *s)
  {
      uint16_t len;
 @@ -577,8 +601,10 @@ xmit_seg(E1000State *s)
      }
  
      if (tp->props.sum_needed & E1000_TXD_POPTS_TXSM) {
 +        uint16_t oldsum = lduw_be_p(tp->data + tp->props.tucso);
          putsum(tp->data, tp->size, tp->props.tucso,
                 tp->props.tucss, tp->props.tucse);
 +        debug_csum(tp, oldsum); /* FIXME: remove */
      }
      if (tp->props.sum_needed & E1000_TXD_POPTS_IXSM) {
          putsum(tp->data, tp->size, tp->props.ipcso,


This patch series first moves per-TX packet flags out of the context
struct, which is used by both e1000 and e1000e. ("Used" is generous,
as e1000e ignores most of the context parameters supplied by the guest
and does its own thing, which is only sometimes correct. Taming e1000e
is a project for another day. The change to e1000e here is a baby step
in that direction.)

Then we separate the e1000 TSO and non-TSO contexts, which fixes the
UDP TX corruption bug.

Ed Swierk (2):
  e1000, e1000e: Move per-packet TX offload flags out of context state
  e1000: Separate TSO and non-TSO contexts, fixing UDP TX corruption

 hw/net/e1000.c         | 92 ++++++++++++++++++++++++++++----------------------
 hw/net/e1000e.c        |  4 +--
 hw/net/e1000e_core.c   | 16 ++++-----
 hw/net/e1000e_core.h   |  2 ++
 hw/net/e1000x_common.h |  2 --
 5 files changed, 64 insertions(+), 52 deletions(-)

-- 
1.9.1


Re: [Qemu-devel] [PATCH 0/2] e1000: Correct TX offload context handling
Posted by Jason Wang 6 years, 5 months ago

On 2017年11月15日 07:23, Ed Swierk via Qemu-devel wrote:
> The transmit offload implementation in QEMU's e1000 device is
> deficient and causes packet data corruption in some situations.
>
> According to the Intel 8254x software developer's manual[1], the
> device maintains two separate contexts: the TCP segmentation offload
> context includes parameters for both segmentation offload and checksum
> offload, and the normal (checksum-offload-only) context includes only
> checksum offload parameters. These parameters specify over which
> packet data to compute the checksum, and where in the packet to store
> the computed checksum(s).
>
> [1] https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf
>
> The e1000 driver can update either of these contexts by sending a
> transmit context descriptor. The TSE bit in the TUCMD field controls
> which context is modified by the descriptor. Crucially, a transmit
> context descriptor with TSE=1 changes only the TSO context, leaving
> the non-TSO context unchanged; with TSE=0 the opposite is true.
>
> Fields in the transmit data descriptor determine which (if either) of
> these two contexts the device uses when actually transmitting some
> data:
>
> - If the TSE bit in the DCMD field is set, then the device performs
>    TCP segmentation offload using the parameters previously set in the
>    TSO context. In addition, if TXSM and/or IXSM is set in the POPTS
>    field, the device performs the appropriate checksum offloads using
>    the parameters in the same (TSO) context.
>
> - Otherwise, if the TSE bit in the DCMD field is clear, then there is
>    no TCP segmentation offload. If TXSM and/or IXSM is set in the POPTS
>    field, the device performs the appropriate checksum offloads using
>    the parameters in the non-TSO context.
>
> The e1000 driver is free to set up the TSO and non-TSO contexts and
> then transmit a mixture of data, with each data descriptor using a
> different (or neither) context. This is what the e1000 driver for
> Windows (Intel(R) PRO/1000 MT Network Connection, aka E1G6023E.sys)
> does in certain cases. Sometimes with quite undesirable results, since
> the QEMU e1000 device doesn't work as described above.
>
> Instead, the QEMU e1000 device maintains only one context in its state
> structure. When it receives a transmit context descriptor from the
> driver, it overwrites the context parameters regardless of the TSE bit
> in the TUCMD field.
>
> To see why this is wrong, suppose the driver first sets up a non-TSO
> context with UDP checksum offload parameters (say, TUCSO pointing to
> the appropriate offset for a UDP checksum, 6 bytes into the header),
> and then sets up a TSO context with TCP checksum offload parameters
> (TUCSO pointing to the appropriate offset for a TCP checksum, 16 bytes
> into the header). The driver then sends a transmit data descriptor
> with TSO=0 and TXSM=1 along with a UDP datagram. The QEMU e1000 device
> computes the checksum using the last set of checksum offload
> parameters, and writes the checksum to offset 16, stomping on two
> bytes of UDP data, and leaving the wrong checksum in the UDP checksum
> field.
>
> To make matters worse, if the network stack on the host running QEMU
> treats data transmitted from a VM as locally originated, it may do its
> own UDP checksum computation, "correcting" it to match the corrupt
> data before sending it on the wire. Now the corrupt UDP packet makes
> its way all the way to the peer application.
>
> I have reproduced this behavior with a Windows 10 guest, rather easily
> with a TCP iperf and a UDP iperf running in parallel. With the
> patchlet below, you'll see an error message whenever the bug is
> triggered.
>
>
>   --- a/hw/net/e1000.c
>   +++ b/hw/net/e1000.c
>   @@ -534,6 +534,30 @@ e1000_send_packet(E1000State *s, const uint8_t *buf, int size)
>    }
>    
>    static void
>   +debug_csum(struct e1000_tx *tp, uint16_t oldsum)
>   +{
>   +    e1000x_txd_props *props = &tp->props;
>   +    uint8_t proto = tp->data[14 + 9];
>   +    uint32_t sumoff = props->tucso - props->tucss;
>   +
>   +    if ((proto == 17 && sumoff != 6) ||
>   +        (proto == 6 && sumoff != 16)) {
>   +        DBGOUT(TXERR, "txsum bug! ver %d src %08x dst %08x len %d proto %d "
>   +               "cptse %d sum_needed %x oldsum %x newsum %x realsum %x\n",
>   +               tp->data[14] >> 4,
>   +               ldl_be_p(tp->data + 14 + 12),
>   +               ldl_be_p(tp->data + 14 + 16),
>   +               lduw_be_p(tp->data + 14 + 2),
>   +               proto,
>   +               props->cptse,
>   +               props->sum_needed,
>   +               oldsum,
>   +               lduw_be_p(tp->data + props->tucso),
>   +               lduw_be_p(tp->data + props->tucss + (proto == 6 ? 16 : 6)));
>   +    }
>   +}
>   +
>   +static void
>    xmit_seg(E1000State *s)
>    {
>        uint16_t len;
>   @@ -577,8 +601,10 @@ xmit_seg(E1000State *s)
>        }
>    
>        if (tp->props.sum_needed & E1000_TXD_POPTS_TXSM) {
>   +        uint16_t oldsum = lduw_be_p(tp->data + tp->props.tucso);
>            putsum(tp->data, tp->size, tp->props.tucso,
>                   tp->props.tucss, tp->props.tucse);
>   +        debug_csum(tp, oldsum); /* FIXME: remove */
>        }
>        if (tp->props.sum_needed & E1000_TXD_POPTS_IXSM) {
>            putsum(tp->data, tp->size, tp->props.ipcso,
>
>
> This patch series first moves per-TX packet flags out of the context
> struct, which is used by both e1000 and e1000e. ("Used" is generous,
> as e1000e ignores most of the context parameters supplied by the guest
> and does its own thing, which is only sometimes correct. Taming e1000e
> is a project for another day. The change to e1000e here is a baby step
> in that direction.)
>
> Then we separate the e1000 TSO and non-TSO contexts, which fixes the
> UDP TX corruption bug.
>
> Ed Swierk (2):
>    e1000, e1000e: Move per-packet TX offload flags out of context state
>    e1000: Separate TSO and non-TSO contexts, fixing UDP TX corruption
>
>   hw/net/e1000.c         | 92 ++++++++++++++++++++++++++++----------------------
>   hw/net/e1000e.c        |  4 +--
>   hw/net/e1000e_core.c   | 16 ++++-----
>   hw/net/e1000e_core.h   |  2 ++
>   hw/net/e1000x_common.h |  2 --
>   5 files changed, 64 insertions(+), 52 deletions(-)
>

Consider the risk for rc1, I've queued this for 2.12.

Thanks