xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* netif.h clarifications
@ 2016-05-19 16:27 Roger Pau Monné
  2016-05-20  9:09 ` Paul Durrant
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Pau Monné @ 2016-05-19 16:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Wei Liu, David Vrabel

Hello,

While trying to solve a FreeBSD netfront bug [0] I came across a couple 
of netif.h dark spots that I think should be documented in the netif.h 
header. I'm willing to make those changes, but I want to make sure my 
understanding is right.

Regarding checksum offloading, I had a hard time figuring out what the 
different flags actually mean:

/* Packet data has been validated against protocol checksum. */
#define _NETRXF_data_validated (0)
#define  NETRXF_data_validated (1U<<_NETRXF_data_validated)

/* Protocol checksum field is blank in the packet (hardware offload)? */
#define _NETRXF_csum_blank     (1)
#define  NETRXF_csum_blank     (1U<<_NETRXF_csum_blank)

(Same applies to the TX flags, I'm not copying them there because they are 
the same)

First of all, I assume "protocol" here refers to Layer 3 and Layer 4 
protocol, so that would be IP and TCP/UDP/SCTP checksum offloading? In any 
case this needs clarification and proper wording.

Then, I have some questions regarding the meaning of the flags themselves 
and the content of the checksum field in all the possible scenarios.

On RX path:

 - NETRXF_data_validated only: data has been validated, but what's the state 
   of the checksum field itself? If the data is validated again, would it 
   match against the checksum?
 - NETRXF_csum_blank only: I don't think this makes much sense, data is in 
   unknown state and checksum is not present, so there's no way to validate 
   it. Packet should be dropped?
 - NETRXF_data_validated | NETRXF_csum_blank: this combination seems to be 
   the one that makes more sense to me, data is valid, but checksum is not 
   there. This matches what some real NICs already do, that is to provide 
   the result of the checksum check _without_ actually providing the 
   checksum itself on the RX path.

On TX path:

 - NETTXF_data_validated only: I don't think this makes any sense, data is 
   always valid from the senders point of view.
 - NETTXF_csum_blank only: checksum calculation offload, it should be 
   performed by the other end.
 - NETTXF_data_validated | NETTXF_csum_blank: again, I don't think it makes 
   much sense, data is always valid from the senders point of view, or else 
   why bother sending it?

So it looks to me like we could get away with just two flags, one on the RX 
side that signals that the packet doesn't have a checksum but that the 
checksum validation has already been performed, and another one on the TX 
side to signal that the packet doesn't have a calculated checksum 
(typical checksum offload).

And then I've also seen some issues with TSO/LRO (GSO in Linux terminology) 
when using packet forwarding inside of a FreeBSD DomU. For example in the 
following scenario:

                                   +
                                   |
   +---------+           +--------------------+           +----------+
   |         |A         B|       router       |C         D|          |
   | Guest 1 +-----------+         +          +-----------+ Guest 2  |
   |         |  bridge0  |         |          |  bridge1  |          |
   +---------+           +--------------------+           +----------+
   172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
                                   |
             +--------------------------------------------->
              ssh 10.0.1.2         |
                                   |
                                   |
                                   |
                                   +

All those VMs are inside of the same host, and one of them acts as a gateway 
between them because they are on two different subnets. In this case I'm 
seeing issues because even though I disable TSO/LRO on the "router" at 
runtime, the backend doesn't watch the xenstore feature flag, and never 
disables it from the vif on the Dom0 bridge. This causes LRO packets 
(non-fragmented) to be received at point 'C', and then when the gateway 
tries to inject them into the other NIC it fails because the size is greater 
than the MTU, and the "no fragment" bit is set.

How does Linux deal with this situation? Does it simply ignore the no 
fragment flag and fragments the packet? Does it simply inject the packet to 
the other end ignoring the MTU and propagating the GSO flag?

Roger.

[0] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=188261

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netif.h clarifications
  2016-05-19 16:27 netif.h clarifications Roger Pau Monné
@ 2016-05-20  9:09 ` Paul Durrant
  2016-05-20 10:46   ` Roger Pau Monne
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Durrant @ 2016-05-20  9:09 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel@lists.xenproject.org; +Cc: Wei Liu, David Vrabel

> -----Original Message-----
> From: Roger Pau Monné [mailto:roger.pau@citrix.com]
> Sent: 19 May 2016 17:28
> To: xen-devel@lists.xenproject.org
> Cc: Wei Liu; David Vrabel; Paul Durrant
> Subject: netif.h clarifications
> 
> Hello,
> 
> While trying to solve a FreeBSD netfront bug [0] I came across a couple
> of netif.h dark spots that I think should be documented in the netif.h
> header. I'm willing to make those changes, but I want to make sure my
> understanding is right.
> 
> Regarding checksum offloading, I had a hard time figuring out what the
> different flags actually mean:
> 
> /* Packet data has been validated against protocol checksum. */
> #define _NETRXF_data_validated (0)
> #define  NETRXF_data_validated (1U<<_NETRXF_data_validated)
> 
> /* Protocol checksum field is blank in the packet (hardware offload)? */
> #define _NETRXF_csum_blank     (1)
> #define  NETRXF_csum_blank     (1U<<_NETRXF_csum_blank)
> 
> (Same applies to the TX flags, I'm not copying them there because they are
> the same)
> 
> First of all, I assume "protocol" here refers to Layer 3 and Layer 4
> protocol, so that would be IP and TCP/UDP/SCTP checksum offloading? In
> any
> case this needs clarification and proper wording.
> 
> Then, I have some questions regarding the meaning of the flags themselves
> and the content of the checksum field in all the possible scenarios.
> 
> On RX path:
> 
>  - NETRXF_data_validated only: data has been validated, but what's the state
>    of the checksum field itself? If the data is validated again, would it
>    match against the checksum?
>  - NETRXF_csum_blank only: I don't think this makes much sense, data is in
>    unknown state and checksum is not present, so there's no way to validate
>    it. Packet should be dropped?

Yes, in practice it's not used on its own. As you say, I don't think it makes any sense.

>  - NETRXF_data_validated | NETRXF_csum_blank: this combination seems to
> be
>    the one that makes more sense to me, data is valid, but checksum is not
>    there. This matches what some real NICs already do, that is to provide
>    the result of the checksum check _without_ actually providing the
>    checksum itself on the RX path.
> 

In Linux netback this is set if the checksum info is partial, which I take to mean that the packet has a valid pseudo-header checksum. I think packets coming from NICs are more likely to be 'checksum unnecessary' which results in NETRXF_data_validated only, which I take to mean that the checksum has been verified but may have been trashed in the process.

> On TX path:
> 
>  - NETTXF_data_validated only: I don't think this makes any sense, data is
>    always valid from the senders point of view.
>  - NETTXF_csum_blank only: checksum calculation offload, it should be
>    performed by the other end.
>  - NETTXF_data_validated | NETTXF_csum_blank: again, I don't think it makes
>    much sense, data is always valid from the senders point of view, or else
>    why bother sending it?
> 

In Linux netback, the code goes:

		if (txp->flags & XEN_NETTXF_csum_blank)
			skb->ip_summed = CHECKSUM_PARTIAL;
		else if (txp->flags & XEN_NETTXF_data_validated)
			skb->ip_summed = CHECKSUM_UNNECESSARY;

So, csum_blank with or without data_validated means that it's assumed that the packet contains a valid pseudo-header checksum, but if csum_blank is not set then data_validated means that the data is good but the checksum is in an unknown state which is ok if the packet is then forwarded to another vif, and I assume 'unnecessary' is ignored by NIC drivers on their TX side (I guess they would only be interesting in 'partial').

> So it looks to me like we could get away with just two flags, one on the RX
> side that signals that the packet doesn't have a checksum but that the
> checksum validation has already been performed, and another one on the TX
> side to signal that the packet doesn't have a calculated checksum
> (typical checksum offload).
> 

On the TX side it would be useful to have flags which indicate:

- Full checksum present
- Pseudo-header checksum present
- State of checksum is unknown

On the RX side it would be useful to have

- Data validated, checksum state unknown
- Data validated, checksum correct
- Data not validated

> And then I've also seen some issues with TSO/LRO (GSO in Linux
> terminology)
> when using packet forwarding inside of a FreeBSD DomU. For example in the
> following scenario:
> 
>                                    +
>                                    |
>    +---------+           +--------------------+           +----------+
>    |         |A         B|       router       |C         D|          |
>    | Guest 1 +-----------+         +          +-----------+ Guest 2  |
>    |         |  bridge0  |         |          |  bridge1  |          |
>    +---------+           +--------------------+           +----------+
>    172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
>                                    |
>              +--------------------------------------------->
>               ssh 10.0.1.2         |
>                                    |
>                                    |
>                                    |
>                                    +
> 
> All those VMs are inside of the same host, and one of them acts as a gateway
> between them because they are on two different subnets. In this case I'm
> seeing issues because even though I disable TSO/LRO on the "router" at
> runtime, the backend doesn't watch the xenstore feature flag, and never
> disables it from the vif on the Dom0 bridge. This causes LRO packets
> (non-fragmented) to be received at point 'C', and then when the gateway
> tries to inject them into the other NIC it fails because the size is greater
> than the MTU, and the "no fragment" bit is set.
> 

Yes, GSO cannot be disabled/enabled dynamically on the netback tx side (i.e. guest rx side) so you can't turn it off. The Windows PV driver leave sit on all the time and does the fragmentation itself if the stack doesn't want GRO. Doing the fragmentation in the frontend makes more sense anyway since the cpu cycles are burned by the VM rather than dom0 and so it scales better.

> How does Linux deal with this situation? Does it simply ignore the no
> fragment flag and fragments the packet? Does it simply inject the packet to
> the other end ignoring the MTU and propagating the GSO flag?
> 

I've not looked at the netfront rx code but I assume that the large packet that is passed from netback is just marked as GSO and makes its way to wherever it's going (being fragmented by the stack if it's forwarded to an interface that doesn't have the TSO flag set).

  Paul

> Roger.
> 
> [0] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=188261

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netif.h clarifications
  2016-05-20  9:09 ` Paul Durrant
@ 2016-05-20 10:46   ` Roger Pau Monne
  2016-05-20 11:55     ` Paul Durrant
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Pau Monne @ 2016-05-20 10:46 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel@lists.xenproject.org, Wei Liu, David Vrabel

On Fri, May 20, 2016 at 10:09:43AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monné [mailto:roger.pau@citrix.com]
> > Sent: 19 May 2016 17:28
> > To: xen-devel@lists.xenproject.org
> > Cc: Wei Liu; David Vrabel; Paul Durrant
> > Subject: netif.h clarifications
> > 
> > Hello,
> > 
> > While trying to solve a FreeBSD netfront bug [0] I came across a couple
> > of netif.h dark spots that I think should be documented in the netif.h
> > header. I'm willing to make those changes, but I want to make sure my
> > understanding is right.
> > 
> > Regarding checksum offloading, I had a hard time figuring out what the
> > different flags actually mean:
> > 
> > /* Packet data has been validated against protocol checksum. */
> > #define _NETRXF_data_validated (0)
> > #define  NETRXF_data_validated (1U<<_NETRXF_data_validated)
> > 
> > /* Protocol checksum field is blank in the packet (hardware offload)? */
> > #define _NETRXF_csum_blank     (1)
> > #define  NETRXF_csum_blank     (1U<<_NETRXF_csum_blank)
> > 
> > (Same applies to the TX flags, I'm not copying them there because they are
> > the same)
> > 
> > First of all, I assume "protocol" here refers to Layer 3 and Layer 4
> > protocol, so that would be IP and TCP/UDP/SCTP checksum offloading? In
> > any
> > case this needs clarification and proper wording.
> > 
> > Then, I have some questions regarding the meaning of the flags themselves
> > and the content of the checksum field in all the possible scenarios.
> > 
> > On RX path:
> > 
> >  - NETRXF_data_validated only: data has been validated, but what's the state
> >    of the checksum field itself? If the data is validated again, would it
> >    match against the checksum?
> >  - NETRXF_csum_blank only: I don't think this makes much sense, data is in
> >    unknown state and checksum is not present, so there's no way to validate
> >    it. Packet should be dropped?
> 
> Yes, in practice it's not used on its own. As you say, I don't think it makes any sense.
> 
> >  - NETRXF_data_validated | NETRXF_csum_blank: this combination seems to
> > be
> >    the one that makes more sense to me, data is valid, but checksum is not
> >    there. This matches what some real NICs already do, that is to provide
> >    the result of the checksum check _without_ actually providing the
> >    checksum itself on the RX path.
> > 
> 
> In Linux netback this is set if the checksum info is partial, which I take to mean that the packet has a valid pseudo-header checksum. I think packets coming from NICs are more likely to be 'checksum unnecessary' which results in NETRXF_data_validated only, which I take to mean that the checksum has been verified but may have been trashed in the process.

Hm, if the checksum might have been trashed, I think NETRXF_csum_blank 
should be there, in the absence of NETRXF_csum_blank I would like to assume 
that the checksum is there and hasn't been trashed.

> 
> > On TX path:
> > 
> >  - NETTXF_data_validated only: I don't think this makes any sense, data is
> >    always valid from the senders point of view.
> >  - NETTXF_csum_blank only: checksum calculation offload, it should be
> >    performed by the other end.
> >  - NETTXF_data_validated | NETTXF_csum_blank: again, I don't think it makes
> >    much sense, data is always valid from the senders point of view, or else
> >    why bother sending it?
> > 
> 
> In Linux netback, the code goes:
> 
> 		if (txp->flags & XEN_NETTXF_csum_blank)
> 			skb->ip_summed = CHECKSUM_PARTIAL;
> 		else if (txp->flags & XEN_NETTXF_data_validated)
> 			skb->ip_summed = CHECKSUM_UNNECESSARY;
> 
> So, csum_blank with or without data_validated means that it's assumed that the packet contains a valid pseudo-header checksum, but if csum_blank is not set then data_validated means that the data is good but the checksum is in an unknown state which is ok if the packet is then forwarded to another vif, and I assume 'unnecessary' is ignored by NIC drivers on their TX side (I guess they would only be interesting in 'partial').

OK, csum_blank means there's a pseudo-header, and data_validated means 
there's no checksum at all? Sorry, this is all very confusing.

> > So it looks to me like we could get away with just two flags, one on the RX
> > side that signals that the packet doesn't have a checksum but that the
> > checksum validation has already been performed, and another one on the TX
> > side to signal that the packet doesn't have a calculated checksum
> > (typical checksum offload).
> > 
> 
> On the TX side it would be useful to have flags which indicate:
> 
> - Full checksum present
> - Pseudo-header checksum present
> - State of checksum is unknown
> 
> On the RX side it would be useful to have
> 
> - Data validated, checksum state unknown
> - Data validated, checksum correct
> - Data not validated

+1 clearly.

> > And then I've also seen some issues with TSO/LRO (GSO in Linux
> > terminology)
> > when using packet forwarding inside of a FreeBSD DomU. For example in the
> > following scenario:
> > 
> >                                    +
> >                                    |
> >    +---------+           +--------------------+           +----------+
> >    |         |A         B|       router       |C         D|          |
> >    | Guest 1 +-----------+         +          +-----------+ Guest 2  |
> >    |         |  bridge0  |         |          |  bridge1  |          |
> >    +---------+           +--------------------+           +----------+
> >    172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
> >                                    |
> >              +--------------------------------------------->
> >               ssh 10.0.1.2         |
> >                                    |
> >                                    |
> >                                    |
> >                                    +
> > 
> > All those VMs are inside of the same host, and one of them acts as a gateway
> > between them because they are on two different subnets. In this case I'm
> > seeing issues because even though I disable TSO/LRO on the "router" at
> > runtime, the backend doesn't watch the xenstore feature flag, and never
> > disables it from the vif on the Dom0 bridge. This causes LRO packets
> > (non-fragmented) to be received at point 'C', and then when the gateway
> > tries to inject them into the other NIC it fails because the size is greater
> > than the MTU, and the "no fragment" bit is set.
> > 
> 
> Yes, GSO cannot be disabled/enabled dynamically on the netback tx side (i.e. guest rx side) so you can't turn it off. The Windows PV driver leave sit on all the time and does the fragmentation itself if the stack doesn't want GRO. Doing the fragmentation in the frontend makes more sense anyway since the cpu cycles are burned by the VM rather than dom0 and so it scales better.

The weird thing is that GSO can usually be dinamically enabled/disabled on 
all network cards, so it would make sense to allow netfront to do the same. 
I guess the only way is to reset the netfront/netback connection when
changing this property.

> > How does Linux deal with this situation? Does it simply ignore the no
> > fragment flag and fragments the packet? Does it simply inject the packet to
> > the other end ignoring the MTU and propagating the GSO flag?
> > 
> 
> I've not looked at the netfront rx code but I assume that the large packet that is passed from netback is just marked as GSO and makes its way to wherever it's going (being fragmented by the stack if it's forwarded to an interface that doesn't have the TSO flag set).

But it cannot be fragmented if it has the IP "don't fragment" flag set. 

What I'm seeing here is that at point C netback passes GSO packets to the 
"router" VM, this packets have not been fragmented, and then when the router 
VM tries to forward them to point B it has to issue a "need fragmentation" 
icmp message because the MTU of the interface is 1500 and the IP header has 
the "don't fragment" set (and of course the GSO chain is bigger than 1500).

Is Linux ignoring the "don't fragment" IP flag here and simply fragmenting 
it? Because AFAICT I don't see any option in Linux netfront to prevent 
advertising "feature-gso-tcpv4", so netback will always inject GSO packets 
on the netfront RX path if possible.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netif.h clarifications
  2016-05-20 10:46   ` Roger Pau Monne
@ 2016-05-20 11:55     ` Paul Durrant
  2016-05-20 12:34       ` Roger Pau Monne
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Durrant @ 2016-05-20 11:55 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel@lists.xenproject.org, Wei Liu, David Vrabel

> -----Original Message-----
[snip]
> > > And then I've also seen some issues with TSO/LRO (GSO in Linux
> > > terminology)
> > > when using packet forwarding inside of a FreeBSD DomU. For example in
> the
> > > following scenario:
> > >
> > >                                    +
> > >                                    |
> > >    +---------+           +--------------------+           +----------+
> > >    |         |A         B|       router       |C         D|          |
> > >    | Guest 1 +-----------+         +          +-----------+ Guest 2  |
> > >    |         |  bridge0  |         |          |  bridge1  |          |
> > >    +---------+           +--------------------+           +----------+
> > >    172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
> > >                                    |
> > >              +--------------------------------------------->
> > >               ssh 10.0.1.2         |
> > >                                    |
> > >                                    |
> > >                                    |
> > >                                    +
> > >
> > > All those VMs are inside of the same host, and one of them acts as a
> gateway
> > > between them because they are on two different subnets. In this case
> I'm
> > > seeing issues because even though I disable TSO/LRO on the "router" at
> > > runtime, the backend doesn't watch the xenstore feature flag, and never
> > > disables it from the vif on the Dom0 bridge. This causes LRO packets
> > > (non-fragmented) to be received at point 'C', and then when the
> gateway
> > > tries to inject them into the other NIC it fails because the size is greater
> > > than the MTU, and the "no fragment" bit is set.
> > >
> >
> > Yes, GSO cannot be disabled/enabled dynamically on the netback tx side
> (i.e. guest rx side) so you can't turn it off. The Windows PV driver leave sit on
> all the time and does the fragmentation itself if the stack doesn't want GRO.
> Doing the fragmentation in the frontend makes more sense anyway since
> the cpu cycles are burned by the VM rather than dom0 and so it scales
> better.
> 
> The weird thing is that GSO can usually be dinamically enabled/disabled on
> all network cards, so it would make sense to allow netfront to do the same.
> I guess the only way is to reset the netfront/netback connection when
> changing this property.

Or implement GSO fragmentation in netfront, as I did for Windows.

> 
> > > How does Linux deal with this situation? Does it simply ignore the no
> > > fragment flag and fragments the packet? Does it simply inject the packet
> to
> > > the other end ignoring the MTU and propagating the GSO flag?
> > >
> >
> > I've not looked at the netfront rx code but I assume that the large packet
> that is passed from netback is just marked as GSO and makes its way to
> wherever it's going (being fragmented by the stack if it's forwarded to an
> interface that doesn't have the TSO flag set).
> 
> But it cannot be fragmented if it has the IP "don't fragment" flag set.
> 

Huh? This is GSO we're talking about here, not IP fragmentation. They are not the same thing.

> What I'm seeing here is that at point C netback passes GSO packets to the
> "router" VM, this packets have not been fragmented, and then when the
> router
> VM tries to forward them to point B it has to issue a "need fragmentation"
> icmp message because the MTU of the interface is 1500 and the IP header
> has
> the "don't fragment" set (and of course the GSO chain is bigger than 1500).
> 

That's presumably because they've lost the GSO information somewhere (i.e. the flag saying it's GSO and the MSS).

> Is Linux ignoring the "don't fragment" IP flag here and simply fragmenting
> it?

Yes. As I said GSO != IP fragmentation; the DF bit has no bearing on it. You do need the GSO information though.

  Paul

> Because AFAICT I don't see any option in Linux netfront to prevent
> advertising "feature-gso-tcpv4", so netback will always inject GSO packets
> on the netfront RX path if possible.
> 
> Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netif.h clarifications
  2016-05-20 11:55     ` Paul Durrant
@ 2016-05-20 12:34       ` Roger Pau Monne
  2016-05-20 12:49         ` Paul Durrant
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Pau Monne @ 2016-05-20 12:34 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel@lists.xenproject.org, Wei Liu, David Vrabel

On Fri, May 20, 2016 at 12:55:16PM +0100, Paul Durrant wrote:
> > -----Original Message-----
> [snip]
> > > > And then I've also seen some issues with TSO/LRO (GSO in Linux
> > > > terminology)
> > > > when using packet forwarding inside of a FreeBSD DomU. For example in
> > the
> > > > following scenario:
> > > >
> > > >                                    +
> > > >                                    |
> > > >    +---------+           +--------------------+           +----------+
> > > >    |         |A         B|       router       |C         D|          |
> > > >    | Guest 1 +-----------+         +          +-----------+ Guest 2  |
> > > >    |         |  bridge0  |         |          |  bridge1  |          |
> > > >    +---------+           +--------------------+           +----------+
> > > >    172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
> > > >                                    |
> > > >              +--------------------------------------------->
> > > >               ssh 10.0.1.2         |
> > > >                                    |
> > > >                                    |
> > > >                                    |
> > > >                                    +
> > > >
> > > > All those VMs are inside of the same host, and one of them acts as a
> > gateway
> > > > between them because they are on two different subnets. In this case
> > I'm
> > > > seeing issues because even though I disable TSO/LRO on the "router" at
> > > > runtime, the backend doesn't watch the xenstore feature flag, and never
> > > > disables it from the vif on the Dom0 bridge. This causes LRO packets
> > > > (non-fragmented) to be received at point 'C', and then when the
> > gateway
> > > > tries to inject them into the other NIC it fails because the size is greater
> > > > than the MTU, and the "no fragment" bit is set.
> > > >
> > >
> > > Yes, GSO cannot be disabled/enabled dynamically on the netback tx side
> > (i.e. guest rx side) so you can't turn it off. The Windows PV driver leave sit on
> > all the time and does the fragmentation itself if the stack doesn't want GRO.
> > Doing the fragmentation in the frontend makes more sense anyway since
> > the cpu cycles are burned by the VM rather than dom0 and so it scales
> > better.
> > 
> > The weird thing is that GSO can usually be dinamically enabled/disabled on
> > all network cards, so it would make sense to allow netfront to do the same.
> > I guess the only way is to reset the netfront/netback connection when
> > changing this property.
> 
> Or implement GSO fragmentation in netfront, as I did for Windows.
>
> > 
> > > > How does Linux deal with this situation? Does it simply ignore the no
> > > > fragment flag and fragments the packet? Does it simply inject the packet
> > to
> > > > the other end ignoring the MTU and propagating the GSO flag?
> > > >
> > >
> > > I've not looked at the netfront rx code but I assume that the large packet
> > that is passed from netback is just marked as GSO and makes its way to
> > wherever it's going (being fragmented by the stack if it's forwarded to an
> > interface that doesn't have the TSO flag set).
> > 
> > But it cannot be fragmented if it has the IP "don't fragment" flag set.
> > 
> 
> Huh? This is GSO we're talking about here, not IP fragmentation. They are not the same thing.

Well, as I understand it GSO works by offloading the fragmentation to the 
NIC, so the NIC performs the TCP/IP fragmentation itself. In which case I 
think it's relevant, because if you receive a 64KB GSO packet with the 
"don't fragment" IP flags set, you should not fragment it AFAIK, even if 
it's a GSO packet.

I think this is all caused because there's no real media here, it's all 
bridges and virtual network interfaces on the same host. The bridge has no 
real MTU, but on the real world the packet would be fragmented the moment it 
hits the wire.

OTOH, when using the PV net protocol we are basically passing mbufs (or skbs 
in Linux world) around, so is it expected that the fragmentation is going to 
be performed when the packet is put on a real wire that has a real MTU, so 
the last entity that touches it must do the fragmentation?

IMHO, this apporach seems very dangerous, and we are breaking the end-to-end 
principle.

> > What I'm seeing here is that at point C netback passes GSO packets to the
> > "router" VM, this packets have not been fragmented, and then when the
> > router
> > VM tries to forward them to point B it has to issue a "need fragmentation"
> > icmp message because the MTU of the interface is 1500 and the IP header
> > has
> > the "don't fragment" set (and of course the GSO chain is bigger than 1500).
> > 
> 
> That's presumably because they've lost the GSO information somewhere (i.e. the flag saying it's GSO and the MSS).

AFAICT, I'm correctly passing the GSO information around.

> > Is Linux ignoring the "don't fragment" IP flag here and simply fragmenting
> > it?
> 
> Yes. As I said GSO != IP fragmentation; the DF bit has no bearing on it. You do need the GSO information though.

I'm sorry but I don't think I'm following here. GSO basically offloads 
IP/TCP fragmentation to the NIC, so I don't see why the DF bit is not 
relevant here. The DF bit is clearly not relevant if it's a locally 
generated packet, but it matters if it's a packet comming from another 
entity.

In the diagram that I've posted above for example, if you change bridge0 
with a physical media, and the guests at both ends want to stablish a SSH 
connection the fragmentation would then be done at point B (for packets 
going from guest 2 to 1), which seems completely wrong to me for packets 
that have the DF bit set, because the fragmentation would be done by the 
_router_, not the sender (which AFAICT is what the DF flag is trying to 
avoid).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netif.h clarifications
  2016-05-20 12:34       ` Roger Pau Monne
@ 2016-05-20 12:49         ` Paul Durrant
  0 siblings, 0 replies; 6+ messages in thread
From: Paul Durrant @ 2016-05-20 12:49 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel@lists.xenproject.org, Wei Liu, David Vrabel

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 20 May 2016 13:34
> To: Paul Durrant
> Cc: xen-devel@lists.xenproject.org; Wei Liu; David Vrabel
> Subject: Re: netif.h clarifications
> 
> On Fri, May 20, 2016 at 12:55:16PM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > [snip]
> > > > > And then I've also seen some issues with TSO/LRO (GSO in Linux
> > > > > terminology)
> > > > > when using packet forwarding inside of a FreeBSD DomU. For
> example in
> > > the
> > > > > following scenario:
> > > > >
> > > > >                                    +
> > > > >                                    |
> > > > >    +---------+           +--------------------+           +----------+
> > > > >    |         |A         B|       router       |C         D|          |
> > > > >    | Guest 1 +-----------+         +          +-----------+ Guest 2  |
> > > > >    |         |  bridge0  |         |          |  bridge1  |          |
> > > > >    +---------+           +--------------------+           +----------+
> > > > >    172.16.1.67          172.16.1.66|   10.0.1.1           10.0.1.2
> > > > >                                    |
> > > > >              +--------------------------------------------->
> > > > >               ssh 10.0.1.2         |
> > > > >                                    |
> > > > >                                    |
> > > > >                                    |
> > > > >                                    +
> > > > >
> > > > > All those VMs are inside of the same host, and one of them acts as a
> > > gateway
> > > > > between them because they are on two different subnets. In this
> case
> > > I'm
> > > > > seeing issues because even though I disable TSO/LRO on the "router"
> at
> > > > > runtime, the backend doesn't watch the xenstore feature flag, and
> never
> > > > > disables it from the vif on the Dom0 bridge. This causes LRO packets
> > > > > (non-fragmented) to be received at point 'C', and then when the
> > > gateway
> > > > > tries to inject them into the other NIC it fails because the size is
> greater
> > > > > than the MTU, and the "no fragment" bit is set.
> > > > >
> > > >
> > > > Yes, GSO cannot be disabled/enabled dynamically on the netback tx
> side
> > > (i.e. guest rx side) so you can't turn it off. The Windows PV driver leave sit
> on
> > > all the time and does the fragmentation itself if the stack doesn't want
> GRO.
> > > Doing the fragmentation in the frontend makes more sense anyway since
> > > the cpu cycles are burned by the VM rather than dom0 and so it scales
> > > better.
> > >
> > > The weird thing is that GSO can usually be dinamically enabled/disabled
> on
> > > all network cards, so it would make sense to allow netfront to do the
> same.
> > > I guess the only way is to reset the netfront/netback connection when
> > > changing this property.
> >
> > Or implement GSO fragmentation in netfront, as I did for Windows.
> >
> > >
> > > > > How does Linux deal with this situation? Does it simply ignore the no
> > > > > fragment flag and fragments the packet? Does it simply inject the
> packet
> > > to
> > > > > the other end ignoring the MTU and propagating the GSO flag?
> > > > >
> > > >
> > > > I've not looked at the netfront rx code but I assume that the large
> packet
> > > that is passed from netback is just marked as GSO and makes its way to
> > > wherever it's going (being fragmented by the stack if it's forwarded to an
> > > interface that doesn't have the TSO flag set).
> > >
> > > But it cannot be fragmented if it has the IP "don't fragment" flag set.
> > >
> >
> > Huh? This is GSO we're talking about here, not IP fragmentation. They are
> not the same thing.
> 
> Well, as I understand it GSO works by offloading the fragmentation to the
> NIC, so the NIC performs the TCP/IP fragmentation itself. In which case I
> think it's relevant, because if you receive a 64KB GSO packet with the
> "don't fragment" IP flags set, you should not fragment it AFAIK, even if
> it's a GSO packet.

I don't believe that is the case. The DF bit is not relevant because you are not fragmenting an IP packet, you are taking a large TCP segment and splitting it into MSS sized segments.

> 
> I think this is all caused because there's no real media here, it's all
> bridges and virtual network interfaces on the same host. The bridge has no
> real MTU, but on the real world the packet would be fragmented the
> moment it
> hits the wire.
> 

Yes. MTU is not really relevant until a bit of wire gets involved.

> OTOH, when using the PV net protocol we are basically passing mbufs (or
> skbs
> in Linux world) around, so is it expected that the fragmentation is going to
> be performed when the packet is put on a real wire that has a real MTU, so
> the last entity that touches it must do the fragmentation?
> 

Let's call it 'segmentation' to avoid getting confused again... Yes, if something along the path does not know about large TCP segments (a.k.a. GSO packets) then the segmentation must be done at that boundary. So, for VM <-> VM traffic on the same host you really want everything to know about GSO packets so that the payload never has to be segmented.

> IMHO, this apporach seems very dangerous, and we are breaking the end-
> to-end
> principle.

Why? What principle are we breaking? As long as the MSS information is carried in the packet metadata then it can always be segmented at the point where the packet is handed so something that either doesn't know how to handle that metadata, or when it needs to go on a bit of wire.

> 
> > > What I'm seeing here is that at point C netback passes GSO packets to the
> > > "router" VM, this packets have not been fragmented, and then when the
> > > router
> > > VM tries to forward them to point B it has to issue a "need
> fragmentation"
> > > icmp message because the MTU of the interface is 1500 and the IP
> header
> > > has
> > > the "don't fragment" set (and of course the GSO chain is bigger than
> 1500).
> > >
> >
> > That's presumably because they've lost the GSO information somewhere
> (i.e. the flag saying it's GSO and the MSS).
> 
> AFAICT, I'm correctly passing the GSO information around.
> 
> > > Is Linux ignoring the "don't fragment" IP flag here and simply fragmenting
> > > it?
> >
> > Yes. As I said GSO != IP fragmentation; the DF bit has no bearing on it. You
> do need the GSO information though.
> 
> I'm sorry but I don't think I'm following here. GSO basically offloads
> IP/TCP fragmentation to the NIC, so I don't see why the DF bit is not
> relevant here. The DF bit is clearly not relevant if it's a locally
> generated packet, but it matters if it's a packet comming from another
> entity.
> 

The DF bit says whether the *IP packet* is being fragmented. IP packets are not fragmented then TCP segmentation offload is done.

> In the diagram that I've posted above for example, if you change bridge0
> with a physical media, and the guests at both ends want to stablish a SSH
> connection the fragmentation would then be done at point B (for packets
> going from guest 2 to 1), which seems completely wrong to me for packets
> that have the DF bit set, because the fragmentation would be done by the
> _router_, not the sender (which AFAICT is what the DF flag is trying to
> avoid).
> 

It would be wrong if IP packets are being fragmented, but they are not.

  Paul

> Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-20 12:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-19 16:27 netif.h clarifications Roger Pau Monné
2016-05-20  9:09 ` Paul Durrant
2016-05-20 10:46   ` Roger Pau Monne
2016-05-20 11:55     ` Paul Durrant
2016-05-20 12:34       ` Roger Pau Monne
2016-05-20 12:49         ` Paul Durrant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).