[MPTCP] Status

All of lore.kernel.org
 help / color / mirror / Atom feed

* [MPTCP] Status
@ 2017-10-02 21:14 Christoph Paasch
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Paasch @ 2017-10-02 21:14 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

Hello,

just wanted to ring in here.

I started working on porting TCP_MD5 to use the TCP extra-option framework
from Mat's branch.

It allows to nicely cleanup the TCP_MD5-code out of the TCP data-path.
There are some changes/extensions I needed to do to Mat's framework. But
nothing. I will post patches in the coming days here on the list.

I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
stuck at v4.9 (there is a nasty bug that popped up with the merge and I
wasn't able to fix that yet).

The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
I have been thinking back and forth on how we could handle this. The best
way I see at the moment is to create a scratch-area at the end of the skb's
data (like skb_shared_info). I think it also would quite nicely fit with a
KCM/ULP-style architecture where we could have a BPF-program that does the
scheduling.
I haven't dived very deep into the skb->cb problem yet.

Anyways, at the moment I am focusing on fixing mptcp_trunk's merge with v4.9
and the TCP_MD5 cleanup (which I think would be of interest for netdev).

Christoph

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-02 23:00 Mat Martineau
  0 siblings, 0 replies; 9+ messages in thread
From: Mat Martineau @ 2017-10-02 23:00 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 2034 bytes --]

Hi Christoph -

On Mon, 2 Oct 2017, Christoph Paasch wrote:

> Hello,
>
> just wanted to ring in here.
>
>
> I started working on porting TCP_MD5 to use the TCP extra-option framework
> from Mat's branch.
>
> It allows to nicely cleanup the TCP_MD5-code out of the TCP data-path.
> There are some changes/extensions I needed to do to Mat's framework. But
> nothing. I will post patches in the coming days here on the list.
>

I'm glad the framework is working well for the MD5 cleanup. I had set that 
aside for a while, but one thing I remember wanting to fix was the option 
writing callback. It seemed like a per-socket callback (especially for 
connections in the 'established' state) would be an improvement, so every 
socket doesn't have to tranverse the entire list of callbacks.

>
> I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
> stuck at v4.9 (there is a nasty bug that popped up with the merge and I
> wasn't able to fix that yet).
>
>
> The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
> I have been thinking back and forth on how we could handle this. The best
> way I see at the moment is to create a scratch-area at the end of the skb's
> data (like skb_shared_info). I think it also would quite nicely fit with a
> KCM/ULP-style architecture where we could have a BPF-program that does the
> scheduling.
> I haven't dived very deep into the skb->cb problem yet.
>

I don't think we're the first ones to want more control block bytes, seems 
like finding a solution would help outside of MPTCP too. I've looked at 
skb_tailroom_reserve a little bit, and also given some preliminary thought 
to stashing header info in skb_shared_info->frags (maybe by creating 
"header fragments").

>
> Anyways, at the moment I am focusing on fixing mptcp_trunk's merge with v4.9
> and the TCP_MD5 cleanup (which I think would be of interest for netdev).

I really appreciate the update, thank you!

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-02 23:28 Christoph Paasch
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Paasch @ 2017-10-02 23:28 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 4513 bytes --]

Hello Mat,

On 02/10/17 - 16:00:56, Mat Martineau wrote:
> On Mon, 2 Oct 2017, Christoph Paasch wrote:
> > just wanted to ring in here.
> > 
> > 
> > I started working on porting TCP_MD5 to use the TCP extra-option framework
> > from Mat's branch.
> > 
> > It allows to nicely cleanup the TCP_MD5-code out of the TCP data-path.
> > There are some changes/extensions I needed to do to Mat's framework. But
> > nothing. I will post patches in the coming days here on the list.
> > 
> 
> I'm glad the framework is working well for the MD5 cleanup. I had set that
> aside for a while, but one thing I remember wanting to fix was the option
> writing callback. It seemed like a per-socket callback (especially for
> connections in the 'established' state) would be an improvement, so every
> socket doesn't have to tranverse the entire list of callbacks.

+1 on the per-socket callback. I have this on my list of things that would
be good to add.

Basically, I think we should move tcp_option_list to struct tcp_sock.
Dynamically allocated there a pointer to the callbacks and also some
additional memory region.

That way, we have a generic way to store the state needed for the extra TCP
option (md5sig_info in TCP_MD5's case).

I have some more comments, but it will be clearer with the TCP_MD5-code at
hand.

> > I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
> > stuck at v4.9 (there is a nasty bug that popped up with the merge and I
> > wasn't able to fix that yet).
> > 
> > 
> > The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
> > I have been thinking back and forth on how we could handle this. The best
> > way I see at the moment is to create a scratch-area at the end of the skb's
> > data (like skb_shared_info). I think it also would quite nicely fit with a
> > KCM/ULP-style architecture where we could have a BPF-program that does the
> > scheduling.
> > I haven't dived very deep into the skb->cb problem yet.
> > 
> 
> I don't think we're the first ones to want more control block bytes, seems
> like finding a solution would help outside of MPTCP too. I've looked at
> skb_tailroom_reserve a little bit, and also given some preliminary thought
> to stashing header info in skb_shared_info->frags (maybe by creating "header
> fragments").

Yes, I also have to look a bit more at tailroom_reserve.

Can you elaborate a bit more on the "header fragments" ?

At one point, I had a more or less crazy idea of storing it inside the
payload.

Here was my train of thought:

Basically, the big problem with MPTCP (ignoring implementations) is that the
IETF decided to put the DSS-option in the TCP option-space. Thus, this
inherintly links a TCP-option with the payload of the packet (due to the
DSS-mapping).
Such linking is bad, for TSO, LRO/GRO, middleboxes splitting segments,...

It would have been much better if the IETF would have placed the DSS-option
(not the DATA_ACK) in the payload and leave the TCP-options just for truly
signalling options (DATA_ACK, ADD_ADDR, REMOVE_ADDR, MP_PRIO,...).

So, I was thinking that we could fake this and the MPTCP-level would do a
regular tcp_sendmsg on the subsockets with the DSS-mapping as part of the
payload. It would also just pass a flag down to tcp_sendmsg, that indicates
that the payload contains a DSS-mapping. This flag would then be stored in
the relevant skb (just one bit - I think we have that space).

Then, later in tcp_options_write, we just need to check on that flag and
extract the DSS-mapping and write it into the TCP header space (and adjust
skb->data,...).

In principle, I think this would have been very clean IMO.

But it doesn't work, because this DSS-mapping will no be accounted in TCP's
sequence space (aka., snd_nxt,...) but in the end it won't be sent out. So,
that would screw up TCP completly. Basically, skb->len will include the
DSS-mapping in the payload but it won't be sent out as part of the payload
but as part of the TCP option-space.

So, because of this I gave up on this avenue.

If you think this could work in another way or something like that, let me
know :)

Christoph

> 
> > 
> > Anyways, at the moment I am focusing on fixing mptcp_trunk's merge with v4.9
> > and the TCP_MD5 cleanup (which I think would be of interest for netdev).
> 
> I really appreciate the update, thank you!
> 
> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-03 19:26 Mat Martineau
  0 siblings, 0 replies; 9+ messages in thread
From: Mat Martineau @ 2017-10-03 19:26 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 4259 bytes --]


On Mon, 2 Oct 2017, Christoph Paasch wrote:

> Hello Mat,
>
> On 02/10/17 - 16:00:56, Mat Martineau wrote:
>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>
>>> I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
>>> stuck at v4.9 (there is a nasty bug that popped up with the merge and I
>>> wasn't able to fix that yet).
>>>
>>>
>>> The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
>>> I have been thinking back and forth on how we could handle this. The best
>>> way I see at the moment is to create a scratch-area at the end of the skb's
>>> data (like skb_shared_info). I think it also would quite nicely fit with a
>>> KCM/ULP-style architecture where we could have a BPF-program that does the
>>> scheduling.
>>> I haven't dived very deep into the skb->cb problem yet.
>>>
>>
>> I don't think we're the first ones to want more control block bytes, seems
>> like finding a solution would help outside of MPTCP too. I've looked at
>> skb_tailroom_reserve a little bit, and also given some preliminary thought
>> to stashing header info in skb_shared_info->frags (maybe by creating "header
>> fragments").
>
>
>
> Yes, I also have to look a bit more at tailroom_reserve.
>
> Can you elaborate a bit more on the "header fragments" ?

The idea is to modify struct skb_frag_struct to represent different types 
of fragments by making the page pointer a union and providing a way to 
detect the difference (size == 0?). In addition to pages of payload, it 
could also point to other types of content, like header information, that 
would be ignored as payload but available for other use like a 
pre-populated TCP option.

A downside is that this would involve making sure that every user of the 
skb frags array understands what's up, which would touch a lot of driver 
code.

>
> At one point, I had a more or less crazy idea of storing it inside the
> payload.
>
> Here was my train of thought:
>
> Basically, the big problem with MPTCP (ignoring implementations) is that the
> IETF decided to put the DSS-option in the TCP option-space. Thus, this
> inherintly links a TCP-option with the payload of the packet (due to the
> DSS-mapping).
> Such linking is bad, for TSO, LRO/GRO, middleboxes splitting segments,...
>
> It would have been much better if the IETF would have placed the DSS-option
> (not the DATA_ACK) in the payload and leave the TCP-options just for truly
> signalling options (DATA_ACK, ADD_ADDR, REMOVE_ADDR, MP_PRIO,...).

Agreed!

> So, I was thinking that we could fake this and the MPTCP-level would do a
> regular tcp_sendmsg on the subsockets with the DSS-mapping as part of the
> payload. It would also just pass a flag down to tcp_sendmsg, that indicates
> that the payload contains a DSS-mapping. This flag would then be stored in
> the relevant skb (just one bit - I think we have that space).
>
> Then, later in tcp_options_write, we just need to check on that flag and
> extract the DSS-mapping and write it into the TCP header space (and adjust
> skb->data,...).
>
>
> In principle, I think this would have been very clean IMO.
>
>
> But it doesn't work, because this DSS-mapping will no be accounted in TCP's
> sequence space (aka., snd_nxt,...) but in the end it won't be sent out. So,
> that would screw up TCP completly. Basically, skb->len will include the
> DSS-mapping in the payload but it won't be sent out as part of the payload
> but as part of the TCP option-space.
>
> So, because of this I gave up on this avenue.
>
> If you think this could work in another way or something like that, let me
> know :)

I've been looking at a very similar approach, pre-populating the DSS 
option and setting a flag telling tcp_options_write to use those option 
bytes as-is and build the rest of the TCP header before it. The TCP 
sequence numbers could also detect the prepopulated header and account for 
it. I don't think this has the problem you identified since all of the skb 
payload is sent. Maybe this doesn't play nice with cloned skbs (which 
can't modify the payload), but doesn't the mapping need to stay the same 
if it's retransmitted?


--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-03 21:13 Christoph Paasch
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Paasch @ 2017-10-03 21:13 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 6168 bytes --]

On 03/10/17 - 12:26:22, Mat Martineau wrote:
> 
> On Mon, 2 Oct 2017, Christoph Paasch wrote:
> 
> > Hello Mat,
> > 
> > On 02/10/17 - 16:00:56, Mat Martineau wrote:
> > > On Mon, 2 Oct 2017, Christoph Paasch wrote:
> > 
> > > > I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
> > > > stuck at v4.9 (there is a nasty bug that popped up with the merge and I
> > > > wasn't able to fix that yet).
> > > > 
> > > > 
> > > > The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
> > > > I have been thinking back and forth on how we could handle this. The best
> > > > way I see at the moment is to create a scratch-area at the end of the skb's
> > > > data (like skb_shared_info). I think it also would quite nicely fit with a
> > > > KCM/ULP-style architecture where we could have a BPF-program that does the
> > > > scheduling.
> > > > I haven't dived very deep into the skb->cb problem yet.
> > > > 
> > > 
> > > I don't think we're the first ones to want more control block bytes, seems
> > > like finding a solution would help outside of MPTCP too. I've looked at
> > > skb_tailroom_reserve a little bit, and also given some preliminary thought
> > > to stashing header info in skb_shared_info->frags (maybe by creating "header
> > > fragments").
> > 
> > 
> > 
> > Yes, I also have to look a bit more at tailroom_reserve.
> > 
> > Can you elaborate a bit more on the "header fragments" ?
> 
> The idea is to modify struct skb_frag_struct to represent different types of
> fragments by making the page pointer a union and providing a way to detect
> the difference (size == 0?). In addition to pages of payload, it could also
> point to other types of content, like header information, that would be
> ignored as payload but available for other use like a pre-populated TCP
> option.

What if instead of using struct skb_frag_struct, we just add a field to
struct skb_shared_info? It should not be as restricted wrt to adding fields
as is struct sk_buff, no?

> A downside is that this would involve making sure that every user of the skb
> frags array understands what's up, which would touch a lot of driver code.

True, this would impact a big code base and could be error-prone.

> 
> > 
> > At one point, I had a more or less crazy idea of storing it inside the
> > payload.
> > 
> > Here was my train of thought:
> > 
> > Basically, the big problem with MPTCP (ignoring implementations) is that the
> > IETF decided to put the DSS-option in the TCP option-space. Thus, this
> > inherintly links a TCP-option with the payload of the packet (due to the
> > DSS-mapping).
> > Such linking is bad, for TSO, LRO/GRO, middleboxes splitting segments,...
> > 
> > It would have been much better if the IETF would have placed the DSS-option
> > (not the DATA_ACK) in the payload and leave the TCP-options just for truly
> > signalling options (DATA_ACK, ADD_ADDR, REMOVE_ADDR, MP_PRIO,...).
> 
> Agreed!
> 
> > So, I was thinking that we could fake this and the MPTCP-level would do a
> > regular tcp_sendmsg on the subsockets with the DSS-mapping as part of the
> > payload. It would also just pass a flag down to tcp_sendmsg, that indicates
> > that the payload contains a DSS-mapping. This flag would then be stored in
> > the relevant skb (just one bit - I think we have that space).
> > 
> > Then, later in tcp_options_write, we just need to check on that flag and
> > extract the DSS-mapping and write it into the TCP header space (and adjust
> > skb->data,...).
> > 
> > 
> > In principle, I think this would have been very clean IMO.
> > 
> > 
> > But it doesn't work, because this DSS-mapping will no be accounted in TCP's
> > sequence space (aka., snd_nxt,...) but in the end it won't be sent out. So,
> > that would screw up TCP completly. Basically, skb->len will include the
> > DSS-mapping in the payload but it won't be sent out as part of the payload
> > but as part of the TCP option-space.
> > 
> > So, because of this I gave up on this avenue.
> > 
> > If you think this could work in another way or something like that, let me
> > know :)
> 
> I've been looking at a very similar approach, pre-populating the DSS option
> and setting a flag telling tcp_options_write to use those option bytes as-is
> and build the rest of the TCP header before it.

These pre-populated bytes would be sitting in the linear part of the memory
above skb->data ?

We had a similar approach prior to switching to the tcp_skb_cb->dss mode
(which was done by Octavian). I detailled this old approach in my thesis
(https://inl.info.ucl.ac.be/system/files/phd-thesis_1.pdf), starting at page
101.


I just reread what the issue was and the problem was that pskb_copy()
would then not copy the DSS-option. So, before every call to pskb_copy() in
the TCP-stack we were adjusting skb->data and skb->len to make sure that it
all got taken care of.

Now, if we can handle this cleaner in pskb_copy() we could go back to what
we had before. I think we just need to store in the skb how much data is
sitting on top of skb->data that needs to be copied as well. And then we are
good to go IMO.


> The TCP sequence numbers
> could also detect the prepopulated header and account for it.

I'm not sure what you mean by that. How would the sequence numbers account
for the prepopulated header?

> I don't think
> this has the problem you identified since all of the skb payload is sent.
> Maybe this doesn't play nice with cloned skbs (which can't modify the
> payload), but doesn't the mapping need to stay the same if it's
> retransmitted?

Yes, it must stay the same. Otherwise there will be trouble with the
DSS-csum.

But it does not need to be on all segments that are covered by the mapping.
Meaning, if we split an skb thanks to an incoming SACK-block or a partial ACK,
we don't have to make sure that the new skb has the DSS-option as well.
Because part of it has been ack'd (or sack'd) and thus the receiver got
the DSS-option.


Christoph


> 
> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-04  0:22 Mat Martineau
  0 siblings, 0 replies; 9+ messages in thread
From: Mat Martineau @ 2017-10-04  0:22 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 7726 bytes --]


On Tue, 3 Oct 2017, Christoph Paasch wrote:

> On 03/10/17 - 12:26:22, Mat Martineau wrote:
>>
>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>
>>> Hello Mat,
>>>
>>> On 02/10/17 - 16:00:56, Mat Martineau wrote:
>>>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>>
>>>>> I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
>>>>> stuck at v4.9 (there is a nasty bug that popped up with the merge and I
>>>>> wasn't able to fix that yet).
>>>>>
>>>>>
>>>>> The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
>>>>> I have been thinking back and forth on how we could handle this. The best
>>>>> way I see at the moment is to create a scratch-area at the end of the skb's
>>>>> data (like skb_shared_info). I think it also would quite nicely fit with a
>>>>> KCM/ULP-style architecture where we could have a BPF-program that does the
>>>>> scheduling.
>>>>> I haven't dived very deep into the skb->cb problem yet.
>>>>>
>>>>
>>>> I don't think we're the first ones to want more control block bytes, seems
>>>> like finding a solution would help outside of MPTCP too. I've looked at
>>>> skb_tailroom_reserve a little bit, and also given some preliminary thought
>>>> to stashing header info in skb_shared_info->frags (maybe by creating "header
>>>> fragments").
>>>
>>>
>>>
>>> Yes, I also have to look a bit more at tailroom_reserve.
>>>
>>> Can you elaborate a bit more on the "header fragments" ?
>>
>> The idea is to modify struct skb_frag_struct to represent different types of
>> fragments by making the page pointer a union and providing a way to detect
>> the difference (size == 0?). In addition to pages of payload, it could also
>> point to other types of content, like header information, that would be
>> ignored as payload but available for other use like a pre-populated TCP
>> option.
>
> What if instead of using struct skb_frag_struct, we just add a field to
> struct skb_shared_info? It should not be as restricted wrt to adding fields
> as is struct sk_buff, no?
>

I've been assuming that the resistance to increasing the size of sk_buff 
would also apply to skb_shared_info since you end up with larger 
allocations for every skb either way. The impact of a larger 
skb_shared_info is spread across any clones, so it's not as bad as 
changing sk_buff, but it's still an increase.

However, a larger skb_shared_info could be *optional* if we defined a new 
SKB_ALLOC_SHARED_CB flag to pass to __alloc_skb. We get a bunch of space 
to play with and everyone else gets the same size structs and allocations 
that they're used to.

>> A downside is that this would involve making sure that every user of the skb
>> frags array understands what's up, which would touch a lot of driver code.
>
> True, this would impact a big code base and could be error-prone.
>
>>
>>>
>>> At one point, I had a more or less crazy idea of storing it inside the
>>> payload.
>>>
>>> Here was my train of thought:
>>>
>>> Basically, the big problem with MPTCP (ignoring implementations) is that the
>>> IETF decided to put the DSS-option in the TCP option-space. Thus, this
>>> inherintly links a TCP-option with the payload of the packet (due to the
>>> DSS-mapping).
>>> Such linking is bad, for TSO, LRO/GRO, middleboxes splitting segments,...
>>>
>>> It would have been much better if the IETF would have placed the DSS-option
>>> (not the DATA_ACK) in the payload and leave the TCP-options just for truly
>>> signalling options (DATA_ACK, ADD_ADDR, REMOVE_ADDR, MP_PRIO,...).
>>
>> Agreed!
>>
>>> So, I was thinking that we could fake this and the MPTCP-level would do a
>>> regular tcp_sendmsg on the subsockets with the DSS-mapping as part of the
>>> payload. It would also just pass a flag down to tcp_sendmsg, that indicates
>>> that the payload contains a DSS-mapping. This flag would then be stored in
>>> the relevant skb (just one bit - I think we have that space).
>>>
>>> Then, later in tcp_options_write, we just need to check on that flag and
>>> extract the DSS-mapping and write it into the TCP header space (and adjust
>>> skb->data,...).
>>>
>>>
>>> In principle, I think this would have been very clean IMO.
>>>
>>>
>>> But it doesn't work, because this DSS-mapping will no be accounted in TCP's
>>> sequence space (aka., snd_nxt,...) but in the end it won't be sent out. So,
>>> that would screw up TCP completly. Basically, skb->len will include the
>>> DSS-mapping in the payload but it won't be sent out as part of the payload
>>> but as part of the TCP option-space.
>>>
>>> So, because of this I gave up on this avenue.
>>>
>>> If you think this could work in another way or something like that, let me
>>> know :)
>>
>> I've been looking at a very similar approach, pre-populating the DSS option
>> and setting a flag telling tcp_options_write to use those option bytes as-is
>> and build the rest of the TCP header before it.
>
> These pre-populated bytes would be sitting in the linear part of the memory
> above skb->data ?

By "above" I'm not sure if you mean "in the headroom between skb->head and 
skb->data" or "starting at skb->data". The latter is what I intended: 
skb->data points to the option kind byte.

>
> We had a similar approach prior to switching to the tcp_skb_cb->dss mode
> (which was done by Octavian). I detailled this old approach in my thesis
> (https://inl.info.ucl.ac.be/system/files/phd-thesis_1.pdf), starting at page
> 101.
>
>
> I just reread what the issue was and the problem was that pskb_copy()
> would then not copy the DSS-option. So, before every call to pskb_copy() in
> the TCP-stack we were adjusting skb->data and skb->len to make sure that it
> all got taken care of.

I've re-read that now as well, and it sounds like I am reinventing the 
older v0.88 approach that has various corner cases but not the pksb_copy 
problem. Unfortunately it sounds like it was complicated in practice!

> Now, if we can handle this cleaner in pskb_copy() we could go back to what
> we had before. I think we just need to store in the skb how much data is
> sitting on top of skb->data that needs to be copied as well. And then we are
> good to go IMO.

Another variation is to have a "copy_headroom" bit in sk_buff that makes 
pskb_copy copy everything from skb->head to skb->tail. But I like the 
SKB_ALLOC_SHARED_CB idea better.

>> The TCP sequence numbers
>> could also detect the prepopulated header and account for it.
>
> I'm not sure what you mean by that. How would the sequence numbers account
> for the prepopulated header?

I didn't phrase that well, my apologies. If the TCP option was 
prepopulated at skb->data, the current TCP code would see it as part of 
the payload. The code doing the TCP sequence number assignment would have 
to know the actual TCP payload length in order to set the sequence numbers 
correctly in the header.

>> I don't think
>> this has the problem you identified since all of the skb payload is sent.
>> Maybe this doesn't play nice with cloned skbs (which can't modify the
>> payload), but doesn't the mapping need to stay the same if it's
>> retransmitted?
>
> Yes, it must stay the same. Otherwise there will be trouble with the
> DSS-csum.
>
> But it does not need to be on all segments that are covered by the mapping.
> Meaning, if we split an skb thanks to an incoming SACK-block or a partial ACK,
> we don't have to make sure that the new skb has the DSS-option as well.
> Because part of it has been ack'd (or sack'd) and thus the receiver got
> the DSS-option.

Regards,

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-04  6:22 Christoph Paasch
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Paasch @ 2017-10-04  6:22 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 9257 bytes --]

On 03/10/17 - 17:22:23, Mat Martineau wrote:
> 
> On Tue, 3 Oct 2017, Christoph Paasch wrote:
> 
> > On 03/10/17 - 12:26:22, Mat Martineau wrote:
> > > 
> > > On Mon, 2 Oct 2017, Christoph Paasch wrote:
> > > 
> > > > Hello Mat,
> > > > 
> > > > On 02/10/17 - 16:00:56, Mat Martineau wrote:
> > > > > On Mon, 2 Oct 2017, Christoph Paasch wrote:
> > > > 
> > > > > > I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
> > > > > > stuck at v4.9 (there is a nasty bug that popped up with the merge and I
> > > > > > wasn't able to fix that yet).
> > > > > > 
> > > > > > 
> > > > > > The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
> > > > > > I have been thinking back and forth on how we could handle this. The best
> > > > > > way I see at the moment is to create a scratch-area at the end of the skb's
> > > > > > data (like skb_shared_info). I think it also would quite nicely fit with a
> > > > > > KCM/ULP-style architecture where we could have a BPF-program that does the
> > > > > > scheduling.
> > > > > > I haven't dived very deep into the skb->cb problem yet.
> > > > > > 
> > > > > 
> > > > > I don't think we're the first ones to want more control block bytes, seems
> > > > > like finding a solution would help outside of MPTCP too. I've looked at
> > > > > skb_tailroom_reserve a little bit, and also given some preliminary thought
> > > > > to stashing header info in skb_shared_info->frags (maybe by creating "header
> > > > > fragments").
> > > > 
> > > > 
> > > > 
> > > > Yes, I also have to look a bit more at tailroom_reserve.
> > > > 
> > > > Can you elaborate a bit more on the "header fragments" ?
> > > 
> > > The idea is to modify struct skb_frag_struct to represent different types of
> > > fragments by making the page pointer a union and providing a way to detect
> > > the difference (size == 0?). In addition to pages of payload, it could also
> > > point to other types of content, like header information, that would be
> > > ignored as payload but available for other use like a pre-populated TCP
> > > option.
> > 
> > What if instead of using struct skb_frag_struct, we just add a field to
> > struct skb_shared_info? It should not be as restricted wrt to adding fields
> > as is struct sk_buff, no?
> > 
> 
> I've been assuming that the resistance to increasing the size of sk_buff
> would also apply to skb_shared_info since you end up with larger allocations
> for every skb either way. The impact of a larger skb_shared_info is spread
> across any clones, so it's not as bad as changing sk_buff, but it's still an
> increase.
> 
> However, a larger skb_shared_info could be *optional* if we defined a new
> SKB_ALLOC_SHARED_CB flag to pass to __alloc_skb. We get a bunch of space to
> play with and everyone else gets the same size structs and allocations that
> they're used to.

I like that!

> 
> > > A downside is that this would involve making sure that every user of the skb
> > > frags array understands what's up, which would touch a lot of driver code.
> > 
> > True, this would impact a big code base and could be error-prone.
> > 
> > > 
> > > > 
> > > > At one point, I had a more or less crazy idea of storing it inside the
> > > > payload.
> > > > 
> > > > Here was my train of thought:
> > > > 
> > > > Basically, the big problem with MPTCP (ignoring implementations) is that the
> > > > IETF decided to put the DSS-option in the TCP option-space. Thus, this
> > > > inherintly links a TCP-option with the payload of the packet (due to the
> > > > DSS-mapping).
> > > > Such linking is bad, for TSO, LRO/GRO, middleboxes splitting segments,...
> > > > 
> > > > It would have been much better if the IETF would have placed the DSS-option
> > > > (not the DATA_ACK) in the payload and leave the TCP-options just for truly
> > > > signalling options (DATA_ACK, ADD_ADDR, REMOVE_ADDR, MP_PRIO,...).
> > > 
> > > Agreed!
> > > 
> > > > So, I was thinking that we could fake this and the MPTCP-level would do a
> > > > regular tcp_sendmsg on the subsockets with the DSS-mapping as part of the
> > > > payload. It would also just pass a flag down to tcp_sendmsg, that indicates
> > > > that the payload contains a DSS-mapping. This flag would then be stored in
> > > > the relevant skb (just one bit - I think we have that space).
> > > > 
> > > > Then, later in tcp_options_write, we just need to check on that flag and
> > > > extract the DSS-mapping and write it into the TCP header space (and adjust
> > > > skb->data,...).
> > > > 
> > > > 
> > > > In principle, I think this would have been very clean IMO.
> > > > 
> > > > 
> > > > But it doesn't work, because this DSS-mapping will no be accounted in TCP's
> > > > sequence space (aka., snd_nxt,...) but in the end it won't be sent out. So,
> > > > that would screw up TCP completly. Basically, skb->len will include the
> > > > DSS-mapping in the payload but it won't be sent out as part of the payload
> > > > but as part of the TCP option-space.
> > > > 
> > > > So, because of this I gave up on this avenue.
> > > > 
> > > > If you think this could work in another way or something like that, let me
> > > > know :)
> > > 
> > > I've been looking at a very similar approach, pre-populating the DSS option
> > > and setting a flag telling tcp_options_write to use those option bytes as-is
> > > and build the rest of the TCP header before it.
> > 
> > These pre-populated bytes would be sitting in the linear part of the memory
> > above skb->data ?
> 
> By "above" I'm not sure if you mean "in the headroom between skb->head and
> skb->data" or "starting at skb->data". The latter is what I intended:
> skb->data points to the option kind byte.

I meant, in the headroom between skb->head and skb->data.

Putting it below skb->data was what I had envisioned in my original e-mail
here. But that has other side-effects (see below at [1]).

> 
> > 
> > We had a similar approach prior to switching to the tcp_skb_cb->dss mode
> > (which was done by Octavian). I detailled this old approach in my thesis
> > (https://inl.info.ucl.ac.be/system/files/phd-thesis_1.pdf), starting at page
> > 101.
> > 
> > 
> > I just reread what the issue was and the problem was that pskb_copy()
> > would then not copy the DSS-option. So, before every call to pskb_copy() in
> > the TCP-stack we were adjusting skb->data and skb->len to make sure that it
> > all got taken care of.
> 
> I've re-read that now as well, and it sounds like I am reinventing the older
> v0.88 approach that has various corner cases but not the pksb_copy problem.
> Unfortunately it sounds like it was complicated in practice!
> 
> > Now, if we can handle this cleaner in pskb_copy() we could go back to what
> > we had before. I think we just need to store in the skb how much data is
> > sitting on top of skb->data that needs to be copied as well. And then we are
> > good to go IMO.
> 
> Another variation is to have a "copy_headroom" bit in sk_buff that makes
> pskb_copy copy everything from skb->head to skb->tail. But I like the
> SKB_ALLOC_SHARED_CB idea better.

Agreed, SKB_ALLOC_SHARED_CB looks good to me. It's very generic and can be
used later on by other protocols,...

> 
> > > The TCP sequence numbers
> > > could also detect the prepopulated header and account for it.
> > 
> > I'm not sure what you mean by that. How would the sequence numbers account
> > for the prepopulated header?
> 
> I didn't phrase that well, my apologies. If the TCP option was prepopulated
> at skb->data, the current TCP code would see it as part of the payload. The
> code doing the TCP sequence number assignment would have to know the actual
> TCP payload length in order to set the sequence numbers correctly in the
> header.

[1]

Would you then set skb->len so that it includes the TCP-option starting at
skb->data? I guess you have to, because otherwise functions like
__pskb_trim_head() (called from tcp_trim_head()) will have problems.

That would mean that the TCP-stack can't use skb->len anymore in its
output path. Making sure that this still works would be very tricky I think.



From what I see, the SKB_ALLOC_SHARED_CB approach seems to me the most
promising and generic. Let's try to aim for this.

Do you have cycles to investigate this approach?


Christoph



> 
> > > I don't think
> > > this has the problem you identified since all of the skb payload is sent.
> > > Maybe this doesn't play nice with cloned skbs (which can't modify the
> > > payload), but doesn't the mapping need to stay the same if it's
> > > retransmitted?
> > 
> > Yes, it must stay the same. Otherwise there will be trouble with the
> > DSS-csum.
> > 
> > But it does not need to be on all segments that are covered by the mapping.
> > Meaning, if we split an skb thanks to an incoming SACK-block or a partial ACK,
> > we don't have to make sure that the new skb has the DSS-option as well.
> > Because part of it has been ack'd (or sack'd) and thus the receiver got
> > the DSS-option.
> 
> Regards,
> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-04 16:13 Mat Martineau
  0 siblings, 0 replies; 9+ messages in thread
From: Mat Martineau @ 2017-10-04 16:13 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 4786 bytes --]


On Tue, 3 Oct 2017, Christoph Paasch wrote:

> On 03/10/17 - 17:22:23, Mat Martineau wrote:
>>
>> On Tue, 3 Oct 2017, Christoph Paasch wrote:
>>
>>> On 03/10/17 - 12:26:22, Mat Martineau wrote:
>>>>
>>>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>>>
>>>>> Hello Mat,
>>>>>
>>>>> On 02/10/17 - 16:00:56, Mat Martineau wrote:
>>>>>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>>>>
>>>>>>> I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
>>>>>>> stuck at v4.9 (there is a nasty bug that popped up with the merge and I
>>>>>>> wasn't able to fix that yet).
>>>>>>>
>>>>>>>
>>>>>>> The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
>>>>>>> I have been thinking back and forth on how we could handle this. The best
>>>>>>> way I see at the moment is to create a scratch-area at the end of the skb's
>>>>>>> data (like skb_shared_info). I think it also would quite nicely fit with a
>>>>>>> KCM/ULP-style architecture where we could have a BPF-program that does the
>>>>>>> scheduling.
>>>>>>> I haven't dived very deep into the skb->cb problem yet.
>>>>>>>
>>>>>>
>>>>>> I don't think we're the first ones to want more control block bytes, seems
>>>>>> like finding a solution would help outside of MPTCP too. I've looked at
>>>>>> skb_tailroom_reserve a little bit, and also given some preliminary thought
>>>>>> to stashing header info in skb_shared_info->frags (maybe by creating "header
>>>>>> fragments").
>>>>>
>>>>>
>>>>>
>>>>> Yes, I also have to look a bit more at tailroom_reserve.
>>>>>
>>>>> Can you elaborate a bit more on the "header fragments" ?
>>>>
>>>> The idea is to modify struct skb_frag_struct to represent different types of
>>>> fragments by making the page pointer a union and providing a way to detect
>>>> the difference (size == 0?). In addition to pages of payload, it could also
>>>> point to other types of content, like header information, that would be
>>>> ignored as payload but available for other use like a pre-populated TCP
>>>> option.
>>>
>>> What if instead of using struct skb_frag_struct, we just add a field to
>>> struct skb_shared_info? It should not be as restricted wrt to adding fields
>>> as is struct sk_buff, no?
>>>
>>
>> I've been assuming that the resistance to increasing the size of sk_buff
>> would also apply to skb_shared_info since you end up with larger allocations
>> for every skb either way. The impact of a larger skb_shared_info is spread
>> across any clones, so it's not as bad as changing sk_buff, but it's still an
>> increase.
>>
>> However, a larger skb_shared_info could be *optional* if we defined a new
>> SKB_ALLOC_SHARED_CB flag to pass to __alloc_skb. We get a bunch of space to
>> play with and everyone else gets the same size structs and allocations that
>> they're used to.
>
> I like that!
>

...

>
>>
>>>
>>> We had a similar approach prior to switching to the tcp_skb_cb->dss mode
>>> (which was done by Octavian). I detailled this old approach in my thesis
>>> (https://inl.info.ucl.ac.be/system/files/phd-thesis_1.pdf), starting at page
>>> 101.
>>>
>>>
>>> I just reread what the issue was and the problem was that pskb_copy()
>>> would then not copy the DSS-option. So, before every call to pskb_copy() in
>>> the TCP-stack we were adjusting skb->data and skb->len to make sure that it
>>> all got taken care of.
>>
>> I've re-read that now as well, and it sounds like I am reinventing the older
>> v0.88 approach that has various corner cases but not the pksb_copy problem.
>> Unfortunately it sounds like it was complicated in practice!
>>
>>> Now, if we can handle this cleaner in pskb_copy() we could go back to what
>>> we had before. I think we just need to store in the skb how much data is
>>> sitting on top of skb->data that needs to be copied as well. And then we are
>>> good to go IMO.
>>
>> Another variation is to have a "copy_headroom" bit in sk_buff that makes
>> pskb_copy copy everything from skb->head to skb->tail. But I like the
>> SKB_ALLOC_SHARED_CB idea better.
>
> Agreed, SKB_ALLOC_SHARED_CB looks good to me. It's very generic and can be
> used later on by other protocols,...
>

...

>
> From what I see, the SKB_ALLOC_SHARED_CB approach seems to me the most
> promising and generic. Let's try to aim for this.
>
> Do you have cycles to investigate this approach?
>

Yes, I'll work on SKB_ALLOC_SHARED_CB.

I like the idea of a variable-length shared control buffer, but it would 
introduce more complexity to request a specific length and ensure that 
it's accessed safely. Slightly leaning toward a fixed length 48 bytes, 
just like sk_buff->cb. Any thoughts on that?

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [MPTCP] Status
@ 2017-10-04 16:38 Christoph Paasch
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Paasch @ 2017-10-04 16:38 UTC (permalink / raw)
  To: mptcp 

[-- Attachment #1: Type: text/plain, Size: 5126 bytes --]



> On Oct 4, 2017, at 9:13 AM, Mat Martineau <mathew.j.martineau(a)linux.intel.com> wrote:
> 
> 
>> On Tue, 3 Oct 2017, Christoph Paasch wrote:
>> 
>>> On 03/10/17 - 17:22:23, Mat Martineau wrote:
>>> 
>>>> On Tue, 3 Oct 2017, Christoph Paasch wrote:
>>>> 
>>>>> On 03/10/17 - 12:26:22, Mat Martineau wrote:
>>>>> 
>>>>>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>>>>> 
>>>>>> Hello Mat,
>>>>>> 
>>>>>>> On 02/10/17 - 16:00:56, Mat Martineau wrote:
>>>>>>> On Mon, 2 Oct 2017, Christoph Paasch wrote:
>>>>>> 
>>>>>>>> I keep on moving mptcp_trunk upwards to track upstream Linux. Currently I'm
>>>>>>>> stuck at v4.9 (there is a nasty bug that popped up with the merge and I
>>>>>>>> wasn't able to fix that yet).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The merge with v4.9 also forced me to bump skb->cb to 80 bytes... :/
>>>>>>>> I have been thinking back and forth on how we could handle this. The best
>>>>>>>> way I see at the moment is to create a scratch-area at the end of the skb's
>>>>>>>> data (like skb_shared_info). I think it also would quite nicely fit with a
>>>>>>>> KCM/ULP-style architecture where we could have a BPF-program that does the
>>>>>>>> scheduling.
>>>>>>>> I haven't dived very deep into the skb->cb problem yet.
>>>>>>>> 
>>>>>>> 
>>>>>>> I don't think we're the first ones to want more control block bytes, seems
>>>>>>> like finding a solution would help outside of MPTCP too. I've looked at
>>>>>>> skb_tailroom_reserve a little bit, and also given some preliminary thought
>>>>>>> to stashing header info in skb_shared_info->frags (maybe by creating "header
>>>>>>> fragments").
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Yes, I also have to look a bit more at tailroom_reserve.
>>>>>> 
>>>>>> Can you elaborate a bit more on the "header fragments" ?
>>>>> 
>>>>> The idea is to modify struct skb_frag_struct to represent different types of
>>>>> fragments by making the page pointer a union and providing a way to detect
>>>>> the difference (size == 0?). In addition to pages of payload, it could also
>>>>> point to other types of content, like header information, that would be
>>>>> ignored as payload but available for other use like a pre-populated TCP
>>>>> option.
>>>> 
>>>> What if instead of using struct skb_frag_struct, we just add a field to
>>>> struct skb_shared_info? It should not be as restricted wrt to adding fields
>>>> as is struct sk_buff, no?
>>>> 
>>> 
>>> I've been assuming that the resistance to increasing the size of sk_buff
>>> would also apply to skb_shared_info since you end up with larger allocations
>>> for every skb either way. The impact of a larger skb_shared_info is spread
>>> across any clones, so it's not as bad as changing sk_buff, but it's still an
>>> increase.
>>> 
>>> However, a larger skb_shared_info could be *optional* if we defined a new
>>> SKB_ALLOC_SHARED_CB flag to pass to __alloc_skb. We get a bunch of space to
>>> play with and everyone else gets the same size structs and allocations that
>>> they're used to.
>> 
>> I like that!
>> 
> 
> ...
> 
>> 
>>> 
>>>> 
>>>> We had a similar approach prior to switching to the tcp_skb_cb->dss mode
>>>> (which was done by Octavian). I detailled this old approach in my thesis
>>>> (https://inl.info.ucl.ac.be/system/files/phd-thesis_1.pdf), starting at page
>>>> 101.
>>>> 
>>>> 
>>>> I just reread what the issue was and the problem was that pskb_copy()
>>>> would then not copy the DSS-option. So, before every call to pskb_copy() in
>>>> the TCP-stack we were adjusting skb->data and skb->len to make sure that it
>>>> all got taken care of.
>>> 
>>> I've re-read that now as well, and it sounds like I am reinventing the older
>>> v0.88 approach that has various corner cases but not the pksb_copy problem.
>>> Unfortunately it sounds like it was complicated in practice!
>>> 
>>>> Now, if we can handle this cleaner in pskb_copy() we could go back to what
>>>> we had before. I think we just need to store in the skb how much data is
>>>> sitting on top of skb->data that needs to be copied as well. And then we are
>>>> good to go IMO.
>>> 
>>> Another variation is to have a "copy_headroom" bit in sk_buff that makes
>>> pskb_copy copy everything from skb->head to skb->tail. But I like the
>>> SKB_ALLOC_SHARED_CB idea better.
>> 
>> Agreed, SKB_ALLOC_SHARED_CB looks good to me. It's very generic and can be
>> used later on by other protocols,...
>> 
> 
> ...
> 
>> 
>> From what I see, the SKB_ALLOC_SHARED_CB approach seems to me the most
>> promising and generic. Let's try to aim for this.
>> 
>> Do you have cycles to investigate this approach?
>> 
> 
> Yes, I'll work on SKB_ALLOC_SHARED_CB.
> 
> I like the idea of a variable-length shared control buffer, but it would introduce more complexity to request a specific length and ensure that it's accessed safely. Slightly leaning toward a fixed length 48 bytes, just like sk_buff->cb. Any thoughts on that?

Yes, fixed length of 48 bytes sounds good to me.


Christoph 

> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-10-04 16:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-04  6:22 [MPTCP] Status Christoph Paasch
  -- strict thread matches above, loose matches on Subject: below --
2017-10-04 16:38 Christoph Paasch
2017-10-04 16:13 Mat Martineau
2017-10-04  0:22 Mat Martineau
2017-10-03 21:13 Christoph Paasch
2017-10-03 19:26 Mat Martineau
2017-10-02 23:28 Christoph Paasch
2017-10-02 23:00 Mat Martineau
2017-10-02 21:14 Christoph Paasch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.