* [RFC] tcp: add support for scheduling TCP options on TCP sockets
@ 2014-05-06 18:05 Octavian Purdila
2014-05-07 5:38 ` David Miller
0 siblings, 1 reply; 13+ messages in thread
From: Octavian Purdila @ 2014-05-06 18:05 UTC (permalink / raw)
To: netdev; +Cc: Octavian Purdila
TCP was designed to be extensible with the help of TCP
options. However, even though theoretically a new TCP extension could
be orthogonal with respect to the core TCP stack, the current
implementation for handling TCP options makes it hard to create
modular TCP extensions.
When adding a new TCP option modifications in several core TCP
functions are required (tcp_options_write, tcp_established_options,
tcp_syn_options, tcp_syn_ack_options). More importantly, options are
passed via the tcp_skb_cb whose space is limited and currently only 4
bytes are available.
Extending the skb dynamically to store the options does not fully
solve the issue, as the TCP option space is limited in the TCP header
itself and when other options are set by the core TCP stack
(e.g. SACK) we might not have enough space. In this case, ideally we
should send the option in the next packet.
This patch tries to address this issue by creating a mechanism that
allows scheduling TCP options to be sent either with a SYN packet or
with a packet with a specific sequence number (or a later packet if
there is not enough space in the TCP header).
The options can be scheduled reliably and in that case retransmissions
of a packet initially sent with a scheduled option will include the
option in the retransmitted packets.
In certain cases we may not have space in the header and run out of
packets to be sent. To avoid blocking until more data can be sent,
duplicate ACKs may be requested when scheduling options.
This mechanism can be used to implement TCP extensions such as
MultiPath TCP [1] in a modular fashion, which would otherwise require
intrusive changes in the TCP stack [2]. We plan to use it to implement
an MPTCP layer that sits on top of TCP and thus avoid most of the
changes in the existing TCP code.
Pardon the rough patch, but I hope it is enough to get some feedback
on the overall approach.
[1] http://tools.ietf.org/html/rfc6824
[2] https://github.com/multipath-tcp/mptcp
---
include/linux/tcp.h | 6 +++
include/net/tcp.h | 25 ++++++++++-
net/ipv4/tcp.c | 53 ++++++++++++++++++++++
net/ipv4/tcp_input.c | 15 +++++++
net/ipv4/tcp_ipv4.c | 2 +
net/ipv4/tcp_output.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++---
6 files changed, 214 insertions(+), 7 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d686334..a06519c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -320,6 +320,12 @@ struct tcp_sock {
* socket. Used to retransmit SYNACKs etc.
*/
struct request_sock *fastopen_rsk;
+
+ struct list_head sched_opts; /* Scheduled options (to be sent) */
+ struct list_head sent_opts; /* Sent options (for retransmission) */
+ u32 req_dup_ack; /* Trigger a dup-ack if critical options
+ * were not sent due to not enough
+ * space in the header*/
};
enum tsq_flags {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70e55d2..22c6721 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -725,7 +725,7 @@ struct tcp_skb_cb {
#define TCPCB_RETRANS (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
- /* 1 byte hole */
+ __u8 retrans:1; /* SKB is being retransmitted */
__u32 ack_seq; /* Sequence number ACK'd */
};
@@ -1600,6 +1600,29 @@ struct tcp_request_sock_ops {
int tcpv4_offload_init(void);
+struct tcp_opt_buff {
+ struct list_head list;
+ __u32 seq;
+ __be32 data[MAX_TCP_OPTION_SPACE/sizeof(__be32)];
+ __u8 size;
+ __u8 flags;
+#define TOB_F_RELIABLE 0x01 /* This option must be retransmitted. */
+#define TOB_F_DUP_ACK 0x02 /* Allow triggering a dup-ack if all
+ * data is gone but we didn't find
+ * space for this option */
+#define TOB_F_NO_GSO 0x04 /* This option is not compatbile with GSO. */
+#define TOB_F_SYN 0x08 /* */
+#define TOB_USER_MASK 0x0F /* Mask for public flags */
+#define TOB_F_NOSPACE 0x10 /* Not enough space to send this option */
+};
+
+int __tcp_queue_opt(struct tcp_sock *tp, struct list_head *head,
+ struct tcp_opt_buff *tob);
+
+int tcp_sched_opt(struct tcp_sock *tp, struct tcp_opt_buff *tob);
+void tcp_sched_opt_purge(struct tcp_sock *tp);
+
+
void tcp_v4_init(void);
void tcp_init(void);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8e8529d..1f78d91 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -417,6 +417,9 @@ void tcp_init_sock(struct sock *sk)
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+ INIT_LIST_HEAD(&tp->sched_opts);
+ INIT_LIST_HEAD(&tp->sent_opts);
+
local_bh_disable();
sock_update_memcg(sk);
sk_sockets_allocated_inc(sk);
@@ -3060,6 +3063,56 @@ EXPORT_SYMBOL(tcp_md5_hash_key);
#endif
+int __tcp_queue_opt(struct tcp_sock *tp, struct list_head *head,
+ struct tcp_opt_buff *tob)
+{
+ struct tcp_opt_buff *i;
+
+ if (tob->flags & TOB_F_SYN) {
+ struct sock *sk = (struct sock *)tp;
+ if ((1 << sk->sk_state) & ~(TCPF_CLOSE | TCPF_LISTEN))
+ return -EINVAL;
+ list_add(&tob->list, head);
+ return 0;
+ } else {
+ if (before(tob->seq, tp->write_seq))
+ return -EINVAL;
+ }
+
+ list_for_each_entry_reverse(i, &tp->sched_opts, list) {
+ if (before(i->seq, tob->seq))
+ break;
+ }
+ list_add_tail(&tob->list, &i->list);
+ return 0;
+}
+
+int tcp_sched_opt(struct tcp_sock *tp, struct tcp_opt_buff *tob)
+{
+ tob->flags &= TOB_USER_MASK;
+ if (tob->size != ALIGN(tob->size, 4))
+ return -EINVAL;
+
+ return __tcp_queue_opt(tp, &tp->sched_opts, tob);
+}
+EXPORT_SYMBOL(tcp_sched_opt);
+
+
+void tcp_sched_opt_purge(struct tcp_sock *tp)
+{
+ struct tcp_opt_buff *tob, *tmp;
+
+ list_for_each_entry_safe(tob, tmp, &tp->sched_opts, list) {
+ list_del(&tob->list);
+ kfree(tob);
+ }
+ list_for_each_entry_safe(tob, tmp, &tp->sent_opts, list) {
+ list_del(&tob->list);
+ kfree(tob);
+ }
+}
+
+
void tcp_done(struct sock *sk)
{
struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53b7f3..40d1436 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3015,6 +3015,19 @@ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
return packets_acked;
}
+static void tcp_clean_sent_opts(struct tcp_sock *tp)
+{
+ struct tcp_opt_buff *tob, *tmp;
+
+ list_for_each_entry_safe(tob, tmp, &tp->sent_opts, list) {
+ if (after(tob->seq, tp->snd_una))
+ break;
+
+ list_del(&tob->list);
+ kfree(tob);
+ }
+}
+
/* Remove acknowledged frames from the retransmission queue. If our packet
* is before the ack sequence we can discard it as it's confirmed to have
* arrived at the other end.
@@ -3452,6 +3465,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, sack_rtt);
acked -= tp->packets_out;
+ tcp_clean_sent_opts(tp);
+
/* Advance cwnd if state allows */
if (tcp_may_raise_cwnd(sk, flag))
tcp_cong_avoid(sk, ack, acked, prior_in_flight);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 14bba8a..59271c8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2158,6 +2158,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
/* Cleanup up the write buffer. */
tcp_write_queue_purge(sk);
+ tcp_sched_opt_purge(tp);
+
/* Cleans up our, hopefully empty, out_of_order_queue. */
__skb_queue_purge(&tp->out_of_order_queue);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6728546..c3680a5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -388,6 +388,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
#define OPTION_TS (1 << 1)
#define OPTION_MD5 (1 << 2)
#define OPTION_WSCALE (1 << 3)
+#define OPTION_SCHED_OPTIONS (1 << 4)
#define OPTION_FAST_OPEN_COOKIE (1 << 8)
struct tcp_out_options {
@@ -399,6 +400,7 @@ struct tcp_out_options {
__u8 *hash_location; /* temporary pointer, overloaded */
__u32 tsval, tsecr; /* need to include OPTION_TS */
struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
+ struct list_head sched_opts;
};
/* Write previously computed TCP options to the packet.
@@ -498,6 +500,100 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
}
ptr += (foc->len + 3) >> 2;
}
+
+ if (unlikely(OPTION_SCHED_OPTIONS & options)) {
+ struct tcp_opt_buff *tob, *tmp;
+
+ list_for_each_entry_safe(tob, tmp, &opts->sched_opts, list) {
+
+ memcpy(ptr, tob->data, tob->size);
+ ptr += tob->size;
+
+ list_del(&tob->list);
+
+ if (tob->flags & TOB_F_RELIABLE)
+ __tcp_queue_opt(tp, &tp->sent_opts, tob);
+ else
+ kfree(tob);
+ }
+ }
+}
+
+static void __tcp_sched_options(struct tcp_sock *tp, struct sk_buff *skb,
+ struct tcp_out_options *options,
+ unsigned int *remaining,
+ struct list_head *queue)
+{
+ struct tcp_opt_buff *tob, *tmp;
+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+ list_for_each_entry_safe(tob, tmp, queue, list) {
+
+ if (tob->flags && TOB_F_SYN) {
+ if (!(tcb->tcp_flags & TCPHDR_SYN)) {
+ /* TODO signal error */
+ list_del(&tob->list);
+ kfree(tob);
+ continue;
+ }
+ } else {
+ if (before(tob->seq, tcb->seq)) {
+ continue;
+ }
+ }
+
+ if (*remaining == 0)
+ break;
+
+ if (skb_is_gso(skb) && (tob->flags & TOB_F_NO_GSO))
+ continue;
+
+ if (*remaining < tob->size) {
+ if (tob->flags && TOB_F_SYN) {
+ /* TODO signal error */
+ list_del(&tob->list);
+ kfree(tob);
+ continue;
+ }
+ if ((tob->flags & TOB_F_DUP_ACK) &&
+ !(tob->flags & TOB_F_NOSPACE)) {
+ tp->req_dup_ack++;
+ tob->flags |= TOB_F_NOSPACE;
+ }
+ continue;
+ }
+
+ options->options |= OPTION_SCHED_OPTIONS;
+ *remaining -= tob->size;
+ tob->seq = tcb->seq;
+
+ list_del(&tob->list);
+ list_add_tail(&tob->list, &options->sched_opts);
+
+ if (tob->flags & TOB_F_NOSPACE) {
+ tob->flags &= ~TOB_F_NOSPACE;
+ tp->req_dup_ack--;
+ }
+ }
+
+}
+
+static void tcp_sched_options(struct tcp_sock *tp, struct sk_buff *skb,
+ struct tcp_out_options *options,
+ unsigned int *remaining)
+{
+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+ if (!skb)
+ return;
+
+ INIT_LIST_HEAD(&options->sched_opts);
+ if (tcb->retrans)
+ __tcp_sched_options(tp, skb, options, remaining,
+ &tp->sent_opts);
+ else
+ __tcp_sched_options(tp, skb, options, remaining,
+ &tp->sched_opts);
}
/* Compute TCP options for SYN packets. This is not the final
@@ -561,6 +657,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
}
}
+ tcp_sched_options(tp, skb, opts, &remaining);
+
return MAX_TCP_OPTION_SPACE - remaining;
}
@@ -622,6 +720,8 @@ static unsigned int tcp_synack_options(struct sock *sk,
}
}
+ tcp_sched_options(tcp_sk(sk), skb, opts, &remaining);
+
return MAX_TCP_OPTION_SPACE - remaining;
}
@@ -634,7 +734,7 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
{
struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
struct tcp_sock *tp = tcp_sk(sk);
- unsigned int size = 0;
+ unsigned int remaining = MAX_TCP_OPTION_SPACE;
unsigned int eff_sacks;
opts->options = 0;
@@ -643,7 +743,7 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
*md5 = tp->af_specific->md5_lookup(sk, sk);
if (unlikely(*md5)) {
opts->options |= OPTION_MD5;
- size += TCPOLEN_MD5SIG_ALIGNED;
+ remaining -= TCPOLEN_MD5SIG_ALIGNED;
}
#else
*md5 = NULL;
@@ -653,21 +753,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
opts->options |= OPTION_TS;
opts->tsval = tcb ? tcb->when + tp->tsoffset : 0;
opts->tsecr = tp->rx_opt.ts_recent;
- size += TCPOLEN_TSTAMP_ALIGNED;
+ remaining -= TCPOLEN_TSTAMP_ALIGNED;
}
eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
if (unlikely(eff_sacks)) {
- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
opts->num_sack_blocks =
min_t(unsigned int, eff_sacks,
(remaining - TCPOLEN_SACK_BASE_ALIGNED) /
TCPOLEN_SACK_PERBLOCK);
- size += TCPOLEN_SACK_BASE_ALIGNED +
+ remaining -= TCPOLEN_SACK_BASE_ALIGNED +
opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
}
- return size;
+ tcp_sched_options(tp, skb, opts, &remaining);
+
+ return MAX_TCP_OPTION_SPACE - remaining;
}
@@ -1912,6 +2013,12 @@ repair:
break;
}
+ /* Critical options were not sent because of not enough space
+ * in the header. Force a dup-ack to allow the critical option
+ * to get out. */
+ if (tp->req_dup_ack)
+ tcp_send_ack(sk);
+
if (likely(sent_pkts)) {
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += sent_pkts;
@@ -2357,6 +2464,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
* is still in somebody's hands, else make a clone.
*/
TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->retrans = 1;
/* make sure skb->data is aligned on arches that require it
* and check if ack-trimming & collapsing extended the headroom
--
1.8.3.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-06 18:05 [RFC] tcp: add support for scheduling TCP options on TCP sockets Octavian Purdila
@ 2014-05-07 5:38 ` David Miller
2014-05-07 7:30 ` Octavian Purdila
[not found] ` <bc98535539ef4d2c9cb6f53f85068b65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
0 siblings, 2 replies; 13+ messages in thread
From: David Miller @ 2014-05-07 5:38 UTC (permalink / raw)
To: octavian.purdila; +Cc: netdev
From: Octavian Purdila <octavian.purdila@intel.com>
Date: Tue, 6 May 2014 21:05:24 +0300
> Pardon the rough patch, but I hope it is enough to get some feedback
> on the overall approach.
Sorry I don't like this.
Walking a linked list unnecessary is going to add overhead to every
single packet transmission. I think more people want our TCP stack to
be fast (everyone) than those who want option processing to be
abstracted enough to be modular (you).
Just make the intrusive changes, they are necessary as they force you
to think fully about how one option might interact with another.
I also disagree with the "if the option doesn't fit, send it in the
next packet" idea. Where did that come from?
For SACK for example, that doesn't make any sense, and it's SACK
that usually can put us past the amount of space available. For
SACK the thing to do is send the SACK information for the area
closest to what we've fully ACKd and just forget about advertising
the rest of the SACK blocks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 5:38 ` David Miller
@ 2014-05-07 7:30 ` Octavian Purdila
2014-05-07 16:48 ` David Miller
[not found] ` <09b16b4c1a6147f8aa4137e2c50e2e74@UCL-MBX03.OASIS.UCLOUVAIN.BE>
[not found] ` <bc98535539ef4d2c9cb6f53f85068b65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
1 sibling, 2 replies; 13+ messages in thread
From: Octavian Purdila @ 2014-05-07 7:30 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Christoph Paasch
On Wed, May 7, 2014 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
>
> From: Octavian Purdila <octavian.purdila@intel.com>
> Date: Tue, 6 May 2014 21:05:24 +0300
>
> > Pardon the rough patch, but I hope it is enough to get some feedback
> > on the overall approach.
>
> Sorry I don't like this.
>
> Walking a linked list unnecessary is going to add overhead to every
> single packet transmission. I think more people want our TCP stack to
> be fast (everyone) than those who want option processing to be
> abstracted enough to be modular (you).
>
> Just make the intrusive changes, they are necessary as they force you
> to think fully about how one option might interact with another.
>
Unfortunately skb_tcp_cb does not have enough space to hold
information for new large options. To work around that, the MPTCP
implementation is pushing the option data in the skb and then
occasionally uses the following when the pskb_copy is used:
- else
+ if (unlikely(skb_cloned(skb))) {
+ struct sk_buff *newskb;
+ if (mptcp_is_data_seq(skb))
+ skb_push(skb, MPTCP_SUB_LEN_DSS_ALIGN +
+ MPTCP_SUB_LEN_ACK_ALIGN +
+ MPTCP_SUB_LEN_SEQ_ALIGN);
+
+ newskb = pskb_copy(skb, gfp_mask);
+
+ if (mptcp_is_data_seq(skb)) {
+ skb_pull(skb, MPTCP_SUB_LEN_DSS_ALIGN +
+ MPTCP_SUB_LEN_ACK_ALIGN +
+ MPTCP_SUB_LEN_SEQ_ALIGN);
+ if (newskb)
+ skb_pull(newskb,
MPTCP_SUB_LEN_DSS_ALIGN +
+
MPTCP_SUB_LEN_ACK_ALIGN +
+
MPTCP_SUB_LEN_SEQ_ALIGN);
+ }
+ skb = newskb;
+ } else {
skb = skb_clone(skb, gfp_mask);
+ }
MPTCP has many other intrusive changes in the TCP stack. To avoid that
complexity, we could do the bulk of the implementation in a separate
layer, on top of TCP. But we would need a mechanism to pass the
options down to the TCP layer somehow.
> I also disagree with the "if the option doesn't fit, send it in the
> next packet" idea. Where did that come from?
>
> For SACK for example, that doesn't make any sense, and it's SACK
> that usually can put us past the amount of space available. For
> SACK the thing to do is send the SACK information for the area
> closest to what we've fully ACKd and just forget about advertising
> the rest of the SACK blocks.
It is useful for MPTCP. A few MPTCP options are 12-20 bytes but if you
already have timestamps and SACK in the packet there is not enough
space to include it. However, MPTCP does not bound you to sent the
options immediately, they can be sent at a later time.
The patch may look like an "abstraction bloat", but it is just a way
of trying to to keep adding more complexity in the TCP stack when
implementing MPTCP. If people have other ideas of how can we
accomplish that, so that we can integrate MPTCP upstream, I am keenly
looking for suggestions.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
[not found] ` <bc98535539ef4d2c9cb6f53f85068b65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
@ 2014-05-07 13:46 ` Christoph Paasch
2014-05-07 14:04 ` Octavian Purdila
[not found] ` <906b020ccf1b4e1b98ac414147259a65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
0 siblings, 2 replies; 13+ messages in thread
From: Christoph Paasch @ 2014-05-07 13:46 UTC (permalink / raw)
To: Octavian Purdila; +Cc: David Miller, netdev@vger.kernel.org
Hello Octavian,
On 07/05/14 - 07:30:23, Octavian Purdila wrote:
> On Wed, May 7, 2014 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
> > Sorry I don't like this.
> >
> > Walking a linked list unnecessary is going to add overhead to every
> > single packet transmission. I think more people want our TCP stack to
> > be fast (everyone) than those who want option processing to be
> > abstracted enough to be modular (you).
> >
> > Just make the intrusive changes, they are necessary as they force you
> > to think fully about how one option might interact with another.
> >
>
> Unfortunately skb_tcp_cb does not have enough space to hold
> information for new large options. To work around that, the MPTCP
> implementation is pushing the option data in the skb and then
> occasionally uses the following when the pskb_copy is used:
>
> - else
> + if (unlikely(skb_cloned(skb))) {
> + struct sk_buff *newskb;
> + if (mptcp_is_data_seq(skb))
> + skb_push(skb, MPTCP_SUB_LEN_DSS_ALIGN +
> + MPTCP_SUB_LEN_ACK_ALIGN +
> + MPTCP_SUB_LEN_SEQ_ALIGN);
> +
> + newskb = pskb_copy(skb, gfp_mask);
> +
> + if (mptcp_is_data_seq(skb)) {
> + skb_pull(skb, MPTCP_SUB_LEN_DSS_ALIGN +
> + MPTCP_SUB_LEN_ACK_ALIGN +
> + MPTCP_SUB_LEN_SEQ_ALIGN);
> + if (newskb)
> + skb_pull(newskb,
> MPTCP_SUB_LEN_DSS_ALIGN +
> +
> MPTCP_SUB_LEN_ACK_ALIGN +
> +
> MPTCP_SUB_LEN_SEQ_ALIGN);
> + }
> + skb = newskb;
> + } else {
> skb = skb_clone(skb, gfp_mask);
> + }
>
> MPTCP has many other intrusive changes in the TCP stack. To avoid that
> complexity, we could do the bulk of the implementation in a separate
> layer, on top of TCP. But we would need a mechanism to pass the
> options down to the TCP layer somehow.
Why not extend the head-space of the linear data of the skb as we discussed
already previously on mptcp-dev? Just in a similar way as 'struct can_skb_priv'
is being used. This would avoid this expensive list-processing and clean up
the above example you give.
Or did something else prevented to do it in such a way?
> > I also disagree with the "if the option doesn't fit, send it in the
> > next packet" idea. Where did that come from?
> >
> > For SACK for example, that doesn't make any sense, and it's SACK
> > that usually can put us past the amount of space available. For
> > SACK the thing to do is send the SACK information for the area
> > closest to what we've fully ACKd and just forget about advertising
> > the rest of the SACK blocks.
>
> It is useful for MPTCP. A few MPTCP options are 12-20 bytes but if you
> already have timestamps and SACK in the packet there is not enough
> space to include it. However, MPTCP does not bound you to sent the
> options immediately, they can be sent at a later time.
Be careful which options you defer to a later time.
For the DSS-option it is better to temporarily reduce the number of
SACK-blocks to 2 instead of ommitting the DSS-option. Otherwise, you
need to support at the receiver late receival of DSS-options inside pure acks
(probably out-of-order) - and thus be able to store a list of DSS-options.
Only ADD_ADDR/REMOVE_ADDR and MP_PRIO should be allowed to be sent at a
'later' time. But for them you can also explicitly schedule an ack.
Cheers,
Christoph
> The patch may look like an "abstraction bloat", but it is just a way
> of trying to to keep adding more complexity in the TCP stack when
> implementing MPTCP. If people have other ideas of how can we
> accomplish that, so that we can integrate MPTCP upstream, I am keenly
> looking for suggestions.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 13:46 ` Christoph Paasch
@ 2014-05-07 14:04 ` Octavian Purdila
[not found] ` <906b020ccf1b4e1b98ac414147259a65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
1 sibling, 0 replies; 13+ messages in thread
From: Octavian Purdila @ 2014-05-07 14:04 UTC (permalink / raw)
To: Christoph Paasch; +Cc: David Miller, netdev@vger.kernel.org
On Wed, May 7, 2014 at 4:46 PM, Christoph Paasch
<christoph.paasch@uclouvain.be> wrote:
> Hello Octavian,
>
> On 07/05/14 - 07:30:23, Octavian Purdila wrote:
>> On Wed, May 7, 2014 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
>> > Sorry I don't like this.
>> >
>> > Walking a linked list unnecessary is going to add overhead to every
>> > single packet transmission. I think more people want our TCP stack to
>> > be fast (everyone) than those who want option processing to be
>> > abstracted enough to be modular (you).
>> >
>> > Just make the intrusive changes, they are necessary as they force you
>> > to think fully about how one option might interact with another.
>> >
>>
>> Unfortunately skb_tcp_cb does not have enough space to hold
>> information for new large options. To work around that, the MPTCP
>> implementation is pushing the option data in the skb and then
>> occasionally uses the following when the pskb_copy is used:
>>
>> - else
>> + if (unlikely(skb_cloned(skb))) {
>> + struct sk_buff *newskb;
>> + if (mptcp_is_data_seq(skb))
>> + skb_push(skb, MPTCP_SUB_LEN_DSS_ALIGN +
>> + MPTCP_SUB_LEN_ACK_ALIGN +
>> + MPTCP_SUB_LEN_SEQ_ALIGN);
>> +
>> + newskb = pskb_copy(skb, gfp_mask);
>> +
>> + if (mptcp_is_data_seq(skb)) {
>> + skb_pull(skb, MPTCP_SUB_LEN_DSS_ALIGN +
>> + MPTCP_SUB_LEN_ACK_ALIGN +
>> + MPTCP_SUB_LEN_SEQ_ALIGN);
>> + if (newskb)
>> + skb_pull(newskb,
>> MPTCP_SUB_LEN_DSS_ALIGN +
>> +
>> MPTCP_SUB_LEN_ACK_ALIGN +
>> +
>> MPTCP_SUB_LEN_SEQ_ALIGN);
>> + }
>> + skb = newskb;
>> + } else {
>> skb = skb_clone(skb, gfp_mask);
>> + }
>>
>> MPTCP has many other intrusive changes in the TCP stack. To avoid that
>> complexity, we could do the bulk of the implementation in a separate
>> layer, on top of TCP. But we would need a mechanism to pass the
>> options down to the TCP layer somehow.
>
> Why not extend the head-space of the linear data of the skb as we discussed
> already previously on mptcp-dev? Just in a similar way as 'struct can_skb_priv'
> is being used. This would avoid this expensive list-processing and clean up
> the above example you give.
>
> Or did something else prevented to do it in such a way?
>
You mean storing options at skb->head? Wouldn't we have the same issue
as above for pskb_copy?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
[not found] ` <906b020ccf1b4e1b98ac414147259a65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
@ 2014-05-07 14:11 ` Christoph Paasch
2014-05-07 15:17 ` Octavian Purdila
2014-05-07 17:23 ` David Miller
0 siblings, 2 replies; 13+ messages in thread
From: Christoph Paasch @ 2014-05-07 14:11 UTC (permalink / raw)
To: Octavian Purdila; +Cc: David Miller, netdev@vger.kernel.org
On 07/05/14 - 14:04:36, Octavian Purdila wrote:
> On Wed, May 7, 2014 at 4:46 PM, Christoph Paasch
> <christoph.paasch@uclouvain.be> wrote:
> > On 07/05/14 - 07:30:23, Octavian Purdila wrote:
> >> Unfortunately skb_tcp_cb does not have enough space to hold
> >> information for new large options. To work around that, the MPTCP
> >> implementation is pushing the option data in the skb and then
> >> occasionally uses the following when the pskb_copy is used:
> >>
> >> - else
> >> + if (unlikely(skb_cloned(skb))) {
> >> + struct sk_buff *newskb;
> >> + if (mptcp_is_data_seq(skb))
> >> + skb_push(skb, MPTCP_SUB_LEN_DSS_ALIGN +
> >> + MPTCP_SUB_LEN_ACK_ALIGN +
> >> + MPTCP_SUB_LEN_SEQ_ALIGN);
> >> +
> >> + newskb = pskb_copy(skb, gfp_mask);
> >> +
> >> + if (mptcp_is_data_seq(skb)) {
> >> + skb_pull(skb, MPTCP_SUB_LEN_DSS_ALIGN +
> >> + MPTCP_SUB_LEN_ACK_ALIGN +
> >> + MPTCP_SUB_LEN_SEQ_ALIGN);
> >> + if (newskb)
> >> + skb_pull(newskb,
> >> MPTCP_SUB_LEN_DSS_ALIGN +
> >> +
> >> MPTCP_SUB_LEN_ACK_ALIGN +
> >> +
> >> MPTCP_SUB_LEN_SEQ_ALIGN);
> >> + }
> >> + skb = newskb;
> >> + } else {
> >> skb = skb_clone(skb, gfp_mask);
> >> + }
> >>
> >> MPTCP has many other intrusive changes in the TCP stack. To avoid that
> >> complexity, we could do the bulk of the implementation in a separate
> >> layer, on top of TCP. But we would need a mechanism to pass the
> >> options down to the TCP layer somehow.
> >
> > Why not extend the head-space of the linear data of the skb as we discussed
> > already previously on mptcp-dev? Just in a similar way as 'struct can_skb_priv'
> > is being used. This would avoid this expensive list-processing and clean up
> > the above example you give.
> >
> > Or did something else prevented to do it in such a way?
> >
>
> You mean storing options at skb->head? Wouldn't we have the same issue
> as above for pskb_copy?
Yes, but it could be done in a more "clean" way so that future extensions to
TCP are no more limited by the limitation of struct tcp_skb_cb.
Basically, allow some memory inside the linear part to be used by the layer
the skb is currently at and let pskb_copy handle it properly (not like the
current 'hack' in tcp_transmit_skb).
This allows extensions at any layer who are not widely enough used to justify increasing
skb->cb.
Cheers,
Christoph
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 14:11 ` Christoph Paasch
@ 2014-05-07 15:17 ` Octavian Purdila
2014-05-07 15:32 ` Eric Dumazet
2014-05-07 17:24 ` David Miller
2014-05-07 17:23 ` David Miller
1 sibling, 2 replies; 13+ messages in thread
From: Octavian Purdila @ 2014-05-07 15:17 UTC (permalink / raw)
To: Christoph Paasch; +Cc: David Miller, netdev@vger.kernel.org
On Wed, May 7, 2014 at 5:11 PM, Christoph Paasch
<christoph.paasch@uclouvain.be> wrote:
> On 07/05/14 - 14:04:36, Octavian Purdila wrote:
>> On Wed, May 7, 2014 at 4:46 PM, Christoph Paasch
>> <christoph.paasch@uclouvain.be> wrote:
>> > On 07/05/14 - 07:30:23, Octavian Purdila wrote:
>> >> Unfortunately skb_tcp_cb does not have enough space to hold
>> >> information for new large options. To work around that, the MPTCP
>> >> implementation is pushing the option data in the skb and then
>> >> occasionally uses the following when the pskb_copy is used:
>> >>
>> >> - else
>> >> + if (unlikely(skb_cloned(skb))) {
>> >> + struct sk_buff *newskb;
>> >> + if (mptcp_is_data_seq(skb))
>> >> + skb_push(skb, MPTCP_SUB_LEN_DSS_ALIGN +
>> >> + MPTCP_SUB_LEN_ACK_ALIGN +
>> >> + MPTCP_SUB_LEN_SEQ_ALIGN);
>> >> +
>> >> + newskb = pskb_copy(skb, gfp_mask);
>> >> +
>> >> + if (mptcp_is_data_seq(skb)) {
>> >> + skb_pull(skb, MPTCP_SUB_LEN_DSS_ALIGN +
>> >> + MPTCP_SUB_LEN_ACK_ALIGN +
>> >> + MPTCP_SUB_LEN_SEQ_ALIGN);
>> >> + if (newskb)
>> >> + skb_pull(newskb,
>> >> MPTCP_SUB_LEN_DSS_ALIGN +
>> >> +
>> >> MPTCP_SUB_LEN_ACK_ALIGN +
>> >> +
>> >> MPTCP_SUB_LEN_SEQ_ALIGN);
>> >> + }
>> >> + skb = newskb;
>> >> + } else {
>> >> skb = skb_clone(skb, gfp_mask);
>> >> + }
>> >>
>> >> MPTCP has many other intrusive changes in the TCP stack. To avoid that
>> >> complexity, we could do the bulk of the implementation in a separate
>> >> layer, on top of TCP. But we would need a mechanism to pass the
>> >> options down to the TCP layer somehow.
>> >
>> > Why not extend the head-space of the linear data of the skb as we discussed
>> > already previously on mptcp-dev? Just in a similar way as 'struct can_skb_priv'
>> > is being used. This would avoid this expensive list-processing and clean up
>> > the above example you give.
>> >
>> > Or did something else prevented to do it in such a way?
>> >
>>
>> You mean storing options at skb->head? Wouldn't we have the same issue
>> as above for pskb_copy?
>
> Yes, but it could be done in a more "clean" way so that future extensions to
> TCP are no more limited by the limitation of struct tcp_skb_cb.
>
> Basically, allow some memory inside the linear part to be used by the layer
> the skb is currently at and let pskb_copy handle it properly (not like the
> current 'hack' in tcp_transmit_skb).
> This allows extensions at any layer who are not widely enough used to justify increasing
> skb->cb.
>
That would require adding a new field to sk_buff to keep track of how
much we need to copy in pskb_copy. Fortunately it seems it has some
holes we could use.
David, does that seem reasonable?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 15:17 ` Octavian Purdila
@ 2014-05-07 15:32 ` Eric Dumazet
2014-05-07 18:15 ` Octavian Purdila
2014-05-07 17:24 ` David Miller
1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2014-05-07 15:32 UTC (permalink / raw)
To: Octavian Purdila; +Cc: Christoph Paasch, David Miller, netdev@vger.kernel.org
On Wed, 2014-05-07 at 18:17 +0300, Octavian Purdila wrote:
>
> That would require adding a new field to sk_buff to keep track of how
> much we need to copy in pskb_copy. Fortunately it seems it has some
> holes we could use.
>
> David, does that seem reasonable?
MPTCP and/or Minion could be implemented in user land, with a generic
infrastructure in the kernel to directly pass raw packets once demux
has been done by the kernel.
Large CDN cant always assume multi paths to hit a particular destination
can all originate from a single host in the datacenter.
MPTCP has this hard assumption from day one.
Our team try to focus on making TCP better, we believe all these
experimental features do not belong to the kernel yet.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 7:30 ` Octavian Purdila
@ 2014-05-07 16:48 ` David Miller
[not found] ` <09b16b4c1a6147f8aa4137e2c50e2e74@UCL-MBX03.OASIS.UCLOUVAIN.BE>
1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2014-05-07 16:48 UTC (permalink / raw)
To: octavian.purdila; +Cc: netdev, christoph.paasch
From: Octavian Purdila <octavian.purdila@intel.com>
Date: Wed, 7 May 2014 10:30:23 +0300
> On Wed, May 7, 2014 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
>>
>> From: Octavian Purdila <octavian.purdila@intel.com>
>> Date: Tue, 6 May 2014 21:05:24 +0300
>>
>> > Pardon the rough patch, but I hope it is enough to get some feedback
>> > on the overall approach.
>>
>> Sorry I don't like this.
>>
>> Walking a linked list unnecessary is going to add overhead to every
>> single packet transmission. I think more people want our TCP stack to
>> be fast (everyone) than those who want option processing to be
>> abstracted enough to be modular (you).
>>
>> Just make the intrusive changes, they are necessary as they force you
>> to think fully about how one option might interact with another.
>>
>
> Unfortunately skb_tcp_cb does not have enough space to hold
> information for new large options. To work around that, the MPTCP
> implementation is pushing the option data in the skb and then
> occasionally uses the following when the pskb_copy is used:
Why not deal with the problem directly by trying to find a way to
compress the existing use of skb_tcp_cb() so that there is actually
the amount of space you need?
That is the approach I definitely prefer you take.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 14:11 ` Christoph Paasch
2014-05-07 15:17 ` Octavian Purdila
@ 2014-05-07 17:23 ` David Miller
1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2014-05-07 17:23 UTC (permalink / raw)
To: christoph.paasch; +Cc: octavian.purdila, netdev
From: Christoph Paasch <christoph.paasch@uclouvain.be>
Date: Wed, 7 May 2014 16:11:08 +0200
> Yes, but it could be done in a more "clean" way so that future
> extensions to TCP are no more limited by the limitation of struct
> tcp_skb_cb.
Would you stop harping on this point?
Adding new options to TCP is a really big deal, it implicitly
involves deeply understanding how it interacts with other options
and what is the most optimal way to insert those options in the
output path.
It should be "hard".
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 15:17 ` Octavian Purdila
2014-05-07 15:32 ` Eric Dumazet
@ 2014-05-07 17:24 ` David Miller
1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2014-05-07 17:24 UTC (permalink / raw)
To: octavian.purdila; +Cc: christoph.paasch, netdev
From: Octavian Purdila <octavian.purdila@intel.com>
Date: Wed, 7 May 2014 18:17:08 +0300
> That would require adding a new field to sk_buff to keep track of how
> much we need to copy in pskb_copy. Fortunately it seems it has some
> holes we could use.
>
> David, does that seem reasonable?
As long as you can find a way to delete something in sk_buff to make
room for it sure.
I am very serious, I don't want sk_buff to turn into something like mm_struct
which seems to double in size every time I look at it.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
2014-05-07 15:32 ` Eric Dumazet
@ 2014-05-07 18:15 ` Octavian Purdila
0 siblings, 0 replies; 13+ messages in thread
From: Octavian Purdila @ 2014-05-07 18:15 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Christoph Paasch, David Miller, netdev@vger.kernel.org
On Wed, May 7, 2014 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2014-05-07 at 18:17 +0300, Octavian Purdila wrote:
>
>>
>> That would require adding a new field to sk_buff to keep track of how
>> much we need to copy in pskb_copy. Fortunately it seems it has some
>> holes we could use.
>>
>> David, does that seem reasonable?
>
> MPTCP and/or Minion could be implemented in user land, with a generic
> infrastructure in the kernel to directly pass raw packets once demux
> has been done by the kernel.
>
Implementing MPTCP in userspace makes a lot of sense to me as you can
naturally consider MPTCP as a layer on top of TCP. Passing raw packets
is one option, another one is creating the infrastructure to
send/receive TCP options with standard TCP sockets APIs like sendmsg
or recvmsg.
The first option is more flexible, but on the other hand (at least for
MPTCP), it means that userspace must duplicate the "basic" TCP
functionality. The second option allows one to reuse the TCP stack in
kernel while adding new functionality on top of that.
> Large CDN cant always assume multi paths to hit a particular destination
> can all originate from a single host in the datacenter.
>
> MPTCP has this hard assumption from day one.
>
> Our team try to focus on making TCP better, we believe all these
> experimental features do not belong to the kernel yet.
I hear you, the TCP stack is already very complex. That is why I am
looking at ways to manage the increased complexity that MPTCP brings
by using a layered approach.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC] tcp: add support for scheduling TCP options on TCP sockets
[not found] ` <09b16b4c1a6147f8aa4137e2c50e2e74@UCL-MBX03.OASIS.UCLOUVAIN.BE>
@ 2014-05-08 8:32 ` Christoph Paasch
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Paasch @ 2014-05-08 8:32 UTC (permalink / raw)
To: David Miller; +Cc: octavian.purdila@intel.com, netdev@vger.kernel.org
On 07/05/14 - 16:48:59, David Miller wrote:
> From: Octavian Purdila <octavian.purdila@intel.com>
> Date: Wed, 7 May 2014 10:30:23 +0300
>
> > On Wed, May 7, 2014 at 8:38 AM, David Miller <davem@davemloft.net> wrote:
> >>
> >> From: Octavian Purdila <octavian.purdila@intel.com>
> >> Date: Tue, 6 May 2014 21:05:24 +0300
> >>
> >> > Pardon the rough patch, but I hope it is enough to get some feedback
> >> > on the overall approach.
> >>
> >> Sorry I don't like this.
> >>
> >> Walking a linked list unnecessary is going to add overhead to every
> >> single packet transmission. I think more people want our TCP stack to
> >> be fast (everyone) than those who want option processing to be
> >> abstracted enough to be modular (you).
> >>
> >> Just make the intrusive changes, they are necessary as they force you
> >> to think fully about how one option might interact with another.
> >>
> >
> > Unfortunately skb_tcp_cb does not have enough space to hold
> > information for new large options. To work around that, the MPTCP
> > implementation is pushing the option data in the skb and then
> > occasionally uses the following when the pskb_copy is used:
>
> Why not deal with the problem directly by trying to find a way to
> compress the existing use of skb_tcp_cb() so that there is actually
> the amount of space you need?
It might be possible to replace accesses to end_seq by calculating (seq + len + fin/syn)
That way, we gain 4 bytes. Would this be acceptable?
And union tcp_flags/ip_dsfield as suggested in b82d1bb4fd206 (tcp: unalias
tcp_skb_cb flags and ip_dsfield).
Cheers,
Christoph
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-05-08 8:32 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-06 18:05 [RFC] tcp: add support for scheduling TCP options on TCP sockets Octavian Purdila
2014-05-07 5:38 ` David Miller
2014-05-07 7:30 ` Octavian Purdila
2014-05-07 16:48 ` David Miller
[not found] ` <09b16b4c1a6147f8aa4137e2c50e2e74@UCL-MBX03.OASIS.UCLOUVAIN.BE>
2014-05-08 8:32 ` Christoph Paasch
[not found] ` <bc98535539ef4d2c9cb6f53f85068b65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
2014-05-07 13:46 ` Christoph Paasch
2014-05-07 14:04 ` Octavian Purdila
[not found] ` <906b020ccf1b4e1b98ac414147259a65@UCL-MBX03.OASIS.UCLOUVAIN.BE>
2014-05-07 14:11 ` Christoph Paasch
2014-05-07 15:17 ` Octavian Purdila
2014-05-07 15:32 ` Eric Dumazet
2014-05-07 18:15 ` Octavian Purdila
2014-05-07 17:24 ` David Miller
2014-05-07 17:23 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).