Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] bpf: fix bpf_tail_call() x64 JIT
From: Eric Dumazet @ 2017-10-03 22:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, Daniel Borkmann, Martin KaFai Lau, netdev,
	kernel-team
In-Reply-To: <20171003223720.1517246-1-ast@fb.com>

On Tue, 2017-10-03 at 15:37 -0700, Alexei Starovoitov wrote:
> - bpf prog_array just like all other types of bpf array accepts 32-bit index.
>   Clarify that in the comment.
> - fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
> - tighten corresponding check in the interpreter to stay consistent
> 
> The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
> in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and
> though JIT code is wrong it will check bounds correctly.
> Hence two fixes tags. All other JITs don't have this problem.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
> Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper")
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Martin KaFai Lau <kafai@fb.com>
> ---
> Backport to stable would be nice, but not strictly necessary.

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned
From: Marcelo Ricardo Leitner @ 2017-10-03 22:46 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q,
	nhorman-2XuSBdqkA4R54TAoqtyWWQ, vyasevich-Re5JQEeQqe8AvxtiuMwx3w,
	luto-DgEjT+Ai2ygdnm+yROfE0A, kvalo-sgV2jX0FEOL9JmXXK+q4OQ,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20171003223308.GD19750-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

On Tue, Oct 03, 2017 at 07:33:08PM -0300, Marcelo Ricardo Leitner wrote:
> On Tue, Oct 03, 2017 at 10:25:22AM +0800, Jia-Ju Bai wrote:
> > The SCTP program may sleep under a spinlock, and the function call path is:
> > sctp_generate_t3_rtx_event (acquire the spinlock)
> >   sctp_do_sm
> >     sctp_side_effects
> >       sctp_cmd_interpreter
> >         sctp_make_init_ack
> >           sctp_pack_cookie
> >             crypto_shash_setkey
> >               shash_setkey_unaligned
> >                 kmalloc(GFP_KERNEL)
> 
> Are you sure this can happen?
> The host is not supposed to store any information when replying to an
> INIT packet (which generated the INIT_ACK listed above). That said,
> it's weird to see the timer function triggering so.
> 
> Checking now, that code is dead actually:
> $ git grep -A 2 SCTP_CMD_GEN_INIT_ACK
> sm_sideeffect.c:                case SCTP_CMD_GEN_INIT_ACK:
> sm_sideeffect.c-                        /* Generate an INIT ACK chunk.
> */
> sm_sideeffect.c-                        new_obj =
> sctp_make_init_ack(asoc, chunk, GFP_ATOMIC,
> 
> Nobody is triggering a call to sctp_cmd_interpreter with
> SCTP_CMD_GEN_INIT_ACK command, which would generate the callstack
> above.

Nevertheless, the issue is real through other call paths.

Thanks,
Marcelo

^ permalink raw reply

* Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned
From: Marcelo Ricardo Leitner @ 2017-10-03 22:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Andy Lutomirski, Jia-Ju Bai, David S. Miller, Neil Horman,
	vyasevich, Kalle Valo, Linux Crypto Mailing List,
	Network Development, linux-sctp, Linux Wireless List
In-Reply-To: <20171003052643.GB22750@gondor.apana.org.au>

On Tue, Oct 03, 2017 at 01:26:43PM +0800, Herbert Xu wrote:
> On Mon, Oct 02, 2017 at 09:18:24PM -0700, Andy Lutomirski wrote:
> > > On Oct 2, 2017, at 7:25 PM, Jia-Ju Bai <baijiaju1990@163.com> wrote:
> > >
> > > The SCTP program may sleep under a spinlock, and the function call path is:
> > > sctp_generate_t3_rtx_event (acquire the spinlock)
> > >  sctp_do_sm
> > >    sctp_side_effects
> > >      sctp_cmd_interpreter
> > >        sctp_make_init_ack
> > >          sctp_pack_cookie
> > >            crypto_shash_setkey
> > >              shash_setkey_unaligned
> > >                kmalloc(GFP_KERNEL)
> > 
> > I'm going to go out on a limb here: why on Earth is out crypto API so
> > full of indirection that we allocate memory at all here?
> 
> The crypto API operates on a one key per-tfm basis.  So normally
> tfm allocation and key setting is done once only and not done on
> the data path.
> 
> I have looked at the SCTP code and it appears to fit this paradigm.
> That is, we should be able to allocate the tfm and set the key when
> the key is actually generated via get_random_bytes, rather than every
> time the key is used which is not only a waste but as you see runs
> into API issues.

Fair point, but

> 
> Usually if you're invoking setkey from a non-sleeping code-path
> you're probably doing something wrong.

Usually but not always. There are 3 calls to that function on SCTP
code:
- pack a cookie, which is sent on an INIT_ACK packet to the client
- unpack the cookie above, after it is sent back by the client on a
  COOKIE_ECHO packet
- send a chunk authenticated by a hash

the first two happen during softirq processing, while processing a
packet that was received.

As I explained on the other email, SCTP code is not supposed to store
any information about the peer between the 1st and the 2nd moments
above, to be less vulnerable to DoS attacks (it's planned so by the
RFC), thus why it uses the cookie.

The 3rd one we probably can improve, but I don't think we can do much
about the 2 first ones from the SCTP side.

Note on sctp_sf_do_5_1B_init() how sctp_make_init_ack() is explicitly
called with GFP_ATOMIC, and also on sctp_sf_do_unexpected_init().
Though we can't propagate that to crypto_shash_setkey.

Ideas?

Thanks,
Marcelo

> 
> As someone else noted recently, there is no single forum for
> reviewing code that uses the crypto API so buggy code like this
> is not surprising.
> 
> > We're synchronously computing a hash of a small amount of data using
> > either HMAC-SHA1 or HMAC-SHA256 (determined at runtime) if I read it
> > right.  There's a sane way to do this that doesn't need kmalloc,
> > alloca, or fancy indirection.  And then there's crypto_shash_xyz().
> 
> There are some legitimate cases where you want to use a different
> key for every hashing operation.  But so far these are uses have
> been very few so there has been no need to provide an API for them.
> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next 0/3] A own subdirectory for shared TCP code
From: Eric Dumazet @ 2017-10-03 22:42 UTC (permalink / raw)
  To: Richard Sailer; +Cc: netdev, davem, kuznet, yoshfuji
In-Reply-To: <20171003222213.7996-1-richard_siegfried@systemli.org>

On Wed, 2017-10-04 at 00:22 +0200, Richard Sailer wrote:
> net/ipv4 currently contains around 100 source files containing the
> IP implementation and lots of other functionality (UDP, TCP, xfrm, etc.)
> 
> # 1/3
> To make the networking source tree more self documenting and well 
> structured 1/3 moves the 30 shared TCP source files to a own 
> subdirectory of net/ipv4/.
> 
> # 2/3 
> A huge part of the TCP code is congestion control algorithms.
> With the same goal as above, this moves the cc algos to a own
> subdirectory of net/ipv4/tcp/
> 
> # 3/3
> Implementation of a suggestion of valdis, to have a new directory
> net/ipv4v6_shared that contains all the code shared between
> IPv4 and IPv6. This creates net/ipv4v6_shared and moves the 
> net/ipv4/tcp directory of shared TCP code there.
> 
> While every commit depends on its previous commits, it is
> also possible and sensible to only apply (1) or (1,2) , depending
> on how conservative one wants to decide regarding directory
> structure changes.
> 
> [PATCH net-next 1/3] net/ipv4: Move shared tcp code to own subdirectory
> [PATCH net-next 2/3] tcp: Move cc algorithms to own subdirectory
> [PATCH net-next 3/3] Move shared tcp code to net/ipv4v6shared


NACK for the whole series.

Simple reason :

This is going to be a maintenance nightmare, for people dealing with TCP
stack every day.

^ permalink raw reply

* Re: [PATCH] net: stmmac: dwmac-rk: Add RK3128 GMAC support
From: David Miller @ 2017-10-03 22:40 UTC (permalink / raw)
  To: david.wu
  Cc: heiko, peppe.cavallaro, alexandre.torgue, huangtao, netdev,
	linux-rockchip, linux-kernel
In-Reply-To: <1506764843-24681-1-git-send-email-david.wu@rock-chips.com>

From: David Wu <david.wu@rock-chips.com>
Date: Sat, 30 Sep 2017 17:47:23 +0800

> Add constants and callback functions for the dwmac on rk3128 soc.
> As can be seen, the base structure is the same, only registers
> and the bits in them moved slightly.
> 
> Signed-off-by: David Wu <david.wu@rock-chips.com>

Applied, thank you.

^ permalink raw reply

* [PATCH net] bpf: fix bpf_tail_call() x64 JIT
From: Alexei Starovoitov @ 2017-10-03 22:37 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, Martin KaFai Lau, netdev, kernel-team

- bpf prog_array just like all other types of bpf array accepts 32-bit index.
  Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent

The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
Backport to stable would be nice, but not strictly necessary.

 arch/x86/net/bpf_jit_comp.c | 4 ++--
 include/uapi/linux/bpf.h    | 2 +-
 kernel/bpf/core.c           | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 8c9573660d51..0554e8aef4d5 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -284,9 +284,9 @@ static void emit_bpf_tail_call(u8 **pprog)
 	/* if (index >= array->map.max_entries)
 	 *   goto out;
 	 */
-	EMIT4(0x48, 0x8B, 0x46,                   /* mov rax, qword ptr [rsi + 16] */
+	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
+	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-	EMIT3(0x48, 0x39, 0xD0);                  /* cmp rax, rdx */
 #define OFFSET1 43 /* number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
 	label1 = cnt;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c402f98..f90860d1f897 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -312,7 +312,7 @@ union bpf_attr {
  *     jump into another BPF program
  *     @ctx: context pointer passed to next program
  *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
- *     @index: index inside array that selects specific program to run
+ *     @index: 32-bit index inside array that selects specific program to run
  *     Return: 0 on success or negative error
  *
  * int bpf_clone_redirect(skb, ifindex, flags)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 917cc04a0a94..7b62df86be1d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1022,7 +1022,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn,
 		struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;
 		struct bpf_array *array = container_of(map, struct bpf_array, map);
 		struct bpf_prog *prog;
-		u64 index = BPF_R3;
+		u32 index = BPF_R3;
 
 		if (unlikely(index >= array->map.max_entries))
 			goto out;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned
From: Marcelo Ricardo Leitner @ 2017-10-03 22:33 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: davem, herbert, nhorman, vyasevich, luto, kvalo, linux-crypto,
	netdev, linux-sctp, linux-wireless
In-Reply-To: <1506997522-26684-1-git-send-email-baijiaju1990@163.com>

On Tue, Oct 03, 2017 at 10:25:22AM +0800, Jia-Ju Bai wrote:
> The SCTP program may sleep under a spinlock, and the function call path is:
> sctp_generate_t3_rtx_event (acquire the spinlock)
>   sctp_do_sm
>     sctp_side_effects
>       sctp_cmd_interpreter
>         sctp_make_init_ack
>           sctp_pack_cookie
>             crypto_shash_setkey
>               shash_setkey_unaligned
>                 kmalloc(GFP_KERNEL)

Are you sure this can happen?
The host is not supposed to store any information when replying to an
INIT packet (which generated the INIT_ACK listed above). That said,
it's weird to see the timer function triggering so.

Checking now, that code is dead actually:
$ git grep -A 2 SCTP_CMD_GEN_INIT_ACK
sm_sideeffect.c:                case SCTP_CMD_GEN_INIT_ACK:
sm_sideeffect.c-                        /* Generate an INIT ACK chunk.
*/
sm_sideeffect.c-                        new_obj =
sctp_make_init_ack(asoc, chunk, GFP_ATOMIC,

Nobody is triggering a call to sctp_cmd_interpreter with
SCTP_CMD_GEN_INIT_ACK command, which would generate the callstack
above.

  Marcelo

^ permalink raw reply

* [PATCH net-next 2/3] tcp: Move cc algorithms to own subdirectory
From: Richard Sailer @ 2017-10-03 22:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuznet, yoshfuji
In-Reply-To: <20171003222213.7996-1-richard_siegfried@systemli.org>

Essentially the TCP code files consist of two groups:

  * main code and some utilities (13 files)
  * pluggable Congestion Control Algorithms (17 files)

Similar to the previous commit, this moves the Congestion control Algorithms to
a own subdirecty and updates the Makefiles accordingly to make the source tree
more well structured and self explaining.

Signed-off-by: Richard Sailer <richard_siegfried@systemli.org>
---
 net/ipv4/tcp/Makefile                       | 18 ++----------------
 net/ipv4/tcp/cc_algos/Makefile              | 21 +++++++++++++++++++++
 net/ipv4/tcp/{ => cc_algos}/tcp_bbr.c       |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_bic.c       |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_cdg.c       |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_cubic.c     |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_dctcp.c     |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_highspeed.c |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_htcp.c      |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_hybla.c     |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_illinois.c  |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_lp.c        |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_nv.c        |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_scalable.c  |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_vegas.c     |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_vegas.h     |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_veno.c      |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_westwood.c  |  0
 net/ipv4/tcp/{ => cc_algos}/tcp_yeah.c      |  0
 19 files changed, 23 insertions(+), 16 deletions(-)
 create mode 100644 net/ipv4/tcp/cc_algos/Makefile
 rename net/ipv4/tcp/{ => cc_algos}/tcp_bbr.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_bic.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_cdg.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_cubic.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_dctcp.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_highspeed.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_htcp.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_hybla.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_illinois.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_lp.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_nv.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_scalable.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_vegas.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_vegas.h (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_veno.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_westwood.c (100%)
 rename net/ipv4/tcp/{ => cc_algos}/tcp_yeah.c (100%)

diff --git a/net/ipv4/tcp/Makefile b/net/ipv4/tcp/Makefile
index 91d2c991a243..60e96525a489 100644
--- a/net/ipv4/tcp/Makefile
+++ b/net/ipv4/tcp/Makefile
@@ -8,19 +8,5 @@ obj-y     := tcp.o tcp_input.o tcp_output.o tcp_timer.o \
 
 obj-$(CONFIG_INET_TCP_DIAG)     +=  tcp_diag.o
 obj-$(CONFIG_NET_TCPPROBE)      +=  tcp_probe.o
-obj-$(CONFIG_TCP_CONG_BBR)      +=  tcp_bbr.o
-obj-$(CONFIG_TCP_CONG_BIC)      +=  tcp_bic.o
-obj-$(CONFIG_TCP_CONG_CDG)      +=  tcp_cdg.o
-obj-$(CONFIG_TCP_CONG_CUBIC)    +=  tcp_cubic.o
-obj-$(CONFIG_TCP_CONG_DCTCP)    +=  tcp_dctcp.o
-obj-$(CONFIG_TCP_CONG_WESTWOOD) +=  tcp_westwood.o
-obj-$(CONFIG_TCP_CONG_HSTCP)    +=  tcp_highspeed.o
-obj-$(CONFIG_TCP_CONG_HYBLA)    +=  tcp_hybla.o
-obj-$(CONFIG_TCP_CONG_HTCP)     +=  tcp_htcp.o
-obj-$(CONFIG_TCP_CONG_VEGAS)    +=  tcp_vegas.o
-obj-$(CONFIG_TCP_CONG_NV)       +=  tcp_nv.o
-obj-$(CONFIG_TCP_CONG_VENO)     +=  tcp_veno.o
-obj-$(CONFIG_TCP_CONG_SCALABLE) +=  tcp_scalable.o
-obj-$(CONFIG_TCP_CONG_LP)       +=  tcp_lp.o
-obj-$(CONFIG_TCP_CONG_YEAH)     +=  tcp_yeah.o
-obj-$(CONFIG_TCP_CONG_ILLINOIS) +=  tcp_illinois.o
+
+obj-$(CONFIG_TCP_CONG_ADVANCED) +=  cc_algos/
diff --git a/net/ipv4/tcp/cc_algos/Makefile b/net/ipv4/tcp/cc_algos/Makefile
new file mode 100644
index 000000000000..81999dd2a045
--- /dev/null
+++ b/net/ipv4/tcp/cc_algos/Makefile
@@ -0,0 +1,21 @@
+#
+# Makefile for all pluggable TCP congestion control algorithms
+#
+
+
+obj-$(CONFIG_TCP_CONG_BBR)      +=  tcp_bbr.o
+obj-$(CONFIG_TCP_CONG_BIC)      +=  tcp_bic.o
+obj-$(CONFIG_TCP_CONG_CDG)      +=  tcp_cdg.o
+obj-$(CONFIG_TCP_CONG_CUBIC)    +=  tcp_cubic.o
+obj-$(CONFIG_TCP_CONG_DCTCP)    +=  tcp_dctcp.o
+obj-$(CONFIG_TCP_CONG_WESTWOOD) +=  tcp_westwood.o
+obj-$(CONFIG_TCP_CONG_HSTCP)    +=  tcp_highspeed.o
+obj-$(CONFIG_TCP_CONG_HYBLA)    +=  tcp_hybla.o
+obj-$(CONFIG_TCP_CONG_HTCP)     +=  tcp_htcp.o
+obj-$(CONFIG_TCP_CONG_VEGAS)    +=  tcp_vegas.o
+obj-$(CONFIG_TCP_CONG_NV)       +=  tcp_nv.o
+obj-$(CONFIG_TCP_CONG_VENO)     +=  tcp_veno.o
+obj-$(CONFIG_TCP_CONG_SCALABLE) +=  tcp_scalable.o
+obj-$(CONFIG_TCP_CONG_LP)       +=  tcp_lp.o
+obj-$(CONFIG_TCP_CONG_YEAH)     +=  tcp_yeah.o
+obj-$(CONFIG_TCP_CONG_ILLINOIS) +=  tcp_illinois.o
diff --git a/net/ipv4/tcp/tcp_bbr.c b/net/ipv4/tcp/cc_algos/tcp_bbr.c
similarity index 100%
rename from net/ipv4/tcp/tcp_bbr.c
rename to net/ipv4/tcp/cc_algos/tcp_bbr.c
diff --git a/net/ipv4/tcp/tcp_bic.c b/net/ipv4/tcp/cc_algos/tcp_bic.c
similarity index 100%
rename from net/ipv4/tcp/tcp_bic.c
rename to net/ipv4/tcp/cc_algos/tcp_bic.c
diff --git a/net/ipv4/tcp/tcp_cdg.c b/net/ipv4/tcp/cc_algos/tcp_cdg.c
similarity index 100%
rename from net/ipv4/tcp/tcp_cdg.c
rename to net/ipv4/tcp/cc_algos/tcp_cdg.c
diff --git a/net/ipv4/tcp/tcp_cubic.c b/net/ipv4/tcp/cc_algos/tcp_cubic.c
similarity index 100%
rename from net/ipv4/tcp/tcp_cubic.c
rename to net/ipv4/tcp/cc_algos/tcp_cubic.c
diff --git a/net/ipv4/tcp/tcp_dctcp.c b/net/ipv4/tcp/cc_algos/tcp_dctcp.c
similarity index 100%
rename from net/ipv4/tcp/tcp_dctcp.c
rename to net/ipv4/tcp/cc_algos/tcp_dctcp.c
diff --git a/net/ipv4/tcp/tcp_highspeed.c b/net/ipv4/tcp/cc_algos/tcp_highspeed.c
similarity index 100%
rename from net/ipv4/tcp/tcp_highspeed.c
rename to net/ipv4/tcp/cc_algos/tcp_highspeed.c
diff --git a/net/ipv4/tcp/tcp_htcp.c b/net/ipv4/tcp/cc_algos/tcp_htcp.c
similarity index 100%
rename from net/ipv4/tcp/tcp_htcp.c
rename to net/ipv4/tcp/cc_algos/tcp_htcp.c
diff --git a/net/ipv4/tcp/tcp_hybla.c b/net/ipv4/tcp/cc_algos/tcp_hybla.c
similarity index 100%
rename from net/ipv4/tcp/tcp_hybla.c
rename to net/ipv4/tcp/cc_algos/tcp_hybla.c
diff --git a/net/ipv4/tcp/tcp_illinois.c b/net/ipv4/tcp/cc_algos/tcp_illinois.c
similarity index 100%
rename from net/ipv4/tcp/tcp_illinois.c
rename to net/ipv4/tcp/cc_algos/tcp_illinois.c
diff --git a/net/ipv4/tcp/tcp_lp.c b/net/ipv4/tcp/cc_algos/tcp_lp.c
similarity index 100%
rename from net/ipv4/tcp/tcp_lp.c
rename to net/ipv4/tcp/cc_algos/tcp_lp.c
diff --git a/net/ipv4/tcp/tcp_nv.c b/net/ipv4/tcp/cc_algos/tcp_nv.c
similarity index 100%
rename from net/ipv4/tcp/tcp_nv.c
rename to net/ipv4/tcp/cc_algos/tcp_nv.c
diff --git a/net/ipv4/tcp/tcp_scalable.c b/net/ipv4/tcp/cc_algos/tcp_scalable.c
similarity index 100%
rename from net/ipv4/tcp/tcp_scalable.c
rename to net/ipv4/tcp/cc_algos/tcp_scalable.c
diff --git a/net/ipv4/tcp/tcp_vegas.c b/net/ipv4/tcp/cc_algos/tcp_vegas.c
similarity index 100%
rename from net/ipv4/tcp/tcp_vegas.c
rename to net/ipv4/tcp/cc_algos/tcp_vegas.c
diff --git a/net/ipv4/tcp/tcp_vegas.h b/net/ipv4/tcp/cc_algos/tcp_vegas.h
similarity index 100%
rename from net/ipv4/tcp/tcp_vegas.h
rename to net/ipv4/tcp/cc_algos/tcp_vegas.h
diff --git a/net/ipv4/tcp/tcp_veno.c b/net/ipv4/tcp/cc_algos/tcp_veno.c
similarity index 100%
rename from net/ipv4/tcp/tcp_veno.c
rename to net/ipv4/tcp/cc_algos/tcp_veno.c
diff --git a/net/ipv4/tcp/tcp_westwood.c b/net/ipv4/tcp/cc_algos/tcp_westwood.c
similarity index 100%
rename from net/ipv4/tcp/tcp_westwood.c
rename to net/ipv4/tcp/cc_algos/tcp_westwood.c
diff --git a/net/ipv4/tcp/tcp_yeah.c b/net/ipv4/tcp/cc_algos/tcp_yeah.c
similarity index 100%
rename from net/ipv4/tcp/tcp_yeah.c
rename to net/ipv4/tcp/cc_algos/tcp_yeah.c
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 3/3] Move shared tcp code to net/ipv4v6shared
From: Richard Sailer @ 2017-10-03 22:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuznet, yoshfuji
In-Reply-To: <20171003222213.7996-1-richard_siegfried@systemli.org>

Currently a lot of Code shared between IPv4 and IPv6 resides in net/ipv4.

As an attempt to make things more modular and encapsulated + the source tree
more self documenting this commit:

  * introduces net/ipv4v6shared
  * moves the shared tcp code there
  * updates the makefiles accordingly
  * creates a new Kconfig file for the TCP code
  * updates all other Kconfig files accordingly

Of course I left the ipv4 specific parts of tcp (tcp_ipv4.c and tcp_offload.c)
in net/ipv4.

Thanks valdis from #kernelnewbies for the hint of doing this, when already
isolating tcp code.

Signed-off-by: Richard Sailer <richard_siegfried@systemli.org>
---
 net/Kconfig                                        |   1 +
 net/Makefile                                       |   6 +-
 net/ipv4/Kconfig                                   | 321 --------------------
 net/ipv4/Makefile                                  |   4 +-
 net/ipv4v6_shared/Makefile                         |   7 +
 net/ipv4v6_shared/tcp/Kconfig                      | 324 +++++++++++++++++++++
 net/{ipv4 => ipv4v6_shared}/tcp/Makefile           |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/Makefile  |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_bbr.c |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_bic.c |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_cdg.c |   0
 .../tcp/cc_algos/tcp_cubic.c                       |   0
 .../tcp/cc_algos/tcp_dctcp.c                       |   0
 .../tcp/cc_algos/tcp_highspeed.c                   |   0
 .../tcp/cc_algos/tcp_htcp.c                        |   0
 .../tcp/cc_algos/tcp_hybla.c                       |   0
 .../tcp/cc_algos/tcp_illinois.c                    |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_lp.c  |   0
 net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_nv.c  |   0
 .../tcp/cc_algos/tcp_scalable.c                    |   0
 .../tcp/cc_algos/tcp_vegas.c                       |   0
 .../tcp/cc_algos/tcp_vegas.h                       |   0
 .../tcp/cc_algos/tcp_veno.c                        |   0
 .../tcp/cc_algos/tcp_westwood.c                    |   0
 .../tcp/cc_algos/tcp_yeah.c                        |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp.c              |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_cong.c         |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_diag.c         |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_fastopen.c     |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_input.c        |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_metrics.c      |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_minisocks.c    |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_output.c       |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_probe.c        |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_rate.c         |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_recovery.c     |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_timer.c        |   0
 net/{ipv4 => ipv4v6_shared}/tcp/tcp_ulp.c          |   0
 38 files changed, 339 insertions(+), 324 deletions(-)
 create mode 100644 net/ipv4v6_shared/Makefile
 create mode 100644 net/ipv4v6_shared/tcp/Kconfig
 rename net/{ipv4 => ipv4v6_shared}/tcp/Makefile (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/Makefile (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_bbr.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_bic.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_cdg.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_cubic.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_dctcp.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_highspeed.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_htcp.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_hybla.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_illinois.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_lp.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_nv.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_scalable.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_vegas.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_vegas.h (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_veno.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_westwood.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/cc_algos/tcp_yeah.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_cong.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_diag.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_fastopen.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_input.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_metrics.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_minisocks.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_output.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_probe.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_rate.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_recovery.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_timer.c (100%)
 rename net/{ipv4 => ipv4v6_shared}/tcp/tcp_ulp.c (100%)

diff --git a/net/Kconfig b/net/Kconfig
index 9dba2715919d..caa1c51c577d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -86,6 +86,7 @@ config INET
 
 if INET
 source "net/ipv4/Kconfig"
+source "net/ipv4v6_shared/tcp/Kconfig"
 source "net/ipv6/Kconfig"
 source "net/netlabel/Kconfig"
 
diff --git a/net/Makefile b/net/Makefile
index ae2fe2283d2f..b089a88d3d5d 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -14,7 +14,11 @@ obj-$(CONFIG_NET)		+= $(tmp-y)
 obj-$(CONFIG_LLC)		+= llc/
 obj-$(CONFIG_NET)		+= ethernet/ 802/ sched/ netlink/ bpf/
 obj-$(CONFIG_NETFILTER)		+= netfilter/
-obj-$(CONFIG_INET)		+= ipv4/
+
+# IPv6 depends on CONFIG_INET anyways, so this also includes the shared code
+# for IPv6 when needed.
+obj-$(CONFIG_INET)		+= ipv4/ ipv4v6_shared/
+
 obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
 obj-$(CONFIG_UNIX)		+= unix/
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 91a2557942fa..aa952b0c6a8c 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -261,42 +261,6 @@ config IP_PIMSM_V2
 	  gated-5). This routing protocol is not used widely, so say N unless
 	  you want to play with it.
 
-config SYN_COOKIES
-	bool "IP: TCP syncookie support"
-	---help---
-	  Normal TCP/IP networking is open to an attack known as "SYN
-	  flooding". This denial-of-service attack prevents legitimate remote
-	  users from being able to connect to your computer during an ongoing
-	  attack and requires very little work from the attacker, who can
-	  operate from anywhere on the Internet.
-
-	  SYN cookies provide protection against this type of attack. If you
-	  say Y here, the TCP/IP stack will use a cryptographic challenge
-	  protocol known as "SYN cookies" to enable legitimate users to
-	  continue to connect, even when your machine is under attack. There
-	  is no need for the legitimate users to change their TCP/IP software;
-	  SYN cookies work transparently to them. For technical information
-	  about SYN cookies, check out <http://cr.yp.to/syncookies.html>.
-
-	  If you are SYN flooded, the source address reported by the kernel is
-	  likely to have been forged by the attacker; it is only reported as
-	  an aid in tracing the packets to their actual source and should not
-	  be taken as absolute truth.
-
-	  SYN cookies may prevent correct error reporting on clients when the
-	  server is really overloaded. If this happens frequently better turn
-	  them off.
-
-	  If you say Y here, you can disable SYN cookies at run time by
-	  saying Y to "/proc file system support" and
-	  "Sysctl support" below and executing the command
-
-	  echo 0 > /proc/sys/net/ipv4/tcp_syncookies
-
-	  after the /proc file system has been mounted.
-
-	  If unsure, say N.
-
 config NET_IPVTI
 	tristate "Virtual (secure) IP: tunneling"
 	select INET_TUNNEL
@@ -465,288 +429,3 @@ config INET_DIAG_DESTROY
 	  had been disconnected.
 	  If unsure, say N.
 
-menuconfig TCP_CONG_ADVANCED
-	bool "TCP: advanced congestion control"
-	---help---
-	  Support for selection of various TCP congestion control
-	  modules.
-
-	  Nearly all users can safely say no here, and a safe default
-	  selection will be made (CUBIC with new Reno as a fallback).
-
-	  If unsure, say N.
-
-if TCP_CONG_ADVANCED
-
-config TCP_CONG_BIC
-	tristate "Binary Increase Congestion (BIC) control"
-	default m
-	---help---
-	BIC-TCP is a sender-side only change that ensures a linear RTT
-	fairness under large windows while offering both scalability and
-	bounded TCP-friendliness. The protocol combines two schemes
-	called additive increase and binary search increase. When the
-	congestion window is large, additive increase with a large
-	increment ensures linear RTT fairness as well as good
-	scalability. Under small congestion windows, binary search
-	increase provides TCP friendliness.
-	See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/
-
-config TCP_CONG_CUBIC
-	tristate "CUBIC TCP"
-	default y
-	---help---
-	This is version 2.0 of BIC-TCP which uses a cubic growth function
-	among other techniques.
-	See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf
-
-config TCP_CONG_WESTWOOD
-	tristate "TCP Westwood+"
-	default m
-	---help---
-	TCP Westwood+ is a sender-side only modification of the TCP Reno
-	protocol stack that optimizes the performance of TCP congestion
-	control. It is based on end-to-end bandwidth estimation to set
-	congestion window and slow start threshold after a congestion
-	episode. Using this estimation, TCP Westwood+ adaptively sets a
-	slow start threshold and a congestion window which takes into
-	account the bandwidth used  at the time congestion is experienced.
-	TCP Westwood+ significantly increases fairness wrt TCP Reno in
-	wired networks and throughput over wireless links.
-
-config TCP_CONG_HTCP
-        tristate "H-TCP"
-        default m
-	---help---
-	H-TCP is a send-side only modifications of the TCP Reno
-	protocol stack that optimizes the performance of TCP
-	congestion control for high speed network links. It uses a
-	modeswitch to change the alpha and beta parameters of TCP Reno
-	based on network conditions and in a way so as to be fair with
-	other Reno and H-TCP flows.
-
-config TCP_CONG_HSTCP
-	tristate "High Speed TCP"
-	default n
-	---help---
-	Sally Floyd's High Speed TCP (RFC 3649) congestion control.
-	A modification to TCP's congestion control mechanism for use
-	with large congestion windows. A table indicates how much to
-	increase the congestion window by when an ACK is received.
- 	For more detail	see http://www.icir.org/floyd/hstcp.html
-
-config TCP_CONG_HYBLA
-	tristate "TCP-Hybla congestion control algorithm"
-	default n
-	---help---
-	TCP-Hybla is a sender-side only change that eliminates penalization of
-	long-RTT, large-bandwidth connections, like when satellite legs are
-	involved, especially when sharing a common bottleneck with normal
-	terrestrial connections.
-
-config TCP_CONG_VEGAS
-	tristate "TCP Vegas"
-	default n
-	---help---
-	TCP Vegas is a sender-side only change to TCP that anticipates
-	the onset of congestion by estimating the bandwidth. TCP Vegas
-	adjusts the sending rate by modifying the congestion
-	window. TCP Vegas should provide less packet loss, but it is
-	not as aggressive as TCP Reno.
-
-config TCP_CONG_NV
-       tristate "TCP NV"
-       default n
-       ---help---
-       TCP NV is a follow up to TCP Vegas. It has been modified to deal with
-       10G networks, measurement noise introduced by LRO, GRO and interrupt
-       coalescence. In addition, it will decrease its cwnd multiplicatively
-       instead of linearly.
-
-       Note that in general congestion avoidance (cwnd decreased when # packets
-       queued grows) cannot coexist with congestion control (cwnd decreased only
-       when there is packet loss) due to fairness issues. One scenario when they
-       can coexist safely is when the CA flows have RTTs << CC flows RTTs.
-
-       For further details see http://www.brakmo.org/networking/tcp-nv/
-
-config TCP_CONG_SCALABLE
-	tristate "Scalable TCP"
-	default n
-	---help---
-	Scalable TCP is a sender-side only change to TCP which uses a
-	MIMD congestion control algorithm which has some nice scaling
-	properties, though is known to have fairness issues.
-	See http://www.deneholme.net/tom/scalable/
-
-config TCP_CONG_LP
-	tristate "TCP Low Priority"
-	default n
-	---help---
-	TCP Low Priority (TCP-LP), a distributed algorithm whose goal is
-	to utilize only the excess network bandwidth as compared to the
-	``fair share`` of bandwidth as targeted by TCP.
-	See http://www-ece.rice.edu/networks/TCP-LP/
-
-config TCP_CONG_VENO
-	tristate "TCP Veno"
-	default n
-	---help---
-	TCP Veno is a sender-side only enhancement of TCP to obtain better
-	throughput over wireless networks. TCP Veno makes use of state
-	distinguishing to circumvent the difficult judgment of the packet loss
-	type. TCP Veno cuts down less congestion window in response to random
-	loss packets.
-	See <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1177186> 
-
-config TCP_CONG_YEAH
-	tristate "YeAH TCP"
-	select TCP_CONG_VEGAS
-	default n
-	---help---
-	YeAH-TCP is a sender-side high-speed enabled TCP congestion control
-	algorithm, which uses a mixed loss/delay approach to compute the
-	congestion window. It's design goals target high efficiency,
-	internal, RTT and Reno fairness, resilience to link loss while
-	keeping network elements load as low as possible.
-
-	For further details look here:
-	  http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
-
-config TCP_CONG_ILLINOIS
-	tristate "TCP Illinois"
-	default n
-	---help---
-	TCP-Illinois is a sender-side modification of TCP Reno for
-	high speed long delay links. It uses round-trip-time to
-	adjust the alpha and beta parameters to achieve a higher average
-	throughput and maintain fairness.
-
-	For further details see:
-	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
-
-config TCP_CONG_DCTCP
-	tristate "DataCenter TCP (DCTCP)"
-	default n
-	---help---
-	DCTCP leverages Explicit Congestion Notification (ECN) in the network to
-	provide multi-bit feedback to the end hosts. It is designed to provide:
-
-	- High burst tolerance (incast due to partition/aggregate),
-	- Low latency (short flows, queries),
-	- High throughput (continuous data updates, large file transfers) with
-	  commodity, shallow-buffered switches.
-
-	All switches in the data center network running DCTCP must support
-	ECN marking and be configured for marking when reaching defined switch
-	buffer thresholds. The default ECN marking threshold heuristic for
-	DCTCP on switches is 20 packets (30KB) at 1Gbps, and 65 packets
-	(~100KB) at 10Gbps, but might need further careful tweaking.
-
-	For further details see:
-	  http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
-
-config TCP_CONG_CDG
-	tristate "CAIA Delay-Gradient (CDG)"
-	default n
-	---help---
-	CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
-	the TCP sender in order to:
-
-	  o Use the delay gradient as a congestion signal.
-	  o Back off with an average probability that is independent of the RTT.
-	  o Coexist with flows that use loss-based congestion control.
-	  o Tolerate packet loss unrelated to congestion.
-
-	For further details see:
-	  D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
-	  delay gradients." In Networking 2011. Preprint: http://goo.gl/No3vdg
-
-config TCP_CONG_BBR
-	tristate "BBR TCP"
-	default n
-	---help---
-
-	BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
-	maximize network utilization and minimize queues. It builds an explicit
-	model of the the bottleneck delivery rate and path round-trip
-	propagation delay. It tolerates packet loss and delay unrelated to
-	congestion. It can operate over LAN, WAN, cellular, wifi, or cable
-	modem links. It can coexist with flows that use loss-based congestion
-	control, and can operate with shallow buffers, deep buffers,
-	bufferbloat, policers, or AQM schemes that do not provide a delay
-	signal. It requires the fq ("Fair Queue") pacing packet scheduler.
-
-choice
-	prompt "Default TCP congestion control"
-	default DEFAULT_CUBIC
-	help
-	  Select the TCP congestion control that will be used by default
-	  for all connections.
-
-	config DEFAULT_BIC
-		bool "Bic" if TCP_CONG_BIC=y
-
-	config DEFAULT_CUBIC
-		bool "Cubic" if TCP_CONG_CUBIC=y
-
-	config DEFAULT_HTCP
-		bool "Htcp" if TCP_CONG_HTCP=y
-
-	config DEFAULT_HYBLA
-		bool "Hybla" if TCP_CONG_HYBLA=y
-
-	config DEFAULT_VEGAS
-		bool "Vegas" if TCP_CONG_VEGAS=y
-
-	config DEFAULT_VENO
-		bool "Veno" if TCP_CONG_VENO=y
-
-	config DEFAULT_WESTWOOD
-		bool "Westwood" if TCP_CONG_WESTWOOD=y
-
-	config DEFAULT_DCTCP
-		bool "DCTCP" if TCP_CONG_DCTCP=y
-
-	config DEFAULT_CDG
-		bool "CDG" if TCP_CONG_CDG=y
-
-	config DEFAULT_BBR
-		bool "BBR" if TCP_CONG_BBR=y
-
-	config DEFAULT_RENO
-		bool "Reno"
-endchoice
-
-endif
-
-config TCP_CONG_CUBIC
-	tristate
-	depends on !TCP_CONG_ADVANCED
-	default y
-
-config DEFAULT_TCP_CONG
-	string
-	default "bic" if DEFAULT_BIC
-	default "cubic" if DEFAULT_CUBIC
-	default "htcp" if DEFAULT_HTCP
-	default "hybla" if DEFAULT_HYBLA
-	default "vegas" if DEFAULT_VEGAS
-	default "westwood" if DEFAULT_WESTWOOD
-	default "veno" if DEFAULT_VENO
-	default "reno" if DEFAULT_RENO
-	default "dctcp" if DEFAULT_DCTCP
-	default "cdg" if DEFAULT_CDG
-	default "bbr" if DEFAULT_BBR
-	default "cubic"
-
-config TCP_MD5SIG
-	bool "TCP: MD5 Signature Option support (RFC2385)"
-	select CRYPTO
-	select CRYPTO_MD5
-	---help---
-	  RFC2385 specifies a method of giving MD5 protection to TCP sessions.
-	  Its main (only?) use is to protect BGP sessions between core routers
-	  on the Internet.
-
-	  If unsure, say N.
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index a106a1cf0e11..d170d6e1417c 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -1,8 +1,8 @@
 #
-# Makefile for the Linux TCP/IP (INET) layer.
+# Makefile for the Linux IP layer.
 #
 
-obj-y     := tcp/ route.o inetpeer.o protocol.o \
+obj-y     := route.o inetpeer.o protocol.o \
 	     ip_input.o ip_fragment.o ip_forward.o ip_options.o \
 	     ip_output.o ip_sockglue.o inet_hashtables.o \
 	     inet_timewait_sock.o inet_connection_sock.o \
diff --git a/net/ipv4v6_shared/Makefile b/net/ipv4v6_shared/Makefile
new file mode 100644
index 000000000000..41a67a0cf210
--- /dev/null
+++ b/net/ipv4v6_shared/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for shared networking code called/used by IPv4 and IPv6
+#
+# Note: Currently this only contains the shared part of the TCP Implementation
+#   a lot of the shared code still resides in net/ipv4/
+
+obj-y     := tcp/
diff --git a/net/ipv4v6_shared/tcp/Kconfig b/net/ipv4v6_shared/tcp/Kconfig
new file mode 100644
index 000000000000..dff4d046a42b
--- /dev/null
+++ b/net/ipv4v6_shared/tcp/Kconfig
@@ -0,0 +1,324 @@
+#
+# TCP configuration
+#
+config SYN_COOKIES
+	bool "IP: TCP syncookie support"
+	---help---
+	  Normal TCP/IP networking is open to an attack known as "SYN
+	  flooding". This denial-of-service attack prevents legitimate remote
+	  users from being able to connect to your computer during an ongoing
+	  attack and requires very little work from the attacker, who can
+	  operate from anywhere on the Internet.
+
+	  SYN cookies provide protection against this type of attack. If you
+	  say Y here, the TCP/IP stack will use a cryptographic challenge
+	  protocol known as "SYN cookies" to enable legitimate users to
+	  continue to connect, even when your machine is under attack. There
+	  is no need for the legitimate users to change their TCP/IP software;
+	  SYN cookies work transparently to them. For technical information
+	  about SYN cookies, check out <http://cr.yp.to/syncookies.html>.
+
+	  If you are SYN flooded, the source address reported by the kernel is
+	  likely to have been forged by the attacker; it is only reported as
+	  an aid in tracing the packets to their actual source and should not
+	  be taken as absolute truth.
+
+	  SYN cookies may prevent correct error reporting on clients when the
+	  server is really overloaded. If this happens frequently better turn
+	  them off.
+
+	  If you say Y here, you can disable SYN cookies at run time by
+	  saying Y to "/proc file system support" and
+	  "Sysctl support" below and executing the command
+
+	  echo 0 > /proc/sys/net/ipv4/tcp_syncookies
+
+	  after the /proc file system has been mounted.
+
+	  If unsure, say N.
+
+menuconfig TCP_CONG_ADVANCED
+	bool "TCP: advanced congestion control"
+	---help---
+	  Support for selection of various TCP congestion control
+	  modules.
+
+	  Nearly all users can safely say no here, and a safe default
+	  selection will be made (CUBIC with new Reno as a fallback).
+
+	  If unsure, say N.
+
+if TCP_CONG_ADVANCED
+
+config TCP_CONG_BIC
+	tristate "Binary Increase Congestion (BIC) control"
+	default m
+	---help---
+	BIC-TCP is a sender-side only change that ensures a linear RTT
+	fairness under large windows while offering both scalability and
+	bounded TCP-friendliness. The protocol combines two schemes
+	called additive increase and binary search increase. When the
+	congestion window is large, additive increase with a large
+	increment ensures linear RTT fairness as well as good
+	scalability. Under small congestion windows, binary search
+	increase provides TCP friendliness.
+	See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/
+
+config TCP_CONG_CUBIC
+	tristate "CUBIC TCP"
+	default y
+	---help---
+	This is version 2.0 of BIC-TCP which uses a cubic growth function
+	among other techniques.
+	See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf
+
+config TCP_CONG_WESTWOOD
+	tristate "TCP Westwood+"
+	default m
+	---help---
+	TCP Westwood+ is a sender-side only modification of the TCP Reno
+	protocol stack that optimizes the performance of TCP congestion
+	control. It is based on end-to-end bandwidth estimation to set
+	congestion window and slow start threshold after a congestion
+	episode. Using this estimation, TCP Westwood+ adaptively sets a
+	slow start threshold and a congestion window which takes into
+	account the bandwidth used  at the time congestion is experienced.
+	TCP Westwood+ significantly increases fairness wrt TCP Reno in
+	wired networks and throughput over wireless links.
+
+config TCP_CONG_HTCP
+        tristate "H-TCP"
+        default m
+	---help---
+	H-TCP is a send-side only modifications of the TCP Reno
+	protocol stack that optimizes the performance of TCP
+	congestion control for high speed network links. It uses a
+	modeswitch to change the alpha and beta parameters of TCP Reno
+	based on network conditions and in a way so as to be fair with
+	other Reno and H-TCP flows.
+
+config TCP_CONG_HSTCP
+	tristate "High Speed TCP"
+	default n
+	---help---
+	Sally Floyd's High Speed TCP (RFC 3649) congestion control.
+	A modification to TCP's congestion control mechanism for use
+	with large congestion windows. A table indicates how much to
+	increase the congestion window by when an ACK is received.
+ 	For more detail	see http://www.icir.org/floyd/hstcp.html
+
+config TCP_CONG_HYBLA
+	tristate "TCP-Hybla congestion control algorithm"
+	default n
+	---help---
+	TCP-Hybla is a sender-side only change that eliminates penalization of
+	long-RTT, large-bandwidth connections, like when satellite legs are
+	involved, especially when sharing a common bottleneck with normal
+	terrestrial connections.
+
+config TCP_CONG_VEGAS
+	tristate "TCP Vegas"
+	default n
+	---help---
+	TCP Vegas is a sender-side only change to TCP that anticipates
+	the onset of congestion by estimating the bandwidth. TCP Vegas
+	adjusts the sending rate by modifying the congestion
+	window. TCP Vegas should provide less packet loss, but it is
+	not as aggressive as TCP Reno.
+
+config TCP_CONG_NV
+       tristate "TCP NV"
+       default n
+       ---help---
+       TCP NV is a follow up to TCP Vegas. It has been modified to deal with
+       10G networks, measurement noise introduced by LRO, GRO and interrupt
+       coalescence. In addition, it will decrease its cwnd multiplicatively
+       instead of linearly.
+
+       Note that in general congestion avoidance (cwnd decreased when # packets
+       queued grows) cannot coexist with congestion control (cwnd decreased only
+       when there is packet loss) due to fairness issues. One scenario when they
+       can coexist safely is when the CA flows have RTTs << CC flows RTTs.
+
+       For further details see http://www.brakmo.org/networking/tcp-nv/
+
+config TCP_CONG_SCALABLE
+	tristate "Scalable TCP"
+	default n
+	---help---
+	Scalable TCP is a sender-side only change to TCP which uses a
+	MIMD congestion control algorithm which has some nice scaling
+	properties, though is known to have fairness issues.
+	See http://www.deneholme.net/tom/scalable/
+
+config TCP_CONG_LP
+	tristate "TCP Low Priority"
+	default n
+	---help---
+	TCP Low Priority (TCP-LP), a distributed algorithm whose goal is
+	to utilize only the excess network bandwidth as compared to the
+	``fair share`` of bandwidth as targeted by TCP.
+	See http://www-ece.rice.edu/networks/TCP-LP/
+
+config TCP_CONG_VENO
+	tristate "TCP Veno"
+	default n
+	---help---
+	TCP Veno is a sender-side only enhancement of TCP to obtain better
+	throughput over wireless networks. TCP Veno makes use of state
+	distinguishing to circumvent the difficult judgment of the packet loss
+	type. TCP Veno cuts down less congestion window in response to random
+	loss packets.
+	See <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1177186> 
+
+config TCP_CONG_YEAH
+	tristate "YeAH TCP"
+	select TCP_CONG_VEGAS
+	default n
+	---help---
+	YeAH-TCP is a sender-side high-speed enabled TCP congestion control
+	algorithm, which uses a mixed loss/delay approach to compute the
+	congestion window. It's design goals target high efficiency,
+	internal, RTT and Reno fairness, resilience to link loss while
+	keeping network elements load as low as possible.
+
+	For further details look here:
+	  http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
+
+config TCP_CONG_ILLINOIS
+	tristate "TCP Illinois"
+	default n
+	---help---
+	TCP-Illinois is a sender-side modification of TCP Reno for
+	high speed long delay links. It uses round-trip-time to
+	adjust the alpha and beta parameters to achieve a higher average
+	throughput and maintain fairness.
+
+	For further details see:
+	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+
+config TCP_CONG_DCTCP
+	tristate "DataCenter TCP (DCTCP)"
+	default n
+	---help---
+	DCTCP leverages Explicit Congestion Notification (ECN) in the network to
+	provide multi-bit feedback to the end hosts. It is designed to provide:
+
+	- High burst tolerance (incast due to partition/aggregate),
+	- Low latency (short flows, queries),
+	- High throughput (continuous data updates, large file transfers) with
+	  commodity, shallow-buffered switches.
+
+	All switches in the data center network running DCTCP must support
+	ECN marking and be configured for marking when reaching defined switch
+	buffer thresholds. The default ECN marking threshold heuristic for
+	DCTCP on switches is 20 packets (30KB) at 1Gbps, and 65 packets
+	(~100KB) at 10Gbps, but might need further careful tweaking.
+
+	For further details see:
+	  http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+
+config TCP_CONG_CDG
+	tristate "CAIA Delay-Gradient (CDG)"
+	default n
+	---help---
+	CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
+	the TCP sender in order to:
+
+	  o Use the delay gradient as a congestion signal.
+	  o Back off with an average probability that is independent of the RTT.
+	  o Coexist with flows that use loss-based congestion control.
+	  o Tolerate packet loss unrelated to congestion.
+
+	For further details see:
+	  D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
+	  delay gradients." In Networking 2011. Preprint: http://goo.gl/No3vdg
+
+config TCP_CONG_BBR
+	tristate "BBR TCP"
+	default n
+	---help---
+
+	BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
+	maximize network utilization and minimize queues. It builds an explicit
+	model of the the bottleneck delivery rate and path round-trip
+	propagation delay. It tolerates packet loss and delay unrelated to
+	congestion. It can operate over LAN, WAN, cellular, wifi, or cable
+	modem links. It can coexist with flows that use loss-based congestion
+	control, and can operate with shallow buffers, deep buffers,
+	bufferbloat, policers, or AQM schemes that do not provide a delay
+	signal. It requires the fq ("Fair Queue") pacing packet scheduler.
+
+choice
+	prompt "Default TCP congestion control"
+	default DEFAULT_CUBIC
+	help
+	  Select the TCP congestion control that will be used by default
+	  for all connections.
+
+	config DEFAULT_BIC
+		bool "Bic" if TCP_CONG_BIC=y
+
+	config DEFAULT_CUBIC
+		bool "Cubic" if TCP_CONG_CUBIC=y
+
+	config DEFAULT_HTCP
+		bool "Htcp" if TCP_CONG_HTCP=y
+
+	config DEFAULT_HYBLA
+		bool "Hybla" if TCP_CONG_HYBLA=y
+
+	config DEFAULT_VEGAS
+		bool "Vegas" if TCP_CONG_VEGAS=y
+
+	config DEFAULT_VENO
+		bool "Veno" if TCP_CONG_VENO=y
+
+	config DEFAULT_WESTWOOD
+		bool "Westwood" if TCP_CONG_WESTWOOD=y
+
+	config DEFAULT_DCTCP
+		bool "DCTCP" if TCP_CONG_DCTCP=y
+
+	config DEFAULT_CDG
+		bool "CDG" if TCP_CONG_CDG=y
+
+	config DEFAULT_BBR
+		bool "BBR" if TCP_CONG_BBR=y
+
+	config DEFAULT_RENO
+		bool "Reno"
+endchoice
+
+endif
+
+config TCP_CONG_CUBIC
+	tristate
+	depends on !TCP_CONG_ADVANCED
+	default y
+
+config DEFAULT_TCP_CONG
+	string
+	default "bic" if DEFAULT_BIC
+	default "cubic" if DEFAULT_CUBIC
+	default "htcp" if DEFAULT_HTCP
+	default "hybla" if DEFAULT_HYBLA
+	default "vegas" if DEFAULT_VEGAS
+	default "westwood" if DEFAULT_WESTWOOD
+	default "veno" if DEFAULT_VENO
+	default "reno" if DEFAULT_RENO
+	default "dctcp" if DEFAULT_DCTCP
+	default "cdg" if DEFAULT_CDG
+	default "bbr" if DEFAULT_BBR
+	default "cubic"
+
+config TCP_MD5SIG
+	bool "TCP: MD5 Signature Option support (RFC2385)"
+	select CRYPTO
+	select CRYPTO_MD5
+	---help---
+	  RFC2385 specifies a method of giving MD5 protection to TCP sessions.
+	  Its main (only?) use is to protect BGP sessions between core routers
+	  on the Internet.
+
+	  If unsure, say N.
diff --git a/net/ipv4/tcp/Makefile b/net/ipv4v6_shared/tcp/Makefile
similarity index 100%
rename from net/ipv4/tcp/Makefile
rename to net/ipv4v6_shared/tcp/Makefile
diff --git a/net/ipv4/tcp/cc_algos/Makefile b/net/ipv4v6_shared/tcp/cc_algos/Makefile
similarity index 100%
rename from net/ipv4/tcp/cc_algos/Makefile
rename to net/ipv4v6_shared/tcp/cc_algos/Makefile
diff --git a/net/ipv4/tcp/cc_algos/tcp_bbr.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_bbr.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_bbr.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_bbr.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_bic.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_bic.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_bic.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_bic.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_cdg.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_cdg.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_cdg.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_cdg.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_cubic.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_cubic.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_cubic.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_cubic.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_dctcp.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_dctcp.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_dctcp.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_dctcp.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_highspeed.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_highspeed.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_highspeed.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_highspeed.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_htcp.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_htcp.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_htcp.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_htcp.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_hybla.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_hybla.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_hybla.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_hybla.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_illinois.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_illinois.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_illinois.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_illinois.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_lp.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_lp.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_lp.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_lp.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_nv.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_nv.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_nv.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_nv.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_scalable.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_scalable.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_scalable.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_scalable.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_vegas.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_vegas.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_vegas.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_vegas.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_vegas.h b/net/ipv4v6_shared/tcp/cc_algos/tcp_vegas.h
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_vegas.h
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_vegas.h
diff --git a/net/ipv4/tcp/cc_algos/tcp_veno.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_veno.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_veno.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_veno.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_westwood.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_westwood.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_westwood.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_westwood.c
diff --git a/net/ipv4/tcp/cc_algos/tcp_yeah.c b/net/ipv4v6_shared/tcp/cc_algos/tcp_yeah.c
similarity index 100%
rename from net/ipv4/tcp/cc_algos/tcp_yeah.c
rename to net/ipv4v6_shared/tcp/cc_algos/tcp_yeah.c
diff --git a/net/ipv4/tcp/tcp.c b/net/ipv4v6_shared/tcp/tcp.c
similarity index 100%
rename from net/ipv4/tcp/tcp.c
rename to net/ipv4v6_shared/tcp/tcp.c
diff --git a/net/ipv4/tcp/tcp_cong.c b/net/ipv4v6_shared/tcp/tcp_cong.c
similarity index 100%
rename from net/ipv4/tcp/tcp_cong.c
rename to net/ipv4v6_shared/tcp/tcp_cong.c
diff --git a/net/ipv4/tcp/tcp_diag.c b/net/ipv4v6_shared/tcp/tcp_diag.c
similarity index 100%
rename from net/ipv4/tcp/tcp_diag.c
rename to net/ipv4v6_shared/tcp/tcp_diag.c
diff --git a/net/ipv4/tcp/tcp_fastopen.c b/net/ipv4v6_shared/tcp/tcp_fastopen.c
similarity index 100%
rename from net/ipv4/tcp/tcp_fastopen.c
rename to net/ipv4v6_shared/tcp/tcp_fastopen.c
diff --git a/net/ipv4/tcp/tcp_input.c b/net/ipv4v6_shared/tcp/tcp_input.c
similarity index 100%
rename from net/ipv4/tcp/tcp_input.c
rename to net/ipv4v6_shared/tcp/tcp_input.c
diff --git a/net/ipv4/tcp/tcp_metrics.c b/net/ipv4v6_shared/tcp/tcp_metrics.c
similarity index 100%
rename from net/ipv4/tcp/tcp_metrics.c
rename to net/ipv4v6_shared/tcp/tcp_metrics.c
diff --git a/net/ipv4/tcp/tcp_minisocks.c b/net/ipv4v6_shared/tcp/tcp_minisocks.c
similarity index 100%
rename from net/ipv4/tcp/tcp_minisocks.c
rename to net/ipv4v6_shared/tcp/tcp_minisocks.c
diff --git a/net/ipv4/tcp/tcp_output.c b/net/ipv4v6_shared/tcp/tcp_output.c
similarity index 100%
rename from net/ipv4/tcp/tcp_output.c
rename to net/ipv4v6_shared/tcp/tcp_output.c
diff --git a/net/ipv4/tcp/tcp_probe.c b/net/ipv4v6_shared/tcp/tcp_probe.c
similarity index 100%
rename from net/ipv4/tcp/tcp_probe.c
rename to net/ipv4v6_shared/tcp/tcp_probe.c
diff --git a/net/ipv4/tcp/tcp_rate.c b/net/ipv4v6_shared/tcp/tcp_rate.c
similarity index 100%
rename from net/ipv4/tcp/tcp_rate.c
rename to net/ipv4v6_shared/tcp/tcp_rate.c
diff --git a/net/ipv4/tcp/tcp_recovery.c b/net/ipv4v6_shared/tcp/tcp_recovery.c
similarity index 100%
rename from net/ipv4/tcp/tcp_recovery.c
rename to net/ipv4v6_shared/tcp/tcp_recovery.c
diff --git a/net/ipv4/tcp/tcp_timer.c b/net/ipv4v6_shared/tcp/tcp_timer.c
similarity index 100%
rename from net/ipv4/tcp/tcp_timer.c
rename to net/ipv4v6_shared/tcp/tcp_timer.c
diff --git a/net/ipv4/tcp/tcp_ulp.c b/net/ipv4v6_shared/tcp/tcp_ulp.c
similarity index 100%
rename from net/ipv4/tcp/tcp_ulp.c
rename to net/ipv4v6_shared/tcp/tcp_ulp.c
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 1/3] net/ipv4: Move shared tcp code to own subdirectory
From: Richard Sailer @ 2017-10-03 22:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuznet, yoshfuji
In-Reply-To: <20171003222213.7996-1-richard_siegfried@systemli.org>

net/ipv4 contains around 100 source files containing the IP implementation
and various other functionality (UDP, TCP, xfrm, etc.). 30 of them make up
the TCP implementation.

I think moving the TCP implementation to a own subdirectory
makes the source tree more explorable and well structured.

Note: This only affects the TCP Code shared by IPv4 and IPv6,
the IPv4 specific part (tcp_ipv4.c and tcp_offload.c) is left in
net/ipv4/.

This creates the directory, moves the files and updates the Makefiles
acordingly.

Signed-off-by: Richard Sailer <richard_siegfried@systemli.org>
---
 net/ipv4/Makefile                  | 25 ++-----------------------
 net/ipv4/tcp/Makefile              | 26 ++++++++++++++++++++++++++
 net/ipv4/{ => tcp}/tcp.c           |  0
 net/ipv4/{ => tcp}/tcp_bbr.c       |  0
 net/ipv4/{ => tcp}/tcp_bic.c       |  0
 net/ipv4/{ => tcp}/tcp_cdg.c       |  0
 net/ipv4/{ => tcp}/tcp_cong.c      |  0
 net/ipv4/{ => tcp}/tcp_cubic.c     |  0
 net/ipv4/{ => tcp}/tcp_dctcp.c     |  0
 net/ipv4/{ => tcp}/tcp_diag.c      |  0
 net/ipv4/{ => tcp}/tcp_fastopen.c  |  0
 net/ipv4/{ => tcp}/tcp_highspeed.c |  0
 net/ipv4/{ => tcp}/tcp_htcp.c      |  0
 net/ipv4/{ => tcp}/tcp_hybla.c     |  0
 net/ipv4/{ => tcp}/tcp_illinois.c  |  0
 net/ipv4/{ => tcp}/tcp_input.c     |  0
 net/ipv4/{ => tcp}/tcp_lp.c        |  0
 net/ipv4/{ => tcp}/tcp_metrics.c   |  0
 net/ipv4/{ => tcp}/tcp_minisocks.c |  0
 net/ipv4/{ => tcp}/tcp_nv.c        |  0
 net/ipv4/{ => tcp}/tcp_output.c    |  0
 net/ipv4/{ => tcp}/tcp_probe.c     |  0
 net/ipv4/{ => tcp}/tcp_rate.c      |  0
 net/ipv4/{ => tcp}/tcp_recovery.c  |  0
 net/ipv4/{ => tcp}/tcp_scalable.c  |  0
 net/ipv4/{ => tcp}/tcp_timer.c     |  0
 net/ipv4/{ => tcp}/tcp_ulp.c       |  0
 net/ipv4/{ => tcp}/tcp_vegas.c     |  0
 net/ipv4/{ => tcp}/tcp_vegas.h     |  0
 net/ipv4/{ => tcp}/tcp_veno.c      |  0
 net/ipv4/{ => tcp}/tcp_westwood.c  |  0
 net/ipv4/{ => tcp}/tcp_yeah.c      |  0
 32 files changed, 28 insertions(+), 23 deletions(-)
 create mode 100644 net/ipv4/tcp/Makefile
 rename net/ipv4/{ => tcp}/tcp.c (100%)
 rename net/ipv4/{ => tcp}/tcp_bbr.c (100%)
 rename net/ipv4/{ => tcp}/tcp_bic.c (100%)
 rename net/ipv4/{ => tcp}/tcp_cdg.c (100%)
 rename net/ipv4/{ => tcp}/tcp_cong.c (100%)
 rename net/ipv4/{ => tcp}/tcp_cubic.c (100%)
 rename net/ipv4/{ => tcp}/tcp_dctcp.c (100%)
 rename net/ipv4/{ => tcp}/tcp_diag.c (100%)
 rename net/ipv4/{ => tcp}/tcp_fastopen.c (100%)
 rename net/ipv4/{ => tcp}/tcp_highspeed.c (100%)
 rename net/ipv4/{ => tcp}/tcp_htcp.c (100%)
 rename net/ipv4/{ => tcp}/tcp_hybla.c (100%)
 rename net/ipv4/{ => tcp}/tcp_illinois.c (100%)
 rename net/ipv4/{ => tcp}/tcp_input.c (100%)
 rename net/ipv4/{ => tcp}/tcp_lp.c (100%)
 rename net/ipv4/{ => tcp}/tcp_metrics.c (100%)
 rename net/ipv4/{ => tcp}/tcp_minisocks.c (100%)
 rename net/ipv4/{ => tcp}/tcp_nv.c (100%)
 rename net/ipv4/{ => tcp}/tcp_output.c (100%)
 rename net/ipv4/{ => tcp}/tcp_probe.c (100%)
 rename net/ipv4/{ => tcp}/tcp_rate.c (100%)
 rename net/ipv4/{ => tcp}/tcp_recovery.c (100%)
 rename net/ipv4/{ => tcp}/tcp_scalable.c (100%)
 rename net/ipv4/{ => tcp}/tcp_timer.c (100%)
 rename net/ipv4/{ => tcp}/tcp_ulp.c (100%)
 rename net/ipv4/{ => tcp}/tcp_vegas.c (100%)
 rename net/ipv4/{ => tcp}/tcp_vegas.h (100%)
 rename net/ipv4/{ => tcp}/tcp_veno.c (100%)
 rename net/ipv4/{ => tcp}/tcp_westwood.c (100%)
 rename net/ipv4/{ => tcp}/tcp_yeah.c (100%)

diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index afcb435adfbe..a106a1cf0e11 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -2,14 +2,11 @@
 # Makefile for the Linux TCP/IP (INET) layer.
 #
 
-obj-y     := route.o inetpeer.o protocol.o \
+obj-y     := tcp/ route.o inetpeer.o protocol.o \
 	     ip_input.o ip_fragment.o ip_forward.o ip_options.o \
 	     ip_output.o ip_sockglue.o inet_hashtables.o \
 	     inet_timewait_sock.o inet_connection_sock.o \
-	     tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
-	     tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
-	     tcp_rate.o tcp_recovery.o tcp_ulp.o \
-	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
+	     datagram.o raw.o udp.o udplite.o tcp_offload.o tcp_ipv4.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
 	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
@@ -39,26 +36,8 @@ obj-$(CONFIG_INET_XFRM_MODE_TUNNEL) += xfrm4_mode_tunnel.o
 obj-$(CONFIG_IP_PNP) += ipconfig.o
 obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
 obj-$(CONFIG_INET_DIAG) += inet_diag.o 
-obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
 obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
-obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
-obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
-obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
-obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
-obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
-obj-$(CONFIG_TCP_CONG_DCTCP) += tcp_dctcp.o
-obj-$(CONFIG_TCP_CONG_WESTWOOD) += tcp_westwood.o
-obj-$(CONFIG_TCP_CONG_HSTCP) += tcp_highspeed.o
-obj-$(CONFIG_TCP_CONG_HYBLA) += tcp_hybla.o
-obj-$(CONFIG_TCP_CONG_HTCP) += tcp_htcp.o
-obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vegas.o
-obj-$(CONFIG_TCP_CONG_NV) += tcp_nv.o
-obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o
-obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o
-obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
-obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
-obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
diff --git a/net/ipv4/tcp/Makefile b/net/ipv4/tcp/Makefile
new file mode 100644
index 000000000000..91d2c991a243
--- /dev/null
+++ b/net/ipv4/tcp/Makefile
@@ -0,0 +1,26 @@
+#
+# Makefile for Linux TCP
+#
+
+obj-y     := tcp.o tcp_input.o tcp_output.o tcp_timer.o \
+	     tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
+	     tcp_rate.o tcp_recovery.o tcp_ulp.o
+
+obj-$(CONFIG_INET_TCP_DIAG)     +=  tcp_diag.o
+obj-$(CONFIG_NET_TCPPROBE)      +=  tcp_probe.o
+obj-$(CONFIG_TCP_CONG_BBR)      +=  tcp_bbr.o
+obj-$(CONFIG_TCP_CONG_BIC)      +=  tcp_bic.o
+obj-$(CONFIG_TCP_CONG_CDG)      +=  tcp_cdg.o
+obj-$(CONFIG_TCP_CONG_CUBIC)    +=  tcp_cubic.o
+obj-$(CONFIG_TCP_CONG_DCTCP)    +=  tcp_dctcp.o
+obj-$(CONFIG_TCP_CONG_WESTWOOD) +=  tcp_westwood.o
+obj-$(CONFIG_TCP_CONG_HSTCP)    +=  tcp_highspeed.o
+obj-$(CONFIG_TCP_CONG_HYBLA)    +=  tcp_hybla.o
+obj-$(CONFIG_TCP_CONG_HTCP)     +=  tcp_htcp.o
+obj-$(CONFIG_TCP_CONG_VEGAS)    +=  tcp_vegas.o
+obj-$(CONFIG_TCP_CONG_NV)       +=  tcp_nv.o
+obj-$(CONFIG_TCP_CONG_VENO)     +=  tcp_veno.o
+obj-$(CONFIG_TCP_CONG_SCALABLE) +=  tcp_scalable.o
+obj-$(CONFIG_TCP_CONG_LP)       +=  tcp_lp.o
+obj-$(CONFIG_TCP_CONG_YEAH)     +=  tcp_yeah.o
+obj-$(CONFIG_TCP_CONG_ILLINOIS) +=  tcp_illinois.o
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp/tcp.c
similarity index 100%
rename from net/ipv4/tcp.c
rename to net/ipv4/tcp/tcp.c
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp/tcp_bbr.c
similarity index 100%
rename from net/ipv4/tcp_bbr.c
rename to net/ipv4/tcp/tcp_bbr.c
diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp/tcp_bic.c
similarity index 100%
rename from net/ipv4/tcp_bic.c
rename to net/ipv4/tcp/tcp_bic.c
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp/tcp_cdg.c
similarity index 100%
rename from net/ipv4/tcp_cdg.c
rename to net/ipv4/tcp/tcp_cdg.c
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp/tcp_cong.c
similarity index 100%
rename from net/ipv4/tcp_cong.c
rename to net/ipv4/tcp/tcp_cong.c
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp/tcp_cubic.c
similarity index 100%
rename from net/ipv4/tcp_cubic.c
rename to net/ipv4/tcp/tcp_cubic.c
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp/tcp_dctcp.c
similarity index 100%
rename from net/ipv4/tcp_dctcp.c
rename to net/ipv4/tcp/tcp_dctcp.c
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp/tcp_diag.c
similarity index 100%
rename from net/ipv4/tcp_diag.c
rename to net/ipv4/tcp/tcp_diag.c
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp/tcp_fastopen.c
similarity index 100%
rename from net/ipv4/tcp_fastopen.c
rename to net/ipv4/tcp/tcp_fastopen.c
diff --git a/net/ipv4/tcp_highspeed.c b/net/ipv4/tcp/tcp_highspeed.c
similarity index 100%
rename from net/ipv4/tcp_highspeed.c
rename to net/ipv4/tcp/tcp_highspeed.c
diff --git a/net/ipv4/tcp_htcp.c b/net/ipv4/tcp/tcp_htcp.c
similarity index 100%
rename from net/ipv4/tcp_htcp.c
rename to net/ipv4/tcp/tcp_htcp.c
diff --git a/net/ipv4/tcp_hybla.c b/net/ipv4/tcp/tcp_hybla.c
similarity index 100%
rename from net/ipv4/tcp_hybla.c
rename to net/ipv4/tcp/tcp_hybla.c
diff --git a/net/ipv4/tcp_illinois.c b/net/ipv4/tcp/tcp_illinois.c
similarity index 100%
rename from net/ipv4/tcp_illinois.c
rename to net/ipv4/tcp/tcp_illinois.c
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp/tcp_input.c
similarity index 100%
rename from net/ipv4/tcp_input.c
rename to net/ipv4/tcp/tcp_input.c
diff --git a/net/ipv4/tcp_lp.c b/net/ipv4/tcp/tcp_lp.c
similarity index 100%
rename from net/ipv4/tcp_lp.c
rename to net/ipv4/tcp/tcp_lp.c
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp/tcp_metrics.c
similarity index 100%
rename from net/ipv4/tcp_metrics.c
rename to net/ipv4/tcp/tcp_metrics.c
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp/tcp_minisocks.c
similarity index 100%
rename from net/ipv4/tcp_minisocks.c
rename to net/ipv4/tcp/tcp_minisocks.c
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp/tcp_nv.c
similarity index 100%
rename from net/ipv4/tcp_nv.c
rename to net/ipv4/tcp/tcp_nv.c
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp/tcp_output.c
similarity index 100%
rename from net/ipv4/tcp_output.c
rename to net/ipv4/tcp/tcp_output.c
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp/tcp_probe.c
similarity index 100%
rename from net/ipv4/tcp_probe.c
rename to net/ipv4/tcp/tcp_probe.c
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp/tcp_rate.c
similarity index 100%
rename from net/ipv4/tcp_rate.c
rename to net/ipv4/tcp/tcp_rate.c
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp/tcp_recovery.c
similarity index 100%
rename from net/ipv4/tcp_recovery.c
rename to net/ipv4/tcp/tcp_recovery.c
diff --git a/net/ipv4/tcp_scalable.c b/net/ipv4/tcp/tcp_scalable.c
similarity index 100%
rename from net/ipv4/tcp_scalable.c
rename to net/ipv4/tcp/tcp_scalable.c
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp/tcp_timer.c
similarity index 100%
rename from net/ipv4/tcp_timer.c
rename to net/ipv4/tcp/tcp_timer.c
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp/tcp_ulp.c
similarity index 100%
rename from net/ipv4/tcp_ulp.c
rename to net/ipv4/tcp/tcp_ulp.c
diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp/tcp_vegas.c
similarity index 100%
rename from net/ipv4/tcp_vegas.c
rename to net/ipv4/tcp/tcp_vegas.c
diff --git a/net/ipv4/tcp_vegas.h b/net/ipv4/tcp/tcp_vegas.h
similarity index 100%
rename from net/ipv4/tcp_vegas.h
rename to net/ipv4/tcp/tcp_vegas.h
diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp/tcp_veno.c
similarity index 100%
rename from net/ipv4/tcp_veno.c
rename to net/ipv4/tcp/tcp_veno.c
diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp/tcp_westwood.c
similarity index 100%
rename from net/ipv4/tcp_westwood.c
rename to net/ipv4/tcp/tcp_westwood.c
diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp/tcp_yeah.c
similarity index 100%
rename from net/ipv4/tcp_yeah.c
rename to net/ipv4/tcp/tcp_yeah.c
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 0/3] A own subdirectory for shared TCP code
From: Richard Sailer @ 2017-10-03 22:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuznet, yoshfuji

net/ipv4 currently contains around 100 source files containing the
IP implementation and lots of other functionality (UDP, TCP, xfrm, etc.)

# 1/3
To make the networking source tree more self documenting and well 
structured 1/3 moves the 30 shared TCP source files to a own 
subdirectory of net/ipv4/.

# 2/3 
A huge part of the TCP code is congestion control algorithms.
With the same goal as above, this moves the cc algos to a own
subdirectory of net/ipv4/tcp/

# 3/3
Implementation of a suggestion of valdis, to have a new directory
net/ipv4v6_shared that contains all the code shared between
IPv4 and IPv6. This creates net/ipv4v6_shared and moves the 
net/ipv4/tcp directory of shared TCP code there.

While every commit depends on its previous commits, it is
also possible and sensible to only apply (1) or (1,2) , depending
on how conservative one wants to decide regarding directory
structure changes.

[PATCH net-next 1/3] net/ipv4: Move shared tcp code to own subdirectory
[PATCH net-next 2/3] tcp: Move cc algorithms to own subdirectory
[PATCH net-next 3/3] Move shared tcp code to net/ipv4v6shared

^ permalink raw reply

* [PATCH net-next v2 10/10] sctp: introduce round robin stream scheduler
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

This patch introduces RFC Draft ndata section 3.2 Priority Based
Scheduler (SCTP_SS_RR).

Works by maintaining a list of enqueued streams and tracking the last
one used to send data. When the datamsg is done, it switches to the next
stream.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h |  11 +++
 include/uapi/linux/sctp.h  |   3 +-
 net/sctp/Makefile          |   3 +-
 net/sctp/stream_sched.c    |   2 +
 net/sctp/stream_sched_rr.c | 201 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 218 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_rr.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 40eb8d66a37c3ecee39141dc111663f7aac7326a..16f949eef52fdfd7c90fa15b44093334d1355aaf 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1348,6 +1348,10 @@ struct sctp_stream_out_ext {
 			struct list_head prio_list;
 			struct sctp_stream_priorities *prio_head;
 		};
+		/* Fields used by RR scheduler */
+		struct {
+			struct list_head rr_list;
+		};
 	};
 };
 
@@ -1374,6 +1378,13 @@ struct sctp_stream {
 			/* List of priorities scheduled */
 			struct list_head prio_list;
 		};
+		/* Fields used by RR scheduler */
+		struct {
+			/* List of streams scheduled */
+			struct list_head rr_list;
+			/* The next stream stream in line */
+			struct sctp_stream_out_ext *rr_next;
+		};
 	};
 };
 
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 850fa8b29d7e8163dc4ee88af192309bb2535ae9..6cd7d416ca406e59d3214976fc425bb805f5c6cc 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1100,7 +1100,8 @@ struct sctp_add_streams {
 enum sctp_sched_type {
 	SCTP_SS_FCFS,
 	SCTP_SS_PRIO,
-	SCTP_SS_MAX = SCTP_SS_PRIO
+	SCTP_SS_RR,
+	SCTP_SS_MAX = SCTP_SS_RR
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 647c9cfd4e95be4429d25792e5832d7be2efc5c8..bf90c53977190ff563c2b43af31afb7c431d4534 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o stream_sched.o stream_sched_prio.o
+	  offload.o stream_sched.o stream_sched_prio.o \
+	  stream_sched_rr.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 115ddb7651695cca7417cb63004a1a59c93523b8..03513a9fa110b5317af4502f98ab37702c1eddb9 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -122,10 +122,12 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 /* API to other parts of the stack */
 
 extern struct sctp_sched_ops sctp_sched_prio;
+extern struct sctp_sched_ops sctp_sched_rr;
 
 struct sctp_sched_ops *sctp_sched_ops[] = {
 	&sctp_sched_fcfs,
 	&sctp_sched_prio,
+	&sctp_sched_rr,
 };
 
 int sctp_sched_set_sched(struct sctp_association *asoc,
diff --git a/net/sctp/stream_sched_rr.c b/net/sctp/stream_sched_rr.c
new file mode 100644
index 0000000000000000000000000000000000000000..7612a438c5b939ae1c26c4acc06902749b601524
--- /dev/null
+++ b/net/sctp/stream_sched_rr.c
@@ -0,0 +1,201 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Priority handling
+ * RFC DRAFT ndata section 3.2
+ */
+static void sctp_sched_rr_unsched_all(struct sctp_stream *stream);
+
+static void sctp_sched_rr_next_stream(struct sctp_stream *stream)
+{
+	struct list_head *pos;
+
+	pos = stream->rr_next->rr_list.next;
+	if (pos == &stream->rr_list)
+		pos = pos->next;
+	stream->rr_next = list_entry(pos, struct sctp_stream_out_ext, rr_list);
+}
+
+static void sctp_sched_rr_unsched(struct sctp_stream *stream,
+				  struct sctp_stream_out_ext *soute)
+{
+	if (stream->rr_next == soute)
+		/* Try to move to the next stream */
+		sctp_sched_rr_next_stream(stream);
+
+	list_del_init(&soute->rr_list);
+
+	/* If we have no other stream queued, clear next */
+	if (list_empty(&stream->rr_list))
+		stream->rr_next = NULL;
+}
+
+static void sctp_sched_rr_sched(struct sctp_stream *stream,
+				struct sctp_stream_out_ext *soute)
+{
+	if (!list_empty(&soute->rr_list))
+		/* Already scheduled. */
+		return;
+
+	/* Schedule the stream */
+	list_add_tail(&soute->rr_list, &stream->rr_list);
+
+	if (!stream->rr_next)
+		stream->rr_next = soute;
+}
+
+static int sctp_sched_rr_set(struct sctp_stream *stream, __u16 sid,
+			     __u16 prio, gfp_t gfp)
+{
+	return 0;
+}
+
+static int sctp_sched_rr_get(struct sctp_stream *stream, __u16 sid,
+			     __u16 *value)
+{
+	return 0;
+}
+
+static int sctp_sched_rr_init(struct sctp_stream *stream)
+{
+	INIT_LIST_HEAD(&stream->rr_list);
+	stream->rr_next = NULL;
+
+	return 0;
+}
+
+static int sctp_sched_rr_init_sid(struct sctp_stream *stream, __u16 sid,
+				  gfp_t gfp)
+{
+	INIT_LIST_HEAD(&stream->out[sid].ext->rr_list);
+
+	return 0;
+}
+
+static void sctp_sched_rr_free(struct sctp_stream *stream)
+{
+	sctp_sched_rr_unsched_all(stream);
+}
+
+static void sctp_sched_rr_enqueue(struct sctp_outq *q,
+				  struct sctp_datamsg *msg)
+{
+	struct sctp_stream *stream;
+	struct sctp_chunk *ch;
+	__u16 sid;
+
+	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
+	sid = sctp_chunk_stream_no(ch);
+	stream = &q->asoc->stream;
+	sctp_sched_rr_sched(stream, stream->out[sid].ext);
+}
+
+static struct sctp_chunk *sctp_sched_rr_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch = NULL;
+
+	/* Bail out quickly if queue is empty */
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	/* Find which chunk is next */
+	if (stream->out_curr)
+		soute = stream->out_curr->ext;
+	else
+		soute = stream->rr_next;
+	ch = list_entry(soute->outq.next, struct sctp_chunk, stream_list);
+
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_rr_dequeue_done(struct sctp_outq *q,
+				       struct sctp_chunk *ch)
+{
+	struct sctp_stream_out_ext *soute;
+	__u16 sid;
+
+	/* Last chunk on that msg, move to the next stream */
+	sid = sctp_chunk_stream_no(ch);
+	soute = q->asoc->stream.out[sid].ext;
+
+	sctp_sched_rr_next_stream(&q->asoc->stream);
+
+	if (list_empty(&soute->outq))
+		sctp_sched_rr_unsched(&q->asoc->stream, soute);
+}
+
+static void sctp_sched_rr_sched_all(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		__u16 sid;
+
+		sid = sctp_chunk_stream_no(ch);
+		soute = stream->out[sid].ext;
+		if (soute)
+			sctp_sched_rr_sched(stream, soute);
+	}
+}
+
+static void sctp_sched_rr_unsched_all(struct sctp_stream *stream)
+{
+	struct sctp_stream_out_ext *soute, *tmp;
+
+	list_for_each_entry_safe(soute, tmp, &stream->rr_list, rr_list)
+		sctp_sched_rr_unsched(stream, soute);
+}
+
+struct sctp_sched_ops sctp_sched_rr = {
+	.set = sctp_sched_rr_set,
+	.get = sctp_sched_rr_get,
+	.init = sctp_sched_rr_init,
+	.init_sid = sctp_sched_rr_init_sid,
+	.free = sctp_sched_rr_free,
+	.enqueue = sctp_sched_rr_enqueue,
+	.dequeue = sctp_sched_rr_dequeue,
+	.dequeue_done = sctp_sched_rr_dequeue_done,
+	.sched_all = sctp_sched_rr_sched_all,
+	.unsched_all = sctp_sched_rr_unsched_all,
+};
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 09/10] sctp: introduce priority based stream scheduler
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

This patch introduces RFC Draft ndata section 3.4 Priority Based
Scheduler (SCTP_SS_PRIO).

It works by having a struct sctp_stream_priority for each priority
configured. This struct is then enlisted on a queue ordered per priority
if, and only if, there is a stream with data queued, so that dequeueing
is very straightforward: either finish current datamsg or simply dequeue
from the highest priority queued, which is the next stream pointed, and
that's it.

If there are multiple streams assigned with the same priority and with
data queued, it will do round robin amongst them while respecting
datamsgs boundaries (when not using idata chunks), to be reasonably
fair.

We intentionally don't maintain a list of priorities nor a list of all
streams with the same priority to save memory. The first would mean at
least 2 other pointers per priority (which, for 1000 priorities, that
can mean 16kB) and the second would also mean 2 other pointers but per
stream. As SCTP supports up to 65535 streams on a given asoc, that's
1MB. This impacts when giving a priority to some stream, as we have to
find out if the new priority is already being used and if we can free
the old one, and also when tearing down.

The new fields in struct sctp_stream_out_ext and sctp_stream are added
under a union because that memory is to be shared with other schedulers.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h   |  24 +++
 include/uapi/linux/sctp.h    |   3 +-
 net/sctp/Makefile            |   2 +-
 net/sctp/stream_sched.c      |   3 +
 net/sctp/stream_sched_prio.c | 347 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_prio.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 3c22a30fd71b4ef87419a77cf69b00807a5986bb..40eb8d66a37c3ecee39141dc111663f7aac7326a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1328,10 +1328,27 @@ struct sctp_inithdr_host {
 	__u32 initial_tsn;
 };
 
+struct sctp_stream_priorities {
+	/* List of priorities scheduled */
+	struct list_head prio_sched;
+	/* List of streams scheduled */
+	struct list_head active;
+	/* The next stream stream in line */
+	struct sctp_stream_out_ext *next;
+	__u16 prio;
+};
+
 struct sctp_stream_out_ext {
 	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
 	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
 	struct list_head outq; /* chunks enqueued by this stream */
+	union {
+		struct {
+			/* Scheduled streams list */
+			struct list_head prio_list;
+			struct sctp_stream_priorities *prio_head;
+		};
+	};
 };
 
 struct sctp_stream_out {
@@ -1351,6 +1368,13 @@ struct sctp_stream {
 	__u16 incnt;
 	/* Current stream being sent, if any */
 	struct sctp_stream_out *out_curr;
+	union {
+		/* Fields used by priority scheduler */
+		struct {
+			/* List of priorities scheduled */
+			struct list_head prio_list;
+		};
+	};
 };
 
 #define SCTP_STREAM_CLOSED		0x00
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 00ac417d2c4f8468ea2aad32e59806be5c5aa08d..850fa8b29d7e8163dc4ee88af192309bb2535ae9 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1099,7 +1099,8 @@ struct sctp_add_streams {
 /* SCTP Stream schedulers */
 enum sctp_sched_type {
 	SCTP_SS_FCFS,
-	SCTP_SS_MAX = SCTP_SS_FCFS
+	SCTP_SS_PRIO,
+	SCTP_SS_MAX = SCTP_SS_PRIO
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 0f6e6d1d69fd336b4a99f896851b0120f9a0d1e0..647c9cfd4e95be4429d25792e5832d7be2efc5c8 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o stream_sched.o
+	  offload.o stream_sched.o stream_sched_prio.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 40a9a9de2b98a56786a4c8585f5ad514be9189af..115ddb7651695cca7417cb63004a1a59c93523b8 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -121,8 +121,11 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 
 /* API to other parts of the stack */
 
+extern struct sctp_sched_ops sctp_sched_prio;
+
 struct sctp_sched_ops *sctp_sched_ops[] = {
 	&sctp_sched_fcfs,
+	&sctp_sched_prio,
 };
 
 int sctp_sched_set_sched(struct sctp_association *asoc,
diff --git a/net/sctp/stream_sched_prio.c b/net/sctp/stream_sched_prio.c
new file mode 100644
index 0000000000000000000000000000000000000000..384dbf3c876096e2ad98a6b6185d9da5cc4145c6
--- /dev/null
+++ b/net/sctp/stream_sched_prio.c
@@ -0,0 +1,347 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Priority handling
+ * RFC DRAFT ndata section 3.4
+ */
+
+static void sctp_sched_prio_unsched_all(struct sctp_stream *stream);
+
+static struct sctp_stream_priorities *sctp_sched_prio_new_head(
+			struct sctp_stream *stream, int prio, gfp_t gfp)
+{
+	struct sctp_stream_priorities *p;
+
+	p = kmalloc(sizeof(*p), gfp);
+	if (!p)
+		return NULL;
+
+	INIT_LIST_HEAD(&p->prio_sched);
+	INIT_LIST_HEAD(&p->active);
+	p->next = NULL;
+	p->prio = prio;
+
+	return p;
+}
+
+static struct sctp_stream_priorities *sctp_sched_prio_get_head(
+			struct sctp_stream *stream, int prio, gfp_t gfp)
+{
+	struct sctp_stream_priorities *p;
+	int i;
+
+	/* Look into scheduled priorities first, as they are sorted and
+	 * we can find it fast IF it's scheduled.
+	 */
+	list_for_each_entry(p, &stream->prio_list, prio_sched) {
+		if (p->prio == prio)
+			return p;
+		if (p->prio > prio)
+			break;
+	}
+
+	/* No luck. So we search on all streams now. */
+	for (i = 0; i < stream->outcnt; i++) {
+		if (!stream->out[i].ext)
+			continue;
+
+		p = stream->out[i].ext->prio_head;
+		if (!p)
+			/* Means all other streams won't be initialized
+			 * as well.
+			 */
+			break;
+		if (p->prio == prio)
+			return p;
+	}
+
+	/* If not even there, allocate a new one. */
+	return sctp_sched_prio_new_head(stream, prio, gfp);
+}
+
+static void sctp_sched_prio_next_stream(struct sctp_stream_priorities *p)
+{
+	struct list_head *pos;
+
+	pos = p->next->prio_list.next;
+	if (pos == &p->active)
+		pos = pos->next;
+	p->next = list_entry(pos, struct sctp_stream_out_ext, prio_list);
+}
+
+static bool sctp_sched_prio_unsched(struct sctp_stream_out_ext *soute)
+{
+	bool scheduled = false;
+
+	if (!list_empty(&soute->prio_list)) {
+		struct sctp_stream_priorities *prio_head = soute->prio_head;
+
+		/* Scheduled */
+		scheduled = true;
+
+		if (prio_head->next == soute)
+			/* Try to move to the next stream */
+			sctp_sched_prio_next_stream(prio_head);
+
+		list_del_init(&soute->prio_list);
+
+		/* Also unsched the priority if this was the last stream */
+		if (list_empty(&prio_head->active)) {
+			list_del_init(&prio_head->prio_sched);
+			/* If there is no stream left, clear next */
+			prio_head->next = NULL;
+		}
+	}
+
+	return scheduled;
+}
+
+static void sctp_sched_prio_sched(struct sctp_stream *stream,
+				  struct sctp_stream_out_ext *soute)
+{
+	struct sctp_stream_priorities *prio, *prio_head;
+
+	prio_head = soute->prio_head;
+
+	/* Nothing to do if already scheduled */
+	if (!list_empty(&soute->prio_list))
+		return;
+
+	/* Schedule the stream. If there is a next, we schedule the new
+	 * one before it, so it's the last in round robin order.
+	 * If there isn't, we also have to schedule the priority.
+	 */
+	if (prio_head->next) {
+		list_add(&soute->prio_list, prio_head->next->prio_list.prev);
+		return;
+	}
+
+	list_add(&soute->prio_list, &prio_head->active);
+	prio_head->next = soute;
+
+	list_for_each_entry(prio, &stream->prio_list, prio_sched) {
+		if (prio->prio > prio_head->prio) {
+			list_add(&prio_head->prio_sched, prio->prio_sched.prev);
+			return;
+		}
+	}
+
+	list_add_tail(&prio_head->prio_sched, &stream->prio_list);
+}
+
+static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
+			       __u16 prio, gfp_t gfp)
+{
+	struct sctp_stream_out *sout = &stream->out[sid];
+	struct sctp_stream_out_ext *soute = sout->ext;
+	struct sctp_stream_priorities *prio_head, *old;
+	bool reschedule = false;
+	int i;
+
+	prio_head = sctp_sched_prio_get_head(stream, prio, gfp);
+	if (!prio_head)
+		return -ENOMEM;
+
+	reschedule = sctp_sched_prio_unsched(soute);
+	old = soute->prio_head;
+	soute->prio_head = prio_head;
+	if (reschedule)
+		sctp_sched_prio_sched(stream, soute);
+
+	if (!old)
+		/* Happens when we set the priority for the first time */
+		return 0;
+
+	for (i = 0; i < stream->outcnt; i++) {
+		soute = stream->out[i].ext;
+		if (soute && soute->prio_head == old)
+			/* It's still in use, nothing else to do here. */
+			return 0;
+	}
+
+	/* No hits, we are good to free it. */
+	kfree(old);
+
+	return 0;
+}
+
+static int sctp_sched_prio_get(struct sctp_stream *stream, __u16 sid,
+			       __u16 *value)
+{
+	*value = stream->out[sid].ext->prio_head->prio;
+	return 0;
+}
+
+static int sctp_sched_prio_init(struct sctp_stream *stream)
+{
+	INIT_LIST_HEAD(&stream->prio_list);
+
+	return 0;
+}
+
+static int sctp_sched_prio_init_sid(struct sctp_stream *stream, __u16 sid,
+				    gfp_t gfp)
+{
+	INIT_LIST_HEAD(&stream->out[sid].ext->prio_list);
+	return sctp_sched_prio_set(stream, sid, 0, gfp);
+}
+
+static void sctp_sched_prio_free(struct sctp_stream *stream)
+{
+	struct sctp_stream_priorities *prio, *n;
+	LIST_HEAD(list);
+	int i;
+
+	/* As we don't keep a list of priorities, to avoid multiple
+	 * frees we have to do it in 3 steps:
+	 *   1. unsched everyone, so the lists are free to use in 2.
+	 *   2. build the list of the priorities
+	 *   3. free the list
+	 */
+	sctp_sched_prio_unsched_all(stream);
+	for (i = 0; i < stream->outcnt; i++) {
+		if (!stream->out[i].ext)
+			continue;
+		prio = stream->out[i].ext->prio_head;
+		if (prio && list_empty(&prio->prio_sched))
+			list_add(&prio->prio_sched, &list);
+	}
+	list_for_each_entry_safe(prio, n, &list, prio_sched) {
+		list_del_init(&prio->prio_sched);
+		kfree(prio);
+	}
+}
+
+static void sctp_sched_prio_enqueue(struct sctp_outq *q,
+				    struct sctp_datamsg *msg)
+{
+	struct sctp_stream *stream;
+	struct sctp_chunk *ch;
+	__u16 sid;
+
+	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
+	sid = sctp_chunk_stream_no(ch);
+	stream = &q->asoc->stream;
+	sctp_sched_prio_sched(stream, stream->out[sid].ext);
+}
+
+static struct sctp_chunk *sctp_sched_prio_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_stream_priorities *prio;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch = NULL;
+
+	/* Bail out quickly if queue is empty */
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	/* Find which chunk is next. It's easy, it's either the current
+	 * one or the first chunk on the next active stream.
+	 */
+	if (stream->out_curr) {
+		soute = stream->out_curr->ext;
+	} else {
+		prio = list_entry(stream->prio_list.next,
+				  struct sctp_stream_priorities, prio_sched);
+		soute = prio->next;
+	}
+	ch = list_entry(soute->outq.next, struct sctp_chunk, stream_list);
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_prio_dequeue_done(struct sctp_outq *q,
+					 struct sctp_chunk *ch)
+{
+	struct sctp_stream_priorities *prio;
+	struct sctp_stream_out_ext *soute;
+	__u16 sid;
+
+	/* Last chunk on that msg, move to the next stream on
+	 * this priority.
+	 */
+	sid = sctp_chunk_stream_no(ch);
+	soute = q->asoc->stream.out[sid].ext;
+	prio = soute->prio_head;
+
+	sctp_sched_prio_next_stream(prio);
+
+	if (list_empty(&soute->outq))
+		sctp_sched_prio_unsched(soute);
+}
+
+static void sctp_sched_prio_sched_all(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_out *sout;
+	struct sctp_chunk *ch;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		__u16 sid;
+
+		sid = sctp_chunk_stream_no(ch);
+		sout = &stream->out[sid];
+		if (sout->ext)
+			sctp_sched_prio_sched(stream, sout->ext);
+	}
+}
+
+static void sctp_sched_prio_unsched_all(struct sctp_stream *stream)
+{
+	struct sctp_stream_priorities *p, *tmp;
+	struct sctp_stream_out_ext *soute, *souttmp;
+
+	list_for_each_entry_safe(p, tmp, &stream->prio_list, prio_sched)
+		list_for_each_entry_safe(soute, souttmp, &p->active, prio_list)
+			sctp_sched_prio_unsched(soute);
+}
+
+struct sctp_sched_ops sctp_sched_prio = {
+	.set = sctp_sched_prio_set,
+	.get = sctp_sched_prio_get,
+	.init = sctp_sched_prio_init,
+	.init_sid = sctp_sched_prio_init_sid,
+	.free = sctp_sched_prio_free,
+	.enqueue = sctp_sched_prio_enqueue,
+	.dequeue = sctp_sched_prio_dequeue,
+	.dequeue_done = sctp_sched_prio_dequeue_done,
+	.sched_all = sctp_sched_prio_sched_all,
+	.unsched_all = sctp_sched_prio_unsched_all,
+};
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 08/10] sctp: add sockopt to get/set stream scheduler parameters
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

As defined per RFC Draft ndata Section 4.3.3, named as
SCTP_STREAM_SCHEDULER_VALUE.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/uapi/linux/sctp.h |  7 +++++
 net/sctp/socket.c         | 77 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 0050f10087d224bad87c8c54ad318003381aee12..00ac417d2c4f8468ea2aad32e59806be5c5aa08d 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -123,6 +123,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_ADD_STREAMS	121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
 #define SCTP_STREAM_SCHEDULER	123
+#define SCTP_STREAM_SCHEDULER_VALUE	124
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE	0x0000
@@ -815,6 +816,12 @@ struct sctp_assoc_value {
     uint32_t                assoc_value;
 };
 
+struct sctp_stream_value {
+	sctp_assoc_t assoc_id;
+	uint16_t stream_id;
+	uint16_t stream_value;
+};
+
 /*
  * 7.2.2 Peer Address Information
  *
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ae35dbf2810f78c71ce77115ffe4b0e27a672abc..88c28421ec151e83665efcbcbd8a6403b122205a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3945,6 +3945,34 @@ static int sctp_setsockopt_scheduler(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_scheduler_value(struct sock *sk,
+					   char __user *optval,
+					   unsigned int optlen)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_value params;
+	int retval = -EINVAL;
+
+	if (optlen < sizeof(params))
+		goto out;
+
+	optlen = sizeof(params);
+	if (copy_from_user(&params, optval, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc)
+		goto out;
+
+	retval = sctp_sched_set_value(asoc, params.stream_id,
+				      params.stream_value, GFP_KERNEL);
+
+out:
+	return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4129,6 +4157,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_STREAM_SCHEDULER:
 		retval = sctp_setsockopt_scheduler(sk, optval, optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER_VALUE:
+		retval = sctp_setsockopt_scheduler_value(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -6864,6 +6895,48 @@ static int sctp_getsockopt_scheduler(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_scheduler_value(struct sock *sk, int len,
+					   char __user *optval,
+					   int __user *optlen)
+{
+	struct sctp_stream_value params;
+	struct sctp_association *asoc;
+	int retval = -EFAULT;
+
+	if (len < sizeof(params)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	len = sizeof(params);
+	if (copy_from_user(&params, optval, len))
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	retval = sctp_sched_get_value(asoc, params.stream_id,
+				      &params.stream_value);
+	if (retval)
+		goto out;
+
+	if (put_user(len, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	if (copy_to_user(optval, &params, len)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+out:
+	return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -7050,6 +7123,10 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_getsockopt_scheduler(sk, len, optval,
 						   optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER_VALUE:
+		retval = sctp_getsockopt_scheduler_value(sk, len, optval,
+							 optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 07/10] sctp: add sockopt to get/set stream scheduler
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

As defined per RFC Draft ndata Section 4.3.2, named as
SCTP_STREAM_SCHEDULER.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/socket.c         | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 4487e7625ddbd48be1868a8292a807ecd0a314bc..0050f10087d224bad87c8c54ad318003381aee12 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -122,6 +122,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_RESET_ASSOC	120
 #define SCTP_ADD_STREAMS	121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
+#define SCTP_STREAM_SCHEDULER	123
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE	0x0000
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d207734326b085e60625e4333f74221481114892..ae35dbf2810f78c71ce77115ffe4b0e27a672abc 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -79,6 +79,7 @@
 #include <net/sock.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 /* Forward declarations for internal helper functions. */
 static int sctp_writeable(struct sock *sk);
@@ -3914,6 +3915,36 @@ static int sctp_setsockopt_add_streams(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_scheduler(struct sock *sk,
+				     char __user *optval,
+				     unsigned int optlen)
+{
+	struct sctp_association *asoc;
+	struct sctp_assoc_value params;
+	int retval = -EINVAL;
+
+	if (optlen < sizeof(params))
+		goto out;
+
+	optlen = sizeof(params);
+	if (copy_from_user(&params, optval, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	if (params.assoc_value > SCTP_SS_MAX)
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc)
+		goto out;
+
+	retval = sctp_sched_set_sched(asoc, params.assoc_value);
+
+out:
+	return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4095,6 +4126,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_ADD_STREAMS:
 		retval = sctp_setsockopt_add_streams(sk, optval, optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER:
+		retval = sctp_setsockopt_scheduler(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -6793,6 +6827,43 @@ static int sctp_getsockopt_enable_strreset(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_scheduler(struct sock *sk, int len,
+				     char __user *optval,
+				     int __user *optlen)
+{
+	struct sctp_assoc_value params;
+	struct sctp_association *asoc;
+	int retval = -EFAULT;
+
+	if (len < sizeof(params)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	len = sizeof(params);
+	if (copy_from_user(&params, optval, len))
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	params.assoc_value = sctp_sched_get_sched(asoc);
+
+	if (put_user(len, optlen))
+		goto out;
+
+	if (copy_to_user(optval, &params, len))
+		goto out;
+
+	retval = 0;
+
+out:
+	return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -6975,6 +7046,10 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_getsockopt_enable_strreset(sk, len, optval,
 							 optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER:
+		retval = sctp_getsockopt_scheduler(sk, len, optval,
+						   optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 06/10] sctp: introduce stream scheduler foundations
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

This patch introduces the hooks necessary to do stream scheduling, as
per RFC Draft ndata.  It also introduces the first scheduler, which is
what we do today but now factored out: first come first served (FCFS).

With stream scheduling now we have to track which chunk was enqueued on
which stream and be able to select another other than the in front of
the main outqueue. So we introduce a list on sctp_stream_out_ext
structure for this purpose.

We reuse sctp_chunk->transmitted_list space for the list above, as the
chunk cannot belong to the two lists at the same time. By using the
union in there, we can have distinct names for these moments.

sctp_sched_ops are the operations expected to be implemented by each
scheduler. The dequeueing is a bit particular to this implementation but
it is to match how we dequeue packets today. We first dequeue and then
check if it fits the packet and if not, we requeue it at head. Thus why
we don't have a peek operation but have dequeue_done instead, which is
called once the chunk can be safely considered as transmitted.

The check removed from sctp_outq_flush is now performed by
sctp_stream_outq_migrate, which is only called during assoc setup.
(sctp_sendmsg() also checks for it)

The only operation that is foreseen but not yet added here is a way to
signalize that a new packet is starting or that the packet is done, for
round robin scheduler per packet, but is intentionally left to the
patch that actually implements it.

Support for I-DATA chunks, also described in this RFC, with user message
interleaving is straightforward as it just requires the schedulers to
probe for the feature and ignore datamsg boundaries when dequeueing.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/stream_sched.h |  72 +++++++++++
 include/net/sctp/structs.h      |  15 ++-
 include/uapi/linux/sctp.h       |   6 +
 net/sctp/Makefile               |   2 +-
 net/sctp/outqueue.c             |  59 +++++----
 net/sctp/sm_sideeffect.c        |   3 +
 net/sctp/stream.c               |  88 +++++++++++--
 net/sctp/stream_sched.c         | 270 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 477 insertions(+), 38 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c

diff --git a/include/net/sctp/stream_sched.h b/include/net/sctp/stream_sched.h
new file mode 100644
index 0000000000000000000000000000000000000000..c676550a4c7dd0ea27ac0e14437d0a2b451ef499
--- /dev/null
+++ b/include/net/sctp/stream_sched.h
@@ -0,0 +1,72 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * These are definitions used by the stream schedulers, defined in RFC
+ * draft ndata (https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-11)
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation  is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresses:
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *   Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#ifndef __sctp_stream_sched_h__
+#define __sctp_stream_sched_h__
+
+struct sctp_sched_ops {
+	/* Property handling for a given stream */
+	int (*set)(struct sctp_stream *stream, __u16 sid, __u16 value,
+		   gfp_t gfp);
+	int (*get)(struct sctp_stream *stream, __u16 sid, __u16 *value);
+
+	/* Init the specific scheduler */
+	int (*init)(struct sctp_stream *stream);
+	/* Init a stream */
+	int (*init_sid)(struct sctp_stream *stream, __u16 sid, gfp_t gfp);
+	/* Frees the entire thing */
+	void (*free)(struct sctp_stream *stream);
+
+	/* Enqueue a chunk */
+	void (*enqueue)(struct sctp_outq *q, struct sctp_datamsg *msg);
+	/* Dequeue a chunk */
+	struct sctp_chunk *(*dequeue)(struct sctp_outq *q);
+	/* Called only if the chunk fit the packet */
+	void (*dequeue_done)(struct sctp_outq *q, struct sctp_chunk *chunk);
+	/* Sched all chunks already enqueued */
+	void (*sched_all)(struct sctp_stream *steam);
+	/* Unched all chunks already enqueued */
+	void (*unsched_all)(struct sctp_stream *steam);
+};
+
+int sctp_sched_set_sched(struct sctp_association *asoc,
+			 enum sctp_sched_type sched);
+int sctp_sched_get_sched(struct sctp_association *asoc);
+int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 value, gfp_t gfp);
+int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 *value);
+void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch);
+
+void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch);
+int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp);
+struct sctp_sched_ops *sctp_sched_ops_from_stream(struct sctp_stream *stream);
+
+#endif /* __sctp_stream_sched_h__ */
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c48f7999fe9b80c5b5e41910a3608059b94140a7..3c22a30fd71b4ef87419a77cf69b00807a5986bb 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -84,7 +84,6 @@ struct sctp_ulpq;
 struct sctp_ep_common;
 struct crypto_shash;
 struct sctp_stream;
-struct sctp_stream_out;
 
 
 #include <net/sctp/tsnmap.h>
@@ -531,8 +530,12 @@ struct sctp_chunk {
 	/* How many times this chunk have been sent, for prsctp RTX policy */
 	int sent_count;
 
-	/* This is our link to the per-transport transmitted list.  */
-	struct list_head transmitted_list;
+	union {
+		/* This is our link to the per-transport transmitted list.  */
+		struct list_head transmitted_list;
+		/* List in specific stream outq */
+		struct list_head stream_list;
+	};
 
 	/* This field is used by chunks that hold fragmented data.
 	 * For the first fragment this is the list that holds the rest of
@@ -1019,6 +1022,9 @@ struct sctp_outq {
 	/* Data pending that has never been transmitted.  */
 	struct list_head out_chunk_list;
 
+	/* Stream scheduler being used */
+	struct sctp_sched_ops *sched;
+
 	unsigned int out_qlen;	/* Total length of queued data chunks. */
 
 	/* Error of send failed, may used in SCTP_SEND_FAILED event. */
@@ -1325,6 +1331,7 @@ struct sctp_inithdr_host {
 struct sctp_stream_out_ext {
 	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
 	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+	struct list_head outq; /* chunks enqueued by this stream */
 };
 
 struct sctp_stream_out {
@@ -1342,6 +1349,8 @@ struct sctp_stream {
 	struct sctp_stream_in *in;
 	__u16 outcnt;
 	__u16 incnt;
+	/* Current stream being sent, if any */
+	struct sctp_stream_out *out_curr;
 };
 
 #define SCTP_STREAM_CLOSED		0x00
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 6217ff8500a1d818fd1002fbd6f81c0c11974665..4487e7625ddbd48be1868a8292a807ecd0a314bc 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1088,4 +1088,10 @@ struct sctp_add_streams {
 	uint16_t sas_outstrms;
 };
 
+/* SCTP Stream schedulers */
+enum sctp_sched_type {
+	SCTP_SS_FCFS,
+	SCTP_SS_MAX = SCTP_SS_FCFS
+};
+
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 70f1b570bab9764d692f1c2e605d76d056cda2cd..0f6e6d1d69fd336b4a99f896851b0120f9a0d1e0 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o
+	  offload.o stream_sched.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 746b07b7937d8730824b9e09917d947aa7863ec6..4db012aa25f7a042f063bc17b56270effebc6cc6 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -50,6 +50,7 @@
 
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 /* Declare internal functions here.  */
 static int sctp_acked(struct sctp_sackhdr *sack, __u32 tsn);
@@ -72,32 +73,38 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
 
 /* Add data to the front of the queue. */
 static inline void sctp_outq_head_data(struct sctp_outq *q,
-					struct sctp_chunk *ch)
+				       struct sctp_chunk *ch)
 {
+	struct sctp_stream_out_ext *oute;
+	__u16 stream;
+
 	list_add(&ch->list, &q->out_chunk_list);
 	q->out_qlen += ch->skb->len;
+
+	stream = sctp_chunk_stream_no(ch);
+	oute = q->asoc->stream.out[stream].ext;
+	list_add(&ch->stream_list, &oute->outq);
 }
 
 /* Take data from the front of the queue. */
 static inline struct sctp_chunk *sctp_outq_dequeue_data(struct sctp_outq *q)
 {
-	struct sctp_chunk *ch = NULL;
-
-	if (!list_empty(&q->out_chunk_list)) {
-		struct list_head *entry = q->out_chunk_list.next;
-
-		ch = list_entry(entry, struct sctp_chunk, list);
-		list_del_init(entry);
-		q->out_qlen -= ch->skb->len;
-	}
-	return ch;
+	return q->sched->dequeue(q);
 }
+
 /* Add data chunk to the end of the queue. */
 static inline void sctp_outq_tail_data(struct sctp_outq *q,
 				       struct sctp_chunk *ch)
 {
+	struct sctp_stream_out_ext *oute;
+	__u16 stream;
+
 	list_add_tail(&ch->list, &q->out_chunk_list);
 	q->out_qlen += ch->skb->len;
+
+	stream = sctp_chunk_stream_no(ch);
+	oute = q->asoc->stream.out[stream].ext;
+	list_add_tail(&ch->stream_list, &oute->outq);
 }
 
 /*
@@ -207,6 +214,7 @@ void sctp_outq_init(struct sctp_association *asoc, struct sctp_outq *q)
 	INIT_LIST_HEAD(&q->retransmit);
 	INIT_LIST_HEAD(&q->sacked);
 	INIT_LIST_HEAD(&q->abandoned);
+	sctp_sched_set_sched(asoc, SCTP_SS_FCFS);
 }
 
 /* Free the outqueue structure and any related pending chunks.
@@ -258,6 +266,7 @@ static void __sctp_outq_teardown(struct sctp_outq *q)
 
 	/* Throw away any leftover data chunks. */
 	while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
+		sctp_sched_dequeue_done(q, chunk);
 
 		/* Mark as send failure. */
 		sctp_chunk_fail(chunk, q->error);
@@ -391,13 +400,14 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 	struct sctp_outq *q = &asoc->outqueue;
 	struct sctp_chunk *chk, *temp;
 
+	q->sched->unsched_all(&asoc->stream);
+
 	list_for_each_entry_safe(chk, temp, &q->out_chunk_list, list) {
 		if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
 		    chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive)
 			continue;
 
-		list_del_init(&chk->list);
-		q->out_qlen -= chk->skb->len;
+		sctp_sched_dequeue_common(q, chk);
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		if (chk->sinfo.sinfo_stream < asoc->stream.outcnt) {
@@ -415,6 +425,8 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 			break;
 	}
 
+	q->sched->sched_all(&asoc->stream);
+
 	return msg_len;
 }
 
@@ -1033,22 +1045,9 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 		while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
 			__u32 sid = ntohs(chunk->subh.data_hdr->stream);
 
-			/* RFC 2960 6.5 Every DATA chunk MUST carry a valid
-			 * stream identifier.
-			 */
-			if (chunk->sinfo.sinfo_stream >= asoc->stream.outcnt) {
-
-				/* Mark as failed send. */
-				sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM);
-				if (asoc->peer.prsctp_capable &&
-				    SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
-					asoc->sent_cnt_removable--;
-				sctp_chunk_free(chunk);
-				continue;
-			}
-
 			/* Has this chunk expired? */
 			if (sctp_chunk_abandoned(chunk)) {
+				sctp_sched_dequeue_done(q, chunk);
 				sctp_chunk_fail(chunk, 0);
 				sctp_chunk_free(chunk);
 				continue;
@@ -1070,6 +1069,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				new_transport = asoc->peer.active_path;
 			if (new_transport->state == SCTP_UNCONFIRMED) {
 				WARN_ONCE(1, "Attempt to send packet on unconfirmed path.");
+				sctp_sched_dequeue_done(q, chunk);
 				sctp_chunk_fail(chunk, 0);
 				sctp_chunk_free(chunk);
 				continue;
@@ -1133,6 +1133,11 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				else
 					asoc->stats.oodchunks++;
 
+				/* Only now it's safe to consider this
+				 * chunk as sent, sched-wise.
+				 */
+				sctp_sched_dequeue_done(q, chunk);
+
 				break;
 
 			default:
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index e6a2974e020e1a4232d94e6c2933eebff5f8acb4..402bfbb888cda53248dd192d3756a2f4db1d2a7f 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -50,6 +50,7 @@
 #include <net/sock.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 static int sctp_cmd_interpreter(enum sctp_event event_type,
 				union sctp_subtype subtype,
@@ -1089,6 +1090,8 @@ static void sctp_cmd_send_msg(struct sctp_association *asoc,
 
 	list_for_each_entry(chunk, &msg->chunks, frag_list)
 		sctp_outq_tail(&asoc->outqueue, chunk, gfp);
+
+	asoc->outqueue.sched->enqueue(&asoc->outqueue, msg);
 }
 
 
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 055ca25bbc91bf932db8048c72a1b11cc2214942..5ea33a2c453b4272c5c22fa61e8e8bec06001f8b 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -32,8 +32,61 @@
  *    Xin Long <lucien.xin@gmail.com>
  */
 
+#include <linux/list.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Migrates chunks from stream queues to new stream queues if needed,
+ * but not across associations. Also, removes those chunks to streams
+ * higher than the new max.
+ */
+static void sctp_stream_outq_migrate(struct sctp_stream *stream,
+				     struct sctp_stream *new, __u16 outcnt)
+{
+	struct sctp_association *asoc;
+	struct sctp_chunk *ch, *temp;
+	struct sctp_outq *outq;
+	int i;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	outq = &asoc->outqueue;
+
+	list_for_each_entry_safe(ch, temp, &outq->out_chunk_list, list) {
+		__u16 sid = sctp_chunk_stream_no(ch);
+
+		if (sid < outcnt)
+			continue;
+
+		sctp_sched_dequeue_common(outq, ch);
+		/* No need to call dequeue_done here because
+		 * the chunks are not scheduled by now.
+		 */
+
+		/* Mark as failed send. */
+		sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM);
+		if (asoc->peer.prsctp_capable &&
+		    SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags))
+			asoc->sent_cnt_removable--;
+
+		sctp_chunk_free(ch);
+	}
+
+	if (new) {
+		/* Here we actually move the old ext stuff into the new
+		 * buffer, because we want to keep it. Then
+		 * sctp_stream_update will swap ->out pointers.
+		 */
+		for (i = 0; i < outcnt; i++) {
+			kfree(new->out[i].ext);
+			new->out[i].ext = stream->out[i].ext;
+			stream->out[i].ext = NULL;
+		}
+	}
+
+	for (i = outcnt; i < stream->outcnt; i++)
+		kfree(stream->out[i].ext);
+}
 
 static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 				 gfp_t gfp)
@@ -87,7 +140,8 @@ static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
-	int i;
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+	int i, ret = 0;
 
 	gfp |= __GFP_NOWARN;
 
@@ -97,6 +151,11 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (outcnt == stream->outcnt)
 		goto in;
 
+	/* Filter out chunks queued on streams that won't exist anymore */
+	sched->unsched_all(stream);
+	sctp_stream_outq_migrate(stream, NULL, outcnt);
+	sched->sched_all(stream);
+
 	i = sctp_stream_alloc_out(stream, outcnt, gfp);
 	if (i)
 		return i;
@@ -105,20 +164,27 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	for (i = 0; i < stream->outcnt; i++)
 		stream->out[i].state = SCTP_STREAM_OPEN;
 
+	sched->init(stream);
+
 in:
 	if (!incnt)
-		return 0;
+		goto out;
 
 	i = sctp_stream_alloc_in(stream, incnt, gfp);
 	if (i) {
-		kfree(stream->out);
-		stream->out = NULL;
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto free;
 	}
 
 	stream->incnt = incnt;
+	goto out;
 
-	return 0;
+free:
+	sched->free(stream);
+	kfree(stream->out);
+	stream->out = NULL;
+out:
+	return ret;
 }
 
 int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
@@ -130,13 +196,15 @@ int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
 		return -ENOMEM;
 	stream->out[sid].ext = soute;
 
-	return 0;
+	return sctp_sched_init_sid(stream, sid, GFP_KERNEL);
 }
 
 void sctp_stream_free(struct sctp_stream *stream)
 {
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
 	int i;
 
+	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
 		kfree(stream->out[i].ext);
 	kfree(stream->out);
@@ -156,6 +224,10 @@ void sctp_stream_clear(struct sctp_stream *stream)
 
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
 {
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+
+	sched->unsched_all(stream);
+	sctp_stream_outq_migrate(stream, new, new->outcnt);
 	sctp_stream_free(stream);
 
 	stream->out = new->out;
@@ -163,6 +235,8 @@ void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
 	stream->outcnt = new->outcnt;
 	stream->incnt  = new->incnt;
 
+	sched->sched_all(stream);
+
 	new->out = NULL;
 	new->in  = NULL;
 }
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
new file mode 100644
index 0000000000000000000000000000000000000000..40a9a9de2b98a56786a4c8585f5ad514be9189af
--- /dev/null
+++ b/net/sctp/stream_sched.c
@@ -0,0 +1,270 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* First Come First Serve (a.k.a. FIFO)
+ * RFC DRAFT ndata Section 3.1
+ */
+static int sctp_sched_fcfs_set(struct sctp_stream *stream, __u16 sid,
+			       __u16 value, gfp_t gfp)
+{
+	return 0;
+}
+
+static int sctp_sched_fcfs_get(struct sctp_stream *stream, __u16 sid,
+			       __u16 *value)
+{
+	*value = 0;
+	return 0;
+}
+
+static int sctp_sched_fcfs_init(struct sctp_stream *stream)
+{
+	return 0;
+}
+
+static int sctp_sched_fcfs_init_sid(struct sctp_stream *stream, __u16 sid,
+				    gfp_t gfp)
+{
+	return 0;
+}
+
+static void sctp_sched_fcfs_free(struct sctp_stream *stream)
+{
+}
+
+static void sctp_sched_fcfs_enqueue(struct sctp_outq *q,
+				    struct sctp_datamsg *msg)
+{
+}
+
+static struct sctp_chunk *sctp_sched_fcfs_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_chunk *ch = NULL;
+	struct list_head *entry;
+
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	if (stream->out_curr) {
+		ch = list_entry(stream->out_curr->ext->outq.next,
+				struct sctp_chunk, stream_list);
+	} else {
+		entry = q->out_chunk_list.next;
+		ch = list_entry(entry, struct sctp_chunk, list);
+	}
+
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_fcfs_dequeue_done(struct sctp_outq *q,
+					 struct sctp_chunk *chunk)
+{
+}
+
+static void sctp_sched_fcfs_sched_all(struct sctp_stream *stream)
+{
+}
+
+static void sctp_sched_fcfs_unsched_all(struct sctp_stream *stream)
+{
+}
+
+static struct sctp_sched_ops sctp_sched_fcfs = {
+	.set = sctp_sched_fcfs_set,
+	.get = sctp_sched_fcfs_get,
+	.init = sctp_sched_fcfs_init,
+	.init_sid = sctp_sched_fcfs_init_sid,
+	.free = sctp_sched_fcfs_free,
+	.enqueue = sctp_sched_fcfs_enqueue,
+	.dequeue = sctp_sched_fcfs_dequeue,
+	.dequeue_done = sctp_sched_fcfs_dequeue_done,
+	.sched_all = sctp_sched_fcfs_sched_all,
+	.unsched_all = sctp_sched_fcfs_unsched_all,
+};
+
+/* API to other parts of the stack */
+
+struct sctp_sched_ops *sctp_sched_ops[] = {
+	&sctp_sched_fcfs,
+};
+
+int sctp_sched_set_sched(struct sctp_association *asoc,
+			 enum sctp_sched_type sched)
+{
+	struct sctp_sched_ops *n = sctp_sched_ops[sched];
+	struct sctp_sched_ops *old = asoc->outqueue.sched;
+	struct sctp_datamsg *msg = NULL;
+	struct sctp_chunk *ch;
+	int i, ret = 0;
+
+	if (old == n)
+		return ret;
+
+	if (sched > SCTP_SS_MAX)
+		return -EINVAL;
+
+	if (old) {
+		old->free(&asoc->stream);
+
+		/* Give the next scheduler a clean slate. */
+		for (i = 0; i < asoc->stream.outcnt; i++) {
+			void *p = asoc->stream.out[i].ext;
+
+			if (!p)
+				continue;
+
+			p += offsetofend(struct sctp_stream_out_ext, outq);
+			memset(p, 0, sizeof(struct sctp_stream_out_ext) -
+				     offsetofend(struct sctp_stream_out_ext, outq));
+		}
+	}
+
+	asoc->outqueue.sched = n;
+	n->init(&asoc->stream);
+	for (i = 0; i < asoc->stream.outcnt; i++) {
+		if (!asoc->stream.out[i].ext)
+			continue;
+
+		ret = n->init_sid(&asoc->stream, i, GFP_KERNEL);
+		if (ret)
+			goto err;
+	}
+
+	/* We have to requeue all chunks already queued. */
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		if (ch->msg == msg)
+			continue;
+		msg = ch->msg;
+		n->enqueue(&asoc->outqueue, msg);
+	}
+
+	return ret;
+
+err:
+	n->free(&asoc->stream);
+	asoc->outqueue.sched = &sctp_sched_fcfs; /* Always safe */
+
+	return ret;
+}
+
+int sctp_sched_get_sched(struct sctp_association *asoc)
+{
+	int i;
+
+	for (i = 0; i <= SCTP_SS_MAX; i++)
+		if (asoc->outqueue.sched == sctp_sched_ops[i])
+			return i;
+
+	return 0;
+}
+
+int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 value, gfp_t gfp)
+{
+	if (sid >= asoc->stream.outcnt)
+		return -EINVAL;
+
+	if (!asoc->stream.out[sid].ext) {
+		int ret;
+
+		ret = sctp_stream_init_ext(&asoc->stream, sid);
+		if (ret)
+			return ret;
+	}
+
+	return asoc->outqueue.sched->set(&asoc->stream, sid, value, gfp);
+}
+
+int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 *value)
+{
+	if (sid >= asoc->stream.outcnt)
+		return -EINVAL;
+
+	if (!asoc->stream.out[sid].ext)
+		return 0;
+
+	return asoc->outqueue.sched->get(&asoc->stream, sid, value);
+}
+
+void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch)
+{
+	if (!list_is_last(&ch->frag_list, &ch->msg->chunks)) {
+		struct sctp_stream_out *sout;
+		__u16 sid;
+
+		/* datamsg is not finish, so save it as current one,
+		 * in case application switch scheduler or a higher
+		 * priority stream comes in.
+		 */
+		sid = sctp_chunk_stream_no(ch);
+		sout = &q->asoc->stream.out[sid];
+		q->asoc->stream.out_curr = sout;
+		return;
+	}
+
+	q->asoc->stream.out_curr = NULL;
+	q->sched->dequeue_done(q, ch);
+}
+
+/* Auxiliary functions for the schedulers */
+void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch)
+{
+	list_del_init(&ch->list);
+	list_del_init(&ch->stream_list);
+	q->out_qlen -= ch->skb->len;
+}
+
+int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp)
+{
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+
+	INIT_LIST_HEAD(&stream->out[sid].ext->outq);
+	return sched->init_sid(stream, sid, gfp);
+}
+
+struct sctp_sched_ops *sctp_sched_ops_from_stream(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+
+	return asoc->outqueue.sched;
+}
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 05/10] sctp: introduce sctp_chunk_stream_no
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

Add a helper to fetch the stream number from a given chunk.

Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 9b2b30b3ba4dfd10c24c3e06ed80779180a06baf..c48f7999fe9b80c5b5e41910a3608059b94140a7 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -642,6 +642,11 @@ void sctp_init_addrs(struct sctp_chunk *, union sctp_addr *,
 		     union sctp_addr *);
 const union sctp_addr *sctp_source(const struct sctp_chunk *chunk);
 
+static inline __u16 sctp_chunk_stream_no(struct sctp_chunk *ch)
+{
+	return ntohs(ch->subh.data_hdr->stream);
+}
+
 enum {
 	SCTP_ADDR_NEW,		/* new address added to assoc/ep */
 	SCTP_ADDR_SRC,		/* address can be used as source */
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 04/10] sctp: introduce struct sctp_stream_out_ext
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

With the stream schedulers, sctp_stream_out will become too big to be
allocated by kmalloc and as we need to allocate with BH disabled, we
cannot use __vmalloc in sctp_stream_init().

This patch moves out the stats from sctp_stream_out to
sctp_stream_out_ext, which will be allocated only when the application
tries to sendmsg something on it.

Just the introduction of sctp_stream_out_ext would already fix the issue
described above by splitting the allocation in two. Moving the stats
to it also reduces the pressure on the allocator as we will ask for less
memory atomically when creating the socket and we will use GFP_KERNEL
later.

Then, for stream schedulers, we will just use sctp_stream_out_ext.

Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h | 10 ++++++++--
 net/sctp/chunk.c           |  6 +++---
 net/sctp/outqueue.c        |  4 ++--
 net/sctp/socket.c          | 27 +++++++++++++++++++++------
 net/sctp/stream.c          | 16 ++++++++++++++++
 5 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 0477945de1a3cf5c27348e99d9a30e02c491d1de..9b2b30b3ba4dfd10c24c3e06ed80779180a06baf 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -84,6 +84,7 @@ struct sctp_ulpq;
 struct sctp_ep_common;
 struct crypto_shash;
 struct sctp_stream;
+struct sctp_stream_out;
 
 
 #include <net/sctp/tsnmap.h>
@@ -380,6 +381,7 @@ struct sctp_sender_hb_info {
 
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp);
+int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid);
 void sctp_stream_free(struct sctp_stream *stream);
 void sctp_stream_clear(struct sctp_stream *stream);
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new);
@@ -1315,11 +1317,15 @@ struct sctp_inithdr_host {
 	__u32 initial_tsn;
 };
 
+struct sctp_stream_out_ext {
+	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
+	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+};
+
 struct sctp_stream_out {
 	__u16	ssn;
 	__u8	state;
-	__u64	abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
-	__u64	abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+	struct sctp_stream_out_ext *ext;
 };
 
 struct sctp_stream_in {
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 3afac275ee82dbec825dd71378dffe69a53718a7..7b261afc47b9d709fdd780a93aaba874f35d79be 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -311,10 +311,10 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 
 		if (chunk->sent_count) {
 			chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
-			streamout->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
+			streamout->ext->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
 		} else {
 			chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
-			streamout->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
+			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
 		}
 		return 1;
 	} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
@@ -323,7 +323,7 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
 
 		chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
-		streamout->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
+		streamout->ext->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
 		return 1;
 	} else if (!SCTP_PR_POLICY(chunk->sinfo.sinfo_flags) &&
 		   chunk->msg->expires_at &&
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 2966ff400755fe93e3658e09d3bb44b9d7d19d2e..746b07b7937d8730824b9e09917d947aa7863ec6 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -366,7 +366,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association *asoc,
 		streamout = &asoc->stream.out[chk->sinfo.sinfo_stream];
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
-		streamout->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
+		streamout->ext->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
 
 		if (!chk->tsn_gap_acked) {
 			if (chk->transport)
@@ -404,7 +404,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 			struct sctp_stream_out *streamout =
 				&asoc->stream.out[chk->sinfo.sinfo_stream];
 
-			streamout->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
+			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		}
 
 		msg_len -= SCTP_DATA_SNDSIZE(chk) +
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d4730ada7f3233367be7a0e3bb10e286a25602c8..d207734326b085e60625e4333f74221481114892 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1927,6 +1927,13 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
 		goto out_free;
 	}
 
+	/* Allocate sctp_stream_out_ext if not already done */
+	if (unlikely(!asoc->stream.out[sinfo->sinfo_stream].ext)) {
+		err = sctp_stream_init_ext(&asoc->stream, sinfo->sinfo_stream);
+		if (err)
+			goto out_free;
+	}
+
 	if (sctp_wspace(asoc) < msg_len)
 		sctp_prsctp_prune(asoc, sinfo, msg_len - sctp_wspace(asoc));
 
@@ -6645,7 +6652,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 					   char __user *optval,
 					   int __user *optlen)
 {
-	struct sctp_stream_out *streamout;
+	struct sctp_stream_out_ext *streamoute;
 	struct sctp_association *asoc;
 	struct sctp_prstatus params;
 	int retval = -EINVAL;
@@ -6668,21 +6675,29 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 	if (!asoc || params.sprstat_sid >= asoc->stream.outcnt)
 		goto out;
 
-	streamout = &asoc->stream.out[params.sprstat_sid];
+	streamoute = asoc->stream.out[params.sprstat_sid].ext;
+	if (!streamoute) {
+		/* Not allocated yet, means all stats are 0 */
+		params.sprstat_abandoned_unsent = 0;
+		params.sprstat_abandoned_sent = 0;
+		retval = 0;
+		goto out;
+	}
+
 	if (policy == SCTP_PR_SCTP_NONE) {
 		params.sprstat_abandoned_unsent = 0;
 		params.sprstat_abandoned_sent = 0;
 		for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
 			params.sprstat_abandoned_unsent +=
-				streamout->abandoned_unsent[policy];
+				streamoute->abandoned_unsent[policy];
 			params.sprstat_abandoned_sent +=
-				streamout->abandoned_sent[policy];
+				streamoute->abandoned_sent[policy];
 		}
 	} else {
 		params.sprstat_abandoned_unsent =
-			streamout->abandoned_unsent[__SCTP_PR_INDEX(policy)];
+			streamoute->abandoned_unsent[__SCTP_PR_INDEX(policy)];
 		params.sprstat_abandoned_sent =
-			streamout->abandoned_sent[__SCTP_PR_INDEX(policy)];
+			streamoute->abandoned_sent[__SCTP_PR_INDEX(policy)];
 	}
 
 	if (put_user(len, optlen) || copy_to_user(optval, &params, len)) {
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 952437d656cc71ad1c133a736c539eff9a8d80c2..055ca25bbc91bf932db8048c72a1b11cc2214942 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -121,8 +121,24 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	return 0;
 }
 
+int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
+{
+	struct sctp_stream_out_ext *soute;
+
+	soute = kzalloc(sizeof(*soute), GFP_KERNEL);
+	if (!soute)
+		return -ENOMEM;
+	stream->out[sid].ext = soute;
+
+	return 0;
+}
+
 void sctp_stream_free(struct sctp_stream *stream)
 {
+	int i;
+
+	for (i = 0; i < stream->outcnt; i++)
+		kfree(stream->out[i].ext);
 	kfree(stream->out);
 	kfree(stream->in);
 }
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 03/10] sctp: factor out stream->in allocation
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

There is 1 place allocating it and another reallocating. Move such
procedures to a common function.

v2: updated changelog

Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 6d0e997d301f89e165367106c02e82f8a6c3a877..952437d656cc71ad1c133a736c539eff9a8d80c2 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -59,6 +59,31 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 	return 0;
 }
 
+static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
+				gfp_t gfp)
+{
+	struct sctp_stream_in *in;
+
+	in = kmalloc_array(incnt, sizeof(*stream->in), gfp);
+
+	if (!in)
+		return -ENOMEM;
+
+	if (stream->in) {
+		memcpy(in, stream->in, min(incnt, stream->incnt) *
+				       sizeof(*in));
+		kfree(stream->in);
+	}
+
+	if (incnt > stream->incnt)
+		memset(in + stream->incnt, 0,
+		       (incnt - stream->incnt) * sizeof(*in));
+
+	stream->in = in;
+
+	return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
@@ -84,8 +109,8 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (!incnt)
 		return 0;
 
-	stream->in = kcalloc(incnt, sizeof(*stream->in), gfp);
-	if (!stream->in) {
+	i = sctp_stream_alloc_in(stream, incnt, gfp);
+	if (i) {
 		kfree(stream->out);
 		stream->out = NULL;
 		return -ENOMEM;
@@ -623,7 +648,6 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
 	struct sctp_strreset_addstrm *addstrm = param.v;
 	struct sctp_stream *stream = &asoc->stream;
 	__u32 result = SCTP_STRRESET_DENIED;
-	struct sctp_stream_in *streamin;
 	__u32 request_seq, incnt;
 	__u16 in, i;
 
@@ -670,13 +694,9 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
 	if (!in || incnt > SCTP_MAX_STREAM)
 		goto out;
 
-	streamin = krealloc(stream->in, incnt * sizeof(*streamin),
-			    GFP_ATOMIC);
-	if (!streamin)
+	if (sctp_stream_alloc_in(stream, incnt, GFP_ATOMIC))
 		goto out;
 
-	memset(streamin + stream->incnt, 0, in * sizeof(*streamin));
-	stream->in = streamin;
 	stream->incnt = incnt;
 
 	result = SCTP_STRRESET_PERFORMED;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 02/10] sctp: factor out stream->out allocation
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

There is 1 place allocating it and 2 other reallocating. Move such
procedures to a common function.

Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 1afa9555808390d5fc736727422d9700a3855613..6d0e997d301f89e165367106c02e82f8a6c3a877 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -35,6 +35,30 @@
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
 
+static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
+				 gfp_t gfp)
+{
+	struct sctp_stream_out *out;
+
+	out = kmalloc_array(outcnt, sizeof(*out), gfp);
+	if (!out)
+		return -ENOMEM;
+
+	if (stream->out) {
+		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
+					 sizeof(*out));
+		kfree(stream->out);
+	}
+
+	if (outcnt > stream->outcnt)
+		memset(out + stream->outcnt, 0,
+		       (outcnt - stream->outcnt) * sizeof(*out));
+
+	stream->out = out;
+
+	return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
@@ -48,11 +72,9 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (outcnt == stream->outcnt)
 		goto in;
 
-	kfree(stream->out);
-
-	stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
-	if (!stream->out)
-		return -ENOMEM;
+	i = sctp_stream_alloc_out(stream, outcnt, gfp);
+	if (i)
+		return i;
 
 	stream->outcnt = outcnt;
 	for (i = 0; i < stream->outcnt; i++)
@@ -276,15 +298,9 @@ int sctp_send_add_streams(struct sctp_association *asoc,
 	}
 
 	if (out) {
-		struct sctp_stream_out *streamout;
-
-		streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-				     GFP_KERNEL);
-		if (!streamout)
+		retval = sctp_stream_alloc_out(stream, outcnt, GFP_KERNEL);
+		if (retval)
 			goto out;
-
-		memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-		stream->out = streamout;
 	}
 
 	chunk = sctp_make_strreset_addstrm(asoc, out, in);
@@ -682,10 +698,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
 	struct sctp_strreset_addstrm *addstrm = param.v;
 	struct sctp_stream *stream = &asoc->stream;
 	__u32 result = SCTP_STRRESET_DENIED;
-	struct sctp_stream_out *streamout;
 	struct sctp_chunk *chunk = NULL;
 	__u32 request_seq, outcnt;
 	__u16 out, i;
+	int ret;
 
 	request_seq = ntohl(addstrm->request_seq);
 	if (TSN_lt(asoc->strreset_inseq, request_seq) ||
@@ -714,14 +730,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
 	if (!out || outcnt > SCTP_MAX_STREAM)
 		goto out;
 
-	streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-			     GFP_ATOMIC);
-	if (!streamout)
+	ret = sctp_stream_alloc_out(stream, outcnt, GFP_ATOMIC);
+	if (ret)
 		goto out;
 
-	memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-	stream->out = streamout;
-
 	chunk = sctp_make_strreset_addstrm(asoc, out, 0);
 	if (!chunk)
 		goto out;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 01/10] sctp: silence warns on sctp_stream_init allocations
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1507069005.git.marcelo.leitner@gmail.com>

As SCTP supports up to 65535 streams, that can lead to very large
allocations in sctp_stream_init(). As Xin Long noticed, systems with
small amounts of memory are more prone to not have enough memory and
dump warnings on dmesg initiated by user actions. Thus, silence them.

Also, if the reallocation of stream->out is not necessary, skip it and
keep the memory we already have.

Reported-by: Xin Long <lucien.xin@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 63ea1550371493ec8863627c7a43f46a22f4a4c9..1afa9555808390d5fc736727422d9700a3855613 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -40,9 +40,14 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 {
 	int i;
 
+	gfp |= __GFP_NOWARN;
+
 	/* Initial stream->out size may be very big, so free it and alloc
-	 * a new one with new outcnt to save memory.
+	 * a new one with new outcnt to save memory if needed.
 	 */
+	if (outcnt == stream->outcnt)
+		goto in;
+
 	kfree(stream->out);
 
 	stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
@@ -53,6 +58,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	for (i = 0; i < stream->outcnt; i++)
 		stream->out[i].state = SCTP_STREAM_OPEN;
 
+in:
 	if (!incnt)
 		return 0;
 
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 00/10] Introduce SCTP Stream Schedulers
From: Marcelo Ricardo Leitner @ 2017-10-03 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight

This patchset introduces the SCTP Stream Schedulers are defined by
https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13

It provides 3 schedulers at the moment: FCFS, Priority and Round Robin.
The other 3, Round Robin per packet, Fair Capacity and Weighted Fair
Capacity will be added later. More specifically, WFQ is required by
WebRTC Datachannels.

The draft also defines the idata chunk, allowing a usermsg to be
interrupted by another piece of idata from another stream. This patchset
*doesn't* include it. It will be posted later by Xin Long.  Its
integration with this patchset is very simple and it basically only
requires a tweak in sctp_sched_dequeue_done(), to ignore datamsg
boundaries.

The first 5 patches are a preparation for the next ones. The most
relevant patches are the 4th and 6th ones. More details are available on
each patch.

v2: changelog update on patch 3

Marcelo Ricardo Leitner (10):
  sctp: silence warns on sctp_stream_init allocations
  sctp: factor out stream->out allocation
  sctp: factor out stream->in allocation
  sctp: introduce struct sctp_stream_out_ext
  sctp: introduce sctp_chunk_stream_no
  sctp: introduce stream scheduler foundations
  sctp: add sockopt to get/set stream scheduler
  sctp: add sockopt to get/set stream scheduler parameters
  sctp: introduce priority based stream scheduler
  sctp: introduce round robin stream scheduler

 include/net/sctp/stream_sched.h |  72 +++++++++
 include/net/sctp/structs.h      |  63 +++++++-
 include/uapi/linux/sctp.h       |  16 ++
 net/sctp/Makefile               |   3 +-
 net/sctp/chunk.c                |   6 +-
 net/sctp/outqueue.c             |  63 ++++----
 net/sctp/sm_sideeffect.c        |   3 +
 net/sctp/socket.c               | 179 ++++++++++++++++++++-
 net/sctp/stream.c               | 196 +++++++++++++++++++----
 net/sctp/stream_sched.c         | 275 +++++++++++++++++++++++++++++++
 net/sctp/stream_sched_prio.c    | 347 ++++++++++++++++++++++++++++++++++++++++
 net/sctp/stream_sched_rr.c      | 201 +++++++++++++++++++++++
 12 files changed, 1347 insertions(+), 77 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c
 create mode 100644 net/sctp/stream_sched_prio.c
 create mode 100644 net/sctp/stream_sched_rr.c

-- 
2.13.5

^ permalink raw reply

* Re: [PATCH net] net: fib_rules: Fix fib_rules_ops->compare implementations to support exact match
From: David Miller @ 2017-10-03 21:54 UTC (permalink / raw)
  To: shmulik; +Cc: netdev, mateusz.bajorski, dsa, tgraf, shmulik.ladkani
In-Reply-To: <20170930085909.1103-1-shmulik@nsof.io>

From: Shmulik Ladkani <shmulik@nsof.io>
Date: Sat, 30 Sep 2017 11:59:09 +0300

> This leads to inconsistencies, depending on order of operations, e.g.:

I don't see any inconsistency.  When you insert using NLM_F_EXCL the
insertion fails if any existing rule matches or overlaps in any way
with the keys in the new rule.

Sorry I'm not going to apply this.

^ permalink raw reply

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Jakub Kicinski @ 2017-10-03 21:50 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David S . Miller, Simon Horman, Dirk van der Merwe, oss-drivers,
	netdev, linux-kernel, Renato Golin, Manoj Gupta, Guenter Roeck,
	Doug Anderson
In-Reply-To: <20171003200546.165731-1-mka@chromium.org>

On Tue,  3 Oct 2017 13:05:46 -0700, Matthias Kaehlcke wrote:
> nfp_eth_set_bit_config() is marked as __always_inline to allow gcc to
> identify the 'mask' parameter as known to be constant at compile time,
> which is required to use the FIELD_GET() macro.
> 
> The forced inlining does the trick for gcc, but for kernel builds with
> clang it results in undefined symbols:
> 
> drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.o: In function
>   `__nfp_eth_set_aneg':
> drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.c:(.text+0x787):
>   undefined reference to `__compiletime_assert_492'
> drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.c:(.text+0x7b1):
>   undefined reference to `__compiletime_assert_496'
> 
> These __compiletime_assert_xyx() calls would have been optimized away if
> the compiler had seen 'mask' as a constant.
> 
> Convert nfp_eth_set_bit_config() into a macro, which allows both gcc and
> clang to identify 'mask' as a compile time constant.
> 
> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>

:(

Is there no chance of fixing the constant propagation in the compiler?

^ permalink raw reply

* Re: [PATCH next] bonding: speed/duplex update at NETDEV_UP event
From: David Miller @ 2017-10-03 21:32 UTC (permalink / raw)
  To: mahesh; +Cc: j.vosburgh, andy, vfalico, netdev, maheshb
In-Reply-To: <20170928010349.8988-1-mahesh@bandewar.net>

From: Mahesh Bandewar <mahesh@bandewar.net>
Date: Wed, 27 Sep 2017 18:03:49 -0700

> From: Mahesh Bandewar <maheshb@google.com>
> 
> Some NIC drivers don't have correct speed/duplex settings at the
> time they send NETDEV_UP notification and that messes up the
> bonding state. Especially 802.3ad mode which is very sensitive
> to these settings. In the current implementation we invoke
> bond_update_speed_duplex() when we receive NETDEV_UP, however,
> ignore the return value. If the values we get are invalid
> (UNKNOWN), then slave gets removed from the aggregator with
> speed and duplex set to UNKNOWN while link is still marked as UP.
> 
> This patch fixes this scenario. Also 802.3ad mode is sensitive to
> these conditions while other modes are not, so making sure that it
> doesn't change the behavior for other modes.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>

Applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox