Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] bpf: validate bpf_func when BPF_JIT is enabled
From: Sami Tolvanen @ 2019-09-11 21:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Björn Töpel, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Kees Cook, Martin Lau, Song Liu,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jesper Dangaard Brouer
In-Reply-To: <87impzt4pu.fsf@toke.dk>

On Wed, Sep 11, 2019 at 5:09 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Björn Töpel <bjorn.topel@intel.com> writes:
> > I ran the "xdp_rxq_info" sample with and without Sami's patch:
>
> Thanks for doing this!

Yes, thanks for testing this Björn!

> Or (1/22998700 - 1/23923874) * 10**9 == 1.7 nanoseconds of overhead.
>
> I guess that is not *too* bad; but it's still chipping away at
> performance; anything we could do to lower the overhead?

The check is already rather minimal, but I could move this to a static
inline function to help ensure the compiler doesn't generate an
additional function call for this. I'm also fine with gating this
behind a separate config option, but I'm not sure if that's worth it.
Any thoughts?

Sami

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Eric Dumazet @ 2019-09-11 20:59 UTC (permalink / raw)
  To: Christoph Paasch, Greg Kroah-Hartman
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Neal Cardwell,
	Eric Dumazet, Jason Baron, Vladimir Rutsky
In-Reply-To: <CALMXkpZfufqWhvd9F4kbtC18bFYCgNWrkEvL7Tw976i24f1EFw@mail.gmail.com>

On Wed, Sep 11, 2019 at 7:36 PM Christoph Paasch
<christoph.paasch@gmail.com> wrote:
>
> Hello,
>
> On Mon, Aug 26, 2019 at 11:04 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Vladimir Rutsky reported stuck TCP sessions after memory pressure
> > events. Edge Trigger epoll() user would never receive an EPOLLOUT
> > notification allowing them to retry a sendmsg().
> >
> > Jason tested the case of sk_stream_alloc_skb() returning NULL,
> > but there are other paths that could lead both sendmsg() and sendpage()
> > to return -1 (EAGAIN), with an empty skb queued on the write queue.
> >
> > This patch makes sure we remove this empty skb so that
> > Jason code can detect that the queue is empty, and
> > call sk->sk_write_space(sk) accordingly.
> >
> > Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Jason Baron <jbaron@akamai.com>
> > Reported-by: Vladimir Rutsky <rutsky@google.com>
> > Cc: Soheil Hassas Yeganeh <soheil@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > ---
> >  net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
> >  1 file changed, 20 insertions(+), 10 deletions(-)
>
> I got syzkaller complaining now on 4.14.143 with the following reproducer:
>
> # {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1
> Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false
> UseTmpDir:false EnableCgroups:false EnableNetdev:false ResetNet:false
> HandleSegv:false Repro:false Trace:false}
> r0 = socket$inet_tcp(0x2, 0x1, 0x0)
> setsockopt$inet_tcp_TCP_REPAIR(r0, 0x6, 0x13, &(0x7f0000000040)=0x1, 0x4)
> setsockopt$inet_tcp_TCP_REPAIR_QUEUE(r0, 0x6, 0x14, &(0x7f00000012c0)=0x2, 0x4)
> setsockopt$inet_tcp_int(r0, 0x6, 0x19, &(0x7f0000000000)=0x9, 0x4)
> setsockopt$inet_tcp_TCP_MD5SIG(r0, 0x6, 0xe,
> &(0x7f00000001c0)={@in={{0x2, 0x0, @empty}}, 0x0, 0x2, 0x0,
> "c157cf4809151e5e89cfd6d934fbe981ec8ff6afc252ccf486c325c7ff3d35f3a89412a5cb6430e169092617df2ba65bf0ab844572e4e7dd4ece8ec1de5ac1ccd870067b018cb3b1f05f2391d872b67d"},
> 0xd8)
> connect$inet(r0, &(0x7f0000000080)={0x2, 0x0, @dev={0xac, 0x14, 0x14,
> 0x1d}}, 0x10)
> sendto(r0, 0x0, 0x87, 0x0, 0x0, 0x391)
>
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN PTI
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 2529 Comm: syz-executor709 Not tainted 4.14.143 #5
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
> task: ffff8880677fdc00 task.stack: ffff8880642b0000
> RIP: 0010:tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350
> RSP: 0018:ffff8880642bf718 EFLAGS: 00010206
> RAX: 0000000000000014 RBX: 0000000000000087 RCX: ffff88806a794f50
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000a0
> RBP: ffff8880642bfaa8 R08: 0000000000000006 R09: ffff8880677fe3a0
> R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
> R13: ffff88806a794f50 R14: ffff88806a794d00 R15: 0000000000000087
> FS:  00007f644b697700(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffcd37370b0 CR3: 00000000679f2006 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  tcp_sendmsg+0x2a/0x40 net/ipv4/tcp.c:1533
>  inet_sendmsg+0x173/0x4e0 net/ipv4/af_inet.c:784
>  sock_sendmsg_nosec net/socket.c:646 [inline]
>  sock_sendmsg+0xc3/0x100 net/socket.c:656
>  SYSC_sendto+0x35d/0x5e0 net/socket.c:1766
>  do_syscall_64+0x241/0x680 arch/x86/entry/common.c:292
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x7f644afc6469
> RSP: 002b:00007f644b696f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 0000000000602130 RCX: 00007f644afc6469
> RDX: 0000000000000087 RSI: 0000000000000000 RDI: 0000000000000003
> RBP: 0000000000602138 R08: 0000000000000000 R09: 0000000000000391
> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000060213c
> R13: 00007ffcd373700f R14: 00007f644b677000 R15: 0000000000000003
> Code: 74 08 3c 03 0f 8e f1 32 00 00 8b 85 98 fd ff ff 89 85 60 fd ff
> ff 48 8b 85 70 fd ff ff 48 8d b8 a0 00 00 00 48 89 f8 48 c1 e8 03 <42>
> 0f b6 04 20 84 c0 74 06 0f 8e d2 32 00 00 4c 8b bd 70 fd ff
> RIP: tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350 RSP: ffff8880642bf718
> ---[ end trace 70f07f242cd3b9d8 ]---
>
>
> It's because skb is NULL in tcp_sendmsg_locked at:
>                   skb = tcp_write_queue_tail(sk);
>                   if (tcp_send_head(sk)) {
>                           if (skb->ip_summed == CHECKSUM_NONE)
>
>
> I think we need this here on pre-rb-tree kernels :
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 5ce069ce2a97..efe767e20d01 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -924,8 +924,7 @@ static void tcp_remove_empty_skb(struct sock *sk,
> struct sk_buff *skb)
>  {
>   if (skb && !skb->len) {
>   tcp_unlink_write_queue(skb, sk);
> - if (tcp_write_queue_empty(sk))
> - tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> + tcp_check_send_head(sk, skb);
>   sk_wmem_free_skb(sk, skb);
>   }
>  }
>
> Does that look good?
>

Yes the backport to 4.14.143 was not done properly.

Thanks

^ permalink raw reply

* Re: [PATCH] bpf: validate bpf_func when BPF_JIT is enabled
From: Sami Tolvanen @ 2019-09-11 20:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Kees Cook, Martin Lau,
	Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org, Toke Høiland-Jørgensen,
	Björn Töpel
In-Reply-To: <c7c7668e-6336-0367-42b3-2f6026c466dd@fb.com>

On Wed, Sep 11, 2019 at 12:43 AM Yonghong Song <yhs@fb.com> wrote:
> How about this:
>
>         if (!IS_ENABLED(CONFIG_BPF_JIT_ALWAYS_ON) && !prog->jited)
>                 goto out;
>
>         if (unlikely(hdr->magic != BPF_BINARY_HEADER_MAGIC ||
>             !arch_bpf_jit_check_func(prog))) {
>                 WARN(1, "attempt to jump to an invalid address");
>                 return 0;
>         }
> out:
>         return prog->bpf_func(ctx, prog->insnsi);

Sure, that does look cleaner. I'll use this in the next version. Thanks.

Sami

^ permalink raw reply

* [PATCH net-next] ip: support SO_MARK cmsg
From: Willem de Bruijn @ 2019-09-11 19:50 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable setting skb->mark for UDP and RAW sockets using cmsg.

This is analogous to existing support for TOS, TTL, txtime, etc.

Packet sockets already support this as of commit c7d39e32632e
("packet: support per-packet fwmark for af_packet sendmsg").

Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark

Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/inet_sock.h | 1 +
 include/net/ip.h        | 1 +
 net/ipv4/ip_output.c    | 3 ++-
 net/ipv4/ping.c         | 2 +-
 net/ipv4/raw.c          | 4 ++--
 net/ipv4/udp.c          | 2 +-
 net/ipv6/ip6_output.c   | 3 ++-
 net/ipv6/raw.c          | 4 +++-
 net/ipv6/udp.c          | 3 ++-
 9 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 7769c9b36d75..34c4436fd18f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -160,6 +160,7 @@ struct inet_cork {
 	char			priority;
 	__u16			gso_size;
 	u64			transmit_time;
+	u32			mark;
 };
 
 struct inet_cork_full {
diff --git a/include/net/ip.h b/include/net/ip.h
index 29d89de39822..95bb77f95bcc 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -88,6 +88,7 @@ static inline void ipcm_init_sk(struct ipcm_cookie *ipcm,
 {
 	ipcm_init(ipcm);
 
+	ipcm->sockc.mark = inet->sk.sk_mark;
 	ipcm->sockc.tsflags = inet->sk.sk_tsflags;
 	ipcm->oif = inet->sk.sk_bound_dev_if;
 	ipcm->addr = inet->inet_saddr;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cc7ef0d05bbd..5eb73775c3f7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1266,6 +1266,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 	cork->length = 0;
 	cork->ttl = ipc->ttl;
 	cork->tos = ipc->tos;
+	cork->mark = ipc->sockc.mark;
 	cork->priority = ipc->priority;
 	cork->transmit_time = ipc->sockc.transmit_time;
 	cork->tx_flags = 0;
@@ -1529,7 +1530,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	}
 
 	skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
-	skb->mark = sk->sk_mark;
+	skb->mark = cork->mark;
 	skb->tstamp = cork->transmit_time;
 	/*
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 9d24ef5c5d8f..535427292194 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -781,7 +781,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	} else if (!ipc.oif)
 		ipc.oif = inet->uc_index;
 
-	flowi4_init_output(&fl4, ipc.oif, sk->sk_mark, tos,
+	flowi4_init_output(&fl4, ipc.oif, ipc.sockc.mark, tos,
 			   RT_SCOPE_UNIVERSE, sk->sk_protocol,
 			   inet_sk_flowi_flags(sk), faddr, saddr, 0, 0,
 			   sk->sk_uid);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 40a6abbc9cf6..80da5a66d5d7 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -375,7 +375,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	skb_reserve(skb, hlen);
 
 	skb->priority = sk->sk_priority;
-	skb->mark = sk->sk_mark;
+	skb->mark = sockc->mark;
 	skb->tstamp = sockc->transmit_time;
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
@@ -623,7 +623,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		}
 	}
 
-	flowi4_init_output(&fl4, ipc.oif, sk->sk_mark, tos,
+	flowi4_init_output(&fl4, ipc.oif, ipc.sockc.mark, tos,
 			   RT_SCOPE_UNIVERSE,
 			   hdrincl ? IPPROTO_RAW : sk->sk_protocol,
 			   inet_sk_flowi_flags(sk) |
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d88821c794fb..fbcd9be3a470 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1130,7 +1130,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 		fl4 = &fl4_stack;
 
-		flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
+		flowi4_init_output(fl4, ipc.oif, ipc.sockc.mark, tos,
 				   RT_SCOPE_UNIVERSE, sk->sk_protocol,
 				   flow_flags,
 				   faddr, saddr, dport, inet->inet_sport,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8e49fd62eea9..89a4c7c2e25d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1294,6 +1294,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 	cork->base.fragsize = mtu;
 	cork->base.gso_size = ipc6->gso_size;
 	cork->base.tx_flags = 0;
+	cork->base.mark = ipc6->sockc.mark;
 	sock_tx_timestamp(sk, ipc6->sockc.tsflags, &cork->base.tx_flags);
 
 	if (dst_allfrag(xfrm_dst_path(&rt->dst)))
@@ -1764,7 +1765,7 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
 	hdr->daddr = *final_dst;
 
 	skb->priority = sk->sk_priority;
-	skb->mark = sk->sk_mark;
+	skb->mark = cork->base.mark;
 
 	skb->tstamp = cork->base.transmit_time;
 
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 8a6131991e38..6e1888ee4036 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -646,7 +646,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 
 	skb->protocol = htons(ETH_P_IPV6);
 	skb->priority = sk->sk_priority;
-	skb->mark = sk->sk_mark;
+	skb->mark = sockc->mark;
 	skb->tstamp = sockc->transmit_time;
 
 	skb_put(skb, length);
@@ -810,6 +810,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ipcm6_init(&ipc6);
 	ipc6.sockc.tsflags = sk->sk_tsflags;
+	ipc6.sockc.mark = sk->sk_mark;
 
 	if (sin6) {
 		if (addr_len < SIN6_LEN_RFC2133)
@@ -891,6 +892,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	opt = ipv6_fixup_options(&opt_space, opt);
 
 	fl6.flowi6_proto = proto;
+	fl6.flowi6_mark = ipc6.sockc.mark;
 
 	if (!hdrincl) {
 		rfv.msg = msg;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 827fe7385078..2c8beb3896d1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1230,6 +1230,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipcm6_init(&ipc6);
 	ipc6.gso_size = up->gso_size;
 	ipc6.sockc.tsflags = sk->sk_tsflags;
+	ipc6.sockc.mark = sk->sk_mark;
 
 	/* destination address check */
 	if (sin6) {
@@ -1352,7 +1353,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (!fl6.flowi6_oif)
 		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
 
-	fl6.flowi6_mark = sk->sk_mark;
+	fl6.flowi6_mark = ipc6.sockc.mark;
 	fl6.flowi6_uid = sk->sk_uid;
 
 	if (msg->msg_controllen) {
-- 
2.23.0.162.g0b9fbb3734-goog


^ permalink raw reply related

* Re: [PATCH v1 net-next 12/15] net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload
From: Vinicius Costa Gomes @ 2019-09-11 19:45 UTC (permalink / raw)
  To: Vladimir Oltean, f.fainelli, vivien.didelot, andrew, davem,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, Jose.Abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, netdev, Vladimir Oltean
In-Reply-To: <20190902162544.24613-13-olteanv@gmail.com>

Hi,

Vladimir Oltean <olteanv@gmail.com> writes:

> This qdisc offload is the closest thing to what the SJA1105 supports in
> hardware for time-based egress shaping. The switch core really is built
> around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
> operate similarly to IEEE 802.1Qbv with some constraints:
>
> - The gate control list is a global list for all ports. There are 8
>   execution threads that iterate through this global list in parallel.
>   I don't know why 8, there are only 4 front-panel ports.
>
> - Care must be taken by the user to make sure that two execution threads
>   never get to execute a GCL entry simultaneously. I created a O(n^4)
>   checker for this hardware limitation, prior to accepting a taprio
>   offload configuration as valid.
>
> - The spec says that if a GCL entry's interval is shorter than the frame
>   length, you shouldn't send it (and end up in head-of-line blocking).
>   Well, this switch does anyway.
>
> - The switch has no concept of ADMIN and OPER configurations. Because
>   it's so simple, the TAS settings are loaded through the static config
>   tables interface, so there isn't even place for any discussion about
>   'graceful switchover between ADMIN and OPER'. You just reset the
>   switch and upload a new OPER config.
>
> - The switch accepts multiple time sources for the gate events. Right
>   now I am using the standalone clock source as opposed to PTP. So the
>   base time parameter doesn't really do much. Support for the PTP clock
>   source will be added in the next patch.
>
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> ---
> Changes since RFC:
> - Removed the sja1105_tas_config_work workqueue.
> - Allocating memory with GFP_KERNEL.
> - Made the ASCII art drawing fit in < 80 characters.
> - Made most of the time-holding variables s64 instead of u64 (for fear
>   of them not holding the result of signed arithmetics properly).
>
>  drivers/net/dsa/sja1105/Kconfig        |   8 +
>  drivers/net/dsa/sja1105/Makefile       |   4 +
>  drivers/net/dsa/sja1105/sja1105.h      |   5 +
>  drivers/net/dsa/sja1105/sja1105_main.c |  19 +-
>  drivers/net/dsa/sja1105/sja1105_tas.c  | 420 +++++++++++++++++++++++++
>  drivers/net/dsa/sja1105/sja1105_tas.h  |  42 +++
>  6 files changed, 497 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.c
>  create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.h
>
> diff --git a/drivers/net/dsa/sja1105/Kconfig b/drivers/net/dsa/sja1105/Kconfig
> index 770134a66e48..55424f39cb0d 100644
> --- a/drivers/net/dsa/sja1105/Kconfig
> +++ b/drivers/net/dsa/sja1105/Kconfig
> @@ -23,3 +23,11 @@ config NET_DSA_SJA1105_PTP
>  	help
>  	  This enables support for timestamping and PTP clock manipulations in
>  	  the SJA1105 DSA driver.
> +
> +config NET_DSA_SJA1105_TAS
> +	bool "Support for the Time-Aware Scheduler on NXP SJA1105"
> +	depends on NET_DSA_SJA1105
> +	help
> +	  This enables support for the TTEthernet-based egress scheduling
> +	  engine in the SJA1105 DSA driver, which is controlled using a
> +	  hardware offload of the tc-tqprio qdisc.
> diff --git a/drivers/net/dsa/sja1105/Makefile b/drivers/net/dsa/sja1105/Makefile
> index 4483113e6259..66161e874344 100644
> --- a/drivers/net/dsa/sja1105/Makefile
> +++ b/drivers/net/dsa/sja1105/Makefile
> @@ -12,3 +12,7 @@ sja1105-objs := \
>  ifdef CONFIG_NET_DSA_SJA1105_PTP
>  sja1105-objs += sja1105_ptp.o
>  endif
> +
> +ifdef CONFIG_NET_DSA_SJA1105_TAS
> +sja1105-objs += sja1105_tas.o
> +endif
> diff --git a/drivers/net/dsa/sja1105/sja1105.h b/drivers/net/dsa/sja1105/sja1105.h
> index 3ca0b87aa3e4..d95f9ce3b4f9 100644
> --- a/drivers/net/dsa/sja1105/sja1105.h
> +++ b/drivers/net/dsa/sja1105/sja1105.h
> @@ -21,6 +21,7 @@
>  #define SJA1105_AGEING_TIME_MS(ms)	((ms) / 10)
>  
>  #include "sja1105_ptp.h"
> +#include "sja1105_tas.h"
>  
>  /* Keeps the different addresses between E/T and P/Q/R/S */
>  struct sja1105_regs {
> @@ -96,6 +97,7 @@ struct sja1105_private {
>  	struct mutex mgmt_lock;
>  	struct sja1105_tagger_data tagger_data;
>  	struct sja1105_ptp_data ptp_data;
> +	struct sja1105_tas_data tas_data;
>  };
>  
>  #include "sja1105_dynamic_config.h"
> @@ -111,6 +113,9 @@ typedef enum {
>  	SPI_WRITE = 1,
>  } sja1105_spi_rw_mode_t;
>  
> +/* From sja1105_main.c */
> +int sja1105_static_config_reload(struct sja1105_private *priv);
> +
>  /* From sja1105_spi.c */
>  int sja1105_spi_send_packed_buf(const struct sja1105_private *priv,
>  				sja1105_spi_rw_mode_t rw, u64 reg_addr,
> diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
> index 8b930cc2dabc..4b393782cc84 100644
> --- a/drivers/net/dsa/sja1105/sja1105_main.c
> +++ b/drivers/net/dsa/sja1105/sja1105_main.c
> @@ -22,6 +22,7 @@
>  #include <linux/if_ether.h>
>  #include <linux/dsa/8021q.h>
>  #include "sja1105.h"
> +#include "sja1105_tas.h"
>  
>  static void sja1105_hw_reset(struct gpio_desc *gpio, unsigned int pulse_len,
>  			     unsigned int startup_delay)
> @@ -1382,7 +1383,7 @@ static void sja1105_bridge_leave(struct dsa_switch *ds, int port,
>   * modify at runtime (currently only MAC) and restore them after uploading,
>   * such that this operation is relatively seamless.
>   */
> -static int sja1105_static_config_reload(struct sja1105_private *priv)
> +int sja1105_static_config_reload(struct sja1105_private *priv)
>  {
>  	struct ptp_system_timestamp ptp_sts_before;
>  	struct ptp_system_timestamp ptp_sts_after;
> @@ -1761,6 +1762,7 @@ static void sja1105_teardown(struct dsa_switch *ds)
>  {
>  	struct sja1105_private *priv = ds->priv;
>  
> +	sja1105_tas_teardown(priv);
>  	cancel_work_sync(&priv->tagger_data.rxtstamp_work);
>  	skb_queue_purge(&priv->tagger_data.skb_rxtstamp_queue);
>  	sja1105_ptp_clock_unregister(priv);
> @@ -2088,6 +2090,18 @@ static bool sja1105_port_txtstamp(struct dsa_switch *ds, int port,
>  	return true;
>  }
>  
> +static int sja1105_port_setup_tc(struct dsa_switch *ds, int port,
> +				 enum tc_setup_type type,
> +				 void *type_data)
> +{
> +	switch (type) {
> +	case TC_SETUP_QDISC_TAPRIO:
> +		return sja1105_setup_tc_taprio(ds, port, type_data);
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
>  static const struct dsa_switch_ops sja1105_switch_ops = {
>  	.get_tag_protocol	= sja1105_get_tag_protocol,
>  	.setup			= sja1105_setup,
> @@ -2120,6 +2134,7 @@ static const struct dsa_switch_ops sja1105_switch_ops = {
>  	.port_hwtstamp_set	= sja1105_hwtstamp_set,
>  	.port_rxtstamp		= sja1105_port_rxtstamp,
>  	.port_txtstamp		= sja1105_port_txtstamp,
> +	.port_setup_tc		= sja1105_port_setup_tc,
>  };
>  
>  static int sja1105_check_device_id(struct sja1105_private *priv)
> @@ -2229,6 +2244,8 @@ static int sja1105_probe(struct spi_device *spi)
>  	}
>  	mutex_init(&priv->mgmt_lock);
>  
> +	sja1105_tas_setup(priv);
> +
>  	return dsa_register_switch(priv->ds);
>  }
>  
> diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c b/drivers/net/dsa/sja1105/sja1105_tas.c
> new file mode 100644
> index 000000000000..769e1d8e5e8f
> --- /dev/null
> +++ b/drivers/net/dsa/sja1105/sja1105_tas.c
> @@ -0,0 +1,420 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> + */
> +#include "sja1105.h"
> +
> +#define SJA1105_TAS_CLKSRC_DISABLED	0
> +#define SJA1105_TAS_CLKSRC_STANDALONE	1
> +#define SJA1105_TAS_CLKSRC_AS6802	2
> +#define SJA1105_TAS_CLKSRC_PTP		3
> +#define SJA1105_GATE_MASK		GENMASK_ULL(SJA1105_NUM_TC - 1, 0)
> +#define SJA1105_TAS_MAX_DELTA		BIT(19)
> +
> +/* This is not a preprocessor macro because the "ns" argument may or may not be
> + * s64 at caller side. This ensures it is properly type-cast before div_s64.
> + */
> +static s64 ns_to_sja1105_delta(s64 ns)
> +{
> +	return div_s64(ns, 200);
> +}
> +
> +/* Lo and behold: the egress scheduler from hell.
> + *
> + * At the hardware level, the Time-Aware Shaper holds a global linear arrray of
> + * all schedule entries for all ports. These are the Gate Control List (GCL)
> + * entries, let's call them "timeslots" for short. This linear array of
> + * timeslots is held in BLK_IDX_SCHEDULE.
> + *
> + * Then there are a maximum of 8 "execution threads" inside the switch, which
> + * iterate cyclically through the "schedule". Each "cycle" has an entry point
> + * and an exit point, both being timeslot indices in the schedule table. The
> + * hardware calls each cycle a "subschedule".
> + *
> + * Subschedule (cycle) i starts when
> + *   ptpclkval >= ptpschtm + BLK_IDX_SCHEDULE_ENTRY_POINTS[i].delta.
> + *
> + * The hardware scheduler iterates BLK_IDX_SCHEDULE with a k ranging from
> + *   k = BLK_IDX_SCHEDULE_ENTRY_POINTS[i].address to
> + *   k = BLK_IDX_SCHEDULE_PARAMS.subscheind[i]
> + *
> + * For each schedule entry (timeslot) k, the engine executes the gate control
> + * list entry for the duration of BLK_IDX_SCHEDULE[k].delta.
> + *
> + *         +---------+
> + *         |         | BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS
> + *         +---------+
> + *              |
> + *              +-----------------+
> + *                                | .actsubsch
> + *  BLK_IDX_SCHEDULE_ENTRY_POINTS v
> + *                 +-------+-------+
> + *                 |cycle 0|cycle 1|
> + *                 +-------+-------+
> + *                   |  |      |  |
> + *  +----------------+  |      |  +-------------------------------------+
> + *  |   .subschindx     |      |             .subschindx                |
> + *  |                   |      +---------------+                        |
> + *  |          .address |        .address      |                        |
> + *  |                   |                      |                        |
> + *  |                   |                      |                        |
> + *  |  BLK_IDX_SCHEDULE v                      v                        |
> + *  |              +-------+-------+-------+-------+-------+------+     |
> + *  |              |entry 0|entry 1|entry 2|entry 3|entry 4|entry5|     |
> + *  |              +-------+-------+-------+-------+-------+------+     |
> + *  |                                  ^                    ^  ^  ^     |
> + *  |                                  |                    |  |  |     |
> + *  |        +-------------------------+                    |  |  |     |
> + *  |        |              +-------------------------------+  |  |     |
> + *  |        |              |              +-------------------+  |     |
> + *  |        |              |              |                      |     |
> + *  | +---------------------------------------------------------------+ |
> + *  | |subscheind[0]<=subscheind[1]<=subscheind[2]<=...<=subscheind[7]| |
> + *  | +---------------------------------------------------------------+ |
> + *  |        ^              ^                BLK_IDX_SCHEDULE_PARAMS    |
> + *  |        |              |                                           |
> + *  +--------+              +-------------------------------------------+
> + *
> + *  In the above picture there are two subschedules (cycles):
> + *
> + *  - cycle 0: iterates the schedule table from 0 to 2 (and back)
> + *  - cycle 1: iterates the schedule table from 3 to 5 (and back)
> + *
> + *  All other possible execution threads must be marked as unused by making
> + *  their "subschedule end index" (subscheind) equal to the last valid
> + *  subschedule's end index (in this case 5).
> + */
> +static int sja1105_init_scheduling(struct sja1105_private *priv)
> +{
> +	struct sja1105_schedule_entry_points_entry *schedule_entry_points;
> +	struct sja1105_schedule_entry_points_params_entry
> +					*schedule_entry_points_params;
> +	struct sja1105_schedule_params_entry *schedule_params;
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	struct sja1105_schedule_entry *schedule;
> +	struct sja1105_table *table;
> +	int subscheind[8] = {0};
> +	int schedule_start_idx;
> +	s64 entry_point_delta;
> +	int schedule_end_idx;
> +	int num_entries = 0;
> +	int num_cycles = 0;
> +	int cycle = 0;
> +	int i, k = 0;
> +	int port;
> +
> +	/* Discard previous Schedule Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
> +	if (table->entry_count) {
> +		kfree(table->entries);
> +		table->entry_count = 0;
> +	}
> +
> +	/* Discard previous Schedule Entry Points Parameters Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
> +	if (table->entry_count) {
> +		kfree(table->entries);
> +		table->entry_count = 0;
> +	}
> +
> +	/* Discard previous Schedule Parameters Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_PARAMS];
> +	if (table->entry_count) {
> +		kfree(table->entries);
> +		table->entry_count = 0;
> +	}
> +
> +	/* Discard previous Schedule Entry Points Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS];
> +	if (table->entry_count) {
> +		kfree(table->entries);
> +		table->entry_count = 0;
> +	}
> +
> +	/* Figure out the dimensioning of the problem */
> +	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
> +		if (tas_data->config[port]) {
> +			num_entries += tas_data->config[port]->num_entries;
> +			num_cycles++;
> +		}
> +	}
> +
> +	/* Nothing to do */
> +	if (!num_cycles)
> +		return 0;
> +
> +	/* Pre-allocate space in the static config tables */
> +
> +	/* Schedule Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
> +	table->entries = kcalloc(num_entries, table->ops->unpacked_entry_size,
> +				 GFP_KERNEL);
> +	if (!table->entries)
> +		return -ENOMEM;
> +	table->entry_count = num_entries;
> +	schedule = table->entries;
> +
> +	/* Schedule Points Parameters Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
> +	table->entries = kcalloc(SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
> +				 table->ops->unpacked_entry_size, GFP_KERNEL);
> +	if (!table->entries)
> +		return -ENOMEM;

Should this free the previous allocation, in case this one fails?
(also applies to the statements below)

> +	table->entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT;
> +	schedule_entry_points_params = table->entries;
> +
> +	/* Schedule Parameters Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_PARAMS];
> +	table->entries = kcalloc(SJA1105_MAX_SCHEDULE_PARAMS_COUNT,
> +				 table->ops->unpacked_entry_size, GFP_KERNEL);
> +	if (!table->entries)
> +		return -ENOMEM;
> +	table->entry_count = SJA1105_MAX_SCHEDULE_PARAMS_COUNT;
> +	schedule_params = table->entries;
> +
> +	/* Schedule Entry Points Table */
> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS];
> +	table->entries = kcalloc(num_cycles, table->ops->unpacked_entry_size,
> +				 GFP_KERNEL);
> +	if (!table->entries)
> +		return -ENOMEM;
> +	table->entry_count = num_cycles;
> +	schedule_entry_points = table->entries;
> +
> +	/* Finally start populating the static config tables */
> +	schedule_entry_points_params->clksrc = SJA1105_TAS_CLKSRC_STANDALONE;
> +	schedule_entry_points_params->actsubsch = num_cycles - 1;
> +
> +	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
> +		const struct tc_taprio_qopt_offload *tas_config;
> +
> +		tas_config = tas_data->config[port];
> +		if (!tas_config)
> +			continue;
> +
> +		schedule_start_idx = k;
> +		schedule_end_idx = k + tas_config->num_entries - 1;
> +		/* TODO this is only a relative base time for the subschedule
> +		 * (relative to PTPSCHTM). But as we're using standalone and
> +		 * not PTP clock as time reference, leave it like this for now.
> +		 * Later we'll have to enforce that all ports' base times are
> +		 * within SJA1105_TAS_MAX_DELTA 200ns cycles of one another.
> +		 */
> +		entry_point_delta = ns_to_sja1105_delta(tas_config->base_time);
> +
> +		schedule_entry_points[cycle].subschindx = cycle;
> +		schedule_entry_points[cycle].delta = entry_point_delta;
> +		schedule_entry_points[cycle].address = schedule_start_idx;
> +
> +		for (i = cycle; i < 8; i++)
> +			subscheind[i] = schedule_end_idx;
> +
> +		for (i = 0; i < tas_config->num_entries; i++, k++) {
> +			s64 delta_ns = tas_config->entries[i].interval;
> +
> +			schedule[k].delta = ns_to_sja1105_delta(delta_ns);
> +			schedule[k].destports = BIT(port);
> +			schedule[k].resmedia_en = true;
> +			schedule[k].resmedia = SJA1105_GATE_MASK &
> +					~tas_config->entries[i].gate_mask;
> +		}
> +		cycle++;
> +	}
> +
> +	for (i = 0; i < 8; i++)
> +		schedule_params->subscheind[i] = subscheind[i];
> +
> +	return 0;
> +}
> +
> +/* Be there 2 port subschedules, each executing an arbitrary number of gate
> + * open/close events cyclically.
> + * None of those gate events must ever occur at the exact same time, otherwise
> + * the switch is known to act in exotically strange ways.
> + * However the hardware doesn't bother performing these integrity checks - the
> + * designers probably said "nah, let's leave that to the experts" - oh well,
> + * now we're the experts.
> + * So here we are with the task of validating whether the new @qopt has any
> + * conflict with the already established TAS configuration in tas_data->config.
> + * We already know the other ports are in harmony with one another, otherwise
> + * we wouldn't have saved them.
> + * Each gate event executes periodically, with a period of @cycle_time and a
> + * phase given by its cycle's @base_time plus its offset within the cycle
> + * (which in turn is given by the length of the events prior to it).
> + * There are two aspects to possible collisions:
> + * - Collisions within one cycle's (actually the longest cycle's) time frame.
> + *   For that, we need to compare the cartesian product of each possible
> + *   occurrence of each event within one cycle time.
> + * - Collisions in the future. Events may not collide within one cycle time,
> + *   but if two port schedules don't have the same periodicity (aka the cycle
> + *   times aren't multiples of one another), they surely will some time in the
> + *   future (actually they will collide an infinite amount of times).
> + */
> +static bool
> +sja1105_tas_check_conflicts(struct sja1105_private *priv,
> +			    const struct tc_taprio_qopt_offload *qopt)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	int port;
> +
> +	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
> +		const struct tc_taprio_qopt_offload *tas_config;
> +		s64 max_cycle_time, min_cycle_time;
> +		s64 delta1, delta2;
> +		s64 rbt1, rbt2;
> +		s64 stop_time;
> +		s64 t1, t2;
> +		int i, j;
> +		s32 rem;
> +
> +		tas_config = tas_data->config[port];
> +
> +		if (!tas_config)
> +			continue;
> +
> +		/* Check if the two cycle times are multiples of one another.
> +		 * If they aren't, then they will surely collide.
> +		 */
> +		max_cycle_time = max(tas_config->cycle_time, qopt->cycle_time);
> +		min_cycle_time = min(tas_config->cycle_time, qopt->cycle_time);
> +		div_s64_rem(max_cycle_time, min_cycle_time, &rem);
> +		if (rem)
> +			return true;
> +
> +		/* Calculate the "reduced" base time of each of the two cycles
> +		 * (transposed back as close to 0 as possible) by dividing to
> +		 * the cycle time.
> +		 */
> +		div_s64_rem(tas_config->base_time, tas_config->cycle_time,
> +			    &rem);
> +		rbt1 = rem;
> +
> +		div_s64_rem(qopt->base_time, qopt->cycle_time, &rem);
> +		rbt2 = rem;
> +
> +		stop_time = max_cycle_time + max(rbt1, rbt2);
> +
> +		/* delta1 is the relative base time of each GCL entry within
> +		 * the established ports' TAS config.
> +		 */
> +		for (i = 0, delta1 = 0;
> +		     i < tas_config->num_entries;
> +		     delta1 += tas_config->entries[i].interval, i++) {
> +
> +			/* delta2 is the relative base time of each GCL entry
> +			 * within the newly added TAS config.
> +			 */
> +			for (j = 0, delta2 = 0;
> +			     j < qopt->num_entries;
> +			     delta2 += qopt->entries[j].interval, j++) {
> +
> +				/* t1 follows all possible occurrences of the
> +				 * established ports' GCL entry i within the
> +				 * first cycle time.
> +				 */
> +				for (t1 = rbt1 + delta1;
> +				     t1 <= stop_time;
> +				     t1 += tas_config->cycle_time) {
> +
> +					/* t2 follows all possible occurrences
> +					 * of the newly added GCL entry j
> +					 * within the first cycle time.
> +					 */
> +					for (t2 = rbt2 + delta2;
> +					     t2 <= stop_time;
> +					     t2 += qopt->cycle_time) {
> +
> +						if (t1 == t2) {
> +							dev_warn(priv->ds->dev,
> +								 "GCL entry %d collides with entry %d of port %d\n",
> +								 j, i, port);
> +							return true;
> +						}
> +					}
> +				}
> +			}
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
> +			    struct tc_taprio_qopt_offload *tas_config)
> +{
> +	struct sja1105_private *priv = ds->priv;
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	int rc, i;
> +
> +	/* Can't change an already configured port (must delete qdisc first).
> +	 * Can't delete the qdisc from an unconfigured port.
> +	 */
> +	if (!!tas_data->config[port] == tas_config->enable)
> +		return -EINVAL;
> +
> +	if (!tas_config->enable) {
> +		taprio_free(tas_data->config[port]);
> +		tas_data->config[port] = NULL;
> +
> +		rc = sja1105_init_scheduling(priv);
> +		if (rc < 0)
> +			return rc;
> +
> +		return sja1105_static_config_reload(priv);
> +	}
> +
> +	/* The cycle time extension is the amount of time the last cycle from
> +	 * the old OPER needs to be extended in order to phase-align with the
> +	 * base time of the ADMIN when that becomes the new OPER.
> +	 * But of course our switch needs to be reset to switch-over between
> +	 * the ADMIN and the OPER configs - so much for a seamless transition.
> +	 * So don't add insult over injury and just say we don't support cycle
> +	 * time extension.
> +	 */
> +	if (tas_config->cycle_time_extension)
> +		return -ENOTSUPP;
> +
> +	if (!ns_to_sja1105_delta(tas_config->base_time)) {
> +		dev_err(ds->dev, "A base time of zero is not hardware-allowed\n");
> +		return -ERANGE;
> +	}
> +
> +	for (i = 0; i < tas_config->num_entries; i++) {
> +		s64 delta_ns = tas_config->entries[i].interval;
> +		s64 delta_cycles = ns_to_sja1105_delta(delta_ns);
> +		bool too_long, too_short;
> +
> +		too_long = (delta_cycles >= SJA1105_TAS_MAX_DELTA);
> +		too_short = (delta_cycles == 0);
> +		if (too_long || too_short) {
> +			dev_err(priv->ds->dev,
> +				"Interval %llu too %s for GCL entry %d\n",
> +				delta_ns, too_long ? "long" : "short", i);
> +			return -ERANGE;
> +		}
> +	}
> +
> +	if (sja1105_tas_check_conflicts(priv, tas_config))
> +		return -ERANGE;
> +
> +	tas_data->config[port] = taprio_get(tas_config);
> +
> +	rc = sja1105_init_scheduling(priv);
> +	if (rc < 0)
> +		return rc;
> +
> +	return sja1105_static_config_reload(priv);
> +}
> +
> +void sja1105_tas_setup(struct sja1105_private *priv)
> +{
> +}
> +
> +void sja1105_tas_teardown(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	int port;
> +
> +	for (port = 0; port < SJA1105_NUM_PORTS; port++)
> +		if (tas_data->config[port])
> +			taprio_free(tas_data->config[port]);
> +}
> diff --git a/drivers/net/dsa/sja1105/sja1105_tas.h b/drivers/net/dsa/sja1105/sja1105_tas.h
> new file mode 100644
> index 000000000000..0ef82810d9d7
> --- /dev/null
> +++ b/drivers/net/dsa/sja1105/sja1105_tas.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> + */
> +#ifndef _SJA1105_TAS_H
> +#define _SJA1105_TAS_H
> +
> +#include <net/pkt_sched.h>
> +
> +#if IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS)
> +
> +struct sja1105_tas_data {
> +	struct tc_taprio_qopt_offload *config[SJA1105_NUM_PORTS];
> +};
> +
> +int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
> +			    struct tc_taprio_qopt_offload *qopt);
> +
> +void sja1105_tas_setup(struct sja1105_private *priv);
> +
> +void sja1105_tas_teardown(struct sja1105_private *priv);
> +
> +#else
> +
> +/* C doesn't allow empty structures, bah! */
> +struct sja1105_tas_data {
> +	u8 dummy;
> +};
> +
> +static inline int
> +sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
> +			struct tc_taprio_qopt_offload *qopt)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void sja1105_tas_setup(struct sja1105_private *priv) { }
> +
> +static inline void sja1105_tas_teardown(struct sja1105_private *priv) { }
> +
> +#endif /* IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS) */
> +
> +#endif /* _SJA1105_TAS_H */
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH v1 net-next 15/15] net: dsa: sja1105: Implement state machine for TAS with PTP clock source
From: Vinicius Costa Gomes @ 2019-09-11 19:43 UTC (permalink / raw)
  To: Vladimir Oltean, f.fainelli, vivien.didelot, andrew, davem,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, Jose.Abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, netdev, Vladimir Oltean
In-Reply-To: <20190902162544.24613-16-olteanv@gmail.com>

Hi Vladimir,

Vladimir Oltean <olteanv@gmail.com> writes:

> Tested using the following bash script and the tc from iproute2-next:
>
> 	#!/bin/bash
>
> 	set -e -u -o pipefail
>
> 	NSEC_PER_SEC="1000000000"
>
> 	gatemask() {
> 		local tc_list="$1"
> 		local mask=0
>
> 		for tc in ${tc_list}; do
> 			mask=$((${mask} | (1 << ${tc})))
> 		done
>
> 		printf "%02x" ${mask}
> 	}
>
> 	if ! systemctl is-active --quiet ptp4l; then
> 		echo "Please start the ptp4l service"
> 		exit
> 	fi
>
> 	now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }')
> 	# Phase-align the base time to the start of the next second.
> 	sec=$(echo "${now}" | gawk -F. '{ print $1; }')
> 	base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))"
>
> 	echo 'file drivers/net/dsa/sja1105/sja1105_tas.c +plm' | \
> 		sudo tee /sys/kernel/debug/dynamic_debug/control
>
> 	tc qdisc add dev swp5 parent root handle 100 taprio \
> 		num_tc 8 \
> 		map 0 1 2 3 5 6 7 \
> 		queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> 		base-time ${base_time} \
> 		sched-entry S $(gatemask 7) 100000 \
> 		sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \
> 		clockid CLOCK_TAI flags 2
>
> The "state machine" is a workqueue invoked after each manipulation
> command on the PTP clock (reset, adjust time, set time, adjust
> frequency) which checks over the state of the time-aware scheduler.
> So it is not monitored periodically, only in reaction to a PTP command
> typically triggered from a userspace daemon (linuxptp). Otherwise there
> is no reason for things to go wrong.
>
> Now that the timecounter/cyclecounter has been replaced with hardware
> operations on the PTP clock, the TAS Kconfig now depends upon PTP and
> the standalone clocksource operating mode has been removed.
>
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> ---
> Changes since RFC:
> - Used the "delta" terminology instead of "TAS cycle" to be more
>   consistent and avoid confusion with the cyclic schedule (of which the
>   "delta" is only the most granular unit, there is no other connection).
>
>  drivers/net/dsa/sja1105/Kconfig       |   2 +-
>  drivers/net/dsa/sja1105/sja1105.h     |   2 +
>  drivers/net/dsa/sja1105/sja1105_ptp.c |  26 +-
>  drivers/net/dsa/sja1105/sja1105_ptp.h |  13 +
>  drivers/net/dsa/sja1105/sja1105_spi.c |   4 +
>  drivers/net/dsa/sja1105/sja1105_tas.c | 426 +++++++++++++++++++++++++-
>  drivers/net/dsa/sja1105/sja1105_tas.h |  27 ++
>  7 files changed, 486 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/dsa/sja1105/Kconfig b/drivers/net/dsa/sja1105/Kconfig
> index 4dc873e985e6..9316a23b7c30 100644
> --- a/drivers/net/dsa/sja1105/Kconfig
> +++ b/drivers/net/dsa/sja1105/Kconfig
> @@ -35,7 +35,7 @@ config NET_DSA_SJA1105_PTP
>  
>  config NET_DSA_SJA1105_TAS
>  	bool "Support for the Time-Aware Scheduler on NXP SJA1105"
> -	depends on NET_DSA_SJA1105
> +	depends on NET_DSA_SJA1105_PTP
>  	help
>  	  This enables support for the TTEthernet-based egress scheduling
>  	  engine in the SJA1105 DSA driver, which is controlled using a
> diff --git a/drivers/net/dsa/sja1105/sja1105.h b/drivers/net/dsa/sja1105/sja1105.h
> index 44f7385c51b5..e8f95b6fadfa 100644
> --- a/drivers/net/dsa/sja1105/sja1105.h
> +++ b/drivers/net/dsa/sja1105/sja1105.h
> @@ -40,6 +40,8 @@ struct sja1105_regs {
>  	u64 ptp_control;
>  	u64 ptpclk;
>  	u64 ptpclkrate;
> +	u64 ptpclkcorp;
> +	u64 ptpschtm;
>  	u64 ptpegr_ts[SJA1105_NUM_PORTS];
>  	u64 pad_mii_tx[SJA1105_NUM_PORTS];
>  	u64 pad_mii_id[SJA1105_NUM_PORTS];
> diff --git a/drivers/net/dsa/sja1105/sja1105_ptp.c b/drivers/net/dsa/sja1105/sja1105_ptp.c
> index ed80278a3521..b037834ff820 100644
> --- a/drivers/net/dsa/sja1105/sja1105_ptp.c
> +++ b/drivers/net/dsa/sja1105/sja1105_ptp.c
> @@ -67,6 +67,8 @@ void sja1105et_ptp_cmd_packing(u8 *buf, struct sja1105_ptp_cmd *cmd,
>  	u64 valid = 1;
>  
>  	sja1105_packing(buf, &valid,           31, 31, size, op);
> +	sja1105_packing(buf, &cmd->ptpstrtsch, 30, 30, size, op);
> +	sja1105_packing(buf, &cmd->ptpstopsch, 29, 29, size, op);
>  	sja1105_packing(buf, &cmd->resptp,      2,  2, size, op);
>  	sja1105_packing(buf, &cmd->corrclk4ts,  1,  1, size, op);
>  	sja1105_packing(buf, &cmd->ptpclkadd,   0,  0, size, op);
> @@ -80,14 +82,16 @@ void sja1105pqrs_ptp_cmd_packing(u8 *buf, struct sja1105_ptp_cmd *cmd,
>  	u64 valid = 1;
>  
>  	sja1105_packing(buf, &valid,           31, 31, size, op);
> +	sja1105_packing(buf, &cmd->ptpstrtsch, 30, 30, size, op);
> +	sja1105_packing(buf, &cmd->ptpstopsch, 29, 29, size, op);
>  	sja1105_packing(buf, &cmd->resptp,      3,  3, size, op);
>  	sja1105_packing(buf, &cmd->corrclk4ts,  2,  2, size, op);
>  	sja1105_packing(buf, &cmd->ptpclkadd,   0,  0, size, op);
>  }
>  
> -static int sja1105_ptp_commit(struct sja1105_private *priv,
> -			      struct sja1105_ptp_cmd *cmd,
> -			      sja1105_spi_rw_mode_t rw)
> +int sja1105_ptp_commit(struct sja1105_private *priv,
> +		       struct sja1105_ptp_cmd *cmd,
> +		       sja1105_spi_rw_mode_t rw)
>  {
>  	const struct sja1105_regs *regs = priv->info->regs;
>  	u8 buf[SJA1105_SIZE_PTP_CMD] = {0};
> @@ -222,6 +226,8 @@ int sja1105_ptp_reset(struct sja1105_private *priv)
>  	dev_dbg(priv->ds->dev, "Resetting PTP clock\n");
>  	rc = sja1105_ptp_commit(priv, &cmd, SPI_WRITE);
>  
> +	sja1105_tas_clockstep(priv);
> +
>  	mutex_unlock(&ptp_data->lock);
>  
>  	return rc;
> @@ -291,7 +297,11 @@ int __sja1105_ptp_settime(struct sja1105_private *priv, u64 ns,
>  		return rc;
>  	}
>  
> -	return sja1105_ptpclkval_write(priv, ticks, ptp_sts);
> +	rc = sja1105_ptpclkval_write(priv, ticks, ptp_sts);
> +
> +	sja1105_tas_clockstep(priv);
> +
> +	return rc;
>  }
>  
>  static int sja1105_ptp_settime(struct ptp_clock_info *ptp,
> @@ -331,6 +341,8 @@ static int sja1105_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
>  	rc = sja1105_spi_send_int(priv, SPI_WRITE, regs->ptpclkrate,
>  				  &clkrate, 4, NULL);
>  
> +	sja1105_tas_adjfreq(priv);
> +
>  	mutex_unlock(&priv->ptp_data.lock);
>  
>  	return rc;
> @@ -366,7 +378,11 @@ int __sja1105_ptp_adjtime(struct sja1105_private *priv, s64 delta)
>  		return rc;
>  	}
>  
> -	return sja1105_ptpclkval_write(priv, ticks, NULL);
> +	rc = sja1105_ptpclkval_write(priv, ticks, NULL);
> +
> +	sja1105_tas_clockstep(priv);
> +
> +	return rc;
>  }
>  
>  static int sja1105_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
> diff --git a/drivers/net/dsa/sja1105/sja1105_ptp.h b/drivers/net/dsa/sja1105/sja1105_ptp.h
> index c24c40115650..da68e5881e5f 100644
> --- a/drivers/net/dsa/sja1105/sja1105_ptp.h
> +++ b/drivers/net/dsa/sja1105/sja1105_ptp.h
> @@ -29,6 +29,8 @@ enum sja1105_ptp_clk_mode {
>  };
>  
>  struct sja1105_ptp_cmd {
> +	u64 ptpstrtsch;		/* start schedule */
> +	u64 ptpstopsch;		/* stop schedule */
>  	u64 resptp;		/* reset */
>  	u64 corrclk4ts;		/* use the corrected clock for timestamps */
>  	u64 ptpclkadd;		/* enum sja1105_ptp_clk_mode */
> @@ -73,6 +75,10 @@ int __sja1105_ptp_settime(struct sja1105_private *priv, u64 ns,
>  
>  int __sja1105_ptp_adjtime(struct sja1105_private *priv, s64 delta);
>  
> +int sja1105_ptp_commit(struct sja1105_private *priv,
> +		       struct sja1105_ptp_cmd *cmd,
> +		       sja1105_spi_rw_mode_t rw);
> +
>  #else
>  
>  struct sja1105_ptp_cmd;
> @@ -135,6 +141,13 @@ static inline int __sja1105_ptp_adjtime(struct sja1105_private *priv, s64 delta)
>  	return 0;
>  }
>  
> +static inline int sja1105_ptp_commit(struct sja1105_private *priv,
> +				     struct sja1105_ptp_cmd *cmd,
> +				     sja1105_spi_rw_mode_t rw)
> +{
> +	return 0;
> +}
> +
>  #define sja1105et_ptp_cmd_packing NULL
>  
>  #define sja1105pqrs_ptp_cmd_packing NULL
> diff --git a/drivers/net/dsa/sja1105/sja1105_spi.c b/drivers/net/dsa/sja1105/sja1105_spi.c
> index 794cc5077565..f6df050c15ec 100644
> --- a/drivers/net/dsa/sja1105/sja1105_spi.c
> +++ b/drivers/net/dsa/sja1105/sja1105_spi.c
> @@ -526,9 +526,11 @@ static struct sja1105_regs sja1105et_regs = {
>  	.rmii_ref_clk = {0x100015, 0x10001C, 0x100023, 0x10002A, 0x100031},
>  	.rmii_ext_tx_clk = {0x100018, 0x10001F, 0x100026, 0x10002D, 0x100034},
>  	.ptpegr_ts = {0xC0, 0xC2, 0xC4, 0xC6, 0xC8},
> +	.ptpschtm = 0x12, /* Spans 0x12 to 0x13 */
>  	.ptp_control = 0x17,
>  	.ptpclk = 0x18, /* Spans 0x18 to 0x19 */
>  	.ptpclkrate = 0x1A,
> +	.ptpclkcorp = 0x1D,
>  };
>  
>  static struct sja1105_regs sja1105pqrs_regs = {
> @@ -556,9 +558,11 @@ static struct sja1105_regs sja1105pqrs_regs = {
>  	.rmii_ext_tx_clk = {0x100017, 0x10001D, 0x100023, 0x100029, 0x10002F},
>  	.qlevel = {0x604, 0x614, 0x624, 0x634, 0x644},
>  	.ptpegr_ts = {0xC0, 0xC4, 0xC8, 0xCC, 0xD0},
> +	.ptpschtm = 0x13, /* Spans 0x13 to 0x14 */
>  	.ptp_control = 0x18,
>  	.ptpclk = 0x19,
>  	.ptpclkrate = 0x1B,
> +	.ptpclkcorp = 0x1E,
>  };
>  
>  struct sja1105_info sja1105e_info = {
> diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c b/drivers/net/dsa/sja1105/sja1105_tas.c
> index 769e1d8e5e8f..ed0c3f00c09d 100644
> --- a/drivers/net/dsa/sja1105/sja1105_tas.c
> +++ b/drivers/net/dsa/sja1105/sja1105_tas.c
> @@ -10,6 +10,11 @@
>  #define SJA1105_GATE_MASK		GENMASK_ULL(SJA1105_NUM_TC - 1, 0)
>  #define SJA1105_TAS_MAX_DELTA		BIT(19)
>  
> +#define work_to_sja1105_tas(d) \
> +	container_of((d), struct sja1105_tas_data, tas_work)
> +#define tas_to_sja1105(d) \
> +	container_of((d), struct sja1105_private, tas_data)
> +
>  /* This is not a preprocessor macro because the "ns" argument may or may not be
>   * s64 at caller side. This ensures it is properly type-cast before div_s64.
>   */
> @@ -18,6 +23,102 @@ static s64 ns_to_sja1105_delta(s64 ns)
>  	return div_s64(ns, 200);
>  }
>  
> +static s64 sja1105_delta_to_ns(s64 delta)
> +{
> +	return delta * 200;
> +}
> +
> +/* Calculate the first base_time in the future that satisfies this
> + * relationship:
> + *
> + * future_base_time = base_time + N x cycle_time >= now, or
> + *
> + *      now - base_time
> + * N >= ---------------
> + *         cycle_time
> + *
> + * Because N is an integer, the ceiling value of the above "a / b" ratio
> + * is in fact precisely the floor value of "(a + b - 1) / b", which is
> + * easier to calculate only having integer division tools.
> + */
> +static s64 future_base_time(s64 base_time, s64 cycle_time, s64 now)
> +{
> +	s64 a, b, n;
> +
> +	if (base_time >= now)
> +		return base_time;
> +
> +	a = now - base_time;
> +	b = cycle_time;
> +	n = div_s64(a + b - 1, b);
> +
> +	return base_time + n * cycle_time;
> +}
> +
> +static int sja1105_tas_set_runtime_params(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	s64 earliest_base_time = S64_MAX;
> +	s64 latest_base_time = 0;
> +	s64 its_cycle_time = 0;
> +	s64 max_cycle_time = 0;
> +	int port;
> +
> +	tas_data->enabled = false;
> +
> +	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
> +		const struct tc_taprio_qopt_offload *tas_config;
> +
> +		tas_config = tas_data->config[port];
> +		if (!tas_config)
> +			continue;
> +
> +		tas_data->enabled = true;
> +
> +		if (max_cycle_time < tas_config->cycle_time)
> +			max_cycle_time = tas_config->cycle_time;
> +		if (latest_base_time < tas_config->base_time)
> +			latest_base_time = tas_config->base_time;
> +		if (earliest_base_time > tas_config->base_time) {
> +			earliest_base_time = tas_config->base_time;
> +			its_cycle_time = tas_config->cycle_time;
> +		}
> +	}
> +
> +	if (!tas_data->enabled)
> +		return 0;
> +
> +	/* Roll the earliest base time over until it is in a comparable
> +	 * time base with the latest, then compare their deltas.
> +	 * We want to enforce that all ports' base times are within
> +	 * SJA1105_TAS_MAX_DELTA 200ns cycles of one another.
> +	 */
> +	earliest_base_time = future_base_time(earliest_base_time,
> +					      its_cycle_time,
> +					      latest_base_time);
> +	while (earliest_base_time > latest_base_time)
> +		earliest_base_time -= its_cycle_time;
> +	if (latest_base_time - earliest_base_time >
> +	    sja1105_delta_to_ns(SJA1105_TAS_MAX_DELTA)) {
> +		dev_err(priv->ds->dev,
> +			"Base times too far apart: min %llu max %llu\n",
> +			earliest_base_time, latest_base_time);
> +		return -ERANGE;
> +	}
> +
> +	tas_data->earliest_base_time = earliest_base_time;
> +	tas_data->max_cycle_time = max_cycle_time;
> +
> +	dev_dbg(priv->ds->dev, "earliest base time %lld ns\n",
> +		tas_data->earliest_base_time);
> +	dev_dbg(priv->ds->dev, "latest base time %lld ns\n",
> +		tas_data->earliest_base_time);
> +	dev_dbg(priv->ds->dev, "longest cycle time %lld ns\n",
> +		tas_data->max_cycle_time);
> +
> +	return 0;
> +}
> +
>  /* Lo and behold: the egress scheduler from hell.
>   *
>   * At the hardware level, the Time-Aware Shaper holds a global linear arrray of
> @@ -100,7 +201,11 @@ static int sja1105_init_scheduling(struct sja1105_private *priv)
>  	int num_cycles = 0;
>  	int cycle = 0;
>  	int i, k = 0;
> -	int port;
> +	int port, rc;
> +
> +	rc = sja1105_tas_set_runtime_params(priv);
> +	if (rc < 0)
> +		return rc;
>  
>  	/* Discard previous Schedule Table */
>  	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
> @@ -181,11 +286,13 @@ static int sja1105_init_scheduling(struct sja1105_private *priv)
>  	schedule_entry_points = table->entries;
>  
>  	/* Finally start populating the static config tables */
> -	schedule_entry_points_params->clksrc = SJA1105_TAS_CLKSRC_STANDALONE;
> +	schedule_entry_points_params->clksrc = SJA1105_TAS_CLKSRC_PTP;
>  	schedule_entry_points_params->actsubsch = num_cycles - 1;
>  
>  	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
>  		const struct tc_taprio_qopt_offload *tas_config;
> +		/* Relative base time */
> +		s64 rbt;
>  
>  		tas_config = tas_data->config[port];
>  		if (!tas_config)
> @@ -193,13 +300,20 @@ static int sja1105_init_scheduling(struct sja1105_private *priv)
>  
>  		schedule_start_idx = k;
>  		schedule_end_idx = k + tas_config->num_entries - 1;
> -		/* TODO this is only a relative base time for the subschedule
> -		 * (relative to PTPSCHTM). But as we're using standalone and
> -		 * not PTP clock as time reference, leave it like this for now.
> -		 * Later we'll have to enforce that all ports' base times are
> -		 * within SJA1105_TAS_MAX_DELTA 200ns cycles of one another.
> +		/* This is only a relative base time for the subschedule
> +		 * (relative to PTPSCHTM - aka the operational base time).
>  		 */
> -		entry_point_delta = ns_to_sja1105_delta(tas_config->base_time);
> +		rbt = future_base_time(tas_config->base_time,
> +				       tas_config->cycle_time,
> +				       tas_data->earliest_base_time);
> +		rbt -= tas_data->earliest_base_time;
> +		/* UM10944.pdf 4.2.2. Schedule Entry Points table says that
> +		 * delta cannot be zero, which is shitty. Advance all relative
> +		 * base times by 1 TAS delta, so that even the earliest base
> +		 * time becomes 1 in relative terms. Then start the operational
> +		 * base time (PTPSCHTM) one TAS delta earlier than planned.
> +		 */
> +		entry_point_delta = ns_to_sja1105_delta(rbt) + 1;
>  
>  		schedule_entry_points[cycle].subschindx = cycle;
>  		schedule_entry_points[cycle].delta = entry_point_delta;
> @@ -405,8 +519,302 @@ int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
>  	return sja1105_static_config_reload(priv);
>  }
>  
> +static int sja1105_tas_check_running(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	struct sja1105_ptp_cmd cmd = {0};
> +	int rc;
> +
> +	rc = sja1105_ptp_commit(priv, &cmd, SPI_READ);
> +	if (rc < 0)
> +		return rc;
> +
> +	if (cmd.ptpstrtsch == 1)
> +		/* Schedule successfully started */
> +		tas_data->state = SJA1105_TAS_STATE_RUNNING;
> +	else if (cmd.ptpstopsch == 1)
> +		/* Schedule is stopped */
> +		tas_data->state = SJA1105_TAS_STATE_DISABLED;
> +	else
> +		/* Schedule is probably not configured with PTP clock source */
> +		rc = -EINVAL;
> +
> +	return rc;
> +}
> +
> +/* Write to PTPCLKCORP */
> +static int sja1105_tas_adjust_drift(struct sja1105_private *priv,
> +				    u64 correction)
> +{
> +	const struct sja1105_regs *regs = priv->info->regs;
> +	u64 ptpclkcorp = ns_to_sja1105_ticks(correction);
> +
> +	return sja1105_spi_send_int(priv, SPI_WRITE, regs->ptpclkcorp,
> +				    &ptpclkcorp, 4, NULL);
> +}
> +
> +/* Write to PTPSCHTM */
> +static int sja1105_tas_set_base_time(struct sja1105_private *priv,
> +				     u64 base_time)
> +{
> +	const struct sja1105_regs *regs = priv->info->regs;
> +	u64 ptpschtm = ns_to_sja1105_ticks(base_time);
> +
> +	return sja1105_spi_send_int(priv, SPI_WRITE, regs->ptpschtm,
> +				    &ptpschtm, 8, NULL);
> +}
> +
> +static int sja1105_tas_start(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	struct sja1105_ptp_cmd *cmd = &priv->ptp_data.cmd;
> +	int rc;
> +
> +	dev_dbg(priv->ds->dev, "Starting the TAS\n");
> +
> +	if (tas_data->state == SJA1105_TAS_STATE_ENABLED_NOT_RUNNING ||
> +	    tas_data->state == SJA1105_TAS_STATE_RUNNING) {
> +		dev_err(priv->ds->dev, "TAS already started\n");
> +		return -EINVAL;
> +	}
> +
> +	cmd->ptpstrtsch = 1;
> +	cmd->ptpstopsch = 0;
> +
> +	rc = sja1105_ptp_commit(priv, cmd, SPI_WRITE);
> +	if (rc < 0)
> +		return rc;
> +
> +	tas_data->state = SJA1105_TAS_STATE_ENABLED_NOT_RUNNING;
> +
> +	return 0;
> +}
> +
> +static int sja1105_tas_stop(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +	struct sja1105_ptp_cmd *cmd = &priv->ptp_data.cmd;
> +	int rc;
> +
> +	dev_dbg(priv->ds->dev, "Stopping the TAS\n");
> +
> +	if (tas_data->state == SJA1105_TAS_STATE_DISABLED) {
> +		dev_err(priv->ds->dev, "TAS already disabled\n");
> +		return -EINVAL;
> +	}
> +
> +	cmd->ptpstopsch = 1;
> +	cmd->ptpstrtsch = 0;
> +
> +	rc = sja1105_ptp_commit(priv, cmd, SPI_WRITE);
> +	if (rc < 0)
> +		return rc;
> +
> +	tas_data->state = SJA1105_TAS_STATE_DISABLED;
> +
> +	return 0;
> +}
> +
> +/* The schedule engine and the PTP clock are driven by the same oscillator, and
> + * they run in parallel. But whilst the PTP clock can keep an absolute
> + * time-of-day, the schedule engine is only running in 'ticks' (25 ticks make
> + * up a delta, which is 200ns), and wrapping around at the end of each cycle.
> + * The schedule engine is started when the PTP clock reaches the PTPSCHTM time
> + * (in PTP domain).
> + * Because the PTP clock can be rate-corrected (accelerated or slowed down) by
> + * a software servo, and the schedule engine clock runs in parallel to the PTP
> + * clock, there is logic internal to the switch that periodically keeps the
> + * schedule engine from drifting away. The frequency with which this internal
> + * syntonization happens is the PTP clock correction period (PTPCLKCORP). It is
> + * a value also in the PTP clock domain, and is also rate-corrected.
> + * To be precise, during a correction period, there is logic to determine by
> + * how many scheduler clock ticks has the PTP clock drifted. At the end of each
> + * correction period/beginning of new one, the length of a delta is shrunk or
> + * expanded with an integer number of ticks, compared with the typical 25.
> + * So a delta lasts for 200ns (or 25 ticks) only on average.
> + * Sometimes it is longer, sometimes it is shorter. The internal syntonization
> + * logic can adjust for at most 5 ticks each 20 ticks.
> + *
> + * The first implication is that you should choose your schedule correction
> + * period to be an integer multiple of the schedule length. Preferably one.
> + * In case there are schedules of multiple ports active, then the correction
> + * period needs to be a multiple of them all. Given the restriction that the
> + * cycle times have to be multiples of one another anyway, this means the
> + * correction period can simply be the largest cycle time, hence the current
> + * choice. This way, the updates are always synchronous to the transmission
> + * cycle, and therefore predictable.
> + *
> + * The second implication is that at the beginning of a correction period, the
> + * first few deltas will be modulated in time, until the schedule engine is
> + * properly phase-aligned with the PTP clock. For this reason, you should place
> + * your best-effort traffic at the beginning of a cycle, and your
> + * time-triggered traffic afterwards.
> + *
> + * The third implication is that once the schedule engine is started, it can
> + * only adjust for so much drift within a correction period. In the servo you
> + * can only change the PTPCLKRATE, but not step the clock (PTPCLKADD). If you
> + * want to do the latter, you need to stop and restart the schedule engine,
> + * which is what the state machine handles.
> + */
> +static void sja1105_tas_state_machine(struct work_struct *work)
> +{
> +	struct sja1105_tas_data *tas_data = work_to_sja1105_tas(work);
> +	struct sja1105_private *priv = tas_to_sja1105(tas_data);
> +	struct sja1105_ptp_data *ptp_data = &priv->ptp_data;
> +	struct timespec64 base_time_ts, now_ts;
> +	struct dsa_switch *ds = priv->ds;
> +	struct timespec64 diff;
> +	s64 base_time, now;
> +	int rc = 0;
> +
> +	mutex_lock(&ptp_data->lock);
> +
> +	switch (tas_data->state) {
> +	case SJA1105_TAS_STATE_DISABLED:
> +
> +		dev_dbg(ds->dev, "TAS state: disabled\n");
> +		/* Can't do anything at all if clock is still being stepped */
> +		if (tas_data->last_op != SJA1105_PTP_ADJUSTFREQ)
> +			break;
> +
> +		rc = sja1105_tas_adjust_drift(priv, tas_data->max_cycle_time);
> +		if (rc < 0)
> +			break;
> +
> +		now = __sja1105_ptp_gettimex(priv, NULL);
> +
> +		/* Plan to start the earliest schedule first. The others
> +		 * will be started in hardware, by way of their respective
> +		 * entry points delta.
> +		 * Try our best to avoid fringe cases (race condition between
> +		 * ptpschtm and ptpstrtsch) by pushing the oper_base_time at
> +		 * least one second in the future from now. This is not ideal,
> +		 * but this only needs to buy us time until the
> +		 * sja1105_tas_start command below gets executed.
> +		 */
> +		base_time = future_base_time(tas_data->earliest_base_time,
> +					     tas_data->max_cycle_time,
> +					     now + 1ull * NSEC_PER_SEC);
> +		base_time -= sja1105_delta_to_ns(1);
> +
> +		rc = sja1105_tas_set_base_time(priv, base_time);
> +		if (rc < 0)
> +			break;
> +
> +		tas_data->oper_base_time = base_time;
> +
> +		rc = sja1105_tas_start(priv);
> +		if (rc < 0)
> +			break;
> +
> +		base_time_ts = ns_to_timespec64(base_time);
> +		now_ts = ns_to_timespec64(now);
> +
> +		dev_dbg(ds->dev, "OPER base time %lld.%09ld (now %lld.%09ld)\n",
> +			base_time_ts.tv_sec, base_time_ts.tv_nsec,
> +			now_ts.tv_sec, now_ts.tv_nsec);
> +
> +		break;
> +
> +	case SJA1105_TAS_STATE_ENABLED_NOT_RUNNING:
> +		/* Check if TAS has actually started, by comparing the
> +		 * scheduled start time with the SJA1105 PTP clock
> +		 */
> +		dev_dbg(ds->dev, "TAS state: enabled but not running\n");
> +
> +		/* Clock was stepped.. bad news for TAS */
> +		if (tas_data->last_op != SJA1105_PTP_ADJUSTFREQ) {
> +			sja1105_tas_stop(priv);
> +			break;
> +		}
> +
> +		now = __sja1105_ptp_gettimex(priv, NULL);
> +
> +		if (now < tas_data->oper_base_time) {
> +			/* TAS has not started yet */
> +			diff = ns_to_timespec64(tas_data->oper_base_time - now);
> +			dev_dbg(ds->dev, "time to start: [%lld.%09ld]",
> +				diff.tv_sec, diff.tv_nsec);
> +			break;
> +		}
> +
> +		/* Time elapsed, what happened? */
> +		rc = sja1105_tas_check_running(priv);
> +		if (rc < 0)
> +			break;
> +
> +		if (tas_data->state == SJA1105_TAS_STATE_RUNNING)
> +			/* TAS has started */
> +			dev_dbg(ds->dev, "TAS state: transitioned to running\n");
> +		else
> +			dev_err(ds->dev, "TAS state: not started despite time elapsed\n");
> +
> +		break;
> +
> +	case SJA1105_TAS_STATE_RUNNING:
> +		dev_dbg(ds->dev, "TAS state: running\n");
> +
> +		/* Clock was stepped.. bad news for TAS */
> +		if (tas_data->last_op != SJA1105_PTP_ADJUSTFREQ) {
> +			sja1105_tas_stop(priv);
> +			break;
> +		}
> +
> +		rc = sja1105_tas_check_running(priv);
> +		if (rc < 0)
> +			break;
> +
> +		if (tas_data->state != SJA1105_TAS_STATE_RUNNING) {
> +			dev_err(ds->dev, "TAS surprisingly stopped\n");
> +			break;
> +		}
> +
> +		now = __sja1105_ptp_gettimex(priv, NULL);
> +
> +		diff = ns_to_timespec64(now - tas_data->oper_base_time);
> +
> +		dev_dbg(ds->dev, "Time since TAS started: [%lld.%09ld]\n",
> +			diff.tv_sec, diff.tv_nsec);
> +		break;

I got the feeling that some of the debug statements are more leftovers
from development that things that could help debug issues.

> +
> +	default:
> +		if (net_ratelimit())
> +			dev_err(ds->dev, "TAS in an invalid state (incorrect use of API)!\n");
> +	}
> +
> +	if (rc && net_ratelimit())
> +		dev_err(ds->dev, "An operation returned %d\n", rc);
> +
> +	mutex_unlock(&ptp_data->lock);
> +}
> +
> +void sja1105_tas_clockstep(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +
> +	if (!tas_data->enabled)
> +		return;
> +
> +	tas_data->last_op = SJA1105_PTP_CLOCKSTEP;
> +	schedule_work(&tas_data->tas_work);
> +}
> +
> +void sja1105_tas_adjfreq(struct sja1105_private *priv)
> +{
> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
> +
> +	if (!tas_data->enabled)
> +		return;
> +
> +	tas_data->last_op = SJA1105_PTP_ADJUSTFREQ;
> +	schedule_work(&tas_data->tas_work);
> +}
> +
>  void sja1105_tas_setup(struct sja1105_private *priv)
>  {
> +	INIT_WORK(&priv->tas_data.tas_work, sja1105_tas_state_machine);
> +	priv->tas_data.state = SJA1105_TAS_STATE_DISABLED;
> +	priv->tas_data.last_op = SJA1105_PTP_NONE;
>  }
>  
>  void sja1105_tas_teardown(struct sja1105_private *priv)
> @@ -414,6 +822,8 @@ void sja1105_tas_teardown(struct sja1105_private *priv)
>  	struct sja1105_tas_data *tas_data = &priv->tas_data;
>  	int port;
>  
> +	cancel_work_sync(&tas_data->tas_work);
> +

I think you should set 'tas_data->enabled' to false somewhere around
here: wondering if it's possible for a PTP function (via a ioctl() or
something) to call sja1105_tas_clockstep() at the very wrong time, and
re-start the workqueue.

>  	for (port = 0; port < SJA1105_NUM_PORTS; port++)
>  		if (tas_data->config[port])
>  			taprio_free(tas_data->config[port]);
> diff --git a/drivers/net/dsa/sja1105/sja1105_tas.h b/drivers/net/dsa/sja1105/sja1105_tas.h
> index 0ef82810d9d7..ecc95624e3f6 100644
> --- a/drivers/net/dsa/sja1105/sja1105_tas.h
> +++ b/drivers/net/dsa/sja1105/sja1105_tas.h
> @@ -8,8 +8,27 @@
>  
>  #if IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS)
>  
> +enum sja1105_tas_state {
> +	SJA1105_TAS_STATE_DISABLED,
> +	SJA1105_TAS_STATE_ENABLED_NOT_RUNNING,
> +	SJA1105_TAS_STATE_RUNNING,
> +};
> +
> +enum sja1105_ptp_op {
> +	SJA1105_PTP_NONE,
> +	SJA1105_PTP_CLOCKSTEP,
> +	SJA1105_PTP_ADJUSTFREQ,
> +};
> +
>  struct sja1105_tas_data {
>  	struct tc_taprio_qopt_offload *config[SJA1105_NUM_PORTS];
> +	enum sja1105_tas_state state;
> +	enum sja1105_ptp_op last_op;
> +	struct work_struct tas_work;
> +	s64 earliest_base_time;
> +	s64 oper_base_time;
> +	u64 max_cycle_time;
> +	bool enabled;
>  };
>  
>  int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
> @@ -19,6 +38,10 @@ void sja1105_tas_setup(struct sja1105_private *priv);
>  
>  void sja1105_tas_teardown(struct sja1105_private *priv);
>  
> +void sja1105_tas_clockstep(struct sja1105_private *priv);
> +
> +void sja1105_tas_adjfreq(struct sja1105_private *priv);
> +
>  #else
>  
>  /* C doesn't allow empty structures, bah! */
> @@ -37,6 +60,10 @@ static inline void sja1105_tas_setup(struct sja1105_private *priv) { }
>  
>  static inline void sja1105_tas_teardown(struct sja1105_private *priv) { }
>  
> +static inline void sja1105_tas_clockstep(struct sja1105_private *priv) { }
> +
> +static inline void sja1105_tas_adjfreq(struct sja1105_private *priv) { }
> +
>  #endif /* IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS) */
>  
>  #endif /* _SJA1105_TAS_H */
> -- 
> 2.17.1

^ permalink raw reply

* [bpf-next,v3] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-09-11 19:02 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev, bpf

Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
to 600. To make this size flexible, a new map 'pcktsz' is added.

By updating new packet size to this map from the userland,
xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.

If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
will be 600 as a default.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>

---
Changes in v2:
    - Change the helper to fetch map from 'bpf_map__next' to 
    'bpf_object__find_map_fd_by_name'.
 
 samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
 samples/bpf/xdp_adjust_tail_user.c | 28 ++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
index 411fdb21f8bc..d6d84ffe6a7a 100644
--- a/samples/bpf/xdp_adjust_tail_kern.c
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -25,6 +25,13 @@
 #define ICMP_TOOBIG_SIZE 98
 #define ICMP_TOOBIG_PAYLOAD_SIZE 92
 
+struct bpf_map_def SEC("maps") pcktsz = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
 struct bpf_map_def SEC("maps") icmpcnt = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(__u32),
@@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
 	*csum = csum_fold_helper(*csum);
 }
 
-static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
+					      __u32 max_pckt_size)
 {
 	int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
 
@@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
 	orig_iph = data + off;
 	icmp_hdr->type = ICMP_DEST_UNREACH;
 	icmp_hdr->code = ICMP_FRAG_NEEDED;
-	icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
+	icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
 	icmp_hdr->checksum = 0;
 	ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
 	icmp_hdr->checksum = csum;
@@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 {
 	void *data_end = (void *)(long)xdp->data_end;
 	void *data = (void *)(long)xdp->data;
+	__u32 max_pckt_size = MAX_PCKT_SIZE;
+	__u32 *pckt_sz;
+	__u32 key = 0;
 	int pckt_size = data_end - data;
 	int offset;
 
-	if (pckt_size > MAX_PCKT_SIZE) {
+	pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
+	if (pckt_sz && *pckt_sz)
+		max_pckt_size = *pckt_sz;
+
+	if (pckt_size > max_pckt_size) {
 		offset = pckt_size - ICMP_TOOBIG_SIZE;
 		if (bpf_xdp_adjust_tail(xdp, 0 - offset))
 			return XDP_PASS;
-		return send_icmp4_too_big(xdp);
+		return send_icmp4_too_big(xdp, max_pckt_size);
 	}
 	return XDP_PASS;
 }
diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
index a3596b617c4c..aef6c69a48a7 100644
--- a/samples/bpf/xdp_adjust_tail_user.c
+++ b/samples/bpf/xdp_adjust_tail_user.c
@@ -23,6 +23,7 @@
 #include "libbpf.h"
 
 #define STATS_INTERVAL_S 2U
+#define MAX_PCKT_SIZE 600
 
 static int ifindex = -1;
 static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
@@ -72,6 +73,7 @@ static void usage(const char *cmd)
 	printf("Usage: %s [...]\n", cmd);
 	printf("    -i <ifname|ifindex> Interface\n");
 	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
+	printf("    -P <MAX_PCKT_SIZE> Default: %u\n", MAX_PCKT_SIZE);
 	printf("    -S use skb-mode\n");
 	printf("    -N enforce native mode\n");
 	printf("    -F force loading prog\n");
@@ -85,13 +87,14 @@ int main(int argc, char **argv)
 		.prog_type	= BPF_PROG_TYPE_XDP,
 	};
 	unsigned char opt_flags[256] = {};
-	const char *optstr = "i:T:SNFh";
+	const char *optstr = "i:T:P:SNFh";
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
+	__u32 max_pckt_size = 0;
+	__u32 key = 0;
 	unsigned int kill_after_s = 0;
 	int i, prog_fd, map_fd, opt;
 	struct bpf_object *obj;
-	struct bpf_map *map;
 	char filename[256];
 	int err;
 
@@ -110,6 +113,9 @@ int main(int argc, char **argv)
 		case 'T':
 			kill_after_s = atoi(optarg);
 			break;
+		case 'P':
+			max_pckt_size = atoi(optarg);
+			break;
 		case 'S':
 			xdp_flags |= XDP_FLAGS_SKB_MODE;
 			break;
@@ -150,12 +156,22 @@ int main(int argc, char **argv)
 	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
 		return 1;
 
-	map = bpf_map__next(NULL, obj);
-	if (!map) {
-		printf("finding a map in obj file failed\n");
+	/* update pcktsz map */
+	if (max_pckt_size) {
+		map_fd = bpf_object__find_map_fd_by_name(obj, "pcktsz");
+		if (!map_fd) {
+			printf("finding a pcktsz map in obj file failed\n");
+			return 1;
+		}
+		bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
+	}
+
+	/* fetch icmpcnt map */
+	map_fd = bpf_object__find_map_fd_by_name(obj, "icmpcnt");
+	if (!map_fd) {
+		printf("finding a icmpcnt map in obj file failed\n");
 		return 1;
 	}
-	map_fd = bpf_map__fd(map);
 
 	if (!prog_fd) {
 		printf("load_bpf_file: %s\n", strerror(errno));
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Vijay Khemka @ 2019-09-11 18:50 UTC (permalink / raw)
  To: Florian Fainelli, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: openbmc @ lists . ozlabs . org, Sai Dasari,
	linux-aspeed@lists.ozlabs.org
In-Reply-To: <c9876340-c8d0-ba8b-2ae1-9900958f1834@gmail.com>



On 9/11/19, 11:34 AM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:

    On 9/11/19 11:30 AM, Vijay Khemka wrote:
    > 
    > 
    > On 9/10/19, 4:08 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:
    > 
    >     
    >     
    >     On 9/10/19, 3:50 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:
    >     
    >         
    >         
    >         On 9/10/19, 3:05 PM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:
    >         
    >             On 9/10/19 2:37 PM, Vijay Khemka wrote:
    >             > HW checksum generation is not working for AST2500, specially with IPV6
    >             > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
    >             > it works perfectly fine with IPV6.
    >             > 
    >             > Verified with IPV6 enabled and can do ssh.
    >             
    >             How about IPv4, do these packets have problem? If not, can you continue
    >             advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
    >         
    >         I changed code from (netdev->hw_features &= ~NETIF_F_HW_CSUM) to 
    >         (netdev->hw_features &= ~NETIF_F_ IPV6_CSUM). And it is not working. 
    >         Don't know why. IPV4 works without any change but IPv6 needs HW_CSUM
    >         Disabled.
    >     
    >     Now I changed to
    >     netdev->hw_features &= (~NETIF_F_HW_CSUM) | NETIF_F_IP_CSUM;
    >     And it works.
    > 
    > I investigated more on these features and found that we cannot set NETIF_F_IP_CSUM 
    > While NETIF_F_HW_CSUM is set. So I disabled NETIF_F_HW_CSUM first and enabled
    > NETIF_F_IP_CSUM in next statement. And it works fine.
    > 
    > But as per line 166 in include/linux/skbuff.h,  
    > *   NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are being deprecated in favor of
    >  *   NETIF_F_HW_CSUM. New devices should use NETIF_F_HW_CSUM to indicate
    >  *   checksum offload capability.
    > 
    > Please suggest which of below 2 I should do. As both works for me.
    > 1. Disable completely NETIF_F_HW_CSUM and do nothing. This is original patch.
    > 2. Enable NETIF_F_IP_CSUM in addition to 1. I can have v2 if this is accepted.
    
    Sounds like 2 would leave the option of offloading IPv4 checksum
    offload, so that would be a better middle group than flat out disable
    checksum offload for both IPv4 and IPv6, no?
Sounds good, I will have v2 after lot more testing.
    -- 
    Florian
    


^ permalink raw reply

* [v2 3/3] samples: pktgen: allow to specify destination IP range (CIDR)
From: Daniel T. Lee @ 2019-09-11 18:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S . Miller; +Cc: netdev
In-Reply-To: <20190911184807.21770-1-danieltimlee@gmail.com>

Currently, kernel pktgen has the feature to specify destination
address range for sending packet. (e.g. pgset "dst_min/dst_max")

But on samples, each of the scripts doesn't have any option to achieve this.

This commit adds the feature to specify the destination address range with CIDR.

    -d : ($DEST_IP)   destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed

    # ./pktgen_sample01_simple.sh -6 -d fe80::20/126 -p 3000 -n 4
    # tcpdump ip6 and udp
    05:14:18.082285 IP6 fe80::99.71 > fe80::23.3000: UDP, length 16
    05:14:18.082564 IP6 fe80::99.43 > fe80::23.3000: UDP, length 16
    05:14:18.083366 IP6 fe80::99.107 > fe80::22.3000: UDP, length 16
    05:14:18.083585 IP6 fe80::99.97 > fe80::21.3000: UDP, length 16

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 samples/pktgen/README.rst                             |  2 +-
 samples/pktgen/parameters.sh                          |  2 +-
 .../pktgen/pktgen_bench_xmit_mode_netif_receive.sh    |  4 +++-
 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh   |  4 +++-
 samples/pktgen/pktgen_sample01_simple.sh              |  4 +++-
 samples/pktgen/pktgen_sample02_multiqueue.sh          |  4 +++-
 samples/pktgen/pktgen_sample03_burst_single_flow.sh   |  4 +++-
 samples/pktgen/pktgen_sample04_many_flows.sh          | 11 ++++++++---
 samples/pktgen/pktgen_sample05_flow_per_thread.sh     |  4 +++-
 .../pktgen_sample06_numa_awared_queue_irq_affinity.sh |  4 +++-
 10 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/samples/pktgen/README.rst b/samples/pktgen/README.rst
index fd39215db508..3f6483e8b2df 100644
--- a/samples/pktgen/README.rst
+++ b/samples/pktgen/README.rst
@@ -18,7 +18,7 @@ across the sample scripts.  Usage example is printed on errors::
  Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX
   -i : ($DEV)       output interface/device (required)
   -s : ($PKT_SIZE)  packet size
-  -d : ($DEST_IP)   destination IP
+  -d : ($DEST_IP)   destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed
   -m : ($DST_MAC)   destination MAC-addr
   -p : ($DST_PORT)  destination PORT range (e.g. 433-444) is also allowed
   -t : ($THREADS)   threads to start
diff --git a/samples/pktgen/parameters.sh b/samples/pktgen/parameters.sh
index a06b00a0c7b6..ff0ed474fee9 100644
--- a/samples/pktgen/parameters.sh
+++ b/samples/pktgen/parameters.sh
@@ -8,7 +8,7 @@ function usage() {
     echo "Usage: $0 [-vx] -i ethX"
     echo "  -i : (\$DEV)       output interface/device (required)"
     echo "  -s : (\$PKT_SIZE)  packet size"
-    echo "  -d : (\$DEST_IP)   destination IP"
+    echo "  -d : (\$DEST_IP)   destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed"
     echo "  -m : (\$DST_MAC)   destination MAC-addr"
     echo "  -p : (\$DST_PORT)  destination PORT range (e.g. 433-444) is also allowed"
     echo "  -t : (\$THREADS)   threads to start"
diff --git a/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh b/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
index 9b74502c58f7..da6cb711b7f4 100755
--- a/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
+++ b/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
@@ -41,6 +41,7 @@ fi
 [ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
 [ -z "$BURST" ] && BURST=1024
 [ -z "$COUNT" ] && COUNT="10000000" # Zero means indefinitely
+[ -n "$DEST_IP" ] && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -71,7 +72,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst$IP6 $DEST_IP"
+    pg_set $dev "dst${IP6}_min $DST_MIN"
+    pg_set $dev "dst${IP6}_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
index 0f332555b40d..355937787364 100755
--- a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
+++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
@@ -24,6 +24,7 @@ if [[ -n "$BURST" ]]; then
     err 1 "Bursting not supported for this mode"
 fi
 [ -z "$COUNT" ] && COUNT="10000000" # Zero means indefinitely
+[ -n "$DEST_IP" ] && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -54,7 +55,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst$IP6 $DEST_IP"
+    pg_set $dev "dst${IP6}_min $DST_MIN"
+    pg_set $dev "dst${IP6}_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
diff --git a/samples/pktgen/pktgen_sample01_simple.sh b/samples/pktgen/pktgen_sample01_simple.sh
index 063ec0998906..08995fa70025 100755
--- a/samples/pktgen/pktgen_sample01_simple.sh
+++ b/samples/pktgen/pktgen_sample01_simple.sh
@@ -22,6 +22,7 @@ fi
 # Example enforce param "-m" for dst_mac
 [ -z "$DST_MAC" ] && usage && err 2 "Must specify -m dst_mac"
 [ -z "$COUNT" ]   && COUNT="100000" # Zero means indefinitely
+[ -n "$DEST_IP" ] && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -61,7 +62,8 @@ pg_set $DEV "flag NO_TIMESTAMP"
 
 # Destination
 pg_set $DEV "dst_mac $DST_MAC"
-pg_set $DEV "dst$IP6 $DEST_IP"
+pg_set $DEV "dst${IP6}_min $DST_MIN"
+pg_set $DEV "dst${IP6}_max $DST_MAX"
 
 if [ -n "$DST_PORT" ]; then
     # Single destination port or random port range
diff --git a/samples/pktgen/pktgen_sample02_multiqueue.sh b/samples/pktgen/pktgen_sample02_multiqueue.sh
index a4726fb50197..9b806e41c23a 100755
--- a/samples/pktgen/pktgen_sample02_multiqueue.sh
+++ b/samples/pktgen/pktgen_sample02_multiqueue.sh
@@ -29,6 +29,7 @@ if [ -z "$DEST_IP" ]; then
     [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
 fi
 [ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+[ -n "$DEST_IP" ] && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -62,7 +63,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst$IP6 $DEST_IP"
+    pg_set $dev "dst${IP6}_min $DST_MIN"
+    pg_set $dev "dst${IP6}_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
diff --git a/samples/pktgen/pktgen_sample03_burst_single_flow.sh b/samples/pktgen/pktgen_sample03_burst_single_flow.sh
index dfea91a09ccc..cb067788ceb3 100755
--- a/samples/pktgen/pktgen_sample03_burst_single_flow.sh
+++ b/samples/pktgen/pktgen_sample03_burst_single_flow.sh
@@ -33,6 +33,7 @@ fi
 [ -z "$BURST" ]     && BURST=32
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0" # No need for clones when bursting
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
+[ -n "$DEST_IP" ]   && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -62,7 +63,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst$IP6 $DEST_IP"
+    pg_set $dev "dst${IP6}_min $DST_MIN"
+    pg_set $dev "dst${IP6}_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
diff --git a/samples/pktgen/pktgen_sample04_many_flows.sh b/samples/pktgen/pktgen_sample04_many_flows.sh
index 7ea9b4a3acf6..626e33016869 100755
--- a/samples/pktgen/pktgen_sample04_many_flows.sh
+++ b/samples/pktgen/pktgen_sample04_many_flows.sh
@@ -17,6 +17,7 @@ source ${basedir}/parameters.sh
 [ -z "$DST_MAC" ]   && DST_MAC="90:e2:ba:ff:ff:ff"
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0"
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
+[ -n "$DEST_IP" ]   && read -r DST_MIN DST_MAX <<< $(parse_addr $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -37,6 +38,9 @@ if [[ -n "$BURST" ]]; then
     err 1 "Bursting not supported for this mode"
 fi
 
+# 198.18.0.0 / 198.19.255.255
+read -r SRC_MIN SRC_MAX <<< $(parse_addr 198.18.0.0/15)
+
 # General cleanup everything since last run
 pg_ctrl "reset"
 
@@ -58,7 +62,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Single destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst $DEST_IP"
+    pg_set $dev "dst_min $DST_MIN"
+    pg_set $dev "dst_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
@@ -69,8 +74,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Randomize source IP-addresses
     pg_set $dev "flag IPSRC_RND"
-    pg_set $dev "src_min 198.18.0.0"
-    pg_set $dev "src_max 198.19.255.255"
+    pg_set $dev "src_min $SRC_MIN"
+    pg_set $dev "src_max $SRC_MAX"
 
     # Limit number of flows (max 65535)
     pg_set $dev "flows $FLOWS"
diff --git a/samples/pktgen/pktgen_sample05_flow_per_thread.sh b/samples/pktgen/pktgen_sample05_flow_per_thread.sh
index fbfafe029e11..cb79de073e9d 100755
--- a/samples/pktgen/pktgen_sample05_flow_per_thread.sh
+++ b/samples/pktgen/pktgen_sample05_flow_per_thread.sh
@@ -22,6 +22,7 @@ source ${basedir}/parameters.sh
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0"
 [ -z "$BURST" ]     && BURST=32
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
+[ -n "$DEST_IP" ]   && read -r DST_MIN DST_MAX <<< $(parse_addr $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -51,7 +52,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
 
     # Single destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst $DEST_IP"
+    pg_set $dev "dst_min $DST_MIN"
+    pg_set $dev "dst_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
index 755e662183f1..739adcda5b5f 100755
--- a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -35,6 +35,7 @@ if [ -z "$DEST_IP" ]; then
     [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
 fi
 [ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+[ -n "$DEST_IP" ] && read -r DST_MIN DST_MAX <<< $(parse_addr${IP6} $DEST_IP)
 if [ -n "$DST_PORT" ]; then
     read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
     validate_ports $UDP_DST_MIN $UDP_DST_MAX
@@ -79,7 +80,8 @@ for ((i = 0; i < $THREADS; i++)); do
 
     # Destination
     pg_set $dev "dst_mac $DST_MAC"
-    pg_set $dev "dst$IP6 $DEST_IP"
+    pg_set $dev "dst${IP6}_min $DST_MIN"
+    pg_set $dev "dst${IP6}_max $DST_MAX"
 
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
-- 
2.20.1


^ permalink raw reply related

* [v2 2/3] samples: pktgen: add helper functions for IP(v4/v6) CIDR parsing
From: Daniel T. Lee @ 2019-09-11 18:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S . Miller; +Cc: netdev
In-Reply-To: <20190911184807.21770-1-danieltimlee@gmail.com>

This commit adds CIDR parsing and IP validate helper function to parse
single IP or range of IP with CIDR. (e.g. 198.18.0.0/15)

Helpers will be used in prior to set target address in samples/pktgen.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 samples/pktgen/functions.sh | 122 ++++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

diff --git a/samples/pktgen/functions.sh b/samples/pktgen/functions.sh
index 4af4046d71be..8be5a6b6c097 100644
--- a/samples/pktgen/functions.sh
+++ b/samples/pktgen/functions.sh
@@ -163,6 +163,128 @@ function get_node_cpus()
 	echo $node_cpu_list
 }
 
+# Extend shrunken IPv6 address.
+# fe80::42:bcff:fe84:e10a => fe80:0:0:0:42:bcff:fe84:e10a
+function extend_addr6()
+{
+    local addr=$1
+    local sep=: sep2=::
+    local sep_cnt=$(tr -cd $sep <<< $1 | wc -c)
+    local shrink
+
+    # separator count : should be between 2, 7.
+    if [[ $sep_cnt -lt 2 || $sep_cnt -gt 7 ]]; then
+        err 5 "Invalid IP6 address sep: $1"
+    fi
+
+    # if shrink '::' occurs multiple, it's malformed.
+    shrink=( $(egrep -o "$sep{2,}" <<< $addr) )
+    if [[ ${#shrink[@]} -ne 0 ]]; then
+        if [[ ${#shrink[@]} -gt 1 || ( ${shrink[0]} != $sep2 ) ]]; then
+            err 5 "Invalid IP$IP6 address shr: $1"
+        fi
+    fi
+
+    # add 0 at begin & end, and extend addr by adding :0
+    [[ ${addr:0:1} == $sep ]] && addr=0${addr}
+    [[ ${addr: -1} == $sep ]] && addr=${addr}0
+    echo "${addr/$sep2/$(printf ':0%.s' $(seq $[8-sep_cnt])):}"
+}
+
+
+# Given a single IP(v4/v6) address, whether it is valid.
+function validate_addr()
+{
+    # check function is called with (funcname)6
+    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
+    local len=$[ IP6 ? 8 : 4 ]
+    local max=$[ 2**(len*2)-1 ]
+    local addr sep
+
+    # set separator for each IP(v4/v6)
+    [[ $IP6 ]] && sep=: || sep=.
+    IFS=$sep read -a addr <<< $1
+
+    # array length
+    if [[ ${#addr[@]} != $len ]]; then
+        err 5 "Invalid IP$IP6 address: $1"
+    fi
+
+    # check each digit between 0, $max
+    for digit in "${addr[@]}"; do
+        [[ $IP6 ]] && digit=$[ 16#$digit ]
+        if [[ $digit -lt 0 || $digit -gt $max ]]; then
+            err 5 "Invalid IP$IP6 address: $1"
+        fi
+    done
+
+    return 0
+}
+
+function validate_addr6() { validate_addr $@ ; }
+
+# Given a single IP(v4/v6) or CIDR, return minimum and maximum IP addr.
+function parse_addr()
+{
+    # check function is called with (funcname)6
+    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
+    local bitlen=$[ IP6 ? 128 : 32 ]
+    local octet=$[ IP6 ? 16 : 8 ]
+
+    local addr=$1
+    local net prefix
+    local min_ip max_ip
+
+    IFS='/' read net prefix <<< $addr
+    [[ $IP6 ]] && net=$(extend_addr6 $net)
+    validate_addr$IP6 $net
+
+    if [[ $prefix -gt $bitlen ]]; then
+        err 5 "Invalid prefix: $prefix"
+    elif [[ -z $prefix ]]; then
+        min_ip=$net
+        max_ip=$net
+    else
+        # defining array for converting Decimal 2 Binary
+        # 00000000 00000001 00000010 00000011 00000100 ...
+        local d2b='{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}'
+        [[ $IP6 ]] && d2b+=$d2b
+        eval local D2B=($d2b)
+
+        local shift=$[ bitlen-prefix ]
+        local min_mask max_mask
+        local min max
+        local ip_bit
+        local ip sep
+
+        # set separator for each IP(v4/v6)
+        [[ $IP6 ]] && sep=: || sep=.
+        IFS=$sep read -ra ip <<< $net
+
+        min_mask="$(printf '1%.s' $(seq $prefix))$(printf '0%.s' $(seq $shift))"
+        max_mask="$(printf '0%.s' $(seq $prefix))$(printf '1%.s' $(seq $shift))"
+
+        # calculate min/max ip with &,| operator
+        for i in "${!ip[@]}"; do
+            digit=$[ IP6 ? 16#${ip[$i]} : ${ip[$i]} ]
+            ip_bit=${D2B[$digit]}
+
+            idx=$[ octet*i ]
+            min[$i]=$[ 2#$ip_bit & 2#${min_mask:$idx:$octet} ]
+            max[$i]=$[ 2#$ip_bit | 2#${max_mask:$idx:$octet} ]
+            [[ $IP6 ]] && { min[$i]=$(printf '%X' ${min[$i]});
+                            max[$i]=$(printf '%X' ${max[$i]}); }
+        done
+
+        min_ip=$(IFS=$sep; echo "${min[*]}")
+        max_ip=$(IFS=$sep; echo "${max[*]}")
+    fi
+
+    echo $min_ip $max_ip
+}
+
+function parse_addr6() { parse_addr $@ ; }
+
 # Given a single or range of port(s), return minimum and maximum port number.
 function parse_ports()
 {
-- 
2.20.1


^ permalink raw reply related

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
From: Kalle Valo @ 2019-09-11 18:48 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k
In-Reply-To: <383b145b608e0fe3a35ffb0ceb99fdf938d4e2bb.camel@sipsolutions.net>

Johannes Berg <johannes@sipsolutions.net> writes:

> On Wed, 2019-09-11 at 21:19 +0300, Kalle Valo wrote:
>> > Looks like indeed the driver gives the device at least *3 seconds* for
>> > every command, see ath10k_wmi_cmd_send(), so most likely this would
>> > eventually have finished, but who knows how many firmware commands it
>> > would still have attempted to send...
>> 
>> 3 seconds is a bit short but in normal cases it should be enough. Of
>> course we could increase the delay but I'm skeptic it would help here.
>
> I was thinking 3 seconds is way too long :-)

Heh :)

>> > Perhaps the driver should mark the device as dead and fail quickly once
>> > it timed out once, or so, but I'll let Kalle comment on that.
>> 
>> Actually we do try to restart the device when a timeout happens in
>> ath10k_wmi_cmd_send():
>> 
>>         if (ret == -EAGAIN) {
>>                 ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
>>                             cmd_id);
>>                 queue_work(ar->workqueue, &ar->restart_work);
>>         }
>
> Yeah, and this is the problem, in a sense, I'd think. It seems to me
> that at this point the code needs to tag the device as "dead" and
> immediately return from any further commands submitted to it with an
> error (e.g. -EIO).

Yeah, ath10k_core_restart() is supposed change to state
ATH10K_STATE_RESTARTING but very few mac80211 ops in ath10k_ops are
checking for it, and to me it looks like quite late even. I think a
proper fix for ops which can sleep is to check ar->state is
ATH10K_STATE_ON and for ops which cannot sleep check
ATH10K_FLAG_CRASH_FLUSH.

But of course this just fixes the symptoms, the root cause for timeouts
needs to be found as well.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* [v2 1/3] samples: pktgen: make variable consistent with option
From: Daniel T. Lee @ 2019-09-11 18:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S . Miller; +Cc: netdev

This commit changes variable names that can cause confusion.

For example, variable DST_MIN is quite confusing since the
keyword 'udp_dst_min' and keyword 'dst_min' is used with pg_ctrl.

On the following commit, 'dst_min' will be used to set destination IP,
and the existing variable name DST_MIN should be changed.

Variable names are matched to the exact keyword used with pg_ctrl.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 .../pktgen_bench_xmit_mode_netif_receive.sh      |  8 ++++----
 .../pktgen/pktgen_bench_xmit_mode_queue_xmit.sh  |  8 ++++----
 samples/pktgen/pktgen_sample01_simple.sh         | 16 ++++++++--------
 samples/pktgen/pktgen_sample02_multiqueue.sh     | 16 ++++++++--------
 .../pktgen/pktgen_sample03_burst_single_flow.sh  |  8 ++++----
 samples/pktgen/pktgen_sample04_many_flows.sh     |  8 ++++----
 .../pktgen/pktgen_sample05_flow_per_thread.sh    |  8 ++++----
 ...en_sample06_numa_awared_queue_irq_affinity.sh | 16 ++++++++--------
 8 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh b/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
index e14b1a9144d9..9b74502c58f7 100755
--- a/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
+++ b/samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
@@ -42,8 +42,8 @@ fi
 [ -z "$BURST" ] && BURST=1024
 [ -z "$COUNT" ] && COUNT="10000000" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # Base Config
@@ -76,8 +76,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Inject packet into RX path of stack
diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
index 82c3e504e056..0f332555b40d 100755
--- a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
+++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
@@ -25,8 +25,8 @@ if [[ -n "$BURST" ]]; then
 fi
 [ -z "$COUNT" ] && COUNT="10000000" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # Base Config
@@ -59,8 +59,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Inject packet into TX qdisc egress path of stack
diff --git a/samples/pktgen/pktgen_sample01_simple.sh b/samples/pktgen/pktgen_sample01_simple.sh
index d1702fdde8f3..063ec0998906 100755
--- a/samples/pktgen/pktgen_sample01_simple.sh
+++ b/samples/pktgen/pktgen_sample01_simple.sh
@@ -23,16 +23,16 @@ fi
 [ -z "$DST_MAC" ] && usage && err 2 "Must specify -m dst_mac"
 [ -z "$COUNT" ]   && COUNT="100000" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # Base Config
 DELAY="0"        # Zero means max speed
 
 # Flow variation random source port between min and max
-UDP_MIN=9
-UDP_MAX=109
+UDP_SRC_MIN=9
+UDP_SRC_MAX=109
 
 # General cleanup everything since last run
 # (especially important if other threads were configured by other scripts)
@@ -66,14 +66,14 @@ pg_set $DEV "dst$IP6 $DEST_IP"
 if [ -n "$DST_PORT" ]; then
     # Single destination port or random port range
     pg_set $DEV "flag UDPDST_RND"
-    pg_set $DEV "udp_dst_min $DST_MIN"
-    pg_set $DEV "udp_dst_max $DST_MAX"
+    pg_set $DEV "udp_dst_min $UDP_DST_MIN"
+    pg_set $DEV "udp_dst_max $UDP_DST_MAX"
 fi
 
 # Setup random UDP port src range
 pg_set $DEV "flag UDPSRC_RND"
-pg_set $DEV "udp_src_min $UDP_MIN"
-pg_set $DEV "udp_src_max $UDP_MAX"
+pg_set $DEV "udp_src_min $UDP_SRC_MIN"
+pg_set $DEV "udp_src_max $UDP_SRC_MAX"
 
 # start_run
 echo "Running... ctrl^C to stop" >&2
diff --git a/samples/pktgen/pktgen_sample02_multiqueue.sh b/samples/pktgen/pktgen_sample02_multiqueue.sh
index 7f7a9a27548f..a4726fb50197 100755
--- a/samples/pktgen/pktgen_sample02_multiqueue.sh
+++ b/samples/pktgen/pktgen_sample02_multiqueue.sh
@@ -21,8 +21,8 @@ DELAY="0"        # Zero means max speed
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0"
 
 # Flow variation random source port between min and max
-UDP_MIN=9
-UDP_MAX=109
+UDP_SRC_MIN=9
+UDP_SRC_MAX=109
 
 # (example of setting default params in your script)
 if [ -z "$DEST_IP" ]; then
@@ -30,8 +30,8 @@ if [ -z "$DEST_IP" ]; then
 fi
 [ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # General cleanup everything since last run
@@ -67,14 +67,14 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Setup random UDP port src range
     pg_set $dev "flag UDPSRC_RND"
-    pg_set $dev "udp_src_min $UDP_MIN"
-    pg_set $dev "udp_src_max $UDP_MAX"
+    pg_set $dev "udp_src_min $UDP_SRC_MIN"
+    pg_set $dev "udp_src_max $UDP_SRC_MAX"
 done
 
 # start_run
diff --git a/samples/pktgen/pktgen_sample03_burst_single_flow.sh b/samples/pktgen/pktgen_sample03_burst_single_flow.sh
index b520637817ce..dfea91a09ccc 100755
--- a/samples/pktgen/pktgen_sample03_burst_single_flow.sh
+++ b/samples/pktgen/pktgen_sample03_burst_single_flow.sh
@@ -34,8 +34,8 @@ fi
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0" # No need for clones when bursting
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # Base Config
@@ -67,8 +67,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Setup burst, for easy testing -b 0 disable bursting
diff --git a/samples/pktgen/pktgen_sample04_many_flows.sh b/samples/pktgen/pktgen_sample04_many_flows.sh
index 5b6e9d9cb5b5..7ea9b4a3acf6 100755
--- a/samples/pktgen/pktgen_sample04_many_flows.sh
+++ b/samples/pktgen/pktgen_sample04_many_flows.sh
@@ -18,8 +18,8 @@ source ${basedir}/parameters.sh
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0"
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # NOTICE:  Script specific settings
@@ -63,8 +63,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Randomize source IP-addresses
diff --git a/samples/pktgen/pktgen_sample05_flow_per_thread.sh b/samples/pktgen/pktgen_sample05_flow_per_thread.sh
index 0c06e63fbe97..fbfafe029e11 100755
--- a/samples/pktgen/pktgen_sample05_flow_per_thread.sh
+++ b/samples/pktgen/pktgen_sample05_flow_per_thread.sh
@@ -23,8 +23,8 @@ source ${basedir}/parameters.sh
 [ -z "$BURST" ]     && BURST=32
 [ -z "$COUNT" ]     && COUNT="0" # Zero means indefinitely
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # Base Config
@@ -56,8 +56,8 @@ for ((thread = $F_THREAD; thread <= $L_THREAD; thread++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Setup source IP-addresses based on thread number
diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
index 97f0266c0356..755e662183f1 100755
--- a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -20,8 +20,8 @@ DELAY="0"        # Zero means max speed
 [ -z "$CLONE_SKB" ] && CLONE_SKB="0"
 
 # Flow variation random source port between min and max
-UDP_MIN=9
-UDP_MAX=109
+UDP_SRC_MIN=9
+UDP_SRC_MAX=109
 
 node=`get_iface_node $DEV`
 irq_array=(`get_iface_irqs $DEV`)
@@ -36,8 +36,8 @@ if [ -z "$DEST_IP" ]; then
 fi
 [ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
 if [ -n "$DST_PORT" ]; then
-    read -r DST_MIN DST_MAX <<< $(parse_ports $DST_PORT)
-    validate_ports $DST_MIN $DST_MAX
+    read -r UDP_DST_MIN UDP_DST_MAX <<< $(parse_ports $DST_PORT)
+    validate_ports $UDP_DST_MIN $UDP_DST_MAX
 fi
 
 # General cleanup everything since last run
@@ -84,14 +84,14 @@ for ((i = 0; i < $THREADS; i++)); do
     if [ -n "$DST_PORT" ]; then
 	# Single destination port or random port range
 	pg_set $dev "flag UDPDST_RND"
-	pg_set $dev "udp_dst_min $DST_MIN"
-	pg_set $dev "udp_dst_max $DST_MAX"
+	pg_set $dev "udp_dst_min $UDP_DST_MIN"
+	pg_set $dev "udp_dst_max $UDP_DST_MAX"
     fi
 
     # Setup random UDP port src range
     pg_set $dev "flag UDPSRC_RND"
-    pg_set $dev "udp_src_min $UDP_MIN"
-    pg_set $dev "udp_src_max $UDP_MAX"
+    pg_set $dev "udp_src_min $UDP_SRC_MIN"
+    pg_set $dev "udp_src_max $UDP_SRC_MAX"
 done
 
 # start_run
-- 
2.20.1


^ permalink raw reply related

* Re: ixgbe: driver drops packets routed from an IPSec interface with a "bad sa_idx" error
From: Jeff Kirsher @ 2019-09-11 18:45 UTC (permalink / raw)
  To: Michael Marley, Steffen Klassert; +Cc: Shannon Nelson, netdev
In-Reply-To: <12d6d2313eeb61a51731a2ba9b1fa9bf@michaelmarley.com>

[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

On Wed, 2019-09-11 at 10:50 -0400, Michael Marley wrote:
> On 2019-09-11 02:15, Steffen Klassert wrote:
> > On Tue, Sep 10, 2019 at 06:53:30PM -0400, Michael Marley wrote:
> > > StrongSwan has hardware offload disabled by default, and I
> > > didn't 
> > > enable
> > > it explicitly.  I also already tried turning off all those
> > > switches 
> > > with
> > > ethtool and it has no effect.  This doesn't surprise me though, 
> > > because
> > > as I said, I don't actually have the IPSec connection running
> > > over the
> > > ixgbe device.  The IPSec connection runs over another network
> > > adapter
> > > that doesn't support IPSec offload at all.  The problem comes
> > > when
> > > traffic received over the IPSec interface is then routed back out
> > > (unencrypted) through the ixgbe device into the local network.
> > 
> > Seems like the ixgbe driver tries to use the sec_path
> > from RX to setup an offload at the TX side.
> > 
> > Can you please try this (completely untested) patch?

Steffen, can you send your patch to intel-wired-lan@lists.osuosl.org
mailing list?

> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 9bcae44e9883..ae31bd57127c 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -36,6 +36,7 @@
> >  #include <net/vxlan.h>
> >  #include <net/mpls.h>
> >  #include <net/xdp_sock.h>
> > +#include <net/xfrm.h>
> > 
> >  #include "ixgbe.h"
> >  #include "ixgbe_common.h"
> > @@ -8696,7 +8697,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct
> > sk_buff 
> > *skb,
> >  #endif /* IXGBE_FCOE */
> > 
> >  #ifdef CONFIG_IXGBE_IPSEC
> > -	if (secpath_exists(skb) &&
> > +	if (xfrm_offload(skb) &&
> >  	    !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
> >  		goto out_drop;
> >  #endif
> With the patch, the problem is gone.  Thanks!
> 
> Michael


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [Patch net] sch_sfb: fix a crash in sfb_destroy()
From: Cong Wang @ 2019-09-11 18:34 UTC (permalink / raw)
  To: netdev
  Cc: Cong Wang, syzbot+d5870a903591faaca4ae, Linus Torvalds,
	Jamal Hadi Salim, Jiri Pirko

When tcf_block_get() fails in sfb_init(), q->qdisc is still a NULL
pointer which leads to a crash in sfb_destroy().

Linus suggested three solutions for this problem, the simplest fix
is just moving the noop_qdisc assignment before tcf_block_get()
so that qdisc_put() would become a nop.

Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Reported-by: syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_sfb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index 1dff8506a715..db1c8eb521a2 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -552,11 +552,11 @@ static int sfb_init(struct Qdisc *sch, struct nlattr *opt,
 	struct sfb_sched_data *q = qdisc_priv(sch);
 	int err;
 
+	q->qdisc = &noop_qdisc;
+
 	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
 	if (err)
 		return err;
-
-	q->qdisc = &noop_qdisc;
 	return sfb_change(sch, opt, extack);
 }
 
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Florian Fainelli @ 2019-09-11 18:34 UTC (permalink / raw)
  To: Vijay Khemka, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: openbmc @ lists . ozlabs . org, Sai Dasari,
	linux-aspeed@lists.ozlabs.org
In-Reply-To: <8A8392C8-5E5E-444D-AB1B-E0FAD3C29425@fb.com>

On 9/11/19 11:30 AM, Vijay Khemka wrote:
> 
> 
> On 9/10/19, 4:08 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:
> 
>     
>     
>     On 9/10/19, 3:50 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:
>     
>         
>         
>         On 9/10/19, 3:05 PM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:
>         
>             On 9/10/19 2:37 PM, Vijay Khemka wrote:
>             > HW checksum generation is not working for AST2500, specially with IPV6
>             > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
>             > it works perfectly fine with IPV6.
>             > 
>             > Verified with IPV6 enabled and can do ssh.
>             
>             How about IPv4, do these packets have problem? If not, can you continue
>             advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
>         
>         I changed code from (netdev->hw_features &= ~NETIF_F_HW_CSUM) to 
>         (netdev->hw_features &= ~NETIF_F_ IPV6_CSUM). And it is not working. 
>         Don't know why. IPV4 works without any change but IPv6 needs HW_CSUM
>         Disabled.
>     
>     Now I changed to
>     netdev->hw_features &= (~NETIF_F_HW_CSUM) | NETIF_F_IP_CSUM;
>     And it works.
> 
> I investigated more on these features and found that we cannot set NETIF_F_IP_CSUM 
> While NETIF_F_HW_CSUM is set. So I disabled NETIF_F_HW_CSUM first and enabled
> NETIF_F_IP_CSUM in next statement. And it works fine.
> 
> But as per line 166 in include/linux/skbuff.h,  
> *   NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are being deprecated in favor of
>  *   NETIF_F_HW_CSUM. New devices should use NETIF_F_HW_CSUM to indicate
>  *   checksum offload capability.
> 
> Please suggest which of below 2 I should do. As both works for me.
> 1. Disable completely NETIF_F_HW_CSUM and do nothing. This is original patch.
> 2. Enable NETIF_F_IP_CSUM in addition to 1. I can have v2 if this is accepted.

Sounds like 2 would leave the option of offloading IPv4 checksum
offload, so that would be a better middle group than flat out disable
checksum offload for both IPv4 and IPv6, no?
-- 
Florian

^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Vijay Khemka @ 2019-09-11 18:30 UTC (permalink / raw)
  To: Florian Fainelli, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: openbmc @ lists . ozlabs . org, Sai Dasari,
	linux-aspeed@lists.ozlabs.org
In-Reply-To: <D79D04CC-4A02-4E51-8FDF-48B7C7EB6CC2@fb.com>



On 9/10/19, 4:08 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:

    
    
    On 9/10/19, 3:50 PM, "Linux-aspeed on behalf of Vijay Khemka" <linux-aspeed-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of vijaykhemka@fb.com> wrote:
    
        
        
        On 9/10/19, 3:05 PM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:
        
            On 9/10/19 2:37 PM, Vijay Khemka wrote:
            > HW checksum generation is not working for AST2500, specially with IPV6
            > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
            > it works perfectly fine with IPV6.
            > 
            > Verified with IPV6 enabled and can do ssh.
            
            How about IPv4, do these packets have problem? If not, can you continue
            advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
        
        I changed code from (netdev->hw_features &= ~NETIF_F_HW_CSUM) to 
        (netdev->hw_features &= ~NETIF_F_ IPV6_CSUM). And it is not working. 
        Don't know why. IPV4 works without any change but IPv6 needs HW_CSUM
        Disabled.
    
    Now I changed to
    netdev->hw_features &= (~NETIF_F_HW_CSUM) | NETIF_F_IP_CSUM;
    And it works.

I investigated more on these features and found that we cannot set NETIF_F_IP_CSUM 
While NETIF_F_HW_CSUM is set. So I disabled NETIF_F_HW_CSUM first and enabled
NETIF_F_IP_CSUM in next statement. And it works fine.

But as per line 166 in include/linux/skbuff.h,  
*   NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are being deprecated in favor of
 *   NETIF_F_HW_CSUM. New devices should use NETIF_F_HW_CSUM to indicate
 *   checksum offload capability.

Please suggest which of below 2 I should do. As both works for me.
1. Disable completely NETIF_F_HW_CSUM and do nothing. This is original patch.
2. Enable NETIF_F_IP_CSUM in addition to 1. I can have v2 if this is accepted.
            
            > 
            > Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
            > ---
            >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
            >  1 file changed, 3 insertions(+), 2 deletions(-)
            > 
            > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
            > index 030fed65393e..591c9725002b 100644
            > --- a/drivers/net/ethernet/faraday/ftgmac100.c
            > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
            > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
            >  	if (priv->use_ncsi)
            >  		netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
            >  
            > -	/* AST2400  doesn't have working HW checksum generation */
            > -	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
            > +	/* AST2400  and AST2500 doesn't have working HW checksum generation */
            > +	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
            > +		   of_device_is_compatible(np, "aspeed,ast2500-mac")))
            >  		netdev->hw_features &= ~NETIF_F_HW_CSUM;
            >  	if (np && of_get_property(np, "no-hw-checksum", NULL))
            >  		netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
            > 
            
            
            -- 
            Florian
            
        
        
    
    


^ permalink raw reply

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
From: Johannes Berg @ 2019-09-11 18:23 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Linus Torvalds, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k
In-Reply-To: <87ef0mlmqg.fsf@tynnyri.adurom.net>

On Wed, 2019-09-11 at 21:19 +0300, Kalle Valo wrote:
> > Looks like indeed the driver gives the device at least *3 seconds* for
> > every command, see ath10k_wmi_cmd_send(), so most likely this would
> > eventually have finished, but who knows how many firmware commands it
> > would still have attempted to send...
> 
> 3 seconds is a bit short but in normal cases it should be enough. Of
> course we could increase the delay but I'm skeptic it would help here.

I was thinking 3 seconds is way too long :-)

> > Perhaps the driver should mark the device as dead and fail quickly once
> > it timed out once, or so, but I'll let Kalle comment on that.
> 
> Actually we do try to restart the device when a timeout happens in
> ath10k_wmi_cmd_send():
> 
>         if (ret == -EAGAIN) {
>                 ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
>                             cmd_id);
>                 queue_work(ar->workqueue, &ar->restart_work);
>         }

Yeah, and this is the problem, in a sense, I'd think. It seems to me
that at this point the code needs to tag the device as "dead" and
immediately return from any further commands submitted to it with an
error (e.g. -EIO). You can can actually see in the initial report that
while the restart was triggered, it too is waiting to acquire the RTNL:

>    Workqueue: events_freezable ieee80211_restart_work [mac80211]
>    Call Trace:
>     schedule+0x39/0xa0
>     schedule_preempt_disabled+0xa/0x10
>     __mutex_lock.isra.0+0x263/0x4b0
>     ieee80211_restart_work+0x54/0xe0 [mac80211]
>     process_one_work+0x1cf/0x370
>     worker_thread+0x4a/0x3c0
>     kthread+0xfb/0x130
>     ret_from_fork+0x35/0x40


So basically all this delay is mac80211 and the driver doing stuff with
the device, but every single thing has to time out and probably some
stuff loops etc., and then it just takes long enough with the RTNL held
that everything goes south.

johannes


^ permalink raw reply

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
From: Kalle Valo @ 2019-09-11 18:19 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, David S. Miller, Kalle Valo, linux-wireless,
	Netdev, Linux List Kernel Mailing, ath10k
In-Reply-To: <feecebfcceba521703f13c8ee7f5bb9016924cb6.camel@sipsolutions.net>

Johannes Berg <johannes@sipsolutions.net> writes:

>>    ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>>    ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>>    ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>>    ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>>    ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>>    ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>> 
>> That certainly looks like something did try to set a power limit, but
>> eventually failed.
>
> Yeah, that does seem a bit fishy. Kalle would have to comment for
> ath10k.
>
>> Immediately after that:
>> 
>>    wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
>> (Reason: 3=DEAUTH_LEAVING)
>
> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, to me it looks anything ath10k tries to do with the devie fails,
even resetting the device.

> Looks like indeed the driver gives the device at least *3 seconds* for
> every command, see ath10k_wmi_cmd_send(), so most likely this would
> eventually have finished, but who knows how many firmware commands it
> would still have attempted to send...

3 seconds is a bit short but in normal cases it should be enough. Of
course we could increase the delay but I'm skeptic it would help here.

> Perhaps the driver should mark the device as dead and fail quickly once
> it timed out once, or so, but I'll let Kalle comment on that.

Actually we do try to restart the device when a timeout happens in
ath10k_wmi_cmd_send():

        if (ret == -EAGAIN) {
                ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
                            cmd_id);
                queue_work(ar->workqueue, &ar->restart_work);
        }
                        

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
From: Christian Barcenas @ 2019-09-11 18:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Christian Barcenas,
	bpf

A process can lock memory addresses into physical RAM explicitly
(via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.

CAP_IPC_LOCK allows a process to exceed these limits, and throughout
the kernel this capability is checked before allowing/denying an attempt
to lock memory regions into RAM.

Because bpf locks its programs and maps into RAM, it should respect
CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
exceeded by a privileged process, which is contrary to documented
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Christian Barcenas <christian@cbarcenas.com>
---
 kernel/bpf/syscall.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 272071e9112f..e551961f364b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 static int bpf_charge_memlock(struct user_struct *user, u32 pages)
 {
 	unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
 
-	if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
+	if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
 		atomic_long_sub(pages, &user->locked_vm);
 		return -EPERM;
 	}
@@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages)
 
 	if (user) {
 		user_bufs = atomic_long_add_return(pages, &user->locked_vm);
-		if (user_bufs > memlock_limit) {
+		if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
 			atomic_long_sub(pages, &user->locked_vm);
 			return -EPERM;
 		}
-- 
2.23.0

^ permalink raw reply related

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
From: Kalle Valo @ 2019-09-11 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Berg, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k
In-Reply-To: <CAHk-=wgBuu8PiYpD7uWgxTSY8aUOJj6NJ=ivNQPYjAKO=cRinA@mail.gmail.com>

+ ath10k list

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So I'm at LCA, reading email, using my laptop more than I normally do,
> and with different networking than I normally do.
>
> And I just had a 802.11 WARN_ON() trigger, followed by essentially a
> dead machine due to some lock held (maybe rtnl_lock).
>
> It's possible that the lock held thing happened before, and is the
> _reason_ for the delay, I don't know. I had to reboot the machine, but
> I gathered as much information as made sense and was obvious before I
> did so. That's appended.

Some notes while investigating this:

> But wait!
>
> ... then 10+ minutes later:
>
>    ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>    ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>    ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>    ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>    ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>    ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>
> That certainly looks like something did try to set a power limit, but
> eventually failed.

I suspect the failing WMI command is called from:

ath10k_bss_info_changed()
ath10k_mac_txpower_recalc()
ath10k_mac_txpower_setup()
ath10k_wmi_pdev_set_param()
ath10k_wmi_cmd_send()
ath10k_wmi_cmd_send_nowait()
ath10k_htc_send()

-11 is -EAGAIN which would mean that the HTC credits have run out some
 reason for the WMI command:

if (ep->tx_credits < credits) {
        ath10k_dbg(ar, ATH10K_DBG_HTC,
                "htc insufficient credits ep %d required %d available %d\n",
                eid, credits, ep->tx_credits);
        spin_unlock_bh(&htc->tx_lock);
        ret = -EAGAIN;
        goto err_pull;
}

Credits can run out, for example, if there's a lot of WMI command/event
activity and are not returned during the 3s wait, firmware crashed or
problems with the PCI bus. But when the WMI command timeout happens
ath10k is supposed to restart the firmware and everything should be
usable again.

> Immediately after that:
>
>    wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
> (Reason: 3=DEAUTH_LEAVING)
>    ath10k_pci 0000:02:00.0: failed to read hi_board_data address: -16
>    ath10k_pci 0000:02:00.0: failed to receive initialized event from
> target: 00000000
>    ath10k_pci 0000:02:00.0: failed to receive initialized event from
> target: 00000000
>    ath10k_pci 0000:02:00.0: failed to wait for target init: -110

I suspect here ath10k tries to reset the target during stop operation,
"failed to receive initialized event from target" comes from:

ath10k_pci_hif_stop()
ath10k_pci_safe_chip_reset()
ath10k_pci_warm_reset()
ath10k_pci_wait_for_target_init()

It shouldn't fail like that, which makes me suspect either a low level
problem or a bug in qca6174 firmware restart code. To check the latter,
could you please try to force a firmware crash and see if firmware
restart is working for you?

To crash the firmware you need to write either "hard" or "assert" (I
forgot which one QCA6174 firmware supports) to
/sys/kernel/debug/ieee80211/phy*/ath10k/simulate_fw_crash. And what
should happen is that the firmware crashes, ath10k prints a big pile of
warnings, restarts it and in few seconds everything resumes to normal
without user space even noticing it.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: [PATCH net-next 5/5] sctp: add spt_pathcpthld in struct sctp_paddrthlds
From: Xin Long @ 2019-09-11 17:47 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: David Laight, network dev, linux-sctp@vger.kernel.org,
	Neil Horman, davem@davemloft.net
In-Reply-To: <20190911125609.GC3499@localhost.localdomain>

On Wed, Sep 11, 2019 at 8:56 PM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Wed, Sep 11, 2019 at 05:38:33PM +0800, Xin Long wrote:
> > On Wed, Sep 11, 2019 at 5:21 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Wed, Sep 11, 2019 at 5:03 PM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Xin Long [mailto:lucien.xin@gmail.com]
> > > > > Sent: 11 September 2019 09:52
> > > > > On Tue, Sep 10, 2019 at 9:19 PM David Laight <David.Laight@aculab.com> wrote:
> > > > > >
> > > > > > From: Xin Long
> > > > > > > Sent: 09 September 2019 08:57
> > > > > > > Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
> > > > > > > Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
> > > > > > > added to allow a user to change ps_retrans per sock/asoc/transport, as
> > > > > > > other 2 paddrthlds: pf_retrans, pathmaxrxt.
> > > > > > >
> > > > > > > Note that ps_retrans is not allowed to be greater than pf_retrans.
> > > > > > >
> > > > > > > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > > > > > > ---
> > > > > > >  include/uapi/linux/sctp.h |  1 +
> > > > > > >  net/sctp/socket.c         | 10 ++++++++++
> > > > > > >  2 files changed, 11 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> > > > > > > index a15cc28..dfd81e1 100644
> > > > > > > --- a/include/uapi/linux/sctp.h
> > > > > > > +++ b/include/uapi/linux/sctp.h
> > > > > > > @@ -1069,6 +1069,7 @@ struct sctp_paddrthlds {
> > > > > > >       struct sockaddr_storage spt_address;
> > > > > > >       __u16 spt_pathmaxrxt;
> > > > > > >       __u16 spt_pathpfthld;
> > > > > > > +     __u16 spt_pathcpthld;
> > > > > > >  };
> > > > > > >
> > > > > > >  /*
> > > > > > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > > > > > index 5e2098b..5b9774d 100644
> > > > > > > --- a/net/sctp/socket.c
> > > > > > > +++ b/net/sctp/socket.c
> > > > > > > @@ -3954,6 +3954,9 @@ static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
> > > > > >
> > > > > > This code does:
> > > > > >         if (optlen < sizeof(struct sctp_paddrthlds))
> > > > > >                 return -EINVAL;
> > > > > here will become:
> > > > >
> > > > >         if (optlen >= sizeof(struct sctp_paddrthlds)) {
> > > > >                 optlen = sizeof(struct sctp_paddrthlds);
> > > > >         } else if (optlen >= ALIGN(offsetof(struct sctp_paddrthlds,
> > > > >                                             spt_pathcpthld), 4))
> > > > >                 optlen = ALIGN(offsetof(struct sctp_paddrthlds,
> > > > >                                         spt_pathcpthld), 4);
> > > > >                 val.spt_pathcpthld = 0xffff;
> > > > >         else {
> > > > >                 return -EINVAL;
> > > > >         }
> > > >
> > > > Hmmm...
> > > > If the kernel has to default 'val.spt_pathcpthld = 0xffff'
> > > > then recompiling an existing application with the new uapi
> > > > header is going to lead to very unexpected behaviour.
> > > >
> > > > The best you can hope for is that the application memset the
> > > > structure to zero.
> > > > But more likely it is 'random' on-stack data.
> > > 0xffff is a value to disable the feature 'Primary Path Switchover'.
> > > you're right that user might set it to zero unexpectly with their
> > > old application rebuilt.
> > >
> > > A safer way is to introduce "sysctl net.sctp.ps_retrans", it won't
> > > matter if users set spt_pathcpthld properly when they're not aware
> > > of this feature, like "sysctl net.sctp.pF_retrans". Looks better?
> > Sorry for confusing,  "sysctl net.sctp.ps_retrans" is already there
> > (its value is 0xffff by default),
> > we just need to do this in sctp_setsockopt_paddr_thresholds():
> >
> >         if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
> >                            optlen))
> >                 return -EFAULT;
> >
> >         if (sock_net(sk)->sctp.ps_retrans == 0xffff)
> >                 val.spt_pathcpthld = 0xffff;
>
> I'm confused with the snippets, but if I got them right, this is after
> dealing with proper len and could leave val.spt_pathcpthld
> uninitialized if the application used the old format and sysctl is !=
> 0xffff.
right, how about this in sctp_setsockopt_paddr_thresholds():

        offset = ALIGN(offsetof(struct sctp_paddrthlds, spt_pathcpthld), 4);
        if (optlen < offset)
                return -EINVAL;
        if (optlen < sizeof(val) || sock_net(sk)->sctp.ps_retrans == 0xffff) {
                optlen = offset;
                val.spt_pathcpthld = 0xffff;
        } else {
                optlen = sizeof(val);
        }

        if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
                           optlen))
                return -EFAULT;

        if (val.spt_pathpfthld > val.spt_pathcpthld)
                return -EINVAL;

Which means we will 'skip' spt_pathcpthld if (it's using old format) or
(ps_retrans is disabled and it's using new format).
Note that  ps_retrans < pf_retrans is not allowed in rfc7829.

and in sctp_getsockopt_paddr_thresholds():

        offset = ALIGN(offsetof(struct sctp_paddrthlds, spt_pathcpthld), 4);
        if (len < offset)
                return -EINVAL;
        if (len < sizeof(val) || sock_net(sk)->sctp.ps_retrans == 0xffff)
                len = offset;
        else
                len = sizeof(val);

        if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, len))
                return -EFAULT;


>
> >
> >         if (val.spt_pathpfthld > val.spt_pathcpthld)
> >                 return -EINVAL;
> >
> > >
> > > >
> > > >         David
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)
> >

^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Vijay Khemka @ 2019-09-11 17:44 UTC (permalink / raw)
  To: Joel Stanley, Florian Fainelli, Benjamin Herrenschmidt
  Cc: David S. Miller, YueHaibing, Andrew Lunn, Kate Stewart,
	Mauro Carvalho Chehab, Luis Chamberlain, Thomas Gleixner,
	netdev@vger.kernel.org, Linux Kernel Mailing List,
	openbmc @ lists . ozlabs . org, linux-aspeed, Sai Dasari
In-Reply-To: <CACPK8XcS4iKfKigPbPg0BFbmjbT-kdyjiPDXjk1k5XaS5bCdAA@mail.gmail.com>



On 9/11/19, 7:49 AM, "Joel Stanley" <joel@jms.id.au> wrote:

    Hi Ben,
    
    On Tue, 10 Sep 2019 at 22:05, Florian Fainelli <f.fainelli@gmail.com> wrote:
    >
    > On 9/10/19 2:37 PM, Vijay Khemka wrote:
    > > HW checksum generation is not working for AST2500, specially with IPV6
    > > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
    > > it works perfectly fine with IPV6.
    > >
    > > Verified with IPV6 enabled and can do ssh.
    >
    > How about IPv4, do these packets have problem? If not, can you continue
    > advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
    >
    > >
    > > Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
    > > ---
    > >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
    > >  1 file changed, 3 insertions(+), 2 deletions(-)
    > >
    > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
    > > index 030fed65393e..591c9725002b 100644
    > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
    > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
    > > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
    > >       if (priv->use_ncsi)
    > >               netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
    > >
    > > -     /* AST2400  doesn't have working HW checksum generation */
    > > -     if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
    > > +     /* AST2400  and AST2500 doesn't have working HW checksum generation */
    > > +     if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
    > > +                of_device_is_compatible(np, "aspeed,ast2500-mac")))
    
    Do you recall under what circumstances we need to disable hardware checksumming?
Mainly, TCP packets over IPV6 getting dropped. After disabling it was working.
    
    Cheers,
    
    Joel
    
    > >               netdev->hw_features &= ~NETIF_F_HW_CSUM;
    > >       if (np && of_get_property(np, "no-hw-checksum", NULL))
    > >               netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
    > >
    >
    >
    > --
    > Florian
    


^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Christoph Paasch @ 2019-09-11 17:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Neal Cardwell,
	Eric Dumazet, Jason Baron, Vladimir Rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>

Hello,

On Mon, Aug 26, 2019 at 11:04 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
>
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
>
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)

I got syzkaller complaining now on 4.14.143 with the following reproducer:

# {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1
Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false
UseTmpDir:false EnableCgroups:false EnableNetdev:false ResetNet:false
HandleSegv:false Repro:false Trace:false}
r0 = socket$inet_tcp(0x2, 0x1, 0x0)
setsockopt$inet_tcp_TCP_REPAIR(r0, 0x6, 0x13, &(0x7f0000000040)=0x1, 0x4)
setsockopt$inet_tcp_TCP_REPAIR_QUEUE(r0, 0x6, 0x14, &(0x7f00000012c0)=0x2, 0x4)
setsockopt$inet_tcp_int(r0, 0x6, 0x19, &(0x7f0000000000)=0x9, 0x4)
setsockopt$inet_tcp_TCP_MD5SIG(r0, 0x6, 0xe,
&(0x7f00000001c0)={@in={{0x2, 0x0, @empty}}, 0x0, 0x2, 0x0,
"c157cf4809151e5e89cfd6d934fbe981ec8ff6afc252ccf486c325c7ff3d35f3a89412a5cb6430e169092617df2ba65bf0ab844572e4e7dd4ece8ec1de5ac1ccd870067b018cb3b1f05f2391d872b67d"},
0xd8)
connect$inet(r0, &(0x7f0000000080)={0x2, 0x0, @dev={0xac, 0x14, 0x14,
0x1d}}, 0x10)
sendto(r0, 0x0, 0x87, 0x0, 0x0, 0x391)

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN PTI
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 2529 Comm: syz-executor709 Not tainted 4.14.143 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
task: ffff8880677fdc00 task.stack: ffff8880642b0000
RIP: 0010:tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350
RSP: 0018:ffff8880642bf718 EFLAGS: 00010206
RAX: 0000000000000014 RBX: 0000000000000087 RCX: ffff88806a794f50
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000a0
RBP: ffff8880642bfaa8 R08: 0000000000000006 R09: ffff8880677fe3a0
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff88806a794f50 R14: ffff88806a794d00 R15: 0000000000000087
FS:  00007f644b697700(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffcd37370b0 CR3: 00000000679f2006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcp_sendmsg+0x2a/0x40 net/ipv4/tcp.c:1533
 inet_sendmsg+0x173/0x4e0 net/ipv4/af_inet.c:784
 sock_sendmsg_nosec net/socket.c:646 [inline]
 sock_sendmsg+0xc3/0x100 net/socket.c:656
 SYSC_sendto+0x35d/0x5e0 net/socket.c:1766
 do_syscall_64+0x241/0x680 arch/x86/entry/common.c:292
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f644afc6469
RSP: 002b:00007f644b696f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000602130 RCX: 00007f644afc6469
RDX: 0000000000000087 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000602138 R08: 0000000000000000 R09: 0000000000000391
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000060213c
R13: 00007ffcd373700f R14: 00007f644b677000 R15: 0000000000000003
Code: 74 08 3c 03 0f 8e f1 32 00 00 8b 85 98 fd ff ff 89 85 60 fd ff
ff 48 8b 85 70 fd ff ff 48 8d b8 a0 00 00 00 48 89 f8 48 c1 e8 03 <42>
0f b6 04 20 84 c0 74 06 0f 8e d2 32 00 00 4c 8b bd 70 fd ff
RIP: tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350 RSP: ffff8880642bf718
---[ end trace 70f07f242cd3b9d8 ]---


It's because skb is NULL in tcp_sendmsg_locked at:
                  skb = tcp_write_queue_tail(sk);
                  if (tcp_send_head(sk)) {
                          if (skb->ip_summed == CHECKSUM_NONE)


I think we need this here on pre-rb-tree kernels :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5ce069ce2a97..efe767e20d01 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -924,8 +924,7 @@ static void tcp_remove_empty_skb(struct sock *sk,
struct sk_buff *skb)
 {
  if (skb && !skb->len) {
  tcp_unlink_write_queue(skb, sk);
- if (tcp_write_queue_empty(sk))
- tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+ tcp_check_send_head(sk, skb);
  sk_wmem_free_skb(sk, skb);
  }
 }

Does that look good?


Thanks,
Christoph

^ permalink raw reply related

* [PATCH bpf-next 3/3] samples/bpf: fix xdpsock l2fwd tx for unaligned mode
From: Ciara Loftus @ 2019-09-11 17:24 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jonathan.lemon
  Cc: bruce.richardson, bpf, intel-wired-lan, kevin.laatz, Ciara Loftus
In-Reply-To: <20190911172435.21042-1-ciara.loftus@intel.com>

Preserve the offset of the address of the received descriptor, and include
it in the address set for the tx descriptor, so the kernel can correctly
locate the start of the packet data.

Fixes: 03895e63ff97 ("samples/bpf: add buffer recycling for unaligned chunks to xdpsock")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 samples/bpf/xdpsock_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 102eace22956..df011ac33402 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -685,7 +685,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 	for (i = 0; i < rcvd; i++) {
 		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
 		u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
-		u64 orig = xsk_umem__extract_addr(addr);
+		u64 orig = addr;
 
 		addr = xsk_umem__add_offset_to_addr(addr);
 		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 2/3] ixgbe: fix xdp handle calculations
From: Ciara Loftus @ 2019-09-11 17:24 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jonathan.lemon
  Cc: bruce.richardson, bpf, intel-wired-lan, kevin.laatz, Ciara Loftus
In-Reply-To: <20190911172435.21042-1-ciara.loftus@intel.com>

Commit 7cbbf9f1fa23 ("ixgbe: fix xdp handle calculations") reintroduced
the addition of the umem headroom to the xdp handle in the ixgbe_zca_free,
ixgbe_alloc_buffer_slow_zc and ixgbe_alloc_buffer_zc functions. However,
the headroom is already added to the handle in the function
ixgbe_run_xdp_zc. This commit removes the latter addition and fixes the
case where the headroom is non-zero.

Fixes: 7cbbf9f1fa23 ("ixgbe: fix xdp handle calculations")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index ad802a8909e0..5ed8b5a257cf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -145,7 +145,7 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 {
 	struct xdp_umem *umem = rx_ring->xsk_umem;
 	int err, result = IXGBE_XDP_PASS;
-	u64 offset = umem->headroom;
+	u64 offset;
 	struct bpf_prog *xdp_prog;
 	struct xdp_frame *xdpf;
 	u32 act;
@@ -153,7 +153,7 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-	offset += xdp->data - xdp->data_hard_start;
+	offset = xdp->data - xdp->data_hard_start;
 
 	xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
 
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox