Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf] bpf: fix wrong helper enablement in cgroup local storage
From: Daniel Borkmann @ 2018-10-26 22:49 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Roman Gushchin

Commit cd3394317653 ("bpf: introduce the bpf_get_local_storage()
helper function") enabled the bpf_get_local_storage() helper also
for BPF program types where it does not make sense to use them.

They have been added both in sk_skb_func_proto() and sk_msg_func_proto()
even though both program types are not invoked in combination with
cgroups, and neither through BPF_PROG_RUN_ARRAY(). In the latter the
bpf_cgroup_storage_set() is set shortly before BPF program invocation.

Later, the helper bpf_get_local_storage() retrieves this prior set
up per-cpu pointer and hands the buffer to the BPF program. The map
argument in there solely retrieves the enum bpf_cgroup_storage_type
from a local storage map associated with the program and based on the
type returns either the global or per-cpu storage. However, there
is no specific association between the program's map and the actual
content in bpf_cgroup_storage[].

Meaning, any BPF program that would have been properly run from the
cgroup side through BPF_PROG_RUN_ARRAY() where bpf_cgroup_storage_set()
was performed, and that is later unloaded such that prog / maps are
teared down will cause a use after free if that pointer is retrieved
from programs that are not run through BPF_PROG_RUN_ARRAY() but have
the cgroup local storage helper enabled in their func proto.

Lets just remove it from the two sock_map program types to fix it.
Auditing through the types where this helper is enabled, it appears
that these are the only ones where it was mistakenly allowed.

Fixes: cd3394317653 ("bpf: introduce the bpf_get_local_storage() helper function")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roman Gushchin <guro@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 35c6933..4d7c511 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5264,8 +5264,6 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_msg_pull_data_proto;
 	case BPF_FUNC_msg_push_data:
 		return &bpf_msg_push_data_proto;
-	case BPF_FUNC_get_local_storage:
-		return &bpf_get_local_storage_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -5296,8 +5294,6 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_redirect_map_proto;
 	case BPF_FUNC_sk_redirect_hash:
 		return &bpf_sk_redirect_hash_proto;
-	case BPF_FUNC_get_local_storage:
-		return &bpf_get_local_storage_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_tcp:
 		return &bpf_sk_lookup_tcp_proto;
-- 
2.9.5

^ permalink raw reply related

* Re: [Intel-wired-lan] [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
From: Vinicius Costa Gomes @ 2018-10-26 22:16 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev; +Cc: Richard Cochran, intel-wired-lan
In-Reply-To: <20181026162742.631-2-mlichvar@redhat.com>

Hi Miroslav,

Miroslav Lichvar <mlichvar@redhat.com> writes:

> The PTP_SYS_OFFSET ioctl, which can be used to measure the offset
> between a PHC and the system clock, includes the total time that the
> gettime64 function of a driver needs to read the PHC timestamp.
>
> This typically involves reading of multiple PCI registers (sometimes in
> multiple iterations) and the register that contains the lowest bits of
> the timestamp is not read in the middle between the two readings of the
> system clock. This asymmetry causes the measured offset to have a
> significant error.
>
> Introduce a new ioctl, driver function, and helper functions, which
> allow the reading of the lowest register to be isolated from the other
> readings in order to reduce the asymmetry. The ioctl and driver function
> return three timestamps for each measurement:
> - system time right before reading the lowest bits of the PHC timestamp
> - PHC time
> - system time immediately after reading the lowest bits of the PHC
>   timestamp

Cool stuff!

Just one little thing below. Feel free to add my ack to the series.

Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/ptp/ptp_chardev.c        | 39 ++++++++++++++++++++++++++++++++
>  include/linux/ptp_clock_kernel.h | 26 +++++++++++++++++++++
>  include/uapi/linux/ptp_clock.h   | 12 ++++++++++
>  3 files changed, 77 insertions(+)
>
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 2012551d93e0..1a04c437fd4f 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -124,11 +124,13 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  	struct ptp_clock_caps caps;
>  	struct ptp_clock_request req;
>  	struct ptp_sys_offset *sysoff = NULL;
> +	struct ptp_sys_offset_extended *sysoff_extended = NULL;
>  	struct ptp_sys_offset_precise precise_offset;
>  	struct ptp_pin_desc pd;
>  	struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
>  	struct ptp_clock_info *ops = ptp->info;
>  	struct ptp_clock_time *pct;
> +	struct ptp_system_timestamp sts;
>  	struct timespec64 ts;
>  	struct system_device_crosststamp xtstamp;
>  	int enable, err = 0;
> @@ -211,6 +213,43 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  			err = -EFAULT;
>  		break;
>  
> +	case PTP_SYS_OFFSET_EXTENDED:
> +		if (!ptp->info->gettimex64) {
> +			err = -EOPNOTSUPP;
> +			break;
> +		}
> +		sysoff_extended = memdup_user((void __user *)arg,
> +					      sizeof(*sysoff_extended));

Looks like you forgot to free 'sysoff_extended', no? 

^ permalink raw reply

* [iproute2 PATCH v2] bridge: fix vlan show stats formatting
From: Tobias Jungel @ 2018-10-26 21:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20181022100001.3413fe1c@xeon-e3>

The output of -statistics vlan show was broken previous change for json
output. This aligns the format to vlan show.

v2: fixed too greedy deletion that caused a -Wmaybe-uninitialized

Signed-off-by: Tobias Jungel <tobias.jungel@gmail.com>
---
 bridge/vlan.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/bridge/vlan.c b/bridge/vlan.c
index a111d5e6..d075a42d 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -484,7 +484,7 @@ static void print_vlan_stats_attr(struct rtattr *attr, int ifindex)
 	rem = RTA_PAYLOAD(list);

 	ifname = ll_index_to_name(ifindex);
-	open_json_object(ifname);
+	open_vlan_port(ifindex);

 	print_color_string(PRINT_FP, COLOR_IFNAME,
 			   NULL, "%-16s", ifname);
@@ -505,8 +505,7 @@ static void print_vlan_stats_attr(struct rtattr *attr, int ifindex)

 		print_one_vlan_stats(vstats);
 	}
-	close_json_object();
-
+	close_vlan_port();
 }

 static int print_vlan_stats(struct nlmsghdr *n, void *arg)

^ permalink raw reply related

* Re: [PATCH net v3] net/ipv6: Add anycast addresses to a global hashtable
From: David Ahern @ 2018-10-26 21:44 UTC (permalink / raw)
  To: Jeff Barnhill, netdev; +Cc: davem, kuznet, yoshfuji
In-Reply-To: <20181026212242.9661-1-0xeffeff@gmail.com>

On 10/26/18 3:22 PM, Jeff Barnhill wrote:
> @@ -275,6 +356,13 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
>  		err = -ENOMEM;
>  		goto out;
>  	}
> +	err = ipv6_add_acaddr_hash(dev_net(idev->dev), addr);
> +	if (err) {
> +		fib6_info_release(f6i);
> +		fib6_info_release(f6i);
> +		kfree(aca);
> +		goto out;
> +	}

I think aca_put() makes this less confusing as it will do the
fib6_info_release(f6i) and kfree(aca);

^ permalink raw reply

* Re: [PATCH] net: allow traceroute with a specified interface in a vrf
From: David Ahern @ 2018-10-26 21:23 UTC (permalink / raw)
  To: Mike Manning, netdev
In-Reply-To: <20181026112435.12159-1-mmanning@vyatta.att-mail.com>

On 10/26/18 5:24 AM, Mike Manning wrote:
> Traceroute executed in a vrf succeeds if no device is given or if the
> vrf is given as the device, but fails if the interface is given as the
> device. This is for default UDP probes, it succeeds for TCP SYN or ICMP
> ECHO probes. As the skb bound dev is the interface and the sk dev is
> the vrf, sk lookup fails for ICMP_DEST_UNREACH and ICMP_TIME_EXCEEDED
> messages. The solution is for the secondary dev to be passed so that
> the interface is available for the device match to succeed, in the same
> way as is already done for non-error cases.
> 
> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
> ---
>  net/ipv4/udp.c | 4 ++--
>  net/ipv6/udp.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 


Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH] ptp: drop redundant kasprintf() to create worker name
From: Rasmus Villemoes @ 2018-10-26 21:22 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Kees Cook, Rasmus Villemoes, netdev, linux-kernel

Building with -Wformat-nonliteral, gcc complains

drivers/ptp/ptp_clock.c: In function ‘ptp_clock_register’:
drivers/ptp/ptp_clock.c:239:26: warning: format not a string literal and no format arguments [-Wformat-nonliteral]
            worker_name : info->name);

kthread_create_worker takes fmt+varargs to set the name of the
worker, and that happens with a vsnprintf() to a stack buffer (that is
then copied into task_comm). So there's no reason not to just pass
"ptp%d", ptp->index to kthread_create_worker() and avoid the
intermediate worker_name variable.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 drivers/ptp/ptp_clock.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
index 7eacc1c4b3b1..5419a89d300e 100644
--- a/drivers/ptp/ptp_clock.c
+++ b/drivers/ptp/ptp_clock.c
@@ -232,12 +232,8 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
 	init_waitqueue_head(&ptp->tsev_wq);
 
 	if (ptp->info->do_aux_work) {
-		char *worker_name = kasprintf(GFP_KERNEL, "ptp%d", ptp->index);
-
 		kthread_init_delayed_work(&ptp->aux_work, ptp_aux_kworker);
-		ptp->kworker = kthread_create_worker(0, worker_name ?
-						     worker_name : info->name);
-		kfree(worker_name);
+		ptp->kworker = kthread_create_worker(0, "ptp%d", ptp->index);
 		if (IS_ERR(ptp->kworker)) {
 			err = PTR_ERR(ptp->kworker);
 			pr_err("failed to create ptp aux_worker %d\n", err);
-- 
2.19.1.6.gbde171bbf5

^ permalink raw reply related

* [PATCH net v3] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-10-26 21:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
In-Reply-To: <CAL6e_pdbk9+5g-6dr54f9aT4v1RHV_thxM+zb4w41J0n_ydNcA@mail.gmail.com>

icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
 include/net/addrconf.h |   2 +
 include/net/if_inet6.h |   8 ++++
 net/ipv6/af_inet6.c    |   5 ++
 net/ipv6/anycast.c     | 122 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..799af1a037d1 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 			     const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..a445014b981d 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -142,6 +142,14 @@ struct ipv6_ac_socklist {
 	struct ipv6_ac_socklist *acl_next;
 };
 
+struct ipv6_ac_addrlist {
+	struct in6_addr		acal_addr;
+	possible_net_t		acal_pnet;
+	refcount_t		acal_users;
+	struct hlist_node	acal_lst; /* inet6_acaddr_lst */
+	struct rcu_head		rcu;
+};
+
 struct ifacaddr6 {
 	struct in6_addr		aca_addr;
 	struct fib6_info	*aca_rt;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..ddc8a6dbfba2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
 	err = ip6_flowlabel_init();
 	if (err)
 		goto ip6_flowlabel_fail;
+	err = anycast_init();
+	if (err)
+		goto anycast_fail;
 	err = addrconf_init();
 	if (err)
 		goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
 	addrconf_cleanup();
 addrconf_fail:
+	anycast_cleanup();
+anycast_fail:
 	ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
 	ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..1040d08867ab 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include <net/checksum.h>
 
+#define IN6_ADDR_HSIZE_SHIFT	8
+#define IN6_ADDR_HSIZE		BIT(IN6_ADDR_HSIZE_SHIFT)
+/*	anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+	return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  *	socket join an anycast group
  */
@@ -204,6 +218,73 @@ void ipv6_sock_ac_close(struct sock *sk)
 	rtnl_unlock();
 }
 
+static struct ipv6_ac_addrlist *acal_alloc(struct net *net,
+					   const struct in6_addr *addr)
+{
+	struct ipv6_ac_addrlist *acal;
+
+	acal = kzalloc(sizeof(*acal), GFP_ATOMIC);
+	if (!acal)
+		return NULL;
+
+	acal->acal_addr = *addr;
+	write_pnet(&acal->acal_pnet, net);
+	refcount_set(&acal->acal_users, 1);
+	INIT_HLIST_NODE(&acal->acal_lst);
+
+	return acal;
+}
+
+static int ipv6_add_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
+	int err = 0;
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_for_each_entry(acal, &inet6_acaddr_lst[hash], acal_lst) {
+		if (!net_eq(read_pnet(&acal->acal_pnet), net))
+			continue;
+		if (ipv6_addr_equal(&acal->acal_addr, addr)) {
+			refcount_inc(&acal->acal_users);
+			goto out;
+		}
+	}
+
+	acal = acal_alloc(net, addr);
+	if (!acal) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	hlist_add_head_rcu(&acal->acal_lst, &inet6_acaddr_lst[hash]);
+
+out:
+	spin_unlock(&acaddr_hash_lock);
+	return err;
+}
+
+static void ipv6_del_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_for_each_entry(acal, &inet6_acaddr_lst[hash], acal_lst) {
+		if (!net_eq(read_pnet(&acal->acal_pnet), net))
+			continue;
+		if (ipv6_addr_equal(&acal->acal_addr, addr)) {
+			if (refcount_dec_and_test(&acal->acal_users)) {
+				hlist_del_init_rcu(&acal->acal_lst);
+				kfree_rcu(acal, rcu);
+			}
+			spin_unlock(&acaddr_hash_lock);
+			return;
+		}
+	}
+	spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
 	refcount_inc(&aca->aca_refcnt);
@@ -275,6 +356,13 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 		err = -ENOMEM;
 		goto out;
 	}
+	err = ipv6_add_acaddr_hash(dev_net(idev->dev), addr);
+	if (err) {
+		fib6_info_release(f6i);
+		fib6_info_release(f6i);
+		kfree(aca);
+		goto out;
+	}
 
 	aca->aca_next = idev->ac_list;
 	idev->ac_list = aca;
@@ -324,6 +412,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 		prev_aca->aca_next = aca->aca_next;
 	else
 		idev->ac_list = aca->aca_next;
+	ipv6_del_acaddr_hash(dev_net(idev->dev), &aca->aca_addr);
 	write_unlock_bh(&idev->lock);
 	addrconf_leave_solict(idev, &aca->aca_addr);
 
@@ -350,6 +439,8 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
 	write_lock_bh(&idev->lock);
 	while ((aca = idev->ac_list) != NULL) {
 		idev->ac_list = aca->aca_next;
+		ipv6_del_acaddr_hash(dev_net(idev->dev), &aca->aca_addr);
+
 		write_unlock_bh(&idev->lock);
 
 		addrconf_leave_solict(idev, &aca->aca_addr);
@@ -390,17 +481,23 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr)
 {
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
 	bool found = false;
 
 	rcu_read_lock();
 	if (dev)
 		found = ipv6_chk_acast_dev(dev, addr);
 	else
-		for_each_netdev_rcu(net, dev)
-			if (ipv6_chk_acast_dev(dev, addr)) {
+		hlist_for_each_entry_rcu(acal, &inet6_acaddr_lst[hash],
+					 acal_lst) {
+			if (!net_eq(read_pnet(&acal->acal_pnet), net))
+				continue;
+			if (ipv6_addr_equal(&acal->acal_addr, addr)) {
 				found = true;
 				break;
 			}
+		}
 	rcu_read_unlock();
 	return found;
 }
@@ -539,4 +636,25 @@ void ac6_proc_exit(struct net *net)
 {
 	remove_proc_entry("anycast6", net->proc_net);
 }
+
+/*	Init / cleanup code
+ */
+int __init anycast_init(void)
+{
+	int i;
+
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		INIT_HLIST_HEAD(&inet6_acaddr_lst[i]);
+	return 0;
+}
+
+void anycast_cleanup(void)
+{
+	int i;
+
+	spin_lock(&acaddr_hash_lock);
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
+	spin_unlock(&acaddr_hash_lock);
+}
 #endif
-- 
2.14.1

^ permalink raw reply related

* Re: KASAN: slab-out-of-bounds Read in sctp_getsockopt
From: Xin Long @ 2018-10-27  5:58 UTC (permalink / raw)
  To: syzbot+5da0d0a72a9e7d791748
  Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
	Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <000000000000e80c930579245751@google.com>

On Sat, Oct 27, 2018 at 1:38 AM syzbot
<syzbot+5da0d0a72a9e7d791748@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bd6bf7c10484 Merge tag 'pci-v4.20-changes' of git://git.ke..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16fd6bcb400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2dd8629d56664133
> dashboard link: https://syzkaller.appspot.com/bug?extid=5da0d0a72a9e7d791748
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16b3ea33400000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17f9f1bd400000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5da0d0a72a9e7d791748@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in sctp_getsockopt_pr_streamstatus
> net/sctp/socket.c:7174 [inline]
Forgot to change to use "&" in sctp_getsockopt_pr_streamstatus() in
"sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead"

@@ -7158,7 +7158,7 @@ static int
sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
                goto out;
        }

-       if (policy == SCTP_PR_SCTP_ALL) {
+       if (policy & SCTP_PR_SCTP_ALL) {

> BUG: KASAN: slab-out-of-bounds in sctp_getsockopt+0x7516/0x7cc2
> net/sctp/socket.c:7582
> Read of size 8 at addr ffff8801d89f0968 by task syz-executor278/5330
>
> CPU: 1 PID: 5330 Comm: syz-executor278 Not tainted 4.19.0+ #303
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x244/0x39d lib/dump_stack.c:113
>   print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
>   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>   sctp_getsockopt_pr_streamstatus net/sctp/socket.c:7174 [inline]
>   sctp_getsockopt+0x7516/0x7cc2 net/sctp/socket.c:7582
>   sock_common_getsockopt+0x9a/0xe0 net/core/sock.c:2937
>   __sys_getsockopt+0x1ad/0x390 net/socket.c:1939
>   __do_sys_getsockopt net/socket.c:1950 [inline]
>   __se_sys_getsockopt net/socket.c:1947 [inline]
>   __x64_sys_getsockopt+0xbe/0x150 net/socket.c:1947
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x445789
> Code: e8 6c b6 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> ff 0f 83 2b 12 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007effdb293db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
> RAX: ffffffffffffffda RBX: 00000000006dac48 RCX: 0000000000445789
> RDX: 0000000000000074 RSI: 0000000000000084 RDI: 0000000000000003
> RBP: 00000000006dac40 R08: 0000000020000040 R09: 0000000000000000
> R10: 0000000020000080 R11: 0000000000000246 R12: 00000000006dac4c
> R13: 00007ffcfc408c6f R14: 00007effdb2949c0 R15: 00000000006dad2c
>
> Allocated by task 5329:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
>   kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620
>   kmalloc include/linux/slab.h:513 [inline]
>   kzalloc include/linux/slab.h:707 [inline]
>   sctp_stream_init_ext+0x4f/0xf0 net/sctp/stream.c:237
>   sctp_sendmsg_to_asoc+0x1308/0x1a20 net/sctp/socket.c:1896
>   sctp_sendmsg+0x13c2/0x1da0 net/sctp/socket.c:2113
>   inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xd5/0x120 net/socket.c:631
>   __sys_sendto+0x3d7/0x670 net/socket.c:1788
>   __do_sys_sendto net/socket.c:1800 [inline]
>   __se_sys_sendto net/socket.c:1796 [inline]
>   __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 3223:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
>   kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>   __cache_free mm/slab.c:3498 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3813
>   kzfree+0x28/0x30 mm/slab_common.c:1543
>   aa_free_file_ctx security/apparmor/include/file.h:76 [inline]
>   apparmor_file_free_security+0x133/0x1a0 security/apparmor/lsm.c:448
>   security_file_free+0x4a/0x80 security/security.c:900
>   file_free fs/file_table.c:54 [inline]
>   __fput+0x4e8/0xa30 fs/file_table.c:294
>   ____fput+0x15/0x20 fs/file_table.c:309
>   task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
>   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>   exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>   do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff8801d89f0900
>   which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 8 bytes to the right of
>   96-byte region [ffff8801d89f0900, ffff8801d89f0960)
> The buggy address belongs to the page:
> page:ffffea0007627c00 count:1 mapcount:0 mapping:ffff8801da8004c0 index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffffea0007646748 ffffea0007613488 ffff8801da8004c0
> raw: 0000000000000000 ffff8801d89f0000 0000000100000020 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8801d89f0800: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>   ffff8801d89f0880: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> > ffff8801d89f0900: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>                                                            ^
>   ffff8801d89f0980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>   ffff8801d89f0a00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* [PATCH] Fix ss Netid column and Local/Peer_Address
From: Yoann P. @ 2018-10-26 20:53 UTC (permalink / raw)
  To: netdev; +Cc: yoann.p.public

When using ss -Hutn4 or -utn3, Netid and State columns are sometime merged, it 
can be confusing when trying to pipe into awk or column.
Details (before and after output) are available on this github issue: https://
github.com/shemminger/iproute2/issues/20

Signed-off-by: YoyPa <yoann.p.public@gmail.com>
---
 misc/ss.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c8970438..5e46cc0e 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -144,9 +144,9 @@ static struct column columns[] = {
        { ALIGN_LEFT,   "State",                " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Recv-Q",               " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Send-Q",               " ",    0, 0, 0 },
-       { ALIGN_RIGHT,  "Local Address:",       " ",    0, 0, 0 },
+       { ALIGN_RIGHT,  "Local_Address:",       " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Port",                 "",     0, 0, 0 },
-       { ALIGN_RIGHT,  "Peer Address:",        " ",    0, 0, 0 },
+       { ALIGN_RIGHT,  "Peer_Address:",        " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Port",                 "",     0, 0, 0 },
        { ALIGN_LEFT,   "",                     "",     0, 0, 0 },
 };
@@ -1334,7 +1334,7 @@ static void sock_state_print(struct sockstat *s)
                out("`- %s", sctp_sstate_name[s->state]);
        } else {
                field_set(COL_NETID);
-               out("%s", sock_name);
+               out("%-6s", sock_name);
                field_set(COL_STATE);
                out("%s", sstate_name[s->state]);
        }
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: Arnd Bergmann @ 2018-10-26 19:28 UTC (permalink / raw)
  To: Sunil Kovvuri; +Cc: David Miller, Networking, linux-soc, Sunil Goutham
In-Reply-To: <CA+sq2Cdia2ETBOEHo5mezWe6urrARNa2azbZnWfxUjmEbgyJgw@mail.gmail.com>

On Fri, Oct 26, 2018 at 6:33 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> On Fri, Oct 26, 2018 at 9:56 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > On Fri, Oct 26, 2018 at 7:34 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On 10/26/18, Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > > > On Fri, Oct 26, 2018 at 6:24 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > As of now it's only mbox based configuration that is supported.
> > >
> > > Ok, thanks for the clarification.
> > >
> > > Does this mean that you intend to have user space tools that use
> > > the mbox based interface on VFIO devices to perform configuration
> > > for virtual network devices, or just that the configuration interface
> > > is something that needs to be designed later?
> > >
> >
> > No there is no need for any userspace tools.
> > It's the virtual network device's driver which will send commands
> > like resource allocation, configuration, stats retrieval to this
> > AF device via mbox interface.
> >
> > eg: A user using ethtool changes RSS settings for the network device,
> > network device's driver receives the data, prepares a mailbox command
> > sends it to this driver for configuring the same in HW.

Ok, that part is mostly fine, as within a given host you can just have
multiple network interfaces that you can each configure independently,
and the mailbox interface for the most part is an implementation detail.

Doing the same in virtual machines means that the mailbox interface
becomes an ABI between the driver in the guest and the driver in the
host. This is still not too bad, in the worst case the guest might have
to detect the version of the host that's running and support both
an old and new version of the interface. There is reasonable hope
that you don't need to revise the interface here; it's not much different
from talking to firmware, and having both sides of the interface under
the control of Linux may in fact be better.

Once the interface gets exposed to stuff like ODP or DPDK, it effectively
becomes a user space interface, and that carries risk of being abused
for passing lots of other stuff, so this is the point where we have to be
very careful.

Aside from this, there is the stuff that Andrew mentioned, which is the
most important: For anything that should /not/ be controlled by a
network interface for itself, you still need an administrative interface.
An example of this would be creating additional virtual functions,
assigning bandwidth allocation between them, or limiting the
data that can be transferred to and from a virtual function.

Can you explain what your plan is to handle those?

> To be more clear there is no mbox 'interface' as such.
> Here PCI devices shares a memory region, one device prepares a command
> in this shared memory and writes into a doorbell kind of register which triggers
> an IRQ to other device. Which then takes the command processes it.

Yes, this part was already clear to me.

       Arnd

^ permalink raw reply

* Re: [PATCH 1/1] ipmr: Make cache queue length configurable
From: Stephen Hemminger @ 2018-10-26 18:24 UTC (permalink / raw)
  To: Brodie Greenfield
  Cc: davem, kuznet, yoshfuji, netdev, linux-kernel, chris.packham,
	luuk.paulussen
In-Reply-To: <20181026020219.22401-2-brodie.greenfield@alliedtelesis.co.nz>

On Fri, 26 Oct 2018 15:02:19 +1300
Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz> wrote:

> We want to be able to keep more spaces available in our queue for
> processing incoming multicast traffic (adding (S,G) entries) - this lets
> us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>
> ---
>  Documentation/networking/ip-sysctl.txt | 7 +++++++
>  include/net/netns/ipv4.h               | 1 +
>  include/uapi/linux/sysctl.h            | 1 +
>  kernel/sysctl_binary.c                 | 1 +
>  net/ipv4/af_inet.c                     | 2 ++
>  net/ipv4/ipmr.c                        | 4 +++-
>  net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
>  7 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 960de8fe3f40..dfc70ef6c42b 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -864,6 +864,13 @@ ip_local_reserved_ports - list of comma separated ranges
>  
>  	Default: Empty
>  
> +ip_mr_cache_queue_length - INTEGER
> +	Limit the number of multicast packets we can have in the queue to be
> +	resolved.
> +	Bear in mind that this causes an O(n) traversal of the same size when
> +	the queue is full. This should be considered if increasing.
> +	Default: 10
>

Thanks for updating documentation.  The second two sentences aren't clear.
Does it mean that setting queue length causes O(n) traversal or that each
multicast packet received causes O(n) traversal

>  ip_unprivileged_port_start - INTEGER
>  	This is a per-namespace sysctl.  It defines the first
>  	unprivileged port in the network namespace.  Privileged ports
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index e47503b4e4d1..1ca5cabe2d3b 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -184,6 +184,7 @@ struct netns_ipv4 {
>  	int sysctl_igmp_max_msf;
>  	int sysctl_igmp_llm_reports;
>  	int sysctl_igmp_qrv;
> +	int sysctl_ip_mr_cache_queue_length;

Maybe unsigned because negative value is not meaningful.

>  
>  	struct ping_group_range ping_group_range;
>  
> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index d71013fffaf6..32e32d4904cd 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -425,6 +425,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH=126,
>  };

The numeric sysctl enum is considered deprecated, new sysctl's
need not be added here.

>  enum {
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 07148b497451..8db94e8d97ed 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -367,6 +367,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
>  	{ CTL_INT,	NET_IPV4_LOCAL_PORT_RANGE,		"ip_local_port_range" },
>  	{ CTL_INT,	NET_IPV4_IGMP_MAX_MEMBERSHIPS,		"igmp_max_memberships" },
>  	{ CTL_INT,	NET_IPV4_IGMP_MAX_MSF,			"igmp_max_msf" },
> +	{ CTL_INT,	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH,	"ip_mr_cache_queue_length" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_THRESHOLD,		"inet_peer_threshold" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_MINTTL,		"inet_peer_minttl" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_MAXTTL,		"inet_peer_maxttl" },
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 1fbe2f815474..4b78d12aca36 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net)
>  	net->ipv4.sysctl_igmp_llm_reports = 1;
>  	net->ipv4.sysctl_igmp_qrv = 2;
>  
> +	/* ipmr unresolved queue length max */
> +	net->ipv4.sysctl_ip_mr_cache_queue_length = 10;

Comment here is not necessary, is obvious.

>  	return 0;
>  }
>  
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 5660adcf7a04..2864f80e2f2a 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1128,6 +1128,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  	struct mfc_cache *c;
>  	bool found = false;
>  	int err;
> +	struct net *net = dev_net(dev);

The network layer coding style is to use reverse christmas tree
style declarations, move this up.

>  
>  	spin_lock_bh(&mfc_unres_lock);
>  	list_for_each_entry(c, &mrt->mfc_unres_queue, _c.list) {
> @@ -1140,7 +1141,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  
>  	if (!found) {
>  		/* Create a new entry if allowable */
> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> +		if (atomic_read(&mrt->cache_resolve_queue_len) >=
> +		    net->ipv4.sysctl_ip_mr_cache_queue_length ||
>  		    (c = ipmr_cache_alloc_unres()) == NULL) {
>  			spin_unlock_bh(&mfc_unres_lock);
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 891ed2f91467..b249932ee24e 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -772,6 +772,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname	= "ip_mr_cache_queue_length",
> +		.data		= &init_net.ipv4.sysctl_ip_mr_cache_queue_length,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
>  #ifdef CONFIG_IP_MULTICAST
>  	{
>  		.procname	= "igmp_qrv",

This sysctl is not needed if CONFIG_IP_MULTICAST is not defined.

^ permalink raw reply

* Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy
From: Cong Wang @ 2018-10-26 18:19 UTC (permalink / raw)
  To: Marco Berizzi; +Cc: dsahern, David Miller, Linux Kernel Network Developers
In-Reply-To: <1305358874.1795395.1540553653206@mail.libero.it>

On Fri, Oct 26, 2018 at 4:35 AM Marco Berizzi <pupilla@libero.it> wrote:
> Apologies for bothering you again.
> I applied your patch to 4.19, but after issuing this
> command:
>
> root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> root@Calimero:~# ping 10.81.104.1
> PING 10.81.104.1 (10.81.104.1) 56(84) bytes of data.
> ^C
> --- 10.81.104.1 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1001ms
>
> I'm losing ipv4 connectivity.
> If I remove the qdisc everything is going to work again:

Did this really work before?

You specify a default class without adding it, so the packets are dropped.

How would you expect this to work? :)

^ permalink raw reply

* [RFT net-next] net: stmmac: Fix RX packet size > 8191
From: thor.thayer @ 2018-10-26 18:15 UTC (permalink / raw)
  To: peppe.cavallaro, alexandre.torgue, joabreu; +Cc: davem, netdev, Thor Thayer

From: Thor Thayer <thor.thayer@linux.intel.com>

Ping problems with packets > 8191 as shown:

PING 192.168.1.99 (192.168.1.99) 8150(8178) bytes of data.
8158 bytes from 192.168.1.99: icmp_seq=1 ttl=64 time=0.669 ms
wrong data byte 8144 should be 0xd0 but was 0x0
16    10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
      20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f
%< ---------------snip--------------------------------------
8112  b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
      c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
8144  0 0 0 0 d0 d1
      ^^^^^^^
Notice the 4 bytes of 0 before the expected byte of d0.

Databook notes that the RX buffer must be a multiple of 4/8/16
bytes [1].

Add a new define for RX DMA Buffer size since the TX descriptors
don't have this limitation. Use this new define in all the RX
buffer setup and refill functions.
Also fixup the normal descriptor RX buffer size since that has
the same limitation.

[1] Synopsys DesignWare Cores Ethernet MAC Universal v3.70a
    [section 8.4.2 - Table 8-24]

[RFT] Request testing on a platform that has normal descriptors.

Tested on SoCFPGA Stratix10 with ping sweep from 100 to 8300 byte packets.

Fixes: 286a83721720 ("stmmac: add CHAINED descriptor mode support (V4)")
Suggested-by: Jose Abreu <jose.abreu@synopsys.com>
Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/common.h      | 2 ++
 drivers/net/ethernet/stmicro/stmmac/descs_com.h   | 4 ++--
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c   | 2 +-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c   | 8 ++++----
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index b1b305f8f414..ffc6b344a81c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -366,6 +366,8 @@ struct dma_features {
 /* GMAC TX FIFO is 8K, Rx FIFO is 16K */
 #define BUF_SIZE_16KiB 16384
 #define BUF_SIZE_8KiB 8192
+/* RX Buffer size must be < 8191 and multiple of 4/8/16 bytes */
+#define RX_BUF_SIZE_8KiB 8188
 #define BUF_SIZE_4KiB 4096
 #define BUF_SIZE_2KiB 2048
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/descs_com.h b/drivers/net/ethernet/stmicro/stmmac/descs_com.h
index ca9d7e48034c..4043ef6e8698 100644
--- a/drivers/net/ethernet/stmicro/stmmac/descs_com.h
+++ b/drivers/net/ethernet/stmicro/stmmac/descs_com.h
@@ -31,7 +31,7 @@
 /* Enhanced descriptors */
 static inline void ehn_desc_rx_set_on_ring(struct dma_desc *p, int end)
 {
-	p->des1 |= cpu_to_le32(((BUF_SIZE_8KiB - 1)
+	p->des1 |= cpu_to_le32((RX_BUF_SIZE_8KiB
 			<< ERDES1_BUFFER2_SIZE_SHIFT)
 		   & ERDES1_BUFFER2_SIZE_MASK);
 
@@ -61,7 +61,7 @@ static inline void enh_set_tx_desc_len_on_ring(struct dma_desc *p, int len)
 /* Normal descriptors */
 static inline void ndesc_rx_set_on_ring(struct dma_desc *p, int end)
 {
-	p->des1 |= cpu_to_le32(((BUF_SIZE_2KiB - 1)
+	p->des1 |= cpu_to_le32((BUF_SIZE_2KiB
 				<< RDES1_BUFFER2_SIZE_SHIFT)
 		    & RDES1_BUFFER2_SIZE_MASK);
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
index de65bb29feba..74a563682945 100644
--- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
@@ -138,7 +138,7 @@ static void ndesc_init_rx_desc(struct dma_desc *p, int disable_rx_ic, int mode,
 			       int end)
 {
 	p->des0 |= cpu_to_le32(RDES0_OWN);
-	p->des1 |= cpu_to_le32((BUF_SIZE_2KiB - 1) & RDES1_BUFFER1_SIZE_MASK);
+	p->des1 |= cpu_to_le32(BUF_SIZE_2KiB & RDES1_BUFFER1_SIZE_MASK);
 
 	if (mode == STMMAC_CHAIN_MODE)
 		ndesc_rx_set_on_chain(p, end);
diff --git a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
index abc3f85270cd..09974a626b49 100644
--- a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
@@ -114,14 +114,14 @@ static void refill_desc3(void *priv_ptr, struct dma_desc *p)
 	struct stmmac_priv *priv = (struct stmmac_priv *)priv_ptr;
 
 	/* Fill DES3 in case of RING mode */
-	if (priv->dma_buf_sz >= BUF_SIZE_8KiB)
-		p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+	if (priv->dma_buf_sz >= RX_BUF_SIZE_8KiB)
+		p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + RX_BUF_SIZE_8KiB);
 }
 
 /* In ring mode we need to fill the desc3 because it is used as buffer */
 static void init_desc3(struct dma_desc *p)
 {
-	p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+	p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + RX_BUF_SIZE_8KiB);
 }
 
 static void clean_desc3(void *priv_ptr, struct dma_desc *p)
@@ -140,7 +140,7 @@ static void clean_desc3(void *priv_ptr, struct dma_desc *p)
 static int set_16kib_bfsize(int mtu)
 {
 	int ret = 0;
-	if (unlikely(mtu >= BUF_SIZE_8KiB))
+	if (unlikely(mtu > RX_BUF_SIZE_8KiB))
 		ret = BUF_SIZE_16KiB;
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 076a8be18d67..5e314050eb38 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1086,7 +1086,7 @@ static int stmmac_set_bfsize(int mtu, int bufsize)
 	int ret = bufsize;
 
 	if (mtu >= BUF_SIZE_4KiB)
-		ret = BUF_SIZE_8KiB;
+		ret = RX_BUF_SIZE_8KiB;
 	else if (mtu >= BUF_SIZE_2KiB)
 		ret = BUF_SIZE_4KiB;
 	else if (mtu > DEFAULT_BUFSIZE)
-- 
2.7.4

^ permalink raw reply related

* RE: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
From: Keller, Jacob E @ 2018-10-26 17:20 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran,
	Thomas Gleixner
In-Reply-To: <20181026171300.12720-1-mlichvar@redhat.com>

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 10:13 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Miroslav Lichvar <mlichvar@redhat.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Richard Cochran
> <richardcochran@gmail.com>; Thomas Gleixner <tglx@linutronix.de>
> Subject: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Also, the
> PHC may be adjusted to run up to 6% faster than real time and the system
> clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
> timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.
> 

Acked-by: Jacob Keller <jacob.e.keller@intel.com>

> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c
> b/drivers/net/ethernet/intel/igb/igb_ptp.c
> index 9f4d700e09df..2b95dc9c7a6a 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ptp.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
> @@ -51,9 +51,17 @@
>   *
>   * The 40 bit 82580 SYSTIM overflows every
>   *   2^40 * 10^-9 /  60  = 18.3 minutes.
> + *
> + * SYSTIM is converted to real time using a timecounter. As
> + * timecounter_cyc2time() allows old timestamps, the timecounter needs
> + * to be updated at least once per half of the SYSTIM interval.
> + * Scheduling of delayed work is not very accurate, and also the NIC
> + * clock can be adjusted to run up to 6% faster and the system clock
> + * up to 10% slower, so we aim for 6 minutes to be sure the actual
> + * interval in the NIC time is shorter than 9.16 minutes.
>   */
> 
> -#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 9)
> +#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 6)
>  #define IGB_PTP_TX_TIMEOUT		(HZ * 15)
>  #define INCPERIOD_82576			BIT(E1000_TIMINCA_16NS_SHIFT)
>  #define INCVALUE_82576_MASK
> 	GENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
> --
> 2.17.2

^ permalink raw reply

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: David Miller @ 2018-10-26 17:19 UTC (permalink / raw)
  To: arnd; +Cc: sunil.kovvuri, netdev, linux-soc, sgoutham
In-Reply-To: <CAK8P3a3qL15dmeRYX7W2EC5bGO1qFCXEDRj+xTcGt8Fakwc3vw@mail.gmail.com>

From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 26 Oct 2018 16:04:44 +0200

> I fear that setting a precedent of using the mbox for user-level
> configuration management would mean that we would have to
> treat each of these interfaces as an ABI, which in turn requires
> much deeper review as well as raising the fundamental question
> on how this should be done across drivers. The mailbox interface
> seem inherently nonportable to other hardware here, which is
> a significant downside.

+1

^ permalink raw reply

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: Andrew Lunn @ 2018-10-26 17:16 UTC (permalink / raw)
  To: Sunil Kovvuri
  Cc: Arnd Bergmann, David S. Miller, Linux Netdev List, linux-soc,
	Sunil Goutham
In-Reply-To: <CA+sq2CcuUYNoFHaAXGYMSTUqZy9LQT+Xk+g+zVCRTSDUcw27sQ@mail.gmail.com>

> No there is no need for any userspace tools.

Cool.

The problem with the Freescale architecture is that you seem to need
to provision the hardware. Tell it how many instances of various
things to create, and how to logically connect them together. They
have a user space tool to do this, in an opaque way.

But it sounds like you don't need anything like this.

    Andrew

^ permalink raw reply

* [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
From: Miroslav Lichvar @ 2018-10-26 17:13 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, Miroslav Lichvar, Jacob Keller, Richard Cochran,
	Thomas Gleixner

The timecounter needs to be updated at least once per ~550 seconds in
order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
timestamp.

Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
scheduling of delayed work seems to be less accurate and a requested
delay of 540 seconds may actually be longer than 550 seconds. Also, the
PHC may be adjusted to run up to 6% faster than real time and the system
clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
timecounter is updated in time.

This fixes an issue with HW timestamps on 82580/I350/I354 being off by
~1100 seconds for few seconds every ~9 minutes.

Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
---
 drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 9f4d700e09df..2b95dc9c7a6a 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -51,9 +51,17 @@
  *
  * The 40 bit 82580 SYSTIM overflows every
  *   2^40 * 10^-9 /  60  = 18.3 minutes.
+ *
+ * SYSTIM is converted to real time using a timecounter. As
+ * timecounter_cyc2time() allows old timestamps, the timecounter needs
+ * to be updated at least once per half of the SYSTIM interval.
+ * Scheduling of delayed work is not very accurate, and also the NIC
+ * clock can be adjusted to run up to 6% faster and the system clock
+ * up to 10% slower, so we aim for 6 minutes to be sure the actual
+ * interval in the NIC time is shorter than 9.16 minutes.
  */
 
-#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 9)
+#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 6)
 #define IGB_PTP_TX_TIMEOUT		(HZ * 15)
 #define INCPERIOD_82576			BIT(E1000_TIMINCA_16NS_SHIFT)
 #define INCVALUE_82576_MASK		GENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
-- 
2.17.2

^ permalink raw reply related

* Re: [PATCH net] net/neigh: fix NULL deref in pneigh_dump_table()
From: David Ahern @ 2018-10-26 17:07 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20181026163327.219797-1-edumazet@google.com>

On 10/26/18 10:33 AM, Eric Dumazet wrote:
> pneigh can have NULL device pointer, so we need to make

hmmm.. missed that.

> neigh_master_filtered() and neigh_ifindex_filtered() more robust.
> 
...

> Fixes: 6f52f80e8530 ("net/neigh: Extend dump filter to proxy neighbor dumps")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsahern@gmail.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
>  net/core/neighbour.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Thanks for the fix

Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
From: Keller, Jacob E @ 2018-10-26 16:54 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-5-mlichvar@redhat.com>



> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 ++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> index b3e0d8bb5cbd..d31e8d3effc7 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> @@ -466,6 +466,60 @@ static int ixgbe_ptp_gettime(struct ptp_clock_info *ptp,
> struct timespec64 *ts)
>  	return 0;
>  }
> 
> +/**
> + * ixgbe_ptp_gettimex
> + * @ptp: the ptp clock structure
> + * @sts: structure to hold the system time before reading the PHC,
> + * the PHC timestamp, and system time after reading the PHC
> + *
> + * read the timecounter and return the correct value on ns,
> + * after converting it into a struct timespec.
> + */
> +static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp,
> +			      struct ptp_system_timestamp *sts)
> +{
> +	struct ixgbe_adapter *adapter =
> +		container_of(ptp, struct ixgbe_adapter, ptp_caps);
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	unsigned long flags;
> +	struct timespec64 ts;
> +	u64 ns, stamp;
> +
> +	spin_lock_irqsave(&adapter->tmreg_lock, flags);
> +
> +	switch (adapter->hw.mac.type) {
> +	case ixgbe_mac_X550:
> +	case ixgbe_mac_X550EM_x:
> +	case ixgbe_mac_x550em_a:
> +		/* Upper 32 bits represent billions of cycles, lower 32 bits
> +		 * represent cycles. However, we use timespec64_to_ns for the
> +		 * correct math even though the units haven't been corrected
> +		 * yet.
> +		 */
> +		ptp_read_system_prets(sts);
> +		IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
> +		ptp_read_system_postts(sts);
> +		ts.tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> +		ts.tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
> +		stamp = timespec64_to_ns(&ts);
> +		break;
> +	default:
> +		ptp_read_system_prets(sts);
> +		stamp = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> +		ptp_read_system_postts(sts);
> +		stamp |= (u64)IXGBE_READ_REG(hw, IXGBE_SYSTIMH) << 32;
> +		break;
> +	}
> +
> +	ns = timecounter_cyc2time(&adapter->hw_tc, stamp);
> +
> +	spin_unlock_irqrestore(&adapter->tmreg_lock, flags);
> +
> +	sts->phc_ts = ns_to_timespec64(ns);
> +
> +	return 0;
> +}
> +


What about replacing gettime64 with:

static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts)
{
    struct ptp_system_timestamp sts
    
    ixgbe_ptp_gettimex(ptp, &tst);
    *ts = sts.phc_ts
}

Actually, could that even just be provided by the PTP core if gettime64 isn't implemented? This way new drivers only have to implement the new interface, and userspace will just get the old behavior if they use the old call?

Thanks,
Jake

>  /**
>   * ixgbe_ptp_settime
>   * @ptp: the ptp clock structure
> @@ -1217,6 +1271,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		adapter->ptp_setup_sdp = ixgbe_ptp_setup_sdp_x540;
> @@ -1234,6 +1289,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		break;
> @@ -1250,6 +1306,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_X550;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		adapter->ptp_setup_sdp = NULL;
> --
> 2.17.2

^ permalink raw reply

* RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
From: Keller, Jacob E @ 2018-10-26 16:51 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 

I read the whole series, and it looks good to me.

Acked-by: Jacob Keller <jacob.e.keller@intel.com>

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 
> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

Hmm.. Yea, I don't really have better names either.

> Miroslav Lichvar (4):
>   ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
>   e1000e: add support for extended PHC gettime
>   igb: add support for extended PHC gettime
>   ixgbe: add support for extended PHC gettime
> 
>  drivers/net/ethernet/intel/e1000e/e1000.h    |  3 ++
>  drivers/net/ethernet/intel/e1000e/netdev.c   | 48 +++++++++++++----
>  drivers/net/ethernet/intel/e1000e/ptp.c      | 21 ++++++++
>  drivers/net/ethernet/intel/igb/igb_ptp.c     | 43 +++++++++++++++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 ++++++++++++++++++++
>  drivers/ptp/ptp_chardev.c                    | 39 ++++++++++++++
>  include/linux/ptp_clock_kernel.h             | 26 +++++++++
>  include/uapi/linux/ptp_clock.h               | 12 +++++
>  8 files changed, 239 insertions(+), 10 deletions(-)
> 
> --
> 2.17.2

^ permalink raw reply

* RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
From: Keller, Jacob E @ 2018-10-26 16:40 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

Hi Miroslav,

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 
> This series adds support for a more accurate synchronization between a
> PTP hardware clock and the system clock.
> 
> The first patch adds an extended version of the PTP_SYS_OFFSET ioctl,
> which returns three timestamps for each measurement. The idea is to
> shorten the interval between the system timestamps to contain just the
> reading of the lowest register of the PHC in order to reduce the error
> in the measured offset and give a better bound on the maximum error.
> 
> The other patches add support for the new ioctl to the e1000e, igb,
> and ixgbe driver. Tests with few different NICs in different machines
> (and PCIe slots) show that:
> - with an I219 (e1000e) the measured delay improved from 2500 to 1300 ns
>   and the error in the measured offset, when compared to cross
>   timestamping, was reduced by a factor of 5
> - with an I210 (igb) the delay improved from 5100 to 1700 ns
> - with an I350 (igb) the delay improved from 2300 to 750 ns
> - with an X550 (ixgbe) the delay improved from 1950 to 650 ns
> 

That is some very significant improvements! Excellent find.

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 

Yea...

> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

I don't have any good suggestions yet. I'll reply after reviewing if I think of any.

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: Sunil Kovvuri @ 2018-10-26 16:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Linux Netdev List, linux-soc, Sunil Goutham
In-Reply-To: <CA+sq2CcuUYNoFHaAXGYMSTUqZy9LQT+Xk+g+zVCRTSDUcw27sQ@mail.gmail.com>

On Fri, Oct 26, 2018 at 9:56 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
>
> On Fri, Oct 26, 2018 at 7:34 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On 10/26/18, Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > > On Fri, Oct 26, 2018 at 6:24 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >>
> > >> I see this has been applied, but I'd still like to understand better how
> > >> the
> > >> configuration interface is expected to work once the driver is complete.
> > >>
> > >> In particular, so far the interfaces all assume that configuration is
> > >> done through the mailbox between PCI devices, which could be done
> > >> from a virtual machine kernel with access to PCI, or through the use
> > >> of VFIO from a user application.
> > >>
> > >> Is that the only method of configuring it that you support, or will there
> > >> also be a devlink based interface or something like that to configure
> > >> the aspects of a virtual device that should not be accessible to the
> > >> VF itself?
> > >>
> > >
> > >
> > > As of now it's only mbox based configuration that is supported.
> >
> > Ok, thanks for the clarification.
> >
> > Does this mean that you intend to have user space tools that use
> > the mbox based interface on VFIO devices to perform configuration
> > for virtual network devices, or just that the configuration interface
> > is something that needs to be designed later?
> >
>
> No there is no need for any userspace tools.
> It's the virtual network device's driver which will send commands
> like resource allocation, configuration, stats retrieval to this
> AF device via mbox interface.
>
> eg: A user using ethtool changes RSS settings for the network device,
> network device's driver receives the data, prepares a mailbox command
> sends it to this driver for configuring the same in HW.
>
> Thanks,
> Sunil.

To be more clear there is no mbox 'interface' as such.
Here PCI devices shares a memory region, one device prepares a command
in this shared memory and writes into a doorbell kind of register which triggers
an IRQ to other device. Which then takes the command processes it.

Thanks,
Sunil.

^ permalink raw reply

* [PATCH net] net/neigh: fix NULL deref in pneigh_dump_table()
From: Eric Dumazet @ 2018-10-26 16:33 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern

pneigh can have NULL device pointer, so we need to make
neigh_master_filtered() and neigh_ifindex_filtered() more robust.

syzbot report :

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 15867 Comm: syz-executor2 Not tainted 4.19.0+ #276
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:179 [inline]
RIP: 0010:list_empty include/linux/list.h:203 [inline]
RIP: 0010:netdev_master_upper_dev_get+0xa1/0x250 net/core/dev.c:6467
RSP: 0018:ffff8801bfaf7220 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffc90005e92000
RDX: 0000000000000016 RSI: ffffffff860b44d9 RDI: 0000000000000005
RBP: ffff8801bfaf72b0 R08: ffff8801c4c84080 R09: fffffbfff139a580
R10: fffffbfff139a580 R11: ffffffff89cd2c07 R12: 1ffff10037f5ee45
R13: 0000000000000000 R14: ffff8801bfaf7288 R15: 00000000000000b0
FS:  00007f65cc68d700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b33a21000 CR3: 00000001c6116000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 neigh_master_filtered net/core/neighbour.c:2367 [inline]
 pneigh_dump_table net/core/neighbour.c:2456 [inline]
 neigh_dump_info+0x7a9/0x1ce0 net/core/neighbour.c:2577
 netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
 __netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
 netlink_dump_start include/linux/netlink.h:216 [inline]
 rtnetlink_rcv_msg+0x809/0xc20 net/core/rtnetlink.c:4898
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4953
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x5a5/0x760 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:631
 sock_write_iter+0x35e/0x5c0 net/socket.c:900
 call_write_iter include/linux/fs.h:1808 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569

Fixes: 6f52f80e8530 ("net/neigh: Extend dump filter to proxy neighbor dumps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/core/neighbour.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ee605d9d8bd4ea0826d0c03aa189ccef4c5cdc61..41954e42a2ded2ac1770c4b107505a5eee57a2be 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2364,7 +2364,7 @@ static bool neigh_master_filtered(struct net_device *dev, int master_idx)
 	if (!master_idx)
 		return false;
 
-	master = netdev_master_upper_dev_get(dev);
+	master = dev ? netdev_master_upper_dev_get(dev) : NULL;
 	if (!master || master->ifindex != master_idx)
 		return true;
 
@@ -2373,7 +2373,7 @@ static bool neigh_master_filtered(struct net_device *dev, int master_idx)
 
 static bool neigh_ifindex_filtered(struct net_device *dev, int filter_idx)
 {
-	if (filter_idx && dev->ifindex != filter_idx)
+	if (filter_idx && (!dev || dev->ifindex != filter_idx))
 		return true;
 
 	return false;
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related

* [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
From: Miroslav Lichvar @ 2018-10-26 16:27 UTC (permalink / raw)
  To: netdev; +Cc: intel-wired-lan, Richard Cochran, Jacob Keller, Miroslav Lichvar
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 ++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
index b3e0d8bb5cbd..d31e8d3effc7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
@@ -466,6 +466,60 @@ static int ixgbe_ptp_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
 	return 0;
 }
 
+/**
+ * ixgbe_ptp_gettimex
+ * @ptp: the ptp clock structure
+ * @sts: structure to hold the system time before reading the PHC,
+ * the PHC timestamp, and system time after reading the PHC
+ *
+ * read the timecounter and return the correct value on ns,
+ * after converting it into a struct timespec.
+ */
+static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp,
+			      struct ptp_system_timestamp *sts)
+{
+	struct ixgbe_adapter *adapter =
+		container_of(ptp, struct ixgbe_adapter, ptp_caps);
+	struct ixgbe_hw *hw = &adapter->hw;
+	unsigned long flags;
+	struct timespec64 ts;
+	u64 ns, stamp;
+
+	spin_lock_irqsave(&adapter->tmreg_lock, flags);
+
+	switch (adapter->hw.mac.type) {
+	case ixgbe_mac_X550:
+	case ixgbe_mac_X550EM_x:
+	case ixgbe_mac_x550em_a:
+		/* Upper 32 bits represent billions of cycles, lower 32 bits
+		 * represent cycles. However, we use timespec64_to_ns for the
+		 * correct math even though the units haven't been corrected
+		 * yet.
+		 */
+		ptp_read_system_prets(sts);
+		IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
+		ptp_read_system_postts(sts);
+		ts.tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
+		ts.tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
+		stamp = timespec64_to_ns(&ts);
+		break;
+	default:
+		ptp_read_system_prets(sts);
+		stamp = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
+		ptp_read_system_postts(sts);
+		stamp |= (u64)IXGBE_READ_REG(hw, IXGBE_SYSTIMH) << 32;
+		break;
+	}
+
+	ns = timecounter_cyc2time(&adapter->hw_tc, stamp);
+
+	spin_unlock_irqrestore(&adapter->tmreg_lock, flags);
+
+	sts->phc_ts = ns_to_timespec64(ns);
+
+	return 0;
+}
+
 /**
  * ixgbe_ptp_settime
  * @ptp: the ptp clock structure
@@ -1217,6 +1271,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter *adapter)
 		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
 		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
 		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
+		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
 		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
 		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
 		adapter->ptp_setup_sdp = ixgbe_ptp_setup_sdp_x540;
@@ -1234,6 +1289,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter *adapter)
 		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
 		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
 		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
+		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
 		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
 		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
 		break;
@@ -1250,6 +1306,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter *adapter)
 		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_X550;
 		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
 		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
+		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
 		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
 		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
 		adapter->ptp_setup_sdp = NULL;
-- 
2.17.2

^ permalink raw reply related

* [RFC PATCH 3/4] igb: add support for extended PHC gettime
From: Miroslav Lichvar @ 2018-10-26 16:27 UTC (permalink / raw)
  To: netdev; +Cc: intel-wired-lan, Richard Cochran, Jacob Keller, Miroslav Lichvar
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
---
 drivers/net/ethernet/intel/igb/igb_ptp.c | 43 ++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 29ced6b74d36..6294d18b5a60 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -310,6 +310,46 @@ static int igb_ptp_gettime_i210(struct ptp_clock_info *ptp,
 	return 0;
 }
 
+static int igb_ptp_gettimex(struct ptp_clock_info *ptp,
+			    struct ptp_system_timestamp *sts)
+{
+	struct igb_adapter *igb = container_of(ptp, struct igb_adapter,
+					       ptp_caps);
+	struct e1000_hw *hw = &igb->hw;
+	unsigned long flags;
+	u32 lo, hi;
+	u64 ns;
+
+	spin_lock_irqsave(&igb->tmreg_lock, flags);
+
+	/* 82576 doesn't have SYSTIMR */
+	if (igb->hw.mac.type == e1000_82576) {
+		ptp_read_system_prets(sts);
+		lo = rd32(E1000_SYSTIML);
+		ptp_read_system_postts(sts);
+		hi = rd32(E1000_SYSTIMH);
+	} else {
+		ptp_read_system_prets(sts);
+		rd32(E1000_SYSTIMR);
+		ptp_read_system_postts(sts);
+		lo = rd32(E1000_SYSTIML);
+		hi = rd32(E1000_SYSTIMH);
+	}
+
+	/* SYSTIM on I210/I211 counts time in seconds and nanoseconds */
+	if (igb->hw.mac.type == e1000_i210 || igb->hw.mac.type == e1000_i211) {
+		sts->phc_ts.tv_sec = hi;
+		sts->phc_ts.tv_nsec = lo;
+	} else {
+		ns = timecounter_cyc2time(&igb->tc, ((u64)hi << 32) | lo);
+		sts->phc_ts = ns_to_timespec64(ns);
+	}
+
+	spin_unlock_irqrestore(&igb->tmreg_lock, flags);
+
+	return 0;
+}
+
 static int igb_ptp_settime_82576(struct ptp_clock_info *ptp,
 				 const struct timespec64 *ts)
 {
@@ -1125,6 +1165,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
 		adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82576;
 		adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576;
 		adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576;
+		adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;
 		adapter->ptp_caps.settime64 = igb_ptp_settime_82576;
 		adapter->ptp_caps.enable = igb_ptp_feature_enable;
 		adapter->cc.read = igb_ptp_read_82576;
@@ -1144,6 +1185,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
 		adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
 		adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576;
 		adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576;
+		adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;
 		adapter->ptp_caps.settime64 = igb_ptp_settime_82576;
 		adapter->ptp_caps.enable = igb_ptp_feature_enable;
 		adapter->cc.read = igb_ptp_read_82580;
@@ -1172,6 +1214,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
 		adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
 		adapter->ptp_caps.adjtime = igb_ptp_adjtime_i210;
 		adapter->ptp_caps.gettime64 = igb_ptp_gettime_i210;
+		adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;
 		adapter->ptp_caps.settime64 = igb_ptp_settime_i210;
 		adapter->ptp_caps.enable = igb_ptp_feature_enable_i210;
 		adapter->ptp_caps.verify = igb_ptp_verify_pin;
-- 
2.17.2

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox