Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next 3/3] net/sched: Change act_api and act_xxx modules to use IDR
From: Jamal Hadi Salim @ 2017-08-28 21:56 UTC (permalink / raw)
  To: Chris Mi, netdev; +Cc: xiyou.wangcong, jiri, davem, mawilcox
In-Reply-To: <1503902477-39829-4-git-send-email-chrism@mellanox.com>

On 17-08-28 02:41 AM, Chris Mi wrote:
> Typically, each TC filter has its own action. All the actions of the
> same type are saved in its hash table. But the hash buckets are too
> small that it degrades to a list. And the performance is greatly
> affected. For example, it takes about 0m11.914s to insert 64K rules.
> If we convert the hash table to IDR, it only takes about 0m1.500s.
> The improvement is huge.
> 
> But please note that the test result is based on previous patch that
> cls_flower uses IDR.
> 
> Signed-off-by: Chris Mi <chrism@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: [patch net-next 2/3] net/sched: Change cls_flower to use IDR
From: Jamal Hadi Salim @ 2017-08-28 21:55 UTC (permalink / raw)
  To: Chris Mi, netdev; +Cc: xiyou.wangcong, jiri, davem, mawilcox
In-Reply-To: <1503902477-39829-3-git-send-email-chrism@mellanox.com>

On 17-08-28 02:41 AM, Chris Mi wrote:
> Currently, all filters with the same priority are linked in a doubly
> linked list. Every filter should have a unique handle. To make the
> handle unique, we need to iterate the list every time to see if the
> handle exists or not when inserting a new filter. It is time-consuming.
> For example, it takes about 5m3.169s to insert 64K rules.
> 
> This patch changes cls_flower to use IDR. With this patch, it
> takes about 0m1.127s to insert 64K rules. The improvement is huge.
> 
> But please note that in this testing, all filters share the same action.
> If every filter has a unique action, that is another bottleneck.
> Follow-up patch in this patchset addresses that.
> 
> Signed-off-by: Chris Mi <chrism@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

As Cong asked last time - any plans to add to other classifiers?

cheers,
jamal

^ permalink raw reply

* [PATCH v2 23/30] net: Restrict unwhitelisted proto caches to size 0
From: Kees Cook @ 2017-08-28 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David S. Miller, Eric Dumazet, Paolo Abeni,
	David Howells, netdev, linux-mm, kernel-hardening, David Windsor
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

Now that protocols have been annotated (the copy of icsk_ca_ops->name
is of an ops field from outside the slab cache):

$ git grep 'copy_.*_user.*sk.*->'
caif/caif_socket.c: copy_from_user(&cf_sk->conn_req.param.data, ov, ol)) {
ipv4/raw.c:   if (copy_from_user(&raw_sk(sk)->filter, optval, optlen))
ipv4/raw.c:       copy_to_user(optval, &raw_sk(sk)->filter, len))
ipv4/tcp.c:       if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
ipv4/tcp.c:       if (copy_to_user(optval, icsk->icsk_ulp_ops->name, len))
ipv6/raw.c:       if (copy_from_user(&raw6_sk(sk)->filter, optval, optlen))
ipv6/raw.c:           if (copy_to_user(optval, &raw6_sk(sk)->filter, len))
sctp/socket.c: if (copy_from_user(&sctp_sk(sk)->subscribe, optval, optlen))
sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->subscribe, len))
sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->initmsg, len))

we can switch the default proto usercopy region to size 0. Any protocols
needing to add whitelisted regions must annotate the fields with the
useroffset and usersize fields of struct proto.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 02dab98ca3e3..c7d0afa1d0b1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3112,9 +3112,7 @@ int proto_register(struct proto *prot, int alloc_slab)
 		prot->slab = kmem_cache_create_usercopy(prot->name,
 					prot->obj_size, 0,
 					SLAB_HWCACHE_ALIGN | prot->slab_flags,
-					prot->usersize ? prot->useroffset : 0,
-					prot->usersize ? prot->usersize
-						       : prot->obj_size,
+					prot->useroffset, prot->usersize,
 					NULL);
 
 		if (prot->slab == NULL) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v2 22/30] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
From: Kees Cook @ 2017-08-28 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Vlad Yasevich, Neil Horman,
	David S. Miller, linux-sctp, netdev, linux-mm, kernel-hardening
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The autoclose field can be copied with put_user(), so there is no need to
use copy_to_user(). In both cases, hardened usercopy is being bypassed
since the size is constant, and not open to runtime manipulation.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log]
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/sctp/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c8784cb216e4..a29e41e19d64 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4882,7 +4882,7 @@ static int sctp_getsockopt_autoclose(struct sock *sk, int len, char __user *optv
 	len = sizeof(int);
 	if (put_user(len, optlen))
 		return -EFAULT;
-	if (copy_to_user(optval, &sctp_sk(sk)->autoclose, sizeof(int)))
+	if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval))
 		return -EFAULT;
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 21/30] sctp: Define usercopy region in SCTP proto slab cache
From: Kees Cook @ 2017-08-28 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Vlad Yasevich, Neil Horman,
	David S. Miller, linux-sctp, netdev, linux-mm, kernel-hardening
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The SCTP socket event notification subscription information need to be
copied to/from userspace. In support of usercopy hardening, this patch
defines a region in the struct proto slab cache in which userspace copy
operations are allowed. Additionally moves the usercopy fields to be
adjacent for the region to cover both.

example usage trace:

    net/sctp/socket.c:
        sctp_getsockopt_events(...):
            ...
            copy_to_user(..., &sctp_sk(sk)->subscribe, len)

        sctp_setsockopt_events(...):
            ...
            copy_from_user(&sctp_sk(sk)->subscribe, ..., optlen)

        sctp_getsockopt_initmsg(...):
            ...
            copy_to_user(..., &sctp_sk(sk)->initmsg, len)

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: split from network patch, move struct member adjacent, provide usage]
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/net/sctp/structs.h | 9 +++++++--
 net/sctp/socket.c          | 4 ++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 5ab29af8ca8a..f1d7810e200e 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -202,12 +202,17 @@ struct sctp_sock {
 	/* Flags controlling Heartbeat, SACK delay, and Path MTU Discovery. */
 	__u32 param_flags;
 
-	struct sctp_initmsg initmsg;
 	struct sctp_rtoinfo rtoinfo;
 	struct sctp_paddrparams paddrparam;
-	struct sctp_event_subscribe subscribe;
 	struct sctp_assocparams assocparams;
 
+	/*
+	 * These two structures must be grouped together for the usercopy
+	 * whitelist region.
+	 */
+	struct sctp_event_subscribe subscribe;
+	struct sctp_initmsg initmsg;
+
 	int user_frag;
 
 	__u32 autoclose;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 1db478e34520..c8784cb216e4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -8235,6 +8235,10 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
+	.useroffset  =  offsetof(struct sctp_sock, subscribe),
+	.usersize    =  offsetof(struct sctp_sock, initmsg) -
+				offsetof(struct sctp_sock, subscribe) +
+				sizeof_field(struct sctp_sock, initmsg),
 	.sysctl_mem  =  sysctl_sctp_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 20/30] caif: Define usercopy region in caif proto slab cache
From: Kees Cook @ 2017-08-28 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Dmitry Tarnyagin, David S. Miller,
	netdev, linux-mm, kernel-hardening
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The CAIF channel connection request parameters need to be copied to/from
userspace. In support of usercopy hardening, this patch defines a region
in the struct proto slab cache in which userspace copy operations are
allowed.

example usage trace:

    net/caif/caif_socket.c:
        setsockopt(...):
            ...
            copy_from_user(&cf_sk->conn_req.param.data, ..., ol)

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: split from network patch, provide usage trace]
Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/caif/caif_socket.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 632d5a416d97..c76d513b9a7a 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -1032,6 +1032,8 @@ static int caif_create(struct net *net, struct socket *sock, int protocol,
 	static struct proto prot = {.name = "PF_CAIF",
 		.owner = THIS_MODULE,
 		.obj_size = sizeof(struct caifsock),
+		.useroffset = offsetof(struct caifsock, conn_req.param),
+		.usersize = sizeof_field(struct caifsock, conn_req.param)
 	};

 	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 19/30] ip: Define usercopy region in IP proto slab cache
From: Kees Cook @ 2017-08-28 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, netdev, linux-mm, kernel-hardening
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The ICMP filters for IPv4 and IPv6 raw sockets need to be copied to/from
userspace. In support of usercopy hardening, this patch defines a region
in the struct proto slab cache in which userspace copy operations are
allowed.

example usage trace:

    net/ipv4/raw.c:
        raw_seticmpfilter(...):
            ...
            copy_from_user(&raw_sk(sk)->filter, ..., optlen)

        raw_geticmpfilter(...):
            ...
            copy_to_user(..., &raw_sk(sk)->filter, len)

    net/ipv6/raw.c:
        rawv6_seticmpfilter(...):
            ...
            copy_from_user(&raw6_sk(sk)->filter, ..., optlen)

        rawv6_geticmpfilter(...):
            ...
            copy_to_user(..., &raw6_sk(sk)->filter, len)

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: split from network patch, provide usage trace]
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/ipv4/raw.c | 2 ++
 net/ipv6/raw.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index b0bb5d0a30bd..6c7f8d2eb3af 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -964,6 +964,8 @@ struct proto raw_prot = {
 	.hash		   = raw_hash_sk,
 	.unhash		   = raw_unhash_sk,
 	.obj_size	   = sizeof(struct raw_sock),
+	.useroffset	   = offsetof(struct raw_sock, filter),
+	.usersize	   = sizeof_field(struct raw_sock, filter),
 	.h.raw_hash	   = &raw_v4_hashinfo,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_raw_setsockopt,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 60be012fe708..27dd9a5f71c6 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1265,6 +1265,8 @@ struct proto rawv6_prot = {
 	.hash		   = raw_hash_sk,
 	.unhash		   = raw_unhash_sk,
 	.obj_size	   = sizeof(struct raw6_sock),
+	.useroffset	   = offsetof(struct raw6_sock, filter),
+	.usersize	   = sizeof_field(struct raw6_sock, filter),
 	.h.raw_hash	   = &raw_v6_hashinfo,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_rawv6_setsockopt,
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v2 18/30] net: Define usercopy region in struct proto slab cache
From: Kees Cook @ 2017-08-28 21:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, David S. Miller, Eric Dumazet,
	Paolo Abeni, David Howells, netdev, linux-mm, kernel-hardening
In-Reply-To: <1503956111-36652-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

In support of usercopy hardening, this patch defines a region in the
struct proto slab cache in which userspace copy operations are allowed.
Some protocols need to copy objects to/from userspace, and they can
declare the region via their proto structure with the new usersize and
useroffset fields. Initially, if no region is specified (usersize ==
0), the entire field is marked as whitelisted. This allows protocols
to be whitelisted in subsequent patches. Once all protocols have been
annotated, the full-whitelist default can be removed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split off per-proto patches]
[kees: add logic for by-default full-whitelist]
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/net/sock.h | 2 ++
 net/core/sock.c    | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7c0632c7e870..170d5b2dbcb6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1106,6 +1106,8 @@ struct proto {
 	struct kmem_cache	*slab;
 	unsigned int		obj_size;
 	int			slab_flags;
+	size_t			useroffset;	/* Usercopy region offset */
+	size_t			usersize;	/* Usercopy region size */

 	struct percpu_counter	*orphan_count;

diff --git a/net/core/sock.c b/net/core/sock.c
index ac2a404c73eb..02dab98ca3e3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3109,8 +3109,12 @@ static int req_prot_init(const struct proto *prot)
 int proto_register(struct proto *prot, int alloc_slab)
 {
 	if (alloc_slab) {
-		prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
+		prot->slab = kmem_cache_create_usercopy(prot->name,
+					prot->obj_size, 0,
 					SLAB_HWCACHE_ALIGN | prot->slab_flags,
+					prot->usersize ? prot->useroffset : 0,
+					prot->usersize ? prot->usersize
+						       : prot->obj_size,
 					NULL);

 		if (prot->slab == NULL) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: Get ARP/ND tables from kernel
From: Bassam Alsanie @ 2017-08-28 21:34 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: netdev@vger.kernel.org
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F07796D3C@ORSMSX103.amr.corp.intel.com>

PJ,
Thank you. strace is really gonna help :)
It seem using Netlink (NETLINK_ROUTE) interface is the way to get the
arp/tables form kernel programmatically.

Thank you,
Bassam

On Sun, Aug 27, 2017 at 7:53 PM, Waskiewicz Jr, Peter
<peter.waskiewicz.jr@intel.com> wrote:
> On 8/27/17 9:25 PM, Bassam Alsanie wrote:
>> Hello everyone,
>> I looking into a good way (stable and compatible with large number of
>> distros) to get the arp/nd cache from kernel to user space, for both
>> IP4 and IP6.
>>
>> It seem IOCTL (SIOCGARP) can't do that, you can only get MAC address
>> from provided IP address. But IOCTL can't give the the full arp/nd
>> table.
>> The other option is the Netlink interface. I tried it and I got the
>> ARP/ND table :).
>> The third option is using /proc/net/arp, which only restricted to IP4.
>>
>> There is command line utilities that I excluding in my case.
>>
>> Is there another way to do it? what is the best way in my case?
>>
>> Thank you all.
>
> # strace arp -an
> [...]
> open("/proc/net/arp", O_RDONLY)         = 4
> fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> read(4, "IP address       HW type     Fla"..., 1024) = 310
> [...]
>
> # strace ip -6 neighbor show
> [...]
> socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3
> setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
> setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
> bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
> getsockname(3, {sa_family=AF_NETLINK, nl_pid=30292, nl_groups=00000000},
> [12]) = 0
> sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP,
> seq=1503888680, pid=0},
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\35\0\1\0\0\0"}, 40, 0, NULL, 0) = 40
> recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0,
> nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=1268,
> type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680, pid=30292},
> "\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0\7\0\3\0lo\0\0\10\0\r\0\350\3\0\0"...},
> {{len=1280, type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680,
> pid=30292},
> "\0\0\1\0\2\0\0\0C\20\1\0\0\0\0\0\t\0\3\0eno1\0\0\0\0\10\0\r\0"...}],
> iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2548
> [...]
>
> Seems like it's pretty obvious if you don't want to use the existing
> tools, just look at how the existing tools get this data.  IPv4 uses
> /proc/net/arp, IPv6 uses netlink.
>
> Cheers,
> -PJ
>

^ permalink raw reply

* Re: [PATCH net-next] Revert "ipv4: make net_protocol const"
From: David Miller @ 2017-08-28 21:30 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, bhumirks
In-Reply-To: <80c558d4-dea8-3a60-3786-c70e55946ff2@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 28 Aug 2017 15:03:34 -0600

> On 8/28/17 3:01 PM, David Miller wrote:
>> From: David Ahern <dsahern@gmail.com>
>> Date: Mon, 28 Aug 2017 13:23:09 -0700
>> 
>>> This reverts commit aa8db499ea67cff1f5f049033810ffede2fe5ae4.
>> ...
>>> I think this is the second time such a patch has been reverted.
>> 
>> Then please add a comment, it will help prevent this from happening
>> again.
>> 
> 
> Was going to do that in both places after the revert. You want it as
> part of this one?

You're right a clean revert first is better.

I'll apply this now, thanks David.

^ permalink raw reply

* Re: [PATCH] ipv6: sr: fix get_srh() to comply with IPv6 standard "RFC 8200"
From: Ahmed Abdelsalam @ 2017-08-28 21:27 UTC (permalink / raw)
  To: David Lebrun; +Cc: davem, yoshfuji, netdev
In-Reply-To: <b84d5301-ba0a-f23a-4f8a-697e3cad1fae@uclouvain.be>

On Mon, 28 Aug 2017 19:48:15 +0100
David Lebrun <david.lebrun@uclouvain.be> wrote:

> On 08/28/2017 07:20 PM, Ahmed Abdelsalam wrote:
> > This patch fixes the get_srh(), so it gets the segment routing header
> > regardless of its position in the chain of the extension headers in IPv6
> > packet, and makes sure that the IPv6 routing extension header is of
> > Type 4.
> 
> Ahmed,
> 
> You need to initialize srhoff to 0, otherwise ipv6_find_hdr() will crash
> the kernel by dereferencing an uninitialized pointer.
> 
> Please test your patches before submitting them.
> 
> Furthermore, your pskb_may_pull() check should happen right after the
> call to ipv6_find_hdr, with srhoff + sizeof(*srh) as argument. Once you
> have checked the SRH type, you can then do another pskb_may_pull with
> srhoff + len.
> 
> David
> 

Thanks David, 
 
I will address the comments and re-submit the patch after testing.

-- 
Ahmed 

^ permalink raw reply

* Re: [PATCH 2/6] nbd: make device_attribute const
From: Jens Axboe @ 2017-08-28 21:21 UTC (permalink / raw)
  To: Bhumika Goyal, julia.lawall, rjw, lenb, jbacik, jikos,
	benjamin.tissoires, manish.chopra, rahul.verma,
	Dept-GELinuxNICDev, harish.patil, cascardo, don, dvhart, andy,
	sre, linux-acpi, linux-kernel, linux-block, nbd-general,
	linux-input, netdev, platform-driver-x86, linux-pm
In-Reply-To: <1503315792-14837-3-git-send-email-bhumirks@gmail.com>

On 08/21/2017 05:43 AM, Bhumika Goyal wrote:
> Make this const as is is only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle

Added for 4.14, thanks.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH net-next] Revert "ipv4: make net_protocol const"
From: Stephen Hemminger @ 2017-08-28 21:18 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, bhumirks, davem
In-Reply-To: <1503951789-31836-1-git-send-email-dsahern@gmail.com>

On Mon, 28 Aug 2017 13:23:09 -0700
David Ahern <dsahern@gmail.com> wrote:

> This reverts commit aa8db499ea67cff1f5f049033810ffede2fe5ae4.
> 
> Early demux structs can not be made const. Doing so results in:
> [   84.967355] BUG: unable to handle kernel paging request at ffffffff81684b10
> [   84.969272] IP: proc_configure_early_demux+0x1e/0x3d
> [   84.970544] PGD 1a0a067
> [   84.970546] P4D 1a0a067
> [   84.971212] PUD 1a0b063
> [   84.971733] PMD 80000000016001e1
> 
> [   84.972669] Oops: 0003 [#1] SMP
> [   84.973065] Modules linked in: ip6table_filter ip6_tables veth vrf
> [   84.973833] CPU: 0 PID: 955 Comm: sysctl Not tainted 4.13.0-rc6+ #22
> [   84.974612] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   84.975855] task: ffff88003854ce00 task.stack: ffffc900005a4000
> [   84.976580] RIP: 0010:proc_configure_early_demux+0x1e/0x3d
> [   84.977253] RSP: 0018:ffffc900005a7dd0 EFLAGS: 00010246
> [   84.977891] RAX: ffffffff81684b10 RBX: 0000000000000001 RCX: 0000000000000000
> [   84.978759] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000
> [   84.979628] RBP: ffffc900005a7dd0 R08: 0000000000000000 R09: 0000000000000000
> [   84.980501] R10: 0000000000000001 R11: 0000000000000008 R12: 0000000000000001
> [   84.981373] R13: ffffffffffffffea R14: ffffffff81a9b4c0 R15: 0000000000000002
> [   84.982249] FS:  00007feb237b7700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> [   84.983231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   84.983941] CR2: ffffffff81684b10 CR3: 0000000038492000 CR4: 00000000000406f0
> [   84.984817] Call Trace:
> [   84.985133]  proc_tcp_early_demux+0x29/0x30
> 
> I think this is the second time such a patch has been reverted.
> 
> Cc: Bhumika Goyal <bhumirks@gmail.com>
> Signed-off-by: David Ahern <dsahern@gmail.com>

This would have been caught at compile time if you tried setting inet_protos to const.
Start with that if you want to give it a try.



diff --git a/include/net/protocol.h b/include/net/protocol.h
index 65ba335b0e7e..373fa92d33ff 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -93,7 +93,7 @@ struct inet_protosw {
 #define INET_PROTOSW_PERMANENT 0x02  /* Permanent protocols are unremovable. */
 #define INET_PROTOSW_ICSK      0x04  /* Is this an inet_connection_sock? */
 
-extern struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
+extern const struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
 extern const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS];
 extern const struct net_offload __rcu *inet6_offloads[MAX_INET_PROTOS];
 
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 32a691b7ce2c..4b7c0ec65251 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -28,7 +28,7 @@
 #include <linux/spinlock.h>
 #include <net/protocol.h>
 
-struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
+const struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
 const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS] __read_mostly;
 EXPORT_SYMBOL(inet_offloads);


net/ipv4/sysctl_net_ipv4.c: In function ‘proc_configure_early_demux’:
net/ipv4/sysctl_net_ipv4.c:310:9: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  ipprot = rcu_dereference(inet_protos[protocol]);

^ permalink raw reply related

* Re: Fwd: DA850-evm MAC Address is random
From: Tony Lindgren @ 2017-08-28 21:12 UTC (permalink / raw)
  To: Adam Ford; +Cc: Grygorii Strashko, linux-omap, netdev, Sekhar Nori
In-Reply-To: <CAHCN7xLHEjda9DcVw5iHA9RaJ7UnyMVxf7e4e8NTEj9Op1JaBw@mail.gmail.com>

* Adam Ford <aford173@gmail.com> [170828 13:33]:
> On Mon, Aug 28, 2017 at 1:54 PM, Grygorii Strashko
> <grygorii.strashko@ti.com> wrote:
> > Cc: Sekhar
> >
> > On 08/28/2017 10:32 AM, Adam Ford wrote:
> >>
> >> The davinvi_emac MAC address seems to attempt a call to
> >> ti_cm_get_macid in cpsw-common.c but it returns the message
> >> 'davinci_emac davinci_emac.1: incompatible machine/device type for
> >> reading mac address ' and then generates a random MAC address.
> >>
> >> The function appears to lookup varions boards using
> >> 'of_machine_is_compaible' and supports dm8148, am33xx, am3517, dm816,
> >> am4372 and dra7.  I don't see the ti,davinci-dm6467-emac which is
> >> what's shown in the da850 device tree.
> >>
> >> Is there a patch somewhere for supporting the da850-evm?
> >
> >
> > Not sure if MAC address can be read from Control module.
> > May be Sekhar can say more?
> 
> My understanding is that the MAC address is programmed by Logic PD
> into the SPI flash.  The Bootloader reads this from either SPI or its
> env variables.  Looking at the partition info listed in the
> da850-evm.dts file, it appears as if they've reserved space for it.
> Unfortunately, I don't see any code that reads it out.  I was hoping
> there might be a way to just pass cmdline parameter from the
> bootloader to the kernel to accept the MAC address.
> 
> >
> >>
> >> If not, is there a way to pass the MAC address from U-Boot to the
> >> driver so it doesn't generate a random MAC?
> >
> >
> > "local-mac-address" dt porp
> 
> The downside here, is that we'd have to have the Bootloader modify the
> device tree.

That piece of code exists somewhere in u-boot already. Note how
we are populating the mac address for USB Ethernet drivers in
u-boot and then the Ethernet driver code parses it. See commit
055d31de7158 ("ARM: omap3: beagleboard-xm: dt: Add ethernet to
the device tree") for some more information.

I think u-boot needs the ethernet alias for finding the interface.

Regards,

Tony

^ permalink raw reply

* Re: [PATCH net-next] Revert "ipv4: make net_protocol const"
From: David Ahern @ 2017-08-28 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, bhumirks
In-Reply-To: <20170828.140133.2126200702424568912.davem@davemloft.net>

On 8/28/17 3:01 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Mon, 28 Aug 2017 13:23:09 -0700
> 
>> This reverts commit aa8db499ea67cff1f5f049033810ffede2fe5ae4.
> ...
>> I think this is the second time such a patch has been reverted.
> 
> Then please add a comment, it will help prevent this from happening
> again.
> 

Was going to do that in both places after the revert. You want it as
part of this one?

^ permalink raw reply

* Re: [PATCH net-next] Revert "ipv4: make net_protocol const"
From: David Miller @ 2017-08-28 21:01 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, bhumirks
In-Reply-To: <1503951789-31836-1-git-send-email-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 28 Aug 2017 13:23:09 -0700

> This reverts commit aa8db499ea67cff1f5f049033810ffede2fe5ae4.
...
> I think this is the second time such a patch has been reverted.

Then please add a comment, it will help prevent this from happening
again.

^ permalink raw reply

* RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Keller, Jacob E @ 2017-08-28 20:56 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev@vger.kernel.org
In-Reply-To: <20170825153418.53864810@cakuba>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> Behalf Of Jakub Kicinski
> Sent: Friday, August 25, 2017 12:34 PM
> To: Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> On Fri, 25 Aug 2017 08:24:49 -0700, Jacob Keller wrote:
> > Under some circumstances, such as with many stacked devices, it is
> > possible that dev_hard_start_xmit will bundle many packets together, and
> > mark them all with xmit_more.
> 
> Excuse my ignorance but what are those stacked devices?  Could they
> perhaps be fixed somehow?  My intuition was that long xmit_more
> sequences can only happen if NIC and/or BQL are back pressuring, and
> therefore we shouldn't be seeing a long xmit_more "train" arriving at
> an empty device ring...

a veth device connecting a VM to the host, then connected to a bridge, which is connected to a vlan interface connected to a bond, which is hooked in active-backup to a physical device.

Sorry if I don't really know the correct way to refer to these, I just think of them as devices stacked on top of each other.

During root cause investigation I found that we (the i40e driver) sometimes received up to 100 or more SKBs in a row with xmit_more set. We were incorrectly also using xmit_more as a hint for not marking packets to get writebacks, which caused significant throughput issues. Additionally there was concern that that many packets in a row without a tail bump would cause latency issues, so I thought maybe it was best to simply guarantee that the stack didn't send us too many packets marked with xmit more at once.

It seems based on discussion that it should be up to the driver to determine exactly how to handle the xmit_more hint and to determine when it actually isn't helpful or not, so I do not think this patch makes sense now.

Thanks,
Jake 

^ permalink raw reply

* Re: [PATCH 1/6] ACPI: make device_attribute const
From: Rafael J. Wysocki @ 2017-08-28 20:56 UTC (permalink / raw)
  To: Bhumika Goyal
  Cc: julia.lawall, lenb, jbacik, jikos, benjamin.tissoires,
	manish.chopra, rahul.verma, Dept-GELinuxNICDev, harish.patil,
	cascardo, don, dvhart, andy, sre, linux-acpi, linux-kernel,
	linux-block, nbd-general, linux-input, netdev,
	platform-driver-x86, linux-pm
In-Reply-To: <1503315792-14837-2-git-send-email-bhumirks@gmail.com>

On Monday, August 21, 2017 1:43:07 PM CEST Bhumika Goyal wrote:
> Make these const as they are only passed as an argument to the function
> device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle
> 
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
> ---
>  drivers/acpi/battery.c | 2 +-
>  drivers/acpi/sbs.c     | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
> index 1cbb88d..13e7b56 100644
> --- a/drivers/acpi/battery.c
> +++ b/drivers/acpi/battery.c
> @@ -620,7 +620,7 @@ static ssize_t acpi_battery_alarm_store(struct device *dev,
>  	return count;
>  }
>  
> -static struct device_attribute alarm_attr = {
> +static const struct device_attribute alarm_attr = {
>  	.attr = {.name = "alarm", .mode = 0644},
>  	.show = acpi_battery_alarm_show,
>  	.store = acpi_battery_alarm_store,
> diff --git a/drivers/acpi/sbs.c b/drivers/acpi/sbs.c
> index a184637..a2428e9 100644
> --- a/drivers/acpi/sbs.c
> +++ b/drivers/acpi/sbs.c
> @@ -474,7 +474,7 @@ static ssize_t acpi_battery_alarm_store(struct device *dev,
>  	return count;
>  }
>  
> -static struct device_attribute alarm_attr = {
> +static const struct device_attribute alarm_attr = {
>  	.attr = {.name = "alarm", .mode = 0644},
>  	.show = acpi_battery_alarm_show,
>  	.store = acpi_battery_alarm_store,
> 

Applied, thanks!



^ permalink raw reply

* [PATCH net-next] ipv6: Use rt6i_idev index for echo replies to a local address
From: David Ahern @ 2017-08-28 20:53 UTC (permalink / raw)
  To: netdev, tariqt; +Cc: David Ahern

Tariq repored local pings to linklocal address is failing:
$ ifconfig ens8
ens8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 11.141.16.6  netmask 255.255.0.0  broadcast 11.141.255.255
        inet6 fe80::7efe:90ff:fecb:7502  prefixlen 64  scopeid 0x20<link>
        ether 7c:fe:90:cb:75:02  txqueuelen 1000  (Ethernet)
        RX packets 12  bytes 1164 (1.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30  bytes 2484 (2.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

$  /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8
PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes

--- fe80::7efe:90ff:fecb:7502%ens8 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2043ms

icmpv6_echo_reply needs to use the rt6i_idev dev index for local traffic
similar to how icmp6_send does. Convert the change for icmp6_send into a
helper that can be used in both places. Add the long over due
skb_rt6_info helper to convert dst on an skb to rt6_info similar to
skb_rtable for ipv4.

Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on
       device with address")
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_route.h | 10 ++++++++++
 net/ipv6/icmp.c         | 33 ++++++++++++++++++++-------------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 882bc3c7ccde..ee96f402cb75 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -164,6 +164,16 @@ void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
 void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
 
+static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb)
+{
+	const struct dst_entry *dst = skb_dst(skb);
+	const struct rt6_info *rt6 = NULL;
+
+	if (dst)
+		rt6 = container_of(dst, struct rt6_info, dst);
+
+	return rt6;
+}
 
 /*
  *	Store a destination cache entry in a socket
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index dd7608cf1d72..c25b5954cfbb 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -399,6 +399,24 @@ static struct dst_entry *icmpv6_route_lookup(struct net *net,
 	return ERR_PTR(err);
 }
 
+static int icmp6_iif(const struct sk_buff *skb)
+{
+	int iif = skb->dev->ifindex;
+
+	/* for local traffic to local address, skb dev is the loopback
+	 * device. Check if there is a dst attached to the skb and if so
+	 * get the real device index.
+	 */
+	if (unlikely(iif == LOOPBACK_IFINDEX)) {
+		const struct rt6_info *rt6 = skb_rt6_info(skb);
+
+		if (rt6)
+			iif = rt6->rt6i_idev->dev->ifindex;
+	}
+
+	return iif;
+}
+
 /*
  *	Send an ICMP message in response to a packet in error
  */
@@ -460,18 +478,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 	 */
 
 	if (__ipv6_addr_needs_scope_id(addr_type)) {
-		iif = skb->dev->ifindex;
-
-		/* for local packets, get the real device index */
-		if (iif == LOOPBACK_IFINDEX) {
-			dst = skb_dst(skb);
-			if (dst) {
-				struct rt6_info *rt;
-
-				rt = container_of(dst, struct rt6_info, dst);
-				iif = rt->rt6i_idev->dev->ifindex;
-			}
-		}
+		iif = icmp6_iif(skb);
 	} else {
 		dst = skb_dst(skb);
 		iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
@@ -694,7 +701,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	fl6.daddr = ipv6_hdr(skb)->saddr;
 	if (saddr)
 		fl6.saddr = *saddr;
-	fl6.flowi6_oif = skb->dev->ifindex;
+	fl6.flowi6_oif = icmp6_iif(skb);
 	fl6.fl6_icmp_type = ICMPV6_ECHO_REPLY;
 	fl6.flowi6_mark = mark;
 	fl6.flowi6_uid = sock_net_uid(net, NULL);
-- 
2.1.4

^ permalink raw reply related

* RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Keller, Jacob E @ 2017-08-28 20:46 UTC (permalink / raw)
  To: Alexander Duyck, Stephen Hemminger
  Cc: Waskiewicz Jr, Peter, netdev@vger.kernel.org
In-Reply-To: <CAKgT0UdC_8hTbc-p6cQPNy4=gbFnodZ3ZyKts329Yu8kiv-ZWw@mail.gmail.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> Behalf Of Alexander Duyck
> Sent: Friday, August 25, 2017 3:34 PM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Waskiewicz Jr, Peter <peter.waskiewicz.jr@intel.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>; netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> On Fri, Aug 25, 2017 at 8:58 AM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Fri, 25 Aug 2017 15:36:22 +0000
> > "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:
> >
> >> On 8/25/17 11:25 AM, Jacob Keller wrote:
> >> > Under some circumstances, such as with many stacked devices, it is
> >> > possible that dev_hard_start_xmit will bundle many packets together, and
> >> > mark them all with xmit_more.
> >> >
> >> > Most drivers respond to xmit_more by skipping tail bumps on packet
> >> > rings, or similar behavior as long as xmit_more is set. This is
> >> > a performance win since it means drivers can avoid notifying hardware of
> >> > new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> >> > bandwidth.
> >> >
> >> > This use of xmit_more comes with a trade off because bundling too many
> >> > packets can increase latency of the Tx packets. To avoid this, we should
> >> > limit the maximum number of packets with xmit_more.
> >> >
> >> > Driver authors could modify their drivers to check for some determined
> >> > limit, but this requires all drivers to be modified in order to gain
> >> > advantage.
> >> >
> >> > Instead, add a sysctl "xmit_more_max" which can be used to configure the
> >> > maximum number of xmit_more skbs to send in a sequence. This ensures
> >> > that all drivers benefit, and allows system administrators the option to
> >> > tune the value to their environment.
> >> >
> >> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> >> > ---
> >> >
> >> > Stray thoughts and further questions....
> >> >
> >> > Is this the right approach? Did I miss any other places where we should
> >> > limit? Does the limit make sense? Should it instead be a per-device
> >> > tuning nob instead of a global? Is 32 a good default?
> >>
> >> I actually like the idea of a per-device knob.  A xmit_more_max that's
> >> global in a system with 1GbE devices along with a 25/50GbE or more just
> >> doesn't make much sense to me.  Or having heterogeneous vendor devices
> >> in the same system that have different HW behaviors could mask issues
> >> with latency.
> >>
> >> This seems like another incarnation of possible buffer-bloat if the max
> >> is too high...
> >>
> >> >
> >> >   Documentation/sysctl/net.txt |  6 ++++++
> >> >   include/linux/netdevice.h    |  2 ++
> >> >   net/core/dev.c               | 10 +++++++++-
> >> >   net/core/sysctl_net_core.c   |  7 +++++++
> >> >   4 files changed, 24 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> >> > index b67044a2575f..3d995e8f4448 100644
> >> > --- a/Documentation/sysctl/net.txt
> >> > +++ b/Documentation/sysctl/net.txt
> >> > @@ -230,6 +230,12 @@ netdev_max_backlog
> >> >   Maximum number  of  packets,  queued  on  the  INPUT  side, when the
> interface
> >> >   receives packets faster than kernel can process them.
> >> >
> >> > +xmit_more_max
> >> > +-------------
> >> > +
> >> > +Maximum number of packets in a row to mark with skb->xmit_more. A value
> of zero
> >> > +indicates no limit.
> >>
> >> What defines "packet?"  MTU-sized packets, or payloads coming down from
> >> the stack (e.g. TSO's)?
> >
> > xmit_more is only a hint to the device. The device driver should ignore it unless
> > there are hardware advantages. The device driver is the place with HW specific
> > knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this
> device).
> >
> > Anything that pushes that optimization out to the user is only useful for
> benchmarks
> > and embedded devices.
> 
> Actually I think I might have an idea what is going on here and I
> agree that this is probably something that needs to be fixed in the
> drivers. Especially since the problem isn't so much the skbs but
> descriptors in the descriptor ring.
> 
> If I am not mistaken the issue is most drivers will honor the
> xmit_more unless the ring cannot enqueue another packet. The problem
> is if the clean-up is occurring on a different CPU than transmit we
> can cause the clean-up CPU/device DMA to go idle by not providing any
> notifications to the device that new packets are present. What we
> should probably do is look at adding another condition which is to
> force us to flush the packet if we have used over half of the
> descriptors in a given ring without notifying the device. Then that
> way we can be filling half while the device is processing the other
> half which should result in us operating smoothly.
> 
> - Alex

Ok, and that definitely is driver specific, so I would be comfortable leaving that up to driver implementation. I'll look at creating a patch to do something like this for i40e.

Thanks,
Jake

^ permalink raw reply

* Re: Fwd: DA850-evm MAC Address is random
From: Adam Ford @ 2017-08-28 20:32 UTC (permalink / raw)
  To: Grygorii Strashko; +Cc: linux-omap, netdev, Sekhar Nori
In-Reply-To: <b7aab9fc-17ba-077b-3ff3-28087e0b25bc@ti.com>

On Mon, Aug 28, 2017 at 1:54 PM, Grygorii Strashko
<grygorii.strashko@ti.com> wrote:
> Cc: Sekhar
>
> On 08/28/2017 10:32 AM, Adam Ford wrote:
>>
>> The davinvi_emac MAC address seems to attempt a call to
>> ti_cm_get_macid in cpsw-common.c but it returns the message
>> 'davinci_emac davinci_emac.1: incompatible machine/device type for
>> reading mac address ' and then generates a random MAC address.
>>
>> The function appears to lookup varions boards using
>> 'of_machine_is_compaible' and supports dm8148, am33xx, am3517, dm816,
>> am4372 and dra7.  I don't see the ti,davinci-dm6467-emac which is
>> what's shown in the da850 device tree.
>>
>> Is there a patch somewhere for supporting the da850-evm?
>
>
> Not sure if MAC address can be read from Control module.
> May be Sekhar can say more?

My understanding is that the MAC address is programmed by Logic PD
into the SPI flash.  The Bootloader reads this from either SPI or its
env variables.  Looking at the partition info listed in the
da850-evm.dts file, it appears as if they've reserved space for it.
Unfortunately, I don't see any code that reads it out.  I was hoping
there might be a way to just pass cmdline parameter from the
bootloader to the kernel to accept the MAC address.

>
>>
>> If not, is there a way to pass the MAC address from U-Boot to the
>> driver so it doesn't generate a random MAC?
>
>
> "local-mac-address" dt porp

The downside here, is that we'd have to have the Bootloader modify the
device tree.
>
> --
> regards,
> -grygorii

thanks

adam

^ permalink raw reply

* [PATCH net-next v1] amd-xgbe: Interrupt summary bits are h/w version dependent
From: Tom Lendacky @ 2017-08-28 20:29 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

There is a difference in the bit position of the normal interrupt summary
enable (NIE) and abnormal interrupt summary enable (AIE) between revisions
of the hardware.  For older revisions the NIE and AIE bits are positions
16 and 15 respectively.  For newer revisions the NIE and AIE bits are
positions 15 and 14.  The effect in changing the bit position is that
newer hardware won't receive AIE interrupts in the current version of the
driver.  Specifically, the driver uses this interrupt to collect
statistics on when a receive buffer unavailable event occurs and to
restart the driver/device when a fatal bus error occurs.

Update the driver to set the interrupt enable bit based on the reported
version of the hardware.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-common.h |    8 ++++++--
 drivers/net/ethernet/amd/xgbe/xgbe-dev.c    |   13 ++++++++++---
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-common.h b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
index 9431330..7ea72ef 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-common.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
@@ -210,11 +210,15 @@
 #define DMA_CH_CR_PBLX8_WIDTH		1
 #define DMA_CH_CR_SPH_INDEX		24
 #define DMA_CH_CR_SPH_WIDTH		1
-#define DMA_CH_IER_AIE_INDEX		15
+#define DMA_CH_IER_AIE20_INDEX		15
+#define DMA_CH_IER_AIE20_WIDTH		1
+#define DMA_CH_IER_AIE_INDEX		14
 #define DMA_CH_IER_AIE_WIDTH		1
 #define DMA_CH_IER_FBEE_INDEX		12
 #define DMA_CH_IER_FBEE_WIDTH		1
-#define DMA_CH_IER_NIE_INDEX		16
+#define DMA_CH_IER_NIE20_INDEX		16
+#define DMA_CH_IER_NIE20_WIDTH		1
+#define DMA_CH_IER_NIE_INDEX		15
 #define DMA_CH_IER_NIE_WIDTH		1
 #define DMA_CH_IER_RBUE_INDEX		7
 #define DMA_CH_IER_RBUE_WIDTH		1
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-dev.c b/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
index 671203d..e107e18 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
@@ -649,13 +649,15 @@ static void xgbe_config_flow_control(struct xgbe_prv_data *pdata)
 static void xgbe_enable_dma_interrupts(struct xgbe_prv_data *pdata)
 {
 	struct xgbe_channel *channel;
-	unsigned int i;
+	unsigned int i, ver;
 
 	/* Set the interrupt mode if supported */
 	if (pdata->channel_irq_mode)
 		XGMAC_IOWRITE_BITS(pdata, DMA_MR, INTM,
 				   pdata->channel_irq_mode);
 
+	ver = XGMAC_GET_BITS(pdata->hw_feat.version, MAC_VR, SNPSVER);
+
 	for (i = 0; i < pdata->channel_count; i++) {
 		channel = pdata->channel[i];
 
@@ -671,8 +673,13 @@ static void xgbe_enable_dma_interrupts(struct xgbe_prv_data *pdata)
 		 *   AIE  - Abnormal Interrupt Summary Enable
 		 *   FBEE - Fatal Bus Error Enable
 		 */
-		XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, NIE, 1);
-		XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, AIE, 1);
+		if (ver < 0x21) {
+			XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, NIE20, 1);
+			XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, AIE20, 1);
+		} else {
+			XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, NIE, 1);
+			XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, AIE, 1);
+		}
 		XGMAC_SET_BITS(channel->curr_ier, DMA_CH_IER, FBEE, 1);
 
 		if (channel->tx_ring) {

^ permalink raw reply related

* [PATCH net-next] Revert "ipv4: make net_protocol const"
From: David Ahern @ 2017-08-28 20:23 UTC (permalink / raw)
  To: netdev, bhumirks, davem; +Cc: David Ahern

This reverts commit aa8db499ea67cff1f5f049033810ffede2fe5ae4.

Early demux structs can not be made const. Doing so results in:
[   84.967355] BUG: unable to handle kernel paging request at ffffffff81684b10
[   84.969272] IP: proc_configure_early_demux+0x1e/0x3d
[   84.970544] PGD 1a0a067
[   84.970546] P4D 1a0a067
[   84.971212] PUD 1a0b063
[   84.971733] PMD 80000000016001e1

[   84.972669] Oops: 0003 [#1] SMP
[   84.973065] Modules linked in: ip6table_filter ip6_tables veth vrf
[   84.973833] CPU: 0 PID: 955 Comm: sysctl Not tainted 4.13.0-rc6+ #22
[   84.974612] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   84.975855] task: ffff88003854ce00 task.stack: ffffc900005a4000
[   84.976580] RIP: 0010:proc_configure_early_demux+0x1e/0x3d
[   84.977253] RSP: 0018:ffffc900005a7dd0 EFLAGS: 00010246
[   84.977891] RAX: ffffffff81684b10 RBX: 0000000000000001 RCX: 0000000000000000
[   84.978759] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000
[   84.979628] RBP: ffffc900005a7dd0 R08: 0000000000000000 R09: 0000000000000000
[   84.980501] R10: 0000000000000001 R11: 0000000000000008 R12: 0000000000000001
[   84.981373] R13: ffffffffffffffea R14: ffffffff81a9b4c0 R15: 0000000000000002
[   84.982249] FS:  00007feb237b7700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[   84.983231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.983941] CR2: ffffffff81684b10 CR3: 0000000038492000 CR4: 00000000000406f0
[   84.984817] Call Trace:
[   84.985133]  proc_tcp_early_demux+0x29/0x30

I think this is the second time such a patch has been reverted.

Cc: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
Bhumika: How are you testing these constify changes? In this case a simple
sysctl -w net.ipv4.tcp_early_demux=1 would have shown the problem

 net/ipv4/af_inet.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 19aee073ba29..d678820e4306 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1596,7 +1596,7 @@ static const struct net_protocol igmp_protocol = {
 };
 #endif
 
-static const struct net_protocol tcp_protocol = {
+static struct net_protocol tcp_protocol = {
 	.early_demux	=	tcp_v4_early_demux,
 	.early_demux_handler =  tcp_v4_early_demux,
 	.handler	=	tcp_v4_rcv,
@@ -1606,7 +1606,7 @@ static const struct net_protocol tcp_protocol = {
 	.icmp_strict_tag_validation = 1,
 };
 
-static const struct net_protocol udp_protocol = {
+static struct net_protocol udp_protocol = {
 	.early_demux =	udp_v4_early_demux,
 	.early_demux_handler =	udp_v4_early_demux,
 	.handler =	udp_rcv,
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next v2 01/10] net: dsa: add debugfs interface
From: Andrew Lunn @ 2017-08-28 20:19 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Egil Hjelmeland, John Crispin, Woojung Huh, Sean Wang,
	Nikita Yushchenko, Chris Healy
In-Reply-To: <20170828191748.19492-2-vivien.didelot@savoirfairelinux.com>

On Mon, Aug 28, 2017 at 03:17:39PM -0400, Vivien Didelot wrote:
> This commit adds a DEBUG_FS dependent DSA core file creating a generic
> debug filesystem interface for the DSA switch devices.
> 
> The interface can be mounted with:
> 
>     # mount -t debugfs none /sys/kernel/debug
> 
> The dsa directory contains one directory per switch chip:
> 
>     # cd /sys/kernel/debug/dsa/
>     # ls
>     switch0  switch1 switch2
> 
> Each chip directory contains one directory per port:
> 
>     # ls -l switch0/
>     drwxr-xr-x 2 root root 0 Jan  1 00:00 port0
>     drwxr-xr-x 2 root root 0 Jan  1 00:00 port1
>     drwxr-xr-x 2 root root 0 Jan  1 00:00 port2
>     drwxr-xr-x 2 root root 0 Jan  1 00:00 port5
>     drwxr-xr-x 2 root root 0 Jan  1 00:00 port6
> 
> Future patches will add entry files to these directories.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 03/10] net: dsa: debugfs: add tag_protocol
From: Andrew Lunn @ 2017-08-28 20:16 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Egil Hjelmeland, John Crispin, Woojung Huh, Sean Wang,
	Nikita Yushchenko, Chris Healy
In-Reply-To: <20170828191748.19492-4-vivien.didelot@savoirfairelinux.com>

On Mon, Aug 28, 2017 at 03:17:41PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "tag_protocol" entry to query the switch tagging
> protocol through the .get_tag_protocol operation.
> 
>     # cat switch1/tag_protocol
>     EDSA
> 
> To ease maintenance of tag protocols, add a dsa_tag_protocol_name helper
> to the public API which to convert a tag protocol enum to a string.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox