Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Why do we need MSG_SENDPAGE_NOTLAST?
From: Eric Dumazet @ 2017-05-06 17:13 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: netdev@vger.kernel.org, tls-fpga-sw-dev, Dave Watson
In-Reply-To: <VI1PR0502MB29571B2B852BF681FC1DB9B0D4E80@VI1PR0502MB2957.eurprd05.prod.outlook.com>

Do not top-post on netdev, please.


On Sat, 2017-05-06 at 05:46 +0000, Ilya Lesokhin wrote:
> I don't follow.
> Why can't splice use MSG_MORE for the individual pages?
> Why does tcp_sendpage need to know if the MORE indicator is coming from the user or from splice?
> 
> I also don't understand your comment about partial writes.
> 

Make sure that sendpage() wont end up with a stall on TCP, if the socket
has not enough room to store the 16 pages provided by splice() or
sendpage()

Just use MSG_SENDPAGE_NOTLAST and be happy.


> Thanks,
> Ilya
> 
> > -----Original Message-----
> > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> > Sent: Thursday, May 4, 2017 9:33 PM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: netdev@vger.kernel.org; tls-fpga-sw-dev <tls-fpga-sw-
> > dev@mellanox.com>; Dave Watson <davejwatson@fb.com>
> > Subject: Re: Why do we need MSG_SENDPAGE_NOTLAST?
> > 
> > On Thu, 2017-05-04 at 17:03 +0000, Ilya Lesokhin wrote:
> > > I don't understand the need for MSG_SENDPAGE_NOTLAST and I'm hoping
> > > someone can enlighten me.
> > >
> > > According to commit 35f9c09 ('tcp: tcp_sendpages() should call
> > > tcp_push() once'):
> > > "We need to call tcp_flush() at the end of the last page processed in
> > > tcp_sendpages(), or else transmits can be deferred and future sends
> > > stall."
> > >
> > > I don't understand why we need to differentiate between the user
> > > setting MSG_MORE
> > > and splice indicating that more data is going to be sent.
> > > if the user passed MSG_MORE and didn't push any extra data, isn't it
> > > the users fault?
> > > Do we need it because poorly written applications were broken when
> > > MSG_MORE was added to tcp_sendpage? Or is there a deeper reason?
> > >
> > 
> > The answer lies to how splice() is working.
> > 
> > User can issue one splice without MSG_MORE semantic, right ?
> > 
> > Still, we want an implicit MORE behavior for all individual pages, but
> > the last one.
> > 
> > 
> > > The reason I'm asking is that we are working on a kernel TLS
> > > implementation
> > > and I would like to know if we can coalesce multiple tls_sendpage
> > > calls with MSG_MORE into a single
> > > tls record or whether we must push out the record as soon as
> > > MSG_SENDPAGE_NOTLAST is cleared?
> > 
> > Make sure you handle partial writes (you want to coalesce 10 pages, but
> > stack will only take 5 of them)
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
From: Alexander Duyck @ 2017-05-06 18:07 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: Casey Leedom, Raj, Ashok, Bjorn Helgaas, Michael Werner,
	Ganesh GR, Arjun V., Asit K Mallick, Patrick J Cramer,
	Suravee Suthikulpanit, Bob Shaw, h, Mark Rutland, Amir Ancel,
	Gabriele Paoloni, Catalin Marinas, Will Deacon, LinuxArm,
	David Laight, Jeff Kirsher, Netdev
In-Reply-To: <22c46382-5778-022f-d0bd-74abec9e9c7c@huawei.com>

On Fri, May 5, 2017 at 8:08 PM, Ding Tianhong <dingtianhong@huawei.com> wrote:
>
>
> On 2017/5/5 22:04, Alexander Duyck wrote:
>> On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote:
>>> | From: Alexander Duyck <alexander.duyck@gmail.com>
>>> | Sent: Wednesday, May 3, 2017 9:02 AM
>>> | ...
>>> | It sounds like we are more or less in agreement. My only concern is
>>> | really what we default this to. On x86 I would say we could probably
>>> | default this to disabled for existing platforms since my understanding
>>> | is that relaxed ordering doesn't provide much benefit on what is out
>>> | there right now when performing DMA through the root complex. As far
>>> | as peer-to-peer I would say we should probably look at enabling the
>>> | ability to have Relaxed Ordering enabled for some channels but not
>>> | others. In those cases the hardware needs to be smart enough to allow
>>> | for you to indicate you want it disabled by default for most of your
>>> | DMA channels, and then enabled for the select channels that are
>>> | handling the peer-to-peer traffic.
>>>
>>>   Yes, I think that we are mostly in agreement.  I had just wanted to make
>>> sure that whatever scheme was developed would allow for simultaneously
>>> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed
>>> Ordering for others within the same system.  I.e. not simply
>>> enabling/disabling/etc.  based solely on System Platform Architecture.
>>>
>>>   By the way, I've started our QA folks off looking at what things look like
>>> in Linux Virtual Machines under different Hypervisors to see what
>>> information they may provide to the VM in the way of what Root Complex Port
>>> is being used, etc.  So far they've got Windows HyperV done and there
>>> there's no PCIe Fabric exposed in any way: just the attached device.  I'll
>>> have to see what pci_find_pcie_root_port() returns in that environment.
>>> Maybe NULL?
>>
>> I believe NULL is one of the options. It all depends on what qemu is
>> emulating. Most likely you won't find a pcie root port on KVM because
>> the default is to emulate an older system that only supports PCI.
>>
>>>   With your reservations (which I also share), I think that it probably
>>> makes sense to have a per-architecture definition of the "Can I Use Relaxed
>>> Ordering With TLPs Directed At This End Point" predicate, with the default
>>> being "No" for any architecture which doesn't implement the predicate.  And
>>> if the specified (struct pci_dev *) End Node is NULL, it ought to return
>>> False for that as well.  I can't see any reason to pass in the Source End
>>> Node but I may be missing something.
>>>
>>>   At this point, this is pretty far outside my level of expertise.  I'm
>>> happy to give it a go, but I'd be even happier if someone with a lot more
>>> experience in the PCIe Infrastructure were to want to carry the ball
>>> forward.  I'm not super familiar with the Linux Kernel "Rules Of
>>> Engagement", so let me know what my next step should be.  Thanks.
>>>
>>> Casey
>>
>> For now we can probably keep this on the linux-pci mailing list. Going
>> that route is the most straight forward for now since step one is
>> probably just making sure we are setting the relaxed ordering bit in
>> the setups that make sense. I would say we could probably keep it
>> simple. We just need to enable relaxed ordering by default for SPARC
>> architectures, on most others we can probably default it to off.
>>
>
> Casey, Alexander:
>
> Thanks for the wonderful discussion, it is more clearly that what to do next,
> I agree that enable relaxed ordering by default only for SPARC and ARM64
> is more safe for all the other platform, as no one want to break anything.
>
>> I believe this all had started as Ding Tianhong was hoping to enable
>> this for the ARM architecture. That is the only one I can think of
>> where it might be difficult to figure out which way to default as we
>> were attempting to follow the same code that was enabled for SPARC and
>> that is what started this tug-of-war about how this should be done.
>> What we might do is take care of this in two phases. The first one
>> enables the infrastructure generically but leaves it defaulted to off
>> for everyone but SPARC. Then we can go through and start enabling it
>> for other platforms such as some of those on ARM in the platforms that
>> Ding Tianhong was working with.
>>
>
> According the suggestion, I could only think of this code:
>
> @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8,
>                               quirk_tw686x_class);
>
> +static void quirk_relaxedordering_disable(struct pci_dev *dev)
> +{
> + if (dev->vendor != PCI_VENDOR_ID_HUAWEI &&
> +     dev->vendor != PCI_VENDOR_ID_SUN)
> +         dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
> +}
> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, PCI_CLASS_NOT_DEFINED, 8,
> +                       quirk_relaxedordering_disable);
> +
>  /*
>   * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
>   * values for the Attribute as were supplied in the header of the
>
>
> What do you think of it?
>
> Thanks
> Ding
>

This is a bit simplistic but it is a start.

The other bit I was getting at is that we need to update the core PCIe
code so that when we configure devices and the root complex reports no
support for relaxed ordering it should be clearing the relaxed
ordering bits in the PCIe configuration registers on the upstream
facing devices.

The last bit we need in all this is a way to allow for setups where
peer-to-peer wants to perform relaxed ordering but for writes to the
host we have to not use relaxed ordering. For that we need to enable a
special case and that isn't handled right now in any of the solutions
we have coded up so far.

- Alex

^ permalink raw reply

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: David Miller @ 2017-05-06 18:38 UTC (permalink / raw)
  To: illusionist.neo
  Cc: daniel, keescook, mgherzan, netdev, kernel-hardening,
	linux-arm-kernel, ast
In-Reply-To: <CAHgaXdK1XjDnhTqtk=nCq2crrigHiuCaLhF+1GupJt9bPThwZw@mail.gmail.com>

From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Sat, 6 May 2017 22:18:16 +0530

> Hi Daniel,
> 
> Thanks for the last reply about the testing of eBPF JIT.
> 
> I have one issue though, I am not able to find what BPF_ABS and
> BPF_IND instruction does exactly.

They are not instructions, they are modifiers for the BPF_LD
instruction which indicate an SKB load is to be performed.

You never need to ask what a BPF instruction does, it is clear
defined in the BPF interperter found in kernel/bpf/core.c

Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

^ permalink raw reply

* [PATCH iproute2 v2 0/1] vxlan: support for modifying vxlan device
From: Girish Moodalbail @ 2017-05-06 18:37 UTC (permalink / raw)
  To: stephen; +Cc: netdev

Hello all,

This patch adds support for modifying VXLAN device attributes. I have
refactored the vxlan_parse_opt() function to be more readable and not
use lot of bool variables.

I have tested my changes by running Linux Test Project's VXLAN
testcases, and I didn't see any regression.
---
v1->v2
- refactored vxlan_parse_opt() to not to use a bunch of
  foo_set variables

Girish Moodalbail (1):
  vxlan: Add support for modifying vxlan device attributes

 ip/iplink_vxlan.c | 251 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 143 insertions(+), 108 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH iproute2 v2 1/1] vxlan: Add support for modifying vxlan device attributes
From: Girish Moodalbail @ 2017-05-06 18:37 UTC (permalink / raw)
  To: stephen; +Cc: netdev
In-Reply-To: <1494095863-21916-1-git-send-email-girish.moodalbail@oracle.com>

Ability to change vxlan device attributes was added to kernel through
commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
cannot do the same through ip(8) command.  Changing the allowed vxlan
device attributes using 'ip link set dev <vxlan_name> type vxlan
<allowed_attributes>' currently fails with 'operation not supported'
error.  This failure is due to the incorrect rtnetlink message
construction for the 'ip link set' operation.

The vxlan_parse_opt() callback function is called for parsing options
for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
down default values for those attributes that were not provided as CLI
options. However, for the 'set' case we should be only passing down the
explicitly provided attributes and not any other (default) attributes.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
---
 ip/iplink_vxlan.c | 251 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 143 insertions(+), 108 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index b4ebb13..2bd619d 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -21,6 +21,8 @@
 #include "utils.h"
 #include "ip_common.h"
 
+#define VXLAN_ATTRSET(attrs, type) (((attrs) & (1L << (type))) != 0)
+
 static void print_explain(FILE *f)
 {
 	fprintf(f,
@@ -59,54 +61,50 @@ static void explain(void)
 	print_explain(stderr);
 }
 
+static void check_duparg(__u64 *attrs, int type, const char *key,
+			 const char *argv)
+{
+	if (!VXLAN_ATTRSET(*attrs, type)) {
+		*attrs |= (1L << type);
+		return;
+	}
+	duparg2(key, argv);
+}
+
 static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			  struct nlmsghdr *n)
 {
 	__u32 vni = 0;
-	int vni_set = 0;
-	__u32 saddr = 0;
 	__u32 gaddr = 0;
 	__u32 daddr = 0;
-	struct in6_addr saddr6 = IN6ADDR_ANY_INIT;
 	struct in6_addr gaddr6 = IN6ADDR_ANY_INIT;
 	struct in6_addr daddr6 = IN6ADDR_ANY_INIT;
-	unsigned int link = 0;
-	__u8 tos = 0;
-	__u8 ttl = 0;
-	__u32 label = 0;
 	__u8 learning = 1;
-	__u8 proxy = 0;
-	__u8 rsc = 0;
-	__u8 l2miss = 0;
-	__u8 l3miss = 0;
-	__u8 noage = 0;
-	__u32 age = 0;
-	__u32 maxaddr = 0;
 	__u16 dstport = 0;
-	__u8 udpcsum = 0;
-	bool udpcsum_set = false;
-	__u8 udp6zerocsumtx = 0;
-	bool udp6zerocsumtx_set = false;
-	__u8 udp6zerocsumrx = 0;
-	bool udp6zerocsumrx_set = false;
-	__u8 remcsumtx = 0;
-	__u8 remcsumrx = 0;
 	__u8 metadata = 0;
-	__u8 gbp = 0;
-	__u8 gpe = 0;
-	int dst_port_set = 0;
-	struct ifla_vxlan_port_range range = { 0, 0 };
+	__u64 attrs = 0;
+	bool set_op = (n->nlmsg_type == RTM_NEWLINK &&
+		       !(n->nlmsg_flags & NLM_F_CREATE));
 
 	while (argc > 0) {
 		if (!matches(*argv, "id") ||
 		    !matches(*argv, "vni")) {
+			/* We will add ID attribute outside of the loop since we
+			 * need to consider metadata information as well.
+			 */
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_ID, "id", *argv);
 			if (get_u32(&vni, *argv, 0) ||
 			    vni >= 1u << 24)
 				invarg("invalid id", *argv);
-			vni_set = 1;
 		} else if (!matches(*argv, "group")) {
+			if (daddr || !IN6_IS_ADDR_UNSPECIFIED(&daddr6)) {
+				fprintf(stderr, "vxlan: both group and remote");
+				fprintf(stderr, " cannot be specified\n");
+				return -1;
+			}
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_GROUP, "group", *argv);
 			if (!inet_get_addr(*argv, &gaddr, &gaddr6)) {
 				fprintf(stderr, "Invalid address \"%s\"\n", *argv);
 				return -1;
@@ -114,7 +112,13 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			if (!IN6_IS_ADDR_MULTICAST(&gaddr6) && !IN_MULTICAST(ntohl(gaddr)))
 				invarg("invalid group address", *argv);
 		} else if (!matches(*argv, "remote")) {
+			if (gaddr || !IN6_IS_ADDR_UNSPECIFIED(&gaddr6)) {
+				fprintf(stderr, "vxlan: both group and remote");
+				fprintf(stderr, " cannot be specified\n");
+				return -1;
+			}
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_GROUP, "remote", *argv);
 			if (!inet_get_addr(*argv, &daddr, &daddr6)) {
 				fprintf(stderr, "Invalid address \"%s\"\n", *argv);
 				return -1;
@@ -122,7 +126,11 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			if (IN6_IS_ADDR_MULTICAST(&daddr6) || IN_MULTICAST(ntohl(daddr)))
 				invarg("invalid remote address", *argv);
 		} else if (!matches(*argv, "local")) {
+			__u32 saddr = 0;
+			struct in6_addr saddr6 = IN6ADDR_ANY_INIT;
+
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_LOCAL, "local", *argv);
 			if (strcmp(*argv, "any")) {
 				if (!inet_get_addr(*argv, &saddr, &saddr6)) {
 					fprintf(stderr, "Invalid address \"%s\"\n", *argv);
@@ -132,19 +140,31 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 
 			if (IN_MULTICAST(ntohl(saddr)) || IN6_IS_ADDR_MULTICAST(&saddr6))
 				invarg("invalid local address", *argv);
+
+			if (saddr)
+				addattr_l(n, 1024, IFLA_VXLAN_LOCAL, &saddr, 4);
+			else if (!IN6_IS_ADDR_UNSPECIFIED(&saddr6))
+				addattr_l(n, 1024, IFLA_VXLAN_LOCAL6, &saddr6,
+					  sizeof(struct in6_addr));
 		} else if (!matches(*argv, "dev")) {
+			unsigned int link;
+
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_LINK, "dev", *argv);
 			link = if_nametoindex(*argv);
 			if (link == 0) {
 				fprintf(stderr, "Cannot find device \"%s\"\n",
 					*argv);
 				exit(-1);
 			}
+			addattr32(n, 1024, IFLA_VXLAN_LINK, link);
 		} else if (!matches(*argv, "ttl") ||
 			   !matches(*argv, "hoplimit")) {
 			unsigned int uval;
+			__u8 ttl = 0;
 
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_TTL, "ttl", *argv);
 			if (strcmp(*argv, "inherit") != 0) {
 				if (get_unsigned(&uval, *argv, 0))
 					invarg("invalid TTL", *argv);
@@ -152,106 +172,163 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 					invarg("TTL must be <= 255", *argv);
 				ttl = uval;
 			}
+			addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
 		} else if (!matches(*argv, "tos") ||
 			   !matches(*argv, "dsfield")) {
 			__u32 uval;
+			__u8 tos;
 
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_TOS, "tos", *argv);
 			if (strcmp(*argv, "inherit") != 0) {
 				if (rtnl_dsfield_a2n(&uval, *argv))
 					invarg("bad TOS value", *argv);
 				tos = uval;
 			} else
 				tos = 1;
+			addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
 		} else if (!matches(*argv, "label") ||
 			   !matches(*argv, "flowlabel")) {
 			__u32 uval;
 
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_LABEL, "flowlabel",
+				     *argv);
 			if (get_u32(&uval, *argv, 0) ||
 			    (uval & ~LABEL_MAX_MASK))
 				invarg("invalid flowlabel", *argv);
-			label = htonl(uval);
+			addattr32(n, 1024, IFLA_VXLAN_LABEL, htonl(uval));
 		} else if (!matches(*argv, "ageing")) {
+			__u32 age;
+
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_AGEING, "ageing",
+				     *argv);
 			if (strcmp(*argv, "none") == 0)
-				noage = 1;
+				age = 0;
 			else if (get_u32(&age, *argv, 0))
 				invarg("ageing timer", *argv);
+			addattr32(n, 1024, IFLA_VXLAN_AGEING, age);
 		} else if (!matches(*argv, "maxaddress")) {
+			__u32 maxaddr;
+
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_LIMIT,
+				     "maxaddress", *argv);
 			if (strcmp(*argv, "unlimited") == 0)
 				maxaddr = 0;
 			else if (get_u32(&maxaddr, *argv, 0))
 				invarg("max addresses", *argv);
+			addattr32(n, 1024, IFLA_VXLAN_LIMIT, maxaddr);
 		} else if (!matches(*argv, "port") ||
 			   !matches(*argv, "srcport")) {
+			struct ifla_vxlan_port_range range = { 0, 0 };
+
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_PORT_RANGE, "srcport",
+				     *argv);
 			if (get_be16(&range.low, *argv, 0))
 				invarg("min port", *argv);
 			NEXT_ARG();
 			if (get_be16(&range.high, *argv, 0))
 				invarg("max port", *argv);
+			if (range.low || range.high) {
+				addattr_l(n, 1024, IFLA_VXLAN_PORT_RANGE,
+					  &range, sizeof(range));
+			}
 		} else if (!matches(*argv, "dstport")) {
 			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_PORT, "dstport", *argv);
 			if (get_u16(&dstport, *argv, 0))
 				invarg("dst port", *argv);
-			dst_port_set = 1;
 		} else if (!matches(*argv, "nolearning")) {
+			check_duparg(&attrs, IFLA_VXLAN_LEARNING, *argv, *argv);
 			learning = 0;
 		} else if (!matches(*argv, "learning")) {
+			check_duparg(&attrs, IFLA_VXLAN_LEARNING, *argv, *argv);
 			learning = 1;
 		} else if (!matches(*argv, "noproxy")) {
-			proxy = 0;
+			check_duparg(&attrs, IFLA_VXLAN_PROXY, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_PROXY, 0);
 		} else if (!matches(*argv, "proxy")) {
-			proxy = 1;
+			check_duparg(&attrs, IFLA_VXLAN_PROXY, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_PROXY, 1);
 		} else if (!matches(*argv, "norsc")) {
-			rsc = 0;
+			check_duparg(&attrs, IFLA_VXLAN_RSC, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_RSC, 0);
 		} else if (!matches(*argv, "rsc")) {
-			rsc = 1;
+			check_duparg(&attrs, IFLA_VXLAN_RSC, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_RSC, 1);
 		} else if (!matches(*argv, "nol2miss")) {
-			l2miss = 0;
+			check_duparg(&attrs, IFLA_VXLAN_L2MISS, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_L2MISS, 0);
 		} else if (!matches(*argv, "l2miss")) {
-			l2miss = 1;
+			check_duparg(&attrs, IFLA_VXLAN_L2MISS, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_L2MISS, 1);
 		} else if (!matches(*argv, "nol3miss")) {
-			l3miss = 0;
+			check_duparg(&attrs, IFLA_VXLAN_L3MISS, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_L3MISS, 0);
 		} else if (!matches(*argv, "l3miss")) {
-			l3miss = 1;
+			check_duparg(&attrs, IFLA_VXLAN_L3MISS, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_L3MISS, 1);
 		} else if (!matches(*argv, "udpcsum")) {
-			udpcsum = 1;
-			udpcsum_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_CSUM, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_CSUM, 1);
 		} else if (!matches(*argv, "noudpcsum")) {
-			udpcsum = 0;
-			udpcsum_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_CSUM, *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_CSUM, 0);
 		} else if (!matches(*argv, "udp6zerocsumtx")) {
-			udp6zerocsumtx = 1;
-			udp6zerocsumtx_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, 1);
 		} else if (!matches(*argv, "noudp6zerocsumtx")) {
-			udp6zerocsumtx = 0;
-			udp6zerocsumtx_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, 0);
 		} else if (!matches(*argv, "udp6zerocsumrx")) {
-			udp6zerocsumrx = 1;
-			udp6zerocsumrx_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
 		} else if (!matches(*argv, "noudp6zerocsumrx")) {
-			udp6zerocsumrx = 0;
-			udp6zerocsumrx_set = true;
+			check_duparg(&attrs, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 0);
 		} else if (!matches(*argv, "remcsumtx")) {
-			remcsumtx = 1;
+			check_duparg(&attrs, IFLA_VXLAN_REMCSUM_TX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_REMCSUM_TX, 1);
 		} else if (!matches(*argv, "noremcsumtx")) {
-			remcsumtx = 0;
+			check_duparg(&attrs, IFLA_VXLAN_REMCSUM_TX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_REMCSUM_TX, 0);
 		} else if (!matches(*argv, "remcsumrx")) {
-			remcsumrx = 1;
+			check_duparg(&attrs, IFLA_VXLAN_REMCSUM_RX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_REMCSUM_RX, 1);
 		} else if (!matches(*argv, "noremcsumrx")) {
-			remcsumrx = 0;
+			check_duparg(&attrs, IFLA_VXLAN_REMCSUM_RX,
+				     *argv, *argv);
+			addattr8(n, 1024, IFLA_VXLAN_REMCSUM_RX, 0);
 		} else if (!matches(*argv, "external")) {
+			check_duparg(&attrs, IFLA_VXLAN_COLLECT_METADATA,
+				     *argv, *argv);
 			metadata = 1;
 			learning = 0;
+			/* we will add LEARNING attribute outside of the loop */
+			addattr8(n, 1024, IFLA_VXLAN_COLLECT_METADATA,
+				 metadata);
 		} else if (!matches(*argv, "noexternal")) {
+			check_duparg(&attrs, IFLA_VXLAN_COLLECT_METADATA,
+				     *argv, *argv);
 			metadata = 0;
+			addattr8(n, 1024, IFLA_VXLAN_COLLECT_METADATA,
+				 metadata);
 		} else if (!matches(*argv, "gbp")) {
-			gbp = 1;
+			check_duparg(&attrs, IFLA_VXLAN_GBP, *argv, *argv);
+			addattr_l(n, 1024, IFLA_VXLAN_GBP, NULL, 0);
 		} else if (!matches(*argv, "gpe")) {
-			gpe = 1;
+			check_duparg(&attrs, IFLA_VXLAN_GPE, *argv, *argv);
+			addattr_l(n, 1024, IFLA_VXLAN_GPE, NULL, 0);
 		} else if (matches(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -263,31 +340,26 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 		argc--, argv++;
 	}
 
-	if (metadata && vni_set) {
+	if (metadata && VXLAN_ATTRSET(attrs, IFLA_VXLAN_ID)) {
 		fprintf(stderr, "vxlan: both 'external' and vni cannot be specified\n");
 		return -1;
 	}
 
-	if (!metadata && !vni_set) {
+	if (!metadata && !VXLAN_ATTRSET(attrs, IFLA_VXLAN_ID)) {
 		fprintf(stderr, "vxlan: missing virtual network identifier\n");
 		return -1;
 	}
 
-	if ((gaddr && daddr) ||
-	    (!IN6_IS_ADDR_UNSPECIFIED(&gaddr6) &&
-	     !IN6_IS_ADDR_UNSPECIFIED(&daddr6))) {
-		fprintf(stderr, "vxlan: both group and remote cannot be specified\n");
-		return -1;
-	}
-
-	if ((gaddr || !IN6_IS_ADDR_UNSPECIFIED(&gaddr6)) && !link) {
+	if ((gaddr || !IN6_IS_ADDR_UNSPECIFIED(&gaddr6)) &&
+	    !VXLAN_ATTRSET(attrs, IFLA_VXLAN_LINK)) {
 		fprintf(stderr, "vxlan: 'group' requires 'dev' to be specified\n");
 		return -1;
 	}
 
-	if (!dst_port_set && gpe) {
+	if (!VXLAN_ATTRSET(attrs, IFLA_VXLAN_PORT) &&
+	    VXLAN_ATTRSET(attrs, IFLA_VXLAN_GPE)) {
 		dstport = 4790;
-	} else if (!dst_port_set) {
+	} else if (!VXLAN_ATTRSET(attrs, IFLA_VXLAN_PORT) && !set_op) {
 		fprintf(stderr, "vxlan: destination port not specified\n"
 			"Will use Linux kernel default (non-standard value)\n");
 		fprintf(stderr,
@@ -309,49 +381,12 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	else if (preferred_family == AF_INET6)
 		addattr_l(n, 1024, IFLA_VXLAN_GROUP6, &daddr6, sizeof(struct in6_addr));
 
-	if (saddr)
-		addattr_l(n, 1024, IFLA_VXLAN_LOCAL, &saddr, 4);
-	else if (!IN6_IS_ADDR_UNSPECIFIED(&saddr6))
-		addattr_l(n, 1024, IFLA_VXLAN_LOCAL6, &saddr6, sizeof(struct in6_addr));
-
-	if (link)
-		addattr32(n, 1024, IFLA_VXLAN_LINK, link);
-	addattr32(n, 1024, IFLA_VXLAN_LABEL, label);
-	addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
-	addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
-	addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
-	addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
-	addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
-	addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
-	addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
-	addattr8(n, 1024, IFLA_VXLAN_REMCSUM_TX, remcsumtx);
-	addattr8(n, 1024, IFLA_VXLAN_REMCSUM_RX, remcsumrx);
-	addattr8(n, 1024, IFLA_VXLAN_COLLECT_METADATA, metadata);
-
-	if (udpcsum_set)
-		addattr8(n, 1024, IFLA_VXLAN_UDP_CSUM, udpcsum);
-	if (udp6zerocsumtx_set)
-		addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, udp6zerocsumtx);
-	if (udp6zerocsumrx_set)
-		addattr8(n, 1024, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, udp6zerocsumrx);
-	if (noage)
-		addattr32(n, 1024, IFLA_VXLAN_AGEING, 0);
-	else if (age)
-		addattr32(n, 1024, IFLA_VXLAN_AGEING, age);
-	if (maxaddr)
-		addattr32(n, 1024, IFLA_VXLAN_LIMIT, maxaddr);
-	if (range.low || range.high)
-		addattr_l(n, 1024, IFLA_VXLAN_PORT_RANGE,
-			  &range, sizeof(range));
+	if (!set_op || VXLAN_ATTRSET(attrs, IFLA_VXLAN_LEARNING))
+		addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
+
 	if (dstport)
 		addattr16(n, 1024, IFLA_VXLAN_PORT, htons(dstport));
 
-	if (gbp)
-		addattr_l(n, 1024, IFLA_VXLAN_GBP, NULL, 0);
-	if (gpe)
-		addattr_l(n, 1024, IFLA_VXLAN_GPE, NULL, 0);
-
-
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Shubham Bansal @ 2017-05-06 20:27 UTC (permalink / raw)
  To: David Miller
  Cc: Kees Cook, Daniel Borkmann, kernel-hardening, Network Development,
	ast, Mircea Gherzan, linux-arm-kernel
In-Reply-To: <20170506.143816.2259768618578706384.davem@davemloft.net>

Thanks David.

Hi all,

I have two questions about the code at arch/arm64/net/bpf_jit_comp.c.

1. At line 708, " const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct
sk_buff *skb */ ".
    Why is this code using BPF_REG_1 before saving it? As far as I
know, BPF_REG_1 has pointer to bpf program context and this code
clearly is overwriting that pointer which makes that pointer useless
for future usage. It clearly looks like a bug.

2. At line 256, " emit(A64_LDR64(prg, tmp, r3), ctx); ".
    This line of code is used to load an array( of pointers ) element,
where r3 is used as an index of that array. Shouldn't it be be
arithmetic left shifted by 3 or multiplied by 8 to get the right
address in that array of pointers ?

Apologies if any of the above question is stupid to ask.

Best,
Shubham
Best,
Shubham Bansal


On Sun, May 7, 2017 at 12:08 AM, David Miller <davem@davemloft.net> wrote:
> From: Shubham Bansal <illusionist.neo@gmail.com>
> Date: Sat, 6 May 2017 22:18:16 +0530
>
>> Hi Daniel,
>>
>> Thanks for the last reply about the testing of eBPF JIT.
>>
>> I have one issue though, I am not able to find what BPF_ABS and
>> BPF_IND instruction does exactly.
>
> They are not instructions, they are modifiers for the BPF_LD
> instruction which indicate an SKB load is to be performed.
>
> You never need to ask what a BPF instruction does, it is clear
> defined in the BPF interperter found in kernel/bpf/core.c
>
> Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

^ permalink raw reply

* [PATCH iproute2] tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support
From: Alexander Alemayhu @ 2017-05-06 20:30 UTC (permalink / raw)
  To: netdev; +Cc: daniel, stephen, Alexander Alemayhu

sparc64 support was added in 7a12b5031c6b (sparc64: Add eBPF JIT., 2017-04-17)[0]
and ppc64 in 156d0e290e96 (powerpc/ebpf/jit: Implement JIT compiler for extended BPF, 2016-06-22)[1].

[0]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7a12b5031c6b
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=156d0e290e96
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
---
 man/man8/tc-bpf.8 | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
index e371964d06ab..2e9812ede028 100644
--- a/man/man8/tc-bpf.8
+++ b/man/man8/tc-bpf.8
@@ -75,9 +75,9 @@ In Linux, it's generally considered that eBPF is the successor of cBPF.
 The kernel internally transforms cBPF expressions into eBPF expressions and
 executes the latter. Execution of them can be performed in an interpreter
 or at setup time, they can be just-in-time compiled (JIT'ed) to run as
-native machine code. Currently, x86_64, ARM64 and s390 architectures have
-eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not
-(yet) switch to eBPF JIT support.
+native machine code. Currently, x86_64, ARM64, s390, ppc64 and sparc64
+architectures have eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have
+cBPF, but did not (yet) switch to eBPF JIT support.
 
 eBPF's instruction set has similar underlying principles as the cBPF
 instruction set, it however is modelled closer to the underlying
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH 0/3] udp: scalability improvements
From: Paolo Abeni @ 2017-05-06 20:42 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is in the 20-35% range depending on
the numa layout and the number of ingress queue used by the relevant nic.

On a single numa node host, the peak tput is now reached when the traffic
targeting the udp socket uses multiple nic rx queues, while on current net-next
the tput always decreases when moving from a single rx queue to multiple ones.

Paolo Abeni (3):
  net/sock: factor out dequeue/peek with offset code
  udp: use a separate rx queue for packet reception
  udp: keep the sk_receive_queue held when splicing

 include/linux/skbuff.h |   7 +++
 include/linux/udp.h    |   3 +
 include/net/sock.h     |   4 +-
 include/net/udp.h      |   9 +--
 include/net/udplite.h  |   2 +-
 net/core/datagram.c    |  90 +++++++++++++++------------
 net/ipv4/udp.c         | 162 +++++++++++++++++++++++++++++++++++++++++++------
 net/ipv6/udp.c         |   3 +-
 8 files changed, 211 insertions(+), 69 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [RFC PATCH 1/3] net/sock: factor out dequeue/peek with offset code
From: Paolo Abeni @ 2017-05-06 20:42 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494103184.git.pabeni@redhat.com>

And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h |  7 ++++
 include/net/sock.h     |  4 +--
 net/core/datagram.c    | 90 ++++++++++++++++++++++++++++----------------------
 3 files changed, 60 insertions(+), 41 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a098d95..bfc7892 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
 
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
 				const struct sk_buff *skb);
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+					  struct sk_buff_head *queue,
+					  unsigned int flags,
+					  void (*destructor)(struct sock *sk,
+							   struct sk_buff *skb),
+					  int *peeked, int *off, int *err,
+					  struct sk_buff **last);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
 					void (*destructor)(struct sock *sk,
 							   struct sk_buff *skb),
diff --git a/include/net/sock.h b/include/net/sock.h
index 66349e4..49d226f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
-int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-			unsigned int flags,
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
+			struct sk_buff *skb, unsigned int flags,
 			void (*destructor)(struct sock *sk,
 					   struct sk_buff *skb));
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index db1866f2..a4592b4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
 	return skb;
 }
 
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+					  struct sk_buff_head *queue,
+					  unsigned int flags,
+					  void (*destructor)(struct sock *sk,
+							   struct sk_buff *skb),
+					  int *peeked, int *off, int *err,
+					  struct sk_buff **last)
+{
+	struct sk_buff *skb;
+
+	*last = queue->prev;
+	skb_queue_walk(queue, skb) {
+		if (flags & MSG_PEEK) {
+			if (*off >= skb->len && (skb->len || *off ||
+						 skb->peeked)) {
+				*off -= skb->len;
+				continue;
+			}
+			if (!skb->len) {
+				skb = skb_set_peeked(skb);
+				if (unlikely(IS_ERR(skb))) {
+					*err = PTR_ERR(skb);
+					return skb;
+				}
+			}
+			*peeked = 1;
+			atomic_inc(&skb->users);
+		} else {
+			__skb_unlink(skb, queue);
+			if (destructor)
+				destructor(sk, skb);
+		}
+		return skb;
+	}
+	return NULL;
+}
+
 /**
  *	__skb_try_recv_datagram - Receive a datagram skbuff
  *	@sk: socket
@@ -216,46 +253,20 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
 
 	*peeked = 0;
 	do {
+		int _off = *off;
+
 		/* Again only user level code calls this function, so nothing
 		 * interrupt level will suddenly eat the receive_queue.
 		 *
 		 * Look at current nfs client by the way...
 		 * However, this function was correct in any case. 8)
 		 */
-		int _off = *off;
-
-		*last = (struct sk_buff *)queue;
 		spin_lock_irqsave(&queue->lock, cpu_flags);
-		skb_queue_walk(queue, skb) {
-			*last = skb;
-			if (flags & MSG_PEEK) {
-				if (_off >= skb->len && (skb->len || _off ||
-							 skb->peeked)) {
-					_off -= skb->len;
-					continue;
-				}
-				if (!skb->len) {
-					skb = skb_set_peeked(skb);
-					if (IS_ERR(skb)) {
-						error = PTR_ERR(skb);
-						spin_unlock_irqrestore(&queue->lock,
-								       cpu_flags);
-						goto no_packet;
-					}
-				}
-				*peeked = 1;
-				atomic_inc(&skb->users);
-			} else {
-				__skb_unlink(skb, queue);
-				if (destructor)
-					destructor(sk, skb);
-			}
-			spin_unlock_irqrestore(&queue->lock, cpu_flags);
-			*off = _off;
-			return skb;
-		}
-
+		skb = __skb_try_recv_from_queue(sk, queue, flags, destructor,
+						peeked, &_off, err, last);
 		spin_unlock_irqrestore(&queue->lock, cpu_flags);
+		if (skb)
+			return skb;
 
 		if (!sk_can_busy_loop(sk))
 			break;
@@ -335,8 +346,8 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
-int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-			unsigned int flags,
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
+			struct sk_buff *skb, unsigned int flags,
 			void (*destructor)(struct sock *sk,
 					   struct sk_buff *skb))
 {
@@ -344,15 +355,15 @@ int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
 
 	if (flags & MSG_PEEK) {
 		err = -ENOENT;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		if (skb == skb_peek(&sk->sk_receive_queue)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
+		spin_lock_bh(&sk_queue->lock);
+		if (skb == skb_peek(sk_queue)) {
+			__skb_unlink(skb, sk_queue);
 			atomic_dec(&skb->users);
 			if (destructor)
 				destructor(sk, skb);
 			err = 0;
 		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		spin_unlock_bh(&sk_queue->lock);
 	}
 
 	atomic_inc(&sk->sk_drops);
@@ -383,7 +394,8 @@ EXPORT_SYMBOL(__sk_queue_drop_skb);
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-	int err = __sk_queue_drop_skb(sk, skb, flags, NULL);
+	int err = __sk_queue_drop_skb(sk, &sk->sk_receive_queue, skb, flags,
+				      NULL);
 
 	kfree_skb(skb);
 	sk_mem_reclaim_partial(sk);
-- 
2.9.3

^ permalink raw reply related

* [RFC PATCH 2/3] udp: use a separate rx queue for packet reception
From: Paolo Abeni @ 2017-05-06 20:42 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494103184.git.pabeni@redhat.com>

under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h   |   3 ++
 include/net/udp.h     |   9 +---
 include/net/udplite.h |   2 +-
 net/ipv4/udp.c        | 138 ++++++++++++++++++++++++++++++++++++++++++++------
 net/ipv6/udp.c        |   3 +-
 5 files changed, 131 insertions(+), 24 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 6cb4061..eaea63b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -80,6 +80,9 @@ struct udp_sock {
 						struct sk_buff *skb,
 						int nhoff);
 
+	/* udp_recvmsg try to use this before splicing sk_receive_queue */
+	struct sk_buff_head	reader_queue ____cacheline_aligned_in_smp;
+
 	/* This field is dirtied by udp_recvmsg() */
 	int		forward_deficit;
 };
diff --git a/include/net/udp.h b/include/net/udp.h
index 3391dbd..1468dbd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -249,13 +249,8 @@ void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb);
-static inline struct sk_buff *
-__skb_recv_udp(struct sock *sk, unsigned int flags, int noblock, int *peeked,
-	       int *off, int *err)
-{
-	return __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
-				   udp_skb_destructor, peeked, off, err);
-}
+struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
+			       int noblock, int *peeked, int *off, int *err);
 static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags,
 					   int noblock, int *err)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index ea34052..b7a18f6 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -26,8 +26,8 @@ static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
 /* Designate sk as UDP-Lite socket */
 static inline int udplite_sk_init(struct sock *sk)
 {
+	udp_init_sock(sk);
 	udp_sk(sk)->pcflag = UDPLITE_BIT;
-	sk->sk_destruct = udp_destruct_sock;
 	return 0;
 }
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ea6e4cf..492c76b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1167,19 +1167,24 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
 	struct udp_sock *up = udp_sk(sk);
+	struct sk_buff_head *sk_queue;
 	int amt;
 
 	if (likely(partial)) {
 		up->forward_deficit += size;
 		size = up->forward_deficit;
 		if (size < (sk->sk_rcvbuf >> 2) &&
-		    !skb_queue_empty(&sk->sk_receive_queue))
+		    !skb_queue_empty(&up->reader_queue))
 			return;
 	} else {
 		size += up->forward_deficit;
 	}
 	up->forward_deficit = 0;
 
+	/* acquire the sk_receive_queue for fwd allocated memory scheduling */
+	sk_queue = &sk->sk_receive_queue;
+	spin_lock(&sk_queue->lock);
+
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
 	sk->sk_forward_alloc -= amt;
@@ -1188,9 +1193,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 
 	atomic_sub(size, &sk->sk_rmem_alloc);
+
+	/* this can save us from acquiring the rx queue lock on next receive */
+	skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
+
+	spin_unlock(&sk_queue->lock);
 }
 
-/* Note: called with sk_receive_queue.lock held.
+/* Note: called with reader_queue.lock held.
  * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
  * This avoids a cache line miss while receive_queue lock is held.
  * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
@@ -1306,10 +1316,12 @@ EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
 void udp_destruct_sock(struct sock *sk)
 {
 	/* reclaim completely the forward allocated memory */
+	struct udp_sock *up = udp_sk(sk);
 	unsigned int total = 0;
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+	skb_queue_splice_tail_init(&sk->sk_receive_queue, &up->reader_queue);
+	while ((skb = __skb_dequeue(&up->reader_queue)) != NULL) {
 		total += skb->truesize;
 		kfree_skb(skb);
 	}
@@ -1321,6 +1333,7 @@ EXPORT_SYMBOL_GPL(udp_destruct_sock);
 
 int udp_init_sock(struct sock *sk)
 {
+	skb_queue_head_init(&udp_sk(sk)->reader_queue);
 	sk->sk_destruct = udp_destruct_sock;
 	return 0;
 }
@@ -1338,6 +1351,26 @@ void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL_GPL(skb_consume_udp);
 
+static struct sk_buff *__first_packet_length(struct sock *sk,
+					     struct sk_buff_head *rcvq,
+					     int *total)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_peek(rcvq)) != NULL &&
+	       udp_lib_checksum_complete(skb)) {
+		__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
+				IS_UDPLITE(sk));
+		__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
+				IS_UDPLITE(sk));
+		atomic_inc(&sk->sk_drops);
+		__skb_unlink(skb, rcvq);
+		*total += skb->truesize;
+		kfree_skb(skb);
+	}
+	return skb;
+}
+
 /**
  *	first_packet_length	- return length of first packet in receive queue
  *	@sk: socket
@@ -1347,22 +1380,20 @@ EXPORT_SYMBOL_GPL(skb_consume_udp);
  */
 static int first_packet_length(struct sock *sk)
 {
-	struct sk_buff_head *rcvq = &sk->sk_receive_queue;
+	struct sk_buff_head *rcvq = &udp_sk(sk)->reader_queue;
+	struct sk_buff_head *sk_queue = &sk->sk_receive_queue;
 	struct sk_buff *skb;
 	int total = 0;
 	int res;
 
 	spin_lock_bh(&rcvq->lock);
-	while ((skb = skb_peek(rcvq)) != NULL &&
-		udp_lib_checksum_complete(skb)) {
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
-				IS_UDPLITE(sk));
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
-				IS_UDPLITE(sk));
-		atomic_inc(&sk->sk_drops);
-		__skb_unlink(skb, rcvq);
-		total += skb->truesize;
-		kfree_skb(skb);
+	skb = __first_packet_length(sk, rcvq, &total);
+	if (!skb && !skb_queue_empty(sk_queue)) {
+		spin_lock(&sk_queue->lock);
+		skb_queue_splice_tail_init(sk_queue, rcvq);
+		spin_unlock(&sk_queue->lock);
+
+		skb = __first_packet_length(sk, rcvq, &total);
 	}
 	res = skb ? skb->len : -1;
 	if (total)
@@ -1400,6 +1431,79 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(udp_ioctl);
 
+struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
+			       int noblock, int *peeked, int *off, int *err)
+{
+	struct sk_buff_head *sk_queue = &sk->sk_receive_queue;
+	struct sk_buff_head *queue;
+	struct sk_buff *last;
+	long timeo;
+	int error;
+
+	queue = &udp_sk(sk)->reader_queue;
+	flags |= noblock ? MSG_DONTWAIT : 0;
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	do {
+		struct sk_buff *skb;
+
+		error = sock_error(sk);
+		if (error)
+			break;
+
+		error = -EAGAIN;
+		*peeked = 0;
+		do {
+			int _off = *off;
+
+			spin_lock_bh(&queue->lock);
+			skb = __skb_try_recv_from_queue(sk, queue, flags,
+							udp_skb_destructor,
+							peeked, &_off, err,
+							&last);
+			if (skb) {
+				spin_unlock_bh(&queue->lock);
+				*off = _off;
+				return skb;
+			}
+
+			if (skb_queue_empty(sk_queue)) {
+				spin_unlock_bh(&queue->lock);
+				goto busy_check;
+			}
+
+			/* refill the reader queue and walk it again */
+			_off = *off;
+			spin_lock(&sk_queue->lock);
+			skb_queue_splice_tail_init(sk_queue, queue);
+			spin_unlock(&sk_queue->lock);
+
+			skb = __skb_try_recv_from_queue(sk, queue, flags,
+							udp_skb_destructor,
+							peeked, &_off, err,
+							&last);
+			spin_unlock_bh(&queue->lock);
+			if (skb) {
+				*off = _off;
+				return skb;
+			}
+
+busy_check:
+			if (!sk_can_busy_loop(sk))
+				break;
+
+			sk_busy_loop(sk, flags & MSG_DONTWAIT);
+		} while (!skb_queue_empty(sk_queue));
+
+		/* sk_queue is empty, reader_queue may contain peeked packets */
+	} while (timeo &&
+		 !__skb_wait_for_more_packets(sk, &error, &timeo,
+					      (struct sk_buff *)sk_queue));
+
+	*err = error;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(__skb_recv_udp);
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
@@ -1490,7 +1594,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	return err;
 
 csum_copy_err:
-	if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
+	if (!__sk_queue_drop_skb(sk, &udp_sk(sk)->reader_queue, skb, flags,
+				 udp_skb_destructor)) {
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
 	}
@@ -2325,6 +2430,9 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	unsigned int mask = datagram_poll(file, sock, wait);
 	struct sock *sk = sock->sk;
 
+	if (!skb_queue_empty(&udp_sk(sk)->reader_queue))
+		mask |= POLLIN | POLLRDNORM;
+
 	sock_rps_record_flow(sk);
 
 	/* Check for false positives due to checksum errors */
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 04862ab..f78fdf8 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -455,7 +455,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return err;
 
 csum_copy_err:
-	if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
+	if (!__sk_queue_drop_skb(sk, &udp_sk(sk)->reader_queue, skb, flags,
+				 udp_skb_destructor)) {
 		if (is_udp4) {
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_CSUMERRORS, is_udplite);
-- 
2.9.3

^ permalink raw reply related

* [RFC PATCH 3/3] udp: keep the sk_receive_queue held when splicing
From: Paolo Abeni @ 2017-05-06 20:42 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494103184.git.pabeni@redhat.com>

On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 492c76b..d698973 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1164,7 +1164,8 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 
 /* fully reclaim rmem/fwd memory allocated for skb */
-static void udp_rmem_release(struct sock *sk, int size, int partial)
+static void udp_rmem_release(struct sock *sk, int size, int partial,
+			     int rx_queue_lock_held)
 {
 	struct udp_sock *up = udp_sk(sk);
 	struct sk_buff_head *sk_queue;
@@ -1181,9 +1182,13 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	}
 	up->forward_deficit = 0;
 
-	/* acquire the sk_receive_queue for fwd allocated memory scheduling */
+	/* acquire the sk_receive_queue for fwd allocated memory scheduling,
+	 * if the called don't held it already
+	 */
 	sk_queue = &sk->sk_receive_queue;
-	spin_lock(&sk_queue->lock);
+	if (!rx_queue_lock_held)
+		spin_lock(&sk_queue->lock);
+
 
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
@@ -1197,7 +1202,8 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	/* this can save us from acquiring the rx queue lock on next receive */
 	skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
 
-	spin_unlock(&sk_queue->lock);
+	if (!rx_queue_lock_held)
+		spin_unlock(&sk_queue->lock);
 }
 
 /* Note: called with reader_queue.lock held.
@@ -1207,10 +1213,16 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
  */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->dev_scratch, 1);
+	udp_rmem_release(sk, skb->dev_scratch, 1, 0);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* as above, but the caller held the rx queue lock, too */
+void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
+{
+	udp_rmem_release(sk, skb->dev_scratch, 1, 1);
+}
+
 /* Idea of busylocks is to let producers grab an extra spinlock
  * to relieve pressure on the receive_queue spinlock shared by consumer.
  * Under flood, this means that only one producer can be in line
@@ -1325,7 +1337,7 @@ void udp_destruct_sock(struct sock *sk)
 		total += skb->truesize;
 		kfree_skb(skb);
 	}
-	udp_rmem_release(sk, total, 0);
+	udp_rmem_release(sk, total, 0, 1);
 
 	inet_sock_destruct(sk);
 }
@@ -1397,7 +1409,7 @@ static int first_packet_length(struct sock *sk)
 	}
 	res = skb ? skb->len : -1;
 	if (total)
-		udp_rmem_release(sk, total, 1);
+		udp_rmem_release(sk, total, 1, 0);
 	spin_unlock_bh(&rcvq->lock);
 	return res;
 }
@@ -1471,16 +1483,20 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
 				goto busy_check;
 			}
 
-			/* refill the reader queue and walk it again */
+			/* refill the reader queue and walk it again
+			 * keep both queues locked to avoid re-acquiring
+			 * the sk_receive_queue lock if fwd memory scheduling
+			 * is needed.
+			 */
 			_off = *off;
 			spin_lock(&sk_queue->lock);
 			skb_queue_splice_tail_init(sk_queue, queue);
-			spin_unlock(&sk_queue->lock);
 
 			skb = __skb_try_recv_from_queue(sk, queue, flags,
-							udp_skb_destructor,
+							udp_skb_dtor_locked,
 							peeked, &_off, err,
 							&last);
+			spin_unlock(&sk_queue->lock);
 			spin_unlock_bh(&queue->lock);
 			if (skb) {
 				*off = _off;
-- 
2.9.3

^ permalink raw reply related

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Shubham Bansal @ 2017-05-06 22:17 UTC (permalink / raw)
  To: David Miller
  Cc: Daniel Borkmann, Kees Cook, Mircea Gherzan, Network Development,
	kernel-hardening, linux-arm-kernel, ast
In-Reply-To: <CAHgaXd+2vF=d5zpWTg+a=vSnv2ZA+8LCR_MXT_MDKn4r2s8sig@mail.gmail.com>

Okay. My mistake. I just checked the verify function.

Apologies.
Best,
Shubham Bansal


On Sun, May 7, 2017 at 1:57 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Thanks David.
>
> Hi all,
>
> I have two questions about the code at arch/arm64/net/bpf_jit_comp.c.
>
> 1. At line 708, " const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct
> sk_buff *skb */ ".
>     Why is this code using BPF_REG_1 before saving it? As far as I
> know, BPF_REG_1 has pointer to bpf program context and this code
> clearly is overwriting that pointer which makes that pointer useless
> for future usage. It clearly looks like a bug.
>
> 2. At line 256, " emit(A64_LDR64(prg, tmp, r3), ctx); ".
>     This line of code is used to load an array( of pointers ) element,
> where r3 is used as an index of that array. Shouldn't it be be
> arithmetic left shifted by 3 or multiplied by 8 to get the right
> address in that array of pointers ?
>
> Apologies if any of the above question is stupid to ask.
>
> Best,
> Shubham
> Best,
> Shubham Bansal
>
>
> On Sun, May 7, 2017 at 12:08 AM, David Miller <davem@davemloft.net> wrote:
>> From: Shubham Bansal <illusionist.neo@gmail.com>
>> Date: Sat, 6 May 2017 22:18:16 +0530
>>
>>> Hi Daniel,
>>>
>>> Thanks for the last reply about the testing of eBPF JIT.
>>>
>>> I have one issue though, I am not able to find what BPF_ABS and
>>> BPF_IND instruction does exactly.
>>
>> They are not instructions, they are modifiers for the BPF_LD
>> instruction which indicate an SKB load is to be performed.
>>
>> You never need to ask what a BPF instruction does, it is clear
>> defined in the BPF interperter found in kernel/bpf/core.c
>>
>> Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

^ permalink raw reply

* Re: [RFC PATCH 0/3] udp: scalability improvements
From: Tom Herbert @ 2017-05-06 23:09 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494103184.git.pabeni@redhat.com>

On Sat, May 6, 2017 at 1:42 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> This patch series implement an idea suggested by Eric Dumazet to
> reduce the contention of the udp sk_receive_queue lock when the socket is
> under flood.
>
> An ancillary queue is added to the udp socket, and the socket always
> tries first to read packets from such queue. If it's empty, we splice
> the content from sk_receive_queue into the ancillary queue.
>
> The first patch introduces some helpers to keep the udp code small, and the
> following two implement the ancillary queue strategy. The code is split
> to hopefully help the reviewing process.
>
> The measured overall gain under udp flood is in the 20-35% range depending on
> the numa layout and the number of ingress queue used by the relevant nic.
>
Certainly sounds good, but can you give real reproducible performance
numbers including the test that was run?

Tom

> On a single numa node host, the peak tput is now reached when the traffic
> targeting the udp socket uses multiple nic rx queues, while on current net-next
> the tput always decreases when moving from a single rx queue to multiple ones.
>
>
> Paolo Abeni (3):
>   net/sock: factor out dequeue/peek with offset code
>   udp: use a separate rx queue for packet reception
>   udp: keep the sk_receive_queue held when splicing
>
>  include/linux/skbuff.h |   7 +++
>  include/linux/udp.h    |   3 +
>  include/net/sock.h     |   4 +-
>  include/net/udp.h      |   9 +--
>  include/net/udplite.h  |   2 +-
>  net/core/datagram.c    |  90 +++++++++++++++------------
>  net/ipv4/udp.c         | 162 +++++++++++++++++++++++++++++++++++++++++++------
>  net/ipv6/udp.c         |   3 +-
>  8 files changed, 211 insertions(+), 69 deletions(-)
>
> --
> 2.9.3
>

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Leon Romanovsky @ 2017-05-07  6:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jiri Benc, Stephen Hemminger, Doug Ledford, Jiri Pirko,
	Ariel Almog, Dennis Dalessandro, Ram Amrani, Bart Van Assche,
	Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig, Or Gerlitz,
	Linux RDMA, Linux Netdev
In-Reply-To: <20170506104826.GD2017@nanopsycho>

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

On Sat, May 06, 2017 at 12:48:26PM +0200, Jiri Pirko wrote:
> Fri, May 05, 2017 at 03:17:54PM CEST, leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org wrote:
> >On Fri, May 05, 2017 at 08:54:57AM +0200, Jiri Benc wrote:
> >> On Thu,  4 May 2017 21:02:08 +0300, Leon Romanovsky wrote:
> >> > In order to close object model, ensure reuse of existing code and make this
> >> > tool usable from day one, we decided to implement wrappers over legacy sysfs
> >> > prior to implementing netlink functionality. As a nice bonus, it will allow
> >> > to use this tool with old kernels too.
> >>
> >> This sounds wrong. We don't support legacy ioctl interface for the 'ip'
> >> command, either. I think rdma should be converted to netlink first and
> >> the new tool should only use netlink.
> >
> >RDMA in slightly different situation than "ip" tool was. "ip" was implemented
> >when tools like ifconfig existed. It allowed to old and new systems to be
> >configured to some degree. In RDMA community, there are no similar tools like
> >"ifconfig". Implementation in netlink-only interface will leave old systems without
> >common tool at all.
> >
> >As an upstream-oriented person, I personally fine with that, but anyway would
> >like to get wider agreement/disagreement on that, before removing sysfs
> >parsing logic from the rdmatool.
>
> I tend to agree with Jiri Benc. I fear that supporting sysfs + netlink
> api later on for the same things will make the code unnecessary complex.
> Also, the legacy sysfs will most likely stay there forever so there will
> be no actual motivation to port the existing things to the new netlink
> api.
>
> For the prototyping purposes, I belive that what you did makes perfect
> sense. But for the actual mergable version, my feeling is that we need
> to strictly stick with new netlink rdma interface and just forget about
> the old sysfs one. Distros would have to backport the new kernel
> rdma netlink api.

Thanks,
It looks like that most of the comments are in favor of netlink-only
solution.

>
> Yes, this will be little bit more painful at the beginning, but in the
> long run, I believe it will save some severe headaches.
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 3/5] dsa: add DSA switch driver for Microchip KSZ9477
From: Florian Fainelli @ 2017-05-07 21:37 UTC (permalink / raw)
  To: Woojung.Huh, andrew, vivien.didelot; +Cc: netdev, davem, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40A5C001@CHN-SV-EXMX02.mchp-main.com>

Hi Woojung,

Le 05/05/17 à 16:18, Woojung.Huh@microchip.com a écrit :
> From: Woojung Huh <Woojung.Huh@microchip.com>
> 
> The KSZ9477 is a fully integrated layer 2, managed, 7 ports GigE switch
> with numerous advanced features. 5 ports incorporate
> 10/100/1000 Mbps PHYs. The other 2 ports have interfaces that can be
> configured as SGMII, RGMII, MII or RMII.
> Either of these may connect directly to a host processor or to
> an external PHY. The SGMII port may interface to a fiber optic transceiver.
> 
> This driver currently supports vlan, fdb, mdb & mirror dsa switch operations.
> 
> Signed-off-by: Woojung Huh <Woojung.Huh@microchip.com>
> ---

Overall, this looks really nice and clean, there are a few minor nits,
most of them being that busy polling loop have no timeout to make sure
they terminate in a finite amount of time.

For VLAN programming, can you describe how the CPU port works? Is it
similar to B53 where we explictly need to have the CPU port be part of a
VLAN entry (including tagged/untagged attribute) or is it similar to
Marvell where it seems to be smarter and requires little to no
configuration?

> +
> +static u64 mib_value[TOTAL_SWITCH_COUNTER_NUM];

This probably needs to be driver instance specific and in the ksz_device
structure?

> +
> +static void ksz_cfg(struct ksz_device *dev, u32 addr, u8 bits, int set)
> +{

bool set?

> +	u8 data;
> +
> +	ksz_read8(dev, addr, &data);
> +	if (set)
> +		data |= bits;
> +	else
> +		data &= ~bits;
> +	ksz_write8(dev, addr, data);
> +}
> +
> +static void ksz_cfg32(struct ksz_device *dev, u32 addr, u32 bits, int set)
> +{

Same thing

> +	u32 data;
> +
> +	ksz_read32(dev, addr, &data);
> +	if (set)
> +		data |= bits;
> +	else
> +		data &= ~bits;
> +	ksz_write32(dev, addr, data);
> +}
> +
> +static void ksz_port_cfg(struct ksz_device *dev, int port, int offset, u8 bits,
> +			 int set)
> +{
> +	u32 addr;
> +	u8 data;
> +
> +	addr = PORT_CTRL_ADDR(port, offset);
> +	ksz_read8(dev, addr, &data);
> +
> +	if (set)
> +		data |= bits;
> +	else
> +		data &= ~bits;
> +
> +	ksz_write8(dev, addr, data);
> +}
> +
> +static void ksz_port_cfg32(struct ksz_device *dev, int port, int offset,
> +			   u32 bits, int set)
> +{
> +	u32 addr;
> +	u32 data;
> +
> +	addr = PORT_CTRL_ADDR(port, offset);
> +	ksz_read32(dev, addr, &data);
> +
> +	if (set)
> +		data |= bits;
> +	else
> +		data &= ~bits;
> +
> +	ksz_write32(dev, addr, data);
> +}
> +
> +static void get_vlan_table(struct dsa_switch *ds, u16 vid, u32 *vlan_table)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data;
> +
> +	ksz_write16(dev, REG_SW_VLAN_ENTRY_INDEX__2, vid & VLAN_INDEX_M);
> +	ksz_write8(dev, REG_SW_VLAN_CTRL, VLAN_READ | VLAN_START);
> +	/* wait to be cleared */
> +	data = 0;
> +	do {
> +		ksz_read8(dev, REG_SW_VLAN_CTRL, &data);
> +	} while (data & VLAN_START);

You need to put an upper boundary on how this can take, potentially
spinning forever is not very safe.

> +
> +	ksz_read32(dev, REG_SW_VLAN_ENTRY__4, &vlan_table[0]);
> +	ksz_read32(dev, REG_SW_VLAN_ENTRY_UNTAG__4, &vlan_table[1]);
> +	ksz_read32(dev, REG_SW_VLAN_ENTRY_PORTS__4, &vlan_table[2]);
> +
> +	ksz_write8(dev, REG_SW_VLAN_CTRL, 0);
> +}
> +
> +static void set_vlan_table(struct dsa_switch *ds, u16 vid, u32 *vlan_table)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data;
> +
> +	ksz_write32(dev, REG_SW_VLAN_ENTRY__4, vlan_table[0]);
> +	ksz_write32(dev, REG_SW_VLAN_ENTRY_UNTAG__4, vlan_table[1]);
> +	ksz_write32(dev, REG_SW_VLAN_ENTRY_PORTS__4, vlan_table[2]);
> +
> +	ksz_write16(dev, REG_SW_VLAN_ENTRY_INDEX__2, vid & VLAN_INDEX_M);
> +	ksz_write8(dev, REG_SW_VLAN_CTRL, VLAN_START | VLAN_WRITE);
> +
> +	do {
> +		ksz_read8(dev, REG_SW_VLAN_CTRL, &data);
> +	} while (data & VLAN_START);

Same here.

> +
> +	ksz_write8(dev, REG_SW_VLAN_CTRL, 0);
> +
> +	/* update vlan cache table */
> +	dev->vlan_cache[vid].table[0] = vlan_table[0];
> +	dev->vlan_cache[vid].table[1] = vlan_table[1];
> +	dev->vlan_cache[vid].table[2] = vlan_table[2];
> +}
> +
> +static void read_table(struct dsa_switch *ds, u32 *table)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	ksz_read32(dev, REG_SW_ALU_VAL_A, &table[0]);
> +	ksz_read32(dev, REG_SW_ALU_VAL_B, &table[1]);
> +	ksz_read32(dev, REG_SW_ALU_VAL_C, &table[2]);
> +	ksz_read32(dev, REG_SW_ALU_VAL_D, &table[3]);
> +}
> +
> +static void write_table(struct dsa_switch *ds, u32 *table)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	ksz_write32(dev, REG_SW_ALU_VAL_A, table[0]);
> +	ksz_write32(dev, REG_SW_ALU_VAL_B, table[1]);
> +	ksz_write32(dev, REG_SW_ALU_VAL_C, table[2]);
> +	ksz_write32(dev, REG_SW_ALU_VAL_D, table[3]);
> +}
> +
> +static int ksz_reset_switch(struct dsa_switch *ds)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data8;
> +	u16 data16;
> +	u32 data32;
> +
> +	/* reset switch */
> +	ksz_cfg(dev, REG_SW_OPERATION, SW_RESET, true);
> +
> +	/* turn off SPI DO Edge select */
> +	ksz_read8(dev, REG_SW_GLOBAL_SERIAL_CTRL_0, &data8);
> +	data8 &= ~SPI_AUTO_EDGE_DETECTION;
> +	ksz_write8(dev, REG_SW_GLOBAL_SERIAL_CTRL_0, data8);
> +
> +	/* default configuration */
> +	ksz_read8(dev, REG_SW_LUE_CTRL_1, &data8);
> +	data8 = SW_AGING_ENABLE | SW_LINK_AUTO_AGING |
> +	      SW_SRC_ADDR_FILTER | SW_FLUSH_STP_TABLE | SW_FLUSH_MSTP_TABLE;
> +	ksz_write8(dev, REG_SW_LUE_CTRL_1, data8);
> +
> +	/* disable interrupts */
> +	ksz_write32(dev, REG_SW_INT_MASK__4, SWITCH_INT_MASK);
> +	ksz_write32(dev, REG_SW_PORT_INT_MASK__4, 0x7F);
> +	ksz_read32(dev, REG_SW_PORT_INT_STATUS__4, &data32);
> +
> +	/* set broadcast storm protection 10% rate */
> +	ksz_read16(dev, REG_SW_MAC_CTRL_2, &data16);
> +	data16 &= ~BROADCAST_STORM_RATE;
> +	data16 |= (BROADCAST_STORM_VALUE * BROADCAST_STORM_PROT_RATE) / 100;
> +	ksz_write16(dev, REG_SW_MAC_CTRL_2, data16);
> +
> +	memset(dev->vlan_cache, 0, sizeof(*dev->vlan_cache) * dev->num_vlans);
> +
> +	return 0;
> +}
> +
> +static int ksz_setup_ports(struct dsa_switch *ds)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data8;
> +	u16 data16;
> +	int i;
> +
> +	ds->num_ports = dev->port_cnt;
> +
> +	for (i = 0; i < ds->num_ports; i++) {
> +		if (dsa_is_cpu_port(ds, i)) {
> +			dev->cpu_port = i;
> +			/* enable tag tail for host port */
> +			ksz_port_cfg(dev, i, REG_PORT_CTRL_0,
> +				     PORT_TAIL_TAG_ENABLE, true);
> +		}

With the exception of this configuration above, which is specific to the
CPU port

> +
> +		ksz_port_cfg(dev, i, REG_PORT_MAC_CTRL_1,
> +			     PORT_BACK_PRESSURE, true);
> +		ksz_port_cfg(dev, i, REG_PORT_CTRL_0,
> +			     PORT_FORCE_TX_FLOW_CTRL | PORT_FORCE_RX_FLOW_CTRL,
> +			     false);
> +
> +		/* configure MAC to 1G & RGMII mode */
> +		ksz_pread8(dev, i, REG_PORT_XMII_CTRL_1, &data8);
> +		data8 |= PORT_RGMII_ID_EG_ENABLE;
> +		data8 &= ~PORT_MII_NOT_1GBIT;
> +		data8 &= ~PORT_MII_SEL_M;
> +		data8 |= PORT_RGMII_SEL;
> +		ksz_pwrite8(dev, i, REG_PORT_XMII_CTRL_1, data8);
> +
> +		/* clear pending interrupts */
> +		ksz_pread16(dev, i, REG_PORT_PHY_INT_ENABLE, &data16);

All of this configuration should probably be moved to the port_enable
method.

NB: in the original DSA drivers, there was no per-port setup function,
so we should think about migrating existing drivers to per-port
port_enable instead of keeping this here.

> +	}
> +
> +	return 0;
> +}
> +
> +static int ksz_setup(struct dsa_switch *ds)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	int ret = 0;
> +
> +	dev->vlan_cache = devm_kmalloc_array(dev->dev,
> +					     sizeof(struct vlan_table),
> +					     dev->num_vlans, GFP_KERNEL);
> +	if (!dev->vlan_cache)
> +		return -ENOMEM;
> +
> +	ret = ksz_reset_switch(ds);
> +	if (ret) {
> +		dev_err(ds->dev, "failed to reset switch\n");
> +		return ret;
> +	}
> +
> +	/* accept packet up to 2000bytes */
> +	ksz_cfg(dev, REG_SW_MAC_CTRL_1, SW_LEGAL_PACKET_DISABLE, true);
> +
> +	ksz_setup_ports(ds);
> +
> +	ksz_cfg(dev, REG_SW_MAC_CTRL_1, MULTICAST_STORM_DISABLE, true);
> +
> +	/* queue based egress rate limit */
> +	ksz_cfg(dev, REG_SW_MAC_CTRL_5, SW_OUT_RATE_LIMIT_QUEUE_BASED, true);
> +
> +	/* start switch */
> +	ksz_cfg(dev, REG_SW_OPERATION, SW_START, true);
> +
> +	return 0;
> +}
> +
> +static enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds)
> +{
> +	return DSA_TAG_PROTO_KSZ;
> +}
> +
> +static int ksz_phy_read16(struct dsa_switch *ds, int addr, int reg)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u16 val = 0;
> +
> +	ksz_pread16(dev, addr, 0x100 + (reg << 1), &val);
> +
> +	return (int)val;

The cast is probably not necessary here.

> +}
> +
> +static int ksz_phy_write16(struct dsa_switch *ds, int addr, int reg, u16 val)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	ksz_pwrite16(dev, addr, 0x100 + (reg << 1), val);
> +
> +	return 0;
> +}
> +
> +static int ksz_enable_port(struct dsa_switch *ds, int port,
> +			   struct phy_device *phy)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	ksz_port_cfg(dev, port, REG_PORT_CTRL_0, PORT_MAC_LOOPBACK, false);
> +
> +	/* set back pressure */
> +	ksz_port_cfg(dev, port, REG_PORT_MAC_CTRL_1, PORT_BACK_PRESSURE, true);
> +
> +	/* set flow control */
> +	ksz_port_cfg(dev, port, REG_PORT_CTRL_0,
> +		     PORT_FORCE_TX_FLOW_CTRL | PORT_FORCE_RX_FLOW_CTRL, true);
> +
> +	/* enable boradcast storm limit */
> +	ksz_port_cfg(dev, port, P_BCAST_STORM_CTRL, PORT_BROADCAST_STORM, true);
> +
> +	/* disable DiffServ priority */
> +	ksz_port_cfg(dev, port, P_PRIO_CTRL, PORT_DIFFSERV_PRIO_ENABLE, false);
> +
> +	/* replace priority */
> +	ksz_port_cfg(dev, port, REG_PORT_MRI_MAC_CTRL,
> +		     PORT_USER_PRIO_CEILING, false);
> +	ksz_port_cfg32(dev, port, REG_PORT_MTI_QUEUE_CTRL_0__4,
> +		       MTI_PVID_REPLACE, false);
> +
> +	/* enable 802.1p priority */
> +	ksz_port_cfg(dev, port, P_PRIO_CTRL, PORT_802_1P_PRIO_ENABLE, true);
> +
> +	return 0;
> +}
> +
> +static void ksz_disable_port(struct dsa_switch *ds, int port,
> +			     struct phy_device *phy)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	/* there is no port disable */
> +	ksz_port_cfg(dev, port, REG_PORT_CTRL_0, PORT_MAC_LOOPBACK, true);
> +}
> +
> +static int ksz_sset_count(struct dsa_switch *ds)
> +{
> +	return TOTAL_SWITCH_COUNTER_NUM;
> +}
> +
> +static void ksz_get_strings(struct dsa_switch *ds, int port, uint8_t *buf)
> +{
> +	int i;
> +
> +	for (i = 0; i < TOTAL_SWITCH_COUNTER_NUM; i++) {
> +		memcpy(buf + (i * ETH_GSTRING_LEN), mib_names[i].string,
> +		       ETH_GSTRING_LEN);

Parenthesis around i * ETH_GSTRINGS_LEN should not be needed.

> +	}
> +}
> +
> +static void ksz_get_ethtool_stats(struct dsa_switch *ds, int port,
> +				  uint64_t *buf)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	int i;
> +	u32 data;
> +
> +	mutex_lock(&dev->stats_mutex);
> +
> +	for (i = 0; i < TOTAL_SWITCH_COUNTER_NUM; i++) {
> +		data = MIB_COUNTER_READ;
> +		data |= ((mib_names[i].index & 0xFF) << MIB_COUNTER_INDEX_S);
> +		ksz_pwrite32(dev, port, REG_PORT_MIB_CTRL_STAT__4, data);
> +
> +		do {
> +			ksz_pread32(dev, port, REG_PORT_MIB_CTRL_STAT__4,
> +				    &data);
> +			usleep_range(10, 50);
> +		} while (data & MIB_COUNTER_READ);

Same thing here we need to bound this with a timeout count of some sort.

> +
> +		/* count resets upon read */
> +		ksz_pread32(dev, port, REG_PORT_MIB_DATA, &data);
> +
> +		mib_value[i] += (uint64_t)data;
> +		buf[i] = mib_value[i];
> +	}
> +
> +	mutex_unlock(&dev->stats_mutex);
> +}
> +
> +static void ksz_port_stp_state_set(struct dsa_switch *ds, int port, u8 state)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data;
> +
> +	ksz_pread8(dev, port, P_STP_CTRL, &data);
> +	switch (state) {
> +	case BR_STATE_DISABLED:
> +		data &= ~(PORT_TX_ENABLE | PORT_RX_ENABLE);
> +		data |= PORT_LEARN_DISABLE;
> +		break;
> +	case BR_STATE_LISTENING:
> +		data &= ~PORT_TX_ENABLE;
> +		data |= (PORT_RX_ENABLE | PORT_LEARN_DISABLE);
> +		break;
> +	case BR_STATE_LEARNING:
> +		data &= ~(PORT_TX_ENABLE | PORT_LEARN_DISABLE);
> +		data |= PORT_RX_ENABLE;
> +		break;
> +	case BR_STATE_FORWARDING:
> +		data &= ~PORT_LEARN_DISABLE;
> +		data |= (PORT_TX_ENABLE | PORT_RX_ENABLE);
> +		break;
> +	case BR_STATE_BLOCKING:
> +		data &= ~(PORT_TX_ENABLE | PORT_RX_ENABLE);
> +		data |= PORT_LEARN_DISABLE;
> +		break;
> +	default:
> +		dev_err(ds->dev, "invalid STP state: %d\n", state);
> +		return;
> +	}

It would be clearer to move the masking of all bits outside of the
switch()/case statement, have the switch()/case just assign values, and
do the logial oring at the end of this statement.

> +
> +	ksz_pwrite8(dev, port, P_STP_CTRL, data);
> +}
> +
> +static void ksz_port_fast_age(struct dsa_switch *ds, int port)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data8;
> +
> +	ksz_read8(dev, REG_SW_LUE_CTRL_1, &data8);
> +	data8 |= SW_FAST_AGING;
> +	ksz_write8(dev, REG_SW_LUE_CTRL_1, data8);
> +
> +	data8 &= ~SW_FAST_AGING;
> +	ksz_write8(dev, REG_SW_LUE_CTRL_1, data8);
> +}
> +
> +static int ksz_port_vlan_filtering(struct dsa_switch *ds, int port, bool flag)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	if (flag) {
> +		ksz_port_cfg(dev, port, REG_PORT_LUE_CTRL,
> +			     PORT_VLAN_LOOKUP_VID_0, true);
> +		ksz_cfg32(dev, REG_SW_QM_CTRL__4, UNICAST_VLAN_BOUNDARY, true);
> +		ksz_cfg(dev, REG_SW_LUE_CTRL_0, SW_VLAN_ENABLE, true);
> +	} else {
> +		ksz_cfg(dev, REG_SW_LUE_CTRL_0, SW_VLAN_ENABLE, false);
> +		ksz_cfg32(dev, REG_SW_QM_CTRL__4, UNICAST_VLAN_BOUNDARY, false);
> +		ksz_port_cfg(dev, port, REG_PORT_LUE_CTRL,
> +			     PORT_VLAN_LOOKUP_VID_0, false);
> +	}
> +
> +	return 0;
> +}
> +
> +static int ksz_port_vlan_prepare(struct dsa_switch *ds, int port,
> +				 const struct switchdev_obj_port_vlan *vlan,
> +				 struct switchdev_trans *trans)
> +{
> +	/* nothing needed */
> +
> +	return 0;
> +}
> +
> +static void ksz_port_vlan_add(struct dsa_switch *ds, int port,
> +			      const struct switchdev_obj_port_vlan *vlan,
> +			      struct switchdev_trans *trans)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u32 vlan_table[3];
> +	u16 vid;
> +	bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED;
> +
> +	for (vid = vlan->vid_begin; vid <= vlan->vid_end; vid++) {
> +		get_vlan_table(ds, vid, vlan_table);
> +
> +		vlan_table[0] = VLAN_VALID | (vid & VLAN_FID_M);
> +		if (untagged)
> +			vlan_table[1] |= BIT(port);
> +		else
> +			vlan_table[1] &= ~BIT(port);
> +		vlan_table[1] &= ~(BIT(dev->cpu_port));
> +
> +		vlan_table[2] |= BIT(port) | BIT(dev->cpu_port);
> +
> +		set_vlan_table(ds, vid, vlan_table);
> +
> +		/* change PVID */
> +		if (vlan->flags & BRIDGE_VLAN_INFO_PVID)
> +			ksz_pwrite16(dev, port, REG_PORT_DEFAULT_VID, vid);
> +	}
> +}
> +
> +static int ksz_port_vlan_del(struct dsa_switch *ds, int port,
> +			     const struct switchdev_obj_port_vlan *vlan)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED;
> +	u32 vlan_table[3];
> +	u16 vid;
> +	u16 pvid;
> +
> +	ksz_pread16(dev, port, REG_PORT_DEFAULT_VID, &pvid);
> +	pvid = pvid & 0xFFF;
> +
> +	for (vid = vlan->vid_begin; vid <= vlan->vid_end; vid++) {
> +		get_vlan_table(ds, vid, vlan_table);
> +
> +		vlan_table[2] &= ~BIT(port);
> +
> +		if (pvid == vid)
> +			pvid = 1;
> +
> +		if (untagged)
> +			vlan_table[1] &= ~BIT(port);
> +
> +		set_vlan_table(ds, vid, vlan_table);
> +	}
> +
> +	ksz_pwrite16(dev, port, REG_PORT_DEFAULT_VID, pvid);
> +
> +	return 0;
> +}
> +
> +static int ksz_port_vlan_dump(struct dsa_switch *ds, int port,
> +			      struct switchdev_obj_port_vlan *vlan,
> +			      int (*cb)(struct switchdev_obj *obj))
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u16 vid;
> +	u16 data;
> +	struct vlan_table *vlan_cache;
> +	int err = 0;
> +
> +	/* use dev->vlan_cache due to lack of searching valid vlan entry */
> +	for (vid = vlan->vid_begin; vid < dev->num_vlans; vid++) {
> +		vlan_cache = &dev->vlan_cache[vid];
> +
> +		if (!(vlan_cache->table[0] & VLAN_VALID))
> +			continue;
> +
> +		vlan->vid_begin = vid;
> +		vlan->vid_end = vid;
> +		vlan->flags = 0;
> +		if (vlan_cache->table[2] & BIT(port)) {
> +			if (vlan_cache->table[1] & BIT(port))
> +				vlan->flags |= BRIDGE_VLAN_INFO_UNTAGGED;
> +			ksz_pread16(dev, port, REG_PORT_DEFAULT_VID, &data);
> +			if (vid == (data & 0xFFFFF))
> +				vlan->flags |= BRIDGE_VLAN_INFO_PVID;
> +
> +			err = cb(&vlan->obj);
> +			if (err)
> +				break;
> +		}
> +	}
> +
> +	return err;
> +}
> +
> +static int ksz_port_fdb_prepare(struct dsa_switch *ds, int port,
> +				const struct switchdev_obj_port_fdb *fdb,
> +				struct switchdev_trans *trans)
> +{
> +	/* nothing needed */
> +
> +	return 0;
> +}
> +
> +struct alu_struct {
> +	/* entry 1 */
> +	u8	is_static:1;
> +	u8	is_src_filter:1;
> +	u8	is_dst_filter:1;
> +	u8	prio_age:3;
> +	u32	_reserv_0_1:23;
> +	u8	mstp:3;
> +	/* entry 2 */
> +	u8	is_override:1;
> +	u8	is_use_fid:1;
> +	u32	_reserv_1_1:23;
> +	u8	port_forward:7;
> +	/* entry 3 & 4*/
> +	u32	_reserv_2_1:9;
> +	u8	fid:7;
> +	u8	mac[ETH_ALEN];
> +};
> +
> +static void ksz_port_fdb_add(struct dsa_switch *ds, int port,
> +			     const struct switchdev_obj_port_fdb *fdb,
> +			     struct switchdev_trans *trans)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u32 alu_table[4];
> +	u32 data;
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	/* find any entry with mac & vid */
> +	data = fdb->vid << ALU_FID_INDEX_S;
> +	data |= ((fdb->addr[0] << 8) | fdb->addr[1]);
> +	ksz_write32(dev, REG_SW_ALU_INDEX_0, data);
> +
> +	data = ((fdb->addr[2] << 24) | (fdb->addr[3] << 16));
> +	data |= ((fdb->addr[4] << 8) | fdb->addr[5]);
> +	ksz_write32(dev, REG_SW_ALU_INDEX_1, data);
> +
> +	/* start read operation */
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, ALU_READ | ALU_START);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_CTRL__4, &data);
> +	} while (data & ALU_START);
> +
> +	/* read ALU entry */
> +	read_table(ds, alu_table);
> +
> +	mutex_unlock(&dev->alu_mutex);
> +
> +	/* update ALU entry */
> +	alu_table[0] = ALU_V_STATIC_VALID;
> +	alu_table[1] |= BIT(port);
> +	if (fdb->vid)
> +		alu_table[1] |= ALU_V_USE_FID;
> +	alu_table[2] = (fdb->vid << ALU_V_FID_S);
> +	alu_table[2] |= ((fdb->addr[0] << 8) | fdb->addr[1]);
> +	alu_table[3] = ((fdb->addr[2] << 24) | (fdb->addr[3] << 16));
> +	alu_table[3] |= ((fdb->addr[4] << 8) | fdb->addr[5]);
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	write_table(ds, alu_table);
> +
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, ALU_WRITE | ALU_START);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_CTRL__4, &data);
> +	} while (data & ALU_START);

Same thing here we also need a timeout count, there are more occurences
of this throughout the entire driver.

> +
> +	mutex_unlock(&dev->alu_mutex);
> +}
> +
> +static int ksz_port_fdb_del(struct dsa_switch *ds, int port,
> +			    const struct switchdev_obj_port_fdb *fdb)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u32 alu_table[4];
> +	u32 data;
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	/* read any entry with mac & vid */
> +	data = fdb->vid << ALU_FID_INDEX_S;
> +	data |= ((fdb->addr[0] << 8) | fdb->addr[1]);
> +	ksz_write32(dev, REG_SW_ALU_INDEX_0, data);
> +
> +	data = ((fdb->addr[2] << 24) | (fdb->addr[3] << 16));
> +	data |= ((fdb->addr[4] << 8) | fdb->addr[5]);
> +	ksz_write32(dev, REG_SW_ALU_INDEX_1, data);
> +
> +	/* start read operation */
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, ALU_READ | ALU_START);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_CTRL__4, &data);
> +	} while (data & ALU_START);
> +
> +	ksz_read32(dev, REG_SW_ALU_VAL_A, &alu_table[0]);
> +	if (alu_table[0] & ALU_V_STATIC_VALID) {
> +		ksz_read32(dev, REG_SW_ALU_VAL_B, &alu_table[1]);
> +		ksz_read32(dev, REG_SW_ALU_VAL_C, &alu_table[2]);
> +		ksz_read32(dev, REG_SW_ALU_VAL_D, &alu_table[3]);
> +
> +		/* clear forwarding port */
> +		alu_table[2] &= ~BIT(port);
> +
> +		/* if there is no port to forward, clear table */
> +		if ((alu_table[2] & ALU_V_PORT_MAP) == 0) {
> +			alu_table[0] = 0;
> +			alu_table[1] = 0;
> +			alu_table[2] = 0;
> +			alu_table[3] = 0;
> +		}
> +	} else {
> +		alu_table[0] = 0;
> +		alu_table[1] = 0;
> +		alu_table[2] = 0;
> +		alu_table[3] = 0;
> +	}
> +
> +	write_table(ds, alu_table);
> +
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, ALU_WRITE | ALU_START);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_CTRL__4, &data);
> +	} while (data & ALU_START);
> +
> +	mutex_unlock(&dev->alu_mutex);
> +
> +	return 0;
> +}
> +
> +static void convert_alu(struct alu_struct *alu, u32 *alu_table)
> +{
> +	alu->is_static = !!(alu_table[0] & ALU_V_STATIC_VALID);
> +	alu->is_src_filter = !!(alu_table[0] & ALU_V_SRC_FILTER);
> +	alu->is_dst_filter = !!(alu_table[0] & ALU_V_DST_FILTER);
> +	alu->prio_age = (alu_table[0] >> ALU_V_PRIO_AGE_CNT_S) &
> +			ALU_V_PRIO_AGE_CNT_M;
> +	alu->mstp = alu_table[0] & ALU_V_MSTP_M;
> +
> +	alu->is_override = !!(alu_table[1] & ALU_V_OVERRIDE);
> +	alu->is_use_fid = !!(alu_table[1] & ALU_V_USE_FID);
> +	alu->port_forward = alu_table[1] & ALU_V_PORT_MAP;
> +
> +	alu->fid = (alu_table[2] >> ALU_V_FID_S) & ALU_V_FID_M;
> +
> +	alu->mac[0] = (alu_table[2] >> 8) & 0xFF;
> +	alu->mac[1] = alu_table[2] & 0xFF;
> +	alu->mac[2] = (alu_table[3] >> 24) & 0xFF;
> +	alu->mac[3] = (alu_table[3] >> 16) & 0xFF;
> +	alu->mac[4] = (alu_table[3] >> 8) & 0xFF;
> +	alu->mac[5] = alu_table[3] & 0xFF;
> +}
> +
> +static int ksz_port_fdb_dump(struct dsa_switch *ds, int port,
> +			     struct switchdev_obj_port_fdb *fdb,
> +			     int (*cb)(struct switchdev_obj *obj))
> +{
> +	struct ksz_device *dev = ds->priv;
> +	int ret = 0;
> +	u32 data;
> +	u32 alu_table[4];
> +	struct alu_struct alu;
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	/* start ALU search */
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, ALU_START | ALU_SEARCH);
> +
> +	do {
> +		do {
> +			ksz_read32(dev, REG_SW_ALU_CTRL__4, &data);
> +		} while (!(data & ALU_VALID) && (data & ALU_START));
> +
> +		/* read ALU table */
> +		read_table(ds, alu_table);
> +
> +		convert_alu(&alu, alu_table);
> +
> +		if (alu.port_forward & BIT(port)) {
> +			fdb->vid = alu.fid;
> +			if (alu.is_static)
> +				fdb->ndm_state = NUD_NOARP;
> +			else
> +				fdb->ndm_state = NUD_REACHABLE;
> +			ether_addr_copy(fdb->addr, alu.mac);
> +
> +			ret = cb(&fdb->obj);
> +			if (ret)
> +				goto exit;
> +		}
> +	} while (data & ALU_START);
> +
> +exit:
> +
> +	/* stop ALU search */
> +	ksz_write32(dev, REG_SW_ALU_CTRL__4, 0);
> +
> +	mutex_unlock(&dev->alu_mutex);
> +
> +	return ret;
> +}
> +
> +static int ksz_port_mdb_prepare(struct dsa_switch *ds, int port,
> +				const struct switchdev_obj_port_mdb *mdb,
> +				struct switchdev_trans *trans)
> +{
> +	/* nothing to do */
> +	return 0;
> +}
> +
> +static void ksz_port_mdb_add(struct dsa_switch *ds, int port,
> +			     const struct switchdev_obj_port_mdb *mdb,
> +			     struct switchdev_trans *trans)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u32 static_table[4];
> +	u32 data;
> +	int index;
> +	u32 mac_hi, mac_lo;
> +
> +	mac_hi = ((mdb->addr[0] << 8) | mdb->addr[1]);
> +	mac_lo = ((mdb->addr[2] << 24) | (mdb->addr[3] << 16));
> +	mac_lo |= ((mdb->addr[4] << 8) | mdb->addr[5]);
> +
> +	for (index = 0; index < dev->num_statics; index++) {
> +		mutex_lock(&dev->alu_mutex);
> +
> +		/* find empty slot first */
> +		data = (index << ALU_STAT_INDEX_S) |
> +			ALU_STAT_READ | ALU_STAT_START;
> +		ksz_write32(dev, REG_SW_ALU_STAT_CTRL__4, data);
> +
> +		/* wait to be finished */
> +		do {
> +			ksz_read32(dev, REG_SW_ALU_STAT_CTRL__4, &data);
> +		} while (data & ALU_STAT_START);
> +
> +		/* read ALU static table */
> +		read_table(ds, static_table);
> +
> +		mutex_unlock(&dev->alu_mutex);
> +
> +		if (static_table[0] & ALU_V_STATIC_VALID) {
> +			/* check this has same vid & mac address */
> +
> +			if (((static_table[2] >> ALU_V_FID_S) == (mdb->vid)) &&
> +			    ((static_table[2] & ALU_V_MAC_ADDR_HI) == mac_hi) &&
> +			    (static_table[3] == mac_lo)) {
> +				/* found matching one */
> +				break;
> +			}
> +		} else {
> +			/* found empty one */
> +			break;
> +		}
> +	}
> +
> +	/* no available entry */
> +	if (index == dev->num_statics)
> +		return;
> +
> +	/* add entry */
> +	static_table[0] = ALU_V_STATIC_VALID;
> +	static_table[1] |= BIT(port);
> +	if (mdb->vid)
> +		static_table[1] |= ALU_V_USE_FID;
> +	static_table[2] = (mdb->vid << ALU_V_FID_S);
> +	static_table[2] |= mac_hi;
> +	static_table[3] = mac_lo;
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	write_table(ds, static_table);
> +
> +	data = (index << ALU_STAT_INDEX_S) | ALU_STAT_START;
> +	ksz_write32(dev, REG_SW_ALU_STAT_CTRL__4, data);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_STAT_CTRL__4, &data);
> +	} while (data & ALU_STAT_START);
> +
> +	mutex_unlock(&dev->alu_mutex);
> +}
> +
> +static int ksz_port_mdb_del(struct dsa_switch *ds, int port,
> +			    const struct switchdev_obj_port_mdb *mdb)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u32 static_table[4];
> +	u32 data;
> +	int index;
> +	u32 mac_hi, mac_lo;
> +
> +	mac_hi = ((mdb->addr[0] << 8) | mdb->addr[1]);
> +	mac_lo = ((mdb->addr[2] << 24) | (mdb->addr[3] << 16));
> +	mac_lo |= ((mdb->addr[4] << 8) | mdb->addr[5]);
> +
> +	for (index = 0; index < dev->num_statics; index++) {
> +		mutex_lock(&dev->alu_mutex);
> +
> +		/* find empty slot first */
> +		data = (index << ALU_STAT_INDEX_S) |
> +			ALU_STAT_READ | ALU_STAT_START;
> +		ksz_write32(dev, REG_SW_ALU_STAT_CTRL__4, data);
> +
> +		/* wait to be finished */
> +		do {
> +			ksz_read32(dev, REG_SW_ALU_STAT_CTRL__4, &data);
> +		} while (data & ALU_STAT_START);
> +
> +		/* read ALU static table */
> +		read_table(ds, static_table);
> +
> +		mutex_unlock(&dev->alu_mutex);
> +
> +		if (static_table[0] & ALU_V_STATIC_VALID) {
> +			/* check this has same vid & mac address */
> +
> +			if (((static_table[2] >> ALU_V_FID_S) == (mdb->vid)) &&
> +			    ((static_table[2] & ALU_V_MAC_ADDR_HI) == mac_hi) &&
> +			    (static_table[3] == mac_lo)) {
> +				/* found matching one */
> +				break;
> +			}
> +		}
> +	}
> +
> +	/* no available entry */
> +	if (index == dev->num_statics)
> +		return -EINVAL;
> +
> +	/* clear port */
> +	static_table[1] &= ~BIT(port);
> +
> +	if ((static_table[1] & ALU_V_PORT_MAP) == 0) {
> +		/* delete entry */
> +		static_table[0] = 0;
> +		static_table[1] = 0;
> +		static_table[2] = 0;
> +		static_table[3] = 0;
> +	}
> +
> +	mutex_lock(&dev->alu_mutex);
> +
> +	write_table(ds, static_table);
> +
> +	data = (index << ALU_STAT_INDEX_S) | ALU_STAT_START;
> +	ksz_write32(dev, REG_SW_ALU_STAT_CTRL__4, data);
> +
> +	/* wait to be finished */
> +	do {
> +		ksz_read32(dev, REG_SW_ALU_STAT_CTRL__4, &data);
> +	} while (data & ALU_STAT_START);
> +
> +	mutex_unlock(&dev->alu_mutex);
> +
> +	return 0;
> +}
> +
> +static int ksz_port_mdb_dump(struct dsa_switch *ds, int port,
> +			     struct switchdev_obj_port_mdb *mdb,
> +			     int (*cb)(struct switchdev_obj *obj))
> +{
> +	/* this is not called by switch layer */
> +	return 0;
> +}
> +
> +static int ksz_port_mirror_add(struct dsa_switch *ds, int port,
> +			       struct dsa_mall_mirror_tc_entry *mirror,
> +			       bool ingress)
> +{
> +	struct ksz_device *dev = ds->priv;
> +
> +	if (ingress)
> +		ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_RX, true);
> +	else
> +		ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_TX, true);
> +
> +	ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_SNIFFER, false);
> +
> +	/* configure mirror port */
> +	ksz_port_cfg(dev, mirror->to_local_port, P_MIRROR_CTRL,
> +		     PORT_MIRROR_SNIFFER, true);
> +
> +	ksz_cfg(dev, S_MIRROR_CTRL, SW_MIRROR_RX_TX, false);
> +
> +	return 0;
> +}
> +
> +static void ksz_port_mirror_del(struct dsa_switch *ds, int port,
> +				struct dsa_mall_mirror_tc_entry *mirror)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	u8 data;
> +
> +	if (mirror->ingress)
> +		ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_RX, false);
> +	else
> +		ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_TX, false);
> +
> +	ksz_pread8(dev, port, P_MIRROR_CTRL, &data);
> +
> +	if (!(data & (PORT_MIRROR_RX | PORT_MIRROR_TX)))
> +		ksz_port_cfg(dev, mirror->to_local_port, P_MIRROR_CTRL,
> +			     PORT_MIRROR_SNIFFER, false);
> +}
> +
> +static const struct dsa_switch_ops ksz_switch_ops = {
> +	.get_tag_protocol	= ksz_get_tag_protocol,
> +	.setup			= ksz_setup,
> +	.phy_read		= ksz_phy_read16,
> +	.phy_write		= ksz_phy_write16,
> +	.port_enable		= ksz_enable_port,
> +	.port_disable		= ksz_disable_port,
> +	.get_strings		= ksz_get_strings,
> +	.get_ethtool_stats	= ksz_get_ethtool_stats,
> +	.get_sset_count		= ksz_sset_count,
> +	.port_stp_state_set	= ksz_port_stp_state_set,
> +	.port_fast_age		= ksz_port_fast_age,
> +	.port_vlan_filtering	= ksz_port_vlan_filtering,
> +	.port_vlan_prepare	= ksz_port_vlan_prepare,
> +	.port_vlan_add		= ksz_port_vlan_add,
> +	.port_vlan_del		= ksz_port_vlan_del,
> +	.port_vlan_dump		= ksz_port_vlan_dump,
> +	.port_fdb_prepare	= ksz_port_fdb_prepare,
> +	.port_fdb_dump		= ksz_port_fdb_dump,
> +	.port_fdb_add		= ksz_port_fdb_add,
> +	.port_fdb_del		= ksz_port_fdb_del,
> +	.port_mdb_prepare       = ksz_port_mdb_prepare,
> +	.port_mdb_add           = ksz_port_mdb_add,
> +	.port_mdb_del           = ksz_port_mdb_del,
> +	.port_mdb_dump          = ksz_port_mdb_dump,
> +	.port_mirror_add	= ksz_port_mirror_add,
> +	.port_mirror_del	= ksz_port_mirror_del,
> +};
> +
> +struct ksz_chip_data {
> +	u32 chip_id;
> +	const char *dev_name;
> +	int num_vlans;
> +	int num_alus;
> +	int num_statics;
> +	u16 enabled_ports;
> +	int cpu_port;
> +	int port_cnt;
> +	int phy_port_cnt;
> +};
> +
> +static const struct ksz_chip_data ksz_switch_chips[] = {
> +	{
> +		.chip_id = 0x00947700,
> +		.dev_name = "KSZ9477",
> +		.num_vlans = 4096,
> +		.num_alus = 4096,
> +		.num_statics = 16,
> +		.enabled_ports = 0x1F,	/* port0-4 */
> +		.cpu_port = 5,		/* port5 (RGMII) */
> +		.port_cnt = 7,
> +		.phy_port_cnt = 5,
> +	},
> +};

Do you have an idea of which other models would be largely software
compatible?

> +
> +static int ksz_switch_init(struct ksz_device *dev)
> +{
> +	int i;
> +
> +	mutex_init(&dev->reg_mutex);
> +	mutex_init(&dev->stats_mutex);
> +	mutex_init(&dev->alu_mutex);
> +
> +	dev->ds->ops = &ksz_switch_ops;
> +
> +	for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
> +		const struct ksz_chip_data *chip = &ksz_switch_chips[i];
> +
> +		if (dev->chip_id == chip->chip_id) {
> +			if (!dev->enabled_ports)
> +				dev->enabled_ports = chip->enabled_ports;
> +			dev->name = chip->dev_name;
> +			dev->num_vlans = chip->num_vlans;
> +			dev->num_alus = chip->num_alus;
> +			dev->num_statics = chip->num_statics;
> +			/* default cpu port */
> +			dev->cpu_port = chip->cpu_port;
> +			dev->port_cnt = chip->port_cnt;
> +
> +			break;
> +		}
> +	}
> +
> +	/* no switch found */
> +	if (!dev->port_cnt)
> +		return -ENODEV;
> +
> +	return 0;
> +}
> +
> +struct ksz_device *ksz_switch_alloc(struct device *base,
> +				    const struct ksz_io_ops *ops,
> +				    void *priv)
> +{
> +	struct dsa_switch *ds;
> +	struct ksz_device *swdev;
> +
> +	ds = dsa_switch_alloc(base, DSA_MAX_PORTS);
> +	if (!ds)
> +		return NULL;
> +
> +	swdev = devm_kzalloc(base, sizeof(*swdev), GFP_KERNEL);
> +	if (!swdev)
> +		return NULL;
> +
> +	ds->priv = swdev;
> +	swdev->dev = base;
> +
> +	swdev->ds = ds;
> +	swdev->priv = priv;
> +	swdev->ops = ops;
> +
> +	return swdev;
> +}
> +EXPORT_SYMBOL(ksz_switch_alloc);
> +
> +int ksz_switch_detect(struct ksz_device *dev)
> +{
> +	u8 data8;
> +	u32 id32;
> +	int ret;
> +
> +	/* turn off SPI DO Edge select */
> +	ret = ksz_read8(dev, REG_SW_GLOBAL_SERIAL_CTRL_0, &data8);
> +	if (ret)
> +		return ret;
> +
> +	data8 &= ~SPI_AUTO_EDGE_DETECTION;
> +	ret = ksz_write8(dev, REG_SW_GLOBAL_SERIAL_CTRL_0, data8);
> +	if (ret)
> +		return ret;
> +
> +	/* read chip id */
> +	ret = ksz_read32(dev, REG_CHIP_ID0__1, &id32);
> +	if (ret)
> +		return ret;
> +
> +	dev->chip_id = id32;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ksz_switch_detect);
> +
> +int ksz_switch_register(struct ksz_device *dev)
> +{
> +	int ret;
> +
> +	if (dev->pdata) {
> +		dev->chip_id = dev->pdata->chip_id;
> +		dev->enabled_ports = dev->pdata->enabled_ports;
> +	}
> +
> +	if (!dev->chip_id && ksz_switch_detect(dev))
> +		return -EINVAL;
> +
> +	ret = ksz_switch_init(dev);
> +	if (ret)
> +		return ret;
> +
> +	return dsa_register_switch(dev->ds, dev->ds->dev);
> +}
> +EXPORT_SYMBOL(ksz_switch_register);
> +
> +void ksz_switch_remove(struct ksz_device *dev)
> +{
> +	dsa_unregister_switch(dev->ds);
> +}
> +EXPORT_SYMBOL(ksz_switch_remove);
> +
> +MODULE_AUTHOR("Woojung Huh <Woojung.Huh@microchip.com>");
> +MODULE_DESCRIPTION("Microchip KSZ Series Switch DSA Driver");
> diff --git a/drivers/net/dsa/microchip/ksz_priv.h b/drivers/net/dsa/microchip/ksz_priv.h
> new file mode 100644
> index 0000000..5619acd
> --- /dev/null
> +++ b/drivers/net/dsa/microchip/ksz_priv.h
> @@ -0,0 +1,207 @@
> +/*
> + * Microchip KSZ series switch common definitions
> + *
> + * Copyright (C) 2017
> + *
> + * Permission to use, copy, modify, and/or distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#ifndef __KSZ_PRIV_H
> +#define __KSZ_PRIV_H
> +
> +#include <linux/kernel.h>
> +#include <linux/mutex.h>
> +#include <linux/phy.h>
> +#include <linux/etherdevice.h>
> +#include <net/dsa.h>
> +
> +#include "ksz_9477_reg.h"
> +
> +struct ksz_io_ops;
> +
> +struct vlan_table {
> +	u32 table[3];
> +};
> +
> +struct ksz_device {
> +	struct dsa_switch *ds;
> +	struct ksz_platform_data *pdata;
> +	const char *name;
> +
> +	struct mutex reg_mutex;		/* register access */
> +	struct mutex stats_mutex;	/* status access */
> +	struct mutex alu_mutex;		/* ALU access */
> +	const struct ksz_io_ops *ops;
> +
> +	struct device *dev;
> +
> +	void *priv;
> +
> +	/* chip specific data */
> +	u32 chip_id;
> +	u16 enabled_ports;
> +	int num_vlans;
> +	int num_alus;
> +	int num_statics;
> +	int cpu_port;
> +	int port_cnt;
> +
> +	struct vlan_table *vlan_cache;
> +};
> +
> +struct ksz_io_ops {
> +	int (*read8)(struct ksz_device *dev, u32 reg, u8 *value);
> +	int (*read16)(struct ksz_device *dev, u32 reg, u16 *value);
> +	int (*read24)(struct ksz_device *dev, u32 reg, u32 *value);
> +	int (*read32)(struct ksz_device *dev, u32 reg, u32 *value);
> +	int (*write8)(struct ksz_device *dev, u32 reg, u8 value);
> +	int (*write16)(struct ksz_device *dev, u32 reg, u16 value);
> +	int (*write24)(struct ksz_device *dev, u32 reg, u32 value);
> +	int (*write32)(struct ksz_device *dev, u32 reg, u32 value);
> +	int (*phy_read16)(struct ksz_device *dev, int addr, int reg,
> +			  u16 *value);
> +	int (*phy_write16)(struct ksz_device *dev, int addr, int reg,
> +			   u16 value);
> +};
> +
> +struct ksz_device *ksz_switch_alloc(struct device *base,
> +				    const struct ksz_io_ops *ops, void *priv);
> +int ksz_switch_detect(struct ksz_device *dev);
> +int ksz_switch_register(struct ksz_device *dev);
> +void ksz_switch_remove(struct ksz_device *dev);
> +
> +static inline int ksz_read8(struct ksz_device *dev, u32 reg, u8 *val)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->read8(dev, reg, val);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_read16(struct ksz_device *dev, u32 reg, u16 *val)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->read16(dev, reg, val);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_read24(struct ksz_device *dev, u32 reg, u32 *val)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->read24(dev, reg, val);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_read32(struct ksz_device *dev, u32 reg, u32 *val)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->read32(dev, reg, val);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_write8(struct ksz_device *dev, u32 reg, u8 value)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->write8(dev, reg, value);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_write16(struct ksz_device *dev, u32 reg, u16 value)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->write16(dev, reg, value);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_write24(struct ksz_device *dev, u32 reg, u32 value)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->write24(dev, reg, value);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_write32(struct ksz_device *dev, u32 reg, u32 value)
> +{
> +	int ret;
> +
> +	mutex_lock(&dev->reg_mutex);
> +	ret = dev->ops->write32(dev, reg, value);
> +	mutex_unlock(&dev->reg_mutex);
> +
> +	return ret;
> +}
> +
> +static inline void ksz_pread8(struct ksz_device *dev, int port, int offset,
> +			      u8 *data)
> +{
> +	ksz_read8(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +static inline void ksz_pread16(struct ksz_device *dev, int port, int offset,
> +			       u16 *data)
> +{
> +	ksz_read16(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +static inline void ksz_pread32(struct ksz_device *dev, int port, int offset,
> +			       u32 *data)
> +{
> +	ksz_read32(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +static inline void ksz_pwrite8(struct ksz_device *dev, int port, int offset,
> +			       u8 data)
> +{
> +	ksz_write8(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +static inline void ksz_pwrite16(struct ksz_device *dev, int port, int offset,
> +				u16 data)
> +{
> +	ksz_write16(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +static inline void ksz_pwrite32(struct ksz_device *dev, int port, int offset,
> +				u32 data)
> +{
> +	ksz_write32(dev, PORT_CTRL_ADDR(port, offset), data);
> +}
> +
> +#endif
> diff --git a/drivers/net/dsa/microchip/ksz_spi.c b/drivers/net/dsa/microchip/ksz_spi.c
> new file mode 100644
> index 0000000..f92e41e
> --- /dev/null
> +++ b/drivers/net/dsa/microchip/ksz_spi.c
> @@ -0,0 +1,215 @@
> +/*
> + * Microchip KSZ series register access through SPI
> + *
> + * Copyright (C) 2017
> + *
> + * Permission to use, copy, modify, and/or distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#include <asm/unaligned.h>
> +
> +#include <linux/delay.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/spi/spi.h>
> +
> +#include "ksz_priv.h"
> +
> +/* SPI frame opcodes */
> +#define KS_SPIOP_RD			3
> +#define KS_SPIOP_WR			2
> +
> +#define SPI_ADDR_SHIFT			24
> +#define SPI_ADDR_MASK			(BIT(SPI_ADDR_SHIFT) - 1)
> +#define SPI_TURNAROUND_SHIFT		5
> +
> +static inline int ksz_spi_read_reg(struct spi_device *spi, u32 reg, u8 *val,
> +				   unsigned int len)
> +{
> +	u32 txbuf;
> +	int ret;
> +
> +	txbuf = reg & SPI_ADDR_MASK;
> +	txbuf |= KS_SPIOP_RD << SPI_ADDR_SHIFT;
> +	txbuf <<= SPI_TURNAROUND_SHIFT;
> +	txbuf = cpu_to_be32(txbuf);
> +
> +	ret = spi_write_then_read(spi, &txbuf, 4, val, len);
> +	return ret;
> +}
> +
> +static int ksz_spi_read(struct ksz_device *dev, u32 reg, u8 *data,
> +			unsigned int len)
> +{
> +	struct spi_device *spi = dev->priv;
> +
> +	return ksz_spi_read_reg(spi, reg, data, len);
> +}
> +
> +static int ksz_spi_read8(struct ksz_device *dev, u32 reg, u8 *val)
> +{
> +	return ksz_spi_read(dev, reg, val, 1);
> +}
> +
> +static int ksz_spi_read16(struct ksz_device *dev, u32 reg, u16 *val)
> +{
> +	int ret = ksz_spi_read(dev, reg, (u8 *)val, 2);
> +
> +	if (!ret)
> +		*val = be16_to_cpu(*val);
> +
> +	return ret;
> +}
> +
> +static int ksz_spi_read24(struct ksz_device *dev, u32 reg, u32 *val)
> +{
> +	int ret;
> +
> +	*val = 0;
> +	ret = ksz_spi_read(dev, reg, (u8 *)val, 3);
> +	if (!ret) {
> +		*val = be32_to_cpu(*val);
> +		/* convert to 24bit */
> +		*val >>= 8;
> +	}
> +
> +	return ret;
> +}
> +
> +static int ksz_spi_read32(struct ksz_device *dev, u32 reg, u32 *val)
> +{
> +	int ret = ksz_spi_read(dev, reg, (u8 *)val, 4);
> +
> +	if (!ret)
> +		*val = be32_to_cpu(*val);
> +
> +	return ret;
> +}
> +
> +static inline int ksz_spi_write_reg(struct spi_device *spi, u32 reg, u8 *val,
> +				    unsigned int len)
> +{
> +	u32 txbuf;
> +	u8 data[12];
> +	int i;
> +
> +	txbuf = reg & SPI_ADDR_MASK;
> +	txbuf |= (KS_SPIOP_WR << SPI_ADDR_SHIFT);
> +	txbuf <<= SPI_TURNAROUND_SHIFT;
> +	txbuf = cpu_to_be32(txbuf);
> +
> +	data[0] = txbuf & 0xFF;
> +	data[1] = (txbuf & 0xFF00) >> 8;
> +	data[2] = (txbuf & 0xFF0000) >> 16;
> +	data[3] = (txbuf & 0xFF000000) >> 24;
> +	for (i = 0; i < len; i++)
> +		data[i + 4] = val[i];
> +
> +	return spi_write(spi, &data, 4 + len);
> +}
> +
> +static int ksz_spi_write8(struct ksz_device *dev, u32 reg, u8 value)
> +{
> +	struct spi_device *spi = dev->priv;
> +
> +	return ksz_spi_write_reg(spi, reg, &value, 1);
> +}
> +
> +static int ksz_spi_write16(struct ksz_device *dev, u32 reg, u16 value)
> +{
> +	struct spi_device *spi = dev->priv;
> +
> +	value = cpu_to_be16(value);
> +	return ksz_spi_write_reg(spi, reg, (u8 *)&value, 2);
> +}
> +
> +static int ksz_spi_write24(struct ksz_device *dev, u32 reg, u32 value)
> +{
> +	struct spi_device *spi = dev->priv;
> +
> +	/* make it to big endian 24bit from MSB */
> +	value <<= 8;
> +	value = cpu_to_be32(value);
> +	return ksz_spi_write_reg(spi, reg, (u8 *)&value, 3);
> +}
> +
> +static int ksz_spi_write32(struct ksz_device *dev, u32 reg, u32 value)
> +{
> +	struct spi_device *spi = dev->priv;
> +
> +	value = cpu_to_be32(value);
> +	return ksz_spi_write_reg(spi, reg, (u8 *)&value, 4);
> +}
> +
> +static const struct ksz_io_ops ksz_spi_ops = {
> +	.read8 = ksz_spi_read8,
> +	.read16 = ksz_spi_read16,
> +	.read24 = ksz_spi_read24,
> +	.read32 = ksz_spi_read32,
> +	.write8 = ksz_spi_write8,
> +	.write16 = ksz_spi_write16,
> +	.write24 = ksz_spi_write24,
> +	.write32 = ksz_spi_write32,
> +};
> +
> +static int ksz_spi_probe(struct spi_device *spi)
> +{
> +	struct ksz_device *dev;
> +	int ret;
> +
> +	dev = ksz_switch_alloc(&spi->dev, &ksz_spi_ops, spi);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	if (spi->dev.platform_data)
> +		dev->pdata = spi->dev.platform_data;
> +
> +	ret = ksz_switch_register(dev);
> +	if (ret)
> +		return ret;
> +
> +	spi_set_drvdata(spi, dev);
> +
> +	return 0;
> +}
> +
> +static int ksz_spi_remove(struct spi_device *spi)
> +{
> +	struct ksz_device *dev = spi_get_drvdata(spi);
> +
> +	if (dev)
> +		ksz_switch_remove(dev);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id ksz_dt_ids[] = {
> +	{ .compatible = "microchip,ksz9477" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, ksz_dt_ids);
> +
> +static struct spi_driver ksz_spi_driver = {
> +	.driver = {
> +		.name	= "ksz9477-switch",
> +		.owner	= THIS_MODULE,
> +		.of_match_table = of_match_ptr(ksz_dt_ids),
> +	},
> +	.probe	= ksz_spi_probe,
> +	.remove	= ksz_spi_remove,
> +};
> +
> +module_spi_driver(ksz_spi_driver);
> +
> +MODULE_AUTHOR("Woojung Huh <Woojung.Huh@microchip.com>");
> +MODULE_DESCRIPTION("Microchip KSZ Series Switch SPI access Driver");
> diff --git a/include/linux/platform_data/microchip-ksz.h b/include/linux/platform_data/microchip-ksz.h
> new file mode 100644
> index 0000000..84789ca
> --- /dev/null
> +++ b/include/linux/platform_data/microchip-ksz.h
> @@ -0,0 +1,29 @@
> +/*
> + * Microchip KSZ series switch platform data
> + *
> + * Copyright (C) 2017
> + *
> + * Permission to use, copy, modify, and/or distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#ifndef __MICROCHIP_KSZ_H
> +#define __MICROCHIP_KSZ_H
> +
> +#include <linux/kernel.h>
> +
> +struct ksz_platform_data {
> +	u32 chip_id;
> +	u16 enabled_ports;
> +};
> +
> +#endif
> 


-- 
Florian

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Jiri Pirko @ 2017-05-07 10:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: leonro@mellanox.com, jiri@mellanox.com,
	linux-rdma@vger.kernel.org, ram.amrani@cavium.com,
	sagi@grimberg.me, ogerlitz@mellanox.com,
	dennis.dalessandro@intel.com, hch@lst.de, netdev@vger.kernel.org,
	leon@kernel.org, stephen@networkplumber.org, dledford@redhat.com,
	jgunthorpe@obsidianresearch.com, ariela@mellanox.com
In-Reply-To: <1494081623.3228.1.camel@sandisk.com>

Sat, May 06, 2017 at 04:40:24PM CEST, Bart.VanAssche@sandisk.com wrote:
>On Sat, 2017-05-06 at 12:40 +0200, Jiri Pirko wrote:
>> Thu, May 04, 2017 at 08:10:54PM CEST, Bart.VanAssche@sandisk.com wrote:
>> > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
>> > > Following our discussion both in mailing list [1] and at the LPC 2016 [2],
>> > > we would like to propose this RDMA tool to be part of iproute2 package
>> > > and finally improve this situation.
>> > 
>> > Although I really appreciate your work: can you clarify why you would like to
>> > add *RDMA* functionality to an *IP routing* tool? I haven't found any motivation
>> > for adding RDMA functionality to iproute2 in [1].
>> 
>> Bart, please realize that iproute2 is much more than "*IP routing* tool".
>> I understand you got confused by the name. Please see sources. Your comment
>> is totally pointless...
>
>I asked for a clarification that should have been in the cover letter but that
>was missing from that cover letter. So I think that was the right thing to do

I think that was just complete misunderstanding about what iproute2 is.


>instead of pointless. BTW, can you explain why you are using an e-mail address
>that is hiding that you are a Mellanox employee?

Who sais I have to do that? This is funny...

^ permalink raw reply

* Re: [PATCH iproute2 v2 1/1] vxlan: Add support for modifying vxlan device attributes
From: Roman Mashak @ 2017-05-07 14:40 UTC (permalink / raw)
  To: Girish Moodalbail; +Cc: stephen, netdev
In-Reply-To: <1494095863-21916-2-git-send-email-girish.moodalbail@oracle.com>

Girish Moodalbail <girish.moodalbail@oracle.com> writes:


[...]

>  ip/iplink_vxlan.c | 251 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 143 insertions(+), 108 deletions(-)
>
> diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
> index b4ebb13..2bd619d 100644
> --- a/ip/iplink_vxlan.c
> +++ b/ip/iplink_vxlan.c
> @@ -21,6 +21,8 @@
>  #include "utils.h"
>  #include "ip_common.h"
>  
> +#define VXLAN_ATTRSET(attrs, type) (((attrs) & (1L << (type))) != 0)

I think you can drop '!= 0' part in the macro.

[...]

>  
> +static void check_duparg(__u64 *attrs, int type, const char *key,
> +			 const char *argv)
> +{
> +	if (!VXLAN_ATTRSET(*attrs, type)) {
> +		*attrs |= (1L << type);
> +		return;
> +	}
> +	duparg2(key, argv);
> +}

[...]

^ permalink raw reply

* [PATCH net v2] Check for skb_put_padto return value
From: Peter Heise @ 2017-05-07 17:15 UTC (permalink / raw)
  To: netdev; +Cc: mail

Return value of skb_put_padto is now checked as
reported by Dan Carpenter. skb might be freed in
case of error in skb_put_padto.
---
 net/hsr/hsr_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index c73160fb11e7..a1545d09a3c9 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -314,7 +314,8 @@ static void send_hsr_supervision_frame(struct hsr_port *master,
 	hsr_sp = (typeof(hsr_sp)) skb_put(skb, sizeof(struct hsr_sup_payload));
 	ether_addr_copy(hsr_sp->MacAddressA, master->dev->dev_addr);
 
-	skb_put_padto(skb, ETH_ZLEN + HSR_HLEN);
+	if (skb_put_padto(skb, ETH_ZLEN + HSR_HLEN))
+		return;
 
 	hsr_forward_skb(skb, master);
 	return;
-- 
2.11.0

^ permalink raw reply related

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Eric Dumazet @ 2017-05-07 18:53 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, andreyknvl, edumazet
In-Reply-To: <1493934857-6693-1-git-send-email-xiyou.wangcong@gmail.com>

On Thu, 2017-05-04 at 14:54 -0700, Cong Wang wrote:
> IPv4 dst could use fi->fib_metrics to store metrics but fib_info
> itself is refcnt'ed, so without taking a refcnt fi and
> fi->fib_metrics could be freed while dst metrics still points to
> it. This triggers use-after-free as reported by Andrey twice.
> 
> This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
> restore this reference counting. It is a quick fix for -net and
> -stable, for -net-next, as Eric suggested, we can consider doing
> reference counting for metrics itself instead of relying on fib_info.
> 
> IPv6 is very different, it copies or steals the metrics from mx6_config
> in fib6_commit_metrics() so probably doesn't need a refcnt.
> 
> Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
> 
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Leon Romanovsky @ 2017-05-07  6:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jiri@resnulli.us, jiri@mellanox.com, linux-rdma@vger.kernel.org,
	ram.amrani@cavium.com, sagi@grimberg.me, ogerlitz@mellanox.com,
	dennis.dalessandro@intel.com, hch@lst.de, netdev@vger.kernel.org,
	stephen@networkplumber.org, dledford@redhat.com,
	jgunthorpe@obsidianresearch.com, ariela@mellanox.com
In-Reply-To: <1494081623.3228.1.camel@sandisk.com>

[-- Attachment #1: Type: text/plain, Size: 1519 bytes --]

On Sat, May 06, 2017 at 02:40:24PM +0000, Bart Van Assche wrote:
> On Sat, 2017-05-06 at 12:40 +0200, Jiri Pirko wrote:
> > Thu, May 04, 2017 at 08:10:54PM CEST, Bart.VanAssche@sandisk.com wrote:
> > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > > Following our discussion both in mailing list [1] and at the LPC 2016 [2],
> > > > we would like to propose this RDMA tool to be part of iproute2 package
> > > > and finally improve this situation.
> > >
> > > Although I really appreciate your work: can you clarify why you would like to
> > > add *RDMA* functionality to an *IP routing* tool? I haven't found any motivation
> > > for adding RDMA functionality to iproute2 in [1].
> >
> > Bart, please realize that iproute2 is much more than "*IP routing* tool".
> > I understand you got confused by the name. Please see sources. Your comment
> > is totally pointless...
>
> I asked for a clarification that should have been in the cover letter but that
> was missing from that cover letter. So I think that was the right thing to do
> instead of pointless. BTW, can you explain why you are using an e-mail address
> that is hiding that you are a Mellanox employee?

For the same reason as I do. It is much easier to use outside servers
than Mellanox's IT infrastructure.

Right now, I'm speaking for myself, but for me, to properly answer with @mellanox.com,
I need to use very specific mail setup, while for any other addresses I can and use any
sane setup, including mobile application.

>
> Bart.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Stephen Hemminger @ 2017-05-07 21:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jiri Pirko, Jiri Benc, Doug Ledford, Jiri Pirko, Ariel Almog,
	Dennis Dalessandro, Ram Amrani, Bart Van Assche, Sagi Grimberg,
	Jason Gunthorpe, Christoph Hellwig, Or Gerlitz, Linux RDMA,
	Linux Netdev
In-Reply-To: <20170507063329.GL22833@mtr-leonro.local>

[-- Attachment #1: Type: text/plain, Size: 2370 bytes --]

On Sun, 7 May 2017 09:33:29 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> On Sat, May 06, 2017 at 12:48:26PM +0200, Jiri Pirko wrote:
> > Fri, May 05, 2017 at 03:17:54PM CEST, leon@kernel.org wrote:  
> > >On Fri, May 05, 2017 at 08:54:57AM +0200, Jiri Benc wrote:  
> > >> On Thu,  4 May 2017 21:02:08 +0300, Leon Romanovsky wrote:  
> > >> > In order to close object model, ensure reuse of existing code and make this
> > >> > tool usable from day one, we decided to implement wrappers over legacy sysfs
> > >> > prior to implementing netlink functionality. As a nice bonus, it will allow
> > >> > to use this tool with old kernels too.  
> > >>
> > >> This sounds wrong. We don't support legacy ioctl interface for the 'ip'
> > >> command, either. I think rdma should be converted to netlink first and
> > >> the new tool should only use netlink.  
> > >
> > >RDMA in slightly different situation than "ip" tool was. "ip" was implemented
> > >when tools like ifconfig existed. It allowed to old and new systems to be
> > >configured to some degree. In RDMA community, there are no similar tools like
> > >"ifconfig". Implementation in netlink-only interface will leave old systems without
> > >common tool at all.
> > >
> > >As an upstream-oriented person, I personally fine with that, but anyway would
> > >like to get wider agreement/disagreement on that, before removing sysfs
> > >parsing logic from the rdmatool.  
> >
> > I tend to agree with Jiri Benc. I fear that supporting sysfs + netlink
> > api later on for the same things will make the code unnecessary complex.
> > Also, the legacy sysfs will most likely stay there forever so there will
> > be no actual motivation to port the existing things to the new netlink
> > api.
> >
> > For the prototyping purposes, I belive that what you did makes perfect
> > sense. But for the actual mergable version, my feeling is that we need
> > to strictly stick with new netlink rdma interface and just forget about
> > the old sysfs one. Distros would have to backport the new kernel
> > rdma netlink api.  
> 
> Thanks,
> It looks like that most of the comments are in favor of netlink-only
> solution.

If current (like 4.10 or later) kernel support netlink only solution, that makes sense.
When I created bridge command; it also abandoned the old ioctl interface.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* RE: [PATCH net-next 2/5] phy: micrel: add Microchip KSZ 9477 Switch PHY support
From: Woojung.Huh @ 2017-05-07  1:02 UTC (permalink / raw)
  To: andrew; +Cc: f.fainelli, vivien.didelot, netdev, davem, UNGLinuxDriver
In-Reply-To: <20170506035021.GB14749@lunn.ch>

>> +}, {
>> +     .phy_id         = PHY_ID_KSZ9477,
>> +     .phy_id_mask    = MICREL_PHY_ID_MASK,
>> +     .name           = "Microchip KSZ9477",
>> +     .features       = PHY_GBIT_FEATURES,
>> +     .flags          = PHY_HAS_MAGICANEG,
>
>Is this magic still used anywhere? I could not find anything.
>
Hi Andrew,

Really no usage in define and flags.
Will double check and remove it.

- Woojung

^ permalink raw reply

* Re: [PATCH net v4 00/12] Fix possbile memleaks when fail to register_netdevice
From: David Miller @ 2017-05-07 22:25 UTC (permalink / raw)
  To: gfree.wind
  Cc: jiri, mareklindner, sw, a, kuznet, jmorris, yoshfuji, kaber,
	steffen.klassert, herbert, netdev
In-Reply-To: <cover.1493699451.git.gfree.wind@foxmail.com>

From: gfree.wind@foxmail.com
Date: Tue,  2 May 2017 13:58:42 +0800

> These following drivers allocate kinds of resources in its ndo_init
> func, free some of them or all in the destructor func. Then there is
> one memleak that some errors happen after register_netdevice invokes
> the ndo_init callback. Because only the ndo_uninit callback is invoked
> in the error handler of register_netdevice, but destructor not.
> 
> In my original approach, I tried to free the resources in the
> newlink func when fail to register_netdevice, like destructor did
> except not free the net_dev. This method is not good when destructor
> is changed, and the memleak could be not fixed when there is no
> newlink callback.
> 
> Now create one new func used to free the resources in the
> destructor, and the ndo_uninit func also could invokes it when fail
> to register the net_device by comparing the dev->reg_state with
> NETREG_UNINITIALIZED.  If there is no existing ndo_uninit, just add
> one.
> 
> This solution doesn't only make sure free all resources in any case,
> but also follows the original desgin that some resources could be
> kept until the destructor executes normally after register the
> device successfully.

Device private teardown is in two stages for the following reason.
The issue is that netdev_ops->ndo_init() allocates two types of
resources.

One type is OK to release during destruction before the netdev refs
goes to zero.  This is what netdev_ops->ndo_uninit() is for.

The second type is for releasing things which are not safe to drop
until the very last netdev reference disappears.  This is what
netdev->destructor() is for.

If you look around there are hacks in place all over to try and deal
with this issue.  Basically, look for code that checks the return
value of register_netdev() and if an error is indicated it does
some local driver state freeing.  Bonding is one example.  It is
trying to deal with the problem this patch set is targetting.

What really needs to happen is we must divorce the logic of
dev->destructor() from free_netdev().

That way we can do the free_netdev in the unregister netdevice path
after calling netdev->destructor().

Then the only issue callers of register_netdevice() need to be aware
of is that if an error is returned, that caller must call
free_netdev().  Which has been the case for decades.

So I would fix this as follows:

1) Rename netdev->destructor() into "netdev->priv_destructor()"  It
   performs all post ndo_ops->ndo_uninit() cleanups, _except_
   free_netdev().  Update all drivers with netdev->destructor().

2) Add a boolean state to netdev, which indicates if free_netdev()
   should be performed after netdev->priv_destructor() during
   unregister.

That provides all of the flexibility necessary to fix this bug in
the core.

In register_netdevice() if something after ndo_ops->ndo_init()
succeeeds, we invoke _both_ ndo_ops->ndo_uninit() and
netdev->priv_destructor().  We do not look at the netdev "needs
free_netdev()" boolean.

In netdev_run_todo(), where we have the one and only
netdev->destructor() call, change it to:

	if (dev->priv_destructor)
		dev->priv_destructor(dev);
	if (dev->needs_free_netdev)
		free_netdev(dev);

That fixes the bug in all cases.  And makes the purpose and logic
extremely clear.  Also, no internal state tests leak into the
drivers.

Finally, drivers that try to cover up this issue, such as bonding,
need to be changed to no try and free up device private state if
their invocation of register_netdevice() fails.

Thanks.

^ permalink raw reply

* Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged
From: Jann Horn @ 2017-05-07 22:26 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David S. Miller, Alexei Starovoitov, kafai, netdev
In-Reply-To: <793c517a7d163c613ab886eb02d32efea9f902fd.1494194233.git.daniel@iogearbox.net>

On Mon, May 8, 2017 at 12:04 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> The patch fixes two things at once:
>
> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>    the log if we have the privileges to do so, otherwise it just dumps 0
>    as we would when kptr_restrict is enabled on %pK. Given the latter is
>    off by default and not every distro sets it, I don't want to rely on
>    this, hence the 0 by default for unprivileged.
>
> 2) Printing of ldimm64 in the verifier log is currently broken in that
>    we don't print the full immediate, but only the 32 bit part of the
>    first insn part for ldimm64. Thus, fix this up as well; it's okay to
>    access, since we verified all ldimm64 earlier already (including just
>    constants) through replace_map_fd_with_map_ptr().
>
> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[...]
> @@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
>                                 insn->code,
>                                 bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>                                 insn->src_reg, insn->imm);
> -               } else if (BPF_MODE(insn->code) == BPF_IMM) {
> -                       verbose("(%02x) r%d = 0x%x\n",
> -                               insn->code, insn->dst_reg, insn->imm);
> +               } else if (BPF_MODE(insn->code) == BPF_IMM &&
> +                          BPF_SIZE(insn->code) == BPF_DW) {
> +                       /* At this point, we already made sure that the second
> +                        * part of the ldimm64 insn is accessible.
> +                        */
> +                       u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;
> +                       bool map_ptr = insn->src_reg == BPF_PSEUDO_MAP_FD;
> +
> +                       if (map_ptr && !env->allow_ptr_leaks)
> +                               imm = 0;
> +
> +                       verbose("(%02x) r%d = 0x%llx\n", insn->code,
> +                               insn->dst_reg, (unsigned long long)imm);
>                 } else {
>                         verbose("BUG_ld_%02x\n", insn->code);
>                         return;

You replaced the `BPF_MODE(insn->code) == BPF_IMM` branch with a
`BPF_MODE(insn->code) == BPF_IMM && BPF_SIZE(insn->code) == BPF_DW`
branch. Doesn't that break printing normal immediates?

^ permalink raw reply

* RE: [PATCH net-next 4/5] dsa: Microchip KSZ switches SPI devicetree configuration
From: Woojung.Huh @ 2017-05-07  1:04 UTC (permalink / raw)
  To: sergei.shtylyov, andrew, f.fainelli, vivien.didelot
  Cc: netdev, davem, UNGLinuxDriver
In-Reply-To: <1ade6f7e-7a28-db47-3811-d35d90f57b56@cogentembedded.com>

Hi Sergei

>> +             spi1: spi@f8008000 {
>> +             pinctrl-0 = <&pinctrl_spi_ksz>;
>> +             cs-gpios = <&pioC 25 0>;
>> +             id = <1>;
>> +             status = "okay";
>
>    4 lines above should be indented more to the right.
>> +             };
>> +             };
>
>    Unmatched }?

Will fix both.

- Woojung

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox