Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: How do I avoid recvmsg races with IP_RECVERR?
From: Andy Lutomirski @ 2016-04-09  0:02 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Andy Lutomirski, Network Development
In-Reply-To: <1433291591.3300318.285256449.713A7E1E@webmail.messagingengine.com>

On Tue, Jun 2, 2015 at 5:33 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Wed, Jun 3, 2015, at 02:03, Andy Lutomirski wrote:
>> On Tue, Jun 2, 2015 at 2:50 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> >> My proposal would be to make the error conversion lazy:
>> >>
>> >> Keeping duplicate data is not a good idea in general: So we shouldn't
>> >> use sk->sk_err if IP_RECVERR is set at all but let sock_error just use
>> >> the sk_error_queue and extract the error code from there.
>> >>
>> >> Only if IP_RECVERR was not set, we use sk->sk_err logic.
>> >>
>> >> What do you think?
>> >
>> > I just noticed that this will probably break existing user space
>> > applications which require that icmp errors are transient even with
>> > IP_RECVERR. We can mark that with a bit in the sk_error_queue pointer
>> > and xchg the pointer, hmmm....
>>
>> Do you mean to fix the race like this but to otherwise leave the
>> semantics
>> alone?  That would be an improvement, but it might be nice to also add
>> a non-crappy API for this, too.
>
> Yes, keep current semantics but fix the race you reported.
>
> I currently don't have good proposals for a decent API to handle this
> besides adding some ancillary cmsg data to msg_control. This still would
> not solve the problem fundamentally, as a -EFAULT/-EINVAL return value
> could also mean that msg_control should not be touched, thus we end up
> again relying on errno checking. :/ Thus checking error queue after
> receiving an error indications is my best hunch so far.
>
> Your proposal with MSG_IGNORE_ERROR seems reasonable so far for ping or
> udp, but I haven't fully grasped the TCP semantics of sk->sk_err, yet.

I was looking at this a bit, and I was thinking about adding a new
socket option, but I'm a bit vague on how all this fits together.

One option would be a socket option that simply causes sock_error to
return 0 (and change SO_ERROR to peek at sk_err directly).  But there
seem to be sock_error callers all over the place, and maybe this
change would cause problems.

Another option would be to add a socket option that explicitly turns
off everything that queues soft errors to sk_err.

I think that, for IP datagrams at least, the ideal semantics would be
for soft errors not to affect sk_err and for POLLERR to be set if the
error queue is nonempty.

--Andy

^ permalink raw reply

* Re: [PATCH net-next v2] vxlan: synchronously and race-free destruction of vxlan sockets
From: Hannes Frederic Sowa @ 2016-04-08 23:55 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Eric Dumazet, Jiri Benc,
	Marcelo Ricardo Leitner, Stephen Hemminger
In-Reply-To: <CAM_iQpUbTz2dRS+ZTZSSL0AvkOb_2_HEfXo+ap9__aQZmajMjA@mail.gmail.com>



On Sat, Apr 9, 2016, at 01:24, Cong Wang wrote:
> On Fri, Apr 8, 2016 at 1:55 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Due to the fact that the udp socket is destructed asynchronously in a
> > work queue, we have some nondeterministic behavior during shutdown of
> > vxlan tunnels and creating new ones. Fix this by keeping the destruction
> > process synchronous in regards to the user space process so IFF_UP can
> > be reliably set.
> >
> > udp_tunnel_sock_release destroys vs->sock->sk if reference counter
> > indicates so. We expect to have the same lifetime of vxlan_sock and
> > vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
> > only destruct the whole socket after we can be sure it cannot be found
> > by searching vxlan_net->sock_list.
> >
> 
> I am wondering what is the reason why we used work queue from
> the beginning?

I actually don't know. It was like that from the beginning. I cc'ed
Stephen, maybe he remembers?

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Bjorn Andersson @ 2016-04-08 23:25 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-msm, Sagar Dharia, shankerd, Greg Kroah-Hartman,
	vikrams, Christopher Covington
In-Reply-To: <57083834.2060102@codeaurora.org>

On Fri 08 Apr 16:01 PDT 2016, Timur Tabi wrote:

> Bjorn Andersson wrote:
> 
> >It sounds like you're trying to say that the pins used can be are
> >muxed as GPIO or MDIO, in the TLMM.
> 
> I'm not 100% sure, but I think that's correct.  If you don't want to have
> normal networking, you could connect those external pins to some GPIO device
> (like an LED or whatever), and then configure the pin muxing for GPIO
> purposes.  But if that's true, it's only true on the FSM9900. On the
> QDF2432, those lines are not connected to the TLMM.  They are instead
> hard-wired to the Emac.
> 

Then through proper use of the pinctrl framework you should configure
the FSM9900 to mux these pins appropriately and the two solutions are
equivalent.

> >In the downstream kernel this is often seen with the drivers calling
> >gpio_request() to "reserve" said pins, but all you should do is
> >described the desired configuration and muxing in the pinctrl node,
> >reference that from your driver and simply ignore the fact that those
> >pins could have been used as GPIO pins.
> 
> That makes sense, but I think the driver already does that.
> 
> https://patchwork.ozlabs.org/patch/561667/
> 
> Function emac_probe_resources() has a call to of_get_named_gpio().  And then
> emac_mac_up() calls gpio_request().  As far as I can tell, that's it.
> 
> I'm guessing that the of_get_named_gpio() call needs to be changed somehow,
> but I'm not sure how.
> 

Thanks for the link.

In short those call to the gpio framework should just be removed. They
should only be there if you're using the gpiolib to control the state of
those pins, and you're not as far as I can see.


The general outline of what you should have in your dts instead is:

soc {
	tlmm {
		compatible = "qcom,pinctrl-xyz";

		mdio_pins_a: mdio {
			state {
				pins = "gpio0", "gpio1";
				function = "mdio";
			};
		};
	};

	emac {
		compatible = "qcom,somthing-emac";

		pinctrl-names = "default";
		pinctrl-0 = <&mdio_pins_a>;
	};
};

Regards,
Bjorn

^ permalink raw reply

* Re: [PATCH net-next v2] vxlan: synchronously and race-free destruction of vxlan sockets
From: Cong Wang @ 2016-04-08 23:24 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Linux Kernel Network Developers, Eric Dumazet, Jiri Benc,
	Marcelo Ricardo Leitner
In-Reply-To: <1460148901-23740-1-git-send-email-hannes@stressinduktion.org>

On Fri, Apr 8, 2016 at 1:55 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Due to the fact that the udp socket is destructed asynchronously in a
> work queue, we have some nondeterministic behavior during shutdown of
> vxlan tunnels and creating new ones. Fix this by keeping the destruction
> process synchronous in regards to the user space process so IFF_UP can
> be reliably set.
>
> udp_tunnel_sock_release destroys vs->sock->sk if reference counter
> indicates so. We expect to have the same lifetime of vxlan_sock and
> vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
> only destruct the whole socket after we can be sure it cannot be found
> by searching vxlan_net->sock_list.
>

I am wondering what is the reason why we used work queue from
the beginning?

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Timur Tabi @ 2016-04-08 23:01 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-msm, Sagar Dharia, shankerd, Greg Kroah-Hartman,
	vikrams, Christopher Covington
In-Reply-To: <CAOCOHw5R=PnkafRqx2eGKKm10eLgD4H3u7TjbbhdZMbqVR7LrA@mail.gmail.com>

Bjorn Andersson wrote:

> It sounds like you're trying to say that the pins used can be are
> muxed as GPIO or MDIO, in the TLMM.

I'm not 100% sure, but I think that's correct.  If you don't want to 
have normal networking, you could connect those external pins to some 
GPIO device (like an LED or whatever), and then configure the pin muxing 
for GPIO purposes.  But if that's true, it's only true on the FSM9900. 
On the QDF2432, those lines are not connected to the TLMM.  They are 
instead hard-wired to the Emac.

> In the downstream kernel this is often seen with the drivers calling
> gpio_request() to "reserve" said pins, but all you should do is
> described the desired configuration and muxing in the pinctrl node,
> reference that from your driver and simply ignore the fact that those
> pins could have been used as GPIO pins.

That makes sense, but I think the driver already does that.

https://patchwork.ozlabs.org/patch/561667/

Function emac_probe_resources() has a call to of_get_named_gpio().  And 
then emac_mac_up() calls gpio_request().  As far as I can tell, that's it.

I'm guessing that the of_get_named_gpio() call needs to be changed 
somehow, but I'm not sure how.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Bjorn Andersson @ 2016-04-08 22:43 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Andrew Lunn, Rob Herring, Gilad Avidov, netdev,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-arm-msm,
	Sagar Dharia, shankerd-sgV2jX0FEOL9JmXXK+q4OQ, Greg Kroah-Hartman,
	vikrams-sgV2jX0FEOL9JmXXK+q4OQ, Christopher Covington
In-Reply-To: <5708013F.90207-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

On Fri, Apr 8, 2016 at 12:06 PM, Timur Tabi <timur-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:
> Andrew Lunn wrote:
>
>> There are two different things here. One is configuring the pin to be
>> a GPIO. The second is using the GPIO as a GPIO. In this case,
>> bit-banging the MDIO bus.
>>
>> The firmware could be doing the configuration, setting the pin as a
>> GPIO. However, the firmware cannot be doing the MDIO bit-banging to
>> make an MDIO bus available. Linux has to do that.
>>
>> Or it could be we have all completely misunderstood the hardware, and
>> we are not doing bit-banging GPIO MDIO. There is a real MDIO
>> controller there, we don't use these pins as GPIOs, etc....
>
>
> Actually, I think there is a misunderstanding.
>
> On the FSM9900 SOC (which uses device-tree), the two pins that connect to
> the external PHY are gpio pins.  However, the driver needs to reprogram the
> pinmux so that those pins are wired to the Emac controller.  That's what the
> the gpio code in this driver is doing: it's just configuring the pins so
> that they connect directly between the Emac and the external PHY.  After
> that, they are no longer GPIO pins, and you cannot use the "GPIO controlled
> MDIO bus".  There is no MDIO controller on the SOC.  The external PHY is
> controlled directly from the Emac and also from the internal PHY.  It is
> screwy, I know, but that's what Gilad was trying to explain.
>

It sounds like you're trying to say that the pins used can be are
muxed as GPIO or MDIO, in the TLMM.

In the downstream kernel this is often seen with the drivers calling
gpio_request() to "reserve" said pins, but all you should do is
described the desired configuration and muxing in the pinctrl node,
reference that from your driver and simply ignore the fact that those
pins could have been used as GPIO pins.

Regards,
Bjorn
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH iproute2 -master 2/3] tc, bpf: further improve error reporting
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Make it easier to spot issues when loading the object file fails. This
includes reporting in what pinned object specs differ, better indication
when we've reached instruction limits. Don't retry to load a non relo
program once we failed with bpf(2), and report out of bounds tail call key.

Also, add truncation of huge log outputs by default. Sometimes errors are
quite easy to spot by only looking at the tail of the verifier log, but
logs can get huge in size e.g. up to few MB (due to verifier checking all
possible program paths). Thus, by default limit output to the last 4096
bytes and indicate that it's truncated. For the full log, the verbose option
can be used.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tc/tc_bpf.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 tc/tc_bpf.h |  4 +++
 2 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/tc/tc_bpf.c b/tc/tc_bpf.c
index d94af82..0c59427 100644
--- a/tc/tc_bpf.c
+++ b/tc/tc_bpf.c
@@ -184,7 +184,7 @@ static int bpf_ops_parse(int argc, char **argv, struct sock_filter *bpf_ops,
 	}
 
 	if (i != bpf_len) {
-		fprintf(stderr, "Parsed program length is less than encodedlength parameter!\n");
+		fprintf(stderr, "Parsed program length is less than encoded length parameter!\n");
 		ret = -EINVAL;
 		goto out;
 	}
@@ -214,6 +214,27 @@ void bpf_print_ops(FILE *f, struct rtattr *bpf_ops, __u16 len)
 		ops[i].jf, ops[i].k);
 }
 
+static void bpf_map_pin_report(const struct bpf_elf_map *pin,
+			       const struct bpf_elf_map *obj)
+{
+	fprintf(stderr, "Map specification differs from pinned file!\n");
+
+	if (obj->type != pin->type)
+		fprintf(stderr, " - Type:         %u (obj) != %u (pin)\n",
+			obj->type, pin->type);
+	if (obj->size_key != pin->size_key)
+		fprintf(stderr, " - Size key:     %u (obj) != %u (pin)\n",
+			obj->size_key, pin->size_key);
+	if (obj->size_value != pin->size_value)
+		fprintf(stderr, " - Size value:   %u (obj) != %u (pin)\n",
+			obj->size_value, pin->size_value);
+	if (obj->max_elem != pin->max_elem)
+		fprintf(stderr, " - Max elems:    %u (obj) != %u (pin)\n",
+			obj->max_elem, pin->max_elem);
+
+	fprintf(stderr, "\n");
+}
+
 static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 				    int length)
 {
@@ -256,7 +277,7 @@ static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 		if (!memcmp(&tmp, &zero, length))
 			return 0;
 
-		fprintf(stderr, "Map specs from pinned file differ!\n");
+		bpf_map_pin_report(&tmp, map);
 		return -EINVAL;
 	}
 }
@@ -735,7 +756,19 @@ bpf_dump_error(struct bpf_elf_ctx *ctx, const char *format, ...)
 	va_end(vl);
 
 	if (ctx->log && ctx->log[0]) {
-		fprintf(stderr, "%s\n", ctx->log);
+		if (ctx->verbose) {
+			fprintf(stderr, "%s\n", ctx->log);
+		} else {
+			unsigned int off = 0, len = strlen(ctx->log);
+
+			if (len > BPF_MAX_LOG) {
+				off = len - BPF_MAX_LOG;
+				fprintf(stderr, "Skipped %u bytes, use \'verb\' option for the full verbose log.\n[...]\n",
+					off);
+			}
+			fprintf(stderr, "%s\n", ctx->log + off);
+		}
+
 		memset(ctx->log, 0, ctx->log_size);
 	}
 }
@@ -1055,14 +1088,16 @@ static void bpf_prog_report(int fd, const char *section,
 			    const struct bpf_elf_prog *prog,
 			    struct bpf_elf_ctx *ctx)
 {
-	fprintf(stderr, "Prog section \'%s\' %s%s (%d)!\n", section,
+	unsigned int insns = prog->size / sizeof(struct bpf_insn);
+
+	fprintf(stderr, "\nProg section \'%s\' %s%s (%d)!\n", section,
 		fd < 0 ? "rejected: " : "loaded",
 		fd < 0 ? strerror(errno) : "",
 		fd < 0 ? errno : fd);
 
 	fprintf(stderr, " - Type:         %u\n", prog->type);
-	fprintf(stderr, " - Instructions: %zu\n",
-		prog->size / sizeof(struct bpf_insn));
+	fprintf(stderr, " - Instructions: %u (%u over limit)\n",
+		insns, insns > BPF_MAXINSNS ? insns - BPF_MAXINSNS : 0);
 	fprintf(stderr, " - License:      %s\n\n", prog->license);
 
 	bpf_dump_error(ctx, "Verifier analysis:\n\n");
@@ -1283,6 +1318,11 @@ static int bpf_fetch_strtab(struct bpf_elf_ctx *ctx, int section,
 	return 0;
 }
 
+static bool bpf_has_map_data(const struct bpf_elf_ctx *ctx)
+{
+	return ctx->sym_tab && ctx->str_tab && ctx->sec_maps;
+}
+
 static int bpf_fetch_ancillary(struct bpf_elf_ctx *ctx)
 {
 	struct bpf_elf_sec_data data;
@@ -1306,13 +1346,13 @@ static int bpf_fetch_ancillary(struct bpf_elf_ctx *ctx)
 			 !strcmp(data.sec_name, ".strtab"))
 			ret = bpf_fetch_strtab(ctx, i, &data);
 		if (ret < 0) {
-			fprintf(stderr, "Error parsing section %d! Perhapscheck with readelf -a?\n",
+			fprintf(stderr, "Error parsing section %d! Perhaps check with readelf -a?\n",
 				i);
 			break;
 		}
 	}
 
-	if (ctx->sym_tab && ctx->str_tab && ctx->sec_maps) {
+	if (bpf_has_map_data(ctx)) {
 		ret = bpf_maps_attach_all(ctx);
 		if (ret < 0) {
 			fprintf(stderr, "Error loading maps into kernel!\n");
@@ -1348,7 +1388,7 @@ static int bpf_fetch_prog(struct bpf_elf_ctx *ctx, const char *section)
 
 		fd = bpf_prog_attach(section, &prog, ctx);
 		if (fd < 0)
-			continue;
+			break;
 
 		ctx->sec_done[i] = true;
 		break;
@@ -1412,7 +1452,8 @@ static int bpf_apply_relo_data(struct bpf_elf_ctx *ctx,
 	return 0;
 }
 
-static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
+static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section,
+			       bool *lderr)
 {
 	struct bpf_elf_sec_data data_relo, data_insn;
 	struct bpf_elf_prog prog;
@@ -1442,8 +1483,10 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
 		prog.license = ctx->license;
 
 		fd = bpf_prog_attach(section, &prog, ctx);
-		if (fd < 0)
-			continue;
+		if (fd < 0) {
+			*lderr = true;
+			break;
+		}
 
 		ctx->sec_done[i]   = true;
 		ctx->sec_done[idx] = true;
@@ -1455,11 +1498,12 @@ static int bpf_fetch_prog_relo(struct bpf_elf_ctx *ctx, const char *section)
 
 static int bpf_fetch_prog_sec(struct bpf_elf_ctx *ctx, const char *section)
 {
+	bool lderr = false;
 	int ret = -1;
 
-	if (ctx->sym_tab)
-		ret = bpf_fetch_prog_relo(ctx, section);
-	if (ret < 0)
+	if (bpf_has_map_data(ctx))
+		ret = bpf_fetch_prog_relo(ctx, section, &lderr);
+	if (ret < 0 && !lderr)
 		ret = bpf_fetch_prog(ctx, section);
 
 	return ret;
@@ -1504,8 +1548,12 @@ static int bpf_fill_prog_arrays(struct bpf_elf_ctx *ctx)
 
 		ret = bpf_map_update(ctx->map_fds[idx], &key_id,
 				     &fd, BPF_ANY);
-		if (ret < 0)
-			return -ENOENT;
+		if (ret < 0) {
+			if (errno == E2BIG)
+				fprintf(stderr, "Tail call key %u for map %u out of bounds?\n",
+					key_id, map_id);
+			return -errno;
+		}
 
 		ctx->sec_done[i] = true;
 	}
diff --git a/tc/tc_bpf.h b/tc/tc_bpf.h
index 93f7f0e..30306de 100644
--- a/tc/tc_bpf.h
+++ b/tc/tc_bpf.h
@@ -33,6 +33,10 @@ enum {
 #define BPF_ENV_UDS	"TC_BPF_UDS"
 #define BPF_ENV_MNT	"TC_BPF_MNT"
 
+#ifndef BPF_MAX_LOG
+# define BPF_MAX_LOG	4096
+#endif
+
 #ifndef BPF_FS_MAGIC
 # define BPF_FS_MAGIC	0xcafe4a11
 #endif
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 3/3] tc, bpf: add support for map pre/allocation
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Follow-up to kernel commit 6c9059817432 ("bpf: pre-allocate hash map
elements"). Add flags support, so that we can pass in BPF_F_NO_PREALLOC
flag for disallowing preallocation. Update examples accordingly and also
remove the BPF_* map helper macros from them as they were not very useful.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 examples/bpf/bpf_cyclic.c   |  9 ++++++++-
 examples/bpf/bpf_graft.c    |  8 +++++++-
 examples/bpf/bpf_prog.c     |  2 ++
 examples/bpf/bpf_shared.c   |  8 +++++++-
 examples/bpf/bpf_tailcall.c | 29 +++++++++++++++++++++++++----
 include/bpf_api.h           | 45 ---------------------------------------------
 include/bpf_elf.h           |  1 +
 tc/tc_bpf.c                 | 16 ++++++++++++----
 8 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/examples/bpf/bpf_cyclic.c b/examples/bpf/bpf_cyclic.c
index 36745a3..11d1c06 100644
--- a/examples/bpf/bpf_cyclic.c
+++ b/examples/bpf/bpf_cyclic.c
@@ -6,7 +6,14 @@
  */
 #define JMP_MAP_ID	0xabccba
 
-BPF_PROG_ARRAY(jmp_tc, JMP_MAP_ID, PIN_OBJECT_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= JMP_MAP_ID,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
 
 __section_tail(JMP_MAP_ID, 0)
 int cls_loop(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_graft.c b/examples/bpf/bpf_graft.c
index 20784ff..07113d4 100644
--- a/examples/bpf/bpf_graft.c
+++ b/examples/bpf/bpf_graft.c
@@ -33,7 +33,13 @@
  *   [...]
  */
 
-BPF_PROG_ARRAY(jmp_tc, 0, PIN_GLOBAL_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_GLOBAL_NS,
+	.max_elem	= 1,
+};
 
 __section("aaa")
 int cls_aaa(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_prog.c b/examples/bpf/bpf_prog.c
index f15e876..d6caf37 100644
--- a/examples/bpf/bpf_prog.c
+++ b/examples/bpf/bpf_prog.c
@@ -192,6 +192,7 @@ struct bpf_elf_map __section("maps") map_proto = {
 	.size_key	=	sizeof(uint8_t),
 	.size_value	=	sizeof(struct count_tuple),
 	.max_elem	=	256,
+	.flags		=	BPF_F_NO_PREALLOC,
 };
 
 struct bpf_elf_map __section("maps") map_queue = {
@@ -200,6 +201,7 @@ struct bpf_elf_map __section("maps") map_queue = {
 	.size_key	=	sizeof(uint32_t),
 	.size_value	=	sizeof(struct count_queue),
 	.max_elem	=	1024,
+	.flags		=	BPF_F_NO_PREALLOC,
 };
 
 struct bpf_elf_map __section("maps") map_drops = {
diff --git a/examples/bpf/bpf_shared.c b/examples/bpf/bpf_shared.c
index 7fe9ef3..21fe6f1 100644
--- a/examples/bpf/bpf_shared.c
+++ b/examples/bpf/bpf_shared.c
@@ -18,7 +18,13 @@
  * instance is being created.
  */
 
-BPF_ARRAY4(map_sh, 0, PIN_OBJECT_NS, 1); /* or PIN_GLOBAL_NS, or PIN_NONE */
+struct bpf_elf_map __section_maps map_sh = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS, /* or PIN_GLOBAL_NS, or PIN_NONE */
+	.max_elem	= 1,
+};
 
 __section("egress")
 int emain(struct __sk_buff *skb)
diff --git a/examples/bpf/bpf_tailcall.c b/examples/bpf/bpf_tailcall.c
index f545430..1a30426 100644
--- a/examples/bpf/bpf_tailcall.c
+++ b/examples/bpf/bpf_tailcall.c
@@ -26,10 +26,31 @@
  * classifier behaviour.
  */
 
-BPF_PROG_ARRAY(jmp_tc, FOO, PIN_OBJECT_NS, MAX_JMP_SIZE);
-BPF_PROG_ARRAY(jmp_ex, BAR, PIN_OBJECT_NS, 1);
-
-BPF_ARRAY4(map_sh, 0, PIN_OBJECT_NS, 1);
+struct bpf_elf_map __section_maps jmp_tc = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= FOO,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= MAX_JMP_SIZE,
+};
+
+struct bpf_elf_map __section_maps jmp_ex = {
+	.type		= BPF_MAP_TYPE_PROG_ARRAY,
+	.id		= BAR,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
+
+struct bpf_elf_map __section_maps map_sh = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.size_key	= sizeof(uint32_t),
+	.size_value	= sizeof(uint32_t),
+	.pinning	= PIN_OBJECT_NS,
+	.max_elem	= 1,
+};
 
 __section_tail(FOO, ENTRY_0)
 int cls_case1(struct __sk_buff *skb)
diff --git a/include/bpf_api.h b/include/bpf_api.h
index 0f278f0..1b250d2 100644
--- a/include/bpf_api.h
+++ b/include/bpf_api.h
@@ -99,51 +99,6 @@
 	char ____license[] __section_license = NAME
 #endif
 
-#ifndef __BPF_MAP
-# define __BPF_MAP(NAME, TYPE, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM)	\
-	struct bpf_elf_map __section_maps NAME = {			\
-		.type		= (TYPE),				\
-		.id		= (ID),					\
-		.size_key	= (SIZE_KEY),				\
-		.size_value	= (SIZE_VALUE),				\
-		.pinning	= (PIN),				\
-		.max_elem	= (MAX_ELEM),				\
-	}
-#endif
-
-#ifndef BPF_HASH
-# define BPF_HASH(NAME, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM)	\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_HASH, ID, SIZE_KEY, SIZE_VALUE,	\
-		  PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY
-# define BPF_ARRAY(NAME, ID, SIZE_VALUE, PIN, MAX_ELEM)			\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_ARRAY, ID, sizeof(uint32_t), 	\
-		  SIZE_VALUE, PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY2
-# define BPF_ARRAY2(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint16_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY4
-# define BPF_ARRAY4(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint32_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_ARRAY8
-# define BPF_ARRAY8(NAME, ID, PIN, MAX_ELEM)				\
-	BPF_ARRAY(NAME, ID, sizeof(uint64_t), PIN, MAX_ELEM)
-#endif
-
-#ifndef BPF_PROG_ARRAY
-# define BPF_PROG_ARRAY(NAME, ID, PIN, MAX_ELEM)			\
-	__BPF_MAP(NAME, BPF_MAP_TYPE_PROG_ARRAY, ID, sizeof(uint32_t),	\
-		  sizeof(uint32_t), PIN, MAX_ELEM)
-#endif
-
 /** Classifier helper */
 
 #ifndef BPF_H_DEFAULT
diff --git a/include/bpf_elf.h b/include/bpf_elf.h
index 31a8974..36cc988 100644
--- a/include/bpf_elf.h
+++ b/include/bpf_elf.h
@@ -32,6 +32,7 @@ struct bpf_elf_map {
 	__u32 size_key;
 	__u32 size_value;
 	__u32 max_elem;
+	__u32 flags;
 	__u32 id;
 	__u32 pinning;
 };
diff --git a/tc/tc_bpf.c b/tc/tc_bpf.c
index 0c59427..fe927ac 100644
--- a/tc/tc_bpf.c
+++ b/tc/tc_bpf.c
@@ -231,6 +231,9 @@ static void bpf_map_pin_report(const struct bpf_elf_map *pin,
 	if (obj->max_elem != pin->max_elem)
 		fprintf(stderr, " - Max elems:    %u (obj) != %u (pin)\n",
 			obj->max_elem, pin->max_elem);
+	if (obj->flags != pin->flags)
+		fprintf(stderr, " - Flags:        %#x (obj) != %#x (pin)\n",
+			obj->flags, pin->flags);
 
 	fprintf(stderr, "\n");
 }
@@ -261,6 +264,8 @@ static int bpf_map_selfcheck_pinned(int fd, const struct bpf_elf_map *map,
 			tmp.size_value = val;
 		else if (sscanf(buff, "max_entries:\t%u", &val) == 1)
 			tmp.max_elem = val;
+		else if (sscanf(buff, "map_flags:\t%i", &val) == 1)
+			tmp.flags = val;
 	}
 
 	fclose(fp);
@@ -796,8 +801,9 @@ static int bpf_log_realloc(struct bpf_elf_ctx *ctx)
 	return 0;
 }
 
-static int bpf_map_create(enum bpf_map_type type, unsigned int size_key,
-			  unsigned int size_value, unsigned int max_elem)
+static int bpf_map_create(enum bpf_map_type type, uint32_t size_key,
+			  uint32_t size_value, uint32_t max_elem,
+			  uint32_t flags)
 {
 	union bpf_attr attr;
 
@@ -806,6 +812,7 @@ static int bpf_map_create(enum bpf_map_type type, unsigned int size_key,
 	attr.key_size = size_key;
 	attr.value_size = size_value;
 	attr.max_entries = max_elem;
+	attr.map_flags = flags;
 
 	return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
@@ -1147,7 +1154,8 @@ static void bpf_map_report(int fd, const char *name,
 	fprintf(stderr, " - Pinning:      %u\n", map->pinning);
 	fprintf(stderr, " - Size key:     %u\n", map->size_key);
 	fprintf(stderr, " - Size value:   %u\n", map->size_value);
-	fprintf(stderr, " - Max elems:    %u\n\n", map->max_elem);
+	fprintf(stderr, " - Max elems:    %u\n", map->max_elem);
+	fprintf(stderr, " - Flags:        %#x\n\n", map->flags);
 }
 
 static int bpf_map_attach(const char *name, const struct bpf_elf_map *map,
@@ -1174,7 +1182,7 @@ static int bpf_map_attach(const char *name, const struct bpf_elf_map *map,
 
 	errno = 0;
 	fd = bpf_map_create(map->type, map->size_key, map->size_value,
-			    map->max_elem);
+			    map->max_elem, map->flags);
 	if (fd < 0 || ctx->verbose) {
 		bpf_map_report(fd, name, map, ctx);
 		if (fd < 0)
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 1/3] tc, bpf: add new csum and tunnel signatures
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>

Add new signatures for BPF_FUNC_csum_diff, BPF_FUNC_skb_get_tunnel_opt
and BPF_FUNC_skb_set_tunnel_opt.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/bpf_api.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/bpf_api.h b/include/bpf_api.h
index 4b16d25..0f278f0 100644
--- a/include/bpf_api.h
+++ b/include/bpf_api.h
@@ -212,6 +212,8 @@ static int BPF_FUNC(l3_csum_replace, struct __sk_buff *skb, uint32_t off,
 		    uint32_t from, uint32_t to, uint32_t flags);
 static int BPF_FUNC(l4_csum_replace, struct __sk_buff *skb, uint32_t off,
 		    uint32_t from, uint32_t to, uint32_t flags);
+static int BPF_FUNC(csum_diff, const void *from, uint32_t from_size,
+		    const void *to, uint32_t to_size, uint32_t seed);
 
 /* Packet vlan encap/decap */
 static int BPF_FUNC(skb_vlan_push, struct __sk_buff *skb, uint16_t proto,
@@ -225,6 +227,11 @@ static int BPF_FUNC(skb_set_tunnel_key, struct __sk_buff *skb,
 		    const struct bpf_tunnel_key *from, uint32_t size,
 		    uint32_t flags);
 
+static int BPF_FUNC(skb_get_tunnel_opt, struct __sk_buff *skb,
+		    void *to, uint32_t size);
+static int BPF_FUNC(skb_set_tunnel_opt, struct __sk_buff *skb,
+		    const void *from, uint32_t size);
+
 /** LLVM built-ins, mem*() routines work for constant size */
 
 #ifndef lock_xadd
-- 
1.9.3

^ permalink raw reply related

* [PATCH iproute2 -master 0/3] Minor tc/bpf updates
From: Daniel Borkmann @ 2016-04-08 22:32 UTC (permalink / raw)
  To: stephen; +Cc: netdev, alexei.starovoitov, Daniel Borkmann

Some minor updates to improve error reporting, add signatures
and recently introduced map flags attribute. Set is against
master branch.

Thanks!

Daniel Borkmann (3):
  tc, bpf: add new csum and tunnel signatures
  tc, bpf: further improve error reporting
  tc, bpf: add support for map pre/allocation

 examples/bpf/bpf_cyclic.c   |  9 ++++-
 examples/bpf/bpf_graft.c    |  8 +++-
 examples/bpf/bpf_prog.c     |  2 +
 examples/bpf/bpf_shared.c   |  8 +++-
 examples/bpf/bpf_tailcall.c | 29 ++++++++++++--
 include/bpf_api.h           | 52 ++++--------------------
 include/bpf_elf.h           |  1 +
 tc/tc_bpf.c                 | 98 +++++++++++++++++++++++++++++++++++----------
 tc/tc_bpf.h                 |  4 ++
 9 files changed, 138 insertions(+), 73 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [net-next][PATCH 2/2] RDS: Fix the atomicity for congestion map update
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel, Santosh Shilimkar
In-Reply-To: <1460154400-14546-1-git-send-email-santosh.shilimkar@oracle.com>

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---
 net/rds/cong.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-	__set_bit_le(off, (void *)map->m_page_addrs[i]);
+	set_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
+	clear_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.9.1

^ permalink raw reply related

* [net-next][PATCH 1/2] RDS: fix endianness for dp_ack_seq
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel, Santosh Shilimkar
In-Reply-To: <1460154400-14546-1-git-send-email-santosh.shilimkar@oracle.com>

From: Qing Huang <qing.huang@oracle.com>

dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8764970..310cabc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
 		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
 		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
-		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+		dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic));
 
 		/* Advertise flow control */
 		if (ic->i_flowctl) {
-- 
1.9.1

^ permalink raw reply related

* [net-next][PATCH 0/2] RDS: couple of fixes for 4.6
From: Santosh Shilimkar @ 2016-04-08 22:26 UTC (permalink / raw)
  To: netdev, davem; +Cc: linux-kernel

Patches are also available at below git tree. 

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds-fixes

Qing Huang (1):
  RDS: fix endianness for dp_ack_seq

Santosh Shilimkar (1):
  RDS: Fix the atomicity for congestion map update

 net/rds/cong.c  | 4 ++--
 net/rds/ib_cm.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [PATCH RFT 2/2] macb: kill PHY reset code
From: Sergei Shtylyov @ 2016-04-08 22:25 UTC (permalink / raw)
  To: nicolas.ferre, netdev; +Cc: linux-kernel
In-Reply-To: <81129033.NXiOLTg1so@wasted.cogentembedded.com>

With  the 'phylib' now  being aware of  the "reset-gpios" PHY node property,
there should be no need to frob the PHY reset in this  driver anymore...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

---
 drivers/net/ethernet/cadence/macb.c |   17 -----------------
 drivers/net/ethernet/cadence/macb.h |    1 -
 2 files changed, 18 deletions(-)

Index: net-next/drivers/net/ethernet/cadence/macb.c
===================================================================
--- net-next.orig/drivers/net/ethernet/cadence/macb.c
+++ net-next/drivers/net/ethernet/cadence/macb.c
@@ -2884,7 +2884,6 @@ static int macb_probe(struct platform_de
 					      = macb_clk_init;
 	int (*init)(struct platform_device *) = macb_init;
 	struct device_node *np = pdev->dev.of_node;
-	struct device_node *phy_node;
 	const struct macb_config *macb_config = NULL;
 	struct clk *pclk, *hclk = NULL, *tx_clk = NULL;
 	unsigned int queue_mask, num_queues;
@@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de
 	else
 		macb_get_hwaddr(bp);
 
-	/* Power up the PHY if there is a GPIO reset */
-	phy_node =  of_get_next_available_child(np, NULL);
-	if (phy_node) {
-		int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
-
-		if (gpio_is_valid(gpio)) {
-			bp->reset_gpio = gpio_to_desc(gpio);
-			gpiod_direction_output(bp->reset_gpio, 1);
-		}
-	}
-	of_node_put(phy_node);
-
 	err = of_get_phy_mode(np);
 	if (err < 0) {
 		pdata = dev_get_platdata(&pdev->dev);
@@ -3054,10 +3041,6 @@ static int macb_remove(struct platform_d
 		mdiobus_unregister(bp->mii_bus);
 		mdiobus_free(bp->mii_bus);
 
-		/* Shutdown the PHY if there is a GPIO reset */
-		if (bp->reset_gpio)
-			gpiod_set_value(bp->reset_gpio, 0);
-
 		unregister_netdev(dev);
 		clk_disable_unprepare(bp->tx_clk);
 		clk_disable_unprepare(bp->hclk);
Index: net-next/drivers/net/ethernet/cadence/macb.h
===================================================================
--- net-next.orig/drivers/net/ethernet/cadence/macb.h
+++ net-next/drivers/net/ethernet/cadence/macb.h
@@ -832,7 +832,6 @@ struct macb {
 	unsigned int		dma_burst_length;
 
 	phy_interface_t		phy_interface;
-	struct gpio_desc	*reset_gpio;
 
 	/* AT91RM9200 transmit */
 	struct sk_buff *skb;			/* holds skb until xmit interrupt completes */

^ permalink raw reply

* [PATCH RFT 1/2] phylib: add device reset GPIO support
From: Sergei Shtylyov @ 2016-04-08 22:22 UTC (permalink / raw)
  To: grant.likely, robh+dt, devicetree, f.fainelli, netdev,
	frowand.list, pawel.moll, mark.rutland, ijc+devicetree, galak
  Cc: linux-kernel
In-Reply-To: <81129033.NXiOLTg1so@wasted.cogentembedded.com>

The PHY  devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question;  that solution, when  applied to the device trees,
led to adding the PHY reset GPIO properties to the MAC device node, with
one exception: Cadence MACB driver which could handle the "reset-gpios"
prop  in a PHY device  subnode.  I believe that the correct approach is to
teach the 'phylib' to get the MDIO device reset GPIO from the device tree
node corresponding to this device -- which this patch is doing...

Note that I had to modify the  AT803x PHY driver as it would stop working
otherwise as it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

---
 Documentation/devicetree/bindings/net/phy.txt |    2 +
 drivers/net/phy/at803x.c                      |   19 ++------------
 drivers/net/phy/mdio_bus.c                    |    4 +++
 drivers/net/phy/mdio_device.c                 |   27 +++++++++++++++++++--
 drivers/net/phy/phy_device.c                  |   33 ++++++++++++++++++++++++--
 drivers/of/of_mdio.c                          |   16 ++++++++++++
 include/linux/mdio.h                          |    3 ++
 include/linux/phy.h                           |    5 +++
 8 files changed, 89 insertions(+), 20 deletions(-)

Index: net-next/Documentation/devicetree/bindings/net/phy.txt
===================================================================
--- net-next.orig/Documentation/devicetree/bindings/net/phy.txt
+++ net-next/Documentation/devicetree/bindings/net/phy.txt
@@ -35,6 +35,8 @@ Optional Properties:
 - broken-turn-around: If set, indicates the PHY device does not correctly
   release the turn around line low at the end of a MDIO transaction.
 
+- reset-gpios: The GPIO phandle and specifier for the PHY reset signal.
+
 Example:
 
 ethernet-phy@0 {
Index: net-next/drivers/net/phy/at803x.c
===================================================================
--- net-next.orig/drivers/net/phy/at803x.c
+++ net-next/drivers/net/phy/at803x.c
@@ -65,7 +65,6 @@ MODULE_LICENSE("GPL");
 
 struct at803x_priv {
 	bool phy_reset:1;
-	struct gpio_desc *gpiod_reset;
 };
 
 struct at803x_context {
@@ -271,22 +270,10 @@ static int at803x_probe(struct phy_devic
 {
 	struct device *dev = &phydev->mdio.dev;
 	struct at803x_priv *priv;
-	struct gpio_desc *gpiod_reset;
 
 	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
 	if (!priv)
 		return -ENOMEM;
-
-	if (phydev->drv->phy_id != ATH8030_PHY_ID)
-		goto does_not_require_reset_workaround;
-
-	gpiod_reset = devm_gpiod_get_optional(dev, "reset", GPIOD_OUT_LOW);
-	if (IS_ERR(gpiod_reset))
-		return PTR_ERR(gpiod_reset);
-
-	priv->gpiod_reset = gpiod_reset;
-
-does_not_require_reset_workaround:
 	phydev->priv = priv;
 
 	return 0;
@@ -361,14 +348,14 @@ static void at803x_link_change_notify(st
 	 */
 	if (phydev->drv->phy_id == ATH8030_PHY_ID) {
 		if (phydev->state == PHY_NOLINK) {
-			if (priv->gpiod_reset && !priv->phy_reset) {
+			if (phydev->mdio.reset && !priv->phy_reset) {
 				struct at803x_context context;
 
 				at803x_context_save(phydev, &context);
 
-				gpiod_set_value(priv->gpiod_reset, 1);
+				phy_device_reset(phydev, 1);
 				msleep(1);
-				gpiod_set_value(priv->gpiod_reset, 0);
+				phy_device_reset(phydev, 0);
 				msleep(1);
 
 				at803x_context_restore(phydev, &context);
Index: net-next/drivers/net/phy/mdio_bus.c
===================================================================
--- net-next.orig/drivers/net/phy/mdio_bus.c
+++ net-next/drivers/net/phy/mdio_bus.c
@@ -35,6 +35,7 @@
 #include <linux/phy.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
+#include <linux/gpio/consumer.h>
 
 #include <asm/irq.h>
 
@@ -371,6 +372,9 @@ void mdiobus_unregister(struct mii_bus *
 		if (!mdiodev)
 			continue;
 
+		if (mdiodev->reset)
+			gpiod_put(mdiodev->reset);
+
 		mdiodev->device_remove(mdiodev);
 		mdiodev->device_free(mdiodev);
 	}
Index: net-next/drivers/net/phy/mdio_device.c
===================================================================
--- net-next.orig/drivers/net/phy/mdio_device.c
+++ net-next/drivers/net/phy/mdio_device.c
@@ -12,6 +12,8 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/errno.h>
+#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
@@ -103,6 +105,13 @@ void mdio_device_remove(struct mdio_devi
 }
 EXPORT_SYMBOL(mdio_device_remove);
 
+void mdio_device_reset(struct mdio_device *mdiodev, int value)
+{
+	if (mdiodev->reset)
+		gpiod_set_value(mdiodev->reset, value);
+}
+EXPORT_SYMBOL(mdio_device_reset);
+
 /**
  * mdio_probe - probe an MDIO device
  * @dev: device to probe
@@ -117,9 +126,16 @@ static int mdio_probe(struct device *dev
 	struct mdio_driver *mdiodrv = to_mdio_driver(drv);
 	int err = 0;
 
-	if (mdiodrv->probe)
+	if (mdiodrv->probe) {
+		/* Deassert the reset signal */
+		mdio_device_reset(mdiodev, 0);
+
 		err = mdiodrv->probe(mdiodev);
 
+		/* Assert the reset signal */
+		mdio_device_reset(mdiodev, 1);
+	}
+
 	return err;
 }
 
@@ -129,9 +145,16 @@ static int mdio_remove(struct device *de
 	struct device_driver *drv = mdiodev->dev.driver;
 	struct mdio_driver *mdiodrv = to_mdio_driver(drv);
 
-	if (mdiodrv->remove)
+	if (mdiodrv->remove) {
+		/* Deassert the reset signal */
+		mdio_device_reset(mdiodev, 0);
+
 		mdiodrv->remove(mdiodev);
 
+		/* Assert the reset signal */
+		mdio_device_reset(mdiodev, 1);
+	}
+
 	return 0;
 }
 
Index: net-next/drivers/net/phy/phy_device.c
===================================================================
--- net-next.orig/drivers/net/phy/phy_device.c
+++ net-next/drivers/net/phy/phy_device.c
@@ -589,6 +589,9 @@ int phy_device_register(struct phy_devic
 	if (err)
 		return err;
 
+	/* Deassert the reset signal */
+	phy_device_reset(phydev, 0);
+
 	/* Run all of the fixups for this PHY */
 	err = phy_scan_fixups(phydev);
 	if (err) {
@@ -604,9 +607,15 @@ int phy_device_register(struct phy_devic
 		goto out;
 	}
 
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
+
 	return 0;
 
  out:
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
+
 	mdiobus_unregister_device(&phydev->mdio);
 	return err;
 }
@@ -792,6 +801,9 @@ int phy_init_hw(struct phy_device *phyde
 {
 	int ret = 0;
 
+	/* Deassert the reset signal */
+	phy_device_reset(phydev, 0);
+
 	if (!phydev->drv || !phydev->drv->config_init)
 		return 0;
 
@@ -997,6 +1009,9 @@ void phy_detach(struct phy_device *phyde
 
 	put_device(&phydev->mdio.dev);
 	module_put(bus->owner);
+
+	/* Assert the reset signal */
+	phy_device_reset(phydev, 1);
 }
 EXPORT_SYMBOL(phy_detach);
 
@@ -1595,9 +1610,16 @@ static int phy_probe(struct device *dev)
 	/* Set the state to READY by default */
 	phydev->state = PHY_READY;
 
-	if (phydev->drv->probe)
+	if (phydev->drv->probe) {
+		/* Deassert the reset signal */
+		phy_device_reset(phydev, 0);
+
 		err = phydev->drv->probe(phydev);
 
+		/* Assert the reset signal */
+		phy_device_reset(phydev, 1);
+	}
+
 	mutex_unlock(&phydev->lock);
 
 	return err;
@@ -1611,8 +1633,15 @@ static int phy_remove(struct device *dev
 	phydev->state = PHY_DOWN;
 	mutex_unlock(&phydev->lock);
 
-	if (phydev->drv->remove)
+	if (phydev->drv->remove) {
+		/* Deassert the reset signal */
+		phy_device_reset(phydev, 0);
+
 		phydev->drv->remove(phydev);
+
+		/* Assert the reset signal */
+		phy_device_reset(phydev, 1);
+	}
 	phydev->drv = NULL;
 
 	return 0;
Index: net-next/drivers/of/of_mdio.c
===================================================================
--- net-next.orig/drivers/of/of_mdio.c
+++ net-next/drivers/of/of_mdio.c
@@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
 static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node *child,
 				   u32 addr)
 {
+	struct gpio_desc *gpiod;
 	struct phy_device *phy;
 	bool is_c45;
 	int rc;
@@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
 	is_c45 = of_device_is_compatible(child,
 					 "ethernet-phy-ieee802.3-c45");
 
+	gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
+	/* Deassert the reset signal */
+	if (!IS_ERR(gpiod))
+		gpiod_direction_output(gpiod, 0);
 	if (!is_c45 && !of_get_phy_id(child, &phy_id))
 		phy = phy_device_create(mdio, addr, phy_id, 0, NULL);
 	else
 		phy = get_phy_device(mdio, addr, is_c45);
+	/* Assert the reset signal again */
+	if (!IS_ERR(gpiod))
+		gpiod_set_value(gpiod, 1);
 	if (IS_ERR_OR_NULL(phy))
 		return 1;
 
@@ -75,6 +83,9 @@ static int of_mdiobus_register_phy(struc
 	of_node_get(child);
 	phy->mdio.dev.of_node = child;
 
+	if (!IS_ERR(gpiod))
+		phy->mdio.reset = gpiod;
+
 	/* All data is now stored in the phy struct;
 	 * register it */
 	rc = phy_device_register(phy);
@@ -95,6 +106,7 @@ static int of_mdiobus_register_device(st
 				      u32 addr)
 {
 	struct mdio_device *mdiodev;
+	struct gpio_desc *gpiod;
 	int rc;
 
 	mdiodev = mdio_device_create(mdio, addr);
@@ -107,6 +119,10 @@ static int of_mdiobus_register_device(st
 	of_node_get(child);
 	mdiodev->dev.of_node = child;
 
+	gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
+	if (!IS_ERR(gpiod))
+		mdiodev->reset = gpiod;
+
 	/* All data is now stored in the mdiodev struct; register it. */
 	rc = mdio_device_register(mdiodev);
 	if (rc) {
Index: net-next/include/linux/mdio.h
===================================================================
--- net-next.orig/include/linux/mdio.h
+++ net-next/include/linux/mdio.h
@@ -11,6 +11,7 @@
 
 #include <uapi/linux/mdio.h>
 
+struct gpio_desc;
 struct mii_bus;
 
 struct mdio_device {
@@ -26,6 +27,7 @@ struct mdio_device {
 	/* Bus address of the MDIO device (0-31) */
 	int addr;
 	int flags;
+	struct gpio_desc *reset;
 };
 #define to_mdio_device(d) container_of(d, struct mdio_device, dev)
 
@@ -58,6 +60,7 @@ void mdio_device_free(struct mdio_device
 struct mdio_device *mdio_device_create(struct mii_bus *bus, int addr);
 int mdio_device_register(struct mdio_device *mdiodev);
 void mdio_device_remove(struct mdio_device *mdiodev);
+void mdio_device_reset(struct mdio_device *mdiodev, int value);
 int mdio_driver_register(struct mdio_driver *drv);
 void mdio_driver_unregister(struct mdio_driver *drv);
 
Index: net-next/include/linux/phy.h
===================================================================
--- net-next.orig/include/linux/phy.h
+++ net-next/include/linux/phy.h
@@ -769,6 +769,11 @@ static inline int phy_read_status(struct
 	return phydev->drv->read_status(phydev);
 }
 
+static inline void phy_device_reset(struct phy_device *phydev, int value)
+{
+	mdio_device_reset(&phydev->mdio, value);
+}
+
 #define phydev_err(_phydev, format, args...)	\
 	dev_err(&_phydev->mdio.dev, format, ##args)
 

^ permalink raw reply

* [PATCH RFT 0/2] Teach phylib hard-resetting devices
From: Sergei Shtylyov @ 2016-04-08 22:21 UTC (permalink / raw)
  To: grant.likely, robh+dt, devicetree, f.fainelli, netdev,
	frowand.list, pawel.moll, mark.rutland, ijc+devicetree, galak
  Cc: linux-kernel

Hello.

   Here's the set of 2 patches against DaveM's 'net-next.git' repo. They add to
'phylib' support for resetting devices via GPIO and do some clean up after
doing that...

[1/2] phylib: add device reset GPIO support
[2/2] macb: kill PHY reset code

MBR, Sergei

^ permalink raw reply

* Re: [net-next PATCH 2/5] GSO: Add GSO type for fixed IPv4 ID
From: Alexander Duyck @ 2016-04-08 22:12 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAEh+42j21N+3ScmR9D9P3esgUL-G0kWxNT1x2zfGi_9M-bPKzw@mail.gmail.com>

On Fri, Apr 8, 2016 at 2:41 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Fri, Apr 8, 2016 at 5:33 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch adds support for TSO using IPv4 headers with a fixed IP ID
>> field.  This is meant to allow us to do a lossless GRO in the case of TCP
>> flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
>> headers.
>>
>> In addition I am adding a feature that for now I am referring to TSO with
>> IP ID mangling.  Basically when this flag is enabled the device has the
>> option to either output the flow with incrementing IP IDs or with a fixed
>> IP ID regardless of what the original IP ID ordering was.  This is useful
>> in cases where the DF bit is set and we do not care if the original IP ID
>> value is maintained.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> I think SKB_GSO_TCP_FIXEDID would also need to be added to the list of
> enumerated OK GSO types for MPLS GSO.

I'll have that fixed for v2.

- Alex

^ permalink raw reply

* Re: [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs
From: Steven Rostedt @ 2016-04-08 22:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, David S . Miller, Ingo Molnar,
	Daniel Borkmann, Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik,
	Brendan Gregg, netdev, linux-kernel, kernel-team
In-Reply-To: <20160405120626.GM3448@twins.programming.kicks-ass.net>

On Tue, 5 Apr 2016 14:06:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 04, 2016 at 09:52:47PM -0700, Alexei Starovoitov wrote:
> > avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
> > It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
> > with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu_alloc  
> 
> Its not actually allocated; but because its a static uninitialized
> variable we get .bss like behaviour and the initial value is copied to
> all CPUs when the per-cpu allocator thingy bootstraps SMP IIRC.
> 
> > and
> > subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
> > so we can safely drop memset from all of the above cases and   
> 
> Indeed.
> 
> > move it into
> > perf_ftrace_function_call that calls it with stack allocated pt_regs.  
> 
> Hmm, is there a reason that's still on-stack instead of using the
> per-cpu thing, Steve?

Well, what do you do when you are tracing with regs in an interrupt
that already set the per cpu regs field? We could create our own
per-cpu one as well, but then that would require checking which level
we are in, as we can have one for normal context, one for softirq
context, one for irq context and one for nmi context.

-- Steve



> 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>  
> 
> In any case,
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [PATCH v4] route: do not cache fib route info on local routes with oif
From: Julian Anastasov @ 2016-04-08 22:08 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev
In-Reply-To: <570820DA.2070007@windriver.com>


	Hello,

On Fri, 8 Apr 2016, Chris Friesen wrote:

> For local routes that require a particular output interface we do not want
> to cache the result.  Caching the result causes incorrect behaviour when
> there are multiple source addresses on the interface.  The end result
> being that if the intended recipient is waiting on that interface for the
> packet he won't receive it because it will be delivered on the loopback
> interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
> interface as well.
> 
> This can be tested by running a program such as "dhcp_release" which
> attempts to inject a packet on a particular interface so that it is
> received by another program on the same board.  The receiving process
> should see an IP_PKTINFO ipi_ifndex value of the source interface
> (e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
> will still appear on the loopback interface in tcpdump but the important
> aspect is that the CMSG info is correct.
> 
> Sample dhcp_release command line:
> 
>    dhcp_release eth1 192.168.204.222 02:11:33:22:44:66
> 
> Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
> Signed off-by: Chris Friesen <chris.friesen@windriver.com>

	Looks good to me.

Reviewed-by: Julian Anastasov <ja@ssi.bg>

> ---
>  net/ipv4/route.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 02c6229..b050cf9 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2045,6 +2045,18 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>  		 */
>  		if (fi && res->prefixlen < 4)
>  			fi = NULL;
> +	} else if ((type == RTN_LOCAL) && (orig_oif != 0) &&
> +		   (orig_oif != dev_out->ifindex)) {
> +		/* For local routes that require a particular output interface
> +		 * we do not want to cache the result.  Caching the result
> +		 * causes incorrect behaviour when there are multiple source
> +		 * addresses on the interface, the end result being that if the
> +		 * intended recipient is waiting on that interface for the
> +		 * packet he won't receive it because it will be delivered on
> +		 * the loopback interface and the IP_PKTINFO ipi_ifindex will
> +		 * be set to the loopback interface as well.
> +		 */
> +		fi = NULL;
>  	}
>  
>  	fnhe = NULL;

Regards

^ permalink raw reply

* Re: [RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO
From: Alexander Duyck @ 2016-04-08 22:04 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAEh+42gjqeGK94xvVQFuiUF=c3YKY2SCWOL=y=2gBsHyUxQ+GA@mail.gmail.com>

On Fri, Apr 8, 2016 at 2:40 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Thu, Apr 7, 2016 at 8:52 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> Just a thought.  What if I replaced NETIF_F_TSO_FIXEDID with something
>> that meant we could mange the IP ID like a NETIF_F_TSO_IPID_MANGLE
>> (advice for better name welcome).  Instead of the feature flag meaning
>> we are going to transmit packets with a fixed ID it would mean we
>> don't care about the ID and are free to mangle it as we see fit.  The
>> GSO type can retain the same meaning as far as that requiring the same
>> ID for all, but the feature would mean we will take fixed and convert
>> it to incrementing, or incrementing and convert it to fixed.
>
> I saw the new version of the code that you posted with this idea and
> now that I understand it better, it seems like a reasonable choice to
> me - it's nice that it is consistent with GRO and not tunnel specific.
> It also makes behavior consistent across drivers in regards to
> incrementing IDs in the default case, which was one of my concerns
> from before.
>
> Maybe I missed it but I didn't see any checks for the DF bit being set
> when we transmit a packet with NETIF_F_TSO_MANGLEID. Even if I am
> comfortable mangling my IDs in the DF case, I don't think this would
> ever extend to non-DF packets. In the documentation you noted that it
> is the driver's responsibility to do this check but I couldn't find it
> in either ixgbe or igb. It would also be nice if the core stack could
> enforce it somehow as well rather than each driver.

Yeah I had glossed over that in the igb and ixgbe patches.  A check is
only really needed for the incrementing to non-incrementing case and I
wasn't sure how common it was to have TCP with an IP header that
didn't set the DF bit.  In the case of the outer headers igb and ixgbe
will increment the IP ID always so we don't have to worry about if DF
is set of not there.  For the inner headers I had fudged it a bit and
didn't add the validation.  If needed I can see about adding that
shortly.

- Alex

^ permalink raw reply

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Timur Tabi @ 2016-04-08 21:45 UTC (permalink / raw)
  To: Vikram Sethi, Andrew Lunn
  Cc: Rob Herring, Gilad Avidov, netdev,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-arm-msm,
	Sagar Dharia, shankerd-sgV2jX0FEOL9JmXXK+q4OQ, Greg Kroah-Hartman,
	Christopher Covington
In-Reply-To: <57081D96.902-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

Vikram Sethi wrote:

>> On the FSM9900 SOC (which uses device-tree), the two pins that connect to the external PHY are gpio pins.  However, the driver needs to reprogram the pinmux so that those pins are wired to the Emac controller.  That's what the the gpio code in this driver is doing: it's just configuring the pins so that they connect directly between the Emac and the external PHY.  After that, they are no longer GPIO pins, and you cannot use the "GPIO controlled MDIO bus".  There is no MDIO controller on the SOC.  The external PHY is controlled directly from the Emac and also from the internal PHY.  It is screwy, I know, but that's what Gilad was trying to explain.
> It is incorrect to say there's no MDIO controller on the SoC. The EMAC Core on the SoC itself has a MDIO controller which talks to the external PHY. The internal SGMII is not on MDIO however.
> Please see the EMAC specification.

Sorry, I should have said that there is no *independent* MDIO controller 
(one that has its own driver).  As you said, you can only talk to the 
external PHY through the Emac.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next PATCH 2/5] GSO: Add GSO type for fixed IPv4 ID
From: Jesse Gross @ 2016-04-08 21:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Herbert Xu, Tom Herbert, Alexander Duyck, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <20160408203318.12838.59720.stgit@ahduyck-xeon-server>

On Fri, Apr 8, 2016 at 5:33 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This patch adds support for TSO using IPv4 headers with a fixed IP ID
> field.  This is meant to allow us to do a lossless GRO in the case of TCP
> flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
> headers.
>
> In addition I am adding a feature that for now I am referring to TSO with
> IP ID mangling.  Basically when this flag is enabled the device has the
> option to either output the flow with incrementing IP IDs or with a fixed
> IP ID regardless of what the original IP ID ordering was.  This is useful
> in cases where the DF bit is set and we do not care if the original IP ID
> value is maintained.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>

I think SKB_GSO_TCP_FIXEDID would also need to be added to the list of
enumerated OK GSO types for MPLS GSO.

^ permalink raw reply

* Re: [RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO
From: Jesse Gross @ 2016-04-08 21:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAKgT0Ufc8EiqCR5opSFUCE2v3hFCmnS641o+epyi+G92bD+yww@mail.gmail.com>

On Thu, Apr 7, 2016 at 8:52 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> Just a thought.  What if I replaced NETIF_F_TSO_FIXEDID with something
> that meant we could mange the IP ID like a NETIF_F_TSO_IPID_MANGLE
> (advice for better name welcome).  Instead of the feature flag meaning
> we are going to transmit packets with a fixed ID it would mean we
> don't care about the ID and are free to mangle it as we see fit.  The
> GSO type can retain the same meaning as far as that requiring the same
> ID for all, but the feature would mean we will take fixed and convert
> it to incrementing, or incrementing and convert it to fixed.

I saw the new version of the code that you posted with this idea and
now that I understand it better, it seems like a reasonable choice to
me - it's nice that it is consistent with GRO and not tunnel specific.
It also makes behavior consistent across drivers in regards to
incrementing IDs in the default case, which was one of my concerns
from before.

Maybe I missed it but I didn't see any checks for the DF bit being set
when we transmit a packet with NETIF_F_TSO_MANGLEID. Even if I am
comfortable mangling my IDs in the DF case, I don't think this would
ever extend to non-DF packets. In the documentation you noted that it
is the driver's responsibility to do this check but I couldn't find it
in either ixgbe or igb. It would also be nice if the core stack could
enforce it somehow as well rather than each driver.

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Alexei Starovoitov @ 2016-04-08 21:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, linux-mm
In-Reply-To: <20160408220808.682630d7@redhat.com>

On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 8 Apr 2016 10:26:53 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >   
> > > > > +/* user return codes for PHYS_DEV prog type */
> > > > > +enum bpf_phys_dev_action {
> > > > > +	BPF_PHYS_DEV_DROP,
> > > > > +	BPF_PHYS_DEV_OK,
> > > > > +};    
> > > > 
> > > > I can imagine these extra return codes:
> > > > 
> > > >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> > > >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> > > >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > > > 
> > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > > > which we can look at when we get that far...  
> > > 
> > > I want to point out something which is quite FUNDAMENTAL, for
> > > understanding these return codes (and network stack).
> > > 
> > > 
> > > At driver RX time, the network stack basically have two ways of
> > > building an SKB, which is send up the stack.
> > > 
> > > Option-A (fastest): The packet page is writable. The SKB can be
> > > allocated and skb->data/head can point directly to the page.  And
> > > we place/write skb_shared_info in the end/tail-room. (This is done by
> > > calling build_skb()).
> > > 
> > > Option-B (slower): The packet page is read-only.  The SKB cannot point
> > > skb->data/head directly to the page, because skb_shared_info need to be
> > > written into skb->end (slightly hidden via skb_shinfo() casting).  To
> > > get around this, a separate piece of memory is allocated (speedup by
> > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> > > be written. (This is done when calling netdev/napi_alloc_skb()).
> > >   Drivers then need to copy over packet headers, and assign + adjust
> > > skb_shinfo(skb)->frags[0] offset to skip copied headers.
> > > 
> > > 
> > > Unfortunately most drivers use option-B.  Due to cost of calling the
> > > page allocator.  It is only slightly most expensive to get a larger
> > > compound page from the page allocator, which then can be partitioned into
> > > page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> > > cost is added later, when constructing the SKB.
> > >  Another reason for option-B, is that archs with expensive IOMMU
> > > requirements (like PowerPC), don't need to dma_unmap on every packet,
> > > but only on the compound page level.
> > > 
> > > Side-note: Most drivers have a "copy-break" optimization.  Especially
> > > for option-B, when copying header data anyhow. For small packet, one
> > > might as well free (or recycle) the RX page, if header size fits into
> > > the newly allocated memory (for skb_shared_info).  
> > 
> > I think you guys are going into overdesign territory, so
> > . nack on read-only pages
> 
> Unfortunately you cannot just ignore or nack read-only pages. They are
> a fact in the current drivers.
> 
> Most drivers today (at-least the ones we care about) only deliver
> read-only pages.  If you don't accept read-only pages day-1, then you
> first have to rewrite a lot of drivers... and that will stall the
> project!  How will you deal with this fact?
> 
> The early drop filter use-case in this patchset, can ignore read-only
> pages.  But ABI wise we need to deal with the future case where we do
> need/require writeable pages.  A simple need-writable pages in the API
> could help us move forward.

the program should never need to worry about whether dma buffer is
writeable or not. Complicating drivers, api, abi, usability
for the single use case of fast packet drop is not acceptable.
XDP is not going to be a fit for all drivers and all architectures.
That is cruicial 'performance vs generality' aspect of the design.
All kernel-bypasses are taking advantage of specific architecture.
We have to take advantage of it as well. If it doesn't fit
powerpc with iommu, so be it. XDP will return -enotsupp.
That is fundamental point. We have to cut such corners and avoid
all cases where unnecessary generality hurts performance.
Read-only pages is clearly such thing.

> > The whole thing must be dead simple to use. Above is not simple by any means.
> 
> Maybe you missed that the above was a description of how the current
> network stack handles this, which is not simple... which is root of the
> hole performance issue.

Disagree. The stack has copy-break, gro, gso and everything else because
it's serving _host_ use case. XDP is packet forwarder use case.
The requirements are completely different. Ex. the host needs gso
in the core and drivers. It needs to deliver data all the way
to user space and back. That is hard and that's where complexity
comes from. For packet forwarder none of it is needed. So saying,
look we have this complexity, so XDP needs it too, is flawed argument.
The kernel is serving host and applications.
XDP is pure packet-in/packet-out framework to achieve better
performance than kernel-bypass, since kernel is the right
place to do it. It has clean access to interrupts, per-cpu,
scheduler, device registers and so on.
Though there are only two broad use cases packet drop and forward,
they cover a ton of real cases: firewalls, dos prevention,
load balancer, nat, etc. In other words mostly stateless.
As soon as packet needs to be queued somewhere we have to
instantiate skb and pass it to the stack.
So no queues in XDP and no 'stolen' and 'shared' return codes.
The program always runs to completion with single packet.
There is no header vs payload split. There is no header
from program point of view. It's raw bytes in dma buffer.

> I do like the idea of rejecting XDP eBPF programs based on the DMA
> setup is not compatible, or if the driver does not implement e.g.
> writable DMA pages.

exactly.

> Customers wanting this feature will then go buy the NIC which support
> this feature.  There is nothing more motivating for NIC vendors seeing
> customers buying the competitors hardware. And it only require a driver
> change to get this market...

exactly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v4] route: do not cache fib route info on local routes with oif
From: Chris Friesen @ 2016-04-08 21:21 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev
In-Reply-To: <alpine.LFD.2.11.1604082328460.2124@ja.home.ssi.bg>

For local routes that require a particular output interface we do not want
to cache the result.  Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface.  The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board.  The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

   dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed off-by: Chris Friesen <chris.friesen@windriver.com>
---
 net/ipv4/route.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 02c6229..b050cf9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2045,6 +2045,18 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		 */
 		if (fi && res->prefixlen < 4)
 			fi = NULL;
+	} else if ((type == RTN_LOCAL) && (orig_oif != 0) &&
+		   (orig_oif != dev_out->ifindex)) {
+		/* For local routes that require a particular output interface
+		 * we do not want to cache the result.  Caching the result
+		 * causes incorrect behaviour when there are multiple source
+		 * addresses on the interface, the end result being that if the
+		 * intended recipient is waiting on that interface for the
+		 * packet he won't receive it because it will be delivered on
+		 * the loopback interface and the IP_PKTINFO ipi_ifindex will
+		 * be set to the loopback interface as well.
+		 */
+		fi = NULL;
 	}

 	fnhe = NULL;

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox