Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
From: Michal Kubecek @ 2017-09-29 13:06 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Stephen Hemminger, Phil Sutter, Hangbin Liu
In-Reply-To: <1506605626-1744-3-git-send-email-haliu@redhat.com>

On Thu, Sep 28, 2017 at 09:33:46PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> iplink_get()"). After update, we will not need to double the buffer size
> every time when VFs number increased.
> 
> With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> length parameter.
> 
> With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> answer to avoid overwrite data in nlh, because it may has more info after
> nlh. also this will avoid nlh buffer not enough issue.
> 
> We need to free answer after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

> diff --git a/ip/link_gre.c b/ip/link_gre.c
> index 9ea2970..35782ca 100644
> --- a/ip/link_gre.c
> +++ b/ip/link_gre.c
> @@ -68,7 +68,6 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  	struct {
>  		struct nlmsghdr n;
>  		struct ifinfomsg i;
> -		char buf[16384];
>  	} req = {
>  		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
>  		.n.nlmsg_flags = NLM_F_REQUEST,
> @@ -76,6 +75,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  		.i.ifi_family = preferred_family,
>  		.i.ifi_index = ifi->ifi_index,
>  	};
> +	struct nlmsghdr *answer = NULL;
>  	struct rtattr *tb[IFLA_MAX + 1];
>  	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
>  	struct rtattr *greinfo[IFLA_GRE_MAX + 1];
> @@ -100,19 +100,20 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  	__u32 erspan_idx = 0;
>  
>  	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
> -		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
> +		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
>  get_failed:
>  			fprintf(stderr,
>  				"Failed to get existing tunnel info.\n");
> +			free(answer);
>  			return -1;
>  		}

Took me a moment to realize answer is still NULL if we get here via
failed rtnl_talk() but non-NULL if we get here via "goto get_failed"
later. Nice trick. :-)

Michal Kubecek

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jesper Dangaard Brouer @ 2017-09-29 13:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Alexei Starovoitov,
	Andy Gospodarek, brouer
In-Reply-To: <1c37f945-0e2f-1eec-fe88-a740815026d3@redhat.com>

On Fri, 29 Sep 2017 17:49:23 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2017年09月28日 20:57, Jesper Dangaard Brouer wrote:
> > +};
> > +
> > +/* Convert xdp_buff to xdp_pkt */
> > +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> > +{
> > +	struct xdp_pkt *xdp_pkt;
> > +	int headroom;
> > +
> > +	/* Assure headroom is available for storing info */
> > +	headroom = xdp->data - xdp->data_hard_start;
> > +	if (headroom < sizeof(*xdp_pkt))
> > +		return NULL;  
> 
> Hi Jesper:
> 
> Do you consider this as a trick or a long term solution? Is it better to 
> store XDP in a circular buffer? (I'm asking since I meet similar issue 
> when doing xdp_xmit for tun).

(The way you ask the question is slightly ambiguous, but I hope I understand.)

IMHO the best solution to allow queueing of XDP packets is to create a
meta-data structure, with the needed info.  For performance reasons, we
don't want to allocate a new memory area for this.  Thus, we simply use
the available headroom in the page that the packet is stored into.
Notice that DPDK also use the first cache-line of the packet data, for
its packet meta-data structure. (This is not a performance problem.
I've done several PoC benchmarks, before choosing to do this)

For now, this "trick" is local to the cpumap, and thus not exposed as
any API.  Thus we can evolve and change the contents easily.  But I
would in time, like to see this generalized. When/if more places need
to queue XDP packets, this header meta-data format should be
standardized.

Pipe-dreaming: Taking this to the extreme... if I could get away with
it, I would actually like to store the (232 bytes) SKB meta-data header
inside headroom too.  That would eliminate any real SKB memory alloc.

> > +
> > +	/* Store info in top of packet */
> > +	xdp_pkt = xdp->data_hard_start;
> > +
> > +	xdp_pkt->data = xdp->data;
> > +	xdp_pkt->len  = xdp->data_end - xdp->data;
> > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> > +  
> 
> Is wmb() needed here?

No. This xdp_pkt is queued into a into a ptr_ring, which have a
spin_lock on enqueue, and any atomic operation works as a full memory
barrirer mb().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next 03/10] sctp: factor out stream->in allocation
From: 'Marcelo Ricardo Leitner' @ 2017-09-29 13:05 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, linux-sctp@vger.kernel.org, Neil Horman,
	Vlad Yasevich, Xin Long
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DD0084F8E@AcuExch.aculab.com>

On Fri, Sep 29, 2017 at 10:04:15AM +0000, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 28 September 2017 21:25
> > Same idea as previous patch.
> 
> That needs a proper description.

Alright.

  Marcelo

^ permalink raw reply

* Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
From: Michal Kubecek @ 2017-09-29 12:55 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Stephen Hemminger, Phil Sutter, Hangbin Liu
In-Reply-To: <1506605626-1744-2-git-send-email-haliu@redhat.com>

On Thu, Sep 28, 2017 at 09:33:45PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
> we doubled the buffer size to support more VFs. But the VFs number is
> increasing all the time. Some customers even use more than 200 VFs now.
> 
> We could not double it everytime when the buffer is not enough. Let's just
> not hard code the buffer size and malloc the correct number when running.
> 
> Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
> The caller need to free it after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---
>  lib/libnetlink.c | 114 ++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 80 insertions(+), 34 deletions(-)
> 

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

^ permalink raw reply

* [PATCH v2 net] net: mvpp2: Fix clock resource by adding an optional bus clock
From: Gregory CLEMENT @ 2017-09-29 12:27 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jason Cooper, Andrew Lunn, Sebastian Hesselbarth, Gregory CLEMENT,
	Thomas Petazzoni, linux-arm-kernel, Antoine Tenart,
	Miquèl Raynal, Nadav Haklai, Shadi Ammouri, Yehuda Yitschak,
	Omri Itach, Hanna Hawa, Igal Liberman, Marcin Wojtas

On Armada 7K/8K we need to explicitly enable the bus clock. The bus clock
is optional because not all the SoCs need them but at least for Armada
7K/8K it is actually mandatory.

The binding documentation is updating accordingly.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
Changelog:
v1 -> v2:
 - manage the -EPROBE_DEFER case
 - fix typos in documentation
 - remove useless test before clk_disable_unprepare()

 Documentation/devicetree/bindings/net/marvell-pp2.txt | 10 ++++++----
 drivers/net/ethernet/marvell/mvpp2.c                  | 15 +++++++++++++++
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-pp2.txt b/Documentation/devicetree/bindings/net/marvell-pp2.txt
index 7e2dad08a12e..1814fa13f6ab 100644
--- a/Documentation/devicetree/bindings/net/marvell-pp2.txt
+++ b/Documentation/devicetree/bindings/net/marvell-pp2.txt
@@ -21,8 +21,9 @@ Required properties:
 	- main controller clock (for both armada-375-pp2 and armada-7k-pp2)
 	- GOP clock (for both armada-375-pp2 and armada-7k-pp2)
 	- MG clock (only for armada-7k-pp2)
-- clock-names: names of used clocks, must be "pp_clk", "gop_clk" and
-  "mg_clk" (the latter only for armada-7k-pp2).
+	- AXI clock (only for armada-7k-pp2)
+- clock-names: names of used clocks, must be "pp_clk", "gop_clk", "mg_clk"
+  and "axi_clk" (the 2 latter only for armada-7k-pp2).
 
 The ethernet ports are represented by subnodes. At least one port is
 required.
@@ -78,8 +79,9 @@ Example for marvell,armada-7k-pp2:
 cpm_ethernet: ethernet@0 {
 	compatible = "marvell,armada-7k-pp22";
 	reg = <0x0 0x100000>, <0x129000 0xb000>;
-	clocks = <&cpm_syscon0 1 3>, <&cpm_syscon0 1 9>, <&cpm_syscon0 1 5>;
-	clock-names = "pp_clk", "gop_clk", "gp_clk";
+	clocks = <&cpm_syscon0 1 3>, <&cpm_syscon0 1 9>,
+		 <&cpm_syscon0 1 5>, <&cpm_syscon0 1 18>;
+	clock-names = "pp_clk", "gop_clk", "gp_clk", "axi_clk";
 
 	eth0: eth0 {
 		interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index dd0ee2691c86..f2656112986b 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -792,6 +792,7 @@ struct mvpp2 {
 	struct clk *pp_clk;
 	struct clk *gop_clk;
 	struct clk *mg_clk;
+	struct clk *axi_clk;
 
 	/* List of pointers to port structures */
 	struct mvpp2_port **port_list;
@@ -7963,6 +7964,18 @@ static int mvpp2_probe(struct platform_device *pdev)
 		err = clk_prepare_enable(priv->mg_clk);
 		if (err < 0)
 			goto err_gop_clk;
+
+		priv->axi_clk = devm_clk_get(&pdev->dev, "axi_clk");
+		if (IS_ERR(priv->axi_clk)) {
+			err = PTR_ERR(priv->axi_clk);
+			if (err == -EPROBE_DEFER)
+				goto err_gop_clk;
+			priv->axi_clk = NULL;
+		} else {
+			err = clk_prepare_enable(priv->axi_clk);
+			if (err < 0)
+				goto err_gop_clk;
+		}
 	}
 
 	/* Get system's tclk rate */
@@ -8015,6 +8028,7 @@ static int mvpp2_probe(struct platform_device *pdev)
 	return 0;
 
 err_mg_clk:
+	clk_disable_unprepare(priv->axi_clk);
 	if (priv->hw_version == MVPP22)
 		clk_disable_unprepare(priv->mg_clk);
 err_gop_clk:
@@ -8052,6 +8066,7 @@ static int mvpp2_remove(struct platform_device *pdev)
 				  aggr_txq->descs_dma);
 	}
 
+	clk_disable_unprepare(priv->axi_clk);
 	clk_disable_unprepare(priv->mg_clk);
 	clk_disable_unprepare(priv->pp_clk);
 	clk_disable_unprepare(priv->gop_clk);
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: Andrew Lunn @ 2017-09-29 12:12 UTC (permalink / raw)
  To: David Laight
  Cc: Tristram.Ha@microchip.com, muvarov@gmail.com, pavel@ucw.cz,
	nathan.leigh.conrad@gmail.com,
	vivien.didelot@savoirfairelinux.com, f.fainelli@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Woojung.Huh@microchip.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DD0084EDC@AcuExch.aculab.com>

On Fri, Sep 29, 2017 at 09:14:26AM +0000, David Laight wrote:
> From: Andrew Lunn
> > Sent: 28 September 2017 20:34
> ...
> > > There are 34 counters.  In normal case using generic bus I/O or PCI to read them
> > > is very quick, but the switch is mostly accessed using SPI, or even I2C.  As the SPI
> > > access is very slow.
> > 
> > How slow is it? The Marvell switches all use MDIO. It is probably a
> > bit faster than I2C, but it is a lot slower than MMIO or PCI.
> > 
> > ethtool -S lan0 takes about 25ms.
> 
> Is the SPI access software bit-banged?

That will depend on the board design. I've used mdio bit banging, and
that was painfully slow for stats.

But we should primarily think about average hardware. It is going to
have hardware SPI or I2C. If statistics reading with hardware I2C is
reasonable, i would avoid caching, and just ensure other accesses are
permitted between individual statistic reads.

It also requires Microchip also post new code. They have been very
silent for quite a while....

	  Andrew

^ permalink raw reply

* Fwd: [PATCH] net: ethernet: sfc: There is a typo, modify curent to current.
From: Hao Zhang @ 2017-09-29 11:53 UTC (permalink / raw)
  To: netdev, davem
In-Reply-To: <20170929101937.GA8963@arx-kt>

Resend to maintainer.



Regards :)


---------- Forwarded message ----------
From: Hao Zhang <hao5781286@gmail.com>
Date: 2017-09-29 18:19 GMT+08:00
Subject: [PATCH] net: ethernet: sfc: There is a typo, modify curent to current.
To: linux-net-drivers@solarflare.com, ecree@solarflare.com,
bkenward@solarflare.com
抄送： linux-kernel@vger.kernel.org, hao5781286@gmail.com


There is a typo, fix it by modify curent to current.

Signed-off-by: Hao Zhang <hao5781286@gmail.com>
---
 drivers/net/ethernet/sfc/mcdi_pcol.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/mcdi_pcol.h
b/drivers/net/ethernet/sfc/mcdi_pcol.h
index 91fb54f..8bfc8d4 100644
--- a/drivers/net/ethernet/sfc/mcdi_pcol.h
+++ b/drivers/net/ethernet/sfc/mcdi_pcol.h
@@ -11505,7 +11505,7 @@

 /***********************************/
 /* MC_CMD_GET_SNAPSHOT_LENGTH
- * Obtain the curent range of allowable values for the SNAPSHOT_LENGTH
+ * Obtain the current range of allowable values for the SNAPSHOT_LENGTH
  * parameter to MC_CMD_INIT_RXQ.
  */
 #define MC_CMD_GET_SNAPSHOT_LENGTH 0x101

^ permalink raw reply related

* Re: [PATCH] net: ethernet: sfc: There is a typo, modify curent to current.
From: Hao Zhang @ 2017-09-29 11:51 UTC (permalink / raw)
  To: davem, bkenward, ecree, netdev, linux-net-drivers
In-Reply-To: <20170929101937.GA8963@arx-kt>

Resend to maintainer.



Regards :)

2017-09-29 18:19 GMT+08:00 Hao Zhang <hao5781286@gmail.com>:
> There is a typo, fix it by modify curent to current.
>
> Signed-off-by: Hao Zhang <hao5781286@gmail.com>
> ---
>  drivers/net/ethernet/sfc/mcdi_pcol.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/sfc/mcdi_pcol.h b/drivers/net/ethernet/sfc/mcdi_pcol.h
> index 91fb54f..8bfc8d4 100644
> --- a/drivers/net/ethernet/sfc/mcdi_pcol.h
> +++ b/drivers/net/ethernet/sfc/mcdi_pcol.h
> @@ -11505,7 +11505,7 @@
>
>  /***********************************/
>  /* MC_CMD_GET_SNAPSHOT_LENGTH
> - * Obtain the curent range of allowable values for the SNAPSHOT_LENGTH
> + * Obtain the current range of allowable values for the SNAPSHOT_LENGTH
>   * parameter to MC_CMD_INIT_RXQ.
>   */
>  #define MC_CMD_GET_SNAPSHOT_LENGTH 0x101
> --
> 2.7.4
>

^ permalink raw reply

* [PATCH iproute2] ip xfrm: use correct key length for netlink message
From: Michal Kubecek @ 2017-09-29 11:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

When SA is added manually using "ip xfrm state add", xfrm_state_modify()
uses alg_key_len field of struct xfrm_algo for the length of key passed to
kernel in the netlink message. However alg_key_len is bit length of the key
while we need byte length here. This is usually harmless as kernel ignores
the excess data but when the bit length of the key exceeds 512
(XFRM_ALGO_KEY_BUF_SIZE), it can result in buffer overflow.

We can simply divide by 8 here as the only place setting alg_key_len is in
xfrm_algo_parse() where it is always set to a multiple of 8 (and there are
already multiple places using "algo->alg_key_len / 8").

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 ip/xfrm_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 4483fb8f71d2..99fdec2325ec 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -539,7 +539,7 @@ static int xfrm_state_modify(int cmd, unsigned int flags, int argc, char **argv)

 				xfrm_algo_parse((void *)&alg, type, name, key,
 						buf, sizeof(alg.buf));
-				len += alg.u.alg.alg_key_len;
+				len += alg.u.alg.alg_key_len / 8;

 				addattr_l(&req.n, sizeof(req.buf), type,
 					  (void *)&alg, len);
-- 
2.14.2

^ permalink raw reply related

* RE: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field
From: Yuval Mintz @ 2017-09-29 11:36 UTC (permalink / raw)
  To: Davide Caratti, Jiri Pirko, netdev@vger.kernel.org
  Cc: davem@davemloft.net, Yotam Gigi, Ido Schimmel, mlxsw,
	nikolay@cumulusnetworks.com, andrew@lunn.ch,
	dsa@cumulusnetworks.com, edumazet@google.com, willemb@google.com,
	johannes.berg@intel.com, pabeni@redhat.com, daniel@iogearbox.net,
	f.fainelli@gmail.com, fw@strlen.de, gfree.wind@vip.163.com
In-Reply-To: <1506683686.2980.44.camel@redhat.com>

> hello Jiri and Yotam,
> 
> On Thu, 2017-09-28 at 19:34 +0200, Jiri Pirko wrote:
> > From: Yotam Gigi <yotamg@mellanox.com>
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 19e64bf..ada8214 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -772,6 +772,7 @@ struct sk_buff {
> >  	__u8			remcsum_offload:1;
> >  #ifdef CONFIG_NET_SWITCHDEV
> >  	__u8			offload_fwd_mark:1;
> > +	__u8			offload_mr_fwd_mark:1;
> 
> I had a look at the pahole output:
> 
> $ make allyesconfig
> $ make net/core/skbuff.o
> $ pahole net/core/skbuff.o | grep -C7 tc_from_ingress
> 
> 
>         __u8                       ipvs_property:1;      /*   147: 7  1 */
>         __u8                       inner_protocol_type:1; /*   147: 6  1 */
>         __u8                       remcsum_offload:1;    /*   147: 5  1 */
>         __u8                       offload_fwd_mark:1;   /*   147: 4  1 */
>         __u8                       tc_skip_classify:1;   /*   147: 3  1 */
>         __u8                       tc_at_ingress:1;      /*   147: 2  1 */
>         __u8                       tc_redirected:1;      /*   147: 1  1 */
>         __u8                       tc_from_ingress:1;    /*   147: 0  1 */
>         __u16                      tc_index;             /*   148     2 */
> 
>         /* XXX 2 bytes hole, try to pack */
> 
>         union {
>                 __wsum             csum;                 /*           4 */
>                 struct {
> 
> apparently there are no more spare bits to use at that offset: therefore,
> adding 'offload_mr_fwd_mark' before 'tc_skip_classify' will make
> 'tc_from_ingress' slip at offset 148, and tc_index at offset 150.
> I think you can use that 2-bytes hole below tc_index, and also move the
> offload_fwd_mark bit there, as we use both when
> CONFIG_NET_SWITCHDEV is
> enabled. This way we will also gain one spare bit, without changing the
> struct size or worsening the cacheline alignments.
> 
> what do you think?

Your pahole output still shows a 2B hole until the following union
which is 4B-aligned.
While it's true tc_index moves to offset 150, the union will not move 
[I.e., stay at offset 152] so the layout doesn't really change [greatly]
nor the size of the struct. And we have the benefit of all the bits
remaining consecutive.


^ permalink raw reply

* Re: [PATCH v3] ebtables: fix race condition in frame_filter_net_init()
From: Pablo Neira Ayuso @ 2017-09-29 11:29 UTC (permalink / raw)
  To: Artem Savkov; +Cc: Florian Westphal, netdev, linux-kernel, netfilter-devel
In-Reply-To: <20170926163545.3668-1-asavkov@redhat.com>

On Tue, Sep 26, 2017 at 06:35:45PM +0200, Artem Savkov wrote:
> It is possible for ebt_in_hook to be triggered before ebt_table is assigned
> resulting in a NULL-pointer dereference. Make sure hooks are
> registered as the last step.

Applied, thanks.

^ permalink raw reply

* [PATCH net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-09-29 11:21 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, Florian Westphal

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it plus a seqcount to get a consistent snapshot
of the alias buffer.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netdevice.h |  3 +-
 net/core/dev.c            | 70 +++++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      | 14 ++++------
 net/core/rtnetlink.c      | 13 +++++++--
 4 files changed, 77 insertions(+), 23 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..0bcedb498f6f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1632,7 +1632,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
-	char 			*ifalias;
+	char			__rcu *ifalias;
 	/*
 	 *	I/O specific fields
 	 *	FIXME: Merge these and struct ifmap into one
@@ -3275,6 +3275,7 @@ void __dev_notify_flags(struct net_device *, unsigned int old_flags,
 			unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
+int dev_get_alias(const struct net_device *, char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..a7ac2902d702 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -188,6 +188,9 @@ static struct napi_struct *napi_by_id(unsigned int napi_id);
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static DEFINE_MUTEX(ifalias_mutex);
+static seqcount_t ifalias_rename_seq;
+
 /* protects napi_hash addition/deletion and napi_gen_id */
 static DEFINE_SPINLOCK(napi_hash_lock);
 
@@ -1265,29 +1268,72 @@ int dev_change_name(struct net_device *dev, const char *newname)
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
-	char *new_ifalias;
-
-	ASSERT_RTNL();
+	char *new_ifalias, *old_ifalias;
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
+	mutex_lock(&ifalias_mutex);
+
+	old_ifalias = rcu_dereference_protected(dev->ifalias,
+						mutex_is_locked(&ifalias_mutex));
 	if (!len) {
-		kfree(dev->ifalias);
-		dev->ifalias = NULL;
-		return 0;
+		RCU_INIT_POINTER(dev->ifalias, NULL);
+		goto out;
 	}
 
-	new_ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!new_ifalias)
+	new_ifalias = __krealloc(old_ifalias, len + 1, GFP_KERNEL);
+	if (!new_ifalias) {
+		mutex_unlock(&ifalias_mutex);
 		return -ENOMEM;
-	dev->ifalias = new_ifalias;
-	memcpy(dev->ifalias, alias, len);
-	dev->ifalias[len] = 0;
+	}
 
+	if (new_ifalias == old_ifalias) {
+		write_seqcount_begin(&ifalias_rename_seq);
+		memcpy(new_ifalias, alias, len);
+		new_ifalias[len] = 0;
+		write_seqcount_end(&ifalias_rename_seq);
+		mutex_unlock(&ifalias_mutex);
+		return len;
+	}
+
+	memcpy(new_ifalias, alias, len);
+	new_ifalias[len] = 0;
+
+	rcu_assign_pointer(dev->ifalias, new_ifalias);
+out:
+	mutex_unlock(&ifalias_mutex);
+	if (old_ifalias) {
+		synchronize_net();
+		kfree(old_ifalias);
+	}
 	return len;
 }
 
+int dev_get_alias(const struct net_device *dev, char *alias, size_t len)
+{
+	unsigned int seq;
+	int ret;
+
+	for (;;) {
+		const char *name;
+
+		ret = 0;
+		rcu_read_lock();
+		name = rcu_dereference(dev->ifalias);
+		seq = raw_seqcount_begin(&ifalias_rename_seq);
+		if (name)
+			ret = snprintf(alias, len, "%s", name);
+		rcu_read_unlock();
+
+		if (!read_seqcount_retry(&ifalias_rename_seq, seq))
+			break;
+
+		cond_resched();
+	}
+
+	return ret;
+}
 
 /**
  *	netdev_features_change - device changes features
@@ -8770,6 +8816,8 @@ static int __init net_dev_init(void)
 				       NULL, dev_cpu_dead);
 	WARN_ON(rc < 0);
 	rc = 0;
+
+	seqcount_init(&ifalias_rename_seq);
 out:
 	return rc;
 }
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..3d85119d3104 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -391,10 +391,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
 	ret = dev_set_alias(netdev, buf, count);
-	rtnl_unlock();
 
 	return ret < 0 ? ret : len;
 }
@@ -403,13 +400,12 @@ static ssize_t ifalias_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	char tmp[IFALIASZ];
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	ret = dev_get_alias(netdev, tmp, sizeof(tmp));
+	if (ret > 0)
+		ret = sprintf(buf, "%s\n", tmp);
 	return ret;
 }
 static DEVICE_ATTR_RW(ifalias);
@@ -1488,7 +1484,7 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	kfree(rcu_access_pointer(dev->ifalias));
 	netdev_freemem(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6955da0d58d..3961f87cdc76 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1366,6 +1366,16 @@ static int nla_put_iflink(struct sk_buff *skb, const struct net_device *dev)
 	return nla_put_u32(skb, IFLA_LINK, ifindex);
 }
 
+static noinline_for_stack int nla_put_ifalias(struct sk_buff *skb,
+					      struct net_device *dev)
+{
+	char buf[IFALIASZ];
+	int ret;
+
+	ret = dev_get_alias(dev, buf, sizeof(buf));
+	return ret > 0 ? nla_put_string(skb, IFLA_IFALIAS, buf) : 0;
+}
+
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
 				  const struct net_device *dev)
 {
@@ -1425,8 +1435,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
-	    (dev->ifalias &&
-	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
+	    nla_put_ifalias(skb, dev) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net-next] net: ipv4: remove fib_weight
From: David Miller @ 2017-09-29  5:20 UTC (permalink / raw)
  To: dsahern; +Cc: netdev
In-Reply-To: <1506564480-20374-1-git-send-email-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 27 Sep 2017 19:08:00 -0700

> fib_weight in fib_info is set but not used. Remove it and the
> helpers for setting it.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: ipv4: remove fib_info arg to fib_check_nh
From: David Miller @ 2017-09-29  5:20 UTC (permalink / raw)
  To: dsahern; +Cc: netdev
In-Reply-To: <1506570119-17088-1-git-send-email-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 27 Sep 2017 20:41:59 -0700

> fib_check_nh does not use the fib_info arg; remove t.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied.

^ permalink raw reply

* Re: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field
From: Davide Caratti @ 2017-09-29 11:14 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, nikolay, andrew, dsa, edumazet,
	willemb, johannes.berg, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-2-jiri@resnulli.us>

hello Jiri and Yotam,

On Thu, 2017-09-28 at 19:34 +0200, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 19e64bf..ada8214 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -772,6 +772,7 @@ struct sk_buff {
>  	__u8			remcsum_offload:1;
>  #ifdef CONFIG_NET_SWITCHDEV
>  	__u8			offload_fwd_mark:1;
> +	__u8			offload_mr_fwd_mark:1;

I had a look at the pahole output:

$ make allyesconfig
$ make net/core/skbuff.o
$ pahole net/core/skbuff.o | grep -C7 tc_from_ingress


        __u8                       ipvs_property:1;      /*   147: 7  1 */
        __u8                       inner_protocol_type:1; /*   147: 6  1 */
        __u8                       remcsum_offload:1;    /*   147: 5  1 */
        __u8                       offload_fwd_mark:1;   /*   147: 4  1 */
        __u8                       tc_skip_classify:1;   /*   147: 3  1 */
        __u8                       tc_at_ingress:1;      /*   147: 2  1 */
        __u8                       tc_redirected:1;      /*   147: 1  1 */
        __u8                       tc_from_ingress:1;    /*   147: 0  1 */
        __u16                      tc_index;             /*   148     2 */

        /* XXX 2 bytes hole, try to pack */

        union {
                __wsum             csum;                 /*           4 */
                struct {

apparently there are no more spare bits to use at that offset: therefore,
adding 'offload_mr_fwd_mark' before 'tc_skip_classify' will make
'tc_from_ingress' slip at offset 148, and tc_index at offset 150.
I think you can use that 2-bytes hole below tc_index, and also move the
offload_fwd_mark bit there, as we use both when CONFIG_NET_SWITCHDEV is
enabled. This way we will also gain one spare bit, without changing the
struct size or worsening the cacheline alignments.

what do you think?

regards,
-- 
davide

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: David Miller @ 2017-09-29  5:04 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, bridge, stephen
In-Reply-To: <1506517964-17479-1-git-send-email-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Wed, 27 Sep 2017 16:12:44 +0300

> We need to be able to transparently forward most link-local frames via
> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
> mask which restricts the forwarding of STP and LACP, but we need to be able
> to forward these over tunnels and control that forwarding on a per-port
> basis thus add a new per-port group_fwd_mask option which only disallows
> mac pause frames to be forwarded (they're always dropped anyway).
> The patch does not change the current default situation - all of the others
> are still restricted unless configured for forwarding.
> We have successfully tested this patch with LACP and STP forwarding over
> VxLAN and qinq tunnels.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/2] net/sched: support tunnel options in cls_flower and act_tunnel_key
From: David Miller @ 2017-09-29  4:54 UTC (permalink / raw)
  To: simon.horman; +Cc: jiri, jhs, xiyou.wangcong, netdev, oss-drivers
In-Reply-To: <1506500194-17637-1-git-send-email-simon.horman@netronome.com>

From: Simon Horman <simon.horman@netronome.com>
Date: Wed, 27 Sep 2017 10:16:32 +0200

> Users of options:
> 
> * There are eBPF hooks to allow getting on and setting tunnel metadata:
>   bpf_skb_set_tunnel_opt, bpf_skb_get_tunnel_opt.
> 
> * Open vSwitch is able to match and set Geneve and VXLAN-GBP options.
> 
> Neither of the above appear to assume any structure for the data.

I really worry about this.

These metadata option blobs are internal kernel datastructure which we
could change at any point in time.  They are not exported to
userspace as a UAPI.

It's kinda OK for eBPF programs to access this stuff since they are
expected to cope with changes to internal data-structures.

But for anything user facing, this really doesn't work.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers
From: Matteo Croce @ 2017-09-29 10:47 UTC (permalink / raw)
  To: Erik Kline; +Cc: David Miller, netdev, linux-doc
In-Reply-To: <CAAedzxrLurc=oGy64ePO30j3QQ_sL5kJCEz+RnTk3LMSn_yvyQ@mail.gmail.com>

On Thu, Sep 28, 2017 at 1:22 PM, Erik Kline <ek@google.com> wrote:
> Upon further reflection, doesn't the whole premise of this change
> means that it's no longer possible to selectively disable these
> features if they are set on "all"?  Or are we saying that this mode is
> only support with "default" enable + "ifname" disable?

Hi Erik, thanks for the review.
Yes the behaviour seems wrong when writing all.accept_dad respect what
the documentation says.

BTW the previous behaviour was not defined, I put them in OR because
that's what other handlers do, eg. send_redirects.
If you think that it's better to put them in AND we can change the
documentation accordingly.
What do you think?

-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* netlink backwards compatibility in userspace tools
From: Jason A. Donenfeld @ 2017-09-29 10:22 UTC (permalink / raw)
  To: Netdev, LKML; +Cc: Daniel Kahn Gillmor

Hi guys,

One handy aspect of Netlink is that it's backwards compatible. This
means that you can run old userspace utilities on new kernels, even if
the new kernel supports new features and netlink attributes. The wire
format is stable enough that the data marshaled can be extended
without breaking compat. Neat.

I was wondering, though, what you think the best stance is toward
these old userspace utilities. What should they do if the kernel sends
it netlink attributes that it does not recognize? At the moment, I'm
doing something like this:

static void warn_unrecognized(void)
{
    static bool once = false;
    if (once)
        return;
    once = true;
    fprintf(stderr,
        "Warning: this program received from your kernel one or more\n"
        "attributes that it did not recognize. It is possible that\n"
        "this version of wg(8) is older than your kernel. You may\n"
        "want to update this program.\n");
}

This seems like a somewhat sensible warning, but then I wonder about
distributions like Debian, which has a long stable life cycle, so it
frequently has very old tools (ancient iproute2 for example). Then,
VPS providers have these Debian images run on top of newer kernels.
People in this situation would undoubtedly see the above warning a lot
and not be able to do anything about it. Not horrible, but a bit
annoying. Is this an okay annoyance? Or is it advised to just have no
warning at all? One idea would be to put it behind an environment
variable flag, but I don't like too many nobs.

I'm generally wondering about attitudes toward this kind of userspace
program behavior in response to newer kernels.

Thanks,
Jason

^ permalink raw reply

* RE: [PATCH net-next 03/10] sctp: factor out stream->in allocation
From: David Laight @ 2017-09-29 10:04 UTC (permalink / raw)
  To: 'Marcelo Ricardo Leitner', netdev@vger.kernel.org
  Cc: linux-sctp@vger.kernel.org, Neil Horman, Vlad Yasevich, Xin Long
In-Reply-To: <a2d837095740147003164fe573794c140001ec71.1506536044.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner
> Sent: 28 September 2017 21:25
> Same idea as previous patch.

That needs a proper description.

	David

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:50 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928173415.15551-3-jiri@resnulli.us>

On 28/09/17 20:34, Jiri Pirko wrote:
> From: Yotam Gigi <yotamg@mellanox.com>
> 
> In order to allow the ipmr module to do partial multicast forwarding
> according to the device parent ID, add the device parent ID field to the
> VIF struct. This way, the forwarding path can use the parent ID field
> without invoking switchdev calls, which requires the RTNL lock.
> 
> When a new VIF is added, set the device parent ID field in it by invoking
> the switchdev_port_attr_get call.
> 
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/mroute.h | 2 ++
>  net/ipv4/ipmr.c        | 9 +++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
> index b072a84..a46577f 100644
> --- a/include/linux/mroute.h
> +++ b/include/linux/mroute.h
> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>  
>  struct vif_device {
>  	struct net_device 	*dev;			/* Device we are using */
> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
> +	bool		dev_parent_id_valid;
>  	unsigned long	bytes_in,bytes_out;
>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 292a8e8..4566c54 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -67,6 +67,7 @@
>  #include <net/fib_rules.h>
>  #include <linux/netconf.h>
>  #include <net/nexthop.h>
> +#include <net/switchdev.h>
>  
>  struct ipmr_rule {
>  	struct fib_rule		common;
> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  		   struct vifctl *vifc, int mrtsock)
>  {
>  	int vifi = vifc->vifc_vifi;
> +	struct switchdev_attr attr = {
> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
> +	};
>  	struct vif_device *v = &mrt->vif_table[vifi];
>  	struct net_device *dev;
>  	struct in_device *in_dev;
> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>  
>  	/* Fill in the VIF structures */
>  
> +	attr.orig_dev = dev;
> +	if (!switchdev_port_attr_get(dev, &attr)) {
> +		v->dev_parent_id_valid = true;
> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> +	}
>  	v->rate_limit = vifc->vifc_rate_limit;
>  	v->local = vifc->vifc_lcl_addr.s_addr;
>  	v->remote = vifc->vifc_rmt_addr.s_addr;
> 

One more thing - what happens on vif delete, then add with the same vif index of another
device that doesn't have a parent id ? I think the vif will be stuck with its parent_id
when it gets set.

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jason Wang @ 2017-09-29  9:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: jakub.kicinski, Michael S. Tsirkin, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Alexei Starovoitov,
	Andy Gospodarek
In-Reply-To: <150660343811.2808.7680200486950101509.stgit@firesoul>



On 2017年09月28日 20:57, Jesper Dangaard Brouer wrote:
> +};
> +
> +/* Convert xdp_buff to xdp_pkt */
> +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> +{
> +	struct xdp_pkt *xdp_pkt;
> +	int headroom;
> +
> +	/* Assure headroom is available for storing info */
> +	headroom = xdp->data - xdp->data_hard_start;
> +	if (headroom < sizeof(*xdp_pkt))
> +		return NULL;

Hi Jesper:

Do you consider this as a trick or a long term solution? Is it better to 
store XDP in a circular buffer? (I'm asking since I meet similar issue 
when doing xdp_xmit for tun).

> +
> +	/* Store info in top of packet */
> +	xdp_pkt = xdp->data_hard_start;
> +
> +	xdp_pkt->data = xdp->data;
> +	xdp_pkt->len  = xdp->data_end - xdp->data;
> +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> +

Is wmb() needed here?

> +	return xdp_pkt;
> +}

Thanks

^ permalink raw reply

* Re: [patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct
From: Nikolay Aleksandrov @ 2017-09-29  9:45 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, yotamg, idosch, mlxsw, andrew, dsa, edumazet, willemb,
	johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <7618b2e8-e7f9-d2c6-b13c-53aef0f50de0@cumulusnetworks.com>

On 29/09/17 12:29, Nikolay Aleksandrov wrote:
> On 28/09/17 20:34, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>>
>> In order to allow the ipmr module to do partial multicast forwarding
>> according to the device parent ID, add the device parent ID field to the
>> VIF struct. This way, the forwarding path can use the parent ID field
>> without invoking switchdev calls, which requires the RTNL lock.
>>
>> When a new VIF is added, set the device parent ID field in it by invoking
>> the switchdev_port_attr_get call.
>>
>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/mroute.h | 2 ++
>>  net/ipv4/ipmr.c        | 9 +++++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>> index b072a84..a46577f 100644
>> --- a/include/linux/mroute.h
>> +++ b/include/linux/mroute.h
>> @@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule *rule)
>>  
>>  struct vif_device {
>>  	struct net_device 	*dev;			/* Device we are using */
>> +	struct netdev_phys_item_id dev_parent_id;	/* Device parent ID    */
>> +	bool		dev_parent_id_valid;
>>  	unsigned long	bytes_in,bytes_out;
>>  	unsigned long	pkt_in,pkt_out;		/* Statistics 			*/
>>  	unsigned long	rate_limit;		/* Traffic shaping (NI) 	*/
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 292a8e8..4566c54 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -67,6 +67,7 @@
>>  #include <net/fib_rules.h>
>>  #include <linux/netconf.h>
>>  #include <net/nexthop.h>
>> +#include <net/switchdev.h>
>>  
>>  struct ipmr_rule {
>>  	struct fib_rule		common;
>> @@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  		   struct vifctl *vifc, int mrtsock)
>>  {
>>  	int vifi = vifc->vifc_vifi;
>> +	struct switchdev_attr attr = {
>> +		.id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
>> +	};
>>  	struct vif_device *v = &mrt->vif_table[vifi];
>>  	struct net_device *dev;
>>  	struct in_device *in_dev;
>> @@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
>>  
>>  	/* Fill in the VIF structures */
>>  
>> +	attr.orig_dev = dev;
>> +	if (!switchdev_port_attr_get(dev, &attr)) {
>> +		v->dev_parent_id_valid = true;
>> +		memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
> 
> Hmm, shouldn't you set dev_parent_id.id_len too ? It would seem netdev_phys_item_id_same()
> uses it in the comparison and without the len it would always look like they're the same
> because memcmp will simply return 0 with count = 0.

Also maybe we can use the non-zero id_len as a signal that it was set and drop the dev_parent_id_valid
field altogether, it would seem there's no valid reason to have id_len == 0 and yet expect a valid
parent_id.

> 
>> +	}
>>  	v->rate_limit = vifc->vifc_rate_limit;
>>  	v->local = vifc->vifc_lcl_addr.s_addr;
>>  	v->remote = vifc->vifc_rmt_addr.s_addr;
>>
> 

^ permalink raw reply

* Re: [PATCH net-next RFC 0/9] net: dsa: PTP timestamping for mv88e6xxx
From: Richard Cochran @ 2017-09-29  9:43 UTC (permalink / raw)
  To: Brandon Streiff
  Cc: netdev, linux-kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot, Erik Hons
In-Reply-To: <1506612341-18061-1-git-send-email-brandon.streiff@ni.com>

Brandon,

On Thu, Sep 28, 2017 at 10:25:32AM -0500, Brandon Streiff wrote:
> - Patch #2: We expose the switch time as a PTP clock but don't support
>   adjustment (max_adj=0).

The driver should implement a cyclecounter/timecounter.

> Our platform adjusted a systemwide oscillator
>   from userspace, so we didn't need adjustment at this layer, but other
>   PTP clock drivers support this and we probably should too.

We don't currently have any way to support this kind of HW or anything
like an external VCO.  I would like to find a way to do this, but that
is a different kettle of fish as it might require changes in the PHC
subsystem.  For this driver, I think we should get it merged using the
cyclecounter/timecounter (as that will benefit lots of users) and
worry about the external oscillator later.

> Feedback is appreciated.

I happy to see this series.  I just finished porting an out-of-tree
PHC driver for the Marvell mv88e635x, and I want to mainline it, but I
also have a few uglies.

Unfortunately I am in the middle of a move right now, and so my review
of this series might have to wait a bit.  However, I am looking
forward to comparing notes, and then getting this support in.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next 0/3] support changing steering policies in tuntap
From: Jason Wang @ 2017-09-29  9:41 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, Network Development, LKML
In-Reply-To: <CAF=yD-KuDZK0-YUfYde=dk_6M6fgj_opuiK2VTfCKDM_=MqDcw@mail.gmail.com>



On 2017年09月29日 00:09, Willem de Bruijn wrote:
> On Thu, Sep 28, 2017 at 3:23 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年09月28日 07:25, Willem de Bruijn wrote:
>>>>> In the future, both simple and sophisticated policy like RSS or other
>>>>> guest
>>>>> driven steering policies could be done on top.
>>>> IMHO there should be a more practical example before adding all this
>>>> indirection. And it would be nice to understand why this queue selection
>>>> needs to be tun specific.
>>> I was thinking the same and this reminds me of the various strategies
>>> implemented in packet fanout. tun_cpu_select_queue is analogous to
>>> fanout_demux_cpu though it is tun-specific in that it requires
>>> tun->numqueues.
>>
>> Right, the main idea is to introduce a way to change flow steering policy
>> for tun. I think fanout policy could be implemented through the API
>> introduced in this series. (Current flow caches based automatic steering
>> method is tun specific).
>>
>>> Fanout accrued various strategies until it gained an eBPF variant. Just
>>> supporting BPF is probably sufficient here, too.
>>
>> Technically yes, but for tun, it also serve for virt. We probably still need
>> some hard coded policy which could be changed by guest until we can accept
>> an BPF program from guest I think?
> When would a guest choose the policy? As long as this is under control
> of a host user, possibly unprivileged, allowing BPF here is moot, as any
> user can run socket filter BPF already. Programming from the guest is
> indeed different. I don't fully understand that use case.

The problem is userspace (qemu) know little about what kind of workloads 
will be done by guest, so we need guest controllable method here since 
it knows the best steering policy. Rethink about this, instead of 
passing eBPF from guest, qemu can have some pre-defined sets of polices. 
I will change the cpu id based to eBPF based in V2.

Thanks

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox