Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next 06/10] mlxsw: core: Fix arg name of MLXSW_CORE_RES_VALID and MLXSW_CORE_RES_GET
From: Sasha Levin @ 2018-04-05  1:33 UTC (permalink / raw)
  To: Sasha Levin, Ido Schimmel, Jiri Pirko, netdev@vger.kernel.org
  Cc: davem@davemloft.net, jiri@mellanox.com, petrm@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20180401143459.30770-7-idosch@mellanox.com>

Hi Ido Schimmel
Jiri Pirko.

[This is an automated email]

This commit has been processed by the -stable helper bot and determined
to be a high probability candidate for -stable trees. (score: 1.5151)

The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.4.126, 

v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Failed to apply! Possible dependencies:
    c1a3831121f6: ("mlxsw: Convert resources into array")

v4.4.126: Failed to apply! Possible dependencies:
    403547d38d0b: ("mlxsw: profile: Add KVD resources to profile config")
    489107bda1d1: ("mlxsw: Add KVD sizes configuration into profile")
    57d316ba2017: ("mlxsw: pci: Add resources query implementation.")
    8060646a0fd1: ("mlxsw: core: Add support for packets received from LAG port")
    89309da39f55: ("mlxsw: core: Implement temperature hwmon interface")
    89309da39f55: ("mlxsw: core: Implement temperature hwmon interface")
    932762b69a28: ("mlxsw: Move devlink port registration into common core code")
    8060646a0fd1: ("mlxsw: core: Add support for packets received from LAG port")
    89309da39f55: ("mlxsw: core: Implement temperature hwmon interface")
    90183b980d0a: ("mlxsw: spectrum: Initialize egress scheduling")
    7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two")
    bd40e9d6d538: ("mlxsw: spectrum: Allocate active VLANs only for port netdevs")
    0d65fc13042f: ("mlxsw: spectrum: Implement LAG port join/leave")
    c4745500e988: ("mlxsw: Implement devlink interface")
    89309da39f55: ("mlxsw: core: Implement temperature hwmon interface")
    7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two")


Please let us know if you'd like to have this patch included in a stable tree.

^ permalink raw reply

* [rtlwifi-btcoex] Suspicious code in halbtc8821a1ant driver
From: Gustavo A. R. Silva @ 2018-04-05  1:25 UTC (permalink / raw)
  To: Yan-Hsuan Chuang, Ping-Ke Shih, Kalle Valo
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Gustavo A. R. Silva

Hi all,

While doing some static analysis I came across the following piece of code at drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtc8821a1ant.c:1581:

1581 static void btc8821a1ant_act_bt_sco_hid_only_busy(struct btc_coexist *btcoexist,
1582                                                   u8 wifi_status)
1583 {
1584         /* tdma and coex table */
1585         btc8821a1ant_ps_tdma(btcoexist, NORMAL_EXEC, true, 5);
1586 
1587         if (BT_8821A_1ANT_WIFI_STATUS_NON_CONNECTED_ASSO_AUTH_SCAN ==
1588             wifi_status)
1589                 btc8821a1ant_coex_table_with_type(btcoexist, NORMAL_EXEC, 1);
1590         else
1591                 btc8821a1ant_coex_table_with_type(btcoexist, NORMAL_EXEC, 1);
1592 }

The issue here is that the code for both branches of the if-else statement is identical.

The if-else was introduced a year ago in this commit c6821613e653

I wonder if an argument should be changed in any of the calls to btc8821a1ant_coex_table_with_type?

What do you think?

Thanks
--
Gustavo

^ permalink raw reply

* Re: [PATCH net] net: dsa: Discard frames from unused ports
From: Florian Fainelli @ 2018-04-05  0:49 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Vivien Didelot
In-Reply-To: <1522886204-1545-1-git-send-email-andrew@lunn.ch>

On 04/04/2018 04:56 PM, Andrew Lunn wrote:
> The Marvell switches under some conditions will pass a frame to the
> host with the port being the CPU port. Such frames are invalid, and
> should be dropped. Not dropping them can result in a crash when
> incrementing the receive statistics for an invalid port.
> 
> Reported-by: Chris Healy <cphealy@gmail.com>
> Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")

Are you sure this is the commit that introduced the problem?

> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> ---
>  net/dsa/dsa_priv.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
> index 70de7895e5b8..6c1bf3d9f652 100644
> --- a/net/dsa/dsa_priv.h
> +++ b/net/dsa/dsa_priv.h
> @@ -126,6 +126,7 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
>  	struct dsa_port *cpu_dp = dev->dsa_ptr;
>  	struct dsa_switch_tree *dst = cpu_dp->dst;
>  	struct dsa_switch *ds;
> +	struct dsa_port *slave_port;
>  
>  	if (device < 0 || device >= DSA_MAX_SWITCHES)
>  		return NULL;
> @@ -137,7 +138,12 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
>  	if (port < 0 || port >= ds->num_ports)
>  		return NULL;
>  
> -	return ds->ports[port].slave;
> +	slave_port = &ds->ports[port];
> +
> +	if (slave_port->type != DSA_PORT_TYPE_USER)

Can we optimize this with an unlikely()?

> +		return NULL;
> +
> +	return slave_port->slave;
>  }
>  
>  /* port.c */
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH v2 00/21] Allow compile-testing NO_DMA (drivers)
From: Rob Herring @ 2018-04-05  0:32 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Ulf Hansson, Wolfram Sang, linux-iio-u79uwXL29TY76Z2rM5mHXA,
	linux-fpga-u79uwXL29TY76Z2rM5mHXA,
	open list:REMOTE PROCESSOR (REMOTEPROC) SUBSYSTEM, Linux-ALSA,
	Bjorn Andersson, Eric Anholt, netdev,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Linux I2C,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Christoph Hellwig, Stefan Wahren, Boris Brezillon,
	James E . J . Bottomley, Herbert Xu,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Richard Weinberger, Jassi Brar,
	Marek Vasut, "open list:SERIAL DRIVERS" <linux-serial@
In-Reply-To: <1521208314-4783-1-git-send-email-geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org>

On Fri, Mar 16, 2018 at 8:51 AM, Geert Uytterhoeven
<geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org> wrote:
>         Hi all,
>
> If NO_DMA=y, get_dma_ops() returns a reference to the non-existing
> symbol bad_dma_ops, thus causing a link failure if it is ever used.
>
> The intention of this is twofold:
>   1. To catch users of the DMA API on systems that do no support the DMA
>      mapping API,
>   2. To avoid building drivers that cannot work on such systems anyway.
>
> However, the disadvantage is that we have to keep on adding dependencies
> on HAS_DMA all over the place.
>
> Thanks to the COMPILE_TEST symbol, lots of drivers now depend on one or
> more platform dependencies (that imply HAS_DMA) || COMPILE_TEST, thus
> already covering intention #2.  Having to add an explicit dependency on
> HAS_DMA here is cumbersome, and hinders compile-testing.

The same can be said for CONFIG_IOMEM and CONFIG_OF. Any plans to
remove those too? CONFIG_IOMEM is mostly just a !CONFIG_UM option.

Rob

^ permalink raw reply

* [PATCH] brcm80211: brcmsmac: phy_lcn: remove duplicate code
From: Gustavo A. R. Silva @ 2018-04-05  0:09 UTC (permalink / raw)
  To: Arend van Spriel, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, Kalle Valo
  Cc: linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel, Gustavo A. R. Silva

Remove and refactor some code in order to avoid having identical code
for different branches.

Notice that this piece of code hasn't been modified since 2011.

Addresses-Coverity-ID: 1226756 ("Identical code for different branches")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c
index 93d4cde..9d830d2 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c
@@ -3388,13 +3388,8 @@ void wlc_lcnphy_deaf_mode(struct brcms_phy *pi, bool mode)
 	u8 phybw40;
 	phybw40 = CHSPEC_IS40(pi->radio_chanspec);
 
-	if (LCNREV_LT(pi->pubpi.phy_rev, 2)) {
-		mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5);
-		mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9);
-	} else {
-		mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5);
-		mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9);
-	}
+	mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5);
+	mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9);
 
 	if (phybw40 == 0) {
 		mod_phy_reg((pi), 0x410,
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC] net: bump the default number of RSS queues
From: Jakub Kicinski @ 2018-04-05  0:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, brouer, alexander.duyck, oss-drivers
In-Reply-To: <3f6f40f3-cca2-0f43-8940-05003da88524@gmail.com>

On Tue, 3 Apr 2018 17:20:49 -0700, Eric Dumazet wrote:
> On 04/03/2018 05:14 PM, Jakub Kicinski wrote:
> > Some popular NIC vendors are not adhering to
> > netif_get_num_default_rss_queues() which leads to users being
> > surprised and filing bugs :)  Bump the number of default RX
> > queues to something more reasonable for modern machines.
> > 
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > ---
> > I'm mostly wondering what's the policy on this default?  When
> > should it be applied?  Why was 8 chosen as the default?  We
> > can abandon using netif_get_num_default_rss_queues() for the
> > nfp but I wonder what's the correct course of action here...
> > Should new drivers use netif_get_num_default_rss_queues() for
> > example?
> > 
> >  include/linux/netdevice.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 2a2d9cf50aa2..26fe145ada2a 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3260,7 +3260,7 @@ static inline unsigned int get_netdev_rx_queue_index(
> >  }
> >  #endif
> >  
> > -#define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
> > +#define DEFAULT_MAX_NUM_RSS_QUEUES	(64)
> >  int netif_get_num_default_rss_queues(void);  
> 
> There is no evidence having so many queues is beneficial.
> 
> Too many queues -> lots of overhead in many cases.
> 
> So I would rather not touch this, unless you can present good numbers ;)

Thank you for the comment!  I don't have convincing number it was more
of a matter of consistency :)  

Now I think I forgot about aRFS, when aRFS support for the nfp is added
we will probably start ignoring the default as well.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
From: Edward Cree @ 2018-04-04 23:58 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Daniel Borkmann, netdev
In-Reply-To: <20180403233718.rrzh6ds67hraxhax@ast-mbp>

On 04/04/18 00:37, Alexei Starovoitov wrote:
> hmm. that doesn't fail for me and any other bots didn't complain.
> Are you sure you're running the latest kernel and tests?
Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared;
 something must be going wrong with header files.
Never mind.

> hmm. what's wrong with bsearch? It's trivial and fast. 
bsearch is O(log n), and the sort() call on the subprog_starts (which happens
 every time add_subprog() is called) is O(n log n).
Whereas reading aux->subprogno is O(1).
As far as I'm concerned, that's a sign that the latter data structure is the
 appropriate one.

> Even if we don't see the solution today we have to work towards it.
I guess I'm just not confident "towards" is the direction you think it is.

> Compiler designers could have combined multiple of such passes into
> fewer ones, but it's not done, because it increases complexity and
> causes tough bugs where pass is partially complete.
I'm not trying to combine together multiple 'for bb in prog/for insn in bb'-
 type passes.  The combining I was doing was more on 'for all possible
 execution paths'-type passes, because it's those that explode combinatorially.
Happily I think we can go a long way towards getting rid of them; but while I
 think we can get down to only having 1, I don't think we can reach 0.

> The prime example where more than 4k instructions and loops are mandatory
> is user space stack analysis inside the program. Like walking python stack
> requires non-trival pointer chasing. With 'pragma unroll' the stack depth
> limit today is ~18. That's not really usable. Looping through 100 python
> frames would require about 16k bpf assembler instructions.
But this would be solved by having support for bounded loops, and I think I've
 successfully shown that this is not inherently incompatible with a do_check()
 style walk.

> Hence do_check approach must go. The rough idea is to compute per basic
> block a set of INs (registers and stack) that basic block needs
> to see to be safe and corresponding set of OUTs.
> Then propagate this knowledge across cfg edges.
> Once we have such set per bpf function, it will essentially answer the question
> 'what arguments this function needs to see to be safe and what it returns'
> To make bpf libraries scale we'd need to keep such information
> around after the verification, so dynamic linking and indirect calls
> are fast to verify.
> It's very high level obviously. There are many gotchas to resolve.
I agree that if we can do this it'll be ideal.  But that's a big 'if'; my
 example code was intended to demonstrate that the "set of INs bb/func needs to
 see to be safe" can be an arbitrarily complicated disjunction, and that instead
 of a combinatorially exploding number of paths to walk (the do_check() approach)
 you now have combinatorially exploding IN-constraints to propagate backwards.

> Please do, since that's my concern with tsort.
> The verifier is the key piece of bpf infra and to be effective maintainers
> we need to thoroughly understand the verifier code.
> We cannot just take the patch based on the cover letter. The author may
> disappear tomorrow and what we're going to do with the code?
I entirely accept this argument.  Unfortunately, when writing explanations,
 it's difficult to know when one has reached something that will be
 understood, so inevitably there will have to be a few iterations to get
 there ;-)

^ permalink raw reply

* [PATCH net] net: dsa: Discard frames from unused ports
From: Andrew Lunn @ 2018-04-04 23:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Florian Fainelli, Vivien Didelot, Andrew Lunn

The Marvell switches under some conditions will pass a frame to the
host with the port being the CPU port. Such frames are invalid, and
should be dropped. Not dropping them can result in a crash when
incrementing the receive statistics for an invalid port.

Reported-by: Chris Healy <cphealy@gmail.com>
Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 net/dsa/dsa_priv.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 70de7895e5b8..6c1bf3d9f652 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -126,6 +126,7 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
 	struct dsa_port *cpu_dp = dev->dsa_ptr;
 	struct dsa_switch_tree *dst = cpu_dp->dst;
 	struct dsa_switch *ds;
+	struct dsa_port *slave_port;
 
 	if (device < 0 || device >= DSA_MAX_SWITCHES)
 		return NULL;
@@ -137,7 +138,12 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev,
 	if (port < 0 || port >= ds->num_ports)
 		return NULL;
 
-	return ds->ports[port].slave;
+	slave_port = &ds->ports[port];
+
+	if (slave_port->type != DSA_PORT_TYPE_USER)
+		return NULL;
+
+	return slave_port->slave;
 }
 
 /* port.c */
-- 
2.16.3

^ permalink raw reply related

* Re: [PATCH iproute2 rdma: Ignore unknown netlink attributes
From: Stephen Hemminger @ 2018-04-04 23:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, RDMA mailing list, David Ahern,
	Steve Wise
In-Reply-To: <20180403072842.32153-1-leon@kernel.org>

On Tue,  3 Apr 2018 10:28:42 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Leon Romanovsky <leonro@mellanox.com>
> 
> The check if netlink attributes supplied more than maximum supported
> is to strict and may lead to backward compatibility issues with old
> application with a newer kernel that supports new attribute.
> 
> CC: Steve Wise <swise@opengridcomputing.com>
> Fixes: 74bd75c2b68d ("rdma: Add basic infrastructure for RDMA tool")
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Applied

^ permalink raw reply

* Re: [PATCH iproute2] ip/l2tp: remove offset and peer-offset options
From: Stephen Hemminger @ 2018-04-04 23:43 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev, James Chapman
In-Reply-To: <46d360e852235de84e44c796fa934f9cfce988b1.1522769819.git.g.nault@alphalink.fr>

On Tue, 3 Apr 2018 17:39:54 +0200
Guillaume Nault <g.nault@alphalink.fr> wrote:

> Ignore options "peer-offset" and "offset" when creating sessions. Keep
> them when dumping sessions in order to avoid breaking external scripts.
> 
> "peer-offset" has always been a noop in iproute2. "offset" is now
> ignored in Linux 4.16 (and was broken before that).
> 
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

Sure, this makes sense applied.
In theory, you could have just dropped them from the JSON output.

^ permalink raw reply

* Re: [PATCH iproute2-next] tc: Correct json output for actions
From: Stephen Hemminger @ 2018-04-04 23:42 UTC (permalink / raw)
  To: Roman Mashak; +Cc: Yuval Mintz, dsahern, mlxsw, netdev
In-Reply-To: <858ta2brwv.fsf@mojatatu.com>

On Wed, 04 Apr 2018 17:09:04 -0400
Roman Mashak <mrv@mojatatu.com> wrote:

> Yuval Mintz <yuvalm@mellanox.com> writes:
> 
> > Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON
> > support for tc-actions at the expense of breaking other use cases that
> > reach tc_print_action(), as the latter don't expect the 'actions' array
> > to be a new object.
> >
> > Consider the following taken duringrun of tc_chain.sh selftest,
> > and see the latter command output is broken:
> >
> > $ ./tc/tc -j -p actions list action gact | grep -C 3 actions
> > [ {
> >         "total acts": 1
> >     },{
> >         "actions": [ {
> >                 "order": 0,
> >
> > $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions
> >             },
> >             "skip_hw": true,
> >             "not_in_hw": true,{
> >                 "actions": [ {
> >                         "order": 1,
> >                         "kind": "gact",
> >                         "control_action": {
> >
> > Relocate the open/close of the JSON object to declare the object only
> > for the case that needs it.
> >
> > Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>  
> 
> [...]
> 
> 
> Good catch, thanks Yuval.
> 
> Tested-by: Roman Mashak <mrv@mojatatu.com>

Applied

^ permalink raw reply

* Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
From: David Ahern @ 2018-04-04 23:00 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: Ido Schimmel, netdev, davem, jiri, petrm, mlxsw
In-Reply-To: <20180404155910.7baf2ec9@cakuba.netronome.com>

On 4/4/18 4:59 PM, Jakub Kicinski wrote:
> On Wed, 4 Apr 2018 08:25:11 +0200, Jiri Pirko wrote:
>>> Jiri, I am not aware of any other API where a driver registers with it
>>> yet doesn't want the handler to be called so either waits to register  
>>
>> Again, the thing is, this is kind of unusual because of the reload
>> thing. 
> 
> FWIW my knee jerk thought is that it's strange that devlink ops can
> be executed at all while reload is happening (incl. another reload
> request).  I'm probably missing the real issue..
> 

Just responding with the same question ...

Why are you not unregistering resources on a reload?

^ permalink raw reply

* Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
From: Jakub Kicinski @ 2018-04-04 22:59 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David Ahern, Ido Schimmel, netdev, davem, jiri, petrm, mlxsw
In-Reply-To: <20180404062511.GO3313@nanopsycho>

On Wed, 4 Apr 2018 08:25:11 +0200, Jiri Pirko wrote:
> >Jiri, I am not aware of any other API where a driver registers with it
> >yet doesn't want the handler to be called so either waits to register  
> 
> Again, the thing is, this is kind of unusual because of the reload
> thing. 

FWIW my knee jerk thought is that it's strange that devlink ops can
be executed at all while reload is happening (incl. another reload
request).  I'm probably missing the real issue..

^ permalink raw reply

* Re: [PATCH net] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-04 22:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, gregkh, netdev, linux-kernel, avagin, ktkhai, serge
In-Reply-To: <20180404203048.GA21118@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> 
>> enabled sending hotplug events into all network namespaces back in 2010.
>> Over time the set of uevents that get sent into all network namespaces has
>> shrunk. We have now reached the point where hotplug events for all devices
>> that carry a namespace tag are filtered according to that namespace.
>> 
>> Specifically, they are filtered whenever the namespace tag of the kobject
>> does not match the namespace tag of the netlink socket. One example are
>> network devices. Uevents for network devices only show up in the network
>> namespaces these devices are moved to or created in.
>> 
>> However, any uevent for a kobject that does not have a namespace tag
>> associated with it will not be filtered and we will *try* to broadcast it
>> into all network namespaces.
>> 
>> The original patchset was written in 2010 before user namespaces were a
>> thing. With the introduction of user namespaces sending out uevents became
>> partially isolated as they were filtered by user namespaces:
>> 
>> net/netlink/af_netlink.c:do_one_broadcast()
>> 
>> if (!net_eq(sock_net(sk), p->net)) {
>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>                 return;
>> 
>>         if (!peernet_has_id(sock_net(sk), p->net))
>>                 return;
>> 
>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>                              CAP_NET_BROADCAST))
>>         j       return;
>> }
>> 
>> The file_ns_capable() check will check whether the caller had
>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> namespace of interest. This check is fine in general but seems insufficient
>> to me when paired with uevents. The reason is that devices always belong to
>> the initial user namespace so uevents for kobjects that do not carry a
>> namespace tag should never be sent into another user namespace. This has
>> been the intention all along. But there's one case where this breaks,
>> namely if a new user namespace is created by root on the host and an
>> identity mapping is established between root on the host and root in the
>> new user namespace. Here's a reproducer:
>> 
>>  sudo unshare -U --map-root
>>  udevadm monitor -k
>>  # Now change to initial user namespace and e.g. do
>>  modprobe kvm
>>  # or
>>  rmmod kvm
>> 
>> will allow the non-initial user namespace to retrieve all uevents from the
>> host. This seems very anecdotal given that in the general case user
>> namespaces do not see any uevents and also can't really do anything useful
>> with them.
>> 
>> Additionally, it is now possible to send uevents from userspace. As such we
>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> namespace of the network namespace of the netlink socket) userspace process
>> make a decision what uevents should be sent.
>> 
>> This makes me think that we should simply ensure that uevents for kobjects
>> that do not carry a namespace tag are *always* filtered by user namespace
>> in kobj_bcast_filter(). Specifically:
>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>   event will always be filtered.
>> - If the network namespace the uevent socket belongs to was created in the
>>   initial user namespace but was opened from a non-initial user namespace
>>   the event will be filtered as well.
>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> always only sent to the initial user namespace. The regression potential
>> for this is near to non-existent since user namespaces can't really do
>> anything with interesting devices.
>> 
>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>
> That was supposed to be [PATCH net] not [PATCH net-next] which is
> obviously closed. Sorry about that.

This does not appear to be a fix.
This looks like feature work.
The motivation appears to be that looks wrong let's change it.

So let's please leave this for when net-next opens again so we can
have time to fully consider a change in semantics.

Thank you,
Eric

^ permalink raw reply

* RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Jon Rosen (jrosen) @ 2018-04-04 22:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list
In-Reply-To: <CAF=yD-LP7RjenkfbFq+dG9XFZx2D7-JwqOeVD+1tLjgXxh8SpA@mail.gmail.com>

> >> >    One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >> >    is that the documentation of the tp_status field is somewhat
> >> >    inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >> >    meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >> >    meaning the entry is owned by user space.  In other places ownership
> >> >    by user space is defined by the TP_STATUS_USER(1) bit being set.
> >>
> >> But indeed this example in packet_mmap.txt is problematic
> >>
> >>     if (status == TP_STATUS_KERNEL)
> >>         retval = poll(&pfd, 1, timeout);
> >>
> >> It does not really matter whether the docs are possibly inconsistent and
> >> which one is authoritative. Examples like the above make it likely that
> >> some user code expects such code to work.
> >
> > Yes, that's exactly my concern.  Yet another troubling example seems to be
> > lipbcap which also is looking specifically for status to be anything other than
> > TP_STATUS_KERNEL(0) to indicate a frame is available in user space.
> 
> Good catch. If pcap-linux.c relies on this then the status field
> cannot be changed. Other fields can be modified freely while tp_status
> remains 0, perhaps that's an option.

Possibly. Someone else suggested something similar but in at least the
one example we thought through it still seemed like it didn't address the problem.

For example, let's say we used tp_len == -1 to indicate to other kernel threads
that the entry was already in progress.  This would require that user space never
set tp_len = -1 before returning the entry back to the kernel.  If it did then no
kernel thread would ever claim ownership and the ring would hang.

Now, it seems pretty unlikely that user space would do such a thing so maybe we
could look past that, but then we run into the issue that there is still a window
of opportunity for other kernel threads to come in and wrap the ring.

The reason is we can't set tp_len to the correct length after setting tp_status because
user space could grab the entry and see tp_len == -1 so we have to set tp_len
before we set tp_status. This means that there is still a window where other
kernel threads could come in and see tp_len as something other than -1 and
a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry.
This puts us back to where we are today (arguably with a smaller window,
but a window none the less).

Alternatively we could reacquire the spin_lock to then set tp_len followed by
tp_status.  This would give the necessary indivisibility in the kernel while 
preserving proper order as made visible to user space, but it comes at the cost
of another spin_lock.

Thanks for the suggestion.  If you can think of a way around this I'm all ears.
I'll think on this some more but so far I'm stuck on how to get past having to
broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort
of atomic construct along with a parallel shadow ring structure (still thinking
through that one as well).

^ permalink raw reply

* Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Willem de Bruijn @ 2018-04-04 21:44 UTC (permalink / raw)
  To: Jon Rosen (jrosen)
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list
In-Reply-To: <057239cdf1c34bc69636009c9e099214@XCH-RTP-016.cisco.com>

>> >    One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>> >    is that the documentation of the tp_status field is somewhat
>> >    inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>> >    meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>> >    meaning the entry is owned by user space.  In other places ownership
>> >    by user space is defined by the TP_STATUS_USER(1) bit being set.
>>
>> But indeed this example in packet_mmap.txt is problematic
>>
>>     if (status == TP_STATUS_KERNEL)
>>         retval = poll(&pfd, 1, timeout);
>>
>> It does not really matter whether the docs are possibly inconsistent and
>> which one is authoritative. Examples like the above make it likely that
>> some user code expects such code to work.
>
> Yes, that's exactly my concern.  Yet another troubling example seems to be
> lipbcap which also is looking specifically for status to be anything other than
> TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Good catch. If pcap-linux.c relies on this then the status field
cannot be changed. Other fields can be modified freely while tp_status
remains 0, perhaps that's an option.

^ permalink raw reply

* Re: [PATCH iproute2-next] tc: Correct json output for actions
From: Roman Mashak @ 2018-04-04 21:09 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: dsahern, mlxsw, netdev
In-Reply-To: <1522844653-37136-1-git-send-email-yuvalm@mellanox.com>

Yuval Mintz <yuvalm@mellanox.com> writes:

> Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON
> support for tc-actions at the expense of breaking other use cases that
> reach tc_print_action(), as the latter don't expect the 'actions' array
> to be a new object.
>
> Consider the following taken duringrun of tc_chain.sh selftest,
> and see the latter command output is broken:
>
> $ ./tc/tc -j -p actions list action gact | grep -C 3 actions
> [ {
>         "total acts": 1
>     },{
>         "actions": [ {
>                 "order": 0,
>
> $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions
>             },
>             "skip_hw": true,
>             "not_in_hw": true,{
>                 "actions": [ {
>                         "order": 1,
>                         "kind": "gact",
>                         "control_action": {
>
> Relocate the open/close of the JSON object to declare the object only
> for the case that needs it.
>
> Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>

[...]


Good catch, thanks Yuval.

Tested-by: Roman Mashak <mrv@mojatatu.com>

^ permalink raw reply

* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: Md. Islam @ 2018-04-04 21:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David Miller, David Ahern, Stephen Hemminger,
	Anton Gary Ceph, Pavel Emelyanov, Eric Dumazet,
	alexei.starovoitov
In-Reply-To: <20180404081604.422e8a97@redhat.com>

On Wed, Apr 4, 2018 at 2:16 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Sun, 1 Apr 2018 20:47:28 -0400 Md. Islam" <mislam4@kent.edu> wrote:
>
>> [...] More specifically, header parsing and fib
>> lookup only takes around 82 ns. This shows that this could be used to
>> implement linerate packet forwarding in kernel.
>
> I cannot resist correcting you...
>
> You didn't specify the link speed, but assuming 10Gbit/s, then the
> linerate is 14.88Mpps, which is 67.2 ns between arriving packets. Thus,
> if the lookup cost is 82 ns, thus you cannot claim linerate performance
> with these numbers.
>
>
> Details:
>
> This is calculated based on the the minimum Ethernet frame size
> 84-bytes, see https://en.wikipedia.org/wiki/Ethernet_frame for why this
> is the minimum size.
>
> 10*10^9/(84*8) = 14,880,952 pps
> 1/last*10^9    = 67.2 ns
>

Yes, it's not actually line-rate forwarding, but it shows the intent
towards that. Currently we are doing many things in fib_table_lookup()
that can be simplified for a router. fib_get_table() and FIB_RES_DEV()
would be simplified if we disable IP_ROUTE_MULTIPATH and
IP_MULTIPLE_TABLES. We can increase throughput by doing less :-)
Moreover if a network mostly carries larger packets (for instance, a
network exclusively used for video streaming), then a 40Gb NIC
produces packets in every 300ns.

40*10^9/(1500*8) = 3.4mpps
1/last*10^9    = 300 ns

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-04 20:30 UTC (permalink / raw)
  To: ebiederm, davem, gregkh, netdev, linux-kernel; +Cc: avagin, ktkhai, serge
In-Reply-To: <20180404194857.29375-1-christian.brauner@ubuntu.com>

On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> 
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk. We have now reached the point where hotplug events for all devices
> that carry a namespace tag are filtered according to that namespace.
> 
> Specifically, they are filtered whenever the namespace tag of the kobject
> does not match the namespace tag of the netlink socket. One example are
> network devices. Uevents for network devices only show up in the network
> namespaces these devices are moved to or created in.
> 
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will *try* to broadcast it
> into all network namespaces.
> 
> The original patchset was written in 2010 before user namespaces were a
> thing. With the introduction of user namespaces sending out uevents became
> partially isolated as they were filtered by user namespaces:
> 
> net/netlink/af_netlink.c:do_one_broadcast()
> 
> if (!net_eq(sock_net(sk), p->net)) {
>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>                 return;
> 
>         if (!peernet_has_id(sock_net(sk), p->net))
>                 return;
> 
>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>                              CAP_NET_BROADCAST))
>         j       return;
> }
> 
> The file_ns_capable() check will check whether the caller had
> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> namespace of interest. This check is fine in general but seems insufficient
> to me when paired with uevents. The reason is that devices always belong to
> the initial user namespace so uevents for kobjects that do not carry a
> namespace tag should never be sent into another user namespace. This has
> been the intention all along. But there's one case where this breaks,
> namely if a new user namespace is created by root on the host and an
> identity mapping is established between root on the host and root in the
> new user namespace. Here's a reproducer:
> 
>  sudo unshare -U --map-root
>  udevadm monitor -k
>  # Now change to initial user namespace and e.g. do
>  modprobe kvm
>  # or
>  rmmod kvm
> 
> will allow the non-initial user namespace to retrieve all uevents from the
> host. This seems very anecdotal given that in the general case user
> namespaces do not see any uevents and also can't really do anything useful
> with them.
> 
> Additionally, it is now possible to send uevents from userspace. As such we
> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> namespace of the network namespace of the netlink socket) userspace process
> make a decision what uevents should be sent.
> 
> This makes me think that we should simply ensure that uevents for kobjects
> that do not carry a namespace tag are *always* filtered by user namespace
> in kobj_bcast_filter(). Specifically:
> - If the owning user namespace of the uevent socket is not init_user_ns the
>   event will always be filtered.
> - If the network namespace the uevent socket belongs to was created in the
>   initial user namespace but was opened from a non-initial user namespace
>   the event will be filtered as well.
> Put another way, uevents for kobjects not carrying a namespace tag are now
> always only sent to the initial user namespace. The regression potential
> for this is near to non-existent since user namespaces can't really do
> anything with interesting devices.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

That was supposed to be [PATCH net] not [PATCH net-next] which is
obviously closed. Sorry about that.

Christian

> ---
>  lib/kobject_uevent.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..cb98cddb6e3b 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  		return sock_ns != ns;
>  	}
>  
> -	return 0;
> +	/*
> +	 * The kobject does not carry a namespace tag so filter by user
> +	 * namespace below.
> +	 */
> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> +		return 1;
> +
> +	/* Check if socket was opened from non-initial user namespace. */
> +	return sk_user_ns(dsk) != &init_user_ns;
>  }
>  #endif
>  
> -- 
> 2.15.1
> 

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-04 20:08 UTC (permalink / raw)
  To: David Ahern
  Cc: Siwei Liu, Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, David Miller,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar,
	Netdev, virtualization
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.

Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)

> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.

I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.

    Andrew

^ permalink raw reply

* [jkirsher/next-queue, RFC PATCH 3/3] net-sysfs: Add interface for Rx queue map per Tx queue
From: Amritha Nambiar @ 2018-04-04 20:00 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher
  Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
	sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>

Extend transmit queue sysfs attribute to configure Rx queue map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/net-sysfs.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d7abd33..0654243 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_dev_maps *dev_maps;
+	unsigned long *mask, index;
+	int j, len, num_tc = 1, tc = 0;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	if (dev->num_tc) {
+		num_tc = dev->num_tc;
+		tc = netdev_txq_to_tc(dev, index);
+		if (tc < 0)
+			return -EINVAL;
+	}
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+	if (dev_maps) {
+		for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+		     j < dev->num_rx_queues;) {
+			int i, tci = j * num_tc + tc;
+			struct xps_map *map;
+
+			map = rcu_dereference(dev_maps->attr_map[tci]);
+			if (!map)
+				continue;
+
+			for (i = map->len; i--;) {
+				if (map->queues[i] == index) {
+					set_bit(j, mask);
+					break;
+				}
+			}
+		}
+	}
+
+	len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+	rcu_read_unlock();
+	kfree(mask);
+
+	return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+			      size_t len)
+{
+	struct net_device *dev = queue->dev;
+	unsigned long *mask, index;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+	if (err) {
+		kfree(mask);
+		return err;
+	}
+
+	err = __netif_set_xps_queue(dev, mask, index, XPS_MAP_RXQS);
+	kfree(mask);
+	return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+	= __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
+	&xps_rxqs_attribute.attr,
 	&queue_tx_maxrate.attr,
 #endif
 	NULL

^ permalink raw reply related

* [jkirsher/next-queue, RFC PATCH 2/3] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-04-04 20:00 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher
  Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
	sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>

This patch adds support to pick Tx queue based on the Rx queue map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive
queue map does not apply, then the Tx queue selection falls back
to CPU map based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/sock.h       |   18 ++++++++++++++++++
 net/core/dev.c           |   36 ++++++++++++++++++++++++++++++------
 net/core/sock.c          |    5 +++++
 net/ipv4/tcp_input.c     |    7 +++++++
 net/ipv4/tcp_ipv4.c      |    1 +
 net/ipv4/tcp_minisocks.c |    1 +
 6 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 49bd2c1..53d58bc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
+ *	@skc_rx_ifindex: rx ifindex for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +217,10 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	int			skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	int			skc_rx_queue_mapping;
+	int			skc_rx_ifindex;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +332,10 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex		__sk_common.skc_rx_ifindex
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1691,6 +1701,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return sk ? sk->sk_tx_queue_mapping : -1;
 }
 
+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_ifindex = skb->skb_iif;
+	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4cfc179..d43f1c2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3457,18 +3457,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+			       struct xps_dev_maps *dev_maps, unsigned int tci)
 {
 #ifdef CONFIG_XPS
-	struct xps_dev_maps *dev_maps;
 	struct xps_map *map;
 	int queue_index = -1;
 
-	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
-		unsigned int tci = skb->sender_cpu - 1;
-
 		if (dev->num_tc) {
 			tci *= dev->num_tc;
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
@@ -3485,6 +3481,34 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 				queue_index = -1;
 		}
 	}
+	return queue_index;
+#else
+	return -1;
+#endif
+}
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	enum xps_map_type i = XPS_MAP_RXQS;
+	struct xps_dev_maps *dev_maps;
+	struct sock *sk = skb->sk;
+	int queue_index = -1;
+	unsigned int tci = 0;
+
+	if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
+	    dev->ifindex == sk->sk_rx_ifindex)
+		tci = sk->sk_rx_queue_mapping;
+
+	rcu_read_lock();
+	while (queue_index < 0 && i < __XPS_MAP_MAX) {
+		if (i == XPS_MAP_CPUS)
+			tci = skb->sender_cpu - 1;
+		dev_maps = rcu_dereference(dev->xps_maps[i]);
+		queue_index = __get_xps_queue_idx(dev, skb, dev_maps, tci);
+		i++;
+	}
+
 	rcu_read_unlock();
 
 	return queue_index;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6444525..bd053db 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2817,6 +2817,11 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+	sk->sk_rx_ifindex = -1;
+	sk->sk_rx_queue_mapping = -1;
+#endif
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6..521b85c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5502,6 +5503,11 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		__tcp_fast_path_on(tp, tp->snd_wnd);
 	else
 		tp->pred_flags = 0;
+
+	if (skb) {
+		sk_mark_napi_id(sk, skb);
+		sk_mark_rx_queue(sk, skb);
+	}
 }
 
 static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
@@ -6310,6 +6316,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_mark_rx_queue(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..132d9af 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1467,6 +1467,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 		sock_rps_save_rxhash(sk, skb);
 		sk_mark_napi_id(sk, skb);
+		sk_mark_rx_queue(sk, skb);
 		if (dst) {
 			if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
 			    !dst->ops->check(dst, 0)) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468..c18d6f2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -835,6 +835,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
 
 	/* record NAPI ID of child */
 	sk_mark_napi_id(child, skb);
+	sk_mark_rx_queue(child, skb);
 
 	tcp_segs_in(tcp_sk(child), skb);
 	if (!sock_owned_by_user(child)) {

^ permalink raw reply related

* [jkirsher/next-queue, RFC PATCH 1/3] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-04-04 19:59 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher
  Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
	sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>

Refactor XPS code to support Tx queue selection based on
CPU map or Rx queue map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/netdevice.h |   82 +++++++++++++++++-
 net/core/dev.c            |  208 ++++++++++++++++++++++++++++++---------------
 net/core/net-sysfs.c      |    4 -
 3 files changed, 218 insertions(+), 76 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503..37dbffe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,21 @@ struct xps_map {
  */
 struct xps_dev_maps {
 	struct rcu_head rcu;
-	struct xps_map __rcu *cpu_map[0];
+	struct xps_map __rcu *attr_map[0];
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +		\
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +	\
 	(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+	(_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
+enum xps_map_type {
+	XPS_MAP_RXQS,
+	XPS_MAP_CPUS,
+	__XPS_MAP_MAX
+};
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE	16
@@ -1867,7 +1878,7 @@ struct net_device {
 	int			watchdog_timeo;
 
 #ifdef CONFIG_XPS
-	struct xps_dev_maps __rcu *xps_maps;
+	struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
@@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, enum xps_map_type type);
+
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+				  unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+	WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+	return test_bit(j, mask);
+}
+
+static inline bool attr_test_online(unsigned long j,
+				    const unsigned long *online_mask,
+				    unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+	WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+
+	if (online_mask)
+		return test_bit(j, online_mask);
+
+	if (j >= 0 && j < nr_bits)
+		return true;
+
+	return false;
+}
+
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+					 unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+		WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+	}
+
+	if (srcp)
+		return find_next_bit(srcp, nr_bits, n + 1);
+
+	return n + 1;
+}
+
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+				    const unsigned long *src2p,
+				    unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+		WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+	}
+
+	if (src1p && src2p)
+		return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+	else if (src1p)
+		return find_next_bit(src1p, nr_bits, n + 1);
+	else if (src2p)
+		return find_next_bit(src2p, nr_bits, n + 1);
+
+	return n + 1;
+}
 #else
 static inline int netif_set_xps_queue(struct net_device *dev,
 				      const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index 9b04a9f..4cfc179 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2091,7 +2091,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 	int pos;
 
 	if (dev_maps)
-		map = xmap_dereference(dev_maps->cpu_map[tci]);
+		map = xmap_dereference(dev_maps->attr_map[tci]);
 	if (!map)
 		return false;
 
@@ -2104,7 +2104,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 			break;
 		}
 
-		RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+		RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
 		kfree_rcu(map, rcu);
 		return false;
 	}
@@ -2137,30 +2137,49 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 				   u16 count)
 {
+	const unsigned long *possible_mask = NULL;
+	enum xps_map_type type = XPS_MAP_RXQS;
 	struct xps_dev_maps *dev_maps;
-	int cpu, i;
 	bool active = false;
+	unsigned int nr_ids;
+	int i, j;
 
 	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
 
-	if (!dev_maps)
-		goto out_no_maps;
-
-	for_each_possible_cpu(cpu)
-		active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
-					       offset, count);
+	while (type < __XPS_MAP_MAX) {
+		dev_maps = xmap_dereference(dev->xps_maps[type]);
+		if (!dev_maps)
+			goto out_no_maps;
+
+		if (type == XPS_MAP_CPUS) {
+			if (num_possible_cpus() > 1)
+				possible_mask = cpumask_bits(cpu_possible_mask);
+			nr_ids = nr_cpu_ids;
+		} else if (type == XPS_MAP_RXQS) {
+			nr_ids = dev->num_rx_queues;
+		}
 
-	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-		kfree_rcu(dev_maps, rcu);
-	}
+		for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+		     j < nr_ids;){
+			active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+						       count);
+		}
 
-	for (i = offset + (count - 1); count--; i--)
-		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
-					     NUMA_NO_NODE);
+		if (!active) {
+			RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+			kfree_rcu(dev_maps, rcu);
+		}
 
+		if (type == XPS_MAP_CPUS) {
+			for (i = offset + (count - 1); count--; i--)
+				netdev_queue_numa_node_write(
+					netdev_get_tx_queue(dev, i),
+							    NUMA_NO_NODE);
+		}
 out_no_maps:
+		type++;
+	}
+
 	mutex_unlock(&xps_map_mutex);
 }
 
@@ -2169,11 +2188,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 	netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
 }
 
-static struct xps_map *expand_xps_map(struct xps_map *map,
-				      int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+				      u16 index, enum xps_map_type type)
 {
-	struct xps_map *new_map;
 	int alloc_len = XPS_MIN_MAP_ALLOC;
+	struct xps_map *new_map = NULL;
 	int i, pos;
 
 	for (pos = 0; map && pos < map->len; pos++) {
@@ -2182,7 +2201,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		return map;
 	}
 
-	/* Need to add queue to this CPU's existing map */
+	/* Need to add tx-queue to this CPU's/rx-queue's existing map */
 	if (map) {
 		if (pos < map->alloc_len)
 			return map;
@@ -2190,9 +2209,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		alloc_len = map->alloc_len * 2;
 	}
 
-	/* Need to allocate new map to store queue on this CPU's map */
-	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
-			       cpu_to_node(cpu));
+	/* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+	 *  map
+	 */
+	if (type == XPS_MAP_RXQS)
+		new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+	else if (type == XPS_MAP_CPUS)
+		new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+				       cpu_to_node(attr_index));
 	if (!new_map)
 		return NULL;
 
@@ -2204,14 +2228,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 	return new_map;
 }
 
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
-			u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, enum xps_map_type type)
 {
+	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-	int i, cpu, tci, numa_node_id = -2;
+	int i, j, tci, numa_node_id = -2;
 	int maps_sz, num_tc = 1, tc = 0;
 	struct xps_map *map, *new_map;
 	bool active = false;
+	unsigned int nr_ids;
 
 	if (dev->num_tc) {
 		num_tc = dev->num_tc;
@@ -2220,16 +2246,33 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -EINVAL;
 	}
 
-	maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
+	switch (type) {
+	case XPS_MAP_RXQS:
+		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+		dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+		nr_ids = dev->num_rx_queues;
+		break;
+	case XPS_MAP_CPUS:
+		maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+		if (num_possible_cpus() > 1) {
+			online_mask = cpumask_bits(cpu_online_mask);
+			possible_mask = cpumask_bits(cpu_possible_mask);
+		}
+		dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_CPUS]);
+		nr_ids = nr_cpu_ids;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	if (maps_sz < L1_CACHE_BYTES)
 		maps_sz = L1_CACHE_BYTES;
 
 	mutex_lock(&xps_map_mutex);
 
-	dev_maps = xmap_dereference(dev->xps_maps);
-
 	/* allocate memory for queue storage */
-	for_each_cpu_and(cpu, cpu_online_mask, mask) {
+	for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+	     j < nr_ids;) {
 		if (!new_dev_maps)
 			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
 		if (!new_dev_maps) {
@@ -2237,73 +2280,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -ENOMEM;
 		}
 
-		tci = cpu * num_tc + tc;
-		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+		tci = j * num_tc + tc;
+		map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
 				 NULL;
 
-		map = expand_xps_map(map, cpu, index);
+		map = expand_xps_map(map, j, index, type);
 		if (!map)
 			goto error;
 
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+		RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 	}
 
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	for_each_possible_cpu(cpu) {
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
-		for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+		for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* We need to explicitly update tci as prevous loop
 		 * could break out early if dev_maps is NULL.
 		 */
-		tci = cpu * num_tc + tc;
+		tci = j * num_tc + tc;
 
-		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
-			/* add queue to CPU maps */
+		if (attr_test_mask(j, mask, nr_ids) &&
+		    attr_test_online(j, online_mask, nr_ids)) {
+			/* add tx-queue to CPU/rx-queue maps */
 			int pos = 0;
 
-			map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+			map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			while ((pos < map->len) && (map->queues[pos] != index))
 				pos++;
 
 			if (pos == map->len)
 				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
+			if (type == XPS_MAP_CPUS) {
+				if (numa_node_id == -2)
+					numa_node_id = cpu_to_node(j);
+				else if (numa_node_id != cpu_to_node(j))
+					numa_node_id = -1;
+			}
 #endif
 		} else if (dev_maps) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* copy maps belonging to foreign traffic classes */
 		for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 	}
 
-	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	if (type == XPS_MAP_RXQS)
+		rcu_assign_pointer(dev->xps_maps[XPS_MAP_RXQS], new_dev_maps);
+	else if (type == XPS_MAP_CPUS)
+		rcu_assign_pointer(dev->xps_maps[XPS_MAP_CPUS], new_dev_maps);
 
 	/* Cleanup old maps */
 	if (!dev_maps)
 		goto out_no_old_maps;
 
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
 			if (map && map != new_map)
 				kfree_rcu(map, rcu);
 		}
@@ -2316,19 +2367,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	active = true;
 
 out_no_new_maps:
-	/* update Tx queue numa node */
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     (numa_node_id >= 0) ? numa_node_id :
-				     NUMA_NO_NODE);
+	if (type == XPS_MAP_CPUS) {
+		/* update Tx queue numa node */
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+					     (numa_node_id >= 0) ?
+					     numa_node_id : NUMA_NO_NODE);
+	}
 
 	if (!dev_maps)
 		goto out_no_maps;
 
-	/* removes queue from unused CPUs */
-	for_each_possible_cpu(cpu) {
-		for (i = tc, tci = cpu * num_tc; i--; tci++)
+	/* removes tx-queue from unused CPUs/rx-queues */
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = tc, tci = j * num_tc; i--; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
-		if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+		if (!attr_test_mask(j, mask, nr_ids) ||
+		    !attr_test_online(j, online_mask, nr_ids))
 			active |= remove_xps_queue(dev_maps, tci, index);
 		for (i = num_tc - tc, tci++; --i; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
@@ -2336,7 +2391,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 
 	/* free map if not active */
 	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		if (type == XPS_MAP_RXQS)
+			RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_RXQS], NULL);
+		else if (type == XPS_MAP_CPUS)
+			RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_CPUS], NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
@@ -2346,11 +2404,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	return 0;
 error:
 	/* remove any maps that we added */
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			map = dev_maps ?
-			      xmap_dereference(dev_maps->cpu_map[tci]) :
+			      xmap_dereference(dev_maps->attr_map[tci]) :
 			      NULL;
 			if (new_map && new_map != map)
 				kfree(new_map);
@@ -2362,6 +2421,13 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+			u16 index)
+{
+	return __netif_set_xps_queue(dev, cpumask_bits(mask), index,
+				     XPS_MAP_CPUS);
+}
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
@@ -3399,7 +3465,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3408,7 +3474,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
 		}
 
-		map = rcu_dereference(dev_maps->cpu_map[tci]);
+		map = rcu_dereference(dev_maps->attr_map[tci]);
 		if (map) {
 			if (map->len == 1)
 				queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f07..d7abd33 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	}
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
 		for_each_possible_cpu(cpu) {
 			int i, tci = cpu * num_tc + tc;
 			struct xps_map *map;
 
-			map = rcu_dereference(dev_maps->cpu_map[tci]);
+			map = rcu_dereference(dev_maps->attr_map[tci]);
 			if (!map)
 				continue;
 

^ permalink raw reply related

* [jkirsher/next-queue, RFC PATCH 0/3] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-04-04 19:59 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher
  Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
	sridhar.samudrala, hannes, tom

This patch series implements support for Tx queue selection based on
Rx queue map. This is done by configuring Rx queue map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

This is to enable sending packets on the same Tx-Rx queue pair as this
is useful for busy polling multi-threaded workloads where it is not
possible to pin the threads to a CPU. This is a rework of Sridhar's
patch for symmetric queueing via socket option:
https://www.spinics.net/lists/netdev/msg453106.html

---

Amritha Nambiar (3):
      net: Refactor XPS for CPUs and Rx queues
      net: Enable Tx queue selection based on Rx queues
      net-sysfs: Add interface for Rx queue map per Tx queue

 include/linux/netdevice.h |   82 +++++++++++++++
 include/net/sock.h        |   18 +++
 net/core/dev.c            |  242 +++++++++++++++++++++++++++++++--------------
 net/core/net-sysfs.c      |   85 +++++++++++++++-
 net/core/sock.c           |    5 +
 net/ipv4/tcp_input.c      |    7 +
 net/ipv4/tcp_ipv4.c       |    1 
 net/ipv4/tcp_minisocks.c  |    1 
 8 files changed, 360 insertions(+), 81 deletions(-)

^ permalink raw reply

* Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
From: Boris Brezillon @ 2018-04-04 19:56 UTC (permalink / raw)
  To: Robert Jarzmik
  Cc: Ulf Hansson, alsa-devel, Jaroslav Kysela, linux-ide, netdev,
	linux-mtd, driverdevel, Boris Brezillon, Vinod Koul,
	Richard Weinberger, Takashi Iwai, Marek Vasut, Ezequiel Garcia,
	linux-media, Samuel Ortiz, Arnd Bergmann,
	Bartlomiej Zolnierkiewicz, Haojian Zhuang, dmaengine, Mark Brown,
	Mauro Carvalho Chehab, Linux ARM, Nicolas Pitre,
	Greg Kroah-Hartman
In-Reply-To: <874lkq4urd.fsf@belgarion.home>

On Wed, 04 Apr 2018 21:49:26 +0200
Robert Jarzmik <robert.jarzmik@free.fr> wrote:

> Ulf Hansson <ulf.hansson@linaro.org> writes:
> 
> > On 2 April 2018 at 16:26, Robert Jarzmik <robert.jarzmik@free.fr> wrote:  
> >> Hi,
> >>
> >> This serie is aimed at removing the dmaengine slave compat use, and transfer
> >> knowledge of the DMA requestors into architecture code.
> >> As this looks like a patch bomb, each maintainer expressing for his tree either
> >> an Ack or "I want to take through my tree" will be spared in the next iterations
> >> of this serie.  
> >
> > Perhaps an option is to send this hole series as PR for 3.17 rc1, that
> > would removed some churns and make this faster/easier? Well, if you
> > receive the needed acks of course.  
> For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all

Especially since 3.17-rc1 has been released more than 3 years ago :-),
but I guess you meant 4.17-rc1.

> acks, I'll queue it into my pxa tree. If at least one maintainer withholds his
> ack, the end of the serie (phase 3) won't be applied until it is sorted out.
> 
> Cheers.
> 
> --
> Robert

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox