* Re: [PATCH iproute2-next] tc: Correct json output for actions
From: Roman Mashak @ 2018-04-04 21:09 UTC (permalink / raw)
To: Yuval Mintz; +Cc: dsahern, mlxsw, netdev
In-Reply-To: <1522844653-37136-1-git-send-email-yuvalm@mellanox.com>
Yuval Mintz <yuvalm@mellanox.com> writes:
> Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON
> support for tc-actions at the expense of breaking other use cases that
> reach tc_print_action(), as the latter don't expect the 'actions' array
> to be a new object.
>
> Consider the following taken duringrun of tc_chain.sh selftest,
> and see the latter command output is broken:
>
> $ ./tc/tc -j -p actions list action gact | grep -C 3 actions
> [ {
> "total acts": 1
> },{
> "actions": [ {
> "order": 0,
>
> $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions
> },
> "skip_hw": true,
> "not_in_hw": true,{
> "actions": [ {
> "order": 1,
> "kind": "gact",
> "control_action": {
>
> Relocate the open/close of the JSON object to declare the object only
> for the case that needs it.
>
> Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
[...]
Good catch, thanks Yuval.
Tested-by: Roman Mashak <mrv@mojatatu.com>
^ permalink raw reply
* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: Md. Islam @ 2018-04-04 21:09 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, David Miller, David Ahern, Stephen Hemminger,
Anton Gary Ceph, Pavel Emelyanov, Eric Dumazet,
alexei.starovoitov
In-Reply-To: <20180404081604.422e8a97@redhat.com>
On Wed, Apr 4, 2018 at 2:16 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Sun, 1 Apr 2018 20:47:28 -0400 Md. Islam" <mislam4@kent.edu> wrote:
>
>> [...] More specifically, header parsing and fib
>> lookup only takes around 82 ns. This shows that this could be used to
>> implement linerate packet forwarding in kernel.
>
> I cannot resist correcting you...
>
> You didn't specify the link speed, but assuming 10Gbit/s, then the
> linerate is 14.88Mpps, which is 67.2 ns between arriving packets. Thus,
> if the lookup cost is 82 ns, thus you cannot claim linerate performance
> with these numbers.
>
>
> Details:
>
> This is calculated based on the the minimum Ethernet frame size
> 84-bytes, see https://en.wikipedia.org/wiki/Ethernet_frame for why this
> is the minimum size.
>
> 10*10^9/(84*8) = 14,880,952 pps
> 1/last*10^9 = 67.2 ns
>
Yes, it's not actually line-rate forwarding, but it shows the intent
towards that. Currently we are doing many things in fib_table_lookup()
that can be simplified for a router. fib_get_table() and FIB_RES_DEV()
would be simplified if we disable IP_ROUTE_MULTIPATH and
IP_MULTIPLE_TABLES. We can increase throughput by doing less :-)
Moreover if a network mostly carries larger packets (for instance, a
network exclusively used for video streaming), then a 40Gb NIC
produces packets in every 300ns.
40*10^9/(1500*8) = 3.4mpps
1/last*10^9 = 300 ns
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* Re: [PATCH net] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-04 20:30 UTC (permalink / raw)
To: ebiederm, davem, gregkh, netdev, linux-kernel; +Cc: avagin, ktkhai, serge
In-Reply-To: <20180404194857.29375-1-christian.brauner@ubuntu.com>
On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk. We have now reached the point where hotplug events for all devices
> that carry a namespace tag are filtered according to that namespace.
>
> Specifically, they are filtered whenever the namespace tag of the kobject
> does not match the namespace tag of the netlink socket. One example are
> network devices. Uevents for network devices only show up in the network
> namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will *try* to broadcast it
> into all network namespaces.
>
> The original patchset was written in 2010 before user namespaces were a
> thing. With the introduction of user namespaces sending out uevents became
> partially isolated as they were filtered by user namespaces:
>
> net/netlink/af_netlink.c:do_one_broadcast()
>
> if (!net_eq(sock_net(sk), p->net)) {
> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> return;
>
> if (!peernet_has_id(sock_net(sk), p->net))
> return;
>
> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> CAP_NET_BROADCAST))
> j return;
> }
>
> The file_ns_capable() check will check whether the caller had
> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> namespace of interest. This check is fine in general but seems insufficient
> to me when paired with uevents. The reason is that devices always belong to
> the initial user namespace so uevents for kobjects that do not carry a
> namespace tag should never be sent into another user namespace. This has
> been the intention all along. But there's one case where this breaks,
> namely if a new user namespace is created by root on the host and an
> identity mapping is established between root on the host and root in the
> new user namespace. Here's a reproducer:
>
> sudo unshare -U --map-root
> udevadm monitor -k
> # Now change to initial user namespace and e.g. do
> modprobe kvm
> # or
> rmmod kvm
>
> will allow the non-initial user namespace to retrieve all uevents from the
> host. This seems very anecdotal given that in the general case user
> namespaces do not see any uevents and also can't really do anything useful
> with them.
>
> Additionally, it is now possible to send uevents from userspace. As such we
> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> namespace of the network namespace of the netlink socket) userspace process
> make a decision what uevents should be sent.
>
> This makes me think that we should simply ensure that uevents for kobjects
> that do not carry a namespace tag are *always* filtered by user namespace
> in kobj_bcast_filter(). Specifically:
> - If the owning user namespace of the uevent socket is not init_user_ns the
> event will always be filtered.
> - If the network namespace the uevent socket belongs to was created in the
> initial user namespace but was opened from a non-initial user namespace
> the event will be filtered as well.
> Put another way, uevents for kobjects not carrying a namespace tag are now
> always only sent to the initial user namespace. The regression potential
> for this is near to non-existent since user namespaces can't really do
> anything with interesting devices.
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
That was supposed to be [PATCH net] not [PATCH net-next] which is
obviously closed. Sorry about that.
Christian
> ---
> lib/kobject_uevent.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..cb98cddb6e3b 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> return sock_ns != ns;
> }
>
> - return 0;
> + /*
> + * The kobject does not carry a namespace tag so filter by user
> + * namespace below.
> + */
> + if (sock_net(dsk)->user_ns != &init_user_ns)
> + return 1;
> +
> + /* Check if socket was opened from non-initial user namespace. */
> + return sk_user_ns(dsk) != &init_user_ns;
> }
> #endif
>
> --
> 2.15.1
>
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-04 20:08 UTC (permalink / raw)
To: David Ahern
Cc: Siwei Liu, Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin,
Stephen Hemminger, Alexander Duyck, David Miller,
Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar,
Netdev, virtualization
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>
> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.
Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)
> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.
I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.
Andrew
^ permalink raw reply
* [jkirsher/next-queue, RFC PATCH 3/3] net-sysfs: Add interface for Rx queue map per Tx queue
From: Amritha Nambiar @ 2018-04-04 20:00 UTC (permalink / raw)
To: intel-wired-lan, jeffrey.t.kirsher
Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>
Extend transmit queue sysfs attribute to configure Rx queue map
per Tx queue. By default no receive queues are configured for the
Tx queue.
- /sys/class/net/eth0/queues/tx-*/xps_rxqs
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
net/core/net-sysfs.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d7abd33..0654243 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+ struct net_device *dev = queue->dev;
+ struct xps_dev_maps *dev_maps;
+ unsigned long *mask, index;
+ int j, len, num_tc = 1, tc = 0;
+
+ mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+ GFP_KERNEL);
+ if (!mask)
+ return -ENOMEM;
+
+ index = get_netdev_queue_index(queue);
+
+ if (dev->num_tc) {
+ num_tc = dev->num_tc;
+ tc = netdev_txq_to_tc(dev, index);
+ if (tc < 0)
+ return -EINVAL;
+ }
+
+ rcu_read_lock();
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+ if (dev_maps) {
+ for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+ j < dev->num_rx_queues;) {
+ int i, tci = j * num_tc + tc;
+ struct xps_map *map;
+
+ map = rcu_dereference(dev_maps->attr_map[tci]);
+ if (!map)
+ continue;
+
+ for (i = map->len; i--;) {
+ if (map->queues[i] == index) {
+ set_bit(j, mask);
+ break;
+ }
+ }
+ }
+ }
+
+ len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+ rcu_read_unlock();
+ kfree(mask);
+
+ return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+ size_t len)
+{
+ struct net_device *dev = queue->dev;
+ unsigned long *mask, index;
+ int err;
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+ GFP_KERNEL);
+ if (!mask)
+ return -ENOMEM;
+
+ index = get_netdev_queue_index(queue);
+
+ err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+ if (err) {
+ kfree(mask);
+ return err;
+ }
+
+ err = __netif_set_xps_queue(dev, mask, index, XPS_MAP_RXQS);
+ kfree(mask);
+ return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+ = __ATTR_RW(xps_rxqs);
#endif /* CONFIG_XPS */
static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
&queue_traffic_class.attr,
#ifdef CONFIG_XPS
&xps_cpus_attribute.attr,
+ &xps_rxqs_attribute.attr,
&queue_tx_maxrate.attr,
#endif
NULL
^ permalink raw reply related
* [jkirsher/next-queue, RFC PATCH 2/3] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-04-04 20:00 UTC (permalink / raw)
To: intel-wired-lan, jeffrey.t.kirsher
Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>
This patch adds support to pick Tx queue based on the Rx queue map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive
queue map does not apply, then the Tx queue selection falls back
to CPU map based selection and finally to hashing.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
include/net/sock.h | 18 ++++++++++++++++++
net/core/dev.c | 36 ++++++++++++++++++++++++++++++------
net/core/sock.c | 5 +++++
net/ipv4/tcp_input.c | 7 +++++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_minisocks.c | 1 +
6 files changed, 62 insertions(+), 6 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 49bd2c1..53d58bc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
+ * @skc_rx_ifindex: rx ifindex for this connection
* @skc_flags: place holder for sk_flags
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +217,10 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+ int skc_rx_queue_mapping;
+ int skc_rx_ifindex;
+#endif
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -326,6 +332,10 @@ struct sock {
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping __sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex __sk_common.skc_rx_ifindex
+#endif
#define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
#define sk_dontcopy_end __sk_common.skc_dontcopy_end
@@ -1691,6 +1701,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
return sk ? sk->sk_tx_queue_mapping : -1;
}
+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+ sk->sk_rx_ifindex = skb->skb_iif;
+ sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+#endif
+}
+
static inline void sk_set_socket(struct sock *sk, struct socket *sock)
{
sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4cfc179..d43f1c2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3457,18 +3457,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
}
#endif /* CONFIG_NET_EGRESS */
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+ struct xps_dev_maps *dev_maps, unsigned int tci)
{
#ifdef CONFIG_XPS
- struct xps_dev_maps *dev_maps;
struct xps_map *map;
int queue_index = -1;
- rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
- unsigned int tci = skb->sender_cpu - 1;
-
if (dev->num_tc) {
tci *= dev->num_tc;
tci += netdev_get_prio_tc_map(dev, skb->priority);
@@ -3485,6 +3481,34 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
queue_index = -1;
}
}
+ return queue_index;
+#else
+ return -1;
+#endif
+}
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+ enum xps_map_type i = XPS_MAP_RXQS;
+ struct xps_dev_maps *dev_maps;
+ struct sock *sk = skb->sk;
+ int queue_index = -1;
+ unsigned int tci = 0;
+
+ if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
+ dev->ifindex == sk->sk_rx_ifindex)
+ tci = sk->sk_rx_queue_mapping;
+
+ rcu_read_lock();
+ while (queue_index < 0 && i < __XPS_MAP_MAX) {
+ if (i == XPS_MAP_CPUS)
+ tci = skb->sender_cpu - 1;
+ dev_maps = rcu_dereference(dev->xps_maps[i]);
+ queue_index = __get_xps_queue_idx(dev, skb, dev_maps, tci);
+ i++;
+ }
+
rcu_read_unlock();
return queue_index;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6444525..bd053db 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2817,6 +2817,11 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_pacing_rate = ~0U;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+ sk->sk_rx_ifindex = -1;
+ sk->sk_rx_queue_mapping = -1;
+#endif
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6..521b85c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
#include <linux/errqueue.h>
#include <trace/events/tcp.h>
#include <linux/static_key.h>
+#include <net/busy_poll.h>
int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
@@ -5502,6 +5503,11 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
__tcp_fast_path_on(tp, tp->snd_wnd);
else
tp->pred_flags = 0;
+
+ if (skb) {
+ sk_mark_napi_id(sk, skb);
+ sk_mark_rx_queue(sk, skb);
+ }
}
static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
@@ -6310,6 +6316,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_rsk(req)->snt_isn = isn;
tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
+ sk_mark_rx_queue(req_to_sk(req), skb);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..132d9af 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1467,6 +1467,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
sock_rps_save_rxhash(sk, skb);
sk_mark_napi_id(sk, skb);
+ sk_mark_rx_queue(sk, skb);
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
!dst->ops->check(dst, 0)) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468..c18d6f2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -835,6 +835,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
/* record NAPI ID of child */
sk_mark_napi_id(child, skb);
+ sk_mark_rx_queue(child, skb);
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
^ permalink raw reply related
* [jkirsher/next-queue, RFC PATCH 1/3] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-04-04 19:59 UTC (permalink / raw)
To: intel-wired-lan, jeffrey.t.kirsher
Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
sridhar.samudrala, hannes, tom
In-Reply-To: <152287164664.5088.10567280431867626085.stgit@anamdev.jf.intel.com>
Refactor XPS code to support Tx queue selection based on
CPU map or Rx queue map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
include/linux/netdevice.h | 82 +++++++++++++++++-
net/core/dev.c | 208 ++++++++++++++++++++++++++++++---------------
net/core/net-sysfs.c | 4 -
3 files changed, 218 insertions(+), 76 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503..37dbffe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,21 @@ struct xps_map {
*/
struct xps_dev_maps {
struct rcu_head rcu;
- struct xps_map __rcu *cpu_map[0];
+ struct xps_map __rcu *attr_map[0];
};
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+ (_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
+enum xps_map_type {
+ XPS_MAP_RXQS,
+ XPS_MAP_CPUS,
+ __XPS_MAP_MAX
+};
+
#endif /* CONFIG_XPS */
#define TC_MAX_QUEUE 16
@@ -1867,7 +1878,7 @@ struct net_device {
int watchdog_timeo;
#ifdef CONFIG_XPS
- struct xps_dev_maps __rcu *xps_maps;
+ struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
#endif
#ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
#ifdef CONFIG_XPS
int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, enum xps_map_type type);
+
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+ unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ return test_bit(j, mask);
+}
+
+static inline bool attr_test_online(unsigned long j,
+ const unsigned long *online_mask,
+ unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+
+ if (online_mask)
+ return test_bit(j, online_mask);
+
+ if (j >= 0 && j < nr_bits)
+ return true;
+
+ return false;
+}
+
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+ unsigned int nr_bits)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ }
+
+ if (srcp)
+ return find_next_bit(srcp, nr_bits, n + 1);
+
+ return n + 1;
+}
+
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+ const unsigned long *src2p,
+ unsigned int nr_bits)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ }
+
+ if (src1p && src2p)
+ return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+ else if (src1p)
+ return find_next_bit(src1p, nr_bits, n + 1);
+ else if (src2p)
+ return find_next_bit(src2p, nr_bits, n + 1);
+
+ return n + 1;
+}
#else
static inline int netif_set_xps_queue(struct net_device *dev,
const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index 9b04a9f..4cfc179 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2091,7 +2091,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
int pos;
if (dev_maps)
- map = xmap_dereference(dev_maps->cpu_map[tci]);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
if (!map)
return false;
@@ -2104,7 +2104,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
break;
}
- RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+ RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
return false;
}
@@ -2137,30 +2137,49 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
u16 count)
{
+ const unsigned long *possible_mask = NULL;
+ enum xps_map_type type = XPS_MAP_RXQS;
struct xps_dev_maps *dev_maps;
- int cpu, i;
bool active = false;
+ unsigned int nr_ids;
+ int i, j;
mutex_lock(&xps_map_mutex);
- dev_maps = xmap_dereference(dev->xps_maps);
- if (!dev_maps)
- goto out_no_maps;
-
- for_each_possible_cpu(cpu)
- active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
- offset, count);
+ while (type < __XPS_MAP_MAX) {
+ dev_maps = xmap_dereference(dev->xps_maps[type]);
+ if (!dev_maps)
+ goto out_no_maps;
+
+ if (type == XPS_MAP_CPUS) {
+ if (num_possible_cpus() > 1)
+ possible_mask = cpumask_bits(cpu_possible_mask);
+ nr_ids = nr_cpu_ids;
+ } else if (type == XPS_MAP_RXQS) {
+ nr_ids = dev->num_rx_queues;
+ }
- if (!active) {
- RCU_INIT_POINTER(dev->xps_maps, NULL);
- kfree_rcu(dev_maps, rcu);
- }
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;){
+ active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+ count);
+ }
- for (i = offset + (count - 1); count--; i--)
- netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
- NUMA_NO_NODE);
+ if (!active) {
+ RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+ kfree_rcu(dev_maps, rcu);
+ }
+ if (type == XPS_MAP_CPUS) {
+ for (i = offset + (count - 1); count--; i--)
+ netdev_queue_numa_node_write(
+ netdev_get_tx_queue(dev, i),
+ NUMA_NO_NODE);
+ }
out_no_maps:
+ type++;
+ }
+
mutex_unlock(&xps_map_mutex);
}
@@ -2169,11 +2188,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
}
-static struct xps_map *expand_xps_map(struct xps_map *map,
- int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+ u16 index, enum xps_map_type type)
{
- struct xps_map *new_map;
int alloc_len = XPS_MIN_MAP_ALLOC;
+ struct xps_map *new_map = NULL;
int i, pos;
for (pos = 0; map && pos < map->len; pos++) {
@@ -2182,7 +2201,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
return map;
}
- /* Need to add queue to this CPU's existing map */
+ /* Need to add tx-queue to this CPU's/rx-queue's existing map */
if (map) {
if (pos < map->alloc_len)
return map;
@@ -2190,9 +2209,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
alloc_len = map->alloc_len * 2;
}
- /* Need to allocate new map to store queue on this CPU's map */
- new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
- cpu_to_node(cpu));
+ /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+ * map
+ */
+ if (type == XPS_MAP_RXQS)
+ new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+ else if (type == XPS_MAP_CPUS)
+ new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+ cpu_to_node(attr_index));
if (!new_map)
return NULL;
@@ -2204,14 +2228,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
return new_map;
}
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
- u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, enum xps_map_type type)
{
+ const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
- int i, cpu, tci, numa_node_id = -2;
+ int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
bool active = false;
+ unsigned int nr_ids;
if (dev->num_tc) {
num_tc = dev->num_tc;
@@ -2220,16 +2246,33 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return -EINVAL;
}
- maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
+ switch (type) {
+ case XPS_MAP_RXQS:
+ maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+ dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+ nr_ids = dev->num_rx_queues;
+ break;
+ case XPS_MAP_CPUS:
+ maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+ if (num_possible_cpus() > 1) {
+ online_mask = cpumask_bits(cpu_online_mask);
+ possible_mask = cpumask_bits(cpu_possible_mask);
+ }
+ dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_CPUS]);
+ nr_ids = nr_cpu_ids;
+ break;
+ default:
+ return -EINVAL;
+ }
+
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
mutex_lock(&xps_map_mutex);
- dev_maps = xmap_dereference(dev->xps_maps);
-
/* allocate memory for queue storage */
- for_each_cpu_and(cpu, cpu_online_mask, mask) {
+ for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+ j < nr_ids;) {
if (!new_dev_maps)
new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
@@ -2237,73 +2280,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return -ENOMEM;
}
- tci = cpu * num_tc + tc;
- map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+ tci = j * num_tc + tc;
+ map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
NULL;
- map = expand_xps_map(map, cpu, index);
+ map = expand_xps_map(map, j, index, type);
if (!map)
goto error;
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
if (!new_dev_maps)
goto out_no_new_maps;
- for_each_possible_cpu(cpu) {
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
- for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+ for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
/* We need to explicitly update tci as prevous loop
* could break out early if dev_maps is NULL.
*/
- tci = cpu * num_tc + tc;
+ tci = j * num_tc + tc;
- if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
- /* add queue to CPU maps */
+ if (attr_test_mask(j, mask, nr_ids) &&
+ attr_test_online(j, online_mask, nr_ids)) {
+ /* add tx-queue to CPU/rx-queue maps */
int pos = 0;
- map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+ map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
if (pos == map->len)
map->queues[map->len++] = index;
#ifdef CONFIG_NUMA
- if (numa_node_id == -2)
- numa_node_id = cpu_to_node(cpu);
- else if (numa_node_id != cpu_to_node(cpu))
- numa_node_id = -1;
+ if (type == XPS_MAP_CPUS) {
+ if (numa_node_id == -2)
+ numa_node_id = cpu_to_node(j);
+ else if (numa_node_id != cpu_to_node(j))
+ numa_node_id = -1;
+ }
#endif
} else if (dev_maps) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
/* copy maps belonging to foreign traffic classes */
for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
}
- rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+ if (type == XPS_MAP_RXQS)
+ rcu_assign_pointer(dev->xps_maps[XPS_MAP_RXQS], new_dev_maps);
+ else if (type == XPS_MAP_CPUS)
+ rcu_assign_pointer(dev->xps_maps[XPS_MAP_CPUS], new_dev_maps);
/* Cleanup old maps */
if (!dev_maps)
goto out_no_old_maps;
- for_each_possible_cpu(cpu) {
- for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
- new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
- map = xmap_dereference(dev_maps->cpu_map[tci]);
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = num_tc, tci = j * num_tc; i--; tci++) {
+ new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
if (map && map != new_map)
kfree_rcu(map, rcu);
}
@@ -2316,19 +2367,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
active = true;
out_no_new_maps:
- /* update Tx queue numa node */
- netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
- (numa_node_id >= 0) ? numa_node_id :
- NUMA_NO_NODE);
+ if (type == XPS_MAP_CPUS) {
+ /* update Tx queue numa node */
+ netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+ (numa_node_id >= 0) ?
+ numa_node_id : NUMA_NO_NODE);
+ }
if (!dev_maps)
goto out_no_maps;
- /* removes queue from unused CPUs */
- for_each_possible_cpu(cpu) {
- for (i = tc, tci = cpu * num_tc; i--; tci++)
+ /* removes tx-queue from unused CPUs/rx-queues */
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = tc, tci = j * num_tc; i--; tci++)
active |= remove_xps_queue(dev_maps, tci, index);
- if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+ if (!attr_test_mask(j, mask, nr_ids) ||
+ !attr_test_online(j, online_mask, nr_ids))
active |= remove_xps_queue(dev_maps, tci, index);
for (i = num_tc - tc, tci++; --i; tci++)
active |= remove_xps_queue(dev_maps, tci, index);
@@ -2336,7 +2391,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
/* free map if not active */
if (!active) {
- RCU_INIT_POINTER(dev->xps_maps, NULL);
+ if (type == XPS_MAP_RXQS)
+ RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_RXQS], NULL);
+ else if (type == XPS_MAP_CPUS)
+ RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_CPUS], NULL);
kfree_rcu(dev_maps, rcu);
}
@@ -2346,11 +2404,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return 0;
error:
/* remove any maps that we added */
- for_each_possible_cpu(cpu) {
- for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
- new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = num_tc, tci = j * num_tc; i--; tci++) {
+ new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
map = dev_maps ?
- xmap_dereference(dev_maps->cpu_map[tci]) :
+ xmap_dereference(dev_maps->attr_map[tci]) :
NULL;
if (new_map && new_map != map)
kfree(new_map);
@@ -2362,6 +2421,13 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
kfree(new_dev_maps);
return -ENOMEM;
}
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+ u16 index)
+{
+ return __netif_set_xps_queue(dev, cpumask_bits(mask), index,
+ XPS_MAP_CPUS);
+}
EXPORT_SYMBOL(netif_set_xps_queue);
#endif
@@ -3399,7 +3465,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
int queue_index = -1;
rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps);
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
unsigned int tci = skb->sender_cpu - 1;
@@ -3408,7 +3474,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
tci += netdev_get_prio_tc_map(dev, skb->priority);
}
- map = rcu_dereference(dev_maps->cpu_map[tci]);
+ map = rcu_dereference(dev_maps->attr_map[tci]);
if (map) {
if (map->len == 1)
queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f07..d7abd33 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps);
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
for_each_possible_cpu(cpu) {
int i, tci = cpu * num_tc + tc;
struct xps_map *map;
- map = rcu_dereference(dev_maps->cpu_map[tci]);
+ map = rcu_dereference(dev_maps->attr_map[tci]);
if (!map)
continue;
^ permalink raw reply related
* [jkirsher/next-queue, RFC PATCH 0/3] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-04-04 19:59 UTC (permalink / raw)
To: intel-wired-lan, jeffrey.t.kirsher
Cc: alexander.h.duyck, amritha.nambiar, netdev, edumazet,
sridhar.samudrala, hannes, tom
This patch series implements support for Tx queue selection based on
Rx queue map. This is done by configuring Rx queue map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.
XPS is refactored to support Tx queue selection based on either the
CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.
- /sys/class/net/eth0/queues/tx-*/xps_rxqs
This is to enable sending packets on the same Tx-Rx queue pair as this
is useful for busy polling multi-threaded workloads where it is not
possible to pin the threads to a CPU. This is a rework of Sridhar's
patch for symmetric queueing via socket option:
https://www.spinics.net/lists/netdev/msg453106.html
---
Amritha Nambiar (3):
net: Refactor XPS for CPUs and Rx queues
net: Enable Tx queue selection based on Rx queues
net-sysfs: Add interface for Rx queue map per Tx queue
include/linux/netdevice.h | 82 +++++++++++++++
include/net/sock.h | 18 +++
net/core/dev.c | 242 +++++++++++++++++++++++++++++++--------------
net/core/net-sysfs.c | 85 +++++++++++++++-
net/core/sock.c | 5 +
net/ipv4/tcp_input.c | 7 +
net/ipv4/tcp_ipv4.c | 1
net/ipv4/tcp_minisocks.c | 1
8 files changed, 360 insertions(+), 81 deletions(-)
^ permalink raw reply
* Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
From: Boris Brezillon @ 2018-04-04 19:56 UTC (permalink / raw)
To: Robert Jarzmik
Cc: Ulf Hansson, alsa-devel, Jaroslav Kysela, linux-ide, netdev,
linux-mtd, driverdevel, Boris Brezillon, Vinod Koul,
Richard Weinberger, Takashi Iwai, Marek Vasut, Ezequiel Garcia,
linux-media, Samuel Ortiz, Arnd Bergmann,
Bartlomiej Zolnierkiewicz, Haojian Zhuang, dmaengine, Mark Brown,
Mauro Carvalho Chehab, Linux ARM, Nicolas Pitre,
Greg Kroah-Hartman
In-Reply-To: <874lkq4urd.fsf@belgarion.home>
On Wed, 04 Apr 2018 21:49:26 +0200
Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> Ulf Hansson <ulf.hansson@linaro.org> writes:
>
> > On 2 April 2018 at 16:26, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> >> Hi,
> >>
> >> This serie is aimed at removing the dmaengine slave compat use, and transfer
> >> knowledge of the DMA requestors into architecture code.
> >> As this looks like a patch bomb, each maintainer expressing for his tree either
> >> an Ack or "I want to take through my tree" will be spared in the next iterations
> >> of this serie.
> >
> > Perhaps an option is to send this hole series as PR for 3.17 rc1, that
> > would removed some churns and make this faster/easier? Well, if you
> > receive the needed acks of course.
> For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all
Especially since 3.17-rc1 has been released more than 3 years ago :-),
but I guess you meant 4.17-rc1.
> acks, I'll queue it into my pxa tree. If at least one maintainer withholds his
> ack, the end of the serie (phase 3) won't be applied until it is sorted out.
>
> Cheers.
>
> --
> Robert
^ permalink raw reply
* Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
From: Robert Jarzmik @ 2018-04-04 19:49 UTC (permalink / raw)
To: Ulf Hansson
Cc: alsa-devel, Jaroslav Kysela, linux-ide, netdev, linux-mtd,
driverdevel, Boris Brezillon, Vinod Koul, Richard Weinberger,
Takashi Iwai, Marek Vasut, Ezequiel Garcia, linux-media,
Samuel Ortiz, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
Haojian Zhuang, dmaengine, Mark Brown, Mauro Carvalho Chehab,
Linux ARM, Nicolas Pitre, Greg Kroah-Hartman,
"linux-mmc@vger.ke
In-Reply-To: <CAPDyKFot9dAST2jQL5s8E4U=bCHxkio=uwpqPd6S0N4FWJRB-w@mail.gmail.com>
Ulf Hansson <ulf.hansson@linaro.org> writes:
> On 2 April 2018 at 16:26, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>> Hi,
>>
>> This serie is aimed at removing the dmaengine slave compat use, and transfer
>> knowledge of the DMA requestors into architecture code.
>> As this looks like a patch bomb, each maintainer expressing for his tree either
>> an Ack or "I want to take through my tree" will be spared in the next iterations
>> of this serie.
>
> Perhaps an option is to send this hole series as PR for 3.17 rc1, that
> would removed some churns and make this faster/easier? Well, if you
> receive the needed acks of course.
For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all
acks, I'll queue it into my pxa tree. If at least one maintainer withholds his
ack, the end of the serie (phase 3) won't be applied until it is sorted out.
Cheers.
--
Robert
^ permalink raw reply
* [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-04 19:48 UTC (permalink / raw)
To: ebiederm, davem, gregkh, netdev, linux-kernel
Cc: avagin, ktkhai, serge, Christian Brauner
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket. One example are
network devices. Uevents for network devices only show up in the network
namespaces these devices are moved to or created in.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will *try* to broadcast it
into all network namespaces.
The original patchset was written in 2010 before user namespaces were a
thing. With the introduction of user namespaces sending out uevents became
partially isolated as they were filtered by user namespaces:
net/netlink/af_netlink.c:do_one_broadcast()
if (!net_eq(sock_net(sk), p->net)) {
if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
return;
if (!peernet_has_id(sock_net(sk), p->net))
return;
if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
CAP_NET_BROADCAST))
j return;
}
The file_ns_capable() check will check whether the caller had
CAP_NET_BROADCAST at the time of opening the netlink socket in the user
namespace of interest. This check is fine in general but seems insufficient
to me when paired with uevents. The reason is that devices always belong to
the initial user namespace so uevents for kobjects that do not carry a
namespace tag should never be sent into another user namespace. This has
been the intention all along. But there's one case where this breaks,
namely if a new user namespace is created by root on the host and an
identity mapping is established between root on the host and root in the
new user namespace. Here's a reproducer:
sudo unshare -U --map-root
udevadm monitor -k
# Now change to initial user namespace and e.g. do
modprobe kvm
# or
rmmod kvm
will allow the non-initial user namespace to retrieve all uevents from the
host. This seems very anecdotal given that in the general case user
namespaces do not see any uevents and also can't really do anything useful
with them.
Additionally, it is now possible to send uevents from userspace. As such we
can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace process
make a decision what uevents should be sent.
This makes me think that we should simply ensure that uevents for kobjects
that do not carry a namespace tag are *always* filtered by user namespace
in kobj_bcast_filter(). Specifically:
- If the owning user namespace of the uevent socket is not init_user_ns the
event will always be filtered.
- If the network namespace the uevent socket belongs to was created in the
initial user namespace but was opened from a non-initial user namespace
the event will be filtered as well.
Put another way, uevents for kobjects not carrying a namespace tag are now
always only sent to the initial user namespace. The regression potential
for this is near to non-existent since user namespaces can't really do
anything with interesting devices.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
lib/kobject_uevent.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..cb98cddb6e3b 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
return sock_ns != ns;
}
- return 0;
+ /*
+ * The kobject does not carry a namespace tag so filter by user
+ * namespace below.
+ */
+ if (sock_net(dsk)->user_ns != &init_user_ns)
+ return 1;
+
+ /* Check if socket was opened from non-initial user namespace. */
+ return sk_user_ns(dsk) != &init_user_ns;
}
#endif
--
2.15.1
^ permalink raw reply related
* Re: possible deadlock in skb_queue_tail
From: Dmitry Vyukov @ 2018-04-04 19:00 UTC (permalink / raw)
To: Cong Wang
Cc: Kirill Tkhai, Ingo Molnar, syzbot, David Miller, David Herrmann,
Denys Vlasenko, David Windsor, Reshetova, Elena, Hans Liljestrand,
Kees Cook, LKML, Matthew Dawson, Mateusz Jurczyk, netdev,
syzkaller-bugs, Al Viro, xemul
In-Reply-To: <CAM_iQpV89LRuJovA840T0pUQYXkZvecezLmg6=Hp0PxfziOxUQ@mail.gmail.com>
On Wed, Apr 4, 2018 at 7:08 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Apr 3, 2018 at 4:42 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 03.04.2018 14:25, Dmitry Vyukov wrote:
>>> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
>>>> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
>>>> it's unix_listen(). The function is applied to stream and seqpacket
>>>> socket types.
>>>>
>>>> It can't be stream because of the second stack, and seqpacket also can't,
>>>> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
>>>> in the way, we don't see it in the stack.
>>>>
>>>> So, this is looks like false positive result for me.
>>>>
>>>> Kirill
>>>
>>> Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are
>>> always different?
>>
>> In these 2 particular stacks they have to be different.
>
> So actually my patch could fix this false positive? I thought it couldn't.
> https://patchwork.ozlabs.org/patch/894342/
You know better!
If you suspect it can fix this report, and nobody has better
proposals, then we can just mark this as being fixed with your commit
and then see if it triggers again with your commit or not.
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Jiri Pirko @ 2018-04-04 18:20 UTC (permalink / raw)
To: David Miller
Cc: dsahern, loseweigh, si-wei.liu, mst, stephen, alexander.h.duyck,
jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
virtualization, virtio-dev
In-Reply-To: <20180404.133749.1802514210170809419.davem@davemloft.net>
Wed, Apr 04, 2018 at 07:37:49PM CEST, davem@davemloft.net wrote:
>From: David Ahern <dsahern@gmail.com>
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely? If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.
Exactly. Don't workaround userspace issues by kernel patches.
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-04 18:02 UTC (permalink / raw)
To: David Ahern
Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
Jason Wang, Samudrala, Sridhar, Netdev, virtualization
In-Reply-To: <54accf73-e6cc-e03f-6a1c-34e1bbd78047@gmail.com>
On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.
If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.
So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.
-Siwei
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2018-04-04 17:52 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
This fixes some fallout from the net-next merge the other day, plus
some non-merge-window-related bug fixes:
1) Fix sparse warnings in bcmgenet,systemport, b53, and mt7530, from
Florian Fainelli.
2) pptp does a bogus dst_release() on a route we have a single refcount
on, and attached to a socket, which needs that refcount. From Eric
Dumazet.
3) UDP connected sockets on ipv6 can race with route update handling,
resulting in a pre-PMTU update route still stuck on the socket and
thus continuing to get ICMPV6_PKT_TOOBIG errors. We end up never
seeing the updated route. Fix from Alexey Kodanev.
4) Missing list initializer(s) in TIPC, from Jon Maloy.
5) Connect phy early to prevent crashes in lan78xx driver, from
Alexander Graf.
6) Fix build with modular NVMEM, from Arnd Bergmann.
7) netdevsim canot mark nsim_devlink_net_ops and nsim_fib_net_ops as
__net_initdata, as these are references from module unload
unconditionally. From Arnd Bergmann.
Please pull, thanks a lot!
The following changes since commit 17dec0a949153d9ac00760ba2f5b78cb583e995f:
Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace (2018-04-03 19:15:32 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
for you to fetch changes up to 87248d31d1055b56e01a62d9320b4e118bc84e0e:
netdevsim: remove incorrect __net_initdata annotations (2018-04-04 12:53:37 -0400)
----------------------------------------------------------------
Alexander Graf (1):
lan78xx: Connect phy early
Alexey Kodanev (4):
ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
ipv6: udp: set dst cache for a connected sk if current not valid
Arnd Bergmann (2):
nvmem: disallow modular CONFIG_NVMEM
netdevsim: remove incorrect __net_initdata annotations
Bert Kenward (1):
sfc: remove ctpio_dmabuf_start from stats
Cong Wang (1):
af_unix: remove redundant lockdep class
David Howells (1):
rxrpc: Fix undefined packet handling
David S. Miller (2):
Merge branch 'net-Broadcom-drivers-sparse-fixes'
Merge branch 'ipv6-udp-set-dst-cache-for-a-connected-sk-if-current-not-valid'
Dirk van der Merwe (1):
nfp: use full 40 bits of the NSP buffer address
Eric Dumazet (2):
pptp: remove a buggy dst release in pptp_connect()
inet: frags: fix ip6frag_low_thresh boundary
Florian Fainelli (4):
net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()
net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()
net: dsa: b53: Fix sparse warnings in b53_mmap.c
net: dsa: mt7530: Use NULL instead of plain integer
GhantaKrishnamurthy MohanKrishna (1):
tipc: Fix namespace violation in tipc_sk_fill_sock_diag
Jakub Kicinski (1):
nfp: add a separate counter for packets with CHECKSUM_COMPLETE
Jon Maloy (1):
tipc: Fix missing list initializations in struct tipc_subscription
Paolo Abeni (1):
net: avoid unneeded atomic operation in ip*_append_data()
Russell King (1):
net: phy: marvell10g: add thermal hwmon device
Tan Xiaojun (1):
net: hns3: fix length overflow when CONFIG_ARM64_64K_PAGES
drivers/net/dsa/b53/b53_mmap.c | 33 +++++++++++++-----
drivers/net/dsa/mt7530.c | 6 ++--
drivers/net/ethernet/broadcom/bcmsysport.c | 11 +++---
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 +++---
drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
drivers/net/ethernet/netronome/nfp/nfp_net.h | 4 ++-
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 16 +++++----
drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 9 ++---
drivers/net/ethernet/sfc/ef10.c | 2 --
drivers/net/ethernet/sfc/nic.h | 1 -
drivers/net/netdevsim/devlink.c | 2 +-
drivers/net/netdevsim/fib.c | 2 +-
drivers/net/phy/marvell10g.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
drivers/net/ppp/pptp.c | 1 -
drivers/net/usb/lan78xx.c | 34 +++++++++---------
drivers/nvmem/Kconfig | 2 +-
include/net/ip6_route.h | 3 ++
include/net/ipv6.h | 3 +-
net/ieee802154/6lowpan/reassembly.c | 2 --
net/ipv4/ip_fragment.c | 5 ++-
net/ipv4/ip_output.c | 3 +-
net/ipv6/datagram.c | 9 +----
net/ipv6/ip6_output.c | 18 +++++++---
net/ipv6/netfilter/nf_conntrack_reasm.c | 2 --
net/ipv6/ping.c | 2 +-
net/ipv6/reassembly.c | 2 --
net/ipv6/route.c | 17 +++++++++
net/ipv6/udp.c | 31 ++++-------------
net/rxrpc/input.c | 6 ++++
net/rxrpc/protocol.h | 6 ++++
net/tipc/socket.c | 3 +-
net/tipc/subscr.c | 2 ++
net/unix/af_unix.c | 10 ------
34 files changed, 326 insertions(+), 120 deletions(-)
^ permalink raw reply
* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Neal Cardwell @ 2018-04-04 17:44 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, sergei.shtylyov
In-Reply-To: <CAK6E8=f5GQxnEhNS=BbXzW-qFKR4mpFuYX8W8z4FpO0r+=DCRw@mail.gmail.com>
On Wed, Apr 4, 2018 at 1:41 PM Yuchung Cheng <ycheng@google.com> wrote:
> On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardwell@google.com>
wrote:
> > n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
> >> Agreed. That's a good point. And I would much preferred to rename that
> >> to FLAG_ORIG_PROGRESS (w/ updated comment).
> >
> >> so I think we're in agreement to use existing patch w/ the new name
> >> FLAG_ORIG_PROGRESS
> >
> > Yes, SGTM.
> >
> > I guess this "prevent bogus FRTO undos" patch would go to "net" branch
and
> > the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> > branch?
> huh? why not one patch ... this is getting close to patch-split paralyses.
The flag rename seemed like a cosmetic issue that was not needed for the
fix. Smelled like net-next to me. But I don't feel strongly. However you
guys want to package it is fine with me. :-)
neal
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Stephen Hemminger @ 2018-04-04 17:44 UTC (permalink / raw)
To: David Ahern
Cc: Siwei Liu, Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin,
Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
Jason Wang, Samudrala, Sridhar, Netdev, virtualization
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>
On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern <dsahern@gmail.com> wrote:
> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.
Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Miller @ 2018-04-04 17:42 UTC (permalink / raw)
To: dsahern
Cc: loseweigh, jiri, si-wei.liu, mst, stephen, alexander.h.duyck,
jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
virtualization
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>
From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:37:52 -0600
> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.
Two approaches:
1) Add an IFF_CONTROL and make userspace understand this. It is probably
long overdue.
2) Design the driver properly. Have a non-netdev master device like
mlxsw does, and control it using devlink or similar. This is exactly
how this stuff was meant to be architected.
> From there I think you are confusing my intentions: I fundamentally do
> not believe the kernel should be hiding anything from an admin. Not
> showing data by default is completely different than not showing that
> data at all.
It is the same David.
It measn we have no intention of fixing applications to properly know
what to do with and how to handle these devices.
If you hide these objects, we are basically giving up on fixing the
tools and or the drivers themselves to be architected differently
(see #2 above).
That really isn't acceptable in my opinion.
> The intention of my patch with the IFF_HIDDEN attribute is:
> 1. it is a netdev attribute
>
> 2. that attribute can be used by userpsace to indicate to the kernel I
> want all or I want the default
>
> 3. that attribute can be controlled by an admin.
>
> The patches go beyond my specific use case (preventing a user from
> modifying a netdev it should not be touching) but to defining the
> semantics of a generic capability which is what the kernel should have.
"Teach, do not hide!" -Yoda
^ permalink raw reply
* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Yuchung Cheng @ 2018-04-04 17:40 UTC (permalink / raw)
To: Neal Cardwell; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <CADVnQy=XSWqbVwHKAgZdYUaHjQhhPOSU3PRJpF0Oe10DjyzMhQ@mail.gmail.com>
On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardwell@google.com> wrote:
> n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
>> Agreed. That's a good point. And I would much preferred to rename that
>> to FLAG_ORIG_PROGRESS (w/ updated comment).
>
>> so I think we're in agreement to use existing patch w/ the new name
>> FLAG_ORIG_PROGRESS
>
> Yes, SGTM.
>
> I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
> the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> branch?
huh? why not one patch ... this is getting close to patch-split paralyses.
>
> neal
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern @ 2018-04-04 17:37 UTC (permalink / raw)
To: Siwei Liu
Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
Jason Wang, Samudrala, Sridhar, Netdev, virtualization
In-Reply-To: <CADGSJ22qmdzY8AiOfQdFdmR7T6rehTp-hHHj1U10XGad0bTb8A@mail.gmail.com>
[ dropping virtio-dev@lists.oasis-open.org since it is a closed list and
I am tired of deleting bounces ]
On 4/4/18 2:28 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
>
> I'm still not sure I understand your case: why you want to hide the
> control netdev, as I assume those devices could choose either to
> silently ignore the request, or fail loudly against user operations?
> Is it creating issues already, or what problem you want to solve if
> not making the netdev invisible. Why couldn't lldpd check some
> specific flag and ignore the control netdevice (can you please give an
> example of a concrete driver for control netdevice *in tree*).
>
> And I'm completely lost why you want an API to make a hidden netdev
> visible again for these control devices.
Networking vendors have out of tree kernel modules. Those modules use a
netdev (call it a master netdev, a control netdev, cpu port, whatever)
to pull packets from the ASIC and deliver to virtual netdevices
representing physical ports. The master netdev should not be mucked with
by a user. It should be ignored by certain s/w with lldpd as just an
*example*.
The short of it is that you have your reasons for wanting to hide the
virtio bypass device; other users have other arguments for wanting a
similar capability.
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Miller @ 2018-04-04 17:37 UTC (permalink / raw)
To: dsahern
Cc: loseweigh, jiri, si-wei.liu, mst, stephen, alexander.h.duyck,
jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
virtualization, virtio-dev
In-Reply-To: <54accf73-e6cc-e03f-6a1c-34e1bbd78047@gmail.com>
From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:21:54 -0600
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.
I agree on this.
What I really don't understand still is the use case... really.
So there are control netdevs, what exactly is the problem with that?
Are we not exporting enough information for applications to handle
these devices sanely? If so, then's let add that information.
We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
Another alternative is to add an interface flag like IFF_CONTROL or
similar, and that probably is much nicer.
Hiding the devices means that we acknowledge that applications are
currently broken with control netdevs... and we want them to stay
broken!
That doesn't sound like a good plan to me.
So let's fix handling of control netdevs instead of hiding them.
Thanks.
^ permalink raw reply
* Re: [Intel-wired-lan] [iwl next-queue PATCH 02/10] macvlan: Rename fwd_priv to accel_priv and add accessor function
From: Alexander Duyck @ 2018-04-04 17:33 UTC (permalink / raw)
To: Shannon Nelson; +Cc: Alexander Duyck, intel-wired-lan, Jeff Kirsher, Netdev
In-Reply-To: <7b6a4392-ed51-4324-9b2e-fa483f769882@oracle.com>
On Wed, Apr 4, 2018 at 9:53 AM, Shannon Nelson
<shannon.nelson@oracle.com> wrote:
> On 4/3/2018 2:16 PM, Alexander Duyck wrote:
>
> [...]
>>
>> +static inline void *macvlan_accel_priv(struct net_device *dev)
>> +{
>> + struct macvlan_dev *macvlan = netdev_priv(dev);
>> +
>> + return macvlan->accel_priv;
>
>
> Perhaps a check for (macvlan == NULL) before using it?
> sln
>
>
The macvlan pointer cannot be NULL.The netdev_priv() function adds an
offset to the dev pointer. So I would have to be checking for a NULL
netdev. If the netdev was NULL then there was probably no point in
calling this function in the first place.
- Alex
^ permalink raw reply
* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Neal Cardwell @ 2018-04-04 17:22 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, sergei.shtylyov
In-Reply-To: <CAK6E8=dL+Qz5hK8hj-u=N2kLf8h8FrY=5XNjoeo0NLsRifX1zA@mail.gmail.com>
n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
> Agreed. That's a good point. And I would much preferred to rename that
> to FLAG_ORIG_PROGRESS (w/ updated comment).
> so I think we're in agreement to use existing patch w/ the new name
> FLAG_ORIG_PROGRESS
Yes, SGTM.
I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
branch?
neal
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern @ 2018-04-04 17:21 UTC (permalink / raw)
To: Siwei Liu
Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
virtio-dev
In-Reply-To: <CADGSJ23Zr7_CLMr1W9qhcWux59+aCvFdTg_nQ7Mmp5B-FWL8=Q@mail.gmail.com>
On 4/4/18 1:36 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>> I would not advocate for any API that does not allow users to have full
>> introspection. The intent is to hide the netdev by default but have an
>> option to see it.
>
> I'm fine with having a link dump API to inspect the hidden netdev. As
> said, the name for hidden netdevs should be in a separate device
> namespace, and we did not even get closer to what it should look like
> as I don't want to make it just an option for ip link. Perhaps a new
> set of sub-commands of, say, 'ip device'.
It is a netdev so there is no reason to have a separate ip command to
inspect it. 'ip link' is the right place.
^ permalink raw reply
* [PATCH iproute2-next 1/1] tc: jsonify tunnel_key action
From: Roman Mashak @ 2018-04-04 17:21 UTC (permalink / raw)
To: dsahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
tc/m_tunnel_key.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)
diff --git a/tc/m_tunnel_key.c b/tc/m_tunnel_key.c
index bac3c07fa90b..0fa461549ad9 100644
--- a/tc/m_tunnel_key.c
+++ b/tc/m_tunnel_key.c
@@ -221,7 +221,13 @@ static void tunnel_key_print_ip_addr(FILE *f, const char *name,
else
return;
- fprintf(f, "\n\t%s %s", name, rt_addr_n2a_rta(family, attr));
+ print_string(PRINT_FP, NULL, "%s", _SL_);
+ if (matches(name, "src_ip") == 0)
+ print_string(PRINT_ANY, "src_ip", "\tsrc_ip %s",
+ rt_addr_n2a_rta(family, attr));
+ else if (matches(name, "dst_ip") == 0)
+ print_string(PRINT_ANY, "dst_ip", "\tdst_ip %s",
+ rt_addr_n2a_rta(family, attr));
}
static void tunnel_key_print_key_id(FILE *f, const char *name,
@@ -229,7 +235,8 @@ static void tunnel_key_print_key_id(FILE *f, const char *name,
{
if (!attr)
return;
- fprintf(f, "\n\t%s %d", name, rta_getattr_be32(attr));
+ print_string(PRINT_FP, NULL, "%s", _SL_);
+ print_uint(PRINT_ANY, "key_id", "\tkey_id %u", rta_getattr_be32(attr));
}
static void tunnel_key_print_dst_port(FILE *f, char *name,
@@ -237,7 +244,9 @@ static void tunnel_key_print_dst_port(FILE *f, char *name,
{
if (!attr)
return;
- fprintf(f, "\n\t%s %d", name, rta_getattr_be16(attr));
+ print_string(PRINT_FP, NULL, "%s", _SL_);
+ print_uint(PRINT_ANY, "dst_port", "\tdst_port %u",
+ rta_getattr_be16(attr));
}
static void tunnel_key_print_flag(FILE *f, const char *name_on,
@@ -246,7 +255,9 @@ static void tunnel_key_print_flag(FILE *f, const char *name_on,
{
if (!attr)
return;
- fprintf(f, "\n\t%s", rta_getattr_u8(attr) ? name_on : name_off);
+ print_string(PRINT_FP, NULL, "%s", _SL_);
+ print_string(PRINT_ANY, "flag", "\t%s",
+ rta_getattr_u8(attr) ? name_on : name_off);
}
static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg)
@@ -260,19 +271,20 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg)
parse_rtattr_nested(tb, TCA_TUNNEL_KEY_MAX, arg);
if (!tb[TCA_TUNNEL_KEY_PARMS]) {
- fprintf(f, "[NULL tunnel_key parameters]");
+ print_string(PRINT_FP, NULL, "%s",
+ "[NULL tunnel_key parameters]");
return -1;
}
parm = RTA_DATA(tb[TCA_TUNNEL_KEY_PARMS]);
- fprintf(f, "tunnel_key");
+ print_string(PRINT_ANY, "kind", "%s ", "tunnel_key");
switch (parm->t_action) {
case TCA_TUNNEL_KEY_ACT_RELEASE:
- fprintf(f, " unset");
+ print_string(PRINT_ANY, "mode", " %s", "unset");
break;
case TCA_TUNNEL_KEY_ACT_SET:
- fprintf(f, " set");
+ print_string(PRINT_ANY, "mode", " %s", "set");
tunnel_key_print_ip_addr(f, "src_ip",
tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC]);
tunnel_key_print_ip_addr(f, "dst_ip",
@@ -291,8 +303,10 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg)
}
print_action_control(f, " ", parm->action, "");
- fprintf(f, "\n\tindex %d ref %d bind %d", parm->index, parm->refcnt,
- parm->bindcnt);
+ print_string(PRINT_FP, NULL, "%s", _SL_);
+ print_uint(PRINT_ANY, "index", "\t index %u", parm->index);
+ print_int(PRINT_ANY, "ref", " ref %d", parm->refcnt);
+ print_int(PRINT_ANY, "bind", " bind %d", parm->bindcnt);
if (show_stats) {
if (tb[TCA_TUNNEL_KEY_TM]) {
@@ -302,7 +316,7 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg)
}
}
- fprintf(f, "\n ");
+ print_string(PRINT_FP, NULL, "%s", _SL_);
return 0;
}
--
2.7.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox