* Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
From: Andrew Lunn @ 2016-09-13 19:07 UTC (permalink / raw)
To: Vivien Didelot
Cc: John Crispin, David S. Miller, Florian Fainelli, netdev,
linux-kernel, qsdk-review
In-Reply-To: <87twdjafmr.fsf@ketchup.mtl.sfl>
> Since the former alternative is prefered, we may want to remove the
> latter soon from DSA. If this phy_port_map is needed for that case, it'd
> be preferable not to add it.
O.K, so maybe we should solve it the device tree way:
&mdio0 {
phy_port1: phy@0 {
reg = <0>;
};
phy_port2: phy@1 {
reg = <1>;
};
phy_port3: phy@2 {
reg = <2>;
};
phy_port4: phy@3 {
reg = <3>;
};
phy_port5: phy@4 {
reg = <4>;
};
switch@0 {
compatible = "qca,qca8337";
#address-cells = <1>;
#size-cells = <0>;
reg = <30>;
ports {
port@11 {
reg = <11>;
label = "cpu";
ethernet = <&gmac1>;
phy-mode = "rgmii";
};
port@1 {
reg = <1>;
label = "lan1";
phy-handle = <&phy_port1>;
};
port@2 {
reg = <2>;
label = "lan2";
phy-handle = <&phy_port2>;
};
port@3 {
reg = <3>;
label = "lan3";
phy-handle = <&phy_port3>;
};
port@4 {
reg = <4>;
label = "lan4";
phy-handle = <&phy_port4>;
};
};
};
};
and remove the phy_read() and phy_write() functions.
Andrew
^ permalink raw reply
* Re: [PATCH net-next] MAINTAINERS: Add an entry for the core network DSA code
From: Florian Fainelli @ 2016-09-13 19:01 UTC (permalink / raw)
To: Andrew Lunn, David Miller; +Cc: Vivien Didelot, netdev
In-Reply-To: <1473793231-26130-1-git-send-email-andrew@lunn.ch>
On 09/13/2016 12:00 PM, Andrew Lunn wrote:
> The core distributed switch architecture code currently does not have
> a MAINTAINERS entry, which results in some contributions not landing
> in the right peoples inbox.
>
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> ---
> MAINTAINERS | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ce80b36aab69..98cf0c4a14cf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8169,6 +8169,13 @@ S: Maintained
> W: https://fedorahosted.org/dropwatch/
> F: net/core/drop_monitor.c
>
> +NETWORKING [DSA]
> +M: Andrew Lunn <andrew@lunn.ch>
> +M: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> +M: Florian Fainelli <f.fainelli@gmail.com>
> +S: Maintained
> +F: net/dsa/
F: include/net/dsa.h
F: drivers/net/dsa/
Other than that, LGTM
--
Florian
^ permalink raw reply
* [PATCH net-next] MAINTAINERS: Add an entry for the core network DSA code
From: Andrew Lunn @ 2016-09-13 19:00 UTC (permalink / raw)
To: David Miller; +Cc: Florian Fainelli, Vivien Didelot, netdev, Andrew Lunn
The core distributed switch architecture code currently does not have
a MAINTAINERS entry, which results in some contributions not landing
in the right peoples inbox.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
MAINTAINERS | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index ce80b36aab69..98cf0c4a14cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8169,6 +8169,13 @@ S: Maintained
W: https://fedorahosted.org/dropwatch/
F: net/core/drop_monitor.c
+NETWORKING [DSA]
+M: Andrew Lunn <andrew@lunn.ch>
+M: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
+M: Florian Fainelli <f.fainelli@gmail.com>
+S: Maintained
+F: net/dsa/
+
NETWORKING [GENERAL]
M: "David S. Miller" <davem@davemloft.net>
L: netdev@vger.kernel.org
--
2.9.3
^ permalink raw reply related
* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: Greg @ 2016-09-13 18:33 UTC (permalink / raw)
To: Cyrill Gorcunov
Cc: netdev, linux-kernel, David Miller, dsa, eric.dumazet, kuznet,
jmorris, yoshfuji, kaber, avagin, stephen
In-Reply-To: <20160913171950.GC32643@uranus>
On Tue, 2016-09-13 at 20:19 +0300, Cyrill Gorcunov wrote:
> In criu we are actively using diag interface to collect sockets
> present in the system when dumping applications. And while for
> unix, tcp, udp[lite], packet, netlink it works as expected,
> the raw sockets do not have. Thus add it.
>
> v2:
> - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
> - implement @destroy for diag requests (by dsa@)
>
> v3:
> - add export of raw_abort for IPv6 (by dsa@)
> - pass net-admin flag into inet_sk_diag_fill due to
> changes in net-next branch (by dsa@)
>
> CC: David S. Miller <davem@davemloft.net>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: David Ahern <dsa@cumulusnetworks.com>
> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
> CC: James Morris <jmorris@namei.org>
> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
> CC: Patrick McHardy <kaber@trash.net>
> CC: Andrey Vagin <avagin@openvz.org>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>
> include/net/raw.h | 6 +
> include/net/rawv6.h | 7 +
> net/ipv4/Kconfig | 8 +
> net/ipv4/Makefile | 1
> net/ipv4/raw.c | 21 ++++
> net/ipv4/raw_diag.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> net/ipv6/raw.c | 7 +
> 7 files changed, 272 insertions(+), 4 deletions(-)
>
> Index: linux-ml.git/include/net/raw.h
> ===================================================================
> --- linux-ml.git.orig/include/net/raw.h
> +++ linux-ml.git/include/net/raw.h
> @@ -23,6 +23,12 @@
>
> extern struct proto raw_prot;
>
> +extern struct raw_hashinfo raw_v4_hashinfo;
> +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
> + unsigned short num, __be32 raddr,
> + __be32 laddr, int dif);
> +
> +int raw_abort(struct sock *sk, int err);
> void raw_icmp_error(struct sk_buff *, int, u32);
> int raw_local_deliver(struct sk_buff *, int);
>
> Index: linux-ml.git/include/net/rawv6.h
> ===================================================================
> --- linux-ml.git.orig/include/net/rawv6.h
> +++ linux-ml.git/include/net/rawv6.h
> @@ -3,6 +3,13 @@
>
> #include <net/protocol.h>
>
> +extern struct raw_hashinfo raw_v6_hashinfo;
> +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
> + unsigned short num, const struct in6_addr *loc_addr,
> + const struct in6_addr *rmt_addr, int dif);
> +
> +int raw_abort(struct sock *sk, int err);
> +
> void raw6_icmp_error(struct sk_buff *, int nexthdr,
> u8 type, u8 code, int inner_offset, __be32);
> bool raw6_local_deliver(struct sk_buff *, int);
> Index: linux-ml.git/net/ipv4/Kconfig
> ===================================================================
> --- linux-ml.git.orig/net/ipv4/Kconfig
> +++ linux-ml.git/net/ipv4/Kconfig
> @@ -430,6 +430,14 @@ config INET_UDP_DIAG
> Support for UDP socket monitoring interface used by the ss tool.
> If unsure, say Y.
>
> +config INET_RAW_DIAG
> + tristate "RAW: socket monitoring interface"
> + depends on INET_DIAG && (IPV6 || IPV6=n)
> + default n
> + ---help---
> + Support for RAW socket monitoring interface used by the ss tool.
> + If unsure, say Y.
> +
> config INET_DIAG_DESTROY
> bool "INET: allow privileged process to administratively close sockets"
> depends on INET_DIAG
> Index: linux-ml.git/net/ipv4/Makefile
> ===================================================================
> --- linux-ml.git.orig/net/ipv4/Makefile
> +++ linux-ml.git/net/ipv4/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER) += netfilter.o n
> obj-$(CONFIG_INET_DIAG) += inet_diag.o
> obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
> obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
> +obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
> obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
> obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
> obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
> Index: linux-ml.git/net/ipv4/raw.c
> ===================================================================
> --- linux-ml.git.orig/net/ipv4/raw.c
> +++ linux-ml.git/net/ipv4/raw.c
> @@ -89,9 +89,10 @@ struct raw_frag_vec {
> int hlen;
> };
>
> -static struct raw_hashinfo raw_v4_hashinfo = {
> +struct raw_hashinfo raw_v4_hashinfo = {
> .lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock),
> };
> +EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
>
> int raw_hash_sk(struct sock *sk)
> {
> @@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk)
> }
> EXPORT_SYMBOL_GPL(raw_unhash_sk);
>
> -static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
> +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
> unsigned short num, __be32 raddr, __be32 laddr, int dif)
> {
> sk_for_each_from(sk) {
> @@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru
> found:
> return sk;
> }
> +EXPORT_SYMBOL_GPL(__raw_v4_lookup);
>
> /*
> * 0 - deliver
> @@ -918,6 +920,20 @@ static int compat_raw_ioctl(struct sock
> }
> #endif
>
> +int raw_abort(struct sock *sk, int err)
> +{
> + lock_sock(sk);
> +
> + sk->sk_err = err;
> + sk->sk_error_report(sk);
> + udp_disconnect(sk, 0);
> +
> + release_sock(sk);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(raw_abort);
> +
> struct proto raw_prot = {
> .name = "RAW",
> .owner = THIS_MODULE,
> @@ -943,6 +959,7 @@ struct proto raw_prot = {
> .compat_getsockopt = compat_raw_getsockopt,
> .compat_ioctl = compat_raw_ioctl,
> #endif
> + .diag_destroy = raw_abort,
> };
>
> #ifdef CONFIG_PROC_FS
> Index: linux-ml.git/net/ipv4/raw_diag.c
> ===================================================================
> --- /dev/null
> +++ linux-ml.git/net/ipv4/raw_diag.c
> @@ -0,0 +1,226 @@
> +#include <linux/module.h>
> +
> +#include <linux/inet_diag.h>
> +#include <linux/sock_diag.h>
> +
> +#include <net/raw.h>
> +#include <net/rawv6.h>
> +
> +#ifdef pr_fmt
> +# undef pr_fmt
> +#endif
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +static struct raw_hashinfo *
> +raw_get_hashinfo(const struct inet_diag_req_v2 *r)
> +{
> + if (r->sdiag_family == AF_INET) {
> + return &raw_v4_hashinfo;
> +#if IS_ENABLED(CONFIG_IPV6)
> + } else if (r->sdiag_family == AF_INET6) {
> + return &raw_v6_hashinfo;
> +#endif
Someday Linux will be a modern OS that just includes IPV6 and forces a
config option to NOT have it.
That'll be great. All the IS_ENABLED_(CONFIG_IPV6) scattered everywhere
is nuts.
</editorial comment>
- Greg
> + } else {
> + pr_warn_once("Unexpected inet family %d\n",
> + r->sdiag_family);
> + WARN_ON_ONCE(1);
> + return ERR_PTR(-EINVAL);
> + }
> +}
> +
> +static struct sock *raw_lookup(struct net *net, struct sock *from,
> + const struct inet_diag_req_v2 *r)
> +{
> + struct sock *sk = NULL;
> +
> + if (r->sdiag_family == AF_INET)
> + sk = __raw_v4_lookup(net, from, r->sdiag_protocol,
> + r->id.idiag_dst[0],
> + r->id.idiag_src[0],
> + r->id.idiag_if);
> +#if IS_ENABLED(CONFIG_IPV6)
> + else
> + sk = __raw_v6_lookup(net, from, r->sdiag_protocol,
> + (const struct in6_addr *)r->id.idiag_src,
> + (const struct in6_addr *)r->id.idiag_dst,
> + r->id.idiag_if);
> +#endif
> + return sk;
> +}
> +
> +static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r)
> +{
> + struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
> + struct sock *sk = NULL, *s;
> + int slot;
> +
> + if (IS_ERR(hashinfo))
> + return ERR_CAST(hashinfo);
> +
> + read_lock(&hashinfo->lock);
> + for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) {
> + sk_for_each(s, &hashinfo->ht[slot]) {
> + sk = raw_lookup(net, s, r);
> + if (sk)
> + break;
> + }
> + }
> + if (sk && !atomic_inc_not_zero(&sk->sk_refcnt))
> + sk = NULL;
> + read_unlock(&hashinfo->lock);
> +
> + return sk ? sk : ERR_PTR(-ENOENT);
> +}
> +
> +static int raw_diag_dump_one(struct sk_buff *in_skb,
> + const struct nlmsghdr *nlh,
> + const struct inet_diag_req_v2 *r)
> +{
> + struct net *net = sock_net(in_skb->sk);
> + struct sk_buff *rep;
> + struct sock *sk;
> + int err;
> +
> + sk = raw_sock_get(net, r);
> + if (IS_ERR(sk))
> + return PTR_ERR(sk);
> +
> + rep = nlmsg_new(sizeof(struct inet_diag_msg) +
> + sizeof(struct inet_diag_meminfo) + 64,
> + GFP_KERNEL);
> + if (!rep) {
> + sock_put(sk);
> + return -ENOMEM;
> + }
> +
> + err = inet_sk_diag_fill(sk, NULL, rep, r,
> + sk_user_ns(NETLINK_CB(in_skb).sk),
> + NETLINK_CB(in_skb).portid,
> + nlh->nlmsg_seq, 0, nlh,
> + netlink_net_capable(in_skb, CAP_NET_ADMIN));
> + sock_put(sk);
> +
> + if (err < 0) {
> + kfree_skb(rep);
> + return err;
> + }
> +
> + err = netlink_unicast(net->diag_nlsk, rep,
> + NETLINK_CB(in_skb).portid,
> + MSG_DONTWAIT);
> + if (err > 0)
> + err = 0;
> + return err;
> +}
> +
> +static int sk_diag_dump(struct sock *sk, struct sk_buff *skb,
> + struct netlink_callback *cb,
> + const struct inet_diag_req_v2 *r,
> + struct nlattr *bc, bool net_admin)
> +{
> + if (!inet_diag_bc_sk(bc, sk))
> + return 0;
> +
> + return inet_sk_diag_fill(sk, NULL, skb, r,
> + sk_user_ns(NETLINK_CB(cb->skb).sk),
> + NETLINK_CB(cb->skb).portid,
> + cb->nlh->nlmsg_seq, NLM_F_MULTI,
> + cb->nlh, net_admin);
> +}
> +
> +static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
> + const struct inet_diag_req_v2 *r, struct nlattr *bc)
> +{
> + bool net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN);
> + struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
> + struct net *net = sock_net(skb->sk);
> + int num, s_num, slot, s_slot;
> + struct sock *sk = NULL;
> +
> + if (IS_ERR(hashinfo))
> + return;
> +
> + s_slot = cb->args[0];
> + num = s_num = cb->args[1];
> +
> + read_lock(&hashinfo->lock);
> + for (slot = s_slot; slot < RAW_HTABLE_SIZE; s_num = 0, slot++) {
> + num = 0;
> +
> + sk_for_each(sk, &hashinfo->ht[slot]) {
> + struct inet_sock *inet = inet_sk(sk);
> +
> + if (!net_eq(sock_net(sk), net))
> + continue;
> + if (num < s_num)
> + goto next;
> + if (sk->sk_family != r->sdiag_family)
> + goto next;
> + if (r->id.idiag_sport != inet->inet_sport &&
> + r->id.idiag_sport)
> + goto next;
> + if (r->id.idiag_dport != inet->inet_dport &&
> + r->id.idiag_dport)
> + goto next;
> + if (sk_diag_dump(sk, skb, cb, r, bc, net_admin) < 0)
> + goto out_unlock;
> +next:
> + num++;
> + }
> + }
> +
> +out_unlock:
> + read_unlock(&hashinfo->lock);
> +
> + cb->args[0] = slot;
> + cb->args[1] = num;
> +}
> +
> +static void raw_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
> + void *info)
> +{
> + r->idiag_rqueue = sk_rmem_alloc_get(sk);
> + r->idiag_wqueue = sk_wmem_alloc_get(sk);
> +}
> +
> +#ifdef CONFIG_INET_DIAG_DESTROY
> +static int raw_diag_destroy(struct sk_buff *in_skb,
> + const struct inet_diag_req_v2 *r)
> +{
> + struct net *net = sock_net(in_skb->sk);
> + struct sock *sk;
> +
> + sk = raw_sock_get(net, r);
> + if (IS_ERR(sk))
> + return PTR_ERR(sk);
> + return sock_diag_destroy(sk, ECONNABORTED);
> +}
> +#endif
> +
> +static const struct inet_diag_handler raw_diag_handler = {
> + .dump = raw_diag_dump,
> + .dump_one = raw_diag_dump_one,
> + .idiag_get_info = raw_diag_get_info,
> + .idiag_type = IPPROTO_RAW,
> + .idiag_info_size = 0,
> +#ifdef CONFIG_INET_DIAG_DESTROY
> + .destroy = raw_diag_destroy,
> +#endif
> +};
> +
> +static int __init raw_diag_init(void)
> +{
> + return inet_diag_register(&raw_diag_handler);
> +}
> +
> +static void __exit raw_diag_exit(void)
> +{
> + inet_diag_unregister(&raw_diag_handler);
> +}
> +
> +module_init(raw_diag_init);
> +module_exit(raw_diag_exit);
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 2-255 /* AF_INET - IPPROTO_RAW */);
> +MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 10-255 /* AF_INET6 - IPPROTO_RAW */);
> Index: linux-ml.git/net/ipv6/raw.c
> ===================================================================
> --- linux-ml.git.orig/net/ipv6/raw.c
> +++ linux-ml.git/net/ipv6/raw.c
> @@ -65,11 +65,12 @@
>
> #define ICMPV6_HDRLEN 4 /* ICMPv6 header, RFC 4443 Section 2.1 */
>
> -static struct raw_hashinfo raw_v6_hashinfo = {
> +struct raw_hashinfo raw_v6_hashinfo = {
> .lock = __RW_LOCK_UNLOCKED(raw_v6_hashinfo.lock),
> };
> +EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
>
> -static struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
> +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
> unsigned short num, const struct in6_addr *loc_addr,
> const struct in6_addr *rmt_addr, int dif)
> {
> @@ -102,6 +103,7 @@ static struct sock *__raw_v6_lookup(stru
> found:
> return sk;
> }
> +EXPORT_SYMBOL_GPL(__raw_v6_lookup);
>
> /*
> * 0 - deliver
> @@ -1252,6 +1254,7 @@ struct proto rawv6_prot = {
> .compat_getsockopt = compat_rawv6_getsockopt,
> .compat_ioctl = compat_rawv6_ioctl,
> #endif
> + .diag_destroy = raw_abort,
> };
>
> #ifdef CONFIG_PROC_FS
^ permalink raw reply
* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov @ 2016-09-13 18:30 UTC (permalink / raw)
To: Rustad, Mark D
Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
Linux Kernel Network Developers, intel-wired-lan,
Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu
In-Reply-To: <36ECC79A-0810-4D06-A72D-702AD9901E83@intel.com>
On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> >I've looked through qemu and it appears only emulate e1k and tg3.
> >The latter is still used in the field, so the risk of touching
> >it is higher.
>
> I have no idea what makes you think that e1k is *not* "used in the field".
> I grant you it probably is used more virtualized than not these days, but it
> certainly exists and is used. You can still buy them new at Newegg for
> goodness sakes!
the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.
^ permalink raw reply
* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Rustad, Mark D @ 2016-09-13 18:28 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
Linux Kernel Network Developers, intel-wired-lan,
Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu
In-Reply-To: <20160913175926.GA38737@ast-mbp.thefacebook.com>
[-- Attachment #1: Type: text/plain, Size: 613 bytes --]
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> I've looked through qemu and it appears only emulate e1k and tg3.
> The latter is still used in the field, so the risk of touching
> it is higher.
I have no idea what makes you think that e1k is *not* "used in the field".
I grant you it probably is used more virtualized than not these days, but
it certainly exists and is used. You can still buy them new at Newegg for
goodness sakes!
Maybe I'll go home and plug in my old e100 into my machine that still has a
PCI slot, just for old times sake. Oh darn, I have a SCSI card in that
slot...
^ permalink raw reply
* Re: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO
From: Florian Fainelli @ 2016-09-13 18:26 UTC (permalink / raw)
To: Nelson Chang, john, davem; +Cc: nbd, netdev, linux-mediatek, nelsonch.tw
In-Reply-To: <1473774866-3156-3-git-send-email-nelson.chang@mediatek.com>
On 09/13/2016 06:54 AM, Nelson Chang wrote:
> The codes add ethtool functions to set RX flows for HW LRO. Because the
> HW LRO hardware can only recognize the destination IP of TCP/IP RX flows,
> the ethtool command to add HW LRO flow is as below:
> ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]
>
> Otherwise, cause the hardware can set total four destination IPs, each
> GMAC (GMAC1/GMAC2) can set two IPs separately at most.
>
> Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
> ---
> +
> +static int mtk_set_features(struct net_device *dev, netdev_features_t features)
> +{
> + int err = 0;
> +
> + if (!((dev->features ^ features) & NETIF_F_LRO))
> + return 0;
> +
> + if (!(features & NETIF_F_LRO))
> + mtk_hwlro_netdev_disable(dev);
you may want to implement a fix_features ndo operations which makes sure
that NETIF_F_LRO is turned on in case a RX flow is programmed,
otherwise, it may be confusing to the user that a flow was programmed,
but no offload is happening.
--
Florian
^ permalink raw reply
* Re: [PATCH net-next 3/3] net: ethernet: mediatek: add dts configuration to enable HW LRO
From: Florian Fainelli @ 2016-09-13 18:24 UTC (permalink / raw)
To: Nelson Chang, john, davem; +Cc: nbd, netdev, linux-mediatek, nelsonch.tw
In-Reply-To: <1473774866-3156-4-git-send-email-nelson.chang@mediatek.com>
On 09/13/2016 06:54 AM, Nelson Chang wrote:
> Add the configuration of HW LRO in the binding document.
>
> Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
> ---
> Documentation/devicetree/bindings/net/mediatek-net.txt | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt
> index 32eaaca..f43c0d1 100644
> --- a/Documentation/devicetree/bindings/net/mediatek-net.txt
> +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
> @@ -20,6 +20,7 @@ Required properties:
> - mediatek,ethsys: phandle to the syscon node that handles the port setup
> - mediatek,pctl: phandle to the syscon node that handles the ports slew rate
> and driver current
> +- mediatek,hwlro: set to enable HW LRO functions of PDMA rx rings
That sounds like implementing a enable/disable policy in the Device Tree
as opposed to providing an indication as to whether the HW supports LRO
or not. If all versions of the hardware support LRO, then you would
rather let the users change NETIF_F_LRO using ethtool features instead
of having this be defined in the Device Tree.
If, on the other hand, not all version of the HW support LRO, then you
would just want to rephrase the property description to say this
describes a capability.
--
Florian
^ permalink raw reply
* Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev
From: Jiri Slaby @ 2016-09-13 18:14 UTC (permalink / raw)
To: Tejun Heo, Dmitry Vyukov
Cc: Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller,
linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany,
Alexander Potapenko, Eric Dumazet, Takashi Iwai
In-Reply-To: <20160913153520.GC21123-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
On 09/13/2016, 05:35 PM, Tejun Heo wrote:
> Hello,
>
> On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote:
>> Hit the WARNING with the patch. It showed "Showing busy workqueues and
>> worker pools:" after the WARNING, but then no queue info. Was it
>> already destroyed and removed from the list?...
>
> Hmm... It either means that the work item which was in flight when
> WARN_ON() ran finished by the time the debug printout got to it or
> that it's something unrelated to busy work items.
>
>> [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042
>> destroy_workqueue+0x17b/0x630
>
> I don't seem to have the same source code that you have. Which exact
> WARN_ON() is this?
I assume Dmitry sees the same what I am still seeing, so I reported this
some time ago:
https://lkml.org/lkml/2016/3/21/492
This warning is trigerred there and still occurs with "HEAD":
(pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
and the state dump is in the log empty too:
destroy_workqueue: name='hci0' pwq=ffff88006b5c8f00
wq->dfl_pwq=ffff88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
pwq 13:
cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
in-flight: 2669:wq_barrier_func
thanks,
--
js
suse labs
^ permalink raw reply
* Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
From: Florian Fainelli @ 2016-09-13 18:11 UTC (permalink / raw)
To: John Crispin, Andrew Lunn
Cc: David S. Miller, netdev, linux-kernel, qsdk-review
In-Reply-To: <44865cc8-19bf-54dc-ad05-82bdaa648eeb@phrozen.org>
On 09/13/2016 11:07 AM, John Crispin wrote:
>
>
> On 13/09/2016 19:09, Florian Fainelli wrote:
>> On 09/13/2016 08:59 AM, Andrew Lunn wrote:
>>>> Hi Andrew,
>>>>
>>>> this function does indeed duplicate the functionality of
>>>> phy_ethtool_get_eee() with the small difference, that e->eee_active is
>>>> also set which phy_ethtool_get_eee() does not set.
>>>>
>>>> dsa_slave_get_eee() will call phy_ethtool_get_eee() right after the
>>>> get_eee() op has been called. would it be ok to move the code setting
>>>> eee_active to phy_ethtool_get_eee().
>>
>> Humm, AFAIR, the reason why eee_active is set outside of
>> phy_ethtool_set_eee() is because this is a MAC + PHY thing, both need to
>> agree and support that, and so while the PHY may be configured to have
>> EEE advertised and enabled, you also need to take care of the MAC
>> portion and enable EEE in there as well. Is not there such a thing for
>> the qca8k switch where the PHY needs to be configured through the
>> standard phylib calls, but the switch's transmitter/receiver also needs
>> to have EEE enabled?
>>
>
> Hi Florian,
>
> the switch needs to enable the eee on a per mac absis, but there is no
> way to tell if the autonegotiate worked and eee is enabled without
> reading the phys registers.
OK, that does not sound atypical here, most drivers I see do have a way
to tell if EEE is active by reading e.g: the LPI indication register, or
something that is able to reflect the negotiated result.
>
> setting the eee_active inside phy_ethtool_get_eee() would break those
> dsa drivers that have a register telling if AN worked. if it is ok i
> will just call phy_ethtool_get_eee() inside get_eee().
Ok, let's see the code and then we can discuss from there, not very
clear on the proposed change here.
--
Florian
^ permalink raw reply
* Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
From: John Crispin @ 2016-09-13 18:07 UTC (permalink / raw)
To: Florian Fainelli, Andrew Lunn
Cc: David S. Miller, netdev, linux-kernel, qsdk-review
In-Reply-To: <371ed906-5ce8-0553-1400-11f0abf4b489@gmail.com>
On 13/09/2016 19:09, Florian Fainelli wrote:
> On 09/13/2016 08:59 AM, Andrew Lunn wrote:
>>> Hi Andrew,
>>>
>>> this function does indeed duplicate the functionality of
>>> phy_ethtool_get_eee() with the small difference, that e->eee_active is
>>> also set which phy_ethtool_get_eee() does not set.
>>>
>>> dsa_slave_get_eee() will call phy_ethtool_get_eee() right after the
>>> get_eee() op has been called. would it be ok to move the code setting
>>> eee_active to phy_ethtool_get_eee().
>
> Humm, AFAIR, the reason why eee_active is set outside of
> phy_ethtool_set_eee() is because this is a MAC + PHY thing, both need to
> agree and support that, and so while the PHY may be configured to have
> EEE advertised and enabled, you also need to take care of the MAC
> portion and enable EEE in there as well. Is not there such a thing for
> the qca8k switch where the PHY needs to be configured through the
> standard phylib calls, but the switch's transmitter/receiver also needs
> to have EEE enabled?
>
Hi Florian,
the switch needs to enable the eee on a per mac absis, but there is no
way to tell if the autonegotiate worked and eee is enabled without
reading the phys registers.
setting the eee_active inside phy_ethtool_get_eee() would break those
dsa drivers that have a register telling if AN worked. if it is ok i
will just call phy_ethtool_get_eee() inside get_eee().
John
^ permalink raw reply
* [PATCHv2 net 6/6] sctp: not return ENOMEM err back in sctp_packet_transmit
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
user in transmit path. Instead, sctp's retransmit would take care of
the chunks that fail to send because of ENOMEM.
This patch is only to do some release job when alloc_skb fails, not to
return ENOMEM back any more.
Besides, it also cleans up sctp_packet_transmit's err path, and fixes
some issues in err path:
- It didn't free the head skb in nomem: path.
- No need to check nskb in no_route: path.
- It should goto err: path if alloc_skb fails for head.
- Not all the NOMEMs should free nskb.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/output.c | 47 ++++++++++++++++++++++-------------------------
1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/net/sctp/output.c b/net/sctp/output.c
index f2597a9..0c605ec 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -442,14 +442,14 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
* time. Application may notice this error.
*/
pr_err_once("Trying to GSO but underlying device doesn't support it.");
- goto nomem;
+ goto err;
}
} else {
pkt_size = packet->size;
}
head = alloc_skb(pkt_size + MAX_HEADER, gfp);
if (!head)
- goto nomem;
+ goto err;
if (gso) {
NAPI_GRO_CB(head)->last = head;
skb_shinfo(head)->gso_type = sk->sk_gso_type;
@@ -470,8 +470,12 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
}
}
dst = dst_clone(tp->dst);
- if (!dst)
- goto no_route;
+ if (!dst) {
+ if (asoc)
+ IP_INC_STATS(sock_net(asoc->base.sk),
+ IPSTATS_MIB_OUTNOROUTES);
+ goto nodst;
+ }
skb_dst_set(head, dst);
/* Build the SCTP header. */
@@ -622,8 +626,10 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
if (!gso)
break;
- if (skb_gro_receive(&head, nskb))
+ if (skb_gro_receive(&head, nskb)) {
+ kfree_skb(nskb);
goto nomem;
+ }
nskb = NULL;
if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
sk->sk_gso_max_segs))
@@ -717,18 +723,13 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
}
head->ignore_df = packet->ipfragok;
tp->af_specific->sctp_xmit(head, tp);
+ goto out;
-out:
- sctp_packet_reset(packet);
- return err;
-no_route:
- kfree_skb(head);
- if (nskb != head)
- kfree_skb(nskb);
-
- if (asoc)
- IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+nomem:
+ if (packet->auth && list_empty(&packet->auth->list))
+ sctp_chunk_free(packet->auth);
+nodst:
/* FIXME: Returning the 'err' will effect all the associations
* associated with a socket, although only one of the paths of the
* association is unreachable.
@@ -737,22 +738,18 @@ no_route:
* required.
*/
/* err = -EHOSTUNREACH; */
-err:
- /* Control chunks are unreliable so just drop them. DATA chunks
- * will get resent or dropped later.
- */
+ kfree_skb(head);
+err:
list_for_each_entry_safe(chunk, tmp, &packet->chunk_list, list) {
list_del_init(&chunk->list);
if (!sctp_chunk_is_data(chunk))
sctp_chunk_free(chunk);
}
- goto out;
-nomem:
- if (packet->auth && list_empty(&packet->auth->list))
- sctp_chunk_free(packet->auth);
- err = -ENOMEM;
- goto err;
+
+out:
+ sctp_packet_reset(packet);
+ return err;
}
/********************************************************************
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 5/6] sctp: make sctp_outq_flush/tail/uncork return void
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
include/net/sctp/structs.h | 4 ++--
net/sctp/outqueue.c | 19 +++++++------------
net/sctp/sm_sideeffect.c | 9 ++++-----
3 files changed, 13 insertions(+), 19 deletions(-)
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index f61fb7c..8693dc4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1077,7 +1077,7 @@ struct sctp_outq {
void sctp_outq_init(struct sctp_association *, struct sctp_outq *);
void sctp_outq_teardown(struct sctp_outq *);
void sctp_outq_free(struct sctp_outq*);
-int sctp_outq_tail(struct sctp_outq *, struct sctp_chunk *chunk, gfp_t);
+void sctp_outq_tail(struct sctp_outq *, struct sctp_chunk *chunk, gfp_t);
int sctp_outq_sack(struct sctp_outq *, struct sctp_chunk *);
int sctp_outq_is_empty(const struct sctp_outq *);
void sctp_outq_restart(struct sctp_outq *);
@@ -1085,7 +1085,7 @@ void sctp_outq_restart(struct sctp_outq *);
void sctp_retransmit(struct sctp_outq *, struct sctp_transport *,
sctp_retransmit_reason_t);
void sctp_retransmit_mark(struct sctp_outq *, struct sctp_transport *, __u8);
-int sctp_outq_uncork(struct sctp_outq *, gfp_t gfp);
+void sctp_outq_uncork(struct sctp_outq *, gfp_t gfp);
void sctp_prsctp_prune(struct sctp_association *asoc,
struct sctp_sndrcvinfo *sinfo, int msg_len);
/* Uncork and flush an outqueue. */
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 052a479..8c3f446 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -68,7 +68,7 @@ static void sctp_mark_missing(struct sctp_outq *q,
static void sctp_generate_fwdtsn(struct sctp_outq *q, __u32 sack_ctsn);
-static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
+static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
/* Add data to the front of the queue. */
static inline void sctp_outq_head_data(struct sctp_outq *q,
@@ -285,10 +285,9 @@ void sctp_outq_free(struct sctp_outq *q)
}
/* Put a new chunk in an sctp_outq. */
-int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
+void sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
{
struct net *net = sock_net(q->asoc->base.sk);
- int error = 0;
pr_debug("%s: outq:%p, chunk:%p[%s]\n", __func__, q, chunk,
chunk && chunk->chunk_hdr ?
@@ -318,9 +317,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
}
if (!q->cork)
- error = sctp_outq_flush(q, 0, gfp);
-
- return error;
+ sctp_outq_flush(q, 0, gfp);
}
/* Insert a chunk into the sorted list based on the TSNs. The retransmit list
@@ -748,12 +745,12 @@ redo:
}
/* Cork the outqueue so queued chunks are really queued. */
-int sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
+void sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
{
if (q->cork)
q->cork = 0;
- return sctp_outq_flush(q, 0, gfp);
+ sctp_outq_flush(q, 0, gfp);
}
@@ -766,7 +763,7 @@ int sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
* locking concerns must be made. Today we use the sock lock to protect
* this function.
*/
-static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
+static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
{
struct sctp_packet *packet;
struct sctp_packet singleton;
@@ -891,7 +888,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
error = sctp_packet_transmit(&singleton, gfp);
if (error < 0) {
asoc->base.sk->sk_err = -error;
- return 0;
+ return;
}
break;
@@ -1175,8 +1172,6 @@ sctp_flush_out:
/* Clear the burst limited state, if any */
sctp_transport_burst_reset(t);
}
-
- return 0;
}
/* Update unack_data based on the incoming SACK chunk */
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index cf6e4f0..c345bf1 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1421,8 +1421,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
local_cork = 1;
}
/* Send a chunk to our peer. */
- error = sctp_outq_tail(&asoc->outqueue, cmd->obj.chunk,
- gfp);
+ sctp_outq_tail(&asoc->outqueue, cmd->obj.chunk, gfp);
break;
case SCTP_CMD_SEND_PKT:
@@ -1676,7 +1675,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
case SCTP_CMD_FORCE_PRIM_RETRAN:
t = asoc->peer.retran_path;
asoc->peer.retran_path = asoc->peer.primary_path;
- error = sctp_outq_uncork(&asoc->outqueue, gfp);
+ sctp_outq_uncork(&asoc->outqueue, gfp);
local_cork = 0;
asoc->peer.retran_path = t;
break;
@@ -1733,9 +1732,9 @@ out:
*/
if (asoc && SCTP_EVENT_T_CHUNK == event_type && chunk) {
if (chunk->end_of_packet || chunk->singleton)
- error = sctp_outq_uncork(&asoc->outqueue, gfp);
+ sctp_outq_uncork(&asoc->outqueue, gfp);
} else if (local_cork)
- error = sctp_outq_uncork(&asoc->outqueue, gfp);
+ sctp_outq_uncork(&asoc->outqueue, gfp);
if (sp->data_ready_signalled)
sp->data_ready_signalled = 0;
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 3/6] sctp: free msg->chunks when sctp_primitive_SEND return err
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.
This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e57 ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7b6 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.
Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
include/net/sctp/structs.h | 1 +
net/sctp/chunk.c | 13 +++++++++++++
net/sctp/outqueue.c | 1 -
net/sctp/socket.c | 8 ++++++--
4 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ce93c4b..f61fb7c 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -537,6 +537,7 @@ struct sctp_datamsg {
struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *,
struct sctp_sndrcvinfo *,
struct iov_iter *);
+void sctp_datamsg_free(struct sctp_datamsg *);
void sctp_datamsg_put(struct sctp_datamsg *);
void sctp_chunk_fail(struct sctp_chunk *, int error);
int sctp_chunk_abandoned(struct sctp_chunk *);
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index a55e547..af9cc80 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -70,6 +70,19 @@ static struct sctp_datamsg *sctp_datamsg_new(gfp_t gfp)
return msg;
}
+void sctp_datamsg_free(struct sctp_datamsg *msg)
+{
+ struct sctp_chunk *chunk;
+
+ /* This doesn't have to be a _safe vairant because
+ * sctp_chunk_free() only drops the refs.
+ */
+ list_for_each_entry(chunk, &msg->chunks, frag_list)
+ sctp_chunk_free(chunk);
+
+ sctp_datamsg_put(msg);
+}
+
/* Final destructruction of datamsg memory. */
static void sctp_datamsg_destroy(struct sctp_datamsg *msg)
{
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index da2418b..6c109b0 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -304,7 +304,6 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
"illegal chunk");
- sctp_chunk_hold(chunk);
sctp_outq_tail_data(q, chunk);
if (chunk->asoc->prsctp_enable &&
SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9fc417a..6cdc61c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1958,6 +1958,8 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
/* Now send the (possibly) fragmented message. */
list_for_each_entry(chunk, &datamsg->chunks, frag_list) {
+ sctp_chunk_hold(chunk);
+
/* Do accounting for the write space. */
sctp_set_owner_w(chunk);
@@ -1970,13 +1972,15 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
* breaks.
*/
err = sctp_primitive_SEND(net, asoc, datamsg);
- sctp_datamsg_put(datamsg);
/* Did the lower layer accept the chunk? */
- if (err)
+ if (err) {
+ sctp_datamsg_free(datamsg);
goto out_free;
+ }
pr_debug("%s: we sent primitively\n", __func__);
+ sctp_datamsg_put(datamsg);
err = msg_len;
if (unlikely(wait_connect)) {
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 4/6] sctp: save transmit error to sk_err in sctp_outq_flush
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
Every time when sctp calls sctp_outq_flush, it sends out the chunks of
control queue, retransmit queue and data queue. Even if some trunks are
failed to transmit, it still has to flush all the transports, as it's
the only chance to clean that transmit_list.
So the latest transmit error here should be returned back. This transmit
error is an internal error of sctp stack.
I checked all the places where it uses the transmit error (the return
value of sctp_outq_flush), most of them are actually just save it to
sk_err.
Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
it's failed to send a REPLY, which is actually incorrect, as we can't
be sure the error that sctp_outq_flush returns is from sending that
REPLY.
So it's meaningless for sctp_outq_flush to return error back.
This patch is to save transmit error to sk_err in sctp_outq_flush, the
new error can update the old value. Eventually, sctp_wait_for_* would
check for it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/output.c | 3 ++-
net/sctp/outqueue.c | 21 ++++++++++++---------
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 31b7bc3..f2597a9 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -180,7 +180,6 @@ sctp_xmit_t sctp_packet_transmit_chunk(struct sctp_packet *packet,
int one_packet, gfp_t gfp)
{
sctp_xmit_t retval;
- int error = 0;
pr_debug("%s: packet:%p size:%Zu chunk:%p size:%d\n", __func__,
packet, packet->size, chunk, chunk->skb ? chunk->skb->len : -1);
@@ -188,6 +187,8 @@ sctp_xmit_t sctp_packet_transmit_chunk(struct sctp_packet *packet,
switch ((retval = (sctp_packet_append_chunk(packet, chunk)))) {
case SCTP_XMIT_PMTU_FULL:
if (!packet->has_cookie_echo) {
+ int error = 0;
+
error = sctp_packet_transmit(packet, gfp);
if (error < 0)
chunk->skb->sk->sk_err = -error;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 6c109b0..052a479 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -533,7 +533,6 @@ void sctp_retransmit(struct sctp_outq *q, struct sctp_transport *transport,
sctp_retransmit_reason_t reason)
{
struct net *net = sock_net(q->asoc->base.sk);
- int error = 0;
switch (reason) {
case SCTP_RTXR_T3_RTX:
@@ -577,10 +576,7 @@ void sctp_retransmit(struct sctp_outq *q, struct sctp_transport *transport,
* will be flushed at the end.
*/
if (reason != SCTP_RTXR_FAST_RTX)
- error = sctp_outq_flush(q, /* rtx_timeout */ 1, GFP_ATOMIC);
-
- if (error)
- q->asoc->base.sk->sk_err = -error;
+ sctp_outq_flush(q, /* rtx_timeout */ 1, GFP_ATOMIC);
}
/*
@@ -893,8 +889,10 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
sctp_packet_config(&singleton, vtag, 0);
sctp_packet_append_chunk(&singleton, chunk);
error = sctp_packet_transmit(&singleton, gfp);
- if (error < 0)
- return error;
+ if (error < 0) {
+ asoc->base.sk->sk_err = -error;
+ return 0;
+ }
break;
case SCTP_CID_ABORT:
@@ -992,6 +990,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
retran:
error = sctp_outq_flush_rtx(q, packet,
rtx_timeout, &start_timer);
+ if (error < 0)
+ asoc->base.sk->sk_err = -error;
if (start_timer) {
sctp_transport_reset_t3_rtx(transport);
@@ -1166,14 +1166,17 @@ sctp_flush_out:
struct sctp_transport,
send_ready);
packet = &t->packet;
- if (!sctp_packet_empty(packet))
+ if (!sctp_packet_empty(packet)) {
error = sctp_packet_transmit(packet, gfp);
+ if (error < 0)
+ asoc->base.sk->sk_err = -error;
+ }
/* Clear the burst limited state, if any */
sctp_transport_burst_reset(t);
}
- return error;
+ return 0;
}
/* Update unack_data based on the incoming SACK chunk */
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
the asoc's state through statetable before calling sctp_outq_tail. So
there's no need to check the asoc's state again in sctp_outq_tail.
Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
interrupted by timer events, the event's processes still need to acquire
lock_sock first. It means no others CMDs can be enqueue into side effect
list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.
This patch is to remove redundant asoc->state check from sctp_outq_tail.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/outqueue.c | 53 ++++++++++++++---------------------------------------
1 file changed, 14 insertions(+), 39 deletions(-)
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 72e54a4..da2418b 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -299,50 +299,25 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
* immediately.
*/
if (sctp_chunk_is_data(chunk)) {
- /* Is it OK to queue data chunks? */
- /* From 9. Termination of Association
- *
- * When either endpoint performs a shutdown, the
- * association on each peer will stop accepting new
- * data from its user and only deliver data in queue
- * at the time of sending or receiving the SHUTDOWN
- * chunk.
- */
- switch (q->asoc->state) {
- case SCTP_STATE_CLOSED:
- case SCTP_STATE_SHUTDOWN_PENDING:
- case SCTP_STATE_SHUTDOWN_SENT:
- case SCTP_STATE_SHUTDOWN_RECEIVED:
- case SCTP_STATE_SHUTDOWN_ACK_SENT:
- /* Cannot send after transport endpoint shutdown */
- error = -ESHUTDOWN;
- break;
-
- default:
- pr_debug("%s: outqueueing: outq:%p, chunk:%p[%s])\n",
- __func__, q, chunk, chunk && chunk->chunk_hdr ?
- sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
- "illegal chunk");
-
- sctp_chunk_hold(chunk);
- sctp_outq_tail_data(q, chunk);
- if (chunk->asoc->prsctp_enable &&
- SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
- chunk->asoc->sent_cnt_removable++;
- if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
- SCTP_INC_STATS(net, SCTP_MIB_OUTUNORDERCHUNKS);
- else
- SCTP_INC_STATS(net, SCTP_MIB_OUTORDERCHUNKS);
- break;
- }
+ pr_debug("%s: outqueueing: outq:%p, chunk:%p[%s])\n",
+ __func__, q, chunk, chunk && chunk->chunk_hdr ?
+ sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
+ "illegal chunk");
+
+ sctp_chunk_hold(chunk);
+ sctp_outq_tail_data(q, chunk);
+ if (chunk->asoc->prsctp_enable &&
+ SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
+ chunk->asoc->sent_cnt_removable++;
+ if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
+ SCTP_INC_STATS(net, SCTP_MIB_OUTUNORDERCHUNKS);
+ else
+ SCTP_INC_STATS(net, SCTP_MIB_OUTORDERCHUNKS);
} else {
list_add_tail(&chunk->list, &q->control_chunk_list);
SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
}
- if (error < 0)
- return error;
-
if (!q->cork)
error = sctp_outq_flush(q, 0, gfp);
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 2/6] sctp: do not return the transmit err back to sctp_sendmsg
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>
Once a chunk is enqueued successfully, sctp queues can take care of it.
Even if it is failed to transmit (like because of nomem), it should be
put into retransmit queue.
If sctp report this error to users, it confuses them, they may resend
that msg, but actually in kernel sctp stack is in charge of retransmit
it already.
Besides, this error probably is not from the failure of transmitting
current msg, but transmitting or retransmitting another msg's chunks,
as sctp_outq_flush just tries to send out all transports' chunks.
This patch is to make sctp_cmd_send_msg return avoid, and not return the
transmit err back to sctp_sendmsg
Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/sm_sideeffect.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 12d4519..cf6e4f0 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1020,19 +1020,13 @@ static void sctp_cmd_t1_timer_update(struct sctp_association *asoc,
* This way the whole message is queued up and bundling if
* encouraged for small fragments.
*/
-static int sctp_cmd_send_msg(struct sctp_association *asoc,
- struct sctp_datamsg *msg, gfp_t gfp)
+static void sctp_cmd_send_msg(struct sctp_association *asoc,
+ struct sctp_datamsg *msg, gfp_t gfp)
{
struct sctp_chunk *chunk;
- int error = 0;
-
- list_for_each_entry(chunk, &msg->chunks, frag_list) {
- error = sctp_outq_tail(&asoc->outqueue, chunk, gfp);
- if (error)
- break;
- }
- return error;
+ list_for_each_entry(chunk, &msg->chunks, frag_list)
+ sctp_outq_tail(&asoc->outqueue, chunk, gfp);
}
@@ -1709,7 +1703,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
sctp_outq_cork(&asoc->outqueue);
local_cork = 1;
}
- error = sctp_cmd_send_msg(asoc, cmd->obj.msg, gfp);
+ sctp_cmd_send_msg(asoc, cmd->obj.msg, gfp);
break;
case SCTP_CMD_SEND_NEXT_ASCONF:
sctp_cmd_send_asconf(asoc);
--
2.1.0
^ permalink raw reply related
* [PATCHv2 net 0/6] sctp: fix the transmit err process
From: Xin Long @ 2016-09-13 18:04 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Vlad Yasevich, daniel
This patchset is to improve the transmit err process and also fix some
issues.
After this patchset, once the chunks are enqueued successfully, even
if the chunks fail to send out, no matter because of nodst or nomem,
no err retruns back to users any more. Instead, they are taken care
of by retransmit.
v1->v2:
- add more details to the changelog in patch 1/6
- add Fixes: tag in patch 2/6, 3/6
- also revert 69b5777f2e57 in patch 3/6
Xin Long (6):
sctp: remove the unnecessary state check in sctp_outq_tail
sctp: do not return the transmit err back to sctp_sendmsg
sctp: free msg->chunks when sctp_primitive_SEND return err
sctp: save transmit error to sk_err in sctp_outq_flush
sctp: make sctp_outq_flush/tail/uncork return void
sctp: not return ENOMEM err back in sctp_packet_transmit
include/net/sctp/structs.h | 5 +--
net/sctp/chunk.c | 13 +++++++
net/sctp/output.c | 50 +++++++++++++-------------
net/sctp/outqueue.c | 88 ++++++++++++++++------------------------------
net/sctp/sm_sideeffect.c | 25 +++++--------
net/sctp/socket.c | 8 +++--
6 files changed, 85 insertions(+), 104 deletions(-)
--
2.1.0
^ permalink raw reply
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov @ 2016-09-13 17:59 UTC (permalink / raw)
To: Eric Dumazet
Cc: Tom Herbert, John Fastabend, Brenden Blanco, Jeff Kirsher,
Jesper Dangaard Brouer, David S. Miller, Cong Wang,
intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <1473788252.18970.177.camel@edumazet-glaptop3.roam.corp.google.com>
On Tue, Sep 13, 2016 at 10:37:32AM -0700, Eric Dumazet wrote:
> On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
>
> > I'm afraid the point 'only for debugging' still didn't make it across.
> > xdp+e1k is for development (and debugging) of xdp-type of bpf
> > programs and _not_ for debugging of xdp itself, kernel or anything else.
> > The e1k provided interfaces and behavior needs to match exactly
> > what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> > Doing special hacks are not acceptable. Therefore your
> > 'proposed fix' misses the mark, since:
> > 1. ignoring bql/qdisc is not a bug, but the requirement
> > 2. such 'fix' goes against the goal above since behaviors will be
> > different and xdp developer won't be able to build something like
> > xdp loadbalancer in the kvm.
> >
>
>
> Is e1k the only way a VM can receive and send packets ?
>
> Instead of adding more cruft to a legacy driver, risking breaking real
> old machines,
agree that it is the concern.
> I am sure we can find modern alternative.
I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.
The other alternative is virtio, but it doesn't have dma and/or pages,
so it looks to me even messier hack.
The last alternative considered was to invent xdp-only fake 'hw' nic,
but it's too much work to get it into qemu then ask the world
to upgrade qemu.
At that point I ran out of ideas and settled on hacking e1k :(
Not proud of this hack at all.
^ permalink raw reply
* Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail
From: Xin Long @ 2016-09-13 17:58 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Neil Horman, network dev, linux-sctp, davem, Vlad Yasevich,
daniel
In-Reply-To: <20160912194838.GE17689@localhost.localdomain>
>
> I also don't see an issue with this patch, btw.
>
> Xin, you may want to add more/such details to the changelog, specially
> about the timer versus primitive handling.
>
OK, I will post v2 of this patchset.
^ permalink raw reply
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Tom Herbert @ 2016-09-13 17:55 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
Jesper Dangaard Brouer, David S. Miller, Cong Wang,
intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <20160913171340.GA37341@ast-mbp.thefacebook.com>
On Tue, Sep 13, 2016 at 10:13 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >> >
>> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> >> The simplest program that drops all the packets will make the box unpingable.
>> >> >
>> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> >> > scooter on 101 highway ;)
>> >> >
>> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> >> > absolutely no documentation on it, warning users about possible
>> >> > limitations/outcomes.
>> >> >
>> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >> >
>> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> >> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >> >
>> >> > Do we have one test to validate that a XDP_TX implementation is actually
>> >> > correct ?
>> >> >
>> >> Obviously not for e1000 :-(. We really need some real test and
>> >> performance results and analysis on the interaction between the stack
>> >> data path and XDP data path.
>> >
>> > no. we don't need it for e1k and we cannot really do it.
>> > <broken record mode on> this patch is for debugging of xdp programs only.
>> >
>> You can say this "only for a debugging" a thousand times and that
>> still won't justify putting bad code into the kernel. Material issues
>> have been raised with these patches, I have proposed a fix for one
>> core issue, and we have requested a lot more testing. So, please, if
>> you really want to move these patches forward start addressing the
>> concerns being raised by reviewers.
>
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
You don't seem to understand the problem. In the shared queue scenario
if one party (the stack) implements qdiscs, BQL, and such and the
other (XDP) just throws packets onto the queue then these are
incompatible behaviors and something will break. I suppose it's
possible that some how this does not affect the stack path, but
remains to be proven. In any case the patches under review look very
much like they break things; either a fix is needed or tests run to
show it's not a problem. Until this is resolved I am going to nack the
patch.
Tom
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
> If you have other concerns please raise them or if you have
> suggestions on how to develop xdp programs without this e1k patch
> I would love hear them.
> Alexander's review comments are discussed in separate thread.
>
^ permalink raw reply
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Eric Dumazet @ 2016-09-13 17:37 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Tom Herbert, John Fastabend, Brenden Blanco, Jeff Kirsher,
Jesper Dangaard Brouer, David S. Miller, Cong Wang,
intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <20160913171340.GA37341@ast-mbp.thefacebook.com>
On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
Is e1k the only way a VM can receive and send packets ?
Instead of adding more cruft to a legacy driver, risking breaking real
old machines, I am sure we can find modern alternative.
^ permalink raw reply
* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
From: Pablo Neira Ayuso @ 2016-09-13 17:24 UTC (permalink / raw)
To: Daniel Mack
Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <da300784-284c-0d1f-a82e-aa0a0f8ae116-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> Hi,
>
> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> >> This is v5 of the patch set to allow eBPF programs for network
> >> filtering and accounting to be attached to cgroups, so that they apply
> >> to all sockets of all tasks placed in that cgroup. The logic also
> >> allows to be extendeded for other cgroup based eBPF logic.
> >
> > 1) This infrastructure can only be useful to systemd, or any similar
> > orchestration daemon. Look, you can only apply filtering policies
> > to processes that are launched by systemd, so this only works
> > for server processes.
>
> Sorry, but both statements aren't true. The eBPF policies apply to every
> process that is placed in a cgroup, and my example program in 6/6 shows
> how that can be done from the command line.
Then you have to explain me how can anyone else than systemd use this
infrastructure?
> Also, systemd is able to control userspace processes just fine, and
> it not limited to 'server processes'.
My main point is that those processes *need* to be launched by the
orchestrator, which is was refering as 'server processes'.
> > For client processes this infrastructure is
> > *racy*, you have to add new processes in runtime to the cgroup,
> > thus there will be time some little time where no filtering policy
> > will be applied. For quality of service, this may be an acceptable
> > race, but this is aiming to deploy a filtering policy.
>
> That's a limitation that applies to many more control mechanisms in the
> kernel, and it's something that can easily be solved with fork+exec.
As long as you have control to launch the processes yes, but this
will not work in other scenarios. Just like cgroup net_cls and friends
are broken for filtering for things that you have no control to
fork+exec.
To use this infrastructure from a non-launcher process, you'll have to
rely on the proc connection to subscribe to new process events, then
echo that pid to the cgroup, and that interface is asynchronous so
*adding new processes to the cgroup is subject to races*.
> > 2) This aproach looks uninfrastructured to me. This provides a hook
> > to push a bpf blob at a place in the stack that deploys a filtering
> > policy that is not visible to others.
>
> That's just as transparent as SO_ATTACH_FILTER. What kind of
> introspection mechanism do you have in mind?
SO_ATTACH_FILTER is called from the process itself, so this is a local
filtering policy that you apply to your own process.
In this case, this filtering policy is *global*, other processes with
similar capabilities can get just a bpf blob at best...
[...]
> >> After chatting with Daniel Borkmann and Alexei off-list, we concluded
> >> that __dev_queue_xmit() is the place where the egress hooks should live
> >> when eBPF programs need access to the L2 bits of the skb.
> >
> > 3) This egress hook is coming very late, the only reason I find to
> > place it at __dev_queue_xmit() is that bpf naturally works with
> > layer 2 information in place. But this new hook is placed in
> > _everyone's output ath_ that only works for the very specific
> > usecase I exposed above.
>
> It's about filtering outgoing network packets of applications, and
> providing them with L2 information for filtering purposes. I don't think
> that's a very specific use-case.
>
> When the feature is not used at all, the added costs on the output path
> are close to zero, due to the use of static branches.
*You're proposing a socket filtering facility that hooks layer 2
output path*!
[...]
> > I have nothing against systemd or the needs for more
> > programmability/flexibility in the stack, but I think this needs to
> > fulfill some requirements to fit into the infrastructure that we have
> > in the right way.
>
> Well, as I explained already, this patch set results from endless
> discussions that went nowhere, about how such a thing can be achieved
> with netfilter.
That is only a rough ~30 lines kernel patchset to support this in
netfilter and only one extra input hook, with potential access to
conntrack and better integration with other existing subsystems.
^ permalink raw reply
* [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: Cyrill Gorcunov @ 2016-09-13 17:19 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: David Miller, dsa, eric.dumazet, kuznet, jmorris, yoshfuji, kaber,
avagin, stephen
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.
v2:
- add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
- implement @destroy for diag requests (by dsa@)
v3:
- add export of raw_abort for IPv6 (by dsa@)
- pass net-admin flag into inet_sk_diag_fill due to
changes in net-next branch (by dsa@)
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
include/net/raw.h | 6 +
include/net/rawv6.h | 7 +
net/ipv4/Kconfig | 8 +
net/ipv4/Makefile | 1
net/ipv4/raw.c | 21 ++++
net/ipv4/raw_diag.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++
net/ipv6/raw.c | 7 +
7 files changed, 272 insertions(+), 4 deletions(-)
Index: linux-ml.git/include/net/raw.h
===================================================================
--- linux-ml.git.orig/include/net/raw.h
+++ linux-ml.git/include/net/raw.h
@@ -23,6 +23,12 @@
extern struct proto raw_prot;
+extern struct raw_hashinfo raw_v4_hashinfo;
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+ unsigned short num, __be32 raddr,
+ __be32 laddr, int dif);
+
+int raw_abort(struct sock *sk, int err);
void raw_icmp_error(struct sk_buff *, int, u32);
int raw_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/include/net/rawv6.h
===================================================================
--- linux-ml.git.orig/include/net/rawv6.h
+++ linux-ml.git/include/net/rawv6.h
@@ -3,6 +3,13 @@
#include <net/protocol.h>
+extern struct raw_hashinfo raw_v6_hashinfo;
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+ unsigned short num, const struct in6_addr *loc_addr,
+ const struct in6_addr *rmt_addr, int dif);
+
+int raw_abort(struct sock *sk, int err);
+
void raw6_icmp_error(struct sk_buff *, int nexthdr,
u8 type, u8 code, int inner_offset, __be32);
bool raw6_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/net/ipv4/Kconfig
===================================================================
--- linux-ml.git.orig/net/ipv4/Kconfig
+++ linux-ml.git/net/ipv4/Kconfig
@@ -430,6 +430,14 @@ config INET_UDP_DIAG
Support for UDP socket monitoring interface used by the ss tool.
If unsure, say Y.
+config INET_RAW_DIAG
+ tristate "RAW: socket monitoring interface"
+ depends on INET_DIAG && (IPV6 || IPV6=n)
+ default n
+ ---help---
+ Support for RAW socket monitoring interface used by the ss tool.
+ If unsure, say Y.
+
config INET_DIAG_DESTROY
bool "INET: allow privileged process to administratively close sockets"
depends on INET_DIAG
Index: linux-ml.git/net/ipv4/Makefile
===================================================================
--- linux-ml.git.orig/net/ipv4/Makefile
+++ linux-ml.git/net/ipv4/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER) += netfilter.o n
obj-$(CONFIG_INET_DIAG) += inet_diag.o
obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
+obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
Index: linux-ml.git/net/ipv4/raw.c
===================================================================
--- linux-ml.git.orig/net/ipv4/raw.c
+++ linux-ml.git/net/ipv4/raw.c
@@ -89,9 +89,10 @@ struct raw_frag_vec {
int hlen;
};
-static struct raw_hashinfo raw_v4_hashinfo = {
+struct raw_hashinfo raw_v4_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock),
};
+EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
int raw_hash_sk(struct sock *sk)
{
@@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk)
}
EXPORT_SYMBOL_GPL(raw_unhash_sk);
-static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
unsigned short num, __be32 raddr, __be32 laddr, int dif)
{
sk_for_each_from(sk) {
@@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru
found:
return sk;
}
+EXPORT_SYMBOL_GPL(__raw_v4_lookup);
/*
* 0 - deliver
@@ -918,6 +920,20 @@ static int compat_raw_ioctl(struct sock
}
#endif
+int raw_abort(struct sock *sk, int err)
+{
+ lock_sock(sk);
+
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ udp_disconnect(sk, 0);
+
+ release_sock(sk);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(raw_abort);
+
struct proto raw_prot = {
.name = "RAW",
.owner = THIS_MODULE,
@@ -943,6 +959,7 @@ struct proto raw_prot = {
.compat_getsockopt = compat_raw_getsockopt,
.compat_ioctl = compat_raw_ioctl,
#endif
+ .diag_destroy = raw_abort,
};
#ifdef CONFIG_PROC_FS
Index: linux-ml.git/net/ipv4/raw_diag.c
===================================================================
--- /dev/null
+++ linux-ml.git/net/ipv4/raw_diag.c
@@ -0,0 +1,226 @@
+#include <linux/module.h>
+
+#include <linux/inet_diag.h>
+#include <linux/sock_diag.h>
+
+#include <net/raw.h>
+#include <net/rawv6.h>
+
+#ifdef pr_fmt
+# undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static struct raw_hashinfo *
+raw_get_hashinfo(const struct inet_diag_req_v2 *r)
+{
+ if (r->sdiag_family == AF_INET) {
+ return &raw_v4_hashinfo;
+#if IS_ENABLED(CONFIG_IPV6)
+ } else if (r->sdiag_family == AF_INET6) {
+ return &raw_v6_hashinfo;
+#endif
+ } else {
+ pr_warn_once("Unexpected inet family %d\n",
+ r->sdiag_family);
+ WARN_ON_ONCE(1);
+ return ERR_PTR(-EINVAL);
+ }
+}
+
+static struct sock *raw_lookup(struct net *net, struct sock *from,
+ const struct inet_diag_req_v2 *r)
+{
+ struct sock *sk = NULL;
+
+ if (r->sdiag_family == AF_INET)
+ sk = __raw_v4_lookup(net, from, r->sdiag_protocol,
+ r->id.idiag_dst[0],
+ r->id.idiag_src[0],
+ r->id.idiag_if);
+#if IS_ENABLED(CONFIG_IPV6)
+ else
+ sk = __raw_v6_lookup(net, from, r->sdiag_protocol,
+ (const struct in6_addr *)r->id.idiag_src,
+ (const struct in6_addr *)r->id.idiag_dst,
+ r->id.idiag_if);
+#endif
+ return sk;
+}
+
+static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r)
+{
+ struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
+ struct sock *sk = NULL, *s;
+ int slot;
+
+ if (IS_ERR(hashinfo))
+ return ERR_CAST(hashinfo);
+
+ read_lock(&hashinfo->lock);
+ for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) {
+ sk_for_each(s, &hashinfo->ht[slot]) {
+ sk = raw_lookup(net, s, r);
+ if (sk)
+ break;
+ }
+ }
+ if (sk && !atomic_inc_not_zero(&sk->sk_refcnt))
+ sk = NULL;
+ read_unlock(&hashinfo->lock);
+
+ return sk ? sk : ERR_PTR(-ENOENT);
+}
+
+static int raw_diag_dump_one(struct sk_buff *in_skb,
+ const struct nlmsghdr *nlh,
+ const struct inet_diag_req_v2 *r)
+{
+ struct net *net = sock_net(in_skb->sk);
+ struct sk_buff *rep;
+ struct sock *sk;
+ int err;
+
+ sk = raw_sock_get(net, r);
+ if (IS_ERR(sk))
+ return PTR_ERR(sk);
+
+ rep = nlmsg_new(sizeof(struct inet_diag_msg) +
+ sizeof(struct inet_diag_meminfo) + 64,
+ GFP_KERNEL);
+ if (!rep) {
+ sock_put(sk);
+ return -ENOMEM;
+ }
+
+ err = inet_sk_diag_fill(sk, NULL, rep, r,
+ sk_user_ns(NETLINK_CB(in_skb).sk),
+ NETLINK_CB(in_skb).portid,
+ nlh->nlmsg_seq, 0, nlh,
+ netlink_net_capable(in_skb, CAP_NET_ADMIN));
+ sock_put(sk);
+
+ if (err < 0) {
+ kfree_skb(rep);
+ return err;
+ }
+
+ err = netlink_unicast(net->diag_nlsk, rep,
+ NETLINK_CB(in_skb).portid,
+ MSG_DONTWAIT);
+ if (err > 0)
+ err = 0;
+ return err;
+}
+
+static int sk_diag_dump(struct sock *sk, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ const struct inet_diag_req_v2 *r,
+ struct nlattr *bc, bool net_admin)
+{
+ if (!inet_diag_bc_sk(bc, sk))
+ return 0;
+
+ return inet_sk_diag_fill(sk, NULL, skb, r,
+ sk_user_ns(NETLINK_CB(cb->skb).sk),
+ NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, NLM_F_MULTI,
+ cb->nlh, net_admin);
+}
+
+static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
+ const struct inet_diag_req_v2 *r, struct nlattr *bc)
+{
+ bool net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN);
+ struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
+ struct net *net = sock_net(skb->sk);
+ int num, s_num, slot, s_slot;
+ struct sock *sk = NULL;
+
+ if (IS_ERR(hashinfo))
+ return;
+
+ s_slot = cb->args[0];
+ num = s_num = cb->args[1];
+
+ read_lock(&hashinfo->lock);
+ for (slot = s_slot; slot < RAW_HTABLE_SIZE; s_num = 0, slot++) {
+ num = 0;
+
+ sk_for_each(sk, &hashinfo->ht[slot]) {
+ struct inet_sock *inet = inet_sk(sk);
+
+ if (!net_eq(sock_net(sk), net))
+ continue;
+ if (num < s_num)
+ goto next;
+ if (sk->sk_family != r->sdiag_family)
+ goto next;
+ if (r->id.idiag_sport != inet->inet_sport &&
+ r->id.idiag_sport)
+ goto next;
+ if (r->id.idiag_dport != inet->inet_dport &&
+ r->id.idiag_dport)
+ goto next;
+ if (sk_diag_dump(sk, skb, cb, r, bc, net_admin) < 0)
+ goto out_unlock;
+next:
+ num++;
+ }
+ }
+
+out_unlock:
+ read_unlock(&hashinfo->lock);
+
+ cb->args[0] = slot;
+ cb->args[1] = num;
+}
+
+static void raw_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
+ void *info)
+{
+ r->idiag_rqueue = sk_rmem_alloc_get(sk);
+ r->idiag_wqueue = sk_wmem_alloc_get(sk);
+}
+
+#ifdef CONFIG_INET_DIAG_DESTROY
+static int raw_diag_destroy(struct sk_buff *in_skb,
+ const struct inet_diag_req_v2 *r)
+{
+ struct net *net = sock_net(in_skb->sk);
+ struct sock *sk;
+
+ sk = raw_sock_get(net, r);
+ if (IS_ERR(sk))
+ return PTR_ERR(sk);
+ return sock_diag_destroy(sk, ECONNABORTED);
+}
+#endif
+
+static const struct inet_diag_handler raw_diag_handler = {
+ .dump = raw_diag_dump,
+ .dump_one = raw_diag_dump_one,
+ .idiag_get_info = raw_diag_get_info,
+ .idiag_type = IPPROTO_RAW,
+ .idiag_info_size = 0,
+#ifdef CONFIG_INET_DIAG_DESTROY
+ .destroy = raw_diag_destroy,
+#endif
+};
+
+static int __init raw_diag_init(void)
+{
+ return inet_diag_register(&raw_diag_handler);
+}
+
+static void __exit raw_diag_exit(void)
+{
+ inet_diag_unregister(&raw_diag_handler);
+}
+
+module_init(raw_diag_init);
+module_exit(raw_diag_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 2-255 /* AF_INET - IPPROTO_RAW */);
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 10-255 /* AF_INET6 - IPPROTO_RAW */);
Index: linux-ml.git/net/ipv6/raw.c
===================================================================
--- linux-ml.git.orig/net/ipv6/raw.c
+++ linux-ml.git/net/ipv6/raw.c
@@ -65,11 +65,12 @@
#define ICMPV6_HDRLEN 4 /* ICMPv6 header, RFC 4443 Section 2.1 */
-static struct raw_hashinfo raw_v6_hashinfo = {
+struct raw_hashinfo raw_v6_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(raw_v6_hashinfo.lock),
};
+EXPORT_SYMBOL_GPL(raw_v6_hashinfo);
-static struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
unsigned short num, const struct in6_addr *loc_addr,
const struct in6_addr *rmt_addr, int dif)
{
@@ -102,6 +103,7 @@ static struct sock *__raw_v6_lookup(stru
found:
return sk;
}
+EXPORT_SYMBOL_GPL(__raw_v6_lookup);
/*
* 0 - deliver
@@ -1252,6 +1254,7 @@ struct proto rawv6_prot = {
.compat_getsockopt = compat_rawv6_getsockopt,
.compat_ioctl = compat_rawv6_ioctl,
#endif
+ .diag_destroy = raw_abort,
};
#ifdef CONFIG_PROC_FS
^ permalink raw reply
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov @ 2016-09-13 17:13 UTC (permalink / raw)
To: Tom Herbert
Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
Jesper Dangaard Brouer, David S. Miller, Cong Wang,
intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <CALx6S36u+tQUEVhu0BaLZ6-_0=GSMwxd4m89-H+brVZ-rr-U2g@mail.gmail.com>
On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >> >
> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> >> The simplest program that drops all the packets will make the box unpingable.
> >> >
> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> >> > scooter on 101 highway ;)
> >> >
> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> >> > absolutely no documentation on it, warning users about possible
> >> > limitations/outcomes.
> >> >
> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >> >
> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> >> > but tx completion will call netdev_tx_completed_queue() -> crash
> >> >
> >> > Do we have one test to validate that a XDP_TX implementation is actually
> >> > correct ?
> >> >
> >> Obviously not for e1000 :-(. We really need some real test and
> >> performance results and analysis on the interaction between the stack
> >> data path and XDP data path.
> >
> > no. we don't need it for e1k and we cannot really do it.
> > <broken record mode on> this patch is for debugging of xdp programs only.
> >
> You can say this "only for a debugging" a thousand times and that
> still won't justify putting bad code into the kernel. Material issues
> have been raised with these patches, I have proposed a fix for one
> core issue, and we have requested a lot more testing. So, please, if
> you really want to move these patches forward start addressing the
> concerns being raised by reviewers.
I'm afraid the point 'only for debugging' still didn't make it across.
xdp+e1k is for development (and debugging) of xdp-type of bpf
programs and _not_ for debugging of xdp itself, kernel or anything else.
The e1k provided interfaces and behavior needs to match exactly
what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
Doing special hacks are not acceptable. Therefore your
'proposed fix' misses the mark, since:
1. ignoring bql/qdisc is not a bug, but the requirement
2. such 'fix' goes against the goal above since behaviors will be
different and xdp developer won't be able to build something like
xdp loadbalancer in the kvm.
If you have other concerns please raise them or if you have
suggestions on how to develop xdp programs without this e1k patch
I would love hear them.
Alexander's review comments are discussed in separate thread.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox