* [PATCH 07/14] netfilter: ipset: fix ip_set_list allocation failure
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Andrey Ryabinin <aryabinin@virtuozzo.com>
ip_set_create() and ip_set_net_init() attempt to allocate physically
contiguous memory for ip_set_list. If memory is fragmented, the
allocations could easily fail:
vzctl: page allocation failure: order:7, mode:0xc0d0
Call Trace:
dump_stack+0x19/0x1b
warn_alloc_failed+0x110/0x180
__alloc_pages_nodemask+0x7bf/0xc60
alloc_pages_current+0x98/0x110
kmalloc_order+0x18/0x40
kmalloc_order_trace+0x26/0xa0
__kmalloc+0x279/0x290
ip_set_net_init+0x4b/0x90 [ip_set]
ops_init+0x3b/0xb0
setup_net+0xbb/0x170
copy_net_ns+0xf1/0x1c0
create_new_namespaces+0xf9/0x180
copy_namespaces+0x8e/0xd0
copy_process+0xb61/0x1a00
do_fork+0x91/0x320
Use kvcalloc() to fallback to 0-order allocations if high order
page isn't available.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipset/ip_set_core.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index fa15a831aeee..68db946df151 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -960,7 +960,7 @@ static int ip_set_create(struct net *net, struct sock *ctnl,
/* Wraparound */
goto cleanup;
- list = kcalloc(i, sizeof(struct ip_set *), GFP_KERNEL);
+ list = kvcalloc(i, sizeof(struct ip_set *), GFP_KERNEL);
if (!list)
goto cleanup;
/* nfnl mutex is held, both lists are valid */
@@ -972,7 +972,7 @@ static int ip_set_create(struct net *net, struct sock *ctnl,
/* Use new list */
index = inst->ip_set_max;
inst->ip_set_max = i;
- kfree(tmp);
+ kvfree(tmp);
ret = 0;
} else if (ret) {
goto cleanup;
@@ -2058,7 +2058,7 @@ ip_set_net_init(struct net *net)
if (inst->ip_set_max >= IPSET_INVALID_ID)
inst->ip_set_max = IPSET_INVALID_ID - 1;
- list = kcalloc(inst->ip_set_max, sizeof(struct ip_set *), GFP_KERNEL);
+ list = kvcalloc(inst->ip_set_max, sizeof(struct ip_set *), GFP_KERNEL);
if (!list)
return -ENOMEM;
inst->is_deleted = false;
@@ -2086,7 +2086,7 @@ ip_set_net_exit(struct net *net)
}
}
nfnl_unlock(NFNL_SUBSYS_IPSET);
- kfree(rcu_dereference_protected(inst->ip_set_list, 1));
+ kvfree(rcu_dereference_protected(inst->ip_set_list, 1));
}
static struct pernet_operations ip_set_net_ops = {
--
2.11.0
^ permalink raw reply related
* [PATCH 09/14] netfilter: xt_IDLETIMER: add sysfs filename checking routine
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Taehee Yoo <ap420073@gmail.com>
When IDLETIMER rule is added, sysfs file is created under
/sys/class/xt_idletimer/timers/
But some label name shouldn't be used.
".", "..", "power", "uevent", "subsystem", etc...
So that sysfs filename checking routine is needed.
test commands:
%iptables -I INPUT -j IDLETIMER --timeout 1 --label "power"
splat looks like:
[95765.423132] sysfs: cannot create duplicate filename '/devices/virtual/xt_idletimer/timers/power'
[95765.433418] CPU: 0 PID: 8446 Comm: iptables Not tainted 4.19.0-rc6+ #20
[95765.449755] Call Trace:
[95765.449755] dump_stack+0xc9/0x16b
[95765.449755] ? show_regs_print_info+0x5/0x5
[95765.449755] sysfs_warn_dup+0x74/0x90
[95765.449755] sysfs_add_file_mode_ns+0x352/0x500
[95765.449755] sysfs_create_file_ns+0x179/0x270
[95765.449755] ? sysfs_add_file_mode_ns+0x500/0x500
[95765.449755] ? idletimer_tg_checkentry+0x3e5/0xb1b [xt_IDLETIMER]
[95765.449755] ? rcu_read_lock_sched_held+0x114/0x130
[95765.449755] ? __kmalloc_track_caller+0x211/0x2b0
[95765.449755] ? memcpy+0x34/0x50
[95765.449755] idletimer_tg_checkentry+0x4e2/0xb1b [xt_IDLETIMER]
[ ... ]
Fixes: 0902b469bd25 ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/xt_IDLETIMER.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c
index c6acfc2d9c84..eb4cbd244c3d 100644
--- a/net/netfilter/xt_IDLETIMER.c
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -114,6 +114,22 @@ static void idletimer_tg_expired(struct timer_list *t)
schedule_work(&timer->work);
}
+static int idletimer_check_sysfs_name(const char *name, unsigned int size)
+{
+ int ret;
+
+ ret = xt_check_proc_name(name, size);
+ if (ret < 0)
+ return ret;
+
+ if (!strcmp(name, "power") ||
+ !strcmp(name, "subsystem") ||
+ !strcmp(name, "uevent"))
+ return -EINVAL;
+
+ return 0;
+}
+
static int idletimer_tg_create(struct idletimer_tg_info *info)
{
int ret;
@@ -124,6 +140,10 @@ static int idletimer_tg_create(struct idletimer_tg_info *info)
goto out;
}
+ ret = idletimer_check_sysfs_name(info->label, sizeof(info->label));
+ if (ret < 0)
+ goto out_free_timer;
+
sysfs_attr_init(&info->timer->attr.attr);
info->timer->attr.attr.name = kstrdup(info->label, GFP_KERNEL);
if (!info->timer->attr.attr.name) {
--
2.11.0
^ permalink raw reply related
* [PATCH 08/14] netfilter: ipset: Correct rcu_dereference() call in ip_set_put_comment()
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
The function is called when rcu_read_lock() is held and not
when rcu_read_lock_bh() is held.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
| 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--git a/include/linux/netfilter/ipset/ip_set_comment.h b/include/linux/netfilter/ipset/ip_set_comment.h
index 8e2bab1e8e90..70877f8de7e9 100644
--- a/include/linux/netfilter/ipset/ip_set_comment.h
+++ b/include/linux/netfilter/ipset/ip_set_comment.h
@@ -43,11 +43,11 @@ ip_set_init_comment(struct ip_set *set, struct ip_set_comment *comment,
rcu_assign_pointer(comment->c, c);
}
-/* Used only when dumping a set, protected by rcu_read_lock_bh() */
+/* Used only when dumping a set, protected by rcu_read_lock() */
static inline int
ip_set_put_comment(struct sk_buff *skb, const struct ip_set_comment *comment)
{
- struct ip_set_comment_rcu *c = rcu_dereference_bh(comment->c);
+ struct ip_set_comment_rcu *c = rcu_dereference(comment->c);
if (!c)
return 0;
--
2.11.0
^ permalink raw reply related
* [PATCH 06/14] netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Eric Westbrook <eric@westbrook.io>
Allow /0 as advertised for hash:net,port,net sets.
For "hash:net,port,net", ipset(8) says that "either subnet
is permitted to be a /0 should you wish to match port
between all destinations."
Make that statement true.
Before:
# ipset create cidrzero hash:net,port,net
# ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
ipset v6.34: The value of the CIDR parameter of the IP address is invalid
# ipset create cidrzero6 hash:net,port,net family inet6
# ipset add cidrzero6 ::/0,12345,::/0
ipset v6.34: The value of the CIDR parameter of the IP address is invalid
After:
# ipset create cidrzero hash:net,port,net
# ipset add cidrzero 0.0.0.0/0,12345,0.0.0.0/0
# ipset test cidrzero 192.168.205.129,12345,172.16.205.129
192.168.205.129,tcp:12345,172.16.205.129 is in set cidrzero.
# ipset create cidrzero6 hash:net,port,net family inet6
# ipset add cidrzero6 ::/0,12345,::/0
# ipset test cidrzero6 fe80::1,12345,ff00::1
fe80::1,tcp:12345,ff00::1 is in set cidrzero6.
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=200897
https://github.com/ewestbrook/linux/commit/df7ff6efb0934ab6acc11f003ff1a7580d6c1d9c
Signed-off-by: Eric Westbrook <linux@westbrook.io>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipset/ip_set_hash_netportnet.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index d391485a6acd..613e18e720a4 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -213,13 +213,13 @@ hash_netportnet4_uadt(struct ip_set *set, struct nlattr *tb[],
if (tb[IPSET_ATTR_CIDR]) {
e.cidr[0] = nla_get_u8(tb[IPSET_ATTR_CIDR]);
- if (!e.cidr[0] || e.cidr[0] > HOST_MASK)
+ if (e.cidr[0] > HOST_MASK)
return -IPSET_ERR_INVALID_CIDR;
}
if (tb[IPSET_ATTR_CIDR2]) {
e.cidr[1] = nla_get_u8(tb[IPSET_ATTR_CIDR2]);
- if (!e.cidr[1] || e.cidr[1] > HOST_MASK)
+ if (e.cidr[1] > HOST_MASK)
return -IPSET_ERR_INVALID_CIDR;
}
@@ -493,13 +493,13 @@ hash_netportnet6_uadt(struct ip_set *set, struct nlattr *tb[],
if (tb[IPSET_ATTR_CIDR]) {
e.cidr[0] = nla_get_u8(tb[IPSET_ATTR_CIDR]);
- if (!e.cidr[0] || e.cidr[0] > HOST_MASK)
+ if (e.cidr[0] > HOST_MASK)
return -IPSET_ERR_INVALID_CIDR;
}
if (tb[IPSET_ATTR_CIDR2]) {
e.cidr[1] = nla_get_u8(tb[IPSET_ATTR_CIDR2]);
- if (!e.cidr[1] || e.cidr[1] > HOST_MASK)
+ if (e.cidr[1] > HOST_MASK)
return -IPSET_ERR_INVALID_CIDR;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 10/14] netfilter: ipset: Fix calling ip_set() macro at dumping
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
The ip_set() macro is called when either ip_set_ref_lock held only
or no lock/nfnl mutex is held at dumping. Take this into account
properly. Also, use Pablo's suggestion to use rcu_dereference_raw(),
the ref_netlink protects the set.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipset/ip_set_core.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 68db946df151..1577f2f76060 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -55,11 +55,15 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
MODULE_DESCRIPTION("core IP set support");
MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_IPSET);
-/* When the nfnl mutex is held: */
+/* When the nfnl mutex or ip_set_ref_lock is held: */
#define ip_set_dereference(p) \
- rcu_dereference_protected(p, lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET))
+ rcu_dereference_protected(p, \
+ lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET) || \
+ lockdep_is_held(&ip_set_ref_lock))
#define ip_set(inst, id) \
ip_set_dereference((inst)->ip_set_list)[id]
+#define ip_set_ref_netlink(inst,id) \
+ rcu_dereference_raw((inst)->ip_set_list)[id]
/* The set types are implemented in modules and registered set types
* can be found in ip_set_type_list. Adding/deleting types is
@@ -1251,7 +1255,7 @@ ip_set_dump_done(struct netlink_callback *cb)
struct ip_set_net *inst =
(struct ip_set_net *)cb->args[IPSET_CB_NET];
ip_set_id_t index = (ip_set_id_t)cb->args[IPSET_CB_INDEX];
- struct ip_set *set = ip_set(inst, index);
+ struct ip_set *set = ip_set_ref_netlink(inst, index);
if (set->variant->uref)
set->variant->uref(set, cb, false);
@@ -1440,7 +1444,7 @@ ip_set_dump_start(struct sk_buff *skb, struct netlink_callback *cb)
release_refcount:
/* If there was an error or set is done, release set */
if (ret || !cb->args[IPSET_CB_ARG0]) {
- set = ip_set(inst, index);
+ set = ip_set_ref_netlink(inst, index);
if (set->variant->uref)
set->variant->uref(set, cb, false);
pr_debug("release set %s\n", set->name);
--
2.11.0
^ permalink raw reply related
* [PATCH 12/14] netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
Otherwise, we hit a NULL pointer deference since handlers always assume
default timeout policy is passed.
netlink: 24 bytes leftover after parsing attributes in process `syz-executor2'.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9575 Comm: syz-executor1 Not tainted 4.19.0+ #312
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:icmp_timeout_obj_to_nlattr+0x77/0x170 net/netfilter/nf_conntrack_proto_icmp.c:297
Fixes: c779e849608a ("netfilter: conntrack: remove get_timeout() indirection")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nfnetlink_cttimeout.c | 47 +++++++++++++++++++++++++++++++------
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c
index e7a50af1b3d6..a518eb162344 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -382,7 +382,8 @@ static int cttimeout_default_set(struct net *net, struct sock *ctnl,
static int
cttimeout_default_fill_info(struct net *net, struct sk_buff *skb, u32 portid,
u32 seq, u32 type, int event, u16 l3num,
- const struct nf_conntrack_l4proto *l4proto)
+ const struct nf_conntrack_l4proto *l4proto,
+ const unsigned int *timeouts)
{
struct nlmsghdr *nlh;
struct nfgenmsg *nfmsg;
@@ -408,7 +409,7 @@ cttimeout_default_fill_info(struct net *net, struct sk_buff *skb, u32 portid,
if (!nest_parms)
goto nla_put_failure;
- ret = l4proto->ctnl_timeout.obj_to_nlattr(skb, NULL);
+ ret = l4proto->ctnl_timeout.obj_to_nlattr(skb, timeouts);
if (ret < 0)
goto nla_put_failure;
@@ -430,6 +431,7 @@ static int cttimeout_default_get(struct net *net, struct sock *ctnl,
struct netlink_ext_ack *extack)
{
const struct nf_conntrack_l4proto *l4proto;
+ unsigned int *timeouts = NULL;
struct sk_buff *skb2;
int ret, err;
__u16 l3num;
@@ -442,12 +444,44 @@ static int cttimeout_default_get(struct net *net, struct sock *ctnl,
l4num = nla_get_u8(cda[CTA_TIMEOUT_L4PROTO]);
l4proto = nf_ct_l4proto_find_get(l4num);
- /* This protocol is not supported, skip. */
- if (l4proto->l4proto != l4num) {
- err = -EOPNOTSUPP;
+ err = -EOPNOTSUPP;
+ if (l4proto->l4proto != l4num)
goto err;
+
+ switch (l4proto->l4proto) {
+ case IPPROTO_ICMP:
+ timeouts = &nf_icmp_pernet(net)->timeout;
+ break;
+ case IPPROTO_TCP:
+ timeouts = nf_tcp_pernet(net)->timeouts;
+ break;
+ case IPPROTO_UDP:
+ timeouts = nf_udp_pernet(net)->timeouts;
+ break;
+ case IPPROTO_DCCP:
+#ifdef CONFIG_NF_CT_PROTO_DCCP
+ timeouts = nf_dccp_pernet(net)->dccp_timeout;
+#endif
+ break;
+ case IPPROTO_ICMPV6:
+ timeouts = &nf_icmpv6_pernet(net)->timeout;
+ break;
+ case IPPROTO_SCTP:
+#ifdef CONFIG_NF_CT_PROTO_SCTP
+ timeouts = nf_sctp_pernet(net)->timeouts;
+#endif
+ break;
+ case 255:
+ timeouts = &nf_generic_pernet(net)->timeout;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
}
+ if (!timeouts)
+ goto err;
+
skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
if (skb2 == NULL) {
err = -ENOMEM;
@@ -458,8 +492,7 @@ static int cttimeout_default_get(struct net *net, struct sock *ctnl,
nlh->nlmsg_seq,
NFNL_MSG_TYPE(nlh->nlmsg_type),
IPCTNL_MSG_TIMEOUT_DEFAULT_SET,
- l3num,
- l4proto);
+ l3num, l4proto, timeouts);
if (ret <= 0) {
kfree_skb(skb2);
err = -ENOMEM;
--
2.11.0
^ permalink raw reply related
* [PATCH 11/14] netfilter: conntrack: add nf_{tcp,udp,sctp,icmp,dccp,icmpv6,generic}_pernet()
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
Expose these functions to access conntrack protocol tracker netns area,
nfnetlink_cttimeout needs this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack_l4proto.h | 39 ++++++++++++++++++++++++++++
net/netfilter/nf_conntrack_proto_dccp.c | 13 +++-------
net/netfilter/nf_conntrack_proto_generic.c | 11 +++-----
net/netfilter/nf_conntrack_proto_icmp.c | 11 +++-----
net/netfilter/nf_conntrack_proto_icmpv6.c | 11 +++-----
net/netfilter/nf_conntrack_proto_sctp.c | 11 +++-----
net/netfilter/nf_conntrack_proto_tcp.c | 15 ++++-------
net/netfilter/nf_conntrack_proto_udp.c | 11 +++-----
8 files changed, 63 insertions(+), 59 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index eed04af9b75e..ae7b86f587f2 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -153,4 +153,43 @@ void nf_ct_l4proto_log_invalid(const struct sk_buff *skb,
const char *fmt, ...) { }
#endif /* CONFIG_SYSCTL */
+static inline struct nf_generic_net *nf_generic_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.generic;
+}
+
+static inline struct nf_tcp_net *nf_tcp_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.tcp;
+}
+
+static inline struct nf_udp_net *nf_udp_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.udp;
+}
+
+static inline struct nf_icmp_net *nf_icmp_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.icmp;
+}
+
+static inline struct nf_icmp_net *nf_icmpv6_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.icmpv6;
+}
+
+#ifdef CONFIG_NF_CT_PROTO_DCCP
+static inline struct nf_dccp_net *nf_dccp_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.dccp;
+}
+#endif
+
+#ifdef CONFIG_NF_CT_PROTO_SCTP
+static inline struct nf_sctp_net *nf_sctp_pernet(struct net *net)
+{
+ return &net->ct.nf_ct_proto.sctp;
+}
+#endif
+
#endif /*_NF_CONNTRACK_PROTOCOL_H*/
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index 171e9e122e5f..023c1445bc39 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -384,11 +384,6 @@ dccp_state_table[CT_DCCP_ROLE_MAX + 1][DCCP_PKT_SYNCACK + 1][CT_DCCP_MAX + 1] =
},
};
-static inline struct nf_dccp_net *dccp_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.dccp;
-}
-
static noinline bool
dccp_new(struct nf_conn *ct, const struct sk_buff *skb,
const struct dccp_hdr *dh)
@@ -401,7 +396,7 @@ dccp_new(struct nf_conn *ct, const struct sk_buff *skb,
state = dccp_state_table[CT_DCCP_ROLE_CLIENT][dh->dccph_type][CT_DCCP_NONE];
switch (state) {
default:
- dn = dccp_pernet(net);
+ dn = nf_dccp_pernet(net);
if (dn->dccp_loose == 0) {
msg = "not picking up existing connection ";
goto out_invalid;
@@ -568,7 +563,7 @@ static int dccp_packet(struct nf_conn *ct, struct sk_buff *skb,
timeouts = nf_ct_timeout_lookup(ct);
if (!timeouts)
- timeouts = dccp_pernet(nf_ct_net(ct))->dccp_timeout;
+ timeouts = nf_dccp_pernet(nf_ct_net(ct))->dccp_timeout;
nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[new_state]);
return NF_ACCEPT;
@@ -681,7 +676,7 @@ static int nlattr_to_dccp(struct nlattr *cda[], struct nf_conn *ct)
static int dccp_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
- struct nf_dccp_net *dn = dccp_pernet(net);
+ struct nf_dccp_net *dn = nf_dccp_pernet(net);
unsigned int *timeouts = data;
int i;
@@ -814,7 +809,7 @@ static int dccp_kmemdup_sysctl_table(struct net *net, struct nf_proto_net *pn,
static int dccp_init_net(struct net *net)
{
- struct nf_dccp_net *dn = dccp_pernet(net);
+ struct nf_dccp_net *dn = nf_dccp_pernet(net);
struct nf_proto_net *pn = &dn->pn;
if (!pn->users) {
diff --git a/net/netfilter/nf_conntrack_proto_generic.c b/net/netfilter/nf_conntrack_proto_generic.c
index e10e867e0b55..5da19d5fbc76 100644
--- a/net/netfilter/nf_conntrack_proto_generic.c
+++ b/net/netfilter/nf_conntrack_proto_generic.c
@@ -27,11 +27,6 @@ static bool nf_generic_should_process(u8 proto)
}
}
-static inline struct nf_generic_net *generic_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.generic;
-}
-
static bool generic_pkt_to_tuple(const struct sk_buff *skb,
unsigned int dataoff,
struct net *net, struct nf_conntrack_tuple *tuple)
@@ -58,7 +53,7 @@ static int generic_packet(struct nf_conn *ct,
}
if (!timeout)
- timeout = &generic_pernet(nf_ct_net(ct))->timeout;
+ timeout = &nf_generic_pernet(nf_ct_net(ct))->timeout;
nf_ct_refresh_acct(ct, ctinfo, skb, *timeout);
return NF_ACCEPT;
@@ -72,7 +67,7 @@ static int generic_packet(struct nf_conn *ct,
static int generic_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
- struct nf_generic_net *gn = generic_pernet(net);
+ struct nf_generic_net *gn = nf_generic_pernet(net);
unsigned int *timeout = data;
if (!timeout)
@@ -138,7 +133,7 @@ static int generic_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int generic_init_net(struct net *net)
{
- struct nf_generic_net *gn = generic_pernet(net);
+ struct nf_generic_net *gn = nf_generic_pernet(net);
struct nf_proto_net *pn = &gn->pn;
gn->timeout = nf_ct_generic_timeout;
diff --git a/net/netfilter/nf_conntrack_proto_icmp.c b/net/netfilter/nf_conntrack_proto_icmp.c
index 3598520bd19b..de64d8a5fdfd 100644
--- a/net/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/netfilter/nf_conntrack_proto_icmp.c
@@ -25,11 +25,6 @@
static const unsigned int nf_ct_icmp_timeout = 30*HZ;
-static inline struct nf_icmp_net *icmp_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.icmp;
-}
-
static bool icmp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff,
struct net *net, struct nf_conntrack_tuple *tuple)
{
@@ -103,7 +98,7 @@ static int icmp_packet(struct nf_conn *ct,
}
if (!timeout)
- timeout = &icmp_pernet(nf_ct_net(ct))->timeout;
+ timeout = &nf_icmp_pernet(nf_ct_net(ct))->timeout;
nf_ct_refresh_acct(ct, ctinfo, skb, *timeout);
return NF_ACCEPT;
@@ -275,7 +270,7 @@ static int icmp_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
unsigned int *timeout = data;
- struct nf_icmp_net *in = icmp_pernet(net);
+ struct nf_icmp_net *in = nf_icmp_pernet(net);
if (tb[CTA_TIMEOUT_ICMP_TIMEOUT]) {
if (!timeout)
@@ -337,7 +332,7 @@ static int icmp_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int icmp_init_net(struct net *net)
{
- struct nf_icmp_net *in = icmp_pernet(net);
+ struct nf_icmp_net *in = nf_icmp_pernet(net);
struct nf_proto_net *pn = &in->pn;
in->timeout = nf_ct_icmp_timeout;
diff --git a/net/netfilter/nf_conntrack_proto_icmpv6.c b/net/netfilter/nf_conntrack_proto_icmpv6.c
index 378618feed5d..a15eefb8e317 100644
--- a/net/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/netfilter/nf_conntrack_proto_icmpv6.c
@@ -30,11 +30,6 @@
static const unsigned int nf_ct_icmpv6_timeout = 30*HZ;
-static inline struct nf_icmp_net *icmpv6_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.icmpv6;
-}
-
static bool icmpv6_pkt_to_tuple(const struct sk_buff *skb,
unsigned int dataoff,
struct net *net,
@@ -87,7 +82,7 @@ static bool icmpv6_invert_tuple(struct nf_conntrack_tuple *tuple,
static unsigned int *icmpv6_get_timeouts(struct net *net)
{
- return &icmpv6_pernet(net)->timeout;
+ return &nf_icmpv6_pernet(net)->timeout;
}
/* Returns verdict for packet, or -1 for invalid. */
@@ -286,7 +281,7 @@ static int icmpv6_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
unsigned int *timeout = data;
- struct nf_icmp_net *in = icmpv6_pernet(net);
+ struct nf_icmp_net *in = nf_icmpv6_pernet(net);
if (!timeout)
timeout = icmpv6_get_timeouts(net);
@@ -348,7 +343,7 @@ static int icmpv6_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int icmpv6_init_net(struct net *net)
{
- struct nf_icmp_net *in = icmpv6_pernet(net);
+ struct nf_icmp_net *in = nf_icmpv6_pernet(net);
struct nf_proto_net *pn = &in->pn;
in->timeout = nf_ct_icmpv6_timeout;
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 3d719d3eb9a3..d53e3e78f605 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -146,11 +146,6 @@ static const u8 sctp_conntracks[2][11][SCTP_CONNTRACK_MAX] = {
}
};
-static inline struct nf_sctp_net *sctp_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.sctp;
-}
-
#ifdef CONFIG_NF_CONNTRACK_PROCFS
/* Print out the private part of the conntrack. */
static void sctp_print_conntrack(struct seq_file *s, struct nf_conn *ct)
@@ -480,7 +475,7 @@ static int sctp_packet(struct nf_conn *ct,
timeouts = nf_ct_timeout_lookup(ct);
if (!timeouts)
- timeouts = sctp_pernet(nf_ct_net(ct))->timeouts;
+ timeouts = nf_sctp_pernet(nf_ct_net(ct))->timeouts;
nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[new_state]);
@@ -599,7 +594,7 @@ static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
unsigned int *timeouts = data;
- struct nf_sctp_net *sn = sctp_pernet(net);
+ struct nf_sctp_net *sn = nf_sctp_pernet(net);
int i;
/* set default SCTP timeouts. */
@@ -736,7 +731,7 @@ static int sctp_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int sctp_init_net(struct net *net)
{
- struct nf_sctp_net *sn = sctp_pernet(net);
+ struct nf_sctp_net *sn = nf_sctp_pernet(net);
struct nf_proto_net *pn = &sn->pn;
if (!pn->users) {
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 1bcf9984d45e..4dcbd51a8e97 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -272,11 +272,6 @@ static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
}
};
-static inline struct nf_tcp_net *tcp_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.tcp;
-}
-
#ifdef CONFIG_NF_CONNTRACK_PROCFS
/* Print out the private part of the conntrack. */
static void tcp_print_conntrack(struct seq_file *s, struct nf_conn *ct)
@@ -475,7 +470,7 @@ static bool tcp_in_window(const struct nf_conn *ct,
const struct tcphdr *tcph)
{
struct net *net = nf_ct_net(ct);
- struct nf_tcp_net *tn = tcp_pernet(net);
+ struct nf_tcp_net *tn = nf_tcp_pernet(net);
struct ip_ct_tcp_state *sender = &state->seen[dir];
struct ip_ct_tcp_state *receiver = &state->seen[!dir];
const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
@@ -767,7 +762,7 @@ static noinline bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
{
enum tcp_conntrack new_state;
struct net *net = nf_ct_net(ct);
- const struct nf_tcp_net *tn = tcp_pernet(net);
+ const struct nf_tcp_net *tn = nf_tcp_pernet(net);
const struct ip_ct_tcp_state *sender = &ct->proto.tcp.seen[0];
const struct ip_ct_tcp_state *receiver = &ct->proto.tcp.seen[1];
@@ -841,7 +836,7 @@ static int tcp_packet(struct nf_conn *ct,
const struct nf_hook_state *state)
{
struct net *net = nf_ct_net(ct);
- struct nf_tcp_net *tn = tcp_pernet(net);
+ struct nf_tcp_net *tn = nf_tcp_pernet(net);
struct nf_conntrack_tuple *tuple;
enum tcp_conntrack new_state, old_state;
unsigned int index, *timeouts;
@@ -1283,7 +1278,7 @@ static unsigned int tcp_nlattr_tuple_size(void)
static int tcp_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
- struct nf_tcp_net *tn = tcp_pernet(net);
+ struct nf_tcp_net *tn = nf_tcp_pernet(net);
unsigned int *timeouts = data;
int i;
@@ -1508,7 +1503,7 @@ static int tcp_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int tcp_init_net(struct net *net)
{
- struct nf_tcp_net *tn = tcp_pernet(net);
+ struct nf_tcp_net *tn = nf_tcp_pernet(net);
struct nf_proto_net *pn = &tn->pn;
if (!pn->users) {
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index a7aa70370913..c879d8d78cfd 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -32,14 +32,9 @@ static const unsigned int udp_timeouts[UDP_CT_MAX] = {
[UDP_CT_REPLIED] = 180*HZ,
};
-static inline struct nf_udp_net *udp_pernet(struct net *net)
-{
- return &net->ct.nf_ct_proto.udp;
-}
-
static unsigned int *udp_get_timeouts(struct net *net)
{
- return udp_pernet(net)->timeouts;
+ return nf_udp_pernet(net)->timeouts;
}
static void udp_error_log(const struct sk_buff *skb,
@@ -212,7 +207,7 @@ static int udp_timeout_nlattr_to_obj(struct nlattr *tb[],
struct net *net, void *data)
{
unsigned int *timeouts = data;
- struct nf_udp_net *un = udp_pernet(net);
+ struct nf_udp_net *un = nf_udp_pernet(net);
if (!timeouts)
timeouts = un->timeouts;
@@ -292,7 +287,7 @@ static int udp_kmemdup_sysctl_table(struct nf_proto_net *pn,
static int udp_init_net(struct net *net)
{
- struct nf_udp_net *un = udp_pernet(net);
+ struct nf_udp_net *un = nf_udp_pernet(net);
struct nf_proto_net *pn = &un->pn;
if (!pn->users) {
--
2.11.0
^ permalink raw reply related
* [PATCH 13/14] netfilter: nft_compat: ebtables 'nat' table is normal chain type
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Unlike ip(6)tables, the ebtables nat table has no special properties.
This bug causes 'ebtables -A' to fail when using a target such as
'snat' (ebt_snat target sets ".table = "nat"'). Targets that have
no table restrictions work fine.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nft_compat.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c
index 768292eac2a4..9d0ede474224 100644
--- a/net/netfilter/nft_compat.c
+++ b/net/netfilter/nft_compat.c
@@ -54,9 +54,11 @@ static bool nft_xt_put(struct nft_xt *xt)
return false;
}
-static int nft_compat_chain_validate_dependency(const char *tablename,
- const struct nft_chain *chain)
+static int nft_compat_chain_validate_dependency(const struct nft_ctx *ctx,
+ const char *tablename)
{
+ enum nft_chain_types type = NFT_CHAIN_T_DEFAULT;
+ const struct nft_chain *chain = ctx->chain;
const struct nft_base_chain *basechain;
if (!tablename ||
@@ -64,9 +66,12 @@ static int nft_compat_chain_validate_dependency(const char *tablename,
return 0;
basechain = nft_base_chain(chain);
- if (strcmp(tablename, "nat") == 0 &&
- basechain->type->type != NFT_CHAIN_T_NAT)
- return -EINVAL;
+ if (strcmp(tablename, "nat") == 0) {
+ if (ctx->family != NFPROTO_BRIDGE)
+ type = NFT_CHAIN_T_NAT;
+ if (basechain->type->type != type)
+ return -EINVAL;
+ }
return 0;
}
@@ -342,8 +347,7 @@ static int nft_target_validate(const struct nft_ctx *ctx,
if (target->hooks && !(hook_mask & target->hooks))
return -EINVAL;
- ret = nft_compat_chain_validate_dependency(target->table,
- ctx->chain);
+ ret = nft_compat_chain_validate_dependency(ctx, target->table);
if (ret < 0)
return ret;
}
@@ -590,8 +594,7 @@ static int nft_match_validate(const struct nft_ctx *ctx,
if (match->hooks && !(hook_mask & match->hooks))
return -EINVAL;
- ret = nft_compat_chain_validate_dependency(match->table,
- ctx->chain);
+ ret = nft_compat_chain_validate_dependency(ctx, match->table);
if (ret < 0)
return ret;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 14/14] netfilter: conntrack: fix calculation of next bucket number in early_drop
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Vasily Khoruzhick <vasilykh@arista.com>
If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.
Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.
Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_conntrack_core.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index ca1168d67fac..e92e749aff53 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1073,19 +1073,22 @@ static unsigned int early_drop_list(struct net *net,
return drops;
}
-static noinline int early_drop(struct net *net, unsigned int _hash)
+static noinline int early_drop(struct net *net, unsigned int hash)
{
- unsigned int i;
+ unsigned int i, bucket;
for (i = 0; i < NF_CT_EVICTION_RANGE; i++) {
struct hlist_nulls_head *ct_hash;
- unsigned int hash, hsize, drops;
+ unsigned int hsize, drops;
rcu_read_lock();
nf_conntrack_get_ht(&ct_hash, &hsize);
- hash = reciprocal_scale(_hash++, hsize);
+ if (!i)
+ bucket = reciprocal_scale(hash, hsize);
+ else
+ bucket = (bucket + 1) % hsize;
- drops = early_drop_list(net, &ct_hash[hash]);
+ drops = early_drop_list(net, &ct_hash[bucket]);
rcu_read_unlock();
if (drops) {
--
2.11.0
^ permalink raw reply related
* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
From: Vlastimil Babka @ 2018-11-06 9:32 UTC (permalink / raw)
To: Aaron Lu
Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck
In-Reply-To: <20181106084746.GA24198@intel.com>
On 11/6/18 9:47 AM, Aaron Lu wrote:
> On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
>> On 11/6/18 6:30 AM, Aaron Lu wrote:
>>> We have multiple places of freeing a page, most of them doing similar
>>> things and a common function can be used to reduce code duplicate.
>>>
>>> It also avoids bug fixed in one function but left in another.
>>>
>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> Thanks.
>
>> I assume there's no arch that would run page_ref_sub_and_test(1) slower
>> than put_page_testzero(), for the critical __free_pages() case?
>
> Good question.
>
> I followed the non-arch specific calls and found that:
> page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
> put_page_testzero() ends up calling atomic_sub_return(1, v). So they
> should be same for archs that do not have their own implementations.
x86 seems to distinguish between DECL and SUBL, see
arch/x86/include/asm/atomic.h although I could not figure out where does
e.g. arch_atomic_dec_and_test become atomic_dec_and_test to override the
generic implementation.
I don't know if the CPU e.g. executes DECL faster, but objectively it
has one parameter less. Maybe it doesn't matter?
> Back to your question: I don't know either.
> If this is deemed unsafe, we can probably keep the ref modify part in
> their original functions and only take the free part into a common
> function.
I guess you could also employ if (__builtin_constant_p(nr)) in
free_the_page(), but the result will be ugly I guess, and maybe not
worth it :)
> Regards,
> Aaron
>
>>> ---
>>> v2: move comments close to code as suggested by Dave.
>>>
>>> mm/page_alloc.c | 36 ++++++++++++++++--------------------
>>> 1 file changed, 16 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 91a9a6af41a2..4faf6b7bf225 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
>>> }
>>> EXPORT_SYMBOL(get_zeroed_page);
>>>
>>> -void __free_pages(struct page *page, unsigned int order)
>>> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
>>> {
>>> - if (put_page_testzero(page)) {
>>> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>> +
>>> + /*
>>> + * Free a page by reducing its ref count by @nr.
>>> + * If its refcount reaches 0, then according to its order:
>>> + * order0: send to PCP;
>>> + * high order: directly send to Buddy.
>>> + */
>>> + if (page_ref_sub_and_test(page, nr)) {
>>> if (order == 0)
>>> free_unref_page(page);
>>> else
>>> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
>>> }
>>> }
>>>
>>> +void __free_pages(struct page *page, unsigned int order)
>>> +{
>>> + free_the_page(page, order, 1);
>>> +}
>>> EXPORT_SYMBOL(__free_pages);
>>>
>>> void free_pages(unsigned long addr, unsigned int order)
>>> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>
>>> void __page_frag_cache_drain(struct page *page, unsigned int count)
>>> {
>>> - VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>> -
>>> - if (page_ref_sub_and_test(page, count)) {
>>> - unsigned int order = compound_order(page);
>>> -
>>> - if (order == 0)
>>> - free_unref_page(page);
>>> - else
>>> - __free_pages_ok(page, order);
>>> - }
>>> + free_the_page(page, compound_order(page), count);
>>> }
>>> EXPORT_SYMBOL(__page_frag_cache_drain);
>>>
>>> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
>>> {
>>> struct page *page = virt_to_head_page(addr);
>>>
>>> - if (unlikely(put_page_testzero(page))) {
>>> - unsigned int order = compound_order(page);
>>> -
>>> - if (order == 0)
>>> - free_unref_page(page);
>>> - else
>>> - __free_pages_ok(page, order);
>>> - }
>>> + free_the_page(page, compound_order(page), 1);
>>> }
>>> EXPORT_SYMBOL(page_frag_free);
>>>
>>>
>>
^ permalink raw reply
* Re: [PATCH net v2 1/2] rtnetlink: restore handling of dumpit return value in rtnl_dump_all()
From: David Miller @ 2018-11-06 1:06 UTC (permalink / raw)
To: alexey.kodanev; +Cc: netdev, dsahern
In-Reply-To: <1541175065-25931-1-git-send-email-alexey.kodanev@oracle.com>
From: Alexey Kodanev <alexey.kodanev@oracle.com>
Date: Fri, 2 Nov 2018 19:11:04 +0300
> For non-zero return from dumpit() we should break the loop
> in rtnl_dump_all() and return the result. Otherwise, e.g.,
> we could get the memory leak in inet6_dump_fib() [1]. The
> pointer to the allocated struct fib6_walker there (saved
> in cb->args) can be lost, reset on the next iteration.
>
> Fix it by partially restoring the previous behavior before
> commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate
> error from dumpit function"). The returned error from
> dumpit() is still passed further.
...
> Fixes: c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit function")
> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Applied.
^ permalink raw reply
* Re: [PATCH net v2 2/2] ipv6: properly check return value in inet6_dump_all()
From: David Miller @ 2018-11-06 1:06 UTC (permalink / raw)
To: alexey.kodanev; +Cc: netdev, dsahern
In-Reply-To: <1541175065-25931-2-git-send-email-alexey.kodanev@oracle.com>
From: Alexey Kodanev <alexey.kodanev@oracle.com>
Date: Fri, 2 Nov 2018 19:11:05 +0300
> Make sure we call fib6_dump_end() if it happens that skb->len
> is zero. rtnl_dump_all() can reset cb->args on the next loop
> iteration there.
>
> Fixes: 08e814c9e8eb ("net/ipv6: Bail early if user only wants cloned entries")
> Fixes: ae677bbb4441 ("net: Don't return invalid table id error when dumping all families")
> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Applied.
^ permalink raw reply
* Re: [PATCH net v2 1/2] rtnetlink: restore handling of dumpit return value in rtnl_dump_all()
From: David Ahern @ 2018-11-06 1:08 UTC (permalink / raw)
To: David Miller, alexey.kodanev; +Cc: netdev
In-Reply-To: <20181105.170619.795988090444501257.davem@davemloft.net>
On 11/5/18 6:06 PM, David Miller wrote:
> From: Alexey Kodanev <alexey.kodanev@oracle.com>
> Date: Fri, 2 Nov 2018 19:11:04 +0300
>
>> For non-zero return from dumpit() we should break the loop
>> in rtnl_dump_all() and return the result. Otherwise, e.g.,
>> we could get the memory leak in inet6_dump_fib() [1]. The
>> pointer to the allocated struct fib6_walker there (saved
>> in cb->args) can be lost, reset on the next iteration.
>>
>> Fix it by partially restoring the previous behavior before
>> commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate
>> error from dumpit function"). The returned error from
>> dumpit() is still passed further.
> ...
>> Fixes: c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit function")
>> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
>
> Applied.
>
Lost track of these in the noise of the past few days.
Thanks for the fixes Alexey.
^ permalink raw reply
* Re: [PATCH] sock_diag: fix autoloading of the raw_diag module
From: David Miller @ 2018-11-06 1:10 UTC (permalink / raw)
To: avagin; +Cc: netdev, gorcunov, lucien.xin
In-Reply-To: <20181105063715.21639-1-avagin@gmail.com>
From: Andrei Vagin <avagin@gmail.com>
Date: Sun, 4 Nov 2018 22:37:15 -0800
> IPPROTO_TCP isn't registred as an inet protocol, so
> inet_protos[protocol] is always NULL for it.
>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Fixes: bf2ae2e4bf93 ("sock_diag: request _diag module only when the family or proto has been registered")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] net: bpfilter: fix iptables failure if bpfilter_umh is disabled
From: David Miller @ 2018-11-06 1:13 UTC (permalink / raw)
To: ap420073; +Cc: netdev, daniel, ast, pablo, fw
In-Reply-To: <20181105133141.31621-1-ap420073@gmail.com>
From: Taehee Yoo <ap420073@gmail.com>
Date: Mon, 5 Nov 2018 22:31:41 +0900
> When iptables command is executed, ip_{set/get}sockopt() try to upload
> bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
> command is failed.
> bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
> ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
> So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
> iptables command is always failed.
>
> test config:
> CONFIG_BPFILTER=y
> # CONFIG_BPFILTER_UMH is not set
>
> test command:
> %iptables -L
> iptables: No chain/target/match by that name.
>
> Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Applied, thanks Taehee-ssi.
^ permalink raw reply
* Re: [PATCH 00/14] Netfilter fixes for net
From: David Miller @ 2018-11-06 1:19 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue, 6 Nov 2018 00:28:18 +0100
> The following patchset contains the first batch of Netfilter fixes for
> your net tree:
...
> You can pull these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Pulled, thank you.
^ permalink raw reply
* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
From: Aaron Lu @ 2018-11-06 11:20 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck
In-Reply-To: <30aa9d1f-d619-c143-3de6-6876029538bc@suse.cz>
On Tue, Nov 06, 2018 at 10:32:00AM +0100, Vlastimil Babka wrote:
> On 11/6/18 9:47 AM, Aaron Lu wrote:
> > On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
> >> On 11/6/18 6:30 AM, Aaron Lu wrote:
> >>> We have multiple places of freeing a page, most of them doing similar
> >>> things and a common function can be used to reduce code duplicate.
> >>>
> >>> It also avoids bug fixed in one function but left in another.
> >>>
> >>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >>
> >> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Thanks.
> >
> >> I assume there's no arch that would run page_ref_sub_and_test(1) slower
> >> than put_page_testzero(), for the critical __free_pages() case?
> >
> > Good question.
> >
> > I followed the non-arch specific calls and found that:
> > page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
> > put_page_testzero() ends up calling atomic_sub_return(1, v). So they
> > should be same for archs that do not have their own implementations.
>
> x86 seems to distinguish between DECL and SUBL, see
Ah right.
> arch/x86/include/asm/atomic.h although I could not figure out where does
> e.g. arch_atomic_dec_and_test become atomic_dec_and_test to override the
> generic implementation.
I didn't check that either but I think it will :-)
> I don't know if the CPU e.g. executes DECL faster, but objectively it
> has one parameter less. Maybe it doesn't matter?
No immediate idea.
> > Back to your question: I don't know either.
> > If this is deemed unsafe, we can probably keep the ref modify part in
> > their original functions and only take the free part into a common
> > function.
>
> I guess you could also employ if (__builtin_constant_p(nr)) in
> free_the_page(), but the result will be ugly I guess, and maybe not
> worth it :)
Right I can't make it clean.
I think I'll just move the free part a common function and leave the ref
decreasing part as is to be safe.
Regards,
Aaron
> >>> ---
> >>> v2: move comments close to code as suggested by Dave.
> >>>
> >>> mm/page_alloc.c | 36 ++++++++++++++++--------------------
> >>> 1 file changed, 16 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 91a9a6af41a2..4faf6b7bf225 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
> >>> }
> >>> EXPORT_SYMBOL(get_zeroed_page);
> >>>
> >>> -void __free_pages(struct page *page, unsigned int order)
> >>> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
> >>> {
> >>> - if (put_page_testzero(page)) {
> >>> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> >>> +
> >>> + /*
> >>> + * Free a page by reducing its ref count by @nr.
> >>> + * If its refcount reaches 0, then according to its order:
> >>> + * order0: send to PCP;
> >>> + * high order: directly send to Buddy.
> >>> + */
> >>> + if (page_ref_sub_and_test(page, nr)) {
> >>> if (order == 0)
> >>> free_unref_page(page);
> >>> else
> >>> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
> >>> }
> >>> }
> >>>
> >>> +void __free_pages(struct page *page, unsigned int order)
> >>> +{
> >>> + free_the_page(page, order, 1);
> >>> +}
> >>> EXPORT_SYMBOL(__free_pages);
> >>>
> >>> void free_pages(unsigned long addr, unsigned int order)
> >>> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >>>
> >>> void __page_frag_cache_drain(struct page *page, unsigned int count)
> >>> {
> >>> - VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> >>> -
> >>> - if (page_ref_sub_and_test(page, count)) {
> >>> - unsigned int order = compound_order(page);
> >>> -
> >>> - if (order == 0)
> >>> - free_unref_page(page);
> >>> - else
> >>> - __free_pages_ok(page, order);
> >>> - }
> >>> + free_the_page(page, compound_order(page), count);
> >>> }
> >>> EXPORT_SYMBOL(__page_frag_cache_drain);
> >>>
> >>> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
> >>> {
> >>> struct page *page = virt_to_head_page(addr);
> >>>
> >>> - if (unlikely(put_page_testzero(page))) {
> >>> - unsigned int order = compound_order(page);
> >>> -
> >>> - if (order == 0)
> >>> - free_unref_page(page);
> >>> - else
> >>> - __free_pages_ok(page, order);
> >>> - }
> >>> + free_the_page(page, compound_order(page), 1);
> >>> }
> >>> EXPORT_SYMBOL(page_frag_free);
> >>>
> >>>
> >>
>
^ permalink raw reply
* [PATCH v3 2/2] mm/page_alloc: use a single function to free page
From: Aaron Lu @ 2018-11-06 11:31 UTC (permalink / raw)
To: linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
Dave Hansen, Alexander Duyck
In-Reply-To: <20181106053037.GD6203@intel.com>
We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.
It also avoids bug fixed in one function but left in another.
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v3: Vlastimil mentioned the possible performance loss by using
page_ref_sub_and_test(page, 1) for put_page_testzero(page), since
we aren't sure so be safe by keeping page ref decreasing code as
is, only move freeing page part to a common function.
mm/page_alloc.c | 37 ++++++++++++++-----------------------
1 file changed, 14 insertions(+), 23 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..431a03aa96f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,16 +4425,19 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
}
EXPORT_SYMBOL(get_zeroed_page);
-void __free_pages(struct page *page, unsigned int order)
+static inline void free_the_page(struct page *page, unsigned int order)
{
- if (put_page_testzero(page)) {
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ if (order == 0)
+ free_unref_page(page);
+ else
+ __free_pages_ok(page, order);
}
+void __free_pages(struct page *page, unsigned int order)
+{
+ if (put_page_testzero(page))
+ free_the_page(page, order);
+}
EXPORT_SYMBOL(__free_pages);
void free_pages(unsigned long addr, unsigned int order)
@@ -4483,14 +4486,8 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
{
VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
- if (page_ref_sub_and_test(page, count)) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ if (page_ref_sub_and_test(page, count))
+ free_the_page(page, compound_order(page));
}
EXPORT_SYMBOL(__page_frag_cache_drain);
@@ -4555,14 +4552,8 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page))) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ if (unlikely(put_page_testzero(page)))
+ free_the_page(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);
--
2.17.2
^ permalink raw reply related
* Re: [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-11-06 2:17 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <b9d535f8-ddc3-a4bc-21c9-ca21e808f0d1@redhat.com>
On 2018/11/5 17:21, Jason Wang wrote:
>
> On 2018/11/5 下午3:43, jiangyiwen wrote:
>> Now vsock only support send/receive small packet, it can't achieve
>> high performance. As previous discussed with Jason Wang, I revisit the
>> idea of vhost-net about mergeable rx buffer and implement the mergeable
>> rx buffer in vhost-vsock, it can allow big packet to be scattered in
>> into different buffers and improve performance obviously.
>>
>> I write a tool to test the vhost-vsock performance, mainly send big
>> packet(64K) included guest->Host and Host->Guest. The result as
>> follows:
>>
>> Before performance:
>> Single socket Multiple sockets(Max Bandwidth)
>> Guest->Host ~400MB/s ~480MB/s
>> Host->Guest ~1450MB/s ~1600MB/s
>>
>> After performance:
>> Single socket Multiple sockets(Max Bandwidth)
>> Guest->Host ~1700MB/s ~2900MB/s
>> Host->Guest ~1700MB/s ~2900MB/s
>>
>> From the test results, the performance is improved obviously, and guest
>> memory will not be wasted.
>
>
> Hi:
>
> Thanks for the patches and the numbers are really impressive.
>
> But instead of duplicating codes between sock and net. I was considering to use virtio-net as a transport of vsock. Then we may have all existed features likes batching, mergeable rx buffers and multiqueue. Want to consider this idea? Thoughts?
>
>
Hi Jason,
I am not very familiar with virtio-net, so I am afraid I can't give too
much effective advice. Then I have several problems:
1. If use virtio-net as a transport, guest should see a virtio-net
device instead of virtio-vsock device, right? Is vsock only as a
transport between socket and net_device? User should still use
AF_VSOCK type to create socket, right?
2. I want to know if this idea has already started, and how is
the current progress?
3. And what is stefan's idea?
Thanks,
Yiwen.
>>
>> ---
>>
>> Yiwen Jiang (5):
>> VSOCK: support fill mergeable rx buffer in guest
>> VSOCK: support fill data to mergeable rx buffer in host
>> VSOCK: support receive mergeable rx buffer in guest
>> VSOCK: modify default rx buf size to improve performance
>> VSOCK: batch sending rx buffer to increase bandwidth
>>
>> drivers/vhost/vsock.c | 135 +++++++++++++++++++++++------
>> include/linux/virtio_vsock.h | 15 +++-
>> include/uapi/linux/virtio_vsock.h | 5 ++
>> net/vmw_vsock/virtio_transport.c | 147 ++++++++++++++++++++++++++------
>> net/vmw_vsock/virtio_transport_common.c | 59 +++++++++++--
>> 5 files changed, 300 insertions(+), 61 deletions(-)
>>
>
> .
>
^ permalink raw reply
* Re: [PATCH] net: phy: realtek: fix RTL8201F sysfs name
From: Florian Fainelli @ 2018-11-06 2:24 UTC (permalink / raw)
To: Andrew Lunn, Holger Hoffstätte; +Cc: Netdev, David S. Miller
In-Reply-To: <20181104184346.GA27023@lunn.ch>
On 11/4/2018 10:43 AM, Andrew Lunn wrote:
> On Sun, Nov 04, 2018 at 07:02:42PM +0100, Holger Hoffstätte wrote:
>> Since 4.19 the following error in sysfs has appeared when using the
>> r8169 NIC driver:
>>
>> $cd /sys/module/realtek/drivers
>> $ls -l
>> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or directory
>> [..garbled dir entries follow..]
>>
>> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted
>> as directory separator that leads nowhere, and was introduced in commit
>> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
>>
>> Fix this by removing the offending slash in the driver name.
>>
>> Other drivers in net/phy seem to have the same problem, but I cannot
>> test/verify them.
>>
>> Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com>
>
> Fixes:513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
>
> David, please apply to net.
We should probably seek a more generic solution within sysfs to deny
specific problematic characters from being used, such as ., .., / etc.
--
Florian
^ permalink raw reply
* [bindings][PATCH] bindings/net: DPAA Backplane Device Bindings
From: Florinel Iordache @ 2018-11-06 11:48 UTC (permalink / raw)
To: robh+dt@kernel.org, mark.rutland@arm.com, broonie@kernel.org,
horms+renesas@verge.net.au, geert+renesas@glider.be,
linus.walleij@linaro.org
Cc: devicetree@vger.kernel.org, davem@davemloft.net,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
Florinel Iordache
Device Tree Bindings for DPAA backplane available on Layerscape
communications processors.
Signed-off-by: Florinel Iordache <florinel.iordache@nxp.com>
---
.../devicetree/bindings/net/dpaa-backplane.txt | 105 +++++++++++++++++++++
1 file changed, 105 insertions(+)
create mode 100644 Documentation/devicetree/bindings/net/dpaa-backplane.txt
diff --git a/Documentation/devicetree/bindings/net/dpaa-backplane.txt b/Documentation/devicetree/bindings/net/dpaa-backplane.txt
new file mode 100644
index 0000000..f147c84
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dpaa-backplane.txt
@@ -0,0 +1,105 @@
+=============================================================================
+DPAA Backplane Device Bindings
+
+CONTENTS
+ - SerDes Node
+ - PCS Phy Node
+
+=============================================================================
+SerDes Node
+
+DESCRIPTION
+
+SerDes (Serializer/Deserializer) HW peripheral
+
+PROPERTIES
+
+- compatible
+ Usage: required
+ Value type: <stringlist>
+ Definition: Specifies the type of SerDes.
+ Must include the prefix "fsl,serdes"
+ SerDes can be of different types:
+ - 10G SerDes must be specified as: "fsl,serdes-10g"
+ - 28G SerDes must be specified as: "fsl,serdes-28g"
+
+- reg
+ Usage: required
+ Value type: <prop-encoded-array>
+ Definition: Specifies the offset of the SerDes configuration registers
+
+- little-endian
+ Usage: optional
+ Value type: <Boolean>
+ Definition: Specifies endianness access to SerDes registers.
+ If omitted, big-endian will be used
+ See common-properties.txt for complete definition
+
+EXAMPLE
+
+Example of 10G SerDes node:
+
+serdes1: serdes@1ea0000 {
+ compatible = "fsl,serdes-10g";
+ reg = <0x0 0x1ea0000 0 0x00002000>;
+ little-endian;
+};
+
+=============================================================================
+PCS Phy Node
+
+DESCRIPTION
+
+PCS Phy (Physical Coding Sublayer / Physical layer) node
+
+PROPERTIES
+
+- compatible
+ Usage: required
+ Value type: <stringlist>
+ Definition: A standard property. Specifies the IEEE 802.3 Clause
+ Different IEEE 802.3 Clauses can be specified:
+ - Clause 22 must be specified as: "ethernet-phy-ieee802.3-c22"
+ - Clause 45 must be specified as: "ethernet-phy-ieee802.3-c45"
+ For complete definition see:
+ Documentation/devicetree/bindings/net/phy.txt
+
+- reg
+ Usage: required
+ Value type: <prop-encoded-array>
+ Definition: A standard property.
+ Specifies the offset of the PCS Phy configuration registers
+ For complete definition see:
+ Documentation/devicetree/bindings/net/phy.txt
+
+- backplane-mode
+ Usage: required
+ Value type: <stringlist>
+ Definition: Specifies the speed and type of the protocol used
+ Different speeds and backplane protocol types can be used:
+ - 10GBase-KR must be specified as: "10gbase-kr"
+ - 40GBase-KR must be specified as: "40gbase-kr"
+
+- fsl,lane-handle
+ Usage: required
+ Value type: <phandle>
+ Definition: Specifies the reference to a node representing the SerDes
+ device
+
+- fsl,lane-reg
+ Usage: required
+ Value type: <prop-encoded-array>
+ Definition: Specifies the offsets of the SerDes lanes configuration
+ registers
+
+EXAMPLE
+
+Example of pcs phy node for 10GBase-KR:
+
+pcs_phy1: ethernet-phy@0 {
+ compatible = "ethernet-phy-ieee802.3-c45";
+ backplane-mode = "10gbase-kr";
+ reg = <0x0>;
+ fsl,lane-handle = <&serdes1>;
+ fsl,lane-reg = <0xE00>; /* lane G */
+};
--
1.9.1
^ permalink raw reply related
* Re: [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: Jason Wang @ 2018-11-06 2:41 UTC (permalink / raw)
To: jiangyiwen, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <5BE0F9C9.2080003@huawei.com>
On 2018/11/6 上午10:17, jiangyiwen wrote:
> On 2018/11/5 17:21, Jason Wang wrote:
>> On 2018/11/5 下午3:43, jiangyiwen wrote:
>>> Now vsock only support send/receive small packet, it can't achieve
>>> high performance. As previous discussed with Jason Wang, I revisit the
>>> idea of vhost-net about mergeable rx buffer and implement the mergeable
>>> rx buffer in vhost-vsock, it can allow big packet to be scattered in
>>> into different buffers and improve performance obviously.
>>>
>>> I write a tool to test the vhost-vsock performance, mainly send big
>>> packet(64K) included guest->Host and Host->Guest. The result as
>>> follows:
>>>
>>> Before performance:
>>> Single socket Multiple sockets(Max Bandwidth)
>>> Guest->Host ~400MB/s ~480MB/s
>>> Host->Guest ~1450MB/s ~1600MB/s
>>>
>>> After performance:
>>> Single socket Multiple sockets(Max Bandwidth)
>>> Guest->Host ~1700MB/s ~2900MB/s
>>> Host->Guest ~1700MB/s ~2900MB/s
>>>
>>> From the test results, the performance is improved obviously, and guest
>>> memory will not be wasted.
>> Hi:
>>
>> Thanks for the patches and the numbers are really impressive.
>>
>> But instead of duplicating codes between sock and net. I was considering to use virtio-net as a transport of vsock. Then we may have all existed features likes batching, mergeable rx buffers and multiqueue. Want to consider this idea? Thoughts?
>>
>>
> Hi Jason,
>
> I am not very familiar with virtio-net, so I am afraid I can't give too
> much effective advice. Then I have several problems:
>
> 1. If use virtio-net as a transport, guest should see a virtio-net
> device instead of virtio-vsock device, right? Is vsock only as a
> transport between socket and net_device? User should still use
> AF_VSOCK type to create socket, right?
Well, there're many choices. What you need is just to keep the socket
API and hide the implementation. For example, you can keep the vosck
device in guest and switch to use vhost-net in host. We probably need a
new feature bit or header to let vhost know we are passing vsock packet.
And vhost-net could forward the packet to vsock core on host.
>
> 2. I want to know if this idea has already started, and how is
> the current progress?
Not yet started. Just want to listen from the community. If this sounds
good, do you have interest in implementing this?
>
> 3. And what is stefan's idea?
Talk with Stefan a little on this during KVM Forum. I think he tends to
agree on this idea. Anyway, let's wait for his reply.
Thanks
>
> Thanks,
> Yiwen.
>
^ permalink raw reply
* Re: [PATCH v3 2/2] mm/page_alloc: use a single function to free page
From: Vlastimil Babka @ 2018-11-06 12:06 UTC (permalink / raw)
To: Aaron Lu, linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen,
Alexander Duyck
In-Reply-To: <20181106113149.GC24198@intel.com>
On 11/6/18 12:31 PM, Aaron Lu wrote:
> We have multiple places of freeing a page, most of them doing similar
> things and a common function can be used to reduce code duplicate.
>
> It also avoids bug fixed in one function but left in another.
>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Thanks!
> ---
> v3: Vlastimil mentioned the possible performance loss by using
> page_ref_sub_and_test(page, 1) for put_page_testzero(page), since
> we aren't sure so be safe by keeping page ref decreasing code as
> is, only move freeing page part to a common function.
^ permalink raw reply
* Re: [PATCH iproute2-next v3] rdma: Document IB device renaming option
From: David Ahern @ 2018-11-06 3:13 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Leon Romanovsky, netdev, RDMA mailing list, Stephen Hemminger
In-Reply-To: <20181104191122.11979-1-leon@kernel.org>
On 11/4/18 12:11 PM, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
>
> [leonro@server /]$ lspci |grep -i Ether
> 00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
> 00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
> [leonro@server /]$ sudo rdma dev
> 1: mlx5_0: node_type ca fw 3.8.9999 node_guid 5254:00c0:fe12:3455
> sys_image_guid 5254:00c0:fe12:3455
> [leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0
> [leonro@server /]$ sudo rdma dev
> 1: hfi1_0: node_type ca fw 3.8.9999 node_guid 5254:00c0:fe12:3455
> sys_image_guid 5254:00c0:fe12:3455
>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
> Changelog:
> v2->v3:
> * Dropped "to be named" words from example section of man
> ---
> man/man8/rdma-dev.8 | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
applied to iproute2-next. Thanks
^ permalink raw reply
* [PATCH iproute2] tc: f_u32: allow skip_hw and skip_sw flags to be last
From: Jakub Kicinski @ 2018-11-06 3:23 UTC (permalink / raw)
To: stephen, dsahern; +Cc: netdev, oss-drivers, Jakub Kicinski
u32 uses NEXT_ARG() incorrectly when parsing skip_hw and skip_sw
flags. NEXT_ARG() ensures there is another argument on the command
line, and is used in handling <keyword> <value> syntax to move past
<keyword> and ensure there is a <value> to read.
Commit 5e5b3008d1fb ("tc: f_u32: Add support for skip_hw and skip_sw
flags") seems to have copy pasted the handling from the previous
command - "police", which needs an extra parameter and is kind of
special due to the use of parse_police() helper.
The combination of NEXT_ARG() and continue worked fine as long as
skip_sw/skip_hw wasn't last, e.g.:
$ tc filter add dev dummy0 ingress prio 101 protocol ipv6 \
u32 match ip6 priority 0xa0 0xe0 skip_hw action pass
But would fail if it was last:
$ tc filter add dev dummy0 ingress prio 101 protocol ipv6 \
u32 match ip6 priority 0xa0 0xe0 flowid :1 skip_hw
Command line is not complete. Try option "help"
Remove the NEXT_ARG()s and the continues, and let the argc--; argv++;
at the end of the loop do its job.
Fixes: 5e5b3008d1fb ("tc: f_u32: Add support for skip_hw and skip_sw flags")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
tc/f_u32.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/tc/f_u32.c b/tc/f_u32.c
index bff4be637728..e0a322d5a11c 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -1147,13 +1147,9 @@ static int u32_parse_opt(struct filter_util *qu, char *handle,
terminal_ok++;
continue;
} else if (strcmp(*argv, "skip_hw") == 0) {
- NEXT_ARG();
flags |= TCA_CLS_FLAGS_SKIP_HW;
- continue;
} else if (strcmp(*argv, "skip_sw") == 0) {
- NEXT_ARG();
flags |= TCA_CLS_FLAGS_SKIP_SW;
- continue;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox