* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Daniel Mack @ 2016-12-20 10:21 UTC (permalink / raw)
To: Andy Lutomirski, Alexei Starovoitov
Cc: Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn,
Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
Michael Kerrisk, Peter Zijlstra, Linux API,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Network Development
In-Reply-To: <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Hi,
On 12/20/2016 04:50 AM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>>> I think we're still talking past each other. A big part of the point
>>> of changing it is that none of this is specific to bpf. You could (in
>>
>> the hooks and context passed into the program is very much bpf specific.
>> That's what I've been trying to convey all along.
>
> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf
> specfic about the hook except that the name of this macro has "BPF" in
> it. There is nothing whatsoever that's bpf-specific about the context
> -- sk is not bpf-specific at all.
>
> The only thing bpf-specific about it is that it currently only invokes
> bpf programs. That could easily change.
I'm not sure if I follow. The code as it currently stands only supports
attaching bpf programs to cgroups which have been created using
BPF_PROG_LOAD. If cgroups would support other program types in the
future, then they would need to be stored in different data types
anyway, and the bpf syscall multiplexer would be the wrong entry point
to access them anyway.
Whether we add bpf-specific code to the cgroup file parsers or
cgroup-specific code to the bpf layer does not make much of a semantic
difference, does it? As a matter of fact, my very first implementation
of this patch set implemented a cgroup controller that would allow
writing strings like "ingress 5" to its control file, where 5 is the fd
number that came out of BPF_PROG_LOAD. The main reason we decided to
ditch that was that echoing fd numbers into a text file seemed way worse
than going through a proper syscall layer with it, and ioctls are
unavailable on pseudo-fs.
The idea was rather to allow attaching bpf programs to other things than
just cgroups as well, which is why we called the member of 'union
bpf_attr' 'target_fd', and a cgroup is just one type a target here.
>> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
>> and kernel needs to parse above? will you allow capital and lower
>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
>> can program fd expressed as decimal or hex or both?
>> how do you return the error? as a text string for user space
>> to parse?
>
> No. The kernel does not parse it because you cannot write this to the
> file. You set a bpf filter with ioctl and pass an fd.
An ioctl on what file, exactly?
> If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.
We need a debugging facility at some point, I agree to that. As the code
currently stands, that would rather need to go into the bpf(2) syscall
though, as setting a program through bpf(2) and reading it through
cgroupfs is really nasty.
>> so you're proposing to add a bunch of hard coded logic to the kernel.
>> First to parse such text into some sort of syntax tree or list/set
>> and then have hard coded logic specifically for these two use cases?
>> While above two can be implemented as trivial bpf programs already?!
>> That goes 180% degree vs bpf philosophy. bpf is about moving
>> the specific code out of the kernel and keeping kernel generic that
>> it can solve as many use cases as possible by being programmable.
>
> I'm not seriously proposing implementing these. My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.
Sure, but nobody claimed it to be that be-all-and-end-all thing. It's
just one thing that a cgroup is now able to accommodate, and because
that new feature is specific to bpf, we decided to hook up the uapi to
the bpf syscall.
> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?
That's a fair point, and we've discussed it as well. The issue is, as
Alexei already pointed out, that we do not want to traverse the tree up
to the root for nested cgroups due to the runtime costs in the
networking fast-path. After all, we're running the bpf program for each
packet in flight. Hence, we opted for the approach to only look at the
leaf node for now, with the ability to open it up further in the future
using flags during attach etc.
> The current approach to bpf hooks will bite you down the road. David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.
Hmm, I thought we've sorted out the concerns about that by making sure
that we
a) lock-down the API sufficiently so it doesn't cause any security
issues in its current form, and
b) make it possible to extend the functionality in the future by adding
flags to the command struct etc.
And I hoped we achieved that after discussing it for so long.
> How about slowing down a wee bit and trying to come up with cgroup
> hook semantics that work for all of these use cases?
I'm all for discussing things, but I don't this was done in a rush.
I do agree though that adding functionality to cgroups that is not
limited to resource control is a delicate thing to do, which is why I
cc'ed cgroups@ in my patches. I should have also added linux-api@ I
guess, sorry I missed that.
> I think my proposal is quite close to workable.
So let's talk about how to proceed. I've seen different bits of your
proposal in different mails, and I think a summary of it would help the
discussion.
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH net] netfilter: check duplicate config when initializing in ipt_CLUSTERIP
From: Xin Long @ 2016-12-20 11:14 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: network dev, netfilter-devel, davem, Marcelo Ricardo Leitner
In-Reply-To: <20161220004803.GA14656@salvia>
On Tue, Dec 20, 2016 at 8:48 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Dec 15, 2016 at 12:31:40PM +0800, Xin Long wrote:
>> @@ -185,6 +186,17 @@ clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip,
>> atomic_set(&c->refcount, 1);
>> atomic_set(&c->entries, 1);
>>
>> + spin_lock_bh(&cn->lock);
>> + if (__clusterip_config_find(net, ip)) {
>> + spin_unlock_bh(&cn->lock);
>> + kfree(c);
>> +
>> + return NULL;
>> + }
>
> This is going to result in ENOMEM error report to userspace on race,
> which can be confusing. Time for clusterip_config_init() to return
> PTR_ERR()?
will post v2 with PTR_ERR, thanks.
>
>> +
>> + list_add_rcu(&c->list, &cn->configs);
>> + spin_unlock_bh(&cn->lock);
>> +
>> #ifdef CONFIG_PROC_FS
>> {
>> char buffer[16];
^ permalink raw reply
* [PATCHv2 net] netfilter: check duplicate config when initializing in ipt_CLUSTERIP
From: Xin Long @ 2016-12-20 11:14 UTC (permalink / raw)
To: network dev, netfilter-devel; +Cc: davem, pablo, Marcelo Ricardo Leitner
Now when adding an ipt_CLUSTERIP rule, it only checks duplicate config in
clusterip_config_find_get(). But after that, there may be still another
thread to insert a config with the same ip, then it leaves proc_create_data
to do duplicate check.
It's more reasonable to check duplicate config by ipt_CLUSTERIP itself,
instead of checking it by proc fs duplicate file check. Before, when proc
fs allowed duplicate name files in a directory, It could even crash kernel
because of use-after-free.
This patch is to check duplicate config under the protection of clusterip
net lock when initializing a new config and correct the return err.
Note that it also moves proc file node creation after adding new config, as
proc_create_data may sleep, it couldn't be called under the clusterip_net
lock. clusterip_config_find_get returns NULL if c->pde is null to make sure
it can't be used until the proc file node creation is done.
v1->v2:
correct the err clusterip_config_init returns.
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/ipv4/netfilter/ipt_CLUSTERIP.c | 34 +++++++++++++++++++++++-----------
1 file changed, 23 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 21db00d..a6b8c1a 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -144,7 +144,7 @@ clusterip_config_find_get(struct net *net, __be32 clusterip, int entry)
rcu_read_lock_bh();
c = __clusterip_config_find(net, clusterip);
if (c) {
- if (unlikely(!atomic_inc_not_zero(&c->refcount)))
+ if (!c->pde || unlikely(!atomic_inc_not_zero(&c->refcount)))
c = NULL;
else if (entry)
atomic_inc(&c->entries);
@@ -166,14 +166,15 @@ clusterip_config_init_nodelist(struct clusterip_config *c,
static struct clusterip_config *
clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip,
- struct net_device *dev)
+ struct net_device *dev)
{
+ struct net *net = dev_net(dev);
struct clusterip_config *c;
- struct clusterip_net *cn = net_generic(dev_net(dev), clusterip_net_id);
+ struct clusterip_net *cn = net_generic(net, clusterip_net_id);
c = kzalloc(sizeof(*c), GFP_ATOMIC);
if (!c)
- return NULL;
+ return ERR_PTR(-ENOMEM);
c->dev = dev;
c->clusterip = ip;
@@ -185,6 +186,17 @@ clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip,
atomic_set(&c->refcount, 1);
atomic_set(&c->entries, 1);
+ spin_lock_bh(&cn->lock);
+ if (__clusterip_config_find(net, ip)) {
+ spin_unlock_bh(&cn->lock);
+ kfree(c);
+
+ return ERR_PTR(-EBUSY);
+ }
+
+ list_add_rcu(&c->list, &cn->configs);
+ spin_unlock_bh(&cn->lock);
+
#ifdef CONFIG_PROC_FS
{
char buffer[16];
@@ -195,16 +207,16 @@ clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip,
cn->procdir,
&clusterip_proc_fops, c);
if (!c->pde) {
+ spin_lock_bh(&cn->lock);
+ list_del_rcu(&c->list);
+ spin_unlock_bh(&cn->lock);
kfree(c);
- return NULL;
+
+ return ERR_PTR(-ENOMEM);
}
}
#endif
- spin_lock_bh(&cn->lock);
- list_add_rcu(&c->list, &cn->configs);
- spin_unlock_bh(&cn->lock);
-
return c;
}
@@ -410,9 +422,9 @@ static int clusterip_tg_check(const struct xt_tgchk_param *par)
config = clusterip_config_init(cipinfo,
e->ip.dst.s_addr, dev);
- if (!config) {
+ if (IS_ERR(config)) {
dev_put(dev);
- return -ENOMEM;
+ return PTR_ERR(config);
}
dev_mc_add(config->dev, config->clustermac);
}
--
2.1.0
^ permalink raw reply related
* [PATCH] stmmac: CSR clock configuration fix
From: Joao Pinto @ 2016-12-20 11:21 UTC (permalink / raw)
To: peppe.cavallaro, davem
Cc: hock.leong.kweh, niklas.cassel, pavel, linux-kernel, netdev,
Joao Pinto
When testing stmmac with my QoS reference design I checked a problem in the
CSR clock configuration that was impossibilitating the phy discovery, since
every read operation returned 0x0000ffff. This patch fixes the issue.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index 23322fd..fda01f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -81,8 +81,8 @@ static int stmmac_mdio_read(struct mii_bus *bus, int phyaddr, int phyreg)
value |= (phyaddr << priv->hw->mii.addr_shift)
& priv->hw->mii.addr_mask;
value |= (phyreg << priv->hw->mii.reg_shift) & priv->hw->mii.reg_mask;
- value |= (priv->clk_csr & priv->hw->mii.clk_csr_mask)
- << priv->hw->mii.clk_csr_shift;
+ value |= (priv->clk_csr << priv->hw->mii.clk_csr_shift)
+ & priv->hw->mii.clk_csr_mask;
if (priv->plat->has_gmac4)
value |= MII_GMAC4_READ;
@@ -122,8 +122,8 @@ static int stmmac_mdio_write(struct mii_bus *bus, int phyaddr, int phyreg,
& priv->hw->mii.addr_mask;
value |= (phyreg << priv->hw->mii.reg_shift) & priv->hw->mii.reg_mask;
- value |= ((priv->clk_csr & priv->hw->mii.clk_csr_mask)
- << priv->hw->mii.clk_csr_shift);
+ value |= (priv->clk_csr << priv->hw->mii.clk_csr_shift)
+ & priv->hw->mii.clk_csr_mask;
if (priv->plat->has_gmac4)
value |= MII_GMAC4_WRITE;
--
2.9.3
^ permalink raw reply related
* Re: [PATCHv2 net 1/2] sctp: reduce indent level in sctp_copy_local_addr_list
From: Marcelo Ricardo Leitner @ 2016-12-20 11:29 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <b4d6428c565955e9e81e3b4623a6458e7567e593.1482212764.git.lucien.xin@gmail.com>
On Tue, Dec 20, 2016 at 01:49:49PM +0800, Xin Long wrote:
> This patch is to reduce indent level by using continue when the addr
> is not allowed, and also drop end_copy by using break.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/protocol.c | 37 +++++++++++++++++++------------------
> 1 file changed, 19 insertions(+), 18 deletions(-)
>
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index 7b523e3..da5d82b 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -205,26 +205,27 @@ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *bp,
> list_for_each_entry_rcu(addr, &net->sctp.local_addr_list, list) {
> if (!addr->valid)
> continue;
> - if (sctp_in_scope(net, &addr->a, scope)) {
> - /* Now that the address is in scope, check to see if
> - * the address type is really supported by the local
> - * sock as well as the remote peer.
> - */
> - if ((((AF_INET == addr->a.sa.sa_family) &&
> - (copy_flags & SCTP_ADDR4_PEERSUPP))) ||
> - (((AF_INET6 == addr->a.sa.sa_family) &&
> - (copy_flags & SCTP_ADDR6_ALLOWED) &&
> - (copy_flags & SCTP_ADDR6_PEERSUPP)))) {
> - error = sctp_add_bind_addr(bp, &addr->a,
> - sizeof(addr->a),
> - SCTP_ADDR_SRC, GFP_ATOMIC);
> - if (error)
> - goto end_copy;
> - }
> - }
> + if (!sctp_in_scope(net, &addr->a, scope))
> + continue;
> +
> + /* Now that the address is in scope, check to see if
> + * the address type is really supported by the local
> + * sock as well as the remote peer.
> + */
> + if (addr->a.sa.sa_family == AF_INET &&
> + !(copy_flags & SCTP_ADDR4_PEERSUPP))
> + continue;
> + if (addr->a.sa.sa_family == AF_INET6 &&
> + (!(copy_flags & SCTP_ADDR6_ALLOWED) ||
> + !(copy_flags & SCTP_ADDR6_PEERSUPP)))
> + continue;
> +
> + error = sctp_add_bind_addr(bp, &addr->a, sizeof(addr->a),
> + SCTP_ADDR_SRC, GFP_ATOMIC);
> + if (error)
> + break;
> }
>
> -end_copy:
> rcu_read_unlock();
> return error;
> }
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCHv2 net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list
From: Marcelo Ricardo Leitner @ 2016-12-20 11:30 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <5a0037123617b525dfe456db1055770f39fb1193.1482212764.git.lucien.xin@gmail.com>
On Tue, Dec 20, 2016 at 01:49:50PM +0800, Xin Long wrote:
> sctp.local_addr_list is a global address list that is supposed to include
> all the local addresses. sctp updates this list according to NETDEV_UP/
> NETDEV_DOWN notifications.
>
> However, if multiple NICs have the same address, the global list would
> have duplicate addresses. Even if for one NIC, promote secondaries in
> __inet_del_ifa can also lead to accumulating duplicate addresses.
>
> When sctp binds address 'ANY' and creates a connection, it copies all
> the addresses from global list into asoc's bind addr list, which makes
> sctp pack the duplicate addresses into INIT/INIT_ACK packets.
>
> This patch is to filter the duplicate addresses when copying the addrs
> from global list in sctp_copy_local_addr_list and unpacking addr_param
> from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list.
>
> Note that we can't filter the duplicate addrs when global address list
> gets updated, As NETDEV_DOWN event may remove an addr that still exists
> in another NIC.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/bind_addr.c | 3 +++
> net/sctp/protocol.c | 3 +++
> 2 files changed, 6 insertions(+)
>
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index 401c607..1ebc184 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -292,6 +292,8 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, __u8 *raw_addr_list,
> }
>
> af->from_addr_param(&addr, rawaddr, htons(port), 0);
> + if (sctp_bind_addr_state(bp, &addr) != -1)
> + goto next;
> retval = sctp_add_bind_addr(bp, &addr, sizeof(addr),
> SCTP_ADDR_SRC, gfp);
> if (retval) {
> @@ -300,6 +302,7 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, __u8 *raw_addr_list,
> break;
> }
>
> +next:
> len = ntohs(param->length);
> addrs_len -= len;
> raw_addr_list += len;
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index da5d82b..616a942 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -220,6 +220,9 @@ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *bp,
> !(copy_flags & SCTP_ADDR6_PEERSUPP)))
> continue;
>
> + if (sctp_bind_addr_state(bp, &addr->a) != -1)
> + continue;
> +
> error = sctp_add_bind_addr(bp, &addr->a, sizeof(addr->a),
> SCTP_ADDR_SRC, GFP_ATOMIC);
> if (error)
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* [PATCH][V2] qed: fix memory leak of a qed_spq_entry on error failure paths
From: Colin King @ 2016-12-20 11:44 UTC (permalink / raw)
To: Yuval Mintz, Ariel Elior, everest-linux-l2, netdev; +Cc: linux-kernel
From: Colin Ian King <colin.king@canonical.com>
A qed_spq_entry entry is allocated by qed_sp_init_request but is not
kfree'd if an error occurs, causing a memory leak. Fix this by
returning the previously allocated spq entry and also setting *pp_ent
to NULL to be safe.
Thanks to Yuval Mintz for suggestions on how to improve my original
fix.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/qlogic/qed/qed_sp_commands.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
index a39ef2e..d2034fa 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
@@ -55,8 +55,10 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
break;
case QED_SPQ_MODE_BLOCK:
- if (!p_data->p_comp_data)
- return -EINVAL;
+ if (!p_data->p_comp_data) {
+ rc = -EINVAL;
+ goto err;
+ }
p_ent->comp_cb.cookie = p_data->p_comp_data->cookie;
break;
@@ -71,7 +73,8 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
default:
DP_NOTICE(p_hwfn, "Unknown SPQE completion mode %d\n",
p_ent->comp_mode);
- return -EINVAL;
+ rc = -EINVAL;
+ goto err;
}
DP_VERBOSE(p_hwfn, QED_MSG_SPQ,
@@ -85,6 +88,10 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
memset(&p_ent->ramrod, 0, sizeof(p_ent->ramrod));
return 0;
+err:
+ qed_spq_return_entry(p_hwfn, *pp_ent);
+ *pp_ent = NULL;
+ return rc;
}
static enum tunnel_clss qed_tunn_get_clss_type(u8 type)
--
2.10.2
^ permalink raw reply related
* Re: Soft lockup in tc_classify
From: Daniel Borkmann @ 2016-12-20 11:47 UTC (permalink / raw)
To: Shahar Klein, Cong Wang
Cc: Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
Jiri Pirko, John Fastabend, Hadar Hen Zion
In-Reply-To: <5a985705-11e5-1575-a049-723accb97608@mellanox.com>
Hi Shahar,
On 12/20/2016 07:22 AM, Shahar Klein wrote:
> On 12/19/2016 7:58 PM, Cong Wang wrote:
>> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com> wrote:
>>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>>
>>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>>>
>>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net>
>>>>> wrote:
>>>>>
>>>>>> Note that there's still the RCU fix missing for the deletion race that
>>>>>> Cong will still send out, but you say that the only thing you do is to
>>>>>> add a single rule, but no other operation in involved during that test?
>>>>>
>>>>> What's missing to have the deletion race fixed? making a patch or
>>>>> testing to a patch which was sent?
>>>>
>>>> If you think it would help for this problem, here is my patch rebased
>>>> on the latest net-next.
>>>>
>>>> Again, I don't see how it could help this case yet, especially I don't
>>>> see how we could have a loop in this singly linked list.
>>>
>>> I've applied cong's patch and hit a different lockup(full log attached):
>>
>> Are you sure this is really different? For me, it is still inside the loop
>> in tc_classify(), with only a slightly different offset.
>>
>>>
>>> Daniel suggested I'll add a print:
>>> case RTM_DELTFILTER:
>>> - err = tp->ops->delete(tp, fh);
>>> + printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
>>> + err = tp->ops->delete(tp, fh, &last);
>>> if (err == 0) {
>>>
>>> and I couldn't see this print in the output.....
>>
>> Hmm, that is odd, if this never prints, then my patch should not make any
>> difference.
>>
>> There are still two other cases where we could change tp->next, so do you
>> mind to add two more printk's for debugging?
>>
>> Attached is the delta patch.
>>
>> Thanks!
>
> I've added a slightly different debug print:
> @@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
> if (tp_created) {
> RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
> rcu_assign_pointer(*back, tp);
> + printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
> }
> tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
I'm curious, could you be a bit more verbose why you didn't go with Cong's
debug patch?
In particular, why you removed the hunk from the condition 'n->nlmsg_type ==
RTM_DELTFILTER && t->tcm_handle == 0' where we delete the whole tp instance?
Is it because if you have that printk() there, then the issue doesn't trigger
for you anymore? Or any other reason?
How many CPUs does your test machine have, I suspect more than 1, right?
So iff RTM_DELTFILTER with tcm_handle of 0 really played a role in this, I'm
wondering whether there was a subtle deletion + add race where the newly added
filter on the other CPU still saw a stale pointer in the list. But just a wild
guess at this point.
Hmm, could you try this below to see whether the issue still appears?
Thanks,
Daniel
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..4eee1cb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -317,7 +317,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
struct tcf_proto *next = rtnl_dereference(tp->next);
- RCU_INIT_POINTER(*back, next);
+ rcu_assign_pointer(*back, next);
tfilter_notify(net, skb, n, tp, fh,
RTM_DELTFILTER, false);
> full output attached:
>
> [ 283.290271] Mirror/redirect action on
> [ 283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9432d704df60 tp->next= (null)
> [ 283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d240 tp->next= (null)
> [ 283.359997] GACT probability on
> [ 283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
> [ 283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [ 283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [ 283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [ 283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [ 308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
>
> Thanks
> Shahar
^ permalink raw reply related
* Re: wl1251 & mac address & calibration data
From: Kalle Valo @ 2016-12-20 11:47 UTC (permalink / raw)
To: Arend Van Spriel
Cc: Pali Rohár, Daniel Wagner, Luis R. Rodriguez, Tom Gundersen,
Johannes Berg, Ming Lei, Mimi Zohar, Bjorn Andersson,
Rafał Miłecki, Sebastian Reichel, Pavel Machek,
Michal Kazior, Ivaylo Dimitrov, Aaro Koskinen, Tony Lindgren,
linux-wireless, Network Development, linux-kernel@vger.kernel.org
In-Reply-To: <fd180271-1cd4-9891-e3d5-ae9cd0fb088b@broadcom.com>
Arend Van Spriel <arend.vanspriel@broadcom.com> writes:
> On 18-12-2016 13:09, Pali Rohár wrote:
>
>> File wl1251-nvs.bin is provided by linux-firmware package and contains
>> default data which should be overriden by model specific calibrated
>> data.
>
> Ah. Someone thought it was a good idea to provide the "one ring to rule
> them all". Nice.
Yes, that was a bad idea. wl1251-nvs.bin in linux-firmware.git should be
renamed to wl1251-nvs.bin.example, or something like that, as it should
be only installed to a real system only if there's no real calibration
data available (only for developers to use, not real users).
>> But overwriting that one file is not possible as it next update of
>> linux-firmware package will overwrite it back. It break any normal usage
>> of package management.
>>
>> Also it is ridiculously broken by design if some "boot" files needs to
>> be overwritten to initialize hardware properly. To not break booting you
>> need to overwrite that file before first boot. But without booting
>> device you cannot read calibration data. So some hack with autoreboot
>> after boot is needed.
Providing the calibration data via Device Tree is the proper way to
solve this. Yes yes, I know N900 doesn't support it but that's a
deficiency in N900, not Linux.
>> And how to detect that we have real overwritten calibration data and
>> not default one from linux-firmware? Any heuristic or checks will be
>> broken here. And no, nothing like you need to reboot your device now
>> (and similar concept) from windows world is not accepted.
>
> Well. After reading and creating calibration data you could just rebind
> the driver to the device to have it probed again.
Or load wl1251 as a module and make sure calibration data is installed
before the module is loaded. LEDE does that with ath10k:
https://git.lede-project.org/?p=source.git;a=blob;f=target/linux/ar71xx/base-files/etc/hotplug.d/firmware/11-ath10k-caldata;h=97875bd79a579a0010da3f60324b6ec966fe9c6a;hb=HEAD
> But yeah, the default one from linux-firmware should never have been
> there in the first place.
Agreed.
--
Kalle Valo
^ permalink raw reply
* Re: [PATCH net-next] ixgbevf: fix 'Etherleak' in ixgbevf
From: Weilong Chen @ 2016-12-20 11:50 UTC (permalink / raw)
To: Alexander Duyck
Cc: Jeff Kirsher, intel-wired-lan, Netdev,
linux-kernel@vger.kernel.org, wangkefeng.wang
In-Reply-To: <CAKgT0UfA9VG4NQZLLxXvobtfp22yPpPssSM4Ge94KFa1RVnhyg@mail.gmail.com>
Hi,
Thanks for you reply.
We test you patch, but the problem is still there, it seems do not work.
I'm not sure why ixgbe use the limit 17. The kenel use ETH_ZLEN (60)
with out FCS. A lot of drivers such as e1000 use it. Any explaination?
Thanks.
On 2016/12/16 0:13, Alexander Duyck wrote:
> On Thu, Dec 15, 2016 at 3:40 AM, Weilong Chen <chenweilong@huawei.com> wrote:
>> Nessus report the vf appears to leak memory in network packets.
>> Fix this by padding all small packets manually.
>>
>> And the CVE-2003-0001.
>> https://ofirarkin.files.wordpress.com/2008/11/atstake_etherleak_report.pdf
>>
>> Signed-off-by: Weilong Chen <chenweilong@huawei.com>
>> ---
>> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>> index 6d4bef5..137a154 100644
>> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>> @@ -3654,6 +3654,13 @@ static int ixgbevf_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>> return NETDEV_TX_OK;
>> }
>>
>> + /* On PCI/PCI-X HW, if packet size is less than ETH_ZLEN,
>> + * packets may get corrupted during padding by HW.
>> + * To WA this issue, pad all small packets manually.
>> + */
>> + if (eth_skb_pad(skb))
>> + return NETDEV_TX_OK;
>> +
>
> So the patch description for this probably isn't correct. It looks
> like the problem isn't leaking data it is the fact that the frames
> aren't being padded to prevent malicious events. The only issue is
> the patch is padding by a bit too much. I would recommend replacing
> this with the following from ixgbe:
>
> /*
> * The minimum packet size for olinfo paylen is 17 so pad the skb
> * in order to meet this minimum size requirement.
> */
> if (skb_put_padto(skb, 17))
> return NETDEV_TX_OK;
>
>
>> tx_ring = adapter->tx_ring[skb->queue_mapping];
>>
>> /* need: 1 descriptor per page * PAGE_SIZE/IXGBE_MAX_DATA_PER_TXD,
>> --
>> 1.7.12
>>
>
> .
>
^ permalink raw reply
* Re: mlx4: Bug in XDP_TX + 16 rx-queues
From: Tariq Toukan @ 2016-12-20 12:02 UTC (permalink / raw)
To: Martin KaFai Lau, Tariq Toukan
Cc: Saeed Mahameed, netdev@vger.kernel.org, Alexei Starovoitov
In-Reply-To: <20161219233709.GA29858@kafai-mba.local>
Thanks Martin, nice catch!
On 20/12/2016 1:37 AM, Martin KaFai Lau wrote:
> Hi Tariq,
>
> On Sat, Dec 17, 2016 at 02:18:03AM -0800, Martin KaFai Lau wrote:
>> Hi All,
>>
>> I have been debugging with XDP_TX and 16 rx-queues.
>>
>> 1) When 16 rx-queues is used and an XDP prog is doing XDP_TX,
>> it seems that the packet cannot be XDP_TX out if the pkt
>> is received from some particular CPUs (/rx-queues).
>>
>> 2) If 8 rx-queues is used, it does not have problem.
>>
>> 3) The 16 rx-queues problem also went away after reverting these
>> two patches:
>> 15fca2c8eb41 net/mlx4_en: Add ethtool statistics for XDP cases
>> 67f8b1dcb9ee net/mlx4_en: Refactor the XDP forwarding rings scheme
>>
> After taking a closer look at 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
> and armed with the fact that '>8 rx-queues does not work', I have
> made the attached change that fixed the issue.
>
> Making change in mlx4_en_fill_qp_context() could be an easier fix
> but I think this change will be easier for discussion purpose.
>
> I don't want to lie that I know anything about how this variable
> works in CX3. If this change makes sense, I can cook up a diff.
> Otherwise, can you shed some light on what could be happening
> and hopefully can lead to a diff?
>
> Thanks
> --Martin
>
>
> diff --git i/drivers/net/ethernet/mellanox/mlx4/en_netdev.c w/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index bcd955339058..b3bfb987e493 100644
> --- i/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ w/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1638,10 +1638,10 @@ int mlx4_en_start_port(struct net_device *dev)
>
> /* Configure tx cq's and rings */
> for (t = 0 ; t < MLX4_EN_NUM_TX_TYPES; t++) {
> - u8 num_tx_rings_p_up = t == TX ? priv->num_tx_rings_p_up : 1;
The bug lies in this line.
Number of rings per UP in case of TX_XDP should be
priv->tx_ring_num[TX_XDP ], not 1.
Please try the following fix.
I can prepare and send it once the window opens again.
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index bcd955339058..edbe200ac2fa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1638,7 +1638,8 @@ int mlx4_en_start_port(struct net_device *dev)
/* Configure tx cq's and rings */
for (t = 0 ; t < MLX4_EN_NUM_TX_TYPES; t++) {
- u8 num_tx_rings_p_up = t == TX ? priv->num_tx_rings_p_up
: 1;
+ u8 num_tx_rings_p_up = t == TX ?
+ priv->num_tx_rings_p_up : priv->tx_ring_num[t];
for (i = 0; i < priv->tx_ring_num[t]; i++) {
/* Configure cq */
> -
> for (i = 0; i < priv->tx_ring_num[t]; i++) {
> /* Configure cq */
> + int user_prio;
> +
> cq = priv->tx_cq[t][i];
> err = mlx4_en_activate_cq(priv, cq, i);
> if (err) {
> @@ -1660,9 +1660,14 @@ int mlx4_en_start_port(struct net_device *dev)
>
> /* Configure ring */
> tx_ring = priv->tx_ring[t][i];
> + if (t != TX_XDP)
> + user_prio = i / priv->num_tx_rings_p_up;
> + else
> + user_prio = i & 0x07;
> +
> err = mlx4_en_activate_tx_ring(priv, tx_ring,
> cq->mcq.cqn,
> - i / num_tx_rings_p_up);
> + user_prio);
> if (err) {
> en_err(priv, "Failed allocating Tx ring\n");
> mlx4_en_deactivate_cq(priv, cq);
Regards,
Tariq Toukan.
^ permalink raw reply related
* Re: [PATCH net] virtio_net: reject XDP programs using header adjustment
From: Jakub Kicinski @ 2016-12-20 12:30 UTC (permalink / raw)
To: John Fastabend; +Cc: netdev, kafai, Daniel Borkmann, alexei.starovoitov, mst
In-Reply-To: <58581E01.7070902@gmail.com>
On Mon, Dec 19, 2016 at 5:50 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-12-19 07:05 AM, Jakub Kicinski wrote:
>> commit 17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
>> added a new XDP helper to prepend and remove data from a frame.
>> Make virtio_net reject programs making use of this helper until
>> proper support is added.
>>
>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> ---
>> drivers/net/virtio_net.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 08327e005ccc..db761f37783e 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1677,6 +1677,11 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>> u16 xdp_qp = 0, curr_qp;
>> int i, err;
>>
>> + if (prog && prog->xdp_adjust_head) {
>> + netdev_warn(dev, "Does not support bpf_xdp_adjust_head()\n");
>> + return -EOPNOTSUPP;
>> + }
>> +
>> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>> netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
>>
>
> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>
> Thanks patch looks good. Alternatively we could push a "bug fix" to
> support the adjust header feature depending on how DaveM and MST feel
> about that. I don't have a strong opinion but I have the patch on my
> queue it just needs some more testing.
Cool! I thought to ask you what your plans are but then this patch is so
trivial I decided to just post it :) I'm perfectly happy with dropping it
for now and reposting after ~rc5 if needed.
^ permalink raw reply
* [PATCH] stmmac: enable rx queues
From: Joao Pinto @ 2016-12-20 12:55 UTC (permalink / raw)
To: peppe.cavallaro, davem
Cc: hock.leong.kweh, niklas.cassel, pavel, linux-kernel, netdev,
Joao Pinto
When the hardware is synthesized with multiple queues, all queues are
disabled for default. This patch adds the rx queues configuration.
This patch was successfully tested in a Synopsys QoS Reference design.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
---
drivers/net/ethernet/stmicro/stmmac/common.h | 2 ++
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 4 ++++
drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
4 files changed, 38 insertions(+)
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index b13a144..61bab50 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -454,6 +454,8 @@ struct stmmac_ops {
void (*core_init)(struct mac_device_info *hw, int mtu);
/* Enable and verify that the IPC module is supported */
int (*rx_ipc)(struct mac_device_info *hw);
+ /* Enable RX Queues */
+ void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
/* Dump MAC registers */
void (*dump_regs)(struct mac_device_info *hw);
/* Handle extra events on specific interrupts hw dependent */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index 3e8d4fe..fd013bd 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -22,6 +22,7 @@
#define GMAC_HASH_TAB_32_63 0x00000014
#define GMAC_RX_FLOW_CTRL 0x00000090
#define GMAC_QX_TX_FLOW_CTRL(x) (0x70 + x * 4)
+#define GMAC_RXQ_CTRL0 0x000000a0
#define GMAC_INT_STATUS 0x000000b0
#define GMAC_INT_EN 0x000000b4
#define GMAC_PCS_BASE 0x000000e0
@@ -44,6 +45,9 @@
#define GMAC_MAX_PERFECT_ADDRESSES 128
+/* MAC RX Queue Enable*/
+#define GMAC_RX_QUEUE_ENABLE(queue) BIT(queue * 2)
+
/* MAC Flow Control RX */
#define GMAC_RX_FLOW_CTRL_RFE BIT(0)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
index eaed7cb..7ec1887 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
@@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
writel(value, ioaddr + GMAC_INT_EN);
}
+static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
+{
+ void __iomem *ioaddr = hw->pcsr;
+ u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
+
+ value |= GMAC_RX_QUEUE_ENABLE(queue);
+
+ writel(value, ioaddr + GMAC_RXQ_CTRL0);
+}
+
static void dwmac4_dump_regs(struct mac_device_info *hw)
{
void __iomem *ioaddr = hw->pcsr;
@@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
static const struct stmmac_ops dwmac4_ops = {
.core_init = dwmac4_core_init,
.rx_ipc = dwmac4_rx_ipc_enable,
+ .rx_queue_enable = dwmac4_rx_queue_enable,
.dump_regs = dwmac4_dump_regs,
.host_irq_status = dwmac4_irq_status,
.flow_ctrl = dwmac4_flow_ctrl,
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e40578..e30034d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
}
/**
+ * stmmac_mac_enable_rx_queues - Enable MAC rx queues
+ * @priv: driver private structure
+ * Description: It is used for enabling the rx queues in the MAC
+ */
+static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
+{
+ int rx_count = priv->dma_cap.number_rx_channel;
+ int queue = 0;
+
+ /* If GMAC does not have multiqueues, then this is not necessary*/
+ if (rx_count == 1)
+ return;
+
+ for (queue = 0; queue < rx_count; queue++)
+ priv->hw->mac->rx_queue_enable(priv->hw, queue);
+}
+
+/**
* stmmac_dma_operation_mode - HW DMA operation mode
* @priv: driver private structure
* Description: it is used for configuring the DMA operation mode register in
@@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
/* Initialize the MAC Core */
priv->hw->mac->core_init(priv->hw, dev->mtu);
+ /* Initialize MAC RX Queues */
+ stmmac_mac_enable_rx_queues(priv);
+
ret = priv->hw->mac->rx_ipc(priv->hw);
if (!ret) {
netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");
--
2.9.3
^ permalink raw reply related
* [PATCH 0/2 v2] mm, slab: consolidate KMALLOC_MAX_SIZE
From: Michal Hocko @ 2016-12-20 13:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Cristopher Lameter, Alexei Starovoitov, Andrey Konovalov, netdev,
linux-mm, LKML
Hi,
this is the second version of the patchset previously posted here [1].
Alexei has insisted on the patches reordering which I've done in this
series. I've also updated the changelog of the second patch to mention
why KMALLOC_SHIFT_MAX has been used.
Andrey has revealed a discrepancy between KMALLOC_MAX_SIZE and the
maximum supported page allocator size [2]. The underlying problem
should be fixed in the ep_write_iter code of course, but I do not feel
qualified to do that. The discrepancy which it reveals (see patch 2)
is worth fixing anyway, though.
While I was looking into the code, I've noticed that the only code which
uses KMALLOC_SHIFT_MAX outside of the slab code is bpf so I've updated
it to use KMALLOC_MAX_SIZE instead. There shouldn't be any real reason
to use KMALLOC_SHIFT_MAX which is a slab internal constant same as
KMALLOC_SHIFT_{LOW,HIGH}
[1] http://lkml.kernel.org/r/20161215164722.21586-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CAAeHK+ztusS68DejO8AH3nn-EfiYQpD5FmBwmqKG8BWvoqPNqQ@mail.gm
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [PATCH 1/2] mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
From: Michal Hocko @ 2016-12-20 13:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Cristopher Lameter, Alexei Starovoitov, Andrey Konovalov, netdev,
linux-mm, LKML, Michal Hocko
In-Reply-To: <20161220130659.16461-1-mhocko@kernel.org>
From: Michal Hocko <mhocko@suse.com>
Andrey Konovalov has reported the following warning triggered by
the syzkaller fuzzer.
WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511
__alloc_pages_nodemask+0x159c/0x1e20
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88006949f2c8 ffffffff81f96b8a ffffffff00000200 1ffff1000d293dec
ffffed000d293de4 0000000000000a06 0000000041b58ab3 ffffffff8598b510
ffffffff81f968f8 0000000041b58ab3 ffffffff85942a58 ffffffff81432860
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff81f96b8a>] dump_stack+0x292/0x398 lib/dump_stack.c:51
[<ffffffff8168c88e>] panic+0x1cb/0x3a9 kernel/panic.c:179
[<ffffffff812b80b4>] __warn+0x1c4/0x1e0 kernel/panic.c:542
[<ffffffff812b831c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[< inline >] __alloc_pages_slowpath mm/page_alloc.c:3511
[<ffffffff816c08ac>] __alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
[<ffffffff817cde17>] alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
[< inline >] alloc_pages include/linux/gfp.h:469
[<ffffffff8172fd8f>] kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
[<ffffffff8172fdff>] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
[< inline >] kmalloc_large include/linux/slab.h:422
[<ffffffff817e01f0>] __kmalloc+0x210/0x2d0 mm/slub.c:3723
[< inline >] kmalloc include/linux/slab.h:495
[<ffffffff832262a7>] ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
[< inline >] new_sync_write fs/read_write.c:499
[<ffffffff817fdcd3>] __vfs_write+0x483/0x760 fs/read_write.c:512
[<ffffffff817ff720>] vfs_write+0x170/0x4e0 fs/read_write.c:560
[< inline >] SYSC_write fs/read_write.c:607
[<ffffffff81803b2b>] SyS_write+0xfb/0x230 fs/read_write.c:599
[<ffffffff84f47ec1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
The issue is caused by a lack of size check for the request size in
ep_write_iter which should be fixed. It, however, points to another
problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
resulting page allocator request might be MAX_ORDER which is too large
(see __alloc_pages_slowpath). The same applies to the SLOB allocator
which allows even larger sizes. Make sure that they are capped properly
and never request more than MAX_ORDER order.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
include/linux/slab.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..4c5363566815 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -226,7 +226,7 @@ static inline const char *__check_heap_object(const void *ptr,
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
*/
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
-#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
+#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
@@ -239,7 +239,7 @@ static inline const char *__check_heap_object(const void *ptr,
* be allocated from the same page.
*/
#define KMALLOC_SHIFT_HIGH PAGE_SHIFT
-#define KMALLOC_SHIFT_MAX 30
+#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
--
2.10.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 2/2] bpf: do not use KMALLOC_SHIFT_MAX
From: Michal Hocko @ 2016-12-20 13:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Cristopher Lameter, Alexei Starovoitov, Andrey Konovalov, netdev,
linux-mm, LKML, Michal Hocko
In-Reply-To: <20161220130659.16461-1-mhocko@kernel.org>
From: Michal Hocko <mhocko@suse.com>
01b3f52157ff ("bpf: fix allocation warnings in bpf maps and integer
overflow") has added checks for the maximum allocateable size. It
(ab)used KMALLOC_SHIFT_MAX for that purpose. While this is not incorrect
it is not very clean because we already have KMALLOC_MAX_SIZE for this
very reason so let's change both checks to use KMALLOC_MAX_SIZE instead.
The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
kernel/bpf/arraymap.c | 2 +-
kernel/bpf/hashtab.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index a2ac051c342f..229a5d5df977 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -56,7 +56,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
attr->value_size == 0 || attr->map_flags)
return ERR_PTR(-EINVAL);
- if (attr->value_size >= 1 << (KMALLOC_SHIFT_MAX - 1))
+ if (attr->value_size > KMALLOC_MAX_SIZE)
/* if value_size is bigger, the user space won't be able to
* access the elements.
*/
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index ad1bc67aff1b..c5ec7dc71c84 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -181,7 +181,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
*/
goto free_htab;
- if (htab->map.value_size >= (1 << (KMALLOC_SHIFT_MAX - 1)) -
+ if (htab->map.value_size >= KMALLOC_MAX_SIZE -
MAX_BPF_STACK - sizeof(struct htab_elem))
/* if value_size is bigger, the user space won't be able to
* access the elements via bpf syscall. This check also makes
--
2.10.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [RFC PATCH 0/4] page_pool proof-of-concept early code
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
To: linux-mm, Alexander Duyck
Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
Tariq Toukan
This is an RFC patchset of my *work-in-progress* page_pool implemenation.
This is NOT ready for inclusion. People asked to see the code, so here we go.
This patchset is focused providing a generic replacement for the
driver page recycle caches. Where mlx5 is the first user in patch-3.
Notice that patch-2 is more "MM-invasive" (modifies put_page) than
patch-4 which is less MM-agressive (scaled back based on input from
Mel Gorman).
I do know that all page-flags are used (for 32bit), thus I'm open to
suggestions/ideas on howto work-around this (need some way to identify
a page belongs to a page pool).
This patchset is the bare-minimum PoC that allows me to benchmarks
these ideas and see if performance is going in the right direction.
It is not safe, e.g. unloading the driver can crash the kernel.
---
Jesper Dangaard Brouer (4):
doc: page_pool introduction documentation
page_pool: basic implementation of page_pool
mlx5: use page_pool
page_pool: change refcnt model
Documentation/vm/page_pool/introduction.rst | 71 ++++
drivers/net/ethernet/mellanox/mlx5/core/en.h | 1
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 28 +
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 47 ++
include/linux/mm.h | 1
include/linux/mm_types.h | 11 +
include/linux/page-flags.h | 13 +
include/linux/page_pool.h | 168 +++++++++
include/linux/skbuff.h | 2
include/trace/events/mmflags.h | 3
mm/Makefile | 3
mm/page_alloc.c | 6
mm/page_pool.c | 402 +++++++++++++++++++++
mm/slub.c | 4
mm/swap.c | 3
15 files changed, 741 insertions(+), 22 deletions(-)
create mode 100644 Documentation/vm/page_pool/introduction.rst
create mode 100644 include/linux/page_pool.h
create mode 100644 mm/page_pool.c
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [RFC PATCH 1/4] doc: page_pool introduction documentation
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
To: linux-mm, Alexander Duyck
Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>
Copied from:
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/introduction.html
~/git/prototype-kernel/kernel/Documentation/vm/page_pool/introduction.rst
This will be updated from above links before upstream submit.
Also this need to be "linked" into new kernel doc system.
---
Documentation/vm/page_pool/introduction.rst | 71 +++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
create mode 100644 Documentation/vm/page_pool/introduction.rst
diff --git a/Documentation/vm/page_pool/introduction.rst b/Documentation/vm/page_pool/introduction.rst
new file mode 100644
index 000000000000..db03b02f218c
--- /dev/null
+++ b/Documentation/vm/page_pool/introduction.rst
@@ -0,0 +1,71 @@
+============
+Introduction
+============
+
+The page_pool is a generic API for drivers that have a need for a pool
+of recycling pages used for streaming DMA.
+
+
+Motivation
+==========
+
+The page_pool is primarily motivated by two things (1) performance
+and (2) changing the memory model for drivers.
+
+Drivers have developed performance workarounds when the speed of the
+page allocator and the DMA APIs became too slow for their HW
+needs. The page pool solves them on a general level providing
+performance gains and benefits that local driver recycling hacks
+cannot realize.
+
+A fundamental property is that pages are returned to the page_pool.
+This property allow a certain class of optimizations, which is to move
+setup and tear-down operations out of the fast-path, sometimes known as
+constructor/destruction operations. DMA map/unmap is one example of
+operations this applies to. Certain page alloc/free validations can
+also be avoided in the fast-path. Another example could be
+pre-mapping pages into userspace, and clearing them (memset-zero)
+outside the fast-path.
+
+Memory model
+============
+
+Once drivers are converted to using page_pool API, then it will become
+easier change the underlying memory model backing the driver with
+pages (without changing the driver).
+
+One prime use-case is NIC zero-copy RX into userspace. As DaveM
+describes in his `Google-plus post`_, the mapping and unmapping
+operations in the address space of the process has a cost that cancels
+out most of the gains of such zero-copy schemes.
+
+This mapping cost can solved the same way as the keeping DMA mapped
+trick. By keeping the pages VM-mapped to userspace. This is a layer
+that can be added later to the page_pool. It will likely be
+beneficial to also consider using huge-pages (as backing) to reduce
+the TLB-stress.
+
+.. _Google-plus post:
+ https://plus.google.com/+DavidMiller/posts/EUDiGoXD6Xv
+
+Advantages
+==========
+
+Advantages of a recycling page pool as bullet points:
+
+1) Faster than going through page-allocator. Given a specialized
+ allocator require less checks, and can piggyback on drivers
+ resource protection (for alloc-side).
+
+2) DMA IOMMU mapping cost is removed by keeping pages mapped.
+
+3) Makes DMA pages writable by predictable DMA unmap point.
+
+4) OOM protection at device level, as having a feedback-loop knows
+ number of outstanding pages.
+
+5) Flexible memory model allowing zero-copy RX, solving memory early
+ demux (does depend on HW filters into RX queues)
+
+6) Less fragmentation of the page buddy algorithm, when driver
+ maintains a steady-state working-set.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [RFC PATCH 2/4] page_pool: basic implementation of page_pool
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
To: linux-mm, Alexander Duyck
Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>
The focus in this patch is getting the API around page_pool figured out.
The internal data structures for returning page_pool pages is not optimal.
This implementation use ptr_ring for recycling, which is known not to scale
in case of multiple remote CPUs releasing/returning pages.
A bulking interface into the page allocator is also left for later. (This
requires cooperation will Mel Gorman, who just send me some PoC patches for this).
---
include/linux/mm.h | 6 +
include/linux/mm_types.h | 11 +
include/linux/page-flags.h | 13 +
include/linux/page_pool.h | 158 +++++++++++++++
include/linux/skbuff.h | 2
include/trace/events/mmflags.h | 3
mm/Makefile | 3
mm/page_alloc.c | 10 +
mm/page_pool.c | 423 ++++++++++++++++++++++++++++++++++++++++
mm/slub.c | 4
10 files changed, 627 insertions(+), 6 deletions(-)
create mode 100644 include/linux/page_pool.h
create mode 100644 mm/page_pool.c
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..11b4d8fb280b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
#include <linux/page_ext.h>
#include <linux/err.h>
#include <linux/page_ref.h>
+#include <linux/page_pool.h>
struct mempolicy;
struct anon_vma;
@@ -765,6 +766,11 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
+ if (PagePool(page)) {
+ page_pool_put_page(page);
+ return;
+ }
+
if (put_page_testzero(page))
__put_page(page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08d947fc4c59..c74dea967f99 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -47,6 +47,12 @@ struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
union {
+ /* DISCUSS: Considered moving page_pool pointer here,
+ * but I'm unsure if 'mapping' is needed for userspace
+ * mapping the page, as this is a use-case the
+ * page_pool need to support in the future. (Basically
+ * mapping a NIC RX ring into userspace).
+ */
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
@@ -63,6 +69,7 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* sl[aou]b first free object */
+ dma_addr_t dma_addr; /* used by page_pool */
/* page_deferred_list().prev -- second tail page */
};
@@ -117,6 +124,8 @@ struct page {
* avoid collision and false-positive PageTail().
*/
union {
+ /* XXX: Idea reuse lru list, in page_pool to align with PCP */
+
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone_lru_lock !
* Can be used as a generic list
@@ -189,6 +198,8 @@ struct page {
#endif
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
+ /* XXX: Sure page_pool will have no users of "private"? */
+ struct page_pool *pool;
};
#ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..253d7f7cf89f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,7 +91,8 @@ enum pageflags {
PG_mappedtodisk, /* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked, /* Page is backed by RAM/swap */
- PG_unevictable, /* Page is "unevictable" */
+/*20*/ PG_unevictable, /* Page is "unevictable" */
+// XXX stable flag?
#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
@@ -101,6 +102,8 @@ enum pageflags {
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
+ /* Question: can we squeeze in here and avoid CONFIG_64BIT hacks?*/
+ PG_pool, // XXX macros called: SetPagePool / PagePool
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
@@ -347,6 +350,12 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif
+// XXX: Define some macros for page_pool
+// XXX: avoiding atomic set_bit() operation (like slab)
+// XXX: PF_HEAD vs PF_ANY vs PF_NO_TAIL????
+__PAGEFLAG(Pool, pool, PF_ANY)
+
+
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
@@ -700,7 +709,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
/*
* Flags checked when a page is freed. Pages being freed should not have
* these flags set. It they are, there is a problem.
- */
+ */ /* XXX add PG_pool here??? */
#define PAGE_FLAGS_CHECK_AT_FREE \
(1UL << PG_lru | 1UL << PG_locked | \
1UL << PG_private | 1UL << PG_private_2 | \
diff --git a/include/linux/page_pool.h b/include/linux/page_pool.h
new file mode 100644
index 000000000000..6f8f2ff6d758
--- /dev/null
+++ b/include/linux/page_pool.h
@@ -0,0 +1,158 @@
+/*
+ * page_pool.h
+ *
+ * Author: Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ * Copyright (C) 2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * The page_pool is primarily motivated by two things (1) performance
+ * and (2) changing the memory model for drivers.
+ *
+ * Drivers have developed performance workarounds when the speed of
+ * the page allocator and the DMA APIs became too slow for their HW
+ * needs. The page pool solves them on a general level providing
+ * performance gains and benefits that local driver recycling hacks
+ * cannot realize.
+ *
+ * A fundamental property is that pages are returned to the page_pool.
+ * This property allow a certain class of optimizations, which is to
+ * move setup and tear-down operations out of the fast-path, sometimes
+ * known as constructor/destruction operations. DMA map/unmap is one
+ * example of operations this applies to. Certain page alloc/free
+ * validations can also be avoided in the fast-path. Another example
+ * could be pre-mapping pages into userspace, and clearing them
+ * (memset-zero) outside the fast-path.
+ *
+ * This API is only meant for streaming DMA, which map/unmap frequently.
+ */
+#ifndef _LINUX_PAGE_POOL_H
+#define _LINUX_PAGE_POOL_H
+
+/*
+ * NOTES on page flags (PG_pool)... we might have a problem with
+ * enough page flags on 32 bit systems, example see PG_idle + PG_young
+ * include/linux/page_idle.h and CONFIG_IDLE_PAGE_TRACKING
+ */
+
+#include <linux/ptr_ring.h>
+
+//#include <linux/dma-mapping.h>
+#include <linux/dma-direction.h>
+
+// Not-used-atm #define PP_FLAG_NAPI 0x1
+#define PP_FLAG_ALL 0
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case. As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection. If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE 128
+#define PP_ALLOC_CACHE_REFILL 64
+struct pp_alloc_cache {
+ u32 count ____cacheline_aligned_in_smp;
+ u32 refill; /* not used atm */
+ void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+/*
+ * Extensible params struct. Focus on currently implemented features,
+ * extend later. Restriction, subsequently added members value of zero
+ * must gives the previous behaviour. Avoids need to update every
+ * driver simultaniously (given likely in difference subsystems).
+ */
+struct page_pool_params {
+ u32 size; /* caller sets size of struct */
+ unsigned int order;
+ unsigned long flags;
+ /* Associated with a specific device, for DMA pre-mapping purposes */
+ struct device *dev;
+ /* Numa node id to allocate from pages from */
+ int nid;
+ enum dma_data_direction dma_dir; /* DMA mapping direction */
+ unsigned int pool_size;
+ char end_marker[0]; /* must be last struct member */
+};
+#define PAGE_POOL_PARAMS_SIZE offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+ struct page_pool_params p;
+
+ /*
+ * Data structure for allocation side
+ *
+ * Drivers allocation side usually already perform some kind
+ * of resource protection. Piggyback on this protection, and
+ * require driver to protect allocation side.
+ *
+ * For NIC drivers this means, allocate a page_pool per
+ * RX-queue. As the RX-queue is already protected by
+ * Softirq/BH scheduling and napi_schedule. NAPI schedule
+ * guarantee that a single napi_struct will only be scheduled
+ * on a single CPU (see napi_schedule).
+ */
+ struct pp_alloc_cache alloc;
+
+ /* Data structure for storing recycled pages.
+ *
+ * Returning/freeing pages is more complicated synchronization
+ * wise, because free's can happen on remote CPUs, with no
+ * association with allocation resource.
+ *
+ * For now use ptr_ring, as it separates consumer and
+ * producer, which is a common use-case. The ptr_ring is not
+ * though as the final data structure, expecting this to
+ * change into a more advanced data structure with more
+ * integration with page_alloc.c and data structs per CPU for
+ * returning pages in bulk.
+ *
+ */
+ struct ptr_ring ring;
+
+ /* TODO: Domain "id" add later, for RX zero-copy validation */
+
+ /* TODO: Need list pointers for keeping page_pool object on a
+ * cleanup list, given pages can be "outstanding" even after
+ * e.g. driver is unloaded.
+ */
+};
+
+struct page* page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+ gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN | __GFP_COLD);
+ return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page *page)
+{
+ __page_pool_put_page(page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page *page)
+{
+ __page_pool_put_page(page, true);
+}
+
+#endif /* _LINUX_PAGE_POOL_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac7fa34db8a7..84294278039d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2584,7 +2584,7 @@ static inline void __skb_frag_ref(skb_frag_t *frag)
* @f: the fragment offset.
*
* Takes an additional reference on the @f'th paged fragment of @skb.
- */
+ */ // XXX
static inline void skb_frag_ref(struct sk_buff *skb, int f)
{
__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab48a2fb..ee15ca659ea1 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -99,7 +99,8 @@
{1UL << PG_mappedtodisk, "mappedtodisk" }, \
{1UL << PG_reclaim, "reclaim" }, \
{1UL << PG_swapbacked, "swapbacked" }, \
- {1UL << PG_unevictable, "unevictable" } \
+ {1UL << PG_unevictable, "unevictable" }, \
+ {1UL << PG_pool, "pool" } \
IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \
IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \
IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..dbe5a7181e28 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,6 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+
+# Hack enable for compile testing
+obj-y += page_pool.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..655db05f0c1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3873,6 +3873,11 @@ EXPORT_SYMBOL(get_zeroed_page);
void __free_pages(struct page *page, unsigned int order)
{
+ if (PagePool(page)) {
+ page_pool_put_page(page);
+ return;
+ }
+
if (put_page_testzero(page)) {
if (order == 0)
free_hot_cold_page(page, false);
@@ -4000,6 +4005,11 @@ void __free_page_frag(void *addr)
{
struct page *page = virt_to_head_page(addr);
+ if (PagePool(page)) {
+ page_pool_put_page(page);
+ return;
+ }
+
if (unlikely(put_page_testzero(page)))
__free_pages_ok(page, compound_order(page));
}
diff --git a/mm/page_pool.c b/mm/page_pool.c
new file mode 100644
index 000000000000..74138d5fe86d
--- /dev/null
+++ b/mm/page_pool.c
@@ -0,0 +1,423 @@
+/*
+ * page_pool.c
+ */
+
+/* Using the page pool from a driver, involves
+ *
+ * 1. Creating/allocating a page_pool per RX ring for the NIC
+ * 2. Using pages from page_pool to populate RX ring
+ * 3. Page pool will call dma_map/unmap
+ * 4. Driver is responsible for dma_sync part
+ * 5. On page put/free the page is returned to the page_pool
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <linux/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+/*
+ * The struct page_pool (likely) cannot be embedded into another
+ * structure, because freeing this struct depend on outstanding pages,
+ * which can point back to the page_pool. Thus, don't export "init".
+ */
+int page_pool_init(struct page_pool *pool,
+ const struct page_pool_params *params)
+{
+ int ring_qsize = 1024; /* Default */
+ int param_copy_sz;
+
+ if (!pool)
+ return -EFAULT;
+
+ /* Allow kernel devel trees and driver to progress at different rates */
+ param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+ memset(&pool->p, 0, param_copy_sz);
+ if (params->size < param_copy_sz) {
+ /*
+ * Older module calling newer kernel, handled by only
+ * copying supplied size, and keep remaining params zero
+ */
+ param_copy_sz = params->size;
+ } else if (params->size > param_copy_sz) {
+ /*
+ * Newer module calling older kernel. Need to validate
+ * no new features were requested.
+ */
+ unsigned char *addr = (unsigned char*)params + param_copy_sz;
+ unsigned char *end = (unsigned char*)params + params->size;
+
+ for (; addr < end; addr++) {
+ if (*addr != 0)
+ return -E2BIG;
+ }
+ }
+ memcpy(&pool->p, params, param_copy_sz);
+
+ /* Validate only known flags were used */
+ if (pool->p.flags & ~(PP_FLAG_ALL))
+ return -EINVAL;
+
+ if (pool->p.pool_size)
+ ring_qsize = pool->p.pool_size;
+
+ /* ptr_ring is not meant as final struct, see page_pool.h */
+ if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) {
+ return -ENOMEM;
+ }
+
+ /*
+ * DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+ * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+ * which is the XDP_TX use-case.
+ */
+ if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+ (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+ return -EINVAL;
+
+ return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+ struct page_pool *pool;
+ int err = 0;
+
+ if (params->size < offsetof(struct page_pool_params, nid)) {
+ WARN(1, "Fix page_pool_params->size code\n");
+ return NULL;
+ }
+
+ pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+ err = page_pool_init(pool, params);
+ if (err < 0) {
+ pr_warn("%s() gave up with errno %d\n", __func__, err);
+ kfree(pool);
+ return ERR_PTR(err);
+ }
+ return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+ struct page *page;
+
+ /* FIXME: use another test for safe-context, caller should
+ * simply provide this guarantee
+ */
+ if (likely(in_serving_softirq())) { // FIXME add use of PP_FLAG_NAPI
+ struct ptr_ring *r;
+
+ if (likely(pool->alloc.count)) {
+ /* Fast-path */
+ page = pool->alloc.cache[--pool->alloc.count];
+ return page;
+ }
+ /* Slower-path: Alloc array empty, time to refill */
+ r = &pool->ring;
+ /* Open-coded bulk ptr_ring consumer.
+ *
+ * Discussion: ATM the ring consumer lock is not
+ * really needed due to the softirq/NAPI protection,
+ * but later MM-layer need the ability to reclaim
+ * pages on the ring. Thus, keeping the locks.
+ */
+ spin_lock(&r->consumer_lock);
+ while ((page = __ptr_ring_consume(r))) {
+ if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+ break;
+ pool->alloc.cache[pool->alloc.count++] = page;
+ }
+ spin_unlock(&r->consumer_lock);
+ return page;
+ }
+
+ /* Slow-path: Get page from locked ring queue */
+ page = ptr_ring_consume(&pool->ring);
+ return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages(struct page_pool *pool,
+ gfp_t _gfp)
+{
+ struct page *page;
+ gfp_t gfp = _gfp;
+ dma_addr_t dma;
+
+ /* We could always set __GFP_COMP, and avoid this branch, as
+ * prep_new_page() can handle order-0 with __GFP_COMP.
+ */
+ if (pool->p.order)
+ gfp |= __GFP_COMP;
+ /*
+ * Discuss GFP flags: e.g
+ * __GFP_NOWARN + __GFP_NORETRY + __GFP_NOMEMALLOC
+ */
+
+ /*
+ * FUTURE development:
+ *
+ * Current slow-path essentially falls back to single page
+ * allocations, which doesn't improve performance. This code
+ * need bulk allocation support from the page allocator code.
+ *
+ * For now, page pool recycle cache is not refilled. Hint:
+ * when pages are returned, they will go into the recycle
+ * cache.
+ */
+
+ /* Cache was empty, do real allocation */
+ page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+ if (!page)
+ return NULL;
+
+ /* FIXME: Add accounting of pages.
+ *
+ * TODO: Look into memcg_charge_slab/memcg_uncharge_slab
+ *
+ * What if page comes from pfmemalloc reserves?
+ * Should we abort to help memory pressure? (test err code path!)
+ * Code see SetPageSlabPfmemalloc(), __ClearPageSlabPfmemalloc()
+ * and page_is_pfmemalloc(page)
+ */
+
+ /* Setup DMA mapping:
+ * This mapping is kept for lifetime of page, until leaving pool.
+ */
+ dma = dma_map_page(pool->p.dev, page, 0,
+ (PAGE_SIZE << pool->p.order),
+ pool->p.dma_dir);
+ if (dma_mapping_error(pool->p.dev, dma)) {
+ put_page(page);
+ return NULL;
+ }
+ page->dma_addr = dma;
+
+ /* IDEA: When page just alloc'ed is should/must have refcnt 1.
+ * Should we do refcnt inc tricks to keep page mapped/owned by
+ * page_pool infrastructure? (like page_frag code)
+ */
+
+ /* TODO: Init fields in struct page. See slub code allocate_slab()
+ *
+ */
+ page->pool = pool; /* Save pool the page MUST be returned to */
+ __SetPagePool(page); /* Mark page with flag */
+
+ return page;
+}
+
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+ struct page *page;
+
+ /* Fast-path: Get a page from cache */
+ page = __page_pool_get_cached(pool);
+ if (page)
+ return page;
+
+ /* Slow-path: cache empty, do real allocation */
+ page = __page_pool_alloc_pages(pool, gfp);
+ return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+// Ideas taken from __free_slab()
+static void __page_pool_clean_page(struct page *page)
+{
+ struct page_pool *pool;
+
+ VM_BUG_ON_PAGE(!PagePool(page), page);
+
+ // mod_zone_page_state() ???
+
+ pool = page->pool;
+ __ClearPagePool(page);
+
+ /* DMA unmap */
+ dma_unmap_page(pool->p.dev, page->dma_addr,
+ PAGE_SIZE << pool->p.order,
+ pool->p.dma_dir);
+ page->dma_addr = 0;
+ /* Q: Use DMA macros???
+ *
+ * dma_unmap_page(pool->p.dev, dma_unmap_addr(page,dma_addr),
+ * PAGE_SIZE << pool->p.order,
+ * pool->p.dma_dir);
+ * dma_unmap_addr_set(page, dma_addr, 0);
+ */
+
+ /* FUTURE: Use Alex Duyck's DMA_ATTR_SKIP_CPU_SYNC changes
+ *
+ * dma_unmap_page_attrs(pool->p.dev, page->dma_addr,
+ * PAGE_SIZE << pool->p.order,
+ * pool->p.dma_dir,
+ * DMA_ATTR_SKIP_CPU_SYNC);
+ */
+
+ // page_mapcount_reset(page); // ??
+ // page->mapping = NULL; // ??
+
+ // Not really needed, but good for provoking bugs
+ page->pool = (void *)0xDEADBEE0;
+
+ /* FIXME: Add accounting of pages here!
+ *
+ * Look into: memcg_uncharge_page_pool(page, order, pool);
+ */
+
+ // FIXME: do we need this??? likely not as slub does not...
+// if (unlikely(is_zone_device_page(page)))
+// put_zone_device_page(page);
+
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page *page)
+{
+ struct page_pool *pool = page->pool;
+
+ __page_pool_clean_page(page);
+ /*
+ * Given page pool state and flags were just cleared, the page
+ * must be freed here. Thus, code invariant assumes
+ * refcnt==1, as __free_pages() call put_page_testzero().
+ */
+ __free_pages(page, pool->p.order);
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+ struct page *page)
+{
+ int ret;
+ /* TODO: Use smarter data structure for recycle cache. Using
+ * ptr_ring will not scale when multiple remote CPUs want to
+ * recycle pages.
+ */
+
+ /* Need BH protection when free occurs from userspace e.g
+ * __kfree_skb() called via {tcp,inet,sock}_recvmsg
+ *
+ * Problematic for several reasons: (1) it is more costly,
+ * (2) the BH unlock can cause (re)sched of softirq.
+ *
+ * BH protection not needed if current is serving softirq
+ */
+ if (in_serving_softirq())
+ ret = ptr_ring_produce(&pool->ring, page);
+ else
+ ret = ptr_ring_produce_bh(&pool->ring, page);
+
+ return (ret == 0) ? true : false;
+}
+
+/*
+ * Only allow direct recycling in very special circumstances, into the
+ * alloc cache. E.g. XDP_DROP use-case.
+ *
+ * Caller must provide appropiate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+ struct page_pool *pool)
+{
+ // BUG_ON(!in_serving_softirq());
+
+ if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+ return false;
+
+ /* Caller MUST have verified/know (page_ref_count(page) == 1) */
+ pool->alloc.cache[pool->alloc.count++] = page;
+ return true;
+}
+
+void __page_pool_put_page(struct page *page, bool allow_direct)
+{
+ struct page_pool *pool = page->pool;
+
+ /* This is a fast-path optimization, that avoids an atomic
+ * operation, in the case where a single object is (refcnt)
+ * using the page.
+ *
+ * refcnt == 1 means page_pool owns page, and can recycle it.
+ */
+ if (likely(page_ref_count(page) == 1)) {
+ /* Read barrier implicit paired with full MB of atomic ops */
+ smp_rmb();
+
+ if (allow_direct)
+ if (__page_pool_recycle_direct(page, pool))
+ return;
+
+ if (!__page_pool_recycle_into_ring(pool, page)) {
+ /* Cache full, do real __free_pages() */
+ __page_pool_return_page(page);
+ }
+ return;
+ }
+ /*
+ * Many drivers splitting up the page into fragments, and some
+ * want to keep doing this to save memory. The put_page_testzero()
+ * function as a refcnt decrement, and should not return true.
+ */
+ if (unlikely(put_page_testzero(page))) {
+ /*
+ * Reaching refcnt zero should not be possible,
+ * indicate code error. Don't crash but warn, handle
+ * case by not-recycling, but return page to page
+ * allocator.
+ */
+ WARN(1, "%s() violating page_pool invariance refcnt:%d\n",
+ __func__, page_ref_count(page));
+ /* Cleanup state before directly returning page */
+ __page_pool_clean_page(page);
+ __put_page(page);
+ }
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+static void __destructor_put_page(void *ptr)
+{
+ struct page *page = ptr;
+
+ /* Verify the refcnt invariant of cached pages */
+ if (!(page_ref_count(page) == 1)) {
+ pr_crit("%s() page_pool refcnt %d violation\n",
+ __func__, page_ref_count(page));
+ BUG();
+ }
+ __page_pool_return_page(page);
+}
+
+/* Cleanup and release resources */
+void page_pool_destroy(struct page_pool *pool)
+{
+ /* Empty recycle ring */
+ ptr_ring_cleanup(&pool->ring, __destructor_put_page);
+
+ /* FIXME-mem-leak: cleanup array/stack cache
+ * pool->alloc. Driver usually will destroy RX ring after
+ * making sure nobody can alloc from it, thus it should be
+ * safe to just empty cache here
+ */
+
+ /* FIXME: before releasing the page_pool memory, we MUST make
+ * sure no pages points back this page_pool.
+ */
+ kfree(pool);
+}
+EXPORT_SYMBOL(page_pool_destroy);
diff --git a/mm/slub.c b/mm/slub.c
index 067598a00849..7de478c20464 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1572,8 +1572,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
page->objects = oo_objects(oo);
order = compound_order(page);
- page->slab_cache = s;
- __SetPageSlab(page);
+ page->slab_cache = s; // Example: Saving kmem_cache in struct page
+ __SetPageSlab(page); // Example: Setting flag
if (page_is_pfmemalloc(page))
SetPageSlabPfmemalloc(page);
^ permalink raw reply related
* [RFC PATCH 3/4] mlx5: use page_pool
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
To: linux-mm, Alexander Duyck
Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>
The mlx5 driver already have a driver local page recycle cache. This
page cache is only efficient when the number of outstanding pages is
small, the queue based cache array size is 128. Further more a single
page with elevated refcnt can block the queue.
Benchmarking on next-next at commit f5f99309fa74 ("sock: do not set
sk_err in sock_dequeue_err_skb"), which include Paolo's UDP
performance optimizations (commit fc13fd398625 ("Merge branch
'udp-fwd-mem-sched-on-dequeue'"). Showed a speedup of 29% for UDP
packets. Detailed ethtool stats showed mlx5 page recycler didn't
"work" in that benchmark. The XDP_DROP use-case, showed a small perf
regression +2.7ns using page_pool. This correspons well to the 28%
gain reported in commit 1bfecfca565c ("net/mlx5e: Build RX SKB on
demand").
UPDATE: On newer kernels, net-next at commit 52f40e9d65. The mlx5 page
recycle cache works again, and performance gain is gone. Detailed
benchmarking show, RX-ksoftirq side is approx 10% faster, while UDP
socket delivery is same performance.
For TC early ingress drop there is a small performance regression of
approx +4 ns. There are pending page_pool optimization that will
close that gap.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 1
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 28 +++++++++++++
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 47 ++++++++++++++-------
3 files changed, 60 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 951dbd58594d..b30d5b08d6a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -361,6 +361,7 @@ struct mlx5e_rq {
struct mlx5e_tstamp *tstamp;
struct mlx5e_rq_stats stats;
struct mlx5e_cq cq;
+ struct page_pool *page_pool;
struct mlx5e_page_cache page_cache;
mlx5e_fp_handle_rx_cqe handle_rx_cqe;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cbfa38fc72c0..cd71e5764ec1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,6 +34,7 @@
#include <net/pkt_cls.h>
#include <linux/mlx5/fs.h>
#include <net/vxlan.h>
+#include <linux/page_pool.h>
#include <linux/bpf.h>
#include "en.h"
#include "en_tc.h"
@@ -521,6 +522,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
struct mlx5e_rq_param *param,
struct mlx5e_rq *rq)
{
+ struct page_pool_params pp_params = { 0 };
struct mlx5e_priv *priv = c->priv;
struct mlx5_core_dev *mdev = priv->mdev;
void *rqc = param->rqc;
@@ -591,6 +593,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
default: /* MLX5_WQ_TYPE_LINKED_LIST */
rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info),
GFP_KERNEL, cpu_to_node(c->cpu));
+// rq->dma_info = NULL; //HACK ALWAYS FAIL TEST
if (!rq->dma_info) {
err = -ENOMEM;
goto err_rq_wq_destroy;
@@ -618,6 +621,24 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
rq->buff.page_order = order_base_2(npages);
+ pp_params.size = PAGE_POOL_PARAMS_SIZE;
+ pp_params.order = rq->buff.page_order;
+ pp_params.dev = c->pdev;
+ pp_params.nid = cpu_to_node(c->cpu);
+ pp_params.dma_dir = DMA_BIDIRECTIONAL;
+ pp_params.pool_size = 2000;
+ pr_info("XXX: %s() pp_params.size=%d end=%lu\n",
+ __func__, pp_params.size,
+ offsetof(struct page_pool_params, end_marker));
+
+ rq->page_pool = page_pool_create(&pp_params);
+ if (IS_ERR_OR_NULL(rq->page_pool)) {
+ rq->page_pool = NULL;
+ kfree(rq->dma_info);
+ err = -ENOMEM;
+ goto err_rq_wq_destroy;
+ }
+
byte_count |= MLX5_HW_START_PADDING;
rq->mkey_be = c->mkey_be;
}
@@ -662,6 +683,13 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
kfree(rq->dma_info);
+ if (rq->page_pool)
+ page_pool_destroy(rq->page_pool);
+ else
+ // Can happen because mlx5 have some extra
+ // rq's for some other purposes... (explain?)
+ pr_err("XXX: %s() NULL pointer at rq->page_pool\n",
+ __func__);
}
for (i = rq->page_cache.head; i != rq->page_cache.tail;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e2fb3ed1790..0512632b30fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -182,6 +182,7 @@ void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
+// TODO: Remove mlx5-page-cache
static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
@@ -198,6 +199,7 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
return true;
}
+// TODO: Remove mlx5-page-cache
static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
@@ -228,20 +230,27 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
{
struct page *page;
- if (mlx5e_rx_cache_get(rq, dma_info))
- return 0;
+// if (mlx5e_rx_cache_get(rq, dma_info))
+// return 0;
- page = dev_alloc_pages(rq->buff.page_order);
+ //page = dev_alloc_pages(rq->buff.page_order);
+ page = page_pool_dev_alloc_pages(rq->page_pool);
if (unlikely(!page))
return -ENOMEM;
dma_info->page = page;
- dma_info->addr = dma_map_page(rq->pdev, page, 0,
- RQ_PAGE_SIZE(rq), rq->buff.map_dir);
- if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
- put_page(page);
- return -ENOMEM;
- }
+ dma_info->addr = page->dma_addr;
+// dma_info->addr = dma_map_page(rq->pdev, page, 0,
+// RQ_PAGE_SIZE(rq), rq->buff.map_dir);
+
+ /* DISCUSS: should this be moved into page_pool API? Here we
+ * sync entire page, but some drivers might want have more
+ * control? Like using the dma_sync_single_range_for_device()
+ * like Alex is doing in the Intel drivers...
+ */
+ dma_sync_single_for_device(rq->pdev, dma_info->addr,
+ RQ_PAGE_SIZE(rq),
+ DMA_FROM_DEVICE);
return 0;
}
@@ -249,11 +258,21 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle)
{
- if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+// if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+// return;
+ // TODO: use page_pool_recycle_direct(dma_info->page);
+ if (recycle) {
+ page_pool_recycle_direct(dma_info->page);
return;
+ }
+
+// page_pool take over dma_unmap
+// dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+// rq->buff.map_dir);
+ // XXX: do we need to call dma_sync_single_range_for_cpu here???
+ // dma_sync_single_range_for_cpu(rq->pdev, dma_info->addr,
+ // RQ_PAGE_SIZE(rq), rq->buff.map_dir);
- dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
- rq->buff.map_dir);
put_page(dma_info->page);
}
@@ -773,10 +792,6 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
return NULL;
}
- /* queue up for recycling ..*/
- page_ref_inc(di->page);
- mlx5e_page_release(rq, di, true);
-
skb_reserve(skb, MLX5_RX_HEADROOM);
skb_put(skb, cqe_bcnt);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [RFC PATCH 4/4] page_pool: change refcnt model
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
To: linux-mm, Alexander Duyck
Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>
This is the direction the patch is going, after Mel's comments.
Most significantly: Change that refcnt must reach zero (and not 1)
before the page gets into the recycle ring. Pages on the
pp_alloc_cache have refcnt==1 invariance, as this allow fast direct
recycling (allowed by XDP_DROP).
When mlx5 page recycle cache didn't work (at next-next at commit
f5f99309fa74) the benchmarks showed the gain was reduced to 14% by
this patch, or an added cost of approx 133 cycle (which were a higher
cycle cost than expected).
UPDATE: net-next at commit 52f40e9d65 this patch show no gain, perhaps
a small performance regression. The accuracy of the UDP measurements
are not good enough to conclude on, ksoftirq +1.4% and UDP side -0.89%.
The TC ingress drop test is more significant and show 4.3% slower.
Thus, this patch makes page_pool slower than the driver specific page
recycle cache. More optimizations are pending for the page_pool, thus
this can likely be regained.
The page_pool will still show benefit for use-case where the driver
page recycle cache doesn't work (>128 outstanding packets/pages).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/linux/mm.h | 5 --
include/linux/page_pool.h | 10 +++
mm/page_alloc.c | 16 ++---
mm/page_pool.c | 141 +++++++++++++++++++--------------------------
mm/swap.c | 3 +
5 files changed, 79 insertions(+), 96 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 11b4d8fb280b..7315c1790f7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -766,11 +766,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
- if (PagePool(page)) {
- page_pool_put_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
diff --git a/include/linux/page_pool.h b/include/linux/page_pool.h
index 6f8f2ff6d758..40da1fac573d 100644
--- a/include/linux/page_pool.h
+++ b/include/linux/page_pool.h
@@ -112,6 +112,7 @@ struct page_pool {
* wise, because free's can happen on remote CPUs, with no
* association with allocation resource.
*
+ * XXX: Mel says drop comment
* For now use ptr_ring, as it separates consumer and
* producer, which is a common use-case. The ptr_ring is not
* though as the final data structure, expecting this to
@@ -145,6 +146,7 @@ void page_pool_destroy(struct page_pool *pool);
/* Never call this directly, use helpers below */
void __page_pool_put_page(struct page *page, bool allow_direct);
+/* XXX: Mel: needs descriptions*/
static inline void page_pool_put_page(struct page *page)
{
__page_pool_put_page(page, false);
@@ -155,4 +157,12 @@ static inline void page_pool_recycle_direct(struct page *page)
__page_pool_put_page(page, true);
}
+/*
+ * Called when refcnt reach zero. On failure page_pool state is
+ * cleared, and caller can return page to page allocator.
+ */
+bool page_pool_recycle(struct page *page);
+// XXX: compile out trick, let this return false compile time,
+// or let PagePool() check compile to false.
+
#endif /* _LINUX_PAGE_POOL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 655db05f0c1c..5a68bdbc9dc1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1240,6 +1240,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
int migratetype;
unsigned long pfn = page_to_pfn(page);
+ if (PagePool(page) && page_pool_recycle(page))
+ return;
+
if (!free_pages_prepare(page, order, true))
return;
@@ -2448,6 +2451,9 @@ void free_hot_cold_page(struct page *page, bool cold)
unsigned long pfn = page_to_pfn(page);
int migratetype;
+ if (PagePool(page) && page_pool_recycle(page))
+ return;
+
if (!free_pcp_prepare(page))
return;
@@ -3873,11 +3879,6 @@ EXPORT_SYMBOL(get_zeroed_page);
void __free_pages(struct page *page, unsigned int order)
{
- if (PagePool(page)) {
- page_pool_put_page(page);
- return;
- }
-
if (put_page_testzero(page)) {
if (order == 0)
free_hot_cold_page(page, false);
@@ -4005,11 +4006,6 @@ void __free_page_frag(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (PagePool(page)) {
- page_pool_put_page(page);
- return;
- }
-
if (unlikely(put_page_testzero(page)))
__free_pages_ok(page, compound_order(page));
}
diff --git a/mm/page_pool.c b/mm/page_pool.c
index 74138d5fe86d..064034d89f8a 100644
--- a/mm/page_pool.c
+++ b/mm/page_pool.c
@@ -21,14 +21,15 @@
#include <linux/dma-mapping.h>
#include <linux/page-flags.h>
#include <linux/mm.h> /* for __put_page() */
+#include "internal.h" /* for set_page_refcounted() */
/*
* The struct page_pool (likely) cannot be embedded into another
* structure, because freeing this struct depend on outstanding pages,
* which can point back to the page_pool. Thus, don't export "init".
*/
-int page_pool_init(struct page_pool *pool,
- const struct page_pool_params *params)
+static int page_pool_init(struct page_pool *pool,
+ const struct page_pool_params *params)
{
int ring_qsize = 1024; /* Default */
int param_copy_sz;
@@ -108,40 +109,33 @@ EXPORT_SYMBOL(page_pool_create);
/* fast path */
static struct page *__page_pool_get_cached(struct page_pool *pool)
{
+ struct ptr_ring *r;
struct page *page;
- /* FIXME: use another test for safe-context, caller should
- * simply provide this guarantee
- */
- if (likely(in_serving_softirq())) { // FIXME add use of PP_FLAG_NAPI
- struct ptr_ring *r;
-
- if (likely(pool->alloc.count)) {
- /* Fast-path */
- page = pool->alloc.cache[--pool->alloc.count];
- return page;
- }
- /* Slower-path: Alloc array empty, time to refill */
- r = &pool->ring;
- /* Open-coded bulk ptr_ring consumer.
- *
- * Discussion: ATM the ring consumer lock is not
- * really needed due to the softirq/NAPI protection,
- * but later MM-layer need the ability to reclaim
- * pages on the ring. Thus, keeping the locks.
- */
- spin_lock(&r->consumer_lock);
- while ((page = __ptr_ring_consume(r))) {
- if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
- break;
- pool->alloc.cache[pool->alloc.count++] = page;
- }
- spin_unlock(&r->consumer_lock);
+ /* Caller guarantee safe context for accessing alloc.cache */
+ if (likely(pool->alloc.count)) {
+ /* Fast-path */
+ page = pool->alloc.cache[--pool->alloc.count];
return page;
}
- /* Slow-path: Get page from locked ring queue */
- page = ptr_ring_consume(&pool->ring);
+ /* Slower-path: Alloc array empty, time to refill */
+ r = &pool->ring;
+ /* Open-coded bulk ptr_ring consumer.
+ *
+ * Discussion: ATM ring *consumer* lock is not really needed
+ * due to caller protecton, but later MM-layer need the
+ * ability to reclaim pages from ring. Thus, keeping locks.
+ */
+ spin_lock(&r->consumer_lock);
+ while ((page = __ptr_ring_consume(r))) {
+ /* Pages on ring refcnt==0, on alloc.cache refcnt==1 */
+ set_page_refcounted(page);
+ if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+ break;
+ pool->alloc.cache[pool->alloc.count++] = page;
+ }
+ spin_unlock(&r->consumer_lock);
return page;
}
@@ -290,15 +284,9 @@ static void __page_pool_clean_page(struct page *page)
/* Return a page to the page allocator, cleaning up our state */
static void __page_pool_return_page(struct page *page)
{
- struct page_pool *pool = page->pool;
-
+ VM_BUG_ON_PAGE(page_ref_count(page) != 0, page);
__page_pool_clean_page(page);
- /*
- * Given page pool state and flags were just cleared, the page
- * must be freed here. Thus, code invariant assumes
- * refcnt==1, as __free_pages() call put_page_testzero().
- */
- __free_pages(page, pool->p.order);
+ __put_page(page);
}
bool __page_pool_recycle_into_ring(struct page_pool *pool,
@@ -332,70 +320,61 @@ bool __page_pool_recycle_into_ring(struct page_pool *pool,
*
* Caller must provide appropiate safe context.
*/
-static bool __page_pool_recycle_direct(struct page *page,
+// noinline /* hack for perf-record test */
+static
+bool __page_pool_recycle_direct(struct page *page,
struct page_pool *pool)
{
- // BUG_ON(!in_serving_softirq());
+ VM_BUG_ON_PAGE(page_ref_count(page) != 1, page);
+ /* page refcnt==1 invarians on alloc.cache */
if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
return false;
- /* Caller MUST have verified/know (page_ref_count(page) == 1) */
pool->alloc.cache[pool->alloc.count++] = page;
return true;
}
-void __page_pool_put_page(struct page *page, bool allow_direct)
+/*
+ * Called when refcnt reach zero. On failure page_pool state is
+ * cleared, and caller can return page to page allocator.
+ */
+bool page_pool_recycle(struct page *page)
{
struct page_pool *pool = page->pool;
- /* This is a fast-path optimization, that avoids an atomic
- * operation, in the case where a single object is (refcnt)
- * using the page.
- *
- * refcnt == 1 means page_pool owns page, and can recycle it.
- */
- if (likely(page_ref_count(page) == 1)) {
- /* Read barrier implicit paired with full MB of atomic ops */
- smp_rmb();
-
- if (allow_direct)
- if (__page_pool_recycle_direct(page, pool))
- return;
+ VM_BUG_ON_PAGE(page_ref_count(page) != 0, page);
- if (!__page_pool_recycle_into_ring(pool, page)) {
- /* Cache full, do real __free_pages() */
- __page_pool_return_page(page);
- }
- return;
- }
- /*
- * Many drivers splitting up the page into fragments, and some
- * want to keep doing this to save memory. The put_page_testzero()
- * function as a refcnt decrement, and should not return true.
- */
- if (unlikely(put_page_testzero(page))) {
- /*
- * Reaching refcnt zero should not be possible,
- * indicate code error. Don't crash but warn, handle
- * case by not-recycling, but return page to page
- * allocator.
- */
- WARN(1, "%s() violating page_pool invariance refcnt:%d\n",
- __func__, page_ref_count(page));
- /* Cleanup state before directly returning page */
+ /* Pages on recycle ring have refcnt==0 */
+ if (!__page_pool_recycle_into_ring(pool, page)) {
__page_pool_clean_page(page);
- __put_page(page);
+ return false;
}
+ return true;
+}
+EXPORT_SYMBOL(page_pool_recycle);
+
+void __page_pool_put_page(struct page *page, bool allow_direct)
+{
+ struct page_pool *pool = page->pool;
+
+ if (allow_direct && (page_ref_count(page) == 1))
+ if (__page_pool_recycle_direct(page, pool))
+ return;
+
+ if (put_page_testzero(page))
+ if (!page_pool_recycle(page))
+ __put_page(page);
+
}
EXPORT_SYMBOL(__page_pool_put_page);
-static void __destructor_put_page(void *ptr)
+void __destructor_return_page(void *ptr)
{
struct page *page = ptr;
/* Verify the refcnt invariant of cached pages */
- if (!(page_ref_count(page) == 1)) {
+ if (page_ref_count(page) != 0) {
pr_crit("%s() page_pool refcnt %d violation\n",
__func__, page_ref_count(page));
BUG();
@@ -407,7 +386,7 @@ static void __destructor_put_page(void *ptr)
void page_pool_destroy(struct page_pool *pool)
{
/* Empty recycle ring */
- ptr_ring_cleanup(&pool->ring, __destructor_put_page);
+ ptr_ring_cleanup(&pool->ring, __destructor_return_page);
/* FIXME-mem-leak: cleanup array/stack cache
* pool->alloc. Driver usually will destroy RX ring after
diff --git a/mm/swap.c b/mm/swap.c
index 4dcf852e1e6d..d71c896cb1a1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -96,6 +96,9 @@ static void __put_compound_page(struct page *page)
void __put_page(struct page *page)
{
+ if (PagePool(page) && page_pool_recycle(page))
+ return;
+
if (unlikely(PageCompound(page)))
__put_compound_page(page);
else
^ permalink raw reply related
* [PATCH] ethernet: sfc: Add Kconfig entry for vendor Solarflare
From: Tobias Klauser @ 2016-12-20 13:38 UTC (permalink / raw)
To: netdev; +Cc: linux-net-drivers, ecree, bkenward
Since commit
5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
there are two drivers for Solarflare devices, but both still show up
directly beneath "Ethernet driver support" in the Kconfig. Follow the
pattern of other vendors and group them beneath an own vendor Kconfig
entry for Solarflare.
Cc: Edward Cree <ecree@solarflare.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/ethernet/Kconfig | 1 -
drivers/net/ethernet/sfc/Kconfig | 21 +++++++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 6e16e441f85e..e4c28fed61d5 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -166,7 +166,6 @@ source "drivers/net/ethernet/seeq/Kconfig"
source "drivers/net/ethernet/silan/Kconfig"
source "drivers/net/ethernet/sis/Kconfig"
source "drivers/net/ethernet/sfc/Kconfig"
-source "drivers/net/ethernet/sfc/falcon/Kconfig"
source "drivers/net/ethernet/sgi/Kconfig"
source "drivers/net/ethernet/smsc/Kconfig"
source "drivers/net/ethernet/stmicro/Kconfig"
diff --git a/drivers/net/ethernet/sfc/Kconfig b/drivers/net/ethernet/sfc/Kconfig
index 46f7be85f5a3..2c032629c369 100644
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@ -1,3 +1,20 @@
+#
+# Solarflare device configuration
+#
+
+config NET_VENDOR_SOLARFLARE
+ bool "Solarflare devices"
+ default y
+ ---help---
+ If you have a network (Ethernet) card belonging to this class, say Y.
+
+ Note that the answer to this question doesn't directly affect the
+ kernel: saying N will just cause the configurator to skip all
+ the questions about Solarflare devices. If you say Y, you will be asked
+ for your specific card in the following questions.
+
+if NET_VENDOR_SOLARFLARE
+
config SFC
tristate "Solarflare SFC9000/SFC9100-family support"
depends on PCI
@@ -44,3 +61,7 @@ config SFC_MCDI_LOGGING
Driver-Interface) commands and responses, allowing debugging of
driver/firmware interaction. The tracing is actually enabled by
a sysfs file 'mcdi_logging' under the PCI device.
+
+source "drivers/net/ethernet/sfc/falcon/Kconfig"
+
+endif # NET_VENDOR_SOLARFLARE
--
2.11.0
^ permalink raw reply related
* Re: [PATCH perf/core REBASE 2/5] samples/bpf: Switch over to libbpf
From: Arnaldo Carvalho de Melo @ 2016-12-20 13:41 UTC (permalink / raw)
To: Joe Stringer
Cc: LKML, netdev, Wang Nan, ast, Daniel Borkmann,
Arnaldo Carvalho de Melo
In-Reply-To: <CAPWQB7EAN7Fg8+vO3Tn7WYWcZW_JpKQFNCJzY_K100ckab6JRg@mail.gmail.com>
Em Thu, Dec 15, 2016 at 05:48:31PM -0800, Joe Stringer escreveu:
> On 15 December 2016 at 14:00, Joe Stringer <joe@ovn.org> wrote:
> > On 15 December 2016 at 10:34, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >> So, I'm stopping here so that I can push what I have to Ingo, then I'll get
> >> back to this, hopefully by then you beat me and I have just to retest 8-)
> > OK, thanks for the report. Looks like there was another difference
> > between the two libbpfs - one used total program size for its
> > load_program API; the actual kernel API uses instruction count. This
> > incremental should do the trick:
> > https://github.com/joestringer/linux/commit/6ff7726f20077bed66fb725f5189c13690154b6a
> The full branch with this change (fast-forward from your tmp branch)
> is available here:
> https://github.com/joestringer/linux/tree/submit/libbpf_samples_v5
> I tried running every selftest and BPF sample I could get my hands on;
> there's one or two that I couldn't run, but seemed more to do with my
> versions of TC/iproute and kernel config rather than libbpf changes.
> Let me know if you see any further trouble.
Finally getting back to this, now after I figured out how to get patches
out of github (wget commit + .patch) I applied this and at least the
samples/bpf/offwaketime seems to work as before, applying.
- Arnaldo
^ permalink raw reply
* [PATCH] netfilter: xt_connlimit: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
David S. Miller
Cc: Geliang Tang, netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
net/netfilter/xt_connlimit.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 2aff2b7..660b61d 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -218,7 +218,7 @@ count_tree(struct net *net, struct rb_root *root,
int diff;
bool addit;
- rbconn = container_of(*rbnode, struct xt_connlimit_rb, node);
+ rbconn = rb_entry(*rbnode, struct xt_connlimit_rb, node);
parent = *rbnode;
diff = same_source_net(addr, mask, &rbconn->addr, family);
@@ -398,7 +398,7 @@ static void destroy_tree(struct rb_root *r)
struct rb_node *node;
while ((node = rb_first(r)) != NULL) {
- rbconn = container_of(node, struct xt_connlimit_rb, node);
+ rbconn = rb_entry(node, struct xt_connlimit_rb, node);
rb_erase(node, r);
--
2.9.3
^ permalink raw reply related
* [PATCH] net/mlx5: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
To: Saeed Mahameed, Matan Barak, Leon Romanovsky
Cc: Geliang Tang, netdev, linux-rdma, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index 3b026c1..7431f63 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -75,7 +75,7 @@ static void mlx5_fc_stats_insert(struct rb_root *root, struct mlx5_fc *counter)
struct rb_node *parent = NULL;
while (*new) {
- struct mlx5_fc *this = container_of(*new, struct mlx5_fc, node);
+ struct mlx5_fc *this = rb_entry(*new, struct mlx5_fc, node);
int result = counter->id - this->id;
parent = *new;
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox