Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next 3/6] mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls
From: David Ahern @ 2016-09-22 14:58 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, ogerlitz, roopa, nikolay,
	linville, andy, f.fainelli, jhs, vivien.didelot, andrew, ivecera,
	kaber, john
In-Reply-To: <1474458794-5512-4-git-send-email-jiri@resnulli.us>

On 9/21/16 5:53 AM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Until now, in order to offload a FIB entry to HW we use switchdev op.
> However that has limits. Mainly in case we need to make the HW aware of
> all route prefixes configured in kernel. HW needs to know those in order
> to properly trap appropriate packets and pass the to kernel to do
> the forwarding. Abort mechanism is now handled within the mlxsw driver.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h     |   9 +-
>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 386 ++++++++++++---------
>  .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |   9 -
>  3 files changed, 216 insertions(+), 188 deletions(-)
> 
> 


Make the move to fib notifiers and the removal of switchdev code separate patches. It will make the fib notifier change easier to follow.

^ permalink raw reply

* Re: [patch net-next 3/6] mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls
From: Jiri Pirko @ 2016-09-22 15:05 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, ogerlitz, roopa,
	nikolay, linville, andy, f.fainelli, jhs, vivien.didelot, andrew,
	ivecera, kaber, john
In-Reply-To: <5206ebd6-89c1-ec44-8b98-c5109ebe6140@cumulusnetworks.com>

Thu, Sep 22, 2016 at 04:58:20PM CEST, dsa@cumulusnetworks.com wrote:
>On 9/21/16 5:53 AM, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Until now, in order to offload a FIB entry to HW we use switchdev op.
>> However that has limits. Mainly in case we need to make the HW aware of
>> all route prefixes configured in kernel. HW needs to know those in order
>> to properly trap appropriate packets and pass the to kernel to do
>> the forwarding. Abort mechanism is now handled within the mlxsw driver.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h     |   9 +-
>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 386 ++++++++++++---------
>>  .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |   9 -
>>  3 files changed, 216 insertions(+), 188 deletions(-)
>> 
>> 
>
>
>Make the move to fib notifiers and the removal of switchdev code separate patches. It will make the fib notifier change easier to follow.

Since I'm changing one iface by another, I don't see how I can split it.

^ permalink raw reply

* Re: [patch net-next 3/6] mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls
From: David Ahern @ 2016-09-22 15:12 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, ogerlitz, roopa,
	nikolay, linville, andy, f.fainelli, jhs, vivien.didelot, andrew,
	ivecera, kaber, john
In-Reply-To: <20160922150551.GE1830@nanopsycho.orion>

On 9/22/16 9:05 AM, Jiri Pirko wrote:
> Thu, Sep 22, 2016 at 04:58:20PM CEST, dsa@cumulusnetworks.com wrote:
>> On 9/21/16 5:53 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> Until now, in order to offload a FIB entry to HW we use switchdev op.
>>> However that has limits. Mainly in case we need to make the HW aware of
>>> all route prefixes configured in kernel. HW needs to know those in order
>>> to properly trap appropriate packets and pass the to kernel to do
>>> the forwarding. Abort mechanism is now handled within the mlxsw driver.
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h     |   9 +-
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 386 ++++++++++++---------
>>>  .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |   9 -
>>>  3 files changed, 216 insertions(+), 188 deletions(-)
>>>
>>>
>>
>>
>> Make the move to fib notifiers and the removal of switchdev code separate patches. It will make the fib notifier change easier to follow.
> 
> Since I'm changing one iface by another, I don't see how I can split it.
> 

Make the previous API dead and remove it in a separate patch. 

The init/fini code changes -- just a code move, no logical changes? then make it a separate patch - or don't move it at all and just add prototypes.

Some important code blocks are hard to read through the diffs as they are now.

^ permalink raw reply

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Daniel Borkmann @ 2016-09-22 15:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Thomas Graf
  Cc: Daniel Mack, htejun, ast, davem, kafai, fw, harald, netdev,
	sargun, cgroups
In-Reply-To: <20160922120554.GA1738@salvia>

On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote:
> On Thu, Sep 22, 2016 at 11:54:11AM +0200, Thomas Graf wrote:
>> On 09/22/16 at 11:21am, Pablo Neira Ayuso wrote:
>>> I have a hard time to buy this new specific hook, I think we should
>>> shift focus of this debate, this is my proposal to untangle this:
>>>
>>> You add a net/netfilter/nft_bpf.c expression that allows you to run
>>> bpf programs from nf_tables. This expression can either run bpf
>>> programs in a similar fashion to tc+bpf or run the bpf program that
>>> you have attached to the cgroup.
>>
>> So for every packet processed, you want to require the user to load
>> and run a (unJITed) nft program acting as a wrapper to run a JITed
>> BPF program? What it the benefit of this model compared to what Daniel
>> is proposing? The hooking point is the same. This only introduces
>> additional per packet overhead in the fast path. Am I missing
>> something?
>
> Have a look at net/ipv4/netfilter/nft_chain_route_ipv4.c for instance.
> In your case, you have to add a new chain type:
>
> static const struct nf_chain_type nft_chain_bpf = {
>          .name           = "bpf",
>          .type           = NFT_CHAIN_T_BPF,
>          ...
>          .hooks          = {
>                  [NF_INET_LOCAL_IN]      = nft_do_bpf,
>                  [NF_INET_LOCAL_OUT]     = nft_do_bpf,
>                  [NF_INET_FORWARD]       = nft_do_bpf,
>                  [NF_INET_PRE_ROUTING]   = nft_do_bpf,
>                  [NF_INET_POST_ROUTING]  = nft_do_bpf,
>          },
> };
>
> nft_do_bpf() is the raw netfilter hook that you register, this hook
> will just execute to iterate over the list of bpf filters and run
> them.
>
> This new chain is created on demand, so no overhead if not needed, eg.
>
> nft add table bpf
> nft add chain bpf input { type bpf hook output priority 0\; }
>
> Then, you add a rule for each bpf program you want to run, just like
> tc+bpf.

But from a high-level point of view, this sounds like a huge hack to me,
in the sense that nft as a bytecode engine (from whole architecture I
mean) calls into another bytecode engine such as BPF as an extension. And
BPF code from there isn't using any of the features from nft besides being
invoked from the hook, yet it needs to be a hard dependency to be compiled
into the kernel when the only thing that is needed is to just load and
execute a BPF program that can do already everything needed by itself.

Also, if we look at history, at times such tings have been tried in the
past - just take tc calling into iptables as an example - I get goosebumps
(and probably you as well ;)), as the result looks worse than it would
have looked like when it would have been just kept separated.
I don't quite understand this detour, I mean, I would understand it when
the idea is that nft is using BPF as an intermediate to make use of all
the already existing JIT/offloading features that have been developed over
the years to not duplicate all the work for arch folks, but off-topic
in this discussion context here. I was hoping that nft would try to avoid
some of those exotic modules we have from xt, I would consider xt_bpf (no
offense ;)) as one of them; in the sense that in the context/model back
then of xt it made sense as most xt modules were kind of hard-coded, but
that was overcome with nft itself. So I'm not really keen on nft_bpf
idea besides that it also doesn't use anything else other than the hooks
themselves for what is proposed. Really, both are two different worlds
with different programming models and use-cases and it doesn't really make
sense to brute-force them together.

> Benefits are, rewording previous email:
>
> * You get access to all of the existing netfilter hooks in one go
>    to run bpf programs. No need for specific redundant hooks. This
>    provides raw access to the netfilter hook, you define the little
>    code that your hook runs before you bpf run invocation. So there
>    is *no need to bloat the stack with more hooks, we use what we
>    have.*

But also this doesn't really address the fundamental underlying problem
that is discussed here. nft doesn't even have cgroups v2 support, only
xt_cgroups has it so far, but even if it would have it, then it's still
a scalability issue that this model has over what is being proposed by
Daniel, since you still need to test linearly wrt cgroups v2 membership,
whereas in the set that is proposed it's integral part of cgroups and can
be extended further, also for non-networking users to use this facility.
Or would the idea be that the current netfilter hooks would be redone in
a way that they are generic enough so that any other user could make use
of it independent of netfilter?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] net: VRF: Fix receiving multicast traffic
From: David Ahern @ 2016-09-22 15:14 UTC (permalink / raw)
  To: Mark Tomlinson, netdev
In-Reply-To: <20160922041308.30848-1-mark.tomlinson@alliedtelesis.co.nz>

On 9/21/16 10:13 PM, Mark Tomlinson wrote:
> The previous patch to ensure that the original iif was used when
> checking for forwarding also meant that this same interface was used to
> determine whether multicast packets should be received or not. This was
> incorrect, and would cause multicast packets to be dropped.
> 
> The fix here is to use skb->dev when checking multicast addresses.
> skb->dev has been set to the l3mdev by this point, so the check will be
> against that, rather than the ingress interface.

l3mdev devices do not support IPv4 multicast so checking mcast against that device should not be working at all. For that reason I was fine with the change in the previous patch. ie., you want the real ingress device there not the vrf device.

What test are you running that says your previous patch broke something?

^ permalink raw reply

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Edward Cree @ 2016-09-22 15:21 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet
  Cc: netdev, David S. Miller, James Morris, Trond Myklebust,
	Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert,
	Hannes Frederic Sowa, linux-nfs
In-Reply-To: <1474540415.4845.69.camel@redhat.com>

On 22/09/16 11:33, Paolo Abeni wrote:
> Hi Eric,
>
> On Wed, 2016-09-21 at 16:31 -0700, Eric Dumazet wrote:
>> Also does inet_diag properly give the forward_alloc to user ?
>>
>> $ ss -mua
>> State      Recv-Q Send-Q Local Address:Port                 Peer Addres
>> s:Port
>> UNCONN     51584  0          *:52460      *:*                    
>> 	 skmem:(r51584,rb327680,t0,tb327680,f1664,w0,o0,bl0,d575)
> Thank you very much for reviewing this! 
>
> My bad, there is still a race which leads to temporary negative values
> of fwd. I feel the fix is trivial but it needs some investigation.
>
>> Couldn't we instead use an union of an atomic_t and int for
>> sk->sk_forward_alloc ?
> That was our first attempt, but we had some issue on mem scheduling; if
> we use:
>
>    if (atomic_sub_return(size, &sk->sk_forward_alloc_atomic) < 0) {
>         // fwd alloc
>    }
>
> that leads to inescapable, temporary, negative value for
> sk->sk_forward_alloc.
>
> Another option would be:
>
> again:
>         fwd = atomic_read(&sk->sk_forward_alloc_atomic);
>         if (fwd > size) {
> 		if (atomic_cmpxchg(&sk->sk_forward_alloc_atomic, fwd, fwd - size) != fwd)
> 			goto again;
>         } else 
>             // fwd alloc
>
> which would be bad under high contention.
Apologies if I'm misunderstanding the problem, but couldn't you have two
atomic_t fields, 'internal' and 'external' forward_alloc.  Then
    if (atomic_sub_return(size, &sk->sk_forward_alloc_internal) < 0) {
        atomic_sub(size, &sk->sk_forward_alloc);
        // fwd alloc
    } else {
        atomic_add(size, &sk->sk_forward_alloc_internal);
    }
or something like that.  Then sk->sk_forward_alloc never sees a negative
value, and is always >= sk->sk_forward_alloc_internal.  Of course places
that go the other way would have to add to sk->sk_forward_alloc first,
before adding to sk->sk_forward_alloc_internal, to maintain that invariant.

Would that help matters at all?

-Ed

^ permalink raw reply

* Re: [PATCH net-next v2 3/3] net: ethernet: mediatek: add the dts property to set if TRGMII supported on GMAC0
From: Sean Wang @ 2016-09-22 15:21 UTC (permalink / raw)
  To: sergei.shtylyov
  Cc: john, davem, nbd, netdev, linux-kernel, linux-mediatek, keyhaede
In-Reply-To: <b04e14ff-15db-9cc8-9490-204da8c5837d@cogentembedded.com>

Date: Thu, 22 Sep 2016 14:28:36 +0300, Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> wrote:
>Hello.
>
>On 9/22/2016 5:33 AM, sean.wang@mediatek.com wrote:
>
>> From: Sean Wang <sean.wang@mediatek.com>
>>
>> Add the dts property for the capability if TRGMII supported on GAMC0
>>
>> Signed-off-by: Sean Wang <sean.wang@mediatek.com>
>> ---
>>  Documentation/devicetree/bindings/net/mediatek-net.txt | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt
>> index 6103e55..7111278 100644
>> --- a/Documentation/devicetree/bindings/net/mediatek-net.txt
>> +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
>> @@ -31,7 +31,10 @@ Optional properties:
>>  Required properties:
>>  - compatible: Should be "mediatek,eth-mac"
>>  - reg: The number of the MAC
>> -- phy-handle: see ethernet.txt file in the same directory.
>> +- phy-handle: see ethernet.txt file in the same directory and
>> +	the phy-mode "trgmii" required being provided when reg
>
>   Since you've modified the generic parser of the "phy-mode" to add your 
>"trgmii", you also need to update ethernet.txt...
>

>> +	is equal to 0 and the MAC uses fixed-link to connect
>> +	with inernal switch such as MT7530.
>
>   Internal.
>

thank you for taking time and effort to review this
I will fix it up

^ permalink raw reply

* [PATCH] fs/select: add vmalloc fallback for select(2)
From: Vlastimil Babka @ 2016-09-22 15:28 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be
physically contiguous. Also the allocation is temporary for the duration of the
syscall, so it's unlikely to stress vmalloc too much.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..8fe5bddbe99b 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	unsigned long alloc_size;
 
 	ret = -EINVAL;
 	if (n < 0)
@@ -580,10 +582,15 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	bits = stack_fds;
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
+		alloc_size = 6 * size;
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
-		if (!bits)
-			goto out_nofds;
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE) {
+			bits = vmalloc(alloc_size);
+
+			if (!bits)
+				goto out_nofds;
+		}
 	}
 	fds.in      = bits;
 	fds.out     = bits +   size;
@@ -618,7 +625,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Daniel Mack @ 2016-09-22 15:53 UTC (permalink / raw)
  To: Daniel Borkmann, Pablo Neira Ayuso, Thomas Graf
  Cc: htejun-b10kYP2dOMg, ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	kafai-b10kYP2dOMg, fw-HFFVJYpyMKqzQB+pC5nmwQ,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <57E3F4F9.70300-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On 09/22/2016 05:12 PM, Daniel Borkmann wrote:
> On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote:

>> Benefits are, rewording previous email:
>>
>> * You get access to all of the existing netfilter hooks in one go
>>    to run bpf programs. No need for specific redundant hooks. This
>>    provides raw access to the netfilter hook, you define the little
>>    code that your hook runs before you bpf run invocation. So there
>>    is *no need to bloat the stack with more hooks, we use what we
>>    have.*
> 
> But also this doesn't really address the fundamental underlying problem
> that is discussed here. nft doesn't even have cgroups v2 support, only
> xt_cgroups has it so far, but even if it would have it, then it's still
> a scalability issue that this model has over what is being proposed by
> Daniel, since you still need to test linearly wrt cgroups v2 membership,
> whereas in the set that is proposed it's integral part of cgroups and can
> be extended further, also for non-networking users to use this facility.
> Or would the idea be that the current netfilter hooks would be redone in
> a way that they are generic enough so that any other user could make use
> of it independent of netfilter?

Yes, that part I don't understand either.

Pablo, could you outline in more detail (in terms of syscalls, commands,
resulting nftables layout etc) how your proposed model would support
having per-cgroup byte and packet counters for both ingress and egress,
and filtering at least for ingress?

And how would that mitigate the race gaps you have been worried about,
between cgroup creation and filters taking effect for a task?


Thanks,
Daniel

^ permalink raw reply

* [PATCH net-next] net_sched: sch_fq: account for schedule/timers drifts
From: Eric Dumazet @ 2016-09-22 15:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.

We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)

Add an EWMA of the unthrottle latency to help diagnostics.

This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow. 


Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17106.04 756876.84   1102.75 1119049.02      0.00      0.00      0.52

After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17867.00 800245.90   1151.77 1183172.12      0.00      0.00      0.52

A new iproute2 tc can output the 'unthrottle latency' :

$ tc -s qd sh dev eth0 | grep latency
  0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/pkt_sched.h |    2 +-
 net/sched/sch_fq.c             |   21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index f8e39dbaa781..df7451d35131 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -811,7 +811,7 @@ struct tc_fq_qd_stats {
 	__u32	flows;
 	__u32	inactive_flows;
 	__u32	throttled_flows;
-	__u32	pad;
+	__u32	unthrottle_latency_ns;
 };
 
 /* Heavy-Hitter Filter */
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 5dd929cc1423..18e752439f6f 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -86,6 +86,7 @@ struct fq_sched_data {
 
 	struct rb_root	delayed;	/* for rate limited flows */
 	u64		time_next_delayed_flow;
+	unsigned long	unthrottle_latency_ns;
 
 	struct fq_flow	internal;	/* for non classified or high prio packets */
 	u32		quantum;
@@ -408,11 +409,19 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 
 static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 {
+	unsigned long sample;
 	struct rb_node *p;
 
 	if (q->time_next_delayed_flow > now)
 		return;
 
+	/* Update unthrottle latency EWMA.
+	 * This is cheap and can help diagnosing timer/latency problems.
+	 */
+	sample = (unsigned long)(now - q->time_next_delayed_flow);
+	q->unthrottle_latency_ns -= q->unthrottle_latency_ns >> 3;
+	q->unthrottle_latency_ns += sample >> 3;
+
 	q->time_next_delayed_flow = ~0ULL;
 	while ((p = rb_first(&q->delayed)) != NULL) {
 		struct fq_flow *f = container_of(p, struct fq_flow, rate_node);
@@ -515,7 +524,12 @@ begin:
 			len = NSEC_PER_SEC;
 			q->stat_pkts_too_long++;
 		}
-
+		/* Account for schedule/timers drifts.
+		 * f->time_next_packet was set when prior packet was sent,
+		 * and current time (@now) can be too late by tens of us.
+		 */
+		if (f->time_next_packet)
+			len -= min(len/2, now - f->time_next_packet);
 		f->time_next_packet = now + len;
 	}
 out:
@@ -787,6 +801,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
 	q->initial_quantum	= 10 * psched_mtu(qdisc_dev(sch));
 	q->flow_refill_delay	= msecs_to_jiffies(40);
 	q->flow_max_rate	= ~0U;
+	q->time_next_delayed_flow = ~0ULL;
 	q->rate_enable		= 1;
 	q->new_flows.first	= NULL;
 	q->old_flows.first	= NULL;
@@ -854,8 +869,8 @@ static int fq_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	st.flows		  = q->flows;
 	st.inactive_flows	  = q->inactive_flows;
 	st.throttled_flows	  = q->throttled_flows;
-	st.pad			  = 0;
-
+	st.unthrottle_latency_ns  = min_t(unsigned long,
+					  q->unthrottle_latency_ns, ~0U);
 	sch_tree_unlock(sch);
 
 	return gnet_stats_copy_app(d, &st, sizeof(st));

^ permalink raw reply related

* Re: [PATCH iproute2 net-next] tc: m_vlan: Add vlan modify action
From: Stephen Hemminger @ 2016-09-22 16:05 UTC (permalink / raw)
  To: Shmulik Ladkani; +Cc: netdev, Jamal Hadi Salim, Jiri Pirko, Shmulik Ladkani
In-Reply-To: <1474536670-2495-1-git-send-email-shmulik.ladkani@gmail.com>

On Thu, 22 Sep 2016 12:31:10 +0300
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:

> +
> +static const char *action_name(int action)
> +{
> +	static const char * const names[] = {
> +		[TCA_VLAN_ACT_POP] = "pop",
> +		[TCA_VLAN_ACT_PUSH] = "push",
> +		[TCA_VLAN_ACT_MODIFY] = "modify",
> +	};
> +	return names[action];
> +}
> +

Why are you wrapping a simple array lookup in a function?

^ permalink raw reply

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Paolo Abeni @ 2016-09-22 16:14 UTC (permalink / raw)
  To: Edward Cree
  Cc: Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller,
	James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann,
	Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <589839b3-5930-2527-b0a3-315be254a175-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>

On Thu, 2016-09-22 at 16:21 +0100, Edward Cree wrote:
> On 22/09/16 11:33, Paolo Abeni wrote:
> > Hi Eric,
> >
> > On Wed, 2016-09-21 at 16:31 -0700, Eric Dumazet wrote:
> >> Also does inet_diag properly give the forward_alloc to user ?
> >>
> >> $ ss -mua
> >> State      Recv-Q Send-Q Local Address:Port                 Peer Addres
> >> s:Port
> >> UNCONN     51584  0          *:52460      *:*                    
> >> 	 skmem:(r51584,rb327680,t0,tb327680,f1664,w0,o0,bl0,d575)
> > Thank you very much for reviewing this! 
> >
> > My bad, there is still a race which leads to temporary negative values
> > of fwd. I feel the fix is trivial but it needs some investigation.
> >
> >> Couldn't we instead use an union of an atomic_t and int for
> >> sk->sk_forward_alloc ?
> > That was our first attempt, but we had some issue on mem scheduling; if
> > we use:
> >
> >    if (atomic_sub_return(size, &sk->sk_forward_alloc_atomic) < 0) {
> >         // fwd alloc
> >    }
> >
> > that leads to inescapable, temporary, negative value for
> > sk->sk_forward_alloc.
> >
> > Another option would be:
> >
> > again:
> >         fwd = atomic_read(&sk->sk_forward_alloc_atomic);
> >         if (fwd > size) {
> > 		if (atomic_cmpxchg(&sk->sk_forward_alloc_atomic, fwd, fwd - size) != fwd)
> > 			goto again;
> >         } else 
> >             // fwd alloc
> >
> > which would be bad under high contention.
> Apologies if I'm misunderstanding the problem, but couldn't you have two
> atomic_t fields, 'internal' and 'external' forward_alloc.  Then
>     if (atomic_sub_return(size, &sk->sk_forward_alloc_internal) < 0) {
>         atomic_sub(size, &sk->sk_forward_alloc);
>         // fwd alloc
>     } else {
>         atomic_add(size, &sk->sk_forward_alloc_internal);
>     }
> or something like that.  Then sk->sk_forward_alloc never sees a negative
> value, and is always >= sk->sk_forward_alloc_internal.  Of course places
> that go the other way would have to add to sk->sk_forward_alloc first,
> before adding to sk->sk_forward_alloc_internal, to maintain that invariant.

I think that the idea behind using atomic ops directly on
sk_forward_alloc is to avoid adding other fields to the udp_socket. 

If we can add some fields to the udp_sock structure, the schema proposed
in this patch should fit better (modulo bugs ;-), always requiring a
single atomic operation at memory reclaiming time and at memory
allocation time.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next] Documentation: devicetree: revise ethernet device-tree binding about TRGMII
From: sean.wang @ 2016-09-22 16:16 UTC (permalink / raw)
  To: davem, sergei.shtylyov, john
  Cc: nbd, netdev, linux-kernel, linux-mediatek, devicetree, keyhaede,
	objelf, Sean Wang

From: Sean Wang <sean.wang@mediatek.com>

fix typo in mediatek-net.txt and add phy-mode "trgmii" to ethernet.txt

Cc: devicetree@vger.kernel.org
Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 Documentation/devicetree/bindings/net/ethernet.txt     | 4 ++--
 Documentation/devicetree/bindings/net/mediatek-net.txt | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt
index 5d88f37..e1d7681 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -11,8 +11,8 @@ The following properties are common to the Ethernet controllers:
   the maximum frame size (there's contradiction in ePAPR).
 - phy-mode: string, operation mode of the PHY interface; supported values are
   "mii", "gmii", "sgmii", "qsgmii", "tbi", "rev-mii", "rmii", "rgmii", "rgmii-id",
-  "rgmii-rxid", "rgmii-txid", "rtbi", "smii", "xgmii"; this is now a de-facto
-  standard property;
+  "rgmii-rxid", "rgmii-txid", "rtbi", "smii", "xgmii", "trgmii"; this is now a
+  de-facto standard property;
 - phy-connection-type: the same as "phy-mode" property but described in ePAPR;
 - phy-handle: phandle, specifies a reference to a node representing a PHY
   device; this property is described in ePAPR and so preferred;
diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt
index bae848e..da05f8b 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -34,7 +34,7 @@ Required properties:
 - phy-handle: see ethernet.txt file in the same directory and
 	the phy-mode "trgmii" required being provided when reg
 	is equal to 0 and the MAC uses fixed-link to connect
-	with inernal switch such as MT7530.
+	with internal switch such as MT7530.
 
 Example:
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] fs/select: add vmalloc fallback for select(2)
From: Eric Dumazet @ 2016-09-22 16:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev
In-Reply-To: <20160922152831.24165-1-vbabka@suse.cz>

On Thu, 2016-09-22 at 17:28 +0200, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  fs/select.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 8ed9da50896a..8fe5bddbe99b 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/freezer.h>
>  #include <net/busy_poll.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	struct fdtable *fdt;
>  	/* Allocate small arguments on the stack to save memory and be faster */
>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> +	unsigned long alloc_size;
>  
>  	ret = -EINVAL;
>  	if (n < 0)
> @@ -580,10 +582,15 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	bits = stack_fds;
>  	if (size > sizeof(stack_fds) / 6) {
>  		/* Not enough space in on-stack array; must use kmalloc */
> +		alloc_size = 6 * size;
>  		ret = -ENOMEM;
> -		bits = kmalloc(6 * size, GFP_KERNEL);
> -		if (!bits)
> -			goto out_nofds;
> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> +		if (!bits && alloc_size > PAGE_SIZE) {
> +			bits = vmalloc(alloc_size);
> +
> +			if (!bits)
> +				goto out_nofds;

Test should happen if alloc_size <= PAGE_SIZE

> +		}

if (!bits && alloc_size > PAGE_SIZE)
    bits = vmalloc(alloc_size);

if (!bits)
      goto out_nofds;



>  	}
>  	fds.in      = bits;
>  	fds.out     = bits +   size;
> @@ -618,7 +625,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  
>  out:
>  	if (bits != stack_fds)
> -		kfree(bits);
> +		kvfree(bits);
>  out_nofds:
>  	return ret;
>  }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net 2/2] act_ife: Fix false parsing on decode side
From: Yotam Gigi @ 2016-09-22 12:55 UTC (permalink / raw)
  To: jhs, davem, netdev; +Cc: Yotam Gigi
In-Reply-To: <1474548926-22815-1-git-send-email-yotamg@mellanox.com>

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the pointer was advanced in a wrong way, as the tlv type and size
bytes were not taken into account. This led to false parsing of ife
header. In addition, due to the fact that the loop counter was unsigned,
it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the parsing to
take into account the tlv type and size.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
---
 net/sched/act_ife.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 7f71a3d..4786b28 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -627,7 +627,7 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_ife_info *ife = to_ife(a);
 	int action = ife->tcf_action;
 	struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-	u16 ifehdrln = ifehdr->metalen;
+	int ifehdrln = (int)ifehdr->metalen;
 	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
 
 	spin_lock(&ife->tcf_lock);
@@ -668,8 +668,8 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 			ife->tcf_qstats.overlimits++;
 		}
 
-		tlvdata += alen;
-		ifehdrln -= alen;
+		tlvdata += alen + sizeof(struct meta_tlvhdr);
+		ifehdrln -= alen + sizeof(struct meta_tlvhdr);
 		tlv = (struct meta_tlvhdr *)tlvdata;
 	}
 
-- 
2.4.11

^ permalink raw reply related

* [PATCH net 0/2] Fix tc-ife bugs
From: Yotam Gigi @ 2016-09-22 12:55 UTC (permalink / raw)
  To: jhs, davem, netdev; +Cc: Yotam Gigi

This patch-set contains two bugfixes in the tc-ife action, one fixing some
random behaviour in encode side, and one fixing the decode side packet
parsing logic.

Yotam Gigi (2):
  act_ife: Fix external mac header on encode
  act_ife: Fix false parsing on decode side

 net/sched/act_ife.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

-- 
2.4.11

^ permalink raw reply

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Eric Dumazet @ 2016-09-22 16:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Edward Cree, netdev, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, linux-nfs
In-Reply-To: <1474560864.4845.78.camel@redhat.com>

On Thu, 2016-09-22 at 18:14 +0200, Paolo Abeni wrote:

> I think that the idea behind using atomic ops directly on
> sk_forward_alloc is to avoid adding other fields to the udp_socket. 
> 
> If we can add some fields to the udp_sock structure, the schema proposed
> in this patch should fit better (modulo bugs ;-), always requiring a
> single atomic operation at memory reclaiming time and at memory
> allocation time.

But do we want any additional atomic to begin with ?

Given typical number of UDP sockets on a host, we could reserve/forward
alloc at socket creation time, and when SO_RCVBUF is changed.

I totally understand that this schem does not work with TCP, because TCP
BDP mandates xx MB of buffers per socket, and we really handle millions
of TCP sockets per host,

But for UDP, most applications are working well with a default limit,
and we hardly have more than 1000 UDP sockets.

cat /proc/sys/net/core/rmem_default
212992

^ permalink raw reply

* Re: [PATCH] fs/select: add vmalloc fallback for select(2)
From: Vlastimil Babka @ 2016-09-22 16:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev
In-Reply-To: <1474561478.23058.127.camel@edumazet-glaptop3.roam.corp.google.com>

On 09/22/2016 06:24 PM, Eric Dumazet wrote:

>> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
>> +		if (!bits && alloc_size > PAGE_SIZE) {
>> +			bits = vmalloc(alloc_size);
>> +
>> +			if (!bits)
>> +				goto out_nofds;
>
> Test should happen if alloc_size <= PAGE_SIZE
>
>> +		}
>
> if (!bits && alloc_size > PAGE_SIZE)
>     bits = vmalloc(alloc_size);
>
> if (!bits)
>       goto out_nofds;
>

Thanks... stupid last-minute changes.

^ permalink raw reply

* [PATCH v3 1/2] netfilter: Prepare xt_hashlimit.c for revision 2
From: Vishwanath Pai @ 2016-09-22 16:42 UTC (permalink / raw)
  To: pablo, kaber, kadlec
  Cc: johunt, netfilter-devel, coreteam, netdev, pai.vishwain

I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
---
 include/uapi/linux/netfilter/xt_hashlimit.h |  2 +-
 net/netfilter/xt_hashlimit.c                | 61 +++++++++++++++--------------
 2 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/include/uapi/linux/netfilter/xt_hashlimit.h b/include/uapi/linux/netfilter/xt_hashlimit.h
index 6db9037..ea8c1c0 100644
--- a/include/uapi/linux/netfilter/xt_hashlimit.h
+++ b/include/uapi/linux/netfilter/xt_hashlimit.h
@@ -5,7 +5,7 @@
 #include <linux/if.h>
 
 /* timings are in milliseconds. */
-#define XT_HASHLIMIT_SCALE 10000
+#define XT_HASHLIMIT_SCALE_v1 10000
 /* 1/10,000 sec period => max of 10,000/sec.  Min rate is then 429490
  * seconds, or one packet every 59 hours.
  */
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 1786968..6b0ad93 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -56,7 +56,7 @@ static inline struct hashlimit_net *hashlimit_pernet(struct net *net)
 }
 
 /* need to declare this at the top */
-static const struct file_operations dl_file_ops;
+static const struct file_operations dl_file_ops_v1;
 
 /* hash table crap */
 struct dsthash_dst {
@@ -215,8 +215,8 @@ dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent)
 }
 static void htable_gc(struct work_struct *work);
 
-static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
-			 u_int8_t family)
+static int htable_create_v1(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
+			    u_int8_t family)
 {
 	struct hashlimit_net *hashlimit_net = hashlimit_pernet(net);
 	struct xt_hashlimit_htable *hinfo;
@@ -265,7 +265,7 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
 	hinfo->pde = proc_create_data(minfo->name, 0,
 		(family == NFPROTO_IPV4) ?
 		hashlimit_net->ipt_hashlimit : hashlimit_net->ip6t_hashlimit,
-		&dl_file_ops, hinfo);
+		&dl_file_ops_v1, hinfo);
 	if (hinfo->pde == NULL) {
 		kfree(hinfo->name);
 		vfree(hinfo);
@@ -398,7 +398,7 @@ static void htable_put(struct xt_hashlimit_htable *hinfo)
    (slowest userspace tool allows), which means
    CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie.
 */
-#define MAX_CPJ (0xFFFFFFFF / (HZ*60*60*24))
+#define MAX_CPJ_v1 (0xFFFFFFFF / (HZ*60*60*24))
 
 /* Repeated shift and or gives us all 1s, final shift and add 1 gives
  * us the power of 2 below the theoretical max, so GCC simply does a
@@ -410,7 +410,7 @@ static void htable_put(struct xt_hashlimit_htable *hinfo)
 #define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16))
 #define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1)
 
-#define CREDITS_PER_JIFFY POW2_BELOW32(MAX_CPJ)
+#define CREDITS_PER_JIFFY_v1 POW2_BELOW32(MAX_CPJ_v1)
 
 /* in byte mode, the lowest possible rate is one packet/second.
  * credit_cap is used as a counter that tells us how many times we can
@@ -428,11 +428,12 @@ static u32 xt_hashlimit_len_to_chunks(u32 len)
 static u32 user2credits(u32 user)
 {
 	/* If multiplying would overflow... */
-	if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY))
+	if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY_v1))
 		/* Divide first. */
-		return (user / XT_HASHLIMIT_SCALE) * HZ * CREDITS_PER_JIFFY;
+		return (user / XT_HASHLIMIT_SCALE_v1) *\
+					HZ * CREDITS_PER_JIFFY_v1;
 
-	return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE;
+	return (user * HZ * CREDITS_PER_JIFFY_v1) / XT_HASHLIMIT_SCALE_v1;
 }
 
 static u32 user2credits_byte(u32 user)
@@ -461,7 +462,7 @@ static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, u32 mode)
 			return;
 		}
 	} else {
-		dh->rateinfo.credit += delta * CREDITS_PER_JIFFY;
+		dh->rateinfo.credit += delta * CREDITS_PER_JIFFY_v1;
 		cap = dh->rateinfo.credit_cap;
 	}
 	if (dh->rateinfo.credit > cap)
@@ -603,7 +604,7 @@ static u32 hashlimit_byte_cost(unsigned int len, struct dsthash_ent *dh)
 }
 
 static bool
-hashlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
+hashlimit_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
 	struct xt_hashlimit_htable *hinfo = info->hinfo;
@@ -660,7 +661,7 @@ hashlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	return false;
 }
 
-static int hashlimit_mt_check(const struct xt_mtchk_param *par)
+static int hashlimit_mt_check_v1(const struct xt_mtchk_param *par)
 {
 	struct net *net = par->net;
 	struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
@@ -701,7 +702,7 @@ static int hashlimit_mt_check(const struct xt_mtchk_param *par)
 	mutex_lock(&hashlimit_mutex);
 	info->hinfo = htable_find_get(net, info->name, par->family);
 	if (info->hinfo == NULL) {
-		ret = htable_create(net, info, par->family);
+		ret = htable_create_v1(net, info, par->family);
 		if (ret < 0) {
 			mutex_unlock(&hashlimit_mutex);
 			return ret;
@@ -711,7 +712,7 @@ static int hashlimit_mt_check(const struct xt_mtchk_param *par)
 	return 0;
 }
 
-static void hashlimit_mt_destroy(const struct xt_mtdtor_param *par)
+static void hashlimit_mt_destroy_v1(const struct xt_mtdtor_param *par)
 {
 	const struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
 
@@ -723,10 +724,10 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.name           = "hashlimit",
 		.revision       = 1,
 		.family         = NFPROTO_IPV4,
-		.match          = hashlimit_mt,
+		.match          = hashlimit_mt_v1,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo1),
-		.checkentry     = hashlimit_mt_check,
-		.destroy        = hashlimit_mt_destroy,
+		.checkentry     = hashlimit_mt_check_v1,
+		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
 	},
 #if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
@@ -734,10 +735,10 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.name           = "hashlimit",
 		.revision       = 1,
 		.family         = NFPROTO_IPV6,
-		.match          = hashlimit_mt,
+		.match          = hashlimit_mt_v1,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo1),
-		.checkentry     = hashlimit_mt_check,
-		.destroy        = hashlimit_mt_destroy,
+		.checkentry     = hashlimit_mt_check_v1,
+		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
 	},
 #endif
@@ -786,8 +787,8 @@ static void dl_seq_stop(struct seq_file *s, void *v)
 	spin_unlock_bh(&htable->lock);
 }
 
-static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family,
-				   struct seq_file *s)
+static int dl_seq_real_show_v1(struct dsthash_ent *ent, u_int8_t family,
+			       struct seq_file *s)
 {
 	const struct xt_hashlimit_htable *ht = s->private;
 
@@ -825,7 +826,7 @@ static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family,
 	return seq_has_overflowed(s);
 }
 
-static int dl_seq_show(struct seq_file *s, void *v)
+static int dl_seq_show_v1(struct seq_file *s, void *v)
 {
 	struct xt_hashlimit_htable *htable = s->private;
 	unsigned int *bucket = (unsigned int *)v;
@@ -833,22 +834,22 @@ static int dl_seq_show(struct seq_file *s, void *v)
 
 	if (!hlist_empty(&htable->hash[*bucket])) {
 		hlist_for_each_entry(ent, &htable->hash[*bucket], node)
-			if (dl_seq_real_show(ent, htable->family, s))
+			if (dl_seq_real_show_v1(ent, htable->family, s))
 				return -1;
 	}
 	return 0;
 }
 
-static const struct seq_operations dl_seq_ops = {
+static const struct seq_operations dl_seq_ops_v1 = {
 	.start = dl_seq_start,
 	.next  = dl_seq_next,
 	.stop  = dl_seq_stop,
-	.show  = dl_seq_show
+	.show  = dl_seq_show_v1
 };
 
-static int dl_proc_open(struct inode *inode, struct file *file)
+static int dl_proc_open_v1(struct inode *inode, struct file *file)
 {
-	int ret = seq_open(file, &dl_seq_ops);
+	int ret = seq_open(file, &dl_seq_ops_v1);
 
 	if (!ret) {
 		struct seq_file *sf = file->private_data;
@@ -857,9 +858,9 @@ static int dl_proc_open(struct inode *inode, struct file *file)
 	return ret;
 }
 
-static const struct file_operations dl_file_ops = {
+static const struct file_operations dl_file_ops_v1 = {
 	.owner   = THIS_MODULE,
-	.open    = dl_proc_open,
+	.open    = dl_proc_open_v1,
 	.read    = seq_read,
 	.llseek  = seq_lseek,
 	.release = seq_release
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3 2/2] netfilter: Create revision 2 of xt_hashlimit to support higher pps rates
From: Vishwanath Pai @ 2016-09-22 16:43 UTC (permalink / raw)
  To: pablo, kaber, kadlec
  Cc: johunt, netfilter-devel, coreteam, netdev, pai.vishwain

V2:
Removed the call to BUG() in cfg_copy, we return -EINVAL
instead and all calls to cfg_copy check for this

V3:
change "revision" in the call to cfg_copy inside htable_create to 2
previously this would pass down revision from the function parameter
this is wrong since *cfg here is always hashlimit_cfg2

There is no change in userspace patch so V2 will work

--

netfilter: Create revision 2 of xt_hashlimit to support higher pps rates

Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.

To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.

Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
---
 include/uapi/linux/netfilter/xt_hashlimit.h |  23 ++
 net/netfilter/xt_hashlimit.c                | 330 ++++++++++++++++++++++------
 2 files changed, 285 insertions(+), 68 deletions(-)

diff --git a/include/uapi/linux/netfilter/xt_hashlimit.h b/include/uapi/linux/netfilter/xt_hashlimit.h
index ea8c1c0..be5d2e1 100644
--- a/include/uapi/linux/netfilter/xt_hashlimit.h
+++ b/include/uapi/linux/netfilter/xt_hashlimit.h
@@ -6,6 +6,7 @@
 
 /* timings are in milliseconds. */
 #define XT_HASHLIMIT_SCALE_v1 10000
+#define XT_HASHLIMIT_SCALE 1000000llu
 /* 1/10,000 sec period => max of 10,000/sec.  Min rate is then 429490
  * seconds, or one packet every 59 hours.
  */
@@ -63,6 +64,20 @@ struct hashlimit_cfg1 {
 	__u8 srcmask, dstmask;
 };
 
+struct hashlimit_cfg2 {
+	__u32 mode;	  /* bitmask of XT_HASHLIMIT_HASH_* */
+	__u64 avg;    /* Average secs between packets * scale */
+	__u64 burst;  /* Period multiplier for upper limit. */
+
+	/* user specified */
+	__u32 size;		/* how many buckets */
+	__u32 max;		/* max number of entries */
+	__u32 gc_interval;	/* gc interval */
+	__u32 expire;	/* when do entries expire? */
+
+	__u8 srcmask, dstmask;
+};
+
 struct xt_hashlimit_mtinfo1 {
 	char name[IFNAMSIZ];
 	struct hashlimit_cfg1 cfg;
@@ -71,4 +86,12 @@ struct xt_hashlimit_mtinfo1 {
 	struct xt_hashlimit_htable *hinfo __attribute__((aligned(8)));
 };
 
+struct xt_hashlimit_mtinfo2 {
+	char name[NAME_MAX];
+	struct hashlimit_cfg2 cfg;
+
+	/* Used internally by the kernel */
+	struct xt_hashlimit_htable *hinfo __attribute__((aligned(8)));
+};
+
 #endif /* _UAPI_XT_HASHLIMIT_H */
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 6b0ad93..64579f3 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -57,6 +57,7 @@ static inline struct hashlimit_net *hashlimit_pernet(struct net *net)
 
 /* need to declare this at the top */
 static const struct file_operations dl_file_ops_v1;
+static const struct file_operations dl_file_ops;
 
 /* hash table crap */
 struct dsthash_dst {
@@ -86,8 +87,8 @@ struct dsthash_ent {
 	unsigned long expires;		/* precalculated expiry time */
 	struct {
 		unsigned long prev;	/* last modification */
-		u_int32_t credit;
-		u_int32_t credit_cap, cost;
+		u_int64_t credit;
+		u_int64_t credit_cap, cost;
 	} rateinfo;
 	struct rcu_head rcu;
 };
@@ -98,7 +99,7 @@ struct xt_hashlimit_htable {
 	u_int8_t family;
 	bool rnd_initialized;
 
-	struct hashlimit_cfg1 cfg;	/* config */
+	struct hashlimit_cfg2 cfg;	/* config */
 
 	/* used internally */
 	spinlock_t lock;		/* lock for list_head */
@@ -114,6 +115,30 @@ struct xt_hashlimit_htable {
 	struct hlist_head hash[0];	/* hashtable itself */
 };
 
+static int
+cfg_copy(struct hashlimit_cfg2 *to, void *from, int revision)
+{
+	if (revision == 1) {
+		struct hashlimit_cfg1 *cfg = (struct hashlimit_cfg1 *)from;
+
+		to->mode = cfg->mode;
+		to->avg = cfg->avg;
+		to->burst = cfg->burst;
+		to->size = cfg->size;
+		to->max = cfg->max;
+		to->gc_interval = cfg->gc_interval;
+		to->expire = cfg->expire;
+		to->srcmask = cfg->srcmask;
+		to->dstmask = cfg->dstmask;
+	} else if (revision == 2) {
+		memcpy(to, from, sizeof(struct hashlimit_cfg2));
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static DEFINE_MUTEX(hashlimit_mutex);	/* protects htables list */
 static struct kmem_cache *hashlimit_cachep __read_mostly;
 
@@ -215,16 +240,18 @@ dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent)
 }
 static void htable_gc(struct work_struct *work);
 
-static int htable_create_v1(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
-			    u_int8_t family)
+static int htable_create(struct net *net, struct hashlimit_cfg2 *cfg,
+			 const char *name, u_int8_t family,
+			 struct xt_hashlimit_htable **out_hinfo,
+			 int revision)
 {
 	struct hashlimit_net *hashlimit_net = hashlimit_pernet(net);
 	struct xt_hashlimit_htable *hinfo;
-	unsigned int size;
-	unsigned int i;
+	unsigned int size, i;
+	int ret;
 
-	if (minfo->cfg.size) {
-		size = minfo->cfg.size;
+	if (cfg->size) {
+		size = cfg->size;
 	} else {
 		size = (totalram_pages << PAGE_SHIFT) / 16384 /
 		       sizeof(struct list_head);
@@ -238,10 +265,14 @@ static int htable_create_v1(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
 	                sizeof(struct list_head) * size);
 	if (hinfo == NULL)
 		return -ENOMEM;
-	minfo->hinfo = hinfo;
+	*out_hinfo = hinfo;
 
 	/* copy match config into hashtable config */
-	memcpy(&hinfo->cfg, &minfo->cfg, sizeof(hinfo->cfg));
+	ret = cfg_copy(&hinfo->cfg, (void *)cfg, 2);
+
+	if (ret)
+		return ret;
+
 	hinfo->cfg.size = size;
 	if (hinfo->cfg.max == 0)
 		hinfo->cfg.max = 8 * hinfo->cfg.size;
@@ -255,17 +286,18 @@ static int htable_create_v1(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
 	hinfo->count = 0;
 	hinfo->family = family;
 	hinfo->rnd_initialized = false;
-	hinfo->name = kstrdup(minfo->name, GFP_KERNEL);
+	hinfo->name = kstrdup(name, GFP_KERNEL);
 	if (!hinfo->name) {
 		vfree(hinfo);
 		return -ENOMEM;
 	}
 	spin_lock_init(&hinfo->lock);
 
-	hinfo->pde = proc_create_data(minfo->name, 0,
+	hinfo->pde = proc_create_data(name, 0,
 		(family == NFPROTO_IPV4) ?
 		hashlimit_net->ipt_hashlimit : hashlimit_net->ip6t_hashlimit,
-		&dl_file_ops_v1, hinfo);
+		(revision == 1) ? &dl_file_ops_v1 : &dl_file_ops,
+		hinfo);
 	if (hinfo->pde == NULL) {
 		kfree(hinfo->name);
 		vfree(hinfo);
@@ -399,6 +431,7 @@ static void htable_put(struct xt_hashlimit_htable *hinfo)
    CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie.
 */
 #define MAX_CPJ_v1 (0xFFFFFFFF / (HZ*60*60*24))
+#define MAX_CPJ (0xFFFFFFFFFFFFFFFF / (HZ*60*60*24))
 
 /* Repeated shift and or gives us all 1s, final shift and add 1 gives
  * us the power of 2 below the theoretical max, so GCC simply does a
@@ -408,8 +441,11 @@ static void htable_put(struct xt_hashlimit_htable *hinfo)
 #define _POW2_BELOW8(x) (_POW2_BELOW4(x)|_POW2_BELOW4((x)>>4))
 #define _POW2_BELOW16(x) (_POW2_BELOW8(x)|_POW2_BELOW8((x)>>8))
 #define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16))
+#define _POW2_BELOW64(x) (_POW2_BELOW32(x)|_POW2_BELOW32((x)>>32))
 #define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1)
+#define POW2_BELOW64(x) ((_POW2_BELOW64(x)>>1) + 1)
 
+#define CREDITS_PER_JIFFY POW2_BELOW64(MAX_CPJ)
 #define CREDITS_PER_JIFFY_v1 POW2_BELOW32(MAX_CPJ_v1)
 
 /* in byte mode, the lowest possible rate is one packet/second.
@@ -425,15 +461,24 @@ static u32 xt_hashlimit_len_to_chunks(u32 len)
 }
 
 /* Precision saver. */
-static u32 user2credits(u32 user)
+static u64 user2credits(u64 user, int revision)
 {
-	/* If multiplying would overflow... */
-	if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY_v1))
-		/* Divide first. */
-		return (user / XT_HASHLIMIT_SCALE_v1) *\
-					HZ * CREDITS_PER_JIFFY_v1;
+	if (revision == 1) {
+		/* If multiplying would overflow... */
+		if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY_v1))
+			/* Divide first. */
+			return (user / XT_HASHLIMIT_SCALE_v1) *\
+						HZ * CREDITS_PER_JIFFY_v1;
+
+		return (user * HZ * CREDITS_PER_JIFFY_v1) \
+						/ XT_HASHLIMIT_SCALE_v1;
+	} else {
+		if (user > 0xFFFFFFFFFFFFFFFF / (HZ*CREDITS_PER_JIFFY))
+			return (user / XT_HASHLIMIT_SCALE) *\
+						HZ * CREDITS_PER_JIFFY;
 
-	return (user * HZ * CREDITS_PER_JIFFY_v1) / XT_HASHLIMIT_SCALE_v1;
+		return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE;
+	}
 }
 
 static u32 user2credits_byte(u32 user)
@@ -443,10 +488,11 @@ static u32 user2credits_byte(u32 user)
 	return (u32) (us >> 32);
 }
 
-static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, u32 mode)
+static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now,
+			    u32 mode, int revision)
 {
 	unsigned long delta = now - dh->rateinfo.prev;
-	u32 cap;
+	u64 cap, cpj;
 
 	if (delta == 0)
 		return;
@@ -454,7 +500,7 @@ static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, u32 mode)
 	dh->rateinfo.prev = now;
 
 	if (mode & XT_HASHLIMIT_BYTES) {
-		u32 tmp = dh->rateinfo.credit;
+		u64 tmp = dh->rateinfo.credit;
 		dh->rateinfo.credit += CREDITS_PER_JIFFY_BYTES * delta;
 		cap = CREDITS_PER_JIFFY_BYTES * HZ;
 		if (tmp >= dh->rateinfo.credit) {/* overflow */
@@ -462,7 +508,9 @@ static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, u32 mode)
 			return;
 		}
 	} else {
-		dh->rateinfo.credit += delta * CREDITS_PER_JIFFY_v1;
+		cpj = (revision == 1) ?
+			CREDITS_PER_JIFFY_v1 : CREDITS_PER_JIFFY;
+		dh->rateinfo.credit += delta * cpj;
 		cap = dh->rateinfo.credit_cap;
 	}
 	if (dh->rateinfo.credit > cap)
@@ -470,7 +518,7 @@ static void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, u32 mode)
 }
 
 static void rateinfo_init(struct dsthash_ent *dh,
-			  struct xt_hashlimit_htable *hinfo)
+			  struct xt_hashlimit_htable *hinfo, int revision)
 {
 	dh->rateinfo.prev = jiffies;
 	if (hinfo->cfg.mode & XT_HASHLIMIT_BYTES) {
@@ -479,8 +527,8 @@ static void rateinfo_init(struct dsthash_ent *dh,
 		dh->rateinfo.credit_cap = hinfo->cfg.burst;
 	} else {
 		dh->rateinfo.credit = user2credits(hinfo->cfg.avg *
-						   hinfo->cfg.burst);
-		dh->rateinfo.cost = user2credits(hinfo->cfg.avg);
+						   hinfo->cfg.burst, revision);
+		dh->rateinfo.cost = user2credits(hinfo->cfg.avg, revision);
 		dh->rateinfo.credit_cap = dh->rateinfo.credit;
 	}
 }
@@ -604,15 +652,15 @@ static u32 hashlimit_byte_cost(unsigned int len, struct dsthash_ent *dh)
 }
 
 static bool
-hashlimit_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
+hashlimit_mt_common(const struct sk_buff *skb, struct xt_action_param *par,
+		    struct xt_hashlimit_htable *hinfo,
+		    const struct hashlimit_cfg2 *cfg, int revision)
 {
-	const struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
-	struct xt_hashlimit_htable *hinfo = info->hinfo;
 	unsigned long now = jiffies;
 	struct dsthash_ent *dh;
 	struct dsthash_dst dst;
 	bool race = false;
-	u32 cost;
+	u64 cost;
 
 	if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0)
 		goto hotdrop;
@@ -627,18 +675,18 @@ hashlimit_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
 		} else if (race) {
 			/* Already got an entry, update expiration timeout */
 			dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire);
-			rateinfo_recalc(dh, now, hinfo->cfg.mode);
+			rateinfo_recalc(dh, now, hinfo->cfg.mode, revision);
 		} else {
 			dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire);
-			rateinfo_init(dh, hinfo);
+			rateinfo_init(dh, hinfo, revision);
 		}
 	} else {
 		/* update expiration timeout */
 		dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire);
-		rateinfo_recalc(dh, now, hinfo->cfg.mode);
+		rateinfo_recalc(dh, now, hinfo->cfg.mode, revision);
 	}
 
-	if (info->cfg.mode & XT_HASHLIMIT_BYTES)
+	if (cfg->mode & XT_HASHLIMIT_BYTES)
 		cost = hashlimit_byte_cost(skb->len, dh);
 	else
 		cost = dh->rateinfo.cost;
@@ -648,70 +696,126 @@ hashlimit_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
 		dh->rateinfo.credit -= cost;
 		spin_unlock(&dh->lock);
 		rcu_read_unlock_bh();
-		return !(info->cfg.mode & XT_HASHLIMIT_INVERT);
+		return !(cfg->mode & XT_HASHLIMIT_INVERT);
 	}
 
 	spin_unlock(&dh->lock);
 	rcu_read_unlock_bh();
 	/* default match is underlimit - so over the limit, we need to invert */
-	return info->cfg.mode & XT_HASHLIMIT_INVERT;
+	return cfg->mode & XT_HASHLIMIT_INVERT;
 
  hotdrop:
 	par->hotdrop = true;
 	return false;
 }
 
-static int hashlimit_mt_check_v1(const struct xt_mtchk_param *par)
+static bool
+hashlimit_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
+	struct xt_hashlimit_htable *hinfo = info->hinfo;
+	struct hashlimit_cfg2 cfg = {};
+	int ret;
+
+	ret = cfg_copy(&cfg, (void *)&info->cfg, 1);
+
+	if (ret)
+		return ret;
+
+	return hashlimit_mt_common(skb, par, hinfo, &cfg, 1);
+}
+
+static bool
+hashlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_hashlimit_mtinfo2 *info = par->matchinfo;
+	struct xt_hashlimit_htable *hinfo = info->hinfo;
+
+	return hashlimit_mt_common(skb, par, hinfo, &info->cfg, 2);
+}
+
+static int hashlimit_mt_check_common(const struct xt_mtchk_param *par,
+				     struct xt_hashlimit_htable **hinfo,
+				     struct hashlimit_cfg2 *cfg,
+				     const char *name, int revision)
 {
 	struct net *net = par->net;
-	struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
 	int ret;
 
-	if (info->cfg.gc_interval == 0 || info->cfg.expire == 0)
-		return -EINVAL;
-	if (info->name[sizeof(info->name)-1] != '\0')
+	if (cfg->gc_interval == 0 || cfg->expire == 0)
 		return -EINVAL;
 	if (par->family == NFPROTO_IPV4) {
-		if (info->cfg.srcmask > 32 || info->cfg.dstmask > 32)
+		if (cfg->srcmask > 32 || cfg->dstmask > 32)
 			return -EINVAL;
 	} else {
-		if (info->cfg.srcmask > 128 || info->cfg.dstmask > 128)
+		if (cfg->srcmask > 128 || cfg->dstmask > 128)
 			return -EINVAL;
 	}
 
-	if (info->cfg.mode & ~XT_HASHLIMIT_ALL) {
+	if (cfg->mode & ~XT_HASHLIMIT_ALL) {
 		pr_info("Unknown mode mask %X, kernel too old?\n",
-						info->cfg.mode);
+						cfg->mode);
 		return -EINVAL;
 	}
 
 	/* Check for overflow. */
-	if (info->cfg.mode & XT_HASHLIMIT_BYTES) {
-		if (user2credits_byte(info->cfg.avg) == 0) {
-			pr_info("overflow, rate too high: %u\n", info->cfg.avg);
+	if (cfg->mode & XT_HASHLIMIT_BYTES) {
+		if (user2credits_byte(cfg->avg) == 0) {
+			pr_info("overflow, rate too high: %llu\n", cfg->avg);
 			return -EINVAL;
 		}
-	} else if (info->cfg.burst == 0 ||
-		    user2credits(info->cfg.avg * info->cfg.burst) <
-		    user2credits(info->cfg.avg)) {
-			pr_info("overflow, try lower: %u/%u\n",
-				info->cfg.avg, info->cfg.burst);
+	} else if (cfg->burst == 0 ||
+		    user2credits(cfg->avg * cfg->burst, revision) <
+		    user2credits(cfg->avg, revision)) {
+			pr_info("overflow, try lower: %llu/%llu\n",
+				cfg->avg, cfg->burst);
 			return -ERANGE;
 	}
 
 	mutex_lock(&hashlimit_mutex);
-	info->hinfo = htable_find_get(net, info->name, par->family);
-	if (info->hinfo == NULL) {
-		ret = htable_create_v1(net, info, par->family);
+	*hinfo = htable_find_get(net, name, par->family);
+	if (*hinfo == NULL) {
+		ret = htable_create(net, cfg, name, par->family,
+				    hinfo, revision);
 		if (ret < 0) {
 			mutex_unlock(&hashlimit_mutex);
 			return ret;
 		}
 	}
 	mutex_unlock(&hashlimit_mutex);
+
 	return 0;
 }
 
+static int hashlimit_mt_check_v1(const struct xt_mtchk_param *par)
+{
+	struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
+	struct hashlimit_cfg2 cfg = {};
+	int ret;
+
+	if (info->name[sizeof(info->name) - 1] != '\0')
+		return -EINVAL;
+
+	ret = cfg_copy(&cfg, (void *)&info->cfg, 1);
+
+	if (ret)
+		return ret;
+
+	return hashlimit_mt_check_common(par, &info->hinfo,
+					 &cfg, info->name, 1);
+}
+
+static int hashlimit_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_hashlimit_mtinfo2 *info = par->matchinfo;
+
+	if (info->name[sizeof(info->name) - 1] != '\0')
+		return -EINVAL;
+
+	return hashlimit_mt_check_common(par, &info->hinfo, &info->cfg,
+					 info->name, 2);
+}
+
 static void hashlimit_mt_destroy_v1(const struct xt_mtdtor_param *par)
 {
 	const struct xt_hashlimit_mtinfo1 *info = par->matchinfo;
@@ -719,6 +823,13 @@ static void hashlimit_mt_destroy_v1(const struct xt_mtdtor_param *par)
 	htable_put(info->hinfo);
 }
 
+static void hashlimit_mt_destroy(const struct xt_mtdtor_param *par)
+{
+	const struct xt_hashlimit_mtinfo2 *info = par->matchinfo;
+
+	htable_put(info->hinfo);
+}
+
 static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 	{
 		.name           = "hashlimit",
@@ -730,6 +841,16 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
 	},
+	{
+		.name           = "hashlimit",
+		.revision       = 2,
+		.family         = NFPROTO_IPV4,
+		.match          = hashlimit_mt,
+		.matchsize      = sizeof(struct xt_hashlimit_mtinfo2),
+		.checkentry     = hashlimit_mt_check,
+		.destroy        = hashlimit_mt_destroy,
+		.me             = THIS_MODULE,
+	},
 #if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
 	{
 		.name           = "hashlimit",
@@ -741,6 +862,16 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
 	},
+	{
+		.name           = "hashlimit",
+		.revision       = 2,
+		.family         = NFPROTO_IPV6,
+		.match          = hashlimit_mt,
+		.matchsize      = sizeof(struct xt_hashlimit_mtinfo2),
+		.checkentry     = hashlimit_mt_check,
+		.destroy        = hashlimit_mt_destroy,
+		.me             = THIS_MODULE,
+	},
 #endif
 };
 
@@ -787,18 +918,12 @@ static void dl_seq_stop(struct seq_file *s, void *v)
 	spin_unlock_bh(&htable->lock);
 }
 
-static int dl_seq_real_show_v1(struct dsthash_ent *ent, u_int8_t family,
-			       struct seq_file *s)
+static void dl_seq_print(struct dsthash_ent *ent, u_int8_t family,
+			 struct seq_file *s)
 {
-	const struct xt_hashlimit_htable *ht = s->private;
-
-	spin_lock(&ent->lock);
-	/* recalculate to show accurate numbers */
-	rateinfo_recalc(ent, jiffies, ht->cfg.mode);
-
 	switch (family) {
 	case NFPROTO_IPV4:
-		seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n",
+		seq_printf(s, "%ld %pI4:%u->%pI4:%u %llu %llu %llu\n",
 			   (long)(ent->expires - jiffies)/HZ,
 			   &ent->dst.ip.src,
 			   ntohs(ent->dst.src_port),
@@ -809,7 +934,7 @@ static int dl_seq_real_show_v1(struct dsthash_ent *ent, u_int8_t family,
 		break;
 #if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
 	case NFPROTO_IPV6:
-		seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n",
+		seq_printf(s, "%ld %pI6:%u->%pI6:%u %llu %llu %llu\n",
 			   (long)(ent->expires - jiffies)/HZ,
 			   &ent->dst.ip6.src,
 			   ntohs(ent->dst.src_port),
@@ -822,6 +947,34 @@ static int dl_seq_real_show_v1(struct dsthash_ent *ent, u_int8_t family,
 	default:
 		BUG();
 	}
+}
+
+static int dl_seq_real_show_v1(struct dsthash_ent *ent, u_int8_t family,
+			       struct seq_file *s)
+{
+	const struct xt_hashlimit_htable *ht = s->private;
+
+	spin_lock(&ent->lock);
+	/* recalculate to show accurate numbers */
+	rateinfo_recalc(ent, jiffies, ht->cfg.mode, 1);
+
+	dl_seq_print(ent, family, s);
+
+	spin_unlock(&ent->lock);
+	return seq_has_overflowed(s);
+}
+
+static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family,
+			    struct seq_file *s)
+{
+	const struct xt_hashlimit_htable *ht = s->private;
+
+	spin_lock(&ent->lock);
+	/* recalculate to show accurate numbers */
+	rateinfo_recalc(ent, jiffies, ht->cfg.mode, 2);
+
+	dl_seq_print(ent, family, s);
+
 	spin_unlock(&ent->lock);
 	return seq_has_overflowed(s);
 }
@@ -840,6 +993,20 @@ static int dl_seq_show_v1(struct seq_file *s, void *v)
 	return 0;
 }
 
+static int dl_seq_show(struct seq_file *s, void *v)
+{
+	struct xt_hashlimit_htable *htable = s->private;
+	unsigned int *bucket = (unsigned int *)v;
+	struct dsthash_ent *ent;
+
+	if (!hlist_empty(&htable->hash[*bucket])) {
+		hlist_for_each_entry(ent, &htable->hash[*bucket], node)
+			if (dl_seq_real_show(ent, htable->family, s))
+				return -1;
+	}
+	return 0;
+}
+
 static const struct seq_operations dl_seq_ops_v1 = {
 	.start = dl_seq_start,
 	.next  = dl_seq_next,
@@ -847,6 +1014,13 @@ static const struct seq_operations dl_seq_ops_v1 = {
 	.show  = dl_seq_show_v1
 };
 
+static const struct seq_operations dl_seq_ops = {
+	.start = dl_seq_start,
+	.next  = dl_seq_next,
+	.stop  = dl_seq_stop,
+	.show  = dl_seq_show
+};
+
 static int dl_proc_open_v1(struct inode *inode, struct file *file)
 {
 	int ret = seq_open(file, &dl_seq_ops_v1);
@@ -858,6 +1032,18 @@ static int dl_proc_open_v1(struct inode *inode, struct file *file)
 	return ret;
 }
 
+static int dl_proc_open(struct inode *inode, struct file *file)
+{
+	int ret = seq_open(file, &dl_seq_ops);
+
+	if (!ret) {
+		struct seq_file *sf = file->private_data;
+
+		sf->private = PDE_DATA(inode);
+	}
+	return ret;
+}
+
 static const struct file_operations dl_file_ops_v1 = {
 	.owner   = THIS_MODULE,
 	.open    = dl_proc_open_v1,
@@ -866,6 +1052,14 @@ static const struct file_operations dl_file_ops_v1 = {
 	.release = seq_release
 };
 
+static const struct file_operations dl_file_ops = {
+	.owner   = THIS_MODULE,
+	.open    = dl_proc_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = seq_release
+};
+
 static int __net_init hashlimit_proc_net_init(struct net *net)
 {
 	struct hashlimit_net *hashlimit_net = hashlimit_pernet(net);
-- 
1.9.1


^ permalink raw reply related

* [PATCH v2] fs/select: add vmalloc fallback for select(2)
From: Vlastimil Babka @ 2016-09-22 16:43 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Eric Dumazet, Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be
physically contiguous. Also the allocation is temporary for the duration of the
syscall, so it's unlikely to stress vmalloc too much.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..b99e98524fde 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	unsigned long alloc_size;
 
 	ret = -EINVAL;
 	if (n < 0)
@@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	bits = stack_fds;
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
+		alloc_size = 6 * size;
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE)
+			bits = vmalloc(alloc_size);
+
 		if (!bits)
 			goto out_nofds;
 	}
@@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH net-next] Documentation: devicetree: revise ethernet device-tree binding about TRGMII
From: Sergei Shtylyov @ 2016-09-22 16:48 UTC (permalink / raw)
  To: sean.wang, davem, john
  Cc: nbd, netdev, linux-kernel, linux-mediatek, devicetree, keyhaede,
	objelf
In-Reply-To: <1474560996-20863-1-git-send-email-sean.wang@mediatek.com>

On 09/22/2016 07:16 PM, sean.wang@mediatek.com wrote:

> From: Sean Wang <sean.wang@mediatek.com>
>
> fix typo in mediatek-net.txt and add phy-mode "trgmii" to ethernet.txt

    These changes are unrelated to each other, so there should be 2 separate 
patches. And have the patches I reviewed been merged already, why are you 
sending an incremental patch?

> Cc: devicetree@vger.kernel.org
> Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> Signed-off-by: Sean Wang <sean.wang@mediatek.com>
[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
From: Eric Dumazet @ 2016-09-22 16:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev
In-Reply-To: <20160922164359.9035-1-vbabka@suse.cz>

On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.

vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.

So I guess allowing vmalloc() being called from an innocent application
doing a select() might be dangerous, especially if this select() happens
thousands of time per second.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 2/2] netfilter: Create revision 2 of xt_hashlimit to support higher pps rates
From: Jan Engelhardt @ 2016-09-22 16:53 UTC (permalink / raw)
  To: Vishwanath Pai
  Cc: pablo, kaber, kadlec, johunt, netfilter-devel, coreteam, netdev,
	pai.vishwain
In-Reply-To: <20160922164344.GB20423@akamai.com>


On Thursday 2016-09-22 18:43, Vishwanath Pai wrote:
>+struct hashlimit_cfg2 {
>+	__u32 mode;	  /* bitmask of XT_HASHLIMIT_HASH_* */
>+	__u64 avg;    /* Average secs between packets * scale */
>+	__u64 burst;  /* Period multiplier for upper limit. */

This would have different sizes between i386 and x86_64,
necessiting additional compat functions. It should be padded
or reordered instead.

>+
>+	/* user specified */
>+	__u32 size;		/* how many buckets */
>+	__u32 max;		/* max number of entries */
>+	__u32 gc_interval;	/* gc interval */
>+	__u32 expire;	/* when do entries expire? */
>+
>+	__u8 srcmask, dstmask;
>+};
>+
>+struct xt_hashlimit_mtinfo2 {
>+	char name[NAME_MAX];
>+	struct hashlimit_cfg2 cfg;
>+
>+	/* Used internally by the kernel */
>+	struct xt_hashlimit_htable *hinfo __attribute__((aligned(8)));
>+};


^ permalink raw reply

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
From: Vlastimil Babka @ 2016-09-22 16:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev
In-Reply-To: <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com>

On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>>
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox