Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH stable 3.2 3.4] ipv4: disable bh while doing route gc
From: David Miller @ 2014-10-13 16:52 UTC (permalink / raw)
  To: mleitner; +Cc: netdev, hannes
In-Reply-To: <6c3d6eca5d6a15c01393b010f2116bd169477c5a.1413215324.git.mleitner@redhat.com>

From: Marcelo Ricardo Leitner <mleitner@redhat.com>
Date: Mon, 13 Oct 2014 13:20:38 -0300

> Further tests revealed that after moving the garbage collector to a work
> queue and protecting it with a spinlock may leave the system prone to
> soft lockups if bottom half gets very busy.
> 
> It was reproced with a set of firewall rules that REJECTed packets. If
> the NIC bottom half handler ends up running on the same CPU that is
> running the garbage collector on a very large cache, the garbage
> collector will not be able to do its job due to the amount of work
> needed for handling the REJECTs and also won't reschedule.
> 
> The fix is to disable bottom half during the garbage collecting, as it
> already was in the first place (most calls to it came from softirqs).
> 
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Please add my:

Acked-by: David S. Miller <davem@davemloft.net>

and submit this directly to -stable, thanks.

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Eric Dumazet @ 2014-10-13 16:58 UTC (permalink / raw)
  To: Dave Taht
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller
In-Reply-To: <CAL4WiioOodM60_Ud6qyGt1_uyeqpDeGFF6PtEKtGa=NwHhVFiw@mail.gmail.com>


> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>         When I first got cc'd on these threads, and saw
>         netperf-wrapper being
>         used on it,
>         I thought: "Oh god, I've created a monster.". My intent with
>         helping create
>         such a measurement tool was not to routinely drive a network
>         to saturation
>         but to be able to measure the impact on latency of doing so.
>         
>         I was trying get reasonable behavior when a "router" went into
>         overload.
>         
>         Servers, on the other hand, have more options to avoid
>         overload than
>         routers do. There's been a great deal of really nice work on
>         that
>         front. I love all that.
>         
>         and I like BQL because it provides enough backpressure to be
>         able to
>         do smarter things about scheduling packets higher in the
>         stack. (life
>         pre-BQL cost some hair)
>         
>         But tom once told me me "BQL's objective is to keep the
>         hardware busy".
>         It uses an MIAD controller instead of a more sane AIMD one, in
>         particular,
>         I'd much rather it ramped down to smaller values after
>         absorbing a burst.
>         
>         My objective is always to keep the *network's behavior
>         optimal*,
>         minimizing bursts, and subsequent tail loss on the other side,
>         and responding quickly to loss, and doing that by
>         preserving to the highest extent possible the ack clocking
>         that a
>         fluid model has. I LOVE BQL for providing more backpressure
>         than has
>         ever existed before, and I know it's incredibly difficult to
>         have fluid models
>         in a conventional cpu architecture that has to do other stuff.
>         
>         But in order to get the best results for network behavior I'm
>         willing to
>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>         to get the
>         most packets to all the destinations specified, whatever the
>         workload,
>         with the *minimum amount of latency between ack and reply*
>         possible.
>         
>         What I'd hoped for in the new bulking and rcu stuff was to be
>         able to
>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>         also did
>         keep hoping for some profiles on sch_fq, and for more complex
>         benchmarking of dozens or hundreds of realistically sized TCP
>         flows
>         (in both directions) to exercise it all.
>         
>         Some of the data presented showed that a single BQL'd queue
>         was >400K,
>         and with hardware multi-queue, 128K, when TSO and GSO were
>         used, but
>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>         30K.
>         
>         This said to me that the maximum "right" size for a TSO/GSO
>         "packet" was
>         closer to 12k in this environment, and the right size for BQL,
>         30k,
>         before it started exerting backpressure to the qdisc.
>         
>         This would reduce the potential inter-flow network latency by
>         a factor
>         of 10 on the single hw queue scenario, and 4 in the multi
>         queue one.
>         
>         It would probably cost some interrupts, and in scenarios
>         lacking
>         packet loss, throughput, but in other scenarios with lots of
>         flows
>         each flow will ramp up in speed, faster, as you reduce the
>         RTTs.
>         Paying attention to this will also push profiling activities
>         into
>         areas of the stack that might be profitable.
>         
>         I would very much like to have profiles of happens now both
>         here and
>         elsewhere in the stack with this new code with TSO/GSO sizes
>         capped
>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>         used.
>         
>         2) Most of the time, a server is not driving the wire to
>         saturation. If
>            it is, you are doing something wrong. The BQL queues are
>         empty, or
>            nearly so, so the instant someone creates a qdisc queue, it
>            drains.
>         
>         But: if there are two or more flows under contention, creating
>         a qdisc queue
>             better multiplexing the results is highly desirable, and
>         the stack
>            should be smart enough to make that overload only last
>         briefly.
>         
>            This is part of why I'm unfond of the deep and persistent
>         BQL queues as we
>         get today.
>         
>         3) Pure ack-only workloads are rare. It is a useful test case,
>         but...
>         
>         4) I thought the ring-cleanup optimization was rather
>         interesting and
>            could be made more dynamic.
>         
>         5) I remain amazed at the vast improvements in throughput,
>         reductions in
>         interrupts, lockless operation and the RCU stuff that have
>         come out of
>         this so far, but had to make these points in the hope that the
>         big picture
>         is retained.
>         
>         It does no good to blast packets through the network unless
>         there is a
>         high probability that they will actually be received on the
>         other side.
>         
>         thanks for listening.

Dave

I am kind of surprised you wrote this nonsense.

Being able to send data at full speed has nothing to do about how
packets are scheduled. You are concerned about packet scheduling, and
not about how fast we can send raw data on a 40Gb NIC.

We made all these changes so that we can spend cpu cycles at the right
place.

There are reasons we have fq_codel, and fq. Do not forget this, please.

Take a deep breath, some coffee, and rethink.

Thanks

^ permalink raw reply

* Re: [PATCH stable 3.2 3.4] ipv4: disable bh while doing route gc
From: Marcelo Ricardo Leitner @ 2014-10-13 16:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hannes
In-Reply-To: <20141013.125225.1129718741415025861.davem@davemloft.net>

On 13-10-2014 13:52, David Miller wrote:
> From: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Date: Mon, 13 Oct 2014 13:20:38 -0300
>
>> Further tests revealed that after moving the garbage collector to a work
>> queue and protecting it with a spinlock may leave the system prone to
>> soft lockups if bottom half gets very busy.
>>
>> It was reproced with a set of firewall rules that REJECTed packets. If
>> the NIC bottom half handler ends up running on the same CPU that is
>> running the garbage collector on a very large cache, the garbage
>> collector will not be able to do its job due to the amount of work
>> needed for handling the REJECTs and also won't reschedule.
>>
>> The fix is to disable bottom half during the garbage collecting, as it
>> already was in the first place (most calls to it came from softirqs).
>>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>
> Please add my:
>
> Acked-by: David S. Miller <davem@davemloft.net>
>
> and submit this directly to -stable, thanks.

Will do. Thanks.

Marcelo

^ permalink raw reply

* [PATCH] sfc: efx: add support for skb->xmit_more
From: Daniel Borkmann @ 2014-10-13 16:59 UTC (permalink / raw)
  To: davem; +Cc: nikolay, netdev, Shradha Shah

Add support for xmit_more batching: efx has BQL support, thus we need
to only move the queue hang check before the hw tail pointer write
and test for xmit_more bit resp. whether the queue/stack has stopped.
Thanks to Nikolay Aleksandrov who had a Solarflare NIC to test!

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Shradha Shah <sshah@solarflare.com>
---
 drivers/net/ethernet/sfc/tx.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 3206098..566432c 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -439,13 +439,13 @@ finish_packet:
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
+	efx_tx_maybe_stop_queue(tx_queue);
+
 	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
 
 	tx_queue->tx_packets++;
-
-	efx_tx_maybe_stop_queue(tx_queue);
-
 	return NETDEV_TX_OK;
 
  dma_err:
@@ -1308,11 +1308,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
-	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
-
 	efx_tx_maybe_stop_queue(tx_queue);
 
+	/* Pass off to hardware */
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
+
 	tx_queue->tso_bursts++;
 	return NETDEV_TX_OK;
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Rick Jones @ 2014-10-13 17:14 UTC (permalink / raw)
  To: Eric Dumazet, Prashant Sreedharan; +Cc: davem, netdev, dborkman, mchan
In-Reply-To: <1413218934.9362.104.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/13/2014 09:48 AM, Eric Dumazet wrote:
> On Mon, 2014-10-13 at 09:21 -0700, Prashant Sreedharan wrote:
>> Ring TX doorbell only if xmit_more is not set or the queue is stopped.
>>
>> Suggested-by: Daniel Borkmann <dborkman@redhat.com>
>> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
>> Signed-off-by: Michael Chan <mchan@broadcom.com>
>> ---
>>   drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
>>   1 files changed, 6 insertions(+), 4 deletions(-)
>
> Have you noticed any performance change ?
>
> I did the patch for bnx2x but got no real difference...

If ringing the doorbell is just a pio write, it may just get lost in the 
noise.  If though the programming model (or some sort of defect 
workaround) of the NIC requires a pio read of some sort when ringing the 
doorbell...

rick

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: kill nf_send_reset6() from include/net/netfilter/ipv6/nf_reject.h
From: Josh Boyer @ 2014-10-13 17:18 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, netfilter-devel, David Miller, netdev,
	Eric Dumazet
In-Reply-To: <CA+5PVA72ME31+CAzcM46JyvrP19dawTO4LSmta6Jt2eDZVbWaA@mail.gmail.com>

On Mon, Oct 13, 2014 at 12:01 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> On Mon, Oct 13, 2014 at 11:55 AM, Florian Westphal <fw@strlen.de> wrote:
>> Josh Boyer <jwboyer@fedoraproject.org> wrote:
>>> On Thu, Oct 9, 2014 at 2:27 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>> > nf_send_reset6() now resides in net/ipv6/netfilter/nf_reject_ipv6.c
>>> >
>>> > Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
>>> > Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>>> > Acked-by: Eric Dumazet <edumazet@google.com>
>>> > ---
>>> >  include/net/netfilter/ipv6/nf_reject.h |  157 +-------------------------------
>>> >  1 file changed, 2 insertions(+), 155 deletions(-)
>>>
>>> Hi All,
>>>
>>> This morning I was testing a kernel build from Linus' tree as of Linux
>>> v3.17-7639-g90eac7eee2f4.  When I rebooted my test machines, I
>>> couldn't ssh back into any of them.  I poked around a bit and noticed
>>> that it seems the iptables rules weren't getting loaded properly.
>>> Traffic out worked fine, and I could ping the machine, but other
>>> incoming traffic was blocked.  Then I saw that the ip6t_REJECT and
>>> ip6t_rpfilter modules were not being loaded on the bad kernel.
>>> Looking in dmesg I see:
>>>
>>> [   14.619028] nf_reject_ipv6: module license 'unspecified' taints kernel.
>>> [   14.619125] nf_reject_ipv6: Unknown symbol ip6_local_out (err 0)
>>
>> Ouch. ip6_local_is EXPORT_SYMBOL_GPL.
>>
>> http://patchwork.ozlabs.org/patch/398501/
>>
>> should fix this.
>
> I believe you're correct.  I dug in some more and I was just about to
> send a similar patch.  I'll add it on top of my builds and test it
> out.  Thanks for the pointer.

The patch you pointed to does indeed fix the issues I was seeing.
Thank you very much for the fast response.

josh

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Dave Taht @ 2014-10-13 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller
In-Reply-To: <1413219532.9362.109.camel@edumazet-glaptop2.roam.corp.google.com>

On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>         When I first got cc'd on these threads, and saw
>>         netperf-wrapper being
>>         used on it,
>>         I thought: "Oh god, I've created a monster.". My intent with
>>         helping create
>>         such a measurement tool was not to routinely drive a network
>>         to saturation
>>         but to be able to measure the impact on latency of doing so.
>>
>>         I was trying get reasonable behavior when a "router" went into
>>         overload.
>>
>>         Servers, on the other hand, have more options to avoid
>>         overload than
>>         routers do. There's been a great deal of really nice work on
>>         that
>>         front. I love all that.
>>
>>         and I like BQL because it provides enough backpressure to be
>>         able to
>>         do smarter things about scheduling packets higher in the
>>         stack. (life
>>         pre-BQL cost some hair)
>>
>>         But tom once told me me "BQL's objective is to keep the
>>         hardware busy".
>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>         particular,
>>         I'd much rather it ramped down to smaller values after
>>         absorbing a burst.
>>
>>         My objective is always to keep the *network's behavior
>>         optimal*,
>>         minimizing bursts, and subsequent tail loss on the other side,
>>         and responding quickly to loss, and doing that by
>>         preserving to the highest extent possible the ack clocking
>>         that a
>>         fluid model has. I LOVE BQL for providing more backpressure
>>         than has
>>         ever existed before, and I know it's incredibly difficult to
>>         have fluid models
>>         in a conventional cpu architecture that has to do other stuff.
>>
>>         But in order to get the best results for network behavior I'm
>>         willing to
>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>         to get the
>>         most packets to all the destinations specified, whatever the
>>         workload,
>>         with the *minimum amount of latency between ack and reply*
>>         possible.
>>
>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>         able to
>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>         also did
>>         keep hoping for some profiles on sch_fq, and for more complex
>>         benchmarking of dozens or hundreds of realistically sized TCP
>>         flows
>>         (in both directions) to exercise it all.
>>
>>         Some of the data presented showed that a single BQL'd queue
>>         was >400K,
>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>         used, but
>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>         30K.
>>
>>         This said to me that the maximum "right" size for a TSO/GSO
>>         "packet" was
>>         closer to 12k in this environment, and the right size for BQL,
>>         30k,
>>         before it started exerting backpressure to the qdisc.
>>
>>         This would reduce the potential inter-flow network latency by
>>         a factor
>>         of 10 on the single hw queue scenario, and 4 in the multi
>>         queue one.
>>
>>         It would probably cost some interrupts, and in scenarios
>>         lacking
>>         packet loss, throughput, but in other scenarios with lots of
>>         flows
>>         each flow will ramp up in speed, faster, as you reduce the
>>         RTTs.
>>         Paying attention to this will also push profiling activities
>>         into
>>         areas of the stack that might be profitable.
>>
>>         I would very much like to have profiles of happens now both
>>         here and
>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>         capped
>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>         used.
>>
>>         2) Most of the time, a server is not driving the wire to
>>         saturation. If
>>            it is, you are doing something wrong. The BQL queues are
>>         empty, or
>>            nearly so, so the instant someone creates a qdisc queue, it
>>            drains.
>>
>>         But: if there are two or more flows under contention, creating
>>         a qdisc queue
>>             better multiplexing the results is highly desirable, and
>>         the stack
>>            should be smart enough to make that overload only last
>>         briefly.
>>
>>            This is part of why I'm unfond of the deep and persistent
>>         BQL queues as we
>>         get today.
>>
>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>         but...
>>
>>         4) I thought the ring-cleanup optimization was rather
>>         interesting and
>>            could be made more dynamic.
>>
>>         5) I remain amazed at the vast improvements in throughput,
>>         reductions in
>>         interrupts, lockless operation and the RCU stuff that have
>>         come out of
>>         this so far, but had to make these points in the hope that the
>>         big picture
>>         is retained.
>>
>>         It does no good to blast packets through the network unless
>>         there is a
>>         high probability that they will actually be received on the
>>         other side.
>>
>>         thanks for listening.
>
> Dave
>
> I am kind of surprised you wrote this nonsense.

Sorry, it was definately pre-coffee! I get twitchy when all the joy
seems to be in spewing packets at high rates rather than optimizing
for low RTTs in packet paired flow behavior.

> Being able to send data at full speed has nothing to do about how
> packets are scheduled. You are concerned about packet scheduling, and
> not about how fast we can send raw data on a 40Gb NIC.

I would like to also get better behavior out of gigE and below, and for
these changes to not impact the downstream behavior of the network
overall.

To give you an example, I would like to see the tcp flows in the
2nd chart here, to converge faster than the 5 seconds they currently
take at GigE speeds.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

> We made all these changes so that we can spend cpu cycles at the right
> place.

I grok that.

> There are reasons we have fq_codel, and fq. Do not forget this, please.

Which is why I was hoping to see profiles along the way that showed where
else locks were being taken, what those cpu cycles were like, etc, and what were
those hotspots, when a smarter qdisc was engaged.


-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply

* Re: [PATCH v9 net-next 2/4] net: filter: split filter.h and expose eBPF to user space
From: Daniel Borkmann @ 2014-10-13 17:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Hannes Frederic Sowa, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api, netdev, linux-kernel
In-Reply-To: <540737DF.1010801@redhat.com>

On 09/03/2014 05:46 PM, Daniel Borkmann wrote:
...
> Ok, given you post the remaining two RFCs later on this window as
> you indicate, I have no objections:
>
> Acked-by: Daniel Borkmann <dborkman@redhat.com>

Ping, Alexei, are you still sending the patch for bpf_common.h or
do you want me to take care of this?

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH v2 net] x86: bpf_jit: fix two bugs in eBPF JIT compiler
From: Darrick J. Wong @ 2014-10-13 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, netdev, linux-kernel
In-Reply-To: <1412998223-3805-1-git-send-email-ast@plumgrid.com>

On Fri, Oct 10, 2014 at 08:30:23PM -0700, Alexei Starovoitov wrote:
> 1.
> JIT compiler using multi-pass approach to converge to final image size,
> since x86 instructions are variable length. It starts with large
> gaps between instructions (so some jumps may use imm32 instead of imm8)
> and iterates until total program size is the same as in previous pass.
> This algorithm works only if program size is strictly decreasing.
> Programs that use LD_ABS insn need additional code in prologue, but it
> was not emitted during 1st pass, so there was a chance that 2nd pass would
> adjust imm32->imm8 jump offsets to the same number of bytes as increase in
> prologue, which may cause algorithm to erroneously decide that size converged.
> Fix it by always emitting largest prologue in the first pass which
> is detected by oldproglen==0 check.
> Also change error check condition 'proglen != oldproglen' to fail gracefully.
> 
> 2.
> while staring at the code realized that 64-byte buffer may not be enough
> when 1st insn is large, so increase it to 128 to avoid buffer overflow
> (theoretical maximum size of prologue+div is 109) and add runtime check.
> 
> Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
> Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

This fixes the crash, thank you!

Tested-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> v1->v2: reduce chances of stack corruption in case of future bugs (suggested by Eric)
> 
> note in classic BPF programs 1st insn is always short move, but native eBPF
> programs may trigger buffer overflow. I couldn't force the crash with overflow,
> since there are no further calls while this part of stack is used.
> Both are ugly bugs regardless.
> When net-next opens I will add narrowed down testcase from 'nmap' to testsuite.
> 
>  arch/x86/net/bpf_jit_comp.c |   25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d56cd1f..3f62734 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -182,12 +182,17 @@ struct jit_context {
>  	bool seen_ld_abs;
>  };
>  
> +/* maximum number of bytes emitted while JITing one eBPF insn */
> +#define BPF_MAX_INSN_SIZE	128
> +#define BPF_INSN_SAFETY		64
> +
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  		  int oldproglen, struct jit_context *ctx)
>  {
>  	struct bpf_insn *insn = bpf_prog->insnsi;
>  	int insn_cnt = bpf_prog->len;
> -	u8 temp[64];
> +	bool seen_ld_abs = ctx->seen_ld_abs | (oldproglen == 0);
> +	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
>  	int i;
>  	int proglen = 0;
>  	u8 *prog = temp;
> @@ -225,7 +230,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  	EMIT2(0x31, 0xc0); /* xor eax, eax */
>  	EMIT3(0x4D, 0x31, 0xED); /* xor r13, r13 */
>  
> -	if (ctx->seen_ld_abs) {
> +	if (seen_ld_abs) {
>  		/* r9d : skb->len - skb->data_len (headlen)
>  		 * r10 : skb->data
>  		 */
> @@ -685,7 +690,7 @@ xadd:			if (is_imm8(insn->off))
>  		case BPF_JMP | BPF_CALL:
>  			func = (u8 *) __bpf_call_base + imm32;
>  			jmp_offset = func - (image + addrs[i]);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x52); /* push %r10 */
>  				EMIT2(0x41, 0x51); /* push %r9 */
>  				/* need to adjust jmp offset, since
> @@ -699,7 +704,7 @@ xadd:			if (is_imm8(insn->off))
>  				return -EINVAL;
>  			}
>  			EMIT1_off32(0xE8, jmp_offset);
> -			if (ctx->seen_ld_abs) {
> +			if (seen_ld_abs) {
>  				EMIT2(0x41, 0x59); /* pop %r9 */
>  				EMIT2(0x41, 0x5A); /* pop %r10 */
>  			}
> @@ -804,7 +809,8 @@ emit_jmp:
>  			goto common_load;
>  		case BPF_LD | BPF_ABS | BPF_W:
>  			func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
> -common_load:		ctx->seen_ld_abs = true;
> +common_load:
> +			ctx->seen_ld_abs = seen_ld_abs = true;
>  			jmp_offset = func - (image + addrs[i]);
>  			if (!func || !is_simm32(jmp_offset)) {
>  				pr_err("unsupported bpf func %d addr %p image %p\n",
> @@ -878,6 +884,11 @@ common_load:		ctx->seen_ld_abs = true;
>  		}
>  
>  		ilen = prog - temp;
> +		if (ilen > BPF_MAX_INSN_SIZE) {
> +			pr_err("bpf_jit_compile fatal insn size error\n");
> +			return -EFAULT;
> +		}
> +
>  		if (image) {
>  			if (unlikely(proglen + ilen > oldproglen)) {
>  				pr_err("bpf_jit_compile fatal error\n");
> @@ -934,9 +945,11 @@ void bpf_int_jit_compile(struct bpf_prog *prog)
>  			goto out;
>  		}
>  		if (image) {
> -			if (proglen != oldproglen)
> +			if (proglen != oldproglen) {
>  				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
>  				       proglen, oldproglen);
> +				goto out;
> +			}
>  			break;
>  		}
>  		if (proglen == oldproglen) {
> -- 
> 1.7.9.5
> 

^ permalink raw reply

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Florian Westphal @ 2014-10-13 17:38 UTC (permalink / raw)
  To: Prashant Sreedharan; +Cc: davem, netdev, dborkman, mchan
In-Reply-To: <1413217302-15396-1-git-send-email-prashant@broadcom.com>

Prashant Sreedharan <prashant@broadcom.com> wrote:
> Ring TX doorbell only if xmit_more is not set or the queue is stopped.
> 
> Suggested-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index ba49948..dbb41c1 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -8099,9 +8099,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	/* Sync BD data before updating mailbox */
>  	wmb();
>  
> -	/* Packets are ready, update Tx producer idx local and on card. */
> -	tw32_tx_mbox(tnapi->prodmbox, entry);
> -
>  	tnapi->tx_prod = entry;
>  	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
>  		netif_tx_stop_queue(txq);
> @@ -8116,7 +8113,12 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  			netif_tx_wake_queue(txq);
>  	}
>  
> -	mmiowb();
> +	if (!skb->xmit_more || netif_xmit_stopped(txq)) {
> +		/* Packets are ready, update Tx producer idx on card. */
> +		tw32_tx_mbox(tnapi->prodmbox, entry);

I think you need to swap the test, i.e.

netif_xmit_stopped() || xmit_more

Otherwise, hw may not be aware of further packets wainting
and queue might not be restarted?

^ permalink raw reply

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Tom Herbert @ 2014-10-13 17:43 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, David Miller
In-Reply-To: <CAA93jw5MKVaNaObVD3Lm=DP0ie6HcjwOvt0X-whDLJD_6hF5bw@mail.gmail.com>

When TSO/GSO is enabled with BQL we tend to see large limits than if
they are disabled. This is due to the fact that we treat gso packets
as a single packet although way to queuing in device. BQL limit is
nominally 2*N+1 MSS where N is minimal number of bytes to keep queue
full. In gso MSS is up to 64K, so a limit of 192K is common (without
BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K next in qdisc, maybe we can
split to do GSO on first 30K of segment, and requeue other 34K to the
qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).


On Mon, Oct 13, 2014 at 10:20 AM, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>>         When I first got cc'd on these threads, and saw
>>>         netperf-wrapper being
>>>         used on it,
>>>         I thought: "Oh god, I've created a monster.". My intent with
>>>         helping create
>>>         such a measurement tool was not to routinely drive a network
>>>         to saturation
>>>         but to be able to measure the impact on latency of doing so.
>>>
>>>         I was trying get reasonable behavior when a "router" went into
>>>         overload.
>>>
>>>         Servers, on the other hand, have more options to avoid
>>>         overload than
>>>         routers do. There's been a great deal of really nice work on
>>>         that
>>>         front. I love all that.
>>>
>>>         and I like BQL because it provides enough backpressure to be
>>>         able to
>>>         do smarter things about scheduling packets higher in the
>>>         stack. (life
>>>         pre-BQL cost some hair)
>>>
>>>         But tom once told me me "BQL's objective is to keep the
>>>         hardware busy".
>>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>>         particular,
>>>         I'd much rather it ramped down to smaller values after
>>>         absorbing a burst.
>>>
>>>         My objective is always to keep the *network's behavior
>>>         optimal*,
>>>         minimizing bursts, and subsequent tail loss on the other side,
>>>         and responding quickly to loss, and doing that by
>>>         preserving to the highest extent possible the ack clocking
>>>         that a
>>>         fluid model has. I LOVE BQL for providing more backpressure
>>>         than has
>>>         ever existed before, and I know it's incredibly difficult to
>>>         have fluid models
>>>         in a conventional cpu architecture that has to do other stuff.
>>>
>>>         But in order to get the best results for network behavior I'm
>>>         willing to
>>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>>         to get the
>>>         most packets to all the destinations specified, whatever the
>>>         workload,
>>>         with the *minimum amount of latency between ack and reply*
>>>         possible.
>>>
>>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>>         able to
>>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>>         also did
>>>         keep hoping for some profiles on sch_fq, and for more complex
>>>         benchmarking of dozens or hundreds of realistically sized TCP
>>>         flows
>>>         (in both directions) to exercise it all.
>>>
>>>         Some of the data presented showed that a single BQL'd queue
>>>         was >400K,
>>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>>         used, but
>>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>>         30K.
>>>
>>>         This said to me that the maximum "right" size for a TSO/GSO
>>>         "packet" was
>>>         closer to 12k in this environment, and the right size for BQL,
>>>         30k,
>>>         before it started exerting backpressure to the qdisc.
>>>
>>>         This would reduce the potential inter-flow network latency by
>>>         a factor
>>>         of 10 on the single hw queue scenario, and 4 in the multi
>>>         queue one.
>>>
>>>         It would probably cost some interrupts, and in scenarios
>>>         lacking
>>>         packet loss, throughput, but in other scenarios with lots of
>>>         flows
>>>         each flow will ramp up in speed, faster, as you reduce the
>>>         RTTs.
>>>         Paying attention to this will also push profiling activities
>>>         into
>>>         areas of the stack that might be profitable.
>>>
>>>         I would very much like to have profiles of happens now both
>>>         here and
>>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>>         capped
>>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>>         used.
>>>
>>>         2) Most of the time, a server is not driving the wire to
>>>         saturation. If
>>>            it is, you are doing something wrong. The BQL queues are
>>>         empty, or
>>>            nearly so, so the instant someone creates a qdisc queue, it
>>>            drains.
>>>
>>>         But: if there are two or more flows under contention, creating
>>>         a qdisc queue
>>>             better multiplexing the results is highly desirable, and
>>>         the stack
>>>            should be smart enough to make that overload only last
>>>         briefly.
>>>
>>>            This is part of why I'm unfond of the deep and persistent
>>>         BQL queues as we
>>>         get today.
>>>
>>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>>         but...
>>>
>>>         4) I thought the ring-cleanup optimization was rather
>>>         interesting and
>>>            could be made more dynamic.
>>>
>>>         5) I remain amazed at the vast improvements in throughput,
>>>         reductions in
>>>         interrupts, lockless operation and the RCU stuff that have
>>>         come out of
>>>         this so far, but had to make these points in the hope that the
>>>         big picture
>>>         is retained.
>>>
>>>         It does no good to blast packets through the network unless
>>>         there is a
>>>         high probability that they will actually be received on the
>>>         other side.
>>>
>>>         thanks for listening.
>>
>> Dave
>>
>> I am kind of surprised you wrote this nonsense.
>
> Sorry, it was definately pre-coffee! I get twitchy when all the joy
> seems to be in spewing packets at high rates rather than optimizing
> for low RTTs in packet paired flow behavior.
>
There are multiple dimensions we are trying to optimize for. Bulk
dequeue should not adversely affect latency, we are merely doing work
in one operation previously done in several.

The lack of granularity in GSO segments might be something that could
be address though. When TSO/GSO is enabled with BQL we tend to see
large limits than if they are disabled. This is due to the fact that
we treat gso packets as a single packet all the way to queuing in
device. BQL limit is nominally 2*N+1 MSS where N is minimal number of
bytes to keep queue full. In gso MSS is up to 64K, so a limit of 192K
is common (without BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K segment next in qdisc,
maybe we can split to do GSO on first 30K of segment, and requeue
other 34K to the qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).

>> Being able to send data at full speed has nothing to do about how
>> packets are scheduled. You are concerned about packet scheduling, and
>> not about how fast we can send raw data on a 40Gb NIC.
>
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.
>
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
>
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html
>
>> We made all these changes so that we can spend cpu cycles at the right
>> place.
>
> I grok that.
>
>> There are reasons we have fq_codel, and fq. Do not forget this, please.
>
> Which is why I was hoping to see profiles along the way that showed where
> else locks were being taken, what those cpu cycles were like, etc, and what were
> those hotspots, when a smarter qdisc was engaged.
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply

* Re: [PATCH stable v3.2 v3.4] ipv4: disable bh while doing route gc
From: David Miller @ 2014-10-13 17:51 UTC (permalink / raw)
  To: mleitner; +Cc: stable, netdev, hannes
In-Reply-To: <6c3d6eca5d6a15c01393b010f2116bd169477c5a.1413215324.git.mleitner@redhat.com>

From: Marcelo Ricardo Leitner <mleitner@redhat.com>
Date: Mon, 13 Oct 2014 14:03:30 -0300

> Further tests revealed that after moving the garbage collector to a work
> queue and protecting it with a spinlock may leave the system prone to
> soft lockups if bottom half gets very busy.
> 
> It was reproced with a set of firewall rules that REJECTed packets. If
> the NIC bottom half handler ends up running on the same CPU that is
> running the garbage collector on a very large cache, the garbage
> collector will not be able to do its job due to the amount of work
> needed for handling the REJECTs and also won't reschedule.
> 
> The fix is to disable bottom half during the garbage collecting, as it
> already was in the first place (most calls to it came from softirqs).
> 
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Acked-by: David S. Miller <davem@davemloft.net>
> Cc: stable@vger.kernel.org

-stable folks, please integrate this directly, thanks!

^ permalink raw reply

* Re: [2/2] dsa: Replace mii_bus with a generic host device
From: Guenter Roeck @ 2014-10-13 17:51 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, f.fainelli, kernel, davem
In-Reply-To: <20140915170024.1261.97226.stgit@ahduyck-bv4.jf.intel.com>

On Mon, Sep 15, 2014 at 01:00:27PM -0400, Alexander Duyck wrote:
> This change makes it so that instead of passing and storing a mii_bus we
> instead pass and store a host_dev.  From there we can test to determine the
> exact type of device, and can verify it is the correct device for our switch.
> 
> So for example it would be possible to pass a device pointer from a pci_dev
> and instead of checking for a PHY ID we could check for a vendor and/or device
> ID.
> 

Hi Alexander,

We are thinking about porting a PCIe based switch to use the DSA infrastructure,
so this patch is coming quite handy.

Have you been working on a port or a PCIe based chip to DSA yourself, by any
chance ? If yes, could you possibly share the current state ?
That might save us some time.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH net-next] tg3: Add skb->xmit_more support
From: Florian Westphal @ 2014-10-13 18:01 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Prashant Sreedharan, davem, netdev, dborkman, mchan
In-Reply-To: <20141013173806.GB26105@breakpoint.cc>

Florian Westphal <fw@strlen.de> wrote:
> Prashant Sreedharan <prashant@broadcom.com> wrote:
> > Ring TX doorbell only if xmit_more is not set or the queue is stopped.
> > 
> > Suggested-by: Daniel Borkmann <dborkman@redhat.com>
> > Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
> > Signed-off-by: Michael Chan <mchan@broadcom.com>
> > ---
> >  drivers/net/ethernet/broadcom/tg3.c |   10 ++++++----
> >  1 files changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> > index ba49948..dbb41c1 100644
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -8099,9 +8099,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  	/* Sync BD data before updating mailbox */
> >  	wmb();
> >  
> > -	/* Packets are ready, update Tx producer idx local and on card. */
> > -	tw32_tx_mbox(tnapi->prodmbox, entry);
> > -
> >  	tnapi->tx_prod = entry;
> >  	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
> >  		netif_tx_stop_queue(txq);
> > @@ -8116,7 +8113,12 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >  			netif_tx_wake_queue(txq);
> >  	}
> >  
> > -	mmiowb();
> > +	if (!skb->xmit_more || netif_xmit_stopped(txq)) {
> > +		/* Packets are ready, update Tx producer idx on card. */
> > +		tw32_tx_mbox(tnapi->prodmbox, entry);
> 
> I think you need to swap the test, i.e.

Never mind, sorry for the noise.

^ permalink raw reply

* Re: [PATCH] sfc: efx: add support for skb->xmit_more
From: Edward Cree @ 2014-10-13 18:02 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, nikolay, netdev, Shradha Shah, linux-net-drivers
In-Reply-To: <1413219585-15854-1-git-send-email-dborkman@redhat.com>

On 13/10/14 17:59, Daniel Borkmann wrote:
> Add support for xmit_more batching: efx has BQL support, thus we need
> to only move the queue hang check before the hw tail pointer write
> and test for xmit_more bit resp. whether the queue/stack has stopped.
> Thanks to Nikolay Aleksandrov who had a Solarflare NIC to test!
I see two issues here.

1) The error handling path (labels dma_err: and err:) will unwind back
to tx_queue->insert_count, which (AFAICT) only gets updated by
efx_nic_push_buffers() (at least on EF10, where that calls
efx_ef10_tx_write()).
So if we transmit a bunch of skbs with xmit_more, then get an error, we
will unwind all the way back to the start, whereas (again, AFAICT) the
previous skbs were nicely completed and safe to send.
The unwind will leave us in a good state, but we'll have failed to send
some packets that there was nothing wrong with.
I think the appropriate fix for this would be to maintain two separate
insert_counts, as xmit_more means we can't conflate "last write pointer
pushed" and "write pointer at start of this skb" any longer.

2) I think we shouldn't do PIO when xmit_more is set - it's probably bad
for performance (though I haven't measured), and I'm not entirely
convinced it will behave correctly.  I think
efx_nic_tx_is_empty(tx_queue) will return true if we have xmit_more'd a
PIO packet, enabling us to potentially PIO the next packet as well, thus
overwriting the PIO buffer before the NIC's had a chance to read it.
I'm also unsure about what happens to TX Push (efx_ef10_push_tx_desc),
for similar reasons.
So I think TX PIO and TX Push both need to be suppressed when
skb->xmit_more (as a correctness issue), and should also be suppressed
on the !xmit_more skb that 'uncorks' a previous row of xmit_mores - as a
performance issue at the least, and for TX Push I'm also unsure about
the correctness (though I haven't worked that one through in detail).

Until these issues are addressed, I have to NAK this.  Tomorrow I'll try
to rustle up an rfc patch that handles these issues, unless someone
beats me to it.

-Edward

> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
> Cc: Shradha Shah <sshah@solarflare.com>
> ---
>  drivers/net/ethernet/sfc/tx.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
> index 3206098..566432c 100644
> --- a/drivers/net/ethernet/sfc/tx.c
> +++ b/drivers/net/ethernet/sfc/tx.c
> @@ -439,13 +439,13 @@ finish_packet:
>  
>  	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
>  
> +	efx_tx_maybe_stop_queue(tx_queue);
> +
>  	/* Pass off to hardware */
> -	efx_nic_push_buffers(tx_queue);
> +	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
> +		efx_nic_push_buffers(tx_queue);
>  
>  	tx_queue->tx_packets++;
> -
> -	efx_tx_maybe_stop_queue(tx_queue);
> -
>  	return NETDEV_TX_OK;
>  
>   dma_err:
> @@ -1308,11 +1308,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
>  
>  	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
>  
> -	/* Pass off to hardware */
> -	efx_nic_push_buffers(tx_queue);
> -
>  	efx_tx_maybe_stop_queue(tx_queue);
>  
> +	/* Pass off to hardware */
> +	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
> +		efx_nic_push_buffers(tx_queue);
> +
>  	tx_queue->tso_bursts++;
>  	return NETDEV_TX_OK;
>  

^ permalink raw reply

* Re: IPv6 FIB related crash with MACVLANs in 3.9.11+ kernel.
From: Ben Greear @ 2014-10-13 18:06 UTC (permalink / raw)
  To: Vladislav Yasevich; +Cc: Cong Wang, Hannes Frederic Sowa, Hongmei Li, netdev
In-Reply-To: <CAGCdqXHh6AvabW2JwQEK5NcnTuxw9eMdRwd+JvYZUYqyHW4K7Q@mail.gmail.com>

On 10/12/2014 04:42 AM, Vladislav Yasevich wrote:
> On Mon, Sep 29, 2014 at 5:24 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 09/29/2014 01:39 PM, Cong Wang wrote:
>>> On Mon, Sep 29, 2014 at 12:48 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>>
>>>> We are going to be running up to 20 concurrent scripts that
>>>> will be dumping routes & ips, and configuring ip addresses
>>>> and routes during bringup.
>>>>
>>>> I don't have a simple stand-alone way to reproduce this,
>>>> but at least when we reported the problem is was very easy
>>>> for us to reproduce with our tool.
>>>>
>>>> Maybe 20 scripts running in parallel that randomly configured and dumped routes
>>>> and ip addresses on random interfaces would do the trick?
>>>>
>>>
>>> I think the most important question is why is this related with macvlan?
>>> IPv6 is L3 while macvlan pure L2, if this is a IPv6 routing bug, it should
>>> not be limited to macvlan.
>>
>> We saw it using mac-vlans on top of ixgbe.  Probably it can be reproduced
>> elsewhere, but since we had a good test case, we didn't bother trying lots
>> of other combinations.
>>
>> Just enabling some debug code caused the problem to be much harder
>> to reproduce, so it is probably some sort of race.  Maybe lots of mac-vlans
>> make it easier to hit.
>>
>>> What does your IPv6 routing table look like? And how do you configure those
>>> macvlan interfaces?
>>
>> Each interface would have it's own routing table, with at least as subnet
>> route.  I'm not sure we were adding a default gateway or not.
>>
>> The main routing table would also have some auto-created routes
>> I think.
>>
>> It is configured using 'ip' to add and dump routes.
>>
> 
> Ben
> 
> Just saw this thread.  Can you check if this commit makes a difference:
> commit 40b8fe45d1f094e3babe7b2dc2b71557ab71401d
> Author: Vlad Yasevich <vyasevich@gmail.com>
> Date:   Mon Sep 22 16:34:17 2014 -0400
> 
>     macvtap: Fix race between device delete and open.

We can retest this, but we do not use macvtap, and we have higher-priority
things to test at the moment, so might be a while.

Thanks,
Ben

> 
> Thanks
> -vlad
> 
>> Thanks,
>> Ben
>>
>> --
>> Ben Greear <greearb@candelatech.com>
>> Candela Technologies Inc  http://www.candelatech.com
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] sfc: efx: add support for skb->xmit_more
From: Daniel Borkmann @ 2014-10-13 19:18 UTC (permalink / raw)
  To: Edward Cree; +Cc: davem, nikolay, netdev, Shradha Shah, linux-net-drivers
In-Reply-To: <543C13AE.80606@solarflare.com>

On 10/13/2014 08:02 PM, Edward Cree wrote:
...
> Until these issues are addressed, I have to NAK this.  Tomorrow I'll try
> to rustle up an rfc patch that handles these issues, unless someone
> beats me to it.

Thanks for your feedback! Yes, please feel free to take this onwards and
integrate/adapt it to your needs for sfc's xmit_more support.

Thanks a lot,
Daniel

^ permalink raw reply

* Fwd: micrel: ksz8051 badly detected as ksz8031
From: Angelo Dureghello @ 2014-10-13 19:34 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <543BB635.80608@gmail.com>

Hi,

i confirm, moving from kernel 3.5.1 to  3.17.0 the network Micrel phy 
link is non
functional.

The non working detection of KSZ8051 (taken default, ksz8031, from mdio 
driver)
breaks the functionality.

But this is probably due to the fact i am not using DT population, but 
just using
a minimal DT just to boot (to avoidto change the bootloader setup that's 
already
into production).

What change from 3.5.1 and 3.17.0 is mainly that DT support has been added.

If i copy the old 3.5.1 micrel.c i can have the link, but network was 
not working
properly.

So i brute-fixed the issue into 3.17.0 with this small patchfor my need,
Imiss the complete understanding still for a proper fix.Hope you can 
elucidate.




diff -crB drivers/net/ethernet/ti/davinci_mdio.c 
../linux-3.17/drivers/net/ethernet/ti/davinci_mdio.c
*** drivers/net/ethernet/ti/davinci_mdio.c      2014-10-13 
21:02:25.522022750 +0200
--- ../linux-3.17/drivers/net/ethernet/ti/davinci_mdio.c 2014-10-05 
21:23:04.000000000 +0200
***************
*** 320,327 ****
   }
   #endif

- extern struct phy_driver *micrel_pdrv;
-
   static int davinci_mdio_probe(struct platform_device *pdev)
   {
         struct mdio_platform_data *pdata = dev_get_platdata(&pdev->dev);
--- 320,325 ----
***************
*** 397,408 ****
         for (addr = 0; addr < PHY_MAX_ADDR; addr++) {
                 phy = data->bus->phy_map[addr];
                 if (phy) {
-                       /*
-                        * Angelo : micrel broken the link adding OF 
support.
-                        * waiting for the patch, i force the link to 
out phy.
-                        */
-                       phy->drv = micrel_pdrv;
-                       data->bus->phy_map[addr] = phy;
                         dev_info(dev, "phy[%d]: device %s, driver %s\n",
                                  phy->addr, dev_name(&phy->dev),
                                  phy->drv ? phy->drv->name : "unknown");
--- 395,400 ----

diff -crB drivers/net/phy/micrel.c ../linux-3.17/drivers/net/phy/micrel.c
*** drivers/net/phy/micrel.c    2014-10-13 21:00:03.273616086 +0200
--- ../linux-3.17/drivers/net/phy/micrel.c      2014-10-05 
21:23:04.000000000 +0200
***************
*** 640,650 ****
                 ARRAY_SIZE(ksphy_driver));
   }

- /*
-  * angelo, temporary patch
-  */
- struct phy_driver *micrel_pdrv = &ksphy_driver[5];
-
   module_init(ksphy_init);
   module_exit(ksphy_exit);



Regards, angelo

^ permalink raw reply

* Re: [PATCH] net: wireless: brcm80211: brcmfmac: dhd_sdio.c: Cleaning up missing null-terminate in conjunction with strncpy
From: Rickard Strandqvist @ 2014-10-13 20:14 UTC (permalink / raw)
  To: David Laight
  Cc: Brett Rudley, Arend van Spriel, Hante Meuleman, John W. Linville,
	Pieter-Paul Giesberts, Daniel Kim, linux-wireless@vger.kernel.org,
	brcm80211-dev-list@broadcom.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D174C992F@AcuExch.aculab.com>

2014-10-13 10:55 GMT+02:00 David Laight <David.Laight@aculab.com>:
> From: Rickard Strandqvist
>> Replacing strncpy with strlcpy to avoid strings that lacks null terminate.
>> And changed from using strncpy to strlcpy to simplify code.
>
> I think you should return an error if the strings get truncated.
> Silent truncation is going to lead to issues at some point in the future
> (in some places).
>
>> Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
>> ---
>>  drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c |   25 ++++++++++----------
>>  1 file changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> b/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> index f55f625..d20d4e6 100644
>> --- a/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> +++ b/drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
>> @@ -670,7 +670,6 @@ static int brcmf_sdio_get_fwnames(struct brcmf_chip *ci,
>>                                 struct brcmf_sdio_dev *sdiodev)
>>  {
>>       int i;
>> -     uint fw_len, nv_len;
>>       char end;
>>
>>       for (i = 0; i < ARRAY_SIZE(brcmf_fwname_data); i++) {
>> @@ -684,25 +683,25 @@ static int brcmf_sdio_get_fwnames(struct brcmf_chip *ci,
>>               return -ENODEV;
>>       }
>>
>> -     fw_len = sizeof(sdiodev->fw_name) - 1;
>> -     nv_len = sizeof(sdiodev->nvram_name) - 1;
>>       /* check if firmware path is provided by module parameter */
>>       if (brcmf_firmware_path[0] != '\0') {
>> -             strncpy(sdiodev->fw_name, brcmf_firmware_path, fw_len);
>> -             strncpy(sdiodev->nvram_name, brcmf_firmware_path, nv_len);
>> -             fw_len -= strlen(sdiodev->fw_name);
>> -             nv_len -= strlen(sdiodev->nvram_name);
>> +             strlcpy(sdiodev->fw_name, brcmf_firmware_path,
>> +                     sizeof(sdiodev->fw_name));
>> +             strlcpy(sdiodev->nvram_name, brcmf_firmware_path,
>> +                     sizeof(sdiodev->nvram_name));
>>
>>               end = brcmf_firmware_path[strlen(brcmf_firmware_path) - 1];
>
> If you are doing a strlen() here, you could use the length for the copy
> and/or use it to avoid the strcat().
>
>>               if (end != '/') {
>> -                     strncat(sdiodev->fw_name, "/", fw_len);
>> -                     strncat(sdiodev->nvram_name, "/", nv_len);
>> -                     fw_len--;
>> -                     nv_len--;
>> +                     strlcat(sdiodev->fw_name, "/",
>> +                             sizeof(sdiodev->fw_name));
>> +                     strlcat(sdiodev->nvram_name, "/",
>> +                             sizeof(sdiodev->nvram_name));
>>               }
>>       }
>> -     strncat(sdiodev->fw_name, brcmf_fwname_data[i].bin, fw_len);
>> -     strncat(sdiodev->nvram_name, brcmf_fwname_data[i].nv, nv_len);
>> +     strlcat(sdiodev->fw_name, brcmf_fwname_data[i].bin,
>> +             sizeof(sdiodev->fw_name));
>> +     strlcat(sdiodev->nvram_name, brcmf_fwname_data[i].nv,
>> +             sizeof(sdiodev->nvram_name));
>
> I assume something ensures that fw_name[0] == 0 here.
>
>         David


Hi David


What do you mean you would use the strlen, can you give some example
code instead?


Arend van Spriel wanted me to change the title before.
So this should really continue the conversation in that new mail. And
then you can also see my other comments.
And I have the same type of objection to the:
brcmf_firmware_path[0] == '\0'

Se:
https://lkml.org/lkml/2014/10/12/42

Kind regards
Rickard Strandqvist

^ permalink raw reply

* Bug#763325: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
From: ael @ 2014-10-13 20:20 UTC (permalink / raw)
  To: netdev; +Cc: 763325

It was suggested that I post this bug here.
>From /var/log/messages:
=======================================================================

Oct  6 14:16:29 shelf kernel: [  961.633769] ------------[ cut here ]------------
Oct  6 14:16:29 shelf kernel: [  961.633790] WARNING: CPU: 0 PID: 0 at /build/linux-P15SNz/linux-3.16.3/net/sched/sch_generic.c:264
dev_watchdog+0x236/0x240()
Oct  6 14:16:29 shelf kernel: [  961.633792] NETDEV WATCHDOG: eth0
(r8169): transmit queue 0 timed out
Oct  6 14:16:29 shelf kernel: [  961.633794] Modules linked in:
sha256_ssse3 sha256_generic dm_crypt dm_mod usb_storage uinput cpuid
snd_hrtimer binfmt_misc snd_seq snd_seq_device bnep joydev nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc
snd_hda_codec_realtek iTCO_wdt snd_hda_codec_hdmi snd_hda_codec_generic
iTCO_vendor_support ecb uvcvideo videobuf2_vmalloc btusb
videobuf2_memops videobuf2_core v4l2_common bluetooth videodev media
6lowpan_iphc arc4 rtsx_pci_sdmmc rtsx_pci_ms iwlmvm memstick mmc_core
mac80211 iwlwifi cfg80211 r8169 rtsx_pci mii rfkill x86_pkg_temp_thermal
intel_powerclamp intel_rapl coretemp kvm_intel kvm crc32_pclmul
crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd lpc_ich evdev psmouse snd_hda_intel
serio_raw sr_mod snd_hda_controller pcspkr sg snd_hda_codec cdrom
mfd_core tpm_infineon i915 snd_hwdep i2c_i801 wmi shpchp snd_pcm
ehci_pci snd_timer tpm_tis drm_kms_helper ehci_hcd xhci_hcd snd drm
soundcore tpm i2c_algo_bit mei_me usbcore i2c_core mei video usb_common
ac battery processor button fuse parport_pc ppdev lp parport autofs4
ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ahci libahci
crct10dif_pclmul crct10dif_common libata scsi_mod thermal thermal_sys
Oct  6 14:16:29 shelf kernel: [  961.633898] CPU: 0 PID: 0 Comm:
swapper/0 Not tainted 3.16-2-amd64 #1 Debian 3.16.3-2
Oct  6 14:16:29 shelf kernel: [  961.633899] Hardware name:
Notebook                         W54_55SU1,SUW/W54_55SU1,SUW, BIOS 4.6.5
05/29/2014
Oct  6 14:16:29 shelf kernel: [  961.633901]  0000000000000009 ffffffff81506188 ffff88041fa03e28 ffffffff81065707
Oct  6 14:16:29 shelf kernel: [  961.633903]  0000000000000000
ffff88041fa03e78 0000000000000001 0000000000000000
Oct  6 14:16:29 shelf kernel: [  961.633907]  ffff88040a0ec000
ffffffff8106576c ffffffff81775ca0 0000000000000030
Oct  6 14:16:29 shelf kernel: [  961.633912] Call Trace:
Oct  6 14:16:29 shelf kernel: [  961.633914]  <IRQ>
[<ffffffff81506188>] ? dump_stack+0x41/0x51
Oct  6 14:16:29 shelf kernel: [  961.633922]  [<ffffffff81065707>] ?
warn_slowpath_common+0x77/0x90
Oct  6 14:16:29 shelf kernel: [  961.633926]  [<ffffffff8106576c>] ?
warn_slowpath_fmt+0x4c/0x50
Oct  6 14:16:29 shelf kernel: [  961.633930]  [<ffffffff81439af6>] ?
dev_watchdog+0x236/0x240
Oct  6 14:16:29 shelf kernel: [  961.633936]  [<ffffffff814398c0>] ?
dev_graft_qdisc+0x70/0x70
Oct  6 14:16:29 shelf kernel: [  961.633941]  [<ffffffff810709c1>] ?
call_timer_fn+0x31/0x100
Oct  6 14:16:29 shelf kernel: [  961.633945]  [<ffffffff814398c0>] ?
dev_graft_qdisc+0x70/0x70
Oct  6 14:16:29 shelf kernel: [  961.633948]  [<ffffffff81071ff9>] ?
run_timer_softirq+0x209/0x2f0
Oct  6 14:16:29 shelf kernel: [  961.633953]  [<ffffffff8106a5a1>] ?
__do_softirq+0xf1/0x290
Oct  6 14:16:29 shelf kernel: [  961.633956]  [<ffffffff8106a975>] ?
irq_exit+0x95/0xa0
Oct  6 14:16:29 shelf kernel: [  961.633961]  [<ffffffff8150f155>] ?
smp_apic_timer_interrupt+0x45/0x60
Oct  6 14:16:29 shelf kernel: [  961.633965]  [<ffffffff8150d25d>] ?
apic_timer_interrupt+0x6d/0x80
Oct  6 14:16:29 shelf kernel: [  961.633968]  <EOI>
[<ffffffff81088b5d>] ? __hrtimer_start_range_ns+0x1cd/0x390
Oct  6 14:16:29 shelf kernel: [  961.633976]  [<ffffffff813d96e2>] ?
cpuidle_enter_state+0x52/0xc0
Oct  6 14:16:29 shelf kernel: [  961.633980]  [<ffffffff813d96d8>] ?
cpuidle_enter_state+0x48/0xc0
Oct  6 14:16:29 shelf kernel: [  961.633984]  [<ffffffff810a5d28>] ?
cpu_startup_entry+0x2f8/0x400
Oct  6 14:16:29 shelf kernel: [  961.633989]  [<ffffffff8190305a>] ?
start_kernel+0x47b/0x486
Oct  6 14:16:29 shelf kernel: [  961.633993]  [<ffffffff81902a04>] ?
set_init_arg+0x4e/0x4e
Oct  6 14:16:29 shelf kernel: [  961.633997]  [<ffffffff81902120>] ?
early_idt_handlers+0x120/0x120
Oct  6 14:16:29 shelf kernel: [  961.634000]  [<ffffffff8190271f>] ?
x86_64_start_kernel+0x14d/0x15c
Oct  6 14:16:29 shelf kernel: [  961.634004] ---[ end trace
065e688ac03d9c53 ]---
Oct  6 14:16:29 shelf kernel: [  961.657322] r8169 0000:03:00.1 eth0:
link up
===================================================================

I believe I was changing from ethernet to wireless frequently while
doing some testing just before this particular instance. However this
bug is typicaly trigggered several times when just using eth0 and has
been happening for the last week.

For completeness, here are extracts from the log before the instance
above (when I was probably explicitly taking eth0 up and down, so it
possibly has irrelevant noise from those activities):

Oct  6 13:59:06 shelf kernel: [    1.466463] r8169 0000:03:00.1 eth0:
RTL8411 at 0xffffc900018c4000, 80:fa:5b:04:ca:96, XID 1c800880 IRQ 44
Oct  6 13:59:06 shelf kernel: [    1.466470] r8169 0000:03:00.1 eth0:
jumbo features [frames: 9200 bytes, tx checksumming: ko]
--[snip]--
Oct  6 13:59:06 shelf kernel: [    1.556245] r8169 0000:03:00.1:
firmware: direct-loading firmware rtl_nic/rtl8411-2.fw
Oct  6 13:59:06 shelf kernel: [    1.571948] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 13:59:09 shelf kernel: [    5.463398] r8169 0000:03:00.1 eth0:
link up
--[snip]--
Oct  6 14:09:17 shelf kernel: [  545.893079] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 14:09:17 shelf kernel: [  547.647632] r8169 0000:03:00.1: no
hotplug settings from platform
Oct  6 14:09:17 shelf kernel: [  547.648327] done.
Oct  6 14:09:17 shelf kernel: [  547.660691] Bluetooth: hci0: read Intel
version: 370710018002030d37
Oct  6 14:09:17 shelf kernel: [  547.660693] Bluetooth: hci0: Intel
device is already patched. patch num: 37
Oct  6 14:09:17 shelf kernel: [  547.805664] [drm] Enabling RC6 states:
RC6 on, RC6p off, RC6pp off
Oct  6 14:09:23 shelf kernel: [  553.115079] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:24 shelf kernel: [  554.134710] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:27 shelf kernel: [  557.556657] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:28 shelf kernel: [  558.578349] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:32 shelf kernel: [  562.018553] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:33 shelf kernel: [  563.040819] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:36 shelf kernel: [  566.548111] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:37 shelf kernel: [  567.568219] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:41 shelf kernel: [  571.786079] r8169 0000:03:00.1 eth0:
link up
Oct  6 14:09:42 shelf kernel: [  572.808291] r8169 0000:03:00.1 eth0:
link down
Oct  6 14:09:45 shelf kernel: [  575.383331] r8169 0000:03:00.1 eth0:
link up
--[snip]--
Oct  6 14:11:59 shelf kernel: [  691.106397] r8169 0000:03:00.1 eth0:
link down
--[snip]--
Oct  6 14:11:59 shelf kernel: [  691.974414] r8169 0000:03:00.1: no
hotplug settings from platform
--[snip]--
Oct  6 14:12:05 shelf kernel: [  697.522170] r8169 0000:03:00.1 eth0:
link up
========================================================

lspci:
-----
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation HM86 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 5287 (rev 01)
03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)
-----------------------------------------------------------

lspci -v -s 03:00.1:
-------------------

03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)
	Subsystem: CLEVO/KAPOK Computer Device 5455
	Flags: bus master, fast devsel, latency 0, IRQ 43
	I/O ports at e000 [size=256]
	Memory at f7c14000 (64-bit, non-prefetchable) [size=4K]
	Memory at f7c10000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
--------------------------------------------------------------------

I have not set the debug parameter on r8169 but can do so if that would
be useful.

ael

^ permalink raw reply

* [PATCH 1/1 net-next] caif: replace kmalloc/memset 0 by kzalloc
From: Fabian Frederick @ 2014-10-13 20:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Fabian Frederick, Dmitry Tarnyagin, David S. Miller, netdev

Also add blank line after declaration

Signed-off-by: Fabian Frederick <fabf@skynet.be>
---
 net/caif/cfmuxl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/caif/cfmuxl.c b/net/caif/cfmuxl.c
index 8c5d638..510aa5a 100644
--- a/net/caif/cfmuxl.c
+++ b/net/caif/cfmuxl.c
@@ -47,10 +47,10 @@ static struct cflayer *get_up(struct cfmuxl *muxl, u16 id);
 
 struct cflayer *cfmuxl_create(void)
 {
-	struct cfmuxl *this = kmalloc(sizeof(struct cfmuxl), GFP_ATOMIC);
+	struct cfmuxl *this = kzalloc(sizeof(struct cfmuxl), GFP_ATOMIC);
+
 	if (!this)
 		return NULL;
-	memset(this, 0, sizeof(*this));
 	this->layer.receive = cfmuxl_receive;
 	this->layer.transmit = cfmuxl_transmit;
 	this->layer.ctrlcmd = cfmuxl_ctrlcmd;
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/1 net-next] caif_usb: remove redundant memory message
From: Fabian Frederick @ 2014-10-13 20:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Fabian Frederick, Dmitry Tarnyagin, David S. Miller, netdev

Let MM subsystem display out of memory messages.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
---
 net/caif/caif_usb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/caif/caif_usb.c b/net/caif/caif_usb.c
index ba02db0..0e487b0 100644
--- a/net/caif/caif_usb.c
+++ b/net/caif/caif_usb.c
@@ -87,10 +87,9 @@ static struct cflayer *cfusbl_create(int phyid, u8 ethaddr[ETH_ALEN],
 {
 	struct cfusbl *this = kmalloc(sizeof(struct cfusbl), GFP_ATOMIC);
 
-	if (!this) {
-		pr_warn("Out of memory\n");
+	if (!this)
 		return NULL;
-	}
+
 	caif_assert(offsetof(struct cfusbl, layer) == 0);
 
 	memset(this, 0, sizeof(struct cflayer));
-- 
1.9.3

^ permalink raw reply related

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
From: Jesper Dangaard Brouer @ 2014-10-13 20:27 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller,
	brouer
In-Reply-To: <CAA93jw5MKVaNaObVD3Lm=DP0ie6HcjwOvt0X-whDLJD_6hF5bw@mail.gmail.com>


On Mon, 13 Oct 2014 10:20:17 -0700 Dave Taht <dave.taht@gmail.com> wrote:

> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
[...]
> 
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.

I also care about 1Gbit/s and below, that why I did some many tests
(with igb at 10Mbit/s, 100Mbit/s and 1Gbit/s).  

 
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
> 
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

In the last graph, where you cannot saturate the link, because you
turned off GSO, GRO and TSO.  Here I expect you will see the benefit of
the qdisc bulking. That is, will be able to saturate the link and
achieve the lower latency as BQL will cut off the bursts at +1 MTU.
I would be interested in the results...


> > We made all these changes so that we can spend cpu cycles at the right
> > place.

Exactly. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 1/1 net-next] caif_usb: remove redundant memory message
From: Joe Perches @ 2014-10-13 20:34 UTC (permalink / raw)
  To: Fabian Frederick; +Cc: linux-kernel, Dmitry Tarnyagin, David S. Miller, netdev
In-Reply-To: <1413231946-9914-1-git-send-email-fabf@skynet.be>

On Mon, 2014-10-13 at 22:25 +0200, Fabian Frederick wrote:
> Let MM subsystem display out of memory messages.
[]
> diff --git a/net/caif/caif_usb.c b/net/caif/caif_usb.c
[]
> @@ -87,10 +87,9 @@ static struct cflayer *cfusbl_create(int phyid, u8 ethaddr[ETH_ALEN],
>  {
>  	struct cfusbl *this = kmalloc(sizeof(struct cfusbl), GFP_ATOMIC);
>  
> -	if (!this) {
> -		pr_warn("Out of memory\n");
> +	if (!this)
>  		return NULL;
> -	}
> +
>  	caif_assert(offsetof(struct cfusbl, layer) == 0);
>  
>  	memset(this, 0, sizeof(struct cflayer));

This bit should probably be:

	memset(&this->layer, 0, sizeof(this->layer));

or the allocation above should use kzalloc.

^ permalink raw reply

* Re: [ovs-dev] [PATCH/RFC repost 7/8] ofproto: translate datapath select group action
From: Ben Pfaff @ 2014-10-13 20:46 UTC (permalink / raw)
  To: Simon Horman; +Cc: dev, netdev
In-Reply-To: <20141009011434.GE9339@vergenet.net>

On Thu, Oct 09, 2014 at 10:14:36AM +0900, Simon Horman wrote:
> On Fri, Sep 26, 2014 at 04:57:25PM -0700, Ben Pfaff wrote:
> > On Thu, Sep 18, 2014 at 10:55:10AM +0900, Simon Horman wrote:
> > > This patch is a prototype and has several limitations:
> > > 
> > > * It assumes that no actions follow a select group action
> > >   because the resulting packet after a select group action may
> > >   differ depending on the bucket used. It may be possible
> > >   to address this problem using recirculation. Or to not use
> > >   the datapath select group in such situations. In any case
> > >   this patch does not solve this problem or even prevent it
> > >   from occurring.
> > 
> > It seems like this limitation in particular is a pretty big one.  Do
> > you have a good plan in mind for how to resolve it?
> 
> Hi Ben,
> 
> it seems to me that this would be somewhat difficult to resolve in the
> datapath so I propose not doing so. And I have two ideas on how to
> resolve this problem outside of the datapath.
> 
> 1. Recirculation
> 
>    It seems to me that it ought to be possible to handle this by
>    recirculating if actions occur after an ODP select group action.
> 
>    This could be made slightly more selective by only recirculating
>    if the execution different buckets may result in different packet
>    contents and the actions after the ODP select group action rely on
>    the packet contents (e.g. set actions do but output actions do not).
> 
>    My feeling is that this could be implemented by adding a small amount
>    of extra state to action translation in ovs-vswitchd.
> 
> 2. Fall back to selecting buckets in ovs-vswtichd
> 
>    The idea here is to detect cases where there would be a problem
>    executing actions after an ODP select group action and in that
>    case to select buckets in ovs-vswtichd: that is use the existing bucket
>    translation code in ovs-vswtichd.
> 
>    Though this seems conceptually simpler than recirculation it
>    seems to me that it would be somewhat more difficult to implement
>    as it implies a two stage translation process: e.g. one stage to
>    determine if an ODP select group may be used; and one to perform
>    the translation.
> 
>    I seem to recall trying various two stage translation processes
>    as part some earlier unrelated work. And my recollection is that
>    the result of my previous efforts were not pretty.
> 
> Both of the above more or less negate any benefits of ODP select group
> action. In particular lowering flow setup cost and potentially allowing
> complete offload of select groups from the datapath to hardware. However I
> think that this case is not a common one as it requires both of the
> following. And I think they are both not usual use cases.
> 
> * Different buckets modifying packets in different ways
>   - My expectation is that it is common for buckets to be homogeneous in
>     regards to packet modifications. But perhaps this is na??ve in the
>     context of VLANs, MPLS, and similar tags that can be pushed and popped.
> * Actions that rely on packet contents after
>   - My expectation is that it is common to use a select group to output
>     packets and that is the final action performed.

I am glad that you have thought about it.  Your ideas seem like a good
start to me.  Personally, approach #2, of falling back to selecting
buckets in ovs-vswitchd, seems like a clean solution to me, although
if it really takes multiple stages in translation then that is
undesirable, so I hope that some clean and simple approach works out.

I think that we probably need a solution before we can apply the patch
series, because otherwise we end up with half-working code.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox