Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH net-next] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Simon Horman @ 2012-06-29  0:34 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: netdev, lvs-devel, netfilter-devel, netfilter, coreteam,
	linux-kernel, Xiaotian Feng, Wensong Zhang, Julian Anastasov,
	Pablo Neira Ayuso, Patrick McHardy, David S. Miller
In-Reply-To: <1340890587-8169-1-git-send-email-xtfeng@gmail.com>

On Thu, Jun 28, 2012 at 09:36:27PM +0800, Xiaotian Feng wrote:
> We met a kernel panic in 2.6.32.43 kernel:
> 
> [2680191.848044] IPVS: ip_vs_conn_hash(): request for already hashed, called from run_timer_softirq+0x175/0x1d0
> <snip>
> [2680311.849009] general protection fault: 0000 [#1] SMP
> [2680311.853001] RIP: 0010:[<ffffffff815f155c>]  [<ffffffff815f155c>] ip_vs_conn_expire+0xdc/0x2f0
> [2680311.853001] RSP: 0018:ffff880028303e70  EFLAGS: 00010202
> [2680311.853001] RAX: dead000000200200 RBX: ffff8801aad00b80 RCX: 0000000000001d90
> [2680311.853001] RDX: dead000000100100 RSI: 000000004fd59800 RDI: ffff8801aad00c08
> <snip>
> [2680311.853001] Call Trace:
> [2680311.853001]  <IRQ>
> [2680311.853001]  [<ffffffff815f1480>] ? ip_vs_conn_expire+0x0/0x2f0
> [2680311.853001]  [<ffffffff8104e2a5>] run_timer_softirq+0x175/0x1d0
> [2680311.853001]  [<ffffffff81021a48>] ? lapic_next_event+0x18/0x20
> [2680311.853001]  [<ffffffff81049a13>] __do_softirq+0xb3/0x150
> [2680311.853001]  [<ffffffff8100cc5c>] call_softirq+0x1c/0x30
> [2680311.853001]  [<ffffffff8100ea9a>] do_softirq+0x4a/0x80
> [2680311.853001]  [<ffffffff81049957>] irq_exit+0x77/0x80
> [2680311.853001]  [<ffffffff81021f2c>] smp_apic_timer_interrupt+0x6c/0xa0
> [2680311.853001]  [<ffffffff8100c633>] apic_timer_interrupt+0x13/0x20
> [2680311.853001]  <EOI>
> [2680311.853001]  [<ffffffff81013b52>] ? mwait_idle+0x52/0x70
> [2680311.853001]  [<ffffffff8100a7b0>] ? enter_idle+0x20/0x30
> [2680311.853001]  [<ffffffff8100ac62>] ? cpu_idle+0x52/0x80
> [2680311.853001]  [<ffffffff816d504d>] ? start_secondary+0x19d/0x280
> 
> rax and rdx is LIST_POISON1 and LIST_POISON2, so kernel is list_del() on an already deleted
> connection and result the general protect fault.
> 
> The "request for already hashed" warning, told us someone might change the connection flags
> incorrectly, like described in commit aea9d711, it changes the connection flags, but doesn't
> put the connection back to the list. So ip_vs_conn_hash() throw a warning and return.
> Later, when ip_vs_conn_expire fire again, ip_vs_conn_unhash() will find the HASHED connection
> and list_del() it, then kernel panic happened.
> 
> After code review, the only chance that kernel change connection flag without protection is
> in ip_vs_ftp_init_conn().

Thanks for being thorough.

> Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
> Cc: Wensong Zhang <wensong@linux-vs.org>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: "David S. Miller" <davem@davemloft.net> 

Acked-by: Simon Horman <horms@verge.net.au>

Pablo, can you handle this directly through your tree?
I think it also needs to go to -stable.

> ---
>  net/netfilter/ipvs/ip_vs_ftp.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index b20b29c..c2bc264 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>  static int
>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>  {
> +	spin_lock(&cp->lock);
>  	/* We use connection tracking for the command connection */
>  	cp->flags |= IP_VS_CONN_F_NFCT;
> +	spin_unlock(&cp->lock);
>  	return 0;
>  }

^ permalink raw reply

* Re: [RFC PATCH net-next] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Julian Anastasov @ 2012-06-29  0:17 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: netdev, lvs-devel, netfilter-devel, netfilter, coreteam,
	linux-kernel, Xiaotian Feng, Wensong Zhang, Simon Horman,
	Pablo Neira Ayuso, Patrick McHardy, David S. Miller
In-Reply-To: <1340890587-8169-1-git-send-email-xtfeng@gmail.com>


	Hello,

On Thu, 28 Jun 2012, Xiaotian Feng wrote:

> We met a kernel panic in 2.6.32.43 kernel:
> 
> [2680191.848044] IPVS: ip_vs_conn_hash(): request for already hashed, called from run_timer_softirq+0x175/0x1d0
> <snip>
> [2680311.849009] general protection fault: 0000 [#1] SMP
> [2680311.853001] RIP: 0010:[<ffffffff815f155c>]  [<ffffffff815f155c>] ip_vs_conn_expire+0xdc/0x2f0
> [2680311.853001] RSP: 0018:ffff880028303e70  EFLAGS: 00010202
> [2680311.853001] RAX: dead000000200200 RBX: ffff8801aad00b80 RCX: 0000000000001d90
> [2680311.853001] RDX: dead000000100100 RSI: 000000004fd59800 RDI: ffff8801aad00c08
> <snip>
> [2680311.853001] Call Trace:
> [2680311.853001]  <IRQ>
> [2680311.853001]  [<ffffffff815f1480>] ? ip_vs_conn_expire+0x0/0x2f0
> [2680311.853001]  [<ffffffff8104e2a5>] run_timer_softirq+0x175/0x1d0
> [2680311.853001]  [<ffffffff81021a48>] ? lapic_next_event+0x18/0x20
> [2680311.853001]  [<ffffffff81049a13>] __do_softirq+0xb3/0x150
> [2680311.853001]  [<ffffffff8100cc5c>] call_softirq+0x1c/0x30
> [2680311.853001]  [<ffffffff8100ea9a>] do_softirq+0x4a/0x80
> [2680311.853001]  [<ffffffff81049957>] irq_exit+0x77/0x80
> [2680311.853001]  [<ffffffff81021f2c>] smp_apic_timer_interrupt+0x6c/0xa0
> [2680311.853001]  [<ffffffff8100c633>] apic_timer_interrupt+0x13/0x20
> [2680311.853001]  <EOI>
> [2680311.853001]  [<ffffffff81013b52>] ? mwait_idle+0x52/0x70
> [2680311.853001]  [<ffffffff8100a7b0>] ? enter_idle+0x20/0x30
> [2680311.853001]  [<ffffffff8100ac62>] ? cpu_idle+0x52/0x80
> [2680311.853001]  [<ffffffff816d504d>] ? start_secondary+0x19d/0x280
> 
> rax and rdx is LIST_POISON1 and LIST_POISON2, so kernel is list_del() on an already deleted
> connection and result the general protect fault.
> 
> The "request for already hashed" warning, told us someone might change the connection flags
> incorrectly, like described in commit aea9d711, it changes the connection flags, but doesn't
> put the connection back to the list. So ip_vs_conn_hash() throw a warning and return.
> Later, when ip_vs_conn_expire fire again, ip_vs_conn_unhash() will find the HASHED connection
> and list_del() it, then kernel panic happened.
> 
> After code review, the only chance that kernel change connection flag without protection is
> in ip_vs_ftp_init_conn().

	Hm, ip_vs_ftp_init_conn is called before 1st hashing,
from ip_vs_bind_app() in ip_vs_conn_new() before
ip_vs_conn_hash(). It should be another problem with
the flags. How different is IPVS in 2.6.32.43 compared to
recent kernels? If commit aea9d711 is present, I'm not
aware of other similar problems.

	May be you can post some details for your setup on
lvs-devel@vger.kernel.org. I assume FTP is used,
what about master-backup sync? Can you confirm that
this fix solves the problem or it happens in rare cases?

> Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
> Cc: Wensong Zhang <wensong@linux-vs.org>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: "David S. Miller" <davem@davemloft.net> 
> ---
>  net/netfilter/ipvs/ip_vs_ftp.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index b20b29c..c2bc264 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>  static int
>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>  {
> +	spin_lock(&cp->lock);
>  	/* We use connection tracking for the command connection */
>  	cp->flags |= IP_VS_CONN_F_NFCT;
> +	spin_unlock(&cp->lock);
>  	return 0;
>  }
>  
> -- 
> 1.7.1

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH] davinci_cpdma: include linux/module.h
From: David Miller @ 2012-06-29  0:03 UTC (permalink / raw)
  To: zonque; +Cc: netdev, linux-arm-kernel, hvaibhav, christian.riesch
In-Reply-To: <1340899952-15201-1-git-send-email-zonque@gmail.com>

From: Daniel Mack <zonque@gmail.com>
Date: Thu, 28 Jun 2012 18:12:32 +0200

> This fixes a number of warnings such as:
> 
>   CC      drivers/net/ethernet/ti/davinci_cpdma.o
> drivers/net/ethernet/ti/davinci_cpdma.c:279:1: warning: data definition
> has no type or storage class
> drivers/net/ethernet/ti/davinci_cpdma.c:279:1: warning: type defaults to
> ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’
> drivers/net/ethernet/ti/davinci_cpdma.c:279:1: warning: parameter names
> (without types) in function declaration
> 
> Signed-off-by: Daniel Mack <zonque@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: pull request: wireless 2012-06-28
From: David Miller @ 2012-06-29  0:01 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20120628184011.GC2115-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Thu, 28 Jun 2012 14:40:12 -0400

> These fixes are intended for 3.5...
 ...
> Please let me know if there are problems!
 ...
>   git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git for-davem

Pulled, thanks John.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] gianfar: Fix RXICr/TXICr programming for multi-queue mode
From: David Miller @ 2012-06-28 23:57 UTC (permalink / raw)
  To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1340894453-28556-1-git-send-email-claudiu.manoil@freescale.com>

From: Claudiu Manoil <claudiu.manoil@freescale.com>
Date: Thu, 28 Jun 2012 17:40:53 +0300

> The correct behavior is to program the interrupt coalescing regs
> (RXICr/TXICr) in accordance with the Rx/Tx Q's "rx/txcoalescing"
> flag. That is, if the coalescing flag is 0 for a given Rx/Tx queue
> then the corresponding coalescing register should be cleared.
> This behavior is correctly implemented for the single-queue mode
> (SQ_SG_MODE), but not for the multi-queue mode (MQ_MG_MODE).
> This fixes the later case.
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>

Applied, thanks.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/1] xen/netback: only non-freed SKB is queued into tx_queue
From: David Miller @ 2012-06-28 23:55 UTC (permalink / raw)
  To: annie.li; +Cc: xen-devel, netdev, Ian.Campbell, konrad.wilk, kurt.hackel
In-Reply-To: <1340794018-17274-1-git-send-email-annie.li@oracle.com>

From: annie.li@oracle.com
Date: Wed, 27 Jun 2012 18:46:58 +0800

> From: Annie Li <Annie.li@oracle.com>
> 
> After SKB is queued into tx_queue, it will be freed if request_gop is NULL.
> However, no dequeue action is called in this situation, it is likely that
> tx_queue constains freed SKB. This patch should fix this issue, and it is
> based on 3.5.0-rc4+.
> 
> This issue is found through code inspection, no bug is seen with it currently.
> I run netperf test for several hours, and no network regression was found.
> 
> Signed-off-by: Annie Li <annie.li@oracle.com>

I lack the expertiece necessary to properly review this, so I really
need a Xen expert to look this over.

^ permalink raw reply

* Re: [PATCH] net: Downgrade CAP_SYS_MODULE deprecated message from error to warning.
From: David Miller @ 2012-06-28 23:55 UTC (permalink / raw)
  To: vlee; +Cc: edumazet, mirq-linux, jpirko, therbert, netdev, linux-kernel,
	tdmackey
In-Reply-To: <1340843527-9962-1-git-send-email-vlee@twitter.com>

From: Vinson Lee <vlee@twitter.com>
Date: Wed, 27 Jun 2012 17:32:07 -0700

> Make logging level consistent with other deprecation messages in net
> subsystem.
> 
> Signed-off-by: Vinson Lee <vlee@twitter.com>
> Cc: David Mackey <tdmackey@twitter.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] net: qmi_wwan: fix Oops while disconnecting
From: David Miller @ 2012-06-28 23:54 UTC (permalink / raw)
  To: oliver-GvhC2dPhHPQdnm+yROfE0A
  Cc: bjorn-yOkvZcmFvRU, netdev-u79uwXL29TY76Z2rM5mHXA,
	tom.leiming-Re5JQEeQqe8AvxtiuMwx3w,
	linux-usb-u79uwXL29TY76Z2rM5mHXA,
	marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <201206281040.55402.oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>

From: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Date: Thu, 28 Jun 2012 10:40:55 +0200

> Am Donnerstag, 28. Juni 2012, 10:36:49 schrieb Bjørn Mork:
>> I'd like this patch applied to qmi_wwan regardless of the outcome of the
>> (now stalled?) generic usbnet_disconnect discussion.
>> 
>> The patch fixes a real Oops in 3.4 and 3.5, and I believe it should be
>> left in qmi_wwan even if the usbnet code is fixed to avoid this specific
>> bug.  The additional NULL test won't harm, and it makes the code more
>> robust should someone decide to rearrange usbnet_disconnect again at
>> some later point in time.
>> 
>> I really want this fixed in the next 3.4 stable release, if possible.
>> Should I resubmit the patch, or will you pick it up from
>> http://patchwork.ozlabs.org/patch/166542/ ?
> 
> Yes.
> 
> David, can you please apply this? It needs to also go into stable.
> We might remove this in later releases, but for now it is needed.

Done.  I left the stable CC: in there so Greg will pick this up
automatically once it hits Linus's tree.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Nandita Dukkipati @ 2012-06-28 23:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Dave Taht, codel, Tom Herbert, Matt Mathis,
	Yuchung Cheng, Neal Cardwell
In-Reply-To: <1340903237.13187.151.camel@edumazet-glaptop>

As you know I really like this idea. My main concern is that the same
packet could cause TCP to reduce cwnd twice within an RTT - first on
enqueue and then if this packet is ECN marked on dequeue. I don't
think this is the desired behavior. Can we avoid it?

On Thu, Jun 28, 2012 at 10:07 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> At enqueue time, check sojourn time of packet at head of the queue,
> and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
> time is above codel @target.
>
> This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
> without drops (for example if ECN is not enabled for the flow)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dave Taht <dave.taht@bufferbloat.net>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  include/linux/pkt_sched.h |    1 +
>  include/net/codel.h       |    2 +-
>  net/sched/sch_fq_codel.c  |   19 +++++++++++++++----
>  3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 32aef0a..4d409a5 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -714,6 +714,7 @@ struct tc_fq_codel_qd_stats {
>                                 */
>        __u32   new_flows_len;  /* count of flows in new list */
>        __u32   old_flows_len;  /* count of flows in old list */
> +       __u32   congestion_count;
>  };
>
>  struct tc_fq_codel_cl_stats {
> diff --git a/include/net/codel.h b/include/net/codel.h
> index 550debf..8c7d6a7 100644
> --- a/include/net/codel.h
> +++ b/include/net/codel.h
> @@ -148,7 +148,7 @@ struct codel_vars {
>  * struct codel_stats - contains codel shared variables and stats
>  * @maxpacket: largest packet we've seen so far
>  * @drop_count:        temp count of dropped packets in dequeue()
> - * ecn_mark:   number of packets we ECN marked instead of dropping
> + * @ecn_mark:  number of packets we ECN marked instead of dropping
>  */
>  struct codel_stats {
>        u32             maxpacket;
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index 9fc1c62..c0485a0 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -62,6 +62,7 @@ struct fq_codel_sched_data {
>        struct codel_stats cstats;
>        u32             drop_overlimit;
>        u32             new_flow_count;
> +       u32             congestion_count;
>
>        struct list_head new_flows;     /* list of new flows */
>        struct list_head old_flows;     /* list of old flows */
> @@ -196,16 +197,25 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>                flow->deficit = q->quantum;
>                flow->dropped = 0;
>        }
> -       if (++sch->q.qlen < sch->limit)
> +       if (++sch->q.qlen < sch->limit) {
> +               codel_time_t hdelay = codel_get_enqueue_time(skb) -
> +                                     codel_get_enqueue_time(flow->head);
> +
> +               /* If this flow is congested, tell the caller ! */
> +               if (codel_time_after(hdelay, q->cparams.target)) {
> +                       q->congestion_count++;
> +                       return NET_XMIT_CN;
> +               }
>                return NET_XMIT_SUCCESS;
> -
> +       }
>        q->drop_overlimit++;
>        /* Return Congestion Notification only if we dropped a packet
>         * from this flow.
>         */
> -       if (fq_codel_drop(sch) == idx)
> +       if (fq_codel_drop(sch) == idx) {
> +               q->congestion_count++;
>                return NET_XMIT_CN;
> -
> +       }
>        /* As we dropped a packet, better let upper stack know this */
>        qdisc_tree_decrease_qlen(sch, 1);
>        return NET_XMIT_SUCCESS;
> @@ -467,6 +477,7 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
>
>        st.qdisc_stats.maxpacket = q->cstats.maxpacket;
>        st.qdisc_stats.drop_overlimit = q->drop_overlimit;
> +       st.qdisc_stats.congestion_count = q->congestion_count;
>        st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
>        st.qdisc_stats.new_flow_count = q->new_flow_count;
>
>
>

^ permalink raw reply

* Re: [PATCH] net: unlock busylock before servicing qdisc
From: David Miller @ 2012-06-28 23:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tparkin
In-Reply-To: <1340911362.13187.183.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 28 Jun 2012 21:22:42 +0200

> @@ -2412,13 +2412,13 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,

Looks like your email client clobbered this, please resubmit.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Dave Taht @ 2012-06-28 23:47 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Nandita Dukkipati, netdev, codel, Matt Mathis, Neal Cardwell,
	David Miller
In-Reply-To: <CAK6E8=dM0HNjjT=pDWzkBHs2Npi9foAnk1zxXdVLqBxA63dSqw@mail.gmail.com>

On Thu, Jun 28, 2012 at 6:56 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
>>
>>> clever idea. A problem is there are other forms of network traffic on
>>> a link, and this is punishing a single tcp
> Dave: it won't just punish a single TCP, all protocols that react to
> XMIT_CN will share similar fate.

What protocols in the kernel do and don't? was the crux of this question.

I'm not objecting to the idea, it's clever, as I said. I'm thinking I'll
apply it to cerowrt's next build and see what happens, if this
will apply against 3.3. Or maybe the ns3 model. Or both.

As a segue...

I am still fond of gaining an ability to also throw congestion notifications
(who, where, and why) up to userspace via netlink. Having a viable
metric for things like mesh routing and multipath has been a age-old
quest of mine, and getting that data out could be a start towards it.

Another example is something that lives in userspace like uTP.

>
>>> stream that may not be the source of the problem in the first place,
>>> and basically putting it into double jeopardy.
>>>
>>
>> Why ? In fact this patch helps the tcp session being signaled (as it
>> will be anyway) at enqueue time, instead of having to react to packet
>> losses indications given (after RTT) by receiver.

I tend to think more in terms of routing packets rather than originating
them.

>> Avoiding losses help receiver to consume data without having to buffer
>> it into Out Of Order queue.
>>
>> So its not jeopardy, but early congestion notification without RTT
>> delay.

Well there is the birthday problem and hashing to the same queues.
the sims we have do some interesting things on new streams in slow
start sometimes. But don't have enough of a grip on it to talk about it
yet...

>>
>> NET_XMIT_CN is a soft signal, far more disruptive than a DROP.
> I don't read here: you mean far "less" disruptive in terms of performance?

I figured eric meant less.

>>
>>> I am curious as to how often an enqueue is actually dropping in the
>>> codel/fq_codel case, the hope was that there would be plenty of
>>> headroom under far more circumstances on this qdisc.
>>>
>>
>> "tc -s qdisc show dev eth0" can show you all the counts.
>>
>> We never drop a packet at enqueue time, unless you hit the emergency
>> limit (10240 packets for fq_codel). When you reach this limit, you are
>> under trouble.

In my own tests with artificial streams that set but don't respect ecn,
I hit limit easily. But that's the subject of another thread on the codel list,
and a different problem entirely. I just am not testing at > 1GigE
speeds and I know you guys are. I worry about behaviors above 10GigE,
and here too, the NET_XMIT_CN idea seems like a good idea.

so, applause. new idea on top of fair queue-ing + codel. cool. So many
hard problems seem to be getting tractable!

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-6 is out
with fq_codel!"

^ permalink raw reply

* [PATCH net-next 4/4] tcp: plug dst leak in tcp_v6_conn_request()
From: Neal Cardwell @ 2012-06-28 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell
In-Reply-To: <1340922861-11684-1-git-send-email-ncardwell@google.com>

The code in tcp_v6_conn_request() was implicitly assuming that
tcp_v6_send_synack() would take care of dst_release(), much as
tcp_v4_send_synack() already does. This resulted in
tcp_v6_conn_request() leaking a dst if sysctl_tw_recycle is enabled.

This commit restructures tcp_v6_send_synack() so that it accepts a dst
pointer and takes care of releasing the dst that is passed in, to plug
the leak and avoid future surprises by bringing the IPv6 behavior in
line with the IPv4 side.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv6/tcp_ipv6.c |   19 ++++++++++---------
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c221086..5edabf6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -477,7 +477,8 @@ out:
 }
 
 
-static int tcp_v6_send_synack(struct sock *sk,
+static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+			      struct flowi6 *fl6,
 			      struct request_sock *req,
 			      struct request_values *rvp,
 			      u16 queue_mapping)
@@ -486,12 +487,10 @@ static int tcp_v6_send_synack(struct sock *sk,
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct sk_buff * skb;
 	struct ipv6_txoptions *opt = np->opt;
-	struct flowi6 fl6;
-	struct dst_entry *dst;
 	int err = -ENOMEM;
 
-	dst = inet6_csk_route_req(sk, &fl6, req);
-	if (!dst)
+	/* First, grab a route. */
+	if (!dst && (dst = inet6_csk_route_req(sk, fl6, req)) == NULL)
 		goto done;
 
 	skb = tcp_make_synack(sk, dst, req, rvp);
@@ -499,9 +498,9 @@ static int tcp_v6_send_synack(struct sock *sk,
 	if (skb) {
 		__tcp_v6_send_check(skb, &treq->loc_addr, &treq->rmt_addr);
 
-		fl6.daddr = treq->rmt_addr;
+		fl6->daddr = treq->rmt_addr;
 		skb_set_queue_mapping(skb, queue_mapping);
-		err = ip6_xmit(sk, skb, &fl6, opt, np->tclass);
+		err = ip6_xmit(sk, skb, fl6, opt, np->tclass);
 		err = net_xmit_eval(err);
 	}
 
@@ -514,8 +513,10 @@ done:
 static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req,
 			     struct request_values *rvp)
 {
+	struct flowi6 fl6;
+
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v6_send_synack(sk, req, rvp, 0);
+	return tcp_v6_send_synack(sk, NULL, &fl6, req, rvp, 0);
 }
 
 static void tcp_v6_reqsk_destructor(struct request_sock *req)
@@ -1200,7 +1201,7 @@ have_isn:
 
 	security_inet_conn_request(sk, skb, req);
 
-	if (tcp_v6_send_synack(sk, req,
+	if (tcp_v6_send_synack(sk, dst, &fl6, req,
 			       (struct request_values *)&tmp_ext,
 			       skb_get_queue_mapping(skb)) ||
 	    want_cookie)
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net-next 1/4] tcp: fix inet6_csk_route_req() for link-local addresses
From: Neal Cardwell @ 2012-06-28 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell

Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif,
which is correctly fixed up in tcp_v6_conn_request() to handle the
case of link-local addresses. This brings it in line with the
tcp_v6_send_synack() code, which is already correctly using the
treq->iif in this way.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv6/inet6_connection_sock.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index e6cee52..e23d354 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -68,7 +68,7 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
 	fl6.daddr = treq->rmt_addr;
 	final_p = fl6_update_dst(&fl6, np->opt, &final);
 	fl6.saddr = treq->loc_addr;
-	fl6.flowi6_oif = sk->sk_bound_dev_if;
+	fl6.flowi6_oif = treq->iif;
 	fl6.flowi6_mark = sk->sk_mark;
 	fl6.fl6_dport = inet_rsk(req)->rmt_port;
 	fl6.fl6_sport = inet_rsk(req)->loc_port;
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net-next 3/4] tcp: use inet6_csk_route_req() in tcp_v6_send_synack()
From: Neal Cardwell @ 2012-06-28 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell
In-Reply-To: <1340922861-11684-1-git-send-email-ncardwell@google.com>

With the recent change (earlier in this patch series) to set
flowi6_oif to treq->iif in inet6_csk_route_req(), the dst lookup in
these two functions is now identical, so tcp_v6_send_synack() can now
just call inet6_csk_route_req(), to reduce code duplication and keep
things closer to the IPv4 side, which is structured this way.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv6/tcp_ipv6.c |   29 ++++++-----------------------
 1 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3b79e94..c221086 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -485,34 +485,17 @@ static int tcp_v6_send_synack(struct sock *sk,
 	struct inet6_request_sock *treq = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct sk_buff * skb;
-	struct ipv6_txoptions *opt = NULL;
-	struct in6_addr * final_p, final;
+	struct ipv6_txoptions *opt = np->opt;
 	struct flowi6 fl6;
 	struct dst_entry *dst;
-	int err;
-
-	memset(&fl6, 0, sizeof(fl6));
-	fl6.flowi6_proto = IPPROTO_TCP;
-	fl6.daddr = treq->rmt_addr;
-	fl6.saddr = treq->loc_addr;
-	fl6.flowlabel = 0;
-	fl6.flowi6_oif = treq->iif;
-	fl6.flowi6_mark = sk->sk_mark;
-	fl6.fl6_dport = inet_rsk(req)->rmt_port;
-	fl6.fl6_sport = inet_rsk(req)->loc_port;
-	security_req_classify_flow(req, flowi6_to_flowi(&fl6));
-
-	opt = np->opt;
-	final_p = fl6_update_dst(&fl6, opt, &final);
+	int err = -ENOMEM;
 
-	dst = ip6_dst_lookup_flow(sk, &fl6, final_p, false);
-	if (IS_ERR(dst)) {
-		err = PTR_ERR(dst);
-		dst = NULL;
+	dst = inet6_csk_route_req(sk, &fl6, req);
+	if (!dst)
 		goto done;
-	}
+
 	skb = tcp_make_synack(sk, dst, req, rvp);
-	err = -ENOMEM;
+
 	if (skb) {
 		__tcp_v6_send_check(skb, &treq->loc_addr, &treq->rmt_addr);
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net-next 2/4] tcp: pass fl6 to inet6_csk_route_req()
From: Neal Cardwell @ 2012-06-28 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell
In-Reply-To: <1340922861-11684-1-git-send-email-ncardwell@google.com>

This commit changes inet_csk_route_req() so that it uses a pointer to
a struct flowi6, rather than allocating its own on the stack. This
brings its behavior in line with its IPv4 cousin,
inet_csk_route_req(), and allows a follow-on patch to fix a dst leak.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 include/net/inet6_connection_sock.h |    1 +
 net/ipv6/inet6_connection_sock.c    |   24 ++++++++++++------------
 net/ipv6/tcp_ipv6.c                 |    9 ++++++---
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
index 1866a67..df2a857 100644
--- a/include/net/inet6_connection_sock.h
+++ b/include/net/inet6_connection_sock.h
@@ -26,6 +26,7 @@ extern int inet6_csk_bind_conflict(const struct sock *sk,
 				   const struct inet_bind_bucket *tb, bool relax);
 
 extern struct dst_entry* inet6_csk_route_req(struct sock *sk,
+					     struct flowi6 *fl6,
 					     const struct request_sock *req);
 
 extern struct request_sock *inet6_csk_search_req(const struct sock *sk,
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index e23d354..bceb144 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -55,26 +55,26 @@ int inet6_csk_bind_conflict(const struct sock *sk,
 EXPORT_SYMBOL_GPL(inet6_csk_bind_conflict);
 
 struct dst_entry *inet6_csk_route_req(struct sock *sk,
+				      struct flowi6 *fl6,
 				      const struct request_sock *req)
 {
 	struct inet6_request_sock *treq = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *final_p, final;
 	struct dst_entry *dst;
-	struct flowi6 fl6;
-
-	memset(&fl6, 0, sizeof(fl6));
-	fl6.flowi6_proto = IPPROTO_TCP;
-	fl6.daddr = treq->rmt_addr;
-	final_p = fl6_update_dst(&fl6, np->opt, &final);
-	fl6.saddr = treq->loc_addr;
-	fl6.flowi6_oif = treq->iif;
-	fl6.flowi6_mark = sk->sk_mark;
-	fl6.fl6_dport = inet_rsk(req)->rmt_port;
-	fl6.fl6_sport = inet_rsk(req)->loc_port;
-	security_req_classify_flow(req, flowi6_to_flowi(&fl6));
 
-	dst = ip6_dst_lookup_flow(sk, &fl6, final_p, false);
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_proto = IPPROTO_TCP;
+	fl6->daddr = treq->rmt_addr;
+	final_p = fl6_update_dst(fl6, np->opt, &final);
+	fl6->saddr = treq->loc_addr;
+	fl6->flowi6_oif = treq->iif;
+	fl6->flowi6_mark = sk->sk_mark;
+	fl6->fl6_dport = inet_rsk(req)->rmt_port;
+	fl6->fl6_sport = inet_rsk(req)->loc_port;
+	security_req_classify_flow(req, flowi6_to_flowi(fl6));
+
+	dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
 	if (IS_ERR(dst))
 		return NULL;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 26a8862..3b79e94 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -477,7 +477,8 @@ out:
 }
 
 
-static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
+static int tcp_v6_send_synack(struct sock *sk,
+			      struct request_sock *req,
 			      struct request_values *rvp,
 			      u16 queue_mapping)
 {
@@ -1058,6 +1059,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 isn = TCP_SKB_CB(skb)->when;
 	struct dst_entry *dst = NULL;
+	struct flowi6 fl6;
 	bool want_cookie = false;
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -1177,7 +1179,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet6_csk_route_req(sk, req)) != NULL &&
+		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL &&
 		    (peer = rt6_get_peer((struct rt6_info *)dst)) != NULL &&
 		    ipv6_addr_equal((struct in6_addr *)peer->daddr.addr.a6,
 				    &treq->rmt_addr)) {
@@ -1246,6 +1248,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key *key;
 #endif
+	struct flowi6 fl6;
 
 	if (skb->protocol == htons(ETH_P_IP)) {
 		/*
@@ -1308,7 +1311,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		goto out_overflow;
 
 	if (!dst) {
-		dst = inet6_csk_route_req(sk, req);
+		dst = inet6_csk_route_req(sk, &fl6, req);
 		if (!dst)
 			goto out;
 	}
-- 
1.7.4.1

^ permalink raw reply related

* Re: 3.4.x regression: rtl8169: frequent resets
From: Nix @ 2012-06-28 23:42 UTC (permalink / raw)
  To: Stefan Lippers-Hollmann; +Cc: Francois Romieu, netdev, linux-kernel
In-Reply-To: <201206290131.49150.s.L-H@gmx.de>

On 29 Jun 2012, Stefan Lippers-Hollmann uttered the following:
> I received the same oops from a 3.4.4 user with these onboard network 
> cards:

Pedant point: it's not an oops, just a slowpath warning.

>                                                        no oops in 21
> hours uptime so far (while it usually shows up within about an hour).

Interesting. I wonder why it took two days to show up for me. It's not
like I'm a very light network user or anything, there are easily 300
packets per second in each direction on this interface every second of
the day (and often much more).

-- 
NULL && (void)

^ permalink raw reply

* Re: 3.4.x regression: rtl8169: frequent resets
From: Stefan Lippers-Hollmann @ 2012-06-28 23:31 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Nix, netdev, linux-kernel
In-Reply-To: <20120628193026.GA31627@electric-eye.fr.zoreil.com>

[-- Attachment #1: Type: Text/Plain, Size: 1386 bytes --]

Hi

On Thursday 28 June 2012, Francois Romieu wrote:
> Nix <nix@esperi.org.uk> :
> > I recently upgraded from 3.3.x to 3.4.4, and am now experiencing
> > networking problems with my desktop box's r8169 card. The symptoms are
> > that all traffic ceases for five to ten seconds, then the card appears
> > to reset and everything is back to normal -- until it happens again. It
> > can happen quite a lot:
> 
> Can you try and revert 036dafa28da1e2565a8529de2ae663c37b7a0060 ?
> 
> I would welcome a complete dmesg including the XID line from the
> r8169 driver.

I received the same oops from a 3.4.4 user with these onboard network 
cards:

r8169 0000:04:00.0: eth0: RTL8168d/8111d at 0xffffc90000c72000, 00:24:1d:72:7c:75, XID 081000c0 IRQ 44
r8169 0000:05:00.0: eth1: RTL8168d/8111d at 0xffffc90000c70000, 00:24:1d:72:7c:77, XID 081000c0 IRQ 45

Reverting 036dafa28da1e2565a8529de2ae663c37b7a0060 (Nix, trivial 
backport to 3.4.4 attached) did improve the situation, no oops in 21
hours uptime so far (while it usually shows up within about an hour).
Unfortunately his oops report was cut brief, so I've asked him to try 
reproducing it with an unpatched kernel again, to collect a full dmesg
(the test is still going on, past the one hour mark, but the oops 
hasn't triggered yet). I'll report back, as soon as I get confirmation 
and a full dmesg.

Regards
	Stefan Lippers-Hollmann

[-- Attachment #2: revert-r8169-add-byte-queue-limit-support.patch --]
[-- Type: text/x-patch, Size: 1832 bytes --]

revert r8169: add byte queue limit support.

--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5000,7 +5000,6 @@ static void rtl8169_tx_clear(struct rtl8
 {
 	rtl8169_tx_clear_range(tp, tp->dirty_tx, NUM_TX_DESC);
 	tp->cur_tx = tp->dirty_tx = 0;
-	netdev_reset_queue(tp->dev);
 }
 
 static void rtl_reset_work(struct rtl8169_private *tp)
@@ -5155,8 +5154,6 @@ static netdev_tx_t rtl8169_start_xmit(st
 
 	txd->opts2 = cpu_to_le32(opts[1]);
 
-	netdev_sent_queue(dev, skb->len);
-
 	skb_tx_timestamp(skb);
 
 	wmb();
@@ -5253,16 +5250,9 @@ static void rtl8169_pcierr_interrupt(str
 	rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 }
 
-struct rtl_txc {
-	int packets;
-	int bytes;
-};
-
 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 {
-	struct rtl8169_stats *tx_stats = &tp->tx_stats;
 	unsigned int dirty_tx, tx_left;
-	struct rtl_txc txc = { 0, 0 };
 
 	dirty_tx = tp->dirty_tx;
 	smp_rmb();
@@ -5281,24 +5271,17 @@ static void rtl_tx(struct net_device *de
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
 		if (status & LastFrag) {
-			struct sk_buff *skb = tx_skb->skb;
-
-			txc.packets++;
-			txc.bytes += skb->len;
-			dev_kfree_skb(skb);
+			u64_stats_update_begin(&tp->tx_stats.syncp);
+			tp->tx_stats.packets++;
+			tp->tx_stats.bytes += tx_skb->skb->len;
+			u64_stats_update_end(&tp->tx_stats.syncp);
+			dev_kfree_skb(tx_skb->skb);
 			tx_skb->skb = NULL;
 		}
 		dirty_tx++;
 		tx_left--;
 	}
 
-	u64_stats_update_begin(&tx_stats->syncp);
-	tx_stats->packets += txc.packets;
-	tx_stats->bytes += txc.bytes;
-	u64_stats_update_end(&tx_stats->syncp);
-
-	netdev_completed_queue(dev, txc.packets, txc.bytes);
-
 	if (tp->dirty_tx != dirty_tx) {
 		tp->dirty_tx = dirty_tx;
 		/* Sync with rtl8169_start_xmit:

^ permalink raw reply

* Re: [PATCH] ipv4: Create and use fib_compute_spec_dst() helper.
From: David Miller @ 2012-06-28 23:27 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1206282309340.1722@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Fri, 29 Jun 2012 02:16:25 +0300 (EEST)

> 	Hello,

I really appreciate your detailed analysis, I'll look deeply into
all of your feedback.

Thanks!

^ permalink raw reply

* Re: 3.4.x regression: rtl8169: frequent resets
From: Nix @ 2012-06-28 23:10 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <20120628222325.GA1579@electric-eye.fr.zoreil.com>

On 28 Jun 2012, Francois Romieu stated:

> Nix <nix@esperi.org.uk> :
>> Francois Romieu <romieu@fr.zoreil.com> :
>> > Can you try and revert 036dafa28da1e2565a8529de2ae663c37b7a0060 ?
>> 
>> I can try, but there's been a *lot* of code motion since then, 'git
>> revert' fails hilariously (trying to patch obviously the wrong places):
>> I'll have to do it by hand.
>
> There is a single line reject in rtl8169_start_xmit. Other than that it
> should patch -p1 -R fine.

... and indeed it does. Weird, why does git revert fail so badly when
patch and git apply are both happy?!

I'll reboot into this kernel tomorrow, and report back in a few days (or
sooner if it goes wrong).

>> [    1.341389] r8169 0000:06:00.0: eth0: jumbo features [frames: 6128 bytes, tx checksumming: ko]
>
> This chipset is not supposed to be pushed beyond 6128 bytes.

Interesting. It's always worked flawlessly at 7200 for me before, until,
uh, last year when it stopped working and I never noticed (in fact
a 7200-byte MTU was how the machine was shipped to me :) ).

I guess I'll knock it down to 6128 then, less than 1000 bytes isn't
going to ruin performance by any means...

... aand that works. thanks! (Let's see if the link stuttering continues. I
expect it will, though it's been three hours since the last stutter...)

-- 
NULL && (void)

^ permalink raw reply

* Re: [PATCH] ipv4: Create and use fib_compute_spec_dst() helper.
From: Julian Anastasov @ 2012-06-28 23:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120628.035946.1026666621448892198.davem@davemloft.net>


	Hello,

On Thu, 28 Jun 2012, David Miller wrote:

> +__be32 fib_compute_spec_dst(struct sk_buff *skb)
> +{
> +	struct net_device *dev = skb->dev;
> +	struct in_device *in_dev;
> +	struct fib_result res;
> +	struct flowi4 fl4;
> +	struct net *net;
> +

	This is bad for looped m/b-cast: ip_mc_output calls
ip_dev_loopback_xmit where pkt_type is set to PACKET_LOOPBACK.
May be we have to check skb_dst as below.

	Also, for rt_spec_dst __mkroute_output used fl4->daddr for
looped unicast traffic and fl4->saddr for
RTCF_BROADCAST | RTCF_MULTICAST. Can we check rt_flags instead?
It will catch some pkt_type values such as PACKET_LOOPBACK.
For example, check similar to icmp_send():

	struct rtable *rt = skb_rtable(skb);

	/* input or output route destined to local stack?
	 * daddr can be mcast/bcast
	 */
	if (rt && rt->rt_flags & RTCF_LOCAL) {
		if (rt_is_input_route(rt)) {
			if (!(rt->rt_flags & (RTCF_BROADCAST |
					      RTCF_MULTICAST)))
				return ip_hdr(skb)->daddr;
			/* select saddr */
		} else
			return (rt->rt_flags & (RTCF_BROADCAST | 
						RTCF_MULTICAST)) ?
				ip_hdr(skb)->saddr : ip_hdr(skb)->daddr;
	}

	Isn't pkt_type just a hint from sender? We can not use
it for such places. What if hackers use unicast mac (PACKET_HOST)
combined with multicast daddr? I'm not sure that pkt_type
is a strong validator for layer 3 addresses.

> +	if (skb->pkt_type != PACKET_BROADCAST &&
> +	    skb->pkt_type != PACKET_MULTICAST)
> +		return ip_hdr(skb)->daddr;

	So, above check should be removed if we check rt_flags.

> +
> +	in_dev = __in_dev_get_rcu(dev);
> +	BUG_ON(!in_dev);
> +	fl4.flowi4_oif = 0;
> +	fl4.flowi4_iif = 0;

	It is not clear to me why ip_route_input_mc and
ip_route_input_slow (brd_input) call fib_validate_source()
with arg oif=0. How would fib_rule_match match flowi_iif
for such traffic then?

	May be iif should be always set just like in
ip_route_output_slow because now we do output lookup to
unicast target?:

	fl4.flowi4_iif = net->loopback_dev->ifindex;

> +	fl4.daddr = ip_hdr(skb)->saddr;
> +	fl4.saddr = ip_hdr(skb)->daddr;

	Here only 0 is allowed for m/b-cast daddr, we are not
going to use ip_hdr(skb)->daddr, so no need to provide it:

	fl4.saddr = 0;

	fib_validate_source was called with dst=0 for bcast/mcast.

> +	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
> +	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
> +	fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
> +
> +	net = dev_net(dev);

	Now net will be needed above for iif.

> +	if (!fib_lookup(net, &fl4, &res))
> +		return FIB_RES_PREFSRC(net, res);
> +	else

	What about providing ip_hdr(skb)->saddr as
2nd argument here (it is validated by input routing
to be unicast or 0.0.0.0):

> +		return inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);

	By this way we will prefer local address from the
same subnet as iph->saddr. Also, we should not call
fib_lookup if ipv4_is_zeronet(ip_hdr(skb)->saddr), we
should use RT_SCOPE_LINK for inet_select_addr in this case.
By this way we will prefer addresses with scope link for
target 0.0.0.0. There is such logic that uses RT_SCOPE_LINK in
ip_route_input_mc() and ip_route_input_slow().

>  int ip_options_compile(struct net *net,
>  		       struct ip_options *opt, struct sk_buff *skb)
>  {
> -	int l;
> -	unsigned char *iph;
> -	unsigned char *optptr;
> -	int optlen;
> +	__be32 spec_dst = (__force __be32) 0;
>  	unsigned char *pp_ptr = NULL;
> -	struct rtable *rt = NULL;
> +	unsigned char *optptr;
> +	unsigned char *iph;
> +	int optlen, l;
>  
>  	if (skb != NULL) {
> -		rt = skb_rtable(skb);
> +		spec_dst = fib_compute_spec_dst(skb);

	If we use rt_flags above I'm not sure whether
ip_options_compile is called with valid rt_flags from the
bridging code.

Regards
--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Yuchung Cheng @ 2012-06-28 22:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, David Miller, netdev, codel, Tom Herbert, Matt Mathis,
	Nandita Dukkipati, Neal Cardwell
In-Reply-To: <1340907151.13187.169.camel@edumazet-glaptop>

On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
>
>> clever idea. A problem is there are other forms of network traffic on
>> a link, and this is punishing a single tcp
Dave: it won't just punish a single TCP, all protocols that react to
XMIT_CN will share similar fate.

>> stream that may not be the source of the problem in the first place,
>> and basically putting it into double jeopardy.
>>
>
> Why ? In fact this patch helps the tcp session being signaled (as it
> will be anyway) at enqueue time, instead of having to react to packet
> losses indications given (after RTT) by receiver.
>
> Avoiding losses help receiver to consume data without having to buffer
> it into Out Of Order queue.
>
> So its not jeopardy, but early congestion notification without RTT
> delay.
>
> NET_XMIT_CN is a soft signal, far more disruptive than a DROP.
I don't read here: you mean far "less" disruptive in terms of performance?

>
>> I am curious as to how often an enqueue is actually dropping in the
>> codel/fq_codel case, the hope was that there would be plenty of
>> headroom under far more circumstances on this qdisc.
>>
>
> "tc -s qdisc show dev eth0" can show you all the counts.
>
> We never drop a packet at enqueue time, unless you hit the emergency
> limit (10240 packets for fq_codel). When you reach this limit, you are
> under trouble.

Another nice feature is to resume TCP xmit operation at when the skb
hits the wire.
My hutch is that Eric is working on this too.
>
>
>

^ permalink raw reply

* Re: 3.4.x regression: rtl8169: frequent resets
From: Francois Romieu @ 2012-06-28 22:23 UTC (permalink / raw)
  To: Nix; +Cc: netdev, linux-kernel
In-Reply-To: <87fw9fp2e1.fsf@spindle.srvr.nix>

Nix <nix@esperi.org.uk> :
> Francois Romieu <romieu@fr.zoreil.com> :
> > Can you try and revert 036dafa28da1e2565a8529de2ae663c37b7a0060 ?
> 
> I can try, but there's been a *lot* of code motion since then, 'git
> revert' fails hilariously (trying to patch obviously the wrong places):
> I'll have to do it by hand.

There is a single line reject in rtl8169_start_xmit. Other than that it
should patch -p1 -R fine.

[...]
> I note that at some time (after the first reset?), my MTU either flipped
> back to 1500, from its initial jumbo default, or simply refused to go
> jumbo in the first place. I bring it up like so:
> 
> ip link set fastnet up multicast on txqueuelen 100 mtu 7200
> ip addr add local 192.168.16.20/24 broadcast 192.168.16.255 dev fastnet
> 
> but its MTU is now shown as 1500 :( so at some point either jumbo frames
> have stopped working or the reset is flipping them off.
[...]
> [    1.341389] r8169 0000:06:00.0: eth0: jumbo features [frames: 6128 bytes, tx checksumming: ko]

This chipset is not supposed to be pushed beyond 6128 bytes.

-- 
Ueimor

^ permalink raw reply

* Re: [patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running
From: Ben Hutchings @ 2012-06-28 22:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, netdev, shimoda.hiroaki, virtualization, danny.kukawka,
	edumazet, davem
In-Reply-To: <1340892639-1111-2-git-send-email-jpirko@redhat.com>

On Thu, 2012-06-28 at 16:10 +0200, Jiri Pirko wrote:
> Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable
> netif_running() check in eth_mac_addr()
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> ---
>  include/linux/if.h |    2 ++
>  net/ethernet/eth.c |    2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/if.h b/include/linux/if.h
> index f995c66..fd9ee7c 100644
> --- a/include/linux/if.h
> +++ b/include/linux/if.h
> @@ -81,6 +81,8 @@
>  #define IFF_UNICAST_FLT	0x20000		/* Supports unicast filtering	*/
>  #define IFF_TEAM_PORT	0x40000		/* device used as team port */
>  #define IFF_SUPP_NOFCS	0x80000		/* device supports sending custom FCS */
> +#define IFF_LIFE_ADDR_CHANGE 0x100000	/* device supports hardware address
> +					 * change when it's running */
[...]

Any device that has IFF_UNICAST_FLT can update the unicast MAC filter
while it's running; doesn't that go hand-in-hand with being able to
handle changes to the primary MAC address?  Is the new flag really
necessary at all?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: 3.4.x regression: rtl8169: frequent resets
From: Nix @ 2012-06-28 21:12 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <20120628193026.GA31627@electric-eye.fr.zoreil.com>

[-- Attachment #1: Type: text/plain, Size: 2495 bytes --]

On 28 Jun 2012, Francois Romieu stated:

> Nix <nix@esperi.org.uk> :
>> I recently upgraded from 3.3.x to 3.4.4, and am now experiencing
>> networking problems with my desktop box's r8169 card. The symptoms are
>> that all traffic ceases for five to ten seconds, then the card appears
>> to reset and everything is back to normal -- until it happens again. It
>> can happen quite a lot:
>
> Can you try and revert 036dafa28da1e2565a8529de2ae663c37b7a0060 ?

I can try, but there's been a *lot* of code motion since then, 'git
revert' fails hilariously (trying to patch obviously the wrong places):
I'll have to do it by hand.

I'll try that tomorrow. (But, as before, it might be several days before
we know anything one way or the other. Assuming I can revert it without
fouling something else up.)

I'm not using BQL (yet, anyway).

I note that at some time (after the first reset?), my MTU either flipped
back to 1500, from its initial jumbo default, or simply refused to go
jumbo in the first place. I bring it up like so:

ip link set fastnet up multicast on txqueuelen 100 mtu 7200
ip addr add local 192.168.16.20/24 broadcast 192.168.16.255 dev fastnet

but its MTU is now shown as 1500 :( so at some point either jumbo frames
have stopped working or the reset is flipping them off. (It used to
work, with a warning yelling about how terribly dangerous jumbo frames
were because any attacker on the local subnet could subvert my machine.
Any attacker on the local subnet will be sitting in my lap and/or will
have root on the NFS server from which this machine is getting all its
data, so I don't care one jot about that. This may be because rx
checksumming is turned on by default, and as of last year that forces
jumbo frames off: I've turned that off and will see if jumbo frames
start working next time I bring the interface up. I *thought* my local
network was awful slow lately...)

> I would welcome a complete dmesg including the XID line from the
> r8169 driver.

Now that I can do :) it's long, so here's the relevant bit: complete
gzipped dmesg attached.

[    1.338060] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    1.339700] r8169 0000:06:00.0: irq 70 for MSI/MSI-X
[    1.339793] r8169 0000:06:00.0: eth0: RTL8168c/8111c at 0xffffc90000048000, 00:24:8c:0f:64:18, XID 1c4000c0 IRQ 70
[    1.341389] r8169 0000:06:00.0: eth0: jumbo features [frames: 6128 bytes, tx checksumming: ko]

(the interface is renamed to 'fastnet' by udev a little later in the
boot process.)


[-- Attachment #2: gzipped complete dmesg --]
[-- Type: application/octet-stream, Size: 18208 bytes --]

^ permalink raw reply

* Re: [patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running
From: Michael S. Tsirkin @ 2012-06-28 20:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiri Pirko, netdev, shimoda.hiroaki, virtualization,
	danny.kukawka, edumazet, davem
In-Reply-To: <1340897092.13187.110.camel@edumazet-glaptop>

On Thu, Jun 28, 2012 at 05:24:52PM +0200, Eric Dumazet wrote:
> On Thu, 2012-06-28 at 16:10 +0200, Jiri Pirko wrote:
> > Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable
> > netif_running() check in eth_mac_addr()
> > 
> > Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> > ---
> >  include/linux/if.h |    2 ++
> >  net/ethernet/eth.c |    2 +-
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/if.h b/include/linux/if.h
> > index f995c66..fd9ee7c 100644
> > --- a/include/linux/if.h
> > +++ b/include/linux/if.h
> > @@ -81,6 +81,8 @@
> >  #define IFF_UNICAST_FLT	0x20000		/* Supports unicast filtering	*/
> >  #define IFF_TEAM_PORT	0x40000		/* device used as team port */
> >  #define IFF_SUPP_NOFCS	0x80000		/* device supports sending custom FCS */
> > +#define IFF_LIFE_ADDR_CHANGE 0x100000	/* device supports hardware address
> > +					 * change when it's running */
> >  
> > 
> >  #define IF_GET_IFACE	0x0001		/* for querying only */
> > diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> > index 36e5880..8f8ded4 100644
> > --- a/net/ethernet/eth.c
> > +++ b/net/ethernet/eth.c
> > @@ -283,7 +283,7 @@ int eth_mac_addr(struct net_device *dev, void *p)
> >  {
> >  	struct sockaddr *addr = p;
> >  
> > -	if (netif_running(dev))
> > +	if (!(dev->priv_flags & IFF_LIFE_ADDR_CHANGE) && netif_running(dev))
> >  		return -EBUSY;
> >  	if (!is_valid_ether_addr(addr->sa_data))
> >  		return -EADDRNOTAVAIL;
> 
> Since the memcpy() is not atomic, there is a small window where a reader
> could get a half-changed mac address. I guess its a detail.
> 

At least for virtio nothing changes - we had this bug forever.
How'd you fix this?

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox