Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Guillaume Nault <g.nault@alphalink.fr>
To: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
Date: Mon, 1 Aug 2016 22:54:25 +0200	[thread overview]
Message-ID: <20160801205425.GC3031@alphalink.fr> (raw)
In-Reply-To: <e6cb92be01af6190350b9b3765bee6e3@nuclearcat.com>

On Thu, Jul 28, 2016 at 02:28:23PM +0300, Denys Fedoryshchenko wrote:
> On 2016-07-28 14:09, Guillaume Nault wrote:
> > On Tue, Jul 12, 2016 at 10:31:18AM -0700, Cong Wang wrote:
> > > On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
> > > > Hi
> > > >
> > > > On latest kernel i noticed kernel panic happening 1-2 times per day. It is
> > > > also happening on older kernel (at least 4.5.3).
> > > >
> > > ...
> > > >  [42916.426463] Call Trace:
> > > >  [42916.426658]  <IRQ>
> > > >
> > > >  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
> > > >  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
> > > > [ppp_generic]
> > > >  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
> > > >  [42916.427516]  [<ffffffff818530f2>] ?
> > > > validate_xmit_skb.isra.107.part.108+0x11d/0x238
> > > >  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
> > > >  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
> > > >  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
> > > >  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
> > > >  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
> > > >  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
> > > >  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
> > > 
> > > Interesting, we call a skb_cow_head() before skb_push() in
> > > ppp_start_xmit(),
> > > I have no idea why this could happen.
> > > 
> > The skb is corrupted: head is at ffff8800b0bf2800 while data is at
> > ffa00500b0bf284c.
> > 
> > Figuring out how this corruption happened is going to be hard without a
> > way to reproduce the problem.
> > 
> > Denys, can you confirm you're using a vanilla kernel?
> > Also I guess the ppp devices and tc settings are handled by accel-ppp.
> > If so, can you share more info about your setup (accel-ppp.conf, radius
> > attributes, iptables...) so that I can try to reproduce it on my
> > machines?
> 
> I have slight modification from vanilla:
> 
> --- linux/net/sched/sch_htb.c	2016-06-08 01:23:53.000000000 +0000
> +++ linux-new/net/sched/sch_htb.c	2016-06-21 14:03:08.398486593 +0000
> @@ -1495,10 +1495,10 @@
>  				cl->common.classid);
>  			cl->quantum = 1000;
>  		}
> -		if (!hopt->quantum && cl->quantum > 200000) {
> +		if (!hopt->quantum && cl->quantum > 2000000) {
>  			pr_warn("HTB: quantum of class %X is big. Consider r2q change.\n",
>  				cl->common.classid);
> -			cl->quantum = 200000;
> +			cl->quantum = 2000000;
>  		}
>  		if (hopt->quantum)
>  			cl->quantum = hopt->quantum;
> 
> But i guess it should not be reason of crash (it is related to another
> system,  without it i was unable to shape over 7Gbps, maybe with latest
> kernel i will not need this patch).
>
I guess such a big quantum is probably going to add some stress on HTB
because of longer dequeues. But that shouldn't make the kernel panic.
Anyway, I'm certainly not an HTB expert, so I can't comment further.
BTW, what about setting ->quantum directly and drop this patch if you
really need values this big?

> I'm trying to make reproducible conditions of crash, because right now it
> happens only on some servers in large networks (completely different ISPs,
> so i excluded possible hardware fault of specific server). It is complex
> config, i have accel-ppp, plus my own "shaping daemon" that apply several
> shapers on ppp interfaces. Wost thing it happens only on live customers, i
> am unable to reproduce same on stress tests. Also until recent kernel i
> was getting different panic messages (but all related to ppp).
> 
In the logs I commented earlier, the skb is probably corrupted before
the ppp_start_xmit() call. The PPP module hasn't done anything at this
stage, unless the packet was forwarded from another PPP interface.
In short, corruption could have happened anywhere. So we really need to
narrow down the scope or get a way to reproduce the problem.

> I think also at least one reason of crash also was fixed by "ppp: defer
> netns reference release for ppp channel" in 4.7.0 (maybe thats why i am
> getting less crashes recently).
> I tried also various kernel debug options that doesn't cause major
> performance degradation (locks checking, freed memory poisoning and etc),
> without any luck yet.
> Is it useful if i will post panics that at least
> occurs twice? (I will post below example, got recently)
Do you mean that you have many more different panics traces?

next prev parent reply	other threads:[~2016-08-01 21:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-11 19:45 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit nuclearcat
2016-07-12 17:31 ` Cong Wang
2016-07-12 18:03   ` nuclearcat
2016-07-12 18:05     ` Cong Wang
2016-07-12 18:13       ` nuclearcat
2016-07-28 11:09   ` Guillaume Nault
2016-07-28 11:28     ` Denys Fedoryshchenko
2016-08-01 20:54       ` Guillaume Nault [this message]
2016-08-01 20:59       ` Guillaume Nault
2016-08-01 22:52         ` Denys Fedoryshchenko
2016-08-08 11:25         ` Denys Fedoryshchenko
2016-08-08 21:05           ` Guillaume Nault
2016-08-17 11:54             ` Denys Fedoryshchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160801205425.GC3031@alphalink.fr \
    --to=g.nault@alphalink.fr \
    --cc=netdev@vger.kernel.org \
    --cc=nuclearcat@nuclearcat.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).