netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel 4.1.12 crash
@ 2015-11-20 13:58 Andrew
  2015-11-20 23:13 ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew @ 2015-11-20 13:58 UTC (permalink / raw)
  To: netdev

Hi all.

Today some BRASes on 4.1.12 kernel were crashed.

Here's crash traces: http://pastebin.com/p68hNS8R 
http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6

On 3.2 kernel same hardware works OK, troubles were noticed after kernel 
upgrade.

What additional info is needed?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-20 13:58 Kernel 4.1.12 crash Andrew
@ 2015-11-20 23:13 ` Alexander Duyck
  2015-11-21  8:16   ` Andrew
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2015-11-20 23:13 UTC (permalink / raw)
  To: Andrew, netdev

On 11/20/2015 05:58 AM, Andrew wrote:
> Hi all.
>
> Today some BRASes on 4.1.12 kernel were crashed.
>
> Here's crash traces: http://pastebin.com/p68hNS8R
> http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6
>
> On 3.2 kernel same hardware works OK, troubles were noticed after kernel
> upgrade.
>
> What additional info is needed?

Looking over the traces there seem to be two areas called out.

The first is the fib_trie resize BUG_ON that was triggered due to the 
parent and child not being associated.  I think that might be due to 
memory corruption as I cannot find any spots where we are resizing 
without correctly setting up the parent-child relationship of the nodes 
first.

The other spot that is showing up is ppp_shutdown_interface and it's 
related path.  It looks like there are a couple of patches you could try 
back-porting to see if it resolves the issue.  If they do then perhaps 
they should be considered candidates for stable:

8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion")
58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()")

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-20 23:13 ` Alexander Duyck
@ 2015-11-21  8:16   ` Andrew
  2015-11-22  5:17     ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew @ 2015-11-21  8:16 UTC (permalink / raw)
  To: netdev

Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
almost all of these boxes, except H61M-based box from 1st log, works for 
a long time with uptime more than year; and only software was changed on 
it; H61M-based box runs memtest86 for a tens of hours w/o any error. If 
it was caused by hardware - they should crash even earlier.

Rarely on different servers I saw 'zram decompression error' messages 
(in this case I've got such message on H61M-based box).

Also, other people that uses accel-ppp as BRAS software, have different 
kernel panics/bugs/oopses on fresh kernels.

I'll try to apply these patches, and I'll try to switch back to kernels 
that were stable on some boxes.

21.11.2015 01:13, Alexander Duyck пишет:
> On 11/20/2015 05:58 AM, Andrew wrote:
>> Hi all.
>>
>> Today some BRASes on 4.1.12 kernel were crashed.
>>
>> Here's crash traces: http://pastebin.com/p68hNS8R
>> http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6
>>
>> On 3.2 kernel same hardware works OK, troubles were noticed after kernel
>> upgrade.
>>
>> What additional info is needed?
>
> Looking over the traces there seem to be two areas called out.
>
> The first is the fib_trie resize BUG_ON that was triggered due to the 
> parent and child not being associated.  I think that might be due to 
> memory corruption as I cannot find any spots where we are resizing 
> without correctly setting up the parent-child relationship of the 
> nodes first.
>
> The other spot that is showing up is ppp_shutdown_interface and it's 
> related path.  It looks like there are a couple of patches you could 
> try back-porting to see if it resolves the issue.  If they do then 
> perhaps they should be considered candidates for stable:
>
> 8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion")
> 58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()")
>
> - Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-21  8:16   ` Andrew
@ 2015-11-22  5:17     ` Alexander Duyck
  2015-11-22 10:45       ` Andrew
  2015-11-24 22:59       ` Andrew
  0 siblings, 2 replies; 14+ messages in thread
From: Alexander Duyck @ 2015-11-22  5:17 UTC (permalink / raw)
  To: Andrew, netdev

On 11/21/2015 12:16 AM, Andrew wrote:
> Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
> almost all of these boxes, except H61M-based box from 1st log, works 
> for a long time with uptime more than year; and only software was 
> changed on it; H61M-based box runs memtest86 for a tens of hours w/o 
> any error. If it was caused by hardware - they should crash even earlier.

I wasn't saying it was hardware related.  My thought is that it could be 
some sort of use after free or double free type issue. Basically what 
you end up with is the memory getting corrupted by software that is 
accessing regions it shouldn't be.

> Rarely on different servers I saw 'zram decompression error' messages 
> (in this case I've got such message on H61M-based box).
>
> Also, other people that uses accel-ppp as BRAS software, have 
> different kernel panics/bugs/oopses on fresh kernels.
>
> I'll try to apply these patches, and I'll try to switch back to 
> kernels that were stable on some boxes.

If you could bisect this it would be useful.  Basically we just need to 
determine where in the git history these issues started popping up so 
that we can then narrow down on the root cause.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-22  5:17     ` Alexander Duyck
@ 2015-11-22 10:45       ` Andrew
  2015-11-24 22:59       ` Andrew
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew @ 2015-11-22 10:45 UTC (permalink / raw)
  To: netdev

22.11.2015 07:17, Alexander Duyck wrote:
> On 11/21/2015 12:16 AM, Andrew wrote:
>> Memory corruption, if happens, IMHO shouldn't be a hardware-related -
>> almost all of these boxes, except H61M-based box from 1st log, works
>> for a long time with uptime more than year; and only software was
>> changed on it; H61M-based box runs memtest86 for a tens of hours w/o
>> any error. If it was caused by hardware - they should crash even
>> earlier.
>
> I wasn't saying it was hardware related.  My thought is that it could
> be some sort of use after free or double free type issue. Basically
> what you end up with is the memory getting corrupted by software that
> is accessing regions it shouldn't be.
>
>> Rarely on different servers I saw 'zram decompression error' messages
>> (in this case I've got such message on H61M-based box).
>>
>> Also, other people that uses accel-ppp as BRAS software, have
>> different kernel panics/bugs/oopses on fresh kernels.
>>
>> I'll try to apply these patches, and I'll try to switch back to
>> kernels that were stable on some boxes.
>
> If you could bisect this it would be useful.  Basically we just need
> to determine where in the git history these issues started popping up
> so that we can then narrow down on the root cause.
>
> - Alex
IMHO bisecting will be too long, because these crashes aren't regular - 
once box may work for a month w/o troubles, and then - may crash twice 
per week with same load.

Maybe if I'll create 10-20k sessions in test environment, this will 
cause crash - but I'm not sure about this. I'll try to check this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-22  5:17     ` Alexander Duyck
  2015-11-22 10:45       ` Andrew
@ 2015-11-24 22:59       ` Andrew
  2015-11-25  9:35         ` Andrew
  2015-11-25 14:10         ` Guillaume Nault
  1 sibling, 2 replies; 14+ messages in thread
From: Andrew @ 2015-11-24 22:59 UTC (permalink / raw)
  To: Alexander Duyck, netdev

Hi.

I tried to reproduce errors in virtual environment (some VMs on my 
notebook).

I've tried to create 1000 client PPPoE sessions from this box via script:
for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password 
test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth 
eth0; done

And on VM that is used as client I've got strange random crashes (that 
are present only when server is online - so they're network-related):

http://postimg.org/image/ohr2mu3rj/ - crash is here:
(gdb) list *process_one_work+0x32
0xc10607b2 is in process_one_work 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).
1947    __releases(&pool->lock)
1948    __acquires(&pool->lock)
1949    {
1950        struct pool_workqueue *pwq = get_work_pwq(work);
1951        struct worker_pool *pool = worker->pool;
1952        bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
1953        int work_color;
1954        struct worker *collision;
1955    #ifdef CONFIG_LOCKDEP
1956        /*


http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
0xc10658bf is in kthread_data 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).
131     * The caller is responsible for ensuring the validity of @task when
132     * calling this function.
133     */
134    void *kthread_data(struct task_struct *task)
135    {
136        return to_kthread(task)->data;
137    }

which is leaded by strange place:
(gdb) list *kthread_create_on_node+0x120
0xc1065340 is in kthread 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).
171    {
172        __kthread_parkme(to_kthread(current));
173    }
174
175    static int kthread(void *_create)
176    {
177        /* Copy data: it's on kthread's stack */
178        struct kthread_create_info *create = _create;
179        int (*threadfn)(void *data) = create->threadfn;
180        void *data = create->data;

And earlier:
(gdb) list *ret_from_kernel_thread+0x21
0xc13bb181 is at 
/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.
307        popl_cfi %eax
308        pushl_cfi $0x0202        # Reset kernel eflags
309        popfl_cfi
310        movl PT_EBP(%esp),%eax
311        call *PT_EBX(%esp)
312        movl $0,PT_EAX(%esp)
313        jmp syscall_exit
314        CFI_ENDPROC
315    ENDPROC(ret_from_kernel_thread)
316

Stack corruption?..

I'll try to make test environment on real hardware. And I'll try to test 
with older kernels.

22.11.2015 07:17, Alexander Duyck пишет:
> On 11/21/2015 12:16 AM, Andrew wrote:
>> Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
>> almost all of these boxes, except H61M-based box from 1st log, works 
>> for a long time with uptime more than year; and only software was 
>> changed on it; H61M-based box runs memtest86 for a tens of hours w/o 
>> any error. If it was caused by hardware - they should crash even 
>> earlier.
>
> I wasn't saying it was hardware related.  My thought is that it could 
> be some sort of use after free or double free type issue. Basically 
> what you end up with is the memory getting corrupted by software that 
> is accessing regions it shouldn't be.
>
>> Rarely on different servers I saw 'zram decompression error' messages 
>> (in this case I've got such message on H61M-based box).
>>
>> Also, other people that uses accel-ppp as BRAS software, have 
>> different kernel panics/bugs/oopses on fresh kernels.
>>
>> I'll try to apply these patches, and I'll try to switch back to 
>> kernels that were stable on some boxes.
>
> If you could bisect this it would be useful.  Basically we just need 
> to determine where in the git history these issues started popping up 
> so that we can then narrow down on the root cause.
>
> - Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-24 22:59       ` Andrew
@ 2015-11-25  9:35         ` Andrew
  2015-11-25 14:10         ` Guillaume Nault
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew @ 2015-11-25  9:35 UTC (permalink / raw)
  To: Alexander Duyck, netdev

Hm, older image with 3.10.57 looks stable in same testcase - so at least 
one of bugs can be enough easily bisected. I'll try to downgrade kernel 
with same userland for testing, and then - bisect buggy commit.

25.11.2015 00:59, Andrew пишет:
> Hi.
>
> I tried to reproduce errors in virtual environment (some VMs on my 
> notebook).
>
> I've tried to create 1000 client PPPoE sessions from this box via script:
> for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password 
> test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth 
> eth0; done
>
> And on VM that is used as client I've got strange random crashes (that 
> are present only when server is online - so they're network-related):
>
> http://postimg.org/image/ohr2mu3rj/ - crash is here:
> (gdb) list *process_one_work+0x32
> 0xc10607b2 is in process_one_work 
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).
> 1947    __releases(&pool->lock)
> 1948    __acquires(&pool->lock)
> 1949    {
> 1950        struct pool_workqueue *pwq = get_work_pwq(work);
> 1951        struct worker_pool *pool = worker->pool;
> 1952        bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
> 1953        int work_color;
> 1954        struct worker *collision;
> 1955    #ifdef CONFIG_LOCKDEP
> 1956        /*
>
>
> http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
> 0xc10658bf is in kthread_data 
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).
> 131     * The caller is responsible for ensuring the validity of @task 
> when
> 132     * calling this function.
> 133     */
> 134    void *kthread_data(struct task_struct *task)
> 135    {
> 136        return to_kthread(task)->data;
> 137    }
>
> which is leaded by strange place:
> (gdb) list *kthread_create_on_node+0x120
> 0xc1065340 is in kthread 
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).
> 171    {
> 172        __kthread_parkme(to_kthread(current));
> 173    }
> 174
> 175    static int kthread(void *_create)
> 176    {
> 177        /* Copy data: it's on kthread's stack */
> 178        struct kthread_create_info *create = _create;
> 179        int (*threadfn)(void *data) = create->threadfn;
> 180        void *data = create->data;
>
> And earlier:
> (gdb) list *ret_from_kernel_thread+0x21
> 0xc13bb181 is at 
> /var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.
> 307        popl_cfi %eax
> 308        pushl_cfi $0x0202        # Reset kernel eflags
> 309        popfl_cfi
> 310        movl PT_EBP(%esp),%eax
> 311        call *PT_EBX(%esp)
> 312        movl $0,PT_EAX(%esp)
> 313        jmp syscall_exit
> 314        CFI_ENDPROC
> 315    ENDPROC(ret_from_kernel_thread)
> 316
>
> Stack corruption?..
>
> I'll try to make test environment on real hardware. And I'll try to 
> test with older kernels.
>
> 22.11.2015 07:17, Alexander Duyck пишет:
>> On 11/21/2015 12:16 AM, Andrew wrote:
>>> Memory corruption, if happens, IMHO shouldn't be a hardware-related 
>>> - almost all of these boxes, except H61M-based box from 1st log, 
>>> works for a long time with uptime more than year; and only software 
>>> was changed on it; H61M-based box runs memtest86 for a tens of hours 
>>> w/o any error. If it was caused by hardware - they should crash even 
>>> earlier.
>>
>> I wasn't saying it was hardware related.  My thought is that it could 
>> be some sort of use after free or double free type issue. Basically 
>> what you end up with is the memory getting corrupted by software that 
>> is accessing regions it shouldn't be.
>>
>>> Rarely on different servers I saw 'zram decompression error' 
>>> messages (in this case I've got such message on H61M-based box).
>>>
>>> Also, other people that uses accel-ppp as BRAS software, have 
>>> different kernel panics/bugs/oopses on fresh kernels.
>>>
>>> I'll try to apply these patches, and I'll try to switch back to 
>>> kernels that were stable on some boxes.
>>
>> If you could bisect this it would be useful.  Basically we just need 
>> to determine where in the git history these issues started popping up 
>> so that we can then narrow down on the root cause.
>>
>> - Alex
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-24 22:59       ` Andrew
  2015-11-25  9:35         ` Andrew
@ 2015-11-25 14:10         ` Guillaume Nault
       [not found]           ` <5655CCAE.6000300@seti.kr.ua>
  1 sibling, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-25 14:10 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev

On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> Hi.
> 
> I tried to reproduce errors in virtual environment (some VMs on my
> notebook).
> 
> I've tried to create 1000 client PPPoE sessions from this box via script:
> for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> 
I've tried to reproduce the bug with your script, but couldn't get
anything to crash (VM is Debian Jessie i386 running on KVM with upstream
kernel 4.1.12). Does the crash happen before all sessions get
established?

Can you reliably reproduce the bug? If so can you please try with 4.3?
It contains ppp fixes not included in 4.1.12.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
       [not found]           ` <5655CCAE.6000300@seti.kr.ua>
@ 2015-11-26 16:44             ` Guillaume Nault
       [not found]               ` <565B7699.8030105@seti.kr.ua>
  0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-26 16:44 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev

On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> 25.11.2015 16:10, Guillaume Nault пишет:
> >On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> >>Hi.
> >>
> >>I tried to reproduce errors in virtual environment (some VMs on my
> >>notebook).
> >>
> >>I've tried to create 1000 client PPPoE sessions from this box via script:
> >>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> >>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> >>
> >I've tried to reproduce the bug with your script, but couldn't get
> >anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> >kernel 4.1.12). Does the crash happen before all sessions get
> >established?
> Yes, crash happens even before all daemon instances are started. Sessions
> don't get established because BRAS configured to reject sessions (so a lot
> of concurrent connection retries happens) - I still didn't created account
> for test user on it.
> 
Ok, I got the crash too. In fact I had misunderstood your previous
message, crash happens when PPP sessions don't get established
(authentication failures in my case).

I'll investigate on that and let you know.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
       [not found]               ` <565B7699.8030105@seti.kr.ua>
@ 2015-11-30 15:03                 ` Guillaume Nault
  2015-11-30 20:42                   ` Guillaume Nault
  0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-30 15:03 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev

On Mon, Nov 30, 2015 at 12:05:13AM +0200, Andrew wrote:
> 26.11.2015 18:44, Guillaume Nault пишет:
> >On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> >>25.11.2015 16:10, Guillaume Nault пишет:
> >>>On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> >>>>Hi.
> >>>>
> >>>>I tried to reproduce errors in virtual environment (some VMs on my
> >>>>notebook).
> >>>>
> >>>>I've tried to create 1000 client PPPoE sessions from this box via script:
> >>>>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> >>>>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> >>>>
> >>>I've tried to reproduce the bug with your script, but couldn't get
> >>>anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> >>>kernel 4.1.12). Does the crash happen before all sessions get
> >>>established?
> >>Yes, crash happens even before all daemon instances are started. Sessions
> >>don't get established because BRAS configured to reject sessions (so a lot
> >>of concurrent connection retries happens) - I still didn't created account
> >>for test user on it.
> >>
> >Ok, I got the crash too. In fact I had misunderstood your previous
> >message, crash happens when PPP sessions don't get established
> >(authentication failures in my case).
> >
> >I'll investigate on that and let you know.
> 
> It seems like bug appears on mass ppp devices removing (I planned to use
> this test environment to reproduce BRAS periodical crashes, but suddenly
> I've got crashes on test client).
> 
> I've checked it with some kernels - it's present in 4.3.0, but it isn't
> present in 3.10.57. I'll try to build 3.14/3.18 kernels to look how they
> will work in this case.

Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
workqueue to die properly when a PADT is received"). I still have to
figure out why.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-30 15:03                 ` Guillaume Nault
@ 2015-11-30 20:42                   ` Guillaume Nault
  2015-12-02 17:23                     ` Guillaume Nault
  0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-30 20:42 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth

[Adding Simon to the discussion]

On Mon, Nov 30, 2015 at 04:03:37PM +0100, Guillaume Nault wrote:
> On Mon, Nov 30, 2015 at 12:05:13AM +0200, Andrew wrote:
> > 26.11.2015 18:44, Guillaume Nault пишет:
> > >On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> > >>25.11.2015 16:10, Guillaume Nault пишет:
> > >>>On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> > >>>>Hi.
> > >>>>
> > >>>>I tried to reproduce errors in virtual environment (some VMs on my
> > >>>>notebook).
> > >>>>
> > >>>>I've tried to create 1000 client PPPoE sessions from this box via script:
> > >>>>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> > >>>>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> > >>>>
> > >>>I've tried to reproduce the bug with your script, but couldn't get
> > >>>anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> > >>>kernel 4.1.12). Does the crash happen before all sessions get
> > >>>established?
> > >>Yes, crash happens even before all daemon instances are started. Sessions
> > >>don't get established because BRAS configured to reject sessions (so a lot
> > >>of concurrent connection retries happens) - I still didn't created account
> > >>for test user on it.
> > >>
> > >Ok, I got the crash too. In fact I had misunderstood your previous
> > >message, crash happens when PPP sessions don't get established
> > >(authentication failures in my case).
> > >
> > >I'll investigate on that and let you know.
> > 
> > It seems like bug appears on mass ppp devices removing (I planned to use
> > this test environment to reproduce BRAS periodical crashes, but suddenly
> > I've got crashes on test client).
> > 
> > I've checked it with some kernels - it's present in 4.3.0, but it isn't
> > present in 3.10.57. I'll try to build 3.14/3.18 kernels to look how they
> > will work in this case.
> 
> Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
> workqueue to die properly when a PADT is received"). I still have to
> figure out why.

I confirm the bug comes from this commit.

It happens if pppoe_connect() reinitialises po->proto.pppoe.padt_work
after pppoe_disc_rcv() has added it to the system's work queue, and
before that work got scheduled. Then when scheduling occurs, the worker
thread tries to run a corrupted structure and crashes.

I'm going to work on a patch.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-11-30 20:42                   ` Guillaume Nault
@ 2015-12-02 17:23                     ` Guillaume Nault
  2015-12-03 15:35                       ` Guillaume Nault
  0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-12-02 17:23 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth

On Mon, Nov 30, 2015 at 09:42:08PM +0100, Guillaume Nault wrote:
> On Mon, Nov 30, 2015 at 04:03:37PM +0100, Guillaume Nault wrote:
> > Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
> > workqueue to die properly when a PADT is received"). I still have to
> > figure out why.
> 
> I confirm the bug comes from this commit.
> 
> It happens if pppoe_connect() reinitialises po->proto.pppoe.padt_work
> after pppoe_disc_rcv() has added it to the system's work queue, and
> before that work got scheduled. Then when scheduling occurs, the worker
> thread tries to run a corrupted structure and crashes.
> 
> I'm going to work on a patch.

You can try the following. It's not yet a proper fix as there are still
a few things that bug me in pppoe_connect().

---
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 5e0b432..865b74d 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
 	sk->sk_family		= PF_PPPOX;
 	sk->sk_protocol		= PX_PROTO_OE;
 
+	INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
+		  pppoe_unbind_sock_work);
+
 	return 0;
 }
 
@@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 
 	lock_sock(sk);
 
-	INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
-
 	error = -EINVAL;
 	if (sp->sa_protocol != PX_PROTO_OE)
 		goto end;
@@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 			po->pppoe_dev = NULL;
 		}
 
-		memset(sk_pppox(po) + 1, 0,
-		       sizeof(struct pppox_sock) - sizeof(struct sock));
 		sk->sk_state = PPPOX_NONE;
 	}
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-12-02 17:23                     ` Guillaume Nault
@ 2015-12-03 15:35                       ` Guillaume Nault
  2015-12-03 21:09                         ` Andrew
  0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-12-03 15:35 UTC (permalink / raw)
  To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth

On Wed, Dec 02, 2015 at 06:23:35PM +0100, Guillaume Nault wrote:
> 
> You can try the following. It's not yet a proper fix as there are still
> a few things that bug me in pppoe_connect().
> 
> ---
> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
> index 5e0b432..865b74d 100644
> --- a/drivers/net/ppp/pppoe.c
> +++ b/drivers/net/ppp/pppoe.c
> @@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
>  	sk->sk_family		= PF_PPPOX;
>  	sk->sk_protocol		= PX_PROTO_OE;
>  
> +	INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
> +		  pppoe_unbind_sock_work);
> +
>  	return 0;
>  }
>  
> @@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>  
>  	lock_sock(sk);
>  
> -	INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
> -
>  	error = -EINVAL;
>  	if (sp->sa_protocol != PX_PROTO_OE)
>  		goto end;
> @@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>  			po->pppoe_dev = NULL;
>  		}
>  
> -		memset(sk_pppox(po) + 1, 0,
> -		       sizeof(struct pppox_sock) - sizeof(struct sock));
>  		sk->sk_state = PPPOX_NONE;
>  	}
>  
Finally, I'm going to send something similar to -net and keep the rest
of pppoe_connect() modifications for net-next. This will ease
backporting to -stable.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel 4.1.12 crash
  2015-12-03 15:35                       ` Guillaume Nault
@ 2015-12-03 21:09                         ` Andrew
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew @ 2015-12-03 21:09 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Alexander Duyck, netdev, Simon Farnsworth

Hi.

Thanks, I'll rebuild kernel with your patch "pppoe: fix memory 
corruption in padt work structure", tryto check it in test env, and try 
to update PPPoE servers.

03.12.2015 17:35, Guillaume Nault wrote:
> On Wed, Dec 02, 2015 at 06:23:35PM +0100, Guillaume Nault wrote:
>> You can try the following. It's not yet a proper fix as there are still
>> a few things that bug me in pppoe_connect().
>>
>> ---
>> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
>> index 5e0b432..865b74d 100644
>> --- a/drivers/net/ppp/pppoe.c
>> +++ b/drivers/net/ppp/pppoe.c
>> @@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
>>   	sk->sk_family		= PF_PPPOX;
>>   	sk->sk_protocol		= PX_PROTO_OE;
>>   
>> +	INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
>> +		  pppoe_unbind_sock_work);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>>   
>>   	lock_sock(sk);
>>   
>> -	INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
>> -
>>   	error = -EINVAL;
>>   	if (sp->sa_protocol != PX_PROTO_OE)
>>   		goto end;
>> @@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>>   			po->pppoe_dev = NULL;
>>   		}
>>   
>> -		memset(sk_pppox(po) + 1, 0,
>> -		       sizeof(struct pppox_sock) - sizeof(struct sock));
>>   		sk->sk_state = PPPOX_NONE;
>>   	}
>>   
> Finally, I'm going to send something similar to -net and keep the rest
> of pppoe_connect() modifications for net-next. This will ease
> backporting to -stable.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-12-03 21:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-20 13:58 Kernel 4.1.12 crash Andrew
2015-11-20 23:13 ` Alexander Duyck
2015-11-21  8:16   ` Andrew
2015-11-22  5:17     ` Alexander Duyck
2015-11-22 10:45       ` Andrew
2015-11-24 22:59       ` Andrew
2015-11-25  9:35         ` Andrew
2015-11-25 14:10         ` Guillaume Nault
     [not found]           ` <5655CCAE.6000300@seti.kr.ua>
2015-11-26 16:44             ` Guillaume Nault
     [not found]               ` <565B7699.8030105@seti.kr.ua>
2015-11-30 15:03                 ` Guillaume Nault
2015-11-30 20:42                   ` Guillaume Nault
2015-12-02 17:23                     ` Guillaume Nault
2015-12-03 15:35                       ` Guillaume Nault
2015-12-03 21:09                         ` Andrew

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).