* Kernel 4.1.12 crash
@ 2015-11-20 13:58 Andrew
2015-11-20 23:13 ` Alexander Duyck
0 siblings, 1 reply; 14+ messages in thread
From: Andrew @ 2015-11-20 13:58 UTC (permalink / raw)
To: netdev
Hi all.
Today some BRASes on 4.1.12 kernel were crashed.
Here's crash traces: http://pastebin.com/p68hNS8R
http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6
On 3.2 kernel same hardware works OK, troubles were noticed after kernel
upgrade.
What additional info is needed?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-20 13:58 Kernel 4.1.12 crash Andrew
@ 2015-11-20 23:13 ` Alexander Duyck
2015-11-21 8:16 ` Andrew
0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2015-11-20 23:13 UTC (permalink / raw)
To: Andrew, netdev
On 11/20/2015 05:58 AM, Andrew wrote:
> Hi all.
>
> Today some BRASes on 4.1.12 kernel were crashed.
>
> Here's crash traces: http://pastebin.com/p68hNS8R
> http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6
>
> On 3.2 kernel same hardware works OK, troubles were noticed after kernel
> upgrade.
>
> What additional info is needed?
Looking over the traces there seem to be two areas called out.
The first is the fib_trie resize BUG_ON that was triggered due to the
parent and child not being associated. I think that might be due to
memory corruption as I cannot find any spots where we are resizing
without correctly setting up the parent-child relationship of the nodes
first.
The other spot that is showing up is ppp_shutdown_interface and it's
related path. It looks like there are a couple of patches you could try
back-porting to see if it resolves the issue. If they do then perhaps
they should be considered candidates for stable:
8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion")
58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()")
- Alex
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-20 23:13 ` Alexander Duyck
@ 2015-11-21 8:16 ` Andrew
2015-11-22 5:17 ` Alexander Duyck
0 siblings, 1 reply; 14+ messages in thread
From: Andrew @ 2015-11-21 8:16 UTC (permalink / raw)
To: netdev
Memory corruption, if happens, IMHO shouldn't be a hardware-related -
almost all of these boxes, except H61M-based box from 1st log, works for
a long time with uptime more than year; and only software was changed on
it; H61M-based box runs memtest86 for a tens of hours w/o any error. If
it was caused by hardware - they should crash even earlier.
Rarely on different servers I saw 'zram decompression error' messages
(in this case I've got such message on H61M-based box).
Also, other people that uses accel-ppp as BRAS software, have different
kernel panics/bugs/oopses on fresh kernels.
I'll try to apply these patches, and I'll try to switch back to kernels
that were stable on some boxes.
21.11.2015 01:13, Alexander Duyck пишет:
> On 11/20/2015 05:58 AM, Andrew wrote:
>> Hi all.
>>
>> Today some BRASes on 4.1.12 kernel were crashed.
>>
>> Here's crash traces: http://pastebin.com/p68hNS8R
>> http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6
>>
>> On 3.2 kernel same hardware works OK, troubles were noticed after kernel
>> upgrade.
>>
>> What additional info is needed?
>
> Looking over the traces there seem to be two areas called out.
>
> The first is the fib_trie resize BUG_ON that was triggered due to the
> parent and child not being associated. I think that might be due to
> memory corruption as I cannot find any spots where we are resizing
> without correctly setting up the parent-child relationship of the
> nodes first.
>
> The other spot that is showing up is ppp_shutdown_interface and it's
> related path. It looks like there are a couple of patches you could
> try back-porting to see if it resolves the issue. If they do then
> perhaps they should be considered candidates for stable:
>
> 8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion")
> 58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()")
>
> - Alex
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-21 8:16 ` Andrew
@ 2015-11-22 5:17 ` Alexander Duyck
2015-11-22 10:45 ` Andrew
2015-11-24 22:59 ` Andrew
0 siblings, 2 replies; 14+ messages in thread
From: Alexander Duyck @ 2015-11-22 5:17 UTC (permalink / raw)
To: Andrew, netdev
On 11/21/2015 12:16 AM, Andrew wrote:
> Memory corruption, if happens, IMHO shouldn't be a hardware-related -
> almost all of these boxes, except H61M-based box from 1st log, works
> for a long time with uptime more than year; and only software was
> changed on it; H61M-based box runs memtest86 for a tens of hours w/o
> any error. If it was caused by hardware - they should crash even earlier.
I wasn't saying it was hardware related. My thought is that it could be
some sort of use after free or double free type issue. Basically what
you end up with is the memory getting corrupted by software that is
accessing regions it shouldn't be.
> Rarely on different servers I saw 'zram decompression error' messages
> (in this case I've got such message on H61M-based box).
>
> Also, other people that uses accel-ppp as BRAS software, have
> different kernel panics/bugs/oopses on fresh kernels.
>
> I'll try to apply these patches, and I'll try to switch back to
> kernels that were stable on some boxes.
If you could bisect this it would be useful. Basically we just need to
determine where in the git history these issues started popping up so
that we can then narrow down on the root cause.
- Alex
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-22 5:17 ` Alexander Duyck
@ 2015-11-22 10:45 ` Andrew
2015-11-24 22:59 ` Andrew
1 sibling, 0 replies; 14+ messages in thread
From: Andrew @ 2015-11-22 10:45 UTC (permalink / raw)
To: netdev
22.11.2015 07:17, Alexander Duyck wrote:
> On 11/21/2015 12:16 AM, Andrew wrote:
>> Memory corruption, if happens, IMHO shouldn't be a hardware-related -
>> almost all of these boxes, except H61M-based box from 1st log, works
>> for a long time with uptime more than year; and only software was
>> changed on it; H61M-based box runs memtest86 for a tens of hours w/o
>> any error. If it was caused by hardware - they should crash even
>> earlier.
>
> I wasn't saying it was hardware related. My thought is that it could
> be some sort of use after free or double free type issue. Basically
> what you end up with is the memory getting corrupted by software that
> is accessing regions it shouldn't be.
>
>> Rarely on different servers I saw 'zram decompression error' messages
>> (in this case I've got such message on H61M-based box).
>>
>> Also, other people that uses accel-ppp as BRAS software, have
>> different kernel panics/bugs/oopses on fresh kernels.
>>
>> I'll try to apply these patches, and I'll try to switch back to
>> kernels that were stable on some boxes.
>
> If you could bisect this it would be useful. Basically we just need
> to determine where in the git history these issues started popping up
> so that we can then narrow down on the root cause.
>
> - Alex
IMHO bisecting will be too long, because these crashes aren't regular -
once box may work for a month w/o troubles, and then - may crash twice
per week with same load.
Maybe if I'll create 10-20k sessions in test environment, this will
cause crash - but I'm not sure about this. I'll try to check this.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-22 5:17 ` Alexander Duyck
2015-11-22 10:45 ` Andrew
@ 2015-11-24 22:59 ` Andrew
2015-11-25 9:35 ` Andrew
2015-11-25 14:10 ` Guillaume Nault
1 sibling, 2 replies; 14+ messages in thread
From: Andrew @ 2015-11-24 22:59 UTC (permalink / raw)
To: Alexander Duyck, netdev
Hi.
I tried to reproduce errors in virtual environment (some VMs on my
notebook).
I've tried to create 1000 client PPPoE sessions from this box via script:
for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password
test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth
eth0; done
And on VM that is used as client I've got strange random crashes (that
are present only when server is online - so they're network-related):
http://postimg.org/image/ohr2mu3rj/ - crash is here:
(gdb) list *process_one_work+0x32
0xc10607b2 is in process_one_work
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).
1947 __releases(&pool->lock)
1948 __acquires(&pool->lock)
1949 {
1950 struct pool_workqueue *pwq = get_work_pwq(work);
1951 struct worker_pool *pool = worker->pool;
1952 bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
1953 int work_color;
1954 struct worker *collision;
1955 #ifdef CONFIG_LOCKDEP
1956 /*
http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
0xc10658bf is in kthread_data
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).
131 * The caller is responsible for ensuring the validity of @task when
132 * calling this function.
133 */
134 void *kthread_data(struct task_struct *task)
135 {
136 return to_kthread(task)->data;
137 }
which is leaded by strange place:
(gdb) list *kthread_create_on_node+0x120
0xc1065340 is in kthread
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).
171 {
172 __kthread_parkme(to_kthread(current));
173 }
174
175 static int kthread(void *_create)
176 {
177 /* Copy data: it's on kthread's stack */
178 struct kthread_create_info *create = _create;
179 int (*threadfn)(void *data) = create->threadfn;
180 void *data = create->data;
And earlier:
(gdb) list *ret_from_kernel_thread+0x21
0xc13bb181 is at
/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.
307 popl_cfi %eax
308 pushl_cfi $0x0202 # Reset kernel eflags
309 popfl_cfi
310 movl PT_EBP(%esp),%eax
311 call *PT_EBX(%esp)
312 movl $0,PT_EAX(%esp)
313 jmp syscall_exit
314 CFI_ENDPROC
315 ENDPROC(ret_from_kernel_thread)
316
Stack corruption?..
I'll try to make test environment on real hardware. And I'll try to test
with older kernels.
22.11.2015 07:17, Alexander Duyck пишет:
> On 11/21/2015 12:16 AM, Andrew wrote:
>> Memory corruption, if happens, IMHO shouldn't be a hardware-related -
>> almost all of these boxes, except H61M-based box from 1st log, works
>> for a long time with uptime more than year; and only software was
>> changed on it; H61M-based box runs memtest86 for a tens of hours w/o
>> any error. If it was caused by hardware - they should crash even
>> earlier.
>
> I wasn't saying it was hardware related. My thought is that it could
> be some sort of use after free or double free type issue. Basically
> what you end up with is the memory getting corrupted by software that
> is accessing regions it shouldn't be.
>
>> Rarely on different servers I saw 'zram decompression error' messages
>> (in this case I've got such message on H61M-based box).
>>
>> Also, other people that uses accel-ppp as BRAS software, have
>> different kernel panics/bugs/oopses on fresh kernels.
>>
>> I'll try to apply these patches, and I'll try to switch back to
>> kernels that were stable on some boxes.
>
> If you could bisect this it would be useful. Basically we just need
> to determine where in the git history these issues started popping up
> so that we can then narrow down on the root cause.
>
> - Alex
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-24 22:59 ` Andrew
@ 2015-11-25 9:35 ` Andrew
2015-11-25 14:10 ` Guillaume Nault
1 sibling, 0 replies; 14+ messages in thread
From: Andrew @ 2015-11-25 9:35 UTC (permalink / raw)
To: Alexander Duyck, netdev
Hm, older image with 3.10.57 looks stable in same testcase - so at least
one of bugs can be enough easily bisected. I'll try to downgrade kernel
with same userland for testing, and then - bisect buggy commit.
25.11.2015 00:59, Andrew пишет:
> Hi.
>
> I tried to reproduce errors in virtual environment (some VMs on my
> notebook).
>
> I've tried to create 1000 client PPPoE sessions from this box via script:
> for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password
> test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth
> eth0; done
>
> And on VM that is used as client I've got strange random crashes (that
> are present only when server is online - so they're network-related):
>
> http://postimg.org/image/ohr2mu3rj/ - crash is here:
> (gdb) list *process_one_work+0x32
> 0xc10607b2 is in process_one_work
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).
> 1947 __releases(&pool->lock)
> 1948 __acquires(&pool->lock)
> 1949 {
> 1950 struct pool_workqueue *pwq = get_work_pwq(work);
> 1951 struct worker_pool *pool = worker->pool;
> 1952 bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
> 1953 int work_color;
> 1954 struct worker *collision;
> 1955 #ifdef CONFIG_LOCKDEP
> 1956 /*
>
>
> http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
> 0xc10658bf is in kthread_data
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).
> 131 * The caller is responsible for ensuring the validity of @task
> when
> 132 * calling this function.
> 133 */
> 134 void *kthread_data(struct task_struct *task)
> 135 {
> 136 return to_kthread(task)->data;
> 137 }
>
> which is leaded by strange place:
> (gdb) list *kthread_create_on_node+0x120
> 0xc1065340 is in kthread
> (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).
> 171 {
> 172 __kthread_parkme(to_kthread(current));
> 173 }
> 174
> 175 static int kthread(void *_create)
> 176 {
> 177 /* Copy data: it's on kthread's stack */
> 178 struct kthread_create_info *create = _create;
> 179 int (*threadfn)(void *data) = create->threadfn;
> 180 void *data = create->data;
>
> And earlier:
> (gdb) list *ret_from_kernel_thread+0x21
> 0xc13bb181 is at
> /var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.
> 307 popl_cfi %eax
> 308 pushl_cfi $0x0202 # Reset kernel eflags
> 309 popfl_cfi
> 310 movl PT_EBP(%esp),%eax
> 311 call *PT_EBX(%esp)
> 312 movl $0,PT_EAX(%esp)
> 313 jmp syscall_exit
> 314 CFI_ENDPROC
> 315 ENDPROC(ret_from_kernel_thread)
> 316
>
> Stack corruption?..
>
> I'll try to make test environment on real hardware. And I'll try to
> test with older kernels.
>
> 22.11.2015 07:17, Alexander Duyck пишет:
>> On 11/21/2015 12:16 AM, Andrew wrote:
>>> Memory corruption, if happens, IMHO shouldn't be a hardware-related
>>> - almost all of these boxes, except H61M-based box from 1st log,
>>> works for a long time with uptime more than year; and only software
>>> was changed on it; H61M-based box runs memtest86 for a tens of hours
>>> w/o any error. If it was caused by hardware - they should crash even
>>> earlier.
>>
>> I wasn't saying it was hardware related. My thought is that it could
>> be some sort of use after free or double free type issue. Basically
>> what you end up with is the memory getting corrupted by software that
>> is accessing regions it shouldn't be.
>>
>>> Rarely on different servers I saw 'zram decompression error'
>>> messages (in this case I've got such message on H61M-based box).
>>>
>>> Also, other people that uses accel-ppp as BRAS software, have
>>> different kernel panics/bugs/oopses on fresh kernels.
>>>
>>> I'll try to apply these patches, and I'll try to switch back to
>>> kernels that were stable on some boxes.
>>
>> If you could bisect this it would be useful. Basically we just need
>> to determine where in the git history these issues started popping up
>> so that we can then narrow down on the root cause.
>>
>> - Alex
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-24 22:59 ` Andrew
2015-11-25 9:35 ` Andrew
@ 2015-11-25 14:10 ` Guillaume Nault
[not found] ` <5655CCAE.6000300@seti.kr.ua>
1 sibling, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-25 14:10 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev
On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> Hi.
>
> I tried to reproduce errors in virtual environment (some VMs on my
> notebook).
>
> I've tried to create 1000 client PPPoE sessions from this box via script:
> for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
>
I've tried to reproduce the bug with your script, but couldn't get
anything to crash (VM is Debian Jessie i386 running on KVM with upstream
kernel 4.1.12). Does the crash happen before all sessions get
established?
Can you reliably reproduce the bug? If so can you please try with 4.3?
It contains ppp fixes not included in 4.1.12.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
[not found] ` <5655CCAE.6000300@seti.kr.ua>
@ 2015-11-26 16:44 ` Guillaume Nault
[not found] ` <565B7699.8030105@seti.kr.ua>
0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-26 16:44 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev
On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> 25.11.2015 16:10, Guillaume Nault пишет:
> >On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> >>Hi.
> >>
> >>I tried to reproduce errors in virtual environment (some VMs on my
> >>notebook).
> >>
> >>I've tried to create 1000 client PPPoE sessions from this box via script:
> >>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> >>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> >>
> >I've tried to reproduce the bug with your script, but couldn't get
> >anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> >kernel 4.1.12). Does the crash happen before all sessions get
> >established?
> Yes, crash happens even before all daemon instances are started. Sessions
> don't get established because BRAS configured to reject sessions (so a lot
> of concurrent connection retries happens) - I still didn't created account
> for test user on it.
>
Ok, I got the crash too. In fact I had misunderstood your previous
message, crash happens when PPP sessions don't get established
(authentication failures in my case).
I'll investigate on that and let you know.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
[not found] ` <565B7699.8030105@seti.kr.ua>
@ 2015-11-30 15:03 ` Guillaume Nault
2015-11-30 20:42 ` Guillaume Nault
0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-30 15:03 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev
On Mon, Nov 30, 2015 at 12:05:13AM +0200, Andrew wrote:
> 26.11.2015 18:44, Guillaume Nault пишет:
> >On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> >>25.11.2015 16:10, Guillaume Nault пишет:
> >>>On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> >>>>Hi.
> >>>>
> >>>>I tried to reproduce errors in virtual environment (some VMs on my
> >>>>notebook).
> >>>>
> >>>>I've tried to create 1000 client PPPoE sessions from this box via script:
> >>>>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> >>>>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> >>>>
> >>>I've tried to reproduce the bug with your script, but couldn't get
> >>>anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> >>>kernel 4.1.12). Does the crash happen before all sessions get
> >>>established?
> >>Yes, crash happens even before all daemon instances are started. Sessions
> >>don't get established because BRAS configured to reject sessions (so a lot
> >>of concurrent connection retries happens) - I still didn't created account
> >>for test user on it.
> >>
> >Ok, I got the crash too. In fact I had misunderstood your previous
> >message, crash happens when PPP sessions don't get established
> >(authentication failures in my case).
> >
> >I'll investigate on that and let you know.
>
> It seems like bug appears on mass ppp devices removing (I planned to use
> this test environment to reproduce BRAS periodical crashes, but suddenly
> I've got crashes on test client).
>
> I've checked it with some kernels - it's present in 4.3.0, but it isn't
> present in 3.10.57. I'll try to build 3.14/3.18 kernels to look how they
> will work in this case.
Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
workqueue to die properly when a PADT is received"). I still have to
figure out why.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-30 15:03 ` Guillaume Nault
@ 2015-11-30 20:42 ` Guillaume Nault
2015-12-02 17:23 ` Guillaume Nault
0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-11-30 20:42 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth
[Adding Simon to the discussion]
On Mon, Nov 30, 2015 at 04:03:37PM +0100, Guillaume Nault wrote:
> On Mon, Nov 30, 2015 at 12:05:13AM +0200, Andrew wrote:
> > 26.11.2015 18:44, Guillaume Nault пишет:
> > >On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:
> > >>25.11.2015 16:10, Guillaume Nault пишет:
> > >>>On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:
> > >>>>Hi.
> > >>>>
> > >>>>I tried to reproduce errors in virtual environment (some VMs on my
> > >>>>notebook).
> > >>>>
> > >>>>I've tried to create 1000 client PPPoE sessions from this box via script:
> > >>>>for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
> > >>>>nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done
> > >>>>
> > >>>I've tried to reproduce the bug with your script, but couldn't get
> > >>>anything to crash (VM is Debian Jessie i386 running on KVM with upstream
> > >>>kernel 4.1.12). Does the crash happen before all sessions get
> > >>>established?
> > >>Yes, crash happens even before all daemon instances are started. Sessions
> > >>don't get established because BRAS configured to reject sessions (so a lot
> > >>of concurrent connection retries happens) - I still didn't created account
> > >>for test user on it.
> > >>
> > >Ok, I got the crash too. In fact I had misunderstood your previous
> > >message, crash happens when PPP sessions don't get established
> > >(authentication failures in my case).
> > >
> > >I'll investigate on that and let you know.
> >
> > It seems like bug appears on mass ppp devices removing (I planned to use
> > this test environment to reproduce BRAS periodical crashes, but suddenly
> > I've got crashes on test client).
> >
> > I've checked it with some kernels - it's present in 4.3.0, but it isn't
> > present in 3.10.57. I'll try to build 3.14/3.18 kernels to look how they
> > will work in this case.
>
> Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
> workqueue to die properly when a PADT is received"). I still have to
> figure out why.
I confirm the bug comes from this commit.
It happens if pppoe_connect() reinitialises po->proto.pppoe.padt_work
after pppoe_disc_rcv() has added it to the system's work queue, and
before that work got scheduled. Then when scheduling occurs, the worker
thread tries to run a corrupted structure and crashes.
I'm going to work on a patch.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-11-30 20:42 ` Guillaume Nault
@ 2015-12-02 17:23 ` Guillaume Nault
2015-12-03 15:35 ` Guillaume Nault
0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-12-02 17:23 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth
On Mon, Nov 30, 2015 at 09:42:08PM +0100, Guillaume Nault wrote:
> On Mon, Nov 30, 2015 at 04:03:37PM +0100, Guillaume Nault wrote:
> > Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
> > workqueue to die properly when a PADT is received"). I still have to
> > figure out why.
>
> I confirm the bug comes from this commit.
>
> It happens if pppoe_connect() reinitialises po->proto.pppoe.padt_work
> after pppoe_disc_rcv() has added it to the system's work queue, and
> before that work got scheduled. Then when scheduling occurs, the worker
> thread tries to run a corrupted structure and crashes.
>
> I'm going to work on a patch.
You can try the following. It's not yet a proper fix as there are still
a few things that bug me in pppoe_connect().
---
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 5e0b432..865b74d 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
sk->sk_family = PF_PPPOX;
sk->sk_protocol = PX_PROTO_OE;
+ INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
+ pppoe_unbind_sock_work);
+
return 0;
}
@@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
lock_sock(sk);
- INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
-
error = -EINVAL;
if (sp->sa_protocol != PX_PROTO_OE)
goto end;
@@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
po->pppoe_dev = NULL;
}
- memset(sk_pppox(po) + 1, 0,
- sizeof(struct pppox_sock) - sizeof(struct sock));
sk->sk_state = PPPOX_NONE;
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-12-02 17:23 ` Guillaume Nault
@ 2015-12-03 15:35 ` Guillaume Nault
2015-12-03 21:09 ` Andrew
0 siblings, 1 reply; 14+ messages in thread
From: Guillaume Nault @ 2015-12-03 15:35 UTC (permalink / raw)
To: Andrew; +Cc: Alexander Duyck, netdev, Simon Farnsworth
On Wed, Dec 02, 2015 at 06:23:35PM +0100, Guillaume Nault wrote:
>
> You can try the following. It's not yet a proper fix as there are still
> a few things that bug me in pppoe_connect().
>
> ---
> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
> index 5e0b432..865b74d 100644
> --- a/drivers/net/ppp/pppoe.c
> +++ b/drivers/net/ppp/pppoe.c
> @@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
> sk->sk_family = PF_PPPOX;
> sk->sk_protocol = PX_PROTO_OE;
>
> + INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
> + pppoe_unbind_sock_work);
> +
> return 0;
> }
>
> @@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>
> lock_sock(sk);
>
> - INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
> -
> error = -EINVAL;
> if (sp->sa_protocol != PX_PROTO_OE)
> goto end;
> @@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
> po->pppoe_dev = NULL;
> }
>
> - memset(sk_pppox(po) + 1, 0,
> - sizeof(struct pppox_sock) - sizeof(struct sock));
> sk->sk_state = PPPOX_NONE;
> }
>
Finally, I'm going to send something similar to -net and keep the rest
of pppoe_connect() modifications for net-next. This will ease
backporting to -stable.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel 4.1.12 crash
2015-12-03 15:35 ` Guillaume Nault
@ 2015-12-03 21:09 ` Andrew
0 siblings, 0 replies; 14+ messages in thread
From: Andrew @ 2015-12-03 21:09 UTC (permalink / raw)
To: Guillaume Nault; +Cc: Alexander Duyck, netdev, Simon Farnsworth
Hi.
Thanks, I'll rebuild kernel with your patch "pppoe: fix memory
corruption in padt work structure", tryto check it in test env, and try
to update PPPoE servers.
03.12.2015 17:35, Guillaume Nault wrote:
> On Wed, Dec 02, 2015 at 06:23:35PM +0100, Guillaume Nault wrote:
>> You can try the following. It's not yet a proper fix as there are still
>> a few things that bug me in pppoe_connect().
>>
>> ---
>> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
>> index 5e0b432..865b74d 100644
>> --- a/drivers/net/ppp/pppoe.c
>> +++ b/drivers/net/ppp/pppoe.c
>> @@ -568,6 +568,9 @@ static int pppoe_create(struct net *net, struct socket *sock, int kern)
>> sk->sk_family = PF_PPPOX;
>> sk->sk_protocol = PX_PROTO_OE;
>>
>> + INIT_WORK(&pppox_sk(sk)->proto.pppoe.padt_work,
>> + pppoe_unbind_sock_work);
>> +
>> return 0;
>> }
>>
>> @@ -632,8 +635,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>>
>> lock_sock(sk);
>>
>> - INIT_WORK(&po->proto.pppoe.padt_work, pppoe_unbind_sock_work);
>> -
>> error = -EINVAL;
>> if (sp->sa_protocol != PX_PROTO_OE)
>> goto end;
>> @@ -663,8 +664,6 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>> po->pppoe_dev = NULL;
>> }
>>
>> - memset(sk_pppox(po) + 1, 0,
>> - sizeof(struct pppox_sock) - sizeof(struct sock));
>> sk->sk_state = PPPOX_NONE;
>> }
>>
> Finally, I'm going to send something similar to -net and keep the rest
> of pppoe_connect() modifications for net-next. This will ease
> backporting to -stable.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-12-03 21:10 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-20 13:58 Kernel 4.1.12 crash Andrew
2015-11-20 23:13 ` Alexander Duyck
2015-11-21 8:16 ` Andrew
2015-11-22 5:17 ` Alexander Duyck
2015-11-22 10:45 ` Andrew
2015-11-24 22:59 ` Andrew
2015-11-25 9:35 ` Andrew
2015-11-25 14:10 ` Guillaume Nault
[not found] ` <5655CCAE.6000300@seti.kr.ua>
2015-11-26 16:44 ` Guillaume Nault
[not found] ` <565B7699.8030105@seti.kr.ua>
2015-11-30 15:03 ` Guillaume Nault
2015-11-30 20:42 ` Guillaume Nault
2015-12-02 17:23 ` Guillaume Nault
2015-12-03 15:35 ` Guillaume Nault
2015-12-03 21:09 ` Andrew
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).