v3.18-RT

All of lore.kernel.org
 help / color / mirror / Atom feed

* v3.18-RT
@ 2016-05-31 18:42 David Hauck
  2016-06-03 16:37 ` v3.18-RT Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: David Hauck @ 2016-05-31 18:42 UTC (permalink / raw)
  To: linux-rt-users@vger.kernel.org

Hi,

We've been working to integrate/support a new system board and have noticed boot-up stalls running 3.18.29-rt30 (in particular when the kernel is configured with PREEMPT_RT_FULL - as diagnostics markers, PREEMPT_RT_RTB and non-RT appear to be fine). The boot-up stalls appear fairly frequently - approximately once every 8-10 cold boots - and occur very early in the boot process. The SHB board houses dual Xeon E5-2680 v3 parts (i.e., 2x12 cores for a combined 24 cores).

My question is regarding the v3.18 tip (currently v3.18.34) and whether an RT release for this version might be worthwhile to try. We'd consider moving to a v4 test branch, but ultimately we'd like to test a new v3.18 in order to determine impact in the field. I hesitate to ask this but am wondering if anyone might have some feedback on the time frame for an RT refresh on the v3.18 tip?

Thanks in advance,
-David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: v3.18-RT
  2016-05-31 18:42 v3.18-RT David Hauck
@ 2016-06-03 16:37 ` Sebastian Andrzej Siewior
  2016-06-03 17:15   ` v3.18-RT David Hauck
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-06-03 16:37 UTC (permalink / raw)
  To: David Hauck; +Cc: linux-rt-users@vger.kernel.org

* David Hauck | 2016-05-31 18:42:08 [+0000]:

>Hi,
Hi,

>We've been working to integrate/support a new system board and have noticed boot-up stalls running 3.18.29-rt30 (in particular when the kernel is configured with PREEMPT_RT_FULL - as diagnostics markers, PREEMPT_RT_RTB and non-RT appear to be fine). The boot-up stalls appear fairly frequently - approximately once every 8-10 cold boots - and occur very early in the boot process. The SHB board houses dual Xeon E5-2680 v3 parts (i.e., 2x12 cores for a combined 24 cores).
>
>My question is regarding the v3.18 tip (currently v3.18.34) and whether an RT release for this version might be worthwhile to try. We'd consider moving to a v4 test branch, but ultimately we'd like to test a new v3.18 in order to determine impact in the field. I hesitate to ask this but am wondering if anyone might have some feedback on the time frame for an RT refresh on the v3.18 tip?

I am not aware of any lockup on v3.18-RT tree. I just tried a few boot
up on two of machines and it looks good. Don't have currently any
control on anything >4 cores.

>Thanks in advance,
>-David

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: v3.18-RT
  2016-06-03 16:37 ` v3.18-RT Sebastian Andrzej Siewior
@ 2016-06-03 17:15   ` David Hauck
  2016-06-06  7:01     ` v3.18-RT Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: David Hauck @ 2016-06-03 17:15 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-rt-users@vger.kernel.org

Hi Sebastian,
 
On Fri, 3 Jun 2016 at 09:38:00, Sebastian Andrzej Siewior wrote:
>> We've been working to integrate/support a new system board and have
>> noticed boot-up stalls running 3.18.29-rt30 (in
> particular when the kernel is configured with PREEMPT_RT_FULL - as
> diagnostics markers, PREEMPT_RT_RTB and non-RT appear to be fine). The
> boot-up stalls appear fairly frequently - approximately once every
> 8-10 cold boots - and occur very early in the boot process. The SHB board houses dual Xeon E5-2680 v3 parts (i.e., 2x12 cores for a combined 24 cores).
>> 
>> My question is regarding the v3.18 tip (currently v3.18.34) and
>> whether an RT release for this version might be
> worthwhile to try. We'd consider moving to a v4 test branch, but
> ultimately we'd like to test a new v3.18 in order to determine impact
> in the field. I hesitate to ask this but am wondering if anyone might have some feedback on the time frame for an RT refresh on the v3.18 tip?
> 
> I am not aware of any lockup on v3.18-RT tree. I just tried a few boot up on two of machines and it looks good.
> Don't have currently any control on anything >4 cores.

Thx. We've done further testing and see that v3.18.9 does not suffer the same problem.

I also have some dump information (all "unable to handle kernel paging request") and was wondering what the best way to pass this along to the list might be? Would a compressed archive of the (4) log files be OK to send along?

-David
 
>> Thanks in advance,
>> -David
> 
> Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: v3.18-RT
  2016-06-03 17:15   ` v3.18-RT David Hauck
@ 2016-06-06  7:01     ` Sebastian Andrzej Siewior
  2016-06-06 17:45       ` v3.18-RT David Hauck
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-06-06  7:01 UTC (permalink / raw)
  To: David Hauck; +Cc: linux-rt-users@vger.kernel.org

On 06/03/2016 07:15 PM, David Hauck wrote:
> Hi Sebastian,

Hi David,

>  
> On Fri, 3 Jun 2016 at 09:38:00, Sebastian Andrzej Siewior wrote:
>> I am not aware of any lockup on v3.18-RT tree. I just tried a few boot up on two of machines and it looks good.
>> Don't have currently any control on anything >4 cores.
> 
> Thx. We've done further testing and see that v3.18.9 does not suffer the same problem.
> 
> I also have some dump information (all "unable to handle kernel paging request") and was wondering what the best way to pass this along to the list might be? Would a compressed archive of the (4) log files be OK to send along?

That "unable to handle kernel paging request" shouldn't be much. Please
send it to the list. The first BUG backtrace is the important one.
Also if you say that the v3.18.9 based RT tree worked could please try
v3.18.13-rt10? If so then you could the git tree

  https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/

and start a bisect between v3.18.13-rt10 and v3.18.29-rt30?

> -David
>  
>>> Thanks in advance,
>>> -David

Sebastian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: v3.18-RT
  2016-06-06  7:01     ` v3.18-RT Sebastian Andrzej Siewior
@ 2016-06-06 17:45       ` David Hauck
  2016-06-07  8:36         ` v3.18-RT Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: David Hauck @ 2016-06-06 17:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-rt-users@vger.kernel.org

Hi Sebastian,
 
On Monday, June 06, 2016 12:02 AM, Sebastian Andrzej Siewior wrote:
> On 06/03/2016 07:15 PM, David Hauck wrote:
> Hi David,
> 
>> On Fri, 3 Jun 2016 at 09:38:00, Sebastian Andrzej Siewior wrote:
>>> I am not aware of any lockup on v3.18-RT tree. I just tried a few boot
>>> up on two of machines and it looks good. Don't have currently any
>>> control on anything >4 cores.
>> 
>> Thx. We've done further testing and see that v3.18.9 does not suffer
>> the same problem.
>> 
>> I also have some dump information (all "unable to handle kernel
>> paging
> request") and was wondering what the best way to pass this along to
> the list might be? Would a compressed archive of the (4) log files be
> OK to send along?
> 
> That "unable to handle kernel paging request" shouldn't be much.
> Please send it to the list. The first BUG backtrace is the important one.

Thx, here's one - hope this might be helpful:

[    1.352165] BUG: unable to handle kernel 
[    1.352167] paging request
[    1.352169]  at a93c2560
[    1.352172] IP:
[    1.352178]  [<c107c248>] can_migrate_task+0x58/0x220
[    1.352182] *pde = 00000000 
[    1.352183] 
[    1.352187] Oops: 0000 [#1] 
[    1.352189] PREEMPT 
[    1.352190] SMP 
[    1.352191] 
[    1.352194] Modules linked in:
[    1.352194] 
[    1.352198] CPU: 5 PID: 238 Comm: kthreadd Not tainted 3.18.29-rt30 #2
[    1.352201] Hardware name: Default string Default string/HEP8225, BIOS HEPHF107 05/20/2016
[    1.352205] task: db7d1d40 ti: db27e000 task.ti: db27e000
[    1.352208] EIP: 0060:[<c107c248>] EFLAGS: 00010086 CPU: 5
[    1.352212] EIP is at can_migrate_task+0x58/0x220
[    1.352215] EAX: 00000005 EBX: db27fe10 ECX: 1830b404 EDX: a93c2560
[    1.352217] ESI: dc508000 EDI: 00000002 EBP: db27fdbc ESP: db27fdb0
[    1.352220]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[    1.352223] CR0: 80050033 CR2: a93c2560 CR3: 01a3a000 CR4: 001407d0
[    1.352225] Stack:
[    1.352228]  dc50805c
[    1.352230]  00000367
[    1.352232]  dcb53b88
[    1.352234]  db27fe60
[    1.352236]  c1083f81
[    1.352238]  00000000
[    1.352239]  00000000
[    1.352241]  00000005
[    1.352242] 
[    1.352245]  dcae5b30
[    1.352246]  dbe5d608
[    1.352248]  00000000
[    1.352250]  dbe5d600
[    1.352252]  c1a30660
[    1.352253]  c1a30660
[    1.352255]  00000000
[    1.352257]  00000082
[    1.352258] 
[    1.352261]  dcb53660
[    1.352263]  db27fe60
[    1.352265]  dbe3a1c0
[    1.352266]  dcb53b88
[    1.352268]  dcb53660
[    1.352270]  dc508000
[    1.352272]  000003ef
[    1.352274]  0000019b
[    1.352275] 
[    1.352277] Call Trace:
[    1.352284]  [<c1083f81>] load_balance+0x321/0x8b0
[    1.352293]  [<c1084ad8>] pick_next_task_fair+0x5c8/0xb10
[    1.352300]  [<c1072bd1>] ? dequeue_task+0x91/0xc0
[    1.352307]  [<c16bce2a>] __schedule+0xfa/0xae0
[    1.352313]  [<c16c06c7>] ? _raw_spin_unlock_irqrestore+0x17/0x50
[    1.352320]  [<c107699b>] ? try_to_wake_up+0x5b/0x550
[    1.352324]  [<c1076f5f>] ? wake_up_state+0xf/0x20
[    1.352330]  [<c108b0cc>] ? __swait_wake_locked+0x3c/0x80
[    1.352336]  [<c1065930>] ? process_one_work+0x410/0x410
[    1.352342]  [<c16bd83b>] schedule+0x2b/0x90
[    1.352347]  [<c1069e73>] kthread+0x73/0xb0
[    1.352354]  [<c1060000>] ? SyS_olduname+0x100/0x180
[    1.352360]  [<c16c1081>] ret_from_kernel_thread+0x21/0x30
[    1.352365]  [<c1069e00>] ? kthread_worker_fn+0x160/0x160
[    1.352368] Code:
[    1.352370]  00
[    1.352372]  8b
[    1.352374]  be
[    1.352376]  54
[    1.352378]  02
[    1.352379]  00
[    1.352381]  00
[    1.352383]  8d
[    1.352385]  96
[    1.352387]  60
[    1.352389]  02
[    1.352390]  00
[    1.352392]  00
[    1.352394]  85
[    1.352396]  ff
[    1.352398]  74
[    1.352400]  1a
[    1.352402]  8b
[    1.352403]  56
[    1.352405]  08
[    1.352407]  8b
[    1.352409]  4a
[    1.352410]  10
[    1.352412]  89
[    1.352414]  ca
[    1.352415]  83
[    1.352417]  e2
[    1.352419]  1f
[    1.352420]  c1
[    1.352422]  e9
[    1.352424]  05
[    1.352426]  8d
[    1.352427]  14
[    1.352429]  95
[    1.352431]  24
[    1.352432]  d9
[    1.352434]  6c
[    1.352436]  c1
[    1.352437]  c1
[    1.352439]  e1
[    1.352441]  02
[    1.352443]  29
[    1.352445]  ca
[    1.352447]  <0f>
[    1.352448]  a3
[    1.352450]  02
[    1.352452]  19
[    1.352454]  c0
[    1.352455]  85
[    1.352457]  c0
[    1.352459]  0f
[    1.352461]  85
[    1.352462]  c3
[    1.352464]  00
[    1.352466]  00
[    1.352467]  00
[    1.352469]  83
[    1.352471]  86
[    1.352472]  00
[    1.352474]  01
[    1.352476]  00
[    1.352478]  00
[    1.352480]  01
[    1.352481]  83
[    1.352482] 
[    1.352485] EIP: [<c107c248>] 
[    1.352489] can_migrate_task+0x58/0x220
[    1.352490]  SS:ESP 0068:db27fdb0
[    1.352493] CR2: 00000000a93c2560
[   71.711659] ---[ end trace 0000000000000001 ]---
[   71.711661] note: kthreadd[238] exited with preempt_count 2
[   71.711666] WARNING: CPU: 5 PID: 238 at kernel/smp.c:293 smp_call_function_single+0xb4/0xe0()
[   71.711667] Modules linked in:
[   71.711669] CPU: 5 PID: 238 Comm: kthreadd Tainted: G      D        3.18.29-rt30 #2
[   71.711669] Hardware name: Default string Default string/HEP8225, BIOS HEPHF107 05/20/2016
[   71.711671]  00000000 00000000 db27fb60 c16bb56f 00000000 db27fb94 c104f078 c185a058
[   71.711673]  00000005 000000ee c18504dc 00000125 c10bf314 00000125 c10bf314 ffffffff
[   71.711675]  00000005 c110fac0 db27fba4 c104f140 00000009 00000000 db27fbc4 c10bf314
[   71.711675] Call Trace:
[   71.711678]  [<c16bb56f>] dump_stack+0x46/0x5c
[   71.711680]  [<c104f078>] warn_slowpath_common+0x88/0xb0
[   71.711681]  [<c10bf314>] ? smp_call_function_single+0xb4/0xe0
[   71.711682]  [<c10bf314>] ? smp_call_function_single+0xb4/0xe0
[   71.711685]  [<c110fac0>] ? cpu_clock_event_add+0x20/0x20
[   71.711686]  [<c104f140>] warn_slowpath_null+0x20/0x30
[   71.711687]  [<c10bf314>] smp_call_function_single+0xb4/0xe0
[   71.711689]  [<c110fbc0>] ? perf_event_disable+0x90/0x90
[   71.711691]  [<c110e9fc>] task_function_call+0x3c/0x50
[   71.711692]  [<c1114fe0>] ? perf_cgroup_switch+0x1f0/0x1f0
[   71.711694]  [<c110fbdf>] perf_cgroup_exit+0x1f/0x30
[   71.711696]  [<c10cefd3>] cgroup_exit+0xb3/0x100
[   71.711698]  [<c105084a>] do_exit+0x32a/0x9c0
[   71.711699]  [<c16bab81>] ? printk+0x1c/0x1e
[   71.711702]  [<c1099d8b>] ? kmsg_dump+0xcb/0xd0
[   71.711704]  [<c1005eff>] oops_end+0x8f/0xd0
[   71.711707]  [<c1041430>] no_context+0xf0/0x230
[   71.711709]  [<c1041625>] __bad_area_nosemaphore+0xb5/0x150
[   71.711711]  [<c10834cd>] ? update_sd_lb_stats+0x12d/0x3d0
[   71.711713]  [<c10416d7>] bad_area_nosemaphore+0x17/0x20
[   71.711714]  [<c1041bbb>] __do_page_fault+0x9b/0x620
[   71.711716]  [<c10837a9>] ? find_busiest_group+0x39/0x4f0
[   71.711719]  [<c1042140>] ? __do_page_fault+0x620/0x620
[   71.711720]  [<c104214b>] do_page_fault+0xb/0x10
[   71.711721]  [<c16c1e3a>] error_code+0x5a/0x60
[   71.711723]  [<c1042140>] ? __do_page_fault+0x620/0x620
[   71.711725]  [<c107c248>] ? can_migrate_task+0x58/0x220
[   71.711726]  [<c1083f81>] load_balance+0x321/0x8b0
[   71.711729]  [<c1084ad8>] pick_next_task_fair+0x5c8/0xb10
[   71.711731]  [<c1072bd1>] ? dequeue_task+0x91/0xc0
[   71.711733]  [<c16bce2a>] __schedule+0xfa/0xae0
[   71.711734]  [<c16c06c7>] ? _raw_spin_unlock_irqrestore+0x17/0x50
[   71.711736]  [<c107699b>] ? try_to_wake_up+0x5b/0x550
[   71.711737]  [<c1076f5f>] ? wake_up_state+0xf/0x20
[   71.711738]  [<c108b0cc>] ? __swait_wake_locked+0x3c/0x80
[   71.711740]  [<c1065930>] ? process_one_work+0x410/0x410
[   71.711741]  [<c16bd83b>] schedule+0x2b/0x90
[   71.711743]  [<c1069e73>] kthread+0x73/0xb0
[   71.711744]  [<c1060000>] ? SyS_olduname+0x100/0x180
[   71.711746]  [<c16c1081>] ret_from_kernel_thread+0x21/0x30
[   71.711747]  [<c1069e00>] ? kthread_worker_fn+0x160/0x160
[   71.711748] ---[ end trace 0000000000000002 ]---

> Also if you say that the v3.18.9 based RT tree worked could please try
> v3.18.13-rt10? If so then you could the git tree
> 
>   https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/
> and start a bisect between v3.18.13-rt10 and v3.18.29-rt30?

Great, thx, we'll get started on this this week.

Thanks again,
-David
 
>> -David
>> 
>>>> Thanks in advance,
>>>> -David
> 
> Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: v3.18-RT
  2016-06-06 17:45       ` v3.18-RT David Hauck
@ 2016-06-07  8:36         ` Sebastian Andrzej Siewior
  2016-07-20 20:53           ` v3.18-RT Carol Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-06-07  8:36 UTC (permalink / raw)
  To: David Hauck; +Cc: linux-rt-users@vger.kernel.org

On 06/06/2016 07:45 PM, David Hauck wrote:
> Hi Sebastian,

Hi,

> Thx, here's one - hope this might be helpful:
> 
> [    1.352165] BUG: unable to handle kernel 
> [    1.352167] paging request
> [    1.352169]  at a93c2560
> [    1.352172] IP:
> [    1.352178]  [<c107c248>] can_migrate_task+0x58/0x220
> [    1.352182] *pde = 00000000 

most of it should be one line.

this comes down to
  if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {

in can_migrate_task() and the crash happens because the pointer
returned in tsk_cpus_allowed() is invalid. Currently nothing rings a
bell.

> Great, thx, we'll get started on this this week.

okay.

> 
> Thanks again,
> -David

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: v3.18-RT
  2016-06-07  8:36         ` v3.18-RT Sebastian Andrzej Siewior
@ 2016-07-20 20:53           ` Carol Wong
  2016-07-29 16:19             ` v3.18-RT Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Carol Wong @ 2016-07-20 20:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users@vger.kernel.org, David Hauck, Preston Hauck

Hi Sebastian,

We finally traced the boot-up crash to the following patch in kernel/sched/core.c:

https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.18-rt&id=62044e554f14547061afcfef7f0aceda43e28982

After reverting the two-line patch in 3.18.29-rt30, the crash no longer occurs on our dual Xeon (2x12 core) system.

Other observations:
- Does not reproduce on single processor (2 and 4 core) systems
- Reproduces under 3.18.27-rt27 and 3.18.36-rt38 on the dual Xeon
- Does not reproduce on 3.18.27-rt26 and earlier on the dual Xeon
- Reproduces more frequently on .29-rt30 (1 in 20 reboots) compared to .27-rt27 (1 in 100 reboots)

So far we've not observed any side effects after reverting this patch.

I understand that a high core count system may not be easy to come by, so if there are diagnostics or patches you would like to try on the dual Xeon system, we can assist with that.

Cheers,
Carol Wong
NetAcquire Corporation

> -----Original Message-----
> From: linux-rt-users-owner@vger.kernel.org [mailto:linux-rt-users-
> owner@vger.kernel.org] On Behalf Of Sebastian Andrzej Siewior
> Sent: Tuesday, June 07, 2016 1:37 AM
> To: David Hauck
> Cc: linux-rt-users@vger.kernel.org
> Subject: Re: v3.18-RT
> 
> On 06/06/2016 07:45 PM, David Hauck wrote:
> > Hi Sebastian,
> 
> Hi,
> 
> > Thx, here's one - hope this might be helpful:
> >
> > [    1.352165] BUG: unable to handle kernel
> > [    1.352167] paging request
> > [    1.352169]  at a93c2560
> > [    1.352172] IP:
> > [    1.352178]  [<c107c248>] can_migrate_task+0x58/0x220
> > [    1.352182] *pde = 00000000
> 
> most of it should be one line.
> 
> this comes down to
>   if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
> 
> in can_migrate_task() and the crash happens because the pointer
> returned in tsk_cpus_allowed() is invalid. Currently nothing rings a
> bell.
> 
> > Great, thx, we'll get started on this this week.
> 
> okay.
> 
> >
> > Thanks again,
> > -David
> 
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-
> users" in the body of a message to majordomo@vger.kernel.org More
> majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: v3.18-RT
  2016-07-20 20:53           ` v3.18-RT Carol Wong
@ 2016-07-29 16:19             ` Sebastian Andrzej Siewior
  2016-08-19  0:41               ` v3.18-RT Carol Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-07-29 16:19 UTC (permalink / raw)
  To: Carol Wong; +Cc: linux-rt-users@vger.kernel.org, David Hauck, Preston Hauck

* Carol Wong | 2016-07-20 20:53:21 [+0000]:

>Hi Sebastian,
Hi Carol,

>We finally traced the boot-up crash to the following patch in kernel/sched/core.c:
>
>https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.18-rt&id=62044e554f14547061afcfef7f0aceda43e28982
>
>After reverting the two-line patch in 3.18.29-rt30, the crash no longer occurs on our dual Xeon (2x12 core) system.
>
>Other observations:
>- Does not reproduce on single processor (2 and 4 core) systems
>- Reproduces under 3.18.27-rt27 and 3.18.36-rt38 on the dual Xeon
>- Does not reproduce on 3.18.27-rt26 and earlier on the dual Xeon
>- Reproduces more frequently on .29-rt30 (1 in 20 reboots) compared to .27-rt27 (1 in 100 reboots)
>
>So far we've not observed any side effects after reverting this patch.

This was part of CPU hotplug fixups. Lockdep might be broken without it
but I am not sure if is most of the time the case or just during
hotplug.

>I understand that a high core count system may not be easy to come by, so if there are diagnostics or patches you would like to try on the dual Xeon system, we can assist with that.

With that patch, migrate_disable() skips the whole preempt-lazy +
pin-cpu code if called with IRQs off. Since interrupts are disabled we
can't migrate to another so it is a possible optimsation.
It only makes a difference if migrate_disable() + migrate_enable() calls
are not in balance. The commit
  https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v3.18-rt&id=8d51d3a296b6ec4aebd0d6d7e1b7162cd9bf6662
is one example where I fixed the inbalance.
Do you get additional backtraces with CONFIG_SCHED_DEBUG enabled?

There is one thing the debug code does not cover, so could you please
add this chunk?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 140ee06079b6..1f8613f77598 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3229,6 +3229,7 @@ void migrate_enable(void)
 
 	if (in_atomic() || irqs_disabled()) {
 #ifdef CONFIG_SCHED_DEBUG
+		WARN_ON_ONCE(p->migrate_disable_atomic <= 0);
 		p->migrate_disable_atomic--;
 #endif
 		return;

>Cheers,
>Carol Wong
>NetAcquire Corporation

Sebastian

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: v3.18-RT
  2016-07-29 16:19             ` v3.18-RT Sebastian Andrzej Siewior
@ 2016-08-19  0:41               ` Carol Wong
  2016-09-08 13:45                 ` v3.18-RT Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Carol Wong @ 2016-08-19  0:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users@vger.kernel.org, David Hauck, Preston Hauck

Hi Sebastian,

Were you able to gain any insight from the traces?

If we were to proceed with reverting the kernel/sched/core.c patch in our build of 3.18.29-rt30, would the addition of the WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you recommended (2016/07/29) be sufficient for detecting imbalances? We would perform extended testing on multiple systems to determine the effects of reverting the patch.

Cheers,
Carol

> -----Original Message-----
> From: Carol Wong
> Sent: Wednesday, August 03, 2016 6:32 PM
> To: 'Sebastian Andrzej Siewior'
> Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck
> Subject: RE: v3.18-RT
> 
> Hi Sebastian,
> 
> I made the suggested change to sched/core.c and verified that
> CONFIG_SCHED_DEBUG=y. I reproduced the crash 3 times and captured the
> attached traces.
> 
> Thanks,
> Carol
> 
> > -----Original Message-----
> > From: Sebastian Andrzej Siewior [mailto:bigeasy@linutronix.de]
> > Sent: Friday, July 29, 2016 9:20 AM
> > To: Carol Wong
> > Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck
> > Subject: Re: v3.18-RT
> >
> > * Carol Wong | 2016-07-20 20:53:21 [+0000]:
> >
> > >Hi Sebastian,
> > Hi Carol,
> >
> > >We finally traced the boot-up crash to the following patch in
> > kernel/sched/core.c:
> > >
> > >https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-
> > rt.git/com
> > >mit/?h=v3.18-rt&id=62044e554f14547061afcfef7f0aceda43e28982
> > >
> > >After reverting the two-line patch in 3.18.29-rt30, the crash no
> > longer occurs on our dual Xeon (2x12 core) system.
> > >
> > >Other observations:
> > >- Does not reproduce on single processor (2 and 4 core) systems
> > >- Reproduces under 3.18.27-rt27 and 3.18.36-rt38 on the dual Xeon
> > >- Does not reproduce on 3.18.27-rt26 and earlier on the dual Xeon
> > >- Reproduces more frequently on .29-rt30 (1 in 20 reboots)
> compared
> > to
> > >.27-rt27 (1 in 100 reboots)
> > >
> > >So far we've not observed any side effects after reverting this
> > patch.
> >
> > This was part of CPU hotplug fixups. Lockdep might be broken
> without
> > it but I am not sure if is most of the time the case or just during
> > hotplug.
> >
> > >I understand that a high core count system may not be easy to come
> > by, so if there are diagnostics or patches you would like to try on
> > the dual Xeon system, we can assist with that.
> >
> > With that patch, migrate_disable() skips the whole preempt-lazy +
> > pin-cpu code if called with IRQs off. Since interrupts are disabled
> we
> > can't migrate to another so it is a possible optimsation.
> > It only makes a difference if migrate_disable() + migrate_enable()
> > calls are not in balance. The commit
> >   https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-
> > rt.git/commit/?h=v3.18-
> rt&id=8d51d3a296b6ec4aebd0d6d7e1b7162cd9bf6662
> > is one example where I fixed the inbalance.
> > Do you get additional backtraces with CONFIG_SCHED_DEBUG enabled?
> >
> > There is one thing the debug code does not cover, so could you
> please
> > add this chunk?
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> > 140ee06079b6..1f8613f77598 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3229,6 +3229,7 @@ void migrate_enable(void)
> >
> >  	if (in_atomic() || irqs_disabled()) {  #ifdef
> CONFIG_SCHED_DEBUG
> > +		WARN_ON_ONCE(p->migrate_disable_atomic <= 0);
> >  		p->migrate_disable_atomic--;
> >  #endif
> >  		return;
> >
> > >Cheers,
> > >Carol Wong
> > >NetAcquire Corporation
> >
> > Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: v3.18-RT
  2016-08-19  0:41               ` v3.18-RT Carol Wong
@ 2016-09-08 13:45                 ` Sebastian Andrzej Siewior
  2016-09-20 18:27                   ` v3.18-RT Carol Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-09-08 13:45 UTC (permalink / raw)
  To: Carol Wong; +Cc: linux-rt-users@vger.kernel.org, David Hauck, Preston Hauck

On 2016-08-19 00:41:46 [+0000], Carol Wong wrote:
> Hi Sebastian,
Hi Carol,

> Were you able to gain any insight from the traces?

not really. T00 shows a fault in
[    2.756284] BUG: unable to handle kernel NULL pointer dereference at 00000004
[    2.756289] IP: [<c11653e7>] kmem_cache_alloc+0x87/0x230
from ida_pre_get() / create_worker(). That is quite late so I have no
idea why that would happen.
The other two are not really help full.

> If we were to proceed with reverting the kernel/sched/core.c patch in our build of 3.18.29-rt30, would the addition of the WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you recommended (2016/07/29) be sufficient for detecting imbalances? We would perform extended testing on multiple systems to determine the effects of reverting the patch.

One thing on the bisect. The git tree has the patches in this order:
 (1) kernel: migrate_disable() do fastpath in atomic & irqs-off
 (2) kernel: softirq: unlock with irqs on

but you need apply Patch #2 before #1. So if you bisect and you hit
warnings due to #1 please note that need apply #2.

T01 and T02 show probably the same issue but there are too many warnings
comming in parallel. If this comes from the sched patch due #1/#2 mix up
then don't bisect here or have them both applied.
The call path itself does look special as it would violate the rule of
atomic locking / unlocking (as it was fixed in #2 for instance).
At this point I assume that your bisect went wrong due to patch #1/#2.

> Cheers,
> Carol
> 
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: v3.18-RT
  2016-09-08 13:45                 ` v3.18-RT Sebastian Andrzej Siewior
@ 2016-09-20 18:27                   ` Carol Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Carol Wong @ 2016-09-20 18:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users@vger.kernel.org, David Hauck, Preston Hauck

Hi Sebastian,

You wrote:
> One thing on the bisect. The git tree has the patches in this order:
>  (1) kernel: migrate_disable() do fastpath in atomic & irqs-off
>  (2) kernel: softirq: unlock with irqs on
> 
> but you need apply Patch #2 before #1. So if you bisect and you hit
> warnings due to #1 please note that need apply #2.
> 
> T01 and T02 show probably the same issue but there are too many
> warnings comming in parallel. If this comes from the sched patch due
> #1/#2 mix up then don't bisect here or have them both applied.
> The call path itself does look special as it would violate the rule
> of atomic locking / unlocking (as it was fixed in #2 for instance).
> At this point I assume that your bisect went wrong due to patch
> #1/#2.

The traces were produced using the original 3.18.29-rt30 kernel (with all patches) plus the addition of 
WARN_ON_ONCE(p->migrate_disable_atomic <= 0) in migrate_enable() and CONFIG_SCHED_DEBUG=y.

When I revert only patch #1, from the 3.18.29-rt30 kernel, the kernel never crashes. I've been performing long-running tests on a dual Xeon system and a quad-core i7 system with patch #1 reverted.

Cheers,
Carol

> -----Original Message-----
> From: Sebastian Andrzej Siewior [mailto:bigeasy@linutronix.de]
> Sent: Thursday, September 08, 2016 6:45 AM
> To: Carol Wong
> Cc: linux-rt-users@vger.kernel.org; David Hauck; Preston Hauck
> Subject: Re: v3.18-RT
> 
> On 2016-08-19 00:41:46 [+0000], Carol Wong wrote:
> > Hi Sebastian,
> Hi Carol,
> 
> > Were you able to gain any insight from the traces?
> 
> not really. T00 shows a fault in
> [    2.756284] BUG: unable to handle kernel NULL pointer dereference
> at 00000004
> [    2.756289] IP: [<c11653e7>] kmem_cache_alloc+0x87/0x230
> from ida_pre_get() / create_worker(). That is quite late so I have no
> idea why that would happen.
> The other two are not really help full.
> 
> > If we were to proceed with reverting the kernel/sched/core.c patch
> in our build of 3.18.29-rt30, would the addition of the
> WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you
> recommended (2016/07/29) be sufficient for detecting imbalances? We
> would perform extended testing on multiple systems to determine the
> effects of reverting the patch.
> 
> One thing on the bisect. The git tree has the patches in this order:
>  (1) kernel: migrate_disable() do fastpath in atomic & irqs-off
>  (2) kernel: softirq: unlock with irqs on
> 
> but you need apply Patch #2 before #1. So if you bisect and you hit
> warnings due to #1 please note that need apply #2.
> 
> T01 and T02 show probably the same issue but there are too many
> warnings comming in parallel. If this comes from the sched patch due
> #1/#2 mix up then don't bisect here or have them both applied.
> The call path itself does look special as it would violate the rule
> of atomic locking / unlocking (as it was fixed in #2 for instance).
> At this point I assume that your bisect went wrong due to patch
> #1/#2.
> 
> > Cheers,
> > Carol
> >
> Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-09-20 18:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-31 18:42 v3.18-RT David Hauck
2016-06-03 16:37 ` v3.18-RT Sebastian Andrzej Siewior
2016-06-03 17:15   ` v3.18-RT David Hauck
2016-06-06  7:01     ` v3.18-RT Sebastian Andrzej Siewior
2016-06-06 17:45       ` v3.18-RT David Hauck
2016-06-07  8:36         ` v3.18-RT Sebastian Andrzej Siewior
2016-07-20 20:53           ` v3.18-RT Carol Wong
2016-07-29 16:19             ` v3.18-RT Sebastian Andrzej Siewior
2016-08-19  0:41               ` v3.18-RT Carol Wong
2016-09-08 13:45                 ` v3.18-RT Sebastian Andrzej Siewior
2016-09-20 18:27                   ` v3.18-RT Carol Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.