public inbox for trinity@vger.kernel.org
 help / color / mirror / Atom feed
* BUG: soft lockup for a user mode linux image
@ 2013-10-02 18:30 Toralf Förster
  2013-10-02 18:31 ` Toralf Förster
  2013-10-02 19:55 ` [uml-devel] " Richard Weinberger
  0 siblings, 2 replies; 15+ messages in thread
From: Toralf Förster @ 2013-10-02 18:30 UTC (permalink / raw)
  To: trinity; +Cc: UML devel

Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time 
within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
yields into this konsole message :

 * Starting local
net.core.warnings = 0                                                                         [ ok ]
BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]


and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
2 subsequent made back traces made with

$> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt

shows nearly a similar position around __get_user_pages() - both are attached.

I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
expected or if it points to an issue.


FWIW the last lines of trinity log were :


[2031] [94] setsid() = 2031
[2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
[2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
[2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
[2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
[2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
[2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
[2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
[2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
[2031] [103] sigpending(set=0x40025000) = 0
[2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
[2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
[2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
[2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
[2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
[2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
[2031] [110] munlock(addr=0x40025000, len=34) = 0
[2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
[2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
[2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
[2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
[2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
[2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
[2031] [117] inotify_init() = 654
[2031] [118] getgid() = 100
[2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
[2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
[2031] [121] timerfd_create(clockid=0, flags=0) = 655
[2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
[2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
[2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
[2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
[2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
[2031] [127] ssetmask(newmask=0x88000092) = 0
[2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
[2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
[2031] [130] fchdir(fd=397) = -1 (Not a directory)
[2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
[2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
[2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.

-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* BUG: soft lockup for a user mode linux image
  2013-10-02 18:30 BUG: soft lockup for a user mode linux image Toralf Förster
@ 2013-10-02 18:31 ` Toralf Förster
  2013-10-02 19:55 ` [uml-devel] " Richard Weinberger
  1 sibling, 0 replies; 15+ messages in thread
From: Toralf Förster @ 2013-10-02 18:31 UTC (permalink / raw)
  To: trinity; +Cc: UML devel

[-- Attachment #1: Type: text/plain, Size: 4232 bytes --]

(resend - now with attached files)

Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time 
within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
yields into this konsole message :

 * Starting local
net.core.warnings = 0                                                                         [ ok ]
BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]


and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
2 subsequent made back traces made with

$> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt

shows nearly a similar position around __get_user_pages() - both are attached.

I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
expected or if it points to an issue.


FWIW the last lines of trinity log were :


[2031] [94] setsid() = 2031
[2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
[2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
[2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
[2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
[2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
[2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
[2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
[2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
[2031] [103] sigpending(set=0x40025000) = 0
[2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
[2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
[2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
[2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
[2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
[2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
[2031] [110] munlock(addr=0x40025000, len=34) = 0
[2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
[2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
[2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
[2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
[2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
[2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
[2031] [117] inotify_init() = 654
[2031] [118] getgid() = 100
[2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
[2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
[2031] [121] timerfd_create(clockid=0, flags=0) = 655
[2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
[2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
[2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
[2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
[2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
[2031] [127] ssetmask(newmask=0x88000092) = 0
[2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
[2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
[2031] [130] fchdir(fd=397) = -1 (Not a directory)
[2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
[2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
[2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.

-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3



[-- Attachment #2: 1 --]
[-- Type: text/plain, Size: 4078 bytes --]

0x082949e4 in number (buf=0x434dfb21 "/K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", end=0x434dfb22 "K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", num=597699752268362946, spec=...) at lib/vsprintf.c:381
381	char *number(char *buf, char *end, unsigned long long num,
#0  0x082949e4 in number (buf=0x434dfb21 "/K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", end=0x434dfb22 "K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", num=597699752268362946, spec=...) at lib/vsprintf.c:381
#1  0x08295ef7 in vsnprintf (buf=0x434dfb1c "+0x52/K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", size=2147483647, fmt=0x84b74c2 "", args=0x434dfad0 "\t\373MC") at lib/vsprintf.c:1740
#2  0x08296547 in sprintf (buf=0x434dfb22 "K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", fmt=0x434dfb22 "K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b") at lib/vsprintf.c:1875
#3  0x080b76cf in __sprint_symbol (buffer=0x434dfb09 "io_schedule_timeout+0x52/K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", address=3276667675, symbol_offset=1, add_offset=1) at kernel/kallsyms.c:374
#4  0x080b781a in sprint_symbol (buffer=0x434dfb09 "io_schedule_timeout+0x52/K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", address=138503010) at kernel/kallsyms.c:395
#5  0x080b787f in __print_symbol (fmt=0x434dfb22 "K\b\225\036`\b\243^K\b\243^K\bD\373MC\210\247\t\b", address=1129184034) at kernel/kallsyms.c:440
#6  0x08078200 in print_symbol (fmt=<optimized out>, addr=<optimized out>) at include/linux/kallsyms.h:119
#7  print_context_stack (ebp=<optimized out>, tinfo=<optimized out>, stack=<optimized out>) at arch/x86/um/sysrq_32.c:54
#8  show_trace (task=0x434dfffd, stack=0x434dfc44) at arch/x86/um/sysrq_32.c:95
#9  0x0805fb6f in show_stack (task=0x0, esp=0x434dfc44) at arch/um/kernel/sysrq.c:65
#10 0x08413ae6 in __dump_stack () at lib/dump_stack.c:15
#11 dump_stack () at lib/dump_stack.c:60
#12 0x08414981 in schedule_timeout (timeout=-4028) at kernel/timer.c:1460
#13 0x08416362 in io_schedule_timeout (timeout=1129184034) at kernel/sched/core.c:3992
#14 0x08411c31 in balance_dirty_pages (pages_dirtied=17, mapping=<optimized out>) at mm/page-writeback.c:1519
#15 0x080d1ce4 in balance_dirty_pages_ratelimited (mapping=0x60) at mm/page-writeback.c:1653
#16 0x080e2d0c in __do_fault (mm=0x47f24400, vma=0x442095d8, address=1080451072, pmd=0x434dfb22, pgoff=1129184034, flags=1155011416, orig_pte=<incomplete type>) at mm/memory.c:3452
#17 0x080e5286 in do_nonlinear_fault (orig_pte=..., flags=<optimized out>, pmd=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>, page_table=<optimized out>) at mm/memory.c:3518
#18 handle_pte_fault (flags=<optimized out>, pmd=<optimized out>, pte=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3717
#19 __handle_mm_fault (flags=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3845
#20 handle_mm_fault (mm=0x47f24400, vma=0x442095d8, address=1080451072, flags=1) at mm/memory.c:3868
#21 0x080e5a07 in __get_user_pages (tsk=0x46e23600, mm=0x47f24400, start=1080451072, nr_pages=65, gup_flags=519, pages=0x46f34d20, vmas=0x0, nonblocking=0x0) at mm/memory.c:1822
#22 0x080e5cc3 in get_user_pages (tsk=0x434dfb22, mm=0x434dfb22, start=1129184034, nr_pages=1129184034, write=1, force=0, pages=0xc34dfb1b, vmas=0x60) at mm/memory.c:2019
#23 0x08140cce in aio_setup_ring (ctx=<optimized out>) at fs/aio.c:317
#24 ioctx_alloc (nr_events=<optimized out>) at fs/aio.c:582
#25 SYSC_io_setup (ctxp=<optimized out>, nr_events=<optimized out>) at fs/aio.c:1100
#26 SyS_io_setup (nr_events=4095, ctxp=1076256768) at fs/aio.c:1083
#27 0x080619c2 in handle_syscall (r=0x46e237d4) at arch/um/kernel/skas/syscall.c:35
#28 0x08073f2d in handle_trap (local_using_sysemu=<optimized out>, regs=<optimized out>, pid=<optimized out>) at arch/um/os-Linux/skas/process.c:198
#29 userspace (regs=0x46e237d4) at arch/um/os-Linux/skas/process.c:431
#30 0x0805e6ac in fork_handler () at arch/um/kernel/process.c:160
#31 0x00000000 in ?? ()

[-- Attachment #3: 2 --]
[-- Type: text/plain, Size: 3500 bytes --]

format_decode (fmt=0x84b74b8 "+%#lx/%#lx", spec=0x434dfa8c) at lib/vsprintf.c:1375
1375		if (spec->type == FORMAT_TYPE_WIDTH) {
#0  format_decode (fmt=0x84b74b8 "+%#lx/%#lx", spec=0x434dfa8c) at lib/vsprintf.c:1375
#1  0x08295b82 in vsnprintf (buf=0x434dfb19 "", size=2147483647, fmt=0x84b74b8 "+%#lx/%#lx", args=0x434dfac8 "A") at lib/vsprintf.c:1615
#2  0x08296547 in sprintf (buf=0x84b74b8 "+%#lx/%#lx", fmt=0x84b74b8 "+%#lx/%#lx") at lib/vsprintf.c:1875
#3  0x080b76cf in __sprint_symbol (buffer=0x434dfb09 "schedule_timeout", address=1129183884, symbol_offset=1, add_offset=1) at kernel/kallsyms.c:374
#4  0x080b781a in sprint_symbol (buffer=0x434dfb09 "schedule_timeout", address=138496385) at kernel/kallsyms.c:395
#5  0x080b787f in __print_symbol (fmt=0x84b74b8 "+%#lx/%#lx", address=139162808) at kernel/kallsyms.c:440
#6  0x08078200 in print_symbol (fmt=<optimized out>, addr=<optimized out>) at include/linux/kallsyms.h:119
#7  print_context_stack (ebp=<optimized out>, tinfo=<optimized out>, stack=<optimized out>) at arch/x86/um/sysrq_32.c:54
#8  show_trace (task=0x434dfffd, stack=0x434dfc44) at arch/x86/um/sysrq_32.c:95
#9  0x0805fb6f in show_stack (task=0x0, esp=0x434dfc44) at arch/um/kernel/sysrq.c:65
#10 0x08413ae6 in __dump_stack () at lib/dump_stack.c:15
#11 dump_stack () at lib/dump_stack.c:60
#12 0x08414981 in schedule_timeout (timeout=-4028) at kernel/timer.c:1460
#13 0x08416362 in io_schedule_timeout (timeout=139162808) at kernel/sched/core.c:3992
#14 0x08411c31 in balance_dirty_pages (pages_dirtied=17, mapping=<optimized out>) at mm/page-writeback.c:1519
#15 0x080d1ce4 in balance_dirty_pages_ratelimited (mapping=0x11) at mm/page-writeback.c:1653
#16 0x080e2d0c in __do_fault (mm=0x47f24400, vma=0x442095d8, address=1080451072, pmd=0x84b74b8, pgoff=139162808, flags=1155011416, orig_pte=<incomplete type>) at mm/memory.c:3452
#17 0x080e5286 in do_nonlinear_fault (orig_pte=..., flags=<optimized out>, pmd=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>, page_table=<optimized out>) at mm/memory.c:3518
#18 handle_pte_fault (flags=<optimized out>, pmd=<optimized out>, pte=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3717
#19 __handle_mm_fault (flags=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3845
#20 handle_mm_fault (mm=0x47f24400, vma=0x442095d8, address=1080451072, flags=1) at mm/memory.c:3868
#21 0x080e5a07 in __get_user_pages (tsk=0x46e23600, mm=0x47f24400, start=1080451072, nr_pages=65, gup_flags=519, pages=0x46f34d20, vmas=0x0, nonblocking=0x0) at mm/memory.c:1822
#22 0x080e5cc3 in get_user_pages (tsk=0x84b74b8, mm=0x84b74b8, start=139162808, nr_pages=139162808, write=1, force=0, pages=0x434dfa8c, vmas=0x11) at mm/memory.c:2019
#23 0x08140cce in aio_setup_ring (ctx=<optimized out>) at fs/aio.c:317
#24 ioctx_alloc (nr_events=<optimized out>) at fs/aio.c:582
#25 SYSC_io_setup (ctxp=<optimized out>, nr_events=<optimized out>) at fs/aio.c:1100
#26 SyS_io_setup (nr_events=4095, ctxp=1076256768) at fs/aio.c:1083
#27 0x080619c2 in handle_syscall (r=0x46e237d4) at arch/um/kernel/skas/syscall.c:35
#28 0x08073f2d in handle_trap (local_using_sysemu=<optimized out>, regs=<optimized out>, pid=<optimized out>) at arch/um/os-Linux/skas/process.c:198
#29 userspace (regs=0x46e237d4) at arch/um/os-Linux/skas/process.c:431
#30 0x0805e6ac in fork_handler () at arch/um/kernel/process.c:160
#31 0x00000000 in ?? ()

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-02 18:30 BUG: soft lockup for a user mode linux image Toralf Förster
  2013-10-02 18:31 ` Toralf Förster
@ 2013-10-02 19:55 ` Richard Weinberger
  2013-10-03 18:54   ` Toralf Förster
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2013-10-02 19:55 UTC (permalink / raw)
  To: Toralf Förster; +Cc: trinity, UML devel

On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
> yields into this konsole message :
>
>  * Starting local
> net.core.warnings = 0                                                                         [ ok ]
> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>
>
> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
> 2 subsequent made back traces made with
>
> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>
> shows nearly a similar position around __get_user_pages() - both are attached.
>
> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
> expected or if it points to an issue.
>
>
> FWIW the last lines of trinity log were :
>
>
> [2031] [94] setsid() = 2031
> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
> [2031] [103] sigpending(set=0x40025000) = 0
> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
> [2031] [110] munlock(addr=0x40025000, len=34) = 0
> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
> [2031] [117] inotify_init() = 654
> [2031] [118] getgid() = 100
> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
> [2031] [127] ssetmask(newmask=0x88000092) = 0
> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.

Reading your gdb backtraces show that schedule_timeout() got called
with a negative value.
Looks like an integer overflow.
The soft-lockup might also origin from that (very big integer which
did not overflow jet)

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-02 19:55 ` [uml-devel] " Richard Weinberger
@ 2013-10-03 18:54   ` Toralf Förster
  2013-10-03 19:04     ` Richard Weinberger
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-03 18:54 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: trinity, UML devel

On 10/02/2013 09:55 PM, Richard Weinberger wrote:
> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>> yields into this konsole message :
>>
>>  * Starting local
>> net.core.warnings = 0                                                                         [ ok ]
>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>
>>
>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>> 2 subsequent made back traces made with
>>
>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>
>> shows nearly a similar position around __get_user_pages() - both are attached.
>>
>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>> expected or if it points to an issue.
>>
>>
>> FWIW the last lines of trinity log were :
>>
>>
>> [2031] [94] setsid() = 2031
>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>> [2031] [103] sigpending(set=0x40025000) = 0
>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>> [2031] [117] inotify_init() = 654
>> [2031] [118] getgid() = 100
>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
> 
> Reading your gdb backtraces show that schedule_timeout() got called
> with a negative value.
> Looks like an integer overflow.
> The soft-lockup might also origin from that (very big integer which
> did not overflow jet)
> 

If the culprit is solved by this patch I'd like to send it out. But I'm
unsure whether it catches the culprit or if it just covers the root cause.

ommit 7b367d5478dfcba1635e8cde3c663515fee129ec
Author: Toralf Förster <toralf.foerster@gmx.de>
Date:   Thu Oct 3 20:46:38 2013 +0200

    mm/page-writeback.c: avoid integer overflow in balance_dirty_pages()

    avoid crash of a 32 bit UML (found using fuzz tool trinity)

    Signed-off-by: Toralf Förster <toralf.foerster@gmx.de>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f5236f8..f5a3096 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1462,7 +1462,8 @@ static void balance_dirty_pages(struct
address_space *mapping,
                        pause = max_pause;
                        goto pause;
                }
-               period = HZ * pages_dirtied / task_ratelimit;
+               period = ((long) HZ) * pages_dirtied / task_ratelimit;
+
                pause = period;
                if (current->dirty_paused_when)
                        pause -= now - current->dirty_paused_when;


-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: BUG: soft lockup for a user mode linux image
  2013-10-03 18:54   ` Toralf Förster
@ 2013-10-03 19:04     ` Richard Weinberger
  2013-10-03 19:16       ` [uml-devel] " Toralf Förster
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2013-10-03 19:04 UTC (permalink / raw)
  To: Toralf Förster; +Cc: UML devel, trinity

Am 03.10.2013 20:54, schrieb Toralf Förster:
> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>> yields into this konsole message :
>>>
>>>  * Starting local
>>> net.core.warnings = 0                                                                         [ ok ]
>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>
>>>
>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>> 2 subsequent made back traces made with
>>>
>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>
>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>
>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>> expected or if it points to an issue.
>>>
>>>
>>> FWIW the last lines of trinity log were :
>>>
>>>
>>> [2031] [94] setsid() = 2031
>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>> [2031] [103] sigpending(set=0x40025000) = 0
>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>> [2031] [117] inotify_init() = 654
>>> [2031] [118] getgid() = 100
>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>
>> Reading your gdb backtraces show that schedule_timeout() got called
>> with a negative value.
>> Looks like an integer overflow.
>> The soft-lockup might also origin from that (very big integer which
>> did not overflow jet)
>>
> 
> If the culprit is solved by this patch I'd like to send it out. But I'm
> unsure whether it catches the culprit or if it just covers the root cause.

I fear your Patch will not fix the issue.

Does the issue only trigger on 32bit UMLs?
How long does it take till trinity hits it?

Thanks,
//richard

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-03 19:04     ` Richard Weinberger
@ 2013-10-03 19:16       ` Toralf Förster
  2013-10-03 19:20         ` Richard Weinberger
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-03 19:16 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: Richard Weinberger, trinity, UML devel

On 10/03/2013 09:04 PM, Richard Weinberger wrote:
> Am 03.10.2013 20:54, schrieb Toralf Förster:
>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>> yields into this konsole message :
>>>>
>>>>  * Starting local
>>>> net.core.warnings = 0                                                                         [ ok ]
>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>
>>>>
>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>> 2 subsequent made back traces made with
>>>>
>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>
>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>
>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>> expected or if it points to an issue.
>>>>
>>>>
>>>> FWIW the last lines of trinity log were :
>>>>
>>>>
>>>> [2031] [94] setsid() = 2031
>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>> [2031] [117] inotify_init() = 654
>>>> [2031] [118] getgid() = 100
>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>
>>> Reading your gdb backtraces show that schedule_timeout() got called
>>> with a negative value.
>>> Looks like an integer overflow.
>>> The soft-lockup might also origin from that (very big integer which
>>> did not overflow jet)
>>>
>>
>> If the culprit is solved by this patch I'd like to send it out. But I'm
>> unsure whether it catches the culprit or if it just covers the root cause.
> 
> I fear your Patch will not fix the issue.
> 
> Does the issue only trigger on 32bit UMLs?
No diea, I do only have a 32 bit system here (both host and client).

> How long does it take till trinity hits it?
a command like

$> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"

usually needs 10 till 15 min to trigger the issue. With just 1 trinity
task (-C1) however it needs often a hour or more.

> Thanks,
> //richard
> 


-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-03 19:16       ` [uml-devel] " Toralf Förster
@ 2013-10-03 19:20         ` Richard Weinberger
  2013-10-03 19:33           ` Toralf Förster
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2013-10-03 19:20 UTC (permalink / raw)
  To: Toralf Förster; +Cc: Richard Weinberger, trinity, UML devel

Am 03.10.2013 21:16, schrieb Toralf Förster:
> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>> yields into this konsole message :
>>>>>
>>>>>  * Starting local
>>>>> net.core.warnings = 0                                                                         [ ok ]
>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>
>>>>>
>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>> 2 subsequent made back traces made with
>>>>>
>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>
>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>
>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>> expected or if it points to an issue.
>>>>>
>>>>>
>>>>> FWIW the last lines of trinity log were :
>>>>>
>>>>>
>>>>> [2031] [94] setsid() = 2031
>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>> [2031] [117] inotify_init() = 654
>>>>> [2031] [118] getgid() = 100
>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>
>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>> with a negative value.
>>>> Looks like an integer overflow.
>>>> The soft-lockup might also origin from that (very big integer which
>>>> did not overflow jet)
>>>>
>>>
>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>> unsure whether it catches the culprit or if it just covers the root cause.
>>
>> I fear your Patch will not fix the issue.
>>
>> Does the issue only trigger on 32bit UMLs?
> No diea, I do only have a 32 bit system here (both host and client).
> 
>> How long does it take till trinity hits it?
> a command like
> 
> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
> 
> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
> task (-C1) however it needs often a hour or more.

That's good. :-)
You can place some printk()s into balance_dirty_pages() and observe the values
of period, max_pause, min_pause, etc...
Maybe this will give us a clue.

So far the issue looks not really UML specific.
But maybe it is more likely to happen on UML because of the slow page faults...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-03 19:20         ` Richard Weinberger
@ 2013-10-03 19:33           ` Toralf Förster
  2013-10-04  5:52             ` Richard Weinberger
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-03 19:33 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: Richard Weinberger, trinity, UML devel

On 10/03/2013 09:20 PM, Richard Weinberger wrote:
> Am 03.10.2013 21:16, schrieb Toralf Förster:
>> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>>> yields into this konsole message :
>>>>>>
>>>>>>  * Starting local
>>>>>> net.core.warnings = 0                                                                         [ ok ]
>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>>
>>>>>>
>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>>> 2 subsequent made back traces made with
>>>>>>
>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>>
>>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>>
>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>>> expected or if it points to an issue.
>>>>>>
>>>>>>
>>>>>> FWIW the last lines of trinity log were :
>>>>>>
>>>>>>
>>>>>> [2031] [94] setsid() = 2031
>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>>> [2031] [117] inotify_init() = 654
>>>>>> [2031] [118] getgid() = 100
>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>>
>>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>>> with a negative value.
>>>>> Looks like an integer overflow.
>>>>> The soft-lockup might also origin from that (very big integer which
>>>>> did not overflow jet)
>>>>>
>>>>
>>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>>> unsure whether it catches the culprit or if it just covers the root cause.
>>>
>>> I fear your Patch will not fix the issue.
>>>
>>> Does the issue only trigger on 32bit UMLs?
>> No diea, I do only have a 32 bit system here (both host and client).
>>
>>> How long does it take till trinity hits it?
>> a command like
>>
>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
>>
>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
>> task (-C1) however it needs often a hour or more.
> 
> That's good. :-)
> You can place some printk()s into balance_dirty_pages() and observe the values
> of period, max_pause, min_pause, etc...
> Maybe this will give us a clue.

Will do it again. I did it before already but period was the only thing
which becames otherwise negative -
and because pages_dirtied was often in the range of 20 - 30 and perioad
around -3000

I just wondered if HZ is casts as a short int in the calculation ?
BTW printk doesn't showed anything in the syslog (ok, I used
printk(KERN_ERR ...)
BUG_ON however worked fine, but do not give any output ...

> 
> So far the issue looks not really UML specific.
> But maybe it is more likely to happen on UML because of the slow page faults...
> 
> Thanks,
> //richard
> 
> 

-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-03 19:33           ` Toralf Förster
@ 2013-10-04  5:52             ` Richard Weinberger
  2013-10-06 14:17               ` Toralf Förster
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Weinberger @ 2013-10-04  5:52 UTC (permalink / raw)
  To: Toralf Förster; +Cc: trinity, UML devel

Am 03.10.2013 21:33, schrieb Toralf Förster:
> On 10/03/2013 09:20 PM, Richard Weinberger wrote:
>> Am 03.10.2013 21:16, schrieb Toralf Förster:
>>> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>>>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>>>> yields into this konsole message :
>>>>>>>
>>>>>>>  * Starting local
>>>>>>> net.core.warnings = 0                                                                         [ ok ]
>>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>>>
>>>>>>>
>>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>>>> 2 subsequent made back traces made with
>>>>>>>
>>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>>>
>>>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>>>
>>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>>>> expected or if it points to an issue.
>>>>>>>
>>>>>>>
>>>>>>> FWIW the last lines of trinity log were :
>>>>>>>
>>>>>>>
>>>>>>> [2031] [94] setsid() = 2031
>>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>>>> [2031] [117] inotify_init() = 654
>>>>>>> [2031] [118] getgid() = 100
>>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>>>
>>>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>>>> with a negative value.
>>>>>> Looks like an integer overflow.
>>>>>> The soft-lockup might also origin from that (very big integer which
>>>>>> did not overflow jet)
>>>>>>
>>>>>
>>>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>>>> unsure whether it catches the culprit or if it just covers the root cause.
>>>>
>>>> I fear your Patch will not fix the issue.
>>>>
>>>> Does the issue only trigger on 32bit UMLs?
>>> No diea, I do only have a 32 bit system here (both host and client).
>>>
>>>> How long does it take till trinity hits it?
>>> a command like
>>>
>>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
>>>
>>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
>>> task (-C1) however it needs often a hour or more.
>>
>> That's good. :-)
>> You can place some printk()s into balance_dirty_pages() and observe the values
>> of period, max_pause, min_pause, etc...
>> Maybe this will give us a clue.
> 
> Will do it again. I did it before already but period was the only thing
> which becames otherwise negative -
> and because pages_dirtied was often in the range of 20 - 30 and perioad
> around -3000

Did you print it before or after
                if (current->dirty_paused_when)
                        pause -= now - current->dirty_paused_when;

I'm also wondering why
		if (pause < min_pause) {
did not trigger or did it?

> I just wondered if HZ is casts as a short int in the calculation ?
> BTW printk doesn't showed anything in the syslog (ok, I used
> printk(KERN_ERR ...)

syslog or dmesg?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-04  5:52             ` Richard Weinberger
@ 2013-10-06 14:17               ` Toralf Förster
  2013-10-06 18:38                 ` Geert Uytterhoeven
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-06 14:17 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: trinity, UML devel

On 10/04/2013 07:52 AM, Richard Weinberger wrote:
> Am 03.10.2013 21:33, schrieb Toralf Förster:
>> On 10/03/2013 09:20 PM, Richard Weinberger wrote:
>>> Am 03.10.2013 21:16, schrieb Toralf Förster:
>>>> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>>>>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>>>>> yields into this konsole message :
>>>>>>>>
>>>>>>>>  * Starting local
>>>>>>>> net.core.warnings = 0                                                                         [ ok ]
>>>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>>>>
>>>>>>>>
>>>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>>>>> 2 subsequent made back traces made with
>>>>>>>>
>>>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>>>>
>>>>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>>>>
>>>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>>>>> expected or if it points to an issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> FWIW the last lines of trinity log were :
>>>>>>>>
>>>>>>>>
>>>>>>>> [2031] [94] setsid() = 2031
>>>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>>>>> [2031] [117] inotify_init() = 654
>>>>>>>> [2031] [118] getgid() = 100
>>>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>>>>
>>>>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>>>>> with a negative value.
>>>>>>> Looks like an integer overflow.
>>>>>>> The soft-lockup might also origin from that (very big integer which
>>>>>>> did not overflow jet)
>>>>>>>
>>>>>>
>>>>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>>>>> unsure whether it catches the culprit or if it just covers the root cause.
>>>>>
>>>>> I fear your Patch will not fix the issue.
>>>>>
>>>>> Does the issue only trigger on 32bit UMLs?
>>>> No diea, I do only have a 32 bit system here (both host and client).
>>>>
>>>>> How long does it take till trinity hits it?
>>>> a command like
>>>>
>>>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
>>>>
>>>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
>>>> task (-C1) however it needs often a hour or more.
>>>
>>> That's good. :-)
>>> You can place some printk()s into balance_dirty_pages() and observe the values
>>> of period, max_pause, min_pause, etc...
>>> Maybe this will give us a clue.
>>
>> Will do it again. I did it before already but period was the only thing
>> which becames otherwise negative -
>> and because pages_dirtied was often in the range of 20 - 30 and perioad
>> around -3000
> 
> Did you print it before or after
>                 if (current->dirty_paused_when)
>                         pause -= now - current->dirty_paused_when;
> 
> I'm also wondering why
> 		if (pause < min_pause) {
> did not trigger or did it?
> 
>> I just wondered if HZ is casts as a short int in the calculation ?
>> BTW printk doesn't showed anything in the syslog (ok, I used
>> printk(KERN_ERR ...)
> 
> syslog or dmesg?
> 
> Thanks,
> //richard
> 

back from going a mushroom foray - I nearly surrounded all "pause=" statements
with BUG_ON and downgrades trinity (b/c their latest enhancements made trinity too weak)

The UML stopped here :
...
		if (unlikely(task_ratelimit == 0)) {
			period = max_pause;
			pause = max_pause;
			BUG_ON(pause < 0);
			goto pause;
		}
		BUG_ON(pages_dirtied < 0);
		BUG_ON(task_ratelimit < 0);
		period = HZ * pages_dirtied / task_ratelimit;
		BUG_ON(period < 0);         <----------------------here
		pause = period;
		BUG_ON(pause < 0);
		if (current->dirty_paused_when)
			pause -= now - current->dirty_paused_when;
		BUG_ON(pause < 0);
...

The back trace is :

tfoerste@n22 ~/devel/trinity $ gdb --core=/mnt/ramdisk/core /home/tfoerste/devel/linux/linux -batch -ex bt[New LWP 2522]
Core was generated by `/home/tfoerste/devel/linux/linux earlyprintk ubda=/home/tfoerste/virtual/uml/tr'.
Program terminated with signal 6, Aborted.
#0  0xb777a424 in __kernel_vsyscall ()
#0  0xb777a424 in __kernel_vsyscall ()
#1  0x083b33b5 in kill ()
#2  0x0807190d in uml_abort () at arch/um/os-Linux/util.c:93
#3  0x08071c45 in os_dump_core () at arch/um/os-Linux/util.c:148
#4  0x08061417 in panic_exit (self=0x85b9558 <panic_exit_notifier>, unused1=0, unused2=0x85ef720 <buf.16084>) at arch/um/kernel/um_arch.c:240
#5  0x0809a7d8 in notifier_call_chain (nl=0x0, val=0, v=0x85ef720 <buf.16084>, nr_to_call=-2, nr_calls=0x0) at kernel/notifier.c:93
#6  0x0809a923 in __atomic_notifier_call_chain (nr_calls=<optimized out>, nr_to_call=<optimized out>, v=<optimized out>, val=<optimized out>, nh=<optimized out>) at kernel/notifier.c:182
#7  atomic_notifier_call_chain (nh=0x85ef704 <panic_notifier_list>, val=0, v=0x85ef720 <buf.16084>) at kernel/notifier.c:191
#8  0x08410d1c in panic (fmt=0x0) at kernel/panic.c:130
#9  0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471
#10 0x080d1ce4 in balance_dirty_pages_ratelimited (mapping=0x6) at mm/page-writeback.c:1663
#11 0x080e2d0c in __do_fault (mm=0x4870da00, vma=0x487484c8, address=1095131136, pmd=0x0, pgoff=0, flags=1212323992, orig_pte=<incomplete type>) at mm/memory.c:3452
#12 0x080e5286 in do_nonlinear_fault (orig_pte=..., flags=<optimized out>, pmd=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>, page_table=<optimized out>) at mm/memory.c:3518
#13 handle_pte_fault (flags=<optimized out>, pmd=<optimized out>, pte=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3717
#14 __handle_mm_fault (flags=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3845
#15 handle_mm_fault (mm=0x4870da00, vma=0x487484c8, address=1095131136, flags=1) at mm/memory.c:3868
#16 0x080e5a07 in __get_user_pages (tsk=0x47d5a400, mm=0x4870da00, start=1095131136, nr_pages=1025, gup_flags=519, pages=0x47c50000, vmas=0x0, nonblocking=0x0) at mm/memory.c:1822
#17 0x080e5cc3 in get_user_pages (tsk=0x0, mm=0x0, start=0, nr_pages=0, write=1, force=0, pages=0x47fafb98, vmas=0x6) at mm/memory.c:2019
#18 0x08140d0e in aio_setup_ring (ctx=<optimized out>) at fs/aio.c:340
#19 ioctx_alloc (nr_events=<optimized out>) at fs/aio.c:605
#20 SYSC_io_setup (ctxp=<optimized out>, nr_events=<optimized out>) at fs/aio.c:1122
#21 SyS_io_setup (nr_events=65535, ctxp=135045120) at fs/aio.c:1105
#22 0x080619c2 in handle_syscall (r=0x47d5a5d4) at arch/um/kernel/skas/syscall.c:35
#23 0x08073f2d in handle_trap (local_using_sysemu=<optimized out>, regs=<optimized out>, pid=<optimized out>) at arch/um/os-Linux/skas/process.c:198
#24 userspace (regs=0x47d5a5d4) at arch/um/os-Linux/skas/process.c:431
#25 0x0805e6ac in fork_handler () at arch/um/kernel/process.c:160
#26 0x00000000 in ?? ()

I'll try again to see if printk's will help - how do I force a sync after a printk ?

-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-06 14:17               ` Toralf Förster
@ 2013-10-06 18:38                 ` Geert Uytterhoeven
  2013-10-06 20:08                   ` Toralf Förster
  0 siblings, 1 reply; 15+ messages in thread
From: Geert Uytterhoeven @ 2013-10-06 18:38 UTC (permalink / raw)
  To: Toralf Förster; +Cc: Richard Weinberger, UML devel, trinity

On Sun, Oct 6, 2013 at 4:17 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
> The UML stopped here :
> ...
>                 if (unlikely(task_ratelimit == 0)) {
>                         period = max_pause;
>                         pause = max_pause;
>                         BUG_ON(pause < 0);
>                         goto pause;
>                 }
>                 BUG_ON(pages_dirtied < 0);
>                 BUG_ON(task_ratelimit < 0);
>                 period = HZ * pages_dirtied / task_ratelimit;
>                 BUG_ON(period < 0);         <----------------------here

So pages_dirtied becomes that big compared to task_ratelimit (both are
"unsigned long"), that period (which is "long", just like "pause") overflows
into a negative number.

This is indeed much more likely to happen on 32-bit.

> The back trace is :

> #9  0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471

But here pages_dirtied is only 9??

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-06 18:38                 ` Geert Uytterhoeven
@ 2013-10-06 20:08                   ` Toralf Förster
  2013-10-06 20:26                     ` Geert Uytterhoeven
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-06 20:08 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Richard Weinberger, UML devel, trinity

On 10/06/2013 08:38 PM, Geert Uytterhoeven wrote:
> On Sun, Oct 6, 2013 at 4:17 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>> The UML stopped here :
>> ...
>>                 if (unlikely(task_ratelimit == 0)) {
>>                         period = max_pause;
>>                         pause = max_pause;
>>                         BUG_ON(pause < 0);
>>                         goto pause;
>>                 }
>>                 BUG_ON(pages_dirtied < 0);
>>                 BUG_ON(task_ratelimit < 0);
>>                 period = HZ * pages_dirtied / task_ratelimit;
>>                 BUG_ON(period < 0);         <----------------------here
> 
> So pages_dirtied becomes that big compared to task_ratelimit (both are
> "unsigned long"), that period (which is "long", just like "pause") overflows
> into a negative number.
> 
> This is indeed much more likely to happen on 32-bit.
> 
>> The back trace is :
> 
>> #9  0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471
> 
> But here pages_dirtied is only 9??
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
> 
Well, this points to an overflow or ? :

tfoerste@n22 ~/devel/linux $ nl -ba mm/page-writeback.c | grep -A 5 -B 5 1468
  1463                          BUG_ON(pause < 0);
  1464                          goto pause;
  1465                  }
  1466                  period = HZ * pages_dirtied / task_ratelimit;
  1467                  pause = period;
  1468                  BUG_ON(pause < 0 && pages_dirtied > 0 && task_ratelimit > 0);
  1469                  if (current->dirty_paused_when)
  1470                          pause -= now - current->dirty_paused_when;
  1471                  /*
  1472                   * For less than 1s think time (ext3/4 may block the dirtier
  1473                   * for up to 800ms from time to time on 1-HDD; so does xfs,


and the back trace is :

tfoerste@n22 ~/devel/linux $ gdb --core=/mnt/ramdisk/core /home/tfoerste/devel/linux/linux -batch -ex bt
[New LWP 13163]
Core was generated by `/home/tfoerste/devel/linux/linux earlyprintk ubda=/home/tfoerste/virtual/uml/tr'.
Program terminated with signal 6, Aborted.
#0  0xb77d2424 in __kernel_vsyscall ()
#0  0xb77d2424 in __kernel_vsyscall ()
#1  0x083b33b5 in kill ()
#2  0x0807190d in uml_abort () at arch/um/os-Linux/util.c:93
#3  0x08071c45 in os_dump_core () at arch/um/os-Linux/util.c:148
#4  0x08061417 in panic_exit (self=0x85b9558 <panic_exit_notifier>, unused1=0, unused2=0x85ef720 <buf.16084>) at arch/um/kernel/um_arch.c:240
#5  0x0809a7d8 in notifier_call_chain (nl=0x0, val=0, v=0x85ef720 <buf.16084>, nr_to_call=-2, nr_calls=0x0) at kernel/notifier.c:93
#6  0x0809a923 in __atomic_notifier_call_chain (nr_calls=<optimized out>, nr_to_call=<optimized out>, v=<optimized out>, val=<optimized out>, nh=<optimized out>) at kernel/notifier.c:182
#7  atomic_notifier_call_chain (nh=0x85ef704 <panic_notifier_list>, val=0, v=0x85ef720 <buf.16084>) at kernel/notifier.c:191
#8  0x08410d1c in panic (fmt=0x0) at kernel/panic.c:130
#9  0x08411c6c in balance_dirty_pages (pages_dirtied=0, mapping=<optimized out>) at mm/page-writeback.c:1468
#10 0x080d1ce4 in balance_dirty_pages_ratelimited (mapping=0x6) at mm/page-writeback.c:1657
#11 0x080e2d0c in __do_fault (mm=0x47b09600, vma=0x48bc9e58, address=1082572800, pmd=0x0, pgoff=0, flags=1167616488, orig_pte=<incomplete type>) at mm/memory.c:3452
#12 0x080e5286 in do_nonlinear_fault (orig_pte=..., flags=<optimized out>, pmd=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>, page_table=<optimized out>) at mm/memory.c:3518
#13 handle_pte_fault (flags=<optimized out>, pmd=<optimized out>, pte=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3717
#14 __handle_mm_fault (flags=<optimized out>, address=<optimized out>, vma=<optimized out>, mm=<optimized out>) at mm/memory.c:3845
#15 handle_mm_fault (mm=0x47b09600, vma=0x48bc9e58, address=1082572800, flags=1) at mm/memory.c:3868
#16 0x080e5a07 in __get_user_pages (tsk=0x47a3ea00, mm=0x47b09600, start=1082572800, nr_pages=962, gup_flags=519, pages=0x47b96120, vmas=0x0, nonblocking=0x0) at mm/memory.c:1822
#17 0x080e5cc3 in get_user_pages (tsk=0x0, mm=0x0, start=0, nr_pages=0, write=1, force=0, pages=0x4789fb9c, vmas=0x6) at mm/memory.c:2019
#18 0x08140d0e in aio_setup_ring (ctx=<optimized out>) at fs/aio.c:340
#19 ioctx_alloc (nr_events=<optimized out>) at fs/aio.c:605
#20 SYSC_io_setup (ctxp=<optimized out>, nr_events=<optimized out>) at fs/aio.c:1122
#21 SyS_io_setup (nr_events=-2147422135, ctxp=135045120) at fs/aio.c:1105
#22 0x080619c2 in handle_syscall (r=0x47a3ebd4) at arch/um/kernel/skas/syscall.c:35
#23 0x08073f2d in handle_trap (local_using_sysemu=<optimized out>, regs=<optimized out>, pid=<optimized out>) at arch/um/os-Linux/skas/process.c:198
#24 userspace (regs=0x47a3ebd4) at arch/um/os-Linux/skas/process.c:431
#25 0x0805e6ac in fork_handler () at arch/um/kernel/process.c:160
#26 0x5a5a5a5a in ?? ()

-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-06 20:08                   ` Toralf Förster
@ 2013-10-06 20:26                     ` Geert Uytterhoeven
  2013-10-06 21:01                       ` Toralf Förster
  0 siblings, 1 reply; 15+ messages in thread
From: Geert Uytterhoeven @ 2013-10-06 20:26 UTC (permalink / raw)
  To: Toralf Förster; +Cc: Richard Weinberger, UML devel, trinity

On Sun, Oct 6, 2013 at 10:08 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
> On 10/06/2013 08:38 PM, Geert Uytterhoeven wrote:
>> On Sun, Oct 6, 2013 at 4:17 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>> The UML stopped here :
>>> ...
>>>                 if (unlikely(task_ratelimit == 0)) {
>>>                         period = max_pause;
>>>                         pause = max_pause;
>>>                         BUG_ON(pause < 0);
>>>                         goto pause;
>>>                 }
>>>                 BUG_ON(pages_dirtied < 0);
>>>                 BUG_ON(task_ratelimit < 0);
>>>                 period = HZ * pages_dirtied / task_ratelimit;
>>>                 BUG_ON(period < 0);         <----------------------here
>>
>> So pages_dirtied becomes that big compared to task_ratelimit (both are
>> "unsigned long"), that period (which is "long", just like "pause") overflows
>> into a negative number.
>>
>> This is indeed much more likely to happen on 32-bit.
>>
>>> The back trace is :
>>
>>> #9  0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471
>>
>> But here pages_dirtied is only 9??

> Well, this points to an overflow or ? :

Negative indicates an overflow, but pages_dirtied doesn't.

> tfoerste@n22 ~/devel/linux $ nl -ba mm/page-writeback.c | grep -A 5 -B 5 1468
>   1463                          BUG_ON(pause < 0);
>   1464                          goto pause;
>   1465                  }
>   1466                  period = HZ * pages_dirtied / task_ratelimit;
>   1467                  pause = period;
>   1468                  BUG_ON(pause < 0 && pages_dirtied > 0 && task_ratelimit > 0);
>   1469                  if (current->dirty_paused_when)
>   1470                          pause -= now - current->dirty_paused_when;
>   1471                  /*
>   1472                   * For less than 1s think time (ext3/4 may block the dirtier
>   1473                   * for up to 800ms from time to time on 1-HDD; so does xfs,
>
>
> and the back trace is :
>
> #9  0x08411c6c in balance_dirty_pages (pages_dirtied=0, mapping=<optimized out>) at mm/page-writeback.c:1468

Hmm, now pages_dirtied is zero, according to the backtrace, but the BUG_ON()
asserts its strict positive?!?

Can you please try the following instead of the BUG_ON():

if (pause < 0) {
        printk("pages_dirtied = %lu\n", pages_dirtied);
        printk("task_ratelimit = %lu\n", task_ratelimit);
        printk("pause = %ld\n", pause);
}

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-06 20:26                     ` Geert Uytterhoeven
@ 2013-10-06 21:01                       ` Toralf Förster
  2013-10-08 20:07                         ` Geert Uytterhoeven
  0 siblings, 1 reply; 15+ messages in thread
From: Toralf Förster @ 2013-10-06 21:01 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Richard Weinberger, UML devel, trinity

On 10/06/2013 10:26 PM, Geert Uytterhoeven wrote:
> On Sun, Oct 6, 2013 at 10:08 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>> On 10/06/2013 08:38 PM, Geert Uytterhoeven wrote:
>>> On Sun, Oct 6, 2013 at 4:17 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>> The UML stopped here :
>>>> ...
>>>>                 if (unlikely(task_ratelimit == 0)) {
>>>>                         period = max_pause;
>>>>                         pause = max_pause;
>>>>                         BUG_ON(pause < 0);
>>>>                         goto pause;
>>>>                 }
>>>>                 BUG_ON(pages_dirtied < 0);
>>>>                 BUG_ON(task_ratelimit < 0);
>>>>                 period = HZ * pages_dirtied / task_ratelimit;
>>>>                 BUG_ON(period < 0);         <----------------------here
>>>
>>> So pages_dirtied becomes that big compared to task_ratelimit (both are
>>> "unsigned long"), that period (which is "long", just like "pause") overflows
>>> into a negative number.
>>>
>>> This is indeed much more likely to happen on 32-bit.
>>>
>>>> The back trace is :
>>>
>>>> #9  0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471
>>>
>>> But here pages_dirtied is only 9??
> 
>> Well, this points to an overflow or ? :
> 
> Negative indicates an overflow, but pages_dirtied doesn't.
> 
>> tfoerste@n22 ~/devel/linux $ nl -ba mm/page-writeback.c | grep -A 5 -B 5 1468
>>   1463                          BUG_ON(pause < 0);
>>   1464                          goto pause;
>>   1465                  }
>>   1466                  period = HZ * pages_dirtied / task_ratelimit;
>>   1467                  pause = period;
>>   1468                  BUG_ON(pause < 0 && pages_dirtied > 0 && task_ratelimit > 0);
>>   1469                  if (current->dirty_paused_when)
>>   1470                          pause -= now - current->dirty_paused_when;
>>   1471                  /*
>>   1472                   * For less than 1s think time (ext3/4 may block the dirtier
>>   1473                   * for up to 800ms from time to time on 1-HDD; so does xfs,
>>
>>
>> and the back trace is :
>>
>> #9  0x08411c6c in balance_dirty_pages (pages_dirtied=0, mapping=<optimized out>) at mm/page-writeback.c:1468
> 
> Hmm, now pages_dirtied is zero, according to the backtrace, but the BUG_ON()
> asserts its strict positive?!?
> 
> Can you please try the following instead of the BUG_ON():
> 
> if (pause < 0) {
>         printk("pages_dirtied = %lu\n", pages_dirtied);
>         printk("task_ratelimit = %lu\n", task_ratelimit);
>         printk("pause = %ld\n", pause);
> }
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
I tried it in different ways already - I'm completely unsuccessful in getting any printk output.
As soon as the issue happens I do have a 

BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]

at stderr of the UML and then no further input is accepted. With uml_mconsole I'm however able
to run very basic commands like a crash dump, sysrq ond so on.



> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
> 


-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [uml-devel] BUG: soft lockup for a user mode linux image
  2013-10-06 21:01                       ` Toralf Förster
@ 2013-10-08 20:07                         ` Geert Uytterhoeven
  0 siblings, 0 replies; 15+ messages in thread
From: Geert Uytterhoeven @ 2013-10-08 20:07 UTC (permalink / raw)
  To: Toralf Förster; +Cc: Richard Weinberger, UML devel, trinity

On Sun, Oct 6, 2013 at 11:01 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>> Hmm, now pages_dirtied is zero, according to the backtrace, but the BUG_ON()
>> asserts its strict positive?!?
>>
>> Can you please try the following instead of the BUG_ON():
>>
>> if (pause < 0) {
>>         printk("pages_dirtied = %lu\n", pages_dirtied);
>>         printk("task_ratelimit = %lu\n", task_ratelimit);
>>         printk("pause = %ld\n", pause);
>> }
>>
>> Gr{oetje,eeting}s,
>>
>>                         Geert
> I tried it in different ways already - I'm completely unsuccessful in getting any printk output.
> As soon as the issue happens I do have a
>
> BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]
>
> at stderr of the UML and then no further input is accepted. With uml_mconsole I'm however able
> to run very basic commands like a crash dump, sysrq ond so on.

You may get an idea of the magnitude of pages_dirtied by using a chain of
BUG_ON()s, like:

BUG_ON(pages_dirtied > 2000000000);
BUG_ON(pages_dirtied > 1000000000);
BUG_ON(pages_dirtied > 100000000);
BUG_ON(pages_dirtied > 10000000);
BUG_ON(pages_dirtied > 1000000);

Probably 1 million is already too much for normal operation?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-10-08 20:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-02 18:30 BUG: soft lockup for a user mode linux image Toralf Förster
2013-10-02 18:31 ` Toralf Förster
2013-10-02 19:55 ` [uml-devel] " Richard Weinberger
2013-10-03 18:54   ` Toralf Förster
2013-10-03 19:04     ` Richard Weinberger
2013-10-03 19:16       ` [uml-devel] " Toralf Förster
2013-10-03 19:20         ` Richard Weinberger
2013-10-03 19:33           ` Toralf Förster
2013-10-04  5:52             ` Richard Weinberger
2013-10-06 14:17               ` Toralf Förster
2013-10-06 18:38                 ` Geert Uytterhoeven
2013-10-06 20:08                   ` Toralf Förster
2013-10-06 20:26                     ` Geert Uytterhoeven
2013-10-06 21:01                       ` Toralf Förster
2013-10-08 20:07                         ` Geert Uytterhoeven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox