From: Richard Weinberger <richard@nod.at>
To: "Toralf Förster" <toralf.foerster@gmx.de>
Cc: UML devel <user-mode-linux-devel@lists.sourceforge.net>,
trinity@vger.kernel.org
Subject: Re: [uml-devel] BUG: soft lockup for a user mode linux image
Date: Fri, 04 Oct 2013 07:52:58 +0200 [thread overview]
Message-ID: <524E57BA.805@nod.at> (raw)
In-Reply-To: <524DC675.4020201@gmx.de>
Am 03.10.2013 21:33, schrieb Toralf Förster:
> On 10/03/2013 09:20 PM, Richard Weinberger wrote:
>> Am 03.10.2013 21:16, schrieb Toralf Förster:
>>> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>>>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>>>> yields into this konsole message :
>>>>>>>
>>>>>>> * Starting local
>>>>>>> net.core.warnings = 0 [ ok ]
>>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>>>
>>>>>>>
>>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>>>> 2 subsequent made back traces made with
>>>>>>>
>>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>>>
>>>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>>>
>>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>>>> expected or if it points to an issue.
>>>>>>>
>>>>>>>
>>>>>>> FWIW the last lines of trinity log were :
>>>>>>>
>>>>>>>
>>>>>>> [2031] [94] setsid() = 2031
>>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>>>> [2031] [117] inotify_init() = 654
>>>>>>> [2031] [118] getgid() = 100
>>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>>>
>>>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>>>> with a negative value.
>>>>>> Looks like an integer overflow.
>>>>>> The soft-lockup might also origin from that (very big integer which
>>>>>> did not overflow jet)
>>>>>>
>>>>>
>>>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>>>> unsure whether it catches the culprit or if it just covers the root cause.
>>>>
>>>> I fear your Patch will not fix the issue.
>>>>
>>>> Does the issue only trigger on 32bit UMLs?
>>> No diea, I do only have a 32 bit system here (both host and client).
>>>
>>>> How long does it take till trinity hits it?
>>> a command like
>>>
>>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
>>>
>>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
>>> task (-C1) however it needs often a hour or more.
>>
>> That's good. :-)
>> You can place some printk()s into balance_dirty_pages() and observe the values
>> of period, max_pause, min_pause, etc...
>> Maybe this will give us a clue.
>
> Will do it again. I did it before already but period was the only thing
> which becames otherwise negative -
> and because pages_dirtied was often in the range of 20 - 30 and perioad
> around -3000
Did you print it before or after
if (current->dirty_paused_when)
pause -= now - current->dirty_paused_when;
I'm also wondering why
if (pause < min_pause) {
did not trigger or did it?
> I just wondered if HZ is casts as a short int in the calculation ?
> BTW printk doesn't showed anything in the syslog (ok, I used
> printk(KERN_ERR ...)
syslog or dmesg?
Thanks,
//richard
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel
WARNING: multiple messages have this Message-ID (diff)
From: Richard Weinberger <richard@nod.at>
To: "Toralf Förster" <toralf.foerster@gmx.de>
Cc: trinity@vger.kernel.org,
UML devel <user-mode-linux-devel@lists.sourceforge.net>
Subject: Re: [uml-devel] BUG: soft lockup for a user mode linux image
Date: Fri, 04 Oct 2013 07:52:58 +0200 [thread overview]
Message-ID: <524E57BA.805@nod.at> (raw)
In-Reply-To: <524DC675.4020201@gmx.de>
Am 03.10.2013 21:33, schrieb Toralf Förster:
> On 10/03/2013 09:20 PM, Richard Weinberger wrote:
>> Am 03.10.2013 21:16, schrieb Toralf Förster:
>>> On 10/03/2013 09:04 PM, Richard Weinberger wrote:
>>>> Am 03.10.2013 20:54, schrieb Toralf Förster:
>>>>> On 10/02/2013 09:55 PM, Richard Weinberger wrote:
>>>>>> On Wed, Oct 2, 2013 at 8:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>>>>>>> Running trinity (1 process, no victim files, just "$>trinity -C1) for a longer time
>>>>>>> within a 32 bit user mode linux image with a recent git kernel (host: 3.11.3 guest 3.12-rc3-g...)
>>>>>>> yields into this konsole message :
>>>>>>>
>>>>>>> * Starting local
>>>>>>> net.core.warnings = 0 [ ok ]
>>>>>>> BUG: soft lockup - CPU#0 stuck for 23s! [trinity-child0:2031]
>>>>>>>
>>>>>>>
>>>>>>> and at the host t1 of the "linux"-processes eats all CPU cycles at 1 CPU core.
>>>>>>> 2 subsequent made back traces made with
>>>>>>>
>>>>>>> $> sudo gdb /home/tfoerste/devel/linux/linux 28144 -n -batch -ex bt
>>>>>>>
>>>>>>> shows nearly a similar position around __get_user_pages() - both are attached.
>>>>>>>
>>>>>>> I'm not surprised that trinity harms a systems - I'm just wondering whether this particular picture is
>>>>>>> expected or if it points to an issue.
>>>>>>>
>>>>>>>
>>>>>>> FWIW the last lines of trinity log were :
>>>>>>>
>>>>>>>
>>>>>>> [2031] [94] setsid() = 2031
>>>>>>> [2031] [95] setresgid(rgid=0xffff33e3, egid=0xffffff93, sgid=0x22000040) = -1 (Operation not permitted)
>>>>>>> [2031] [96] vmsplice(fd=5, iov=0x85501e0, nr_segs=300, flags=9) = 0x3000
>>>>>>> [2031] [97] setresuid(ruid=0x80549193, euid=0xc61041e0, suid=0xff19b6fa) = -1 (Operation not permitted)
>>>>>>> [2031] [98] setpriority(which=0xff010000, who=0xf3737373, niceval=0x8088960c) = -1 (Invalid argument)
>>>>>>> [2031] [99] socketcall(call=1, args=0x8550200) = -1 (Address family not supported by protocol)
>>>>>>> [2031] [100] access(filename="�", mode=2017) = -1 (Invalid argument)
>>>>>>> [2031] [101] getgroups(gidsetsize=0, grouplist=0x80d0000[page_rand]) = 3
>>>>>>> [2031] [102] msync(start=0xc0100220, len=0, flags=3) = -1 (Invalid argument)
>>>>>>> [2031] [103] sigpending(set=0x40025000) = 0
>>>>>>> [2031] [104] signalfd4(ufd=383, user_mask=1, sizemask=0xa4200000, flags=0x80800) = -1 (Invalid argument)
>>>>>>> [2031] [105] sendfile(out_fd=383, in_fd=382, offset=0, count=4096) = -1 (Invalid argument)
>>>>>>> [2031] [106] fanotify_mark(fanotify_fd=382, flags=5, mask=0x8000023, dfd=382, pathname="/proc/1092/task/1092/fdinfo/68") = -1 (Invalid argument)
>>>>>>> [2031] [107] wait4(upid=1, stat_addr=4, options=0xd761979b, ru=8) = -1 (Invalid argument)
>>>>>>> [2031] [108] sigpending(set=0x80ca000[page_zeros]) = 0
>>>>>>> [2031] [109] setresuid(ruid=0xefffd6fc, euid=0x1bf4c92f, suid=0xffff2e33) = -1 (Operation not permitted)
>>>>>>> [2031] [110] munlock(addr=0x40025000, len=34) = 0
>>>>>>> [2031] [111] timer_delete(timer_id=0xffffffdc) = -1 (Invalid argument)
>>>>>>> [2031] [112] sched_get_priority_max(policy=0x10000040) = -1 (Invalid argument)
>>>>>>> [2031] [113] syslog(type=0xc1000000, buf=1, len=0x82a5) = -1 (Operation not permitted)
>>>>>>> [2031] [114] setpriority(which=0xc4c806c6, who=0xffffff01, niceval=0xffff0682) = -1 (Invalid argument)
>>>>>>> [2031] [115] getgroups16(gidsetsize=0xfffe, grouplist=1) = -1 (Bad address)
>>>>>>> [2031] [116] rename(oldname=4, newname=8) = -1 (Bad address)
>>>>>>> [2031] [117] inotify_init() = 654
>>>>>>> [2031] [118] getgid() = 100
>>>>>>> [2031] [119] fstatat64(dfd=382, filename="/sys/devices/virtual/net/sit0/duplex", statbuf=0, flag=0xb545d727) = -1 (Invalid argument)
>>>>>>> [2031] [120] unlinkat(dfd=382, pathname="/proc/sys/net/ipv4/neigh/default/retrans_time", flag=0xc00ef76) = -1 (Invalid argument)
>>>>>>> [2031] [121] timerfd_create(clockid=0, flags=0) = 655
>>>>>>> [2031] [122] munlock(addr=4, len=0x3fff) = -1 (Cannot allocate memory)
>>>>>>> [2031] [123] fremovexattr(fd=382, name=0) = -1 (Bad address)
>>>>>>> [2031] [124] sched_get_priority_min(policy=0xff58bfef) = -1 (Invalid argument)
>>>>>>> [2031] [125] mq_timedreceive(mqdes=397, u_msg_ptr=4, msg_len=5245, u_msg_prio=0xc0100220, u_abs_timeout=0xc0100220) = -1 (Bad address)
>>>>>>> [2031] [126] chdir(filename="/proc/116/net/ptype") = -1 (Not a directory)
>>>>>>> [2031] [127] ssetmask(newmask=0x88000092) = 0
>>>>>>> [2031] [128] statfs(pathname="/proc/6/mounts", buf=0) = -1 (Bad address)
>>>>>>> [2031] [129] fchown16(fd=397, user=104, group=0x94100000) = -1 (Operation not permitted)
>>>>>>> [2031] [130] fchdir(fd=397) = -1 (Not a directory)
>>>>>>> [2031] [131] mkdir(pathname="/proc/1092/task/1092/fdinfo/316", mode=525) = -1 (File exists)
>>>>>>> [2031] [132] fsetxattr(fd=386, name=0x856f158, value=0x8571160, size=0, flags=0) = -1 (Numerical result out of range)
>>>>>>> [2031] [133] io_setup(nr_events=4095, ctxp=0x40266000) ^CKilled by signal 2.
>>>>>>
>>>>>> Reading your gdb backtraces show that schedule_timeout() got called
>>>>>> with a negative value.
>>>>>> Looks like an integer overflow.
>>>>>> The soft-lockup might also origin from that (very big integer which
>>>>>> did not overflow jet)
>>>>>>
>>>>>
>>>>> If the culprit is solved by this patch I'd like to send it out. But I'm
>>>>> unsure whether it catches the culprit or if it just covers the root cause.
>>>>
>>>> I fear your Patch will not fix the issue.
>>>>
>>>> Does the issue only trigger on 32bit UMLs?
>>> No diea, I do only have a 32 bit system here (both host and client).
>>>
>>>> How long does it take till trinity hits it?
>>> a command like
>>>
>>> $> ssh tfoerste@trinity "rm -rf t3; mkdir t3; cd t3; trinity -C4"
>>>
>>> usually needs 10 till 15 min to trigger the issue. With just 1 trinity
>>> task (-C1) however it needs often a hour or more.
>>
>> That's good. :-)
>> You can place some printk()s into balance_dirty_pages() and observe the values
>> of period, max_pause, min_pause, etc...
>> Maybe this will give us a clue.
>
> Will do it again. I did it before already but period was the only thing
> which becames otherwise negative -
> and because pages_dirtied was often in the range of 20 - 30 and perioad
> around -3000
Did you print it before or after
if (current->dirty_paused_when)
pause -= now - current->dirty_paused_when;
I'm also wondering why
if (pause < min_pause) {
did not trigger or did it?
> I just wondered if HZ is casts as a short int in the calculation ?
> BTW printk doesn't showed anything in the syslog (ok, I used
> printk(KERN_ERR ...)
syslog or dmesg?
Thanks,
//richard
next prev parent reply other threads:[~2013-10-04 5:53 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-02 18:30 [uml-devel] BUG: soft lockup for a user mode linux image Toralf Förster
2013-10-02 18:30 ` Toralf Förster
2013-10-02 18:31 ` [uml-devel] " Toralf Förster
2013-10-02 18:31 ` Toralf Förster
2013-10-02 19:55 ` [uml-devel] " Richard Weinberger
2013-10-02 19:55 ` Richard Weinberger
2013-10-03 18:54 ` Toralf Förster
2013-10-03 18:54 ` Toralf Förster
2013-10-03 19:04 ` Richard Weinberger
2013-10-03 19:04 ` Richard Weinberger
2013-10-03 19:16 ` [uml-devel] " Toralf Förster
2013-10-03 19:16 ` Toralf Förster
2013-10-03 19:20 ` Richard Weinberger
2013-10-03 19:20 ` Richard Weinberger
2013-10-03 19:33 ` Toralf Förster
2013-10-03 19:33 ` Toralf Förster
2013-10-04 5:52 ` Richard Weinberger [this message]
2013-10-04 5:52 ` Richard Weinberger
2013-10-06 14:17 ` Toralf Förster
2013-10-06 14:17 ` Toralf Förster
2013-10-06 18:38 ` Geert Uytterhoeven
2013-10-06 18:38 ` Geert Uytterhoeven
2013-10-06 20:08 ` Toralf Förster
2013-10-06 20:26 ` Geert Uytterhoeven
2013-10-06 21:01 ` Toralf Förster
2013-10-08 20:07 ` Geert Uytterhoeven
2013-10-08 20:07 ` Geert Uytterhoeven
2013-10-09 17:26 ` Toralf Förster
2013-10-09 18:43 ` Richard Weinberger
2013-10-09 18:43 ` Richard Weinberger
2013-10-09 21:47 ` Jan Kara
2013-10-09 21:47 ` Jan Kara
2013-10-09 22:33 ` Richard Weinberger
2013-10-09 22:33 ` Richard Weinberger
2013-10-09 22:33 ` Richard Weinberger
2013-10-10 16:49 ` Toralf Förster
2013-10-10 16:49 ` Toralf Förster
2013-10-10 16:49 ` Toralf Förster
2013-10-11 1:16 ` Fengguang Wu
2013-10-11 1:16 ` Fengguang Wu
2013-10-11 8:42 ` Toralf Förster
2013-10-11 8:42 ` Toralf Förster
2013-10-11 8:57 ` Fengguang Wu
2013-10-11 8:57 ` Fengguang Wu
2013-10-11 8:57 ` Fengguang Wu
2013-10-11 9:05 ` Fengguang Wu
2013-10-11 9:05 ` Fengguang Wu
2013-10-11 14:12 ` Toralf Förster
2013-10-11 14:12 ` Toralf Förster
2013-10-11 14:12 ` Toralf Förster
2013-10-12 0:43 ` [PATCH] writeback: fix negative bdi max pause Fengguang Wu
2013-10-12 0:43 ` Fengguang Wu
2013-10-12 4:45 ` [PATCH v2] " Fengguang Wu
2013-10-12 4:45 ` Fengguang Wu
2013-10-14 12:34 ` Jan Kara
2013-10-14 12:34 ` Jan Kara
2013-10-10 2:46 ` [uml-devel] BUG: soft lockup for a user mode linux image Fengguang Wu
2013-10-10 2:46 ` Fengguang Wu
2013-10-10 6:52 ` Geert Uytterhoeven
2013-10-10 6:52 ` Geert Uytterhoeven
2013-10-10 7:03 ` Fengguang Wu
2013-10-10 7:03 ` Fengguang Wu
2013-10-08 19:56 ` Toralf Förster
2013-10-09 10:35 ` stian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=524E57BA.805@nod.at \
--to=richard@nod.at \
--cc=toralf.foerster@gmx.de \
--cc=trinity@vger.kernel.org \
--cc=user-mode-linux-devel@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.