* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() [not found] <bug-12201-10286@http.bugzilla.kernel.org/> @ 2008-12-11 22:37 ` Andrew Morton 2008-12-11 22:48 ` Kay Sievers ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Andrew Morton @ 2008-12-11 22:37 UTC (permalink / raw) To: mike; +Cc: bugme-daemon, netdev, linux-kernel, Al Viro (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Thu, 11 Dec 2008 14:15:21 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12201 > > Summary: long wait in call_usermodehelper() / queue_work() / > wait_for_completion() > Product: Process Management > Version: 2.5 > KernelVersion: 2.6.26.8 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: process_other@kernel-bugs.osdl.org > ReportedBy: mike@nauticaltech.com > > > Latest working kernel version: None > Earliest failing kernel version: 2.6.26 (can't test any older) It'd be great if you could test something more recent please. > Distribution: CentOS 5 > Hardware Environment: Sun x4450, 16-cores, 128GB of RAM > Software Environment: CentOS 5 + Apache webserver > Problem Description: > My problem started with SSH using the audit library, and my kernel not having > AUDIT support. During strace, any call to socket(PF_NETLINK, SOCK_RAW, > NETLINK_AUDIT) took 1-2 seconds to return. During this time, sys% was high. Well this is bad. We don't want the kernel calling out to userspace each time you run socket(PF_NETLINK, ...). The performance could be awful. I don't know if this is a net problem, an audit problem or whatever. Probably the offending kernel code simply shouldn't exist if CONFIG_AUDIT=n. Please attach a copy of the config to http://bugzilla.kernel.org/show_bug.cgi?id=12201 > As I continued to dig deeper (using lots of printks), I found that these delays > were caused by the netlink_create() code calling request_module() to find/load > a module for AUDIT support which doesn't exist. > > Continuing to dig, I found that request_module() uses call_usermodehelper() to > run /sbin/modprobe to find/load the module. > > The farthest I got is that after the process is created, we call > wait_for_completion() to get the result of that process. This waiting process > takes 1-2 seconds. > > The big problem in troubleshooting here is that this only starts to happen > after the server has been online for a while (10 days maybe) and serving lots > of traffic. The delay gradually builds up and maxes out at around 2 seconds. > > If I manually call /sbin/modprobe on the commandline and provide it the same > arguments that call_usermodehelper() uses, the command returns instantly 100% > of the time (assuming server has been on for a while). > > If I write a small pilot program that calls socket(PF_NETLINK, SOCK_RAW, > NETLINK_AUDIT), it will delay by 1-2 seconds 100% of the time (assuming server > has been online for a while). Certain protocol types given to socket() have > zero delay (because no module needs to be loaded). > > Steps to reproduce: > Once server has been online for a while, a simple call to socket(PF_NETLINK, > SOCK_RAW, NETLINK_AUDIT) shows the problem. OK, weird. Please get sysrq working then get us a task trace, so we can see who is sleeping where. Do this: - run your "small pilot program" - wait one second (so we catch it while it is delaying) - echo t > /proc/sysrq-trigger - dmesg -s 1000000 > foo - send us a copy of foo. (foo will be large, so it would be best to attach it to http://bugzilla.kernel.org/show_bug.cgi?id=12201 then email us the URL) (Using `echo w > /proc/sysrq-trigger' would work too, if we know that the offending processes are stuck in D state. It will produce less output) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 22:37 ` [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() Andrew Morton @ 2008-12-11 22:48 ` Kay Sievers 2008-12-12 19:47 ` Michael Spiegle 2008-12-11 22:51 ` Evgeniy Polyakov 2008-12-12 19:42 ` Michael Spiegle 2 siblings, 1 reply; 7+ messages in thread From: Kay Sievers @ 2008-12-11 22:48 UTC (permalink / raw) To: Andrew Morton; +Cc: mike, bugme-daemon, netdev, linux-kernel, Al Viro On Thu, Dec 11, 2008 at 23:37, Andrew Morton <akpm@linux-foundation.org> wrote: >> As I continued to dig deeper (using lots of printks), I found that these delays >> were caused by the netlink_create() code calling request_module() to find/load >> a module for AUDIT support which doesn't exist. >> >> Continuing to dig, I found that request_module() uses call_usermodehelper() to >> run /sbin/modprobe to find/load the module. >> >> The farthest I got is that after the process is created, we call >> wait_for_completion() to get the result of that process. This waiting process >> takes 1-2 seconds. >> >> The big problem in troubleshooting here is that this only starts to happen >> after the server has been online for a while (10 days maybe) and serving lots >> of traffic. The delay gradually builds up and maxes out at around 2 seconds. >> >> If I manually call /sbin/modprobe on the commandline and provide it the same >> arguments that call_usermodehelper() uses, the command returns instantly 100% >> of the time (assuming server has been on for a while). >> >> If I write a small pilot program that calls socket(PF_NETLINK, SOCK_RAW, >> NETLINK_AUDIT), it will delay by 1-2 seconds 100% of the time (assuming server >> has been online for a while). Certain protocol types given to socket() have >> zero delay (because no module needs to be loaded). >> >> Steps to reproduce: >> Once server has been online for a while, a simple call to socket(PF_NETLINK, >> SOCK_RAW, NETLINK_AUDIT) shows the problem. If you replace /sbin/modprobe in the kernel module loader, does the delay go away: echo /bin/true > /proc/sys/kernel/modprobe ? Kay ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 22:48 ` Kay Sievers @ 2008-12-12 19:47 ` Michael Spiegle 0 siblings, 0 replies; 7+ messages in thread From: Michael Spiegle @ 2008-12-12 19:47 UTC (permalink / raw) To: Kay Sievers; +Cc: Andrew Morton, bugme-daemon, netdev, linux-kernel, Al Viro > If you replace /sbin/modprobe in the kernel module loader, does the > delay go away: > echo /bin/true > /proc/sys/kernel/modprobe Yes, that does reduce the delay significantly. The timing of the socket() call went from about 1500000us to 430000us. This points the finger at modprobe it seems? Michael Spiegle mike@nauticaltech.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 22:37 ` [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() Andrew Morton 2008-12-11 22:48 ` Kay Sievers @ 2008-12-11 22:51 ` Evgeniy Polyakov 2008-12-11 23:05 ` Evgeniy Polyakov 2008-12-12 19:42 ` Michael Spiegle 2 siblings, 1 reply; 7+ messages in thread From: Evgeniy Polyakov @ 2008-12-11 22:51 UTC (permalink / raw) To: Andrew Morton; +Cc: mike, bugme-daemon, netdev, linux-kernel, Al Viro Hi. On Thu, Dec 11, 2008 at 02:37:58PM -0800, Andrew Morton (akpm@linux-foundation.org) wrote: > > If I write a small pilot program that calls socket(PF_NETLINK, SOCK_RAW, > > NETLINK_AUDIT), it will delay by 1-2 seconds 100% of the time (assuming server > > has been online for a while). Certain protocol types given to socket() have > > zero delay (because no module needs to be loaded). > > > > Steps to reproduce: > > Once server has been online for a while, a simple call to socket(PF_NETLINK, > > SOCK_RAW, NETLINK_AUDIT) shows the problem. > > OK, weird. > > Please get sysrq working then get us a task trace, so we can see who is > sleeping where. Do this: > > > - run your "small pilot program" Or send it to us and we will quickly fix the bug :) Thank you! -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 22:51 ` Evgeniy Polyakov @ 2008-12-11 23:05 ` Evgeniy Polyakov 2008-12-12 19:51 ` Michael Spiegle 0 siblings, 1 reply; 7+ messages in thread From: Evgeniy Polyakov @ 2008-12-11 23:05 UTC (permalink / raw) To: Andrew Morton; +Cc: mike, bugme-daemon, netdev, linux-kernel, Al Viro On Fri, Dec 12, 2008 at 01:51:58AM +0300, Evgeniy Polyakov (zbr@ioremap.net) wrote: > > > If I write a small pilot program that calls socket(PF_NETLINK, SOCK_RAW, > > > NETLINK_AUDIT), it will delay by 1-2 seconds 100% of the time (assuming server > > > has been online for a while). Certain protocol types given to socket() have > > > zero delay (because no module needs to be loaded). > > > > > > Steps to reproduce: > > > Once server has been online for a while, a simple call to socket(PF_NETLINK, > > > SOCK_RAW, NETLINK_AUDIT) shows the problem. > > > > OK, weird. > > > > Please get sysrq working then get us a task trace, so we can see who is > > sleeping where. Do this: > > > > > > - run your "small pilot program" > > Or send it to us and we will quickly fix the bug :) > > Thank you! Btw, does this ugly hack fix the problem? According to your description it may be the case. diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 480184a..2b64eb5 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -434,6 +434,10 @@ static int netlink_create(struct net *net, struct socket *sock, int protocol) if (protocol < 0 || protocol >= MAX_LINKS) return -EPROTONOSUPPORT; +#ifndef CONFIG_AUDIT + if (protocol == NETLINK_AUDIT) + return -EPROTONOSUPPORT; +#endif netlink_lock_table(); #ifdef CONFIG_MODULES if (!nl_table[protocol].registered) { -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 23:05 ` Evgeniy Polyakov @ 2008-12-12 19:51 ` Michael Spiegle 0 siblings, 0 replies; 7+ messages in thread From: Michael Spiegle @ 2008-12-12 19:51 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Andrew Morton, bugme-daemon, netdev, linux-kernel, Al Viro > Btw, does this ugly hack fix the problem? > According to your description it may be the case. > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c > index 480184a..2b64eb5 100644 > --- a/net/netlink/af_netlink.c > +++ b/net/netlink/af_netlink.c > @@ -434,6 +434,10 @@ static int netlink_create(struct net *net, struct socket *sock, int protocol) > if (protocol < 0 || protocol >= MAX_LINKS) > return -EPROTONOSUPPORT; > > +#ifndef CONFIG_AUDIT > + if (protocol == NETLINK_AUDIT) > + return -EPROTONOSUPPORT; > +#endif > netlink_lock_table(); > #ifdef CONFIG_MODULES > if (!nl_table[protocol].registered) { > > Yes, that would take care of it. My original workaround was to disable CONFIG_KMOD. I only have a few modules compiled (for servers that require those modules), but most of my servers don't need any modules. This would allow me to manually load modules myself and never have to worry about the kernel forking stuff in userspace. Unfortunately, it appears that CONFIG_KMOD is being deprecated soon and I didn't want to rely on it. What is the core reason for removing CONFIG_KMOD? I personally would rather have the choice as to whether my kernel tries to find its own modules or not. Michael Spiegle mike@nauticaltech.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() 2008-12-11 22:37 ` [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() Andrew Morton 2008-12-11 22:48 ` Kay Sievers 2008-12-11 22:51 ` Evgeniy Polyakov @ 2008-12-12 19:42 ` Michael Spiegle 2 siblings, 0 replies; 7+ messages in thread From: Michael Spiegle @ 2008-12-12 19:42 UTC (permalink / raw) To: Andrew Morton; +Cc: bugme-daemon, netdev, linux-kernel, Al Viro > It'd be great if you could test something more recent please. Will do. It takes a little bit of time to reproduce, so I'll respond now with information from my current kernel and update once the new kernel is running. > Please attach a copy of the config to > http://bugzilla.kernel.org/show_bug.cgi?id=12201 Done: http://bugzilla.kernel.org/attachment.cgi?id=19263 > Please get sysrq working then get us a task trace, so we can see who is > sleeping where. Do this: Done: http://bugzilla.kernel.org/attachment.cgi?id=19264 I wanted to get the call to /sbin/modprobe in there as well, so I had to do it about 10 times before I could catch it. I also stopped unimportant processes on that box to eliminate additional variables. There is one other variable I didn't mention. The machine is diskless and running on an nfs root. I have tried to eliminate this variable via simple testing (just running /sbin/modprobe at the commandline returns instantly), but wanted to mention it. The box was 100% idle before starting the pilot program and running the task trace. I added gettimeofday() around my socket() call and got about 1290000us - 1490000us on average. Using a socket() call with protocol NETLINK_ROUTE returns in about 5us - 8us. Certain other protocol types like NETLINK_FIREWALL also have the same long delay. Michael Spiegle mike@nauticaltech.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-12-12 19:53 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bug-12201-10286@http.bugzilla.kernel.org/>
2008-12-11 22:37 ` [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for_completion() Andrew Morton
2008-12-11 22:48 ` Kay Sievers
2008-12-12 19:47 ` Michael Spiegle
2008-12-11 22:51 ` Evgeniy Polyakov
2008-12-11 23:05 ` Evgeniy Polyakov
2008-12-12 19:51 ` Michael Spiegle
2008-12-12 19:42 ` Michael Spiegle
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).