From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kay Sievers Date: Fri, 01 Oct 2004 08:08:47 +0000 Subject: Re: Hanging udev process on nfs-mounted /dev Message-Id: <1096618128.4295.47.camel@localhost.localdomain> List-Id: References: <415980BF.1020401@bio.ifi.lmu.de> In-Reply-To: <415980BF.1020401@bio.ifi.lmu.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-hotplug@vger.kernel.org On Fri, 2004-10-01 at 09:38 +0200, Frank Steiner wrote: > Hi, > > here we go :-) On reboot, one of the clients ran into the haning > udev process. Althoug the timeout patch was applied, the hanging > udev process was not killed. That's ok. The signal handler does not kill the process. It is just a timeout to interrupt a system call waiting for the kernel. The tdb code return unsuccessful if it catches that timeout. The hanging udev version is spinning by itself (not hanging in a system call) and therefore will do that forever. > But it blocked a lot of other processes because there are messages > about "timeout reached" in /var/log/messages. I had to reboot the > PC (the professors client :-)), but I tried to collect all information > that might be helpful. Yes, sure, it is. We're getting closer. > I've put all the logs on a website. They include /var/log/messages > from the point where the system bootet until it hung, a "ps -aux" output > while udev was hanging, and the straces for all udev processes started > during the boot. Recall that I replaced /sbin/udev{start} by > > strace -o /var/log/udev.log.`uname -n`.${$} -f /sbin/utest/`basename $0` $@ > > and moved the original udev and udevstart to /sbin/utest/. > All the information is here: http://www.bio.ifi.lmu.de/~steiner/udev/ > The udev traces are sorted in "ls -lat" order. > > The udev process that was hanging had pid 9700. The matching strace > is udev.log.noether.9652. After calling "pkill udev" to make the > host usable again, three straces were changed. Those are listed > with both versions, so that one can see what happened after killing > (don't know if this helps). Again, the hanging udev process hung > after F_SETLKW: > ... > 9700 fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start(8, len=1}) = 0 > 9700 fcntl64(5, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, startt924, len=1}) = 0 > 9700 fcntl64(5, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, startt924, len=1}) = 0 > 9700 fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start4, len=1}) = 0 > 9700 --- SIGALRM (Alarm clock) @ 0 (0) --- > 9700 time([1096612648]) = 1096612648 > 9700 rt_sigaction(SIGPIPE, {0x40116ae0, [], SA_RESTORER, 0x40067aa8}, {SIG_DFL}, 8) = 0 > 9700 send(0, "<14>Oct 1 08:37:28 udev: error:"..., 137, 0) = 137 > 9700 rt_sigaction(SIGPIPE, {SIG_DFL}, NULL, 8) = 0 > 9700 sigreturn() = ? (mask now []) Yes, that's the fault. Seems that this process locks the db-file and then keeps spinning forever without doing system calls. It's just a loop inside of the tdb code. It consumed a lot of your CPU: > root 9688 0.0 0.0 1696 600 ? S< 08:37 0:00 strace -o /var/log/udev.log.noether.9652 -f /sbin/utest/udev scsi_generic > root 9700 99.9 0.0 1664 604 ? R< 08:37 17:37 /sbin/utest/udev scsi_generic Thanks, Kay ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net Linux-hotplug-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel