From mboxrd@z Thu Jan  1 00:00:00 1970
From: Kay Sievers <kay.sievers@vrfy.org>
Date: Fri, 01 Oct 2004 08:08:47 +0000
Subject: Re: Hanging udev process on nfs-mounted /dev
Message-Id: <1096618128.4295.47.camel@localhost.localdomain>
List-Id: <linux-hotplug.vger.kernel.org>
References: <415980BF.1020401@bio.ifi.lmu.de>
In-Reply-To: <415980BF.1020401@bio.ifi.lmu.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: linux-hotplug@vger.kernel.org

On Fri, 2004-10-01 at 09:38 +0200, Frank Steiner wrote:
> Hi,
> 
> here we go :-) On reboot, one of the clients ran into the haning
> udev process. Althoug the timeout patch was applied, the hanging
> udev process was not killed.

That's ok. The signal handler does not kill the process. It is just a
timeout to interrupt a system call waiting for the kernel. The tdb code
return unsuccessful if it catches that timeout. The hanging udev version
is spinning by itself (not hanging in a system call) and therefore will
do that forever.

> But it blocked a lot of other processes because there are messages
> about "timeout reached" in /var/log/messages. I had to reboot the
> PC (the professors client :-)), but I tried to collect all information
> that might be helpful.

Yes, sure, it is. We're getting closer.

> I've put all the logs on a website. They include /var/log/messages
> from the point where the system bootet until it hung, a "ps -aux" output
> while udev was hanging, and the straces for all udev processes started
> during the boot. Recall that I replaced /sbin/udev{start} by
> 
> strace -o /var/log/udev.log.`uname -n`.${$} -f /sbin/utest/`basename $0` $@
> 
> and moved the original udev and udevstart to /sbin/utest/.
> All the information is here: http://www.bio.ifi.lmu.de/~steiner/udev/
> The udev traces are sorted in "ls -lat" order.
> 
> The udev process that was hanging had pid 9700. The matching strace
> is udev.log.noether.9652. After calling "pkill udev" to make the
> host usable again, three straces were changed. Those are listed
> with both versions, so that one can see what happened after killing
> (don't know if this helps). Again, the hanging udev process hung
> after F_SETLKW:
> ...
> 9700  fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start(8, len=1}) = 0
> 9700  fcntl64(5, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, startt924, len=1}) = 0
> 9700  fcntl64(5, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, startt924, len=1}) = 0
> 9700  fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start4, len=1}) = 0
> 9700  --- SIGALRM (Alarm clock) @ 0 (0) ---
> 9700  time([1096612648])                = 1096612648
> 9700  rt_sigaction(SIGPIPE, {0x40116ae0, [], SA_RESTORER, 0x40067aa8}, {SIG_DFL}, 8) = 0
> 9700  send(0, "<14>Oct  1 08:37:28 udev: error:"..., 137, 0) = 137
> 9700  rt_sigaction(SIGPIPE, {SIG_DFL}, NULL, 8) = 0
> 9700  sigreturn()                       = ? (mask now [])

Yes, that's the fault. Seems that this process locks the db-file and
then keeps spinning forever without doing system calls. It's just a loop
inside of the tdb code. It consumed a lot of your CPU:

> root      9688  0.0  0.0  1696  600 ?        S<   08:37   0:00 strace -o /var/log/udev.log.noether.9652 -f /sbin/utest/udev scsi_generic
> root      9700 99.9  0.0  1664  604 ?        R<   08:37  17:37 /sbin/utest/udev scsi_generic

Thanks,
Kay


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel