* open() on /dev/tty takes 30 seconds on 2.6.36
@ 2010-10-30 7:47 Eli Billauer
2010-10-30 18:46 ` Andrew Morton
0 siblings, 1 reply; 14+ messages in thread
From: Eli Billauer @ 2010-10-30 7:47 UTC (permalink / raw)
To: linux-kernel
Hello,
I'm sorry about my previous overconclusive email, but there is a real
problem with opening TTYs during a few minutes after a system boots
(Fedora 12 in my case). I'll stick to the facts this time.
The kernel involved is 2.6.36, downloaded a few days ago as "latest
stable kernel" at kernel.org.
Running strace on sshd with -T and -tt flags, hence showing the time
each call took, these two lines were found. The number in brackets is
the time the system call took (not to the time to the next call or
something).
...
21:35:21.131436 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such
device or address) <30.022532>
...
21:35:51.175642 open("/dev/pts/0", O_RDWR) = 4 <30.063213>
...
So it took 30 seconds just to fail opening /dev/tty.
I then went on to add printk's in pty.c. Among others, I had:
static int ptmx_open(struct inode *inode, struct file *filp)
{
struct tty_struct *tty;
int retval;
int index;
nonseekable_open(inode, filp);
/* find a device that is not in use. */
printk(KERN_ALERT "34: ptmx_open to lock\n");
tty_lock();
printk(KERN_ALERT "35: ptmx_open locked\n");
[snipped here]
And then found in my /var/log/messages (no log lines between these two):
Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock
Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked
So we can see it took 15 seconds to acquire a lock in this case. In all
other pairs of lock messages there was no time difference. To me it
looks like 15 seconds in order to acquire a lock in the kernel is a
smoking gun.
Please look into this.
Eli
--
Web: http://www.billauer.co.il
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-30 7:47 open() on /dev/tty takes 30 seconds on 2.6.36 Eli Billauer @ 2010-10-30 18:46 ` Andrew Morton 2010-10-30 20:46 ` Arnd Bergmann 2010-10-31 0:20 ` Eli Billauer 0 siblings, 2 replies; 14+ messages in thread From: Andrew Morton @ 2010-10-30 18:46 UTC (permalink / raw) To: Eli Billauer; +Cc: linux-kernel On Sat, 30 Oct 2010 09:47:26 +0200 Eli Billauer <eli@billauer.co.il> wrote: > Hello, > > I'm sorry about my previous overconclusive email, but there is a real > problem with opening TTYs during a few minutes after a system boots > (Fedora 12 in my case). I'll stick to the facts this time. > > The kernel involved is 2.6.36, downloaded a few days ago as "latest > stable kernel" at kernel.org. > > Running strace on sshd with -T and -tt flags, hence showing the time > each call took, these two lines were found. The number in brackets is > the time the system call took (not to the time to the next call or > something). > > ... > 21:35:21.131436 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such > device or address) <30.022532> > ... > 21:35:51.175642 open("/dev/pts/0", O_RDWR) = 4 <30.063213> > ... > > So it took 30 seconds just to fail opening /dev/tty. > > I then went on to add printk's in pty.c. Among others, I had: > > static int ptmx_open(struct inode *inode, struct file *filp) > { > struct tty_struct *tty; > int retval; > int index; > > nonseekable_open(inode, filp); > > /* find a device that is not in use. */ > printk(KERN_ALERT "34: ptmx_open to lock\n"); > tty_lock(); > printk(KERN_ALERT "35: ptmx_open locked\n"); > > [snipped here] > > And then found in my /var/log/messages (no log lines between these two): > Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock > Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked > > So we can see it took 15 seconds to acquire a lock in this case. In all > other pairs of lock messages there was no time difference. To me it > looks like 15 seconds in order to acquire a lock in the kernel is a > smoking gun. > Odd. Presumably someone else was holding big_tty_mutex for 15 seconds. We can find out who, with the sysrq-d command if you have the time please. You'll need to enable lockdep and magic sysrq in .config. Then run `dmesg -n 8' so all messages get printed by the kernel and then, in the middle of that 15-second delay, do echo d > /proc/sysrq-trigger I'll confess that I've never used sysrq-d and am unsure what the output looks like. Fingers crossed! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-30 18:46 ` Andrew Morton @ 2010-10-30 20:46 ` Arnd Bergmann 2010-10-31 0:20 ` Eli Billauer 1 sibling, 0 replies; 14+ messages in thread From: Arnd Bergmann @ 2010-10-30 20:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Eli Billauer, linux-kernel On Saturday 30 October 2010, Andrew Morton wrote: > > static int ptmx_open(struct inode *inode, struct file *filp) > > { > > struct tty_struct *tty; > > int retval; > > int index; > > > > nonseekable_open(inode, filp); > > > > /* find a device that is not in use. */ > > printk(KERN_ALERT "34: ptmx_open to lock\n"); > > tty_lock(); > > printk(KERN_ALERT "35: ptmx_open locked\n"); > > > > [snipped here] > > > > And then found in my /var/log/messages (no log lines between these two): > > Oct 29 16:14:58 ocho kernel: 34: ptmx_open to lock > > Oct 29 16:15:13 ocho kernel: 35: ptmx_open locked > > > > So we can see it took 15 seconds to acquire a lock in this case. In all > > other pairs of lock messages there was no time difference. To me it > > looks like 15 seconds in order to acquire a lock in the kernel is a > > smoking gun. Agreed. When I changed the locking to use a mutex instead of the BKL, I intentionally left it locked across sleeping functions, but none of them are supposed to sleep for multiple seconds. > > Odd. Presumably someone else was holding big_tty_mutex for 15 seconds. > > We can find out who, with the sysrq-d command if you have the time > please. You'll need to enable lockdep and magic sysrq in .config. > Then run `dmesg -n 8' so all messages get printed by the kernel and > then, in the middle of that 15-second delay, do > > echo d > /proc/sysrq-trigger > > I'll confess that I've never used sysrq-d and am unsure what the output > looks like. Fingers crossed! With a little luck, lockdep by itself might even be able to find an inconsistency in the BTM use and warn right away about it. I found a number of bugs in the TTY code just by running with lockdep in the mutex version. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-30 18:46 ` Andrew Morton 2010-10-30 20:46 ` Arnd Bergmann @ 2010-10-31 0:20 ` Eli Billauer 2010-10-31 3:36 ` Arnd Bergmann 1 sibling, 1 reply; 14+ messages in thread From: Eli Billauer @ 2010-10-31 0:20 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton wrote: > Odd. Presumably someone else was holding big_tty_mutex for 15 seconds. > > We can find out who, with the sysrq-d command if you have the time > please. You'll need to enable lockdep and magic sysrq in .config. > Then run `dmesg -n 8' so all messages get printed by the kernel and > then, in the middle of that 15-second delay, do > > echo d > /proc/sysrq-trigger > > I'll confess that I've never used sysrq-d and am unsure what the output > looks like. Fingers crossed! > First, I'll explain what I did: Since I can't login while this problem occurs, I had to modify sshd's service script to start sshd with strace, and also kick off a script (called who-is-locking.sh), which creates that sysrq call every five seconds. This could also be the place to mention, that I got an NULL dereference oops during one of these (IP at get_usage_chars+0x0/0x10e, while in the who-is-locking process). But that happened after the desired data was logged. Now, let's look at the interesting point of the strace (of sshd): 01:23:37.156475 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) <29.999444> 01:24:07.156354 setsid() = 2336 <0.000167> The same open() taking 30 seconds, ending at 01:24:07. In /var/log/messages (below) we have Oct 31 01:24:07 ocho modem-manager: (ttyS2) opening serial device... Oct 31 01:24:07 ocho modem-manager: (ttyS3) closing serial device... The same things happens again, when ssh opens /dev/pts/0, which takes 30 seconds again, with the end of operation coinciding with modem-manager opening a tty device. As far as I know, there is only one real serial port on my computer. Maybe this snippet from my boot log may be of interest: Oct 31 01:14:28 ocho kernel: Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled Oct 31 01:14:28 ocho kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A Oct 31 01:14:28 ocho kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A Oct 31 01:14:28 ocho kernel: serial 0000:05:04.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS1 at I/O 0xbe20 (irq = 16) is a 8250 Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS2 at I/O 0xbe28 (irq = 16) is a 8250 Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS3 at I/O 0xbe40 (irq = 16) is a 8250 Oct 31 01:14:28 ocho kernel: Couldn't register serial port 0000:05:04.0: -28 Below is /var/log/message dump for the relevant time interval. I can submit more if necessary, but I think the picture is pretty clear. Hope this helps, Eli Oct 31 01:23:59 ocho kernel: ============================================= Oct 31 01:23:59 ocho kernel: Oct 31 01:24:04 ocho kernel: SysRq : Show Locks Held Oct 31 01:24:04 ocho kernel: Oct 31 01:24:04 ocho kernel: Showing all locks held in the system: Oct 31 01:24:04 ocho kernel: 2 locks held by modem-manager/1835: Oct 31 01:24:04 ocho kernel: #0: (big_tty_mutex){+.+.+.}, at: [<ffffffff814b9c09>] tty_lock+0x3f/0x4a Oct 31 01:24:04 ocho kernel: #1: (&port->mutex){+.+.+.}, at: [<ffffffff812fcace>] uart_close+0x82/0x312 Oct 31 01:24:04 ocho kernel: 2 locks held by who-is-locking./1850: Oct 31 01:24:04 ocho kernel: #0: (sysrq_key_table_lock){......}, at: [<ffffffff812e97de>] __handle_sysrq+0x28/0x15b Oct 31 01:24:04 ocho kernel: #1: (tasklist_lock){.?.+..}, at: [<ffffffff810798a5>] debug_show_all_locks+0x4c/0x178 Oct 31 01:24:04 ocho kernel: 1 lock held by gpm/1971: Oct 31 01:24:04 ocho kernel: #0: (tty_mutex){+.+.+.}, at: [<ffffffff812d5903>] tty_open+0x5a/0x4ac Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2125: Oct 31 01:24:04 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2126: Oct 31 01:24:04 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2127: Oct 31 01:24:04 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2128: Oct 31 01:24:04 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:04 ocho kernel: 1 lock held by mingetty/2129: Oct 31 01:24:04 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:04 ocho kernel: 2 locks held by sshd/2336: Oct 31 01:24:04 ocho kernel: #0: (tty_mutex){+.+.+.}, at: [<ffffffff812d5903>] tty_open+0x5a/0x4ac Oct 31 01:24:04 ocho kernel: #1: (big_tty_mutex){+.+.+.}, at: [<ffffffff814b9c09>] tty_lock+0x3f/0x4a Oct 31 01:24:04 ocho kernel: Oct 31 01:24:04 ocho kernel: ============================================= Oct 31 01:24:04 ocho kernel: Oct 31 01:24:07 ocho modem-manager: (ttyS2) opening serial device... Oct 31 01:24:07 ocho modem-manager: (ttyS3) closing serial device... Oct 31 01:24:09 ocho kernel: SysRq : Show Locks Held Oct 31 01:24:09 ocho kernel: Oct 31 01:24:09 ocho kernel: Showing all locks held in the system: Oct 31 01:24:09 ocho kernel: 2 locks held by modem-manager/1835: Oct 31 01:24:09 ocho kernel: #0: (big_tty_mutex){+.+.+.}, at: [<ffffffff814b9c09>] tty_lock+0x3f/0x4a Oct 31 01:24:09 ocho kernel: #1: (&port->mutex){+.+.+.}, at: [<ffffffff812fcace>] uart_close+0x82/0x312 Oct 31 01:24:09 ocho kernel: 2 locks held by who-is-locking./1850: Oct 31 01:24:09 ocho kernel: #0: (sysrq_key_table_lock){......}, at: [<ffffffff812e97de>] __handle_sysrq+0x28/0x15b Oct 31 01:24:09 ocho kernel: #1: (tasklist_lock){.?.+..}, at: [<ffffffff810798a5>] debug_show_all_locks+0x4c/0x178 Oct 31 01:24:09 ocho kernel: 1 lock held by gpm/1971: Oct 31 01:24:09 ocho kernel: #0: (tty_mutex){+.+.+.}, at: [<ffffffff812d5903>] tty_open+0x5a/0x4ac Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2125: Oct 31 01:24:09 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2126: Oct 31 01:24:09 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2127: Oct 31 01:24:09 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2128: Oct 31 01:24:09 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:09 ocho kernel: 1 lock held by mingetty/2129: Oct 31 01:24:09 ocho kernel: #0: (&tty->atomic_read_lock){+.+...}, at: [<ffffffff812d7648>] n_tty_read+0x281/0x7b4 Oct 31 01:24:09 ocho kernel: 2 locks held by sshd/2336: Oct 31 01:24:09 ocho kernel: #0: (tty_mutex){+.+.+.}, at: [<ffffffff812d5903>] tty_open+0x5a/0x4ac Oct 31 01:24:09 ocho kernel: #1: (big_tty_mutex){+.+.+.}, at: [<ffffffff814b9c09>] tty_lock+0x3f/0x4a Oct 31 01:24:09 ocho kernel: Oct 31 01:24:09 ocho kernel: ============================================= Oct 31 01:24:09 ocho kernel: -- Web: http://www.billauer.co.il ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-31 0:20 ` Eli Billauer @ 2010-10-31 3:36 ` Arnd Bergmann 2010-10-31 6:34 ` James Cloos 2010-11-01 1:12 ` Eli Billauer 0 siblings, 2 replies; 14+ messages in thread From: Arnd Bergmann @ 2010-10-31 3:36 UTC (permalink / raw) To: Eli Billauer; +Cc: Andrew Morton, linux-kernel On Sunday 31 October 2010, Eli Billauer wrote: > Andrew Morton wrote: > > First, I'll explain what I did: Since I can't login while this problem > occurs, I had to modify sshd's service script to start sshd with strace, > and also kick off a script (called who-is-locking.sh), which creates > that sysrq call every five seconds. This could also be the place to > mention, that I got an NULL dereference oops during one of these (IP at > get_usage_chars+0x0/0x10e, while in the who-is-locking process). But > that happened after the desired data was logged. An oops report is interesting most of the time, except when you know that you caused it with your own kernel changes. If you can still find the oops report, please post it here. > Oct 31 01:14:28 ocho kernel: Serial: 8250/16550 driver, 4 ports, IRQ > sharing enabled > Oct 31 01:14:28 ocho kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is > a 16550A > Oct 31 01:14:28 ocho kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A > Oct 31 01:14:28 ocho kernel: serial 0000:05:04.0: PCI INT A -> GSI 16 > (level, low) -> IRQ 16 > Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS1 at I/O 0xbe20 (irq = > 16) is a 8250 > Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS2 at I/O 0xbe28 (irq = > 16) is a 8250 > Oct 31 01:14:28 ocho kernel: 0000:05:04.0: ttyS3 at I/O 0xbe40 (irq = > 16) is a 8250 > Oct 31 01:14:28 ocho kernel: Couldn't register serial port 0000:05:04.0: -28 For some reason, your kernel finds five ports at boot time, while it's been configured at build time to support only four. > Oct 31 01:23:59 ocho kernel: ============================================= > Oct 31 01:23:59 ocho kernel: > Oct 31 01:24:04 ocho kernel: SysRq : Show Locks Held > Oct 31 01:24:04 ocho kernel: > Oct 31 01:24:04 ocho kernel: Showing all locks held in the system: > Oct 31 01:24:04 ocho kernel: 2 locks held by modem-manager/1835: > Oct 31 01:24:04 ocho kernel: #0: (big_tty_mutex){+.+.+.}, at: > [<ffffffff814b9c09>] tty_lock+0x3f/0x4a > Oct 31 01:24:04 ocho kernel: #1: (&port->mutex){+.+.+.}, at: > [<ffffffff812fcace>] uart_close+0x82/0x312 So modem-manager is the process you need to look at, as it holds the BTM, probably for a long time. It should not do that. First you should verify that when you don't have modem-manager running, the problem goes away. This may indeed be related to the misdetection of a serial port. The other thing to check is what modem-manager is actually doing. You can find out by looking at /proc/<pid>/stack, where <pid> is the pid of the modem-manager process, 1835 in your example. You can also look at <sysrq>+t to get all the stack traces for the running processes and finding the process in there. I think you're closing in on the bug, thanks to your providing exactly the right parts of information. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-31 3:36 ` Arnd Bergmann @ 2010-10-31 6:34 ` James Cloos 2010-10-31 11:36 ` Arnd Bergmann 2010-11-01 1:12 ` Eli Billauer 1 sibling, 1 reply; 14+ messages in thread From: James Cloos @ 2010-10-31 6:34 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel [I'm not sure whether this is the same issue, but in case it helps... -JimC] I was about to prepare a bug report about a similar issue which as hit my (normally headless) compute node. It happened once before I first booted 2.6.36, and once running 2.6.36. The last two kernels which I compiled before v2.6.36 were 9c03f16 and bfa88ea; I cannot remember whether I ever booted 9c03f16. I may have compiled v2.6.36 before trying 9c03f16. The box is a Phenom, 64bit kernel and userland. All of the ptys stopped working, which seems like it might be related to a change in tty locking. Everything else worked normally; only pty i/o stopped. I couldn't log in, or use any of my open terminals, but pipes, networking, disk i/o, et al were all OK. I was able to use 'ssh server dmesg' to get the dmesg; it had this to say: WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2() Hardware name: MS-7642 Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18 Call Trace: [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2 [<ffffffff81068453>] worker_thread+0x182/0x19b [<ffffffff810682d1>] ? worker_thread+0x0/0x19b [<ffffffff8106ba81>] kthread+0x82/0x8a [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 [<ffffffff8106b9ff>] ? kthread+0x0/0x8a [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 ---[ end trace 756b0818a6415dca ]--- That was followed by a number of task blocked for more than 120 seconds messages, all due to waiting for pty input or output. An example trace: INFO: task gpg:15408 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gpg D ffff8802507ec380 0 15408 15390 0x00000000 ffff8802cf24fd88 0000000000000082 ffff880200000001 0000000000014b00 0000000000014b00 ffff88010a45ae20 ffff8802cf24ffd8 0000000000014b00 0000000000014b00 0000000000014b00 0000000000014b00 ffff8802cf24ffd8 Call Trace: [<ffffffff815ae2fc>] schedule_timeout+0x36/0xe3 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41 [<ffffffff815b2b6d>] ? sub_preempt_count+0x97/0xaa [<ffffffff815ad8a5>] wait_for_common+0xab/0x105 [<ffffffff8104c1c4>] ? default_wake_function+0x0/0x14 [<ffffffff81047b2b>] ? get_parent_ip+0x11/0x41 [<ffffffff81065dd1>] ? __need_more_worker+0x15/0x2c [<ffffffff810d39ba>] ? lru_add_drain_per_cpu+0x0/0x10 [<ffffffff815ad9b7>] wait_for_completion+0x1d/0x1f [<ffffffff81066d0d>] flush_work+0x110/0x12e [<ffffffff81066777>] ? wq_barrier_func+0x0/0x14 [<ffffffff81068831>] schedule_on_each_cpu+0xa8/0xd7 [<ffffffff810d3472>] lru_add_drain_all+0x15/0x17 [<ffffffff810e7786>] sys_mlock+0x30/0xdf [<ffffffff81009cc2>] system_call_fastpath+0x16/0x1b (I didn't save the dmesg the first time it happened; I wanted to try .36 before filing a bug report, in case it had already been fixed.) -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-31 6:34 ` James Cloos @ 2010-10-31 11:36 ` Arnd Bergmann 2010-11-02 10:28 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Arnd Bergmann @ 2010-10-31 11:36 UTC (permalink / raw) To: James Cloos; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Tejun Heo On Sunday 31 October 2010, James Cloos wrote: > WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2() > Hardware name: MS-7642 > Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp > Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18 > Call Trace: > [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d > [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c > [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2 > [<ffffffff81068453>] worker_thread+0x182/0x19b > [<ffffffff810682d1>] ? worker_thread+0x0/0x19b > [<ffffffff8106ba81>] kthread+0x82/0x8a > [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 > [<ffffffff8106b9ff>] ? kthread+0x0/0x8a > [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 > ---[ end trace 756b0818a6415dca ]--- > > That was followed by a number of task blocked for more than 120 seconds > messages, all due to waiting for pty input or output. An example trace: This is triggered by an internal check in the workqueue handling that Tejun introduced. I guess it might be possible that this is triggered by a worker thread being blocked on a mutex, but I don't understand exactly what it is checking for. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-31 11:36 ` Arnd Bergmann @ 2010-11-02 10:28 ` Tejun Heo 0 siblings, 0 replies; 14+ messages in thread From: Tejun Heo @ 2010-11-02 10:28 UTC (permalink / raw) To: Arnd Bergmann; +Cc: James Cloos, Eli Billauer, Andrew Morton, linux-kernel Hello, On 10/31/2010 12:36 PM, Arnd Bergmann wrote: > On Sunday 31 October 2010, James Cloos wrote: >> WARNING: at kernel/workqueue.c:1180 worker_enter_idle+0xd6/0xe2() >> Hardware name: MS-7642 >> Modules linked in: tcp_diag inet_diag ipt_addrtype xt_dscp xt_string xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP xt_NFQUEUE xt_mark xt_connmark tun snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usbmidi_lib snd_rawmidi tpm_tis tpm ppdev parport_pc tpm_bios parport serio_raw edac_core k10temp pcspkr i2c_piix4 shpchp >> Pid: 8061, comm: kworker/0:1 Not tainted 2.6.36-carbon1 #18 >> Call Trace: >> [<ffffffff81050c60>] warn_slowpath_common+0x85/0x9d >> [<ffffffff81050c92>] warn_slowpath_null+0x1a/0x1c >> [<ffffffff81066861>] worker_enter_idle+0xd6/0xe2 >> [<ffffffff81068453>] worker_thread+0x182/0x19b >> [<ffffffff810682d1>] ? worker_thread+0x0/0x19b >> [<ffffffff8106ba81>] kthread+0x82/0x8a >> [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 >> [<ffffffff8106b9ff>] ? kthread+0x0/0x8a >> [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 >> ---[ end trace 756b0818a6415dca ]--- >> >> That was followed by a number of task blocked for more than 120 seconds >> messages, all due to waiting for pty input or output. An example trace: > > This is triggered by an internal check in the workqueue handling that > Tejun introduced. I guess it might be possible that this is triggered by > a worker thread being blocked on a mutex, but I don't understand > exactly what it is checking for. Hmmm... that's a condition which shouldn't happen regardless of being blocked on mutex or whatever. It means that internal book keeping used to manage concurrency went out of sync somehow. I've re-read the code managing nr_running but couldn't spot where the problem could be. Is there any way to reproduce the problem? Thanks. -- tejun ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-10-31 3:36 ` Arnd Bergmann 2010-10-31 6:34 ` James Cloos @ 2010-11-01 1:12 ` Eli Billauer 2010-11-01 19:39 ` Arnd Bergmann 1 sibling, 1 reply; 14+ messages in thread From: Eli Billauer @ 2010-11-01 1:12 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Andrew Morton, linux-kernel Let me start with the headline: It's close() of a ttySx that takes 30 seconds, during which the big TTY lock is held. Now to details. Arnd Bergmann wrote: > So modem-manager is the process you need to look at, as it holds the > BTM, probably for a long time. It should not do that. First you should > verify that when you don't have modem-manager running, the problem goes > away. This may indeed be related to the misdetection of a serial port. > That test is built-in: When modem-manager stops poking around, everything becomes fine again. > The other thing to check is what modem-manager is actually doing. You can > find out by looking at /proc/<pid>/stack, where <pid> is the pid of the > modem-manager process, 1835 in your example. > OK, so this is what I did eventually: I ran strace on sshd and modem-manager, and I also added printk's to tell me when tty_lock() is invoked, when it actually got the lock, and when tty_unlock() is invoked. Pretty messy, and still. Let's start with sshd's strace. We have a 01:06:20.152787 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) <30.008153> And this is what we have in /var/log/messages: Nov 1 01:06:20 ocho modem-manager: (ttyS2) closing serial device... Nov 1 01:06:20 ocho kernel: tty_lock() called Nov 1 01:06:20 ocho kernel: tty_lock() got lock Nov 1 01:06:20 ocho kernel: tty_lock() called Nov 1 01:06:50 ocho kernel: tty_unlock() called Nov 1 01:06:50 ocho kernel: tty_lock() got lock Nov 1 01:06:50 ocho kernel: tty_unlock() called Nov 1 01:06:50 ocho kernel: tty_lock() called So we can see that someone got the lock, and then didn't release it for 30 seconds, and that someone else tried to get the lock and waited for it 30 seconds. Here comes modem-manager's strace in the relevant time: 01:06:20.096730 sendto(3, "<29>Nov 1 01:06:20 modem-manager: (ttyS2) closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000048> 01:06:20.096932 close(22) = 0 <30.064112> 01:06:50.161255 ioctl(22, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000132> Surprise, surprise. modem-manager waits 30 seconds for a close() operation (maybe I should change this thread's subject?). But maybe modem-manager is waiting for someone else? Nope. In another part of modem-manager's strace we have: 01:05:50.001380 sendto(3, "<29>Nov 1 01:05:49 modem-manager: (ttyS1) closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000237> 01:05:50.001765 close(21) = 0 <30.064420> Again (as always, actually), modem-manager waits 30 seconds for a close(). In /var/log/messages we have: Nov 1 01:05:50 ocho modem-manager: (ttyS1) closing serial device... Nov 1 01:05:50 ocho kernel: tty_lock() called Nov 1 01:05:50 ocho kernel: tty_lock() got lock Nov 1 01:05:54 ocho kernel: tty_lock() called Nov 1 01:06:20 ocho kernel: tty_unlock() called The last lock-related log entry was an unlock 9 seconds earlier. So it's clear that modem-manager's close() call got the lock right away, but held it for 30 seconds. In the middle we have some other process asking for the lock (and is put to wait). An strace of modem-manager running under vanilla kernel 2.6.35-4 shows that close() takes 30 seconds there as well. It just so happens that it went unnoticed (maybe because of the changes in the locking schema?). And in case we suspect that something special modem-manager did, I've attached the relevant piece of strace while handling some TTY. I've cut the log to cover a full cycle of a single TTY (file descriptor 21), but I chose not to omit the calls relevant to other ports, which are handled meanwhile. I should only note that all TTY close() calls take 30 seconds, and that all ports get a ioctl() and extra close() after the file descriptor is closed, with an EBADF as response. Trace follows. Hope this helps. Eli 01:06:20.067616 open("/dev/ttyS1", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 21 <0.020836> 01:06:20.088688 ioctl(21, TIOCEXCL, 0) = 0 <0.000043> 01:06:20.088873 ioctl(21, TCFLSH, 0x2) = 0 <0.000096> 01:06:20.089106 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000098> 01:06:20.089377 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000092> 01:06:20.089699 ioctl(21, SNDCTL_TMR_START or TCSETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000072> 01:06:20.089909 ioctl(21, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000056> 01:06:20.090088 fstat(21, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 65), ...}) = 0 <0.000091> 01:06:20.090294 fcntl(21, F_GETFL) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000088> 01:06:20.090515 poll([{fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}], 19, 0) = 0 (Timeout) <0.000034> 01:06:20.090752 poll([{fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}], 19, 0) = 0 (Timeout) <0.000032> 01:06:20.090916 write(21, "\0", 1) = 1 <0.000028> 01:06:20.091012 nanosleep({0, 1000000}, NULL) = 0 <0.001152> 01:06:20.092246 write(21, "x", 1) = 1 <0.000027> 01:06:20.092425 nanosleep({0, 1000000}, NULL) = 0 <0.001090> 01:06:20.093690 write(21, "\360", 1) = 1 <0.000093> 01:06:20.093950 nanosleep({0, 1000000}, NULL) = 0 <0.001154> 01:06:20.095203 write(21, "~", 1) = 1 <0.000094> 01:06:20.095389 nanosleep({0, 1000000}, NULL) = 0 <0.001144> 01:06:20.096730 sendto(3, "<29>Nov 1 01:06:20 modem-manager: (ttyS2) closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000048> 01:06:20.096932 close(22) = 0 <30.064112> 01:06:50.161255 ioctl(22, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000132> 01:06:50.161623 close(22) = -1 EBADF (Bad file descriptor) <0.000121> 01:06:50.162061 sendto(3, "<29>Nov 1 01:06:50 modem-manager: (ttyS2) opening serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000165> 01:06:50.162465 open("/dev/ttyS2", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 22 <0.020821> 01:06:50.183512 ioctl(22, TIOCEXCL, 0) = 0 <0.000163> 01:06:50.183894 ioctl(22, TCFLSH, 0x2) = 0 <0.000102> 01:06:50.184121 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000090> 01:06:50.184374 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000092> 01:06:50.184631 ioctl(22, SNDCTL_TMR_START or TCSETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000050> 01:06:50.184768 ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000074> 01:06:50.184934 fstat(22, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 66), ...}) = 0 <0.000020> 01:06:50.185040 fcntl(22, F_GETFL) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000020> 01:06:50.185237 sendto(3, "<29>Nov 1 01:06:50 modem-manager: (ttyS3) closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000026> 01:06:50.185358 close(23) = 0 <30.063507> 01:07:20.248957 ioctl(23, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = -1 EBADF (Bad file descriptor) <0.000025> 01:07:20.249074 close(23) = -1 EBADF (Bad file descriptor) <0.000020> 01:07:20.249242 sendto(3, "<29>Nov 1 01:07:20 modem-manager: (ttyS3) opening serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000050> 01:07:20.249397 open("/dev/ttyS3", O_RDWR|O_EXCL|O_NOCTTY|O_NONBLOCK) = 23 <0.021043> 01:07:20.270531 ioctl(23, TIOCEXCL, 0) = 0 <0.000031> 01:07:20.270628 ioctl(23, TCFLSH, 0x2) = 0 <0.000104> 01:07:20.270816 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000029> 01:07:20.270940 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B57600 -opost isig -icanon -echo ...}) = 0 <0.000023> 01:07:20.271039 ioctl(23, SNDCTL_TMR_START or TCSETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000032> 01:07:20.271147 ioctl(23, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 -opost -isig -icanon -echo ...}) = 0 <0.000023> 01:07:20.271248 fstat(23, {st_mode=S_IFCHR|0660, st_rdev=makedev(4, 67), ...}) = 0 <0.000022> 01:07:20.271362 fcntl(23, F_GETFL) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000021> 01:07:20.271494 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) <0.000031> 01:07:20.271666 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) <0.000042> 01:07:20.271868 write(22, "\0", 1) = 1 <0.000040> 01:07:20.271983 nanosleep({0, 1000000}, NULL) = 0 <0.001189> 01:07:20.273259 write(22, "x", 1) = 1 <0.000041> 01:07:20.273378 nanosleep({0, 1000000}, NULL) = 0 <0.001188> 01:07:20.274653 write(22, "\360", 1) = 1 <0.000036> 01:07:20.274807 nanosleep({0, 1000000}, NULL) = 0 <0.001193> 01:07:20.276086 write(22, "~", 1) = 1 <0.000035> 01:07:20.276213 nanosleep({0, 1000000}, NULL) = 0 <0.001190> 01:07:20.277509 write(23, "\0", 1) = 1 <0.000039> 01:07:20.277626 nanosleep({0, 1000000}, NULL) = 0 <0.001189> 01:07:20.278900 write(23, "x", 1) = 1 <0.000035> 01:07:20.279011 nanosleep({0, 1000000}, NULL) = 0 <0.001192> 01:07:20.280288 write(23, "\360", 1) = 1 <0.000039> 01:07:20.280404 nanosleep({0, 1000000}, NULL) = 0 <0.001193> 01:07:20.281695 write(23, "~", 1) = 1 <0.000039> 01:07:20.281814 nanosleep({0, 1000000}, NULL) = 0 <0.001191> 01:07:20.283113 write(21, "\0", 1) = 1 <0.000039> 01:07:20.283232 nanosleep({0, 1000000}, NULL) = 0 <0.001190> 01:07:20.284506 write(21, "x", 1) = 1 <0.000034> 01:07:20.284615 nanosleep({0, 1000000}, NULL) = 0 <0.001189> 01:07:20.285888 write(21, "\360", 1) = 1 <0.000040> 01:07:20.286005 nanosleep({0, 1000000}, NULL) = 0 <0.001186> 01:07:20.287276 write(21, "~", 1) = 1 <0.000040> 01:07:20.287392 nanosleep({0, 1000000}, NULL) = 0 <0.001186> 01:07:20.288690 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 3000) = 0 (Timeout) <3.003405> 01:07:23.292250 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 707) = 0 (Timeout) <0.707898> 01:07:24.000331 poll([{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 19, 0) = 0 (Timeout) <0.000044> 01:07:24.000506 write(22, "\0", 1) = 1 <0.000038> 01:07:24.000620 nanosleep({0, 1000000}, NULL) = 0 <0.001209> 01:07:24.001914 write(22, "x", 1) = 1 <0.000039> 01:07:24.002028 nanosleep({0, 1000000}, NULL) = 0 <0.001208> 01:07:24.003320 write(22, "\360", 1) = 1 <0.000035> 01:07:24.003434 nanosleep({0, 1000000}, NULL) = 0 <0.001207> 01:07:24.004743 write(22, "~", 1) = 1 <0.000036> 01:07:24.004867 nanosleep({0, 1000000}, NULL) = 0 <0.001189> 01:07:24.006171 write(23, "\0", 1) = 1 <0.000039> 01:07:24.006300 nanosleep({0, 1000000}, NULL) = 0 <0.001177> 01:07:24.007571 write(23, "x", 1) = 1 <0.000034> 01:07:24.007696 nanosleep({0, 1000000}, NULL) = 0 <0.001189> 01:07:24.008988 write(23, "\360", 1) = 1 <0.000034> 01:07:24.009113 nanosleep({0, 1000000}, NULL) = 0 <0.001191> 01:07:24.010402 write(23, "~", 1) = 1 <0.000034> 01:07:24.010526 nanosleep({0, 1000000}, NULL) = 0 <0.001192> 01:07:24.011859 sendto(3, "<29>Nov 1 01:07:24 modem-manager: (ttyS1) closing serial device...", 67, MSG_NOSIGNAL, NULL, 0) = 67 <0.000084> 01:07:24.012044 close(21) = 0 <30.063920> 01:07:54.076064 ioctl(21, SNDCTL_TMR_START or TCSETS, {B57600 -opost isig -icanon -echo ...}) = -1 EBADF (Bad file descriptor) <0.000021> 01:07:54.076186 close(21) = -1 EBADF (Bad file descriptor) <0.000019> -- Web: http://www.billauer.co.il ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-11-01 1:12 ` Eli Billauer @ 2010-11-01 19:39 ` Arnd Bergmann 2010-11-01 20:46 ` Alan Cox 2010-11-03 0:15 ` Eli Billauer 0 siblings, 2 replies; 14+ messages in thread From: Arnd Bergmann @ 2010-11-01 19:39 UTC (permalink / raw) To: Eli Billauer; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox On Monday 01 November 2010, Eli Billauer wrote: > Let me start with the headline: It's close() of a ttySx that takes 30 > seconds, during which the big TTY lock is held. Now to details. Ok, good to hear you tracked it down this far! > Arnd Bergmann wrote: > > > So modem-manager is the process you need to look at, as it holds the > > BTM, probably for a long time. It should not do that. First you should > > verify that when you don't have modem-manager running, the problem goes > > away. This may indeed be related to the misdetection of a serial port. > > > That test is built-in: When modem-manager stops poking around, > everything becomes fine again. Ok. > > The other thing to check is what modem-manager is actually doing. You can > > find out by looking at /proc/<pid>/stack, where <pid> is the pid of the > > modem-manager process, 1835 in your example. > > > OK, so this is what I did eventually: I ran strace on sshd and > modem-manager, and I also added printk's to tell me when tty_lock() is > invoked, when it actually got the lock, and when tty_unlock() is > invoked. Pretty messy, and still. > > Let's start with sshd's strace. We have a > 01:06:20.152787 open("/dev/tty", O_RDWR|O_NOCTTY) = -1 ENXIO (No such > device or address) <30.008153> That looks unrelated, as far as I can tell. > And this is what we have in /var/log/messages: > > Nov 1 01:06:20 ocho modem-manager: (ttyS2) closing serial device... > Nov 1 01:06:20 ocho kernel: tty_lock() called > Nov 1 01:06:20 ocho kernel: tty_lock() got lock > Nov 1 01:06:20 ocho kernel: tty_lock() called > Nov 1 01:06:50 ocho kernel: tty_unlock() called > Nov 1 01:06:50 ocho kernel: tty_lock() got lock > Nov 1 01:06:50 ocho kernel: tty_unlock() called > Nov 1 01:06:50 ocho kernel: tty_lock() called > > So we can see that someone got the lock, and then didn't release it for > 30 seconds, and that someone else tried to get the lock and waited for > it 30 seconds. Ok, that's certainly modem-manager in the tty_release function, calling uart_close, which has this code: if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) { /* * hack: open-coded tty_wait_until_sent to avoid * recursive tty_lock */ long timeout = msecs_to_jiffies(port->closing_wait); if (wait_event_interruptible_timeout(tty->write_wait, !tty_chars_in_buffer(tty), timeout) >= 0) __uart_wait_until_sent(uport, timeout); } port->closing_wait gets initialized to 30 seconds, which fits your symptom. I'm not sure whether it blocks in wait_event or in __uart_wait_until_sent though, both get the same timeout. > An strace of modem-manager running under vanilla kernel 2.6.35-4 shows > that close() takes 30 seconds there as well. It just so happens that it > went unnoticed (maybe because of the changes in the locking schema?). Exactly. The code used to hold the BKL, which gets implicitly released during a sleeping operation like wait_event_interruptible_timeout or __uart_wait_until_sent. With the change to a mutex, this does not happen any more. When I did the change, I did not anticipate the call actually timing out the 30 seconds. > I should only note that all TTY close() calls take 30 seconds, and that > all ports get a ioctl() and extra close() after the file descriptor is > closed, with an EBADF as response. This must be an unrelated bug in modem-manager. Can you confirm that this only happens on tty1 through tty4, but not on tty0, which is your real uart? I think we need to find out why it's hitting the timeout most of all, and that means understanding what the hardware on your system is actually doing or supposed to be doing. What does "lspci -vvx" show about the port? Can you disable it in some way? Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-11-01 19:39 ` Arnd Bergmann @ 2010-11-01 20:46 ` Alan Cox 2010-11-03 0:15 ` Eli Billauer 1 sibling, 0 replies; 14+ messages in thread From: Alan Cox @ 2010-11-01 20:46 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Greg KH > I think we need to find out why it's hitting the timeout most of all, Its a perfectly normal happening in some cases - eg when you the remote end asserts flow control and then hangs up on you leaving it that way. So we need to handle this by dropping locks correctly. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-11-01 19:39 ` Arnd Bergmann 2010-11-01 20:46 ` Alan Cox @ 2010-11-03 0:15 ` Eli Billauer 2010-11-03 3:32 ` Arnd Bergmann 1 sibling, 1 reply; 14+ messages in thread From: Eli Billauer @ 2010-11-03 0:15 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox Arnd Bergmann wrote: > Ok, that's certainly modem-manager in the tty_release function, calling > uart_close, which has this code: > > if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) { > /* > * hack: open-coded tty_wait_until_sent to avoid > * recursive tty_lock > */ > long timeout = msecs_to_jiffies(port->closing_wait); > if (wait_event_interruptible_timeout(tty->write_wait, > !tty_chars_in_buffer(tty), timeout) >= 0) > __uart_wait_until_sent(uport, timeout); > } > > port->closing_wait gets initialized to 30 seconds, which fits your > symptom. > I lost you here. Maybe my version is outdated? The most similar thing I have is this in tty_port.c, in tty_port_close_start(): if (test_bit(ASYNCB_INITIALIZED, &port->flags) && port->closing_wait != ASYNC_CLOSING_WAIT_NONE) tty_wait_until_sent(tty, port->closing_wait); I put a printk on invocations of tty_wait_until_sent(), which was called like crazy during bootup on behalf of tty1, and then never again after boot was completed. I should point out, that during this session, all I did was to wait a few minutes and then reboot the computer from the GUI login console. So I never logged in. Anyhow, the ttySx ports were opened and closed, the same 30 seconds delays, but no call to tty_wait_until_sent() until kernel logging was stopped. > What does "lspci -vvx" show about the > port? It turns out, that the device, to which ttyS1-ttyS3 are attached is a soft modem, which doesn't even have drivers for a 64 bits system. There is no /dev/ttyS4, which is consistent with the "Couldn't register serial port 0000:05:04.0: -28" message. Anyhow, I can't say I understand why any serial port was allocated to this modem. But it's not like I understand how it should work. 05:04.0 Modem: ALi Corporation SmartLink SmartPCI563 56K Modem (prog-if 00 [Generic]) Subsystem: Device 2020:545a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 Interrupt: pin A routed to IRQ 16 Region 0: Memory at fbbff000 (32-bit, non-prefetchable) [size=4K] Region 1: I/O ports at be00 [size=256] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: serial 00: b9 10 5a 54 07 00 90 02 00 00 03 07 00 20 00 00 10: 00 f0 bf fb 01 be 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 20 20 5a 54 30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00 Eli -- Web: http://www.billauer.co.il ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-11-03 0:15 ` Eli Billauer @ 2010-11-03 3:32 ` Arnd Bergmann 2010-11-03 10:34 ` Alan Cox 0 siblings, 1 reply; 14+ messages in thread From: Arnd Bergmann @ 2010-11-03 3:32 UTC (permalink / raw) To: Eli Billauer; +Cc: Andrew Morton, linux-kernel, Greg KH, Alan Cox On Wednesday 03 November 2010, Eli Billauer wrote: > Arnd Bergmann wrote: > > > Ok, that's certainly modem-manager in the tty_release function, calling > > uart_close, which has this code: > > > > if (port->closing_wait != ASYNC_CLOSING_WAIT_NONE) { > > /* > > * hack: open-coded tty_wait_until_sent to avoid > > * recursive tty_lock > > */ > > long timeout = msecs_to_jiffies(port->closing_wait); > > if (wait_event_interruptible_timeout(tty->write_wait, > > !tty_chars_in_buffer(tty), timeout) >= 0) > > __uart_wait_until_sent(uport, timeout); > > } > > > > port->closing_wait gets initialized to 30 seconds, which fits your > > symptom. > > > I lost you here. Maybe my version is outdated? The most similar thing I > have is this in tty_port.c, in tty_port_close_start(): > > if (test_bit(ASYNCB_INITIALIZED, &port->flags) && > port->closing_wait != ASYNC_CLOSING_WAIT_NONE) > tty_wait_until_sent(tty, port->closing_wait); The code I quoted is from uart_close() in drivers/serial/serial_core.c > I put a printk on invocations of tty_wait_until_sent(), which was called > like crazy during bootup on behalf of tty1, and then never again after > boot was completed. I should point out, that during this session, all I > did was to wait a few minutes and then reboot the computer from the GUI > login console. So I never logged in. Anyhow, the ttySx ports were opened > and closed, the same 30 seconds delays, but no call to > tty_wait_until_sent() until kernel logging was stopped. Right. The uart_close() function does not call tty_wait_until_sent, it calls __uart_wait_until_sent directly. > > What does "lspci -vvx" show about the > > port? > It turns out, that the device, to which ttyS1-ttyS3 are attached is a > soft modem, which doesn't even have drivers for a 64 bits system. There > is no /dev/ttyS4, which is consistent with the "Couldn't register serial > port 0000:05:04.0: -28" message. Anyhow, I can't say I understand why > any serial port was allocated to this modem. But it's not like I > understand how it should work. > > 05:04.0 Modem: ALi Corporation SmartLink SmartPCI563 56K Modem (prog-if > 00 [Generic]) It should actually be covered by the intel8x0m.c driver as far as I can tell, but it's missing from the PCI IDs in that driver... As Alan said, we need to sort out the locking though. One problem is that we cannot just do tty_unlock/tty_lock around the sleeping call, because we also hold &port->mutex there, which nests inside of tty_lock(). I hope Alan can figure out if it's either safe to drop both here, or if we might be able to call uart_close without tty_lock() held in the first place. Arnd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: open() on /dev/tty takes 30 seconds on 2.6.36 2010-11-03 3:32 ` Arnd Bergmann @ 2010-11-03 10:34 ` Alan Cox 0 siblings, 0 replies; 14+ messages in thread From: Alan Cox @ 2010-11-03 10:34 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Eli Billauer, Andrew Morton, linux-kernel, Greg KH > I hope Alan can figure out if it's either safe to drop both here, or if we > might be able to call uart_close without tty_lock() held in the first place. That was always my intention and why I moved it to tty_port. I think it is safe to do that, but as far as I can tell the port mutex is assumed held by the low level drivers during the uart ops calls some of the time. Safest is probably to drop the tty lock before we take the port mutex and take it again when we exit. The tty_port fields are protected by the port mutex/lock The uport methods by the uport lock The only two points of concern I see are updating of closing_wait as it is read (no big deal), and the nasty - which is tty_ldisc_flush. I am not sure what assumptions lurk in the ldisc flush paths but I think it's ok. uart_wait_until_sent will also need to not take the tty lock at that point to fix it properly. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-11-03 10:35 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-30 7:47 open() on /dev/tty takes 30 seconds on 2.6.36 Eli Billauer 2010-10-30 18:46 ` Andrew Morton 2010-10-30 20:46 ` Arnd Bergmann 2010-10-31 0:20 ` Eli Billauer 2010-10-31 3:36 ` Arnd Bergmann 2010-10-31 6:34 ` James Cloos 2010-10-31 11:36 ` Arnd Bergmann 2010-11-02 10:28 ` Tejun Heo 2010-11-01 1:12 ` Eli Billauer 2010-11-01 19:39 ` Arnd Bergmann 2010-11-01 20:46 ` Alan Cox 2010-11-03 0:15 ` Eli Billauer 2010-11-03 3:32 ` Arnd Bergmann 2010-11-03 10:34 ` Alan Cox
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.