* [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() @ 2026-06-23 10:31 Roger Pau Monne 2026-06-23 13:36 ` Oleksii Kurochko 2026-06-23 13:44 ` Jan Beulich 0 siblings, 2 replies; 8+ messages in thread From: Roger Pau Monne @ 2026-06-23 10:31 UTC (permalink / raw) To: xen-devel Cc: Oleksii Kurochko, Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich, Julien Grall, Stefano Stabellini The current logic in ns16550_interrupt() will loop until the device sets the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow control of the serial-over-lan emulated UART seems to be broken, as it doesn't set the NOINT bit consistently. The Transmitter Holding Register Empty in LSR also seems to not be properly signaled, as even with it set writes to the transmit register take ~6ms. This leads to the watchdog triggering very easily on such system. Introduce an upper bound on the execution time of ns16550_interrupt(), this is currently set as 4x the polling interval, which is calculated as the time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. Once the timeout triggers the interrupt is disabled and the uart is switched to polling mode. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- There's a possible alternative approach to solve this by moving the actual interrupt processing to a softirq tasklet and disabling the interrupt source until the processing is done, likely unifying the logic with the timer task. However that's a bigger change, and too risky for 4.22 at this point. --- xen/drivers/char/ns16550.c | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c index 878da27f2ef8..008f673f52ee 100644 --- a/xen/drivers/char/ns16550.c +++ b/xen/drivers/char/ns16550.c @@ -62,6 +62,7 @@ static struct ns16550 { #endif unsigned int timeout_ms; bool intr_works; + bool force_polling; bool dw_usr_bsy; #ifdef NS16550_PCI /* PCI card parameters. */ @@ -190,12 +191,41 @@ static void cf_check ns16550_interrupt(int irq, void *dev_id) { struct serial_port *port = dev_id; struct ns16550 *uart = port->uart; + /* Set quite arbitrarily as 4x the time to drain the TX or fill RX FIFOs. */ + const s_time_t timeout = NOW() + min(MILLISECS(uart->timeout_ms * 4), + MILLISECS(5)); + + if ( uart->force_polling ) + return; uart->intr_works = 1; while ( !(ns_read_reg(uart, UART_IIR) & UART_IIR_NOINT) ) { u8 lsr = ns_read_reg(uart, UART_LSR); + s_time_t now = NOW(); + + /* Break out of the loop if spending too much time. */ + if ( now > timeout ) + { + struct irq_desc *desc = irq_to_desc(irq); + + /* Disable the interrupt source - it's never shared. */ + spin_lock_irq(&desc->lock); + desc->status |= IRQ_DISABLED; + if ( desc->handler->disable ) + desc->handler->disable(desc); + spin_unlock_irq(&desc->lock); + + /* Disable interrupt generation on the device and arm the timer. */ + uart->force_polling = true; + ns_write_reg(uart, UART_IER, 0); + set_timer(&uart->timer, now + MILLISECS(uart->timeout_ms)); + printk(XENLOG_WARNING + "uart interrupt taking too long, switched to polling\n"); + + return; + } if ( (lsr & uart->lsr_mask) == uart->lsr_mask ) serial_tx_interrupt(port); @@ -223,7 +253,7 @@ static void cf_check __ns16550_poll(const struct cpu_user_regs *regs) struct ns16550 *uart = port->uart; const struct cpu_user_regs *old_regs; - if ( uart->intr_works ) + if ( uart->intr_works && !uart->force_polling ) return; /* Interrupts work - no more polling */ /* Mimic interrupt context. */ @@ -313,6 +343,7 @@ static void ns16550_setup_preirq(struct ns16550 *uart) unsigned int divisor; uart->intr_works = 0; + uart->force_polling = false; pci_serial_early_init(uart); -- 2.53.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 10:31 [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() Roger Pau Monne @ 2026-06-23 13:36 ` Oleksii Kurochko 2026-06-23 13:46 ` Jan Beulich 2026-06-23 13:44 ` Jan Beulich 1 sibling, 1 reply; 8+ messages in thread From: Oleksii Kurochko @ 2026-06-23 13:36 UTC (permalink / raw) To: Roger Pau Monne, xen-devel Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich, Julien Grall, Stefano Stabellini On 6/23/26 12:31 PM, Roger Pau Monne wrote: > The current logic in ns16550_interrupt() will loop until the device sets > the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow > control of the serial-over-lan emulated UART seems to be broken, as it > doesn't set the NOINT bit consistently. The Transmitter Holding Register > Empty in LSR also seems to not be properly signaled, as even with it set > writes to the transmit register take ~6ms. This leads to the watchdog > triggering very easily on such system. > > Introduce an upper bound on the execution time of ns16550_interrupt(), this > is currently set as 4x the polling interval, which is calculated as the > time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. > Once the timeout triggers the interrupt is disabled and the uart is > switched to polling mode. > Don't you mmiss Fixes: tag? > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>> --- > There's a possible alternative approach to solve this by moving the actual > interrupt processing to a softirq tasklet and disabling the interrupt > source until the processing is done, likely unifying the logic with the > timer task. However that's a bigger change, and too risky for 4.22 at this > point. > --- Agree, it would be better to stick to the current solution: Release-Acked-by: Oleksii Kurochko <oleskii.kurochko@gmail.com> Thanks. ~ Oleksii ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 13:36 ` Oleksii Kurochko @ 2026-06-23 13:46 ` Jan Beulich 2026-06-23 14:19 ` Roger Pau Monné 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2026-06-23 13:46 UTC (permalink / raw) To: Oleksii Kurochko Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, Roger Pau Monne, xen-devel On 23.06.2026 15:36, Oleksii Kurochko wrote: > On 6/23/26 12:31 PM, Roger Pau Monne wrote: >> The current logic in ns16550_interrupt() will loop until the device sets >> the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow >> control of the serial-over-lan emulated UART seems to be broken, as it >> doesn't set the NOINT bit consistently. The Transmitter Holding Register >> Empty in LSR also seems to not be properly signaled, as even with it set >> writes to the transmit register take ~6ms. This leads to the watchdog >> triggering very easily on such system. >> >> Introduce an upper bound on the execution time of ns16550_interrupt(), this >> is currently set as 4x the polling interval, which is calculated as the >> time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. >> Once the timeout triggers the interrupt is disabled and the uart is >> switched to polling mode. > > Don't you mmiss Fixes: tag? Fixes: "SoL on Lenovo ThinkSystem SR630 V4" you mean? I think there's nothing wrong with our pre-existing code, and the changes here instead are a workaround for some (apparently) badly implemented SoL. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 13:46 ` Jan Beulich @ 2026-06-23 14:19 ` Roger Pau Monné 0 siblings, 0 replies; 8+ messages in thread From: Roger Pau Monné @ 2026-06-23 14:19 UTC (permalink / raw) To: Jan Beulich Cc: Oleksii Kurochko, Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Tue, Jun 23, 2026 at 03:46:11PM +0200, Jan Beulich wrote: > On 23.06.2026 15:36, Oleksii Kurochko wrote: > > On 6/23/26 12:31 PM, Roger Pau Monne wrote: > >> The current logic in ns16550_interrupt() will loop until the device sets > >> the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow > >> control of the serial-over-lan emulated UART seems to be broken, as it > >> doesn't set the NOINT bit consistently. The Transmitter Holding Register > >> Empty in LSR also seems to not be properly signaled, as even with it set > >> writes to the transmit register take ~6ms. This leads to the watchdog > >> triggering very easily on such system. > >> > >> Introduce an upper bound on the execution time of ns16550_interrupt(), this > >> is currently set as 4x the polling interval, which is calculated as the > >> time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. > >> Once the timeout triggers the interrupt is disabled and the uart is > >> switched to polling mode. > > > > Don't you mmiss Fixes: tag? > > Fixes: "SoL on Lenovo ThinkSystem SR630 V4" > > you mean? I think there's nothing wrong with our pre-existing code, and > the changes here instead are a workaround for some (apparently) badly > implemented SoL. It was on purpose that no Fixes tag was provided. Xen code would be fine with well-behaved uarts, however most of the serial-over-lan emulated ones are not well behaved it seems. There's a possible issue with the unbounded loop in ns16550_interrupt() as it's relying solely on hardware register values to terminate, which again would be OK if hardware was correctly implemented. I don't think this warrants a Fixes tag. Thanks, Roger. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 10:31 [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() Roger Pau Monne 2026-06-23 13:36 ` Oleksii Kurochko @ 2026-06-23 13:44 ` Jan Beulich 2026-06-23 14:16 ` Roger Pau Monné 1 sibling, 1 reply; 8+ messages in thread From: Jan Beulich @ 2026-06-23 13:44 UTC (permalink / raw) To: Roger Pau Monne Cc: Oleksii Kurochko, Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On 23.06.2026 12:31, Roger Pau Monne wrote: > The current logic in ns16550_interrupt() will loop until the device sets > the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow > control of the serial-over-lan emulated UART seems to be broken, as it > doesn't set the NOINT bit consistently. The Transmitter Holding Register > Empty in LSR also seems to not be properly signaled, as even with it set > writes to the transmit register take ~6ms. This leads to the watchdog > triggering very easily on such system. > > Introduce an upper bound on the execution time of ns16550_interrupt(), this > is currently set as 4x the polling interval, which is calculated as the > time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. > Once the timeout triggers the interrupt is disabled and the uart is > switched to polling mode. > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > There's a possible alternative approach to solve this by moving the actual > interrupt processing to a softirq tasklet and disabling the interrupt > source until the processing is done, likely unifying the logic with the > timer task. However that's a bigger change, and too risky for 4.22 at this > point. +1 > --- a/xen/drivers/char/ns16550.c > +++ b/xen/drivers/char/ns16550.c > @@ -62,6 +62,7 @@ static struct ns16550 { > #endif > unsigned int timeout_ms; > bool intr_works; > + bool force_polling; > bool dw_usr_bsy; > #ifdef NS16550_PCI > /* PCI card parameters. */ > @@ -190,12 +191,41 @@ static void cf_check ns16550_interrupt(int irq, void *dev_id) > { > struct serial_port *port = dev_id; > struct ns16550 *uart = port->uart; > + /* Set quite arbitrarily as 4x the time to drain the TX or fill RX FIFOs. */ Nit: I'd drop the latter of the two "the". > + const s_time_t timeout = NOW() + min(MILLISECS(uart->timeout_ms * 4), > + MILLISECS(5)); MILLISECS(min(uart->timeout_ms * 4, 5U)) ? > + if ( uart->force_polling ) > + return; As the IRQ was disabled, is this even possible? I.e. should this be some kind of assertion or alike? > uart->intr_works = 1; > > while ( !(ns_read_reg(uart, UART_IIR) & UART_IIR_NOINT) ) > { > u8 lsr = ns_read_reg(uart, UART_LSR); > + s_time_t now = NOW(); > + > + /* Break out of the loop if spending too much time. */ > + if ( now > timeout ) > + { > + struct irq_desc *desc = irq_to_desc(irq); > + > + /* Disable the interrupt source - it's never shared. */ > + spin_lock_irq(&desc->lock); This needs to be spin_lock_irqsave() - we may not rely on IRQs being on when we make it here. However, ... > + desc->status |= IRQ_DISABLED; > + if ( desc->handler->disable ) > + desc->handler->disable(desc); > + spin_unlock_irq(&desc->lock); ... all of this open-coding is quite bad anyway. We should probably add a helper for this in IRQ handling code. > + /* Disable interrupt generation on the device and arm the timer. */ > + uart->force_polling = true; > + ns_write_reg(uart, UART_IER, 0); > + set_timer(&uart->timer, now + MILLISECS(uart->timeout_ms)); > + printk(XENLOG_WARNING > + "uart interrupt taking too long, switched to polling\n"); Probably it is indeed best to keep this simple, but: A single instance of this taking e.g. just over 5ms (perhaps with a low baud rate) may not be indicative of an actual issue. To alleviate this as least some, perhaps besides capping at 5ms we should also make sure that the timeout used isn't below ->timeout_ms? Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 13:44 ` Jan Beulich @ 2026-06-23 14:16 ` Roger Pau Monné 2026-06-23 14:27 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Roger Pau Monné @ 2026-06-23 14:16 UTC (permalink / raw) To: Jan Beulich Cc: Oleksii Kurochko, Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Tue, Jun 23, 2026 at 03:44:06PM +0200, Jan Beulich wrote: > On 23.06.2026 12:31, Roger Pau Monne wrote: > > The current logic in ns16550_interrupt() will loop until the device sets > > the NOINT in IIR. At least on the Lenovo ThinkSystem SR630 V4 the flow > > control of the serial-over-lan emulated UART seems to be broken, as it > > doesn't set the NOINT bit consistently. The Transmitter Holding Register > > Empty in LSR also seems to not be properly signaled, as even with it set > > writes to the transmit register take ~6ms. This leads to the watchdog > > triggering very easily on such system. > > > > Introduce an upper bound on the execution time of ns16550_interrupt(), this > > is currently set as 4x the polling interval, which is calculated as the > > time to fill RX FIFO and/or empty TX FIFO. The current maximum is 5ms. > > Once the timeout triggers the interrupt is disabled and the uart is > > switched to polling mode. > > > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > > --- > > There's a possible alternative approach to solve this by moving the actual > > interrupt processing to a softirq tasklet and disabling the interrupt > > source until the processing is done, likely unifying the logic with the > > timer task. However that's a bigger change, and too risky for 4.22 at this > > point. > > +1 > > > --- a/xen/drivers/char/ns16550.c > > +++ b/xen/drivers/char/ns16550.c > > @@ -62,6 +62,7 @@ static struct ns16550 { > > #endif > > unsigned int timeout_ms; > > bool intr_works; > > + bool force_polling; > > bool dw_usr_bsy; > > #ifdef NS16550_PCI > > /* PCI card parameters. */ > > @@ -190,12 +191,41 @@ static void cf_check ns16550_interrupt(int irq, void *dev_id) > > { > > struct serial_port *port = dev_id; > > struct ns16550 *uart = port->uart; > > + /* Set quite arbitrarily as 4x the time to drain the TX or fill RX FIFOs. */ > > Nit: I'd drop the latter of the two "the". > > > + const s_time_t timeout = NOW() + min(MILLISECS(uart->timeout_ms * 4), > > + MILLISECS(5)); > > MILLISECS(min(uart->timeout_ms * 4, 5U)) ? Bah, yes, sorry, I've added the min() later and clearly didn't apply much brainpower about it's position. > > > + if ( uart->force_polling ) > > + return; > > As the IRQ was disabled, is this even possible? I.e. should this be some > kind of assertion or alike? Hm, I wasn't setting IRQ_DISABLED before, and hence needed this guard. But now with IRQ_DISABLED being set in ->status do_IRQ() should filter any stray interrupts. I will attempt to add an ASSERT_UNREACHABLE() here. > > uart->intr_works = 1; > > > > while ( !(ns_read_reg(uart, UART_IIR) & UART_IIR_NOINT) ) > > { > > u8 lsr = ns_read_reg(uart, UART_LSR); > > + s_time_t now = NOW(); > > + > > + /* Break out of the loop if spending too much time. */ > > + if ( now > timeout ) > > + { > > + struct irq_desc *desc = irq_to_desc(irq); > > + > > + /* Disable the interrupt source - it's never shared. */ > > + spin_lock_irq(&desc->lock); > > This needs to be spin_lock_irqsave() - we may not rely on IRQs being on > when we make it here. However, ... I was relying on do_IRQ() unconditionally enabling IRQs before calling the handler, but it's safer to use the _irqsave() variant. > > + desc->status |= IRQ_DISABLED; > > + if ( desc->handler->disable ) > > + desc->handler->disable(desc); > > + spin_unlock_irq(&desc->lock); > > ... all of this open-coding is quite bad anyway. We should probably add > a helper for this in IRQ handling code. Can add, no problem. > > + /* Disable interrupt generation on the device and arm the timer. */ > > + uart->force_polling = true; > > + ns_write_reg(uart, UART_IER, 0); > > + set_timer(&uart->timer, now + MILLISECS(uart->timeout_ms)); > > + printk(XENLOG_WARNING > > + "uart interrupt taking too long, switched to polling\n"); > > Probably it is indeed best to keep this simple, but: A single instance of > this taking e.g. just over 5ms (perhaps with a low baud rate) may not be > indicative of an actual issue. It's all fiddly, yes, I could add a counter and maybe only disable if we have a certain amount of executions of ns16550_interrupt() exceeding the timeout, yet I wanted to keep this simple. > To alleviate this as least some, perhaps > besides capping at 5ms we should also make sure that the timeout used isn't > below ->timeout_ms? OK, let me see what I can do. Thanks, Roger. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 14:16 ` Roger Pau Monné @ 2026-06-23 14:27 ` Jan Beulich 2026-06-23 15:54 ` Roger Pau Monné 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2026-06-23 14:27 UTC (permalink / raw) To: Roger Pau Monné Cc: Oleksii Kurochko, Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On 23.06.2026 16:16, Roger Pau Monné wrote: > On Tue, Jun 23, 2026 at 03:44:06PM +0200, Jan Beulich wrote: >> On 23.06.2026 12:31, Roger Pau Monne wrote: >>> + if ( uart->force_polling ) >>> + return; >> >> As the IRQ was disabled, is this even possible? I.e. should this be some >> kind of assertion or alike? > > Hm, I wasn't setting IRQ_DISABLED before, and hence needed this guard. > But now with IRQ_DISABLED being set in ->status do_IRQ() should filter > any stray interrupts. I will attempt to add an ASSERT_UNREACHABLE() > here. Simply ASSERT(!uart->force_polling) should do here? It is not wrong to run the code below in release builds in such an event. If we kept getting interrupts (perhaps at a high frequency) we'd be in trouble anyway. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() 2026-06-23 14:27 ` Jan Beulich @ 2026-06-23 15:54 ` Roger Pau Monné 0 siblings, 0 replies; 8+ messages in thread From: Roger Pau Monné @ 2026-06-23 15:54 UTC (permalink / raw) To: Jan Beulich Cc: Oleksii Kurochko, Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Tue, Jun 23, 2026 at 04:27:12PM +0200, Jan Beulich wrote: > On 23.06.2026 16:16, Roger Pau Monné wrote: > > On Tue, Jun 23, 2026 at 03:44:06PM +0200, Jan Beulich wrote: > >> On 23.06.2026 12:31, Roger Pau Monne wrote: > >>> + if ( uart->force_polling ) > >>> + return; > >> > >> As the IRQ was disabled, is this even possible? I.e. should this be some > >> kind of assertion or alike? > > > > Hm, I wasn't setting IRQ_DISABLED before, and hence needed this guard. > > But now with IRQ_DISABLED being set in ->status do_IRQ() should filter > > any stray interrupts. I will attempt to add an ASSERT_UNREACHABLE() > > here. > > Simply ASSERT(!uart->force_polling) should do here? It is not wrong to > run the code below in release builds in such an event. If we kept getting > interrupts (perhaps at a high frequency) we'd be in trouble anyway. No, I'm afraid I can't do it like that, I can't put an ASSERT there, because we can still get into ns16550_interrupt() after the interrupt has been disabled. In do_IRQ() we have the following loop: while ( desc->status & IRQ_PENDING ) { desc->status &= ~IRQ_PENDING; spin_unlock_irq(&desc->lock); tsc_in = tb_init_done ? get_cycles() : 0; action->handler(irq, action->dev_id); TRACE_TIME(TRC_HW_IRQ_HANDLED, irq, tsc_in, get_cycles()); spin_lock_irq(&desc->lock); } So if the device is generating further interrupts in the window with IRQs enabled (while we execute the handler), we will keep looping around this, without taking into account the setting of IRQ_DISABLED. This is something that we might want to fix, so that the loop is bound by IRQ_PENDING being set, and IRQ_DISABLED not, ie: while ( (desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING ) Regards, Roger. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-06-23 15:55 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-23 10:31 [PATCH for-4.22] char/ns16550: bound execution time of ns16550_interrupt() Roger Pau Monne 2026-06-23 13:36 ` Oleksii Kurochko 2026-06-23 13:46 ` Jan Beulich 2026-06-23 14:19 ` Roger Pau Monné 2026-06-23 13:44 ` Jan Beulich 2026-06-23 14:16 ` Roger Pau Monné 2026-06-23 14:27 ` Jan Beulich 2026-06-23 15:54 ` Roger Pau Monné
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.