[PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu

public inbox for linux-serial@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
@ 2025-10-21  8:39 Xin Zhao
  2025-10-21 11:28 ` Greg KH
  0 siblings, 1 reply; 13+ messages in thread
From: Xin Zhao @ 2025-10-21  8:39 UTC (permalink / raw)
  To: gregkh, jirislaby; +Cc: linux-kernel, linux-serial, tj, hch, Xin Zhao

On the embedded platform, certain critical data, such as IMU data, is
transmitted through UART. The tty_flip_buffer_push interface in the TTY
layer uses system_unbound_wq to handle the flipping of the TTY buffer.
Although the unbound workqueue can create new threads on demand and wake
up the kworker thread on an idle CPU, the priority of the kworker thread
itself is not high. Even if the CPU running this work was idle just a
moment ago, it may be preempted by real-time tasks or other high-priority
tasks.
In our system, the processing interval for each frame of IMU data
transmitted via UART can experience significant jitter due to this issue.
Instead of the expected 10 to 15 ms frame processing interval, we see
spikes up to 30 to 35 ms. Moreover, in just one or two hours, there can
be 2 to 3 occurrences of such high jitter, which is quite frequent. This
jitter exceeds the software's tolerable limit of 20 ms.
Add module param tty_flip_cpu to queue rx complete work on the specific
cpu by tty_flip_buffer_push_wq. The default value of tty_flip_cpu is
WORK_CPU_UNBOUND which means using the default system_unbound_wq called
by tty_flip_buffer_push, otherwise we use the newly added workqueue
wq_tty_flip which is set to WQ_HIGHPRI to promote performance.
We set tty_flip_cpu to a specific CPU core that has relatively few
real-time tasks running continuously for long periods. Additionally,
tasks on this core have some correlation with the UART data related to
the 8250 DMA operation. After queuing work to this designated CPU and
set workqueue to WQ_HIGHPRI, we can stably eliminate the jitter and
ensure that the frame processing interval remains between 10 and 15 ms.
If we do not add the WQ_HIGHPRI flag, our testing on the platform shows
that there are still spikes occurring approximately once every hour.
Considering that projects utilizing this optimization feature must have
encountered similar issues to ours, just add the WQ_HIGHPRI flag in the
patch, without adding new module parameters for selection.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 drivers/tty/serial/8250/8250.h     |  2 ++
 drivers/tty/serial/8250/8250_dma.c | 46 ++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/serial/8250/8250.h b/drivers/tty/serial/8250/8250.h
index cfe6ba286..e6da925df 100644
--- a/drivers/tty/serial/8250/8250.h
+++ b/drivers/tty/serial/8250/8250.h
@@ -20,6 +20,8 @@ struct uart_8250_dma {
 	void (*prepare_tx_dma)(struct uart_8250_port *p);
 	void (*prepare_rx_dma)(struct uart_8250_port *p);

+	struct workqueue_struct *wq_tty_flip;
+
 	/* Filter function */
 	dma_filter_fn		fn;
 	/* Parameter to the filter function */
diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
index bdd26c9f3..9a0abee62 100644
--- a/drivers/tty/serial/8250/8250_dma.c
+++ b/drivers/tty/serial/8250/8250_dma.c
@@ -38,6 +38,35 @@ static void __dma_tx_complete(void *param)
 	uart_port_unlock_irqrestore(&p->port, flags);
 }

+#define TTY_FLIP_WORK_CPU		WORK_CPU_UNBOUND
+
+static int wq_tty_flip_cpu = TTY_FLIP_WORK_CPU;
+
+static int param_set_tty_flip_cpu(const char *val,
+					const struct kernel_param *kp)
+{
+	int cpu;
+	int ret;
+
+	ret = kstrtoint(val, 0, &cpu);
+	if (ret)
+		return ret;
+
+	if ((cpu >= 0 && cpu < nr_cpu_ids) || cpu == WORK_CPU_UNBOUND)
+		wq_tty_flip_cpu = cpu;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+static const struct kernel_param_ops tty_flip_cpu_ops = {
+	.set	= param_set_tty_flip_cpu,
+	.get	= param_get_int,
+};
+
+module_param_cb(tty_flip_cpu, &tty_flip_cpu_ops, &wq_tty_flip_cpu, 0644);
+
 static void __dma_rx_complete(struct uart_8250_port *p)
 {
 	struct uart_8250_dma	*dma = p->dma;
@@ -61,7 +90,10 @@ static void __dma_rx_complete(struct uart_8250_port *p)
 	p->port.icount.rx += count;
 	dma->rx_running = 0;

-	tty_flip_buffer_push(tty_port);
+	if (wq_tty_flip_cpu == WORK_CPU_UNBOUND)
+		tty_flip_buffer_push(tty_port);
+	else
+		tty_flip_buffer_push_wq(tty_port, dma->wq_tty_flip, wq_tty_flip_cpu);
 }

 static void dma_rx_complete(void *param)
@@ -244,6 +276,12 @@ int serial8250_request_dma(struct uart_8250_port *p)
 		goto release_rx;
 	}

+	dma->wq_tty_flip = alloc_workqueue("wq_tty_flip", WQ_PERCPU | WQ_HIGHPRI, 0);
+	if (!dma->wq_tty_flip) {
+		ret = -ENOMEM;
+		goto release_rx;
+	}
+
 	dmaengine_slave_config(dma->rxchan, &dma->rxconf);

 	/* Get a channel for TX */
@@ -252,7 +290,7 @@ int serial8250_request_dma(struct uart_8250_port *p)
 						       p->port.dev, "tx");
 	if (!dma->txchan) {
 		ret = -ENODEV;
-		goto release_rx;
+		goto release_rx_wq;
 	}

 	/* 8250 tx dma requires dmaengine driver to support terminate */
@@ -294,6 +332,8 @@ int serial8250_request_dma(struct uart_8250_port *p)
 	return 0;
 err:
 	dma_release_channel(dma->txchan);
+release_rx_wq:
+	destroy_workqueue(dma->wq_tty_flip);
 release_rx:
 	dma_release_channel(dma->rxchan);
 	return ret;
@@ -322,6 +362,8 @@ void serial8250_release_dma(struct uart_8250_port *p)
 	dma->txchan = NULL;
 	dma->tx_running = 0;

+	destroy_workqueue(dma->wq_tty_flip);
+
 	dev_dbg_ratelimited(p->port.dev, "dma channels released\n");
 }
 EXPORT_SYMBOL_GPL(serial8250_release_dma);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21  8:39 [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu Xin Zhao
@ 2025-10-21 11:28 ` Greg KH
  2025-10-21 13:05   ` Xin Zhao
  0 siblings, 1 reply; 13+ messages in thread
From: Greg KH @ 2025-10-21 11:28 UTC (permalink / raw)
  To: Xin Zhao; +Cc: jirislaby, linux-kernel, linux-serial, tj, hch

On Tue, Oct 21, 2025 at 04:39:47PM +0800, Xin Zhao wrote:

Meta-comment, this is a v2, and was not threaded with patch 1/1.  Please
use a tool like b4 or git send-email to send patches out so that we can
properly review/apply them.

> On the embedded platform, certain critical data, such as IMU data, is
> transmitted through UART. The tty_flip_buffer_push interface in the TTY
> layer uses system_unbound_wq to handle the flipping of the TTY buffer.
> Although the unbound workqueue can create new threads on demand and wake
> up the kworker thread on an idle CPU, the priority of the kworker thread
> itself is not high. Even if the CPU running this work was idle just a
> moment ago, it may be preempted by real-time tasks or other high-priority
> tasks.

Are you using the RT kernel?  It seems odd that this is just coming up
now, given our normal latency issues in the tty layer.  What changed to
cause this?

> In our system, the processing interval for each frame of IMU data
> transmitted via UART can experience significant jitter due to this issue.
> Instead of the expected 10 to 15 ms frame processing interval, we see
> spikes up to 30 to 35 ms. Moreover, in just one or two hours, there can
> be 2 to 3 occurrences of such high jitter, which is quite frequent. This
> jitter exceeds the software's tolerable limit of 20 ms.

Again, are you using the RT kernel?  If not, can you try that?

> Add module param tty_flip_cpu to queue rx complete work on the specific
> cpu by tty_flip_buffer_push_wq.

This is not the 1990's, we don't add new module parameters :)

This will not work for systems with multiple tty devices/drivers, please
use a correct api for this if you really really need it.

But again, I think this might not be needed if you use the RT kernel
issue.

> The default value of tty_flip_cpu is
> WORK_CPU_UNBOUND which means using the default system_unbound_wq called
> by tty_flip_buffer_push, otherwise we use the newly added workqueue
> wq_tty_flip which is set to WQ_HIGHPRI to promote performance.
> We set tty_flip_cpu to a specific CPU core that has relatively few
> real-time tasks running continuously for long periods. Additionally,
> tasks on this core have some correlation with the UART data related to
> the 8250 DMA operation. After queuing work to this designated CPU and
> set workqueue to WQ_HIGHPRI, we can stably eliminate the jitter and
> ensure that the frame processing interval remains between 10 and 15 ms.
> If we do not add the WQ_HIGHPRI flag, our testing on the platform shows
> that there are still spikes occurring approximately once every hour.
> Considering that projects utilizing this optimization feature must have
> encountered similar issues to ours, just add the WQ_HIGHPRI flag in the
> patch, without adding new module parameters for selection.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>  drivers/tty/serial/8250/8250.h     |  2 ++
>  drivers/tty/serial/8250/8250_dma.c | 46 ++++++++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/serial/8250/8250.h b/drivers/tty/serial/8250/8250.h
> index cfe6ba286..e6da925df 100644
> --- a/drivers/tty/serial/8250/8250.h
> +++ b/drivers/tty/serial/8250/8250.h
> @@ -20,6 +20,8 @@ struct uart_8250_dma {
>  	void (*prepare_tx_dma)(struct uart_8250_port *p);
>  	void (*prepare_rx_dma)(struct uart_8250_port *p);
>  
> +	struct workqueue_struct *wq_tty_flip;
> +
>  	/* Filter function */
>  	dma_filter_fn		fn;
>  	/* Parameter to the filter function */
> diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
> index bdd26c9f3..9a0abee62 100644
> --- a/drivers/tty/serial/8250/8250_dma.c
> +++ b/drivers/tty/serial/8250/8250_dma.c
> @@ -38,6 +38,35 @@ static void __dma_tx_complete(void *param)
>  	uart_port_unlock_irqrestore(&p->port, flags);
>  }
>  
> +#define TTY_FLIP_WORK_CPU		WORK_CPU_UNBOUND
> +
> +static int wq_tty_flip_cpu = TTY_FLIP_WORK_CPU;
> +
> +static int param_set_tty_flip_cpu(const char *val,
> +					const struct kernel_param *kp)
> +{
> +	int cpu;
> +	int ret;
> +
> +	ret = kstrtoint(val, 0, &cpu);
> +	if (ret)
> +		return ret;
> +
> +	if ((cpu >= 0 && cpu < nr_cpu_ids) || cpu == WORK_CPU_UNBOUND)
> +		wq_tty_flip_cpu = cpu;
> +	else
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static const struct kernel_param_ops tty_flip_cpu_ops = {
> +	.set	= param_set_tty_flip_cpu,
> +	.get	= param_get_int,
> +};
> +
> +module_param_cb(tty_flip_cpu, &tty_flip_cpu_ops, &wq_tty_flip_cpu, 0644);
> +
>  static void __dma_rx_complete(struct uart_8250_port *p)
>  {
>  	struct uart_8250_dma	*dma = p->dma;
> @@ -61,7 +90,10 @@ static void __dma_rx_complete(struct uart_8250_port *p)
>  	p->port.icount.rx += count;
>  	dma->rx_running = 0;
>  
> -	tty_flip_buffer_push(tty_port);
> +	if (wq_tty_flip_cpu == WORK_CPU_UNBOUND)
> +		tty_flip_buffer_push(tty_port);
> +	else
> +		tty_flip_buffer_push_wq(tty_port, dma->wq_tty_flip, wq_tty_flip_cpu);

So you just bound ALL tty devices to the same cpu?  That feels very
wrong, how will that work with multiple devices on different interrupts
coming in on different cpus?

Why not just queue this up on the cpu that the irq happened on instead?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 11:28 ` Greg KH
@ 2025-10-21 13:05   ` Xin Zhao
  2025-10-21 16:26     ` Greg KH
  0 siblings, 1 reply; 13+ messages in thread
From: Xin Zhao @ 2025-10-21 13:05 UTC (permalink / raw)
  To: gregkh; +Cc: hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial, tj

On Tue, 21 Oct 2025 13:28:46 +0200 Greg KH <gregkh@linuxfoundation.org> wrote:

> Meta-comment, this is a v2, and was not threaded with patch 1/1.  Please
> use a tool like b4 or git send-email to send patches out so that we can
> properly review/apply them.

I did use git send-email to send the patches. I have successfully used the
git send-email tool to send a series of patches before, and that series was
later merged into 6.18-rc1. Therefore, I feel that my usage of git send-email
should not be an issue. The only difference of my current submission is that
after sending each patch with git send-email, I waited for the patches to
appear on lkml.org before sending the next one. I had encountered issues before
where sending multiple patches in a row caused only some of them to appear on
lkml.org, with others taking a long time to show up. That's why I specifically
decided to wait a bit after each patch this time. I'm really not sure why the
system didn't group my three patches together; it's quite puzzling to me.

> Are you using the RT kernel?  It seems odd that this is just coming up
> now, given our normal latency issues in the tty layer.  What changed to
> cause this?

> > In our system, the processing interval for each frame of IMU data
> > transmitted via UART can experience significant jitter due to this issue.
> > Instead of the expected 10 to 15 ms frame processing interval, we see
> > spikes up to 30 to 35 ms. Moreover, in just one or two hours, there can
> > be 2 to 3 occurrences of such high jitter, which is quite frequent. This
> > jitter exceeds the software's tolerable limit of 20 ms.
> 
> Again, are you using the RT kernel?  If not, can you try that?

Our system is indeed RT-Linux, version 6.1.146-rt53. Our project has been
ongoing for about 2 years, and we hadn't paid close attention to these jitter
issues before. It is only recently that we needed to focus on them specifically.

> This is not the 1990's, we don't add new module parameters :)
> 
> This will not work for systems with multiple tty devices/drivers, please
> use a correct api for this if you really really need it.
> 
> But again, I think this might not be needed if you use the RT kernel
> issue.

If user-configurable CPUs are really needed for queueing, do you have any
recommendations on how to implement this? Would it be appropriate to create
a new node under /sys/devices/platform/serial8250/tty/ttyS*?
As mentioned earlier, our system is an RT-Linux system. As you say that
using an RT kernel may not encounter this issue, are you implying that RT-Linux
has priority inheritance logic? However, the logic within work items isn't
always protected by locks. Even if everything is under lock protection, when a
work item starts executing and hasn't yet entered the lock, it can still be
preempted by real-time tasks. In an RT-Linux system, such preemption of kworker
by real-time tasks is more common compared to a standard kernel.

> So you just bound ALL tty devices to the same cpu?  That feels very
> wrong, how will that work with multiple devices on different interrupts
> coming in on different cpus?
> 
> Why not just queue this up on the cpu that the irq happened on instead?

If possible, I will later add a node to separately configure each tty/ttyS*
device. The approach I intend to use is to add this tty_flip_cpu node under
/sys/devices/platform/serial8250/tty/ttyS*. Is it OK?
Based on our project, the DMA RX complete events for tty are almost evenly
distributed across each core. I believe it is necessary to increase the
flexibility of the queue_work to be located on a specific CPU, as it is difficult
to predict whether a core is running many long-duration real-time tasks based on
the driver code or low-level logic strategy.

--
Xin Zhao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 13:05   ` Xin Zhao
@ 2025-10-21 16:26     ` Greg KH
  2025-10-21 16:40       ` Tejun Heo
  2025-10-22 14:47       ` Xin Zhao
  0 siblings, 2 replies; 13+ messages in thread
From: Greg KH @ 2025-10-21 16:26 UTC (permalink / raw)
  To: Xin Zhao; +Cc: hch, jirislaby, linux-kernel, linux-serial, tj

On Tue, Oct 21, 2025 at 09:05:13PM +0800, Xin Zhao wrote:
> On Tue, 21 Oct 2025 13:28:46 +0200 Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > Meta-comment, this is a v2, and was not threaded with patch 1/1.  Please
> > use a tool like b4 or git send-email to send patches out so that we can
> > properly review/apply them.
> 
> I did use git send-email to send the patches. I have successfully used the
> git send-email tool to send a series of patches before, and that series was
> later merged into 6.18-rc1. Therefore, I feel that my usage of git send-email
> should not be an issue. The only difference of my current submission is that
> after sending each patch with git send-email, I waited for the patches to
> appear on lkml.org before sending the next one. I had encountered issues before
> where sending multiple patches in a row caused only some of them to appear on
> lkml.org, with others taking a long time to show up. That's why I specifically
> decided to wait a bit after each patch this time. I'm really not sure why the
> system didn't group my three patches together; it's quite puzzling to me.

I do not know, but it was not threaded :(

Try sending it to yourself and see if the delay maybe caused a problem?
there should not be an issue sending just 3 messages at once.

> > Are you using the RT kernel?  It seems odd that this is just coming up
> > now, given our normal latency issues in the tty layer.  What changed to
> > cause this?
> 
> > > In our system, the processing interval for each frame of IMU data
> > > transmitted via UART can experience significant jitter due to this issue.
> > > Instead of the expected 10 to 15 ms frame processing interval, we see
> > > spikes up to 30 to 35 ms. Moreover, in just one or two hours, there can
> > > be 2 to 3 occurrences of such high jitter, which is quite frequent. This
> > > jitter exceeds the software's tolerable limit of 20 ms.
> > 
> > Again, are you using the RT kernel?  If not, can you try that?
> 
> Our system is indeed RT-Linux, version 6.1.146-rt53. Our project has been
> ongoing for about 2 years, and we hadn't paid close attention to these jitter
> issues before. It is only recently that we needed to focus on them specifically.

6.1 is very old, please try the latest 6.17.y release, or 6.18-rc1.
Lots and lots of RT stuff has been fixed since 6.1.

> > This is not the 1990's, we don't add new module parameters :)
> > 
> > This will not work for systems with multiple tty devices/drivers, please
> > use a correct api for this if you really really need it.
> > 
> > But again, I think this might not be needed if you use the RT kernel
> > issue.
> 
> If user-configurable CPUs are really needed for queueing, do you have any
> recommendations on how to implement this? Would it be appropriate to create
> a new node under /sys/devices/platform/serial8250/tty/ttyS*?
> As mentioned earlier, our system is an RT-Linux system. As you say that
> using an RT kernel may not encounter this issue, are you implying that RT-Linux
> has priority inheritance logic?

I'm saying that the RT selection should cause you not to have those
larger variants in delays.  Try a newer kernel and see.

> However, the logic within work items isn't
> always protected by locks. Even if everything is under lock protection, when a
> work item starts executing and hasn't yet entered the lock, it can still be
> preempted by real-time tasks. In an RT-Linux system, such preemption of kworker
> by real-time tasks is more common compared to a standard kernel.

True, which is something that the rt changes should have addressed.
This sounds like an overall rt issue, not a serial port issue.  Have you
asked about this on the rt mailing list?

> > So you just bound ALL tty devices to the same cpu?  That feels very
> > wrong, how will that work with multiple devices on different interrupts
> > coming in on different cpus?
> > 
> > Why not just queue this up on the cpu that the irq happened on instead?
> 
> If possible, I will later add a node to separately configure each tty/ttyS*
> device. The approach I intend to use is to add this tty_flip_cpu node under
> /sys/devices/platform/serial8250/tty/ttyS*. Is it OK?
> Based on our project, the DMA RX complete events for tty are almost evenly
> distributed across each core.

This should come from a hardware definition somewhere in your DT, not as
a user-selectable option.  And again, why not just tie it to the cpu
where the irq came from automatically?

> I believe it is necessary to increase the
> flexibility of the queue_work to be located on a specific CPU, as it is difficult
> to predict whether a core is running many long-duration real-time tasks based on
> the driver code or low-level logic strategy.

That's a system design issue, not a kernel issue :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 16:26     ` Greg KH
@ 2025-10-21 16:40       ` Tejun Heo
  2025-10-21 17:39         ` Xin Zhao
  2025-10-22 14:47       ` Xin Zhao
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2025-10-21 16:40 UTC (permalink / raw)
  To: Greg KH; +Cc: Xin Zhao, hch, jirislaby, linux-kernel, linux-serial

Hello,

On Tue, Oct 21, 2025 at 06:26:53PM +0200, Greg KH wrote:
...
> True, which is something that the rt changes should have addressed.
> This sounds like an overall rt issue, not a serial port issue.  Have you
> asked about this on the rt mailing list?

Maybe it's mostly RT; however, because of the shared worker pool
implementation, workqueue can introduce high tail latencies, especially
depending on what mm is doing. When a work item is scheduled, there would
usually be a worker thread that can service it right away; however, there's
no guarantee and when it gets unlucky it may have to create a new kworker
for the pool, which can easily add some msecs of delay. If the system is
under even moderate memory pressure, that can shoot up considerably too.

I'm not convinced WQ_HIGHPRI is a good solution tho. It changes scheduling
priority and it likely reduces the probabilty of having to create a new
kworker as highpri pools are a lot less utilized. However, that can still
happen. If consistent low latency is really important, it should use
dedicated kthread workers with appropriate priority bump.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 16:40       ` Tejun Heo
@ 2025-10-21 17:39         ` Xin Zhao
  2025-10-21 18:50           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Xin Zhao @ 2025-10-21 17:39 UTC (permalink / raw)
  To: tj; +Cc: gregkh, hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial

On Tue, 21 Oct 2025 06:40:04 -1000 Tejun Heo <tj@kernel.org> wrote:

> > True, which is something that the rt changes should have addressed.
> > This sounds like an overall rt issue, not a serial port issue.  Have you
> > asked about this on the rt mailing list?
> 
> Maybe it's mostly RT; however, because of the shared worker pool
> implementation, workqueue can introduce high tail latencies, especially
> depending on what mm is doing. When a work item is scheduled, there would
> usually be a worker thread that can service it right away; however, there's
> no guarantee and when it gets unlucky it may have to create a new kworker
> for the pool, which can easily add some msecs of delay. If the system is
> under even moderate memory pressure, that can shoot up considerably too.
> 
> I'm not convinced WQ_HIGHPRI is a good solution tho. It changes scheduling
> priority and it likely reduces the probabilty of having to create a new
> kworker as highpri pools are a lot less utilized. However, that can still
> happen. If consistent low latency is really important, it should use
> dedicated kthread workers with appropriate priority bump.

Thank you for your insights about high tail latencies. However, the spikes
I'm encountering in the project related to kworker still stem from RT
processes preempting kworker threads that are already executing. In our
project, I have already implemented my optimization approach by queuing
work to CPU0, since there aren't as many long-running real-time tasks on CPU0.
Moreover, some logic on CPU0 replies on the tty uart data, which provides a
degree of complementarity from the perspective of CPU resource usage. From
the experimental data results, the most significant optimization effect comes
from queuing work to this fixed CPU0, rather than using the WQ_HIGHPRI flag,
although WQ_HIGHPRI does provide some improvement.
I also agree that tasks requiring continuous real-time execution should be
handled by kthread workers. However, while the ideal situation is appealing,
the reality is challenging. A large amount of driver code in the system uses
the system's pwq and worker_pool management, as this API is very convenient.
Refactoring the code carries significant change risks, and even with a team
effort, it's hard to bear such risks.
Adding flags like WQ_HIGHPRI or even introducing WQ_RT from a functional
development perspective doesn't pose too much concern; we just need to focus
on whether there is any impact on performance. In other words, for developers
working on the tty driver in this context, it might be difficult to practically
accept changes to a large portion of the current work code in tty, making
implementation quite challenging.

--
Xin Zhao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 17:39         ` Xin Zhao
@ 2025-10-21 18:50           ` Tejun Heo
  2025-10-22 13:59             ` Xin Zhao
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2025-10-21 18:50 UTC (permalink / raw)
  To: Xin Zhao; +Cc: gregkh, hch, jirislaby, linux-kernel, linux-serial

Hello,

On Wed, Oct 22, 2025 at 01:39:33AM +0800, Xin Zhao wrote:
...
> I also agree that tasks requiring continuous real-time execution should be
> handled by kthread workers. However, while the ideal situation is appealing,
> the reality is challenging. A large amount of driver code in the system uses
> the system's pwq and worker_pool management, as this API is very convenient.
> Refactoring the code carries significant change risks, and even with a team
> effort, it's hard to bear such risks.

kthread_work's API is really similar to workqueue to make it easy to switch
between the two. We probably didn't go far enough tho and it may make sense
to allow workqueue to always use dedicated fixed set of workers (e.g. by
allowing a workqueue to create a private pool of workers) when configured
from userspace so that this becomes a configuration problem which doesn't
require code changes.

> Adding flags like WQ_HIGHPRI or even introducing WQ_RT from a functional
> development perspective doesn't pose too much concern; we just need to focus

WQ_RT really doesn't make sense given that you can't even tell whether any
given work item would have a worker ready for it. What does RT priority do
when you need to go and create a new kworker?

Note that workqueue property changes don't need to be hard coded. If you
make the code use its own workqueue and set WQ_SYSFS, you can change its
properties from userspace through sysfs interface.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 18:50           ` Tejun Heo
@ 2025-10-22 13:59             ` Xin Zhao
  0 siblings, 0 replies; 13+ messages in thread
From: Xin Zhao @ 2025-10-22 13:59 UTC (permalink / raw)
  To: tj; +Cc: gregkh, hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial

On Tue, 21 Oct 2025 08:50:16 -1000 Tejun Heo <tj@kernel.org> wrote:

> > I also agree that tasks requiring continuous real-time execution should be
> > handled by kthread workers. However, while the ideal situation is appealing,
> > the reality is challenging. A large amount of driver code in the system uses
> > the system's pwq and worker_pool management, as this API is very convenient.
> > Refactoring the code carries significant change risks, and even with a team
> > effort, it's hard to bear such risks.
> 
> kthread_work's API is really similar to workqueue to make it easy to switch
> between the two. We probably didn't go far enough tho and it may make sense
> to allow workqueue to always use dedicated fixed set of workers (e.g. by
> allowing a workqueue to create a private pool of workers) when configured
> from userspace so that this becomes a configuration problem which doesn't
> require code changes.

kthread_work's API is greate for system which do not care about throughput but
concern with real-time performance. It is suitable for scenarios where a single
work instance corresponds to a single workqueue. In situations where a
workqueue corresponds to multiple different works, using kthread_work does not
create threads as needed. To address this, we might consider adding a new
interface, perhaps called kthread_run_workqueue. Unlike kthread_run_worker,
this new interface would determine whether to create a new thread based on the
pointer to the work instance passed in, ensuring that each work uniquely
corresponds to a thread. This approach has several advantages: it allows for
seamless switching between the existing workqueue's queue work logic and
kthread_work, and it can help avoid missing any work associated with a particular
workqueue, especially in large corebase containing multiple layers.
If the kthread_work API does not provide the functionality to create different
threads for different works, then I think a private worker pool is meaningful.
However, it may still lead to potential blocking for subsequent works if multiple
different works arrive concurrently while there is only one active kworker thread.

> > Adding flags like WQ_HIGHPRI or even introducing WQ_RT from a functional
> > development perspective doesn't pose too much concern; we just need to focus
> 
> WQ_RT really doesn't make sense given that you can't even tell whether any
> given work item would have a worker ready for it. What does RT priority do
> when you need to go and create a new kworker?

Indeed, WQ_RT cannot address the issue of multiple different works arriving
concurrently in the same worker pool leading to delays, as just mentioned.
The creation of new threads on demand in a work queue, from the perspective
of the current code implementation, is inherently non-real-time.
It seems that only the kthread_work mechanism can ensure that work tasks are
executed in real-time. I will use it to solve another GPU-related kworker
issue that I encounter in our system.

> Note that workqueue property changes don't need to be hard coded. If you
> make the code use its own workqueue and set WQ_SYSFS, you can change its
> properties from userspace through sysfs interface.

Regarding the current kworker preemption issue with 8250_dma, based on our
current measurements, using the kworker API is sufficient. Additionally, if
we switch to kthread_work, the tty driver layer would require significant
modifications. Therefore, I will adopt the WQ_SYSFS approach you mentioned,
as it is quite convenient for dynamic control from user space.

--
Xin Zhao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-21 16:26     ` Greg KH
  2025-10-21 16:40       ` Tejun Heo
@ 2025-10-22 14:47       ` Xin Zhao
  2025-10-22 16:37         ` Tejun Heo
  2025-10-22 16:59         ` Greg KH
  1 sibling, 2 replies; 13+ messages in thread
From: Xin Zhao @ 2025-10-22 14:47 UTC (permalink / raw)
  To: gregkh; +Cc: hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial, tj

On Tue, 21 Oct 2025 18:26:53 +0200 Greg KH <gregkh@linuxfoundation.org> wrote:

> I do not know, but it was not threaded :(
> 
> Try sending it to yourself and see if the delay maybe caused a problem?
> there should not be an issue sending just 3 messages at once.

Now that I think about it, both patches are small and address the same issue, so I
will combine them into a single patch for the next submission.

> 6.1 is very old, please try the latest 6.17.y release, or 6.18-rc1.
> Lots and lots of RT stuff has been fixed since 6.1.
> 
> I'm saying that the RT selection should cause you not to have those
> larger variants in delays.  Try a newer kernel and see.

I have now clarified, as you mentioned, why there is no issue in the standard kernel
but there is a problem in RT-Linux. In the dma_rx_complete function in 8250_dma.c,
locking is done using uart_port_lock_irqsave and uart_port_unlock_irqrestore, where
uart_port_lock_irqsave calls the spin_lock_irqsave API. In the standard kernel,
spin_lock_irqsave disables both preemption and hardware interrupts, while in RT-Linux,
spin_lock_irqsave does not disable preemption and does not disable hardware interrupts.
This leads to a situation where, after acquiring the spin_lock_irqsave lock, if a
real-time task preempts the current task, the spinlock holding period will lead to
migrate_disable, so if other CPUs back to idle, they cannot pull the preempted kworker
task. The work task can only continue execution after the real-time task releases the
CPU, resulting in high latency. This issue cannot be resolved in any version of the
kernel because the implementation of spin_lock_irqsave in RT-Linux is still defined
as spin_lock in higher version kernels, which means it does not disable preemption or
hardware interrupts. Similarly, in higher version kernels, spin_lock still calls
migrate_disable. Therefore, this issue cannot be resolved by simply using a higher
version of the kernel.

> This should come from a hardware definition somewhere in your DT, not as
> a user-selectable option.  And again, why not just tie it to the cpu
> where the irq came from automatically?

I don't think binding the work task to the CPU that handles the interrupt is feasible,
because, in practice, this hardware interrupt is evenly distributed across all cores
in our system. Moreover, from the ftrace data we captured, the IRQ handler thread that
wakes up the kworker threads in RT-Linux is also distributed across various CPUs by
default.
Considering the current situation is still limited to the RT-Linux scenario, if
possible, I will add the logic to create this workqueue only when CONFIG_PREEMPT_RT
is enabled in the next patch. By setting WQ_SYS, it will allow user space to dynamically
modify it. Additionally, in tty_flip_buffer_push, I will check if a private workqueue
has been created; if so, I will use the private workqueue to queue the work task.
If modified this way, there will be no changes for the standard kernel. For the RT-Linux
system, if user space does not dynamically modify the workqueue's CPU affinity and
priority, the effect will be the same as if this patch were not applied. What do you think
about this approach?

--
Xin Zhao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-22 14:47       ` Xin Zhao
@ 2025-10-22 16:37         ` Tejun Heo
  2025-10-22 17:15           ` Xin Zhao
  2025-10-22 16:59         ` Greg KH
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2025-10-22 16:37 UTC (permalink / raw)
  To: Xin Zhao; +Cc: gregkh, hch, jirislaby, linux-kernel, linux-serial

On Wed, Oct 22, 2025 at 10:47:07PM +0800, Xin Zhao wrote:
> Considering the current situation is still limited to the RT-Linux scenario, if
> possible, I will add the logic to create this workqueue only when CONFIG_PREEMPT_RT
> is enabled in the next patch. By setting WQ_SYS, it will allow user space to dynamically
> modify it. Additionally, in tty_flip_buffer_push, I will check if a private workqueue
> has been created; if so, I will use the private workqueue to queue the work task.

Creating a workqueue isn't that expensive. Might as well just always create
a dedicated workqueue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-22 16:37         ` Tejun Heo
@ 2025-10-22 17:15           ` Xin Zhao
  0 siblings, 0 replies; 13+ messages in thread
From: Xin Zhao @ 2025-10-22 17:15 UTC (permalink / raw)
  To: tj; +Cc: gregkh, hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial

On Wed, 22 Oct 2025 06:37:37 -1000 Tejun Heo <tj@kernel.org> wrote:

> > Considering the current situation is still limited to the RT-Linux scenario, if
> > possible, I will add the logic to create this workqueue only when CONFIG_PREEMPT_RT
> > is enabled in the next patch. By setting WQ_SYS, it will allow user space to dynamically
> > modify it. Additionally, in tty_flip_buffer_push, I will check if a private workqueue
> > has been created; if so, I will use the private workqueue to queue the work task.
> 
> Creating a workqueue isn't that expensive. Might as well just always create
> a dedicated workqueue.

OK, I'll just always create one. Thanks.

--
Xin Zhao


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-22 14:47       ` Xin Zhao
  2025-10-22 16:37         ` Tejun Heo
@ 2025-10-22 16:59         ` Greg KH
  2025-10-22 17:20           ` Xin Zhao
  1 sibling, 1 reply; 13+ messages in thread
From: Greg KH @ 2025-10-22 16:59 UTC (permalink / raw)
  To: Xin Zhao; +Cc: hch, jirislaby, linux-kernel, linux-serial, tj

On Wed, Oct 22, 2025 at 10:47:07PM +0800, Xin Zhao wrote:
> > This should come from a hardware definition somewhere in your DT, not as
> > a user-selectable option.  And again, why not just tie it to the cpu
> > where the irq came from automatically?
> 
> I don't think binding the work task to the CPU that handles the interrupt is feasible,
> because, in practice, this hardware interrupt is evenly distributed across all cores
> in our system.

I suggest fixing that, that's ripe for lots of latency as cores hit
cache misses and the like.  Learn from the networking people, you want
the cpu that handled the irq to handle the data processing too.  They
learned that years ago.

> Moreover, from the ftrace data we captured, the IRQ handler thread that
> wakes up the kworker threads in RT-Linux is also distributed across various CPUs by
> default.

Again, don't do that, bind things to cpus that previously handled the
data if at all possible to avoid these latencies.  That's what you are
trying to do here anyway, so you kind of have proof of that being a
viable solution :)

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu
  2025-10-22 16:59         ` Greg KH
@ 2025-10-22 17:20           ` Xin Zhao
  0 siblings, 0 replies; 13+ messages in thread
From: Xin Zhao @ 2025-10-22 17:20 UTC (permalink / raw)
  To: gregkh; +Cc: hch, jackzxcui1989, jirislaby, linux-kernel, linux-serial, tj

On Wed, 22 Oct 2025 18:59:05 +0200 Greg KH <gregkh@linuxfoundation.org> wrote:

> > > This should come from a hardware definition somewhere in your DT, not as
> > > a user-selectable option.  And again, why not just tie it to the cpu
> > > where the irq came from automatically?
> > 
> > I don't think binding the work task to the CPU that handles the interrupt is feasible,
> > because, in practice, this hardware interrupt is evenly distributed across all cores
> > in our system.
> 
> I suggest fixing that, that's ripe for lots of latency as cores hit
> cache misses and the like.  Learn from the networking people, you want
> the cpu that handled the irq to handle the data processing too.  They
> learned that years ago.
> 
> > Moreover, from the ftrace data we captured, the IRQ handler thread that
> > wakes up the kworker threads in RT-Linux is also distributed across various CPUs by
> > default.
> 
> Again, don't do that, bind things to cpus that previously handled the
> data if at all possible to avoid these latencies.  That's what you are
> trying to do here anyway, so you kind of have proof of that being a
> viable solution :)

Thank you for your suggestion. I will perform the relevant interrupt affinity binding on
our system.

--
Xin Zhao


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-10-22 17:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21  8:39 [PATCH v1 2/2] serial: 8250_dma: add parameter to queue work on specific cpu Xin Zhao
2025-10-21 11:28 ` Greg KH
2025-10-21 13:05   ` Xin Zhao
2025-10-21 16:26     ` Greg KH
2025-10-21 16:40       ` Tejun Heo
2025-10-21 17:39         ` Xin Zhao
2025-10-21 18:50           ` Tejun Heo
2025-10-22 13:59             ` Xin Zhao
2025-10-22 14:47       ` Xin Zhao
2025-10-22 16:37         ` Tejun Heo
2025-10-22 17:15           ` Xin Zhao
2025-10-22 16:59         ` Greg KH
2025-10-22 17:20           ` Xin Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox