* [PATCH 0/3] serial: qcom-geni: fix lockups
@ 2024-06-24 13:31 Johan Hovold
2024-06-24 13:31 ` [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush Johan Hovold
` (3 more replies)
0 siblings, 4 replies; 19+ messages in thread
From: Johan Hovold @ 2024-06-24 13:31 UTC (permalink / raw)
To: Greg Kroah-Hartman, Jiri Slaby
Cc: Konrad Dybcio, Douglas Anderson, Bjorn Andersson, linux-arm-msm,
linux-serial, linux-kernel, Johan Hovold
Since 6.10-rc1, Qualcomm machines with a serial port can easily lock up
hard, for example, when stopping a getty on reboot.
The first patch in this series fixes this severe regression by restoring
the pre-6.10-rc1 behaviour of printing additional characters when
flushing the tx buffer.
The second patch fixes a long-standing issue in the GENI driver which
can lead to a soft lock up when using software flow control and on
suspend.
The third patch, addresses the old issue with additional characters
being printing when flushing the buffer.
Note that timeouts used when clearing the tx fifo are a bit excessive
since I'm reusing the current qcom_geni_serial_poll_bit() helper for
now.
I think at least the first patch should be merged for rc6 while we
consider the best way forward to address the remaining issues.
Doug has posted an alternative series of fixes here that depends on
reworking the driver a fair bit here:
https://lore.kernel.org/lkml/20240610222515.3023730-1-dianders@chromium.org/
Johan
Johan Hovold (3):
serial: qcom-geni: fix hard lockup on buffer flush
serial: qcom-geni: fix soft lockup on sw flow control and suspend
serial: qcom-geni: fix garbage output after buffer flush
drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++--------
1 file changed, 25 insertions(+), 11 deletions(-)
--
2.44.1
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-24 13:31 [PATCH 0/3] serial: qcom-geni: fix lockups Johan Hovold
@ 2024-06-24 13:31 ` Johan Hovold
2024-06-24 17:39 ` Doug Anderson
2024-06-24 13:31 ` [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend Johan Hovold
` (2 subsequent siblings)
3 siblings, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-24 13:31 UTC (permalink / raw)
To: Greg Kroah-Hartman, Jiri Slaby
Cc: Konrad Dybcio, Douglas Anderson, Bjorn Andersson, linux-arm-msm,
linux-serial, linux-kernel, Johan Hovold
The Qualcomm GENI serial driver does not handle buffer flushing and used
to print garbage characters when the circular buffer was cleared. Since
commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
spinning indefinitely in the interrupt handler.
This is easily triggered by interrupting a command such as dmesg in a
serial console but can also happen when stopping a serial getty on
reboot.
Fix the immediate issue by printing NUL characters until the current TX
command has been completed.
Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 2bd25afe0d92..1d5d6045879a 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -862,7 +862,7 @@ static void qcom_geni_serial_send_chunk_fifo(struct uart_port *uport,
memset(buf, 0, sizeof(buf));
tx_bytes = min(remaining, BYTES_PER_FIFO_WORD);
- tx_bytes = uart_fifo_out(uport, buf, tx_bytes);
+ uart_fifo_out(uport, buf, tx_bytes);
iowrite32_rep(uport->membase + SE_GENI_TX_FIFOn, buf, 1);
--
2.44.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-24 13:31 [PATCH 0/3] serial: qcom-geni: fix lockups Johan Hovold
2024-06-24 13:31 ` [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush Johan Hovold
@ 2024-06-24 13:31 ` Johan Hovold
2024-06-24 21:23 ` Doug Anderson
2024-06-24 13:31 ` [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush Johan Hovold
2024-07-03 14:09 ` [PATCH 0/3] serial: qcom-geni: fix lockups Greg Kroah-Hartman
3 siblings, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-24 13:31 UTC (permalink / raw)
To: Greg Kroah-Hartman, Jiri Slaby
Cc: Konrad Dybcio, Douglas Anderson, Bjorn Andersson, linux-arm-msm,
linux-serial, linux-kernel, Johan Hovold, stable
The stop_tx() callback is used to implement software flow control and
must not discard data as the Qualcomm GENI driver is currently doing
when there is an active TX command.
Cancelling an active command can also leave data in the hardware FIFO,
which prevents the watermark interrupt from being enabled when TX is
later restarted. This results in a soft lockup and is easily triggered
by stopping TX using software flow control in a serial console but this
can also happen after suspend.
Fix this by only stopping any active command, and effectively clearing
the hardware fifo, when shutting down the port. Make sure to temporarily
raise the watermark level so that the interrupt fires when TX is
restarted.
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Cc: stable@vger.kernel.org # 4.17
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 28 +++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 1d5d6045879a..72addeb9f461 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -651,13 +651,8 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
{
u32 irq_en;
- if (qcom_geni_serial_main_active(uport) ||
- !qcom_geni_serial_tx_empty(uport))
- return;
-
irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN;
-
writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
}
@@ -665,16 +660,28 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
{
u32 irq_en;
- struct qcom_geni_serial_port *port = to_dev_port(uport);
irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
writel(0, uport->membase + SE_GENI_TX_WATERMARK_REG);
writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
- /* Possible stop tx is called multiple times. */
+}
+
+static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
+{
+ struct qcom_geni_serial_port *port = to_dev_port(uport);
+
if (!qcom_geni_serial_main_active(uport))
return;
+ /*
+ * Increase watermark level so that TX can be restarted and wait for
+ * sequencer to start to prevent lockups.
+ */
+ writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
+ qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
+ M_TX_FIFO_WATERMARK_EN, true);
+
geni_se_cancel_m_cmd(&port->se);
if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
M_CMD_CANCEL_EN, true)) {
@@ -684,6 +691,8 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
}
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+
+ port->tx_remaining = 0;
}
static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
@@ -1069,11 +1078,10 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport)
{
disable_irq(uport->irq);
- if (uart_console(uport))
- return;
-
qcom_geni_serial_stop_tx(uport);
qcom_geni_serial_stop_rx(uport);
+
+ qcom_geni_serial_clear_tx_fifo(uport);
}
static int qcom_geni_serial_port_setup(struct uart_port *uport)
--
2.44.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush
2024-06-24 13:31 [PATCH 0/3] serial: qcom-geni: fix lockups Johan Hovold
2024-06-24 13:31 ` [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush Johan Hovold
2024-06-24 13:31 ` [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend Johan Hovold
@ 2024-06-24 13:31 ` Johan Hovold
2024-06-24 22:19 ` Doug Anderson
2024-07-03 14:09 ` [PATCH 0/3] serial: qcom-geni: fix lockups Greg Kroah-Hartman
3 siblings, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-24 13:31 UTC (permalink / raw)
To: Greg Kroah-Hartman, Jiri Slaby
Cc: Konrad Dybcio, Douglas Anderson, Bjorn Andersson, linux-arm-msm,
linux-serial, linux-kernel, Johan Hovold, stable
The Qualcomm GENI serial driver does not handle buffer flushing and
outputs garbage (or NUL) characters for the remainder of any active TX
command after the write buffer has been cleared.
Implement the flush_buffer() callback and use it to cancel any active TX
command when the write buffer has been emptied.
Fixes: a1fee899e5be ("tty: serial: qcom_geni_serial: Fix softlock")
Cc: stable@vger.kernel.org # 5.0
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 72addeb9f461..5fbb72f1c0c7 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -1084,6 +1084,11 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport)
qcom_geni_serial_clear_tx_fifo(uport);
}
+static void qcom_geni_serial_flush_buffer(struct uart_port *uport)
+{
+ qcom_geni_serial_clear_tx_fifo(uport);
+}
+
static int qcom_geni_serial_port_setup(struct uart_port *uport)
{
struct qcom_geni_serial_port *port = to_dev_port(uport);
@@ -1540,6 +1545,7 @@ static const struct uart_ops qcom_geni_console_pops = {
.request_port = qcom_geni_serial_request_port,
.config_port = qcom_geni_serial_config_port,
.shutdown = qcom_geni_serial_shutdown,
+ .flush_buffer = qcom_geni_serial_flush_buffer,
.type = qcom_geni_serial_get_type,
.set_mctrl = qcom_geni_serial_set_mctrl,
.get_mctrl = qcom_geni_serial_get_mctrl,
--
2.44.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-24 13:31 ` [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush Johan Hovold
@ 2024-06-24 17:39 ` Doug Anderson
2024-06-24 20:45 ` Doug Anderson
2024-06-25 14:40 ` Johan Hovold
0 siblings, 2 replies; 19+ messages in thread
From: Doug Anderson @ 2024-06-24 17:39 UTC (permalink / raw)
To: Johan Hovold
Cc: Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel
Hi,
On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> The Qualcomm GENI serial driver does not handle buffer flushing and used
> to print garbage characters when the circular buffer was cleared. Since
> commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> spinning indefinitely in the interrupt handler.
>
> This is easily triggered by interrupting a command such as dmesg in a
> serial console but can also happen when stopping a serial getty on
> reboot.
>
> Fix the immediate issue by printing NUL characters until the current TX
> command has been completed.
>
> Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> Reported-by: Douglas Anderson <dianders@chromium.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> drivers/tty/serial/qcom_geni_serial.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
I don't love this, though it's better than a hard lockup. I will note
that it doesn't exactly restore the old behavior which would have
(most likely) continued to output data that had previously been in the
FIFO but that had been cancelled.
...actually, if we're looking for a short term fix that mimics the old
behavior more closely, what would you think about having a
driver-local buffer that we fill when we kick off the transfer. Then
the data can't go away from underneath us. It's an extra copy, but
it's just a memory-to-memory copy which is much faster than the MMIO
copy we'll eventually need to do anyway... This local buffer would
essentially act as a larger FIFO.
You could choose the local buffer size to balance being able to cancel
quickly vs. using the FIFO efficiently.
-Doug
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-24 17:39 ` Doug Anderson
@ 2024-06-24 20:45 ` Doug Anderson
2024-06-25 14:53 ` Johan Hovold
2024-06-25 14:40 ` Johan Hovold
1 sibling, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2024-06-24 20:45 UTC (permalink / raw)
To: Johan Hovold
Cc: Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel
Hi,
On Mon, Jun 24, 2024 at 10:39 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > to print garbage characters when the circular buffer was cleared. Since
> > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > spinning indefinitely in the interrupt handler.
> >
> > This is easily triggered by interrupting a command such as dmesg in a
> > serial console but can also happen when stopping a serial getty on
> > reboot.
> >
> > Fix the immediate issue by printing NUL characters until the current TX
> > command has been completed.
> >
> > Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> > Reported-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> > drivers/tty/serial/qcom_geni_serial.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
>
> I don't love this, though it's better than a hard lockup. I will note
> that it doesn't exactly restore the old behavior which would have
> (most likely) continued to output data that had previously been in the
> FIFO but that had been cancelled.
>
> ...actually, if we're looking for a short term fix that mimics the old
> behavior more closely, what would you think about having a
> driver-local buffer that we fill when we kick off the transfer. Then
> the data can't go away from underneath us. It's an extra copy, but
> it's just a memory-to-memory copy which is much faster than the MMIO
> copy we'll eventually need to do anyway... This local buffer would
> essentially act as a larger FIFO.
>
> You could choose the local buffer size to balance being able to cancel
> quickly vs. using the FIFO efficiently.
Also: if we're looking at quick/easy to land and just fix the hard
lockup, I'd vote for this (I can send a real patch, though I'm about
to go on vacation):
--
@@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
uart_port *uport,
goto out_write_wakeup;
if (!port->tx_remaining) {
- qcom_geni_serial_setup_tx(uport, pending);
- port->tx_remaining = pending;
+ port->tx_remaining = min(avail, pending);
+ qcom_geni_serial_setup_tx(uport, port->tx_remaining);
irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
--
That will fix the hard lockup, is short and sweet, and also doesn't
end up outputting NUL bytes.
I measured time with that. I've been testing with a file I created
called "alphabet.txt" that just contains the letters A-Z repeated 3
times followed by a "\n", over and over again. I think gmail will kill
me with word wrapping, but basically:
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
...
...
FWIW:
head -200 /var/alphabet.txt | wc
200 200 15800
Before my patch I ran `time head -200 /var/alphabet.txt` and I got:
real 0m1.386s
After my patch I ran the same thing and got:
real 0m1.409s
So it's slower, but that's not 25% slower. I get 1.7% slower:
In [6]: (1.409 - 1.386) / 1.386 * 100
Out[6]: 1.659451659451669
IMO that seems like a fine slowdown in order to avoid printing NUL bytes.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-24 13:31 ` [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend Johan Hovold
@ 2024-06-24 21:23 ` Doug Anderson
2024-06-24 21:58 ` Doug Anderson
2024-06-26 7:42 ` Johan Hovold
0 siblings, 2 replies; 19+ messages in thread
From: Doug Anderson @ 2024-06-24 21:23 UTC (permalink / raw)
To: Johan Hovold
Cc: Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel, stable
Hi,
On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> The stop_tx() callback is used to implement software flow control and
> must not discard data as the Qualcomm GENI driver is currently doing
> when there is an active TX command.
>
> Cancelling an active command can also leave data in the hardware FIFO,
> which prevents the watermark interrupt from being enabled when TX is
> later restarted. This results in a soft lockup and is easily triggered
> by stopping TX using software flow control in a serial console but this
> can also happen after suspend.
>
> Fix this by only stopping any active command, and effectively clearing
> the hardware fifo, when shutting down the port. Make sure to temporarily
> raise the watermark level so that the interrupt fires when TX is
> restarted.
Nice! I did quite a few experiments, but it sounds like you found
something that I wasn't able to find. Specifically once I cancelled an
ongoing command I could never manage to get it started back up, but it
must have just been that data was still in the FIFO and thus the
watermark never fired again.
When I was experimenting, I also swore that there were cases where
geni would sometimes fully drop bytes when I tried to "cancel" a
command, but maybe I was mistaken. Everything I figured out was
essentially by running experiments and I could easily have had a bug
in my experiment.
> Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
> Cc: stable@vger.kernel.org # 4.17
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> drivers/tty/serial/qcom_geni_serial.c | 28 +++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 1d5d6045879a..72addeb9f461 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -651,13 +651,8 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
> {
> u32 irq_en;
>
> - if (qcom_geni_serial_main_active(uport) ||
> - !qcom_geni_serial_tx_empty(uport))
> - return;
> -
> irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN;
> -
> writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
> writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> }
> @@ -665,16 +660,28 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
> static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> {
> u32 irq_en;
> - struct qcom_geni_serial_port *port = to_dev_port(uport);
>
> irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
> writel(0, uport->membase + SE_GENI_TX_WATERMARK_REG);
> writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> - /* Possible stop tx is called multiple times. */
If qcom_geni_serial_stop_tx_fifo() is supposed to be used for UART
flow control and you have a way to stop the transfer immediately
without losing data (by using geni_se_cancel_m_cmd), maybe we should
do that? If the other side wants us to stop transferring data and we
can stop it right away that would be ideal...
> +}
> +
> +static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
> +{
> + struct qcom_geni_serial_port *port = to_dev_port(uport);
> +
> if (!qcom_geni_serial_main_active(uport))
> return;
>
> + /*
> + * Increase watermark level so that TX can be restarted and wait for
> + * sequencer to start to prevent lockups.
> + */
> + writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
> + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> + M_TX_FIFO_WATERMARK_EN, true);
Oh, maybe this "wait for sequencer to start to prevent lockups." is
the part that I was missing? Can you explain more about what's going
on here? Why does waiting for the watermark interrupt to fire prevent
lockups? I would have imagined that the watermark interrupt would be
part of the geni hardware and have nothing to do with the firmware
running on the other end, so I'm not sure why it firing somehow would
prevent a lockup. Was this just by trial and error?
> @@ -684,6 +691,8 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> }
> writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> +
> + port->tx_remaining = 0;
> }
>
> static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
> @@ -1069,11 +1078,10 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport)
> {
> disable_irq(uport->irq);
>
> - if (uart_console(uport))
> - return;
Can you explain this part of the patch? I'm not saying it's wrong to
remove this special case since this driver seems to have lots of
needless special cases that are already handled by the core or by
other parts of the driver, but this change seems unrelated to the rest
of the patch. Could it be a separate patch?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-24 21:23 ` Doug Anderson
@ 2024-06-24 21:58 ` Doug Anderson
2024-06-26 7:54 ` Johan Hovold
2024-06-26 7:42 ` Johan Hovold
1 sibling, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2024-06-24 21:58 UTC (permalink / raw)
To: Johan Hovold
Cc: Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel, stable
Hi,
On Mon, Jun 24, 2024 at 2:23 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The stop_tx() callback is used to implement software flow control and
> > must not discard data as the Qualcomm GENI driver is currently doing
> > when there is an active TX command.
> >
> > Cancelling an active command can also leave data in the hardware FIFO,
> > which prevents the watermark interrupt from being enabled when TX is
> > later restarted. This results in a soft lockup and is easily triggered
> > by stopping TX using software flow control in a serial console but this
> > can also happen after suspend.
> >
> > Fix this by only stopping any active command, and effectively clearing
> > the hardware fifo, when shutting down the port. Make sure to temporarily
> > raise the watermark level so that the interrupt fires when TX is
> > restarted.
>
> Nice! I did quite a few experiments, but it sounds like you found
> something that I wasn't able to find. Specifically once I cancelled an
> ongoing command I could never manage to get it started back up, but it
> must have just been that data was still in the FIFO and thus the
> watermark never fired again.
>
> When I was experimenting, I also swore that there were cases where
> geni would sometimes fully drop bytes when I tried to "cancel" a
> command, but maybe I was mistaken. Everything I figured out was
> essentially by running experiments and I could easily have had a bug
> in my experiment.
>
>
> > Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
> > Cc: stable@vger.kernel.org # 4.17
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> > drivers/tty/serial/qcom_geni_serial.c | 28 +++++++++++++++++----------
> > 1 file changed, 18 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> > index 1d5d6045879a..72addeb9f461 100644
> > --- a/drivers/tty/serial/qcom_geni_serial.c
> > +++ b/drivers/tty/serial/qcom_geni_serial.c
> > @@ -651,13 +651,8 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
> > {
> > u32 irq_en;
> >
> > - if (qcom_geni_serial_main_active(uport) ||
> > - !qcom_geni_serial_tx_empty(uport))
> > - return;
> > -
> > irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> > irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN;
> > -
> > writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> > }
> > @@ -665,16 +660,28 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
> > static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> > {
> > u32 irq_en;
> > - struct qcom_geni_serial_port *port = to_dev_port(uport);
> >
> > irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> > irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
> > writel(0, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> > - /* Possible stop tx is called multiple times. */
>
> If qcom_geni_serial_stop_tx_fifo() is supposed to be used for UART
> flow control and you have a way to stop the transfer immediately
> without losing data (by using geni_se_cancel_m_cmd), maybe we should
> do that? If the other side wants us to stop transferring data and we
> can stop it right away that would be ideal...
>
>
> > +}
> > +
> > +static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
> > +{
> > + struct qcom_geni_serial_port *port = to_dev_port(uport);
> > +
> > if (!qcom_geni_serial_main_active(uport))
> > return;
> >
> > + /*
> > + * Increase watermark level so that TX can be restarted and wait for
> > + * sequencer to start to prevent lockups.
> > + */
> > + writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> > + M_TX_FIFO_WATERMARK_EN, true);
>
> Oh, maybe this "wait for sequencer to start to prevent lockups." is
> the part that I was missing? Can you explain more about what's going
> on here? Why does waiting for the watermark interrupt to fire prevent
> lockups? I would have imagined that the watermark interrupt would be
> part of the geni hardware and have nothing to do with the firmware
> running on the other end, so I'm not sure why it firing somehow would
> prevent a lockup. Was this just by trial and error?
Actually, the more I look at it the more confused I am about your
qcom_geni_serial_clear_tx_fifo(). Can you explain and maybe add some
inline comments in the function since it's not obvious? Specifically,
things I'm confused about with your patch:
1. The function is named qcom_geni_serial_clear_tx_fifo() which
implies that when it finishes that the hardware FIFO will have nothing
in it. ...but how does your code ensure this?
2. If the function is really clearing the FIFOs then why do we need to
adjust the watermark level? The fact that you need to adjust the
watermark levels implies (to me) that there are things stuck in the
FIFO still. ...but then what happens to those characters? When are
they sent?
3. On my hardware you're setting the FIFO level to 16 here. The docs I
have say that if the FIFO level is "less than" the value you set here
then the interrupt will go off and further clarifies that if you set
the register to 1 here then you'll get interrupted when the FIFO is
empty. So what happens with your solution if the FIFO is completely
full? In that case you'd have to set this to 17, right? ...but then I
could believe that might confuse the interrupt handler which would get
told to start transmitting when there is no room for anything.
Maybe something is missing in my mental model here and testing your
patch and hitting Ctrl-C seems to work, but I don't really understand
why so hopefully you can clarify! :-)
-Doug
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush
2024-06-24 13:31 ` [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush Johan Hovold
@ 2024-06-24 22:19 ` Doug Anderson
2024-06-26 8:01 ` Johan Hovold
0 siblings, 1 reply; 19+ messages in thread
From: Doug Anderson @ 2024-06-24 22:19 UTC (permalink / raw)
To: Johan Hovold
Cc: Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel, stable
Hi,
On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> The Qualcomm GENI serial driver does not handle buffer flushing and
> outputs garbage (or NUL) characters for the remainder of any active TX
> command after the write buffer has been cleared.
>
> Implement the flush_buffer() callback and use it to cancel any active TX
> command when the write buffer has been emptied.
I could be reading it wrong, but in the kernel-doc of `struct
tty_ldisc_ops` it seems to indicate that flush_buffer() is for the
other direction. Specifically, it says:
This function instructs the line discipline to clear its buffers of
any input characters it may have queued to be delivered to the user
mode process.
It's hard to figure out which direction that matches to, but looking
at the descriptions of "read" and "write" makes me believe that it's
supposed to flush characters that have been read by the UART, not
characters that are being written out to the UART. Maybe I'm
misunderstanding or the kernel doc is wrong/incomplete?
I guess the underlying worry I have is that there's no guarantee that
the flush function will be called when the kfifo loses bytes. If it
ever happens we'll fall back to writing NUL bytes out and that doesn't
seem amazing to me. To me it feels like
qcom_geni_serial_send_chunk_fifo() should detect this situation and
then it should be responsible for canceling, though better (in my
mind) is if we never initiate any big transfers if we can get away
with that and still be performant.
-Doug
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-24 17:39 ` Doug Anderson
2024-06-24 20:45 ` Doug Anderson
@ 2024-06-25 14:40 ` Johan Hovold
2024-07-04 9:59 ` Johan Hovold
1 sibling, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-25 14:40 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel
On Mon, Jun 24, 2024 at 10:39:07AM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > to print garbage characters when the circular buffer was cleared. Since
> > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > spinning indefinitely in the interrupt handler.
> >
> > This is easily triggered by interrupting a command such as dmesg in a
> > serial console but can also happen when stopping a serial getty on
> > reboot.
> >
> > Fix the immediate issue by printing NUL characters until the current TX
> > command has been completed.
> >
> > Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> > Reported-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> > drivers/tty/serial/qcom_geni_serial.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
>
> I don't love this, though it's better than a hard lockup. I will note
> that it doesn't exactly restore the old behavior which would have
> (most likely) continued to output data that had previously been in the
> FIFO but that had been cancelled.
Ah, yes, you're right. I went back and compared with 6.9 and the effect
was indeed (often) that the machine felt sluggish when you hit ctrl-c to
interrupt something like dmesg and the driver would continue to print up
to 4k characters after that (e.g. 350 ms at 115200).
The idea here was to fix the lockup regression separately and then have
the third patch address the buffer flush failure, which could also be
backported without depending on the kfifo conversion.
But running with this series since yesterday, I realise there are still
some unresolved interaction with the console code, which can now trigger
a soft (instead of hard) lockup on reboot...
> ...actually, if we're looking for a short term fix that mimics the old
> behavior more closely, what would you think about having a
> driver-local buffer that we fill when we kick off the transfer. Then
> the data can't go away from underneath us. It's an extra copy, but
> it's just a memory-to-memory copy which is much faster than the MMIO
> copy we'll eventually need to do anyway... This local buffer would
> essentially act as a larger FIFO.
The idea did cross my mind, three levels of fifo...
> You could choose the local buffer size to balance being able to cancel
> quickly vs. using the FIFO efficiently.
Yeah, perhaps adding a smaller driver kfifo would work, but not sure how
clean it would be to implement.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-24 20:45 ` Doug Anderson
@ 2024-06-25 14:53 ` Johan Hovold
2024-06-25 16:27 ` Doug Anderson
0 siblings, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-25 14:53 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel
On Mon, Jun 24, 2024 at 01:45:17PM -0700, Doug Anderson wrote:
> Also: if we're looking at quick/easy to land and just fix the hard
> lockup, I'd vote for this (I can send a real patch, though I'm about
> to go on vacation):
>
> --
>
> @@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
> uart_port *uport,
> goto out_write_wakeup;
>
> if (!port->tx_remaining) {
> - qcom_geni_serial_setup_tx(uport, pending);
> - port->tx_remaining = pending;
> + port->tx_remaining = min(avail, pending);
> + qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>
> irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
>
> --
>
> That will fix the hard lockup, is short and sweet, and also doesn't
> end up outputting NUL bytes.
Yeah, this might be a good stop gap even if performance suffers.
> I measured time with that. I've been testing with a file I created
> called "alphabet.txt" that just contains the letters A-Z repeated 3
> times followed by a "\n", over and over again. I think gmail will kill
> me with word wrapping, but basically:
> head -200 /var/alphabet.txt | wc
> 200 200 15800
>
> Before my patch I ran `time head -200 /var/alphabet.txt` and I got:
>
> real 0m1.386s
>
> After my patch I ran the same thing and got:
>
> real 0m1.409s
>
> So it's slower, but that's not 25% slower. I get 1.7% slower:
>
> In [6]: (1.409 - 1.386) / 1.386 * 100
> Out[6]: 1.659451659451669
>
> IMO that seems like a fine slowdown in order to avoid printing NUL bytes.
With my 500K dmesg file test I see a similar performance drop as with
your full series even if seems to behave slightly better (e.g. 20% drop
instead of 24%).
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-25 14:53 ` Johan Hovold
@ 2024-06-25 16:27 ` Doug Anderson
0 siblings, 0 replies; 19+ messages in thread
From: Doug Anderson @ 2024-06-25 16:27 UTC (permalink / raw)
To: Johan Hovold
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel
Hi,
On Tue, Jun 25, 2024 at 7:53 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Jun 24, 2024 at 01:45:17PM -0700, Doug Anderson wrote:
>
> > Also: if we're looking at quick/easy to land and just fix the hard
> > lockup, I'd vote for this (I can send a real patch, though I'm about
> > to go on vacation):
> >
> > --
> >
> > @@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
> > uart_port *uport,
> > goto out_write_wakeup;
> >
> > if (!port->tx_remaining) {
> > - qcom_geni_serial_setup_tx(uport, pending);
> > - port->tx_remaining = pending;
> > + port->tx_remaining = min(avail, pending);
> > + qcom_geni_serial_setup_tx(uport, port->tx_remaining);
> >
> > irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> > if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
> >
> > --
> >
> > That will fix the hard lockup, is short and sweet, and also doesn't
> > end up outputting NUL bytes.
>
> Yeah, this might be a good stop gap even if performance suffers.
I've officially posted this as:
https://lore.kernel.org/r/20240625092440.1.Icf914852be911b95aefa9d798b6f1cd1a180f985@changeid
I realized that I didn't need to re-calculate "chunk" so the patch is
very slightly different than I posted above but should be effectively
the same.
-Doug
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-24 21:23 ` Doug Anderson
2024-06-24 21:58 ` Doug Anderson
@ 2024-06-26 7:42 ` Johan Hovold
1 sibling, 0 replies; 19+ messages in thread
From: Johan Hovold @ 2024-06-26 7:42 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel,
stable
On Mon, Jun 24, 2024 at 02:23:52PM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> > @@ -665,16 +660,28 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
> > static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> > {
> > u32 irq_en;
> > - struct qcom_geni_serial_port *port = to_dev_port(uport);
> >
> > irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> > irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
> > writel(0, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> > - /* Possible stop tx is called multiple times. */
>
> If qcom_geni_serial_stop_tx_fifo() is supposed to be used for UART
> flow control and you have a way to stop the transfer immediately
> without losing data (by using geni_se_cancel_m_cmd), maybe we should
> do that? If the other side wants us to stop transferring data and we
> can stop it right away that would be ideal...
Right, but since cancelling commands seems fragile at best (e.g.
potentially lost data, lockups) it seems best to just let the fifo
drain. But sure, if we can get cancel and restart to work reliably
eventually then even better.
> > +}
> > +
> > +static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
> > +{
> > + struct qcom_geni_serial_port *port = to_dev_port(uport);
> > +
> > if (!qcom_geni_serial_main_active(uport))
> > return;
> >
> > + /*
> > + * Increase watermark level so that TX can be restarted and wait for
> > + * sequencer to start to prevent lockups.
> > + */
> > + writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> > + M_TX_FIFO_WATERMARK_EN, true);
>
> Oh, maybe this "wait for sequencer to start to prevent lockups." is
> the part that I was missing? Can you explain more about what's going
> on here? Why does waiting for the watermark interrupt to fire prevent
> lockups? I would have imagined that the watermark interrupt would be
> part of the geni hardware and have nothing to do with the firmware
> running on the other end, so I'm not sure why it firing somehow would
> prevent a lockup. Was this just by trial and error?
Yes, I saw two kinds of lockups in my experiments. The first was due to
data being left in the fifo so that the watermark interrupt never fired
on start_tx(), but there was one more case where it seemed like the hw
would get stuck if a cancel command was issues immediately after a new
command had been started.
Waiting for one character to be sent to avoid that race and seems to
address the latter hang.
Note that I hit this also when never filling the FIFO completely (e.g.
so that a watermark of 16 should have fired as there were never more
than 15 words in the fifo).
> > @@ -684,6 +691,8 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> > writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> > }
> > writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> > +
> > + port->tx_remaining = 0;
> > }
> >
> > static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
> > @@ -1069,11 +1078,10 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport)
> > {
> > disable_irq(uport->irq);
> >
> > - if (uart_console(uport))
> > - return;
>
> Can you explain this part of the patch? I'm not saying it's wrong to
> remove this special case since this driver seems to have lots of
> needless special cases that are already handled by the core or by
> other parts of the driver, but this change seems unrelated to the rest
> of the patch. Could it be a separate patch?
We need to stop tx and clear the FIFO also when the port is used as a
console.
I added back the above check in commit 9aff74cc4e9e ("serial: qcom-geni:
fix console shutdown hang") as a quick way to work around a previous
regression where we would hit this soft lockup. With the issue fixed,
the workaround is no longer needed.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-24 21:58 ` Doug Anderson
@ 2024-06-26 7:54 ` Johan Hovold
2024-07-04 10:08 ` Johan Hovold
0 siblings, 1 reply; 19+ messages in thread
From: Johan Hovold @ 2024-06-26 7:54 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel,
stable
On Mon, Jun 24, 2024 at 02:58:39PM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 2:23 PM Doug Anderson <dianders@chromium.org> wrote:
> > On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> > > +static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
> > > +{
> > > + struct qcom_geni_serial_port *port = to_dev_port(uport);
> > > +
> > > if (!qcom_geni_serial_main_active(uport))
> > > return;
> > >
> > > + /*
> > > + * Increase watermark level so that TX can be restarted and wait for
> > > + * sequencer to start to prevent lockups.
> > > + */
> > > + writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
> > > + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> > > + M_TX_FIFO_WATERMARK_EN, true);
> >
> > Oh, maybe this "wait for sequencer to start to prevent lockups." is
> > the part that I was missing? Can you explain more about what's going
> > on here? Why does waiting for the watermark interrupt to fire prevent
> > lockups? I would have imagined that the watermark interrupt would be
> > part of the geni hardware and have nothing to do with the firmware
> > running on the other end, so I'm not sure why it firing somehow would
> > prevent a lockup. Was this just by trial and error?
>
> Actually, the more I look at it the more confused I am about your
> qcom_geni_serial_clear_tx_fifo(). Can you explain and maybe add some
> inline comments in the function since it's not obvious? Specifically,
> things I'm confused about with your patch:
>
> 1. The function is named qcom_geni_serial_clear_tx_fifo() which
> implies that when it finishes that the hardware FIFO will have nothing
> in it. ...but how does your code ensure this?
Yeah, I realised after I sent out the series that this may not be the
case. I was under the impression that cancelling a command would discard
the data in the FIFO (e.g. when starting the next command) but that was
probably an error in my mental model.
Do you see any way to discard the FIFO in the docs you have access to?
> 2. If the function is really clearing the FIFOs then why do we need to
> adjust the watermark level? The fact that you need to adjust the
> watermark levels implies (to me) that there are things stuck in the
> FIFO still. ...but then what happens to those characters? When are
> they sent?
Exactly, there is data there according to the FIFO status, but I
erroneously interpreted it as a it would be discarded (e.g. when
starting the next command).
> 3. On my hardware you're setting the FIFO level to 16 here. The docs I
> have say that if the FIFO level is "less than" the value you set here
> then the interrupt will go off and further clarifies that if you set
> the register to 1 here then you'll get interrupted when the FIFO is
> empty. So what happens with your solution if the FIFO is completely
> full? In that case you'd have to set this to 17, right? ...but then I
> could believe that might confuse the interrupt handler which would get
> told to start transmitting when there is no room for anything.
Indeed. I may implicitly be relying on the absence of hardware flow
control as well so that waiting for one character to be sent is what
makes this work.
> Maybe something is missing in my mental model here and testing your
> patch and hitting Ctrl-C seems to work, but I don't really understand
> why so hopefully you can clarify! :-)
I spent too much time trying to reverse engineer this hw over the
weekend and wanted to get this out as a counter proposal as it indicated
that we may be able to find a smaller fix. The series addresses the
serial getty issues, but as I mentioned yesterday there is some
interaction with the console left to be resolved before this can be an
alternative.
If it wasn't for the hard lockup I would have sent the whole thing as
an RFC.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush
2024-06-24 22:19 ` Doug Anderson
@ 2024-06-26 8:01 ` Johan Hovold
0 siblings, 0 replies; 19+ messages in thread
From: Johan Hovold @ 2024-06-26 8:01 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel,
stable
On Mon, Jun 24, 2024 at 03:19:33PM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The Qualcomm GENI serial driver does not handle buffer flushing and
> > outputs garbage (or NUL) characters for the remainder of any active TX
> > command after the write buffer has been cleared.
> >
> > Implement the flush_buffer() callback and use it to cancel any active TX
> > command when the write buffer has been emptied.
>
> I could be reading it wrong, but in the kernel-doc of `struct
> tty_ldisc_ops` it seems to indicate that flush_buffer() is for the
> other direction. Specifically, it says:
>
> This function instructs the line discipline to clear its buffers of
> any input characters it may have queued to be delivered to the user
> mode process.
Yes, but this a uart op (i.e. not tty_ldisc_ops), for which the doc
states:
Flush any write buffers, reset any DMA state and stop any
ongoing DMA transfers.
> I guess the underlying worry I have is that there's no guarantee that
> the flush function will be called when the kfifo loses bytes. If it
> ever happens we'll fall back to writing NUL bytes out and that doesn't
> seem amazing to me. To me it feels like
> qcom_geni_serial_send_chunk_fifo() should detect this situation and
> then it should be responsible for canceling, though better (in my
> mind) is if we never initiate any big transfers if we can get away
> with that and still be performant.
The flush buffer callback is called from the uart_flush_buffer() tty
operation (again, not tty_ldisc_ops) when the FIFO is reset.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] serial: qcom-geni: fix lockups
2024-06-24 13:31 [PATCH 0/3] serial: qcom-geni: fix lockups Johan Hovold
` (2 preceding siblings ...)
2024-06-24 13:31 ` [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush Johan Hovold
@ 2024-07-03 14:09 ` Greg Kroah-Hartman
2024-07-03 14:13 ` Johan Hovold
3 siblings, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2024-07-03 14:09 UTC (permalink / raw)
To: Johan Hovold
Cc: Jiri Slaby, Konrad Dybcio, Douglas Anderson, Bjorn Andersson,
linux-arm-msm, linux-serial, linux-kernel
On Mon, Jun 24, 2024 at 03:31:32PM +0200, Johan Hovold wrote:
> Since 6.10-rc1, Qualcomm machines with a serial port can easily lock up
> hard, for example, when stopping a getty on reboot.
>
> The first patch in this series fixes this severe regression by restoring
> the pre-6.10-rc1 behaviour of printing additional characters when
> flushing the tx buffer.
>
> The second patch fixes a long-standing issue in the GENI driver which
> can lead to a soft lock up when using software flow control and on
> suspend.
>
> The third patch, addresses the old issue with additional characters
> being printing when flushing the buffer.
>
> Note that timeouts used when clearing the tx fifo are a bit excessive
> since I'm reusing the current qcom_geni_serial_poll_bit() helper for
> now.
>
> I think at least the first patch should be merged for rc6 while we
> consider the best way forward to address the remaining issues.
>
> Doug has posted an alternative series of fixes here that depends on
> reworking the driver a fair bit here:
>
> https://lore.kernel.org/lkml/20240610222515.3023730-1-dianders@chromium.org/
I'm confused. Should I take this series, or Doug's, or Doug's single
patch that they say resolve the immediate issue? I can't tell what was
agreed on here at all, so I'm going to drop all of these patches and
wait for a resubmission that everyone agrees should be what is taken...
thanks,
greg k-h
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] serial: qcom-geni: fix lockups
2024-07-03 14:09 ` [PATCH 0/3] serial: qcom-geni: fix lockups Greg Kroah-Hartman
@ 2024-07-03 14:13 ` Johan Hovold
0 siblings, 0 replies; 19+ messages in thread
From: Johan Hovold @ 2024-07-03 14:13 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Johan Hovold, Jiri Slaby, Konrad Dybcio, Douglas Anderson,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel
On Wed, Jul 03, 2024 at 04:09:22PM +0200, Greg Kroah-Hartman wrote:
> On Mon, Jun 24, 2024 at 03:31:32PM +0200, Johan Hovold wrote:
> > Since 6.10-rc1, Qualcomm machines with a serial port can easily lock up
> > hard, for example, when stopping a getty on reboot.
> >
> > The first patch in this series fixes this severe regression by restoring
> > the pre-6.10-rc1 behaviour of printing additional characters when
> > flushing the tx buffer.
> >
> > The second patch fixes a long-standing issue in the GENI driver which
> > can lead to a soft lock up when using software flow control and on
> > suspend.
> >
> > The third patch, addresses the old issue with additional characters
> > being printing when flushing the buffer.
> >
> > Note that timeouts used when clearing the tx fifo are a bit excessive
> > since I'm reusing the current qcom_geni_serial_poll_bit() helper for
> > now.
> >
> > I think at least the first patch should be merged for rc6 while we
> > consider the best way forward to address the remaining issues.
> >
> > Doug has posted an alternative series of fixes here that depends on
> > reworking the driver a fair bit here:
> >
> > https://lore.kernel.org/lkml/20240610222515.3023730-1-dianders@chromium.org/
>
> I'm confused. Should I take this series, or Doug's, or Doug's single
> patch that they say resolve the immediate issue? I can't tell what was
> agreed on here at all, so I'm going to drop all of these patches and
> wait for a resubmission that everyone agrees should be what is taken...
Yes, sorry about the confusion. I'm preparing a v2 of this series which
fixes the regression without downsides of Doug's first series or minimal
fix. I should be able to post it tomorrow.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush
2024-06-25 14:40 ` Johan Hovold
@ 2024-07-04 9:59 ` Johan Hovold
0 siblings, 0 replies; 19+ messages in thread
From: Johan Hovold @ 2024-07-04 9:59 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel
On Tue, Jun 25, 2024 at 04:40:36PM +0200, Johan Hovold wrote:
> On Mon, Jun 24, 2024 at 10:39:07AM -0700, Doug Anderson wrote:
> > On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> > >
> > > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > > to print garbage characters when the circular buffer was cleared. Since
> > > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > > spinning indefinitely in the interrupt handler.
> > >
> > > This is easily triggered by interrupting a command such as dmesg in a
> > > serial console but can also happen when stopping a serial getty on
> > > reboot.
> > >
> > > Fix the immediate issue by printing NUL characters until the current TX
> > > command has been completed.
> > I don't love this, though it's better than a hard lockup. I will note
> > that it doesn't exactly restore the old behavior which would have
> > (most likely) continued to output data that had previously been in the
> > FIFO but that had been cancelled.
>
> Ah, yes, you're right. I went back and compared with 6.9 and the effect
> was indeed (often) that the machine felt sluggish when you hit ctrl-c to
> interrupt something like dmesg and the driver would continue to print up
> to 4k characters after that (e.g. 350 ms at 115200).
>
> The idea here was to fix the lockup regression separately and then have
> the third patch address the buffer flush failure, which could also be
> backported without depending on the kfifo conversion.
>
> But running with this series since yesterday, I realise there are still
> some unresolved interaction with the console code, which can now trigger
> a soft (instead of hard) lockup on reboot...
I've reworked my series to avoid the remaining lockup, which was due to
v1 not handling some cases where cancelling a command left stale data in
the fifo.
I've also reordered the patches to avoid printing NUL characters as an
intermediate fix.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend
2024-06-26 7:54 ` Johan Hovold
@ 2024-07-04 10:08 ` Johan Hovold
0 siblings, 0 replies; 19+ messages in thread
From: Johan Hovold @ 2024-07-04 10:08 UTC (permalink / raw)
To: Doug Anderson
Cc: Johan Hovold, Greg Kroah-Hartman, Jiri Slaby, Konrad Dybcio,
Bjorn Andersson, linux-arm-msm, linux-serial, linux-kernel,
stable
On Wed, Jun 26, 2024 at 09:54:40AM +0200, Johan Hovold wrote:
> On Mon, Jun 24, 2024 at 02:58:39PM -0700, Doug Anderson wrote:
> > 1. The function is named qcom_geni_serial_clear_tx_fifo() which
> > implies that when it finishes that the hardware FIFO will have nothing
> > in it. ...but how does your code ensure this?
>
> Yeah, I realised after I sent out the series that this may not be the
> case. I was under the impression that cancelling a command would discard
> the data in the FIFO (e.g. when starting the next command) but that was
> probably an error in my mental model.
I went back and did some more reverse engineering and have now confirmed
that the hardware works as I assumed for v1, that is, that cancelling a
command leaves data in the fifo, which is later discarded when a new
command is issued.
> > 3. On my hardware you're setting the FIFO level to 16 here. The docs I
> > have say that if the FIFO level is "less than" the value you set here
> > then the interrupt will go off and further clarifies that if you set
> > the register to 1 here then you'll get interrupted when the FIFO is
> > empty. So what happens with your solution if the FIFO is completely
> > full? In that case you'd have to set this to 17, right? ...but then I
> > could believe that might confuse the interrupt handler which would get
> > told to start transmitting when there is no room for anything.
>
> Indeed. I may implicitly be relying on the absence of hardware flow
> control as well so that waiting for one character to be sent is what
> makes this work.
I'm keeping the watermark level unchanged in v2 and instead restart tx
by issuing a short transfer command to clear any stale data from the
fifo which could prevent the watermark interrupt from firing.
Johan
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-07-04 10:08 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24 13:31 [PATCH 0/3] serial: qcom-geni: fix lockups Johan Hovold
2024-06-24 13:31 ` [PATCH 1/3] serial: qcom-geni: fix hard lockup on buffer flush Johan Hovold
2024-06-24 17:39 ` Doug Anderson
2024-06-24 20:45 ` Doug Anderson
2024-06-25 14:53 ` Johan Hovold
2024-06-25 16:27 ` Doug Anderson
2024-06-25 14:40 ` Johan Hovold
2024-07-04 9:59 ` Johan Hovold
2024-06-24 13:31 ` [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control and suspend Johan Hovold
2024-06-24 21:23 ` Doug Anderson
2024-06-24 21:58 ` Doug Anderson
2024-06-26 7:54 ` Johan Hovold
2024-07-04 10:08 ` Johan Hovold
2024-06-26 7:42 ` Johan Hovold
2024-06-24 13:31 ` [PATCH 3/3] serial: qcom-geni: fix garbage output after buffer flush Johan Hovold
2024-06-24 22:19 ` Doug Anderson
2024-06-26 8:01 ` Johan Hovold
2024-07-03 14:09 ` [PATCH 0/3] serial: qcom-geni: fix lockups Greg Kroah-Hartman
2024-07-03 14:13 ` Johan Hovold
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).