[PATCH] amba-pl011: simplify TX handling

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] amba-pl011: simplify TX handling
       [not found]   ` <20150317154248.1675cd4c@north>
@ 2015-03-17 14:55     ` Andre Przywara
  0 siblings, 0 replies; 5+ messages in thread
From: Andre Przywara @ 2015-03-17 14:55 UTC (permalink / raw)
  To: linux-arm-kernel

[resent with LAKML address fixed]
Hi Jakub,

On 17/03/15 14:42, Jakub Kici?ski wrote:
> On Tue, 17 Mar 2015 13:58:44 +0000, Dave P Martin wrote:
>> On Mon, Mar 16, 2015 at 11:15:29PM +0000, Jakub Kicinski wrote:
>>> From: Jakub Kicinski <kubakici@wp.pl>

[ ... ]
>>
>> My update still keeps the softirq stuff.  I wanted to avoid adding
>> status polling inside the interrupt handler's core loop, due to
>> concerns about performance overhead: without the polling, the
>> writes to DR are fire-and-forget, whereas polling FR each time
>> involves a whole round-trip to the UART which may involve significant
>> extra time cost and/or IRQ disable latency; however.
>>
>> Your approach does mitigate some of the cost by only starting to
>> poll after count chars have been transmitted, and will typically
>> halve the number of IRQs taken -- which could lead to a net benefit.
>> It would be good to see some benchmarks to understand how much
>> difference it makes to performance.
> 
> I will *try* to come up with some figures but I don't have any
> high-speed UARTs so my tests run at 115k tops.

Have you tried this?
http://fw.hardijzer.nl/?p=138

This drives the RPi PL011 from a higher frequency clock, thus you can
achieve much higher baud rates. Haven't tried this myself, but it looks
reasonable and not too complicated. You need a matching speed at the
other end, of course, AFAIK the FTDI USB chip can do this.

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] amba-pl011: simplify TX handling
       [not found] <20150317163200.GE3759@e103592.cambridge.arm.com>
@ 2015-03-17 23:42 ` Jakub Kiciński
  2015-03-18 17:41   ` Dave P Martin
  0 siblings, 1 reply; 5+ messages in thread
From: Jakub Kiciński @ 2015-03-17 23:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 17 Mar 2015 16:32:00 +0000, Dave P Martin wrote:
> [Apologies for any duplicate mails received -- some MTAs keep
> choking on this.  Squshing to ASCII in case that helps.
> 
> [fixed typo'd address for linux-arm-kernel]
> 
> [Greg, to minimise confusion while these patches are still under
> discussion, can you drop the following patches from tty-next:
> 
>  * f2ee6df serial/amba-pl011: Leave the TX IRQ alone when the UART is not open
>  * 734745c serial/amba-pl011: Activate TX IRQ passively
> 
> You can pull in my updated series if you want, but at this point further
> changes are likely:
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/330619.html
> 
> See below for context.]
> 
>> [...]
> 
> -next is an automated merge of lots of people's branches.  This helps
> give advance warning of changes that are likely to break the kernel,
> or each other.  Stuff in -next isn't necessarily accepted by anyone
> for actual merging yet.  All history can get rewritten until patches
> land in torvalds/master.
> 
> Many maintainers keep separate branches for sending to -next, so that
> immature patches can get wider visibility before they accept them for
> merging; a separate branch (called for-linus or similar) is often used
> for sending pull requests to Linus.

OK, the reason I asked is that I come from the networking corner of the
kernel where rebasing even -next trees is more frowned upon.  However,
if we agree to merge my patch we could actually leave the tree as it
is, I think most of your v1 -> v2 changes are overwritten anyway.

> > > Can you try to rebase?
> > >
> > > My update still keeps the softirq stuff.  I wanted to avoid adding
> > > status polling inside the interrupt handler's core loop, due to
> > > concerns about performance overhead: without the polling, the
> > > writes to DR are fire-and-forget, whereas polling FR each time
> > > involves a whole round-trip to the UART which may involve significant
> > > extra time cost and/or IRQ disable latency; however.
> > >
> > > Your approach does mitigate some of the cost by only starting to
> > > poll after count chars have been transmitted, and will typically
> > > halve the number of IRQs taken -- which could lead to a net benefit.
> > > It would be good to see some benchmarks to understand how much
> > > difference it makes to performance.
> > 
> > I will *try* to come up with some figures but I don't have any
> > high-speed UARTs so my tests run at 115k tops.  I think dropping the FF
> > poll at the end of load from IRQ could be a good idea.  And I have to
> > respin anyway because I just noticed I added a double new line in
> > pl011_tx_chars()...
> 
> Maybe you can drop some getnstimeofday() into pl011_int() and/or
> pl011_tx_chars(), to collect some simple stats?
> 
> I might have a go at this if I get a moment -- but any info you
> could obtain would be helpful.  For one thing, the behvaiour
> might vary a bit between platforms, even for a fixed baud rate.

Unfortunately I had to give up trying to get perf to work on my RPi (I
read somewhere that SOC does not have PMU properly connected?).  What I
have instead are some simple, "proxy" numbers:

1) First I measured how much more data can be loaded thanks to the FF
   check at the end of guaranteed FIFO filling.  For 2Mbps I got
   following numbers (transferring the same file):

          | FIFO loads | FF seen** | additional chars loaded |  
no load   |   173350   |  169157   |         8429	     |
100% CPU* |   136757   |   50942   |       300375	     |

*  100% CPU utilization achieved by running iperf and UART traffic on
   other chips;
** after guaranteed count FF was set - no additional chars loaded.

Breakdown of how many additional chars could be loaded per IRQ:

               1    2    3    4    5    6    7    8    9  10 ...
no load  |  2094 1127  418  256  120   81   36   31   23   1 0 0 0 0 0 0
100% CPU | 33375 7336 6527 7324 7971 8586 6983 3387 4007   4 0 0 0 0 0 0

I also tested with 115kbps and only 5 times out of 30k checks IRQ was
able to put in a single additional character on a fully loaded system.


To sum up - on a fully loaded system FF check gives a significant (~20%)
drop in the number of pl011_tx_chars() calls (but they are more costly).


2) I run some CPU intensive and I/O intensive benchmarks to see the
   effect doing FF check has on general performance:

w/read - check performed
 /read - just load FIFO size/2; don't check FIFO_FULL

iperf  -c 10.1.1.16 -t 60  # [Mbps] usb ethernet stick
clean:  49.3 49.2 49.3 49.4 49.4 
w/read: 42.0 40.9 42.0 42.1 42.0
 /read: 42.0 43.1 42.2 42.1 42.6

time echo "scale=1500; a(1)*5" | bc -l  # [s] of real time
clean:  12.263 12.311 12.230 12.242 12.272
w/read: 14.512 14.646 14.541 14.894 14.504
 /read: 14.776 14.714 14.936 14.588 14.658


This simple test indicates that not doing the check helps I/O
performance but hurts the CPU computation speed.  The difference is not
big in either case but since my heart lies with networking I will
actually drop the FF check in v2 :)

I don't feel strongly about it though, let me know if you prefer to
leave it in.

> > > Looping in pl011_tx_chars() until either the FIFO is full or
> > > we run out of chars to transmit gets rid of the need for the
> > > softirq, so I wouldn't be opposed to keeping that if the
> > > performance impact is not too significant.
> > 
> > TBH the softirq stuff I don't quite understand.  I had never seen it
> > fire in my tests.  Can you explain its purpose?  Is there HW out there
> > which fails to fire the IRQ (but it wouldn't work until now anyhow)?
> 
> At the moment it's partly theoretical: I have seen the TX IRQ fail
> to fire on some software models -- it's not that uncommon for
> a model not to bother emulating the actual transmission delay, so
> some effectively drain every char written to the FIFO instantly.
> 
> It could happen in some hardware configurations too -- I once had an
> embedded UART running at 3Mbps, fed by a 15MHz ARM7TDMI at one point.
> Could have gone faster, but the FTDI uart at the other end of
> the link used a different base clock, making it hard to match
> the rates closely enough at high speeds.  Even on fast modern
> platforms, DVFS may slow down the CPU quite aggressively on some
> systems.

So the delayed work serves as TX watchdog.  I'm not convinced that we
should add it if we have no report of hardware in the wild which needs
it...  If we do add it - perhaps the watchdog should check all of the
IRQs (e.g. RX IRQ delivery can fail as well)? 

> That said, my current approach of backing off and scheduling a
> softirq doesn't necessarily bring a lot of benefit.
> 
> The main difference is that writing a huge buffer to a UART with
> an unfillable TX FIFO will just leave execution stuck in pl011_int()
> holding the port lock for a long time until all the data has been
> sent.  But I think that would happen even with the original driver code
> -- TXIS would stay asserted, so pl011_int() would just keep calling
> _tx_chars() without releasing the port lock or returning.
> 
> I think I'll have a try at getting rid of the softirq, following
> your example, and see how things look.
> 
> A system where feeding the UART really does take ~100%CPU really
> has a design problem, which we can't expect to fix in the pl011
> driver...

Agreed.

> > > On port shutdown, the TX interrupt will be masked by IMSC, but should
> > > remain asserted inside the PL011, triggering an interrupt the next time
> > > it is unmasked.
> > > 
> > > If you observed something that conflicts with these assumptions,
> > > I'd be interested to know what's going on...
> > 
> > Actually seeing this bug in practice was what motivated this work.
> > I triggered it by catting a long file over the UART and then hitting
> > ^C.  UART died every time...  That being said I test this on BCM2708
> > which is also notorious for transmitting while device in loopback
> > configuration, so perhaps it's implementation-dependent...
> 
> Can you describe in more detail what you did?  I don't think that
> ^C should break things ever, but I'd like to try to reproduce it
> just in case there is some faulty logic somewhere...

This is exactly what I did:
# stty -F /dev/ttyAMA0 115200 -onlcr
# cat 1MB_text_file > /dev/ttyAMA0
^C
Now AMA0 is dead.  If I waited until the whole file was written,
everything was fine.  This was 100% reproducible and I later checked
that on .shutdown() the FIFO was full (no IRQ pending yet).  Initially I
tried to play around with CR programming on .shutdown() but it didn't
change anything (according to Broadcom docs one should be very careful
not to touch CR while UART is busy).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] amba-pl011: simplify TX handling
  2015-03-17 23:42 ` [PATCH] amba-pl011: simplify TX handling Jakub Kiciński
@ 2015-03-18 17:41   ` Dave P Martin
  2015-03-18 20:26     ` Jakub Kiciński
  2015-03-18 23:43     ` Russell King - ARM Linux
  0 siblings, 2 replies; 5+ messages in thread
From: Dave P Martin @ 2015-03-18 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave P Martin <Dave.Martin@arm.com>
To: Jakub Kicinski <moorray3@wp.pl>
CC: Russell King <linux@arm.linux.org.uk>, Greg Kroah-Hartman
 <gregkh@linuxfoundation.org>, "linux-serial at vger.kernel.org"
 <linux-serial@vger.kernel.org>, "linux-arm-kernel at lists.infradead.org"
 <linux-arm-kernel@lists.infradead.org>, Andre Przywara
 <Andre.Przywara@arm.com>, Karol Debogorski <k.debogorski@a2s.pl>, Jakub
 Kicinski <kubakici@wp.pl>
Subject: Re: [PATCH] amba-pl011: simplify TX handling

On Tue, Mar 17, 2015 at 11:42:55PM +0000, Jakub Kicinski wrote:
> On Tue, 17 Mar 2015 16:32:00 +0000, Dave P Martin wrote:

[...]

> > Many maintainers keep separate branches for sending to -next, so that
> > immature patches can get wider visibility before they accept them for
> > merging; a separate branch (called for-linus or similar) is often used
> > for sending pull requests to Linus.
> 
> OK, the reason I asked is that I come from the networking corner of the
> kernel where rebasing even -next trees is more frowned upon.  However,
> if we agree to merge my patch we could actually leave the tree as it
> is, I think most of your v1 -> v2 changes are overwritten anyway.

I guess I misread the tone of your question slightly... anyway I guess
it's up to Greg.

For now, I'll probably repost replacement patches and a delta, and he
can choose which to apply.


I'm aiming to repost tomorrow -- couldn't quite get to where I was
aiming for today.

More comments below.

> > Maybe you can drop some getnstimeofday() into pl011_int() and/or
> > pl011_tx_chars(), to collect some simple stats?
> > 
> > I might have a go at this if I get a moment -- but any info you
> > could obtain would be helpful.  For one thing, the behvaiour
> > might vary a bit between platforms, even for a fixed baud rate.
> 
> Unfortunately I had to give up trying to get perf to work on my RPi (I
> read somewhere that SOC does not have PMU properly connected?).  What I
> have instead are some simple, "proxy" numbers:
> 
> 1) First I measured how much more data can be loaded thanks to the FF
>    check at the end of guaranteed FIFO filling.  For 2Mbps I got
>    following numbers (transferring the same file):
> 
>           | FIFO loads | FF seen** | additional chars loaded |  
> no load   |   173350   |  169157   |         8429	     |
> 100% CPU* |   136757   |   50942   |       300375	     |
> 
> *  100% CPU utilization achieved by running iperf and UART traffic on
>    other chips;
> ** after guaranteed count FF was set - no additional chars loaded.
> 
> Breakdown of how many additional chars could be loaded per IRQ:
> 
>                1    2    3    4    5    6    7    8    9  10 ...
> no load  |  2094 1127  418  256  120   81   36   31   23   1 0 0 0 0 0 0
> 100% CPU | 33375 7336 6527 7324 7971 8586 6983 3387 4007   4 0 0 0 0 0 0
> 
> I also tested with 115kbps and only 5 times out of 30k checks IRQ was
> able to put in a single additional character on a fully loaded system.
> 
> 
> To sum up - on a fully loaded system FF check gives a significant (~20%)
> drop in the number of pl011_tx_chars() calls (but they are more costly).
> 
> 2) I run some CPU intensive and I/O intensive benchmarks to see the
>    effect doing FF check has on general performance:
> 
> w/read - check performed
>  /read - just load FIFO size/2; don't check FIFO_FULL
> 
> iperf  -c 10.1.1.16 -t 60  # [Mbps] usb ethernet stick
> clean:  49.3 49.2 49.3 49.4 49.4 
> w/read: 42.0 40.9 42.0 42.1 42.0
>  /read: 42.0 43.1 42.2 42.1 42.6
> 
> time echo "scale=1500; a(1)*5" | bc -l  # [s] of real time
> clean:  12.263 12.311 12.230 12.242 12.272
> w/read: 14.512 14.646 14.541 14.894 14.504
>  /read: 14.776 14.714 14.936 14.588 14.658
> 
> 
> This simple test indicates that not doing the check helps I/O
> performance but hurts the CPU computation speed.  The difference is not
> big in either case but since my heart lies with networking I will
> actually drop the FF check in v2 :)
> 
> I don't feel strongly about it though, let me know if you prefer to
> leave it in.

Thanks for looking into this -- it's interesting to see measurable
results, but overall I agree: this is an extreme case, and it looks
like we don't need to panic one war or the other.

My original instinct was to keep the original behaviour of only
sending the guaranteed number of chars on IRQ, so I guess we're
in agreement on this (though without strong feeling on either
side.)

[...]

> > It could happen in some hardware configurations too -- I once had an
> > embedded UART running at 3Mbps, fed by a 15MHz ARM7TDMI at one point.
> > Could have gone faster, but the FTDI uart at the other end of
> > the link used a different base clock, making it hard to match
> > the rates closely enough at high speeds.  Even on fast modern
> > platforms, DVFS may slow down the CPU quite aggressively on some
> > systems.
> 
> So the delayed work serves as TX watchdog.  I'm not convinced that we
> should add it if we have no report of hardware in the wild which needs
> it...  If we do add it - perhaps the watchdog should check all of the
> IRQs (e.g. RX IRQ delivery can fail as well)? 

Sure, I think it can be dropped.  It does not add much benefit for
that extreme case, and for more realistic cases it just adds complexity.

[...]

> > A system where feeding the UART really does take ~100%CPU really
> > has a design problem, which we can't expect to fix in the pl011
> > driver...
> 
> Agreed.

[...]

> This is exactly what I did:
> # stty -F /dev/ttyAMA0 115200 -onlcr
> # cat 1MB_text_file > /dev/ttyAMA0
> ^C
> Now AMA0 is dead.  If I waited until the whole file was written,
> everything was fine.  This was 100% reproducible and I later checked
> that on .shutdown() the FIFO was full (no IRQ pending yet).  Initially I
> tried to play around with CR programming on .shutdown() but it didn't
> change anything (according to Broadcom docs one should be very careful
> not to touch CR while UART is busy).

Interesting... I missed this because the systemd issue that got me
started on this only shows up of the console is on the PL011.  Once
a shell is running on ttyAMA0, the port is always open, so the effects
of shutting down and restarting the pl011 are not seen.

I'm actually suspicious of the correct behaviour here.  serial_core
waits for the UART to drain via uart_wait_until_sent(), but there are
some oddities:

 * The wait is abandoned early if there is a pending signal.  This
   means that some output already sent to the kernel via write() is
   is simply lost.  This feels a bit odd -- for all other I/O I can
   think of, write() does not guarantee that the data has reached
   its destination, but on return it usually does guarantee that the
   data _will_ reach its destination (except for unrecoverable I/O
   errors).

   This behaviour does mean that pl011_shutdown() is invoked with
   a non-empty FIFO if the only process with the port open is killed
   by a signal while output is pending, causing the hangup.

 * uart_wait_until_sent()'s timeout calculations aim to wait for
   no longer than it takes the FIFO to drain.  However, this function
   can get called when the serial_core xmit queue for the port is
   very non-empty -- meaning that the FIFO continues to be topped
   up for some time.  This can cause more data to be lost.


Since my "optimization" to hold TXIS asserted across _shutdown()..
_startup() was of questionable benefit in the first place I agree just
to drop it.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] amba-pl011: simplify TX handling
  2015-03-18 17:41   ` Dave P Martin
@ 2015-03-18 20:26     ` Jakub Kiciński
  2015-03-18 23:43     ` Russell King - ARM Linux
  1 sibling, 0 replies; 5+ messages in thread
From: Jakub Kiciński @ 2015-03-18 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 18 Mar 2015 17:41:57 +0000, Dave P Martin wrote:
> > This is exactly what I did:
> > # stty -F /dev/ttyAMA0 115200 -onlcr
> > # cat 1MB_text_file > /dev/ttyAMA0
> > ^C
> > Now AMA0 is dead.  If I waited until the whole file was written,
> > everything was fine.  This was 100% reproducible and I later checked
> > that on .shutdown() the FIFO was full (no IRQ pending yet).  Initially I
> > tried to play around with CR programming on .shutdown() but it didn't
> > change anything (according to Broadcom docs one should be very careful
> > not to touch CR while UART is busy).
> 
> Interesting... I missed this because the systemd issue that got me
> started on this only shows up of the console is on the PL011.  Once
> a shell is running on ttyAMA0, the port is always open, so the effects
> of shutting down and restarting the pl011 are not seen.
> 
> I'm actually suspicious of the correct behaviour here.  serial_core
> waits for the UART to drain via uart_wait_until_sent(), but there are
> some oddities:
> 
>  * The wait is abandoned early if there is a pending signal.  This
>    means that some output already sent to the kernel via write() is
>    is simply lost.  This feels a bit odd -- for all other I/O I can
>    think of, write() does not guarantee that the data has reached
>    its destination, but on return it usually does guarantee that the
>    data _will_ reach its destination (except for unrecoverable I/O
>    errors).
> 
>    This behaviour does mean that pl011_shutdown() is invoked with
>    a non-empty FIFO if the only process with the port open is killed
>    by a signal while output is pending, causing the hangup.
> 
>  * uart_wait_until_sent()'s timeout calculations aim to wait for
>    no longer than it takes the FIFO to drain.  However, this function
>    can get called when the serial_core xmit queue for the port is
>    very non-empty -- meaning that the FIFO continues to be topped
>    up for some time.  This can cause more data to be lost.

I confess that the way serial_core works is not very intuitive for me
either, I can only add to your list.  For instance I observed
that .start_tx() called eight times during the initial load.  Why?
Obviously the first one fills up the FIFO and the subsequent just waste
CPU time.  I remember looking at the serial_core and it seemed like the
repeated calls must be coming from the higher layers... Admittedly I was
too lazy to just add a dump_stack() and see for myself ;)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] amba-pl011: simplify TX handling
  2015-03-18 17:41   ` Dave P Martin
  2015-03-18 20:26     ` Jakub Kiciński
@ 2015-03-18 23:43     ` Russell King - ARM Linux
  1 sibling, 0 replies; 5+ messages in thread
From: Russell King - ARM Linux @ 2015-03-18 23:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 18, 2015 at 05:41:57PM +0000, Dave P Martin wrote:
>  * uart_wait_until_sent()'s timeout calculations aim to wait for
>    no longer than it takes the FIFO to drain.  However, this function
>    can get called when the serial_core xmit queue for the port is
>    very non-empty -- meaning that the FIFO continues to be topped
>    up for some time.  This can cause more data to be lost.

The code used to wait for the xmit queue to empty (up to closing_wait)
before shutting the port down, which defaulted to 30 seconds.  If the
xmit queue doesn't drain in 30 seconds, it probably never will.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-03-18 23:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20150317163200.GE3759@e103592.cambridge.arm.com>
2015-03-17 23:42 ` [PATCH] amba-pl011: simplify TX handling Jakub Kiciński
2015-03-18 17:41   ` Dave P Martin
2015-03-18 20:26     ` Jakub Kiciński
2015-03-18 23:43     ` Russell King - ARM Linux
     [not found] <1426547729-21255-1-git-send-email-moorray3@wp.pl>
     [not found] ` <20150317135844.GA3759@e103592.cambridge.arm.com>
     [not found]   ` <20150317154248.1675cd4c@north>
2015-03-17 14:55     ` Andre Przywara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).