* Proposed SDIO layer rework
@ 2008-09-05 11:45 Christer Weinigel
2008-09-05 13:18 ` Ben Dooks
2008-09-05 14:50 ` Pierre Ossman
0 siblings, 2 replies; 6+ messages in thread
From: Christer Weinigel @ 2008-09-05 11:45 UTC (permalink / raw)
To: linux-kernel; +Cc: Ben Dooks, Pierre Ossman
Hi Pierre,
I said I wanted to argue a bit about the SDIO layer, and here it is. :-)
Ben, this is about the s3cmci driver, so I guess you might be interested.
I've been trying to make SDIO run and run fast on a small embedded
system with a Samsung S3C24A0 processor with 200 MHz ARM926 core. While
doing this I've discovered some changes that I'd like to make to the
generic SDIO layer.
First some measurements. I've connected the SDIO bus to an Agilent
scope to get some pretty plots. To measure CPU load I have a small test
program which just counts how many gettimeofday calls I can do in one
second. On on idle system I get the following:
269648 kcycles in 1.000 s, 269 kcycles/s
I've written a test firmware for one of our SDIO chips which streams out
data using the SDIO Bluetooth Type A protocol, at 250 packets per
second, 2 kByte per packet for a total of about 4 MBit transfer rate.
The device driver I'm using is the s3cmci driver from Ben Dooks together
with my patches to do SDIO and with a working SDIO interrupt. The
driver does not support DMA yet. I've limited the SDIO bus clock to
10 MHz just to avoid any errors when connecting the scope probes.
If I use a test driver which just copies these packets to userspace and
then discards them (the logic in the test driver is basically the same
as in the btsdio driver), this is what I see on a scope:
http://zoo.weinigel.se/s3cmci/timing-normal.png
D4 is the CMD line and D5 is the CLK line. D0 to D3 are the data lines.
The SDIO interrupt is signalled on D1 and then you can see the
different SDIO transactions on the bus:
1. SDIO layer reads the interrupt register in function 0
2. Driver reads the BT function nterrupt register
3. Driver clears the BT function interrupt register and
you can see the interrupt line going high again.
4. Driver reads the SDIO BT Type A header, you can see
the data transaction on the D0-D3 wires.
5. Driver reads the payload which is 2kBytes.
6. Driver acknowledges the packet.
Each transaction requires the SDIO irq thread to wake up, and at 250
packets times 6 operations per packet, that is 1500 wakeups per second,
not including the initial wakeup for the SDIO interrupt itself. The gaps
when the CLK line stops in the 2kByte data transfer is probably due to
scheduling latencies before the pio_tasklet in the s3cmci driver is run.
As you can see, the latency from the interrupt being signalled to the
interrupt being cleared is about 340 us. The cpu load program reports:
130674 kcycles in 1.000 s, 154 kcycles/s
so about 45% of the CPU is spent just transferring data into kernel
space. Another problem is that the reference board I'm using has a SMC
ethernet chip which is extremely slow, it can spend up to a dozen
milliseconds just reading data from the chip. Since network traffic has
higher priority than the SDIO irq worker thread, if there is almost any
network traffic at all, the SDIO interrupt latency is awful, it often
reaches 10 or 20 ms.
To speed things up slightly, I removed the pio_tasklet in the s3cmci
driver and just called the function to read from the fifo directly from
the interrupt handler:
http://zoo.weinigel.se/s3cmci/timing-no-tasklet.png
156078 kcycles in 1.000 s, 156 kcycles/s
So the CPU load is slightly lower and it looks as if the clock never
stops during the 2kByte transfer. The latency is 320 us but this
difference is negligible, it is well within the variation I see when
looking at the the scope.
So, what I did next was to go through the whole SDIO layer, adding
asynchronous I/O operations all over the place. So an asynchronouse
call to sdio_readb looks like this:
...
sdio_readb_start(func, address, done_callback);
}
And then I of course need a callback:
void done_callback(struct sdio_func *func)
{
u8 intrd;
int ret;
value = sdio_readb_complete(func, &ret);
...
So instead of sleeping, it registers a callback function which will be
called directly from the host interrupt handler when the SDIO transfer
finishes. I also changed the SDIO interrupt handling so that the
interrupt handler is called directly from the mmc_signal_sdio_irq
function, instead of running from a thread. This made an enormous
difference on the latency, it's down to 132 us, and the CPU load is down
as well:
http://zoo.weinigel.se/s3cmci/timing-async.png
185000 kcycles in 1.000 s, 185 kcycles/s
Most of the CPU is probably spent doing PIO transfers to the SDIO
controller, if DMA starts working in the s3cmci driver, the CPU load
difference will be even larger.
Since more and more high speed chips are being connected to embedded
devices using SDIO, I belive that to get good SDIO performance, the SDIO
layer has to be changed not to use a worker thread. This is
unfortunately rather painful since it means that we have to add
asynchronous variants of all the I/O operations, and the sdio_irq thread
has to be totally redone, and the sdio IRQ enabling and disablin turned
out to be a bit tricky.
So, do you think that it is worth it to make such a major change to the
SDIO subsystem? I belive so, but I guess I'm a bit biased. I can clean
up the work I've done and make sure that everything is backwards
compatible so that existing SDIO function drivers keep working (my
current hack to the sdio_irq thread breaks existing drivers, and it is
too ugly to live anyway so I don't even want to show it to you yet), and
add a demo driver which shows how to use the asynchronous functions.
But if you don't like this idea at all, I'll probably just keep it as a
patch and not bother as much with backwards compatibility. This is a
lot more work for me though, and I'd much prefer to get it into the
official kernel.
Comments?
/Christer
ps. I've tested the same driver on a Lenovo laptop with the SDHCI driver
and it works fine, but it doesn't make as big a difference there. The
CPU load isn't big enough to matter even with the normal drivers, but
the latency gets slightly but not significantly better.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Proposed SDIO layer rework
2008-09-05 11:45 Proposed SDIO layer rework Christer Weinigel
@ 2008-09-05 13:18 ` Ben Dooks
2008-09-05 13:47 ` Christer Weinigel
2008-09-05 14:50 ` Pierre Ossman
1 sibling, 1 reply; 6+ messages in thread
From: Ben Dooks @ 2008-09-05 13:18 UTC (permalink / raw)
To: Christer Weinigel; +Cc: linux-kernel, Ben Dooks, Pierre Ossman
On Fri, Sep 05, 2008 at 01:45:55PM +0200, Christer Weinigel wrote:
> Hi Pierre,
>
> I said I wanted to argue a bit about the SDIO layer, and here it is. :-)
>
> Ben, this is about the s3cmci driver, so I guess you might be interested.
>
> I've been trying to make SDIO run and run fast on a small embedded
> system with a Samsung S3C24A0 processor with 200 MHz ARM926 core. While
> doing this I've discovered some changes that I'd like to make to the
> generic SDIO layer.
Many of the PDA class SoC are often in the 200-400 MHz class, although
faster variants are turning up there's an large contingent of similar
devices already out there.
> First some measurements. I've connected the SDIO bus to an Agilent
> scope to get some pretty plots. To measure CPU load I have a small test
> program which just counts how many gettimeofday calls I can do in one
> second. On on idle system I get the following:
>
> 269648 kcycles in 1.000 s, 269 kcycles/s
>
> I've written a test firmware for one of our SDIO chips which streams out
> data using the SDIO Bluetooth Type A protocol, at 250 packets per
> second, 2 kByte per packet for a total of about 4 MBit transfer rate.
>
> The device driver I'm using is the s3cmci driver from Ben Dooks together
> with my patches to do SDIO and with a working SDIO interrupt. The
> driver does not support DMA yet. I've limited the SDIO bus clock to
> 10 MHz just to avoid any errors when connecting the scope probes.
Well, it isn't really from me, I've been doing the maintainance and required
cleanup after the original driver wsa released.
> If I use a test driver which just copies these packets to userspace and
> then discards them (the logic in the test driver is basically the same
> as in the btsdio driver), this is what I see on a scope:
>
> http://zoo.weinigel.se/s3cmci/timing-normal.png
>
> D4 is the CMD line and D5 is the CLK line. D0 to D3 are the data lines.
> The SDIO interrupt is signalled on D1 and then you can see the
> different SDIO transactions on the bus:
>
> 1. SDIO layer reads the interrupt register in function 0
> 2. Driver reads the BT function nterrupt register
> 3. Driver clears the BT function interrupt register and
> you can see the interrupt line going high again.
> 4. Driver reads the SDIO BT Type A header, you can see
> the data transaction on the D0-D3 wires.
> 5. Driver reads the payload which is 2kBytes.
> 6. Driver acknowledges the packet.
>
> Each transaction requires the SDIO irq thread to wake up, and at 250
> packets times 6 operations per packet, that is 1500 wakeups per second,
> not including the initial wakeup for the SDIO interrupt itself. The gaps
> when the CLK line stops in the 2kByte data transfer is probably due to
> scheduling latencies before the pio_tasklet in the s3cmci driver is run.
> As you can see, the latency from the interrupt being signalled to the
> interrupt being cleared is about 340 us. The cpu load program reports:
>
> 130674 kcycles in 1.000 s, 154 kcycles/s
>
> so about 45% of the CPU is spent just transferring data into kernel
> space. Another problem is that the reference board I'm using has a SMC
> ethernet chip which is extremely slow, it can spend up to a dozen
> milliseconds just reading data from the chip. Since network traffic has
> higher priority than the SDIO irq worker thread, if there is almost any
> network traffic at all, the SDIO interrupt latency is awful, it often
> reaches 10 or 20 ms.
>
> To speed things up slightly, I removed the pio_tasklet in the s3cmci
> driver and just called the function to read from the fifo directly from
> the interrupt handler:
I've never really liked that PIO tasklet, and was planning on vanquishing
it (or at least allowing it to be configured off) in the next release, I
also belive that we can probably improve the code in that region anyway.
> http://zoo.weinigel.se/s3cmci/timing-no-tasklet.png
> 156078 kcycles in 1.000 s, 156 kcycles/s
>
> So the CPU load is slightly lower and it looks as if the clock never
> stops during the 2kByte transfer. The latency is 320 us but this
> difference is negligible, it is well within the variation I see when
> looking at the the scope.
>
> So, what I did next was to go through the whole SDIO layer, adding
> asynchronous I/O operations all over the place. So an asynchronouse
> call to sdio_readb looks like this:
>
> ...
> sdio_readb_start(func, address, done_callback);
> }
>
> And then I of course need a callback:
>
> void done_callback(struct sdio_func *func)
> {
> u8 intrd;
> int ret;
>
> value = sdio_readb_complete(func, &ret);
> ...
>
> So instead of sleeping, it registers a callback function which will be
> called directly from the host interrupt handler when the SDIO transfer
> finishes. I also changed the SDIO interrupt handling so that the
> interrupt handler is called directly from the mmc_signal_sdio_irq
> function, instead of running from a thread. This made an enormous
> difference on the latency, it's down to 132 us, and the CPU load is down
> as well:
>
> http://zoo.weinigel.se/s3cmci/timing-async.png
> 185000 kcycles in 1.000 s, 185 kcycles/s
>
> Most of the CPU is probably spent doing PIO transfers to the SDIO
> controller, if DMA starts working in the s3cmci driver, the CPU load
> difference will be even larger.
I'm not sure if I'll get the time to look at this before the new kernel
is released... anyway DMA may not be much of a win for smaller transfers
anyway, since the setup (the cache will need to be cleaned out or the
transfer memory made unbuffered) and complete time will add another
IRQ's worth of response time. This means small transfers are probably
better off using PIO.
> Since more and more high speed chips are being connected to embedded
> devices using SDIO, I belive that to get good SDIO performance, the SDIO
> layer has to be changed not to use a worker thread. This is
> unfortunately rather painful since it means that we have to add
> asynchronous variants of all the I/O operations, and the sdio_irq thread
> has to be totally redone, and the sdio IRQ enabling and disablin turned
> out to be a bit tricky.
>
> So, do you think that it is worth it to make such a major change to the
> SDIO subsystem? I belive so, but I guess I'm a bit biased. I can clean
> up the work I've done and make sure that everything is backwards
> compatible so that existing SDIO function drivers keep working (my
> current hack to the sdio_irq thread breaks existing drivers, and it is
> too ugly to live anyway so I don't even want to show it to you yet), and
> add a demo driver which shows how to use the asynchronous functions.
>
> But if you don't like this idea at all, I'll probably just keep it as a
> patch and not bother as much with backwards compatibility. This is a
> lot more work for me though, and I'd much prefer to get it into the
> official kernel.
>
> Comments?
>
> /Christer
>
> ps. I've tested the same driver on a Lenovo laptop with the SDHCI driver
> and it works fine, but it doesn't make as big a difference there. The
> CPU load isn't big enough to matter even with the normal drivers, but
> the latency gets slightly but not significantly better.
--
--
Ben
Q: What's a light-year?
A: One-third less calories than a regular year.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Proposed SDIO layer rework
2008-09-05 13:18 ` Ben Dooks
@ 2008-09-05 13:47 ` Christer Weinigel
2008-09-05 14:02 ` Ben Dooks
0 siblings, 1 reply; 6+ messages in thread
From: Christer Weinigel @ 2008-09-05 13:47 UTC (permalink / raw)
To: Ben Dooks; +Cc: linux-kernel, Pierre Ossman
Ben Dooks wrote:
>> Most of the CPU is probably spent doing PIO transfers to the SDIO
>> controller, if DMA starts working in the s3cmci driver, the CPU load
>> difference will be even larger.
>
> I'm not sure if I'll get the time to look at this before the new kernel
> is released... anyway DMA may not be much of a win for smaller transfers
> anyway, since the setup (the cache will need to be cleaned out or the
> transfer memory made unbuffered) and complete time will add another
> IRQ's worth of response time. This means small transfers are probably
> better off using PIO.
Yes. For the DMA-capable S3C SPI driver I wrote, I added some
thresholds, so for smaller transfers than a certain number of bytes, I
skip DMA and just do a polled/interrupt transfer instead. For short
transfers at high clock rates it's not even worth getting an interrupt
per byte, it's better to just busy wait for each byte, since the
interrupt overhead is larger than the time between each byte.
A SDIO CMD/response packet is 48 bits, so at 25 MHz that is only about 4
us and I think the interrupt overhead is more than that. So if we
really want to squeeze every last clock cycle out of the SDIO driver it
may be better to busy wait for the end of simple CMD52s instead of using
the an interrupt to complete the transfer.
I'll clean up my s3cmci patches and send them to you, but I can't
promise when I'll be done, so it'll probably have to wait for the next
kernel release.
/Christer
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Proposed SDIO layer rework
2008-09-05 13:47 ` Christer Weinigel
@ 2008-09-05 14:02 ` Ben Dooks
2008-09-05 17:08 ` Ben Dooks
0 siblings, 1 reply; 6+ messages in thread
From: Ben Dooks @ 2008-09-05 14:02 UTC (permalink / raw)
To: Christer Weinigel; +Cc: Ben Dooks, linux-kernel, Pierre Ossman
On Fri, Sep 05, 2008 at 03:47:21PM +0200, Christer Weinigel wrote:
> Ben Dooks wrote:
> >>Most of the CPU is probably spent doing PIO transfers to the SDIO
> >>controller, if DMA starts working in the s3cmci driver, the CPU load
> >>difference will be even larger.
> >
> >I'm not sure if I'll get the time to look at this before the new kernel
> >is released... anyway DMA may not be much of a win for smaller transfers
> >anyway, since the setup (the cache will need to be cleaned out or the
> >transfer memory made unbuffered) and complete time will add another
> >IRQ's worth of response time. This means small transfers are probably
> >better off using PIO.
>
> Yes. For the DMA-capable S3C SPI driver I wrote, I added some
> thresholds, so for smaller transfers than a certain number of bytes, I
> skip DMA and just do a polled/interrupt transfer instead. For short
> transfers at high clock rates it's not even worth getting an interrupt
> per byte, it's better to just busy wait for each byte, since the
> interrupt overhead is larger than the time between each byte.
>
> A SDIO CMD/response packet is 48 bits, so at 25 MHz that is only about 4
> us and I think the interrupt overhead is more than that. So if we
> really want to squeeze every last clock cycle out of the SDIO driver it
> may be better to busy wait for the end of simple CMD52s instead of using
> the an interrupt to complete the transfer.
>
> I'll clean up my s3cmci patches and send them to you, but I can't
> promise when I'll be done, so it'll probably have to wait for the next
> kernel release.
Any chance of getting a list of what you've got in progress and at-least
the byte/word patch sorted out before the next merge window?
--
Ben
Q: What's a light-year?
A: One-third less calories than a regular year.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Proposed SDIO layer rework
2008-09-05 11:45 Proposed SDIO layer rework Christer Weinigel
2008-09-05 13:18 ` Ben Dooks
@ 2008-09-05 14:50 ` Pierre Ossman
1 sibling, 0 replies; 6+ messages in thread
From: Pierre Ossman @ 2008-09-05 14:50 UTC (permalink / raw)
To: Christer Weinigel; +Cc: linux-kernel, Ben Dooks
[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]
On Fri, 05 Sep 2008 13:45:55 +0200
Christer Weinigel <christer@weinigel.se> wrote:
> First some measurements. I've connected the SDIO bus to an Agilent
> scope to get some pretty plots.
Why doesn't santa bring me any of the nice toys everyone else seems to
be getting :(
> Since more and more high speed chips are being connected to embedded
> devices using SDIO, I belive that to get good SDIO performance, the SDIO
> layer has to be changed not to use a worker thread. This is
> unfortunately rather painful since it means that we have to add
> asynchronous variants of all the I/O operations, and the sdio_irq thread
> has to be totally redone, and the sdio IRQ enabling and disablin turned
> out to be a bit tricky.
>
> So, do you think that it is worth it to make such a major change to the
> SDIO subsystem? I belive so, but I guess I'm a bit biased. I can clean
> up the work I've done and make sure that everything is backwards
> compatible so that existing SDIO function drivers keep working (my
> current hack to the sdio_irq thread breaks existing drivers, and it is
> too ugly to live anyway so I don't even want to show it to you yet), and
> add a demo driver which shows how to use the asynchronous functions.
The latency improvement is indeed impressive, but I am not convinced it
is worth it. An asynchronous API is much more complex and difficult to
work with (not to mention reading and trying to make sense of existing
code), and SDIO is not even an asynchronous bus to begin with.
Also, as the latency is caused by CPU cycles, the problem will be
reduced as hardware improved.
But...
> But if you don't like this idea at all, I'll probably just keep it as a
> patch and not bother as much with backwards compatibility. This is a
> lot more work for me though, and I'd much prefer to get it into the
> official kernel.
I do like the idea of reducing latencies (and generally improving the
performance of the MMC stack). The primary reason I haven't done
anything myself is lack of stuff to test and proper instrumentation.
There are really two issues here, which aren't necessarily related;
actual interrupt latency and command completion latency.
The main culprit in your case is the command completion one. Perhaps
there is some other way of solving that inside the MMC core? E.g. we
could spin instead of sleeping while we wait for the request to
complete. Most drivers never use process context to handle a request
anyway. We would need to determine when the request is small enough
(all non-busy, non-data commands?) and that the CPU is slow enough. I
also saw something about a new trigger interface that could make this
efficient.
Does this sound like something worth exploring?
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
rdesktop, core developer http://www.rdesktop.org
WARNING: This correspondence is being monitored by the
Swedish government. Make sure your server uses encryption
for SMTP traffic and consider using PGP for end-to-end
encryption.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Proposed SDIO layer rework
2008-09-05 14:02 ` Ben Dooks
@ 2008-09-05 17:08 ` Ben Dooks
0 siblings, 0 replies; 6+ messages in thread
From: Ben Dooks @ 2008-09-05 17:08 UTC (permalink / raw)
To: Ben Dooks; +Cc: Christer Weinigel, linux-kernel, Pierre Ossman
On Fri, Sep 05, 2008 at 03:02:16PM +0100, Ben Dooks wrote:
> On Fri, Sep 05, 2008 at 03:47:21PM +0200, Christer Weinigel wrote:
> > Ben Dooks wrote:
> > >>Most of the CPU is probably spent doing PIO transfers to the SDIO
> > >>controller, if DMA starts working in the s3cmci driver, the CPU load
> > >>difference will be even larger.
> > >
> > >I'm not sure if I'll get the time to look at this before the new kernel
> > >is released... anyway DMA may not be much of a win for smaller transfers
> > >anyway, since the setup (the cache will need to be cleaned out or the
> > >transfer memory made unbuffered) and complete time will add another
> > >IRQ's worth of response time. This means small transfers are probably
> > >better off using PIO.
> >
> > Yes. For the DMA-capable S3C SPI driver I wrote, I added some
> > thresholds, so for smaller transfers than a certain number of bytes, I
> > skip DMA and just do a polled/interrupt transfer instead. For short
> > transfers at high clock rates it's not even worth getting an interrupt
> > per byte, it's better to just busy wait for each byte, since the
> > interrupt overhead is larger than the time between each byte.
> >
> > A SDIO CMD/response packet is 48 bits, so at 25 MHz that is only about 4
> > us and I think the interrupt overhead is more than that. So if we
> > really want to squeeze every last clock cycle out of the SDIO driver it
> > may be better to busy wait for the end of simple CMD52s instead of using
> > the an interrupt to complete the transfer.
> >
> > I'll clean up my s3cmci patches and send them to you, but I can't
> > promise when I'll be done, so it'll probably have to wait for the next
> > kernel release.
>
> Any chance of getting a list of what you've got in progress and at-least
> the byte/word patch sorted out before the next merge window?
I should have added that at least if I know what you are working on,
then we can come to some arrangement on who should do what for the
next kernel release. I would like to get the fixups for word/byte code
in, and anything else that does not require huge amounts of testing.
--
Ben (ben@fluff.org, http://www.fluff.org/)
'a smiley only costs 4 bytes'
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-09-05 17:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-05 11:45 Proposed SDIO layer rework Christer Weinigel
2008-09-05 13:18 ` Ben Dooks
2008-09-05 13:47 ` Christer Weinigel
2008-09-05 14:02 ` Ben Dooks
2008-09-05 17:08 ` Ben Dooks
2008-09-05 14:50 ` Pierre Ossman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox