From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?utf-8?Q?Beno=C3=AEt_Th=C3=A9baudeau?= <benoit.thebaudeau@advansee.com>
Date: Sun, 29 Jul 2012 02:48:18 +0200 (CEST)
Subject: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed
In-Reply-To: <500E9CC8.2030507@herbrechtsmeier.net>
Message-ID: <635841778.759390.1343522898919.JavaMail.root@advansee.com>
List-Id: <u-boot.lists.denx.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: u-boot@lists.denx.de

Dear Stefan,

Sorry for the delay. I'm very busy, and there is much to tell on this topic.

On Tue, Jul 24, 2012 at 03:02:00 PM, Stefan Herbrechtsmeier wrote:
> Am 23.07.2012 19:15, schrieb Beno?t Th?baudeau:
> > On Monday 23 July 2012 15:35:25 Stefan Herbrechtsmeier wrote:
> >> Am 20.07.2012 17:35, schrieb Beno?t Th?baudeau:
> >>> On Friday 20 July 2012 17:15:13 Stefan Herbrechtsmeier wrote:
> >>>> Am 20.07.2012 17:03, schrieb Beno?t Th?baudeau:
> >>>>> On Friday 20 July 2012 16:51:33 Stefan Herbrechtsmeier wrote:
> >>>>>> Am 20.07.2012 15:56, schrieb Beno?t Th?baudeau:
> >>>>>>> Dear Marek Vasut,
> >>>>>>>
> >>>>>>> On Friday 20 July 2012 15:44:01 Marek Vasut wrote:
> >>>>>>>>> On Friday 20 July 2012 13:37:37 Stefan Herbrechtsmeier
> >>>>>>>>> wrote:
> >>>>>>>>>> Am 20.07.2012 13:26, schrieb Beno?t Th?baudeau:
> >>>>>>>>>>> +			int xfr_bytes = min(left_length,
> >>>>>>>>>>> +					    (QT_BUFFER_CNT * 4096 -
> >>>>>>>>>>> +					     ((uint32_t)buf_ptr & 4095)) &
> >>>>>>>>>>> +					    ~4095);
> >>>>>>>>>> Why you align the length to 4096?
> >>>>>>>>> It's to guarantee that each transfer length is a multiple
> >>>>>>>>> of
> >>>>>>>>> the
> >>>>>>>>> max packet
> >>>>>>>>> length. Otherwise, early short packets are issued, which
> >>>>>>>>> breaks
> >>>>>>>>> the
> >>>>>>>>> transfer and results in time-out error messages.
> >>>>>>>> Early short packets ? What do you mean?
> >>>>>>> During a USB transfer, all packets must have a length of max
> >>>>>>> packet
> >>>>>>> length for
> >>>>>>> the pipe/endpoint, except the final one that can be a short
> >>>>>>> packet.
> >>>>>>> Without the
> >>>>>>> alignment I make for xfr_bytes, short packets can occur
> >>>>>>> within
> >>>>>>> a
> >>>>>>> transfer,
> >>>>>>> because the hardware starts a new packet for each new queued
> >>>>>>> qTD
> >>>>>>> it
> >>>>>>> handles.
> >>>>>> But if I am right, the max packet length is 512 for bulk and
> >>>>>> 1024
> >>>>>> for
> >>>>>> Interrupt transfer.
> >>>>> There are indeed different max packet lengths for different
> >>>>> transfer types, but
> >>>>> it does not matter since the chosen alignment guarantees a
> >>>>> multiple
> >>>>> of all these
> >>>>> possible max packet lengths.
> >>>> But thereby you limit the transfer to 4 qT buffers for unaligned
> >>>> transfers.
> >>> Not exactly. The 5 qt_buffers are used for page-unaligned
> >>> buffers,
> >>> but that
> >>> results in only 4 full pages of unaligned data, requiring 5
> >>> aligned
> >>> pages.
> >> Sorry I mean 4 full pages of unaligned data.
> >>> For page-aligned buffers, the 5 qt_buffers result in 5 full pages
> >>> of aligned
> >>> data.
> >> Sure.
> >>> The unaligned case could be a little bit improved to always use
> >>> as
> >>> many packets
> >>> as possible per qTD, but that would over-complicate things for a
> >>> very negligible
> >>> speed and memory gain.
> >> In my use case (fragmented file on usb storage)  the gain would be
> >> nearly 20%. The reason is that the data are block aligned (512)
> >> and
> >> could be aligned to 4096 with the first transfer (5 qt_buffers).
> > Can you explain where this gain would come from? In both cases, the
> > data in USB
> > transfers would be organized in the same way, and it would be
> > accessed in memory
> > also in the same way (regarding bursts). The only difference would
> > be the fetch
> > time of a little bit more qTDs, which is extremely fast and
> > insignificant
> > compared to the transfer time of the payload, which remains
> > unchanged.
> You are right, the speed different will be minimal, only the memory
> usage will be lower.

If your point is only the memory gain, I agree. With your suggestion, there are
roughly 25% less qTDs used in the "(max wMaxPacketSize)-aligned but not
page-aligned" case since the number of qTDs is about (total transfer size) / 5
instead of (total transfer size) / 4. But this is still small compared to usual
heap sizes (at least on the kind of hardware I use).

> > Moreover, in your use case, if you are e.g. using FAT, on the one
> > hand, the
> > buffers in fat.c are never aligned to more than the DMA min
> > alignment, and on
> > the other hand, if you can align your user buffers to 512 bytes,
> > you can also
> > align them directly to 4 kB.
> The user buffer is aligned to 4kB, but this doesn't matter as a file
> load from a storage device (ex. fatload) can be segmented in partial
> USB
> transfers. This can lead to any block aligned buffer for a partial
> transfer.

What do you mean by "partial USB transfers"? As seen from EHCI users like the
MSC driver (usb_storage.c), USB transfers either succeed or fail, but they
cannot be "segmented".

On its side, the MSC driver will only segment the FAT layer requests if they are
larger than 65535 blocks, so still not what you describe.

As to the FAT stack, it will only read whole clusters while accessing file
payload, and the most usual cluster sizes are by default a multiple of 4 kiB
(see http://support.microsoft.com/kb/140365).

So I don't see "segmentation" anywhere, and for usual cluster sizes, the EHCI
buffer alignment is fully determined by the applicative buffer alignment and the
file position corresponding to the beginning of the applicative buffer. But
there are indeed some unusual use cases (e.g. smaller clusters) for which only
a block-aligned buffer will reach EHCI despite a page-aligned applicative
buffer.

> >> My suggestion would be to truncate the xfr_bytes with the max
> >> wMaxPacketSize (1024) and for the qtd_count use:
> >>
> >> if ((uint32_t)buffer & 1023)    /* wMaxPacketSize unaligned */
> >>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
> >>               length, (QT_BUFFER_CNT - 1) * 4096);
> >> else                /* wMaxPacketSize aligned */
> >>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
> >>               length, QT_BUFFER_CNT * 4096);
> >>
> >> This allows 50% of unaligned block data (512) to be transferred
> >> with
> >> min
> >> qTDs.
> > That would also require a realignment-to-page stage. This is
> > specific code for
> > specific buffer alignment from the upper layers. We could also skip
> > the
> > realignment to page and always keep the same qTD transfer size
> > except for the
> > last one, by adding as many packets as possible for the buffer
> > alignment.
> What you mean by realignment-to-page stage?

I mean that the alignment of the transfer to 1024 instead of 4096 can make the
first qTD transfer larger than the following ones, which guarantees that the
following qTD transfers are page-aligned, even if the first one was only aligned
to 1024. For the 1024-aligned case, this results in the change that you suggest,
but it also changes things for the unaligned case, which makes this part of the
code inaccurate. See below.

> > But I still don't see a significant reason to complicate code to do
> > that.
> I don't understand where you expect to complicate the code.
> 
> You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT - 1)
> *
> 4kB for unaligned buffers. But you only need to limit it to a
> multiple
> of the maximal possible wMaxPacketSize (1kB) to make sure that you
> always send full packages.
> 
> I only suggest to replace the causeless 4kB alignment with a
> reasonable
> 1kB alignment and adapte the qtd_count caluclation.
> 
>                          int xfr_bytes = min(left_length,
>                                              (QT_BUFFER_CNT * 4096 -
>                                               ((uint32_t)buf_ptr &
>                                               4095)) &
> -                                           ~4095);
> +                                           ~1023);

I agree for this part of the code. But for the allocation part, your suggestion
is already a little bit more complicated than my original code, while still
incomplete. Besides that, the "((uint32_t)buffer & 4095) +" for the
page-unaligned case in my code was actually useless, which emphasizes the
difference, even if it's a light complication.

For the allocation part, the appropriate code for your suggestion would be:

		if ((uint32_t)buffer & 1023) /* max-wMaxPacketSize-unaligned */
			qtd_count += DIV_ROUND_UP(max(length > 0, length -
				(4096 - ((uint32_t)buf_ptr & 4095) & ~1023)),
					(QT_BUFFER_CNT - 1) * 4096);
		else			     /* max-wMaxPacketSize-aligned */
			qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
					length, QT_BUFFER_CNT * 4096);

This code allocates exactly the required number of qTDs, no less, no more. It's
clearly more complicated than the 4096 version.

A test should also be added to make sure that qtd_count is never 0. Otherwise,
ZLPs are broken (this applies to my original code too).

If we want to compromise accuracy for simplicity, we can change that to:

		qtd_count += 2 + length /
			((QT_BUFFER_CNT - !!((uint32_t)buffer & 1023)) * 4096);

This code allocates enough qTDs for all cases, with at worst 2 extra qTDs (i.e.
a total of 128 bytes) that will be left unused. It also handles intrinsically
ZLPs.

Now, let's consider the possible values of wMaxPacketSize:
 - control endpoints:
    * LS: 8,
    * FS: 8, 16, 32 or 64,
    * HS: 64,
 - isochronous endpoints: not supported by ehci-hcd.c,
 - interrupt endpoints:
    * LS: <= 8,
    * FS: <= 64,
    * HS: <= 1024 (1x to 3x for high-bandwidth),
 - bulk endpoints:
    * LS: N/A,
    * FS: 8, 16, 32 or 64,
    * HS: 512.

My code assumes that wMaxPacketSize is a power of 2. This is not always true for
interrupt endpoints. Let's talk about these. Their handling is currently broken
in U-Boot since their transfers are made asynchronous instead of periodic.
Devices shouldn't care too much about that, as long as transfers do not exceed
wMaxPacketSize, in which case my code still works because wMaxPacketSize always
fits in a single qTD. Interrupt transfers larger than wMaxPacketSize do not seem
to be used by U-Boot. If they were used, the current code in U-Boot would have a
timing issue because the asynchronous scheme would break the interval requested
by devices, which could at worst make them fail in some way. So the only
solution would be that such transfers be split by the caller of submit_int_msg,
in which case my code still works. What would you think about failing with an
error message in submit_int_msg if length is larger than wMaxPacketSize? Marek,
what do you think?

For all other cases, wMaxPacketSize is a power of 2, so everything is fine,
except that in those cases wMaxPacketSize is@most 512, which means that with
the suggested limitation applied to submit_int_msg, your suggested 1024 could be
replaced with 512, which is good news since this is also the most common storage
sector size.

We could even use usb_maxpacket(dev, pipe) instead of 512, with some
restrictions. If we don't want to alter the misalignment check in
ehci_td_buffer, max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN) would actually
have to be used. This misalignment check could be limited to the first qTD
transfer of a USB transfer, but that would complicate things, all the more the
corresponding call to flush_dcache_range would have to be modified to fix
alignments.

So we have to make two choices:
 - between 4096, 1024, 512 and max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN),
 - between the accurate and simple allocations.
That makes a total of 8 working possibilities. What do you guys think we should
choose? On my side, I like max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN) with
the simple allocation. It's efficient as to code speed, size and readability, as
well as to RAM usage.

> > BTW, the 15x speed gain that I gave in my patch description was
> > compared to an
> > older version of the original code that used 20 blocks per transfer
> > in
> > usb_storage.c. This is now 40 blocks per transfer with a
> > page-aligned buffer, so
> > the speed gain compared to the current code should be rather about
> > 7x. I should
> > update that.
> I'm sure that there is a significant speed gain but you shouldn't
> miss
> the heap usage as the main CONFIG_SYS_MALLOC_LEN is 128kB.

I have checked all the config files. Among those using EHCI, most have a heap
size >= 1 MiB. The only exceptions are:

---------------------------------------------
| Heap Size |      Board       |  RAM Size  |
|-------------------------------------------|
|   512 kiB | MERGERBOX        |   > 10 MiB |
|           | MPC8315ERDB      |    128 MiB |
|           | MVBLM7           |    512 MiB |
|-------------------------------------------|
|   256 kiB | MPC8349ITX       |    256 MiB |
|           | omap4_panda      |      1 GiB |
|-------------------------------------------|
|   128 kiB | adp-ag102        |    256 MiB |
|           | at91sam9m10g45ek |    128 MiB |
|           | edminiv2         |     64 MiB |
|           | M52277EVB        |     64 MiB |
|           | omap3_beagle     | >= 128 MiB |
---------------------------------------------

As you can see, these small heap sizes are not linked to any hardware
constraint, but only to the lack of need to have larger heaps, so they could be
easily enlarged if needed. But even 128 kiB should be enough for common usage.

> Maybe you should also add a worst case heap usage and I'm not sure,
> if
> your calculation are right, as the size of struct qTD is allays 32B
> and
> thereby I get 50kB or 64kB.