From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Herbrechtsmeier Date: Tue, 24 Jul 2012 15:02:00 +0200 Subject: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed In-Reply-To: <962510849.461629.1343063751268.JavaMail.root@advansee.com> References: <962510849.461629.1343063751268.JavaMail.root@advansee.com> Message-ID: <500E9CC8.2030507@herbrechtsmeier.net> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de Am 23.07.2012 19:15, schrieb Beno?t Th?baudeau: > On Monday 23 July 2012 15:35:25 Stefan Herbrechtsmeier wrote: >> Am 20.07.2012 17:35, schrieb Beno?t Th?baudeau: >>> On Friday 20 July 2012 17:15:13 Stefan Herbrechtsmeier wrote: >>>> Am 20.07.2012 17:03, schrieb Beno?t Th?baudeau: >>>>> On Friday 20 July 2012 16:51:33 Stefan Herbrechtsmeier wrote: >>>>>> Am 20.07.2012 15:56, schrieb Beno?t Th?baudeau: >>>>>>> Dear Marek Vasut, >>>>>>> >>>>>>> On Friday 20 July 2012 15:44:01 Marek Vasut wrote: >>>>>>>>> On Friday 20 July 2012 13:37:37 Stefan Herbrechtsmeier wrote: >>>>>>>>>> Am 20.07.2012 13:26, schrieb Beno?t Th?baudeau: >>>>>>>>>>> + int xfr_bytes = min(left_length, >>>>>>>>>>> + (QT_BUFFER_CNT * 4096 - >>>>>>>>>>> + ((uint32_t)buf_ptr & 4095)) & >>>>>>>>>>> + ~4095); >>>>>>>>>> Why you align the length to 4096? >>>>>>>>> It's to guarantee that each transfer length is a multiple of >>>>>>>>> the >>>>>>>>> max packet >>>>>>>>> length. Otherwise, early short packets are issued, which >>>>>>>>> breaks >>>>>>>>> the >>>>>>>>> transfer and results in time-out error messages. >>>>>>>> Early short packets ? What do you mean? >>>>>>> During a USB transfer, all packets must have a length of max >>>>>>> packet >>>>>>> length for >>>>>>> the pipe/endpoint, except the final one that can be a short >>>>>>> packet. >>>>>>> Without the >>>>>>> alignment I make for xfr_bytes, short packets can occur within >>>>>>> a >>>>>>> transfer, >>>>>>> because the hardware starts a new packet for each new queued >>>>>>> qTD >>>>>>> it >>>>>>> handles. >>>>>> But if I am right, the max packet length is 512 for bulk and >>>>>> 1024 >>>>>> for >>>>>> Interrupt transfer. >>>>> There are indeed different max packet lengths for different >>>>> transfer types, but >>>>> it does not matter since the chosen alignment guarantees a >>>>> multiple >>>>> of all these >>>>> possible max packet lengths. >>>> But thereby you limit the transfer to 4 qT buffers for unaligned >>>> transfers. >>> Not exactly. The 5 qt_buffers are used for page-unaligned buffers, >>> but that >>> results in only 4 full pages of unaligned data, requiring 5 aligned >>> pages. >> Sorry I mean 4 full pages of unaligned data. >>> For page-aligned buffers, the 5 qt_buffers result in 5 full pages >>> of aligned >>> data. >> Sure. >>> The unaligned case could be a little bit improved to always use as >>> many packets >>> as possible per qTD, but that would over-complicate things for a >>> very negligible >>> speed and memory gain. >> In my use case (fragmented file on usb storage) the gain would be >> nearly 20%. The reason is that the data are block aligned (512) and >> could be aligned to 4096 with the first transfer (5 qt_buffers). > Can you explain where this gain would come from? In both cases, the data in USB > transfers would be organized in the same way, and it would be accessed in memory > also in the same way (regarding bursts). The only difference would be the fetch > time of a little bit more qTDs, which is extremely fast and insignificant > compared to the transfer time of the payload, which remains unchanged. You are right, the speed different will be minimal, only the memory usage will be lower. > Moreover, in your use case, if you are e.g. using FAT, on the one hand, the > buffers in fat.c are never aligned to more than the DMA min alignment, and on > the other hand, if you can align your user buffers to 512 bytes, you can also > align them directly to 4 kB. The user buffer is aligned to 4kB, but this doesn't matter as a file load from a storage device (ex. fatload) can be segmented in partial USB transfers. This can lead to any block aligned buffer for a partial transfer. > >> My suggestion would be to truncate the xfr_bytes with the max >> wMaxPacketSize (1024) and for the qtd_count use: >> >> if ((uint32_t)buffer & 1023) /* wMaxPacketSize unaligned */ >> qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) + >> length, (QT_BUFFER_CNT - 1) * 4096); >> else /* wMaxPacketSize aligned */ >> qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) + >> length, QT_BUFFER_CNT * 4096); >> >> This allows 50% of unaligned block data (512) to be transferred with >> min >> qTDs. > That would also require a realignment-to-page stage. This is specific code for > specific buffer alignment from the upper layers. We could also skip the > realignment to page and always keep the same qTD transfer size except for the > last one, by adding as many packets as possible for the buffer alignment. What you mean by realignment-to-page stage? > But I still don't see a significant reason to complicate code to do that. I don't understand where you expect to complicate the code. You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT - 1) * 4kB for unaligned buffers. But you only need to limit it to a multiple of the maximal possible wMaxPacketSize (1kB) to make sure that you always send full packages. I only suggest to replace the causeless 4kB alignment with a reasonable 1kB alignment and adapte the qtd_count caluclation. int xfr_bytes = min(left_length, (QT_BUFFER_CNT * 4096 - ((uint32_t)buf_ptr & 4095)) & - ~4095); + ~1023); > BTW, the 15x speed gain that I gave in my patch description was compared to an > older version of the original code that used 20 blocks per transfer in > usb_storage.c. This is now 40 blocks per transfer with a page-aligned buffer, so > the speed gain compared to the current code should be rather about 7x. I should > update that. I'm sure that there is a significant speed gain but you shouldn't miss the heap usage as the main CONFIG_SYS_MALLOC_LEN is 128kB. Maybe you should also add a worst case heap usage and I'm not sure, if your calculation are right, as the size of struct qTD is allays 32B and thereby I get 50kB or 64kB. Best regards, Stefan