From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Herbrechtsmeier Date: Tue, 31 Jul 2012 21:52:04 +0200 Subject: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed In-Reply-To: <1468394961.845242.1343696760020.JavaMail.root@advansee.com> References: <1468394961.845242.1343696760020.JavaMail.root@advansee.com> Message-ID: <50183764.7000804@herbrechtsmeier.net> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de Am 31.07.2012 03:06, schrieb Beno?t Th?baudeau: > Dear Marek Vasut, > > On Tue, Jul 31, 2012 at 12:38:54 AM, Marek Vasut wrote: >> [...] >> >>>>> Can you explain where this gain would come from? In both cases, >>>>> the >>>>> data in USB >>>>> transfers would be organized in the same way, and it would be >>>>> accessed in memory >>>>> also in the same way (regarding bursts). The only difference >>>>> would >>>>> be the fetch >>>>> time of a little bit more qTDs, which is extremely fast and >>>>> insignificant >>>>> compared to the transfer time of the payload, which remains >>>>> unchanged. >>>> You are right, the speed different will be minimal, only the >>>> memory >>>> usage will be lower. >>> If your point is only the memory gain, I agree. With your >>> suggestion, there >>> are roughly 25% less qTDs used in the "(max wMaxPacketSize)-aligned >>> but >>> not page-aligned" case since the number of qTDs is about (total >>> transfer >>> size) / 5 instead of (total transfer size) / 4. But this is still >>> small >>> compared to usual heap sizes (at least on the kind of hardware I >>> use). >> Ok, I see the point. I understand it's not really a bug, just an >> improvement. > Exactly. > >> Maybe we can do a subsequent patch on top of these from Benoit and >> see how it >> fares? > If you wish. I'll do that. > >>>>> Moreover, in your use case, if you are e.g. using FAT, on the >>>>> one >>>>> hand, the >>>>> buffers in fat.c are never aligned to more than the DMA min >>>>> alignment, and on >>>>> the other hand, if you can align your user buffers to 512 >>>>> bytes, >>>>> you can also >>>>> align them directly to 4 kB. >>>> The user buffer is aligned to 4kB, but this doesn't matter as a >>>> file >>>> load from a storage device (ex. fatload) can be segmented in >>>> partial >>>> USB >>>> transfers. This can lead to any block aligned buffer for a >>>> partial >>>> transfer. >>> What do you mean by "partial USB transfers"? As seen from EHCI >>> users like >>> the MSC driver (usb_storage.c), USB transfers either succeed or >>> fail, but >>> they cannot be "segmented". >> Segmented -- like multiple transfers being issues with small payload? Right. >> You can >> not put these together at the USB-level, since it's the issuing code >> that has to >> be fixed. If the segmentation comes from the file system handling we can not avoid this. >>> On its side, the MSC driver will only segment the FAT layer >>> requests if >>> they are larger than 65535 blocks, so still not what you describe. >>> >>> As to the FAT stack, it will only read whole clusters while >>> accessing file >>> payload, and the most usual cluster sizes are by default a multiple >>> of 4 >>> kiB (see http://support.microsoft.com/kb/140365). >> 512b is minimum and it's quite often used. > OK. In my example I use a FAT partition with 128 MB and 1 KB clusters. The file is read in two segments in which the first transfer starts 4 kB aligned but stops 1 kB aligned but not 4 kB aligned and leads to unaligned second transfer. >>> So I don't see "segmentation" anywhere, and for usual cluster >>> sizes, the >>> EHCI buffer alignment is fully determined by the applicative buffer >>> alignment and the file position corresponding to the beginning of >>> the >>> applicative buffer. But there are indeed some unusual use cases >>> (e.g. >>> smaller clusters) for which only a block-aligned buffer will reach >>> EHCI >>> despite a page-aligned applicative buffer. >> I don't quite get this one. > I meant that 512 bytes (most usual storage block size) is what we should aim at > to optimize the number of qTDs. Right. > >>>>>> My suggestion would be to truncate the xfr_bytes with the max >>>>>> wMaxPacketSize (1024) and for the qtd_count use: >>>>>> >>>>>> if ((uint32_t)buffer & 1023) /* wMaxPacketSize unaligned */ >>>>>> >>>>>> qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) + >>>>>> >>>>>> length, (QT_BUFFER_CNT - 1) * 4096); >>>>>> >>>>>> else /* wMaxPacketSize aligned */ >>>>>> >>>>>> qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) + >>>>>> >>>>>> length, QT_BUFFER_CNT * 4096); >>>>>> >>>>>> This allows 50% of unaligned block data (512) to be >>>>>> transferred >>>>>> with >>>>>> min >>>>>> qTDs. >>>>> That would also require a realignment-to-page stage. This is >>>>> specific code for >>>>> specific buffer alignment from the upper layers. We could also >>>>> skip >>>>> the >>>>> realignment to page and always keep the same qTD transfer size >>>>> except for the >>>>> last one, by adding as many packets as possible for the buffer >>>>> alignment. >>>> What you mean by realignment-to-page stage? >>> I mean that the alignment of the transfer to 1024 instead of 4096 >>> can make >>> the first qTD transfer larger than the following ones, which >>> guarantees >>> that the following qTD transfers are page-aligned, even if the >>> first one >>> was only aligned to 1024. For the 1024-aligned case, this results >>> in the >>> change that you suggest, but it also changes things for the >>> unaligned >>> case, which makes this part of the code inaccurate. See below. You are right. It maximise the first transfer. All other transfers are 5 * 4 KB (aligned) or 4 * 4 KB (unaligned) long. >>>>> But I still don't see a significant reason to complicate code >>>>> to do >>>>> that. >>>> I don't understand where you expect to complicate the code. >>>> >>>> You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT >>>> - 1) >>>> * >>>> 4kB for unaligned buffers. But you only need to limit it to a >>>> multiple >>>> of the maximal possible wMaxPacketSize (1kB) to make sure that >>>> you >>>> always send full packages. >>>> >>>> I only suggest to replace the causeless 4kB alignment with a >>>> reasonable >>>> 1kB alignment and adapte the qtd_count caluclation. >>>> >>>> int xfr_bytes = min(left_length, >>>> >>>> (QT_BUFFER_CNT * >>>> 4096 - >>>> >>>> ((uint32_t)buf_ptr >>>> & >>>> 4095)) & >>>> >>>> - ~4095); >>>> + ~1023); >>> I agree for this part of the code. But for the allocation part, >>> your >>> suggestion is already a little bit more complicated than my >>> original code, >>> while still incomplete. Besides that, the "((uint32_t)buffer & >>> 4095) +" >>> for the page-unaligned case in my code was actually useless, which >>> emphasizes the difference, even if it's a light complication. >>> >>> For the allocation part, the appropriate code for your suggestion >>> would be: >>> >>> if ((uint32_t)buffer & 1023) /* max-wMaxPacketSize-unaligned */ >>> qtd_count += >> DIV_ROUND_UP( >> max( >> length > 0, >> length - (4096 - ((uint32_t)buf_ptr & 4095) & ~1023) >> ), >> (QT_BUFFER_CNT - 1) * 4096 >> ); >> >> Ok, I now think I understand what's going on here. I still have to >> wonder how >> much would the compiler optimize of this if you "decompressed" your >> code -- to >> make it more easy to understand. > I wouldn't go for this complicated version since it's not really useful compared > to the simpler yet less accurate solution I gave below. You are right, but your complicate code only saves one qTD. > >>> else /* max-wMaxPacketSize-aligned */ >>> qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) + >>> length, QT_BUFFER_CNT * 4096); >>> >>> This code allocates exactly the required number of qTDs, no less, >>> no more. >>> It's clearly more complicated than the 4096 version. >>> >>> A test should also be added to make sure that qtd_count is never 0. >>> Otherwise, ZLPs are broken (this applies to my original code too). >>> >>> If we want to compromise accuracy for simplicity, we can change >>> that to: >>> >>> qtd_count += 2 + length / >>> ((QT_BUFFER_CNT - !!((uint32_t)buffer & 1023)) * 4096); > It's this solution I'd like to use to optimize the number of qTDs (with 1023 or > something else). This sounds reasonable. > >>> This code allocates enough qTDs for all cases, with at worst 2 >>> extra qTDs >>> (i.e. a total of 128 bytes) that will be left unused. It also >>> handles >>> intrinsically ZLPs. >>> >>> Now, let's consider the possible values of wMaxPacketSize: >>> - control endpoints: >>> * LS: 8, >>> * FS: 8, 16, 32 or 64, >>> * HS: 64, >>> - isochronous endpoints: not supported by ehci-hcd.c, >>> - interrupt endpoints: >>> * LS: <= 8, >>> * FS: <= 64, >>> * HS: <= 1024 (1x to 3x for high-bandwidth), >>> - bulk endpoints: >>> * LS: N/A, >>> * FS: 8, 16, 32 or 64, >>> * HS: 512. >>> >>> My code assumes that wMaxPacketSize is a power of 2. This is not >>> always >>> true for interrupt endpoints. Let's talk about these. Their >>> handling is >>> currently broken in U-Boot since their transfers are made >>> asynchronous >>> instead of periodic. Devices shouldn't care too much about that, as >>> long >>> as transfers do not exceed wMaxPacketSize, in which case my code >>> still >>> works because wMaxPacketSize always fits in a single qTD. Interrupt >>> transfers larger than wMaxPacketSize do not seem to be used by >>> U-Boot. If >>> they were used, the current code in U-Boot would have a timing >>> issue >>> because the asynchronous scheme would break the interval requested >>> by >>> devices, which could at worst make them fail in some way. So the >>> only >>> solution would be that such transfers be split by the caller of >>> submit_int_msg, in which case my code still works. What would you >>> think >>> about failing with an error message in submit_int_msg if length is >>> larger >>> than wMaxPacketSize? Marek, what do you think? >> Let's do that ... I think the interrupt endpoint is only used for >> keyboard and >> if someone needs it for something else, the code will be there, just >> needing >> improvement. Comment and error message are OK. > OK. I have thought of another solution for this. You'll tell me which one you > prefer. > > The ehci_submit_async code currently in U-Boot checks through ehci_td_buffer > that length fits in the single qTD reserved for data payload only after work has > begun, possibly after a SETUP transfer. With my series, this is checked at the > very beginning, before the allocation. We could detect that wMaxPacketSize is > not a power of 2 (e.g. with __builtin_popcount), in which case the allocation > for the data payload would be restricted to 1 qTD like now, and there would be > a check at the very beginning to test if length fits in this qTD. In that way, > there could be several packets per interrupt transfer as long as it fits in a > single qTD, just like now, contrary to the limitation imposed by the error in > submit_int_msg. But I'm not sure it's a good idea to allow this behavior. I think this is not needed, as there is only one user (keyboard) with max size of 8 byte. > >>> For all other cases, wMaxPacketSize is a power of 2, so everything >>> is fine, >>> except that in those cases wMaxPacketSize is at most 512, which >>> means that >>> with the suggested limitation applied to submit_int_msg, your >>> suggested >>> 1024 could be replaced with 512, which is good news since this is >>> also the >>> most common storage sector size. >>> >>> We could even use usb_maxpacket(dev, pipe) instead of 512, with >>> some >>> restrictions. If we don't want to alter the misalignment check in >>> ehci_td_buffer, max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN) >>> would >>> actually have to be used. This misalignment check could be limited >>> to the >>> first qTD transfer of a USB transfer, but that would complicate >>> things, >>> all the more the corresponding call to flush_dcache_range would >>> have to be >>> modified to fix alignments. >>> >>> So we have to make two choices: >>> - between 4096, 1024, 512 and max(usb_maxpacket(dev, pipe), >>> ARCH_DMA_MINALIGN), - between the accurate and simple allocations. >>> That makes a total of 8 working possibilities. What do you guys >>> think we >>> should choose? On my side, I like max(usb_maxpacket(dev, pipe), >>> ARCH_DMA_MINALIGN) >> Won't maxpacket fall below 512 on occasions, > Sure. I would go with 512 as it also the most common storage sector size. >> which might cause >> trouble? > Why? > >>> with the simple allocation. It's efficient as to code >>> speed, size and readability, as well as to RAM usage. >> For now, I'd go for the safest, easiest and dumbest solution and see >> how it >> fares. Subsequent patch can be submitted to improve that and >> measurements made. >> >> "We should forget about small efficiencies, say about 97% of the >> time; premature >> optimization is the root of all evil" >> -- Donald E. Knuth, Structured Programming with go to >> Statements >> [...] > OK, so I'll stick to my original series, rework it lightly as we said, add Jim's > patch, and add a further patch for these optimizations. Okay. >>> So we could perhaps issue a #error in ehci-hcd or in usb_storage if >>> CONFIG_SYS_MALLOC_LEN is not large enough, but I don't think it's a >>> good >>> idea because: >>> - the threshold value would have to depend on runtime block sizes >>> or >>> something, which could lead to a big worst worst case that would >>> almost >>> never happen in real life, so giving such an unrealistic heap size >>> constraint would be cumbersome, >> #warning then? > With which limit if so? I would expect more than 128kB as this is a common worst case (512 B block size). > >>> - reaching the top sizes would mean reading a huge file or >>> something to a >>> large buffer (much larger than the qTDs this transfer requires), >>> which >>> would very likely be heap-allocated (except for commands like >>> fatload), so >>> CONFIG_SYS_MALLOC_LEN would already have to be large for the >>> application, >>> - for command line operations like fatload, transfers of >>> uncontrolled >>> lengths could simply crash U-Boot if they go too far in memory >> Why, because they overwrite it? > Yes. U-Boot expands down its allocation during startup, so it's often located at > the end of the embedded RAM, which means that fatload will very likely use the > beginning of the RAM. > >>> , which >>> means that users of such commands need to know what they are doing >>> anyway, >>> so they have to control transfer sizes, >>> - there is already a runtime error displayed in case of allocation >>> failure. >> Ok > So #warning or not besides this? > >>> Marek, what do you think? >> Had a good evening with the EHCI r10 spec, hope I answered most of >> your >> questions. > Yes, thanks. Regards, Stefan