From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefan Herbrechtsmeier <stefan@herbrechtsmeier.net>
Date: Tue, 31 Jul 2012 21:52:04 +0200
Subject: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed
In-Reply-To: <1468394961.845242.1343696760020.JavaMail.root@advansee.com>
References: <1468394961.845242.1343696760020.JavaMail.root@advansee.com>
Message-ID: <50183764.7000804@herbrechtsmeier.net>
List-Id: <u-boot.lists.denx.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: u-boot@lists.denx.de

Am 31.07.2012 03:06, schrieb Beno?t Th?baudeau:
> Dear Marek Vasut,
>
> On Tue, Jul 31, 2012 at 12:38:54 AM, Marek Vasut wrote:
>> [...]
>>
>>>>> Can you explain where this gain would come from? In both cases,
>>>>> the
>>>>> data in USB
>>>>> transfers would be organized in the same way, and it would be
>>>>> accessed in memory
>>>>> also in the same way (regarding bursts). The only difference
>>>>> would
>>>>> be the fetch
>>>>> time of a little bit more qTDs, which is extremely fast and
>>>>> insignificant
>>>>> compared to the transfer time of the payload, which remains
>>>>> unchanged.
>>>> You are right, the speed different will be minimal, only the
>>>> memory
>>>> usage will be lower.
>>> If your point is only the memory gain, I agree. With your
>>> suggestion, there
>>> are roughly 25% less qTDs used in the "(max wMaxPacketSize)-aligned
>>> but
>>> not page-aligned" case since the number of qTDs is about (total
>>> transfer
>>> size) / 5 instead of (total transfer size) / 4. But this is still
>>> small
>>> compared to usual heap sizes (at least on the kind of hardware I
>>> use).
>> Ok, I see the point. I understand it's not really a bug, just an
>> improvement.
> Exactly.
>
>> Maybe we can do a subsequent patch on top of these from Benoit and
>> see how it
>> fares?
> If you wish. I'll do that.
>
>>>>> Moreover, in your use case, if you are e.g. using FAT, on the
>>>>> one
>>>>> hand, the
>>>>> buffers in fat.c are never aligned to more than the DMA min
>>>>> alignment, and on
>>>>> the other hand, if you can align your user buffers to 512
>>>>> bytes,
>>>>> you can also
>>>>> align them directly to 4 kB.
>>>> The user buffer is aligned to 4kB, but this doesn't matter as a
>>>> file
>>>> load from a storage device (ex. fatload) can be segmented in
>>>> partial
>>>> USB
>>>> transfers. This can lead to any block aligned buffer for a
>>>> partial
>>>> transfer.
>>> What do you mean by "partial USB transfers"? As seen from EHCI
>>> users like
>>> the MSC driver (usb_storage.c), USB transfers either succeed or
>>> fail, but
>>> they cannot be "segmented".
>> Segmented -- like multiple transfers being issues with small payload?
Right.
>> You can
>> not put these together at the USB-level, since it's the issuing code
>> that has to
>> be fixed.
If the segmentation comes from the file system handling we can not avoid 
this.

>>> On its side, the MSC driver will only segment the FAT layer
>>> requests if
>>> they are larger than 65535 blocks, so still not what you describe.
>>>
>>> As to the FAT stack, it will only read whole clusters while
>>> accessing file
>>> payload, and the most usual cluster sizes are by default a multiple
>>> of 4
>>> kiB (see http://support.microsoft.com/kb/140365).
>> 512b is minimum and it's quite often used.
> OK.
In my example I use a FAT partition with 128 MB and 1 KB clusters. The 
file is read in two segments in which the first transfer starts 4 kB 
aligned but stops 1 kB aligned but not 4 kB aligned and leads to 
unaligned second transfer.
>>> So I don't see "segmentation" anywhere, and for usual cluster
>>> sizes, the
>>> EHCI buffer alignment is fully determined by the applicative buffer
>>> alignment and the file position corresponding to the beginning of
>>> the
>>> applicative buffer. But there are indeed some unusual use cases
>>> (e.g.
>>> smaller clusters) for which only a block-aligned buffer will reach
>>> EHCI
>>> despite a page-aligned applicative buffer.
>> I don't quite get this one.
> I meant that 512 bytes (most usual storage block size) is what we should aim at
> to optimize the number of qTDs.
Right.
>
>>>>>> My suggestion would be to truncate the xfr_bytes with the max
>>>>>> wMaxPacketSize (1024) and for the qtd_count use:
>>>>>>
>>>>>> if ((uint32_t)buffer & 1023)    /* wMaxPacketSize unaligned */
>>>>>>
>>>>>>        qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
>>>>>>        
>>>>>>                length, (QT_BUFFER_CNT - 1) * 4096);
>>>>>>
>>>>>> else                /* wMaxPacketSize aligned */
>>>>>>
>>>>>>        qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
>>>>>>        
>>>>>>                length, QT_BUFFER_CNT * 4096);
>>>>>>
>>>>>> This allows 50% of unaligned block data (512) to be
>>>>>> transferred
>>>>>> with
>>>>>> min
>>>>>> qTDs.
>>>>> That would also require a realignment-to-page stage. This is
>>>>> specific code for
>>>>> specific buffer alignment from the upper layers. We could also
>>>>> skip
>>>>> the
>>>>> realignment to page and always keep the same qTD transfer size
>>>>> except for the
>>>>> last one, by adding as many packets as possible for the buffer
>>>>> alignment.
>>>> What you mean by realignment-to-page stage?
>>> I mean that the alignment of the transfer to 1024 instead of 4096
>>> can make
>>> the first qTD transfer larger than the following ones, which
>>> guarantees
>>> that the following qTD transfers are page-aligned, even if the
>>> first one
>>> was only aligned to 1024. For the 1024-aligned case, this results
>>> in the
>>> change that you suggest, but it also changes things for the
>>> unaligned
>>> case, which makes this part of the code inaccurate. See below.
You are right. It maximise the first transfer. All other transfers are 5 
* 4 KB (aligned) or 4 * 4 KB (unaligned) long.
>>>>> But I still don't see a significant reason to complicate code
>>>>> to do
>>>>> that.
>>>> I don't understand where you expect to complicate the code.
>>>>
>>>> You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT
>>>> - 1)
>>>> *
>>>> 4kB for unaligned buffers. But you only need to limit it to a
>>>> multiple
>>>> of the maximal possible wMaxPacketSize (1kB) to make sure that
>>>> you
>>>> always send full packages.
>>>>
>>>> I only suggest to replace the causeless 4kB alignment with a
>>>> reasonable
>>>> 1kB alignment and adapte the qtd_count caluclation.
>>>>
>>>>                           int xfr_bytes = min(left_length,
>>>>                           
>>>>                                               (QT_BUFFER_CNT *
>>>>                                               4096 -
>>>>                                               
>>>>                                                ((uint32_t)buf_ptr
>>>>                                                &
>>>>                                                4095)) &
>>>>
>>>> -                                           ~4095);
>>>> +                                           ~1023);
>>> I agree for this part of the code. But for the allocation part,
>>> your
>>> suggestion is already a little bit more complicated than my
>>> original code,
>>> while still incomplete. Besides that, the "((uint32_t)buffer &
>>> 4095) +"
>>> for the page-unaligned case in my code was actually useless, which
>>> emphasizes the difference, even if it's a light complication.
>>>
>>> For the allocation part, the appropriate code for your suggestion
>>> would be:
>>>
>>> 		if ((uint32_t)buffer & 1023) /* max-wMaxPacketSize-unaligned */
>>> 			qtd_count +=
>> DIV_ROUND_UP(
>>      max(
>>          length > 0,
>>          length - (4096 - ((uint32_t)buf_ptr & 4095) & ~1023)
>>      ),
>>      (QT_BUFFER_CNT - 1) * 4096
>> );
>>
>> Ok, I now think I understand what's going on here. I still have to
>> wonder how
>> much would the compiler optimize of this if you "decompressed" your
>> code -- to
>> make it more easy to understand.
> I wouldn't go for this complicated version since it's not really useful compared
> to the simpler yet less accurate solution I gave below.
You are right, but your complicate code only saves one qTD.
>
>>> 		else			     /* max-wMaxPacketSize-aligned */
>>> 			qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
>>> 					length, QT_BUFFER_CNT * 4096);
>>>
>>> This code allocates exactly the required number of qTDs, no less,
>>> no more.
>>> It's clearly more complicated than the 4096 version.
>>>
>>> A test should also be added to make sure that qtd_count is never 0.
>>> Otherwise, ZLPs are broken (this applies to my original code too).
>>>
>>> If we want to compromise accuracy for simplicity, we can change
>>> that to:
>>>
>>> 		qtd_count += 2 + length /
>>> 			((QT_BUFFER_CNT - !!((uint32_t)buffer & 1023)) * 4096);
> It's this solution I'd like to use to optimize the number of qTDs (with 1023 or
> something else).
This sounds reasonable.

>
>>> This code allocates enough qTDs for all cases, with at worst 2
>>> extra qTDs
>>> (i.e. a total of 128 bytes) that will be left unused. It also
>>> handles
>>> intrinsically ZLPs.
>>>
>>> Now, let's consider the possible values of wMaxPacketSize:
>>>   - control endpoints:
>>>      * LS: 8,
>>>      * FS: 8, 16, 32 or 64,
>>>      * HS: 64,
>>>   - isochronous endpoints: not supported by ehci-hcd.c,
>>>   - interrupt endpoints:
>>>      * LS: <= 8,
>>>      * FS: <= 64,
>>>      * HS: <= 1024 (1x to 3x for high-bandwidth),
>>>   - bulk endpoints:
>>>      * LS: N/A,
>>>      * FS: 8, 16, 32 or 64,
>>>      * HS: 512.
>>>
>>> My code assumes that wMaxPacketSize is a power of 2. This is not
>>> always
>>> true for interrupt endpoints. Let's talk about these. Their
>>> handling is
>>> currently broken in U-Boot since their transfers are made
>>> asynchronous
>>> instead of periodic. Devices shouldn't care too much about that, as
>>> long
>>> as transfers do not exceed wMaxPacketSize, in which case my code
>>> still
>>> works because wMaxPacketSize always fits in a single qTD. Interrupt
>>> transfers larger than wMaxPacketSize do not seem to be used by
>>> U-Boot. If
>>> they were used, the current code in U-Boot would have a timing
>>> issue
>>> because the asynchronous scheme would break the interval requested
>>> by
>>> devices, which could at worst make them fail in some way. So the
>>> only
>>> solution would be that such transfers be split by the caller of
>>> submit_int_msg, in which case my code still works. What would you
>>> think
>>> about failing with an error message in submit_int_msg if length is
>>> larger
>>> than wMaxPacketSize? Marek, what do you think?
>> Let's do that ... I think the interrupt endpoint is only used for
>> keyboard and
>> if someone needs it for something else, the code will be there, just
>> needing
>> improvement. Comment and error message are OK.
> OK. I have thought of another solution for this. You'll tell me which one you
> prefer.
>
> The ehci_submit_async code currently in U-Boot checks through ehci_td_buffer
> that length fits in the single qTD reserved for data payload only after work has
> begun, possibly after a SETUP transfer. With my series, this is checked at the
> very beginning, before the allocation. We could detect that wMaxPacketSize is
> not a power of 2 (e.g. with __builtin_popcount), in which case the allocation
> for the data payload would be restricted to 1 qTD like now, and there would be
> a check at the very beginning to test if length fits in this qTD. In that way,
> there could be several packets per interrupt transfer as long as it fits in a
> single qTD, just like now, contrary to the limitation imposed by the error in
> submit_int_msg. But I'm not sure it's a good idea to allow this behavior.
I think this is not needed, as there is only one user (keyboard) with 
max size of 8 byte.
>
>>> For all other cases, wMaxPacketSize is a power of 2, so everything
>>> is fine,
>>> except that in those cases wMaxPacketSize is at most 512, which
>>> means that
>>> with the suggested limitation applied to submit_int_msg, your
>>> suggested
>>> 1024 could be replaced with 512, which is good news since this is
>>> also the
>>> most common storage sector size.
>>>
>>> We could even use usb_maxpacket(dev, pipe) instead of 512, with
>>> some
>>> restrictions. If we don't want to alter the misalignment check in
>>> ehci_td_buffer, max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN)
>>> would
>>> actually have to be used. This misalignment check could be limited
>>> to the
>>> first qTD transfer of a USB transfer, but that would complicate
>>> things,
>>> all the more the corresponding call to flush_dcache_range would
>>> have to be
>>> modified to fix alignments.
>>>
>>> So we have to make two choices:
>>>   - between 4096, 1024, 512 and max(usb_maxpacket(dev, pipe),
>>> ARCH_DMA_MINALIGN), - between the accurate and simple allocations.
>>> That makes a total of 8 working possibilities. What do you guys
>>> think we
>>> should choose? On my side, I like max(usb_maxpacket(dev, pipe),
>>> ARCH_DMA_MINALIGN)
>> Won't maxpacket fall below 512 on occasions,
> Sure.
I would go with 512 as it also the most common storage sector size.
>> which might cause
>> trouble?
> Why?
>
>>> with the simple allocation. It's efficient as to code
>>> speed, size and readability, as well as to RAM usage.
>> For now, I'd go for the safest, easiest and dumbest solution and see
>> how it
>> fares. Subsequent patch can be submitted to improve that and
>> measurements made.
>>
>> "We should forget about small efficiencies, say about 97% of the
>> time; premature
>> optimization is the root of all evil"
>>              -- Donald E. Knuth, Structured Programming with go to
>>              Statements
>> [...]
> OK, so I'll stick to my original series, rework it lightly as we said, add Jim's
> patch, and add a further patch for these optimizations.
Okay.
>>> So we could perhaps issue a #error in ehci-hcd or in usb_storage if
>>> CONFIG_SYS_MALLOC_LEN is not large enough, but I don't think it's a
>>> good
>>> idea because:
>>>   - the threshold value would have to depend on runtime block sizes
>>>   or
>>> something, which could lead to a big worst worst case that would
>>> almost
>>> never happen in real life, so giving such an unrealistic heap size
>>> constraint would be cumbersome,
>> #warning then?
> With which limit if so?
I would expect more than 128kB as this is a common worst case (512 B 
block size).

>
>>>   - reaching the top sizes would mean reading a huge file or
>>>   something to a
>>> large buffer (much larger than the qTDs this transfer requires),
>>> which
>>> would very likely be heap-allocated (except for commands like
>>> fatload), so
>>> CONFIG_SYS_MALLOC_LEN would already have to be large for the
>>> application,
>>> - for command line operations like fatload, transfers of
>>> uncontrolled
>>> lengths could simply crash U-Boot if they go too far in memory
>> Why, because they overwrite it?
> Yes. U-Boot expands down its allocation during startup, so it's often located at
> the end of the embedded RAM, which means that fatload will very likely use the
> beginning of the RAM.
>
>>> , which
>>> means that users of such commands need to know what they are doing
>>> anyway,
>>> so they have to control transfer sizes,
>>>   - there is already a runtime error displayed in case of allocation
>>> failure.
>> Ok
> So #warning or not besides this?
>
>>> Marek, what do you think?
>> Had a good evening with the EHCI r10 spec, hope I answered most of
>> your
>> questions.
> Yes, thanks.
Regards,
     Stefan