From mboxrd@z Thu Jan 1 00:00:00 1970 From: merez@codeaurora.org Subject: Re: [PATCH v2 1/1] mmc: block: Add write packing control Date: Tue, 24 Jul 2012 01:44:02 -0700 (PDT) Message-ID: <34e29d774773a7dfa3e1b57232249b68.squirrel@www.codeaurora.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Sender: linux-arm-msm-owner@vger.kernel.org To: "S, Venkatraman" Cc: merez@codeaurora.org, Chris Ball , Muthu Kumar , linux-mmc@vger.kernel.org, linux-arm-msm@vger.kernel.org, open list , Seungwon Jeon List-Id: linux-mmc@vger.kernel.org On Mon, July 23, 2012 5:22 am, S, Venkatraman wrote: > On Mon, Jul 23, 2012 at 5:13 PM, wrote: >> On Wed, July 18, 2012 12:26 am, Chris Ball wrote: >>> Hi, [removing Jens and the documentation list, since now we're talking about the MMC side only] >>> On Wed, Jul 18 2012, merez@codeaurora.org wrote: >>>> Is there anything else that holds this patch from being pushed to >> mmc-next? >>> Yes, I'm still uncomfortable with the write packing patchsets for a >> couple of reasons, and I suspect that the sum of those reasons means= that >> we should probably plan on holding off merging it until after 3.6. >>> Here are the open issues; please correct any misunderstandings: Wit= h Seungwon's patchset ("Support packed write command"): >>> * I still don't have a good set of representative benchmarks showin= g >>> what kind of performance changes come with this patchset. It see= ms >> like we've had a small amount of testing on one controller/eMMC part= combo >> from Seungwon, and an entirely different test from Maya, and the results >> aren't documented fully anywhere to the level of describing what the hardware was, what the test was, and what the results were before and after the patchset. >> Currently, there is only one card vendor that supports packed comman= ds. =46ollowing are our sequential write (LMDD) test results on 2 of our targets >> (in MB/s): >> No packing packing >> Target 1 (SDR 50MHz) 15 25 >> Target 2 (DDR 50MHz) 20 30 >>> With the reads-during-writes regression: >>> * Venkat still has open questions about the nature of the read >>> regression, and thinks we should understand it with blktrace befo= re >> trying to fix it. Maya has a theory about writes overwhelming reads= , but >> Venkat doesn't understand why this would explain the observed >> bandwidth drop. >> The degradation of read due to writes is not a new behavior and exis= ts also without the write packing feature (which only increases the degradation). Our investigation of this phenomenon led us to the Conclusion that a new scheduling policy should be used for mobile devices, >> but this is not related to the current discussion of the write packi= ng feature. >> The write packing feature increases the degradation of read due to write >> since it allows the MMC to fetch many write requests in a row, inste= ad of >> fetching only one at a time. Therefore some of the read requests wi= ll have to wait for the completion of more write requests before they can be >> issued. > > I am a bit puzzled by this claim. One thing I checked carefully when reviewing write packing patches from SJeon was that the code didn't plough through a mixed list of reads and writes and selected only writes. > This section of the code in "mmc_blk_prep_packed_list()", from v8 patchset.. > > + if (rq_data_dir(cur) !=3D rq_data_dir(next)) { > + put_back =3D 1; > + break; > + } > > > means that once a read is encountered in the middle of write packing, the packing is stopped at that point and it is executed. Then the next blk_fetch_request should get the next read and continue as before. > > IOW, the ordering of reads and writes is _not_ altered when using pac= ked commands. > For example if there were 5 write requests, followed by 1 read, > followed by 5 more write requests in the request_queue, the first 5 writes will be executed as one "packed command", then the read will be executed, and then the remaining 5 writes will be executed as one "packed command". So the read does not have to wait any more than it waited before (packing feature) Let me try to better explain with your example. Without packing the MMC layer will fetch 2 write requests and wait for = the first write request completion before fetching another write request. During this time the read request could be inserted into the CFQ and si= nce it has higher priority than the async write it will be dispatched in th= e next fetch. So, the result would be 2 write requests followed by one re= ad request and the read would have to wait for completion of only 2 write requests. With packing, all the 5 write requests will be fetched in a row, and th= en the read will arrive and be dispatched in the next fetch. Then the read will have to wait for the completion of 5 write requests. =46ew more clarifications: Due to the plug list mechanism in the block layer the applications can "aggregate" several requests to be inserted into the scheduler before waking the MMC queue thread. This leads to a situation where there are several write requests in the CFQ queue when MMC starts to do the fetches. If the read was inserted while we are building the packed command then = I agree that we should have seen less effect on the read performance. However, the write packing statistics show that in most of the cases th= e packing stopped due to an empty queue, meaning that the read was insert= ed to the CFQ after all the pending write requests were fetched and packed= =2E =46ollowing is an example for write packing statistics of a READ/WRITE parallel scenario: write packing statistics: Packed 1 reqs - 448 times Packed 2 reqs - 38 times Packed 3 reqs - 23 times Packed 4 reqs - 30 times Packed 5 reqs - 14 times Packed 6 reqs - 8 times Packed 7 reqs - 4 times Packed 8 reqs - 1 times Packed 10 reqs - 1 times Packed 34 reqs - 1 times stopped packing due to the following reasons: 2 times: wrong data direction (meaning a READ was fetched and stopped t= he packing) 1 times: flush or discard 565 times: empty queue (meaning blk_fetch_request returned NULL) > > And I requested blktrace to confirm that this is indeed the behaviour= =2E The trace logs show that in case of no packing, there are maximum of 3-= 4 requests issued before a read request, while with packing there are als= o cases of 6 and 7 requests dispatched before a read request. I'm waiting for an approval for sharing the block trace logs. Since this is a simple test to run you can collect the trace logs and l= et us know if you reach other conclusions. Thanks, Maya > > Your rest of the arguments anyway depend on this assertion, so can yo= u please clarify this. > >> To overcome this behavior, the solution would be to stop the write p= acking >> when a read request is fetched, and this is the algorithm suggested = by the >> write packing control. >> Let's also keep in mind that lmdd benchmarking doesn't fully reflect the >> real life in which there are not many scenarios that cause massive r= ead and write operations. In our user-common-scenarios tests we saw that in many cases the write packing decreases the read latency. It can happen in >> cases where the same amount of write requests is fetched with and wi= thout >> packing. In such a case the write packing decreases the transfer tim= e of >> the write requests and causes the read request to wait for a shorter= time. >>> With Maya's patchset ("write packing control"): >>> * Venkat thinks that HPI should be used, and the number-of-requests >>> metric is too coarse, and it doesn't let you disable packing at t= he >> right time, and you're essentially implementing a new I/O scheduler = inside >> the MMC subsystem without understanding the root cause for why that'= s necessary. >> According to our measurements the stop transmission (CMD12) + HPI is= a heavy operation that may take up to several milliseconds. Therefore, a massive usage of HPI can cause a degradation of performance. >> In addition, it doesn=92t provide a complete solution for read durin= g write >> since it doesn=92t solve the problem of =93what to do with the inter= rupted write request remainder?=94. That is, a common interrupting read reque= st will usually be followed by another one. If we just continue to write the >> interrupted write request remainder we will probably get another HPI due >> to the second read request, so eventually we may end up with lots of= HPIs >> and write retries. A complete solution will be: stop the current wri= te, change packing mode to non-packing, serve the read request, push back the >> write remainders to the block I/O scheduler and let him schedule the= m again probably after the read burst ends (this requires block layer support of course). >> Regarding the packing control, there seem to be a confusion since th= e number-of-requests is the trigger for *enabling* the packing (after it was >> disabled), while a single read request disable packing. Therefore, t= he packing is stopped at the right time. >> The packing control doesn't add any scheduling policy to the MMC lay= er. The write packing feature is the one changing the scheduling policy by fetching many write requests in a row without a delay that allows read requests to come in the middle. >> By disabling the write packing, the write packing control returns th= e old >> scheduling policy. It causes the MMC to fetch the requests one by on= e, thus read requests are served as before. >> It is correct that the trigger for enabling the write packing contro= l should be adjusted per platform and doesn't give a complete solution. As >> I >> mentioned above, the complete solution will include the usage of wri= te packing control, a re-insert of the write packed to the scheduler when a >> read request is fetched and usage of HPI to stop the packing that is already transferred. >> To summarize - >> We recommend including the write packing in 3.6 due to the following reasons: >> 1. It significantly improves the write throughput >> 2. In some of the cases it even decreases the read latency >> 3. The read degradation in simultaneous read-write flows already exi= st, even without this feature >> As for the write packing control, it can be included in 3.6 to suppl= y a partial solution for the read degradation or we can postpone it to 3.7 and >> integrate it as part of the complete solution. >> Thanks, >> Maya >>> My sense is that there's no way we can solve all of these to >>> satisfaction in the next week (which is when the merge window will >> open), but that by waiting a cycle we might come up with some good answers. >>> What do other people think? If you're excited about these patchset= s, >> now would be a fine time to come forward with your benchmarking resu= lts and to help understand the reads-during-writes regression. > --=20 Sent by consultant of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum