From mboxrd@z Thu Jan 1 00:00:00 1970 From: annie li Subject: Re: Interesting observation with network event notification and batching Date: Mon, 01 Jul 2013 23:59:45 +0800 Message-ID: <51D1A771.4000302@oracle.com> References: <20130612101451.GF2765@zion.uk.xensource.com> <20130628161542.GF16643@zion.uk.xensource.com> <51D13456.1040609@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Stefano Stabellini Cc: andrew.bennieston@citrix.com, Wei Liu , ian.campbell@citrix.com, xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 2013-7-1 22:19, Stefano Stabellini wrote: > Could you please use plain text emails in the future? Sure, sorry about that. Thanks Annie > > On Mon, 1 Jul 2013, annie li wrote: >> On 2013-6-29 0:15, Wei Liu wrote: >> >> Hi all, >> >> After collecting more stats and comparing copying / mapping cases, I now >> have some more interesting finds, which might contradict what I said >> before. >> >> I tuned the runes I used for benchmark to make sure iperf and netperf >> generate large packets (~64K). Here are the runes I use: >> >> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note) >> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072 >> >> COPY MAP >> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s) >> >> >> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets? >> >> PPI 2.90 1.07 >> SPI 37.75 13.69 >> PPN 2.90 1.07 >> SPN 37.75 13.69 >> tx_count 31808 174769 >> >> >> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k. >> >> nr_napi_schedule 31805 174697 >> total_packets 92354 187408 >> total_reqs 1200793 2392614 >> >> netperf Tput: 5.8Gb/s 10.5Gb/s >> PPI 2.13 1.00 >> SPI 36.70 16.73 >> PPN 2.13 1.31 >> SPN 36.70 16.75 >> tx_count 57635 205599 >> nr_napi_schedule 57633 205311 >> total_packets 122800 270254 >> total_reqs 2115068 3439751 >> >> PPI: packets processed per interrupt >> SPI: slots processed per interrupt >> PPN: packets processed per napi schedule >> SPN: slots processed per napi schedule >> tx_count: interrupt count >> total_reqs: total slots used during test >> >> * Notification and batching >> >> Is notification and batching really a problem? I'm not so sure now. My >> first thought when I didn't measure PPI / PPN / SPI / SPN in copying >> case was that "in that case netback *must* have better batching" which >> turned out not very true -- copying mode makes netback slower, however >> the batching gained is not hugh. >> >> Ideally we still want to batch as much as possible. Possible way >> includes playing with the 'weight' parameter in NAPI. But as the figures >> show batching seems not to be very important for throughput, at least >> for now. If the NAPI framework and netfront / netback are doing their >> jobs as designed we might not need to worry about this now. >> >> Andrew, do you have any thought on this? You found out that NAPI didn't >> scale well with multi-threaded iperf in DomU, do you have any handle how >> that can happen? >> >> * Thoughts on zero-copy TX >> >> With this hack we are able to achieve 10Gb/s single stream, which is >> good. But, with classic XenoLinux kernel which has zero copy TX we >> didn't able to achieve this. I also developed another zero copy netback >> prototype one year ago with Ian's out-of-tree skb frag destructor patch >> series. That prototype couldn't achieve 10Gb/s either (IIRC the >> performance was more or less the same as copying mode, about 6~7Gb/s). >> >> My hack maps all necessary pages permantently, there is no unmap, we >> skip lots of page table manipulation and TLB flushes. So my basic >> conclusion is that page table manipulation and TLB flushes do incur >> heavy performance penalty. >> >> This hack can be upstreamed in no way. If we're to re-introduce >> zero-copy TX, we would need to implement some sort of lazy flushing >> mechanism. I haven't thought this through. Presumably this mechanism >> would also benefit blk somehow? I'm not sure yet. >> >> Could persistent mapping (with the to-be-developed reclaim / MRU list >> mechanism) be useful here? So that we can unify blk and net drivers? >> >> * Changes required to introduce zero-copy TX >> >> 1. SKB frag destructor series: to track life cycle of SKB frags. This is >> not yet upstreamed. >> >> >> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html? >> >> >> 2. Mechanism to negotiate max slots frontend can use: mapping requires >> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS. >> >> 3. Lazy flushing mechanism or persistent grants: ??? >> >> >> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default >> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar >> results with larger packet size too. >> >> Thanks >> Annie >> >> >>