netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* r8169 :  always copying the rx buffer to new skb
@ 2011-04-18 17:08 John Lumby
  2011-04-18 17:27 ` Ben Hutchings
  2011-04-18 18:21 ` Francois Romieu
  0 siblings, 2 replies; 16+ messages in thread
From: John Lumby @ 2011-04-18 17:08 UTC (permalink / raw)
  To: netdev

Summary  -  current r8169 always memcpy's every received buffer to new skb,
             and I'd like to propose that it should not,
             which can improve throughput / reduce CPU utilization by 
around 10%

In the patch
   2010-10-09 Stanislaw Gruszka r8169: allocate with GFP_KERNEL flag 
when able to sleep
the code for making a decision
     "shall I copy the buffer to new skb or unhook it from ring to pass 
up and allocate new"
based on a module param called rx_copybreak,  was removed,  and instead 
we now always allocate naked data buffers (i.e. not wrapped in skb) and 
always memcpy each one to a new skb to pass up to netif.

I think (not sure) this was related to a bug
   Bug 629158 Network adapter "disappears" after resuming from acpi suspend
although the change from using GFP_ATOMIC to using GFP_KERNEL for 
initial allocation
was made (I believe) in
   2010-04-30 Eric Dumazet r8169: Fix rtl8169_rx_interrupt()

So I am not entirely certain of the motivation for the removal of 
rx_copybreak and the "always memcpy".  But I believe it is not 
necessary,  at least not on all (most?) systems, and have measured 11% 
increase in throughput (aggregate Mbits/sec) on a heavy network 
benchmark on my own machine,  on 2.6.39-rc2 with rps/rfs enabled, by 
patching the code back to the 2.6.39 equivalent of rx_copybreak.

oprofile shows the improvement is all or mostly obtained from avoiding 
the memcpy'ing.  And of course since the memcpy'ing is done in the 
driver before the netif/rps gets it,  all that memcpy'ing is done on 
CPU0    (I think true on any SMP machine with an rtl8169?)

The measurements are :
                              aggregate (rx+tx) Mbits/sec through the NIC
     2.6.39-rc2 unpatched                                  918
     2.6.39-rc2   patched,  rx_copybreak=64               1026

packet rates fluctuate around 60K - 70K pkts/sec rx plus 60K - 70K 
pkts/sec tx

I would like to propose that :
   .  switch back to wrapping databuffers in skb's (so we have the 
option of copying or unhooking)
   .  re-introduce rx_copybreak module param so each sysadmin can 
control if wished.

That is my main proposal,    and I would be interested to hear thoughts 
on that.

As a second proposal (which I made before),  I'd like to suggest that 
the number of rx buffers allocated at open should be configurable by 
module param.     This is not needed for my other proposal but may help 
reduce the possibility of temporary shortage of buffers.     There is 
really no justification for assuming that 256 buffers is the correct 
number for all systems from a netbook to a 32-way server.     I ran my 
"patched" measurement with num_rx_buffs=128 and there were no alloc 
failures logged.     I would say that if there is any enthusiasm for the 
main avoid_memcpy proposal,   then it is worth the extra small effort to 
do the num_rx_buffs as well.

My current patch (including configurable num_rx_buffs) -
lines deleted   120
lines added     668

BTW if this proposal is acceptable,  I'm willing to do the patch work 
but I have only one machine with a r8169 (actually a RTL8168c) to test on.

John Lumby

^ permalink raw reply	[flat|nested] 16+ messages in thread
* r8169 :  always copying the rx buffer to new skb
@ 2011-06-27 22:54 John Lumby
  2011-06-28  7:55 ` Francois Romieu
  0 siblings, 1 reply; 16+ messages in thread
From: John Lumby @ 2011-06-27 22:54 UTC (permalink / raw)
  To: netdev


Summary of some results since previous posts in April :

Previously I suggested re-introducing the rx_copybreak parameter to provide the option of un-hooking the receive buffer rather than copying it,  in order to save the overhead of the memcpy,   which shows as the highest tick-count in oprofile.  All buffer memcpy'ing is done on CPU0 on my system.

I then found that,  without the memcpy,  the driver and net stack consume other overhead elsewhere,  particularly in too-frequent polling/interrupting.

Eric D pointed out that :
            Doing the copy of data and building an exact size skb has benefit of
            providing 'right' skb->truesize (might reduce RCVBUF contention and
            avoid backlog drops) and already cached data (hot in cpu caches).
            Next 'copy' is almost free (L1 cache access)

There was also some discussion off-line about using larger MTU size.

Since then,  I have explored some ideas for dealing with the too-frequent polling/interrupting and the cache aspect,  with some success on the first and no success on the second.   In summary of results:
   .  With MTU of 1500 and "normal" workload,   I see an improvement of between 4% - 6% in throughput,  depending on kernel release and kernel .config.    Specifically,  with the heaviest workload and most tuned kernel .config:
       no changes  -  ~  1440 Megabits/sec bi-directional
     with changes  -  ~  1530 Megabits/sec bi-directional
   (same .config for each of course)
      All 4 of my atom 330's (2 physical x 2SMt per physical) were at 100% on both without and with changes for this workload,  but with very different profiles.
      These throughput numbers are higher than I reported before,   and % improvement lower,  because of the tuning to the base system and workload.

   .  With MTU of 6144,  I see a more dramatic effect  -  the same workload runs at 1725 Megabit/sec on both kernels,  (which may be a practical hardware limit on one of the adapters,  since it hits exactly this rate almost every time no matter what else I change),  but overall CPU utilization drops from ~ 80% without changes to ~60% with changes.    I feel this is significant but of course its use limited to networks that can support this segment size everywhere.

Notes on the changes:

 Too-frequent polling/interrupting:
 These two are highly interrelated by NAPI.
     Too-frequent polling:
         The NAPI weight is a double-duty parameter,  controlling both the dynamic choice between continuing a NAPI polling session versus leaving and resuming interrupts,  and also the maximum number of receive buffers to be passed up per poll.   It's also not configurable (set to 64).    I split it into two numbers,  one for each purpose,  and made them configurable,  and tried tuning them.    A good value for the poll/int choice was 16,   while the max-size number was best left at 64.    This helps a bit,  but polling is still too frequent.
         I then made an interface up into the softirqd to let the driver tell the softirqd :
              "keep my napi session alive but sched_yield to other runnable processes before running another poll"
         I added a check to __do_softirq that if the *only* pending softirq is NET_RX_SOFTIRQ and the rx_action routine requested this, then it exits and tells the deamon to yield.
         I borrowed a bit in local_softirq_pending for this.  This helped a lot for certain workloads.    I saw considerable drop in system CPU% on CPU0 and higher user CPU% there.

     Too-frequent interrupting:
         I made use of the r8169's Interrupt Mitigation feature,   setting it to the maximum multiplied by a factor between 0-1 based inversely on tx queue size  (large qsize,  short delay and vice versa).     This also helped a lot.  The current driver sets these registers but only once per "up" session,  during rtl8169_open of the NIC;  But Hayes explained that the regs must be set on each enabling of interrupts.    This is the one case where (I think) I corrected a bug present in the current driver.   Harmless but not doing what was intended.

      The effect of these two changes was to reduce the rate of hardware interrupts down to less than 1/20 of before,  and also hold the polling rate down (around 4-5 packets per poll on average on a typical run,  sometimes much higher).


  memory and caching:
  Here I failed to achieve anything.    Based on Eric's point about memcpy giving a "free" next copy, I thought possibly memory prefetching might provide something equivalent.   Specifically,  prefetch the skb and its databuff immediately after un-dma'ing.
  For example,   with my changes and no memcpy,  I see eth_type_trans() high in oprofile tick score on CPU0.
  This small function does very little work but is the first (I think) to access a field in the skb->data buffer - the ethernet header.  Prefetching ought to do better than memcpy'ing since only one copy of the data will enter L1,  not two.    But my attempts at this achieved nothing or negative.
  Note  -  the current driver does issue a prefetch of the original buffer prior to the memcpy.  But,  on my system (atom CPUs),  gdb of the object file r8169.o indicates no prefetch instructions are generated,   only lea of the address to be prefetched.     I tried changing the prefetch call to an asm generated prefetcht0/prefetchnta instruction with disappointing results.   I noticed some discussion of memory prefetch in this list earlier and maybe it is not useful.

  I tried to explore Eric's other point about skb->truesize but ran out of time researching.     I guess my current results are negatively impacted by these memory and skb issues that Eric mentions,  but I could not find any answer.


  There was a question of how this changed driver handles memory pressure :
  Along with the rx_copybreak change,  I made the number of rx and tx ring buffers configurable and dynamically replenishable.    The changed driver can tolerate occasional or even bursty alloc failures without exposing any effects outside itself,    whereas the current driver drops packets.   However,  under extreme consecutive failures,  the changed driver will eventually run too low and stop completely,  whereas the current driver will (I assume) stay up.     I was unable to cause either of these in my tests.    Measurements with concurrent memory hogs confirmed this but did show heavy drop in throughput for the changed driver.


  I've tried these changes out on all kernel release levels from 2.6.36 to 3.0-rc3 and see roughly comparable deltas on all,  but with slightly different tuning required to hit the optimum,  and some variability on all after 2.6.37.     2.6.37 seemed to be slightly the "best".   Not sure why although I see some relevant changes to the scheduler between 2.6.37 - 38.    There is also a strange effect with the old RTC in 3.0  -  I had to remove it from the kernel to get good results,    whereas it was a module in 2.6 levels (which I did not load for the tests).     I don't need it on my system except for one ancient utility.     I also found major impact from iptables and cleared all tables for the tests.    That is the one item that would normally be needed in a production setup that I turned off.  The overhead of iptables is presumably highly dependent on how many rules in the filter chains.    (I have rather a lot in INPUT)


I don't plan to do any more on this but can provide my patch (currently one monolithic one based on DaveM 3.0.0-rc1netnext-110615) and detailed results if anyone wants.

Cheers,   John Lumby
 		 	   		  

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-06-28  8:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-18 17:08 r8169 : always copying the rx buffer to new skb John Lumby
2011-04-18 17:27 ` Ben Hutchings
2011-04-18 21:26   ` John Lumby
2011-04-20 19:13     ` Francois Romieu
2011-04-21  3:41       ` John Lumby
2011-04-21  3:52       ` John Lumby
2011-04-27  2:18         ` John Lumby
2011-04-27  3:57           ` Eric Dumazet
2011-04-27 20:35           ` Francois Romieu
2011-04-29  1:55             ` John Lumby
2011-04-29  4:54               ` Eric Dumazet
2011-05-02 19:04           ` Chris Friesen
2011-05-03 11:59             ` hayeswang
2011-04-18 18:21 ` Francois Romieu
  -- strict thread matches above, loose matches on Subject: below --
2011-06-27 22:54 John Lumby
2011-06-28  7:55 ` Francois Romieu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).