From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 14 Oct 2015 15:40:49 -0700
Subject: [Intel-wired-lan] [next-queue 13/17] fm10k: Update adaptive ITR
 algorithm
In-Reply-To: <1444853538.26286.42.camel@intel.com>
References: <1444779554-20464-1-git-send-email-jacob.e.keller@intel.com>
 <1444779554-20464-13-git-send-email-jacob.e.keller@intel.com>
 <561EA08C.8090705@gmail.com> <1444853538.26286.42.camel@intel.com>
Message-ID: <561ED9F1.5060104@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: intel-wired-lan@osuosl.org
List-ID: <intel-wired-lan.osuosl.org>

On 10/14/2015 01:12 PM, Keller, Jacob E wrote:
> On Wed, 2015-10-14 at 11:35 -0700, Alexander Duyck wrote:
>> On 10/13/2015 04:39 PM, Jacob Keller wrote:
>> +	 */+#define FM10K_ITR_SCALE_SMALL	6
>>> +#define FM10K_ITR_SCALE_MEDIUM	5
>>> +#define FM10K_ITR_SCALE_LARGE	4
>>> +
>>> +	if (avg_wire_size < 300)
>>> +		avg_wire_size *= FM10K_ITR_SCALE_SMALL;
>>> +	else if ((avg_wire_size >= 300) && (avg_wire_size < 1200))
>>> +		avg_wire_size *= FM10K_ITR_SCALE_MEDIUM;
>>>    	else
>>> -		avg_wire_size /= 2;
>>> +		avg_wire_size *= FM10K_ITR_SCALE_LARGE;
>>> +
>> Where is it these scaling values originated from?  Just looking
>> through
>> the values I am not sure having this broken out like it is provides
>> much
>> value.
>>
> I am not really sure what exactly you mean here?

The numbers are kind of all over the place.  The result of the math 
above swings back and forth in a saw tooth between values and doesn't 
seem to do anything really consistent.

For example a packet that is 275 in size will have an interrupt rate of 
125K interrupts per second, while a packet that is 276 bytes in size 
with be closer to 166K.  It is basically creating some odd peaks and 
valleys.

>
>> What I am getting at is that the input is a packet size, and the
>> output
>> is a value somewhere between 2 and 47.  (I think that 47 is still a
>> bit
>> high by the way and probably should be something more like 25 which I
>> believe you established as the minimum Tx interrupt rate in a later
>> patch.)
>>
> The input is two fold for calculation, packet size, and number of
> packets.

Last I knew though the average packet size is the only portion we end up 
using to actually set the interrupt moderation rate.

>> What you may want to do is look at pulling in the upper limit to
>> something more reasonable like 1536 for avg_wire_size, and then
>> simplify
>> this logic a bit.  Specifically what is it you are trying to
>> accomplish
>> by tweaking the scale factor like you are?  I assume you are wanting
>> to
>> approximate a curve.  If so you might wan to look at possibly
>> including
>> an offset value so that you can smooth out the points where your
>> intersections occur.
>>
> I honestly don't know. I mostly took the original work from 6-7 months
> ago, and added your suggestions from that series, but maybe that isn't
> valid now?

I suspect some of my views have changed over the last several months 
after dealing with interrupt moderation issues on ixgbe.

The thing I started realizing is that with full sized Ethernet frames, 
1514 bytes in size, you have a maximum rate of something like 4 million 
packets per second at 50Gbps.  It seems like you should be firing an 
interrupt once every 100 packets at least don't you think?  That is why 
I am now thinking at minimum 40K interrupts per second is necessary.  
However a side effect of that is that 100 packets will overrun a UDP 
buffer in most cases as there is only room for about 70 frames based on 
math I saw when I submitted the patch for ixgbe 
(https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?id=8ac34f10a5ea4c7b6f57dfd52b0693a2b67d9ac4). 
So if I want to take UDP rmem_default into account we would be looking 
at something more like 60K interrupts per second which is getting fairly 
tiny at 17us

>> For example what you may want to consider doing would be to instead
>> use
>> a multiplication factor for small, an addition value for medium, and
>> for
>> large you simply cap it at a certain value.
>>
> So, how would this look? The capping makes sense, we should probably
> cap it at around 30 or something? I'm not sure that 25 is the limit,
> since for some workloads I think we did calculate that we could use
> that few interrupts.. maybe the CPU savings aren't worth it though if
> it messes up too many other flows?

So I have done some math based on an assumption of a 212992 wmem_default 
and a overhead of 640 bytes per packet (256 skb, 64 headroom, 320 shared 
info).  Breaking it all down goes a little something like this:

     wmem_default / (size + overhead) = desired_pkts_per_int
     rate / bits_per_byte / (size + ethernet_overhead) = pkt_rate
     (desired_pkts_per_int / pkt_rate) * usec_per_sec = ITR value

So then when we start plugging in the numbers:
     212992 / (size + 640) = desired_pkts_per_int

we can simplify the second expression by just doing the division now.
     50,000,000,000 / 8 / (size + 24) = pkt_rate
     6,250,000,000/(size + 24) = pkt_rate

Then we just need to plug in the values and reduce:
     (212,992 / (size + 640)) / (6,250,000,000/(size + 24)) * 1,000,000 
= ITR value
     (212,992/(size + 640)) / (6,250/(size + 24)) = ITR value
     (34.078/(size + 640))/(1/(size+24) = ITR value
     (34 * (size + 24))/(size + 640) = ITR value

So now we have the expression we are looking for in order to determine 
the minimum interrupt rate, but we want to avoid the division.  So in 
the end we have something that looks like this in order to generate an 
approximation of the curve without having to do any unnecessary math 
(dropping the +24, and the 3000 limit from earlier):

     /* the following is a crude approximation for the following expression:
      * (34 * (size + 24))/(size + 640) = ITR value
      * /
     if (avg_wire_size <= 360) {
         /* start 333K ints/sec and gradually drop to 77K ints/sec */
         avg_wire_size *= 8;
         avg_wire_size += 376;
     } else if (avg_wire_size <= 1152) {
         /* 77K ints/sec to 45K ints/sec
         avg_wire_size *= 3;
         avg_wire_size += 2176;
     } else if (avg_wire_size <= 1920) {
         /* 45K ints/sec to 38K ints/sec
         avg_wire_size += 4480;
     } else {
         /* plateau@a limit of 38K ints/sec */
         avg_wire_size = 6656;
     }

Mind you the above is a crude approximation, but it should give decent 
performance and it only strays from the values of the original function 
by 1us or 2us and it stays under the curve.  It could probably use some 
tuning and tweaking as well but you get the general idea.

You may even want to tune this to be a bit more aggressive in terms of 
interrupts per second.  I know some distros such as RHEL are still 
running around with an untuned skb_shared_info and such and as a result 
they take up more space resulting in a larger memory footprint.  What 
this would represent is modifying the 640 value in the original function 
to increase based on the extra overhead.  Then it would be necessary to 
modify the slopes, offsets, and transitions points to get the right 
approximation for the new curve.

> I could also try to take a completely different algorithm say from i40e
> instead? This one really has limited testing.

Yeah, this one was a "good enough" solution at the time and as I recall 
it was a clone of the igb interrupt moderation.  Now that you have real 
ports you probably need something better than 1Gbps.

The i40e one is from ixgbe, which was inherited from e1000.  The problem 
with the interrupt moderation in that design is that for any high 
throughput usage everything becomes bulk (8K ints per second). For 1G 
that works fine.  But I can tell you from what I have seen on 10Gbps 
NICs it doesn't do well under any kind of small packet, or single 
threaded throughput test.