Re: Initial thoughts on TXDP - Jesper Dangaard Brouer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: brouer@redhat.com, Tom Herbert <tom@herbertland.com>,
	Florian Westphal <fw@strlen.de>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	John Fastabend <john.fastabend@gmail.com>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: Initial thoughts on TXDP
Date: Fri, 2 Dec 2016 14:01:02 +0100	[thread overview]
Message-ID: <20161202140102.1d515e0b@redhat.com> (raw)
In-Reply-To: <859a0c99-f427-1db8-d260-1297777792fb@stressinduktion.org>

On Thu, 1 Dec 2016 23:47:44 +0100
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> Side note:
> 
> On 01.12.2016 20:51, Tom Herbert wrote:
> >> > E.g. "mini-skb": Even if we assume that this provides a speedup
> >> > (where does that come from? should make no difference if a 32 or
> >> >  320 byte buffer gets allocated).

Yes, the size of the allocation from the SLUB allocator does not change
base performance/cost much (at least for small objects, if < 1024).

Do notice the base SLUB alloc+free cost is fairly high (compared to a
201 cycles budget). Especially for networking as the free-side is very
likely to hit a slow path.  SLUB fast-path 53 cycles, and slow-path
around 100 cycles (data from [1]).  I've tried to address this with the
kmem_cache bulk APIs.  Which reduce the cost to approx 30 cycles.
(Something we have not fully reaped the benefit from yet!)

[1] https://git.kernel.org/torvalds/c/ca257195511

> >> >  
> > It's the zero'ing of three cache lines. I believe we talked about that
> > as netdev.

Actually 4 cache-lines, but with some cleanup I believe we can get down
to clearing 192 bytes 3 cache-lines.

> 
> Jesper and me played with that again very recently:
> 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
> 
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?

Yes, I believe so.

> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
> 
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.

One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz),
Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle,
or 32 cycles for 256 bytes.

Talking to Alex and John during netdev, and reading on the Intel arch,
I though that this CPU should be-able-to perform 16 bytes per cycle.
The CPU can do it as the rep-stos show this once the size gets large
enough.

On this CPU the memset rep stos starts to win around 512 bytes:

 192/35 =  5.5 bytes/cycle
 256/36 =  7.1 bytes/cycle
 512/40 = 12.8 bytes/cycle
 768/46 = 16.7 bytes/cycle
1024/52 = 19.7 bytes/cycle
2048/84 = 24.4 bytes/cycle
4096/148= 27.7 bytes/cycle

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-12-02 13:01 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-30 22:54 Initial thoughts on TXDP Tom Herbert
2016-12-01  2:44 ` Florian Westphal
2016-12-01 19:51   ` Tom Herbert
2016-12-01 22:47     ` Hannes Frederic Sowa
2016-12-01 23:46       ` Tom Herbert
2016-12-02 14:36         ` Edward Cree
2016-12-02 17:12           ` Tom Herbert
2016-12-02 13:01       ` Jesper Dangaard Brouer [this message]
2016-12-02 12:13     ` Jesper Dangaard Brouer
2016-12-02 12:13       ` Jesper Dangaard Brouer
2016-12-01 13:55 ` Sowmini Varadhan
2016-12-01 19:05   ` Tom Herbert
2016-12-01 19:48     ` Rick Jones
2016-12-01 20:18       ` Tom Herbert
2016-12-01 21:47         ` Rick Jones
2016-12-01 22:12           ` Tom Herbert
2016-12-02  0:04             ` Rick Jones
2016-12-01 20:13     ` Sowmini Varadhan
2016-12-01 20:39       ` Tom Herbert
2016-12-01 22:55       ` Hannes Frederic Sowa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161202140102.1d515e0b@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=fw@strlen.de \
    --cc=hannes@stressinduktion.org \
    --cc=john.fastabend@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.