Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Thomas Graf @ 2016-09-21 18:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jakub Kicinski, Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Kernel Team, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <CALx6S34c6Mz=YpbiPcVNe38wTkxfjB9SWxaqtz1LivLeGzssRQ@mail.gmail.com>

On 09/21/16 at 11:50am, Tom Herbert wrote:
> 50 lines in one driver is not a big deal, 50 lines in a hundred
> drivers is! I learned this lesson in BQL which was well abstracted out
> to be minimally invasive but we still saw many issues because of the
> pecularities of different drivers.

You want to enable XDP in a hundred drivers? Are you planning to
deploy ISA NIC based ILA routers? ;-)

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Jakub Kicinski @ 2016-09-21 18:54 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Kernel Team, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <CALx6S34c6Mz=YpbiPcVNe38wTkxfjB9SWxaqtz1LivLeGzssRQ@mail.gmail.com>

On Wed, 21 Sep 2016 11:50:06 -0700, Tom Herbert wrote:
> On Wed, Sep 21, 2016 at 11:45 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:  
> >> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski <kubakici@wp.pl> wrote:  
> >> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:  
> >> >>  >  - Reduces the amount of code and complexity needed in drivers to
> >> >>  >    manage XDP  
> >> >>
> >> >> hmm:
> >> >> 534 insertions(+), 144 deletions(-)
> >> >> looks like increase in complexity instead.  
> >> >
> >> > and more to come to tie this with HW offloads.  
> >>
> >> The amount of driver code did decrease with these patches:
> >>
> >> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 ++++----------------------
> >> drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 25 ++++------
> >> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
> >>
> >> Minimizing complexity being added to drivers for XDP is critical since
> >> we basically asking every driver to replicate the function. This
> >> property also should also apply to HW offloads, the more complexity we
> >> can abstract out drivers into a common backend infrastructure the
> >> better for supporting across different drivers.  
> >
> > I'm in the middle of writing/testing XDP support for the Netronome's
> > driver and generic infra is very much appreciated ;)  In my experience
> > the 50 lines of code which are required for assigning the programs and
> > freeing them aren't really a big deal, though.
> >  
> 
> 50 lines in one driver is not a big deal, 50 lines in a hundred
> drivers is! I learned this lesson in BQL which was well abstracted out
> to be minimally invasive but we still saw many issues because of the
> pecularities of different drivers.

Agreed, I just meant to say that splitting rings and rewritting RX path
to behave differently for XDP vs non-XDP case is way more brain
consuming than a bit of boilerplate code so if anyone could solve those
two it would be much appreciated :)  My main point was what I wrote
below, though.

> > Let's also separate putting xdp_prog in netdevice/napi_struct from the
> > generic hook infra.  All the simplifications to the driver AFAICS come
> > from the former.  If everyone is fine with growing napi_struct we can do
> > that but IMHO this is not an argument for the generic infra :)  

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Tom Herbert @ 2016-09-21 18:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Kernel Team, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <20160921194548.33b85cb3@jkicinski-Precision-T1700>

On Wed, Sep 21, 2016 at 11:45 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:
>> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
>> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>> >>  >  - Reduces the amount of code and complexity needed in drivers to
>> >>  >    manage XDP
>> >>
>> >> hmm:
>> >> 534 insertions(+), 144 deletions(-)
>> >> looks like increase in complexity instead.
>> >
>> > and more to come to tie this with HW offloads.
>>
>> The amount of driver code did decrease with these patches:
>>
>> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 ++++----------------------
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 25 ++++------
>> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
>>
>> Minimizing complexity being added to drivers for XDP is critical since
>> we basically asking every driver to replicate the function. This
>> property also should also apply to HW offloads, the more complexity we
>> can abstract out drivers into a common backend infrastructure the
>> better for supporting across different drivers.
>
> I'm in the middle of writing/testing XDP support for the Netronome's
> driver and generic infra is very much appreciated ;)  In my experience
> the 50 lines of code which are required for assigning the programs and
> freeing them aren't really a big deal, though.
>

50 lines in one driver is not a big deal, 50 lines in a hundred
drivers is! I learned this lesson in BQL which was well abstracted out
to be minimally invasive but we still saw many issues because of the
pecularities of different drivers.

> Let's also separate putting xdp_prog in netdevice/napi_struct from the
> generic hook infra.  All the simplifications to the driver AFAICS come
> from the former.  If everyone is fine with growing napi_struct we can do
> that but IMHO this is not an argument for the generic infra :)

^ permalink raw reply

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Thomas Graf @ 2016-09-21 18:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, sargun, cgroups
In-Reply-To: <20160921154533.GA13656@salvia>

On 09/21/16 at 05:45pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote:
> > The point is that from an application's perspective, restricting the
> > ability to bind a port and dropping packets that are being sent is a
> > very different thing. Applications will start to behave differently if
> > they can't bind to a port, and that's something we do not want to happen.
> 
> What is exactly the problem? Applications are not checking for return
> value from bind? They should be fixed. If you want to collect
> statistics, I see no reason why you couldn't collect them for every
> EACCESS on each bind() call.

It's not about applications not checking the return value of bind().
Unfortunately, many applications (or the respective libraries they use)
retry on connect() failure but handle bind() errors as a hard failure
and exit. Yes, it's an application or library bug but these
applications have very specific exceptions how something fails.
Sometimes even going from drop to RST will break applications.

Paranoia speaking: by returning errors where no error was returned
before, undefined behaviour occurs. In Murphy speak: things break.

This is given and we can't fix it from the kernel side. Returning at
system call level has many benefits but it's not always an option.

Adding the late hook does not prevent filtering at socket layer to
also be added. I think we need both.

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Jakub Kicinski @ 2016-09-21 18:45 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Kernel Team, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <CALx6S37CkP44ZSNZM3UNBMv_ECFS6tf5T_37VSMyBCjm9ejiDw@mail.gmail.com>

On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:
> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:  
> >>  >  - Reduces the amount of code and complexity needed in drivers to
> >>  >    manage XDP  
> >>
> >> hmm:
> >> 534 insertions(+), 144 deletions(-)
> >> looks like increase in complexity instead.  
> >
> > and more to come to tie this with HW offloads.  
> 
> The amount of driver code did decrease with these patches:
> 
> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 ++++----------------------
> drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 25 ++++------
> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
> 
> Minimizing complexity being added to drivers for XDP is critical since
> we basically asking every driver to replicate the function. This
> property also should also apply to HW offloads, the more complexity we
> can abstract out drivers into a common backend infrastructure the
> better for supporting across different drivers.

I'm in the middle of writing/testing XDP support for the Netronome's
driver and generic infra is very much appreciated ;)  In my experience
the 50 lines of code which are required for assigning the programs and
freeing them aren't really a big deal, though.

Let's also separate putting xdp_prog in netdevice/napi_struct from the
generic hook infra.  All the simplifications to the driver AFAICS come
from the former.  If everyone is fine with growing napi_struct we can do
that but IMHO this is not an argument for the generic infra :)

^ permalink raw reply

* [PATCH v2] tcp: fix wrong checksum calculation on MTU probing
From: Douglas Caetano dos Santos @ 2016-09-21 18:26 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20160920.225729.1715125475365382873.davem@davemloft.net>

With TCP MTU probing enabled and offload TX checksumming disabled,
tcp_mtu_probe() calculated the wrong checksum when a fragment being copied
into the probe's SKB had an odd length. This was caused by the direct use
of skb_copy_and_csum_bits() to calculate the checksum, as it pads the
fragment being copied, if needed. When this fragment was not the last, a
subsequent call used the previous checksum without considering this
padding.

The effect was a stale connection in one way, as even retransmissions
wouldn't solve the problem, because the checksum was never recalculated for
the full SKB length.

Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
---
 net/ipv4/tcp_output.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f53d0cc..767135e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1968,10 +1968,12 @@ static int tcp_mtu_probe(struct sock *sk)
 		copy = min_t(int, skb->len, probe_size - len);
 		if (nskb->ip_summed)
 			skb_copy_bits(skb, 0, skb_put(nskb, copy), copy);
-		else
-			nskb->csum = skb_copy_and_csum_bits(skb, 0,
-							    skb_put(nskb, copy),
-							    copy, nskb->csum);
+		else {
+			__wsum csum = skb_copy_and_csum_bits(skb, 0,
+							     skb_put(nskb, copy),
+							     copy, 0);
+			nskb->csum = csum_block_add(nskb->csum, csum, len);
+		}

 		if (skb->len <= copy) {
 			/* We've eaten all the data from this skb.
-- 
2.5.0

^ permalink raw reply related

* Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show
From: Marcelo @ 2016-09-21 18:24 UTC (permalink / raw)
  To: hejianet
  Cc: netdev, linux-sctp, linux-kernel, davem, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Vlad Yasevich,
	Neil Horman, Steffen Klassert, Herbert Xu
In-Reply-To: <84aae301-0d00-d953-e6f6-d2d163d1136a@gmail.com>

On Thu, Sep 22, 2016 at 12:18:46AM +0800, hejianet wrote:
> Hi Marcelo
> 
> sorry for the late, just came back from a vacation.

Hi, no problem. Hope your batteries are recharged now :-)

> 
> On 9/14/16 7:55 PM, Marcelo wrote:
> > Hi Jia,
> > 
> > On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:
> > > Hi Marcelo
> > > 
> > > 
> > > On 9/13/16 2:57 AM, Marcelo wrote:
> > > > On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
> > > > > This is to use the generic interface snmp_get_cpu_field{,64}_batch to
> > > > > aggregate the data by going through all the items of each cpu sequentially.
> > > > > Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
> > > > > warning "the frame size" larger than 1024 on s390.
> > > > Yeah about that, did you test it with stack overflow detection?
> > > > These arrays can be quite large.
> > > > 
> > > > One more below..
> > > Do you think it is acceptable if the stack usage is a little larger than 1024?
> > > e.g. 1120
> > > I can't find any other way to reduce the stack usage except use "static" before
> > > unsigned long buff[TCP_MIB_MAX]
> > > 
> > > PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
> > > B.R.
> > That's pretty much the question. Linux has the option on some archs to
> > run with 4Kb (4KSTACKS option), so this function alone would be using
> > 25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
> > ("x86_64: expand kernel stack to 16K")).
> > 
> > Adding static to it is not an option as it actually makes the variable
> > shared amongst the CPUs (and then you have concurrency issues), plus the
> > fact that it's always allocated, even while not in use.
> > 
> > Others here certainly know better than me if it's okay to make such
> > usage of the stach.
> What about this patch instead?
> It is a trade-off. I split the aggregation process into 2 parts, it will
> increase the cache miss a little bit, but it can reduce the stack usage.
> After this, stack usage is 672bytes
> objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
> 0xc0000000007f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672
> 
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index c6ee8a2..cc41590 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
>   */
>  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
>  {
> -       int i;
> -       unsigned long buff[LINUX_MIB_MAX];
> +       int i, c;
> +       unsigned long buff[LINUX_MIB_MAX/2 + 1];
>         struct net *net = seq->private;
> 
> -       memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
> +       memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
> 
>         seq_puts(seq, "TcpExt:");
>         for (i = 0; snmp4_net_list[i].name; i++)
>                 seq_printf(seq, " %s", snmp4_net_list[i].name);
> 
>         seq_puts(seq, "\nTcpExt:");
> -       snmp_get_cpu_field_batch(buff, snmp4_net_list,
> -                                net->mib.net_statistics);
> -       for (i = 0; snmp4_net_list[i].name; i++)
> +       for_each_possible_cpu(c) {
> +               for (i = 0; i < LINUX_MIB_MAX/2; i++)
> +                       buff[i] += snmp_get_cpu_field(
> + net->mib.net_statistics,
> +                                               c, snmp4_net_list[i].entry);
> +       }
> +       for (i = 0; i < LINUX_MIB_MAX/2; i++)
>                 seq_printf(seq, " %lu", buff[i]);
> 
> +       memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
> +       for_each_possible_cpu(c) {
> +               for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
> +                       buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
> +                               net->mib.net_statistics,
> +                               c,
> +                               snmp4_net_list[i].entry);
> +       }
> +        for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
> +                seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
> +
>         return 0;
>  }

Yep, it halves the stack usage, but it doesn't look good heh

But well, you may try to post the patchset (with or without this last
change, you pick) officially and see how it goes. As you're posting as
RFC, it's not being evaluated as seriously.

FWIW, I tested your patches, using your test and /proc/net/snmp file on
a x86_64 box, Intel(R) Xeon(R) CPU E5-2643 v3.

Before the patches:

 Performance counter stats for './test /proc/net/snmp':

             5.225      cache-misses                                                
    12.708.673.785      L1-dcache-loads                                             
     1.288.450.174      L1-dcache-load-misses     #   10,14% of all L1-dcache hits  
     1.271.857.028      LLC-loads                                                   
             4.122      LLC-load-misses           #    0,00% of all LL-cache hits   

       9,174936524 seconds time elapsed

After:

 Performance counter stats for './test /proc/net/snmp':

             2.865      cache-misses                                                
    30.203.883.807      L1-dcache-loads                                             
     1.215.774.643      L1-dcache-load-misses     #    4,03% of all L1-dcache hits  
     1.181.662.831      LLC-loads                                                   
             2.685      LLC-load-misses           #    0,00% of all LL-cache hits   

      13,374445056 seconds time elapsed

Numbers were steady across multiple runs.

  Marcelo

> 
> > > > > +static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
> > > > > +{
> > > > > +	int i;
> > > > > +	u64 buff64[IPSTATS_MIB_MAX];
> > > > > +	struct net *net = seq->private;
> > > > >    	seq_puts(seq, "\nIpExt:");
> > > > >    	for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
> > > > >    		seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
> > > > >    	seq_puts(seq, "\nIpExt:");
> > > > You're missing a memset() call here.
> > Not sure if you missed this one or not..
> indeed, thanks
> B.R.
> Jia
> > Thanks,
> > Marcelo
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [ANNOUNCE] ndiv: line-rate network traffic processing
From: Willy Tarreau @ 2016-09-21 18:06 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S34Oy6QeOxR45PuuX70hfm-TR=2mpDwwqr82c7tvpWLX+Q@mail.gmail.com>

Hi Tom,

On Wed, Sep 21, 2016 at 10:16:45AM -0700, Tom Herbert wrote:
> This does seem interesting and indeed the driver datapath looks very
> much like XDP. It would be quite interesting if you could rebase and
> then maybe look at how this can work with XDP that would be helpful.

OK I'll assign some time to rebase it then.

> The return actions are identical,

I'm not surprized that the same needs lead to the same designs when
these designs are constrained by CPU cycle count :-)

> but processing descriptor meta data
> (like checksum, vlan) is not yet implemented in XDP-- maybe this is
> something we can leverage from ndiv?

Yes possibly. It's not a big work but it's absolutely mandatory if you
don't want to waste some smart devices' valuable performance improvements.
We changed our API when porting it to ixgbe to support what this NIC (and
many other ones) supports so that the application code doesn't have to
deal with checksums etc. By the way, VLAN is not yet implemented in the
mvneta driver. But this choice ensures that no application has to deal
nor to create bugs.

Cheers,
Willy

^ permalink raw reply

* Re: [ANNOUNCE] ndiv: line-rate network traffic processing
From: Willy Tarreau @ 2016-09-21 18:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Tom Herbert, Alexei Starovoitov, Brenden Blanco,
	Tariq Toukan
In-Reply-To: <20160921182639.2fce960a@redhat.com>

Hi Jesper!

On Wed, Sep 21, 2016 at 06:26:39PM +0200, Jesper Dangaard Brouer wrote:
> I definitely want to study it!

Great, at least I've not put this online for nothing :-)

> You mention XDP.  If you didn't notice, I've created some documentation
> on XDP (it is very "live" documentation at this point and it will
> hopefully "materialized" later in the process).  But it should be a good
> starting point for understanding XDP:
> 
>  https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

Thanks, I'll read it. We'll need to educate ourselves to see how to port
our anti-ddos to XDP in the future I guess, so better ensure the design
is fit from the beginning!

> > I presented it in 2014 at kernel recipes :
> >   http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/
> 
> Cool, and it even have a video!

Yep, with a horrible english accent :-)

> > It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> > very light, and retrieves the packets in the NIC's driver before they
> > are converted to an skb, then submits them to a registered RX handler
> > running in softirq context so we have the best of all worlds by
> > benefitting from CPU scalability, delayed processing, and not paying
> > the cost of switching to userland. Also an rx_done() function allows
> > handlers to batch their processing. 
> 
> Wow - it does sound a lot like XDP!  I would say that is sort of
> validate the current direction of XDP, and that there are real
> use-cases for this stuff.

Absolutely! In fact what drove use to this architecture is that we first
wrote our anti-ddos in userland using netmap. While userland might be OK
for switches and routers, in our case we have haproxy listening on TCP
sockets and waiting for these packets. So the packets were bouncing from
kernel to user, then to kernel again, losing checksums, GRO, GSO, etc...
We modified it to support all of these but the performance was still poor,
capping at about 8 Gbps of forwarded traffic instead of ~40. Thus we thought
that the processing would definitely need to be placed in the kernel to avoid
this bouncing, and to avoid turning rings into newer rings all the time.
That's when I realized that it could possibly also cover my needs for a
sniffer and we redesigned the initial code to support both use cases. Now
we don't even see it in regular traffic, which is pretty nice.

> > The RX handler returns an action
> > among accepting the packet as-is, accepting it modified (eg: vlan or
> > tunnel decapsulation), dropping it, postponing the processing
> > (equivalent to EAGAIN), or building a new packet to send back.
> 
> I'll be very interested in studying in-details how you implemented and
> choose what actions to implement.

OK. The HTTP server is a good use case to study because it lets packets
pass through, being dropped, or being responded to, and the code is very
small, so easy to analyse.

> What was the need for postponing the processing (EAGAIN)?

Our SYN cookie generator. If the NIC's Tx queue is full and we cannot build
a SYN-ACK, we prefer to break out of the Rx loop because there's still room
in the Rx ring (statistically speaking).

> > This last function is the one requiring the most changes in existing
> > drivers, but offers the widest range of possibilities. We use it to
> > send SYN cookies, but I have also implemented a stateless HTTP server
> > supporting keep-alive using it, achieving line-rate traffic processing
> > on a single CPU core when the NIC supports it. It's very convenient to
> > test various stateful TCP components as it's easy to sustain millions
> > of connections per second on it.
> 
> Interesting, and controversial use-case.  One controversial use-case
> for XDP, that I imagine was implementing a DNS accelerator, what
> answers simple and frequent requests.

We thought about such a use case as well, just like of a ping responder
(rate limited to avoid serving as DDoS responders).

> You took it a step further with a HTTP server!

It's a fake HTTP server. You ask it to return 1kB of data and it sends you
1kB. It can even do multiple segments but then you're facing the risk of
losses that you'd preferably avoid. But in our case it's very useful to
test various things including netfilter, LVS and haproxy because it
consumes so little power to reach performance levels that they cannot even
reach that you can set it up on a small machine (eg: a cheap USB-powered
ARM board saturates the GigE link with 340 kcps, 663 krps). However I found
that it *could* be fun to improve it to deliver favicons or small error
pages.

> > It does not support forwarding between NICs. It was my first goal
> > because I wanted to implement a TAP with it, bridging the traffic
> > between two ports, but figured it was adding some complexity to the
> > system back then. 
> 
> With all the XDP features at the moment, we have avoided going through
> the page allocator, by relying on different page recycling tricks.
> 
> When doing forwarding between NICs is it harder to do these page
> recycling tricks.  I've measured that page allocators fast-path
> ("recycling" same page) cost approx 270 cycles, and the 14Mpps cycle
> count on this 4GHz CPU is 268 cycles.  Thus, it is a non-starter...

Wow indeed. We're doing complete stateful inspection and policy-based
filtering on less than 67 ns, so indeed here it would be far too long.

> Did you have to modify the page allocator?
> Or implement some kind of recycling?

We "simply" implemented our own Tx ring, depending on what drivers and
hardware support. This is the most complicated part of the code because
it is very hardware-dependant and because you want to deal with conflicts
between the packets being generated on the Rx path and other packets being
delivered by other cores on the regular Tx path. In some drivers we cheat
on the skb pointer in the descriptors, we set bit 0 to 1 to mark it as
being ours so that we recycle it into our ring after it's sent instead of
releasing an skb. That's why it would be hard to implement forwarding. I
thought that I could at least implement it between two NICs making use of
the same driver, but it could lead to starvation of certain tx rings and
other ones filling up. However I don't have a solution for now because I
decided to stop thinking about it at the moment. Over the long term I
would love to see my mirabox being used as an inline tap logging to USB3 :-)

Another important design choice that comes to my mind is that we purposely
decide to support optimal devices. We decided this after seeing how netmap
uses only the least common denominator between all supportable NICs
resulting in any NIC to become dumb. In our case, the driver has to feed
the checksum status, L3/L4 protocol types etc. If the NIC is too dumb to
support this, it just has to be implemented in software for this NIC only.
And in practice all NICs that matter support L3/L4 protocol identification
as well as checksum verification/computation, so it's not a problem and
the hardware continues to work for us for free.

> > However since then we've implemented traffic
> > capture in our product, exploiting this framework to capture without
> > losses at 14 Mpps. I may find some time to try to extract it later.
> > It uses the /sys API so that you can simply plug tcpdump -r on a
> > file there, though there's also an mmap version which uses less CPU
> > (that's important at 10G).
> 
> Interesting.  I do see a XDP use-case for RAW packet capture, but I've
> postponed that work until later.  I would interested in how you solved
> it?  E.g. Do you support zero-copy?

No, we intentionally copy. In fact on xeon processors, the memory bandwidth
is so huge that you don't even notice the copy. And by using small buffers,
you can even ensure that the memory blocks stays in L3 cache. We had most
difficulties with the /sys API because it only supports page-size transfers
and uses lots of CPU just for this, hence we had to implement mmap support
to present the packets to user-space (without copy here). But even the
regular /sys API with a double copy supports line-rate with high CPU usage.

I'll ask my coworker Emeric who did the sniffer if he can take it out as a
standalone component. It will take a bit of time because we're moving to a
new office and that significantly mangles our priorities as you can expect,
but that's definitely something we'd like to do.

> > In its current form since the initial code's intent was to limit
> > core changes, it happens not to modify anything in the kernel by
> > default and to reuse the net_device's ax25_ptr to attach devices
> > (idea borrowed from netmap), so it can be used on an existing
> > kernel just by loading the patched network drivers (yes, I know
> > it's not a valid solution for the long term).
> > 
> > The current code is available here :
> > 
> >   http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/
> 
> I was just about to complain that the link was broken... but it fixed
> itself while writing this email ;-)

I noticed the same thing a few times already, and whatever I do, the
description is not updated. I suspect there's some load balancing with
one server not being updated as fast as the other ones.

> Can you instead explain what branch to look at?

Sure! For the most up-to-date code, better use ndiv_v5-4.4. It contains
the core (a single .h file), and support for ixgbe, e1000e, e1000, igb,
and the dummy HTTP server (slhttpd). For a more readable version, better
use ndiv_v5-3.14 which also contains the mvneta driver, it's much simpler
than the other ones and makes the code more readable. I'll have to port
it to 4.4 soon but didn't have time yet. We don't support mlx4 yet, and
it's a chicken-and-egg problem : by lack of time we don't work on porting
it and since we don't support it we don't use it in our products. That's
too bad because from what I've been told we should be able to reach high
packet rates there as well.

> > Please let me know if there could be some interest in rebasing it
> > on more recent versions (currently 3.10, 3.14 and 4.4 are supported).
> 
> What, no support for 2.4 ;-)

Not yet :-) Jokes aside, given that the API is very simple, it could be
done if anyone needed, as it really doesn't rely on any existing
infrastructure. The API is reasonably OS-agnostic as it only wants
pointers and lengths. For sniffing and/or filtering on Rx/Tx paths
only, the code basically only is (synthetic code, just to illustrate) :

     ndiv = netdev_get_ndiv(dev);
     if (ndiv) {
         ret = ndiv->handle_rx(ndiv, l3ptr, l3len, l2len, l2ptr, NULL);
         if (ret & NDIV_RX_R_F_DROP)
            continue;
     }

Best regards,
Willy

^ permalink raw reply

* RE: [PATCH] net: fec: set mac address unconditionally
From: Andy Duan @ 2016-09-21 16:26 UTC (permalink / raw)
  To: Gavin Schenk; +Cc: netdev@vger.kernel.org, kernel@pengutronix.de
In-Reply-To: <1474464655-126940-1-git-send-email-g.schenk@eckelmann.de>

From: Gavin Schenk <g.schenk@eckelmann.de> Sent: Wednesday, September 21, 2016 9:31 PM
> To: Andy Duan <fugang.duan@nxp.com>
> Cc: netdev@vger.kernel.org; kernel@pengutronix.de; Gavin Schenk
> <g.schenk@eckelmann.de>
> Subject: [PATCH] net: fec: set mac address unconditionally
> 
> Fixes: 9638d19e4816 ("net: fec: add netif status check before set mac
> address")
> 
> If the mac address origin is not dt, you can only safe assign a mac address
> after "link up" of the device. If the link is down the clocks are disabled and
> because of issues assigning registers when clocks are down the new mac
> address is discarded on some soc's. This fix sets the mac address
> unconditionally in fec_restart(...) and ensures consistens between fec
> registers and the network layer.
> 
> Signed-off-by: Gavin Schenk <g.schenk@eckelmann.de>
> ---

It make sense, thanks.

Acked-by: Fugang Duan <fugang.duan@nxp.com>

>  drivers/net/ethernet/freescale/fec_main.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 2a03857cca18..bdabea6cd981 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -903,13 +903,11 @@ fec_restart(struct net_device *ndev)
>  	 * enet-mac reset will reset mac address registers too,
>  	 * so need to reconfigure it.
>  	 */
> -	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
> -		memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
> -		writel((__force u32)cpu_to_be32(temp_mac[0]),
> -		       fep->hwp + FEC_ADDR_LOW);
> -		writel((__force u32)cpu_to_be32(temp_mac[1]),
> -		       fep->hwp + FEC_ADDR_HIGH);
> -	}
> +	memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
> +	writel((__force u32)cpu_to_be32(temp_mac[0]),
> +	       fep->hwp + FEC_ADDR_LOW);
> +	writel((__force u32)cpu_to_be32(temp_mac[1]),
> +	       fep->hwp + FEC_ADDR_HIGH);
> 
>  	/* Clear any outstanding interrupt. */
>  	writel(0xffffffff, fep->hwp + FEC_IEVENT);
> --
> 1.9.1
> 
> 
> Eckelmann AG
> Vorstand: Dipl.-Ing. Peter Frankenbach (Sprecher) Dipl.-Wi.-Ing. Philipp
> Eckelmann
> Dr.-Ing. Frank-Thomas Mellert Dr.-Ing. Marco Münchhof Dr.-Ing. Frank
> Uhlemann
> Vorsitzender des Aufsichtsrats: Hubertus G. Krossa
> Sitz der Gesellschaft: Berliner Str. 161, 65205 Wiesbaden, Amtsgericht
> Wiesbaden HRB 12636
> http://www.eckelmann.de

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Tom Herbert @ 2016-09-21 17:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Kernel Team, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <20160921182657.67681e67@jkicinski-Precision-T1700>

On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>>  >  - Reduces the amount of code and complexity needed in drivers to
>>  >    manage XDP
>>
>> hmm:
>> 534 insertions(+), 144 deletions(-)
>> looks like increase in complexity instead.
>
> and more to come to tie this with HW offloads.

The amount of driver code did decrease with these patches:

drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 ++++----------------------
drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 25 ++++------
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -

Minimizing complexity being added to drivers for XDP is critical since
we basically asking every driver to replicate the function. This
property also should also apply to HW offloads, the more complexity we
can abstract out drivers into a common backend infrastructure the
better for supporting across different drivers.

Tom

^ permalink raw reply

* As-salam o Alaikum
From: Saidat Abdulmumuni @ 2016-09-21 17:32 UTC (permalink / raw)


Hello, I am Saidat, I sent you a message yesterday, Please did you
receive it? Thank you!

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Jakub Kicinski @ 2016-09-21 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, davem, netdev, kernel-team, tariqt, bblanco,
	alexei.starovoitov, eric.dumazet, brouer
In-Reply-To: <57E1CDE3.5030404@fb.com>

On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>  >  - Reduces the amount of code and complexity needed in drivers to
>  >    manage XDP  
> 
> hmm:
> 534 insertions(+), 144 deletions(-)
> looks like increase in complexity instead.

and more to come to tie this with HW offloads.

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Jakub Kicinski @ 2016-09-21 17:26 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Kernel Team, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Eric Dumazet, Jesper Dangaard Brouer
In-Reply-To: <20160921115545.GA12789@pox.localdomain>

On Wed, 21 Sep 2016 13:55:45 +0200, Thomas Graf wrote:
> > I am looking at using this for ILA router. The problem I am hitting is
> > that not all packets that we need to translate go through the XDP
> > path. Some would go through the kernel path, some through XDP path but  
> 
> When you say kernel path, what do you mean specifically? One aspect of
> XDP I love is that XDP can act as an acceleration option for existing
> BPF programs attached to cls_bpf. Support for direct packet read and
> write at clsact level have made it straight forward to write programs
> which are compatible or at minimum share a lot of common code. They
> can share data structures, lookup functionality, etc.

My very humble dream was that XDP would be transparently offloaded from
cls_bpf if program was simple enough.  That ship has most likely sailed
because XDP has different abort behaviour.  When possible though, trying
to offload higher-level hooks when the rules don't require access to
full skb would be really cool.

^ permalink raw reply

* [PATCH net-next 3/3] udp: use it's own memory accounting schema
From: Paolo Abeni @ 2016-09-21 17:23 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	linux-nfs
In-Reply-To: <cover.1474477902.git.pabeni@redhat.com>

Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model does not require socket
locking, remove the lock on enqueue and free and avoid using the
backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c        | 39 ++++++++++-----------------------------
 net/ipv6/udp.c        | 28 +++++++++-------------------
 net/sunrpc/svcsock.c  | 22 ++++++++++++++++++----
 net/sunrpc/xprtsock.c |  2 +-
 4 files changed, 38 insertions(+), 53 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 98480af..cb617ee 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1358,13 +1358,8 @@ static int first_packet_length(struct sock *sk)
 	res = skb ? skb->len : -1;
 	spin_unlock_bh(&rcvq->lock);
 
-	if (!skb_queue_empty(&list_kill)) {
-		bool slow = lock_sock_fast(sk);
-
-		__skb_queue_purge(&list_kill);
-		sk_mem_reclaim_partial(sk);
-		unlock_sock_fast(sk, slow);
-	}
+	if (!skb_queue_empty(&list_kill))
+		udp_queue_purge(sk, &list_kill, 1);
 	return res;
 }
 
@@ -1413,7 +1408,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ip_recv_error(sk, msg, len, addr_len);
@@ -1454,13 +1448,12 @@ try_again:
 	}
 
 	if (unlikely(err)) {
-		trace_kfree_skb(skb, udp_recvmsg);
 		if (!peeked) {
 			atomic_inc(&sk->sk_drops);
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_INERRORS, is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		skb_free_udp(sk, skb);
 		return err;
 	}
 
@@ -1485,16 +1478,15 @@ try_again:
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_fast(sk, slow);
+	skb_free_udp(sk, skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -1613,7 +1605,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = udp_rmem_schedule(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -1627,8 +1619,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		return -1;
 	}
 
+	__sock_enqueue_skb(sk, skb);
 	return 0;
-
 }
 
 static struct static_key udp_encap_needed __read_mostly;
@@ -1650,7 +1642,6 @@ EXPORT_SYMBOL(udp_encap_enable);
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	/*
@@ -1743,19 +1734,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 	}
 
-	rc = 0;
-
 	ipv4_pktinfo_prepare(sk, skb);
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		rc = __udp_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udp_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -2365,6 +2345,7 @@ struct proto udp_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udp_destroy_sock,
 	.setsockopt	   = udp_setsockopt,
 	.getsockopt	   = udp_getsockopt,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9aa7c1c..6f8e160 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
 	int is_udp4;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ipv6_recv_error(sk, msg, len, addr_len);
@@ -388,7 +387,7 @@ try_again:
 				UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
 					       is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		skb_free_udp(sk, skb);
 		return err;
 	}
 	if (!peeked) {
@@ -437,12 +436,11 @@ try_again:
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		if (is_udp4) {
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_CSUMERRORS, is_udplite);
@@ -455,7 +453,7 @@ csum_copy_err:
 				       UDP_MIB_INERRORS, is_udplite);
 		}
 	}
-	unlock_sock_fast(sk, slow);
+	skb_free_udp(sk, skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -523,7 +521,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = udp_rmem_schedule(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -535,6 +533,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		kfree_skb(skb);
 		return -1;
 	}
+
+	__sock_enqueue_skb(sk, skb);
 	return 0;
 }
 
@@ -556,7 +556,6 @@ EXPORT_SYMBOL(udpv6_encap_enable);
 int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
@@ -630,17 +629,7 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	skb_dst_drop(skb);
 
-	bh_lock_sock(sk);
-	rc = 0;
-	if (!sock_owned_by_user(sk))
-		rc = __udpv6_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udpv6_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -1433,6 +1422,7 @@ struct proto udpv6_prot = {
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udpv6_destroy_sock,
 	.setsockopt	   = udpv6_setsockopt,
 	.getsockopt	   = udpv6_getsockopt,
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 57625f6..b5739c7 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -39,6 +39,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
+#include <net/udp.h>
 #include <net/tcp.h>
 #include <net/tcp_states.h>
 #include <asm/uaccess.h>
@@ -129,6 +130,20 @@ static void svc_release_skb(struct svc_rqst *rqstp)
 	}
 }
 
+static void svc_release_udp_skb(struct svc_rqst *rqstp)
+{
+	struct sk_buff *skb = rqstp->rq_xprt_ctxt;
+
+	if (skb) {
+		struct svc_sock *svsk =
+			container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+		rqstp->rq_xprt_ctxt = NULL;
+
+		dprintk("svc: service %p, releasing skb %p\n", rqstp, skb);
+		skb_consume_udp(svsk->sk_sk, skb, 0);
+	}
+}
+
 union svc_pktinfo_u {
 	struct in_pktinfo pkti;
 	struct in6_pktinfo pkti6;
@@ -575,7 +590,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 			goto out_free;
 		}
 		local_bh_enable();
-		skb_free_datagram_locked(svsk->sk_sk, skb);
+		skb_consume_udp(svsk->sk_sk, skb, 0);
 	} else {
 		/* we can use it in-place */
 		rqstp->rq_arg.head[0].iov_base = skb->data;
@@ -602,8 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	return len;
 out_free:
-	trace_kfree_skb(skb, svc_udp_recvfrom);
-	skb_free_datagram_locked(svsk->sk_sk, skb);
+	skb_free_udp(svsk->sk_sk, skb);
 	return 0;
 }
 
@@ -660,7 +674,7 @@ static struct svc_xprt_ops svc_udp_ops = {
 	.xpo_create = svc_udp_create,
 	.xpo_recvfrom = svc_udp_recvfrom,
 	.xpo_sendto = svc_udp_sendto,
-	.xpo_release_rqst = svc_release_skb,
+	.xpo_release_rqst = svc_release_udp_skb,
 	.xpo_detach = svc_sock_detach,
 	.xpo_free = svc_sock_free,
 	.xpo_prep_reply_hdr = svc_udp_prep_reply_hdr,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 8ede3bc..b75c2c3 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 		skb = skb_recv_datagram(sk, 0, 1, &err);
 		if (skb != NULL) {
 			xs_udp_data_read_skb(&transport->xprt, sk, skb);
-			skb_free_datagram(sk, skb);
+			skb_consume_udp(sk, skb, 0);
 			continue;
 		}
 		if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state))
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/3] net/socket: factor out helpers for memory and queue manipulation
From: Paolo Abeni @ 2016-09-21 17:23 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	linux-nfs
In-Reply-To: <cover.1474477902.git.pabeni@redhat.com>

Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h |  2 +-
 include/net/sock.h     |  5 +++
 net/core/datagram.c    | 36 +++++++++++--------
 net/core/skbuff.c      |  3 +-
 net/core/sock.c        | 96 +++++++++++++++++++++++++++++++++-----------------
 5 files changed, 94 insertions(+), 48 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cfb7219..49c489d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3016,7 +3016,7 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
 #define skb_walk_frags(skb, iter)	\
 	for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
 
-
+void sock_rmem_free(struct sk_buff *skb);
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
 				const struct sk_buff *skb);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
diff --git a/include/net/sock.h b/include/net/sock.h
index c797c57..a37362c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket *socket)
 /*
  * Functions for memory accounting
  */
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
 int __sk_mem_schedule(struct sock *sk, int size, int kind);
+void __sk_mem_reduce_allocated(struct sock *sk, int amount);
 void __sk_mem_reclaim(struct sock *sk, int amount);
 
 #define SK_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -1940,6 +1942,9 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags);
+void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb);
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..bfb973a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags)
+{
+	int err = 0;
+
+	if (flags & MSG_PEEK) {
+		err = -ENOENT;
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		if (skb == skb_peek(&sk->sk_receive_queue)) {
+			__skb_unlink(skb, &sk->sk_receive_queue);
+			atomic_dec(&skb->users);
+			err = 0;
+		}
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+	}
+
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL(__sk_queue_drop_skb);
+
 /**
  *	skb_kill_datagram - Free a datagram skbuff forcibly
  *	@sk: socket
@@ -346,23 +367,10 @@ EXPORT_SYMBOL(__skb_free_datagram_locked);
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-	int err = 0;
-
-	if (flags & MSG_PEEK) {
-		err = -ENOENT;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		if (skb == skb_peek(&sk->sk_receive_queue)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
-			atomic_dec(&skb->users);
-			err = 0;
-		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-	}
+	int err = __sk_queue_drop_skb(sk, skb, flags);
 
 	kfree_skb(skb);
-	atomic_inc(&sk->sk_drops);
 	sk_mem_reclaim_partial(sk);
-
 	return err;
 }
 EXPORT_SYMBOL(skb_kill_datagram);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..4dce605 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3657,12 +3657,13 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
 }
 EXPORT_SYMBOL_GPL(skb_cow_data);
 
-static void sock_rmem_free(struct sk_buff *skb)
+void sock_rmem_free(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 
 	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
 }
+EXPORT_SYMBOL_GPL(sock_rmem_free);
 
 /*
  * Note: We dont mem charge error packets (no sk_forward_alloc changes)
diff --git a/net/core/sock.c b/net/core/sock.c
index 51a7304..752308d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -405,24 +405,12 @@ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
 }
 
 
-int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb)
 {
 	unsigned long flags;
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 
-	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) {
-		atomic_inc(&sk->sk_drops);
-		trace_sock_rcvqueue_full(sk, skb);
-		return -ENOMEM;
-	}
-
-	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
-		atomic_inc(&sk->sk_drops);
-		return -ENOBUFS;
-	}
-
 	skb->dev = NULL;
-	skb_set_owner_r(skb, sk);
 
 	/* we escape from rcu protected region, make sure we dont leak
 	 * a norefcounted dst
@@ -436,6 +424,24 @@ int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk);
+}
+EXPORT_SYMBOL(__sock_enqueue_skb);
+
+int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) {
+		atomic_inc(&sk->sk_drops);
+		trace_sock_rcvqueue_full(sk, skb);
+		return -ENOMEM;
+	}
+
+	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
+		atomic_inc(&sk->sk_drops);
+		return -ENOBUFS;
+	}
+
+	skb_set_owner_r(skb, sk);
+	__sock_enqueue_skb(sk, skb);
 	return 0;
 }
 EXPORT_SYMBOL(__sock_queue_rcv_skb);
@@ -2088,24 +2094,18 @@ int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb)
 EXPORT_SYMBOL(sk_wait_data);
 
 /**
- *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	__sk_mem_raise_allocated - increase memory_allocated
  *	@sk: socket
  *	@size: memory size to allocate
+ *	@amt: pages to allocate
  *	@kind: allocation type
  *
- *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
- *	rmem allocation. This function assumes that protocols which have
- *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ *	Similar to __sk_mem_schedule(), but does not update sk_forward_alloc
  */
-int __sk_mem_schedule(struct sock *sk, int size, int kind)
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct proto *prot = sk->sk_prot;
-	int amt = sk_mem_pages(size);
-	long allocated;
-
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-
-	allocated = sk_memory_allocated_add(sk, amt);
+	long allocated = sk_memory_allocated_add(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
 	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
@@ -2166,9 +2166,6 @@ suppress_allocation:
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated);
 
-	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-
 	sk_memory_allocated_sub(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
@@ -2176,18 +2173,40 @@ suppress_allocation:
 
 	return 0;
 }
+EXPORT_SYMBOL(__sk_mem_raise_allocated);
+
+/**
+ *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@size: memory size to allocate
+ *	@kind: allocation type
+ *
+ *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
+ *	rmem allocation. This function assumes that protocols which have
+ *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ */
+int __sk_mem_schedule(struct sock *sk, int size, int kind)
+{
+	int ret, amt = sk_mem_pages(size);
+
+	sk->sk_forward_alloc += amt << SK_MEM_QUANTUM_SHIFT;
+	ret = __sk_mem_raise_allocated(sk, size, amt, kind);
+	if (!ret)
+		sk->sk_forward_alloc -= amt << SK_MEM_QUANTUM_SHIFT;
+	return ret;
+}
 EXPORT_SYMBOL(__sk_mem_schedule);
 
 /**
- *	__sk_mem_reclaim - reclaim memory_allocated
+ *	__sk_mem_reduce_allocated - reclaim memory_allocated
  *	@sk: socket
- *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ *	@amount: number of quanta
+ *
+ *	Similar to __sk_mem_reclaim(), but does not update sk_forward_alloc
  */
-void __sk_mem_reclaim(struct sock *sk, int amount)
+void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	amount >>= SK_MEM_QUANTUM_SHIFT;
 	sk_memory_allocated_sub(sk, amount);
-	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
 		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
@@ -2196,6 +2215,19 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
 }
+EXPORT_SYMBOL(__sk_mem_reduce_allocated);
+
+/**
+ *	__sk_mem_reclaim - reclaim sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ */
+void __sk_mem_reclaim(struct sock *sk, int amount)
+{
+	amount >>= SK_MEM_QUANTUM_SHIFT;
+	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
+	__sk_mem_reduce_allocated(sk, amount);
+}
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
 int sk_set_peek_off(struct sock *sk, int val)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Paolo Abeni @ 2016-09-21 17:23 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1474477902.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Avoid usage of common memory accounting functions, since
the logic is pretty much different.

To account for forward allocation, a couple of new atomic_t
members are added to udp_sock: 'mem_alloced' and 'mem_freed'.
The current forward allocation is estimated as 'mem_alloced'
 minus 'mem_freed' minus 'sk_rmem_alloc'.

When the forward allocation can't cope with the packet to be
enqueued, 'mem_alloced' is incremented by the packet size
rounded-up to the next SK_MEM_QUANTUM.
After a dequeue, we try to partially reclaim of the forward
allocated memory rounded down to an SK_MEM_QUANTUM and
'mem_freed' is increased by that amount.
sk->sk_forward_alloc is set after each allocated/freed memory
update, to the currently estimated forward allocation, without
any lock or protection.
This value is updated/maintained only to expose some
semi-reasonable value to the eventual reader, and is guaranteed
to be 0 at socket destruction time.

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock() helper, which completely reclaim
the allocated forward memory.

Helpers are provided for skb free, consume and purge, respecting
the above constraints.

The socket lock is still used to protect the updates to sk_peek_off,
but is acquired only if peeking with offset is enabled.

As a consequence of the above schema, enqueue to sk_error_queue
will cause larger forward allocation on following normal data
(due to sk_rmem_alloc grow), but this allows amortizing the cost
of the atomic operation on SK_MEM_QUANTUM/skb->truesize packets.
The use of separate atomics for 'mem_alloced' and 'mem_freed'
allows the use of a single atomic operation to protect against
concurrent dequeue.

Acked-by: Hannes Frederic Sowa <hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/udp.h |   2 +
 include/net/udp.h   |   5 ++
 net/ipv4/udp.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 158 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..cd72645 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -42,6 +42,8 @@ static inline u32 udp_hashfn(const struct net *net, u32 num, u32 mask)
 struct udp_sock {
 	/* inet_sock has to be the first member */
 	struct inet_sock inet;
+	atomic_t	 mem_allocated;
+	atomic_t	 mem_freed;
 #define udp_port_hash		inet.sk.__sk_common.skc_u16hashes[0]
 #define udp_portaddr_hash	inet.sk.__sk_common.skc_u16hashes[1]
 #define udp_portaddr_node	inet.sk.__sk_common.skc_portaddr_node
diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..86307a4 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,6 +246,10 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
 }
 
 /* net/ipv4/udp.c */
+void skb_free_udp(struct sock *sk, struct sk_buff *skb);
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
+int udp_rmem_schedule(struct sock *sk, struct sk_buff *skb);
+
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
@@ -258,6 +262,7 @@ void udp_flush_pending_frames(struct sock *sk);
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
+int udp_init_sock(struct sock *sk);
 int udp_disconnect(struct sock *sk, int flags);
 unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 058c312..98480af 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1178,6 +1178,157 @@ out:
 	return ret;
 }
 
+static inline int __udp_forward(struct udp_sock *up, int freed, int rmem)
+{
+	return atomic_read(&up->mem_allocated) - freed - rmem;
+}
+
+static int skb_unref(struct sk_buff *skb)
+{
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return 0;
+
+	return skb->truesize;
+}
+
+static inline int udp_try_release(struct sock *sk, int *fwd, int partial)
+{
+	struct udp_sock *up = udp_sk(sk);
+	int freed_old, freed_new, amt;
+
+	freed_old = atomic_read(&up->mem_freed);
+	*fwd = __udp_forward(up, freed_old, atomic_read(&sk->sk_rmem_alloc));
+	if (*fwd < SK_MEM_QUANTUM + partial)
+		return 0;
+
+	/* we can have concurrent release; if we catch any conflict
+	 * via atomic_cmpxchg, let only one of them relase the memory
+	 */
+	amt = sk_mem_pages(*fwd - partial) << SK_MEM_QUANTUM_SHIFT;
+	freed_new = atomic_cmpxchg(&up->mem_freed, freed_old, freed_old + amt);
+	return (freed_new == freed_old) ? amt : 0;
+}
+
+/* reclaim the allocated forward memory, except 'partial' quanta */
+static void skb_release_mem_udp(struct sock *sk, int partial)
+{
+	int fwd, delta = udp_try_release(sk, &fwd, partial);
+
+	if (delta)
+		__sk_mem_reduce_allocated(sk, delta >> SK_MEM_QUANTUM_SHIFT);
+	sk->sk_forward_alloc = fwd - delta;
+}
+
+void skb_free_udp(struct sock *sk, struct sk_buff *skb)
+{
+	int size = skb_unref(skb);
+
+	if (!size)
+		return;
+
+	trace_kfree_skb(skb, __builtin_return_address(0));
+	__kfree_skb(skb);
+	skb_release_mem_udp(sk, 1);
+}
+EXPORT_SYMBOL_GPL(skb_free_udp);
+
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
+{
+	int size = skb_unref(skb);
+
+	if (unlikely(READ_ONCE(sk->sk_peek_off) >= 0)) {
+		bool slow = lock_sock_fast(sk);
+
+		sk_peek_offset_bwd(sk, len);
+		unlock_sock_fast(sk, slow);
+	}
+	if (!size)
+		return;
+
+	__kfree_skb(skb);
+	skb_release_mem_udp(sk, 1);
+}
+EXPORT_SYMBOL_GPL(skb_consume_udp);
+
+static void udp_queue_purge(struct sock *sk, struct sk_buff_head *list,
+			    int partial)
+{
+	struct sk_buff *skb;
+	int size;
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		size = skb_unref(skb);
+		if (size) {
+			trace_kfree_skb(skb, udp_queue_purge);
+			__kfree_skb(skb);
+		}
+	}
+	skb_release_mem_udp(sk, partial);
+}
+
+int udp_rmem_schedule(struct sock *sk, struct sk_buff *skb)
+{
+	int alloc, freed, fwd, amt, delta, rmem, err = -ENOMEM;
+	struct udp_sock *up = udp_sk(sk);
+
+	rmem = atomic_add_return(skb->truesize, &sk->sk_rmem_alloc);
+	if (rmem > sk->sk_rcvbuf)
+		goto drop;
+
+	freed = atomic_read(&up->mem_freed);
+	fwd = __udp_forward(up, freed, rmem);
+	if (fwd > 0)
+		goto no_alloc;
+
+	amt = sk_mem_pages(skb->truesize);
+	delta = amt << SK_MEM_QUANTUM_SHIFT;
+	if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
+		err = -ENOBUFS;
+		goto drop;
+	}
+
+	/* if we have some skbs in the error queue, the forward allocation could
+	 * be understimated, even below 0; avoid exporting such values
+	 */
+	alloc = atomic_add_return(delta, &up->mem_allocated);
+	fwd = alloc - freed - rmem;
+	if (fwd < 0)
+		fwd = SK_MEM_QUANTUM;
+
+no_alloc:
+	sk->sk_forward_alloc = fwd;
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = sock_rmem_free;
+	return 0;
+
+drop:
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL_GPL(udp_rmem_schedule);
+
+static void udp_destruct_sock(struct sock *sk)
+{
+	/* reclaim completely the forward allocated memory */
+	udp_queue_purge(sk, &sk->sk_receive_queue, 0);
+	inet_sock_destruct(sk);
+}
+
+int udp_init_sock(struct sock *sk)
+{
+	struct udp_sock *up = udp_sk(sk);
+
+	atomic_set(&up->mem_allocated, 0);
+	atomic_set(&up->mem_freed, 0);
+	sk->sk_destruct = udp_destruct_sock;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(udp_init_sock);
+
 /**
  *	first_packet_length	- return length of first packet in receive queue
  *	@sk: socket
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net-next 0/3] udp: refactor memory accounting
From: Paolo Abeni @ 2016-09-21 17:23 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This patch series refactor the udp memory accounting, replacing the
generic implementation with a custom one, in order to remove the needs for
locking the socket on the enqueue and dequeue operations. The socket backlog
usage is dropped, as well.

The first patch factor out core pieces of some queue and memory management
socket helpers, so that they can later be used by the udp memory accounting
functions.
The second patch adds the memory account helpers, without using them.
The third patch replacse the old rx memory accounting path for udp over ipv4 and
udp over ipv6. In kernel UDP users are updated, as well.

The memory accounting schema is described in detail in the individual patch
commit message.

The performance gain depends on the specific scenario; with few flows (and
little contention in the original code) the differences are in the noise range,
while with several flows contending the same socket, the measured speed-up
is relevant (e.g. even over 100% in case of extreme contention)

Paolo Abeni (3):
  net/socket: factor out helpers for memory and queue manipulation
  udp: implement memory accounting helpers
  udp: use it's own memory accounting schema

 include/linux/skbuff.h |   2 +-
 include/linux/udp.h    |   2 +
 include/net/sock.h     |   5 ++
 include/net/udp.h      |   5 ++
 net/core/datagram.c    |  36 ++++++----
 net/core/skbuff.c      |   3 +-
 net/core/sock.c        |  96 ++++++++++++++++---------
 net/ipv4/udp.c         | 190 +++++++++++++++++++++++++++++++++++++++++--------
 net/ipv6/udp.c         |  28 +++-----
 net/sunrpc/svcsock.c   |  22 ++++--
 net/sunrpc/xprtsock.c  |   2 +-
 11 files changed, 290 insertions(+), 101 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ANNOUNCE] ndiv: line-rate network traffic processing
From: Tom Herbert @ 2016-09-21 17:16 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Linux Kernel Network Developers
In-Reply-To: <20160921112852.GA991@1wt.eu>

On Wed, Sep 21, 2016 at 4:28 AM, Willy Tarreau <w@1wt.eu> wrote:
> Hi,
>
> Over the last 3 years I've been working a bit on high traffic processing
> for various reasons. It started with the wish to capture line-rate GigE
> traffic on very small fanless ARM machines and the framework has evolved
> to be used at my company as a basis for our anti-DDoS engine capable of
> dealing with multiple 10G links saturated with floods.
>
> I know it comes a bit late now that there is XDP, but it's my first
> vacation since then and I needed to have a bit of calm time to collect
> the patches from the various branches and put them together. Anyway I'm
> sending this here in case it can be of interest to anyone, for use or
> just to study it.
>
> I presented it in 2014 at kernel recipes :
>   http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/
>
> It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> very light, and retrieves the packets in the NIC's driver before they
> are converted to an skb, then submits them to a registered RX handler
> running in softirq context so we have the best of all worlds by
> benefitting from CPU scalability, delayed processing, and not paying
> the cost of switching to userland. Also an rx_done() function allows
> handlers to batch their processing. The RX handler returns an action
> among accepting the packet as-is, accepting it modified (eg: vlan or
> tunnel decapsulation), dropping it, postponing the processing
> (equivalent to EAGAIN), or building a new packet to send back.
>
> This last function is the one requiring the most changes in existing
> drivers, but offers the widest range of possibilities. We use it to
> send SYN cookies, but I have also implemented a stateless HTTP server
> supporting keep-alive using it, achieving line-rate traffic processing
> on a single CPU core when the NIC supports it. It's very convenient to
> test various stateful TCP components as it's easy to sustain millions
> of connections per second on it.
>
> It does not support forwarding between NICs. It was my first goal
> because I wanted to implement a TAP with it, bridging the traffic
> between two ports, but figured it was adding some complexity to the
> system back then. However since then we've implemented traffic
> capture in our product, exploiting this framework to capture without
> losses at 14 Mpps. I may find some time to try to extract it later.
> It uses the /sys API so that you can simply plug tcpdump -r on a
> file there, though there's also an mmap version which uses less CPU
> (that's important at 10G).
>
> In its current form since the initial code's intent was to limit
> core changes, it happens not to modify anything in the kernel by
> default and to reuse the net_device's ax25_ptr to attach devices
> (idea borrowed from netmap), so it can be used on an existing
> kernel just by loading the patched network drivers (yes, I know
> it's not a valid solution for the long term).
>
> The current code is available here :
>
>   http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/
>
> Please let me know if there could be some interest in rebasing it
> on more recent versions (currently 3.10, 3.14 and 4.4 are supported).
> I don't have much time to assign to it since it works fine as-is,
> but will be glad to do so if that can be useful.
>
Hi Willy,

This does seem interesting and indeed the driver datapath looks very
much like XDP. It would be quite interesting if you could rebase and
then maybe look at how this can work with XDP that would be helpful.
The return actions are identical, but processing descriptor meta data
(like checksum, vlan) is not yet implemented in XDP-- maybe this is
something we can leverage from ndiv?

Tom

> Also the stateless HTTP server provided in it definitely is a nice
> use case for testing such a framework.
>
> Regards,
> Willy

^ permalink raw reply

* Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Alexei Starovoitov @ 2016-09-21 17:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Hannes Frederic Sowa, Sowmini Varadhan, netdev, jbenc, davem
In-Reply-To: <1474476811.23058.78.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Sep 21, 2016 at 09:53:31AM -0700, Eric Dumazet wrote:
> On Wed, 2016-09-21 at 09:14 -0700, Alexei Starovoitov wrote:
> 
> > 
> > I think it's the opposite. Even on x86 compiler will use byte loads.
> 
> Unless you tweaked gcc, it should still use word loads on x86.

> checked that on x86-64 actually. Also clearly visible here:

ahh. ok. good to know. thanks guys!

^ permalink raw reply

* Re: [net-next 01/15] i40e: Introduce VF port representor/control netdevs
From: Samudrala, Sridhar @ 2016-09-21 16:59 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jeff Kirsher, David Miller, Linux Netdev List, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com, guru.anbalagane,
	Ilya Lesokhin, Andy Gospodarek, John Fastabend, Jiri Pirko,
	Rony Efraim
In-Reply-To: <CAJ3xEMgMkz1FZ-QObMvc28JQT0Osrf-k7_RfanUYkYP=F8jZGQ@mail.gmail.com>



On 9/21/2016 12:04 AM, Or Gerlitz wrote:
> On Wed, Sep 21, 2016 at 8:45 AM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>> On 9/20/2016 9:22 PM, Or Gerlitz wrote:
>>> On Wed, Sep 21, 2016 at 6:43 AM, Jeff Kirsher
>>> <jeffrey.t.kirsher@intel.com> wrote:
>>>> From: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> This patch enables creation of a VF Port representor/Control netdev
>>>> associated with each VF. These netdevs can be used to control and
>>>> configure
>>>> VFs from PFs namespace. They enable exposing VF statistics, configuring
>>>> link state, mtu, fdb/vlan entries etc.
>>> What happens if someone does a xmit on the VF representor, does the
>>> packet show up @ the VF?
>>> and what happens of the VF xmits and there's no HW steering rule that
>>> matches this, does
>>> the frame show up @ the VF rep on the host?
>> TX/RX are not yet supported via VFPR netdevs in this patch series.
>> Will be submitting this support in the next patchset.
> Okay, good.
>
>>> In other words, can these VF reps serve for setting up host SW based
>>> switching which you
>>> can later offload (through TC, bridge, netfilter, etc)?
>> Yes. These offloads will be possible  via VFPRs.
> cool
>
>>> I am posing these questions because in downstream patch you are adding
>>> devlink support
>>> for set/get the e-switch mode and you declare the default mode to be switchdev.
>>> When the switchdev mode was introduced in 4.8 these RX/TX
>>> characteristics were defined
>>> to be an essential (== requirement) part for a driver to support that mode.
>> The current patchset introduces the basic VFPR support starting with
>> exposing VF stats and syncing link state between VFs and VFPRs.
>> We decided to declare the default mode to be switchdev so that the new code
>> paths will get exercised by default during normal testing.
> so what happens after this patchset is applied and before the future
> work is submitted?
> RX/TX slow path through the VFPRs isn't supported and what about fast
> path? in other words
> what happens when someone loads the driver, sets SRIOV (--> the driver
> set itself to switchdev mode
> and VFPRs are created) and then a VF sends a packet? do you still put
> into the HW the legacy DMAC
> based switching rules? I am not following...

The VF driver requests adding the dmac based filter rules via mailbox 
messages to PF and that is
not changed in this patchset.
Once we have VFPR TX/RX support, we will not allow the VF driver to add 
these rules, Instead a host based
program will be able to add these rules to enable the fast path.

Thanks
Sridhar

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: ethernet: mediatek: add extension of phy-mode for TRGMII
From: Florian Fainelli @ 2016-09-21 16:59 UTC (permalink / raw)
  To: Sean Wang; +Cc: john, davem, nbd, netdev, linux-mediatek, keyhaede
In-Reply-To: <1474443189-8836-1-git-send-email-sean.wang@mediatek.com>

On 09/21/2016 12:33 AM, Sean Wang wrote:
> Date: Tue, 20 Sep 2016 14:23:24 -0700, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> On 09/20/2016 12:59 AM, sean.wang@mediatek.com wrote:
>>> From: Sean Wang <sean.wang@mediatek.com>
>>>
>>> adds PHY-mode "trgmii" as an extension for the operation
>>> mode of the PHY interface, TRGMII can be compatible with
>>> RGMII, so the extended mode doesn't really have effects on
>>> the target MAC and PHY, is used as the indication if the
>>> current MAC is connected to an internal switch or external
>>> PHY respectively by the given configuration on the board and
>>> then to perform the corresponding setup on TRGMII hardware
>>> module.
>>
>> Based on my googling, it seems like Turbo RGMII is a Mediatek-specific
>> thing for now, but this could become standard and used by other vendors
>> at some point, so I would be inclined to just extend the phy-mode
>> property to support trgmii as another interface type.
>>
>> If you do so, do you also mind proposing an update to the Device Tree
>> specification:
>>
>> https://www.devicetree.org/specifications/
>>
>> Thanks!
> 
> I am willing to do the these thing
> 
> 1)
> in the next version, I will extend rgmii mode as
> another interface type as PHY_INTERFACE_MODE_TRGMII
> defined in linux/phy.h instead of extension only inside
> the current driver. This change also helps to save some code.
> 
> 2)
> I send another separate patch for updating the Device Tree
> specification about TRGMII adding description
> 
> are these all okay for you?

Absolutely, thanks a lot!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] MAINTAINERS: Update b44 maintainer.
From: Florian Fainelli @ 2016-09-21 16:56 UTC (permalink / raw)
  To: Michael Chan, davem; +Cc: netdev
In-Reply-To: <1474428795-25095-1-git-send-email-michael.chan@broadcom.com>

On 09/20/2016 08:33 PM, Michael Chan wrote:
> Taking over as maintainer since Gary Zambrano is no longer working
> for Broadcom.
> 
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Acked-by: Florian Fainelli <f.fainelli@gmail.com>

Thanks Michael!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Eric Dumazet @ 2016-09-21 16:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hannes Frederic Sowa, Sowmini Varadhan, netdev, jbenc, davem
In-Reply-To: <20160921161457.GA17116@ast-mbp.thefacebook.com>

On Wed, 2016-09-21 at 09:14 -0700, Alexei Starovoitov wrote:

> 
> I think it's the opposite. Even on x86 compiler will use byte loads.

Unless you tweaked gcc, it should still use word loads on x86.

^ permalink raw reply

* Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Hannes Frederic Sowa @ 2016-09-21 16:49 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Sowmini Varadhan, netdev, jbenc, davem
In-Reply-To: <20160921161457.GA17116@ast-mbp.thefacebook.com>

On 21.09.2016 18:14, Alexei Starovoitov wrote:
> On Wed, Sep 21, 2016 at 12:10:55PM +0200, Hannes Frederic Sowa wrote:
>> On 20.09.2016 20:57, Sowmini Varadhan wrote:
>>> The vxlan header may not be aligned to 4 bytes in
>>> vxlan_build_skb (e.g., for MLD packets). This patch
>>> avoids unaligned access traps from vxlan_build_skb
>>> (in platforms like sparc) by making struct vxlanhdr __packed.
>>>
>>> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>
>> Performance wise this should only affect code generation for archs where
>> it matters anyway.
> 
> I think it's the opposite. Even on x86 compiler will use byte loads.

I checked that on x86-64 actually. Also clearly visible here:

https://godbolt.org/g/xsW2P1

Bye,
Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox