Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 4.9] bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWN
From: Alexei Starovoitov @ 2017-12-23  3:51 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Greg Kroah-Hartman, stable, netdev, Edward Cree, Jann Horn,
	Alexei Starovoitov
In-Reply-To: <20171223022617.GO2971@decadent.org.uk>

On Sat, Dec 23, 2017 at 02:26:17AM +0000, Ben Hutchings wrote:
> An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless
> pointer leaks are allowed.  Therefore, states_equal() must not treat
> a state with a pointer in a register as "equal" to a state with an
> UNKNOWN_VALUE in that register.
> 
> This was fixed differently upstream, but the code around here was
> largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework
> value tracking".  The bug can be detected by the bpf/verifier sub-test
> "pointer/scalar confusion in state equality check (way 1)".
> 
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
> Cc: Edward Cree <ecree@solarflare.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Alexei Starovoitov <ast@kernel.org>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic
From: Alexei Starovoitov @ 2017-12-23  3:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Network Development, kernel-team
In-Reply-To: <CAG48ez3j-uSEmJNWvOkVZRgdyawn7GAR+xgpA8LM3E4YGxJjKA@mail.gmail.com>

On Sat, Dec 23, 2017 at 03:26:27AM +0100, Jann Horn wrote:
> On Sat, Dec 23, 2017 at 3:07 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
> >> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> >> > instead of computing max stack depth for current call chain only
> >> > track the maximum possible stack depth of any function at given
> >> > frame position. Such algorithm is simple and fast, but conservative,
> >> > since it overestimates amount of stack used. Consider:
> >> > main() // stack 32
> >> > {
> >> >   A();
> >> >   B();
> >> > }
> >> >
> >> > A(){} // stack 256
> >> >
> >> > B()  // stack 64
> >> > {
> >> >   A();
> >> > }
> >> >
> >> > since A() is called at frame[1] and frame[2], the algorithm
> >> > will estimate the max stack depth as 32 + 256 + 256 and will reject
> >> > such program, though real max is 32 + 64 + 256.
> >> >
> >> > Fortunately the algorithm is good enough in practice. The alternative
> >> > would be to track max stack of every function in the fast pass through
> >> > the verifier and then do additional CFG walk just to compute max total.
> >> >
> >> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> >> > Reported-by: Jann Horn <jannh@google.com>
> >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>
> >> Does this work in cases where multiple invocations of a function have
> >> different stack access patterns because their inputs have different
> >> bounds?
> >>
> >> Consider this pseudocode example:
> >>
> >> void main(void) {
> >>   func1(0);
> >>   func1(1);
> >>   func2(1);
> >> }
> >> void func1(int alloc_or_recurse) {
> >>   if (alloc_or_recurse) {
> >>     frame_pointer[-300] = 1;
> >>   } else {
> >>     func2(alloc_or_recurse);
> >>   }
> >> }
> >> void func2(int alloc_or_recurse) {
> >>   if (alloc_or_recurse) {
> >>     frame_pointer[-300] = 1;
> >>   }
> >> }
> >>
> >> AFAICS this will work as follows:
> >>
> >> Call to func1->func2 runs without any stack accesses because the
> >> verifier can prove that alloc_or_recurse is 0.
> >
> > argh. right.
> > I guess that ruins my attemp to do the stack check inline
> > with the main verifier pass.
> > Do you see an algorithm that can do it without extra
> > cfg walk at the end?
> 
> A crappy heuristic would be to forbid recursion (calling a function
> that is already present somewhere in the call stack) 

the recursion is already forbidden. It's a 'back-edge' from cfg point
of view. There are 3 tests that cover that in few variants.

> and then sum up
> the maximum stack depths of all functions at the end and see whether
> the sum is bigger than the maximum stack size. While it'd be horribly
> conservative, it might work for now? 512 bytes are a lot of stack.

That's what I tried first while developing bpf calls.
It should have been good enough, but round up to 32-byte makes even
tiny function into sizeable and both *_noinline.c tests fail to load.
Arguably they are artificial stress test that push the limits with
number of calls, argument passing and pointer accesses, but I don't
want to scale them down.

> Or as a more complicated, but slightly less conservative heuristic,
> you could forbid recursion, record the maximum number of stack frames
> (max_stack_frames), then at the end select the top max_stack_frames
> functions with the biggest stack sizes and sum up their sizes?

it's pretty much the same as previous one. In these two tests I'm
building long stack chain out of all functions.

> Anything else I can come up with is probably more complicated than an
> extra cfg walk.

I guess we have to generalize (or remember) cfg anyway for future use,
so will code it up now and see how it looks.

Thank you for the feedback!

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic
From: Jann Horn @ 2017-12-23  2:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Network Development, kernel-team
In-Reply-To: <20171223020749.j7vc7z4o4lf2xj33@ast-mbp>

On Sat, Dec 23, 2017 at 3:07 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
>> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <ast@kernel.org> wrote:
>> > instead of computing max stack depth for current call chain only
>> > track the maximum possible stack depth of any function at given
>> > frame position. Such algorithm is simple and fast, but conservative,
>> > since it overestimates amount of stack used. Consider:
>> > main() // stack 32
>> > {
>> >   A();
>> >   B();
>> > }
>> >
>> > A(){} // stack 256
>> >
>> > B()  // stack 64
>> > {
>> >   A();
>> > }
>> >
>> > since A() is called at frame[1] and frame[2], the algorithm
>> > will estimate the max stack depth as 32 + 256 + 256 and will reject
>> > such program, though real max is 32 + 64 + 256.
>> >
>> > Fortunately the algorithm is good enough in practice. The alternative
>> > would be to track max stack of every function in the fast pass through
>> > the verifier and then do additional CFG walk just to compute max total.
>> >
>> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
>> > Reported-by: Jann Horn <jannh@google.com>
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> Does this work in cases where multiple invocations of a function have
>> different stack access patterns because their inputs have different
>> bounds?
>>
>> Consider this pseudocode example:
>>
>> void main(void) {
>>   func1(0);
>>   func1(1);
>>   func2(1);
>> }
>> void func1(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>>     frame_pointer[-300] = 1;
>>   } else {
>>     func2(alloc_or_recurse);
>>   }
>> }
>> void func2(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>>     frame_pointer[-300] = 1;
>>   }
>> }
>>
>> AFAICS this will work as follows:
>>
>> Call to func1->func2 runs without any stack accesses because the
>> verifier can prove that alloc_or_recurse is 0.
>
> argh. right.
> I guess that ruins my attemp to do the stack check inline
> with the main verifier pass.
> Do you see an algorithm that can do it without extra
> cfg walk at the end?

A crappy heuristic would be to forbid recursion (calling a function
that is already present somewhere in the call stack) and then sum up
the maximum stack depths of all functions at the end and see whether
the sum is bigger than the maximum stack size. While it'd be horribly
conservative, it might work for now? 512 bytes are a lot of stack.
Or as a more complicated, but slightly less conservative heuristic,
you could forbid recursion, record the maximum number of stack frames
(max_stack_frames), then at the end select the top max_stack_frames
functions with the biggest stack sizes and sum up their sizes? (Or if
you want to support recursion, check
max_stack_frames*biggest_frame_size<=MAX_BPF_STACK.)
Anything else I can come up with is probably more complicated than an
extra cfg walk.

^ permalink raw reply

* [PATCH 4.9] bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWN
From: Ben Hutchings @ 2017-12-23  2:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, netdev, Edward Cree, Jann Horn, Alexei Starovoitov

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless
pointer leaks are allowed.  Therefore, states_equal() must not treat
a state with a pointer in a register as "equal" to a state with an
UNKNOWN_VALUE in that register.

This was fixed differently upstream, but the code around here was
largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework
value tracking".  The bug can be detected by the bpf/verifier sub-test
"pointer/scalar confusion in state equality check (way 1)".

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Jann Horn <jannh@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
---
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2722,11 +2722,12 @@ static bool states_equal(struct bpf_veri

 		/* If we didn't map access then again we don't care about the
 		 * mismatched range values and it's ok if our old type was
-		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed or pointer reg.
 		 */
 		if (rold->type == NOT_INIT ||
 		    (!varlen_map_access && rold->type == UNKNOWN_VALUE &&
-		     rcur->type != NOT_INIT))
+		     rcur->type != NOT_INIT &&
+		     !__is_pointer_value(env->allow_ptr_leaks, rcur)))
 			continue;

 		/* Don't care about the reg->id in this case. */

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic
From: Alexei Starovoitov @ 2017-12-23  2:07 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann,
	Network Development, kernel-team
In-Reply-To: <CAG48ez37FM=Kianqa9C0DVg-iO0dPXkZWXVCeHjfkkAAAB55rA@mail.gmail.com>

On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> > instead of computing max stack depth for current call chain only
> > track the maximum possible stack depth of any function at given
> > frame position. Such algorithm is simple and fast, but conservative,
> > since it overestimates amount of stack used. Consider:
> > main() // stack 32
> > {
> >   A();
> >   B();
> > }
> >
> > A(){} // stack 256
> >
> > B()  // stack 64
> > {
> >   A();
> > }
> >
> > since A() is called at frame[1] and frame[2], the algorithm
> > will estimate the max stack depth as 32 + 256 + 256 and will reject
> > such program, though real max is 32 + 64 + 256.
> >
> > Fortunately the algorithm is good enough in practice. The alternative
> > would be to track max stack of every function in the fast pass through
> > the verifier and then do additional CFG walk just to compute max total.
> >
> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> > Reported-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> 
> Does this work in cases where multiple invocations of a function have
> different stack access patterns because their inputs have different
> bounds?
> 
> Consider this pseudocode example:
> 
> void main(void) {
>   func1(0);
>   func1(1);
>   func2(1);
> }
> void func1(int alloc_or_recurse) {
>   if (alloc_or_recurse) {
>     frame_pointer[-300] = 1;
>   } else {
>     func2(alloc_or_recurse);
>   }
> }
> void func2(int alloc_or_recurse) {
>   if (alloc_or_recurse) {
>     frame_pointer[-300] = 1;
>   }
> }
> 
> AFAICS this will work as follows:
> 
> Call to func1->func2 runs without any stack accesses because the
> verifier can prove that alloc_or_recurse is 0.

argh. right.
I guess that ruins my attemp to do the stack check inline
with the main verifier pass.
Do you see an algorithm that can do it without extra
cfg walk at the end?

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic
From: Jann Horn @ 2017-12-23  1:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, Daniel Borkmann, Network Development,
	kernel-team
In-Reply-To: <20171222213328.993019-2-ast@kernel.org>

On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> instead of computing max stack depth for current call chain only
> track the maximum possible stack depth of any function at given
> frame position. Such algorithm is simple and fast, but conservative,
> since it overestimates amount of stack used. Consider:
> main() // stack 32
> {
>   A();
>   B();
> }
>
> A(){} // stack 256
>
> B()  // stack 64
> {
>   A();
> }
>
> since A() is called at frame[1] and frame[2], the algorithm
> will estimate the max stack depth as 32 + 256 + 256 and will reject
> such program, though real max is 32 + 64 + 256.
>
> Fortunately the algorithm is good enough in practice. The alternative
> would be to track max stack of every function in the fast pass through
> the verifier and then do additional CFG walk just to compute max total.
>
> Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Does this work in cases where multiple invocations of a function have
different stack access patterns because their inputs have different
bounds?

Consider this pseudocode example:

void main(void) {
  func1(0);
  func1(1);
  func2(1);
}
void func1(int alloc_or_recurse) {
  if (alloc_or_recurse) {
    frame_pointer[-300] = 1;
  } else {
    func2(alloc_or_recurse);
  }
}
void func2(int alloc_or_recurse) {
  if (alloc_or_recurse) {
    frame_pointer[-300] = 1;
  }
}

AFAICS this will work as follows:

Call to func1->func2 runs without any stack accesses because the
verifier can prove that alloc_or_recurse is 0.
Second call to func1 allocates stack memory, bumping up
frame_stack_depth[1].
Second call to func2 allocates stack memory, leaving
frame_stack_depth[1] the same.

So I think this will still pass the verifier, even though func1
and func2 will end up with 300 bytes stack memory each, causing
the func1->func2 call to use more stack memory than permitted.

^ permalink raw reply

* Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state
From: Yafang Shao @ 2017-12-23  1:19 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: David Miller, netdev
In-Reply-To: <CALOAHbB10mLdV_0p7vWv8w4snG2JEnUSkYUXBjC-BvP+b5AVDw@mail.gmail.com>

On Sat, Dec 23, 2017 at 9:10 AM, Yafang Shao <laoar.shao@gmail.com> wrote:
> On Sat, Dec 23, 2017 at 1:04 AM, Sergei Shtylyov
> <sergei.shtylyov@cogentembedded.com> wrote:
>> Hello!
>>
>> On 12/22/2017 06:37 PM, Yafang Shao wrote:
>>
>>> There's a space character missed in the printk messages.
>>> This error should be prevented with checkscript.pl, but it couldn't caught
>>
>>                                                                      ^ be?
>
> It is checkpatch.pl.
>
>>
>>> by running with "checkscript.pl -f xxxx.patch", that's what I had run
>>> before.
>>> What a carelessness.
>>
>>
>>    You generally don't need to break up the messages violating 80-column
>> limit, and checkpatch.pl should be aware of this...
>>
>
> Oh. That's right.
> It can be aware of that.
>
> I just want to make the code easy to read and limit the textwidth to
> 80 character.
>
> If the message takes two lines as bellow,
>     printk("xxx "
>                      ^ space character.
>               "yyy");
> The checkpatch.pl  could also be aware of that if the first line is
> not end with space character, but it couldn't be aware of that if run
> with "checkpatch.pl -f xxxx.patch".
>

Should we need to check that error as well when we run with
"checkpatch.pl -f" ?

>
>>> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint
>>> with
>>> inet_sock_set_state tracepoint")
>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>
>> [...]
>>
>> MBR, Sergei

^ permalink raw reply

* [PATCH net-next] net/trace: fix printk format in inet_sock_set_state
From: Yafang Shao @ 2017-12-23  1:14 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yafang Shao, Sergei Shtylyov

There's a space character missed in the printk messages.
This error should be prevented with checkpatch.pl, but it couldn't caught
by running with "checkpatch.pl -f xxxx.patch", that's what I had run
before.
What a carelessness.

Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with
inet_sock_set_state tracepoint")
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/trace/events/sock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 3b9094a..598399da 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -160,7 +160,7 @@
 		}
 	),

-	TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4"
+	TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 "
 			"saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
 			show_inet_protocol_name(__entry->protocol),
 			__entry->sport, __entry->dport,

^ permalink raw reply related

* Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state
From: Yafang Shao @ 2017-12-23  1:10 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: David Miller, netdev
In-Reply-To: <c92b431d-1a62-4c25-61f2-3a66dae97a08@cogentembedded.com>

On Sat, Dec 23, 2017 at 1:04 AM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> Hello!
>
> On 12/22/2017 06:37 PM, Yafang Shao wrote:
>
>> There's a space character missed in the printk messages.
>> This error should be prevented with checkscript.pl, but it couldn't caught
>
>                                                                      ^ be?

It is checkpatch.pl.

>
>> by running with "checkscript.pl -f xxxx.patch", that's what I had run
>> before.
>> What a carelessness.
>
>
>    You generally don't need to break up the messages violating 80-column
> limit, and checkpatch.pl should be aware of this...
>

Oh. That's right.
It can be aware of that.

I just want to make the code easy to read and limit the textwidth to
80 character.

If the message takes two lines as bellow,
    printk("xxx "
                     ^ space character.
              "yyy");
The checkpatch.pl  could also be aware of that if the first line is
not end with space character, but it couldn't be aware of that if run
with "checkpatch.pl -f xxxx.patch".


>> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint
>> with
>> inet_sock_set_state tracepoint")
>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> [...]
>
> MBR, Sergei

^ permalink raw reply

* Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing
From: Ed Swierk @ 2017-12-23  0:39 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: ovs-dev, Linux Kernel Network Developers, Benjamin Warren,
	Keith Holleman
In-Reply-To: <CAOrHB_D_xUoynncs_W3EcKtpc2yyYuvUgAg1G=7GNJvgnFt=ow@mail.gmail.com>

On Fri, Dec 22, 2017 at 3:31 PM, Pravin Shelar <pshelar@ovn.org> wrote:
> On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
>> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
>> included in the L3 length. For example, a short IPv4 packet may have
>> up to 6 bytes of padding following the IP payload when received on an
>> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
>> packet to ip_hdr->tot_len before invoking netfilter hooks (including
>> conntrack and nat).
>>
>> In the IPv6 receive path, ip6_rcv() does the same using
>> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
>> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
>> length before invoking NF_INET_PRE_ROUTING hooks.
>>
>> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
>> the L3 header but does not trim it to the L3 length before calling
>> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
>> encounters a packet with lower-layer padding, nf_checksum() fails and
>> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
>> affect the checksum, the length in the IP pseudoheader does. That
>> length is based on skb->len, and without trimming, it doesn't match
>> the length the sender used when computing the checksum.
>>
>> The assumption throughout nf_conntrack and nf_nat is that skb->len
>> reflects the length of the L3 header and payload, so there is no need
>> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>>
>> This change brings OVS into line with other netfilter users, trimming
>> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>>
>> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
>> ---
>> v2:
>> - Trim packet in nat receive path as well as conntrack
>> - Free skb on error
>> ---
>>  net/openvswitch/conntrack.c | 34 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 34 insertions(+)
>>
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index b27c5c6..1bdc78f 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>>         return ct_executed;
>>  }
>>
>> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
>> + * the L3 header. The skb is freed on error.
>> + */
>> +static int skb_trim_l3(struct sk_buff *skb)
>> +{
>> +       unsigned int nh_len;
>> +       int err;
>> +
>> +       switch (skb->protocol) {
>> +       case htons(ETH_P_IP):
>> +               nh_len = ntohs(ip_hdr(skb)->tot_len);
>> +               break;
>> +       case htons(ETH_P_IPV6):
>> +               nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>> +                       + sizeof(struct ipv6hdr);
>> +               break;
>> +       default:
>> +               nh_len = skb->len;
>> +       }
>> +
>> +       err = pskb_trim_rcsum(skb, nh_len);
>> +       if (err)
> This should is unlikely.

I'll add unlikely().

>> +               kfree_skb(skb);
>> +
>> +       return err;
>> +}
>> +
> This looks like a generic function, it probably does not belong to OVS
> code base.

Indeed. I'll move it to skbuff.c, unless you have a better idea.

>>  #ifdef CONFIG_NF_NAT_NEEDED
>>  /* Modelled after nf_nat_ipv[46]_fn().
>>   * range is only used for new, uninitialized NAT state.
>> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
>>  {
>>         int hooknum, nh_off, err = NF_ACCEPT;
>>
>> +       /* The nat module expects to be working at L3. */
>>         nh_off = skb_network_offset(skb);
>>         skb_pull_rcsum(skb, nh_off);
>> +       err = skb_trim_l3(skb);
>> +       if (err)
>> +               return err;
>>
> ct-nat is executed within ct action, so I do not see why you you call
> skb-trim again from ovs_ct_nat_execute().
> ovs_ct_execute() trim should take care of the skb.

I see. Doesn't that mean that skb_pull_rcsum() is also unnecessary in
ovs_ct_nat_execute(), as ovs_ct_execute() has already pulled the skb
to the L3 header?

>>         /* See HOOK2MANIP(). */
>>         if (maniptype == NF_NAT_MANIP_SRC)
>> @@ -1111,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
>>         /* The conntrack module expects to be working at L3. */
>>         nh_ofs = skb_network_offset(skb);
>>         skb_pull_rcsum(skb, nh_ofs);
>> +       err = skb_trim_l3(skb);
>> +       if (err)
>> +               return err;
>>
>>         if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
>>                 err = handle_fragments(net, key, info->zone.id, skb);
>> --
>> 1.9.1
>>

^ permalink raw reply

* Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing
From: Ed Swierk @ 2017-12-23  0:39 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: ovs-dev, Linux Kernel Network Developers, Benjamin Warren,
	Keith Holleman
In-Reply-To: <CAOrHB_D_xUoynncs_W3EcKtpc2yyYuvUgAg1G=7GNJvgnFt=ow@mail.gmail.com>

On Fri, Dec 22, 2017 at 3:31 PM, Pravin Shelar <pshelar@ovn.org> wrote:
> On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
>> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
>> included in the L3 length. For example, a short IPv4 packet may have
>> up to 6 bytes of padding following the IP payload when received on an
>> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
>> packet to ip_hdr->tot_len before invoking netfilter hooks (including
>> conntrack and nat).
>>
>> In the IPv6 receive path, ip6_rcv() does the same using
>> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
>> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
>> length before invoking NF_INET_PRE_ROUTING hooks.
>>
>> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
>> the L3 header but does not trim it to the L3 length before calling
>> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
>> encounters a packet with lower-layer padding, nf_checksum() fails and
>> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
>> affect the checksum, the length in the IP pseudoheader does. That
>> length is based on skb->len, and without trimming, it doesn't match
>> the length the sender used when computing the checksum.
>>
>> The assumption throughout nf_conntrack and nf_nat is that skb->len
>> reflects the length of the L3 header and payload, so there is no need
>> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>>
>> This change brings OVS into line with other netfilter users, trimming
>> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>>
>> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
>> ---
>> v2:
>> - Trim packet in nat receive path as well as conntrack
>> - Free skb on error
>> ---
>>  net/openvswitch/conntrack.c | 34 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 34 insertions(+)
>>
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index b27c5c6..1bdc78f 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>>         return ct_executed;
>>  }
>>
>> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
>> + * the L3 header. The skb is freed on error.
>> + */
>> +static int skb_trim_l3(struct sk_buff *skb)
>> +{
>> +       unsigned int nh_len;
>> +       int err;
>> +
>> +       switch (skb->protocol) {
>> +       case htons(ETH_P_IP):
>> +               nh_len = ntohs(ip_hdr(skb)->tot_len);
>> +               break;
>> +       case htons(ETH_P_IPV6):
>> +               nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>> +                       + sizeof(struct ipv6hdr);
>> +               break;
>> +       default:
>> +               nh_len = skb->len;
>> +       }
>> +
>> +       err = pskb_trim_rcsum(skb, nh_len);
>> +       if (err)
> This should is unlikely.

I'll add unlikely().

>> +               kfree_skb(skb);
>> +
>> +       return err;
>> +}
>> +
> This looks like a generic function, it probably does not belong to OVS
> code base.

Indeed. I'll move it to skbuff.c, unless you have a better idea.

>>  #ifdef CONFIG_NF_NAT_NEEDED
>>  /* Modelled after nf_nat_ipv[46]_fn().
>>   * range is only used for new, uninitialized NAT state.
>> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
>>  {
>>         int hooknum, nh_off, err = NF_ACCEPT;
>>
>> +       /* The nat module expects to be working at L3. */
>>         nh_off = skb_network_offset(skb);
>>         skb_pull_rcsum(skb, nh_off);
>> +       err = skb_trim_l3(skb);
>> +       if (err)
>> +               return err;
>>
> ct-nat is executed within ct action, so I do not see why you you call
> skb-trim again from ovs_ct_nat_execute().
> ovs_ct_execute() trim should take care of the skb.

I see. Doesn't that mean that skb_pull_rcsum() is also unnecessary in
ovs_ct_nat_execute(), as ovs_ct_execute() has already pulled the skb
to the L3 header?

>>         /* See HOOK2MANIP(). */
>>         if (maniptype == NF_NAT_MANIP_SRC)
>> @@ -1111,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
>>         /* The conntrack module expects to be working at L3. */
>>         nh_ofs = skb_network_offset(skb);
>>         skb_pull_rcsum(skb, nh_ofs);
>> +       err = skb_trim_l3(skb);
>> +       if (err)
>> +               return err;
>>
>>         if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
>>                 err = handle_fragments(net, key, info->zone.id, skb);
>> --
>> 1.9.1
>>

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: David Daney @ 2017-12-23  0:30 UTC (permalink / raw)
  To: Tim Harvey; +Cc: Andrew Lunn, Sunil Goutham, netdev
In-Reply-To: <CAJ+vNU16c5GbbW3A78rr=9YYAboBmFuTezWEA7=4mF0z8xWeBw@mail.gmail.com>

On 12/22/2017 04:22 PM, Tim Harvey wrote:
> On Fri, Dec 22, 2017 at 3:00 PM, David Daney <ddaney@caviumnetworks.com> wrote:
>> On 12/22/2017 02:19 PM, Tim Harvey wrote:
>>>
>>> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>
>>>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:
>>>>>
>>>>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>>>>
>>>>>>> The nic appears to work fine (pings, TCP etc) up until a performance
>>>>>>> test is attempted.
>>>>>>> When an iperf bandwidth test is attempted the nic ends up in a state
>>>>>>> where truncated-ip packets are being sent out (per a tcpdump from
>>>>>>> another board):
>>>>>>
>>>>>>
>>>>>> Hi Tim
>>>>>>
>>>>>> Are pause frames supported? Have you tried turning them off?
>>>>>>
>>>>>> Can you reproduce the issue with UDP? Or is it TCP only?
>>>>>>
>>>>>
>>>>> Andrew,
>>>>>
>>>>> Pause frames don't appear to be supported yet and the issue occurs
>>>>> when using UDP as well as TCP. I'm not clear what the best way to
>>>>> troubleshoot this is.
>>>>
>>>>
>>>> Hi Tim
>>>>
>>>> Is pause being negotiated? In theory, it should not be. The PHY should
>>>> not offer it, if the MAC has not enabled it. But some PHY drivers are
>>>> probably broken and offer pause when they should not.
>>>>
>>>> Also, can you trigger the issue using UDP at say 75% the maximum
>>>> bandwidth. That should be low enough that the peer never even tries to
>>>> use pause.
>>>>
>>>> All this pause stuff is just a stab in the dark. Something else to try
>>>> is to turn off various forms off acceleration, ethtook -K, and see if
>>>> that makes a difference.
>>>>
>>>
>>> Andrew,
>>>
>>> Currently I'm not using the DP83867_PHY driver (after verifying the
>>> issue occurs with or without that driver).
>>>
>>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
>>> and the issue still occurs.
>>>
>>> I have found that once the issue occurs I can recover to a working
>>> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
>>> the issue and recover with that, I can never trigger the issue again.
>>> If toggle that register bit upon power-up before the issue occurs it
>>> will still occur.
>>>
>>> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
>>> - when cleared all dedicated BGX context state for LMAC (state
>>> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
>>> resources (data path, serdes lanes) is disabled
>>> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx
>>> of idles and fault sequences)
>>
>>
>> You could try looking at
>> BGXX_GMP_PCS_INTX
>> BGXX_GMP_GMI_RXX_INT
>> BGXX_GMP_GMI_TXX_INT
>>
>> Those are all W1C registers that should contain all zeros.  If they don't,
>> just write back to them to clear before running a test.
>>
>> If there are bits asserting in these when the thing gets wedged up, it might
>> point to a possible cause.
> 
> David,
> 
> BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is
> triggered. From CN80XX-HM-1.2P this is caused by:
> 
> "In the unlikely event that P2X data cannot keep the GMP TX FIFO full,
> the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This
> should be detected by the receiving device as an FCS error.
> Internally, the packet is drained and lost"
> 

Yikes!

> Perhaps this needs to be caught and handled in some way. There's some
> interrupt handlers in nicvf_main.c yet I'm not clear where to hook up
> this one.

This would be an interrupt generated by the BGX device, not the NIC 
device  It will have an MSI-X index of (6 + LMAC * 7).  See 
BGX_INT_VEC_E in the HRM.

Note that I am telling you which interrupt it is, but not recommending 
that catching it and doing something is necessarily the best thing to do.

> 
>>
>> You could also look at these RO registers:
>> BGXX_GMP_PCS_TXX_STATES
>> BGXX_GMP_PCS_RXX_STATES
>>
> 
> These show the same before/after triggering the issue and
> RX_BAD/TX_BAD are still 0.
> 
> Tim
> 

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: Tim Harvey @ 2017-12-23  0:22 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Sunil Goutham, netdev
In-Reply-To: <20171222224507.GD22041@lunn.ch>

On Fri, Dec 22, 2017 at 2:45 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> Currently I'm not using the DP83867_PHY driver (after verifying the
>> issue occurs with or without that driver).
>>
>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
>> and the issue still occurs.
>
>> I'm told that the particular Cavium reference board with an SGMII phy
>> doesn't show this issue (I don't have that specific board to do my own
>> testing or comparisons against our board) so I'm inclined to think it
>> has something to do with an interaction with the DP83867 PHY. I would
>> like to start poking at PHY registers to see if I can find anything
>> unusual. The best way to do that from userspace is via
>> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
>> support ioctl's so I guess I'll have to add that support unless
>> there's a way to get at phy registers from userspace through a phy
>> driver?
>
> phy_mii_ioctl() does what you need, and is simple to use.
>
> mii-tool will then give you access to the standard PHY registers.
>
>          Andre

I didn't think mii-tool or ethtool (its replacement right?) could
read/write specific phy registers via cmdline? I wrote my own tool
some time ago to do that [1].

Tim

[1] http://trac.gateworks.com/attachment/wiki/conformance_testing/mii-reg.c

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: Tim Harvey @ 2017-12-23  0:22 UTC (permalink / raw)
  To: David Daney; +Cc: Andrew Lunn, Sunil Goutham, netdev
In-Reply-To: <7cb389a0-c035-81b5-17d5-c814e873c670@caviumnetworks.com>

On Fri, Dec 22, 2017 at 3:00 PM, David Daney <ddaney@caviumnetworks.com> wrote:
> On 12/22/2017 02:19 PM, Tim Harvey wrote:
>>
>> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>
>>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:
>>>>
>>>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>>>
>>>>>> The nic appears to work fine (pings, TCP etc) up until a performance
>>>>>> test is attempted.
>>>>>> When an iperf bandwidth test is attempted the nic ends up in a state
>>>>>> where truncated-ip packets are being sent out (per a tcpdump from
>>>>>> another board):
>>>>>
>>>>>
>>>>> Hi Tim
>>>>>
>>>>> Are pause frames supported? Have you tried turning them off?
>>>>>
>>>>> Can you reproduce the issue with UDP? Or is it TCP only?
>>>>>
>>>>
>>>> Andrew,
>>>>
>>>> Pause frames don't appear to be supported yet and the issue occurs
>>>> when using UDP as well as TCP. I'm not clear what the best way to
>>>> troubleshoot this is.
>>>
>>>
>>> Hi Tim
>>>
>>> Is pause being negotiated? In theory, it should not be. The PHY should
>>> not offer it, if the MAC has not enabled it. But some PHY drivers are
>>> probably broken and offer pause when they should not.
>>>
>>> Also, can you trigger the issue using UDP at say 75% the maximum
>>> bandwidth. That should be low enough that the peer never even tries to
>>> use pause.
>>>
>>> All this pause stuff is just a stab in the dark. Something else to try
>>> is to turn off various forms off acceleration, ethtook -K, and see if
>>> that makes a difference.
>>>
>>
>> Andrew,
>>
>> Currently I'm not using the DP83867_PHY driver (after verifying the
>> issue occurs with or without that driver).
>>
>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
>> and the issue still occurs.
>>
>> I have found that once the issue occurs I can recover to a working
>> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
>> the issue and recover with that, I can never trigger the issue again.
>> If toggle that register bit upon power-up before the issue occurs it
>> will still occur.
>>
>> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
>> - when cleared all dedicated BGX context state for LMAC (state
>> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
>> resources (data path, serdes lanes) is disabled
>> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx
>> of idles and fault sequences)
>
>
> You could try looking at
> BGXX_GMP_PCS_INTX
> BGXX_GMP_GMI_RXX_INT
> BGXX_GMP_GMI_TXX_INT
>
> Those are all W1C registers that should contain all zeros.  If they don't,
> just write back to them to clear before running a test.
>
> If there are bits asserting in these when the thing gets wedged up, it might
> point to a possible cause.

David,

BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is
triggered. From CN80XX-HM-1.2P this is caused by:

"In the unlikely event that P2X data cannot keep the GMP TX FIFO full,
the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This
should be detected by the receiving device as an FCS error.
Internally, the packet is drained and lost"

Perhaps this needs to be caught and handled in some way. There's some
interrupt handlers in nicvf_main.c yet I'm not clear where to hook up
this one.

>
> You could also look at these RO registers:
> BGXX_GMP_PCS_TXX_STATES
> BGXX_GMP_PCS_RXX_STATES
>

These show the same before/after triggering the issue and
RX_BAD/TX_BAD are still 0.

Tim

^ permalink raw reply

* Re: [bpf-next V2 PATCH 01/14] xdp: base API for new XDP rx-queue info concept
From: Jakub Kicinski @ 2017-12-23  0:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, dsahern, gospo,
	bjorn.topel, michael.chan
In-Reply-To: <151396269959.20006.11486855606275589519.stgit@firesoul>

On Fri, 22 Dec 2017 18:11:39 +0100, Jesper Dangaard Brouer wrote:
> +struct xdp_rxq_info {
> +	struct net_device *dev;
> +	u32 queue_index;
> +	u32 reg_state;
> +} ____cacheline_aligned; /* perf critical, avoid false-sharing */

I'm assuming this is cacheline_aligned, because of some stuff you will
add here in the future for the completion path?  (The comment could
mention that this data is read-mostly.)  Drivers are likely to already
have a read-mostly (or unused-mostly) section of the rx ring structure.
Would it be possible to define this in a way that would allow people
who carefully lay out their data path structures to save cache space?

^ permalink raw reply

* Re: [PATCH bpf 0/2] tools: bpftool: fix unlikely race and JSON output on error path
From: Daniel Borkmann @ 2017-12-23  0:12 UTC (permalink / raw)
  To: Jakub Kicinski, netdev, alexei.starovoitov; +Cc: oss-drivers
In-Reply-To: <20171222193606.19786-1-jakub.kicinski@netronome.com>

On 12/22/2017 08:36 PM, Jakub Kicinski wrote:
> Hi!
> 
> Two small fixes here to listing maps and programs.  The loop for showing
> maps is written slightly differently to programs which was missed in JSON
> output support, and output would be broken if any of the system calls
> failed.  Second fix is in very unlikely case that program or map disappears
> after we get its ID we should just skip over that object instead of failing.

Applied to bpf tree, thanks Jakub!

^ permalink raw reply

* Re: pull-request: bpf-next 2017-12-18
From: Daniel Borkmann @ 2017-12-23  0:12 UTC (permalink / raw)
  To: David Miller; +Cc: alexei.starovoitov, ast, netdev
In-Reply-To: <20171222.094241.1237980817048654974.davem@davemloft.net>

On 12/22/2017 03:42 PM, David Miller wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, 22 Dec 2017 00:48:22 +0100
> 
>> Looks good, one thing: If I spot this correctly, isn't here a ...
>>
>> 		prog->aux->jit_data = jit_data;
>>
>> ... missing? Otherwise the context from the initial pass is neither
>> saved for the extra pass nor freed.
> 
> Good catch, here is an updated patch:
> 
> ====================
> bpf: sparc64: Add JIT support for multi-function programs.
> 
> Modelled strongly upon the arm64 implementation.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Applied to bpf-next, thanks David!

^ permalink raw reply

* [PATCH net-next 3/4] tcp: place all zerocopy payload in frags
From: Willem de Bruijn @ 2017-12-23  0:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171223000020.55509-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

This avoids an unnecessary copy of 1-2KB and improves tso_fragment,
which has to fall back to tcp_fragment if skb->len != skb_data_len.

It also avoids a surprising inconsistency in notifications:
Zerocopy packets sent over loopback have their frags copied, so set
SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently
does not happen for small packets, because when all data fits in the
linear fragment, data is not copied in skb_orphan_frags_rx.

Reported-by: Tom Deseyn <tom.deseyn@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 44102484a76f..947348872c3e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1186,7 +1186,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool process_backlog = false;
-	bool sg;
+	bool sg, zc = false;
 	long timeo;
 
 	flags = msg->msg_flags;
@@ -1204,7 +1204,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			goto out_err;
 		}
 
-		if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG))
+		zc = sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG;
+		if (!zc)
 			uarg->zerocopy = 0;
 	}
 
@@ -1325,13 +1326,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			copy = msg_data_left(msg);
 
 		/* Where to copy to? */
-		if (skb_availroom(skb) > 0) {
+		if (skb_availroom(skb) > 0 && !zc) {
 			/* We have some space in skb head. Superb! */
 			copy = min_t(int, copy, skb_availroom(skb));
 			err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
 			if (err)
 				goto do_fault;
-		} else if (!uarg || !uarg->zerocopy) {
+		} else if (!zc) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net-next 4/4] tcp: do not allocate linear memory for zerocopy skbs
From: Willem de Bruijn @ 2017-12-23  0:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171223000020.55509-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Zerocopy payload is now always stored in frags, and space for headers
is reversed, so this memory is unused.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 947348872c3e..7ac583a2b9fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1104,12 +1104,15 @@ static int linear_payload_sz(bool first_skb)
 	return 0;
 }
 
-static int select_size(const struct sock *sk, bool sg, bool first_skb)
+static int select_size(const struct sock *sk, bool sg, bool first_skb, bool zc)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	int tmp = tp->mss_cache;
 
 	if (sg) {
+		if (zc)
+			return 0;
+
 		if (sk_can_gso(sk)) {
 			tmp = linear_payload_sz(first_skb);
 		} else {
@@ -1282,6 +1285,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
+			int linear;
 
 new_segment:
 			/* Allocate new segment. If the interface is SG,
@@ -1295,9 +1299,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto restart;
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
-			skb = sk_stream_alloc_skb(sk,
-						  select_size(sk, sg, first_skb),
-						  sk->sk_allocation,
+			linear = select_size(sk, sg, first_skb, zc);
+			skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
 						  first_skb);
 			if (!skb)
 				goto wait_for_memory;
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net-next 2/4] tcp: push full zerocopy packets
From: Willem de Bruijn @ 2017-12-23  0:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171223000020.55509-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the
same for zerocopy frags as non-zerocopy frags and set the PSH bit.
This improves GRO assembly.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 67d39b79c801..44102484a76f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1371,8 +1371,10 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			pfrag->offset += copy;
 		} else {
 			err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
-			if (err == -EMSGSIZE || err == -EEXIST)
+			if (err == -EMSGSIZE || err == -EEXIST) {
+				tcp_mark_push(tp, skb);
 				goto new_segment;
+			}
 			if (err < 0)
 				goto do_error;
 			copy = err;
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net-next 1/4] skbuff: in skb_segment, call zerocopy functions once per nskb
From: Willem de Bruijn @ 2017-12-23  0:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171223000020.55509-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

This is a net-next follow-up to commit 268b79067942 ("skbuff: orphan
frags before zerocopy clone"), which fixed a bug in net, but added a
call to skb_zerocopy_clone at each frag to do so.

When segmenting skbs with user frags, either the user frags must be
replaced with private copies and uarg released, or the uarg must have
its refcount increased for each new skb.

skb_orphan_frags does the first, except for cases that can handle
reference counting. skb_zerocopy_clone then does the second.

Call these once per nskb, instead of once per frag.

That is, in the common case. With a frag list, also refresh when the
origin skb (frag_skb) changes.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a3cb0be4c6f3..00b0757830e2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3656,6 +3656,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 		skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
 					      SKBTX_SHARED_FRAG;
 
+		if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+		    skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
+			goto err;
+
 		while (pos < offset + len) {
 			if (i >= nfrags) {
 				BUG_ON(skb_headlen(list_skb));
@@ -3667,6 +3671,11 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 				BUG_ON(!nfrags);
 
+				if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+				    skb_zerocopy_clone(nskb, frag_skb,
+						       GFP_ATOMIC))
+					goto err;
+
 				list_skb = list_skb->next;
 			}
 
@@ -3678,11 +3687,6 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 				goto err;
 			}
 
-			if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
-				goto err;
-			if (skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
-				goto err;
-
 			*nskb_frag = *frag;
 			__skb_frag_ref(nskb_frag);
 			size = skb_frag_size(nskb_frag);
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net-next 0/4] zerocopy refinements
From: Willem de Bruijn @ 2017-12-23  0:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

1/4 is a small optimization follow-up to the earlier fix to skb_segment:
    check skb state once per skb, instead of once per frag.
2/4 makes behavior more consistent between standard and zerocopy send:
    set the PSH bit when hitting MAX_SKB_FRAGS. This helps GRO.
3/4 resolves a surprising inconsistency in notification:
    because small packets were not stored in frags, they would not set
    the copied error code over loopback. This change also optimizes
    the path by removing copying and making tso_fragment cheaper.
4/4 follows-up to 3/4 by no longer allocated now unused memory.
    this was actually already in RFC patches, but dropped as I pared
    down the patch set during revisions.
   
Willem de Bruijn (4):
  skbuff: in skb_segment, call zerocopy functions once per nskb
  tcp: push full zerocopy packets
  tcp: place all zerocopy payload in frags
  tcp: do not allocate linear memory for zerocopy skbs

 net/core/skbuff.c | 14 +++++++++-----
 net/ipv4/tcp.c    | 24 +++++++++++++++---------
 2 files changed, 24 insertions(+), 14 deletions(-)

-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply

* Re: [PATCH RFC 13/18] r8168: replace speed_down with genphy_restart_aneg
From: Heiner Kallweit @ 2017-12-22 23:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Realtek linux nic maintainers, Chun-Hao Lin, David Miller,
	netdev@vger.kernel.org
In-Reply-To: <20171222101441.GK2431@lunn.ch>

Am 22.12.2017 um 11:14 schrieb Andrew Lunn:
> On Thu, Dec 21, 2017 at 09:50:39PM +0100, Heiner Kallweit wrote:
>> Dealing with link partner abilities is handled by phylib, so let's
>> just trigger autonegotiation here.
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/ethernet/realtek/r8168.c | 26 +-------------------------
>>  1 file changed, 1 insertion(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8168.c b/drivers/net/ethernet/realtek/r8168.c
>> index d33f93a31..6b398915f 100644
>> --- a/drivers/net/ethernet/realtek/r8168.c
>> +++ b/drivers/net/ethernet/realtek/r8168.c
>> @@ -4360,30 +4360,6 @@ static void rtl_init_mdio_ops(struct rtl8168_private *tp)
>>  	}
>>  }
>>  
>> -static void rtl_speed_down(struct rtl8168_private *tp)
>> -{
>> -	u32 adv;
>> -	int lpa;
>> -
>> -	rtl_writephy(tp, 0x1f, 0x0000);
>> -	lpa = rtl_readphy(tp, MII_LPA);
>> -
>> -	if (lpa & (LPA_10HALF | LPA_10FULL))
>> -		adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full;
>> -	else if (lpa & (LPA_100HALF | LPA_100FULL))
>> -		adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>> -		      ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full;
>> -	else
>> -		adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>> -		      ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full |
>> -		      (tp->mii.supports_gmii ?
>> -		       ADVERTISED_1000baseT_Half |
>> -		       ADVERTISED_1000baseT_Full : 0);
>> -
>> -	rtl8168_set_speed(tp->dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
>> -			  adv);
>> -}
>> -
>>  static void rtl_wol_suspend_quirk(struct rtl8168_private *tp)
>>  {
>>  	void __iomem *ioaddr = tp->mmio_addr;
>> @@ -4424,7 +4400,7 @@ static bool rtl_wol_pll_power_down(struct rtl8168_private *tp)
>>  	if (!(__rtl8168_get_wol(tp) & WAKE_ANY))
>>  		return false;
>>  
>> -	rtl_speed_down(tp);
>> +	genphy_restart_aneg(tp->dev->phydev);
>>  	rtl_wol_suspend_quirk(tp);
>>  
>>  	return true;
> 
> I'm not too clear what is going on here? Is this suspend while WOL is
> enabled? There should be no need to change the PHY settings. The PHY
> driver should leave the PHY running in whatever state it was
> configured to. The only danger here is that the MAC driver has called
> phy_stop() during suspend. That should not be done when WOL is
> enabled.
> 
> Is the wol being passed to the phylib? phy_ethtool_set_wol() and
> phy_ethtool_get_wol()?
> 
I also have a hard time to understand what this is good for.
Here's the mail thread regarding introduction of this function:
https://lkml.org/lkml/2013/4/2/669

If I understand this correctly it's about fixing an issue when
aneg is disabled and the PHY automatically changes the speed.

In this case we may have to use genphy_config_aneg here which
calls genphy_setup_forced if aneg is disabled.

> 	Andrew
> 

^ permalink raw reply

* Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing
From: Pravin Shelar @ 2017-12-22 23:31 UTC (permalink / raw)
  To: Ed Swierk
  Cc: ovs-dev, Linux Kernel Network Developers, Benjamin Warren,
	Keith Holleman
In-Reply-To: <1513869437-20059-1-git-send-email-eswierk@skyportsystems.com>

On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
> packet to ip_hdr->tot_len before invoking netfilter hooks (including
> conntrack and nat).
>
> In the IPv6 receive path, ip6_rcv() does the same using
> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
> length before invoking NF_INET_PRE_ROUTING hooks.
>
> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
> the L3 header but does not trim it to the L3 length before calling
> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
> encounters a packet with lower-layer padding, nf_checksum() fails and
> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
> affect the checksum, the length in the IP pseudoheader does. That
> length is based on skb->len, and without trimming, it doesn't match
> the length the sender used when computing the checksum.
>
> The assumption throughout nf_conntrack and nf_nat is that skb->len
> reflects the length of the L3 header and payload, so there is no need
> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>
> This change brings OVS into line with other netfilter users, trimming
> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> ---
> v2:
> - Trim packet in nat receive path as well as conntrack
> - Free skb on error
> ---
>  net/openvswitch/conntrack.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b27c5c6..1bdc78f 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>         return ct_executed;
>  }
>
> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
> + * the L3 header. The skb is freed on error.
> + */
> +static int skb_trim_l3(struct sk_buff *skb)
> +{
> +       unsigned int nh_len;
> +       int err;
> +
> +       switch (skb->protocol) {
> +       case htons(ETH_P_IP):
> +               nh_len = ntohs(ip_hdr(skb)->tot_len);
> +               break;
> +       case htons(ETH_P_IPV6):
> +               nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> +                       + sizeof(struct ipv6hdr);
> +               break;
> +       default:
> +               nh_len = skb->len;
> +       }
> +
> +       err = pskb_trim_rcsum(skb, nh_len);
> +       if (err)
This should is unlikely.
> +               kfree_skb(skb);
> +
> +       return err;
> +}
> +
This looks like a generic function, it probably does not belong to OVS
code base.

>  #ifdef CONFIG_NF_NAT_NEEDED
>  /* Modelled after nf_nat_ipv[46]_fn().
>   * range is only used for new, uninitialized NAT state.
> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
>  {
>         int hooknum, nh_off, err = NF_ACCEPT;
>
> +       /* The nat module expects to be working at L3. */
>         nh_off = skb_network_offset(skb);
>         skb_pull_rcsum(skb, nh_off);
> +       err = skb_trim_l3(skb);
> +       if (err)
> +               return err;
>
ct-nat is executed within ct action, so I do not see why you you call
skb-trim again from ovs_ct_nat_execute().
ovs_ct_execute() trim should take care of the skb.

>         /* See HOOK2MANIP(). */
>         if (maniptype == NF_NAT_MANIP_SRC)
> @@ -1111,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
>         /* The conntrack module expects to be working at L3. */
>         nh_ofs = skb_network_offset(skb);
>         skb_pull_rcsum(skb, nh_ofs);
> +       err = skb_trim_l3(skb);
> +       if (err)
> +               return err;
>
>         if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
>                 err = handle_fragments(net, key, info->zone.id, skb);
> --
> 1.9.1
>

^ permalink raw reply

* Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing
From: Pravin Shelar @ 2017-12-22 23:31 UTC (permalink / raw)
  To: Ed Swierk
  Cc: ovs-dev, Linux Kernel Network Developers, Benjamin Warren,
	Keith Holleman
In-Reply-To: <1513869437-20059-1-git-send-email-eswierk@skyportsystems.com>

On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
> packet to ip_hdr->tot_len before invoking netfilter hooks (including
> conntrack and nat).
>
> In the IPv6 receive path, ip6_rcv() does the same using
> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
> length before invoking NF_INET_PRE_ROUTING hooks.
>
> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
> the L3 header but does not trim it to the L3 length before calling
> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
> encounters a packet with lower-layer padding, nf_checksum() fails and
> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
> affect the checksum, the length in the IP pseudoheader does. That
> length is based on skb->len, and without trimming, it doesn't match
> the length the sender used when computing the checksum.
>
> The assumption throughout nf_conntrack and nf_nat is that skb->len
> reflects the length of the L3 header and payload, so there is no need
> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>
> This change brings OVS into line with other netfilter users, trimming
> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> ---
> v2:
> - Trim packet in nat receive path as well as conntrack
> - Free skb on error
> ---
>  net/openvswitch/conntrack.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b27c5c6..1bdc78f 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>         return ct_executed;
>  }
>
> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
> + * the L3 header. The skb is freed on error.
> + */
> +static int skb_trim_l3(struct sk_buff *skb)
> +{
> +       unsigned int nh_len;
> +       int err;
> +
> +       switch (skb->protocol) {
> +       case htons(ETH_P_IP):
> +               nh_len = ntohs(ip_hdr(skb)->tot_len);
> +               break;
> +       case htons(ETH_P_IPV6):
> +               nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> +                       + sizeof(struct ipv6hdr);
> +               break;
> +       default:
> +               nh_len = skb->len;
> +       }
> +
> +       err = pskb_trim_rcsum(skb, nh_len);
> +       if (err)
This should is unlikely.
> +               kfree_skb(skb);
> +
> +       return err;
> +}
> +
This looks like a generic function, it probably does not belong to OVS
code base.

>  #ifdef CONFIG_NF_NAT_NEEDED
>  /* Modelled after nf_nat_ipv[46]_fn().
>   * range is only used for new, uninitialized NAT state.
> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
>  {
>         int hooknum, nh_off, err = NF_ACCEPT;
>
> +       /* The nat module expects to be working at L3. */
>         nh_off = skb_network_offset(skb);
>         skb_pull_rcsum(skb, nh_off);
> +       err = skb_trim_l3(skb);
> +       if (err)
> +               return err;
>
ct-nat is executed within ct action, so I do not see why you you call
skb-trim again from ovs_ct_nat_execute().
ovs_ct_execute() trim should take care of the skb.

>         /* See HOOK2MANIP(). */
>         if (maniptype == NF_NAT_MANIP_SRC)
> @@ -1111,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
>         /* The conntrack module expects to be working at L3. */
>         nh_ofs = skb_network_offset(skb);
>         skb_pull_rcsum(skb, nh_ofs);
> +       err = skb_trim_l3(skb);
> +       if (err)
> +               return err;
>
>         if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
>                 err = handle_fragments(net, key, info->zone.id, skb);
> --
> 1.9.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox