Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length
From: David Miller @ 2019-02-09 23:12 UTC (permalink / raw)
  To: marcelo.leitner
  Cc: julien, netdev, linux-sctp, linux-kernel, nhorman, vyasevich,
	lucien.xin
In-Reply-To: <20190206203754.GC13621@localhost.localdomain>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Wed, 6 Feb 2019 18:37:54 -0200

> On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
>> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
>> structures longer than the current definitions.
>> 
>> This should prevent unjustified setsockopt() failures due to struct
>> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
>> binaries that should be compatible, but were built with later kernel
>> uapi headers.
> 
> Not sure if we support backwards compatibility like this?

What a complete mess we have here.

Use new socket option numbers next time, do not change the size and/or
layout of existing socket options.

This whole thread, if you read it, is basically "if we compatability
this way, that breaks, and if we do compatability this other way oh
shit this other thing doesn't work."

I think we really need to specifically check for the difference sizes
that existed one by one, clear out the part not given by the user, and
backport this as far back as possible in a way that in the older kernels
we see if the user is actually trying to use the new features and if so
error out.

Which, btw, is terrible behavior.  Newly compiled apps should work on
older kernels if they don't try to use the new features, and if they
can the ones that want to try to use the new features should be able
to fall back when that feature isn't available in a non-ambiguous
and precisely defined way.

The fact that the use of the new feature is hidden in the new
structure elements is really rotten.

This patch, at best, needs some work and definitely a longer and more
detailed commit message.

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix narrow load on a bpf_sock returned from sk_lookup()
From: Joe Stringer @ 2019-02-10  1:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Joe Stringer
In-Reply-To: <20190209062554.142612-1-kafai@fb.com>

On Fri, 8 Feb 2019 at 22:27, Martin KaFai Lau <kafai@fb.com> wrote:
>
> By adding this test to test_verifier:
> {
>         "reference tracking: access sk->src_ip4 (narrow load)",
>         .insns = {
>         BPF_SK_LOOKUP,
>         BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
>         BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
>         BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
>         BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
>         BPF_EMIT_CALL(BPF_FUNC_sk_release),
>         BPF_EXIT_INSN(),
>         },
>         .prog_type = BPF_PROG_TYPE_SCHED_CLS,
>         .result = ACCEPT,
> },
>
> The above test loads 2 bytes from sk->src_ip4 where
> sk is obtained by bpf_sk_lookup_tcp().
>
> It hits an internal verifier error from convert_ctx_accesses():
> [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
> Failed to load prog 'Invalid argument'!
> 0: (b7) r2 = 0
> 1: (63) *(u32 *)(r10 -8) = r2
> 2: (7b) *(u64 *)(r10 -16) = r2
> 3: (7b) *(u64 *)(r10 -24) = r2
> 4: (7b) *(u64 *)(r10 -32) = r2
> 5: (7b) *(u64 *)(r10 -40) = r2
> 6: (7b) *(u64 *)(r10 -48) = r2
> 7: (bf) r2 = r10
> 8: (07) r2 += -48
> 9: (b7) r3 = 36
> 10: (b7) r4 = 0
> 11: (b7) r5 = 0
> 12: (85) call bpf_sk_lookup_tcp#84
> 13: (bf) r6 = r0
> 14: (15) if r0 == 0x0 goto pc+3
>  R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
> 15: (69) r2 = *(u16 *)(r0 +26)
> 16: (bf) r1 = r6
> 17: (85) call bpf_sk_release#86
> 18: (95) exit
>
> from 14 to 18: safe
> processed 20 insns (limit 131072), stack depth 48
> bpf verifier is misconfigured
> Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
>
> The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
> loaded (meaning load any 1 or 2 bytes of the src_ip4) by
> marking info->ctx_field_size.  However, this marked
> ctx_field_size is not used.  This patch fixes it.
>
> Due to the recent refactoring in test_verifier,
> this new test will be added to the bpf-next branch
> (together with the bpf_tcp_sock patchset)
> to avoid merge conflict.
>
> Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type")
> Cc: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

Nice find, thanks.

Acked-by: Joe Stringer <joe@wand.net.nz>

^ permalink raw reply

* Re: [PATCH net-next v2] add snmp counter document
From: David Miller @ 2019-02-10  3:00 UTC (permalink / raw)
  To: yupeng0921; +Cc: netdev, xiyou.wangcong, rdunlap
In-Reply-To: <20190209224618.6127-1-yupeng0921@gmail.com>

From: yupeng <yupeng0921@gmail.com>
Date: Sat,  9 Feb 2019 14:46:18 -0800

> add document for tcp retransmission, tcp fast open, syn cookies,
> challenge ack, prune and several general counters
> 
> Signed-off-by: yupeng <yupeng0921@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net/packet: fix 4gb buffer limit due to overflow check
From: David Miller @ 2019-02-10  3:01 UTC (permalink / raw)
  To: kal.conley
  Cc: willemb, edumazet, gregkh, jeffrey.t.kirsher, alexander.h.duyck,
	ktkhai, vincent.whitchurch, lirongqing, magnus.karlsson, netdev,
	linux-kernel
In-Reply-To: <20190209203701.27252-1-kal.conley@dectris.com>

From: Kal Conley <kal.conley@dectris.com>
Date: Sat,  9 Feb 2019 21:37:00 +0100

> When calculating rb->frames_per_block * req->tp_block_nr the result
> can overflow. Check it for overflow without limiting the total buffer
> size to UINT_MAX.
> 
> This change fixes support for packet ring buffers >= UINT_MAX.

Please resubmit with a proper signoff and also an appropriate Fixes:
tag.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: mv88e6xxx: SERDES support 2500BaseT via external PHY
From: David Miller @ 2019-02-10  3:02 UTC (permalink / raw)
  To: hkallweit1; +Cc: andrew, f.fainelli, vivien.didelot, netdev
In-Reply-To: <aed891fb-8568-2291-6050-18b6e2fd12e8@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Fri, 8 Feb 2019 22:25:44 +0100

> From: Andrew Lunn <andrew@lunn.ch>
> By using an external PHY, ports 9 and 10 can support 2500BaseT.
> So set this link mode in the mask when validating.
> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: aquantia: add support for AQCS109
From: David Miller @ 2019-02-10  3:03 UTC (permalink / raw)
  To: hkallweit1; +Cc: andrew, f.fainelli, netdev, nikita.yoush
In-Reply-To: <23f602a2-1b35-a504-a634-2b352d8fad89@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Fri, 8 Feb 2019 20:56:55 +0100

> Add support for the AQCS109. From software point of view,
> it should be almost equivalent to AQR107.
> 
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix narrow load on a bpf_sock returned from sk_lookup()
From: Alexei Starovoitov @ 2019-02-10  4:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Joe Stringer
In-Reply-To: <20190209062554.142612-1-kafai@fb.com>

On Fri, Feb 08, 2019 at 10:25:54PM -0800, Martin KaFai Lau wrote:
> By adding this test to test_verifier:
> {
> 	"reference tracking: access sk->src_ip4 (narrow load)",
> 	.insns = {
> 	BPF_SK_LOOKUP,
> 	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
> 	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
> 	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
> 	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> 	BPF_EMIT_CALL(BPF_FUNC_sk_release),
> 	BPF_EXIT_INSN(),
> 	},
> 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
> 	.result = ACCEPT,
> },
> 
> The above test loads 2 bytes from sk->src_ip4 where
> sk is obtained by bpf_sk_lookup_tcp().
> 
> It hits an internal verifier error from convert_ctx_accesses():
> [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
> Failed to load prog 'Invalid argument'!
> 0: (b7) r2 = 0
> 1: (63) *(u32 *)(r10 -8) = r2
> 2: (7b) *(u64 *)(r10 -16) = r2
> 3: (7b) *(u64 *)(r10 -24) = r2
> 4: (7b) *(u64 *)(r10 -32) = r2
> 5: (7b) *(u64 *)(r10 -40) = r2
> 6: (7b) *(u64 *)(r10 -48) = r2
> 7: (bf) r2 = r10
> 8: (07) r2 += -48
> 9: (b7) r3 = 36
> 10: (b7) r4 = 0
> 11: (b7) r5 = 0
> 12: (85) call bpf_sk_lookup_tcp#84
> 13: (bf) r6 = r0
> 14: (15) if r0 == 0x0 goto pc+3
>  R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
> 15: (69) r2 = *(u16 *)(r0 +26)
> 16: (bf) r1 = r6
> 17: (85) call bpf_sk_release#86
> 18: (95) exit
> 
> from 14 to 18: safe
> processed 20 insns (limit 131072), stack depth 48
> bpf verifier is misconfigured
> Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
> 
> The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
> loaded (meaning load any 1 or 2 bytes of the src_ip4) by
> marking info->ctx_field_size.  However, this marked
> ctx_field_size is not used.  This patch fixes it.
> 
> Due to the recent refactoring in test_verifier,
> this new test will be added to the bpf-next branch
> (together with the bpf_tcp_sock patchset)
> to avoid merge conflict.
> 
> Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type")
> Cc: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied to bpf tree.

Martin, if your is_fullsock work depends on it, I can apply the fix
to bpf-next as well. Just let me know.


^ permalink raw reply

* Re: [PATCH net-next 10/16] rocker: Handle SWITCHDEV_PORT_ATTR_SET
From: Florian Fainelli @ 2019-02-10  4:20 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David S. Miller, Ido Schimmel, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE, jiri,
	andrew, vivien.didelot
In-Reply-To: <20190209182147.GQ2353@nanopsycho>

Le 2/9/19 à 10:21 AM, Jiri Pirko a écrit :
> Sat, Feb 09, 2019 at 01:32:42AM CET, f.fainelli@gmail.com wrote:
>> Following patches will change the way we communicate getting or setting
> 
> Just "setting", no "getting".
> 
> 
>> a port's attribute and use a blocking notifier to perform those tasks.
>>
>> Prepare rocker to support receiving notifier events targeting
>> SWITCHDEV_PORT_ATTR_SET and simply translate that into the existing
>> rocker_port_attr_set call.
>>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>> drivers/net/ethernet/rocker/rocker_main.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
>> index ff3f14504f4f..f10e4888ecff 100644
>> --- a/drivers/net/ethernet/rocker/rocker_main.c
>> +++ b/drivers/net/ethernet/rocker/rocker_main.c
>> @@ -2811,6 +2811,24 @@ rocker_switchdev_port_obj_event(unsigned long event, struct net_device *netdev,
>> 	return notifier_from_errno(err);
>> }
>>
>> +static int
>> +rocker_switchdev_port_attr_event(unsigned long event, struct net_device *netdev,
>> +				 struct switchdev_notifier_port_attr_info
>> +				 *port_attr_info)
>> +{
>> +	int err = -EOPNOTSUPP;
>> +
>> +	switch (event) {
>> +	case SWITCHDEV_PORT_ATTR_SET:
> 
> Do you expect some other event to be handled in
> rocker_switchdev_port_attr_event()? Because you have SWITCHDEV_PORT_ATTR_SET
> selected in case here and in rocker_switchdev_blocking_event.
> Perhaps you can rename rocker_switchdev_port_attr_event() to
> rocker_switchdev_port_attr_set_event() and avoid this switchcase.

That's a good point, I have PORT_ATTR_{GET_SET} initially which was the
reason for the switch/case statement, will change it according to your
suggestion. Thanks!
-- 
Florian

^ permalink raw reply

* [RFC] apparently bogus logics in unix_find_other() since 2002
From: Al Viro @ 2019-02-10  4:24 UTC (permalink / raw)
  To: netdev; +Cc: Solar Designer, David Miller

	In "net/unix/af_unix.c: Set ATIME on socket inode" (back in
2002) we'd grown something rather odd in unix_find_other().  In the
original patch it was
                u=unix_find_socket_byname(sunname, len, type, hash);
-               if (!u)
+               if (u) {
+                       struct dentry *dentry;
+                       dentry = u->protinfo.af_unix.dentry;
+                       if (dentry)
+                               UPDATE_ATIME(dentry->d_inode);
+               } else
                        goto fail;

These days the code is

                u = unix_find_socket_byname(net, sunname, len, type, hash);
                if (u) {
                        struct dentry *dentry;
                        dentry = unix_sk(u)->path.dentry;
                        if (dentry)
                                touch_atime(&unix_sk(u)->path);
                } else  
                        goto fail;

but the logics is the same.  It's the abstract address case - we have
'\0' in sunname->sun_path[0].  How in hell could that possibly have
non-NULL ->path.dentry and what would it be?

Note that unix_find_socket_byname() returns non-NULL u here only if
u->addr->name->sun_path[0] is equal to sunname->sun_path[0], i.e.
'\0'.  There are only two places where we ever might assign non-NULL
to ->path.dentry:
        if (sun_path[0]) {
                addr->hash = UNIX_HASH_SIZE;
                hash = d_backing_inode(path.dentry)->i_ino & (UNIX_HASH_SIZE - 1);
                spin_lock(&unix_table_lock);
                u->path = path;
                list = &unix_socket_table[hash];
        } else {
in unix_bind() (in which case u->addr->name->sun_path[0] will not be '\0') and
        /* copy address information from listening to new sock*/
        if (otheru->addr) {
                refcount_inc(&otheru->addr->refcnt);
                newu->addr = otheru->addr;
        }
        if (otheru->path.dentry) {
                path_get(&otheru->path);
                newu->path = otheru->path;
        }
in unix_stream_connect().  And once ->addr is non-NULL, it's never changed,
and ->addr->name contents is never modified afterwards.  So we would have
to had the same condition (non-NULL ->path.dentry with '\0' in
->addr->name->sun_path[0]) at earlier point on the listener socket.

Looks like that should be impossible; what am I missing here?  Incidentally,
how can the quoted fragment in in unix_stream_connect() be reached with NULL
otheru->addr?  After all, otheru is unix_sock of a listener; how could
we possibly have found it if it had NULL ->addr?

Confused...

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix narrow load on a bpf_sock returned from sk_lookup()
From: Martin Lau @ 2019-02-10  7:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Joe Stringer
In-Reply-To: <20190210040241.wtsldfavw2vk3afv@ast-mbp>

On Sat, Feb 09, 2019 at 08:02:43PM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 08, 2019 at 10:25:54PM -0800, Martin KaFai Lau wrote:
> > By adding this test to test_verifier:
> > {
> > 	"reference tracking: access sk->src_ip4 (narrow load)",
> > 	.insns = {
> > 	BPF_SK_LOOKUP,
> > 	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
> > 	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
> > 	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
> > 	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> > 	BPF_EMIT_CALL(BPF_FUNC_sk_release),
> > 	BPF_EXIT_INSN(),
> > 	},
> > 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
> > 	.result = ACCEPT,
> > },
> > 
> > The above test loads 2 bytes from sk->src_ip4 where
> > sk is obtained by bpf_sk_lookup_tcp().
> > 
> > It hits an internal verifier error from convert_ctx_accesses():
> > [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
> > Failed to load prog 'Invalid argument'!
> > 0: (b7) r2 = 0
> > 1: (63) *(u32 *)(r10 -8) = r2
> > 2: (7b) *(u64 *)(r10 -16) = r2
> > 3: (7b) *(u64 *)(r10 -24) = r2
> > 4: (7b) *(u64 *)(r10 -32) = r2
> > 5: (7b) *(u64 *)(r10 -40) = r2
> > 6: (7b) *(u64 *)(r10 -48) = r2
> > 7: (bf) r2 = r10
> > 8: (07) r2 += -48
> > 9: (b7) r3 = 36
> > 10: (b7) r4 = 0
> > 11: (b7) r5 = 0
> > 12: (85) call bpf_sk_lookup_tcp#84
> > 13: (bf) r6 = r0
> > 14: (15) if r0 == 0x0 goto pc+3
> >  R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
> > 15: (69) r2 = *(u16 *)(r0 +26)
> > 16: (bf) r1 = r6
> > 17: (85) call bpf_sk_release#86
> > 18: (95) exit
> > 
> > from 14 to 18: safe
> > processed 20 insns (limit 131072), stack depth 48
> > bpf verifier is misconfigured
> > Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
> > 
> > The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
> > loaded (meaning load any 1 or 2 bytes of the src_ip4) by
> > marking info->ctx_field_size.  However, this marked
> > ctx_field_size is not used.  This patch fixes it.
> > 
> > Due to the recent refactoring in test_verifier,
> > this new test will be added to the bpf-next branch
> > (together with the bpf_tcp_sock patchset)
> > to avoid merge conflict.
> > 
> > Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type")
> > Cc: Joe Stringer <joe@wand.net.nz>
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> 
> Applied to bpf tree.
Thanks!

> 
> Martin, if your is_fullsock work depends on it, I can apply the fix
> to bpf-next as well. Just let me know.
Yes, the is_fullsock work depends on it.
I should have mentioned it in this commit log.

^ permalink raw reply

* [PATCH v2 bpf-next 0/7] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This series adds __sk_buff->sk, "struct bpf_tcp_sock",
BPF_FUNC_sk_fullsock and BPF_FUNC_tcp_sock.  Together, they provide
a common way to expose the members of "struct tcp_sock" and
"struct bpf_sock" for the bpf_prog to access.

The patch series first adds a bpf_sock pointer to __sk_buff
and a new helper BPF_FUNC_sk_fullsock.

It then adds BPF_FUNC_tcp_sock to get a bpf_tcp_sock
pointer from a bpf_sock pointer.

The current use case is to allow a cg_skb_bpf_prog to provide
per cgroup traffic policing/shaping.

Please see individual patch for details.

v2:
- Patch 1 depends on
  commit d623876646be ("bpf: Fix narrow load on a bpf_sock returned from sk_lookup()")
  in the bpf branch.
- Add sk_to_full_sk() to bpf_sk_fullsock() and bpf_tcp_sock()
  such that there is a way to access the listener's sk and tcp_sk
  when __sk_buff->sk is a request_sock.
  The comments in the uapi bpf.h is updated accordingly.
- bpf_ctx_range_till() is used in bpf_sock_common_is_valid_access()
  in patch 1.  Saved a few lines.
- Patch 2 is new in v2 and it adds "state", "dst_ip4", "dst_ip6" and
  "dst_port" to the bpf_sock.  Narrow load is allowed on them.
  The "state" (i.e. sk_state) has already been used in
  INET_DIAG (e.g. ss -t) and getsockopt(TCP_INFO).
- While at it in the new patch 2, also allow narrow load on some
  existing fields of the bpf_sock, which are "family", "type", "protocol"
  and "src_port".  Only allow loading from first byte for now.
  i.e. does not allow narrow load starting from the 2nd byte.
- Add some narrow load tests to the test_verifier's sock.c

Martin KaFai Lau (7):
  bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock
  bpf: Refactor sock_ops_convert_ctx_access
  bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
  bpf: Sync bpf.h to tools/
  bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to
    test_verifer
  bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock

 include/linux/bpf.h                           |  42 ++
 include/uapi/linux/bpf.h                      |  72 ++-
 kernel/bpf/verifier.c                         | 159 ++++--
 net/core/filter.c                             | 495 +++++++++++-------
 tools/include/uapi/linux/bpf.h                |  72 ++-
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 tools/testing/selftests/bpf/bpf_util.h        |   9 +
 .../testing/selftests/bpf/test_sock_fields.c  | 327 ++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 152 ++++++
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 384 ++++++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 13 files changed, 1493 insertions(+), 235 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

-- 
2.17.1


^ permalink raw reply

* [PATCH v2 bpf-next 1/7] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
before accessing the fields in sock.  For example, in __netdev_pick_tx:

static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
			    struct net_device *sb_dev)
{
	/* ... */

	struct sock *sk = skb->sk;

		if (queue_index != new_index && sk &&
		    sk_fullsock(sk) &&
		    rcu_access_pointer(sk->sk_dst_cache))
			sk_tx_queue_set(sk, new_index);

	/* ... */

	return queue_index;
}

This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
where a few of the convert_ctx_access() in filter.c has already been
accessing the skb->sk sock_common's fields,
e.g. sock_ops_convert_ctx_access().

"__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
Some of the fileds in "bpf_sock" will not be directly
accessible through the "__sk_buff->sk" pointer.  It is limited
by the new "bpf_sock_common_is_valid_access()".
e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
     are not allowed.

The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
can be used to get a sk with all accessible fields in "bpf_sock".
This helper is added to both cg_skb and sched_(cls|act).

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_sock *sk;

	sk = skb->sk;
	if (!sk)
		return 1;

	sk = bpf_sk_fullsock(sk);
	if (!sk)
		return 1;

	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
		return 1;

	/* some_traffic_shaping(); */

	return 1;
}

(1) The sk is read only

(2) There is no new "struct bpf_sock_common" introduced.

(3) Future kernel sock's members could be added to bpf_sock only
    instead of repeatedly adding at multiple places like currently
    in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

(4) After "sk = skb->sk", the reg holding sk is in type
    PTR_TO_SOCK_COMMON_OR_NULL.

(5) After bpf_sk_fullsock(), the return type will be in type
    PTR_TO_SOCKET_OR_NULL which is the same as the return type of
    bpf_sk_lookup_xxx().

    However, bpf_sk_fullsock() does not take refcnt.  The
    acquire_reference_state() is only depending on the return type now.
    To avoid it, a new is_acquire_function() is checked before calling
    acquire_reference_state().

(6) The WARN_ON in "release_reference_state()" is no longer an
    internal verifier bug.

    When reg->id is not found in state->refs[], it means the
    bpf_prog does something wrong like
    "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
    never been acquired by calling "bpf_sk_fullsock(skb->sk)".

    A -EINVAL and a verbose are done instead of WARN_ON.  A test is
    added to the test_verifier in a later patch.

    Since the WARN_ON in "release_reference_state()" is no longer
    needed, "__release_reference_state()" is folded into
    "release_reference_state()" also.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  12 ++++
 include/uapi/linux/bpf.h |  12 +++-
 kernel/bpf/verifier.c    | 132 +++++++++++++++++++++++++++------------
 net/core/filter.c        |  42 +++++++++++++
 4 files changed, 157 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bd169a7bcc93..a60463b45b54 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -194,6 +194,7 @@ enum bpf_arg_type {
 	ARG_ANYTHING,		/* any (initialized) argument is ok */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
 	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
+	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
 };
 
 /* type of values returned from helper functions */
@@ -256,6 +257,8 @@ enum bpf_reg_type {
 	PTR_TO_FLOW_KEYS,	 /* reg points to bpf_flow_keys */
 	PTR_TO_SOCKET,		 /* reg points to struct bpf_sock */
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
+	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
+	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -920,6 +923,9 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 #if defined(CONFIG_NET)
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info);
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info);
 u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
@@ -928,6 +934,12 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				struct bpf_prog *prog,
 				u32 *target_size);
 #else
+static inline bool bpf_sock_common_is_valid_access(int off, int size,
+						   enum bpf_access_type type,
+						   struct bpf_insn_access_aux *info)
+{
+	return false;
+}
 static inline bool bpf_sock_is_valid_access(int off, int size,
 					    enum bpf_access_type type,
 					    struct bpf_insn_access_aux *info)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1777fa0c61e4..5d79cba74ddc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2329,6 +2329,14 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_sock** pointer such
+ *		that all the fields in bpf_sock can be accessed.
+ *	Return
+ *		A **struct bpf_sock** pointer on success, or NULL in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2425,7 +2433,8 @@ union bpf_attr {
 	FN(msg_pop_data),		\
 	FN(rc_pointer_rel),		\
 	FN(spin_lock),			\
-	FN(spin_unlock),
+	FN(spin_unlock),		\
+	FN(sk_fullsock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2545,6 +2554,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 516dfc6d78de..b755d55a3791 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -331,10 +331,17 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 	       type == PTR_TO_PACKET_META;
 }
 
+static bool type_is_sk_pointer(enum bpf_reg_type type)
+{
+	return type == PTR_TO_SOCKET ||
+		type == PTR_TO_SOCK_COMMON;
+}
+
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
-	       type == PTR_TO_SOCKET_OR_NULL;
+	       type == PTR_TO_SOCKET_OR_NULL ||
+	       type == PTR_TO_SOCK_COMMON_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -377,6 +384,12 @@ static bool is_release_function(enum bpf_func_id func_id)
 	return func_id == BPF_FUNC_sk_release;
 }
 
+static bool is_acquire_function(enum bpf_func_id func_id)
+{
+	return func_id == BPF_FUNC_sk_lookup_tcp ||
+		func_id == BPF_FUNC_sk_lookup_udp;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
 	[NOT_INIT]		= "?",
@@ -392,6 +405,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_FLOW_KEYS]	= "flow_keys",
 	[PTR_TO_SOCKET]		= "sock",
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
+	[PTR_TO_SOCK_COMMON]	= "sock_common",
+	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
 };
 
 static char slot_type_char[] = {
@@ -618,13 +633,10 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
 }
 
 /* release function corresponding to acquire_reference_state(). Idempotent. */
-static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
+static int release_reference_state(struct bpf_func_state *state, int ptr_id)
 {
 	int i, last_idx;
 
-	if (!ptr_id)
-		return -EFAULT;
-
 	last_idx = state->acquired_refs - 1;
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (state->refs[i].id == ptr_id) {
@@ -636,21 +648,7 @@ static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
 			return 0;
 		}
 	}
-	return -EFAULT;
-}
-
-/* variation on the above for cases where we expect that there must be an
- * outstanding reference for the specified ptr_id.
- */
-static int release_reference_state(struct bpf_verifier_env *env, int ptr_id)
-{
-	struct bpf_func_state *state = cur_func(env);
-	int err;
-
-	err = __release_reference_state(state, ptr_id);
-	if (WARN_ON_ONCE(err != 0))
-		verbose(env, "verifier internal error: can't release reference\n");
-	return err;
+	return -EINVAL;
 }
 
 static int transfer_reference_state(struct bpf_func_state *dst,
@@ -1209,6 +1207,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case CONST_PTR_TO_MAP:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1647,6 +1647,7 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx,
 	struct bpf_reg_state *regs = cur_regs(env);
 	struct bpf_reg_state *reg = &regs[regno];
 	struct bpf_insn_access_aux info = {};
+	bool valid;
 
 	if (reg->smin_value < 0) {
 		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
@@ -1654,15 +1655,28 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx,
 		return -EACCES;
 	}
 
-	if (!bpf_sock_is_valid_access(off, size, t, &info)) {
-		verbose(env, "invalid bpf_sock access off=%d size=%d\n",
-			off, size);
-		return -EACCES;
+	switch (reg->type) {
+	case PTR_TO_SOCK_COMMON:
+		valid = bpf_sock_common_is_valid_access(off, size, t, &info);
+		break;
+	case PTR_TO_SOCKET:
+		valid = bpf_sock_is_valid_access(off, size, t, &info);
+		break;
+	default:
+		valid = false;
 	}
 
-	env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
 
-	return 0;
+	if (valid) {
+		env->insn_aux_data[insn_idx].ctx_field_size =
+			info.ctx_field_size;
+		return 0;
+	}
+
+	verbose(env, "R%d invalid %s access off=%d size=%d\n",
+		regno, reg_type_str[reg->type], off, size);
+
+	return -EACCES;
 }
 
 static bool __is_pointer_value(bool allow_ptr_leaks,
@@ -1688,8 +1702,14 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno)
 {
 	const struct bpf_reg_state *reg = reg_state(env, regno);
 
-	return reg->type == PTR_TO_CTX ||
-	       reg->type == PTR_TO_SOCKET;
+	return reg->type == PTR_TO_CTX;
+}
+
+static bool is_sk_reg(struct bpf_verifier_env *env, int regno)
+{
+	const struct bpf_reg_state *reg = reg_state(env, regno);
+
+	return type_is_sk_pointer(reg->type);
 }
 
 static bool is_pkt_reg(struct bpf_verifier_env *env, int regno)
@@ -1800,6 +1820,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET:
 		pointer_desc = "sock ";
 		break;
+	case PTR_TO_SOCK_COMMON:
+		pointer_desc = "sock_common ";
+		break;
 	default:
 		break;
 	}
@@ -2003,11 +2026,14 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			 * PTR_TO_PACKET[_META,_END]. In the latter
 			 * case, we know the offset is zero.
 			 */
-			if (reg_type == SCALAR_VALUE)
+			if (reg_type == SCALAR_VALUE) {
 				mark_reg_unknown(env, regs, value_regno);
-			else
+			} else {
 				mark_reg_known_zero(env, regs,
 						    value_regno);
+				if (reg_type_may_be_null(reg_type))
+					regs[value_regno].id = ++env->id_gen;
+			}
 			regs[value_regno].type = reg_type;
 		}
 
@@ -2053,9 +2079,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_flow_keys_access(env, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
-	} else if (reg->type == PTR_TO_SOCKET) {
+	} else if (type_is_sk_pointer(reg->type)) {
 		if (t == BPF_WRITE) {
-			verbose(env, "cannot write into socket\n");
+			verbose(env, "R%d cannot write into %s\n",
+				regno, reg_type_str[reg->type]);
 			return -EACCES;
 		}
 		err = check_sock_access(env, insn_idx, regno, off, size, t);
@@ -2102,7 +2129,8 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins
 
 	if (is_ctx_reg(env, insn->dst_reg) ||
 	    is_pkt_reg(env, insn->dst_reg) ||
-	    is_flow_key_reg(env, insn->dst_reg)) {
+	    is_flow_key_reg(env, insn->dst_reg) ||
+	    is_sk_reg(env, insn->dst_reg)) {
 		verbose(env, "BPF_XADD stores into R%d %s is not allowed\n",
 			insn->dst_reg,
 			reg_type_str[reg_state(env, insn->dst_reg)->type]);
@@ -2369,6 +2397,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		err = check_ctx_reg(env, reg, regno);
 		if (err < 0)
 			return err;
+	} else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
+		expected_type = PTR_TO_SOCK_COMMON;
+		/* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */
+		if (!type_is_sk_pointer(type))
+			goto err_type;
 	} else if (arg_type == ARG_PTR_TO_SOCKET) {
 		expected_type = PTR_TO_SOCKET;
 		if (type != expected_type)
@@ -2783,7 +2816,7 @@ static int release_reference(struct bpf_verifier_env *env,
 	for (i = 0; i <= vstate->curframe; i++)
 		release_reg_references(env, vstate->frame[i], meta->ptr_id);
 
-	return release_reference_state(env, meta->ptr_id);
+	return release_reference_state(cur_func(env), meta->ptr_id);
 }
 
 static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
@@ -3049,8 +3082,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		}
 	} else if (is_release_function(func_id)) {
 		err = release_reference(env, &meta);
-		if (err)
+		if (err) {
+			verbose(env, "func %s#%d reference has not been acquired before\n",
+				func_id_name(func_id), func_id);
 			return err;
+		}
 	}
 
 	regs = cur_regs(env);
@@ -3099,12 +3135,19 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
 	} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
-		int id = acquire_reference_state(env, insn_idx);
-		if (id < 0)
-			return id;
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
-		regs[BPF_REG_0].id = id;
+		if (is_acquire_function(func_id)) {
+			int id = acquire_reference_state(env, insn_idx);
+
+			if (id < 0)
+				return id;
+			/* For release_reference() */
+			regs[BPF_REG_0].id = id;
+		} else {
+			/* For mark_ptr_or_null_reg() */
+			regs[BPF_REG_0].id = ++env->id_gen;
+		}
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3364,6 +3407,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_PACKET_END:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4597,6 +4642,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			}
 		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
 			reg->type = PTR_TO_SOCKET;
+		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
+			reg->type = PTR_TO_SOCK_COMMON;
 		}
 		if (is_null || !(reg_is_refcounted(reg) ||
 				 reg_may_point_to_spin_lock(reg))) {
@@ -4621,7 +4668,7 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 	int i, j;
 
 	if (reg_is_refcounted_or_null(&regs[regno]) && is_null)
-		__release_reference_state(state, id);
+		release_reference_state(state, id);
 
 	for (i = 0; i < MAX_BPF_REG; i++)
 		mark_ptr_or_null_reg(state, &regs[i], id, is_null);
@@ -5790,6 +5837,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_FLOW_KEYS:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -6110,6 +6159,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_CTX:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -7112,6 +7163,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = ops->convert_ctx_access;
 			break;
 		case PTR_TO_SOCKET:
+		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
 		default:
diff --git a/net/core/filter.c b/net/core/filter.c
index 3a49f68eda10..401d2e0aebf8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1793,6 +1793,20 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
+{
+	sk = sk_to_full_sk(sk);
+
+	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_sk_fullsock_proto = {
+	.func		= bpf_sk_fullsock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 static inline int sk_skb_try_make_writable(struct sk_buff *skb,
 					   unsigned int write_len)
 {
@@ -5406,6 +5420,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5477,6 +5493,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 #ifdef CONFIG_XFRM
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
@@ -5764,6 +5782,11 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 		if (size != sizeof(__u64))
 			return false;
 		break;
+	case offsetof(struct __sk_buff, sk):
+		if (type == BPF_WRITE || size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
+		break;
 	default:
 		/* Only narrow read access allowed for now. */
 		if (type == BPF_WRITE) {
@@ -5950,6 +5973,18 @@ static bool __sock_filter_check_size(int off, int size,
 	return size == size_default;
 }
 
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case bpf_ctx_range_till(struct bpf_sock, type, priority):
+		return false;
+	default:
+		return bpf_sock_is_valid_access(off, size, type, info);
+	}
+}
+
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info)
 {
@@ -6748,6 +6783,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		off += offsetof(struct qdisc_skb_cb, pkt_len);
 		*target_size = 4;
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
+		break;
+
+	case offsetof(struct __sk_buff, sk):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_buff, sk));
+		break;
 	}
 
 	return insn - insn_buf;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 2/7] bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
bpf_sock.  The userspace has already been using "state",
e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).

This patch also allows narrow load on the following existing fields:
"family", "type", "protocol" and "src_port".  Unlike IP address,
the load offset is resticted to the first byte for them but it
can be relaxed later if there is a use case.

This patch also folds __sock_filter_check_size() into
bpf_sock_is_valid_access() since it is not called
by any where else.  All bpf_sock checking is in
one place.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/uapi/linux/bpf.h | 17 ++++---
 net/core/filter.c        | 99 +++++++++++++++++++++++++++++++---------
 2 files changed, 85 insertions(+), 31 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5d79cba74ddc..d8f91777c5b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2606,15 +2606,14 @@ struct bpf_sock {
 	__u32 protocol;
 	__u32 mark;
 	__u32 priority;
-	__u32 src_ip4;		/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
-				 */
-	__u32 src_ip6[4];	/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
-				 */
-	__u32 src_port;		/* Allows 4-byte read.
-				 * Stored in host byte order
-				 */
+	/* IP address also allows 1 and 2 bytes access */
+	__u32 src_ip4;
+	__u32 src_ip6[4];
+	__u32 src_port;		/* host byte order */
+	__u32 dst_port;		/* network byte order */
+	__u32 dst_ip4;
+	__u32 dst_ip6[4];
+	__u32 state;
 };
 
 struct bpf_sock_tuple {
diff --git a/net/core/filter.c b/net/core/filter.c
index 401d2e0aebf8..01bb64bf2b5e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5958,21 +5958,6 @@ static bool __sock_filter_check_attach_type(int off,
 	return true;
 }
 
-static bool __sock_filter_check_size(int off, int size,
-				     struct bpf_insn_access_aux *info)
-{
-	const int size_default = sizeof(__u32);
-
-	switch (off) {
-	case bpf_ctx_range(struct bpf_sock, src_ip4):
-	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
-		bpf_ctx_record_field_size(info, size_default);
-		return bpf_ctx_narrow_access_ok(off, size, size_default);
-	}
-
-	return size == size_default;
-}
-
 bool bpf_sock_common_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
 				     struct bpf_insn_access_aux *info)
@@ -5988,13 +5973,29 @@ bool bpf_sock_common_is_valid_access(int off, int size,
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info)
 {
+	const int size_default = sizeof(__u32);
+
 	if (off < 0 || off >= sizeof(struct bpf_sock))
 		return false;
 	if (off % size != 0)
 		return false;
-	if (!__sock_filter_check_size(off, size, info))
-		return false;
-	return true;
+
+	switch (off) {
+	case offsetof(struct bpf_sock, state):
+	case offsetof(struct bpf_sock, family):
+	case offsetof(struct bpf_sock, type):
+	case offsetof(struct bpf_sock, protocol):
+	case offsetof(struct bpf_sock, dst_port):
+	case offsetof(struct bpf_sock, src_port):
+	case bpf_ctx_range(struct bpf_sock, src_ip4):
+	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
+	case bpf_ctx_range(struct bpf_sock, dst_ip4):
+	case bpf_ctx_range_till(struct bpf_sock, dst_ip6[0], dst_ip6[3]):
+		bpf_ctx_record_field_size(info, size_default);
+		return bpf_ctx_narrow_access_ok(off, size, size_default);
+	}
+
+	return size == size_default;
 }
 
 static bool sock_filter_is_valid_access(int off, int size,
@@ -6838,24 +6839,32 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 		break;
 
 	case offsetof(struct bpf_sock, family):
-		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
-
-		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, sk_family));
+		*insn++ = BPF_LDX_MEM(
+			BPF_FIELD_SIZEOF(struct sock_common, skc_family),
+			si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common,
+				       skc_family,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_family),
+				       target_size));
 		break;
 
 	case offsetof(struct bpf_sock, type):
+		BUILD_BUG_ON(HWEIGHT32(SK_FL_TYPE_MASK) != BITS_PER_BYTE * 2);
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
 				      offsetof(struct sock, __sk_flags_offset));
 		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
 		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
+		*target_size = 2;
 		break;
 
 	case offsetof(struct bpf_sock, protocol):
+		BUILD_BUG_ON(HWEIGHT32(SK_FL_PROTO_MASK) != BITS_PER_BYTE);
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
 				      offsetof(struct sock, __sk_flags_offset));
 		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
 		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_PROTO_SHIFT);
+		*target_size = 1;
 		break;
 
 	case offsetof(struct bpf_sock, src_ip4):
@@ -6867,6 +6876,15 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				       target_size));
 		break;
 
+	case offsetof(struct bpf_sock, dst_ip4):
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_daddr,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_daddr),
+				       target_size));
+		break;
+
 	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
 #if IS_ENABLED(CONFIG_IPV6)
 		off = si->off;
@@ -6885,6 +6903,23 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 #endif
 		break;
 
+	case bpf_ctx_range_till(struct bpf_sock, dst_ip6[0], dst_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		off = si->off;
+		off -= offsetof(struct bpf_sock, dst_ip6[0]);
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common,
+				       skc_v6_daddr.s6_addr32[0],
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_v6_daddr.s6_addr32[0]),
+				       target_size) + off);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+		*target_size = 4;
+#endif
+		break;
+
 	case offsetof(struct bpf_sock, src_port):
 		*insn++ = BPF_LDX_MEM(
 			BPF_FIELD_SIZEOF(struct sock_common, skc_num),
@@ -6894,6 +6929,26 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 						    skc_num),
 				       target_size));
 		break;
+
+	case offsetof(struct bpf_sock, dst_port):
+		*insn++ = BPF_LDX_MEM(
+			BPF_FIELD_SIZEOF(struct sock_common, skc_dport),
+			si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_dport,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_dport),
+				       target_size));
+		break;
+
+	case offsetof(struct bpf_sock, state):
+		*insn++ = BPF_LDX_MEM(
+			BPF_FIELD_SIZEOF(struct sock_common, skc_state),
+			si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_state,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_state),
+				       target_size));
+		break;
 	}
 
 	return insn - insn_buf;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 4/7] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb->sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp->snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
returns NULL.  Hence, the caller needs to check for NULL before
accessing it.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      | 30 +++++++++++++++
 include/uapi/linux/bpf.h | 51 +++++++++++++++++++++++++-
 kernel/bpf/verifier.c    | 31 +++++++++++++++-
 net/core/filter.c        | 79 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 188 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a60463b45b54..7f58828755fd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -204,6 +204,7 @@ enum bpf_return_type {
 	RET_PTR_TO_MAP_VALUE,		/* returns a pointer to map elem value */
 	RET_PTR_TO_MAP_VALUE_OR_NULL,	/* returns a pointer to map elem value or NULL */
 	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
+	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -259,6 +260,8 @@ enum bpf_reg_type {
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
 	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
 	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
+	PTR_TO_TCP_SOCK,	 /* reg points to struct tcp_sock */
+	PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -956,4 +959,31 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 }
 #endif
 
+#ifdef CONFIG_INET
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info);
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog,
+				    u32 *target_size);
+#else
+static inline bool bpf_tcp_sock_is_valid_access(int off, int size,
+						enum bpf_access_type type,
+						struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+						  const struct bpf_insn *si,
+						  struct bpf_insn *insn_buf,
+						  struct bpf_prog *prog,
+						  u32 *target_size)
+{
+	return 0;
+}
+#endif /* CONFIG_INET */
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d8f91777c5b6..25c8c0e62ecf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2337,6 +2337,15 @@ union bpf_attr {
  *	Return
  *		A **struct bpf_sock** pointer on success, or NULL in
  *		case of failure.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_tcp_sock** pointer from a
+ *		**struct bpf_sock** pointer.
+ *
+ *	Return
+ *		A **struct bpf_tcp_sock** pointer on success, or NULL in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2434,7 +2443,8 @@ union bpf_attr {
 	FN(rc_pointer_rel),		\
 	FN(spin_lock),			\
 	FN(spin_unlock),		\
-	FN(sk_fullsock),
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2616,6 +2626,45 @@ struct bpf_sock {
 	__u32 state;
 };
 
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
+				 */
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
+				 */
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b755d55a3791..1b9496c41383 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -334,14 +334,16 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 static bool type_is_sk_pointer(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
-		type == PTR_TO_SOCK_COMMON;
+		type == PTR_TO_SOCK_COMMON ||
+		type == PTR_TO_TCP_SOCK;
 }
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
 	       type == PTR_TO_SOCKET_OR_NULL ||
-	       type == PTR_TO_SOCK_COMMON_OR_NULL;
+	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
+	       type == PTR_TO_TCP_SOCK_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -407,6 +409,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
 	[PTR_TO_SOCK_COMMON]	= "sock_common",
 	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
+	[PTR_TO_TCP_SOCK]	= "tcp_sock",
+	[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
 };
 
 static char slot_type_char[] = {
@@ -1209,6 +1213,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1662,6 +1668,9 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx,
 	case PTR_TO_SOCKET:
 		valid = bpf_sock_is_valid_access(off, size, t, &info);
 		break;
+	case PTR_TO_TCP_SOCK:
+		valid = bpf_tcp_sock_is_valid_access(off, size, t, &info);
+		break;
 	default:
 		valid = false;
 	}
@@ -1823,6 +1832,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCK_COMMON:
 		pointer_desc = "sock_common ";
 		break;
+	case PTR_TO_TCP_SOCK:
+		pointer_desc = "tcp_sock ";
+		break;
 	default:
 		break;
 	}
@@ -3148,6 +3160,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			/* For mark_ptr_or_null_reg() */
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
+	} else if (fn->ret_type == RET_PTR_TO_TCP_SOCK_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
+		regs[BPF_REG_0].id = ++env->id_gen;
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3409,6 +3425,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4644,6 +4662,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_SOCKET;
 		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
 			reg->type = PTR_TO_SOCK_COMMON;
+		} else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) {
+			reg->type = PTR_TO_TCP_SOCK;
 		}
 		if (is_null || !(reg_is_refcounted(reg) ||
 				 reg_may_point_to_spin_lock(reg))) {
@@ -5839,6 +5859,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -6161,6 +6183,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -7166,6 +7190,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
+		case PTR_TO_TCP_SOCK:
+			convert_ctx_access = bpf_tcp_sock_convert_ctx_access;
+			break;
 		default:
 			continue;
 		}
diff --git a/net/core/filter.c b/net/core/filter.c
index c0d7b9ef279f..353735575204 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5315,6 +5315,79 @@ static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_tcp_sock, bytes_received):
+	case offsetof(struct bpf_tcp_sock, bytes_acked):
+		return size == sizeof(__u64);
+	default:
+		return size == sizeof(__u32);
+	}
+}
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+#define BPF_TCP_SOCK_GET_COMMON(FIELD)					\
+	do {								\
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) >	\
+			     FIELD_SIZEOF(struct bpf_tcp_sock, FIELD));	\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_sock, FIELD),\
+				      si->dst_reg, si->src_reg,		\
+				      offsetof(struct tcp_sock, FIELD)); \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock,
+				       BPF_TCP_SOCK_GET_COMMON);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_tcp_sock, rtt_min):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+			     sizeof(struct minmax));
+		BUILD_BUG_ON(sizeof(struct minmax) <
+			     sizeof(struct minmax_sample));
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct tcp_sock, rtt_min) +
+				      offsetof(struct minmax_sample, v));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
+{
+	sk = sk_to_full_sk(sk);
+
+	if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP)
+		return (unsigned long)sk;
+
+	return (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_tcp_sock_proto = {
+	.func		= bpf_tcp_sock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5470,6 +5543,10 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_local_storage_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
+#ifdef CONFIG_INET
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
+#endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5560,6 +5637,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_lookup_udp_proto;
 	case BPF_FUNC_sk_release:
 		return &bpf_sk_release_proto;
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 5/7] bpf: Sync bpf.h to tools/
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch sync the uapi bpf.h to tools/.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 72 ++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 7 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1777fa0c61e4..25c8c0e62ecf 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2329,6 +2329,23 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_sock** pointer such
+ *		that all the fields in bpf_sock can be accessed.
+ *	Return
+ *		A **struct bpf_sock** pointer on success, or NULL in
+ *		case of failure.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_tcp_sock** pointer from a
+ *		**struct bpf_sock** pointer.
+ *
+ *	Return
+ *		A **struct bpf_tcp_sock** pointer on success, or NULL in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2425,7 +2442,9 @@ union bpf_attr {
 	FN(msg_pop_data),		\
 	FN(rc_pointer_rel),		\
 	FN(spin_lock),			\
-	FN(spin_unlock),
+	FN(spin_unlock),		\
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2545,6 +2564,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
@@ -2596,14 +2616,52 @@ struct bpf_sock {
 	__u32 protocol;
 	__u32 mark;
 	__u32 priority;
-	__u32 src_ip4;		/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
+	/* IP address also allows 1 and 2 bytes access */
+	__u32 src_ip4;
+	__u32 src_ip6[4];
+	__u32 src_port;		/* host byte order */
+	__u32 dst_port;		/* network byte order */
+	__u32 dst_ip4;
+	__u32 dst_ip6[4];
+	__u32 state;
+};
+
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
 				 */
-	__u32 src_ip6[4];	/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
 				 */
-	__u32 src_port;		/* Allows 4-byte read.
-				 * Stored in host byte order
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
 				 */
 };
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 3/7] bpf: Refactor sock_ops_convert_ctx_access
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

The next patch will introduce a new "struct bpf_tcp_sock" which
exposes the same tcp_sock's fields already exposed in
"struct bpf_sock_ops".

This patch refactor the existing convert_ctx_access() codes for
"struct bpf_sock_ops" to get them ready to be reused for
"struct bpf_tcp_sock".  The "rtt_min" is not refactored
in this patch because its handling is different from other
fields.

The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
changes are code move only.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 287 ++++++++++++++++++++--------------------------
 1 file changed, 127 insertions(+), 160 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 01bb64bf2b5e..c0d7b9ef279f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5030,6 +5030,54 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 };
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+#define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT)		\
+do {									\
+	switch (si->off) {						\
+	case offsetof(md_type, snd_cwnd):				\
+		CONVERT(snd_cwnd); break;				\
+	case offsetof(md_type, srtt_us):				\
+		CONVERT(srtt_us); break;				\
+	case offsetof(md_type, snd_ssthresh):				\
+		CONVERT(snd_ssthresh); break;				\
+	case offsetof(md_type, rcv_nxt):				\
+		CONVERT(rcv_nxt); break;				\
+	case offsetof(md_type, snd_nxt):				\
+		CONVERT(snd_nxt); break;				\
+	case offsetof(md_type, snd_una):				\
+		CONVERT(snd_una); break;				\
+	case offsetof(md_type, mss_cache):				\
+		CONVERT(mss_cache); break;				\
+	case offsetof(md_type, ecn_flags):				\
+		CONVERT(ecn_flags); break;				\
+	case offsetof(md_type, rate_delivered):				\
+		CONVERT(rate_delivered); break;				\
+	case offsetof(md_type, rate_interval_us):			\
+		CONVERT(rate_interval_us); break;			\
+	case offsetof(md_type, packets_out):				\
+		CONVERT(packets_out); break;				\
+	case offsetof(md_type, retrans_out):				\
+		CONVERT(retrans_out); break;				\
+	case offsetof(md_type, total_retrans):				\
+		CONVERT(total_retrans); break;				\
+	case offsetof(md_type, segs_in):				\
+		CONVERT(segs_in); break;				\
+	case offsetof(md_type, data_segs_in):				\
+		CONVERT(data_segs_in); break;				\
+	case offsetof(md_type, segs_out):				\
+		CONVERT(segs_out); break;				\
+	case offsetof(md_type, data_segs_out):				\
+		CONVERT(data_segs_out); break;				\
+	case offsetof(md_type, lost_out):				\
+		CONVERT(lost_out); break;				\
+	case offsetof(md_type, sacked_out):				\
+		CONVERT(sacked_out); break;				\
+	case offsetof(md_type, bytes_received):				\
+		CONVERT(bytes_received); break;				\
+	case offsetof(md_type, bytes_acked):				\
+		CONVERT(bytes_acked); break;				\
+	}								\
+} while (0)
+
 #ifdef CONFIG_INET
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
@@ -7196,6 +7244,85 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	struct bpf_insn *insn = insn_buf;
 	int off;
 
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+	} while (0)
+
+#define SOCK_OPS_GET_TCP_SOCK_FIELD(FIELD) \
+		SOCK_OPS_GET_FIELD(FIELD, FIELD, struct tcp_sock)
+
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops,
+				       SOCK_OPS_GET_TCP_SOCK_FIELD);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
 	switch (si->off) {
 	case offsetof(struct bpf_sock_ops, op) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
@@ -7353,175 +7480,15 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      FIELD_SIZEOF(struct minmax_sample, t));
 		break;
 
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
-						       OBJ_FIELD),	      \
-				      si->dst_reg, si->dst_reg,		      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-	} while (0)
-
-/* Helper macro for adding write access to tcp_sock or sock fields.
- * The macro is called with two registers, dst_reg which contains a pointer
- * to ctx (context) and src_reg which contains the value that should be
- * stored. However, we need an additional register since we cannot overwrite
- * dst_reg because it may be used later in the program.
- * Instead we "borrow" one of the other register. We first save its value
- * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
- * it at the end of the macro.
- */
-#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		int reg = BPF_REG_9;					      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
-				      reg, si->src_reg,			      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-	} while (0)
-
-#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
-	do {								      \
-		if (TYPE == BPF_WRITE)					      \
-			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-		else							      \
-			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-	} while (0)
-
-	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
 		SOCK_OPS_GET_FIELD(bpf_sock_ops_cb_flags, bpf_sock_ops_cb_flags,
 				   struct tcp_sock);
 		break;
 
-	case offsetof(struct bpf_sock_ops, snd_ssthresh):
-		SOCK_OPS_GET_FIELD(snd_ssthresh, snd_ssthresh, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rcv_nxt):
-		SOCK_OPS_GET_FIELD(rcv_nxt, rcv_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_nxt):
-		SOCK_OPS_GET_FIELD(snd_nxt, snd_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_una):
-		SOCK_OPS_GET_FIELD(snd_una, snd_una, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, mss_cache):
-		SOCK_OPS_GET_FIELD(mss_cache, mss_cache, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, ecn_flags):
-		SOCK_OPS_GET_FIELD(ecn_flags, ecn_flags, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_delivered):
-		SOCK_OPS_GET_FIELD(rate_delivered, rate_delivered,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_interval_us):
-		SOCK_OPS_GET_FIELD(rate_interval_us, rate_interval_us,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, packets_out):
-		SOCK_OPS_GET_FIELD(packets_out, packets_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, retrans_out):
-		SOCK_OPS_GET_FIELD(retrans_out, retrans_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, total_retrans):
-		SOCK_OPS_GET_FIELD(total_retrans, total_retrans,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_in):
-		SOCK_OPS_GET_FIELD(segs_in, segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_in):
-		SOCK_OPS_GET_FIELD(data_segs_in, data_segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_out):
-		SOCK_OPS_GET_FIELD(segs_out, segs_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_out):
-		SOCK_OPS_GET_FIELD(data_segs_out, data_segs_out,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, lost_out):
-		SOCK_OPS_GET_FIELD(lost_out, lost_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, sacked_out):
-		SOCK_OPS_GET_FIELD(sacked_out, sacked_out, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, sk_txhash):
 		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
 					  struct sock, type);
 		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_received):
-		SOCK_OPS_GET_FIELD(bytes_received, bytes_received,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_acked):
-		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
-		break;
-
 	}
 	return insn - insn_buf;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 7/7] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch adds a C program to show the usage on
skb->sk and bpf_tcp_sock.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 327 ++++++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 152 ++++++++
 4 files changed, 487 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 383d2ff13fc7..c7e1e3255448 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt test_tcpnotify_user
+	test_netcnt test_tcpnotify_user test_sock_fields
 
 BPF_OBJ_FILES = \
 	test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
@@ -35,7 +35,8 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
+	test_sock_fields_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -111,6 +112,7 @@ $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
 $(OUTPUT)/test_netcnt: cgroup_helpers.c
+$(OUTPUT)/test_sock_fields: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6a0ce0f055c5..d9999f1ed1d2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -176,6 +176,10 @@ static void (*bpf_spin_lock)(struct bpf_spin_lock *lock) =
 	(void *) BPF_FUNC_spin_lock;
 static void (*bpf_spin_unlock)(struct bpf_spin_lock *lock) =
 	(void *) BPF_FUNC_spin_unlock;
+static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_sk_fullsock;
+static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_tcp_sock;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_sock_fields.c b/tools/testing/selftests/bpf/test_sock_fields.c
new file mode 100644
index 000000000000..9bb58369b481
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields.c
@@ -0,0 +1,327 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <sys/socket.h>
+#include <sys/epoll.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+#define CHECK(condition, tag, format...) ({				\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
+		printf(format);						\
+		printf("\n");						\
+		exit(-1);						\
+	}								\
+})
+
+#define TEST_CGROUP "/test-bpf-sock-fields"
+#define DATA "Hello BPF!"
+#define DATA_LEN sizeof(DATA)
+
+static struct sockaddr_in6 srv_sa6, cli_sa6;
+static int linum_map_fd;
+static int addr_map_fd;
+static int tp_map_fd;
+static int sk_map_fd;
+static __u32 srv_idx = SRV_IDX;
+static __u32 cli_idx = CLI_IDX;
+
+static void init_loopback6(struct sockaddr_in6 *sa6)
+{
+	memset(sa6, 0, sizeof(*sa6));
+	sa6->sin6_family = AF_INET6;
+	sa6->sin6_addr = in6addr_loopback;
+}
+
+static void print_sk(const struct bpf_sock *sk)
+{
+	char src_ip4[24], dst_ip4[24];
+	char src_ip6[64], dst_ip6[64];
+
+	inet_ntop(AF_INET, &sk->src_ip4, src_ip4, sizeof(src_ip4));
+	inet_ntop(AF_INET6, &sk->src_ip6, src_ip6, sizeof(src_ip6));
+	inet_ntop(AF_INET, &sk->dst_ip4, dst_ip4, sizeof(dst_ip4));
+	inet_ntop(AF_INET6, &sk->dst_ip6, dst_ip6, sizeof(dst_ip6));
+
+	printf("state:%u bound_dev_if:%u family:%u type:%u protocol:%u mark:%u priority:%u "
+	       "src_ip4:%x(%s) src_ip6:%x:%x:%x:%x(%s) src_port:%u "
+	       "dst_ip4:%x(%s) dst_ip6:%x:%x:%x:%x(%s) dst_port:%u\n",
+	       sk->state, sk->bound_dev_if, sk->family, sk->type, sk->protocol,
+	       sk->mark, sk->priority,
+	       sk->src_ip4, src_ip4,
+	       sk->src_ip6[0], sk->src_ip6[1], sk->src_ip6[2], sk->src_ip6[3],
+	       src_ip6, sk->src_port,
+	       sk->dst_ip4, dst_ip4,
+	       sk->dst_ip6[0], sk->dst_ip6[1], sk->dst_ip6[2], sk->dst_ip6[3],
+	       dst_ip6, ntohs(sk->dst_port));
+}
+
+static void print_tp(const struct bpf_tcp_sock *tp)
+{
+	printf("snd_cwnd:%u srtt_us:%u rtt_min:%u snd_ssthresh:%u rcv_nxt:%u "
+	       "snd_nxt:%u snd:una:%u mss_cache:%u ecn_flags:%u "
+	       "rate_delivered:%u rate_interval_us:%u packets_out:%u "
+	       "retrans_out:%u total_retrans:%u segs_in:%u data_segs_in:%u "
+	       "segs_out:%u data_segs_out:%u lost_out:%u sacked_out:%u "
+	       "bytes_received:%llu bytes_acked:%llu\n",
+	       tp->snd_cwnd, tp->srtt_us, tp->rtt_min, tp->snd_ssthresh,
+	       tp->rcv_nxt, tp->snd_nxt, tp->snd_una, tp->mss_cache,
+	       tp->ecn_flags, tp->rate_delivered, tp->rate_interval_us,
+	       tp->packets_out, tp->retrans_out, tp->total_retrans,
+	       tp->segs_in, tp->data_segs_in, tp->segs_out,
+	       tp->data_segs_out, tp->lost_out, tp->sacked_out,
+	       tp->bytes_received, tp->bytes_acked);
+}
+
+static void check_result(void)
+{
+	struct bpf_tcp_sock srv_tp, cli_tp;
+	struct bpf_sock srv_sk, cli_sk;
+	__u32 linum, idx0 = 0;
+	int err;
+
+	err = bpf_map_lookup_elem(linum_map_fd, &idx0, &linum);
+	CHECK(err == -1, "bpf_map_lookup_elem(linum_map_fd)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &srv_idx, &srv_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &srv_idx, &srv_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &cli_idx, &cli_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &cli_idx, &cli_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	printf("srv_sk: ");
+	print_sk(&srv_sk);
+	printf("\n");
+
+	printf("cli_sk: ");
+	print_sk(&cli_sk);
+	printf("\n");
+
+	printf("srv_tp: ");
+	print_tp(&srv_tp);
+	printf("\n");
+
+	printf("cli_tp: ");
+	print_tp(&cli_tp);
+	printf("\n");
+
+	CHECK(srv_sk.state == 10 ||
+	      !srv_sk.state ||
+	      srv_sk.family != AF_INET6 ||
+	      srv_sk.protocol != IPPROTO_TCP ||
+	      memcmp(srv_sk.src_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.src_ip6)) ||
+	      memcmp(srv_sk.dst_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.dst_ip6)) ||
+	      srv_sk.src_port != ntohs(srv_sa6.sin6_port) ||
+	      srv_sk.dst_port != cli_sa6.sin6_port,
+	      "Unexpected srv_sk", "Check srv_sk output. linum:%u", linum);
+
+	CHECK(cli_sk.state == 10 ||
+	      !cli_sk.state ||
+	      cli_sk.family != AF_INET6 ||
+	      cli_sk.protocol != IPPROTO_TCP ||
+	      memcmp(cli_sk.src_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.src_ip6)) ||
+	      memcmp(cli_sk.dst_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.dst_ip6)) ||
+	      cli_sk.src_port != ntohs(cli_sa6.sin6_port) ||
+	      cli_sk.dst_port != srv_sa6.sin6_port,
+	      "Unexpected cli_sk", "Check cli_sk output. linum:%u", linum);
+
+	CHECK(srv_tp.data_segs_out != 1 ||
+	      srv_tp.data_segs_in ||
+	      srv_tp.snd_cwnd != 10 ||
+	      srv_tp.total_retrans ||
+	      srv_tp.bytes_acked != DATA_LEN,
+	      "Unexpected srv_tp", "Check srv_tp output. linum:%u", linum);
+
+	CHECK(cli_tp.data_segs_out ||
+	      cli_tp.data_segs_in != 1 ||
+	      cli_tp.snd_cwnd != 10 ||
+	      cli_tp.total_retrans ||
+	      cli_tp.bytes_received != DATA_LEN,
+	      "Unexpected cli_tp", "Check cli_tp output. linum:%u", linum);
+}
+
+static void test(void)
+{
+	int listen_fd, cli_fd, accept_fd, epfd, err;
+	struct epoll_event ev;
+	socklen_t addrlen;
+
+	addrlen = sizeof(struct sockaddr_in6);
+	ev.events = EPOLLIN;
+
+	epfd = epoll_create(1);
+	CHECK(epfd == -1, "epoll_create()", "epfd:%d errno:%d", epfd, errno);
+
+	/* Prepare listen_fd */
+	listen_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(listen_fd == -1, "socket()", "listen_fd:%d errno:%d",
+	      listen_fd, errno);
+
+	init_loopback6(&srv_sa6);
+	err = bind(listen_fd, (struct sockaddr *)&srv_sa6, sizeof(srv_sa6));
+	CHECK(err, "bind(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(listen_fd, (struct sockaddr *)&srv_sa6, &addrlen);
+	CHECK(err, "getsockname(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = listen(listen_fd, 1);
+	CHECK(err, "listen(listen_fd)", "err:%d errno:%d", err, errno);
+
+	/* Prepare cli_fd */
+	cli_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(cli_fd == -1, "socket()", "cli_fd:%d errno:%d", cli_fd, errno);
+
+	init_loopback6(&cli_sa6);
+	err = bind(cli_fd, (struct sockaddr *)&cli_sa6, sizeof(cli_sa6));
+	CHECK(err, "bind(cli_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(cli_fd, (struct sockaddr *)&cli_sa6, &addrlen);
+	CHECK(err, "getsockname(cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Update addr_map with srv_sa6 and cli_sa6 */
+	err = bpf_map_update_elem(addr_map_fd, &srv_idx, &srv_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	err = bpf_map_update_elem(addr_map_fd, &cli_idx, &cli_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	/* Connect from cli_sa6 to srv_sa6 */
+	err = connect(cli_fd, (struct sockaddr *)&srv_sa6, addrlen);
+	printf("srv_sa6.sin6_port:%u cli_sa6.sin6_port:%u\n\n",
+	       ntohs(srv_sa6.sin6_port), ntohs(cli_sa6.sin6_port));
+	CHECK(err && errno != EINPROGRESS,
+	      "connect(cli_fd)", "err:%d errno:%d", err, errno);
+
+	ev.data.fd = listen_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, listen_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Accept the connection */
+	/* Have some timeout in accept(listen_fd). Just in case. */
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != listen_fd,
+	      "epoll_wait(listen_fd)",
+	      "err:%d errno:%d ev.data.fd:%d listen_fd:%d",
+	      err, errno, ev.data.fd, listen_fd);
+
+	accept_fd = accept(listen_fd, NULL, NULL);
+	CHECK(accept_fd == -1, "accept(listen_fd)", "accept_fd:%d errno:%d",
+	      accept_fd, errno);
+	close(listen_fd);
+
+	/* Send some data from accept_fd to cli_fd */
+	err = send(accept_fd, DATA, DATA_LEN, 0);
+	CHECK(err != DATA_LEN, "send(accept_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Have some timeout in recv(cli_fd). Just in case. */
+	ev.data.fd = cli_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, cli_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != cli_fd,
+	      "epoll_wait(cli_fd)", "err:%d errno:%d ev.data.fd:%d cli_fd:%d",
+	      err, errno, ev.data.fd, cli_fd);
+
+	err = recv(cli_fd, NULL, 0, MSG_TRUNC);
+	CHECK(err, "recv(cli_fd)", "err:%d errno:%d", err, errno);
+
+	close(epfd);
+	close(accept_fd);
+	close(cli_fd);
+
+	check_result();
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr attr = {
+		.file = "test_sock_fields_kern.o",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int cgroup_fd, prog_fd, err;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+
+	err = setup_cgroup_environment();
+	CHECK(err, "setup_cgroup_environment()", "err:%d errno:%d",
+	      err, errno);
+
+	atexit(cleanup_cgroup_environment);
+
+	/* Create a cgroup, get fd, and join it */
+	cgroup_fd = create_and_get_cgroup(TEST_CGROUP);
+	CHECK(cgroup_fd == -1, "create_and_get_cgroup()",
+	      "cgroup_fd:%d errno:%d", cgroup_fd, errno);
+
+	err = join_cgroup(TEST_CGROUP);
+	CHECK(err, "join_cgroup", "err:%d errno:%d", err, errno);
+
+	err = bpf_prog_load_xattr(&attr, &obj, &prog_fd);
+	CHECK(err, "bpf_prog_load_xattr()", "err:%d", err);
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_INET_EGRESS, 0);
+	CHECK(err == -1, "bpf_prog_attach(CPF_CGROUP_INET_EGRESS)",
+	      "err:%d errno%d", err, errno);
+	close(cgroup_fd);
+
+	map = bpf_object__find_map_by_name(obj, "addr_map");
+	CHECK(!map, "cannot find addr_map", "(null)");
+	addr_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "sock_result_map");
+	CHECK(!map, "cannot find sock_result_map", "(null)");
+	sk_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "tcp_sock_result_map");
+	CHECK(!map, "cannot find tcp_sock_result_map", "(null)");
+	tp_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "linum_map");
+	CHECK(!map, "cannot find linum_map", "(null)");
+	linum_map_fd = bpf_map__fd(map);
+
+	test();
+
+	bpf_object__close(obj);
+	cleanup_cgroup_environment();
+
+	printf("PASS\n");
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_sock_fields_kern.c b/tools/testing/selftests/bpf/test_sock_fields_kern.c
new file mode 100644
index 000000000000..de1a43e8f610
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields_kern.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/bpf.h>
+#include <netinet/in.h>
+#include <stdbool.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") addr_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct sockaddr_in6),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") tcp_sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_tcp_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") linum_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+static bool is_loopback6(__u32 *a6)
+{
+	return !a6[0] && !a6[1] && !a6[2] && a6[3] == bpf_htonl(1);
+}
+
+static void skcpy(struct bpf_sock *dst,
+		  const struct bpf_sock *src)
+{
+	dst->bound_dev_if = src->bound_dev_if;
+	dst->family = src->family;
+	dst->type = src->type;
+	dst->protocol = src->protocol;
+	dst->mark = src->mark;
+	dst->priority = src->priority;
+	dst->src_ip4 = src->src_ip4;
+	dst->src_ip6[0] = src->src_ip6[0];
+	dst->src_ip6[1] = src->src_ip6[1];
+	dst->src_ip6[2] = src->src_ip6[2];
+	dst->src_ip6[3] = src->src_ip6[3];
+	dst->src_port = src->src_port;
+	dst->dst_ip4 = src->dst_ip4;
+	dst->dst_ip6[0] = src->dst_ip6[0];
+	dst->dst_ip6[1] = src->dst_ip6[1];
+	dst->dst_ip6[2] = src->dst_ip6[2];
+	dst->dst_ip6[3] = src->dst_ip6[3];
+	dst->dst_port = src->dst_port;
+	dst->state = src->state;
+}
+
+static void tpcpy(struct bpf_tcp_sock *dst,
+		  const struct bpf_tcp_sock *src)
+{
+	dst->snd_cwnd = src->snd_cwnd;
+	dst->srtt_us = src->srtt_us;
+	dst->rtt_min = src->rtt_min;
+	dst->snd_ssthresh = src->snd_ssthresh;
+	dst->rcv_nxt = src->rcv_nxt;
+	dst->snd_nxt = src->snd_nxt;
+	dst->snd_una = src->snd_una;
+	dst->mss_cache = src->mss_cache;
+	dst->ecn_flags = src->ecn_flags;
+	dst->rate_delivered = src->rate_delivered;
+	dst->rate_interval_us = src->rate_interval_us;
+	dst->packets_out = src->packets_out;
+	dst->retrans_out = src->retrans_out;
+	dst->total_retrans = src->total_retrans;
+	dst->segs_in = src->segs_in;
+	dst->data_segs_in = src->data_segs_in;
+	dst->segs_out = src->segs_out;
+	dst->data_segs_out = src->data_segs_out;
+	dst->lost_out = src->lost_out;
+	dst->sacked_out = src->sacked_out;
+	dst->bytes_received = src->bytes_received;
+	dst->bytes_acked = src->bytes_acked;
+}
+
+#define RETURN {						\
+	linum = __LINE__;					\
+	bpf_map_update_elem(&linum_map, &idx0, &linum, 0);	\
+	return 1;						\
+}
+
+SEC("cgroup_skb/egress")
+int read_sock_fields(struct __sk_buff *skb)
+{
+	__u32 srv_idx = SRV_IDX, cli_idx = CLI_IDX, idx;
+	struct sockaddr_in6 *srv_sa6, *cli_sa6;
+	struct bpf_tcp_sock *tp, *tp_ret;
+	struct bpf_sock *sk, *sk_ret;
+	__u32 linum, idx0 = 0;
+
+	sk = skb->sk;
+	if (!sk || sk->state == 10)
+		RETURN;
+
+	sk = bpf_sk_fullsock(sk);
+	if (!sk || sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP ||
+	    !is_loopback6(sk->src_ip6))
+		RETURN;
+
+	tp = bpf_tcp_sock(sk);
+	if (!tp)
+		RETURN;
+
+	srv_sa6 = bpf_map_lookup_elem(&addr_map, &srv_idx);
+	cli_sa6 = bpf_map_lookup_elem(&addr_map, &cli_idx);
+	if (!srv_sa6 || !cli_sa6)
+		RETURN;
+
+	if (sk->src_port == bpf_ntohs(srv_sa6->sin6_port))
+		idx = srv_idx;
+	else if (sk->src_port == bpf_ntohs(cli_sa6->sin6_port))
+		idx = cli_idx;
+	else
+		RETURN;
+
+	sk_ret = bpf_map_lookup_elem(&sock_result_map, &idx);
+	tp_ret = bpf_map_lookup_elem(&tcp_sock_result_map, &idx);
+	if (!sk_ret || !tp_ret)
+		RETURN;
+
+	skcpy(sk_ret, sk);
+	tpcpy(tp_ret, tp);
+
+	RETURN;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 6/7] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch tests accessing the skb->sk and the new helpers,
bpf_sk_fullsock and bpf_tcp_sock.

The errstr of some existing "reference tracking" tests is changed
with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
verifier's reg_type_str[].

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/bpf_util.h        |   9 +
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 384 ++++++++++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 4 files changed, 396 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
index 315a44fa32af..197347031038 100644
--- a/tools/testing/selftests/bpf/bpf_util.h
+++ b/tools/testing/selftests/bpf/bpf_util.h
@@ -48,4 +48,13 @@ static inline unsigned int bpf_num_possible_cpus(void)
 # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
 #endif
 
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+#ifndef offsetofend
+#define offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER)	+ sizeof_field(TYPE, MEMBER))
+#endif
+
 #endif /* __BPF_UTIL__ */
diff --git a/tools/testing/selftests/bpf/verifier/ref_tracking.c b/tools/testing/selftests/bpf/verifier/ref_tracking.c
index dc2cc823df2b..3ed3593bd8b6 100644
--- a/tools/testing/selftests/bpf/verifier/ref_tracking.c
+++ b/tools/testing/selftests/bpf/verifier/ref_tracking.c
@@ -547,7 +547,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.result = REJECT,
 },
 {
@@ -562,7 +562,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "invalid bpf_sock access off=0 size=8",
+	.errstr = "invalid sock access off=0 size=8",
 	.result = REJECT,
 },
 {
diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
new file mode 100644
index 000000000000..0ddfdf76aba5
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -0,0 +1,384 @@
+{
+	"skb->sk: no NULL check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_common_or_null'",
+},
+{
+	"skb->sk: sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, family)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"skb->sk: sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock_common access",
+},
+{
+	"bpf_sk_fullsock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"sk_fullsock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_or_null'",
+},
+{
+	"sk_fullsock(skb->sk): sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, family)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->state [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, state)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_port [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_port)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_port [load 2nd byte]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_port) + 1),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock access",
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_ip6 [load 2nd byte]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_ip6[0]) + 1),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->type [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->protocol [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, protocol)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): beyond last field",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetofend(struct bpf_sock, state)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock access",
+},
+{
+	"bpf_tcp_sock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"bpf_tcp_sock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'tcp_sock_or_null'",
+},
+{
+	"bpf_tcp_sock(skb->sk): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): tp->bytes_acked",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, bytes_acked)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): beyond last field",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetofend(struct bpf_tcp_sock, bytes_acked)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid tcp_sock access",
+},
+{
+	"bpf_tcp_sock(bpf_sk_fullsock(skb->sk)): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_sk_release(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=sock",
+},
+{
+	"bpf_sk_release(bpf_sk_fullsock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "reference has not been acquired before",
+},
+{
+	"bpf_sk_release(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=tcp_sock expected=sock",
+},
diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c
index 3e046695fad7..dbaf5be947b2 100644
--- a/tools/testing/selftests/bpf/verifier/unpriv.c
+++ b/tools/testing/selftests/bpf/verifier/unpriv.c
@@ -365,7 +365,7 @@
 	},
 	.result = REJECT,
 	//.errstr = "same insn cannot be used with different pointers",
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2] net/packet: fix 4gb buffer limit due to overflow check
From: Kal Conley @ 2019-02-10  8:57 UTC (permalink / raw)
  To: davem
  Cc: kal.conley, Willem de Bruijn, Eric Dumazet, Alexander Duyck,
	Jeff Kirsher, Kirill Tkhai, Vincent Whitchurch, Li RongQing,
	Magnus Karlsson, netdev, linux-kernel
In-Reply-To: <20190209.190114.542890373094719579.davem@davemloft.net>

When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow. Check it for overflow without limiting the total buffer
size to UINT_MAX.

This change fixes support for packet ring buffers >= UINT_MAX.

Fixes: 8f8d28e4d6d8 ("net/packet: fix overflow in check for tp_frame_nr")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
Changes in v2:
- Add Signed-off-by and Fixes tag

 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3b1a78906bc0..1cd1d83a4be0 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4292,7 +4292,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		rb->frames_per_block = req->tp_block_size / req->tp_frame_size;
 		if (unlikely(rb->frames_per_block == 0))
 			goto out;
-		if (unlikely(req->tp_block_size > UINT_MAX / req->tp_block_nr))
+		if (unlikely(rb->frames_per_block > UINT_MAX / req->tp_block_nr))
 			goto out;
 		if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
 					req->tp_frame_nr))
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next 2/2] net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE
From: Eli Cohen @ 2019-02-10  8:19 UTC (permalink / raw)
  To: David Miller
  Cc: jhs, xiyou.wangcong, jiri, netdev, linux-kernel, simon.horman,
	jakub.kicinski, dirk.vandermerwe, francois.theron, quentin.monnet,
	john.hurley, edwin.peer
In-Reply-To: <20190207.100110.1557913033627638063.davem@davemloft.net>

On Thu, Feb 07, 2019 at 10:01:10AM -0800, David Miller wrote:
> From: Eli Cohen <eli@mellanox.com>
> Date: Thu,  7 Feb 2019 09:45:49 +0200
> 
> > diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
> > index 902957beceb3..d54cb608dbaf 100644
> > --- a/net/sched/act_simple.c
> > +++ b/net/sched/act_simple.c
> > @@ -19,8 +19,6 @@
> >  #include <net/netlink.h>
> >  #include <net/pkt_sched.h>
> >  
> > -#define TCA_ACT_SIMP 22
> > -
> >  #include <linux/tc_act/tc_defact.h>
> >  #include <net/tc_act/tc_defact.h>
> >  
> 
> I would do this in patch #1.

Sure, that was the intention. It just slipped off my fingers.

Will fix and send a new series.

> Actually, because you didn't, after patch #1 there are two #defines
> evaluated for this macro.  One in pkt_cls.h and one here.
> 
> The only reason the compiler doesn't warn and complain is because the
> definitions are identical.
> 
> Thank you.

^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Sander Eikelenboom @ 2019-02-10  9:16 UTC (permalink / raw)
  To: Heiner Kallweit, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <824acd0b-920c-7554-4a63-d80c7de2a8b6@gmail.com>

On 09/02/2019 12:50, Heiner Kallweit wrote:
> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>
>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>
>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>
>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>
>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>
>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>
>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>> on the host.
>>>>>>>>
>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>> as author of the underlying changes.
>>>>>>>>
>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>
>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>> test also with only 
>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>> removed.
>>>>>>>
>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>
>>>>>>> Sure, thanks.
>>>>>>>
>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>
>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>
>>>>>>
>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>
>>>>>> You could try :
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         dma_addr_t mapping;
>>>>>>         u32 opts[2], len;
>>>>>>         bool stop_queue;
>>>>>> +       bool door_bell;
>>>>>>         int frags;
>>>>>>  
>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>         dma_wmb();
>>>>>>  
>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>> +
>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>  
>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         if (unlikely(stop_queue))
>>>>>>                 netif_stop_queue(dev);
>>>>>>  
>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>> +       if (door_bell) {
>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>                 mmiowb();
>>>>>>         }
>>>>>>
>>>>> Thanks a lot for checking and for the proposed fix.
>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>
>>>> I have done that already during the night .. the results:
>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>
>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>
>>>>   If I can, it is a separate issue.
>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>
>>>>   The timeout in question:
>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>         [28336.893358] Modules linked in:
>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>         [28337.090052] Call Trace:
>>>>         [28337.103615]  <IRQ>
>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>         [28337.241261]  </IRQ>
>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>
>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>> did that occur again?
>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>> contribute to the issue) and just submit a patch to effectively revert
>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>
>> I can't say if that is correct, because i haven't tested that.
>>
>> Another thing I could test is:
>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>    that would be great.
>>
> 
> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
> efx_enqueue_skb():
> 
> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
> 
> 		/* There could be packets left on the partner queue if those
> 		 * SKBs had skb->xmit_more set. If we do not push those they
> 		 * could be left for a long time and cause a netdev watchdog.
> 		 */
> 		if (txq2->xmit_more_available)
> 			efx_nic_push_buffers(txq2);
> 
> But I'm not sure whether the situation in r8169 is comparable. The following patch
> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
> you could give it a try.

Hi Heiner,

It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.

So there is still time left and I could do a more proper run with your patch below.
Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
just that I can't tell.

Perhaps Linus as Oops-decoding-guru has an idea ?

--
Sander

[39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
[39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[39041.689018] Call Trace:
[39041.689022]  <IRQ>
[39041.689030]  dump_stack+0x5c/0x7b
[39041.689033]  warn_alloc+0x103/0x190
[39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
[39041.689039]  ? ip_rcv+0x48/0xc0
[39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
[39041.689042]  page_frag_alloc+0x117/0x150
[39041.689044]  __napi_alloc_skb+0x83/0xd0
[39041.689048]  rtl8169_poll+0x210/0x640
[39041.689051]  net_rx_action+0x23d/0x370
[39041.689054]  __do_softirq+0xed/0x229
[39041.689058]  irq_exit+0xb7/0xc0
[39041.689061]  xen_evtchn_do_upcall+0x27/0x40
[39041.689063]  xen_do_hypervisor_callback+0x29/0x40
[39041.689064]  </IRQ>
[39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
[39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
[39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
[39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
[39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
[39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
[39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
[39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
[39041.689077]  iput+0x63/0x1a0
[39041.689079]  __dentry_kill+0xc5/0x170
[39041.689080]  shrink_dentry_list+0x93/0x1c0
[39041.689082]  prune_dcache_sb+0x4d/0x70
[39041.689084]  super_cache_scan+0x104/0x190
[39041.689087]  do_shrink_slab+0x12c/0x1e0
[39041.689089]  shrink_slab+0xdf/0x2b0
[39041.689091]  shrink_node+0x158/0x470
[39041.689093]  do_try_to_free_pages+0xd1/0x380
[39041.689095]  try_to_free_pages+0xb2/0xe0
[39041.689097]  __alloc_pages_nodemask+0x603/0xe80
[39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
[39041.689102]  alloc_pages_vma+0x7b/0x1c0
[39041.689106]  __handle_mm_fault+0xdb3/0x1060
[39041.689109]  ? xen_mc_flush+0xc0/0x190
[39041.689110]  handle_mm_fault+0xf8/0x200
[39041.689113]  __do_page_fault+0x231/0x4a0
[39041.689115]  ? page_fault+0x8/0x30
[39041.689116]  page_fault+0x1e/0x30
[39041.689118] RIP: e033:0x7fb9851d012e
[39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
[39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
[39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
[39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
[39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
[39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
[39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
[39041.689125] Mem-Info:
[39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
                active_file:106409 inactive_file:107531 isolated_file:0
                unevictable:552 dirty:175 writeback:0 unstable:0
                slab_reclaimable:13739 slab_unreclaimable:16454
                mapped:1605 shmem:23 pagetables:2900 bounce:0
                free:3681 free_pcp:935 free_cma:0
[39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[39041.689136] lowmem_reserve[]: 0 1865 1865 1865
[39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
[39041.689140] lowmem_reserve[]: 0 0 0 0
[39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
[39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
[39041.689153] 214701 total pagecache pages
[39041.689155] 273 pages in swap cache
[39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
[39041.689156] Free swap  = 3790588kB
[39041.689157] Total swap = 4194300kB
[39041.689157] 524181 pages RAM
[39041.689158] 0 pages HighMem/MovableOnly
[39041.689158] 101471 pages reserved
[39041.689159] 0 pages cma reserved





> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index e8a112149..3cca2ffb2 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  	struct device *d = tp_to_dev(tp);
>  	dma_addr_t mapping;
>  	u32 opts[2], len;
> -	bool stop_queue;
>  	int frags;
>  
>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	txd->opts2 = cpu_to_le32(opts[1]);
>  
> +	netdev_sent_queue(dev, skb->len);
> +
>  	skb_tx_timestamp(skb);
>  
>  	/* Force memory writes to complete before releasing descriptor */
> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	tp->cur_tx += frags + 1;
>  
> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
> -	if (unlikely(stop_queue))
> -		netif_stop_queue(dev);
> -
> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
> -		RTL_W8(tp, TxPoll, NPQ);
> +	RTL_W8(tp, TxPoll, NPQ);
>  
> -	if (unlikely(stop_queue)) {
> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
> +		 * not miss a ring update when it notices a stopped queue.
> +		 */
> +		smp_wmb();
> +		netif_stop_queue(dev);
>  		/* Sync with rtl_tx:
>  		 * - publish queue status and cur_tx ring index (write barrier)
>  		 * - refresh dirty_tx ring index (read barrier).
> 


^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Heiner Kallweit @ 2019-02-10  9:32 UTC (permalink / raw)
  To: Sander Eikelenboom, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <302dd169-d981-9d8d-99a6-9d1462e913f9@eikelenboom.it>

On 10.02.2019 10:16, Sander Eikelenboom wrote:
> On 09/02/2019 12:50, Heiner Kallweit wrote:
>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>
>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>
>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>
>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>
>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>> on the host.
>>>>>>>>>
>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>
>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>
>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>> test also with only 
>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>> removed.
>>>>>>>>
>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>
>>>>>>>> Sure, thanks.
>>>>>>>>
>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>
>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>
>>>>>>>
>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>
>>>>>>> You could try :
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         dma_addr_t mapping;
>>>>>>>         u32 opts[2], len;
>>>>>>>         bool stop_queue;
>>>>>>> +       bool door_bell;
>>>>>>>         int frags;
>>>>>>>  
>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>         dma_wmb();
>>>>>>>  
>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>> +
>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>  
>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         if (unlikely(stop_queue))
>>>>>>>                 netif_stop_queue(dev);
>>>>>>>  
>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>> +       if (door_bell) {
>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>                 mmiowb();
>>>>>>>         }
>>>>>>>
>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>
>>>>> I have done that already during the night .. the results:
>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>
>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>
>>>>>   If I can, it is a separate issue.
>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>
>>>>>   The timeout in question:
>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>         [28336.893358] Modules linked in:
>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>         [28337.090052] Call Trace:
>>>>>         [28337.103615]  <IRQ>
>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>         [28337.241261]  </IRQ>
>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>
>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>> did that occur again?
>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>> contribute to the issue) and just submit a patch to effectively revert
>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>
>>> I can't say if that is correct, because i haven't tested that.
>>>
>>> Another thing I could test is:
>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>    that would be great.
>>>
>>
>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>> efx_enqueue_skb():
>>
>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>
>> 		/* There could be packets left on the partner queue if those
>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>> 		 * could be left for a long time and cause a netdev watchdog.
>> 		 */
>> 		if (txq2->xmit_more_available)
>> 			efx_nic_push_buffers(txq2);
>>
>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>> you could give it a try.
> 
> Hi Heiner,
> 
> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
> 
> So there is still time left and I could do a more proper run with your patch below.
> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
> just that I can't tell.
> 
The nasty memory allocation problem in napi_alloc_skb() you've seen before ..
If you can't reproduce it with 4.20 then a potential cause may be here:
5317d5c6d47e ("r8169: use napi_consume_skb where possible")
At least at a first and second glance I don't see how usage of both calls
could be wrong. So maybe the root cause is somewhere deeper inside.
At least it would be worth trying with the mentioned commit reverted.

> Perhaps Linus as Oops-decoding-guru has an idea ?
> 
> --
> Sander
> 
Heiner

> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
> [39041.689018] Call Trace:
> [39041.689022]  <IRQ>
> [39041.689030]  dump_stack+0x5c/0x7b
> [39041.689033]  warn_alloc+0x103/0x190
> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
> [39041.689039]  ? ip_rcv+0x48/0xc0
> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
> [39041.689042]  page_frag_alloc+0x117/0x150
> [39041.689044]  __napi_alloc_skb+0x83/0xd0
> [39041.689048]  rtl8169_poll+0x210/0x640
> [39041.689051]  net_rx_action+0x23d/0x370
> [39041.689054]  __do_softirq+0xed/0x229
> [39041.689058]  irq_exit+0xb7/0xc0
> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
> [39041.689064]  </IRQ>
> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
> [39041.689077]  iput+0x63/0x1a0
> [39041.689079]  __dentry_kill+0xc5/0x170
> [39041.689080]  shrink_dentry_list+0x93/0x1c0
> [39041.689082]  prune_dcache_sb+0x4d/0x70
> [39041.689084]  super_cache_scan+0x104/0x190
> [39041.689087]  do_shrink_slab+0x12c/0x1e0
> [39041.689089]  shrink_slab+0xdf/0x2b0
> [39041.689091]  shrink_node+0x158/0x470
> [39041.689093]  do_try_to_free_pages+0xd1/0x380
> [39041.689095]  try_to_free_pages+0xb2/0xe0
> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
> [39041.689109]  ? xen_mc_flush+0xc0/0x190
> [39041.689110]  handle_mm_fault+0xf8/0x200
> [39041.689113]  __do_page_fault+0x231/0x4a0
> [39041.689115]  ? page_fault+0x8/0x30
> [39041.689116]  page_fault+0x1e/0x30
> [39041.689118] RIP: e033:0x7fb9851d012e
> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
> [39041.689125] Mem-Info:
> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>                 active_file:106409 inactive_file:107531 isolated_file:0
>                 unevictable:552 dirty:175 writeback:0 unstable:0
>                 slab_reclaimable:13739 slab_unreclaimable:16454
>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>                 free:3681 free_pcp:935 free_cma:0
> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
> [39041.689140] lowmem_reserve[]: 0 0 0 0
> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
> [39041.689153] 214701 total pagecache pages
> [39041.689155] 273 pages in swap cache
> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
> [39041.689156] Free swap  = 3790588kB
> [39041.689157] Total swap = 4194300kB
> [39041.689157] 524181 pages RAM
> [39041.689158] 0 pages HighMem/MovableOnly
> [39041.689158] 101471 pages reserved
> [39041.689159] 0 pages cma reserved
> 
> 
> 
> 
> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index e8a112149..3cca2ffb2 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  	struct device *d = tp_to_dev(tp);
>>  	dma_addr_t mapping;
>>  	u32 opts[2], len;
>> -	bool stop_queue;
>>  	int frags;
>>  
>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>  
>> +	netdev_sent_queue(dev, skb->len);
>> +
>>  	skb_tx_timestamp(skb);
>>  
>>  	/* Force memory writes to complete before releasing descriptor */
>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	tp->cur_tx += frags + 1;
>>  
>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>> -	if (unlikely(stop_queue))
>> -		netif_stop_queue(dev);
>> -
>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>> -		RTL_W8(tp, TxPoll, NPQ);
>> +	RTL_W8(tp, TxPoll, NPQ);
>>  
>> -	if (unlikely(stop_queue)) {
>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>> +		 * not miss a ring update when it notices a stopped queue.
>> +		 */
>> +		smp_wmb();
>> +		netif_stop_queue(dev);
>>  		/* Sync with rtl_tx:
>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>  		 * - refresh dirty_tx ring index (read barrier).
>>
> 
> 


^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Heiner Kallweit @ 2019-02-10 11:44 UTC (permalink / raw)
  To: Sander Eikelenboom, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <302dd169-d981-9d8d-99a6-9d1462e913f9@eikelenboom.it>

On 10.02.2019 10:16, Sander Eikelenboom wrote:
> On 09/02/2019 12:50, Heiner Kallweit wrote:
>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>
>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>
>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>
>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>
>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>> on the host.
>>>>>>>>>
>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>
>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>
>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>> test also with only 
>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>> removed.
>>>>>>>>
>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>
>>>>>>>> Sure, thanks.
>>>>>>>>
>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>
>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>
>>>>>>>
>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>
>>>>>>> You could try :
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         dma_addr_t mapping;
>>>>>>>         u32 opts[2], len;
>>>>>>>         bool stop_queue;
>>>>>>> +       bool door_bell;
>>>>>>>         int frags;
>>>>>>>  
>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>         dma_wmb();
>>>>>>>  
>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>> +
>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>  
>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         if (unlikely(stop_queue))
>>>>>>>                 netif_stop_queue(dev);
>>>>>>>  
>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>> +       if (door_bell) {
>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>                 mmiowb();
>>>>>>>         }
>>>>>>>
>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>
>>>>> I have done that already during the night .. the results:
>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>
>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>
>>>>>   If I can, it is a separate issue.
>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>
>>>>>   The timeout in question:
>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>         [28336.893358] Modules linked in:
>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>         [28337.090052] Call Trace:
>>>>>         [28337.103615]  <IRQ>
>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>         [28337.241261]  </IRQ>
>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>
>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>> did that occur again?
>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>> contribute to the issue) and just submit a patch to effectively revert
>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>
>>> I can't say if that is correct, because i haven't tested that.
>>>
>>> Another thing I could test is:
>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>    that would be great.
>>>
>>
>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>> efx_enqueue_skb():
>>
>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>
>> 		/* There could be packets left on the partner queue if those
>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>> 		 * could be left for a long time and cause a netdev watchdog.
>> 		 */
>> 		if (txq2->xmit_more_available)
>> 			efx_nic_push_buffers(txq2);
>>
>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>> you could give it a try.
> 
> Hi Heiner,
> 
> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
> 
> So there is still time left and I could do a more proper run with your patch below.
> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
> just that I can't tell.
> 
I checked further and there's a handful of network drivers using __napi_alloc_skb() with __GFP_NOWARN,
maybe to avoid such splats. Did the splat impact functionality? When checking the code in r8169 the
affected packet would just be dropped.

> Perhaps Linus as Oops-decoding-guru has an idea ?
> 
> --
> Sander
> 
> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
> [39041.689018] Call Trace:
> [39041.689022]  <IRQ>
> [39041.689030]  dump_stack+0x5c/0x7b
> [39041.689033]  warn_alloc+0x103/0x190
> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
> [39041.689039]  ? ip_rcv+0x48/0xc0
> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
> [39041.689042]  page_frag_alloc+0x117/0x150
> [39041.689044]  __napi_alloc_skb+0x83/0xd0
> [39041.689048]  rtl8169_poll+0x210/0x640
> [39041.689051]  net_rx_action+0x23d/0x370
> [39041.689054]  __do_softirq+0xed/0x229
> [39041.689058]  irq_exit+0xb7/0xc0
> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
> [39041.689064]  </IRQ>
> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
> [39041.689077]  iput+0x63/0x1a0
> [39041.689079]  __dentry_kill+0xc5/0x170
> [39041.689080]  shrink_dentry_list+0x93/0x1c0
> [39041.689082]  prune_dcache_sb+0x4d/0x70
> [39041.689084]  super_cache_scan+0x104/0x190
> [39041.689087]  do_shrink_slab+0x12c/0x1e0
> [39041.689089]  shrink_slab+0xdf/0x2b0
> [39041.689091]  shrink_node+0x158/0x470
> [39041.689093]  do_try_to_free_pages+0xd1/0x380
> [39041.689095]  try_to_free_pages+0xb2/0xe0
> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
> [39041.689109]  ? xen_mc_flush+0xc0/0x190
> [39041.689110]  handle_mm_fault+0xf8/0x200
> [39041.689113]  __do_page_fault+0x231/0x4a0
> [39041.689115]  ? page_fault+0x8/0x30
> [39041.689116]  page_fault+0x1e/0x30
> [39041.689118] RIP: e033:0x7fb9851d012e
> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
> [39041.689125] Mem-Info:
> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>                 active_file:106409 inactive_file:107531 isolated_file:0
>                 unevictable:552 dirty:175 writeback:0 unstable:0
>                 slab_reclaimable:13739 slab_unreclaimable:16454
>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>                 free:3681 free_pcp:935 free_cma:0
> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
> [39041.689140] lowmem_reserve[]: 0 0 0 0
> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
> [39041.689153] 214701 total pagecache pages
> [39041.689155] 273 pages in swap cache
> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
> [39041.689156] Free swap  = 3790588kB
> [39041.689157] Total swap = 4194300kB
> [39041.689157] 524181 pages RAM
> [39041.689158] 0 pages HighMem/MovableOnly
> [39041.689158] 101471 pages reserved
> [39041.689159] 0 pages cma reserved
> 
> 
> 
> 
> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index e8a112149..3cca2ffb2 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  	struct device *d = tp_to_dev(tp);
>>  	dma_addr_t mapping;
>>  	u32 opts[2], len;
>> -	bool stop_queue;
>>  	int frags;
>>  
>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>  
>> +	netdev_sent_queue(dev, skb->len);
>> +
>>  	skb_tx_timestamp(skb);
>>  
>>  	/* Force memory writes to complete before releasing descriptor */
>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	tp->cur_tx += frags + 1;
>>  
>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>> -	if (unlikely(stop_queue))
>> -		netif_stop_queue(dev);
>> -
>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>> -		RTL_W8(tp, TxPoll, NPQ);
>> +	RTL_W8(tp, TxPoll, NPQ);
>>  
>> -	if (unlikely(stop_queue)) {
>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>> +		 * not miss a ring update when it notices a stopped queue.
>> +		 */
>> +		smp_wmb();
>> +		netif_stop_queue(dev);
>>  		/* Sync with rtl_tx:
>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>  		 * - refresh dirty_tx ring index (read barrier).
>>
> 
> 


^ permalink raw reply

* Re: Possible bug into DSA2 code.
From: Rodolfo Giometti @ 2019-02-10 11:45 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, Vivien Didelot, David S. Miller, netdev
In-Reply-To: <20190209193409.GI30856@lunn.ch>

On 09/02/2019 20:34, Andrew Lunn wrote:
>> So we I see two possible solutions:
>>
>> 1) having both ds->slave_mii_bus and ds->ops->phy_read already defined is an
>> error, then it must be signaled to the calling code, or
> 
> I don't think we can do that. mv88e6xxx optionally instantiates the
> MDIO busses, depending on what is in device tree. If there is no mdio
> property, we need the DSA core to create an MDIO bus.

OK, but using the following check to know if DSA did such allocation is not 
correct because DSA drivers can allocate it by their own:

static void dsa_switch_teardown(struct dsa_switch *ds)
{
         if (ds->slave_mii_bus && ds->ops->phy_read)
                 mdiobus_unregister(ds->slave_mii_bus);

Maybe can we add a flag to register ds->slave_mii_bus allocation by DSA?

> Looking at the driver, ds->slave_mii_bus is assigned in
> mv88e6xxx_setup().
> 
> We have talked about adding a teardown() to the ops structure. This
> seems like another argument we should do it. The mv88e6xxx_teardown()
> can set ds->slave_mii_bus back to NULL, undoing what it did in the
> setup code.

This seems reasonable to me, but in this case you have to call teardown() 
operation before calling mdiobus_unregister() into dsa_switch_teardown() or we 
still have the problem...

Ciao,

Rodolfo

-- 
GNU/Linux Solutions                  e-mail: giometti@enneenne.com
Linux Device Driver                          giometti@linux.it
Embedded Systems                     phone:  +39 349 2432127
UNIX programming                     skype:  rodolfo.giometti

^ permalink raw reply

* [PATCH v1 net-next 0/2] Change tc action identifiers to be more consistent
From: Eli Cohen @ 2019-02-10 12:24 UTC (permalink / raw)
  To: davem, jhs, xiyou.wangcong, jiri; +Cc: netdev, simon.horman, Eli Cohen

This two patch series modifies TC actions identifiers to be more consistent and
also puts them in one place so new identifiers numbers can be chosen more
easily.

Eli Cohen (2):
  net: Move all TC actions identifiers to one place
  net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE

Changelog:
v0 -> v1:
   Remove definition of TCA_ACT_SIMP from net/sched/act_simple.c to avoid two
   identical definitions.


 include/net/act_api.h                     |  2 +-
 include/net/tc_act/tc_csum.h              |  2 +-
 include/net/tc_act/tc_gact.h              |  2 +-
 include/net/tc_act/tc_mirred.h            |  4 +--
 include/net/tc_act/tc_pedit.h             |  2 +-
 include/net/tc_act/tc_sample.h            |  2 +-
 include/net/tc_act/tc_skbedit.h           |  2 +-
 include/net/tc_act/tc_tunnel_key.h        |  4 +--
 include/net/tc_act/tc_vlan.h              |  2 +-
 include/uapi/linux/pkt_cls.h              | 45 ++++++++++++++++++++++++++++---
 include/uapi/linux/tc_act/tc_bpf.h        |  2 --
 include/uapi/linux/tc_act/tc_connmark.h   |  2 --
 include/uapi/linux/tc_act/tc_csum.h       |  2 --
 include/uapi/linux/tc_act/tc_gact.h       |  1 -
 include/uapi/linux/tc_act/tc_ife.h        |  1 -
 include/uapi/linux/tc_act/tc_ipt.h        |  3 ---
 include/uapi/linux/tc_act/tc_mirred.h     |  1 -
 include/uapi/linux/tc_act/tc_nat.h        |  2 --
 include/uapi/linux/tc_act/tc_pedit.h      |  2 --
 include/uapi/linux/tc_act/tc_sample.h     |  2 --
 include/uapi/linux/tc_act/tc_skbedit.h    |  2 --
 include/uapi/linux/tc_act/tc_skbmod.h     |  2 --
 include/uapi/linux/tc_act/tc_tunnel_key.h |  2 --
 include/uapi/linux/tc_act/tc_vlan.h       |  2 --
 net/sched/act_api.c                       |  2 +-
 net/sched/act_bpf.c                       |  2 +-
 net/sched/act_connmark.c                  |  2 +-
 net/sched/act_csum.c                      |  2 +-
 net/sched/act_gact.c                      |  2 +-
 net/sched/act_ife.c                       |  2 +-
 net/sched/act_ipt.c                       |  4 +--
 net/sched/act_mirred.c                    |  2 +-
 net/sched/act_nat.c                       |  2 +-
 net/sched/act_pedit.c                     |  2 +-
 net/sched/act_police.c                    |  2 +-
 net/sched/act_sample.c                    |  2 +-
 net/sched/act_simple.c                    |  4 +--
 net/sched/act_skbedit.c                   |  2 +-
 net/sched/act_skbmod.c                    |  2 +-
 net/sched/act_tunnel_key.c                |  2 +-
 net/sched/act_vlan.c                      |  2 +-
 tools/include/uapi/linux/tc_act/tc_bpf.h  |  2 --
 42 files changed, 70 insertions(+), 63 deletions(-)

-- 
2.9.5


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox