Netdev List
 help / color / mirror / Atom feed
* Re: general protection fault in smc_setsockopt
From: syzbot @ 2018-04-21  3:49 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun,
	ubraun
In-Reply-To: <001a11426eae20adb5056655e0f1@google.com>

syzbot has found reproducer for the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of  
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=9045fc589fcd196ef522

So far this crash happened 124 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6522155797839872
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5566093930266624
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6661555940753408
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9045fc589fcd196ef522@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4520 Comm: syzkaller696326 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:smc_setsockopt+0x8b/0x120 net/smc/af_smc.c:1287
RSP: 0018:ffff8801b433fcc8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000020000000
RDX: 0000000000000005 RSI: ffffffff873e3bf6 RDI: 0000000000000028
RBP: ffff8801b433fcf8 R08: 0000000000000000 R09: ffffed00359a3780
R10: ffffed00359a3780 R11: ffff8801acd1bc03 R12: ffff8801b433fd40
R13: 0000000000000021 R14: 000000000000000d R15: 0000000020000000
FS:  0000000000b01880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000562e1fa410d0 CR3: 00000001acf1e000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fd19
RSP: 002b:00007ffccc4960f8 EFLAGS: 00000217 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd19
RDX: 000000000000000d RSI: 0000000000000021 RDI: 0000000000000004
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000020000000 R11: 0000000000000217 R12: 0000000000401640
R13: 00000000004016d0 R14: 0000000000000000 R15: 0000000000000000
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 93 00 00 00 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 62 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 28 49
RIP: smc_setsockopt+0x8b/0x120 net/smc/af_smc.c:1287 RSP: ffff8801b433fcc8
---[ end trace 3858d0cd9ce5e4d4 ]---

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Alexander Duyck @ 2018-04-21  2:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <CAF=yD-JX--1m8owpS+EqNAhU6rP-8KCDDXC51HNpRss4px6yig@mail.gmail.com>

On Fri, Apr 20, 2018 at 2:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>>>> Also any plans for HW offload support for this? I vaguely recall that
>>>> the igb and ixgbe parts had support for something like this in
>>>> hardware. I would have to double check to see what exactly is
>>>> supported.
>>>
>>> I hadn't given that much thought until the request yesterday to
>>> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
>>> virtue of having only a single fixed segmentation length, it
>>> appears reasonably straightforward to offload.
>>
>> Actually I just got a chance to start on a review of things. Do we
>> need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
>> better if we could split these up and specifically call out GSO_UDP as
>> UFO and GSO_UDP_L4 as being UDP segmentation.
>
> Thanks for taking a look, Alex.
>
> Agreed, I'll revise that. My initial thought was that both gso skbs need
> to take the same udp gso special case branches in places like act_csum
> and ovs. But on rereading that seems an unsafe approach, as some
> branches are fragmentation specific. I'll review them all and add
> separate SKB_GSO_UDP_L4 cases where needed, instead.

Sounds good. Keep me in the loop on these patches. I will see if we
can get support for ixgbe, ixgbevf, and igb to at least offload this
since it shouldn't be much of a lift to get that. Once we have that I
can also take a look and see if there would be much work needed to add
gso_partial support so we could deal with tunnel encapsulated UDP
segmentation offload support on those devices.

Thanks.

- Alex

^ permalink raw reply

* pull-request: bpf-next 2018-04-21
From: Daniel Borkmann @ 2018-04-21  1:07 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit a2d481b326c98b6b67eea8a378c858d57ca5ff3d:

  ipv6: send netlink notifications for manually configured addresses (2018-04-17 14:03:56 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 878a4d328104fed86ea321c48a5245eed44fbe14:

  libbpf: fixed build error for samples/bpf/ (2018-04-20 10:12:19 +0200)

----------------------------------------------------------------
Andrey Ignatov (3):
      bpftool: Support new prog types and attach types
      libbpf: Support guessing post_bind{4,6} progs
      libbpf: Type functions for raw tracepoints

Björn Töpel (1):
      libbpf: fixed build error for samples/bpf/

Daniel Borkmann (3):
      Merge branch 'bpf-libbpf-add-types'
      Merge branch 'bpf-xdp-adjust-tail'
      Merge branch 'bpf-type-format'

Jesper Dangaard Brouer (2):
      samples/bpf: fix xdp_monitor user output for tracepoint exception
      bpf: reserve xdp_frame size in xdp headroom

Martin KaFai Lau (10):
      bpf: btf: Introduce BPF Type Format (BTF)
      bpf: btf: Validate type reference
      bpf: btf: Check members of struct/union
      bpf: btf: Add pretty print capability for data with BTF type info
      bpf: btf: Add BPF_BTF_LOAD command
      bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
      bpf: btf: Add pretty print support to the basic arraymap
      bpf: btf: Sync bpf.h and btf.h to tools/
      bpf: btf: Add BTF support to libbpf
      bpf: btf: Add BTF tests

Nikita V. Shirokov (11):
      bpf: adding bpf_xdp_adjust_tail helper
      bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
      bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
      bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
      bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
      bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
      bpf: make tun compatible w/ bpf_xdp_adjust_tail
      bpf: make virtio compatible w/ bpf_xdp_adjust_tail
      bpf: making bpf_prog_test run aware of possible data_end ptr change
      bpf: adding tests for bpf_xdp_adjust_tail
      bpf: add bpf_xdp_adjust_tail sample prog

Quentin Monnet (1):
      tools: bpftool: make it easier to feed hex bytes to bpftool

Wang Sheng-Hui (1):
      samples/bpf: correct comment in sock_example.c

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |    2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |    2 +-
 drivers/net/tun.c                                  |    3 +-
 drivers/net/virtio_net.c                           |    7 +-
 include/linux/bpf.h                                |   20 +-
 include/linux/btf.h                                |   48 +
 include/uapi/linux/bpf.h                           |   22 +-
 include/uapi/linux/btf.h                           |  130 ++
 kernel/bpf/Makefile                                |    1 +
 kernel/bpf/arraymap.c                              |   50 +
 kernel/bpf/btf.c                                   | 2064 ++++++++++++++++++++
 kernel/bpf/inode.c                                 |  156 +-
 kernel/bpf/syscall.c                               |   51 +-
 net/bpf/test_run.c                                 |    3 +-
 net/core/dev.c                                     |   10 +-
 net/core/filter.c                                  |   41 +-
 samples/bpf/Makefile                               |    4 +
 samples/bpf/sock_example.c                         |    4 +-
 samples/bpf/xdp_adjust_tail_kern.c                 |  152 ++
 samples/bpf/xdp_adjust_tail_user.c                 |  142 ++
 samples/bpf/xdp_monitor_user.c                     |    2 +-
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |   11 +-
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |   29 +-
 tools/bpf/bpftool/bash-completion/bpftool          |   14 +-
 tools/bpf/bpftool/cgroup.c                         |   15 +-
 tools/bpf/bpftool/map.c                            |   17 +-
 tools/bpf/bpftool/prog.c                           |    3 +
 tools/include/uapi/linux/bpf.h                     |   22 +-
 tools/include/uapi/linux/btf.h                     |  130 ++
 tools/lib/bpf/Build                                |    2 +-
 tools/lib/bpf/bpf.c                                |   92 +-
 tools/lib/bpf/bpf.h                                |   17 +
 tools/lib/bpf/btf.c                                |  374 ++++
 tools/lib/bpf/btf.h                                |   22 +
 tools/lib/bpf/libbpf.c                             |  156 +-
 tools/lib/bpf/libbpf.h                             |    5 +
 tools/testing/selftests/bpf/Makefile               |   26 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |    5 +
 tools/testing/selftests/bpf/test_adjust_tail.c     |   30 +
 tools/testing/selftests/bpf/test_btf.c             | 1669 ++++++++++++++++
 tools/testing/selftests/bpf/test_btf_haskv.c       |   48 +
 tools/testing/selftests/bpf/test_btf_nokv.c        |   43 +
 tools/testing/selftests/bpf/test_progs.c           |   32 +
 45 files changed, 5591 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/btf.h
 create mode 100644 include/uapi/linux/btf.h
 create mode 100644 kernel/bpf/btf.c
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c
 create mode 100644 tools/include/uapi/linux/btf.h
 create mode 100644 tools/lib/bpf/btf.c
 create mode 100644 tools/lib/bpf/btf.h
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
 create mode 100644 tools/testing/selftests/bpf/test_btf.c
 create mode 100644 tools/testing/selftests/bpf/test_btf_haskv.c
 create mode 100644 tools/testing/selftests/bpf/test_btf_nokv.c

^ permalink raw reply

* pull-request: bpf 2018-04-21
From: Daniel Borkmann @ 2018-04-21  0:22 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when
   one task is detaching a BPF prog via perf_event_detach_bpf_prog()
   and another one dumping through bpf_prog_array_copy_info(). For
   the latter we move the copy_to_user() out of the bpf_event_mutex
   lock to fix it, from Yonghong.

2) Fix test_sock and test_sock_addr.sh failures. The former was
   hitting rlimit issues and the latter required ping to specify
   the address family, from Yonghong.

3) Remove a dead check in sockmap's sock_map_alloc(), from Jann.

4) Add generated files to BPF kselftests gitignore that were previously
   missed, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 1cc5954f44150bb70cac07c3cc5df7cf0dfb61ec:

  ip_gre: clear feature flags when incompatible o_flags are set (2018-04-10 11:03:32 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 6ab690aa439803347743c0d899ac422774fdd5e7:

  bpf: sockmap remove dead check (2018-04-20 22:09:51 +0200)

----------------------------------------------------------------
Anders Roxell (1):
      selftests: bpf: update .gitignore with missing generated files

Jann Horn (1):
      bpf: sockmap remove dead check

Yonghong Song (2):
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
      tools/bpf: fix test_sock and test_sock_addr.sh failure

 include/linux/bpf.h                           |  4 +--
 kernel/bpf/core.c                             | 45 +++++++++++++++++----------
 kernel/bpf/sockmap.c                          |  3 --
 kernel/trace/bpf_trace.c                      | 25 ++++++++++++---
 tools/testing/selftests/bpf/.gitignore        |  3 ++
 tools/testing/selftests/bpf/test_sock.c       |  1 +
 tools/testing/selftests/bpf/test_sock_addr.c  |  1 +
 tools/testing/selftests/bpf/test_sock_addr.sh |  4 +--
 8 files changed, 59 insertions(+), 27 deletions(-)

^ permalink raw reply

* [PATCH net-next, v2] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-20 23:52 UTC (permalink / raw)
  To: netdev, ja; +Cc: Stephen Suryaputra

This is enhanced from the proposed patch by Igor Maravic in 2011 to
support per interface IPv4 stats. The enhancement is mainly adding a
kernel configuration option CONFIG_IP_IFSTATS_TABLE.

Changes from v1:
- Count input statistics in the input device (per Julian Anastasov).
- Changes so that the existing per interface IPv6 stats aren't affected
  when the option isn't enabled.
- Restore the order of calling ipv4_proc_init().

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 drivers/net/vrf.c               |   2 +-
 include/linux/inetdevice.h      |  22 ++++++
 include/net/icmp.h              |   6 +-
 include/net/ip.h                | 144 ++++++++++++++++++++++++++++++++++++++--
 include/net/ipv6.h              |  30 ++++-----
 include/net/netns/mib.h         |   3 +
 include/net/snmp.h              |  12 ++++
 net/bridge/br_netfilter_hooks.c |  10 +--
 net/dccp/ipv4.c                 |   4 +-
 net/ipv4/Kconfig                |   8 +++
 net/ipv4/datagram.c             |   2 +-
 net/ipv4/devinet.c              |  84 ++++++++++++++++++++++-
 net/ipv4/icmp.c                 |  32 ++++-----
 net/ipv4/inet_connection_sock.c |   8 ++-
 net/ipv4/ip_forward.c           |   8 +--
 net/ipv4/ip_fragment.c          |  20 +++---
 net/ipv4/ip_input.c             |  31 +++++----
 net/ipv4/ip_output.c            |  42 +++++++-----
 net/ipv4/ipmr.c                 |   6 +-
 net/ipv4/ping.c                 |   9 ++-
 net/ipv4/proc.c                 | 126 +++++++++++++++++++++++++++++++++++
 net/ipv4/raw.c                  |   4 +-
 net/ipv4/route.c                |   6 +-
 net/ipv4/tcp_ipv4.c             |   4 +-
 net/ipv4/udp.c                  |   4 +-
 net/l2tp/l2tp_ip.c              |   4 +-
 net/l2tp/l2tp_ip6.c             |   2 +-
 net/mpls/af_mpls.c              |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   6 +-
 net/sctp/input.c                |   2 +-
 net/sctp/output.c               |   2 +-
 31 files changed, 528 insertions(+), 117 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 90b5f39..2b17ead 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -593,7 +593,7 @@ static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d..3d120cb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -22,6 +22,15 @@ struct ipv4_devconf {
 
 #define MC_HASH_SZ_LOG 9
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+struct ipv4_devstat {
+	struct proc_dir_entry *proc_dir_entry;
+	DEFINE_SNMP_STAT(struct ipstats_mib, ip);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmp_mib_device, icmpdev);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpmsg_mib_device, icmpmsgdev);
+};
+#endif
+
 struct in_device {
 	struct net_device	*dev;
 	refcount_t		refcnt;
@@ -45,6 +54,9 @@ struct in_device {
 
 	struct neigh_parms	*arp_parms;
 	struct ipv4_devconf	cnf;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct ipv4_devstat stats;
+#endif
 	struct rcu_head		rcu_head;
 };
 
@@ -216,6 +228,16 @@ static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
 	return rcu_dereference(dev->ip_ptr);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static inline struct in_device *__in_dev_get_rcu_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference(dev->ip_ptr);
+	else
+		return NULL;
+}
+#endif
+
 static inline struct in_device *in_dev_get(const struct net_device *dev)
 {
 	struct in_device *in_dev;
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743..70612ca 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -29,10 +29,6 @@ struct icmp_err {
 };
 
 extern const struct icmp_err icmp_err_convert[];
-#define ICMP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define __ICMP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define ICMPMSGOUT_INC_STATS(net, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
-#define ICMPMSGIN_INC_STATS(net, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
 
 struct dst_entry;
 struct net_proto_family;
@@ -43,6 +39,6 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
 void icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
-void icmp_out_count(struct net *net, unsigned char type);
+void icmp_out_count(struct net_device *dev, unsigned char type);
 
 #endif	/* _ICMP_H */
diff --git a/include/net/ip.h b/include/net/ip.h
index dc4a2d6..ada33da 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -27,6 +27,7 @@
 #include <linux/in.h>
 #include <linux/skbuff.h>
 #include <linux/jhash.h>
+#include <linux/inetdevice.h>
 
 #include <net/inet_sock.h>
 #include <net/route.h>
@@ -218,12 +219,134 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			   const struct ip_reply_arg *arg,
 			   unsigned int len);
 
-#define IP_INC_STATS(net, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define __IP_INC_STATS(net, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define IP_ADD_STATS(net, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_ADD_STATS(net, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define IP_UPD_PO_STATS(net, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_UPD_PO_STATS(net, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define _DEVINC(net, statname, mod, idev, field)			\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device counters are atomic_long_t */
+#define _DEVINCATOMIC(net, statname, mod, idev, field)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device and per net counters are atomic_long_t */
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));	\
+})
+
+#define _DEVADD(net, statname, mod, idev, field, val)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+
+#define _DEVUPD(net, statname, mod, idev, field, val)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+
+#define IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, , __in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, __, __in_dev_get_rcu_safely(dev), field); \
+		rcu_read_unlock();	\
+	})
+
+#define IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, , __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define __IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, __, __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, , __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define __IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, __, __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, , __in_dev_get_rcu_safely(dev), field); \
+		rcu_read_unlock();	\
+	})
+
+#define __ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, __,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg,	\
+			__in_dev_get_rcu_safely(dev), field+256);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGIN_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define IP_INC_STATS(net, dev, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define __IP_INC_STATS(net, dev, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define IP_ADD_STATS(net, dev, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_ADD_STATS(net, dev, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define IP_UPD_PO_STATS(net, dev, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_UPD_PO_STATS(net, dev, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+
+#define ICMP_INC_STATS(net, dev, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define __ICMP_INC_STATS(net, dev, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
+#define ICMPMSGIN_INC_STATS(net, dev, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#endif
 #define NET_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define __NET_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define NET_ADD_STATS(net, field, adnd)	SNMP_ADD_STATS((net)->mib.net_statistics, field, adnd)
@@ -663,4 +786,13 @@ extern int sysctl_icmp_msgs_burst;
 int ip_misc_proc_init(void);
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#ifdef CONFIG_PROC_FS
+extern int snmp_register_dev(struct in_device *idev);
+extern int snmp_unregister_dev(struct in_device *idev);
+#else
+extern int snmp_register_dev(struct in_device *idev) { return 0; }
+extern int snmp_unregister_dev(struct in_device *idev) { return 0; }
+#endif
+#endif
 #endif	/* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 68b167d..a26ffc9 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -163,7 +163,7 @@ struct frag_hdr {
 extern int sysctl_mld_max_msf;
 extern int sysctl_mld_qrv;
 
-#define _DEVINC(net, statname, mod, idev, field)			\
+#define _DEVINC6(net, statname, mod, idev, field)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -172,7 +172,7 @@ extern int sysctl_mld_qrv;
 })
 
 /* per device counters are atomic_long_t */
-#define _DEVINCATOMIC(net, statname, mod, idev, field)			\
+#define _DEVINCATOMIC6(net, statname, mod, idev, field)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -181,7 +181,7 @@ extern int sysctl_mld_qrv;
 })
 
 /* per device and per net counters are atomic_long_t */
-#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)		\
+#define _DEVINC_ATOMIC_ATOMIC6(net, statname, idev, field)		\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -189,7 +189,7 @@ extern int sysctl_mld_qrv;
 	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));\
 })
 
-#define _DEVADD(net, statname, mod, idev, field, val)			\
+#define _DEVADD6(net, statname, mod, idev, field, val)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -197,7 +197,7 @@ extern int sysctl_mld_qrv;
 	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));\
 })
 
-#define _DEVUPD(net, statname, mod, idev, field, val)			\
+#define _DEVUPD6(net, statname, mod, idev, field, val)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -208,26 +208,26 @@ extern int sysctl_mld_qrv;
 /* MIBs */
 
 #define IP6_INC_STATS(net, idev,field)		\
-		_DEVINC(net, ipv6, , idev, field)
+		_DEVINC6(net, ipv6, , idev, field)
 #define __IP6_INC_STATS(net, idev,field)	\
-		_DEVINC(net, ipv6, __, idev, field)
+		_DEVINC6(net, ipv6, __, idev, field)
 #define IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, , idev, field, val)
+		_DEVADD6(net, ipv6, , idev, field, val)
 #define __IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, __, idev, field, val)
+		_DEVADD6(net, ipv6, __, idev, field, val)
 #define IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, , idev, field, val)
+		_DEVUPD6(net, ipv6, , idev, field, val)
 #define __IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, __, idev, field, val)
+		_DEVUPD6(net, ipv6, __, idev, field, val)
 #define ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, , idev, field)
+		_DEVINCATOMIC6(net, icmpv6, , idev, field)
 #define __ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, __, idev, field)
+		_DEVINCATOMIC6(net, icmpv6, __, idev, field)
 
 #define ICMP6MSGOUT_INC_STATS(net, idev, field)		\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field +256)
+	_DEVINC_ATOMIC_ATOMIC6(net, icmpv6msg, idev, field +256)
 #define ICMP6MSGIN_INC_STATS(net, idev, field)	\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field)
+	_DEVINC_ATOMIC_ATOMIC6(net, icmpv6msg, idev, field)
 
 struct ip6_ra_chain {
 	struct ip6_ra_chain	*next;
diff --git a/include/net/netns/mib.h b/include/net/netns/mib.h
index 830bdf3..798bbc2 100644
--- a/include/net/netns/mib.h
+++ b/include/net/netns/mib.h
@@ -5,6 +5,9 @@
 #include <net/snmp.h>
 
 struct netns_mib {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct proc_dir_entry *proc_net_devsnmp;
+#endif
 	DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 	DEFINE_SNMP_STAT(struct ipstats_mib, ip_statistics);
 	DEFINE_SNMP_STAT(struct linux_mib, net_statistics);
diff --git a/include/net/snmp.h b/include/net/snmp.h
index c9228ad..b7c99a3 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -61,14 +61,26 @@ struct ipstats_mib {
 
 /* ICMP */
 #define ICMP_MIB_MAX	__ICMP_MIB_MAX
+/* per network ns counters */
 struct icmp_mib {
 	unsigned long	mibs[ICMP_MIB_MAX];
 };
 
 #define ICMPMSG_MIB_MAX	__ICMPMSG_MIB_MAX
+/* per network ns counters */
 struct icmpmsg_mib {
 	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
 };
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* per device counters, (shared on all cpus) */
+struct icmp_mib_device {
+	atomic_long_t	mibs[ICMP_MIB_MAX];
+};
+/* per device counters, (shared on all cpus) */
+struct icmpmsg_mib_device {
+	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
+};
+#endif
 
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX	__ICMP6_MIB_MAX
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 9b16eaf..f9576b7 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -218,13 +218,13 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
 
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -237,9 +237,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	return 0;
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	return -1;
 }
@@ -691,7 +691,7 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b08feb2..40e4c00 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -258,7 +258,7 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 				       iph->saddr, ntohs(dh->dccph_sport),
 				       inet_iif(skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -468,7 +468,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt)) {
-		IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		return NULL;
 	}
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 80dad30..6470b95 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -52,6 +52,14 @@ config IP_ADVANCED_ROUTER
 
 	  If unsure, say N here.
 
+config IP_IFSTATS_TABLE
+	def_bool n
+	depends on IP_ADVANCED_ROUTER
+	prompt "IP: interface statistics"
+	help
+	  This option enables per interface statistics for IPv4. Refer to
+	  RFC 4293.
+
 config IP_FIB_TRIE_STATS
 	bool "FIB TRIE statistics"
 	depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..7acaadb 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -55,7 +55,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 40f0017..c2809f3 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -218,6 +218,49 @@ static void inet_free_ifa(struct in_ifaddr *ifa)
 	call_rcu(&ifa->rcu_head, inet_rcu_free_ifa);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static int snmp_alloc_dev(struct in_device *idev)
+{
+	int i;
+
+	idev->stats.ip = alloc_percpu(struct ipstats_mib);
+	if (!idev->stats.ip)
+		goto err_ip;
+
+	for_each_possible_cpu(i) {
+		struct ipstats_mib *addrconf_stats;
+
+		addrconf_stats = per_cpu_ptr(idev->stats.ip, i);
+		u64_stats_init(&addrconf_stats->syncp);
+	}
+
+	idev->stats.icmpdev = kzalloc(sizeof(*idev->stats.icmpdev),
+				      GFP_KERNEL);
+	if (!idev->stats.icmpdev)
+		goto err_icmp;
+	idev->stats.icmpmsgdev = kzalloc(sizeof(*idev->stats.icmpmsgdev),
+					 GFP_KERNEL);
+	if (!idev->stats.icmpmsgdev)
+		goto err_icmpmsg;
+
+	return 0;
+
+err_icmpmsg:
+	kfree(idev->stats.icmpdev);
+err_icmp:
+	free_percpu(idev->stats.ip);
+err_ip:
+	return -ENOMEM;
+}
+
+static void snmp_free_dev(struct in_device *idev)
+{
+	kfree(idev->stats.icmpmsgdev);
+	kfree(idev->stats.icmpdev);
+	free_percpu(idev->stats.ip);
+}
+#endif
+
 void in_dev_finish_destroy(struct in_device *idev)
 {
 	struct net_device *dev = idev->dev;
@@ -229,10 +272,14 @@ void in_dev_finish_destroy(struct in_device *idev)
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
 	dev_put(dev);
-	if (!idev->dead)
+	if (!idev->dead) {
 		pr_err("Freeing alive in_device %p\n", idev);
-	else
+	} else {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_free_dev(idev);
+#endif
 		kfree(idev);
+	}
 }
 EXPORT_SYMBOL(in_dev_finish_destroy);
 
@@ -257,6 +304,25 @@ static struct in_device *inetdev_init(struct net_device *dev)
 		dev_disable_lro(dev);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	err = snmp_alloc_dev(in_dev);
+	if (err < 0) {
+		netdev_crit(dev,
+			    "%s(): cannot allocate memory for statistics; dev=%s err=%d.\n",
+			    __func__, dev->name, err);
+		neigh_parms_release(&arp_tbl, in_dev->arp_parms);
+		dev_put(dev);
+		kfree(in_dev);
+		return NULL;
+	}
+
+	err = snmp_register_dev(in_dev);
+	if (err < 0)
+		netdev_warn(dev,
+			    "%s(): cannot create /proc/net/dev_snmp/%s err=%d\n",
+			    __func__, dev->name, err);
+#endif
+
 	/* Account for reference dev->ip_ptr (below) */
 	refcount_set(&in_dev->refcnt, 1);
 
@@ -306,6 +372,9 @@ static void inetdev_destroy(struct in_device *in_dev)
 	}
 
 	RCU_INIT_POINTER(dev->ip_ptr, NULL);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	snmp_unregister_dev(in_dev);
+#endif
 
 	devinet_sysctl_unregister(in_dev);
 	neigh_parms_release(&arp_tbl, in_dev->arp_parms);
@@ -1529,8 +1598,19 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 		 */
 		inetdev_changename(dev, in_dev);
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_unregister_dev(in_dev);
+#endif
 		devinet_sysctl_unregister(in_dev);
 		devinet_sysctl_register(in_dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		{
+			int err = snmp_register_dev(in_dev);
+
+			if (err)
+				return notifier_from_errno(err);
+		}
+#endif
 		break;
 	}
 out:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604..4d5c092 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -338,10 +338,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 /*
  *	Maintain the counters used in the SNMP statistics for outgoing ICMP
  */
-void icmp_out_count(struct net *net, unsigned char type)
+void icmp_out_count(struct net_device *dev, unsigned char type)
 {
-	ICMPMSGOUT_INC_STATS(net, type);
-	ICMP_INC_STATS(net, ICMP_MIB_OUTMSGS);
+	ICMPMSGOUT_INC_STATS(dev_net(dev), dev, type);
+	ICMP_INC_STATS(dev_net(dev), dev, ICMP_MIB_OUTMSGS);
 }
 
 /*
@@ -370,13 +370,14 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 {
 	struct sock *sk;
 	struct sk_buff *skb;
+	struct net_device *dev = (*rt)->dst.dev;
 
-	sk = icmp_sk(dev_net((*rt)->dst.dev));
+	sk = icmp_sk(dev_net(dev));
 	if (ip_append_data(sk, fl4, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
 			   ipc, rt, MSG_DONTWAIT) < 0) {
-		__ICMP_INC_STATS(sock_net(sk), ICMP_MIB_OUTERRORS);
+		__ICMP_INC_STATS(sock_net(sk), dev, ICMP_MIB_OUTERRORS);
 		ip_flush_pending_frames(sk);
 	} else if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		struct icmphdr *icmph = icmp_hdr(skb);
@@ -760,7 +761,7 @@ static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
 	 * avoid additional coding at protocol handlers.
 	 */
 	if (!pskb_may_pull(skb, iph->ihl * 4 + 8)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -792,8 +793,9 @@ static bool icmp_unreach(struct sk_buff *skb)
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
+	struct net_device *dev = skb_dst(skb)->dev;
 
-	net = dev_net(skb_dst(skb)->dev);
+	net = dev_net(dev);
 
 	/*
 	 *	Incomplete header ?
@@ -852,7 +854,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 		info = ntohl(icmph->un.gateway) >> 24;
 		break;
 	case ICMP_TIME_EXCEEDED:
-		__ICMP_INC_STATS(net, ICMP_MIB_INTIMEEXCDS);
+		__ICMP_INC_STATS(net, dev, ICMP_MIB_INTIMEEXCDS);
 		if (icmph->code == ICMP_EXC_FRAGTIME)
 			goto out;
 		break;
@@ -890,7 +892,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 out:
 	return true;
 out_err:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -902,7 +904,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 static bool icmp_redirect(struct sk_buff *skb)
 {
 	if (skb->len < sizeof(struct iphdr)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return false;
 	}
 
@@ -982,7 +984,7 @@ static bool icmp_timestamp(struct sk_buff *skb)
 	return true;
 
 out_err:
-	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), skb_dst(skb)->dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -1022,7 +1024,7 @@ int icmp_rcv(struct sk_buff *skb)
 		skb_set_network_header(skb, nh);
 	}
 
-	__ICMP_INC_STATS(net, ICMP_MIB_INMSGS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INMSGS);
 
 	if (skb_checksum_simple_validate(skb))
 		goto csum_error;
@@ -1032,7 +1034,7 @@ int icmp_rcv(struct sk_buff *skb)
 
 	icmph = icmp_hdr(skb);
 
-	ICMPMSGIN_INC_STATS(net, icmph->type);
+	ICMPMSGIN_INC_STATS(net, skb->dev, icmph->type);
 	/*
 	 *	18 is the highest 'known' ICMP type. Anything else is a mystery
 	 *
@@ -1078,9 +1080,9 @@ int icmp_rcv(struct sk_buff *skb)
 	kfree_skb(skb);
 	return NET_RX_DROP;
 csum_error:
-	__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_CSUMERRORS);
 error:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 	goto drop;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d..30afff4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -539,6 +539,7 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	struct net *net = read_pnet(&ireq->ireq_net);
 	struct ip_options_rcu *opt;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = ireq_opt_deref(ireq);
 
@@ -557,9 +558,10 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
@@ -574,6 +576,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	struct ip_options_rcu *opt;
 	struct flowi4 *fl4;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = rcu_dereference(ireq->ireq_opt);
 	fl4 = &newinet->cork.fl.u.ip4;
@@ -593,9 +596,10 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index b54b948..c84e177 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 {
 	struct ip_options *opt	= &(IPCB(skb)->opt);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
 	if (ip_exceeds_mtu(skb, mtu)) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		goto drop;
@@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
 
 too_many_hops:
 	/* Tell the sender its packet died... */
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 drop:
 	kfree_skb(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8e9528e..6324254 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,6 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
+	struct net_device *dev;
 	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
@@ -151,16 +152,17 @@ static void ip_expire(struct timer_list *t)
 		goto out;
 
 	ipq_kill(qp);
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	dev = dev_get_by_index_rcu(net, qp->iif);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 
 	head = qp->q.fragments;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMTIMEOUT);
 
 	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
 		goto out;
 
-	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	head->dev = dev;
 	if (!head->dev)
 		goto out;
 
@@ -237,7 +239,9 @@ static int ip_frag_too_far(struct ipq *qp)
 		struct net *net;
 
 		net = container_of(qp->q.net, struct net, ipv4.frags);
-		__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+		rcu_read_lock();
+		__IP_INC_STATS(net, dev_get_by_index_rcu(net, qp->iif), IPSTATS_MIB_REASMFAILS);
+		rcu_read_unlock();
 	}
 
 	return rc;
@@ -582,7 +586,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	ip_send_check(iph);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
 	return 0;
@@ -594,7 +598,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 out_oversize:
 	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->q.key.v4.saddr);
 out_fail:
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	return err;
 }
 
@@ -605,7 +609,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	int vif = l3mdev_master_ifindex_rcu(dev);
 	struct ipq *qp;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMREQDS);
 	skb_orphan(skb);
 
 	/* Lookup (or create) queue header */
@@ -622,7 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 		return ret;
 	}
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	kfree_skb(skb);
 	return -ENOMEM;
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713..2d3e073 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -197,6 +197,9 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 		int protocol = ip_hdr(skb)->protocol;
 		const struct net_protocol *ipprot;
 		int raw;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		struct net_device *dev = skb->dev;
+#endif
 
 	resubmit:
 		raw = raw_local_deliver(skb, protocol);
@@ -217,17 +220,17 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 				protocol = -ret;
 				goto resubmit;
 			}
-			__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 		} else {
 			if (!raw) {
 				if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-					__IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
+					__IP_INC_STATS(net, dev, IPSTATS_MIB_INUNKNOWNPROTOS);
 					icmp_send(skb, ICMP_DEST_UNREACH,
 						  ICMP_PROT_UNREACH, 0);
 				}
 				kfree_skb(skb);
 			} else {
-				__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+				__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 				consume_skb(skb);
 			}
 		}
@@ -272,7 +275,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 					      --ANK (980813)
 	*/
 	if (skb_cow(skb, skb_headroom(skb))) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -281,7 +284,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 
 	if (ip_options_compile(dev_net(dev), opt, skb)) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
@@ -366,9 +369,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 	if (rt->rt_type == RTN_MULTICAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INMCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INBCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INBCAST, skb->len);
 	} else if (skb->pkt_type == PACKET_BROADCAST ||
 		   skb->pkt_type == PACKET_MULTICAST) {
 		struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -422,11 +425,11 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 
 	net = dev_net(dev);
-	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
+	__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (!skb) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto out;
 	}
 
@@ -452,7 +455,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_1);
 	BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_0);
 	BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE);
-	__IP_ADD_STATS(net,
+	__IP_ADD_STATS(net, dev,
 		       IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
 		       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
@@ -466,7 +469,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
@@ -476,7 +479,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 * Note this now means skb->len holds ntohs(iph->tot_len).
 	 */
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -494,9 +497,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	kfree_skb(skb);
 out:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b81..54194a9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -191,9 +191,9 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	u32 nexthop;
 
 	if (rt->rt_type == RTN_MULTICAST) {
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST)
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTBCAST, skb->len);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
@@ -339,7 +339,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/*
 	 *	If the indicated interface is up and running, send the packet.
 	 */
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -397,7 +397,7 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -507,7 +507,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EHOSTUNREACH;
 }
@@ -548,7 +548,7 @@ static int ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(!skb->ignore_df ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb->dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		kfree_skb(skb);
@@ -575,8 +575,11 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
+	struct net_device *dev = rt->dst.dev;
 	int err = 0;
 
+	dev_hold(dev);
+
 	/* for offloaded checksums cleanup checksum before fragmentation */
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    (err = skb_checksum_help(skb)))
@@ -675,7 +678,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			err = output(net, sk, skb);
 
 			if (!err)
-				IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+				IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 			if (err || !frag)
 				break;
 
@@ -685,7 +688,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		}
 
 		if (err == 0) {
-			IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+			dev_put(dev);
 			return 0;
 		}
 
@@ -694,7 +698,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			kfree_skb(frag);
 			frag = skb;
 		}
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+		dev_put(dev);
 		return err;
 
 slow_path_clean:
@@ -811,15 +816,17 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		if (err)
 			goto fail;
 
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 	}
 	consume_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+	dev_put(dev);
 	return err;
 
 fail:
 	kfree_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+	dev_put(dev);
 	return err;
 }
 EXPORT_SYMBOL(ip_do_fragment);
@@ -1098,7 +1105,7 @@ static int __ip_append_data(struct sock *sk,
 	err = -EFAULT;
 error:
 	cork->length -= length;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return err;
 }
@@ -1306,7 +1313,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 
 error:
 	cork->length -= size;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
@@ -1407,7 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
-		icmp_out_count(net, ((struct icmphdr *)
+		icmp_out_count(skb_dst(skb)->dev, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
 	ip_cork_release(cork);
@@ -1418,13 +1425,16 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 int ip_send_skb(struct net *net, struct sk_buff *skb)
 {
 	int err;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct net_device *dev = skb_dst(skb)->dev;
+#endif
 
 	err = ip_local_out(net, skb->sk, skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 	return err;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3..67ec987 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1778,8 +1778,8 @@ static inline int ipmr_forward_finish(struct net *net, struct sock *sk,
 {
 	struct ip_options *opt = &(IPCB(skb)->opt);
 
-	IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -1862,7 +1862,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 		 * allow to send ICMP, so that packets will disappear
 		 * to blackhole.
 		 */
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
 		ip_rt_put(rt);
 		goto out_free;
 	}
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05e47d7..1b4db11 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -712,6 +712,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	__be32 saddr, daddr, faddr;
 	u8  tos;
 	int err;
+	struct net_device *dev;
 
 	pr_debug("ping_v4_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -805,7 +806,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(rt);
 		rt = NULL;
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
@@ -841,13 +842,17 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 out:
+	dev = rt->dst.dev;
+	dev_hold(dev);
 	ip_rt_put(rt);
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
-		icmp_out_count(sock_net(sk), user_icmph.type);
+		icmp_out_count(dev, user_icmph.type);
+		dev_put(dev);
 		return len;
 	}
+	dev_put(dev);
 	return err;
 
 do_confirm:
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de6..c0c822f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -134,6 +134,17 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_SENTINEL
 };
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static const struct snmp_mib snmp4_icmp_list[] = {
+	SNMP_MIB_ITEM("InMsgs", ICMP_MIB_INMSGS),
+	SNMP_MIB_ITEM("InErrors", ICMP_MIB_INERRORS),
+	SNMP_MIB_ITEM("OutMsgs", ICMP_MIB_OUTMSGS),
+	SNMP_MIB_ITEM("OutErrors", ICMP_MIB_OUTERRORS),
+	SNMP_MIB_ITEM("InCsumErrors", ICMP_MIB_CSUMERRORS),
+	SNMP_MIB_SENTINEL
+};
+#endif
+
 static const struct {
 	const char *name;
 	int index;
@@ -473,6 +484,109 @@ static const struct file_operations snmp_seq_fops = {
 };
 
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static void snmp_seq_show_item(struct seq_file *seq, void __percpu **pcpumib,
+			       atomic_long_t *smib,
+			       const struct snmp_mib *itemlist,
+			       char *prefix)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; itemlist[i].name; i++) {
+		val = pcpumib ?
+			snmp_fold_field64(pcpumib, itemlist[i].entry,
+					  offsetof(struct ipstats_mib, syncp)) :
+			atomic_long_read(smib + itemlist[i].entry);
+		snprintf(name, sizeof(name), "%s%s",
+			 prefix, itemlist[i].name);
+		seq_printf(seq, "%-32s\t%lu\n", name, val);
+	}
+}
+
+static void snmp_seq_show_icmpmsg(struct seq_file *seq, atomic_long_t *smib)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; i < ICMPMSG_MIB_MAX; i++) {
+		val = atomic_long_read(smib + i);
+		if (val) {
+			snprintf(name, sizeof(name), "Icmp%sType%u",
+				 i & 0x100 ? "Out" : "In", i & 0xff);
+			seq_printf(seq, "%-32s\t%lu\n", name, val);
+		}
+	}
+}
+
+static int snmp_dev_seq_show(struct seq_file *seq, void *v)
+{
+	struct in_device *idev = (struct in_device *)seq->private;
+
+	seq_printf(seq, "%-32s\t%u\n", "ifIndex", idev->dev->ifindex);
+
+	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipstats_list, "Ip");
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipextstats_list, "Ip");
+	snmp_seq_show_item(seq, NULL, idev->stats.icmpdev->mibs,
+			   snmp4_icmp_list, "Icmp");
+	snmp_seq_show_icmpmsg(seq, idev->stats.icmpmsgdev->mibs);
+	return 0;
+}
+
+static int snmp_dev_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, snmp_dev_seq_show, PDE_DATA(inode));
+}
+
+static const struct file_operations snmp_dev_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = snmp_dev_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+int snmp_register_dev(struct in_device *idev)
+{
+	struct proc_dir_entry *p;
+	struct net *net;
+
+	if (!idev || !idev->dev)
+		return -EINVAL;
+
+	net = dev_net(idev->dev);
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+
+	p = proc_create_data(idev->dev->name, 0444,
+			     net->mib.proc_net_devsnmp,
+			     &snmp_dev_seq_fops, idev);
+	if (!p)
+		return -ENOMEM;
+
+	idev->stats.proc_dir_entry = p;
+	return 0;
+}
+
+int snmp_unregister_dev(struct in_device *idev)
+{
+	struct net *net = dev_net(idev->dev);
+
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+	if (!idev->stats.proc_dir_entry)
+		return -EINVAL;
+	proc_remove(idev->stats.proc_dir_entry);
+	idev->stats.proc_dir_entry = NULL;
+	return 0;
+}
+#endif
 
 /*
  *	Output /proc/net/netstat
@@ -528,9 +642,18 @@ static __net_init int ip_proc_init_net(struct net *net)
 		goto out_netstat;
 	if (!proc_create("snmp", 0444, net->proc_net, &snmp_seq_fops))
 		goto out_snmp;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	net->mib.proc_net_devsnmp = proc_mkdir("dev_snmp", net->proc_net);
+	if (!net->mib.proc_net_devsnmp)
+		goto out_dev_snmp;
+#endif
 
 	return 0;
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+out_dev_snmp:
+	remove_proc_entry("snmp", net->proc_net);
+#endif
 out_snmp:
 	remove_proc_entry("netstat", net->proc_net);
 out_netstat:
@@ -544,6 +667,9 @@ static __net_exit void ip_proc_exit_net(struct net *net)
 	remove_proc_entry("snmp", net->proc_net);
 	remove_proc_entry("netstat", net->proc_net);
 	remove_proc_entry("sockstat", net->proc_net);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	remove_proc_entry("dev_snmp", net->proc_net);
+#endif
 }
 
 static __net_initdata struct pernet_operations ip_proc_ops = {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1b4d335..f39f87a 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -425,7 +425,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 		skb->transport_header += iphlen;
 		if (iph->protocol == IPPROTO_ICMP &&
 		    length >= iphlen + sizeof(struct icmphdr))
-			icmp_out_count(net, ((struct icmphdr *)
+			icmp_out_count(rt->dst.dev, ((struct icmphdr *)
 				skb_transport_header(skb))->type);
 	}
 
@@ -442,7 +442,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 error_free:
 	kfree_skb(skb);
 error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	if (err == -ENOBUFS && !inet->recverr)
 		err = 0;
 	return err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ccb25d8..8fc87b4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -960,11 +960,11 @@ static int ip_error(struct sk_buff *skb)
 	if (!IN_DEV_FORWARD(in_dev)) {
 		switch (rt->dst.error) {
 		case EHOSTUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INADDRERRORS);
 			break;
 
 		case ENETUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INNOROUTES);
 			break;
 		}
 		goto out;
@@ -979,7 +979,7 @@ static int ip_error(struct sk_buff *skb)
 		break;
 	case ENETUNREACH:
 		code = ICMP_NET_UNREACH;
-		__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INNOROUTES);
 		break;
 	case EACCES:
 		code = ICMP_PKT_FILTERED;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..df1b989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -194,7 +194,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		return err;
 	}
 
@@ -402,7 +402,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(icmp_skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, icmp_skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	if (sk->sk_state == TCP_TIME_WAIT) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59..a8b6567 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -609,7 +609,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       iph->saddr, uh->source, skb->dev->ifindex, 0,
 			       udptable, NULL);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;	/* No socket for error */
 	}
 
@@ -1008,7 +1008,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			err = PTR_ERR(rt);
 			rt = NULL;
 			if (err == -ENETUNREACH)
-				IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+				IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 			goto out;
 		}
 
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index a9c05b2..b52b2e3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -381,7 +381,7 @@ static int l2tp_ip_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return 0;
 }
@@ -504,7 +504,7 @@ static int l2tp_ip_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	rc = -EHOSTUNREACH;
 	goto out;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9573691..cf2172d 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -462,7 +462,7 @@ static int l2tp_ip6_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return -1;
 }
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d..c629239 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -141,7 +141,7 @@ void mpls_stats_inc_outucastpkts(struct net_device *dev,
 					   tx_packets,
 					   tx_bytes);
 	} else if (skb->protocol == htons(ETH_P_IP)) {
-		IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+		IP_UPD_PO_STATS(dev_net(dev), dev, IPSTATS_MIB_OUT, skb->len);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
 		struct inet6_dev *in6dev = __in6_dev_get(dev);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index ba0a0fd..9875463 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -287,7 +287,11 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 	{
 		if (ip_hdr(skb)->ttl <= 1) {
 			/* Tell the sender its packet died... */
-			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+			/* at LOCAL_IN the stat is incremented for the input
+			 * dev, at LOCAL_OUT the global is incremented since
+			 * skb->dev is NULL
+			 */
+			__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 			return false;
 		}
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..fef625a 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -596,7 +596,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 	skb->network_header = saveip;
 	skb->transport_header = savesctp;
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	/* Warning:  The sock lock is held.  Remember to call
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 690d855..8ae6dc0 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -608,7 +608,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
 	/* drop packet if no dst */
 	dst = dst_clone(tp->dst);
 	if (!dst) {
-		IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		kfree_skb(head);
 		goto out;
 	}
-- 
2.7.4

^ permalink raw reply related

* Re: unregister_netdevice: waiting for DEV to become free
From: syzbot @ 2018-04-20 23:59 UTC (permalink / raw)
  To: ddstreet, dvyukov, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000d71bf6056a34b628@google.com>

syzbot has found reproducer for the following crash on  
https://github.com/google/kmsan.git/master commit
48c6a2b0ab1b752451cdc40b5392471ed1a2a329 (Mon Apr 16 08:42:26 2018 +0000)
mm/kmsan: fix origin calculation in kmsan_internal_check_memory
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=2dfb68e639f0621b19fb

So far this crash happened 180 times on  
https://github.com/google/kmsan.git/master, net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4936564132020224
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5817131010621440
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6313498770407424
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=6627248707860932248
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2dfb68e639f0621b19fb@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
device bridge_slave_1 left promiscuous mode
bridge0: port 2(bridge_slave_1) entered disabled state
device bridge_slave_0 left promiscuous mode
bridge0: port 1(bridge_slave_0) entered disabled state
unregister_netdevice: waiting for lo to become free. Usage count = 3

^ permalink raw reply

* Query: eBPF socket filter
From: Tushar Dave @ 2018-04-21  6:18 UTC (permalink / raw)
  To: netdev

Hi,

I am dealing with Reliable Datagram Socket (RDS) protocol and exploring
ways to filter RDS traffic using eBPF.

FYI, RDS sits on top of transport layer and works with
transport such as TCP and IB/RDMA. So RDS messages arrive in form of skb
(over TCP)and scatterlist (over IB/RDMA).

For RDS traffic over TCP, I am using ebpf program of type
BPF_PROG_TYPE_SOCKET_FILTER and with that my filter program receives
__sk_buff as bpf context. This works.
However, I cannot use BPF_PROG_TYPE_SOCKET_FILTER with struct
scatterlist. I am not sure if it would be appropriate to extend BPF
socket filter so that it also work with struct scatterlist?

Ideas/suggestions or any other alternatives please?

Thanks in advance,
-Tushar

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-04-20 23:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team
In-Reply-To: <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg>

On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> > 
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
> 
> 	Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
> 
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
> 
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()

One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.

> 
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver.  The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> > 
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> > 
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412]  ? skb_release_data+0xfc/0x120
> > [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> > [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> > [ 1129.053111]  __qdisc_run+0x223/0x5a0
> > [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268]  dev_queue_xmit+0x10/0x20
> > [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> > [ 1129.091573]  ip6_finish_output+0x125/0x190
> > [ 1129.099754]  ip6_output+0x5f/0x100
> > [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212]  ip6_local_out+0x35/0x40
> > [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> > [ 1129.195800]  ip6_xmit+0x30b/0x540
> > [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> > [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106]  tcp_push+0xde/0x100
> > [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> > [ 1129.271703]  inet_sendmsg+0x31/0xb0
> > [ 1129.278672]  sock_write_iter+0xf8/0x110
> > [ 1129.286335]  new_sync_write+0xd9/0x120
> > [ 1129.293823]  vfs_write+0x18d/0x1e0
> > [ 1129.300614]  SyS_write+0x48/0xa0
> > [ 1129.307045]  do_syscall_64+0x69/0x1e0
> > [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> > 
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> > 
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> > 
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> >     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> >        valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> >     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> >        valid_lft forever preferred_lft forever
> > 
> > [host-a] > cat /proc/net/ip_vs
> > TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> >   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
> >   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
> >   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> > 
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> > 
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue.  Like this:
> 
> 	Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
> 
> > 
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> >  1 file changed, 33 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> >  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> >  }
> >  
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> >  					  int rt_mode,
> >  					  struct ip_vs_iphdr *ipvsh,
> >  					  struct sk_buff *skb, int mtu)
> >  {
> > +	struct netns_ipvs *ipvs = cp->ipvs;
> > +
> >  #ifdef CONFIG_IP_VS_IPV6
> > -	if (skb_af == AF_INET6) {
> > +	if (cp->af == AF_INET6) {
> >  		struct net *net = ipvs->net;
> >  
> >  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> >  		}
> >  	}
> >  
> > +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > +		unsigned short hdr_len = (th->doff << 2) +
> > +			skb_network_header_len(skb);
> > +
> > +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > +			skb_decrease_gso_size(skb_shinfo(skb),
> > +					      skb_shinfo(skb)->gso_size -
> > +					      (mtu - hdr_len));
> 
> 	So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior.  Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.

> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.

Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).

Thanks for looking into it!

Martin

> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH] iptables: Per-net ns lock
From: Andrei Vagin @ 2018-04-20 23:06 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: fw, netdev, pablo, rstoyanov1, ptikhomirov
In-Reply-To: <152423174378.4473.8708420767261754117.stgit@localhost.localdomain>

On Fri, Apr 20, 2018 at 04:42:47PM +0300, Kirill Tkhai wrote:
> Containers want to restore their own net ns,
> while they may have no their own mnt ns.
> This case they share host's /run/xtables.lock
> file, but they may not have permission to open
> it.
> 
> Patch makes /run/xtables.lock to be per-namespace,
> i.e., to refer to the caller task's net ns.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  iptables/xshared.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/iptables/xshared.c b/iptables/xshared.c
> index 06db72d4..b6dbe4e7 100644
> --- a/iptables/xshared.c
> +++ b/iptables/xshared.c
> @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval)
>  	time_left.tv_sec = wait;
>  	time_left.tv_usec = 0;
>  
> -	fd = open(XT_LOCK_NAME, O_CREAT, 0600);
> +	if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&

Any user can open this file and take the lock. Before this patch, the
lock file could be opened only by the root user. It means that any user
will be able to block all iptables operations. Do I miss something?

[root@fc24 ~]# ln -s /proc/self/ns/net /run/xtables.lock2
[root@fc24 ~]# ls -l /run/xtables.lock2
lrwxrwxrwx 1 root root 17 Apr 21 01:52 /run/xtables.lock2 ->
/proc/self/ns/net
[root@fc24 ~]# ls -l /proc/self/ns/net 
lrwxrwxrwx 1 root root 0 Apr 21 01:52 /proc/self/ns/net ->
net:[4026531993]

Thanks,
Andrei

> +	    errno != EEXIST) {
> +		fprintf(stderr, "Fatal: can't create lock file\n");

		fprintf(stderr, "Fatal: can't create lock file %s: %s\n",
			XT_LOCK_NAME, strerror(errno));

> +		return XT_LOCK_FAILED;
> +	}
> +	fd = open(XT_LOCK_NAME, O_RDONLY);
>  	if (fd < 0) {
>  		fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
>  			XT_LOCK_NAME, strerror(errno));
> 

^ permalink raw reply

* Re: Representative Needed.
From: PPMC OFFSHORE @ 2018-04-20 22:25 UTC (permalink / raw)
  To: Recipients

Good day,

  I am seeking your concept with great gratitude to present you as a representative to carry out business transactions with a reasonable share upon your interest and cooperation to work with us in trust. If interested please get back.

Regards
Kingsley

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: Representative Needed.
From: PPMC OFFSHORE @ 2018-04-20 22:23 UTC (permalink / raw)
  To: Recipients

Good day,

  I am seeking your concept with great gratitude to present you as a representative to carry out business transactions with a reasonable share upon your interest and cooperation to work with us in trust. If interested please get back.

Regards
Kingsley

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* [PATCH net-next 7/7] net/ipv6: Remove unncessary check on f6i in fib6_check
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 004d00fe2fe5..0407bbc5a028 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2130,8 +2130,7 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((f6i && !fib6_get_cookie_safe(f6i, &rt_cookie)) ||
-	     rt_cookie != cookie)
+	if (!fib6_get_cookie_safe(f6i, &rt_cookie) || rt_cookie != cookie)
 		return false;
 
 	if (fib6_check_expired(f6i))
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 6/7] net/ipv6: Make from in rt6_info rcu protected
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 10 +++---
 net/ipv6/ip6_fib.c    |  8 +++--
 net/ipv6/ip6_output.c |  9 +++--
 net/ipv6/route.c      | 96 ++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 88 insertions(+), 35 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index dc3505fb62b3..1af450d4e923 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -174,7 +174,7 @@ struct fib6_info {
 
 struct rt6_info {
 	struct dst_entry		dst;
-	struct fib6_info		*from;
+	struct fib6_info __rcu		*from;
 
 	struct rt6key			rt6i_dst;
 	struct rt6key			rt6i_src;
@@ -248,13 +248,15 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 
 static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
+	struct fib6_info *from;
 	u32 cookie = 0;
 
 	rcu_read_lock();
 
-	if (rt->rt6i_flags & RTF_PCPU ||
-	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		fib6_get_cookie_safe(rt->from, &cookie);
+	from = rcu_dereference(rt->from);
+	if (from && (rt->rt6i_flags & RTF_PCPU ||
+	    unlikely(!list_empty(&rt->rt6i_uncached))))
+		fib6_get_cookie_safe(from, &cookie);
 
 	rcu_read_unlock();
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 64ca02b7745f..6421c893466e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -875,8 +875,12 @@ static void fib6_drop_pcpu_from(struct fib6_info *f6i,
 		ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu);
 		pcpu_rt = *ppcpu_rt;
 		if (pcpu_rt) {
-			fib6_info_release(pcpu_rt->from);
-			pcpu_rt->from = NULL;
+			struct fib6_info *from;
+
+			from = rcu_dereference_protected(pcpu_rt->from,
+					     lockdep_is_held(&table->tb6_lock));
+			rcu_assign_pointer(pcpu_rt->from, NULL);
+			fib6_info_release(from);
 		}
 	}
 }
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3db47986ef38..6a477d54f8c7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -962,16 +962,21 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
 	 * that's why we try it again later.
 	 */
 	if (ipv6_addr_any(&fl6->saddr) && (!*dst || !(*dst)->error)) {
+		struct fib6_info *from;
 		struct rt6_info *rt;
 		bool had_dst = *dst != NULL;
 
 		if (!had_dst)
 			*dst = ip6_route_output(net, sk, fl6);
 		rt = (*dst)->error ? NULL : (struct rt6_info *)*dst;
-		err = ip6_route_get_saddr(net, rt ? rt->from : NULL,
-					  &fl6->daddr,
+
+		rcu_read_lock();
+		from = rt ? rcu_dereference(rt->from) : NULL;
+		err = ip6_route_get_saddr(net, from, &fl6->daddr,
 					  sk ? inet6_sk(sk)->srcprefs : 0,
 					  &fl6->saddr);
+		rcu_read_unlock();
+
 		if (err)
 			goto out_err_release;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 810a37ac043b..004d00fe2fe5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(ip6_dst_alloc);
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
-	struct fib6_info *from = rt->from;
+	struct fib6_info *from;
 	struct inet6_dev *idev;
 
 	dst_destroy_metrics_generic(dst);
@@ -371,8 +371,11 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 		in6_dev_put(idev);
 	}
 
-	rt->from = NULL;
+	rcu_read_lock();
+	from = rcu_dereference(rt->from);
+	rcu_assign_pointer(rt->from, NULL);
 	fib6_info_release(from);
+	rcu_read_unlock();
 }
 
 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -402,12 +405,16 @@ static bool __rt6_check_expired(const struct rt6_info *rt)
 
 static bool rt6_check_expired(const struct rt6_info *rt)
 {
+	struct fib6_info *from;
+
+	from = rcu_dereference(rt->from);
+
 	if (rt->rt6i_flags & RTF_EXPIRES) {
 		if (time_after(jiffies, rt->dst.expires))
 			return true;
-	} else if (rt->from) {
+	} else if (from) {
 		return rt->dst.obsolete != DST_OBSOLETE_FORCE_CHK ||
-			fib6_check_expired(rt->from);
+			fib6_check_expired(from);
 	}
 	return false;
 }
@@ -963,7 +970,7 @@ static void rt6_set_from(struct rt6_info *rt, struct fib6_info *from)
 {
 	rt->rt6i_flags &= ~RTF_EXPIRES;
 	fib6_info_hold(from);
-	rt->from = from;
+	rcu_assign_pointer(rt->from, from);
 	dst_init_metrics(&rt->dst, from->fib6_metrics->metrics, true);
 	if (from->fib6_metrics != &dst_default_metrics) {
 		rt->dst._metrics |= DST_METRICS_REFCOUNTED;
@@ -2133,11 +2140,13 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 	return true;
 }
 
-static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
+static struct dst_entry *rt6_check(struct rt6_info *rt,
+				   struct fib6_info *from,
+				   u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((rt->from && !fib6_get_cookie_safe(rt->from, &rt_cookie)) ||
+	if ((from && !fib6_get_cookie_safe(from, &rt_cookie)) ||
 	    rt_cookie != cookie)
 		return NULL;
 
@@ -2147,11 +2156,13 @@ static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
 	return &rt->dst;
 }
 
-static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
+static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt,
+					    struct fib6_info *from,
+					    u32 cookie)
 {
 	if (!__rt6_check_expired(rt) &&
 	    rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
-	    fib6_check(rt->from, cookie))
+	    fib6_check(from, cookie))
 		return &rt->dst;
 	else
 		return NULL;
@@ -2160,6 +2171,7 @@ static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct dst_entry *dst_ret;
+	struct fib6_info *from;
 	struct rt6_info *rt;
 
 	rt = container_of(dst, struct rt6_info, dst);
@@ -2171,11 +2183,13 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * into this function always.
 	 */
 
-	if (rt->rt6i_flags & RTF_PCPU ||
-	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		dst_ret = rt6_dst_from_check(rt, cookie);
+	from = rcu_dereference(rt->from);
+
+	if (from && (rt->rt6i_flags & RTF_PCPU ||
+	    unlikely(!list_empty(&rt->rt6i_uncached))))
+		dst_ret = rt6_dst_from_check(rt, from, cookie);
 	else
-		dst_ret = rt6_check(rt, cookie);
+		dst_ret = rt6_check(rt, from, cookie);
 
 	rcu_read_unlock();
 
@@ -2211,13 +2225,17 @@ static void ip6_link_failure(struct sk_buff *skb)
 		if (rt->rt6i_flags & RTF_CACHE) {
 			if (dst_hold_safe(&rt->dst))
 				rt6_remove_exception_rt(rt);
-		} else if (rt->from) {
+		} else {
+			struct fib6_info *from;
 			struct fib6_node *fn;
 
 			rcu_read_lock();
-			fn = rcu_dereference(rt->from->fib6_node);
-			if (fn && (rt->rt6i_flags & RTF_DEFAULT))
-				fn->fn_sernum = -1;
+			from = rcu_dereference(rt->from);
+			if (from) {
+				fn = rcu_dereference(from->fib6_node);
+				if (fn && (rt->rt6i_flags & RTF_DEFAULT))
+					fn->fn_sernum = -1;
+			}
 			rcu_read_unlock();
 		}
 	}
@@ -2225,8 +2243,15 @@ static void ip6_link_failure(struct sk_buff *skb)
 
 static void rt6_update_expires(struct rt6_info *rt0, int timeout)
 {
-	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
-		rt0->dst.expires = rt0->from->expires;
+	if (!(rt0->rt6i_flags & RTF_EXPIRES)) {
+		struct fib6_info *from;
+
+		rcu_read_lock();
+		from = rcu_dereference(rt0->from);
+		if (from)
+			rt0->dst.expires = from->expires;
+		rcu_read_unlock();
+	}
 
 	dst_set_expires(&rt0->dst, timeout);
 	rt0->rt6i_flags |= RTF_EXPIRES;
@@ -2243,8 +2268,14 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
 
 static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
 {
+	bool from_set;
+
+	rcu_read_lock();
+	from_set = !!rcu_dereference(rt->from);
+	rcu_read_unlock();
+
 	return !(rt->rt6i_flags & RTF_CACHE) &&
-		(rt->rt6i_flags & RTF_PCPU || rt->from);
+		(rt->rt6i_flags & RTF_PCPU || from_set);
 }
 
 static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
@@ -2280,16 +2311,18 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
 		if (rt6->rt6i_flags & RTF_CACHE)
 			rt6_update_exception_stamp_rt(rt6);
 	} else if (daddr) {
+		struct fib6_info *from;
 		struct rt6_info *nrt6;
 
 		rcu_read_lock();
-		nrt6 = ip6_rt_cache_alloc(rt6->from, daddr, saddr);
-		rcu_read_unlock();
+		from = rcu_dereference(rt6->from);
+		nrt6 = ip6_rt_cache_alloc(from, daddr, saddr);
 		if (nrt6) {
 			rt6_do_update_pmtu(nrt6, mtu);
-			if (rt6_insert_exception(nrt6, rt6->from))
+			if (rt6_insert_exception(nrt6, from))
 				dst_release_immediate(&nrt6->dst);
 		}
+		rcu_read_unlock();
 	}
 }
 
@@ -3222,6 +3255,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 	struct ndisc_options ndopts;
 	struct inet6_dev *in6_dev;
 	struct neighbour *neigh;
+	struct fib6_info *from;
 	struct rd_msg *msg;
 	int optlen, on_link;
 	u8 *lladdr;
@@ -3304,7 +3338,8 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 		     NDISC_REDIRECT, &ndopts);
 
 	rcu_read_lock();
-	nrt = ip6_rt_cache_alloc(rt->from, &msg->dest, NULL);
+	from = rcu_dereference(rt->from);
+	nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL);
 	rcu_read_unlock();
 	if (!nrt)
 		goto out;
@@ -4687,6 +4722,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	struct net *net = sock_net(in_skb->sk);
 	struct nlattr *tb[RTA_MAX+1];
 	int err, iif = 0, oif = 0;
+	struct fib6_info *from;
 	struct dst_entry *dst;
 	struct rt6_info *rt;
 	struct sk_buff *skb;
@@ -4783,15 +4819,21 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	}
 
 	skb_dst_set(skb, &rt->dst);
+
+	rcu_read_lock();
+	from = rcu_dereference(rt->from);
+
 	if (fibmatch)
-		err = rt6_fill_node(net, skb, rt->from, NULL, NULL, NULL, iif,
+		err = rt6_fill_node(net, skb, from, NULL, NULL, NULL, iif,
 				    RTM_NEWROUTE, NETLINK_CB(in_skb).portid,
 				    nlh->nlmsg_seq, 0);
 	else
-		err = rt6_fill_node(net, skb, rt->from, dst,
-				    &fl6.daddr, &fl6.saddr, iif, RTM_NEWROUTE,
+		err = rt6_fill_node(net, skb, from, dst, &fl6.daddr,
+				    &fl6.saddr, iif, RTM_NEWROUTE,
 				    NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
 				    0);
+	rcu_read_unlock();
+
 	if (err < 0) {
 		kfree_skb(skb);
 		goto errout;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 5/7] net/ipv6: Move release of fib6_info from pcpu routes to helper
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/ip6_fib.c | 41 +++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f936e91a8c65..64ca02b7745f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -860,6 +860,27 @@ static struct fib6_node *fib6_add_1(struct net *net,
 	return ln;
 }
 
+static void fib6_drop_pcpu_from(struct fib6_info *f6i,
+				const struct fib6_table *table)
+{
+	int cpu;
+
+	/* release the reference to this fib entry from
+	 * all of its cached pcpu routes
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rt6_info **ppcpu_rt;
+		struct rt6_info *pcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu);
+		pcpu_rt = *ppcpu_rt;
+		if (pcpu_rt) {
+			fib6_info_release(pcpu_rt->from);
+			pcpu_rt->from = NULL;
+		}
+	}
+}
+
 static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
 			  struct net *net)
 {
@@ -887,24 +908,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
 				    lockdep_is_held(&table->tb6_lock));
 		}
 
-		if (rt->rt6i_pcpu) {
-			int cpu;
-
-			/* release the reference to this fib entry from
-			 * all of its cached pcpu routes
-			 */
-			for_each_possible_cpu(cpu) {
-				struct rt6_info **ppcpu_rt;
-				struct rt6_info *pcpu_rt;
-
-				ppcpu_rt = per_cpu_ptr(rt->rt6i_pcpu, cpu);
-				pcpu_rt = *ppcpu_rt;
-				if (pcpu_rt) {
-					fib6_info_release(pcpu_rt->from);
-					pcpu_rt->from = NULL;
-				}
-			}
-		}
+		if (rt->rt6i_pcpu)
+			fib6_drop_pcpu_from(rt, table);
 	}
 }
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 4/7] net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h |  6 ++++--
 net/ipv6/route.c      | 13 ++++++++++---
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index dd1481ed8bdb..dc3505fb62b3 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -234,7 +234,6 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 	struct fib6_node *fn;
 	bool status = false;
 
-	rcu_read_lock();
 	fn = rcu_dereference(f6i->fib6_node);
 
 	if (fn) {
@@ -244,7 +243,6 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 		status = true;
 	}
 
-	rcu_read_unlock();
 	return status;
 }
 
@@ -252,10 +250,14 @@ static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
 	u32 cookie = 0;
 
+	rcu_read_lock();
+
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
 		fib6_get_cookie_safe(rt->from, &cookie);
 
+	rcu_read_unlock();
+
 	return cookie;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f8b22183b7fd..810a37ac043b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2159,9 +2159,12 @@ static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
 
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
+	struct dst_entry *dst_ret;
 	struct rt6_info *rt;
 
-	rt = (struct rt6_info *) dst;
+	rt = container_of(dst, struct rt6_info, dst);
+
+	rcu_read_lock();
 
 	/* All IPV6 dsts are created with ->obsolete set to the value
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
@@ -2170,9 +2173,13 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		return rt6_dst_from_check(rt, cookie);
+		dst_ret = rt6_dst_from_check(rt, cookie);
 	else
-		return rt6_check(rt, cookie);
+		dst_ret = rt6_check(rt, cookie);
+
+	rcu_read_unlock();
+
+	return dst_ret;
 }
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 3/7] net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 2d6fcfe11c82..f8b22183b7fd 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1164,10 +1164,8 @@ static struct rt6_info *ip6_rt_cache_alloc(struct fib6_info *ort,
 	 *	Clone the route.
 	 */
 
-	rcu_read_lock();
 	dev = ip6_rt_get_dev_rcu(ort);
 	rt = ip6_dst_alloc(dev_net(dev), dev, 0);
-	rcu_read_unlock();
 	if (!rt)
 		return NULL;
 
@@ -1855,14 +1853,11 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 		 * the daddr in the skb during the neighbor look-up is different
 		 * from the fl6->daddr used to look-up route here.
 		 */
-
 		struct rt6_info *uncached_rt;
 
-		fib6_info_hold(f6i);
-		rcu_read_unlock();
-
 		uncached_rt = ip6_rt_cache_alloc(f6i, &fl6->daddr, NULL);
-		fib6_info_release(f6i);
+
+		rcu_read_unlock();
 
 		if (uncached_rt) {
 			/* Uncached_rt's refcnt is taken during ip6_rt_cache_alloc()
@@ -2280,7 +2275,9 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
 	} else if (daddr) {
 		struct rt6_info *nrt6;
 
+		rcu_read_lock();
 		nrt6 = ip6_rt_cache_alloc(rt6->from, daddr, saddr);
+		rcu_read_unlock();
 		if (nrt6) {
 			rt6_do_update_pmtu(nrt6, mtu);
 			if (rt6_insert_exception(nrt6, rt6->from))
@@ -3299,7 +3296,9 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 				     NEIGH_UPDATE_F_ISROUTER)),
 		     NDISC_REDIRECT, &ndopts);
 
+	rcu_read_lock();
 	nrt = ip6_rt_cache_alloc(rt->from, &msg->dest, NULL);
+	rcu_read_unlock();
 	if (!nrt)
 		goto out;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 2/7] net/ipv6: Rename rt6_get_cookie_safe
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 6 +++---
 net/ipv6/route.c      | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 327a74cd6a0d..dd1481ed8bdb 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -228,8 +228,8 @@ static inline bool fib6_check_expired(const struct fib6_info *f6i)
  * Return true if we can get cookie safely
  * Return false if not
  */
-static inline bool rt6_get_cookie_safe(const struct fib6_info *f6i,
-				       u32 *cookie)
+static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
+					u32 *cookie)
 {
 	struct fib6_node *fn;
 	bool status = false;
@@ -254,7 +254,7 @@ static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		rt6_get_cookie_safe(rt->from, &cookie);
+		fib6_get_cookie_safe(rt->from, &cookie);
 
 	return cookie;
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 062dd4d8232c..2d6fcfe11c82 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2128,7 +2128,7 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((f6i && !rt6_get_cookie_safe(f6i, &rt_cookie)) ||
+	if ((f6i && !fib6_get_cookie_safe(f6i, &rt_cookie)) ||
 	     rt_cookie != cookie)
 		return false;
 
@@ -2142,7 +2142,7 @@ static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((rt->from && !rt6_get_cookie_safe(rt->from, &rt_cookie)) ||
+	if ((rt->from && !fib6_get_cookie_safe(rt->from, &rt_cookie)) ||
 	    rt_cookie != cookie)
 		return NULL;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 1/7] net/ipv6: Clean up rt expires helpers
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 21 ---------------------
 net/ipv6/route.c      |  9 +++++++++
 2 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 642211174692..327a74cd6a0d 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -223,27 +223,6 @@ static inline bool fib6_check_expired(const struct fib6_info *f6i)
 	return false;
 }
 
-static inline void rt6_clean_expires(struct rt6_info *rt)
-{
-	rt->rt6i_flags &= ~RTF_EXPIRES;
-	rt->dst.expires = 0;
-}
-
-static inline void rt6_set_expires(struct rt6_info *rt, unsigned long expires)
-{
-	rt->dst.expires = expires;
-	rt->rt6i_flags |= RTF_EXPIRES;
-}
-
-static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
-{
-	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
-		rt0->dst.expires = rt0->from->expires;
-
-	dst_set_expires(&rt0->dst, timeout);
-	rt0->rt6i_flags |= RTF_EXPIRES;
-}
-
 /* Function to safely get fn->sernum for passed in rt
  * and store result in passed in cookie.
  * Return true if we can get cookie safely
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 884d9cee790f..062dd4d8232c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2221,6 +2221,15 @@ static void ip6_link_failure(struct sk_buff *skb)
 	}
 }
 
+static void rt6_update_expires(struct rt6_info *rt0, int timeout)
+{
+	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
+		rt0->dst.expires = rt0->from->expires;
+
+	dst_set_expires(&rt0->dst, timeout);
+	rt0->rt6i_flags |= RTF_EXPIRES;
+}
+
 static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
 {
 	struct net *net = dev_net(rt->dst.dev);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 0/7] net/ipv6: Another followup to the fib6_info change
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern

Last one - for this week.

Patches 1, 2 and 7 are more cleanup patches - removing dead code,
moving code from a header to near its single caller, and updating
function name.

Patches 3-5 do some refactoring leading up to patch 6 which fixes
a NULL dereference. I have only managed to trigger a panic once, so
I can not definitively confirm it addresses the problem but it seems
pretty clear that it is a race on removing a 'from' reference on
an rt6_info and another path using that 'from' value to do
cookie checking.

David Ahern (7):
  net/ipv6: Clean up rt expires helpers
  net/ipv6: Rename rt6_get_cookie_safe
  net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
  net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
  net/ipv6: Move release of fib6_info from pcpu routes to helper
  net/ipv6: Make from in rt6_info rcu protected
  net/ipv6: Remove unncessary check on f6i in fib6_check

 include/net/ip6_fib.h |  41 +++++------------
 net/ipv6/ip6_fib.c    |  45 ++++++++++--------
 net/ipv6/ip6_output.c |   9 +++-
 net/ipv6/route.c      | 124 ++++++++++++++++++++++++++++++++++++--------------
 4 files changed, 136 insertions(+), 83 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0)
With this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
    ......
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0)
        return 0;
    ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
    ......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c8bb92..01c215d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
 		/* We may learn something more from the var_off */
 		__update_reg_bounds(dst_reg);
 		break;
+	case BPF_ARSH:
+		if (umax_val >= insn_bitness) {
+			/* Shifts greater than 31 or 63 are undefined.
+			 * This includes shifts by a negative number.
+			 */
+			mark_reg_unknown(env, regs, insn->dst_reg);
+			break;
+		}
+		if (dst_reg->smin_value < 0)
+			dst_reg->smin_value >>= umin_val;
+		else
+			dst_reg->smin_value >>= umax_val;
+		if (dst_reg->smax_value < 0)
+			dst_reg->smax_value >>= umax_val;
+		else
+			dst_reg->smax_value >>= umin_val;
+		if (src_known)
+			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
+						       umin_val);
+		else
+			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
+		dst_reg->umin_value >>= umax_val;
+		dst_reg->umax_value >>= umin_val;
+		/* We may learn something more from the var_off */
+		__update_reg_bounds(dst_reg);
+		break;
 	default:
 		mark_reg_unknown(env, regs, insn->dst_reg);
 		break;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 5/9] tools/bpf: add bpf_get_stack helper to tools headers
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/include/uapi/linux/bpf.h            | 19 +++++++++++++++++--
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f7fbb9..116eb5f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -529,6 +529,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -841,7 +852,8 @@ union bpf_attr {
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
 	FN(bind),			\
-	FN(xdp_adjust_tail),
+	FN(xdp_adjust_tail),		\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -875,11 +887,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 9271576..2d9d650 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -98,7 +98,8 @@ static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
 	(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
 	(void *) BPF_FUNC_xdp_adjust_tail;
-
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+	(void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 8/9] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

The test attached a raw_tracepoint program to sched/sched_switch.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that the kernel function
for raw_tracepoint ___bpf_prog_run is part of the stack.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c           | 115 +++++++++++++++++++++
 3 files changed, 219 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..54e9e74 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
-	test_btf_haskv.o test_btf_nokv.o
+	test_btf_haskv.o test_btf_nokv.o test_get_stack_rawtp.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -56,6 +56,7 @@ $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
+$(OUTPUT)/test_progs: trace_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
new file mode 100644
index 0000000..ba1dcf9
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK_RAWTP 100
+struct stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	__u64 kern_stack[MAX_STACK_RAWTP];
+	__u64 user_stack[MAX_STACK_RAWTP];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(__u32),
+	.max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct stack_trace_t),
+	.max_entries = 1,
+};
+
+/* Allocate per-cpu space twice the needed. For the code below
+ *   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ *   if (usize < 0)
+ *     return 0;
+ *   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+ *
+ * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
+ * verifier will complain that access "raw_data + usize"
+ * with size "max_len - usize" may be out of bound.
+ * The maximum "raw_data + usize" is "raw_data + max_len"
+ * and the maximum "max_len - usize" is "max_len", verifier
+ * concludes that the maximum buffer access range is
+ * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
+ *
+ * Doubling the to-be-used max buffer size can fix this verifier
+ * issue and avoid complicated C programming massaging.
+ * This is an acceptable workaround since there is one entry here.
+ */
+struct bpf_map_def SEC("maps") rawdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
+	.max_entries = 1,
+};
+
+SEC("tracepoint/sched/sched_switch")
+int bpf_prog1(void *ctx)
+{
+	int max_len, max_buildid_len, usize, ksize, total_size;
+	struct stack_trace_t *data;
+	void *raw_data;
+	__u32 key = 0;
+
+	data = bpf_map_lookup_elem(&stackdata_map, &key);
+	if (!data)
+		return 0;
+
+	max_len = MAX_STACK_RAWTP * sizeof(__u64);
+	max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
+	data->pid = bpf_get_current_pid_tgid();
+	data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+					      max_len, 0);
+	data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+					    BPF_F_USER_STACK);
+	data->user_stack_buildid_size = bpf_get_stack(
+		ctx, data->user_stack_buildid, max_buildid_len,
+		BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+	bpf_perf_event_output(ctx, &perfmap, 0, data, sizeof(*data));
+
+	/* write both kernel and user stacks to the same buffer */
+	raw_data = bpf_map_lookup_elem(&rawdata_map, &key);
+	if (!raw_data)
+		return 0;
+
+	usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+	if (usize < 0)
+		return 0;
+
+	ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+	if (ksize < 0)
+		return 0;
+
+	total_size = usize + ksize;
+	if (total_size > 0 && total_size <= max_len)
+		bpf_perf_event_output(ctx, &perfmap, 0, raw_data, total_size);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1; /* ignored by tracepoints, required by libbpf.a */
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index eedda98..dad4c3f 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -38,6 +38,7 @@ typedef __u16 __sum16;
 #include "bpf_util.h"
 #include "bpf_endian.h"
 #include "bpf_rlimit.h"
+#include "trace_helpers.h"
 
 static int error_cnt, pass_cnt;
 
@@ -1204,6 +1205,119 @@ static void test_stacktrace_build_id(void)
 	return;
 }
 
+#define MAX_CNT_RAWTP	10ull
+#define MAX_STACK_RAWTP	100
+struct get_stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	__u64 kern_stack[MAX_STACK_RAWTP];
+	__u64 user_stack[MAX_STACK_RAWTP];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+static void get_stack_raw_tp_action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
+static int get_stack_print_output(void *data, int size)
+{
+	bool good_kern_stack = false, good_user_stack = false;
+	const char *expected_func = "___bpf_prog_run";
+	struct get_stack_trace_t *e = data;
+	int i, num_stack;
+	static __u64 cnt;
+	struct ksym *ks;
+
+	cnt++;
+
+	if (size < sizeof(struct get_stack_trace_t)) {
+		__u64 *raw_data = data;
+
+		num_stack = size / sizeof(__u64);
+		for (i = 0; i < num_stack; i++) {
+			ks = ksym_search(raw_data[i]);
+			if (ks && (strcmp(ks->name, expected_func) == 0)) {
+				good_kern_stack = true;
+				good_user_stack = (i > 0);
+			}
+		}
+	} else {
+		if (e->kern_stack_size > 0) {
+			num_stack = e->kern_stack_size / sizeof(__u64);
+			for (i = 0; i < num_stack; i++) {
+				ks = ksym_search(e->kern_stack[i]);
+				if (ks && (strcmp(ks->name, expected_func) == 0))
+					good_kern_stack = true;
+			}
+		}
+		if (e->user_stack_size > 0 && e->user_stack_buildid_size > 0)
+			good_user_stack = true;
+	}
+	if (!good_kern_stack || !good_user_stack)
+		return PERF_EVENT_ERROR;
+
+	if (cnt == MAX_CNT_RAWTP)
+		return PERF_EVENT_DONE;
+
+	return PERF_EVENT_CONT;
+}
+
+static void test_get_stack_raw_tp(void)
+{
+	const char *file = "./test_get_stack_rawtp.o";
+	int efd, err, prog_fd, pmu_fd, perfmap_fd;
+	struct perf_event_attr attr = {};
+	__u32 key = 0, duration = 0;
+	struct bpf_object *obj;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno))
+		return;
+
+	efd = bpf_raw_tracepoint_open("sched_switch", prog_fd);
+	if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+
+	perfmap_fd = bpf_find_map(__func__, obj, "perfmap");
+	if (CHECK(perfmap_fd < 0, "bpf_find_map", "err %d errno %d\n", perfmap_fd, errno))
+		goto close_prog;
+
+	err = load_kallsyms();
+	if (CHECK(err < 0, "load_kallsyms", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.type = PERF_TYPE_SOFTWARE;
+	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1/*pid*/, 0/*cpu*/,
+			 -1/*group_fd*/, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd, errno))
+		goto close_prog;
+
+	err = bpf_map_update_elem(perfmap_fd, &key, &pmu_fd, BPF_ANY);
+	if (CHECK(err < 0, "bpf_map_update_elem", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err < 0, "ioctl PERF_EVENT_IOC_ENABLE", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	if (perf_event_poller(pmu_fd, get_stack_raw_tp_action, get_stack_print_output))
+		goto close_prog;
+
+	goto close_prog_noerr;
+close_prog:
+	error_cnt++;
+close_prog_noerr:
+	bpf_object__close(obj);
+}
+
 int main(void)
 {
 	test_pkt_access();
@@ -1219,6 +1333,7 @@ int main(void)
 	test_stacktrace_map();
 	test_stacktrace_build_id();
 	test_stacktrace_map_raw_tp();
+	test_get_stack_raw_tp();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 6/9] samples/bpf: move common-purpose trace functions to selftests
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

There is no functionality change in this patch. The common-purpose
trace functions, including perf_event polling and ksym lookup,
are moved from trace_output_user.c and bpf_load.c to
selftests/bpf/trace_helpers.c so that these function can
be reused later in selftests.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/Makefile                        |  11 +-
 samples/bpf/bpf_load.c                      |  63 ----------
 samples/bpf/bpf_load.h                      |   7 --
 samples/bpf/offwaketime_user.c              |   1 +
 samples/bpf/sampleip_user.c                 |   1 +
 samples/bpf/spintest_user.c                 |   1 +
 samples/bpf/trace_event_user.c              |   1 +
 samples/bpf/trace_output_user.c             | 125 +++----------------
 tools/testing/selftests/bpf/trace_helpers.c | 186 ++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/trace_helpers.h |  24 ++++
 10 files changed, 238 insertions(+), 182 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aa8c392..d36444c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -49,6 +49,7 @@ hostprogs-y += xdp_adjust_tail
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
+TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
 sock_example-objs := sock_example.o $(LIBBPF)
@@ -65,10 +66,10 @@ tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
 tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
 map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
 test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
 test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
@@ -82,8 +83,8 @@ xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
 test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
 				       test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe418..529972e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -650,66 +650,3 @@ void read_trace_pipe(void)
 		}
 	}
 }
-
-#define MAX_SYMS 300000
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
-
-static int ksym_cmp(const void *p1, const void *p2)
-{
-	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
-}
-
-int load_kallsyms(void)
-{
-	FILE *f = fopen("/proc/kallsyms", "r");
-	char func[256], buf[256];
-	char symbol;
-	void *addr;
-	int i = 0;
-
-	if (!f)
-		return -ENOENT;
-
-	while (!feof(f)) {
-		if (!fgets(buf, sizeof(buf), f))
-			break;
-		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
-			break;
-		if (!addr)
-			continue;
-		syms[i].addr = (long) addr;
-		syms[i].name = strdup(func);
-		i++;
-	}
-	sym_cnt = i;
-	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
-	return 0;
-}
-
-struct ksym *ksym_search(long key)
-{
-	int start = 0, end = sym_cnt;
-	int result;
-
-	while (start < end) {
-		size_t mid = start + (end - start) / 2;
-
-		result = key - syms[mid].addr;
-		if (result < 0)
-			end = mid;
-		else if (result > 0)
-			start = mid + 1;
-		else
-			return &syms[mid];
-	}
-
-	if (start >= 1 && syms[start - 1].addr < key &&
-	    key < syms[start].addr)
-		/* valid ksym */
-		return &syms[start - 1];
-
-	/* out of range. return _stext */
-	return &syms[0];
-}
-
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 453c200..2c3d0b4 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -54,12 +54,5 @@ int load_bpf_file(char *path);
 int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
 
 void read_trace_pipe(void);
-struct ksym {
-	long addr;
-	char *name;
-};
-
-int load_kallsyms(void);
-struct ksym *ksym_search(long key);
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c
index 512f87a..f06063a 100644
--- a/samples/bpf/offwaketime_user.c
+++ b/samples/bpf/offwaketime_user.c
@@ -17,6 +17,7 @@
 #include <sys/resource.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "trace_helpers.h"
 
 #define PRINT_RAW_ADDR 0
 
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
index 4ed690b..60c2b73 100644
--- a/samples/bpf/sampleip_user.c
+++ b/samples/bpf/sampleip_user.c
@@ -22,6 +22,7 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 #define DEFAULT_FREQ	99
 #define DEFAULT_SECS	5
diff --git a/samples/bpf/spintest_user.c b/samples/bpf/spintest_user.c
index 3d73621..8d3e9cf 100644
--- a/samples/bpf/spintest_user.c
+++ b/samples/bpf/spintest_user.c
@@ -7,6 +7,7 @@
 #include <sys/resource.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "trace_helpers.h"
 
 int main(int ac, char **argv)
 {
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 56f7a25..1fa1bec 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -21,6 +21,7 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 #define SAMPLE_FREQ 50
 
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index ccca1e3..cc4b383 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -21,100 +21,10 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 static int pmu_fd;
 
-int page_size;
-int page_cnt = 8;
-volatile struct perf_event_mmap_page *header;
-
-typedef void (*print_fn)(void *data, int size);
-
-static int perf_event_mmap(int fd)
-{
-	void *base;
-	int mmap_size;
-
-	page_size = getpagesize();
-	mmap_size = page_size * (page_cnt + 1);
-
-	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
-	if (base == MAP_FAILED) {
-		printf("mmap err\n");
-		return -1;
-	}
-
-	header = base;
-	return 0;
-}
-
-static int perf_event_poll(int fd)
-{
-	struct pollfd pfd = { .fd = fd, .events = POLLIN };
-
-	return poll(&pfd, 1, 1000);
-}
-
-struct perf_event_sample {
-	struct perf_event_header header;
-	__u32 size;
-	char data[];
-};
-
-static void perf_event_read(print_fn fn)
-{
-	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
-	__u64 buffer_size = page_cnt * page_size;
-	void *base, *begin, *end;
-	char buf[256];
-
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
-	if (data_head == data_tail)
-		return;
-
-	base = ((char *)header) + page_size;
-
-	begin = base + data_tail % buffer_size;
-	end = base + data_head % buffer_size;
-
-	while (begin != end) {
-		struct perf_event_sample *e;
-
-		e = begin;
-		if (begin + e->header.size > base + buffer_size) {
-			long len = base + buffer_size - begin;
-
-			assert(len < e->header.size);
-			memcpy(buf, begin, len);
-			memcpy(buf + len, base, e->header.size - len);
-			e = (void *) buf;
-			begin = base + e->header.size - len;
-		} else if (begin + e->header.size == base + buffer_size) {
-			begin = base;
-		} else {
-			begin += e->header.size;
-		}
-
-		if (e->header.type == PERF_RECORD_SAMPLE) {
-			fn(e->data, e->size);
-		} else if (e->header.type == PERF_RECORD_LOST) {
-			struct {
-				struct perf_event_header header;
-				__u64 id;
-				__u64 lost;
-			} *lost = (void *) e;
-			printf("lost %lld events\n", lost->lost);
-		} else {
-			printf("unknown event type=%d size=%d\n",
-			       e->header.type, e->header.size);
-		}
-	}
-
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_head;
-}
-
 static __u64 time_get_ns(void)
 {
 	struct timespec ts;
@@ -127,7 +37,7 @@ static __u64 start_time;
 
 #define MAX_CNT 100000ll
 
-static void print_bpf_output(void *data, int size)
+static int print_bpf_output(void *data, int size)
 {
 	static __u64 cnt;
 	struct {
@@ -138,7 +48,7 @@ static void print_bpf_output(void *data, int size)
 	if (e->cookie != 0x12345678) {
 		printf("BUG pid %llx cookie %llx sized %d\n",
 		       e->pid, e->cookie, size);
-		kill(0, SIGINT);
+		return PERF_EVENT_ERROR;
 	}
 
 	cnt++;
@@ -146,8 +56,10 @@ static void print_bpf_output(void *data, int size)
 	if (cnt == MAX_CNT) {
 		printf("recv %lld events per sec\n",
 		       MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
-		kill(0, SIGINT);
+		return PERF_EVENT_DONE;
 	}
+
+	return PERF_EVENT_CONT;
 }
 
 static void test_bpf_perf_event(void)
@@ -166,10 +78,18 @@ static void test_bpf_perf_event(void)
 	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
 }
 
+static void exec_action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
 int main(int argc, char **argv)
 {
 	char filename[256];
-	FILE *f;
+	int ret;
 
 	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
 
@@ -180,17 +100,8 @@ int main(int argc, char **argv)
 
 	test_bpf_perf_event();
 
-	if (perf_event_mmap(pmu_fd) < 0)
-		return 1;
-
-	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
-	(void) f;
-
 	start_time = time_get_ns();
-	for (;;) {
-		perf_event_poll(pmu_fd);
-		perf_event_read(print_bpf_output);
-	}
-
-	return 0;
+	ret = perf_event_poller(pmu_fd, exec_action, print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
 }
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
new file mode 100644
index 0000000..00954e3
--- /dev/null
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <errno.h>
+#include <poll.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/mman.h>
+#include "trace_helpers.h"
+
+#define MAX_SYMS 300000
+static struct ksym syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+int load_kallsyms(void)
+{
+	FILE *f = fopen("/proc/kallsyms", "r");
+	char func[256], buf[256];
+	char symbol;
+	void *addr;
+	int i = 0;
+
+	if (!f)
+		return -ENOENT;
+
+	while (!feof(f)) {
+		if (!fgets(buf, sizeof(buf), f))
+			break;
+		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
+			break;
+		if (!addr)
+			continue;
+		syms[i].addr = (long) addr;
+		syms[i].name = strdup(func);
+		i++;
+	}
+	sym_cnt = i;
+	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+	return 0;
+}
+
+struct ksym *ksym_search(long key)
+{
+	int start = 0, end = sym_cnt;
+	int result;
+
+	while (start < end) {
+		size_t mid = start + (end - start) / 2;
+
+		result = key - syms[mid].addr;
+		if (result < 0)
+			end = mid;
+		else if (result > 0)
+			start = mid + 1;
+		else
+			return &syms[mid];
+	}
+
+	if (start >= 1 && syms[start - 1].addr < key &&
+	    key < syms[start].addr)
+		/* valid ksym */
+		return &syms[start - 1];
+
+	/* out of range. return _stext */
+	return &syms[0];
+}
+
+static int page_size;
+static int page_cnt = 8;
+static volatile struct perf_event_mmap_page *header;
+
+static int perf_event_mmap(int fd)
+{
+	void *base;
+	int mmap_size;
+
+	page_size = getpagesize();
+	mmap_size = page_size * (page_cnt + 1);
+
+	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (base == MAP_FAILED) {
+		printf("mmap err\n");
+		return -1;
+	}
+
+	header = base;
+	return 0;
+}
+
+static int perf_event_poll(int fd)
+{
+	struct pollfd pfd = { .fd = fd, .events = POLLIN };
+
+	return poll(&pfd, 1, 1000);
+}
+
+struct perf_event_sample {
+	struct perf_event_header header;
+	__u32 size;
+	char data[];
+};
+
+static int perf_event_read(perf_event_print_fn fn)
+{
+	__u64 data_tail = header->data_tail;
+	__u64 data_head = header->data_head;
+	__u64 buffer_size = page_cnt * page_size;
+	void *base, *begin, *end;
+	char buf[256];
+	int ret;
+
+	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+	if (data_head == data_tail)
+		return PERF_EVENT_CONT;
+
+	base = ((char *)header) + page_size;
+
+	begin = base + data_tail % buffer_size;
+	end = base + data_head % buffer_size;
+
+	while (begin != end) {
+		struct perf_event_sample *e;
+
+		e = begin;
+		if (begin + e->header.size > base + buffer_size) {
+			long len = base + buffer_size - begin;
+
+			assert(len < e->header.size);
+			memcpy(buf, begin, len);
+			memcpy(buf + len, base, e->header.size - len);
+			e = (void *) buf;
+			begin = base + e->header.size - len;
+		} else if (begin + e->header.size == base + buffer_size) {
+			begin = base;
+		} else {
+			begin += e->header.size;
+		}
+
+		if (e->header.type == PERF_RECORD_SAMPLE) {
+			ret = fn(e->data, e->size);
+			if (ret != PERF_EVENT_CONT)
+				return ret;
+		} else if (e->header.type == PERF_RECORD_LOST) {
+			struct {
+				struct perf_event_header header;
+				__u64 id;
+				__u64 lost;
+			} *lost = (void *) e;
+			printf("lost %lld events\n", lost->lost);
+		} else {
+			printf("unknown event type=%d size=%d\n",
+			       e->header.type, e->header.size);
+		}
+	}
+
+	__sync_synchronize(); /* smp_mb() */
+	header->data_tail = data_head;
+	return PERF_EVENT_CONT;
+}
+
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn)
+{
+	int ret;
+
+	if (perf_event_mmap(fd) < 0)
+		return PERF_EVENT_ERROR;
+
+	exec_fn();
+
+	for (;;) {
+		perf_event_poll(fd);
+		ret = perf_event_read(output_fn);
+		if (ret != PERF_EVENT_CONT)
+			return ret;
+	}
+
+	return PERF_EVENT_DONE;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
new file mode 100644
index 0000000..8750778
--- /dev/null
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TRACE_HELPER_H
+#define __TRACE_HELPER_H
+
+struct ksym {
+	long addr;
+	char *name;
+};
+
+int load_kallsyms(void);
+struct ksym *ksym_search(long key);
+
+typedef void (*perf_event_exec_fn)(void);
+typedef int (*perf_event_print_fn)(void *data, int size);
+
+/* return code for perf_event_print_fn */
+#define PERF_EVENT_DONE		0
+#define PERF_EVENT_ERROR	1
+#define PERF_EVENT_CONT		2
+
+/* return PERF_EVENT_DONE or PERF_EVENT_ERROR */
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn);
+#endif
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 2/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 19 ++++++++++++--
 kernel/bpf/core.c        |  5 ++++
 kernel/bpf/stackmap.c    | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c     | 10 ++++++++
 kernel/bpf/verifier.c    |  3 +++
 kernel/trace/bpf_trace.c | 50 +++++++++++++++++++++++++++++++++++-
 8 files changed, 154 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ee5275e..2c520b4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -690,6 +690,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..044d30e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -468,7 +468,8 @@ struct bpf_prog {
 				dst_needed:1,	/* Do we need dst entry? */
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
-				kprobe_override:1; /* Do we override a kprobe? */
+				kprobe_override:1, /* Do we override a kprobe? */
+				need_callchain_buf:1; /* Needs callchain buffer? */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8383a2..470f3a2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -529,6 +529,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -841,7 +852,8 @@ union bpf_attr {
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
 	FN(bind),			\
-	FN(xdp_adjust_tail),
+	FN(xdp_adjust_tail),		\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -875,11 +887,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..bf22eca 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -31,6 +31,7 @@
 #include <linux/rbtree_latch.h>
 #include <linux/kallsyms.h>
 #include <linux/rcupdate.h>
+#include <linux/perf_event.h>
 
 #include <asm/unaligned.h>
 
@@ -1709,6 +1710,10 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 	aux = container_of(work, struct bpf_prog_aux, work);
 	if (bpf_prog_is_dev_bound(aux))
 		bpf_prog_offload_destroy(aux->prog);
+#ifdef CONFIG_PERF_EVENTS
+	if (aux->prog->need_callchain_buf)
+		put_callchain_buffers();
+#endif
 	for (i = 0; i < aux->func_cnt; i++)
 		bpf_jit_free(aux->func[i]);
 	if (aux->func_cnt) {
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 04f6ec1..4477cf6 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -402,6 +402,73 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+	   u64, flags)
+{
+	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	bool user = flags & BPF_F_USER_STACK;
+	struct perf_callchain_entry *trace;
+	bool kernel = !user;
+	int err = -EINVAL;
+	u64 *ips;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_USER_BUILD_ID)))
+		goto clear;
+	if (kernel && user_build_id)
+		goto clear;
+
+	elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
+					    : sizeof(u64);
+	if (unlikely(size % elem_size))
+		goto clear;
+
+	num_elem = size / elem_size;
+	if (sysctl_perf_event_max_stack < num_elem)
+		init_nr = 0;
+	else
+		init_nr = sysctl_perf_event_max_stack - num_elem;
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+	if (unlikely(!trace))
+		goto err_fault;
+
+	trace_nr = trace->nr - init_nr;
+	if (trace_nr <= skip)
+		goto err_fault;
+
+	trace_nr -= skip;
+	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+	copy_len = trace_nr * elem_size;
+	ips = trace->ip + skip + init_nr;
+	if (user && user_build_id)
+		stack_map_get_build_id_offset(buf, ips, trace_nr, user);
+	else
+		memcpy(buf, ips, copy_len);
+
+	if (size > copy_len)
+		memset(buf + copy_len, 0, size - copy_len);
+	return copy_len;
+
+err_fault:
+	err = -EFAULT;
+clear:
+	memset(buf, 0, size);
+	return err;
+}
+
+const struct bpf_func_proto bpf_get_stack_proto = {
+	.func		= bpf_get_stack,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fe23dc5a..1ee71f6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1360,6 +1360,16 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err)
 		goto free_used_maps;
 
+	if (prog->need_callchain_buf) {
+#ifdef CONFIG_PERF_EVENTS
+		err = get_callchain_buffers(sysctl_perf_event_max_stack);
+#else
+		err = -ENOTSUPP;
+#endif
+		if (err)
+			goto free_used_maps;
+	}
+
 	err = bpf_prog_new_fd(prog);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..aba9425 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2460,6 +2460,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	if (err)
 		return err;
 
+	if (func_id == BPF_FUNC_get_stack)
+		env->prog->need_callchain_buf = true;
+
 	if (changes_data)
 		clear_all_pkt_pointers(env);
 	return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..fe8476f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -20,6 +20,7 @@
 #include "trace.h"
 
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 /**
  * trace_call_bpf - invoke BPF program
@@ -577,6 +578,8 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto;
 	case BPF_FUNC_perf_event_read_value:
 		return &bpf_perf_event_read_value_proto;
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
@@ -664,6 +667,25 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size,
+	   u64, flags)
+{
+	struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_tp = {
+	.func		= bpf_get_stack_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -672,6 +694,8 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
@@ -734,6 +758,8 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	case BPF_FUNC_perf_prog_read_value:
 		return &bpf_perf_prog_read_value_proto;
 	default:
@@ -744,7 +770,7 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 /*
  * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
  * to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output and/or bpf_get_stack_id
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
  */
 static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
 BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
@@ -787,6 +813,26 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   void *, buf, u32, size, u64, flags)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
+	.func		= bpf_get_stack_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -795,6 +841,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_raw_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_raw_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_raw_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 0/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patches #3 and #4 are two verifier improves to make
bpf programming easier. Patch #5 synced the new helper
to tools headers. Patch #6 moved perf_event polling code
and ksym lookup code from samples/bpf to
tools/testing/selftests/bpf. Patch #7 added a verifier
test in tools/bpf for new verifier change.
Patches #8 and #9 added tests for raw tracepoint prog
and tracepoint prog respectively.

Changelogs:
  v2 -> v3:
    . use meta to track helper memory size argument
    . implement range checking for ARSH in verifier
    . move perf event polling and ksym related functions
      from samples/bpf to tools/bpf
    . added test to compare build id's between bpf_get_stackid
      and bpf_get_stack
  v1 -> v2:
    . fix compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (9):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose trace functions to selftests
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
  tools/bpf: add a test for bpf_get_stack with tracepoint prog

 include/linux/bpf.h                                |   1 +
 include/linux/filter.h                             |   3 +-
 include/uapi/linux/bpf.h                           |  19 ++-
 kernel/bpf/core.c                                  |   5 +
 kernel/bpf/stackmap.c                              |  80 ++++++++-
 kernel/bpf/syscall.c                               |  10 ++
 kernel/bpf/verifier.c                              |  56 +++++++
 kernel/trace/bpf_trace.c                           |  50 +++++-
 samples/bpf/Makefile                               |  11 +-
 samples/bpf/bpf_load.c                             |  63 -------
 samples/bpf/bpf_load.h                             |   7 -
 samples/bpf/offwaketime_user.c                     |   1 +
 samples/bpf/sampleip_user.c                        |   1 +
 samples/bpf/spintest_user.c                        |   1 +
 samples/bpf/trace_event_user.c                     |   1 +
 samples/bpf/trace_output_user.c                    | 125 ++------------
 tools/include/uapi/linux/bpf.h                     |  19 ++-
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 +++++++++++
 tools/testing/selftests/bpf/test_progs.c           | 178 +++++++++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  20 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  |  20 ++-
 tools/testing/selftests/bpf/test_verifier.c        |  45 +++++
 tools/testing/selftests/bpf/trace_helpers.c        | 186 +++++++++++++++++++++
 tools/testing/selftests/bpf/trace_helpers.h        |  24 +++
 26 files changed, 825 insertions(+), 209 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

-- 
2.9.5

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox