Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net-next v5 09/12] socket: Add SO_TIMESTAMPING_NEW
From: Ran Rozenstein @ 2019-02-10 15:43 UTC (permalink / raw)
  To: Deepa Dinamani, davem@davemloft.net, linux-kernel@vger.kernel.org
  Cc: netdev@vger.kernel.org, arnd@arndb.de, y2038@lists.linaro.org,
	chris@zankel.net, fenghua.yu@intel.com, rth@twiddle.net,
	tglx@linutronix.de, ubraun@linux.ibm.com,
	linux-alpha@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-ia64@vger.kernel.org, linux-mips@linux-mips.org,
	linux-s390@vger.kernel.org, linux-xtensa@linux-xtensa.org,
	sparclinux@vger.kernel.org
In-Reply-To: <20190202153454.7121-10-deepa.kernel@gmail.com>

> Subject: [PATCH net-next v5 09/12] socket: Add SO_TIMESTAMPING_NEW
> 
> Add SO_TIMESTAMPING_NEW variant of socket timestamp options.
> This is the y2038 safe versions of the SO_TIMESTAMPING_OLD for all
> architectures.
> 
> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
> Acked-by: Willem de Bruijn <willemb@google.com>


Hi,

I have app that include: 
	#include <linux/errqueue.h>

It now fail with this error:
	In file included from timestamping.c:6:0:
	/usr/include/linux/errqueue.h:46:27: error: array type has incomplete element type 'struct __kernel_timespec'
	  struct __kernel_timespec ts[3];
                			           ^~
I tried to do the trivial fix, to include time.h:
In include/uapi/linux/errqueue.h
	#include <linux/time.h>
	#include <linux/types.h>

But it just add some other noises:
                In file included from /usr/include/linux/errqueue.h:5:0,
                                 from timestamping.c:6:
                /usr/include/linux/time.h:10:8: error: redefinition of ?struct timespec?
                 struct timespec {
                        ^~~~~~~~
                In file included from /usr/include/sys/select.h:39:0,
                                 from /usr/include/sys/types.h:197,
                                 from /usr/include/stdlib.h:279,
                                 from timestamping.c:2:
                /usr/include/bits/types/struct_timespec.h:8:8: note: originally defined here
                 struct timespec
                        ^~~~~~~~
                In file included from /usr/include/linux/errqueue.h:5:0,
                                 from timestamping.c:6:
                /usr/include/linux/time.h:16:8: error: redefinition of ?struct timeval?
                 struct timeval {
                        ^~~~~~~
                In file included from /usr/include/sys/select.h:37:0,
                                 from /usr/include/sys/types.h:197,
                                 from /usr/include/stdlib.h:279,
                                 from timestamping.c:2:
                /usr/include/bits/types/struct_timeval.h:8:8: note: originally defined here
                 struct timeval
                        ^~~~~~~


Can you please advise how to solve it?

Thanks,
Ran


> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index c0151200f7d1..d955b9e32288 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -41,6 +41,10 @@ struct scm_timestamping {
>  	struct timespec ts[3];
>  };
> 
> +struct scm_timestamping64 {
> +	struct __kernel_timespec ts[3];
> +};
> +
>  /* The type of scm_timestamping, passed in sock_extended_err ee_info.
>   * This defines the type of ts[0]. For SCM_TSTAMP_SND only, if ts[0]
>   * is zero, then this is a hardware timestamp and recorded in ts[2].

^ permalink raw reply

* Re: KASAN: invalid-free in sctp_stream_free
From: Xin Long @ 2019-02-10 14:36 UTC (permalink / raw)
  To: syzbot
  Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
	Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <000000000000f4f3ad0581140b3f@google.com>

On Tue, Feb 5, 2019 at 1:21 AM syzbot
<syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    dc4c89997735 Add linux-next specific files for 20190201
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=17eb3dc4c00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=59aefae07c771af6
> dashboard link: https://syzkaller.appspot.com/bug?extid=58e480e7b28f2d890bfd
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=117990e4c00000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16758df8c00000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KASAN: double-free or invalid-free in sctp_stream_free+0xfa/0x190
> net/sctp/stream.c:187

The issue might be caused by "return -ENOMEM" in sctp_stream_alloc_out().

sctp_stream_init() -> sctp_stream_outq_migrate():
        for (i = outcnt; i < stream->outcnt; i++)
                kfree(SCTP_SO(stream, i)->ext);

It will free the surplus streams' ext, but it doesn't set them to NULL. Then,
sctp_stream_init() -> sctp_stream_alloc_out():
        ret = sctp_stream_alloc_out(stream, outcnt, gfp);
        if (ret)
                goto out;

        stream->outcnt = outcnt;
when it returns err, stream->outcnt will not be set to the new outcnt.

With the old outcnt(a bigger value), it will try to free those surplus
streams' ext again when freeing the asoc.

So we can fix it simply by:

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 3892e76..9e317e5 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -131,8 +131,10 @@ static void sctp_stream_outq_migrate(struct
sctp_stream *stream,
                }
        }

-       for (i = outcnt; i < stream->outcnt; i++)
+       for (i = outcnt; i < stream->outcnt; i++) {
                kfree(SCTP_SO(stream, i)->ext);
+               SCTP_SO(stream, i)->ext = NULL;
+       }
 }

>
> CPU: 0 PID: 7775 Comm: syz-executor682 Not tainted 5.0.0-rc4-next-20190201
> #25
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
>   kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:278
>   __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:439
>   kasan_slab_free+0xe/0x10 mm/kasan/common.c:468
>   __cache_free mm/slab.c:3488 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3807
>   sctp_stream_free+0xfa/0x190 net/sctp/stream.c:187
>   sctp_association_free+0x235/0x79a net/sctp/associola.c:373
>   sctp_cmd_delete_tcb net/sctp/sm_sideeffect.c:939 [inline]
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1353 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x3c4e/0x5380 net/sctp/sm_sideeffect.c:1191
>   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1074
>   sctp_inq_push+0x1ea/0x290 net/sctp/inqueue.c:95
>   sctp_backlog_rcv+0x196/0xbe0 net/sctp/input.c:354
>   sk_backlog_rcv include/net/sock.h:936 [inline]
>   __release_sock+0x12e/0x3a0 net/core/sock.c:2309
>   release_sock+0x59/0x1c0 net/core/sock.c:2825
>   sctp_close+0x4a4/0x860 net/sctp/socket.c:1550
>   inet_release+0x105/0x1f0 net/ipv4/af_inet.c:428
>   __sock_release+0xd3/0x250 net/socket.c:579
>   sock_close+0x1b/0x30 net/socket.c:1139
>   __fput+0x2df/0x8d0 fs/file_table.c:278
>   ____fput+0x16/0x20 fs/file_table.c:309
>   task_work_run+0x14a/0x1c0 kernel/task_work.c:113
>   exit_task_work include/linux/task_work.h:22 [inline]
>   do_exit+0x90a/0x2fa0 kernel/exit.c:876
>   do_group_exit+0x135/0x370 kernel/exit.c:980
>   __do_sys_exit_group kernel/exit.c:991 [inline]
>   __se_sys_exit_group kernel/exit.c:989 [inline]
>   __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x43edd8
> Code: Bad RIP value.
> RSP: 002b:00007fff92021088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043edd8
> RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> RBP: 00000000004be688 R08: 00000000000000e7 R09: ffffffffffffffd0
> R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000001
> R13: 00000000006d0180 R14: 0000000000000000 R15: 0000000000000000
>
> Allocated by task 7775:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_kmalloc mm/kasan/common.c:498 [inline]
>   __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:471
>   kasan_kmalloc+0x9/0x10 mm/kasan/common.c:506
>   kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3610
>   kmalloc include/linux/slab.h:548 [inline]
>   kzalloc include/linux/slab.h:743 [inline]
>   sctp_stream_init_ext+0x51/0x110 net/sctp/stream.c:172
>   sctp_sendmsg_to_asoc+0x1273/0x17b0 net/sctp/socket.c:1896
>   sctp_sendmsg+0x81f/0x17f0 net/sctp/socket.c:2113
>   inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xdd/0x130 net/socket.c:631
>   ___sys_sendmsg+0x806/0x930 net/socket.c:2114
>   __sys_sendmsg+0x105/0x1d0 net/socket.c:2152
>   __do_sys_sendmsg net/socket.c:2161 [inline]
>   __se_sys_sendmsg net/socket.c:2159 [inline]
>   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 7775:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/common.c:460
>   kasan_slab_free+0xe/0x10 mm/kasan/common.c:468
>   __cache_free mm/slab.c:3488 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3807
>   sctp_stream_outq_migrate+0x3e6/0x540 net/sctp/stream.c:88
>   sctp_stream_init+0xbc/0x410 net/sctp/stream.c:139
>   sctp_process_init+0x21c3/0x2b20 net/sctp/sm_make_chunk.c:2466
>   sctp_cmd_process_init net/sctp/sm_sideeffect.c:682 [inline]
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1410 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x3145/0x5380 net/sctp/sm_sideeffect.c:1191
>   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1074
>   sctp_inq_push+0x1ea/0x290 net/sctp/inqueue.c:95
>   sctp_backlog_rcv+0x196/0xbe0 net/sctp/input.c:354
>   sk_backlog_rcv include/net/sock.h:936 [inline]
>   __release_sock+0x12e/0x3a0 net/core/sock.c:2309
>   release_sock+0x59/0x1c0 net/core/sock.c:2825
>   sctp_wait_for_connect+0x316/0x540 net/sctp/socket.c:8998
>   sctp_sendmsg_to_asoc+0x13e3/0x17b0 net/sctp/socket.c:1967
>   sctp_sendmsg+0x81f/0x17f0 net/sctp/socket.c:2113
>   inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xdd/0x130 net/socket.c:631
>   ___sys_sendmsg+0x806/0x930 net/socket.c:2114
>   __sys_sendmsg+0x105/0x1d0 net/socket.c:2152
>   __do_sys_sendmsg net/socket.c:2161 [inline]
>   __se_sys_sendmsg net/socket.c:2159 [inline]
>   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff8880a7e2c480
>   which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 0 bytes inside of
>   96-byte region [ffff8880a7e2c480, ffff8880a7e2c4e0)
> The buggy address belongs to the page:
> page:ffffea00029f8b00 count:1 mapcount:0 mapping:ffff88812c3f04c0
> index:0xffff8880a7e2c300
> flags: 0x1fffc0000000200(slab)
> raw: 01fffc0000000200 ffffea00029f9908 ffff88812c3f1438 ffff88812c3f04c0
> raw: ffff8880a7e2c300 ffff8880a7e2c000 0000000100000012 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8880a7e2c380: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
>   ffff8880a7e2c400: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> > ffff8880a7e2c480: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                     ^
>   ffff8880a7e2c500: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>   ffff8880a7e2c580: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply related

* [PATCH v2 net 2/2] Revert "r8169: make use of xmit_more and __netdev_sent_queue"
From: Heiner Kallweit @ 2019-02-10 14:28 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: Sander Eikelenboom, netdev@vger.kernel.org
In-Reply-To: <63e9e767-f78b-8865-fe25-41d77e4e7a58@gmail.com>

This reverts commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.

Sander reported a regression causing a kernel panic[1],
therefore let's revert this commit.

[1] https://marc.info/?t=154965066400001&r=1&w=2

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index bba806ce57d3..6e36b88ca7c9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6074,7 +6074,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	struct device *d = tp_to_dev(tp);
 	dma_addr_t mapping;
 	u32 opts[2], len;
-	bool stop_queue;
 	int frags;
 
 	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
@@ -6116,6 +6115,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	txd->opts2 = cpu_to_le32(opts[1]);
 
+	netdev_sent_queue(dev, skb->len);
+
 	skb_tx_timestamp(skb);
 
 	/* Force memory writes to complete before releasing descriptor */
@@ -6128,16 +6129,16 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	tp->cur_tx += frags + 1;
 
-	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
-	if (unlikely(stop_queue))
-		netif_stop_queue(dev);
+	RTL_W8(tp, TxPoll, NPQ);
 
-	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
-		RTL_W8(tp, TxPoll, NPQ);
-		mmiowb();
-	}
+	mmiowb();
 
-	if (unlikely(stop_queue)) {
+	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
+		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
+		 * not miss a ring update when it notices a stopped queue.
+		 */
+		smp_wmb();
+		netif_stop_queue(dev);
 		/* Sync with rtl_tx:
 		 * - publish queue status and cur_tx ring index (write barrier)
 		 * - refresh dirty_tx ring index (read barrier).
-- 
2.20.1




^ permalink raw reply related

* [PATCH v2 net 1/2] Revert "r8169: remove unneeded mmiowb barriers"
From: Heiner Kallweit @ 2019-02-10 14:26 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: Sander Eikelenboom, netdev@vger.kernel.org
In-Reply-To: <63e9e767-f78b-8865-fe25-41d77e4e7a58@gmail.com>

This reverts commit bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3.

There doesn't seem to be anything wrong with this patch,
it's just reverted to get a stable baseline again.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index abb94c543aa2..bba806ce57d3 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1286,11 +1286,13 @@ static u16 rtl_get_events(struct rtl8169_private *tp)
 static void rtl_ack_events(struct rtl8169_private *tp, u16 bits)
 {
 	RTL_W16(tp, IntrStatus, bits);
+	mmiowb();
 }
 
 static void rtl_irq_disable(struct rtl8169_private *tp)
 {
 	RTL_W16(tp, IntrMask, 0);
+	mmiowb();
 }
 
 #define RTL_EVENT_NAPI_RX	(RxOK | RxErr)
@@ -6130,8 +6132,10 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(stop_queue))
 		netif_stop_queue(dev);
 
-	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
+	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
 		RTL_W8(tp, TxPoll, NPQ);
+		mmiowb();
+	}
 
 	if (unlikely(stop_queue)) {
 		/* Sync with rtl_tx:
@@ -6483,7 +6487,9 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 
 	if (work_done < budget) {
 		napi_complete_done(napi, work_done);
+
 		rtl_irq_enable(tp);
+		mmiowb();
 	}
 
 	return work_done;
-- 
2.20.1




^ permalink raw reply related

* [PATCH v2 net 0/2] r8169: revert two commits due to a regression
From: Heiner Kallweit @ 2019-02-10 14:24 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: Sander Eikelenboom, netdev@vger.kernel.org

Sander reported a regression (kernel panic, see[1]), therefore let's
revert these commits. Removal of the barriers doesn't seem to
contribute to the issue, the patch just overlaps with the problematic
one and only reverting both patches was tested.

[1] https://marc.info/?t=154965066400001&r=1&w=2

v2:
- improve commit message

Heiner Kallweit (2):
  Revert "r8169: remove unneeded mmiowb barriers"
  Revert "r8169: make use of xmit_more and __netdev_sent_queue"

 drivers/net/ethernet/realtek/r8169.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH net 2/2] Revert "r8169: make use of xmit_more and __netdev_sent_queue"
From: Sander Eikelenboom @ 2019-02-10 14:18 UTC (permalink / raw)
  To: Heiner Kallweit, Realtek linux nic maintainers, David Miller
  Cc: netdev@vger.kernel.org
In-Reply-To: <573a7ab0-ebfc-1dc2-6497-95c7411d6ddb@gmail.com>

On 10/02/2019 14:50, Heiner Kallweit wrote:
> This reverts commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
> 
> Sander reported a regression [1], therefore let's revert these commits.
> Removal of the barriers doesn't seem to contribute to the issue, the
> patch just overlaps with the problematic one and reverting
> "r8169: make use of xmit_more and __netdev_sent_queue" only wasn't
> tested.

This commit message is incorrect, it is only correct for the other commit bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3.

Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 caused the kernel panic on the BUG_ON() in lib/dynamic_queue_limits.c
it could well be the smp_wmb() barrier is what is needed (or the construction from the patch Eric proposed.
The splat from the BUG_ON() is below.

Same goes for the cover letter which make it seem rather benign, while the regression is actually a kernel panic.

--
Sander


[ 6466.554866] kernel BUG at lib/dynamic_queue_limits.c:27!
[ 6466.571425] invalid opcode: 0000 [#1] SMP NOPTI
[ 6466.585890] CPU: 3 PID: 7057 Comm: as Not tainted 5.0.0-rc5-20190208-thp-net-florian-doflr+ #1
[ 6466.598693] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[ 6466.611579] RIP: e030:dql_completed+0x126/0x140
[ 6466.624339] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 48 8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 0b 8b 47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
[ 6466.648130] RSP: e02b:ffff88807d4c3e78 EFLAGS: 00010297
[ 6466.659616] RAX: 0000000000000042 RBX: ffff8880049cf800 RCX: 0000000000000000
[ 6466.672835] RDX: 0000000000000001 RSI: 0000000000000042 RDI: ffff8880049cf8c0
[ 6466.684521] RBP: ffff888077df7260 R08: 0000000000000001 R09: 0000000000000000
[ 6466.696824] R10: 00000000387c2336 R11: 00000000387c2336 R12: 0000000010000000
[ 6466.709953] R13: ffff888077df6898 R14: ffff888077df75c0 R15: 0000000000454677
[ 6466.722165] FS:  00007fd869147200(0000) GS:ffff88807d4c0000(0000) knlGS:0000000000000000
[ 6466.733228] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6466.746581] CR2: 00007fd867dfd000 CR3: 0000000074884000 CR4: 0000000000000660
[ 6466.758366] Call Trace:
[ 6466.768118]  <IRQ>
[ 6466.778214]  rtl8169_poll+0x4f4/0x640
[ 6466.789198]  net_rx_action+0x23d/0x370
[ 6466.798467]  __do_softirq+0xed/0x229
[ 6466.807039]  irq_exit+0xb7/0xc0
[ 6466.815471]  xen_evtchn_do_upcall+0x27/0x40
[ 6466.826647]  xen_do_hypervisor_callback+0x29/0x40
[ 6466.835902]  </IRQ>
[ 6466.845361] RIP: e030:xen_hypercall_mmu_update+0xa/0x20
[ 6466.853390] Code: 51 41 53 b8 00 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 01 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[ 6466.874031] RSP: e02b:ffffc90003c0bdd0 EFLAGS: 00000246
[ 6466.883452] RAX: 0000000000000000 RBX: 000000041f83bfe8 RCX: ffffffff8100102a
[ 6466.891986] RDX: deadbeefdeadf00d RSI: deadbeefdeadf00d RDI: deadbeefdeadf00d
[ 6466.903402] RBP: 0000000000000fe8 R08: 000000000000000b R09: 0000000000000000
[ 6466.911201] R10: deadbeefdeadf00d R11: 0000000000000246 R12: 800000050c346067
[ 6466.918491] R13: ffff8880607c4fe8 R14: ffff888005082800 R15: 0000000000000000
[ 6466.926647]  ? xen_hypercall_mmu_update+0xa/0x20
[ 6466.938195]  ? xen_set_pte_at+0x78/0xe0
[ 6466.947046]  ? __handle_mm_fault+0xc43/0x1060
[ 6466.955772]  ? do_mmap+0x44b/0x5b0
[ 6466.964410]  ? handle_mm_fault+0xf8/0x200
[ 6466.973290]  ? __do_page_fault+0x231/0x4a0
[ 6466.981973]  ? page_fault+0x8/0x30
[ 6466.990904]  ? page_fault+0x1e/0x30
[ 6466.999585] Modules linked in:
[ 6467.007533] ---[ end trace 94bec01608fe4061 ]---
[ 6467.016751] RIP: e030:dql_completed+0x126/0x140
[ 6467.024271] Code: 2b 47 54 ba 00 00 00 00 c7 47 54 ff ff ff ff 0f 48 c2 48 8b 15 7b 39 4a 01 48 89 57 58 e9 48 ff ff ff 44 89 c0 e9 40 ff ff ff <0f> 0b 8b 47 50 29 e8 41 0f 48 c3 eb 9f 90 90 90 90 90 90 90 90 90
[ 6467.039726] RSP: e02b:ffff88807d4c3e78 EFLAGS: 00010297
[ 6467.047243] RAX: 0000000000000042 RBX: ffff8880049cf800 RCX: 0000000000000000
[ 6467.054202] RDX: 0000000000000001 RSI: 0000000000000042 RDI: ffff8880049cf8c0
[ 6467.062000] RBP: ffff888077df7260 R08: 0000000000000001 R09: 0000000000000000
[ 6467.069664] R10: 00000000387c2336 R11: 00000000387c2336 R12: 0000000010000000
[ 6467.077715] R13: ffff888077df6898 R14: ffff888077df75c0 R15: 0000000000454677
[ 6467.084916] FS:  00007fd869147200(0000) GS:ffff88807d4c0000(0000) knlGS:0000000000000000
[ 6467.093352] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6467.101492] CR2: 00007fd867dfd000 CR3: 0000000074884000 CR4: 0000000000000660
[ 6467.110542] Kernel panic - not syncing: Fatal exception in interrupt
[ 6467.118166] Kernel Offset: disabled
(XEN) [2019-02-08 18:04:48.854] Hardware Dom0 crashed: rebooting machine in 5 seconds.

> 
> [1] https://marc.info/?t=154965066400001&r=1&w=2
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/ethernet/realtek/r8169.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index bba806ce57d3..6e36b88ca7c9 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -6074,7 +6074,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  	struct device *d = tp_to_dev(tp);
>  	dma_addr_t mapping;
>  	u32 opts[2], len;
> -	bool stop_queue;
>  	int frags;
>  
>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
> @@ -6116,6 +6115,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	txd->opts2 = cpu_to_le32(opts[1]);
>  
> +	netdev_sent_queue(dev, skb->len);
> +
>  	skb_tx_timestamp(skb);
>  
>  	/* Force memory writes to complete before releasing descriptor */
> @@ -6128,16 +6129,16 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	tp->cur_tx += frags + 1;
>  
> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
> -	if (unlikely(stop_queue))
> -		netif_stop_queue(dev);
> +	RTL_W8(tp, TxPoll, NPQ);
>  
> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
> -		RTL_W8(tp, TxPoll, NPQ);
> -		mmiowb();
> -	}
> +	mmiowb();
>  
> -	if (unlikely(stop_queue)) {
> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
> +		 * not miss a ring update when it notices a stopped queue.
> +		 */
> +		smp_wmb();
> +		netif_stop_queue(dev);
>  		/* Sync with rtl_tx:
>  		 * - publish queue status and cur_tx ring index (write barrier)
>  		 * - refresh dirty_tx ring index (read barrier).
> 


^ permalink raw reply

* Re: [PATCH v1 net-next 0/2] Change tc action identifiers to be more consistent
From: Jamal Hadi Salim @ 2019-02-10 14:13 UTC (permalink / raw)
  To: Eli Cohen, davem, xiyou.wangcong, jiri; +Cc: netdev, simon.horman
In-Reply-To: <20190210122500.13614-1-eli@mellanox.com>

On 2019-02-10 7:24 a.m., Eli Cohen wrote:
> This two patch series modifies TC actions identifiers to be more consistent and
> also puts them in one place so new identifiers numbers can be chosen more
> easily.

You left out my ACK. here it is:

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Heiner Kallweit @ 2019-02-10 13:57 UTC (permalink / raw)
  To: Sander Eikelenboom, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <b9c3afa3-7e02-83dd-94f9-c62677e00399@eikelenboom.it>

On 10.02.2019 14:05, Sander Eikelenboom wrote:
> On 10/02/2019 12:44, Heiner Kallweit wrote:
>> On 10.02.2019 10:16, Sander Eikelenboom wrote:
>>> On 09/02/2019 12:50, Heiner Kallweit wrote:
>>>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>>>
>>>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>>>
>>>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>>>> on the host.
>>>>>>>>>>>
>>>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>>>
>>>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>>>
>>>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>>>> test also with only 
>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>>>
>>>>>>>>>> Sure, thanks.
>>>>>>>>>>
>>>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>>>
>>>>>>>>>> Yes
>>>>>>>>>>
>>>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>>>
>>>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>>>
>>>>>>>>> You could try :
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>>         dma_addr_t mapping;
>>>>>>>>>         u32 opts[2], len;
>>>>>>>>>         bool stop_queue;
>>>>>>>>> +       bool door_bell;
>>>>>>>>>         int frags;
>>>>>>>>>  
>>>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>>>         dma_wmb();
>>>>>>>>>  
>>>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>>>> +
>>>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>>>  
>>>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>>         if (unlikely(stop_queue))
>>>>>>>>>                 netif_stop_queue(dev);
>>>>>>>>>  
>>>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>>>> +       if (door_bell) {
>>>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>>>                 mmiowb();
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>>>
>>>>>>> I have done that already during the night .. the results:
>>>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>>>
>>>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>>>
>>>>>>>   If I can, it is a separate issue.
>>>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>>>
>>>>>>>   The timeout in question:
>>>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>>>         [28336.893358] Modules linked in:
>>>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>>>         [28337.090052] Call Trace:
>>>>>>>         [28337.103615]  <IRQ>
>>>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>>>         [28337.241261]  </IRQ>
>>>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>>>
>>>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>>>> did that occur again?
>>>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>>>> contribute to the issue) and just submit a patch to effectively revert
>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>>>
>>>>> I can't say if that is correct, because i haven't tested that.
>>>>>
>>>>> Another thing I could test is:
>>>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>>>    that would be great.
>>>>>
>>>>
>>>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>>>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>>>> efx_enqueue_skb():
>>>>
>>>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>>>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>>>
>>>> 		/* There could be packets left on the partner queue if those
>>>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>>>> 		 * could be left for a long time and cause a netdev watchdog.
>>>> 		 */
>>>> 		if (txq2->xmit_more_available)
>>>> 			efx_nic_push_buffers(txq2);
>>>>
>>>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>>>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>>>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>>>> you could give it a try.
>>>
>>> Hi Heiner,
>>>
>>> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
>>> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
>>>
>>> So there is still time left and I could do a more proper run with your patch below.
>>> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
>>> just that I can't tell.
>>>
>> I checked further and there's a handful of network drivers using __napi_alloc_skb() with __GFP_NOWARN,
>> maybe to avoid such splats. Did the splat impact functionality? When checking the code in r8169 the
>> affected packet would just be dropped.
> 
> It doesn't permanently or noticeably impact functionality, and indeed seems to drop packets:
> 
> eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 172.16.1.1  netmask 255.255.0.0  broadcast 172.16.255.255
>         ether 40:61:86:f4:67:d8  txqueuelen 1000  (Ethernet)
>         RX packets 11563913  bytes 16724445852 (15.5 GiB)
>         RX errors 0  dropped 6  overruns 0  frame 0
>         TX packets 4301515  bytes 1210966808 (1.1 GiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> Reverting 5317d5c6d47e ("r8169: use napi_consume_skb where possible") doesn't suffice still gives the page allocation failure.
> 
> I think at this point in time we should at least get the
> reverts into 5.0 (probably to late for rc-6 since DaveM's pull request is already in) for:
>     bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>     2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
> 
OK, I just sent the reverts for both patches.

Heiner

> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the one which caused the BUG_ON() to be hit.
> 
> While we could use your patch to revert only 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 and leave
> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 in:
>     - I haven't been able to test it properly.
>     - It's dealing with barriers, which can be tedious and give subtle breakage.
>     - I don't see any compelling argument to keep it in (it's no fix).
>     - It's RC-6 time ...
> 
> So we then we can focus on the page allocation issue and hopefully find some stable
> baseline before 5.0-final is cut. While I appreciate having a forward looking approach,
> I think we are at the point in time, were we should revert when in doubt 
> (and it doesn't clearly fix an other issue).
> 
> After establishing a stable baseline again, we can start incrementally re-applying and test stuff.
> 
> --
> Sander
> 
> 
> 
>>> Perhaps Linus as Oops-decoding-guru has an idea ?
>>>
>>> --
>>> Sander
>>>
>>> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
>>> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
>>> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>> [39041.689018] Call Trace:
>>> [39041.689022]  <IRQ>
>>> [39041.689030]  dump_stack+0x5c/0x7b
>>> [39041.689033]  warn_alloc+0x103/0x190
>>> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
>>> [39041.689039]  ? ip_rcv+0x48/0xc0
>>> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
>>> [39041.689042]  page_frag_alloc+0x117/0x150
>>> [39041.689044]  __napi_alloc_skb+0x83/0xd0
>>> [39041.689048]  rtl8169_poll+0x210/0x640
>>> [39041.689051]  net_rx_action+0x23d/0x370
>>> [39041.689054]  __do_softirq+0xed/0x229
>>> [39041.689058]  irq_exit+0xb7/0xc0
>>> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
>>> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
>>> [39041.689064]  </IRQ>
>>> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
>>> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
>>> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
>>> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
>>> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
>>> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
>>> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
>>> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
>>> [39041.689077]  iput+0x63/0x1a0
>>> [39041.689079]  __dentry_kill+0xc5/0x170
>>> [39041.689080]  shrink_dentry_list+0x93/0x1c0
>>> [39041.689082]  prune_dcache_sb+0x4d/0x70
>>> [39041.689084]  super_cache_scan+0x104/0x190
>>> [39041.689087]  do_shrink_slab+0x12c/0x1e0
>>> [39041.689089]  shrink_slab+0xdf/0x2b0
>>> [39041.689091]  shrink_node+0x158/0x470
>>> [39041.689093]  do_try_to_free_pages+0xd1/0x380
>>> [39041.689095]  try_to_free_pages+0xb2/0xe0
>>> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
>>> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
>>> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
>>> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
>>> [39041.689109]  ? xen_mc_flush+0xc0/0x190
>>> [39041.689110]  handle_mm_fault+0xf8/0x200
>>> [39041.689113]  __do_page_fault+0x231/0x4a0
>>> [39041.689115]  ? page_fault+0x8/0x30
>>> [39041.689116]  page_fault+0x1e/0x30
>>> [39041.689118] RIP: e033:0x7fb9851d012e
>>> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
>>> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
>>> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
>>> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
>>> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
>>> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
>>> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
>>> [39041.689125] Mem-Info:
>>> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>>>                 active_file:106409 inactive_file:107531 isolated_file:0
>>>                 unevictable:552 dirty:175 writeback:0 unstable:0
>>>                 slab_reclaimable:13739 slab_unreclaimable:16454
>>>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>>>                 free:3681 free_pcp:935 free_cma:0
>>> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>>> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
>>> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
>>> [39041.689140] lowmem_reserve[]: 0 0 0 0
>>> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
>>> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
>>> [39041.689153] 214701 total pagecache pages
>>> [39041.689155] 273 pages in swap cache
>>> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
>>> [39041.689156] Free swap  = 3790588kB
>>> [39041.689157] Total swap = 4194300kB
>>> [39041.689157] 524181 pages RAM
>>> [39041.689158] 0 pages HighMem/MovableOnly
>>> [39041.689158] 101471 pages reserved
>>> [39041.689159] 0 pages cma reserved
>>>
>>>
>>>
>>>
>>>
>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>> index e8a112149..3cca2ffb2 100644
>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>  	struct device *d = tp_to_dev(tp);
>>>>  	dma_addr_t mapping;
>>>>  	u32 opts[2], len;
>>>> -	bool stop_queue;
>>>>  	int frags;
>>>>  
>>>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>  
>>>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>>>  
>>>> +	netdev_sent_queue(dev, skb->len);
>>>> +
>>>>  	skb_tx_timestamp(skb);
>>>>  
>>>>  	/* Force memory writes to complete before releasing descriptor */
>>>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>  
>>>>  	tp->cur_tx += frags + 1;
>>>>  
>>>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>>>> -	if (unlikely(stop_queue))
>>>> -		netif_stop_queue(dev);
>>>> -
>>>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>>>> -		RTL_W8(tp, TxPoll, NPQ);
>>>> +	RTL_W8(tp, TxPoll, NPQ);
>>>>  
>>>> -	if (unlikely(stop_queue)) {
>>>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>>>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>>>> +		 * not miss a ring update when it notices a stopped queue.
>>>> +		 */
>>>> +		smp_wmb();
>>>> +		netif_stop_queue(dev);
>>>>  		/* Sync with rtl_tx:
>>>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>>>  		 * - refresh dirty_tx ring index (read barrier).
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply

* [PATCH net 2/2] Revert "r8169: make use of xmit_more and __netdev_sent_queue"
From: Heiner Kallweit @ 2019-02-10 13:50 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: Sander Eikelenboom, netdev@vger.kernel.org
In-Reply-To: <d17d5f20-0f85-f052-66ed-73ea9fe77377@gmail.com>

This reverts commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.

Sander reported a regression [1], therefore let's revert these commits.
Removal of the barriers doesn't seem to contribute to the issue, the
patch just overlaps with the problematic one and reverting
"r8169: make use of xmit_more and __netdev_sent_queue" only wasn't
tested.

[1] https://marc.info/?t=154965066400001&r=1&w=2

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index bba806ce57d3..6e36b88ca7c9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6074,7 +6074,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	struct device *d = tp_to_dev(tp);
 	dma_addr_t mapping;
 	u32 opts[2], len;
-	bool stop_queue;
 	int frags;
 
 	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
@@ -6116,6 +6115,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	txd->opts2 = cpu_to_le32(opts[1]);
 
+	netdev_sent_queue(dev, skb->len);
+
 	skb_tx_timestamp(skb);
 
 	/* Force memory writes to complete before releasing descriptor */
@@ -6128,16 +6129,16 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	tp->cur_tx += frags + 1;
 
-	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
-	if (unlikely(stop_queue))
-		netif_stop_queue(dev);
+	RTL_W8(tp, TxPoll, NPQ);
 
-	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
-		RTL_W8(tp, TxPoll, NPQ);
-		mmiowb();
-	}
+	mmiowb();
 
-	if (unlikely(stop_queue)) {
+	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
+		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
+		 * not miss a ring update when it notices a stopped queue.
+		 */
+		smp_wmb();
+		netif_stop_queue(dev);
 		/* Sync with rtl_tx:
 		 * - publish queue status and cur_tx ring index (write barrier)
 		 * - refresh dirty_tx ring index (read barrier).
-- 
2.20.1



^ permalink raw reply related

* [PATCH net 1/2] Revert "r8169: remove unneeded mmiowb barriers"
From: Heiner Kallweit @ 2019-02-10 13:49 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: Sander Eikelenboom, netdev@vger.kernel.org
In-Reply-To: <d17d5f20-0f85-f052-66ed-73ea9fe77377@gmail.com>

This reverts commit bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3.

There doesn't seem to be anything wrong with this patch,
it's just reverted to get a stable baseline again.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index abb94c543aa2..bba806ce57d3 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1286,11 +1286,13 @@ static u16 rtl_get_events(struct rtl8169_private *tp)
 static void rtl_ack_events(struct rtl8169_private *tp, u16 bits)
 {
 	RTL_W16(tp, IntrStatus, bits);
+	mmiowb();
 }
 
 static void rtl_irq_disable(struct rtl8169_private *tp)
 {
 	RTL_W16(tp, IntrMask, 0);
+	mmiowb();
 }
 
 #define RTL_EVENT_NAPI_RX	(RxOK | RxErr)
@@ -6130,8 +6132,10 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(stop_queue))
 		netif_stop_queue(dev);
 
-	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
+	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
 		RTL_W8(tp, TxPoll, NPQ);
+		mmiowb();
+	}
 
 	if (unlikely(stop_queue)) {
 		/* Sync with rtl_tx:
@@ -6483,7 +6487,9 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 
 	if (work_done < budget) {
 		napi_complete_done(napi, work_done);
+
 		rtl_irq_enable(tp);
+		mmiowb();
 	}
 
 	return work_done;
-- 
2.20.1



^ permalink raw reply related

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Sander Eikelenboom @ 2019-02-10 13:05 UTC (permalink / raw)
  To: Heiner Kallweit, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <6307e338-c2b6-ea87-ce56-2eb9606d3bfa@gmail.com>

On 10/02/2019 12:44, Heiner Kallweit wrote:
> On 10.02.2019 10:16, Sander Eikelenboom wrote:
>> On 09/02/2019 12:50, Heiner Kallweit wrote:
>>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>>
>>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>>
>>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>>> on the host.
>>>>>>>>>>
>>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>>
>>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>>
>>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>>> test also with only 
>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>> removed.
>>>>>>>>>
>>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>>
>>>>>>>>> Sure, thanks.
>>>>>>>>>
>>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>>
>>>>>>>>> Yes
>>>>>>>>>
>>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>>
>>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>>
>>>>>>>> You could try :
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>         dma_addr_t mapping;
>>>>>>>>         u32 opts[2], len;
>>>>>>>>         bool stop_queue;
>>>>>>>> +       bool door_bell;
>>>>>>>>         int frags;
>>>>>>>>  
>>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>>         dma_wmb();
>>>>>>>>  
>>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>>> +
>>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>>  
>>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>>         if (unlikely(stop_queue))
>>>>>>>>                 netif_stop_queue(dev);
>>>>>>>>  
>>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>>> +       if (door_bell) {
>>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>>                 mmiowb();
>>>>>>>>         }
>>>>>>>>
>>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>>
>>>>>> I have done that already during the night .. the results:
>>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>>
>>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>>
>>>>>>   If I can, it is a separate issue.
>>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>>
>>>>>>   The timeout in question:
>>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>>         [28336.893358] Modules linked in:
>>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>>         [28337.090052] Call Trace:
>>>>>>         [28337.103615]  <IRQ>
>>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>>         [28337.241261]  </IRQ>
>>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>>
>>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>>> did that occur again?
>>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>>> contribute to the issue) and just submit a patch to effectively revert
>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>>
>>>> I can't say if that is correct, because i haven't tested that.
>>>>
>>>> Another thing I could test is:
>>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>>    that would be great.
>>>>
>>>
>>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>>> efx_enqueue_skb():
>>>
>>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>>
>>> 		/* There could be packets left on the partner queue if those
>>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>>> 		 * could be left for a long time and cause a netdev watchdog.
>>> 		 */
>>> 		if (txq2->xmit_more_available)
>>> 			efx_nic_push_buffers(txq2);
>>>
>>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>>> you could give it a try.
>>
>> Hi Heiner,
>>
>> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
>> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
>>
>> So there is still time left and I could do a more proper run with your patch below.
>> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
>> just that I can't tell.
>>
> I checked further and there's a handful of network drivers using __napi_alloc_skb() with __GFP_NOWARN,
> maybe to avoid such splats. Did the splat impact functionality? When checking the code in r8169 the
> affected packet would just be dropped.

It doesn't permanently or noticeably impact functionality, and indeed seems to drop packets:

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.1.1  netmask 255.255.0.0  broadcast 172.16.255.255
        ether 40:61:86:f4:67:d8  txqueuelen 1000  (Ethernet)
        RX packets 11563913  bytes 16724445852 (15.5 GiB)
        RX errors 0  dropped 6  overruns 0  frame 0
        TX packets 4301515  bytes 1210966808 (1.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Reverting 5317d5c6d47e ("r8169: use napi_consume_skb where possible") doesn't suffice still gives the page allocation failure.

I think at this point in time we should at least get the
reverts into 5.0 (probably to late for rc-6 since DaveM's pull request is already in) for:
    bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
    2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue

Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the one which caused the BUG_ON() to be hit.

While we could use your patch to revert only 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 and leave
bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 in:
    - I haven't been able to test it properly.
    - It's dealing with barriers, which can be tedious and give subtle breakage.
    - I don't see any compelling argument to keep it in (it's no fix).
    - It's RC-6 time ...

So we then we can focus on the page allocation issue and hopefully find some stable
baseline before 5.0-final is cut. While I appreciate having a forward looking approach,
I think we are at the point in time, were we should revert when in doubt 
(and it doesn't clearly fix an other issue).

After establishing a stable baseline again, we can start incrementally re-applying and test stuff.

--
Sander



>> Perhaps Linus as Oops-decoding-guru has an idea ?
>>
>> --
>> Sander
>>
>> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
>> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
>> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>> [39041.689018] Call Trace:
>> [39041.689022]  <IRQ>
>> [39041.689030]  dump_stack+0x5c/0x7b
>> [39041.689033]  warn_alloc+0x103/0x190
>> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
>> [39041.689039]  ? ip_rcv+0x48/0xc0
>> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
>> [39041.689042]  page_frag_alloc+0x117/0x150
>> [39041.689044]  __napi_alloc_skb+0x83/0xd0
>> [39041.689048]  rtl8169_poll+0x210/0x640
>> [39041.689051]  net_rx_action+0x23d/0x370
>> [39041.689054]  __do_softirq+0xed/0x229
>> [39041.689058]  irq_exit+0xb7/0xc0
>> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
>> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
>> [39041.689064]  </IRQ>
>> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
>> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
>> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
>> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
>> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
>> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
>> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
>> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
>> [39041.689077]  iput+0x63/0x1a0
>> [39041.689079]  __dentry_kill+0xc5/0x170
>> [39041.689080]  shrink_dentry_list+0x93/0x1c0
>> [39041.689082]  prune_dcache_sb+0x4d/0x70
>> [39041.689084]  super_cache_scan+0x104/0x190
>> [39041.689087]  do_shrink_slab+0x12c/0x1e0
>> [39041.689089]  shrink_slab+0xdf/0x2b0
>> [39041.689091]  shrink_node+0x158/0x470
>> [39041.689093]  do_try_to_free_pages+0xd1/0x380
>> [39041.689095]  try_to_free_pages+0xb2/0xe0
>> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
>> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
>> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
>> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
>> [39041.689109]  ? xen_mc_flush+0xc0/0x190
>> [39041.689110]  handle_mm_fault+0xf8/0x200
>> [39041.689113]  __do_page_fault+0x231/0x4a0
>> [39041.689115]  ? page_fault+0x8/0x30
>> [39041.689116]  page_fault+0x1e/0x30
>> [39041.689118] RIP: e033:0x7fb9851d012e
>> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
>> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
>> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
>> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
>> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
>> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
>> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
>> [39041.689125] Mem-Info:
>> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>>                 active_file:106409 inactive_file:107531 isolated_file:0
>>                 unevictable:552 dirty:175 writeback:0 unstable:0
>>                 slab_reclaimable:13739 slab_unreclaimable:16454
>>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>>                 free:3681 free_pcp:935 free_cma:0
>> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
>> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
>> [39041.689140] lowmem_reserve[]: 0 0 0 0
>> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
>> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
>> [39041.689153] 214701 total pagecache pages
>> [39041.689155] 273 pages in swap cache
>> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
>> [39041.689156] Free swap  = 3790588kB
>> [39041.689157] Total swap = 4194300kB
>> [39041.689157] 524181 pages RAM
>> [39041.689158] 0 pages HighMem/MovableOnly
>> [39041.689158] 101471 pages reserved
>> [39041.689159] 0 pages cma reserved
>>
>>
>>
>>
>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>> index e8a112149..3cca2ffb2 100644
>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>  	struct device *d = tp_to_dev(tp);
>>>  	dma_addr_t mapping;
>>>  	u32 opts[2], len;
>>> -	bool stop_queue;
>>>  	int frags;
>>>  
>>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>  
>>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>>  
>>> +	netdev_sent_queue(dev, skb->len);
>>> +
>>>  	skb_tx_timestamp(skb);
>>>  
>>>  	/* Force memory writes to complete before releasing descriptor */
>>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>  
>>>  	tp->cur_tx += frags + 1;
>>>  
>>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>>> -	if (unlikely(stop_queue))
>>> -		netif_stop_queue(dev);
>>> -
>>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>>> -		RTL_W8(tp, TxPoll, NPQ);
>>> +	RTL_W8(tp, TxPoll, NPQ);
>>>  
>>> -	if (unlikely(stop_queue)) {
>>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>>> +		 * not miss a ring update when it notices a stopped queue.
>>> +		 */
>>> +		smp_wmb();
>>> +		netif_stop_queue(dev);
>>>  		/* Sync with rtl_tx:
>>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>>  		 * - refresh dirty_tx ring index (read barrier).
>>>
>>
>>
> 


^ permalink raw reply

* Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length
From: Marcelo Ricardo Leitner @ 2019-02-10 12:46 UTC (permalink / raw)
  To: David Miller
  Cc: julien, netdev, linux-sctp, linux-kernel, nhorman, vyasevich,
	lucien.xin
In-Reply-To: <20190209.151217.175627323493244750.davem@davemloft.net>

On Sat, Feb 09, 2019 at 03:12:17PM -0800, David Miller wrote:
> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Date: Wed, 6 Feb 2019 18:37:54 -0200
> 
> > On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> >> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> >> structures longer than the current definitions.
> >> 
> >> This should prevent unjustified setsockopt() failures due to struct
> >> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> >> binaries that should be compatible, but were built with later kernel
> >> uapi headers.
> > 
> > Not sure if we support backwards compatibility like this?
> 
> What a complete mess we have here.
> 
> Use new socket option numbers next time, do not change the size and/or
> layout of existing socket options.

What about reusing the same socket option, but defining a new struct?
Say, MYSOCKOPT supports struct mysockopt, struct mysockopt2, struct
mysockopt3...

That way we have a clear definition of the user's intent.

> 
> This whole thread, if you read it, is basically "if we compatability
> this way, that breaks, and if we do compatability this other way oh
> shit this other thing doesn't work."
> 
> I think we really need to specifically check for the difference sizes
> that existed one by one, clear out the part not given by the user, and
> backport this as far back as possible in a way that in the older kernels
> we see if the user is actually trying to use the new features and if so
> error out.

I'm afraid clearing out may not be enough, though seems it's the best
we can do so far. If the struct is allocated but not fully initialized
via a memset, but by setting its fields one by one, the remaining new
fields will be left uninitinialized.

> 
> Which, btw, is terrible behavior.  Newly compiled apps should work on
> older kernels if they don't try to use the new features, and if they

One use case here is: a given distro is using kernel X and app Foo is
built against it. Then upgrades to X+1, Foo is patched to fix an issue
and is rebuilt against X+1. The user upgrades Foo package but for
whatever reason, doesn't upgrade kernel or reboot the system. Here,
Foo doesn't work anymore until the new kernel is also running.

> can the ones that want to try to use the new features should be able
> to fall back when that feature isn't available in a non-ambiguous
> and precisely defined way.
> 
> The fact that the use of the new feature is hidden in the new
> structure elements is really rotten.
> 
> This patch, at best, needs some work and definitely a longer and more
> detailed commit message.

^ permalink raw reply

* [PATCH v1 net-next 2/2] net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE
From: Eli Cohen @ 2019-02-10 12:25 UTC (permalink / raw)
  To: davem, jhs, xiyou.wangcong, jiri; +Cc: netdev, simon.horman, Eli Cohen
In-Reply-To: <20190210122500.13614-1-eli@mellanox.com>

Modify the kernel users of the TCA_ACT_* macros to use TCA_ID_*. For
example, use TCA_ID_GACT instead of TCA_ACT_GACT. This will align with
TCA_ID_POLICE and also differentiates these identifier, used in struct
tc_action_ops type field, from other macros starting with TCA_ACT_.

To make things clearer, we name the enum defining the TCA_ID_*
identifiers and also change the "type" field of struct tc_action to
id.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/act_api.h              | 2 +-
 include/net/tc_act/tc_csum.h       | 2 +-
 include/net/tc_act/tc_gact.h       | 2 +-
 include/net/tc_act/tc_mirred.h     | 4 ++--
 include/net/tc_act/tc_pedit.h      | 2 +-
 include/net/tc_act/tc_sample.h     | 2 +-
 include/net/tc_act/tc_skbedit.h    | 2 +-
 include/net/tc_act/tc_tunnel_key.h | 4 ++--
 include/net/tc_act/tc_vlan.h       | 2 +-
 include/uapi/linux/pkt_cls.h       | 2 +-
 net/sched/act_api.c                | 2 +-
 net/sched/act_bpf.c                | 2 +-
 net/sched/act_connmark.c           | 2 +-
 net/sched/act_csum.c               | 2 +-
 net/sched/act_gact.c               | 2 +-
 net/sched/act_ife.c                | 2 +-
 net/sched/act_ipt.c                | 4 ++--
 net/sched/act_mirred.c             | 2 +-
 net/sched/act_nat.c                | 2 +-
 net/sched/act_pedit.c              | 2 +-
 net/sched/act_police.c             | 2 +-
 net/sched/act_sample.c             | 2 +-
 net/sched/act_simple.c             | 2 +-
 net/sched/act_skbedit.c            | 2 +-
 net/sched/act_skbmod.c             | 2 +-
 net/sched/act_tunnel_key.c         | 2 +-
 net/sched/act_vlan.c               | 2 +-
 27 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index dbc795ec659e..c745e9ccfab2 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -80,7 +80,7 @@ static inline void tcf_tm_dump(struct tcf_t *dtm, const struct tcf_t *stm)
 struct tc_action_ops {
 	struct list_head head;
 	char    kind[IFNAMSIZ];
-	__u32   type; /* TBD to match kind */
+	enum tca_id  id; /* identifier should match kind */
 	size_t	size;
 	struct module		*owner;
 	int     (*act)(struct sk_buff *, const struct tc_action *,
diff --git a/include/net/tc_act/tc_csum.h b/include/net/tc_act/tc_csum.h
index 32d2454c0479..68269e4581b7 100644
--- a/include/net/tc_act/tc_csum.h
+++ b/include/net/tc_act/tc_csum.h
@@ -21,7 +21,7 @@ struct tcf_csum {
 static inline bool is_tcf_csum(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	if (a->ops && a->ops->type == TCA_ACT_CSUM)
+	if (a->ops && a->ops->id == TCA_ID_CSUM)
 		return true;
 #endif
 	return false;
diff --git a/include/net/tc_act/tc_gact.h b/include/net/tc_act/tc_gact.h
index ef8dd0db70ce..ee8d005f56fc 100644
--- a/include/net/tc_act/tc_gact.h
+++ b/include/net/tc_act/tc_gact.h
@@ -22,7 +22,7 @@ static inline bool __is_tcf_gact_act(const struct tc_action *a, int act,
 #ifdef CONFIG_NET_CLS_ACT
 	struct tcf_gact *gact;
 
-	if (a->ops && a->ops->type != TCA_ACT_GACT)
+	if (a->ops && a->ops->id != TCA_ID_GACT)
 		return false;
 
 	gact = to_gact(a);
diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index a2e9cbca5c9e..c757585a05b0 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -17,7 +17,7 @@ struct tcf_mirred {
 static inline bool is_tcf_mirred_egress_redirect(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	if (a->ops && a->ops->type == TCA_ACT_MIRRED)
+	if (a->ops && a->ops->id == TCA_ID_MIRRED)
 		return to_mirred(a)->tcfm_eaction == TCA_EGRESS_REDIR;
 #endif
 	return false;
@@ -26,7 +26,7 @@ static inline bool is_tcf_mirred_egress_redirect(const struct tc_action *a)
 static inline bool is_tcf_mirred_egress_mirror(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	if (a->ops && a->ops->type == TCA_ACT_MIRRED)
+	if (a->ops && a->ops->id == TCA_ID_MIRRED)
 		return to_mirred(a)->tcfm_eaction == TCA_EGRESS_MIRROR;
 #endif
 	return false;
diff --git a/include/net/tc_act/tc_pedit.h b/include/net/tc_act/tc_pedit.h
index fac3ad4a86de..748cf87a4d7e 100644
--- a/include/net/tc_act/tc_pedit.h
+++ b/include/net/tc_act/tc_pedit.h
@@ -23,7 +23,7 @@ struct tcf_pedit {
 static inline bool is_tcf_pedit(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	if (a->ops && a->ops->type == TCA_ACT_PEDIT)
+	if (a->ops && a->ops->id == TCA_ID_PEDIT)
 		return true;
 #endif
 	return false;
diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index 01dbfea32672..0a559d4b6f0f 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -20,7 +20,7 @@ struct tcf_sample {
 static inline bool is_tcf_sample(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+	return a->ops && a->ops->id == TCA_ID_SAMPLE;
 #else
 	return false;
 #endif
diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index 911bbac838a2..85c5c4756d92 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -44,7 +44,7 @@ static inline bool is_tcf_skbedit_mark(const struct tc_action *a)
 #ifdef CONFIG_NET_CLS_ACT
 	u32 flags;
 
-	if (a->ops && a->ops->type == TCA_ACT_SKBEDIT) {
+	if (a->ops && a->ops->id == TCA_ID_SKBEDIT) {
 		rcu_read_lock();
 		flags = rcu_dereference(to_skbedit(a)->params)->flags;
 		rcu_read_unlock();
diff --git a/include/net/tc_act/tc_tunnel_key.h b/include/net/tc_act/tc_tunnel_key.h
index 46b8c7f1c8d5..23d5b8b19f3e 100644
--- a/include/net/tc_act/tc_tunnel_key.h
+++ b/include/net/tc_act/tc_tunnel_key.h
@@ -34,7 +34,7 @@ static inline bool is_tcf_tunnel_set(const struct tc_action *a)
 	struct tcf_tunnel_key *t = to_tunnel_key(a);
 	struct tcf_tunnel_key_params *params = rtnl_dereference(t->params);
 
-	if (a->ops && a->ops->type == TCA_ACT_TUNNEL_KEY)
+	if (a->ops && a->ops->id == TCA_ID_TUNNEL_KEY)
 		return params->tcft_action == TCA_TUNNEL_KEY_ACT_SET;
 #endif
 	return false;
@@ -46,7 +46,7 @@ static inline bool is_tcf_tunnel_release(const struct tc_action *a)
 	struct tcf_tunnel_key *t = to_tunnel_key(a);
 	struct tcf_tunnel_key_params *params = rtnl_dereference(t->params);
 
-	if (a->ops && a->ops->type == TCA_ACT_TUNNEL_KEY)
+	if (a->ops && a->ops->id == TCA_ID_TUNNEL_KEY)
 		return params->tcft_action == TCA_TUNNEL_KEY_ACT_RELEASE;
 #endif
 	return false;
diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
index 22ae260d6869..fe39ed502bef 100644
--- a/include/net/tc_act/tc_vlan.h
+++ b/include/net/tc_act/tc_vlan.h
@@ -30,7 +30,7 @@ struct tcf_vlan {
 static inline bool is_tcf_vlan(const struct tc_action *a)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	if (a->ops && a->ops->type == TCA_ACT_VLAN)
+	if (a->ops && a->ops->id == TCA_ID_VLAN)
 		return true;
 #endif
 	return false;
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 7ab55f97e7c4..51a0496f78ea 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -85,7 +85,7 @@ enum {
 #define TCA_ACT_SAMPLE 26
 
 /* Action type identifiers*/
-enum {
+enum tca_id {
 	TCA_ID_UNSPEC = 0,
 	TCA_ID_POLICE = 1,
 	TCA_ID_GACT = TCA_ACT_GACT,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d4b8355737d8..aecf1bf233c8 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -543,7 +543,7 @@ int tcf_register_action(struct tc_action_ops *act,
 
 	write_lock(&act_mod_lock);
 	list_for_each_entry(a, &act_base, head) {
-		if (act->type == a->type || (strcmp(act->kind, a->kind) == 0)) {
+		if (act->id == a->id || (strcmp(act->kind, a->kind) == 0)) {
 			write_unlock(&act_mod_lock);
 			unregister_pernet_subsys(ops);
 			return -EEXIST;
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index c7633843e223..aa5c38d11a30 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -396,7 +396,7 @@ static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_bpf_ops __read_mostly = {
 	.kind		=	"bpf",
-	.type		=	TCA_ACT_BPF,
+	.id		=	TCA_ID_BPF,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_bpf_act,
 	.dump		=	tcf_bpf_dump,
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 8475913f2070..5d24993cccfe 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -204,7 +204,7 @@ static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_connmark_ops = {
 	.kind		=	"connmark",
-	.type		=	TCA_ACT_CONNMARK,
+	.id		=	TCA_ID_CONNMARK,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_connmark_act,
 	.dump		=	tcf_connmark_dump,
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 3dc25b7806d7..945fb34ae721 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -660,7 +660,7 @@ static size_t tcf_csum_get_fill_size(const struct tc_action *act)
 
 static struct tc_action_ops act_csum_ops = {
 	.kind		= "csum",
-	.type		= TCA_ACT_CSUM,
+	.id		= TCA_ID_CSUM,
 	.owner		= THIS_MODULE,
 	.act		= tcf_csum_act,
 	.dump		= tcf_csum_dump,
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index b61c20ebb314..93da0004e9f4 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -253,7 +253,7 @@ static size_t tcf_gact_get_fill_size(const struct tc_action *act)
 
 static struct tc_action_ops act_gact_ops = {
 	.kind		=	"gact",
-	.type		=	TCA_ACT_GACT,
+	.id		=	TCA_ID_GACT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_gact_act,
 	.stats_update	=	tcf_gact_stats_update,
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 30b63fa23ee2..9b1f2b3990ee 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -864,7 +864,7 @@ static int tcf_ife_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_ife_ops = {
 	.kind = "ife",
-	.type = TCA_ACT_IFE,
+	.id = TCA_ID_IFE,
 	.owner = THIS_MODULE,
 	.act = tcf_ife_act,
 	.dump = tcf_ife_dump,
diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 8af6c11d2482..1bad190710ad 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -338,7 +338,7 @@ static int tcf_ipt_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_ipt_ops = {
 	.kind		=	"ipt",
-	.type		=	TCA_ACT_IPT,
+	.id		=	TCA_ID_IPT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_ipt_act,
 	.dump		=	tcf_ipt_dump,
@@ -387,7 +387,7 @@ static int tcf_xt_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_xt_ops = {
 	.kind		=	"xt",
-	.type		=	TCA_ACT_XT,
+	.id		=	TCA_ID_XT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_ipt_act,
 	.dump		=	tcf_ipt_dump,
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index c8cf4d10c435..6692fd054617 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -400,7 +400,7 @@ static void tcf_mirred_put_dev(struct net_device *dev)
 
 static struct tc_action_ops act_mirred_ops = {
 	.kind		=	"mirred",
-	.type		=	TCA_ACT_MIRRED,
+	.id		=	TCA_ID_MIRRED,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_mirred_act,
 	.stats_update	=	tcf_stats_update,
diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index c5c1e23add77..543eab9193f1 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -304,7 +304,7 @@ static int tcf_nat_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_nat_ops = {
 	.kind		=	"nat",
-	.type		=	TCA_ACT_NAT,
+	.id		=	TCA_ID_NAT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_nat_act,
 	.dump		=	tcf_nat_dump,
diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 3663d3b615a4..a80373878df7 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -470,7 +470,7 @@ static int tcf_pedit_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_pedit_ops = {
 	.kind		=	"pedit",
-	.type		=	TCA_ACT_PEDIT,
+	.id		=	TCA_ID_PEDIT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_pedit_act,
 	.dump		=	tcf_pedit_dump,
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index ec8ec55e0fe8..8271a6263824 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -366,7 +366,7 @@ MODULE_LICENSE("GPL");
 
 static struct tc_action_ops act_police_ops = {
 	.kind		=	"police",
-	.type		=	TCA_ID_POLICE,
+	.id		=	TCA_ID_POLICE,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_police_act,
 	.dump		=	tcf_police_dump,
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 1a0c682fd734..203e399e5c85 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -233,7 +233,7 @@ static int tcf_sample_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_sample_ops = {
 	.kind	  = "sample",
-	.type	  = TCA_ACT_SAMPLE,
+	.id	  = TCA_ID_SAMPLE,
 	.owner	  = THIS_MODULE,
 	.act	  = tcf_sample_act,
 	.dump	  = tcf_sample_dump,
diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index b2b16d440154..d54cb608dbaf 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -195,7 +195,7 @@ static int tcf_simp_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_simp_ops = {
 	.kind		=	"simple",
-	.type		=	TCA_ACT_SIMP,
+	.id		=	TCA_ID_SIMP,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_simp_act,
 	.dump		=	tcf_simp_dump,
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 64dba3708fce..39f8a67ea940 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -305,7 +305,7 @@ static int tcf_skbedit_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_skbedit_ops = {
 	.kind		=	"skbedit",
-	.type		=	TCA_ACT_SKBEDIT,
+	.id		=	TCA_ID_SKBEDIT,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_skbedit_act,
 	.dump		=	tcf_skbedit_dump,
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index 59710a183bd3..7bac1d78e7a3 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -260,7 +260,7 @@ static int tcf_skbmod_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_skbmod_ops = {
 	.kind		=	"skbmod",
-	.type		=	TCA_ACT_SKBMOD,
+	.id		=	TCA_ACT_SKBMOD,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_skbmod_act,
 	.dump		=	tcf_skbmod_dump,
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 8b43fe0130f7..9104b8e36482 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -563,7 +563,7 @@ static int tunnel_key_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_tunnel_key_ops = {
 	.kind		=	"tunnel_key",
-	.type		=	TCA_ACT_TUNNEL_KEY,
+	.id		=	TCA_ID_TUNNEL_KEY,
 	.owner		=	THIS_MODULE,
 	.act		=	tunnel_key_act,
 	.dump		=	tunnel_key_dump,
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 93fdaf707313..ac0061599225 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -297,7 +297,7 @@ static int tcf_vlan_search(struct net *net, struct tc_action **a, u32 index)
 
 static struct tc_action_ops act_vlan_ops = {
 	.kind		=	"vlan",
-	.type		=	TCA_ACT_VLAN,
+	.id		=	TCA_ID_VLAN,
 	.owner		=	THIS_MODULE,
 	.act		=	tcf_vlan_act,
 	.dump		=	tcf_vlan_dump,
-- 
2.9.5


^ permalink raw reply related

* [PATCH v1 net-next 1/2] net: Move all TC actions identifiers to one place
From: Eli Cohen @ 2019-02-10 12:24 UTC (permalink / raw)
  To: davem, jhs, xiyou.wangcong, jiri; +Cc: netdev, simon.horman, Eli Cohen
In-Reply-To: <20190210122500.13614-1-eli@mellanox.com>

Move all the TC identifiers to one place, to the same enum that defines
the identifier of police action. This makes it easier choose numbers for
new actions since they are now defined in one place. We preserve the
original values for binary compatibility. New IDs should be added inside
the enum.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---

Changelog:
v0 -> v1:
   Remove definition of TCA_ACT_SIMP from net/sched/act_simple.c to avoid two
   identical definitions.

 include/uapi/linux/pkt_cls.h              | 43 ++++++++++++++++++++++++++++---
 include/uapi/linux/tc_act/tc_bpf.h        |  2 --
 include/uapi/linux/tc_act/tc_connmark.h   |  2 --
 include/uapi/linux/tc_act/tc_csum.h       |  2 --
 include/uapi/linux/tc_act/tc_gact.h       |  1 -
 include/uapi/linux/tc_act/tc_ife.h        |  1 -
 include/uapi/linux/tc_act/tc_ipt.h        |  3 ---
 include/uapi/linux/tc_act/tc_mirred.h     |  1 -
 include/uapi/linux/tc_act/tc_nat.h        |  2 --
 include/uapi/linux/tc_act/tc_pedit.h      |  2 --
 include/uapi/linux/tc_act/tc_sample.h     |  2 --
 include/uapi/linux/tc_act/tc_skbedit.h    |  2 --
 include/uapi/linux/tc_act/tc_skbmod.h     |  2 --
 include/uapi/linux/tc_act/tc_tunnel_key.h |  2 --
 include/uapi/linux/tc_act/tc_vlan.h       |  2 --
 net/sched/act_simple.c                    |  2 --
 tools/include/uapi/linux/tc_act/tc_bpf.h  |  2 --
 17 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 02ac251be8c4..7ab55f97e7c4 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -63,12 +63,49 @@ enum {
 #define TC_ACT_GOTO_CHAIN __TC_ACT_EXT(2)
 #define TC_ACT_EXT_OPCODE_MAX	TC_ACT_GOTO_CHAIN
 
+/* These macros are put here for binary compatibility with userspace apps that
+ * make use of them. For kernel code and new userspace apps, use the TCA_ID_*
+ * versions.
+ */
+#define TCA_ACT_GACT 5
+#define TCA_ACT_IPT 6
+#define TCA_ACT_PEDIT 7
+#define TCA_ACT_MIRRED 8
+#define TCA_ACT_NAT 9
+#define TCA_ACT_XT 10
+#define TCA_ACT_SKBEDIT 11
+#define TCA_ACT_VLAN 12
+#define TCA_ACT_BPF 13
+#define TCA_ACT_CONNMARK 14
+#define TCA_ACT_SKBMOD 15
+#define TCA_ACT_CSUM 16
+#define TCA_ACT_TUNNEL_KEY 17
+#define TCA_ACT_SIMP 22
+#define TCA_ACT_IFE 25
+#define TCA_ACT_SAMPLE 26
+
 /* Action type identifiers*/
 enum {
-	TCA_ID_UNSPEC=0,
-	TCA_ID_POLICE=1,
+	TCA_ID_UNSPEC = 0,
+	TCA_ID_POLICE = 1,
+	TCA_ID_GACT = TCA_ACT_GACT,
+	TCA_ID_IPT = TCA_ACT_IPT,
+	TCA_ID_PEDIT = TCA_ACT_PEDIT,
+	TCA_ID_MIRRED = TCA_ACT_MIRRED,
+	TCA_ID_NAT = TCA_ACT_NAT,
+	TCA_ID_XT = TCA_ACT_XT,
+	TCA_ID_SKBEDIT = TCA_ACT_SKBEDIT,
+	TCA_ID_VLAN = TCA_ACT_VLAN,
+	TCA_ID_BPF = TCA_ACT_BPF,
+	TCA_ID_CONNMARK = TCA_ACT_CONNMARK,
+	TCA_ID_SKBMOD = TCA_ACT_SKBMOD,
+	TCA_ID_CSUM = TCA_ACT_CSUM,
+	TCA_ID_TUNNEL_KEY = TCA_ACT_TUNNEL_KEY,
+	TCA_ID_SIMP = TCA_ACT_SIMP,
+	TCA_ID_IFE = TCA_ACT_IFE,
+	TCA_ID_SAMPLE = TCA_ACT_SAMPLE,
 	/* other actions go here */
-	__TCA_ID_MAX=255
+	__TCA_ID_MAX = 255
 };
 
 #define TCA_ID_MAX __TCA_ID_MAX
diff --git a/include/uapi/linux/tc_act/tc_bpf.h b/include/uapi/linux/tc_act/tc_bpf.h
index 6e89a5df49a4..653c4f94f76e 100644
--- a/include/uapi/linux/tc_act/tc_bpf.h
+++ b/include/uapi/linux/tc_act/tc_bpf.h
@@ -13,8 +13,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_BPF 13
-
 struct tc_act_bpf {
 	tc_gen;
 };
diff --git a/include/uapi/linux/tc_act/tc_connmark.h b/include/uapi/linux/tc_act/tc_connmark.h
index 80caa47b1933..9f8f6f709feb 100644
--- a/include/uapi/linux/tc_act/tc_connmark.h
+++ b/include/uapi/linux/tc_act/tc_connmark.h
@@ -5,8 +5,6 @@
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_CONNMARK 14
-
 struct tc_connmark {
 	tc_gen;
 	__u16 zone;
diff --git a/include/uapi/linux/tc_act/tc_csum.h b/include/uapi/linux/tc_act/tc_csum.h
index 0ecf4d29e2f3..94b2044929de 100644
--- a/include/uapi/linux/tc_act/tc_csum.h
+++ b/include/uapi/linux/tc_act/tc_csum.h
@@ -5,8 +5,6 @@
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_CSUM 16
-
 enum {
 	TCA_CSUM_UNSPEC,
 	TCA_CSUM_PARMS,
diff --git a/include/uapi/linux/tc_act/tc_gact.h b/include/uapi/linux/tc_act/tc_gact.h
index 94273c3b81b0..37e5392e02c7 100644
--- a/include/uapi/linux/tc_act/tc_gact.h
+++ b/include/uapi/linux/tc_act/tc_gact.h
@@ -5,7 +5,6 @@
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_GACT 5
 struct tc_gact {
 	tc_gen;
 
diff --git a/include/uapi/linux/tc_act/tc_ife.h b/include/uapi/linux/tc_act/tc_ife.h
index 2f48490ef386..8c401f185675 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include <linux/pkt_cls.h>
 #include <linux/ife.h>
 
-#define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
 #define IFE_ENCODE 1
 #define IFE_DECODE 0
diff --git a/include/uapi/linux/tc_act/tc_ipt.h b/include/uapi/linux/tc_act/tc_ipt.h
index b743c8bddd13..c48d7da6750d 100644
--- a/include/uapi/linux/tc_act/tc_ipt.h
+++ b/include/uapi/linux/tc_act/tc_ipt.h
@@ -4,9 +4,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_IPT 6
-#define TCA_ACT_XT 10
-
 enum {
 	TCA_IPT_UNSPEC,
 	TCA_IPT_TABLE,
diff --git a/include/uapi/linux/tc_act/tc_mirred.h b/include/uapi/linux/tc_act/tc_mirred.h
index 5dd671cf5776..2500a0005d05 100644
--- a/include/uapi/linux/tc_act/tc_mirred.h
+++ b/include/uapi/linux/tc_act/tc_mirred.h
@@ -5,7 +5,6 @@
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_MIRRED 8
 #define TCA_EGRESS_REDIR 1  /* packet redirect to EGRESS*/
 #define TCA_EGRESS_MIRROR 2 /* mirror packet to EGRESS */
 #define TCA_INGRESS_REDIR 3  /* packet redirect to INGRESS*/
diff --git a/include/uapi/linux/tc_act/tc_nat.h b/include/uapi/linux/tc_act/tc_nat.h
index 086be842587b..21399c2c6130 100644
--- a/include/uapi/linux/tc_act/tc_nat.h
+++ b/include/uapi/linux/tc_act/tc_nat.h
@@ -5,8 +5,6 @@
 #include <linux/pkt_cls.h>
 #include <linux/types.h>
 
-#define TCA_ACT_NAT 9
-
 enum {
 	TCA_NAT_UNSPEC,
 	TCA_NAT_PARMS,
diff --git a/include/uapi/linux/tc_act/tc_pedit.h b/include/uapi/linux/tc_act/tc_pedit.h
index 24ec792dacc1..f3e61b04fa01 100644
--- a/include/uapi/linux/tc_act/tc_pedit.h
+++ b/include/uapi/linux/tc_act/tc_pedit.h
@@ -5,8 +5,6 @@
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_PEDIT 7
-
 enum {
 	TCA_PEDIT_UNSPEC,
 	TCA_PEDIT_TM,
diff --git a/include/uapi/linux/tc_act/tc_sample.h b/include/uapi/linux/tc_act/tc_sample.h
index bd7e9f03abd2..fee1bcc20793 100644
--- a/include/uapi/linux/tc_act/tc_sample.h
+++ b/include/uapi/linux/tc_act/tc_sample.h
@@ -6,8 +6,6 @@
 #include <linux/pkt_cls.h>
 #include <linux/if_ether.h>
 
-#define TCA_ACT_SAMPLE 26
-
 struct tc_sample {
 	tc_gen;
 };
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h
index 6de6071ebed6..800e93377218 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -23,8 +23,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_SKBEDIT 11
-
 #define SKBEDIT_F_PRIORITY		0x1
 #define SKBEDIT_F_QUEUE_MAPPING		0x2
 #define SKBEDIT_F_MARK			0x4
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
index 38c072f66f2f..c525b3503797 100644
--- a/include/uapi/linux/tc_act/tc_skbmod.h
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -13,8 +13,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_SKBMOD 15
-
 #define SKBMOD_F_DMAC	0x1
 #define SKBMOD_F_SMAC	0x2
 #define SKBMOD_F_ETYPE	0x4
diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index be384d63e1b5..41c8b462c177 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -14,8 +14,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_TUNNEL_KEY 17
-
 #define TCA_TUNNEL_KEY_ACT_SET	    1
 #define TCA_TUNNEL_KEY_ACT_RELEASE  2
 
diff --git a/include/uapi/linux/tc_act/tc_vlan.h b/include/uapi/linux/tc_act/tc_vlan.h
index 0d7b5fd6605b..168995b54a70 100644
--- a/include/uapi/linux/tc_act/tc_vlan.h
+++ b/include/uapi/linux/tc_act/tc_vlan.h
@@ -13,8 +13,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_VLAN 12
-
 #define TCA_VLAN_ACT_POP	1
 #define TCA_VLAN_ACT_PUSH	2
 #define TCA_VLAN_ACT_MODIFY	3
diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 902957beceb3..b2b16d440154 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -19,8 +19,6 @@
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 
-#define TCA_ACT_SIMP 22
-
 #include <linux/tc_act/tc_defact.h>
 #include <net/tc_act/tc_defact.h>
 
diff --git a/tools/include/uapi/linux/tc_act/tc_bpf.h b/tools/include/uapi/linux/tc_act/tc_bpf.h
index 6e89a5df49a4..653c4f94f76e 100644
--- a/tools/include/uapi/linux/tc_act/tc_bpf.h
+++ b/tools/include/uapi/linux/tc_act/tc_bpf.h
@@ -13,8 +13,6 @@
 
 #include <linux/pkt_cls.h>
 
-#define TCA_ACT_BPF 13
-
 struct tc_act_bpf {
 	tc_gen;
 };
-- 
2.9.5


^ permalink raw reply related

* [PATCH v1 net-next 0/2] Change tc action identifiers to be more consistent
From: Eli Cohen @ 2019-02-10 12:24 UTC (permalink / raw)
  To: davem, jhs, xiyou.wangcong, jiri; +Cc: netdev, simon.horman, Eli Cohen

This two patch series modifies TC actions identifiers to be more consistent and
also puts them in one place so new identifiers numbers can be chosen more
easily.

Eli Cohen (2):
  net: Move all TC actions identifiers to one place
  net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE

Changelog:
v0 -> v1:
   Remove definition of TCA_ACT_SIMP from net/sched/act_simple.c to avoid two
   identical definitions.


 include/net/act_api.h                     |  2 +-
 include/net/tc_act/tc_csum.h              |  2 +-
 include/net/tc_act/tc_gact.h              |  2 +-
 include/net/tc_act/tc_mirred.h            |  4 +--
 include/net/tc_act/tc_pedit.h             |  2 +-
 include/net/tc_act/tc_sample.h            |  2 +-
 include/net/tc_act/tc_skbedit.h           |  2 +-
 include/net/tc_act/tc_tunnel_key.h        |  4 +--
 include/net/tc_act/tc_vlan.h              |  2 +-
 include/uapi/linux/pkt_cls.h              | 45 ++++++++++++++++++++++++++++---
 include/uapi/linux/tc_act/tc_bpf.h        |  2 --
 include/uapi/linux/tc_act/tc_connmark.h   |  2 --
 include/uapi/linux/tc_act/tc_csum.h       |  2 --
 include/uapi/linux/tc_act/tc_gact.h       |  1 -
 include/uapi/linux/tc_act/tc_ife.h        |  1 -
 include/uapi/linux/tc_act/tc_ipt.h        |  3 ---
 include/uapi/linux/tc_act/tc_mirred.h     |  1 -
 include/uapi/linux/tc_act/tc_nat.h        |  2 --
 include/uapi/linux/tc_act/tc_pedit.h      |  2 --
 include/uapi/linux/tc_act/tc_sample.h     |  2 --
 include/uapi/linux/tc_act/tc_skbedit.h    |  2 --
 include/uapi/linux/tc_act/tc_skbmod.h     |  2 --
 include/uapi/linux/tc_act/tc_tunnel_key.h |  2 --
 include/uapi/linux/tc_act/tc_vlan.h       |  2 --
 net/sched/act_api.c                       |  2 +-
 net/sched/act_bpf.c                       |  2 +-
 net/sched/act_connmark.c                  |  2 +-
 net/sched/act_csum.c                      |  2 +-
 net/sched/act_gact.c                      |  2 +-
 net/sched/act_ife.c                       |  2 +-
 net/sched/act_ipt.c                       |  4 +--
 net/sched/act_mirred.c                    |  2 +-
 net/sched/act_nat.c                       |  2 +-
 net/sched/act_pedit.c                     |  2 +-
 net/sched/act_police.c                    |  2 +-
 net/sched/act_sample.c                    |  2 +-
 net/sched/act_simple.c                    |  4 +--
 net/sched/act_skbedit.c                   |  2 +-
 net/sched/act_skbmod.c                    |  2 +-
 net/sched/act_tunnel_key.c                |  2 +-
 net/sched/act_vlan.c                      |  2 +-
 tools/include/uapi/linux/tc_act/tc_bpf.h  |  2 --
 42 files changed, 70 insertions(+), 63 deletions(-)

-- 
2.9.5


^ permalink raw reply

* Re: Possible bug into DSA2 code.
From: Rodolfo Giometti @ 2019-02-10 11:45 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, Vivien Didelot, David S. Miller, netdev
In-Reply-To: <20190209193409.GI30856@lunn.ch>

On 09/02/2019 20:34, Andrew Lunn wrote:
>> So we I see two possible solutions:
>>
>> 1) having both ds->slave_mii_bus and ds->ops->phy_read already defined is an
>> error, then it must be signaled to the calling code, or
> 
> I don't think we can do that. mv88e6xxx optionally instantiates the
> MDIO busses, depending on what is in device tree. If there is no mdio
> property, we need the DSA core to create an MDIO bus.

OK, but using the following check to know if DSA did such allocation is not 
correct because DSA drivers can allocate it by their own:

static void dsa_switch_teardown(struct dsa_switch *ds)
{
         if (ds->slave_mii_bus && ds->ops->phy_read)
                 mdiobus_unregister(ds->slave_mii_bus);

Maybe can we add a flag to register ds->slave_mii_bus allocation by DSA?

> Looking at the driver, ds->slave_mii_bus is assigned in
> mv88e6xxx_setup().
> 
> We have talked about adding a teardown() to the ops structure. This
> seems like another argument we should do it. The mv88e6xxx_teardown()
> can set ds->slave_mii_bus back to NULL, undoing what it did in the
> setup code.

This seems reasonable to me, but in this case you have to call teardown() 
operation before calling mdiobus_unregister() into dsa_switch_teardown() or we 
still have the problem...

Ciao,

Rodolfo

-- 
GNU/Linux Solutions                  e-mail: giometti@enneenne.com
Linux Device Driver                          giometti@linux.it
Embedded Systems                     phone:  +39 349 2432127
UNIX programming                     skype:  rodolfo.giometti

^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Heiner Kallweit @ 2019-02-10 11:44 UTC (permalink / raw)
  To: Sander Eikelenboom, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <302dd169-d981-9d8d-99a6-9d1462e913f9@eikelenboom.it>

On 10.02.2019 10:16, Sander Eikelenboom wrote:
> On 09/02/2019 12:50, Heiner Kallweit wrote:
>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>
>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>
>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>
>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>
>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>> on the host.
>>>>>>>>>
>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>
>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>
>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>> test also with only 
>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>> removed.
>>>>>>>>
>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>
>>>>>>>> Sure, thanks.
>>>>>>>>
>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>
>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>
>>>>>>>
>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>
>>>>>>> You could try :
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         dma_addr_t mapping;
>>>>>>>         u32 opts[2], len;
>>>>>>>         bool stop_queue;
>>>>>>> +       bool door_bell;
>>>>>>>         int frags;
>>>>>>>  
>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>         dma_wmb();
>>>>>>>  
>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>> +
>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>  
>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         if (unlikely(stop_queue))
>>>>>>>                 netif_stop_queue(dev);
>>>>>>>  
>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>> +       if (door_bell) {
>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>                 mmiowb();
>>>>>>>         }
>>>>>>>
>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>
>>>>> I have done that already during the night .. the results:
>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>
>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>
>>>>>   If I can, it is a separate issue.
>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>
>>>>>   The timeout in question:
>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>         [28336.893358] Modules linked in:
>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>         [28337.090052] Call Trace:
>>>>>         [28337.103615]  <IRQ>
>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>         [28337.241261]  </IRQ>
>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>
>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>> did that occur again?
>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>> contribute to the issue) and just submit a patch to effectively revert
>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>
>>> I can't say if that is correct, because i haven't tested that.
>>>
>>> Another thing I could test is:
>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>    that would be great.
>>>
>>
>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>> efx_enqueue_skb():
>>
>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>
>> 		/* There could be packets left on the partner queue if those
>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>> 		 * could be left for a long time and cause a netdev watchdog.
>> 		 */
>> 		if (txq2->xmit_more_available)
>> 			efx_nic_push_buffers(txq2);
>>
>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>> you could give it a try.
> 
> Hi Heiner,
> 
> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
> 
> So there is still time left and I could do a more proper run with your patch below.
> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
> just that I can't tell.
> 
I checked further and there's a handful of network drivers using __napi_alloc_skb() with __GFP_NOWARN,
maybe to avoid such splats. Did the splat impact functionality? When checking the code in r8169 the
affected packet would just be dropped.

> Perhaps Linus as Oops-decoding-guru has an idea ?
> 
> --
> Sander
> 
> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
> [39041.689018] Call Trace:
> [39041.689022]  <IRQ>
> [39041.689030]  dump_stack+0x5c/0x7b
> [39041.689033]  warn_alloc+0x103/0x190
> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
> [39041.689039]  ? ip_rcv+0x48/0xc0
> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
> [39041.689042]  page_frag_alloc+0x117/0x150
> [39041.689044]  __napi_alloc_skb+0x83/0xd0
> [39041.689048]  rtl8169_poll+0x210/0x640
> [39041.689051]  net_rx_action+0x23d/0x370
> [39041.689054]  __do_softirq+0xed/0x229
> [39041.689058]  irq_exit+0xb7/0xc0
> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
> [39041.689064]  </IRQ>
> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
> [39041.689077]  iput+0x63/0x1a0
> [39041.689079]  __dentry_kill+0xc5/0x170
> [39041.689080]  shrink_dentry_list+0x93/0x1c0
> [39041.689082]  prune_dcache_sb+0x4d/0x70
> [39041.689084]  super_cache_scan+0x104/0x190
> [39041.689087]  do_shrink_slab+0x12c/0x1e0
> [39041.689089]  shrink_slab+0xdf/0x2b0
> [39041.689091]  shrink_node+0x158/0x470
> [39041.689093]  do_try_to_free_pages+0xd1/0x380
> [39041.689095]  try_to_free_pages+0xb2/0xe0
> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
> [39041.689109]  ? xen_mc_flush+0xc0/0x190
> [39041.689110]  handle_mm_fault+0xf8/0x200
> [39041.689113]  __do_page_fault+0x231/0x4a0
> [39041.689115]  ? page_fault+0x8/0x30
> [39041.689116]  page_fault+0x1e/0x30
> [39041.689118] RIP: e033:0x7fb9851d012e
> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
> [39041.689125] Mem-Info:
> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>                 active_file:106409 inactive_file:107531 isolated_file:0
>                 unevictable:552 dirty:175 writeback:0 unstable:0
>                 slab_reclaimable:13739 slab_unreclaimable:16454
>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>                 free:3681 free_pcp:935 free_cma:0
> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
> [39041.689140] lowmem_reserve[]: 0 0 0 0
> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
> [39041.689153] 214701 total pagecache pages
> [39041.689155] 273 pages in swap cache
> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
> [39041.689156] Free swap  = 3790588kB
> [39041.689157] Total swap = 4194300kB
> [39041.689157] 524181 pages RAM
> [39041.689158] 0 pages HighMem/MovableOnly
> [39041.689158] 101471 pages reserved
> [39041.689159] 0 pages cma reserved
> 
> 
> 
> 
> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index e8a112149..3cca2ffb2 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  	struct device *d = tp_to_dev(tp);
>>  	dma_addr_t mapping;
>>  	u32 opts[2], len;
>> -	bool stop_queue;
>>  	int frags;
>>  
>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>  
>> +	netdev_sent_queue(dev, skb->len);
>> +
>>  	skb_tx_timestamp(skb);
>>  
>>  	/* Force memory writes to complete before releasing descriptor */
>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	tp->cur_tx += frags + 1;
>>  
>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>> -	if (unlikely(stop_queue))
>> -		netif_stop_queue(dev);
>> -
>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>> -		RTL_W8(tp, TxPoll, NPQ);
>> +	RTL_W8(tp, TxPoll, NPQ);
>>  
>> -	if (unlikely(stop_queue)) {
>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>> +		 * not miss a ring update when it notices a stopped queue.
>> +		 */
>> +		smp_wmb();
>> +		netif_stop_queue(dev);
>>  		/* Sync with rtl_tx:
>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>  		 * - refresh dirty_tx ring index (read barrier).
>>
> 
> 


^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Heiner Kallweit @ 2019-02-10  9:32 UTC (permalink / raw)
  To: Sander Eikelenboom, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <302dd169-d981-9d8d-99a6-9d1462e913f9@eikelenboom.it>

On 10.02.2019 10:16, Sander Eikelenboom wrote:
> On 09/02/2019 12:50, Heiner Kallweit wrote:
>> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>>
>>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>>
>>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>>
>>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>>
>>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>>> on the host.
>>>>>>>>>
>>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>>> as author of the underlying changes.
>>>>>>>>>
>>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>>
>>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>>> test also with only 
>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>> removed.
>>>>>>>>
>>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>>
>>>>>>>> Sure, thanks.
>>>>>>>>
>>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>>
>>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>>
>>>>>>>
>>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>>
>>>>>>> You could try :
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         dma_addr_t mapping;
>>>>>>>         u32 opts[2], len;
>>>>>>>         bool stop_queue;
>>>>>>> +       bool door_bell;
>>>>>>>         int frags;
>>>>>>>  
>>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>>         dma_wmb();
>>>>>>>  
>>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>>> +
>>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>>  
>>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>>         if (unlikely(stop_queue))
>>>>>>>                 netif_stop_queue(dev);
>>>>>>>  
>>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>>> +       if (door_bell) {
>>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>>                 mmiowb();
>>>>>>>         }
>>>>>>>
>>>>>> Thanks a lot for checking and for the proposed fix.
>>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>>
>>>>> I have done that already during the night .. the results:
>>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>>
>>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>>
>>>>>   If I can, it is a separate issue.
>>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>>
>>>>>   The timeout in question:
>>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>>         [28336.893358] Modules linked in:
>>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>>         [28337.090052] Call Trace:
>>>>>         [28337.103615]  <IRQ>
>>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>>         [28337.241261]  </IRQ>
>>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>>
>>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>>> did that occur again?
>>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>>> contribute to the issue) and just submit a patch to effectively revert
>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>>
>>> I can't say if that is correct, because i haven't tested that.
>>>
>>> Another thing I could test is:
>>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>>    that would be great.
>>>
>>
>> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
>> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
>> efx_enqueue_skb():
>>
>> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>>
>> 		/* There could be packets left on the partner queue if those
>> 		 * SKBs had skb->xmit_more set. If we do not push those they
>> 		 * could be left for a long time and cause a netdev watchdog.
>> 		 */
>> 		if (txq2->xmit_more_available)
>> 			efx_nic_push_buffers(txq2);
>>
>> But I'm not sure whether the situation in r8169 is comparable. The following patch
>> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
>> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
>> you could give it a try.
> 
> Hi Heiner,
> 
> It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
> but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.
> 
> So there is still time left and I could do a more proper run with your patch below.
> Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
> just that I can't tell.
> 
The nasty memory allocation problem in napi_alloc_skb() you've seen before ..
If you can't reproduce it with 4.20 then a potential cause may be here:
5317d5c6d47e ("r8169: use napi_consume_skb where possible")
At least at a first and second glance I don't see how usage of both calls
could be wrong. So maybe the root cause is somewhere deeper inside.
At least it would be worth trying with the mentioned commit reverted.

> Perhaps Linus as Oops-decoding-guru has an idea ?
> 
> --
> Sander
> 
Heiner

> [39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
> [39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
> [39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
> [39041.689018] Call Trace:
> [39041.689022]  <IRQ>
> [39041.689030]  dump_stack+0x5c/0x7b
> [39041.689033]  warn_alloc+0x103/0x190
> [39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
> [39041.689039]  ? ip_rcv+0x48/0xc0
> [39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
> [39041.689042]  page_frag_alloc+0x117/0x150
> [39041.689044]  __napi_alloc_skb+0x83/0xd0
> [39041.689048]  rtl8169_poll+0x210/0x640
> [39041.689051]  net_rx_action+0x23d/0x370
> [39041.689054]  __do_softirq+0xed/0x229
> [39041.689058]  irq_exit+0xb7/0xc0
> [39041.689061]  xen_evtchn_do_upcall+0x27/0x40
> [39041.689063]  xen_do_hypervisor_callback+0x29/0x40
> [39041.689064]  </IRQ>
> [39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
> [39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
> [39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
> [39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
> [39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
> [39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
> [39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
> [39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
> [39041.689077]  iput+0x63/0x1a0
> [39041.689079]  __dentry_kill+0xc5/0x170
> [39041.689080]  shrink_dentry_list+0x93/0x1c0
> [39041.689082]  prune_dcache_sb+0x4d/0x70
> [39041.689084]  super_cache_scan+0x104/0x190
> [39041.689087]  do_shrink_slab+0x12c/0x1e0
> [39041.689089]  shrink_slab+0xdf/0x2b0
> [39041.689091]  shrink_node+0x158/0x470
> [39041.689093]  do_try_to_free_pages+0xd1/0x380
> [39041.689095]  try_to_free_pages+0xb2/0xe0
> [39041.689097]  __alloc_pages_nodemask+0x603/0xe80
> [39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
> [39041.689102]  alloc_pages_vma+0x7b/0x1c0
> [39041.689106]  __handle_mm_fault+0xdb3/0x1060
> [39041.689109]  ? xen_mc_flush+0xc0/0x190
> [39041.689110]  handle_mm_fault+0xf8/0x200
> [39041.689113]  __do_page_fault+0x231/0x4a0
> [39041.689115]  ? page_fault+0x8/0x30
> [39041.689116]  page_fault+0x1e/0x30
> [39041.689118] RIP: e033:0x7fb9851d012e
> [39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
> [39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
> [39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
> [39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
> [39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
> [39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
> [39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
> [39041.689125] Mem-Info:
> [39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
>                 active_file:106409 inactive_file:107531 isolated_file:0
>                 unevictable:552 dirty:175 writeback:0 unstable:0
>                 slab_reclaimable:13739 slab_unreclaimable:16454
>                 mapped:1605 shmem:23 pagetables:2900 bounce:0
>                 free:3681 free_pcp:935 free_cma:0
> [39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [39041.689136] lowmem_reserve[]: 0 1865 1865 1865
> [39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
> [39041.689140] lowmem_reserve[]: 0 0 0 0
> [39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
> [39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
> [39041.689153] 214701 total pagecache pages
> [39041.689155] 273 pages in swap cache
> [39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
> [39041.689156] Free swap  = 3790588kB
> [39041.689157] Total swap = 4194300kB
> [39041.689157] 524181 pages RAM
> [39041.689158] 0 pages HighMem/MovableOnly
> [39041.689158] 101471 pages reserved
> [39041.689159] 0 pages cma reserved
> 
> 
> 
> 
> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index e8a112149..3cca2ffb2 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  	struct device *d = tp_to_dev(tp);
>>  	dma_addr_t mapping;
>>  	u32 opts[2], len;
>> -	bool stop_queue;
>>  	int frags;
>>  
>>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	txd->opts2 = cpu_to_le32(opts[1]);
>>  
>> +	netdev_sent_queue(dev, skb->len);
>> +
>>  	skb_tx_timestamp(skb);
>>  
>>  	/* Force memory writes to complete before releasing descriptor */
>> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>  
>>  	tp->cur_tx += frags + 1;
>>  
>> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
>> -	if (unlikely(stop_queue))
>> -		netif_stop_queue(dev);
>> -
>> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
>> -		RTL_W8(tp, TxPoll, NPQ);
>> +	RTL_W8(tp, TxPoll, NPQ);
>>  
>> -	if (unlikely(stop_queue)) {
>> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
>> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
>> +		 * not miss a ring update when it notices a stopped queue.
>> +		 */
>> +		smp_wmb();
>> +		netif_stop_queue(dev);
>>  		/* Sync with rtl_tx:
>>  		 * - publish queue status and cur_tx ring index (write barrier)
>>  		 * - refresh dirty_tx ring index (read barrier).
>>
> 
> 


^ permalink raw reply

* Re: Linux 5.0 regression: rtl8169 / kernel BUG at lib/dynamic_queue_limits.c:27!
From: Sander Eikelenboom @ 2019-02-10  9:16 UTC (permalink / raw)
  To: Heiner Kallweit, Eric Dumazet, Realtek linux nic maintainers,
	Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, netdev
In-Reply-To: <824acd0b-920c-7554-4a63-d80c7de2a8b6@gmail.com>

On 09/02/2019 12:50, Heiner Kallweit wrote:
> On 09.02.2019 11:07, Sander Eikelenboom wrote:
>> On 09/02/2019 10:59, Heiner Kallweit wrote:
>>> On 09.02.2019 10:34, Sander Eikelenboom wrote:
>>>> On 09/02/2019 10:02, Heiner Kallweit wrote:
>>>>> On 09.02.2019 00:09, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 02/08/2019 01:50 PM, Heiner Kallweit wrote:
>>>>>>> On 08.02.2019 22:45, Sander Eikelenboom wrote:
>>>>>>>> On 08/02/2019 22:22, Heiner Kallweit wrote:
>>>>>>>>> On 08.02.2019 21:55, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/02/2019 19:52, Heiner Kallweit wrote:
>>>>>>>>>>> On 08.02.2019 19:29, Sander Eikelenboom wrote:
>>>>>>>>>>>> L.S.,
>>>>>>>>>>>>
>>>>>>>>>>>> While testing a linux 5.0-rc5 kernel (with some patches on top but they don't seem related) under Xen i the nasty splat below, 
>>>>>>>>>>>> that I haven encountered with Linux 4.20.x.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately I haven't got a clear reproducer for this and bisecting could be nasty due to another (networking related) kernel bug.
>>>>>>>>>>>>
>>>>>>>>>>>> If you need more info, want me to run a debug patch etc., please feel free to ask.
>>>>>>>>>>>>
>>>>>>>>>>> Thanks for the report. However I see no change in the r8169 driver between
>>>>>>>>>>> 4.20 and 5.0 with regard to BQL code. Having said that the root cause could
>>>>>>>>>>> be somewhere else. Therefore I'm afraid a bisect will be needed.
>>>>>>>>>>
>>>>>>>>>> Hmm i did some diging and i think:
>>>>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>>>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>>>>>> 620344c43edfa020bbadfd81a144ebe5181fc94f net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue
>>>>>>>>>>
>>>>>>>>> You're right. Thought this was added in 4.20 already.
>>>>>>>>> The BQL code pattern I copied from the mlx4 driver and so far I haven't heard about
>>>>>>>>> this issue from any user of physical hw. And due to the fact that a lot of mainboards
>>>>>>>>> have onboard Realtek network I have quite a few testers out there.
>>>>>>>>> Does the issue occur under specific circumstances like very high load?
>>>>>>>>
>>>>>>>> Yep, the box is already quite contented with the Xen VM's and if I remember correctly it occurred while kernel compiling
>>>>>>>> on the host.
>>>>>>>>
>>>>>>>>> If indeed the xmit_more patch causes the issue, I think we have to involve Eric Dumazet
>>>>>>>>> as author of the underlying changes.
>>>>>>>>
>>>>>>>> It could also be the barriers weren't that unneeded as assumed.
>>>>>>>
>>>>>>> The barriers were removed after adding xmit_more handling. Therefore it would be good to
>>>>>>> test also with only 
>>>>>>> bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 r8169: remove unneeded mmiowb barriers
>>>>>>> removed.
>>>>>>>
>>>>>>>> Since we are almost at RC6 i took the liberty to CC Eric now.
>>>>>>>>
>>>>>>> Sure, thanks.
>>>>>>>
>>>>>>>> BTW am i correct these patches are merely optimizations ?
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> If so and concluding they revert cleanly, perhaps it should be considered at this point in the RC's
>>>>>>>> to revert them for 5.0 and try again for 5.1 ?
>>>>>>>>
>>>>>>> Before removing both it would be good to test with only the barrier-removal removed.
>>>>>>>
>>>>>>
>>>>>> Commit 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 r8169: make use of xmit_more and __netdev_sent_queue
>>>>>> looks buggy to me, since the skb might have been freed already on another cpu when you call
>>>>>>
>>>>>> You could try :
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>>>>>> index 3624e67aef72c92ed6e908e2c99ac2d381210126..f907d484165d9fd775e81bf2bfb9aa4ddedb1c93 100644
>>>>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>>>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>>>>> @@ -6070,6 +6070,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         dma_addr_t mapping;
>>>>>>         u32 opts[2], len;
>>>>>>         bool stop_queue;
>>>>>> +       bool door_bell;
>>>>>>         int frags;
>>>>>>  
>>>>>>         if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
>>>>>> @@ -6116,6 +6117,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         /* Force memory writes to complete before releasing descriptor */
>>>>>>         dma_wmb();
>>>>>>  
>>>>>> +       door_bell = __netdev_sent_queue(dev, skb->len, skb->xmit_more);
>>>>>> +
>>>>>>         txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
>>>>>>  
>>>>>>         /* Force all memory writes to complete before notifying device */
>>>>>> @@ -6127,7 +6130,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>>>>         if (unlikely(stop_queue))
>>>>>>                 netif_stop_queue(dev);
>>>>>>  
>>>>>> -       if (__netdev_sent_queue(dev, skb->len, skb->xmit_more)) {
>>>>>> +       if (door_bell) {
>>>>>>                 RTL_W8(tp, TxPoll, NPQ);
>>>>>>                 mmiowb();
>>>>>>         }
>>>>>>
>>>>> Thanks a lot for checking and for the proposed fix.
>>>>> Sander, can you try with this patch on top of 5.0-rc5 w/o removing two two commits?
>>>>
>>>> I have done that already during the night .. the results:
>>>> - I can confirm 2e6eedb4813e34d8d84ac0eb3afb668966f3f356 is the first commit which causes hitting the BUG_ON in lib/dynamic_queue_limits.c.
>>>>   (in other word, with only reverting bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 it still blows up).
>>>>
>>>> - The Eric's patch only applies cleanly with bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3 reverted, so that's what I tested.
>>>>   The patch seems to prevent hitting the BUG_ON in lib/dynamic_queue_limits.c, it has run this night and I gave done a few kernel compiles
>>>>   this morning. How ever during these kernel compiles i'm getting a transmit queue timeout which i haven't seen with 4.20.x, although i regularly
>>>>   compile kernels in the same way as I do now. The only thing I can't say if that is due to this change, or if it's again something else.
>>>>   Which makes me somewhat inclined to go testing the complete revert some more and see if I can trigger the queue timeout on that or not.
>>>>
>>>>   If I can, it is a separate issue.
>>>>   If I can't it seems even with a patch it still seems as a regression in comparison with 4.20.x, for which
>>>>   a revert would be the right thing to do (since as you indicated these are merely optimizations), 
>>>>   which would give us more time for 5.1 to try to solve things on top of the 5.0-release-to-be.
>>>>   (especially since I seem to still have other issues which need to be sorted out and time is limited)
>>>>
>>>>   The timeout in question:
>>>>         [28336.869479] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
>>>>         [28336.881498] WARNING: CPU: 0 PID: 6925 at net/sched/sch_generic.c:461 dev_watchdog+0x20b/0x210
>>>>         [28336.893358] Modules linked in:
>>>>         [28336.904106] CPU: 0 PID: 6925 Comm: cc1 Tainted: G      D           5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
>>>>         [28336.917385] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>>>>         [28336.928988] RIP: e030:dev_watchdog+0x20b/0x210
>>>>         [28336.940623] Code: 00 49 63 4e e0 eb 90 4c 89 e7 c6 05 ad d8 f1 00 01 e8 a9 32 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 50 59 89 82 e8 e5 92 4d ff <0f> 0b eb c0 90 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00 0f b7
>>>>         [28336.965265] RSP: e02b:ffff88807d403ea0 EFLAGS: 00010286
>>>>         [28336.977465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82a69db8
>>>>         [28336.991265] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000200
>>>>         [28337.008865] RBP: ffff88807936e41c R08: 0000000000000000 R09: 0000000000000819
>>>>         [28337.022250] R10: 0000000000000202 R11: ffffffff8247ca80 R12: ffff88807936e000
>>>>         [28337.035204] R13: 0000000000000000 R14: ffff88807936e440 R15: 0000000000000001
>>>>         [28337.049832] FS:  00007f53e9bf3840(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
>>>>         [28337.062524] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>         [28337.075086] CR2: 00007f53e60c4000 CR3: 000000001a0be000 CR4: 0000000000000660
>>>>         [28337.090052] Call Trace:
>>>>         [28337.103615]  <IRQ>
>>>>         [28337.116587]  ? qdisc_destroy+0x120/0x120
>>>>         [28337.128905]  call_timer_fn+0x19/0x90
>>>>         [28337.141892]  expire_timers+0x8b/0xa0
>>>>         [28337.153354]  run_timer_softirq+0x7e/0x160
>>>>         [28337.165931]  ? handle_irq_event_percpu+0x4c/0x70
>>>>         [28337.176548]  ? handle_percpu_irq+0x32/0x50
>>>>         [28337.186734]  __do_softirq+0xed/0x229
>>>>         [28337.196404]  ? hypervisor_callback+0xa/0x20
>>>>         [28337.207822]  irq_exit+0xb7/0xc0
>>>>         [28337.218978]  xen_evtchn_do_upcall+0x27/0x40
>>>>         [28337.230763]  xen_do_hypervisor_callback+0x29/0x40
>>>>         [28337.241261]  </IRQ>
>>>>         [28337.253283] RIP: e033:0xff7e62
>>>>         [28337.264899] Code: 35 43 0f c7 00 4c 89 ef e8 8b 6d 67 ff 0f 1f 00 44 89 e0 44 89 e2 c1 e8 06 83 e2 3f 48 8b 0c c5 40 8d c6 01 48 0f a3 d1 72 0e <48> 8b 04 c5 50 8d c6 01 48 0f a3 d0 73 0b 44 89 e6 4c 89 ef e8 b5
>>>>         [28337.288677] RSP: e02b:00007fff0fc6a340 EFLAGS: 00000202
>>>>         [28337.299234] RAX: 0000000000000000 RBX: 00007f53e60c3580 RCX: 0000000000000000
>>>>         [28337.309577] RDX: 0000000000000034 RSI: 0000000001e71a98 RDI: 00007fff0fc6a538
>>>>         [28337.320724] RBP: 00007fff0fc6a4b0 R08: 0000000000000000 R09: 0000000000000000
>>>>         [28337.331829] R10: 0000000000000001 R11: 00000000020cb3d0 R12: 0000000000000034
>>>>         [28337.343900] R13: 00007fff0fc6a538 R14: 0000000000000000 R15: 0000000000000001
>>>>         [28337.353977] ---[ end trace 6ff49f09286816b7 ]---
>>>>
>>> Thanks for your efforts. As usual this tx timeout trace says basically nothing except
>>> "timeout" and root cause could be anything. Earlier you reported a memory allocation error,
>>> did that occur again?
>>> If we decide to revert, I'd leave removal of the memory barriers in (as it doesn't seem to
>>> contribute to the issue) and just submit a patch to effectively revert
>>> 2e6eedb4813e34d8d84ac0eb3afb668966f3f356.
>>
>> I can't say if that is correct, because i haven't tested that.
>>
>> Another thing I could test is:
>>  - putting all the r8169 patches (and prerequisites) that went into 5.0 
>>    up to bd7153bd83b806bfcc2e79b7a6f43aa653d06ef3, onto 4.20.7 and see what that does.
>>    If that would be feasible (not too many needed prerequisites out of r8169) and if 
>>    you could spare me some time and prep such a branch somewhere so i can pull and compile that,
>>    that would be great.
>>
> 
> Unfortunately there's quite a number of changes. Regarding __netdev_tx_sent_queue()
> and watchdog timeout I found the following comment in drivers/net/ethernet/sfc/tx.c,
> efx_enqueue_skb():
> 
> 	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
> 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
> 
> 		/* There could be packets left on the partner queue if those
> 		 * SKBs had skb->xmit_more set. If we do not push those they
> 		 * could be left for a long time and cause a netdev watchdog.
> 		 */
> 		if (txq2->xmit_more_available)
> 			efx_nic_push_buffers(txq2);
> 
> But I'm not sure whether the situation in r8169 is comparable. The following patch
> implements what I mentioned earlier: It leaves all other 5.0 changes in place and
> effectively reverts 2e6eedb4813e34d8d84ac0eb3afb668966f3f356. Would be great if
> you could give it a try.

Hi Heiner,

It took some time to respond, because I had another issue with 5.0 which intervened with proper testing, 
but fortunately I could pinpoint without doing a full bisect and revert that commit for further testing.

So there is still time left and I could do a more proper run with your patch below.
Unfortunately i still get a splat (see below) with this, although i'm not sure it is related, 
just that I can't tell.

Perhaps Linus as Oops-decoding-guru has an idea ?

--
Sander

[39041.689007] dpkg-deb: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[39041.689016] CPU: 4 PID: 14078 Comm: dpkg-deb Not tainted 5.0.0-rc5-20190209-kallweit+ #1
[39041.689017] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[39041.689018] Call Trace:
[39041.689022]  <IRQ>
[39041.689030]  dump_stack+0x5c/0x7b
[39041.689033]  warn_alloc+0x103/0x190
[39041.689036]  __alloc_pages_nodemask+0xe3d/0xe80
[39041.689039]  ? ip_rcv+0x48/0xc0
[39041.689040]  ? ip_rcv_finish_core.isra.0+0x360/0x360
[39041.689042]  page_frag_alloc+0x117/0x150
[39041.689044]  __napi_alloc_skb+0x83/0xd0
[39041.689048]  rtl8169_poll+0x210/0x640
[39041.689051]  net_rx_action+0x23d/0x370
[39041.689054]  __do_softirq+0xed/0x229
[39041.689058]  irq_exit+0xb7/0xc0
[39041.689061]  xen_evtchn_do_upcall+0x27/0x40
[39041.689063]  xen_do_hypervisor_callback+0x29/0x40
[39041.689064]  </IRQ>
[39041.689066] RIP: e030:_atomic_dec_and_lock+0x2/0x40
[39041.689068] Code: ff 39 05 c5 c1 c9 00 89 c7 89 c6 76 0f 83 eb 01 83 fb ff 75 d9 5b 89 f8 5d 41 5c c3 0f 0b 90 90 90 90 90 90 90 90 90 90 8b 07 <83> f8 01 74 0c 8d 50 ff f0 0f b1 17 75 f2 31 c0 c3 55 53 48 89 fb
[39041.689069] RSP: e02b:ffffc9000705b990 EFLAGS: 00000246
[39041.689071] RAX: 0000000000000001 RBX: ffff888017082640 RCX: 0000000000000000
[39041.689071] RDX: 0000000000000000 RSI: ffff8880170826c0 RDI: ffff888017082788
[39041.689072] RBP: ffff8880170826c0 R08: ffffc9000705bb00 R09: ffffc9000705bb00
[39041.689073] R10: ffffc9000705bb58 R11: ffff88807fc17000 R12: ffff888017082788
[39041.689073] R13: ffff88806cc8cf58 R14: ffff888017082640 R15: ffff888009990240
[39041.689077]  iput+0x63/0x1a0
[39041.689079]  __dentry_kill+0xc5/0x170
[39041.689080]  shrink_dentry_list+0x93/0x1c0
[39041.689082]  prune_dcache_sb+0x4d/0x70
[39041.689084]  super_cache_scan+0x104/0x190
[39041.689087]  do_shrink_slab+0x12c/0x1e0
[39041.689089]  shrink_slab+0xdf/0x2b0
[39041.689091]  shrink_node+0x158/0x470
[39041.689093]  do_try_to_free_pages+0xd1/0x380
[39041.689095]  try_to_free_pages+0xb2/0xe0
[39041.689097]  __alloc_pages_nodemask+0x603/0xe80
[39041.689099]  ? __pagevec_lru_add_fn+0x1b1/0x290
[39041.689102]  alloc_pages_vma+0x7b/0x1c0
[39041.689106]  __handle_mm_fault+0xdb3/0x1060
[39041.689109]  ? xen_mc_flush+0xc0/0x190
[39041.689110]  handle_mm_fault+0xf8/0x200
[39041.689113]  __do_page_fault+0x231/0x4a0
[39041.689115]  ? page_fault+0x8/0x30
[39041.689116]  page_fault+0x1e/0x30
[39041.689118] RIP: e033:0x7fb9851d012e
[39041.689119] Code: 29 c2 48 3b 15 7b a3 31 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 41 0f 29 01 <41> 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9 40 48 83 fa
[39041.689119] RSP: e02b:00007fb958b36d38 EFLAGS: 00010202
[39041.689120] RAX: 00007fb97a617f0e RBX: 000000000000f004 RCX: 00007fb948008be3
[39041.689121] RDX: 00000000000080c2 RSI: 00007fb948000b31 RDI: 00007fb97a617f0e
[39041.689122] RBP: 00000000000ff062 R08: 0000000000000002 R09: 00007fb97a620000
[39041.689123] R10: 0000000000000004 R11: 00007fb97a626f02 R12: 000000000000f005
[39041.689123] R13: 00007fb948000b28 R14: 0000562d76b63710 R15: 0000000000000003
[39041.689125] Mem-Info:
[39041.689130] active_anon:78775 inactive_anon:49211 isolated_anon:0
                active_file:106409 inactive_file:107531 isolated_file:0
                unevictable:552 dirty:175 writeback:0 unstable:0
                slab_reclaimable:13739 slab_unreclaimable:16454
                mapped:1605 shmem:23 pagetables:2900 bounce:0
                free:3681 free_pcp:935 free_cma:0
[39041.689132] Node 0 active_anon:315100kB inactive_anon:196844kB active_file:425636kB inactive_file:430124kB unevictable:2208kB isolated(anon):0kB isolated(file):0kB mapped:6420kB dirty:700kB writeback:0kB shmem:92kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[39041.689133] Node 0 DMA free:7480kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:7832kB active_file:472kB inactive_file:4kB unevictable:0kB writepending:0kB present:15956kB managed:15872kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[39041.689136] lowmem_reserve[]: 0 1865 1865 1865
[39041.689138] Node 0 DMA32 free:7244kB min:19472kB low:21380kB high:23288kB active_anon:315360kB inactive_anon:188144kB active_file:425164kB inactive_file:430120kB unevictable:2208kB writepending:700kB present:2080768kB managed:1674968kB mlocked:2208kB kernel_stack:9632kB pagetables:11588kB bounce:0kB free_pcp:3740kB local_pcp:528kB free_cma:0kB
[39041.689140] lowmem_reserve[]: 0 0 0 0
[39041.689142] Node 0 DMA: 6*4kB (UME) 6*8kB (UE) 7*16kB (UME) 6*32kB (ME) 5*64kB (UME) 3*128kB (UE) 5*256kB (UME) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7480kB
[39041.689148] Node 0 DMA32: 69*4kB (U) 315*8kB (UE) 138*16kB (UE) 70*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7244kB
[39041.689153] 214701 total pagecache pages
[39041.689155] 273 pages in swap cache
[39041.689156] Swap cache stats: add 100978, delete 100706, find 1158/1257
[39041.689156] Free swap  = 3790588kB
[39041.689157] Total swap = 4194300kB
[39041.689157] 524181 pages RAM
[39041.689158] 0 pages HighMem/MovableOnly
[39041.689158] 101471 pages reserved
[39041.689159] 0 pages cma reserved





> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index e8a112149..3cca2ffb2 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -6192,7 +6192,6 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  	struct device *d = tp_to_dev(tp);
>  	dma_addr_t mapping;
>  	u32 opts[2], len;
> -	bool stop_queue;
>  	int frags;
>  
>  	if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
> @@ -6234,6 +6233,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	txd->opts2 = cpu_to_le32(opts[1]);
>  
> +	netdev_sent_queue(dev, skb->len);
> +
>  	skb_tx_timestamp(skb);
>  
>  	/* Force memory writes to complete before releasing descriptor */
> @@ -6246,14 +6247,14 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>  
>  	tp->cur_tx += frags + 1;
>  
> -	stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
> -	if (unlikely(stop_queue))
> -		netif_stop_queue(dev);
> -
> -	if (__netdev_sent_queue(dev, skb->len, skb->xmit_more))
> -		RTL_W8(tp, TxPoll, NPQ);
> +	RTL_W8(tp, TxPoll, NPQ);
>  
> -	if (unlikely(stop_queue)) {
> +	if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
> +		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
> +		 * not miss a ring update when it notices a stopped queue.
> +		 */
> +		smp_wmb();
> +		netif_stop_queue(dev);
>  		/* Sync with rtl_tx:
>  		 * - publish queue status and cur_tx ring index (write barrier)
>  		 * - refresh dirty_tx ring index (read barrier).
> 


^ permalink raw reply

* Re: [PATCH net-next 2/2] net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE
From: Eli Cohen @ 2019-02-10  8:19 UTC (permalink / raw)
  To: David Miller
  Cc: jhs, xiyou.wangcong, jiri, netdev, linux-kernel, simon.horman,
	jakub.kicinski, dirk.vandermerwe, francois.theron, quentin.monnet,
	john.hurley, edwin.peer
In-Reply-To: <20190207.100110.1557913033627638063.davem@davemloft.net>

On Thu, Feb 07, 2019 at 10:01:10AM -0800, David Miller wrote:
> From: Eli Cohen <eli@mellanox.com>
> Date: Thu,  7 Feb 2019 09:45:49 +0200
> 
> > diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
> > index 902957beceb3..d54cb608dbaf 100644
> > --- a/net/sched/act_simple.c
> > +++ b/net/sched/act_simple.c
> > @@ -19,8 +19,6 @@
> >  #include <net/netlink.h>
> >  #include <net/pkt_sched.h>
> >  
> > -#define TCA_ACT_SIMP 22
> > -
> >  #include <linux/tc_act/tc_defact.h>
> >  #include <net/tc_act/tc_defact.h>
> >  
> 
> I would do this in patch #1.

Sure, that was the intention. It just slipped off my fingers.

Will fix and send a new series.

> Actually, because you didn't, after patch #1 there are two #defines
> evaluated for this macro.  One in pkt_cls.h and one here.
> 
> The only reason the compiler doesn't warn and complain is because the
> definitions are identical.
> 
> Thank you.

^ permalink raw reply

* [PATCH v2] net/packet: fix 4gb buffer limit due to overflow check
From: Kal Conley @ 2019-02-10  8:57 UTC (permalink / raw)
  To: davem
  Cc: kal.conley, Willem de Bruijn, Eric Dumazet, Alexander Duyck,
	Jeff Kirsher, Kirill Tkhai, Vincent Whitchurch, Li RongQing,
	Magnus Karlsson, netdev, linux-kernel
In-Reply-To: <20190209.190114.542890373094719579.davem@davemloft.net>

When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow. Check it for overflow without limiting the total buffer
size to UINT_MAX.

This change fixes support for packet ring buffers >= UINT_MAX.

Fixes: 8f8d28e4d6d8 ("net/packet: fix overflow in check for tp_frame_nr")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
Changes in v2:
- Add Signed-off-by and Fixes tag

 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3b1a78906bc0..1cd1d83a4be0 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4292,7 +4292,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		rb->frames_per_block = req->tp_block_size / req->tp_frame_size;
 		if (unlikely(rb->frames_per_block == 0))
 			goto out;
-		if (unlikely(req->tp_block_size > UINT_MAX / req->tp_block_nr))
+		if (unlikely(rb->frames_per_block > UINT_MAX / req->tp_block_nr))
 			goto out;
 		if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
 					req->tp_frame_nr))
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 bpf-next 6/7] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch tests accessing the skb->sk and the new helpers,
bpf_sk_fullsock and bpf_tcp_sock.

The errstr of some existing "reference tracking" tests is changed
with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
verifier's reg_type_str[].

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/bpf_util.h        |   9 +
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 384 ++++++++++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 4 files changed, 396 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
index 315a44fa32af..197347031038 100644
--- a/tools/testing/selftests/bpf/bpf_util.h
+++ b/tools/testing/selftests/bpf/bpf_util.h
@@ -48,4 +48,13 @@ static inline unsigned int bpf_num_possible_cpus(void)
 # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
 #endif
 
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+#ifndef offsetofend
+#define offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER)	+ sizeof_field(TYPE, MEMBER))
+#endif
+
 #endif /* __BPF_UTIL__ */
diff --git a/tools/testing/selftests/bpf/verifier/ref_tracking.c b/tools/testing/selftests/bpf/verifier/ref_tracking.c
index dc2cc823df2b..3ed3593bd8b6 100644
--- a/tools/testing/selftests/bpf/verifier/ref_tracking.c
+++ b/tools/testing/selftests/bpf/verifier/ref_tracking.c
@@ -547,7 +547,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.result = REJECT,
 },
 {
@@ -562,7 +562,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "invalid bpf_sock access off=0 size=8",
+	.errstr = "invalid sock access off=0 size=8",
 	.result = REJECT,
 },
 {
diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
new file mode 100644
index 000000000000..0ddfdf76aba5
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -0,0 +1,384 @@
+{
+	"skb->sk: no NULL check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_common_or_null'",
+},
+{
+	"skb->sk: sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, family)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"skb->sk: sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock_common access",
+},
+{
+	"bpf_sk_fullsock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"sk_fullsock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_or_null'",
+},
+{
+	"sk_fullsock(skb->sk): sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, family)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->state [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, state)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_port [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_port)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_port [load 2nd byte]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_port) + 1),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock access",
+},
+{
+	"sk_fullsock(skb->sk): sk->dst_ip6 [load 2nd byte]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, dst_ip6[0]) + 1),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->type [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->protocol [narrow load]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, protocol)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): beyond last field",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetofend(struct bpf_sock, state)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock access",
+},
+{
+	"bpf_tcp_sock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"bpf_tcp_sock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'tcp_sock_or_null'",
+},
+{
+	"bpf_tcp_sock(skb->sk): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): tp->bytes_acked",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, bytes_acked)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): beyond last field",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetofend(struct bpf_tcp_sock, bytes_acked)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid tcp_sock access",
+},
+{
+	"bpf_tcp_sock(bpf_sk_fullsock(skb->sk)): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_sk_release(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=sock",
+},
+{
+	"bpf_sk_release(bpf_sk_fullsock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "reference has not been acquired before",
+},
+{
+	"bpf_sk_release(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=tcp_sock expected=sock",
+},
diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c
index 3e046695fad7..dbaf5be947b2 100644
--- a/tools/testing/selftests/bpf/verifier/unpriv.c
+++ b/tools/testing/selftests/bpf/verifier/unpriv.c
@@ -365,7 +365,7 @@
 	},
 	.result = REJECT,
 	//.errstr = "same insn cannot be used with different pointers",
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 7/7] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch adds a C program to show the usage on
skb->sk and bpf_tcp_sock.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 327 ++++++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 152 ++++++++
 4 files changed, 487 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 383d2ff13fc7..c7e1e3255448 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt test_tcpnotify_user
+	test_netcnt test_tcpnotify_user test_sock_fields
 
 BPF_OBJ_FILES = \
 	test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
@@ -35,7 +35,8 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
+	test_sock_fields_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -111,6 +112,7 @@ $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
 $(OUTPUT)/test_netcnt: cgroup_helpers.c
+$(OUTPUT)/test_sock_fields: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6a0ce0f055c5..d9999f1ed1d2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -176,6 +176,10 @@ static void (*bpf_spin_lock)(struct bpf_spin_lock *lock) =
 	(void *) BPF_FUNC_spin_lock;
 static void (*bpf_spin_unlock)(struct bpf_spin_lock *lock) =
 	(void *) BPF_FUNC_spin_unlock;
+static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_sk_fullsock;
+static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_tcp_sock;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_sock_fields.c b/tools/testing/selftests/bpf/test_sock_fields.c
new file mode 100644
index 000000000000..9bb58369b481
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields.c
@@ -0,0 +1,327 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <sys/socket.h>
+#include <sys/epoll.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+#define CHECK(condition, tag, format...) ({				\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
+		printf(format);						\
+		printf("\n");						\
+		exit(-1);						\
+	}								\
+})
+
+#define TEST_CGROUP "/test-bpf-sock-fields"
+#define DATA "Hello BPF!"
+#define DATA_LEN sizeof(DATA)
+
+static struct sockaddr_in6 srv_sa6, cli_sa6;
+static int linum_map_fd;
+static int addr_map_fd;
+static int tp_map_fd;
+static int sk_map_fd;
+static __u32 srv_idx = SRV_IDX;
+static __u32 cli_idx = CLI_IDX;
+
+static void init_loopback6(struct sockaddr_in6 *sa6)
+{
+	memset(sa6, 0, sizeof(*sa6));
+	sa6->sin6_family = AF_INET6;
+	sa6->sin6_addr = in6addr_loopback;
+}
+
+static void print_sk(const struct bpf_sock *sk)
+{
+	char src_ip4[24], dst_ip4[24];
+	char src_ip6[64], dst_ip6[64];
+
+	inet_ntop(AF_INET, &sk->src_ip4, src_ip4, sizeof(src_ip4));
+	inet_ntop(AF_INET6, &sk->src_ip6, src_ip6, sizeof(src_ip6));
+	inet_ntop(AF_INET, &sk->dst_ip4, dst_ip4, sizeof(dst_ip4));
+	inet_ntop(AF_INET6, &sk->dst_ip6, dst_ip6, sizeof(dst_ip6));
+
+	printf("state:%u bound_dev_if:%u family:%u type:%u protocol:%u mark:%u priority:%u "
+	       "src_ip4:%x(%s) src_ip6:%x:%x:%x:%x(%s) src_port:%u "
+	       "dst_ip4:%x(%s) dst_ip6:%x:%x:%x:%x(%s) dst_port:%u\n",
+	       sk->state, sk->bound_dev_if, sk->family, sk->type, sk->protocol,
+	       sk->mark, sk->priority,
+	       sk->src_ip4, src_ip4,
+	       sk->src_ip6[0], sk->src_ip6[1], sk->src_ip6[2], sk->src_ip6[3],
+	       src_ip6, sk->src_port,
+	       sk->dst_ip4, dst_ip4,
+	       sk->dst_ip6[0], sk->dst_ip6[1], sk->dst_ip6[2], sk->dst_ip6[3],
+	       dst_ip6, ntohs(sk->dst_port));
+}
+
+static void print_tp(const struct bpf_tcp_sock *tp)
+{
+	printf("snd_cwnd:%u srtt_us:%u rtt_min:%u snd_ssthresh:%u rcv_nxt:%u "
+	       "snd_nxt:%u snd:una:%u mss_cache:%u ecn_flags:%u "
+	       "rate_delivered:%u rate_interval_us:%u packets_out:%u "
+	       "retrans_out:%u total_retrans:%u segs_in:%u data_segs_in:%u "
+	       "segs_out:%u data_segs_out:%u lost_out:%u sacked_out:%u "
+	       "bytes_received:%llu bytes_acked:%llu\n",
+	       tp->snd_cwnd, tp->srtt_us, tp->rtt_min, tp->snd_ssthresh,
+	       tp->rcv_nxt, tp->snd_nxt, tp->snd_una, tp->mss_cache,
+	       tp->ecn_flags, tp->rate_delivered, tp->rate_interval_us,
+	       tp->packets_out, tp->retrans_out, tp->total_retrans,
+	       tp->segs_in, tp->data_segs_in, tp->segs_out,
+	       tp->data_segs_out, tp->lost_out, tp->sacked_out,
+	       tp->bytes_received, tp->bytes_acked);
+}
+
+static void check_result(void)
+{
+	struct bpf_tcp_sock srv_tp, cli_tp;
+	struct bpf_sock srv_sk, cli_sk;
+	__u32 linum, idx0 = 0;
+	int err;
+
+	err = bpf_map_lookup_elem(linum_map_fd, &idx0, &linum);
+	CHECK(err == -1, "bpf_map_lookup_elem(linum_map_fd)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &srv_idx, &srv_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &srv_idx, &srv_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &cli_idx, &cli_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &cli_idx, &cli_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	printf("srv_sk: ");
+	print_sk(&srv_sk);
+	printf("\n");
+
+	printf("cli_sk: ");
+	print_sk(&cli_sk);
+	printf("\n");
+
+	printf("srv_tp: ");
+	print_tp(&srv_tp);
+	printf("\n");
+
+	printf("cli_tp: ");
+	print_tp(&cli_tp);
+	printf("\n");
+
+	CHECK(srv_sk.state == 10 ||
+	      !srv_sk.state ||
+	      srv_sk.family != AF_INET6 ||
+	      srv_sk.protocol != IPPROTO_TCP ||
+	      memcmp(srv_sk.src_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.src_ip6)) ||
+	      memcmp(srv_sk.dst_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.dst_ip6)) ||
+	      srv_sk.src_port != ntohs(srv_sa6.sin6_port) ||
+	      srv_sk.dst_port != cli_sa6.sin6_port,
+	      "Unexpected srv_sk", "Check srv_sk output. linum:%u", linum);
+
+	CHECK(cli_sk.state == 10 ||
+	      !cli_sk.state ||
+	      cli_sk.family != AF_INET6 ||
+	      cli_sk.protocol != IPPROTO_TCP ||
+	      memcmp(cli_sk.src_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.src_ip6)) ||
+	      memcmp(cli_sk.dst_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.dst_ip6)) ||
+	      cli_sk.src_port != ntohs(cli_sa6.sin6_port) ||
+	      cli_sk.dst_port != srv_sa6.sin6_port,
+	      "Unexpected cli_sk", "Check cli_sk output. linum:%u", linum);
+
+	CHECK(srv_tp.data_segs_out != 1 ||
+	      srv_tp.data_segs_in ||
+	      srv_tp.snd_cwnd != 10 ||
+	      srv_tp.total_retrans ||
+	      srv_tp.bytes_acked != DATA_LEN,
+	      "Unexpected srv_tp", "Check srv_tp output. linum:%u", linum);
+
+	CHECK(cli_tp.data_segs_out ||
+	      cli_tp.data_segs_in != 1 ||
+	      cli_tp.snd_cwnd != 10 ||
+	      cli_tp.total_retrans ||
+	      cli_tp.bytes_received != DATA_LEN,
+	      "Unexpected cli_tp", "Check cli_tp output. linum:%u", linum);
+}
+
+static void test(void)
+{
+	int listen_fd, cli_fd, accept_fd, epfd, err;
+	struct epoll_event ev;
+	socklen_t addrlen;
+
+	addrlen = sizeof(struct sockaddr_in6);
+	ev.events = EPOLLIN;
+
+	epfd = epoll_create(1);
+	CHECK(epfd == -1, "epoll_create()", "epfd:%d errno:%d", epfd, errno);
+
+	/* Prepare listen_fd */
+	listen_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(listen_fd == -1, "socket()", "listen_fd:%d errno:%d",
+	      listen_fd, errno);
+
+	init_loopback6(&srv_sa6);
+	err = bind(listen_fd, (struct sockaddr *)&srv_sa6, sizeof(srv_sa6));
+	CHECK(err, "bind(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(listen_fd, (struct sockaddr *)&srv_sa6, &addrlen);
+	CHECK(err, "getsockname(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = listen(listen_fd, 1);
+	CHECK(err, "listen(listen_fd)", "err:%d errno:%d", err, errno);
+
+	/* Prepare cli_fd */
+	cli_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(cli_fd == -1, "socket()", "cli_fd:%d errno:%d", cli_fd, errno);
+
+	init_loopback6(&cli_sa6);
+	err = bind(cli_fd, (struct sockaddr *)&cli_sa6, sizeof(cli_sa6));
+	CHECK(err, "bind(cli_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(cli_fd, (struct sockaddr *)&cli_sa6, &addrlen);
+	CHECK(err, "getsockname(cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Update addr_map with srv_sa6 and cli_sa6 */
+	err = bpf_map_update_elem(addr_map_fd, &srv_idx, &srv_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	err = bpf_map_update_elem(addr_map_fd, &cli_idx, &cli_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	/* Connect from cli_sa6 to srv_sa6 */
+	err = connect(cli_fd, (struct sockaddr *)&srv_sa6, addrlen);
+	printf("srv_sa6.sin6_port:%u cli_sa6.sin6_port:%u\n\n",
+	       ntohs(srv_sa6.sin6_port), ntohs(cli_sa6.sin6_port));
+	CHECK(err && errno != EINPROGRESS,
+	      "connect(cli_fd)", "err:%d errno:%d", err, errno);
+
+	ev.data.fd = listen_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, listen_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Accept the connection */
+	/* Have some timeout in accept(listen_fd). Just in case. */
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != listen_fd,
+	      "epoll_wait(listen_fd)",
+	      "err:%d errno:%d ev.data.fd:%d listen_fd:%d",
+	      err, errno, ev.data.fd, listen_fd);
+
+	accept_fd = accept(listen_fd, NULL, NULL);
+	CHECK(accept_fd == -1, "accept(listen_fd)", "accept_fd:%d errno:%d",
+	      accept_fd, errno);
+	close(listen_fd);
+
+	/* Send some data from accept_fd to cli_fd */
+	err = send(accept_fd, DATA, DATA_LEN, 0);
+	CHECK(err != DATA_LEN, "send(accept_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Have some timeout in recv(cli_fd). Just in case. */
+	ev.data.fd = cli_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, cli_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != cli_fd,
+	      "epoll_wait(cli_fd)", "err:%d errno:%d ev.data.fd:%d cli_fd:%d",
+	      err, errno, ev.data.fd, cli_fd);
+
+	err = recv(cli_fd, NULL, 0, MSG_TRUNC);
+	CHECK(err, "recv(cli_fd)", "err:%d errno:%d", err, errno);
+
+	close(epfd);
+	close(accept_fd);
+	close(cli_fd);
+
+	check_result();
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr attr = {
+		.file = "test_sock_fields_kern.o",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int cgroup_fd, prog_fd, err;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+
+	err = setup_cgroup_environment();
+	CHECK(err, "setup_cgroup_environment()", "err:%d errno:%d",
+	      err, errno);
+
+	atexit(cleanup_cgroup_environment);
+
+	/* Create a cgroup, get fd, and join it */
+	cgroup_fd = create_and_get_cgroup(TEST_CGROUP);
+	CHECK(cgroup_fd == -1, "create_and_get_cgroup()",
+	      "cgroup_fd:%d errno:%d", cgroup_fd, errno);
+
+	err = join_cgroup(TEST_CGROUP);
+	CHECK(err, "join_cgroup", "err:%d errno:%d", err, errno);
+
+	err = bpf_prog_load_xattr(&attr, &obj, &prog_fd);
+	CHECK(err, "bpf_prog_load_xattr()", "err:%d", err);
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_INET_EGRESS, 0);
+	CHECK(err == -1, "bpf_prog_attach(CPF_CGROUP_INET_EGRESS)",
+	      "err:%d errno%d", err, errno);
+	close(cgroup_fd);
+
+	map = bpf_object__find_map_by_name(obj, "addr_map");
+	CHECK(!map, "cannot find addr_map", "(null)");
+	addr_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "sock_result_map");
+	CHECK(!map, "cannot find sock_result_map", "(null)");
+	sk_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "tcp_sock_result_map");
+	CHECK(!map, "cannot find tcp_sock_result_map", "(null)");
+	tp_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "linum_map");
+	CHECK(!map, "cannot find linum_map", "(null)");
+	linum_map_fd = bpf_map__fd(map);
+
+	test();
+
+	bpf_object__close(obj);
+	cleanup_cgroup_environment();
+
+	printf("PASS\n");
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_sock_fields_kern.c b/tools/testing/selftests/bpf/test_sock_fields_kern.c
new file mode 100644
index 000000000000..de1a43e8f610
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields_kern.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/bpf.h>
+#include <netinet/in.h>
+#include <stdbool.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") addr_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct sockaddr_in6),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") tcp_sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_tcp_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") linum_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+static bool is_loopback6(__u32 *a6)
+{
+	return !a6[0] && !a6[1] && !a6[2] && a6[3] == bpf_htonl(1);
+}
+
+static void skcpy(struct bpf_sock *dst,
+		  const struct bpf_sock *src)
+{
+	dst->bound_dev_if = src->bound_dev_if;
+	dst->family = src->family;
+	dst->type = src->type;
+	dst->protocol = src->protocol;
+	dst->mark = src->mark;
+	dst->priority = src->priority;
+	dst->src_ip4 = src->src_ip4;
+	dst->src_ip6[0] = src->src_ip6[0];
+	dst->src_ip6[1] = src->src_ip6[1];
+	dst->src_ip6[2] = src->src_ip6[2];
+	dst->src_ip6[3] = src->src_ip6[3];
+	dst->src_port = src->src_port;
+	dst->dst_ip4 = src->dst_ip4;
+	dst->dst_ip6[0] = src->dst_ip6[0];
+	dst->dst_ip6[1] = src->dst_ip6[1];
+	dst->dst_ip6[2] = src->dst_ip6[2];
+	dst->dst_ip6[3] = src->dst_ip6[3];
+	dst->dst_port = src->dst_port;
+	dst->state = src->state;
+}
+
+static void tpcpy(struct bpf_tcp_sock *dst,
+		  const struct bpf_tcp_sock *src)
+{
+	dst->snd_cwnd = src->snd_cwnd;
+	dst->srtt_us = src->srtt_us;
+	dst->rtt_min = src->rtt_min;
+	dst->snd_ssthresh = src->snd_ssthresh;
+	dst->rcv_nxt = src->rcv_nxt;
+	dst->snd_nxt = src->snd_nxt;
+	dst->snd_una = src->snd_una;
+	dst->mss_cache = src->mss_cache;
+	dst->ecn_flags = src->ecn_flags;
+	dst->rate_delivered = src->rate_delivered;
+	dst->rate_interval_us = src->rate_interval_us;
+	dst->packets_out = src->packets_out;
+	dst->retrans_out = src->retrans_out;
+	dst->total_retrans = src->total_retrans;
+	dst->segs_in = src->segs_in;
+	dst->data_segs_in = src->data_segs_in;
+	dst->segs_out = src->segs_out;
+	dst->data_segs_out = src->data_segs_out;
+	dst->lost_out = src->lost_out;
+	dst->sacked_out = src->sacked_out;
+	dst->bytes_received = src->bytes_received;
+	dst->bytes_acked = src->bytes_acked;
+}
+
+#define RETURN {						\
+	linum = __LINE__;					\
+	bpf_map_update_elem(&linum_map, &idx0, &linum, 0);	\
+	return 1;						\
+}
+
+SEC("cgroup_skb/egress")
+int read_sock_fields(struct __sk_buff *skb)
+{
+	__u32 srv_idx = SRV_IDX, cli_idx = CLI_IDX, idx;
+	struct sockaddr_in6 *srv_sa6, *cli_sa6;
+	struct bpf_tcp_sock *tp, *tp_ret;
+	struct bpf_sock *sk, *sk_ret;
+	__u32 linum, idx0 = 0;
+
+	sk = skb->sk;
+	if (!sk || sk->state == 10)
+		RETURN;
+
+	sk = bpf_sk_fullsock(sk);
+	if (!sk || sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP ||
+	    !is_loopback6(sk->src_ip6))
+		RETURN;
+
+	tp = bpf_tcp_sock(sk);
+	if (!tp)
+		RETURN;
+
+	srv_sa6 = bpf_map_lookup_elem(&addr_map, &srv_idx);
+	cli_sa6 = bpf_map_lookup_elem(&addr_map, &cli_idx);
+	if (!srv_sa6 || !cli_sa6)
+		RETURN;
+
+	if (sk->src_port == bpf_ntohs(srv_sa6->sin6_port))
+		idx = srv_idx;
+	else if (sk->src_port == bpf_ntohs(cli_sa6->sin6_port))
+		idx = cli_idx;
+	else
+		RETURN;
+
+	sk_ret = bpf_map_lookup_elem(&sock_result_map, &idx);
+	tp_ret = bpf_map_lookup_elem(&tcp_sock_result_map, &idx);
+	if (!sk_ret || !tp_ret)
+		RETURN;
+
+	skcpy(sk_ret, sk);
+	tpcpy(tp_ret, tp);
+
+	RETURN;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 3/7] bpf: Refactor sock_ops_convert_ctx_access
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

The next patch will introduce a new "struct bpf_tcp_sock" which
exposes the same tcp_sock's fields already exposed in
"struct bpf_sock_ops".

This patch refactor the existing convert_ctx_access() codes for
"struct bpf_sock_ops" to get them ready to be reused for
"struct bpf_tcp_sock".  The "rtt_min" is not refactored
in this patch because its handling is different from other
fields.

The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
changes are code move only.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 287 ++++++++++++++++++++--------------------------
 1 file changed, 127 insertions(+), 160 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 01bb64bf2b5e..c0d7b9ef279f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5030,6 +5030,54 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 };
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+#define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT)		\
+do {									\
+	switch (si->off) {						\
+	case offsetof(md_type, snd_cwnd):				\
+		CONVERT(snd_cwnd); break;				\
+	case offsetof(md_type, srtt_us):				\
+		CONVERT(srtt_us); break;				\
+	case offsetof(md_type, snd_ssthresh):				\
+		CONVERT(snd_ssthresh); break;				\
+	case offsetof(md_type, rcv_nxt):				\
+		CONVERT(rcv_nxt); break;				\
+	case offsetof(md_type, snd_nxt):				\
+		CONVERT(snd_nxt); break;				\
+	case offsetof(md_type, snd_una):				\
+		CONVERT(snd_una); break;				\
+	case offsetof(md_type, mss_cache):				\
+		CONVERT(mss_cache); break;				\
+	case offsetof(md_type, ecn_flags):				\
+		CONVERT(ecn_flags); break;				\
+	case offsetof(md_type, rate_delivered):				\
+		CONVERT(rate_delivered); break;				\
+	case offsetof(md_type, rate_interval_us):			\
+		CONVERT(rate_interval_us); break;			\
+	case offsetof(md_type, packets_out):				\
+		CONVERT(packets_out); break;				\
+	case offsetof(md_type, retrans_out):				\
+		CONVERT(retrans_out); break;				\
+	case offsetof(md_type, total_retrans):				\
+		CONVERT(total_retrans); break;				\
+	case offsetof(md_type, segs_in):				\
+		CONVERT(segs_in); break;				\
+	case offsetof(md_type, data_segs_in):				\
+		CONVERT(data_segs_in); break;				\
+	case offsetof(md_type, segs_out):				\
+		CONVERT(segs_out); break;				\
+	case offsetof(md_type, data_segs_out):				\
+		CONVERT(data_segs_out); break;				\
+	case offsetof(md_type, lost_out):				\
+		CONVERT(lost_out); break;				\
+	case offsetof(md_type, sacked_out):				\
+		CONVERT(sacked_out); break;				\
+	case offsetof(md_type, bytes_received):				\
+		CONVERT(bytes_received); break;				\
+	case offsetof(md_type, bytes_acked):				\
+		CONVERT(bytes_acked); break;				\
+	}								\
+} while (0)
+
 #ifdef CONFIG_INET
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
@@ -7196,6 +7244,85 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	struct bpf_insn *insn = insn_buf;
 	int off;
 
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+	} while (0)
+
+#define SOCK_OPS_GET_TCP_SOCK_FIELD(FIELD) \
+		SOCK_OPS_GET_FIELD(FIELD, FIELD, struct tcp_sock)
+
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops,
+				       SOCK_OPS_GET_TCP_SOCK_FIELD);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
 	switch (si->off) {
 	case offsetof(struct bpf_sock_ops, op) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
@@ -7353,175 +7480,15 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      FIELD_SIZEOF(struct minmax_sample, t));
 		break;
 
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
-						       OBJ_FIELD),	      \
-				      si->dst_reg, si->dst_reg,		      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-	} while (0)
-
-/* Helper macro for adding write access to tcp_sock or sock fields.
- * The macro is called with two registers, dst_reg which contains a pointer
- * to ctx (context) and src_reg which contains the value that should be
- * stored. However, we need an additional register since we cannot overwrite
- * dst_reg because it may be used later in the program.
- * Instead we "borrow" one of the other register. We first save its value
- * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
- * it at the end of the macro.
- */
-#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		int reg = BPF_REG_9;					      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
-				      reg, si->src_reg,			      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-	} while (0)
-
-#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
-	do {								      \
-		if (TYPE == BPF_WRITE)					      \
-			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-		else							      \
-			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-	} while (0)
-
-	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
 		SOCK_OPS_GET_FIELD(bpf_sock_ops_cb_flags, bpf_sock_ops_cb_flags,
 				   struct tcp_sock);
 		break;
 
-	case offsetof(struct bpf_sock_ops, snd_ssthresh):
-		SOCK_OPS_GET_FIELD(snd_ssthresh, snd_ssthresh, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rcv_nxt):
-		SOCK_OPS_GET_FIELD(rcv_nxt, rcv_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_nxt):
-		SOCK_OPS_GET_FIELD(snd_nxt, snd_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_una):
-		SOCK_OPS_GET_FIELD(snd_una, snd_una, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, mss_cache):
-		SOCK_OPS_GET_FIELD(mss_cache, mss_cache, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, ecn_flags):
-		SOCK_OPS_GET_FIELD(ecn_flags, ecn_flags, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_delivered):
-		SOCK_OPS_GET_FIELD(rate_delivered, rate_delivered,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_interval_us):
-		SOCK_OPS_GET_FIELD(rate_interval_us, rate_interval_us,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, packets_out):
-		SOCK_OPS_GET_FIELD(packets_out, packets_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, retrans_out):
-		SOCK_OPS_GET_FIELD(retrans_out, retrans_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, total_retrans):
-		SOCK_OPS_GET_FIELD(total_retrans, total_retrans,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_in):
-		SOCK_OPS_GET_FIELD(segs_in, segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_in):
-		SOCK_OPS_GET_FIELD(data_segs_in, data_segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_out):
-		SOCK_OPS_GET_FIELD(segs_out, segs_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_out):
-		SOCK_OPS_GET_FIELD(data_segs_out, data_segs_out,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, lost_out):
-		SOCK_OPS_GET_FIELD(lost_out, lost_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, sacked_out):
-		SOCK_OPS_GET_FIELD(sacked_out, sacked_out, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, sk_txhash):
 		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
 					  struct sock, type);
 		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_received):
-		SOCK_OPS_GET_FIELD(bytes_received, bytes_received,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_acked):
-		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
-		break;
-
 	}
 	return insn - insn_buf;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 5/7] bpf: Sync bpf.h to tools/
From: Martin KaFai Lau @ 2019-02-10  7:22 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo
In-Reply-To: <20190210072220.1530061-1-kafai@fb.com>

This patch sync the uapi bpf.h to tools/.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 72 ++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 7 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1777fa0c61e4..25c8c0e62ecf 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2329,6 +2329,23 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_sock** pointer such
+ *		that all the fields in bpf_sock can be accessed.
+ *	Return
+ *		A **struct bpf_sock** pointer on success, or NULL in
+ *		case of failure.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_tcp_sock** pointer from a
+ *		**struct bpf_sock** pointer.
+ *
+ *	Return
+ *		A **struct bpf_tcp_sock** pointer on success, or NULL in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2425,7 +2442,9 @@ union bpf_attr {
 	FN(msg_pop_data),		\
 	FN(rc_pointer_rel),		\
 	FN(spin_lock),			\
-	FN(spin_unlock),
+	FN(spin_unlock),		\
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2545,6 +2564,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
@@ -2596,14 +2616,52 @@ struct bpf_sock {
 	__u32 protocol;
 	__u32 mark;
 	__u32 priority;
-	__u32 src_ip4;		/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
+	/* IP address also allows 1 and 2 bytes access */
+	__u32 src_ip4;
+	__u32 src_ip6[4];
+	__u32 src_port;		/* host byte order */
+	__u32 dst_port;		/* network byte order */
+	__u32 dst_ip4;
+	__u32 dst_ip6[4];
+	__u32 state;
+};
+
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
 				 */
-	__u32 src_ip6[4];	/* Allows 1,2,4-byte read.
-				 * Stored in network byte order.
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
 				 */
-	__u32 src_port;		/* Allows 4-byte read.
-				 * Stored in host byte order
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
 				 */
 };
 
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox