Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] net: ethernet: Sort Kconfig sourcing alphabetically
From: Florian Fainelli @ 2018-05-21  3:58 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, David S. Miller, Arnd Bergmann, Andrew Lunn,
	Aviad Krawczyk, Jassi Brar, Kunihiko Hayashi, Moritz Fischer,
	Linus Walleij, Alexandre Belloni, open list

A number of entries were not alphabetically sorted, remedy that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/Kconfig | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 54d71e1c48d5..af766fd61151 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -33,9 +33,9 @@ source "drivers/net/ethernet/aquantia/Kconfig"
 source "drivers/net/ethernet/arc/Kconfig"
 source "drivers/net/ethernet/atheros/Kconfig"
 source "drivers/net/ethernet/aurora/Kconfig"
-source "drivers/net/ethernet/cadence/Kconfig"
 source "drivers/net/ethernet/broadcom/Kconfig"
 source "drivers/net/ethernet/brocade/Kconfig"
+source "drivers/net/ethernet/cadence/Kconfig"
 source "drivers/net/ethernet/calxeda/Kconfig"
 source "drivers/net/ethernet/cavium/Kconfig"
 source "drivers/net/ethernet/chelsio/Kconfig"
@@ -72,16 +72,16 @@ source "drivers/net/ethernet/dec/Kconfig"
 source "drivers/net/ethernet/dlink/Kconfig"
 source "drivers/net/ethernet/emulex/Kconfig"
 source "drivers/net/ethernet/ezchip/Kconfig"
-source "drivers/net/ethernet/neterion/Kconfig"
 source "drivers/net/ethernet/faraday/Kconfig"
 source "drivers/net/ethernet/freescale/Kconfig"
 source "drivers/net/ethernet/fujitsu/Kconfig"
 source "drivers/net/ethernet/hisilicon/Kconfig"
 source "drivers/net/ethernet/hp/Kconfig"
 source "drivers/net/ethernet/huawei/Kconfig"
+source "drivers/net/ethernet/i825xx/Kconfig"
 source "drivers/net/ethernet/ibm/Kconfig"
 source "drivers/net/ethernet/intel/Kconfig"
-source "drivers/net/ethernet/i825xx/Kconfig"
+source "drivers/net/ethernet/neterion/Kconfig"
 source "drivers/net/ethernet/xscale/Kconfig"
 
 config JME
@@ -114,8 +114,8 @@ source "drivers/net/ethernet/mediatek/Kconfig"
 source "drivers/net/ethernet/mellanox/Kconfig"
 source "drivers/net/ethernet/micrel/Kconfig"
 source "drivers/net/ethernet/microchip/Kconfig"
-source "drivers/net/ethernet/mscc/Kconfig"
 source "drivers/net/ethernet/moxa/Kconfig"
+source "drivers/net/ethernet/mscc/Kconfig"
 source "drivers/net/ethernet/myricom/Kconfig"
 
 config FEALNX
@@ -161,20 +161,21 @@ source "drivers/net/ethernet/packetengines/Kconfig"
 source "drivers/net/ethernet/pasemi/Kconfig"
 source "drivers/net/ethernet/qlogic/Kconfig"
 source "drivers/net/ethernet/qualcomm/Kconfig"
+source "drivers/net/ethernet/rdc/Kconfig"
 source "drivers/net/ethernet/realtek/Kconfig"
 source "drivers/net/ethernet/renesas/Kconfig"
-source "drivers/net/ethernet/rdc/Kconfig"
 source "drivers/net/ethernet/rocker/Kconfig"
 source "drivers/net/ethernet/samsung/Kconfig"
 source "drivers/net/ethernet/seeq/Kconfig"
-source "drivers/net/ethernet/silan/Kconfig"
-source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/sfc/Kconfig"
 source "drivers/net/ethernet/sgi/Kconfig"
+source "drivers/net/ethernet/silan/Kconfig"
+source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/smsc/Kconfig"
 source "drivers/net/ethernet/socionext/Kconfig"
 source "drivers/net/ethernet/stmicro/Kconfig"
 source "drivers/net/ethernet/sun/Kconfig"
+source "drivers/net/ethernet/synopsys/Kconfig"
 source "drivers/net/ethernet/tehuti/Kconfig"
 source "drivers/net/ethernet/ti/Kconfig"
 source "drivers/net/ethernet/toshiba/Kconfig"
@@ -183,6 +184,5 @@ source "drivers/net/ethernet/via/Kconfig"
 source "drivers/net/ethernet/wiznet/Kconfig"
 source "drivers/net/ethernet/xilinx/Kconfig"
 source "drivers/net/ethernet/xircom/Kconfig"
-source "drivers/net/ethernet/synopsys/Kconfig"
 
 endif # ETHERNET
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next] net: phy: phylink: Don't release NULL GPIO
From: Florian Fainelli @ 2018-05-21  3:49 UTC (permalink / raw)
  To: netdev
  Cc: rmk+kernel, Florian Fainelli, Andrew Lunn, David S. Miller,
	open list

If CONFIG_GPIOLIB is disabled, gpiod_put() becomes a stub that produces a
warning, this helped identify that we could be attempting to release a NULL
pl->link_gpio GPIO descriptor, so guard against that.

Fixes: daab3349ad1a ("net: phy: phylink: Release link GPIO")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/phylink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 581ce93ecaf9..af4dc4425be2 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -624,7 +624,7 @@ void phylink_destroy(struct phylink *pl)
 {
 	if (pl->sfp_bus)
 		sfp_unregister_upstream(pl->sfp_bus);
-	if (!IS_ERR(pl->link_gpio))
+	if (!IS_ERR_OR_NULL(pl->link_gpio))
 		gpiod_put(pl->link_gpio);
 
 	cancel_work_sync(&pl->resolve);
-- 
2.14.1

^ permalink raw reply related

* System hung for reg_check_changs_work()-> rtnl_lock()->mutex_lock()
From: Shawn Lin @ 2018-05-21  3:47 UTC (permalink / raw)
  To: netdev; +Cc: shawn.lin, Dmitry Vyukov, David S. Miller, syzkaller-bugs

Hi,

I found this hung for several times these days, and seems syzbot already
reported a similar problem. Is there any patch(es) for that?

Successfully initialized wpa_supplicant
[  240.091941] INFO: task kworker/u8:1:39 blocked for more than 120 seconds.
[  240.092004]       Not tainted 4.4.126 #1
[  240.092026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  240.092047] kworker/u8:1    D ffffff8008084dfc     0    39      2 
0x00000000
[  240.092116] Workqueue: events_power_efficient reg_check_chans_work
[  240.092153] Call trace:
[  240.092191] [<ffffff8008084dfc>] __switch_to+0x84/0xa0
[  240.092228] [<ffffff80086299f0>] __schedule+0x428/0x45c
[  240.092260] [<ffffff8008629a98>] schedule+0x74/0x94
[  240.092295] [<ffffff8008629e44>] schedule_preempt_disabled+0x20/0x38
[  240.092332] [<ffffff800862b13c>] __mutex_lock_slowpath+0xc0/0x138
[  240.092364] [<ffffff800862b1e0>] mutex_lock+0x2c/0x40
[  240.092399] [<ffffff80084eb820>] rtnl_lock+0x14/0x1c
[  240.092428] [<ffffff80085ce2a0>] reg_check_chans_work+0x2c/0x1f0
[  240.092463] [<ffffff80080abbac>] process_one_work+0x1b0/0x294
[  240.092494] [<ffffff80080ac904>] worker_thread+0x2d8/0x398
[  240.092524] [<ffffff80080b0f04>] kthread+0xc8/0xd8
[  240.092567] [<ffffff8008082e80>] ret_from_fork+0x10/0x50
[  240.092594] Kernel panic - not syncing: hung_task: blocked tasks
[  240.101163] CPU: 0 PID: 30 Comm: khungtaskd Not tainted 4.4.126 #1
[  240.101729] Hardware name: Rockchip RK3308 evb analog mic board (DT)
[  240.102302] Call trace:
[  240.102546] [<ffffff800808731c>] dump_backtrace+0x0/0x1c4
[  240.103044] [<ffffff80080874f4>] show_stack+0x14/0x1c
[  240.103521] [<ffffff800823f4b4>] dump_stack+0x94/0xbc
[  240.104000] [<ffffff800810a80c>] panic+0xd8/0x228
[  240.104446] [<ffffff80080fb228>] proc_dohung_task_timeout_secs+0x0/0x40
[  240.105050] [<ffffff80080b0f04>] kthread+0xc8/0xd8
[  240.105500] [<ffffff8008082e80>] ret_from_fork+0x10/0x50
[  240.106065] CPU1: stopping
[  240.106348] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.126 #1

^ permalink raw reply

* Re: [PATCH v2] Revert "alx: remove WoL support"
From: David Miller @ 2018-05-21  3:18 UTC (permalink / raw)
  To: acelan.kao
  Cc: jcliburn, chris.snook, rakesh, netdev, emily.chien, andrew,
	linux-kernel
In-Reply-To: <CAFv23Q=RGo5Tw4sL5GiYM69xeGUPReiBE290rnzCvOPezQinCg@mail.gmail.com>

From: AceLan Kao <acelan.kao@canonical.com>
Date: Mon, 21 May 2018 11:14:00 +0800

> We are willing to fix the issue, but we don't have a machine to
> reproduce it, and the WoL feature has been removed 5 years ago, it's
> hard to find those buggy machines.

Have you bothered to ask the person who did the revert?

> WoL is a feature that is only used by a very small group of people,
> and the wake up issue looks like only happens on some
> platforms. Which means only small part of the group of people are
> affected.

One of those people was the wireless networking stack maintainer.

> So, it's not a serious issue worth to remove it from alx driver.

I disagree.

You must fix the regression solved by the revert, before adding
WoL support back to the driver.

I'm not going to say this again.

Thank you.

^ permalink raw reply

* Re: [PATCH v2] Revert "alx: remove WoL support"
From: AceLan Kao @ 2018-05-21  3:14 UTC (permalink / raw)
  To: David Miller
  Cc: James Cliburn, Chris Snook, rakesh, netdev, Emily Chien,
	Andrew Lunn, Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <20180514.093504.1396660355199793729.davem@davemloft.net>

Hi David,

We are willing to fix the issue, but we don't have a machine to reproduce it,
and the WoL feature has been removed 5 years ago, it's hard to find
those buggy machines.

WoL is a feature that is only used by a very small group of people,
and the wake up issue
looks like only happens on some platforms. Which means only small part
of the group of people are affected.
So, it's not a serious issue worth to remove it from alx driver.

As the commit describes, WoL is required to pass E-Start 6.1, and
taking secure boot into account,
we can't keep distributing "alx driver with WoL" dkms package, so we
really need this feature to be built in the kernel.

There are some solutions to fix it.
1. Add WoL feature back, and we will try our best to fix the wake up
issue if we encounter it or users report it.
2. Add WoL feature back and add an driver option to disable it by
default, so that it won't create any regression and user can enable it
by kernel cmdline.
3. Add WoL feature back and create a white list in the driver, we'll
add those platforms we tested to the list.
4. or create a blacklist to list machines which are reported buggy.
Could you let me know which solution is more feasible for you?
Thanks.

Best regards,
AceLan Kao.

2018-05-14 21:35 GMT+08:00 David Miller <davem@davemloft.net>:
> From: AceLan Kao <acelan.kao@canonical.com>
> Date: Mon, 14 May 2018 11:28:39 +0800
>
>> This reverts commit bc2bebe8de8ed4ba6482c9cc370b0dd72ffe8cd2.
>>
>> The WoL feature is a must to pass Energy Star 6.1 and above,
>> the power consumption will be measured during S3 with WoL is enabled.
>>
>> Reverting "alx: remove WoL support", and will try to fix the unintentional
>> wake up issue when WoL is enabled.
>>
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=61651
>>
>> Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
>
> First, we must fix the problem that caused WoL to be disabled.
>
> Then, and only then, can you re-enable it.
>
> Thank you.

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-21  2:39 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <b8c0d147-2711-cb26-a084-e7d3d32d0002@redhat.com>

On Mon, May 21, 2018 at 10:30:51AM +0800, Jason Wang wrote:
> On 2018年05月19日 10:29, Tiwei Bie wrote:
> > > I don't hope so.
> > > 
> > > > I agreed driver should track the DMA addrs or some
> > > > other necessary things from the very beginning. And
> > > > I also repeated the spec to emphasize that it does
> > > > make sense. And I'd like to do that.
> > > > 
> > > > What I was saying is that, to support OOO, we may
> > > > need to manage these context (which saves DMA addrs
> > > > etc) via a list which is similar to the desc list
> > > > maintained via `next` in split ring instead of an
> > > > array whose elements always can be indexed directly.
> > > My point is these context is a must (not only for OOO).
> > Yeah, and I have the exactly same point after you
> > pointed that I shouldn't get the addrs from descs.
> > I do think it makes sense. I'll do it in the next
> > version. I don't have any doubt about it. All my
> > questions are about the OOO, instead of whether we
> > should save context or not. It just seems that you
> > thought I don't want to do it, and were trying to
> > convince me that I should do it.
> 
> Right, but looks like I was wrong :)
> 
> > 
> > > > The desc ring in split ring is an array, but its
> > > > free entries are managed as list via next. I was
> > > > just wondering, do we want to manage such a list
> > > > because of OOO. It's just a very simple question
> > > > that I want to hear your opinion... (It doesn't
> > > > means anything, e.g. It doesn't mean I don't want
> > > > to support OOO. It's just a simple question...)
> > > So the question is yes. But I admit I don't have better idea other than what
> > > you propose here (something like split ring which is a little bit sad).
> > > Maybe Michael had.
> > Yeah, that's why I asked this question. It will
> > make the packed ring a bit similar to split ring
> > at least in the driver part. So I want to draw
> > your attention on this to make sure that we're
> > on the same page.
> 
> Yes. I think we are.

Cool. Glad to hear that! Thanks! :)

Best regards,
Tiwei Bie

> 
> Thanks
> 
> > Best regards,
> > Tiwei Bie
> > 
> 

^ permalink raw reply

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: Jason Wang @ 2018-05-21  2:38 UTC (permalink / raw)
  To: DaeRyong Jeong, mst
  Cc: bammanag, kt0755, kvm, netdev, linux-kernel, virtualization,
	byoungyoung
In-Reply-To: <58419d62-3074-2e5a-8504-da1cdeb08280@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]



On 2018年05月18日 17:24, Jason Wang wrote:
>
>
> On 2018年05月17日 21:45, DaeRyong Jeong wrote:
>> We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
>>
>> This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
>> version of Syzkaller), which we describe more at the end of this
>> report. Our analysis shows that the race occurs when invoking two
>> syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
>>
>>
>> Analysis:
>> We think the concurrent execution of vhost_process_iotlb_msg() and
>> vhost_dev_cleanup() causes the crash.
>> Both of functions can run concurrently (please see call sequence below),
>> and possibly, there is a race on dev->iotlb.
>> If the switch occurs right after vhost_dev_cleanup() frees
>> dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value 
>> and it
>> keep executing without returning -EFAULT. Consequently, use-after-free
>> occures
>>
>>
>> Thread interleaving:
>> CPU0 (vhost_process_iotlb_msg)                CPU1 (vhost_dev_cleanup)
>> (In the case of both VHOST_IOTLB_UPDATE and
>> VHOST_IOTLB_INVALIDATE)
>> =====                            =====
>>                             vhost_umem_clean(dev->iotlb);
>> if (!dev->iotlb) {
>>             ret = -EFAULT;
>>                 break;
>> }
>>                             dev->iotlb = NULL;
>>
>>
>> Call Sequence:
>> CPU0
>> =====
>> vhost_net_chr_write_iter
>>     vhost_chr_write_iter
>>         vhost_process_iotlb_msg
>>
>> CPU1
>> =====
>> vhost_net_ioctl
>>     vhost_net_reset_owner
>>         vhost_dev_reset_owner
>>             vhost_dev_cleanup
>
> Thanks a lot for the analysis.
>
> This could be addressed by simply protect it with dev mutex.
>
> Will post a patch.
>

Could you please help to test the attached patch? I've done some smoking 
test.

Thanks

[-- Attachment #2: 0001-vhost-synchronize-IOTLB-message-with-dev-cleanup.patch --]
[-- Type: text/x-patch, Size: 1508 bytes --]

>From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 18 May 2018 17:33:27 +0800
Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup

DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():

Thread interleaving:
CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
=====						=====
						vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
	        ret = -EFAULT;
		        break;
}
						dev->iotlb = NULL;

The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.

Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
---
 drivers/vhost/vhost.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f3bd8e9..f0be5f3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 {
 	int ret = 0;
 
+	mutex_lock(&dev->mutex);
 	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
@@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	}
 
 	vhost_dev_unlock_vqs(dev);
+	mutex_unlock(&dev->mutex);
+
 	return ret;
 }
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: [RFC V4 PATCH 0/8] Packed ring layout for vhost
From: Jason Wang @ 2018-05-21  2:33 UTC (permalink / raw)
  To: Wei Xu; +Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann,
	tiwei.bie
In-Reply-To: <20180520162514.GA21958@wei-ubt>



On 2018年05月21日 00:25, Wei Xu wrote:
> On Wed, May 16, 2018 at 08:32:13PM +0800, Jason Wang wrote:
>> Hi all:
>>
>> This RFC implement packed ring layout. The code were tested with
>> Tiwei's RFC V3 ahttps://lkml.org/lkml/2018/4/25/34. Some fixups and
>> tweaks were needed on top of Tiwei's code to make it run for event
>> index.
> Could you please show the change based on Tiwei's code to easy other's
> test?

Please try Tiwei's V4 instead of just waiting for the fixup. It should 
work unless you don't try zerocopy and vIOMMU.


>> Pktgen reports about 20% improvement on PPS (event index is off). More
>> testing is ongoing.
>>
>> Notes for tester:
>>
>> - Start from this version, vhost need qemu co-operation to work
>>    correctly. Or you can comment out the packed specific code for
>>    GET/SET_VRING_BASE.
> Do you mean the code in vhost_virtqueue_start/stop?

For qemu, probably.

> Both Tiwei's and your v3
> work fortunately correctly which should be avoided since the ring should be
> definitely different.

I don't understand this, you mean reset work?

Thanks

>
> Wei
>
>> Changes from V3:
>> - Fix math on event idx checking
>> - Sync last avail wrap counter through GET/SET_VRING_BASE
>> - remove desc_event prefix in the driver/device structure
>>
>> Changes from V2:
>> - do not use & in checking desc_event_flags
>> - off should be most significant bit
>> - remove the workaround of mergeable buffer for dpdk prototype
>> - id should be in the last descriptor in the chain
>> - keep _F_WRITE for write descriptor when adding used
>> - device flags updating should use ADDR_USED type
>> - return error on unexpected unavail descriptor in a chain
>> - return false in vhost_ve_avail_empty is descriptor is available
>> - track last seen avail_wrap_counter
>> - correctly examine available descriptor in get_indirect_packed()
>> - vhost_idx_diff should return u16 instead of bool
>>
>> Changes from V1:
>>
>> - Refactor vhost used elem code to avoid open coding on used elem
>> - Event suppression support (compile test only).
>> - Indirect descriptor support (compile test only).
>> - Zerocopy support.
>> - vIOMMU support.
>> - SCSI/VSOCK support (compile test only).
>> - Fix several bugs
>>
>> Jason Wang (8):
>>    vhost: move get_rx_bufs to vhost.c
>>    vhost: hide used ring layout from device
>>    vhost: do not use vring_used_elem
>>    vhost_net: do not explicitly manipulate vhost_used_elem
>>    vhost: vhost_put_user() can accept metadata type
>>    virtio: introduce packed ring defines
>>    vhost: packed ring support
>>    vhost: event suppression for packed ring
>>
>>   drivers/vhost/net.c                | 136 ++----
>>   drivers/vhost/scsi.c               |  62 +--
>>   drivers/vhost/vhost.c              | 861 ++++++++++++++++++++++++++++++++-----
>>   drivers/vhost/vhost.h              |  47 +-
>>   drivers/vhost/vsock.c              |  42 +-
>>   include/uapi/linux/virtio_config.h |   9 +
>>   include/uapi/linux/virtio_ring.h   |  32 ++
>>   7 files changed, 928 insertions(+), 261 deletions(-)
>>
>> -- 
>> 2.7.4
>>

^ permalink raw reply

* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-21  2:30 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <20180519022938.GA18888@debian>



On 2018年05月19日 10:29, Tiwei Bie wrote:
>> I don't hope so.
>>
>>> I agreed driver should track the DMA addrs or some
>>> other necessary things from the very beginning. And
>>> I also repeated the spec to emphasize that it does
>>> make sense. And I'd like to do that.
>>>
>>> What I was saying is that, to support OOO, we may
>>> need to manage these context (which saves DMA addrs
>>> etc) via a list which is similar to the desc list
>>> maintained via `next` in split ring instead of an
>>> array whose elements always can be indexed directly.
>> My point is these context is a must (not only for OOO).
> Yeah, and I have the exactly same point after you
> pointed that I shouldn't get the addrs from descs.
> I do think it makes sense. I'll do it in the next
> version. I don't have any doubt about it. All my
> questions are about the OOO, instead of whether we
> should save context or not. It just seems that you
> thought I don't want to do it, and were trying to
> convince me that I should do it.

Right, but looks like I was wrong :)

>
>>> The desc ring in split ring is an array, but its
>>> free entries are managed as list via next. I was
>>> just wondering, do we want to manage such a list
>>> because of OOO. It's just a very simple question
>>> that I want to hear your opinion... (It doesn't
>>> means anything, e.g. It doesn't mean I don't want
>>> to support OOO. It's just a simple question...)
>> So the question is yes. But I admit I don't have better idea other than what
>> you propose here (something like split ring which is a little bit sad).
>> Maybe Michael had.
> Yeah, that's why I asked this question. It will
> make the packed ring a bit similar to split ring
> at least in the driver part. So I want to draw
> your attention on this to make sure that we're
> on the same page.

Yes. I think we are.

Thanks

> Best regards,
> Tiwei Bie
>

^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Marcelo Ricardo Leitner @ 2018-05-21  1:54 UTC (permalink / raw)
  To: Neil Horman; +Cc: Xin Long, network dev, linux-sctp, davem
In-Reply-To: <20180521005059.GA15233@neilslaptop.think-freely.org>

On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
> > This feature is actually already supported by sk->sk_reuse which can be
> > set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
> > section 8.1.27, like:
> > 
> >   - This option only supports one-to-one style SCTP sockets
> >   - This socket option must not be used after calling bind()
> >     or sctp_bindx().
> > 
> > Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
> > Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
> > work in linux.
> > 
> > This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
> > just with some extra setup limitations that are neeeded when it is being
> > enabled.
> > 
> > "It should be noted that the behavior of the socket-level socket option
> > to reuse ports and/or addresses for SCTP sockets is unspecified", so it
> > leaves SO_REUSEADDR as is for the compatibility.
> > 
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
> >  include/uapi/linux/sctp.h |  1 +
> >  net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 49 insertions(+)
> > 
> A few things:
> 
> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
> socket option.  I understand that this is an implementation of the option in the
> RFC, but its definately a duplication of a feature, which makes several things
> really messy.
> 
> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
> Chief among them is the behavioral interference between this patch and the
> SO_REUSEADDR socket level option, that also sets this feature.  If you set
> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
> reuse for the socket.  We can't do that.

Given your comments, going a bit further here, one other big
implication is that a port would never be able to be considered to
fully meet SCTP standards regarding reuse because a rogue application
may always abuse of the socket level opt to gain access to the port.

IOW, the patch allows the application to use such restrictions against
itself and nothing else, which undermines the patch idea.

I lack the knowledge on why the SCTP option was proposed in the RFC. I
guess they had a good reason to add the restriction on 1:1/1:m style.
Does the usage of the current imply in any risk to SCTP sockets? If
yes, that would give some grounds for going forward with the SCTP
option.

> 
> Its a bit frustrating, since SO_REUSEPORT is widely available on multiple
> operating systems, but isn't standard (AFAIK).  I would say however, given the
> prevalence of the socket level option, we should likely advocate for the removal
> of the sctp specific option, or at the least implement and document it as being

Is it possible, to remove/deprecate an option once it is published on a RFC?

> an alias for SO_REUSEPORT
> 
> 
> As this stands however, its a NACK from me.
> 
> Neil
> 

^ permalink raw reply

* RE: [PATCH 1/2] net: fec: ptp: Switch to SPDX identifier
From: Andy Duan @ 2018-05-21  1:32 UTC (permalink / raw)
  To: Fabio Estevam, davem@davemloft.net; +Cc: netdev@vger.kernel.org, Fabio Estevam
In-Reply-To: <1526835319-17408-1-git-send-email-festevam@gmail.com>

From: Fabio Estevam <festevam@gmail.com> Sent: 2018年5月21日 0:55
> Adopt the SPDX license identifier headers to ease license compliance
> management.
> 
> Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>

Acked-by: Fugang Duan <fugang.duan@nxp.com>

> ---
>  drivers/net/ethernet/freescale/fec_ptp.c | 14 +-------------
>  1 file changed, 1 insertion(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_ptp.c
> b/drivers/net/ethernet/freescale/fec_ptp.c
> index f814397..43d9732 100644
> --- a/drivers/net/ethernet/freescale/fec_ptp.c
> +++ b/drivers/net/ethernet/freescale/fec_ptp.c
> @@ -1,20 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0
>  /*
>   * Fast Ethernet Controller (ENET) PTP driver for MX6x.
>   *
>   * Copyright (C) 2012 Freescale Semiconductor, Inc.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms and conditions of the GNU General Public License,
> - * version 2, as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License
> along with
> - * this program; if not, write to the Free Software Foundation, Inc.,
> - * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
>   */
> 
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> --
> 2.7.4


^ permalink raw reply

* RE: [PATCH 2/2] net: fec: Add a SPDX identifier
From: Andy Duan @ 2018-05-21  1:32 UTC (permalink / raw)
  To: Fabio Estevam, davem@davemloft.net; +Cc: netdev@vger.kernel.org, Fabio Estevam
In-Reply-To: <1526835319-17408-2-git-send-email-festevam@gmail.com>

From: Fabio Estevam <festevam@gmail.com>  Sent: 2018年5月21日 0:55
> From: Fabio Estevam <fabio.estevam@nxp.com>
> 
> Currently there is no license information in the header of this file.
> 
> The MODULE_LICENSE field contains ("GPL"), which means GNU Public
> License v2 or later, so add a corresponding SPDX license identifier.
> 
> Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>

Acked-by: Fugang Duan <fugang.duan@nxp.com>

> ---
>  drivers/net/ethernet/freescale/fec_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index f3e43db..1625b74 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  /*
>   * Fast Ethernet Controller (FEC) driver for Motorola MPC8xx.
>   * Copyright (c) 1997 Dan Malek (dmalek@jlc.net)
> --
> 2.7.4


^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Neil Horman @ 2018-05-21  0:50 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem
In-Reply-To: <be8e672439406e6c8b838321d63aa109b9620f4c.1526715880.git.lucien.xin@gmail.com>

On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
> This feature is actually already supported by sk->sk_reuse which can be
> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
> section 8.1.27, like:
> 
>   - This option only supports one-to-one style SCTP sockets
>   - This socket option must not be used after calling bind()
>     or sctp_bindx().
> 
> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
> work in linux.
> 
> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
> just with some extra setup limitations that are neeeded when it is being
> enabled.
> 
> "It should be noted that the behavior of the socket-level socket option
> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
> leaves SO_REUSEADDR as is for the compatibility.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/uapi/linux/sctp.h |  1 +
>  net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)
> 
A few things:

1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
socket option.  I understand that this is an implementation of the option in the
RFC, but its definately a duplication of a feature, which makes several things
really messy.

2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
Chief among them is the behavioral interference between this patch and the
SO_REUSEADDR socket level option, that also sets this feature.  If you set
sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
option via the SCTP_PORT_REUSE option you will inadvertently turn on address
reuse for the socket.  We can't do that.

Its a bit frustrating, since SO_REUSEPORT is widely available on multiple
operating systems, but isn't standard (AFAIK).  I would say however, given the
prevalence of the socket level option, we should likely advocate for the removal
of the sctp specific option, or at the least implement and document it as being
an alias for SO_REUSEPORT


As this stands however, its a NACK from me.

Neil

^ permalink raw reply

* Re: [PATCH v2] isdn: eicon: fix a missing-check bug
From: kbuild test robot @ 2018-05-21  0:01 UTC (permalink / raw)
  To: Wenwen Wang
  Cc: kbuild-all, Wenwen Wang, Kangjie Lu, Armin Schindler,
	Karsten Keil, open list:ISDN SUBSYSTEM, open list
In-Reply-To: <1526679228-1596-1-git-send-email-wang6495@umn.edu>

[-- Attachment #1: Type: text/plain, Size: 3744 bytes --]

Hi Wenwen,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wenwen-Wang/isdn-eicon-fix-a-missing-check-bug/20180521-034229
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from drivers/isdn/hardware/eicon/divasfunc.c:18:0:
>> drivers/isdn/hardware/eicon/diva.h:23:18: error: unknown type name 'diva_xdi_um_cfg_cmd_t'; did you mean 'diva_xdi_capi_cfg_t'?
         int length, diva_xdi_um_cfg_cmd_t *msg,
                     ^~~~~~~~~~~~~~~~~~~~~
                     diva_xdi_capi_cfg_t
   drivers/isdn/hardware/eicon/diva.h:27:20: error: unknown type name 'diva_xdi_um_cfg_cmd_t'; did you mean 'diva_xdi_capi_cfg_t'?
           int length, diva_xdi_um_cfg_cmd_t *msg,
                       ^~~~~~~~~~~~~~~~~~~~~
                       diva_xdi_capi_cfg_t
--
   In file included from drivers/isdn/hardware/eicon/divasmain.c:30:0:
>> drivers/isdn/hardware/eicon/diva.h:23:18: error: unknown type name 'diva_xdi_um_cfg_cmd_t'
         int length, diva_xdi_um_cfg_cmd_t *msg,
                     ^~~~~~~~~~~~~~~~~~~~~
   drivers/isdn/hardware/eicon/diva.h:27:20: error: unknown type name 'diva_xdi_um_cfg_cmd_t'
           int length, diva_xdi_um_cfg_cmd_t *msg,
                       ^~~~~~~~~~~~~~~~~~~~~
   drivers/isdn/hardware/eicon/divasmain.c: In function 'divas_write':
>> drivers/isdn/hardware/eicon/divasmain.c:598:24: error: implicit declaration of function 'diva_xdi_open_adapter'; did you mean 'diva_xdi_close_adapter'? [-Werror=implicit-function-declaration]
      file->private_data = diva_xdi_open_adapter(file, buf,
                           ^~~~~~~~~~~~~~~~~~~~~
                           diva_xdi_close_adapter
>> drivers/isdn/hardware/eicon/divasmain.c:598:22: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      file->private_data = diva_xdi_open_adapter(file, buf,
                         ^
>> drivers/isdn/hardware/eicon/divasmain.c:603:9: error: implicit declaration of function 'diva_xdi_write'; did you mean 'divas_write'? [-Werror=implicit-function-declaration]
      ret = diva_xdi_write(file->private_data, file,
            ^~~~~~~~~~~~~~
            divas_write
   drivers/isdn/hardware/eicon/divasmain.c: In function 'divas_read':
   drivers/isdn/hardware/eicon/divasmain.c:632:22: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      file->private_data = diva_xdi_open_adapter(file, buf,
                         ^
   cc1: some warnings being treated as errors

vim +23 drivers/isdn/hardware/eicon/diva.h

    12	
    13	typedef int (*divas_xdi_copy_to_user_fn_t) (void *os_handle, void __user *dst,
    14						    const void *src, int length);
    15	
    16	typedef int (*divas_xdi_copy_from_user_fn_t) (void *os_handle, void *dst,
    17						      const void __user *src, int length);
    18	
    19	int diva_xdi_read(void *adapter, void *os_handle, void __user *dst,
    20			  int max_length, divas_xdi_copy_to_user_fn_t cp_fn);
    21	
    22	int diva_xdi_write(void *adapter, void *os_handle, const void __user *src,
  > 23			   int length, diva_xdi_um_cfg_cmd_t *msg,
    24			   divas_xdi_copy_from_user_fn_t cp_fn);
    25	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 62952 bytes --]

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2018-05-20 23:30 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


Sorry, no drug fuelled pull request this time.  As for the commits
themselves, I can't say for sure.... :-)

1) Fix refcounting bug for connections in on-packet scheduling
   mode of IPVS, from Julian Anastasov.

2) Set network header properly in AF_PACKET's packet_snd, from
   Willem de Bruijn.

3) Fix regressions in 3c59x by converting to generic DMA API.
   It was relying upon the hack that the PCI DMA interfaces
   would accept NULL for EISA devices.  From Christoph
   Hellwig.

4) Remove RDMA devices before unregistering netdev in QEDE driver,
   from Michal Kalderon.

5) Use after free in TUN driver ptr_ring usage, from Jason Wang.

6) Properly check for missing netlink attributes in SMC_PNETID
   requests, from Eric Biggers.

7) Set DMA mask before performaing any DMA operations in vmxnet3
   driver, from Regis Duchesne.

8) Fix mlx5 build with SMP=n, from Saeed Mahameed.

9) Classifier fixes in bcm_sf2 driver from Florian Fainelli.

10) Tuntap use after free during release, from Jason Wang.

11) Don't use stack memory in scatterlists in tls code, from
    Matt Mullins.

12) Not fully initialized flow key object in ipv4 routing code,
    from David Ahern.

13) Various packet headroom bug fixes in ip6_gre driver, from Petr
    Machata.

14) Remove queues from XPS maps using correct index, from Amritha
    Nambiar.

15) Fix use after free in sock_diag, from Eric Dumazet.

Please pull, thanks a lot!

The following changes since commit 4bc871984f7cb5b2dec3ae64b570cb02f9ce2227:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-05-11 14:14:46 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to b80d0b93b991e551a32157e0d9d38fc5bc9348a7:

  net: ip6_gre: fix tunnel metadata device sharing. (2018-05-19 23:32:12 -0400)

----------------------------------------------------------------
Amritha Nambiar (1):
      net: Fix a bug in removing queues from XPS map

Christoph Hellwig (1):
      3c59x: convert to generic DMA API

Colin Ian King (1):
      netfilter: nf_tables: fix memory leak on error exit return

Daniel Borkmann (1):
      bpf: fix truncated jump targets on heavy expansions

David Ahern (1):
      net/ipv4: Initialize proto and ports in flow struct

David S. Miller (7):
      Merge git://git.kernel.org/.../pablo/nf
      Merge git://git.kernel.org/.../bpf/bpf
      Merge branch 'dsa-bcm_sf2-CFP-fixes'
      Merge branch 'qed-LL2-fixes'
      Merge branch 'ibmvnic-Fix-bugs-and-memory-leaks'
      Merge branch 'ip6_gre-Fixes-in-headroom-handling'
      Merge git://git.kernel.org/.../bpf/bpf

Davide Caratti (1):
      net/sched: fix refcnt leak in the error path of tcf_vlan_init()

Eric Biggers (1):
      net/smc: check for missing nlattrs in SMC_PNETID messages

Eric Dumazet (2):
      tcp: purge write queue in tcp_connect_init()
      sock_diag: fix use-after-free read in __sk_free

Florian Fainelli (4):
      net: dsa: bcm_sf2: Fix RX_CLS_LOC_ANY overwrite for last rule
      net: dsa: bcm_sf2: Fix IPv6 rules and chain ID
      net: dsa: bcm_sf2: Fix IPv6 rule half deletion
      net: dsa: Do not register devlink for unused ports

Florian Westphal (9):
      netfilter: x_tables: check name length in find_match/target, too
      netfilter: nf_tables: skip synchronize_rcu if transaction log is empty
      netfilter: nf_tables: nft_compat: fix refcount leak on xt module
      netfilter: core: add missing __rcu annotation
      netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes
      netfilter: x_tables: add module alias for icmp matches
      netfilter: nf_tables: don't assume chain stats are set when jumplabel is set
      netfilter: nft_compat: prepare for indirect info storage
      netfilter: nft_compat: fix handling of large matchinfo size

Geert Uytterhoeven (2):
      net: 8390: ne: Fix accidentally removed RBTX4927 support
      sh_eth: Change platform check to CONFIG_ARCH_RENESAS

Jakub Kicinski (2):
      nfp: bpf: allow zero-length capabilities
      tools: bpf: handle NULL return in bpf_prog_load_xattr()

Jason Wang (2):
      tun: fix use after free for ptr_ring
      tuntap: fix use after free during release

Jesper Dangaard Brouer (1):
      selftests/bpf: check return value of fopen in test_verifier.c

John Fastabend (2):
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt
      bpf: parse and verdict prog attach may race with bpf map update

Jozsef Kadlecsik (1):
      netfilter: Fix handling simultaneous open in TCP conntrack

Julian Anastasov (2):
      ipvs: fix refcount usage for conns in ops mode
      ipvs: fix stats update from local clients

Keefe Liu (1):
      ipvlan: call netdevice notifier when master mac address changed

Kumar Sanghvi (1):
      cxgb4: Correct ntuple mask validation for hash filters

Markus Niebel (1):
      net: phy: micrel: add 125MHz reference clock workaround

Matt Mullins (1):
      tls: don't use stack memory in a scatterlist

Michal Kalderon (4):
      qede: Fix ref-cnt usage count
      qed: LL2 flush isles when connection is closed
      qed: Fix possibility of list corruption during rmmod flows
      qed: Fix LL2 race during connection terminate

Pablo Neira Ayuso (1):
      netfilter: nf_tables: bogus EBUSY in chain deletions

Paolo Abeni (1):
      net: sched: red: avoid hashing NULL child

Petr Machata (7):
      net: ip6_gre: Request headroom in __gre6_xmit()
      net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()
      net: ip6_gre: Split up ip6gre_tnl_link_config()
      net: ip6_gre: Split up ip6gre_tnl_change()
      net: ip6_gre: Split up ip6gre_newlink()
      net: ip6_gre: Split up ip6gre_changelink()
      net: ip6_gre: Fix ip6erspan hlen calculation

Rahul Lakkireddy (1):
      cxgb4: fix offset in collecting TX rate limit info

Saeed Mahameed (1):
      net/mlx5: Fix build break when CONFIG_SMP=n

Stephen Hemminger (1):
      netfilter: bridge: stp fix reference to uninitialized data

Tarick Bedeir (1):
      net/mlx4_core: Fix error handling in mlx4_init_port_info.

Thomas Falcon (3):
      ibmvnic: Free coherent DMA memory if FW map failed
      ibmvnic: Fix non-fatal firmware error reset
      ibmvnic: Fix statistics buffers memory leak

Willem de Bruijn (2):
      packet: in packet_snd start writing at link layer allocation
      net: test tailroom before appending to linear skb

William Tu (2):
      erspan: fix invalid erspan version.
      net: ip6_gre: fix tunnel metadata device sharing.

hpreg@vmware.com (2):
      vmxnet3: set the DMA mask before the first DMA map operation
      vmxnet3: use DMA memory barriers where required

 Documentation/devicetree/bindings/net/micrel-ksz90x1.txt |   7 ++
 drivers/net/dsa/bcm_sf2_cfp.c                            |  36 +++++----
 drivers/net/ethernet/3com/3c59x.c                        | 104 ++++++++++++------------
 drivers/net/ethernet/8390/ne.c                           |   4 +-
 drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h        |  28 +++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c        |  88 +++++++--------------
 drivers/net/ethernet/ibm/ibmvnic.c                       |  28 ++++---
 drivers/net/ethernet/mellanox/mlx4/main.c                |   4 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.c            |   2 +-
 drivers/net/ethernet/qlogic/qed/qed_ll2.c                |  61 +++++++++++---
 drivers/net/ethernet/qlogic/qede/qede_main.c             |   3 +-
 drivers/net/ethernet/renesas/sh_eth.h                    |   2 +-
 drivers/net/ipvlan/ipvlan_main.c                         |   4 +-
 drivers/net/phy/micrel.c                                 |  31 ++++++++
 drivers/net/tun.c                                        |  27 +++----
 drivers/net/vmxnet3/vmxnet3_drv.c                        |  72 +++++++++++------
 drivers/net/vmxnet3/vmxnet3_int.h                        |   8 +-
 include/linux/mlx5/driver.h                              |  12 +--
 include/net/netfilter/nf_tables.h                        |   5 ++
 include/net/tls.h                                        |   3 +
 include/uapi/linux/netfilter/nf_conntrack_tcp.h          |   3 +
 kernel/bpf/core.c                                        | 100 +++++++++++++++++------
 kernel/bpf/sockmap.c                                     |  18 ++---
 net/bridge/netfilter/ebt_stp.c                           |   4 +-
 net/core/dev.c                                           |   2 +-
 net/core/filter.c                                        |  11 ++-
 net/core/sock.c                                          |   2 +-
 net/dsa/dsa2.c                                           |   9 ++-
 net/ipv4/fib_frontend.c                                  |   8 +-
 net/ipv4/ip_gre.c                                        |   4 +-
 net/ipv4/ip_output.c                                     |   3 +-
 net/ipv4/netfilter/ip_tables.c                           |   1 +
 net/ipv4/netfilter/ipt_rpfilter.c                        |   2 +-
 net/ipv4/route.c                                         |   7 +-
 net/ipv4/tcp_output.c                                    |   7 +-
 net/ipv6/ip6_gre.c                                       | 286 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 net/ipv6/ip6_output.c                                    |   3 +-
 net/ipv6/netfilter/ip6_tables.c                          |   1 +
 net/netfilter/core.c                                     |   3 +-
 net/netfilter/ipvs/ip_vs_conn.c                          |  17 ++--
 net/netfilter/ipvs/ip_vs_core.c                          |  12 +++
 net/netfilter/nf_conntrack_proto_tcp.c                   |  11 +++
 net/netfilter/nf_tables_api.c                            |  77 ++++++++++++++----
 net/netfilter/nf_tables_core.c                           |  21 +++--
 net/netfilter/nfnetlink_acct.c                           |   2 +-
 net/netfilter/nfnetlink_cthelper.c                       |   7 +-
 net/netfilter/nft_compat.c                               | 201 +++++++++++++++++++++++++++++++++++-----------
 net/netfilter/nft_immediate.c                            |  15 +++-
 net/netfilter/x_tables.c                                 |   6 ++
 net/packet/af_packet.c                                   |   4 +-
 net/sched/act_vlan.c                                     |   2 +
 net/sched/sch_red.c                                      |   5 +-
 net/sched/sch_tbf.c                                      |   5 +-
 net/smc/smc_pnet.c                                       |  71 +++++++++--------
 net/tls/tls_sw.c                                         |   9 +--
 tools/lib/bpf/libbpf.c                                   |   2 +-
 tools/testing/selftests/bpf/test_verifier.c              |   5 ++
 57 files changed, 1010 insertions(+), 465 deletions(-)

^ permalink raw reply

* Re: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
From: Willem de Bruijn @ 2018-05-20 23:22 UTC (permalink / raw)
  To: Jon Rosen
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, Thomas Gleixner, Greg Kroah-Hartman,
	open list:NETWORKING [GENERAL], open list
In-Reply-To: <CAF=yD-+kjceErK=CApiu1oY9uTjVWG2qeX=NkGVYufVVC_L2PA@mail.gmail.com>

On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@cisco.com> wrote:
>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>> casues the ring to get corrupted by allowing multiple kernel threads
>> to claim ownership of the same ring entry. Track ownership in a shadow
>> ring structure to prevent other kernel threads from reusing the same
>> entry before it's fully filled in, passed to user space, and then
>> eventually passed back to the kernel for use with a new packet.
>>
>> Signed-off-by: Jon Rosen <jrosen@cisco.com>
>> ---
>>
>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>> it possible for multiple kernel threads to claim ownership of the same
>> ring entry, corrupting the ring and the corresponding packet(s).
>>
>> These diffs are the second proposed solution, previous proposal was described
>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>> to prevent RX ring overrun
>>
>> Those diffs would have changed the binary interface and have broken certain
>> applications. Consensus was that such a change would be inappropriate.
>>
>> These new diffs use a shadow ring in kernel space for tracking intermediate
>> state of an entry and prevent more than one kernel thread from simultaneously
>> allocating a ring entry. This avoids any impact to the binary interface
>> between kernel and userspace but comes at the additional cost of requiring a
>> second spin_lock when passing ownership of a ring entry to userspace.
>>
>> Jon Rosen (1):
>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>     overrun
>>
>>  net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  net/packet/internal.h  | 14 +++++++++++
>>  2 files changed, 78 insertions(+)
>>
>
>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>  #endif
>>
>>         if (po->tp_version <= TPACKET_V2) {
>> +               spin_lock(&sk->sk_receive_queue.lock);
>>                 __packet_set_status(po, h.raw, status);
>> +               packet_rx_shadow_release(rx_shadow_ring_entry);
>> +               spin_unlock(&sk->sk_receive_queue.lock);
>> +
>>                 sk->sk_data_ready(sk);
>
> Thanks for continuing to look at this. I spent some time on it last time
> around but got stuck, too.
>
> This version takes an extra spinlock in the hot path. That will be very
> expensive. Once we need to accept that, we could opt for a simpler
> implementation akin to the one discussed in the previous thread:
>
> stash a value in tp_padding or similar while tp_status remains
> TP_STATUS_KERNEL to signal ownership to concurrent kernel
> threads. The issue previously was that that field could not atomically
> be cleared together with __packet_set_status. This is no longer
> an issue when holding the queue lock.
>
> With a field like tp_padding, unlike tp_len, it is arguably also safe to
> clear it after flipping status (userspace should treat it as undefined).
>
> With v1 tpacket_hdr, no explicit padding field is defined but due to
> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
> platforms.
>
> The danger with using padding is that a process may write to it
> and cause deadlock, of course. There is no logical reason for doing
> so.

For the ring, there is no requirement to allocate exactly the amount
specified by the user request. Safer than relying on shared memory
and simpler than the extra allocation in this patch would be to allocate
extra shadow memory at the end of the ring (and not mmap that).

That still leaves an extra cold cacheline vs using tp_padding.

^ permalink raw reply

* Re: WARNING in ip_recv_error
From: Willem de Bruijn @ 2018-05-20 23:13 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, DaeLyong Jeong, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Network Development, LKML, Byoungyoung Lee, Kyungtae Kim,
	bammanag, Willem de Bruijn
In-Reply-To: <CAF=yD-+Ai35L2=dGZzpjYYavBmBGNFXd-q9ju93WrPuewnhELg@mail.gmail.com>

On Fri, May 18, 2018 at 2:59 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, May 18, 2018 at 2:46 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Fri, May 18, 2018 at 2:44 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Fri, May 18, 2018 at 1:09 PM, Willem de Bruijn
>>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> On Fri, May 18, 2018 at 11:44 AM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>>> Date: Fri, 18 May 2018 08:30:43 -0700
>>>>>
>>>>>> We probably need to revert Willem patch (7ce875e5ecb8562fd44040f69bda96c999e38bbc)
>>>>>
>>>>> Is it really valid to reach ip_recv_err with an ipv6 socket?
>>>>
>>>> I guess the issue is that setsockopt IPV6_ADDRFORM is not an
>>>> atomic operation, so that the socket is neither fully ipv4 nor fully
>>>> ipv6 by the time it reaches ip_recv_error.
>>>>
>>>>   sk->sk_socket->ops = &inet_dgram_ops;
>>>>   < HERE >
>>>>   sk->sk_family = PF_INET;
>>>>
>>>> Even calling inet_recv_error to demux would not necessarily help.
>>>>
>>>> Safest would be to look up by skb->protocol, similar to what
>>>> ipv6_recv_error does to handle v4-mapped-v6.
>>>>
>>>> Or to make that function safe with PF_INET and swap the order
>>>> of the above two operations.
>>>>
>>>> All sound needlessly complicated for this rare socket option, but
>>>> I don't have a better idea yet. Dropping on the floor is not nice,
>>>> either.
>>>
>>> Ensuring that ip_recv_error correctly handles packets from either
>>> socket and removing the warning should indeed be good.
>>>
>>> It is robust against v4-mapped packets from an AF_INET6 socket,
>>> but see caveat on reconnect below.
>>>
>>> The code between ipv6_recv_error for v4-mapped addresses and
>>> ip_recv_error is essentially the same, the main difference being
>>> whether to return network headers as sockaddr_in with SOL_IP
>>> or sockaddr_in6 with SOL_IPV6.
>>>
>>> There are very few other locations in the stack that explicitly test
>>> sk_family in this way and thus would be vulnerable to races with
>>> IPV6_ADDRFORM.
>>>
>>> I'm not sure whether it is possible for a udpv6 socket to queue a
>>> real ipv6 packet on the error queue, disconnect, connect to an
>>> ipv4 address, call IPV6_ADDRFORM and then call ip_recv_error
>>> on a true ipv6 packet. That would return buggy data, e.g., in
>>> msg_name.
>>
>> In do_ipv6_setsockopt IPV6_ADDRFORM we can test that the
>> error queue is empty, and then take its lock for the duration of the
>> operation.
>
> Actually, no reason to hold the lock. This setsockopt holds the socket
> lock, which connect would need, too. So testing that the queue
> is empty after testing that it is connected to a v4 address is
> sufficient to ensure that no ipv6 packets are queued for reception.
>
> diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> index 4d780c7f0130..a975d6311341 100644
> --- a/net/ipv6/ipv6_sockglue.c
> +++ b/net/ipv6/ipv6_sockglue.c
> @@ -199,6 +199,11 @@ static int do_ipv6_setsockopt(struct sock *sk,
> int level, int optname,
>
>                         if (ipv6_only_sock(sk) ||
>                             !ipv6_addr_v4mapped(&sk->sk_v6_daddr)) {
>                                 retv = -EADDRNOTAVAIL;
>                                 break;
>                         }
>
> +                       if (!skb_queue_empty(&sk->sk_error_queue)) {
> +                               retv = -EBUSY;
> +                               break;
> +                       }
> +
>                         fl6_free_socklist(sk);
>                         __ipv6_sock_mc_close(sk);
>
> After this it should be safe to remove the warning in ip_recv_error.

Hmm.. nope.

This ensures that the socket cannot produce any new true v6 packets.
But it does not guarantee that they are not already in the system, e.g.
queued in tc, and will find their way to the error queue later.

We'll have to just be able to handle ipv6 packets in ip_recv_error.
Since IPV6_ADDRFORM is used to pass to legacy v4-only
processes and those likely are only confused by SOL_IPV6
error messages, it is probably best to just drop them and perhaps
WARN_ONCE.

^ permalink raw reply

* Re: WARNING in __static_key_slow_dec
From: Willem de Bruijn @ 2018-05-20 23:07 UTC (permalink / raw)
  To: DaeRyong Jeong
  Cc: David Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Network Development, LKML, Byoungyoung Lee, Kyungtae Kim,
	bammanag
In-Reply-To: <CAF=yD-+jS5BJKHEDp+K4fB41YnY1mR8ksVg98oWPJV=mG01pfg@mail.gmail.com>

>
> -static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
> +static void sock_disable_timestamp(struct sock *sk, unsigned long flag)
>  {
> -       if (sk->sk_flags & flags) {
> -               sk->sk_flags &= ~flags;
> -               if (sock_needs_netstamp(sk) &&
> -                   !(sk->sk_flags & SK_FLAGS_TIMESTAMP))
> -                       net_disable_timestamp();
> -       }
> +       unsigned long prev;
> +
> +       do {
> +               prev = READ_ONCE(sk->sk_flags);
> +
> +               if (!(prev & flag))
> +                       return;
> +
> +               if (cmpxchg(&sk->sk_flags, prev, prev & ~flag) == prev)
> +                       break;
> +       } while (1);

and this can just use set_mask_bits

^ permalink raw reply

* [PATCH net-next] mv88e6xxx: Fix uninitialized variable warning.
From: David Miller @ 2018-05-20 23:05 UTC (permalink / raw)
  To: andrew; +Cc: netdev


In mv88e6xxx_probe(), ("np" or "pdata") might be an invariant
but GCC can't see that, therefore:

drivers/net/dsa/mv88e6xxx/chip.c: In function ‘mv88e6xxx_probe’:
drivers/net/dsa/mv88e6xxx/chip.c:4420:13: warning: ‘compat_info’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  chip->info = compat_info;

Actually, it should have warned on the "if (!compat_info)" test, but
whatever.

Explicitly initialize to NULL in the variable declaration to
deal with this.

Fixes: 877b7cb0b6f2 ("net: dsa: mv88e6xxx: Add minimal platform_data support")
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 1fa1f820a437..12df00f593b7 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -4382,9 +4382,9 @@ static const void *pdata_device_get_match_data(struct device *dev)
 static int mv88e6xxx_probe(struct mdio_device *mdiodev)
 {
 	struct dsa_mv88e6xxx_pdata *pdata = mdiodev->dev.platform_data;
+	const struct mv88e6xxx_info *compat_info = NULL;
 	struct device *dev = &mdiodev->dev;
 	struct device_node *np = dev->of_node;
-	const struct mv88e6xxx_info *compat_info;
 	struct mv88e6xxx_chip *chip;
 	int port;
 	int err;
-- 
2.17.0


^ permalink raw reply related

* Re: WARNING in __static_key_slow_dec
From: Willem de Bruijn @ 2018-05-20 23:04 UTC (permalink / raw)
  To: DaeRyong Jeong
  Cc: David Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Network Development, LKML, Byoungyoung Lee, Kyungtae Kim,
	bammanag
In-Reply-To: <CAF=yD-JbiE0W6pv77DUO0pYrJCYfgszm6zwCEfkv8HedvRfujQ@mail.gmail.com>

On Fri, May 18, 2018 at 4:30 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, May 18, 2018 at 4:03 AM, DaeRyong Jeong <threeearcat@gmail.com> wrote:
>> We report the crash: WARNING in __static_key_slow_dec
>>
>> This crash has been found in v4.8 using RaceFuzzer (a modified
>> version of Syzkaller), which we describe more at the end of this
>> report.
>> Even though v4.8 is the relatively old version, we did manual verification
>> and we think the bug still exists.
>> Our analysis shows that the race occurs when invoking two syscalls
>> concurrently, setsockopt() with optname SO_TIMESTAMPING and ioctl() with
>> cmd SIOCGSTAMPNS.
>>
>>
>> Diagnosis:
>> We think if timestamp was previously enabled with
>> SOCK_TIMESTAMPING_RX_SOFTWARE flag, the concurrent execution of
>> sock_disable_timestamp() and sock_enable_timestamp() causes the crash.
>>
>>
>> Thread interleaving:
>> (Assume sk->flag has the SOCK_TIMESTAMPING_RX_SOFTWARE flag by the
>> previous setsockopt() call with SO_TIMESTAMPING)
>>
>> CPU0 (sock_disable_timestamp())                 CPU1 (sock_enable_timestamp())
>> =====                                           =====
>> (flag == 1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)  (flag == SOCK_TIMESTAMP)
>>
>>                                                 if (!sock_flag(sk, flag)) {
>>                                                         unsigned long previous_flags = sk->sk_flags;
>>
>> if (sk->sk_flags & flags) {
>>         sk->sk_flags &= ~flags;
>>         if (sock_needs_netstamp(sk) &&
>>             !(sk->sk_flags & SK_FLAGS_TIMESTAMP))
>>                 net_disable_timestamp();
>>                                                         sock_set_flag(sk, flag);
>>
>>                                                         if (sock_needs_netstamp(sk) &&
>>                                                             !(previous_flags & SK_FLAGS_TIMESTAMP))
>>                                                                 net_enable_timestamp();
>>                                                         /* Here, net_enable_timestamp() is not called because
>>                                                          * previous_flags has the SOCK_TIMESTAMPING_RX_SOFTWARE
>>                                                          * flag
>>                                                          */
>> /* After the race, sk->sk has the flag SOCK_TIMESTAMP, but
>>  * net_enable_timestamp() is not called one more time.
>>  * Consequently, when the socket is closed, __sk_destruct()
>>  * calls net_disable_timestamp() that leads WARNING.
>>  */
>
> Thanks for the detailed analysis.
>
> Indeed the updates to sk->sk_flags and calls to net_(dis|en)able_timestamp
> should happen atomically, but this is not the case. The setsockopt
> path holds the socket lock, but not all ioctl paths.
>
> Perhaps we can take lock_sock_fast in sock_get_timestamp and
> variants.

Some callers of sock_get_timestamp already hold the socket lock,
e.g., ax25_ioctl, so that is out.

There is some known non-determinism in this path. Callers of
sock_get_timestamp do not necessarily expect a valid sk_stamp
when they enable the timestamp, so that function can continue
to test sk_flags lockless.

net_enable_timestamp enables timestamping using a static_branch
and possibly a workqueue, so already does not complete synchronously
in the sock_enable_timestamp call.

The only requirement is that updates to sk_flags do not race. This
should be solvable with cmpxchg. The situation is slightly complicated
because sk_flags has two bits that may toggle timestamping. Only the
first bit set must trigger a call to net_enable_timestamp and only the
last bit cleared must call net_disable_timestamp.

Something like

-static bool sock_needs_netstamp(const struct sock *sk)
+static bool sock_needs_netstamp(const struct sock *sk, unsigned long flags)
 {
        switch (sk->sk_family) {
        case AF_UNSPEC:
        case AF_UNIX:
                return false;
        default:
-               return true;
+               return (flags & SK_FLAGS_TIMESTAMP);
        }
 }

-static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+static void sock_disable_timestamp(struct sock *sk, unsigned long flag)
 {
-       if (sk->sk_flags & flags) {
-               sk->sk_flags &= ~flags;
-               if (sock_needs_netstamp(sk) &&
-                   !(sk->sk_flags & SK_FLAGS_TIMESTAMP))
-                       net_disable_timestamp();
-       }
+       unsigned long prev;
+
+       do {
+               prev = READ_ONCE(sk->sk_flags);
+
+               if (!(prev & flag))
+                       return;
+
+               if (cmpxchg(&sk->sk_flags, prev, prev & ~flag) == prev)
+                       break;
+       } while (1);
+
+       /* disable only if this operation removed the last tstamp flag */
+       if (!sock_needs_netstamp(sk, prev & ~flag))
+               net_disable_timestamp();
 }

and analogous for enable.

^ permalink raw reply

* Re: [PATCH net-next v2] net: dsa: b53: Extend platform data to include DSA ports
From: David Miller @ 2018-05-20 22:59 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, andrew, vivien.didelot, linux-kernel
In-Reply-To: <20180520155630.23432-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sun, 20 May 2018 08:56:30 -0700

> The b53 driver already defines and internally uses platform data to let the
> glue drivers specify parameters such as the chip id.  What we were missing was
> a way to tell the core DSA layer about the ports and their type.
> 
> Place a dsa_chip_data structure at the beginning of b53_platform_data for
> dsa_register_switch() to access it. This does not require modifications to
> b53_common.c which will pass platform_data trough.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/3] Platform data support for mv88exxx
From: David Miller @ 2018-05-20 22:58 UTC (permalink / raw)
  To: andrew; +Cc: vivien.didelot, f.fainelli, netdev
In-Reply-To: <1526761895-15839-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Sat, 19 May 2018 22:31:32 +0200

> There are a few Intel based platforms making use of the mv88exxx.
> These don't easily have access to device tree in order to instantiate
> the switch driver. These patches allow the use of platform data to
> hold the configuration.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH 0/3] sh_eth: fix typos/grammar
From: David Miller @ 2018-05-20 22:56 UTC (permalink / raw)
  To: sergei.shtylyov; +Cc: netdev, linux-renesas-soc
In-Reply-To: <d47afee4-6fd5-af6d-885c-5b07996749fc@cogentembedded.com>

From: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Date: Sat, 19 May 2018 23:57:50 +0300

> Here's a set of 3 patches against DaveM's 'net-next.git' repo plus the R8A77980
> support patches posted earlier. They fix the comments typos/grammar and another
> typo in the EESR bit...

Series applied.

^ permalink raw reply

* Re: [PATCH net-next 0/9] Misc. bug fixes and cleanup for HNS3 driver
From: David Miller @ 2018-05-20 22:56 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180519155323.68960-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Sat, 19 May 2018 16:53:14 +0100

> This patch-set presents miscellaneous bug fixes and cleanups found
> during internal review, system testing and cleanup.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
From: Willem de Bruijn @ 2018-05-20 22:51 UTC (permalink / raw)
  To: Jon Rosen
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, Thomas Gleixner, Greg Kroah-Hartman,
	open list:NETWORKING [GENERAL], open list
In-Reply-To: <1526731655-10087-1-git-send-email-jrosen@cisco.com>

On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@cisco.com> wrote:
> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry. Track ownership in a shadow
> ring structure to prevent other kernel threads from reusing the same
> entry before it's fully filled in, passed to user space, and then
> eventually passed back to the kernel for use with a new packet.
>
> Signed-off-by: Jon Rosen <jrosen@cisco.com>
> ---
>
> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
> it possible for multiple kernel threads to claim ownership of the same
> ring entry, corrupting the ring and the corresponding packet(s).
>
> These diffs are the second proposed solution, previous proposal was described
> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
> to prevent RX ring overrun
>
> Those diffs would have changed the binary interface and have broken certain
> applications. Consensus was that such a change would be inappropriate.
>
> These new diffs use a shadow ring in kernel space for tracking intermediate
> state of an entry and prevent more than one kernel thread from simultaneously
> allocating a ring entry. This avoids any impact to the binary interface
> between kernel and userspace but comes at the additional cost of requiring a
> second spin_lock when passing ownership of a ring entry to userspace.
>
> Jon Rosen (1):
>   packet: track ring entry use using a shadow ring to prevent RX ring
>     overrun
>
>  net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/packet/internal.h  | 14 +++++++++++
>  2 files changed, 78 insertions(+)
>

> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  #endif
>
>         if (po->tp_version <= TPACKET_V2) {
> +               spin_lock(&sk->sk_receive_queue.lock);
>                 __packet_set_status(po, h.raw, status);
> +               packet_rx_shadow_release(rx_shadow_ring_entry);
> +               spin_unlock(&sk->sk_receive_queue.lock);
> +
>                 sk->sk_data_ready(sk);

Thanks for continuing to look at this. I spent some time on it last time
around but got stuck, too.

This version takes an extra spinlock in the hot path. That will be very
expensive. Once we need to accept that, we could opt for a simpler
implementation akin to the one discussed in the previous thread:

stash a value in tp_padding or similar while tp_status remains
TP_STATUS_KERNEL to signal ownership to concurrent kernel
threads. The issue previously was that that field could not atomically
be cleared together with __packet_set_status. This is no longer
an issue when holding the queue lock.

With a field like tp_padding, unlike tp_len, it is arguably also safe to
clear it after flipping status (userspace should treat it as undefined).

With v1 tpacket_hdr, no explicit padding field is defined but due to
TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
platforms.

The danger with using padding is that a process may write to it
and cause deadlock, of course. There is no logical reason for doing
so.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox