Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v1 2/3] net: hisilicon: fix hip04-xmit never return TX_BUSY
From: Jiangfeng Xiao @ 2019-08-03 12:31 UTC (permalink / raw)
  To: davem, yisen.zhuang, salil.mehta, xiaojiangfeng
  Cc: netdev, linux-kernel, leeyou.li, xiaowei774, nixiaoming
In-Reply-To: <1564835501-90257-1-git-send-email-xiaojiangfeng@huawei.com>

TX_DESC_NUM is 256, in tx_count, the maximum value of
mod(TX_DESC_NUM - 1) is 254, the variable "count" in
the hip04_mac_start_xmit function is never equal to
(TX_DESC_NUM - 1), so hip04_mac_start_xmit never
return NETDEV_TX_BUSY.

tx_count is modified to mod(TX_DESC_NUM) so that
the maximum value of tx_count can reach
(TX_DESC_NUM - 1), then hip04_mac_start_xmit can reurn
NETDEV_TX_BUSY.

Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 1e1b154..d775b98 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -248,7 +248,7 @@ struct hip04_priv {
 
 static inline unsigned int tx_count(unsigned int head, unsigned int tail)
 {
-	return (head - tail) % (TX_DESC_NUM - 1);
+	return (head - tail) % TX_DESC_NUM;
 }
 
 static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
-- 
1.8.5.6


^ permalink raw reply related

* [PATCH v1 1/3] net: hisilicon: make hip04_tx_reclaim non-reentrant
From: Jiangfeng Xiao @ 2019-08-03 12:31 UTC (permalink / raw)
  To: davem, yisen.zhuang, salil.mehta, xiaojiangfeng
  Cc: netdev, linux-kernel, leeyou.li, xiaowei774, nixiaoming
In-Reply-To: <1564835501-90257-1-git-send-email-xiaojiangfeng@huawei.com>

If hip04_tx_reclaim is interrupted while it is running
and then __napi_schedule continues to execute
hip04_rx_poll->hip04_tx_reclaim, reentrancy occurs
and oops is generated. So you need to mask the interrupt
during the hip04_tx_reclaim run.

The kernel oops exception stack is as follows:

Unable to handle kernel NULL pointer dereference
at virtual address 00000050
pgd = c0003000
[00000050] *pgd=80000000a04003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
Modules linked in: hip04_eth mtdblock mtd_blkdevs mtd
ohci_platform ehci_platform ohci_hcd ehci_hcd
vfat fat sd_mod usb_storage scsi_mod usbcore usb_common
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.4.185 #1
Hardware name: Hisilicon A15
task: c0a250e0 task.stack: c0a00000
PC is at hip04_tx_reclaim+0xe0/0x17c [hip04_eth]
LR is at hip04_tx_reclaim+0x30/0x17c [hip04_eth]
pc : [<bf30c3a4>]    lr : [<bf30c2f4>]    psr: 600e0313
sp : c0a01d88  ip : 00000000  fp : c0601f9c
r10: 00000000  r9 : c3482380  r8 : 00000001
r7 : 00000000  r6 : 000000e1  r5 : c3482000  r4 : 0000000c
r3 : f2209800  r2 : 00000000  r1 : 00000000  r0 : 00000000
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 32c5387d  Table: 03d28c80  DAC: 55555555
Process swapper/0 (pid: 0, stack limit = 0xc0a00190)
Stack: (0xc0a01d88 to 0xc0a02000)
[<bf30c3a4>] (hip04_tx_reclaim [hip04_eth]) from [<bf30d2e0>]
                                                (hip04_rx_poll+0x88/0x368 [hip04_eth])
[<bf30d2e0>] (hip04_rx_poll [hip04_eth]) from [<c04c2d9c>] (net_rx_action+0x114/0x34c)
[<c04c2d9c>] (net_rx_action) from [<c021eed8>] (__do_softirq+0x218/0x318)
[<c021eed8>] (__do_softirq) from [<c021f284>] (irq_exit+0x88/0xac)
[<c021f284>] (irq_exit) from [<c0240090>] (msa_irq_exit+0x11c/0x1d4)
[<c0240090>] (msa_irq_exit) from [<c02677e0>] (__handle_domain_irq+0x110/0x148)
[<c02677e0>] (__handle_domain_irq) from [<c0201588>] (gic_handle_irq+0xd4/0x118)
[<c0201588>] (gic_handle_irq) from [<c0551700>] (__irq_svc+0x40/0x58)
Exception stack(0xc0a01f30 to 0xc0a01f78)
1f20:                                     c0ae8b40 00000000 00000000 00000000
1f40: 00000002 ffffe000 c0601f9c 00000000 ffffffff c0a2257c c0a22440 c0831a38
1f60: c0a01ec4 c0a01f80 c0203714 c0203718 600e0213 ffffffff
[<c0551700>] (__irq_svc) from [<c0203718>] (arch_cpu_idle+0x20/0x3c)
[<c0203718>] (arch_cpu_idle) from [<c025bfd8>] (cpu_startup_entry+0x244/0x29c)
[<c025bfd8>] (cpu_startup_entry) from [<c054b0d8>] (rest_init+0xc8/0x10c)
[<c054b0d8>] (rest_init) from [<c0800c58>] (start_kernel+0x468/0x514)
Code: a40599e5 016086e2 018088e2 7660efe6 (503090e5)
---[ end trace 1db21d6d09c49d74 ]---
Kernel panic - not syncing: Fatal exception in interrupt
CPU3: stopping
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D    O    4.4.185 #1

Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index d604528..1e1b154 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -585,6 +585,9 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 	u16 len;
 	u32 err;
 
+	/* clean up tx descriptors */
+	tx_remaining = hip04_tx_reclaim(ndev, false);
+
 	while (cnt && !last) {
 		buf = priv->rx_buf[priv->rx_head];
 		skb = build_skb(buf, priv->rx_buf_size);
@@ -645,8 +648,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 	}
 	napi_complete_done(napi, rx);
 done:
-	/* clean up tx descriptors and start a new timer if necessary */
-	tx_remaining = hip04_tx_reclaim(ndev, false);
+	/* start a new timer if necessary */
 	if (rx < budget && tx_remaining)
 		hip04_start_tx_timer(priv);
 
-- 
1.8.5.6


^ permalink raw reply related

* [PATCH v1 3/3] net: hisilicon: Fix dma_map_single failed on arm64
From: Jiangfeng Xiao @ 2019-08-03 12:31 UTC (permalink / raw)
  To: davem, yisen.zhuang, salil.mehta, xiaojiangfeng
  Cc: netdev, linux-kernel, leeyou.li, xiaowei774, nixiaoming
In-Reply-To: <1564835501-90257-1-git-send-email-xiaojiangfeng@huawei.com>

On the arm64 platform, executing "ifconfig eth0 up" will fail,
returning "ifconfig: SIOCSIFFLAGS: Input/output error."

ndev->dev is not initialized, dma_map_single->get_dma_ops->
dummy_dma_ops->__dummy_map_page will return DMA_ERROR_CODE
directly, so when we use dma_map_single, the first parameter
is to use the device of platform_device.

Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index d775b98..c841674 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -220,6 +220,7 @@ struct hip04_priv {
 	unsigned int reg_inten;
 
 	struct napi_struct napi;
+	struct device *dev;
 	struct net_device *ndev;
 
 	struct tx_desc *tx_desc;
@@ -465,7 +466,7 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool force)
 		}
 
 		if (priv->tx_phys[tx_tail]) {
-			dma_unmap_single(&ndev->dev, priv->tx_phys[tx_tail],
+			dma_unmap_single(priv->dev, priv->tx_phys[tx_tail],
 					 priv->tx_skb[tx_tail]->len,
 					 DMA_TO_DEVICE);
 			priv->tx_phys[tx_tail] = 0;
@@ -516,8 +517,8 @@ static void hip04_start_tx_timer(struct hip04_priv *priv)
 		return NETDEV_TX_BUSY;
 	}
 
-	phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
-	if (dma_mapping_error(&ndev->dev, phys)) {
+	phys = dma_map_single(priv->dev, skb->data, skb->len, DMA_TO_DEVICE);
+	if (dma_mapping_error(priv->dev, phys)) {
 		dev_kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
@@ -596,7 +597,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 			goto refill;
 		}
 
-		dma_unmap_single(&ndev->dev, priv->rx_phys[priv->rx_head],
+		dma_unmap_single(priv->dev, priv->rx_phys[priv->rx_head],
 				 RX_BUF_SIZE, DMA_FROM_DEVICE);
 		priv->rx_phys[priv->rx_head] = 0;
 
@@ -625,9 +626,9 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 		buf = netdev_alloc_frag(priv->rx_buf_size);
 		if (!buf)
 			goto done;
-		phys = dma_map_single(&ndev->dev, buf,
+		phys = dma_map_single(priv->dev, buf,
 				      RX_BUF_SIZE, DMA_FROM_DEVICE);
-		if (dma_mapping_error(&ndev->dev, phys))
+		if (dma_mapping_error(priv->dev, phys))
 			goto done;
 		priv->rx_buf[priv->rx_head] = buf;
 		priv->rx_phys[priv->rx_head] = phys;
@@ -730,9 +731,9 @@ static int hip04_mac_open(struct net_device *ndev)
 	for (i = 0; i < RX_DESC_NUM; i++) {
 		dma_addr_t phys;
 
-		phys = dma_map_single(&ndev->dev, priv->rx_buf[i],
+		phys = dma_map_single(priv->dev, priv->rx_buf[i],
 				      RX_BUF_SIZE, DMA_FROM_DEVICE);
-		if (dma_mapping_error(&ndev->dev, phys))
+		if (dma_mapping_error(priv->dev, phys))
 			return -EIO;
 
 		priv->rx_phys[i] = phys;
@@ -766,7 +767,7 @@ static int hip04_mac_stop(struct net_device *ndev)
 
 	for (i = 0; i < RX_DESC_NUM; i++) {
 		if (priv->rx_phys[i]) {
-			dma_unmap_single(&ndev->dev, priv->rx_phys[i],
+			dma_unmap_single(priv->dev, priv->rx_phys[i],
 					 RX_BUF_SIZE, DMA_FROM_DEVICE);
 			priv->rx_phys[i] = 0;
 		}
@@ -909,6 +910,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
 		return -ENOMEM;
 
 	priv = netdev_priv(ndev);
+	priv->dev = d;
 	priv->ndev = ndev;
 	platform_set_drvdata(pdev, ndev);
 	SET_NETDEV_DEV(ndev, &pdev->dev);
-- 
1.8.5.6


^ permalink raw reply related

* [PATCH v1 0/3] net: hisilicon: Fix a few problems with hip04_eth
From: Jiangfeng Xiao @ 2019-08-03 12:31 UTC (permalink / raw)
  To: davem, yisen.zhuang, salil.mehta, xiaojiangfeng
  Cc: netdev, linux-kernel, leeyou.li, xiaowei774, nixiaoming

During the use of the hip04_eth driver,
several problems were found,
which solved the hip04_tx_reclaim reentry problem,
fixed the problem that hip04_mac_start_xmit never
returns NETDEV_TX_BUSY
and the dma_map_single failed on the arm64 platform.

Jiangfeng Xiao (3):
  net: hisilicon: make hip04_tx_reclaim non-reentrant
  net: hisilicon: fix hip04-xmit never return TX_BUSY
  net: hisilicon: Fix dma_map_single failed on arm64

 drivers/net/ethernet/hisilicon/hip04_eth.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

-- 
1.8.5.6


^ permalink raw reply

* [PATCH net 0/2] action fixes for flow_offload infra compatibility
From: Vlad Buslov @ 2019-08-03 13:36 UTC (permalink / raw)
  To: netdev
  Cc: pieter.jansenvanvuuren, jhs, xiyou.wangcong, jiri, davem,
	Vlad Buslov

Fix rcu warnings due to usage of action helpers that expect rcu read lock
protection from rtnl-protected context of flow_offload infra.

Vlad Buslov (2):
  net: sched: police: allow accessing police->params with rtnl
  net: sched: sample: allow accessing psample_group with rtnl

 include/net/tc_act/tc_police.h | 4 ++--
 include/net/tc_act/tc_sample.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [PATCH net 2/2] net: sched: sample: allow accessing psample_group with rtnl
From: Vlad Buslov @ 2019-08-03 13:36 UTC (permalink / raw)
  To: netdev
  Cc: pieter.jansenvanvuuren, jhs, xiyou.wangcong, jiri, davem,
	Vlad Buslov
In-Reply-To: <20190803133619.10574-1-vladbu@mellanox.com>

Recently implemented support for sample action in flow_offload infra leads
to following rcu usage warning:

[ 1938.234856] =============================
[ 1938.234858] WARNING: suspicious RCU usage
[ 1938.234863] 5.3.0-rc1+ #574 Not tainted
[ 1938.234866] -----------------------------
[ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage!
[ 1938.234872]
               other info that might help us debug this:

[ 1938.234875]
               rcu_scheduler_active = 2, debug_locks = 1
[ 1938.234879] 1 lock held by tc/19540:
[ 1938.234881]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
[ 1938.234900]
               stack backtrace:
[ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574
[ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 1938.234911] Call Trace:
[ 1938.234922]  dump_stack+0x85/0xc0
[ 1938.234930]  tc_setup_flow_action+0xed5/0x2040
[ 1938.234944]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
[ 1938.234965]  fl_change+0xd24/0x1b30 [cls_flower]
[ 1938.234990]  tc_new_tfilter+0x3e0/0x970
[ 1938.235021]  ? tc_del_tfilter+0x720/0x720
[ 1938.235028]  rtnetlink_rcv_msg+0x389/0x4b0
[ 1938.235038]  ? netlink_deliver_tap+0x95/0x400
[ 1938.235044]  ? rtnl_dellink+0x2d0/0x2d0
[ 1938.235053]  netlink_rcv_skb+0x49/0x110
[ 1938.235063]  netlink_unicast+0x171/0x200
[ 1938.235073]  netlink_sendmsg+0x224/0x3f0
[ 1938.235091]  sock_sendmsg+0x5e/0x60
[ 1938.235097]  ___sys_sendmsg+0x2ae/0x330
[ 1938.235111]  ? __handle_mm_fault+0x12cd/0x19e0
[ 1938.235125]  ? __handle_mm_fault+0x12cd/0x19e0
[ 1938.235138]  ? find_held_lock+0x2b/0x80
[ 1938.235147]  ? do_user_addr_fault+0x22d/0x490
[ 1938.235160]  __sys_sendmsg+0x59/0xa0
[ 1938.235178]  do_syscall_64+0x5c/0xb0
[ 1938.235187]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1938.235192] RIP: 0033:0x7ff9a4d597b8
[ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
 ec 28 89 54
[ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8
[ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003
[ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
[ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001

Change tcf_sample_psample_group() helper to allow using it from both rtnl
and rcu protected contexts.

Fixes: a7a7be6087b0 ("net/sched: add sample action to the hardware intermediate representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 include/net/tc_act/tc_sample.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index 0a559d4b6f0f..b4fce0fae645 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -44,7 +44,7 @@ static inline int tcf_sample_trunc_size(const struct tc_action *a)
 static inline struct psample_group *
 tcf_sample_psample_group(const struct tc_action *a)
 {
-	return rcu_dereference(to_sample(a)->psample_group);
+	return rcu_dereference_rtnl(to_sample(a)->psample_group);
 }
 
 #endif /* __NET_TC_SAMPLE_H */
-- 
2.21.0


^ permalink raw reply related

* [PATCH net 1/2] net: sched: police: allow accessing police->params with rtnl
From: Vlad Buslov @ 2019-08-03 13:36 UTC (permalink / raw)
  To: netdev
  Cc: pieter.jansenvanvuuren, jhs, xiyou.wangcong, jiri, davem,
	Vlad Buslov
In-Reply-To: <20190803133619.10574-1-vladbu@mellanox.com>

Recently implemented support for police action in flow_offload infra leads
to following rcu usage warning:

[ 1925.881092] =============================
[ 1925.881094] WARNING: suspicious RCU usage
[ 1925.881098] 5.3.0-rc1+ #574 Not tainted
[ 1925.881100] -----------------------------
[ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage!
[ 1925.881106]
               other info that might help us debug this:

[ 1925.881109]
               rcu_scheduler_active = 2, debug_locks = 1
[ 1925.881112] 1 lock held by tc/18591:
[ 1925.881115]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
[ 1925.881124]
               stack backtrace:
[ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574
[ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 1925.881132] Call Trace:
[ 1925.881138]  dump_stack+0x85/0xc0
[ 1925.881145]  tc_setup_flow_action+0x1771/0x2040
[ 1925.881155]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
[ 1925.881175]  fl_change+0xd24/0x1b30 [cls_flower]
[ 1925.881200]  tc_new_tfilter+0x3e0/0x970
[ 1925.881231]  ? tc_del_tfilter+0x720/0x720
[ 1925.881243]  rtnetlink_rcv_msg+0x389/0x4b0
[ 1925.881250]  ? netlink_deliver_tap+0x95/0x400
[ 1925.881257]  ? rtnl_dellink+0x2d0/0x2d0
[ 1925.881264]  netlink_rcv_skb+0x49/0x110
[ 1925.881275]  netlink_unicast+0x171/0x200
[ 1925.881284]  netlink_sendmsg+0x224/0x3f0
[ 1925.881299]  sock_sendmsg+0x5e/0x60
[ 1925.881305]  ___sys_sendmsg+0x2ae/0x330
[ 1925.881309]  ? task_work_add+0x43/0x50
[ 1925.881314]  ? fput_many+0x45/0x80
[ 1925.881329]  ? __lock_acquire+0x248/0x1930
[ 1925.881342]  ? find_held_lock+0x2b/0x80
[ 1925.881347]  ? task_work_run+0x7b/0xd0
[ 1925.881359]  __sys_sendmsg+0x59/0xa0
[ 1925.881375]  do_syscall_64+0x5c/0xb0
[ 1925.881381]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1925.881384] RIP: 0033:0x7feb245047b8
[ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
 ec 28 89 54
[ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8
[ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003
[ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
[ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001

Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to
allow using them from both rtnl and rcu protected contexts.

Fixes: 8c8cfc6ed274 ("net/sched: add police action to the hardware intermediate representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 include/net/tc_act/tc_police.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/tc_act/tc_police.h b/include/net/tc_act/tc_police.h
index 8b9ef3664262..cfdc7cb82cad 100644
--- a/include/net/tc_act/tc_police.h
+++ b/include/net/tc_act/tc_police.h
@@ -54,7 +54,7 @@ static inline u64 tcf_police_rate_bytes_ps(const struct tc_action *act)
 	struct tcf_police *police = to_police(act);
 	struct tcf_police_params *params;
 
-	params = rcu_dereference_bh(police->params);
+	params = rcu_dereference_bh_rtnl(police->params);
 	return params->rate.rate_bytes_ps;
 }
 
@@ -63,7 +63,7 @@ static inline s64 tcf_police_tcfp_burst(const struct tc_action *act)
 	struct tcf_police *police = to_police(act);
 	struct tcf_police_params *params;
 
-	params = rcu_dereference_bh(police->params);
+	params = rcu_dereference_bh_rtnl(police->params);
 	return params->tcfp_burst;
 }
 
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next v5 5/6] flow_offload: support get flow_block immediately
From: wenxu @ 2019-08-03 13:42 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: jiri, pablo, fw, netfilter-devel, netdev, John Hurley
In-Reply-To: <20190802172155.7a36713d@cakuba.netronome.com>


在 2019/8/3 8:21, Jakub Kicinski 写道:
> On Sat, 3 Aug 2019 07:19:31 +0800, wenxu wrote:
>>> Or:
>>>
>>> device unregister:
>>>   - nft block destroy
>>>     - UNBIND cb
>>>       - free driver's block state
>>>   - driver notifier callback
>>>     - free driver's state
>>>
>>> No?  
>> For the second case maybe can't unbind cb? because the nft block is
>> destroied. There is no way to find the block(chain) in nft.
> But before the block is destroyed doesn't nft send an UNBIND event to
> the drivers, always?
you are correct, it will be send an UBIND event when the block is destroyed

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Vladimir Oltean @ 2019-08-03 13:49 UTC (permalink / raw)
  To: Tao Ren
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml, openbmc
In-Reply-To: <20190802215419.313512-1-taoren@fb.com>

Hi Tao,

On Sat, 3 Aug 2019 at 00:56, Tao Ren <taoren@fb.com> wrote:
>
> genphy_read_status() cannot report correct link speed when BCM54616S PHY
> is configured in RGMII->1000Base-KX mode (for example, on Facebook CMM
> BMC platform), and it is because speed-related fields in MII registers
> are assigned different meanings in 1000X register set. Actually there
> is no speed field in 1000X register set because link speed is always
> 1000 Mb/s.
>
> The patch adds "probe" callback to detect PHY's operation mode based on
> INTERF_SEL[1:0] pins and 1000X/100FX selection bit in SerDES 100-FX
> Control register. Besides, link speed is manually set to 1000 Mb/s in
> "read_status" callback if PHY-switch link is 1000Base-X.
>
> Signed-off-by: Tao Ren <taoren@fb.com>
> ---
>  Changes in v3:
>   - rename bcm5482_read_status to bcm54xx_read_status so the callback can
>     be shared by BCM5482 and BCM54616S.
>  Changes in v2:
>   - Auto-detect PHY operation mode instead of passing DT node.
>   - move PHY mode auto-detect logic from config_init to probe callback.
>   - only set speed (not including duplex) in read_status callback.
>   - update patch description with more background to avoid confusion.
>   - patch #1 in the series ("net: phy: broadcom: set features explicitly
>     for BCM54616") is dropped: the fix should go to get_features callback
>     which may potentially depend on this patch.
>
>  drivers/net/phy/broadcom.c | 41 +++++++++++++++++++++++++++++++++-----
>  include/linux/brcmphy.h    | 10 ++++++++--
>  2 files changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
> index 937d0059e8ac..ecad8a201a09 100644
> --- a/drivers/net/phy/broadcom.c
> +++ b/drivers/net/phy/broadcom.c
> @@ -383,9 +383,9 @@ static int bcm5482_config_init(struct phy_device *phydev)
>                 /*
>                  * Select 1000BASE-X register set (primary SerDes)
>                  */
> -               reg = bcm_phy_read_shadow(phydev, BCM5482_SHD_MODE);
> -               bcm_phy_write_shadow(phydev, BCM5482_SHD_MODE,
> -                                    reg | BCM5482_SHD_MODE_1000BX);
> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>
>                 /*
>                  * LED1=ACTIVITYLED, LED3=LINKSPD[2]
> @@ -409,7 +409,7 @@ static int bcm5482_config_init(struct phy_device *phydev)
>         return err;
>  }
>
> -static int bcm5482_read_status(struct phy_device *phydev)
> +static int bcm54xx_read_status(struct phy_device *phydev)
>  {
>         int err;
>
> @@ -464,6 +464,35 @@ static int bcm54616s_config_aneg(struct phy_device *phydev)
>         return ret;
>  }
>
> +static int bcm54616s_probe(struct phy_device *phydev)
> +{
> +       int val, intf_sel;
> +
> +       val = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
> +       if (val < 0)
> +               return val;
> +
> +       /* The PHY is strapped in RGMII to fiber mode when INTERF_SEL[1:0]
> +        * is 01b.
> +        */
> +       intf_sel = (val & BCM54XX_SHD_INTF_SEL_MASK) >> 1;
> +       if (intf_sel == 1) {
> +               val = bcm_phy_read_shadow(phydev, BCM54616S_SHD_100FX_CTRL);
> +               if (val < 0)
> +                       return val;
> +
> +               /* Bit 0 of the SerDes 100-FX Control register, when set
> +                * to 1, sets the MII/RGMII -> 100BASE-FX configuration.
> +                * When this bit is set to 0, it sets the GMII/RGMII ->
> +                * 1000BASE-X configuration.
> +                */
> +               if (!(val & BCM54616S_100FX_MODE))
> +                       phydev->dev_flags |= PHY_BCM_FLAGS_MODE_1000BX;
> +       }
> +
> +       return 0;
> +}
> +
>  static int brcm_phy_setbits(struct phy_device *phydev, int reg, int set)
>  {
>         int val;
> @@ -655,6 +684,8 @@ static struct phy_driver broadcom_drivers[] = {
>         .config_aneg    = bcm54616s_config_aneg,
>         .ack_interrupt  = bcm_phy_ack_intr,
>         .config_intr    = bcm_phy_config_intr,
> +       .read_status    = bcm54xx_read_status,
> +       .probe          = bcm54616s_probe,
>  }, {
>         .phy_id         = PHY_ID_BCM5464,
>         .phy_id_mask    = 0xfffffff0,
> @@ -689,7 +720,7 @@ static struct phy_driver broadcom_drivers[] = {
>         .name           = "Broadcom BCM5482",
>         /* PHY_GBIT_FEATURES */
>         .config_init    = bcm5482_config_init,
> -       .read_status    = bcm5482_read_status,
> +       .read_status    = bcm54xx_read_status,
>         .ack_interrupt  = bcm_phy_ack_intr,
>         .config_intr    = bcm_phy_config_intr,
>  }, {
> diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
> index 6db2d9a6e503..b475e7f20d28 100644
> --- a/include/linux/brcmphy.h
> +++ b/include/linux/brcmphy.h
> @@ -200,9 +200,15 @@
>  #define BCM5482_SHD_SSD                0x14    /* 10100: Secondary SerDes control */
>  #define BCM5482_SHD_SSD_LEDM   0x0008  /* SSD LED Mode enable */
>  #define BCM5482_SHD_SSD_EN     0x0001  /* SSD enable */
> -#define BCM5482_SHD_MODE       0x1f    /* 11111: Mode Control Register */
> -#define BCM5482_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
>
> +/* 10011: SerDes 100-FX Control Register */
> +#define BCM54616S_SHD_100FX_CTRL       0x13
> +#define        BCM54616S_100FX_MODE            BIT(0)  /* 100-FX SerDes Enable */
> +
> +/* 11111: Mode Control Register */
> +#define BCM54XX_SHD_MODE               0x1f
> +#define BCM54XX_SHD_INTF_SEL_MASK      GENMASK(2, 1)   /* INTERF_SEL[1:0] */
> +#define BCM54XX_SHD_MODE_1000BX                BIT(0)  /* Enable 1000-X registers */
>
>  /*
>   * EXPANSION SHADOW ACCESS REGISTERS.  (PHY REG 0x15, 0x16, and 0x17)
> --
> 2.17.1
>

The patchset looks better now. But is it ok, I wonder, to keep
PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
phy_attach_direct is overwriting it?

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH][net-next][V2] net/mlx5: remove self-assignment on esw->dev
From: Parav Pandit @ 2019-08-03 15:02 UTC (permalink / raw)
  To: Colin King
  Cc: Saeed Mahameed, Leon Romanovsky, David S . Miller, netdev,
	linux-rdma, kernel-janitors, Linux Kernel Mailing List
In-Reply-To: <20190802151316.16011-1-colin.king@canonical.com>

On Sat, Aug 3, 2019 at 7:54 PM Colin King <colin.king@canonical.com> wrote:
>
> From: Colin Ian King <colin.king@canonical.com>
>
> There is a self assignment of esw->dev to itself, clean this up by
> removing it. Also make dev a const pointer.
>
> Addresses-Coverity: ("Self assignment")
> Fixes: 6cedde451399 ("net/mlx5: E-Switch, Verify support QoS element type")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>
> V2: make dev const
>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index f4ace5f8e884..de0894b695e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -1413,7 +1413,7 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
>
>  static bool element_type_supported(struct mlx5_eswitch *esw, int type)
>  {
> -       struct mlx5_core_dev *dev = esw->dev = esw->dev;
> +       const struct mlx5_core_dev *dev = esw->dev;
>
>         switch (type) {
>         case SCHEDULING_CONTEXT_ELEMENT_TYPE_TSAR:
> --
> 2.20.1
>
Reviewed-by: Parav Pandit <parav@mellanox.com>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Alexei Starovoitov @ 2019-08-03 16:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Song Liu, Kernel Team
In-Reply-To: <CAEf4BzY46=Vosd+kha+_Yh_iXNXhgfSW3ihePApb4GfuzoUU6w@mail.gmail.com>

On Fri, Aug 02, 2019 at 11:30:21PM -0700, Andrii Nakryiko wrote:
> 
> No, not anonymous.
> 
> struct my_struct___local {
>     int a;
> };
> 
> struct my_struct___target {
>     long long a;
> };
> 
> my_struct___local->a will not match my_struct___target->a, but it's
> not a reason to stop relocation process due to error.

why? It feels that this is exactly the reason to fail relocation.
struct names matched. field names matched.
but the types are different. Why proceed further?

Also what about str->a followed by str->b.
Is it possible that str->a will pick up one flavor when str->b different one?
That will likely lead to wrong behavior?

> 
> All the tests I added use non-numeric flavors. While technically I can
> use just ___1, ___2 and so on, it will greatly reduce readability,
> while not really solving any problem (nothing prevents someone to add
> something like lmc___1 eventually).
> 
> I think it's not worth it to complicate this logic just for
> lmc___{softc,media,ctl}, but we can do 2) - try to match any struct as
> is. If that fails, see if it's a "flavor" and match flavors.

Could you please share benchmarking results of largish bpf prog
with couple thousands relocations against typical vmlinux.h ?
I'm concerned that this double check will be noticeable.
May be llvm should recognize "flavor" in the type name and
encode them differently in BTF ?
Or add a pre-pass to libbpf to sort out all types into flavored and not.
If flavored search is expensive may be all flavors could be a linked list
from the base type. The typical case is one or two flavors, right?

> > > > > +     for (i = 0, j = 0; i < cand_ids->len; i++) {
> > > > > +             cand_id = cand_ids->data[i];
> > > > > +             cand_type = btf__type_by_id(targ_btf, cand_id);
> > > > > +             cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> > > > > +
> > > > > +             err = bpf_core_spec_match(&local_spec, targ_btf,
> > > > > +                                       cand_id, &cand_spec);
> > > > > +             if (err < 0) {
> > > > > +                     pr_warning("prog '%s': relo #%d: failed to match spec ",
> > > > > +                                prog_name, relo_idx);
> > > > > +                     bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> > > > > +                     libbpf_print(LIBBPF_WARN,
> > > > > +                                  " to candidate #%d [%d] (%s): %d\n",
> > > > > +                                  i, cand_id, cand_name, err);
> > > > > +                     return err;
> > > > > +             }
> > > > > +             if (err == 0) {
> > > > > +                     pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> > > > > +                              prog_name, relo_idx, i, cand_id, cand_name);
> > > > > +                     bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> > > > > +                     libbpf_print(LIBBPF_DEBUG, "\n");
> > > > > +                     continue;
> > > > > +             }
> > > > > +
> > > > > +             pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> > > > > +                      prog_name, relo_idx, i);
> > > >
> > > > did you mention that you're going to make a helper for this debug dumps?
> > >
> > > yeah, I added bpf_core_dump_spec(), but I don't know how to shorten
> > > this further... This output is extremely useful to understand what's
> > > happening and will be invaluable when users will inevitably report
> > > confusing behavior in some cases, so I still want to keep it.
> >
> > not sure yet. Just pointing out that this function has more debug printfs
> > than actual code which doesn't look right.
> > We have complex algorithms in the kernel (like verifier).
> > Yet we don't sprinkle printfs in there to this degree.
> >
> 
> We do have a verbose verifier logging, though, exactly to help users
> to debug issues, which is extremely helpful and is greatly appreciated
> by users.
> There is nothing worse for developer experience than getting -EINVAL
> without any useful log message. Been there, banged my head against the
> wall wishing for a bit more verbose log. What are we trying to
> optimize for here?

All I'm saying that three printfs in a row that essentially convey the same info
look like clowntown. Some level of verbosity is certainly useful.


^ permalink raw reply

* Re: [PATCH net-next] netfilter: conntrack: use shared sysctl constants
From: Matteo Croce @ 2019-08-03 16:34 UTC (permalink / raw)
  To: netdev, netfilter-devel
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal, LKML,
	Andrew Morton, Tonghao Zhang, Kees Cook
In-Reply-To: <20190723012303.2221-1-mcroce@redhat.com>

On Tue, Jul 23, 2019 at 3:23 AM Matteo Croce <mcroce@redhat.com> wrote:
>
> Use shared sysctl variables for zero and one constants, as in commit
> eec4844fae7c ("proc/sysctl: add shared variables for range check")
>
> Fixes: 8f14c99c7eda ("netfilter: conntrack: limit sysctl setting for boolean options")
> Signed-off-by: Matteo Croce <mcroce@redhat.com>
>

followup, can anyone review it?

Thanks,
-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
From: Pablo Neira Ayuso @ 2019-08-03 16:40 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: hujunwei, wensong, horms, kadlec, Florian Westphal, davem,
	netdev@vger.kernel.org, lvs-devel, netfilter-devel, coreteam,
	Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <alpine.LFD.2.21.1907312052310.3631@ja.home.ssi.bg>

On Wed, Jul 31, 2019 at 08:53:47PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 1 Aug 2019, hujunwei wrote:
> 
> > From: Junwei Hu <hujunwei4@huawei.com>
> > 
> > The ipvs module parse the user buffer and save it to sysctl,
> > then check if the value is valid. invalid value occurs
> > over a period of time.
> > Here, I add a variable, struct ctl_table tmp, used to read
> > the value from the user buffer, and save only when it is valid.
> > I delete proc_do_sync_mode and use extra1/2 in table for the
> > proc_dointvec_minmax call.
> > 
> > Fixes: f73181c8288f ("ipvs: add support for sync threads")
> > Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
> > Acked-by: Julian Anastasov <ja@ssi.bg>
> 
> 	Yep, Acked-by: Julian Anastasov <ja@ssi.bg>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] netfilter: conntrack: use shared sysctl constants
From: Pablo Neira Ayuso @ 2019-08-03 16:40 UTC (permalink / raw)
  To: Matteo Croce
  Cc: netdev, netfilter-devel, Jozsef Kadlecsik, Florian Westphal,
	linux-kernel, akpm, Tonghao Zhang
In-Reply-To: <20190723012303.2221-1-mcroce@redhat.com>

On Tue, Jul 23, 2019 at 03:23:03AM +0200, Matteo Croce wrote:
> Use shared sysctl variables for zero and one constants, as in commit
> eec4844fae7c ("proc/sysctl: add shared variables for range check")

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-08-03 16:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Song Liu, Kernel Team
In-Reply-To: <20190803162556.pdbckv7yta4wigjk@ast-mbp.dhcp.thefacebook.com>

On Sat, Aug 3, 2019 at 9:26 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 02, 2019 at 11:30:21PM -0700, Andrii Nakryiko wrote:
> >
> > No, not anonymous.
> >
> > struct my_struct___local {
> >     int a;
> > };
> >
> > struct my_struct___target {
> >     long long a;
> > };
> >
> > my_struct___local->a will not match my_struct___target->a, but it's
> > not a reason to stop relocation process due to error.
>
> why? It feels that this is exactly the reason to fail relocation.
> struct names matched. field names matched.
> but the types are different. Why proceed further?

Because it can be accidental name collision. Think about two different
ring_buffer structs, let's say they have some field with the same
generic name, but different types, e.g., id (I'm making everything up,
except the fact that there are two different ring_buffers in kernel):

struct ring_buffer {
    int id;
    int extra_field_a;
    /* tons of other fields */
};

struct ring_buffer {
    char id[16];
    int extra_field_b;
    /* other stuff */
};

If BPF app expects to read first ring_buffer's id (of int type), that
relocation will still be tried against both, because there is no way
to know which one was originally intended. But because type matched
for only one of them, then that one is the chosen candidate.

Now, if both ids were int, then we would have two matching candidates,
and then two situations are possible:

1. Offsets of that field match. Then it doesn't matter, we still
successfully relocate.
2. Offsets don't match - that's an error, because we can't figure out
correct relocation. Program loading will fail in that case.

But in situation 2, if there was another relocation for ring_buffer
(for another field), and that field was matched against only one of
the structs, then that one struct would be chosen only.

So in general, I try to postpone failure as much as possible, because
there could be "accidental" name matches. This is not a failure, until
we have ambiguity or failed to find any matching candidate, that
satisfies **all** field relocations.

>
> Also what about str->a followed by str->b.
> Is it possible that str->a will pick up one flavor when str->b different one?

No, if str->a picked one flavor and discarded another one, then we are
left with that chosen flavor as single possible candidate for all
subsequent relocations against structs with the same name. That's main
reason or this cache of candidates that I maintain (not just a
performance improvement).

> That will likely lead to wrong behavior?
>
> >
> > All the tests I added use non-numeric flavors. While technically I can
> > use just ___1, ___2 and so on, it will greatly reduce readability,
> > while not really solving any problem (nothing prevents someone to add
> > something like lmc___1 eventually).
> >
> > I think it's not worth it to complicate this logic just for
> > lmc___{softc,media,ctl}, but we can do 2) - try to match any struct as
> > is. If that fails, see if it's a "flavor" and match flavors.
>
> Could you please share benchmarking results of largish bpf prog
> with couple thousands relocations against typical vmlinux.h ?

I can once I have such a program converted.

> I'm concerned that this double check will be noticeable.

Well yes, it's an extra pass, essentially, which is why I wouldn't
like to do it.

> May be llvm should recognize "flavor" in the type name and
> encode them differently in BTF ?
> Or add a pre-pass to libbpf to sort out all types into flavored and not.

This seems like extreme overcomplication for extreme corner case. And
there is very simple work-around from user space.

E.g., let's take that lmc___media struct. If for user really needs to
relocate such struct, we can just define (locally) typedef:

typedef struct lmc___media lmc___media___for_real;

And now "non-flavored" name that we will try to match will be desired
"lmc___media". Given there are whole 3 structs like that right now,
I'm fine with this work around in exchange of not overcomplicating
libbpf's algorithm. WDYT?

> If flavored search is expensive may be all flavors could be a linked list
> from the base type. The typical case is one or two flavors, right?

Flavors search is no more expensive than non-flavored one. And we have
a list of candidates.

>
> > > > > > +     for (i = 0, j = 0; i < cand_ids->len; i++) {
> > > > > > +             cand_id = cand_ids->data[i];
> > > > > > +             cand_type = btf__type_by_id(targ_btf, cand_id);
> > > > > > +             cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> > > > > > +
> > > > > > +             err = bpf_core_spec_match(&local_spec, targ_btf,
> > > > > > +                                       cand_id, &cand_spec);
> > > > > > +             if (err < 0) {
> > > > > > +                     pr_warning("prog '%s': relo #%d: failed to match spec ",
> > > > > > +                                prog_name, relo_idx);
> > > > > > +                     bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> > > > > > +                     libbpf_print(LIBBPF_WARN,
> > > > > > +                                  " to candidate #%d [%d] (%s): %d\n",
> > > > > > +                                  i, cand_id, cand_name, err);
> > > > > > +                     return err;
> > > > > > +             }
> > > > > > +             if (err == 0) {
> > > > > > +                     pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> > > > > > +                              prog_name, relo_idx, i, cand_id, cand_name);
> > > > > > +                     bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> > > > > > +                     libbpf_print(LIBBPF_DEBUG, "\n");
> > > > > > +                     continue;
> > > > > > +             }
> > > > > > +
> > > > > > +             pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> > > > > > +                      prog_name, relo_idx, i);
> > > > >
> > > > > did you mention that you're going to make a helper for this debug dumps?
> > > >
> > > > yeah, I added bpf_core_dump_spec(), but I don't know how to shorten
> > > > this further... This output is extremely useful to understand what's
> > > > happening and will be invaluable when users will inevitably report
> > > > confusing behavior in some cases, so I still want to keep it.
> > >
> > > not sure yet. Just pointing out that this function has more debug printfs
> > > than actual code which doesn't look right.
> > > We have complex algorithms in the kernel (like verifier).
> > > Yet we don't sprinkle printfs in there to this degree.
> > >
> >
> > We do have a verbose verifier logging, though, exactly to help users
> > to debug issues, which is extremely helpful and is greatly appreciated
> > by users.
> > There is nothing worse for developer experience than getting -EINVAL
> > without any useful log message. Been there, banged my head against the
> > wall wishing for a bit more verbose log. What are we trying to
> > optimize for here?
>
> All I'm saying that three printfs in a row that essentially convey the same info
> look like clowntown. Some level of verbosity is certainly useful.

But they don't convey the same information! One is error clearly
stating "failed to match with error". Another one is informational
debug-only "we tried this candidate, it didn't match, but it's not an
error". The other one "yay, we successfully found this candidate".
Three extremely different outcomes.

Given I find it hard to understand what's the exact reason you don't
like this, it's hard to guess the approach that you'll find
acceptable, but I'll try one more time. How about this:

- emit "We are trying this spec/this candidate" before we try (at
DEBUG log level, though);
- then depending on the outcome, emit corresponding "cand #%d:
failed", "cand #%d: not a match", or "cand #%d: a match".

This is very subpar, especially given that on error we might not have
spec/candidate details, because they are DEBUG-only, but still better
than nothing.

But, honestly, at this point I'm ready to give up and remove all that,
I'll just keep adding it back locally when reproducing or debugging
some issue in the future, which is PITA, but nothing I can't live
with.
>

^ permalink raw reply

* Re: [net-next 00/11][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-01
From: David Miller @ 2019-08-03 17:40 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190801222548.15975-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu,  1 Aug 2019 15:25:37 -0700

> This series for fm10k, by Jake Keller, reduces the scope of local variables
> where possible.
> 
> The following are changes since commit a8e600e2184f45c40025cbe4d7e8893b69378a9f:
>   Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next 00/15] net: Add functional tests for L3 and L4
From: David Miller @ 2019-08-03 17:42 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, dsahern
In-Reply-To: <20190801185648.27653-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Thu,  1 Aug 2019 11:56:33 -0700

> From: David Ahern <dsahern@gmail.com>
> 
> This is a port the functional test cases created during the development
> of the VRF feature. It covers various permutations of icmp, tcp and udp
> for IPv4 and IPv6 including negative tests.

Series applied, thanks David.

Please work with Alexei to make the server/client startup work better in
a CI environment since frankly that's the most important situation where
all of these tests are going to run now that these are in the tree.

Thanks.

^ permalink raw reply

* Re: [net-next 0/9][pull request] 40GbE Intel Wired LAN Driver Updates 2019-08-01
From: David Miller @ 2019-08-03 17:46 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190801205149.4114-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu,  1 Aug 2019 13:51:40 -0700

> This series contains updates to i40e driver only.

Patch #2 seems to need some changes, so I'll wait for the next respin
of this pull request.

^ permalink raw reply

* Re: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: Joe Perches @ 2019-08-03 18:01 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, vyasevich, marcelo.leitner, linux-sctp, netdev,
	linux-kernel
In-Reply-To: <20190802.161932.1776993765494484851.davem@davemloft.net>

On Fri, 2019-08-02 at 16:19 -0700, David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Fri, 02 Aug 2019 10:47:34 -0700
> 
> > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> >> On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> >> > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> >> > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> >> > > > fallthrough may become a pseudo reserved keyword so this only use of
> >> > > > fallthrough is better renamed to allow it.
> > 
> > Can you or any other maintainer apply this patch
> > or ack it so David Miller can apply it?
> 
> I, like others, don't like the lack of __ in the keyword.  It's kind of
> rediculous the problems it creates to pollute the global namespace like
> that and yes also inconsistent with other shorthands for builtins.

Rejected?

I think that's inappropriate.

As coded, it's nothing like a fallthrough and
the rename to unhandled is more descriptive.



^ permalink raw reply

* Re: [PATCH v3] net/socket: fix GCC8+ Wpacked-not-aligned warnings
From: David Miller @ 2019-08-03 18:03 UTC (permalink / raw)
  To: cai
  Cc: vyasevich, nhorman, marcelo.leitner, David.Laight, linux-sctp,
	netdev, linux-kernel
In-Reply-To: <1564500633-7419-1-git-send-email-cai@lca.pw>

From: Qian Cai <cai@lca.pw>
Date: Tue, 30 Jul 2019 11:30:33 -0400

> There are a lot of those warnings with GCC8+ 64-bit,
 ...
> This is because the commit 20c9c825b12f ("[SCTP] Fix SCTP socket options
> to work with 32-bit apps on 64-bit kernels.") added "packed, aligned(4)"
> GCC attributes to some structures but one of the members, i.e, "struct
> sockaddr_storage" in those structures has the attribute,
> "aligned(__alignof__ (struct sockaddr *)" which is 8-byte on 64-bit
> systems, so the commit overwrites the designed alignments for
> "sockaddr_storage".
> 
> To fix this, "struct sockaddr_storage" needs to be aligned to 4-byte as
> it is only used in those packed sctp structure which is part of UAPI,
> and "struct __kernel_sockaddr_storage" is used in some other
> places of UAPI that need not to change alignments in order to not
> breaking userspace.
> 
> Use an implicit alignment for "struct __kernel_sockaddr_storage" so it
> can keep the same alignments as a member in both packed and un-packed
> structures without breaking UAPI.
> 
> Suggested-by: David Laight <David.Laight@ACULAB.COM>
> Signed-off-by: Qian Cai <cai@lca.pw>

Ok, this just changes how the alignment is declared from using the
aligned attribute to using a union member which requires the same
alignment.

Applied.

^ permalink raw reply

* Re: [PATCH v2] net: mdio-octeon: Fix Kconfig warnings and build errors
From: Randy Dunlap @ 2019-08-03 18:34 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: andrew, broonie, davem, devel, f.fainelli, gregkh, hkallweit1,
	kernel-build-reports, linux-arm-kernel, linux-next, lkp, netdev,
	willy
In-Reply-To: <20190803060155.89548-1-natechancellor@gmail.com>

On 8/2/19 11:01 PM, Nathan Chancellor wrote:
> After commit 171a9bae68c7 ("staging/octeon: Allow test build on
> !MIPS"), the following combination of configs cause a few Kconfig
> warnings and build errors (distilled from arm allyesconfig and Randy's
> randconfig builds):
> 
>     CONFIG_NETDEVICES=y
>     CONFIG_STAGING=y
>     CONFIG_COMPILE_TEST=y
> 
> and CONFIG_OCTEON_ETHERNET as either a module or built-in.
> 
> WARNING: unmet direct dependencies detected for MDIO_OCTEON
>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
> && 64BIT [=n] && HAS_IOMEM [=y] && OF_MDIO [=n]
>   Selected by [y]:
>   - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC ||
> COMPILE_TEST [=y]) && NETDEVICES [=y]
> 
> In file included from ../drivers/net/phy/mdio-octeon.c:14:
> ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
> function ‘writeq’; did you mean ‘writel’?
> [-Werror=implicit-function-declaration]
>   111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
>       |                                    ^~~~~~
> 
> CONFIG_64BIT is not strictly necessary if the proper readq/writeq
> definitions are included from io-64-nonatomic-lo-hi.h.
> 
> CONFIG_OF_MDIO is not needed when CONFIG_COMPILE_TEST is enabled because
> of commit f9dc9ac51610 ("of/mdio: Add dummy functions in of_mdio.h.").
> 
> Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
> Reported-by: kbuild test robot <lkp@intel.com>
> Reported-by: Mark Brown <broonie@kernel.org>
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Works for me. Fixes the reported build errors.  Thanks.

Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested

> ---
> 
> v1 -> v2:
> 
> * Address Randy's reported failure here: https://lore.kernel.org/netdev/b3444283-7a77-ece8-7ac6-41756aa7dc60@infradead.org/
>   by not requiring CONFIG_OF_MDIO when CONFIG_COMPILE_TEST is set.
> 
> * Improve commit message
> 
>  drivers/net/phy/Kconfig       | 4 ++--
>  drivers/net/phy/mdio-cavium.h | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 20f14c5fbb7e..0e3d9e3d3533 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -159,8 +159,8 @@ config MDIO_MSCC_MIIM
>  
>  config MDIO_OCTEON
>  	tristate "Octeon and some ThunderX SOCs MDIO buses"
> -	depends on 64BIT
> -	depends on HAS_IOMEM && OF_MDIO
> +	depends on (64BIT && OF_MDIO) || COMPILE_TEST
> +	depends on HAS_IOMEM
>  	select MDIO_CAVIUM
>  	help
>  	  This module provides a driver for the Octeon and ThunderX MDIO
> diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
> index ed5f9bb5448d..b7f89ad27465 100644
> --- a/drivers/net/phy/mdio-cavium.h
> +++ b/drivers/net/phy/mdio-cavium.h
> @@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
>  	return cvmx_read_csr(addr);
>  }
>  #else
> +#include <linux/io-64-nonatomic-lo-hi.h>
> +
>  #define oct_mdio_writeq(val, addr)	writeq(val, (void *)addr)
>  #define oct_mdio_readq(addr)		readq((void *)addr)
>  #endif
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 06/34] drm/i915: convert put_page() to put_user_page*()
From: John Hubbard @ 2019-08-03 20:03 UTC (permalink / raw)
  To: Joonas Lahtinen, Andrew Morton, john.hubbard
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, Jani Nikula, Rodrigo Vivi, David Airlie
In-Reply-To: <7d9a9c57-4322-270b-b636-7214019f87e9@nvidia.com>

On 8/2/19 11:48 AM, John Hubbard wrote:
> On 8/2/19 2:19 AM, Joonas Lahtinen wrote:
>> Quoting john.hubbard@gmail.com (2019-08-02 05:19:37)
>>> From: John Hubbard <jhubbard@nvidia.com>
...
> In order to deal with the merge problem, I'll drop this patch from my series,
> and I'd recommend that the drm-intel-next take the following approach:

Actually, I just pulled the latest linux.git, and there are a few changes:

> 
> 1) For now, s/put_page/put_user_page/ in i915_gem_userptr_put_pages(),
> and fix up the set_page_dirty() --> set_page_dirty_lock() issue, like this
> (based against linux.git):
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> index 528b61678334..94721cc0093b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> @@ -664,10 +664,10 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
> 
>         for_each_sgt_page(page, sgt_iter, pages) {
>                 if (obj->mm.dirty)
> -                       set_page_dirty(page);
> +                       set_page_dirty_lock(page);

I see you've already applied this fix to your tree, in linux.git already.

> 
>                 mark_page_accessed(page);
> -               put_page(page);
> +               put_user_page(page);

But this conversion still needs doing. So I'll repost a patch that only does 
this (plus the other call sites). 

That can go in via either your tree, or Andrew's -mm tree, without generating
any conflicts.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-03 21:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802172418.GB11245@ziepe.ca>

On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > synchronize_rcu.
> > > > 
> > > > 
> > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > concern.
> > > 
> > > I've also idly wondered if calling synchronize_rcu() under the various
> > > mm locks is a deadlock situation.
> > > 
> > > > Then I try spinlock and mutex:
> > > > 
> > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > > improvement.
> > > 
> > > I think the topic here is correctness not performance improvement
> > 
> > The topic is whether we should revert
> > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
> > 
> > or keep it in. The only reason to keep it is performance.
> 
> Yikes, I'm not sure you can ever win against copy_from_user using
> mmu_notifiers?

Ever since copy_from_user started playing with flags (for SMAP) and
added speculation barriers there's a chance we can win by accessing
memory through the kernel address.


Another reason would be to access it from e.g. softirq
context. copy_from_user will only work if the
correct mmu is active.


> The synchronization requirements are likely always
> more expensive unless large and scattered copies are being done..
> 
> The rcu is about the only simple approach that could be less
> expensive, and that gets back to the question if you can block an
> invalidate_start_range in synchronize_rcu or not..
> 
> So, frankly, I'd revert it until someone could prove the rcu solution is
> OK..

I have it all disabled at compile time, so reverting isn't urgent
anymore. I'll wait a couple more days to decide what's cleanest.

> BTW, how do you get copy_from_user to work outside a syscall?

By switching to the correct mm.

> 
> Also, why can't this just permanently GUP the pages? In fact, where
> does it put_page them anyhow? Worrying that 7f466 adds a get_user page
> but does not add a put_page??
> 
> Jason

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-03 21:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
	Paul E. McKenney
In-Reply-To: <130386548.6222676.1564646773879.JavaMail.zimbra@redhat.com>

On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote:
> On 2019/8/1 上午2:29, Michael S. Tsirkin wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> >> We used to use RCU to synchronize MMU notifier with worker. This leads
> >> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> >> system, there would be many factors that may slow down the
> >> synchronize_rcu() which makes it unsuitable to be called in MMU
> >> notifier.
> >>
> >> A solution is SRCU but its overhead is obvious with the expensive full
> >> memory barrier. Another choice is to use seqlock, but it doesn't
> >> provide a synchronization method between readers and writers. The last
> >> choice is to use vq mutex, but it need to deal with the worst case
> >> that MMU notifier must be blocked and wait for the finish of swap in.
> >>
> >> So this patch switches use a counter to track whether or not the map
> >> was used. The counter was increased when vq try to start or finish
> >> uses the map. This means, when it was even, we're sure there's no
> >> readers and MMU notifier is synchronized. When it was odd, it means
> >> there's a reader we need to wait it to be even again then we are
> >> synchronized. To avoid full memory barrier, store_release +
> >> load_acquire on the counter is used.
> >
> > Unfortunately this needs a lot of review and testing, so this can't make
> > rc2, and I don't think this is the kind of patch I can merge after rc3.
> > Subtle memory barrier tricks like this can introduce new bugs while they
> > are fixing old ones.
> 
> I admit the patch is tricky. Some questions:
> 
> - Do we must address the case of e.g swap in? If not, a simple
>   vhost_work_flush() instead of synchronize_rcu() may work.
> - Having some hard thought, I think we can use seqlock, it looks
>   to me smp_wmb() is in write_segcount_begin() is sufficient, we don't
>   care vq->map read before smp_wmb(), and for the other we all have
>   good data devendency so smp_wmb() in the write_seqbegin_end() is
>   sufficient.

If we need an mb in the begin() we can switch to
dependent_ptr_mb. if you need me to fix it up
and repost, let me know.

Why isn't it a problem if the map is
accessed outside the lock?



> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index db2c81cb1e90..6d9501303258 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>  
>  static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>  {
> -	int ref = READ_ONCE(vq->ref);
> -
> -	smp_store_release(&vq->ref, ref + 1);
> -	/* Make sure ref counter is visible before accessing the map */
> -	smp_load_acquire(&vq->ref);
> +	write_seqcount_begin(&vq->seq);
>  }
>  
>  static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>  {
> -	int ref = READ_ONCE(vq->ref);
> -
> -	/* Make sure vq access is done before increasing ref counter */
> -	smp_store_release(&vq->ref, ref + 1);
> +	write_seqcount_end(&vq->seq);
>  }
>  
>  static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>  {
> -	int ref;
> +	unsigned int ret;
>  
>  	/* Make sure map change was done before checking ref counter */
>  	smp_mb();
> -
> -	ref = READ_ONCE(vq->ref);
> -	if (ref & 0x1) {
> -		/* When ref change, we are sure no reader can see
> +	ret = raw_read_seqcount(&vq->seq);
> +	if (ret & 0x1) {
> +		/* When seq changes, we are sure no reader can see
>  		 * previous map */
> -		while (READ_ONCE(vq->ref) == ref) {
> -			set_current_state(TASK_RUNNING);
> +		while (raw_read_seqcount(&vq->seq) == ret)
>  			schedule();


So why do we set state here? And should not we
check need_sched?


> -		}
>  	}
> -	/* Make sure ref counter was checked before any other
> -	 * operations that was dene on map. */
> +	/* Make sure seq was checked before any other operations that
> +	 * was dene on map. */
>  	smp_mb();
>  }
>  
> @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> -		vq->ref = 0;
> +		seqcount_init(&vq->seq);
>  		mutex_init(&vq->mutex);
>  		spin_lock_init(&vq->mmu_lock);
>  		vhost_vq_reset(dev, vq);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3d10da0ae511..1a705e181a84 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -125,7 +125,7 @@ struct vhost_virtqueue {
>  	 */
>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>  #endif
> -	int ref;
> +	seqcount_t seq;
>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  
>  	struct file *kick;
> -- 
> 2.18.1
> 
> >
> >
> >
> >
> >
> >>
> >> Consider the read critical section is pretty small the synchronization
> >> should be done very fast.
> >>
> >> Note the patch lead about 3% PPS dropping.
> >
> > Sorry what do you mean by this last sentence? This degrades performance
> > compared to what?
> 
> Compare to without this patch.

OK is the feature still a performance win? or should we drop it for now?

> >
> >>
> >> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> >> ---
> >>  drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
> >>  drivers/vhost/vhost.h |   7 +-
> >>  2 files changed, 94 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index cfc11f9ed9c9..db2c81cb1e90 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> >>  
> >>  	spin_lock(&vq->mmu_lock);
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> >> -		map[i] = rcu_dereference_protected(vq->maps[i],
> >> -				  lockdep_is_held(&vq->mmu_lock));
> >> +		map[i] = vq->maps[i];
> >>  		if (map[i]) {
> >>  			vhost_set_map_dirty(vq, map[i], i);
> >> -			rcu_assign_pointer(vq->maps[i], NULL);
> >> +			vq->maps[i] = NULL;
> >>  		}
> >>  	}
> >>  	spin_unlock(&vq->mmu_lock);
> >>  
> >> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
> >> -	 * serialized with memory accessors (e.g vq mutex held).
> >> +	/* No need for synchronization since we are serialized with
> >> +	 * memory accessors (e.g vq mutex held).
> >>  	 */
> >>  
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> >>  	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> >>  }
> >>  
> >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref = READ_ONCE(vq->ref);
> >> +
> >> +	smp_store_release(&vq->ref, ref + 1);
> >> +	/* Make sure ref counter is visible before accessing the map */
> >> +	smp_load_acquire(&vq->ref);
> >
> > The map access is after this sequence, correct?
> 
> Yes.
> 
> >
> > Just going by the rules in Documentation/memory-barriers.txt,
> > I think that this pair will not order following accesses with ref store.
> >
> > Documentation/memory-barriers.txt says:
> >
> >
> > +     In addition, a RELEASE+ACQUIRE
> > +     pair is -not- guaranteed to act as a full memory barrier.
> >
> >
> >
> > The guarantee that is made is this:
> > 	after
> >      an ACQUIRE on a given variable, all memory accesses preceding any prior
> >      RELEASE on that same variable are guaranteed to be visible.
> 
> Yes, but it's not clear about the order of ACQUIRE the same location
> of previous RELEASE. And it only has a example like:
> 
> "
> 	*A = a;
> 	RELEASE M
> 	ACQUIRE N
> 	*B = b;
> 
> could occur as:
> 
> 	ACQUIRE N, STORE *B, STORE *A, RELEASE M
> "
> 
> But it doesn't explain what happen when
> 
> *A = a
> RELEASE M
> ACQUIRE M
> *B = b;
> 
> And tools/memory-model/Documentation said
> 
> "
> First, when a lock-acquire reads from a lock-release, the LKMM
> requires that every instruction po-before the lock-release must
> execute before any instruction po-after the lock-acquire.
> "
> 
> Is this a hint that I was correct?

I don't think it's correct since by this logic
memory barriers can be nops on x86.

> >
> >
> > And if we also had the reverse rule we'd end up with a full barrier,
> > won't we?
> >
> > Cc Paul in case I missed something here. And if I'm right,
> > maybe we should call this out, adding
> >
> > 	"The opposite is not true: a prior RELEASE is not
> > 	 guaranteed to be visible before memory accesses following
> > 	 the subsequent ACQUIRE".
> 
> That kinds of violates the RELEASE?
> 
> "
>      This also acts as a one-way permeable barrier.  It guarantees that all
>      memory operations before the RELEASE operation will appear to happen
>      before the RELEASE operation with respect to the other components of the
> "


yes but we are talking about RELEASE itself versus stuff
that comes after it.

> >
> >
> >
> >> +}
> >> +
> >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref = READ_ONCE(vq->ref);
> >> +
> >> +	/* Make sure vq access is done before increasing ref counter */
> >> +	smp_store_release(&vq->ref, ref + 1);
> >> +}
> >> +
> >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref;
> >> +
> >> +	/* Make sure map change was done before checking ref counter */
> >> +	smp_mb();
> >> +
> >> +	ref = READ_ONCE(vq->ref);
> >> +	if (ref & 0x1) {
> >
> > Please document the even/odd trick here too, not just in the commit log.
> >
> 
> Ok.
> 
> >> +		/* When ref change,
> >
> > changes
> >
> >> we are sure no reader can see
> >> +		 * previous map */
> >> +		while (READ_ONCE(vq->ref) == ref) {
> >
> >
> > what is the below line in aid of?
> >
> >> +			set_current_state(TASK_RUNNING);

any answers here?

> >> +			schedule();
> >
> >                         if (need_resched())
> >                                 schedule();
> >
> > ?
> 
> Yes, better.
> 
> >
> >> +		}
> >
> > On an interruptible kernel, there's a risk here is that
> > a task got preempted with an odd ref.
> > So I suspect we'll have to disable preemption when we
> > make ref odd.
> 
> I'm not sure I get, if the odd is not the original value we read,
> we're sure it won't read the new map here I believe.

But we will spin for a very long time in this case.

> >
> >
> >> +	}
> >> +	/* Make sure ref counter was checked before any other
> >> +	 * operations that was dene on map. */
> >
> > was dene -> were done?
> >
> 
> Yes.
> 
> >> +	smp_mb();
> >> +}
> >> +
> >>  static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> >>  				      int index,
> >>  				      unsigned long start,
> >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> >>  	spin_lock(&vq->mmu_lock);
> >>  	++vq->invalidate_count;
> >>  
> >> -	map = rcu_dereference_protected(vq->maps[index],
> >> -					lockdep_is_held(&vq->mmu_lock));
> >> +	map = vq->maps[index];
> >>  	if (map) {
> >>  		vhost_set_map_dirty(vq, map, index);
> >> -		rcu_assign_pointer(vq->maps[index], NULL);
> >> +		vq->maps[index] = NULL;
> >>  	}
> >>  	spin_unlock(&vq->mmu_lock);
> >>  
> >>  	if (map) {
> >> -		synchronize_rcu();
> >> +		vhost_vq_sync_access(vq);
> >>  		vhost_map_unprefetch(map);
> >>  	}
> >>  }
> >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
> >>  	for (i = 0; i < dev->nvqs; ++i) {
> >>  		vq = dev->vqs[i];
> >>  		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> >> -			RCU_INIT_POINTER(vq->maps[j], NULL);
> >> +			vq->maps[j] = NULL;
> >>  	}
> >>  }
> >>  #endif
> >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >>  		vq->indirect = NULL;
> >>  		vq->heads = NULL;
> >>  		vq->dev = dev;
> >> +		vq->ref = 0;
> >>  		mutex_init(&vq->mutex);
> >>  		spin_lock_init(&vq->mmu_lock);
> >>  		vhost_vq_reset(dev, vq);
> >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> >>  	map->npages = npages;
> >>  	map->pages = pages;
> >>  
> >> -	rcu_assign_pointer(vq->maps[index], map);
> >> +	vq->maps[index] = map;
> >>  	/* No need for a synchronize_rcu(). This function should be
> >>  	 * called by dev->worker so we are serialized with all
> >>  	 * readers.
> >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			*((__virtio16 *)&used->ring[vq->num]) =
> >>  				cpu_to_vhost16(vq, vq->avail_idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> >>  	size_t size;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			size = count * sizeof(*head);
> >>  			memcpy(used->ring + idx, head, size);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*idx = avail->idx;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*head = avail->ring[idx & (vq->num - 1)];
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*flags = avail->flags;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		vhost_vq_access_map_begin(vq);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*event = (__virtio16)avail->ring[vq->num];
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			*idx = used->idx;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> >>  	struct vring_desc *d;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> >> +		map = vq->maps[VHOST_ADDR_DESC];
> >>  		if (likely(map)) {
> >>  			d = map->addr;
> >>  			*desc = *(d + idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> >>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
> >>  static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
> >>  {
> >> -	struct vhost_map __rcu *map;
> >> +	struct vhost_map *map;
> >>  	int i;
> >>  
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> >> -		rcu_read_lock();
> >> -		map = rcu_dereference(vq->maps[i]);
> >> -		rcu_read_unlock();
> >> +		map = vq->maps[i];
> >>  		if (unlikely(!map))
> >>  			vhost_map_prefetch(vq, i);
> >>  	}
> >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> >> index a9a2a93857d2..f9e9558a529d 100644
> >> --- a/drivers/vhost/vhost.h
> >> +++ b/drivers/vhost/vhost.h
> >> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
> >>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
> >>  	/* Read by memory accessors, modified by meta data
> >>  	 * prefetching, MMU notifier and vring ioctl().
> >> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
> >> -	 * and readers).
> >> +	 * Synchonrized through mmu_lock (writers) and ref counters,
> >> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
> >>  	 */
> >> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> >> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
> >>  	/* Read by MMU notifier, modified by vring ioctl(),
> >>  	 * synchronized through MMU notifier
> >>  	 * registering/unregistering.
> >>  	 */
> >>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
> >>  #endif
> >> +	int ref;
> >
> > Is it important that this is signed? If not I'd do unsigned here:
> > even though kernel does compile with 2s complement sign overflow,
> > it seems cleaner not to depend on that.
> 
> Not a must, let me fix.
> 
> Thanks
> 
> >
> >>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> >>  
> >>  	struct file *kick;
> >> -- 
> >> 2.18.1

^ permalink raw reply

* Re: [patch 1/1] drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
From: Andrew Lunn @ 2019-08-03 22:22 UTC (permalink / raw)
  To: Arnaud Patard; +Cc: netdev
In-Reply-To: <20190802083310.772136040@rtp-net.org>

On Fri, Aug 02, 2019 at 10:32:40AM +0200, Arnaud Patard wrote:
> Orion5.x systems are still using machine files and not device-tree.
> Commit 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be
> specified for orion-mdio") has replaced devm_clk_get() with of_clk_get(),
> leading to a oops at boot and not working network, as reported in 
> https://lists.debian.org/debian-arm/2019/07/msg00088.html and possibly in
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908712.
> 
> Link: https://lists.debian.org/debian-arm/2019/07/msg00088.html
> Fixes: 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be specified for orion-mdio")
> Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox