Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] mlxsw: spectrum: Extend to support Spectrum-3 ASIC
From: David Miller @ 2019-08-09  5:27 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jiri, petrm, mlxsw, idosch
In-Reply-To: <20190807104231.16085-1-idosch@idosch.org>

From: Ido Schimmel <idosch@idosch.org>
Date: Wed,  7 Aug 2019 13:42:31 +0300

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Extend existing driver for Spectrum and Spectrum-2 ASICs
> to support Spectrum-3 ASIC as well.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> Reviewed-by: Petr Machata <petrm@mellanox.com>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: sja1105: remove set but not used variables 'tx_vid' and 'rx_vid'
From: David Miller @ 2019-08-09  5:29 UTC (permalink / raw)
  To: yuehaibing
  Cc: olteanv, andrew, vivien.didelot, f.fainelli, linux-kernel, netdev
In-Reply-To: <20190807130856.60792-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 7 Aug 2019 21:08:56 +0800

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/dsa/sja1105/sja1105_main.c: In function sja1105_fdb_dump:
> drivers/net/dsa/sja1105/sja1105_main.c:1226:14: warning:
>  variable tx_vid set but not used [-Wunused-but-set-variable]
> drivers/net/dsa/sja1105/sja1105_main.c:1226:6: warning:
>  variable rx_vid set but not used [-Wunused-but-set-variable]
> 
> They are not used since commit 6d7c7d948a2e ("net: dsa:
> sja1105: Fix broken learning with vlan_filtering disabled")
> 
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied to 'net'.

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Tao Ren @ 2019-08-09  5:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, netdev@vger.kernel.org, openbmc@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, Samuel Mendoza-Jonas,
	David S . Miller, William Kennington
In-Reply-To: <20190808230312.GS27917@lunn.ch>

On 8/8/19 4:03 PM, Andrew Lunn wrote:
>> After giving it more thought, I'm thinking about adding ncsi dt node
>> with following structure (mac/ncsi similar to mac/mdio/phy):
>>
>> &mac0 {
>>     /* MAC properties... */
>>
>>     use-ncsi;
> 
> This property seems to be specific to Faraday FTGMAC100. Are you going
> to make it more generic? 

I'm also using ftgmac100 on my platform, and I don't have plan to change this property.

>>     ncsi {
>>         /* ncsi level properties if any */
>>
>>         package@0 {
> 
> You should get Rob Herring involved. This is not really describing
> hardware, so it might get rejected by the device tree maintainer.

Got it. Thank you for the sharing, and let me think it over :-)

>> 1) mac driver doesn't need to parse "mac-offset" stuff: these
>> ncsi-network-controller specific settings should be parsed in ncsi
>> stack.
> 
>> 2) get_bmc_mac_address command is a channel specific command, and
>> technically people can configure different offset/formula for
>> different channels.
> 
> Does that mean the NCSA code puts the interface into promiscuous mode?
> Or at least adds these unicast MAC addresses to the MAC receive
> filter? Humm, ftgmac100 only seems to support multicast address
> filtering, not unicast filters, so it must be using promisc mode, if
> you expect to receive frames using this MAC address.

Uhh, I actually didn't think too much about this: basically it's how to configure frame filtering when there are multiple packages/channels active: single BMC MAC or multiple BMC MAC is also allowed?
I don't have the answer yet, but will talk to NCSI expert and figure it out.


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count'
From: David Miller @ 2019-08-09  5:31 UTC (permalink / raw)
  To: yuehaibing; +Cc: jhs, xiyou.wangcong, jiri, dave.taht, linux-kernel, netdev
In-Reply-To: <20190807131055.66668-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 7 Aug 2019 21:10:55 +0800

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> net/sched/sch_fq_codel.c: In function fq_codel_dequeue:
> net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable]
> net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable]
> 
> They are not used since commit 77ddaff218fc ("fq_codel: Kill
> useless per-flow dropped statistic")
> 
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Do you even compile test this stuff?

  CC [M]  net/sched/sch_fq_codel.o
net/sched/sch_fq_codel.c: In function ‘fq_codel_dequeue’:
net/sched/sch_fq_codel.c:309:42: error: ‘prev_drop_count’ undeclared (first use in this function); did you mean ‘page_ref_count’?
  flow->dropped += q->cstats.drop_count - prev_drop_count;
                                          ^~~~~~~~~~~~~~~
                                          page_ref_count
net/sched/sch_fq_codel.c:309:42: note: each undeclared identifier is reported only once for each function it appears in
net/sched/sch_fq_codel.c:310:40: error: ‘prev_ecn_mark’ undeclared (first use in this function); did you mean ‘pmd_pfn_mask’?
  flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
                                        ^~~~~~~~~~~~~
                                        pmd_pfn_mask
make[1]: *** [scripts/Makefile.build:274: net/sched/sch_fq_codel.o] Error 1
make: *** [Makefile:1769: net/sched/sch_fq_codel.o] Error 2

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count'
From: David Miller @ 2019-08-09  5:32 UTC (permalink / raw)
  To: yuehaibing; +Cc: jhs, xiyou.wangcong, jiri, dave.taht, linux-kernel, netdev
In-Reply-To: <20190808.223136.1507513183278607177.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 08 Aug 2019 22:31:36 -0700 (PDT)

> From: YueHaibing <yuehaibing@huawei.com>
> Date: Wed, 7 Aug 2019 21:10:55 +0800
> 
>> Fixes gcc '-Wunused-but-set-variable' warning:
>> 
>> net/sched/sch_fq_codel.c: In function fq_codel_dequeue:
>> net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable]
>> net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable]
>> 
>> They are not used since commit 77ddaff218fc ("fq_codel: Kill
>> useless per-flow dropped statistic")
>> 
>> Reported-by: Hulk Robot <hulkci@huawei.com>
>> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> 
> Do you even compile test this stuff?
> 
>   CC [M]  net/sched/sch_fq_codel.o
> net/sched/sch_fq_codel.c: In function ‘fq_codel_dequeue’:
> net/sched/sch_fq_codel.c:309:42: error: ‘prev_drop_count’ undeclared (first use in this function); did you mean ‘page_ref_count’?

Never mind, this is my fault.

I was build testing the patch on the wrong tree, I'm very sorry.

^ permalink raw reply

* Re: [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
From: David Miller @ 2019-08-09  5:35 UTC (permalink / raw)
  To: hkallweit1; +Cc: nic_swsd, netdev
In-Reply-To: <7d79d794-b41c-101f-0720-59eea88bf9ab@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Wed, 7 Aug 2019 21:38:22 +0200

> We allocate 16kb per rx buffer, so we can avoid some overhead by using
> alloc_pages_node directly instead of bothering kmalloc_node. Due to
> this change buffers are page-aligned now, therefore the alignment check
> can be removed.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH V4 0/9] Fixes for metadata accelreation
From: Jason Wang @ 2019-08-09  5:35 UTC (permalink / raw)
  To: David Miller
  Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm, jgg
In-Reply-To: <20190808.221543.450194346419371363.davem@davemloft.net>


On 2019/8/9 下午1:15, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Wed,  7 Aug 2019 03:06:08 -0400
>
>> This series try to fix several issues introduced by meta data
>> accelreation series. Please review.
>   ...
>
> My impression is that patch #7 will be changed to use spinlocks so there
> will be a v5.
>

Yes. V5 is on the way.

Thanks


^ permalink raw reply

* [PATCH v2] net: tundra: tsi108: use spin_lock_irqsave instead of spin_lock_irq in IRQ context
From: Fuqian Huang @ 2019-08-09  5:35 UTC (permalink / raw)
  Cc: David S . Miller, netdev, linux-kernel, Fuqian Huang

As spin_unlock_irq will enable interrupts.
Function tsi108_stat_carry is called from interrupt handler tsi108_irq.
Interrupts are enabled in interrupt handler.
Use spin_lock_irqsave/spin_unlock_irqrestore instead of spin_(un)lock_irq
in IRQ context to avoid this.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v2:
  - Preserve reverse christmas tree ordering of local variables.

 drivers/net/ethernet/tundra/tsi108_eth.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c b/drivers/net/ethernet/tundra/tsi108_eth.c
index 78a7de3fb622..c62f474b6d08 100644
--- a/drivers/net/ethernet/tundra/tsi108_eth.c
+++ b/drivers/net/ethernet/tundra/tsi108_eth.c
@@ -371,9 +371,10 @@ tsi108_stat_carry_one(int carry, int carry_bit, int carry_shift,
 static void tsi108_stat_carry(struct net_device *dev)
 {
 	struct tsi108_prv_data *data = netdev_priv(dev);
+	unsigned long flags;
 	u32 carry1, carry2;
 
-	spin_lock_irq(&data->misclock);
+	spin_lock_irqsave(&data->misclock, flags);
 
 	carry1 = TSI_READ(TSI108_STAT_CARRY1);
 	carry2 = TSI_READ(TSI108_STAT_CARRY2);
@@ -441,7 +442,7 @@ static void tsi108_stat_carry(struct net_device *dev)
 			      TSI108_STAT_TXPAUSEDROP_CARRY,
 			      &data->tx_pause_drop);
 
-	spin_unlock_irq(&data->misclock);
+	spin_unlock_irqrestore(&data->misclock, flags);
 }
 
 /* Read a stat counter atomically with respect to carries.
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net 0/2] Fix batched event generation for skbedit action
From: David Miller @ 2019-08-09  5:37 UTC (permalink / raw)
  To: mrv; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1565207849-11442-1-git-send-email-mrv@mojatatu.com>

From: Roman Mashak <mrv@mojatatu.com>
Date: Wed,  7 Aug 2019 15:57:27 -0400

> When adding or deleting a batch of entries, the kernel sends up to
> TCA_ACT_MAX_PRIO (defined to 32 in kernel) entries in an event to user
> space. However it does not consider that the action sizes may vary and
> require different skb sizes.
> 
> For example, consider the following script adding 32 entries with all
> supported skbedit parameters and cookie (in order to maximize netlink
> messages size):
 ...
> patch 1 adds callback in tc_action_ops of skbedit action, which calculates
> the action size, and passes size to tcf_add_notify()/tcf_del_notify().
> 
> patch 2 updates the TDC test suite with relevant skbedit test cases.

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: David Miller @ 2019-08-09  5:40 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: netdev, davejwatson, borisp, aviadye, john.fastabend, daniel,
	willemb, edumazet, alexei.starovoitov, oss-drivers
In-Reply-To: <20190808000359.20785-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Wed,  7 Aug 2019 17:03:59 -0700

> sk_validate_xmit_skb() and drivers depend on the sk member of
> struct sk_buff to identify segments requiring encryption.
> Any operation which removes or does not preserve the original TLS
> socket such as skb_orphan() or skb_clone() will cause clear text
> leaks.
> 
> Make the TCP socket underlying an offloaded TLS connection
> mark all skbs as decrypted, if TLS TX is in offload mode.
> Then in sk_validate_xmit_skb() catch skbs which have no socket
> (or a socket with no validation) and decrypted flag set.
> 
> Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> sk->sk_validate_xmit_skb are slightly interchangeable right now,
> they all imply TLS offload. The new checks are guarded by
> CONFIG_TLS_DEVICE because that's the option guarding the
> sk_buff->decrypted member.
> 
> Second, smaller issue with orphaning is that it breaks
> the guarantee that packets will be delivered to device
> queues in-order. All TLS offload drivers depend on that
> scheduling property. This means skb_orphan_partial()'s
> trick of preserving partial socket references will cause
> issues in the drivers. We need a full orphan, and as a
> result netem delay/throttling will cause all TLS offload
> skbs to be dropped.
> 
> Reusing the sk_buff->decrypted flag also protects from
> leaking clear text when incoming, decrypted skb is redirected
> (e.g. by TC).
> 
> See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
> through ULP") for justification why the internal flag is safe.
> The only location which could leak the flag in is tcp_bpf_sendmsg(),
> which is taken care of by clearing the previously unused bit.
> 
> v2:
>  - remove superfluous decrypted mark copy (Willem);
>  - remove the stale doc entry (Boris);
>  - rely entirely on EOR marking to prevent coalescing (Boris);
>  - use an internal sendpages flag instead of marking the socket
>    (Boris).
> v3 (Willem):
>  - reorganize the can_skb_orphan_partial() condition;
>  - fix the flag leak-in through tcp_bpf_sendmsg.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied, thanks Jakub.

^ permalink raw reply

* Re: [PATCH] liquidio: Use pcie_flr() instead of reimplementing it
From: David Miller @ 2019-08-09  5:41 UTC (permalink / raw)
  To: efremov
  Cc: bjorn.helgaas, dchickles, sburla, fmanlunas, netdev, linux-pci,
	linux-kernel
In-Reply-To: <20190808045753.5474-1-efremov@linux.com>

From: Denis Efremov <efremov@linux.com>
Date: Thu,  8 Aug 2019 07:57:53 +0300

> octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR
> bit, which bypasses timing requirements imposed by the PCIe spec.
> This patch fixes the function to use the pcie_flr() interface instead.
> 
> Signed-off-by: Denis Efremov <efremov@linux.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v2] team: Add vlan tx offload to hw_enc_features
From: David Miller @ 2019-08-09  5:42 UTC (permalink / raw)
  To: yuehaibing
  Cc: j.vosburgh, vfalico, andy, jiri, jay.vosburgh, linux-kernel,
	netdev
In-Reply-To: <20190808062247.38352-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Thu, 8 Aug 2019 14:22:47 +0800

> We should also enable team's vlan tx offload in hw_enc_features,
> pass the vlan packets to the slave devices with vlan tci, let the
> slave handle vlan tunneling offload implementation.
> 
> Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
> v2: fix commit log typo

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH v2] net: tundra: tsi108: use spin_lock_irqsave instead of spin_lock_irq in IRQ context
From: David Miller @ 2019-08-09  5:43 UTC (permalink / raw)
  To: huangfq.daxian; +Cc: netdev, linux-kernel
In-Reply-To: <20190809053539.8341-1-huangfq.daxian@gmail.com>

From: Fuqian Huang <huangfq.daxian@gmail.com>
Date: Fri,  9 Aug 2019 13:35:39 +0800

> As spin_unlock_irq will enable interrupts.
> Function tsi108_stat_carry is called from interrupt handler tsi108_irq.
> Interrupts are enabled in interrupt handler.
> Use spin_lock_irqsave/spin_unlock_irqrestore instead of spin_(un)lock_irq
> in IRQ context to avoid this.
> 
> Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
> ---
> Changes in v2:
>   - Preserve reverse christmas tree ordering of local variables.

Applied, thanks.

^ permalink raw reply

* [PATCH V5 1/9] vhost: don't set uaddr for invalid address
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We should not setup uaddr for the invalid address, otherwise we may
try to pin or prefetch mapping of wrong pages.

Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..488380a581dc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2082,7 +2082,8 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	}
 
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-	vhost_setup_vq_uaddr(vq);
+	if (r == 0)
+		vhost_setup_vq_uaddr(vq);
 
 	if (d->mm)
 		mmu_notifier_register(&d->mmu_notifier, d->mm);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 2/9] vhost: validate MMU notifier registration
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

The return value of mmu_notifier_register() is not checked in
vhost_vring_set_num_addr(). This will cause an out of sync between mm
and MMU notifier thus a double free. To solve this, introduce a
boolean flag to track whether MMU notifier is registered and only do
unregistering when it was true.

Reported-and-tested-by:
syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 19 +++++++++++++++----
 drivers/vhost/vhost.h |  1 +
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 488380a581dc..17f6abea192e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -629,6 +629,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
+	dev->has_notifier = false;
 	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
@@ -730,6 +731,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_mmu_notifier;
 #endif
+	dev->has_notifier = true;
 
 	return 0;
 
@@ -959,7 +961,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	}
 	if (dev->mm) {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
+		if (dev->has_notifier) {
+			mmu_notifier_unregister(&dev->mmu_notifier,
+						dev->mm);
+			dev->has_notifier = false;
+		}
 #endif
 		mmput(dev->mm);
 	}
@@ -2064,8 +2070,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	/* Unregister MMU notifer to allow invalidation callback
 	 * can access vq->uaddrs[] without holding a lock.
 	 */
-	if (d->mm)
+	if (d->has_notifier) {
 		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+		d->has_notifier = false;
+	}
 
 	vhost_uninit_vq_maps(vq);
 #endif
@@ -2085,8 +2093,11 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	if (r == 0)
 		vhost_setup_vq_uaddr(vq);
 
-	if (d->mm)
-		mmu_notifier_register(&d->mmu_notifier, d->mm);
+	if (d->mm) {
+		r = mmu_notifier_register(&d->mmu_notifier, d->mm);
+		if (!r)
+			d->has_notifier = true;
+	}
 #endif
 
 	mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 42a8c2a13ab1..a9a2a93857d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -214,6 +214,7 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
+	bool has_notifier;
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 3/9] vhost: fix vhost map leak
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We don't free map during vhost_map_unprefetch(). This means it could
be leaked. Fixing by free the map.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 17f6abea192e..2a3154976277 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -302,9 +302,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 static void vhost_map_unprefetch(struct vhost_map *map)
 {
 	kfree(map->pages);
-	map->pages = NULL;
-	map->npages = 0;
-	map->addr = NULL;
+	kfree(map);
 }
 
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

The vhost_set_vring_num_addr() could be called in the middle of
invalidate_range_start() and invalidate_range_end(). If we don't reset
invalidate_count after the un-registering of MMU notifier, the
invalidate_cont will run out of sync (e.g never reach zero). This will
in fact disable the fast accessor path. Fixing by reset the count to
zero.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a3154976277..2a7217c33668 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 		d->has_notifier = false;
 	}
 
+	/* reset invalidate_count in case we are in the middle of
+	 * invalidate_start() and invalidate_end().
+	 */
+	vq->invalidate_count = 0;
 	vhost_uninit_vq_maps(vq);
 #endif
 
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 5/9] vhost: mark dirty pages during map uninit
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We don't mark dirty pages if the map was teared down outside MMU
notifier. This will lead untracked dirty pages. Fixing by marking
dirty pages during map uninit.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a7217c33668..c12cdadb0855 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -305,6 +305,18 @@ static void vhost_map_unprefetch(struct vhost_map *map)
 	kfree(map);
 }
 
+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+				struct vhost_map *map, int index)
+{
+	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
+	int i;
+
+	if (uaddr->write) {
+		for (i = 0; i < map->npages; i++)
+			set_page_dirty(map->pages[i]);
+	}
+}
+
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 {
 	struct vhost_map *map[VHOST_NUM_ADDRS];
@@ -314,8 +326,10 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
 		map[i] = rcu_dereference_protected(vq->maps[i],
 				  lockdep_is_held(&vq->mmu_lock));
-		if (map[i])
+		if (map[i]) {
+			vhost_set_map_dirty(vq, map[i], i);
 			rcu_assign_pointer(vq->maps[i], NULL);
+		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
@@ -353,7 +367,6 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
-	int i;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
 		return;
@@ -364,10 +377,7 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	map = rcu_dereference_protected(vq->maps[index],
 					lockdep_is_held(&vq->mmu_lock));
 	if (map) {
-		if (uaddr->write) {
-			for (i = 0; i < map->npages; i++)
-				set_page_dirty(map->pages[i]);
-		}
+		vhost_set_map_dirty(vq, map, index);
 		rcu_assign_pointer(vq->maps[index], NULL);
 	}
 	spin_unlock(&vq->mmu_lock);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

There's no need for RCU synchronization in vhost_uninit_vq_maps()
since we've already serialized with readers (memory accessors). This
also avoid the possible userspace DOS through ioctl() because of the
possible high latency caused by synchronize_rcu().

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c12cdadb0855..cfc11f9ed9c9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -333,7 +333,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	synchronize_rcu();
+	/* No need for synchronize_rcu() or kfree_rcu() since we are
+	 * serialized with memory accessors (e.g vq mutex held).
+	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
 		if (map[i])
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We used to use RCU to synchronize MMU notifier with worker. This leads
calling synchronize_rcu() in invalidate_range_start(). But on a busy
system, there would be many factors that may slow down the
synchronize_rcu() which makes it unsuitable to be called in MMU
notifier. This path switch to use a simple spinlock to do the
synchronization.

Benchmark was done through testpmd + vhost_net + XDP_DROP on
tap. Compare to copy_{to|from}_user() path, on Sandy Bridge (without
SMAP support), 1.5% PPS improvement was measured; on Broadwell (with
SMAP and enabled), 14% PPS improvement was measured.

This means we are not as fast as what 7f466032dc9e did because the
spinlock overhead in the datapath. This needs to be addressed in the
future.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 115 ++++++++++++++++++++++--------------------
 drivers/vhost/vhost.h |   5 +-
 2 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cfc11f9ed9c9..29e8abe694f7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 
 	spin_lock(&vq->mmu_lock);
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		map[i] = rcu_dereference_protected(vq->maps[i],
-				  lockdep_is_held(&vq->mmu_lock));
+		map[i] = vq->maps[i];
 		if (map[i]) {
 			vhost_set_map_dirty(vq, map[i], i);
-			rcu_assign_pointer(vq->maps[i], NULL);
+			vq->maps[i] = NULL;
 		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	/* No need for synchronize_rcu() or kfree_rcu() since we are
-	 * serialized with memory accessors (e.g vq mutex held).
+	/* No need for synchronization since we are serialized with
+	 * memory accessors (e.g vq mutex held).
 	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
@@ -362,6 +361,16 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
 	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
 }
 
+static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
+{
+	spin_lock(&vq->mmu_lock);
+}
+
+static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
+{
+	spin_unlock(&vq->mmu_lock);
+}
+
 static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 				      int index,
 				      unsigned long start,
@@ -376,16 +385,14 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
 
-	map = rcu_dereference_protected(vq->maps[index],
-					lockdep_is_held(&vq->mmu_lock));
+	map = vq->maps[index];
 	if (map) {
+		vq->maps[index] = NULL;
 		vhost_set_map_dirty(vq, map, index);
-		rcu_assign_pointer(vq->maps[index], NULL);
 	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
-		synchronize_rcu();
 		vhost_map_unprefetch(map);
 	}
 }
@@ -457,7 +464,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			RCU_INIT_POINTER(vq->maps[j], NULL);
+			vq->maps[j] = NULL;
 	}
 }
 #endif
@@ -921,7 +928,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 	map->npages = npages;
 	map->pages = pages;
 
-	rcu_assign_pointer(vq->maps[index], map);
+	vq->maps[index] = map;
 	/* No need for a synchronize_rcu(). This function should be
 	 * called by dev->worker so we are serialized with all
 	 * readers.
@@ -1216,18 +1223,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*((__virtio16 *)&used->ring[vq->num]) =
 				cpu_to_vhost16(vq, vq->avail_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1245,18 +1252,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 	size_t size;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			size = count * sizeof(*head);
 			memcpy(used->ring + idx, head, size);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1272,17 +1279,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->flags = cpu_to_vhost16(vq, vq->used_flags);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1298,17 +1305,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1362,17 +1369,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*idx = avail->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1387,17 +1394,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*head = avail->ring[idx & (vq->num - 1)];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1413,17 +1420,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*flags = avail->flags;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1438,15 +1445,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		vhost_vq_access_map_begin(vq);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*event = (__virtio16)avail->ring[vq->num];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1461,17 +1468,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*idx = used->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1486,17 +1493,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 	struct vring_desc *d;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
+		map = vq->maps[VHOST_ADDR_DESC];
 		if (likely(map)) {
 			d = map->addr;
 			*desc = *(d + idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1843,13 +1850,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
 {
-	struct vhost_map __rcu *map;
+	struct vhost_map *map;
 	int i;
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[i]);
-		rcu_read_unlock();
+		map = vq->maps[i];
 		if (unlikely(!map))
 			vhost_map_prefetch(vq, i);
 	}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a9a2a93857d2..983d06e62f12 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -115,10 +115,9 @@ struct vhost_virtqueue {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 	/* Read by memory accessors, modified by meta data
 	 * prefetching, MMU notifier and vring ioctl().
-	 * Synchonrized through mmu_lock (writers) and RCU (writers
-	 * and readers).
+	 * Synchonrized through mmu_lock.
 	 */
-	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
+	struct vhost_map *maps[VHOST_NUM_ADDRS];
 	/* Read by MMU notifier, modified by vring ioctl(),
 	 * synchronized through MMU notifier
 	 * registering/unregistering.
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 8/9] vhost: correctly set dirty pages in MMU notifiers callback
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

We need make sure there's no reference on the map before trying to
mark set dirty pages.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 29e8abe694f7..d8863aaaf0f6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -386,13 +386,12 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	++vq->invalidate_count;
 
 	map = vq->maps[index];
-	if (map) {
+	if (map)
 		vq->maps[index] = NULL;
-		vhost_set_map_dirty(vq, map, index);
-	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
+		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
 }
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 9/9] vhost: do not return -EAGAIN for non blocking invalidation too early
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang
In-Reply-To: <20190809054851.20118-1-jasowang@redhat.com>

Instead of returning -EAGAIN unconditionally, we'd better do that only
we're sure the range is overlapped with the metadata area.

Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d8863aaaf0f6..f98155f28f02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,16 +371,19 @@ static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
 	spin_unlock(&vq->mmu_lock);
 }
 
-static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
-				      int index,
-				      unsigned long start,
-				      unsigned long end)
+static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
+				     int index,
+				     unsigned long start,
+				     unsigned long end,
+				     bool blockable)
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
-		return;
+		return 0;
+	else if (!blockable)
+		return -EAGAIN;
 
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
@@ -394,6 +397,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
+
+	return 0;
 }
 
 static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
@@ -414,18 +419,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn,
 {
 	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
 					     mmu_notifier);
-	int i, j;
-
-	if (!mmu_notifier_range_blockable(range))
-		return -EAGAIN;
+	bool blockable = mmu_notifier_range_blockable(range);
+	int i, j, ret;
 
 	for (i = 0; i < dev->nvqs; i++) {
 		struct vhost_virtqueue *vq = dev->vqs[i];
 
-		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			vhost_invalidate_vq_start(vq, j,
-						  range->start,
-						  range->end);
+		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
+			ret = vhost_invalidate_vq_start(vq, j,
+							range->start,
+							range->end, blockable);
+			if (ret)
+				return ret;
+		}
 	}
 
 	return 0;
-- 
2.18.1


^ permalink raw reply related

* [PATCH V5 0/9] Fixes for vhost metadata acceleration
From: Jason Wang @ 2019-08-09  5:48 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: linux-mm, jgg, Jason Wang

Hi all:

This series try to fix several issues introduced by meta data
accelreation series. Please review.

Changes from V4:
- switch to use spinlock synchronize MMU notifier with accessors

Changes from V3:
- remove the unnecessary patch

Changes from V2:
- use seqlck helper to synchronize MMU notifier with vhost worker

Changes from V1:
- try not use RCU to syncrhonize MMU notifier with vhost worker
- set dirty pages after no readers
- return -EAGAIN only when we find the range is overlapped with
  metadata

Jason Wang (9):
  vhost: don't set uaddr for invalid address
  vhost: validate MMU notifier registration
  vhost: fix vhost map leak
  vhost: reset invalidate_count in vhost_set_vring_num_addr()
  vhost: mark dirty pages during map uninit
  vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
  vhost: do not use RCU to synchronize MMU notifier with worker
  vhost: correctly set dirty pages in MMU notifiers callback
  vhost: do not return -EAGAIN for non blocking invalidation too early

 drivers/vhost/vhost.c | 202 +++++++++++++++++++++++++-----------------
 drivers/vhost/vhost.h |   6 +-
 2 files changed, 122 insertions(+), 86 deletions(-)

-- 
2.18.1


^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-09  6:25 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, David Miller, Jakub Kicinski, Stephen Hemminger,
	David Ahern, dcbw, Michal Kubecek, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <CAJieiUi+gKKc94bKfC-N5LBc=FdzGGo_8+x2oTstihFaUpkKSA@mail.gmail.com>

Fri, Aug 09, 2019 at 06:11:30AM CEST, roopa@cumulusnetworks.com wrote:
>On Fri, Jul 19, 2019 at 4:00 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Add two commands to add and delete alternative ifnames for net device.
>> Each net device can have multiple alternative names.
>>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/netdevice.h      |   4 ++
>>  include/uapi/linux/if.h        |   1 +
>>  include/uapi/linux/if_link.h   |   1 +
>>  include/uapi/linux/rtnetlink.h |   7 +++
>>  net/core/dev.c                 |  58 ++++++++++++++++++-
>>  net/core/rtnetlink.c           | 102 +++++++++++++++++++++++++++++++++
>>  security/selinux/nlmsgtab.c    |   4 +-
>>  7 files changed, 175 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 74f99f127b0e..6922fdb483ca 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -920,10 +920,14 @@ struct tlsdev_ops;
>>
>>  struct netdev_name_node {
>>         struct hlist_node hlist;
>> +       struct list_head list;
>>         struct net_device *dev;
>>         char *name;
>>  };
>>
>> +int netdev_name_node_alt_create(struct net_device *dev, char *name);
>> +int netdev_name_node_alt_destroy(struct net_device *dev, char *name);
>> +
>>  /*
>>   * This structure defines the management hooks for network devices.
>>   * The following hooks can be defined; unless noted otherwise, they are
>> diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
>> index 7fea0fd7d6f5..4bf33344aab1 100644
>> --- a/include/uapi/linux/if.h
>> +++ b/include/uapi/linux/if.h
>> @@ -33,6 +33,7 @@
>>  #define        IFNAMSIZ        16
>>  #endif /* __UAPI_DEF_IF_IFNAMSIZ */
>>  #define        IFALIASZ        256
>> +#define        ALTIFNAMSIZ     128
>>  #include <linux/hdlc/ioctl.h>
>>
>>  /* For glibc compatibility. An empty enum does not compile. */
>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> index 4a8c02cafa9a..92268946e04a 100644
>> --- a/include/uapi/linux/if_link.h
>> +++ b/include/uapi/linux/if_link.h
>> @@ -167,6 +167,7 @@ enum {
>>         IFLA_NEW_IFINDEX,
>>         IFLA_MIN_MTU,
>>         IFLA_MAX_MTU,
>> +       IFLA_ALT_IFNAME_MOD, /* Alternative ifname to add/delete */
>>         __IFLA_MAX
>>  };
>>
>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>> index ce2a623abb75..b36cfd83eb76 100644
>> --- a/include/uapi/linux/rtnetlink.h
>> +++ b/include/uapi/linux/rtnetlink.h
>> @@ -164,6 +164,13 @@ enum {
>>         RTM_GETNEXTHOP,
>>  #define RTM_GETNEXTHOP RTM_GETNEXTHOP
>>
>> +       RTM_NEWALTIFNAME = 108,
>> +#define RTM_NEWALTIFNAME       RTM_NEWALTIFNAME
>> +       RTM_DELALTIFNAME,
>> +#define RTM_DELALTIFNAME       RTM_DELALTIFNAME
>> +       RTM_GETALTIFNAME,
>> +#define RTM_GETALTIFNAME       RTM_GETALTIFNAME
>> +
>
>I might have missed the prior discussion, why do we need new commands
>?. can't this simply be part of RTM_*LINK and we use RTM_SETLINK to
>set alternate names ?

How? This is to add/remove. How do you suggest to to add/remove by
setlink?


>
>
>
>>         __RTM_MAX,
>>  #define RTM_MAX                (((__RTM_MAX + 3) & ~3) - 1)
>>  };
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index ad0d42fbdeee..2a3be2b279d3 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -244,7 +244,13 @@ static struct netdev_name_node *netdev_name_node_alloc(struct net_device *dev,
>>  static struct netdev_name_node *
>>  netdev_name_node_head_alloc(struct net_device *dev)
>>  {
>> -       return netdev_name_node_alloc(dev, dev->name);
>> +       struct netdev_name_node *name_node;
>> +
>> +       name_node = netdev_name_node_alloc(dev, dev->name);
>> +       if (!name_node)
>> +               return NULL;
>> +       INIT_LIST_HEAD(&name_node->list);
>> +       return name_node;
>>  }
>>
>>  static void netdev_name_node_free(struct netdev_name_node *name_node)
>> @@ -288,6 +294,55 @@ static struct netdev_name_node *netdev_name_node_lookup_rcu(struct net *net,
>>         return NULL;
>>  }
>>
>> +int netdev_name_node_alt_create(struct net_device *dev, char *name)
>> +{
>> +       struct netdev_name_node *name_node;
>> +       struct net *net = dev_net(dev);
>> +
>> +       name_node = netdev_name_node_lookup(net, name);
>> +       if (name_node)
>> +               return -EEXIST;
>> +       name_node = netdev_name_node_alloc(dev, name);
>> +       if (!name_node)
>> +               return -ENOMEM;
>> +       netdev_name_node_add(net, name_node);
>> +       /* The node that holds dev->name acts as a head of per-device list. */
>> +       list_add_tail(&name_node->list, &dev->name_node->list);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(netdev_name_node_alt_create);
>> +
>> +static void __netdev_name_node_alt_destroy(struct netdev_name_node *name_node)
>> +{
>> +       list_del(&name_node->list);
>> +       netdev_name_node_del(name_node);
>> +       kfree(name_node->name);
>> +       netdev_name_node_free(name_node);
>> +}
>> +
>> +int netdev_name_node_alt_destroy(struct net_device *dev, char *name)
>> +{
>> +       struct netdev_name_node *name_node;
>> +       struct net *net = dev_net(dev);
>> +
>> +       name_node = netdev_name_node_lookup(net, name);
>> +       if (!name_node)
>> +               return -ENOENT;
>> +       __netdev_name_node_alt_destroy(name_node);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(netdev_name_node_alt_destroy);
>> +
>> +static void netdev_name_node_alt_flush(struct net_device *dev)
>> +{
>> +       struct netdev_name_node *name_node, *tmp;
>> +
>> +       list_for_each_entry_safe(name_node, tmp, &dev->name_node->list, list)
>> +               __netdev_name_node_alt_destroy(name_node);
>> +}
>> +
>>  /* Device list insertion */
>>  static void list_netdevice(struct net_device *dev)
>>  {
>> @@ -8258,6 +8313,7 @@ static void rollback_registered_many(struct list_head *head)
>>                 dev_uc_flush(dev);
>>                 dev_mc_flush(dev);
>>
>> +               netdev_name_node_alt_flush(dev);
>>                 netdev_name_node_free(dev->name_node);
>>
>>                 if (dev->netdev_ops->ndo_uninit)
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index 1ee6460f8275..7a2010b16e10 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -1750,6 +1750,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>>         [IFLA_CARRIER_DOWN_COUNT] = { .type = NLA_U32 },
>>         [IFLA_MIN_MTU]          = { .type = NLA_U32 },
>>         [IFLA_MAX_MTU]          = { .type = NLA_U32 },
>> +       [IFLA_ALT_IFNAME_MOD]   = { .type = NLA_STRING,
>> +                                   .len = ALTIFNAMSIZ - 1 },
>>  };
>>
>>  static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
>> @@ -3373,6 +3375,103 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
>>         return err;
>>  }
>>
>> +static int rtnl_newaltifname(struct sk_buff *skb, struct nlmsghdr *nlh,
>> +                            struct netlink_ext_ack *extack)
>> +{
>> +       struct net *net = sock_net(skb->sk);
>> +       struct nlattr *tb[IFLA_MAX + 1];
>> +       struct net_device *dev;
>> +       struct ifinfomsg *ifm;
>> +       char *new_alt_ifname;
>> +       int err;
>> +
>> +       err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);
>> +       if (err)
>> +               return err;
>> +
>> +       err = rtnl_ensure_unique_netns(tb, extack, true);
>> +       if (err)
>> +               return err;
>> +
>> +       ifm = nlmsg_data(nlh);
>> +       if (ifm->ifi_index > 0) {
>> +               dev = __dev_get_by_index(net, ifm->ifi_index);
>> +       } else if (tb[IFLA_IFNAME]) {
>> +               char ifname[IFNAMSIZ];
>> +
>> +               nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
>> +               dev = __dev_get_by_name(net, ifname);
>> +       } else {
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (!dev)
>> +               return -ENODEV;
>> +
>> +       if (!tb[IFLA_ALT_IFNAME_MOD])
>> +               return -EINVAL;
>> +
>> +       new_alt_ifname = nla_strdup(tb[IFLA_ALT_IFNAME_MOD], GFP_KERNEL);
>> +       if (!new_alt_ifname)
>> +               return -ENOMEM;
>> +
>> +       err = netdev_name_node_alt_create(dev, new_alt_ifname);
>> +       if (err)
>> +               goto out_free_new_alt_ifname;
>> +
>> +       return 0;
>> +
>> +out_free_new_alt_ifname:
>> +       kfree(new_alt_ifname);
>> +       return err;
>> +}
>> +
>> +static int rtnl_delaltifname(struct sk_buff *skb, struct nlmsghdr *nlh,
>> +                            struct netlink_ext_ack *extack)
>> +{
>> +       struct net *net = sock_net(skb->sk);
>> +       struct nlattr *tb[IFLA_MAX + 1];
>> +       struct net_device *dev;
>> +       struct ifinfomsg *ifm;
>> +       char *del_alt_ifname;
>> +       int err;
>> +
>> +       err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);
>> +       if (err)
>> +               return err;
>> +
>> +       err = rtnl_ensure_unique_netns(tb, extack, true);
>> +       if (err)
>> +               return err;
>> +
>> +       ifm = nlmsg_data(nlh);
>> +       if (ifm->ifi_index > 0) {
>> +               dev = __dev_get_by_index(net, ifm->ifi_index);
>> +       } else if (tb[IFLA_IFNAME]) {
>> +               char ifname[IFNAMSIZ];
>> +
>> +               nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
>> +               dev = __dev_get_by_name(net, ifname);
>> +       } else {
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (!dev)
>> +               return -ENODEV;
>> +
>> +       if (!tb[IFLA_ALT_IFNAME_MOD])
>> +               return -EINVAL;
>> +
>> +       del_alt_ifname = nla_strdup(tb[IFLA_ALT_IFNAME_MOD], GFP_KERNEL);
>> +       if (!del_alt_ifname)
>> +               return -ENOMEM;
>> +
>> +       err = netdev_name_node_alt_destroy(dev, del_alt_ifname);
>> +       kfree(del_alt_ifname);
>> +
>> +       return err;
>> +}
>> +
>>  static u16 rtnl_calcit(struct sk_buff *skb, struct nlmsghdr *nlh)
>>  {
>>         struct net *net = sock_net(skb->sk);
>> @@ -5331,6 +5430,9 @@ void __init rtnetlink_init(void)
>>         rtnl_register(PF_UNSPEC, RTM_GETROUTE, NULL, rtnl_dump_all, 0);
>>         rtnl_register(PF_UNSPEC, RTM_GETNETCONF, NULL, rtnl_dump_all, 0);
>>
>> +       rtnl_register(PF_UNSPEC, RTM_NEWALTIFNAME, rtnl_newaltifname, NULL, 0);
>> +       rtnl_register(PF_UNSPEC, RTM_DELALTIFNAME, rtnl_delaltifname, NULL, 0);
>> +
>>         rtnl_register(PF_BRIDGE, RTM_NEWNEIGH, rtnl_fdb_add, NULL, 0);
>>         rtnl_register(PF_BRIDGE, RTM_DELNEIGH, rtnl_fdb_del, NULL, 0);
>>         rtnl_register(PF_BRIDGE, RTM_GETNEIGH, rtnl_fdb_get, rtnl_fdb_dump, 0);
>> diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
>> index 58345ba0528e..a712b54c666c 100644
>> --- a/security/selinux/nlmsgtab.c
>> +++ b/security/selinux/nlmsgtab.c
>> @@ -83,6 +83,8 @@ static const struct nlmsg_perm nlmsg_route_perms[] =
>>         { RTM_NEWNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>         { RTM_DELNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>         { RTM_GETNEXTHOP,       NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>> +       { RTM_NEWALTIFNAME,     NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>> +       { RTM_DELALTIFNAME,     NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>  };
>>
>>  static const struct nlmsg_perm nlmsg_tcpdiag_perms[] =
>> @@ -166,7 +168,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
>>                  * structures at the top of this file with the new mappings
>>                  * before updating the BUILD_BUG_ON() macro!
>>                  */
>> -               BUILD_BUG_ON(RTM_MAX != (RTM_NEWNEXTHOP + 3));
>> +               BUILD_BUG_ON(RTM_MAX != (RTM_NEWALTIFNAME + 3));
>>                 err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
>>                                  sizeof(nlmsg_route_perms));
>>                 break;
>> --
>> 2.21.0
>>

^ permalink raw reply

* [PATCH 3/3] tipc: fix issue of calling smp_processor_id() in preemptible
From: Ying Xue @ 2019-08-09  7:16 UTC (permalink / raw)
  To: davem, netdev; +Cc: jon.maloy, hdanton, tipc-discussion, syzkaller-bugs
In-Reply-To: <1565335017-21302-1-git-send-email-ying.xue@windriver.com>

syzbot found the following issue:

[   81.119772][ T8612] BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor834/8612
[   81.136212][ T8612] caller is dst_cache_get+0x3d/0xb0
[   81.141450][ T8612] CPU: 0 PID: 8612 Comm: syz-executor834 Not tainted 5.2.0-rc6+ #48
[   81.149435][ T8612] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[   81.159480][ T8612] Call Trace:
[   81.162789][ T8612]  dump_stack+0x172/0x1f0
[   81.167123][ T8612]  debug_smp_processor_id+0x251/0x280
[   81.172479][ T8612]  dst_cache_get+0x3d/0xb0
[   81.176928][ T8612]  tipc_udp_xmit.isra.0+0xc4/0xb80
[   81.182046][ T8612]  ? kasan_kmalloc+0x9/0x10
[   81.186531][ T8612]  ? tipc_udp_addr2str+0x170/0x170
[   81.191641][ T8612]  ? __copy_skb_header+0x2e8/0x560
[   81.196750][ T8612]  ? __skb_checksum_complete+0x3f0/0x3f0
[   81.202364][ T8612]  ? netdev_alloc_frag+0x1b0/0x1b0
[   81.207452][ T8612]  ? skb_copy_header+0x21/0x2b0
[   81.212282][ T8612]  ? __pskb_copy_fclone+0x516/0xc90
[   81.217470][ T8612]  tipc_udp_send_msg+0x29a/0x4b0
[   81.222400][ T8612]  tipc_bearer_xmit_skb+0x16c/0x360
[   81.227585][ T8612]  tipc_enable_bearer+0xabe/0xd20
[   81.232606][ T8612]  ? __nla_validate_parse+0x2d0/0x1ee0
[   81.238048][ T8612]  ? tipc_bearer_xmit_skb+0x360/0x360
[   81.243401][ T8612]  ? nla_memcpy+0xb0/0xb0
[   81.247710][ T8612]  ? nla_memcpy+0xb0/0xb0
[   81.252020][ T8612]  ? __nla_parse+0x43/0x60
[   81.256417][ T8612]  __tipc_nl_bearer_enable+0x2de/0x3a0
[   81.261856][ T8612]  ? __tipc_nl_bearer_enable+0x2de/0x3a0
[   81.267467][ T8612]  ? tipc_nl_bearer_disable+0x40/0x40
[   81.272848][ T8612]  ? unwind_get_return_address+0x58/0xa0
[   81.278501][ T8612]  ? lock_acquire+0x16f/0x3f0
[   81.283190][ T8612]  tipc_nl_bearer_enable+0x23/0x40
[   81.288300][ T8612]  genl_family_rcv_msg+0x74b/0xf90
[   81.293404][ T8612]  ? genl_unregister_family+0x790/0x790
[   81.298935][ T8612]  ? __lock_acquire+0x54f/0x5490
[   81.303852][ T8612]  ? __netlink_lookup+0x3fa/0x7b0
[   81.308865][ T8612]  genl_rcv_msg+0xca/0x16c
[   81.313266][ T8612]  netlink_rcv_skb+0x177/0x450
[   81.318043][ T8612]  ? genl_family_rcv_msg+0xf90/0xf90
[   81.323311][ T8612]  ? netlink_ack+0xb50/0xb50
[   81.327906][ T8612]  ? lock_acquire+0x16f/0x3f0
[   81.332589][ T8612]  ? kasan_check_write+0x14/0x20
[   81.337511][ T8612]  genl_rcv+0x29/0x40
[   81.341485][ T8612]  netlink_unicast+0x531/0x710
[   81.346268][ T8612]  ? netlink_attachskb+0x770/0x770
[   81.351374][ T8612]  ? _copy_from_iter_full+0x25d/0x8c0
[   81.356765][ T8612]  ? __sanitizer_cov_trace_cmp8+0x18/0x20
[   81.362479][ T8612]  ? __check_object_size+0x3d/0x42f
[   81.367667][ T8612]  netlink_sendmsg+0x8ae/0xd70
[   81.372415][ T8612]  ? netlink_unicast+0x710/0x710
[   81.377520][ T8612]  ? aa_sock_msg_perm.isra.0+0xba/0x170
[   81.383051][ T8612]  ? apparmor_socket_sendmsg+0x2a/0x30
[   81.388530][ T8612]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[   81.394775][ T8612]  ? security_socket_sendmsg+0x8d/0xc0
[   81.400240][ T8612]  ? netlink_unicast+0x710/0x710
[   81.405161][ T8612]  sock_sendmsg+0xd7/0x130
[   81.409561][ T8612]  ___sys_sendmsg+0x803/0x920
[   81.414220][ T8612]  ? copy_msghdr_from_user+0x430/0x430
[   81.419667][ T8612]  ? _raw_spin_unlock_irqrestore+0x6b/0xe0
[   81.425461][ T8612]  ? debug_object_active_state+0x25d/0x380
[   81.431255][ T8612]  ? __lock_acquire+0x54f/0x5490
[   81.436174][ T8612]  ? kasan_check_read+0x11/0x20
[   81.441208][ T8612]  ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[   81.447008][ T8612]  ? mark_held_locks+0xf0/0xf0
[   81.451768][ T8612]  ? __call_rcu.constprop.0+0x28b/0x720
[   81.457298][ T8612]  ? call_rcu+0xb/0x10
[   81.461353][ T8612]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[   81.467589][ T8612]  ? __fget_light+0x1a9/0x230
[   81.472249][ T8612]  ? __fdget+0x1b/0x20
[   81.476301][ T8612]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[   81.482545][ T8612]  __sys_sendmsg+0x105/0x1d0
[   81.487115][ T8612]  ? __ia32_sys_shutdown+0x80/0x80
[   81.492208][ T8612]  ? blkcg_maybe_throttle_current+0x5e2/0xfb0
[   81.498272][ T8612]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   81.503726][ T8612]  ? do_syscall_64+0x26/0x680
[   81.508385][ T8612]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   81.514444][ T8612]  ? do_syscall_64+0x26/0x680
[   81.519110][ T8612]  __x64_sys_sendmsg+0x78/0xb0
[   81.523862][ T8612]  do_syscall_64+0xfd/0x680
[   81.528352][ T8612]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   81.534234][ T8612] RIP: 0033:0x444679
[   81.538114][ T8612] Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b d8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
[   81.557709][ T8612] RSP: 002b:00007fff0201a8b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   81.566147][ T8612] RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000444679
[   81.574108][ T8612] RDX: 0000000000000000 RSI: 0000000020000580 RDI: 0000000000000003
[   81.582152][ T8612] RBP: 00000000006cf018 R08: 0000000000000001 R09: 00000000004002e0
[   81.590113][ T8612] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000402320
[   81.598089][ T8612] R13: 00000000004023b0 R14: 0000000000000000 R15: 0000000000

In commit e9c1a793210f ("tipc: add dst_cache support for udp media")
dst_cache_get() was introduced to be called in tipc_udp_xmit(). But
smp_processor_id() called by dst_cache_get() cannot be invoked in
preemptible context, as a result, the complaint above was reported.

Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media")
syzbot+1a68504d96cd17b33a05@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/udp_media.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 287df687..ca3ae2e 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -224,6 +224,8 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 	struct udp_bearer *ub;
 	int err = 0;
 
+	local_bh_disable();
+
 	if (skb_headroom(skb) < UDP_MIN_HEADROOM) {
 		err = pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
 		if (err)
@@ -237,9 +239,12 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 		goto out;
 	}
 
-	if (addr->broadcast != TIPC_REPLICAST_SUPPORT)
-		return tipc_udp_xmit(net, skb, ub, src, dst,
-				     &ub->rcast.dst_cache);
+	if (addr->broadcast != TIPC_REPLICAST_SUPPORT) {
+		err = tipc_udp_xmit(net, skb, ub, src, dst,
+				    &ub->rcast.dst_cache);
+		local_bh_enable();
+		return err;
+	}
 
 	/* Replicast, send an skb to each configured IP address */
 	list_for_each_entry_rcu(rcast, &ub->rcast.list, list) {
@@ -259,6 +264,7 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
 	err = 0;
 out:
 	kfree_skb(skb);
+	local_bh_enable();
 	return err;
 }
 
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox