* Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-31 18:43 UTC (permalink / raw)
To: Boris Pismenny
Cc: davem@davemloft.net, netdev@vger.kernel.org,
oss-drivers@netronome.com, edumazet@google.com, Aviad Yehezkel,
davejwatson@fb.com, john.fastabend@gmail.com,
daniel@iogearbox.net, willemb@google.com,
xiyou.wangcong@gmail.com, fw@strlen.de,
alexei.starovoitov@gmail.com
In-Reply-To: <c86943db-f098-3ce8-f0d1-4f04ba676d40@mellanox.com>
On Wed, 31 Jul 2019 13:57:26 +0000, Boris Pismenny wrote:
> > diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
> > index 048e5ca44824..2bc3ab5515d8 100644
> > --- a/Documentation/networking/tls-offload.rst
> > +++ b/Documentation/networking/tls-offload.rst
> > @@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
> > Known bugs
> > ==========
> >
> > -skb_orphan() leaks clear text
> > ------------------------------
> > -
> > -Currently drivers depend on the :c:member:`sk` member of
> > -:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
> > -encryption. Any operation which removes or does not preserve the socket
> > -association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
> > -will cause the driver to miss the packets and lead to clear text leaks.
> > -
> > Redirects leak clear text
> > -------------------------
> Doesn't this patch cover the redirect case as well?
Ah, good catch! I thought this entry was for socket redirect, which
will be a separate fix, but it seems we don't have an entry for that
one.
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 228db3998e46..90f3f552c789 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -814,6 +814,7 @@ enum sock_flags {
> > SOCK_TXTIME,
> > SOCK_XDP, /* XDP is attached */
> > SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > + SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
> > };
> >
> > #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
> > return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > }
> >
> > +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > + skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> > +#endif
> > +}
>
> Have you considered the following alternative to calling
> sk_set_tx_crypto() - Add a new MSG_DECRYPTE flag that will set the
> skb->decrypted bit in the do_tcp_sendpages function.
>
> Then, the rest of the TCP code can propagate this bit from the existing skb.
That'd also work marking the socket as crypto seemed easy enough. I was
planning on using sk_rx_crypto_match() for socket redirect also, so
it'd be symmetrical. Is there a preference for using the internal flags?
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index f62f0e7e3cdd..dee93eab02fe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> > if (!skb)
> > goto wait_for_memory;
> >
> > + sk_set_tx_crypto(sk, skb);
> > skb_entail(sk, skb);
> > copy = size_goal;
> > }
> > @@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >
> > i = skb_shinfo(skb)->nr_frags;
> > can_coalesce = skb_can_coalesce(skb, i, page, offset);
> > - if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > + if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
> > + !sk_tx_crypto_match(sk, skb)) {
>
> I see that sk_tx_crypto_match is called only here to handle cases where
> the socket expected crypto offload, while the skb is not marked
> decrypted. AFAIU, this should not be possible, because we set the
> skb->eor bit for the last plaintext skb before sending any traffic that
> expects crypto offload.
Ack, I missed the tcp_skb_can_collapse_to() above.
^ permalink raw reply
* [PATCH net] net: bridge: move vlan init/deinit to NETDEV_REGISTER/UNREGISTER
From: Nikolay Aleksandrov @ 2019-07-31 18:36 UTC (permalink / raw)
To: netdev; +Cc: roopa, davem, bridge, Nikolay Aleksandrov, michael-dev
Most of the bridge device's vlan init bugs come from the fact that it's
done in the wrong place, way too early in ndo_init() before the device is
even assigned an ifindex. That makes error handling harder, especially for
older kernels which don't have bridge ndo_uninit callback. It also
introduces another bug when the bridge's dev_addr is added as fdb in the
the initial default pvid on vlan initialization, the fdb notification has
ifindex/NDA_MASTER both equal to 0 (see example below) which really
makes no sense for user-space[0]. Usually user-space software would ignore
such entries, but they are actually valid and will eventually have all
necessary attributes. I chose to change the order because this can be
backported to all kernels even pre-ndo_uninit ones without many changes
and it keeps init/deinit symmetric. As a bonus this allows us to keep
the vlan init/deinit entirely in br_vlan.c and remove those exports.
It makes much more sense to send a notification *after* the device has
registered and has a proper ifindex allocated rather than before when
there's a chance that the registration might still fail.
For the demonstration below a small change to iproute2 for printing all fdb
notifications is added, because it contained a workaround not to show
entries with ifindex == 0.
Command executed while monitoring: $ ip l add br0 type bridge
Before (both ifindex and master == 0):
$ bridge monitor fdb
36:7e:8a:b3:56:ba dev * vlan 1 master * permanent
After (proper br0 ifindex):
$ bridge monitor fdb
e6:2a:ae:7a:b7:48 dev br0 vlan 1 master br0 permanent
[0] https://bugzilla.kernel.org/show_bug.cgi?id=204389
Reported-by: michael-dev <michael-dev@fami-braun.de>
Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
I tried a few different approaches to resolve this but they were all
unsuitable for some kernels, this approach can go to stables easily
and IMO is the way this had to be done from the start. Alternatively
we could move only the br_vlan_add and pair it with a br_vlan_del of
default_pvid on the same events, but I don't think it hurts to move
the whole init/deinit there as it'd help older stable releases as well.
I also tested the br_vlan_init error handling after the move by always
returning errors from all over it. Since errors at NETDEV_REGISTER cause
NETDEV_UNREGISTER we can deinit vlans properly for all cases regardless
why it happened (e.g. device destruction or init error).
net/bridge/br.c | 5 ++++-
net/bridge/br_device.c | 10 ----------
net/bridge/br_private.h | 19 ++++---------------
net/bridge/br_vlan.c | 23 ++++++++++++++++-------
4 files changed, 24 insertions(+), 33 deletions(-)
diff --git a/net/bridge/br.c b/net/bridge/br.c
index d164f63a4345..8a8f9e5f264f 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -37,12 +37,15 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
int err;
if (dev->priv_flags & IFF_EBRIDGE) {
+ err = br_vlan_bridge_event(dev, event, ptr);
+ if (err)
+ return notifier_from_errno(err);
+
if (event == NETDEV_REGISTER) {
/* register of bridge completed, add sysfs entries */
br_sysfs_addbr(dev);
return NOTIFY_DONE;
}
- br_vlan_bridge_event(dev, event, ptr);
}
/* not a port of a bridge */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 681b72862c16..b3e3de2ecf95 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -135,18 +135,9 @@ static int br_dev_init(struct net_device *dev)
return err;
}
- err = br_vlan_init(br);
- if (err) {
- free_percpu(br->stats);
- br_mdb_hash_fini(br);
- br_fdb_hash_fini(br);
- return err;
- }
-
err = br_multicast_init_stats(br);
if (err) {
free_percpu(br->stats);
- br_vlan_flush(br);
br_mdb_hash_fini(br);
br_fdb_hash_fini(br);
}
@@ -161,7 +152,6 @@ static void br_dev_uninit(struct net_device *dev)
br_multicast_dev_del(br);
br_multicast_uninit_stats(br);
- br_vlan_flush(br);
br_mdb_hash_fini(br);
br_fdb_hash_fini(br);
free_percpu(br->stats);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e8cf03b43b7d..96dd1c68d73f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -872,7 +872,6 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags,
bool *changed, struct netlink_ext_ack *extack);
int br_vlan_delete(struct net_bridge *br, u16 vid);
-void br_vlan_flush(struct net_bridge *br);
struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 vid);
void br_recalculate_fwd_mask(struct net_bridge *br);
int __br_vlan_filter_toggle(struct net_bridge *br, unsigned long val);
@@ -881,7 +880,6 @@ int __br_vlan_set_proto(struct net_bridge *br, __be16 proto);
int br_vlan_set_proto(struct net_bridge *br, unsigned long val);
int br_vlan_set_stats(struct net_bridge *br, unsigned long val);
int br_vlan_set_stats_per_port(struct net_bridge *br, unsigned long val);
-int br_vlan_init(struct net_bridge *br);
int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val);
int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid,
struct netlink_ext_ack *extack);
@@ -894,8 +892,8 @@ int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
void br_vlan_get_stats(const struct net_bridge_vlan *v,
struct br_vlan_stats *stats);
void br_vlan_port_event(struct net_bridge_port *p, unsigned long event);
-void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
- void *ptr);
+int br_vlan_bridge_event(struct net_device *dev, unsigned long event,
+ void *ptr);
static inline struct net_bridge_vlan_group *br_vlan_group(
const struct net_bridge *br)
@@ -988,19 +986,10 @@ static inline int br_vlan_delete(struct net_bridge *br, u16 vid)
return -EOPNOTSUPP;
}
-static inline void br_vlan_flush(struct net_bridge *br)
-{
-}
-
static inline void br_recalculate_fwd_mask(struct net_bridge *br)
{
}
-static inline int br_vlan_init(struct net_bridge *br)
-{
- return 0;
-}
-
static inline int nbp_vlan_add(struct net_bridge_port *port, u16 vid, u16 flags,
bool *changed, struct netlink_ext_ack *extack)
{
@@ -1085,8 +1074,8 @@ static inline void br_vlan_port_event(struct net_bridge_port *p,
{
}
-static inline void br_vlan_bridge_event(struct net_device *dev,
- unsigned long event, void *ptr)
+static inline int br_vlan_bridge_event(struct net_device *dev,
+ unsigned long event, void *ptr)
{
}
#endif
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index a544e161c7fa..266c1214b9f9 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -709,7 +709,7 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
return __vlan_del(v);
}
-void br_vlan_flush(struct net_bridge *br)
+static void br_vlan_flush(struct net_bridge *br)
{
struct net_bridge_vlan_group *vg;
@@ -721,6 +721,8 @@ void br_vlan_flush(struct net_bridge *br)
br_fdb_delete_by_port(br, NULL, 0, 1);
vg = br_vlan_group(br);
+ if (!vg)
+ return;
__vlan_flush(vg);
RCU_INIT_POINTER(br->vlgrp, NULL);
synchronize_rcu();
@@ -1054,7 +1056,7 @@ int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val)
return err;
}
-int br_vlan_init(struct net_bridge *br)
+static int br_vlan_init(struct net_bridge *br)
{
struct net_bridge_vlan_group *vg;
int ret = -ENOMEM;
@@ -1469,13 +1471,19 @@ static void nbp_vlan_set_vlan_dev_state(struct net_bridge_port *p, u16 vid)
}
/* Must be protected by RTNL. */
-void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
- void *ptr)
+int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr)
{
struct netdev_notifier_changeupper_info *info;
- struct net_bridge *br;
+ struct net_bridge *br = netdev_priv(dev);
+ int ret = 0;
switch (event) {
+ case NETDEV_REGISTER:
+ ret = br_vlan_init(br);
+ break;
+ case NETDEV_UNREGISTER:
+ br_vlan_flush(br);
+ break;
case NETDEV_CHANGEUPPER:
info = ptr;
br_vlan_upper_change(dev, info->upper_dev, info->linking);
@@ -1483,12 +1491,13 @@ void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
case NETDEV_CHANGE:
case NETDEV_UP:
- br = netdev_priv(dev);
if (!br_opt_get(br, BROPT_VLAN_BRIDGE_BINDING))
- return;
+ break;
br_vlan_link_state_change(dev, br);
break;
}
+
+ return ret;
}
/* Must be protected by RTNL. */
--
2.21.0
^ permalink raw reply related
* [PATCH net] mvpp2: fix panic on module removal
From: Matteo Croce @ 2019-07-31 18:31 UTC (permalink / raw)
To: netdev
Cc: Miquel Raynal, linux-kernel, Lorenzo Bianconi, Antoine Tenart,
Maxime Chevallier, David S. Miller
mvpp2 uses a delayed workqueue to gather traffic statistics.
On module removal the workqueue can be destroyed before calling
cancel_delayed_work_sync() on its works.
Fix it by moving the destroy_workqueue() call after mvpp2_port_remove().
# rmmod mvpp2
[ 2743.311722] mvpp2 f4000000.ethernet eth1: phy link down 10gbase-kr/10Gbps/Full
[ 2743.320063] mvpp2 f4000000.ethernet eth1: Link is Down
[ 2743.572263] mvpp2 f4000000.ethernet eth2: phy link down sgmii/1Gbps/Full
[ 2743.580076] mvpp2 f4000000.ethernet eth2: Link is Down
[ 2744.102169] mvpp2 f2000000.ethernet eth0: phy link down 10gbase-kr/10Gbps/Full
[ 2744.110441] mvpp2 f2000000.ethernet eth0: Link is Down
[ 2744.115614] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 2744.115615] Mem abort info:
[ 2744.115616] ESR = 0x96000005
[ 2744.115617] Exception class = DABT (current EL), IL = 32 bits
[ 2744.115618] SET = 0, FnV = 0
[ 2744.115619] EA = 0, S1PTW = 0
[ 2744.115620] Data abort info:
[ 2744.115621] ISV = 0, ISS = 0x00000005
[ 2744.115622] CM = 0, WnR = 0
[ 2744.115624] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000422681000
[ 2744.115626] [0000000000000000] pgd=0000000000000000, pud=0000000000000000
[ 2744.115630] Internal error: Oops: 96000005 [#1] SMP
[ 2744.115632] Modules linked in: mvpp2(-) algif_hash af_alg nls_iso8859_1 nls_cp437 vfat fat xhci_plat_hcd m25p80 spi_nor xhci_hcd mtd usbcore i2c_mv64xxx sfp usb_common marvell10g phy_generic spi_orion mdio_i2c i2c_core mvmdio phylink sbsa_gwdt ip_tables x_tables autofs4 [last unloaded: mvpp2]
[ 2744.115654] CPU: 3 PID: 8357 Comm: kworker/3:2 Not tainted 5.3.0-rc2 #1
[ 2744.115655] Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
[ 2744.115665] Workqueue: events_power_efficient phylink_resolve [phylink]
[ 2744.115669] pstate: a0000085 (NzCv daIf -PAN -UAO)
[ 2744.115675] pc : __queue_work+0x9c/0x4d8
[ 2744.115677] lr : __queue_work+0x170/0x4d8
[ 2744.115678] sp : ffffff801001bd50
[ 2744.115680] x29: ffffff801001bd50 x28: ffffffc422597600
[ 2744.115684] x27: ffffff80109ae6f0 x26: ffffff80108e4018
[ 2744.115688] x25: 0000000000000003 x24: 0000000000000004
[ 2744.115691] x23: ffffff80109ae6e0 x22: 0000000000000017
[ 2744.115694] x21: ffffffc42c030000 x20: ffffffc42209e8f8
[ 2744.115697] x19: 0000000000000000 x18: 0000000000000000
[ 2744.115699] x17: 0000000000000000 x16: 0000000000000000
[ 2744.115701] x15: 0000000000000010 x14: ffffffffffffffff
[ 2744.115702] x13: ffffff8090e2b95f x12: ffffff8010e2b967
[ 2744.115704] x11: ffffff8010906000 x10: 0000000000000040
[ 2744.115706] x9 : ffffff80109223b8 x8 : ffffff80109223b0
[ 2744.115707] x7 : ffffffc42bc00068 x6 : 0000000000000000
[ 2744.115709] x5 : ffffffc42bc00000 x4 : 0000000000000000
[ 2744.115710] x3 : 0000000000000000 x2 : 0000000000000000
[ 2744.115712] x1 : 0000000000000008 x0 : ffffffc42c030000
[ 2744.115714] Call trace:
[ 2744.115716] __queue_work+0x9c/0x4d8
[ 2744.115718] delayed_work_timer_fn+0x28/0x38
[ 2744.115722] call_timer_fn+0x3c/0x180
[ 2744.115723] expire_timers+0x60/0x168
[ 2744.115724] run_timer_softirq+0xbc/0x1e8
[ 2744.115727] __do_softirq+0x128/0x320
[ 2744.115731] irq_exit+0xa4/0xc0
[ 2744.115734] __handle_domain_irq+0x70/0xc0
[ 2744.115735] gic_handle_irq+0x58/0xa8
[ 2744.115737] el1_irq+0xb8/0x140
[ 2744.115738] console_unlock+0x3a0/0x568
[ 2744.115740] vprintk_emit+0x200/0x2a0
[ 2744.115744] dev_vprintk_emit+0x1c8/0x1e4
[ 2744.115747] dev_printk_emit+0x6c/0x7c
[ 2744.115751] __netdev_printk+0x104/0x1d8
[ 2744.115752] netdev_printk+0x60/0x70
[ 2744.115756] phylink_resolve+0x38c/0x3c8 [phylink]
[ 2744.115758] process_one_work+0x1f8/0x448
[ 2744.115760] worker_thread+0x54/0x500
[ 2744.115762] kthread+0x12c/0x130
[ 2744.115764] ret_from_fork+0x10/0x1c
[ 2744.115768] Code: aa1403e0 97fffbbe aa0003f5 b4000700 (f9400261)
Fixes: 118d6298f6f0 ("net: mvpp2: add ethtool GOP statistics")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index c51f1d5b550b..5002d51fc9d6 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -5760,7 +5760,6 @@ static int mvpp2_remove(struct platform_device *pdev)
mvpp2_dbgfs_cleanup(priv);
flush_workqueue(priv->stats_queue);
- destroy_workqueue(priv->stats_queue);
fwnode_for_each_available_child_node(fwnode, port_fwnode) {
if (priv->port_list[i]) {
@@ -5770,6 +5769,8 @@ static int mvpp2_remove(struct platform_device *pdev)
i++;
}
+ destroy_workqueue(priv->stats_queue);
+
for (i = 0; i < MVPP2_BM_POOLS_NUM; i++) {
struct mvpp2_bm_pool *bm_pool = &priv->bm_pools[i];
--
2.21.0
^ permalink raw reply related
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-07-31 18:29 UTC (permalink / raw)
To: Jason Wang
Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
Paul E. McKenney
In-Reply-To: <20190731084655.7024-8-jasowang@redhat.com>
On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
>
> A solution is SRCU but its overhead is obvious with the expensive full
> memory barrier. Another choice is to use seqlock, but it doesn't
> provide a synchronization method between readers and writers. The last
> choice is to use vq mutex, but it need to deal with the worst case
> that MMU notifier must be blocked and wait for the finish of swap in.
>
> So this patch switches use a counter to track whether or not the map
> was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. To avoid full memory barrier, store_release +
> load_acquire on the counter is used.
Unfortunately this needs a lot of review and testing, so this can't make
rc2, and I don't think this is the kind of patch I can merge after rc3.
Subtle memory barrier tricks like this can introduce new bugs while they
are fixing old ones.
>
> Consider the read critical section is pretty small the synchronization
> should be done very fast.
>
> Note the patch lead about 3% PPS dropping.
Sorry what do you mean by this last sentence? This degrades performance
compared to what?
>
> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
> drivers/vhost/vhost.h | 7 +-
> 2 files changed, 94 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..db2c81cb1e90 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>
> spin_lock(&vq->mmu_lock);
> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - map[i] = rcu_dereference_protected(vq->maps[i],
> - lockdep_is_held(&vq->mmu_lock));
> + map[i] = vq->maps[i];
> if (map[i]) {
> vhost_set_map_dirty(vq, map[i], i);
> - rcu_assign_pointer(vq->maps[i], NULL);
> + vq->maps[i] = NULL;
> }
> }
> spin_unlock(&vq->mmu_lock);
>
> - /* No need for synchronize_rcu() or kfree_rcu() since we are
> - * serialized with memory accessors (e.g vq mutex held).
> + /* No need for synchronization since we are serialized with
> + * memory accessors (e.g vq mutex held).
> */
>
> for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> }
>
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);
> +
> + smp_store_release(&vq->ref, ref + 1);
> + /* Make sure ref counter is visible before accessing the map */
> + smp_load_acquire(&vq->ref);
The map access is after this sequence, correct?
Just going by the rules in Documentation/memory-barriers.txt,
I think that this pair will not order following accesses with ref store.
Documentation/memory-barriers.txt says:
+ In addition, a RELEASE+ACQUIRE
+ pair is -not- guaranteed to act as a full memory barrier.
The guarantee that is made is this:
after
an ACQUIRE on a given variable, all memory accesses preceding any prior
RELEASE on that same variable are guaranteed to be visible.
And if we also had the reverse rule we'd end up with a full barrier,
won't we?
Cc Paul in case I missed something here. And if I'm right,
maybe we should call this out, adding
"The opposite is not true: a prior RELEASE is not
guaranteed to be visible before memory accesses following
the subsequent ACQUIRE".
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);
> +
> + /* Make sure vq access is done before increasing ref counter */
> + smp_store_release(&vq->ref, ref + 1);
> +}
> +
> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> +{
> + int ref;
> +
> + /* Make sure map change was done before checking ref counter */
> + smp_mb();
> +
> + ref = READ_ONCE(vq->ref);
> + if (ref & 0x1) {
Please document the even/odd trick here too, not just in the commit log.
> + /* When ref change,
changes
> we are sure no reader can see
> + * previous map */
> + while (READ_ONCE(vq->ref) == ref) {
what is the below line in aid of?
> + set_current_state(TASK_RUNNING);
> + schedule();
if (need_resched())
schedule();
?
> + }
On an interruptible kernel, there's a risk here is that
a task got preempted with an odd ref.
So I suspect we'll have to disable preemption when we
make ref odd.
> + }
> + /* Make sure ref counter was checked before any other
> + * operations that was dene on map. */
was dene -> were done?
> + smp_mb();
> +}
> +
> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> int index,
> unsigned long start,
> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> spin_lock(&vq->mmu_lock);
> ++vq->invalidate_count;
>
> - map = rcu_dereference_protected(vq->maps[index],
> - lockdep_is_held(&vq->mmu_lock));
> + map = vq->maps[index];
> if (map) {
> vhost_set_map_dirty(vq, map, index);
> - rcu_assign_pointer(vq->maps[index], NULL);
> + vq->maps[index] = NULL;
> }
> spin_unlock(&vq->mmu_lock);
>
> if (map) {
> - synchronize_rcu();
> + vhost_vq_sync_access(vq);
> vhost_map_unprefetch(map);
> }
> }
> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> for (j = 0; j < VHOST_NUM_ADDRS; j++)
> - RCU_INIT_POINTER(vq->maps[j], NULL);
> + vq->maps[j] = NULL;
> }
> }
> #endif
> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> vq->indirect = NULL;
> vq->heads = NULL;
> vq->dev = dev;
> + vq->ref = 0;
> mutex_init(&vq->mutex);
> spin_lock_init(&vq->mmu_lock);
> vhost_vq_reset(dev, vq);
> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> map->npages = npages;
> map->pages = pages;
>
> - rcu_assign_pointer(vq->maps[index], map);
> + vq->maps[index] = map;
> /* No need for a synchronize_rcu(). This function should be
> * called by dev->worker so we are serialized with all
> * readers.
> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> *((__virtio16 *)&used->ring[vq->num]) =
> cpu_to_vhost16(vq, vq->avail_idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> size_t size;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> size = count * sizeof(*head);
> memcpy(used->ring + idx, head, size);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> used->flags = cpu_to_vhost16(vq, vq->used_flags);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *idx = avail->idx;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *head = avail->ring[idx & (vq->num - 1)];
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *flags = avail->flags;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + vhost_vq_access_map_begin(vq);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *event = (__virtio16)avail->ring[vq->num];
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> *idx = used->idx;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> struct vring_desc *d;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> + map = vq->maps[VHOST_ADDR_DESC];
> if (likely(map)) {
> d = map->addr;
> *desc = *(d + idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> #if VHOST_ARCH_CAN_ACCEL_UACCESS
> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
> {
> - struct vhost_map __rcu *map;
> + struct vhost_map *map;
> int i;
>
> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - rcu_read_lock();
> - map = rcu_dereference(vq->maps[i]);
> - rcu_read_unlock();
> + map = vq->maps[i];
> if (unlikely(!map))
> vhost_map_prefetch(vq, i);
> }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a9a2a93857d2..f9e9558a529d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
> #if VHOST_ARCH_CAN_ACCEL_UACCESS
> /* Read by memory accessors, modified by meta data
> * prefetching, MMU notifier and vring ioctl().
> - * Synchonrized through mmu_lock (writers) and RCU (writers
> - * and readers).
> + * Synchonrized through mmu_lock (writers) and ref counters,
> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
> */
> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> + struct vhost_map *maps[VHOST_NUM_ADDRS];
> /* Read by MMU notifier, modified by vring ioctl(),
> * synchronized through MMU notifier
> * registering/unregistering.
> */
> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
> #endif
> + int ref;
Is it important that this is signed? If not I'd do unsigned here:
even though kernel does compile with 2s complement sign overflow,
it seems cleaner not to depend on that.
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>
> struct file *kick;
> --
> 2.18.1
^ permalink raw reply
* Re: [PATCH bpf-next v4 09/11] samples/bpf: add buffer recycling for unaligned chunks to xdpsock
From: Jonathan Lemon @ 2019-07-31 18:26 UTC (permalink / raw)
To: Kevin Laatz
Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
intel-wired-lan
In-Reply-To: <20190730085400.10376-10-kevin.laatz@intel.com>
On 30 Jul 2019, at 1:53, Kevin Laatz wrote:
> This patch adds buffer recycling support for unaligned buffers. Since
> we
> don't mask the addr to 2k at umem_reg in unaligned mode, we need to
> make
> sure we give back the correct (original) addr to the fill queue. We
> achieve
> this using the new descriptor format and associated masks. The new
> format
> uses the upper 16-bits for the offset and the lower 48-bits for the
> addr.
> Since we have a field for the offset, we no longer need to modify the
> actual address. As such, all we have to do to get back the original
> address
> is mask for the lower 48 bits (i.e. strip the offset and we get the
> address
> on it's own).
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v2:
> - Removed unused defines
> - Fix buffer recycling for unaligned case
> - Remove --buf-size (--frame-size merged before this)
> - Modifications to use the new descriptor format for buffer
> recycling
> ---
> samples/bpf/xdpsock_user.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index 756b00eb1afe..62b2059cd0e3 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -475,6 +475,7 @@ static void kick_tx(struct xsk_socket_info *xsk)
>
> static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> {
> + struct xsk_umem_info *umem = xsk->umem;
> u32 idx_cq = 0, idx_fq = 0;
> unsigned int rcvd;
> size_t ndescs;
> @@ -487,22 +488,21 @@ static inline void complete_tx_l2fwd(struct
> xsk_socket_info *xsk)
> xsk->outstanding_tx;
>
> /* re-add completed Tx buffers */
> - rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
> + rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
> if (rcvd > 0) {
> unsigned int i;
> int ret;
>
> - ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> + ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
> while (ret != rcvd) {
> if (ret < 0)
> exit_with_error(-ret);
> - ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
> - &idx_fq);
> + ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
> }
> +
> for (i = 0; i < rcvd; i++)
> - *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
> - *xsk_ring_cons__comp_addr(&xsk->umem->cq,
> - idx_cq++);
> + *xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
> + *xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
>
> xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
> xsk_ring_cons__release(&xsk->umem->cq, rcvd);
> @@ -549,7 +549,11 @@ static void rx_drop(struct xsk_socket_info *xsk)
> for (i = 0; i < rcvd; i++) {
> u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
> - char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> + u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +
> + addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> + char *pkt = xsk_umem__get_data(xsk->umem->buffer,
> + addr + offset);
The mask constants should not be part of the api - this should be
hidden behind an accessor.
Something like:
u64 addr = xsk_umem__get_addr(xsk->umem, handle);
> hex_dump(pkt, len, addr);
> *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
> @@ -655,7 +659,9 @@ static void l2fwd(struct xsk_socket_info *xsk)
> idx_rx)->addr;
> u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
> idx_rx++)->len;
> - char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> + u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> + char *pkt = xsk_umem__get_data(xsk->umem->buffer,
> + (addr & XSK_UNALIGNED_BUF_ADDR_MASK) + offset);
>
> swap_mac_addresses(pkt);
>
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
From: Daniel T. Lee @ 2019-07-31 18:23 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Daniel Borkmann, Alexei Starovoitov, netdev, Quentin Monnet
In-Reply-To: <20190731120805.1091d5c9@carbon>
On Wed, Jul 31, 2019 at 7:08 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 31 Jul 2019 03:48:20 +0900
> "Daniel T. Lee" <danieltimlee@gmail.com> wrote:
>
> > By this commit, using `bpftool net load`, user can load XDP prog on
> > interface. New type of enum 'net_load_type' has been made, as stated at
> > cover-letter, the meaning of 'load' is, prog will be loaded on interface.
>
> Why the keyword "load" ?
> Why not "attach" (and "detach")?
>
> For BPF there is a clear distinction between the "load" and "attach"
> steps. I know this is under subcommand "net", but to follow the
> conversion of other subcommands e.g. "prog" there are both "load" and
> "attach" commands.
>
>
> > BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.
>
> Again this is a "set" operation, not a "load" operation.
From earlier at cover-letter, I thought using the same word 'load' might give
confusion since XDP program is not considered as 'bpf_attach_type' and can't
be attached with 'BPF_PROG_ATTACH'.
But, according to the feedback from you and Andrii Nakryiko, replacing
the word 'load' as 'attach' would be more clear and more consistent.
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
>
> [...]
> > static int do_show(int argc, char **argv)
> > {
> > struct bpf_attach_info attach_info = {};
> > @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
> >
> > fprintf(stderr,
> > "Usage: %s %s { show | list } [dev <devname>]\n"
> > + " %s %s load PROG LOAD_TYPE <devname>\n"
>
> The "PROG" here does it correspond to the 'bpftool prog' syntax?:
>
> PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }
>
Yes. By using the same 'prog_parse_fd' from 'bpftool prog',
user can 'attach' XDP prog with id, pinned file or tag.
> > " %s %s help\n"
> > + "\n"
> > + " " HELP_SPEC_PROGRAM "\n"
> > + " LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
> > "Note: Only xdp and tc attachments are supported now.\n"
> > " For progs attached to cgroups, use \"bpftool cgroup\"\n"
> > " to dump program attachments. For program types\n"
> > " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> > " consult iproute2.\n",
>
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
And about the enum 'NET_LOAD_TYPE_XDP_DRIVE',
'DRIVER' looks more clear to understand.
Will change to it right away.
Thanks for the review.
^ permalink raw reply
* Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-31 18:12 UTC (permalink / raw)
To: Willem de Bruijn
Cc: David Miller, Network Development, oss-drivers, Eric Dumazet,
borisp, aviadye, davejwatson, John Fastabend, Daniel Borkmann,
Cong Wang, Florian Westphal, Alexei Starovoitov
In-Reply-To: <CA+FuTSdN41z5PVfyT5Z-ApnKQ9CYcDSnr4VGZnsgA-iAEK12Ow@mail.gmail.com>
On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > sk_validate_xmit_skb() and drivers depend on the sk member of
> > struct sk_buff to identify segments requiring encryption.
> > Any operation which removes or does not preserve the original TLS
> > socket such as skb_orphan() or skb_clone() will cause clear text
> > leaks.
> >
> > Make the TCP socket underlying an offloaded TLS connection
> > mark all skbs as decrypted, if TLS TX is in offload mode.
> > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > (or a socket with no validation) and decrypted flag set.
> >
> > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > they all imply TLS offload. The new checks are guarded by
> > CONFIG_TLS_DEVICE because that's the option guarding the
> > sk_buff->decrypted member.
> >
> > Second, smaller issue with orphaning is that it breaks
> > the guarantee that packets will be delivered to device
> > queues in-order. All TLS offload drivers depend on that
> > scheduling property. This means skb_orphan_partial()'s
> > trick of preserving partial socket references will cause
> > issues in the drivers. We need a full orphan, and as a
> > result netem delay/throttling will cause all TLS offload
> > skbs to be dropped.
> >
> > Reusing the sk_buff->decrypted flag also protects from
> > leaking clear text when incoming, decrypted skb is redirected
> > (e.g. by TC).
> >
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..b0c10b518e65 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> > }
> > EXPORT_SYMBOL(skb_set_owner_w);
> >
> > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > + /* Drivers depend on in-order delivery for crypto offload,
> > + * partial orphan breaks out-of-order-OK logic.
> > + */
> > + if (skb->decrypted)
> > + return false;
> > +#endif
> > +#ifdef CONFIG_INET
> > + if (skb->destructor == tcp_wfree)
> > + return true;
> > +#endif
> > + return skb->destructor == sock_wfree;
> > +}
> > +
>
> Just insert the skb->decrypted check into skb_orphan_partial for less
> code churn?
Okie.. skb_orphan_partial() is a little ugly but will do.
> I also think that this is an independent concern from leaking plain
> text, so perhaps could be a separate patch.
Do you mean the out-of-order stuff is a separate concern?
It is, I had them separate at the first try, but GSO code looks at
the destructor and IIRC only copies the socket if its still tcp_wfree.
If we let partial orphan be we have to do temporary hairy stuff in
tcp_gso_segment(). It's easier to just deal with partial orphan here.
^ permalink raw reply
* Re: [PATCH bpf-next v4 04/11] xsk: add support to allow unaligned chunk placement
From: Jonathan Lemon @ 2019-07-31 18:11 UTC (permalink / raw)
To: Kevin Laatz
Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
intel-wired-lan
In-Reply-To: <20190730085400.10376-5-kevin.laatz@intel.com>
On 30 Jul 2019, at 1:53, Kevin Laatz wrote:
> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be
> placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing
> with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v2:
> - Add checks for the flags coming from userspace
> - Fix how we get chunk_size in xsk_diag.c
> - Add defines for masking the new descriptor format
> - Modified the rx functions to use new descriptor format
> - Modified the tx functions to use new descriptor format
>
> v3:
> - Add helper function to do address/offset masking/addition
>
> v4:
> - fixed page_start calculation in __xsk_rcv_memcpy().
> - move offset handling to the xdp_umem_get_* functions
> - modified the len field in xdp_umem_reg struct. We now use 16 bits
> from
> this for the flags field.
> - removed next_pg_contig field from xdp_umem_page struct. Using low
> 12
> bits of addr to store flags instead.
> - other minor changes based on review comments
> ---
> include/net/xdp_sock.h | 40 ++++++++++++++++++-
> include/uapi/linux/if_xdp.h | 14 ++++++-
> net/xdp/xdp_umem.c | 18 ++++++---
> net/xdp/xsk.c | 79
> +++++++++++++++++++++++++++++--------
> net/xdp/xsk_diag.c | 2 +-
> net/xdp/xsk_queue.h | 69 ++++++++++++++++++++++++++++----
> 6 files changed, 188 insertions(+), 34 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 69796d264f06..a755e8ab6cac 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -16,6 +16,13 @@
> struct net_device;
> struct xsk_queue;
>
> +/* Masks for xdp_umem_page flags.
> + * The low 12-bits of the addr will be 0 since this is the page
> address, so we
> + * can use them for flags.
> + */
> +#define XSK_NEXT_PG_CONTIG_SHIFT 0
> +#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
> +
> struct xdp_umem_page {
> void *addr;
> dma_addr_t dma;
> @@ -48,6 +55,7 @@ struct xdp_umem {
> bool zc;
> spinlock_t xsk_list_lock;
> struct list_head xsk_list;
> + u16 flags;
> };
>
> struct xdp_sock {
> @@ -98,12 +106,21 @@ struct xdp_umem *xdp_get_umem_from_qid(struct
> net_device *dev, u16 queue_id);
>
> static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64
> addr)
> {
> - return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE -
> 1));
> + unsigned long page_addr;
> +
> + addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> + addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> + page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
> +
> + return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
> }
>
> static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64
> addr)
> {
> - return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE -
> 1));
> + addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> + addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +
> + return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
> }
>
> /* Reuse-queue aware version of FILL queue helpers */
> @@ -144,6 +161,19 @@ static inline void xsk_umem_fq_reuse(struct
> xdp_umem *umem, u64 addr)
>
> rq->handles[rq->length++] = addr;
> }
> +
> +/* Handle the offset appropriately depending on aligned or unaligned
> mode.
> + * For unaligned mode, we store the offset in the upper 16-bits of
> the address.
> + * For aligned mode, we simply add the offset to the address.
> + */
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64
> address,
> + u64 offset)
> +{
> + if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> + return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> + else
> + return address + offset;
> +}
> #else
> static inline int xsk_generic_rcv(struct xdp_sock *xs, struct
> xdp_buff *xdp)
> {
> @@ -241,6 +271,12 @@ static inline void xsk_umem_fq_reuse(struct
> xdp_umem *umem, u64 addr)
> {
> }
>
> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64
> handle,
> + u64 offset)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_XDP_SOCKETS */
>
> #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index faaa5ca2a117..4a5490651b22 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -17,6 +17,10 @@
> #define XDP_COPY (1 << 1) /* Force copy-mode */
> #define XDP_ZEROCOPY (1 << 2) /* Force zero-copy mode */
>
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 <<
> XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
> +
> struct sockaddr_xdp {
> __u16 sxdp_family;
> __u16 sxdp_flags;
> @@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
> #define XDP_OPTIONS 8
>
> struct xdp_umem_reg {
> - __u64 addr; /* Start of packet data area */
> - __u64 len; /* Length of packet data area */
> + __u64 addr; /* Start of packet data area */
> + __u64 len:48; /* Length of packet data area */
> + __u64 flags:16; /*Flags for umem */
> __u32 chunk_size;
> __u32 headroom;
> };
> @@ -74,6 +79,11 @@ struct xdp_options {
> #define XDP_UMEM_PGOFF_FILL_RING 0x100000000ULL
> #define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000ULL
>
> +/* Masks for unaligned chunks mode */
> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
> + ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> +
> /* Rx/Tx descriptor */
> struct xdp_desc {
> __u64 addr;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 83de74ca729a..5590ca7bbe15 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -299,6 +299,7 @@ static int xdp_umem_account_pages(struct xdp_umem
> *umem)
>
> static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg
> *mr)
> {
> + bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
> u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
> unsigned int chunks, chunks_per_page;
> u64 addr = mr->addr, size = mr->len;
> @@ -314,7 +315,10 @@ static int xdp_umem_reg(struct xdp_umem *umem,
> struct xdp_umem_reg *mr)
> return -EINVAL;
> }
>
> - if (!is_power_of_2(chunk_size))
> + if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> + return -EINVAL;
> +
> + if (!unaligned_chunks && !is_power_of_2(chunk_size))
> return -EINVAL;
>
> if (!PAGE_ALIGNED(addr)) {
> @@ -331,9 +335,11 @@ static int xdp_umem_reg(struct xdp_umem *umem,
> struct xdp_umem_reg *mr)
> if (chunks == 0)
> return -EINVAL;
>
> - chunks_per_page = PAGE_SIZE / chunk_size;
> - if (chunks < chunks_per_page || chunks % chunks_per_page)
> - return -EINVAL;
> + if (!unaligned_chunks) {
> + chunks_per_page = PAGE_SIZE / chunk_size;
> + if (chunks < chunks_per_page || chunks % chunks_per_page)
> + return -EINVAL;
> + }
>
> headroom = ALIGN(headroom, 64);
>
> @@ -342,13 +348,15 @@ static int xdp_umem_reg(struct xdp_umem *umem,
> struct xdp_umem_reg *mr)
> return -EINVAL;
>
> umem->address = (unsigned long)addr;
> - umem->chunk_mask = ~((u64)chunk_size - 1);
> + umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
> + : ~((u64)chunk_size - 1);
> umem->size = size;
> umem->headroom = headroom;
> umem->chunk_size_nohr = chunk_size - headroom;
> umem->npgs = size / PAGE_SIZE;
> umem->pgs = NULL;
> umem->user = NULL;
> + umem->flags = mr->flags;
> INIT_LIST_HEAD(&umem->xsk_list);
> spin_lock_init(&umem->xsk_list_lock);
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 59b57d708697..9b834d54549e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>
> u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
> {
> - return xskq_peek_addr(umem->fq, addr);
> + return xskq_peek_addr(umem->fq, addr, umem);
> }
> EXPORT_SYMBOL(xsk_umem_peek_addr);
>
> @@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
> }
> EXPORT_SYMBOL(xsk_umem_discard_addr);
>
> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one
> for
> + * each page. This is only required in copy mode.
> + */
> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void
> *from_buf,
> + u32 len, u32 metalen)
> +{
> + void *to_buf = xdp_umem_get_data(umem, addr);
> +
> + if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
> + void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
If 'addr' (really handle) is v2 format and has an offset, this will
give incorrect results. Please use accessors which correctly extract
page and offset from a handle.
> + u64 page_start = addr & ~(PAGE_SIZE - 1);
> + u64 first_len = PAGE_SIZE - (addr - page_start);
> +
> + memcpy(to_buf, from_buf, first_len + metalen);
> + memcpy(next_pg_addr, from_buf + first_len, len - first_len);
> +
> + return;
> + }
> +
> + memcpy(to_buf, from_buf, len + metalen);
> +}
> +
> static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32
> len)
> {
> - void *to_buf, *from_buf;
> + u64 offset = xs->umem->headroom;
> + void *from_buf;
> u32 metalen;
> u64 addr;
> int err;
>
> - if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> + if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
> len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> xs->rx_dropped++;
> return -ENOSPC;
> }
>
> - addr += xs->umem->headroom;
> -
> if (unlikely(xdp_data_meta_unsupported(xdp))) {
> from_buf = xdp->data;
> metalen = 0;
> @@ -78,9 +99,10 @@ static int __xsk_rcv(struct xdp_sock *xs, struct
> xdp_buff *xdp, u32 len)
> metalen = xdp->data - xdp->data_meta;
> }
>
> - to_buf = xdp_umem_get_data(xs->umem, addr);
> - memcpy(to_buf, from_buf, len + metalen);
> - addr += metalen;
> + __xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);
'addr' is actually a handle here. Can't add offset, this needs 'adjust
offset'.
> +
> + offset += metalen;
> + addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
> err = xskq_produce_batch_desc(xs->rx, addr, len);
> if (!err) {
> xskq_discard_addr(xs->umem->fq);
> @@ -125,6 +147,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct
> xdp_buff *xdp)
> {
> u32 metalen = xdp->data - xdp->data_meta;
> u32 len = xdp->data_end - xdp->data;
> + u64 offset = xs->umem->headroom;
> void *buffer;
> u64 addr;
> int err;
> @@ -136,17 +159,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct
> xdp_buff *xdp)
> goto out_unlock;
> }
>
> - if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> + if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
> len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> err = -ENOSPC;
> goto out_drop;
> }
>
> - addr += xs->umem->headroom;
> -
> - buffer = xdp_umem_get_data(xs->umem, addr);
> + buffer = xdp_umem_get_data(xs->umem, addr + offset);
> memcpy(buffer, xdp->data_meta, len + metalen);
> - addr += metalen;
> + offset += metalen;
> +
> + addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
> err = xskq_produce_batch_desc(xs->rx, addr, len);
> if (err)
> goto out_drop;
> @@ -190,7 +213,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem,
> struct xdp_desc *desc)
>
> rcu_read_lock();
> list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
> - if (!xskq_peek_desc(xs->tx, desc))
> + if (!xskq_peek_desc(xs->tx, desc, umem))
> continue;
>
> if (xskq_produce_addr_lazy(umem->cq, desc->addr))
> @@ -243,7 +266,7 @@ static int xsk_generic_xmit(struct sock *sk,
> struct msghdr *m,
> if (xs->queue_id >= xs->dev->real_num_tx_queues)
> goto out;
>
> - while (xskq_peek_desc(xs->tx, &desc)) {
> + while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
> char *buffer;
> u64 addr;
> u32 len;
> @@ -262,6 +285,10 @@ static int xsk_generic_xmit(struct sock *sk,
> struct msghdr *m,
>
> skb_put(skb, len);
> addr = desc.addr;
> + if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> + addr = (addr & XSK_UNALIGNED_BUF_ADDR_MASK) +
> + (addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
This is repeating the computations done in xdp_umem_get_data(). Remove.
> buffer = xdp_umem_get_data(xs->umem, addr);
> err = skb_store_bits(skb, 0, buffer, len);
> if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
> @@ -272,7 +299,7 @@ static int xsk_generic_xmit(struct sock *sk,
> struct msghdr *m,
> skb->dev = xs->dev;
> skb->priority = sk->sk_priority;
> skb->mark = sk->sk_mark;
> - skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> + skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
> skb->destructor = xsk_destruct_skb;
>
> err = dev_direct_xmit(skb, xs->queue_id);
> @@ -412,6 +439,24 @@ static struct socket *xsk_lookup_xsk_from_fd(int
> fd)
> return sock;
> }
>
> +/* Check if umem pages are contiguous.
> + * If zero-copy mode, use the DMA address to do the page contiguity
> check
> + * For all other modes we use addr (kernel virtual address)
> + * Store the result in the low bits of addr.
> + */
> +static void xsk_check_page_contiguity(struct xdp_umem *umem, u32
> flags)
> +{
> + struct xdp_umem_page *pgs = umem->pages;
> + int i, is_contig;
> +
> + for (i = 0; i < umem->npgs - 1; i++) {
> + is_contig = (flags & XDP_ZEROCOPY) ?
> + (pgs[i].dma + PAGE_SIZE == pgs[i + 1].dma) :
> + (pgs[i].addr + PAGE_SIZE == pgs[i + 1].addr);
> + pgs[i].addr += is_contig << XSK_NEXT_PG_CONTIG_SHIFT;
> + }
> +}
> +
> static int xsk_bind(struct socket *sock, struct sockaddr *addr, int
> addr_len)
> {
> struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> @@ -500,6 +545,8 @@ static int xsk_bind(struct socket *sock, struct
> sockaddr *addr, int addr_len)
> err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
> if (err)
> goto out_unlock;
> +
> + xsk_check_page_contiguity(xs->umem, flags);
> }
>
> xs->dev = dev;
> diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
> index d5e06c8e0cbf..9986a759fe06 100644
> --- a/net/xdp/xsk_diag.c
> +++ b/net/xdp/xsk_diag.c
> @@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock
> *xs, struct sk_buff *nlskb)
> du.id = umem->id;
> du.size = umem->size;
> du.num_pages = umem->npgs;
> - du.chunk_size = (__u32)(~umem->chunk_mask + 1);
> + du.chunk_size = umem->chunk_size_nohr + umem->headroom;
> du.headroom = umem->headroom;
> du.ifindex = umem->dev ? umem->dev->ifindex : 0;
> du.queue_id = umem->queue_id;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 909c5168ed0f..3d045c1c94b1 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -133,6 +133,17 @@ static inline bool xskq_has_addrs(struct
> xsk_queue *q, u32 cnt)
>
> /* UMEM queue */
>
> +static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem,
> u64 addr,
> + u64 length)
> +{
> + bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
> + bool next_pg_contig =
> + (unsigned long)umem->pages[(addr >> PAGE_SHIFT)].addr &
> + XSK_NEXT_PG_CONTIG_MASK;
> +
> + return cross_pg && !next_pg_contig;
> +}
> +
> static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
> {
> if (addr >= q->size) {
> @@ -143,23 +154,50 @@ static inline bool xskq_is_valid_addr(struct
> xsk_queue *q, u64 addr)
> return true;
> }
>
> -static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q,
> u64 addr,
> + u64 length,
> + struct xdp_umem *umem)
> +{
> + addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> + addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> + if (addr >= q->size ||
> + xskq_crosses_non_contig_pg(umem, addr, length)) {
> + q->invalid_descs++;
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
> + struct xdp_umem *umem)
> {
> while (q->cons_tail != q->cons_head) {
> struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
> unsigned int idx = q->cons_tail & q->ring_mask;
>
> *addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
> +
> + if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> + if (xskq_is_valid_addr_unaligned(q, *addr,
> + umem->chunk_size_nohr,
> + umem))
> + return addr;
> + goto out;
> + }
> +
> if (xskq_is_valid_addr(q, *addr))
> return addr;
>
> +out:
> q->cons_tail++;
> }
>
> return NULL;
> }
>
> -static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
> +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
> + struct xdp_umem *umem)
> {
> if (q->cons_tail == q->cons_head) {
> smp_mb(); /* D, matches A */
> @@ -170,7 +208,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue
> *q, u64 *addr)
> smp_rmb();
> }
>
> - return xskq_validate_addr(q, addr);
> + return xskq_validate_addr(q, addr, umem);
> }
>
> static inline void xskq_discard_addr(struct xsk_queue *q)
> @@ -229,8 +267,21 @@ static inline int xskq_reserve_addr(struct
> xsk_queue *q)
>
> /* Rx/Tx queue */
>
> -static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct
> xdp_desc *d)
> +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct
> xdp_desc *d,
> + struct xdp_umem *umem)
> {
> + if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> + if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
> + return false;
> +
> + if (d->len > umem->chunk_size_nohr || d->options) {
> + q->invalid_descs++;
> + return false;
> + }
> +
> + return true;
> + }
> +
> if (!xskq_is_valid_addr(q, d->addr))
> return false;
>
> @@ -244,14 +295,15 @@ static inline bool xskq_is_valid_desc(struct
> xsk_queue *q, struct xdp_desc *d)
> }
>
> static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue
> *q,
> - struct xdp_desc *desc)
> + struct xdp_desc *desc,
> + struct xdp_umem *umem)
> {
> while (q->cons_tail != q->cons_head) {
> struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
> unsigned int idx = q->cons_tail & q->ring_mask;
>
> *desc = READ_ONCE(ring->desc[idx]);
> - if (xskq_is_valid_desc(q, desc))
> + if (xskq_is_valid_desc(q, desc, umem))
> return desc;
>
> q->cons_tail++;
> @@ -261,7 +313,8 @@ static inline struct xdp_desc
> *xskq_validate_desc(struct xsk_queue *q,
> }
>
> static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> - struct xdp_desc *desc)
> + struct xdp_desc *desc,
> + struct xdp_umem *umem)
> {
> if (q->cons_tail == q->cons_head) {
> smp_mb(); /* D, matches A */
> @@ -272,7 +325,7 @@ static inline struct xdp_desc
> *xskq_peek_desc(struct xsk_queue *q,
> smp_rmb(); /* C, matches B */
> }
>
> - return xskq_validate_desc(q, desc);
> + return xskq_validate_desc(q, desc, umem);
> }
>
> static inline void xskq_discard_desc(struct xsk_queue *q)
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH bpf-next v4 07/11] mlx5e: modify driver for handling offsets
From: Jonathan Lemon @ 2019-07-31 18:10 UTC (permalink / raw)
To: Kevin Laatz
Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
intel-wired-lan
In-Reply-To: <20190730085400.10376-8-kevin.laatz@intel.com>
On 30 Jul 2019, at 1:53, Kevin Laatz wrote:
> With the addition of the unaligned chunks option, we need to make sure
> we
> handle the offsets accordingly based on the mode we are currently
> running
> in. This patch modifies the driver to appropriately mask the address
> for
> each case.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>
> ---
> v3:
> - Use new helper function to handle offset
>
> v4:
> - fixed headroom addition to handle. Using
> xsk_umem_adjust_headroom()
> now.
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 8 ++++++--
> drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index b0b982cf69bb..d5245893d2c8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct
> mlx5e_dma_info *di,
> void *va, u16 *rx_headroom, u32 *len, bool xsk)
> {
> struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
> + struct xdp_umem *umem = rq->umem;
> struct xdp_buff xdp;
> u32 act;
> int err;
> @@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct
> mlx5e_dma_info *di,
> xdp.rxq = &rq->xdp_rxq;
>
> act = bpf_prog_run_xdp(prog, &xdp);
> - if (xsk)
> - xdp.handle += xdp.data - xdp.data_hard_start;
> + if (xsk) {
> + u64 off = xdp.data - xdp.data_hard_start;
> +
> + xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);
Shouldn't this be xdp_umem_adjust_offset()?
> + }
> switch (act) {
> case XDP_PASS:
> *rx_headroom = xdp.data - xdp.data_hard_start;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 6a55573ec8f2..7c49a66d28c9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
> if (!xsk_umem_peek_addr_rq(umem, &handle))
> return -ENOMEM;
>
> - dma_info->xsk.handle = handle + rq->buff.umem_headroom;
> + dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
> + rq->buff.umem_headroom);
> dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
>
> /* No need to add headroom to the DMA address. In striding RQ case,
> we
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH net-next v2 1/4] dt-bindings: net: Add aspeed,ast2600-mdio binding
From: Rob Herring @ 2019-07-31 18:00 UTC (permalink / raw)
To: Andrew Jeffery
Cc: netdev, David Miller, Mark Rutland, Joel Stanley, Andrew Lunn,
Florian Fainelli, Heiner Kallweit, devicetree,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
linux-aspeed, linux-kernel@vger.kernel.org
In-Reply-To: <20190731053959.16293-2-andrew@aj.id.au>
On Tue, Jul 30, 2019 at 11:39 PM Andrew Jeffery <andrew@aj.id.au> wrote:
>
> The AST2600 splits out the MDIO bus controller from the MAC into its own
> IP block and rearranges the register layout. Add a new binding to
> describe the new hardware.
>
> Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
>
> ---
> v2:
> * aspeed: Utilise mdio.yaml
> * aspeed: Drop status from example
> ---
> .../bindings/net/aspeed,ast2600-mdio.yaml | 45 +++++++++++++++++++
> 1 file changed, 45 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/net/aspeed,ast2600-mdio.yaml
Reviewed-by: Rob Herring <robh@kernel.org>
^ permalink raw reply
* Re: [PATCH 2/2] cnic: Use refcount_t for refcount
From: Michael Chan @ 2019-07-31 17:58 UTC (permalink / raw)
To: Chuhong Yuan; +Cc: David S . Miller, Netdev, open list
In-Reply-To: <20190731122224.1003-1-hslester96@gmail.com>
On Wed, Jul 31, 2019 at 5:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> static void cnic_ctx_wr(struct cnic_dev *dev, u32 cid_addr, u32 off, u32 val)
> @@ -494,7 +494,7 @@ int cnic_register_driver(int ulp_type, struct cnic_ulp_ops *ulp_ops)
> }
> read_unlock(&cnic_dev_lock);
>
> - atomic_set(&ulp_ops->ref_count, 0);
> + refcount_set(&ulp_ops->ref_count, 0);
> rcu_assign_pointer(cnic_ulp_tbl[ulp_type], ulp_ops);
> mutex_unlock(&cnic_lock);
>
Willem's comment applies here too. The driver needs to be modified to
count from 1 to use refcount_t.
Thanks.
^ permalink raw reply
* Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
From: Julian Anastasov @ 2019-07-31 17:53 UTC (permalink / raw)
To: hujunwei
Cc: wensong, horms, pablo, kadlec, Florian Westphal, davem,
Florian Westphal, netdev@vger.kernel.org, lvs-devel,
netfilter-devel, coreteam, Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <5fd55d18-f4e2-a6b4-5c54-db76c05be5df@huawei.com>
Hello,
On Thu, 1 Aug 2019, hujunwei wrote:
> From: Junwei Hu <hujunwei4@huawei.com>
>
> The ipvs module parse the user buffer and save it to sysctl,
> then check if the value is valid. invalid value occurs
> over a period of time.
> Here, I add a variable, struct ctl_table tmp, used to read
> the value from the user buffer, and save only when it is valid.
> I delete proc_do_sync_mode and use extra1/2 in table for the
> proc_dointvec_minmax call.
>
> Fixes: f73181c8288f ("ipvs: add support for sync threads")
> Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
> Acked-by: Julian Anastasov <ja@ssi.bg>
Yep, Acked-by: Julian Anastasov <ja@ssi.bg>
> ---
> V1->V2:
> - delete proc_do_sync_mode and use proc_dointvec_minmax call.
> V2->V3:
> - update git version
> ---
> net/netfilter/ipvs/ip_vs_ctl.c | 69 +++++++++++++++++-----------------
> 1 file changed, 35 insertions(+), 34 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 060565e7d227..72189559a1cd 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int write,
> int val = *valp;
> int rc;
>
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = sizeof(int),
> + .mode = table->mode,
> + };
> +
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> if (write && (*valp != val)) {
> - if ((*valp < 0) || (*valp > 3)) {
> - /* Restore the correct value */
> - *valp = val;
> + if (val < 0 || val > 3) {
> + rc = -EINVAL;
> } else {
> + *valp = val;
> update_defense_level(ipvs);
> }
> }
> @@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int write,
> int *valp = table->data;
> int val[2];
> int rc;
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = table->maxlen,
> + .mode = table->mode,
> + };
>
> - /* backup the value first */
> memcpy(val, valp, sizeof(val));
> -
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> - if (write && (valp[0] < 0 || valp[1] < 0 ||
> - (valp[0] >= valp[1] && valp[1]))) {
> - /* Restore the correct value */
> - memcpy(valp, val, sizeof(val));
> - }
> - return rc;
> -}
> -
> -static int
> -proc_do_sync_mode(struct ctl_table *table, int write,
> - void __user *buffer, size_t *lenp, loff_t *ppos)
> -{
> - int *valp = table->data;
> - int val = *valp;
> - int rc;
> -
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> - if (write && (*valp != val)) {
> - if ((*valp < 0) || (*valp > 1)) {
> - /* Restore the correct value */
> - *valp = val;
> - }
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> + if (write) {
> + if (val[0] < 0 || val[1] < 0 ||
> + (val[0] >= val[1] && val[1]))
> + rc = -EINVAL;
> + else
> + memcpy(valp, val, sizeof(val));
> }
> return rc;
> }
> @@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write,
> int val = *valp;
> int rc;
>
> - rc = proc_dointvec(table, write, buffer, lenp, ppos);
> + struct ctl_table tmp = {
> + .data = &val,
> + .maxlen = sizeof(int),
> + .mode = table->mode,
> + };
> +
> + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> if (write && (*valp != val)) {
> - if (*valp < 1 || !is_power_of_2(*valp)) {
> - /* Restore the correct value */
> + if (val < 1 || !is_power_of_2(val))
> + rc = -EINVAL;
> + else
> *valp = val;
> - }
> }
> return rc;
> }
> @@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = {
> .procname = "sync_version",
> .maxlen = sizeof(int),
> .mode = 0644,
> - .proc_handler = proc_do_sync_mode,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> },
> {
> .procname = "sync_ports",
> --
> 2.21.GIT
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH 1/2] bnxt_en: Use refcount_t for refcount
From: Michael Chan @ 2019-07-31 17:48 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Chuhong Yuan, David S . Miller, Network Development, linux-kernel
In-Reply-To: <CA+FuTScqD4bMpm6n13ETFVEvSKnk_rRUzspzs9HB6B5Un101Dw@mail.gmail.com>
On Wed, Jul 31, 2019 at 9:06 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 31, 2019 at 8:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > refcount_t is better for reference counters since its
> > implementation can prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> > drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 8 ++++----
> > drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h | 2 +-
> > 2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > index fc77caf0a076..eb7ed34639e2 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > @@ -49,7 +49,7 @@ static int bnxt_register_dev(struct bnxt_en_dev *edev, int ulp_id,
> > return -ENOMEM;
> > }
> >
> > - atomic_set(&ulp->ref_count, 0);
> > + refcount_set(&ulp->ref_count, 0);
>
> One feature of refcount_t is that it warns on refcount_inc from 0 to
> detect possible use-after_free. It appears that that can trigger here?
>
I think that's right. We need to change the driver to start counting
from 1 instead of 0 if we convert to refcount.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-07-31 17:18 UTC (permalink / raw)
To: Song Liu
Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <4D2E1082-5013-4A50-B75D-AB88FDCAAC52@fb.com>
On Wed, Jul 31, 2019 at 1:30 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Jul 30, 2019 at 10:19 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >>
> >>
> >>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >>>
> >>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >>>>>
> >>>>> This patch implements the core logic for BPF CO-RE offsets relocations.
> >>>>> Every instruction that needs to be relocated has corresponding
> >>>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> >>>>> to match recorded "local" relocation spec against potentially many
> >>>>> compatible "target" types, creating corresponding spec. Details of the
> >>>>> algorithm are noted in corresponding comments in the code.
> >>>>>
> >>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>
> [...]
>
> >>>>
> >>>
> >>> I just picked the most succinct and non-repetitive form. It's
> >>> immediately apparent which type it's implicitly converted to, so I
> >>> felt there is no need to repeat it. Also, just (void *) is much
> >>> shorter. :)
> >>
> >> _All_ other code in btf.c converts the pointer to the target type.
> >
> > Most in libbpf.c doesn't, though. Also, I try to preserve pointer
> > constness for uses that don't modify BTF types (pretty much all of
> > them in libbpf), so it becomes really verbose, despite extremely short
> > variable names:
> >
> > const struct btf_member *m = (const struct btf_member *)(t + 1);
>
> I don't think being verbose is a big problem here. Overusing
Problem is too big and strong word to describe this :). It hurts
readability and will often quite artificially force either wrapping
the line or unnecessarily splitting declaration and assignment. Void *
on the other hand is short and usually is in the same line as target
type declaration, if not, you'll have to find local variable
declaration to double-check type, if you are unsure.
Using (void *) + implicit cast to target pointer type is not
unprecedented in libbpf:
$ rg ' = \((const )?struct \w+ \*\)' tools/lib/bpf/ | wc -l
52
$ rg ' = \((const )?void \*\)' tools/lib/bpf/ | wc -l
35
52 vs 35 is majority overall, but not by a landslide.
> (void *) feels like a bigger problem.
Why do you feel it's a problem? void * conveys that we have a piece of
memory that we will need to reinterpret as some concrete pointer type.
That's what we are doing, skipping btf_type and then interpreting
memory after common btf_type prefix is some other type, depending on
actual BTF kind. I don't think void * is misleading in any way.
In any case, if you still feel strongly about this after all my
arguments, please let me know and I will convert them in this patch
set. It's not like I'm opposed to use duplicate type names (though it
does feel sort of Java-like before it got limited type inference),
it's just in practice it leads to unnecessarily verbose code which
doesn't really improve anything.
>
> >
> > Add one or two levels of nestedness and you are wrapping this line.
> >
> >> In some cases, it is not apparent which type it is converted to,
> >> for example:
> >>
> >> + m = (void *)(targ_type + 1);
> >>
> >> I would suggest we do implicit conversion whenever possible.
> >
> > Implicit conversion (`m = targ_type + 1;`) is a compilation error,
> > that won't work.
>
> I misused "implicit" here. I actually meant to say
>
> m = ((const struct btf_member *)(t + 1);
Ah, so you meant explicit, yep. It's either `void *` or `const struct
something *` then.
>
>
>
^ permalink raw reply
* Re: [PATCH bpf-next 1/9] selftests/bpf: prevent headers to be compiled as C code
From: Andrii Nakryiko @ 2019-07-31 17:04 UTC (permalink / raw)
To: Ilya Leoshkevich
Cc: Stanislav Fomichev, Andrii Nakryiko, bpf, Networking,
Alexei Starovoitov, Daniel Borkmann, Kernel Team
In-Reply-To: <3D822EE0-C033-4192-B505-A30E5EC23BC3@linux.ibm.com>
On Wed, Jul 31, 2019 at 6:21 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> > Am 27.07.2019 um 20:53 schrieb Andrii Nakryiko <andrii.nakryiko@gmail.com>:
> >
> > On Fri, Jul 26, 2019 at 3:01 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >>
> >> On 07/26, Andrii Nakryiko wrote:
> >>> On Fri, Jul 26, 2019 at 2:21 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >>>>
> >>>> On 07/26, Andrii Nakryiko wrote:
> >>>>> Apprently listing header as a normal dependency for a binary output
> >>>>> makes it go through compilation as if it was C code. This currently
> >>>>> works without a problem, but in subsequent commits causes problems for
> >>>>> differently generated test.h for test_progs. Marking those headers as
> >>>>> order-only dependency solves the issue.
> >>>> Are you sure it will not result in a situation where
> >>>> test_progs/test_maps is not regenerated if tests.h is updated.
> >>>>
> >>>> If I read the following doc correctly, order deps make sense for
> >>>> directories only:
> >>>> https://www.gnu.org/software/make/manual/html_node/Prerequisite-Types.html
> >>>>
> >>>> Can you maybe double check it with:
> >>>> * make
> >>>> * add new prog_tests/test_something.c
> >>>> * make
> >>>> to see if the binary is regenerated with test_something.c?
> >>>
> >>> Yeah, tested that, it triggers test_progs rebuild.
> >>>
> >>> Ordering is still preserved, because test.h is dependency of
> >>> test_progs.c, which is dependency of test_progs binary, so that's why
> >>> it works.
> >>>
> >>> As to why .h file is compiled as C file, I have no idea and ideally
> >>> that should be fixed somehow.
> >> I guess that's because it's a prerequisite and we have a target that
> >> puts all prerequisites when calling CC:
> >>
> >> test_progs: a.c b.c tests.h
> >> gcc a.c b.c tests.h -o test_progs
> >>
> >> So gcc compiles each input file.
> >
> > If that's really a definition of the rule, then it seems not exactly
> > correct. What if some of the prerequisites are some object files,
> > directories, etc. I'd say the rule should only include .c files. I'll
> > add it to my TODO list to go and check how all this is defined
> > somewhere, but for now I'm leaving everything as is in this patch.
> >
>
> I believe it’s an implicit built-in rule, which is defined by make
> itself here:
>
> https://git.savannah.gnu.org/cgit/make.git/tree/default.c?h=4.2.1#n131
>
> The observed behavior arises because these rules use $^ all over the
> place. My 2c is that it would be better to make the rule explicit,
> because it would cost just an extra line, but it would be immediately
> obvious why the code is correct w.r.t. rebuilds.
Thanks for pointing this out, Ilya! I agree, I'd rather have a simple
explicit rule in Makefile that having to guess where this is coming
from and why it doesn't work as you'd expect it to. If no one else
adds that first, I'll eventually get to this.
>
> Best regards,
> Ilya
^ permalink raw reply
* Re: [PATCH 1/2] arm64: Add support for function error injection
From: Leo Yan @ 2019-07-31 16:58 UTC (permalink / raw)
To: Will Deacon
Cc: Russell King, Oleg Nesterov, Catalin Marinas, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
Arnd Bergmann, linux-arm-kernel, linux-kernel, netdev, bpf,
Masami Hiramatsu, Justin He
In-Reply-To: <20190731160836.qmzlk3ndbahwhfmu@willie-the-truck>
Hi Will,
Thanks for reviewing.
On Wed, Jul 31, 2019 at 05:08:37PM +0100, Will Deacon wrote:
> On Tue, Jul 16, 2019 at 07:13:00PM +0800, Leo Yan wrote:
> > This patch implement regs_set_return_value() and
> > override_function_with_return() to support function error injection
> > for arm64.
> >
> > In the exception flow, arm64's general register x30 contains the value
> > for the link register; so we can just update pt_regs::pc with it rather
> > than redirecting execution to a dummy function that returns.
> >
> > This patch is heavily inspired by the commit 7cd01b08d35f ("powerpc:
> > Add support for function error injection").
> >
> > Signed-off-by: Leo Yan <leo.yan@linaro.org>
> > ---
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/error-injection.h | 13 +++++++++++++
> > arch/arm64/include/asm/ptrace.h | 5 +++++
> > arch/arm64/lib/Makefile | 2 ++
> > arch/arm64/lib/error-inject.c | 19 +++++++++++++++++++
> > 5 files changed, 40 insertions(+)
> > create mode 100644 arch/arm64/include/asm/error-injection.h
> > create mode 100644 arch/arm64/lib/error-inject.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 697ea0510729..a6d9e622977d 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -142,6 +142,7 @@ config ARM64
> > select HAVE_EFFICIENT_UNALIGNED_ACCESS
> > select HAVE_FTRACE_MCOUNT_RECORD
> > select HAVE_FUNCTION_TRACER
> > + select HAVE_FUNCTION_ERROR_INJECTION
> > select HAVE_FUNCTION_GRAPH_TRACER
> > select HAVE_GCC_PLUGINS
> > select HAVE_HW_BREAKPOINT if PERF_EVENTS
> > diff --git a/arch/arm64/include/asm/error-injection.h b/arch/arm64/include/asm/error-injection.h
> > new file mode 100644
> > index 000000000000..da057e8ed224
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/error-injection.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ */
> > +
> > +#ifndef __ASM_ERROR_INJECTION_H_
> > +#define __ASM_ERROR_INJECTION_H_
> > +
> > +#include <linux/compiler.h>
> > +#include <linux/linkage.h>
> > +#include <asm/ptrace.h>
> > +#include <asm-generic/error-injection.h>
> > +
> > +void override_function_with_return(struct pt_regs *regs);
> > +
> > +#endif /* __ASM_ERROR_INJECTION_H_ */
>
> Why isn't this prototype in the asm-generic header? Seems weird to have to
> duplicate it for each architecture.
Yeah. When I spin for new version patches, will try to refactor in
the asm-generic header.
> > diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> > index dad858b6adc6..3aafbbe218a2 100644
> > --- a/arch/arm64/include/asm/ptrace.h
> > +++ b/arch/arm64/include/asm/ptrace.h
> > @@ -294,6 +294,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs)
> > return regs->regs[0];
> > }
> >
> > +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> > +{
> > + regs->regs[0] = rc;
> > +}
> > +
> > /**
> > * regs_get_kernel_argument() - get Nth function argument in kernel
> > * @regs: pt_regs of that context
> > diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> > index 33c2a4abda04..f182ccb0438e 100644
> > --- a/arch/arm64/lib/Makefile
> > +++ b/arch/arm64/lib/Makefile
> > @@ -33,3 +33,5 @@ UBSAN_SANITIZE_atomic_ll_sc.o := n
> > lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
> >
> > obj-$(CONFIG_CRC32) += crc32.o
> > +
> > +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> > diff --git a/arch/arm64/lib/error-inject.c b/arch/arm64/lib/error-inject.c
> > new file mode 100644
> > index 000000000000..35661c2de4b0
> > --- /dev/null
> > +++ b/arch/arm64/lib/error-inject.c
> > @@ -0,0 +1,19 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/error-injection.h>
> > +#include <linux/kprobes.h>
> > +
> > +void override_function_with_return(struct pt_regs *regs)
> > +{
> > + /*
> > + * 'regs' represents the state on entry of a predefined function in
> > + * the kernel/module and which is captured on a kprobe.
> > + *
> > + * 'regs->regs[30]' contains the the link register for the probed
>
> extra "the"
Will fix.
> > + * function and assign it to 'regs->pc', so when kprobe returns
> > + * back from exception it will override the end of probed function
> > + * and drirectly return to the predefined function's caller.
>
> directly
Will fix.
> > + */
> > + regs->pc = regs->regs[30];
>
> I suppose we could be all fancy and do:
>
> instruction_pointer_set(regs, procedure_link_pointer(regs));
>
> How about that?
Ah, good point. Will change to use the common APIs.
Thanks,
Leo Yan
^ permalink raw reply
* Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)
From: David Miller @ 2019-07-31 16:41 UTC (permalink / raw)
To: natechancellor
Cc: gregkh, devel, andrew, f.fainelli, kernel-build-reports, netdev,
willy, broonie, linux-next, linux-arm-kernel, hkallweit1
In-Reply-To: <20190731163509.GA90028@archlinux-threadripper>
From: Nathan Chancellor <natechancellor@gmail.com>
Date: Wed, 31 Jul 2019 09:35:09 -0700
> In file included from ../drivers/net/phy/mdio-octeon.c:14:
> ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]
> 111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
One of the hi-lo, lo-hi writeq/readq headers has to be included.
^ permalink raw reply
* Re: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: Joe Perches @ 2019-07-31 16:35 UTC (permalink / raw)
To: Neil Horman
Cc: Vlad Yasevich, Marcelo Ricardo Leitner, David S. Miller,
linux-sctp, netdev, linux-kernel
In-Reply-To: <20190731121646.GD9823@hmswarspite.think-freely.org>
On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > > fallthrough may become a pseudo reserved keyword so this only use of
> > > > fallthrough is better renamed to allow it.
> > > >
> > > > Signed-off-by: Joe Perches <joe@perches.com>
> > > Are you referring to the __attribute__((fallthrough)) statement that gcc
> > > supports? If so the compiler should by all rights be able to differentiate
> > > between a null statement attribute and a explicit goto and label without the
> > > need for renaming here. Or are you referring to something else?
> >
> > Hi.
> >
> > I sent after this a patch that adds
> >
> > # define fallthrough __attribute__((__fallthrough__))
> >
> > https://lore.kernel.org/patchwork/patch/1108577/
> >
> > So this rename is a prerequisite to adding this #define.
> >
> why not just define __fallthrough instead, like we do for all the other
> attributes we alias (i.e. __read_mostly, __protected_by, __unused, __exception,
> etc)
Because it's not as intelligible when used as a statement.
^ permalink raw reply
* Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)
From: Nathan Chancellor @ 2019-07-31 16:35 UTC (permalink / raw)
To: Greg KH
Cc: David Miller, devel, andrew, f.fainelli, kernel-build-reports,
netdev, willy, broonie, linux-next, linux-arm-kernel, hkallweit1
In-Reply-To: <20190731160043.GA15520@kroah.com>
On Wed, Jul 31, 2019 at 06:00:43PM +0200, Greg KH wrote:
> On Wed, Jul 31, 2019 at 08:48:24AM -0700, David Miller wrote:
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Date: Wed, 31 Jul 2019 13:35:22 +0200
> >
> > > On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
> > >> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
> > >>
> > >> Today's -next fails to build an ARM allmodconfig due to:
> > >>
> > >> > allmodconfig (arm, gcc-8) ― FAIL, 1 error, 40 warnings, 0 section mismatches
> > >> >
> > >> > Errors:
> > >> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of function 'writeq'; did you mean 'writel'? [-Werror=implicit-function-declaration]
> > >>
> > >> as a result of the changes that introduced:
> > >>
> > >> WARNING: unmet direct dependencies detected for MDIO_OCTEON
> > >> Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
> > >> Selected by [m]:
> > >> - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES [=y] || COMPILE_TEST [=y])
> > >>
> > >> which is triggered by the staging OCTEON_ETHERNET driver which misses a
> > >> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
> > >> (staging/octeon: Allow test build on !MIPS).
> > >
> > > A patch was posted for this, but it needs to go through the netdev tree
> > > as that's where the offending patches are coming from.
> >
> > I didn't catch that, was netdev CC:'d?
>
> Nope, just you :(
>
> I'll resend it now and cc: netdev.
>
> thanks,
>
> greg k-h
If it is this patch:
https://lore.kernel.org/netdev/20190731160219.GA2114@kroah.com/
It doesn't resolve that issue. I applied it and tested on next-20190731.
$ make -j$(nproc) -s ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- O=out distclean allyesconfig drivers/net/phy/
WARNING: unmet direct dependencies detected for MDIO_OCTEON
Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y] && 64BIT && HAS_IOMEM [=y] && OF_MDIO [=y]
Selected by [y]:
- OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC || COMPILE_TEST [=y]) && NETDEVICES [=y]
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
../drivers/net/phy/mdio-octeon.c:48:3: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
48 | (u64)devm_ioremap(&pdev->dev, mdio_phys, regsize);
| ^
In file included from ../drivers/net/phy/mdio-octeon.c:14:
../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]
111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
| ^~~~~~
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro 'oct_mdio_writeq'
56 | oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
| ^~~~~~~~~~~~~~~
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
| ^
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro 'oct_mdio_writeq'
56 | oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
| ^~~~~~~~~~~~~~~
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
| ^
../drivers/net/phy/mdio-octeon.c:77:2: note: in expansion of macro 'oct_mdio_writeq'
77 | oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
| ^~~~~~~~~~~~~~~
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_remove':
../drivers/net/phy/mdio-cavium.h:111:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
| ^
../drivers/net/phy/mdio-octeon.c:91:2: note: in expansion of macro 'oct_mdio_writeq'
91 | oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
| ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[3]: *** [../scripts/Makefile.build:274: drivers/net/phy/mdio-octeon.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [../Makefile:1780: drivers/net/phy/] Error 2
make[1]: *** [/home/nathan/cbl/linux-next/Makefile:330: __build_one_by_one] Error 2
make: *** [Makefile:179: sub-make] Error 2
This is the diff that I came up with to solve the errors plus the casting
warnings but it doesn't feel proper to me. If you all feel otherwise, I
can draft up a formal commit message.
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 20f14c5fbb7e..ed2edf4b5b0e 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -159,7 +159,7 @@ config MDIO_MSCC_MIIM
config MDIO_OCTEON
tristate "Octeon and some ThunderX SOCs MDIO buses"
- depends on 64BIT
+ depends on 64BIT || COMPILE_TEST
depends on HAS_IOMEM && OF_MDIO
select MDIO_CAVIUM
help
diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
index ed5f9bb5448d..4b71b733edb4 100644
--- a/drivers/net/phy/mdio-cavium.h
+++ b/drivers/net/phy/mdio-cavium.h
@@ -108,8 +108,10 @@ static inline u64 oct_mdio_readq(u64 addr)
return cvmx_read_csr(addr);
}
#else
-#define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
-#define oct_mdio_readq(addr) readq((void *)addr)
+#include <linux/io-64-nonatomic-lo-hi.h>
+
+#define oct_mdio_writeq(val, addr) writeq(val, (void *)(uintptr_t)addr)
+#define oct_mdio_readq(addr) readq((void *)(uintptr_t)addr)
#endif
int cavium_mdiobus_read(struct mii_bus *bus, int phy_id, int regnum);
diff --git a/drivers/net/phy/mdio-octeon.c b/drivers/net/phy/mdio-octeon.c
index 8327382aa568..ab0d8ab588e4 100644
--- a/drivers/net/phy/mdio-octeon.c
+++ b/drivers/net/phy/mdio-octeon.c
@@ -45,7 +45,7 @@ static int octeon_mdiobus_probe(struct platform_device *pdev)
}
bus->register_base =
- (u64)devm_ioremap(&pdev->dev, mdio_phys, regsize);
+ (u64)(uintptr_t)devm_ioremap(&pdev->dev, mdio_phys, regsize);
if (!bus->register_base) {
dev_err(&pdev->dev, "dev_ioremap failed\n");
return -ENOMEM;
^ permalink raw reply related
* Re: [PATCH 1/2] arm64: Add support for function error injection
From: Will Deacon @ 2019-07-31 16:08 UTC (permalink / raw)
To: Leo Yan
Cc: Russell King, Oleg Nesterov, Catalin Marinas, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
Arnd Bergmann, linux-arm-kernel, linux-kernel, netdev, bpf,
Masami Hiramatsu, Justin He
In-Reply-To: <20190716111301.1855-2-leo.yan@linaro.org>
On Tue, Jul 16, 2019 at 07:13:00PM +0800, Leo Yan wrote:
> This patch implement regs_set_return_value() and
> override_function_with_return() to support function error injection
> for arm64.
>
> In the exception flow, arm64's general register x30 contains the value
> for the link register; so we can just update pt_regs::pc with it rather
> than redirecting execution to a dummy function that returns.
>
> This patch is heavily inspired by the commit 7cd01b08d35f ("powerpc:
> Add support for function error injection").
>
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/error-injection.h | 13 +++++++++++++
> arch/arm64/include/asm/ptrace.h | 5 +++++
> arch/arm64/lib/Makefile | 2 ++
> arch/arm64/lib/error-inject.c | 19 +++++++++++++++++++
> 5 files changed, 40 insertions(+)
> create mode 100644 arch/arm64/include/asm/error-injection.h
> create mode 100644 arch/arm64/lib/error-inject.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 697ea0510729..a6d9e622977d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -142,6 +142,7 @@ config ARM64
> select HAVE_EFFICIENT_UNALIGNED_ACCESS
> select HAVE_FTRACE_MCOUNT_RECORD
> select HAVE_FUNCTION_TRACER
> + select HAVE_FUNCTION_ERROR_INJECTION
> select HAVE_FUNCTION_GRAPH_TRACER
> select HAVE_GCC_PLUGINS
> select HAVE_HW_BREAKPOINT if PERF_EVENTS
> diff --git a/arch/arm64/include/asm/error-injection.h b/arch/arm64/include/asm/error-injection.h
> new file mode 100644
> index 000000000000..da057e8ed224
> --- /dev/null
> +++ b/arch/arm64/include/asm/error-injection.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __ASM_ERROR_INJECTION_H_
> +#define __ASM_ERROR_INJECTION_H_
> +
> +#include <linux/compiler.h>
> +#include <linux/linkage.h>
> +#include <asm/ptrace.h>
> +#include <asm-generic/error-injection.h>
> +
> +void override_function_with_return(struct pt_regs *regs);
> +
> +#endif /* __ASM_ERROR_INJECTION_H_ */
Why isn't this prototype in the asm-generic header? Seems weird to have to
duplicate it for each architecture.
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index dad858b6adc6..3aafbbe218a2 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -294,6 +294,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs)
> return regs->regs[0];
> }
>
> +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
> +{
> + regs->regs[0] = rc;
> +}
> +
> /**
> * regs_get_kernel_argument() - get Nth function argument in kernel
> * @regs: pt_regs of that context
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index 33c2a4abda04..f182ccb0438e 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -33,3 +33,5 @@ UBSAN_SANITIZE_atomic_ll_sc.o := n
> lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
>
> obj-$(CONFIG_CRC32) += crc32.o
> +
> +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/arm64/lib/error-inject.c b/arch/arm64/lib/error-inject.c
> new file mode 100644
> index 000000000000..35661c2de4b0
> --- /dev/null
> +++ b/arch/arm64/lib/error-inject.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/error-injection.h>
> +#include <linux/kprobes.h>
> +
> +void override_function_with_return(struct pt_regs *regs)
> +{
> + /*
> + * 'regs' represents the state on entry of a predefined function in
> + * the kernel/module and which is captured on a kprobe.
> + *
> + * 'regs->regs[30]' contains the the link register for the probed
extra "the"
> + * function and assign it to 'regs->pc', so when kprobe returns
> + * back from exception it will override the end of probed function
> + * and drirectly return to the predefined function's caller.
directly
> + */
> + regs->pc = regs->regs[30];
I suppose we could be all fancy and do:
instruction_pointer_set(regs, procedure_link_pointer(regs));
How about that?
Will
^ permalink raw reply
* Re: [PATCH v2] staging/octeon: Fix build error without CONFIG_NETDEVICES
From: David Miller @ 2019-07-31 16:07 UTC (permalink / raw)
To: gregkh; +Cc: willy, netdev, devel, yuehaibing, linux-kernel
In-Reply-To: <20190731160219.GA2114@kroah.com>
From: Greg KH <gregkh@linuxfoundation.org>
Date: Wed, 31 Jul 2019 18:02:19 +0200
> From: YueHaibing <yuehaibing@huawei.com>
>
> While do COMPILE_TEST build without CONFIG_NETDEVICES,
> we get Kconfig warning:
>
> WARNING: unmet direct dependencies detected for PHYLIB
> Depends on [n]: NETDEVICES [=n]
> Selected by [y]:
> - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES [=n] || COMPILE_TEST [=y])
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Reported-by: Mark Brown <broonie@kernel.org>
> Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Applied to net-next.
^ permalink raw reply
* Re: [PATCH 1/2] bnxt_en: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-07-31 16:05 UTC (permalink / raw)
To: Chuhong Yuan
Cc: Michael Chan, David S . Miller, Network Development, linux-kernel
In-Reply-To: <20190731122203.948-1-hslester96@gmail.com>
On Wed, Jul 31, 2019 at 8:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> refcount_t is better for reference counters since its
> implementation can prevent overflows.
> So convert atomic_t ref counters to refcount_t.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
> drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 8 ++++----
> drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h | 2 +-
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> index fc77caf0a076..eb7ed34639e2 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> @@ -49,7 +49,7 @@ static int bnxt_register_dev(struct bnxt_en_dev *edev, int ulp_id,
> return -ENOMEM;
> }
>
> - atomic_set(&ulp->ref_count, 0);
> + refcount_set(&ulp->ref_count, 0);
One feature of refcount_t is that it warns on refcount_inc from 0 to
detect possible use-after_free. It appears that that can trigger here?
> ulp->handle = handle;
> rcu_assign_pointer(ulp->ulp_ops, ulp_ops);
>
> @@ -246,12 +246,12 @@ static int bnxt_send_msg(struct bnxt_en_dev *edev, int ulp_id,
>
> static void bnxt_ulp_get(struct bnxt_ulp *ulp)
> {
> - atomic_inc(&ulp->ref_count);
> + refcount_inc(&ulp->ref_count);
> }
^ permalink raw reply
* [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
From: hujunwei @ 2019-07-31 16:03 UTC (permalink / raw)
To: wensong, horms, pablo, kadlec, fw, davem, Julian Anastasov,
Florian Westphal
Cc: netdev@vger.kernel.org, lvs-devel, netfilter-devel, coreteam,
Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <4a0476d3-57a4-50e0-cae8-9dffc4f4d556@huawei.com>
From: Junwei Hu <hujunwei4@huawei.com>
The ipvs module parse the user buffer and save it to sysctl,
then check if the value is valid. invalid value occurs
over a period of time.
Here, I add a variable, struct ctl_table tmp, used to read
the value from the user buffer, and save only when it is valid.
I delete proc_do_sync_mode and use extra1/2 in table for the
proc_dointvec_minmax call.
Fixes: f73181c8288f ("ipvs: add support for sync threads")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
---
V1->V2:
- delete proc_do_sync_mode and use proc_dointvec_minmax call.
V2->V3:
- update git version
---
net/netfilter/ipvs/ip_vs_ctl.c | 69 +++++++++++++++++-----------------
1 file changed, 35 insertions(+), 34 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 060565e7d227..72189559a1cd 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int write,
int val = *valp;
int rc;
- rc = proc_dointvec(table, write, buffer, lenp, ppos);
+ struct ctl_table tmp = {
+ .data = &val,
+ .maxlen = sizeof(int),
+ .mode = table->mode,
+ };
+
+ rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
if (write && (*valp != val)) {
- if ((*valp < 0) || (*valp > 3)) {
- /* Restore the correct value */
- *valp = val;
+ if (val < 0 || val > 3) {
+ rc = -EINVAL;
} else {
+ *valp = val;
update_defense_level(ipvs);
}
}
@@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int write,
int *valp = table->data;
int val[2];
int rc;
+ struct ctl_table tmp = {
+ .data = &val,
+ .maxlen = table->maxlen,
+ .mode = table->mode,
+ };
- /* backup the value first */
memcpy(val, valp, sizeof(val));
-
- rc = proc_dointvec(table, write, buffer, lenp, ppos);
- if (write && (valp[0] < 0 || valp[1] < 0 ||
- (valp[0] >= valp[1] && valp[1]))) {
- /* Restore the correct value */
- memcpy(valp, val, sizeof(val));
- }
- return rc;
-}
-
-static int
-proc_do_sync_mode(struct ctl_table *table, int write,
- void __user *buffer, size_t *lenp, loff_t *ppos)
-{
- int *valp = table->data;
- int val = *valp;
- int rc;
-
- rc = proc_dointvec(table, write, buffer, lenp, ppos);
- if (write && (*valp != val)) {
- if ((*valp < 0) || (*valp > 1)) {
- /* Restore the correct value */
- *valp = val;
- }
+ rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+ if (write) {
+ if (val[0] < 0 || val[1] < 0 ||
+ (val[0] >= val[1] && val[1]))
+ rc = -EINVAL;
+ else
+ memcpy(valp, val, sizeof(val));
}
return rc;
}
@@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write,
int val = *valp;
int rc;
- rc = proc_dointvec(table, write, buffer, lenp, ppos);
+ struct ctl_table tmp = {
+ .data = &val,
+ .maxlen = sizeof(int),
+ .mode = table->mode,
+ };
+
+ rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
if (write && (*valp != val)) {
- if (*valp < 1 || !is_power_of_2(*valp)) {
- /* Restore the correct value */
+ if (val < 1 || !is_power_of_2(val))
+ rc = -EINVAL;
+ else
*valp = val;
- }
}
return rc;
}
@@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = {
.procname = "sync_version",
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = proc_do_sync_mode,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
},
{
.procname = "sync_ports",
--
2.21.GIT
^ permalink raw reply related
* [PATCH v2] staging/octeon: Fix build error without CONFIG_NETDEVICES
From: Greg KH @ 2019-07-31 16:02 UTC (permalink / raw)
To: willy, davem, netdev; +Cc: devel, YueHaibing, linux-kernel
From: YueHaibing <yuehaibing@huawei.com>
While do COMPILE_TEST build without CONFIG_NETDEVICES,
we get Kconfig warning:
WARNING: unmet direct dependencies detected for PHYLIB
Depends on [n]: NETDEVICES [=n]
Selected by [y]:
- OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES [=n] || COMPILE_TEST [=y])
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Mark Brown <broonie@kernel.org>
Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
v2: cc: netdev and add another reported-by line.
drivers/staging/octeon/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/staging/octeon/Kconfig b/drivers/staging/octeon/Kconfig
index 5b39946..5319909 100644
--- a/drivers/staging/octeon/Kconfig
+++ b/drivers/staging/octeon/Kconfig
@@ -1,7 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
config OCTEON_ETHERNET
tristate "Cavium Networks Octeon Ethernet support"
- depends on CAVIUM_OCTEON_SOC && NETDEVICES || COMPILE_TEST
+ depends on CAVIUM_OCTEON_SOC || COMPILE_TEST
+ depends on NETDEVICES
select PHYLIB
select MDIO_OCTEON
help
--
2.7.4
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
^ permalink raw reply related
* Re: next/master build: 221 builds: 11 failed, 210 passed, 13 errors, 1174 warnings (next-20190731)
From: Greg KH @ 2019-07-31 16:00 UTC (permalink / raw)
To: David Miller
Cc: devel, andrew, f.fainelli, kernel-build-reports, netdev, willy,
broonie, linux-next, linux-arm-kernel, hkallweit1
In-Reply-To: <20190731.084824.2244928058443049.davem@davemloft.net>
On Wed, Jul 31, 2019 at 08:48:24AM -0700, David Miller wrote:
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Date: Wed, 31 Jul 2019 13:35:22 +0200
>
> > On Wed, Jul 31, 2019 at 12:24:41PM +0100, Mark Brown wrote:
> >> On Wed, Jul 31, 2019 at 04:07:41AM -0700, kernelci.org bot wrote:
> >>
> >> Today's -next fails to build an ARM allmodconfig due to:
> >>
> >> > allmodconfig (arm, gcc-8) ― FAIL, 1 error, 40 warnings, 0 section mismatches
> >> >
> >> > Errors:
> >> > drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of function 'writeq'; did you mean 'writel'? [-Werror=implicit-function-declaration]
> >>
> >> as a result of the changes that introduced:
> >>
> >> WARNING: unmet direct dependencies detected for MDIO_OCTEON
> >> Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=m] && MDIO_BUS [=m] && 64BIT && HAS_IOMEM [=y] && OF_MDIO [=m]
> >> Selected by [m]:
> >> - OCTEON_ETHERNET [=m] && STAGING [=y] && (CAVIUM_OCTEON_SOC && NETDEVICES [=y] || COMPILE_TEST [=y])
> >>
> >> which is triggered by the staging OCTEON_ETHERNET driver which misses a
> >> 64BIT dependency but added COMPILE_TEST in 171a9bae68c72f2
> >> (staging/octeon: Allow test build on !MIPS).
> >
> > A patch was posted for this, but it needs to go through the netdev tree
> > as that's where the offending patches are coming from.
>
> I didn't catch that, was netdev CC:'d?
Nope, just you :(
I'll resend it now and cc: netdev.
thanks,
greg k-h
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox