From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79F04C433DB for ; Tue, 12 Jan 2021 00:34:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5014E22D58 for ; Tue, 12 Jan 2021 00:34:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405431AbhALAZd (ORCPT ); Mon, 11 Jan 2021 19:25:33 -0500 Received: from mail.kernel.org ([198.145.29.99]:34834 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404144AbhAKXjd (ORCPT ); Mon, 11 Jan 2021 18:39:33 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id D7F3F22D0B; Mon, 11 Jan 2021 23:38:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1610408332; bh=lRbinhY3PN7BxhvNLR0X+P0hdH51gSvDHp+eHlne3KA=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=Cz7cxy1P/db8GqEEZxqh7TkC7VWV177YLM5quYgC5mW64XpWAvwaIxKcPWPMDRlEG 6qRsBenPj2adQ1bjhOD24YMxWjskKOAp9i8rUa6pXpA/FLt8DDW5ASPWdc7QRHb/pN jQRPx2VoWSuBKh6uKrbzWk1S2i+sXPN9Yhw2EDWGNMu8ouYTuM/+TpPFIoBtEUpzVp CqTEmU4qKvSU/zZPF//FFLtKqlXtdcQDRHcnutrOlMhoW/dTQgCMAS0A+IhgF3REh5 6cB2fznF8Y6enbUlezLLkcX+5mejO2jiW+8qdRhlueDYa4lwkapa+osAdrCLoPjB6h cOeX2EurakQeg== Message-ID: Subject: Re: [PATCH v6 net-next 14/15] net: bonding: ensure .ndo_get_stats64 can sleep From: Saeed Mahameed To: Vladimir Oltean , "David S . Miller" , Jakub Kicinski Cc: netdev@vger.kernel.org, Andrew Lunn , Florian Fainelli , Cong Wang , Stephen Hemminger , Eric Dumazet , George McCollister , Oleksij Rempel , Jay Vosburgh , Veaceslav Falico , Andy Gospodarek , Arnd Bergmann , Taehee Yoo , Jiri Pirko , Florian Westphal , Nikolay Aleksandrov , Pravin B Shelar , Sridhar Samudrala Date: Mon, 11 Jan 2021 15:38:49 -0800 In-Reply-To: <20210109172624.2028156-15-olteanv@gmail.com> References: <20210109172624.2028156-1-olteanv@gmail.com> <20210109172624.2028156-15-olteanv@gmail.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sat, 2021-01-09 at 19:26 +0200, Vladimir Oltean wrote: > From: Vladimir Oltean > > There is an effort to convert .ndo_get_stats64 to sleepable context, > and > for that to work, we need to prevent callers of dev_get_stats from > using > atomic locking. > > The bonding driver retrieves its statistics recursively from its > lower > interfaces, with additional care to only count packets sent/received > while those lowers were actually enslaved to the bond - see commit > 5f0c5f73e5ef ("bonding: make global bonding stats more reliable"). > > Since commit 87163ef9cda7 ("bonding: remove last users of bond->lock > and > bond->lock itself"), the bonding driver uses the following protection > for its array of slaves: RCU for readers and rtnl_mutex for updaters. > > The aforementioned commit removed an interesting comment: > > /* [...] we can't hold bond->lock [...] because we'll > * deadlock. The only solution is to rely on the fact > * that we're under rtnl_lock here, and the slaves > * list won't change. This doesn't solve the problem > * of setting the slave's MTU while it is > * transmitting, but the assumption is that the base > * driver can handle that. > * > * TODO: figure out a way to safely iterate the slaves > * list, but without holding a lock around the actual > * call to the base driver. > */ > > The above summarizes pretty well the challenges we have with nested > bonding interfaces (bond over bond over bond over...) and locking for > their slaves. > > To solve the nesting problem, the simple way is to not hold any locks > when recursing into the slave netdev operation. We can "cheat" and > use > dev_hold to take a reference on the slave net_device, which is enough > to > ensure that netdev_wait_allrefs() waits until we finish, and the > kernel > won't fault. > > However, the slave structure might no longer be valid, just its > associated net_device. So we need to do some more work to ensure that > the slave exists after we took the statistics, and if it still does, > reapply the logic from Andy's commit 5f0c5f73e5ef. > > Tested using the following two scripts running in parallel: > > #!/bin/bash > > while :; do > ip link del bond0 > ip link del bond1 > ip link add bond0 type bond mode 802.3ad > ip link add bond1 type bond mode 802.3ad > ip link set sw0p1 down && ip link set sw0p1 master > bond0 && ip link set sw0p1 up > ip link set sw0p2 down && ip link set sw0p2 master > bond0 && ip link set sw0p2 up > ip link set sw0p3 down && ip link set sw0p3 master > bond0 && ip link set sw0p3 up > ip link set bond0 down && ip link set bond0 master > bond1 && ip link set bond0 up > ip link set sw1p1 down && ip link set sw1p1 master > bond1 && ip link set sw1p1 up > ip link set bond1 up > ip -s -s link show > cat /sys/class/net/bond1/statistics/* > done > > #!/bin/bash > > while :; do > echo spi2.0 > /sys/bus/spi/drivers/sja1105/unbind > echo spi2.0 > /sys/bus/spi/drivers/sja1105/bind > sleep 30 > done > > where the sja1105 driver was explicitly modified for the purpose of > this > test to have a msleep(500) in its .ndo_get_stats64 method, to catch > some > more potential races. > > Signed-off-by: Vladimir Oltean > --- > [...] > > +/* Helpers for reference counting the struct net_device behind the > bond slaves. > + * These can be used to propagate the net_device_ops from the bond > to the > + * slaves while not holding rcu_read_lock() or the rtnl_mutex. > + */ > +struct bonding_slave_dev { > + struct net_device *ndev; > + struct list_head list; > +}; > + > +static inline void bond_put_slaves(struct list_head *slaves) > +{ > + struct bonding_slave_dev *s, *tmp; > + > + list_for_each_entry_safe(s, tmp, slaves, list) { > + dev_put(s->ndev); > + list_del(&s->list); > + kfree(s); > + } > +} > + > +static inline int bond_get_slaves(struct bonding *bond, > + struct list_head *slaves, > + int *num_slaves) > +{ > + struct list_head *iter; > + struct slave *slave; > + > + INIT_LIST_HEAD(slaves); > + *num_slaves = 0; > + > + rcu_read_lock(); > + > + bond_for_each_slave_rcu(bond, slave, iter) { > + struct bonding_slave_dev *s; > + > + s = kzalloc(sizeof(*s), GFP_ATOMIC); GFP_ATOMIC is a little bit aggressive especially when user daemons are periodically reading stats. This can be avoided. You can pre-allocate with GFP_KERNEL an array with an "approximate" size. then fill the array up with whatever slaves the the bond has at that moment, num_of_slaves can be less, equal or more than the array you just allocated but we shouldn't care .. something like: rcu_read_lock() nslaves = bond_get_num_slaves(); rcu_read_unlock() sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), GFP_KERNEL); rcu_read_lock(); bond_fill_slaves_array(bond, sarray); // also do: dev_hold() rcu_read_unlock(); bond_get_slaves_array_stats(sarray); bond_put_slaves_array(sarray); > + if (!s) { > + rcu_read_unlock(); > + bond_put_slaves(slaves); > + return -ENOMEM; > + } > + > + s->ndev = slave->dev; > + dev_hold(s->ndev); > + list_add_tail(&s->list, slaves); > + (*num_slaves)++; > + } > + > + rcu_read_unlock(); > + > + return 0; > +} > + > #define BOND_PRI_RESELECT_ALWAYS 0 > #define BOND_PRI_RESELECT_BETTER 1 > #define BOND_PRI_RESELECT_FAILURE 2