From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F7EBC433F5 for ; Sun, 5 Dec 2021 13:50:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234301AbhLENxa (ORCPT ); Sun, 5 Dec 2021 08:53:30 -0500 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:49491 "EHLO out2-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234224AbhLENxa (ORCPT ); Sun, 5 Dec 2021 08:53:30 -0500 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id DE48B5C00E8; Sun, 5 Dec 2021 08:50:02 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Sun, 05 Dec 2021 08:50:02 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=XgDbCu 9zy9LSqZyd/0JFkSd+H84jhOTyX7buBDSXOKE=; b=O1/IFAcovcyh2rMFAIr5w6 RGF/Rd276S7UKhrobiwT2NJ9bUIfPhU6SoZVxBQYCHnvo/kNzkdgGHUTVL1qpgAU EjkzL2C5h/PA3zg87+yWY7nJKFRb9u5X/X4ga3l2m/hRrMJ9QYyW7LW3OPXql4ep QhFic6yp8x6TR/oP4ECCK2bTNe0aljsh4Ol+K4KYUABA8dTKqDsaTp9HNDTbx3gU ehyf+KIE3zPGBxzyCLJx1PfjGXEL6oO9L+A+BJAEitK7vbeJ6q5pn7AwpT8gYEOi dRgvf7v3ZxIC4oxUmm3EQPV9w797eUSN5XPcgG2QZXw4n1N8Nb9RBBnYAVTLLtpA == X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvuddrjedugdehhecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvuffkfhggtggujgesthdtredttddtvdenucfhrhhomhepkfguohcuufgt hhhimhhmvghluceoihguohhstghhsehiughoshgthhdrohhrgheqnecuggftrfgrthhtvg hrnheptdffkeekfeduffevgeeujeffjefhtefgueeugfevtdeiheduueeukefhudehleet necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepihguoh hstghhsehiughoshgthhdrohhrgh X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Sun, 5 Dec 2021 08:50:01 -0500 (EST) Date: Sun, 5 Dec 2021 15:49:59 +0200 From: Ido Schimmel To: Lahav Schlesinger Cc: netdev@vger.kernel.org, kuba@kernel.org, dsahern@gmail.com Subject: Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion Message-ID: References: <20211202174502.28903-1-lschlesinger@drivenets.com> <20211205121059.btshgxt7s7hfnmtr@kgollan-pc> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211205121059.btshgxt7s7hfnmtr@kgollan-pc> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sun, Dec 05, 2021 at 02:11:00PM +0200, Lahav Schlesinger wrote: > On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote: > > CAUTION: External E-Mail - Use caution with links and attachments > > > > > > On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote: > > > Under large scale, some routers are required to support tens of thousands > > > of devices at once, both physical and virtual (e.g. loopbacks, tunnels, > > > vrfs, etc). > > > At times such routers are required to delete massive amounts of devices > > > at once, such as when a factory reset is performed on the router (causing > > > a deletion of all devices), or when a configuration is restored after an > > > upgrade, or as a request from an operator. > > > > > > Currently there are 2 means of deleting devices using Netlink: > > > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index, > > > or by name using IFLA_IFNAME) > > > 2. Delete all device that belong to a group (using IFLA_GROUP) > > > > > > Deletion of devices one-by-one has poor performance on large scale of > > > devices compared to "group deletion": > > > After all device are handled, netdev_run_todo() is called which > > > calls rcu_barrier() to finish any outstanding RCU callbacks that were > > > registered during the deletion of the device, then wait until the > > > refcount of all the devices is 0, then perform final cleanups. > > > > > > However, calling rcu_barrier() is a very costly operation, each call > > > taking in the order of 10s of milliseconds. > > > > > > When deleting a large number of device one-by-one, rcu_barrier() > > > will be called for each device being deleted. > > > As an example, following benchmark deletes 10K loopback devices, > > > all of which are UP and with only IPv6 LLA being configured: > > > > > > 1. Deleting one-by-one using 1 thread : 243 seconds > > > 2. Deleting one-by-one using 10 thread: 70 seconds > > > 3. Deleting one-by-one using 50 thread: 54 seconds > > > 4. Deleting all using "group deletion": 30 seconds > > > > > > Note that even though the deletion logic takes place under the rtnl > > > lock, since the call to rcu_barrier() is outside the lock we gain > > > some improvements. > > > > > > But, while "group deletion" is the fastest, it is not suited for > > > deleting large number of arbitrary devices which are unknown a head of > > > time. Furthermore, moving large number of devices to a group is also a > > > costly operation. > > > > These are the number I get in a VM running on my laptop. > > > > Moving 16k dummy netdevs to a group: > > > > # time -p ip -b group.batch > > real 1.91 > > user 0.04 > > sys 0.27 > > > > Deleting the group: > > > > # time -p ip link del group 10 > > real 6.15 > > user 0.00 > > sys 3.02 > > > > Hi Ido, in your tests in which state the dummy devices are before > deleting/changing group? > When they are DOWN I get similar numbers to yours (16k devices): > > # time ip -b group_16000_batch > real 0m0.640s > user 0m0.152s > sys 0m0.478s > > # time ip link delete group 100 > real 0m5.324s > user 0m0.017s > sys 0m4.991s > > But when the devices are in state UP, I get: > > # time ip -b group_16000_batch > real 0m48.605s > user 0m0.218s > sys 0m48.244s > > # time ip link delete group 100 > real 1m13.219s > user 0m0.010s > sys 1m9.117s > > And for completeness, setting the devices to DOWN prior to deleting them > is as fast as deleting them in the first place while they're UP. > > Also, while this is probably a minuscule issue, changing the group of > 10ks+ of interfaces will result in a storm of netlink events that will > make any userspace program listening on link events to spend time > handling these events. This will result in twice as many events > compared to directly deleting the devices. Yes, in my setup the netdevs were down. Looking at the code, I think the reason for the 75x increase in latency is the fact that netlink notifications are not generated when the netdev is down. See netdev_state_change(). In your specific case, it is quite useless for the kernel to generate 16k notifications when moving the netdevs to a group since the entire reason they are moved to a group is so that they could be deleted in a batch. I assume that there are other use cases where having the kernel suppress notifications can be useful. Did you consider adding such a flag to the request? I think such a mechanism is more generic/useful than an ad-hoc API to delete a list of netdevs and should allow you to utilize the existing group deletion mechanism. > > > IMO, these numbers do not justify a new API. Also, your user space can > > be taught to create all the netdevs in the same group to begin with: > > > > # ip link add name dummy1 group 10 type dummy > > # ip link show dev dummy1 > > 10: dummy1: mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000 > > link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff > > > > Moreover, unlike the list API that is specific to deletion, the group > > API also lets you batch set operations: > > > > # ip link set group 10 mtu 2000 > > # ip link show dev dummy1 > > 10: dummy1: mtu 2000 qdisc noop state DOWN mode > > DEFAULT group 10 qlen 1000 > > link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff > > The list API can be extended to support other operations as well > (similar to group set operations, we can call do_setlink() for each > device specified in an IFLA_IFINDEX). > I didn't implement it in this patch because we don't have a use for it > currently. > > > > > If you are using namespaces, then during "factory reset" you can delete > > the namespace which should trigger batch deletion of the netdevs inside > > it. > > > > In some scenarios we are required to delete only a subset of devices > (e.g. when a physical link becomes DOWN, we need to delete all the VLANs > and any tunnels configured on that device). Furthermore, a user is > allowed to load a new configuration in which he deletes only some of the > devices (e.g. delete all of the loopbacks in the system), while not > touching the other devices. > > > > > > > This patch adds support for passing an arbitrary list of ifindex of > > > devices to delete with a new IFLA_IFINDEX attribute. A single message > > > may contain multiple instances of this attribute). > > > This gives a more fine-grained control over which devices to delete, > > > while still resulting in rcu_barrier() being called only once. > > > Indeed, the timings of using this new API to delete 10K devices is > > > the same as using the existing "group" deletion. > > > > > > Signed-off-by: Lahav Schlesinger