From: Eric Dumazet <edumazet@google.com>
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: Vladimir Oltean <olteanv@gmail.com>,
netdev <netdev@vger.kernel.org>, Jakub Kicinski <kuba@kernel.org>,
Paul Gortmaker <paul.gortmaker@windriver.com>,
Jiri Benc <jbenc@redhat.com>, Or Gerlitz <ogerlitz@mellanox.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
Jamal Hadi Salim <jhs@mojatatu.com>, Andrew Lunn <andrew@lunn.ch>,
Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Correct usage of dev_base_lock in 2020
Date: Mon, 30 Nov 2020 11:41:10 +0100 [thread overview]
Message-ID: <CANn89iKyyCwiKHFvQMqmeAbaR9SzwsCsko49FP+4NBW6+ZXN4w@mail.gmail.com> (raw)
In-Reply-To: <20201129211230.4d704931@hermes.local>
On Mon, Nov 30, 2020 at 6:12 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Sun, 29 Nov 2020 22:58:17 +0200
> Vladimir Oltean <olteanv@gmail.com> wrote:
>
> > [ resent, had forgot to copy the list ]
> >
> > Hi,
> >
> > net/core/dev.c has this to say about the locking rules around the network
> > interface lists (dev_base_head, and I can only assume that it also applies to
> > the per-ifindex hash table dev_index_head and the per-name hash table
> > dev_name_head):
> >
> > /*
> > * The @dev_base_head list is protected by @dev_base_lock and the rtnl
> > * semaphore.
> > *
> > * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
> > *
> > * Writers must hold the rtnl semaphore while they loop through the
> > * dev_base_head list, and hold dev_base_lock for writing when they do the
> > * actual updates. This allows pure readers to access the list even
> > * while a writer is preparing to update it.
> > *
> > * To put it another way, dev_base_lock is held for writing only to
> > * protect against pure readers; the rtnl semaphore provides the
> > * protection against other writers.
> > *
> > * See, for example usages, register_netdevice() and
> > * unregister_netdevice(), which must be called with the rtnl
> > * semaphore held.
> > */
> >
> > However, as of today, most if not all the read-side accessors of the network
> > interface lists have been converted to run under rcu_read_lock. As Eric explains,
> >
> > commit fb699dfd426a189fe33b91586c15176a75c8aed0
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon Oct 19 19:18:49 2009 +0000
> >
> > net: Introduce dev_get_by_index_rcu()
> >
> > Some workloads hit dev_base_lock rwlock pretty hard.
> > We can use RCU lookups to avoid touching this rwlock.
> >
> > netdevices are already freed after a RCU grace period, so this patch
> > adds no penalty at device dismantle time.
> >
> > dev_ifname() converted to dev_get_by_index_rcu()
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > A lot of work has been put into eliminating the dev_base_lock rwlock
> > completely, as Stephen explained here:
> >
> > [PATCH 00/10] netdev: get rid of read_lock(&dev_base_lock) usages
> > https://www.spinics.net/lists/netdev/msg112264.html
> >
> > However, its use has not been completely eliminated. It is still there, and
> > even more confusingly, that comment in net/core/dev.c is still there. What I
> > see the dev_base_lock being used for now are complete oddballs.
> >
> > - The debugfs for mac80211, in net/mac80211/debugfs_netdev.c, holds the read
> > side when printing some interface properties (good luck disentangling the
> > code and figuring out which ones, though). What is that read-side actually
> > protecting against?
> >
> > - HSR, in net/hsr/hsr_device.c (called from hsr_netdev_notify on NETDEV_UP
> > NETDEV_DOWN and NETDEV_CHANGE), takes the write-side of the lock when
> > modifying the RFC 2863 operstate of the interface. Why?
> > Actually the use of dev_base_lock is the most widespread in the kernel today
> > when accessing the RFC 2863 operstate. I could only find this truncated
> > discussion in the archives:
> > Re: Issue 0 WAS (Re: Oustanding issues WAS(IRe: Consensus? WAS(RFC 2863)
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg03632.html
> > and it said:
> >
> > > be transitioned to up/dormant etc. So an ethernet driver doesnt know it
> > > needs to go from detecting peer link is up to next being authenticated
> > > in the case of 802.1x. It just calls netif_carrier_on which checks
> > > link_mode to decide on transition.
> >
> > we could protect operstate with a spinlock_irqsave() and then change it either
> > from netif_[carrier|dormant]_on/off() or userspace-supplicant. However, I'm
> > not feeling good about it. Look at rtnetlink_fill_ifinfo(), it is able to
> > query a consistent snapshot of all interface settings as long as locking with
> > dev_base_lock and rtnl is obeyed. __LINK_STATE flags are already an
> > exemption, and I don't want operstate to be another. That's why I chose
> > setting it from linkwatch in process context, and I really think this is the
> > correct approach.
> >
> > - rfc2863_policy() in net/core/link_watch.c seems to be the major writer that
> > holds this lock in 2020, together with do_setlink() and set_operstate() from
> > net/core/rtnetlink.c. Has the lock been repurposed over the years and we
> > should update its name appropriately?
> >
> > - This usage from netdev_show() in net/core/net-sysfs.c just looks random to
> > me, maybe somebody can explain:
> >
> > read_lock(&dev_base_lock);
> > if (dev_isalive(ndev))
> > ret = (*format)(ndev, buf);
> > read_unlock(&dev_base_lock);
>
>
> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
> (ie before my time). The time has come to get rid of it.
>
> The use is sysfs is because could be changed to RCU. There have been issues
> in the past with sysfs causing lock inversions with the rtnl mutex, that
> is why you will see some trylock code there.
>
> My guess is that dev_base_lock readers exist only because no one bothered to do
> the RCU conversion.
I think we did, a long time ago.
We took care of all ' fast paths' already.
Not sure what is needed, current situation does not bother me at all ;)
>
> Complex locking rules lead to mistakes and often don't get much performance
> gain. There are really two different domains being covered by locks here.
>
> The first area is change of state of network devices. This has traditionally
> been covered by RTNL because there are places that depend on coordinating
> state between multiple devices. RTNL is too big and held too long but getting
> rid of it is hard because there are corner cases (like state changes from userspace
> for VPN devices).
>
> The other area is code that wants to do read access to look at list of devices.
> These pure readers can/should be converted to RCU by now. Writers should hold RTNL.
Yes, and sometimes this is unfortunate.
dev_change_name() for example is an issue, because of the
synchronize_rcu() it contains.
>
> You could change the readers of operstate to use some form of RCU and atomic
> operation (seqlock?). The state of the device has several components flags, operstate
> etc, and there is no well defined way to read a consistent set of them.
>
> Good Luck on your quest.
>
>
next prev parent reply other threads:[~2020-11-30 10:42 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20201129182435.jgqfjbekqmmtaief@skbuf>
2020-11-29 20:58 ` Correct usage of dev_base_lock in 2020 Vladimir Oltean
2020-11-30 5:12 ` Stephen Hemminger
2020-11-30 10:41 ` Eric Dumazet [this message]
2020-11-30 18:14 ` Jakub Kicinski
2020-11-30 18:30 ` Eric Dumazet
2020-11-30 18:48 ` Vladimir Oltean
2020-11-30 19:00 ` Eric Dumazet
2020-11-30 19:03 ` Vladimir Oltean
2020-11-30 19:22 ` Eric Dumazet
2020-11-30 19:32 ` Vladimir Oltean
2020-11-30 21:41 ` Florian Fainelli
2020-11-30 19:46 ` Vladimir Oltean
2020-11-30 20:18 ` Eric Dumazet
2020-11-30 20:21 ` Stephen Hemminger
2020-11-30 20:26 ` Vladimir Oltean
2020-11-30 20:29 ` Eric Dumazet
2020-11-30 20:36 ` Vladimir Oltean
2020-11-30 20:43 ` Eric Dumazet
2020-11-30 20:50 ` Vladimir Oltean
2020-11-30 21:00 ` Eric Dumazet
2020-11-30 21:11 ` Vladimir Oltean
2020-11-30 21:46 ` Eric Dumazet
2020-11-30 21:53 ` Vladimir Oltean
2020-11-30 22:20 ` Eric Dumazet
2020-11-30 22:41 ` Vladimir Oltean
2020-12-01 14:42 ` Pablo Neira Ayuso
2020-12-01 18:58 ` Vladimir Oltean
2020-12-10 4:32 ` [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANn89iKyyCwiKHFvQMqmeAbaR9SzwsCsko49FP+4NBW6+ZXN4w@mail.gmail.com \
--to=edumazet@google.com \
--cc=andrew@lunn.ch \
--cc=f.fainelli@gmail.com \
--cc=jbenc@redhat.com \
--cc=jhs@mojatatu.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=ogerlitz@mellanox.com \
--cc=olteanv@gmail.com \
--cc=paul.gortmaker@windriver.com \
--cc=stephen@networkplumber.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).