From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?utf-8?B?UGF3ZcWC?= Sikora Subject: Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver) Date: Thu, 15 Nov 2012 19:48:10 +0100 Message-ID: <3506450.k3Q223DJQc@localhost> References: <5092540.GORQ1kUuNX@localhost> <87sja7uvy1.fsf@xmission.com> <20120925050558.GA14685@MAIL.13thfloor.at> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Eric W. Biederman" , Linus Torvalds , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, arekm@pld-linux.org, baggins@pld-linux.org, Daniel Hokka Zakrisson To: Herbert Poetzl Return-path: In-Reply-To: <20120925050558.GA14685@MAIL.13thfloor.at> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Tuesday 25 of September 2012 07:05:59 Herbert Poetzl wrote: > On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote: > > Herbert Poetzl writes: >=20 > >> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Pawe=C5=82 Sikora wrote: > >>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote: > >>>> On Sat, Sep 22, 2012 at 11:09 PM, Pawe=C5=82 Sikora wrote: >=20 > >>>>> br_read_lock(vfsmount_lock); >=20 > >>>> The vfsmount_lock is a "local-global" lock, where a read-lock > >>>> is rather cheap and takes just a per-cpu lock, but the > >>>> downside is that a write-lock is *very* expensive, and can > >>>> cause serious trouble. >=20 > >>>> And the write lock is taken by the [un]mount() paths. Do *not* > >>>> do crazy things. If you do some insane "unmount and remount > >>>> autofs" on a 1s granularity, you're doing insane things. >=20 > >>>> Why do you have that 1s timeout? Insane. >=20 > >>> 1s unmount timeout is *only* for fast bug reproduction (in few > >>> seconds after opteron startup) and testing potential patches. > >>> normally with 60s timeout it happens in few minutes..hours > >>> (depends on machine i/o+cpu load) and makes server unusable > >>> (permament soft-lockup). >=20 > >>> can we redesign vserver's mnt_is_reachable() for better locking > >>> to avoid total soft-lockup? >=20 > >> currently we do: >=20 > >> br_read_lock(&vfsmount_lock); > >> root =3D current->fs->root; > >> root_mnt =3D real_mount(root.mnt); > >> point =3D root.dentry; >=20 > >> while ((mnt !=3D mnt->mnt_parent) && (mnt !=3D root_mnt)) = { > >> point =3D mnt->mnt_mountpoint; > >> mnt =3D mnt->mnt_parent; > >> } >=20 > >> ret =3D (mnt =3D=3D root_mnt) && is_subdir(point, root.den= try); > >> br_read_unlock(&vfsmount_lock); >=20 > >> and we have been considering to move the br_read_unlock() > >> right before the is_subdir() call >=20 > >> if there are any suggestions how to achieve the same > >> with less locking I'm all ears ... >=20 > > Herbert, why do you need to filter the mounts that show up in a > > mount namespace at all? >=20 > that is actually a really good question! >=20 > > I would think a far more performant and simpler solution would > > be to just use mount namespaces without unwanted mounts. >=20 > we had this mechanism for many years, long before the > mount namespaces existed, and I vaguely remember that > early versions didn't get the proc entries right either >=20 > I took a quick look at the code and I think we can drop > the mnt_is_reachable() check and/or make it conditional > on setups without a mount namespace in place in the near > future (thanks for the input, really appreciated!) Hi, Herbert, can i just drop this mnt_is_reachable() method from vserver pa= tch? this issue hasn't been solved for several months now. i can live withou= t this problematic security-through-obscurity feature on my heavy loaded machi= nes. . > > I'd like to blame this on the silly rcu_barrier in > > deactivate_locked_super that should really be in the module > > remove path, but that happens after we drop the br_write_lock. >=20 > > The kernel take br_read_lock(&vfs_mount_lokck) during every rcu > > path lookup so mnt_is_reachable isn't particular crazy just for > > taking the lock. >=20 > > I am with Linus on this one. Pawe=C5=82 even 60s for your mount > > timeout looks too short for your workload. All of the readers > > that take br_read_lock(&vfsmount_lock) seem to be showing up in > > your oops. The only thing that seems to make sense is you have > > a lot of unmount activity running back to back, keeping the > > lock write held. >=20 > > The only other possible culprit I can see is that it looks like > > mnt_is_reachable changes reading /proc/mounts to be something > > worse than linear in the number of mounts and reading /proc/mounts > > starts taking the vfsmount_lock. All minor things but when you > > are pushing things hard they look like things that would add up. >=20 > > Eric