From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?utf-8?B?UGF3ZcWC?= Sikora <pluto@pld-linux.org>
Subject: Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
Date: Thu, 15 Nov 2012 19:48:10 +0100
Message-ID: <3506450.k3Q223DJQc@localhost>
References: <5092540.GORQ1kUuNX@localhost> <87sja7uvy1.fsf@xmission.com> <20120925050558.GA14685@MAIL.13thfloor.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	arekm@pld-linux.org, baggins@pld-linux.org,
	Daniel Hokka Zakrisson <daniel@hozac.com>
To: Herbert Poetzl <herbert@13thfloor.at>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <20120925050558.GA14685@MAIL.13thfloor.at>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Tuesday 25 of September 2012 07:05:59 Herbert Poetzl wrote:
> On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote:
> > Herbert Poetzl <herbert@13thfloor.at> writes:
>=20
> >> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Pawe=C5=82 Sikora wrote:
> >>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
> >>>> On Sat, Sep 22, 2012 at 11:09 PM, Pawe=C5=82 Sikora <pluto@pld-l=
inux.org> wrote:
>=20
> >>>>>         br_read_lock(vfsmount_lock);
>=20
> >>>> The vfsmount_lock is a "local-global" lock, where a read-lock
> >>>> is rather cheap and takes just a per-cpu lock, but the
> >>>> downside is that a write-lock is *very* expensive, and can
> >>>> cause serious trouble.
>=20
> >>>> And the write lock is taken by the [un]mount() paths. Do *not*
> >>>> do crazy things. If you do some insane "unmount and remount
> >>>> autofs" on a 1s granularity, you're doing insane things.
>=20
> >>>> Why do you have that 1s timeout? Insane.
>=20
> >>> 1s unmount timeout is *only* for fast bug reproduction (in few
> >>> seconds after opteron startup) and testing potential patches.
> >>> normally with 60s timeout it happens in few minutes..hours
> >>> (depends on machine i/o+cpu load) and makes server unusable
> >>> (permament soft-lockup).
>=20
> >>> can we redesign vserver's mnt_is_reachable() for better locking
> >>> to avoid total soft-lockup?
>=20
> >> currently we do:
>=20
> >>         br_read_lock(&vfsmount_lock);
> >>         root =3D current->fs->root;
> >>         root_mnt =3D real_mount(root.mnt);
> >>         point =3D root.dentry;
>=20
> >>         while ((mnt !=3D mnt->mnt_parent) && (mnt !=3D root_mnt)) =
{
> >>                 point =3D mnt->mnt_mountpoint;
> >>                 mnt =3D mnt->mnt_parent;
> >>         }
>=20
> >>         ret =3D (mnt =3D=3D root_mnt) && is_subdir(point, root.den=
try);
> >>         br_read_unlock(&vfsmount_lock);
>=20
> >> and we have been considering to move the br_read_unlock()
> >> right before the is_subdir() call
>=20
> >> if there are any suggestions how to achieve the same
> >> with less locking I'm all ears ...
>=20
> > Herbert, why do you need to filter the mounts that show up in a
> > mount namespace at all?
>=20
> that is actually a really good question!
>=20
> > I would think a far more performant and simpler solution would
> > be to just use mount namespaces without unwanted mounts.
>=20
> we had this mechanism for many years, long before the
> mount namespaces existed, and I vaguely remember that
> early versions didn't get the proc entries right either
>=20
> I took a quick look at the code and I think we can drop
> the mnt_is_reachable() check and/or make it conditional
> on setups without a mount namespace in place in the near
> future (thanks for the input, really appreciated!)

Hi,

Herbert, can i just drop this mnt_is_reachable() method from vserver pa=
tch?
this issue hasn't been solved for several months now. i can live withou=
t this
problematic security-through-obscurity feature on my heavy loaded machi=
nes. .


> > I'd like to blame this on the silly rcu_barrier in
> > deactivate_locked_super that should really be in the module
> > remove path, but that happens after we drop the br_write_lock.
>=20
> > The kernel take br_read_lock(&vfs_mount_lokck) during every rcu
> > path lookup so mnt_is_reachable isn't particular crazy just for
> > taking the lock.
>=20
> > I am with Linus on this one. Pawe=C5=82 even 60s for your mount
> > timeout looks too short for your workload. All of the readers
> > that take br_read_lock(&vfsmount_lock) seem to be showing up in
> > your oops. The only thing that seems to make sense is you have
> > a lot of unmount activity running back to back, keeping the
> > lock write held.
>=20
> > The only other possible culprit I can see is that it looks like
> > mnt_is_reachable changes reading /proc/mounts to be something
> > worse than linear in the number of mounts and reading /proc/mounts
> > starts taking the vfsmount_lock.  All minor things but when you
> > are pushing things hard they look like things that would add up.
>=20
> > Eric