[2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
@ 2012-09-23  6:09 Paweł Sikora
  2012-09-24  1:10 ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Paweł Sikora @ 2012-09-23  6:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, torvalds, arekm, baggins

Hi,

a long time ago i reported an ugly soft lock of heavy loaded opteron system without usable backtraces :(
recently, i've logged on 3.4.6 via serial console backtraces from all 16 cores which show some kind
of vfs lock (http://pluto.agmk.net/kernel/oops.txt). this lock occurs on heavy loaded system with
frequent autofs unmount (timeout=1s) in action with vserver patch applied.

i've isolated the vserver function that exposes problem: mnt_is_reachable()
http://vserver.13thfloor.at/Experimental/patch-3.4.6-vs2.3.3.6.diff:

static int mnt_is_reachable(struct vfsmount *vfsmnt)
{
	struct path root;
	struct dentry *point;
	struct mount *mnt = real_mount(vfsmnt);
	struct mount *root_mnt;
	int ret;

	if (mnt == mnt->mnt_ns->root)
		return 1;

	br_read_lock(vfsmount_lock);
	root = current->fs->root;
	root_mnt = real_mount(root.mnt);
	point = root.dentry;

	while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
		point = mnt->mnt_mountpoint;
		mnt = mnt->mnt_parent;
	}

	ret = (mnt == root_mnt) && is_subdir(point, root.dentry);

	br_read_unlock(vfsmount_lock);

	return ret;
}

the vserver developer (Herbert Poetzl) said that locking scheme used in this function looks correct,
so it might be a hidden vfs bug accidentaly exposed by vserver patch. i'm not a kernel developer,
so i'd like to ask you for help in solving this problem. afaics the problem starts with 2.6.38
release which introduced some major vfs changes.

please CC me on reply.

BR,
Paweł.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-23  6:09 [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver) Paweł Sikora
@ 2012-09-24  1:10 ` Linus Torvalds
  2012-09-24  5:23   ` Paweł Sikora
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2012-09-24  1:10 UTC (permalink / raw)
  To: Paweł Sikora; +Cc: linux-kernel, linux-fsdevel, arekm, baggins

On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:
>
>         br_read_lock(vfsmount_lock);

The vfsmount_lock is a "local-global" lock, where a read-lock is
rather cheap and takes just a per-cpu lock, but the downside is that a
write-lock is *very* expensive, and can cause serious trouble.

And the write lock is taken by the [un]mount() paths. Do *not* do
crazy things. If you do some insane "unmount and remount autofs" on a
1s granularity, you're doing insane things.

Why do you have that 1s timeout? Insane.

              Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-24  1:10 ` Linus Torvalds
@ 2012-09-24  5:23   ` Paweł Sikora
  2012-09-24 11:23     ` Herbert Poetzl
  0 siblings, 1 reply; 9+ messages in thread
From: Paweł Sikora @ 2012-09-24  5:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, arekm, baggins, herbert

On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:
> >
> >         br_read_lock(vfsmount_lock);
> 
> The vfsmount_lock is a "local-global" lock, where a read-lock is
> rather cheap and takes just a per-cpu lock, but the downside is that a
> write-lock is *very* expensive, and can cause serious trouble.
> 
> And the write lock is taken by the [un]mount() paths. Do *not* do
> crazy things. If you do some insane "unmount and remount autofs" on a
> 1s granularity, you're doing insane things.
> 
> Why do you have that 1s timeout? Insane.

1s unmount timeout is *only* for fast bug reproduction (in few seconds after opteron startup)
and testing potential patches. normally with 60s timeout it happens in few minutes..hours
(depends on machine i/o+cpu load) and makes server unusable (permament soft-lockup).
can we redesign vserver's mnt_is_reachable() for better locking to avoid total soft-lockup?

BR,
Paweł.

ps).
i'm adding Herbert to CC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-24  5:23   ` Paweł Sikora
@ 2012-09-24 11:23     ` Herbert Poetzl
  2012-09-24 17:22       ` Linus Torvalds
  2012-09-24 18:17       ` Eric W. Biederman
  0 siblings, 2 replies; 9+ messages in thread
From: Herbert Poetzl @ 2012-09-24 11:23 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Linus Torvalds, linux-kernel, linux-fsdevel, arekm, baggins

On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:

>>>         br_read_lock(vfsmount_lock);

>> The vfsmount_lock is a "local-global" lock, where a read-lock
>> is rather cheap and takes just a per-cpu lock, but the
>> downside is that a write-lock is *very* expensive, and can
>> cause serious trouble.

>> And the write lock is taken by the [un]mount() paths. Do *not*
>> do crazy things. If you do some insane "unmount and remount
>> autofs" on a 1s granularity, you're doing insane things.

>> Why do you have that 1s timeout? Insane.

> 1s unmount timeout is *only* for fast bug reproduction (in few
> seconds after opteron startup) and testing potential patches.
> normally with 60s timeout it happens in few minutes..hours
> (depends on machine i/o+cpu load) and makes server unusable
> (permament soft-lockup).

> can we redesign vserver's mnt_is_reachable() for better locking
> to avoid total soft-lockup?

currently we do:

        br_read_lock(&vfsmount_lock);
        root = current->fs->root;
        root_mnt = real_mount(root.mnt);
        point = root.dentry;

        while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
                point = mnt->mnt_mountpoint;
                mnt = mnt->mnt_parent;
        }

        ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
        br_read_unlock(&vfsmount_lock);

and we have been considering to move the br_read_unlock()
right before the is_subdir() call

if there are any suggestions how to achieve the same
with less locking I'm all ears ...

best,
Herbert

> BR,
> Paweł.

> ps).
> i'm adding Herbert to CC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-24 11:23     ` Herbert Poetzl
@ 2012-09-24 17:22       ` Linus Torvalds
  2012-09-24 18:17       ` Eric W. Biederman
  1 sibling, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2012-09-24 17:22 UTC (permalink / raw)
  To: Paweł Sikora, linux-kernel, linux-fsdevel, arekm, baggins

On Mon, Sep 24, 2012 at 4:23 AM, Herbert Poetzl <herbert@13thfloor.at> wrote:
>
> currently we do:
>
>         br_read_lock(&vfsmount_lock);
>         root = current->fs->root;
>         root_mnt = real_mount(root.mnt);
>         point = root.dentry;
>
>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
>                 point = mnt->mnt_mountpoint;
>                 mnt = mnt->mnt_parent;
>         }
>
>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
>         br_read_unlock(&vfsmount_lock);
>
> and we have been considering to move the br_read_unlock()
> right before the is_subdir() call

So the read lock itself should not cause problems. We have tons of
high-frequency read-lockers, over quite big areas.

But exactly because the readlockers are so high-frequency, I'd expect
any problems to be *found* by read-lockers.

But the *cause* would likely be
 - missing unlocks
 - write-locks
 - recursion on the locks (including read-locks: they are actually
per-cpu spinlocks)

but I'd have expected lockdep to find anything obvious like that.

If the locking itself is fine, maybe the loop above (or the
is_subdir()) is infinite due to mnt->mnt_parent somehow becoming a
circular loop. Maybe due to corrupt data structures traversed inside
the lock, causing infinite loops..

I really don't know the vserver patches, it would be much more
interesting if you can recreate the problems using a standard kernel.

                          Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-24 11:23     ` Herbert Poetzl
  2012-09-24 17:22       ` Linus Torvalds
@ 2012-09-24 18:17       ` Eric W. Biederman
  2012-09-25  5:05         ` Herbert Poetzl
  1 sibling, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2012-09-24 18:17 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Linus Torvalds, linux-kernel, linux-fsdevel, arekm, baggins,
	Herbert Poetzl

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
>>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:
>
>>>>         br_read_lock(vfsmount_lock);
>
>>> The vfsmount_lock is a "local-global" lock, where a read-lock
>>> is rather cheap and takes just a per-cpu lock, but the
>>> downside is that a write-lock is *very* expensive, and can
>>> cause serious trouble.
>
>>> And the write lock is taken by the [un]mount() paths. Do *not*
>>> do crazy things. If you do some insane "unmount and remount
>>> autofs" on a 1s granularity, you're doing insane things.
>
>>> Why do you have that 1s timeout? Insane.
>
>> 1s unmount timeout is *only* for fast bug reproduction (in few
>> seconds after opteron startup) and testing potential patches.
>> normally with 60s timeout it happens in few minutes..hours
>> (depends on machine i/o+cpu load) and makes server unusable
>> (permament soft-lockup).
>
>> can we redesign vserver's mnt_is_reachable() for better locking
>> to avoid total soft-lockup?
>
> currently we do:
>
>         br_read_lock(&vfsmount_lock);
>         root = current->fs->root;
>         root_mnt = real_mount(root.mnt);
>         point = root.dentry;
>
>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
>                 point = mnt->mnt_mountpoint;
>                 mnt = mnt->mnt_parent;
>         }
>
>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
>         br_read_unlock(&vfsmount_lock);
>
> and we have been considering to move the br_read_unlock()
> right before the is_subdir() call
>
> if there are any suggestions how to achieve the same
> with less locking I'm all ears ...

Herbert, why do you need to filter the mounts that show up in a mount
namespace at all?  I would think a far more performant and simpler
solution would be to just use mount namespaces without unwanted mounts.

I'd like to blame this on the silly rcu_barrier in
deactivate_locked_super that should really be in the module remove path,
but that happens after we drop the br_write_lock.

The kernel take br_read_lock(&vfs_mount_lokck) during every rcu path
lookup so mnt_is_reachable isn't particular crazy just for taking the
lock.

I am with Linus on this one.  Paweł even 60s for your mount timeout
looks too short for your workload.  All of the readers that take
br_read_lock(&vfsmount_lock) seem to be showing up in your oops.  The
only thing that seems to make sense is you have a lot of unmount
activity running back to back, keeping the lock write held.

The only other possible culprit I can see is that it looks like
mnt_is_reachable changes reading /proc/mounts to be something
worse than linear in the number of mounts and reading /proc/mounts
starts taking the vfsmount_lock.  All minor things but when you
are pushing things hard they look like things that would add up.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-24 18:17       ` Eric W. Biederman
@ 2012-09-25  5:05         ` Herbert Poetzl
  2012-11-15 18:48           ` Paweł Sikora
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert Poetzl @ 2012-09-25  5:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paweł Sikora, Linus Torvalds, linux-kernel, linux-fsdevel,
	arekm, baggins, Daniel Hokka Zakrisson

On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:

>> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
>>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
>>>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:

>>>>>         br_read_lock(vfsmount_lock);

>>>> The vfsmount_lock is a "local-global" lock, where a read-lock
>>>> is rather cheap and takes just a per-cpu lock, but the
>>>> downside is that a write-lock is *very* expensive, and can
>>>> cause serious trouble.

>>>> And the write lock is taken by the [un]mount() paths. Do *not*
>>>> do crazy things. If you do some insane "unmount and remount
>>>> autofs" on a 1s granularity, you're doing insane things.

>>>> Why do you have that 1s timeout? Insane.

>>> 1s unmount timeout is *only* for fast bug reproduction (in few
>>> seconds after opteron startup) and testing potential patches.
>>> normally with 60s timeout it happens in few minutes..hours
>>> (depends on machine i/o+cpu load) and makes server unusable
>>> (permament soft-lockup).

>>> can we redesign vserver's mnt_is_reachable() for better locking
>>> to avoid total soft-lockup?

>> currently we do:

>>         br_read_lock(&vfsmount_lock);
>>         root = current->fs->root;
>>         root_mnt = real_mount(root.mnt);
>>         point = root.dentry;

>>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
>>                 point = mnt->mnt_mountpoint;
>>                 mnt = mnt->mnt_parent;
>>         }

>>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
>>         br_read_unlock(&vfsmount_lock);

>> and we have been considering to move the br_read_unlock()
>> right before the is_subdir() call

>> if there are any suggestions how to achieve the same
>> with less locking I'm all ears ...

> Herbert, why do you need to filter the mounts that show up in a
> mount namespace at all?

that is actually a really good question!

> I would think a far more performant and simpler solution would
> be to just use mount namespaces without unwanted mounts.

we had this mechanism for many years, long before the
mount namespaces existed, and I vaguely remember that
early versions didn't get the proc entries right either

I took a quick look at the code and I think we can drop
the mnt_is_reachable() check and/or make it conditional
on setups without a mount namespace in place in the near
future (thanks for the input, really appreciated!)

> I'd like to blame this on the silly rcu_barrier in
> deactivate_locked_super that should really be in the module
> remove path, but that happens after we drop the br_write_lock.

> The kernel take br_read_lock(&vfs_mount_lokck) during every rcu
> path lookup so mnt_is_reachable isn't particular crazy just for
> taking the lock.

> I am with Linus on this one. Paweł even 60s for your mount
> timeout looks too short for your workload. All of the readers
> that take br_read_lock(&vfsmount_lock) seem to be showing up in
> your oops. The only thing that seems to make sense is you have
> a lot of unmount activity running back to back, keeping the
> lock write held.

> The only other possible culprit I can see is that it looks like
> mnt_is_reachable changes reading /proc/mounts to be something
> worse than linear in the number of mounts and reading /proc/mounts
> starts taking the vfsmount_lock.  All minor things but when you
> are pushing things hard they look like things that would add up.

> Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-09-25  5:05         ` Herbert Poetzl
@ 2012-11-15 18:48           ` Paweł Sikora
  2012-11-15 19:22             ` Herbert Poetzl
  0 siblings, 1 reply; 9+ messages in thread
From: Paweł Sikora @ 2012-11-15 18:48 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Linus Torvalds, linux-kernel, linux-fsdevel,
	arekm, baggins, Daniel Hokka Zakrisson

On Tuesday 25 of September 2012 07:05:59 Herbert Poetzl wrote:
> On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote:
> > Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> >> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
> >>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
> >>>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:
> 
> >>>>>         br_read_lock(vfsmount_lock);
> 
> >>>> The vfsmount_lock is a "local-global" lock, where a read-lock
> >>>> is rather cheap and takes just a per-cpu lock, but the
> >>>> downside is that a write-lock is *very* expensive, and can
> >>>> cause serious trouble.
> 
> >>>> And the write lock is taken by the [un]mount() paths. Do *not*
> >>>> do crazy things. If you do some insane "unmount and remount
> >>>> autofs" on a 1s granularity, you're doing insane things.
> 
> >>>> Why do you have that 1s timeout? Insane.
> 
> >>> 1s unmount timeout is *only* for fast bug reproduction (in few
> >>> seconds after opteron startup) and testing potential patches.
> >>> normally with 60s timeout it happens in few minutes..hours
> >>> (depends on machine i/o+cpu load) and makes server unusable
> >>> (permament soft-lockup).
> 
> >>> can we redesign vserver's mnt_is_reachable() for better locking
> >>> to avoid total soft-lockup?
> 
> >> currently we do:
> 
> >>         br_read_lock(&vfsmount_lock);
> >>         root = current->fs->root;
> >>         root_mnt = real_mount(root.mnt);
> >>         point = root.dentry;
> 
> >>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
> >>                 point = mnt->mnt_mountpoint;
> >>                 mnt = mnt->mnt_parent;
> >>         }
> 
> >>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
> >>         br_read_unlock(&vfsmount_lock);
> 
> >> and we have been considering to move the br_read_unlock()
> >> right before the is_subdir() call
> 
> >> if there are any suggestions how to achieve the same
> >> with less locking I'm all ears ...
> 
> > Herbert, why do you need to filter the mounts that show up in a
> > mount namespace at all?
> 
> that is actually a really good question!
> 
> > I would think a far more performant and simpler solution would
> > be to just use mount namespaces without unwanted mounts.
> 
> we had this mechanism for many years, long before the
> mount namespaces existed, and I vaguely remember that
> early versions didn't get the proc entries right either
> 
> I took a quick look at the code and I think we can drop
> the mnt_is_reachable() check and/or make it conditional
> on setups without a mount namespace in place in the near
> future (thanks for the input, really appreciated!)

Hi,

Herbert, can i just drop this mnt_is_reachable() method from vserver patch?
this issue hasn't been solved for several months now. i can live without this
problematic security-through-obscurity feature on my heavy loaded machines. .


> > I'd like to blame this on the silly rcu_barrier in
> > deactivate_locked_super that should really be in the module
> > remove path, but that happens after we drop the br_write_lock.
> 
> > The kernel take br_read_lock(&vfs_mount_lokck) during every rcu
> > path lookup so mnt_is_reachable isn't particular crazy just for
> > taking the lock.
> 
> > I am with Linus on this one. Paweł even 60s for your mount
> > timeout looks too short for your workload. All of the readers
> > that take br_read_lock(&vfsmount_lock) seem to be showing up in
> > your oops. The only thing that seems to make sense is you have
> > a lot of unmount activity running back to back, keeping the
> > lock write held.
> 
> > The only other possible culprit I can see is that it looks like
> > mnt_is_reachable changes reading /proc/mounts to be something
> > worse than linear in the number of mounts and reading /proc/mounts
> > starts taking the vfsmount_lock.  All minor things but when you
> > are pushing things hard they look like things that would add up.
> 
> > Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)
  2012-11-15 18:48           ` Paweł Sikora
@ 2012-11-15 19:22             ` Herbert Poetzl
  0 siblings, 0 replies; 9+ messages in thread
From: Herbert Poetzl @ 2012-11-15 19:22 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Eric W. Biederman, Linus Torvalds, linux-kernel, linux-fsdevel,
	arekm, baggins, Daniel Hokka Zakrisson

On Thu, Nov 15, 2012 at 07:48:10PM +0100, Paweł Sikora wrote:
> On Tuesday 25 of September 2012 07:05:59 Herbert Poetzl wrote:
>> On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote:
>>> Herbert Poetzl <herbert@13thfloor.at> writes:

>>>> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
>>>>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
>>>>>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@pld-linux.org> wrote:

>>>>>>>         br_read_lock(vfsmount_lock);

>>>>>> The vfsmount_lock is a "local-global" lock, where a read-lock
>>>>>> is rather cheap and takes just a per-cpu lock, but the
>>>>>> downside is that a write-lock is *very* expensive, and can
>>>>>> cause serious trouble.

>>>>>> And the write lock is taken by the [un]mount() paths. Do *not*
>>>>>> do crazy things. If you do some insane "unmount and remount
>>>>>> autofs" on a 1s granularity, you're doing insane things.

>>>>>> Why do you have that 1s timeout? Insane.

>>>>> 1s unmount timeout is *only* for fast bug reproduction (in few
>>>>> seconds after opteron startup) and testing potential patches.
>>>>> normally with 60s timeout it happens in few minutes..hours
>>>>> (depends on machine i/o+cpu load) and makes server unusable
>>>>> (permament soft-lockup).

>>>>> can we redesign vserver's mnt_is_reachable() for better locking
>>>>> to avoid total soft-lockup?

>>>> currently we do:

>>>>         br_read_lock(&vfsmount_lock);
>>>>         root = current->fs->root;
>>>>         root_mnt = real_mount(root.mnt);
>>>>         point = root.dentry;

>>>>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
>>>>                 point = mnt->mnt_mountpoint;
>>>>                 mnt = mnt->mnt_parent;
>>>>         }

>>>>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
>>>>         br_read_unlock(&vfsmount_lock);

>>>> and we have been considering to move the br_read_unlock()
>>>> right before the is_subdir() call

>>>> if there are any suggestions how to achieve the same
>>>> with less locking I'm all ears ...

>>> Herbert, why do you need to filter the mounts that show up in a
>>> mount namespace at all?

>> that is actually a really good question!

>>> I would think a far more performant and simpler solution would
>>> be to just use mount namespaces without unwanted mounts.

>> we had this mechanism for many years, long before the
>> mount namespaces existed, and I vaguely remember that
>> early versions didn't get the proc entries right either

>> I took a quick look at the code and I think we can drop
>> the mnt_is_reachable() check and/or make it conditional
>> on setups without a mount namespace in place in the near
>> future (thanks for the input, really appreciated!)

> Hi,

> Herbert, can i just drop this mnt_is_reachable() method
> from vserver patch? this issue hasn't been solved for
> several months now. i can live without this problematic
> security-through-obscurity feature on my heavy loaded 
> machines.

sure, if you are aware of the implications, you can
simply remove the check ...

best,
Herbert

>>> I'd like to blame this on the silly rcu_barrier in
>>> deactivate_locked_super that should really be in the module
>>> remove path, but that happens after we drop the br_write_lock.

>>> The kernel take br_read_lock(&vfs_mount_lokck) during every rcu
>>> path lookup so mnt_is_reachable isn't particular crazy just for
>>> taking the lock.

>>> I am with Linus on this one. Paweł even 60s for your mount
>>> timeout looks too short for your workload. All of the readers
>>> that take br_read_lock(&vfsmount_lock) seem to be showing up in
>>> your oops. The only thing that seems to make sense is you have
>>> a lot of unmount activity running back to back, keeping the
>>> lock write held.

>>> The only other possible culprit I can see is that it looks like
>>> mnt_is_reachable changes reading /proc/mounts to be something
>>> worse than linear in the number of mounts and reading /proc/mounts
>>> starts taking the vfsmount_lock.  All minor things but when you
>>> are pushing things hard they look like things that would add up.

>>> Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-11-15 19:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-23  6:09 [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver) Paweł Sikora
2012-09-24  1:10 ` Linus Torvalds
2012-09-24  5:23   ` Paweł Sikora
2012-09-24 11:23     ` Herbert Poetzl
2012-09-24 17:22       ` Linus Torvalds
2012-09-24 18:17       ` Eric W. Biederman
2012-09-25  5:05         ` Herbert Poetzl
2012-11-15 18:48           ` Paweł Sikora
2012-11-15 19:22             ` Herbert Poetzl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).