public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
@ 2009-01-05  3:23 Li Zefan
  2009-02-09  8:40 ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-01-05  3:23 UTC (permalink / raw)
  To: LKML; +Cc: Andrew Morton, Paul Menage, Al Viro

Thread 1:
  for ((; ;))
  {
      mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
      mkdir /mnt/0 > /dev/null 2>&1
      rmdir /mnt/0 > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

Thread 2:
  for ((; ;))
  {
      mount -t cpuset xxx /mnt > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

(Note: Again it is irrelevant which cgroup subsys is used.)

After a while this showed up:

------------[ cut here ]------------
WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
Hardware name: Aspire SA85
Modules linked in: bridge stp llc autofs4 dm_mirror dm_region_hash dm_log dm_mod r8169 parport_pc mii parport sg button sata_sis pata_sis ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 4745, comm: umount Not tainted 2.6.28 #479
Call Trace:
 [<c042bbe3>] warn_slowpath+0x79/0x8f
 [<c044babf>] ? __lock_acquire+0x69a/0x700
 [<c04ae44e>] ? mntput_no_expire+0x79/0xf2
 [<c04ae481>] mntput_no_expire+0xac/0xf2
 [<c04ae968>] sys_umount+0x26a/0x2b1
 [<c04ae9c1>] sys_oldumount+0x12/0x14
 [<c0403251>] sysenter_do_call+0x12/0x31
---[ end trace 79d0ab4bef01333f ]---

The WARNING is: WARN_ON(atomic_read(&mnt->__mnt_writers));

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-01-05  3:23 [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2() Li Zefan
@ 2009-02-09  8:40 ` Andrew Morton
  2009-02-09  8:49   ` Li Zefan
  2009-02-09  9:34   ` Al Viro
  0 siblings, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2009-02-09  8:40 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Paul Menage, Al Viro, containers, Arjan van de Ven

(cc's added)

On Mon, 05 Jan 2009 11:23:33 +0800 Li Zefan <lizf@cn.fujitsu.com> wrote:

> Thread 1:
>   for ((; ;))
>   {
>       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
>       mkdir /mnt/0 > /dev/null 2>&1
>       rmdir /mnt/0 > /dev/null 2>&1
>       umount /mnt > /dev/null 2>&1
>   }
> 
> Thread 2:
>   for ((; ;))
>   {
>       mount -t cpuset xxx /mnt > /dev/null 2>&1
>       umount /mnt > /dev/null 2>&1
>   }
> 
> (Note: Again it is irrelevant which cgroup subsys is used.)
> 
> After a while this showed up:
> 
> ------------[ cut here ]------------
> WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
> Hardware name: Aspire SA85
> Modules linked in: bridge stp llc autofs4 dm_mirror dm_region_hash dm_log dm_mod r8169 parport_pc mii parport sg button sata_sis pata_sis ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
> Pid: 4745, comm: umount Not tainted 2.6.28 #479
> Call Trace:
>  [<c042bbe3>] warn_slowpath+0x79/0x8f
>  [<c044babf>] ? __lock_acquire+0x69a/0x700
>  [<c04ae44e>] ? mntput_no_expire+0x79/0xf2
>  [<c04ae481>] mntput_no_expire+0xac/0xf2
>  [<c04ae968>] sys_umount+0x26a/0x2b1
>  [<c04ae9c1>] sys_oldumount+0x12/0x14
>  [<c0403251>] sysenter_do_call+0x12/0x31
> ---[ end trace 79d0ab4bef01333f ]---
> 
> The WARNING is: WARN_ON(atomic_read(&mnt->__mnt_writers));

OK, I'm all confused.  Here we see a WARN_ON triggered, but in
http://lkml.org/lkml/2009/1/4/352 with the same testcase we're seeing a
lockdep warning.

You refer to Arjan's "lockdep: annotate sb ->s_umount" patch - but
that's over two years old.

And you say "The changelog said s_umount needs to be classified as
per-sb, but actually it made it as per-filesystem." But what is the
difference between per-sb and per-fs?

More info here: http://bugzilla.kernel.org/show_bug.cgi?id=12673

This bug report seems to be all over the place.

Is it a post-2.6.28 regression, btw?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09  8:40 ` Andrew Morton
@ 2009-02-09  8:49   ` Li Zefan
  2009-02-09 11:03     ` Al Viro
  2009-02-09  9:34   ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-09  8:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Paul Menage, Al Viro, containers, Arjan van de Ven

Andrew Morton wrote:
> (cc's added)
> 
> On Mon, 05 Jan 2009 11:23:33 +0800 Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> Thread 1:
>>   for ((; ;))
>>   {
>>       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
>>       mkdir /mnt/0 > /dev/null 2>&1
>>       rmdir /mnt/0 > /dev/null 2>&1
>>       umount /mnt > /dev/null 2>&1
>>   }
>>
>> Thread 2:
>>   for ((; ;))
>>   {
>>       mount -t cpuset xxx /mnt > /dev/null 2>&1
>>       umount /mnt > /dev/null 2>&1
>>   }
>>
>> (Note: Again it is irrelevant which cgroup subsys is used.)
>>
>> After a while this showed up:
>>
>> ------------[ cut here ]------------
>> WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
>> Hardware name: Aspire SA85
>> Modules linked in: bridge stp llc autofs4 dm_mirror dm_region_hash dm_log dm_mod r8169 parport_pc mii parport sg button sata_sis pata_sis ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
>> Pid: 4745, comm: umount Not tainted 2.6.28 #479
>> Call Trace:
>>  [<c042bbe3>] warn_slowpath+0x79/0x8f
>>  [<c044babf>] ? __lock_acquire+0x69a/0x700
>>  [<c04ae44e>] ? mntput_no_expire+0x79/0xf2
>>  [<c04ae481>] mntput_no_expire+0xac/0xf2
>>  [<c04ae968>] sys_umount+0x26a/0x2b1
>>  [<c04ae9c1>] sys_oldumount+0x12/0x14
>>  [<c0403251>] sysenter_do_call+0x12/0x31
>> ---[ end trace 79d0ab4bef01333f ]---
>>
>> The WARNING is: WARN_ON(atomic_read(&mnt->__mnt_writers));
> 
> OK, I'm all confused.  Here we see a WARN_ON triggered, but in
> http://lkml.org/lkml/2009/1/4/352 with the same testcase we're seeing a
> lockdep warning.
> 

They are 2 testcases with small difference ;)

case 1:
mount
cat whichever control file
umount

case 2:
mount
mkdir /cgroup/0
rmdir /cgroup/0
umount

> You refer to Arjan's "lockdep: annotate sb ->s_umount" patch - but
> that's over two years old.
> 
> And you say "The changelog said s_umount needs to be classified as
> per-sb, but actually it made it as per-filesystem." But what is the
> difference between per-sb and per-fs?
> 

a filesystem can be single-sb or multile, isn't it? that's struct super_lock
and struct file_system_type. I may be wrong here, since I don't know
much about VFS...

> More info here: http://bugzilla.kernel.org/show_bug.cgi?id=12673
> 
> This bug report seems to be all over the place.
> 
> Is it a post-2.6.28 regression, btw?
> 

I think it was introduced since cgroup was introduced. But it's hard to trigger
in real-life, though it's easy using this test case.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09  8:40 ` Andrew Morton
  2009-02-09  8:49   ` Li Zefan
@ 2009-02-09  9:34   ` Al Viro
  2009-02-09 11:30     ` Li Zefan
  2009-02-09 17:48     ` Dave Hansen
  1 sibling, 2 replies; 26+ messages in thread
From: Al Viro @ 2009-02-09  9:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Li Zefan, LKML, Paul Menage, containers, Arjan van de Ven

On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
> > Thread 1:
> >   for ((; ;))
> >   {
> >       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
> >       mkdir /mnt/0 > /dev/null 2>&1
> >       rmdir /mnt/0 > /dev/null 2>&1
> >       umount /mnt > /dev/null 2>&1
> >   }
> > 
> > Thread 2:
> >   {
> >       mount -t cpuset xxx /mnt > /dev/null 2>&1
> >       umount /mnt > /dev/null 2>&1
> >   }

How cute...  Same mountpoint in both, so these mount(2) will sometimes
fail (cgroup picks the same sb on the same options, AFAICS) and fail
silently due to these redirects...

That's a lovely way to stress-test a large part of ro-bind stuff *and*
umount()-related code.  Could you do C equivalent of the above (just
the same syscalls in loop, nothing fancier) and do time-stamped strace?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09  8:49   ` Li Zefan
@ 2009-02-09 11:03     ` Al Viro
  2009-02-09 11:58       ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-09 11:03 UTC (permalink / raw)
  To: Li Zefan; +Cc: Andrew Morton, LKML, Paul Menage, containers, Arjan van de Ven

	BTW, a trivial note - kfree(root) in your ->kill_sb() is done
earlier than it's nice to do.  Shouldn't affect the problem, though.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09  9:34   ` Al Viro
@ 2009-02-09 11:30     ` Li Zefan
  2009-02-12  6:10       ` Li Zefan
  2009-02-09 17:48     ` Dave Hansen
  1 sibling, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-09 11:30 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, LKML, Paul Menage, containers, Arjan van de Ven

Al Viro wrote:
> On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
>>> Thread 1:
>>>   for ((; ;))
>>>   {
>>>       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
>>>       mkdir /mnt/0 > /dev/null 2>&1
>>>       rmdir /mnt/0 > /dev/null 2>&1
>>>       umount /mnt > /dev/null 2>&1
>>>   }
>>>
>>> Thread 2:
>>>   {
>>>       mount -t cpuset xxx /mnt > /dev/null 2>&1
>>>       umount /mnt > /dev/null 2>&1
>>>   }
> 
> How cute...  Same mountpoint in both, so these mount(2) will sometimes
> fail (cgroup picks the same sb on the same options, AFAICS) and fail
> silently due to these redirects...
> 
> That's a lovely way to stress-test a large part of ro-bind stuff *and*
> umount()-related code.  Could you do C equivalent of the above (just
> the same syscalls in loop, nothing fancier) and do time-stamped strace?
> 

Sure, I'll write a C version and try to reproduce the warning.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09 11:03     ` Al Viro
@ 2009-02-09 11:58       ` Al Viro
  2009-02-10  5:47         ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-09 11:58 UTC (permalink / raw)
  To: Li Zefan; +Cc: Andrew Morton, LKML, Paul Menage, containers, Arjan van de Ven

On Mon, Feb 09, 2009 at 11:03:48AM +0000, Al Viro wrote:
> 	BTW, a trivial note - kfree(root) in your ->kill_sb() is done
> earlier than it's nice to do.  Shouldn't affect the problem, though.

	Other probably irrelevant notes:

                memcpy(start, cgrp->dentry->d_name.name, len);
                cgrp = cgrp->parent;
                if (!cgrp)
                        break;
                dentry = rcu_dereference(cgrp->dentry);

in cgroup_path().  Why don't we need rcu_dereference on both?
Moreover, shouldn't that be
                memcpy(start, dentry->d_name.name, len);
anyway, seeing that we'd just looked at dentry->d_name.len?

In cgroup_rmdir():
        spin_lock(&cgrp->dentry->d_lock);
        d = dget(cgrp->dentry);
        spin_unlock(&d->d_lock);

        cgroup_d_remove_dir(d);
        dput(d);
Er?  Comments, please...  Unless something very unusual is going on,
either that d_lock is pointless or dget() is rather unsafe.

cgroups_clone()
        /* Now do the VFS work to create a cgroup */
        inode = parent->dentry->d_inode;

        /* Hold the parent directory mutex across this operation to
         * stop anyone else deleting the new cgroup */
        mutex_lock(&inode->i_mutex);
Can the parent be in process of getting deleted by somebody else?  If yes,
we are in trouble here.

BTW, that thing in cgroup_path()...  What guarantees that cgroup_rename()
won't hit between getting len and doing memcpy()?

That said, cgroup seems to be completely agnostic wrt anything happening
on vfsmount level, so I really don't see how it gets to that WARN_ON().
Hell knows; I really want to see the sequence of events - it might be
something like fscking up ->s_active handling with interesting results
(cgroup code is certainly hitting it in not quite usual ways), it may be
genuine VFS-only race.  Need more data...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09  9:34   ` Al Viro
  2009-02-09 11:30     ` Li Zefan
@ 2009-02-09 17:48     ` Dave Hansen
  2009-02-09 18:11       ` Arjan van de Ven
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2009-02-09 17:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, containers, Paul Menage, LKML, Arjan van de Ven

On Mon, 2009-02-09 at 09:34 +0000, Al Viro wrote:
> On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
> > > Thread 1:
> > >   for ((; ;))
> > >   {
> > >       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
> > >       mkdir /mnt/0 > /dev/null 2>&1
> > >       rmdir /mnt/0 > /dev/null 2>&1
> > >       umount /mnt > /dev/null 2>&1
> > >   }
> > > 
> > > Thread 2:
> > >   {
> > >       mount -t cpuset xxx /mnt > /dev/null 2>&1
> > >       umount /mnt > /dev/null 2>&1
> > >   }
> 
> How cute...  Same mountpoint in both, so these mount(2) will sometimes
> fail (cgroup picks the same sb on the same options, AFAICS) and fail
> silently due to these redirects...
> 
> That's a lovely way to stress-test a large part of ro-bind stuff *and*
> umount()-related code.  Could you do C equivalent of the above (just
> the same syscalls in loop, nothing fancier) and do time-stamped
> strace?

Could you also add a printk of what ->__mnt_writers was at the time of
the WARN_ON()?  That will hopefully at least tell us whether we're
looking at a real leak or just a single missed mnt_want/drop_write().
Also hopefully in which direction the thing is biased.  With the mount
not being around long I'm not horribly hopeful, but it can't hurt.

-- Dave


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09 17:48     ` Dave Hansen
@ 2009-02-09 18:11       ` Arjan van de Ven
  0 siblings, 0 replies; 26+ messages in thread
From: Arjan van de Ven @ 2009-02-09 18:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Al Viro, Andrew Morton, containers, Paul Menage, LKML

On Mon, 09 Feb 2009 09:48:59 -0800
Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Mon, 2009-02-09 at 09:34 +0000, Al Viro wrote:
> > On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
> > > > Thread 1:
> > > >   for ((; ;))
> > > >   {
> > > >       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
> > > >       mkdir /mnt/0 > /dev/null 2>&1
> > > >       rmdir /mnt/0 > /dev/null 2>&1
> > > >       umount /mnt > /dev/null 2>&1
> > > >   }
> > > > 
> > > > Thread 2:
> > > >   {
> > > >       mount -t cpuset xxx /mnt > /dev/null 2>&1
> > > >       umount /mnt > /dev/null 2>&1
> > > >   }
> > 
> > How cute...  Same mountpoint in both, so these mount(2) will
> > sometimes fail (cgroup picks the same sb on the same options,
> > AFAICS) and fail silently due to these redirects...
> > 
> > That's a lovely way to stress-test a large part of ro-bind stuff
> > *and* umount()-related code.  Could you do C equivalent of the
> > above (just the same syscalls in loop, nothing fancier) and do
> > time-stamped strace?
> 
> Could you also add a printk of what ->__mnt_writers was at the time of
> the WARN_ON()?  That will hopefully at least tell us whether we're
> looking at a real leak or just a single missed mnt_want/drop_write().
> Also hopefully in which direction the thing is biased.  With the mount
> not being around long I'm not horribly hopeful, but it can't hurt.

... we could just change the WARN_ON to a WARN().. which has printk
semantics ;)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?]  WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09 11:58       ` Al Viro
@ 2009-02-10  5:47         ` Li Zefan
  0 siblings, 0 replies; 26+ messages in thread
From: Li Zefan @ 2009-02-10  5:47 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, LKML, Paul Menage, containers, Arjan van de Ven

Al Viro wrote:
> On Mon, Feb 09, 2009 at 11:03:48AM +0000, Al Viro wrote:
>> 	BTW, a trivial note - kfree(root) in your ->kill_sb() is done
>> earlier than it's nice to do.  Shouldn't affect the problem, though.
> 

Do you mean kfree(root) should be called after kill_litter_super()?
I don't see the point here..

> 	Other probably irrelevant notes:
> 
>                 memcpy(start, cgrp->dentry->d_name.name, len);
>                 cgrp = cgrp->parent;
>                 if (!cgrp)
>                         break;
>                 dentry = rcu_dereference(cgrp->dentry);
> 
> in cgroup_path().  Why don't we need rcu_dereference on both?
> Moreover, shouldn't that be
>                 memcpy(start, dentry->d_name.name, len);
> anyway, seeing that we'd just looked at dentry->d_name.len?

We are right, dentry-> but not cgrp->dentry-> should be used.

> 
> In cgroup_rmdir():
>         spin_lock(&cgrp->dentry->d_lock);
>         d = dget(cgrp->dentry);
>         spin_unlock(&d->d_lock);
> 
>         cgroup_d_remove_dir(d);
>         dput(d);
> Er?  Comments, please...  Unless something very unusual is going on,
> either that d_lock is pointless or dget() is rather unsafe.
> 

The code was inherited from cpuset. I doubted it's redundant, but
I was not confident enough to remove it.

> cgroups_clone()
>         /* Now do the VFS work to create a cgroup */
>         inode = parent->dentry->d_inode;
> 
>         /* Hold the parent directory mutex across this operation to
>          * stop anyone else deleting the new cgroup */
>         mutex_lock(&inode->i_mutex);
> Can the parent be in process of getting deleted by somebody else?  If yes,
> we are in trouble here.
> 
> BTW, that thing in cgroup_path()...  What guarantees that cgroup_rename()
> won't hit between getting len and doing memcpy()?
> 

cgroup_path() was inherited from cpuset's cpuset_path(), and I think it's
true it races with rename.

> That said, cgroup seems to be completely agnostic wrt anything happening
> on vfsmount level, so I really don't see how it gets to that WARN_ON().
> Hell knows; I really want to see the sequence of events - it might be
> something like fscking up ->s_active handling with interesting results
> (cgroup code is certainly hitting it in not quite usual ways), it may be
> genuine VFS-only race.  Need more data...
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-09 11:30     ` Li Zefan
@ 2009-02-12  6:10       ` Li Zefan
  2009-02-12  6:24         ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-12  6:10 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Andrew Morton, LKML, Arjan van de Ven

Li Zefan wrote:
> Al Viro wrote:
>> On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
>>>> Thread 1:
>>>>   for ((; ;))
>>>>   {
>>>>       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
>>>>       mkdir /mnt/0 > /dev/null 2>&1
>>>>       rmdir /mnt/0 > /dev/null 2>&1
>>>>       umount /mnt > /dev/null 2>&1
>>>>   }
>>>>
>>>> Thread 2:
>>>>   {
>>>>       mount -t cpuset xxx /mnt > /dev/null 2>&1
>>>>       umount /mnt > /dev/null 2>&1
>>>>   }
>> How cute...  Same mountpoint in both, so these mount(2) will sometimes
>> fail (cgroup picks the same sb on the same options, AFAICS) and fail
>> silently due to these redirects...
>>
>> That's a lovely way to stress-test a large part of ro-bind stuff *and*
>> umount()-related code.  Could you do C equivalent of the above (just
>> the same syscalls in loop, nothing fancier) and do time-stamped strace?
>>
> 
> Sure, I'll write a C version and try to reproduce the warning.
> 

Unfortunately, the C equivalent can't reproduce the warning, I've run the
test for the whole night. :( While using the script, often I can trigger
the warning in several mins.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-12  6:10       ` Li Zefan
@ 2009-02-12  6:24         ` Al Viro
  2009-02-12  6:33           ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-12  6:24 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Andrew Morton, LKML, Arjan van de Ven

On Thu, Feb 12, 2009 at 02:10:37PM +0800, Li Zefan wrote:
> Li Zefan wrote:
> > Al Viro wrote:
> >> On Mon, Feb 09, 2009 at 12:40:46AM -0800, Andrew Morton wrote:
> >>>> Thread 1:
> >>>>   for ((; ;))
> >>>>   {
> >>>>       mount -t cgroup -o cpuset xxx /mnt > /dev/null 2>&1
> >>>>       mkdir /mnt/0 > /dev/null 2>&1
> >>>>       rmdir /mnt/0 > /dev/null 2>&1
> >>>>       umount /mnt > /dev/null 2>&1
> >>>>   }
> >>>>
> >>>> Thread 2:
> >>>>   {
> >>>>       mount -t cpuset xxx /mnt > /dev/null 2>&1
> >>>>       umount /mnt > /dev/null 2>&1
> >>>>   }
> >> How cute...  Same mountpoint in both, so these mount(2) will sometimes
> >> fail (cgroup picks the same sb on the same options, AFAICS) and fail
> >> silently due to these redirects...
> >>
> >> That's a lovely way to stress-test a large part of ro-bind stuff *and*
> >> umount()-related code.  Could you do C equivalent of the above (just
> >> the same syscalls in loop, nothing fancier) and do time-stamped strace?
> >>
> > 
> > Sure, I'll write a C version and try to reproduce the warning.
> > 
> 
> Unfortunately, the C equivalent can't reproduce the warning, I've run the
> test for the whole night. :( While using the script, often I can trigger
> the warning in several mins.

Ho-hum...  I wonder if we are hitting cgroup_clone() in all that fun...
Could you
	a) add a printk to that sucker
	b) independently from (a), see if wrapping these syscalls into
	pid = fork();
	if (!pid) {
		[make a syscall, print something]
		exit(0);
	} else if (pid > 0) {
		waitpid(pid, NULL, 0);
	}
and see what happens...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-12  6:24         ` Al Viro
@ 2009-02-12  6:33           ` Li Zefan
  2009-02-12  6:54             ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-12  6:33 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Andrew Morton, LKML, Arjan van de Ven

>>>> How cute...  Same mountpoint in both, so these mount(2) will sometimes
>>>> fail (cgroup picks the same sb on the same options, AFAICS) and fail
>>>> silently due to these redirects...
>>>>
>>>> That's a lovely way to stress-test a large part of ro-bind stuff *and*
>>>> umount()-related code.  Could you do C equivalent of the above (just
>>>> the same syscalls in loop, nothing fancier) and do time-stamped strace?
>>>>
>>> Sure, I'll write a C version and try to reproduce the warning.
>>>
>> Unfortunately, the C equivalent can't reproduce the warning, I've run the
>> test for the whole night. :( While using the script, often I can trigger
>> the warning in several mins.
> 
> Ho-hum...  I wonder if we are hitting cgroup_clone() in all that fun...

I don't think so, I think cgroup_clone() will be called only if namespace is
used, like clone(CLONE_NEWNS). Even if cgroup_clone() gets called, it will
return before doing any vfs work unless the ns_cgroup subsystem is mounted.

int cgroup_clone(struct task_struct *tsk, struct cgroup_subsys *subsys,
							char *nodename)
{
	...
	mutex_lock(&cgroup_mutex);
 again:
	root = subsys->root;
	if (root == &rootnode) {	<--- here
		mutex_unlock(&cgroup_mutex);
		return 0;
	}

> Could you
> 	a) add a printk to that sucker
> 	b) independently from (a), see if wrapping these syscalls into
> 	pid = fork();
> 	if (!pid) {
> 		[make a syscall, print something]
> 		exit(0);
> 	} else if (pid > 0) {
> 		waitpid(pid, NULL, 0);
> 	}
> and see what happens...
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-12  6:33           ` Li Zefan
@ 2009-02-12  6:54             ` Li Zefan
  2009-02-12  7:07               ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-12  6:54 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

Li Zefan wrote:
>>>>> How cute...  Same mountpoint in both, so these mount(2) will sometimes
>>>>> fail (cgroup picks the same sb on the same options, AFAICS) and fail
>>>>> silently due to these redirects...
>>>>>
>>>>> That's a lovely way to stress-test a large part of ro-bind stuff *and*
>>>>> umount()-related code.  Could you do C equivalent of the above (just
>>>>> the same syscalls in loop, nothing fancier) and do time-stamped strace?
>>>>>
>>>> Sure, I'll write a C version and try to reproduce the warning.
>>>>
>>> Unfortunately, the C equivalent can't reproduce the warning, I've run the
>>> test for the whole night. :( While using the script, often I can trigger
>>> the warning in several mins.
>> Ho-hum...  I wonder if we are hitting cgroup_clone() in all that fun...
> 
> I don't think so, I think cgroup_clone() will be called only if namespace is
> used, like clone(CLONE_NEWNS). Even if cgroup_clone() gets called, it will
> return before doing any vfs work unless the ns_cgroup subsystem is mounted.
> 

But the following testcase can also trigger the warning:

thread 1:
for ((; ;))
{
	mount -t cgroup -o ns xxx cgroup/ > /dev/null 2>&1
	# remove the dirs generated by cgroup_clone()
	rmdir cgroup/[1-9]* > /dev/null 2>&1
	umount cgroup/ > /dev/null 2>&1
}


thread 2:

int foo(void *arg)
{ return 0; }

char *stack[4096];

int main(int argc, char **argv)
{
        int usec = DEFAULT_USEC;
        while (1) {
                usleep(usec);
		# cgroup_clone() will be called
                clone(foo, stack+4096, CLONE_NEWNS, NULL);
        }

        return 0;
}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-12  6:54             ` Li Zefan
@ 2009-02-12  7:07               ` Al Viro
  2009-02-13  5:09                 ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-12  7:07 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

On Thu, Feb 12, 2009 at 02:54:58PM +0800, Li Zefan wrote:
> But the following testcase can also trigger the warning:
> 
> thread 1:
> for ((; ;))
> {
> 	mount -t cgroup -o ns xxx cgroup/ > /dev/null 2>&1
> 	# remove the dirs generated by cgroup_clone()
> 	rmdir cgroup/[1-9]* > /dev/null 2>&1
> 	umount cgroup/ > /dev/null 2>&1
> }
> 
> 
> thread 2:
> 
> int foo(void *arg)
> { return 0; }
> 
> char *stack[4096];
> 
> int main(int argc, char **argv)
> {
>         int usec = DEFAULT_USEC;
>         while (1) {
>                 usleep(usec);
> 		# cgroup_clone() will be called
>                 clone(foo, stack+4096, CLONE_NEWNS, NULL);
>         }
> 
>         return 0;
> }

Uh-oh...  That clone() will do more, actually - it will clone a bunch
of vfsmounts.  What happens if you create a separate namespace for the
first thread, so that the second one would not have our vfsmount to
play with?

Alternatively, what if the second thread is doing
	mount --bind cgroup foo
	umount foo
in a loop?

Another one: does turning the umount in the first thread into umount -l
affect anything?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-12  7:07               ` Al Viro
@ 2009-02-13  5:09                 ` Li Zefan
  2009-02-13  5:47                   ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-13  5:09 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

>> thread 1:
>> for ((; ;))
>> {
>> 	mount -t cgroup -o ns xxx cgroup/ > /dev/null 2>&1
>> 	# remove the dirs generated by cgroup_clone()
>> 	rmdir cgroup/[1-9]* > /dev/null 2>&1
>> 	umount cgroup/ > /dev/null 2>&1
>> }
>>
>>
>> thread 2:
>>
>> int foo(void *arg)
>> { return 0; }
>>
>> char *stack[4096];
>>
>> int main(int argc, char **argv)
>> {
>>         int usec = DEFAULT_USEC;
>>         while (1) {
>>                 usleep(usec);
>> 		# cgroup_clone() will be called
>>                 clone(foo, stack+4096, CLONE_NEWNS, NULL);
>>         }
>>
>>         return 0;
>> }
> 
> Uh-oh...  That clone() will do more, actually - it will clone a bunch
> of vfsmounts.  What happens if you create a separate namespace for the
> first thread, so that the second one would not have our vfsmount to
> play with?
> 

The warning still can be triggered, but seems harder (cost me 1 hour)

> Alternatively, what if the second thread is doing
> 	mount --bind cgroup foo
> 	umount foo
> in a loop?
> 

I ran following testcase, and triggered the warning in 1 hour:

thread 1:
for ((; ;))
{
        mount --bind /cgroup /mnt > /dev/null 2>&1
        umount /mnt > /dev/null 2>&1
}

tread 2:
for ((; ;))
{
        mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
        mkdir /cgroup/0 > /dev/null 2>&1
        rmdir /cgroup/0 > /dev/null 2>&1
        umount -l /cgroup > /dev/null 2>&1
}

> Another one: does turning the umount in the first thread into umount -l
> affect anything?
> 

For this one, I ran the test for the whole night, but failed to hit the warning.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  5:09                 ` Li Zefan
@ 2009-02-13  5:47                   ` Al Viro
  2009-02-13  6:12                     ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-13  5:47 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:

> I ran following testcase, and triggered the warning in 1 hour:
> 
> thread 1:
> for ((; ;))
> {
>         mount --bind /cgroup /mnt > /dev/null 2>&1
>         umount /mnt > /dev/null 2>&1
> }
> 
> tread 2:
> for ((; ;))
> {
>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>         mkdir /cgroup/0 > /dev/null 2>&1
>         rmdir /cgroup/0 > /dev/null 2>&1
>         umount -l /cgroup > /dev/null 2>&1
> }

Wow.  You know, at that point these redirects could probably be removed.
If anything in there ends up producing an output, we very much want to
see that.  Actually, I'd even make that
	mount --bind /cgroup/mnt || (echo mount1: ; date)
etc., so we'd see when do they fail and which one fails (if any)...
 
Which umount has failed in the above, BTW?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  5:47                   ` Al Viro
@ 2009-02-13  6:12                     ` Li Zefan
  2009-02-13  6:31                       ` Li Zefan
  2009-02-13  6:41                       ` Al Viro
  0 siblings, 2 replies; 26+ messages in thread
From: Li Zefan @ 2009-02-13  6:12 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

Al Viro wrote:
> On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
> 
>> I ran following testcase, and triggered the warning in 1 hour:
>>
>> thread 1:
>> for ((; ;))
>> {
>>         mount --bind /cgroup /mnt > /dev/null 2>&1
>>         umount /mnt > /dev/null 2>&1
>> }
>>
>> tread 2:
>> for ((; ;))
>> {
>>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>>         mkdir /cgroup/0 > /dev/null 2>&1
>>         rmdir /cgroup/0 > /dev/null 2>&1
>>         umount -l /cgroup > /dev/null 2>&1
>> }
> 
> Wow.  You know, at that point these redirects could probably be removed.

Ah, yes.

> If anything in there ends up producing an output, we very much want to
> see that.  Actually, I'd even make that
> 	mount --bind /cgroup/mnt || (echo mount1: ; date)
> etc., so we'd see when do they fail and which one fails (if any)...
>  
> Which umount has failed in the above, BTW?
> 
> 

the first one sometimes failed, and the second one hasn't failed:

mount: wrong fs type, bad option, bad superblock on /cgroup,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

mount1
Fri Feb 13 14:05:37 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:08:34 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:08:43 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  6:12                     ` Li Zefan
@ 2009-02-13  6:31                       ` Li Zefan
  2009-02-13  6:41                       ` Al Viro
  1 sibling, 0 replies; 26+ messages in thread
From: Li Zefan @ 2009-02-13  6:31 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven

Li Zefan wrote:
> Al Viro wrote:
>> On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
>>
>>> I ran following testcase, and triggered the warning in 1 hour:
>>>
>>> thread 1:
>>> for ((; ;))
>>> {
>>>         mount --bind /cgroup /mnt > /dev/null 2>&1
>>>         umount /mnt > /dev/null 2>&1
>>> }
>>>
>>> tread 2:
>>> for ((; ;))
>>> {
>>>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>>>         mkdir /cgroup/0 > /dev/null 2>&1
>>>         rmdir /cgroup/0 > /dev/null 2>&1
>>>         umount -l /cgroup > /dev/null 2>&1
>>> }
>> Wow.  You know, at that point these redirects could probably be removed.
> 
> Ah, yes.
> 
>> If anything in there ends up producing an output, we very much want to
>> see that.  Actually, I'd even make that
>> 	mount --bind /cgroup/mnt || (echo mount1: ; date)
>> etc., so we'd see when do they fail and which one fails (if any)...
>>  
>> Which umount has failed in the above, BTW?
>>
>>
> 
> the first one sometimes failed, and the second one hasn't failed:
> 

Just triggered the warning at about:
	Fri Feb 13 14:26:03 CST 2009

But both 2 threads were not failing at that time:

mount: wrong fs type, bad option, bad superblock on /cgroup,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

mount1
Fri Feb 13 14:25:21 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:26:32 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  6:12                     ` Li Zefan
  2009-02-13  6:31                       ` Li Zefan
@ 2009-02-13  6:41                       ` Al Viro
  2009-02-13  7:18                         ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-13  6:41 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

On Fri, Feb 13, 2009 at 02:12:13PM +0800, Li Zefan wrote:
> Al Viro wrote:
> > On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
> > 
> >> I ran following testcase, and triggered the warning in 1 hour:
> >>
> >> thread 1:
> >> for ((; ;))
> >> {
> >>         mount --bind /cgroup /mnt > /dev/null 2>&1
> >>         umount /mnt > /dev/null 2>&1
> >> }
> >>
> >> tread 2:
> >> for ((; ;))
> >> {
> >>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
> >>         mkdir /cgroup/0 > /dev/null 2>&1
> >>         rmdir /cgroup/0 > /dev/null 2>&1
> >>         umount -l /cgroup > /dev/null 2>&1
> >> }
> > 
> > Wow.  You know, at that point these redirects could probably be removed.
> 
> Ah, yes.
> 
> > If anything in there ends up producing an output, we very much want to
> > see that.  Actually, I'd even make that
> > 	mount --bind /cgroup/mnt || (echo mount1: ; date)
> > etc., so we'd see when do they fail and which one fails (if any)...
> >  
> > Which umount has failed in the above, BTW?
> > 
> > 
> 
> the first one sometimes failed, and the second one hasn't failed:

> mount: wrong fs type, bad option, bad superblock on /cgroup,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> 
> mount1

Hold on.  In your last example the first one was doing mount --bind;
has _that_ failed?  Oh, wait...  It can fail, all right, if lookup on
/cgroup gives you your filesystem with the second thread managing to
detach it before we get the namespace_sem.  Then we'll fail that way -
and clean up properly.

Oh, well...  The original question still stands: with those two
scripts, which umount produces that WARN_ON?  The trivial way
to check would be to have a copy of /sbin/umount under a different
name and use _that_ in one of the threads instead of umount.
Then reproduce the WARN_ON and look at the process name in dmesg...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  6:41                       ` Al Viro
@ 2009-02-13  7:18                         ` Al Viro
  2009-02-13  7:26                           ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-13  7:18 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

On Fri, Feb 13, 2009 at 06:41:35AM +0000, Al Viro wrote:

Aaaargh...

        /*
         * We don't have to hold all of the locks at the
         * same time here because we know that we're the
         * last reference to mnt and that no new writers
         * can come in.
         */
        for_each_possible_cpu(cpu) {
                struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
                if (cpu_writer->mnt != mnt)
                        continue;
                spin_lock(&cpu_writer->lock);

is *almost* OK.  Modulo SMP cache coherency.  We know that nothing should
be setting ->mnt to ours anymore, that's fine.  But we do not know if
we'd seen *earlier* change done on CPU in question (not the one we
are running __mntput() on).

I probably would still like to use milder solution in the long run, but for
now let's check if turning that into

                struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
                spin_lock(&cpu_writer->lock);
                if (cpu_writer->mnt != mnt) {
			spin_unlock(&cpu_writer->lock);
                        continue;
		}
prevents the problem, OK?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  7:18                         ` Al Viro
@ 2009-02-13  7:26                           ` Li Zefan
  2009-02-16  1:29                             ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-13  7:26 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML

Al Viro wrote:
> On Fri, Feb 13, 2009 at 06:41:35AM +0000, Al Viro wrote:
> 
> Aaaargh...
> 
>         /*
>          * We don't have to hold all of the locks at the
>          * same time here because we know that we're the
>          * last reference to mnt and that no new writers
>          * can come in.
>          */
>         for_each_possible_cpu(cpu) {
>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>                 if (cpu_writer->mnt != mnt)
>                         continue;
>                 spin_lock(&cpu_writer->lock);
> 
> is *almost* OK.  Modulo SMP cache coherency.  We know that nothing should
> be setting ->mnt to ours anymore, that's fine.  But we do not know if
> we'd seen *earlier* change done on CPU in question (not the one we
> are running __mntput() on).
> 
> I probably would still like to use milder solution in the long run, but for
> now let's check if turning that into
> 
>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>                 spin_lock(&cpu_writer->lock);
>                 if (cpu_writer->mnt != mnt) {
> 			spin_unlock(&cpu_writer->lock);
>                         continue;
> 		}
> prevents the problem, OK?
> 

Sure, I'll try. :)

BTW, thread2's rmdir failed:

rmdir: /cgroup/0: No such file or directory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-13  7:26                           ` Li Zefan
@ 2009-02-16  1:29                             ` Li Zefan
  2009-02-16  2:38                               ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-16  1:29 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven

Li Zefan wrote:
> Al Viro wrote:
>> On Fri, Feb 13, 2009 at 06:41:35AM +0000, Al Viro wrote:
>>
>> Aaaargh...
>>
>>         /*
>>          * We don't have to hold all of the locks at the
>>          * same time here because we know that we're the
>>          * last reference to mnt and that no new writers
>>          * can come in.
>>          */
>>         for_each_possible_cpu(cpu) {
>>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>>                 if (cpu_writer->mnt != mnt)
>>                         continue;
>>                 spin_lock(&cpu_writer->lock);
>>
>> is *almost* OK.  Modulo SMP cache coherency.  We know that nothing should
>> be setting ->mnt to ours anymore, that's fine.  But we do not know if
>> we'd seen *earlier* change done on CPU in question (not the one we
>> are running __mntput() on).
>>
>> I probably would still like to use milder solution in the long run, but for
>> now let's check if turning that into
>>
>>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>>                 spin_lock(&cpu_writer->lock);
>>                 if (cpu_writer->mnt != mnt) {
>> 			spin_unlock(&cpu_writer->lock);
>>                         continue;
>> 		}
>> prevents the problem, OK?
>>
> 
> Sure, I'll try. :)
> 

Not a single warning for the whole weekend, so I think above change works.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-16  1:29                             ` Li Zefan
@ 2009-02-16  2:38                               ` Al Viro
  2009-02-16  2:47                                 ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2009-02-16  2:38 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven,
	gregkh, Linus Torvalds

On Mon, Feb 16, 2009 at 09:29:00AM +0800, Li Zefan wrote:
> >>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
> >>                 spin_lock(&cpu_writer->lock);
> >>                 if (cpu_writer->mnt != mnt) {
> >> 			spin_unlock(&cpu_writer->lock);
> >>                         continue;
> >> 		}
> >> prevents the problem, OK?
> >>
> > 
> > Sure, I'll try. :)
> > 
> 
> Not a single warning for the whole weekend, so I think above change works.

OK...  So here's what we really want:
	* we know that nobody will set cpu_writer->mnt to mnt from now on
	* all changes to that sucker are done under cpu_writer->lock
	* we want the laziest equivalent of
		spin_lock(&cpu_writer->lock);
		if (likely(cpu_writer->mnt != mnt)) {
			spin_unlock(&cpu_writer->lock);
			continue;
		}
		/* do stuff */
that would make sure we won't miss earlier setting of ->mnt done by another
CPU.

Anyway, for now (HEAD and all -stable starting with 2.6.26) we want this:

--- fs/namespace.c	2009-01-25 21:45:31.000000000 -0500
+++ fs/namespace.c	2009-02-15 21:31:14.000000000 -0500
@@ -614,9 +614,11 @@
 	 */
 	for_each_possible_cpu(cpu) {
 		struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
-		if (cpu_writer->mnt != mnt)
-			continue;
 		spin_lock(&cpu_writer->lock);
+		if (cpu_writer->mnt != mnt) {
+			spin_unlock(&cpu_writer->lock);
+			continue;
+		}
 		atomic_add(cpu_writer->count, &mnt->__mnt_writers);
 		cpu_writer->count = 0;
 		/*

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-16  2:38                               ` Al Viro
@ 2009-02-16  2:47                                 ` Li Zefan
  2009-02-16  2:57                                   ` Al Viro
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-02-16  2:47 UTC (permalink / raw)
  To: Al Viro
  Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven,
	gregkh, Linus Torvalds

> OK...  So here's what we really want:
> 	* we know that nobody will set cpu_writer->mnt to mnt from now on
> 	* all changes to that sucker are done under cpu_writer->lock
> 	* we want the laziest equivalent of
> 		spin_lock(&cpu_writer->lock);
> 		if (likely(cpu_writer->mnt != mnt)) {
> 			spin_unlock(&cpu_writer->lock);
> 			continue;
> 		}
> 		/* do stuff */
> that would make sure we won't miss earlier setting of ->mnt done by another
> CPU.
> 

If this is done, I'll be available to test it.

> Anyway, for now (HEAD and all -stable starting with 2.6.26) we want this:
> 

And here is my:

Tested-by: Li Zefan <lizf@cn.fujitsu.com>

> --- fs/namespace.c	2009-01-25 21:45:31.000000000 -0500
> +++ fs/namespace.c	2009-02-15 21:31:14.000000000 -0500
> @@ -614,9 +614,11 @@
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
> -		if (cpu_writer->mnt != mnt)
> -			continue;
>  		spin_lock(&cpu_writer->lock);
> +		if (cpu_writer->mnt != mnt) {
> +			spin_unlock(&cpu_writer->lock);
> +			continue;
> +		}
>  		atomic_add(cpu_writer->count, &mnt->__mnt_writers);
>  		cpu_writer->count = 0;
>  		/*


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
  2009-02-16  2:47                                 ` Li Zefan
@ 2009-02-16  2:57                                   ` Al Viro
  0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2009-02-16  2:57 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven,
	gregkh, Linus Torvalds

On Mon, Feb 16, 2009 at 10:47:02AM +0800, Li Zefan wrote:
> And here is my:
> 
> Tested-by: Li Zefan <lizf@cn.fujitsu.com>

Right...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-02-16  2:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-05  3:23 [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2() Li Zefan
2009-02-09  8:40 ` Andrew Morton
2009-02-09  8:49   ` Li Zefan
2009-02-09 11:03     ` Al Viro
2009-02-09 11:58       ` Al Viro
2009-02-10  5:47         ` Li Zefan
2009-02-09  9:34   ` Al Viro
2009-02-09 11:30     ` Li Zefan
2009-02-12  6:10       ` Li Zefan
2009-02-12  6:24         ` Al Viro
2009-02-12  6:33           ` Li Zefan
2009-02-12  6:54             ` Li Zefan
2009-02-12  7:07               ` Al Viro
2009-02-13  5:09                 ` Li Zefan
2009-02-13  5:47                   ` Al Viro
2009-02-13  6:12                     ` Li Zefan
2009-02-13  6:31                       ` Li Zefan
2009-02-13  6:41                       ` Al Viro
2009-02-13  7:18                         ` Al Viro
2009-02-13  7:26                           ` Li Zefan
2009-02-16  1:29                             ` Li Zefan
2009-02-16  2:38                               ` Al Viro
2009-02-16  2:47                                 ` Li Zefan
2009-02-16  2:57                                   ` Al Viro
2009-02-09 17:48     ` Dave Hansen
2009-02-09 18:11       ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox