2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
@ 2006-06-08  4:27 Miles Lane
  2006-06-08  7:32 ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Miles Lane @ 2006-06-08  4:27 UTC (permalink / raw)
  To: LKML, Andrew Morton

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
mount/1219 is trying to acquire lock:
 (&ni->mrec_lock){--..}, at: [<c031e4b8>] mutex_lock+0x21/0x24

but task is already holding lock:
 (&rl->lock){----}, at: [<f8a98e4d>] ntfs_map_runlist+0x2a/0xb5 [ntfs]

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (&rl->lock){----}:
       [<c012ec64>] lock_acquire+0x58/0x74
       [<f8a97619>] ntfs_readpage+0x362/0x8fd [ntfs]
       [<c01415b0>] read_cache_page+0x8c/0x137
       [<f8a9eeb8>] map_mft_record+0xd7/0x1d2 [ntfs]
       [<f8a9d661>] ntfs_read_locked_inode+0x74/0xea9 [ntfs]
       [<f8a9eabb>] ntfs_read_inode_mount+0x625/0x846 [ntfs]
       [<f8aa2ff8>] ntfs_fill_super+0x8ca/0xd14 [ntfs]
       [<c01647c4>] get_sb_bdev+0xed/0x14e
       [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
       [<c0163c11>] vfs_kern_mount+0x76/0x143
       [<c0163d17>] do_kern_mount+0x29/0x3d
       [<c0177f9f>] do_mount+0x78a/0x7e4
       [<c0178058>] sys_mount+0x5f/0x91
       [<c031fb5d>] sysenter_past_esp+0x56/0x8d

-> #0 (&ni->mrec_lock){--..}:
       [<c012ec64>] lock_acquire+0x58/0x74
       [<c031e321>] __mutex_lock_slowpath+0xa7/0x21d
       [<c031e4b8>] mutex_lock+0x21/0x24
       [<f8a9edfa>] map_mft_record+0x19/0x1d2 [ntfs]
       [<f8a98630>] ntfs_map_runlist_nolock+0x48/0x437 [ntfs]
       [<f8a98eaa>] ntfs_map_runlist+0x87/0xb5 [ntfs]
       [<f8a97739>] ntfs_readpage+0x482/0x8fd [ntfs]
       [<c01415b0>] read_cache_page+0x8c/0x137
       [<f8aa20bc>] load_system_files+0x155/0x7c7 [ntfs]
       [<f8aa30a7>] ntfs_fill_super+0x979/0xd14 [ntfs]
       [<c01647c4>] get_sb_bdev+0xed/0x14e
       [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
       [<c0163c11>] vfs_kern_mount+0x76/0x143
       [<c0163d17>] do_kern_mount+0x29/0x3d
       [<c0177f9f>] do_mount+0x78a/0x7e4
       [<c0178058>] sys_mount+0x5f/0x91
       [<c031fb5d>] sysenter_past_esp+0x56/0x8d

other info that might help us debug this:

2 locks held by mount/1219:
 #0:  (&s->s_umount#18){--..}, at: [<c0164443>] sget+0x223/0x3a1
 #1:  (&rl->lock){----}, at: [<f8a98e4d>] ntfs_map_runlist+0x2a/0xb5 [ntfs]

stack backtrace:
 [<c0103924>] show_trace_log_lvl+0x54/0xfd
 [<c01049a9>] show_trace+0xd/0x10
 [<c01049c3>] dump_stack+0x17/0x1c
 [<c012cefb>] print_circular_bug_tail+0x59/0x64
 [<c012e766>] __lock_acquire+0x7c2/0x97a
 [<c012ec64>] lock_acquire+0x58/0x74
 [<c031e321>] __mutex_lock_slowpath+0xa7/0x21d
 [<c031e4b8>] mutex_lock+0x21/0x24
 [<f8a9edfa>] map_mft_record+0x19/0x1d2 [ntfs]
 [<f8a98630>] ntfs_map_runlist_nolock+0x48/0x437 [ntfs]
 [<f8a98eaa>] ntfs_map_runlist+0x87/0xb5 [ntfs]
 [<f8a97739>] ntfs_readpage+0x482/0x8fd [ntfs]
 [<c01415b0>] read_cache_page+0x8c/0x137
 [<f8aa20bc>] load_system_files+0x155/0x7c7 [ntfs]
 [<f8aa30a7>] ntfs_fill_super+0x979/0xd14 [ntfs]
 [<c01647c4>] get_sb_bdev+0xed/0x14e
 [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
 [<c0163c11>] vfs_kern_mount+0x76/0x143
 [<c0163d17>] do_kern_mount+0x29/0x3d
 [<c0177f9f>] do_mount+0x78a/0x7e4
 [<c0178058>] sys_mount+0x5f/0x91
 [<c031fb5d>] sysenter_past_esp+0x56/0x8d
NTFS volume version 3.1.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08  4:27 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected! Miles Lane
@ 2006-06-08  7:32 ` Anton Altaparmakov
  2006-06-08  9:55   ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-08  7:32 UTC (permalink / raw)
  To: Miles Lane; +Cc: LKML, Andrew Morton

Hi,

Thanks for the report.

On Wed, 2006-06-07 at 21:27 -0700, Miles Lane wrote:
> =====================================================
> [ BUG: possible circular locking deadlock detected! ]
> -----------------------------------------------------
> mount/1219 is trying to acquire lock:
>  (&ni->mrec_lock){--..}, at: [<c031e4b8>] mutex_lock+0x21/0x24
> 
> but task is already holding lock:
>  (&rl->lock){----}, at: [<f8a98e4d>] ntfs_map_runlist+0x2a/0xb5 [ntfs]
> 
> which lock already depends on the new lock,
> which could lead to circular deadlocks!

That's rubbish.  The lock analyser is not intelligent enough for
ntfs...  )-:

FWIW it appears to be picking up that locks depend on each other even
when they relate to different inodes which means that they do not depend
on each other.  It perhaps is getting confused by the special case for
the table of inodes ($MFT) which has the lock dependency reverse to all
other inodes but it is special because it can never take the lock
recursively (and hence deadlock) because we always keep the whole
runlist for $MFT in memory and should a bug or memory corruption cause
this not to be the case then ntfs will detect this and go BUG() so it
still will not deadlock...

The people responsible for the lock analyser will need to fix this in
their code or by adding some magic to ntfs to make the errors go away...

Best regards,

	Anton

> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&rl->lock){----}:
>        [<c012ec64>] lock_acquire+0x58/0x74
>        [<f8a97619>] ntfs_readpage+0x362/0x8fd [ntfs]
>        [<c01415b0>] read_cache_page+0x8c/0x137
>        [<f8a9eeb8>] map_mft_record+0xd7/0x1d2 [ntfs]
>        [<f8a9d661>] ntfs_read_locked_inode+0x74/0xea9 [ntfs]
>        [<f8a9eabb>] ntfs_read_inode_mount+0x625/0x846 [ntfs]
>        [<f8aa2ff8>] ntfs_fill_super+0x8ca/0xd14 [ntfs]
>        [<c01647c4>] get_sb_bdev+0xed/0x14e
>        [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
>        [<c0163c11>] vfs_kern_mount+0x76/0x143
>        [<c0163d17>] do_kern_mount+0x29/0x3d
>        [<c0177f9f>] do_mount+0x78a/0x7e4
>        [<c0178058>] sys_mount+0x5f/0x91
>        [<c031fb5d>] sysenter_past_esp+0x56/0x8d
> 
> -> #0 (&ni->mrec_lock){--..}:
>        [<c012ec64>] lock_acquire+0x58/0x74
>        [<c031e321>] __mutex_lock_slowpath+0xa7/0x21d
>        [<c031e4b8>] mutex_lock+0x21/0x24
>        [<f8a9edfa>] map_mft_record+0x19/0x1d2 [ntfs]
>        [<f8a98630>] ntfs_map_runlist_nolock+0x48/0x437 [ntfs]
>        [<f8a98eaa>] ntfs_map_runlist+0x87/0xb5 [ntfs]
>        [<f8a97739>] ntfs_readpage+0x482/0x8fd [ntfs]
>        [<c01415b0>] read_cache_page+0x8c/0x137
>        [<f8aa20bc>] load_system_files+0x155/0x7c7 [ntfs]
>        [<f8aa30a7>] ntfs_fill_super+0x979/0xd14 [ntfs]
>        [<c01647c4>] get_sb_bdev+0xed/0x14e
>        [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
>        [<c0163c11>] vfs_kern_mount+0x76/0x143
>        [<c0163d17>] do_kern_mount+0x29/0x3d
>        [<c0177f9f>] do_mount+0x78a/0x7e4
>        [<c0178058>] sys_mount+0x5f/0x91
>        [<c031fb5d>] sysenter_past_esp+0x56/0x8d
> 
> other info that might help us debug this:
> 
> 2 locks held by mount/1219:
>  #0:  (&s->s_umount#18){--..}, at: [<c0164443>] sget+0x223/0x3a1
>  #1:  (&rl->lock){----}, at: [<f8a98e4d>] ntfs_map_runlist+0x2a/0xb5 [ntfs]
> 
> stack backtrace:
>  [<c0103924>] show_trace_log_lvl+0x54/0xfd
>  [<c01049a9>] show_trace+0xd/0x10
>  [<c01049c3>] dump_stack+0x17/0x1c
>  [<c012cefb>] print_circular_bug_tail+0x59/0x64
>  [<c012e766>] __lock_acquire+0x7c2/0x97a
>  [<c012ec64>] lock_acquire+0x58/0x74
>  [<c031e321>] __mutex_lock_slowpath+0xa7/0x21d
>  [<c031e4b8>] mutex_lock+0x21/0x24
>  [<f8a9edfa>] map_mft_record+0x19/0x1d2 [ntfs]
>  [<f8a98630>] ntfs_map_runlist_nolock+0x48/0x437 [ntfs]
>  [<f8a98eaa>] ntfs_map_runlist+0x87/0xb5 [ntfs]
>  [<f8a97739>] ntfs_readpage+0x482/0x8fd [ntfs]
>  [<c01415b0>] read_cache_page+0x8c/0x137
>  [<f8aa20bc>] load_system_files+0x155/0x7c7 [ntfs]
>  [<f8aa30a7>] ntfs_fill_super+0x979/0xd14 [ntfs]
>  [<c01647c4>] get_sb_bdev+0xed/0x14e
>  [<f8aa15ef>] ntfs_get_sb+0x10/0x12 [ntfs]
>  [<c0163c11>] vfs_kern_mount+0x76/0x143
>  [<c0163d17>] do_kern_mount+0x29/0x3d
>  [<c0177f9f>] do_mount+0x78a/0x7e4
>  [<c0178058>] sys_mount+0x5f/0x91
>  [<c031fb5d>] sysenter_past_esp+0x56/0x8d
> NTFS volume version 3.1.

-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08  7:32 ` Anton Altaparmakov
@ 2006-06-08  9:55   ` Ingo Molnar
  2006-06-08 10:53     ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-08  9:55 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


hi Anton,

* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> [...] It perhaps is getting confused by the special case for the table 
> of inodes ($MFT) which has the lock dependency reverse to all other 
> inodes but it is special because it can never take the lock 
> recursively (and hence deadlock) because we always keep the whole 
> runlist for $MFT in memory and should a bug or memory corruption cause 
> this not to be the case then ntfs will detect this and go BUG() so it 
> still will not deadlock...

Please help me understand NTFS locking a bit better. As far as i can see 
at the moment, the NTFS locking scenario that the lock validator flagged 
does not involve two inodes - it only involves the MFT inode itself, and 
two of its locks.

Firstly, here is a list of the NTFS terms, locks in question:

  ni              - NTFS inode structure

  mft_ni          - special "Master File Table" inode - one per fs. 
                    Consists of "MFT records", which describe an inode 
                    each. (mft_ni is also called the "big inode")

  &ni->mrec_lock  - a spinlock protecting a particular inode's MFT data. 
                    (finegrained lock for the MFT record) It is 
                    typically taken by map_mft_record() and released by 
                    unmap_mft_record().

  ni->runlist     - maps logical addresses to on-disk addresses. (There 
                    are (two) runlists, one for normal inode data, 
                    another for attribute space data.)

  &rl->lock       - ni->runlist.lock, a rw semaphore that protects the 
                    mapping data. Read-locked on access, write-locked on 
                    modification (extension of an inode, etc.).

The MFT is loaded in-memory permanently at mount time, and its runlist 
gives us access to NTFS inodes. Is its runlist loaded into memory 
permanently too? An NTFS inode's runlist gives access to the actual file 
data.

What the validator flagged is the following locking construct:

we first acquired the MFT's &ni->mrec_lock in map_mft_record(), at:

       [<c0340508>] mutex_lock+0x8/0x10
       [<c01d4d61>] map_mft_record+0x51/0x2c0
       [<c01c51e8>] ntfs_map_runlist_nolock+0x3d8/0x530
       [<c01c58b1>] ntfs_map_runlist+0x41/0x70
       [<c01c1621>] ntfs_readpage+0x8c1/0x9a0
       [<c0142e1c>] read_cache_page+0xac/0x150
       [<c01e23f2>] load_system_files+0x472/0x2250
       [<c01e4e26>] ntfs_fill_super+0xc56/0x1a50
       [<c016bdee>] get_sb_bdev+0xde/0x120
       [<c01e028b>] ntfs_get_sb+0x1b/0x30
       [<c016b413>] vfs_kern_mount+0x33/0xa0
       [<c016b4d6>] do_kern_mount+0x36/0x50
       [<c01818de>] do_mount+0x28e/0x640
       [<c0181cff>] sys_mount+0x6f/0xb0

then we read-locked &rl->lock [the MFT's runlist semaphore] later in 
map_mft_record() -> ntfs_readpage(), while still holding &ni->mrec_lock:

       [<c0134c4e>] down_read+0x2e/0x40
       [<c01c159c>] ntfs_readpage+0x83c/0x9a0
       [<c0142e1c>] read_cache_page+0xac/0x150
       [<c01d4e22>] map_mft_record+0x112/0x2c0
       [<c01d229d>] ntfs_read_locked_inode+0x8d/0x15d0
       [<c01d3c6b>] ntfs_read_inode_mount+0x48b/0xba0
       [<c01e4dcb>] ntfs_fill_super+0xbfb/0x1a50
       [<c016bdee>] get_sb_bdev+0xde/0x120
       [<c01e028b>] ntfs_get_sb+0x1b/0x30
       [<c016b413>] vfs_kern_mount+0x33/0xa0
       [<c016b4d6>] do_kern_mount+0x36/0x50
       [<c01818de>] do_mount+0x28e/0x640
       [<c0181cff>] sys_mount+0x6f/0xb0

so this is a "&ni->mrec_lock => &rl->lock" dependency for the MFT, which 
the validator recorded.

Then the validator also observed the reverse order. We first 
write-locked &rl->lock (of the MFT inode):

 [<c0134c8e>] down_write+0x2e/0x50
 [<c01c5910>] ntfs_map_runlist+0x20/0x70
 [<c01c16a1>] ntfs_readpage+0x8c1/0x9a0
 [<c0142e9c>] read_cache_page+0xac/0x150
 [<c01e2472>] load_system_files+0x472/0x2250
 [<c01e4ea6>] ntfs_fill_super+0xc56/0x1a50
 [<c016be6e>] get_sb_bdev+0xde/0x120
 [<c01e030b>] ntfs_get_sb+0x1b/0x30
 [<c016b493>] vfs_kern_mount+0x33/0xa0
 [<c016b556>] do_kern_mount+0x36/0x50
 [<c018195e>] do_mount+0x28e/0x640
 [<c0181d7f>] sys_mount+0x6f/0xb0

then we took &ni->mrec_lock [this is still the MFT inode's mrec_lock, 
and we have the &rl->lock still held]:

 [<c0340588>] mutex_lock+0x8/0x10
 [<c01d4de1>] map_mft_record+0x51/0x2c0
 [<c01c5268>] ntfs_map_runlist_nolock+0x3d8/0x530
 [<c01c5931>] ntfs_map_runlist+0x41/0x70
 [<c01c16a1>] ntfs_readpage+0x8c1/0x9a0
 [<c0142e9c>] read_cache_page+0xac/0x150
 [<c01e2472>] load_system_files+0x472/0x2250
 [<c01e4ea6>] ntfs_fill_super+0xc56/0x1a50
 [<c016be6e>] get_sb_bdev+0xde/0x120
 [<c01e030b>] ntfs_get_sb+0x1b/0x30
 [<c016b493>] vfs_kern_mount+0x33/0xa0
 [<c016b556>] do_kern_mount+0x36/0x50
 [<c018195e>] do_mount+0x28e/0x640
 [<c0181d7f>] sys_mount+0x6f/0xb0

this means a "&rl->lock => &ni->mrec_lock" dependency, which stands in 
contrast with the already observed "&ni->mrec_lock => &rl->lock" 
dependency.

The dependencies were observed for the same locks (the MFT's runlist 
lock and mrec_lock), i.e. this is not a confusion of normal inodes vs. 
the MFT inode.

First and foremost, are my observations and interpretations correct? 
Assuming that i made no mistake that invalidates my analysis, why are 
the two MFT inode locks apparently taken in opposite order?

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08  9:55   ` Ingo Molnar
@ 2006-06-08 10:53     ` Anton Altaparmakov
  2006-06-08 11:23       ` Ingo Molnar
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-08 10:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

Hi Ingo,

On Thu, 2006-06-08 at 11:55 +0200, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > [...] It perhaps is getting confused by the special case for the table 
> > of inodes ($MFT) which has the lock dependency reverse to all other 
> > inodes but it is special because it can never take the lock 
> > recursively (and hence deadlock) because we always keep the whole 
> > runlist for $MFT in memory and should a bug or memory corruption cause 
> > this not to be the case then ntfs will detect this and go BUG() so it 
> > still will not deadlock...
> 
> Please help me understand NTFS locking a bit better. As far as i can see 

Sure.

> at the moment, the NTFS locking scenario that the lock validator flagged 
> does not involve two inodes - it only involves the MFT inode itself, and 
> two of its locks.
> 
> Firstly, here is a list of the NTFS terms, locks in question:
> 
>   ni              - NTFS inode structure
> 
>   mft_ni          - special "Master File Table" inode - one per fs. 
>                     Consists of "MFT records", which describe an inode 
>                     each. (mft_ni is also called the "big inode")

Correct except big inode is just any ntfs inode that has a vfs inode
(struct inode) to go with it, i.e. a real inode visible to the vfs.
There is nothing special with the MFT inode and big inodes.  All files
have big inodes.  

The non-big inodes, are not real inodes.  They are inode extents.
Basically one inode is typically 1024 bytes so if it fills up (e.g. lots
of hard links and/or a long run list due to fragmentation and/or a huge
file) then ntfs allocates more mft records and attaches them logically
to the real inode thus you can have effectively infinite amounts of
fragmentation/lots of hard links even though an inode is just 1024 bytes
in size you just end up having lots of "extent inodes" as I call them
and in the driver this means you end up with "ntfs_inode" structures
that do not have a VFS struct inode attached to them.

>   &ni->mrec_lock  - a spinlock protecting a particular inode's MFT data. 
>                     (finegrained lock for the MFT record) It is 
>                     typically taken by map_mft_record() and released by 
>                     unmap_mft_record().

Correct, except s/spinlock/semaphore/ (that really should become a mutex
one day).

>   ni->runlist     - maps logical addresses to on-disk addresses. (There 
>                     are (two) runlists, one for normal inode data, 
>                     another for attribute space data.)

Correct.

>   &rl->lock       - ni->runlist.lock, a rw semaphore that protects the 
>                     mapping data. Read-locked on access, write-locked on 
>                     modification (extension of an inode, etc.).

Correct, except also write locked when the runlist is modified in memory
not necessarily due to write to the inode.  This occurs on first
read/write to the inode because at iget() time we do not read the
runlist into memory thus the first read/write will cause &rl->lock to be
taken for writing and the runlist will be read in.

This also happens when the file is fragmented.  Then there are multiple
runlist "fragments", each describing a portion of the data, thus you
have fragment 0 describing file offsets 0 to X, then fragment 1
describing offsets X+1 to Y, etc...  We only load into memory each
fragment as it is used, thus on a read from a file we do:

down_read(&rl->lock);
look up in the runlist the physical location
if (physical location is unknown because this runlist fragment was never
needed before) {
	up_read(&rl->lock);
	down_write(&rl->lock);
	redo the lookup to ensure no-one else mapped this fragment under our
feet and if not read it in and merge it with the existing runlist and
goto above looking up the physical location in the runlist which now
will succeed
up_read/up_write(&rl->lock);

Basically we are on-demand loading the runlist fragments.

> The MFT is loaded in-memory permanently at mount time, and its runlist 
> gives us access to NTFS inodes. Is its runlist loaded into memory 
> permanently too? An NTFS inode's runlist gives access to the actual file 
> data.

Clarification: The MFT inode is loaded at mount time and ALL its runlist
fragments are loaded and merged into one big runlist available at
mft_ni->runlist.  The mft data, i.e. the inodes themselves are not
loaded all the time.  That is what map_mft_record() does and it needs to
lookup in mft_ni->runlist where the inode to be loaded is on disk.

This last lookup of physical location in map_mft_record() [actually in
readpage as map_mft_record() reads the page cache page containing the
record] cannot require us to load an extent mft record because the
runlist is complete so we cannot deadlock here.

Your last statement is correct.  The runlist is a mapping of "file
offset" to "physical device offset".

> What the validator flagged is the following locking construct:
> 
> we first acquired the MFT's &ni->mrec_lock in map_mft_record(), at:

No, this shows acquisition of the MFTMirr &ni->mrec_lock (the mirror of
the MFT, it only contains the first few MFT records stored in the the
middle of the disk for recovery purposes).  If you look at
"fs/ntfs/super.c::load_system_files()" you will see it calls
load_and_init_mft_mirror() as the first thing which as the first thing
does: ntfs_iget(vol->sb, FILE_MFTMirr); thus it is locking the mft
record for FILE_MFTMirr which is inode 1 (whilst the MFT itself is
FILE_MFT which is inode 0).

>        [<c0340508>] mutex_lock+0x8/0x10
>        [<c01d4d61>] map_mft_record+0x51/0x2c0
>        [<c01c51e8>] ntfs_map_runlist_nolock+0x3d8/0x530
>        [<c01c58b1>] ntfs_map_runlist+0x41/0x70
>        [<c01c1621>] ntfs_readpage+0x8c1/0x9a0
>        [<c0142e1c>] read_cache_page+0xac/0x150
>        [<c01e23f2>] load_system_files+0x472/0x2250
>        [<c01e4e26>] ntfs_fill_super+0xc56/0x1a50
>        [<c016bdee>] get_sb_bdev+0xde/0x120
>        [<c01e028b>] ntfs_get_sb+0x1b/0x30
>        [<c016b413>] vfs_kern_mount+0x33/0xa0
>        [<c016b4d6>] do_kern_mount+0x36/0x50
>        [<c01818de>] do_mount+0x28e/0x640
>        [<c0181cff>] sys_mount+0x6f/0xb0
> 
> then we read-locked &rl->lock [the MFT's runlist semaphore] later in 
> map_mft_record() -> ntfs_readpage(),

But the below trace preceeds the above trace in time!  Below we see
ntfs_read_inode_mount() being called which is called _before_
load_system_files() is called... (See ntfs_fill_super() which is the
only place that calls both of those functions.)

Apart from that you are correct that we are taking the &tl->lock of the
MFT's runlist semaphore here.

> while still holding &ni->mrec_lock:

Wrong as explained above this is MFTMirr's ni->mrec_lock being held in
the above trace and it only happens later on, not earlier.

>        [<c0134c4e>] down_read+0x2e/0x40
>        [<c01c159c>] ntfs_readpage+0x83c/0x9a0
>        [<c0142e1c>] read_cache_page+0xac/0x150
>        [<c01d4e22>] map_mft_record+0x112/0x2c0
>        [<c01d229d>] ntfs_read_locked_inode+0x8d/0x15d0
>        [<c01d3c6b>] ntfs_read_inode_mount+0x48b/0xba0
>        [<c01e4dcb>] ntfs_fill_super+0xbfb/0x1a50
>        [<c016bdee>] get_sb_bdev+0xde/0x120
>        [<c01e028b>] ntfs_get_sb+0x1b/0x30
>        [<c016b413>] vfs_kern_mount+0x33/0xa0
>        [<c016b4d6>] do_kern_mount+0x36/0x50
>        [<c01818de>] do_mount+0x28e/0x640
>        [<c0181cff>] sys_mount+0x6f/0xb0
> 
> so this is a "&ni->mrec_lock => &rl->lock" dependency for the MFT, which 
> the validator recorded.

No it is not as explained above.  Something has gotten confused
somewhere because the order of events is the wrong way round...

> Then the validator also observed the reverse order. We first 
> write-locked &rl->lock (of the MFT inode):
> 
>  [<c0134c8e>] down_write+0x2e/0x50
>  [<c01c5910>] ntfs_map_runlist+0x20/0x70
>  [<c01c16a1>] ntfs_readpage+0x8c1/0x9a0
>  [<c0142e9c>] read_cache_page+0xac/0x150
>  [<c01e2472>] load_system_files+0x472/0x2250
>  [<c01e4ea6>] ntfs_fill_super+0xc56/0x1a50
>  [<c016be6e>] get_sb_bdev+0xde/0x120
>  [<c01e030b>] ntfs_get_sb+0x1b/0x30
>  [<c016b493>] vfs_kern_mount+0x33/0xa0
>  [<c016b556>] do_kern_mount+0x36/0x50
>  [<c018195e>] do_mount+0x28e/0x640
>  [<c0181d7f>] sys_mount+0x6f/0xb0

No this will be something else's inode. The first time you see the above
trace should be when MFTMirr is being read in in load_system_files()
when it calls check_mft_mirror().  The runlist of MFTMirr is at that
stage not read in yet thus it will take the runlist lock of MFTMirr's
inode for writing and read the runlist into memory.

> then we took &ni->mrec_lock [this is still the MFT inode's mrec_lock, 
> and we have the &rl->lock still held]:
> 
>  [<c0340588>] mutex_lock+0x8/0x10
>  [<c01d4de1>] map_mft_record+0x51/0x2c0
>  [<c01c5268>] ntfs_map_runlist_nolock+0x3d8/0x530
>  [<c01c5931>] ntfs_map_runlist+0x41/0x70
>  [<c01c16a1>] ntfs_readpage+0x8c1/0x9a0
>  [<c0142e9c>] read_cache_page+0xac/0x150
>  [<c01e2472>] load_system_files+0x472/0x2250
>  [<c01e4ea6>] ntfs_fill_super+0xc56/0x1a50
>  [<c016be6e>] get_sb_bdev+0xde/0x120
>  [<c01e030b>] ntfs_get_sb+0x1b/0x30
>  [<c016b493>] vfs_kern_mount+0x33/0xa0
>  [<c016b556>] do_kern_mount+0x36/0x50
>  [<c018195e>] do_mount+0x28e/0x640
>  [<c0181d7f>] sys_mount+0x6f/0xb0

Again this is the mrec_lock of the MFTMirr ntfs_inode not the MFT
ntfs_inode (mft_ni).

> this means a "&rl->lock => &ni->mrec_lock" dependency, which stands in 
> contrast with the already observed "&ni->mrec_lock => &rl->lock" 
> dependency.

Could you re-evaluate that conclusion given the above feedback?  I think
it is wrong given the inversion in time and the difference in inodes as
explained above.

> The dependencies were observed for the same locks (the MFT's runlist 
> lock and mrec_lock), i.e. this is not a confusion of normal inodes vs. 
> the MFT inode.
> 
> First and foremost, are my observations and interpretations correct? 
> Assuming that i made no mistake that invalidates my analysis, why are 
> the two MFT inode locks apparently taken in opposite order?

I think you got it wrong I am afraid.  Does what I explained above make
more sense?  In particular I am worried about the fact that you showed
two events to happen in reverse chronological order which is impossible.
(unless time flows backwards when you run your tests... (-;)

Please keep the questions coming...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08 10:53     ` Anton Altaparmakov
@ 2006-06-08 11:23       ` Ingo Molnar
  2006-06-09  8:09         ` Anton Altaparmakov
  2006-06-08 16:11       ` Ingo Molnar
  2006-06-09  7:19       ` Ingo Molnar
  2 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-08 11:23 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> No it is not as explained above.  Something has gotten confused 
> somewhere because the order of events is the wrong way round...

ok, next try. Find below a chronological trace of the lock acquire and 
new dependency events leading up the the circular dependency message.

The first (mrec_lock -> runlist.lock) dependency is:

 new dependency:  (&ni->mrec_lock){--..} =>  (&rl->lock){..--}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

and the opposite (runlist.lock -> mrec_lock) dependency wants to get 
created at:

 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01c5358>] ntfs_map_runlist_nolock+0x3d8/0x530
 [<c01c5a21>] ntfs_map_runlist+0x41/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

does this trace make more sense?

	Ingo

----{ NTFS lock events trace }------------->

new type c05bfcc4: &ni->mrec_lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138f24>] __lock_acquire+0xd04/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcc4] &ni->mrec_lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&s->s_umount#15){--..} =>  (&ni->mrec_lock){....}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {hardirq-on-W}:
 (&ni->mrec_lock){-...}, at: [<c0340678>] mutex_lock+0x8/0x10
irq event stamp: 2309
hardirqs last  enabled at (2307): [<c03420a7>] _spin_unlock_irqrestore+0x47/0x60
hardirqs last disabled at (2309): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (1988): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (1981): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c01394db>] mark_held_locks+0x4b/0xa0
 [<c0139660>] trace_hardirqs_on+0x70/0x160
 [<c034052f>] __mutex_lock_slowpath+0x15f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {softirq-on-W}:
 (&ni->mrec_lock){--..}, at: [<c0340678>] mutex_lock+0x8/0x10
irq event stamp: 2309
hardirqs last  enabled at (2307): [<c03420a7>] _spin_unlock_irqrestore+0x47/0x60
hardirqs last disabled at (2309): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (1988): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (1981): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c0139512>] mark_held_locks+0x82/0xa0
 [<c01396a0>] trace_hardirqs_on+0xb0/0x160
 [<c034052f>] __mutex_lock_slowpath+0x15f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (cpa_lock){++..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0342159>] _spin_lock_irqsave+0x39/0x50
 [<c01135fc>] change_page_attr+0x1c/0x2a0
 [<c01138aa>] kernel_map_pages+0x2a/0xa0
 [<c0147f53>] get_page_from_freelist+0x253/0x3c0
 [<c0148110>] __alloc_pages+0x50/0x2f0
 [<c0142f65>] read_cache_page+0x85/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&inode->i_data.tree_lock){++..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0341c82>] _write_lock_irq+0x32/0x40
 [<c0142181>] add_to_page_cache+0x41/0xd0
 [<c014222b>] add_to_page_cache_lru+0x1b/0x40
 [<c0142f17>] read_cache_page+0x37/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&inode->i_data.private_lock){--..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0341bbc>] _spin_lock+0x2c/0x40
 [<c0168053>] create_empty_buffers+0x33/0xa0
 [<c01c11e7>] ntfs_readpage+0x317/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

new type c05bfcbc: &rl->lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138f24>] __lock_acquire+0xd04/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcbc] &rl->lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {hardirq-on-R}:
 (&rl->lock){..-.}, at: [<c01c170c>] ntfs_readpage+0x83c/0x9a0
irq event stamp: 2440
hardirqs last  enabled at (2439): [<c0342107>] _read_unlock_irqrestore+0x47/0x60
hardirqs last disabled at (2440): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (2436): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (2419): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c0138776>] __lock_acquire+0x556/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {softirq-on-R}:
 (&rl->lock){..--}, at: [<c01c170c>] ntfs_readpage+0x83c/0x9a0
irq event stamp: 2440
hardirqs last  enabled at (2439): [<c0342107>] _read_unlock_irqrestore+0x47/0x60
hardirqs last disabled at (2440): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (2436): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (2419): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c0138882>] __lock_acquire+0x662/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&rl->lock){..--}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&lo->lo_lock){....}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0341f62>] _spin_lock_irq+0x32/0x40
 [<c0276e69>] loop_make_request+0x49/0x100
 [<c01f3f51>] generic_make_request+0x1c1/0x270
 [<c01f404e>] submit_bio+0x4e/0xe0
 [<c016731d>] submit_bh+0xcd/0x130
 [<c01c173e>] ntfs_readpage+0x86e/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&q->lock){++..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0342159>] _spin_lock_irqsave+0x39/0x50
 [<c01140cb>] complete+0x1b/0x60
 [<c0276ed7>] loop_make_request+0xb7/0x100
 [<c01f3f51>] generic_make_request+0x1c1/0x270
 [<c01f404e>] submit_bio+0x4e/0xe0
 [<c016731d>] submit_bh+0xcd/0x130
 [<c01c173e>] ntfs_readpage+0x86e/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (ide_lock){++..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0341f62>] _spin_lock_irq+0x32/0x40
 [<c01f30d1>] generic_unplug_device+0x11/0x30
 [<c01f25e2>] blk_backing_dev_unplug+0x12/0x20
 [<c0276f57>] loop_unplug+0x37/0x40
 [<c01f25e2>] blk_backing_dev_unplug+0x12/0x20
 [<c0166269>] block_sync_page+0x39/0x50
 [<c0141cc8>] sync_page+0x38/0x50
 [<c033fc69>] __wait_on_bit_lock+0x49/0x70
 [<c0141dcb>] __lock_page+0x6b/0x80
 [<c0142fe5>] read_cache_page+0x105/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&ni->mrec_lock){--..} =>  (&rq->lock){++..}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0341f62>] _spin_lock_irq+0x32/0x40
 [<c033ebba>] schedule+0xea/0xb80
 [<c033f678>] io_schedule+0x28/0x40
 [<c0141ccd>] sync_page+0x3d/0x50
 [<c033fc69>] __wait_on_bit_lock+0x49/0x70
 [<c0141dcb>] __lock_page+0x6b/0x80
 [<c0142fe5>] read_cache_page+0x105/0x150
 [<c01d4f92>] map_mft_record+0x112/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
 [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcc4] &ni->mrec_lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
 [<c01d4545>] ntfs_iget+0x55/0x80
 [<c01e216a>] load_system_files+0x7a/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcbc] &rl->lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

 new dependency:  (&s->s_umount#15){--..} =>  (&rl->lock){..--}
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138bc9>] __lock_acquire+0x9a9/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c4c>] down_read+0x2c/0x40
 [<c01c170c>] ntfs_readpage+0x83c/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcbc] &rl->lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c8c>] down_write+0x2c/0x50
 [<c01c5a00>] ntfs_map_runlist+0x20/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {hardirq-on-W}:
 (&rl->lock){-.--}, at: [<c01c5a00>] ntfs_map_runlist+0x20/0x70
irq event stamp: 2940
hardirqs last  enabled at (2939): [<c0342403>] restore_nocheck+0x12/0x15
hardirqs last disabled at (2940): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (2938): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (2923): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c0138588>] __lock_acquire+0x368/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c8c>] down_write+0x2c/0x50
 [<c01c5a00>] ntfs_map_runlist+0x20/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

marked lock as {softirq-on-W}:
 (&rl->lock){----}, at: [<c01c5a00>] ntfs_map_runlist+0x20/0x70
irq event stamp: 2940
hardirqs last  enabled at (2939): [<c0342403>] restore_nocheck+0x12/0x15
hardirqs last disabled at (2940): [<c011cad6>] vprintk+0x26/0x360
softirqs last  enabled at (2938): [<c0121e0e>] __do_softirq+0xfe/0x110
softirqs last disabled at (2923): [<c0105049>] do_softirq+0x69/0x100
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0137d37>] mark_lock+0x1c7/0x6b0
 [<c0138882>] __lock_acquire+0x662/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c0134c8c>] down_write+0x2c/0x50
 [<c01c5a00>] ntfs_map_runlist+0x20/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

acquire type [c05bfcc4] &ni->mrec_lock
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0138759>] __lock_acquire+0x539/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01c5358>] ntfs_map_runlist_nolock+0x3d8/0x530
 [<c01c5a21>] ntfs_map_runlist+0x41/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d

=====================================================
[ BUG: possible circular locking deadlock detected! ]
-----------------------------------------------------
mount/2406 is trying to acquire lock:
 (&ni->mrec_lock){--..}, at: [<c0340678>] mutex_lock+0x8/0x10

but task is already holding lock:
 (&rl->lock){----}, at: [<c01c5a00>] ntfs_map_runlist+0x20/0x70

which lock already depends on the new lock,
which could lead to circular deadlocks!

the existing dependency chain (in reverse order) is:

-> #1 (&rl->lock){----}:
       [<c013946b>] lock_acquire+0x7b/0xa0
       [<c0134c4c>] down_read+0x2c/0x40
       [<c01c170c>] ntfs_readpage+0x83c/0x9a0
       [<c0142f8c>] read_cache_page+0xac/0x150
       [<c01d4f92>] map_mft_record+0x112/0x2c0
       [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
       [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
       [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
       [<c016bf5e>] get_sb_bdev+0xde/0x120
       [<c01e03fb>] ntfs_get_sb+0x1b/0x30
       [<c016b583>] vfs_kern_mount+0x33/0xa0
       [<c016b646>] do_kern_mount+0x36/0x50
       [<c0181a4e>] do_mount+0x28e/0x640
       [<c0181e6f>] sys_mount+0x6f/0xb0
       [<c034233d>] sysenter_past_esp+0x56/0x8d

-> #0 (&ni->mrec_lock){--..}:
       [<c013946b>] lock_acquire+0x7b/0xa0
       [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
       [<c0340678>] mutex_lock+0x8/0x10
       [<c01d4ed1>] map_mft_record+0x51/0x2c0
       [<c01c5358>] ntfs_map_runlist_nolock+0x3d8/0x530
       [<c01c5a21>] ntfs_map_runlist+0x41/0x70
       [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
       [<c0142f8c>] read_cache_page+0xac/0x150
       [<c01e2562>] load_system_files+0x472/0x2250
       [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
       [<c016bf5e>] get_sb_bdev+0xde/0x120
       [<c01e03fb>] ntfs_get_sb+0x1b/0x30
       [<c016b583>] vfs_kern_mount+0x33/0xa0
       [<c016b646>] do_kern_mount+0x36/0x50
       [<c0181a4e>] do_mount+0x28e/0x640
       [<c0181e6f>] sys_mount+0x6f/0xb0
       [<c034233d>] sysenter_past_esp+0x56/0x8d

other info that might help us debug this:

2 locks held by mount/2406:
 #0:  (&s->s_umount#15){--..}, at: [<c016bb8d>] sget+0x11d/0x2f0
 #1:  (&rl->lock){----}, at: [<c01c5a00>] ntfs_map_runlist+0x20/0x70

stack backtrace:
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0136e61>] print_circular_bug_tail+0x61/0x70
 [<c013896f>] __lock_acquire+0x74f/0xde0
 [<c013946b>] lock_acquire+0x7b/0xa0
 [<c034043f>] __mutex_lock_slowpath+0x6f/0x2a0
 [<c0340678>] mutex_lock+0x8/0x10
 [<c01d4ed1>] map_mft_record+0x51/0x2c0
 [<c01c5358>] ntfs_map_runlist_nolock+0x3d8/0x530
 [<c01c5a21>] ntfs_map_runlist+0x41/0x70
 [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
 [<c0142f8c>] read_cache_page+0xac/0x150
 [<c01e2562>] load_system_files+0x472/0x2250
 [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
 [<c016bf5e>] get_sb_bdev+0xde/0x120
 [<c01e03fb>] ntfs_get_sb+0x1b/0x30
 [<c016b583>] vfs_kern_mount+0x33/0xa0
 [<c016b646>] do_kern_mount+0x36/0x50
 [<c0181a4e>] do_mount+0x28e/0x640
 [<c0181e6f>] sys_mount+0x6f/0xb0
 [<c034233d>] sysenter_past_esp+0x56/0x8d
NTFS volume version 1.2.
NTFS-fs warning (device loop0): load_system_files(): Disabling sparse support due to NTFS volume version 1.2 (need at least version 3.0).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08 10:53     ` Anton Altaparmakov
  2006-06-08 11:23       ` Ingo Molnar
@ 2006-06-08 16:11       ` Ingo Molnar
  2006-06-08 21:17         ` Anton Altaparmakov
  2006-06-09  7:19       ` Ingo Molnar
  2 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-08 16:11 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> >   &ni->mrec_lock  - a spinlock protecting a particular inode's MFT data. 
> >                     (finegrained lock for the MFT record) It is 
> >                     typically taken by map_mft_record() and released by 
> >                     unmap_mft_record().
> 
> Correct, except s/spinlock/semaphore/ (that really should become a 
> mutex one day).

yeah - it's in fact a mutex already.

> No it is not as explained above.  Something has gotten confused 
> somewhere because the order of events is the wrong way round...

did my second trace make more sense? The dependency that the validator 
recorded can be pretty much taken as granted - it only stores 
dependencies that truly trigger runtime. What shouldnt be taken as 
granted is my explanation of the events :-)

there is a wide array of methods and APIs available to express locking 
semantics to the validator in a natural and non-intrusive way [for cases 
where the validator gets it wrong or simply has no way of auto-learning 
them] - but for that i'll first have to understand the locking semantics
:-)

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08 16:11       ` Ingo Molnar
@ 2006-06-08 21:17         ` Anton Altaparmakov
  0 siblings, 0 replies; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-08 21:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Thu, 8 Jun 2006, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > >   &ni->mrec_lock  - a spinlock protecting a particular inode's MFT data. 
> > >                     (finegrained lock for the MFT record) It is 
> > >                     typically taken by map_mft_record() and released by 
> > >                     unmap_mft_record().
> > 
> > Correct, except s/spinlock/semaphore/ (that really should become a 
> > mutex one day).
> 
> yeah - it's in fact a mutex already.
> 
> > No it is not as explained above.  Something has gotten confused 
> > somewhere because the order of events is the wrong way round...
> 
> did my second trace make more sense? The dependency that the validator 

Which one was that?  Could you please quote it again so we both know we 
are talking about the same thing?

> recorded can be pretty much taken as granted - it only stores 
> dependencies that truly trigger runtime. What shouldnt be taken as 
> granted is my explanation of the events :-)

Ok. (-:

> there is a wide array of methods and APIs available to express locking 
> semantics to the validator in a natural and non-intrusive way [for cases 
> where the validator gets it wrong or simply has no way of auto-learning 
> them] - but for that i'll first have to understand the locking semantics
> :-)

Indeed. (-:

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08 10:53     ` Anton Altaparmakov
  2006-06-08 11:23       ` Ingo Molnar
  2006-06-08 16:11       ` Ingo Molnar
@ 2006-06-09  7:19       ` Ingo Molnar
  2006-06-09  8:31         ` Anton Altaparmakov
  2 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-09  7:19 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> This last lookup of physical location in map_mft_record() [actually in 
> readpage as map_mft_record() reads the page cache page containing the 
> record] cannot require us to load an extent mft record because the 
> runlist is complete so we cannot deadlock here.

ah! So basically, the locking sequences the validator observed during 
load_system_files() are a one-time event that can only occur during 
ntfs_fill_super().

if we ignore those dependencies (of the initial reading of the MFT inode 
generating recursive map_mft_record() calls) and take into account that 
the MFT inode will never again trigger map_mft_record() calls, then the 
locking is conflict-free after mounting has finished, right?

so the validator is technically correct that load_system_files() 
generates locking patterns that have dependency conflicts - but that 
code is guaranteed to be single-threaded because it's a one time event 
and because the VFS has no access to the filesystem at that time yet so 
there is zero parallellism.

can you think of any simple and clean solution that would avoid those 
conflicting dependencies within the NTFS code? Perhaps some way to read 
the MFT inode [and perhaps other, always-cached inodes that are later 
used as metadata] without locking? I can code it up and test it, but i 
dont think i can find the best method myself :-)

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-08 11:23       ` Ingo Molnar
@ 2006-06-09  8:09         ` Anton Altaparmakov
  2006-06-10  7:59           ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-09  8:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Thu, 2006-06-08 at 13:23 +0200, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> 
> > No it is not as explained above.  Something has gotten confused 
> > somewhere because the order of events is the wrong way round...
> 
> ok, next try. Find below a chronological trace of the lock acquire and 
> new dependency events leading up the the circular dependency message.
> 
> The first (mrec_lock -> runlist.lock) dependency is:
> 
>  new dependency:  (&ni->mrec_lock){--..} =>  (&rl->lock){..--}
>  [<c0104bf2>] show_trace+0x12/0x20
>  [<c0104c19>] dump_stack+0x19/0x20
>  [<c0138bc9>] __lock_acquire+0x9a9/0xde0
>  [<c013946b>] lock_acquire+0x7b/0xa0
>  [<c0134c4c>] down_read+0x2c/0x40

This locks the runlist of the MFT (i.e. mft_ni->runlist.lock).

>  [<c01c170c>] ntfs_readpage+0x83c/0x9a0
>  [<c0142f8c>] read_cache_page+0xac/0x150
>  [<c01d4f92>] map_mft_record+0x112/0x2c0

This locks the mft record of MFT (i.e. mft_ni->mrec_lock).

>  [<c01d240d>] ntfs_read_locked_inode+0x8d/0x15d0
>  [<c01d3ddb>] ntfs_read_inode_mount+0x48b/0xba0
>  [<c01e4f3b>] ntfs_fill_super+0xbfb/0x1a50
>  [<c016bf5e>] get_sb_bdev+0xde/0x120
>  [<c01e03fb>] ntfs_get_sb+0x1b/0x30
>  [<c016b583>] vfs_kern_mount+0x33/0xa0
>  [<c016b646>] do_kern_mount+0x36/0x50
>  [<c0181a4e>] do_mount+0x28e/0x640
>  [<c0181e6f>] sys_mount+0x6f/0xb0
>  [<c034233d>] sysenter_past_esp+0x56/0x8d

Let us call the above case A, and the description of case A is: mapping
the mft record of the MFT itself.  This is a special case because it is
circular, i.e. we are mapping into memory a part of a file using the
contents of the file itself.  Sound weird?  It is.  It is a catch-22
situation.  How can you read a file if you need to have read the file
already to be able read it?  (-:

The solution is ntfs_read_inode_mount().  It loads by hand using
sb_bread() the first mft record in the MFT file, i.e. mft record 0,
which is the mft record for the MFT itself.  The on-disk location is
written down in the ntfs boot sector which is already read and parsed.

Armed with that information the volume is boot strapped so that the
needed information to read the mft record for the MFT itself can be done
via reading a page into memory which is what we see above with the above
map_mft_record()->read_cache_page()->ntfs_readpage().  The bootstrapping
consists of setting up the struct inode and ntfs_inode mft_ni
appropriately and loading the runlist for the first fragment of the mft
into memory (mft_ni->runlist).  Doing this means that ntfs_readpage()
locks the runlist (mft_ni->runlist) for reading, looks up the physical
location for the first mft record, unlocks the runlist and then reads
the data from disk.  This always succeeds as we just initialized the
runlist by hand before calling map_mft_record().

ntfs_read_inode_mount() then proceeds to map all extent mft records for
the MFT in sequence (if there are any extent mft records) and to load
their runlist data into memory into mft_ni->runlist.

Thus by the time ntfs_read_inode_mount() is finished mft_ni->runlist is
completely mapped into memory and it is guaranteed to stay there until
umount time.  There are various places in the ntfs driver where we go
BUG() if we ever find an unmapped region in mft_ni->runlist.

Make sense so far?  I hope so.  (-:

> and the opposite (runlist.lock -> mrec_lock) dependency wants to get 
> created at:
> 
>  [<c0340678>] mutex_lock+0x8/0x10
>  [<c01d4ed1>] map_mft_record+0x51/0x2c0
>  [<c01c5358>] ntfs_map_runlist_nolock+0x3d8/0x530
>  [<c01c5a21>] ntfs_map_runlist+0x41/0x70
>  [<c01c1791>] ntfs_readpage+0x8c1/0x9a0
>  [<c0142f8c>] read_cache_page+0xac/0x150
>  [<c01e2562>] load_system_files+0x472/0x2250
>  [<c01e4f96>] ntfs_fill_super+0xc56/0x1a50
>  [<c016bf5e>] get_sb_bdev+0xde/0x120
>  [<c01e03fb>] ntfs_get_sb+0x1b/0x30
>  [<c016b583>] vfs_kern_mount+0x33/0xa0
>  [<c016b646>] do_kern_mount+0x36/0x50
>  [<c0181a4e>] do_mount+0x28e/0x640
>  [<c0181e6f>] sys_mount+0x6f/0xb0
>  [<c034233d>] sysenter_past_esp+0x56/0x8d

Ok, by now we are in the "normal" code paths with mft_ni->runlist fully
loaded in.  The above trace I think must be coming from
check_mft_mirror()->ntfs_map_page() which tries to read the MFT and then
the MFT mirror to compare their contents.

The driver loads the mft first but we already have the page index 0 of
the MFT thus read_cache_page() finds the page in the page cache (you can
see it happen in the long trace you sent me) so I am assuming the above
trace comes from the next ntfs_map_page() of the mft mirror page index
0.

Let me describe that one:

ntfs_map_page(mft mirror, page index 0);
-> read_cache_page();
  -> ntfs_readpage();

ntfs_readpage() take the runlist lock of the mft mirror inode whose data
is being read into the page (NTFS_I(vol->mftmirr_ino)->runlist.lock) for
reading.

It then looks up the physical location for the blocks in the first page
in the runlist.  But given we have never accessed the data in the mft
mirror inode, the runlist is not mapped into memory.  Thus
ntfs_readpage() drops the runlist lock and calls ntfs_map_runlist() to
map the runlist into memory.

ntfs_map_runlist() retakes the runlist lock for writing as it is about
to modify the in-memory runlist.  It then checks that someone else did
not map the runlist into memory whilst we waited on the lock and
assuming we still need to map it, it goes and maps the mft record for
the mft mirror inode which is the map_mft_record() we see above.

This takes the mft record lock for the mft mirror
(NTFS_I(vol->mftmirr_ino)->mrec_lock), which the lock validator
complains about.  Let us call this "normal" code path B and the
description is read data from a file that is not MFT itself.

For this we need to read the file data to do which we need to know where
it is on disk to do which we need the runlist thus we need to lock it so
a file modification (or concurrent file read causing another fragment of
the runlist to be read into memory) cannot change it under our feet.  we
then look up the physical location in the runlist and can read the data
in the common case.  In the uncommon case (first access) the in-memory
runlist does not contain the physical mapping data thus we need to map
the mft record (i.e. the on disk inode) into memory and read the runlist
from it into memory so that we can then look up the physical location of
the data in the runlist as in the common case.

Thus to summarise, there are two cases:

1)

The normal access pattern (B) where the runlist lock is always taken
first.  And the mft record lock is taken second and only if the runlist
is incomplete in-memory.

Of course on file modification, this is also the case, the runlist lock
is taken first, then the mft record lock is taken and thus both the
runlist and the inode can be updated with the new data (e.g. on a file
extend).

2)

The special access pattern (A) where the MFT record 0 itself is read for
the first time which is where the mft record lock is taken first and
then the runlist lock is taken to determine its on-disk location.

If this were a normal file, we would then need to take the mft record
lock _again_ (thus deadlock!) to load the runlist data into memory.  In
thus far the lock validator is correct.  There is a 100% certain
deadlock.

BUT we bootstrapped the runlist of the mft inode mft_ni in
ntfs_inode_read_mount() so that it always finds the physical mapping in
the runlist thus it _never_ has to lock the mft record that second time
thus the deadlock _never_ happens.

> does this trace make more sense?

Yes, makes perfect sense to me.  I hope with my descriptions above it
makes sense to you, too.  Please let me know if not.  (-:

It is not the simplest of things because of the catch-22 situation which
is why such an elaborate boot strapping is required.  But once boot
strapped IMHO it is very elegant because it means that the exact same
code is used for accessing both metadata and normal file data.

And we get to confuse the lock validator which has got to be a bonus.
(-;

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-09  7:19       ` Ingo Molnar
@ 2006-06-09  8:31         ` Anton Altaparmakov
  0 siblings, 0 replies; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-09  8:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Fri, 2006-06-09 at 09:19 +0200, Ingo Molnar wrote: 
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > This last lookup of physical location in map_mft_record() [actually in 
> > readpage as map_mft_record() reads the page cache page containing the 
> > record] cannot require us to load an extent mft record because the 
> > runlist is complete so we cannot deadlock here.
> 
> ah! So basically, the locking sequences the validator observed during 
> load_system_files() are a one-time event that can only occur during 
> ntfs_fill_super().
> 
> if we ignore those dependencies (of the initial reading of the MFT inode 
> generating recursive map_mft_record() calls) and take into account that 
> the MFT inode will never again trigger map_mft_record() calls, then the 
> locking is conflict-free after mounting has finished, right?

Correct!  (-:

> so the validator is technically correct that load_system_files() 
> generates locking patterns that have dependency conflicts - but that 
> code is guaranteed to be single-threaded because it's a one time event 
> and because the VFS has no access to the filesystem at that time yet so 
> there is zero parallellism.

Correct.

I just sent you a long description of the process before reading your
email above...  Did not need to bother if I knew you already understood
the case!

> can you think of any simple and clean solution that would avoid those 
> conflicting dependencies within the NTFS code? Perhaps some way to read 
> the MFT inode [and perhaps other, always-cached inodes that are later 
> used as metadata] without locking? I can code it up and test it, but i 
> dont think i can find the best method myself :-)

I really do not want to do that.  It would penalise ntfs performance in
the general case by introducing "if in mount don't lock otherwise lock"
checks everywhere in the driver pretty much or you would need to
duplicate a ton of code which I avoided duplicating exactly using my
bootstrapping mechanism which is why it exists in the first place!

And given the only reason for the change you want to make is to make the
lock validator happy I would much rather you did something with the lock
validator instead...  Can't you just insert a call to the validator in
ntfs_fill_super() just before the call to ntfs_read_inode_mount() to the
effect of "turn_off_lock_validator_data_gathering()" and then just after
ntfs_read_inode_mount() returns insert a call to the validator to the
effect of "turn_on_lock_validator_data_gathering()".

Does that sound reasonable to you?

I really cannot see why code needs to be penalised for a validator...  I
would much rather see a complicated and slow lock validator then a
complicated and slow driver given a lock validator is only compiled in
for debugging purposes whilst a driver is compiled in all kernels
including release ones...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-09  8:09         ` Anton Altaparmakov
@ 2006-06-10  7:59           ` Ingo Molnar
  2006-06-10  8:48             ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-10  7:59 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> The normal access pattern (B) where the runlist lock is always taken 
> first.  And the mft record lock is taken second and only if the 
> runlist is incomplete in-memory.
> 
> Of course on file modification, this is also the case, the runlist 
> lock is taken first, then the mft record lock is taken and thus both 
> the runlist and the inode can be updated with the new data (e.g. on a 
> file extend).

thanks for the detailed explanation!

I have annotated the code for the lock validator as much as i could, by:

- excluding ntfs_fill_super() from the locking rules,

- 'splitting' the MFT's mrec_lock and runlist->lock locking rules from 
  the other inodes's locking rules,

- splitting the mrec_lock rules of extent inodes. (We map them
  recursively while having the main inode mft record mapped. The nesting
  is safe because inode->extent_inode is a noncircular relation.)

Still there seems to be a case that the validator does not grok: 
load_attribute_list() seems to take the lock in the opposite order from 
what you described above. What locking detail am i missing? [let me know 
if you need all dependency events leading up to this message from the 
validator]

	Ingo

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
ls/2581 is trying to acquire lock:
 (&rl->lock){----}, at: [<c01c1f5b>] load_attribute_list+0xfb/0x3c0

but task is already holding lock:
 (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0

which lock already depends on the new lock,
which could lead to circular locking dependencies.

the existing dependency chain (in reverse order) is:

-> #1 (&ni->mrec_lock){--..}:
       [<c01394df>] lock_acquire+0x6f/0x90
       [<c0346183>] mutex_lock_nested+0x73/0x2a0
       [<c01d5e43>] map_mft_record+0x53/0x2c0
       [<c01c54f8>] ntfs_map_runlist_nolock+0x3d8/0x530
       [<c01c5bc1>] ntfs_map_runlist+0x41/0x70
       [<c01c1929>] ntfs_readpage+0x8c9/0x9b0
       [<c0142ffc>] read_cache_page+0xac/0x150
       [<c01e212d>] ntfs_statfs+0x41d/0x660
       [<c0163254>] vfs_statfs+0x54/0x70
       [<c0163288>] vfs_statfs64+0x18/0x30
       [<c0163384>] sys_statfs64+0x64/0xa0
       [<c0347dcd>] sysenter_past_esp+0x56/0x8d

-> #0 (&rl->lock){----}:
       [<c01394df>] lock_acquire+0x6f/0x90
       [<c0134c8a>] down_read_nested+0x2a/0x40
       [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
       [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
       [<c01d4735>] ntfs_iget+0x55/0x80
       [<c01db3da>] ntfs_lookup+0x14a/0x740
       [<c01736b6>] do_lookup+0x126/0x150
       [<c0173ef3>] __link_path_walk+0x813/0xe50
       [<c017457c>] link_path_walk+0x4c/0xf0
       [<c0174a2d>] do_path_lookup+0xad/0x260
       [<c0175228>] __user_walk_fd+0x38/0x60
       [<c016e3be>] vfs_lstat_fd+0x1e/0x50
       [<c016e401>] vfs_lstat+0x11/0x20
       [<c016ec04>] sys_lstat64+0x14/0x30
       [<c0347dcd>] sysenter_past_esp+0x56/0x8d

other info that might help us debug this:

2 locks held by ls/2581:
 #0:  (&inode->i_mutex){--..}, at: [<c0346108>] mutex_lock+0x8/0x10
 #1:  (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0

stack backtrace:
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0136ef1>] print_circular_bug_tail+0x61/0x70
 [<c01389ff>] __lock_acquire+0x74f/0xde0
 [<c01394df>] lock_acquire+0x6f/0x90
 [<c0134c8a>] down_read_nested+0x2a/0x40
 [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
 [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
 [<c01d4735>] ntfs_iget+0x55/0x80
 [<c01db3da>] ntfs_lookup+0x14a/0x740
 [<c01736b6>] do_lookup+0x126/0x150
 [<c0173ef3>] __link_path_walk+0x813/0xe50
 [<c017457c>] link_path_walk+0x4c/0xf0
 [<c0174a2d>] do_path_lookup+0xad/0x260
 [<c0175228>] __user_walk_fd+0x38/0x60
 [<c016e3be>] vfs_lstat_fd+0x1e/0x50
 [<c016e401>] vfs_lstat+0x11/0x20
 [<c016ec04>] sys_lstat64+0x14/0x30
 [<c0347dcd>] sysenter_past_esp+0x56/0x8d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-10  7:59           ` Ingo Molnar
@ 2006-06-10  8:48             ` Anton Altaparmakov
  2006-06-11  5:31               ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-10  8:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Sat, 10 Jun 2006, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > The normal access pattern (B) where the runlist lock is always taken 
> > first.  And the mft record lock is taken second and only if the 
> > runlist is incomplete in-memory.
> > 
> > Of course on file modification, this is also the case, the runlist 
> > lock is taken first, then the mft record lock is taken and thus both 
> > the runlist and the inode can be updated with the new data (e.g. on a 
> > file extend).
> 
> thanks for the detailed explanation!

You are welcome.  NTFS is not exactly simple code and its locking is 
certainly not simple at all...

> I have annotated the code for the lock validator as much as i could, by:
> 
> - excluding ntfs_fill_super() from the locking rules,
> 
> - 'splitting' the MFT's mrec_lock and runlist->lock locking rules from 
>   the other inodes's locking rules,
> 
> - splitting the mrec_lock rules of extent inodes. (We map them
>   recursively while having the main inode mft record mapped. The nesting
>   is safe because inode->extent_inode is a noncircular relation.)

Sounds good!

> Still there seems to be a case that the validator does not grok: 
> load_attribute_list() seems to take the lock in the opposite order from 
> what you described above. What locking detail am i missing? [let me know 
> if you need all dependency events leading up to this message from the 
> validator]

Ah, yes, I had completely forgotten about that one, sorry.  This is 
another special case I am afraid.  Note how there are two runlists in an 
ntfs_inode.  We have:

ntfs_inode->runlist - This is the runlist for the inode data (for files 
this is the actual file data, for directories this is the directory 
B-tree, for index inodes this is the view index B-tree, for attribute 
inodes this is the attribute data).  This is what we have been dealing 
with before and what you have already solved by the sounds of things.

ntfs_inode->attr_list_rl - This is the runlist for the attribute list 
attribute of a base inode, i.e. it does not exist for index inodes or 
attribute inodes, it only exists for regular files and directories (and 
once we implement symlinks, sockets, fifos and device special files for 
those, too).  Further it only exists for inodes which have extent inodes 
and those are definitely in the minority but that is irrelevant as we have 
to support it none the less.  The attribute list attribute is a linear 
list or better perhaps to describe it as an array of variable size 
records, and each record describes and attribute belonging to the inode, 
which part of the attribute this attribute extent describes and where 
(i.e. in which mft record) this attribute extent belongs.  As such the 
attribute list attribute is only ever used when the mft record is mapped 
and an attribute is being looked up and also when a new attribute is 
created or an attribute is deleted or an attribute is truncated in which 
case the attribute list must be updated.  In all those cases the mft 
record is mapped.  Thus yes in this case the locking is inverted but that 
is because ntfs_inode->attr_list_rl->lock will always be taken with the 
mft record lock held.

I think the lock validator has the problem of not knowing that there are 
two different types of runlist which is why it complains about it.

Another thing to note is that load_attribute_list() is only called at 
iget() time thus we have exclusive access to the inode so it is a special 
case in any case.

And at present the driver in the kernel does not ever modify the attribute 
list attribute (and hence does not modify its runlist either) thus the 
only time the attr_list_rl is used is at load_attribute_list() time.

Something I am noticing now is that given that the attribute list runlist 
is only ever accessed under the mft record lock it is in fact pointless to 
lock it at all as the mft record lock is an exclusive lock.

The reason I never noticed this before is that I started off having the 
mft record lock being a read-write lock so multiple processes could be 
read-accessing the mft record so I wanted to keep the attribute list 
locked as well.

I have to look through the code to make sure but I think we could simply 
not take the runlist lock when accessing the attr_list_rl and then the 
problem would go away.  The only thing is that when we start modifying the 
runlist attr_list_rl we will use functions that are common for the two 
runlist code paths and they may take the lock.  Not a huge problem as they 
can always use the _nolock version.

So I suggest if it is easy, to teach the validator to not complain about 
this as that is the order the locking is meant to happen and it is not 
circular because it is two different types of runlist being locked.  That 
gives me the possibility of later on turning the mft record lock back into 
a read-write lock to improve parallelism when reading data on ntfs...

If that is hard then you or I (or both) need to investigate whether the 
current driver can really simply not take the runlist lock when accessing 
attr_list_rl i.e. we need to make sure it always happens under the 
protection of the mft record lock and then the problem will go away.

Best regards,

	Anton

> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> ls/2581 is trying to acquire lock:
>  (&rl->lock){----}, at: [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
> 
> but task is already holding lock:
>  (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0
> 
> which lock already depends on the new lock,
> which could lead to circular locking dependencies.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&ni->mrec_lock){--..}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0346183>] mutex_lock_nested+0x73/0x2a0
>        [<c01d5e43>] map_mft_record+0x53/0x2c0
>        [<c01c54f8>] ntfs_map_runlist_nolock+0x3d8/0x530
>        [<c01c5bc1>] ntfs_map_runlist+0x41/0x70
>        [<c01c1929>] ntfs_readpage+0x8c9/0x9b0
>        [<c0142ffc>] read_cache_page+0xac/0x150
>        [<c01e212d>] ntfs_statfs+0x41d/0x660
>        [<c0163254>] vfs_statfs+0x54/0x70
>        [<c0163288>] vfs_statfs64+0x18/0x30
>        [<c0163384>] sys_statfs64+0x64/0xa0
>        [<c0347dcd>] sysenter_past_esp+0x56/0x8d
> 
> -> #0 (&rl->lock){----}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0134c8a>] down_read_nested+0x2a/0x40
>        [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
>        [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
>        [<c01d4735>] ntfs_iget+0x55/0x80
>        [<c01db3da>] ntfs_lookup+0x14a/0x740
>        [<c01736b6>] do_lookup+0x126/0x150
>        [<c0173ef3>] __link_path_walk+0x813/0xe50
>        [<c017457c>] link_path_walk+0x4c/0xf0
>        [<c0174a2d>] do_path_lookup+0xad/0x260
>        [<c0175228>] __user_walk_fd+0x38/0x60
>        [<c016e3be>] vfs_lstat_fd+0x1e/0x50
>        [<c016e401>] vfs_lstat+0x11/0x20
>        [<c016ec04>] sys_lstat64+0x14/0x30
>        [<c0347dcd>] sysenter_past_esp+0x56/0x8d
> 
> other info that might help us debug this:
> 
> 2 locks held by ls/2581:
>  #0:  (&inode->i_mutex){--..}, at: [<c0346108>] mutex_lock+0x8/0x10
>  #1:  (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0
> 
> stack backtrace:
>  [<c0104bf2>] show_trace+0x12/0x20
>  [<c0104c19>] dump_stack+0x19/0x20
>  [<c0136ef1>] print_circular_bug_tail+0x61/0x70
>  [<c01389ff>] __lock_acquire+0x74f/0xde0
>  [<c01394df>] lock_acquire+0x6f/0x90
>  [<c0134c8a>] down_read_nested+0x2a/0x40
>  [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
>  [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
>  [<c01d4735>] ntfs_iget+0x55/0x80
>  [<c01db3da>] ntfs_lookup+0x14a/0x740
>  [<c01736b6>] do_lookup+0x126/0x150
>  [<c0173ef3>] __link_path_walk+0x813/0xe50
>  [<c017457c>] link_path_walk+0x4c/0xf0
>  [<c0174a2d>] do_path_lookup+0xad/0x260
>  [<c0175228>] __user_walk_fd+0x38/0x60
>  [<c016e3be>] vfs_lstat_fd+0x1e/0x50
>  [<c016e401>] vfs_lstat+0x11/0x20
>  [<c016ec04>] sys_lstat64+0x14/0x30
>  [<c0347dcd>] sysenter_past_esp+0x56/0x8d

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-10  8:48             ` Anton Altaparmakov
@ 2006-06-11  5:31               ` Ingo Molnar
  2006-06-11  7:10                 ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-11  5:31 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> I think the lock validator has the problem of not knowing that there 
> are two different types of runlist which is why it complains about it.

ah, ok! What happened is that the rwlock_init() 'lock type keys' 
(inlined via ntfs_init_runlist()) for the two runlists were 'merged':

        ntfs_init_runlist(&ni->runlist);
        ntfs_init_runlist(&ni->attr_list_rl);

i have annotated things by initializing the two locks separately (via a 
simple oneliner change), and this has solved the problem.

The two types are now properly 'split', and the validator tracks them 
separately and understands their separate roles. So there's no need to 
touch attribute runlist locking in the NTFS code.

Some background: the validator uses lock initialization as a hint about 
which locks share the same 'type' (or locking domain). Currently this is 
done via:

 #define init_rwsem(sem)                                         \
 do {                                                            \
         static struct lockdep_type_key __key;                   \
                                                                 \
         __init_rwsem((sem), #sem, &__key);                      \
 } while (0)

But since ntfs_init_runlist() got inlined to within the same function, 
the __key there got shared. A better method might be to use the return 
address in __init_rwsem() [and i used that in earlier versions of the 
validator] - but even then there's no guarantee that this code will 
always be inlined. In any case, this is known to be a heuristics, it is 
totally valid to initialize locks in arbitrary manner, and the validator 
only tries to guess it right in 99.9% of the cases. In cases where the 
validator incorrectly merged (or split) lock types [such as in this 
case], the problem can be found easily - and the annotation is easy as 
well.

The good news is that after this fix things went pretty well for 
readonly stuff and i got no new complaints from the validator. Phew! :-) 

It does not fully cover read-write mode yet. When extending an existing 
file the validator did not understand the following locking construct:

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
cat/2802 is trying to acquire lock:
 (&vol->lcnbmp_lock){----}, at: [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0

but task is already holding lock:
 (&ni->mrec_lock){--..}, at: [<c01d5e53>] map_mft_record+0x53/0x2c0

which lock already depends on the new lock,
which could lead to circular dependencies.

the existing dependency chain (in reverse order) is:

-> #2 (&ni->mrec_lock){--..}:
       [<c01394df>] lock_acquire+0x6f/0x90
       [<c0346193>] mutex_lock_nested+0x73/0x2a0
       [<c01d5e53>] map_mft_record+0x53/0x2c0
       [<c01c54f8>] ntfs_map_runlist_nolock+0x3d8/0x530
       [<c01c5bc1>] ntfs_map_runlist+0x41/0x70
       [<c01c1929>] ntfs_readpage+0x8c9/0x9b0
       [<c0142ffc>] read_cache_page+0xac/0x150
       [<c01e213d>] ntfs_statfs+0x41d/0x660
       [<c0163254>] vfs_statfs+0x54/0x70
       [<c0163288>] vfs_statfs64+0x18/0x30
       [<c0163384>] sys_statfs64+0x64/0xa0
       [<c0347ddd>] sysenter_past_esp+0x56/0x8d

-> #1 (&rl->lock){----}:
       [<c01394df>] lock_acquire+0x6f/0x90
       [<c0134c8a>] down_read_nested+0x2a/0x40
       [<c01c18a4>] ntfs_readpage+0x844/0x9b0
       [<c0142ffc>] read_cache_page+0xac/0x150
       [<c01e213d>] ntfs_statfs+0x41d/0x660
       [<c0163254>] vfs_statfs+0x54/0x70
       [<c0163288>] vfs_statfs64+0x18/0x30
       [<c0163384>] sys_statfs64+0x64/0xa0
       [<c0347ddd>] sysenter_past_esp+0x56/0x8d

-> #0 (&vol->lcnbmp_lock){----}:
       [<c01394df>] lock_acquire+0x6f/0x90
       [<c0134ccc>] down_write+0x2c/0x50
       [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0
       [<c01c427d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
       [<c01caa38>] ntfs_file_buffered_write+0x188/0x3880
       [<c01ce2a8>] ntfs_file_aio_write_nolock+0x178/0x210
       [<c01ce3f1>] ntfs_file_writev+0xb1/0x150
       [<c01ce4af>] ntfs_file_write+0x1f/0x30
       [<c0164f09>] vfs_write+0x99/0x160
       [<c016589d>] sys_write+0x3d/0x70
       [<c0347ddd>] sysenter_past_esp+0x56/0x8d

other info that might help us debug this:

3 locks held by cat/2802:
 #0:  (&inode->i_mutex){--..}, at: [<c0346118>] mutex_lock+0x8/0x10
 #1:  (&rl->lock){----}, at: [<c01c3dbe>] ntfs_attr_extend_allocation+0x13e/0x14a0
 #2:  (&ni->mrec_lock){--..}, at: [<c01d5e53>] map_mft_record+0x53/0x2c0

stack backtrace:
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0136ef1>] print_circular_bug_tail+0x61/0x70
 [<c01389ff>] __lock_acquire+0x74f/0xde0
 [<c01394df>] lock_acquire+0x6f/0x90
 [<c0134ccc>] down_write+0x2c/0x50
 [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0
 [<c01c427d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
 [<c01caa38>] ntfs_file_buffered_write+0x188/0x3880
 [<c01ce2a8>] ntfs_file_aio_write_nolock+0x178/0x210
 [<c01ce3f1>] ntfs_file_writev+0xb1/0x150
 [<c01ce4af>] ntfs_file_write+0x1f/0x30
 [<c0164f09>] vfs_write+0x99/0x160
 [<c016589d>] sys_write+0x3d/0x70
 [<c0347ddd>] sysenter_past_esp+0x56/0x8d

this seems to be a pretty complex 3-way dependency related to 
&vol->lcnbmp_lock and &ni->mrec_lock. Should i send a full dependency 
events trace perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-11  5:31               ` Ingo Molnar
@ 2006-06-11  7:10                 ` Anton Altaparmakov
  2006-06-12  8:31                   ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-11  7:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Sun, 11 Jun 2006, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> 
> > I think the lock validator has the problem of not knowing that there 
> > are two different types of runlist which is why it complains about it.
> 
> ah, ok! What happened is that the rwlock_init() 'lock type keys' 
> (inlined via ntfs_init_runlist()) for the two runlists were 'merged':
> 
>         ntfs_init_runlist(&ni->runlist);
>         ntfs_init_runlist(&ni->attr_list_rl);
> 
> i have annotated things by initializing the two locks separately (via a 
> simple oneliner change), and this has solved the problem.
> 
> The two types are now properly 'split', and the validator tracks them 
> separately and understands their separate roles. So there's no need to 
> touch attribute runlist locking in the NTFS code.

Great!

> Some background: the validator uses lock initialization as a hint about 

Thanks for explaining.

> The good news is that after this fix things went pretty well for 
> readonly stuff and i got no new complaints from the validator. Phew! :-) 

Great!

> It does not fully cover read-write mode yet. When extending an existing 
> file the validator did not understand the following locking construct:
> 
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> cat/2802 is trying to acquire lock:
>  (&vol->lcnbmp_lock){----}, at: [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0
> 
> but task is already holding lock:
>  (&ni->mrec_lock){--..}, at: [<c01d5e53>] map_mft_record+0x53/0x2c0
> 
> which lock already depends on the new lock,
> which could lead to circular dependencies.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&ni->mrec_lock){--..}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0346193>] mutex_lock_nested+0x73/0x2a0
>        [<c01d5e53>] map_mft_record+0x53/0x2c0
>        [<c01c54f8>] ntfs_map_runlist_nolock+0x3d8/0x530
>        [<c01c5bc1>] ntfs_map_runlist+0x41/0x70
>        [<c01c1929>] ntfs_readpage+0x8c9/0x9b0
>        [<c0142ffc>] read_cache_page+0xac/0x150
>        [<c01e213d>] ntfs_statfs+0x41d/0x660
>        [<c0163254>] vfs_statfs+0x54/0x70
>        [<c0163288>] vfs_statfs64+0x18/0x30
>        [<c0163384>] sys_statfs64+0x64/0xa0
>        [<c0347ddd>] sysenter_past_esp+0x56/0x8d
> 
> -> #1 (&rl->lock){----}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0134c8a>] down_read_nested+0x2a/0x40
>        [<c01c18a4>] ntfs_readpage+0x844/0x9b0
>        [<c0142ffc>] read_cache_page+0xac/0x150
>        [<c01e213d>] ntfs_statfs+0x41d/0x660
>        [<c0163254>] vfs_statfs+0x54/0x70
>        [<c0163288>] vfs_statfs64+0x18/0x30
>        [<c0163384>] sys_statfs64+0x64/0xa0
>        [<c0347ddd>] sysenter_past_esp+0x56/0x8d
> 
> -> #0 (&vol->lcnbmp_lock){----}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0134ccc>] down_write+0x2c/0x50
>        [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0
>        [<c01c427d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
>        [<c01caa38>] ntfs_file_buffered_write+0x188/0x3880
>        [<c01ce2a8>] ntfs_file_aio_write_nolock+0x178/0x210
>        [<c01ce3f1>] ntfs_file_writev+0xb1/0x150
>        [<c01ce4af>] ntfs_file_write+0x1f/0x30
>        [<c0164f09>] vfs_write+0x99/0x160
>        [<c016589d>] sys_write+0x3d/0x70
>        [<c0347ddd>] sysenter_past_esp+0x56/0x8d
> 
> other info that might help us debug this:
> 
> 3 locks held by cat/2802:
>  #0:  (&inode->i_mutex){--..}, at: [<c0346118>] mutex_lock+0x8/0x10
>  #1:  (&rl->lock){----}, at: [<c01c3dbe>] ntfs_attr_extend_allocation+0x13e/0x14a0
>  #2:  (&ni->mrec_lock){--..}, at: [<c01d5e53>] map_mft_record+0x53/0x2c0
> 
> stack backtrace:
>  [<c0104bf2>] show_trace+0x12/0x20
>  [<c0104c19>] dump_stack+0x19/0x20
>  [<c0136ef1>] print_circular_bug_tail+0x61/0x70
>  [<c01389ff>] __lock_acquire+0x74f/0xde0
>  [<c01394df>] lock_acquire+0x6f/0x90
>  [<c0134ccc>] down_write+0x2c/0x50
>  [<c01e80dd>] ntfs_cluster_alloc+0x10d/0x23a0
>  [<c01c427d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
>  [<c01caa38>] ntfs_file_buffered_write+0x188/0x3880
>  [<c01ce2a8>] ntfs_file_aio_write_nolock+0x178/0x210
>  [<c01ce3f1>] ntfs_file_writev+0xb1/0x150
>  [<c01ce4af>] ntfs_file_write+0x1f/0x30
>  [<c0164f09>] vfs_write+0x99/0x160
>  [<c016589d>] sys_write+0x3d/0x70
>  [<c0347ddd>] sysenter_past_esp+0x56/0x8d
> 
> this seems to be a pretty complex 3-way dependency related to 
> &vol->lcnbmp_lock and &ni->mrec_lock. Should i send a full dependency 
> events trace perhaps?

First an explanation of two relevant locks that are both going to upset 
the validator.  I half expected this to happen given what has happened so 
far.  The two locks are lcnbmp_lock and mftbmp_lock (both are r/w 
semaphores).

Both those locks lock the two big bitmaps on an NTFS volume:

lcnbmp_lock locks accesses to the cluster bitmap (i.e. the physical blocks 
bitmap where each 0 bit means an unallocated block and a 1 bit means an 
allocated block).

mftbmp_lock locks accesses to the mft bitmap (i.e. the inodes bitmap where 
each 0 bit means an unused inode and a 1 bit means the inode is in use).

The locks are taken when a cluster is being allocated / an mft record is 
being allocated.

The reason the validator will likely get very confused is that they 
introduce reverse semantics depending on what you are locking, i.e.

When extending a file the locking order is:

Lock runlist (data one) of file to be extended for writing (as it will be 
modified by the file extension).

Lock lcnbmp_lock for writing (as the bitmap will be modified by the 
allocated clusters).

Lock runlist of $Bitmap system file (the contents of which contain the 
bitmap) for reading so the on-disk physical location can be determined.

If the above fails then re-lock the runlist of $Bitmap for writing and 
lock mft record of $Bitmap system file and load in the runlist of the 
bitmap, the unlock the mft record, unlock the runlist of $Bitmap.

Load the $Bitmap data and allocate clusters in the bitmap.

Unlock lcnbmp_lock.

Update the runlist of the file being extended with the newly allocated 
clusters.

Lock the mft record of the file being extended and update the mft record 
with the new allocations.

Unlock the mft record and the runlist of the file that was extended.

So yes lock dependencies are inverted the validator complains.  But they 
are inverted in a special way that is safe.

Is the above description sufficient for you to fix it?

The mftbmp_lock works simillarly as above but instead of allocating 
clusters we are allocating an mft record, the locking sequence is the 
same though just substitute s/lcnbmp_lock/mftbmp_lock/ and s/cluster 
allocation/mft record allocation/.

Extending the mft in turn can require the mft itself to be extended which 
will take the lcnbmp_lock, thus mftbmp_lock nests outside the lcnbmp_lock 
and that should never be violated so the validator should be at least 
happy with that bit.  (-;

Sorry ntfs is that locking evil!

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-11  7:10                 ` Anton Altaparmakov
@ 2006-06-12  8:31                   ` Ingo Molnar
  2006-06-12  8:47                     ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-12  8:31 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> First an explanation of two relevant locks that are both going to 
> upset the validator.  I half expected this to happen given what has 
> happened so far.  The two locks are lcnbmp_lock and mftbmp_lock (both 
> are r/w semaphores).

thanks!

> Is the above description sufficient for you to fix it?

yeah. I have split off vol->lcnbmp_ino's locking rules (for mrec_lock 
and runlist.lock) from normal inode locking rules, and this fixed the 
file-writing dependency.

but i can still trigger a warning, and i think this time it's a real 
bug: if i mount NTFS with -o show_system_files, and i append data to the 
$Bitmap, then i get the dependency conflict attached further below.

While extending the $Bitmap manually is extremely evil, the filesystem 
should nevertheless not break - for example a script could do it by 
accident. I believe NTFS should either disallow writing to the $Bitmap 
(by forcing it to be readonly under all circumstances), or the writing 
should be made safe - right now if that happens in parallel to some 
other process extending an NTFS file then i think we could deadlock, 
right?

	Ingo

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
cat/2532 is trying to acquire lock:
 (&vol->lcnbmp_lock){----}, at: [<c01e809d>] ntfs_cluster_alloc+0x10d/0x23a0

but task is already holding lock:
 (lcnbmp_mrec_lock){--..}, at: [<c01d5dc3>] map_mft_record+0x53/0x2c0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (lcnbmp_mrec_lock){--..}:
       [<c013948f>] lock_acquire+0x6f/0x90
       [<c0346163>] mutex_lock_nested+0x73/0x2a0
       [<c01d5dc3>] map_mft_record+0x53/0x2c0
       [<c01c5498>] ntfs_map_runlist_nolock+0x3d8/0x530
       [<c01c5b61>] ntfs_map_runlist+0x41/0x70
       [<c01c18d1>] ntfs_readpage+0x8c1/0x9a0
       [<c0142fac>] read_cache_page+0xac/0x150
       [<c01e20ad>] ntfs_statfs+0x41d/0x660
       [<c0163204>] vfs_statfs+0x54/0x70
       [<c0163238>] vfs_statfs64+0x18/0x30
       [<c0163334>] sys_statfs64+0x64/0xa0
       [<c0347dad>] sysenter_past_esp+0x56/0x8d

-> #1 (lcnbmp_runlist_lock){----}:
       [<c013948f>] lock_acquire+0x6f/0x90
       [<c0134c4c>] down_read+0x2c/0x40
       [<c01c184c>] ntfs_readpage+0x83c/0x9a0
       [<c0142fac>] read_cache_page+0xac/0x150
       [<c01e20ad>] ntfs_statfs+0x41d/0x660
       [<c0163204>] vfs_statfs+0x54/0x70
       [<c0163238>] vfs_statfs64+0x18/0x30
       [<c0163334>] sys_statfs64+0x64/0xa0
       [<c0347dad>] sysenter_past_esp+0x56/0x8d

-> #0 (&vol->lcnbmp_lock){----}:
       [<c013948f>] lock_acquire+0x6f/0x90
       [<c0134ccc>] down_write+0x2c/0x50
       [<c01e809d>] ntfs_cluster_alloc+0x10d/0x23a0
       [<c01c421d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
       [<c01ca9d8>] ntfs_file_buffered_write+0x188/0x3880
       [<c01ce248>] ntfs_file_aio_write_nolock+0x178/0x210
       [<c01ce391>] ntfs_file_writev+0xb1/0x150
       [<c01ce44f>] ntfs_file_write+0x1f/0x30
       [<c0164eb9>] vfs_write+0x99/0x160
       [<c016584d>] sys_write+0x3d/0x70
       [<c0347dad>] sysenter_past_esp+0x56/0x8d

other info that might help us debug this:

3 locks held by cat/2532:
 #0:  (&inode->i_mutex){--..}, at: [<c03460e8>] mutex_lock+0x8/0x10
 #1:  (lcnbmp_runlist_lock){----}, at: [<c01c3d5e>] ntfs_attr_extend_allocation+0x13e/0x14a0
 #2:  (lcnbmp_mrec_lock){--..}, at: [<c01d5dc3>] map_mft_record+0x53/0x2c0

stack backtrace:
 [<c0104bf2>] show_trace+0x12/0x20
 [<c0104c19>] dump_stack+0x19/0x20
 [<c0136f11>] print_circular_bug_tail+0x61/0x70
 [<c01389af>] __lock_acquire+0x6df/0xd70
 [<c013948f>] lock_acquire+0x6f/0x90
 [<c0134ccc>] down_write+0x2c/0x50
 [<c01e809d>] ntfs_cluster_alloc+0x10d/0x23a0
 [<c01c421d>] ntfs_attr_extend_allocation+0x5fd/0x14a0
 [<c01ca9d8>] ntfs_file_buffered_write+0x188/0x3880
 [<c01ce248>] ntfs_file_aio_write_nolock+0x178/0x210
 [<c01ce391>] ntfs_file_writev+0xb1/0x150
 [<c01ce44f>] ntfs_file_write+0x1f/0x30
 [<c0164eb9>] vfs_write+0x99/0x160
 [<c016584d>] sys_write+0x3d/0x70
 [<c0347dad>] sysenter_past_esp+0x56/0x8d

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-12  8:31                   ` Ingo Molnar
@ 2006-06-12  8:47                     ` Anton Altaparmakov
  2006-06-12  9:40                       ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-12  8:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Mon, 2006-06-12 at 10:31 +0200, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> 
> > First an explanation of two relevant locks that are both going to 
> > upset the validator.  I half expected this to happen given what has 
> > happened so far.  The two locks are lcnbmp_lock and mftbmp_lock (both 
> > are r/w semaphores).
> 
> thanks!
> 
> > Is the above description sufficient for you to fix it?
> 
> yeah. I have split off vol->lcnbmp_ino's locking rules (for mrec_lock 
> and runlist.lock) from normal inode locking rules, and this fixed the 
> file-writing dependency.
> 
> but i can still trigger a warning, and i think this time it's a real 
> bug: if i mount NTFS with -o show_system_files, and i append data to the 
> $Bitmap, then i get the dependency conflict attached further below.

Not surprising.  (-:  Writing to metadata from user space is definitely
not allowed...

> While extending the $Bitmap manually is extremely evil, the filesystem 
> should nevertheless not break - for example a script could do it by 
> accident. I believe NTFS should either disallow writing to the $Bitmap 
> (by forcing it to be readonly under all circumstances), or the writing 
> should be made safe - right now if that happens in parallel to some 
> other process extending an NTFS file then i think we could deadlock, 
> right?

Absolutely correct.  Writing to metadata files from userspace is
guaranteed to break things.  I don't think a script could do anything
bad because it would never see that a file called $Bitmap exists (-o
show_sys_files is only for debugging/recovery purposes not for general
use).

I never cared for fixing this case because writing to any of those files
is almost guaranteed to corrupt the file system to the extent of Windows
dieing with a blue screen of death on boot and the only option at that
point will be to reformat and reinstall (at least until we write an
ntfsck utility)...

But yes, we could just make all system files read-only.  That would be a
simple enough fix, especially for the ones we keep open all the time
during a mount as they do not even have to affect the iget() code path.
For other system inodes we would have to stick the read-only force in
iget() (or open them at mount time and keep them open).

I will do that when I have some more spare time...  I don't think it is
particularly urgent given it has always been like this and if anyone is
trying to kill their file system even fixing this will not stop them...
They would just need to run the ntfsprogs provided utilities to achieve
anything they like...  I have always followed the principle of giving
people enough rope to hang themselves if they really want to...  (-;
But I agree we should probably make the system files read-only.

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-12  8:47                     ` Anton Altaparmakov
@ 2006-06-12  9:40                       ` Ingo Molnar
  2006-06-12 10:24                         ` Anton Altaparmakov
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-12  9:40 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> I will do that when I have some more spare time...  I don't think it 
> is particularly urgent given it has always been like this and if 
> anyone is trying to kill their file system even fixing this will not 
> stop them... They would just need to run the ntfsprogs provided 
> utilities to achieve anything they like... [...]

i agree that it's not urgent, but i think we have a small disagreement 
wrt. the philosophy behind that conclusion :-)

manipulating a device file (i.e. running ntfsprogs) is a known "hang 
yourself" thing that people are and must be aware of - but messing 
around within the VFS namespace of a filesystem itself ought to be safe, 
even if the file can only be accessed via a special mount option and if 
it's called $Bitmap. The mount option doesnt really say that it's 
unsafe:

       show_sys_files
              If  show_sys_files is specified, show the system files in direc-
              tory listings.  Otherwise the default behaviour is to  hide  the
              system  files.  Note that even when show_sys_files is specified,
              "$MFT" may will not  be  visible  due  to  bugs/mis-features  in
              glibc.   Further,  note that irrespective of show_sys_files, all
              files are accessible by name, i.e. you  can  always  do  "ls  -l
              '$UpCase'" for example to specifically show the system file con-
              taining the Unicode upcase table.

and the dangers of direct namespace access are apparently anticipated in 
the permissions of the $Mft itself:

  [root@neptune mnt2]# ls -l \$Mft
  ----------  1 root root 69632 Jan  1  1970 $Mft

and finally, lets call this a real bug in NTFS: we both worked hard to 
cover it via the lock validator =B-)

in any case, the message will only trigger if someone abuses one of the 
system files, so there is no urgency, and that will also serve as a 
reminder.

below i've attached a cleaned up version of my NTFS annotations so far. 
I think they are pretty straightforward and nonintrusive - for 
!CONFIG_LOCKDEP they add zero changes to ntfs.ko. Most of the linecount 
of the patch comes from comments - so i think we might even consider 
this a "document locking rules" patch :-)

	Ingo

---------------------------
Subject: lock validator: annotate NTFS locking rules
From: Ingo Molnar <mingo@elte.hu>

NTFS uses lots of type-opaque objects which acquire their true
identity runtime - so the lock validator needs to be helped in
a couple of places to figure out object types.

Many thanks to Anton Altaparmakov for giving lots of explanations
about NTFS locking rules.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/ntfs/inode.c |   33 +++++++++++++++++++++++++++++++++
 fs/ntfs/mft.c   |    2 +-
 fs/ntfs/super.c |   31 +++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

Index: linux/fs/ntfs/inode.c
===================================================================
--- linux.orig/fs/ntfs/inode.c
+++ linux/fs/ntfs/inode.c
@@ -367,6 +367,12 @@ static void ntfs_destroy_extent_inode(nt
 	kmem_cache_free(ntfs_inode_cache, ni);
 }
 
+/*
+ * The attribute runlist lock has separate locking rules from the
+ * normal runlist lock, so split the two lock-types:
+ */
+static struct lockdep_type_key attr_list_rl_lock_type;
+
 /**
  * __ntfs_init_inode - initialize ntfs specific part of an inode
  * @sb:		super block of mounted volume
@@ -394,6 +400,8 @@ void __ntfs_init_inode(struct super_bloc
 	ni->attr_list_size = 0;
 	ni->attr_list = NULL;
 	ntfs_init_runlist(&ni->attr_list_rl);
+	lockdep_reinit_key(&ni->attr_list_rl.lock,
+				&attr_list_rl_lock_type);
 	ni->itype.index.bmp_ino = NULL;
 	ni->itype.index.block_size = 0;
 	ni->itype.index.vcn_size = 0;
@@ -405,6 +413,13 @@ void __ntfs_init_inode(struct super_bloc
 	ni->ext.base_ntfs_ino = NULL;
 }
 
+/*
+ * Extent inodes get MFT-mapped in a nested way, while the base inode
+ * is still mapped. Teach this nesting to the lock validator by creating
+ * a separate type for nested inode's mrec_lock's:
+ */
+static struct lockdep_type_key extent_inode_mrec_lock_key;
+
 inline ntfs_inode *ntfs_new_extent_inode(struct super_block *sb,
 		unsigned long mft_no)
 {
@@ -413,6 +428,7 @@ inline ntfs_inode *ntfs_new_extent_inode
 	ntfs_debug("Entering.");
 	if (likely(ni != NULL)) {
 		__ntfs_init_inode(sb, ni);
+		lockdep_reinit_key(&ni->mrec_lock, &extent_inode_mrec_lock_key);
 		ni->mft_no = mft_no;
 		ni->type = AT_UNUSED;
 		ni->name = NULL;
@@ -1722,6 +1738,15 @@ err_out:
 	return err;
 }
 
+/*
+ * The MFT inode has special locking, so teach the lock validator
+ * about this by splitting off the locking rules of the MFT from
+ * the locking rules of other inodes. The MFT inode can never be
+ * accessed from the VFS side (or even internally), only by the
+ * map_mft functions.
+ */
+static struct lockdep_type_key mft_ni_runlist_lock_key, mft_ni_mrec_lock_key;
+
 /**
  * ntfs_read_inode_mount - special read_inode for mount time use only
  * @vi:		inode to read
@@ -2148,6 +2173,14 @@ int ntfs_read_inode_mount(struct inode *
 	ntfs_attr_put_search_ctx(ctx);
 	ntfs_debug("Done.");
 	ntfs_free(m);
+
+	/*
+	 * Split the locking rules of the MFT inode from the
+	 * locking rules of other inodes:
+	 */
+	lockdep_reinit_key(&ni->runlist.lock, &mft_ni_runlist_lock_key);
+	lockdep_reinit_key(&ni->mrec_lock, &mft_ni_mrec_lock_key);
+
 	return 0;
 
 em_put_err_out:
Index: linux/fs/ntfs/mft.c
===================================================================
--- linux.orig/fs/ntfs/mft.c
+++ linux/fs/ntfs/mft.c
@@ -348,7 +348,7 @@ map_err_out:
 		base_ni->ext.extent_ntfs_inos = tmp;
 	}
 	base_ni->ext.extent_ntfs_inos[base_ni->nr_extents++] = ni;
-	mutex_unlock(&base_ni->extent_lock);
+	mutex_unlock_non_nested(&base_ni->extent_lock);
 	atomic_dec(&base_ni->count);
 	ntfs_debug("Done 2.");
 	*ntfs_ino = ni;
Index: linux/fs/ntfs/super.c
===================================================================
--- linux.orig/fs/ntfs/super.c
+++ linux/fs/ntfs/super.c
@@ -1724,6 +1724,14 @@ upcase_failed:
 	return FALSE;
 }
 
+/*
+ * The lcn and mft bitmap inodes are NTFS-internal inodes with
+ * their own special locking rules:
+ */
+static struct lockdep_type_key
+	lcnbmp_runlist_lock_key, lcnbmp_mrec_lock_key,
+	mftbmp_runlist_lock_key, mftbmp_mrec_lock_key;
+
 /**
  * load_system_files - open the system files using normal functions
  * @vol:	ntfs super block describing device whose system files to load
@@ -1780,6 +1788,10 @@ static BOOL load_system_files(ntfs_volum
 		ntfs_error(sb, "Failed to load $MFT/$BITMAP attribute.");
 		goto iput_mirr_err_out;
 	}
+	lockdep_reinit_key(&NTFS_I(vol->mftbmp_ino)->runlist.lock,
+			   &mftbmp_runlist_lock_key);
+	lockdep_reinit_key(&NTFS_I(vol->mftbmp_ino)->mrec_lock,
+			   &mftbmp_mrec_lock_key);
 	/* Read upcase table and setup @vol->upcase and @vol->upcase_len. */
 	if (!load_and_init_upcase(vol))
 		goto iput_mftbmp_err_out;
@@ -1802,6 +1814,11 @@ static BOOL load_system_files(ntfs_volum
 			iput(vol->lcnbmp_ino);
 		goto bitmap_failed;
 	}
+	lockdep_reinit_key(&NTFS_I(vol->lcnbmp_ino)->runlist.lock,
+			   &lcnbmp_runlist_lock_key);
+	lockdep_reinit_key(&NTFS_I(vol->lcnbmp_ino)->mrec_lock,
+			   &lcnbmp_mrec_lock_key);
+
 	NInoSetSparseDisabled(NTFS_I(vol->lcnbmp_ino));
 	if ((vol->nr_clusters + 7) >> 3 > i_size_read(vol->lcnbmp_ino)) {
 		iput(vol->lcnbmp_ino);
@@ -2742,6 +2759,17 @@ static int ntfs_fill_super(struct super_
 	struct inode *tmp_ino;
 	int blocksize, result;
 
+	/*
+	 * We do a pretty difficult piece of bootstrap by reading the
+	 * MFT (and other metadata) from disk into memory. We'll only
+	 * release this metadata during umount, so the locking patterns
+	 * observed during bootstrap do not count. So turn off the
+	 * observation of locking patterns (strictly for this context
+	 * only) while mounting NTFS. [The validator is still active
+	 * otherwise, even for this context: it will for example record
+	 * lock type registrations.]
+	 */
+	lockdep_off();
 	ntfs_debug("Entering.");
 #ifndef NTFS_RW
 	sb->s_flags |= MS_RDONLY;
@@ -2753,6 +2781,7 @@ static int ntfs_fill_super(struct super_
 		if (!silent)
 			ntfs_error(sb, "Allocation of NTFS volume structure "
 					"failed. Aborting mount...");
+		lockdep_on();
 		return -ENOMEM;
 	}
 	/* Initialize ntfs_volume structure. */
@@ -2939,6 +2968,7 @@ static int ntfs_fill_super(struct super_
 		mutex_unlock(&ntfs_lock);
 		sb->s_export_op = &ntfs_export_ops;
 		lock_kernel();
+		lockdep_on();
 		return 0;
 	}
 	ntfs_error(sb, "Failed to allocate root directory.");
@@ -3058,6 +3088,7 @@ err_out_now:
 	sb->s_fs_info = NULL;
 	kfree(vol);
 	ntfs_debug("Failed, returning -EINVAL.");
+	lockdep_on();
 	return -EINVAL;
 }
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-12  9:40                       ` Ingo Molnar
@ 2006-06-12 10:24                         ` Anton Altaparmakov
  2006-06-12 10:56                           ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Anton Altaparmakov @ 2006-06-12 10:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven

On Mon, 2006-06-12 at 11:40 +0200, Ingo Molnar wrote:
> * Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> 
> > I will do that when I have some more spare time...  I don't think it 
> > is particularly urgent given it has always been like this and if 
> > anyone is trying to kill their file system even fixing this will not 
> > stop them... They would just need to run the ntfsprogs provided 
> > utilities to achieve anything they like... [...]
> 
> i agree that it's not urgent, but i think we have a small disagreement 
> wrt. the philosophy behind that conclusion :-)
> 
> manipulating a device file (i.e. running ntfsprogs) is a known "hang 
> yourself" thing that people are and must be aware of - but messing 
> around within the VFS namespace of a filesystem itself ought to be safe, 
> even if the file can only be accessed via a special mount option and if 
> it's called $Bitmap. The mount option doesnt really say that it's 
> unsafe:
> 
>        show_sys_files
>               If  show_sys_files is specified, show the system files in direc-
>               tory listings.  Otherwise the default behaviour is to  hide  the
>               system  files.  Note that even when show_sys_files is specified,
>               "$MFT" may will not  be  visible  due  to  bugs/mis-features  in
>               glibc.   Further,  note that irrespective of show_sys_files, all
>               files are accessible by name, i.e. you  can  always  do  "ls  -l
>               '$UpCase'" for example to specifically show the system file con-
>               taining the Unicode upcase table.
> 
> and the dangers of direct namespace access are apparently anticipated in 
> the permissions of the $Mft itself:

That is even more special due to the mft record locks and the fact that
the mft records on disk are not the same as the ones in memory (due to
the multi sector transfer protection fixups) so access through a file
read or file write would almost certainly crash the ntfs driver in weird
and strange ways hence why for mft and mftmirr not only do I force the
pemissions to blank but also leave the file operations blank as well,
i.e. the only operations mft and mftmirr have are the address space
operations...

>   [root@neptune mnt2]# ls -l \$Mft
>   ----------  1 root root 69632 Jan  1  1970 $Mft
> 
> and finally, lets call this a real bug in NTFS: we both worked hard to 
> cover it via the lock validator =B-)

Oh definitely!  It is nothing to do with the lock validator.  There is
definitely nothing I expect you to do about the lock validator or those
reports.  They only happen if user does stupid thing.  Serves them
right.  And if it happens by accident it is good warning that this is
wrong/bad...

> in any case, the message will only trigger if someone abuses one of the 
> system files, so there is no urgency, and that will also serve as a 
> reminder.

Exactly.

> below i've attached a cleaned up version of my NTFS annotations so far. 
> I think they are pretty straightforward and nonintrusive - for 
> !CONFIG_LOCKDEP they add zero changes to ntfs.ko. Most of the linecount 
> of the patch comes from comments - so i think we might even consider 
> this a "document locking rules" patch :-)

Thanks, looks great!  Can I leave it to you to keep it as part of the
lock validator patches so it gets into -mm with them and later makes it
into the stock kernel with them?

If you want, feel free to add:

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>

To the patch...

Thanks a lot for workings all of this out!

Best regards,

	Anton

> 
> 	Ingo
> 
> ---------------------------
> Subject: lock validator: annotate NTFS locking rules
> From: Ingo Molnar <mingo@elte.hu>
> 
> NTFS uses lots of type-opaque objects which acquire their true
> identity runtime - so the lock validator needs to be helped in
> a couple of places to figure out object types.
> 
> Many thanks to Anton Altaparmakov for giving lots of explanations
> about NTFS locking rules.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  fs/ntfs/inode.c |   33 +++++++++++++++++++++++++++++++++
>  fs/ntfs/mft.c   |    2 +-
>  fs/ntfs/super.c |   31 +++++++++++++++++++++++++++++++
>  3 files changed, 65 insertions(+), 1 deletion(-)
> 
> Index: linux/fs/ntfs/inode.c
> ===================================================================
> --- linux.orig/fs/ntfs/inode.c
> +++ linux/fs/ntfs/inode.c
> @@ -367,6 +367,12 @@ static void ntfs_destroy_extent_inode(nt
>  	kmem_cache_free(ntfs_inode_cache, ni);
>  }
>  
> +/*
> + * The attribute runlist lock has separate locking rules from the
> + * normal runlist lock, so split the two lock-types:
> + */
> +static struct lockdep_type_key attr_list_rl_lock_type;
> +
>  /**
>   * __ntfs_init_inode - initialize ntfs specific part of an inode
>   * @sb:		super block of mounted volume
> @@ -394,6 +400,8 @@ void __ntfs_init_inode(struct super_bloc
>  	ni->attr_list_size = 0;
>  	ni->attr_list = NULL;
>  	ntfs_init_runlist(&ni->attr_list_rl);
> +	lockdep_reinit_key(&ni->attr_list_rl.lock,
> +				&attr_list_rl_lock_type);
>  	ni->itype.index.bmp_ino = NULL;
>  	ni->itype.index.block_size = 0;
>  	ni->itype.index.vcn_size = 0;
> @@ -405,6 +413,13 @@ void __ntfs_init_inode(struct super_bloc
>  	ni->ext.base_ntfs_ino = NULL;
>  }
>  
> +/*
> + * Extent inodes get MFT-mapped in a nested way, while the base inode
> + * is still mapped. Teach this nesting to the lock validator by creating
> + * a separate type for nested inode's mrec_lock's:
> + */
> +static struct lockdep_type_key extent_inode_mrec_lock_key;
> +
>  inline ntfs_inode *ntfs_new_extent_inode(struct super_block *sb,
>  		unsigned long mft_no)
>  {
> @@ -413,6 +428,7 @@ inline ntfs_inode *ntfs_new_extent_inode
>  	ntfs_debug("Entering.");
>  	if (likely(ni != NULL)) {
>  		__ntfs_init_inode(sb, ni);
> +		lockdep_reinit_key(&ni->mrec_lock, &extent_inode_mrec_lock_key);
>  		ni->mft_no = mft_no;
>  		ni->type = AT_UNUSED;
>  		ni->name = NULL;
> @@ -1722,6 +1738,15 @@ err_out:
>  	return err;
>  }
>  
> +/*
> + * The MFT inode has special locking, so teach the lock validator
> + * about this by splitting off the locking rules of the MFT from
> + * the locking rules of other inodes. The MFT inode can never be
> + * accessed from the VFS side (or even internally), only by the
> + * map_mft functions.
> + */
> +static struct lockdep_type_key mft_ni_runlist_lock_key, mft_ni_mrec_lock_key;
> +
>  /**
>   * ntfs_read_inode_mount - special read_inode for mount time use only
>   * @vi:		inode to read
> @@ -2148,6 +2173,14 @@ int ntfs_read_inode_mount(struct inode *
>  	ntfs_attr_put_search_ctx(ctx);
>  	ntfs_debug("Done.");
>  	ntfs_free(m);
> +
> +	/*
> +	 * Split the locking rules of the MFT inode from the
> +	 * locking rules of other inodes:
> +	 */
> +	lockdep_reinit_key(&ni->runlist.lock, &mft_ni_runlist_lock_key);
> +	lockdep_reinit_key(&ni->mrec_lock, &mft_ni_mrec_lock_key);
> +
>  	return 0;
>  
>  em_put_err_out:
> Index: linux/fs/ntfs/mft.c
> ===================================================================
> --- linux.orig/fs/ntfs/mft.c
> +++ linux/fs/ntfs/mft.c
> @@ -348,7 +348,7 @@ map_err_out:
>  		base_ni->ext.extent_ntfs_inos = tmp;
>  	}
>  	base_ni->ext.extent_ntfs_inos[base_ni->nr_extents++] = ni;
> -	mutex_unlock(&base_ni->extent_lock);
> +	mutex_unlock_non_nested(&base_ni->extent_lock);
>  	atomic_dec(&base_ni->count);
>  	ntfs_debug("Done 2.");
>  	*ntfs_ino = ni;
> Index: linux/fs/ntfs/super.c
> ===================================================================
> --- linux.orig/fs/ntfs/super.c
> +++ linux/fs/ntfs/super.c
> @@ -1724,6 +1724,14 @@ upcase_failed:
>  	return FALSE;
>  }
>  
> +/*
> + * The lcn and mft bitmap inodes are NTFS-internal inodes with
> + * their own special locking rules:
> + */
> +static struct lockdep_type_key
> +	lcnbmp_runlist_lock_key, lcnbmp_mrec_lock_key,
> +	mftbmp_runlist_lock_key, mftbmp_mrec_lock_key;
> +
>  /**
>   * load_system_files - open the system files using normal functions
>   * @vol:	ntfs super block describing device whose system files to load
> @@ -1780,6 +1788,10 @@ static BOOL load_system_files(ntfs_volum
>  		ntfs_error(sb, "Failed to load $MFT/$BITMAP attribute.");
>  		goto iput_mirr_err_out;
>  	}
> +	lockdep_reinit_key(&NTFS_I(vol->mftbmp_ino)->runlist.lock,
> +			   &mftbmp_runlist_lock_key);
> +	lockdep_reinit_key(&NTFS_I(vol->mftbmp_ino)->mrec_lock,
> +			   &mftbmp_mrec_lock_key);
>  	/* Read upcase table and setup @vol->upcase and @vol->upcase_len. */
>  	if (!load_and_init_upcase(vol))
>  		goto iput_mftbmp_err_out;
> @@ -1802,6 +1814,11 @@ static BOOL load_system_files(ntfs_volum
>  			iput(vol->lcnbmp_ino);
>  		goto bitmap_failed;
>  	}
> +	lockdep_reinit_key(&NTFS_I(vol->lcnbmp_ino)->runlist.lock,
> +			   &lcnbmp_runlist_lock_key);
> +	lockdep_reinit_key(&NTFS_I(vol->lcnbmp_ino)->mrec_lock,
> +			   &lcnbmp_mrec_lock_key);
> +
>  	NInoSetSparseDisabled(NTFS_I(vol->lcnbmp_ino));
>  	if ((vol->nr_clusters + 7) >> 3 > i_size_read(vol->lcnbmp_ino)) {
>  		iput(vol->lcnbmp_ino);
> @@ -2742,6 +2759,17 @@ static int ntfs_fill_super(struct super_
>  	struct inode *tmp_ino;
>  	int blocksize, result;
>  
> +	/*
> +	 * We do a pretty difficult piece of bootstrap by reading the
> +	 * MFT (and other metadata) from disk into memory. We'll only
> +	 * release this metadata during umount, so the locking patterns
> +	 * observed during bootstrap do not count. So turn off the
> +	 * observation of locking patterns (strictly for this context
> +	 * only) while mounting NTFS. [The validator is still active
> +	 * otherwise, even for this context: it will for example record
> +	 * lock type registrations.]
> +	 */
> +	lockdep_off();
>  	ntfs_debug("Entering.");
>  #ifndef NTFS_RW
>  	sb->s_flags |= MS_RDONLY;
> @@ -2753,6 +2781,7 @@ static int ntfs_fill_super(struct super_
>  		if (!silent)
>  			ntfs_error(sb, "Allocation of NTFS volume structure "
>  					"failed. Aborting mount...");
> +		lockdep_on();
>  		return -ENOMEM;
>  	}
>  	/* Initialize ntfs_volume structure. */
> @@ -2939,6 +2968,7 @@ static int ntfs_fill_super(struct super_
>  		mutex_unlock(&ntfs_lock);
>  		sb->s_export_op = &ntfs_export_ops;
>  		lock_kernel();
> +		lockdep_on();
>  		return 0;
>  	}
>  	ntfs_error(sb, "Failed to allocate root directory.");
> @@ -3058,6 +3088,7 @@ err_out_now:
>  	sb->s_fs_info = NULL;
>  	kfree(vol);
>  	ntfs_debug("Failed, returning -EINVAL.");
> +	lockdep_on();
>  	return -EINVAL;
>  }
>  

-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-12 10:24                         ` Anton Altaparmakov
@ 2006-06-12 10:56                           ` Ingo Molnar
  2006-06-13  5:41                             ` Miles Lane
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2006-06-12 10:56 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Miles Lane, LKML, Andrew Morton, Arjan van de Ven


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> > below i've attached a cleaned up version of my NTFS annotations so far. 
> > I think they are pretty straightforward and nonintrusive - for 
> > !CONFIG_LOCKDEP they add zero changes to ntfs.ko. Most of the linecount 
> > of the patch comes from comments - so i think we might even consider 
> > this a "document locking rules" patch :-)
> 
> Thanks, looks great!  Can I leave it to you to keep it as part of the 
> lock validator patches so it gets into -mm with them and later makes 
> it into the stock kernel with them?

sure!

> If you want, feel free to add:
> 
> Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
> 
> To the patch...

thanks - added.

If anyone wants to try the NTFS related lock-validator fixes, they are 
included in the combo patch:

  http://redhat.com/~mingo/lockdep-patches/lockdep-combo-2.6.17-rc6-mm2.patch

which goes ontop of 2.6.17-rc6-mm2.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-12 10:56                           ` Ingo Molnar
@ 2006-06-13  5:41                             ` Miles Lane
  2006-06-13 21:18                               ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Miles Lane @ 2006-06-13  5:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Anton Altaparmakov, LKML, Andrew Morton, Arjan van de Ven

if [ -r System.map -a -x /sbin/depmod ]; then /sbin/depmod -ae -F
System.map  2.6.17-rc6-mm2-lockdep; fi
WARNING: /lib/modules/2.6.17-rc6-mm2-lockdep/kernel/fs/ntfs/ntfs.ko
needs unknown symbol lockdep_on
WARNING: /lib/modules/2.6.17-rc6-mm2-lockdep/kernel/fs/ntfs/ntfs.ko
needs unknown symbol lockdep_off

Am I doing something wrong?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!
  2006-06-13  5:41                             ` Miles Lane
@ 2006-06-13 21:18                               ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2006-06-13 21:18 UTC (permalink / raw)
  To: Miles Lane; +Cc: Anton Altaparmakov, LKML, Andrew Morton, Arjan van de Ven

* Miles Lane <miles.lane@gmail.com> wrote:

> if [ -r System.map -a -x /sbin/depmod ]; then /sbin/depmod -ae -F
> System.map  2.6.17-rc6-mm2-lockdep; fi
> WARNING: /lib/modules/2.6.17-rc6-mm2-lockdep/kernel/fs/ntfs/ntfs.ko
> needs unknown symbol lockdep_on
> WARNING: /lib/modules/2.6.17-rc6-mm2-lockdep/kernel/fs/ntfs/ntfs.ko
> needs unknown symbol lockdep_off

oops - please re-download the combo patch, i fixed this one.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-06-13 21:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-08  4:27 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected! Miles Lane
2006-06-08  7:32 ` Anton Altaparmakov
2006-06-08  9:55   ` Ingo Molnar
2006-06-08 10:53     ` Anton Altaparmakov
2006-06-08 11:23       ` Ingo Molnar
2006-06-09  8:09         ` Anton Altaparmakov
2006-06-10  7:59           ` Ingo Molnar
2006-06-10  8:48             ` Anton Altaparmakov
2006-06-11  5:31               ` Ingo Molnar
2006-06-11  7:10                 ` Anton Altaparmakov
2006-06-12  8:31                   ` Ingo Molnar
2006-06-12  8:47                     ` Anton Altaparmakov
2006-06-12  9:40                       ` Ingo Molnar
2006-06-12 10:24                         ` Anton Altaparmakov
2006-06-12 10:56                           ` Ingo Molnar
2006-06-13  5:41                             ` Miles Lane
2006-06-13 21:18                               ` Ingo Molnar
2006-06-08 16:11       ` Ingo Molnar
2006-06-08 21:17         ` Anton Altaparmakov
2006-06-09  7:19       ` Ingo Molnar
2006-06-09  8:31         ` Anton Altaparmakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox