2.6.2-rc2 nfsd+xfs spins in i_size

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.2-rc2 nfsd+xfs spins in i_size_read()
@ 2004-01-28 17:17 Miquel van Smoorenburg
  2004-01-29  6:25 ` Andrew Morton
  2004-01-29  6:30 ` Nathan Scott
  0 siblings, 2 replies; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-28 17:17 UTC (permalink / raw)
  To: linux-kernel

I have a Linux 2.6.2-rc2 NFS file server and another similar
box as client. Kernel is compiled for SMP (hyperthreading).

On the NFS server I'm exporting an XFS filesystem to the client
over Gigabit ethernet. The client mounts using
mount -o nfsvers=3,intr,rsize=32768,wsize=32768 server:/export/hwr /mnt

On the client I then run a large dd to a file on the server,
like dd if=/dev/zero of=/mnt/file bs=4096 count=800000

In a few seconds, the server locks up. It spins in
generic_fillattr(), apparently in the i_size_read() inline function.
Server responds to pings and sysrq, but nothing else.

2.6.1 doesn't show this behaviour. I tested several kernels:

2.6.1           	OK!
2.6.1-bk4		OK!
2.6.1-bk5		doesn't boot
2.6.1-bk6		hangs
2.6.2-rc1		hangs
2.6.2-rc2		hangs
2.6.2-rc2-gcc-2.95	hangs
2.6.2-rc1-mm3		OK!
2.6.2-rc2-mm1		OK!

I can't reproduce it on the local filesystem on the NFS server
directly, and I can't reproduce it on other filesystems than XFS.
But NFSD+XFS locks up every time.

Unfortunately the diff between 2.6.1-bk4 and bk6 is too large
for me to see what the difference is, likewise 2.6.2-rc2-mm1,
but perhaps someone has an idea what could be causing this ?

Here's the sysrq output:

Pid: 531, comm:                 nfsd
EIP: 0060:[<c01577c2>] CPU: 0
EIP is at generic_fillattr+0x82/0xa0
 EFLAGS: 00000246    Not tainted
EAX: 00000000 EBX: 07ae7200 ECX: f650a0a0 EDX: 000113eb
ESI: 00000000 EDI: f66cfed4 EBP: f66e5804 DS: 007b ES: 007b
CR0: 8005003b CR2: 40019000 CR3: 37245000 CR4: 000006d0
Call Trace:
 [<f8ab1f6b>] linvfs_getattr+0x2b/0x34 [xfs]
 [<c0157805>] vfs_getattr+0x25/0x84
 [<c01c19a3>] encode_post_op_attr+0x53/0x238
 [<c01c1e27>] encode_wcc_data+0x29f/0x2a8
 [<c01c4521>] nfs3svc_encode_commitres+0x19/0x5c
 [<c01b709d>] nfsd_dispatch+0x14d/0x1a3
 [<c02fb79b>] svc_process+0x3b3/0x640
 [<c01b6ddc>] nfsd+0x1e4/0x358
 [<c01b6bf8>] nfsd+0x0/0x358
 [<c0107251>] kernel_thread_helper+0x5/0xc

(By the way, on 2.6.2-rc1-mm3 and 2.6.2-rc2-mm1 mounting an XFS
 filesystem results in lots of:

 Badness in interruptible_sleep_on at kernel/sched.c:2239
 Call Trace:
  [<c011f5a3>] interruptible_sleep_on+0xf6/0xfb
  [<c011f209>] default_wake_function+0x0/0x12
  [<f8ac0fa2>] pagebuf_daemon+0x231/0x24c [xfs]
  [<c0339ed6>] ret_from_fork+0x6/0x14
  [<f8ac0d47>] pagebuf_daemon_wakeup+0x0/0x2a [xfs]
  [<f8ac0d71>] pagebuf_daemon+0x0/0x24c [xfs]
  [<c0109269>] kernel_thread_helper+0x5/0xb       )

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-28 17:17 2.6.2-rc2 nfsd+xfs spins in i_size_read() Miquel van Smoorenburg
@ 2004-01-29  6:25 ` Andrew Morton
  2004-01-29 23:20   ` Miquel van Smoorenburg
                     ` (2 more replies)
  2004-01-29  6:30 ` Nathan Scott
  1 sibling, 3 replies; 25+ messages in thread
From: Andrew Morton @ 2004-01-29  6:25 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel, linux-xfs

"Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
>
> I have a Linux 2.6.2-rc2 NFS file server and another similar
> box as client. Kernel is compiled for SMP (hyperthreading).
> 
> On the NFS server I'm exporting an XFS filesystem to the client
> over Gigabit ethernet. The client mounts using
> mount -o nfsvers=3,intr,rsize=32768,wsize=32768 server:/export/hwr /mnt
> 
> On the client I then run a large dd to a file on the server,
> like dd if=/dev/zero of=/mnt/file bs=4096 count=800000
> 
> In a few seconds, the server locks up. It spins in
> generic_fillattr(), apparently in the i_size_read() inline function.
> Server responds to pings and sysrq, but nothing else.
> 
> 2.6.1 doesn't show this behaviour. I tested several kernels:
> 
> 2.6.1           	OK!
> 2.6.1-bk4		OK!
> 2.6.1-bk5		doesn't boot
> 2.6.1-bk6		hangs
> 2.6.2-rc1		hangs
> 2.6.2-rc2		hangs
> 2.6.2-rc2-gcc-2.95	hangs
> 2.6.2-rc1-mm3		OK!
> 2.6.2-rc2-mm1		OK!
> 
> I can't reproduce it on the local filesystem on the NFS server
> directly, and I can't reproduce it on other filesystems than XFS.
> But NFSD+XFS locks up every time.
> 
> Unfortunately the diff between 2.6.1-bk4 and bk6 is too large
> for me to see what the difference is, likewise 2.6.2-rc2-mm1,
> but perhaps someone has an idea what could be causing this ?
> 
> Here's the sysrq output:
> 
> Pid: 531, comm:                 nfsd
> EIP: 0060:[<c01577c2>] CPU: 0
> EIP is at generic_fillattr+0x82/0xa0
>  EFLAGS: 00000246    Not tainted
> EAX: 00000000 EBX: 07ae7200 ECX: f650a0a0 EDX: 000113eb
> ESI: 00000000 EDI: f66cfed4 EBP: f66e5804 DS: 007b ES: 007b
> CR0: 8005003b CR2: 40019000 CR3: 37245000 CR4: 000006d0
> Call Trace:
>  [<f8ab1f6b>] linvfs_getattr+0x2b/0x34 [xfs]
>  [<c0157805>] vfs_getattr+0x25/0x84
>  [<c01c19a3>] encode_post_op_attr+0x53/0x238
>  [<c01c1e27>] encode_wcc_data+0x29f/0x2a8
>  [<c01c4521>] nfs3svc_encode_commitres+0x19/0x5c
>  [<c01b709d>] nfsd_dispatch+0x14d/0x1a3
>  [<c02fb79b>] svc_process+0x3b3/0x640
>  [<c01b6ddc>] nfsd+0x1e4/0x358
>  [<c01b6bf8>] nfsd+0x0/0x358
>  [<c0107251>] kernel_thread_helper+0x5/0xc
> 

Is the EIP _always_ inside generic_fillattr()?

If so then yes, your analysis look right.  I'd say that the inode has been
corrupted and the seqcount counter has assumed an non-even value.  That
will cause i_size_read() to lock up.

Are you using slab debugging?  Is so, does the lockup go away if you change
mm/slab.c:POISON_FREE to an even value, say 0x6a?  That would tend to
confirm a use-after-free problem.  Or a totally wild pointer.





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-28 17:17 2.6.2-rc2 nfsd+xfs spins in i_size_read() Miquel van Smoorenburg
  2004-01-29  6:25 ` Andrew Morton
@ 2004-01-29  6:30 ` Nathan Scott
  1 sibling, 0 replies; 25+ messages in thread
From: Nathan Scott @ 2004-01-29  6:30 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

On Wed, Jan 28, 2004 at 05:17:27PM +0000, Miquel van Smoorenburg wrote:
> I have a Linux 2.6.2-rc2 NFS file server and another similar
> box as client. Kernel is compiled for SMP (hyperthreading).
> 
> On the NFS server I'm exporting an XFS filesystem to the client
> over Gigabit ethernet. The client mounts using
> mount -o nfsvers=3,intr,rsize=32768,wsize=32768 server:/export/hwr /mnt
> 
> On the client I then run a large dd to a file on the server,
> like dd if=/dev/zero of=/mnt/file bs=4096 count=800000
> 
> In a few seconds, the server locks up. It spins in
> generic_fillattr(), apparently in the i_size_read() inline function.
> Server responds to pings and sysrq, but nothing else.
> 
> 2.6.1 doesn't show this behaviour. I tested several kernels:
> 
> 2.6.1           	OK!
> 2.6.1-bk4		OK!
> 2.6.1-bk5		doesn't boot
> 2.6.1-bk6		hangs
> 2.6.2-rc1		hangs
> 2.6.2-rc2		hangs
> 2.6.2-rc2-gcc-2.95	hangs
> 2.6.2-rc1-mm3		OK!
> 2.6.2-rc2-mm1		OK!

> I can't reproduce it on the local filesystem on the NFS server
> directly, and I can't reproduce it on other filesystems than XFS.
> But NFSD+XFS locks up every time.

Hmmmm... I don't think Andrew has any XFS fixes in his tree that
might help there; and I can't think of any XFS change in -rc1 that
might have caused this (does the fs/xfs subdir from 2.6.1 plonked
down in place of the 2.6.2-rc1/2 version still have the problem?)

i_size_read seems to have a fair bit of config dependency - are you
CONFIG_SMP / CONFIG_PREEMPT / neither?  and is your BITS_PER_LONG
32 or 64?  thanks.

> (By the way, on 2.6.2-rc1-mm3 and 2.6.2-rc2-mm1 mounting an XFS
>  filesystem results in lots of:
> 
>  Badness in interruptible_sleep_on at kernel/sched.c:2239
>  Call Trace:
>   [<c011f5a3>] interruptible_sleep_on+0xf6/0xfb
>   [<c011f209>] default_wake_function+0x0/0x12
>   [<f8ac0fa2>] pagebuf_daemon+0x231/0x24c [xfs]
>   [<c0339ed6>] ret_from_fork+0x6/0x14
>   [<f8ac0d47>] pagebuf_daemon_wakeup+0x0/0x2a [xfs]
>   [<f8ac0d71>] pagebuf_daemon+0x0/0x24c [xfs]
>   [<c0109269>] kernel_thread_helper+0x5/0xb       )
> 

There's fixes floating around for this, I'll get it merged soon.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-29  6:25 ` Andrew Morton
@ 2004-01-29 23:20   ` Miquel van Smoorenburg
  2004-02-04  0:03     ` Christoph Hellwig
  2004-01-30 16:01   ` Miquel van Smoorenburg
  2004-01-30 20:21   ` Miquel van Smoorenburg
  2 siblings, 1 reply; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-29 23:20 UTC (permalink / raw)
  To: Andrew Morton, Nathan Scott; +Cc: linux-kernel, linux-xfs

According to Andrew Morton:
> "Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
> >
> > On the NFS server I'm exporting an XFS filesystem to the client
> > over Gigabit ethernet. The client mounts using
> > mount -o nfsvers=3,intr,rsize=32768,wsize=32768 server:/export/hwr /mnt
> > 
> > On the client I then run a large dd to a file on the server,
> > like dd if=/dev/zero of=/mnt/file bs=4096 count=800000
> > 
> > In a few seconds, the server locks up. It spins in
> > generic_fillattr(), apparently in the i_size_read() inline function.
> > Server responds to pings and sysrq, but nothing else.
> > 
> > 2.6.1 doesn't show this behaviour. I tested several kernels:

Okay so I did some more tests today. As it turns out, all stock
2.6 kernels lock up - it's just that 2.6.2-rc* locks up faster
than earlier versions. I tested 2.6.1-rc1-mm3 and 2.6.2-rc2-mm1,
these do not lock up (test on -mm has been running for 8 hours
straight now, 2.6.* vanilla locks up in a few minutes).

> > Here's the sysrq output:
> > 
> > Pid: 531, comm:                 nfsd
> > EIP: 0060:[<c01577c2>] CPU: 0
> > EIP is at generic_fillattr+0x82/0xa0
> >  EFLAGS: 00000246    Not tainted
> > EAX: 00000000 EBX: 07ae7200 ECX: f650a0a0 EDX: 000113eb
> > ESI: 00000000 EDI: f66cfed4 EBP: f66e5804 DS: 007b ES: 007b
> > CR0: 8005003b CR2: 40019000 CR3: 37245000 CR4: 000006d0
> > Call Trace:
> >  [<f8ab1f6b>] linvfs_getattr+0x2b/0x34 [xfs]
> >  [<c0157805>] vfs_getattr+0x25/0x84
> >  [<c01c19a3>] encode_post_op_attr+0x53/0x238
> >  [<c01c1e27>] encode_wcc_data+0x29f/0x2a8
> >  [<c01c4521>] nfs3svc_encode_commitres+0x19/0x5c
> >  [<c01b709d>] nfsd_dispatch+0x14d/0x1a3
> >  [<c02fb79b>] svc_process+0x3b3/0x640
> >  [<c01b6ddc>] nfsd+0x1e4/0x358
> >  [<c01b6bf8>] nfsd+0x0/0x358
> >  [<c0107251>] kernel_thread_helper+0x5/0xc
> > 
> 
> Is the EIP _always_ inside generic_fillattr()?

Yes, if a keep on doing BREAK-p the output is alway the same.
EIP varies a few bytes, ofcourse, not much.

> If so then yes, your analysis look right.  I'd say that the inode has been
> corrupted and the seqcount counter has assumed an non-even value.  That
> will cause i_size_read() to lock up.
> 
> Are you using slab debugging?  Is so, does the lockup go away if you change
> mm/slab.c:POISON_FREE to an even value, say 0x6a?  That would tend to
> confirm a use-after-free problem.  Or a totally wild pointer.

I turned on slab debugging but it didn't help, and it didn't make
any difference. Well actually it made it take longer to show the
effect - at first I thought that the problem dissapeared when I turned
on slab debugging, but when I removed the "count=bignumber" from the
dd command, it locked up eventually (in a few minutes).

I also changed the debug constanst from odd to even and the
other way around, but that had no effect - no messages logged,
and the kernel still locked up.

Note that at this heavy load, there are up to 8 nfsd's running
on the server concurrently. There must be some lock or race ...

According to Nathan Scott:
> Hmmmm... I don't think Andrew has any XFS fixes in his tree that
> might help there; and I can't think of any XFS change in -rc1 that
> might have caused this (does the fs/xfs subdir from 2.6.1 plonked
> down in place of the 2.6.2-rc1/2 version still have the problem?)

Yes, sorry for the confusion - all 2.6 (non-mm) shows this behaviour.
I tried -mm1 XFS in 2.6.1-rc2, but that did't help, it still
locks up (the diff is just a one-liner, so that was to be expected).
So it doesn't look like an XFS problem, perse.

> i_size_read seems to have a fair bit of config dependency - are you
> CONFIG_SMP / CONFIG_PREEMPT / neither?  and is your BITS_PER_LONG
> 32 or 64?  thanks.

CONFIG_SMP (for P IV hyperthreading), 32 bits.

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-29  6:25 ` Andrew Morton
  2004-01-29 23:20   ` Miquel van Smoorenburg
@ 2004-01-30 16:01   ` Miquel van Smoorenburg
  2004-01-30 20:21   ` Miquel van Smoorenburg
  2 siblings, 0 replies; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-30 16:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel

On Thu, 29 Jan 2004 07:25:21, Andrew Morton wrote:
> "Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
> >
> > I have a Linux 2.6.2-rc2 NFS file server and another similar
> > box as client. Kernel is compiled for SMP (hyperthreading).
> > 
> > In a few seconds, the server locks up. It spins in
> > generic_fillattr(), apparently in the i_size_read() inline function.
> > Server responds to pings and sysrq, but nothing else.
> > Here's the sysrq output:
> > 
> > Pid: 531, comm:                 nfsd
> > EIP: 0060:[<c01577c2>] CPU: 0
> > EIP is at generic_fillattr+0x82/0xa0
> >  EFLAGS: 00000246    Not tainted
> > EAX: 00000000 EBX: 07ae7200 ECX: f650a0a0 EDX: 000113eb
> > ESI: 00000000 EDI: f66cfed4 EBP: f66e5804 DS: 007b ES: 007b
> > CR0: 8005003b CR2: 40019000 CR3: 37245000 CR4: 000006d0
> > Call Trace:
> >  [<f8ab1f6b>] linvfs_getattr+0x2b/0x34 [xfs]
> >  [<c0157805>] vfs_getattr+0x25/0x84
> >  [<c01c19a3>] encode_post_op_attr+0x53/0x238
> >  [<c01c1e27>] encode_wcc_data+0x29f/0x2a8
> >  [<c01c4521>] nfs3svc_encode_commitres+0x19/0x5c
> >  [<c01b709d>] nfsd_dispatch+0x14d/0x1a3
> >  [<c02fb79b>] svc_process+0x3b3/0x640
> >  [<c01b6ddc>] nfsd+0x1e4/0x358
> >  [<c01b6bf8>] nfsd+0x0/0x358
> >  [<c0107251>] kernel_thread_helper+0x5/0xc
> > 
> 
> Is the EIP _always_ inside generic_fillattr()?
> 
> If so then yes, your analysis look right.  I'd say that the inode has been
> corrupted and the seqcount counter has assumed an non-even value.  That
> will cause i_size_read() to lock up.

Yes, but why does it lock up the whole machine? I'd say just that one
process should go into 'R', but other things should get scheduled right?
Or is some very common lock being held? The machine responds to pings,
so interrupts are working.

In fact as this is a hyperthreaded machine, why don't I see anything
running on the "other virtual cpu" ? I only see nfsd spinning on CPU#0

Could it be a CPU problem, the other CPU-thread is in i_size_write() but
that CPU-thread is simply never run ? To test this, I rebooted the exact
same 2.6.2-rc2 kernel with "maxcpus=1" .. and it doesn't lock up ..

(Still weird that -mm doesn't lock up)

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-29  6:25 ` Andrew Morton
  2004-01-29 23:20   ` Miquel van Smoorenburg
  2004-01-30 16:01   ` Miquel van Smoorenburg
@ 2004-01-30 20:21   ` Miquel van Smoorenburg
  2004-01-30 22:13     ` Miquel van Smoorenburg
  2 siblings, 1 reply; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-30 20:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel

On Thu, 29 Jan 2004 07:25:21, Andrew Morton wrote:
> "Miquel van Smoorenburg" <miquels@cistron.nl> wrote:
> >
> > I have a Linux 2.6.2-rc2 NFS file server and another similar
> > box as client. Kernel is compiled for SMP (hyperthreading).
> > In a few seconds, the server locks up. It spins in
> > generic_fillattr(), apparently in the i_size_read() inline function.
> > Server responds to pings and sysrq, but nothing else.
> 
> Is the EIP _always_ inside generic_fillattr()?
> 
> If so then yes, your analysis look right.  I'd say that the inode has been
> corrupted and the seqcount counter has assumed an non-even value.  That
> will cause i_size_read() to lock up.

I added some extra code to i_size_read() and i_size_write(). First,
some debugging code:

--- fs.h.orig	2004-01-30 21:10:28.000000000 +0100
+++ fs.h.v1	2004-01-30 21:11:19.000000000 +0100
@@ -425,6 +425,7 @@
 	} u;
 #ifdef __NEED_I_SIZE_ORDERED
 	seqcount_t		i_size_seqcount;
+	pid_t			seq_pid; /* XXX */
 #endif
 };
 
@@ -450,6 +451,12 @@
 	do {
 		seq = read_seqcount_begin(&inode->i_size_seqcount);
 		i_size = inode->i_size;
+#if 1 /* XXX HACK */
+		if ((++count & 65535) == 0) {
+			printk("i_size_read() seems to be looping - pid %d\n", inode->seq_pid);
+			mdelay(100);
+		}
+#endif
 	} while (read_seqcount_retry(&inode->i_size_seqcount, seq));
 	return i_size;
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
@@ -467,9 +474,20 @@
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if 1 /* XXX */
+	inode->seq_pid = current->tgid;
+	write_seqcount_begin(&inode->i_size_seqcount);
+	inode->i_size = i_size;
+	write_seqcount_end(&inode->i_size_seqcount);
+	if (inode->i_size_seqcount.sequence & 1)
+		printk("i_size_write: pid %d: sequence is odd!\n",
+			current->tgid);
+	inode->seq_pid = 0;
+#else
 	write_seqcount_begin(&inode->i_size_seqcount);
 	inode->i_size = i_size;
 	write_seqcount_end(&inode->i_size_seqcount);
+#endif
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
 	preempt_disable();
 	inode->i_size = i_size;

I then started the test that locks up the kernel, and it printed this:

i_size_write: pid 542: sequence is odd!
i_size_write: pid 543: sequence is odd!
i_size_write: pid 542: sequence is odd!

i_size_read() seems to be looping - pid 0
i_size_read() seems to be looping - pid 0
[this keeps on being printed and the kernel is locked up]

It took some time for the i_size_write messages to show up, and they were
spaced 10-30 seconds apart, and during that time the server was still
up - right until the first i_size_read message.

Then I added this patch:

--- fs.h.v1	2004-01-30 21:11:19.000000000 +0100
+++ fs.h	2004-01-30 20:16:35.000000000 +0100
@@ -426,6 +426,7 @@
 #ifdef __NEED_I_SIZE_ORDERED
 	seqcount_t		i_size_seqcount;
 	pid_t			seq_pid; /* XXX */
+	spinlock_t		i_size_lock;
 #endif
 };
 
@@ -475,6 +476,7 @@
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 #if 1 /* XXX */
+	spin_lock(&inode->i_size_lock);
 	inode->seq_pid = current->tgid;
 	write_seqcount_begin(&inode->i_size_seqcount);
 	inode->i_size = i_size;
@@ -483,6 +485,7 @@
 		printk("i_size_write: pid %d: sequence is odd!\n",
 			current->tgid);
 	inode->seq_pid = 0;
+	spin_unlock(&inode->i_size_lock);
 #else
 	write_seqcount_begin(&inode->i_size_seqcount);
 	inode->i_size = i_size;

(and some code in fs/inode.c to initialize i_size_lock)

Guess what. No more debug output, no more lockups ... is there
anything else I can do to debug this ? Because I'm not really
sure what I'm doing, you see :)

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 20:21   ` Miquel van Smoorenburg
@ 2004-01-30 22:13     ` Miquel van Smoorenburg
  2004-01-30 22:34       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-30 22:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel, Nathan Scott

On Fri, 30 Jan 2004 21:21:55, Miquel van Smoorenburg wrote:
> I added some extra code to i_size_read() and i_size_write(). First,
> some debugging code:
> I then started the test that locks up the kernel, and it printed this:
> 
> i_size_write: pid 542: sequence is odd!
> i_size_write: pid 543: sequence is odd!
> i_size_write: pid 542: sequence is odd!
> 
> i_size_read() seems to be looping - pid 0
> i_size_read() seems to be looping - pid 0
> [this keeps on being printed and the kernel is locked up]
> 
> It took some time for the i_size_write messages to show up, and they were
> spaced 10-30 seconds apart, and during that time the server was still
> up - right until the first i_size_read message.
> 

Okay, I added a patch to make the sequence increments atomic. Now
i_size_write() still sometimes ends up with an odd sequence, but
i_size_read() doesn't lock up anymore.

What lock exactly is supposed to protect i_size_write, since it
appears that i_size_write is being called without proper locking ?
(Am I right?)


--- fs.h.v1	2004-01-30 21:11:19.000000000 +0100
+++ fs.h	2004-01-30 21:55:17.000000000 +0100
@@ -426,6 +426,7 @@
 #ifdef __NEED_I_SIZE_ORDERED
 	seqcount_t		i_size_seqcount;
 	pid_t			seq_pid; /* XXX */
+	spinlock_t		i_size_lock;
 #endif
 };
 
@@ -441,6 +442,7 @@
  */
 #include <linux/delay.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 static inline loff_t i_size_read(struct inode *inode)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
@@ -476,9 +478,11 @@
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 #if 1 /* XXX */
 	inode->seq_pid = current->tgid;
-	write_seqcount_begin(&inode->i_size_seqcount);
+	atomic_inc((atomic_t *)&inode->i_size_seqcount.sequence);
+	smp_wmb();
 	inode->i_size = i_size;
-	write_seqcount_end(&inode->i_size_seqcount);
+	smp_wmb();
+	atomic_inc((atomic_t *)&inode->i_size_seqcount.sequence);
 	if (inode->i_size_seqcount.sequence & 1)
 		printk("i_size_write: pid %d: sequence is odd!\n",
 			current->tgid);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 22:13     ` Miquel van Smoorenburg
@ 2004-01-30 22:34       ` Andrew Morton
  2004-01-30 22:53         ` Christoph Hellwig
  2004-01-30 23:07         ` Nathan Scott
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2004-01-30 22:34 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: miquels, linux-kernel, nathans

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> On Fri, 30 Jan 2004 21:21:55, Miquel van Smoorenburg wrote:
> > I added some extra code to i_size_read() and i_size_write(). First,
> > some debugging code:
> > I then started the test that locks up the kernel, and it printed this:
> > 
> > i_size_write: pid 542: sequence is odd!
> > i_size_write: pid 543: sequence is odd!
> > i_size_write: pid 542: sequence is odd!
> > 
> > i_size_read() seems to be looping - pid 0
> > i_size_read() seems to be looping - pid 0
> > [this keeps on being printed and the kernel is locked up]
> > 
> > It took some time for the i_size_write messages to show up, and they were
> > spaced 10-30 seconds apart, and during that time the server was still
> > up - right until the first i_size_read message.
> > 
> 
> Okay, I added a patch to make the sequence increments atomic. Now
> i_size_write() still sometimes ends up with an odd sequence, but
> i_size_read() doesn't lock up anymore.

Go, Miquel!

> What lock exactly is supposed to protect i_size_write, since it
> appears that i_size_write is being called without proper locking ?
> (Am I right?)

If two CPUs hit i_size_write() at the same time we have a bug.  That
function requires that the caller provide external serialisation, via i_sem.

Try adding this to the start of i_size_write():

	if (down_trylock(&inode->i_sem) == 0) {
		printk("I am buggy\n");
		dump_stack();
		up(&inode->i_sem);
	}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 22:34       ` Andrew Morton
@ 2004-01-30 22:53         ` Christoph Hellwig
  2004-01-30 23:13           ` Andrew Morton
  2004-01-30 23:07         ` Nathan Scott
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2004-01-30 22:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel, nathans

On Fri, Jan 30, 2004 at 02:34:59PM -0800, Andrew Morton wrote:
> If two CPUs hit i_size_write() at the same time we have a bug.  That
> function requires that the caller provide external serialisation, via i_sem.

O_APPEND|O_DIRECT writes could do that under XFS..


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 22:34       ` Andrew Morton
  2004-01-30 22:53         ` Christoph Hellwig
@ 2004-01-30 23:07         ` Nathan Scott
  1 sibling, 0 replies; 25+ messages in thread
From: Nathan Scott @ 2004-01-30 23:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, linux-kernel

On Fri, Jan 30, 2004 at 02:34:59PM -0800, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> > What lock exactly is supposed to protect i_size_write, since it
> > appears that i_size_write is being called without proper locking ?
> > (Am I right?)
> 
> If two CPUs hit i_size_write() at the same time we have a bug.  That
> function requires that the caller provide external serialisation, via i_sem.

Hmm... I suspect we may not always be providing that in XFS -
I'll go audit our calls.

> Try adding this to the start of i_size_write():
> 
> 	if (down_trylock(&inode->i_sem) == 0) {
> 		printk("I am buggy\n");
> 		dump_stack();
> 		up(&inode->i_sem);
> 	}

Let me know what you hit Miquel :) - I'll run the XFS tests
next week with this too and see what I can find.

thanks.

-- 
Nathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 22:53         ` Christoph Hellwig
@ 2004-01-30 23:13           ` Andrew Morton
  2004-01-31  1:25             ` Miquel van Smoorenburg
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2004-01-30 23:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: miquels, linux-kernel, nathans

Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Jan 30, 2004 at 02:34:59PM -0800, Andrew Morton wrote:
> > If two CPUs hit i_size_write() at the same time we have a bug.  That
> > function requires that the caller provide external serialisation, via i_sem.
> 
> O_APPEND|O_DIRECT writes could do that under XFS..

Sigh.


diff -puN include/linux/fs.h~i_size_write-check include/linux/fs.h
--- 25/include/linux/fs.h~i_size_write-check	Fri Jan 30 15:09:47 2004
+++ 25-akpm/include/linux/fs.h	Fri Jan 30 15:10:28 2004
@@ -464,9 +464,11 @@ static inline loff_t i_size_read(struct 
 #endif
 }
 
+void i_size_write_check(struct inode *inode);
 
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
+	i_size_write_check(inode);
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
 	write_seqcount_begin(&inode->i_size_seqcount);
 	inode->i_size = i_size;
diff -puN mm/filemap.c~i_size_write-check mm/filemap.c
--- 25/mm/filemap.c~i_size_write-check	Fri Jan 30 15:10:23 2004
+++ 25-akpm/mm/filemap.c	Fri Jan 30 15:11:41 2004
@@ -2010,3 +2010,18 @@ out:
 }
 
 EXPORT_SYMBOL_GPL(generic_file_direct_IO);
+
+void i_size_write_check(struct inode *inode)
+{
+	static int count = 0;
+
+	if (down_trylock(&inode->i_sem) == 0) {
+		if (count < 10) {
+			count++;
+			printk("i_size_write() called without i_sem\n");
+			dump_stack();
+		}
+		up(&inode->i_sem);
+	}
+}
+EXPORT_SYMBOL(i_size_write_check);

_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-30 23:13           ` Andrew Morton
@ 2004-01-31  1:25             ` Miquel van Smoorenburg
  2004-01-31  1:38               ` Andrew Morton
                                 ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-31  1:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Hellwig, miquels, linux-kernel, nathans

On Sat, 31 Jan 2004 00:13:16, Andrew Morton wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Jan 30, 2004 at 02:34:59PM -0800, Andrew Morton wrote:
> > > If two CPUs hit i_size_write() at the same time we have a bug.  That
> > > function requires that the caller provide external serialisation, via i_sem.
> > 
> > O_APPEND|O_DIRECT writes could do that under XFS..
> 
> Sigh.
>
> diff -puN mm/filemap.c~i_size_write-check mm/filemap.c
> --- 25/mm/filemap.c~i_size_write-check	Fri Jan 30 15:10:23 2004
> +++ 25-akpm/mm/filemap.c	Fri Jan 30 15:11:41 2004
> @@ -2010,3 +2010,18 @@ out:
>  }
>  
>  EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> +
> +void i_size_write_check(struct inode *inode)
> +{
> +	static int count = 0;
> +
> +	if (down_trylock(&inode->i_sem) == 0) {
> +		if (count < 10) {

You want to set this to 100 at least, since at boot time the message
happens _often_ even without XFS.

It's caused by sysfs:

Jan 31 00:48:33 meghan kernel: i_size_write() called without i_sem
Jan 31 00:48:33 meghan kernel: Call Trace:
Jan 31 00:48:33 meghan kernel:  [i_size_write_check+95/97] i_size_write_check+0x5f/0x61
Jan 31 00:48:33 meghan kernel:  [simple_commit_write+74/148] simple_commit_write+0x4a/0x94
Jan 31 00:48:33 meghan kernel:  [page_symlink+210/414] page_symlink+0xd2/0x19e
Jan 31 00:48:33 meghan kernel:  [sysfs_symlink+107/126] sysfs_symlink+0x6b/0x7e
Jan 31 00:48:33 meghan kernel:  [sysfs_create_link+313/317] sysfs_create_link+0x139/0x13d
Jan 31 00:48:33 meghan kernel:  [bus_add_device+143/161] bus_add_device+0x8f/0xa1
[etc etc]

Also, bd_set_size runs unlocked:

Jan 31 02:02:39 meghan kernel: I am buggy
Jan 31 02:02:39 meghan kernel: Call Trace:
Jan 31 02:02:39 meghan kernel:  [bd_set_size+201/218] bd_set_size+0xc9/0xda
Jan 31 02:02:39 meghan kernel:  [do_open+282/1014] do_open+0x11a/0x3f6
Jan 31 02:02:39 meghan kernel:  [blkdev_get+114/128] blkdev_get+0x72/0x80
Jan 31 02:02:39 meghan kernel:  [do_open+603/1014] do_open+0x25b/0x3f6
Jan 31 02:02:39 meghan kernel:  [blkdev_get+114/128] blkdev_get+0x72/0x80
Jan 31 02:02:39 meghan kernel:  [open_bdev_excl+87/172] open_bdev_excl+0x57/0xacJan 31 02:02:39 meghan kernel:  [get_sb_bdev+55/344] get_sb_bdev+0x37/0x158
Jan 31 02:02:39 meghan kernel:  [ext3_get_sb+47/51] ext3_get_sb+0x2f/0x33
Jan 31 02:02:39 meghan kernel:  [ext3_fill_super+0/2949] ext3_fill_super+0x0/0xb85
Jan 31 02:02:39 meghan kernel:  [do_kern_mount+95/213] do_kern_mount+0x5f/0xd5
Jan 31 02:02:39 meghan kernel:  [do_add_mount+120/334] do_add_mount+0x78/0x14e
Jan 31 02:02:39 meghan kernel:  [do_mount+289/359] do_mount+0x121/0x167
Jan 31 02:02:39 meghan kernel:  [__copy_from_user_ll+104/108] __copy_from_user_ll+0x68/0x6c
Jan 31 02:02:39 meghan kernel:  [copy_mount_options+128/234] copy_mount_options+0x80/0xea
Jan 31 02:02:39 meghan kernel:  [sys_mount+203/309] sys_mount+0xcb/0x135
Jan 31 02:02:39 meghan kernel:  [do_mount_root+47/156] do_mount_root+0x2f/0x9c
Jan 31 02:02:39 meghan kernel:  [mount_block_root+84/285] mount_block_root+0x54/0x11d
Jan 31 02:02:39 meghan kernel:  [mount_root+94/102] mount_root+0x5e/0x66
Jan 31 02:02:39 meghan kernel:  [prepare_namespace+38/195] prepare_namespace+0x26/0xc3
Jan 31 02:02:39 meghan kernel:  [init+80/334] init+0x50/0x14e
Jan 31 02:02:39 meghan kernel:  [init+0/334] init+0x0/0x14e
Jan 31 02:02:39 meghan kernel:  [kernel_thread_helper+5/11] kernel_thread_helper+0x5/0xb

But the XFS problem appears to be vn_revalidate which calls i_size_write()
without holding i_sem:

Jan 31 02:04:00 meghan kernel: I am buggy
Jan 31 02:04:00 meghan kernel: Call Trace:
Jan 31 02:04:00 meghan kernel:  [__crc_pm_idle+1813107/4617778] vn_revalidate+0x18e/0x1a9 [xfs]
Jan 31 02:04:00 meghan kernel:  [nfsd_commit+205/219] nfsd_commit+0xcd/0xdb
Jan 31 02:04:00 meghan kernel:  [__crc_pm_idle+1798087/4617778] linvfs_getattr+0x39/0x3f [xfs]
Jan 31 02:04:00 meghan kernel:  [vfs_getattr+57/154] vfs_getattr+0x39/0x9a
Jan 31 02:04:00 meghan kernel:  [encode_post_op_attr+104/577] encode_post_op_attr+0x68/0x241
Jan 31 02:04:00 meghan kernel:  [nfs3svc_encode_commitres+42/115] nfs3svc_encode_commitres+0x2a/0x73
Jan 31 02:04:00 meghan kernel:  [nfsd_dispatch+297/485] nfsd_dispatch+0x129/0x1e5
Jan 31 02:04:00 meghan kernel:  [svc_process+1170/1621] svc_process+0x492/0x655
Jan 31 02:04:00 meghan kernel:  [apic_timer_interrupt+26/32] apic_timer_interrupt+0x1a/0x20
Jan 31 02:04:00 meghan kernel:  [nfsd+490/915] nfsd+0x1ea/0x393
Jan 31 02:04:00 meghan kernel:  [nfsd+0/915] nfsd+0x0/0x393
Jan 31 02:04:00 meghan kernel:  [kernel_thread_helper+5/11] kernel_thread_helper+0x5/0xb

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31  1:25             ` Miquel van Smoorenburg
@ 2004-01-31  1:38               ` Andrew Morton
  2004-01-31 11:46                 ` Miquel van Smoorenburg
  2004-01-31 17:07               ` Christoph Hellwig
  2004-02-01  1:46               ` Anton Blanchard
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2004-01-31  1:38 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: hch, miquels, linux-kernel, nathans

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> On Sat, 31 Jan 2004 00:13:16, Andrew Morton wrote:
> > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Fri, Jan 30, 2004 at 02:34:59PM -0800, Andrew Morton wrote:
> > > > If two CPUs hit i_size_write() at the same time we have a bug.  That
> > > > function requires that the caller provide external serialisation, via i_sem.
> > > 
> > > O_APPEND|O_DIRECT writes could do that under XFS..
> > 
> > Sigh.
> >
> > diff -puN mm/filemap.c~i_size_write-check mm/filemap.c
> > --- 25/mm/filemap.c~i_size_write-check	Fri Jan 30 15:10:23 2004
> > +++ 25-akpm/mm/filemap.c	Fri Jan 30 15:11:41 2004
> > @@ -2010,3 +2010,18 @@ out:
> >  }
> >  
> >  EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> > +
> > +void i_size_write_check(struct inode *inode)
> > +{
> > +	static int count = 0;
> > +
> > +	if (down_trylock(&inode->i_sem) == 0) {
> > +		if (count < 10) {
> 
> You want to set this to 100 at least, since at boot time the message
> happens _often_ even without XFS.

fun.

> It's caused by sysfs:

OK, sysfs_symlink() needs i_sem.

> Also, bd_set_size runs unlocked:

I don't expect we'll ever see a race over the blockdev's i_size in there. 
It might be simplest to just do a straight assignment.  I'll work that with
viro.

> But the XFS problem appears to be vn_revalidate which calls i_size_write()
> without holding i_sem:

There's your bug.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31  1:38               ` Andrew Morton
@ 2004-01-31 11:46                 ` Miquel van Smoorenburg
  2004-01-31 15:59                   ` Steve Lord
  0 siblings, 1 reply; 25+ messages in thread
From: Miquel van Smoorenburg @ 2004-01-31 11:46 UTC (permalink / raw)
  To: nathans; +Cc: hch, miquels, linux-kernel

On Sat, 31 Jan 2004 02:38:51, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> > But the XFS problem appears to be vn_revalidate which calls i_size_write()
> > without holding i_sem:
> 
> There's your bug.

Okay. It seems that XFS uses its own locking with xfs_ilock() etc,
so I am not sure if this should be fixed by using down(&inode->i_sem)
or by using xfs_ilock().

Perhaps xfs_ilock() should also get the inode->i_sem semaphore
in the XFS_ILOCK_EXCL case ?

Also, there are one or two more places that call i_size_write()
that should be looked at I guess.

Ofcourse I'll test any patches you send me.

Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31 11:46                 ` Miquel van Smoorenburg
@ 2004-01-31 15:59                   ` Steve Lord
  2004-02-01 16:15                     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Lord @ 2004-01-31 15:59 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: nathans, hch, linux-kernel

Miquel van Smoorenburg wrote:
> On Sat, 31 Jan 2004 02:38:51, Andrew Morton wrote:
> 
>>Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>>
>>>But the XFS problem appears to be vn_revalidate which calls i_size_write()
>>>without holding i_sem:
>>
>>There's your bug.
> 
> 
> Okay. It seems that XFS uses its own locking with xfs_ilock() etc,
> so I am not sure if this should be fixed by using down(&inode->i_sem)
> or by using xfs_ilock().
> 
> Perhaps xfs_ilock() should also get the inode->i_sem semaphore
> in the XFS_ILOCK_EXCL case ?
> 
> Also, there are one or two more places that call i_size_write()
> that should be looked at I guess.
> 

Actually it gets more messy than this. i_size_write is called by
different spots by xfs without the i_sem held. The call in
generic_file_aio_write_nolock() is made without the lock for O_DIRECT 
writes, XFS allows multiple parallel writes for O_DIRECT
writes.

The vn_revalidate call is called out of linvfs_setattr,
which is called with the i_sem held, it is also potentially called out
of linvfs_getattr, although since the i_size is always maintained
as it is changed, this call should not actually be updating the size.
Possibly changing the code in vn_revalidate to do this:

	if (i_size_read(inode) < va.va_size))
		i_size_write(inode, va.va_size);

Would be a good starting point, I suspect those calls from the nfs
revalidate call are not really going to change the inode size. My
guess is this will make your problem go away.

Probably some larger code restructure is needed so that revalidate
knows if the i_sem is held or not at this point.

The O_DIRECT write case is the hard one. In XFS's internal view of
the world, the inode size is maintained via the XFS_ILOCK, but we
only hold that across metadata manipulation within the fs code,
not across I/O such as a call to generic_file_aio_write_nolock.
Right now the only way I see of dealing with that is to make
writes which we know will extent the file hold the i_sem for
the duration in the O_DIRECT case.

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-01 16:15                     ` Christoph Hellwig
@ 2004-01-31 16:41                       ` Steve Lord
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Lord @ 2004-01-31 16:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Miquel van Smoorenburg, nathans, linux-kernel

Christoph Hellwig wrote:
> On Sat, Jan 31, 2004 at 09:59:06AM -0600, Steve Lord wrote:
> 
>>The vn_revalidate call is called out of linvfs_setattr,
>>which is called with the i_sem held, it is also potentially called out
>>of linvfs_getattr, although since the i_size is always maintained
>>as it is changed, this call should not actually be updating the size.
>>Possibly changing the code in vn_revalidate to do this:
>>
>>	if (i_size_read(inode) < va.va_size))
>>		i_size_write(inode, va.va_size);
>>
>>Would be a good starting point, I suspect those calls from the nfs
>>revalidate call are not really going to change the inode size. My
>>guess is this will make your problem go away.
>>
>>Probably some larger code restructure is needed so that revalidate
>>knows if the i_sem is held or not at this point.
> 
> 
> I think the right fix is to update the Linux i_size always shortly after
> di_size is updated.  There's a lot of updates in the directory code while
> should be handled by an i_size_write in the matching linvfs routines, and
> there's a few more but we should be able to handle those without
> vn_revalidate aswell.
> 

Changing the validate_fields call to use

	if (i_size_read(inode) != va.va_size)
		i_size_write(inode, va.va_size);

should take care of directories, certainly better than the
direct assignment to i_size which is in there now....
This is called from the directory ops which modify the
contents of a directory and should already be under the
i_sem. The vn_revalidate code should use a != test
as well of course...


> 
>>The O_DIRECT write case is the hard one. In XFS's internal view of
>>the world, the inode size is maintained via the XFS_ILOCK, but we
>>only hold that across metadata manipulation within the fs code,
>>not across I/O such as a call to generic_file_aio_write_nolock.
>>Right now the only way I see of dealing with that is to make
>>writes which we know will extent the file hold the i_sem for
>>the duration in the O_DIRECT case.
> 
> 
> That's the more difficult case.  Any reson why you'd hold i_sem
> for the whole O_DIRECT I/O instead of just for updating i_sem?
> 

We worked hard not to hold it, but that i_size_write in
the middle of the async_io code is tough to work around.

> Note that the EOF zeroing code also calls i_size_write.
> 

But that is called out of an extending setattr or a buffered write which
hold the semaphore. Hmm, actually O_DIRECT write can be in there as
well, but that is the same problem as the generic I/O path call.

Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31  1:25             ` Miquel van Smoorenburg
  2004-01-31  1:38               ` Andrew Morton
@ 2004-01-31 17:07               ` Christoph Hellwig
  2004-02-01  1:46               ` Anton Blanchard
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2004-01-31 17:07 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-kernel, nathans

Here's an (untested) quickhack to take i_sem in linvfs_getattr.
I'll try to come up with a real fix next week.

--- 1.39/fs/xfs/linux/xfs_iops.c	Fri Jan  9 01:07:07 2004
+++ edited/fs/xfs/linux/xfs_iops.c	Sat Jan 31 18:41:44 2004
@@ -478,8 +478,11 @@
 	vnode_t		*vp = LINVFS_GET_VP(inode);
 	int		error = 0;
 
-	if (unlikely(vp->v_flag & VMODIFIED))
+	if (unlikely(vp->v_flag & VMODIFIED)) {
+		down(&inode->i_sem);
 		error = vn_revalidate(vp);
+		up(&inode->i_sem);
+	}
 	if (!error)
 		generic_fillattr(inode, stat);
 	return 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31  1:25             ` Miquel van Smoorenburg
  2004-01-31  1:38               ` Andrew Morton
  2004-01-31 17:07               ` Christoph Hellwig
@ 2004-02-01  1:46               ` Anton Blanchard
  2 siblings, 0 replies; 25+ messages in thread
From: Anton Blanchard @ 2004-02-01  1:46 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Christoph Hellwig, linux-kernel, nathans


> > +	if (down_trylock(&inode->i_sem) == 0) {
> > +		if (count < 10) {
> 
> You want to set this to 100 at least, since at boot time the message
> happens _often_ even without XFS.

You could also just printk_ratelimit it :)

Anton

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-31 15:59                   ` Steve Lord
@ 2004-02-01 16:15                     ` Christoph Hellwig
  2004-01-31 16:41                       ` Steve Lord
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2004-02-01 16:15 UTC (permalink / raw)
  To: Steve Lord; +Cc: Miquel van Smoorenburg, nathans, linux-kernel

On Sat, Jan 31, 2004 at 09:59:06AM -0600, Steve Lord wrote:
> The vn_revalidate call is called out of linvfs_setattr,
> which is called with the i_sem held, it is also potentially called out
> of linvfs_getattr, although since the i_size is always maintained
> as it is changed, this call should not actually be updating the size.
> Possibly changing the code in vn_revalidate to do this:
> 
> 	if (i_size_read(inode) < va.va_size))
> 		i_size_write(inode, va.va_size);
> 
> Would be a good starting point, I suspect those calls from the nfs
> revalidate call are not really going to change the inode size. My
> guess is this will make your problem go away.
> 
> Probably some larger code restructure is needed so that revalidate
> knows if the i_sem is held or not at this point.

I think the right fix is to update the Linux i_size always shortly after
di_size is updated.  There's a lot of updates in the directory code while
should be handled by an i_size_write in the matching linvfs routines, and
there's a few more but we should be able to handle those without
vn_revalidate aswell.

> The O_DIRECT write case is the hard one. In XFS's internal view of
> the world, the inode size is maintained via the XFS_ILOCK, but we
> only hold that across metadata manipulation within the fs code,
> not across I/O such as a call to generic_file_aio_write_nolock.
> Right now the only way I see of dealing with that is to make
> writes which we know will extent the file hold the i_sem for
> the duration in the O_DIRECT case.

That's the more difficult case.  Any reson why you'd hold i_sem
for the whole O_DIRECT I/O instead of just for updating i_sem?

Note that the EOF zeroing code also calls i_size_write.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-04  0:03     ` Christoph Hellwig
@ 2004-02-03 14:13       ` Steve Lord
  2004-02-04 15:16         ` Christoph Hellwig
  2004-02-04 18:17         ` Christoph Hellwig
  2004-02-04  0:06       ` David Weinehall
  1 sibling, 2 replies; 25+ messages in thread
From: Steve Lord @ 2004-02-03 14:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Miquel van Smoorenburg, Andrew Morton, Nathan Scott, linux-kernel,
	linux-xfs

Christoph Hellwig wrote:
> Okay, what about this little patch?:
> 
> 
> Index: fs/xfs/linux-2.6/xfs_iops.c
> ===================================================================
> RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_iops.c,v
> retrieving revision 1.212
> diff -u -p -r1.212 xfs_iops.c
> --- fs/xfs/linux-2.6/xfs_iops.c	12 Dec 2003 04:17:52 -0000	1.212
> +++ fs/xfs/linux-2.6/xfs_iops.c	3 Feb 2004 23:56:17 -0000
> @@ -80,11 +80,15 @@ validate_fields(
>  	vattr_t		va;
>  	int		error;
>  
> -	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS;
> +	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS|XFS_AT_SIZE;
>  	VOP_GETATTR(vp, &va, ATTR_LAZY, NULL, error);
>  	ip->i_nlink = va.va_nlink;
>  	ip->i_size = va.va_size;
>  	ip->i_blocks = va.va_nblocks;
> +
> +	/* we're under i_sem so i_size can't change under us */
> +	if (i_size_read(ip) != va.va_size)
> +		i_size_write(ip, va.va_size);
>  }
>  
>  /*
> @@ -186,8 +190,7 @@ linvfs_mknod(
>  
>  		if (S_ISCHR(mode) || S_ISBLK(mode))
>  			ip->i_rdev = rdev;
> -		else if (S_ISDIR(mode))
> -			validate_fields(ip);
> +		validate_fields(ip);

There was some reason this was only necessary on directories, but I
cannot remember why just now.

>  		d_instantiate(dentry, ip);
>  		validate_fields(dir);
>  	}
> @@ -536,6 +539,7 @@ linvfs_setattr(
>  	if (error)
>  		return(-error);	/* Positive error up from XFS */
>  	if (ia_valid & ATTR_SIZE) {
> +		i_size_write(inode, vattr.va_size);
>  		error = vmtruncate(inode, attr->ia_size);
>  	}
>  
> Index: fs/xfs/linux-2.6/xfs_vnode.c
> ===================================================================
> RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_vnode.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 xfs_vnode.c
> --- fs/xfs/linux-2.6/xfs_vnode.c	20 Oct 2003 02:08:58 -0000	1.120
> +++ fs/xfs/linux-2.6/xfs_vnode.c	3 Feb 2004 23:56:17 -0000
> @@ -213,7 +213,6 @@ vn_revalidate(
>  		inode->i_mtime	    = va.va_mtime;
>  		inode->i_ctime	    = va.va_ctime;
>  		inode->i_atime	    = va.va_atime;
> -		i_size_write(inode, va.va_size);
>  		if (va.va_xflags & XFS_XFLAG_IMMUTABLE)
>  			inode->i_flags |= S_IMMUTABLE;
>  		else
> 

I think this should work, it just leaves the extending O_DIRECT write
case. Keeping the revalidate call out of the path for creating regular
files would be nice though, why did you deem that necessary?

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-01-29 23:20   ` Miquel van Smoorenburg
@ 2004-02-04  0:03     ` Christoph Hellwig
  2004-02-03 14:13       ` Steve Lord
  2004-02-04  0:06       ` David Weinehall
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2004-02-04  0:03 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nathan Scott, linux-kernel, linux-xfs

Okay, what about this little patch?:


Index: fs/xfs/linux-2.6/xfs_iops.c
===================================================================
RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_iops.c,v
retrieving revision 1.212
diff -u -p -r1.212 xfs_iops.c
--- fs/xfs/linux-2.6/xfs_iops.c	12 Dec 2003 04:17:52 -0000	1.212
+++ fs/xfs/linux-2.6/xfs_iops.c	3 Feb 2004 23:56:17 -0000
@@ -80,11 +80,15 @@ validate_fields(
 	vattr_t		va;
 	int		error;
 
-	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS;
+	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS|XFS_AT_SIZE;
 	VOP_GETATTR(vp, &va, ATTR_LAZY, NULL, error);
 	ip->i_nlink = va.va_nlink;
 	ip->i_size = va.va_size;
 	ip->i_blocks = va.va_nblocks;
+
+	/* we're under i_sem so i_size can't change under us */
+	if (i_size_read(ip) != va.va_size)
+		i_size_write(ip, va.va_size);
 }
 
 /*
@@ -186,8 +190,7 @@ linvfs_mknod(
 
 		if (S_ISCHR(mode) || S_ISBLK(mode))
 			ip->i_rdev = rdev;
-		else if (S_ISDIR(mode))
-			validate_fields(ip);
+		validate_fields(ip);
 		d_instantiate(dentry, ip);
 		validate_fields(dir);
 	}
@@ -536,6 +539,7 @@ linvfs_setattr(
 	if (error)
 		return(-error);	/* Positive error up from XFS */
 	if (ia_valid & ATTR_SIZE) {
+		i_size_write(inode, vattr.va_size);
 		error = vmtruncate(inode, attr->ia_size);
 	}
 
Index: fs/xfs/linux-2.6/xfs_vnode.c
===================================================================
RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_vnode.c,v
retrieving revision 1.120
diff -u -p -r1.120 xfs_vnode.c
--- fs/xfs/linux-2.6/xfs_vnode.c	20 Oct 2003 02:08:58 -0000	1.120
+++ fs/xfs/linux-2.6/xfs_vnode.c	3 Feb 2004 23:56:17 -0000
@@ -213,7 +213,6 @@ vn_revalidate(
 		inode->i_mtime	    = va.va_mtime;
 		inode->i_ctime	    = va.va_ctime;
 		inode->i_atime	    = va.va_atime;
-		i_size_write(inode, va.va_size);
 		if (va.va_xflags & XFS_XFLAG_IMMUTABLE)
 			inode->i_flags |= S_IMMUTABLE;
 		else

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-04  0:03     ` Christoph Hellwig
  2004-02-03 14:13       ` Steve Lord
@ 2004-02-04  0:06       ` David Weinehall
  2004-02-04  0:07         ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: David Weinehall @ 2004-02-04  0:06 UTC (permalink / raw)
  To: Christoph Hellwig, Miquel van Smoorenburg, Andrew Morton,
	Nathan Scott, linux-kernel, linux-xfs

On Wed, Feb 04, 2004 at 12:03:15AM +0000, Christoph Hellwig wrote:
> Okay, what about this little patch?:
> 
> 
> Index: fs/xfs/linux-2.6/xfs_iops.c
> ===================================================================
> RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_iops.c,v
> retrieving revision 1.212
> diff -u -p -r1.212 xfs_iops.c
> --- fs/xfs/linux-2.6/xfs_iops.c	12 Dec 2003 04:17:52 -0000	1.212
> +++ fs/xfs/linux-2.6/xfs_iops.c	3 Feb 2004 23:56:17 -0000
> @@ -80,11 +80,15 @@ validate_fields(
>  	vattr_t		va;
>  	int		error;
>  
> -	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS;
> +	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS|XFS_AT_SIZE;

Huh? XFS_AT_SIZE twice?!

[snip]


Regards: David Weinehall
-- 
 /) David Weinehall <tao@acc.umu.se> /) Northern lights wander      (\
//  Maintainer of the v2.0 kernel   //  Dance across the winter sky //
\)  http://www.acc.umu.se/~tao/    (/   Full colour fire           (/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-04  0:06       ` David Weinehall
@ 2004-02-04  0:07         ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2004-02-04  0:07 UTC (permalink / raw)
  To: Christoph Hellwig, Miquel van Smoorenburg, Andrew Morton,
	Nathan Scott, linux-kernel, linux-xfs

On Wed, Feb 04, 2004 at 01:06:23AM +0100, David Weinehall wrote:
> > -	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS;
> > +	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS|XFS_AT_SIZE;
> 
> Huh? XFS_AT_SIZE twice?!

Ah thanks.  Stupid braino but fortunately harmless..


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-03 14:13       ` Steve Lord
@ 2004-02-04 15:16         ` Christoph Hellwig
  2004-02-04 18:17         ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2004-02-04 15:16 UTC (permalink / raw)
  To: Steve Lord
  Cc: Christoph Hellwig, Miquel van Smoorenburg, Andrew Morton,
	Nathan Scott, linux-kernel, linux-xfs

On Tue, Feb 03, 2004 at 08:13:04AM -0600, Steve Lord wrote:
> >  			ip->i_rdev = rdev;
> > -		else if (S_ISDIR(mode))
> > -			validate_fields(ip);
> > +		validate_fields(ip);
> 
> There was some reason this was only necessary on directories, but I
> cannot remember why just now.

Well, it is nessecary now to update i_size.  Or rather it was, I think
I can get rid of it again after taking care of initialize_vnode.

> I think this should work, it just leaves the extending O_DIRECT write
> case.

And initialize_vnode.  I have a working patch for the latter, but I still
need to take a look at O_DIRECT.

> Keeping the revalidate call out of the path for creating regular
> files would be nice though, why did you deem that necessary?

I thought I need it for i_size udates, but we should be able to take
care of it in initialize_vnode.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: 2.6.2-rc2 nfsd+xfs spins in i_size_read()
  2004-02-03 14:13       ` Steve Lord
  2004-02-04 15:16         ` Christoph Hellwig
@ 2004-02-04 18:17         ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2004-02-04 18:17 UTC (permalink / raw)
  To: Steve Lord
  Cc: Christoph Hellwig, Miquel van Smoorenburg, Andrew Morton,
	Nathan Scott, linux-kernel, linux-xfs

Here's a new diff, that should get rid of all the warnings except for
O_DIRECT.  I've also addressed the comments from Dave and Steve.

Index: fs/xfs/linux-2.6/xfs_iops.c
===================================================================
RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_iops.c,v
retrieving revision 1.212
diff -u -p -r1.212 xfs_iops.c
--- fs/xfs/linux-2.6/xfs_iops.c	12 Dec 2003 04:17:52 -0000	1.212
+++ fs/xfs/linux-2.6/xfs_iops.c	4 Feb 2004 17:56:34 -0000
@@ -82,9 +82,14 @@ validate_fields(
 
 	va.va_mask = XFS_AT_NLINK|XFS_AT_SIZE|XFS_AT_NBLOCKS;
 	VOP_GETATTR(vp, &va, ATTR_LAZY, NULL, error);
-	ip->i_nlink = va.va_nlink;
-	ip->i_size = va.va_size;
-	ip->i_blocks = va.va_nblocks;
+	if (likely(!error)) {
+		ip->i_nlink = va.va_nlink;
+		ip->i_blocks = va.va_nblocks;
+
+		/* we're under i_sem so i_size can't change under us */
+		if (i_size_read(ip) != va.va_size)
+			i_size_write(ip, va.va_size);
+	}
 }
 
 /*
@@ -536,6 +541,7 @@ linvfs_setattr(
 	if (error)
 		return(-error);	/* Positive error up from XFS */
 	if (ia_valid & ATTR_SIZE) {
+		i_size_write(inode, vattr.va_size);
 		error = vmtruncate(inode, attr->ia_size);
 	}
 
Index: fs/xfs/linux-2.6/xfs_super.c
===================================================================
RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_super.c,v
retrieving revision 1.294
diff -u -p -r1.294 xfs_super.c
--- fs/xfs/linux-2.6/xfs_super.c	21 Jan 2004 16:46:06 -0000	1.294
+++ fs/xfs/linux-2.6/xfs_super.c	4 Feb 2004 17:56:55 -0000
@@ -178,7 +178,7 @@ xfs_revalidate_inode(
 	}
 	inode->i_blksize = PAGE_CACHE_SIZE;
 	inode->i_generation = ip->i_d.di_gen;
-	i_size_write(inode, ip->i_d.di_size);
+	inode->i_size = ip->i_d.di_size;
 	inode->i_blocks =
 		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
 	inode->i_atime.tv_sec	= ip->i_d.di_atime.t_sec;
Index: fs/xfs/linux-2.6/xfs_vnode.c
===================================================================
RCS file: /cvs/linux-2.6-xfs/fs/xfs/linux-2.6/xfs_vnode.c,v
retrieving revision 1.120
diff -u -p -r1.120 xfs_vnode.c
--- fs/xfs/linux-2.6/xfs_vnode.c	20 Oct 2003 02:08:58 -0000	1.120
+++ fs/xfs/linux-2.6/xfs_vnode.c	4 Feb 2004 17:56:55 -0000
@@ -213,7 +213,6 @@ vn_revalidate(
 		inode->i_mtime	    = va.va_mtime;
 		inode->i_ctime	    = va.va_ctime;
 		inode->i_atime	    = va.va_atime;
-		i_size_write(inode, va.va_size);
 		if (va.va_xflags & XFS_XFLAG_IMMUTABLE)
 			inode->i_flags |= S_IMMUTABLE;
 		else

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2004-02-04 18:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-28 17:17 2.6.2-rc2 nfsd+xfs spins in i_size_read() Miquel van Smoorenburg
2004-01-29  6:25 ` Andrew Morton
2004-01-29 23:20   ` Miquel van Smoorenburg
2004-02-04  0:03     ` Christoph Hellwig
2004-02-03 14:13       ` Steve Lord
2004-02-04 15:16         ` Christoph Hellwig
2004-02-04 18:17         ` Christoph Hellwig
2004-02-04  0:06       ` David Weinehall
2004-02-04  0:07         ` Christoph Hellwig
2004-01-30 16:01   ` Miquel van Smoorenburg
2004-01-30 20:21   ` Miquel van Smoorenburg
2004-01-30 22:13     ` Miquel van Smoorenburg
2004-01-30 22:34       ` Andrew Morton
2004-01-30 22:53         ` Christoph Hellwig
2004-01-30 23:13           ` Andrew Morton
2004-01-31  1:25             ` Miquel van Smoorenburg
2004-01-31  1:38               ` Andrew Morton
2004-01-31 11:46                 ` Miquel van Smoorenburg
2004-01-31 15:59                   ` Steve Lord
2004-02-01 16:15                     ` Christoph Hellwig
2004-01-31 16:41                       ` Steve Lord
2004-01-31 17:07               ` Christoph Hellwig
2004-02-01  1:46               ` Anton Blanchard
2004-01-30 23:07         ` Nathan Scott
2004-01-29  6:30 ` Nathan Scott

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox