Re: [PATCH] btrfs file write debugging patch

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] btrfs file write debugging patch
@ 2011-03-01 16:36 Xin Zhong
  2011-03-01 21:09 ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Xin Zhong @ 2011-03-01 16:36 UTC (permalink / raw)
  To: linux-btrfs

Hi, Mitch
I think you can config ftrace to just trace function calls of btrfs.ko which will save a lot of trace buffer space. See below command:
#echo ':mod:btrfs' > /sys/kernel/debug/tracing/set_ftrace_filterAnd please send out the full ftrace log again.

Another helpful information might be the strace log of the wmldbcreate process. It will show us the io pattern of this command.
Thanks a lot for your help!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-01 16:36 [PATCH] btrfs file write debugging patch Xin Zhong
@ 2011-03-01 21:09 ` Mitch Harder
  2011-03-02 10:58   ` Zhong, Xin
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-01 21:09 UTC (permalink / raw)
  To: Xin Zhong; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1720 bytes --]

2011/3/1 Xin Zhong <thierryzhong@hotmail.com>:
>
> Hi, Mitch
> I think you can config ftrace to just trace function calls of btrfs.ko which will save a lot of trace buffer space. See below command:
> #echo ':mod:btrfs' > /sys/kernel/debug/tracing/set_ftrace_filterAnd please send out the full ftrace log again.
>
> Another helpful information might be the strace log of the wmldbcreate process. It will show us the io pattern of this command.
> Thanks a lot for your help!
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I manually ran an strace around the build command (wmldbcreate) that
is causing my problem, and I am attaching the strace for that.

Please note that wmldbcreate does not seem to care when an error is
returned, and continues on.  So the error is occurring somewhat
silently in the middle, and isn't the last item.  The error is
probably associated with one of the 12288 byte writes.

I have re-run an ftrace following the conditions above, and have
hosted that file (~1.1MB compressed) on my local server at:

http://dontpanic.dyndns.org/trace-openmotif-btrfs-v15.gz

Please note I am still using some debugging modifications of my own to file.c.

They server the purpose of:
(1) Avoiding an infinite loop by identifying when the problem is
occuring, and exiting with error after 256 loops.
(2) Stopping the trace after exiting to keep from flooding the ftrace buffer.
(3) Provide debugging comments (all prefaced with "TPK:" in the trace).

Let me know if you want me to change any of the conditions.

[-- Attachment #2: wmldbcreate-strace.gz --]
[-- Type: application/x-gzip, Size: 4406 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-01 21:09 ` Mitch Harder
@ 2011-03-02 10:58   ` Zhong, Xin
  2011-03-02 14:00     ` Xin Zhong
  2011-03-04  1:51     ` Chris Mason
  0 siblings, 2 replies; 40+ messages in thread
From: Zhong, Xin @ 2011-03-02 10:58 UTC (permalink / raw)
  To: Mitch Harder, Xin Zhong; +Cc: linux-btrfs@vger.kernel.org

I downloaded openmotif and run the command as Mitch mentioned and was a=
ble to recreate the problem locally. And I managed to simplify the comm=
and into a very simple program which can capture the problem easily. Se=
e below code:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
static char a[4096*3];
int main()
{
    int fd =3D open("out", O_WRONLY|O_CREAT|O_TRUNC, 0666);
    write(fd,a+1, 4096*2);
    exit(0);
}

It seems that if we give an unaligned address to btrfs write and the bu=
ffer reside on more than 2 pages. It will trigger this bug.
If we give an aligned address to btrfs write, it works well no matter h=
ow many pages are given.=20

I use ftrace to observe it. It seems iov_iter_fault_in_readable do not =
trigger pagefault handling when the address is not aligned. I do not qu=
ite understand the reason behind it. But the solution should be to proc=
ess the page one by one. And that's also what generic file write routin=
e does.=20

Any suggestion are welcomed. Thanks!

-----Original Message-----
=46rom: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vge=
r.kernel.org] On Behalf Of Mitch Harder
Sent: Wednesday, March 02, 2011 5:09 AM
To: Xin Zhong
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs file write debugging patch

2011/3/1 Xin Zhong <thierryzhong@hotmail.com>:
>
> Hi, Mitch
> I think you can config ftrace to just trace function calls of btrfs.k=
o which will save a lot of trace buffer space. See below command:
> #echo ':mod:btrfs' > /sys/kernel/debug/tracing/set_ftrace_filterAnd p=
lease send out the full ftrace log again.
>
> Another helpful information might be the strace log of the wmldbcreat=
e process. It will show us the io pattern of this command.
> Thanks a lot for your help!
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
"=20
> in the body of a message to majordomo@vger.kernel.org More majordomo=20
> info at =A0http://vger.kernel.org/majordomo-info.html
>

I manually ran an strace around the build command (wmldbcreate) that is=
 causing my problem, and I am attaching the strace for that.

Please note that wmldbcreate does not seem to care when an error is ret=
urned, and continues on.  So the error is occurring somewhat silently i=
n the middle, and isn't the last item.  The error is probably associate=
d with one of the 12288 byte writes.

I have re-run an ftrace following the conditions above, and have hosted=
 that file (~1.1MB compressed) on my local server at:

http://dontpanic.dyndns.org/trace-openmotif-btrfs-v15.gz

Please note I am still using some debugging modifications of my own to =
file.c.

They server the purpose of:
(1) Avoiding an infinite loop by identifying when the problem is occuri=
ng, and exiting with error after 256 loops.
(2) Stopping the trace after exiting to keep from flooding the ftrace b=
uffer.
(3) Provide debugging comments (all prefaced with "TPK:" in the trace).

Let me know if you want me to change any of the conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-02 10:58   ` Zhong, Xin
@ 2011-03-02 14:00     ` Xin Zhong
  2011-03-04  1:51     ` Chris Mason
  1 sibling, 0 replies; 40+ messages in thread
From: Xin Zhong @ 2011-03-02 14:00 UTC (permalink / raw)
  To: xin.zhong, Mitch Harder; +Cc: linux-btrfs


Sorry, I forgot to mention that you need to undo below commit in btrfs-unstable to recreate the problem:
Btrfs: fix fiemap bugs with delalloc (+224/-42)
Otherwise, it will run into enospc error. I am not sure if it's the same problem.

----------------------------------------
> From: xin.zhong@intel.com
> To: mitch.harder@sabayonlinux.org; thierryzhong@hotmail.com
> CC: linux-btrfs@vger.kernel.org
> Date: Wed, 2 Mar 2011 18:58:49 +0800
> Subject: RE: [PATCH] btrfs file write debugging patch
>
> I downloaded openmotif and run the command as Mitch mentioned and was able to recreate the problem locally. And I managed to simplify the command into a very simple program which can capture the problem easily. See below code:
>
> #include 
> #include 
> #include 
> static char a[4096*3];
> int main()
> {
> int fd = open("out", O_WRONLY|O_CREAT|O_TRUNC, 0666);
> write(fd,a+1, 4096*2);
> exit(0);
> }
>
> It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> If we give an aligned address to btrfs write, it works well no matter how many pages are given.
>
> I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does.
>
> Any suggestion are welcomed. Thanks!
>
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Mitch Harder
> Sent: Wednesday, March 02, 2011 5:09 AM
> To: Xin Zhong
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs file write debugging patch
>
> 2011/3/1 Xin Zhong :
> >
> > Hi, Mitch
> > I think you can config ftrace to just trace function calls of btrfs.ko which will save a lot of trace buffer space. See below command:
> > #echo ':mod:btrfs' > /sys/kernel/debug/tracing/set_ftrace_filterAnd please send out the full ftrace log again.
> >
> > Another helpful information might be the strace log of the wmldbcreate process. It will show us the io pattern of this command.
> > Thanks a lot for your help!
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
>
> I manually ran an strace around the build command (wmldbcreate) that is causing my problem, and I am attaching the strace for that.
>
> Please note that wmldbcreate does not seem to care when an error is returned, and continues on. So the error is occurring somewhat silently in the middle, and isn't the last item. The error is probably associated with one of the 12288 byte writes.
>
> I have re-run an ftrace following the conditions above, and have hosted that file (~1.1MB compressed) on my local server at:
>
> http://dontpanic.dyndns.org/trace-openmotif-btrfs-v15.gz
>
> Please note I am still using some debugging modifications of my own to file.c.
>
> They server the purpose of:
> (1) Avoiding an infinite loop by identifying when the problem is occuring, and exiting with error after 256 loops.
> (2) Stopping the trace after exiting to keep from flooding the ftrace buffer.
> (3) Provide debugging comments (all prefaced with "TPK:" in the trace).
>
> Let me know if you want me to change any of the conditions.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-02 10:58   ` Zhong, Xin
  2011-03-02 14:00     ` Xin Zhong
@ 2011-03-04  1:51     ` Chris Mason
  2011-03-04  2:32       ` Josef Bacik
  2011-03-04 12:19       ` Chris Mason
  1 sibling, 2 replies; 40+ messages in thread
From: Chris Mason @ 2011-03-04  1:51 UTC (permalink / raw)
  To: Zhong, Xin; +Cc: Mitch Harder, Xin Zhong, linux-btrfs@vger.kernel.org

Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> I downloaded openmotif and run the command as Mitch mentioned and was able to recreate the problem locally. And I managed to simplify the command into a very simple program which can capture the problem easily. See below code:
> 
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> static char a[4096*3];
> int main()
> {
>     int fd = open("out", O_WRONLY|O_CREAT|O_TRUNC, 0666);
>     write(fd,a+1, 4096*2);
>     exit(0);
> }
> 
> It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> If we give an aligned address to btrfs write, it works well no matter how many pages are given. 
> 
> I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does. 
> 
> Any suggestion are welcomed. Thanks!

Great job guys.  I'm using this on top of my debugging patch.  It passes
the unaligned test but I'll give it a real run tonight and look for
other problems.

(This is almost entirely untested, please don't use it quite yet)

-chris

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 89a6a26..6a44add 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 
 		copied = btrfs_copy_from_user(pos, num_pages,
 					   write_bytes, pages, &i);
+
+		/*
+		 * if we have trouble faulting in the pages, fall
+		 * back to one page at a time
+		 */
+		if (copied < write_bytes)
+			nrptrs = 1;
+
 		if (copied == 0)
 			dirty_pages = 0;
 		else

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-04  1:51     ` Chris Mason
@ 2011-03-04  2:32       ` Josef Bacik
  2011-03-04  2:42         ` Zhong, Xin
  2011-03-04 12:19       ` Chris Mason
  1 sibling, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2011-03-04  2:32 UTC (permalink / raw)
  To: Chris Mason
  Cc: Zhong, Xin, Mitch Harder, Xin Zhong, linux-btrfs@vger.kernel.org

On Thu, Mar 03, 2011 at 08:51:55PM -0500, Chris Mason wrote:
> Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> > I downloaded openmotif and run the command as Mitch mentioned and was able to recreate the problem locally. And I managed to simplify the command into a very simple program which can capture the problem easily. See below code:
> > 
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > static char a[4096*3];
> > int main()
> > {
> >     int fd = open("out", O_WRONLY|O_CREAT|O_TRUNC, 0666);
> >     write(fd,a+1, 4096*2);
> >     exit(0);
> > }
> > 
> > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> > If we give an aligned address to btrfs write, it works well no matter how many pages are given. 
> > 
> > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does. 
> > 
> > Any suggestion are welcomed. Thanks!
> 
> Great job guys.  I'm using this on top of my debugging patch.  It passes
> the unaligned test but I'll give it a real run tonight and look for
> other problems.
> 
> (This is almost entirely untested, please don't use it quite yet)
> 
> -chris
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 89a6a26..6a44add 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
>  
>  		copied = btrfs_copy_from_user(pos, num_pages,
>  					   write_bytes, pages, &i);
> +
> +		/*
> +		 * if we have trouble faulting in the pages, fall
> +		 * back to one page at a time
> +		 */
> +		if (copied < write_bytes)
> +			nrptrs = 1;
> +
>  		if (copied == 0)
>  			dirty_pages = 0;
>  		else

Btw this situation is taken care of in my write path rewrite patch, if copied ==
0 we switch to one segment at a time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-04  2:32       ` Josef Bacik
@ 2011-03-04  2:42         ` Zhong, Xin
  2011-03-04  2:41           ` Josef Bacik
  0 siblings, 1 reply; 40+ messages in thread
From: Zhong, Xin @ 2011-03-04  2:42 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason
  Cc: Mitch Harder, Xin Zhong, linux-btrfs@vger.kernel.org

Where can I find your patch? Thanks!

-----Original Message-----
From: Josef Bacik [mailto:josef@redhat.com] 
Sent: Friday, March 04, 2011 10:32 AM
To: Chris Mason
Cc: Zhong, Xin; Mitch Harder; Xin Zhong; linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs file write debugging patch

On Thu, Mar 03, 2011 at 08:51:55PM -0500, Chris Mason wrote:
> Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> > I downloaded openmotif and run the command as Mitch mentioned and was able to recreate the problem locally. And I managed to simplify the command into a very simple program which can capture the problem easily. See below code:
> > 
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > static char a[4096*3];
> > int main()
> > {
> >     int fd = open("out", O_WRONLY|O_CREAT|O_TRUNC, 0666);
> >     write(fd,a+1, 4096*2);
> >     exit(0);
> > }
> > 
> > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> > If we give an aligned address to btrfs write, it works well no matter how many pages are given. 
> > 
> > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does. 
> > 
> > Any suggestion are welcomed. Thanks!
> 
> Great job guys.  I'm using this on top of my debugging patch.  It passes
> the unaligned test but I'll give it a real run tonight and look for
> other problems.
> 
> (This is almost entirely untested, please don't use it quite yet)
> 
> -chris
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 89a6a26..6a44add 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
>  
>  		copied = btrfs_copy_from_user(pos, num_pages,
>  					   write_bytes, pages, &i);
> +
> +		/*
> +		 * if we have trouble faulting in the pages, fall
> +		 * back to one page at a time
> +		 */
> +		if (copied < write_bytes)
> +			nrptrs = 1;
> +
>  		if (copied == 0)
>  			dirty_pages = 0;
>  		else

Btw this situation is taken care of in my write path rewrite patch, if copied ==
0 we switch to one segment at a time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-04  2:42         ` Zhong, Xin
@ 2011-03-04  2:41           ` Josef Bacik
  2011-03-04  8:41             ` Zhong, Xin
  2011-03-05 16:56             ` Mitch Harder
  0 siblings, 2 replies; 40+ messages in thread
From: Josef Bacik @ 2011-03-04  2:41 UTC (permalink / raw)
  To: Zhong, Xin
  Cc: Josef Bacik, Chris Mason, Mitch Harder, Xin Zhong,
	linux-btrfs@vger.kernel.org

On Fri, Mar 04, 2011 at 10:42:46AM +0800, Zhong, Xin wrote:
> Where can I find your patch? Thanks!
>

It's in my btrfs-work git tree, it's based on the latest git pull from linus so
you can just pull it onto a linus tree and you should be good to go.  The
specific patch is

Btrfs: simplify our write path

Thanks,

Josef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-04  2:41           ` Josef Bacik
@ 2011-03-04  8:41             ` Zhong, Xin
  2011-03-05 16:56             ` Mitch Harder
  1 sibling, 0 replies; 40+ messages in thread
From: Zhong, Xin @ 2011-03-04  8:41 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, Mitch Harder, Xin Zhong, linux-btrfs@vger.kernel.org

Looks good to me. Thanks! 

Another mysterious thing is that this problem can only be recreated on x86 32bit system. I can not recreate it on x86_64 system using my test case. Any one have any idea about it?

-----Original Message-----
From: Josef Bacik [mailto:josef@redhat.com] 
Sent: Friday, March 04, 2011 10:41 AM
To: Zhong, Xin
Cc: Josef Bacik; Chris Mason; Mitch Harder; Xin Zhong; linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs file write debugging patch

On Fri, Mar 04, 2011 at 10:42:46AM +0800, Zhong, Xin wrote:
> Where can I find your patch? Thanks!
>

It's in my btrfs-work git tree, it's based on the latest git pull from linus so
you can just pull it onto a linus tree and you should be good to go.  The
specific patch is

Btrfs: simplify our write path

Thanks,

Josef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-04  2:41           ` Josef Bacik
  2011-03-04  8:41             ` Zhong, Xin
@ 2011-03-05 16:56             ` Mitch Harder
  2011-03-05 17:28               ` Xin Zhong
  1 sibling, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-05 16:56 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Zhong, Xin, Chris Mason, Xin Zhong, linux-btrfs@vger.kernel.org

On Thu, Mar 3, 2011 at 8:41 PM, Josef Bacik <josef@redhat.com> wrote:
> On Fri, Mar 04, 2011 at 10:42:46AM +0800, Zhong, Xin wrote:
>> Where can I find your patch? Thanks!
>>
>
> It's in my btrfs-work git tree, it's based on the latest git pull fro=
m linus so
> you can just pull it onto a linus tree and you should be good to go. =
=A0The
> specific patch is
>
> Btrfs: simplify our write path
>
> Thanks,
>
> Josef
>

Josef:

I've been testing the kernel from you git tree (as of commit 5555f192
Btrfs: add a comment explaining what btrfs_cont_expand does).

I am still running into issues when running the portion of Openmotif
that has been generating the problem with page faults when given an
unaligned address.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-05 16:56             ` Mitch Harder
@ 2011-03-05 17:28               ` Xin Zhong
  0 siblings, 0 replies; 40+ messages in thread
From: Xin Zhong @ 2011-03-05 17:28 UTC (permalink / raw)
  To: Mitch Harder, josef; +Cc: xin.zhong, chris.mason, linux-btrfs


I think Josef's patch only address the one-by-one page processing issue. 
But do not address the issue that dirty_pages should be set to 0 when copied is 0. 

----------------------------------------
> Date: Sat, 5 Mar 2011 10:56:52 -0600
> Subject: Re: [PATCH] btrfs file write debugging patch
> From: mitch.harder@sabayonlinux.org
> To: josef@redhat.com
> CC: xin.zhong@intel.com; chris.mason@oracle.com; thierryzhong@hotmail.com; linux-btrfs@vger.kernel.org
>
> On Thu, Mar 3, 2011 at 8:41 PM, Josef Bacik  wrote:
> > On Fri, Mar 04, 2011 at 10:42:46AM +0800, Zhong, Xin wrote:
> >> Where can I find your patch? Thanks!
> >>
> >
> > It's in my btrfs-work git tree, it's based on the latest git pull from linus so
> > you can just pull it onto a linus tree and you should be good to go.  The
> > specific patch is
> >
> > Btrfs: simplify our write path
> >
> > Thanks,
> >
> > Josef
> >
>
> Josef:
>
> I've been testing the kernel from you git tree (as of commit 5555f192
> Btrfs: add a comment explaining what btrfs_cont_expand does).
>
> I am still running into issues when running the portion of Openmotif
> that has been generating the problem with page faults when given an
> unaligned address.
 		 	   		  

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-04  1:51     ` Chris Mason
  2011-03-04  2:32       ` Josef Bacik
@ 2011-03-04 12:19       ` Chris Mason
  2011-03-04 14:25         ` Xin Zhong
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Mason @ 2011-03-04 12:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Zhong, Xin, Mitch Harder, Xin Zhong, linux-btrfs@vger.kernel.org

Excerpts from Chris Mason's message of 2011-03-03 20:51:55 -0500:
> Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> > If we give an aligned address to btrfs write, it works well no matter how many pages are given. 
> > 
> > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does. 
> > 
> > Any suggestion are welcomed. Thanks!
> 
> Great job guys.  I'm using this on top of my debugging patch.  It passes
> the unaligned test but I'll give it a real run tonight and look for
> other problems.
> 
> (This is almost entirely untested, please don't use it quite yet)

> 
> -chris
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 89a6a26..6a44add 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
>  
>          copied = btrfs_copy_from_user(pos, num_pages,
>                         write_bytes, pages, &i);
> +
> +        /*
> +         * if we have trouble faulting in the pages, fall
> +         * back to one page at a time
> +         */
> +        if (copied < write_bytes)
> +            nrptrs = 1;
> +
>          if (copied == 0)
>              dirty_pages = 0;
>          else

Ok, this is working well for me.  Anyone see any problems with it?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-04 12:19       ` Chris Mason
@ 2011-03-04 14:25         ` Xin Zhong
  2011-03-04 15:33           ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Xin Zhong @ 2011-03-04 14:25 UTC (permalink / raw)
  To: chris.mason; +Cc: xin.zhong, Mitch Harder, linux-btrfs


It works well for me too. 

----------------------------------------
> From: chris.mason@oracle.com
> To: chris.mason@oracle.com
> CC: xin.zhong@intel.com; mitch.harder@sabayonlinux.org; thierryzhong@hotmail.com; linux-btrfs@vger.kernel.org
> Subject: RE: [PATCH] btrfs file write debugging patch
> Date: Fri, 4 Mar 2011 07:19:39 -0500
>
> Excerpts from Chris Mason's message of 2011-03-03 20:51:55 -0500:
> > Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> > > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> > > If we give an aligned address to btrfs write, it works well no matter how many pages are given.
> > >
> > > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does.
> > >
> > > Any suggestion are welcomed. Thanks!
> >
> > Great job guys. I'm using this on top of my debugging patch. It passes
> > the unaligned test but I'll give it a real run tonight and look for
> > other problems.
> >
> > (This is almost entirely untested, please don't use it quite yet)
>
> >
> > -chris
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 89a6a26..6a44add 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
> >
> > copied = btrfs_copy_from_user(pos, num_pages,
> > write_bytes, pages, &i);
> > +
> > + /*
> > + * if we have trouble faulting in the pages, fall
> > + * back to one page at a time
> > + */
> > + if (copied < write_bytes)
> > + nrptrs = 1;
> > +
> > if (copied == 0)
> > dirty_pages = 0;
> > else
>
> Ok, this is working well for me. Anyone see any problems with it?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-04 14:25         ` Xin Zhong
@ 2011-03-04 15:33           ` Mitch Harder
  2011-03-04 17:21             ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-04 15:33 UTC (permalink / raw)
  To: Xin Zhong; +Cc: chris.mason, xin.zhong, linux-btrfs

2011/3/4 Xin Zhong <thierryzhong@hotmail.com>:
>
> It works well for me too.
>
> ----------------------------------------
>> From: chris.mason@oracle.com
>> To: chris.mason@oracle.com
>> CC: xin.zhong@intel.com; mitch.harder@sabayonlinux.org; thierryzhong@hotmail.com; linux-btrfs@vger.kernel.org
>> Subject: RE: [PATCH] btrfs file write debugging patch
>> Date: Fri, 4 Mar 2011 07:19:39 -0500
>>
>> Excerpts from Chris Mason's message of 2011-03-03 20:51:55 -0500:
>> > Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
>> > > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
>> > > If we give an aligned address to btrfs write, it works well no matter how many pages are given.
>> > >
>> > > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does.
>> > >
>> > > Any suggestion are welcomed. Thanks!
>> >
>> > Great job guys. I'm using this on top of my debugging patch. It passes
>> > the unaligned test but I'll give it a real run tonight and look for
>> > other problems.
>> >
>> > (This is almost entirely untested, please don't use it quite yet)
>>
>> >
>> > -chris
>> >
>> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> > index 89a6a26..6a44add 100644
>> > --- a/fs/btrfs/file.c
>> > +++ b/fs/btrfs/file.c
>> > @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
>> >
>> > copied = btrfs_copy_from_user(pos, num_pages,
>> > write_bytes, pages, &i);
>> > +
>> > + /*
>> > + * if we have trouble faulting in the pages, fall
>> > + * back to one page at a time
>> > + */
>> > + if (copied < write_bytes)
>> > + nrptrs = 1;
>> > +
>> > if (copied == 0)
>> > dirty_pages = 0;
>> > else
>>
>> Ok, this is working well for me. Anyone see any problems with it?
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

I've applied this patch on top of the debugging patch at the head of
the thread, and I'm having trouble building gcc now.

When building gcc-4.4.5, I get errors like the following:

Comparing stages 2 and 3
Bootstrap comparison failure!
./cp/call.o differs
./cp/decl.o differs
./cp/pt.o differs
./cp/class.o differs
./cp/decl2.o differs
<....snip.....>
./matrix-reorg.o differs
./tree-inline.o differs
./gcc.o differs
./gcc-options.o differs
make[2]: *** [compare] Error 1
make[1]: *** [stage3-bubble] Error 2
make: *** [bootstrap-lean] Error 2
emake failed

I've went back and rebuilt my kernel without these two debugging
patches, and gcc-4.4.5 builds without error on that kernel.

I haven't yet tested building gcc-4.4.5 with just the debugging patch
at the head of the thread, so I'll test that, and report back.

But I was wondering if anybody else can replicate this issue.

BTW, I've been doing most of my testing on an x86 system.  My x86_64
systems haven't had as much trouble, but I haven't been robustingly
checking my x86_64 systems for these issues.

I noticed that page fault handling is different by architecture.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-04 15:33           ` Mitch Harder
@ 2011-03-04 17:21             ` Mitch Harder
  2011-03-05  1:00               ` Xin Zhong
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-04 17:21 UTC (permalink / raw)
  To: Xin Zhong; +Cc: chris.mason, xin.zhong, linux-btrfs

On Fri, Mar 4, 2011 at 9:33 AM, Mitch Harder
<mitch.harder@sabayonlinux.org> wrote:
> 2011/3/4 Xin Zhong <thierryzhong@hotmail.com>:
>>
>> It works well for me too.
>>
>> ----------------------------------------
>>> From: chris.mason@oracle.com
>>> To: chris.mason@oracle.com
>>> CC: xin.zhong@intel.com; mitch.harder@sabayonlinux.org; thierryzhon=
g@hotmail.com; linux-btrfs@vger.kernel.org
>>> Subject: RE: [PATCH] btrfs file write debugging patch
>>> Date: Fri, 4 Mar 2011 07:19:39 -0500
>>>
>>> Excerpts from Chris Mason's message of 2011-03-03 20:51:55 -0500:
>>> > Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
>>> > > It seems that if we give an unaligned address to btrfs write an=
d the buffer reside on more than 2 pages. It will trigger this bug.
>>> > > If we give an aligned address to btrfs write, it works well no =
matter how many pages are given.
>>> > >
>>> > > I use ftrace to observe it. It seems iov_iter_fault_in_readable=
 do not trigger pagefault handling when the address is not aligned. I d=
o not quite understand the reason behind it. But the solution should be=
 to process the page one by one. And that's also what generic file writ=
e routine does.
>>> > >
>>> > > Any suggestion are welcomed. Thanks!
>>> >
>>> > Great job guys. I'm using this on top of my debugging patch. It p=
asses
>>> > the unaligned test but I'll give it a real run tonight and look f=
or
>>> > other problems.
>>> >
>>> > (This is almost entirely untested, please don't use it quite yet)
>>>
>>> >
>>> > -chris
>>> >
>>> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> > index 89a6a26..6a44add 100644
>>> > --- a/fs/btrfs/file.c
>>> > +++ b/fs/btrfs/file.c
>>> > @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct=
 kiocb *iocb,
>>> >
>>> > copied =3D btrfs_copy_from_user(pos, num_pages,
>>> > write_bytes, pages, &i);
>>> > +
>>> > + /*
>>> > + * if we have trouble faulting in the pages, fall
>>> > + * back to one page at a time
>>> > + */
>>> > + if (copied < write_bytes)
>>> > + nrptrs =3D 1;
>>> > +
>>> > if (copied =3D=3D 0)
>>> > dirty_pages =3D 0;
>>> > else
>>>
>>> Ok, this is working well for me. Anyone see any problems with it?
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
> I've applied this patch on top of the debugging patch at the head of
> the thread, and I'm having trouble building gcc now.
>
> When building gcc-4.4.5, I get errors like the following:
>
> Comparing stages 2 and 3
> Bootstrap comparison failure!
> ./cp/call.o differs
> ./cp/decl.o differs
> ./cp/pt.o differs
> ./cp/class.o differs
> ./cp/decl2.o differs
> <....snip.....>
> ./matrix-reorg.o differs
> ./tree-inline.o differs
> ./gcc.o differs
> ./gcc-options.o differs
> make[2]: *** [compare] Error 1
> make[1]: *** [stage3-bubble] Error 2
> make: *** [bootstrap-lean] Error 2
> emake failed
>
> I've went back and rebuilt my kernel without these two debugging
> patches, and gcc-4.4.5 builds without error on that kernel.
>
> I haven't yet tested building gcc-4.4.5 with just the debugging patch
> at the head of the thread, so I'll test that, and report back.
>
> But I was wondering if anybody else can replicate this issue.
>
> BTW, I've been doing most of my testing on an x86 system. =A0My x86_6=
4
> systems haven't had as much trouble, but I haven't been robustingly
> checking my x86_64 systems for these issues.
>
> I noticed that page fault handling is different by architecture.
>

Some followup...

I'm encountering this issue with "Bootstrap comparison failure!" in a
gcc-4.4.5 build when only the patch at the head of the thread is
applied (leaving the recent patch to limit pages to one-by-one on page
fault out).

I just hadn't run across this issue until I started playing with
patches to limit the pages to one-by-one on page fault errors.

So it may not be associated with the last patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-04 17:21             ` Mitch Harder
@ 2011-03-05  1:00               ` Xin Zhong
  2011-03-05 13:14                 ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Xin Zhong @ 2011-03-05  1:00 UTC (permalink / raw)
  To: Mitch Harder; +Cc: chris.mason, xin.zhong, linux-btrfs


So it works well now with the two patches from Chris on your system. Am I right?

----------------------------------------
> Date: Fri, 4 Mar 2011 11:21:36 -0600
> Subject: Re: [PATCH] btrfs file write debugging patch
> From: mitch.harder@sabayonlinux.org
> To: thierryzhong@hotmail.com
> CC: chris.mason@oracle.com; xin.zhong@intel.com; linux-btrfs@vger.kernel.org
>
> On Fri, Mar 4, 2011 at 9:33 AM, Mitch Harder
>  wrote:
> > 2011/3/4 Xin Zhong :
> >>
> >> It works well for me too.
> >>
> >> ----------------------------------------
> >>> From: chris.mason@oracle.com
> >>> To: chris.mason@oracle.com
> >>> CC: xin.zhong@intel.com; mitch.harder@sabayonlinux.org; thierryzhong@hotmail.com; linux-btrfs@vger.kernel.org
> >>> Subject: RE: [PATCH] btrfs file write debugging patch
> >>> Date: Fri, 4 Mar 2011 07:19:39 -0500
> >>>
> >>> Excerpts from Chris Mason's message of 2011-03-03 20:51:55 -0500:
> >>> > Excerpts from Zhong, Xin's message of 2011-03-02 05:58:49 -0500:
> >>> > > It seems that if we give an unaligned address to btrfs write and the buffer reside on more than 2 pages. It will trigger this bug.
> >>> > > If we give an aligned address to btrfs write, it works well no matter how many pages are given.
> >>> > >
> >>> > > I use ftrace to observe it. It seems iov_iter_fault_in_readable do not trigger pagefault handling when the address is not aligned. I do not quite understand the reason behind it. But the solution should be to process the page one by one. And that's also what generic file write routine does.
> >>> > >
> >>> > > Any suggestion are welcomed. Thanks!
> >>> >
> >>> > Great job guys. I'm using this on top of my debugging patch. It passes
> >>> > the unaligned test but I'll give it a real run tonight and look for
> >>> > other problems.
> >>> >
> >>> > (This is almost entirely untested, please don't use it quite yet)
> >>>
> >>> >
> >>> > -chris
> >>> >
> >>> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >>> > index 89a6a26..6a44add 100644
> >>> > --- a/fs/btrfs/file.c
> >>> > +++ b/fs/btrfs/file.c
> >>> > @@ -1039,6 +1038,14 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
> >>> >
> >>> > copied = btrfs_copy_from_user(pos, num_pages,
> >>> > write_bytes, pages, &i);
> >>> > +
> >>> > + /*
> >>> > + * if we have trouble faulting in the pages, fall
> >>> > + * back to one page at a time
> >>> > + */
> >>> > + if (copied < write_bytes)
> >>> > + nrptrs = 1;
> >>> > +
> >>> > if (copied == 0)
> >>> > dirty_pages = 0;
> >>> > else
> >>>
> >>> Ok, this is working well for me. Anyone see any problems with it?
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >
> > I've applied this patch on top of the debugging patch at the head of
> > the thread, and I'm having trouble building gcc now.
> >
> > When building gcc-4.4.5, I get errors like the following:
> >
> > Comparing stages 2 and 3
> > Bootstrap comparison failure!
> > ./cp/call.o differs
> > ./cp/decl.o differs
> > ./cp/pt.o differs
> > ./cp/class.o differs
> > ./cp/decl2.o differs
> > <....snip.....>
> > ./matrix-reorg.o differs
> > ./tree-inline.o differs
> > ./gcc.o differs
> > ./gcc-options.o differs
> > make[2]: *** [compare] Error 1
> > make[1]: *** [stage3-bubble] Error 2
> > make: *** [bootstrap-lean] Error 2
> > emake failed
> >
> > I've went back and rebuilt my kernel without these two debugging
> > patches, and gcc-4.4.5 builds without error on that kernel.
> >
> > I haven't yet tested building gcc-4.4.5 with just the debugging patch
> > at the head of the thread, so I'll test that, and report back.
> >
> > But I was wondering if anybody else can replicate this issue.
> >
> > BTW, I've been doing most of my testing on an x86 system.  My x86_64
> > systems haven't had as much trouble, but I haven't been robustingly
> > checking my x86_64 systems for these issues.
> >
> > I noticed that page fault handling is different by architecture.
> >
>
> Some followup...
>
> I'm encountering this issue with "Bootstrap comparison failure!" in a
> gcc-4.4.5 build when only the patch at the head of the thread is
> applied (leaving the recent patch to limit pages to one-by-one on page
> fault out).
>
> I just hadn't run across this issue until I started playing with
> patches to limit the pages to one-by-one on page fault errors.
>
> So it may not be associated with the last patch.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-05  1:00               ` Xin Zhong
@ 2011-03-05 13:14                 ` Mitch Harder
  2011-03-05 16:50                   ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-05 13:14 UTC (permalink / raw)
  To: Xin Zhong; +Cc: chris.mason, xin.zhong, linux-btrfs

2011/3/4 Xin Zhong <thierryzhong@hotmail.com>:
>
> So it works well now with the two patches from Chris on your system. Am I right?
>

No.

I am getting errors building gcc-4.4.5 with the two patches from Chris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-05 13:14                 ` Mitch Harder
@ 2011-03-05 16:50                   ` Mitch Harder
  2011-03-06 18:00                     ` Chris Mason
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-03-05 16:50 UTC (permalink / raw)
  To: Xin Zhong; +Cc: chris.mason, xin.zhong, linux-btrfs

I've constructed a test patch that is currently addressing all the
issues on my system.

The portion of Openmotif that was having issues with page faults works
correctly with this patch, and gcc-4.4.5 builds without issue.

I extracted only the portion of the first patch that corrects the
handling of dirty_pages when copied==0, and incorporated the second
patch that falls back to one-page-at-a-time if there are troubles with
page faults.

--- fs/btrfs/file.c	2011-03-05 07:34:43.025131607 -0600
+++ /usr/src/linux/fs/btrfs/file.c	2011-03-05 07:41:45.001260294 -0600
@@ -1023,8 +1023,20 @@

 		copied = btrfs_copy_from_user(pos, num_pages,
 					   write_bytes, pages, &i);
-		dirty_pages = (copied + offset + PAGE_CACHE_SIZE - 1) >>
-				PAGE_CACHE_SHIFT;
+
+		/*
+		 * if we have trouble faulting in the pages, fall
+		 * back to one page at a time
+		 */
+		if (copied < write_bytes)
+			nrptrs = 1;
+
+		if (copied == 0)
+			dirty_pages = 0;
+		else
+			dirty_pages = (copied + offset +
+				       PAGE_CACHE_SIZE - 1) >>
+				       PAGE_CACHE_SHIFT;

 		if (num_pages > dirty_pages) {
 			if (copied > 0)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-05 16:50                   ` Mitch Harder
@ 2011-03-06 18:00                     ` Chris Mason
  2011-03-07  0:58                       ` Chris Mason
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Mason @ 2011-03-06 18:00 UTC (permalink / raw)
  To: Mitch Harder; +Cc: Xin Zhong, xin.zhong, linux-btrfs

Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -0500:
> I've constructed a test patch that is currently addressing all the
> issues on my system.
> 
> The portion of Openmotif that was having issues with page faults works
> correctly with this patch, and gcc-4.4.5 builds without issue.
> 
> I extracted only the portion of the first patch that corrects the
> handling of dirty_pages when copied==0, and incorporated the second
> patch that falls back to one-page-at-a-time if there are troubles with
> page faults.

Just to make sure I understand, could you please post the full combined
path that was giving you trouble with gcc?  We do need to make sure the
pages are properly up to date if we fall back to partial writes.

-chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-06 18:00                     ` Chris Mason
@ 2011-03-07  0:58                       ` Chris Mason
  2011-03-07  6:07                         ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Mason @ 2011-03-07  0:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: Mitch Harder, Xin Zhong, xin.zhong, linux-btrfs

Excerpts from Chris Mason's message of 2011-03-06 13:00:27 -0500:
> Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -0500:
> > I've constructed a test patch that is currently addressing all the
> > issues on my system.
> > 
> > The portion of Openmotif that was having issues with page faults works
> > correctly with this patch, and gcc-4.4.5 builds without issue.
> > 
> > I extracted only the portion of the first patch that corrects the
> > handling of dirty_pages when copied==0, and incorporated the second
> > patch that falls back to one-page-at-a-time if there are troubles with
> > page faults.
> 
> Just to make sure I understand, could you please post the full combined
> path that was giving you trouble with gcc?  We do need to make sure the
> pages are properly up to date if we fall back to partial writes.

Ok, I was able to reproduce this easily with fsx.  The problem is that I
wasn't making sure the last partial page in the write was up to date
when it was also the first page in the write.

Here is the updated patch, it has all the fixes we've found so far:

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7084140..5986ac7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -763,6 +763,27 @@ out:
 }
 
 /*
+ * on error we return an unlocked page and the error value
+ * on success we return a locked page and 0
+ */
+static int prepare_uptodate_page(struct page *page, u64 pos)
+{
+	int ret = 0;
+
+	if ((pos & (PAGE_CACHE_SIZE - 1)) && !PageUptodate(page)) {
+		ret = btrfs_readpage(NULL, page);
+		if (ret)
+			return ret;
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+/*
  * this gets pages into the page cache and locks them down, it also properly
  * waits for data=ordered extents to finish before allowing the pages to be
  * modified.
@@ -777,6 +798,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 	unsigned long index = pos >> PAGE_CACHE_SHIFT;
 	struct inode *inode = fdentry(file)->d_inode;
 	int err = 0;
+	int faili = 0;
 	u64 start_pos;
 	u64 last_pos;
 
@@ -794,15 +816,24 @@ again:
 	for (i = 0; i < num_pages; i++) {
 		pages[i] = grab_cache_page(inode->i_mapping, index + i);
 		if (!pages[i]) {
-			int c;
-			for (c = i - 1; c >= 0; c--) {
-				unlock_page(pages[c]);
-				page_cache_release(pages[c]);
-			}
-			return -ENOMEM;
+			faili = i - 1;
+			err = -ENOMEM;
+			goto fail;
+		}
+
+		if (i == 0)
+			err = prepare_uptodate_page(pages[i], pos);
+		if (i == num_pages - 1)
+			err = prepare_uptodate_page(pages[i],
+						    pos + write_bytes);
+		if (err) {
+			page_cache_release(pages[i]);
+			faili = i - 1;
+			goto fail;
 		}
 		wait_on_page_writeback(pages[i]);
 	}
+	err = 0;
 	if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
 		lock_extent_bits(&BTRFS_I(inode)->io_tree,
@@ -842,6 +873,14 @@ again:
 		WARN_ON(!PageLocked(pages[i]));
 	}
 	return 0;
+fail:
+	while (faili >= 0) {
+		unlock_page(pages[faili]);
+		page_cache_release(pages[faili]);
+		faili--;
+	}
+	return err;
+
 }
 
 static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
@@ -851,7 +890,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = fdentry(file)->d_inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct page *pinned[2];
 	struct page **pages = NULL;
 	struct iov_iter i;
 	loff_t *ppos = &iocb->ki_pos;
@@ -872,9 +910,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	will_write = ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
 		      (file->f_flags & O_DIRECT));
 
-	pinned[0] = NULL;
-	pinned[1] = NULL;
-
 	start_pos = pos;
 
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
@@ -962,32 +997,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	first_index = pos >> PAGE_CACHE_SHIFT;
 	last_index = (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;
 
-	/*
-	 * there are lots of better ways to do this, but this code
-	 * makes sure the first and last page in the file range are
-	 * up to date and ready for cow
-	 */
-	if ((pos & (PAGE_CACHE_SIZE - 1))) {
-		pinned[0] = grab_cache_page(inode->i_mapping, first_index);
-		if (!PageUptodate(pinned[0])) {
-			ret = btrfs_readpage(NULL, pinned[0]);
-			BUG_ON(ret);
-			wait_on_page_locked(pinned[0]);
-		} else {
-			unlock_page(pinned[0]);
-		}
-	}
-	if ((pos + iov_iter_count(&i)) & (PAGE_CACHE_SIZE - 1)) {
-		pinned[1] = grab_cache_page(inode->i_mapping, last_index);
-		if (!PageUptodate(pinned[1])) {
-			ret = btrfs_readpage(NULL, pinned[1]);
-			BUG_ON(ret);
-			wait_on_page_locked(pinned[1]);
-		} else {
-			unlock_page(pinned[1]);
-		}
-	}
-
 	while (iov_iter_count(&i) > 0) {
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes = min(iov_iter_count(&i),
@@ -1024,8 +1033,20 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 
 		copied = btrfs_copy_from_user(pos, num_pages,
 					   write_bytes, pages, &i);
-		dirty_pages = (copied + offset + PAGE_CACHE_SIZE - 1) >>
-				PAGE_CACHE_SHIFT;
+
+		/*
+		 * if we have trouble faulting in the pages, fall
+		 * back to one page at a time
+		 */
+		if (copied < write_bytes)
+			nrptrs = 1;
+
+		if (copied == 0)
+			dirty_pages = 0;
+		else
+			dirty_pages = (copied + offset +
+				       PAGE_CACHE_SIZE - 1) >>
+				       PAGE_CACHE_SHIFT;
 
 		if (num_pages > dirty_pages) {
 			if (copied > 0)
@@ -1069,10 +1090,6 @@ out:
 		err = ret;
 
 	kfree(pages);
-	if (pinned[0])
-		page_cache_release(pinned[0]);
-	if (pinned[1])
-		page_cache_release(pinned[1]);
 	*ppos = pos;
 
 	/*

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-07  0:58                       ` Chris Mason
@ 2011-03-07  6:07                         ` Mitch Harder
  2011-03-07  6:37                           ` Zhong, Xin
  2011-03-07 19:56                           ` Maria Wikström
  0 siblings, 2 replies; 40+ messages in thread
From: Mitch Harder @ 2011-03-07  6:07 UTC (permalink / raw)
  To: Chris Mason; +Cc: Xin Zhong, xin.zhong, linux-btrfs

On Sun, Mar 6, 2011 at 6:58 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
> Excerpts from Chris Mason's message of 2011-03-06 13:00:27 -0500:
>> Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -0500:
>> > I've constructed a test patch that is currently addressing all the
>> > issues on my system.
>> >
>> > The portion of Openmotif that was having issues with page faults w=
orks
>> > correctly with this patch, and gcc-4.4.5 builds without issue.
>> >
>> > I extracted only the portion of the first patch that corrects the
>> > handling of dirty_pages when copied=3D=3D0, and incorporated the s=
econd
>> > patch that falls back to one-page-at-a-time if there are troubles =
with
>> > page faults.
>>
>> Just to make sure I understand, could you please post the full combi=
ned
>> path that was giving you trouble with gcc? =A0We do need to make sur=
e the
>> pages are properly up to date if we fall back to partial writes.
>
> Ok, I was able to reproduce this easily with fsx. =A0The problem is t=
hat I
> wasn't making sure the last partial page in the write was up to date
> when it was also the first page in the write.
>
> Here is the updated patch, it has all the fixes we've found so far:
>

This latest patch that Chris has sent out fixes the issues I've been
encountering.

I can build gcc-4.4.5 without problems.

Also, the portion of Openmotif that was having issues with page faults
is working correctly.

Let me know if you still would like to see the path names for the
portions of the gcc-4.4.5 build that were giving me issues.  I didn't
save that information, but I can regenerate it.  But it sounds like
it's irrelevant now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-07  6:07                         ` Mitch Harder
@ 2011-03-07  6:37                           ` Zhong, Xin
  2011-03-07 19:56                           ` Maria Wikström
  1 sibling, 0 replies; 40+ messages in thread
From: Zhong, Xin @ 2011-03-07  6:37 UTC (permalink / raw)
  To: Mitch Harder, Chris Mason; +Cc: Xin Zhong, linux-btrfs

That's great! :-)

-----Original Message-----
=46rom: Mitch Harder [mailto:mitch.harder@sabayonlinux.org]=20
Sent: Monday, March 07, 2011 2:08 PM
To: Chris Mason
Cc: Xin Zhong; Zhong, Xin; linux-btrfs
Subject: Re: [PATCH] btrfs file write debugging patch

On Sun, Mar 6, 2011 at 6:58 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
> Excerpts from Chris Mason's message of 2011-03-06 13:00:27 -0500:
>> Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -0500:
>> > I've constructed a test patch that is currently addressing all the
>> > issues on my system.
>> >
>> > The portion of Openmotif that was having issues with page faults w=
orks
>> > correctly with this patch, and gcc-4.4.5 builds without issue.
>> >
>> > I extracted only the portion of the first patch that corrects the
>> > handling of dirty_pages when copied=3D=3D0, and incorporated the s=
econd
>> > patch that falls back to one-page-at-a-time if there are troubles =
with
>> > page faults.
>>
>> Just to make sure I understand, could you please post the full combi=
ned
>> path that was giving you trouble with gcc? =A0We do need to make sur=
e the
>> pages are properly up to date if we fall back to partial writes.
>
> Ok, I was able to reproduce this easily with fsx. =A0The problem is t=
hat I
> wasn't making sure the last partial page in the write was up to date
> when it was also the first page in the write.
>
> Here is the updated patch, it has all the fixes we've found so far:
>

This latest patch that Chris has sent out fixes the issues I've been
encountering.

I can build gcc-4.4.5 without problems.

Also, the portion of Openmotif that was having issues with page faults
is working correctly.

Let me know if you still would like to see the path names for the
portions of the gcc-4.4.5 build that were giving me issues.  I didn't
save that information, but I can regenerate it.  But it sounds like
it's irrelevant now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-07  6:07                         ` Mitch Harder
  2011-03-07  6:37                           ` Zhong, Xin
@ 2011-03-07 19:56                           ` Maria Wikström
  2011-03-07 22:12                             ` Johannes Hirte
  1 sibling, 1 reply; 40+ messages in thread
From: Maria Wikström @ 2011-03-07 19:56 UTC (permalink / raw)
  To: Mitch Harder; +Cc: Chris Mason, Xin Zhong, xin.zhong, linux-btrfs

m=C3=A5n 2011-03-07 klockan 00:07 -0600 skrev Mitch Harder:
> On Sun, Mar 6, 2011 at 6:58 PM, Chris Mason <chris.mason@oracle.com> =
wrote:
> > Excerpts from Chris Mason's message of 2011-03-06 13:00:27 -0500:
> >> Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -0500:
> >> > I've constructed a test patch that is currently addressing all t=
he
> >> > issues on my system.
> >> >
> >> > The portion of Openmotif that was having issues with page faults=
 works
> >> > correctly with this patch, and gcc-4.4.5 builds without issue.
> >> >
> >> > I extracted only the portion of the first patch that corrects th=
e
> >> > handling of dirty_pages when copied=3D=3D0, and incorporated the=
 second
> >> > patch that falls back to one-page-at-a-time if there are trouble=
s with
> >> > page faults.
> >>
> >> Just to make sure I understand, could you please post the full com=
bined
> >> path that was giving you trouble with gcc?  We do need to make sur=
e the
> >> pages are properly up to date if we fall back to partial writes.
> >
> > Ok, I was able to reproduce this easily with fsx.  The problem is t=
hat I
> > wasn't making sure the last partial page in the write was up to dat=
e
> > when it was also the first page in the write.
> >
> > Here is the updated patch, it has all the fixes we've found so far:
> >
>=20
> This latest patch that Chris has sent out fixes the issues I've been
> encountering.
>=20
> I can build gcc-4.4.5 without problems.
>=20
> Also, the portion of Openmotif that was having issues with page fault=
s
> is working correctly.
>=20
> Let me know if you still would like to see the path names for the
> portions of the gcc-4.4.5 build that were giving me issues.  I didn't
> save that information, but I can regenerate it.  But it sounds like
> it's irrelevant now.

With the patch I can compile libgcrypt without any problem, so it solve=
s
my problems to.

// Maria



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-07 19:56                           ` Maria Wikström
@ 2011-03-07 22:12                             ` Johannes Hirte
  2011-03-08  2:51                               ` Zhong, Xin
  0 siblings, 1 reply; 40+ messages in thread
From: Johannes Hirte @ 2011-03-07 22:12 UTC (permalink / raw)
  To: Maria Wikström
  Cc: Mitch Harder, Chris Mason, Xin Zhong, xin.zhong, linux-btrfs

On Monday 07 March 2011 20:56:50 Maria Wikstr=C3=B6m wrote:
> m=C3=A5n 2011-03-07 klockan 00:07 -0600 skrev Mitch Harder:
> > On Sun, Mar 6, 2011 at 6:58 PM, Chris Mason <chris.mason@oracle.com=
>=20
wrote:
> > > Excerpts from Chris Mason's message of 2011-03-06 13:00:27 -0500:
> > >> Excerpts from Mitch Harder's message of 2011-03-05 11:50:14 -050=
0:
> > >> > I've constructed a test patch that is currently addressing all=
 the
> > >> > issues on my system.
> > >> >=20
> > >> > The portion of Openmotif that was having issues with page faul=
ts
> > >> > works correctly with this patch, and gcc-4.4.5 builds without
> > >> > issue.
> > >> >=20
> > >> > I extracted only the portion of the first patch that corrects =
the
> > >> > handling of dirty_pages when copied=3D=3D0, and incorporated t=
he second
> > >> > patch that falls back to one-page-at-a-time if there are troub=
les
> > >> > with page faults.
> > >>=20
> > >> Just to make sure I understand, could you please post the full
> > >> combined path that was giving you trouble with gcc?  We do need =
to
> > >> make sure the pages are properly up to date if we fall back to
> > >> partial writes.
> > >=20
> > > Ok, I was able to reproduce this easily with fsx.  The problem is=
 that
> > > I wasn't making sure the last partial page in the write was up to=
 date
> > > when it was also the first page in the write.
> >=20
> > > Here is the updated patch, it has all the fixes we've found so fa=
r:
> > This latest patch that Chris has sent out fixes the issues I've bee=
n
> > encountering.
> >=20
> > I can build gcc-4.4.5 without problems.
> >=20
> > Also, the portion of Openmotif that was having issues with page fau=
lts
> > is working correctly.
> >=20
> > Let me know if you still would like to see the path names for the
> > portions of the gcc-4.4.5 build that were giving me issues.  I didn=
't
> > save that information, but I can regenerate it.  But it sounds like
> > it's irrelevant now.
>=20
> With the patch I can compile libgcrypt without any problem, so it sol=
ves
> my problems to.

Can confirm this. And the bug seems to be hardware-related. On my Penti=
um4=20
system it was 100% reproducible, on my Atom-based system I couldn't tri=
gger=20
it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-07 22:12                             ` Johannes Hirte
@ 2011-03-08  2:51                               ` Zhong, Xin
  0 siblings, 0 replies; 40+ messages in thread
From: Zhong, Xin @ 2011-03-08  2:51 UTC (permalink / raw)
  To: Johannes Hirte, Maria Wikström
  Cc: Mitch Harder, Chris Mason, Xin Zhong, linux-btrfs

RnJvbSBteSBleHBlcmllbmNlLCBvbmx5IHg4NiAzMmJpdCBrZXJuZWwgaGFzIHRoaXMgcHJvYmxl
bSBhbmQgNjRiaXQga2VybmVsIGRvIG5vdCBoYXZlIGl0LiANCg0KSG93ZXZlciwgYXRvbSBiYXNl
ZCBzeXN0ZW0gZG8gbm90IGhhdmUgaXQgYWx0aG91Z2ggaXQncyBpbnN0YWxsZWQgd2l0aCBhIDMy
Yml0IGtlcm5lbC4gDQoNCi0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQpGcm9tOiBKb2hhbm5l
cyBIaXJ0ZSBbbWFpbHRvOmpvaGFubmVzLmhpcnRlQGZlbS50dS1pbG1lbmF1LmRlXSANClNlbnQ6
IFR1ZXNkYXksIE1hcmNoIDA4LCAyMDExIDY6MTIgQU0NClRvOiBNYXJpYSBXaWtzdHLDtm0NCkNj
OiBNaXRjaCBIYXJkZXI7IENocmlzIE1hc29uOyBYaW4gWmhvbmc7IFpob25nLCBYaW47IGxpbnV4
LWJ0cmZzDQpTdWJqZWN0OiBSZTogW1BBVENIXSBidHJmcyBmaWxlIHdyaXRlIGRlYnVnZ2luZyBw
YXRjaA0KDQpPbiBNb25kYXkgMDcgTWFyY2ggMjAxMSAyMDo1Njo1MCBNYXJpYSBXaWtzdHLDtm0g
d3JvdGU6DQo+IG3DpW4gMjAxMS0wMy0wNyBrbG9ja2FuIDAwOjA3IC0wNjAwIHNrcmV2IE1pdGNo
IEhhcmRlcjoNCj4gPiBPbiBTdW4sIE1hciA2LCAyMDExIGF0IDY6NTggUE0sIENocmlzIE1hc29u
IDxjaHJpcy5tYXNvbkBvcmFjbGUuY29tPiANCndyb3RlOg0KPiA+ID4gRXhjZXJwdHMgZnJvbSBD
aHJpcyBNYXNvbidzIG1lc3NhZ2Ugb2YgMjAxMS0wMy0wNiAxMzowMDoyNyAtMDUwMDoNCj4gPiA+
PiBFeGNlcnB0cyBmcm9tIE1pdGNoIEhhcmRlcidzIG1lc3NhZ2Ugb2YgMjAxMS0wMy0wNSAxMTo1
MDoxNCAtMDUwMDoNCj4gPiA+PiA+IEkndmUgY29uc3RydWN0ZWQgYSB0ZXN0IHBhdGNoIHRoYXQg
aXMgY3VycmVudGx5IGFkZHJlc3NpbmcgYWxsIHRoZQ0KPiA+ID4+ID4gaXNzdWVzIG9uIG15IHN5
c3RlbS4NCj4gPiA+PiA+IA0KPiA+ID4+ID4gVGhlIHBvcnRpb24gb2YgT3Blbm1vdGlmIHRoYXQg
d2FzIGhhdmluZyBpc3N1ZXMgd2l0aCBwYWdlIGZhdWx0cw0KPiA+ID4+ID4gd29ya3MgY29ycmVj
dGx5IHdpdGggdGhpcyBwYXRjaCwgYW5kIGdjYy00LjQuNSBidWlsZHMgd2l0aG91dA0KPiA+ID4+
ID4gaXNzdWUuDQo+ID4gPj4gPiANCj4gPiA+PiA+IEkgZXh0cmFjdGVkIG9ubHkgdGhlIHBvcnRp
b24gb2YgdGhlIGZpcnN0IHBhdGNoIHRoYXQgY29ycmVjdHMgdGhlDQo+ID4gPj4gPiBoYW5kbGlu
ZyBvZiBkaXJ0eV9wYWdlcyB3aGVuIGNvcGllZD09MCwgYW5kIGluY29ycG9yYXRlZCB0aGUgc2Vj
b25kDQo+ID4gPj4gPiBwYXRjaCB0aGF0IGZhbGxzIGJhY2sgdG8gb25lLXBhZ2UtYXQtYS10aW1l
IGlmIHRoZXJlIGFyZSB0cm91Ymxlcw0KPiA+ID4+ID4gd2l0aCBwYWdlIGZhdWx0cy4NCj4gPiA+
PiANCj4gPiA+PiBKdXN0IHRvIG1ha2Ugc3VyZSBJIHVuZGVyc3RhbmQsIGNvdWxkIHlvdSBwbGVh
c2UgcG9zdCB0aGUgZnVsbA0KPiA+ID4+IGNvbWJpbmVkIHBhdGggdGhhdCB3YXMgZ2l2aW5nIHlv
dSB0cm91YmxlIHdpdGggZ2NjPyAgV2UgZG8gbmVlZCB0bw0KPiA+ID4+IG1ha2Ugc3VyZSB0aGUg
cGFnZXMgYXJlIHByb3Blcmx5IHVwIHRvIGRhdGUgaWYgd2UgZmFsbCBiYWNrIHRvDQo+ID4gPj4g
cGFydGlhbCB3cml0ZXMuDQo+ID4gPiANCj4gPiA+IE9rLCBJIHdhcyBhYmxlIHRvIHJlcHJvZHVj
ZSB0aGlzIGVhc2lseSB3aXRoIGZzeC4gIFRoZSBwcm9ibGVtIGlzIHRoYXQNCj4gPiA+IEkgd2Fz
bid0IG1ha2luZyBzdXJlIHRoZSBsYXN0IHBhcnRpYWwgcGFnZSBpbiB0aGUgd3JpdGUgd2FzIHVw
IHRvIGRhdGUNCj4gPiA+IHdoZW4gaXQgd2FzIGFsc28gdGhlIGZpcnN0IHBhZ2UgaW4gdGhlIHdy
aXRlLg0KPiA+IA0KPiA+ID4gSGVyZSBpcyB0aGUgdXBkYXRlZCBwYXRjaCwgaXQgaGFzIGFsbCB0
aGUgZml4ZXMgd2UndmUgZm91bmQgc28gZmFyOg0KPiA+IFRoaXMgbGF0ZXN0IHBhdGNoIHRoYXQg
Q2hyaXMgaGFzIHNlbnQgb3V0IGZpeGVzIHRoZSBpc3N1ZXMgSSd2ZSBiZWVuDQo+ID4gZW5jb3Vu
dGVyaW5nLg0KPiA+IA0KPiA+IEkgY2FuIGJ1aWxkIGdjYy00LjQuNSB3aXRob3V0IHByb2JsZW1z
Lg0KPiA+IA0KPiA+IEFsc28sIHRoZSBwb3J0aW9uIG9mIE9wZW5tb3RpZiB0aGF0IHdhcyBoYXZp
bmcgaXNzdWVzIHdpdGggcGFnZSBmYXVsdHMNCj4gPiBpcyB3b3JraW5nIGNvcnJlY3RseS4NCj4g
PiANCj4gPiBMZXQgbWUga25vdyBpZiB5b3Ugc3RpbGwgd291bGQgbGlrZSB0byBzZWUgdGhlIHBh
dGggbmFtZXMgZm9yIHRoZQ0KPiA+IHBvcnRpb25zIG9mIHRoZSBnY2MtNC40LjUgYnVpbGQgdGhh
dCB3ZXJlIGdpdmluZyBtZSBpc3N1ZXMuICBJIGRpZG4ndA0KPiA+IHNhdmUgdGhhdCBpbmZvcm1h
dGlvbiwgYnV0IEkgY2FuIHJlZ2VuZXJhdGUgaXQuICBCdXQgaXQgc291bmRzIGxpa2UNCj4gPiBp
dCdzIGlycmVsZXZhbnQgbm93Lg0KPiANCj4gV2l0aCB0aGUgcGF0Y2ggSSBjYW4gY29tcGlsZSBs
aWJnY3J5cHQgd2l0aG91dCBhbnkgcHJvYmxlbSwgc28gaXQgc29sdmVzDQo+IG15IHByb2JsZW1z
IHRvLg0KDQpDYW4gY29uZmlybSB0aGlzLiBBbmQgdGhlIGJ1ZyBzZWVtcyB0byBiZSBoYXJkd2Fy
ZS1yZWxhdGVkLiBPbiBteSBQZW50aXVtNCANCnN5c3RlbSBpdCB3YXMgMTAwJSByZXByb2R1Y2li
bGUsIG9uIG15IEF0b20tYmFzZWQgc3lzdGVtIEkgY291bGRuJ3QgdHJpZ2dlciANCml0Lg0K

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2]Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
@ 2011-02-25 18:43 Mitch Harder
  2011-02-28  1:46 ` [PATCH] btrfs file write debugging patch Chris Mason
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-02-25 18:43 UTC (permalink / raw)
  To: Chris Mason
  Cc: Maria Wikström, Zhong, Xin, Johannes Hirte,
	linux-btrfs@vger.kernel.org

On Fri, Feb 25, 2011 at 11:11 AM, Mitch Harder
<mitch.harder@sabayonlinux.org> wrote:
> On Thu, Feb 24, 2011 at 5:14 PM, Mitch Harder
> <mitch.harder@sabayonlinux.org> wrote:
>> On Thu, Feb 24, 2011 at 10:32 AM, Mitch Harder
>> <mitch.harder@sabayonlinux.org> wrote:
>>> On Thu, Feb 24, 2011 at 10:19 AM, Chris Mason <chris.mason@oracle.c=
om> wrote:
>>>> Excerpts from Mitch Harder's message of 2011-02-24 11:03:07 -0500:
>>>>> On Thu, Feb 24, 2011 at 10:00 AM, Chris Mason <chris.mason@oracle=
=2Ecom> wrote:
>>>>> > Excerpts from Mitch Harder's message of 2011-02-24 10:55:15 -05=
00:
>>>>> >> 2011/2/24 Maria Wikstr=F6m <maria@ponstudios.se>:
>>>>> >> > m=E5n 2011-02-21 klockan 09:51 +0800 skrev Zhong, Xin:
>>>>> >> >> The backtrace in your attachment looks like a known bug of =
2.6.37 which have already been fixed in 2.6.38. I have no idea why late=
st btrfs still hang in your environment if there's no debug info...
>>>>> >> >>
>>>>> >> >
>>>>> >> > Haha, yes that's very hard :)
>>>>> >> >
>>>>> >> > 2.6.38-rc6 and btrfs-unstable behaves the same way. I can cl=
ose the
>>>>> >> > process with ctrl+c and it disappear a few seconds later. Th=
ere is no
>>>>> >> > CPU usage. Reading works because I can start htop and watch =
"svn info"
>>>>> >> > disappear, but everything writing to btrfs slows down to a c=
rawl. It
>>>>> >> > takes about 1 minute to log in. So I had to put the logs on =
an other
>>>>> >> > partition using ext3 to get the output from sysrq+t.
>>>>> >> >
>>>>> >>
>>>>> >> I believe I've been experiencing this issue also. =A0However, =
my problem
>>>>> >> usually results in a "No space left on device" error rather th=
an a
>>>>> >> lock-up or crash. =A0But I've bisected my issue to this patch,=
 and my
>>>>> >> "btrfs fi show" and "btrfs fi df" looks similar to others who'=
ve
>>>>> >> posted to this tread with all my space being allocated, but no=
t used.
>>>>> >>
>>>>> >
>>>>> > Sorry, which patch did you bisect the problem down to?
>>>>> >
>>>>>
>>>>> The patch at the head of this thread:
>>>>>
>>>>> Btrfs: pwrite blocked when writing from the mmaped buffer of the =
same page
>>>>
>>>> Hmmm, that patch shouldn't be changing our performance under delal=
loc
>>>> pressure, and it really shouldn't impact early enospc.
>>>>
>>>
>>> I've bisected this issue around where this patch went into git, and
>>> I've also constructed a testing patch that reverts this patch, and
>>> placed it on top of the current Btrfs git sources (I understand thi=
s
>>> patch addresses a real issue, this was just for testing).
>>>
>>> It could be that this patch just "uncovers" another problem, but al=
l
>>> my tests seem to point to this patch triggering this issue.
>>>
>>
>> I don't belief the previous ftrace I supplied had a large enough sco=
pe
>> to capture the issue.
>>
>> I've expanded my ftrace buffer, and filtered out everything but btrf=
s*
>> function calls ("# echo btrfs* >
>> /sys/kernel/debug/tracing/set_ftrace_filter").
>>
>> In this trace, I see btrfs spending a great deal of time in a while
>> loop (while (iov_iter_count(&i) > 0) {)) in the btrfs_file_aio_write=
()
>> function in file.c without exiting the function.
>>
>> I'm going to try to inject some debugging trace_printk() statements =
to
>> find if that portion of code is proceeding normally with my test cas=
e.
>>
>> I've put my expanded trace up on my local server, but my upload
>> bandwidth is pretty sad, and it may take a few minutes to transfer
>> even though it's only a 6MB file.
>>
>> http://dontpanic.dyndns.org/trace-openmotif-btrfs-v3.gz
>>
>
> Apologies for only hitting "Reply" instead of "Reply-All" on my last =
message.
>
> I've inserted additional trace_printk() to the btrfs_file_aio_write()
> and btrfs_copy_from_user() function in file.c in order to characteriz=
e
> the problem I've been encountering.
>
> I can see btrfs getting stuck in a loop in the "while
> (iov_iter_count(&i) > 0) {}" portion of the btrfs_file_aio_write()
> function.
>
> The loop is more-or-less following this process (from within the
> "while (iov_iter_count(&i) > 0) {}" loop):
>
> (1) Reserve some space with btrfs_delalloc_reserve_space()
> (2) Prepare the reserved space with prepare_pages()
> (3) Call btrfs_copy_from_user() to copy to the prepared space.
> -------------> From btrfs_copy_from_user()
> (4) ........Try to copy with copied =3D iov_iter_copy_from_user_atomi=
c()
> (5) ........The above operation results with copied =3D=3D 0. Break a=
nd
> return with a return value of 0 bytes copied.
> (6) There is no special handling for copied =3D=3D 0 in the "while
> (iov_iter_count(&i) > 0) {}" loop, so it loops back around, reserves
> some more space, and tries again.
>
> If I look back at how the code was set up before the patch at the hea=
d
> of this thread was applied (Btrfs: pwrite blocked when writing from
> the mmaped buffer of the same page), the btrfs_copy_from_user()
> function had some handling for "copied =3D=3D 0" that would change th=
e
> scope of the amount to write, and loop back to try the write again.
>
> I attempted to construct a patch that just reverted the handling for
> "copied =3D=3D 0" in btrfs_copy_from_user(), however, that just resul=
ted
> in my computer locking up when it reached the point where it was
> previously beginning to allocate disk space.
>
> So, I apologize for not having a patch to address the issue I'm
> seeing, but I hope I've added some insight.
>

Some clarification on my previous message...

After looking at my ftrace log more closely, I can see where Btrfs is
trying to release the allocated pages.  However, the calculation for
the number of dirty_pages is equal to 1 when "copied =3D=3D 0".

So I'm seeing at least two problems:
(1)  It keeps looping when "copied =3D=3D 0".
(2)  One dirty page is not being released on every loop even though
"copied =3D=3D 0" (at least this problem keeps it from being an infinit=
e
loop by eventually exhausting reserveable space on the disk).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] btrfs file write debugging patch
  2011-02-25 18:43 [PATCH v2]Btrfs: pwrite blocked when writing from the mmaped buffer of the same page Mitch Harder
@ 2011-02-28  1:46 ` Chris Mason
  2011-02-28  8:56   ` Zhong, Xin
  2011-02-28 10:13   ` Johannes Hirte
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Mason @ 2011-02-28  1:46 UTC (permalink / raw)
  To: Mitch Harder
  Cc: Maria Wikström, Zhong, Xin, Johannes Hirte,
	linux-btrfs@vger.kernel.org

Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
> Some clarification on my previous message...
> 
> After looking at my ftrace log more closely, I can see where Btrfs is
> trying to release the allocated pages.  However, the calculation for
> the number of dirty_pages is equal to 1 when "copied == 0".
> 
> So I'm seeing at least two problems:
> (1)  It keeps looping when "copied == 0".
> (2)  One dirty page is not being released on every loop even though
> "copied == 0" (at least this problem keeps it from being an infinite
> loop by eventually exhausting reserveable space on the disk).

Hi everyone,

There are actually tow bugs here.  First the one that Mitch hit, and a
second one that still results in bad file_write results with my
debugging hunks (the first two hunks below) in place.

My patch fixes Mitch's bug by checking for copied == 0 after
btrfs_copy_from_user and going the correct delalloc accounting.  This
one looks solved, but you'll notice the patch is bigger.

First, I add some random failures to btrfs_copy_from_user() by failing
everyone once and a while.  This was much more reliable than trying to
use memory pressure than making copy_from_user fail.

If copy_from_user fails and we partially update a page, we end up with a
page that may go away due to memory pressure.  But, btrfs_file_write
assumes that only the first and last page may have good data that needs
to be read off the disk.

This patch ditches that code and puts it into prepare_pages instead.
But I'm still having some errors during long stress.sh runs.  Ideas are
more than welcome, hopefully some other timezones will kick in ideas
while I sleep.

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7084140..89a6a26 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -54,6 +54,13 @@ static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
 	int offset = pos & (PAGE_CACHE_SIZE - 1);
 	int total_copied = 0;
 
+	if ((jiffies % 10) == 0)
+		return 0;
+
+	if ((jiffies % 25) == 0) {
+		write_bytes /= 2;
+	}
+
 	while (write_bytes > 0) {
 		size_t count = min_t(size_t,
 				     PAGE_CACHE_SIZE - offset, write_bytes);
@@ -763,6 +770,26 @@ out:
 }
 
 /*
+ * on error we return an unlocked page and the error value
+ * on success we return a locked page and 0
+ */
+static int prepare_uptodate_page(struct page *page, u64 pos)
+{
+	int ret = 0;
+	if ((pos & (PAGE_CACHE_SIZE - 1)) && !PageUptodate(page)) {
+		ret = btrfs_readpage(NULL, page);
+		if (ret)
+			return ret;
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+/*
  * this gets pages into the page cache and locks them down, it also properly
  * waits for data=ordered extents to finish before allowing the pages to be
  * modified.
@@ -777,6 +804,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 	unsigned long index = pos >> PAGE_CACHE_SHIFT;
 	struct inode *inode = fdentry(file)->d_inode;
 	int err = 0;
+	int faili = 0;
 	u64 start_pos;
 	u64 last_pos;
 
@@ -794,15 +822,24 @@ again:
 	for (i = 0; i < num_pages; i++) {
 		pages[i] = grab_cache_page(inode->i_mapping, index + i);
 		if (!pages[i]) {
-			int c;
-			for (c = i - 1; c >= 0; c--) {
-				unlock_page(pages[c]);
-				page_cache_release(pages[c]);
-			}
-			return -ENOMEM;
+			faili = i - 1;
+			err = -ENOMEM;
+			goto fail;
+		}
+
+		if (i == 0)
+			err = prepare_uptodate_page(pages[i], pos);
+		else if (i == num_pages - 1)
+			err = prepare_uptodate_page(pages[i],
+						    pos + write_bytes);
+		if (err) {
+			page_cache_release(pages[i]);
+			faili = i - 1;
+			goto fail;
 		}
 		wait_on_page_writeback(pages[i]);
 	}
+	err = 0;
 	if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
 		lock_extent_bits(&BTRFS_I(inode)->io_tree,
@@ -842,6 +879,14 @@ again:
 		WARN_ON(!PageLocked(pages[i]));
 	}
 	return 0;
+fail:
+	while (faili >= 0) {
+		unlock_page(pages[faili]);
+		page_cache_release(pages[faili]);
+		faili--;
+	}
+	return err;
+
 }
 
 static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
@@ -851,7 +896,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = fdentry(file)->d_inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct page *pinned[2];
 	struct page **pages = NULL;
 	struct iov_iter i;
 	loff_t *ppos = &iocb->ki_pos;
@@ -872,9 +916,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	will_write = ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
 		      (file->f_flags & O_DIRECT));
 
-	pinned[0] = NULL;
-	pinned[1] = NULL;
-
 	start_pos = pos;
 
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
@@ -962,32 +1003,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	first_index = pos >> PAGE_CACHE_SHIFT;
 	last_index = (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;
 
-	/*
-	 * there are lots of better ways to do this, but this code
-	 * makes sure the first and last page in the file range are
-	 * up to date and ready for cow
-	 */
-	if ((pos & (PAGE_CACHE_SIZE - 1))) {
-		pinned[0] = grab_cache_page(inode->i_mapping, first_index);
-		if (!PageUptodate(pinned[0])) {
-			ret = btrfs_readpage(NULL, pinned[0]);
-			BUG_ON(ret);
-			wait_on_page_locked(pinned[0]);
-		} else {
-			unlock_page(pinned[0]);
-		}
-	}
-	if ((pos + iov_iter_count(&i)) & (PAGE_CACHE_SIZE - 1)) {
-		pinned[1] = grab_cache_page(inode->i_mapping, last_index);
-		if (!PageUptodate(pinned[1])) {
-			ret = btrfs_readpage(NULL, pinned[1]);
-			BUG_ON(ret);
-			wait_on_page_locked(pinned[1]);
-		} else {
-			unlock_page(pinned[1]);
-		}
-	}
-
 	while (iov_iter_count(&i) > 0) {
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes = min(iov_iter_count(&i),
@@ -1024,8 +1039,12 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 
 		copied = btrfs_copy_from_user(pos, num_pages,
 					   write_bytes, pages, &i);
-		dirty_pages = (copied + offset + PAGE_CACHE_SIZE - 1) >>
-				PAGE_CACHE_SHIFT;
+		if (copied == 0)
+			dirty_pages = 0;
+		else
+			dirty_pages = (copied + offset +
+				       PAGE_CACHE_SIZE - 1) >>
+				       PAGE_CACHE_SHIFT;
 
 		if (num_pages > dirty_pages) {
 			if (copied > 0)
@@ -1069,10 +1088,6 @@ out:
 		err = ret;
 
 	kfree(pages);
-	if (pinned[0])
-		page_cache_release(pinned[0]);
-	if (pinned[1])
-		page_cache_release(pinned[1]);
 	*ppos = pos;
 
 	/*

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-02-28  1:46 ` [PATCH] btrfs file write debugging patch Chris Mason
@ 2011-02-28  8:56   ` Zhong, Xin
  2011-02-28 14:02     ` Chris Mason
  2011-02-28 10:13   ` Johannes Hirte
  1 sibling, 1 reply; 40+ messages in thread
From: Zhong, Xin @ 2011-02-28  8:56 UTC (permalink / raw)
  To: Chris Mason, Mitch Harder
  Cc: Maria Wikström, Johannes Hirte, linux-btrfs@vger.kernel.org

T25lIHBvc3NpYmxlIGlzc3VlIEkgY2FuIHNlZSBpcyBpbiB0aGUgcmFuZG9tIGZhaWx1cmUgY2Fz
ZSAjMiB0aGF0IGNvcHlfZnJvbV91c2VyIG9ubHkgcHJvY2VzcyBoYWxmIG9mIHRoZSBkYXRhLiAN
Cg0KRm9yIGV4YW1wbGUsIGlmIGl0IHdyaXRlIGEgNGsgYWxpZ25lZCBwYWdlIGFuZCBjb3B5X2Zy
b21fdXNlciBvbmx5IHdyaXRlIDJrLiBUaGVuIGl0IHdpbGwgbm90IGNhbGwgYnRyZnNfZGVsYWxs
b2NfcmVsZWFzZV9zcGFjZSBzaW5jZSBudW1fcGFnZXMgYW5kIGRpcnR5X3BhZ2VzIGFyZSBib3Ro
IDEuIA0KSW4gdGhlIG5leHQgcm91bmQsIGl0IHdyaXRlIGFub3RoZXIgMmsgYW5kIGJ0cmZzX2Rl
bGFsbG9jX3Jlc2VydmVfc3BhY2UgaXMgY2FsbGVkIHR3aWNlIGZvciB0aGUgc2FtZSBwYWdlLiAN
Cg0KSXMgaXQgYSBwcm9ibGVtPyBUaGFua3MhDQoNCi0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0t
DQpGcm9tOiBDaHJpcyBNYXNvbiBbbWFpbHRvOmNocmlzLm1hc29uQG9yYWNsZS5jb21dIA0KU2Vu
dDogTW9uZGF5LCBGZWJydWFyeSAyOCwgMjAxMSA5OjQ2IEFNDQpUbzogTWl0Y2ggSGFyZGVyDQpD
YzogTWFyaWEgV2lrc3Ryw7ZtOyBaaG9uZywgWGluOyBKb2hhbm5lcyBIaXJ0ZTsgbGludXgtYnRy
ZnNAdmdlci5rZXJuZWwub3JnDQpTdWJqZWN0OiBbUEFUQ0hdIGJ0cmZzIGZpbGUgd3JpdGUgZGVi
dWdnaW5nIHBhdGNoDQoNCkV4Y2VycHRzIGZyb20gTWl0Y2ggSGFyZGVyJ3MgbWVzc2FnZSBvZiAy
MDExLTAyLTI1IDEzOjQzOjM3IC0wNTAwOg0KPiBTb21lIGNsYXJpZmljYXRpb24gb24gbXkgcHJl
dmlvdXMgbWVzc2FnZS4uLg0KPiANCj4gQWZ0ZXIgbG9va2luZyBhdCBteSBmdHJhY2UgbG9nIG1v
cmUgY2xvc2VseSwgSSBjYW4gc2VlIHdoZXJlIEJ0cmZzIGlzDQo+IHRyeWluZyB0byByZWxlYXNl
IHRoZSBhbGxvY2F0ZWQgcGFnZXMuICBIb3dldmVyLCB0aGUgY2FsY3VsYXRpb24gZm9yDQo+IHRo
ZSBudW1iZXIgb2YgZGlydHlfcGFnZXMgaXMgZXF1YWwgdG8gMSB3aGVuICJjb3BpZWQgPT0gMCIu
DQo+IA0KPiBTbyBJJ20gc2VlaW5nIGF0IGxlYXN0IHR3byBwcm9ibGVtczoNCj4gKDEpICBJdCBr
ZWVwcyBsb29waW5nIHdoZW4gImNvcGllZCA9PSAwIi4NCj4gKDIpICBPbmUgZGlydHkgcGFnZSBp
cyBub3QgYmVpbmcgcmVsZWFzZWQgb24gZXZlcnkgbG9vcCBldmVuIHRob3VnaA0KPiAiY29waWVk
ID09IDAiIChhdCBsZWFzdCB0aGlzIHByb2JsZW0ga2VlcHMgaXQgZnJvbSBiZWluZyBhbiBpbmZp
bml0ZQ0KPiBsb29wIGJ5IGV2ZW50dWFsbHkgZXhoYXVzdGluZyByZXNlcnZlYWJsZSBzcGFjZSBv
biB0aGUgZGlzaykuDQoNCkhpIGV2ZXJ5b25lLA0KDQpUaGVyZSBhcmUgYWN0dWFsbHkgdG93IGJ1
Z3MgaGVyZS4gIEZpcnN0IHRoZSBvbmUgdGhhdCBNaXRjaCBoaXQsIGFuZCBhDQpzZWNvbmQgb25l
IHRoYXQgc3RpbGwgcmVzdWx0cyBpbiBiYWQgZmlsZV93cml0ZSByZXN1bHRzIHdpdGggbXkNCmRl
YnVnZ2luZyBodW5rcyAodGhlIGZpcnN0IHR3byBodW5rcyBiZWxvdykgaW4gcGxhY2UuDQoNCk15
IHBhdGNoIGZpeGVzIE1pdGNoJ3MgYnVnIGJ5IGNoZWNraW5nIGZvciBjb3BpZWQgPT0gMCBhZnRl
cg0KYnRyZnNfY29weV9mcm9tX3VzZXIgYW5kIGdvaW5nIHRoZSBjb3JyZWN0IGRlbGFsbG9jIGFj
Y291bnRpbmcuICBUaGlzDQpvbmUgbG9va3Mgc29sdmVkLCBidXQgeW91J2xsIG5vdGljZSB0aGUg
cGF0Y2ggaXMgYmlnZ2VyLg0KDQpGaXJzdCwgSSBhZGQgc29tZSByYW5kb20gZmFpbHVyZXMgdG8g
YnRyZnNfY29weV9mcm9tX3VzZXIoKSBieSBmYWlsaW5nDQpldmVyeW9uZSBvbmNlIGFuZCBhIHdo
aWxlLiAgVGhpcyB3YXMgbXVjaCBtb3JlIHJlbGlhYmxlIHRoYW4gdHJ5aW5nIHRvDQp1c2UgbWVt
b3J5IHByZXNzdXJlIHRoYW4gbWFraW5nIGNvcHlfZnJvbV91c2VyIGZhaWwuDQoNCklmIGNvcHlf
ZnJvbV91c2VyIGZhaWxzIGFuZCB3ZSBwYXJ0aWFsbHkgdXBkYXRlIGEgcGFnZSwgd2UgZW5kIHVw
IHdpdGggYQ0KcGFnZSB0aGF0IG1heSBnbyBhd2F5IGR1ZSB0byBtZW1vcnkgcHJlc3N1cmUuICBC
dXQsIGJ0cmZzX2ZpbGVfd3JpdGUNCmFzc3VtZXMgdGhhdCBvbmx5IHRoZSBmaXJzdCBhbmQgbGFz
dCBwYWdlIG1heSBoYXZlIGdvb2QgZGF0YSB0aGF0IG5lZWRzDQp0byBiZSByZWFkIG9mZiB0aGUg
ZGlzay4NCg0KVGhpcyBwYXRjaCBkaXRjaGVzIHRoYXQgY29kZSBhbmQgcHV0cyBpdCBpbnRvIHBy
ZXBhcmVfcGFnZXMgaW5zdGVhZC4NCkJ1dCBJJ20gc3RpbGwgaGF2aW5nIHNvbWUgZXJyb3JzIGR1
cmluZyBsb25nIHN0cmVzcy5zaCBydW5zLiAgSWRlYXMgYXJlDQptb3JlIHRoYW4gd2VsY29tZSwg
aG9wZWZ1bGx5IHNvbWUgb3RoZXIgdGltZXpvbmVzIHdpbGwga2ljayBpbiBpZGVhcw0Kd2hpbGUg
SSBzbGVlcC4NCg0KZGlmZiAtLWdpdCBhL2ZzL2J0cmZzL2ZpbGUuYyBiL2ZzL2J0cmZzL2ZpbGUu
Yw0KaW5kZXggNzA4NDE0MC4uODlhNmEyNiAxMDA2NDQNCi0tLSBhL2ZzL2J0cmZzL2ZpbGUuYw0K
KysrIGIvZnMvYnRyZnMvZmlsZS5jDQpAQCAtNTQsNiArNTQsMTMgQEAgc3RhdGljIG5vaW5saW5l
IGludCBidHJmc19jb3B5X2Zyb21fdXNlcihsb2ZmX3QgcG9zLCBpbnQgbnVtX3BhZ2VzLA0KIAlp
bnQgb2Zmc2V0ID0gcG9zICYgKFBBR0VfQ0FDSEVfU0laRSAtIDEpOw0KIAlpbnQgdG90YWxfY29w
aWVkID0gMDsNCiANCisJaWYgKChqaWZmaWVzICUgMTApID09IDApDQorCQlyZXR1cm4gMDsNCisN
CisJaWYgKChqaWZmaWVzICUgMjUpID09IDApIHsNCisJCXdyaXRlX2J5dGVzIC89IDI7DQorCX0N
CisNCiAJd2hpbGUgKHdyaXRlX2J5dGVzID4gMCkgew0KIAkJc2l6ZV90IGNvdW50ID0gbWluX3Qo
c2l6ZV90LA0KIAkJCQkgICAgIFBBR0VfQ0FDSEVfU0laRSAtIG9mZnNldCwgd3JpdGVfYnl0ZXMp
Ow0KQEAgLTc2Myw2ICs3NzAsMjYgQEAgb3V0Og0KIH0NCiANCiAvKg0KKyAqIG9uIGVycm9yIHdl
IHJldHVybiBhbiB1bmxvY2tlZCBwYWdlIGFuZCB0aGUgZXJyb3IgdmFsdWUNCisgKiBvbiBzdWNj
ZXNzIHdlIHJldHVybiBhIGxvY2tlZCBwYWdlIGFuZCAwDQorICovDQorc3RhdGljIGludCBwcmVw
YXJlX3VwdG9kYXRlX3BhZ2Uoc3RydWN0IHBhZ2UgKnBhZ2UsIHU2NCBwb3MpDQorew0KKwlpbnQg
cmV0ID0gMDsNCisJaWYgKChwb3MgJiAoUEFHRV9DQUNIRV9TSVpFIC0gMSkpICYmICFQYWdlVXB0
b2RhdGUocGFnZSkpIHsNCisJCXJldCA9IGJ0cmZzX3JlYWRwYWdlKE5VTEwsIHBhZ2UpOw0KKwkJ
aWYgKHJldCkNCisJCQlyZXR1cm4gcmV0Ow0KKwkJbG9ja19wYWdlKHBhZ2UpOw0KKwkJaWYgKCFQ
YWdlVXB0b2RhdGUocGFnZSkpIHsNCisJCQl1bmxvY2tfcGFnZShwYWdlKTsNCisJCQlyZXR1cm4g
LUVJTzsNCisJCX0NCisJfQ0KKwlyZXR1cm4gMDsNCit9DQorDQorLyoNCiAgKiB0aGlzIGdldHMg
cGFnZXMgaW50byB0aGUgcGFnZSBjYWNoZSBhbmQgbG9ja3MgdGhlbSBkb3duLCBpdCBhbHNvIHBy
b3Blcmx5DQogICogd2FpdHMgZm9yIGRhdGE9b3JkZXJlZCBleHRlbnRzIHRvIGZpbmlzaCBiZWZv
cmUgYWxsb3dpbmcgdGhlIHBhZ2VzIHRvIGJlDQogICogbW9kaWZpZWQuDQpAQCAtNzc3LDYgKzgw
NCw3IEBAIHN0YXRpYyBub2lubGluZSBpbnQgcHJlcGFyZV9wYWdlcyhzdHJ1Y3QgYnRyZnNfcm9v
dCAqcm9vdCwgc3RydWN0IGZpbGUgKmZpbGUsDQogCXVuc2lnbmVkIGxvbmcgaW5kZXggPSBwb3Mg
Pj4gUEFHRV9DQUNIRV9TSElGVDsNCiAJc3RydWN0IGlub2RlICppbm9kZSA9IGZkZW50cnkoZmls
ZSktPmRfaW5vZGU7DQogCWludCBlcnIgPSAwOw0KKwlpbnQgZmFpbGkgPSAwOw0KIAl1NjQgc3Rh
cnRfcG9zOw0KIAl1NjQgbGFzdF9wb3M7DQogDQpAQCAtNzk0LDE1ICs4MjIsMjQgQEAgYWdhaW46
DQogCWZvciAoaSA9IDA7IGkgPCBudW1fcGFnZXM7IGkrKykgew0KIAkJcGFnZXNbaV0gPSBncmFi
X2NhY2hlX3BhZ2UoaW5vZGUtPmlfbWFwcGluZywgaW5kZXggKyBpKTsNCiAJCWlmICghcGFnZXNb
aV0pIHsNCi0JCQlpbnQgYzsNCi0JCQlmb3IgKGMgPSBpIC0gMTsgYyA+PSAwOyBjLS0pIHsNCi0J
CQkJdW5sb2NrX3BhZ2UocGFnZXNbY10pOw0KLQkJCQlwYWdlX2NhY2hlX3JlbGVhc2UocGFnZXNb
Y10pOw0KLQkJCX0NCi0JCQlyZXR1cm4gLUVOT01FTTsNCisJCQlmYWlsaSA9IGkgLSAxOw0KKwkJ
CWVyciA9IC1FTk9NRU07DQorCQkJZ290byBmYWlsOw0KKwkJfQ0KKw0KKwkJaWYgKGkgPT0gMCkN
CisJCQllcnIgPSBwcmVwYXJlX3VwdG9kYXRlX3BhZ2UocGFnZXNbaV0sIHBvcyk7DQorCQllbHNl
IGlmIChpID09IG51bV9wYWdlcyAtIDEpDQorCQkJZXJyID0gcHJlcGFyZV91cHRvZGF0ZV9wYWdl
KHBhZ2VzW2ldLA0KKwkJCQkJCSAgICBwb3MgKyB3cml0ZV9ieXRlcyk7DQorCQlpZiAoZXJyKSB7
DQorCQkJcGFnZV9jYWNoZV9yZWxlYXNlKHBhZ2VzW2ldKTsNCisJCQlmYWlsaSA9IGkgLSAxOw0K
KwkJCWdvdG8gZmFpbDsNCiAJCX0NCiAJCXdhaXRfb25fcGFnZV93cml0ZWJhY2socGFnZXNbaV0p
Ow0KIAl9DQorCWVyciA9IDA7DQogCWlmIChzdGFydF9wb3MgPCBpbm9kZS0+aV9zaXplKSB7DQog
CQlzdHJ1Y3QgYnRyZnNfb3JkZXJlZF9leHRlbnQgKm9yZGVyZWQ7DQogCQlsb2NrX2V4dGVudF9i
aXRzKCZCVFJGU19JKGlub2RlKS0+aW9fdHJlZSwNCkBAIC04NDIsNiArODc5LDE0IEBAIGFnYWlu
Og0KIAkJV0FSTl9PTighUGFnZUxvY2tlZChwYWdlc1tpXSkpOw0KIAl9DQogCXJldHVybiAwOw0K
K2ZhaWw6DQorCXdoaWxlIChmYWlsaSA+PSAwKSB7DQorCQl1bmxvY2tfcGFnZShwYWdlc1tmYWls
aV0pOw0KKwkJcGFnZV9jYWNoZV9yZWxlYXNlKHBhZ2VzW2ZhaWxpXSk7DQorCQlmYWlsaS0tOw0K
Kwl9DQorCXJldHVybiBlcnI7DQorDQogfQ0KIA0KIHN0YXRpYyBzc2l6ZV90IGJ0cmZzX2ZpbGVf
YWlvX3dyaXRlKHN0cnVjdCBraW9jYiAqaW9jYiwNCkBAIC04NTEsNyArODk2LDYgQEAgc3RhdGlj
IHNzaXplX3QgYnRyZnNfZmlsZV9haW9fd3JpdGUoc3RydWN0IGtpb2NiICppb2NiLA0KIAlzdHJ1
Y3QgZmlsZSAqZmlsZSA9IGlvY2ItPmtpX2ZpbHA7DQogCXN0cnVjdCBpbm9kZSAqaW5vZGUgPSBm
ZGVudHJ5KGZpbGUpLT5kX2lub2RlOw0KIAlzdHJ1Y3QgYnRyZnNfcm9vdCAqcm9vdCA9IEJUUkZT
X0koaW5vZGUpLT5yb290Ow0KLQlzdHJ1Y3QgcGFnZSAqcGlubmVkWzJdOw0KIAlzdHJ1Y3QgcGFn
ZSAqKnBhZ2VzID0gTlVMTDsNCiAJc3RydWN0IGlvdl9pdGVyIGk7DQogCWxvZmZfdCAqcHBvcyA9
ICZpb2NiLT5raV9wb3M7DQpAQCAtODcyLDkgKzkxNiw2IEBAIHN0YXRpYyBzc2l6ZV90IGJ0cmZz
X2ZpbGVfYWlvX3dyaXRlKHN0cnVjdCBraW9jYiAqaW9jYiwNCiAJd2lsbF93cml0ZSA9ICgoZmls
ZS0+Zl9mbGFncyAmIE9fRFNZTkMpIHx8IElTX1NZTkMoaW5vZGUpIHx8DQogCQkgICAgICAoZmls
ZS0+Zl9mbGFncyAmIE9fRElSRUNUKSk7DQogDQotCXBpbm5lZFswXSA9IE5VTEw7DQotCXBpbm5l
ZFsxXSA9IE5VTEw7DQotDQogCXN0YXJ0X3BvcyA9IHBvczsNCiANCiAJdmZzX2NoZWNrX2Zyb3pl
bihpbm9kZS0+aV9zYiwgU0JfRlJFRVpFX1dSSVRFKTsNCkBAIC05NjIsMzIgKzEwMDMsNiBAQCBz
dGF0aWMgc3NpemVfdCBidHJmc19maWxlX2Fpb193cml0ZShzdHJ1Y3Qga2lvY2IgKmlvY2IsDQog
CWZpcnN0X2luZGV4ID0gcG9zID4+IFBBR0VfQ0FDSEVfU0hJRlQ7DQogCWxhc3RfaW5kZXggPSAo
cG9zICsgaW92X2l0ZXJfY291bnQoJmkpKSA+PiBQQUdFX0NBQ0hFX1NISUZUOw0KIA0KLQkvKg0K
LQkgKiB0aGVyZSBhcmUgbG90cyBvZiBiZXR0ZXIgd2F5cyB0byBkbyB0aGlzLCBidXQgdGhpcyBj
b2RlDQotCSAqIG1ha2VzIHN1cmUgdGhlIGZpcnN0IGFuZCBsYXN0IHBhZ2UgaW4gdGhlIGZpbGUg
cmFuZ2UgYXJlDQotCSAqIHVwIHRvIGRhdGUgYW5kIHJlYWR5IGZvciBjb3cNCi0JICovDQotCWlm
ICgocG9zICYgKFBBR0VfQ0FDSEVfU0laRSAtIDEpKSkgew0KLQkJcGlubmVkWzBdID0gZ3JhYl9j
YWNoZV9wYWdlKGlub2RlLT5pX21hcHBpbmcsIGZpcnN0X2luZGV4KTsNCi0JCWlmICghUGFnZVVw
dG9kYXRlKHBpbm5lZFswXSkpIHsNCi0JCQlyZXQgPSBidHJmc19yZWFkcGFnZShOVUxMLCBwaW5u
ZWRbMF0pOw0KLQkJCUJVR19PTihyZXQpOw0KLQkJCXdhaXRfb25fcGFnZV9sb2NrZWQocGlubmVk
WzBdKTsNCi0JCX0gZWxzZSB7DQotCQkJdW5sb2NrX3BhZ2UocGlubmVkWzBdKTsNCi0JCX0NCi0J
fQ0KLQlpZiAoKHBvcyArIGlvdl9pdGVyX2NvdW50KCZpKSkgJiAoUEFHRV9DQUNIRV9TSVpFIC0g
MSkpIHsNCi0JCXBpbm5lZFsxXSA9IGdyYWJfY2FjaGVfcGFnZShpbm9kZS0+aV9tYXBwaW5nLCBs
YXN0X2luZGV4KTsNCi0JCWlmICghUGFnZVVwdG9kYXRlKHBpbm5lZFsxXSkpIHsNCi0JCQlyZXQg
PSBidHJmc19yZWFkcGFnZShOVUxMLCBwaW5uZWRbMV0pOw0KLQkJCUJVR19PTihyZXQpOw0KLQkJ
CXdhaXRfb25fcGFnZV9sb2NrZWQocGlubmVkWzFdKTsNCi0JCX0gZWxzZSB7DQotCQkJdW5sb2Nr
X3BhZ2UocGlubmVkWzFdKTsNCi0JCX0NCi0JfQ0KLQ0KIAl3aGlsZSAoaW92X2l0ZXJfY291bnQo
JmkpID4gMCkgew0KIAkJc2l6ZV90IG9mZnNldCA9IHBvcyAmIChQQUdFX0NBQ0hFX1NJWkUgLSAx
KTsNCiAJCXNpemVfdCB3cml0ZV9ieXRlcyA9IG1pbihpb3ZfaXRlcl9jb3VudCgmaSksDQpAQCAt
MTAyNCw4ICsxMDM5LDEyIEBAIHN0YXRpYyBzc2l6ZV90IGJ0cmZzX2ZpbGVfYWlvX3dyaXRlKHN0
cnVjdCBraW9jYiAqaW9jYiwNCiANCiAJCWNvcGllZCA9IGJ0cmZzX2NvcHlfZnJvbV91c2VyKHBv
cywgbnVtX3BhZ2VzLA0KIAkJCQkJICAgd3JpdGVfYnl0ZXMsIHBhZ2VzLCAmaSk7DQotCQlkaXJ0
eV9wYWdlcyA9IChjb3BpZWQgKyBvZmZzZXQgKyBQQUdFX0NBQ0hFX1NJWkUgLSAxKSA+Pg0KLQkJ
CQlQQUdFX0NBQ0hFX1NISUZUOw0KKwkJaWYgKGNvcGllZCA9PSAwKQ0KKwkJCWRpcnR5X3BhZ2Vz
ID0gMDsNCisJCWVsc2UNCisJCQlkaXJ0eV9wYWdlcyA9IChjb3BpZWQgKyBvZmZzZXQgKw0KKwkJ
CQkgICAgICAgUEFHRV9DQUNIRV9TSVpFIC0gMSkgPj4NCisJCQkJICAgICAgIFBBR0VfQ0FDSEVf
U0hJRlQ7DQogDQogCQlpZiAobnVtX3BhZ2VzID4gZGlydHlfcGFnZXMpIHsNCiAJCQlpZiAoY29w
aWVkID4gMCkNCkBAIC0xMDY5LDEwICsxMDg4LDYgQEAgb3V0Og0KIAkJZXJyID0gcmV0Ow0KIA0K
IAlrZnJlZShwYWdlcyk7DQotCWlmIChwaW5uZWRbMF0pDQotCQlwYWdlX2NhY2hlX3JlbGVhc2Uo
cGlubmVkWzBdKTsNCi0JaWYgKHBpbm5lZFsxXSkNCi0JCXBhZ2VfY2FjaGVfcmVsZWFzZShwaW5u
ZWRbMV0pOw0KIAkqcHBvcyA9IHBvczsNCiANCiAJLyoNCg==

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-02-28  8:56   ` Zhong, Xin
@ 2011-02-28 14:02     ` Chris Mason
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Mason @ 2011-02-28 14:02 UTC (permalink / raw)
  To: Zhong, Xin
  Cc: Mitch Harder, Maria Wikström, Johannes Hirte,
	linux-btrfs@vger.kernel.org

Excerpts from Zhong, Xin's message of 2011-02-28 03:56:40 -0500:
> One possible issue I can see is in the random failure case #2 that co=
py_from_user only process half of the data.=20
>=20
> For example, if it write a 4k aligned page and copy_from_user only wr=
ite 2k. Then it will not call btrfs_delalloc_release_space since num_pa=
ges and dirty_pages are both 1.=20
> In the next round, it write another 2k and btrfs_delalloc_reserve_spa=
ce is called twice for the same page.=20
>=20
> Is it a problem? Thanks!

It should be the correct answer.  The result of the short copy_from_use=
r
should be exactly the same as two write calls where one does 2K and the
other does another 2K.

Either way, it shouldn't result in incorrect bytes in the file, which i=
s
still happening for me with the debugging hunks in place.

-chris

>=20
> -----Original Message-----
> From: Chris Mason [mailto:chris.mason@oracle.com]=20
> Sent: Monday, February 28, 2011 9:46 AM
> To: Mitch Harder
> Cc: Maria Wikstr=C3=B6m; Zhong, Xin; Johannes Hirte; linux-btrfs@vger=
=2Ekernel.org
> Subject: [PATCH] btrfs file write debugging patch
>=20
> Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
> > Some clarification on my previous message...
> >=20
> > After looking at my ftrace log more closely, I can see where Btrfs =
is
> > trying to release the allocated pages.  However, the calculation fo=
r
> > the number of dirty_pages is equal to 1 when "copied =3D=3D 0".
> >=20
> > So I'm seeing at least two problems:
> > (1)  It keeps looping when "copied =3D=3D 0".
> > (2)  One dirty page is not being released on every loop even though
> > "copied =3D=3D 0" (at least this problem keeps it from being an inf=
inite
> > loop by eventually exhausting reserveable space on the disk).
>=20
> Hi everyone,
>=20
> There are actually tow bugs here.  First the one that Mitch hit, and =
a
> second one that still results in bad file_write results with my
> debugging hunks (the first two hunks below) in place.
>=20
> My patch fixes Mitch's bug by checking for copied =3D=3D 0 after
> btrfs_copy_from_user and going the correct delalloc accounting.  This
> one looks solved, but you'll notice the patch is bigger.
>=20
> First, I add some random failures to btrfs_copy_from_user() by failin=
g
> everyone once and a while.  This was much more reliable than trying t=
o
> use memory pressure than making copy_from_user fail.
>=20
> If copy_from_user fails and we partially update a page, we end up wit=
h a
> page that may go away due to memory pressure.  But, btrfs_file_write
> assumes that only the first and last page may have good data that nee=
ds
> to be read off the disk.
>=20
> This patch ditches that code and puts it into prepare_pages instead.
> But I'm still having some errors during long stress.sh runs.  Ideas a=
re
> more than welcome, hopefully some other timezones will kick in ideas
> while I sleep.
>=20
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 7084140..89a6a26 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -54,6 +54,13 @@ static noinline int btrfs_copy_from_user(loff_t po=
s, int num_pages,
>      int offset =3D pos & (PAGE_CACHE_SIZE - 1);
>      int total_copied =3D 0;
> =20
> +    if ((jiffies % 10) =3D=3D 0)
> +        return 0;
> +
> +    if ((jiffies % 25) =3D=3D 0) {
> +        write_bytes /=3D 2;
> +    }
> +
>      while (write_bytes > 0) {
>          size_t count =3D min_t(size_t,
>                       PAGE_CACHE_SIZE - offset, write_bytes);
> @@ -763,6 +770,26 @@ out:
>  }
> =20
>  /*
> + * on error we return an unlocked page and the error value
> + * on success we return a locked page and 0
> + */
> +static int prepare_uptodate_page(struct page *page, u64 pos)
> +{
> +    int ret =3D 0;
> +    if ((pos & (PAGE_CACHE_SIZE - 1)) && !PageUptodate(page)) {
> +        ret =3D btrfs_readpage(NULL, page);
> +        if (ret)
> +            return ret;
> +        lock_page(page);
> +        if (!PageUptodate(page)) {
> +            unlock_page(page);
> +            return -EIO;
> +        }
> +    }
> +    return 0;
> +}
> +
> +/*
>   * this gets pages into the page cache and locks them down, it also =
properly
>   * waits for data=3Dordered extents to finish before allowing the pa=
ges to be
>   * modified.
> @@ -777,6 +804,7 @@ static noinline int prepare_pages(struct btrfs_ro=
ot *root, struct file *file,
>      unsigned long index =3D pos >> PAGE_CACHE_SHIFT;
>      struct inode *inode =3D fdentry(file)->d_inode;
>      int err =3D 0;
> +    int faili =3D 0;
>      u64 start_pos;
>      u64 last_pos;
> =20
> @@ -794,15 +822,24 @@ again:
>      for (i =3D 0; i < num_pages; i++) {
>          pages[i] =3D grab_cache_page(inode->i_mapping, index + i);
>          if (!pages[i]) {
> -            int c;
> -            for (c =3D i - 1; c >=3D 0; c--) {
> -                unlock_page(pages[c]);
> -                page_cache_release(pages[c]);
> -            }
> -            return -ENOMEM;
> +            faili =3D i - 1;
> +            err =3D -ENOMEM;
> +            goto fail;
> +        }
> +
> +        if (i =3D=3D 0)
> +            err =3D prepare_uptodate_page(pages[i], pos);
> +        else if (i =3D=3D num_pages - 1)
> +            err =3D prepare_uptodate_page(pages[i],
> +                            pos + write_bytes);
> +        if (err) {
> +            page_cache_release(pages[i]);
> +            faili =3D i - 1;
> +            goto fail;
>          }
>          wait_on_page_writeback(pages[i]);
>      }
> +    err =3D 0;
>      if (start_pos < inode->i_size) {
>          struct btrfs_ordered_extent *ordered;
>          lock_extent_bits(&BTRFS_I(inode)->io_tree,
> @@ -842,6 +879,14 @@ again:
>          WARN_ON(!PageLocked(pages[i]));
>      }
>      return 0;
> +fail:
> +    while (faili >=3D 0) {
> +        unlock_page(pages[faili]);
> +        page_cache_release(pages[faili]);
> +        faili--;
> +    }
> +    return err;
> +
>  }
> =20
>  static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
> @@ -851,7 +896,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb =
*iocb,
>      struct file *file =3D iocb->ki_filp;
>      struct inode *inode =3D fdentry(file)->d_inode;
>      struct btrfs_root *root =3D BTRFS_I(inode)->root;
> -    struct page *pinned[2];
>      struct page **pages =3D NULL;
>      struct iov_iter i;
>      loff_t *ppos =3D &iocb->ki_pos;
> @@ -872,9 +916,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb =
*iocb,
>      will_write =3D ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
>                (file->f_flags & O_DIRECT));
> =20
> -    pinned[0] =3D NULL;
> -    pinned[1] =3D NULL;
> -
>      start_pos =3D pos;
> =20
>      vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> @@ -962,32 +1003,6 @@ static ssize_t btrfs_file_aio_write(struct kioc=
b *iocb,
>      first_index =3D pos >> PAGE_CACHE_SHIFT;
>      last_index =3D (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;
> =20
> -    /*
> -     * there are lots of better ways to do this, but this code
> -     * makes sure the first and last page in the file range are
> -     * up to date and ready for cow
> -     */
> -    if ((pos & (PAGE_CACHE_SIZE - 1))) {
> -        pinned[0] =3D grab_cache_page(inode->i_mapping, first_index)=
;
> -        if (!PageUptodate(pinned[0])) {
> -            ret =3D btrfs_readpage(NULL, pinned[0]);
> -            BUG_ON(ret);
> -            wait_on_page_locked(pinned[0]);
> -        } else {
> -            unlock_page(pinned[0]);
> -        }
> -    }
> -    if ((pos + iov_iter_count(&i)) & (PAGE_CACHE_SIZE - 1)) {
> -        pinned[1] =3D grab_cache_page(inode->i_mapping, last_index);
> -        if (!PageUptodate(pinned[1])) {
> -            ret =3D btrfs_readpage(NULL, pinned[1]);
> -            BUG_ON(ret);
> -            wait_on_page_locked(pinned[1]);
> -        } else {
> -            unlock_page(pinned[1]);
> -        }
> -    }
> -
>      while (iov_iter_count(&i) > 0) {
>          size_t offset =3D pos & (PAGE_CACHE_SIZE - 1);
>          size_t write_bytes =3D min(iov_iter_count(&i),
> @@ -1024,8 +1039,12 @@ static ssize_t btrfs_file_aio_write(struct kio=
cb *iocb,
> =20
>          copied =3D btrfs_copy_from_user(pos, num_pages,
>                         write_bytes, pages, &i);
> -        dirty_pages =3D (copied + offset + PAGE_CACHE_SIZE - 1) >>
> -                PAGE_CACHE_SHIFT;
> +        if (copied =3D=3D 0)
> +            dirty_pages =3D 0;
> +        else
> +            dirty_pages =3D (copied + offset +
> +                       PAGE_CACHE_SIZE - 1) >>
> +                       PAGE_CACHE_SHIFT;
> =20
>          if (num_pages > dirty_pages) {
>              if (copied > 0)
> @@ -1069,10 +1088,6 @@ out:
>          err =3D ret;
> =20
>      kfree(pages);
> -    if (pinned[0])
> -        page_cache_release(pinned[0]);
> -    if (pinned[1])
> -        page_cache_release(pinned[1]);
>      *ppos =3D pos;
> =20
>      /*
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28  1:46 ` [PATCH] btrfs file write debugging patch Chris Mason
  2011-02-28  8:56   ` Zhong, Xin
@ 2011-02-28 10:13   ` Johannes Hirte
  2011-02-28 14:00     ` Chris Mason
  2011-02-28 16:10     ` Josef Bacik
  1 sibling, 2 replies; 40+ messages in thread
From: Johannes Hirte @ 2011-02-28 10:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Mitch Harder, Maria Wikström, Zhong, Xin,
	linux-btrfs@vger.kernel.org

On Monday 28 February 2011 02:46:05 Chris Mason wrote:
> Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
> > Some clarification on my previous message...
> > 
> > After looking at my ftrace log more closely, I can see where Btrfs is
> > trying to release the allocated pages.  However, the calculation for
> > the number of dirty_pages is equal to 1 when "copied == 0".
> > 
> > So I'm seeing at least two problems:
> > (1)  It keeps looping when "copied == 0".
> > (2)  One dirty page is not being released on every loop even though
> > "copied == 0" (at least this problem keeps it from being an infinite
> > loop by eventually exhausting reserveable space on the disk).
> 
> Hi everyone,
> 
> There are actually tow bugs here.  First the one that Mitch hit, and a
> second one that still results in bad file_write results with my
> debugging hunks (the first two hunks below) in place.
> 
> My patch fixes Mitch's bug by checking for copied == 0 after
> btrfs_copy_from_user and going the correct delalloc accounting.  This
> one looks solved, but you'll notice the patch is bigger.
> 
> First, I add some random failures to btrfs_copy_from_user() by failing
> everyone once and a while.  This was much more reliable than trying to
> use memory pressure than making copy_from_user fail.
> 
> If copy_from_user fails and we partially update a page, we end up with a
> page that may go away due to memory pressure.  But, btrfs_file_write
> assumes that only the first and last page may have good data that needs
> to be read off the disk.
> 
> This patch ditches that code and puts it into prepare_pages instead.
> But I'm still having some errors during long stress.sh runs.  Ideas are
> more than welcome, hopefully some other timezones will kick in ideas
> while I sleep.

At least it doesn't fix the emerge-problem for me. The behavior is now the same 
as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt' with no 
further interaction to get the emerge-process hang with a svn-process 
consuming 100% CPU. I can cancel the emerge-process with ctrl-c but the 
spawned svn-process stays and it needs a reboot to get rid of it. 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 10:13   ` Johannes Hirte
@ 2011-02-28 14:00     ` Chris Mason
  2011-02-28 16:10     ` Josef Bacik
  1 sibling, 0 replies; 40+ messages in thread
From: Chris Mason @ 2011-02-28 14:00 UTC (permalink / raw)
  To: Johannes Hirte
  Cc: Mitch Harder, Maria Wikström, Zhong, Xin,
	linux-btrfs@vger.kernel.org

Excerpts from Johannes Hirte's message of 2011-02-28 05:13:59 -0500:
> On Monday 28 February 2011 02:46:05 Chris Mason wrote:
> > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
> > > Some clarification on my previous message...
> > > 
> > > After looking at my ftrace log more closely, I can see where Btrfs is
> > > trying to release the allocated pages.  However, the calculation for
> > > the number of dirty_pages is equal to 1 when "copied == 0".
> > > 
> > > So I'm seeing at least two problems:
> > > (1)  It keeps looping when "copied == 0".
> > > (2)  One dirty page is not being released on every loop even though
> > > "copied == 0" (at least this problem keeps it from being an infinite
> > > loop by eventually exhausting reserveable space on the disk).
> > 
> > Hi everyone,
> > 
> > There are actually tow bugs here.  First the one that Mitch hit, and a
> > second one that still results in bad file_write results with my
> > debugging hunks (the first two hunks below) in place.
> > 
> > My patch fixes Mitch's bug by checking for copied == 0 after
> > btrfs_copy_from_user and going the correct delalloc accounting.  This
> > one looks solved, but you'll notice the patch is bigger.
> > 
> > First, I add some random failures to btrfs_copy_from_user() by failing
> > everyone once and a while.  This was much more reliable than trying to
> > use memory pressure than making copy_from_user fail.
> > 
> > If copy_from_user fails and we partially update a page, we end up with a
> > page that may go away due to memory pressure.  But, btrfs_file_write
> > assumes that only the first and last page may have good data that needs
> > to be read off the disk.
> > 
> > This patch ditches that code and puts it into prepare_pages instead.
> > But I'm still having some errors during long stress.sh runs.  Ideas are
> > more than welcome, hopefully some other timezones will kick in ideas
> > while I sleep.
> 
> At least it doesn't fix the emerge-problem for me. The behavior is now the same 
> as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt' with no 
> further interaction to get the emerge-process hang with a svn-process 
> consuming 100% CPU. I can cancel the emerge-process with ctrl-c but the 
> spawned svn-process stays and it needs a reboot to get rid of it. 

I think your problem really is more enospc related.  Still working on
that as well.  But please don't try the patch without removing the
debugging hunk at the top (anything that mentions jiffies).

-chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 10:13   ` Johannes Hirte
  2011-02-28 14:00     ` Chris Mason
@ 2011-02-28 16:10     ` Josef Bacik
  2011-02-28 16:45       ` Maria Wikström
  1 sibling, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2011-02-28 16:10 UTC (permalink / raw)
  To: Johannes Hirte
  Cc: Chris Mason, Mitch Harder, Maria Wikström, Zhong, Xin,
	linux-btrfs@vger.kernel.org

On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
> On Monday 28 February 2011 02:46:05 Chris Mason wrote:
> > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
> > > Some clarification on my previous message...
> > > 
> > > After looking at my ftrace log more closely, I can see where Btrfs is
> > > trying to release the allocated pages.  However, the calculation for
> > > the number of dirty_pages is equal to 1 when "copied == 0".
> > > 
> > > So I'm seeing at least two problems:
> > > (1)  It keeps looping when "copied == 0".
> > > (2)  One dirty page is not being released on every loop even though
> > > "copied == 0" (at least this problem keeps it from being an infinite
> > > loop by eventually exhausting reserveable space on the disk).
> > 
> > Hi everyone,
> > 
> > There are actually tow bugs here.  First the one that Mitch hit, and a
> > second one that still results in bad file_write results with my
> > debugging hunks (the first two hunks below) in place.
> > 
> > My patch fixes Mitch's bug by checking for copied == 0 after
> > btrfs_copy_from_user and going the correct delalloc accounting.  This
> > one looks solved, but you'll notice the patch is bigger.
> > 
> > First, I add some random failures to btrfs_copy_from_user() by failing
> > everyone once and a while.  This was much more reliable than trying to
> > use memory pressure than making copy_from_user fail.
> > 
> > If copy_from_user fails and we partially update a page, we end up with a
> > page that may go away due to memory pressure.  But, btrfs_file_write
> > assumes that only the first and last page may have good data that needs
> > to be read off the disk.
> > 
> > This patch ditches that code and puts it into prepare_pages instead.
> > But I'm still having some errors during long stress.sh runs.  Ideas are
> > more than welcome, hopefully some other timezones will kick in ideas
> > while I sleep.
> 
> At least it doesn't fix the emerge-problem for me. The behavior is now the same 
> as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt' with no 
> further interaction to get the emerge-process hang with a svn-process 
> consuming 100% CPU. I can cancel the emerge-process with ctrl-c but the 
> spawned svn-process stays and it needs a reboot to get rid of it. 

Can you cat /proc/$pid/wchan a few times so we can get an idea of where it's
looping?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 16:10     ` Josef Bacik
@ 2011-02-28 16:45       ` Maria Wikström
  2011-02-28 17:47         ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Maria Wikström @ 2011-02-28 16:45 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Johannes Hirte, Chris Mason, Mitch Harder, Zhong, Xin,
	linux-btrfs@vger.kernel.org

m=C3=A5n 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500=
:
> > > > Some clarification on my previous message...
> > > >=20
> > > > After looking at my ftrace log more closely, I can see where Bt=
rfs is
> > > > trying to release the allocated pages.  However, the calculatio=
n for
> > > > the number of dirty_pages is equal to 1 when "copied =3D=3D 0".
> > > >=20
> > > > So I'm seeing at least two problems:
> > > > (1)  It keeps looping when "copied =3D=3D 0".
> > > > (2)  One dirty page is not being released on every loop even th=
ough
> > > > "copied =3D=3D 0" (at least this problem keeps it from being an=
 infinite
> > > > loop by eventually exhausting reserveable space on the disk).
> > >=20
> > > Hi everyone,
> > >=20
> > > There are actually tow bugs here.  First the one that Mitch hit, =
and a
> > > second one that still results in bad file_write results with my
> > > debugging hunks (the first two hunks below) in place.
> > >=20
> > > My patch fixes Mitch's bug by checking for copied =3D=3D 0 after
> > > btrfs_copy_from_user and going the correct delalloc accounting.  =
This
> > > one looks solved, but you'll notice the patch is bigger.
> > >=20
> > > First, I add some random failures to btrfs_copy_from_user() by fa=
iling
> > > everyone once and a while.  This was much more reliable than tryi=
ng to
> > > use memory pressure than making copy_from_user fail.
> > >=20
> > > If copy_from_user fails and we partially update a page, we end up=
 with a
> > > page that may go away due to memory pressure.  But, btrfs_file_wr=
ite
> > > assumes that only the first and last page may have good data that=
 needs
> > > to be read off the disk.
> > >=20
> > > This patch ditches that code and puts it into prepare_pages inste=
ad.
> > > But I'm still having some errors during long stress.sh runs.  Ide=
as are
> > > more than welcome, hopefully some other timezones will kick in id=
eas
> > > while I sleep.
> >=20
> > At least it doesn't fix the emerge-problem for me. The behavior is =
now the same=20
> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt=
' with no=20
> > further interaction to get the emerge-process hang with a svn-proce=
ss=20
> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c but=
 the=20
> > spawned svn-process stays and it needs a reboot to get rid of it.=20
>=20
> Can you cat /proc/$pid/wchan a few times so we can get an idea of whe=
re it's
> looping?  Thanks,
>=20
> Josef

It behaves the same way here with btrfs-unstable.
The output of "cat /proc/$pid/wchan" is 0.=20

// Maria

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>=20


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 16:45       ` Maria Wikström
@ 2011-02-28 17:47         ` Mitch Harder
  2011-02-28 20:20           ` Mitch Harder
  0 siblings, 1 reply; 40+ messages in thread
From: Mitch Harder @ 2011-02-28 17:47 UTC (permalink / raw)
  To: Maria Wikström
  Cc: Josef Bacik, Johannes Hirte, Chris Mason, Zhong, Xin,
	linux-btrfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3729 bytes --]

2011/2/28 Maria Wikström <maria@ponstudios.se>:
> mån 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500:
>> > > > Some clarification on my previous message...
>> > > >
>> > > > After looking at my ftrace log more closely, I can see where Btrfs is
>> > > > trying to release the allocated pages.  However, the calculation for
>> > > > the number of dirty_pages is equal to 1 when "copied == 0".
>> > > >
>> > > > So I'm seeing at least two problems:
>> > > > (1)  It keeps looping when "copied == 0".
>> > > > (2)  One dirty page is not being released on every loop even though
>> > > > "copied == 0" (at least this problem keeps it from being an infinite
>> > > > loop by eventually exhausting reserveable space on the disk).
>> > >
>> > > Hi everyone,
>> > >
>> > > There are actually tow bugs here.  First the one that Mitch hit, and a
>> > > second one that still results in bad file_write results with my
>> > > debugging hunks (the first two hunks below) in place.
>> > >
>> > > My patch fixes Mitch's bug by checking for copied == 0 after
>> > > btrfs_copy_from_user and going the correct delalloc accounting.  This
>> > > one looks solved, but you'll notice the patch is bigger.
>> > >
>> > > First, I add some random failures to btrfs_copy_from_user() by failing
>> > > everyone once and a while.  This was much more reliable than trying to
>> > > use memory pressure than making copy_from_user fail.
>> > >
>> > > If copy_from_user fails and we partially update a page, we end up with a
>> > > page that may go away due to memory pressure.  But, btrfs_file_write
>> > > assumes that only the first and last page may have good data that needs
>> > > to be read off the disk.
>> > >
>> > > This patch ditches that code and puts it into prepare_pages instead.
>> > > But I'm still having some errors during long stress.sh runs.  Ideas are
>> > > more than welcome, hopefully some other timezones will kick in ideas
>> > > while I sleep.
>> >
>> > At least it doesn't fix the emerge-problem for me. The behavior is now the same
>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt' with no
>> > further interaction to get the emerge-process hang with a svn-process
>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c but the
>> > spawned svn-process stays and it needs a reboot to get rid of it.
>>
>> Can you cat /proc/$pid/wchan a few times so we can get an idea of where it's
>> looping?  Thanks,
>>
>> Josef
>
> It behaves the same way here with btrfs-unstable.
> The output of "cat /proc/$pid/wchan" is 0.
>
> // Maria
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>

I've applied the patch at the head of this thread (with the jiffies
debugging commented out) and I'm attaching a ftrace using the
function_graph tracer when I'm stuck in the loop.  I've just snipped
out a couple of the loops (the full trace file is quite large, and
mostly repititious).

I'm going to try to modify file.c with some trace_printk debugging to
show the values of several of the relevant variables at various
stages.

I'm going to try to exit the loop after 256 tries with an EFAULT so I
can stop the tracing at that point and capture a trace of the entry
into the problem (the ftrace ring buffer fills up too fast for me to
capture the entry point).

[-- Attachment #2: ftrace-btrfs-file-write-debugging.gz --]
[-- Type: application/x-gzip, Size: 2590 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 17:47         ` Mitch Harder
@ 2011-02-28 20:20           ` Mitch Harder
  2011-03-01  5:09             ` Mitch Harder
                               ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Mitch Harder @ 2011-02-28 20:20 UTC (permalink / raw)
  To: Maria Wikström
  Cc: Josef Bacik, Johannes Hirte, Chris Mason, Zhong, Xin,
	linux-btrfs@vger.kernel.org

2011/2/28 Mitch Harder <mitch.harder@sabayonlinux.org>:
> 2011/2/28 Maria Wikstr=F6m <maria@ponstudios.se>:
>> m=E5n 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
>>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
>>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
>>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -05=
00:
>>> > > > Some clarification on my previous message...
>>> > > >
>>> > > > After looking at my ftrace log more closely, I can see where =
Btrfs is
>>> > > > trying to release the allocated pages. =A0However, the calcul=
ation for
>>> > > > the number of dirty_pages is equal to 1 when "copied =3D=3D 0=
".
>>> > > >
>>> > > > So I'm seeing at least two problems:
>>> > > > (1) =A0It keeps looping when "copied =3D=3D 0".
>>> > > > (2) =A0One dirty page is not being released on every loop eve=
n though
>>> > > > "copied =3D=3D 0" (at least this problem keeps it from being =
an infinite
>>> > > > loop by eventually exhausting reserveable space on the disk).
>>> > >
>>> > > Hi everyone,
>>> > >
>>> > > There are actually tow bugs here. =A0First the one that Mitch h=
it, and a
>>> > > second one that still results in bad file_write results with my
>>> > > debugging hunks (the first two hunks below) in place.
>>> > >
>>> > > My patch fixes Mitch's bug by checking for copied =3D=3D 0 afte=
r
>>> > > btrfs_copy_from_user and going the correct delalloc accounting.=
 =A0This
>>> > > one looks solved, but you'll notice the patch is bigger.
>>> > >
>>> > > First, I add some random failures to btrfs_copy_from_user() by =
failing
>>> > > everyone once and a while. =A0This was much more reliable than =
trying to
>>> > > use memory pressure than making copy_from_user fail.
>>> > >
>>> > > If copy_from_user fails and we partially update a page, we end =
up with a
>>> > > page that may go away due to memory pressure. =A0But, btrfs_fil=
e_write
>>> > > assumes that only the first and last page may have good data th=
at needs
>>> > > to be read off the disk.
>>> > >
>>> > > This patch ditches that code and puts it into prepare_pages ins=
tead.
>>> > > But I'm still having some errors during long stress.sh runs. =A0=
Ideas are
>>> > > more than welcome, hopefully some other timezones will kick in =
ideas
>>> > > while I sleep.
>>> >
>>> > At least it doesn't fix the emerge-problem for me. The behavior i=
s now the same
>>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcry=
pt' with no
>>> > further interaction to get the emerge-process hang with a svn-pro=
cess
>>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c b=
ut the
>>> > spawned svn-process stays and it needs a reboot to get rid of it.
>>>
>>> Can you cat /proc/$pid/wchan a few times so we can get an idea of w=
here it's
>>> looping? =A0Thanks,
>>>
>>> Josef
>>
>> It behaves the same way here with btrfs-unstable.
>> The output of "cat /proc/$pid/wchan" is 0.
>>
>> // Maria
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>>
>>
>>
>>
>
> I've applied the patch at the head of this thread (with the jiffies
> debugging commented out) and I'm attaching a ftrace using the
> function_graph tracer when I'm stuck in the loop. =A0I've just snippe=
d
> out a couple of the loops (the full trace file is quite large, and
> mostly repititious).
>
> I'm going to try to modify file.c with some trace_printk debugging to
> show the values of several of the relevant variables at various
> stages.
>
> I'm going to try to exit the loop after 256 tries with an EFAULT so I
> can stop the tracing at that point and capture a trace of the entry
> into the problem (the ftrace ring buffer fills up too fast for me to
> capture the entry point).
>

As promised, I'm put together a modified file.c with many trace_printk
debugging statements to augment the ftrace.

The trace is ~128K compressed (about 31,600 lines or 2.6MB
uncompressed), so I'm putting it up on my local server instead of
attaching.  Let me know if it would be more appropriate to send to the
list as an attachment.

http://dontpanic.dyndns.org/ftrace-btrfs-file-write-debug-v2.gz

I preface all my trace_printk comments with "TPK:" to make skipping
through the trace easier.

The trace contains the trace of about 3 or 4 successful passes through
the btrfs_file_aio_write() function to show what a successful trace
looks like.

The pass through the btrfs_file_aio_write() that breaks begins on line =
1088.

I let it loop through the while (iov_iter_count(&i) > 0) {} loop for
256 times when copied=3D=3D0 (otherwise it would loop infinitely).  The=
n
exit out and stop the trace.

=46or reference, here's a diff file for the debugging statements I've
added to file.c:

Let me know if you would like me to re-run this trial with different
debugging lines.

 fs/btrfs/file.c /usr/src/linux/fs/btrfs/file.c
--- fs/btrfs/file.c	2011-02-28 10:13:45.000000000 -0600
+++ /usr/src/linux/fs/btrfs/file.c	2011-02-28 13:07:11.000000000 -0600
@@ -53,12 +53,14 @@
 	int offset =3D pos & (PAGE_CACHE_SIZE - 1);
 	int total_copied =3D 0;

+	/***************************
 	if ((jiffies % 10) =3D=3D 0)
 		return 0;

 	if ((jiffies % 25) =3D=3D 0) {
 		write_bytes /=3D 2;
 	}
+	**************************/

 	while (write_bytes > 0) {
 		size_t count =3D min_t(size_t,
@@ -82,10 +84,13 @@

 		/* Return to btrfs_file_aio_write to fault page */
 		if (unlikely(copied =3D=3D 0)) {
+			trace_printk("TPK: unlikely copied =3D=3D 0 in btrfs_copy_from_user
(total_copied=3D%i)\n",
+				     total_copied);
 			break;
 		}

 		if (unlikely(copied < PAGE_CACHE_SIZE - offset)) {
+			trace_printk("TPK: unlikely copied < PAGE_CACHE_SIZE - offset in
btrfs_copy_from_user\n");
 			offset +=3D copied;
 		} else {
 			pg++;
@@ -910,8 +915,13 @@
 	int will_write;
 	int buffered =3D 0;
 	int copied =3D 0;
+	int copied_problem =3D 0;
+	int copied_loop_count =3D 0;
+	int stop_ftrace =3D 0;
 	int dirty_pages =3D 0;

+	trace_printk("TPK: Entering btrfs_file_aio_write()\n");
+
 	will_write =3D ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
 		      (file->f_flags & O_DIRECT));

@@ -953,6 +963,7 @@
 	BTRFS_I(inode)->sequence++;

 	if (unlikely(file->f_flags & O_DIRECT)) {
+		trace_printk("TPK: transferred to unlikely(file->f_flags \& O_DIRECT=
)\n");
 		num_written =3D generic_file_direct_write(iocb, iov, &nr_segs,
 							pos, ppos, count,
 							ocount);
@@ -984,6 +995,8 @@
 		 */
 		buffered =3D 1;
 		pos +=3D num_written;
+		trace_printk("TPK: end unlikely(file->f_flags \& O_DIRECT) with
num_written=3D%i\n",
+			     num_written);
 	}

 	iov_iter_init(&i, iov, nr_segs, count, num_written);
@@ -1003,6 +1016,8 @@
 	last_index =3D (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;

 	while (iov_iter_count(&i) > 0) {
+		trace_printk("TPK: start section while (iov_iter_count() > 0)\n");
+
 		size_t offset =3D pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes =3D min(iov_iter_count(&i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
@@ -1010,6 +1025,9 @@
 		size_t num_pages =3D (write_bytes + offset +
 				    PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

+		trace_printk("TPK: iov_iter_count()=3D%i || nr_segs=3D%lu || nrptrs=3D=
%i
|| offset=3D%lu || write_bytes=3D%lu || num_pages=3D%lu\n",
+						iov_iter_count(&i), nr_segs, nrptrs, offset, write_bytes, num_pa=
ges);
+
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);

@@ -1022,6 +1040,19 @@
 			goto out;
 		}

+		if (unlikely(copied_problem)) {
+			copied_loop_count++;
+			trace_printk("TPK: copied problem(%i)\n",
+				     copied_loop_count);
+			/* Give up if we've already tried 256 times */
+			if (copied_loop_count > 256) {
+				ret =3D -EFAULT;
+				stop_ftrace =3D 1;
+				trace_printk("TPK: copied loop count exceeded, returning EFAULT...=
=2E\n");
+				goto out;
+			}
+		}
+
 		ret =3D btrfs_delalloc_reserve_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
 		if (ret)
@@ -1045,6 +1076,10 @@
 				       PAGE_CACHE_SIZE - 1) >>
 				       PAGE_CACHE_SHIFT;

+		if (copied =3D=3D 0) {
+			copied_problem =3D 1;
+		}
+
 		if (num_pages > dirty_pages) {
 			if (copied > 0)
 				atomic_inc(
@@ -1080,11 +1115,19 @@
 		num_written +=3D copied;

 		cond_resched();
+		trace_printk("TPK: end section while (iov_iter_count() > 0)\n");
+		trace_printk(" copied=3D%i || dirty_pages=3D%i || num_written=3D%i |=
|
ending iov_iter_count=3D%i\n",
+						copied, dirty_pages, num_written, iov_iter_count(&i) );
 	}
 out:
+	trace_printk("TPK: Reached: out:\n");
+
 	mutex_unlock(&inode->i_mutex);
-	if (ret)
+	if (ret) {
 		err =3D ret;
+		trace_printk("TPK: ret,err had value %i when out: was reached
(num_written: %i)\n",
+			     ret, num_written);
+	}

 	kfree(pages);
 	*ppos =3D pos;
@@ -1140,6 +1183,19 @@
 	}
 done:
 	current->backing_dev_info =3D NULL;
+	if (ret) {
+		trace_printk("TPK: btrfs_file_aio_write exiting with non-zero ret
value (%i)\n", ret);
+		trace_printk("TPK: Returning num_written (%i) ? num_written (%i) :
err (%i) =3D %i\n",
+				num_written, num_written, err, num_written ? num_written : err);
+	} else {
+		trace_printk("TPK: btrfs_file_aio_write exiting normally with (%i)",
+			     num_written ? num_written : err);
+	}
+=09
+	if (unlikely(stop_ftrace)) {
+		tracing_off();
+	}
+
 	return num_written ? num_written : err;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 20:20           ` Mitch Harder
@ 2011-03-01  5:09             ` Mitch Harder
  2011-03-01 10:14             ` Zhong, Xin
  2011-03-01 21:56             ` Piotr Szymaniak
  2 siblings, 0 replies; 40+ messages in thread
From: Mitch Harder @ 2011-03-01  5:09 UTC (permalink / raw)
  To: Maria Wikström
  Cc: Josef Bacik, Johannes Hirte, Chris Mason, Zhong, Xin,
	linux-btrfs@vger.kernel.org

2011/2/28 Mitch Harder <mitch.harder@sabayonlinux.org>:
> 2011/2/28 Mitch Harder <mitch.harder@sabayonlinux.org>:
>> 2011/2/28 Maria Wikstr=F6m <maria@ponstudios.se>:
>>> m=E5n 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
>>>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
>>>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
>>>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0=
500:
>>>> > > > Some clarification on my previous message...
>>>> > > >
>>>> > > > After looking at my ftrace log more closely, I can see where=
 Btrfs is
>>>> > > > trying to release the allocated pages. =A0However, the calcu=
lation for
>>>> > > > the number of dirty_pages is equal to 1 when "copied =3D=3D =
0".
>>>> > > >
>>>> > > > So I'm seeing at least two problems:
>>>> > > > (1) =A0It keeps looping when "copied =3D=3D 0".
>>>> > > > (2) =A0One dirty page is not being released on every loop ev=
en though
>>>> > > > "copied =3D=3D 0" (at least this problem keeps it from being=
 an infinite
>>>> > > > loop by eventually exhausting reserveable space on the disk)=
=2E
>>>> > >
>>>> > > Hi everyone,
>>>> > >
>>>> > > There are actually tow bugs here. =A0First the one that Mitch =
hit, and a
>>>> > > second one that still results in bad file_write results with m=
y
>>>> > > debugging hunks (the first two hunks below) in place.
>>>> > >
>>>> > > My patch fixes Mitch's bug by checking for copied =3D=3D 0 aft=
er
>>>> > > btrfs_copy_from_user and going the correct delalloc accounting=
=2E =A0This
>>>> > > one looks solved, but you'll notice the patch is bigger.
>>>> > >
>>>> > > First, I add some random failures to btrfs_copy_from_user() by=
 failing
>>>> > > everyone once and a while. =A0This was much more reliable than=
 trying to
>>>> > > use memory pressure than making copy_from_user fail.
>>>> > >
>>>> > > If copy_from_user fails and we partially update a page, we end=
 up with a
>>>> > > page that may go away due to memory pressure. =A0But, btrfs_fi=
le_write
>>>> > > assumes that only the first and last page may have good data t=
hat needs
>>>> > > to be read off the disk.
>>>> > >
>>>> > > This patch ditches that code and puts it into prepare_pages in=
stead.
>>>> > > But I'm still having some errors during long stress.sh runs. =A0=
Ideas are
>>>> > > more than welcome, hopefully some other timezones will kick in=
 ideas
>>>> > > while I sleep.
>>>> >
>>>> > At least it doesn't fix the emerge-problem for me. The behavior =
is now the same
>>>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcr=
ypt' with no
>>>> > further interaction to get the emerge-process hang with a svn-pr=
ocess
>>>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c =
but the
>>>> > spawned svn-process stays and it needs a reboot to get rid of it=
=2E
>>>>
>>>> Can you cat /proc/$pid/wchan a few times so we can get an idea of =
where it's
>>>> looping? =A0Thanks,
>>>>
>>>> Josef
>>>
>>> It behaves the same way here with btrfs-unstable.
>>> The output of "cat /proc/$pid/wchan" is 0.
>>>
>>> // Maria
>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-bt=
rfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>>>>
>>>
>>>
>>>
>>
>> I've applied the patch at the head of this thread (with the jiffies
>> debugging commented out) and I'm attaching a ftrace using the
>> function_graph tracer when I'm stuck in the loop. =A0I've just snipp=
ed
>> out a couple of the loops (the full trace file is quite large, and
>> mostly repititious).
>>
>> I'm going to try to modify file.c with some trace_printk debugging t=
o
>> show the values of several of the relevant variables at various
>> stages.
>>
>> I'm going to try to exit the loop after 256 tries with an EFAULT so =
I
>> can stop the tracing at that point and capture a trace of the entry
>> into the problem (the ftrace ring buffer fills up too fast for me to
>> capture the entry point).
>>
>
> As promised, I'm put together a modified file.c with many trace_print=
k
> debugging statements to augment the ftrace.
>
> The trace is ~128K compressed (about 31,600 lines or 2.6MB
> uncompressed), so I'm putting it up on my local server instead of
> attaching. =A0Let me know if it would be more appropriate to send to =
the
> list as an attachment.
>
> http://dontpanic.dyndns.org/ftrace-btrfs-file-write-debug-v2.gz
>
> I preface all my trace_printk comments with "TPK:" to make skipping
> through the trace easier.
>
> The trace contains the trace of about 3 or 4 successful passes throug=
h
> the btrfs_file_aio_write() function to show what a successful trace
> looks like.
>
> The pass through the btrfs_file_aio_write() that breaks begins on lin=
e 1088.
>
> I let it loop through the while (iov_iter_count(&i) > 0) {} loop for
> 256 times when copied=3D=3D0 (otherwise it would loop infinitely). =A0=
Then
> exit out and stop the trace.
>
> For reference, here's a diff file for the debugging statements I've
> added to file.c:
>
> Let me know if you would like me to re-run this trial with different
> debugging lines.
>
> =A0fs/btrfs/file.c /usr/src/linux/fs/btrfs/file.c
> --- fs/btrfs/file.c =A0 =A0 2011-02-28 10:13:45.000000000 -0600
> +++ /usr/src/linux/fs/btrfs/file.c =A0 =A0 =A02011-02-28 13:07:11.000=
000000 -0600
> @@ -53,12 +53,14 @@
> =A0 =A0 =A0 =A0int offset =3D pos & (PAGE_CACHE_SIZE - 1);
> =A0 =A0 =A0 =A0int total_copied =3D 0;
>
> + =A0 =A0 =A0 /***************************
> =A0 =A0 =A0 =A0if ((jiffies % 10) =3D=3D 0)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
>
> =A0 =A0 =A0 =A0if ((jiffies % 25) =3D=3D 0) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0write_bytes /=3D 2;
> =A0 =A0 =A0 =A0}
> + =A0 =A0 =A0 **************************/
>
> =A0 =A0 =A0 =A0while (write_bytes > 0) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0size_t count =3D min_t(size_t,
> @@ -82,10 +84,13 @@
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* Return to btrfs_file_aio_write to f=
ault page */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (unlikely(copied =3D=3D 0)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: unli=
kely copied =3D=3D 0 in btrfs_copy_from_user
> (total_copied=3D%i)\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0total_copied);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (unlikely(copied < PAGE_CACHE_SIZE =
- offset)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: unli=
kely copied < PAGE_CACHE_SIZE - offset in
> btrfs_copy_from_user\n");
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0offset +=3D copied;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} else {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pg++;
> @@ -910,8 +915,13 @@
> =A0 =A0 =A0 =A0int will_write;
> =A0 =A0 =A0 =A0int buffered =3D 0;
> =A0 =A0 =A0 =A0int copied =3D 0;
> + =A0 =A0 =A0 int copied_problem =3D 0;
> + =A0 =A0 =A0 int copied_loop_count =3D 0;
> + =A0 =A0 =A0 int stop_ftrace =3D 0;
> =A0 =A0 =A0 =A0int dirty_pages =3D 0;
>
> + =A0 =A0 =A0 trace_printk("TPK: Entering btrfs_file_aio_write()\n");
> +
> =A0 =A0 =A0 =A0will_write =3D ((file->f_flags & O_DSYNC) || IS_SYNC(i=
node) ||
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(file->f_flags & O_DIRECT)=
);
>
> @@ -953,6 +963,7 @@
> =A0 =A0 =A0 =A0BTRFS_I(inode)->sequence++;
>
> =A0 =A0 =A0 =A0if (unlikely(file->f_flags & O_DIRECT)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: transferred to unlik=
ely(file->f_flags \& O_DIRECT)\n");
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_written =3D generic_file_direct_wr=
ite(iocb, iov, &nr_segs,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pos, ppos, count,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ocount);
> @@ -984,6 +995,8 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0buffered =3D 1;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pos +=3D num_written;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: end unlikely(file->f=
_flags \& O_DIRECT) with
> num_written=3D%i\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_written)=
;
> =A0 =A0 =A0 =A0}
>
> =A0 =A0 =A0 =A0iov_iter_init(&i, iov, nr_segs, count, num_written);
> @@ -1003,6 +1016,8 @@
> =A0 =A0 =A0 =A0last_index =3D (pos + iov_iter_count(&i)) >> PAGE_CACH=
E_SHIFT;
>
> =A0 =A0 =A0 =A0while (iov_iter_count(&i) > 0) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: start section while =
(iov_iter_count() > 0)\n");
> +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0size_t offset =3D pos & (PAGE_CACHE_SI=
ZE - 1);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0size_t write_bytes =3D min(iov_iter_co=
unt(&i),
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 nrptrs * (size_t)PAGE_CACHE_SIZE -
> @@ -1010,6 +1025,9 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0size_t num_pages =3D (write_bytes + of=
fset +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: iov_iter_count()=3D%=
i || nr_segs=3D%lu || nrptrs=3D%i
> || offset=3D%lu || write_bytes=3D%lu || num_pages=3D%lu\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 iov_iter_count(&i), nr_segs, nrptrs, offset, w=
rite_bytes, num_pages);
> +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(num_pages > nrptrs);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0memset(pages, 0, sizeof(struct page *)=
 * nrptrs);
>
> @@ -1022,6 +1040,19 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (unlikely(copied_problem)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 copied_loop_count++;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: copi=
ed problem(%i)\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0copied_loop_count);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Give up if we've alr=
eady tried 256 times */
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (copied_loop_count >=
 256) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D=
 -EFAULT;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 stop_ft=
race =3D 1;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_p=
rintk("TPK: copied loop count exceeded, returning EFAULT....\n");
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto ou=
t;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D btrfs_delalloc_reserve_space(i=
node,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0num_pages << PAGE_CACHE_SHIFT);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (ret)
> @@ -1045,6 +1076,10 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 PAGE_CACHE_SIZE - 1) >>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 PAGE_CACHE_SHIFT;
>
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (copied =3D=3D 0) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 copied_problem =3D 1;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (num_pages > dirty_pages) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (copied > 0)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0atomic=
_inc(
> @@ -1080,11 +1115,19 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_written +=3D copied;
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cond_resched();
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: end section while (i=
ov_iter_count() > 0)\n");
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk(" copied=3D%i || dirty_pag=
es=3D%i || num_written=3D%i ||
> ending iov_iter_count=3D%i\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 copied, dirty_pages, num_written, iov_iter_cou=
nt(&i) );
> =A0 =A0 =A0 =A0}
> =A0out:
> + =A0 =A0 =A0 trace_printk("TPK: Reached: out:\n");
> +
> =A0 =A0 =A0 =A0mutex_unlock(&inode->i_mutex);
> - =A0 =A0 =A0 if (ret)
> + =A0 =A0 =A0 if (ret) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0err =3D ret;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: ret,err had value %i=
 when out: was reached
> (num_written: %i)\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret, num_wri=
tten);
> + =A0 =A0 =A0 }
>
> =A0 =A0 =A0 =A0kfree(pages);
> =A0 =A0 =A0 =A0*ppos =3D pos;
> @@ -1140,6 +1183,19 @@
> =A0 =A0 =A0 =A0}
> =A0done:
> =A0 =A0 =A0 =A0current->backing_dev_info =3D NULL;
> + =A0 =A0 =A0 if (ret) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: btrfs_file_aio_write=
 exiting with non-zero ret
> value (%i)\n", ret);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: Returning num_writte=
n (%i) ? num_written (%i) :
> err (%i) =3D %i\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 num_wri=
tten, num_written, err, num_written ? num_written : err);
> + =A0 =A0 =A0 } else {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_printk("TPK: btrfs_file_aio_write=
 exiting normally with (%i)",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_written =
? num_written : err);
> + =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 if (unlikely(stop_ftrace)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 tracing_off();
> + =A0 =A0 =A0 }
> +
> =A0 =A0 =A0 =A0return num_written ? num_written : err;
> =A0}
>

I'm developing a hypothesis that something is going wrong when Btrfs
is trying to lock multiple pages.

Page faults are being encountered at the same spot, over and over
(BTW, I ran a memcheck to rule that possibility out).

If I scan through my traces, it looks like most calls to write are
submitted 1 block (4096 bytes) at a time, at the most (also many are <
4096 bytes).  The portion that is causing a problem is trying to write
12288 bytes (3k).  For some reason, the first 24 bytes are written,
and the remainder of the loop is spent on the 12264 that are
remaining.  But page faults are encountered on each loop, and no more
bytes are copied.

I modified file.c to cut back on the scope of the attempted write to
smaller chunks.

The following patch "fixes" my problem, and this build completes
without error.  I'm not submitting this patch as a solution.  It's
clearly papering over a deeper issue.  However, I think it gives
insight into the problem.

I wrote the patch to allow for sequentially smaller blocks if failures
continue.  But the block cleared once I limited the scope to 4096
bytes.  It never needed to try smaller data segments.

As hoped, limiting the scope of the write allowed the 12264 bytes to
be written in the next three subsequent loops after lowering the scope
of the write.

It was interesting to note that cutting the scope to 4096 didn't
guarantee that you were limiting the write to one block. There was
usually some overlap, and 2 dirty block needed to be written.  But I'm
still suspicious that there is a mismatch somewhere when trying to
lock multiple blocks.

Here's the debugging patch I constructed which actually allowed the
build to complete without error.  This was applied (for testing
purposes) on top of the previous debugging patch.  As noted earlier,
it never went lower than a 4096 byte window for write_bytes.

--- file.c.file-write-patch-v1	2011-02-28 13:06:37.000000000 -0600
+++ file.c	2011-02-28 19:23:21.000000000 -0600
@@ -908,6 +908,7 @@
 	ssize_t err =3D 0;
 	size_t count;
 	size_t ocount;
+	size_t write_bytes =3D 0;
 	int ret =3D 0;
 	int nrptrs;
 	unsigned long first_index;
@@ -963,7 +964,7 @@
 	BTRFS_I(inode)->sequence++;

 	if (unlikely(file->f_flags & O_DIRECT)) {
-		trace_printk("TPK: transferred to unlikely(file->f_flags \& O_DIRECT=
)\n");
+		trace_printk("TPK: transferred to unlikely(file->f_flags & O_DIRECT)=
\n");
 		num_written =3D generic_file_direct_write(iocb, iov, &nr_segs,
 							pos, ppos, count,
 							ocount);
@@ -995,7 +996,7 @@
 		 */
 		buffered =3D 1;
 		pos +=3D num_written;
-		trace_printk("TPK: end unlikely(file->f_flags \& O_DIRECT) with
num_written=3D%i\n",
+		trace_printk("TPK: end unlikely(file->f_flags & O_DIRECT) with
num_written=3D%i\n",
 			     num_written);
 	}

@@ -1019,9 +1020,20 @@
 		trace_printk("TPK: start section while (iov_iter_count() > 0)\n");

 		size_t offset =3D pos & (PAGE_CACHE_SIZE - 1);
-		size_t write_bytes =3D min(iov_iter_count(&i),
-					 nrptrs * (size_t)PAGE_CACHE_SIZE -
-					 offset);
+		write_bytes =3D min(iov_iter_count(&i),
+				  nrptrs * (size_t)PAGE_CACHE_SIZE -
+				  offset);
+		if (unlikely(copied_problem)) {
+			if (copied_loop_count > 128) {
+				write_bytes =3D min(write_bytes, 4096);
+			} else if (copied_loop_count > 160) {
+				write_bytes =3D min(write_bytes, 2048);
+			} else if (copied_loop_count > 192) {
+				write_bytes =3D min(write_bytes, 1024);
+			} else if (copied_loop_count > 224) {
+				write_bytes =3D min(write_bytes, 512);
+			}
+		}
 		size_t num_pages =3D (write_bytes + offset +
 				    PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-02-28 20:20           ` Mitch Harder
  2011-03-01  5:09             ` Mitch Harder
@ 2011-03-01 10:14             ` Zhong, Xin
  2011-03-01 11:56               ` Zhong, Xin
  2011-03-01 14:51               ` Mitch Harder
  2011-03-01 21:56             ` Piotr Szymaniak
  2 siblings, 2 replies; 40+ messages in thread
From: Zhong, Xin @ 2011-03-01 10:14 UTC (permalink / raw)
  To: Mitch Harder, Maria Wikström
  Cc: Josef Bacik, Johannes Hirte, Chris Mason,
	linux-btrfs@vger.kernel.org

Is your system running out of memory or is there any other thread like =
flush-btrfs competing for the same page?

I can only see one process in your ftrace log. You may need to trace al=
l btrfs.ko function calls instead of a single process. Thanks!

-----Original Message-----
=46rom: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vge=
r.kernel.org] On Behalf Of Mitch Harder
Sent: Tuesday, March 01, 2011 4:20 AM
To: Maria Wikstr=F6m
Cc: Josef Bacik; Johannes Hirte; Chris Mason; Zhong, Xin; linux-btrfs@v=
ger.kernel.org
Subject: Re: [PATCH] btrfs file write debugging patch

2011/2/28 Mitch Harder <mitch.harder@sabayonlinux.org>:
> 2011/2/28 Maria Wikstr=F6m <maria@ponstudios.se>:
>> m=E5n 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
>>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
>>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
>>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -05=
00:
>>> > > > Some clarification on my previous message...
>>> > > >
>>> > > > After looking at my ftrace log more closely, I can see where =
Btrfs is
>>> > > > trying to release the allocated pages. =A0However, the calcul=
ation for
>>> > > > the number of dirty_pages is equal to 1 when "copied =3D=3D 0=
".
>>> > > >
>>> > > > So I'm seeing at least two problems:
>>> > > > (1) =A0It keeps looping when "copied =3D=3D 0".
>>> > > > (2) =A0One dirty page is not being released on every loop eve=
n though
>>> > > > "copied =3D=3D 0" (at least this problem keeps it from being =
an infinite
>>> > > > loop by eventually exhausting reserveable space on the disk).
>>> > >
>>> > > Hi everyone,
>>> > >
>>> > > There are actually tow bugs here. =A0First the one that Mitch h=
it, and a
>>> > > second one that still results in bad file_write results with my
>>> > > debugging hunks (the first two hunks below) in place.
>>> > >
>>> > > My patch fixes Mitch's bug by checking for copied =3D=3D 0 afte=
r
>>> > > btrfs_copy_from_user and going the correct delalloc accounting.=
 =A0This
>>> > > one looks solved, but you'll notice the patch is bigger.
>>> > >
>>> > > First, I add some random failures to btrfs_copy_from_user() by =
failing
>>> > > everyone once and a while. =A0This was much more reliable than =
trying to
>>> > > use memory pressure than making copy_from_user fail.
>>> > >
>>> > > If copy_from_user fails and we partially update a page, we end =
up with a
>>> > > page that may go away due to memory pressure. =A0But, btrfs_fil=
e_write
>>> > > assumes that only the first and last page may have good data th=
at needs
>>> > > to be read off the disk.
>>> > >
>>> > > This patch ditches that code and puts it into prepare_pages ins=
tead.
>>> > > But I'm still having some errors during long stress.sh runs. =A0=
Ideas are
>>> > > more than welcome, hopefully some other timezones will kick in =
ideas
>>> > > while I sleep.
>>> >
>>> > At least it doesn't fix the emerge-problem for me. The behavior i=
s now the same
>>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcry=
pt' with no
>>> > further interaction to get the emerge-process hang with a svn-pro=
cess
>>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c b=
ut the
>>> > spawned svn-process stays and it needs a reboot to get rid of it.
>>>
>>> Can you cat /proc/$pid/wchan a few times so we can get an idea of w=
here it's
>>> looping? =A0Thanks,
>>>
>>> Josef
>>
>> It behaves the same way here with btrfs-unstable.
>> The output of "cat /proc/$pid/wchan" is 0.
>>
>> // Maria
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>>
>>
>>
>>
>
> I've applied the patch at the head of this thread (with the jiffies
> debugging commented out) and I'm attaching a ftrace using the
> function_graph tracer when I'm stuck in the loop. =A0I've just snippe=
d
> out a couple of the loops (the full trace file is quite large, and
> mostly repititious).
>
> I'm going to try to modify file.c with some trace_printk debugging to
> show the values of several of the relevant variables at various
> stages.
>
> I'm going to try to exit the loop after 256 tries with an EFAULT so I
> can stop the tracing at that point and capture a trace of the entry
> into the problem (the ftrace ring buffer fills up too fast for me to
> capture the entry point).
>

As promised, I'm put together a modified file.c with many trace_printk
debugging statements to augment the ftrace.

The trace is ~128K compressed (about 31,600 lines or 2.6MB
uncompressed), so I'm putting it up on my local server instead of
attaching.  Let me know if it would be more appropriate to send to the
list as an attachment.

http://dontpanic.dyndns.org/ftrace-btrfs-file-write-debug-v2.gz

I preface all my trace_printk comments with "TPK:" to make skipping
through the trace easier.

The trace contains the trace of about 3 or 4 successful passes through
the btrfs_file_aio_write() function to show what a successful trace
looks like.

The pass through the btrfs_file_aio_write() that breaks begins on line =
1088.

I let it loop through the while (iov_iter_count(&i) > 0) {} loop for
256 times when copied=3D=3D0 (otherwise it would loop infinitely).  The=
n
exit out and stop the trace.

=46or reference, here's a diff file for the debugging statements I've
added to file.c:

Let me know if you would like me to re-run this trial with different
debugging lines.

 fs/btrfs/file.c /usr/src/linux/fs/btrfs/file.c
--- fs/btrfs/file.c	2011-02-28 10:13:45.000000000 -0600
+++ /usr/src/linux/fs/btrfs/file.c	2011-02-28 13:07:11.000000000 -0600
@@ -53,12 +53,14 @@
 	int offset =3D pos & (PAGE_CACHE_SIZE - 1);
 	int total_copied =3D 0;

+	/***************************
 	if ((jiffies % 10) =3D=3D 0)
 		return 0;

 	if ((jiffies % 25) =3D=3D 0) {
 		write_bytes /=3D 2;
 	}
+	**************************/

 	while (write_bytes > 0) {
 		size_t count =3D min_t(size_t,
@@ -82,10 +84,13 @@

 		/* Return to btrfs_file_aio_write to fault page */
 		if (unlikely(copied =3D=3D 0)) {
+			trace_printk("TPK: unlikely copied =3D=3D 0 in btrfs_copy_from_user
(total_copied=3D%i)\n",
+				     total_copied);
 			break;
 		}

 		if (unlikely(copied < PAGE_CACHE_SIZE - offset)) {
+			trace_printk("TPK: unlikely copied < PAGE_CACHE_SIZE - offset in
btrfs_copy_from_user\n");
 			offset +=3D copied;
 		} else {
 			pg++;
@@ -910,8 +915,13 @@
 	int will_write;
 	int buffered =3D 0;
 	int copied =3D 0;
+	int copied_problem =3D 0;
+	int copied_loop_count =3D 0;
+	int stop_ftrace =3D 0;
 	int dirty_pages =3D 0;

+	trace_printk("TPK: Entering btrfs_file_aio_write()\n");
+
 	will_write =3D ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
 		      (file->f_flags & O_DIRECT));

@@ -953,6 +963,7 @@
 	BTRFS_I(inode)->sequence++;

 	if (unlikely(file->f_flags & O_DIRECT)) {
+		trace_printk("TPK: transferred to unlikely(file->f_flags \& O_DIRECT=
)\n");
 		num_written =3D generic_file_direct_write(iocb, iov, &nr_segs,
 							pos, ppos, count,
 							ocount);
@@ -984,6 +995,8 @@
 		 */
 		buffered =3D 1;
 		pos +=3D num_written;
+		trace_printk("TPK: end unlikely(file->f_flags \& O_DIRECT) with
num_written=3D%i\n",
+			     num_written);
 	}

 	iov_iter_init(&i, iov, nr_segs, count, num_written);
@@ -1003,6 +1016,8 @@
 	last_index =3D (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;

 	while (iov_iter_count(&i) > 0) {
+		trace_printk("TPK: start section while (iov_iter_count() > 0)\n");
+
 		size_t offset =3D pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes =3D min(iov_iter_count(&i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
@@ -1010,6 +1025,9 @@
 		size_t num_pages =3D (write_bytes + offset +
 				    PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

+		trace_printk("TPK: iov_iter_count()=3D%i || nr_segs=3D%lu || nrptrs=3D=
%i
|| offset=3D%lu || write_bytes=3D%lu || num_pages=3D%lu\n",
+						iov_iter_count(&i), nr_segs, nrptrs, offset, write_bytes, num_pa=
ges);
+
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);

@@ -1022,6 +1040,19 @@
 			goto out;
 		}

+		if (unlikely(copied_problem)) {
+			copied_loop_count++;
+			trace_printk("TPK: copied problem(%i)\n",
+				     copied_loop_count);
+			/* Give up if we've already tried 256 times */
+			if (copied_loop_count > 256) {
+				ret =3D -EFAULT;
+				stop_ftrace =3D 1;
+				trace_printk("TPK: copied loop count exceeded, returning EFAULT...=
=2E\n");
+				goto out;
+			}
+		}
+
 		ret =3D btrfs_delalloc_reserve_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
 		if (ret)
@@ -1045,6 +1076,10 @@
 				       PAGE_CACHE_SIZE - 1) >>
 				       PAGE_CACHE_SHIFT;

+		if (copied =3D=3D 0) {
+			copied_problem =3D 1;
+		}
+
 		if (num_pages > dirty_pages) {
 			if (copied > 0)
 				atomic_inc(
@@ -1080,11 +1115,19 @@
 		num_written +=3D copied;

 		cond_resched();
+		trace_printk("TPK: end section while (iov_iter_count() > 0)\n");
+		trace_printk(" copied=3D%i || dirty_pages=3D%i || num_written=3D%i |=
|
ending iov_iter_count=3D%i\n",
+						copied, dirty_pages, num_written, iov_iter_count(&i) );
 	}
 out:
+	trace_printk("TPK: Reached: out:\n");
+
 	mutex_unlock(&inode->i_mutex);
-	if (ret)
+	if (ret) {
 		err =3D ret;
+		trace_printk("TPK: ret,err had value %i when out: was reached
(num_written: %i)\n",
+			     ret, num_written);
+	}

 	kfree(pages);
 	*ppos =3D pos;
@@ -1140,6 +1183,19 @@
 	}
 done:
 	current->backing_dev_info =3D NULL;
+	if (ret) {
+		trace_printk("TPK: btrfs_file_aio_write exiting with non-zero ret
value (%i)\n", ret);
+		trace_printk("TPK: Returning num_written (%i) ? num_written (%i) :
err (%i) =3D %i\n",
+				num_written, num_written, err, num_written ? num_written : err);
+	} else {
+		trace_printk("TPK: btrfs_file_aio_write exiting normally with (%i)",
+			     num_written ? num_written : err);
+	}
+=09
+	if (unlikely(stop_ftrace)) {
+		tracing_off();
+	}
+
 	return num_written ? num_written : err;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH] btrfs file write debugging patch
  2011-03-01 10:14             ` Zhong, Xin
@ 2011-03-01 11:56               ` Zhong, Xin
  2011-03-01 14:54                 ` Mitch Harder
  2011-03-01 14:51               ` Mitch Harder
  1 sibling, 1 reply; 40+ messages in thread
From: Zhong, Xin @ 2011-03-01 11:56 UTC (permalink / raw)
  To: Zhong, Xin, Mitch Harder, Maria Wikström
  Cc: Josef Bacik, Johannes Hirte, Chris Mason,
	linux-btrfs@vger.kernel.org

Hi Mitch,

I suspect there's a lock contention between flush-btrfs (lock_dellalloc=
_pages) and btrfs_file_aio_write. However I can not recreate it locally=
=2E Could you please try below patch? Thanks!

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 65338a1..b9d0929 1=
00644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1007,17 +1007,16 @@ static ssize_t btrfs_file_aio_write(struct kioc=
b *iocb,
 			goto out;
 		}
=20
-		ret =3D btrfs_delalloc_reserve_space(inode,
-					num_pages << PAGE_CACHE_SHIFT);
-		if (ret)
-			goto out;
-
 		ret =3D prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
 				    write_bytes);
-		if (ret) {
-			btrfs_delalloc_release_space(inode,
+		if (ret)
+			goto out;
+	=09
+		ret =3D btrfs_delalloc_reserve_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
+		if (ret) {
+			btrfs_drop_pages(pages, num_pages);
 			goto out;
 		}


-----Original Message-----
=46rom: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vge=
r.kernel.org] On Behalf Of Zhong, Xin
Sent: Tuesday, March 01, 2011 6:15 PM
To: Mitch Harder; Maria Wikstr=F6m
Cc: Josef Bacik; Johannes Hirte; Chris Mason; linux-btrfs@vger.kernel.o=
rg
Subject: RE: [PATCH] btrfs file write debugging patch

Is your system running out of memory or is there any other thread like =
flush-btrfs competing for the same page?

I can only see one process in your ftrace log. You may need to trace al=
l btrfs.ko function calls instead of a single process. Thanks!

-----Original Message-----
=46rom: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vge=
r.kernel.org] On Behalf Of Mitch Harder
Sent: Tuesday, March 01, 2011 4:20 AM
To: Maria Wikstr=F6m
Cc: Josef Bacik; Johannes Hirte; Chris Mason; Zhong, Xin; linux-btrfs@v=
ger.kernel.org
Subject: Re: [PATCH] btrfs file write debugging patch

2011/2/28 Mitch Harder <mitch.harder@sabayonlinux.org>:
> 2011/2/28 Maria Wikstr=F6m <maria@ponstudios.se>:
>> m=E5n 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik:
>>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote:
>>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote:
>>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -05=
00:
>>> > > > Some clarification on my previous message...
>>> > > >
>>> > > > After looking at my ftrace log more closely, I can see where =
Btrfs is
>>> > > > trying to release the allocated pages. =A0However, the calcul=
ation for
>>> > > > the number of dirty_pages is equal to 1 when "copied =3D=3D 0=
".
>>> > > >
>>> > > > So I'm seeing at least two problems:
>>> > > > (1) =A0It keeps looping when "copied =3D=3D 0".
>>> > > > (2) =A0One dirty page is not being released on every loop eve=
n though
>>> > > > "copied =3D=3D 0" (at least this problem keeps it from being =
an infinite
>>> > > > loop by eventually exhausting reserveable space on the disk).
>>> > >
>>> > > Hi everyone,
>>> > >
>>> > > There are actually tow bugs here. =A0First the one that Mitch h=
it, and a
>>> > > second one that still results in bad file_write results with my
>>> > > debugging hunks (the first two hunks below) in place.
>>> > >
>>> > > My patch fixes Mitch's bug by checking for copied =3D=3D 0 afte=
r
>>> > > btrfs_copy_from_user and going the correct delalloc accounting.=
 =A0This
>>> > > one looks solved, but you'll notice the patch is bigger.
>>> > >
>>> > > First, I add some random failures to btrfs_copy_from_user() by =
failing
>>> > > everyone once and a while. =A0This was much more reliable than =
trying to
>>> > > use memory pressure than making copy_from_user fail.
>>> > >
>>> > > If copy_from_user fails and we partially update a page, we end =
up with a
>>> > > page that may go away due to memory pressure. =A0But, btrfs_fil=
e_write
>>> > > assumes that only the first and last page may have good data th=
at needs
>>> > > to be read off the disk.
>>> > >
>>> > > This patch ditches that code and puts it into prepare_pages ins=
tead.
>>> > > But I'm still having some errors during long stress.sh runs. =A0=
Ideas are
>>> > > more than welcome, hopefully some other timezones will kick in =
ideas
>>> > > while I sleep.
>>> >
>>> > At least it doesn't fix the emerge-problem for me. The behavior i=
s now the same
>>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcry=
pt' with no
>>> > further interaction to get the emerge-process hang with a svn-pro=
cess
>>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c b=
ut the
>>> > spawned svn-process stays and it needs a reboot to get rid of it.
>>>
>>> Can you cat /proc/$pid/wchan a few times so we can get an idea of w=
here it's
>>> looping? =A0Thanks,
>>>
>>> Josef
>>
>> It behaves the same way here with btrfs-unstable.
>> The output of "cat /proc/$pid/wchan" is 0.
>>
>> // Maria
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>>
>>
>>
>>
>
> I've applied the patch at the head of this thread (with the jiffies
> debugging commented out) and I'm attaching a ftrace using the
> function_graph tracer when I'm stuck in the loop. =A0I've just snippe=
d
> out a couple of the loops (the full trace file is quite large, and
> mostly repititious).
>
> I'm going to try to modify file.c with some trace_printk debugging to
> show the values of several of the relevant variables at various
> stages.
>
> I'm going to try to exit the loop after 256 tries with an EFAULT so I
> can stop the tracing at that point and capture a trace of the entry
> into the problem (the ftrace ring buffer fills up too fast for me to
> capture the entry point).
>

As promised, I'm put together a modified file.c with many trace_printk
debugging statements to augment the ftrace.

The trace is ~128K compressed (about 31,600 lines or 2.6MB
uncompressed), so I'm putting it up on my local server instead of
attaching.  Let me know if it would be more appropriate to send to the
list as an attachment.

http://dontpanic.dyndns.org/ftrace-btrfs-file-write-debug-v2.gz

I preface all my trace_printk comments with "TPK:" to make skipping
through the trace easier.

The trace contains the trace of about 3 or 4 successful passes through
the btrfs_file_aio_write() function to show what a successful trace
looks like.

The pass through the btrfs_file_aio_write() that breaks begins on line =
1088.

I let it loop through the while (iov_iter_count(&i) > 0) {} loop for
256 times when copied=3D=3D0 (otherwise it would loop infinitely).  The=
n
exit out and stop the trace.

=46or reference, here's a diff file for the debugging statements I've
added to file.c:

Let me know if you would like me to re-run this trial with different
debugging lines.

 fs/btrfs/file.c /usr/src/linux/fs/btrfs/file.c
--- fs/btrfs/file.c	2011-02-28 10:13:45.000000000 -0600
+++ /usr/src/linux/fs/btrfs/file.c	2011-02-28 13:07:11.000000000 -0600
@@ -53,12 +53,14 @@
 	int offset =3D pos & (PAGE_CACHE_SIZE - 1);
 	int total_copied =3D 0;

+	/***************************
 	if ((jiffies % 10) =3D=3D 0)
 		return 0;

 	if ((jiffies % 25) =3D=3D 0) {
 		write_bytes /=3D 2;
 	}
+	**************************/

 	while (write_bytes > 0) {
 		size_t count =3D min_t(size_t,
@@ -82,10 +84,13 @@

 		/* Return to btrfs_file_aio_write to fault page */
 		if (unlikely(copied =3D=3D 0)) {
+			trace_printk("TPK: unlikely copied =3D=3D 0 in btrfs_copy_from_user
(total_copied=3D%i)\n",
+				     total_copied);
 			break;
 		}

 		if (unlikely(copied < PAGE_CACHE_SIZE - offset)) {
+			trace_printk("TPK: unlikely copied < PAGE_CACHE_SIZE - offset in
btrfs_copy_from_user\n");
 			offset +=3D copied;
 		} else {
 			pg++;
@@ -910,8 +915,13 @@
 	int will_write;
 	int buffered =3D 0;
 	int copied =3D 0;
+	int copied_problem =3D 0;
+	int copied_loop_count =3D 0;
+	int stop_ftrace =3D 0;
 	int dirty_pages =3D 0;

+	trace_printk("TPK: Entering btrfs_file_aio_write()\n");
+
 	will_write =3D ((file->f_flags & O_DSYNC) || IS_SYNC(inode) ||
 		      (file->f_flags & O_DIRECT));

@@ -953,6 +963,7 @@
 	BTRFS_I(inode)->sequence++;

 	if (unlikely(file->f_flags & O_DIRECT)) {
+		trace_printk("TPK: transferred to unlikely(file->f_flags \& O_DIRECT=
)\n");
 		num_written =3D generic_file_direct_write(iocb, iov, &nr_segs,
 							pos, ppos, count,
 							ocount);
@@ -984,6 +995,8 @@
 		 */
 		buffered =3D 1;
 		pos +=3D num_written;
+		trace_printk("TPK: end unlikely(file->f_flags \& O_DIRECT) with
num_written=3D%i\n",
+			     num_written);
 	}

 	iov_iter_init(&i, iov, nr_segs, count, num_written);
@@ -1003,6 +1016,8 @@
 	last_index =3D (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT;

 	while (iov_iter_count(&i) > 0) {
+		trace_printk("TPK: start section while (iov_iter_count() > 0)\n");
+
 		size_t offset =3D pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes =3D min(iov_iter_count(&i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
@@ -1010,6 +1025,9 @@
 		size_t num_pages =3D (write_bytes + offset +
 				    PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

+		trace_printk("TPK: iov_iter_count()=3D%i || nr_segs=3D%lu || nrptrs=3D=
%i
|| offset=3D%lu || write_bytes=3D%lu || num_pages=3D%lu\n",
+						iov_iter_count(&i), nr_segs, nrptrs, offset, write_bytes, num_pa=
ges);
+
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);

@@ -1022,6 +1040,19 @@
 			goto out;
 		}

+		if (unlikely(copied_problem)) {
+			copied_loop_count++;
+			trace_printk("TPK: copied problem(%i)\n",
+				     copied_loop_count);
+			/* Give up if we've already tried 256 times */
+			if (copied_loop_count > 256) {
+				ret =3D -EFAULT;
+				stop_ftrace =3D 1;
+				trace_printk("TPK: copied loop count exceeded, returning EFAULT...=
=2E\n");
+				goto out;
+			}
+		}
+
 		ret =3D btrfs_delalloc_reserve_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
 		if (ret)
@@ -1045,6 +1076,10 @@
 				       PAGE_CACHE_SIZE - 1) >>
 				       PAGE_CACHE_SHIFT;

+		if (copied =3D=3D 0) {
+			copied_problem =3D 1;
+		}
+
 		if (num_pages > dirty_pages) {
 			if (copied > 0)
 				atomic_inc(
@@ -1080,11 +1115,19 @@
 		num_written +=3D copied;

 		cond_resched();
+		trace_printk("TPK: end section while (iov_iter_count() > 0)\n");
+		trace_printk(" copied=3D%i || dirty_pages=3D%i || num_written=3D%i |=
|
ending iov_iter_count=3D%i\n",
+						copied, dirty_pages, num_written, iov_iter_count(&i) );
 	}
 out:
+	trace_printk("TPK: Reached: out:\n");
+
 	mutex_unlock(&inode->i_mutex);
-	if (ret)
+	if (ret) {
 		err =3D ret;
+		trace_printk("TPK: ret,err had value %i when out: was reached
(num_written: %i)\n",
+			     ret, num_written);
+	}

 	kfree(pages);
 	*ppos =3D pos;
@@ -1140,6 +1183,19 @@
 	}
 done:
 	current->backing_dev_info =3D NULL;
+	if (ret) {
+		trace_printk("TPK: btrfs_file_aio_write exiting with non-zero ret
value (%i)\n", ret);
+		trace_printk("TPK: Returning num_written (%i) ? num_written (%i) :
err (%i) =3D %i\n",
+				num_written, num_written, err, num_written ? num_written : err);
+	} else {
+		trace_printk("TPK: btrfs_file_aio_write exiting normally with (%i)",
+			     num_written ? num_written : err);
+	}
+=09
+	if (unlikely(stop_ftrace)) {
+		tracing_off();
+	}
+
 	return num_written ? num_written : err;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-01 11:56               ` Zhong, Xin
@ 2011-03-01 14:54                 ` Mitch Harder
  0 siblings, 0 replies; 40+ messages in thread
From: Mitch Harder @ 2011-03-01 14:54 UTC (permalink / raw)
  To: Zhong, Xin
  Cc: Maria Wikström, Josef Bacik, Johannes Hirte, Chris Mason,
	linux-btrfs@vger.kernel.org

On Tue, Mar 1, 2011 at 5:56 AM, Zhong, Xin <xin.zhong@intel.com> wrote:
> Hi Mitch,
>
> I suspect there's a lock contention between flush-btrfs (lock_dellall=
oc_pages) and btrfs_file_aio_write. However I can not recreate it local=
ly. Could you please try below patch? Thanks!
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 65338a1..b9d0929=
 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1007,17 +1007,16 @@ static ssize_t btrfs_file_aio_write(struct ki=
ocb *iocb,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D btrfs_delalloc_reserve_space(in=
ode,
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 num_pages << PAGE_CACHE_SHIFT);
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ret)
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> -
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D prepare_pages(root, file, page=
s, num_pages,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
pos, first_index, last_index,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
write_bytes);
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ret) {
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 btrfs_delalloc_release_=
space(inode,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ret)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D btrfs_delalloc_reserve_space(in=
ode,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0num_pages << PAGE_CACHE_SHIFT);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ret) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 btrfs_drop_pages(pages,=
 num_pages);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>
>

Thanks.

I've tested this patch, but the build is still failing at the same
point as before.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-03-01 10:14             ` Zhong, Xin
  2011-03-01 11:56               ` Zhong, Xin
@ 2011-03-01 14:51               ` Mitch Harder
  1 sibling, 0 replies; 40+ messages in thread
From: Mitch Harder @ 2011-03-01 14:51 UTC (permalink / raw)
  To: Zhong, Xin
  Cc: Maria Wikström, Josef Bacik, Johannes Hirte, Chris Mason,
	linux-btrfs@vger.kernel.org

On Tue, Mar 1, 2011 at 4:14 AM, Zhong, Xin <xin.zhong@intel.com> wrote:
> Is your system running out of memory or is there any other thread like flush-btrfs competing for the same page?
>

There's no sign of memory pressure.  Although I only have 1 GB in this
box, I'm still show ~1/2 GB RAM free during this build.  There's no
swap space allocated, and nothing in dmesg that indicates there's a
transient spike of RAM pressure.

> I can only see one process in your ftrace log. You may need to trace all btrfs.ko function calls instead of a single process. Thanks!
>

That ftrace.log was run with ftrace defaults for a function trace.  It
should collect calls from the whole system.

For the sake of consistency, I am intentionally trying to insure that
very few other things are going on at the same time as this build.
And I'm building with "-j1" so things will happen the same way each
time.

Also, I supplied just the tail end of the trace log.  The full log
shows a few of the other build processes leading up to the problem,
but the ftrace ring buffer fills up surprisingly fast.  Even with a
50MB ring buffer for ftrace, I usually collect less than 1 second of
information when something busy like a build is going on.

Let me know if you'd like to see the full log.  It's bigger, but I can
find someplace to put it.

But I'm pretty sure that wmldbcreate is the only thing that is going
on when the breakage occurs.  Otherwise I wouldn't get such consistent
breakage in the same spot every time.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] btrfs file write debugging patch
  2011-02-28 20:20           ` Mitch Harder
  2011-03-01  5:09             ` Mitch Harder
  2011-03-01 10:14             ` Zhong, Xin
@ 2011-03-01 21:56             ` Piotr Szymaniak
  2 siblings, 0 replies; 40+ messages in thread
From: Piotr Szymaniak @ 2011-03-01 21:56 UTC (permalink / raw)
  To: Mitch Harder
  Cc: Josef Bacik, Johannes Hirte, Chris Mason, Zhong, Xin, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Mon, Feb 28, 2011 at 02:20:22PM -0600, Mitch Harder wrote:
> As promised, I'm put together a modified file.c with many trace_printk
> debugging statements to augment the ftrace.
> *snip*

Just my few cents. I've applied the patch from Chris Mason (Sun, 27 Feb
2011 20:46:05 -0500) and this one from Mitch (Mon, 28 Feb 2011 14:20:22
-0600) on top of vanilla 2.6.38-rc6 and it seems that it resolves my
issues with hanging `svn info' during libgcrypt emerge.

Piotr Szymaniak.
-- 
 - (...) Nie wyobrazam sobie, co ta gora miesa moglaby ci dac, czego ja
nie   moglbym   ofiarowac.  Oczywiscie  poza  piecdziesiecioma  funtami
rozrosnietych miesni.
 - Moze mnie wlasnie pociagaja rozrosniete miesnie. (...) W koncu wielu
mezczyzn pociaga rozrosnieta tkanka tluszczowa piersi.
  -- Graham Masterton, "The Wells of Hell"

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-03-08  2:51 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-01 16:36 [PATCH] btrfs file write debugging patch Xin Zhong
2011-03-01 21:09 ` Mitch Harder
2011-03-02 10:58   ` Zhong, Xin
2011-03-02 14:00     ` Xin Zhong
2011-03-04  1:51     ` Chris Mason
2011-03-04  2:32       ` Josef Bacik
2011-03-04  2:42         ` Zhong, Xin
2011-03-04  2:41           ` Josef Bacik
2011-03-04  8:41             ` Zhong, Xin
2011-03-05 16:56             ` Mitch Harder
2011-03-05 17:28               ` Xin Zhong
2011-03-04 12:19       ` Chris Mason
2011-03-04 14:25         ` Xin Zhong
2011-03-04 15:33           ` Mitch Harder
2011-03-04 17:21             ` Mitch Harder
2011-03-05  1:00               ` Xin Zhong
2011-03-05 13:14                 ` Mitch Harder
2011-03-05 16:50                   ` Mitch Harder
2011-03-06 18:00                     ` Chris Mason
2011-03-07  0:58                       ` Chris Mason
2011-03-07  6:07                         ` Mitch Harder
2011-03-07  6:37                           ` Zhong, Xin
2011-03-07 19:56                           ` Maria Wikström
2011-03-07 22:12                             ` Johannes Hirte
2011-03-08  2:51                               ` Zhong, Xin
  -- strict thread matches above, loose matches on Subject: below --
2011-02-25 18:43 [PATCH v2]Btrfs: pwrite blocked when writing from the mmaped buffer of the same page Mitch Harder
2011-02-28  1:46 ` [PATCH] btrfs file write debugging patch Chris Mason
2011-02-28  8:56   ` Zhong, Xin
2011-02-28 14:02     ` Chris Mason
2011-02-28 10:13   ` Johannes Hirte
2011-02-28 14:00     ` Chris Mason
2011-02-28 16:10     ` Josef Bacik
2011-02-28 16:45       ` Maria Wikström
2011-02-28 17:47         ` Mitch Harder
2011-02-28 20:20           ` Mitch Harder
2011-03-01  5:09             ` Mitch Harder
2011-03-01 10:14             ` Zhong, Xin
2011-03-01 11:56               ` Zhong, Xin
2011-03-01 14:54                 ` Mitch Harder
2011-03-01 14:51               ` Mitch Harder
2011-03-01 21:56             ` Piotr Szymaniak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).