Btrfs slowdown with ceph (how to reproduce)

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Btrfs slowdown with ceph (how to reproduce)
@ 2012-01-20 12:13 Christian Brunner
  2012-01-23 18:19 ` Josef Bacik
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Brunner @ 2012-01-20 12:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

As you might know, I have been seeing btrfs slowdowns in our ceph
cluster for quite some time. Even with the latest btrfs code for 3.3
I'm still seeing these problems. To make things reproducible, I've now
written a small test, that imitates ceph's behavior:

On a freshly created btrfs filesystem (2 TB size, mounted with
"noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
100 files. After that I'm doing random writes on these files with a
sync_file_range after each write (each write has a size of 100 bytes)
and ioctl(BTRFS_IOC_SYNC) after every 100 writes.

After approximately 20 minutes, write activity suddenly increases
fourfold and the average request size decreases (see chart in the
attachment).

You can find IOstat output here: http://pastebin.com/Smbfg1aG

I hope that you are able to trace down the problem with the test
program in the attachment.

Thanks,
Christian

[-- Attachment #2: btrfstest.c --]
[-- Type: text/x-csrc, Size: 1230 bytes --]

#define _GNU_SOURCE

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <unistd.h>
#include <attr/xattr.h>

#define FILE_COUNT 100
#define FILE_SIZE 4194304

#define STRING "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"

#define BTRFS_IOCTL_MAGIC 0x94
#define BTRFS_IOC_SYNC         _IO(BTRFS_IOCTL_MAGIC, 8)

int main(int argc, char *argv[]) {
    char *imgname = argv[1]; 
    char *tempname;
    int fd[FILE_COUNT]; 
    int ilen, i;

    ilen = strlen(imgname);
    tempname = malloc(ilen + 8);

    for(i=0; i < FILE_COUNT; i++) {
	snprintf(tempname, ilen + 8, "%s.%i", imgname, i);
    	fd[i] = open(tempname, O_CREAT|O_RDWR);
    }

    i=0;
    while(1) {
        int start = rand() % FILE_SIZE;
        int file = rand() % FILE_COUNT;

        putc('.', stderr);

        lseek(fd[file], start, SEEK_SET);
        write(fd[file], STRING, 100);
        sync_file_range(fd[file], start, 100, 0x2);

        usleep(25000);

        i++;
        if (i == 100) {
            i=0;
            ioctl(fd[file], BTRFS_IOC_SYNC);
        }
    }
}

[-- Attachment #3: btrfstest.png --]
[-- Type: image/png, Size: 6698 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-20 12:13 Btrfs slowdown with ceph (how to reproduce) Christian Brunner
@ 2012-01-23 18:19 ` Josef Bacik
  2012-01-23 18:50   ` Chris Mason
  0 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2012-01-23 18:19 UTC (permalink / raw)
  To: Christian Brunner; +Cc: linux-btrfs, ceph-devel

On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
> As you might know, I have been seeing btrfs slowdowns in our ceph
> cluster for quite some time. Even with the latest btrfs code for 3.3
> I'm still seeing these problems. To make things reproducible, I've now
> written a small test, that imitates ceph's behavior:
> 
> On a freshly created btrfs filesystem (2 TB size, mounted with
> "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
> 100 files. After that I'm doing random writes on these files with a
> sync_file_range after each write (each write has a size of 100 bytes)
> and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
> 
> After approximately 20 minutes, write activity suddenly increases
> fourfold and the average request size decreases (see chart in the
> attachment).
> 
> You can find IOstat output here: http://pastebin.com/Smbfg1aG
> 
> I hope that you are able to trace down the problem with the test
> program in the attachment.
 
Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and
formatted the fs with 64k node and leaf sizes and the problem appeared to go
away.  So surprise surprise fragmentation is biting us in the ass.  If you can
try running that branch with 64k node and leaf sizes with your ceph cluster and
see how that works out.  Course you should only do that if you dont mind if you
lose everything :).  Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-23 18:19 ` Josef Bacik
@ 2012-01-23 18:50   ` Chris Mason
  2012-01-23 20:53     ` Christian Brunner
  2012-01-24 19:15     ` Martin Mailand
  0 siblings, 2 replies; 7+ messages in thread
From: Chris Mason @ 2012-01-23 18:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, linux-btrfs, ceph-devel

On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
> > As you might know, I have been seeing btrfs slowdowns in our ceph
> > cluster for quite some time. Even with the latest btrfs code for 3.3
> > I'm still seeing these problems. To make things reproducible, I've now
> > written a small test, that imitates ceph's behavior:
> > 
> > On a freshly created btrfs filesystem (2 TB size, mounted with
> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
> > 100 files. After that I'm doing random writes on these files with a
> > sync_file_range after each write (each write has a size of 100 bytes)
> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
> > 
> > After approximately 20 minutes, write activity suddenly increases
> > fourfold and the average request size decreases (see chart in the
> > attachment).
> > 
> > You can find IOstat output here: http://pastebin.com/Smbfg1aG
> > 
> > I hope that you are able to trace down the problem with the test
> > program in the attachment.
>  
> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and
> formatted the fs with 64k node and leaf sizes and the problem appeared to go
> away.  So surprise surprise fragmentation is biting us in the ass.  If you can
> try running that branch with 64k node and leaf sizes with your ceph cluster and
> see how that works out.  Course you should only do that if you dont mind if you
> lose everything :).  Thanks,
> 

Please keep in mind this branch is only out there for development, and
it really might have huge flaws.  scrub doesn't work with it correctly
right now, and the IO error recovery code is probably broken too.

Long term though, I think the bigger block sizes are going to make a
huge difference in these workloads.

If you use the very dangerous code:

mkfs.btrfs -l 64k -n 64k /dev/xxx

(-l is leaf size, -n is node size).

64K is the max right now, 32K may help just as much at a lower CPU cost.

-chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-23 18:50   ` Chris Mason
@ 2012-01-23 20:53     ` Christian Brunner
  2012-01-24 19:15     ` Martin Mailand
  1 sibling, 0 replies; 7+ messages in thread
From: Christian Brunner @ 2012-01-23 20:53 UTC (permalink / raw)
  To: linux-btrfs, ceph-devel

2012/1/23 Chris Mason <chris.mason@oracle.com>:
> On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
>> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
>> > As you might know, I have been seeing btrfs slowdowns in our ceph
>> > cluster for quite some time. Even with the latest btrfs code for 3=
=2E3
>> > I'm still seeing these problems. To make things reproducible, I've=
 now
>> > written a small test, that imitates ceph's behavior:
>> >
>> > On a freshly created btrfs filesystem (2 TB size, mounted with
>> > "noatime,nodiratime,compress=3Dlzo,space_cache,inode_cache") I'm o=
pening
>> > 100 files. After that I'm doing random writes on these files with =
a
>> > sync_file_range after each write (each write has a size of 100 byt=
es)
>> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
>> >
>> > After approximately 20 minutes, write activity suddenly increases
>> > fourfold and the average request size decreases (see chart in the
>> > attachment).
>> >
>> > You can find IOstat output here: http://pastebin.com/Smbfg1aG
>> >
>> > I hope that you are able to trace down the problem with the test
>> > program in the attachment.
>>
>> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris=
's tree and
>> formatted the fs with 64k node and leaf sizes and the problem appear=
ed to go
>> away. =A0So surprise surprise fragmentation is biting us in the ass.=
 =A0If you can
>> try running that branch with 64k node and leaf sizes with your ceph =
cluster and
>> see how that works out. =A0Course you should only do that if you don=
t mind if you
>> lose everything :). =A0Thanks,
>>
>
> Please keep in mind this branch is only out there for development, an=
d
> it really might have huge flaws. =A0scrub doesn't work with it correc=
tly
> right now, and the IO error recovery code is probably broken too.
>
> Long term though, I think the bigger block sizes are going to make a
> huge difference in these workloads.
>
> If you use the very dangerous code:
>
> mkfs.btrfs -l 64k -n 64k /dev/xxx
>
> (-l is leaf size, -n is node size).
>
> 64K is the max right now, 32K may help just as much at a lower CPU co=
st.

Thanks for taking a look. - I'm glad to hear that there is a solution
on the horizon, but I'm not brave enough to try this on our ceph
cluster. I'll try it when the code has stabilized a bit.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-23 18:50   ` Chris Mason
  2012-01-23 20:53     ` Christian Brunner
@ 2012-01-24 19:15     ` Martin Mailand
  2012-01-24 19:40       ` Chris Mason
  1 sibling, 1 reply; 7+ messages in thread
From: Martin Mailand @ 2012-01-24 19:15 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Christian Brunner, linux-btrfs,
	ceph-devel

Hi
I tried the branch on one of my ceph osd, and there is a big difference 
in the performance.
The average request size stayed high, but after around a hour the kernel 
crashed.

IOstat
http://pastebin.com/xjuriJ6J

Kernel trace
http://pastebin.com/SYE95GgH

-martin

Am 23.01.2012 19:50, schrieb Chris Mason:
> On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
>> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
>>> As you might know, I have been seeing btrfs slowdowns in our ceph
>>> cluster for quite some time. Even with the latest btrfs code for 3.3
>>> I'm still seeing these problems. To make things reproducible, I've now
>>> written a small test, that imitates ceph's behavior:
>>>
>>> On a freshly created btrfs filesystem (2 TB size, mounted with
>>> "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
>>> 100 files. After that I'm doing random writes on these files with a
>>> sync_file_range after each write (each write has a size of 100 bytes)
>>> and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
>>>
>>> After approximately 20 minutes, write activity suddenly increases
>>> fourfold and the average request size decreases (see chart in the
>>> attachment).
>>>
>>> You can find IOstat output here: http://pastebin.com/Smbfg1aG
>>>
>>> I hope that you are able to trace down the problem with the test
>>> program in the attachment.
>>
>> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and
>> formatted the fs with 64k node and leaf sizes and the problem appeared to go
>> away.  So surprise surprise fragmentation is biting us in the ass.  If you can
>> try running that branch with 64k node and leaf sizes with your ceph cluster and
>> see how that works out.  Course you should only do that if you dont mind if you
>> lose everything :).  Thanks,
>>
>
> Please keep in mind this branch is only out there for development, and
> it really might have huge flaws.  scrub doesn't work with it correctly
> right now, and the IO error recovery code is probably broken too.
>
> Long term though, I think the bigger block sizes are going to make a
> huge difference in these workloads.
>
> If you use the very dangerous code:
>
> mkfs.btrfs -l 64k -n 64k /dev/xxx
>
> (-l is leaf size, -n is node size).
>
> 64K is the max right now, 32K may help just as much at a lower CPU cost.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-24 19:15     ` Martin Mailand
@ 2012-01-24 19:40       ` Chris Mason
  2012-01-24 20:55         ` Martin Mailand
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Mason @ 2012-01-24 19:40 UTC (permalink / raw)
  To: Martin Mailand; +Cc: Josef Bacik, Christian Brunner, linux-btrfs, ceph-devel

On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote:
> Hi
> I tried the branch on one of my ceph osd, and there is a big
> difference in the performance.
> The average request size stayed high, but after around a hour the
> kernel crashed.
> 
> IOstat
> http://pastebin.com/xjuriJ6J
> 
> Kernel trace
> http://pastebin.com/SYE95GgH

Aha, this I know how to fix.  Thanks for trying it out.

-chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Btrfs slowdown with ceph (how to reproduce)
  2012-01-24 19:40       ` Chris Mason
@ 2012-01-24 20:55         ` Martin Mailand
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Mailand @ 2012-01-24 20:55 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, Christian Brunner, linux-btrfs,
	ceph-devel

Hi Chris,
great to hear that, could you give me a ping if you fixed it, than I can 
retry it?

-martin

Am 24.01.2012 20:40, schrieb Chris Mason:
> On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote:
>> Hi
>> I tried the branch on one of my ceph osd, and there is a big
>> difference in the performance.
>> The average request size stayed high, but after around a hour the
>> kernel crashed.
>>
>> IOstat
>> http://pastebin.com/xjuriJ6J
>>
>> Kernel trace
>> http://pastebin.com/SYE95GgH
>
> Aha, this I know how to fix.  Thanks for trying it out.
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-24 20:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-20 12:13 Btrfs slowdown with ceph (how to reproduce) Christian Brunner
2012-01-23 18:19 ` Josef Bacik
2012-01-23 18:50   ` Chris Mason
2012-01-23 20:53     ` Christian Brunner
2012-01-24 19:15     ` Martin Mailand
2012-01-24 19:40       ` Chris Mason
2012-01-24 20:55         ` Martin Mailand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).