* Ceph performance
@ 2012-10-29 14:01 Roman Alekseev
2012-10-29 18:57 ` Sam Lang
0 siblings, 1 reply; 10+ messages in thread
From: Roman Alekseev @ 2012-10-29 14:01 UTC (permalink / raw)
To: ceph-devel
Hi,
Kindly guide me how to improve performance on the cluster which consist
of 5 dedicated servers:
- ceph.conf: http://pastebin.com/hT3qEhUF
- file system on all drives is ext4
- mount options "user_xattr"
- each server has :
CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
MEM: 4Gb DDR2
- 1Gb network
Simple test:
mounted as ceph
root@client1:/mnt/mycephfs# time find . | wc -l
83932
real 17m55.399s
user 0m0.152s
sys 0m1.528s
on 1 HDD:
root@client1:/home# time find . | wc -l
83932
real 0m55.399s
user 0m0.152s
sys 0m1.528s
Please help me to find out the issue. Thanks.
--
Kind regards,
R. Alekseev
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-29 14:01 Ceph performance Roman Alekseev
@ 2012-10-29 18:57 ` Sam Lang
2012-10-30 8:27 ` Roman Alekseev
0 siblings, 1 reply; 10+ messages in thread
From: Sam Lang @ 2012-10-29 18:57 UTC (permalink / raw)
To: Roman Alekseev; +Cc: ceph-devel
Hi Roman,
Is this with the ceph fuse client or the ceph kernel module?
Its not surprising that the local file system (/home) is so much faster
than a mounted ceph volume, especially the first time the directory tree
is traversed (metadata results are cached at the client to improve
performance). Try running the same find command on the ceph volume and
see if the cached results at the client improve performance at all.
In order to understand what the performance of ceph should be capable of
doing with your deployment for this specific workload, you should run
iperf between two nodes to get an idea of your latency limits.
Also, I noticed that the real timings you listed for ceph and /home are
offset by exactly 17 minutes (user and sys are identical). Was that a
copy/paste error, by chance?
-sam
On 10/29/2012 09:01 AM, Roman Alekseev wrote:
> Hi,
>
> Kindly guide me how to improve performance on the cluster which consist
> of 5 dedicated servers:
>
> - ceph.conf: http://pastebin.com/hT3qEhUF
> - file system on all drives is ext4
> - mount options "user_xattr"
> - each server has :
> CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
> MEM: 4Gb DDR2
> - 1Gb network
>
> Simple test:
>
> mounted as ceph
> root@client1:/mnt/mycephfs# time find . | wc -l
> 83932
>
> real 17m55.399s
> user 0m0.152s
> sys 0m1.528s
>
> on 1 HDD:
>
> root@client1:/home# time find . | wc -l
> 83932
>
> real 0m55.399s
> user 0m0.152s
> sys 0m1.528s
>
> Please help me to find out the issue. Thanks.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-29 18:57 ` Sam Lang
@ 2012-10-30 8:27 ` Roman Alekseev
2012-10-30 9:10 ` Gregory Farnum
0 siblings, 1 reply; 10+ messages in thread
From: Roman Alekseev @ 2012-10-30 8:27 UTC (permalink / raw)
To: Sam Lang; +Cc: ceph-devel
On 29.10.2012 22:57, Sam Lang wrote:
>
> Hi Roman,
>
> Is this with the ceph fuse client or the ceph kernel module?
>
> Its not surprising that the local file system (/home) is so much
> faster than a mounted ceph volume, especially the first time the
> directory tree is traversed (metadata results are cached at the client
> to improve performance). Try running the same find command on the
> ceph volume and see if the cached results at the client improve
> performance at all.
>
> In order to understand what the performance of ceph should be capable
> of doing with your deployment for this specific workload, you should
> run iperf between two nodes to get an idea of your latency limits.
>
> Also, I noticed that the real timings you listed for ceph and /home
> are offset by exactly 17 minutes (user and sys are identical). Was
> that a copy/paste error, by chance?
>
> -sam
>
> On 10/29/2012 09:01 AM, Roman Alekseev wrote:
>> Hi,
>>
>> Kindly guide me how to improve performance on the cluster which consist
>> of 5 dedicated servers:
>>
>> - ceph.conf: http://pastebin.com/hT3qEhUF
>> - file system on all drives is ext4
>> - mount options "user_xattr"
>> - each server has :
>> CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
>> MEM: 4Gb DDR2
>> - 1Gb network
>>
>> Simple test:
>>
>> mounted as ceph
>> root@client1:/mnt/mycephfs# time find . | wc -l
>> 83932
>>
>> real 17m55.399s
>> user 0m0.152s
>> sys 0m1.528s
>>
>> on 1 HDD:
>>
>> root@client1:/home# time find . | wc -l
>> 83932
>>
>> real 0m55.399s
>> user 0m0.152s
>> sys 0m1.528s
>>
>> Please help me to find out the issue. Thanks.
>>
>
Hi Sam,
I use the Ceph fs only as kernel module, because we need to get its
powerful performance but as I can see it is slower then distributed file
system based on fuse, for example, MooseFS performed the same test for 3
min.
Here is the result iperf test beetwen client and osd server:
root@asrv151:~# iperf -c client -i 1
------------------------------------------------------------
Client connecting to clientIP, TCP port 5001
TCP window size: 96.1 KByte (default)
------------------------------------------------------------
[ 3] local osd_server port 50106 connected with clientIP port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 941 Mbits/sec
[ 3] 1.0- 2.0 sec 110 MBytes 924 Mbits/sec
[ 3] 2.0- 3.0 sec 108 MBytes 905 Mbits/sec
[ 3] 3.0- 4.0 sec 109 MBytes 917 Mbits/sec
[ 3] 4.0- 5.0 sec 110 MBytes 926 Mbits/sec
[ 3] 5.0- 6.0 sec 109 MBytes 915 Mbits/sec
[ 3] 6.0- 7.0 sec 110 MBytes 926 Mbits/sec
[ 3] 7.0- 8.0 sec 108 MBytes 908 Mbits/sec
[ 3] 8.0- 9.0 sec 107 MBytes 897 Mbits/sec
[ 3] 9.0-10.0 sec 106 MBytes 886 Mbits/sec
[ 3] 0.0-10.0 sec 1.06 GBytes 914 Mbits/sec
ceph -w results:
health HEALTH_OK
monmap e3: 3 mons at {a=mon.a:6789/0,b=mon.b:6789/0,c=mon.c:6789/0},
election epoch 10, quorum 0,1,2 a,b,c
osdmap e132: 5 osds: 5 up, 5 in
pgmap v11720: 384 pgs: 384 active+clean; 1880 MB data, 10679 MB
used, 5185 GB / 5473 GB avail
mdsmap e4: 1/1/1 up {0=a=up:active}
2012-10-30 12:23:09.830677 osd.2 [WRN] slow request 30.135787 seconds
old, received at 2012-10-30 12:22:39.694780: osd_op(mds.0.1:309216
10000017163.00000000 [setxattr path (69),setxattr parent (196),tmapput
0~596] 1.724c80f7) v4 currently waiting for sub ops
2012-10-30 12:23:10.109637 mon.0 [INF] pgmap v11720: 384 pgs: 384
active+clean; 1880 MB data, 10679 MB used, 5185 GB / 5473 GB avail
2012-10-30 12:23:12.918038 mon.0 [INF] pgmap v11721: 384 pgs: 384
active+clean; 1880 MB data, 10680 MB used, 5185 GB / 5473 GB avail
2012-10-30 12:23:13.977044 mon.0 [INF] pgmap v11722: 384 pgs: 384
active+clean; 1880 MB data, 10681 MB used, 5185 GB / 5473 GB avail
2012-10-30 12:23:10.587391 osd.3 [WRN] 6 slow requests, 6 included
below; oldest blocked for > 30.808352 secs
2012-10-30 12:23:10.587398 osd.3 [WRN] slow request 30.808352 seconds
old, received at 2012-10-30 12:22:39.778971: osd_op(mds.0.1:308701
200.000002e5 [write 976010~5402] 1.adbeb1a) v4 currently waiting for sub ops
2012-10-30 12:23:10.587403 osd.3 [WRN] slow request 30.796417 seconds
old, received at 2012-10-30 12:22:39.790906: osd_op(mds.0.1:308702
200.000002e5 [write 981412~6019] 1.adbeb1a) v4 currently waiting for sub ops
2012-10-30 12:23:10.587408 osd.3 [WRN] slow request 30.796347 seconds
old, received at 2012-10-30 12:22:39.790976: osd_op(mds.0.1:308703
200.000002e5 [write 987431~61892] 1.adbeb1a) v4 currently waiting for
sub ops
2012-10-30 12:23:10.587413 osd.3 [WRN] slow request 30.530228 seconds
old, received at 2012-10-30 12:22:40.057095: osd_op(mds.0.1:308704
200.000002e5 [write 1049323~6630] 1.adbeb1a) v4 currently waiting for
sub ops
2012-10-30 12:23:10.587417 osd.3 [WRN] slow request 30.530027 seconds
old, received at 2012-10-30 12:22:40.057296: osd_op(mds.0.1:308705
200.000002e5 [write 1055953~20679] 1.adbeb1a) v4 currently waiting for
sub ops
At the same time I'm copy data to ceph mounted storage.
I dunno what can I do to resolve this problem :(
Any advices will be greatly appreciated.
--
Kind regards,
R. Alekseev
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 8:27 ` Roman Alekseev
@ 2012-10-30 9:10 ` Gregory Farnum
2012-10-30 9:54 ` Maciej Gałkiewicz
2012-10-30 10:04 ` Roman Alekseev
0 siblings, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2012-10-30 9:10 UTC (permalink / raw)
To: Roman Alekseev; +Cc: Sam Lang, ceph-devel
On Tue, Oct 30, 2012 at 9:27 AM, Roman Alekseev <rs.alekseev@gmail.com> wrote:
> On 29.10.2012 22:57, Sam Lang wrote:
>>
>>
>> Hi Roman,
>>
>> Is this with the ceph fuse client or the ceph kernel module?
>>
>> Its not surprising that the local file system (/home) is so much faster
>> than a mounted ceph volume, especially the first time the directory tree is
>> traversed (metadata results are cached at the client to improve
>> performance). Try running the same find command on the ceph volume and see
>> if the cached results at the client improve performance at all.
>>
>> In order to understand what the performance of ceph should be capable of
>> doing with your deployment for this specific workload, you should run iperf
>> between two nodes to get an idea of your latency limits.
>>
>> Also, I noticed that the real timings you listed for ceph and /home are
>> offset by exactly 17 minutes (user and sys are identical). Was that a
>> copy/paste error, by chance?
>>
>> -sam
>>
>> On 10/29/2012 09:01 AM, Roman Alekseev wrote:
>>>
>>> Hi,
>>>
>>> Kindly guide me how to improve performance on the cluster which consist
>>> of 5 dedicated servers:
>>>
>>> - ceph.conf: http://pastebin.com/hT3qEhUF
>>> - file system on all drives is ext4
>>> - mount options "user_xattr"
>>> - each server has :
>>> CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
>>> MEM: 4Gb DDR2
>>> - 1Gb network
>>>
>>> Simple test:
>>>
>>> mounted as ceph
>>> root@client1:/mnt/mycephfs# time find . | wc -l
>>> 83932
>>>
>>> real 17m55.399s
>>> user 0m0.152s
>>> sys 0m1.528s
>>>
>>> on 1 HDD:
>>>
>>> root@client1:/home# time find . | wc -l
>>> 83932
>>>
>>> real 0m55.399s
>>> user 0m0.152s
>>> sys 0m1.528s
>>>
>>> Please help me to find out the issue. Thanks.
>>>
>>
> Hi Sam,
>
> I use the Ceph fs only as kernel module, because we need to get its
> powerful performance but as I can see it is slower then distributed file
> system based on fuse, for example, MooseFS performed the same test for 3
> min.
> Here is the result iperf test beetwen client and osd server:
> root@asrv151:~# iperf -c client -i 1
> ------------------------------------------------------------
> Client connecting to clientIP, TCP port 5001
> TCP window size: 96.1 KByte (default)
> ------------------------------------------------------------
> [ 3] local osd_server port 50106 connected with clientIP port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0- 1.0 sec 112 MBytes 941 Mbits/sec
> [ 3] 1.0- 2.0 sec 110 MBytes 924 Mbits/sec
> [ 3] 2.0- 3.0 sec 108 MBytes 905 Mbits/sec
> [ 3] 3.0- 4.0 sec 109 MBytes 917 Mbits/sec
> [ 3] 4.0- 5.0 sec 110 MBytes 926 Mbits/sec
> [ 3] 5.0- 6.0 sec 109 MBytes 915 Mbits/sec
> [ 3] 6.0- 7.0 sec 110 MBytes 926 Mbits/sec
> [ 3] 7.0- 8.0 sec 108 MBytes 908 Mbits/sec
> [ 3] 8.0- 9.0 sec 107 MBytes 897 Mbits/sec
> [ 3] 9.0-10.0 sec 106 MBytes 886 Mbits/sec
> [ 3] 0.0-10.0 sec 1.06 GBytes 914 Mbits/sec
>
> ceph -w results:
>
> health HEALTH_OK
> monmap e3: 3 mons at {a=mon.a:6789/0,b=mon.b:6789/0,c=mon.c:6789/0},
> election epoch 10, quorum 0,1,2 a,b,c
> osdmap e132: 5 osds: 5 up, 5 in
> pgmap v11720: 384 pgs: 384 active+clean; 1880 MB data, 10679 MB used,
> 5185 GB / 5473 GB avail
> mdsmap e4: 1/1/1 up {0=a=up:active}
>
> 2012-10-30 12:23:09.830677 osd.2 [WRN] slow request 30.135787 seconds old,
> received at 2012-10-30 12:22:39.694780: osd_op(mds.0.1:309216
> 10000017163.00000000 [setxattr path (69),setxattr parent (196),tmapput
> 0~596] 1.724c80f7) v4 currently waiting for sub ops
> 2012-10-30 12:23:10.109637 mon.0 [INF] pgmap v11720: 384 pgs: 384
> active+clean; 1880 MB data, 10679 MB used, 5185 GB / 5473 GB avail
> 2012-10-30 12:23:12.918038 mon.0 [INF] pgmap v11721: 384 pgs: 384
> active+clean; 1880 MB data, 10680 MB used, 5185 GB / 5473 GB avail
> 2012-10-30 12:23:13.977044 mon.0 [INF] pgmap v11722: 384 pgs: 384
> active+clean; 1880 MB data, 10681 MB used, 5185 GB / 5473 GB avail
> 2012-10-30 12:23:10.587391 osd.3 [WRN] 6 slow requests, 6 included below;
> oldest blocked for > 30.808352 secs
> 2012-10-30 12:23:10.587398 osd.3 [WRN] slow request 30.808352 seconds old,
> received at 2012-10-30 12:22:39.778971: osd_op(mds.0.1:308701 200.000002e5
> [write 976010~5402] 1.adbeb1a) v4 currently waiting for sub ops
> 2012-10-30 12:23:10.587403 osd.3 [WRN] slow request 30.796417 seconds old,
> received at 2012-10-30 12:22:39.790906: osd_op(mds.0.1:308702 200.000002e5
> [write 981412~6019] 1.adbeb1a) v4 currently waiting for sub ops
> 2012-10-30 12:23:10.587408 osd.3 [WRN] slow request 30.796347 seconds old,
> received at 2012-10-30 12:22:39.790976: osd_op(mds.0.1:308703 200.000002e5
> [write 987431~61892] 1.adbeb1a) v4 currently waiting for sub ops
> 2012-10-30 12:23:10.587413 osd.3 [WRN] slow request 30.530228 seconds old,
> received at 2012-10-30 12:22:40.057095: osd_op(mds.0.1:308704 200.000002e5
> [write 1049323~6630] 1.adbeb1a) v4 currently waiting for sub ops
> 2012-10-30 12:23:10.587417 osd.3 [WRN] slow request 30.530027 seconds old,
> received at 2012-10-30 12:22:40.057296: osd_op(mds.0.1:308705 200.000002e5
> [write 1055953~20679] 1.adbeb1a) v4 currently waiting for sub ops
>
>
> At the same time I'm copy data to ceph mounted storage.
>
> I dunno what can I do to resolve this problem :(
> Any advices will be greatly appreciated.
Is it the same client copying data into cephfs or a different one?
I see here that you have several slow requests; it looks like maybe
you're overloading your disks. That could impact metadata lookups if
the MDS doesn't have everything cached; have you tried running this
test without data ingest? (Obviously we'd like it to be faster even
so, but if it's disk contention there's not a lot we can do.)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 9:10 ` Gregory Farnum
@ 2012-10-30 9:54 ` Maciej Gałkiewicz
[not found] ` <508FAB9A.20307@gmail.com>
2012-10-30 10:04 ` Roman Alekseev
1 sibling, 1 reply; 10+ messages in thread
From: Maciej Gałkiewicz @ 2012-10-30 9:54 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Roman Alekseev, Sam Lang, ceph-devel
I have been experiencing the same problem. Tell us more about your
disks. Are they shared with OS and/or mds, journal? Paste ceph.conf.
regards
Maciej Galkiewicz
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 9:10 ` Gregory Farnum
2012-10-30 9:54 ` Maciej Gałkiewicz
@ 2012-10-30 10:04 ` Roman Alekseev
2012-10-30 10:14 ` Gregory Farnum
1 sibling, 1 reply; 10+ messages in thread
From: Roman Alekseev @ 2012-10-30 10:04 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sam Lang, ceph-devel
On 30.10.2012 13:10, Gregory Farnum wrote:
> On Tue, Oct 30, 2012 at 9:27 AM, Roman Alekseev <rs.alekseev@gmail.com> wrote:
>> On 29.10.2012 22:57, Sam Lang wrote:
>>>
>>> Hi Roman,
>>>
>>> Is this with the ceph fuse client or the ceph kernel module?
>>>
>>> Its not surprising that the local file system (/home) is so much faster
>>> than a mounted ceph volume, especially the first time the directory tree is
>>> traversed (metadata results are cached at the client to improve
>>> performance). Try running the same find command on the ceph volume and see
>>> if the cached results at the client improve performance at all.
>>>
>>> In order to understand what the performance of ceph should be capable of
>>> doing with your deployment for this specific workload, you should run iperf
>>> between two nodes to get an idea of your latency limits.
>>>
>>> Also, I noticed that the real timings you listed for ceph and /home are
>>> offset by exactly 17 minutes (user and sys are identical). Was that a
>>> copy/paste error, by chance?
>>>
>>> -sam
>>>
>>> On 10/29/2012 09:01 AM, Roman Alekseev wrote:
>>>> Hi,
>>>>
>>>> Kindly guide me how to improve performance on the cluster which consist
>>>> of 5 dedicated servers:
>>>>
>>>> - ceph.conf: http://pastebin.com/hT3qEhUF
>>>> - file system on all drives is ext4
>>>> - mount options "user_xattr"
>>>> - each server has :
>>>> CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
>>>> MEM: 4Gb DDR2
>>>> - 1Gb network
>>>>
>>>> Simple test:
>>>>
>>>> mounted as ceph
>>>> root@client1:/mnt/mycephfs# time find . | wc -l
>>>> 83932
>>>>
>>>> real 17m55.399s
>>>> user 0m0.152s
>>>> sys 0m1.528s
>>>>
>>>> on 1 HDD:
>>>>
>>>> root@client1:/home# time find . | wc -l
>>>> 83932
>>>>
>>>> real 0m55.399s
>>>> user 0m0.152s
>>>> sys 0m1.528s
>>>>
>>>> Please help me to find out the issue. Thanks.
>>>>
>> Hi Sam,
>>
>> I use the Ceph fs only as kernel module, because we need to get its
>> powerful performance but as I can see it is slower then distributed file
>> system based on fuse, for example, MooseFS performed the same test for 3
>> min.
>> Here is the result iperf test beetwen client and osd server:
>> root@asrv151:~# iperf -c client -i 1
>> ------------------------------------------------------------
>> Client connecting to clientIP, TCP port 5001
>> TCP window size: 96.1 KByte (default)
>> ------------------------------------------------------------
>> [ 3] local osd_server port 50106 connected with clientIP port 5001
>> [ ID] Interval Transfer Bandwidth
>> [ 3] 0.0- 1.0 sec 112 MBytes 941 Mbits/sec
>> [ 3] 1.0- 2.0 sec 110 MBytes 924 Mbits/sec
>> [ 3] 2.0- 3.0 sec 108 MBytes 905 Mbits/sec
>> [ 3] 3.0- 4.0 sec 109 MBytes 917 Mbits/sec
>> [ 3] 4.0- 5.0 sec 110 MBytes 926 Mbits/sec
>> [ 3] 5.0- 6.0 sec 109 MBytes 915 Mbits/sec
>> [ 3] 6.0- 7.0 sec 110 MBytes 926 Mbits/sec
>> [ 3] 7.0- 8.0 sec 108 MBytes 908 Mbits/sec
>> [ 3] 8.0- 9.0 sec 107 MBytes 897 Mbits/sec
>> [ 3] 9.0-10.0 sec 106 MBytes 886 Mbits/sec
>> [ 3] 0.0-10.0 sec 1.06 GBytes 914 Mbits/sec
>>
>> ceph -w results:
>>
>> health HEALTH_OK
>> monmap e3: 3 mons at {a=mon.a:6789/0,b=mon.b:6789/0,c=mon.c:6789/0},
>> election epoch 10, quorum 0,1,2 a,b,c
>> osdmap e132: 5 osds: 5 up, 5 in
>> pgmap v11720: 384 pgs: 384 active+clean; 1880 MB data, 10679 MB used,
>> 5185 GB / 5473 GB avail
>> mdsmap e4: 1/1/1 up {0=a=up:active}
>>
>> 2012-10-30 12:23:09.830677 osd.2 [WRN] slow request 30.135787 seconds old,
>> received at 2012-10-30 12:22:39.694780: osd_op(mds.0.1:309216
>> 10000017163.00000000 [setxattr path (69),setxattr parent (196),tmapput
>> 0~596] 1.724c80f7) v4 currently waiting for sub ops
>> 2012-10-30 12:23:10.109637 mon.0 [INF] pgmap v11720: 384 pgs: 384
>> active+clean; 1880 MB data, 10679 MB used, 5185 GB / 5473 GB avail
>> 2012-10-30 12:23:12.918038 mon.0 [INF] pgmap v11721: 384 pgs: 384
>> active+clean; 1880 MB data, 10680 MB used, 5185 GB / 5473 GB avail
>> 2012-10-30 12:23:13.977044 mon.0 [INF] pgmap v11722: 384 pgs: 384
>> active+clean; 1880 MB data, 10681 MB used, 5185 GB / 5473 GB avail
>> 2012-10-30 12:23:10.587391 osd.3 [WRN] 6 slow requests, 6 included below;
>> oldest blocked for > 30.808352 secs
>> 2012-10-30 12:23:10.587398 osd.3 [WRN] slow request 30.808352 seconds old,
>> received at 2012-10-30 12:22:39.778971: osd_op(mds.0.1:308701 200.000002e5
>> [write 976010~5402] 1.adbeb1a) v4 currently waiting for sub ops
>> 2012-10-30 12:23:10.587403 osd.3 [WRN] slow request 30.796417 seconds old,
>> received at 2012-10-30 12:22:39.790906: osd_op(mds.0.1:308702 200.000002e5
>> [write 981412~6019] 1.adbeb1a) v4 currently waiting for sub ops
>> 2012-10-30 12:23:10.587408 osd.3 [WRN] slow request 30.796347 seconds old,
>> received at 2012-10-30 12:22:39.790976: osd_op(mds.0.1:308703 200.000002e5
>> [write 987431~61892] 1.adbeb1a) v4 currently waiting for sub ops
>> 2012-10-30 12:23:10.587413 osd.3 [WRN] slow request 30.530228 seconds old,
>> received at 2012-10-30 12:22:40.057095: osd_op(mds.0.1:308704 200.000002e5
>> [write 1049323~6630] 1.adbeb1a) v4 currently waiting for sub ops
>> 2012-10-30 12:23:10.587417 osd.3 [WRN] slow request 30.530027 seconds old,
>> received at 2012-10-30 12:22:40.057296: osd_op(mds.0.1:308705 200.000002e5
>> [write 1055953~20679] 1.adbeb1a) v4 currently waiting for sub ops
>>
>>
>> At the same time I'm copy data to ceph mounted storage.
>>
>> I dunno what can I do to resolve this problem :(
>> Any advices will be greatly appreciated.
> Is it the same client copying data into cephfs or a different one?
> I see here that you have several slow requests; it looks like maybe
> you're overloading your disks. That could impact metadata lookups if
> the MDS doesn't have everything cached; have you tried running this
> test without data ingest? (Obviously we'd like it to be faster even
> so, but if it's disk contention there's not a lot we can do.)
> -Greg
Dear Greg,
Yes, this was the same client. Sorry, could you please explain me with
more details how can I "test without data ingest"?
Also I can rebuild my cluster from scratch and make all tests again.
I have 5 dedicated servers and I think if I create ceph cluster from
them it shouldn't be slower then the same cluster based on fuse
technology. Am I right?
Thanks.
--
Kind regards,
R. Alekseev
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 10:04 ` Roman Alekseev
@ 2012-10-30 10:14 ` Gregory Farnum
0 siblings, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2012-10-30 10:14 UTC (permalink / raw)
To: Roman Alekseev; +Cc: Sam Lang, ceph-devel
On Tue, Oct 30, 2012 at 11:04 AM, Roman Alekseev <rs.alekseev@gmail.com> wrote:
> On 30.10.2012 13:10, Gregory Farnum wrote:
>>
>> On Tue, Oct 30, 2012 at 9:27 AM, Roman Alekseev <rs.alekseev@gmail.com>
>> wrote:
>>>
>>> On 29.10.2012 22:57, Sam Lang wrote:
>>>>
>>>>
>>>> Hi Roman,
>>>>
>>>> Is this with the ceph fuse client or the ceph kernel module?
>>>>
>>>> Its not surprising that the local file system (/home) is so much faster
>>>> than a mounted ceph volume, especially the first time the directory tree
>>>> is
>>>> traversed (metadata results are cached at the client to improve
>>>> performance). Try running the same find command on the ceph volume and
>>>> see
>>>> if the cached results at the client improve performance at all.
>>>>
>>>> In order to understand what the performance of ceph should be capable of
>>>> doing with your deployment for this specific workload, you should run
>>>> iperf
>>>> between two nodes to get an idea of your latency limits.
>>>>
>>>> Also, I noticed that the real timings you listed for ceph and /home are
>>>> offset by exactly 17 minutes (user and sys are identical). Was that a
>>>> copy/paste error, by chance?
>>>>
>>>> -sam
>>>>
>>>> On 10/29/2012 09:01 AM, Roman Alekseev wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Kindly guide me how to improve performance on the cluster which consist
>>>>> of 5 dedicated servers:
>>>>>
>>>>> - ceph.conf: http://pastebin.com/hT3qEhUF
>>>>> - file system on all drives is ext4
>>>>> - mount options "user_xattr"
>>>>> - each server has :
>>>>> CPU:Intel® Xeon® Processor E5335(8M Cache, 2.00 GHz, 1333 MHz FSB) x2
>>>>> MEM: 4Gb DDR2
>>>>> - 1Gb network
>>>>>
>>>>> Simple test:
>>>>>
>>>>> mounted as ceph
>>>>> root@client1:/mnt/mycephfs# time find . | wc -l
>>>>> 83932
>>>>>
>>>>> real 17m55.399s
>>>>> user 0m0.152s
>>>>> sys 0m1.528s
>>>>>
>>>>> on 1 HDD:
>>>>>
>>>>> root@client1:/home# time find . | wc -l
>>>>> 83932
>>>>>
>>>>> real 0m55.399s
>>>>> user 0m0.152s
>>>>> sys 0m1.528s
>>>>>
>>>>> Please help me to find out the issue. Thanks.
>>>>>
>>> Hi Sam,
>>>
>>> I use the Ceph fs only as kernel module, because we need to get its
>>> powerful performance but as I can see it is slower then distributed file
>>> system based on fuse, for example, MooseFS performed the same test for 3
>>> min.
>>> Here is the result iperf test beetwen client and osd server:
>>> root@asrv151:~# iperf -c client -i 1
>>> ------------------------------------------------------------
>>> Client connecting to clientIP, TCP port 5001
>>> TCP window size: 96.1 KByte (default)
>>> ------------------------------------------------------------
>>> [ 3] local osd_server port 50106 connected with clientIP port 5001
>>> [ ID] Interval Transfer Bandwidth
>>> [ 3] 0.0- 1.0 sec 112 MBytes 941 Mbits/sec
>>> [ 3] 1.0- 2.0 sec 110 MBytes 924 Mbits/sec
>>> [ 3] 2.0- 3.0 sec 108 MBytes 905 Mbits/sec
>>> [ 3] 3.0- 4.0 sec 109 MBytes 917 Mbits/sec
>>> [ 3] 4.0- 5.0 sec 110 MBytes 926 Mbits/sec
>>> [ 3] 5.0- 6.0 sec 109 MBytes 915 Mbits/sec
>>> [ 3] 6.0- 7.0 sec 110 MBytes 926 Mbits/sec
>>> [ 3] 7.0- 8.0 sec 108 MBytes 908 Mbits/sec
>>> [ 3] 8.0- 9.0 sec 107 MBytes 897 Mbits/sec
>>> [ 3] 9.0-10.0 sec 106 MBytes 886 Mbits/sec
>>> [ 3] 0.0-10.0 sec 1.06 GBytes 914 Mbits/sec
>>>
>>> ceph -w results:
>>>
>>> health HEALTH_OK
>>> monmap e3: 3 mons at {a=mon.a:6789/0,b=mon.b:6789/0,c=mon.c:6789/0},
>>> election epoch 10, quorum 0,1,2 a,b,c
>>> osdmap e132: 5 osds: 5 up, 5 in
>>> pgmap v11720: 384 pgs: 384 active+clean; 1880 MB data, 10679 MB
>>> used,
>>> 5185 GB / 5473 GB avail
>>> mdsmap e4: 1/1/1 up {0=a=up:active}
>>>
>>> 2012-10-30 12:23:09.830677 osd.2 [WRN] slow request 30.135787 seconds
>>> old,
>>> received at 2012-10-30 12:22:39.694780: osd_op(mds.0.1:309216
>>> 10000017163.00000000 [setxattr path (69),setxattr parent (196),tmapput
>>> 0~596] 1.724c80f7) v4 currently waiting for sub ops
>>> 2012-10-30 12:23:10.109637 mon.0 [INF] pgmap v11720: 384 pgs: 384
>>> active+clean; 1880 MB data, 10679 MB used, 5185 GB / 5473 GB avail
>>> 2012-10-30 12:23:12.918038 mon.0 [INF] pgmap v11721: 384 pgs: 384
>>> active+clean; 1880 MB data, 10680 MB used, 5185 GB / 5473 GB avail
>>> 2012-10-30 12:23:13.977044 mon.0 [INF] pgmap v11722: 384 pgs: 384
>>> active+clean; 1880 MB data, 10681 MB used, 5185 GB / 5473 GB avail
>>> 2012-10-30 12:23:10.587391 osd.3 [WRN] 6 slow requests, 6 included below;
>>> oldest blocked for > 30.808352 secs
>>> 2012-10-30 12:23:10.587398 osd.3 [WRN] slow request 30.808352 seconds
>>> old,
>>> received at 2012-10-30 12:22:39.778971: osd_op(mds.0.1:308701
>>> 200.000002e5
>>> [write 976010~5402] 1.adbeb1a) v4 currently waiting for sub ops
>>> 2012-10-30 12:23:10.587403 osd.3 [WRN] slow request 30.796417 seconds
>>> old,
>>> received at 2012-10-30 12:22:39.790906: osd_op(mds.0.1:308702
>>> 200.000002e5
>>> [write 981412~6019] 1.adbeb1a) v4 currently waiting for sub ops
>>> 2012-10-30 12:23:10.587408 osd.3 [WRN] slow request 30.796347 seconds
>>> old,
>>> received at 2012-10-30 12:22:39.790976: osd_op(mds.0.1:308703
>>> 200.000002e5
>>> [write 987431~61892] 1.adbeb1a) v4 currently waiting for sub ops
>>> 2012-10-30 12:23:10.587413 osd.3 [WRN] slow request 30.530228 seconds
>>> old,
>>> received at 2012-10-30 12:22:40.057095: osd_op(mds.0.1:308704
>>> 200.000002e5
>>> [write 1049323~6630] 1.adbeb1a) v4 currently waiting for sub ops
>>> 2012-10-30 12:23:10.587417 osd.3 [WRN] slow request 30.530027 seconds
>>> old,
>>> received at 2012-10-30 12:22:40.057296: osd_op(mds.0.1:308705
>>> 200.000002e5
>>> [write 1055953~20679] 1.adbeb1a) v4 currently waiting for sub ops
>>>
>>>
>>> At the same time I'm copy data to ceph mounted storage.
>>>
>>> I dunno what can I do to resolve this problem :(
>>> Any advices will be greatly appreciated.
>>
>> Is it the same client copying data into cephfs or a different one?
>> I see here that you have several slow requests; it looks like maybe
>> you're overloading your disks. That could impact metadata lookups if
>> the MDS doesn't have everything cached; have you tried running this
>> test without data ingest? (Obviously we'd like it to be faster even
>> so, but if it's disk contention there's not a lot we can do.)
>> -Greg
>
> Dear Greg,
>
> Yes, this was the same client. Sorry, could you please explain me with more
> details how can I "test without data ingest"?
You said you were copying data into the cluster while doing this test.
It really shouldn't matter, but if you had enough activity going
on...it might. :/
> Also I can rebuild my cluster from scratch and make all tests again.
>
> I have 5 dedicated servers and I think if I create ceph cluster from them it
> shouldn't be slower then the same cluster based on fuse technology. Am I
> right?
Well, the cluster isn't really based on FUSE technology, that's just
the client. You're certainly seeing slower performance than I expect,
but Ceph is doing two copies of every write and I think MooseFS
defaults to one, for instance.
I'm concerned in particular about those "slow write" warnings, which
tend to indicate your data disks are overloaded. Can you share your
ceph.conf and your hardware configuration?
When you're done copying data into the cluster, can you also run
"rados -p data bench 60 write" and report the results back?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
[not found] ` <508FAB9A.20307@gmail.com>
@ 2012-10-30 10:47 ` Maciej Gałkiewicz
2012-10-30 10:53 ` Roman Alekseev
0 siblings, 1 reply; 10+ messages in thread
From: Maciej Gałkiewicz @ 2012-10-30 10:47 UTC (permalink / raw)
To: Roman Alekseev; +Cc: Gregory Farnum, Sam Lang, ceph-devel
> ServerA(mon+osd):
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda1 9.2G 2.4G 6.4G 27% /
> tmpfs 5.9G 0 5.9G 0% /lib/init/rw
> udev 5.9G 148K 5.9G 1% /dev
> tmpfs 5.9G 0 5.9G 0% /dev/shm
> /dev/sda7 674G 483M 629G 1% /var/lib/ceph/mon
> /dev/sda6 917G 2.4G 868G 1% /var/lib/ceph/osd/ceph-4
I see that you have separate partitions for osd and mon. Some of them
are on the same disk as OS. I would strongly recommend having separate
disk (not just partition) for each osd.
There is another problem probably more important. By default journal
is stored in /var/lib/ceph/osd/$cluster-$id/journal which means you
have data and journal on the same disk. It is what slows you down the
most. I was amazed how faster the cluster can be just by putting the
journal somewhere else (tmpfs - only for tests, another disk). From my
experiance, there is no need for separate disk for mon. It can go with
OS.
Run rados bench everytime you make an improvement in your cluster.
--
Regards
Maciej Galkiewicz
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 10:47 ` Maciej Gałkiewicz
@ 2012-10-30 10:53 ` Roman Alekseev
2012-10-30 10:57 ` Maciej Gałkiewicz
0 siblings, 1 reply; 10+ messages in thread
From: Roman Alekseev @ 2012-10-30 10:53 UTC (permalink / raw)
To: Maciej Gałkiewicz; +Cc: Gregory Farnum, Sam Lang, ceph-devel
On 30.10.2012 14:47, Maciej Gałkiewicz wrote:
>> ServerA(mon+osd):
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/sda1 9.2G 2.4G 6.4G 27% /
>> tmpfs 5.9G 0 5.9G 0% /lib/init/rw
>> udev 5.9G 148K 5.9G 1% /dev
>> tmpfs 5.9G 0 5.9G 0% /dev/shm
>> /dev/sda7 674G 483M 629G 1% /var/lib/ceph/mon
>> /dev/sda6 917G 2.4G 868G 1% /var/lib/ceph/osd/ceph-4
> I see that you have separate partitions for osd and mon. Some of them
> are on the same disk as OS. I would strongly recommend having separate
> disk (not just partition) for each osd.
>
> There is another problem probably more important. By default journal
> is stored in /var/lib/ceph/osd/$cluster-$id/journal which means you
> have data and journal on the same disk. It is what slows you down the
> most. I was amazed how faster the cluster can be just by putting the
> journal somewhere else (tmpfs - only for tests, another disk). From my
> experiance, there is no need for separate disk for mon. It can go with
> OS.
>
> Run rados bench everytime you make an improvement in your cluster.
>
Oops, it's actually true!!!
Give me a moment to check this out and thank you for opening my eyes.
Let me to notify you about the results of implementing your recommendations.
I'm going to completely rebuild my cluster. Thank so much.
--
Kind regards,
R. Alekseev
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Ceph performance
2012-10-30 10:53 ` Roman Alekseev
@ 2012-10-30 10:57 ` Maciej Gałkiewicz
0 siblings, 0 replies; 10+ messages in thread
From: Maciej Gałkiewicz @ 2012-10-30 10:57 UTC (permalink / raw)
To: Roman Alekseev; +Cc: Gregory Farnum, Sam Lang, ceph-devel
> Give me a moment to check this out and thank you for opening my eyes.
> Let me to notify you about the results of implementing your recommendations.
I am looking forward to hearing from you. One more thing. Run rados
bench right now to see from what performance level are you starting. I
suggest 2 tests for default block size and lets say 4KB.
Here is very useful paper which helped me a lot:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
--
Regards
Maciej Galkiewicz
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-10-30 10:57 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-29 14:01 Ceph performance Roman Alekseev
2012-10-29 18:57 ` Sam Lang
2012-10-30 8:27 ` Roman Alekseev
2012-10-30 9:10 ` Gregory Farnum
2012-10-30 9:54 ` Maciej Gałkiewicz
[not found] ` <508FAB9A.20307@gmail.com>
2012-10-30 10:47 ` Maciej Gałkiewicz
2012-10-30 10:53 ` Roman Alekseev
2012-10-30 10:57 ` Maciej Gałkiewicz
2012-10-30 10:04 ` Roman Alekseev
2012-10-30 10:14 ` Gregory Farnum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.