All of lore.kernel.org
 help / color / mirror / Atom feed
* RGW Blocking on 1-2 PG's - argonaut
@ 2013-03-04 11:02 Sławomir Skowron
  2013-03-04 14:16 ` Yehuda Sadeh
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-04 11:02 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

Hi,

We have a big problem with RGW. I don't know what is the initial
trigger, but i have theory.

2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more
RAM usage, they have much more operations in journal, and much bigger
latency.

When we PUT some objects then in some cases, there are so many
operations in triple replication on this osd (one PG). Then this
triple can't handle this load, and goes down, drives on backend of
this osd are getting fire with big wait-io, and big response times.
RGW waiting for this PG, and eventually block all the others
operations when makes 1024 operations blocked in queue.
Then whole cluster have problems, and we have an outage.

When RGW block operations there is only one PG that have >1000
operations in queue -
ceph pg map 3.9447554d
osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23]

now this osd are migrated, with ratio 0.5 on, but before it was

ceph pg map 3.9447554d
osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23]

and this three osd's have such a problems. Under this osd's are only 3
drive, one drive per osd, that's why this have such a big impact.

What i done. I gave 50% smaller ratio in CRUSH for this osd's, but
data move to other osd, and this osd, have half of possible capacity.
I think it won't help in long term, and it's not a solution.

I have second cluster, with only replication on it, and there are same
case. Attachment explain everything. Every parameter on this bad osd
is much higher than on others. There are 2-3 osd with such high
counters.

Is this a bug ?? maybe there is no problems in bobtail ?? I can't
switch quick into bobtail that's why i need some answers, which way i
need to go.

Best Regards

Slawomir Skowron

[-- Attachment #2: bad_osd.png --]
[-- Type: image/png, Size: 14886 bytes --]

[-- Attachment #3: good_osds.png --]
[-- Type: image/png, Size: 50297 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 11:02 RGW Blocking on 1-2 PG's - argonaut Sławomir Skowron
@ 2013-03-04 14:16 ` Yehuda Sadeh
  2013-03-04 16:13   ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Yehuda Sadeh @ 2013-03-04 14:16 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: ceph-devel@vger.kernel.org

On Mon, Mar 4, 2013 at 3:02 AM, Sławomir Skowron <szibis@gmail.com> wrote:
> Hi,
>
> We have a big problem with RGW. I don't know what is the initial
> trigger, but i have theory.
>
> 2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more
> RAM usage, they have much more operations in journal, and much bigger
> latency.
>
> When we PUT some objects then in some cases, there are so many
> operations in triple replication on this osd (one PG). Then this
> triple can't handle this load, and goes down, drives on backend of
> this osd are getting fire with big wait-io, and big response times.
> RGW waiting for this PG, and eventually block all the others
> operations when makes 1024 operations blocked in queue.
> Then whole cluster have problems, and we have an outage.
>
> When RGW block operations there is only one PG that have >1000
> operations in queue -
> ceph pg map 3.9447554d
> osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23]
>
> now this osd are migrated, with ratio 0.5 on, but before it was
>
> ceph pg map 3.9447554d
> osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23]
>
> and this three osd's have such a problems. Under this osd's are only 3
> drive, one drive per osd, that's why this have such a big impact.
>
> What i done. I gave 50% smaller ratio in CRUSH for this osd's, but
> data move to other osd, and this osd, have half of possible capacity.
> I think it won't help in long term, and it's not a solution.
>
> I have second cluster, with only replication on it, and there are same
> case. Attachment explain everything. Every parameter on this bad osd
> is much higher than on others. There are 2-3 osd with such high
> counters.
>
> Is this a bug ?? maybe there is no problems in bobtail ?? I can't
> switch quick into bobtail that's why i need some answers, which way i
> need to go.
>

Not sure if bobtail is going to help much here, although there were a
few performance fixes that went in. If your cluster is unbalanced (in
terms of performance) then requests are going to be accumulated on the
weakest link. Reweighting the osd like what you did is a valid way to
go. You need to make sure that on the steady state, there's no one osd
that starts holding all the traffic.
Also, make sure that your pools have enough pgs so that the placement
distribution is uniform.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 14:16 ` Yehuda Sadeh
@ 2013-03-04 16:13   ` Sławomir Skowron
  2013-03-04 17:02     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-04 16:13 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3132 bytes --]

Ok, thanks for response. But if i have crush map like this in attachment.

All data should be balanced equal, not including hosts with 0.5 weight.

How make data auto balanced ?? when i know that some pq's have too
much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
enough.

pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
pg_num 4800 pgp_num 4800 last_change 908 owner 0

When will bee possible to expand number of pg's ??

Best Regards

Slawomir Skowron

On Mon, Mar 4, 2013 at 3:16 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:
> On Mon, Mar 4, 2013 at 3:02 AM, Sławomir Skowron <szibis@gmail.com> wrote:
>> Hi,
>>
>> We have a big problem with RGW. I don't know what is the initial
>> trigger, but i have theory.
>>
>> 2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more
>> RAM usage, they have much more operations in journal, and much bigger
>> latency.
>>
>> When we PUT some objects then in some cases, there are so many
>> operations in triple replication on this osd (one PG). Then this
>> triple can't handle this load, and goes down, drives on backend of
>> this osd are getting fire with big wait-io, and big response times.
>> RGW waiting for this PG, and eventually block all the others
>> operations when makes 1024 operations blocked in queue.
>> Then whole cluster have problems, and we have an outage.
>>
>> When RGW block operations there is only one PG that have >1000
>> operations in queue -
>> ceph pg map 3.9447554d
>> osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23]
>>
>> now this osd are migrated, with ratio 0.5 on, but before it was
>>
>> ceph pg map 3.9447554d
>> osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23]
>>
>> and this three osd's have such a problems. Under this osd's are only 3
>> drive, one drive per osd, that's why this have such a big impact.
>>
>> What i done. I gave 50% smaller ratio in CRUSH for this osd's, but
>> data move to other osd, and this osd, have half of possible capacity.
>> I think it won't help in long term, and it's not a solution.
>>
>> I have second cluster, with only replication on it, and there are same
>> case. Attachment explain everything. Every parameter on this bad osd
>> is much higher than on others. There are 2-3 osd with such high
>> counters.
>>
>> Is this a bug ?? maybe there is no problems in bobtail ?? I can't
>> switch quick into bobtail that's why i need some answers, which way i
>> need to go.
>>
>
> Not sure if bobtail is going to help much here, although there were a
> few performance fixes that went in. If your cluster is unbalanced (in
> terms of performance) then requests are going to be accumulated on the
> weakest link. Reweighting the osd like what you did is a valid way to
> go. You need to make sure that on the steady state, there's no one osd
> that starts holding all the traffic.
> Also, make sure that your pools have enough pgs so that the placement
> distribution is uniform.
>
> Yehuda



--
-----
Pozdrawiam

Sławek "sZiBis" Skowron

[-- Attachment #2: crush.txt --]
[-- Type: text/plain, Size: 4807 bytes --]

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host s3-10-177-64-4 {
	id -2		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 1.000
	item osd.2 weight 1.000
	item osd.3 weight 1.000
	item osd.4 weight 1.000
	item osd.5 weight 1.000
	item osd.6 weight 1.000
	item osd.7 weight 1.000
	item osd.8 weight 1.000
	item osd.9 weight 1.000
	item osd.10 weight 1.000
	item osd.11 weight 1.000
	item osd.13 weight 1.000
	item osd.14 weight 1.000
	item osd.15 weight 1.000
	item osd.16 weight 1.000
	item osd.17 weight 1.000
	item osd.18 weight 1.000
	item osd.19 weight 1.000
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 0.500
	item osd.24 weight 1.000
	item osd.25 weight 1.000
	item osd.12 weight 1.000
	item osd.0 weight 1.000
}
rack rack1 {
	id -3		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item s3-10-177-64-4 weight 25.500
}
host s3-10-177-64-6 {
	id -4		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item osd.26 weight 1.000
	item osd.27 weight 1.000
	item osd.28 weight 1.000
	item osd.31 weight 1.000
	item osd.32 weight 1.000
	item osd.33 weight 1.000
	item osd.34 weight 1.000
	item osd.35 weight 1.000
	item osd.37 weight 1.000
	item osd.38 weight 1.000
	item osd.39 weight 1.000
	item osd.40 weight 1.000
	item osd.41 weight 1.000
	item osd.42 weight 1.000
	item osd.43 weight 1.000
	item osd.45 weight 0.500
	item osd.46 weight 1.000
	item osd.47 weight 1.000
	item osd.48 weight 1.000
	item osd.49 weight 1.000
	item osd.50 weight 1.000
	item osd.51 weight 1.000
	item osd.44 weight 1.000
	item osd.29 weight 1.000
	item osd.36 weight 1.000
	item osd.30 weight 1.000
}
rack rack2 {
	id -5		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item s3-10-177-64-6 weight 25.500
}
host s3-10-177-64-8 {
	id -6		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item osd.52 weight 1.000
	item osd.53 weight 1.000
	item osd.54 weight 1.000
	item osd.55 weight 1.000
	item osd.56 weight 1.000
	item osd.57 weight 1.000
	item osd.58 weight 1.000
	item osd.59 weight 1.000
	item osd.60 weight 1.000
	item osd.61 weight 1.000
	item osd.62 weight 1.000
	item osd.63 weight 1.000
	item osd.64 weight 1.000
	item osd.65 weight 1.000
	item osd.66 weight 1.000
	item osd.67 weight 1.000
	item osd.68 weight 1.000
	item osd.69 weight 1.000
	item osd.70 weight 1.000
	item osd.71 weight 0.500
	item osd.72 weight 1.000
	item osd.73 weight 1.000
	item osd.74 weight 1.000
	item osd.75 weight 1.000
	item osd.76 weight 1.000
	item osd.77 weight 1.000
}
rack rack3 {
	id -7		# do not change unnecessarily
	# weight 25.500
	alg straw
	hash 0	# rjenkins1
	item s3-10-177-64-8 weight 25.500
}
pool default {
	id -1		# do not change unnecessarily
	# weight 76.500
	alg straw
	hash 0	# rjenkins1
	item rack1 weight 25.500
	item rack2 weight 25.500
	item rack3 weight 25.500
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}

# end crush map

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 16:13   ` Sławomir Skowron
@ 2013-03-04 17:02     ` Sage Weil
  2013-03-04 17:23       ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2013-03-04 17:02 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: Yehuda Sadeh, ceph-devel@vger.kernel.org

On Mon, 4 Mar 2013, S?awomir Skowron wrote:
> Ok, thanks for response. But if i have crush map like this in attachment.
> 
> All data should be balanced equal, not including hosts with 0.5 weight.
> 
> How make data auto balanced ?? when i know that some pq's have too
> much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
> enough.
> 
> pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
> pg_num 4800 pgp_num 4800 last_change 908 owner 0
> 
> When will bee possible to expand number of pg's ??

Soon.  :)

The bigger question for me is why there is one PG that is getting pounded 
while the others are not.  Is there a large skew in the workload toward a 
small number of very hot objects?  I expect it should be obvious if you go 
to the loaded osd and do

 ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight

and look at the request queue.

sage


> 
> Best Regards
> 
> Slawomir Skowron
> 
> On Mon, Mar 4, 2013 at 3:16 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:
> > On Mon, Mar 4, 2013 at 3:02 AM, S?awomir Skowron <szibis@gmail.com> wrote:
> >> Hi,
> >>
> >> We have a big problem with RGW. I don't know what is the initial
> >> trigger, but i have theory.
> >>
> >> 2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more
> >> RAM usage, they have much more operations in journal, and much bigger
> >> latency.
> >>
> >> When we PUT some objects then in some cases, there are so many
> >> operations in triple replication on this osd (one PG). Then this
> >> triple can't handle this load, and goes down, drives on backend of
> >> this osd are getting fire with big wait-io, and big response times.
> >> RGW waiting for this PG, and eventually block all the others
> >> operations when makes 1024 operations blocked in queue.
> >> Then whole cluster have problems, and we have an outage.
> >>
> >> When RGW block operations there is only one PG that have >1000
> >> operations in queue -
> >> ceph pg map 3.9447554d
> >> osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23]
> >>
> >> now this osd are migrated, with ratio 0.5 on, but before it was
> >>
> >> ceph pg map 3.9447554d
> >> osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23]
> >>
> >> and this three osd's have such a problems. Under this osd's are only 3
> >> drive, one drive per osd, that's why this have such a big impact.
> >>
> >> What i done. I gave 50% smaller ratio in CRUSH for this osd's, but
> >> data move to other osd, and this osd, have half of possible capacity.
> >> I think it won't help in long term, and it's not a solution.
> >>
> >> I have second cluster, with only replication on it, and there are same
> >> case. Attachment explain everything. Every parameter on this bad osd
> >> is much higher than on others. There are 2-3 osd with such high
> >> counters.
> >>
> >> Is this a bug ?? maybe there is no problems in bobtail ?? I can't
> >> switch quick into bobtail that's why i need some answers, which way i
> >> need to go.
> >>
> >
> > Not sure if bobtail is going to help much here, although there were a
> > few performance fixes that went in. If your cluster is unbalanced (in
> > terms of performance) then requests are going to be accumulated on the
> > weakest link. Reweighting the osd like what you did is a valid way to
> > go. You need to make sure that on the steady state, there's no one osd
> > that starts holding all the traffic.
> > Also, make sure that your pools have enough pgs so that the placement
> > distribution is uniform.
> >
> > Yehuda
> 
> 
> 
> --
> -----
> Pozdrawiam
> 
> S?awek "sZiBis" Skowron
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 17:02     ` Sage Weil
@ 2013-03-04 17:23       ` Sławomir Skowron
  2013-03-04 17:25         ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-04 17:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yehuda Sadeh, ceph-devel@vger.kernel.org

On Mon, Mar 4, 2013 at 6:02 PM, Sage Weil <sage@inktank.com> wrote:
> On Mon, 4 Mar 2013, S?awomir Skowron wrote:
>> Ok, thanks for response. But if i have crush map like this in attachment.
>>
>> All data should be balanced equal, not including hosts with 0.5 weight.
>>
>> How make data auto balanced ?? when i know that some pq's have too
>> much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
>> enough.
>>
>> pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
>> pg_num 4800 pgp_num 4800 last_change 908 owner 0
>>
>> When will bee possible to expand number of pg's ??
>
> Soon.  :)
>
> The bigger question for me is why there is one PG that is getting pounded
> while the others are not.  Is there a large skew in the workload toward a
> small number of very hot objects?

Yes, there are constantly about 100-200 operations in second, all
going into RGW backend. But when problems comes, there are more
requests, more GET, and PUT, because of reconnect of applications,
with short timeouts. But statistically all new PUTs normally goes for
many pg's, this should not overload a single master OSD. Maybe
balanced Reads from all replicas could help a little ??.

>  I expect it should be obvious if you go
> to the loaded osd and do
>
>  ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight
>

Yes i did that, but only when cluster going unstable there are such
long operations. Normaly there are no ops in queue, only when cluster
going to rebalance, remap, or anything else.

> and look at the request queue.
>
> sage
>
>
>>
>> Best Regards
>>
>> Slawomir Skowron
>>
>> On Mon, Mar 4, 2013 at 3:16 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:
>> > On Mon, Mar 4, 2013 at 3:02 AM, S?awomir Skowron <szibis@gmail.com> wrote:
>> >> Hi,
>> >>
>> >> We have a big problem with RGW. I don't know what is the initial
>> >> trigger, but i have theory.
>> >>
>> >> 2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more
>> >> RAM usage, they have much more operations in journal, and much bigger
>> >> latency.
>> >>
>> >> When we PUT some objects then in some cases, there are so many
>> >> operations in triple replication on this osd (one PG). Then this
>> >> triple can't handle this load, and goes down, drives on backend of
>> >> this osd are getting fire with big wait-io, and big response times.
>> >> RGW waiting for this PG, and eventually block all the others
>> >> operations when makes 1024 operations blocked in queue.
>> >> Then whole cluster have problems, and we have an outage.
>> >>
>> >> When RGW block operations there is only one PG that have >1000
>> >> operations in queue -
>> >> ceph pg map 3.9447554d
>> >> osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23]
>> >>
>> >> now this osd are migrated, with ratio 0.5 on, but before it was
>> >>
>> >> ceph pg map 3.9447554d
>> >> osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23]
>> >>
>> >> and this three osd's have such a problems. Under this osd's are only 3
>> >> drive, one drive per osd, that's why this have such a big impact.
>> >>
>> >> What i done. I gave 50% smaller ratio in CRUSH for this osd's, but
>> >> data move to other osd, and this osd, have half of possible capacity.
>> >> I think it won't help in long term, and it's not a solution.
>> >>
>> >> I have second cluster, with only replication on it, and there are same
>> >> case. Attachment explain everything. Every parameter on this bad osd
>> >> is much higher than on others. There are 2-3 osd with such high
>> >> counters.
>> >>
>> >> Is this a bug ?? maybe there is no problems in bobtail ?? I can't
>> >> switch quick into bobtail that's why i need some answers, which way i
>> >> need to go.
>> >>
>> >
>> > Not sure if bobtail is going to help much here, although there were a
>> > few performance fixes that went in. If your cluster is unbalanced (in
>> > terms of performance) then requests are going to be accumulated on the
>> > weakest link. Reweighting the osd like what you did is a valid way to
>> > go. You need to make sure that on the steady state, there's no one osd
>> > that starts holding all the traffic.
>> > Also, make sure that your pools have enough pgs so that the placement
>> > distribution is uniform.
>> >
>> > Yehuda
>>
>>
>>
>> --
>> -----
>> Pozdrawiam
>>
>> S?awek "sZiBis" Skowron
>>



--
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 17:23       ` Sławomir Skowron
@ 2013-03-04 17:25         ` Gregory Farnum
  2013-03-04 17:42           ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-03-04 17:25 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: Sage Weil, Yehuda Sadeh, ceph-devel@vger.kernel.org

On Mon, Mar 4, 2013 at 9:23 AM, Sławomir Skowron <szibis@gmail.com> wrote:
> On Mon, Mar 4, 2013 at 6:02 PM, Sage Weil <sage@inktank.com> wrote:
>> On Mon, 4 Mar 2013, S?awomir Skowron wrote:
>>> Ok, thanks for response. But if i have crush map like this in attachment.
>>>
>>> All data should be balanced equal, not including hosts with 0.5 weight.
>>>
>>> How make data auto balanced ?? when i know that some pq's have too
>>> much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
>>> enough.
>>>
>>> pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
>>> pg_num 4800 pgp_num 4800 last_change 908 owner 0
>>>
>>> When will bee possible to expand number of pg's ??
>>
>> Soon.  :)
>>
>> The bigger question for me is why there is one PG that is getting pounded
>> while the others are not.  Is there a large skew in the workload toward a
>> small number of very hot objects?
>
> Yes, there are constantly about 100-200 operations in second, all
> going into RGW backend. But when problems comes, there are more
> requests, more GET, and PUT, because of reconnect of applications,
> with short timeouts. But statistically all new PUTs normally goes for
> many pg's, this should not overload a single master OSD. Maybe
> balanced Reads from all replicas could help a little ??.
>
>>  I expect it should be obvious if you go
>> to the loaded osd and do
>>
>>  ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight
>>
>
> Yes i did that, but only when cluster going unstable there are such
> long operations. Normaly there are no ops in queue, only when cluster
> going to rebalance, remap, or anything else.

Have you checked the baseline disk performance of the OSDs? Perhaps
it's not that the PG is bad but that the OSDs are slow.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 17:25         ` Gregory Farnum
@ 2013-03-04 17:42           ` Sławomir Skowron
  2013-03-04 18:34             ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-04 17:42 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Yehuda Sadeh, ceph-devel@vger.kernel.org

Alone (one of this slow osd in mentioned tripple)

2013-03-04 18:39:27.683035 osd.23 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 15.241943 sec at 68795 KB/sec

in for loop (some slow request appear):

for x in `seq 0 25`; do ceph osd tell $x bench;done
2013-03-04 18:41:08.259454 osd.12 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.658448 sec at 27844 KB/sec
2013-03-04 18:41:07.850213 osd.5 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.402402 sec at 28034 KB/sec
2013-03-04 18:41:07.850231 osd.11 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.201831 sec at 28186 KB/sec
2013-03-04 18:41:08.100186 osd.10 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.540605 sec at 27931 KB/sec
2013-03-04 18:41:08.319766 osd.21 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.532806 sec at 27937 KB/sec
2013-03-04 18:41:08.415835 osd.14 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.772730 sec at 27760 KB/sec
2013-03-04 18:41:08.775264 osd.9 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.195523 sec at 27452 KB/sec
2013-03-04 18:41:08.808824 osd.6 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.338387 sec at 27350 KB/sec
2013-03-04 18:41:08.923809 osd.19 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.177933 sec at 27465 KB/sec
2013-03-04 18:41:08.925848 osd.18 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.201476 sec at 27448 KB/sec
2013-03-04 18:41:08.936961 osd.15 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.273058 sec at 27397 KB/sec
2013-03-04 18:41:08.619022 osd.20 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.713017 sec at 27804 KB/sec
2013-03-04 18:41:08.764705 osd.22 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.954886 sec at 27626 KB/sec
2013-03-04 18:41:08.499156 osd.0 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.035553 sec at 27568 KB/sec
2013-03-04 18:41:07.873457 osd.2 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.489969 sec at 27969 KB/sec
2013-03-04 18:41:08.134530 osd.13 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.513056 sec at 27952 KB/sec
2013-03-04 18:41:08.219142 osd.1 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.856368 sec at 27698 KB/sec
2013-03-04 18:41:08.485806 osd.4 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.060621 sec at 27550 KB/sec
2013-03-04 18:41:08.612236 osd.7 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.122105 sec at 27505 KB/sec
2013-03-04 18:41:08.647494 osd.8 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.134885 sec at 27496 KB/sec
2013-03-04 18:41:08.649267 osd.3 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.961966 sec at 27621 KB/sec
2013-03-04 18:41:08.943610 osd.24 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.091272 sec at 27527 KB/sec
2013-03-04 18:41:08.975838 osd.17 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.270884 sec at 27398 KB/sec
2013-03-04 18:41:09.544561 osd.23 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.715030 sec at 27084 KB/sec
2013-03-04 18:41:08.969981 osd.16 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 38.287596 sec at 27386 KB/sec
2013-03-04 18:41:09.533789 osd.25 [INF] bench: wrote 1024 MB in blocks
of 4096 KB in 37.954333 sec at 27627 KB/sec

I have a little fragmented xfs, but performance is still good.

On Mon, Mar 4, 2013 at 6:25 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Mon, Mar 4, 2013 at 9:23 AM, Sławomir Skowron <szibis@gmail.com> wrote:
>> On Mon, Mar 4, 2013 at 6:02 PM, Sage Weil <sage@inktank.com> wrote:
>>> On Mon, 4 Mar 2013, S?awomir Skowron wrote:
>>>> Ok, thanks for response. But if i have crush map like this in attachment.
>>>>
>>>> All data should be balanced equal, not including hosts with 0.5 weight.
>>>>
>>>> How make data auto balanced ?? when i know that some pq's have too
>>>> much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
>>>> enough.
>>>>
>>>> pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
>>>> pg_num 4800 pgp_num 4800 last_change 908 owner 0
>>>>
>>>> When will bee possible to expand number of pg's ??
>>>
>>> Soon.  :)
>>>
>>> The bigger question for me is why there is one PG that is getting pounded
>>> while the others are not.  Is there a large skew in the workload toward a
>>> small number of very hot objects?
>>
>> Yes, there are constantly about 100-200 operations in second, all
>> going into RGW backend. But when problems comes, there are more
>> requests, more GET, and PUT, because of reconnect of applications,
>> with short timeouts. But statistically all new PUTs normally goes for
>> many pg's, this should not overload a single master OSD. Maybe
>> balanced Reads from all replicas could help a little ??.
>>
>>>  I expect it should be obvious if you go
>>> to the loaded osd and do
>>>
>>>  ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight
>>>
>>
>> Yes i did that, but only when cluster going unstable there are such
>> long operations. Normaly there are no ops in queue, only when cluster
>> going to rebalance, remap, or anything else.
>
> Have you checked the baseline disk performance of the OSDs? Perhaps
> it's not that the PG is bad but that the OSDs are slow.



--
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 17:42           ` Sławomir Skowron
@ 2013-03-04 18:34             ` Sławomir Skowron
  2013-03-06 13:06               ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-04 18:34 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Yehuda Sadeh, ceph-devel@vger.kernel.org

And some output from rest-bench:

2013-03-04 19:31:41.503865min lat: 0.166207 max lat: 3.44611 avg lat: 0.911577
2013-03-04 19:31:41.503865   sec Cur ops   started  finished  avg MB/s
 cur MB/s  last lat   avg lat
2013-03-04 19:31:41.503865    40      16       715       699   69.7985
       64   1.54288  0.911577
2013-03-04 19:31:42.504218    41      16       721       705   68.6825
       24  0.949049  0.909889
2013-03-04 19:31:43.504528    42      16       742       726   69.0462
       84  0.566944    0.9164
2013-03-04 19:31:44.504857    43      16       761       745   69.2071
       76   1.17317  0.919921
2013-03-04 19:31:45.505099    44      16       766       750   68.0899
       20   1.23423  0.918905
2013-03-04 19:31:46.506975    45      16       785       769   68.2626
       76  0.711296   0.92321
2013-03-04 19:31:47.507964    46      16       794       778   67.5607
       36   1.79786  0.926638
2013-03-04 19:31:48.508148    47      16       812       796   67.6548
       72  0.847533  0.930029
2013-03-04 19:31:49.508347    48      16       829       813   67.6617
       68  0.807918  0.940498
2013-03-04 19:31:50.508547    49      16       840       824   67.1792
       44   0.95126  0.938767
2013-03-04 19:31:51.508753    50      16       858       842   67.2752
       72  0.711993  0.937664
2013-03-04 19:31:52.509076    51      13       859       846   66.2706
       16   1.49896  0.939526
2013-03-04 19:31:53.509662 Total time run:         51.235707
Total writes made:      859
Write size:             4194304
Bandwidth (MB/sec):     67.063

Stddev Bandwidth:       22.35
Max bandwidth (MB/sec): 100
Min bandwidth (MB/sec): 0
Average Latency:        0.951978
Stddev Latency:         0.456654
Max latency:            3.44611
Min latency:            0.166207

On Mon, Mar 4, 2013 at 6:42 PM, Sławomir Skowron <szibis@gmail.com> wrote:
> Alone (one of this slow osd in mentioned tripple)
>
> 2013-03-04 18:39:27.683035 osd.23 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 15.241943 sec at 68795 KB/sec
>
> in for loop (some slow request appear):
>
> for x in `seq 0 25`; do ceph osd tell $x bench;done
> 2013-03-04 18:41:08.259454 osd.12 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.658448 sec at 27844 KB/sec
> 2013-03-04 18:41:07.850213 osd.5 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.402402 sec at 28034 KB/sec
> 2013-03-04 18:41:07.850231 osd.11 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.201831 sec at 28186 KB/sec
> 2013-03-04 18:41:08.100186 osd.10 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.540605 sec at 27931 KB/sec
> 2013-03-04 18:41:08.319766 osd.21 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.532806 sec at 27937 KB/sec
> 2013-03-04 18:41:08.415835 osd.14 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.772730 sec at 27760 KB/sec
> 2013-03-04 18:41:08.775264 osd.9 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.195523 sec at 27452 KB/sec
> 2013-03-04 18:41:08.808824 osd.6 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.338387 sec at 27350 KB/sec
> 2013-03-04 18:41:08.923809 osd.19 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.177933 sec at 27465 KB/sec
> 2013-03-04 18:41:08.925848 osd.18 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.201476 sec at 27448 KB/sec
> 2013-03-04 18:41:08.936961 osd.15 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.273058 sec at 27397 KB/sec
> 2013-03-04 18:41:08.619022 osd.20 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.713017 sec at 27804 KB/sec
> 2013-03-04 18:41:08.764705 osd.22 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.954886 sec at 27626 KB/sec
> 2013-03-04 18:41:08.499156 osd.0 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.035553 sec at 27568 KB/sec
> 2013-03-04 18:41:07.873457 osd.2 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.489969 sec at 27969 KB/sec
> 2013-03-04 18:41:08.134530 osd.13 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.513056 sec at 27952 KB/sec
> 2013-03-04 18:41:08.219142 osd.1 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.856368 sec at 27698 KB/sec
> 2013-03-04 18:41:08.485806 osd.4 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.060621 sec at 27550 KB/sec
> 2013-03-04 18:41:08.612236 osd.7 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.122105 sec at 27505 KB/sec
> 2013-03-04 18:41:08.647494 osd.8 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.134885 sec at 27496 KB/sec
> 2013-03-04 18:41:08.649267 osd.3 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.961966 sec at 27621 KB/sec
> 2013-03-04 18:41:08.943610 osd.24 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.091272 sec at 27527 KB/sec
> 2013-03-04 18:41:08.975838 osd.17 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.270884 sec at 27398 KB/sec
> 2013-03-04 18:41:09.544561 osd.23 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.715030 sec at 27084 KB/sec
> 2013-03-04 18:41:08.969981 osd.16 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 38.287596 sec at 27386 KB/sec
> 2013-03-04 18:41:09.533789 osd.25 [INF] bench: wrote 1024 MB in blocks
> of 4096 KB in 37.954333 sec at 27627 KB/sec
>
> I have a little fragmented xfs, but performance is still good.
>
> On Mon, Mar 4, 2013 at 6:25 PM, Gregory Farnum <greg@inktank.com> wrote:
>> On Mon, Mar 4, 2013 at 9:23 AM, Sławomir Skowron <szibis@gmail.com> wrote:
>>> On Mon, Mar 4, 2013 at 6:02 PM, Sage Weil <sage@inktank.com> wrote:
>>>> On Mon, 4 Mar 2013, S?awomir Skowron wrote:
>>>>> Ok, thanks for response. But if i have crush map like this in attachment.
>>>>>
>>>>> All data should be balanced equal, not including hosts with 0.5 weight.
>>>>>
>>>>> How make data auto balanced ?? when i know that some pq's have too
>>>>> much data ?? I have 4800 pg's on RGW only with 78 OSD, it is quite
>>>>> enough.
>>>>>
>>>>> pool 3 '.rgw.buckets' rep size 3 crush_ruleset 0 object_hash rjenkins
>>>>> pg_num 4800 pgp_num 4800 last_change 908 owner 0
>>>>>
>>>>> When will bee possible to expand number of pg's ??
>>>>
>>>> Soon.  :)
>>>>
>>>> The bigger question for me is why there is one PG that is getting pounded
>>>> while the others are not.  Is there a large skew in the workload toward a
>>>> small number of very hot objects?
>>>
>>> Yes, there are constantly about 100-200 operations in second, all
>>> going into RGW backend. But when problems comes, there are more
>>> requests, more GET, and PUT, because of reconnect of applications,
>>> with short timeouts. But statistically all new PUTs normally goes for
>>> many pg's, this should not overload a single master OSD. Maybe
>>> balanced Reads from all replicas could help a little ??.
>>>
>>>>  I expect it should be obvious if you go
>>>> to the loaded osd and do
>>>>
>>>>  ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight
>>>>
>>>
>>> Yes i did that, but only when cluster going unstable there are such
>>> long operations. Normaly there are no ops in queue, only when cluster
>>> going to rebalance, remap, or anything else.
>>
>> Have you checked the baseline disk performance of the OSDs? Perhaps
>> it's not that the PG is bad but that the OSDs are slow.
>
>
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron



--
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-04 18:34             ` Sławomir Skowron
@ 2013-03-06 13:06               ` Sławomir Skowron
  2013-03-06 14:04                 ` Yehuda Sadeh
  0 siblings, 1 reply; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-06 13:06 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Yehuda Sadeh, ceph-devel@vger.kernel.org

Hi, i do some test, to reproduce this problem.

As you can see, only one drive (each drive in same PG) is much more
utilize, then others, and there are some ops in queue on this slow
osd. This test is getting heads from s3 objects, alphabetically
sorted. This is strange. why this files is going in much part only
from this triple osd's.

checking what osd are in this pg.

 ceph pg map 7.35b
osdmap e117008 pg 7.35b (7.35b) -> up [18,61,133] acting [18,61,133]

On osd.61

{ "num_ops": 13,
  "ops": [
        { "description": "osd_sub_op(client.10376104.0:961532 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.448543",
          "age": "0.032431",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376110.0:972570 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370135
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.453829",
          "age": "0.027145",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376104.0:961534 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370136
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.454012",
          "age": "0.026962",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376107.0:952760 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370137
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.458980",
          "age": "0.021994",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376110.0:972572 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370138
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.459546",
          "age": "0.021428",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376110.0:972574 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370139
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.463680",
          "age": "0.017294",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376107.0:952762 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370140
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.464660",
          "age": "0.016314",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376104.0:961536 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370141
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.468076",
          "age": "0.012898",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376110.0:972576 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370142
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.468332",
          "age": "0.012642",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376107.0:952764 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370143
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.470480",
          "age": "0.010494",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376107.0:952766 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370144
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.475372",
          "age": "0.005602",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376104.0:961538 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370145
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.479391",
          "age": "0.001583",
          "flag_point": "started"},
        { "description": "osd_sub_op(client.10376107.0:952768 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370146
snapset=0=[]:[] snapc=0=[])",
          "received_at": "2013-03-06 13:59:18.480276",
          "age": "0.000698",
          "flag_point": "started"}]}

On osd.18

{ "num_ops": 9,
  "ops": [
        { "description": "osd_op(client.10391092.0:718883
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.929677",
          "age": "0.025480",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10391092",
              "tid": 718883}},
        { "description": "osd_op(client.10373691.0:956595
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.934533",
          "age": "0.020624",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956595}},
        { "description": "osd_op(client.10391092.0:718885
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.937101",
          "age": "0.018056",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10391092",
              "tid": 718885}},
        { "description": "osd_op(client.10373691.0:956597
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.940284",
          "age": "0.014873",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956597}},
        { "description": "osd_op(client.10373691.0:956598
2013-03-06-13-8700.1-ocdn [append 0~275] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.941170",
          "age": "0.013987",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956598}},
        { "description": "osd_op(client.10373691.0:956601
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.946009",
          "age": "0.009148",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956601}},
        { "description": "osd_op(client.10391092.0:718887
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.950400",
          "age": "0.004757",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10391092",
              "tid": 718887}},
        { "description": "osd_op(client.10373691.0:956603
2013-03-06-13-8700.1-ocdn [append 0~275] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.951217",
          "age": "0.003940",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956603}},
        { "description": "osd_op(client.10373691.0:956604
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b)",
          "received_at": "2013-03-06 13:57:52.951491",
          "age": "0.003666",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.10373691",
              "tid": 956604}}]}

iostat of this osd drives in same time. osd.61 is master i think.

osd.133
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00    1.00  816.67     4.00 29925.50
73.21     0.24    0.28    6.67    0.27   0.19  15.33
osd.61
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdk               0.00    60.33    0.67  685.33     2.67 27458.83
80.06     1.48    2.16   54.00    2.11   1.45  99.47
osd.18
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdt               0.00     0.00    2.00  809.67     8.00 27608.00
68.05     0.19    0.23   12.00    0.20   0.14  11.33

Sort o files number of files with same date, but only a little probe of all.

     57 21 Nov 2012
     58 11 Dec 2012
     59 02 Jan 2013
     59 17 Feb 2013
     64 16 Feb 2013
     65 27 Nov 2012
     66 14 Dec 2012
     69 01 Mar 2013
     71 07 Feb 2013
     71 20 Dec 2012
     71 30 Nov 2012
     72 22 Nov 2012
     74 23 Nov 2012
     81 13 Dec 2012
     88 01 Dec 2012
     90 21 Feb 2013
    113 16 Nov 2012
    118 10 Feb 2013
    120 13 Feb 2013
    142 15 Feb 2013
    158 19 Feb 2013
    195 29 Nov 2012
    200 14 Feb 2013
    606 18 Feb 2013
    766 20 Feb 2013
   1347 05 Dec 2012
   2439 09 Dec 2012
   2603 08 Dec 2012

Other osd's have very small number of iops.


Best Regards
SS

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-06 13:06               ` Sławomir Skowron
@ 2013-03-06 14:04                 ` Yehuda Sadeh
  2013-03-06 21:32                   ` Sławomir Skowron
  0 siblings, 1 reply; 11+ messages in thread
From: Yehuda Sadeh @ 2013-03-06 14:04 UTC (permalink / raw)
  To: Sławomir Skowron
  Cc: Gregory Farnum, Sage Weil, ceph-devel@vger.kernel.org

On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron <szibis@gmail.com> wrote:
> Hi, i do some test, to reproduce this problem.
>
> As you can see, only one drive (each drive in same PG) is much more
> utilize, then others, and there are some ops in queue on this slow
> osd. This test is getting heads from s3 objects, alphabetically
> sorted. This is strange. why this files is going in much part only
> from this triple osd's.
>
> checking what osd are in this pg.
>
>  ceph pg map 7.35b
> osdmap e117008 pg 7.35b (7.35b) -> up [18,61,133] acting [18,61,133]
>
> On osd.61
>
> { "num_ops": 13,
>   "ops": [
>         { "description": "osd_sub_op(client.10376104.0:961532 7.35b
> 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134

The ops log is slowing you down. Unless you really need it, set 'rgw
enable ops log = false'. This is off by default in bobtail.


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RGW Blocking on 1-2 PG's - argonaut
  2013-03-06 14:04                 ` Yehuda Sadeh
@ 2013-03-06 21:32                   ` Sławomir Skowron
  0 siblings, 0 replies; 11+ messages in thread
From: Sławomir Skowron @ 2013-03-06 21:32 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: Gregory Farnum, Sage Weil, ceph-devel@vger.kernel.org

Great, thanks. Now i understand everything.

Best Regards
SS

Dnia 6 mar 2013 o godz. 15:04 Yehuda Sadeh <yehuda@inktank.com> napisał(a):

> On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron <szibis@gmail.com> wrote:
>> Hi, i do some test, to reproduce this problem.
>>
>> As you can see, only one drive (each drive in same PG) is much more
>> utilize, then others, and there are some ops in queue on this slow
>> osd. This test is getting heads from s3 objects, alphabetically
>> sorted. This is strange. why this files is going in much part only
>> from this triple osd's.
>>
>> checking what osd are in this pg.
>>
>> ceph pg map 7.35b
>> osdmap e117008 pg 7.35b (7.35b) -> up [18,61,133] acting [18,61,133]
>>
>> On osd.61
>>
>> { "num_ops": 13,
>>  "ops": [
>>        { "description": "osd_sub_op(client.10376104.0:961532 7.35b
>> 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134
>
> The ops log is slowing you down. Unless you really need it, set 'rgw
> enable ops log = false'. This is off by default in bobtail.
>
>
> Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-03-06 21:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-04 11:02 RGW Blocking on 1-2 PG's - argonaut Sławomir Skowron
2013-03-04 14:16 ` Yehuda Sadeh
2013-03-04 16:13   ` Sławomir Skowron
2013-03-04 17:02     ` Sage Weil
2013-03-04 17:23       ` Sławomir Skowron
2013-03-04 17:25         ` Gregory Farnum
2013-03-04 17:42           ` Sławomir Skowron
2013-03-04 18:34             ` Sławomir Skowron
2013-03-06 13:06               ` Sławomir Skowron
2013-03-06 14:04                 ` Yehuda Sadeh
2013-03-06 21:32                   ` Sławomir Skowron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.