Flapping osd / continuously reported as failed

All of lore.kernel.org
 help / color / mirror / Atom feed

* Flapping osd / continuously reported as failed
@ 2013-07-23 21:36 Studziński Krzysztof
  0 siblings, 0 replies; 15+ messages in thread
From: Studziński Krzysztof @ 2013-07-23 21:36 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
  Cc: Mostowiec Dominik

[-- Attachment #1: Type: text/plain, Size: 3801 bytes --]

Hi,
We've got some problem with our cluster - it continuously reports failed one osd and after auto-rebooting everything seems to work fine for some time (few minutes). CPU util of this osd is max 8%, iostat is very low. We tried to "ceph osd out" such flapping osd, but after recovering this behavior returned on different osd. This osd has also much more read operations than others (see attached osd_reads.png file; at about 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108 works while recovering).

Extract from ceph.log:

2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN] map e41730 wrongly marked me down
2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:50:43.611939 >= grace 20.000000)
2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.688891 secs
2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request 30.688891 seconds old, received at 2013-07-23 22:49:59.678453: sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request 30.179044 seconds old, received at 2013-07-23 22:50:00.188300: sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request 30.171968 seconds old, received at 2013-07-23 22:50:00.195376: sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN] map e41738 wrongly marked me down
2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:52:20.610895 >= grace 20.000000)
2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN] map e41742 wrongly marked me down

Could you please take a look at our config and suggest some improvements?
See attached "ceph pg <pg_id> query" for two groups during recovery and parts of our config file.
Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in one bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864 GB avail.

Best regards.
--
Krzysztof Studzinski



[-- Attachment #2: query_3.54d.txt --]
[-- Type: text/plain, Size: 6154 bytes --]


{ "state": "active+recovery_wait",
  "up": [
        72,
        108,
        23],
  "acting": [
        72,
        108,
        23],
  "info": { "pgid": "3.54d",
      "last_update": "41706'114846129",
      "last_complete": "41706'114846128",
      "log_tail": "41698'114844636",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 4,
          "last_epoch_started": 41708,
          "last_epoch_clean": 41706,
          "last_epoch_split": 11537,
          "same_up_since": 41707,
          "same_interval_since": 41707,
          "same_primary_since": 41707,
          "last_scrub": "41169'114613696",
          "last_scrub_stamp": "2013-07-22 10:05:06.306320",
          "last_deep_scrub": "41056'113772144",
          "last_deep_scrub_stamp": "2013-07-16 04:26:25.172957",
          "last_clean_scrub_stamp": "2013-07-22 10:05:06.306320"},
      "stats": { "version": "41706'114846129",
          "reported": "41707'342090941",
          "state": "active+recovery_wait",
          "last_fresh": "2013-07-23 21:51:28.434880",
          "last_change": "2013-07-23 21:50:39.772290",
          "last_active": "2013-07-23 21:51:28.434880",
          "last_clean": "2013-07-23 21:46:10.770116",
          "last_unstale": "2013-07-23 21:51:28.434880",
          "mapping_epoch": 41705,
          "log_start": "41698'114844636",
          "ondisk_log_start": "41698'114844636",
          "created": 4,
          "last_epoch_clean": 4,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "41169'114613696",
          "last_scrub_stamp": "2013-07-22 10:05:06.306320",
          "last_deep_scrub": "41056'113772144",
          "last_deep_scrub_stamp": "2013-07-16 04:26:25.172957",
          "last_clean_scrub_stamp": "2013-07-22 10:05:06.306320",
          "log_size": 195583,
          "ondisk_log_size": 195583,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 976942253,
              "num_objects": 3223,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 480,
              "num_read_kb": 0,
              "num_write": 45030,
              "num_write_kb": 679345,
              "num_scrub_errors": 0,
              "num_objects_recovered": 12520,
              "num_bytes_recovered": 4594566891,
              "num_keys_recovered": 983418521},
          "stat_cat_sum": {},
          "up": [
                72,
                108,
                23],
          "acting": [
                72,
                108,
                23]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 41708},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-07-23 21:50:38.886428",
          "might_have_unfound": [
                { "osd": 23,
                  "status": "already probed"},
                { "osd": 108,
                  "status": "already probed"}],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [
                    { "pull_from": 23,
                      "pulls": [
                            { "recovery_progress": { "first?": 0,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": "images\/pulscms\/ZmU7MDMsMWUwLDAsMSwx\/a00be21418eac804845b79b2480385b4.jpg"},
                              "recovery_info": { "object": "9447554d\/.dir.4168.2\/head\/\/3",
                                  "at_version": "41706'114846129",
                                  "size": "0",
                                  "object_info": { "oid": { "oid": ".dir.4168.2",
                                          "key": "",
                                          "snapid": -2,
                                          "hash": 2487702861,
                                          "max": 0},
                                      "locator": { "pool": 3,
                                          "key": ""},
                                      "category": "",
                                      "version": "41706'114846129",
                                      "prior_version": "41706'114846128",
                                      "last_reqid": "client.44205536.0:178674",
                                      "size": 0,
                                      "mtime": "2013-07-23 21:50:37.591130",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 1,
                                      "clones": []},
                                  "copy_subset": "[]",
                                  "clone_subset": "{}"}}]}],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "41697",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-07-23 21:50:37.822603"}]}

[-- Attachment #3: query_3.f49.txt --]
[-- Type: text/plain, Size: 3919 bytes --]


{ "state": "active+clean",
  "up": [
        72,
        10,
        122],
  "acting": [
        72,
        10,
        122],
  "info": { "pgid": "3.f49",
      "last_update": "41706'32487",
      "last_complete": "41706'32487",
      "log_tail": "40804'31484",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 4,
          "last_epoch_started": 41708,
          "last_epoch_clean": 41708,
          "last_epoch_split": 11544,
          "same_up_since": 41707,
          "same_interval_since": 41707,
          "same_primary_since": 41707,
          "last_scrub": "41169'32426",
          "last_scrub_stamp": "2013-07-23 00:54:26.387902",
          "last_deep_scrub": "41072'32358",
          "last_deep_scrub_stamp": "2013-07-20 23:15:21.902015",
          "last_clean_scrub_stamp": "2013-07-23 00:54:26.387902"},
      "stats": { "version": "41706'32487",
          "reported": "41707'99611",
          "state": "active+clean",
          "last_fresh": "2013-07-23 21:53:02.723704",
          "last_change": "2013-07-23 21:52:25.644065",
          "last_active": "2013-07-23 21:53:02.723704",
          "last_clean": "2013-07-23 21:53:02.723704",
          "last_unstale": "2013-07-23 21:53:02.723704",
          "mapping_epoch": 41705,
          "log_start": "40804'31484",
          "ondisk_log_start": "40804'31484",
          "created": 4,
          "last_epoch_clean": 4,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "41169'32426",
          "last_scrub_stamp": "2013-07-23 00:54:26.387902",
          "last_deep_scrub": "41072'32358",
          "last_deep_scrub_stamp": "2013-07-20 23:15:21.902015",
          "last_clean_scrub_stamp": "2013-07-23 00:54:26.387902",
          "log_size": 193920,
          "ondisk_log_size": 193920,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 992647339,
              "num_objects": 3251,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 1264,
              "num_read_kb": 0,
              "num_write": 117518,
              "num_write_kb": 4066168,
              "num_scrub_errors": 0,
              "num_objects_recovered": 11193,
              "num_bytes_recovered": 5027141369,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                72,
                10,
                122],
          "acting": [
                72,
                10,
                122]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 41708},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-07-23 21:50:38.914289",
          "might_have_unfound": [
                { "osd": 10,
                  "status": "already probed"},
                { "osd": 122,
                  "status": "already probed"}],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-07-23 21:50:37.812159"}]}

[-- Attachment #4: ceph.conf --]
[-- Type: application/octet-stream, Size: 4016 bytes --]

[global]
        ; enable secure authentication
        auth supported = cephx
        keyring = /etc/ceph/$cluster.keyring
 
	max_open_files = 8192

	rgw_cache_enabled = true ;rgw cache enabled
	rgw_cache_lru_size = 10000 ;num of entries in rgw cache
	rgw_thread_pool_size = 2048
	rgw op thread timeout = 6000 ; in ms
	rgw print continue = false ;enable if 100-Continue works
	rgw_enable_ops_log = false ;enable logging every rgw operation - no stripe only one pg writing
	
	debug rgw = 10

	rgw dns name = ocdn.eu

        rbd cache = true
        rbd cache max dirty = 0

	admin_socket = /var/run/ceph/$cluster-$name.asok
	
	mon_osd_nearfull_ratio = .90
	mon_osd_full_ratio = .96

	mon osd down out interval = 0

        debug_lockdep = 0/0
        debug_context = 0/0
        debug_crush = 0/0
        debug_mds = 0/0
        debug_mds_balancer = 0/0
        debug_mds_locker =0/0
        debug_mds_log = 0/0
        debug_mds_log_expire = 0/0
        debug_mds_migrator = 0/0
        debug_buffer = 0/0
        debug_timer = 0/0
        debug_filer = 0/0
        debug_objecter = 0/0
        debug_rados = 0/0
        debug_rbd = 0/0
        debug_journaler = 0/0
        debug_objectcacher = 0/0
        debug_client = 0/0
        ;debug_osd = 0/0
        debug_optracker = 0/0
        debug_objclass = 0/0
        ;debug_filestore = 0/0
        debug_journal = 0/0
        debug_ms = 0/0
        debug_mon = 0/0
        debug_monc = 0/0
        debug_paxos = 0/0
        debug_tp = 0/0
        debug_auth = 0/0
        debug_finisher = 0/0
        debug_heartbeatmap = 0/0
        debug_perfcounter = 0/0
        ;debug_rgw = 0/0
        debug_hadoop = 0/0
        debug_asok = 0/0
        debug_throttle = 0/0

; radosgw client list
; ...

; monitors
;  You need at least one.  You need at least three if you want to
;  tolerate any node failures.  Always create an odd number.
[mon]
        mon data = /vol0/data/mon.$id
 
        ; some minimal logging (just message traffic) to aid debugging
 
        ;debug ms = 0     ; see message traffic
        ;debug mon = 0   ; monitor
        ;debug paxos = 0 ; monitor replication
        ;debug auth = 0  ;
 
        mon allowed clock drift = 2
 
; ...

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
 
[osd]
        ; This is where the btrfs volume will be mounted.

        osd data = /vol0/data/osd.$id

        ; Ideally, make this a separate disk or partition.  A few GB
        ; is usually enough; more if you have fast disks.  You can use
        ; a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.

        osd journal = /vol0/data/osd.$id/journal

        ; If the OSD journal is a file, you need to specify the size. This is specified in MB.

        keyring = /vol0/data/osd.$id/keyring

        #filestore journal writeahead = 1

        journal aio = true

        osd heartbeat grace = 15

        filestore flush min = 0
        filestore flusher = false
        filestore fiemap = false
        filestore op threads = 8
        filestore queue max ops = 4096
        filestore queue max bytes = 10485760
        filestore queue committing max bytes = 10485760

        osd op threads = 12
        osd disk threads = 8
        osd recovery threads = 1
        osd recovery max active = 1
        osd recovery op priority = 100
        osd client op priority = 100
        osd max backfills = 1

        journal max write bytes = 10485760
        journal queue max bytes = 10485760
        ms dispatch throttle bytes = 10485760
        objecter infilght op bytes = 10485760

        ;debug ms = 0         ; message traffic
        #debug osd = 1
        debug osd = 0
        #debug filestore = 1 ; local object storage
        debug filestore = 0 ; local object storage
        ;debug journal = 0   ; local journaling
        ;debug monc = 0
        ;debug rados = 0

[osd.0]
; ...
[osd.155]

[-- Attachment #5: osd_reads.png --]
[-- Type: image/png, Size: 107287 bytes --]

[-- Attachment #6: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Flapping osd / continuously reported as failed
@ 2013-07-23 21:50 Studziński Krzysztof
       [not found] ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Studziński Krzysztof @ 2013-07-23 21:50 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
  Cc: Mostowiec Dominik

Hi,
We've got some problem with our cluster - it continuously reports failed one osd and after auto-rebooting everything seems to work fine for some time (few minutes). CPU util of this osd is max 8%, iostat is very low. We tried to "ceph osd out" such flapping osd, but after recovering this behavior returned on different osd. This osd has also much more read operations than others (see file osd_reads.png linked at the bottom of the email; at about 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108 works while recovering).

Extract from ceph.log:

2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN] map e41730 wrongly marked me down
2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:50:43.611939 >= grace 20.000000)
2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.688891 secs
2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request 30.688891 seconds old, received at 2013-07-23 22:49:59.678453: sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request 30.179044 seconds old, received at 2013-07-23 22:50:00.188300: sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request 30.171968 seconds old, received at 2013-07-23 22:50:00.195376: sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN] map e41738 wrongly marked me down
2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:52:20.610895 >= grace 20.000000)
2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF] osd.72 10.177.64.8:6803/22584 boot
2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN] map e41742 wrongly marked me down

Could you please take a look at our config and suggest some improvements?
See attached "ceph pg <pg_id> query" for two groups during recovery and parts of our config file.
Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in one bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864 GB avail.

Files:
Ceph.conf: https://docs.google.com/file/d/0B_Pxd89e6fWvZ1NtYmZYZFBtZHc/edit?usp=sharing
osd_reads.png: https://docs.google.com/file/d/0B_Pxd89e6fWvQW5XaXZFdUkxcEE/edit?usp=sharing
pg query #1: https://docs.google.com/file/d/0B_Pxd89e6fWvdXhpRk5LT25nNTQ/edit?usp=sharing
pg query #2:https://docs.google.com/file/d/0B_Pxd89e6fWvR1ZsdlIzcmxWYWc/edit?usp=sharing 

Best regards.
--
Krzysztof Studzinski

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found] ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
@ 2013-07-23 22:12   ` Gregory Farnum
       [not found]     ` <CAPYLRzjGDep1ny6K-Ctz_7VG4THV6nAx9odOdjr=WNNesV4cVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-07-23 22:12 UTC (permalink / raw)
  To: Studziński Krzysztof
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Mostowiec Dominik

On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
<krzysztof.studzinski-Yw1TE0hTT7dz6jiHbVrK7g@public.gmane.org> wrote:
> Hi,
> We've got some problem with our cluster - it continuously reports failed one osd and after auto-rebooting everything seems to work fine for some time (few minutes). CPU util of this osd is max 8%, iostat is very low. We tried to "ceph osd out" such flapping osd, but after recovering this behavior returned on different osd. This osd has also much more read operations than others (see file osd_reads.png linked at the bottom of the email; at about 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108 works while recovering).
>
> Extract from ceph.log:
>
> 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF] osd.72 10.177.64.8:6803/22584 boot
> 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN] map e41730 wrongly marked me down
> 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:50:43.611939 >= grace 20.000000)
> 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.688891 secs
> 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request 30.688891 seconds old, received at 2013-07-23 22:49:59.678453: sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op] 3.9447554d) v4 currently no flag points reached
> 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request 30.179044 seconds old, received at 2013-07-23 22:50:00.188300: sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
> 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request 30.171968 seconds old, received at 2013-07-23 22:50:00.195376: sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] 3.9447554d) v4 currently no flag points reached
> 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF] osd.72 10.177.64.8:6803/22584 boot
> 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN] map e41738 wrongly marked me down
> 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
> 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF] osd.72 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23 22:52:20.610895 >= grace 20.000000)
> 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG] osd.72 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF] osd.72 10.177.64.8:6803/22584 boot
> 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN] map e41742 wrongly marked me down
>
> Could you please take a look at our config and suggest some improvements?
> See attached "ceph pg <pg_id> query" for two groups during recovery and parts of our config file.
> Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in one bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864 GB avail.

I'm surprised you're running into it at 9m objects but this is almost
certainly the problem. Right now the index for each RGW bucket lives
on a single OSD; you're probably having issues with whichever OSD is
receiving the bucket index reads. Is it feasible for you to shard the
contents into multiple buckets and see if things calm down?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]     ` <CAPYLRzjGDep1ny6K-Ctz_7VG4THV6nAx9odOdjr=WNNesV4cVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-07-23 22:20       ` Studziński Krzysztof
       [not found]         ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD1-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Studziński Krzysztof @ 2013-07-23 22:20 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Mostowiec Dominik

> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
> <krzysztof.studzinski-Yw1TE0hTT7dz6jiHbVrK7g@public.gmane.org> wrote:
> > Hi,
> > We've got some problem with our cluster - it continuously reports failed
> one osd and after auto-rebooting everything seems to work fine for some
> time (few minutes). CPU util of this osd is max 8%, iostat is very low. We tried
> to "ceph osd out" such flapping osd, but after recovering this behavior
> returned on different osd. This osd has also much more read operations than
> others (see file osd_reads.png linked at the bottom of the email; at about
> 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108
> works while recovering).
> >
> > Extract from ceph.log:
> >
> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF] osd.72
> 10.177.64.8:6803/22584 boot
> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN] map
> e41730 wrongly marked me down
> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF] osd.72
> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> 22:50:43.611939 >= grace 20.000000)
> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests, 3
> included below; oldest blocked for > 30.688891 secs
> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request
> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op]
> 3.9447554d) v4 currently no flag points reached
> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request
> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] 3.9447554d)
> v4 currently no flag points reached
> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request
> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] 3.9447554d)
> v4 currently no flag points reached
> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF] osd.72
> 10.177.64.8:6803/22584 boot
> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN] map
> e41738 wrongly marked me down
> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF] osd.72
> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> 22:52:20.610895 >= grace 20.000000)
> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG] osd.72
> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF] osd.72
> 10.177.64.8:6803/22584 boot
> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN] map
> e41742 wrongly marked me down
> >
> > Could you please take a look at our config and suggest some
> improvements?
> > See attached "ceph pg <pg_id> query" for two groups during recovery and
> parts of our config file.
> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in one
> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864 GB
> avail.
> 
> I'm surprised you're running into it at 9m objects but this is almost
> certainly the problem. Right now the index for each RGW bucket lives
> on a single OSD; you're probably having issues with whichever OSD is
> receiving the bucket index reads. Is it feasible for you to shard the
> contents into multiple buckets and see if things calm down?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

I was afraid of that answer. :-/
We'll try to delete some objects for the beginning (as they are transformed other objects and can be re-created when needed) and then try to shard it as you suggest.
What is your opinion about max number of objects in one bucket ?

Best regards,
Krzysztof Studzinski

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]         ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD1-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
@ 2013-07-23 22:28           ` Gregory Farnum
       [not found]             ` <CAPYLRzhVtMCY+-d-y5F5M5hMVDwRh343+bB7An4Xcw4DT3n82w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-07-24  7:48             ` [ceph-users] " Studziński Krzysztof
  0 siblings, 2 replies; 15+ messages in thread
From: Gregory Farnum @ 2013-07-23 22:28 UTC (permalink / raw)
  To: Studziński Krzysztof, Yehuda Sadeh
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Mostowiec Dominik

On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof
<krzysztof.studzinski@grupaonet.pl> wrote:
>> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
>> <krzysztof.studzinski@grupaonet.pl> wrote:
>> > Hi,
>> > We've got some problem with our cluster - it continuously reports failed
>> one osd and after auto-rebooting everything seems to work fine for some
>> time (few minutes). CPU util of this osd is max 8%, iostat is very low. We tried
>> to "ceph osd out" such flapping osd, but after recovering this behavior
>> returned on different osd. This osd has also much more read operations than
>> others (see file osd_reads.png linked at the bottom of the email; at about
>> 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108
>> works while recovering).
>> >
>> > Extract from ceph.log:
>> >
>> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF] osd.72
>> 10.177.64.8:6803/22584 boot
>> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN] map
>> e41730 wrongly marked me down
>> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
>> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
>> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
>> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF] osd.72
>> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
>> 22:50:43.611939 >= grace 20.000000)
>> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests, 3
>> included below; oldest blocked for > 30.688891 secs
>> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request
>> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
>> sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op]
>> 3.9447554d) v4 currently no flag points reached
>> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request
>> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
>> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list] 3.9447554d)
>> v4 currently no flag points reached
>> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request
>> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
>> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list] 3.9447554d)
>> v4 currently no flag points reached
>> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF] osd.72
>> 10.177.64.8:6803/22584 boot
>> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN] map
>> e41738 wrongly marked me down
>> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
>> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
>> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
>> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF] osd.72
>> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
>> 22:52:20.610895 >= grace 20.000000)
>> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG] osd.72
>> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
>> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF] osd.72
>> 10.177.64.8:6803/22584 boot
>> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN] map
>> e41742 wrongly marked me down
>> >
>> > Could you please take a look at our config and suggest some
>> improvements?
>> > See attached "ceph pg <pg_id> query" for two groups during recovery and
>> parts of our config file.
>> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in one
>> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864 GB
>> avail.
>>
>> I'm surprised you're running into it at 9m objects but this is almost
>> certainly the problem. Right now the index for each RGW bucket lives
>> on a single OSD; you're probably having issues with whichever OSD is
>> receiving the bucket index reads. Is it feasible for you to shard the
>> contents into multiple buckets and see if things calm down?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> I was afraid of that answer. :-/
> We'll try to delete some objects for the beginning (as they are transformed other objects and can be re-created when needed) and then try to shard it as you suggest.
> What is your opinion about max number of objects in one bucket ?

It depends more on how much activity you're throwing at the bucket, I
think. We haven't done any large-scale tests and perhaps we need to.
Leveldb (which we're using under the covers to store this stuff)
should not have any trouble with the amount of data that's in there,
but if you're trying to do frequent enough object lookups or puts then
you might just saturate the disk/node's ability to keep up. I should
mention that Yehuda just started a discussion on handling this in the
thread "[ceph-users] rgw bucket index".

When did you start noticing this trouble?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]             ` <CAPYLRzhVtMCY+-d-y5F5M5hMVDwRh343+bB7An4Xcw4DT3n82w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-07-23 23:18               ` Studziński Krzysztof
  0 siblings, 0 replies; 15+ messages in thread
From: Studziński Krzysztof @ 2013-07-23 23:18 UTC (permalink / raw)
  To: Gregory Farnum, Yehuda Sadeh
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Mostowiec Dominik

> -----Original Message-----
> From: Gregory Farnum [mailto:greg@inktank.com]
> Sent: Wednesday, July 24, 2013 12:28 AM
> To: Studziński Krzysztof; Yehuda Sadeh
> Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec
> Dominik
> Subject: Re: [ceph-users] Flapping osd / continuously reported as failed
> 
> On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof
> <krzysztof.studzinski@grupaonet.pl> wrote:
> >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
> >> <krzysztof.studzinski@grupaonet.pl> wrote:
> >> > Hi,
> >> > We've got some problem with our cluster - it continuously reports failed
> >> one osd and after auto-rebooting everything seems to work fine for some
> >> time (few minutes). CPU util of this osd is max 8%, iostat is very low. We
> tried
> >> to "ceph osd out" such flapping osd, but after recovering this behavior
> >> returned on different osd. This osd has also much more read operations
> than
> >> others (see file osd_reads.png linked at the bottom of the email; at about
> >> 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108
> >> works while recovering).
> >> >
> >> > Extract from ceph.log:
> >> >
> >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF]
> osd.72
> >> 10.177.64.8:6803/22584 boot
> >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN]
> map
> >> e41730 wrongly marked me down
> >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF]
> osd.72
> >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> >> 22:50:43.611939 >= grace 20.000000)
> >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow requests,
> 3
> >> included below; oldest blocked for > 30.688891 secs
> >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request
> >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> >> sd_op(client.44290048.0:125899 .dir.4168.2 [call rgw.bucket_prepare_op]
> >> 3.9447554d) v4 currently no flag points reached
> >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request
> >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> 3.9447554d)
> >> v4 currently no flag points reached
> >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request
> >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> 3.9447554d)
> >> v4 currently no flag points reached
> >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF]
> osd.72
> >> 10.177.64.8:6803/22584 boot
> >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN]
> map
> >> e41738 wrongly marked me down
> >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
> >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> >> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> >> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF]
> osd.72
> >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> >> 22:52:20.610895 >= grace 20.000000)
> >> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG]
> osd.72
> >> 10.177.64.8:6803/22584 reported failed by osd.10 10.177.64.4:6814/26192
> >> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF]
> osd.72
> >> 10.177.64.8:6803/22584 boot
> >> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN]
> map
> >> e41742 wrongly marked me down
> >> >
> >> > Could you please take a look at our config and suggest some
> >> improvements?
> >> > See attached "ceph pg <pg_id> query" for two groups during recovery
> and
> >> parts of our config file.
> >> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in
> one
> >> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB / 42864
> GB
> >> avail.
> >>
> >> I'm surprised you're running into it at 9m objects but this is almost
> >> certainly the problem. Right now the index for each RGW bucket lives
> >> on a single OSD; you're probably having issues with whichever OSD is
> >> receiving the bucket index reads. Is it feasible for you to shard the
> >> contents into multiple buckets and see if things calm down?
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> > I was afraid of that answer. :-/
> > We'll try to delete some objects for the beginning (as they are transformed
> other objects and can be re-created when needed) and then try to shard it
> as you suggest.
> > What is your opinion about max number of objects in one bucket ?
> 
> It depends more on how much activity you're throwing at the bucket, I
> think. We haven't done any large-scale tests and perhaps we need to.
> Leveldb (which we're using under the covers to store this stuff)
> should not have any trouble with the amount of data that's in there,
> but if you're trying to do frequent enough object lookups or puts then
> you might just saturate the disk/node's ability to keep up. I should
> mention that Yehuda just started a discussion on handling this in the
> thread "[ceph-users] rgw bucket index".
> 
> When did you start noticing this trouble?

Most of our objects are images: originals (mostly 5-10MB) and its transformations (compressed, crops, filters, etc, usually <<1MB). We also have plenty of small text files. We mostly read them (objects are cached by nginx outside the cluster), but yesterday (15h ago) we had pretty large deployment of new service that could create plenty of transformed images with different sizes. Unfortunately I cannot say right now what is the difference in bucket size between now and yesterday, I will try to check it out.

From my stats I can see that in peaks we had about 1.5K PUT and 8K GET operations per minute. Most common use-case while PUTting an object in our apps is this sequence:
- HEAD (checking if transformed object already exist), 
- GET  the original
- PUT transformed object and its meta-data 
We also have some amount of PUTs of new objects, but I think there won't be many by now  (it is 1am in Poland). 
Right now there are 80 PUT and 1.5K GET operations per minute and the osd is still flapping.

--
Best regards,
Krzysztof Studzinski

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [ceph-users] Flapping osd / continuously reported as failed
  2013-07-23 22:28           ` Gregory Farnum
       [not found]             ` <CAPYLRzhVtMCY+-d-y5F5M5hMVDwRh343+bB7An4Xcw4DT3n82w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-07-24  7:48             ` Studziński Krzysztof
       [not found]               ` <0D057B737C42FC4AB3F22773A5C9425F259DBDF026-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Studziński Krzysztof @ 2013-07-24  7:48 UTC (permalink / raw)
  To: Gregory Farnum, Yehuda Sadeh
  Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
	Mostowiec Dominik

> -----Original Message-----
> From: Studziński Krzysztof
> Sent: Wednesday, July 24, 2013 1:18 AM
> To: 'Gregory Farnum'; Yehuda Sadeh
> Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec
> Dominik
> Subject: RE: [ceph-users] Flapping osd / continuously reported as failed
> 
> > -----Original Message-----
> > From: Gregory Farnum [mailto:greg@inktank.com]
> > Sent: Wednesday, July 24, 2013 12:28 AM
> > To: Studziński Krzysztof; Yehuda Sadeh
> > Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec
> > Dominik
> > Subject: Re: [ceph-users] Flapping osd / continuously reported as failed
> >
> > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof
> > <krzysztof.studzinski@grupaonet.pl> wrote:
> > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
> > >> <krzysztof.studzinski@grupaonet.pl> wrote:
> > >> > Hi,
> > >> > We've got some problem with our cluster - it continuously reports
> failed
> > >> one osd and after auto-rebooting everything seems to work fine for
> some
> > >> time (few minutes). CPU util of this osd is max 8%, iostat is very low. We
> > tried
> > >> to "ceph osd out" such flapping osd, but after recovering this behavior
> > >> returned on different osd. This osd has also much more read operations
> > than
> > >> others (see file osd_reads.png linked at the bottom of the email; at
> about
> > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108
> > >> works while recovering).
> > >> >
> > >> > Extract from ceph.log:
> > >> >
> > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN]
> > map
> > >> e41730 wrongly marked me down
> > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> > >> 22:50:43.611939 >= grace 20.000000)
> > >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow
> requests,
> > 3
> > >> included below; oldest blocked for > 30.688891 secs
> > >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request
> > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call
> rgw.bucket_prepare_op]
> > >> 3.9447554d) v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request
> > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request
> > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN]
> > map
> > >> e41738 wrongly marked me down
> > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
> > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> > >> 22:52:20.610895 >= grace 20.000000)
> > >> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN]
> > map
> > >> e41742 wrongly marked me down
> > >> >
> > >> > Could you please take a look at our config and suggest some
> > >> improvements?
> > >> > See attached "ceph pg <pg_id> query" for two groups during recovery
> > and
> > >> parts of our config file.
> > >> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in
> > one
> > >> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB /
> 42864
> > GB
> > >> avail.
> > >>
> > >> I'm surprised you're running into it at 9m objects but this is almost
> > >> certainly the problem. Right now the index for each RGW bucket lives
> > >> on a single OSD; you're probably having issues with whichever OSD is
> > >> receiving the bucket index reads. Is it feasible for you to shard the
> > >> contents into multiple buckets and see if things calm down?
> > >> -Greg
> > >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> > >
> > > I was afraid of that answer. :-/
> > > We'll try to delete some objects for the beginning (as they are
> transformed
> > other objects and can be re-created when needed) and then try to shard it
> > as you suggest.
> > > What is your opinion about max number of objects in one bucket ?
> >
> > It depends more on how much activity you're throwing at the bucket, I
> > think. We haven't done any large-scale tests and perhaps we need to.
> > Leveldb (which we're using under the covers to store this stuff)
> > should not have any trouble with the amount of data that's in there,
> > but if you're trying to do frequent enough object lookups or puts then
> > you might just saturate the disk/node's ability to keep up. I should
> > mention that Yehuda just started a discussion on handling this in the
> > thread "[ceph-users] rgw bucket index".
> >
> > When did you start noticing this trouble?
> 
> Most of our objects are images: originals (mostly 5-10MB) and its
> transformations (compressed, crops, filters, etc, usually <<1MB). We also
> have plenty of small text files. We mostly read them (objects are cached by
> nginx outside the cluster), but yesterday (15h ago) we had pretty large
> deployment of new service that could create plenty of transformed images
> with different sizes. Unfortunately I cannot say right now what is the
> difference in bucket size between now and yesterday, I will try to check it
> out.
> 
> From my stats I can see that in peaks we had about 1.5K PUT and 8K GET
> operations per minute. Most common use-case while PUTting an object in
> our apps is this sequence:
> - HEAD (checking if transformed object already exist),
> - GET  the original
> - PUT transformed object and its meta-data
> We also have some amount of PUTs of new objects, but I think there won't
> be many by now  (it is 1am in Poland).
> Right now there are 80 PUT and 1.5K GET operations per minute and the osd
> is still flapping.

Additional info about our index growth: the deployment I mentioned in previous mail didn't influence the speed of index grow. Currently we have 60K of new objects daily, that speed is almost constant for the last 7 days.
 --
 Best regards,
 Krzysztof Studzinski


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]               ` <0D057B737C42FC4AB3F22773A5C9425F259DBDF026-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
@ 2013-07-25  7:47                 ` Mostowiec Dominik
  2013-07-25 17:32                   ` [ceph-users] " Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Mostowiec Dominik @ 2013-07-25  7:47 UTC (permalink / raw)
  To: Gregory Farnum, Yehuda Sadeh
  Cc: Studziński Krzysztof,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Hi
We found something else.
After osd.72 flapp, one PG '3.54d' was recovering long time.

--
ceph health details
HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%)
pg 3.54d is active+recovering, acting [72,108,23]
recovery 1/39821745 degraded (0.000%)
--

Last flap down/up osd.72 was 00:45.
In logs we found:
2013-07-24 00:45:02.736740 7f8ac1e04700  0 log [INF] : 3.54d deep-scrub ok
After this time is ok.

It is possible that reason of flapping this osd was scrubbing?

We have default scrubbing settings (ceph version 0.56.6).
If scrubbig is the trouble-maker, can we make it a little more light by changing config?

--
Regards
Dominik

-----Original Message-----
From: Studziński Krzysztof 
Sent: Wednesday, July 24, 2013 9:48 AM
To: Gregory Farnum; Yehuda Sadeh
Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec Dominik
Subject: RE: [ceph-users] Flapping osd / continuously reported as failed

> -----Original Message-----
> From: Studziński Krzysztof
> Sent: Wednesday, July 24, 2013 1:18 AM
> To: 'Gregory Farnum'; Yehuda Sadeh
> Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec 
> Dominik
> Subject: RE: [ceph-users] Flapping osd / continuously reported as 
> failed
> 
> > -----Original Message-----
> > From: Gregory Farnum [mailto:greg@inktank.com]
> > Sent: Wednesday, July 24, 2013 12:28 AM
> > To: Studziński Krzysztof; Yehuda Sadeh
> > Cc: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com; Mostowiec 
> > Dominik
> > Subject: Re: [ceph-users] Flapping osd / continuously reported as 
> > failed
> >
> > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof 
> > <krzysztof.studzinski@grupaonet.pl> wrote:
> > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof 
> > >> <krzysztof.studzinski@grupaonet.pl> wrote:
> > >> > Hi,
> > >> > We've got some problem with our cluster - it continuously 
> > >> > reports
> failed
> > >> one osd and after auto-rebooting everything seems to work fine 
> > >> for
> some
> > >> time (few minutes). CPU util of this osd is max 8%, iostat is 
> > >> very low. We
> > tried
> > >> to "ceph osd out" such flapping osd, but after recovering this 
> > >> behavior returned on different osd. This osd has also much more 
> > >> read operations
> > than
> > >> others (see file osd_reads.png linked at the bottom of the email; 
> > >> at
> about
> > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. 
> > >> Osd.108 works while recovering).
> > >> >
> > >> > Extract from ceph.log:
> > >> >
> > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : 
> > >> > [WRN]
> > map
> > >> e41730 wrongly marked me down
> > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 
> > >> 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 
> > >> 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 
> > >> 2013-07-23
> > >> 22:50:43.611939 >= grace 20.000000)
> > >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow
> requests,
> > 3
> > >> included below; oldest blocked for > 30.688891 secs
> > >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call
> rgw.bucket_prepare_op]
> > >> 3.9447554d) v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : 
> > >> > [WRN]
> > map
> > >> e41738 wrongly marked me down
> > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.20 
> > >> 10.177.64.4:6913/4101
> > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 
> > >> 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 
> > >> 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 
> > >> 2013-07-23
> > >> 22:52:20.610895 >= grace 20.000000)
> > >> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : 
> > >> > [WRN]
> > map
> > >> e41742 wrongly marked me down
> > >> >
> > >> > Could you please take a look at our config and suggest some
> > >> improvements?
> > >> > See attached "ceph pg <pg_id> query" for two groups during 
> > >> > recovery
> > and
> > >> parts of our config file.
> > >> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, 
> > >> > mostly in
> > one
> > >> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB /
> 42864
> > GB
> > >> avail.
> > >>
> > >> I'm surprised you're running into it at 9m objects but this is 
> > >> almost certainly the problem. Right now the index for each RGW 
> > >> bucket lives on a single OSD; you're probably having issues with 
> > >> whichever OSD is receiving the bucket index reads. Is it feasible 
> > >> for you to shard the contents into multiple buckets and see if things calm down?
> > >> -Greg
> > >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> > >
> > > I was afraid of that answer. :-/
> > > We'll try to delete some objects for the beginning (as they are
> transformed
> > other objects and can be re-created when needed) and then try to 
> > shard it as you suggest.
> > > What is your opinion about max number of objects in one bucket ?
> >
> > It depends more on how much activity you're throwing at the bucket, 
> > I think. We haven't done any large-scale tests and perhaps we need to.
> > Leveldb (which we're using under the covers to store this stuff) 
> > should not have any trouble with the amount of data that's in there, 
> > but if you're trying to do frequent enough object lookups or puts 
> > then you might just saturate the disk/node's ability to keep up. I 
> > should mention that Yehuda just started a discussion on handling 
> > this in the thread "[ceph-users] rgw bucket index".
> >
> > When did you start noticing this trouble?
> 
> Most of our objects are images: originals (mostly 5-10MB) and its 
> transformations (compressed, crops, filters, etc, usually <<1MB). We 
> also have plenty of small text files. We mostly read them (objects are 
> cached by nginx outside the cluster), but yesterday (15h ago) we had 
> pretty large deployment of new service that could create plenty of 
> transformed images with different sizes. Unfortunately I cannot say 
> right now what is the difference in bucket size between now and 
> yesterday, I will try to check it out.
> 
> From my stats I can see that in peaks we had about 1.5K PUT and 8K GET 
> operations per minute. Most common use-case while PUTting an object in 
> our apps is this sequence:
> - HEAD (checking if transformed object already exist),
> - GET  the original
> - PUT transformed object and its meta-data We also have some amount of 
> PUTs of new objects, but I think there won't be many by now  (it is 
> 1am in Poland).
> Right now there are 80 PUT and 1.5K GET operations per minute and the 
> osd is still flapping.

Additional info about our index growth: the deployment I mentioned in previous mail didn't influence the speed of index grow. Currently we have 60K of new objects daily, that speed is almost constant for the last 7 days.
 --
 Best regards,
 Krzysztof Studzinski

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Flapping osd / continuously reported as failed
  2013-07-25  7:47                 ` Mostowiec Dominik
@ 2013-07-25 17:32                   ` Gregory Farnum
       [not found]                     ` <CAPYLRzghUwEvu_f0aV2Q37JqnyCJ=46cTWiteTwN4=Tmqxd3HA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-07-25 17:32 UTC (permalink / raw)
  To: Mostowiec Dominik
  Cc: Yehuda Sadeh, ceph-devel@vger.kernel.org,
	ceph-users@lists.ceph.com, Studziński Krzysztof

On Thu, Jul 25, 2013 at 12:47 AM, Mostowiec Dominik
<Dominik.Mostowiec@grupaonet.pl> wrote:
> Hi
> We found something else.
> After osd.72 flapp, one PG '3.54d' was recovering long time.
>
> --
> ceph health details
> HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%)
> pg 3.54d is active+recovering, acting [72,108,23]
> recovery 1/39821745 degraded (0.000%)
> --
>
> Last flap down/up osd.72 was 00:45.
> In logs we found:
> 2013-07-24 00:45:02.736740 7f8ac1e04700  0 log [INF] : 3.54d deep-scrub ok
> After this time is ok.
>
> It is possible that reason of flapping this osd was scrubbing?
>
> We have default scrubbing settings (ceph version 0.56.6).
> If scrubbig is the trouble-maker, can we make it a little more light by changing config?

It's possible, as deep scrub in particular will add a bit of load (it
goes through and compares the object contents). Are you not having any
flapping issues any more, and did you try and find when it started the
scrub to see if it matched up with your troubles?

I'd be hesitant to turn it off as scrubbing can uncover corrupt
objects etc, but you can configure it with the settings at
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing.
(Always check the surprisingly-helpful docs when you need to do some
config or operations work!)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]                     ` <CAPYLRzghUwEvu_f0aV2Q37JqnyCJ=46cTWiteTwN4=Tmqxd3HA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-08-16 12:47                       ` Mostowiec Dominik
  2013-08-19 19:55                         ` [ceph-users] " Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Mostowiec Dominik @ 2013-08-16 12:47 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Studziński Krzysztof,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sydor Bohdan,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

Hi,
Thanks for your response.

> It's possible, as deep scrub in particular will add a bit of load (it
> goes through and compares the object contents). 

It is possible that the scrubbing blocks access(RW or only W) to bucket index when check .dir... file?
When rgw index is very large I guess it take some time.

> Are you not having any
> flapping issues any more, and did you try and find when it started the
> scrub to see if it matched up with your troubles?

No, I didn't.
But on our second cluster with the same problem, disable scrubbing also helps.

> I'd be hesitant to turn it off as scrubbing can uncover corrupt
> objects etc, but you can configure it with the settings at
> http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing.
> (Always check the surprisingly-helpful docs when you need to do some
> config or operations work!)

I think change config scrub timeout or interval don't full remove issues.
Change "osd deep scrub stride" to small value make scrubbing lightest?

--
Regards
Dominik

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Flapping osd / continuously reported as failed
  2013-08-16 12:47                       ` Mostowiec Dominik
@ 2013-08-19 19:55                         ` Gregory Farnum
  2013-08-19 22:09                           ` Mostowiec Dominik
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-08-19 19:55 UTC (permalink / raw)
  To: Mostowiec Dominik
  Cc: Yehuda Sadeh, ceph-devel@vger.kernel.org,
	ceph-users@lists.ceph.com, Studziński Krzysztof,
	Sydor Bohdan

On Fri, Aug 16, 2013 at 5:47 AM, Mostowiec Dominik
<Dominik.Mostowiec@grupaonet.pl> wrote:
> Hi,
> Thanks for your response.
>
>> It's possible, as deep scrub in particular will add a bit of load (it
>> goes through and compares the object contents).
>
> It is possible that the scrubbing blocks access(RW or only W) to bucket index when check .dir... file?
> When rgw index is very large I guess it take some time.

Yes, it definitely can as scrubbing takes locks on the PG, which will
prevent reads or writes while the message is being processed (which
will involve the rgw index being scanned).

>> Are you not having any
>> flapping issues any more, and did you try and find when it started the
>> scrub to see if it matched up with your troubles?
>
> No, I didn't.
> But on our second cluster with the same problem, disable scrubbing also helps.
>
>> I'd be hesitant to turn it off as scrubbing can uncover corrupt
>> objects etc, but you can configure it with the settings at
>> http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing.
>> (Always check the surprisingly-helpful docs when you need to do some
>> config or operations work!)
>
> I think change config scrub timeout or interval don't full remove issues.
> Change "osd deep scrub stride" to small value make scrubbing lightest?
You probably don't want to change the scrub stride; that is used to
keep reads at an appropriate size for the internal control threads but
won't relate to the object read/write locking.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [ceph-users] Flapping osd / continuously reported as failed
  2013-08-19 19:55                         ` [ceph-users] " Gregory Farnum
@ 2013-08-19 22:09                           ` Mostowiec Dominik
       [not found]                             ` <ADBDB4FFB0814748AF32D0A1EE6E10AF228322C9F0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Mostowiec Dominik @ 2013-08-19 22:09 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Yehuda Sadeh, ceph-devel@vger.kernel.org,
	ceph-users@lists.ceph.com, Studziński Krzysztof,
	Sydor Bohdan

Hi,
> Yes, it definitely can as scrubbing takes locks on the PG, which will prevent reads or writes while the message is being processed (which will involve the rgw index being scanned).
It is possible to tune scrubbing config for eliminate slow requests and marking osd down when large rgw bucket index is scrubbing?

--
Regards
Dominik


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
       [not found]                             ` <ADBDB4FFB0814748AF32D0A1EE6E10AF228322C9F0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
@ 2013-08-19 22:19                               ` Gregory Farnum
  2014-01-24 12:29                                 ` Maciej Bonin
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-08-19 22:19 UTC (permalink / raw)
  To: Mostowiec Dominik
  Cc: Studziński Krzysztof,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sydor Bohdan,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org

On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik
<Dominik.Mostowiec-Yw1TE0hTT7dz6jiHbVrK7g@public.gmane.org> wrote:
> Hi,
>> Yes, it definitely can as scrubbing takes locks on the PG, which will prevent reads or writes while the message is being processed (which will involve the rgw index being scanned).
> It is possible to tune scrubbing config for eliminate slow requests and marking osd down when large rgw bucket index is scrubbing?

Unfortunately not, or we would have mentioned it before. :/ There are
some proposals for sharding bucket indexes that would ameliorate this
problem, and on Cuttlefish or Dumpling the OSD won't get marked down,
but it will still block incoming requests on that object (ie, requests
to access the bucket) while the scrubbing is in place.
That said, that improvement might be sufficient since you haven't
actually shown us how long the object scrub takes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
  2013-08-19 22:19                               ` Gregory Farnum
@ 2014-01-24 12:29                                 ` Maciej Bonin
  2014-01-24 13:36                                   ` Mark Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej Bonin @ 2014-01-24 12:29 UTC (permalink / raw)
  To: ceph-devel

Gregory Farnum <greg@...> writes:

> 
> On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik
> <Dominik.Mostowiec@...> wrote:
> > Hi,
> >> Yes, it definitely can as scrubbing takes locks on the PG, which will 
prevent reads or writes while the
> message is being processed (which will involve the rgw index being 
scanned).
> > It is possible to tune scrubbing config for eliminate slow requests and 
marking osd down when large rgw
> bucket index is scrubbing?
> 
> Unfortunately not, or we would have mentioned it before. :/ There are
> some proposals for sharding bucket indexes that would ameliorate this
> problem, and on Cuttlefish or Dumpling the OSD won't get marked down,
> but it will still block incoming requests on that object (ie, requests
> to access the bucket) while the scrubbing is in place.
> That said, that improvement might be sufficient since you haven't
> actually shown us how long the object scrub takes.
> -Greg
> Software Engineer #42  <at>  http://inktank.com | http://ceph.com
> 


Hello Guys,

I just wanted to share that we've had a similar problem and we had solved it 
by borrowing sensible kernel option defaults from a radosgw patch iirc.
net.ipv4.ip_local_port_range = 1024 65535
net.core.netdev_max_backlog = 30000
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 252144
net.ipv4.tcp_max_tw_buckets = 360000
net.ipv4.tcp_fin_timeout = 3
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2


Regards,
Maciej Bonin
Systems Engineer
m247.com
ISO 27001 Data Protection Classification: A - Public


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Flapping osd / continuously reported as failed
  2014-01-24 12:29                                 ` Maciej Bonin
@ 2014-01-24 13:36                                   ` Mark Nelson
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Nelson @ 2014-01-24 13:36 UTC (permalink / raw)
  To: Maciej Bonin; +Cc: ceph-devel

On 01/24/2014 06:29 AM, Maciej Bonin wrote:
> Gregory Farnum <greg@...> writes:
>
>>
>> On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik
>> <Dominik.Mostowiec@...> wrote:
>>> Hi,
>>>> Yes, it definitely can as scrubbing takes locks on the PG, which will
> prevent reads or writes while the
>> message is being processed (which will involve the rgw index being
> scanned).
>>> It is possible to tune scrubbing config for eliminate slow requests and
> marking osd down when large rgw
>> bucket index is scrubbing?
>>
>> Unfortunately not, or we would have mentioned it before. :/ There are
>> some proposals for sharding bucket indexes that would ameliorate this
>> problem, and on Cuttlefish or Dumpling the OSD won't get marked down,
>> but it will still block incoming requests on that object (ie, requests
>> to access the bucket) while the scrubbing is in place.
>> That said, that improvement might be sufficient since you haven't
>> actually shown us how long the object scrub takes.
>> -Greg
>> Software Engineer #42  <at>  http://inktank.com | http://ceph.com
>>
>
>
> Hello Guys,
>
> I just wanted to share that we've had a similar problem and we had solved it
> by borrowing sensible kernel option defaults from a radosgw patch iirc.
> net.ipv4.ip_local_port_range = 1024 65535
> net.core.netdev_max_backlog = 30000
> net.core.somaxconn = 4096
> net.ipv4.tcp_max_syn_backlog = 252144
> net.ipv4.tcp_max_tw_buckets = 360000
> net.ipv4.tcp_fin_timeout = 3
> net.ipv4.tcp_max_orphans = 262144
> net.ipv4.tcp_synack_retries = 2
> net.ipv4.tcp_syn_retries = 2

FWIW, these may not strictly help with the situation you described, but 
at least on our test cluster helped improve RGW performance in general 
on 10GbE+:

echo 33554432 | sudo tee /proc/sys/net/core/rmem_default
echo 33554432 | sudo tee /proc/sys/net/core/wmem_default
echo 33554432 | sudo tee /proc/sys/net/core/rmem_max
echo 33554432 | sudo tee /proc/sys/net/core/wmem_max
echo "10240 87380 33554432" | sudo tee /proc/sys/net/ipv4/tcp_rmem
echo "10240 87380 33554432" | sudo tee /proc/sys/net/ipv4/tcp_wmem
echo 250000 | sudo tee /proc/sys/net/core/netdev_max_backlog
echo 524288 | sudo tee /proc/sys/net/nf_conntrack_max
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_recycle
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_reuse

>
>
> Regards,
> Maciej Bonin
> Systems Engineer
> m247.com
> ISO 27001 Data Protection Classification: A - Public
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-01-24 13:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-23 21:50 Flapping osd / continuously reported as failed Studziński Krzysztof
     [not found] ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
2013-07-23 22:12   ` Gregory Farnum
     [not found]     ` <CAPYLRzjGDep1ny6K-Ctz_7VG4THV6nAx9odOdjr=WNNesV4cVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-23 22:20       ` Studziński Krzysztof
     [not found]         ` <0D057B737C42FC4AB3F22773A5C9425F259DBDEDD1-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
2013-07-23 22:28           ` Gregory Farnum
     [not found]             ` <CAPYLRzhVtMCY+-d-y5F5M5hMVDwRh343+bB7An4Xcw4DT3n82w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-07-23 23:18               ` Studziński Krzysztof
2013-07-24  7:48             ` [ceph-users] " Studziński Krzysztof
     [not found]               ` <0D057B737C42FC4AB3F22773A5C9425F259DBDF026-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
2013-07-25  7:47                 ` Mostowiec Dominik
2013-07-25 17:32                   ` [ceph-users] " Gregory Farnum
     [not found]                     ` <CAPYLRzghUwEvu_f0aV2Q37JqnyCJ=46cTWiteTwN4=Tmqxd3HA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-16 12:47                       ` Mostowiec Dominik
2013-08-19 19:55                         ` [ceph-users] " Gregory Farnum
2013-08-19 22:09                           ` Mostowiec Dominik
     [not found]                             ` <ADBDB4FFB0814748AF32D0A1EE6E10AF228322C9F0-K9pFWFEelezFe27LHpJFGNHuzzzSOjJt@public.gmane.org>
2013-08-19 22:19                               ` Gregory Farnum
2014-01-24 12:29                                 ` Maciej Bonin
2014-01-24 13:36                                   ` Mark Nelson
  -- strict thread matches above, loose matches on Subject: below --
2013-07-23 21:36 Studziński Krzysztof

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.