* Re: [ceph-users] mon IO usage [not found] <CAF6-1L4oWCwNxywALb=cUNP_pbD=ND631MJqvCWyvAfvNdWauQ@mail.gmail.com> @ 2013-05-21 12:57 ` Mike Dawson [not found] ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Mike Dawson @ 2013-05-21 12:57 UTC (permalink / raw) To: Sylvain Munaut; +Cc: ceph-users, ceph-devel@vger.kernel.org Sylvain, I can confirm I see a similar traffic pattern. Any time I have lots of writes going to my cluster (like heavy writes from RBD or remapping/backfilling after losing an OSD), I see all sorts of monitor issues. If my monitor leveldb store.db directories grow past some unknown point (maybe ~1GB or so), 'compact on trim' is insufficiently slow. The store.db grows faster than compact can trim the garbage. After that point, the only hope to rein in the store.db size is to stop the OSDs and get leveldb to compact without any ongoing writes. I sent Sage and Joao a transaction dump of the growth yesterday. Sage looked, but the files are so large it is tough to get useful info. http://tracker.ceph.com/issues/4895 I believe this issue has existed since 0.48. - Mike On 5/21/2013 8:16 AM, Sylvain Munaut wrote: > Hi, > > > I've just added some monitoring to the IO usage of mon (trying to > track down that growing mon issue), and I'm kind of surprised by the > amount of IO generated by the monitor process. > > I get continuous 4 Mo/s / 75 iops with added big spikes at each > compaction every 3 min or so. > > Is there a description somewhere of what the monitor does exactly ? I > mean the monmap / pgmap / osdmap / mdsmap / election epoch don't > change that often (pgmap is like 1 per second and that's the fastest > change by several orders of magnitude). So what exactly does the > monitor do with all that IO ??? > > > Cheers, > > Sylvain > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org>]
* Re: mon IO usage [not found] ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org> @ 2013-05-21 13:25 ` Mike Dawson 2013-05-21 15:52 ` [ceph-users] " Sylvain Munaut 0 siblings, 1 reply; 8+ messages in thread From: Mike Dawson @ 2013-05-21 13:25 UTC (permalink / raw) To: Sylvain Munaut Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Thanks for the correction on IRC. I should have written that this issue started with 0.59 (when the monitor changes hit). http://ceph.com/dev-notes/cephs-new-monitor-changes/ The writeup and release notes sometimes say they went in for 0.58, but I believe they were actually released in 0.59. Thanks for the correction Sylvain. - Mike On 5/21/2013 8:57 AM, Mike Dawson wrote: > Sylvain, > > I can confirm I see a similar traffic pattern. > > Any time I have lots of writes going to my cluster (like heavy writes > from RBD or remapping/backfilling after losing an OSD), I see all sorts > of monitor issues. > > If my monitor leveldb store.db directories grow past some unknown point > (maybe ~1GB or so), 'compact on trim' is insufficiently slow. The > store.db grows faster than compact can trim the garbage. After that > point, the only hope to rein in the store.db size is to stop the OSDs > and get leveldb to compact without any ongoing writes. > > I sent Sage and Joao a transaction dump of the growth yesterday. Sage > looked, but the files are so large it is tough to get useful info. > > http://tracker.ceph.com/issues/4895 > > I believe this issue has existed since 0.48. > > - Mike > > On 5/21/2013 8:16 AM, Sylvain Munaut wrote: >> Hi, >> >> >> I've just added some monitoring to the IO usage of mon (trying to >> track down that growing mon issue), and I'm kind of surprised by the >> amount of IO generated by the monitor process. >> >> I get continuous 4 Mo/s / 75 iops with added big spikes at each >> compaction every 3 min or so. >> >> Is there a description somewhere of what the monitor does exactly ? I >> mean the monmap / pgmap / osdmap / mdsmap / election epoch don't >> change that often (pgmap is like 1 per second and that's the fastest >> change by several orders of magnitude). So what exactly does the >> monitor do with all that IO ??? >> >> >> Cheers, >> >> Sylvain >> _______________________________________________ >> ceph-users mailing list >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] mon IO usage 2013-05-21 13:25 ` Mike Dawson @ 2013-05-21 15:52 ` Sylvain Munaut 2013-05-21 15:56 ` Gregory Farnum 2013-05-21 15:57 ` Sage Weil 0 siblings, 2 replies; 8+ messages in thread From: Sylvain Munaut @ 2013-05-21 15:52 UTC (permalink / raw) To: Mike Dawson; +Cc: ceph-users, ceph-devel@vger.kernel.org So, AFAICT, the bulk of the write would be writing out the pgmap to disk every second or so. Is it really needed to write it in full ? It doesn't change all that much AFAICT, so writing incremental changes with only periodic flush might be a better option ? Cheers, Sylvain ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] mon IO usage 2013-05-21 15:52 ` [ceph-users] " Sylvain Munaut @ 2013-05-21 15:56 ` Gregory Farnum 2013-05-21 15:57 ` Sage Weil 1 sibling, 0 replies; 8+ messages in thread From: Gregory Farnum @ 2013-05-21 15:56 UTC (permalink / raw) To: Sylvain Munaut Cc: Mike Dawson, ceph-users@lists.ceph.com, ceph-devel@vger.kernel.org On Tue, May 21, 2013 at 8:52 AM, Sylvain Munaut <s.munaut@whatever-company.com> wrote: > So, AFAICT, the bulk of the write would be writing out the pgmap to > disk every second or so. > > Is it really needed to write it in full ? It doesn't change all that > much AFAICT, so writing incremental changes with only periodic flush > might be a better option ? Yeah; this is definitely in our heads as work we'd like to get done. LevelDB is costing us more throughput than we were expecting so people are running into trouble much earlier than we expected. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] mon IO usage 2013-05-21 15:52 ` [ceph-users] " Sylvain Munaut 2013-05-21 15:56 ` Gregory Farnum @ 2013-05-21 15:57 ` Sage Weil 2013-05-21 16:05 ` Sylvain Munaut 1 sibling, 1 reply; 8+ messages in thread From: Sage Weil @ 2013-05-21 15:57 UTC (permalink / raw) To: Sylvain Munaut; +Cc: Mike Dawson, ceph-users, ceph-devel@vger.kernel.org On Tue, 21 May 2013, Sylvain Munaut wrote: > So, AFAICT, the bulk of the write would be writing out the pgmap to > disk every second or so. It should be writing out the full map only every N commits... see 'paxos stash full interval', which defaults to 25. > Is it really needed to write it in full ? It doesn't change all that > much AFAICT, so writing incremental changes with only periodic flush > might be a better option ? Right. It works this way now only because we haven't fully transitioned from the old scheme. The next step is to store the PGMap over lots of leveldb keys (one per pg) so that there is no big encode/decode of the entire PGMap structure... sage ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] mon IO usage 2013-05-21 15:57 ` Sage Weil @ 2013-05-21 16:05 ` Sylvain Munaut [not found] ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Sylvain Munaut @ 2013-05-21 16:05 UTC (permalink / raw) To: Sage Weil; +Cc: Mike Dawson, ceph-users, ceph-devel@vger.kernel.org Hi, >> So, AFAICT, the bulk of the write would be writing out the pgmap to >> disk every second or so. > > It should be writing out the full map only every N commits... see 'paxos > stash full interval', which defaults to 25. But doesn't it also write it in full when there is a new pgmap ? I have a new one about every second and its size * period seemed to match the IO rate pretty well which it why I thought it was the reason for the IO. >> Is it really needed to write it in full ? It doesn't change all that >> much AFAICT, so writing incremental changes with only periodic flush >> might be a better option ? > > Right. It works this way now only because we haven't fully transitioned > from the old scheme. The next step is to store the PGMap over lots of > leveldb keys (one per pg) so that there is no big encode/decode of the > entire PGMap structure... Makes sense. I'm not sure of the "per-key" overhead of leveldb though, in case where there are lots ( > 10k ) PGs. Cheers, Sylvain ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: mon IO usage [not found] ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-05-21 16:10 ` Sage Weil 2013-05-21 18:51 ` [ceph-users] " Sylvain Munaut 0 siblings, 1 reply; 8+ messages in thread From: Sage Weil @ 2013-05-21 16:10 UTC (permalink / raw) To: Sylvain Munaut Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Mike Dawson, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, 21 May 2013, Sylvain Munaut wrote: > Hi, > > >> So, AFAICT, the bulk of the write would be writing out the pgmap to > >> disk every second or so. > > > > It should be writing out the full map only every N commits... see 'paxos > > stash full interval', which defaults to 25. > > But doesn't it also write it in full when there is a new pgmap ? > > I have a new one about every second and its size * period seemed to > match the IO rate pretty well which it why I thought it was the reason > for the IO. Hmm. Can you generate a log with 'debug mon = 20', 'debug paxos = 20', 'debug ms = 1' for a few minutes over which you see a high data rate and send it my way? It sounds like there is something wrong with the stash_full logic. Thanks! > >> Is it really needed to write it in full ? It doesn't change all that > >> much AFAICT, so writing incremental changes with only periodic flush > >> might be a better option ? > > > > Right. It works this way now only because we haven't fully transitioned > > from the old scheme. The next step is to store the PGMap over lots of > > leveldb keys (one per pg) so that there is no big encode/decode of the > > entire PGMap structure... > > Makes sense. I'm not sure of the "per-key" overhead of leveldb though, > in case where there are lots ( > 10k ) PGs. Yeah, it will be larger on-disk, but the io rate will at least be proportional to the update rate. :) sage ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] mon IO usage 2013-05-21 16:10 ` Sage Weil @ 2013-05-21 18:51 ` Sylvain Munaut 0 siblings, 0 replies; 8+ messages in thread From: Sylvain Munaut @ 2013-05-21 18:51 UTC (permalink / raw) To: Sage Weil; +Cc: Mike Dawson, ceph-users, ceph-devel@vger.kernel.org Hi, > Hmm. Can you generate a log with 'debug mon = 20', 'debug paxos = 20', > 'debug ms = 1' for a few minutes over which you see a high data rate and > send it my way? It sounds like there is something wrong with the > stash_full logic. Mm, actually I may have been fooled by the instrumentation ... it does 30 sec average, so when looking closer I don't have 4 Mo/s constantly, it's more like 50 Mo every 15/20 sec as a burst. In anycase, that seems like a lot of data being written. logs can be downloaded from http://ge.tt/9MOeKHh/v/0 Cheers, Sylvan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-05-21 18:51 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAF6-1L4oWCwNxywALb=cUNP_pbD=ND631MJqvCWyvAfvNdWauQ@mail.gmail.com>
2013-05-21 12:57 ` [ceph-users] mon IO usage Mike Dawson
[not found] ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org>
2013-05-21 13:25 ` Mike Dawson
2013-05-21 15:52 ` [ceph-users] " Sylvain Munaut
2013-05-21 15:56 ` Gregory Farnum
2013-05-21 15:57 ` Sage Weil
2013-05-21 16:05 ` Sylvain Munaut
[not found] ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-21 16:10 ` Sage Weil
2013-05-21 18:51 ` [ceph-users] " Sylvain Munaut
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.