* When ceph synchronizes journal to disk?
@ 2013-03-03 12:36 Xing Lin
2013-03-04 16:32 ` Sage Weil
2013-03-04 16:55 ` Gregory Farnum
0 siblings, 2 replies; 12+ messages in thread
From: Xing Lin @ 2013-03-03 12:36 UTC (permalink / raw)
To: ceph-devel@vger.kernel.org
Hi,
There were some discussions about this before on the mailing list but I
am still confused with this. I thought Ceph would flush data from the
journal to disk when either the journal is full or when the time to do
synchronization is due. In my test experiment, I used 24 osds(one osd
for each disk). I used a 10 GB tmpfs file as the journal disk for each
osd. Then for testing, I delayed the synchronization between the journal
and disk on purpose. I increased the 'journal min sync interval' to be
60 s and 'journal max sync interval' to be 300 s. Then I created a rbd
and then started a 4M sequential write workload with fio for 30 seconds.
I was expecting that no IO should happen to disks, unless we have filled
240 GB data (10G*24). However, 'iostat' showed there was data
started to be written into disks (at about 20 MB/s per disk), right
after I started the sequential workload. Could someone help to explain
this situation? Thanks,
I am running 0.48.2. The related configuration is as follows.
-----------------
[osd]
osd journal size = 10000
osd journal = /dev/shm/journal/$name-journal
journal dio = false
filestore xattr use omap = true
# The maximum interval in seconds for synchronizing the filestore.
filestore min sync interval = 60
filestore max sync interval = 300
-------------
Xing
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: When ceph synchronizes journal to disk? 2013-03-03 12:36 When ceph synchronizes journal to disk? Xing Lin @ 2013-03-04 16:32 ` Sage Weil 2013-03-05 4:08 ` Xing Lin 2013-03-04 16:55 ` Gregory Farnum 1 sibling, 1 reply; 12+ messages in thread From: Sage Weil @ 2013-03-04 16:32 UTC (permalink / raw) To: Xing Lin; +Cc: ceph-devel@vger.kernel.org On Sun, 3 Mar 2013, Xing Lin wrote: > Hi, > > There were some discussions about this before on the mailing list but I am > still confused with this. I thought Ceph would flush data from the journal to > disk when either the journal is full or when the time to do synchronization is > due. In my test experiment, I used 24 osds(one osd for each disk). I used a 10 > GB tmpfs file as the journal disk for each osd. Then for testing, I delayed > the synchronization between the journal and disk on purpose. I increased the > 'journal min sync interval' to be 60 s and 'journal max sync interval' to be > 300 s. Then I created a rbd and then started a 4M sequential write workload > with fio for 30 seconds. I was expecting that no IO should happen to disks, > unless we have filled 240 GB data (10G*24). However, 'iostat' showed there was > data > started to be written into disks (at about 20 MB/s per disk), right after I > started the sequential workload. Could someone help to explain this situation? Are you using btrfs? In that case, the journaling is parallel to the fs workload because the btrfs snapshots provide us with a stable checkpoint we can replay from. In contrast, for non-btrfs file systems we need to do writeahead journaling. sage > Thanks, > > I am running 0.48.2. The related configuration is as follows. > ----------------- > [osd] > osd journal size = 10000 > osd journal = /dev/shm/journal/$name-journal > journal dio = false > filestore xattr use omap = true > > # The maximum interval in seconds for synchronizing the filestore. > filestore min sync interval = 60 > filestore max sync interval = 300 > ------------- > > Xing > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-04 16:32 ` Sage Weil @ 2013-03-05 4:08 ` Xing Lin 0 siblings, 0 replies; 12+ messages in thread From: Xing Lin @ 2013-03-05 4:08 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org No, I am using xfs. The same thing happens even if I specified the journal mode explicitly as follows. filestore journal parallel = false filestore journal writeahead = true Xing On 03/04/2013 09:32 AM, Sage Weil wrote: > Are you using btrfs? In that case, the journaling is parallel to the fs > workload because the btrfs snapshots provide us with a stable checkpoint > we can replay from. In contrast, for non-btrfs file systems we need to do > writeahead journaling. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-03 12:36 When ceph synchronizes journal to disk? Xing Lin 2013-03-04 16:32 ` Sage Weil @ 2013-03-04 16:55 ` Gregory Farnum 2013-03-05 4:33 ` Xing Lin 2013-03-05 4:47 ` Xing Lin 1 sibling, 2 replies; 12+ messages in thread From: Gregory Farnum @ 2013-03-04 16:55 UTC (permalink / raw) To: Xing Lin; +Cc: ceph-devel@vger.kernel.org On Sun, Mar 3, 2013 at 4:36 AM, Xing Lin <xinglin@cs.utah.edu> wrote: > Hi, > > There were some discussions about this before on the mailing list but I am > still confused with this. I thought Ceph would flush data from the journal > to disk when either the journal is full or when the time to do > synchronization is due. In my test experiment, I used 24 osds(one osd for > each disk). I used a 10 GB tmpfs file as the journal disk for each osd. Then > for testing, I delayed the synchronization between the journal and disk on > purpose. I increased the 'journal min sync interval' to be 60 s and 'journal > max sync interval' to be 300 s. Then I created a rbd and then started a 4M > sequential write workload with fio for 30 seconds. I was expecting that no > IO should happen to disks, unless we have filled 240 GB data (10G*24). > However, 'iostat' showed there was data > started to be written into disks (at about 20 MB/s per disk), right after I > started the sequential workload. Could someone help to explain this > situation? The "journal [min|max] sync interval" values specify how frequently the OSD's "FileStore" sends a sync to the disk. However, data is still written into the normal filesystem as it comes in, and the normal filesystem continues to schedule normal dirty data writeouts. This is good — it means that when we do send a sync down you don't need to wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to disk before it's completed. > I am running 0.48.2. The related configuration is as follows. If you're starting up a new cluster I recommend upgrading to the bobtail series (.56.3) instead of using Argonaut — it's got a number of enhancements you'll appreciate! -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-04 16:55 ` Gregory Farnum @ 2013-03-05 4:33 ` Xing Lin 2013-03-05 8:37 ` When ceph synchronizes journal to disk? / read request Dieter Kasper ` (2 more replies) 2013-03-05 4:47 ` Xing Lin 1 sibling, 3 replies; 12+ messages in thread From: Xing Lin @ 2013-03-05 4:33 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org Hi Gregory, Thanks for your reply. On 03/04/2013 09:55 AM, Gregory Farnum wrote: > The "journal [min|max] sync interval" values specify how frequently > the OSD's "FileStore" sends a sync to the disk. However, data is still > written into the normal filesystem as it comes in, and the normal > filesystem continues to schedule normal dirty data writeouts. This is > good — it means that when we do send a sync down you don't need to > wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to > disk before it's completed. I do not think I understand this well. When the writeahead journal mode is in use, would you please explain what happens to a single 4M write request? I assume that an entry in the journal will be created for this write request and after this entry is flushed to the journal disk, Ceph returns successful. There should be no IO to the osd's disk. All IOs are supposed to go to the journal disk. At a later time, Ceph will start to apply these changes to the normal filesystem by reading from the first entry at which its previous synchronization stops. Finally, it will read this entry and apply this write change to the normal file system. Could you please point out where is wrong in my understanding? Thanks, >> >I am running 0.48.2. The related configuration is as follows. > If you're starting up a new cluster I recommend upgrading to the > bobtail series (.56.3) instead of using Argonaut — it's got a number > of enhancements you'll appreciate! Yeah, I would like to use bobtail series. However, I started to make small changes with Argonaut (0.48) and had ported my changes once to 0.48.2 when it was released. I think I am good to continue with it for the moment. I may consider to port my changes to bobtail series at a later time. Thanks, Xing -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? / read request 2013-03-05 4:33 ` Xing Lin @ 2013-03-05 8:37 ` Dieter Kasper 2013-03-05 20:13 ` Greg Farnum 2013-03-05 13:54 ` When ceph synchronizes journal to disk? Wido den Hollander 2013-03-05 14:27 ` Ugis 2 siblings, 1 reply; 12+ messages in thread From: Dieter Kasper @ 2013-03-05 8:37 UTC (permalink / raw) To: Gregory Farnum; +Cc: Xing Lin, ceph-devel@vger.kernel.org, Dieter Kasper (KD) Hi Gregory, another interesting aspect for me is: How will a read-request for this block/sub-block (pending between journal and OSD) be satisfied (assuming the client will not cache) ? Will this read go to the journal or to the OSD ? Best Regards, -Dieter On Tue, Mar 05, 2013 at 05:33:13AM +0100, Xing Lin wrote: > Hi Gregory, > > Thanks for your reply. > > On 03/04/2013 09:55 AM, Gregory Farnum wrote: > > The "journal [min|max] sync interval" values specify how frequently > > the OSD's "FileStore" sends a sync to the disk. However, data is still > > written into the normal filesystem as it comes in, and the normal > > filesystem continues to schedule normal dirty data writeouts. This is > > good ? it means that when we do send a sync down you don't need to > > wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to > > disk before it's completed. > > I do not think I understand this well. When the writeahead journal mode > is in use, would you please explain what happens to a single 4M write > request? I assume that an entry in the journal will be created for this > write request and after this entry is flushed to the journal disk, Ceph > returns successful. There should be no IO to the osd's disk. All IOs are > supposed to go to the journal disk. At a later time, Ceph will start to > apply these changes to the normal filesystem by reading from the first > entry at which its previous synchronization stops. Finally, it will read > this entry and apply this write change to the normal file system. Could > you please point out where is wrong in my understanding? Thanks, > > >> >I am running 0.48.2. The related configuration is as follows. > > If you're starting up a new cluster I recommend upgrading to the > > bobtail series (.56.3) instead of using Argonaut ? it's got a number > > of enhancements you'll appreciate! > > Yeah, I would like to use bobtail series. However, I started to make > small changes with Argonaut (0.48) and had ported my changes once to > 0.48.2 when it was released. I think I am good to continue with it for > the moment. I may consider to port my changes to bobtail series at a > later time. Thanks, > > Xing > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? / read request 2013-03-05 8:37 ` When ceph synchronizes journal to disk? / read request Dieter Kasper @ 2013-03-05 20:13 ` Greg Farnum 0 siblings, 0 replies; 12+ messages in thread From: Greg Farnum @ 2013-03-05 20:13 UTC (permalink / raw) To: Dieter Kasper; +Cc: Xing Lin, ceph-devel@vger.kernel.org On Tuesday, March 5, 2013 at 12:37 AM, Dieter Kasper wrote: > Hi Gregory, > > another interesting aspect for me is: > How will a read-request for this block/sub-block (pending between journal and OSD) > be satisfied (assuming the client will not cache) ? > Will this read go to the journal or to the OSD ? > All read requests are satisfied from the main OSD store filesystem. Satisfying reads from the journal would be extraordinarily complicated and not buy us anything that I can think of. (In fact the journal is only read during recovery). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-05 4:33 ` Xing Lin 2013-03-05 8:37 ` When ceph synchronizes journal to disk? / read request Dieter Kasper @ 2013-03-05 13:54 ` Wido den Hollander 2013-03-05 20:12 ` Greg Farnum 2013-03-05 14:27 ` Ugis 2 siblings, 1 reply; 12+ messages in thread From: Wido den Hollander @ 2013-03-05 13:54 UTC (permalink / raw) To: Xing Lin; +Cc: Gregory Farnum, ceph-devel@vger.kernel.org On 03/05/2013 05:33 AM, Xing Lin wrote: > Hi Gregory, > > Thanks for your reply. > > On 03/04/2013 09:55 AM, Gregory Farnum wrote: >> The "journal [min|max] sync interval" values specify how frequently >> the OSD's "FileStore" sends a sync to the disk. However, data is still >> written into the normal filesystem as it comes in, and the normal >> filesystem continues to schedule normal dirty data writeouts. This is >> good — it means that when we do send a sync down you don't need to >> wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to >> disk before it's completed. > > I do not think I understand this well. When the writeahead journal mode > is in use, would you please explain what happens to a single 4M write > request? I assume that an entry in the journal will be created for this > write request and after this entry is flushed to the journal disk, Ceph > returns successful. There should be no IO to the osd's disk. All IOs are > supposed to go to the journal disk. At a later time, Ceph will start to > apply these changes to the normal filesystem by reading from the first > entry at which its previous synchronization stops. Finally, it will read > this entry and apply this write change to the normal file system. Could > you please point out where is wrong in my understanding? Thanks, > All the data goes to the disk in write-back mode so it isn't safe yet until the flush is called. That's why it goes into the journal first, to be consistent at all times. If you would buffer everything in the journal and flush that at once you would overload the disk for that time. Let's say you have 300MB in the journal after 10 seconds and you want to flush that at once. That would mean that specific disk is unable to do any other operations then writing with 60MB/sec for 5 seconds. It's better to always write in write-back mode to the disk and flush at a certain point. In the meantime the scheduler can do it's job to balance between the reads and the writes. Wido >>> >I am running 0.48.2. The related configuration is as follows. >> If you're starting up a new cluster I recommend upgrading to the >> bobtail series (.56.3) instead of using Argonaut — it's got a number >> of enhancements you'll appreciate! > > Yeah, I would like to use bobtail series. However, I started to make > small changes with Argonaut (0.48) and had ported my changes once to > 0.48.2 when it was released. I think I am good to continue with it for > the moment. I may consider to port my changes to bobtail series at a > later time. Thanks, > > Xing > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-05 13:54 ` When ceph synchronizes journal to disk? Wido den Hollander @ 2013-03-05 20:12 ` Greg Farnum 2013-03-06 1:50 ` Xing Lin 0 siblings, 1 reply; 12+ messages in thread From: Greg Farnum @ 2013-03-05 20:12 UTC (permalink / raw) To: Wido den Hollander; +Cc: Xing Lin, ceph-devel@vger.kernel.org On Tuesday, March 5, 2013 at 5:54 AM, Wido den Hollander wrote: > On 03/05/2013 05:33 AM, Xing Lin wrote: > > Hi Gregory, > > > > Thanks for your reply. > > > > On 03/04/2013 09:55 AM, Gregory Farnum wrote: > > > The "journal [min|max] sync interval" values specify how frequently > > > the OSD's "FileStore" sends a sync to the disk. However, data is still > > > written into the normal filesystem as it comes in, and the normal > > > filesystem continues to schedule normal dirty data writeouts. This is > > > good — it means that when we do send a sync down you don't need to > > > wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to > > > disk before it's completed. > > > > > > > > I do not think I understand this well. When the writeahead journal mode > > is in use, would you please explain what happens to a single 4M write > > request? I assume that an entry in the journal will be created for this > > write request and after this entry is flushed to the journal disk, Ceph > > returns successful. There should be no IO to the osd's disk. All IOs are > > supposed to go to the journal disk. At a later time, Ceph will start to > > apply these changes to the normal filesystem by reading from the first > > entry at which its previous synchronization stops. Finally, it will read > > this entry and apply this write change to the normal file system. Could > > you please point out where is wrong in my understanding? Thanks, > > > > All the data goes to the disk in write-back mode so it isn't safe yet > until the flush is called. That's why it goes into the journal first, to > be consistent at all times. > > If you would buffer everything in the journal and flush that at once you > would overload the disk for that time. > > Let's say you have 300MB in the journal after 10 seconds and you want to > flush that at once. That would mean that specific disk is unable to do > any other operations then writing with 60MB/sec for 5 seconds. > > It's better to always write in write-back mode to the disk and flush at > a certain point. > > In the meantime the scheduler can do it's job to balance between the > reads and the writes. > > Wido Yep, what Wido said. Specifically, we do force the data to the journal with an fsync or equivalent before responding to the client, but once it's stable on the journal we give it to the filesystem (without doing any sort of forced sync). This is necessary — all reads are served from the filesystem. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-05 20:12 ` Greg Farnum @ 2013-03-06 1:50 ` Xing Lin 0 siblings, 0 replies; 12+ messages in thread From: Xing Lin @ 2013-03-06 1:50 UTC (permalink / raw) To: Greg Farnum; +Cc: Wido den Hollander, ceph-devel@vger.kernel.org Thanks very much for all your explanations. I am now much clearer about it. Have a great day! Xing On 03/05/2013 01:12 PM, Greg Farnum wrote: >> All the data goes to the disk in write-back mode so it isn't safe yet >> >until the flush is called. That's why it goes into the journal first, to >> >be consistent at all times. >> > >> >If you would buffer everything in the journal and flush that at once you >> >would overload the disk for that time. >> > >> >Let's say you have 300MB in the journal after 10 seconds and you want to >> >flush that at once. That would mean that specific disk is unable to do >> >any other operations then writing with 60MB/sec for 5 seconds. >> > >> >It's better to always write in write-back mode to the disk and flush at >> >a certain point. >> > >> >In the meantime the scheduler can do it's job to balance between the >> >reads and the writes. >> > >> >Wido > Yep, what Wido said. Specifically, we do force the data to the journal with an fsync or equivalent before responding to the client, but once it's stable on the journal we give it to the filesystem (without doing any sort of forced sync). This is necessary — all reads are served from the filesystem. > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-05 4:33 ` Xing Lin 2013-03-05 8:37 ` When ceph synchronizes journal to disk? / read request Dieter Kasper 2013-03-05 13:54 ` When ceph synchronizes journal to disk? Wido den Hollander @ 2013-03-05 14:27 ` Ugis 2 siblings, 0 replies; 12+ messages in thread From: Ugis @ 2013-03-05 14:27 UTC (permalink / raw) To: Xing Lin; +Cc: ceph-devel@vger.kernel.org > I do not think I understand this well. When the writeahead journal mode is > in use, would you please explain what happens to a single 4M write request? > I assume that an entry in the journal will be created for this write request > and after this entry is flushed to the journal disk, Ceph returns > successful. There should be no IO to the osd's disk. All IOs are supposed to > go to the journal disk. At a later time, Ceph will start to apply these > changes to the normal filesystem by reading from the first entry at which > its previous synchronization stops. Finally, it will read this entry and > apply this write change to the normal file system. Could you please point > out where is wrong in my understanding? Thanks, > Probably you are expecting journal to behave like cache if I understood correctly. Journals are for integrity, caches are for io speed. For block device caching you could see more on flashcache or bcache. BR, Ugis ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: When ceph synchronizes journal to disk? 2013-03-04 16:55 ` Gregory Farnum 2013-03-05 4:33 ` Xing Lin @ 2013-03-05 4:47 ` Xing Lin 1 sibling, 0 replies; 12+ messages in thread From: Xing Lin @ 2013-03-05 4:47 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org Maybe it is easier to tell in this way. What we want to see is that the newly written data to stay in the journal disk for as long as possible such that write workloads do not compete for disk headers for read workloads. Any way to achieve that in Ceph? Thanks, Xing On 03/04/2013 09:55 AM, Gregory Farnum wrote: > The "journal [min|max] sync interval" values specify how frequently > the OSD's "FileStore" sends a sync to the disk. However, data is still > written into the normal filesystem as it comes in, and the normal > filesystem continues to schedule normal dirty data writeouts. This is > good — it means that when we do send a sync down you don't need to > wait for all (30 seconds * 100MB/s) 3GB or whatever of data to go to > disk before it's completed. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-03-06 1:51 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-03 12:36 When ceph synchronizes journal to disk? Xing Lin 2013-03-04 16:32 ` Sage Weil 2013-03-05 4:08 ` Xing Lin 2013-03-04 16:55 ` Gregory Farnum 2013-03-05 4:33 ` Xing Lin 2013-03-05 8:37 ` When ceph synchronizes journal to disk? / read request Dieter Kasper 2013-03-05 20:13 ` Greg Farnum 2013-03-05 13:54 ` When ceph synchronizes journal to disk? Wido den Hollander 2013-03-05 20:12 ` Greg Farnum 2013-03-06 1:50 ` Xing Lin 2013-03-05 14:27 ` Ugis 2013-03-05 4:47 ` Xing Lin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.