* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier @ 2010-10-20 6:08 Tao Ma 2010-11-25 10:08 ` Joel Becker 0 siblings, 1 reply; 7+ messages in thread From: Tao Ma @ 2010-10-20 6:08 UTC (permalink / raw) To: ocfs2-devel Hi all, j_trans_barrier in ocfs2 is used to protect some journal operations in ocfs2. So normally, it is used as belows: 1. In journal transaction. When we start a transaction, We will down_read it and j_num_trans will be increased accordingly(in case of a cluster environment). It will be up_read when we do ocfs2_commit_trans. 2. In ocfs2_commit_cache, we will down_write it and then call jbd2_journal_flush, increase j_trans_id, reset j_num_trans and finally call up_write. This function is used by thread ocfs2cmt. So in general, when we do journal flush, no new transaction will be started because of it. But it did hold off the system and caused a long delay for some file operations. I have met with a bug. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1281 After 30 days of usage of ocfs2, the system becomes slower and slower(why the journal commit becomes so slower is still unknown and may be related to file system fragmentation) and a tiny open/truncate of a file will cause around 10-30 secs. I don't think it is endurable for a user. After putting some debug codes in the kernel(great thanks to the user), I find that it is the blocked by ocfs2_start_trans. The strace log shows: 22955 open("/usr/home/test_io_file_ow", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1 <10.329676> And from the system log: Sep 24 17:28:30 192.168.0.4 kernel: (dd,22955,5):ocfs2_orphan_for_truncate:354 start transcation for inode 105572512 Sep 24 17:28:41 192.168.0.4 kernel: (dd,22955,5):ocfs2_orphan_for_truncate:362 journal access for inode 105572512 The code is like this: mlog(0, "start transcation for inode %llu\n", OCFS2_I(inode)->ip_blkno); handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); if (IS_ERR(handle)) { status = PTR_ERR(handle); mlog_errno(status); goto out; } mlog(0, "journal access for inode %llu\n", OCFS2_I(inode)->ip_blkno); So we spent 11 secs in ocfs2_start_trans! From what I have investigated, j_trans_barrier is only used in a cluster environment(for a local mounted volume, ocfs2cmt isn't started and we depends on jbd2 to flush the journal). And it works with j_trans_id to make sure all the modifications to the specified ocfs2_caching_info are flushed(see ocfs2_ci_fully_checkpointed) when we downconvert a cluster lock. And we also call ocfs2_set_ci_lock_trans in journal_access so that we know the last trans_id for a specified ocfs2_caching_info. My solution is that: 1. remove j_trans_barrier 2. Add a flag ci_checkpointing in ocfs2_caching_info: 1) When we find this caching_info needs checkpoint, set this flag and start the checkpointing(in ocfs2_ci_checkpointed). And the downconvert request will be requeued so that we can check and clear this flag next time it is handled. 2) Clear the flag when there is no need for checkpointing this ci(also in ocfs2_ci_checkpointed) during check_downconvert. 3. make sure when we journal_access some blocks, the caching_info can't be in the state of checkpointing. I think if we are checkpointing an caching_info, we shouldn't be able to journal_access it since it is just required to downconvert and we shouldn't have the lock now? So perhaps a BUG_ON should work? So above is the scenario and my solution. Any comments are welcomed. Regards, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-10-20 6:08 [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier Tao Ma @ 2010-11-25 10:08 ` Joel Becker 2010-11-25 10:19 ` Joel Becker 2010-11-26 6:35 ` Tao Ma 0 siblings, 2 replies; 7+ messages in thread From: Joel Becker @ 2010-11-25 10:08 UTC (permalink / raw) To: ocfs2-devel On Wed, Oct 20, 2010 at 02:08:17PM +0800, Tao Ma wrote: > j_trans_barrier in ocfs2 is used to protect some journal operations > in ocfs2. So normally, it is used as belows: > 1. In journal transaction. When we start a transaction, We will > down_read it and j_num_trans will be increased accordingly(in case > of a cluster environment). It will be up_read when we do > ocfs2_commit_trans. > 2. In ocfs2_commit_cache, we will down_write it and then call > jbd2_journal_flush, increase j_trans_id, reset j_num_trans and > finally call up_write. This function is used by thread ocfs2cmt. <snip> slow filesystem... </snip> > My solution is that: > 1. remove j_trans_barrier > 2. Add a flag ci_checkpointing in ocfs2_caching_info: > 1) When we find this caching_info needs checkpoint, set this flag > and start the checkpointing(in ocfs2_ci_checkpointed). And the > downconvert request will be requeued so that we can check and clear > this flag next time it is handled. > 2) Clear the flag when there is no need for checkpointing this > ci(also in ocfs2_ci_checkpointed) during check_downconvert. > 3. make sure when we journal_access some blocks, the caching_info > can't be in the state of checkpointing. I think if we are > checkpointing an caching_info, we shouldn't be able to > journal_access it since it is just required to downconvert and we > shouldn't have the lock now? So perhaps a BUG_ON should work? Tao, I'm sorry I haven't responded sooner. This proposal didn't strike me as quite right, and I didn't have time to think about it. I have a couple of concerns. First, we don't always checkpoint from a downconvert. We do it in clear_inode() as well, when we are flushing an inode from cache. This may not have anything to do with the lock we're caring about, eg on other inodes. What I mean is, the caching info for the inode we care about may not be checkpointing, but the journal as a whole is. We need to stop all action while that is happening. Second, there is the flip side. How do we wait until all open transactions are complete before checkpointing? The down_write() in ocfs2_commit_cache() blocks until all open transactions up_read(). In your scheme, there is no care taken for open transactions against the journal. Remember, the journal is global to the node. Joel -- Life's Little Instruction Book #464 "Don't miss the magic of the moment by focusing on what's to come." Joel Becker Senior Development Manager Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-11-25 10:08 ` Joel Becker @ 2010-11-25 10:19 ` Joel Becker 2010-11-26 6:35 ` Tao Ma 1 sibling, 0 replies; 7+ messages in thread From: Joel Becker @ 2010-11-25 10:19 UTC (permalink / raw) To: ocfs2-devel On Thu, Nov 25, 2010 at 02:08:22AM -0800, Joel Becker wrote: > Second, there is the flip side. How do we wait until all open > transactions are complete before checkpointing? The down_write() in > ocfs2_commit_cache() blocks until all open transactions up_read(). In > your scheme, there is no care taken for open transactions against the > journal. Remember, the journal is global to the node. Hmm. I wonder if we can allow transactions as soon as we kick off the journal? Basically, right now, we do the following: 1) down_write(trans_barrier) - Wait for all open transactions - Block all new transactions 2) jbd2_journal_flush() - Write out the journal - Wait on the journal flush 3) up_write(trans_barrier) - Unblock new transactions We absolutely need to wait for open transactions before starting the flush. Otherwise, we may not have the transaction we need for a downconvert closed. But do we need to block new transactions once the journal flush is going? Like, we could up_write() our transaction barrier after calling journal_lock_updates(). Would that work? Would it help? Joel -- "The cynics are right nine times out of ten." - H. L. Mencken Joel Becker Senior Development Manager Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-11-25 10:08 ` Joel Becker 2010-11-25 10:19 ` Joel Becker @ 2010-11-26 6:35 ` Tao Ma 2010-12-07 0:45 ` Joel Becker 2010-12-07 1:13 ` Mark Fasheh 1 sibling, 2 replies; 7+ messages in thread From: Tao Ma @ 2010-11-26 6:35 UTC (permalink / raw) To: ocfs2-devel Hi Joel, On 11/25/2010 06:08 PM, Joel Becker wrote: > On Wed, Oct 20, 2010 at 02:08:17PM +0800, Tao Ma wrote: >> j_trans_barrier in ocfs2 is used to protect some journal operations >> in ocfs2. So normally, it is used as belows: >> 1. In journal transaction. When we start a transaction, We will >> down_read it and j_num_trans will be increased accordingly(in case >> of a cluster environment). It will be up_read when we do >> ocfs2_commit_trans. >> 2. In ocfs2_commit_cache, we will down_write it and then call >> jbd2_journal_flush, increase j_trans_id, reset j_num_trans and >> finally call up_write. This function is used by thread ocfs2cmt. > > <snip> slow filesystem...</snip> > >> My solution is that: >> 1. remove j_trans_barrier >> 2. Add a flag ci_checkpointing in ocfs2_caching_info: >> 1) When we find this caching_info needs checkpoint, set this flag >> and start the checkpointing(in ocfs2_ci_checkpointed). And the >> downconvert request will be requeued so that we can check and clear >> this flag next time it is handled. >> 2) Clear the flag when there is no need for checkpointing this >> ci(also in ocfs2_ci_checkpointed) during check_downconvert. >> 3. make sure when we journal_access some blocks, the caching_info >> can't be in the state of checkpointing. I think if we are >> checkpointing an caching_info, we shouldn't be able to >> journal_access it since it is just required to downconvert and we >> shouldn't have the lock now? So perhaps a BUG_ON should work? > > Tao, > I'm sorry I haven't responded sooner. This proposal didn't > strike me as quite right, and I didn't have time to think about it. > I have a couple of concerns. Never mind. I knew you had a lot of stuff to handle with these days. > First, we don't always checkpoint from a downconvert. We do it > in clear_inode() as well, when we are flushing an inode from cache. > This may not have anything to do with the lock we're caring about, eg on > other inodes. What I mean is, the caching info for the inode we care > about may not be checkpointing, but the journal as a whole is. We need > to stop all action while that is happening. Sorry I don't get your last sentense. Could you please describe it in detail? Yes, clear_inode does do checkpointing, but actually the whole thing is self-contained. In ocfs2_checkpoint_inode, it can checkpoint the inode by itself and has no relationship with downconvert. > Second, there is the flip side. How do we wait until all open > transactions are complete before checkpointing? The down_write() in > ocfs2_commit_cache() blocks until all open transactions up_read(). In > your scheme, there is no care taken for open transactions against the > journal. Remember, the journal is global to the node. yes, I was thinking of that too. But finally I got that we don't need to care for it. As we have agreed above, there are 2 places we do checkpoint for an inode. As for clear_inode, we don't care since it is going to be cleared and no transaction could be opened against that. Another is downconvert, in which case we shouldn't be able to open a transaction and access that caching_info(we should always get the cluster lock before we do access it). We can add a BUG_ON to journal_access which can facilitate us to find the case that we don't have the lock while accessing it. btw, I have some draft patch for it, I haven't tested it much these days. But if you are interested, I can send it to the mail list for more review. Regards, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-11-26 6:35 ` Tao Ma @ 2010-12-07 0:45 ` Joel Becker 2010-12-07 1:13 ` Mark Fasheh 1 sibling, 0 replies; 7+ messages in thread From: Joel Becker @ 2010-12-07 0:45 UTC (permalink / raw) To: ocfs2-devel On Fri, Nov 26, 2010 at 02:35:58PM +0800, Tao Ma wrote: > Hi Joel, > > On 11/25/2010 06:08 PM, Joel Becker wrote: > > First, we don't always checkpoint from a downconvert. We do it > >in clear_inode() as well, when we are flushing an inode from cache. > >This may not have anything to do with the lock we're caring about, eg on > >other inodes. What I mean is, the caching info for the inode we care > >about may not be checkpointing, but the journal as a whole is. We need > >to stop all action while that is happening. > Sorry I don't get your last sentense. Could you please describe it > in detail? Yes, clear_inode does do checkpointing, but actually the > whole thing is self-contained. In ocfs2_checkpoint_inode, it can > checkpoint the inode by itself and has no relationship with > downconvert. I mean that our transaction on the inode might affect other things (like system files) that are in flux, and they could have open stuff against them. Essentially, with JBD, I'm not sure we can trust the state unless we freeze it. I really wish Mark could comment here, but I think he's pretty busy. Joel -- "Always give your best, never get discouraged, never be petty; always remember, others may hate you. Those who hate you don't win unless you hate them. And then you destroy yourself." - Richard M. Nixon Joel Becker Senior Development Manager Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-11-26 6:35 ` Tao Ma 2010-12-07 0:45 ` Joel Becker @ 2010-12-07 1:13 ` Mark Fasheh 2010-12-07 1:36 ` Tao Ma 1 sibling, 1 reply; 7+ messages in thread From: Mark Fasheh @ 2010-12-07 1:13 UTC (permalink / raw) To: ocfs2-devel On Fri, Nov 26, 2010 at 02:35:58PM +0800, Tao Ma wrote: > Hi Joel, > > On 11/25/2010 06:08 PM, Joel Becker wrote: > > On Wed, Oct 20, 2010 at 02:08:17PM +0800, Tao Ma wrote: > >> j_trans_barrier in ocfs2 is used to protect some journal operations > >> in ocfs2. So normally, it is used as belows: > >> 1. In journal transaction. When we start a transaction, We will > >> down_read it and j_num_trans will be increased accordingly(in case > >> of a cluster environment). It will be up_read when we do > >> ocfs2_commit_trans. > >> 2. In ocfs2_commit_cache, we will down_write it and then call > >> jbd2_journal_flush, increase j_trans_id, reset j_num_trans and > >> finally call up_write. This function is used by thread ocfs2cmt. > > > > <snip> slow filesystem...</snip> > > > >> My solution is that: > >> 1. remove j_trans_barrier > >> 2. Add a flag ci_checkpointing in ocfs2_caching_info: > >> 1) When we find this caching_info needs checkpoint, set this flag > >> and start the checkpointing(in ocfs2_ci_checkpointed). And the > >> downconvert request will be requeued so that we can check and clear > >> this flag next time it is handled. > >> 2) Clear the flag when there is no need for checkpointing this > >> ci(also in ocfs2_ci_checkpointed) during check_downconvert. > >> 3. make sure when we journal_access some blocks, the caching_info > >> can't be in the state of checkpointing. I think if we are > >> checkpointing an caching_info, we shouldn't be able to > >> journal_access it since it is just required to downconvert and we > >> shouldn't have the lock now? So perhaps a BUG_ON should work? A couple thoughts. - Journal-wise, the following code provides any barrier we need to ensure that a transaction can't be around when we're checkpointing: jbd2_journal_lock_updates(journal->j_journal); status = jbd2_journal_flush(journal->j_journal); jbd2_journal_unlock_updates(journal->j_journal); - Please be sure that we don't lose any improvement the check for zero transactions gives us. - Joel mentioned to me that you actually saw a performance improvement? That's interesting, I would have assumed that we wouldn't get any such improvement as the journal code would be blocking us anyway in all the places we use j_trans_barrier. > > > > Tao, > > I'm sorry I haven't responded sooner. This proposal didn't > > strike me as quite right, and I didn't have time to think about it. > > I have a couple of concerns. > Never mind. I knew you had a lot of stuff to handle with these days. > > First, we don't always checkpoint from a downconvert. We do it > > in clear_inode() as well, when we are flushing an inode from cache. > > This may not have anything to do with the lock we're caring about, eg on > > other inodes. What I mean is, the caching info for the inode we care > > about may not be checkpointing, but the journal as a whole is. We need > > to stop all action while that is happening. > Sorry I don't get your last sentense. Could you please describe it in > detail? Yes, clear_inode does do checkpointing, but actually the whole > thing is self-contained. In ocfs2_checkpoint_inode, it can checkpoint > the inode by itself and has no relationship with downconvert. > > Second, there is the flip side. How do we wait until all open > > transactions are complete before checkpointing? The down_write() in > > ocfs2_commit_cache() blocks until all open transactions up_read(). In > > your scheme, there is no care taken for open transactions against the > > journal. Remember, the journal is global to the node. > yes, I was thinking of that too. But finally I got that we don't need to > care for it. As we have agreed above, there are 2 places we do > checkpoint for an inode. As for clear_inode, we don't care since it is > going to be cleared and no transaction could be opened against that. > Another is downconvert, in which case we shouldn't be able to open a > transaction and access that caching_info(we should always get the > cluster lock before we do access it). We can add a BUG_ON to > journal_access which can facilitate us to find the case that we don't > have the lock while accessing it. > btw, I have some draft patch for it, I haven't tested it much these > days. But if you are interested, I can send it to the mail list for more > review. Post it, with some numbers showing the performance difference :) --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier 2010-12-07 1:13 ` Mark Fasheh @ 2010-12-07 1:36 ` Tao Ma 0 siblings, 0 replies; 7+ messages in thread From: Tao Ma @ 2010-12-07 1:36 UTC (permalink / raw) To: ocfs2-devel Hi Mark, On 12/07/2010 09:13 AM, Mark Fasheh wrote: > On Fri, Nov 26, 2010 at 02:35:58PM +0800, Tao Ma wrote: >> Hi Joel, >> >> On 11/25/2010 06:08 PM, Joel Becker wrote: >>> On Wed, Oct 20, 2010 at 02:08:17PM +0800, Tao Ma wrote: >>>> j_trans_barrier in ocfs2 is used to protect some journal operations >>>> in ocfs2. So normally, it is used as belows: >>>> 1. In journal transaction. When we start a transaction, We will >>>> down_read it and j_num_trans will be increased accordingly(in case >>>> of a cluster environment). It will be up_read when we do >>>> ocfs2_commit_trans. >>>> 2. In ocfs2_commit_cache, we will down_write it and then call >>>> jbd2_journal_flush, increase j_trans_id, reset j_num_trans and >>>> finally call up_write. This function is used by thread ocfs2cmt. >>> >>> <snip> slow filesystem...</snip> >>> >>>> My solution is that: >>>> 1. remove j_trans_barrier >>>> 2. Add a flag ci_checkpointing in ocfs2_caching_info: >>>> 1) When we find this caching_info needs checkpoint, set this flag >>>> and start the checkpointing(in ocfs2_ci_checkpointed). And the >>>> downconvert request will be requeued so that we can check and clear >>>> this flag next time it is handled. >>>> 2) Clear the flag when there is no need for checkpointing this >>>> ci(also in ocfs2_ci_checkpointed) during check_downconvert. >>>> 3. make sure when we journal_access some blocks, the caching_info >>>> can't be in the state of checkpointing. I think if we are >>>> checkpointing an caching_info, we shouldn't be able to >>>> journal_access it since it is just required to downconvert and we >>>> shouldn't have the lock now? So perhaps a BUG_ON should work? > > A couple thoughts. > > - Journal-wise, the following code provides any barrier > we need to ensure that a transaction can't be around when we're > checkpointing: > > jbd2_journal_lock_updates(journal->j_journal); > status = jbd2_journal_flush(journal->j_journal); > jbd2_journal_unlock_updates(journal->j_journal); > > - Please be sure that we don't lose any improvement the check > fo zero transactions gives us. > > - Joel mentioned to me that you actually saw a performance improvement? > That's interesting, I would have assumed that we wouldn't get any such > improvement as the journal code would be blocking us anyway in all the > places we use j_trans_barrier. The reason is that with j_trans_barrier, we have to wait for all the transaction to be flushed and then do ocfs2_start_trans even it has no relationship with this inode. So if we remove the j_trans_barrier, ocfs2_start_trans won't be blocked and it is fast. > >>> >>> Tao, >>> I'm sorry I haven't responded sooner. This proposal didn't >>> strike me as quite right, and I didn't have time to think about it. >>> I have a couple of concerns. >> Never mind. I knew you had a lot of stuff to handle with these days. >>> First, we don't always checkpoint from a downconvert. We do it >>> in clear_inode() as well, when we are flushing an inode from cache. >>> This may not have anything to do with the lock we're caring about, eg on >>> other inodes. What I mean is, the caching info for the inode we care >>> about may not be checkpointing, but the journal as a whole is. We need >>> to stop all action while that is happening. >> Sorry I don't get your last sentense. Could you please describe it in >> detail? Yes, clear_inode does do checkpointing, but actually the whole >> thing is self-contained. In ocfs2_checkpoint_inode, it can checkpoint >> the inode by itself and has no relationship with downconvert. >>> Second, there is the flip side. How do we wait until all open >>> transactions are complete before checkpointing? The down_write() in >>> ocfs2_commit_cache() blocks until all open transactions up_read(). In >>> your scheme, there is no care taken for open transactions against the >>> journal. Remember, the journal is global to the node. >> yes, I was thinking of that too. But finally I got that we don't need to >> care for it. As we have agreed above, there are 2 places we do >> checkpoint for an inode. As for clear_inode, we don't care since it is >> going to be cleared and no transaction could be opened against that. >> Another is downconvert, in which case we shouldn't be able to open a >> transaction and access that caching_info(we should always get the >> cluster lock before we do access it). We can add a BUG_ON to >> journal_access which can facilitate us to find the case that we don't >> have the lock while accessing it. > > > >> btw, I have some draft patch for it, I haven't tested it much these >> days. But if you are interested, I can send it to the mail list for more >> review. > > Post it, with some numbers showing the performance difference :) ok, but you know, I am still relocating... So maybe one or 2 weeks later, after all are settled down and the env is established, I can give you some number. ;) Regards, Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-12-07 1:36 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-20 6:08 [Ocfs2-devel] [RFC] ocfs2: Remove j_trans_barrier Tao Ma 2010-11-25 10:08 ` Joel Becker 2010-11-25 10:19 ` Joel Becker 2010-11-26 6:35 ` Tao Ma 2010-12-07 0:45 ` Joel Becker 2010-12-07 1:13 ` Mark Fasheh 2010-12-07 1:36 ` Tao Ma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).