* [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 3) @ 2008-07-12 0:25 Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Adds counter in struct ocfs2_dinode to track journal replays Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 0 siblings, 2 replies; 11+ messages in thread From: Sunil Mushran @ 2008-07-12 0:25 UTC (permalink / raw) To: ocfs2-devel All suggestions implemented. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 1/2] ocfs2: Adds counter in struct ocfs2_dinode to track journal replays 2008-07-12 0:25 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 3) Sunil Mushran @ 2008-07-12 0:25 ` Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 1 sibling, 0 replies; 11+ messages in thread From: Sunil Mushran @ 2008-07-12 0:25 UTC (permalink / raw) To: ocfs2-devel This patch renames the ij_pad to ij_recovery_generation in struct ocfs2_dinode. This will be used to keep count of journal replays after an unclean shutdown. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/ocfs2_fs.h | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 3633edd..928e97f 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -585,7 +585,10 @@ struct ocfs2_dinode { struct { /* Info for journal system inodes */ __le32 ij_flags; /* Mounted, version, etc. */ - __le32 ij_pad; + __le32 ij_recovery_generation; /* Incremented when the + journal is recovered + after an unclean + shutdown */ } journal1; } id1; /* Inode type dependant 1 */ /*C0*/ union { -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-12 0:25 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 3) Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Adds counter in struct ocfs2_dinode to track journal replays Sunil Mushran @ 2008-07-12 0:25 ` Sunil Mushran 2008-07-12 1:24 ` Joel Becker 1 sibling, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-12 0:25 UTC (permalink / raw) To: ocfs2-devel As the fs recovery is asynchronous, there is a small chance that another node can mount (and thus recover) the slot before the recovery thread gets to it. If this happens, the recovery thread will block indefinitely on the journal/slot lock as that lock will be held for the duration of the mount (by design) by the node assigned to that slot. The solution implemented is to keep track of the journal replays using a recovery generation in the journal inode, which will be incremented by the thread replaying that journal. The recovery thread, before attempting the blocking lock on the journal/slot lock, will compare the generation on disk with what it has cached and skip recovery if it does not match. This bug appears to have been inadvertently introduced during the mount/umount vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the mount voting scheme, the messaging would indirectly indicate that the slot was being recovered. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/journal.c | 162 +++++++++++++++++++++++++++++++++++++++------------- fs/ocfs2/journal.h | 13 ++++- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/super.c | 12 ++++- 4 files changed, 146 insertions(+), 43 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 741a4e2..6eb8b13 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -57,7 +57,7 @@ static int __ocfs2_recovery_thread(void *arg); static int ocfs2_commit_cache(struct ocfs2_super *osb); static int ocfs2_wait_on_mount(struct ocfs2_super *osb); static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty); + int dirty, int replayed); static int ocfs2_trylock_journal(struct ocfs2_super *osb, int slot_num); static int ocfs2_recover_orphans(struct ocfs2_super *osb, @@ -437,7 +437,7 @@ done: } static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty) + int dirty, int replayed) { int status; unsigned int flags; @@ -467,6 +467,9 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + if (replayed) + ocfs2_bump_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, journal->j_inode); if (status < 0) mlog_errno(status); @@ -541,7 +544,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) * Do not toggle if flush was unsuccessful otherwise * will leave dirty metadata in a "clean" journal */ - status = ocfs2_journal_toggle_dirty(osb, 0); + status = ocfs2_journal_toggle_dirty(osb, 0, 0); if (status < 0) mlog_errno(status); } @@ -584,7 +587,7 @@ static void ocfs2_clear_journal_error(struct super_block *sb, } } -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) { int status = 0; struct ocfs2_super *osb; @@ -603,7 +606,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local) ocfs2_clear_journal_error(osb->sb, journal->j_journal, osb->slot_num); - status = ocfs2_journal_toggle_dirty(osb, 1); + status = ocfs2_journal_toggle_dirty(osb, 1, replayed); if (status < 0) { mlog_errno(status); goto done; @@ -645,7 +648,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) goto bail; } - status = ocfs2_journal_toggle_dirty(journal->j_osb, 0); + status = ocfs2_journal_toggle_dirty(journal->j_osb, 0, 0); if (status < 0) mlog_errno(status); @@ -950,6 +953,42 @@ out: mlog_exit_void(); } +static int ocfs2_read_journal_inode(struct ocfs2_super *osb, + int slot_num, + struct buffer_head **bh, + struct inode **ret_inode) +{ + int status = -EACCES; + struct inode *inode = NULL; + + BUG_ON(slot_num >= osb->max_slots); + + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, + slot_num); + if (!inode || is_bad_inode(inode)) { + mlog_errno(status); + goto bail; + } + SET_INODE_JOURNAL(inode); + + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, bh, 0, inode); + if (status < 0) { + mlog_errno(status); + goto bail; + } + + status = 0; + +bail: + if (inode) { + if (status || !ret_inode) + iput(inode); + else + *ret_inode = inode; + } + return status; +} + /* Does the actual journal replay and marks the journal inode as * clean. Will only replay if the journal inode is marked dirty. */ static int ocfs2_replay_journal(struct ocfs2_super *osb, @@ -963,22 +1002,35 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, struct ocfs2_dinode *fe; journal_t *journal = NULL; struct buffer_head *bh = NULL; + u32 slot_reco_gen; - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, - slot_num); - if (inode == NULL) { - status = -EACCES; + status = ocfs2_read_journal_inode(osb, slot_num, &bh, &inode); + if (status) { mlog_errno(status); goto done; } - if (is_bad_inode(inode)) { - status = -EACCES; - iput(inode); - inode = NULL; - mlog_errno(status); + + fe = (struct ocfs2_dinode *)bh->b_data; + slot_reco_gen = ocfs2_get_recovery_generation(fe); + brelse(bh); + bh = NULL; + + /* + * As the fs recovery is asynchronous, there is a small chance that another + * node mounted (and recovered) the slot before the recovery thread could + * get the lock. To handle that, we dirty read the journal inode for that + * slot to get the recovery generation. If it is different than what we + * expected, the slot has been recovered. If not, it needs recovery. + */ + if (osb->slot_recovery_generation[slot_num] != slot_reco_gen) { + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, + osb->slot_recovery_generation[slot_num], slot_reco_gen); + osb->slot_recovery_generation[slot_num] = slot_reco_gen; + status = -EBUSY; goto done; } - SET_INODE_JOURNAL(inode); + + /* Continue with recovery as the journal has not yet been recovered */ status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { @@ -992,9 +1044,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, fe = (struct ocfs2_dinode *) bh->b_data; flags = le32_to_cpu(fe->id1.journal1.ij_flags); + slot_reco_gen = ocfs2_get_recovery_generation(fe); if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { mlog(0, "No recovery required for node %d\n", node_num); + /* Refresh recovery generation for the slot */ + osb->slot_recovery_generation[slot_num] = slot_reco_gen; goto done; } @@ -1042,6 +1097,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + /* Increment recovery generation to indicate successful recovery */ + ocfs2_bump_recovery_generation(fe); + osb->slot_recovery_generation[slot_num] = + ocfs2_get_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, inode); if (status < 0) mlog_errno(status); @@ -1098,8 +1158,12 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, slot_num = ocfs2_node_num_to_slot(si, node_num); if (slot_num == OCFS2_INVALID_SLOT) { - status = 0; mlog(0, "no slot for this node, so no recovery required.\n"); + /* Refresh all journal recovery generations from disk */ + status = ocfs2_check_journals_nolocks(osb); + if (status && status != -EROFS) + mlog_errno(status); + status = 0; goto done; } @@ -1107,6 +1171,13 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, status = ocfs2_replay_journal(osb, node_num, slot_num); if (status < 0) { + if (status == -EBUSY) { + mlog(0, "Skipping recovery for slot %u (node %u) " + "as another node has recovered it\n", slot_num, + node_num); + status = 0; + goto done; + } mlog_errno(status); goto done; } @@ -1190,12 +1261,29 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) { int status, i, node_num; struct ocfs2_slot_info *si = osb->slot_info; + struct buffer_head *bh = NULL; + struct ocfs2_dinode *di; /* This is called with the super block cluster lock, so we * know that the slot map can't change underneath us. */ spin_lock(&si->si_lock); for(i = 0; i < si->si_num_slots; i++) { + /* Read journal inode to get the recovery generation */ + status = ocfs2_read_journal_inode(osb, i, &bh, NULL); + if (status) { + mlog_errno(status); + goto bail; + } + di = (struct ocfs2_dinode *)bh->b_data; + osb->slot_recovery_generation[i] = + ocfs2_get_recovery_generation(di); + brelse(bh); + bh = NULL; + + mlog(0, "Slot %u recovery generation is %u\n", i, + osb->slot_recovery_generation[i]); + if (i == osb->slot_num) continue; if (ocfs2_is_empty_slot(si, i)) @@ -1464,49 +1552,41 @@ static int ocfs2_commit_thread(void *arg) return 0; } -/* Look for a dirty journal without taking any cluster locks. Used for - * hard readonly access to determine whether the file system journals - * require recovery. */ +/* Reads all the journal inodes without taking any cluster locks. Used + * for hard readonly access to determine whether any journal requires + * recovery. Also used to refresh the recovery generation numbers after + * a journal has been recovered by another node. + */ int ocfs2_check_journals_nolocks(struct ocfs2_super *osb) { int ret = 0; unsigned int slot; - struct buffer_head *di_bh; + struct buffer_head *di_bh = NULL; struct ocfs2_dinode *di; - struct inode *journal = NULL; + int journal_dirty = 0; for(slot = 0; slot < osb->max_slots; slot++) { - journal = ocfs2_get_system_file_inode(osb, - JOURNAL_SYSTEM_INODE, - slot); - if (!journal || is_bad_inode(journal)) { - ret = -EACCES; - mlog_errno(ret); - goto out; - } - - di_bh = NULL; - ret = ocfs2_read_block(osb, OCFS2_I(journal)->ip_blkno, &di_bh, - 0, journal); - if (ret < 0) { + ret = ocfs2_read_journal_inode(osb, slot, &di_bh, NULL); + if (ret) { mlog_errno(ret); goto out; } di = (struct ocfs2_dinode *) di_bh->b_data; + osb->slot_recovery_generation[slot] = + ocfs2_get_recovery_generation(di); + if (le32_to_cpu(di->id1.journal1.ij_flags) & OCFS2_JOURNAL_DIRTY_FL) - ret = -EROFS; + journal_dirty = 1; brelse(di_bh); - if (ret) - break; + di_bh = NULL; } out: - if (journal) - iput(journal); - + if (journal_dirty) + ret = -EROFS; return ret; } diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 52f02fe..7faab59 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -132,6 +132,16 @@ static inline void ocfs2_inode_set_new(struct ocfs2_super *osb, spin_unlock(&trans_inc_lock); } +static inline void ocfs2_bump_recovery_generation(struct ocfs2_dinode *di) +{ + le32_add_cpu(&(di->id1.journal1.ij_recovery_generation), 1); +} + +static inline u32 ocfs2_get_recovery_generation(struct ocfs2_dinode *di) +{ + return le32_to_cpu(di->id1.journal1.ij_recovery_generation); +} + /* Exported only for the journal struct init code in super.c. Do not call. */ void ocfs2_complete_recovery(kapi_work_struct_t *work); @@ -157,7 +167,8 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, void ocfs2_journal_shutdown(struct ocfs2_super *osb); int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full); -int ocfs2_journal_load(struct ocfs2_journal *journal, int local); +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, + int replayed); int ocfs2_check_journals_nolocks(struct ocfs2_super *osb); void ocfs2_recovery_thread(struct ocfs2_super *osb, int node_num); diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 9546470..db52558 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -193,6 +193,8 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; + u32 *slot_recovery_generation; + spinlock_t node_map_lock; struct ocfs2_node_map recovery_map; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index ff31722..da92242 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1497,6 +1497,15 @@ static int ocfs2_initialize_super(struct super_block *sb, } mlog(0, "max_slots for this device: %u\n", osb->max_slots); + osb->slot_recovery_generation = kcalloc(osb->max_slots, + sizeof(*osb->slot_recovery_generation), + GFP_KERNEL); + if (!osb->slot_recovery_generation) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + init_waitqueue_head(&osb->osb_wipe_event); osb->osb_orphan_wipes = kcalloc(osb->max_slots, sizeof(*osb->osb_orphan_wipes), @@ -1739,7 +1748,7 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) local = ocfs2_mount_local(osb); /* will play back anything left in the journal. */ - ocfs2_journal_load(osb->journal, local); + ocfs2_journal_load(osb->journal, local, dirty); if (dirty) { /* recover my local alloc if we didn't unmount cleanly. */ @@ -1801,6 +1810,7 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb) ocfs2_free_slot_info(osb->slot_info); kfree(osb->osb_orphan_wipes); + kfree(osb->slot_recovery_generation); /* FIXME * This belongs in journal shutdown, but because we have to * allocate osb->journal at the start of ocfs2_initalize_osb(), -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran @ 2008-07-12 1:24 ` Joel Becker 0 siblings, 0 replies; 11+ messages in thread From: Joel Becker @ 2008-07-12 1:24 UTC (permalink / raw) To: ocfs2-devel On Fri, Jul 11, 2008 at 05:25:18PM -0700, Sunil Mushran wrote: > diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h > index 52f02fe..7faab59 100644 > --- a/fs/ocfs2/journal.h > +++ b/fs/ocfs2/journal.h > @@ -132,6 +132,16 @@ static inline void ocfs2_inode_set_new(struct ocfs2_super *osb, > spin_unlock(&trans_inc_lock); > } > > +static inline void ocfs2_bump_recovery_generation(struct ocfs2_dinode *di) > +{ > + le32_add_cpu(&(di->id1.journal1.ij_recovery_generation), 1); > +} > + > +static inline u32 ocfs2_get_recovery_generation(struct ocfs2_dinode *di) > +{ > + return le32_to_cpu(di->id1.journal1.ij_recovery_generation); > +} There's no reason for these to be inline or even outside of journal.c. Joel -- "Practice random acts of kindness and senseless acts of beauty." Oh, and don't forget where your towel is. Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] Recovery/mount fixes for mainline @ 2008-07-15 0:31 Sunil Mushran 2008-07-15 0:31 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-15 0:31 UTC (permalink / raw) To: ocfs2-devel Ok... same as the one I have been emailing for ocfs2 1.4. This was built and tested with 2.6.26. Sunil ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-15 0:31 [Ocfs2-devel] Recovery/mount fixes for mainline Sunil Mushran @ 2008-07-15 0:31 ` Sunil Mushran 0 siblings, 0 replies; 11+ messages in thread From: Sunil Mushran @ 2008-07-15 0:31 UTC (permalink / raw) To: ocfs2-devel As the fs recovery is asynchronous, there is a small chance that another node can mount (and thus recover) the slot before the recovery thread gets to it. If this happens, the recovery thread will block indefinitely on the journal/slot lock as that lock will be held for the duration of the mount (by design) by the node assigned to that slot. The solution implemented is to keep track of the journal replays using a recovery generation in the journal inode, which will be incremented by the thread replaying that journal. The recovery thread, before attempting the blocking lock on the journal/slot lock, will compare the generation on disk with what it has cached and skip recovery if it does not match. This bug appears to have been inadvertently introduced during the mount/umount vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the mount voting scheme, the messaging would indirectly indicate that the slot was being recovered. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/journal.c | 173 ++++++++++++++++++++++++++++++++++++++++------------ fs/ocfs2/journal.h | 3 +- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/super.c | 12 +++- 4 files changed, 148 insertions(+), 42 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 9698338..5140338 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -57,7 +57,7 @@ static int __ocfs2_recovery_thread(void *arg); static int ocfs2_commit_cache(struct ocfs2_super *osb); static int ocfs2_wait_on_mount(struct ocfs2_super *osb); static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty); + int dirty, int replayed); static int ocfs2_trylock_journal(struct ocfs2_super *osb, int slot_num); static int ocfs2_recover_orphans(struct ocfs2_super *osb, @@ -562,8 +562,18 @@ done: return status; } +static void ocfs2_bump_recovery_generation(struct ocfs2_dinode *di) +{ + le32_add_cpu(&(di->id1.journal1.ij_recovery_generation), 1); +} + +static u32 ocfs2_get_recovery_generation(struct ocfs2_dinode *di) +{ + return le32_to_cpu(di->id1.journal1.ij_recovery_generation); +} + static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty) + int dirty, int replayed) { int status; unsigned int flags; @@ -593,6 +603,9 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + if (replayed) + ocfs2_bump_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, journal->j_inode); if (status < 0) mlog_errno(status); @@ -667,7 +680,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) * Do not toggle if flush was unsuccessful otherwise * will leave dirty metadata in a "clean" journal */ - status = ocfs2_journal_toggle_dirty(osb, 0); + status = ocfs2_journal_toggle_dirty(osb, 0, 0); if (status < 0) mlog_errno(status); } @@ -710,7 +723,7 @@ static void ocfs2_clear_journal_error(struct super_block *sb, } } -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) { int status = 0; struct ocfs2_super *osb; @@ -729,7 +742,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local) ocfs2_clear_journal_error(osb->sb, journal->j_journal, osb->slot_num); - status = ocfs2_journal_toggle_dirty(osb, 1); + status = ocfs2_journal_toggle_dirty(osb, 1, replayed); if (status < 0) { mlog_errno(status); goto done; @@ -771,7 +784,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) goto bail; } - status = ocfs2_journal_toggle_dirty(journal->j_osb, 0); + status = ocfs2_journal_toggle_dirty(journal->j_osb, 0, 0); if (status < 0) mlog_errno(status); @@ -1034,6 +1047,12 @@ restart: spin_unlock(&osb->osb_lock); mlog(0, "All nodes recovered\n"); + /* Refresh all journal recovery generations from disk */ + status = ocfs2_check_journals_nolocks(osb); + status = (status == -EROFS) ? 0 : status; + if (status < 0) + mlog_errno(status); + ocfs2_super_unlock(osb, 1); /* We always run recovery on our own orphan dir - the dead @@ -1096,6 +1115,42 @@ out: mlog_exit_void(); } +static int ocfs2_read_journal_inode(struct ocfs2_super *osb, + int slot_num, + struct buffer_head **bh, + struct inode **ret_inode) +{ + int status = -EACCES; + struct inode *inode = NULL; + + BUG_ON(slot_num >= osb->max_slots); + + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, + slot_num); + if (!inode || is_bad_inode(inode)) { + mlog_errno(status); + goto bail; + } + SET_INODE_JOURNAL(inode); + + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, bh, 0, inode); + if (status < 0) { + mlog_errno(status); + goto bail; + } + + status = 0; + +bail: + if (inode) { + if (status || !ret_inode) + iput(inode); + else + *ret_inode = inode; + } + return status; +} + /* Does the actual journal replay and marks the journal inode as * clean. Will only replay if the journal inode is marked dirty. */ static int ocfs2_replay_journal(struct ocfs2_super *osb, @@ -1109,22 +1164,36 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, struct ocfs2_dinode *fe; journal_t *journal = NULL; struct buffer_head *bh = NULL; + u32 slot_reco_gen; - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, - slot_num); - if (inode == NULL) { - status = -EACCES; + status = ocfs2_read_journal_inode(osb, slot_num, &bh, &inode); + if (status) { mlog_errno(status); goto done; } - if (is_bad_inode(inode)) { - status = -EACCES; - iput(inode); - inode = NULL; - mlog_errno(status); + + fe = (struct ocfs2_dinode *)bh->b_data; + slot_reco_gen = ocfs2_get_recovery_generation(fe); + brelse(bh); + bh = NULL; + + /* + * As the fs recovery is asynchronous, there is a small chance that + * another node mounted (and recovered) the slot before the recovery + * thread could get the lock. To handle that, we dirty read the journal + * inode for that slot to get the recovery generation. If it is + * different than what we expected, the slot has been recovered. + * If not, it needs recovery. + */ + if (osb->slot_recovery_generations[slot_num] != slot_reco_gen) { + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, + osb->slot_recovery_generations[slot_num], slot_reco_gen); + osb->slot_recovery_generations[slot_num] = slot_reco_gen; + status = -EBUSY; goto done; } - SET_INODE_JOURNAL(inode); + + /* Continue with recovery as the journal has not yet been recovered */ status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { @@ -1138,9 +1207,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, fe = (struct ocfs2_dinode *) bh->b_data; flags = le32_to_cpu(fe->id1.journal1.ij_flags); + slot_reco_gen = ocfs2_get_recovery_generation(fe); if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { mlog(0, "No recovery required for node %d\n", node_num); + /* Refresh recovery generation for the slot */ + osb->slot_recovery_generations[slot_num] = slot_reco_gen; goto done; } @@ -1188,6 +1260,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + /* Increment recovery generation to indicate successful recovery */ + ocfs2_bump_recovery_generation(fe); + osb->slot_recovery_generations[slot_num] = + ocfs2_get_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, inode); if (status < 0) mlog_errno(status); @@ -1252,6 +1329,13 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, status = ocfs2_replay_journal(osb, node_num, slot_num); if (status < 0) { + if (status == -EBUSY) { + mlog(0, "Skipping recovery for slot %u (node %u) " + "as another node has recovered it\n", slot_num, + node_num); + status = 0; + goto done; + } mlog_errno(status); goto done; } @@ -1334,12 +1418,29 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) { unsigned int node_num; int status, i; + struct buffer_head *bh = NULL; + struct ocfs2_dinode *di; /* This is called with the super block cluster lock, so we * know that the slot map can't change underneath us. */ spin_lock(&osb->osb_lock); for (i = 0; i < osb->max_slots; i++) { + /* Read journal inode to get the recovery generation */ + status = ocfs2_read_journal_inode(osb, i, &bh, NULL); + if (status) { + mlog_errno(status); + goto bail; + } + di = (struct ocfs2_dinode *)bh->b_data; + osb->slot_recovery_generations[i] = + ocfs2_get_recovery_generation(di); + brelse(bh); + bh = NULL; + + mlog(0, "Slot %u recovery generation is %u\n", i, + osb->slot_recovery_generations[i]); + if (i == osb->slot_num) continue; @@ -1603,49 +1704,41 @@ static int ocfs2_commit_thread(void *arg) return 0; } -/* Look for a dirty journal without taking any cluster locks. Used for - * hard readonly access to determine whether the file system journals - * require recovery. */ +/* Reads all the journal inodes without taking any cluster locks. Used + * for hard readonly access to determine whether any journal requires + * recovery. Also used to refresh the recovery generation numbers after + * a journal has been recovered by another node. + */ int ocfs2_check_journals_nolocks(struct ocfs2_super *osb) { int ret = 0; unsigned int slot; - struct buffer_head *di_bh; + struct buffer_head *di_bh = NULL; struct ocfs2_dinode *di; - struct inode *journal = NULL; + int journal_dirty = 0; for(slot = 0; slot < osb->max_slots; slot++) { - journal = ocfs2_get_system_file_inode(osb, - JOURNAL_SYSTEM_INODE, - slot); - if (!journal || is_bad_inode(journal)) { - ret = -EACCES; - mlog_errno(ret); - goto out; - } - - di_bh = NULL; - ret = ocfs2_read_block(osb, OCFS2_I(journal)->ip_blkno, &di_bh, - 0, journal); - if (ret < 0) { + ret = ocfs2_read_journal_inode(osb, slot, &di_bh, NULL); + if (ret) { mlog_errno(ret); goto out; } di = (struct ocfs2_dinode *) di_bh->b_data; + osb->slot_recovery_generations[slot] = + ocfs2_get_recovery_generation(di); + if (le32_to_cpu(di->id1.journal1.ij_flags) & OCFS2_JOURNAL_DIRTY_FL) - ret = -EROFS; + journal_dirty = 1; brelse(di_bh); - if (ret) - break; + di_bh = NULL; } out: - if (journal) - iput(journal); - + if (journal_dirty) + ret = -EROFS; return ret; } diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index db82be2..2178ebf 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -161,7 +161,8 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, void ocfs2_journal_shutdown(struct ocfs2_super *osb); int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full); -int ocfs2_journal_load(struct ocfs2_journal *journal, int local); +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, + int replayed); int ocfs2_check_journals_nolocks(struct ocfs2_super *osb); void ocfs2_recovery_thread(struct ocfs2_super *osb, int node_num); diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 3169237..fe69266 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -192,6 +192,8 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; + u32 *slot_recovery_generations; + spinlock_t node_map_lock; u64 root_blkno; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index df63ba2..1ef765e 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1442,6 +1442,15 @@ static int ocfs2_initialize_super(struct super_block *sb, } mlog(0, "max_slots for this device: %u\n", osb->max_slots); + osb->slot_recovery_generations = + kcalloc(osb->max_slots, sizeof(*osb->slot_recovery_generations), + GFP_KERNEL); + if (!osb->slot_recovery_generations) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + init_waitqueue_head(&osb->osb_wipe_event); osb->osb_orphan_wipes = kcalloc(osb->max_slots, sizeof(*osb->osb_orphan_wipes), @@ -1703,7 +1712,7 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) local = ocfs2_mount_local(osb); /* will play back anything left in the journal. */ - ocfs2_journal_load(osb->journal, local); + ocfs2_journal_load(osb->journal, local, dirty); if (dirty) { /* recover my local alloc if we didn't unmount cleanly. */ @@ -1764,6 +1773,7 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb) ocfs2_free_slot_info(osb); kfree(osb->osb_orphan_wipes); + kfree(osb->slot_recovery_generations); /* FIXME * This belongs in journal shutdown, but because we have to * allocate osb->journal at the start of ocfs2_initalize_osb(), -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 4) @ 2008-07-14 20:12 Sunil Mushran 2008-07-14 20:12 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-14 20:12 UTC (permalink / raw) To: ocfs2-devel Two changes. One was implementing Joel suggestion of moving bump and get recogens into journal.c. Second was concerning the recogen refresh on nodes that did not replay the journal. We now refresh at the end of the recovery process just before we release the EX on the superblock lock. Earlier we were refreshing earlier which was adversely affecting the logic making it do a blocking lock on a slot that was mounted. Sunil ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-14 20:12 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 4) Sunil Mushran @ 2008-07-14 20:12 ` Sunil Mushran 2008-07-14 20:25 ` Joel Becker 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-14 20:12 UTC (permalink / raw) To: ocfs2-devel As the fs recovery is asynchronous, there is a small chance that another node can mount (and thus recover) the slot before the recovery thread gets to it. If this happens, the recovery thread will block indefinitely on the journal/slot lock as that lock will be held for the duration of the mount (by design) by the node assigned to that slot. The solution implemented is to keep track of the journal replays using a recovery generation in the journal inode, which will be incremented by the thread replaying that journal. The recovery thread, before attempting the blocking lock on the journal/slot lock, will compare the generation on disk with what it has cached and skip recovery if it does not match. This bug appears to have been inadvertently introduced during the mount/umount vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the mount voting scheme, the messaging would indirectly indicate that the slot was being recovered. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/journal.c | 173 ++++++++++++++++++++++++++++++++++++++++------------ fs/ocfs2/journal.h | 3 +- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/super.c | 12 +++- 4 files changed, 148 insertions(+), 42 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 741a4e2..1e078d8 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -57,7 +57,7 @@ static int __ocfs2_recovery_thread(void *arg); static int ocfs2_commit_cache(struct ocfs2_super *osb); static int ocfs2_wait_on_mount(struct ocfs2_super *osb); static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty); + int dirty, int replayed); static int ocfs2_trylock_journal(struct ocfs2_super *osb, int slot_num); static int ocfs2_recover_orphans(struct ocfs2_super *osb, @@ -436,8 +436,18 @@ done: return status; } +static void ocfs2_bump_recovery_generation(struct ocfs2_dinode *di) +{ + le32_add_cpu(&(di->id1.journal1.ij_recovery_generation), 1); +} + +static u32 ocfs2_get_recovery_generation(struct ocfs2_dinode *di) +{ + return le32_to_cpu(di->id1.journal1.ij_recovery_generation); +} + static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty) + int dirty, int replayed) { int status; unsigned int flags; @@ -467,6 +477,9 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + if (replayed) + ocfs2_bump_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, journal->j_inode); if (status < 0) mlog_errno(status); @@ -541,7 +554,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) * Do not toggle if flush was unsuccessful otherwise * will leave dirty metadata in a "clean" journal */ - status = ocfs2_journal_toggle_dirty(osb, 0); + status = ocfs2_journal_toggle_dirty(osb, 0, 0); if (status < 0) mlog_errno(status); } @@ -584,7 +597,7 @@ static void ocfs2_clear_journal_error(struct super_block *sb, } } -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) { int status = 0; struct ocfs2_super *osb; @@ -603,7 +616,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local) ocfs2_clear_journal_error(osb->sb, journal->j_journal, osb->slot_num); - status = ocfs2_journal_toggle_dirty(osb, 1); + status = ocfs2_journal_toggle_dirty(osb, 1, replayed); if (status < 0) { mlog_errno(status); goto done; @@ -645,7 +658,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) goto bail; } - status = ocfs2_journal_toggle_dirty(journal->j_osb, 0); + status = ocfs2_journal_toggle_dirty(journal->j_osb, 0, 0); if (status < 0) mlog_errno(status); @@ -887,6 +900,13 @@ restart: ocfs2_recovery_map_clear(osb, node_num); } + + /* Refresh all journal recovery generations from disk */ + status = ocfs2_check_journals_nolocks(osb); + status = (status == -EROFS) ? 0 : status; + if (status < 0) + mlog_errno(status); + ocfs2_super_unlock(osb, 1); /* We always run recovery on our own orphan dir - the dead @@ -950,6 +970,42 @@ out: mlog_exit_void(); } +static int ocfs2_read_journal_inode(struct ocfs2_super *osb, + int slot_num, + struct buffer_head **bh, + struct inode **ret_inode) +{ + int status = -EACCES; + struct inode *inode = NULL; + + BUG_ON(slot_num >= osb->max_slots); + + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, + slot_num); + if (!inode || is_bad_inode(inode)) { + mlog_errno(status); + goto bail; + } + SET_INODE_JOURNAL(inode); + + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, bh, 0, inode); + if (status < 0) { + mlog_errno(status); + goto bail; + } + + status = 0; + +bail: + if (inode) { + if (status || !ret_inode) + iput(inode); + else + *ret_inode = inode; + } + return status; +} + /* Does the actual journal replay and marks the journal inode as * clean. Will only replay if the journal inode is marked dirty. */ static int ocfs2_replay_journal(struct ocfs2_super *osb, @@ -963,22 +1019,35 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, struct ocfs2_dinode *fe; journal_t *journal = NULL; struct buffer_head *bh = NULL; + u32 slot_reco_gen; - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, - slot_num); - if (inode == NULL) { - status = -EACCES; + status = ocfs2_read_journal_inode(osb, slot_num, &bh, &inode); + if (status) { mlog_errno(status); goto done; } - if (is_bad_inode(inode)) { - status = -EACCES; - iput(inode); - inode = NULL; - mlog_errno(status); + + fe = (struct ocfs2_dinode *)bh->b_data; + slot_reco_gen = ocfs2_get_recovery_generation(fe); + brelse(bh); + bh = NULL; + + /* + * As the fs recovery is asynchronous, there is a small chance that another + * node mounted (and recovered) the slot before the recovery thread could + * get the lock. To handle that, we dirty read the journal inode for that + * slot to get the recovery generation. If it is different than what we + * expected, the slot has been recovered. If not, it needs recovery. + */ + if (osb->slot_recovery_generation[slot_num] != slot_reco_gen) { + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, + osb->slot_recovery_generation[slot_num], slot_reco_gen); + osb->slot_recovery_generation[slot_num] = slot_reco_gen; + status = -EBUSY; goto done; } - SET_INODE_JOURNAL(inode); + + /* Continue with recovery as the journal has not yet been recovered */ status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { @@ -992,9 +1061,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, fe = (struct ocfs2_dinode *) bh->b_data; flags = le32_to_cpu(fe->id1.journal1.ij_flags); + slot_reco_gen = ocfs2_get_recovery_generation(fe); if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { mlog(0, "No recovery required for node %d\n", node_num); + /* Refresh recovery generation for the slot */ + osb->slot_recovery_generation[slot_num] = slot_reco_gen; goto done; } @@ -1042,6 +1114,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + /* Increment recovery generation to indicate successful recovery */ + ocfs2_bump_recovery_generation(fe); + osb->slot_recovery_generation[slot_num] = + ocfs2_get_recovery_generation(fe); + status = ocfs2_write_block(osb, bh, inode); if (status < 0) mlog_errno(status); @@ -1107,6 +1184,13 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, status = ocfs2_replay_journal(osb, node_num, slot_num); if (status < 0) { + if (status == -EBUSY) { + mlog(0, "Skipping recovery for slot %u (node %u) " + "as another node has recovered it\n", slot_num, + node_num); + status = 0; + goto done; + } mlog_errno(status); goto done; } @@ -1190,12 +1274,29 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) { int status, i, node_num; struct ocfs2_slot_info *si = osb->slot_info; + struct buffer_head *bh = NULL; + struct ocfs2_dinode *di; /* This is called with the super block cluster lock, so we * know that the slot map can't change underneath us. */ spin_lock(&si->si_lock); for(i = 0; i < si->si_num_slots; i++) { + /* Read journal inode to get the recovery generation */ + status = ocfs2_read_journal_inode(osb, i, &bh, NULL); + if (status) { + mlog_errno(status); + goto bail; + } + di = (struct ocfs2_dinode *)bh->b_data; + osb->slot_recovery_generation[i] = + ocfs2_get_recovery_generation(di); + brelse(bh); + bh = NULL; + + mlog(0, "Slot %u recovery generation is %u\n", i, + osb->slot_recovery_generation[i]); + if (i == osb->slot_num) continue; if (ocfs2_is_empty_slot(si, i)) @@ -1464,49 +1565,41 @@ static int ocfs2_commit_thread(void *arg) return 0; } -/* Look for a dirty journal without taking any cluster locks. Used for - * hard readonly access to determine whether the file system journals - * require recovery. */ +/* Reads all the journal inodes without taking any cluster locks. Used + * for hard readonly access to determine whether any journal requires + * recovery. Also used to refresh the recovery generation numbers after + * a journal has been recovered by another node. + */ int ocfs2_check_journals_nolocks(struct ocfs2_super *osb) { int ret = 0; unsigned int slot; - struct buffer_head *di_bh; + struct buffer_head *di_bh = NULL; struct ocfs2_dinode *di; - struct inode *journal = NULL; + int journal_dirty = 0; for(slot = 0; slot < osb->max_slots; slot++) { - journal = ocfs2_get_system_file_inode(osb, - JOURNAL_SYSTEM_INODE, - slot); - if (!journal || is_bad_inode(journal)) { - ret = -EACCES; - mlog_errno(ret); - goto out; - } - - di_bh = NULL; - ret = ocfs2_read_block(osb, OCFS2_I(journal)->ip_blkno, &di_bh, - 0, journal); - if (ret < 0) { + ret = ocfs2_read_journal_inode(osb, slot, &di_bh, NULL); + if (ret) { mlog_errno(ret); goto out; } di = (struct ocfs2_dinode *) di_bh->b_data; + osb->slot_recovery_generation[slot] = + ocfs2_get_recovery_generation(di); + if (le32_to_cpu(di->id1.journal1.ij_flags) & OCFS2_JOURNAL_DIRTY_FL) - ret = -EROFS; + journal_dirty = 1; brelse(di_bh); - if (ret) - break; + di_bh = NULL; } out: - if (journal) - iput(journal); - + if (journal_dirty) + ret = -EROFS; return ret; } diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 52f02fe..dc27100 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -157,7 +157,8 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, void ocfs2_journal_shutdown(struct ocfs2_super *osb); int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full); -int ocfs2_journal_load(struct ocfs2_journal *journal, int local); +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, + int replayed); int ocfs2_check_journals_nolocks(struct ocfs2_super *osb); void ocfs2_recovery_thread(struct ocfs2_super *osb, int node_num); diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 9546470..db52558 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -193,6 +193,8 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; + u32 *slot_recovery_generation; + spinlock_t node_map_lock; struct ocfs2_node_map recovery_map; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index ff31722..da92242 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1497,6 +1497,15 @@ static int ocfs2_initialize_super(struct super_block *sb, } mlog(0, "max_slots for this device: %u\n", osb->max_slots); + osb->slot_recovery_generation = kcalloc(osb->max_slots, + sizeof(*osb->slot_recovery_generation), + GFP_KERNEL); + if (!osb->slot_recovery_generation) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + init_waitqueue_head(&osb->osb_wipe_event); osb->osb_orphan_wipes = kcalloc(osb->max_slots, sizeof(*osb->osb_orphan_wipes), @@ -1739,7 +1748,7 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) local = ocfs2_mount_local(osb); /* will play back anything left in the journal. */ - ocfs2_journal_load(osb->journal, local); + ocfs2_journal_load(osb->journal, local, dirty); if (dirty) { /* recover my local alloc if we didn't unmount cleanly. */ @@ -1801,6 +1810,7 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb) ocfs2_free_slot_info(osb->slot_info); kfree(osb->osb_orphan_wipes); + kfree(osb->slot_recovery_generation); /* FIXME * This belongs in journal shutdown, but because we have to * allocate osb->journal at the start of ocfs2_initalize_osb(), -- 1.5.4.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-14 20:12 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran @ 2008-07-14 20:25 ` Joel Becker 0 siblings, 0 replies; 11+ messages in thread From: Joel Becker @ 2008-07-14 20:25 UTC (permalink / raw) To: ocfs2-devel On Mon, Jul 14, 2008 at 01:12:38PM -0700, Sunil Mushran wrote: > + u32 *slot_recovery_generation; > + It's a tiny thing, but maybe call it slot_recovery_generations, because it refers to more than one (it's not "our slot's recover generation" it's "recovery generations for all slots"). Otherwise, lookin good. Joel -- "In the arms of the angel, fly away from here, From this dark, cold hotel room and the endlessness that you fear. You are pulled from the wreckage of your silent reverie. In the arms of the angel, may you find some comfort here." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 2) @ 2008-07-11 23:27 Sunil Mushran 2008-07-11 23:27 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-11 23:27 UTC (permalink / raw) To: ocfs2-devel All suggestions incorporated. Please review. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-11 23:27 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 2) Sunil Mushran @ 2008-07-11 23:27 ` Sunil Mushran 2008-07-11 23:52 ` Joel Becker 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-11 23:27 UTC (permalink / raw) To: ocfs2-devel As the fs recovery is asynchronous, there is a small chance that another node can mount (and thus recover) the slot before the recovery thread gets to it. If this happens, the recovery thread will block indefinitely on the journal/slot lock as that lock will be held for the duration of the mount (by design) by the node assigned to that slot. The solution implemented is to keep track of the journal replays using a recovery generation in the journal inode, which will be incremented by the thread replaying that journal. The recovery thread, before attempting the blocking lock on the journal/slot lock, will compare the generation on disk with what it has cached and skip recovery if it does not match. This bug appears to have been inadvertently introduced during the mount/umount vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the mount voting scheme, the messaging would indirectly indicate that the slot was being recovered. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/journal.c | 164 ++++++++++++++++++++++++++++++++++++++++------------ fs/ocfs2/journal.h | 3 +- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/super.c | 14 ++++- 4 files changed, 142 insertions(+), 41 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 741a4e2..2cd91cf 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -57,7 +57,7 @@ static int __ocfs2_recovery_thread(void *arg); static int ocfs2_commit_cache(struct ocfs2_super *osb); static int ocfs2_wait_on_mount(struct ocfs2_super *osb); static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty); + int dirty, int replayed); static int ocfs2_trylock_journal(struct ocfs2_super *osb, int slot_num); static int ocfs2_recover_orphans(struct ocfs2_super *osb, @@ -437,7 +437,7 @@ done: } static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty) + int dirty, int replayed) { int status; unsigned int flags; @@ -467,6 +467,9 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + if (replayed) + le32_add_cpu(&(fe->id1.journal1.ij_recovery_generation), 1); + status = ocfs2_write_block(osb, bh, journal->j_inode); if (status < 0) mlog_errno(status); @@ -541,7 +544,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) * Do not toggle if flush was unsuccessful otherwise * will leave dirty metadata in a "clean" journal */ - status = ocfs2_journal_toggle_dirty(osb, 0); + status = ocfs2_journal_toggle_dirty(osb, 0, 0); if (status < 0) mlog_errno(status); } @@ -584,7 +587,7 @@ static void ocfs2_clear_journal_error(struct super_block *sb, } } -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) { int status = 0; struct ocfs2_super *osb; @@ -603,7 +606,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local) ocfs2_clear_journal_error(osb->sb, journal->j_journal, osb->slot_num); - status = ocfs2_journal_toggle_dirty(osb, 1); + status = ocfs2_journal_toggle_dirty(osb, 1, replayed); if (status < 0) { mlog_errno(status); goto done; @@ -645,7 +648,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) goto bail; } - status = ocfs2_journal_toggle_dirty(journal->j_osb, 0); + status = ocfs2_journal_toggle_dirty(journal->j_osb, 0, 0); if (status < 0) mlog_errno(status); @@ -950,6 +953,44 @@ out: mlog_exit_void(); } +static int ocfs2_read_journal_inode(struct ocfs2_super *osb, + int slot_num, + struct buffer_head **bh, + struct inode **ret_inode) +{ + int status = -EACCES; + struct inode *inode = NULL; + + BUG_ON(slot_num >= osb->max_slots); + + *ret_inode = NULL; + + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, + slot_num); + if (!inode || is_bad_inode(inode)) { + mlog_errno(status); + goto bail; + } + SET_INODE_JOURNAL(inode); + + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, bh, 0, inode); + if (status < 0) { + mlog_errno(status); + goto bail; + } + + status = 0; + +bail: + if (inode) { + if (status) + iput(inode); + else + *ret_inode = inode; + } + return status; +} + /* Does the actual journal replay and marks the journal inode as * clean. Will only replay if the journal inode is marked dirty. */ static int ocfs2_replay_journal(struct ocfs2_super *osb, @@ -963,22 +1004,35 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, struct ocfs2_dinode *fe; journal_t *journal = NULL; struct buffer_head *bh = NULL; + u32 slot_reco_gen; - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, - slot_num); - if (inode == NULL) { - status = -EACCES; + status = ocfs2_read_journal_inode(osb, slot_num, &bh, &inode); + if (status) { mlog_errno(status); goto done; } - if (is_bad_inode(inode)) { - status = -EACCES; - iput(inode); - inode = NULL; - mlog_errno(status); + + fe = (struct ocfs2_dinode *)bh->b_data; + slot_reco_gen = le32_to_cpu(fe->id1.journal1.ij_recovery_generation); + brelse(bh); + bh = NULL; + + /* + * As the fs recovery is asynchronous, there is a small chance that another + * node mounted (and recovered) the slot before the recovery thread could + * get the lock. To handle that, we dirty read the journal inode for that + * slot to get the recovery generation. If it is different than what we + * expected, the slot has been recovered. If not, it needs recovery. + */ + if (osb->slot_recovery_generation[slot_num] != slot_reco_gen) { + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, + osb->slot_recovery_generation[slot_num], slot_reco_gen); + osb->slot_recovery_generation[slot_num] = slot_reco_gen; + status = -EBUSY; goto done; } - SET_INODE_JOURNAL(inode); + + /* Continue with recovery as the journal has not yet been recovered */ status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { @@ -992,9 +1046,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, fe = (struct ocfs2_dinode *) bh->b_data; flags = le32_to_cpu(fe->id1.journal1.ij_flags); + slot_reco_gen = le32_to_cpu(fe->id1.journal1.ij_recovery_generation); if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { mlog(0, "No recovery required for node %d\n", node_num); + /* Refresh recovery generation for the slot */ + osb->slot_recovery_generation[slot_num] = slot_reco_gen; goto done; } @@ -1042,6 +1099,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + /* Increment recovery generation to indicate successful recovery */ + le32_add_cpu(&(fe->id1.journal1.ij_recovery_generation), 1); + osb->slot_recovery_generation[slot_num] = + le32_to_cpu(fe->id1.journal1.ij_recovery_generation); + status = ocfs2_write_block(osb, bh, inode); if (status < 0) mlog_errno(status); @@ -1098,8 +1160,12 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, slot_num = ocfs2_node_num_to_slot(si, node_num); if (slot_num == OCFS2_INVALID_SLOT) { - status = 0; mlog(0, "no slot for this node, so no recovery required.\n"); + /* Refresh all journal recovery generations from disk */ + status = ocfs2_check_journals_nolocks(osb); + if (status && status != -EROFS) + mlog_errno(status); + status = 0; goto done; } @@ -1107,6 +1173,13 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, status = ocfs2_replay_journal(osb, node_num, slot_num); if (status < 0) { + if (status == -EBUSY) { + mlog(0, "Skipping recovery for slot %u (node %u) " + "as another node has recovered it\n", slot_num, + node_num); + status = 0; + goto done; + } mlog_errno(status); goto done; } @@ -1190,14 +1263,35 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) { int status, i, node_num; struct ocfs2_slot_info *si = osb->slot_info; + struct buffer_head *bh = NULL; + struct ocfs2_dinode *fe; + struct inode *inode = NULL; /* This is called with the super block cluster lock, so we * know that the slot map can't change underneath us. */ spin_lock(&si->si_lock); for(i = 0; i < si->si_num_slots; i++) { + /* Read journal inode to get the recovery generation */ + status = ocfs2_read_journal_inode(osb, i, &bh, &inode); + if (status) { + mlog_errno(status); + goto bail; + } + fe = (struct ocfs2_dinode *)bh->b_data; + osb->slot_recovery_generation[i] = + le32_to_cpu(fe->id1.journal1.ij_recovery_generation); + brelse(bh); + iput(inode); + bh = NULL; + inode = NULL; + + mlog(0, "Slot %u recovery generation is %u\n", i, + osb->slot_recovery_generation[i]); + if (i == osb->slot_num) continue; + if (ocfs2_is_empty_slot(si, i)) continue; @@ -1464,49 +1558,45 @@ static int ocfs2_commit_thread(void *arg) return 0; } -/* Look for a dirty journal without taking any cluster locks. Used for - * hard readonly access to determine whether the file system journals - * require recovery. */ +/* Reads all the journal inodes without taking any cluster locks. Used + * for hard readonly access to determine whether any journal requires + * recovery. Also used to refresh the recovery generation numbers after + * a journal has been recovered by another node. + */ int ocfs2_check_journals_nolocks(struct ocfs2_super *osb) { int ret = 0; unsigned int slot; - struct buffer_head *di_bh; + struct buffer_head *di_bh = NULL; struct ocfs2_dinode *di; struct inode *journal = NULL; + int journal_dirty = 0; for(slot = 0; slot < osb->max_slots; slot++) { - journal = ocfs2_get_system_file_inode(osb, - JOURNAL_SYSTEM_INODE, - slot); - if (!journal || is_bad_inode(journal)) { - ret = -EACCES; - mlog_errno(ret); - goto out; - } - - di_bh = NULL; - ret = ocfs2_read_block(osb, OCFS2_I(journal)->ip_blkno, &di_bh, - 0, journal); - if (ret < 0) { + ret = ocfs2_read_journal_inode(osb, slot, &di_bh, &journal); + if (ret) { mlog_errno(ret); goto out; } di = (struct ocfs2_dinode *) di_bh->b_data; + osb->slot_recovery_generation[slot] = + le32_to_cpu(di->id1.journal1.ij_recovery_generation); + if (le32_to_cpu(di->id1.journal1.ij_flags) & OCFS2_JOURNAL_DIRTY_FL) - ret = -EROFS; + journal_dirty = 1; brelse(di_bh); - if (ret) - break; + di_bh = NULL; } out: if (journal) iput(journal); + if (journal_dirty) + ret = -EROFS; return ret; } diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 52f02fe..dc27100 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -157,7 +157,8 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, void ocfs2_journal_shutdown(struct ocfs2_super *osb); int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full); -int ocfs2_journal_load(struct ocfs2_journal *journal, int local); +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, + int replayed); int ocfs2_check_journals_nolocks(struct ocfs2_super *osb); void ocfs2_recovery_thread(struct ocfs2_super *osb, int node_num); diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 9546470..db52558 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -193,6 +193,8 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; + u32 *slot_recovery_generation; + spinlock_t node_map_lock; struct ocfs2_node_map recovery_map; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index ff31722..94567ce 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1497,6 +1497,15 @@ static int ocfs2_initialize_super(struct super_block *sb, } mlog(0, "max_slots for this device: %u\n", osb->max_slots); + osb->slot_recovery_generation = kcalloc(osb->max_slots, + sizeof(*osb->slot_recovery_generation), + GFP_KERNEL); + if (!osb->slot_recovery_generation) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + init_waitqueue_head(&osb->osb_wipe_event); osb->osb_orphan_wipes = kcalloc(osb->max_slots, sizeof(*osb->osb_orphan_wipes), @@ -1739,7 +1748,7 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) local = ocfs2_mount_local(osb); /* will play back anything left in the journal. */ - ocfs2_journal_load(osb->journal, local); + ocfs2_journal_load(osb->journal, local, dirty); if (dirty) { /* recover my local alloc if we didn't unmount cleanly. */ @@ -1754,8 +1763,6 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) * ourselves as mounted. */ } - mlog(0, "Journal loaded.\n"); - status = ocfs2_load_local_alloc(osb); if (status < 0) { mlog_errno(status); @@ -1801,6 +1808,7 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb) ocfs2_free_slot_info(osb->slot_info); kfree(osb->osb_orphan_wipes); + kfree(osb->slot_recovery_generation); /* FIXME * This belongs in journal shutdown, but because we have to * allocate osb->journal at the start of ocfs2_initalize_osb(), -- 1.5.4.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-11 23:27 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran @ 2008-07-11 23:52 ` Joel Becker 0 siblings, 0 replies; 11+ messages in thread From: Joel Becker @ 2008-07-11 23:52 UTC (permalink / raw) To: ocfs2-devel On Fri, Jul 11, 2008 at 04:27:21PM -0700, Sunil Mushran wrote: > + if (replayed) > + le32_add_cpu(&(fe->id1.journal1.ij_recovery_generation), 1); You're doing this a lot, and it's a bunch to read. Can we have: static void ocfs2_bump_recovery_generation(struct ocfs2_dinode *di) { le32_add_cpu(&(di->id1.journal1.ij_recovery_generation), 1); } static u32 ocfs2_get_recovery_generation(struct ocfs2_dinode *di) { return le32_to_cpu(di->id1.journal1.ij_recovery_generation); } I just think it'll read better. > +static int ocfs2_read_journal_inode(struct ocfs2_super *osb, > + int slot_num, > + struct buffer_head **bh, > + struct inode **ret_inode) Perhaps you want to make ret_inode optional? It seems to be there are a couple paths that are just iput()ing it. So, you'd do: > +{ > + int status = -EACCES; > + struct inode *inode = NULL; > + > + BUG_ON(slot_num >= osb->max_slots); > + - + *ret_inode = NULL; - + > + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, > + slot_num); > + if (!inode || is_bad_inode(inode)) { > + mlog_errno(status); > + goto bail; > + } > + SET_INODE_JOURNAL(inode); > + > + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, bh, 0, inode); > + if (status < 0) { > + mlog_errno(status); > + goto bail; > + } > + > + status = 0; > + > +bail: > + if (inode) { - + if (status) + + if (status || !ret_inode) > + iput(inode); > + else > + *ret_inode = inode; > + } > + return status; > +} <snip> > flags = le32_to_cpu(fe->id1.journal1.ij_flags); > + slot_reco_gen = le32_to_cpu(fe->id1.journal1.ij_recovery_generation); Wouldn't this be so much nicer? + slot_reco_gen = ocfs2_get_recovery_generation(fe); > @@ -1190,14 +1263,35 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) > { > int status, i, node_num; > struct ocfs2_slot_info *si = osb->slot_info; > + struct buffer_head *bh = NULL; > + struct ocfs2_dinode *fe; If you are adding new dinodes, can we call them 'di' instead of 'fe'? > spin_lock(&si->si_lock); > for(i = 0; i < si->si_num_slots; i++) { > + /* Read journal inode to get the recovery generation */ > + status = ocfs2_read_journal_inode(osb, i, &bh, &inode); > + if (status) { > + mlog_errno(status); > + goto bail; > + } > + fe = (struct ocfs2_dinode *)bh->b_data; > + osb->slot_recovery_generation[i] = > + le32_to_cpu(fe->id1.journal1.ij_recovery_generation); > + brelse(bh); > + iput(inode); Here's a place you'd want to pass NULL instead of &inode to ocfs2_read_journal_inode(); Joel -- Life's Little Instruction Book #197 "Don't forget, a person's greatest emotional need is to feel appreciated." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race @ 2008-07-10 23:17 Sunil Mushran 2008-07-10 23:17 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-10 23:17 UTC (permalink / raw) To: ocfs2-devel Please review the patches as discussed in the last concall. In particular, please review the journal inode read code. I am not sure whether that is the most efficient. As in, I may be reading the blocks twice. Sunil ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-10 23:17 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race Sunil Mushran @ 2008-07-10 23:17 ` Sunil Mushran 2008-07-11 0:36 ` Joel Becker 0 siblings, 1 reply; 11+ messages in thread From: Sunil Mushran @ 2008-07-10 23:17 UTC (permalink / raw) To: ocfs2-devel As the fs recovery is asynchronous, there is a small chance that another node count mount (and thus recover) the slot before the recovery thread gets to it. If this happens, the recovery thread will block indefinitely on the journal/slot lock as that lock will be held for the duration of the mount (by design) by the node assigned to that slot. The solution implemented is to keep track of the journal replays using a recovery generation in the journal inode, which will be incremented by the thread replaying that journal. The recovery thread, before attempting the blocking lock on the journal/slot lock, will compare the generation on disk with what it has cached and skip recovery if it does not match. This bug appears to have been inadvertently introduced during the mount/umount vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the mount voting scheme, the messaging would indirectly indicate that the slot was being recovered. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/journal.c | 156 +++++++++++++++++++++++++++++++++++++++++++++------- fs/ocfs2/journal.h | 2 +- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/super.c | 14 ++++- 4 files changed, 150 insertions(+), 24 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 741a4e2..2344f8b 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -57,7 +57,7 @@ static int __ocfs2_recovery_thread(void *arg); static int ocfs2_commit_cache(struct ocfs2_super *osb); static int ocfs2_wait_on_mount(struct ocfs2_super *osb); static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty); + int dirty, int replayed); static int ocfs2_trylock_journal(struct ocfs2_super *osb, int slot_num); static int ocfs2_recover_orphans(struct ocfs2_super *osb, @@ -437,7 +437,7 @@ done: } static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, - int dirty) + int dirty, int replayed) { int status; unsigned int flags; @@ -467,6 +467,9 @@ static int ocfs2_journal_toggle_dirty(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + if (replayed) + le32_add_cpu(&(fe->id1.journal1.ij_reco_generation), 1); + status = ocfs2_write_block(osb, bh, journal->j_inode); if (status < 0) mlog_errno(status); @@ -541,7 +544,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) * Do not toggle if flush was unsuccessful otherwise * will leave dirty metadata in a "clean" journal */ - status = ocfs2_journal_toggle_dirty(osb, 0); + status = ocfs2_journal_toggle_dirty(osb, 0, 0); if (status < 0) mlog_errno(status); } @@ -584,7 +587,7 @@ static void ocfs2_clear_journal_error(struct super_block *sb, } } -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replay) { int status = 0; struct ocfs2_super *osb; @@ -603,7 +606,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local) ocfs2_clear_journal_error(osb->sb, journal->j_journal, osb->slot_num); - status = ocfs2_journal_toggle_dirty(osb, 1); + status = ocfs2_journal_toggle_dirty(osb, 1, replay); if (status < 0) { mlog_errno(status); goto done; @@ -645,7 +648,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) goto bail; } - status = ocfs2_journal_toggle_dirty(journal->j_osb, 0); + status = ocfs2_journal_toggle_dirty(journal->j_osb, 0, 0); if (status < 0) mlog_errno(status); @@ -950,11 +953,57 @@ out: mlog_exit_void(); } +static int ocfs2_read_journal_recogen(struct ocfs2_super *osb, + int slot_num, + u32 *slot_reco_gen, + struct inode **ret_in) +{ + int status; + struct ocfs2_dinode *fe; + struct buffer_head *bh = NULL; + struct inode *inode = NULL; + + inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, + slot_num); + if (inode == NULL) { + status = -EACCES; + mlog_errno(status); + goto bail; + } + if (is_bad_inode(inode)) { + status = -EACCES; + iput(inode); + mlog_errno(status); + goto bail; + } + SET_INODE_JOURNAL(inode); + + status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, &bh, 0, inode); + if (status < 0) { + iput(inode); + mlog_errno(status); + goto bail; + } + + fe = (struct ocfs2_dinode *)bh->b_data; + + *slot_reco_gen = le32_to_cpu(fe->id1.journal1.ij_reco_generation); + if (ret_in) + *ret_in = inode; + else + iput(inode); + + brelse(bh); +bail: + return status; +} + /* Does the actual journal replay and marks the journal inode as * clean. Will only replay if the journal inode is marked dirty. */ static int ocfs2_replay_journal(struct ocfs2_super *osb, int node_num, - int slot_num) + int slot_num, + int *busy) { int status; int got_lock = 0; @@ -963,22 +1012,33 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, struct ocfs2_dinode *fe; journal_t *journal = NULL; struct buffer_head *bh = NULL; + u32 recogen; - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, - slot_num); - if (inode == NULL) { - status = -EACCES; + *busy = 0; + + status = ocfs2_read_journal_recogen(osb, slot_num, &recogen, &inode); + if (status) { mlog_errno(status); goto done; } - if (is_bad_inode(inode)) { - status = -EACCES; - iput(inode); - inode = NULL; - mlog_errno(status); + + /* + * As the fs recovery is asynchronous, there is a small chance that another + * node mounted (and recovered) the slot before the recovery thread could + * get the lock. To handle that, we dirty read the journal inode for that + * slot to get the recovery generation. If it is different than what we + * expected, the slot has been recovered. If not, it needs recovery. + */ + if (osb->slot_reco_generation[slot_num] != recogen) { + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, + osb->slot_reco_generation[slot_num], recogen); + osb->slot_reco_generation[slot_num] = recogen; + *busy = 1; + status = 0; goto done; } - SET_INODE_JOURNAL(inode); + + /* Continue with recovery as the journal has not yet been recovered */ status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { @@ -992,9 +1052,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, fe = (struct ocfs2_dinode *) bh->b_data; flags = le32_to_cpu(fe->id1.journal1.ij_flags); + recogen = le32_to_cpu(fe->id1.journal1.ij_reco_generation); if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { mlog(0, "No recovery required for node %d\n", node_num); + /* Refresh recovery generation for the slot */ + osb->slot_reco_generation[slot_num] = recogen; goto done; } @@ -1042,6 +1105,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, flags &= ~OCFS2_JOURNAL_DIRTY_FL; fe->id1.journal1.ij_flags = cpu_to_le32(flags); + /* Increment recovery generation to indicate successful recovery */ + le32_add_cpu(&(fe->id1.journal1.ij_reco_generation), 1); + osb->slot_reco_generation[slot_num] = + le32_to_cpu(fe->id1.journal1.ij_reco_generation); + status = ocfs2_write_block(osb, bh, inode); if (status < 0) mlog_errno(status); @@ -1066,6 +1134,30 @@ done: return status; } +/* Refresh recovery generations of all slots */ +static int ocfs2_refresh_all_journal_recogens(struct ocfs2_super *osb) +{ + int i; + int status = 0; + + for (i = 0; i < osb->max_slots; ++i) { + if (i == osb->slot_num) + continue; + status = ocfs2_read_journal_recogen(osb, i, + &(osb->slot_reco_generation[i]), + NULL); + if (status) { + mlog_errno(status); + goto bail; + } + mlog(0, "Slot %u recovery generation is %u\n", i, + osb->slot_reco_generation[i]); + } + +bail: + return status; +} + /* * Do the most important parts of node recovery: * - Replay it's journal @@ -1081,7 +1173,7 @@ done: static int ocfs2_recover_node(struct ocfs2_super *osb, int node_num) { - int status = 0; + int status = 0, busy = 0; int slot_num; struct ocfs2_slot_info *si = osb->slot_info; struct ocfs2_dinode *la_copy = NULL; @@ -1098,19 +1190,28 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, slot_num = ocfs2_node_num_to_slot(si, node_num); if (slot_num == OCFS2_INVALID_SLOT) { - status = 0; mlog(0, "no slot for this node, so no recovery required.\n"); + /* Refresh all journal recoevery generations from disk */ + status = ocfs2_refresh_all_journal_recogens(osb); + if (status) + mlog_errno(status); goto done; } mlog(0, "node %d was using slot %d\n", node_num, slot_num); - status = ocfs2_replay_journal(osb, node_num, slot_num); + status = ocfs2_replay_journal(osb, node_num, slot_num, &busy); if (status < 0) { mlog_errno(status); goto done; } + if (busy) { + mlog(0, "Skipping recovery for slot %u (node %u) " + "as another node has recovered it\n", slot_num, node_num); + goto done; + } + /* Stamp a clean local alloc file AFTER recovering the journal... */ status = ocfs2_begin_local_alloc_recovery(osb, slot_num, &la_copy); if (status < 0) { @@ -1190,14 +1291,26 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb) { int status, i, node_num; struct ocfs2_slot_info *si = osb->slot_info; + u32 recogen; /* This is called with the super block cluster lock, so we * know that the slot map can't change underneath us. */ spin_lock(&si->si_lock); for(i = 0; i < si->si_num_slots; i++) { + /* Read the recovery generation */ + status = ocfs2_read_journal_recogen(osb, i, &recogen, NULL); + if (status) { + mlog_errno(status); + goto bail; + } + osb->slot_reco_generation[i] = recogen; + + mlog(0, "Slot %u recovery generation is %u\n", i, recogen); + if (i == osb->slot_num) continue; + if (ocfs2_is_empty_slot(si, i)) continue; @@ -1495,6 +1608,9 @@ int ocfs2_check_journals_nolocks(struct ocfs2_super *osb) di = (struct ocfs2_dinode *) di_bh->b_data; + osb->slot_reco_generation[slot] = + le32_to_cpu(di->id1.journal1.ij_reco_generation); + if (le32_to_cpu(di->id1.journal1.ij_flags) & OCFS2_JOURNAL_DIRTY_FL) ret = -EROFS; diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 52f02fe..9066bf0 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -157,7 +157,7 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, void ocfs2_journal_shutdown(struct ocfs2_super *osb); int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full); -int ocfs2_journal_load(struct ocfs2_journal *journal, int local); +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replay); int ocfs2_check_journals_nolocks(struct ocfs2_super *osb); void ocfs2_recovery_thread(struct ocfs2_super *osb, int node_num); diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 9546470..4af81ee 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -193,6 +193,8 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; + u32 *slot_reco_generation; + spinlock_t node_map_lock; struct ocfs2_node_map recovery_map; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index ff31722..8540ab6 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1497,6 +1497,15 @@ static int ocfs2_initialize_super(struct super_block *sb, } mlog(0, "max_slots for this device: %u\n", osb->max_slots); + osb->slot_reco_generation = kcalloc(osb->max_slots, + sizeof(*osb->slot_reco_generation), + GFP_KERNEL); + if (!osb->slot_reco_generation) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + init_waitqueue_head(&osb->osb_wipe_event); osb->osb_orphan_wipes = kcalloc(osb->max_slots, sizeof(*osb->osb_orphan_wipes), @@ -1739,7 +1748,7 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) local = ocfs2_mount_local(osb); /* will play back anything left in the journal. */ - ocfs2_journal_load(osb->journal, local); + ocfs2_journal_load(osb->journal, local, dirty); if (dirty) { /* recover my local alloc if we didn't unmount cleanly. */ @@ -1754,8 +1763,6 @@ static int ocfs2_check_volume(struct ocfs2_super *osb) * ourselves as mounted. */ } - mlog(0, "Journal loaded.\n"); - status = ocfs2_load_local_alloc(osb); if (status < 0) { mlog_errno(status); @@ -1801,6 +1808,7 @@ static void ocfs2_delete_osb(struct ocfs2_super *osb) ocfs2_free_slot_info(osb->slot_info); kfree(osb->osb_orphan_wipes); + kfree(osb->slot_reco_generation); /* FIXME * This belongs in journal shutdown, but because we have to * allocate osb->journal at the start of ocfs2_initalize_osb(), -- 1.5.4.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery 2008-07-10 23:17 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran @ 2008-07-11 0:36 ` Joel Becker 0 siblings, 0 replies; 11+ messages in thread From: Joel Becker @ 2008-07-11 0:36 UTC (permalink / raw) To: ocfs2-devel On Thu, Jul 10, 2008 at 04:17:51PM -0700, Sunil Mushran wrote: > As the fs recovery is asynchronous, there is a small chance that another > node count mount (and thus recover) the slot before the recovery thread node can mount > -int ocfs2_journal_load(struct ocfs2_journal *journal, int local) > +int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replay) I'd call it 'replayed' here, just like in mark_dirty, because it's a status (I already replayed it). > +static int ocfs2_read_journal_recogen(struct ocfs2_super *osb, > + int slot_num, > + u32 *slot_reco_gen, > + struct inode **ret_in) I don't like this name. I keep wanting: static int ocfs2_read_journal_inode(*osb, slot_num, **ret_in, *reco_gen); Which could also be used in check_journal_nolocks() in place of the get_system_file_inode() call. Heck, you could use it as the common call for loading the journal inode, passing NULL for the reco_gen when unwarranted. Basically, you are creating two ways to load the journal inode here, and I wonder if we could just make it a new single method. > /* Does the actual journal replay and marks the journal inode as > * clean. Will only replay if the journal inode is marked dirty. */ > static int ocfs2_replay_journal(struct ocfs2_super *osb, > int node_num, > - int slot_num) > + int slot_num, > + int *busy) I think we want to use -EBUSY instead of *busy - it's in line with the rest of the code, and it's perfectly logical. Joel > { > int status; > int got_lock = 0; > @@ -963,22 +1012,33 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, > struct ocfs2_dinode *fe; > journal_t *journal = NULL; > struct buffer_head *bh = NULL; > + u32 recogen; > > - inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE, > - slot_num); > - if (inode == NULL) { > - status = -EACCES; > + *busy = 0; > + > + status = ocfs2_read_journal_recogen(osb, slot_num, &recogen, &inode); > + if (status) { > mlog_errno(status); > goto done; > } > - if (is_bad_inode(inode)) { > - status = -EACCES; > - iput(inode); > - inode = NULL; > - mlog_errno(status); > + > + /* > + * As the fs recovery is asynchronous, there is a small chance that another > + * node mounted (and recovered) the slot before the recovery thread could > + * get the lock. To handle that, we dirty read the journal inode for that > + * slot to get the recovery generation. If it is different than what we > + * expected, the slot has been recovered. If not, it needs recovery. > + */ > + if (osb->slot_reco_generation[slot_num] != recogen) { > + mlog(0, "Slot %u already recovered (old/new=%u/%u)\n", slot_num, > + osb->slot_reco_generation[slot_num], recogen); > + osb->slot_reco_generation[slot_num] = recogen; > + *busy = 1; > + status = 0; > goto done; > } > - SET_INODE_JOURNAL(inode); > + > + /* Continue with recovery as the journal has not yet been recovered */ > > status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); > if (status < 0) { > @@ -992,9 +1052,12 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, > fe = (struct ocfs2_dinode *) bh->b_data; > > flags = le32_to_cpu(fe->id1.journal1.ij_flags); > + recogen = le32_to_cpu(fe->id1.journal1.ij_reco_generation); > > if (!(flags & OCFS2_JOURNAL_DIRTY_FL)) { > mlog(0, "No recovery required for node %d\n", node_num); > + /* Refresh recovery generation for the slot */ > + osb->slot_reco_generation[slot_num] = recogen; > goto done; > } > > @@ -1042,6 +1105,11 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, > flags &= ~OCFS2_JOURNAL_DIRTY_FL; > fe->id1.journal1.ij_flags = cpu_to_le32(flags); > > + /* Increment recovery generation to indicate successful recovery */ > + le32_add_cpu(&(fe->id1.journal1.ij_reco_generation), 1); > + osb->slot_reco_generation[slot_num] = > + le32_to_cpu(fe->id1.journal1.ij_reco_generation); > + > status = ocfs2_write_block(osb, bh, inode); > if (status < 0) > mlog_errno(status); > @@ -1066,6 +1134,30 @@ done: > return status; > } > > +/* Refresh recovery generations of all slots */ > +static int ocfs2_refresh_all_journal_recogens(struct ocfs2_super *osb) > +{ > + int i; > + int status = 0; > + > + for (i = 0; i < osb->max_slots; ++i) { > + if (i == osb->slot_num) > + continue; > + status = ocfs2_read_journal_recogen(osb, i, > + &(osb->slot_reco_generation[i]), > + NULL); > + if (status) { > + mlog_errno(status); > + goto bail; > + } > + mlog(0, "Slot %u recovery generation is %u\n", i, > + osb->slot_reco_generation[i]); > + } > + > +bail: > + return status; > +} > + > /* > * Do the most important parts of node recovery: > * - Replay it's journal > @@ -1081,7 +1173,7 @@ done: > static int ocfs2_recover_node(struct ocfs2_super *osb, > int node_num) > { > - int status = 0; > + int status = 0, busy = 0; > int slot_num; > struct ocfs2_slot_info *si = osb->slot_info; > struct ocfs2_dinode *la_copy = NULL; > @@ -1098,19 +1190,28 @@ static int ocfs2_recover_node(struct ocfs2_super *osb, > > slot_num = ocfs2_node_num_to_slot(si, node_num); > if (slot_num == OCFS2_INVALID_SLOT) { What version is this patch against? This should read -ENOENT, not INVALID_SLOT. Regarding reading the journals twice - who cares? This is recovery - known to be slow. It's only a few extra reads, part of an operation that already reads and possibly writes. Joel -- "Win95 file and print sharing are for relatively friendly nets." - Paul Leach, Microsoft Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-07-15 0:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-12 0:25 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 3) Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Adds counter in struct ocfs2_dinode to track journal replays Sunil Mushran 2008-07-12 0:25 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 2008-07-12 1:24 ` Joel Becker -- strict thread matches above, loose matches on Subject: below -- 2008-07-15 0:31 [Ocfs2-devel] Recovery/mount fixes for mainline Sunil Mushran 2008-07-15 0:31 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 2008-07-14 20:12 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 4) Sunil Mushran 2008-07-14 20:12 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 2008-07-14 20:25 ` Joel Becker 2008-07-11 23:27 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race (Attempt 2) Sunil Mushran 2008-07-11 23:27 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 2008-07-11 23:52 ` Joel Becker 2008-07-10 23:17 [Ocfs2-devel] OCFS2 1.4: Proposed fix for mount/recovery race Sunil Mushran 2008-07-10 23:17 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Fix race between mount and recovery Sunil Mushran 2008-07-11 0:36 ` Joel Becker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.