Linux EXT4 FS development
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	rookxu <brookxu.cn@gmail.com>,
	Ritesh Harjani <ritesh.list@gmail.com>
Subject: Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
Date: Fri, 27 Jan 2023 15:43:12 +0100	[thread overview]
Message-ID: <20230127144312.3m3hmcufcvxxp6f4@quack3> (raw)
In-Reply-To: <Y8jizbGg6l2WxJPF@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>

Hi Ojaswin!

I'm sorry for a bit delayed reply...

On Thu 19-01-23 11:57:25, Ojaswin Mujoo wrote:
> On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote:
> > On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> > > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > > > Since this covers the special case we discussed above, we will always
> > > > > un-delete the PA when we encounter the special case and we can then
> > > > > adjust for overlap and traverse the PA rbtree without any issues.
> > > > > 
> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > > 
> > > Hi Jan,
> > > Thanks for the review, sharing some of my thoughts below.
> > > 
> > > > 
> > > > So I find this putting back of already deleted inode PA very fragile. For
> > > > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > > > which can mark inode PA (so it can then be spotted by
> > > > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > > > ext4_mb_put_pa() still goes on and destroys the PA.
> > > 
> > > The 2 code paths that clash here are:
> > > 
> > > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> > > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> > > 
> > > Since these are the only code paths from which these 2 functions are
> > > called, for a given inode, access will always be serialized by the upper
> > > level ei->i_data_sem, which is always taken when writing data blocks
> > > using ext4_mb_new_block(). 
> > 
> > Indeed, inode->i_data_sem prevents the race I was afraid of.
> >  
> > > From my understanding of the code, I feel only
> > > ext4_mb_discard_group_preallocations() can race against other functions
> > > that are modifying the PA rbtree since it does not take any inode locks.
> > > 
> > > That being said, I do understand your concerns regarding the solution,
> > > however I'm willing to work with the community to ensure our
> > > implementation of this undelete feature is as robust as possible. Along
> > > with fixing the bug reported here [1], I believe that it is also a good
> > > optimization to have especially when the disk is near full and we are
> > > seeing a lot of group discards going on. 
> > > 
> > > Also, in case the deleted PA completely lies inside our new range, it is
> > > much better to just undelete and use it rather than deleting the
> > > existing PA and reallocating the range again. I think the advantage
> > > would be even bigger in ext4_mb_use_preallocated() function where we can
> > > just undelete and use the PA and skip the entire allocation, incase original
> > > range lies in a deleted PA.
> > 
> > Thanks for explantion. However I think you're optimizing the wrong thing.
> > We are running out of space (to run ext4_mb_discard_group_preallocations()
> > at all) and we allocate from an area covered by PA that we've just decided
> > to discard - if anything relies on performance of the filesystem in ENOSPC
> > conditions it has serious problems no matter what. Sure, we should deliver
> > the result (either ENOSPC or some block allocation) in a reasonable time
> > but the performance does not really matter much because all the scanning
> > and flushing is going to slow down everything a lot anyway. One additional
> > scan of the rbtree is really negligible in this case. So what we should
> > rather optimize for in this case is the code simplicity and maintainability
> > of this rare corner-case that will also likely get only a small amount of
> > testing. And in terms of code simplicity the delete & restart solution
> > seems to be much better (at least as far as I'm imagining it - maybe the
> > code will prove me wrong ;)).
> Hi Jan,
> 
> So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method,
> with ane extra pa_removed flag, but the locking is getting pretty messy. I'm
> not sure if such a design is possible is the lock we currently have. 
> 
> Basically, the issue I'm facing is that we are having to drop the
> locks read locks and accquire the write locks in
> ext4_mb_adjust_overlap(), which looks something like this:
> 
> 				spin_unlock(&tmp_pa->pa_lock);
> 				read_unlock(&ei->i_prealloc_lock);
> 
> 				write_lock(&ei->i_prealloc_lock);
> 				spin_lock(&tmp_pa->pa_lock);
> 
> We have to preserve the order and drop both tree and PA locks to avoid
> deadlocks.  With this approach, the issue is that in between dropping and
> accquiring this lock, the group discard path can actually go ahead and free the
> PA memory after calling rb erase on it, which can result in use after free in
> the adjust overlap path.  This is because the PA is freed without any locks in
> discard path, as it assumes no other thread will have a reference to it. This
> assumption was true earlier since our allocation path never gave up the rbtree
> lock however it is not possible with this approach now.  Essentially, the
> concept of having two different areas where a PA can be deleted is bringing in
> additional challenges and complexity, which might make things worse from a
> maintainers/reviewers point of view.

Right, I didn't realize that. That is nasty.

> After brainstorming a bit, I think there might be a few alternatives here:
> 
> 1. Instead of deleting PA in the adjust overlap thread, make it sleep till group
> discard path goes ahead and deletes/frees it. At this point we can wake it up and retry
> allocation. 
> 
> * Pros: We can be sure that PA would have been removed at the time of retry so
> we don't waste extra retries. C
> * Cons: Extra complexity in code. 
> 
> 2. Just go for a retry in adjust overlap without doing anything. In ideal case,
> by the time we start retrying the PA might be already removed. Worse case: We
> keep looping again and again since discard path has not deleted it yet.
> 
> * Pros: Simplest approach, code remains straightforward.
> * Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough.

Well, I think cond_resched() + goto retry would be OK here. We could also
cycle the corresponding group lock which would wait for
ext4_mb_discard_group_preallocations() to finish but that is going to burn
the CPU even more than the cond_resched() + retry as we'll be just spinning
on the spinlock. Sleeping is IMHO not warranted as the whole
ext4_mb_discard_group_preallocations() is running under a spinlock anyway
so it should better be a very short sleep.

Or actually I have one more possible solution: What the adjusting function
is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
trims start & end to not overlap these PAs. So we could just lookup these
two PAs (ignoring the deleted state) and then just iterate from these with
rb_prev() & rb_next() until we find not-deleted ones. What do you think? 
 
> 3. The approach of undeleting the PA (proposed in this patchset) that
> we've already discussed.
> 
> Now, to be honest, I still prefer the undelete PA approach as it makes more
> sense to me and I think the code is simple enough as there are not many paths
> that might race. Mostly just adjust_overlap and group discard or
> use_preallocated and group discard.

Yeah, I'm still not too keen on this but I'm willing to reconsider if 
above approach proves to be too expensive under ENOSPC conditions...

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2023-01-27 14:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-16  8:02 [PATCH v3 0/8] ext4: Convert inode preallocation list to an rbtree Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 1/8] ext4: Stop searching if PA doesn't satisfy non-extent file Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 2/8] ext4: Refactor code related to freeing PAs Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 3/8] ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated() Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 4/8] ext4: Move overlap assert logic into a separate function Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 5/8] ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request() Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list Ojaswin Mujoo
2023-01-16 12:23   ` Jan Kara
2023-01-17 10:30     ` Ojaswin Mujoo
2023-01-17 11:03       ` Jan Kara
2023-01-19  6:27         ` Ojaswin Mujoo
2023-01-27 14:43           ` Jan Kara [this message]
2023-02-03  8:36             ` Ojaswin Mujoo
2023-02-08 11:25               ` Ojaswin Mujoo
2023-02-09 10:54                 ` Jan Kara
2023-02-09 17:54                   ` Ojaswin Mujoo
2023-02-10 14:37                     ` Jan Kara
2023-02-13 17:58                       ` Ojaswin Mujoo
2023-02-14  8:50                         ` Jan Kara
2023-02-16 17:07                       ` Andreas Dilger
2023-03-17 12:40                         ` Ojaswin Mujoo
2023-01-16  8:02 ` [PATCH v3 8/8] ext4: Remove the logic to trim inode PAs Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230127144312.3m3hmcufcvxxp6f4@quack3 \
    --to=jack@suse.cz \
    --cc=brookxu.cn@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=riteshh@linux.ibm.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox