From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Ritesh Harjani <riteshh@linux.ibm.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
rookxu <brookxu.cn@gmail.com>,
Ritesh Harjani <ritesh.list@gmail.com>
Subject: Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
Date: Fri, 27 Jan 2023 15:43:12 +0100 [thread overview]
Message-ID: <20230127144312.3m3hmcufcvxxp6f4@quack3> (raw)
In-Reply-To: <Y8jizbGg6l2WxJPF@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Hi Ojaswin!
I'm sorry for a bit delayed reply...
On Thu 19-01-23 11:57:25, Ojaswin Mujoo wrote:
> On Tue, Jan 17, 2023 at 12:03:35PM +0100, Jan Kara wrote:
> > On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> > > On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > > > Since this covers the special case we discussed above, we will always
> > > > > un-delete the PA when we encounter the special case and we can then
> > > > > adjust for overlap and traverse the PA rbtree without any issues.
> > > > >
> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > >
> > > Hi Jan,
> > > Thanks for the review, sharing some of my thoughts below.
> > >
> > > >
> > > > So I find this putting back of already deleted inode PA very fragile. For
> > > > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > > > which can mark inode PA (so it can then be spotted by
> > > > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > > > ext4_mb_put_pa() still goes on and destroys the PA.
> > >
> > > The 2 code paths that clash here are:
> > >
> > > ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> > > ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> > >
> > > Since these are the only code paths from which these 2 functions are
> > > called, for a given inode, access will always be serialized by the upper
> > > level ei->i_data_sem, which is always taken when writing data blocks
> > > using ext4_mb_new_block().
> >
> > Indeed, inode->i_data_sem prevents the race I was afraid of.
> >
> > > From my understanding of the code, I feel only
> > > ext4_mb_discard_group_preallocations() can race against other functions
> > > that are modifying the PA rbtree since it does not take any inode locks.
> > >
> > > That being said, I do understand your concerns regarding the solution,
> > > however I'm willing to work with the community to ensure our
> > > implementation of this undelete feature is as robust as possible. Along
> > > with fixing the bug reported here [1], I believe that it is also a good
> > > optimization to have especially when the disk is near full and we are
> > > seeing a lot of group discards going on.
> > >
> > > Also, in case the deleted PA completely lies inside our new range, it is
> > > much better to just undelete and use it rather than deleting the
> > > existing PA and reallocating the range again. I think the advantage
> > > would be even bigger in ext4_mb_use_preallocated() function where we can
> > > just undelete and use the PA and skip the entire allocation, incase original
> > > range lies in a deleted PA.
> >
> > Thanks for explantion. However I think you're optimizing the wrong thing.
> > We are running out of space (to run ext4_mb_discard_group_preallocations()
> > at all) and we allocate from an area covered by PA that we've just decided
> > to discard - if anything relies on performance of the filesystem in ENOSPC
> > conditions it has serious problems no matter what. Sure, we should deliver
> > the result (either ENOSPC or some block allocation) in a reasonable time
> > but the performance does not really matter much because all the scanning
> > and flushing is going to slow down everything a lot anyway. One additional
> > scan of the rbtree is really negligible in this case. So what we should
> > rather optimize for in this case is the code simplicity and maintainability
> > of this rare corner-case that will also likely get only a small amount of
> > testing. And in terms of code simplicity the delete & restart solution
> > seems to be much better (at least as far as I'm imagining it - maybe the
> > code will prove me wrong ;)).
> Hi Jan,
>
> So I did try out the 'rb_erase from ext4_mb_adjust_overlap() and retry' method,
> with ane extra pa_removed flag, but the locking is getting pretty messy. I'm
> not sure if such a design is possible is the lock we currently have.
>
> Basically, the issue I'm facing is that we are having to drop the
> locks read locks and accquire the write locks in
> ext4_mb_adjust_overlap(), which looks something like this:
>
> spin_unlock(&tmp_pa->pa_lock);
> read_unlock(&ei->i_prealloc_lock);
>
> write_lock(&ei->i_prealloc_lock);
> spin_lock(&tmp_pa->pa_lock);
>
> We have to preserve the order and drop both tree and PA locks to avoid
> deadlocks. With this approach, the issue is that in between dropping and
> accquiring this lock, the group discard path can actually go ahead and free the
> PA memory after calling rb erase on it, which can result in use after free in
> the adjust overlap path. This is because the PA is freed without any locks in
> discard path, as it assumes no other thread will have a reference to it. This
> assumption was true earlier since our allocation path never gave up the rbtree
> lock however it is not possible with this approach now. Essentially, the
> concept of having two different areas where a PA can be deleted is bringing in
> additional challenges and complexity, which might make things worse from a
> maintainers/reviewers point of view.
Right, I didn't realize that. That is nasty.
> After brainstorming a bit, I think there might be a few alternatives here:
>
> 1. Instead of deleting PA in the adjust overlap thread, make it sleep till group
> discard path goes ahead and deletes/frees it. At this point we can wake it up and retry
> allocation.
>
> * Pros: We can be sure that PA would have been removed at the time of retry so
> we don't waste extra retries. C
> * Cons: Extra complexity in code.
>
> 2. Just go for a retry in adjust overlap without doing anything. In ideal case,
> by the time we start retrying the PA might be already removed. Worse case: We
> keep looping again and again since discard path has not deleted it yet.
>
> * Pros: Simplest approach, code remains straightforward.
> * Cons: We can end up uselessly retrying if the discard path doesn't delete the PA fast enough.
Well, I think cond_resched() + goto retry would be OK here. We could also
cycle the corresponding group lock which would wait for
ext4_mb_discard_group_preallocations() to finish but that is going to burn
the CPU even more than the cond_resched() + retry as we'll be just spinning
on the spinlock. Sleeping is IMHO not warranted as the whole
ext4_mb_discard_group_preallocations() is running under a spinlock anyway
so it should better be a very short sleep.
Or actually I have one more possible solution: What the adjusting function
is doing that it looks up PA before and after ac->ac_o_ex.fe_logical and
trims start & end to not overlap these PAs. So we could just lookup these
two PAs (ignoring the deleted state) and then just iterate from these with
rb_prev() & rb_next() until we find not-deleted ones. What do you think?
> 3. The approach of undeleting the PA (proposed in this patchset) that
> we've already discussed.
>
> Now, to be honest, I still prefer the undelete PA approach as it makes more
> sense to me and I think the code is simple enough as there are not many paths
> that might race. Mostly just adjust_overlap and group discard or
> use_preallocated and group discard.
Yeah, I'm still not too keen on this but I'm willing to reconsider if
above approach proves to be too expensive under ENOSPC conditions...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2023-01-27 14:43 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-16 8:02 [PATCH v3 0/8] ext4: Convert inode preallocation list to an rbtree Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 1/8] ext4: Stop searching if PA doesn't satisfy non-extent file Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 2/8] ext4: Refactor code related to freeing PAs Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 3/8] ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated() Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 4/8] ext4: Move overlap assert logic into a separate function Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 5/8] ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request() Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list Ojaswin Mujoo
2023-01-16 12:23 ` Jan Kara
2023-01-17 10:30 ` Ojaswin Mujoo
2023-01-17 11:03 ` Jan Kara
2023-01-19 6:27 ` Ojaswin Mujoo
2023-01-27 14:43 ` Jan Kara [this message]
2023-02-03 8:36 ` Ojaswin Mujoo
2023-02-08 11:25 ` Ojaswin Mujoo
2023-02-09 10:54 ` Jan Kara
2023-02-09 17:54 ` Ojaswin Mujoo
2023-02-10 14:37 ` Jan Kara
2023-02-13 17:58 ` Ojaswin Mujoo
2023-02-14 8:50 ` Jan Kara
2023-02-16 17:07 ` Andreas Dilger
2023-03-17 12:40 ` Ojaswin Mujoo
2023-01-16 8:02 ` [PATCH v3 8/8] ext4: Remove the logic to trim inode PAs Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230127144312.3m3hmcufcvxxp6f4@quack3 \
--to=jack@suse.cz \
--cc=brookxu.cn@gmail.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=riteshh@linux.ibm.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox