From: Dmitry Monakhov <dmonakhov@openvz.org>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, jack@suse.cz,
lczerner@redhat.com
Subject: Re: [PATCH 04/11] ext4: completed_io locking cleanup V4
Date: Tue, 02 Oct 2012 11:16:38 +0400 [thread overview]
Message-ID: <87vcetbay1.fsf@openvz.org> (raw)
In-Reply-To: <20121001183833.GF32092@quack.suse.cz>
On Mon, 1 Oct 2012 20:38:33 +0200, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-09-12 19:44:04, Dmitry Monakhov wrote:
> Couple of language fixes:
> > Current unwritten extent conversion state-machine is very fuzzy.
> > - By unknown reason it want perform conversion under i_mutex. What for?
> ^^ For ^^^^^^^^ I'd just make it "performs"
> > My diagnosis:
> > We already protect extent tree with i_data_sem, truncate and punch_hole
> > should wait for DIO, so the only data we have to protect is end_io->flags
> > modification, but only flush_completed_IO and end_io_work modified this
> ^^ these
> > flags and we can serialize them via i_completed_io_lock.
> >
> > Currently all this games with mutex_trylock result in following deadlock
> ^^^ these ^ the
> > truncate: kworker:
> > ext4_setattr ext4_end_io_work
> > mutex_lock(i_mutex)
> > inode_dio_wait(inode) ->BLOCK
> > DEADLOCK<- mutex_trylock()
> > inode_dio_done()
> > #TEST_CASE1_BEGIN
> > MNT=/mnt_scrach
> > unlink $MNT/file
> > fallocate -l $((1024*1024*1024)) $MNT/file
> > aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> > sleep 2
> > truncate -s 0 $MNT/file
> > #TEST_CASE1_END
> >
> > Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
> >
> > This patch makes state machine simple and clean:
> >
> > (1) xxx_end_io schedule final extent conversion simply by calling
> > ext4_add_complete_io(), which append it to ei->i_completed_io_list
> > NOTE1: because of (2A) work should be queued only if
> > ->i_completed_io_list was empty, otherwise it work is scheduled already.
> ^^ remove this
> >
> > (2) ext4_flush_completed_IO is responsible for handling all pending
> > end_io from ei->i_completed_io_list
> > Flushing sequange consist of following stages:
> ^^ sequence ^ consists
> > A) LOCKED: Atomically drain completed_io_list to local_list
> > B) Perform extents conversion
> > C) LOCKED: move conveted io's to to_free list for final deletion
> ^^ converted
> > This logic depends on context which we was called from.
> ^^ were
> > D) Final end_io context destruction
> > NOTE1: i_mutex is no longer required because end_io->flags modification
> > protected by ei->ext4_complete_io_lock
> ^^ is protected
> >
> > Full list of changes:
> > - Move all completion end_io related routines to page-io.c in order to improve
> > logic locality
> > - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
> > - remove EXT4_IO_END_FSYNC
> > - Improve SMP scalability by removing useless i_mutex which does not
> > protect io->flags anymore.
> > - Reduce lock contention on i_completed_io_lock by optimizing list walk.
> > - Rename ext4_end_io_nolock to end4_end_io and make it static
> > - Check flush completion status to ext4_ext_punch_hole(). Because it is
> > not good idea to punch blocks from corrupted inode.
> >
> > Changes since V3 (in request to Jan's comments):
> > Fall back to active flush_completed_IO() approach in order to prevent
> > performance issues with nolocked DIO reads.
> > Changes since V2:
> > Fix use-after-free caused by race truncate vs end_io_work
> >
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/ext4.h | 3 +-
> > fs/ext4/extents.c | 4 +-
> > fs/ext4/fsync.c | 81 -------------------------
> > fs/ext4/indirect.c | 6 +-
> > fs/ext4/inode.c | 25 +-------
> > fs/ext4/page-io.c | 171 ++++++++++++++++++++++++++++++++++------------------
> > 6 files changed, 121 insertions(+), 169 deletions(-)
> >
> ...
> > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > + ext4_io_end_t *work_io)
> > +{
> > + ext4_io_end_t *io;
> > + struct list_head unwritten, complete, to_free;
> > + unsigned long flags;
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > + int err, ret = 0;
> > +
> > + INIT_LIST_HEAD(&complete);
> > + INIT_LIST_HEAD(&to_free);
> > +
> > + spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > + dump_completed_IO(inode);
> > + list_replace_init(&ei->i_completed_io_list, &unwritten);
> > + spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > +
> > + while (!list_empty(&unwritten)) {
> > + io = list_entry(unwritten.next, ext4_io_end_t, list);
> > + BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> > + list_del_init(&io->list);
> > +
> > + err = ext4_end_io(io);
> > + if (unlikely(!ret && err))
> > + ret = err;
> > +
> > + list_add_tail(&io->list, &complete);
> > + }
> > + /* It is important to update all flags for all end_io in one shot w/o
> > + * dropping the lock.*/
> Why? It would seem that once io structures are moved from
> i_completed_io_list, they are unreachable by anyone else?
>
> > + spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > + while (!list_empty(&complete)) {
> > + io = list_entry(complete.next, ext4_io_end_t, list);
> > + io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > + /* end_io context can not be destroyed now because it still
> > + * used by queued worker. Worker thread will destroy it later */
> > + if (io->flag & EXT4_IO_END_QUEUED)
> > + list_del_init(&io->list);
> > + else
> > + list_move(&io->list, &to_free);
> > + }
> > + /* If we are called from worker context, it is time to clear queued
> > + * flag, and destroy it's end_io if it was converted already */
> > + if (work_io) {
> > + work_io->flag &= ~EXT4_IO_END_QUEUED;
> > + if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > + list_add_tail(&work_io->list, &to_free);
> > }
> > - list_del_init(&io->list);
> > spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > - (void) ext4_end_io_nolock(io);
> > - mutex_unlock(&inode->i_mutex);
> > -free:
> > - ext4_free_io_end(io);
> > +
> > + while (!list_empty(&to_free)) {
> > + io = list_entry(to_free.next, ext4_io_end_t, list);
> > + list_del_init(&io->list);
> > + ext4_free_io_end(io);
> > + }
> > + return ret;
> > +}
> > +
> > +/*
> > + * work on completed aio dio IO, to convert unwritten extents to extents
> > + */
> > +static void ext4_end_io_work(struct work_struct *work)
> > +{
> > + ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > + ext4_do_flush_completed_IO(io->inode, io);
> > +}
> > +
> > +int ext4_flush_completed_IO(struct inode *inode)
> > +{
> > + return ext4_do_flush_completed_IO(inode, NULL);
> > }
> Also it seems that when ext4_flush_completed_IO() is called, workqueue
> can have several IO structures queued in its local lists thus we miss them
> here and don't properly wait for all conversions?
No it is not. Because list drained atomically, and
add_complete_io will queue work only if list is empty.
Race between conversion and dequeue-process is not possible because
we hold lock during entire walk of complete_list, so from external
point of view we mark list as conversed(clear unwritten flag)
happens atomically. I've drawn all possible situations and race not
happen. If you know any please let me know.
>
> Honza
next prev parent reply other threads:[~2012-10-02 7:16 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-28 15:44 [PATCH 00/11] ext4: Bunch of DIO/AIO fixes V4 Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 01/11] ext4: ext4_inode_info diet Dmitry Monakhov
2012-10-01 16:28 ` Jan Kara
2012-09-28 15:44 ` [PATCH 02/11] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 03/11] ext4: fix unwritten counter leakage Dmitry Monakhov
2012-10-01 16:37 ` Jan Kara
2012-09-28 15:44 ` [PATCH 04/11] ext4: completed_io locking cleanup V4 Dmitry Monakhov
2012-10-01 18:38 ` Jan Kara
2012-10-02 7:16 ` Dmitry Monakhov [this message]
2012-10-02 10:31 ` Jan Kara
2012-10-02 10:57 ` Dmitry Monakhov
2012-10-02 11:11 ` Jan Kara
2012-10-02 12:42 ` Dmitry Monakhov
2012-10-02 13:30 ` Jan Kara
2012-10-03 11:21 ` Dmitry Monakhov
2012-10-04 10:22 ` Jan Kara
2012-09-28 15:44 ` [PATCH 05/11] ext4: remove ext4_end_io() Dmitry Monakhov
2012-10-04 22:57 ` Anatol Pomozov
2012-10-05 4:28 ` Theodore Ts'o
2012-09-28 15:44 ` [PATCH 06/11] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
2012-10-01 16:39 ` Jan Kara
2012-09-28 15:44 ` [PATCH 07/11] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
2012-09-29 4:49 ` Theodore Ts'o
2012-09-29 11:43 ` Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 08/11] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
2012-10-01 16:41 ` Jan Kara
2012-09-28 15:44 ` [PATCH 09/11] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 10/11] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
2012-10-01 16:46 ` Jan Kara
2012-09-28 15:44 ` [PATCH 11/11] ext4: fix ext_remove_space for punch_hole case Dmitry Monakhov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87vcetbay1.fsf@openvz.org \
--to=dmonakhov@openvz.org \
--cc=jack@suse.cz \
--cc=lczerner@redhat.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.