All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, tytso@mit.edu, lczerner@redhat.com
Subject: Re: [PATCH 04/11] ext4: completed_io locking cleanup V4
Date: Thu, 4 Oct 2012 12:22:10 +0200	[thread overview]
Message-ID: <20121004102210.GD4641@quack.suse.cz> (raw)
In-Reply-To: <87txubsswa.fsf@openvz.org>

On Wed 03-10-12 15:21:25, Dmitry Monakhov wrote:
> On Tue, 2 Oct 2012 15:30:19 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Tue 02-10-12 16:42:39, Dmitry Monakhov wrote:
> > > On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > > > > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > > > > +	while (!list_empty(&complete)) {
> > > > > > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > > > > +		/* end_io context can not be destroyed now because it still
> > > > > > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > > > > +			list_del_init(&io->list);
> > > > > > > > > +		else
> > > > > > > > > +			list_move(&io->list, &to_free);
> > > > > > > > > +	}
> > > > > > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > > > > > +	if (work_io) {
> > > > > > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > > > > > >  	}
> > > > > > > > > -	list_del_init(&io->list);
> > > > > > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > > > > -	(void) ext4_end_io_nolock(io);
> > > > > > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > > > > > -free:
> > > > > > > > > -	ext4_free_io_end(io);
> > > > > > > > > +
> > > > > > > > > +	while (!list_empty(&to_free)) {
> > > > > > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > > > > +		list_del_init(&io->list);
> > > > > > > > > +		ext4_free_io_end(io);
> > > > > > > > > +	}
> > > > > > > > > +	return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > > > > + */
> > > > > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > > > > +{
> > > > > > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > > > > +{
> > > > > > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > > > > > >  }
> > > > > > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > > > > here and don't properly wait for all conversions?
> > > > > > > No it is not. Because list drained atomically, and 
> > > > > > > add_complete_io will queue work only if list is empty.
> > > > > > > 
> > > > > > > Race between conversion and dequeue-process is not possible because
> > > > > > > we hold lock during entire walk of complete_list, so from external
> > > > > > > point of view we mark list as conversed(clear unwritten flag)
> > > > > > > happens atomically. I've drawn all possible situations and race not
> > > > > > > happen. If you know any please let me know.
> > > > > >   I guess I'm missing something obvious. So let's go step by step:
> > > > > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > > > > >    conversion for the inode.
> > > > > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > > > > 3) The following situation seems to be possible:
> > > > > > 
> > > > > > CPU1					CPU2
> > > > > > (worker thread)				(truncate)
> > > > > > ext4_end_io_work()
> > > > > >   ext4_do_flush_completed_IO()
> > > > > >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > >     dump_completed_IO(inode);
> > > > > >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > > > >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > 
> > > > > > 					ext4_flush_completed_IO()
> > > > > > 					  ext4_do_flush_completed_IO()
> > > > > > 					    - sees empty i_completed_io_list
> > > > > > 					     => exits
> > > > > > 
> > > > > >   But we still have some conversions pending in 'unwritten' list. What am
> > > > > > I missing?
> > > > > Indeed, I've simply missed that case. The case which result silently
> > > > > broke integrity sync ;(
> > > > > Thank you for spotting this. I'll be back with updated version.
> > > >   Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> > > > seems to be unnecessary in fsync these days. We don't call aio_complete()
> > > > until we perform the conversion so what fsync does to such IO is undefined.
> > > > Such optimization is a separate matter though.
> > > Yes aio is ok, but integrity fsync after buffered write to unwritten
> > > extent is broken.
> > > 
> > > fsync()                             blkdev_completion          kwork
> > > ->filemap_write_and_wait_range
> > >                                     ->ext4_end_bio
> > >                                      ->end_page_writeback
> > >  <-- filemap_write_and_wait_range return  
> > >                                       ->ext4_add_complete_io
> > > 
> > >                                                               ->ext4_do_flush_completed_IO
> > >                                                                  ->list_replace_init
> > > ->ext4_flush_completed_IO
> > >   sees empty i_comleted_io_list but pended
> > >   conversion still exist
> > >                                                                 ->ext4_end_io
> > > 
> >   Correct. Thanks for pointing that out.
> In my deference I should say that integrity fsync was broken before my patch in
> case of buffered writes because end_page_writeback called before
> end_io added to ei->i_comlete_io_list
>  fsync()                                    blkdev_completion
>  ->filemap_write_and_wait_range
>                                             ->ext4_end_bio
>                                              ->end_page_writeback
>   <-- filemap_write_and_wait_range return  
>  ->ext4_flush_completed_IO
>    sees empty i_comleted_io_list but pended
>    conversion still exist
>                                                ->ext4_add_complete_io
  Right. Actually calling end_page_writeback() before we are sure page can
be correctly reloaded from disk (i.e. before all extent manipulations are
done) is asking for trouble - see e.g. mail
http://lists.openwall.net/linux-ext4/2011/06/08/12 and further discussion.
The discussion was somewhat open-ended but at that time calling
end_page_writeback() after extent conversion was problematic because of
i_mutex. Now we don't need i_mutex for extent conversion, so it is save to
call end_page_writeback() after we convert the extents. So moving
end_page_writeback() there would be good and it would simplify come logic
as well I believe - in particular fsync() would be simpler.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2012-10-04 10:22 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-28 15:44 [PATCH 00/11] ext4: Bunch of DIO/AIO fixes V4 Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 01/11] ext4: ext4_inode_info diet Dmitry Monakhov
2012-10-01 16:28   ` Jan Kara
2012-09-28 15:44 ` [PATCH 02/11] ext4: give i_aiodio_unwritten more appropriate name Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 03/11] ext4: fix unwritten counter leakage Dmitry Monakhov
2012-10-01 16:37   ` Jan Kara
2012-09-28 15:44 ` [PATCH 04/11] ext4: completed_io locking cleanup V4 Dmitry Monakhov
2012-10-01 18:38   ` Jan Kara
2012-10-02  7:16     ` Dmitry Monakhov
2012-10-02 10:31       ` Jan Kara
2012-10-02 10:57         ` Dmitry Monakhov
2012-10-02 11:11           ` Jan Kara
2012-10-02 12:42             ` Dmitry Monakhov
2012-10-02 13:30               ` Jan Kara
2012-10-03 11:21                 ` Dmitry Monakhov
2012-10-04 10:22                   ` Jan Kara [this message]
2012-09-28 15:44 ` [PATCH 05/11] ext4: remove ext4_end_io() Dmitry Monakhov
2012-10-04 22:57   ` Anatol Pomozov
2012-10-05  4:28     ` Theodore Ts'o
2012-09-28 15:44 ` [PATCH 06/11] ext4: serialize dio nonlocked reads with defrag workers V3 Dmitry Monakhov
2012-10-01 16:39   ` Jan Kara
2012-09-28 15:44 ` [PATCH 07/11] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
2012-09-29  4:49   ` Theodore Ts'o
2012-09-29 11:43     ` Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 08/11] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
2012-10-01 16:41   ` Jan Kara
2012-09-28 15:44 ` [PATCH 09/11] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
2012-09-28 15:44 ` [PATCH 10/11] ext4: punch_hole should wait for DIO writers V2 Dmitry Monakhov
2012-10-01 16:46   ` Jan Kara
2012-09-28 15:44 ` [PATCH 11/11] ext4: fix ext_remove_space for punch_hole case Dmitry Monakhov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004102210.GD4641@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=dmonakhov@openvz.org \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.