linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
Date: Fri, 4 Jan 2013 12:26:01 +0800	[thread overview]
Message-ID: <20130104042601.GB27833@gmail.com> (raw)
In-Reply-To: <20130103105613.GA30695@quack.suse.cz>

On Thu, Jan 03, 2013 at 11:56:13AM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > > done because DIO end_io is in a irq context and this conversion needs to take
> > > > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > > > tracking all extent status, we can first convert this unwritten extent in extent
> > > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > > that this dio has done.  We don't need to be worried about exposing a stale data
> > > > because we first try to lookup in extent status tree.  So it makes us to see the
> > > > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > > > extent in extent tree.  After this improvement, reader also needn't wait this
> > > > conversion to be done when dioread_nolock is enabled.
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > > CC: Christoph Hellwig <hch@infradead.org>
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > >   OK, so after reading other patches this should work fine. Just I think we
> > > should somehow mark in the extent status tree that the extent tree is
> > > inconsistent with what's on disk - something like extent dirty bit. It will
> > > be set for UNWRITTEN extents where conversion is pending logically it would
> > > also make sence to have it set for DELAYED extents. Then if we need to
> > > reclaim some extents due to memory pressure we know we have to keep dirty
> > > extents because those cache irreplacible information. What do you think?
> > 
> > Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> > to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> > without dirty bit.  But we can not reclaim DEALYED extents no matter
> > whether they are dirty or not because they are used to lookup an delayed
> > extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
> > extent must be kept in status tree.  That is why in step 1 status tree
> > only tracks all DELAYED extents in the tree.
>   So I was thinking about this some more and also testing some code and I
> realized that using extent status tree won't be enough (sadly). In case of
> AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
> before we can complete the AIO. So extent status tree won't help us in any
> way.

Ah, I see.  Conversion must be done in disk before aio_complete() is
called if AIO DIO with O_SYNC set.  So now we must call aio_complete()
after ext4_convert_unwritten_extents() in ext4_end_io().

> 
> Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
> which currently waits for all unwritten extents to convert and that
> effectively deadlocks the conversion thread if there are more DIO
> conversions pending. To fix this we would have to hack around
> ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
> gets nasty quickly. So I'm currently back to my original plan of completing
> IO only after extent conversion happens... I'll see how that works out.
> 
> Eventually we could *optimize* that by doing the extent conversion only in
> the extent status tree if possible but I wouldn't bother with it right now.
> For one thing, I also realized we would probably have to somehow throttle
> writers so that there are not too many outstanding conversions (when we
> complete AIO only after the conversion is finished, writer is naturally
> limited by the amount of AIOs allowed).

Sorry, currently no any idea is in my mind.  I will think about it. :-(

Thanks,
                                                - Zheng

  reply	other threads:[~2013-01-04  4:12 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-24  7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 2/9 v1] ext4: refine extent status tree Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
2012-12-31 21:49   ` Jan Kara
2013-01-01  5:16     ` Zheng Liu
2013-01-02 11:22       ` Jan Kara
2013-01-05  2:44         ` Zheng Liu
2013-01-08  1:27           ` Dave Chinner
2013-01-08  2:25             ` Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 4/9 v1] ext4: adjust interfaces of " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 5/9 v1] ext4: track all extent status in " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 6/9 v1] ext4: lookup block mapping " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent " Zheng Liu
2012-12-24  7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
2012-12-31 16:36   ` Jan Kara
2012-12-31 17:04     ` Jan Kara
2012-12-31 21:58   ` Jan Kara
2013-01-01  5:24     ` Zheng Liu
2013-01-03 10:56       ` Jan Kara
2013-01-04  4:26         ` Zheng Liu [this message]
2012-12-24  7:55 ` [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files Zheng Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130104042601.GB27833@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).