From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion
Date: Fri, 4 Jan 2013 12:26:01 +0800 [thread overview]
Message-ID: <20130104042601.GB27833@gmail.com> (raw)
In-Reply-To: <20130103105613.GA30695@quack.suse.cz>
On Thu, Jan 03, 2013 at 11:56:13AM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > >
> > > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > > done because DIO end_io is in a irq context and this conversion needs to take
> > > > i_data_sem locking. But we couldn't take a semaphore in a irq context. After
> > > > tracking all extent status, we can first convert this unwritten extent in extent
> > > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > > that this dio has done. We don't need to be worried about exposing a stale data
> > > > because we first try to lookup in extent status tree. So it makes us to see the
> > > > latest extent status. Meanwhile we queue a work to convert this unwritten
> > > > extent in extent tree. After this improvement, reader also needn't wait this
> > > > conversion to be done when dioread_nolock is enabled.
> > > >
> > > > CC: Jan Kara <jack@suse.cz>
> > > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > > CC: Christoph Hellwig <hch@infradead.org>
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > OK, so after reading other patches this should work fine. Just I think we
> > > should somehow mark in the extent status tree that the extent tree is
> > > inconsistent with what's on disk - something like extent dirty bit. It will
> > > be set for UNWRITTEN extents where conversion is pending logically it would
> > > also make sence to have it set for DELAYED extents. Then if we need to
> > > reclaim some extents due to memory pressure we know we have to keep dirty
> > > extents because those cache irreplacible information. What do you think?
> >
> > Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> > to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> > without dirty bit. But we can not reclaim DEALYED extents no matter
> > whether they are dirty or not because they are used to lookup an delayed
> > extent in fiemap, seek_data/hole, and bigalloc. So at least DEALYED
> > extent must be kept in status tree. That is why in step 1 status tree
> > only tracks all DELAYED extents in the tree.
> So I was thinking about this some more and also testing some code and I
> realized that using extent status tree won't be enough (sadly). In case of
> AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
> before we can complete the AIO. So extent status tree won't help us in any
> way.
Ah, I see. Conversion must be done in disk before aio_complete() is
called if AIO DIO with O_SYNC set. So now we must call aio_complete()
after ext4_convert_unwritten_extents() in ext4_end_io().
>
> Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
> which currently waits for all unwritten extents to convert and that
> effectively deadlocks the conversion thread if there are more DIO
> conversions pending. To fix this we would have to hack around
> ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
> gets nasty quickly. So I'm currently back to my original plan of completing
> IO only after extent conversion happens... I'll see how that works out.
>
> Eventually we could *optimize* that by doing the extent conversion only in
> the extent status tree if possible but I wouldn't bother with it right now.
> For one thing, I also realized we would probably have to somehow throttle
> writers so that there are not too many outstanding conversions (when we
> complete AIO only after the conversion is finished, writer is naturally
> limited by the amount of AIOs allowed).
Sorry, currently no any idea is in my mind. I will think about it. :-(
Thanks,
- Zheng
next prev parent reply other threads:[~2013-01-04 4:12 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-24 7:55 [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2) Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 1/9 v1] ext4: fixup metadata reserve block warning when bigalloc and delalloc are enabled Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 2/9 v1] ext4: refine extent status tree Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 3/9 v1] ext4: add physical block and status member into " Zheng Liu
2012-12-31 21:49 ` Jan Kara
2013-01-01 5:16 ` Zheng Liu
2013-01-02 11:22 ` Jan Kara
2013-01-05 2:44 ` Zheng Liu
2013-01-08 1:27 ` Dave Chinner
2013-01-08 2:25 ` Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 4/9 v1] ext4: adjust interfaces of " Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 5/9 v1] ext4: track all extent status in " Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 6/9 v1] ext4: lookup block mapping " Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 7/9 v1] ext4: add a new convert function to convert an unwritten extent " Zheng Liu
2012-12-24 7:55 ` [RFC][PATCH 8/9 v1] ext4: refine unwritten extent conversion Zheng Liu
2012-12-31 16:36 ` Jan Kara
2012-12-31 17:04 ` Jan Kara
2012-12-31 21:58 ` Jan Kara
2013-01-01 5:24 ` Zheng Liu
2013-01-03 10:56 ` Jan Kara
2013-01-04 4:26 ` Zheng Liu [this message]
2012-12-24 7:55 ` [RFC][PATCH 9/9 v1] ext4: set dioread_nolock by default for extent-based files Zheng Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130104042601.GB27833@gmail.com \
--to=gnehzuil.liu@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).