public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Cc: <linux-ext4@vger.kernel.org>, Liu Bo <bo.liu@linux.alibaba.com>
Subject: Re: [PATCH v2 2/2] ext4: fix slow writeback under dioread_nolock and nodelalloc
Date: Wed, 23 Jan 2019 01:27:38 -0500	[thread overview]
Message-ID: <20190123062737.GC7597@mit.edu> (raw)
In-Reply-To: <20181215054840.5960-2-xiaoguang.wang@linux.alibaba.com>

On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> With "nodelalloc", blocks are allocated at the time of writing, and with
> "dioread_nolock", these allocated blocks are marked as unwritten as well,
> so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.

I've been looking at your patches, and it seems that a simpler way,
perhaps more maintainable approach in the long term is to change how
we write to newly allocated blocks.  Today, we have two ways of doing
this:

1) In the dioread_nolock case, we allocate blocks, insert an entry in
the extent tree with the blocks marked uninitialized, write the
blocks, and then mark the blocks initialized.

2) In the !dioread_nolock case, we allocate blocks, insert an entry to
the extent tree, write the blocks --- and if we start a commit, we
write out all dirty pages associated with that inode (in the default
data=writeback case) to avoid stale writes.

So what if we change the dioread_nolock case to do write the blocks
first, and *then* insert the entry into the extent tree?  This avoids
stale data getting exposed, either by a direct I/O read, or after a
crash (which means we avoid needing to do the force write-out).

So what we would need to do is to pass a flag to ext4_map_blocks()
which causes it to *not* make any on-disk changes.  Instead, it would
track the fact that blocks have be reserved in the buddy bitmap (this
is how we prevent blocks from being preallocated after they are
deleted, but before the transaction has been committed), and the
location of the assigned blocks in the extent_status tree.  Since no
on-disk changes are being made, we wouldn't need to hold the
transaction open.

Then in the callback after the blocks are written, using the starting
logical block number stored in the io_end structure, we either convert
the unwritten extents or actually insert the newly allocated blocks in
the extent tree and update the on-disk bitmap allocation bitmaps.

Once we get this working, it should be easy to make dioread_nolock for
1k block sizes; it keeps the time that the handle open very short; and
it completely obviates the need for data=writeback.

What do folks think?

						 - Ted

  reply	other threads:[~2019-01-23  6:27 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-15  5:48 [PATCH v2 1/2] ext4: try to merge unwritten extents who are also not under io Xiaoguang Wang
2018-12-15  5:48 ` [PATCH v2 2/2] ext4: fix slow writeback under dioread_nolock and nodelalloc Xiaoguang Wang
2019-01-23  6:27   ` Theodore Y. Ts'o [this message]
2019-01-23 12:48     ` Jan Kara
2019-01-25  2:02       ` Xiaoguang Wang
2019-01-24  1:34     ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190123062737.GC7597@mit.edu \
    --to=tytso@mit.edu \
    --cc=bo.liu@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox