Re: [PATCH v3 09/13] ext4: fast-commit commit path changes

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
       [not found] <1571900042725.99617@xiaomi.com>
@ 2019-10-24 20:18 ` Theodore Y. Ts'o
       [not found]   ` <1572349386604.43878@xiaomi.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-24 20:18 UTC (permalink / raw)
  To: Xiaohui1 Li 李晓辉
  Cc: lixiaohui1@xiaomi.corp-partner.google.com,
	linux-ext4@vger.kernel.org

On Thu, Oct 24, 2019 at 06:54:44AM +0000, Xiaohui1 Li 李晓辉 wrote:
> 
> But i also have an idea which can simplify the fast commit patch.
> because we want to fix fsync cost too much time problems on our
> mobile phone without format the whole ext4 partition , and i found
> current fast commit patch can't do this job as it need to
> readjustment the layout of journal area and will destroy phone
> user's data from my opinion .

That's not correct.  The fast commit feature can be added to an
existing ext4 file system.  That's because when the ext4 file system
is mounted (or when e2fsck is run) the contents of the file system
journal (if any) are replayed and then discard.  On a clean shutdown,
the journal is empty to begin with.

Hence, restructuring the journal so that a portion of the space can be
used for fast commits can be done without modifying or otherwise
destroying the data on the pre-existing file system.

> so my simplify idea is that:
> when jbd2 thread begin to commit the current transaction , why not
> divide the commiting work into two sub work ? firstly flush metadata
> generated by fsynced handles to disk, and then append a commit end
> block. and then tell the fsync threads that no need to wait, as
> their metadata has already been flush to disk journal area, the
> fsync work is finished.  and then the second sub work is to
> committing metadata and data generated by left handles in current
> transaction.

The problem, as I stated in my earlier message, is that the handles
that were not involved in the fsync in many cases will have been
started and completed before the changes reflected by the handles
involving the inode to be fsync'ed.  We can't just "separate out the
handles" and commit the ones that are necessary, and then do the rest
in a separate transaction.  The problem is entagled dependencies.  For
example, one of the handles not involved with the fsync may have
modified the inode table or the allocation bitmap that is involved
with the update to the inode to be fsync'ed.  We can't just flush the
metadata blocks involved with the "fsync handles", since they will
include the modifications made by other file systems via "the rest of
the handles."

So no, we can't do what you are suggesting.  If it were that easy, we
would have done it a long time ago.

The reason why you can't separate out some of the handles from others
is referenced in the LWN article, "Soft Updates, Hard Problems"[1].
What you are suggesting is not exactly soft updates, but it suffers
from the same problem, namely that of entangled updates, where the
same block is modified by multiple handles.  If you track all of the
logical dependencies, you could potentially "roll back" in memory
those changes which are not yet committed, and then after commit of
the "fsync hanldes", roll them forward again.  But this is hopelessly
complicated to get right.

[1] https://lwn.net/Articles/339337/

So if you implemented your suggestion, and the system were to crash
between the first and second commit, the file system would be
corrupted, and in the worst case, e2fsck might not be able to recover
the file system, and all of the user's data would be lost.  Of course,
if you are sure that your system will never crash, because the kernel
is bug-free(tm), then you could skip using the journalling altogher.....

						- Ted

P.S.  It's actually a little bit more complicated than that; you also
need to worry about power drops, so the battery needs to be embedded,
so there is no chance the battery will come flying out when the phone
is dropped.  The EC also has to be able to give a low-pattery warning
so that the system can be shut down cleanly before the battery power
goes to zero, and you can't allow the emergency poweroff where the
user pushes and holds the power buton for eight seconds.  The last,
after all, won't be needed because we are making the hopelessly
unrealistic assumption that the kernel is completely, 100%,
bug-free(tm).   :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <1572349386604.43878@xiaomi.com>]

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
       [not found]   ` <1572349386604.43878@xiaomi.com>
@ 2019-10-29 21:35     ` Theodore Y. Ts'o
  2019-10-30  4:28       ` 答复: [External Mail]Re: " Xiaohui1 Li 李晓辉
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-29 21:35 UTC (permalink / raw)
  To: Xiaohui1 Li 李晓辉
  Cc: linux-ext4@vger.kernel.org, harshadshirwadkar@gmail.com

On Tue, Oct 29, 2019 at 11:43:54AM +0000, Xiaohui1 Li 李晓辉 wrote:
> > We don't actually have to do this.  Strictly speaking, we only have to
> > write out the specific inode being fsync'ed, or the specific inode for
> > which ext4_nfs_commit_metdata() has been called.  For an fsync()
> > workload, especially one where for example, we might have hundreds of
> > modified inodes, all of which are fc-eligible --- for example, because
> > a kernel build is happening in the background, and a single file which
> > is being fsync'ed --- for example because the programmer has just
> > saved a source file in emacs ---- we only need to include that single
> > inode in the fast commit.  Including *all* of the inodes in the
> > i_fc_list in the fast commit, is wasted effort, especially since the
> > inodes in question will be committed within the next 5 seconds.
> 
> you said only need to include that single inode in the fast commit.
> do you mean that create a fast-commit transaction which only need to
> commit data and metadata of the specific inode ?  but in your last
> email, you says "we can't just separate out some of the handles from
> others in one transation".
> 
> so if we just only include that single inode(ie: being fsync'ed) in
> the fast commit, is it means that in the ext4 traditional way，
> metadata of this single inode being fsync'ed need to be mixed with
> other inodes not being fsync'ed (may doing buffer write) together in
> one transaction to be flushed to disk both together because of
> entagled dependencies you says in your last reply email.
> 
> but when fast-commit patches applied, how the metadata and data of
> this single inode being fsync'ed can be extracted from all files
> metadata changes during one time range ？

Did you read the iJournaling Usenix paper[1] which I referenced
earlier?  It's described in there.

[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

The trick is that we track whether the inode has changes which we
can't represent in the fast commit "logical journal".  In the logical
journal, we record changes since the last full commit, not as the full
physical metadata block, but just bits of the logical metadata that
have changed.  If that inode has changed in ways that we can't
represent in the fast commit journal, then we do a normal full commit.

So we avoid entangled dependencies in two ways .  First of all, we
only journal the logical change.  Hence, if there is a change in
another part of the metadata block (say, another inode in the inode
table) there won't be an issue, since we only update that one inode.
Secondly, if the inode has some entangelements either with other
inodes, or (b) changes in the inode which we can't reflect in the fast
commit log, then fall back to doing a full commit.

So basically, we only deal with the simple, common cases, where it's
easy to log changes to the fast commit log.  Now, those changes are
also logged in the normal physical commit, so once we do a full
commit, all of the entries in the fast commit log are no longer needed
--- the fast commit just contains the small, simple changes since the
last full commit.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-29 21:35     ` Theodore Y. Ts'o
@ 2019-10-30  4:28       ` Xiaohui1 Li 李晓辉
  2019-10-30 14:26         ` Theodore Y. Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Xiaohui1 Li 李晓辉 @ 2019-10-30  4:28 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4@vger.kernel.org, harshadshirwadkar@gmail.com

thanks to this  iJournaling Usenix paper,

fsync latency-too-long problem because of entangled dependencies and inode' data has to be waited in jbd2 order mode
can be fixed.

entangled dependencies problem is known to us by your kind reply email.
the  problem of file' data wating in jbd2 order mode is also a serious problem which case a long-latency fsync call.

as pointed out in this iJournaling paper, when three conditions turn up at the same time,
1: order mode must be applied, not the writeback mode.
2: The delayed block allocation technique of ext4 must be  applied.
3: backgroud buffer writes are too many.

because the periodic flush disk time caused by delayed block allocation is 30s(a bit too long) in android system,
so when begin to flush data and metadata to disk, the amount of inode data flushed can be so large.
and so because of the default ext4 data mode is order(not the writeback mode), so when fsync is called,
we have to be faced with such a difficult condition which is that have to be waited for so many inode data(not the metadata) flushed to
disk completely in jbd2 thread.

we have no choice as the order mode need to do this work, so the waiting inode-data-flushed-disk time is too long in some extreme conditions.
so it cause the appearance of long-latency fsync call.

thank you for your reply, i will try to fix this problem in my free time.

append some words in ijournal paper which may be help for someone(may be include me) which don't be familiar with why delayed block allocation
will cause long-latency fsync call :

The delayed block allocation technique of ext4 ag-
gravates the CTX problem(appeared in fsync call).

However, if an fsync is called just after
the flush kernel thread invocation, as shown in the ex-
ample in Figure 1(a), the flush thread will allocate data
blocks for dirty pages, and register several modified in-
odes in the running transaction during the delayed block
allocation. Then, the commit operation of the journal
transaction will generate many write requests into stor-
age.

 Shall someone can tell the reason why delayed block allocation technique of ext4 cause  long-latency fsync call with more detail ?
many thanks.

________________________________________
发件人: Theodore Y. Ts'o <tytso@mit.edu>
发送时间: 2019年10月30日 5:35
收件人: Xiaohui1 Li 李晓辉
抄送: linux-ext4@vger.kernel.org; harshadshirwadkar@gmail.com
主题: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes

On Tue, Oct 29, 2019 at 11:43:54AM +0000, Xiaohui1 Li 李晓辉 wrote:
> > We don't actually have to do this.  Strictly speaking, we only have to
> > write out the specific inode being fsync'ed, or the specific inode for
> > which ext4_nfs_commit_metdata() has been called.  For an fsync()
> > workload, especially one where for example, we might have hundreds of
> > modified inodes, all of which are fc-eligible --- for example, because
> > a kernel build is happening in the background, and a single file which
> > is being fsync'ed --- for example because the programmer has just
> > saved a source file in emacs ---- we only need to include that single
> > inode in the fast commit.  Including *all* of the inodes in the
> > i_fc_list in the fast commit, is wasted effort, especially since the
> > inodes in question will be committed within the next 5 seconds.
>
> you said only need to include that single inode in the fast commit.
> do you mean that create a fast-commit transaction which only need to
> commit data and metadata of the specific inode ?  but in your last
> email, you says "we can't just separate out some of the handles from
> others in one transation".
>
> so if we just only include that single inode(ie: being fsync'ed) in
> the fast commit, is it means that in the ext4 traditional way，
> metadata of this single inode being fsync'ed need to be mixed with
> other inodes not being fsync'ed (may doing buffer write) together in
> one transaction to be flushed to disk both together because of
> entagled dependencies you says in your last reply email.
>
> but when fast-commit patches applied, how the metadata and data of
> this single inode being fsync'ed can be extracted from all files
> metadata changes during one time range ？

Did you read the iJournaling Usenix paper[1] which I referenced
earlier?  It's described in there.

[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

The trick is that we track whether the inode has changes which we
can't represent in the fast commit "logical journal".  In the logical
journal, we record changes since the last full commit, not as the full
physical metadata block, but just bits of the logical metadata that
have changed.  If that inode has changed in ways that we can't
represent in the fast commit journal, then we do a normal full commit.

So we avoid entangled dependencies in two ways .  First of all, we
only journal the logical change.  Hence, if there is a change in
another part of the metadata block (say, another inode in the inode
table) there won't be an issue, since we only update that one inode.
Secondly, if the inode has some entangelements either with other
inodes, or (b) changes in the inode which we can't reflect in the fast
commit log, then fall back to doing a full commit.

So basically, we only deal with the simple, common cases, where it's
easy to log changes to the fast commit log.  Now, those changes are
also logged in the normal physical commit, so once we do a full
commit, all of the entries in the fast commit log are no longer needed
--- the fast commit just contains the small, simple changes since the
last full commit.

Cheers,

                                                - Ted
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-30  4:28       ` 答复: [External Mail]Re: " Xiaohui1 Li 李晓辉
@ 2019-10-30 14:26         ` Theodore Y. Ts'o
  2019-11-04  1:01           ` xiaohui li
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-30 14:26 UTC (permalink / raw)
  To: Xiaohui1 Li 李晓辉
  Cc: linux-ext4@vger.kernel.org, harshadshirwadkar@gmail.com

On Wed, Oct 30, 2019 at 04:28:42AM +0000, Xiaohui1 Li 李晓辉 wrote:
> the problem of file' data wating in jbd2 order mode is also a
> serious problem which case a long-latency fsync call.

Yes, this is a separate problem, although note that if the file with a
large amount of data is the file which is being fsync'ed, you have to
write it out at fsync time no matter what.

You could try to write out dirty data earlier (e.g., by decreasing the
30 second writeback window), but there are tradeoffs.  For one thing,
if the file ends up being deleted anyway, it's better not to write out
the data at all.  For another, if we know how big the file is at the
time when we do the writeout, we can do a better job allocating space
for the file, and it improves the file layout by making it more likely
it will be contiguous, or at least mostly contiguous.

Also, files that tend to be fsync'ed a lot tend to be database files
(e.g., SQLite files), and they tend to write small amounts of data and
then fsync them.  So the problem described below happens when there
are unrelated files that happen to be downloaded in parallel.  An
example of this in the Android case mgiht be when the user is
downloading a large video file, such as a movie, to be watched offline
later (such as when they are on a plane).

> as pointed out in this iJournaling paper, when three conditions turn up at the same time,
> 1: order mode must be applied, not the writeback mode.
> 2: The delayed block allocation technique of ext4 must be  applied.
> 3: backgroud buffer writes are too many.

(1) and (2) are the default.  (3) may or may not be a frequent
occurrence, depending on the workload.  In practice though, users
aren't downloading large files all *that* often.

> we have no choice as the order mode need to do this work, so the
> waiting inode-data-flushed-disk time is too long in some extreme
> conditions.  so it cause the appearance of long-latency fsync call.
> 
> thank you for your reply, i will try to fix this problem in my free time.

So there is a solution; it's just a bit tricky to do, and it's not
been a huge enough deal that anyone has allocated time to fix it.

The idea is to allocate space, but not actually update the metadata
blocks at the time when the data blocks are allocated.  Instead, we
reserve them so they won't get allocated for use by another file, and
we note where they are in the extent status cache.  We then issue the
writes of the data block, and only after they are complete, only
*then* do we update the metadata blocks (which then gets updated via
the journal, using either a commit or a fast commit).

This is similar to the dioread_nolock case, where we update the
metadata blocks first, but mark them as unwritten, then we let the
data blocks get written, and only finally do we update the metadata
blocks so they are marked as written (e.g., initialized).  This avoids
the stale data problem as well, but we end up modifying the metadata
blocks twice, and it has resulted other performance problems since in
increases overhead on the i_data_sem lock.  See for example some of
the posts by Liu Bo from Alibaba last year:

If we can allocate space, write the data blocks, and only *then*
update the extent tree metadata blocks, it solves a lot of problems.
We can get rid of the dioread_nolock option; we can get rid of the
data=ordered vs data=writeback distinction; and we can avoid the need
to force data blocks to be written out at commit time.  So it improves
performance, and it will reduce code complexity, making it a win-win
approach.

The problem is that this means significantly changing how we do block
allocation and block reservation, so it's a fairly large and invasive
set of changes.  But it's the right long-term direction, and we'll get
there eventually.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-30 14:26         ` Theodore Y. Ts'o
@ 2019-11-04  1:01           ` xiaohui li
  2019-11-04  3:22             ` Theodore Y. Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: xiaohui li @ 2019-11-04  1:01 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Xiaohui1 Li 李晓辉, linux-ext4@vger.kernel.org,
	harshadshirwadkar@gmail.com

another way which i think can fix this fsync time cost problem may be
that changing ext4 data mode from order to writeback.

when in writeback mode, inode' data has not to be waited in jbd2
thread, so the fsync time cost is also reduced.
meawhile, writeback mode also can guarantee filesystem consistency in
os crash-reboot conditions,
with only one drawback is that it will cause security problems such as
stale data will be seen.

but in android system with file encryption enabled, there is no
security problem as files are all encryped.
but user will see wrong file data in system crash-reboot conditions
with writeback mode enabled.

for example:
-------------
file A has allocate 50 new blocks, and already dirtys page cache
corresponding to these 50 blocks,
and after the medata represent new 50 blocks of this file have been
flushed to journal area, the system crash.
file A's data according to above 50 new blocks has not been to flushed to disk.
after system reboot and finish file system recovery work, file A's
size has bee enlarged with new 50 blocks added.
but data in this file's new 50 blocks is not correct. so it will cheat
user if it is difficult to see this data-not-correct problem.
-------------

and this problem can't fixed by e2fsck full check ,it is not belong to
file system consistency,
so we will insist on using order mode ,not writeback mode.

i will also share my view about that ijournal paper in my next time.

On Wed, Oct 30, 2019 at 10:26 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Oct 30, 2019 at 04:28:42AM +0000, Xiaohui1 Li 李晓辉 wrote:
> > the problem of file' data wating in jbd2 order mode is also a
> > serious problem which case a long-latency fsync call.
>
> Yes, this is a separate problem, although note that if the file with a
> large amount of data is the file which is being fsync'ed, you have to
> write it out at fsync time no matter what.
>
> You could try to write out dirty data earlier (e.g., by decreasing the
> 30 second writeback window), but there are tradeoffs.  For one thing,
> if the file ends up being deleted anyway, it's better not to write out
> the data at all.  For another, if we know how big the file is at the
> time when we do the writeout, we can do a better job allocating space
> for the file, and it improves the file layout by making it more likely
> it will be contiguous, or at least mostly contiguous.
>
> Also, files that tend to be fsync'ed a lot tend to be database files
> (e.g., SQLite files), and they tend to write small amounts of data and
> then fsync them.  So the problem described below happens when there
> are unrelated files that happen to be downloaded in parallel.  An
> example of this in the Android case mgiht be when the user is
> downloading a large video file, such as a movie, to be watched offline
> later (such as when they are on a plane).
>
> > as pointed out in this iJournaling paper, when three conditions turn up at the same time,
> > 1: order mode must be applied, not the writeback mode.
> > 2: The delayed block allocation technique of ext4 must be  applied.
> > 3: backgroud buffer writes are too many.
>
> (1) and (2) are the default.  (3) may or may not be a frequent
> occurrence, depending on the workload.  In practice though, users
> aren't downloading large files all *that* often.
>
> > we have no choice as the order mode need to do this work, so the
> > waiting inode-data-flushed-disk time is too long in some extreme
> > conditions.  so it cause the appearance of long-latency fsync call.
> >
> > thank you for your reply, i will try to fix this problem in my free time.
>
> So there is a solution; it's just a bit tricky to do, and it's not
> been a huge enough deal that anyone has allocated time to fix it.
>
> The idea is to allocate space, but not actually update the metadata
> blocks at the time when the data blocks are allocated.  Instead, we
> reserve them so they won't get allocated for use by another file, and
> we note where they are in the extent status cache.  We then issue the
> writes of the data block, and only after they are complete, only
> *then* do we update the metadata blocks (which then gets updated via
> the journal, using either a commit or a fast commit).
>
> This is similar to the dioread_nolock case, where we update the
> metadata blocks first, but mark them as unwritten, then we let the
> data blocks get written, and only finally do we update the metadata
> blocks so they are marked as written (e.g., initialized).  This avoids
> the stale data problem as well, but we end up modifying the metadata
> blocks twice, and it has resulted other performance problems since in
> increases overhead on the i_data_sem lock.  See for example some of
> the posts by Liu Bo from Alibaba last year:
>
> If we can allocate space, write the data blocks, and only *then*
> update the extent tree metadata blocks, it solves a lot of problems.
> We can get rid of the dioread_nolock option; we can get rid of the
> data=ordered vs data=writeback distinction; and we can avoid the need
> to force data blocks to be written out at commit time.  So it improves
> performance, and it will reduce code complexity, making it a win-win
> approach.
>
> The problem is that this means significantly changing how we do block
> allocation and block reservation, so it's a fairly large and invasive
> set of changes.  But it's the right long-term direction, and we'll get
> there eventually.
>
> Cheers,
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-11-04  1:01           ` xiaohui li
@ 2019-11-04  3:22             ` Theodore Y. Ts'o
  2019-11-08  7:58               ` xiaohui li
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04  3:22 UTC (permalink / raw)
  To: xiaohui li
  Cc: Xiaohui1 Li 李晓辉, linux-ext4@vger.kernel.org,
	harshadshirwadkar@gmail.com

On Mon, Nov 04, 2019 at 09:01:28AM +0800, xiaohui li wrote:
> 
> when in writeback mode, inode' data has not to be waited in jbd2
> thread, so the fsync time cost is also reduced.
> meawhile, writeback mode also can guarantee filesystem consistency in
> os crash-reboot conditions,
> with only one drawback is that it will cause security problems such as
> stale data will be seen.

It's not just stale data; in data=writeback, today if a file gets
deleted, its blocks are immediately eligible to be reused.  If there
is a crash before the transaction is committed, there could be a file
that would have deleted (and perhaps replaced) that doesn't in fact
get deleted, but its data blocks will have been corrupted.

I'm not fond of that particular behavior, and I may look to fix it,
but in general, data=writeback means that data blocks may be corrupted
or contain stale data after a crash --- for blocks that were freshly
created, or for a file that might have been deleted, but except for
the crash which means that the file deletion doesn't actually get
corrupted.

> but in android system with file encryption enabled, there is no
> security problem as files are all encryped.
> but user will see wrong file data in system crash-reboot conditions
> with writeback mode enabled.

If all files are encrypted, then yes, the chances of stale data
causing security issues is significantly reduced.

But see also the case of a file which is deleted immediately before a
crash.  Things are more complex in terms of the data gauarantees after
a crash, which is why data=ordered is the default.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-11-04  3:22             ` Theodore Y. Ts'o
@ 2019-11-08  7:58               ` xiaohui li
  0 siblings, 0 replies; 10+ messages in thread
From: xiaohui li @ 2019-11-08  7:58 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Xiaohui1 Li 李晓辉, linux-ext4@vger.kernel.org,
	harshadshirwadkar@gmail.com

thank you, ted.

I have understood  the whole design and implementation of the ijournal paper.
and i think the fast commit for ext4 may be designed and implemented
according to idea of that ijournal paper,
as that ijournal thought is the best way for resolve the problem of
file's data has to been waited in jbd2 thread with
order mode from my opinion.

according to that paper, ijournal only record and commit the changes
of the fsync'ed file to its own ijournal area,
the changes of whole ext4 filesystem are left to normal journal to process.
and ijournal only happen at the end of the thread which is doing fsync
work. it need not be embedded to jbd2 thread.
and the changes of the fsync'ed file which have been committed by
ijournal will also be committed to normal journal area subsequently.
ijournal won't have side effect on normal journal , these two journal
runs independently.

all of these above designments of ijournal from my viewpoint will
simply the fast commit function developed recently, meanwhile it can
help fast commit function to
achieve its goals.
one of its most important goals which i have to highlight should be
fix ext4 fsync time cost problems because of file's data has to been
waited in jbd2 thread(same as CTX problems pointed in ijournal paper).
what do you think of it ?

I like this ijournal thought. may be i want to do some work on coding
of this fast commit function in my free time.

On Mon, Nov 4, 2019 at 11:22 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Mon, Nov 04, 2019 at 09:01:28AM +0800, xiaohui li wrote:
> >
> > when in writeback mode, inode' data has not to be waited in jbd2
> > thread, so the fsync time cost is also reduced.
> > meawhile, writeback mode also can guarantee filesystem consistency in
> > os crash-reboot conditions,
> > with only one drawback is that it will cause security problems such as
> > stale data will be seen.
>
> It's not just stale data; in data=writeback, today if a file gets
> deleted, its blocks are immediately eligible to be reused.  If there
> is a crash before the transaction is committed, there could be a file
> that would have deleted (and perhaps replaced) that doesn't in fact
> get deleted, but its data blocks will have been corrupted.
>
> I'm not fond of that particular behavior, and I may look to fix it,
> but in general, data=writeback means that data blocks may be corrupted
> or contain stale data after a crash --- for blocks that were freshly
> created, or for a file that might have been deleted, but except for
> the crash which means that the file deletion doesn't actually get
> corrupted.
>
> > but in android system with file encryption enabled, there is no
> > security problem as files are all encryped.
> > but user will see wrong file data in system crash-reboot conditions
> > with writeback mode enabled.
>
> If all files are encrypted, then yes, the chances of stale data
> causing security issues is significantly reduced.
>
> But see also the case of a file which is deleted immediately before a
> crash.  Things are more complex in terms of the data gauarantees after
> a crash, which is why data=ordered is the default.
>
> Regards,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 00/13] ext4: add fast commit support
@ 2019-10-01  7:40 Harshad Shirwadkar
  2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
  0 siblings, 1 reply; 10+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
    implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
    fast commits

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

After applying all the patches in this series, following runs of
xfstests were performed:

- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke

All the log tests were successful and smoke tests didn't introduce any
additional failures.

Performance Evaluation
----------------------

Ext4 file system performance was tested with and without fast commit
using fs_mark benchmark. Following was the command used:

Command: ./fs_mark -t 8 -n 1024 -s 65536 -w 4096 -d /mnt

Results:
Without Fast Commit: 1501.2 files/sec
With Fast commits: 3055 files/sec
~103% write performance improvement

Changes since V2:

- Added ability to support new file creation in fast commits. This
  allows us to use fs_mark benchmark for performance testing

- Added support for asynchronous fast commits

- Many cleanups and bug fixes

- Re-organized the patch set, moved most of the new code to
  ext4_jbd2.c instead of super.c

- Handling of review comments on previous patchset

Harshad Shirwadkar(13):
 docs: Add fast commit documentation
 ext4: add support for asynchronous fast commits
 ext4: fast-commit recovery path changes
 ext4: fast-commit commit path changes
 ext4: fast-commit commit range tracking
 ext4: track changed files for fast commit
 ext4: add fields that are needed to track changed files
 jbd2: fast-commit recovery path changes
 jbd2: fast-commit commit path new APIs
 jbd2: fast-commit commit path changes
 jbd2: fast commit setup and enable
 ext4: add handling for extended mount options
 ext4: add fast commit support

 Documentation/filesystems/ext4/journal.rst |   98 +-
 Documentation/filesystems/journalling.rst  |   22
 fs/ext4/acl.c                              |    1
 fs/ext4/balloc.c                           |    7
 fs/ext4/ext4.h                             |   86 +
 fs/ext4/ext4_jbd2.c                        |  902 +++++++++++++++++++
 fs/ext4/ext4_jbd2.h                        |   98 ++
 fs/ext4/extents.c                          |   43
 fs/ext4/fsync.c                            |    7
 fs/ext4/ialloc.c                           |   60 -
 fs/ext4/inline.c                           |   14
 fs/ext4/inode.c                            |   77 +
 fs/ext4/ioctl.c                            |    9
 fs/ext4/mballoc.c                          |   83 +
 fs/ext4/mballoc.h                          |    2
 fs/ext4/migrate.c                          |    1
 fs/ext4/namei.c                            |   16
 fs/ext4/super.c                            |   55 +
 fs/ext4/xattr.c                            |    1
 fs/jbd2/commit.c                           |   98 ++
 fs/jbd2/journal.c                          |  343 ++++++-
 fs/jbd2/recovery.c                         |   63 +
 fs/jbd2/transaction.c                      |    3
 include/linux/jbd2.h                       |  117 ++
 include/trace/events/ext4.h                |   61 +
 include/trace/events/jbd2.h                |    9
 26 files changed, 2170 insertions(+), 106 deletions(-)
-- 
2.23.0.444.g18eeb5a265-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 22:45   ` Theodore Y. Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch implements the actual commit path for fast commit. Based on
inodes tracked and their respective changes remembered, this
patch adds code to create a fast commit block that stores extents
added as well as dentrys created for the inode. We use new JBD2
interfaces added in previous patches in this series. The fast commit
blocks that are created have extents that _should_ be present in the
file. It doesn't yet support removing of extents, making operations
such as truncate, delete fast commit incompatible.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4_jbd2.c         | 309 ++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.h         |  50 +++++-
 fs/ext4/extents.c           |   8 +-
 fs/ext4/inode.c             |  22 ++-
 fs/ext4/super.c             |  11 ++
 include/trace/events/ext4.h |  39 +++++
 6 files changed, 429 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 0bb8de2139a5..fd7740372438 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -4,6 +4,7 @@
  */
 
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
 
 #include <trace/events/ext4.h>
 
@@ -480,3 +481,311 @@ bool ext4_is_inode_fc_new(struct inode *inode)
 
 	return ret;
 }
+
+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	struct buffer_head *orig_bh = bh->b_private;
+
+	BUFFER_TRACE(bh, "");
+	if (uptodate) {
+		ext4_debug("%s: Block %lld up-to-date",
+			   __func__, bh->b_blocknr);
+		set_buffer_uptodate(bh);
+	} else {
+		ext4_debug("%s: Block %lld not up-to-date",
+			   __func__, bh->b_blocknr);
+		clear_buffer_uptodate(bh);
+	}
+	if (orig_bh) {
+		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+		/* Protect BH_Shadow bit in b_state */
+		smp_mb__after_atomic();
+		wake_up_bit(&orig_bh->b_state, BH_Shadow);
+	}
+	unlock_buffer(bh);
+}
+
+static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)
+{
+	struct ext4_fc_tl tl;
+
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(len);
+	memcpy(dst, &tl, sizeof(tl));
+	memcpy(dst + sizeof(tl), val, len);
+
+	return dst + sizeof(tl) + len;
+}
+
+int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
+			struct inode *inode, tid_t tid, tid_t subtid,
+			int is_last, struct dentry *dentry)
+{
+	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
+	struct super_block *sb = journal->j_private;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_commit_hdr *fc_hdr;
+	struct ext4_map_blocks map;
+	struct ext4_iloc iloc;
+	struct ext4_extent extent;
+	struct inode *parent;
+	__u32 dummy_csum = 0, csum;
+	__u8 *start, *cur, *end;
+	__u16 num_tlvs = 0;
+	int ret;
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (tid != ei->i_fc.fc_tid) {
+		jbd_debug(3,
+			  "File not modified. Modified %d, expected %d",
+			  ei->i_fc.fc_tid, tid);
+		read_unlock(&ei->i_fc.fc_lock);
+		return 0;
+	}
+	read_unlock(&ei->i_fc.fc_lock);
+
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return -ECANCELED;
+
+	if (ext4_is_inode_fc_new(inode)) {
+		parent = d_inode(dentry->d_parent);
+		if (parent && ext4_is_inode_fc_ineligible(parent))
+			return -ECANCELED;
+	}
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	end = (__u8 *)bh->b_data + journal->j_blocksize;
+
+	write_lock(&ei->i_fc.fc_lock);
+	old_blk_size = (ei->i_fc.fc_lblk_start + sb->s_blocksize - 1) >>
+		       inode->i_blkbits;
+	new_blk_size = ei->i_fc.fc_lblk_end >> inode->i_blkbits;
+	ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_end;
+	write_unlock(&ei->i_fc.fc_lock);
+
+	jbd_debug(3, "Committing as tid = %d, subtid = %d on buffer %lld\n",
+		  tid, subtid, bh->b_blocknr);
+
+	fc_hdr = (struct ext4_fc_commit_hdr *)
+			((__u8 *)bh->b_data + sizeof(journal_header_t));
+	fc_hdr->fc_magic = cpu_to_le32(EXT4_FC_MAGIC);
+	fc_hdr->fc_subtid = cpu_to_le32(subtid);
+	fc_hdr->fc_ino = cpu_to_le32(inode->i_ino);
+	fc_hdr->fc_features = 0;
+	fc_hdr->fc_flags = 0;
+
+	if (is_last)
+		ext4_fc_mark_last(fc_hdr);
+
+	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));
+	cur = (__u8 *)(fc_hdr + 1);
+	start = cur;
+	if (ext4_is_inode_fc_new(inode)) {
+		__le32 parent_ino;
+
+		read_lock(&ei->i_fc.fc_lock);
+		parent_ino = cpu_to_le32(ei->i_fc.fc_parent_ino);
+		read_unlock(&ei->i_fc.fc_lock);
+
+		if (!dentry)
+			return -ECANCELED;
+
+		cur = fc_add_tag(cur, EXT4_FC_TAG_PARENT_INO,
+				      sizeof(parent_ino), (u8 *)&parent_ino);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_DNAME,
+				 dentry->d_name.len,
+				 (u8 *)dentry->d_name.name);
+		num_tlvs = 2;
+	}
+	csum = 0;
+	cur_lblk_off = old_blk_size;
+	while (cur_lblk_off <= new_blk_size) {
+		map.m_lblk = cur_lblk_off;
+		map.m_len = new_blk_size - cur_lblk_off + 1;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (!ret) {
+			cur_lblk_off += map.m_len;
+			continue;
+		}
+
+		if (map.m_flags & EXT4_MAP_UNWRITTEN)
+			return -ECANCELED;
+		extent.ee_block = cpu_to_le32(map.m_lblk);
+		cur_lblk_off += map.m_len;
+		if (cur + sizeof(struct ext4_extent) +
+		    sizeof(struct ext4_fc_tl) >= end)
+			return -ENOSPC;
+
+		extent.ee_len = cpu_to_le16(map.m_len);
+		ext4_ext_store_pblock(&extent, map.m_pblk);
+		ext4_ext_mark_initialized(&extent);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_EXT,
+				 sizeof(struct ext4_extent),
+				 (u8 *)&extent);
+		num_tlvs++;
+	}
+
+	fc_hdr->fc_num_tlvs = cpu_to_le16(num_tlvs);
+	csum = ext4_chksum(sbi, csum, (__u8 *)fc_hdr,
+			   offsetof(struct ext4_fc_commit_hdr, fc_csum));
+	csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+	csum = ext4_chksum(sbi, csum, start, cur - start);
+	fc_hdr->fc_csum = cpu_to_le32(csum);
+
+	jbd_debug(3, "Created FC block for inode %ld with [%d, %d]",
+		  inode->i_ino, tid, subtid);
+
+	return 1;
+}
+
+static void ext4_journal_fc_cleanup_cb(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct inode *inode;
+
+	spin_lock(&sbi->s_fc_lock);
+	while (!list_empty(&sbi->s_fc_q)) {
+		iter = list_first_entry(&sbi->s_fc_q,
+				  struct ext4_inode_info, i_fc_list);
+		list_del_init(&iter->i_fc_list);
+		inode = &iter->vfs_inode;
+	}
+	INIT_LIST_HEAD(&sbi->s_fc_q);
+	sbi->s_fc_q_cnt = 0;
+	spin_unlock(&sbi->s_fc_lock);
+	sbi->s_fc_eligible = true;
+}
+
+/*
+ * Fast-commit commit callback. There is contention between sbi->s_fc_lock and
+ * i_data_sem. Locking order is - i_data_sem then s_fc_lock
+ */
+static int ext4_journal_fc_commit_cb(journal_t *journal, tid_t tid,
+			      tid_t subtid,
+			      struct transaction_run_stats_s *stats)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct list_head *pos, *tmp;
+	struct ext4_inode_info *iter;
+	int num_bufs = 0, ret;
+
+	memset(stats, 0, sizeof(*stats));
+
+	trace_ext4_journal_fc_commit_cb_start(sb);
+	sbi = sbi;
+	spin_lock(&sbi->s_fc_lock);
+	if (!sbi->s_fc_eligible) {
+		sbi->s_fc_eligible = true;
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown");
+		return -EIO;
+	}
+
+	stats->rs_flushing = jiffies;
+	/* Submit data buffers first */
+	list_for_each(pos, &sbi->s_fc_q) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		ret = jbd2_submit_inode_data(journal, iter->jinode);
+		if (ret) {
+			spin_unlock(&sbi->s_fc_lock);
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "data_commit");
+			return ret;
+		}
+	}
+	stats->rs_logging = jiffies;
+	stats->rs_flushing = jbd2_time_diff(stats->rs_flushing,
+					    stats->rs_logging);
+
+	list_for_each_safe(pos, tmp, &sbi->s_fc_q) {
+		struct inode *inode;
+		struct buffer_head *bh;
+		int is_last;
+
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		inode = &iter->vfs_inode;
+
+		is_last = list_is_last(pos, &sbi->s_fc_q);
+		spin_unlock(&sbi->s_fc_lock);
+
+		ret = jbd2_map_fc_buf(journal, &bh);
+		if (ret) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "map_fc_buf");
+			return -ENOMEM;
+		}
+
+		/*
+		 * Release s_fc_lock here since fc_write_inode calls
+		 * ext4_map_blocks which needs i_data_sem.
+		 */
+		ret = ext4_fc_write_inode(journal, bh, inode, tid, subtid,
+					  is_last, NULL);
+		if (ret < 0) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "fc_write_inode");
+			return ret;
+		}
+		lock_buffer(bh);
+		clear_buffer_dirty(bh);
+		set_buffer_uptodate(bh);
+		bh->b_end_io = ext4_end_buffer_io_sync;
+		submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+		spin_lock(&sbi->s_fc_lock);
+
+		num_bufs++;
+	}
+
+	stats->rs_logging = jbd2_time_diff(stats->rs_logging, jiffies);
+	if (num_bufs == 0) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "no_data");
+		stats->rs_blocks_logged = num_bufs;
+		return 0;
+	}
+
+	/*
+	 * Before returning, check if s_fc_eligible was modified since we
+	 * started.
+	 */
+	if (!sbi->s_fc_eligible) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible2");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown2");
+		return -EIO;
+	}
+
+	spin_unlock(&sbi->s_fc_lock);
+
+	jbd_debug(3, "%s: Journal blocks ready for fast commit\n", __func__);
+
+	stats->rs_blocks_logged = num_bufs;
+
+	trace_ext4_journal_fc_commit_cb_stop(sb, num_bufs, "success");
+
+	return jbd2_wait_on_fc_bufs(journal, num_bufs);
+}
+
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
+{
+	if (ext4_should_fast_commit(sb)) {
+		journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
+		journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
+	}
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2cb7e7e1f025..acb9533068c4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -397,8 +397,14 @@ static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 
 	if (ext4_handle_valid(handle) && !is_handle_aborted(handle)) {
 		ei->i_sync_tid = handle->h_transaction->t_tid;
-		if (datasync)
+		if (ext4_should_fast_commit(inode->i_sb))
+			ei->i_sync_subtid = handle->h_transaction->t_subtid;
+		if (datasync) {
 			ei->i_datasync_tid = handle->h_transaction->t_tid;
+			if (ext4_should_fast_commit(inode->i_sb))
+				ei->i_datasync_subtid =
+						handle->h_transaction->t_subtid;
+		}
 	}
 }
 
@@ -470,6 +476,47 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+/* Ext4 fast commit related info */
+
+/* Magic of fast commit header */
+#define EXT4_FC_MAGIC			0xE2540090
+
+#define EXT4_FC_FL_LAST			0x00000001
+
+#define ext4_fc_is_last(__fc_hdr)	(((__fc_hdr)->fc_flags) &	\
+					 EXT4_FC_FL_LAST)
+
+#define ext4_fc_mark_last(__fc_hdr)	(((__fc_hdr)->fc_flags) |=	\
+					 EXT4_FC_FL_LAST)
+
+struct ext4_fc_commit_hdr {
+	/* Fast commit magic, should be EXT4_FC_MAGIC */
+	__le32 fc_magic;
+	/* Sub transaction ID */
+	__le32 fc_subtid;
+	/* Features used by this fast commit block */
+	__u8 fc_features;
+	/* Flags for this block. */
+	__u8 fc_flags;
+	/* Number of TLVs in this fast commmit block */
+	__le16 fc_num_tlvs;
+	/* Inode number */
+	__le32 fc_ino;
+	/* ext4 inode on disk copy */
+	struct ext4_inode inode;
+	/* Csum(hdr+contents) */
+	__le32 fc_csum;
+};
+
+#define EXT4_FC_TAG_EXT		0x1	/* Extent */
+#define EXT4_FC_TAG_DNAME	0x2
+#define EXT4_FC_TAG_PARENT_INO	0x3
+
+struct ext4_fc_tl {
+	__le16 fc_tag;
+	__le16 fc_len;
+};
+
 void ext4_init_inode_fc_info(struct inode *inode);
 extern void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode);
 extern void ext4_fc_del(struct inode *inode);
@@ -507,4 +554,5 @@ void ext4_fc_update_commit_range(struct inode *inode, ext4_lblk_t start,
 void ext4_fc_mark_new(struct inode *inode);
 bool ext4_is_inode_fc_ineligible(struct inode *inode);
 bool ext4_is_inode_fc_new(struct inode *inode);
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal);
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b30f6175eb71..dea4c2632272 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4898,10 +4898,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
-	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
-	}
+	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal)
+		ret = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 out:
 	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea039e3e1a4d..cbfa1ec858a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5039,20 +5039,25 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	 */
 	if (journal) {
 		transaction_t *transaction;
-		tid_t tid;
+		tid_t tid, subtid;
 
 		read_lock(&journal->j_state_lock);
 		if (journal->j_running_transaction)
 			transaction = journal->j_running_transaction;
 		else
 			transaction = journal->j_committing_transaction;
-		if (transaction)
+		if (transaction) {
 			tid = transaction->t_tid;
-		else
+			subtid = transaction->t_subtid;
+		} else {
 			tid = journal->j_commit_sequence;
+			subtid = journal->j_fc_sequence;
+		}
 		read_unlock(&journal->j_state_lock);
 		ei->i_sync_tid = tid;
 		ei->i_datasync_tid = tid;
+		ei->i_sync_subtid = subtid;
+		ei->i_datasync_subtid = subtid;
 	}
 
 	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
@@ -5475,8 +5480,9 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
-		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
+		err = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 	} else {
 		struct ext4_iloc iloc;
 
@@ -5628,6 +5634,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (attr->ia_valid & ATTR_GID)
 			inode->i_gid = attr->ia_gid;
 		error = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 		ext4_journal_stop(handle);
 	}
 
@@ -5688,6 +5695,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				inode->i_mtime = current_time(inode);
 				inode->i_ctime = inode->i_mtime;
 			}
+			ext4_fc_enqueue_inode(handle, inode);
 			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
@@ -5732,6 +5740,8 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 	if (!error) {
 		setattr_copy(inode, attr);
+		ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+						   inode);
 		mark_inode_dirty(inode);
 	}
 
@@ -6144,6 +6154,7 @@ void ext4_dirty_inode(struct inode *inode, int flags)
 		goto out;
 
 	ext4_mark_inode_dirty(handle, inode);
+	ext4_fc_enqueue_inode(handle, inode);
 
 	ext4_journal_stop(handle);
 out:
@@ -6229,6 +6240,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	ext4_fc_mark_ineligible(inode);
 	err = ext4_mark_inode_dirty(handle, inode);
 	ext4_handle_sync(handle);
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3e9570ea9748..208c57b5ac80 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1129,6 +1129,16 @@ static void ext4_destroy_inode(struct inode *inode)
 				true);
 		dump_stack();
 	}
+	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+#ifdef EXT4FS_DEBUG
+		if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
+			pr_warn("%s: INODE %ld in FC List with FC allowd",
+				__func__, inode->i_ino);
+			dump_stack();
+		}
+#endif
+		ext4_fc_del(inode);
+	}
 }
 
 static void init_once(void *foo)
@@ -4713,6 +4723,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 	journal->j_commit_interval = sbi->s_commit_interval;
 	journal->j_min_batch_time = sbi->s_min_batch_time;
 	journal->j_max_batch_time = sbi->s_max_batch_time;
+	ext4_init_fast_commit(sb, journal);
 
 	write_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..9c24b1c5239f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,45 @@ TRACE_EVENT(ext4_error,
 		  __entry->function, __entry->line)
 );
 
+TRACE_EVENT(ext4_journal_fc_commit_cb_start,
+	TP_PROTO(struct super_block *sb),
+
+	TP_ARGS(sb),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+	),
+
+	TP_printk("fast_commit started on dev %d,%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_journal_fc_commit_cb_stop,
+	    TP_PROTO(struct super_block *sb, int nblks, const char *reason),
+
+	TP_ARGS(sb, nblks, reason),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, nblks)
+		__field(const char *, reason)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->nblks = nblks;
+		__entry->reason = reason;
+	),
+
+	TP_printk("fast_commit done on dev %d,%d, nblks %d, reason %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nblks, __entry->reason)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
@ 2019-10-16 22:45   ` Theodore Y. Ts'o
       [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 22:45 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:58AM -0700, Harshad Shirwadkar wrote:
> This patch implements the actual commit path for fast commit. Based on
> inodes tracked and their respective changes remembered, this
> patch adds code to create a fast commit block that stores extents
> added as well as dentrys created for the inode. We use new JBD2
> interfaces added in previous patches in this series. The fast commit
> blocks that are created have extents that _should_ be present in the
> file. It doesn't yet support removing of extents, making operations
> such as truncate, delete fast commit incompatible.

This affects some of the earlier patches, but I didn't realize this
until now.  Right now, what we're doing is when initiate an fast
commit, we are writing out all fast-commit eligible inodes (and
flushing out any associated data blocks needed to maintain
data=ordered guarantees).

We don't actually have to do this.  Strictly speaking, we only have to
write out the specific inode being fsync'ed, or the specific inode for
which ext4_nfs_commit_metdata() has been called.  For an fsync()
workload, especially one where for example, we might have hundreds of
modified inodes, all of which are fc-eligible --- for example, because
a kernel build is happening in the background, and a single file which
is being fsync'ed --- for example because the programmer has just
saved a source file in emacs ---- we only need to include that single
inode in the fast commit.  Including *all* of the inodes in the
i_fc_list in the fast commit, is wasted effort, especially since the
inodes in question will be committed within the next 5 seconds.

Now, in the case of ext4_nfs_commit_metadata(), we know that NFS is
*very* aggressive at calling commit_metadata, and so writing out all
of the FC-eligible commit is probably a good thing to do.  So we might
want to do different things depending on whether the FC is being
initiated via fsync() or fdatasync() versus commit_metadata().

The other reason why it's better to only do this for
ext4_nfs_commit_metadata() is because if we only write out the inode
which is being fsync'ed, we don't have worry about fairness concerned,
since the I/O will be charged to the process/cgroup who requested the
fsync.  If we write out *all* the fc-eligible inodes in the FC commit,
then they will get charged to the process doing the fsync(2).  Whereas
for an NFS server, we don't care about cgroups, since they can all be
charged to the NFS server.

> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 0bb8de2139a5..fd7740372438 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -4,6 +4,7 @@
> +static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +{
> +	struct buffer_head *orig_bh = bh->b_private;
> +
> +	BUFFER_TRACE(bh, "");
> +	if (uptodate) {
> +		ext4_debug("%s: Block %lld up-to-date",
> +			   __func__, bh->b_blocknr);
> +		set_buffer_uptodate(bh);
> +	} else {
> +		ext4_debug("%s: Block %lld not up-to-date",
> +			   __func__, bh->b_blocknr);
> +		clear_buffer_uptodate(bh);
> +	}
> +	if (orig_bh) {
> +		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> +		/* Protect BH_Shadow bit in b_state */
> +		smp_mb__after_atomic();
> +		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> +	}

We don't need to deal with BH_Shadow handling here.  This is needed
when we are writing out buffer heads correspond to ext4 metadata
(e.g., an inode table block, a block group descriptor block).  We're
only writing out bh's corresponding to the journal, so the BH_Shadow
bit should never be set on such bh's.

> +static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)

Can you add some documentation for this function?  In particular, what
does it return?  I also tend to prefer to pass in the pointer to the
buffer (val) first, followed then by the length (len), but that's more
of a personal preference.

> +int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
> +			struct inode *inode, tid_t tid, tid_t subtid,
> +			int is_last, struct dentry *dentry)
> +{

  ...
> +
> +	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));

So this is a bit problematic.  In the structure definition,
fc_hdr->inode is not at the end of the structure

struct ext4_fc_commit_hdr {
	/* Fast commit magic, should be EXT4_FC_MAGIC */
	__le32 fc_magic;
	...	
	/* ext4 inode on disk copy */
	struct ext4_inode inode;
	/* Csum(hdr+contents) */
	__le32 fc_csum;
};

... and the size of struct ext4_inode is just the fixed portion of the
inode, and is almost always smaller than EXT4_INODE_SIZE(sb) ---
except in the case of 128 byte inodes, in which case the fields
i_extra_isize and beyond going to be beyond the 128 byte limit.

So this isn't going to work.  I'm guessing you didn't test with
extended attributes, because the checksum would have overwritten the
beginning of the in-inode xattrs?

Also, note that EXT4_INODE_SIZE(sb) can be set to the block size.
It's super-rare, but that is legal.  Which means we need to test for
that case somewhere, and either (a) disable fast commits when the
inode size == blocksize, or (b) support a fast commit log which is
larger than a single block.  (This is doable, since there is a
checksum field to protect against partial writes.)

> +struct ext4_fc_commit_hdr {
> +	/* Fast commit magic, should be EXT4_FC_MAGIC */
> +	__le32 fc_magic;
> +	/* Sub transaction ID */
> +	__le32 fc_subtid;
> +	/* Features used by this fast commit block */
> +	__u8 fc_features;
> +	/* Flags for this block. */
> +	__u8 fc_flags;

What fs_features and fc_flags are you thinking we would need?  I can't
think of a good reasons to have per-fc block features.  But I can
think of reasons why we might want to support a small number of blocks
in an fc entry.  So maybe repurpose fc_features with some limit, such
as say, 4 blocks, and on the replay side we can just kmalloc 4 *
blocksize worth of space to read in that number of blocks if
necessary?

> +	/* Number of TLVs in this fast commmit block */
> +	__le16 fc_num_tlvs;
> +	/* Inode number */
> +	__le32 fc_ino;
> +	/* ext4 inode on disk copy */
> +	struct ext4_inode inode;
> +	/* Csum(hdr+contents) */
> +	__le32 fc_csum;

I'd suggest putting the checksum at the very end of the fc entry.
e.g., at offset 4092 if there is only a single block in the fc commit
entry.  Also, I'd make sure that we explicitly zero all of the bytes
at the end of the TLV section and the checksum, and specify that the
checksum is calculated including the must-be-zero padding, just to
keep things simple.

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>]

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
       [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
@ 2019-10-23 12:44       ` Theodore Y. Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-23 12:44 UTC (permalink / raw)
  To: xiaohui li; +Cc: Harshad Shirwadkar, linux-ext4

On Wed, Oct 23, 2019 at 04:58:47PM +0800, xiaohui li wrote:
> why not let fsync handle enjoy one transaction exclusively ?
> that is to say, in this transaction, there is only one handle which is
> generated in one file's fsync path .

There is only one handle which is generated in one file's fsync path.
That isn't the problem.  (If it were that simple, we would have done
it a long time ago.)

The problem is that there may have been other handles that have been
started before the fsync transaction, and these handles will have
already made changes to the file system.  Worse, some of those handles
may have made changes in the same metadata blocks which the fsync
operation needs to modify.

For example, suppose we are three seconds into the current
transaction, with potentially hundreds of handles that have already
been started and finished --- but not yet committed, because the
current transaction hasn't closed.  All of those handles have already
been attached to the current transaction, and they can't be ignored.

The fast commit patch set deals with this by using part of the journal
for a "fast commit journal" where we essentially are doing a very
simplified logical journal.  It doesn't handle all cases, and there
will be situations where we will need to fall back to the physical
journalling techniques used in ext4 today.  For example, if the file
has been truncated, and then a single 4k block is written, and then
the file gets fsync'ed, we won't be able to use the fast commit
logical journal.  Fortunately, the common case which compromises well
over 99% of most workloads are much simpler to handle, and these can
be handled via the fast commit patch.

The fast commit approach is a simplified version of the idea proposed
by Daejun Park and Dungkun Shih from the Sungkyunkwan University in
Korea, and which were presented in the paper "iJournaling:
Fine-Grained Journaling for Improving the Latency of Fsync System
Call[1]", presented at the Usenix Annual Technical Conference in 2017.

[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-11-08  7:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1571900042725.99617@xiaomi.com>
2019-10-24 20:18 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Theodore Y. Ts'o
     [not found]   ` <1572349386604.43878@xiaomi.com>
2019-10-29 21:35     ` Theodore Y. Ts'o
2019-10-30  4:28       ` 答复: [External Mail]Re: " Xiaohui1 Li 李晓辉
2019-10-30 14:26         ` Theodore Y. Ts'o
2019-11-04  1:01           ` xiaohui li
2019-11-04  3:22             ` Theodore Y. Ts'o
2019-11-08  7:58               ` xiaohui li
2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
2019-10-16 22:45   ` Theodore Y. Ts'o
     [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
2019-10-23 12:44       ` Theodore Y. Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).