* Selective Data Journaling in ext4 @ 2019-02-12 0:14 Vijay Chidambaram 2019-02-12 1:25 ` Andreas Dilger 0 siblings, 1 reply; 5+ messages in thread From: Vijay Chidambaram @ 2019-02-12 0:14 UTC (permalink / raw) To: linux-ext4, jesus.palos, Theodore Tso Hi all, We would like to present an idea to improve the performance of data journaling in ext4. Data journaling is expensive because data is written twice: once to the journal and once to the actual file system. Passing data through the journal provides consistency guarantees that ordered journaling mode cannot provide (for example, data journaling prevents a data block from being partially written). The idea behind Selective Data Journaling is simple: create a new journaling mode by modifying ordered journaling mode to journal data blocks which are already part of a file. Data blocks which are newly allocated are not part of the journal, and are written out before the journal blocks in accordance with ordered mode's ordering guarantees. If there is a crash before transaction commit, the only side effect is un-allocated data blocks getting written with new data. Selective Data Journaling provides a lot of the benefits of data journaling, at significantly lower cost. For workloads which mostly deal with new data blocks (any applications which update files via atomic rename), Selective Data Journaling can increase performance significantly. We came up with Selective Data Journaling during my PhD at the University of Wisconsin Madison [1]. I haven't inspected the ext4 codebase deeply since then, so this optimization may already exist. There may also be problems with this approach that we have not considered -- we are open to discussion. It may also be the case that nobody uses data journaling so the extra complexity is not worth it. If this is something you would like to see implemented, my student Jesus Palos (cced) is interested in doing so. We would like to discuss how best to implement this if you are interested. [1] http://research.cs.wisc.edu/adsl/Publications/optfs-sosp13.pdf Thanks, Vijay Chidambaram, UT Austin http://www.cs.utexas.edu/~vijay/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Selective Data Journaling in ext4 2019-02-12 0:14 Selective Data Journaling in ext4 Vijay Chidambaram @ 2019-02-12 1:25 ` Andreas Dilger 2019-02-13 16:30 ` Vijay Chidambaram 0 siblings, 1 reply; 5+ messages in thread From: Andreas Dilger @ 2019-02-12 1:25 UTC (permalink / raw) To: Vijay Chidambaram; +Cc: linux-ext4, jesus.palos, Theodore Tso [-- Attachment #1: Type: text/plain, Size: 3503 bytes --] On Feb 11, 2019, at 5:14 PM, Vijay Chidambaram <vijayc@utexas.edu> wrote: > > Hi all, > > We would like to present an idea to improve the performance of data > journaling in ext4. Data journaling is expensive because data is > written twice: once to the journal and once to the actual file system. > Passing data through the journal provides consistency guarantees that > ordered journaling mode cannot provide (for example, data journaling > prevents a data block from being partially written). > > The idea behind Selective Data Journaling is simple: create a new > journaling mode by modifying ordered journaling mode to journal data > blocks which are already part of a file. Data blocks which are newly > allocated are not part of the journal, and are written out before the > journal blocks in accordance with ordered mode's ordering guarantees. > If there is a crash before transaction commit, the only side effect is > un-allocated data blocks getting written with new data. > > Selective Data Journaling provides a lot of the benefits of data > journaling, at significantly lower cost. For workloads which mostly > deal with new data blocks (any applications which update files via > atomic rename), Selective Data Journaling can increase performance > significantly. One major caveat here is that files are *very rarely* overwritten in place. This is mostly useful for database-type workloads, and most databases already have their own transaction journal independent of the filesystem journal, so AFAIK this would not be a very widely-used feature. That said, a related, but IMHO much more useful form of selective data journaling would be for "random IOPS" workloads, where there may be many small writes either to a single file or to many small files, and this IO could be aggregated and optimized with fast linear writes to the journal, possibly on a separate flash device. That avoids a lot of seeks for the main filesystem device for small IO (which would otherwise be IOPS limited and not bandwidth limited, so the double data writes are not a limiting factor), while allowing large writes to go directly to the filesystem device and avoid the double writes (which would otherwise reduce IO bandwidth by half). Since we already have delalloc to pre-stage the dirty pages before the write, we can make a good decision about whether the file data should be written to the journal or directly to the filesystem. This could likely leverage the work that was already done for SMR journal mode (Ted has patches, and I think they are available online as well), and hopefully integrate both those patches and this new work into mainline ext4. I'm happy to discuss this further if you are interested. > We came up with Selective Data Journaling during my PhD at the > University of Wisconsin Madison [1]. I haven't inspected the ext4 > codebase deeply since then, so this optimization may already exist. > There may also be problems with this approach that we have not > considered -- we are open to discussion. It may also be the case that > nobody uses data journaling so the extra complexity is not worth it. > > If this is something you would like to see implemented, my student > Jesus Palos (cced) is interested in doing so. We would like to discuss > how best to implement this if you are interested. > > [1] http://research.cs.wisc.edu/adsl/Publications/optfs-sosp13.pdf > > Thanks, > Vijay Chidambaram, UT Austin > http://www.cs.utexas.edu/~vijay/ Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Selective Data Journaling in ext4 2019-02-12 1:25 ` Andreas Dilger @ 2019-02-13 16:30 ` Vijay Chidambaram 2019-02-13 18:53 ` Theodore Y. Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Vijay Chidambaram @ 2019-02-13 16:30 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4, jesus.palos, Theodore Tso On Mon, Feb 11, 2019 at 7:25 PM Andreas Dilger <adilger@dilger.ca> wrote: > > On Feb 11, 2019, at 5:14 PM, Vijay Chidambaram <vijayc@utexas.edu> wrote: > > > > Hi all, > > > > We would like to present an idea to improve the performance of data > > journaling in ext4. Data journaling is expensive because data is > > written twice: once to the journal and once to the actual file system. > > Passing data through the journal provides consistency guarantees that > > ordered journaling mode cannot provide (for example, data journaling > > prevents a data block from being partially written). > > > > The idea behind Selective Data Journaling is simple: create a new > > journaling mode by modifying ordered journaling mode to journal data > > blocks which are already part of a file. Data blocks which are newly > > allocated are not part of the journal, and are written out before the > > journal blocks in accordance with ordered mode's ordering guarantees. > > If there is a crash before transaction commit, the only side effect is > > un-allocated data blocks getting written with new data. > > > > Selective Data Journaling provides a lot of the benefits of data > > journaling, at significantly lower cost. For workloads which mostly > > deal with new data blocks (any applications which update files via > > atomic rename), Selective Data Journaling can increase performance > > significantly. > > One major caveat here is that files are *very rarely* overwritten in > place. This is mostly useful for database-type workloads, and most > databases already have their own transaction journal independent of > the filesystem journal, so AFAIK this would not be a very widely-used > feature. Agreed, but another way to view this feature is that it is dynamic switching between ordered mode and data journaling mode. We switch to data journaling mode exactly when it is required, so you are right that most applications would never see a difference. But when it is required, this scheme would ensure stronger semantics are provided. Overall, it provides data-journaling guarantees all the time, and I was thinking some applications would like that peace of mind. > That said, a related, but IMHO much more useful form of selective data > journaling would be for "random IOPS" workloads, where there may be > many small writes either to a single file or to many small files, and > this IO could be aggregated and optimized with fast linear writes to > the journal, possibly on a separate flash device. That avoids a lot > of seeks for the main filesystem device for small IO (which would > otherwise be IOPS limited and not bandwidth limited, so the double data > writes are not a limiting factor), while allowing large writes to go > directly to the filesystem device and avoid the double writes (which > would otherwise reduce IO bandwidth by half). > > Since we already have delalloc to pre-stage the dirty pages before the > write, we can make a good decision about whether the file data should > be written to the journal or directly to the filesystem. > > This could likely leverage the work that was already done for SMR journal > mode (Ted has patches, and I think they are available online as well), > and hopefully integrate both those patches and this new work into mainline > ext4. > > I'm happy to discuss this further if you are interested. We like this idea as well, and would be happy to work on it! To make sure we are on the same page, the proposal is to: - identify whether writes are sequential or random (1) - Send random writes to the journal if Selective Data Journaling is enabled (2) How should we do (1)? Also, would it make sense to do this per-file instead of as a mode for the entire file system? I am thinking of opening a file with O_SDJ which will convert random writes to sequential and increase performance. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Selective Data Journaling in ext4 2019-02-13 16:30 ` Vijay Chidambaram @ 2019-02-13 18:53 ` Theodore Y. Ts'o 2019-02-13 21:08 ` Andreas Dilger 0 siblings, 1 reply; 5+ messages in thread From: Theodore Y. Ts'o @ 2019-02-13 18:53 UTC (permalink / raw) To: Vijay Chidambaram; +Cc: Andreas Dilger, linux-ext4, jesus.palos On Wed, Feb 13, 2019 at 10:30:47AM -0600, Vijay Chidambaram wrote: > Agreed, but another way to view this feature is that it is dynamic > switching between ordered mode and data journaling mode. We switch to > data journaling mode exactly when it is required, so you are right > that most applications would never see a difference. But when it is > required, this scheme would ensure stronger semantics are provided. > Overall, it provides data-journaling guarantees all the time, and I > was thinking some applications would like that peace of mind. Switching back and forth orderred and data journalling mode is a bit tricky. (Insert "one does not simply walk into Morder" meme here). See the comment in ext4_change_journal_flag() in fs/ext4/inode.c: /* * We have to be very careful here: changing a data block's * journaling status dynamically is dangerous. If we write a * data block to the journal, change the status and then delete * that block, we risk forgetting to revoke the old log record * from the journal and so a subsequent replay can corrupt data. * So, first we make sure that the journal is empty and that * nobody is changing anything. */ What this means is that you have to track a list of blocks that has ever been data journalled, because before we delete the file, we have to write revoke all blocks belonging to that file on the list. Similarly, if you switch from ordered to data journalling mode, all of those blocks must be revoked. This should also be done in a way that avoids serializing parallel writes to the the inode. That's not something we support today (yet), but thare are some plans to allow parallel direct I/O writes to the file. Speaking of Direct I/O writes, as above, if a block that was previously written via data journalling, the revoke block must be submitted --- and committed --- before Direct I/O writes to that block can be allowed. > > Since we already have delalloc to pre-stage the dirty pages before the > > write, we can make a good decision about whether the file data should > > be written to the journal or directly to the filesystem. Note that delalloc and data journalling is not compatible. That being said, if we are writing to not-yet-allocated block, recent discussions of changing ext4 so that we only insert the block into the extent tree in a workqueue triggered by the I/O callback for data block write, is probably the better way of removing the data=ordered overhead. Finally, this optimization only makes sense for HDD's, right? For SSD's, random writes are mostly free, and the cost of the double write, not to mention the write amplification effect, probably makes this not worthwhile. Cheers, - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Selective Data Journaling in ext4 2019-02-13 18:53 ` Theodore Y. Ts'o @ 2019-02-13 21:08 ` Andreas Dilger 0 siblings, 0 replies; 5+ messages in thread From: Andreas Dilger @ 2019-02-13 21:08 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Vijay Chidambaram, linux-ext4, jesus.palos [-- Attachment #1: Type: text/plain, Size: 5405 bytes --] On Feb 13, 2019, at 11:53 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Wed, Feb 13, 2019 at 10:30:47AM -0600, Vijay Chidambaram wrote: >> Agreed, but another way to view this feature is that it is dynamic >> switching between ordered mode and data journaling mode. We switch to >> data journaling mode exactly when it is required, so you are right >> that most applications would never see a difference. But when it is >> required, this scheme would ensure stronger semantics are provided. >> Overall, it provides data-journaling guarantees all the time, and I >> was thinking some applications would like that peace of mind. > > Switching back and forth orderred and data journalling mode is a bit > tricky. (Insert "one does not simply walk into Morder" meme here). > > See the comment in ext4_change_journal_flag() in fs/ext4/inode.c: > > /* > * We have to be very careful here: changing a data block's > * journaling status dynamically is dangerous. If we write a > * data block to the journal, change the status and then delete > * that block, we risk forgetting to revoke the old log record > * from the journal and so a subsequent replay can corrupt data. > * So, first we make sure that the journal is empty and that > * nobody is changing anything. > */ > > What this means is that you have to track a list of blocks that has > ever been data journalled, because before we delete the file, we have > to write revoke all blocks belonging to that file on the list. > Similarly, if you switch from ordered to data journalling mode, all of > those blocks must be revoked. To avoid the issue of enabling data journaling on a file, and the more difficult process of disabling data journaling, I think we can be lazy when disabling data journaling on a file until after the last journal tid that contains data blocks from the file has been checkpointed out of the journal. It isn't like the case where the user requests data journal be enabled or disabled *now*, so we just need to e.g. put those files into the orphan list with a journal commit (checkpoint?) callback to track when the data journal can be removed. Alternately, just leave the data-journal mode enabled on such files since they are likely to be used in the same way in the future (or more likely never modified again) and we never disable data journal. > This should also be done in a way that avoids serializing parallel > writes to the the inode. That's not something we support today (yet), > but thare are some plans to allow parallel direct I/O writes to the > file. Speaking of Direct I/O writes, as above, if a block that was > previously written via data journalling, the revoke block must be > submitted --- and committed --- before Direct I/O writes to that block > can be allowed. > >>> Since we already have delalloc to pre-stage the dirty pages before the >>> write, we can make a good decision about whether the file data should >>> be written to the journal or directly to the filesystem. > > Note that delalloc and data journalling is not compatible. That being > said, if we are writing to not-yet-allocated block, recent discussions > of changing ext4 so that we only insert the block into the extent tree > in a workqueue triggered by the I/O callback for data block write, is > probably the better way of removing the data=ordered overhead. > > Finally, this optimization only makes sense for HDD's, right? For > SSD's, random writes are mostly free, and the cost of the double > write, not to mention the write amplification effect, probably makes > this not worthwhile. Sure, HDDs or hybrid HDDs with SSDs for the journal. Using the SMR ext4 patches to enable log-structured write mode for ext4 would allow using a good-sized journal device (32-64GB Optane M.2 devices are cheap and very fast, and the smallest possible devices that are available today, larger SSDs are definitely practical to use). That allows sinking all of the IOPS into the journal automatically without overwhelming the SSD bandwidth with large writes that can efficiently be made directly to HDDs, and then the checkpoint can do a better job to order the writes to HDD later. With a RAID system the aggregate HDD bandwidth for large read/write exceeds the SSD bandwidth. This is definitely a workload that is of real-life interest (mixed large and small file writes), so being able to optimize this at the ext4 level would be great. >> We like this idea as well, and would be happy to work on it! To make >> sure we are on the same page, the proposal is to: >> - identify whether writes are sequential or random (1) >> - Send random writes to journal if Selective Data Journaling is enabled (2) >> >> How should we do (1)? Also, would it make sense to do this per-file >> instead of as a mode for the entire file system? I am thinking of >> opening a file with O_SDJ which will convert random writes to >> sequential and increase performance. There are really two things to (1) - small random/sync/unaligned writes into a large file, and small writes to individual files. The VM already does similar random/sequential read request detection for large files, so the same could be used easily for write requests, and the latter can be done by checking the file size. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-02-13 21:08 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-12 0:14 Selective Data Journaling in ext4 Vijay Chidambaram 2019-02-12 1:25 ` Andreas Dilger 2019-02-13 16:30 ` Vijay Chidambaram 2019-02-13 18:53 ` Theodore Y. Ts'o 2019-02-13 21:08 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox