* Directory unremovable on ext4 no_journal mode @ 2018-04-10 0:08 Jayashree Mohan 2018-04-10 0:38 ` Darrick J. Wong 2018-04-10 3:12 ` Theodore Y. Ts'o 0 siblings, 2 replies; 5+ messages in thread From: Jayashree Mohan @ 2018-04-10 0:08 UTC (permalink / raw) To: linux-ext4, fstests; +Cc: Vijaychidambaram Velayudhan Pillai Hi, We stumbled upon what seems to be a bug that makes a “directory unremovable”, on ext4 when mounted with no_journal option. A sequence of operations described below led to the following state : “A directory that was renamed, was persisted in both parent and target directories, with the same inode number. This also means the rename was non-atomic on storage. In addition, the renamed directory becomes unremovable on the target with FS-error logged in dmesg.” Here are more details of the workload and the corresponding failure. Workload : mkdir /mnt/test/X and /mnt/test/Y mkdir X/Z sync() rename X/Z Y/Z fsync Y —-Crash now—- Remount ls X and Y (You will see Z is present in both directories X and Y, and has same inode) rmdir test_dir/X/Z (This succeeds) rmdir test_dir/Y/Z (This fails with a FS error logged in dmesg) Results: rmdir: failed to remove '/mnt/test/Y/Z': Structure needs cleaning The corresponding dmesg log has the following error message : [66799.504124] EXT4-fs error (device cow_ram_snapshot1_0): ext4_lookup:1576: inode #12: comm rmdir: deleted inode referenced: 14 [66799.504131] EXT4-fs (cow_ram_snapshot1_0): Remounting filesystem read-only The sequence of operations listed above is making dir Z unremovable from dir Y, which seems like unexpected behavior. Could you provide more details on the reason for such behavior? We understand we run this on no_journal mode of ext4, but would like you to verify if this behavior is acceptable. Do let us know if we are missing any detail here. Thanks, Jayashree Mohan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Directory unremovable on ext4 no_journal mode 2018-04-10 0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan @ 2018-04-10 0:38 ` Darrick J. Wong 2018-04-10 3:12 ` Theodore Y. Ts'o 1 sibling, 0 replies; 5+ messages in thread From: Darrick J. Wong @ 2018-04-10 0:38 UTC (permalink / raw) To: Jayashree Mohan; +Cc: linux-ext4, fstests, Vijaychidambaram Velayudhan Pillai On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote: > Hi, > > We stumbled upon what seems to be a bug that makes a “directory > unremovable”, on ext4 when mounted with no_journal option. > > A sequence of operations described below led to the following state : > “A directory that was renamed, was persisted in both parent and target > directories, with the same inode number. This also means the rename > was non-atomic on storage. In addition, the renamed directory becomes > unremovable on the target with FS-error logged in dmesg.” > > Here are more details of the workload and the corresponding failure. > > Workload : > > mkdir /mnt/test/X and /mnt/test/Y > mkdir X/Z > sync() > rename X/Z Y/Z > fsync Y > —-Crash now—- > Remount You're supposed to run e2fsck after a crash to clean up the metadata. nojournal disables the piece that takes care of that. --D > ls X and Y (You will see Z is present in both directories X and Y, and > has same inode) > rmdir test_dir/X/Z (This succeeds) > rmdir test_dir/Y/Z (This fails with a FS error logged in dmesg) > > > Results: > > rmdir: failed to remove '/mnt/test/Y/Z': Structure needs cleaning > > The corresponding dmesg log has the following error message : > [66799.504124] EXT4-fs error (device cow_ram_snapshot1_0): > ext4_lookup:1576: inode #12: comm rmdir: deleted inode referenced: 14 > [66799.504131] EXT4-fs (cow_ram_snapshot1_0): Remounting filesystem read-only > > The sequence of operations listed above is making dir Z unremovable > from dir Y, which seems like unexpected behavior. Could you provide > more details on the reason for such behavior? We understand we run > this on no_journal mode of ext4, but would like you to verify if this > behavior is acceptable. > > Do let us know if we are missing any detail here. > > Thanks, > Jayashree Mohan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Directory unremovable on ext4 no_journal mode 2018-04-10 0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan 2018-04-10 0:38 ` Darrick J. Wong @ 2018-04-10 3:12 ` Theodore Y. Ts'o 2018-04-10 3:21 ` Vijay Chidambaram 1 sibling, 1 reply; 5+ messages in thread From: Theodore Y. Ts'o @ 2018-04-10 3:12 UTC (permalink / raw) To: Jayashree Mohan; +Cc: linux-ext4, fstests, Vijaychidambaram Velayudhan Pillai On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote: > Hi, > > We stumbled upon what seems to be a bug that makes a “directory > unremovable”, on ext4 when mounted with no_journal option. Hi Jayashree, If you use no_journal mode, you **must** run e2fsck after a crash. And you do have to potentially be ready for data loss after a crash. So no, this isn't a bug. The guarantees that you have when use no_journal is essentially limited to what Posix specifies when you crash uncleanly --- "the results are undefined". > The sequence of operations listed above is making dir Z unremovable > from dir Y, which seems like unexpected behavior. Could you provide > more details on the reason for such behavior? We understand we run > this on no_journal mode of ext4, but would like you to verify if this > behavior is acceptable. We use no_journal mode in Google, but we are preprared to effectively reinstall the root partition, and we are prepared to lose data on our data disks, after a crash. We are OK with this because all persistent data stored on machines is data we are prepared to lose (e.g., cached data or easily reinstalled system software) or part of our cluster file system, where we use erasure codes to assure that data in the cluster file system can remain accessible even if (a) a disk dies completely, or (b) the entry router on the rack dies, denying access to all of the disks in a rack from the cluster file system until the router can be repaired. So losing a file or a directory after running e2fsck after a crash is actually small beer compared to any number of other things that can happen to a disk. The goal for no_journal mode is performance at all costs, and we are prepared to sacrifice file system robustness after a crash. This means we aren't doing any kind of FUA writes or CACHE FLUSH operations, because those would compromise performance. (As a thought experiment, I would encouraging you to try to design a file system that would provide better guarantees without using FUA writes, CACHE FLUSH operations, and with the HDD's write-back cache enabled.) To understand why this is so important, I would recommend that you read the "Disks for Data Center" paper[1]. There is also a lot of good stuff in the FAST 2016 keynote that isn't in the paper or the slides. So listening to the audio recording is also something I strongly commend for people who want to understand Google's approach to storage. (Before 2016, we had always considered this part of our "secret sauce" that we had never disclosed for the past decade, since it is what gave us a huge storage TCO advantage over other companies.) [1] https://research.google.com/pubs/pub44830.html [2] https://www.usenix.org/node/194391 Essentially, we are trying to use all of the two baskets of value provided by each HDD. That is, we want to use nearly all of the byte capacity and all of the IOPS that an HDD can provide --- and FUA writes or CACHE FLUSHES significantly compromises the number of I/O operations the HDD can provide. (More details about how we do this at the cluster level can be found in the PDSW 2017 keynote[3], but it goes well beyond the scope of what gets done on a single file system on a single HDD.) [3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf Regards, - Ted P.S. This is not to say that the work you are doing with Crashmonkey et. al. is not useless; it's just not applicable for a cluster file system in a hyper-scale cloud environment. Local disk file systems and robustness after a crash is still important in applications such as Android and Chrome OS, for example. Note that we do *not* use no_journal mode in those environments. :-) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Directory unremovable on ext4 no_journal mode 2018-04-10 3:12 ` Theodore Y. Ts'o @ 2018-04-10 3:21 ` Vijay Chidambaram 2018-04-10 12:07 ` Jayashree Mohan 0 siblings, 1 reply; 5+ messages in thread From: Vijay Chidambaram @ 2018-04-10 3:21 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Jayashree Mohan, Ext4, fstests Thanks Ted! This information is very useful. We won't pursue testing ext4-no-journal further, as there is no problem e2fsck cannot fix if data loss is tolerated. I wanted to point you to an old paper of mine that has a similar goal of performance at all costs: the No Order File System (http://research.cs.wisc.edu/adsl/Publications/nofs-fast12.pdf). It doesn't use any FLUSH or FUA instructions, and instead obtains consistency from mutual agreement between file-system objects. It requires we are able to atomically write a "backpointer" with each disk block (perhaps in an out-of-band area). I thought you might find it interesting! Thanks, Vijay Chidambaram http://www.cs.utexas.edu/~vijay/ On Mon, Apr 9, 2018 at 10:12 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote: > On Mon, Apr 09, 2018 at 07:08:13PM -0500, Jayashree Mohan wrote: >> Hi, >> >> We stumbled upon what seems to be a bug that makes a “directory >> unremovable”, on ext4 when mounted with no_journal option. > > Hi Jayashree, > > If you use no_journal mode, you **must** run e2fsck after a crash. > And you do have to potentially be ready for data loss after a crash. > So no, this isn't a bug. The guarantees that you have when use > no_journal is essentially limited to what Posix specifies when you > crash uncleanly --- "the results are undefined". > >> The sequence of operations listed above is making dir Z unremovable >> from dir Y, which seems like unexpected behavior. Could you provide >> more details on the reason for such behavior? We understand we run >> this on no_journal mode of ext4, but would like you to verify if this >> behavior is acceptable. > > We use no_journal mode in Google, but we are preprared to effectively > reinstall the root partition, and we are prepared to lose data on our > data disks, after a crash. We are OK with this because all persistent > data stored on machines is data we are prepared to lose (e.g., cached > data or easily reinstalled system software) or part of our cluster > file system, where we use erasure codes to assure that data in the > cluster file system can remain accessible even if (a) a disk dies > completely, or (b) the entry router on the rack dies, denying access > to all of the disks in a rack from the cluster file system until the > router can be repaired. So losing a file or a directory after running > e2fsck after a crash is actually small beer compared to any number of > other things that can happen to a disk. > > The goal for no_journal mode is performance at all costs, and we are > prepared to sacrifice file system robustness after a crash. This > means we aren't doing any kind of FUA writes or CACHE FLUSH > operations, because those would compromise performance. (As a thought > experiment, I would encouraging you to try to design a file system > that would provide better guarantees without using FUA writes, CACHE > FLUSH operations, and with the HDD's write-back cache enabled.) > > To understand why this is so important, I would recommend that you > read the "Disks for Data Center" paper[1]. There is also a lot of > good stuff in the FAST 2016 keynote that isn't in the paper or the > slides. So listening to the audio recording is also something I > strongly commend for people who want to understand Google's approach > to storage. (Before 2016, we had always considered this part of our > "secret sauce" that we had never disclosed for the past decade, since > it is what gave us a huge storage TCO advantage over other companies.) > > [1] https://research.google.com/pubs/pub44830.html > [2] https://www.usenix.org/node/194391 > > Essentially, we are trying to use all of the two baskets of value > provided by each HDD. That is, we want to use nearly all of the byte > capacity and all of the IOPS that an HDD can provide --- and FUA > writes or CACHE FLUSHES significantly compromises the number of I/O > operations the HDD can provide. (More details about how we do this at > the cluster level can be found in the PDSW 2017 keynote[3], but it > goes well beyond the scope of what gets done on a single file system > on a single HDD.) > > [3] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf > > Regards, > > - Ted > > P.S. This is not to say that the work you are doing with Crashmonkey > et. al. is not useless; it's just not applicable for a cluster file > system in a hyper-scale cloud environment. Local disk file systems > and robustness after a crash is still important in applications such > as Android and Chrome OS, for example. Note that we do *not* use > no_journal mode in those environments. :-) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Directory unremovable on ext4 no_journal mode 2018-04-10 3:21 ` Vijay Chidambaram @ 2018-04-10 12:07 ` Jayashree Mohan 0 siblings, 0 replies; 5+ messages in thread From: Jayashree Mohan @ 2018-04-10 12:07 UTC (permalink / raw) To: Vijaychidambaram Velayudhan Pillai; +Cc: Theodore Y. Ts'o, Ext4, fstests Hi Ted, Thank you for the detailed response! It makes things much clearer now. I understand why no journal mode is used and what guarantees to expect while using it. Will keep this in mind for future CrashMonkey testing. Thanks, Jayashree ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-04-10 12:07 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-04-10 0:08 Directory unremovable on ext4 no_journal mode Jayashree Mohan 2018-04-10 0:38 ` Darrick J. Wong 2018-04-10 3:12 ` Theodore Y. Ts'o 2018-04-10 3:21 ` Vijay Chidambaram 2018-04-10 12:07 ` Jayashree Mohan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox