Mikulas Patocka wrote: >>> Possibilities how to fix it: >>> >>> 1. lock the buffers and pages while they are being written --- this would >>> cause performance degradation (the most severe degradation would be in case >>> when one process does repeatedly sync() and other unrelated process >>> repeatedly writes to some file). >>> >>> Lock the buffers and pages only for RAID --- would create many special cases >>> and possible bugs. >>> >>> 2. never turn the region dirty bit off until the filesystem is unmounted. >>> --- this is the simplest fix. If the computer crashes after a long time, it >>> resynchronizes the whole device. But there won't cause application-visible >>> or filesystem-visible data corruption. >>> >>> 3. turn off the region bit if the region wasn't written in one pdflush >>> period --- requires an interaction with pdflush, rather complex. The problem >>> here is that pdflush makes its best effort to write data in >>> dirty_writeback_centisecs interval, but it is not guaranteed to do it. >>> >>> 4. make more region states: Region has in-memory states CLEAN, DIRTY, >>> MAYBE_DIRTY, CLEAN_CANDIDATE. >>> >>> When you start writing to the region, it is always moved to DIRTY state (and >>> on-disk bit is turned on). >>> >>> When you finish all writes to the region, move it to MAYBE_DIRTY state, but >>> leave bit on disk on. We now don't know if the region is dirty or no. >>> >>> Run a helper thread that does periodically: >>> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE >>> Issue sync() >>> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit. >>> >>> The rationale is that if the above write-while-modify scenario happens, the >>> page is always dirty. Thus, sync() will write the page, kick the region back >>> from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as >>> clean on disk. >>> >>> >>> I'd like to know you ideas on this, before we start coding a solution. >>> >>> >> I looked at just this problem a while ago, and came to the conclusion that >> what was needed was a COW bit, to show that there was i/o in flight, and that >> before modification it needed to be copied. Since you don't want to let that >> recurse, you don't start writing the copy until the original is written and >> freed. Ideally you wouldn't bother to finish writing the original, but that >> doesn't seem possible. That allows at most two copies of a chunk to take up >> memory space at once, although it's still ugly and can be a bottleneck. >> > > Copying the data would be performance overkill. You can really write > different data to different disks, you just must not forget to resync them > after a crash. The filesystem/application will recover with either old or > new data --- it just won't recover when it's reading old and new data from > the same location. > > Currently you can go for hours without ever reaching a clean state on active files. By not deliberately allowing the buffer to change during a write the chances for getting consistent data on the disk should be significantly improved. > >From my point of view that trick with thread doing sync() and turning off > region bits looks best. I'd like to know if that solution doesn't have any > other flaw. > > >> For reliable operation I would want all copies (and/or CRCs) to be written on >> an fsync, by the time I bother to fsync I really, really, want the data on the >> disk. >> > > fsync already works this way. > The point I was making is that after you change the code I would still want that to happen. And your comment above seems to indicate a goal of getting consistent data after a crash, with less concern that it be the most recent data written. Sorry in advance if that's a misreading of "you just must not forget to resync them after a crash." -- Bill Davidsen "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark