linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Ext4 without journal write cache problem...
@ 2009-12-06 14:42 Andrea Gelmini
  2009-12-06 22:02 ` tytso
  0 siblings, 1 reply; 2+ messages in thread
From: Andrea Gelmini @ 2009-12-06 14:42 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

Hi all,
   I need some advice about this regression...

   Short version:
   With 2.6.32 my / partition (ext4 without journal) seems working
always in synchronous mode. No write caching, so the HD is always
working. It works as usual with 2.6.31.

   Long version:
   To replicate the problem I've used the bash script in attachment (test.sh)
   With 2.6.31.6 I have no work at all of the disk and incredible
speed results of dd (~400 MB/s), of course.
   With 2.6.32 I have the HD always working, and real speed numbers of
dd (~30MB/s).

   I bisected 2.6.32 and 2.6.31.6 version. In attach you can find the
log (bisect_log.txt).
   Reverting the last commit
(5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10) gives the same behavior for
both kernel version.

   I investigated it because since using 2.6.32 my laptop temperature
became a lot higher, so much that my keybord keys were hot.
   Also, if I issue a sync I always have an immediate response, as I
was using "mount -o sync", ragrdless what I've done before.

Thanks a lot for your time and work,
Andrea

[-- Attachment #2: bisect_log.txt --]
[-- Type: text/plain, Size: 2802 bytes --]

git bisect start
# bad: [745402c939529c0e11e59e7987a97becfbabff82] Merge branch 'gelma'
git bisect bad 745402c939529c0e11e59e7987a97becfbabff82
# good: [d8f879c0cc0c55cc42e9176dee5131e19a721854] Merge branch 'gelma' into t31
git bisect good d8f879c0cc0c55cc42e9176dee5131e19a721854
# good: [4d5d3932e8089ffabb1e4f9bd4d9746c8c250457] mio
git bisect good 4d5d3932e8089ffabb1e4f9bd4d9746c8c250457
# good: [6f128fa344833bf8bf076a51d14401661c146470] Merge branch 'davinci-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-davinci
git bisect good 6f128fa344833bf8bf076a51d14401661c146470
# bad: [5f8fe4270e53d38421ba34c428c3b58933b48e50] Merge branch 'for-linus' of git://repo.or.cz/cris-mirror
git bisect bad 5f8fe4270e53d38421ba34c428c3b58933b48e50
# bad: [85e08ca54c5c203cd2638f0fc8fa899a539f6254] USB: Move endpoint sync type definitions from usb/audio.h to usb/ch9.h
git bisect bad 85e08ca54c5c203cd2638f0fc8fa899a539f6254
# bad: [84d6ae431f315e8973aac3c3fe1d550fc9240ef3] V4L/DVB (13033): pt1: Don't use a deprecated DMA_BIT_MASK macro
git bisect bad 84d6ae431f315e8973aac3c3fe1d550fc9240ef3
# good: [515b696b282f856c3ad1679ccd658120faa387d0] Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
git bisect good 515b696b282f856c3ad1679ccd658120faa387d0
# bad: [ecda427340b7bb5c61fbf18857645286c2bfec6c] V4L/DVB (12869): tda18271: fix comments and make tda18271_agc debug less verbose
git bisect bad ecda427340b7bb5c61fbf18857645286c2bfec6c
# bad: [3530c1886291df061e3972c55590777ef1cb67f8] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect bad 3530c1886291df061e3972c55590777ef1cb67f8
# good: [1f7bebb9e911d870fa8f997ddff838e82b5715ea] ext4: Always set dx_node's fake_dirent explicitly.
git bisect good 1f7bebb9e911d870fa8f997ddff838e82b5715ea
# good: [81ce31b773226332475f89501b1072bec0c0e241] Merge branch 'for-linus' of git://gitserver.sunplusct.com/linux-2.6-score
git bisect good 81ce31b773226332475f89501b1072bec0c0e241
# good: [27f5de7963f46388932472b660f2f9a86ab58454] mm: Fix problem of parameter in note
git bisect good 27f5de7963f46388932472b660f2f9a86ab58454
# good: [fb0a387dcdcd21aab1b09ee7fd80b7c979bdbbfd] ext4: limit block allocations for indirect-block files to < 2^32
git bisect good fb0a387dcdcd21aab1b09ee7fd80b7c979bdbbfd
# bad: [0a80e9867db154966b2a771042e10452ac110e1e] ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
git bisect bad 0a80e9867db154966b2a771042e10452ac110e1e
# good: [fb40ba0d98968bc3454731360363d725b4f1064c] ext4: Add a tracepoint for ext4_alloc_da_blocks()
git bisect good fb40ba0d98968bc3454731360363d725b4f1064c
# bad: [5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10] ext4: Fix the alloc on close after a truncate hueristic
git bisect bad 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10

[-- Attachment #3: test.sh --]
[-- Type: application/x-sh, Size: 310 bytes --]

[-- Attachment #4: dumpe2fs.txt --]
[-- Type: text/plain, Size: 1651 bytes --]

Filesystem volume name:   /
Last mounted on:          /
Filesystem UUID:          a7d58ef8-6698-427c-bc0e-08fd79f0e847
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    journal_data_writeback
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2970064
Block count:              104871292
Reserved block count:     0
Free blocks:              30976496
Free inodes:              2268721
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      39
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         232
Inode blocks per group:   58
Flex block group size:    16
Filesystem created:       Sat Nov 21 20:44:04 2009
Last mount time:          Sun Dec  6 12:49:47 2009
Last write time:          Sun Dec  6 12:51:05 2009
Mount count:              27
Maximum mount count:      365
Last checked:             Sat Nov 28 12:53:41 2009
Check interval:           31104000 (12 months)
Next check after:         Tue Nov 23 12:53:41 2010
Lifetime writes:          208 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      42e8029a-d8d9-4f6b-8ddb-26d88b3765b4

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Ext4 without journal write cache problem...
  2009-12-06 14:42 Ext4 without journal write cache problem Andrea Gelmini
@ 2009-12-06 22:02 ` tytso
  0 siblings, 0 replies; 2+ messages in thread
From: tytso @ 2009-12-06 22:02 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: linux-fsdevel

On Sun, Dec 06, 2009 at 03:42:59PM +0100, Andrea Gelmini wrote:
> Hi all,
>    I need some advice about this regression...
> 
>    Short version:
>    With 2.6.32 my / partition (ext4 without journal) seems working
> always in synchronous mode. No write caching, so the HD is always
> working. It works as usual with 2.6.31.
> 
>    Long version:
>    To replicate the problem I've used the bash script in attachment (test.sh)
>    With 2.6.31.6 I have no work at all of the disk and incredible
> speed results of dd (~400 MB/s), of course.
>    With 2.6.32 I have the HD always working, and real speed numbers of
> dd (~30MB/s).

> # bad: [5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10] ext4: Fix the alloc on close after a truncate hueristic
> git bisect bad 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10

Yeah, this was actually deliberate.  The problem is that there are
badly written application programs that update files in place via the
following pattern:

1.  fd = open("file", O_RDONLY);
2.  read(fd, buf, bufsize);	// read in the file
3.  close(fd);
			// Let the user edit the file
			// Now the user requests the file be saved out to disk
4.  fd = open("file", O_WRONLY | O_TRUNC);
5.  write(fd, buf, bufsize);
6.  close(fd)

The problem is what happens if the system crashes between step 4 and
step 5?  Especially if "file" is the user's research for his
Ph.D. thesis, for which he has spent 10 years collecting, but never
bothered to make a backup?  (Well, one could argue that the grad
student doesn't *deserve* a Ph.D., but maybe it's a Ph.D. in English
Literature.  :-)

So the correct way for an editor to write precious files is as
follows:

4.  fd = open("file.new", O_WRONLY | O_TRUNC);
5.  err = write(fd, buf, bufsize); // ... and check error return from write()
6.  err = fsync(fd);		   // ... and check error return from fsync()
6.  err = close(fd);		   // ... and check error return from close()
7.  rename("file.new", "file"); 

The problem is made especially worse because of delayed allocation,
because with delayed allocation, the new data blocks do not get
written for potentially 1-2 minutes, but the truncation from opening
the file with O_TRUNC() will get written to the file system sooner
than that.

So because there are a lot of sucky applications out there, and the
application writers tend to massively outnumber file system
developers, we have placed these hueristics to force an implied
fsync() on close() if the file descriptor resulted in data blocks
getting truncated, either from an O_TRUNC or an explicit call to
ftruncate(2) system call.

So your test script exercises this hueristic:


for f in $(seq 5)
do
        dd if=/dev/zero of=test.dd bs=100M count=1
done

If you change it to be as follows:

for f in $(seq 5)
do
	rm -f test.dd
        dd if=/dev/zero of=test.dd bs=100M count=1
done

It will avoid the hueristic from triggering.

Or, you can suppress the hueristic via the mount option
"noauto_da_alloc".  Note that if you do this, and you edit a file
using a buggy application that doesn't use fsync(), you may end up
losing data on a crash.  

I'm surprised that you are seeing this situation in actual practice
(as opposed to a test script).  Are you regularly overwriting huge
files via truncate(2) or open with O_TRUNC?  And are you doing this
assuming that you really don't care about the previous contents of the
file after a crash?  Most of the time the files that get edited this
way tend to be small files.  (For example KDE had a bug where *every*
*single* *KDE* *dot* *file* was getting rewritten all the time, and
users were getting cranky that after their buggy Nvidia proprietary
binary drivers crashed their system, all of the windows positions that
they had spent hours and hours setting up had vanished.)

					- Ted

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-12-06 22:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-06 14:42 Ext4 without journal write cache problem Andrea Gelmini
2009-12-06 22:02 ` tytso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).