linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* EXT4 nodelalloc => back to stone age.
@ 2013-04-01 11:06 Dmitry Monakhov
  2013-04-01 15:18 ` Eric Sandeen
  2013-04-02 13:46 ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2013-04-01 11:06 UTC (permalink / raw)
  To: ext4 development; +Cc: linux-fsdevel, axboe, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]


I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount  $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
blktrace shows horrible traces:

[-- Attachment #2: trace.log --]
[-- Type: text/plain, Size: 1644 bytes --]

253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]

[-- Attachment #3: Type: text/plain, Size: 1220 bytes --]


As one can see data written from two threads dd and jbd2 on per-page basis and
jbd2 submit pages with WRITE_SYNC  i.e. we write page-by-page
synchronously :)

Exact calltrace:
journal_submit_inode_data_buffers
 wbc.sync_mode =  WB_SYNC_ALL
 ->generic_writepages
   ->write_cache_pages
     ->ext4_writepage
       ->ext4_bio_write_page
         ->io_submit_add_bh
           ->io_submit_init
             io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC :
             WRITE);
       ->ext4_io_submit(io);

1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
  Why blk_finish_plug(&plug) which is called from generic_writepages() is
  not enough? As far as I can see this code was copy-pasted from XFS,
  also DIO also tag bio-s with WRITE_SYNC, but what happen if file
  is highly fragmented (or block device is RAID0) we will endup doing
  synchronous io.

2) Why don't we have writepages for non delalloc case ?

I want to fix (2) by implementing writepages() for non delalloc case
Once this will be done we may add new flag WB_SYNC_NOALLOC so
journal_submit_inode_data_buffers will use
__filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
which will call optimized ->ext4_writepages() 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-04-03  9:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-01 11:06 EXT4 nodelalloc => back to stone age Dmitry Monakhov
2013-04-01 15:18 ` Eric Sandeen
2013-04-01 15:39   ` Theodore Ts'o
2013-04-01 16:00     ` Eric Sandeen
2013-04-01 16:34       ` Zheng Liu
2013-04-01 15:45   ` Chris Mason
2013-04-01 15:57     ` Chris Mason
2013-04-02 13:46 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).