linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dmitry Monakhov <dmonakhov@openvz.org>
To: ext4 development <linux-ext4@vger.kernel.org>
Cc: linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Jan Kara <jack@suse.cz>
Subject: EXT4 nodelalloc => back to stone age.
Date: Mon, 01 Apr 2013 15:06:18 +0400	[thread overview]
Message-ID: <87d2uese6t.fsf@openvz.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]


I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount  $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
blktrace shows horrible traces:

[-- Attachment #2: trace.log --]
[-- Type: text/plain, Size: 1644 bytes --]

253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]

[-- Attachment #3: Type: text/plain, Size: 1220 bytes --]


As one can see data written from two threads dd and jbd2 on per-page basis and
jbd2 submit pages with WRITE_SYNC  i.e. we write page-by-page
synchronously :)

Exact calltrace:
journal_submit_inode_data_buffers
 wbc.sync_mode =  WB_SYNC_ALL
 ->generic_writepages
   ->write_cache_pages
     ->ext4_writepage
       ->ext4_bio_write_page
         ->io_submit_add_bh
           ->io_submit_init
             io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC :
             WRITE);
       ->ext4_io_submit(io);

1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
  Why blk_finish_plug(&plug) which is called from generic_writepages() is
  not enough? As far as I can see this code was copy-pasted from XFS,
  also DIO also tag bio-s with WRITE_SYNC, but what happen if file
  is highly fragmented (or block device is RAID0) we will endup doing
  synchronous io.

2) Why don't we have writepages for non delalloc case ?

I want to fix (2) by implementing writepages() for non delalloc case
Once this will be done we may add new flag WB_SYNC_NOALLOC so
journal_submit_inode_data_buffers will use
__filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
which will call optimized ->ext4_writepages() 


             reply	other threads:[~2013-04-01 11:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-01 11:06 Dmitry Monakhov [this message]
2013-04-01 15:18 ` EXT4 nodelalloc => back to stone age Eric Sandeen
2013-04-01 15:39   ` Theodore Ts'o
2013-04-01 16:00     ` Eric Sandeen
2013-04-01 16:34       ` Zheng Liu
2013-04-01 15:45   ` Chris Mason
2013-04-01 15:57     ` Chris Mason
2013-04-02 13:46 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87d2uese6t.fsf@openvz.org \
    --to=dmonakhov@openvz.org \
    --cc=axboe@kernel.dk \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).