All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dmitry Monakhov <dmonakhov@openvz.org>
To: ext4 development <linux-ext4@vger.kernel.org>
Cc: linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Jan Kara <jack@suse.cz>
Subject: EXT4 nodelalloc => back to stone age.
Date: Mon, 01 Apr 2013 15:06:18 +0400	[thread overview]
Message-ID: <87d2uese6t.fsf@openvz.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]


I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount  $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
blktrace shows horrible traces:

[-- Attachment #2: trace.log --]
[-- Type: text/plain, Size: 1644 bytes --]

253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]

[-- Attachment #3: Type: text/plain, Size: 1220 bytes --]


As one can see data written from two threads dd and jbd2 on per-page basis and
jbd2 submit pages with WRITE_SYNC  i.e. we write page-by-page
synchronously :)

Exact calltrace:
journal_submit_inode_data_buffers
 wbc.sync_mode =  WB_SYNC_ALL
 ->generic_writepages
   ->write_cache_pages
     ->ext4_writepage
       ->ext4_bio_write_page
         ->io_submit_add_bh
           ->io_submit_init
             io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC :
             WRITE);
       ->ext4_io_submit(io);

1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
  Why blk_finish_plug(&plug) which is called from generic_writepages() is
  not enough? As far as I can see this code was copy-pasted from XFS,
  also DIO also tag bio-s with WRITE_SYNC, but what happen if file
  is highly fragmented (or block device is RAID0) we will endup doing
  synchronous io.

2) Why don't we have writepages for non delalloc case ?

I want to fix (2) by implementing writepages() for non delalloc case
Once this will be done we may add new flag WB_SYNC_NOALLOC so
journal_submit_inode_data_buffers will use
__filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
which will call optimized ->ext4_writepages() 


             reply	other threads:[~2013-04-01 11:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-01 11:06 Dmitry Monakhov [this message]
2013-04-01 15:18 ` EXT4 nodelalloc => back to stone age Eric Sandeen
2013-04-01 15:39   ` Theodore Ts'o
2013-04-01 16:00     ` Eric Sandeen
2013-04-01 16:34       ` Zheng Liu
2013-04-01 15:45   ` Chris Mason
2013-04-01 15:57     ` Chris Mason
2013-04-02 13:46 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87d2uese6t.fsf@openvz.org \
    --to=dmonakhov@openvz.org \
    --cc=axboe@kernel.dk \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.