[RFC 0/1] Optimize ext4 DAX overwrites

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani <riteshh@linux.ibm.com>
To: linux-ext4@vger.kernel.org
Cc: jack@suse.cz, tytso@mit.edu, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Ritesh Harjani <riteshh@linux.ibm.com>
Subject: [RFC 0/1] Optimize ext4 DAX overwrites
Date: Thu, 20 Aug 2020 17:06:27 +0530	[thread overview]
Message-ID: <cover.1597855360.git.riteshh@linux.ibm.com> (raw)

In case of dax writes, currently we start a journal txn irrespective of whether
it's an overwrite or not. In case of an overwrite we don't need to start a
jbd2 txn since the blocks are already allocated.
So this patch optimizes away the txn start in case of DAX overwrites.
This could significantly boost performance for multi-threaded random write
(overwrite). Fio script used to collect perf numbers is mentioned below.

Below numbers were calculated on a QEMU setup on ppc64 box with simulated
pmem device.

Performance numbers with different threads - (~10x improvement)
==========================================

vanilla_kernel(kIOPS)
 60 +-+---------------+-------+--------+--------+--------+-------+------+-+   
     |                 +       +        +        +**      +       +        |   
  55 +-+                                          **                     +-+   
     |                                   **       **                       |   
     |                                   **       **                       |   
  50 +-+                                 **       **                     +-+   
     |                                   **       **                       |   
  45 +-+                                 **       **                     +-+   
     |                                   **       **                       |   
     |                                   **       **                       |   
  40 +-+                                 **       **                     +-+   
     |                                   **       **                       |   
  35 +-+                        **       **       **                     +-+   
     |                          **       **       **               **      |   
     |                          **       **       **      **       **      |   
  30 +-+               **       **       **       **      **       **    +-+   
     |                 **      +**      +**      +**      **      +**      |   
  25 +-+---------------**------+**------+**------+**------**------+**----+-+   
                       1       2        4        8       12      16            
                                     Threads                                   
patched_kernel(kIOPS)
  600 +-+--------------+--------+--------+-------+--------+-------+------+-+   
      |                +        +        +       +        +       +**      |   
      |                                                            **      |   
  500 +-+                                                          **    +-+   
      |                                                            **      |   
      |                                                    **      **      |   
  400 +-+                                                  **      **    +-+   
      |                                                    **      **      |   
  300 +-+                                         **       **      **    +-+   
      |                                           **       **      **      |   
      |                                           **       **      **      |   
  200 +-+                                         **       **      **    +-+   
      |                                  **       **       **      **      |   
      |                                  **       **       **      **      |   
  100 +-+                        **      **       **       **      **    +-+   
      |                          **      **       **       **      **      |   
      |                +**      +**      **      +**      +**     +**      |   
    0 +-+--------------+**------+**------**------+**------+**-----+**----+-+   
                       1        2        4       8       12      16            
                                     Threads                                   
fio script
==========
[global]
rw=randwrite
norandommap=1
invalidate=0
bs=4k
numjobs=16 		--> changed this for different thread options
time_based=1
ramp_time=30
runtime=60
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32
directory=/mnt/

[name]
directory=/mnt/
direct=1

NOTE:
======
1. Looking at ~10x perf delta, I probed a bit deeper to understand what's causing
this scalability problem. It seems when we are starting a jbd2 txn then slab
alloc code is observing some serious contention around spinlock.

Even though the spinlock contention could be related to some other
issue (looking into it internally). But I could still see the perf improvement
of close to ~2x on QEMU setup on x86 with simulated pmem device with the
patched_kernel v/s vanilla_kernel with same fio workload.

perf report from vanilla_kernel (this is not seen with patched kernel) (ppc64)
=======================================================================

  47.86%  fio              [kernel.vmlinux]            [k] do_raw_spin_lock
             |
             ---do_raw_spin_lock
                |
                |--19.43%--_raw_spin_lock
                |          |
                |           --19.31%--0
                |                     |
                |                     |--9.77%--deactivate_slab.isra.61
                |                     |          ___slab_alloc
                |                     |          __slab_alloc
                |                     |          kmem_cache_alloc
                |                     |          jbd2__journal_start
                |                     |          __ext4_journal_start_sb
<...>

2. Kept this as RFC, since maybe using the ext4_iomap_overwrite_ops,
will be better here. We could check for overwrite in ext4_dax_write_iter(),
like how we do for DIO writes. Thoughts?

3. This problem was reported by Dan Williams at [1]

Links
======
[1]: https://lore.kernel.org/linux-ext4/20190802144304.GP25064@quack2.suse.cz/T/

Ritesh Harjani (1):
  ext4: Optimize ext4 DAX overwrites

 fs/ext4/ext4.h  | 1 +
 fs/ext4/file.c  | 2 +-
 fs/ext4/inode.c | 8 +++++++-
 3 files changed, 9 insertions(+), 2 deletions(-)

-- 
2.25.4

next             reply	other threads:[~2020-08-20 11:37 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-20 11:36 Ritesh Harjani [this message]
2020-08-20 11:36 ` [RFC 1/1] ext4: Optimize ext4 DAX overwrites Ritesh Harjani
2020-08-20 12:53   ` Jan Kara
2020-08-20 13:09     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1597855360.git.riteshh@linux.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.