linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fsync stuck at jbd2_log_wait_commit on NVMe devices
@ 2015-08-12 16:57 Roy Yang
  2015-08-12 18:37 ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Roy Yang @ 2015-08-12 16:57 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org

Hi EXT4 Experts, 

We came cross an ext4 journaling problem, and I did some googling on high I/O wait, 
found a lot of links, but none of them is the same.

The backtrace when system is hung is :

Aug 7 11:31:33 2f4526ab kernel: "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
Aug 7 11:31:33 2f4526ab kernel: Call Trace:
Aug 7 11:31:33 2f4526ab kernel: [<ffffffff816096a9>] schedule+0x29/0x70
Aug 7 11:31:33 2f4526ab kernel: [<ffffffffa031d785>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Aug 7 11:31:33 2f4526ab kernel: [<ffffffff81098230>] ? wake_up_bit+0x30/0x30
Aug 7 11:31:33 2f4526ab kernel: [<ffffffffa031fe32>] jbd2_complete_transaction+0x52/0xa0 [jbd2]
Aug 7 11:31:33 2f4526ab kernel: [<ffffffffa03339f2>] ext4_sync_file+0x292/0x320 [ext4]
Aug 7 11:31:33 2f4526ab kernel: [<ffffffff811f6ba5>] do_fsync+0x65/0xa0
Aug 7 11:31:33 2f4526ab kernel: [<ffffffff811f6e70>] SyS_fsync+0x10/0x20
Aug 7 11:31:33 2f4526ab kernel: [<ffffffff81614209>] system_call_fastpath+0x16/0x1b

At application layer, we use a lot of fsync APIs. When I/O load is heavy, we saw a lot of I/Os can't finish within 120 seconds
because JBD2 can't finish the journaling. We also see when this happens, the JBD2 info shows very high on the following two items:

$ cat /proc/fs/jbd2/nvme0n1/info
85162438 transactions (85162436 requested), each up to 8192 blocks
average: 
  0ms waiting for transaction
  0ms request delay
  1ms running transaction                 <==== this will be very high, 3 to 6 seconds 
  0ms transaction was being locked
  0ms flushing data (in ordered mode)
  0ms logging transaction
  153us average transaction commit time     <=== this will be very high, half to one second
  35 handles per transaction
  2 blocks per transaction
  3 logged blocks per transaction

Our environment :
$ mount 
/dev/nvme0n1 on /logs type ext4 (rw,noatime,data=ordered)

$ uname -a
Linux a69ac961 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 


$ sudo tune2fs  -l /dev/nvme0n1
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name:   cty-data
Last mounted on:          <not available>
Filesystem UUID:          227fde0d-a3ae-4236-9d32-46ef0091b3b3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              91062272
Block count:              364226560
Reserved block count:     3642265
Free blocks:              126192987
Free inodes:              90916973
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Wed Aug  5 13:00:41 2015
Last mount time:          Mon Aug 10 18:32:59 2015
Last write time:          Mon Aug 10 18:32:59 2015
Mount count:              6
Maximum mount count:      -1
Last checked:             Wed Aug  5 13:00:41 2015
Check interval:           0 (<none>)
Lifetime writes:          135 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      c8f8aa7c-583b-4251

Any suggestions  or hints how to debug this problem will be appreciated!

Thank you very much, 

Roy
© 2015 Microsoft Terms Privacy & cookies Developers English (United States) 

          		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 16:57 fsync stuck at jbd2_log_wait_commit on NVMe devices Roy Yang
@ 2015-08-12 18:37 ` Theodore Ts'o
  2015-08-12 19:28   ` Roy Yang
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2015-08-12 18:37 UTC (permalink / raw)
  To: Roy Yang; +Cc: linux-ext4@vger.kernel.org

Hi Roy,

My suggestion is to collect information from the jbd2_run_stats and
jbd2_checkpoint_stats tracepoints.

cd /sys/kernel/debug/tracing
echo 1 > events/jbd2/jbd2_run_stats/enable
echo 1 > events/jbd2/jbd2_checkpoint_stats/enable
cat trace_pipe > /tmp/traces &
tail -f /tmp/traces

The jbd2_handle_stats tracepoint can be informative, but it's also far
more voluminous.

That will give us a hint where things are getting bottlenecked.

What sort of workload is your application doing?  Is it just primarily
doing random writes into a preallocated file?  Is it creating or
deleting files?  Extending or truncating files?   etc.

	 	 	      		 - Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 18:37 ` Theodore Ts'o
@ 2015-08-12 19:28   ` Roy Yang
  2015-08-12 21:43     ` Roy Yang
  0 siblings, 1 reply; 7+ messages in thread
From: Roy Yang @ 2015-08-12 19:28 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org

Hi Ted, 

     I will collect some info about jbd2 and get back to this list.
     
     To answer your questions,  our workload is mostly sequential  writes from a couple of daemons.
     Every second, we will create 6 to 10 files, and write around 8M bytes into them. We only
     do extend or truncate when needed, but I am sure we rarely do this on NVMe devices.

     Reading the kernel document,  I also see ext4 ordered option will combine metadata and data blocks
     for journaling. If the  data block is big, and the workload is heavy, will JBD2 be soaked?

     Thanks,
     
     Roy

----------------------------------------
> Date: Wed, 12 Aug 2015 14:37:50 -0400
> From: tytso@mit.edu
> To: bsdnet@outlook.com
> CC: linux-ext4@vger.kernel.org
> Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
>
> Hi Roy,
>
> My suggestion is to collect information from the jbd2_run_stats and
> jbd2_checkpoint_stats tracepoints.
>
> cd /sys/kernel/debug/tracing
> echo 1> events/jbd2/jbd2_run_stats/enable
> echo 1> events/jbd2/jbd2_checkpoint_stats/enable
> cat trace_pipe> /tmp/traces &
> tail -f /tmp/traces
>
> The jbd2_handle_stats tracepoint can be informative, but it's also far
> more voluminous.
>
> That will give us a hint where things are getting bottlenecked.
>
> What sort of workload is your application doing? Is it just primarily
> doing random writes into a preallocated file? Is it creating or
> deleting files? Extending or truncating files? etc.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 19:28   ` Roy Yang
@ 2015-08-12 21:43     ` Roy Yang
  2015-08-12 22:34       ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Roy Yang @ 2015-08-12 21:43 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org

Hi Ted, 

    I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so.  Compared to the normal system, this test only takes 2 seconds.
    We saw the spike of running, but checking with kernel source. I don't get  what it means.  Any insights?

           <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
           <...>-220601 [015] .... 181555.141161: jbd2_run_stats: dev 259,3 tid 1064542 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
           <...>-220601 [015] .... 181555.143399: jbd2_run_stats: dev 259,3 tid 1064543 wait 0 request_delay 0 running 1 locked 0 flushing 0 logging 0 handle_count 4 blocks 3 blocks_logged 4
           <...>-220601 [015] .... 181555.143710: jbd2_run_stats: dev 259,3 tid 1064544 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
           <...>-220601 [015] .... 181555.143840: jbd2_run_stats: dev 259,3 tid 1064545 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 1 blocks 4 blocks_logged 5
           <...>-220601 [015] .... 181555.176298: jbd2_run_stats: dev 259,3 tid 1064546 wait 0 request_delay 0 running 32 locked 0 flushing 0 logging 0 handle_count 31 blocks 5 blocks_logged 6
 jbd2/nvme0n1p1--964   [014] .... 181555.189997: jbd2_run_stats: dev 259,1 tid 293294 wait 0 request_delay 0 running 5000 locked 0 flushing 0 logging 0 handle_count 15613 blocks 4 blocks_logged 5
           <...>-220601 [015] .... 181556.126407: jbd2_run_stats: dev 259,3 tid 1064547 wait 0 request_delay 0 running 929 locked 0 flushing 0 logging 0 handle_count 30 blocks 3 blocks_logged 4
           <...>-220601 [014] .... 181556.159874: jbd2_run_stats: dev 259,3 tid 1064548 wait 0 request_delay 0 running 10 locked 0 flushing 0 logging 0 handle_count 29 blocks 1 blocks_logged 2
           <...>-220601 [000] .... 181557.123872: jbd2_run_stats: dev 259,3 tid 1064549 wait 0 request_delay 0 running 941 locked 0 flushing 0 logging 0 handle_count 30 blocks 3 blocks_logged 4
           <...>-220601 [015] .... 181558.135168: jbd2_run_stats: dev 259,3 tid 1064550 wait 0 request_delay 1 running 1 locked 0 flushing 0 logging 0 handle_count 8 blocks 6 blocks_logged 7
           <...>-220601 [015] .... 181558.138895: jbd2_run_stats: dev 259,3 tid 1064551 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 1 blocks 4 blocks_logged 5
           <...>-220601 [015] .... 181558.139091: jbd2_run_stats: dev 259,3 tid 1064552 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 1 handle_count 2 blocks 4 blocks_logged 5
           <...>-220601 [015] .... 181558.139886: jbd2_run_stats: dev 259,3 tid 1064553 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 12 blocks 6 blocks_logged 7
           <...>-220601 [015] .... 181558.147189: jbd2_run_stats: dev 259,3 tid 1064554 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 1 blocks 4 blocks_logged 5

   Thanks,
    Roy

----------------------------------------
> From: bsdnet@outlook.com
> To: tytso@mit.edu
> CC: linux-ext4@vger.kernel.org
> Subject: RE: fsync stuck at jbd2_log_wait_commit on NVMe devices
> Date: Wed, 12 Aug 2015 12:28:10 -0700
>
> Hi Ted,
>
> I will collect some info about jbd2 and get back to this list.
>
> To answer your questions, our workload is mostly sequential writes from a couple of daemons.
> Every second, we will create 6 to 10 files, and write around 8M bytes into them. We only
> do extend or truncate when needed, but I am sure we rarely do this on NVMe devices.
>
> Reading the kernel document, I also see ext4 ordered option will combine metadata and data blocks
> for journaling. If the data block is big, and the workload is heavy, will JBD2 be soaked?
>
> Thanks,
>
> Roy
>
> ----------------------------------------
>> Date: Wed, 12 Aug 2015 14:37:50 -0400
>> From: tytso@mit.edu
>> To: bsdnet@outlook.com
>> CC: linux-ext4@vger.kernel.org
>> Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
>>
>> Hi Roy,
>>
>> My suggestion is to collect information from the jbd2_run_stats and
>> jbd2_checkpoint_stats tracepoints.
>>
>> cd /sys/kernel/debug/tracing
>> echo 1> events/jbd2/jbd2_run_stats/enable
>> echo 1> events/jbd2/jbd2_checkpoint_stats/enable
>> cat trace_pipe> /tmp/traces &
>> tail -f /tmp/traces
>>
>> The jbd2_handle_stats tracepoint can be informative, but it's also far
>> more voluminous.
>>
>> That will give us a hint where things are getting bottlenecked.
>>
>> What sort of workload is your application doing? Is it just primarily
>> doing random writes into a preallocated file? Is it creating or
>> deleting files? Extending or truncating files? etc.
>>
>> - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 21:43     ` Roy Yang
@ 2015-08-12 22:34       ` Theodore Ts'o
  2015-08-12 23:34         ` Roy Yang
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2015-08-12 22:34 UTC (permalink / raw)
  To: Roy Yang; +Cc: linux-ext4@vger.kernel.org

On Wed, Aug 12, 2015 at 02:43:22PM -0700, Roy Yang wrote:
> Hi Ted, 
> 
>     I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so.  Compared to the normal system, this test only takes 2 seconds.
>     We saw the spike of running, but checking with kernel source. I don't get  what it means.  Any insights?

You need to get the trace on your normal system --- a system when
under load, you are seeing the long times for a journal commit to
complete.  The jbd2_run_stats gives us critical insight into what
might be going wrong --- but just as it doesn't help much to use a
diagnostic tool on a pet dog when you are trying to determine why the
child is running a fever, it doesn't help much to take traces on a
test program/system when we're trying to actually debug why your
production server is having trouble....


>            <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2

There will be one of these trace points emitted for each jd2
transaction commit.  Most of the fields, with the exception of the
last three (handle_count, blocks, and blocks logged) are in units of
1000 HZ "jiffies.  So for example, if you don't have any fsync()'s,
transactions by default will run for 5 seconds before being committed,
so the "running" time will be a bit larger than 5000 jiffies (the
timer goes off at 5 seconds, but there may be some delay before the
jbd2 kernel thread gets scheduled):

     jbd2/sda3-8-275   [003] ...2 70449.637938: jbd2_run_stats: dev 8,3 tid 789663 wait 0 request_delay 0 running 5030 locked 0 flushing 0 logging 0 handle_count 18 blocks 5 blocks_logged 6

The "wait" field indicates the maximum time that a handle associated
with a transaction had to wait before it could get started.  Typically
this is because we couldn't start a new transactions because the old
transaction was still in the process of being committed.

The "request_delay" field indicates delay between when transaction
commit was requested (typically due to an fsync), and when the jbd2
kernel thread actually started working on the commit.  (Again, all time
fields are in units of jiffies, so you can read the integer in
milliseconds --> 5030 jiffies === 5030 milliseconds == 5.03 seconds.

The "running" field indicates how long transaction was running; as
mentioned above, this will typically be around 5000 jiffies unless
your workload was doing fsync()'s, or if the transaction was big
enough that it was forced to close because journal space reasons.

The "locked" field indicates how long it takes for all of the pending
handles to be completed.  When the jbd2 layer starts working on the
commit, the first thing it does is lock down the transaction so no new
handles can be started.  It then waits for all curently started
handles can complete.  While we are waiting for handles to close, no
other handles can be started (see discussion of the maximum wait time
above).  We try to keep handle processing time as short as possible
(for example, by prefetching blocks before starting a handle), but
there are cases where this is not possible, or at least not easy.
Analysis of the jbd2_handle_stats trances can help identify
longer-running handles.  If these long-running handles happen right
before journal commit (and for a workload that is overly fsync-happy,
that may be most handles), they can prolong how long a journal commit
stays in the "locked" phase.

The "flushing" field indicates how long the transaction commit spent
writing out data blocks.  We only need to write out data blocks in
data=ordered mode associated with inodes which have new blocks
allocated during the last transaction.

The "logging" field indicates how long it took to write out the blocks
that are actually written to the journal.  The number of blocks
actually written are reflected in the "blocks" and "blocks_logged"
fields, where the "blocks_logged" fields includes the overhead blocks
(jbd2 descriptor blocks, jbd2 revoke blocks, etc.)

The "handle_count" field indicates how many handles are associated
with a particular transaction.  In general each simple system call
(chmod, creat, chown, symlink, read, etc.) that modifies the file
system requires at most a single handle.  The big exception to this
rule of thumb is the truncate(2) and unlink(2) system calls, which may
require multiple handles in order to complete.  The truncate() and
unlink system calls are also the ones which tend to hold a handle open
for longer-than-average amounts of time.

By the way, this is why calling truncate() and doing block allocations
and doing fsync() a lot is *not* free.  If you are purchasing $$$ NVME
storage, in general it's a wise idea for your application to simply do
non-allocating writes (i.e., overwriting previous blocks in a
preallocated file), and tracking which blocks are used for what in
your application, instead of trying to ask general purpose file system
code to do this for you.  If you can used a specialized data structure
(say, a circular buffer) and manage the head and tail pointers
yourself, it will almost always require much less overhead than what
the file system will have to do since the file system doesn't have the
advantage of specialized knowledge of exactly what your requirements
might be for your workload.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 22:34       ` Theodore Ts'o
@ 2015-08-12 23:34         ` Roy Yang
  2015-08-13 13:09           ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Roy Yang @ 2015-08-12 23:34 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org

Hi Ted, 

   Appreciate your time, and will get the log from normal work load.

   BTW, the previous JBD2 log is from the node which is in bad state. 
  Normal node finish within 3 seconds; while on this node, it takes 10 seconds.
 
   Thank you very much!

   Rpu

----------------------------------------
> Date: Wed, 12 Aug 2015 18:34:56 -0400
> From: tytso@mit.edu
> To: bsdnet@outlook.com
> CC: linux-ext4@vger.kernel.org
> Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
>
> On Wed, Aug 12, 2015 at 02:43:22PM -0700, Roy Yang wrote:
>> Hi Ted,
>>
>> I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so. Compared to the normal system, this test only takes 2 seconds.
>> We saw the spike of running, but checking with kernel source. I don't get what it means. Any insights?
>
> You need to get the trace on your normal system --- a system when
> under load, you are seeing the long times for a journal commit to
> complete. The jbd2_run_stats gives us critical insight into what
> might be going wrong --- but just as it doesn't help much to use a
> diagnostic tool on a pet dog when you are trying to determine why the
> child is running a fever, it doesn't help much to take traces on a
> test program/system when we're trying to actually debug why your
> production server is having trouble....
>
>
>> <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
>
> There will be one of these trace points emitted for each jd2
> transaction commit. Most of the fields, with the exception of the
> last three (handle_count, blocks, and blocks logged) are in units of
> 1000 HZ "jiffies. So for example, if you don't have any fsync()'s,
> transactions by default will run for 5 seconds before being committed,
> so the "running" time will be a bit larger than 5000 jiffies (the
> timer goes off at 5 seconds, but there may be some delay before the
> jbd2 kernel thread gets scheduled):
>
> jbd2/sda3-8-275 [003] ...2 70449.637938: jbd2_run_stats: dev 8,3 tid 789663 wait 0 request_delay 0 running 5030 locked 0 flushing 0 logging 0 handle_count 18 blocks 5 blocks_logged 6
>
> The "wait" field indicates the maximum time that a handle associated
> with a transaction had to wait before it could get started. Typically
> this is because we couldn't start a new transactions because the old
> transaction was still in the process of being committed.
>
> The "request_delay" field indicates delay between when transaction
> commit was requested (typically due to an fsync), and when the jbd2
> kernel thread actually started working on the commit. (Again, all time
> fields are in units of jiffies, so you can read the integer in
> milliseconds --> 5030 jiffies === 5030 milliseconds == 5.03 seconds.
>
> The "running" field indicates how long transaction was running; as
> mentioned above, this will typically be around 5000 jiffies unless
> your workload was doing fsync()'s, or if the transaction was big
> enough that it was forced to close because journal space reasons.
>
> The "locked" field indicates how long it takes for all of the pending
> handles to be completed. When the jbd2 layer starts working on the
> commit, the first thing it does is lock down the transaction so no new
> handles can be started. It then waits for all curently started
> handles can complete. While we are waiting for handles to close, no
> other handles can be started (see discussion of the maximum wait time
> above). We try to keep handle processing time as short as possible
> (for example, by prefetching blocks before starting a handle), but
> there are cases where this is not possible, or at least not easy.
> Analysis of the jbd2_handle_stats trances can help identify
> longer-running handles. If these long-running handles happen right
> before journal commit (and for a workload that is overly fsync-happy,
> that may be most handles), they can prolong how long a journal commit
> stays in the "locked" phase.
>
> The "flushing" field indicates how long the transaction commit spent
> writing out data blocks. We only need to write out data blocks in
> data=ordered mode associated with inodes which have new blocks
> allocated during the last transaction.
>
> The "logging" field indicates how long it took to write out the blocks
> that are actually written to the journal. The number of blocks
> actually written are reflected in the "blocks" and "blocks_logged"
> fields, where the "blocks_logged" fields includes the overhead blocks
> (jbd2 descriptor blocks, jbd2 revoke blocks, etc.)
>
> The "handle_count" field indicates how many handles are associated
> with a particular transaction. In general each simple system call
> (chmod, creat, chown, symlink, read, etc.) that modifies the file
> system requires at most a single handle. The big exception to this
> rule of thumb is the truncate(2) and unlink(2) system calls, which may
> require multiple handles in order to complete. The truncate() and
> unlink system calls are also the ones which tend to hold a handle open
> for longer-than-average amounts of time.
>
> By the way, this is why calling truncate() and doing block allocations
> and doing fsync() a lot is *not* free. If you are purchasing $$$ NVME
> storage, in general it's a wise idea for your application to simply do
> non-allocating writes (i.e., overwriting previous blocks in a
> preallocated file), and tracking which blocks are used for what in
> your application, instead of trying to ask general purpose file system
> code to do this for you. If you can used a specialized data structure
> (say, a circular buffer) and manage the head and tail pointers
> yourself, it will almost always require much less overhead than what
> the file system will have to do since the file system doesn't have the
> advantage of specialized knowledge of exactly what your requirements
> might be for your workload.
>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
  2015-08-12 23:34         ` Roy Yang
@ 2015-08-13 13:09           ` Theodore Ts'o
  0 siblings, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2015-08-13 13:09 UTC (permalink / raw)
  To: Roy Yang; +Cc: linux-ext4@vger.kernel.org

On Wed, Aug 12, 2015 at 04:34:17PM -0700, Roy Yang wrote:
> Hi Ted, 
> 
>    Appreciate your time, and will get the log from normal work load.
> 
>    BTW, the previous JBD2 log is from the node which is in bad state. 
>   Normal node finish within 3 seconds; while on this node, it takes 10 seconds.

Ah, you didn't say that.

If this is a node in a bad state, and we're not seeing any real delays
from the jbd2 logs, then it sounds very much like the problem is a
hardware issue (or maybe an NVMe device driver issue).  You might want
to use blktrace to see how long the underlying I/O operations are
taking to complete.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-08-13 13:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-12 16:57 fsync stuck at jbd2_log_wait_commit on NVMe devices Roy Yang
2015-08-12 18:37 ` Theodore Ts'o
2015-08-12 19:28   ` Roy Yang
2015-08-12 21:43     ` Roy Yang
2015-08-12 22:34       ` Theodore Ts'o
2015-08-12 23:34         ` Roy Yang
2015-08-13 13:09           ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).