linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fsync Performance
@ 2015-08-04 19:06 David Muchene
  2015-08-05  1:59 ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: David Muchene @ 2015-08-04 19:06 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org


Hi,
I'm not sure if this is the place to ask this, if it isn't I apologize. We are occasionally seeing fsync take a very long time (sometimes upwards of 3s). We decided to run some fio tests and use systemtap to determine if the disks were the cause of the problem. One of the results from the tests is that there occasionally there is a significant difference between time spent doing io, and the total time to complete the fsync. Is there an explanation to this difference, or is the systemtap script bogus? If it is in fact the driver/disks that is taking a long time, does anyone have any suggestions as to how I'd debug that? I appreciate any help you can provide (even if it's pointing me to the relevant documents). 

System Config:
Debian Wheezy,
3.2.6 Kernel, 
Seagate SSDs  ST200FM0073 
Mount options   (rw,relatime,user_xattr,barrier=1,data=ordered)


fio  --time_based --thinktime=125000 --rw=randwrite --write_lat_log=./throwaway/sm_sd --name=sm_sd --fsync=20 --buffered=1 --thinktime_blocks=20 --rate_iops=20 --direct=0 --filename=/avid/mnt/storagemanager0/sm_sd_test --per_job_logs=1 --bs=4k --output=./throwaway/sm_sd --ioengine=libaio --runtime=36000 --iodepth=20 --log_avg_msec=1000 --size=1024m

fio  --time_based --thinktime=16000000 --rw=randwrite --write_lat_log=./throwaway/sd_sm --name=sd_sm --fsync=25 --buffered=1 --thinktime_blocks=25 --rate_iops=25 --direct=0 --filename=/avid/mnt/systemdirector/sm_sd_test --per_job_logs=1 --bs=128k --output=./throwaway/sd_sm --ioengine=libaio --runtime=36000 --iodepth=20 --log_avg_msec=1000 --size=1024m


Results
---------------------------------------------------------------------------
Sync Time(uS),   Max IO Time (uS),    Total IO time (us)
            13801,                    4402,             4522
            13923,                          0,                    0
            348697,                  88261,           88375
            254846,                  170582,       170723
            318862,                 317069,        317593

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////test_sync.stp////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
global sync_start = 0 
//submission, io-starting, and io completion...
global sync_blocks_start 
global sync_blocks_end
global PROCNAME = "fio"
global sync_beg = 0
global sync_end = 0
global wrt_beg  = 0
global wrt_end  = 0
global fm_beg   = 0
global fm_end   = 0
global thresh = 10000
global max_iotime = 0
global first_io = 0
global last_io = 0


function get_next:long(bio:long)%{
    THIS->__retvalue = (long)((struct bio*)((long)THIS->bio))->bi_next;
    
%}

function iter_bio(bio){
    bio_itr = bio;
    printf("bio list: [");
    while(bio_itr != NULL){
        printf("%p, ",bio_itr);
        bio_itr = get_next(bio_itr);
    }
    printf("]\n");
}
    

probe kernel.function("blk_dequeue_request"){
    if(sync_start == 1 && execname() == PROCNAME){    
        sync_blocks_start[$rq->bio] = gettimeofday_us();
        if(first_io == 0){ 
            first_io = sync_blocks_start[$rq->bio];
        }
    }
}


probe kernel.function("blk_account_io_completion"){
    if(sync_start == 1 && ($req->bio in sync_blocks_start)){
        sync_blocks_end[$req->bio] = gettimeofday_us();    
        iotime = sync_blocks_end[$req->bio] - sync_blocks_start[$req->bio];
        if(iotime > max_iotime){
            max_iotime = iotime;
        }
        last_io = sync_blocks_end[$req->bio];
    }
}

probe syscall.fsync{
    if(execname() == PROCNAME){
        sync_start = 1;
        sync_beg = gettimeofday_us();
        //printf("sync start: %d\n",gettimeofday_us());
    }
}
 
probe syscall.fsync.return{
    if(execname() == PROCNAME){
        sync_start = 0;
        sync_end = gettimeofday_us();
        total_iotime = last_io - first_io;
            
        delete sync_blocks_start;
        delete sync_blocks_end;
        if((sync_end-sync_beg) > thresh){
            printf("%d,%d,%d %d,%d, %d, %d\n", sync_beg,sync_end,first_io, last_io, sync_end-sync_beg, max_iotime, total_iotime);
        }
        last_io = first_io = 0;
        sync_end = 0;
        max_iotime = 0;
    }
}

probe kernel.function("do_writepages") {
    if(execname() == PROCNAME){
        wrt_beg = gettimeofday_us();
    }
}

probe kernel.function("filemap_write_and_wait_range"){
    if(execname() == PROCNAME){
        fm_beg = gettimeofday_us();
    }
}

probe kernel.function("filemap_write_and_wait_range").return{
    if(execname() == PROCNAME){
        fm_end = gettimeofday_us();
    }
}

probe kernel.function("do_writepages").return{
    if(execname() == PROCNAME){
        wrt_end = gettimeofday_us();
    }
}

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fsync Performance
  2015-08-04 19:06 Fsync Performance David Muchene
@ 2015-08-05  1:59 ` Theodore Ts'o
  0 siblings, 0 replies; 4+ messages in thread
From: Theodore Ts'o @ 2015-08-05  1:59 UTC (permalink / raw)
  To: David Muchene; +Cc: linux-ext4@vger.kernel.org

On Tue, Aug 04, 2015 at 07:06:13PM +0000, David Muchene wrote:
> 
> I'm not sure if this is the place to ask this, if it isn't I
> apologize. We are occasionally seeing fsync take a very long time
> (sometimes upwards of 3s). We decided to run some fio tests and use
> systemtap to determine if the disks were the cause of the
> problem. One of the results from the tests is that there
> occasionally there is a significant difference between time spent
> doing io, and the total time to complete the fsync. Is there an
> explanation to this difference, or is the systemtap script bogus? If
> it is in fact the driver/disks that is taking a long time, does
> anyone have any suggestions as to how I'd debug that? I appreciate
> any help you can provide (even if it's pointing me to the relevant
> documents).

You haven't specified which functions you are including as meaning
"time spent doing I/O", but I suspect what you are seeing is the
difference between the time to send the data blocks to the disk, and
(a) time to complete the journal commit and (b) the time for the SSD
to confirm that the data and metadata blocks sent to the device have
been written to stable store (so they will survive a power failure)[1].

[1] Note that not all SSD's, especially if they are non-enterprise
SSD's, are rated to be safe against power failures.

You may be able to avoid the need to complete the journal commit if
all of the writes to the file are non-allocating writes (i.e., the
blocks were allocated and initialized by prewriting the blocks if the
blocks were allocated using fallocate), and you use fdatasync(2)
instead of fsync(2).  (If there is no need to update the file system
metadata blocks in order to guarantee that the blocks can be read
after a power failure, fdatasync will omit updating the inode
mtime/ctime fields to the device.)

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Fsync Performance
@ 2015-08-05 14:37 David Muchene
  2015-08-05 15:14 ` Albino B Neto
  0 siblings, 1 reply; 4+ messages in thread
From: David Muchene @ 2015-08-05 14:37 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org


Hi,
Thanks for responding. The time I'm calling "time spend doing I/O" is the difference between when  blk_dequeue_request is called and when blk_account_io_completion for the same bio is called. I thought this time accounted "the time for the SSD to confirm that the data and metadata blocks sent to the device have been written to stable store". We've tried using fdatasync and it didn't seem to make a difference. Perhaps there are mount options we don't understand?

Thanks,
David Muchene


-----Original Message-----
From: Theodore Ts'o [mailto:tytso@mit.edu]
Sent: Tuesday, August 04, 2015 10:00 PM
To: David Muchene
Cc: linux-ext4@vger.kernel.org
Subject: Re: Fsync Performance

On Tue, Aug 04, 2015 at 07:06:13PM +0000, David Muchene wrote:
> 
> I'm not sure if this is the place to ask this, if it isn't I 
> apologize. We are occasionally seeing fsync take a very long time 
> (sometimes upwards of 3s). We decided to run some fio tests and use 
> systemtap to determine if the disks were the cause of the problem. One 
> of the results from the tests is that there occasionally there is a 
> significant difference between time spent doing io, and the total time 
> to complete the fsync. Is there an explanation to this difference, or 
> is the systemtap script bogus? If it is in fact the driver/disks that 
> is taking a long time, does anyone have any suggestions as to how I'd 
> debug that? I appreciate any help you can provide (even if it's 
> pointing me to the relevant documents).

You haven't specified which functions you are including as meaning "time spent doing I/O", but I suspect what you are seeing is the difference between the time to send the data blocks to the disk, and
(a) time to complete the journal commit and (b) the time for the SSD to confirm that the data and metadata blocks sent to the device have been written to stable store (so they will survive a power failure)[1].

[1] Note that not all SSD's, especially if they are non-enterprise SSD's, are rated to be safe against power failures.

You may be able to avoid the need to complete the journal commit if all of the writes to the file are non-allocating writes (i.e., the blocks were allocated and initialized by prewriting the blocks if the blocks were allocated using fallocate), and you use fdatasync(2) instead of fsync(2).  (If there is no need to update the file system metadata blocks in order to guarantee that the blocks can be read after a power failure, fdatasync will omit updating the inode mtime/ctime fields to the device.)

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fsync Performance
  2015-08-05 14:37 David Muchene
@ 2015-08-05 15:14 ` Albino B Neto
  0 siblings, 0 replies; 4+ messages in thread
From: Albino B Neto @ 2015-08-05 15:14 UTC (permalink / raw)
  To: David Muchene; +Cc: linux-ext4@vger.kernel.org

2015-08-05 11:37 GMT-03:00 David Muchene <david.muchene@avid.com>:
> Hi,
> Thanks for responding.

You broken the thread [0]

0 - http://www.spinics.net/lists/linux-ext4/msg48670.html

-- 
Albino B Neto
twitter.com/b1n0anb
"Debian. Freedom to code. Code to freedom!" faw

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-08-05 15:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-04 19:06 Fsync Performance David Muchene
2015-08-05  1:59 ` Theodore Ts'o
  -- strict thread matches above, loose matches on Subject: below --
2015-08-05 14:37 David Muchene
2015-08-05 15:14 ` Albino B Neto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).