public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Trying to measure performance with splice/vmsplice ....
@ 2010-04-23 16:07 Rick Sherm
  2010-04-23 16:54 ` Steven J. Magnani
  0 siblings, 1 reply; 6+ messages in thread
From: Rick Sherm @ 2010-04-23 16:07 UTC (permalink / raw)
  To: steve, axboe; +Cc: linux-kernel

Hello Jens - any assistance/pointers on 1) and 2) below 
will be great.I'm willing to test out any sample patch.

Steve,

--- On Wed, 4/21/10, Steven J. Magnani <steve@digidescorp.com> wrote:
> Hi Rick,
> 
> On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > Q3) When using splice, even though the destination
> file is opened in O_DIRECT mode, the data gets cached. I
> verified it using vmstat.
> > 
> > r b   swpd   free   buff cache   
> > 1 0     0 9358820 116576 2100904
> > 
> > ./splice_to_splice
> > 
> > r b swpd   free   buff cache
> > 2 0  0 7228908 116576  4198164
> > 
> > I see the same caching issue even if I vmsplice
> buffers(simple malloc'd iov) to a pipe and then splice the
> pipe to a file. The speed is still an issue with vmsplice
> too.
> > 
> 
> One thing is that O_DIRECT is a hint; not all filesystems
> bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. 
> 
> Another variable is whether (and how) your filesystem
> implements the splice_write file operation. The generic one (pipe_to_file)
> in fs/splice.c copies data to pagecache. The default one goes
> out to vfs_write() and might stand more of a chance of honoring
> O_DIRECT.
> 

True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.

> > Q4) Also, using splice, you can only transfer 64K
> worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> using stock read/write, I can go upto 1MB buffer. After that
> I don't see any gain. But still the reduction in system/cpu
> time is significant.
> 
> I'm not a splicing expert but I did spend some time
> recently trying to
> improve FTP reception by splicing from a TCP socket to a
> file. I found that while splicing avoids copying packets to userland,
> that gain is more than offset by a large increase in calls into the
> storage stack.It's especially bad with TCP sockets because a typical
> packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> pages at a time, and packet pages are only about 35% utilized, each
> cycle to userland I could only move 23 KiB of data at most. Some
> similar effect may be in play in your case.
> 

Agreed,increasing number of calls will offset the benefit.
But what if:
1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
What are the implications in the other parts of the kernel?
2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).

> Regards,
>  Steven J. Magnani           

regards
++Rick



      


^ permalink raw reply	[flat|nested] 6+ messages in thread
* Trying to measure performance with splice/vmsplice ....
@ 2010-04-16 17:02 Rick Sherm
  2010-04-21 18:17 ` Steven J. Magnani
  0 siblings, 1 reply; 6+ messages in thread
From: Rick Sherm @ 2010-04-16 17:02 UTC (permalink / raw)
  To: linux-kernel, axboe

Hello,

I'm trying to measure the perf gain by using splice.For now I'm trying to copy a 1G file using splice.(In real scenario, the driver will DMA the data to some buffer(which is mmap'd).The app will then write the newly-DMA'd data to the disk while some other thread is crunching the same buffer.The buffer is guaranteed to not be modified.To avoid copying I was thinking of : splice-IN-mmap'd-buffer->pipe and splice-OUT-pipe->file.)

PS - I've inlined some sloppy code that I cooked up.

Case1) read from input_file and write(O_DIRECT so no buff-cache is involved but it doesn't work) to dest_file.We can talk about the buff-cache later.

(csh#)time ./splice_to_splice

0.004u 1.451s 0:02.16 67.1%     0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE    (1024)
#define PIPE_SIZE    (64 * KILO_BYTE)
int filedes [2];


pipe (filedes);

fd_from = open(filename_from,O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to   = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);

to_write = 2048 * 512 * KILO_BYTE;

while (to_write) {
    ret = splice (fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
                     SPLICE_F_MORE | SPLICE_F_MOVE);
    if (ret < 0) {
        printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
            goto error;
    } else {
        ret = splice (filedes [0], NULL, fd_to,
                         &to_offset, PIPE_SIZE/*should be ret,but ...*/,
                         SPLICE_F_MORE | SPLICE_F_MOVE);
        if (ret < 0) {
            printf("Error: LINE:%d ret:%d\n",__LINE__);
                goto error;
        }
            to_write -= ret;
    }
}

Case 2) directly reading and writing:

Case2.1) copy 64K blocks

(csh#)time ./file_to_file 64
0.015u 1.066s 0:04.04 26.4%     0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE    (1024)
#define MEGA_BYTE    (1024 * (KILO_BYTE))
#define BUFF_SIZE    (64 * MEGA_BYTE)

posix_memalign((void**)&buff,4096,BUFF_SIZE);


fd_from = open(filename_from,(O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to   = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);
   

/* 1G file == 2048 * 512K blocks */
to_write = 2048 * 512 * KILO_BYTE;
copy_size = cmd_line_input * KILO_BYTE; /* control from cmd_line */
while (to_write) {
    ret = read(fd_from, buff,copy_size);
    if (ret != copy_size) {
        printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
                goto error;
    } else {
        ret = write (fd_to,buff,copy_size);
        if (ret != copy_size) {
        printf("Error: LINE:%d ret:%d\n",__LINE__);
        goto error;
        }
            to_write -= ret;
    }
}

Case2.2) copy 512K blocks

(csh#)time ./file_to_file 512
0.004u 0.306s 0:01.86 16.1%     0+0k 2097152+2097152io 0pf+0w


Case 2.3) copy 1M blocks
time ./file_to_file 1024
0.000u 0.240s 0:01.88 12.7%     0+0k 2097152+2097152io 0pf+0w


Questions:
Q1) When using splice,why is the CPU consumption greater than read/write(case 2.1)?What does this mean?

Q2) How do I confirm that the memory bandwidth consumption does not spike up when using splice in this case? By this I mean, (node)cpu<->mem. The DMA-in/DMA-out will happen.You can't escape from that but the IOH-bus will be utilized. I want to keep the cpu(node)-mem path free(well, minimize unnecessary copies).

Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.

r  b   swpd   free   buff    cache   
1  0      0 9358820 116576 2100904

./splice_to_splice

r  b   swpd   free   buff  cache
2  0      0 7228908 116576 4198164

I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.

Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.

I would appreciate any pointers.


thanks
Rick



      


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-23 19:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-23 16:07 Trying to measure performance with splice/vmsplice Rick Sherm
2010-04-23 16:54 ` Steven J. Magnani
2010-04-23 17:05   ` Jens Axboe
2010-04-23 19:52     ` Rick Sherm
  -- strict thread matches above, loose matches on Subject: below --
2010-04-16 17:02 Rick Sherm
2010-04-21 18:17 ` Steven J. Magnani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox