Re: Trying to measure performance with splice/vmsplice ....

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Trying to measure performance with splice/vmsplice ....
@ 2010-04-23 16:07 Rick Sherm
  2010-04-23 16:54 ` Steven J. Magnani
  0 siblings, 1 reply; 6+ messages in thread
From: Rick Sherm @ 2010-04-23 16:07 UTC (permalink / raw)
  To: steve, axboe; +Cc: linux-kernel

Hello Jens - any assistance/pointers on 1) and 2) below 
will be great.I'm willing to test out any sample patch.

Steve,

--- On Wed, 4/21/10, Steven J. Magnani <steve@digidescorp.com> wrote:
> Hi Rick,
> 
> On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > Q3) When using splice, even though the destination
> file is opened in O_DIRECT mode, the data gets cached. I
> verified it using vmstat.
> > 
> > r b   swpd   free   buff cache   
> > 1 0     0 9358820 116576 2100904
> > 
> > ./splice_to_splice
> > 
> > r b swpd   free   buff cache
> > 2 0  0 7228908 116576  4198164
> > 
> > I see the same caching issue even if I vmsplice
> buffers(simple malloc'd iov) to a pipe and then splice the
> pipe to a file. The speed is still an issue with vmsplice
> too.
> > 
> 
> One thing is that O_DIRECT is a hint; not all filesystems
> bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. 
> 
> Another variable is whether (and how) your filesystem
> implements the splice_write file operation. The generic one (pipe_to_file)
> in fs/splice.c copies data to pagecache. The default one goes
> out to vfs_write() and might stand more of a chance of honoring
> O_DIRECT.
> 

True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.

> > Q4) Also, using splice, you can only transfer 64K
> worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> using stock read/write, I can go upto 1MB buffer. After that
> I don't see any gain. But still the reduction in system/cpu
> time is significant.
> 
> I'm not a splicing expert but I did spend some time
> recently trying to
> improve FTP reception by splicing from a TCP socket to a
> file. I found that while splicing avoids copying packets to userland,
> that gain is more than offset by a large increase in calls into the
> storage stack.It's especially bad with TCP sockets because a typical
> packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> pages at a time, and packet pages are only about 35% utilized, each
> cycle to userland I could only move 23 KiB of data at most. Some
> similar effect may be in play in your case.
> 

Agreed,increasing number of calls will offset the benefit.
But what if:
1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
What are the implications in the other parts of the kernel?
2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).

> Regards,
>  Steven J. Magnani           

regards
++Rick



      


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Trying to measure performance with splice/vmsplice ....
  2010-04-23 16:07 Trying to measure performance with splice/vmsplice Rick Sherm
@ 2010-04-23 16:54 ` Steven J. Magnani
  2010-04-23 17:05   ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Steven J. Magnani @ 2010-04-23 16:54 UTC (permalink / raw)
  To: Rick Sherm; +Cc: Jens Axboe, linux-kernel

On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> Hello Jens - any assistance/pointers on 1) and 2) below 
> will be great.I'm willing to test out any sample patch.

Recent mail from him has come from jens.axboe@oracle.com, I cc'd it.

> 
> Steve,
> 
> --- On Wed, 4/21/10, Steven J. Magnani <steve@digidescorp.com> wrote:
> > Hi Rick,
> > 
> > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > Q3) When using splice, even though the destination
> > file is opened in O_DIRECT mode, the data gets cached. I
> > verified it using vmstat.
> > > 
> > > r b   swpd   free   buff cache   
> > > 1 0     0 9358820 116576 2100904
> > > 
> > > ./splice_to_splice
> > > 
> > > r b swpd   free   buff cache
> > > 2 0  0 7228908 116576  4198164
> > > 
> > > I see the same caching issue even if I vmsplice
> > buffers(simple malloc'd iov) to a pipe and then splice the
> > pipe to a file. The speed is still an issue with vmsplice
> > too.
> > > 
> > 
> > One thing is that O_DIRECT is a hint; not all filesystems
> > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. 
> > 
> > Another variable is whether (and how) your filesystem
> > implements the splice_write file operation. The generic one (pipe_to_file)
> > in fs/splice.c copies data to pagecache. The default one goes
> > out to vfs_write() and might stand more of a chance of honoring
> > O_DIRECT.
> > 
> 
> True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
> 
> > > Q4) Also, using splice, you can only transfer 64K
> > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > using stock read/write, I can go upto 1MB buffer. After that
> > I don't see any gain. But still the reduction in system/cpu
> > time is significant.
> > 
> > I'm not a splicing expert but I did spend some time
> > recently trying to
> > improve FTP reception by splicing from a TCP socket to a
> > file. I found that while splicing avoids copying packets to userland,
> > that gain is more than offset by a large increase in calls into the
> > storage stack.It's especially bad with TCP sockets because a typical
> > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > pages at a time, and packet pages are only about 35% utilized, each
> > cycle to userland I could only move 23 KiB of data at most. Some
> > similar effect may be in play in your case.
> > 
> 
> Agreed,increasing number of calls will offset the benefit.
> But what if:
> 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> What are the implications in the other parts of the kernel?

This came up recently, one problem is that there a couple of kernel
functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
the stack cost of increasing PIPE_BUFFERS can be quite high. I've
thought it might be nice if there was some mechanism for userland apps
to be able to request larger PIPE_BUFFERS values, but I haven't pursued
this line of thought to see if it's practical.

> 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).

It's a neat idea, but it would probably be much easier (and less
invasive) to try this sort of pipelining in userland using a ring buffer
or ping-pong approach. I'm actually in the middle of something like this
with FTP, where I will have a reader thread that puts data from the
network into a ring buffer, from which a writer thread moves it to a
file.

------------------------------------------------------------------------
 Steven J. Magnani               "I claim this network for MARS!
 www.digidescorp.com              Earthling, return my space modulator!"

 #include <standard.disclaimer>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Trying to measure performance with splice/vmsplice ....
  2010-04-23 16:54 ` Steven J. Magnani
@ 2010-04-23 17:05   ` Jens Axboe
  2010-04-23 19:52     ` Rick Sherm
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2010-04-23 17:05 UTC (permalink / raw)
  To: Steven J. Magnani; +Cc: Rick Sherm, linux-kernel

On Fri, Apr 23 2010, Steven J. Magnani wrote:
> On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote:
> > Hello Jens - any assistance/pointers on 1) and 2) below 
> > will be great.I'm willing to test out any sample patch.
> 
> Recent mail from him has come from jens.axboe@oracle.com, I cc'd it.

Goes to the same inbox in the end, so no difference :-)

> > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> > > > Q3) When using splice, even though the destination
> > > file is opened in O_DIRECT mode, the data gets cached. I
> > > verified it using vmstat.
> > > > 
> > > > r b   swpd   free   buff cache   
> > > > 1 0     0 9358820 116576 2100904
> > > > 
> > > > ./splice_to_splice
> > > > 
> > > > r b swpd   free   buff cache
> > > > 2 0  0 7228908 116576  4198164
> > > > 
> > > > I see the same caching issue even if I vmsplice
> > > buffers(simple malloc'd iov) to a pipe and then splice the
> > > pipe to a file. The speed is still an issue with vmsplice
> > > too.
> > > > 
> > > 
> > > One thing is that O_DIRECT is a hint; not all filesystems
> > > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. 
> > > 
> > > Another variable is whether (and how) your filesystem
> > > implements the splice_write file operation. The generic one (pipe_to_file)
> > > in fs/splice.c copies data to pagecache. The default one goes
> > > out to vfs_write() and might stand more of a chance of honoring
> > > O_DIRECT.
> > > 
> > 
> > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'.
> > 
> > > > Q4) Also, using splice, you can only transfer 64K
> > > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But
> > > using stock read/write, I can go upto 1MB buffer. After that
> > > I don't see any gain. But still the reduction in system/cpu
> > > time is significant.
> > > 
> > > I'm not a splicing expert but I did spend some time
> > > recently trying to
> > > improve FTP reception by splicing from a TCP socket to a
> > > file. I found that while splicing avoids copying packets to userland,
> > > that gain is more than offset by a large increase in calls into the
> > > storage stack.It's especially bad with TCP sockets because a typical
> > > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS
> > > pages at a time, and packet pages are only about 35% utilized, each
> > > cycle to userland I could only move 23 KiB of data at most. Some
> > > similar effect may be in play in your case.
> > > 
> > 
> > Agreed,increasing number of calls will offset the benefit.
> > But what if:
> > 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'?
> > What are the implications in the other parts of the kernel?
> 
> This came up recently, one problem is that there a couple of kernel
> functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So
> the stack cost of increasing PIPE_BUFFERS can be quite high. I've
> thought it might be nice if there was some mechanism for userland apps
> to be able to request larger PIPE_BUFFERS values, but I haven't pursued
> this line of thought to see if it's practical.

I still have patches pending for this, making the pipe buffer count
settable form user space:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f

Let me know if you want to give it a spin on a recent kernel, and I'll
update it.

> > 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd).
> 
> It's a neat idea, but it would probably be much easier (and less
> invasive) to try this sort of pipelining in userland using a ring buffer
> or ping-pong approach. I'm actually in the middle of something like this
> with FTP, where I will have a reader thread that puts data from the
> network into a ring buffer, from which a writer thread moves it to a
> file.

See vmsplice.c from the splice test tools:

http://brick.kernel.dk/snaps/splice-git-latest.tar.gz

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Trying to measure performance with splice/vmsplice ....
  2010-04-23 17:05   ` Jens Axboe
@ 2010-04-23 19:52     ` Rick Sherm
  0 siblings, 0 replies; 6+ messages in thread
From: Rick Sherm @ 2010-04-23 19:52 UTC (permalink / raw)
  To: Steven J. Magnani, Jens Axboe; +Cc: linux-kernel

Hello Jens,

--- On Fri, 4/23/10, Jens Axboe <jens.axboe@oracle.com> wrote:
> I still have patches pending for this, making the pipe
> buffer count
> settable form user space:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f
> 
> Let me know if you want to give it a spin on a recent
> kernel, and I'll
> update it.
> 

I think we need to adjust 'PIPE_BUFFERS' in default_file_splice_read() also,correct?


> Jens Axboe

Thanks


      


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Trying to measure performance with splice/vmsplice ....
@ 2010-04-16 17:02 Rick Sherm
  2010-04-21 18:17 ` Steven J. Magnani
  0 siblings, 1 reply; 6+ messages in thread
From: Rick Sherm @ 2010-04-16 17:02 UTC (permalink / raw)
  To: linux-kernel, axboe

Hello,

I'm trying to measure the perf gain by using splice.For now I'm trying to copy a 1G file using splice.(In real scenario, the driver will DMA the data to some buffer(which is mmap'd).The app will then write the newly-DMA'd data to the disk while some other thread is crunching the same buffer.The buffer is guaranteed to not be modified.To avoid copying I was thinking of : splice-IN-mmap'd-buffer->pipe and splice-OUT-pipe->file.)

PS - I've inlined some sloppy code that I cooked up.

Case1) read from input_file and write(O_DIRECT so no buff-cache is involved but it doesn't work) to dest_file.We can talk about the buff-cache later.

(csh#)time ./splice_to_splice

0.004u 1.451s 0:02.16 67.1%     0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE    (1024)
#define PIPE_SIZE    (64 * KILO_BYTE)
int filedes [2];


pipe (filedes);

fd_from = open(filename_from,O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to   = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);

to_write = 2048 * 512 * KILO_BYTE;

while (to_write) {
    ret = splice (fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
                     SPLICE_F_MORE | SPLICE_F_MOVE);
    if (ret < 0) {
        printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
            goto error;
    } else {
        ret = splice (filedes [0], NULL, fd_to,
                         &to_offset, PIPE_SIZE/*should be ret,but ...*/,
                         SPLICE_F_MORE | SPLICE_F_MOVE);
        if (ret < 0) {
            printf("Error: LINE:%d ret:%d\n",__LINE__);
                goto error;
        }
            to_write -= ret;
    }
}

Case 2) directly reading and writing:

Case2.1) copy 64K blocks

(csh#)time ./file_to_file 64
0.015u 1.066s 0:04.04 26.4%     0+0k 2097152+2097152io 0pf+0w

#define KILO_BYTE    (1024)
#define MEGA_BYTE    (1024 * (KILO_BYTE))
#define BUFF_SIZE    (64 * MEGA_BYTE)

posix_memalign((void**)&buff,4096,BUFF_SIZE);


fd_from = open(filename_from,(O_RDWR|O_LARGEFILE|O_DIRECT),0777);
fd_to   = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777);
   

/* 1G file == 2048 * 512K blocks */
to_write = 2048 * 512 * KILO_BYTE;
copy_size = cmd_line_input * KILO_BYTE; /* control from cmd_line */
while (to_write) {
    ret = read(fd_from, buff,copy_size);
    if (ret != copy_size) {
        printf("Error: LINE:%d ret:%d\n",__LINE__,ret);
                goto error;
    } else {
        ret = write (fd_to,buff,copy_size);
        if (ret != copy_size) {
        printf("Error: LINE:%d ret:%d\n",__LINE__);
        goto error;
        }
            to_write -= ret;
    }
}

Case2.2) copy 512K blocks

(csh#)time ./file_to_file 512
0.004u 0.306s 0:01.86 16.1%     0+0k 2097152+2097152io 0pf+0w


Case 2.3) copy 1M blocks
time ./file_to_file 1024
0.000u 0.240s 0:01.88 12.7%     0+0k 2097152+2097152io 0pf+0w


Questions:
Q1) When using splice,why is the CPU consumption greater than read/write(case 2.1)?What does this mean?

Q2) How do I confirm that the memory bandwidth consumption does not spike up when using splice in this case? By this I mean, (node)cpu<->mem. The DMA-in/DMA-out will happen.You can't escape from that but the IOH-bus will be utilized. I want to keep the cpu(node)-mem path free(well, minimize unnecessary copies).

Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.

r  b   swpd   free   buff    cache   
1  0      0 9358820 116576 2100904

./splice_to_splice

r  b   swpd   free   buff  cache
2  0      0 7228908 116576 4198164

I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.

Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.

I would appreciate any pointers.


thanks
Rick



      


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Trying to measure performance with splice/vmsplice ....
  2010-04-16 17:02 Rick Sherm
@ 2010-04-21 18:17 ` Steven J. Magnani
  0 siblings, 0 replies; 6+ messages in thread
From: Steven J. Magnani @ 2010-04-21 18:17 UTC (permalink / raw)
  To: Rick Sherm; +Cc: linux-kernel, axboe

Hi Rick,

On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote:
> Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat.
> 
> r  b   swpd   free   buff    cache   
> 1  0      0 9358820 116576 2100904
> 
> ./splice_to_splice
> 
> r  b   swpd   free   buff  cache
> 2  0      0 7228908 116576 4198164
> 
> I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too.
> 

One thing is that O_DIRECT is a hint; not all filesystems bypass the
cache. I'm pretty sure ext2 does, and I know fat doesn't. 

Another variable is whether (and how) your filesystem implements the
splice_write file operation. The generic one (pipe_to_file) in
fs/splice.c copies data to pagecache. The default one goes out to
vfs_write() and might stand more of a chance of honoring O_DIRECT.

> Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant.

I'm not a splicing expert but I did spend some time recently trying to
improve FTP reception by splicing from a TCP socket to a file. I found
that while splicing avoids copying packets to userland, that gain is
more than offset by a large increase in calls into the storage stack.
It's especially bad with TCP sockets because a typical packet has, say,
1460 bytes of data. Since splicing works on PIPE_BUFFERS pages at a
time, and packet pages are only about 35% utilized, each cycle to
userland I could only move 23 KiB of data at most. Some similar effect
may be in play in your case.

ftrace may be of some help in finding the bottleneck...

Regards,
------------------------------------------------------------------------
 Steven J. Magnani               "I claim this network for MARS!
 www.digidescorp.com              Earthling, return my space modulator!"

 #include <standard.disclaimer>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-23 19:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-23 16:07 Trying to measure performance with splice/vmsplice Rick Sherm
2010-04-23 16:54 ` Steven J. Magnani
2010-04-23 17:05   ` Jens Axboe
2010-04-23 19:52     ` Rick Sherm
  -- strict thread matches above, loose matches on Subject: below --
2010-04-16 17:02 Rick Sherm
2010-04-21 18:17 ` Steven J. Magnani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox