linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC-V2] [PATCH 0/7] Zero Copy
@ 2011-02-07  7:21 Venkateswararao Jujjuri (JV)
  2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-07  7:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri

In this patch series I am trying to take another stab at zero copy. 
Please review and provide your feedback.

Goal:

9P Linux client makes an additional copy of read/write buffer into the kernel 
buffer.  There are some transports(especially in the virtualization 
environment) which can avoid this additional copy by directly sending user 
buffer to the server.

Design Goals.

- Have minimal changes to the net layer so that common code is not polluted by 
  the transport specifics.
- Create a common transport library which can be used by other transports.
- Avoid additional optimizations in the initial attempt (more details below) 
  and focus on achieving basic functionality. 

Design

Send the payload buffers separately to the transport layer if it asks for it.
Transport layer specifies the preference through newly introduced field in the 
transport module.  (clnt->trans_mod->pref)
This method has few advantages.
   - Keeps the net layer clean and lets the transport layer deal with specifics.
   - mapping user addr into kernel pages pins the memory. Lack of flow control 
     make the system vulnerable to denial-of-service attacks. This change gives 
     transport layer more control to implement effective flow control.
  - If a transport layer doesn't see the need to handle payload separately, 
    it can set the preference accordingly so that current code works with no 
    changes. This is very useful for transports which has no plans of 
    converting/pinning user pages. Especially things become more complex as 
    copy_to_user()  is not possible as reads(RREAD) are handled by the
    transport layer in the interrupt context.

TREAD/RERROR scenario.
This is a rather sticky issue to deal with for the !dotl protocol. This is not 
a problem in 9P2000.L as the error is a known size (errno) but in other 
protocols it is a string of size (ERRMAX).  To take care of TREAD/RERROR 
scenario in !dotl we make sure that the read buffer is big enough to 
accommodate  ERRMAX string. If the read size is small, don't send the payload 
buffer separately to the transport layer  even if it set its preferences other 
way (P9_TRANS_PREF_PAYLOAD_SEP).
  
For bigger reads, RERROR is handled by copying back user buffers into kernel 
buffer in the case of error. As this is done only in the error path it should 
not affect the regular performance.
 
Created trans_common.[ch] to house common functions so that other transport 
layers can take advantage of them.

msize: One of the major advantage of this patch series is to have bigger msize 
to pull off bigger read/writes from the server. Increasing the msize is not 
really a solution as majority of other transactions are extremely small which 
could result in waste of kernel heap.  To address this problem we need to have 
two sizes of PDUs. 
Given that this is an additional optimization/usecase of zero copy..and not a 
NEED to implement zerocopy itself, I am differing it to next round of changes.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread
* [net/9p] ZeroCopy patch series
@ 2011-02-14  2:21 Venkateswararao Jujjuri (JV)
  2011-02-14  2:21 ` [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case Venkateswararao Jujjuri (JV)
  0 siblings, 1 reply; 27+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2011-02-14  2:21 UTC (permalink / raw)
  To: v9fs-developer; +Cc: linux-fsdevel, Venkateswararao Jujjuri

This patch series introduces zero copy capability to the 9p transport layer.
9P Linux client makes an additional copy of read/write buffer into the kernel
before sending it down to the transport layer. There is no functional
need for this additional copy hence it is eliminated by sending the payload
buffer directly to the transport layer. While this is advantageous to all 
transports, it can be further exploited by virtualized transport layers like 
VirtIO, by directly send user buffer to the server and there by achieving 
real zero copy.

Design Goals.

- Have minimal changes to the net layer so that common code is not polluted by
 the transport specifics.
- Create a common transport library which can be used by other transports.
- Avoid additional optimizations in the initial attempt (more details below)
 and focus on achieving basic functionality.

Design

This patch added infrastructure to send the payload buffers directly to the
transport layer if the later prefers. To accomplish this preferences property
is added to the transport layer and additional elements are added to the
PDU structure (struct 9p_fcall)

Transport layer specifies the preference through newly introduced field in the
transport module.  (clnt->trans_mod->pref) and net layer sends the the payload
through pubuf/pkbuf elements of struct 9p_fcall.

This method has few advantages.
 - Keeps the net layer clean and lets the transport layer deal with specifics.
 - mapping user addr into kernel pages pins the memory this could make the 
   system vulnerable to denial-of-service attacks. This change gives
   transport layer more control to implement effective flow control.
   Expect flow control patches shortly.
 - If a transport layer doesn't see the need to handle payload separately,
  it can set the preference accordingly so that current code works with no
  changes. This is very useful for transports which has no plans of
  converting/pinning user pages.

There is rather a sticky issue with is a rather sticky issue
with TREAD/RERROR scenario in non-9P2000.L protocols (Legacy, 9P2000.u)

If the server has to fail the READ request, it can send an error
up to ERRMAX(256). As this is not fixed size, it is hard to allocate
fixed amount of buffer from the transport layer perspective.

In 9P2000.L, the error is a fixed size (errno) hence not an issue.
On success the received packet will be  PDU header + read size + payload.
On error it is PDU header + errno. Hence non-payload size is constant (11)
irrespective of success or failure.

But this is not the case in non-9P2000.L the header size is different in the
failure (TREAD/RERROR) case. To take care of this the patch makes sure that
the read buffer is big enough to accommodate  ERRMAX string.
It also means that there is a chance of scribbling on the payload/user
buffer in the error case for those non-POSIX complaint protocols.
The added trans_mod->pref will give the option of not participating in the
zero copy.

This series also created trans_common.[ch] to house common functions so 
that other transport layers can take advantage of them.

Testing/Performance:

Setup: HS21 blade a two socket quad core Xeon with 4 GB memory, IO to the 
local disk.

WRITE
dd if=/dev/zero of=/pmnt/file1 bs=4096 count=1MB (variable bs = IO SIZE)

IO SIZE      TOTAL SIZE               No ZC                ZC
1                   1MB                22.4 kb/s         19.8 kb/s
32                 32MB              711 kb/s          633 kb/s
64                 64MB              1.4 mb/s          1.3  mb/s
128               128MB              2.8 mb/s          2.6 mb/s
256               256MB              5.6 mb/s          5.1 mb/s
512               512MB              10.4 mb/s        10.2 mb/s
1024              1GB                19.7 mb/s         20.4 mb/s
2048              2GB               40.1 mb/s          43.7 mb/s
4096              4GB               71.4 mb/s          73.1 mb/s

READ
dd of=/dev/null if=/pmnt/file1 bs=4096 count=1MB(variable bs = IO SIZE)

IO SIZE      TOTAL SIZE                No ZC              ZC
1                   1MB              26.6 kb/s          23.1 kb/s
32                 32MB              783 kb/s           734 kb/s
64                 64MB              1.7 mb/s           1.5 mb/s
128               128MB              3.4 mb/s           3.0 mb/s
256               256MB              4.2 mb/s           5.9 mb/s
512               512MB              6.9 mb/s           11.6 mb/s
1024              1GB               23.3 mb/s           23.4 mb/s
2048              2GB               42.5 mb/s          45.4 mb/s
4096              4GB               67.4 mb/s          73.9 mb/s

ZC benefits are seen beyond 1k buffer. Hence the patch makes sure that
the zero copy is not enforced for smaller IO (< 1024)
My setup/box could be a bottleneck as it gave similar numbers even on the host.
But observed better numbers with zero copy on bigger setup.


What is following this patch series (Future work)

1.
One of the major advantage of this patch series is to have bigger msize
to pull off bigger read/writes from the server. Increasing the msize is not
really a solution as majority of other transactions are extremely small which
could result in waste of kernel heap.  To address this problem we need to have
two sizes of PDUs.

2.
Add flow-control capability to the transport layer.

3.
Add a mount option to disable the zero copy even if the user prefers to.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-02-14  1:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-07  7:21 [RFC-V2] [PATCH 0/7] Zero Copy Venkateswararao Jujjuri (JV)
2011-02-07  6:55 ` Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 2/7] [net/9p] Adds supporting functions for " Venkateswararao Jujjuri (JV)
2011-02-08 15:20     ` [V9fs-developer] " Latchesar Ionkov
2011-02-08 17:21       ` Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate " Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 3/7] [net/9p] Assign type of transaction to tc->pdu->id which is otherwise unsed Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:56   ` [RFC] [PATCH 4/7] [net/9p] Add gup/zero_copy support to VirtIO transport layer Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:57   ` [RFC] [PATCH 5/7] [net/9p] Add preferences to transport layer Venkateswararao Jujjuri (JV)
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:57   ` :[RFC] [PATCH 6/7] [net/9p] Read and Write side zerocopy changes for 9P2000.L protocol Venkateswararao Jujjuri (JV)
2011-02-08 21:09     ` [V9fs-developer] " Eric Van Hensbergen
2011-02-08 21:16       ` Eric Van Hensbergen
2011-02-09 21:09         ` Venkateswararao Jujjuri (JV)
2011-02-09 21:12           ` Venkateswararao Jujjuri (JV)
2011-02-09 21:18           ` Eric Van Hensbergen
2011-02-09 21:39             ` Venkateswararao Jujjuri (JV)
2011-02-08 23:50       ` Venkateswararao Jujjuri (JV)
2011-02-09  1:59         ` Venkateswararao Jujjuri (JV)
2011-02-09 14:28           ` Eric Van Hensbergen
2011-02-07  7:21 ` [RFC] [PATCH 1/7] [net/9p] Additional elements to p9_fcall to accomodate zero copy Venkateswararao Jujjuri (JV)
2011-02-07  6:58   ` [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case Venkateswararao Jujjuri (JV)
  -- strict thread matches above, loose matches on Subject: below --
2011-02-14  2:21 [net/9p] ZeroCopy patch series Venkateswararao Jujjuri (JV)
2011-02-14  2:21 ` [PATCH 7/7] [net/9p] Handle TREAD/RERROR case in !dotl case Venkateswararao Jujjuri (JV)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).