* Packet mmap: TX RING and zero copy
@ 2008-09-02 18:27 Johann Baudy
2008-09-02 19:46 ` Evgeniy Polyakov
2008-09-05 10:28 ` Robert Iakobashvili
0 siblings, 2 replies; 39+ messages in thread
From: Johann Baudy @ 2008-09-02 18:27 UTC (permalink / raw)
To: netdev; +Cc: Ulisses Alonso Camaró
Hi All,
I'm currently working on an embedded project (based on Linux kernel)
that needs a high throughput using gigabit Ethernet controller and
"small" cpu.
I've made lot of tests, playing with jumbo frames, raw sockets, ...
I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
packet socket transmission process.
The main blocking point was the memcpy_fromiovec() function that is
located in the packet_sendmsg() of af_packet.c.
It was consuming all my CPU resources to copy data from user space to
socket buffer.
Then I've started to work on a hack that makes this transfer possible
without any memcpys.
Mainly, the hack is the implementation of two "features":
* Sending packet through a circular buffer between user and
kernel space that minimizes the number of system calls. (Feature
actually implemented for capture process, libpcap ..).
To sum up the user process :
- initialize a raw socket
- allocate N buffers into kernel space through a setsockopt() (TX ring),
- mmap() the allocated memory,
- fill M buffers with custom data, and update status of filled
buffers to ready (header of buffer: struct tpacket_hdr contains a
status field: TP_STATUS_KERNEL means free, TP_STATUS_USER means ready
to be sent, TP_STATUS_COPY means transmission ongoing)
- call send() procedure. The kernel will then send all buffers
set with TP_STATUS_USER. Status is set to TP_STATUS_COPY during
transfer and TP_STATUS_KERNEL when done.
* Zero copy mode. CONFIG_PACKET_MMAP_ZERO_COPY feature flag
skips CPU copy between the circular buffer and the socket buffer
allocated during send.
To send packet without zero copy, if my understanding is
correct, first we allocate a socket buffer with sock_alloc_send_skb(),
then we copy content of data into the socket buffer, finally we give
this sk_buff to the network card. With zero copy, the trick is to
bypass the data copy by substituting data pointers of allocated
sk_buff for data pointers of our circular buffer.
This way network devices use our circular buffer instead of
socket buffer concerning data.
And to prevent the kernel from crashing during skb data release
(shinfo+data release), we restore the whole previous content of
sk_buff inside the destructor callback.
I'm aware that this suggestion is really far from a real solution,
mainly due to this hard substitution.
But, I would like to get as much criticism as possible in order to
start a discussion with experts about a conceivable way to mix
zero-copy, sk_buff management and packet socket.
Which is perhaps impossible with current network kernel flow ...
PS: I've reached 85Mbytes/s with TX RING and zero-copy
Thanks in advance for your advices,
Johann Baudy
diff --git a/Documentation/networking/packet_mmap.txt
b/Documentation/networking/packet_mmap.txt
index db0cd51..0cfb835 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -4,16 +4,17 @@
This file documents the CONFIG_PACKET_MMAP option available with the PACKET
socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
-capture network traffic with utilities like tcpdump or any other that uses
-the libpcap library.
+capture network traffic with utilities like tcpdump or any other that needs
+raw acces to network interface.
You can find the latest version of this document at
- http://pusa.uv.es/~ulisses/packet_mmap/
+ http://pusa.uv.es/~ulisses/packet_mmap/ (down ?)
Please send me your comments to
Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+ Johann Baudy <johann.baudy@gnu-log.net> (TX RING - Zero Copy)
-------------------------------------------------------------------------------
+ Why use PACKET_MMAP
@@ -25,19 +26,25 @@ to capture each packet, it requires two if you
want to get packet's
timestamp (like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
-configurable circular buffer mapped in user space. This way reading
packets just
-needs to wait for them, most of the time there is no need to issue a single
-system call. By using a shared buffer between the kernel and the user
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the
capture process,
-but it isn't everything. At least, if you are capturing at high speeds (this
-is relative to the cpu speed), you should check if the device driver of your
-network interface card supports some sort of interrupt load mitigation or
-(even better) if it supports NAPI, also make sure it is enabled.
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. For
transmission,
+multiple packets can be sent in one sytem call and outgoing data buffers can be
+zero-copied to get the highest bandwidth (with PACKET_MMAP_ZERO_COPY).
+By using a shared buffer between the kernel and the user also has the benefit
+of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. Especially if you are using DMA.
+(cf Jumbo frame)
--------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP
++ How to use CONFIG_PACKET_MMAP to improve capture process
--------------------------------------------------------------------------------
From the user standpoint, you should use the higher level libpcap
library, which
@@ -56,8 +63,9 @@ The rest of this document is intended for people who
want to understand
the low level details or want to improve libpcap by including PACKET_MMAP
support.
+
--------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP directly
++ How to use CONFIG_PACKET_MMAP directly to improve capture porcess
--------------------------------------------------------------------------------
From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +74,7 @@ the following process:
[setup] socket() -------> creation of the capture socket
setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_RX_RING
mmap() ---------> mapping of the allocated buffer to the
user process
@@ -97,14 +106,95 @@ also the mapping of the circular buffer in the
user process and
the use of this buffer.
--------------------------------------------------------------------------------
++ How to use CONFIG_PACKET_MMAP directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup] socket() -------> creation of the transmission socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_TX_RING
+ bind() ---------> bind transmission socket with a
network interface
+ getsockopt() ---> get the circular buffer header size
+ option: PACKET_TX_RING_HEADER_SIZE
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+[transmission] poll() ---------> wait for free packets (optional)
+ send() ---------> send all packets that are set as ready in
+ the ring
+
+[shutdown] close() --------> destruction of the transmission socket and
+ deallocation of all associated resources.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+Each frame contains five parts:
+
+ -------------------
+| struct tpacket | Header. It contains the status of
+| | of this frame
+|-------------------|
+| struct skbuff | (Zero copy only) Save of allocated socket buffer
+| | descriptor.
+|-------------------|
+| network interface | (Zero copy only) size = LL_RESERVED_SPACE(dev)
+| reserved space |
+|-------------------|
+| data buffer |
+. . Data that will be sent over the network interface.
+. .
+|-------------------|
+| network interface | (Zero copy only) size = LL_ALLOCATED_SPACE(dev)
+| reserved space | - LL_RESERVED_SPACE(dev)
+ -------------------
+
+ Network interface reserved spaces may differ between devices that
why user must
+ ask header size to the kernel after bind() call.
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ getsockopt(PACKET_TX_RING_HEADER_SIZE) returns an offset that must be
+ added to each frame pointer to get the start pointer of the data buffer.
+
+ int i_header_size;
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = ETH_P_ALL;
+ my_addr.sll_ifindex = s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ /* get header size */
+ getsockopt(this->socket, SOL_PACKET, PACKET_TX_RING_HEADER_SIZE,
+ (void*)&i_header_size,&opt_len);
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+--------------------------------------------------------------------------------
+ PACKET_MMAP settings
--------------------------------------------------------------------------------
To setup PACKET_MMAP from user level code is done with a call like
+ - Capture process
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+ - Transmission process
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
The most significant argument in the previous call is the req parameter,
this parameter must to have the following structure:
@@ -117,11 +207,11 @@ this parameter must to have the following structure:
};
This structure is defined in /usr/include/linux/if_packet.h and establishes a
-circular buffer (ring) of unswappable memory mapped in the capture process.
+circular buffer (ring) of unswappable memory.
Being mapped in the capture process allows reading the captured frames and
related meta-information like timestamps without requiring a system call.
-Captured frames are grouped in blocks. Each block is a physically contiguous
+Frames are grouped in blocks. Each block is a physically contiguous
region of memory and holds tp_block_size/tp_frame_size frames. The
total number
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant
parameter because
@@ -336,13 +426,13 @@ struct tpacket_hdr). If this field is 0 means
that the frame is ready
to be used for the kernel, If not, there is a frame the user can read
and the following flags apply:
- from include/linux/if_packet.h
+++ Capture process:
+from include/linux/if_packet.h
#define TP_STATUS_COPY 2
#define TP_STATUS_LOSING 4
#define TP_STATUS_CSUMNOTREADY 8
-
TP_STATUS_COPY : This flag indicates that the frame (and associated
meta information) has been truncated because it's
larger than tp_frame_size. This packet can be
@@ -388,8 +478,38 @@ packets are in the ring:
if (status == TP_STATUS_KERNEL)
retval = poll(&pfd, 1, timeout);
-It doesn't incur in a race condition to first check the status value and
-then poll for frames.
+
+++ Transmission process
+Those defines are also used for transmission:
+
+ #define TP_STATUS_KERNEL 0 // Frame is available
+ #define TP_STATUS_USER 1 // Frame will be sent on next send()
+ #define TP_STATUS_COPY 2 // Frame is currently in transmission
+ #define TP_STATUS_LOSING 4 // Indicate a transmission error
+
+First, the kernel initializes all frames to TP_STATUS_KERNEL. To send a packet,
+the user fills a data buffer of an available frame, sets tp_len to current
+data buffer size and sets its status field to TP_STATUS_USER. This can be done
+on multiple frames. Once the user is ready to transmit, it calls send().
+Then all buffers with status equal to TP_STATUS_USER are forwarded to the
+network device. The kernel updates each status of sent frames with
+TP_STATUS_COPY until the end of transfer (if zero copy is used, otherwise
+end of socket buffer copy).
+At the end, all statuses return to TP_STATUS_KERNEL.
+
+ header->tp_len = in_i_size;
+ header->tp_status = TP_STATUS_USER;
+ retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_KERNEL)
+
+ struct pollfd pfd;
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLOUT;
+ retval = poll(&pfd, 1, timeout);
+
--------------------------------------------------------------------------------
+ THANKS
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index ad09609..a79cd89 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -43,6 +43,8 @@ struct sockaddr_ll
#define PACKET_COPY_THRESH 7
#define PACKET_AUXDATA 8
#define PACKET_ORIGDEV 9
+#define PACKET_TX_RING 10
+#define PACKET_TX_RING_HEADER_SIZE 11
struct tpacket_stats
{
@@ -79,6 +81,11 @@ struct tpacket_hdr
#define TPACKET_ALIGN(x) (((x)+TPACKET_ALIGNMENT-1)&~(TPACKET_ALIGNMENT-1))
#define TPACKET_HDRLEN (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) +
sizeof(struct sockaddr_ll))
+/* packet ring modes */
+#define TPACKET_MODE_NONE 0
+#define TPACKET_MODE_RX 1
+#define TPACKET_MODE_TX 2
+
/*
Frame structure:
diff --git a/net/packet/Kconfig b/net/packet/Kconfig
index 34ff93f..2c74568 100644
--- a/net/packet/Kconfig
+++ b/net/packet/Kconfig
@@ -16,7 +16,7 @@ config PACKET
If unsure, say Y.
config PACKET_MMAP
- bool "Packet socket: mmapped IO"
+ bool "mmapped IO"
depends on PACKET
help
If you say Y here, the Packet protocol driver will use an IO
@@ -24,3 +24,12 @@ config PACKET_MMAP
If unsure, say N.
+config PACKET_MMAP_ZERO_COPY
+ bool "zero-copy TX"
+ depends on PACKET_MMAP
+ help
+ If you say Y here, the Packet protocol driver will fill socket buffer
+ descriptors with TX ring buffer addresses. This mechanism that results
+ in faster communication.
+
+ If unsure, say N.
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2cee87d..45367dc 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -158,7 +158,9 @@ struct packet_mreq_max
};
#ifdef CONFIG_PACKET_MMAP
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing);
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int mode);
+static int tpacket_snd(struct socket *sock,
+ struct msghdr *msg, size_t len);
#endif
static void packet_flush_mclist(struct sock *sk);
@@ -173,7 +175,9 @@ struct packet_sock {
unsigned int frames_per_block;
unsigned int frame_size;
unsigned int frame_max;
+ unsigned int header_size;
int copy_thresh;
+ int mode;
#endif
struct packet_type prot_hook;
spinlock_t bind_lock;
@@ -692,10 +696,209 @@ ring_is_full:
goto drop_n_restore;
}
+/*
+ * TX ring skb destructor.
+ * This function is called when skb is freed.
+ * */
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+void tpacket_skb_destructor (struct sk_buff *skb)
+{
+ struct tpacket_hdr *header = (struct tpacket_hdr*) skb->head;
+ struct sk_buff * skb_copy;
+
+ /* calculate old skb pointer */
+ skb_copy = ((void*) header + sizeof(struct tpacket_hdr));
+
+ /* restore previous skb header (before substitution) */
+ memcpy(skb, skb_copy, sizeof(struct sk_buff));
+
+ /* execute previous destructor */
+ if(skb->destructor)
+ skb->destructor(skb);
+
+ /* check status of buffer */
+ BUG_ON(header->tp_status != TP_STATUS_COPY);
+ header->tp_status = TP_STATUS_KERNEL;
+
+ return;
+}
#endif
+/*
+ * TX Ring packet send function
+ * */
+static int tpacket_snd(struct socket *sock,
+ struct msghdr *msg, size_t len)
+{
+ struct sock *sk = sock->sk;
+ struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
+ struct packet_sock *po = pkt_sk(sk);
+ struct net_device *dev;
+ int err, reserve=0, len_sum=0, ifindex, i;
+ struct sk_buff * skb, * skb_copy;
+ unsigned char *addr;
+ __be16 proto;
+
+ /*
+ * Get and verify the address.
+ */
+ if (saddr == NULL) {
+ ifindex = po->ifindex;
+ proto = po->num;
+ addr = NULL;
+ } else {
+ err = -EINVAL;
+ if (msg->msg_namelen < sizeof(struct sockaddr_ll))
+ goto out;
+ if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct
sockaddr_ll, sll_addr)))
+ goto out;
+ ifindex = saddr->sll_ifindex;
+ proto = saddr->sll_protocol;
+ addr = saddr->sll_addr;
+ }
-static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+ /* get device by index */
+ dev = dev_get_by_index(sock_net(sk), ifindex);
+ err = -ENXIO;
+ if (dev == NULL)
+ goto out_put;
+ if (sock->type == SOCK_RAW)
+ reserve = dev->hard_header_len;
+
+ /* check if header size of device has changed since bind */
+ /* bind() call is mandatory as user must know where data must be written.
+ * it fills header_size setting of current socket
+ * and allows getsockopt(PACKET_TX_RING_HEADER_SIZE) call */
+ err = -EINVAL;
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+ if(po->header_size != LL_RESERVED_SPACE(dev) + sizeof(struct
tpacket_hdr) + sizeof(struct sk_buff))
+#else
+ if(po->header_size != sizeof(struct tpacket_hdr))
+#endif
+ goto out_put;
+
+ /* check interface up */
+ err = -ENETDOWN;
+ if (!(dev->flags & IFF_UP))
+ goto out_put;
+
+ /* loop on all frames */
+ for (i = 0; i <= po->frame_max; i++) {
+ struct tpacket_hdr *header = packet_lookup_frame(po, i);
+ int size_max = po->frame_size - sizeof(struct skb_shared_info) -
sizeof(struct tpacket_hdr) - LL_ALLOCATED_SPACE(dev);
+
+ if(header->tp_status == TP_STATUS_USER) {
+ /* mark header as tx ongoing */
+ header->tp_status = TP_STATUS_COPY;
+
+ /* check packet size */
+ err = -EMSGSIZE;
+ if (header->tp_len > dev->mtu+reserve)
+ goto out_put;
+ if(header->tp_len > size_max)
+ goto out_put;
+
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+ err = -ENOMEM;
+ /* allocate skb header */
+ skb = sock_alloc_send_skb(sk,
+ 0,
+ msg->msg_flags & MSG_DONTWAIT,
+ &err);
+ if (skb==NULL)
+ goto out_put;
+
+ err = -EINVAL;
+ if (sock->type == SOCK_DGRAM &&
+ dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len) < 0)
+ goto out_free;
+
+ /* clone current skb */
+ skb_copy = ((void*) header + sizeof(struct tpacket_hdr));
+ memcpy(skb_copy, skb, sizeof(struct sk_buff));
+
+ /* substitute skb data with Tx ring pointers */
+ skb->head = (void*)header;
+ skb->data = (void*)skb->head;
+ skb->end = (void*)header + po->frame_size - sizeof(struct skb_shared_info);
+ skb->truesize = po->frame_size;
+ skb_reset_tail_pointer(skb);
+
+ /* make sure we've copied shinfo properly into ring buffer */
+ memcpy(skb_shinfo(skb), skb_shinfo(skb_copy), sizeof(struct
skb_shared_info));
+
+ err = -ENOSPC;
+ /* check buffer size */
+ if(skb_tailroom(skb) < header->tp_len)
+ goto out_free;
+
+ /* put data into skb */
+ skb_reserve(skb, po->header_size);
+ skb_put(skb, header->tp_len);
+ skb_reset_network_header(skb);
+ skb_reset_transport_header(skb);
+
+ /* store destructor call back to update tpacket header status */
+ skb->destructor = tpacket_skb_destructor;
+#else
+ err = -ENOMEM;
+ /* allocate skb header */
+ skb = sock_alloc_send_skb(sk,
+ header->tp_len + LL_ALLOCATED_SPACE(dev),
+ msg->msg_flags & MSG_DONTWAIT,
+ &err);
+ if (skb==NULL)
+ goto out_put;
+
+ /* reserve device header */
+ skb_reserve(skb, LL_RESERVED_SPACE(dev));
+ skb_put(skb,header->tp_len);
+ skb_shinfo(skb)->frag_list=0;
+ skb_shinfo(skb)->nr_frags=0;
+
+ /* copy all data from TX ring buffer to skb */
+ err = skb_store_bits(skb, 0, (void*)header + po->header_size,
header->tp_len);
+ if( err )
+ goto out_free;
+
+#endif
+
+ /* fill skb with proto, device and priority */
+ skb->protocol = proto;
+ skb->dev = dev;
+ skb->priority = sk->sk_priority;
+
+
+ /* now send it */
+ err = dev_queue_xmit(skb);
+ if (err > 0 && (err = net_xmit_errno(err)) != 0)
+ goto out_free;
+
+#ifndef CONFIG_PACKET_MMAP_ZERO_COPY
+ /* reset flag of buffer as data has been copied into skb */
+ header->tp_status = TP_STATUS_KERNEL;
+#endif
+ len_sum += skb->len;
+ }
+ }
+ dev_put(dev);
+
+ return(len_sum);
+
+out_free:
+ kfree_skb(skb);
+out_put:
+ if (dev)
+ dev_put(dev);
+out:
+ return err;
+}
+#endif
+
+/*
+ * Normal packet send function
+ * */
+static int packet_snd(struct socket *sock,
struct msghdr *msg, size_t len)
{
struct sock *sk = sock->sk;
@@ -705,14 +908,13 @@ static int packet_sendmsg(struct kiocb *iocb,
struct socket *sock,
__be16 proto;
unsigned char *addr;
int ifindex, err, reserve = 0;
+ struct packet_sock *po = pkt_sk(sk);
/*
* Get and verify the address.
*/
if (saddr == NULL) {
- struct packet_sock *po = pkt_sk(sk);
-
ifindex = po->ifindex;
proto = po->num;
addr = NULL;
@@ -786,6 +988,23 @@ out:
return err;
}
+static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+ struct msghdr *msg, size_t len)
+{
+ struct sock *sk = sock->sk;
+ struct packet_sock *po = pkt_sk(sk);
+ //printk("tpacket TX sendmsg\n");
+
+ /* check if tx ring mode enabled */
+#ifdef CONFIG_PACKET_MMAP
+ if (po->mode == TPACKET_MODE_TX)
+ return tpacket_snd(sock, msg, len);
+ else
+#endif
+ return packet_snd(sock, msg, len);
+
+}
+
/*
* Close a PACKET socket. This is fairly simple. We immediately go
* to 'closed' state and remove our protocol entry in the device list.
@@ -827,7 +1046,7 @@ static int packet_release(struct socket *sock)
if (po->pg_vec) {
struct tpacket_req req;
memset(&req, 0, sizeof(req));
- packet_set_ring(sk, &req, 1);
+ packet_set_ring(sk, &req, 1, TPACKET_MODE_NONE);
}
#endif
@@ -875,7 +1094,11 @@ static int packet_do_bind(struct sock *sk,
struct net_device *dev, __be16 protoc
po->prot_hook.dev = dev;
po->ifindex = dev ? dev->ifindex : 0;
-
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+ po->header_size = dev ? (LL_RESERVED_SPACE(dev) + sizeof(struct
tpacket_hdr) + sizeof(struct sk_buff)) : 0;
+#else
+ po->header_size = sizeof(struct tpacket_hdr);
+#endif
if (protocol == 0)
goto out_unlock;
@@ -1015,6 +1238,12 @@ static int packet_create(struct net *net,
struct socket *sock, int protocol)
po->running = 1;
}
+#ifdef CONFIG_PACKET_MMAP
+ po->mode = TPACKET_MODE_NONE;
+ po->header_size = 0;
+#endif
+
+
write_lock_bh(&net->packet.sklist_lock);
sk_add_node(sk, &net->packet.sklist);
write_unlock_bh(&net->packet.sklist_lock);
@@ -1344,7 +1573,19 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
return -EINVAL;
if (copy_from_user(&req,optval,sizeof(req)))
return -EFAULT;
- return packet_set_ring(sk, &req, 0);
+ /* store packet mode */
+ return packet_set_ring(sk, &req, 0, TPACKET_MODE_RX);
+ }
+ case PACKET_TX_RING:
+ {
+ struct tpacket_req req;
+
+ if (optlen<sizeof(req))
+ return -EINVAL;
+ if (copy_from_user(&req,optval,sizeof(req)))
+ return -EFAULT;
+ /* store packet mode */
+ return packet_set_ring(sk, &req, 0, TPACKET_MODE_TX);
}
case PACKET_COPY_THRESH:
{
@@ -1408,6 +1649,17 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
return -EINVAL;
switch(optname) {
+#ifdef CONFIG_PACKET_MMAP
+ case PACKET_TX_RING_HEADER_SIZE:
+ if (len > sizeof(int))
+ len = sizeof(int);
+ val = po->header_size;
+ /* header_size should differ from 0 if device has been bind */
+ if (unlikely(val == 0))
+ return -EACCES;
+ data = &val;
+ break;
+#endif
case PACKET_STATISTICS:
if (len > sizeof(struct tpacket_stats))
len = sizeof(struct tpacket_stats);
@@ -1562,7 +1814,10 @@ static unsigned int packet_poll(struct file *
file, struct socket *sock,
struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk);
unsigned int mask = datagram_poll(file, sock, wait);
+ int i;
+ /* RX RING - waiting for packet */
+ if(po->mode == TPACKET_MODE_RX) {
spin_lock_bh(&sk->sk_receive_queue.lock);
if (po->pg_vec) {
unsigned last = po->head ? po->head-1 : po->frame_max;
@@ -1574,6 +1829,21 @@ static unsigned int packet_poll(struct file *
file, struct socket *sock,
mask |= POLLIN | POLLRDNORM;
}
spin_unlock_bh(&sk->sk_receive_queue.lock);
+ }
+ /* TX RING - waiting for free buffer */
+ else if(po->mode == TPACKET_MODE_TX) {
+ if(mask & POLLOUT) {
+ mask &= ~POLLOUT;
+ for (i = 0; i < po->frame_max; i++) {
+ struct tpacket_hdr *header = packet_lookup_frame(po, i);
+ if(header->tp_status == TP_STATUS_KERNEL)
+ {
+ mask |= POLLOUT;
+ break;
+ }
+ }
+ }
+ }
return mask;
}
@@ -1649,7 +1919,7 @@ out_free_pgvec:
goto out;
}
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing)
+ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing ,int mode)
{
char **pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
@@ -1657,6 +1927,9 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req, int closing
__be16 num;
int err = 0;
+ /* saving ring mode */
+ po->mode = mode;
+
if (req->tp_block_nr) {
int i;
@@ -1736,7 +2009,7 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req, int closing
req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr);
po->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
- po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
+ po->prot_hook.func = (po->pg_vec && (po->mode == TPACKET_MODE_RX))
? tpacket_rcv : packet_rcv;
skb_queue_purge(&sk->sk_receive_queue);
#undef XC
if (atomic_read(&po->mapped))
^ permalink raw reply related [flat|nested] 39+ messages in thread* Re: Packet mmap: TX RING and zero copy
2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
@ 2008-09-02 19:46 ` Evgeniy Polyakov
2008-09-03 7:56 ` Johann Baudy
2008-09-05 10:28 ` Robert Iakobashvili
1 sibling, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-02 19:46 UTC (permalink / raw)
To: Johann Baudy; +Cc: netdev, Ulisses Alonso Camaró
Hi Johann.
On Tue, Sep 02, 2008 at 08:27:36PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I've made lot of tests, playing with jumbo frames, raw sockets, ...
> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
> packet socket transmission process.
>
> The main blocking point was the memcpy_fromiovec() function that is
> located in the packet_sendmsg() of af_packet.c.
Can you saturate the link with usual tcp/udp socket?
> But, I would like to get as much criticism as possible in order to
> start a discussion with experts about a conceivable way to mix
> zero-copy, sk_buff management and packet socket.
> Which is perhaps impossible with current network kernel flow ...
Did you try vmsplice and splice?
It is the preferred way to do a zero-copy.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-02 19:46 ` Evgeniy Polyakov
@ 2008-09-03 7:56 ` Johann Baudy
2008-09-03 10:38 ` Johann Baudy
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 7:56 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: netdev
Hi Evgeniy,
>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>> packet socket transmission process.
>>
>> The main blocking point was the memcpy_fromiovec() function that is
>> located in the packet_sendmsg() of af_packet.c.
>
> Can you saturate the link with usual tcp/udp socket?
No, only ~15-20Mo/s with standard tcp/udp socket.
>
>> But, I would like to get as much criticism as possible in order to
>> start a discussion with experts about a conceivable way to mix
>> zero-copy, sk_buff management and packet socket.
>> Which is perhaps impossible with current network kernel flow ...
>
> Did you try vmsplice and splice?
> It is the preferred way to do a zero-copy.
Not yet, I will perform some tests using splice and let you know performances.
Many thanks,
Johann
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 7:56 ` Johann Baudy
@ 2008-09-03 10:38 ` Johann Baudy
2008-09-03 11:06 ` David Miller
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 10:38 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: netdev
Hi Evgeniy,
I'm not able to exceed 15Mo/s even with vmsplice/splice duo.
Due to some issues:
- I didn't manage to adjust size of packets sent over the network (it
seems to be aligned with page). And maximum packet size seems to be
the page size (4096).
- I need approximately two system calls (vmsplice and splice) for
~4096*8 bytes maximum which is maybe a limit of pipe.
- I'm still going through packet_sendmsg() (packet socket) which
allocates a sk_buff and copies all data inside.
As reference, with my "patch": I need to send more than 32 packets of
7200 bytes (pc network card limit) in one system call (send()) and
without sk_buff data copy. (To reach 85 Mbytes/s)
Please find below my test program for vmsplice/splice:
Best regards,
Johann
#include <stdio.h>
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/select.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <poll.h>
int main (void)
{
struct tpacket_req s_packet_req;
uint32_t size, opt_len;
int fd, i, ec, i_sz_packet = 7150;
struct pollfd s_pfd;
struct sockaddr_ll my_addr, peer_addr;
struct ifreq s_ifr; /* points to one interface returned from ioctl */
int len;
int fd_socket;
int i_nb_buffer = 64;
int i_buffer_size = 8192;
int i_index;
int i_updated_cnt;
int i_ifindex;
int i_header_size;
struct tpacket_hdr * ps_header_start;
struct tpacket_hdr * ps_header;
char buffer[8000];
/* reset indes */
i_index = 0;
fd_socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if(fd_socket == -1)
{
perror("socket");
return EXIT_FAILURE;
}
/* start socket config: device and mtu */
/* clear structure */
memset(&my_addr, 0, sizeof(struct sockaddr_ll));
my_addr.sll_family = PF_PACKET;
my_addr.sll_protocol = htons(ETH_P_ALL);
/* initialize interface struct */
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* Get the broad cast address */
ec = ioctl(fd_socket, SIOCGIFINDEX, &s_ifr);
if(ec == -1)
{
perror("iotcl");
return EXIT_FAILURE;
}
/* update with interface index */
i_ifindex = s_ifr.ifr_ifindex;
/* new mtu value */
s_ifr.ifr_mtu = 7200;
/* update the mtu through ioctl */
ec = ioctl(fd_socket, SIOCSIFMTU, &s_ifr);
if(ec == -1)
{
perror("iotcl");
return EXIT_FAILURE;
}
/* set sockaddr info */
memset(&my_addr, 0, sizeof(struct sockaddr_ll));
my_addr.sll_family = AF_PACKET;
my_addr.sll_protocol = ETH_P_ALL;
my_addr.sll_ifindex = i_ifindex;
/* bind port */
if (bind(fd_socket, (struct sockaddr *)&my_addr, sizeof(struct
sockaddr_ll)) == -1)
{
perror("bind");
return EXIT_FAILURE;
}
/* prepare Tx ring request */
s_packet_req.tp_block_size = i_buffer_size;
s_packet_req.tp_frame_size = i_buffer_size;
s_packet_req.tp_block_nr = i_nb_buffer;
s_packet_req.tp_frame_nr = i_nb_buffer;
/* calculate memory to mmap in the kernel */
size = s_packet_req.tp_block_size * s_packet_req.tp_block_nr;
{
/* Splice flags (not laid down in stone yet). */
#ifndef SPLICE_F_MOVE
#define SPLICE_F_MOVE 0x01
#endif
#ifndef SPLICE_F_NONBLOCK
#define SPLICE_F_NONBLOCK 0x02
#endif
#ifndef SPLICE_F_MORE
#define SPLICE_F_MORE 0x04
#endif
#ifndef SPLICE_F_GIFT
#define SPLICE_F_GIFT 0x08
#endif
#ifndef __NR_splice
#define __NR_splice 313
#endif
int filedes [2];
int ret;
int to_write;
struct iovec iov;
iov.iov_base = &buffer;
iov.iov_len = 4096;
ret = pipe (filedes);
printf("fd = %d %d %d %p\n", fd, filedes[0], filedes[1], iov.iov_base);
for(i=0; i< sizeof buffer; i++)
{
buffer[i] = (char) i;
}
for(i=0; i< 500000; i++)
{
to_write = 0;
while (to_write < iov.iov_len*7) {
ret = vmsplice (filedes [1],&iov, 1, SPLICE_F_MOVE | SPLICE_F_MORE);
if (ret < 0)
{
perror("splice");
return EXIT_FAILURE;
}
else
to_write += ret;
}
while (to_write > 0) {
ret = splice (filedes [0], NULL, fd_socket,
NULL, to_write,
SPLICE_F_MOVE | SPLICE_F_MORE);
if (ret < 0)
{
perror("write splice");
return EXIT_FAILURE;
}
else
to_write -= ret;
}
}
}
return EXIT_SUCCESS;
}
On Wed, Sep 3, 2008 at 9:56 AM, Johann Baudy <johaahn@gmail.com> wrote:
> Hi Evgeniy,
>
>>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>>> packet socket transmission process.
>>>
>>> The main blocking point was the memcpy_fromiovec() function that is
>>> located in the packet_sendmsg() of af_packet.c.
>>
>> Can you saturate the link with usual tcp/udp socket?
>
> No, only ~15-20Mo/s with standard tcp/udp socket.
>
>>
>>> But, I would like to get as much criticism as possible in order to
>>> start a discussion with experts about a conceivable way to mix
>>> zero-copy, sk_buff management and packet socket.
>>> Which is perhaps impossible with current network kernel flow ...
>>
>> Did you try vmsplice and splice?
>> It is the preferred way to do a zero-copy.
>
> Not yet, I will perform some tests using splice and let you know performances.
>
> Many thanks,
> Johann
>
>
>
> --
> Johann Baudy
> johaahn@gmail.com
>
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Packet mmap: TX RING and zero copy
2008-09-03 10:38 ` Johann Baudy
@ 2008-09-03 11:06 ` David Miller
2008-09-03 13:05 ` Johann Baudy
0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2008-09-03 11:06 UTC (permalink / raw)
To: johaahn; +Cc: johnpol, netdev
From: "Johann Baudy" <johaahn@gmail.com>
Date: Wed, 3 Sep 2008 12:38:53 +0200
> I'm not able to exceed 15Mo/s even with vmsplice/splice duo.
I think you misunderstood what Evgeniy was asking of you.
He was asking how fast you can transfer data over this
interface using a normal TCP socket to a remove host,
via sendfile() or splice().
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 11:06 ` David Miller
@ 2008-09-03 13:05 ` Johann Baudy
2008-09-03 13:27 ` Evgeniy Polyakov
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 13:05 UTC (permalink / raw)
To: David Miller, Evgeniy Polyakov; +Cc: netdev
Sorry for misunderstanding,
TCP socket, transferring 20Mbytes file (located in initramfs) in loop
with sendfile() : 5.7Mbytes/s
Best regards,
Johann
On Wed, Sep 3, 2008 at 1:06 PM, David Miller <davem@davemloft.net> wrote:
> From: "Johann Baudy" <johaahn@gmail.com>
> Date: Wed, 3 Sep 2008 12:38:53 +0200
>
>> I'm not able to exceed 15Mo/s even with vmsplice/splice duo.
>
> I think you misunderstood what Evgeniy was asking of you.
>
> He was asking how fast you can transfer data over this
> interface using a normal TCP socket to a remove host,
> via sendfile() or splice().
>
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 13:05 ` Johann Baudy
@ 2008-09-03 13:27 ` Evgeniy Polyakov
2008-09-03 14:57 ` Christoph Lameter
2008-09-03 15:00 ` Johann Baudy
0 siblings, 2 replies; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 13:27 UTC (permalink / raw)
To: Johann Baudy; +Cc: David Miller, netdev
Hi Johann.
On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
> with sendfile() : 5.7Mbytes/s
And _THIS_ is a serious problem. Let's assume that sendfile is broken or
driver/hardware does not support scatter/gather and checksumming (does it?).
Can you saturate the link with pktgen (1) and usual tcp socket (2).
Assuming second case will fail, does it also broken because of very
small performance of the copy from the userspace?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 13:27 ` Evgeniy Polyakov
@ 2008-09-03 14:57 ` Christoph Lameter
2008-09-03 15:00 ` Johann Baudy
1 sibling, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2008-09-03 14:57 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Johann Baudy, David Miller, netdev
Evgeniy Polyakov wrote:
> Hi Johann.
>
> On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
>> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
>> with sendfile() : 5.7Mbytes/s
>
> And _THIS_ is a serious problem. Let's assume that sendfile is broken or
> driver/hardware does not support scatter/gather and checksumming (does it?).
> Can you saturate the link with pktgen (1) and usual tcp socket (2).
> Assuming second case will fail, does it also broken because of very
> small performance of the copy from the userspace?
Could we see the code that was used to get these numbers? The problem may just
be in the way that the calls to sendfile() have been coded.
The TX code looks intriguing. Seems that some vendors are tinkering with VNIC
ideas in order to bypass context switches and data copies. Maybe this is a
cheap way to attain the same goals?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 13:27 ` Evgeniy Polyakov
2008-09-03 14:57 ` Christoph Lameter
@ 2008-09-03 15:00 ` Johann Baudy
2008-09-03 15:13 ` Evgeniy Polyakov
1 sibling, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 15:00 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: David Miller, netdev
Hi Evgeniy,
The driver and the hardware support DMA scater/gather and checksum offloading.
with pktgen and this below config, i reached 85MBytes/s ~ link
saturation (I've reached the same bitrate with raw socket + TX RING
ZeroCopy patch):
#!/bin/sh
echo rem_device_all > /proc/net/pktgen/kpktgend_0
echo add_device eth0 > /proc/net/pktgen/kpktgend_0
echo max_before_softirq 10000 > /proc/net/pktgen/kpktgend_0
sleep 1
echo count 10000000 > /proc/net/pktgen/eth0
echo clone_skb 0 > /proc/net/pktgen/eth0
echo pkt_size 7200 > /proc/net/pktgen/eth0
echo delay 0 > /proc/net/pktgen/eth0
echo dst 192.168.0.1 > /proc/net/pktgen/eth0
echo dst_mac ff:ff:ff:ff:ff:ff > /proc/net/pktgen/eth0
echo start > /proc/net/pktgen/pgctrl
I can't saturate the link from user space with either UDP, TCP or RAW
socket due to copies and multiple system calls.
If the system is just doing one copy of the packet, it falls under
25Mbytes/s. This a simple memory bus which is only running at 100Mhz
for data and instruction.
I think I've well understood why my bitrate is so bad from userspace
using normal TCP,UDP or RAW socket.
That's why I'm working on this zero copy solution (without copy
between user and kernel or between kernel buffer and socket buffer;
and with a minimum of system call).
A kind of full zero-copy sending capability, HW accesses same buffers
as the user.
In fact, I'm just suggesting the symmetric of packet mmap IO used for
capture process with zero copy capability and I need to know what do
you think about it.
Thanks in advance,
Johann
On Wed, Sep 3, 2008 at 3:27 PM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Hi Johann.
>
> On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
>> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
>> with sendfile() : 5.7Mbytes/s
>
> And _THIS_ is a serious problem. Let's assume that sendfile is broken or
> driver/hardware does not support scatter/gather and checksumming (does it?).
> Can you saturate the link with pktgen (1) and usual tcp socket (2).
> Assuming second case will fail, does it also broken because of very
> small performance of the copy from the userspace?
>
> --
> Evgeniy Polyakov
>
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 15:00 ` Johann Baudy
@ 2008-09-03 15:13 ` Evgeniy Polyakov
2008-09-03 15:58 ` Johann Baudy
0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 15:13 UTC (permalink / raw)
To: Johann Baudy; +Cc: David Miller, netdev
Hi Johann.
On Wed, Sep 03, 2008 at 05:00:47PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> The driver and the hardware support DMA scater/gather and checksum offloading.
>
> with pktgen and this below config, i reached 85MBytes/s ~ link
> saturation (I've reached the same bitrate with raw socket + TX RING
> ZeroCopy patch):
> I can't saturate the link from user space with either UDP, TCP or RAW
> socket due to copies and multiple system calls.
>
> If the system is just doing one copy of the packet, it falls under
> 25Mbytes/s. This a simple memory bus which is only running at 100Mhz
> for data and instruction.
What is the bus width and is there burst mode support?
Not to point to the error in the speed calculation, just out of curiosity :)
Always liked such tiny systems...
> I think I've well understood why my bitrate is so bad from userspace
> using normal TCP,UDP or RAW socket.
> That's why I'm working on this zero copy solution (without copy
> between user and kernel or between kernel buffer and socket buffer;
> and with a minimum of system call).
> A kind of full zero-copy sending capability, HW accesses same buffers
> as the user.
But why sendfile/splice does not work the same?
It is (supposed to be) a zero-copy sending interface, which should be even
more optimal, than your ring buffer approach, since uses just single
syscall and no initialization of the data (well, there is page
population and so on, but if file is in the ramdisk, it is effectively
zero overhead). Can you run oprofile during sendfile() data transfer or
describe behaviour (i.e. CPU usage and tcpdump).
> In fact, I'm just suggesting the symmetric of packet mmap IO used for
> capture process with zero copy capability and I need to know what do
> you think about it.
Well, I'm not against this patch, but you pointed to the bug (or wrong
initialization in your code) of the sendfile, which has higher priority
imho :)
Actually if it is indeed a bug in splice code then (if fixed) it can
allow to have simpler zero-copy sulution for your problem.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 15:13 ` Evgeniy Polyakov
@ 2008-09-03 15:58 ` Johann Baudy
2008-09-03 16:43 ` Evgeniy Polyakov
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 15:58 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: David Miller, netdev
Hi Evgeniy,
> What is the bus width and is there burst mode support?
> Not to point to the error in the speed calculation, just out of curiosity :)
> Always liked such tiny systems...
32 bits with burst support. This is a PPC 405 embedded into Xilinx V4
FPGA . (PLB bus)
>
> But why sendfile/splice does not work the same?
> It is (supposed to be) a zero-copy sending interface, which should be even
> more optimal, than your ring buffer approach, since uses just single
> syscall and no initialization of the data (well, there is page
> population and so on, but if file is in the ramdisk, it is effectively
> zero overhead). Can you run oprofile during sendfile() data transfer or
> describe behaviour (i.e. CPU usage and tcpdump).
I've never used oprofile before. I will get more logs and let you know.
Just a question: I don't want to use TCP for final application.
Is it expected that the kernel execute packet_sendmsg() when using
packet socket with splice()? (because this function is doing a memcpy
from a buffer to a socket buffer).
Or is there a dedicated path for splicing? or maybe only in TCP read
(I can see that splice_read operator is redefined with
tcp_splice_read())?
And I've also faced some issues with the size of packet (it seems to
be limited to page size). It is really important for me to send large
packet. I've just decreased the packet size of pktgen script from 7200
to 4096 and the bitrate has fallen from 85Mbytes/s to 50Mbytes/s.
I understand that this is not a problem with TCP when sending a file,
we don't really care about accuracy of the packet size.
Do you know if there is way to adjust the size ?
And again, many thanks for your fast replies ;)
Johann Baudy
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 15:58 ` Johann Baudy
@ 2008-09-03 16:43 ` Evgeniy Polyakov
2008-09-03 20:30 ` Johann Baudy
0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 16:43 UTC (permalink / raw)
To: Johann Baudy; +Cc: David Miller, netdev
Hi Johann.
On Wed, Sep 03, 2008 at 05:58:50PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > What is the bus width and is there burst mode support?
> > Not to point to the error in the speed calculation, just out of curiosity :)
> > Always liked such tiny systems...
>
> 32 bits with burst support. This is a PPC 405 embedded into Xilinx V4
> FPGA . (PLB bus)
So small PLB? Not OPB? Weird hardware :)
But nevertheless at most 400 MB/s with 100mhz, so looks like either
there is no burst mode or weird NIC hardware (or something else :)
I used to easily saturate 100mbit channel with 405gp(r) and emac driver,
which are better numbers than what you have with gige and sockets...
Actually even 405gp had much wider plb, so this could be an issue.
Likley your project will just dma data from some sensor to the
preallocated buffer, you will add headers and send the data, so very
small memory bus speed will not allow to use sockets and thus TCP.
Having splice-friendly setup is possible, but I think raw socket
approach is simpler for you.
> > But why sendfile/splice does not work the same?
> > It is (supposed to be) a zero-copy sending interface, which should be even
> > more optimal, than your ring buffer approach, since uses just single
> > syscall and no initialization of the data (well, there is page
> > population and so on, but if file is in the ramdisk, it is effectively
> > zero overhead). Can you run oprofile during sendfile() data transfer or
> > describe behaviour (i.e. CPU usage and tcpdump).
>
> I've never used oprofile before. I will get more logs and let you know.
> Just a question: I don't want to use TCP for final application.
> Is it expected that the kernel execute packet_sendmsg() when using
> packet socket with splice()? (because this function is doing a memcpy
> from a buffer to a socket buffer).
No, it will use sendpage() if hardware and driver support scatter/gather
and checksumm ofloading. Since you say they do, then there should be no
copies at all.
> Or is there a dedicated path for splicing? or maybe only in TCP read
> (I can see that splice_read operator is redefined with
> tcp_splice_read())?
It will endup with generic_splice_sendpage() and pipe_to_sendpage().
> And I've also faced some issues with the size of packet (it seems to
> be limited to page size). It is really important for me to send large
> packet. I've just decreased the packet size of pktgen script from 7200
> to 4096 and the bitrate has fallen from 85Mbytes/s to 50Mbytes/s.
> I understand that this is not a problem with TCP when sending a file,
> we don't really care about accuracy of the packet size.
> Do you know if there is way to adjust the size ?
What do you mean by packet size? MTU/MSS? In pktgen it means size of the
allocated skb, so it will be eventually split into smaller chunks and the
bigger size you have, the less allocations will be performed. Actually
the fact, that 7200 works at all, is a bit surprising: your small
machine has lots of ram and is effectively unused during tests (i.e. no
other allocations). Changing it do 4k should not decrease performance at
all... Do you have jumbo frames enabled?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 16:43 ` Evgeniy Polyakov
@ 2008-09-03 20:30 ` Johann Baudy
2008-09-03 22:03 ` Evgeniy Polyakov
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 20:30 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: David Miller, netdev
Hi Evgeniy,
> So small PLB? Not OPB? Weird hardware :)
> But nevertheless at most 400 MB/s with 100mhz, so looks like either
> there is no burst mode or weird NIC hardware (or something else :)
> I used to easily saturate 100mbit channel with 405gp(r) and emac driver,
> which are better numbers than what you have with gige and sockets...
> Actually even 405gp had much wider plb, so this could be an issue.
>
> Likley your project will just dma data from some sensor to the
> preallocated buffer, you will add headers and send the data, so very
> small memory bus speed will not allow to use sockets and thus TCP.
> Having splice-friendly setup is possible, but I think raw socket
> approach is simpler for you.
Yes, this is a custom hardware (FPGA :)). There is no combo IPLB / DPLB,
only one and small PLB bus at 100Mhz.
> No, it will use sendpage() if hardware and driver support scatter/gather
> and checksumm ofloading. Since you say they do, then there should be no
> copies at all.
>
> It will endup with generic_splice_sendpage() and pipe_to_sendpage().
>
Indeed, I've double checked, but pipe_to_sendpage() will end up with
packet_sendmsg()
.splice_write = generic_splice_sendpage,
generic_splice_sendpage()
splice_from_pipe();
pipe_to_sendpage() from err = actor(pipe, buf, sd);
sock_sendpage() from ile->f_op->sendpage()
sock_no_sendpage() from sock->ops->sendpage()
kernel_sendmsg()
sock_sendmsg();
packet_sendmsg() from sock->ops->sendmsg();
memcpy() :'(
I think a non-generic splice_write function should do the job.
What do you think?
>
> What do you mean by packet size? MTU/MSS? In pktgen it means size of the
> allocated skb, so it will be eventually split into smaller chunks and the
> bigger size you have, the less allocations will be performed. Actually
> the fact, that 7200 works at all, is a bit surprising: your small
> machine has lots of ram and is effectively unused during tests (i.e. no
> other allocations). Changing it do 4k should not decrease performance at
> all... Do you have jumbo frames enabled?
>
I mean the transfer unit size (ethernet frame length) that must be <= MTU.
Jumbo frames are enabled in the driver and mtu size is set to 7200.
I'm currently using wireshark on a remote pc to check bitrate and format.
I think performance can decrease because CPU will spend the same time
to send 7200 or 4096 bytes but not the DMA.(~50µs for 7200, ~30µs for
4096)
Thanks,
Johann
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 20:30 ` Johann Baudy
@ 2008-09-03 22:03 ` Evgeniy Polyakov
2008-09-04 14:44 ` Johann Baudy
0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 22:03 UTC (permalink / raw)
To: Johann Baudy; +Cc: David Miller, netdev
On Wed, Sep 03, 2008 at 10:30:14PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > It will endup with generic_splice_sendpage() and pipe_to_sendpage().
> >
> Indeed, I've double checked, but pipe_to_sendpage() will end up with
> packet_sendmsg()
>
> .splice_write = generic_splice_sendpage,
> generic_splice_sendpage()
> splice_from_pipe();
> pipe_to_sendpage() from err = actor(pipe, buf, sd);
> sock_sendpage() from ile->f_op->sendpage()
> sock_no_sendpage() from sock->ops->sendpage()
> kernel_sendmsg()
> sock_sendmsg();
> packet_sendmsg() from sock->ops->sendmsg();
> memcpy() :'(
>
> I think a non-generic splice_write function should do the job.
> What do you think?
Looks like you try to sendfile() over packet socket.
Both tcp and udp sockets have sendpage method.
Or your hardware or driver do not support needed fucntionality, so
tcp_sendpage() falls back to sock_no_sendpage(). From your dump I think
it is the first case above. Well, after I read it again, I found word
packet_sendmsg(), which explains everything. Please use tcp or udp
socket for splice/sendfile test.
> I mean the transfer unit size (ethernet frame length) that must be <= MTU.
> Jumbo frames are enabled in the driver and mtu size is set to 7200.
> I'm currently using wireshark on a remote pc to check bitrate and format.
> I think performance can decrease because CPU will spend the same time
> to send 7200 or 4096 bytes but not the DMA.(~50µs for 7200, ~30µs for
> 4096)
If you use jumbo frames, than yes, the bigger allocation unit is
(assuming allocation succeeded), the bigger speed will be, so this result
is expectable.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-03 22:03 ` Evgeniy Polyakov
@ 2008-09-04 14:44 ` Johann Baudy
2008-09-05 7:17 ` Evgeniy Polyakov
0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-04 14:44 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: David Miller, netdev
Hi Evgeniy,
> Looks like you try to sendfile() over packet socket.
> Both tcp and udp sockets have sendpage method.
>
> Or your hardware or driver do not support needed fucntionality, so
> tcp_sendpage() falls back to sock_no_sendpage(). From your dump I think
> it is the first case above. Well, after I read it again, I found word
> packet_sendmsg(), which explains everything. Please use tcp or udp
> socket for splice/sendfile test.
>
I'm finally able to run a full zero copy mechanism with UDP socket as you said.
Unfortunately, I need at least one vmsplice() system call per UDP
packet (vmsplice call()).
mere vmsplice(mem to pipe) cost much (80µs of CPU). And splice(pipe to
socket) call is worst...
80us is approximately the duration of 12Kbytes sent at 1Gbps. As I
need to send packet of 7200bytes (with no frag)...
I can't use this mechanism unfortunaltely. I've only reached 20Mbytes/s.
You can find below a FTRACE of vmsplice(), if you find something
abnormal ... :) :
(80µs result is an average of vmsplice() duration thanks to
gettimeofday(): WITHOUT FTRACE IN KERNEL CONFIG)
main-849 [00] .. 1 4154502892.139088: sys_gettimeofday
<-ret_from_syscall
main-849 [00] .. 1 4154502892.139090: do_gettimeofday
<-sys_gettimeofday
main-849 [00] .. 1 4154502892.139092: getnstimeofday
<-do_gettimeofday
main-849 [00] .. 1 4154502892.139100: sys_vmsplice
<-ret_from_syscall
main-849 [00] .. 1 4154502892.139107: fget_light <-sys_vmsplice
main-849 [00] .. 1 4154502892.139118: rt_down_read <-sys_vmsplice
main-849 [00] .. 1 4154502892.139120: __rt_down_read
<-rt_down_read
main-849 [00] .. 1 4154502892.139124:
rt_mutex_down_read <-__rt_down_read
main-849 [00] .. 1 4154502892.139132: pagefault_disable
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139136: pagefault_enable
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139141: get_user_pages
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139147: find_extend_vma
<-get_user_pages
main-849 [00] .. 1 4154502892.139150: find_vma <-find_extend_vma
main-849 [00] .. 1 4154502892.139158: _cond_resched
<-get_user_pages
main-849 [00] .. 1 4154502892.139161: follow_page
<-get_user_pages
main-849 [00] .. 1 4154502892.139165: rt_spin_lock <-follow_page
main-849 [00] .. 1 4154502892.139167: __rt_spin_lock
<-rt_spin_lock
main-849 [00] .. 1 4154502892.139171: vm_normal_page
<-follow_page
main-849 [00] .. 1 4154502892.139176:
mark_page_accessed <-follow_page
main-849 [00] .. 1 4154502892.139180: rt_spin_unlock
<-follow_page
main-849 [00] .. 1 4154502892.139185: flush_dcache_page
<-get_user_pages
main-849 [00] .. 1 4154502892.139192: rt_up_read <-sys_vmsplice
main-849 [00] .. 1 4154502892.139194: rt_mutex_up_read
<-rt_up_read
main-849 [00] .. 1 4154502892.139203: splice_to_pipe
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139206: _mutex_lock
<-splice_to_pipe
main-849 [00] .. 1 4154502892.139209: rt_mutex_lock <-_mutex_lock
main-849 [00] .. 1 4154502892.139217: _mutex_unlock
<-splice_to_pipe
main-849 [00] .. 1 4154502892.139221: rt_mutex_unlock
<-_mutex_unlock
main-849 [00] .. 1 4154502892.139224: kill_fasync
<-splice_to_pipe
main-849 [00] .. 1 4154502892.139235: sys_gettimeofday
<-ret_from_syscall
main-849 [00] .. 1 4154502892.139237: do_gettimeofday
<-sys_gettimeofday
main-849 [00] .. 1 4154502892.139239: getnstimeofday
<-do_gettimeofday
So, I will return to work on my circular buffer.
This way I can control (ethernet frame length)*(number of frame)/
(number of system call) ratio.
Thanks to splice kernel and pktgen code analyses, I've also found a
clean way to perform
zero copy between my circular buffer and socket buffer. I will test it
and I'll let you know
changes and results.
Many thanks for your help,
Johann Baudy
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Packet mmap: TX RING and zero copy
2008-09-04 14:44 ` Johann Baudy
@ 2008-09-05 7:17 ` Evgeniy Polyakov
[not found] ` <7e0dd21a0809050216r65b8f08fm1ad0630790a13a54@mail.gmail.com>
0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 7:17 UTC (permalink / raw)
To: Johann Baudy; +Cc: David Miller, netdev
Hi Johann.
On Thu, Sep 04, 2008 at 04:44:15PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I'm finally able to run a full zero copy mechanism with UDP socket as you said.
> Unfortunately, I need at least one vmsplice() system call per UDP
> packet (vmsplice call()).
> mere vmsplice(mem to pipe) cost much (80µs of CPU). And splice(pipe to
> socket) call is worst...
> 80us is approximately the duration of 12Kbytes sent at 1Gbps. As I
> need to send packet of 7200bytes (with no frag)...
> I can't use this mechanism unfortunaltely. I've only reached 20Mbytes/s.
vmsplice() can be slow, try to inject header via usual send() call, or
better do not use it at all for testing.
> You can find below a FTRACE of vmsplice(), if you find something
> abnormal ... :) :
> (80µs result is an average of vmsplice() duration thanks to
> gettimeofday(): WITHOUT FTRACE IN KERNEL CONFIG)
Amount of gettimofday() and friends is excessive, but it can be a trace
tool itself. kill_fasync() also took too much time (top CPU user
is at bottom I suppose?), do you use SIGIO? Also vma traveling and page
checking is not what will be done in network code and your project, so
it also adds an overhead. Please try without vmsplice() at all, usual
splice()/sendfile() _has_ to saturate the link, otherwise we have a
serious problem.
> So, I will return to work on my circular buffer.
> This way I can control (ethernet frame length)*(number of frame)/
> (number of system call) ratio.
Not to distract you from the project, but you still can do the same with
existing methods and smaller amount of work. But I should be last saying
that creating tricky hacks to implement the idea should be abandoned in
favour of the standards (even slow) methods :)
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
2008-09-02 19:46 ` Evgeniy Polyakov
@ 2008-09-05 10:28 ` Robert Iakobashvili
2008-09-05 13:06 ` Johann Baudy
1 sibling, 1 reply; 39+ messages in thread
From: Robert Iakobashvili @ 2008-09-05 10:28 UTC (permalink / raw)
To: Johann Baudy; +Cc: netdev, Ulisses Alonso Camaró
Hi Johann,
On Tue, Sep 2, 2008 at 9:27 PM, Johann Baudy <johaahn@gmail.com> wrote:
> I've made lot of tests, playing with jumbo frames, raw sockets, ...
> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
> packet socket transmission process.
>
> The main blocking point was the memcpy_fromiovec() function that is
> located in the packet_sendmsg() of af_packet.c.
> It was consuming all my CPU resources to copy data from user space to
> socket buffer.
> Then I've started to work on a hack that makes this transfer possible
> without any memcpys.
>
> Mainly, the hack is the implementation of two "features":
>
> * Sending packet through a circular buffer between user and
> kernel space that minimizes the number of system calls. (Feature
> actually implemented for capture process, libpcap ..).
Something like this has been done in PF_RING socket,
which is a part of ntop project infra.
Take care.
Truly,
Robert Iakobashvili
......................................................................
www.ghotit.com
Assistive technology that understands you
......................................................................
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Packet mmap: TX RING and zero copy
2008-09-05 10:28 ` Robert Iakobashvili
@ 2008-09-05 13:06 ` Johann Baudy
0 siblings, 0 replies; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 13:06 UTC (permalink / raw)
To: Robert Iakobashvili; +Cc: netdev, Ulisses Alonso Camaró
Thanks Robert,
The architecture of PF_RING seems to be really similar to packet mmap
IO to optimize capture process.
Is it planned to replace it?
I'll try it to get performance.
Best regards,
Johann
On Fri, Sep 5, 2008 at 12:28 PM, Robert Iakobashvili
<coroberti@gmail.com> wrote:
> Hi Johann,
>
> On Tue, Sep 2, 2008 at 9:27 PM, Johann Baudy <johaahn@gmail.com> wrote:
>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>> packet socket transmission process.
>>
>> The main blocking point was the memcpy_fromiovec() function that is
>> located in the packet_sendmsg() of af_packet.c.
>> It was consuming all my CPU resources to copy data from user space to
>> socket buffer.
>> Then I've started to work on a hack that makes this transfer possible
>> without any memcpys.
>>
>> Mainly, the hack is the implementation of two "features":
>>
>> * Sending packet through a circular buffer between user and
>> kernel space that minimizes the number of system calls. (Feature
>> actually implemented for capture process, libpcap ..).
>
> Something like this has been done in PF_RING socket,
> which is a part of ntop project infra.
>
> Take care.
>
> Truly,
> Robert Iakobashvili
> ......................................................................
> www.ghotit.com
> Assistive technology that understands you
> ......................................................................
>
--
Johann Baudy
johaahn@gmail.com
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2008-09-10 6:09 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
2008-09-02 19:46 ` Evgeniy Polyakov
2008-09-03 7:56 ` Johann Baudy
2008-09-03 10:38 ` Johann Baudy
2008-09-03 11:06 ` David Miller
2008-09-03 13:05 ` Johann Baudy
2008-09-03 13:27 ` Evgeniy Polyakov
2008-09-03 14:57 ` Christoph Lameter
2008-09-03 15:00 ` Johann Baudy
2008-09-03 15:13 ` Evgeniy Polyakov
2008-09-03 15:58 ` Johann Baudy
2008-09-03 16:43 ` Evgeniy Polyakov
2008-09-03 20:30 ` Johann Baudy
2008-09-03 22:03 ` Evgeniy Polyakov
2008-09-04 14:44 ` Johann Baudy
2008-09-05 7:17 ` Evgeniy Polyakov
[not found] ` <7e0dd21a0809050216r65b8f08fm1ad0630790a13a54@mail.gmail.com>
2008-09-05 9:17 ` Fwd: " Johann Baudy
2008-09-05 11:31 ` Evgeniy Polyakov
2008-09-05 12:44 ` Johann Baudy
2008-09-05 13:16 ` Evgeniy Polyakov
2008-09-05 13:29 ` Johann Baudy
2008-09-05 13:37 ` Evgeniy Polyakov
2008-09-05 13:55 ` Johann Baudy
2008-09-05 14:19 ` Evgeniy Polyakov
2008-09-05 14:45 ` Johann Baudy
2008-09-05 14:59 ` Evgeniy Polyakov
2008-09-05 15:30 ` Johann Baudy
2008-09-05 15:38 ` Evgeniy Polyakov
2008-09-05 16:01 ` Johann Baudy
2008-09-05 16:34 ` Evgeniy Polyakov
2008-09-08 10:21 ` Johann Baudy
2008-09-08 11:26 ` Evgeniy Polyakov
2008-09-08 13:01 ` Johann Baudy
2008-09-08 15:28 ` Evgeniy Polyakov
2008-09-08 15:38 ` Evgeniy Polyakov
2008-09-09 23:11 ` Johann Baudy
2008-09-10 6:09 ` Evgeniy Polyakov
2008-09-05 10:28 ` Robert Iakobashvili
2008-09-05 13:06 ` Johann Baudy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).