Packet mmap: TX RING and zero copy

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Packet mmap: TX RING and zero copy
@ 2008-09-02 18:27 Johann Baudy
  2008-09-02 19:46 ` Evgeniy Polyakov
  2008-09-05 10:28 ` Robert Iakobashvili
  0 siblings, 2 replies; 39+ messages in thread
From: Johann Baudy @ 2008-09-02 18:27 UTC (permalink / raw)
  To: netdev; +Cc: Ulisses Alonso Camaró

Hi All,

I'm currently working on an embedded project (based on Linux kernel)
that needs a high throughput using gigabit Ethernet controller and
"small" cpu.
I've made lot of tests, playing with jumbo frames, raw sockets, ...
I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
packet socket transmission process.

The main blocking point was the memcpy_fromiovec() function that is
located in the packet_sendmsg() of af_packet.c.
It was consuming all my CPU resources to copy data from user space to
socket buffer.
Then I've started to work on a hack that makes this transfer possible
without any memcpys.

Mainly, the hack is the implementation of two "features":

    *  Sending packet through a circular buffer between user and
kernel space that minimizes the number of system calls. (Feature
actually implemented for capture process, libpcap ..).
       To sum up the user process :
        - initialize a raw socket
        - allocate N buffers into kernel space through a setsockopt() (TX ring),
        - mmap() the allocated memory,
        - fill M buffers with custom data, and update status of filled
buffers to ready (header of buffer: struct tpacket_hdr contains a
status field: TP_STATUS_KERNEL means free, TP_STATUS_USER means ready
to be sent, TP_STATUS_COPY means transmission ongoing)
        - call send() procedure. The kernel will then send all buffers
set with TP_STATUS_USER. Status is set to TP_STATUS_COPY during
transfer and TP_STATUS_KERNEL when done.

    *  Zero copy mode.  CONFIG_PACKET_MMAP_ZERO_COPY feature flag
skips CPU copy between the circular buffer and the socket buffer
allocated during send.
       To send packet without zero copy, if my understanding is
correct, first we allocate a socket buffer with sock_alloc_send_skb(),
then we copy content of data into the socket buffer, finally we give
this sk_buff to the network card. With zero copy, the trick is to
bypass the data copy by substituting data pointers of allocated
sk_buff for data pointers of our circular buffer.
       This way network devices use our circular buffer instead of
socket buffer concerning data.
       And to prevent the kernel from crashing during skb data release
(shinfo+data release), we restore the whole previous content of
sk_buff inside the destructor callback.

I'm aware that this suggestion is really far from a real solution,
mainly due to this hard substitution.
But, I would like to get as much criticism as possible in order to
start a discussion with experts about a conceivable way to mix
zero-copy, sk_buff management and packet socket.
Which is perhaps impossible with current network kernel flow ...

PS: I've reached 85Mbytes/s with TX RING and zero-copy

Thanks in advance for your advices,
Johann Baudy

diff --git a/Documentation/networking/packet_mmap.txt
b/Documentation/networking/packet_mmap.txt
index db0cd51..0cfb835 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -4,16 +4,17 @@

 This file documents the CONFIG_PACKET_MMAP option available with the PACKET
 socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
-capture network traffic with utilities like tcpdump or any other that uses
-the libpcap library.
+capture network traffic with utilities like tcpdump or any other that needs
+raw acces to network interface.

 You can find the latest version of this document at

-    http://pusa.uv.es/~ulisses/packet_mmap/
+    http://pusa.uv.es/~ulisses/packet_mmap/ (down ?)

 Please send me your comments to

     Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+    Johann Baudy <johann.baudy@gnu-log.net> (TX RING - Zero Copy)

 -------------------------------------------------------------------------------
 + Why use PACKET_MMAP
@@ -25,19 +26,25 @@ to capture each packet, it requires two if you
want to get packet's
 timestamp (like libpcap always does).

 In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
-configurable circular buffer mapped in user space. This way reading
packets just
-needs to wait for them, most of the time there is no need to issue a single
-system call. By using a shared buffer between the kernel and the user
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the
capture process,
-but it isn't everything. At least, if you are capturing at high speeds (this
-is relative to the cpu speed), you should check if the device driver of your
-network interface card supports some sort of interrupt load mitigation or
-(even better) if it supports NAPI, also make sure it is enabled.
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. For
transmission,
+multiple packets can be sent in one sytem call and outgoing data buffers can be
+zero-copied to get the highest bandwidth (with PACKET_MMAP_ZERO_COPY).
+By using a shared buffer between the kernel and the user also has the benefit
+of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. Especially if you are using DMA.
+(cf Jumbo frame)

 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP
++ How to use CONFIG_PACKET_MMAP to improve capture process
 --------------------------------------------------------------------------------

 From the user standpoint, you should use the higher level libpcap
library, which
@@ -56,8 +63,9 @@ The rest of this document is intended for people who
want to understand
 the low level details or want to improve libpcap by including PACKET_MMAP
 support.

+
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP directly
++ How to use CONFIG_PACKET_MMAP directly to improve capture porcess
 --------------------------------------------------------------------------------

 From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +74,7 @@ the following process:

 [setup]     socket() -------> creation of the capture socket
             setsockopt() ---> allocation of the circular buffer (ring)
+                              option: PACKET_RX_RING
             mmap() ---------> mapping of the allocated buffer to the
                               user process

@@ -97,14 +106,95 @@ also the mapping of the circular buffer in the
user process and
 the use of this buffer.

 --------------------------------------------------------------------------------
++ How to use CONFIG_PACKET_MMAP directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup]          socket() -------> creation of the transmission socket
+                 setsockopt() ---> allocation of the circular buffer (ring)
+                                   option: PACKET_TX_RING
+                 bind() ---------> bind transmission socket with a
network interface
+                 getsockopt() ---> get the circular buffer header size
+                                   option: PACKET_TX_RING_HEADER_SIZE
+                 mmap() ---------> mapping of the allocated buffer to the
+                                   user process
+
+[transmission]   poll() ---------> wait for free packets (optional)
+                 send() ---------> send all packets that are set as ready in
+                                   the ring
+
+[shutdown]  close() --------> destruction of the transmission socket and
+                              deallocation of all associated resources.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+Each frame contains five parts:
+
+ -------------------
+| struct tpacket    | Header. It contains the status of
+|                   | of this frame
+|-------------------|
+| struct skbuff     | (Zero copy only) Save of allocated socket buffer
+|                   | descriptor.
+|-------------------|
+| network interface | (Zero copy only) size = LL_RESERVED_SPACE(dev)
+| reserved space    |
+|-------------------|
+| data buffer       |
+.                   .  Data that will be sent over the network interface.
+.                   .
+|-------------------|
+| network interface | (Zero copy only) size = LL_ALLOCATED_SPACE(dev)
+| reserved space    |                         - LL_RESERVED_SPACE(dev)
+ -------------------
+
+ Network interface reserved spaces may differ between devices that
why user must
+ ask header size to the kernel after bind() call.
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ getsockopt(PACKET_TX_RING_HEADER_SIZE) returns an offset that must be
+ added to each frame pointer to get the start pointer of the data buffer.
+
+ int i_header_size;
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = ETH_P_ALL;
+ my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ /* get header size */
+ getsockopt(this->socket, SOL_PACKET, PACKET_TX_RING_HEADER_SIZE,
+            (void*)&i_header_size,&opt_len);
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+--------------------------------------------------------------------------------
 + PACKET_MMAP settings
 --------------------------------------------------------------------------------


 To setup PACKET_MMAP from user level code is done with a call like

+ - Capture process
      setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))

+ - Transmission process
+     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
 The most significant argument in the previous call is the req parameter,
 this parameter must to have the following structure:

@@ -117,11 +207,11 @@ this parameter must to have the following structure:
     };

 This structure is defined in /usr/include/linux/if_packet.h and establishes a
-circular buffer (ring) of unswappable memory mapped in the capture process.
+circular buffer (ring) of unswappable memory.
 Being mapped in the capture process allows reading the captured frames and
 related meta-information like timestamps without requiring a system call.

-Captured frames are grouped in blocks. Each block is a physically contiguous
+Frames are grouped in blocks. Each block is a physically contiguous
 region of memory and holds tp_block_size/tp_frame_size frames. The
total number
 of blocks is tp_block_nr. Note that tp_frame_nr is a redundant
parameter because

@@ -336,13 +426,13 @@ struct tpacket_hdr). If this field is 0 means
that the frame is ready
 to be used for the kernel, If not, there is a frame the user can read
 and the following flags apply:

-     from include/linux/if_packet.h
+++ Capture process:

+from include/linux/if_packet.h
      #define TP_STATUS_COPY          2
      #define TP_STATUS_LOSING        4
      #define TP_STATUS_CSUMNOTREADY  8

-
 TP_STATUS_COPY        : This flag indicates that the frame (and associated
                         meta information) has been truncated because it's
                         larger than tp_frame_size. This packet can be
@@ -388,8 +478,38 @@ packets are in the ring:
     if (status == TP_STATUS_KERNEL)
         retval = poll(&pfd, 1, timeout);

-It doesn't incur in a race condition to first check the status value and
-then poll for frames.
+
+++ Transmission process
+Those defines are also used for transmission:
+
+     #define TP_STATUS_KERNEL        0 // Frame is available
+     #define TP_STATUS_USER          1 // Frame will be sent on next send()
+     #define TP_STATUS_COPY          2 // Frame is currently in transmission
+     #define TP_STATUS_LOSING        4 // Indicate a transmission error
+
+First, the kernel initializes all frames to TP_STATUS_KERNEL. To send a packet,
+the user fills a data buffer of an available frame, sets tp_len to current
+data buffer size and sets its status field to TP_STATUS_USER. This can be done
+on multiple frames. Once the user is ready to transmit, it calls send().
+Then all buffers with status equal to TP_STATUS_USER are forwarded to the
+network device. The kernel updates each status of sent frames with
+TP_STATUS_COPY until the end of transfer (if zero copy is used, otherwise
+end of socket buffer copy).
+At the end, all statuses return to TP_STATUS_KERNEL.
+
+    header->tp_len = in_i_size;
+    header->tp_status = TP_STATUS_USER;
+    retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_KERNEL)
+
+    struct pollfd pfd;
+    pfd.fd = fd;
+    pfd.revents = 0;
+    pfd.events = POLLOUT;
+    retval = poll(&pfd, 1, timeout);
+

 --------------------------------------------------------------------------------
 + THANKS
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index ad09609..a79cd89 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -43,6 +43,8 @@ struct sockaddr_ll
 #define PACKET_COPY_THRESH		7
 #define PACKET_AUXDATA			8
 #define PACKET_ORIGDEV			9
+#define PACKET_TX_RING			10
+#define PACKET_TX_RING_HEADER_SIZE	11

 struct tpacket_stats
 {
@@ -79,6 +81,11 @@ struct tpacket_hdr
 #define TPACKET_ALIGN(x)	(((x)+TPACKET_ALIGNMENT-1)&~(TPACKET_ALIGNMENT-1))
 #define TPACKET_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket_hdr)) +
sizeof(struct sockaddr_ll))

+/* packet ring modes */
+#define TPACKET_MODE_NONE 0
+#define TPACKET_MODE_RX 1
+#define TPACKET_MODE_TX 2
+
 /*
    Frame structure:

diff --git a/net/packet/Kconfig b/net/packet/Kconfig
index 34ff93f..2c74568 100644
--- a/net/packet/Kconfig
+++ b/net/packet/Kconfig
@@ -16,7 +16,7 @@ config PACKET
 	  If unsure, say Y.

 config PACKET_MMAP
-	bool "Packet socket: mmapped IO"
+	bool "mmapped IO"
 	depends on PACKET
 	help
 	  If you say Y here, the Packet protocol driver will use an IO
@@ -24,3 +24,12 @@ config PACKET_MMAP

 	  If unsure, say N.

+config PACKET_MMAP_ZERO_COPY
+	bool "zero-copy TX"
+	depends on PACKET_MMAP
+	help
+	  If you say Y here, the Packet protocol driver will fill socket buffer
+	  descriptors with TX ring buffer addresses. This mechanism that results
+	  in faster communication.
+
+	  If unsure, say N.
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2cee87d..45367dc 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -158,7 +158,9 @@ struct packet_mreq_max
 };

 #ifdef CONFIG_PACKET_MMAP
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing);
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int mode);
+static int tpacket_snd(struct socket *sock,
+											struct msghdr *msg, size_t len);
 #endif

 static void packet_flush_mclist(struct sock *sk);
@@ -173,7 +175,9 @@ struct packet_sock {
 	unsigned int            frames_per_block;
 	unsigned int		frame_size;
 	unsigned int		frame_max;
+	unsigned int		header_size;
 	int			copy_thresh;
+	int		mode;
 #endif
 	struct packet_type	prot_hook;
 	spinlock_t		bind_lock;
@@ -692,10 +696,209 @@ ring_is_full:
 	goto drop_n_restore;
 }

+/*
+ * TX ring skb destructor.
+ * This function is called when skb is freed.
+ * */
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+void tpacket_skb_destructor (struct sk_buff *skb)
+{
+	struct tpacket_hdr *header = (struct tpacket_hdr*) skb->head;
+	struct sk_buff * skb_copy;
+
+	/* calculate old skb pointer */
+	skb_copy = ((void*) header + sizeof(struct tpacket_hdr));
+
+	/* restore previous skb header (before substitution) */
+	memcpy(skb, skb_copy, sizeof(struct sk_buff));
+
+	/* execute previous destructor */
+	if(skb->destructor)
+		skb->destructor(skb);
+
+	/* check status of buffer */
+	BUG_ON(header->tp_status != TP_STATUS_COPY);
+	header->tp_status = TP_STATUS_KERNEL;
+
+	return;
+}
 #endif

+/*
+ * TX Ring packet send function
+ * */
+static int tpacket_snd(struct socket *sock,
+											struct msghdr *msg, size_t len)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
+	struct packet_sock *po = pkt_sk(sk);
+	struct net_device *dev;
+	int err, reserve=0,  len_sum=0, ifindex, i;
+	struct sk_buff * skb, * skb_copy;
+	unsigned char *addr;
+	__be16 proto;
+
+	/*
+	 *	Get and verify the address.
+	 */
+	if (saddr == NULL) {
+		ifindex	= po->ifindex;
+		proto	= po->num;
+		addr = NULL;
+	} else {
+		err = -EINVAL;
+		if (msg->msg_namelen < sizeof(struct sockaddr_ll))
+			goto out;
+		if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct
sockaddr_ll, sll_addr)))
+			goto out;
+		ifindex	= saddr->sll_ifindex;
+		proto	= saddr->sll_protocol;
+		addr	= saddr->sll_addr;
+	}

-static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+	/* get device by index */
+	dev = dev_get_by_index(sock_net(sk), ifindex);
+	err = -ENXIO;
+	if (dev == NULL)
+		goto out_put;
+	if (sock->type == SOCK_RAW)
+		reserve = dev->hard_header_len;
+
+	/* check if header size of device has changed since bind */
+	/* bind() call is mandatory as user must know where data must be written.
+	 * it fills header_size setting of current socket
+	 * and allows getsockopt(PACKET_TX_RING_HEADER_SIZE) call */
+	err = -EINVAL;
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+	if(po->header_size != LL_RESERVED_SPACE(dev) + sizeof(struct
tpacket_hdr) + sizeof(struct sk_buff))
+#else
+	if(po->header_size != sizeof(struct tpacket_hdr))
+#endif
+		goto out_put;
+
+	/* check interface up */
+	err = -ENETDOWN;
+	if (!(dev->flags & IFF_UP))
+		goto out_put;
+
+	/* loop on all frames */
+	for (i = 0; i <= po->frame_max; i++) {
+		struct tpacket_hdr *header = packet_lookup_frame(po, i);
+		int size_max = po->frame_size - sizeof(struct skb_shared_info) -
sizeof(struct tpacket_hdr) - LL_ALLOCATED_SPACE(dev);
+
+		if(header->tp_status == TP_STATUS_USER) {
+			/* mark header as tx ongoing */
+			header->tp_status = TP_STATUS_COPY;
+
+			/* check packet size */
+			err = -EMSGSIZE;
+			if (header->tp_len > dev->mtu+reserve)
+				goto out_put;
+			if(header->tp_len > size_max)
+				goto out_put;
+
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+			err = -ENOMEM;
+			/* allocate skb header */
+			skb = sock_alloc_send_skb(sk,
+																0,
+																msg->msg_flags & MSG_DONTWAIT,
+																&err);
+			if (skb==NULL)
+				goto out_put;
+
+			err = -EINVAL;
+			if (sock->type == SOCK_DGRAM &&
+				dev_hard_header(skb, dev, ntohs(proto), addr, NULL, len) < 0)
+				goto out_free;
+
+			/* clone current skb */
+			skb_copy = ((void*) header + sizeof(struct tpacket_hdr));
+			memcpy(skb_copy, skb, sizeof(struct sk_buff));
+
+			/* substitute skb data with Tx ring pointers */
+			skb->head = (void*)header;
+			skb->data = (void*)skb->head;
+			skb->end = (void*)header + po->frame_size - sizeof(struct skb_shared_info);
+			skb->truesize = po->frame_size;
+			skb_reset_tail_pointer(skb);
+
+			/* make sure we've copied shinfo properly into ring buffer */
+			memcpy(skb_shinfo(skb), skb_shinfo(skb_copy), sizeof(struct
skb_shared_info));
+
+			err = -ENOSPC;
+			/* check buffer size */
+			if(skb_tailroom(skb) < header->tp_len)
+				goto out_free;
+
+			/* put data into skb */
+			skb_reserve(skb, po->header_size);
+			skb_put(skb, header->tp_len);
+			skb_reset_network_header(skb);
+			skb_reset_transport_header(skb);
+
+			/* store destructor call back to update tpacket header status */
+			skb->destructor = tpacket_skb_destructor;
+#else
+			err = -ENOMEM;
+			/* allocate skb header */
+			skb = sock_alloc_send_skb(sk,
+																header->tp_len + LL_ALLOCATED_SPACE(dev),
+																msg->msg_flags & MSG_DONTWAIT,
+																&err);
+			if (skb==NULL)
+				goto out_put;
+
+			/* reserve device header */
+			skb_reserve(skb, LL_RESERVED_SPACE(dev));
+			skb_put(skb,header->tp_len);
+			skb_shinfo(skb)->frag_list=0;
+			skb_shinfo(skb)->nr_frags=0;
+
+			/* copy all data from TX ring buffer to skb */
+			err = skb_store_bits(skb, 0, (void*)header + po->header_size,
header->tp_len);
+			if( err )
+				goto out_free;
+
+#endif
+
+			/* fill skb with proto, device and priority */
+			skb->protocol = proto;
+			skb->dev = dev;
+			skb->priority = sk->sk_priority;
+
+
+			/* now send it */
+			err = dev_queue_xmit(skb);
+			if (err > 0 && (err = net_xmit_errno(err)) != 0)
+				goto out_free;
+
+#ifndef CONFIG_PACKET_MMAP_ZERO_COPY
+			/* reset flag of buffer as data has been copied into skb */
+			header->tp_status = TP_STATUS_KERNEL;
+#endif
+			len_sum += skb->len;
+		}
+	}
+	dev_put(dev);
+
+	return(len_sum);
+
+out_free:
+	kfree_skb(skb);
+out_put:
+	if (dev)
+		dev_put(dev);
+out:
+	return err;
+}
+#endif
+
+/*
+ * Normal packet send function
+ * */
+static int packet_snd(struct socket *sock,
 			  struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -705,14 +908,13 @@ static int packet_sendmsg(struct kiocb *iocb,
struct socket *sock,
 	__be16 proto;
 	unsigned char *addr;
 	int ifindex, err, reserve = 0;
+	struct packet_sock *po = pkt_sk(sk);

 	/*
 	 *	Get and verify the address.
 	 */

 	if (saddr == NULL) {
-		struct packet_sock *po = pkt_sk(sk);
-
 		ifindex	= po->ifindex;
 		proto	= po->num;
 		addr	= NULL;
@@ -786,6 +988,23 @@ out:
 	return err;
 }

+static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+													struct msghdr *msg, size_t len)
+{
+	struct sock *sk = sock->sk;
+	struct packet_sock *po = pkt_sk(sk);
+	//printk("tpacket TX sendmsg\n");
+
+	/* check if tx ring mode enabled */
+#ifdef CONFIG_PACKET_MMAP
+	if (po->mode == TPACKET_MODE_TX)
+		return tpacket_snd(sock, msg, len);
+	else
+#endif
+		return packet_snd(sock, msg, len);
+
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -827,7 +1046,7 @@ static int packet_release(struct socket *sock)
 	if (po->pg_vec) {
 		struct tpacket_req req;
 		memset(&req, 0, sizeof(req));
-		packet_set_ring(sk, &req, 1);
+		packet_set_ring(sk, &req, 1, TPACKET_MODE_NONE);
 	}
 #endif

@@ -875,7 +1094,11 @@ static int packet_do_bind(struct sock *sk,
struct net_device *dev, __be16 protoc
 	po->prot_hook.dev = dev;

 	po->ifindex = dev ? dev->ifindex : 0;
-
+#ifdef CONFIG_PACKET_MMAP_ZERO_COPY
+	po->header_size = dev ? (LL_RESERVED_SPACE(dev) + sizeof(struct
tpacket_hdr) + sizeof(struct sk_buff)) : 0;
+#else
+	po->header_size = sizeof(struct tpacket_hdr);
+#endif
 	if (protocol == 0)
 		goto out_unlock;

@@ -1015,6 +1238,12 @@ static int packet_create(struct net *net,
struct socket *sock, int protocol)
 		po->running = 1;
 	}

+#ifdef CONFIG_PACKET_MMAP
+	po->mode = TPACKET_MODE_NONE;
+	po->header_size = 0;
+#endif
+
+
 	write_lock_bh(&net->packet.sklist_lock);
 	sk_add_node(sk, &net->packet.sklist);
 	write_unlock_bh(&net->packet.sklist_lock);
@@ -1344,7 +1573,19 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
 			return -EINVAL;
 		if (copy_from_user(&req,optval,sizeof(req)))
 			return -EFAULT;
-		return packet_set_ring(sk, &req, 0);
+				/* store packet mode */
+				return packet_set_ring(sk, &req, 0, TPACKET_MODE_RX);
+			}
+		case PACKET_TX_RING:
+			{
+				struct tpacket_req req;
+
+				if (optlen<sizeof(req))
+					return -EINVAL;
+				if (copy_from_user(&req,optval,sizeof(req)))
+					return -EFAULT;
+				/* store packet mode */
+				return packet_set_ring(sk, &req, 0, TPACKET_MODE_TX);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -1408,6 +1649,17 @@ static int packet_getsockopt(struct socket
*sock, int level, int optname,
 		return -EINVAL;

 	switch(optname)	{
+#ifdef CONFIG_PACKET_MMAP
+		case PACKET_TX_RING_HEADER_SIZE:
+			if (len > sizeof(int))
+				len = sizeof(int);
+			val = po->header_size;
+			/* header_size should differ from 0 if device has been bind */
+			if (unlikely(val == 0))
+				return -EACCES;
+			data = &val;
+			break;
+#endif
 	case PACKET_STATISTICS:
 		if (len > sizeof(struct tpacket_stats))
 			len = sizeof(struct tpacket_stats);
@@ -1562,7 +1814,10 @@ static unsigned int packet_poll(struct file *
file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
 	unsigned int mask = datagram_poll(file, sock, wait);
+	int i;

+	/* RX RING - waiting for packet */
+	if(po->mode == TPACKET_MODE_RX) {
 	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (po->pg_vec) {
 		unsigned last = po->head ? po->head-1 : po->frame_max;
@@ -1574,6 +1829,21 @@ static unsigned int packet_poll(struct file *
file, struct socket *sock,
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	}
+	/* TX RING - waiting for free buffer */
+	else if(po->mode == TPACKET_MODE_TX) {
+		if(mask & POLLOUT) {
+			mask &= ~POLLOUT;
+			for (i = 0; i < po->frame_max; i++) {
+				struct tpacket_hdr *header = packet_lookup_frame(po, i);
+				if(header->tp_status == TP_STATUS_KERNEL)
+				{
+					mask |= POLLOUT;
+					break;
+				}
+			}
+		}
+	}
 	return mask;
 }

@@ -1649,7 +1919,7 @@ out_free_pgvec:
 	goto out;
 }

-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing)
+	static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing ,int mode)
 {
 	char **pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
@@ -1657,6 +1927,9 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req, int closing
 	__be16 num;
 	int err = 0;

+		/* saving ring mode */
+		po->mode = mode;
+
 	if (req->tp_block_nr) {
 		int i;

@@ -1736,7 +2009,7 @@ static int packet_set_ring(struct sock *sk,
struct tpacket_req *req, int closing
 		req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr);

 		po->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
-		po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
+		po->prot_hook.func = (po->pg_vec && (po->mode == TPACKET_MODE_RX))
? tpacket_rcv : packet_rcv;
 		skb_queue_purge(&sk->sk_receive_queue);
 #undef XC
 		if (atomic_read(&po->mapped))

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
@ 2008-09-02 19:46 ` Evgeniy Polyakov
  2008-09-03  7:56   ` Johann Baudy
  2008-09-05 10:28 ` Robert Iakobashvili
  1 sibling, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-02 19:46 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev, Ulisses Alonso Camaró

Hi Johann.

On Tue, Sep 02, 2008 at 08:27:36PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I've made lot of tests, playing with jumbo frames, raw sockets, ...
> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
> packet socket transmission process.
> 
> The main blocking point was the memcpy_fromiovec() function that is
> located in the packet_sendmsg() of af_packet.c.

Can you saturate the link with usual tcp/udp socket?

> But, I would like to get as much criticism as possible in order to
> start a discussion with experts about a conceivable way to mix
> zero-copy, sk_buff management and packet socket.
> Which is perhaps impossible with current network kernel flow ...

Did you try vmsplice and splice?
It is the preferred way to do a zero-copy.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-02 19:46 ` Evgeniy Polyakov
@ 2008-09-03  7:56   ` Johann Baudy
  2008-09-03 10:38     ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03  7:56 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>> packet socket transmission process.
>>
>> The main blocking point was the memcpy_fromiovec() function that is
>> located in the packet_sendmsg() of af_packet.c.
>
> Can you saturate the link with usual tcp/udp socket?

No, only ~15-20Mo/s with standard tcp/udp socket.

>
>> But, I would like to get as much criticism as possible in order to
>> start a discussion with experts about a conceivable way to mix
>> zero-copy, sk_buff management and packet socket.
>> Which is perhaps impossible with current network kernel flow ...
>
> Did you try vmsplice and splice?
> It is the preferred way to do a zero-copy.

Not yet, I will perform some tests using splice and let you know performances.

Many thanks,
Johann



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03  7:56   ` Johann Baudy
@ 2008-09-03 10:38     ` Johann Baudy
  2008-09-03 11:06       ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 10:38 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

I'm not able to exceed 15Mo/s even with vmsplice/splice duo.

Due to some issues:
- I didn't manage to adjust size of packets sent over the network (it
seems to be aligned with page). And maximum packet size seems to be
the page size (4096).
- I need approximately two system calls (vmsplice and splice) for
~4096*8 bytes maximum which is maybe a limit of pipe.
- I'm still going through packet_sendmsg() (packet socket) which
allocates a sk_buff and copies all data inside.

As reference, with my "patch": I need to send more than 32 packets of
7200 bytes (pc network card limit) in one system call (send()) and
without sk_buff data copy. (To reach 85 Mbytes/s)

Please find below my test program for vmsplice/splice:

Best regards,
Johann

#include <stdio.h>
#define _GNU_SOURCE

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>

#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/select.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <poll.h>


int main (void)
{
	struct tpacket_req s_packet_req;
	uint32_t size, opt_len;
	int fd, i, ec, i_sz_packet = 7150;
	struct pollfd s_pfd;
	struct sockaddr_ll my_addr, peer_addr;
	struct ifreq s_ifr; /* points to one interface returned from ioctl */
	int len;
	int fd_socket;
	int i_nb_buffer = 64;
	int i_buffer_size = 8192;
	int i_index;
	int i_updated_cnt;
	int i_ifindex;
	int i_header_size;
	struct tpacket_hdr * ps_header_start;
	struct tpacket_hdr * ps_header;
	char buffer[8000];

	/* reset indes */
	i_index = 0;

	fd_socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
	if(fd_socket == -1)
	{
		perror("socket");
		return EXIT_FAILURE;
	}

	/* start socket config: device and mtu */

	/* clear structure */
	memset(&my_addr, 0, sizeof(struct sockaddr_ll));
	my_addr.sll_family = PF_PACKET;
	my_addr.sll_protocol = htons(ETH_P_ALL);

	/* initialize interface struct */
	strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));

	/* Get the broad cast address */
	ec = ioctl(fd_socket, SIOCGIFINDEX, &s_ifr);
	if(ec == -1)
	{
		perror("iotcl");
		return EXIT_FAILURE;
	}
	/* update with interface index */
	i_ifindex = s_ifr.ifr_ifindex;

	/* new mtu value */
	s_ifr.ifr_mtu = 7200;

	/* update the mtu through ioctl */
	ec = ioctl(fd_socket, SIOCSIFMTU, &s_ifr);
	if(ec == -1)
	{
		perror("iotcl");
		return EXIT_FAILURE;
	}

	/* set sockaddr info */
	memset(&my_addr, 0, sizeof(struct sockaddr_ll));
	my_addr.sll_family = AF_PACKET;
	my_addr.sll_protocol = ETH_P_ALL;
	my_addr.sll_ifindex = i_ifindex;

	/* bind port */
	if (bind(fd_socket, (struct sockaddr *)&my_addr, sizeof(struct
sockaddr_ll)) == -1)
	{
		perror("bind");
		return EXIT_FAILURE;
	}
	/* prepare Tx ring request */
	s_packet_req.tp_block_size = i_buffer_size;
	s_packet_req.tp_frame_size = i_buffer_size;
	s_packet_req.tp_block_nr = i_nb_buffer;
	s_packet_req.tp_frame_nr = i_nb_buffer;


	/* calculate memory to mmap in the kernel */
	size = s_packet_req.tp_block_size * s_packet_req.tp_block_nr;


	{

		/* Splice flags (not laid down in stone yet). */
#ifndef SPLICE_F_MOVE
#define SPLICE_F_MOVE           0x01
#endif
#ifndef SPLICE_F_NONBLOCK
#define SPLICE_F_NONBLOCK       0x02
#endif
#ifndef SPLICE_F_MORE
#define SPLICE_F_MORE           0x04
#endif
#ifndef SPLICE_F_GIFT
#define SPLICE_F_GIFT           0x08
#endif
#ifndef __NR_splice
#define __NR_splice             313
#endif

		int filedes [2];
		int ret;
		int to_write;
		struct iovec iov;
		iov.iov_base = &buffer;
		iov.iov_len = 4096;


		ret = pipe (filedes);
		printf("fd = %d %d %d %p\n", fd, filedes[0], filedes[1], iov.iov_base);
		for(i=0; i< sizeof buffer; i++)
		{
			buffer[i] = (char) i;
		}
		for(i=0; i< 500000; i++)
		{
			to_write = 0;
			while (to_write < iov.iov_len*7) {
				ret = vmsplice (filedes [1],&iov, 1, SPLICE_F_MOVE | SPLICE_F_MORE);
				if (ret < 0)
				{
					perror("splice");
					return EXIT_FAILURE;
				}
				else
					to_write += ret;
			}

			while (to_write > 0) {
				ret = splice (filedes [0], NULL, fd_socket,
											NULL, to_write,
											SPLICE_F_MOVE | SPLICE_F_MORE);

				if (ret < 0)
				{
					perror("write splice");
					return EXIT_FAILURE;
				}
				else
					to_write -= ret;
			}
		}


	}

	return EXIT_SUCCESS;
}

On Wed, Sep 3, 2008 at 9:56 AM, Johann Baudy <johaahn@gmail.com> wrote:
> Hi Evgeniy,
>
>>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>>> packet socket transmission process.
>>>
>>> The main blocking point was the memcpy_fromiovec() function that is
>>> located in the packet_sendmsg() of af_packet.c.
>>
>> Can you saturate the link with usual tcp/udp socket?
>
> No, only ~15-20Mo/s with standard tcp/udp socket.
>
>>
>>> But, I would like to get as much criticism as possible in order to
>>> start a discussion with experts about a conceivable way to mix
>>> zero-copy, sk_buff management and packet socket.
>>> Which is perhaps impossible with current network kernel flow ...
>>
>> Did you try vmsplice and splice?
>> It is the preferred way to do a zero-copy.
>
> Not yet, I will perform some tests using splice and let you know performances.
>
> Many thanks,
> Johann
>
>
>
> --
> Johann Baudy
> johaahn@gmail.com
>



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 10:38     ` Johann Baudy
@ 2008-09-03 11:06       ` David Miller
  2008-09-03 13:05         ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2008-09-03 11:06 UTC (permalink / raw)
  To: johaahn; +Cc: johnpol, netdev

From: "Johann Baudy" <johaahn@gmail.com>
Date: Wed, 3 Sep 2008 12:38:53 +0200

> I'm not able to exceed 15Mo/s even with vmsplice/splice duo.

I think you misunderstood what Evgeniy was asking of you.

He was asking how fast you can transfer data over this
interface using a normal TCP socket to a remove host,
via sendfile() or splice().

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 11:06       ` David Miller
@ 2008-09-03 13:05         ` Johann Baudy
  2008-09-03 13:27           ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 13:05 UTC (permalink / raw)
  To: David Miller, Evgeniy Polyakov; +Cc: netdev

Sorry for misunderstanding,

TCP socket, transferring 20Mbytes file (located in initramfs) in loop
with sendfile() : 5.7Mbytes/s

Best regards,
Johann

On Wed, Sep 3, 2008 at 1:06 PM, David Miller <davem@davemloft.net> wrote:
> From: "Johann Baudy" <johaahn@gmail.com>
> Date: Wed, 3 Sep 2008 12:38:53 +0200
>
>> I'm not able to exceed 15Mo/s even with vmsplice/splice duo.
>
> I think you misunderstood what Evgeniy was asking of you.
>
> He was asking how fast you can transfer data over this
> interface using a normal TCP socket to a remove host,
> via sendfile() or splice().
>



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 13:05         ` Johann Baudy
@ 2008-09-03 13:27           ` Evgeniy Polyakov
  2008-09-03 14:57             ` Christoph Lameter
  2008-09-03 15:00             ` Johann Baudy
  0 siblings, 2 replies; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 13:27 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, netdev

Hi Johann.

On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
> with sendfile() : 5.7Mbytes/s

And _THIS_ is a serious problem. Let's assume that sendfile is broken or
driver/hardware does not support scatter/gather and checksumming (does it?).
Can you saturate the link with pktgen (1) and usual tcp socket (2).
Assuming second case will fail, does it also broken because of very
small performance of the copy from the userspace?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 13:27           ` Evgeniy Polyakov
@ 2008-09-03 14:57             ` Christoph Lameter
  2008-09-03 15:00             ` Johann Baudy
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2008-09-03 14:57 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, David Miller, netdev

Evgeniy Polyakov wrote:
> Hi Johann.
> 
> On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
>> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
>> with sendfile() : 5.7Mbytes/s
> 
> And _THIS_ is a serious problem. Let's assume that sendfile is broken or
> driver/hardware does not support scatter/gather and checksumming (does it?).
> Can you saturate the link with pktgen (1) and usual tcp socket (2).
> Assuming second case will fail, does it also broken because of very
> small performance of the copy from the userspace?

Could we see the code that was used to get these numbers? The problem may just
be in the way that the calls to sendfile() have been coded.

The TX code looks intriguing. Seems that some vendors are tinkering with VNIC
ideas in order to bypass context switches and data copies. Maybe this is a
cheap way to attain the same goals?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 13:27           ` Evgeniy Polyakov
  2008-09-03 14:57             ` Christoph Lameter
@ 2008-09-03 15:00             ` Johann Baudy
  2008-09-03 15:13               ` Evgeniy Polyakov
  1 sibling, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 15:00 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

Hi Evgeniy,

The driver and the hardware support DMA scater/gather and checksum offloading.

with pktgen and this below config, i reached 85MBytes/s ~ link
saturation (I've reached the same bitrate with raw socket + TX RING
ZeroCopy patch):
#!/bin/sh
echo rem_device_all > /proc/net/pktgen/kpktgend_0
echo add_device eth0 > /proc/net/pktgen/kpktgend_0
echo max_before_softirq 10000 > /proc/net/pktgen/kpktgend_0
sleep 1
echo count 10000000 > /proc/net/pktgen/eth0
echo clone_skb 0 > /proc/net/pktgen/eth0
echo pkt_size 7200 > /proc/net/pktgen/eth0
echo delay 0 > /proc/net/pktgen/eth0
echo dst 192.168.0.1 > /proc/net/pktgen/eth0
echo dst_mac  ff:ff:ff:ff:ff:ff > /proc/net/pktgen/eth0
echo start > /proc/net/pktgen/pgctrl

I can't saturate the link from user space with either UDP, TCP or RAW
socket due to copies and multiple system calls.

If the system is just doing one copy of the packet, it falls under
25Mbytes/s. This a simple memory bus which is only running at 100Mhz
for data and instruction.
I think I've well understood why my bitrate is so bad from userspace
using normal TCP,UDP or RAW socket.
That's why I'm working on this zero copy solution (without copy
between user and kernel or between kernel buffer and socket buffer;
and with a minimum of system call).
A kind of full zero-copy sending capability, HW accesses same buffers
as the user.
In fact, I'm just suggesting the symmetric of packet mmap IO used for
capture process with zero copy capability and I need to know what do
you think about it.

Thanks in advance,
Johann

On Wed, Sep 3, 2008 at 3:27 PM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Hi Johann.
>
> On Wed, Sep 03, 2008 at 03:05:07PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
>> TCP socket, transferring 20Mbytes file (located in initramfs) in loop
>> with sendfile() : 5.7Mbytes/s
>
> And _THIS_ is a serious problem. Let's assume that sendfile is broken or
> driver/hardware does not support scatter/gather and checksumming (does it?).
> Can you saturate the link with pktgen (1) and usual tcp socket (2).
> Assuming second case will fail, does it also broken because of very
> small performance of the copy from the userspace?
>
> --
>        Evgeniy Polyakov
>

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 15:00             ` Johann Baudy
@ 2008-09-03 15:13               ` Evgeniy Polyakov
  2008-09-03 15:58                 ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 15:13 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, netdev

Hi Johann.

On Wed, Sep 03, 2008 at 05:00:47PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> The driver and the hardware support DMA scater/gather and checksum offloading.
> 
> with pktgen and this below config, i reached 85MBytes/s ~ link
> saturation (I've reached the same bitrate with raw socket + TX RING
> ZeroCopy patch):

> I can't saturate the link from user space with either UDP, TCP or RAW
> socket due to copies and multiple system calls.
> 
> If the system is just doing one copy of the packet, it falls under
> 25Mbytes/s. This a simple memory bus which is only running at 100Mhz
> for data and instruction.

What is the bus width and is there burst mode support?
Not to point to the error in the speed calculation, just out of curiosity :)
Always liked such tiny systems...

> I think I've well understood why my bitrate is so bad from userspace
> using normal TCP,UDP or RAW socket.
> That's why I'm working on this zero copy solution (without copy
> between user and kernel or between kernel buffer and socket buffer;
> and with a minimum of system call).
> A kind of full zero-copy sending capability, HW accesses same buffers
> as the user.

But why sendfile/splice does not work the same?
It is (supposed to be) a zero-copy sending interface, which should be even
more optimal, than your ring buffer approach, since uses just single
syscall and no initialization of the data (well, there is page
population and so on, but if file is in the ramdisk, it is effectively
zero overhead). Can you run oprofile during sendfile() data transfer or
describe behaviour (i.e. CPU usage and tcpdump).

> In fact, I'm just suggesting the symmetric of packet mmap IO used for
> capture process with zero copy capability and I need to know what do
> you think about it.

Well, I'm not against this patch, but you pointed to the bug (or wrong
initialization in your code) of the sendfile, which has higher priority
imho :)

Actually if it is indeed a bug in splice code then (if fixed) it can
allow to have simpler zero-copy sulution for your problem.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 15:13               ` Evgeniy Polyakov
@ 2008-09-03 15:58                 ` Johann Baudy
  2008-09-03 16:43                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 15:58 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

Hi Evgeniy,

> What is the bus width and is there burst mode support?
> Not to point to the error in the speed calculation, just out of curiosity :)
> Always liked such tiny systems...

32 bits with burst support. This is a PPC 405 embedded into Xilinx V4
FPGA . (PLB bus)

>
> But why sendfile/splice does not work the same?
> It is (supposed to be) a zero-copy sending interface, which should be even
> more optimal, than your ring buffer approach, since uses just single
> syscall and no initialization of the data (well, there is page
> population and so on, but if file is in the ramdisk, it is effectively
> zero overhead). Can you run oprofile during sendfile() data transfer or
> describe behaviour (i.e. CPU usage and tcpdump).

I've never used oprofile before. I will get more logs and let you know.
Just a question: I don't want to use TCP for final application.
Is it expected that the kernel execute packet_sendmsg() when using
packet socket with splice()?  (because this function is doing a memcpy
from a buffer to a socket buffer).

Or is there a dedicated path for splicing? or maybe only in TCP read
(I can see that splice_read operator is redefined with
tcp_splice_read())?
And I've also faced some issues with the size of packet (it seems to
be limited to page size). It is really important for me to send large
packet. I've just decreased the packet size of pktgen script from 7200
to 4096 and the bitrate has fallen from 85Mbytes/s to 50Mbytes/s.
I understand that this is not a problem with TCP when sending a file,
we don't really care about accuracy of the packet size.
Do you know if there is way to adjust the size ?

And again, many thanks for your fast replies ;)
Johann Baudy

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 15:58                 ` Johann Baudy
@ 2008-09-03 16:43                   ` Evgeniy Polyakov
  2008-09-03 20:30                     ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 16:43 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, netdev

Hi Johann.

On Wed, Sep 03, 2008 at 05:58:50PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > What is the bus width and is there burst mode support?
> > Not to point to the error in the speed calculation, just out of curiosity :)
> > Always liked such tiny systems...
> 
> 32 bits with burst support. This is a PPC 405 embedded into Xilinx V4
> FPGA . (PLB bus)

So small PLB? Not OPB? Weird hardware :)
But nevertheless at most 400 MB/s with 100mhz, so looks like either
there is no burst mode or weird NIC hardware (or something else :)
I used to easily saturate 100mbit channel with 405gp(r) and emac driver,
which are better numbers than what you have with gige and sockets...
Actually even 405gp had much wider plb, so this could be an issue.

Likley your project will just dma data from some sensor to the
preallocated buffer, you will add headers and send the data, so very
small memory bus speed will not allow to use sockets and thus TCP.
Having splice-friendly setup is possible, but I think raw socket
approach is simpler for you.

> > But why sendfile/splice does not work the same?
> > It is (supposed to be) a zero-copy sending interface, which should be even
> > more optimal, than your ring buffer approach, since uses just single
> > syscall and no initialization of the data (well, there is page
> > population and so on, but if file is in the ramdisk, it is effectively
> > zero overhead). Can you run oprofile during sendfile() data transfer or
> > describe behaviour (i.e. CPU usage and tcpdump).
> 
> I've never used oprofile before. I will get more logs and let you know.
> Just a question: I don't want to use TCP for final application.
> Is it expected that the kernel execute packet_sendmsg() when using
> packet socket with splice()?  (because this function is doing a memcpy
> from a buffer to a socket buffer).

No, it will use sendpage() if hardware and driver support scatter/gather
and checksumm ofloading. Since you say they do, then there should be no
copies at all.

> Or is there a dedicated path for splicing? or maybe only in TCP read
> (I can see that splice_read operator is redefined with
> tcp_splice_read())?

It will endup with generic_splice_sendpage() and pipe_to_sendpage().

> And I've also faced some issues with the size of packet (it seems to
> be limited to page size). It is really important for me to send large
> packet. I've just decreased the packet size of pktgen script from 7200
> to 4096 and the bitrate has fallen from 85Mbytes/s to 50Mbytes/s.
> I understand that this is not a problem with TCP when sending a file,
> we don't really care about accuracy of the packet size.
> Do you know if there is way to adjust the size ?

What do you mean by packet size? MTU/MSS? In pktgen it means size of the
allocated skb, so it will be eventually split into smaller chunks and the
bigger size you have, the less allocations will be performed. Actually
the fact, that 7200 works at all, is a bit surprising: your small
machine has lots of ram and is effectively unused during tests (i.e. no
other allocations). Changing it do 4k should not decrease performance at
all... Do you have jumbo frames enabled?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 16:43                   ` Evgeniy Polyakov
@ 2008-09-03 20:30                     ` Johann Baudy
  2008-09-03 22:03                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-03 20:30 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

Hi Evgeniy,

> So small PLB? Not OPB? Weird hardware :)
> But nevertheless at most 400 MB/s with 100mhz, so looks like either
> there is no burst mode or weird NIC hardware (or something else :)
> I used to easily saturate 100mbit channel with 405gp(r) and emac driver,
> which are better numbers than what you have with gige and sockets...
> Actually even 405gp had much wider plb, so this could be an issue.
>
> Likley your project will just dma data from some sensor to the
> preallocated buffer, you will add headers and send the data, so very
> small memory bus speed will not allow to use sockets and thus TCP.
> Having splice-friendly setup is possible, but I think raw socket
> approach is simpler for you.

Yes, this is a custom hardware (FPGA :)). There is no combo IPLB / DPLB,
only one and small PLB bus at 100Mhz.

> No, it will use sendpage() if hardware and driver support scatter/gather
> and checksumm ofloading. Since you say they do, then there should be no
> copies at all.
>

> It will endup with generic_splice_sendpage() and pipe_to_sendpage().
>
Indeed, I've double checked, but pipe_to_sendpage() will end up with
packet_sendmsg()

.splice_write = generic_splice_sendpage,
generic_splice_sendpage()
splice_from_pipe();
pipe_to_sendpage() from err = actor(pipe, buf, sd);
sock_sendpage() from ile->f_op->sendpage()
sock_no_sendpage() from sock->ops->sendpage()
kernel_sendmsg()
sock_sendmsg();
packet_sendmsg() from sock->ops->sendmsg();
memcpy() :'(

I think a non-generic splice_write function should do the job.
What do you think?

>
> What do you mean by packet size? MTU/MSS? In pktgen it means size of the
> allocated skb, so it will be eventually split into smaller chunks and the
> bigger size you have, the less allocations will be performed. Actually
> the fact, that 7200 works at all, is a bit surprising: your small
> machine has lots of ram and is effectively unused during tests (i.e. no
> other allocations). Changing it do 4k should not decrease performance at
> all... Do you have jumbo frames enabled?
>

I mean the transfer unit size (ethernet frame length) that must be <= MTU.
Jumbo frames are enabled in the driver and mtu size is set to 7200.
I'm currently using wireshark on a remote pc to check bitrate and format.
I think performance can decrease because CPU will spend the same time
to send 7200 or 4096 bytes but not the DMA.(~50µs for 7200, ~30µs for
4096)


Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 20:30                     ` Johann Baudy
@ 2008-09-03 22:03                       ` Evgeniy Polyakov
  2008-09-04 14:44                         ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-03 22:03 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, netdev

On Wed, Sep 03, 2008 at 10:30:14PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > It will endup with generic_splice_sendpage() and pipe_to_sendpage().
> >
> Indeed, I've double checked, but pipe_to_sendpage() will end up with
> packet_sendmsg()
> 
> .splice_write = generic_splice_sendpage,
> generic_splice_sendpage()
> splice_from_pipe();
> pipe_to_sendpage() from err = actor(pipe, buf, sd);
> sock_sendpage() from ile->f_op->sendpage()
> sock_no_sendpage() from sock->ops->sendpage()
> kernel_sendmsg()
> sock_sendmsg();
> packet_sendmsg() from sock->ops->sendmsg();
> memcpy() :'(
> 
> I think a non-generic splice_write function should do the job.
> What do you think?

Looks like you try to sendfile() over packet socket.
Both tcp and udp sockets have sendpage method.

Or your hardware or driver do not support needed fucntionality, so
tcp_sendpage() falls back to sock_no_sendpage(). From your dump I think
it is the first case above. Well, after I read it again, I found word
packet_sendmsg(), which explains everything. Please use tcp or udp
socket for splice/sendfile test.

> I mean the transfer unit size (ethernet frame length) that must be <= MTU.
> Jumbo frames are enabled in the driver and mtu size is set to 7200.
> I'm currently using wireshark on a remote pc to check bitrate and format.
> I think performance can decrease because CPU will spend the same time
> to send 7200 or 4096 bytes but not the DMA.(~50µs for 7200, ~30µs for
> 4096)

If you use jumbo frames, than yes, the bigger allocation unit is
(assuming allocation succeeded), the bigger speed will be, so this result
is expectable.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-03 22:03                       ` Evgeniy Polyakov
@ 2008-09-04 14:44                         ` Johann Baudy
  2008-09-05  7:17                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-04 14:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

Hi Evgeniy,


> Looks like you try to sendfile() over packet socket.
> Both tcp and udp sockets have sendpage method.
>
> Or your hardware or driver do not support needed fucntionality, so
> tcp_sendpage() falls back to sock_no_sendpage(). From your dump I think
> it is the first case above. Well, after I read it again, I found word
> packet_sendmsg(), which explains everything. Please use tcp or udp
> socket for splice/sendfile test.
>

I'm finally able to run a full zero copy mechanism with UDP socket as you said.
Unfortunately, I need at least one vmsplice() system call per UDP
packet (vmsplice call()).
mere vmsplice(mem to pipe) cost much (80µs of CPU). And splice(pipe to
socket) call is worst...
80us is approximately the duration of 12Kbytes sent at 1Gbps. As I
need to send packet of 7200bytes (with no frag)...
I can't use this mechanism unfortunaltely. I've only reached 20Mbytes/s.

You can find below a FTRACE of vmsplice(), if you find something
abnormal ... :) :
(80µs result is an average of vmsplice() duration thanks to
gettimeofday(): WITHOUT FTRACE IN KERNEL CONFIG)

      main-849   [00] ..  1 4154502892.139088: sys_gettimeofday
<-ret_from_syscall
            main-849   [00] ..  1 4154502892.139090: do_gettimeofday
<-sys_gettimeofday
            main-849   [00] ..  1 4154502892.139092: getnstimeofday
<-do_gettimeofday
            main-849   [00] ..  1 4154502892.139100: sys_vmsplice
<-ret_from_syscall
            main-849   [00] ..  1 4154502892.139107: fget_light <-sys_vmsplice
            main-849   [00] ..  1 4154502892.139118: rt_down_read <-sys_vmsplice
            main-849   [00] ..  1 4154502892.139120: __rt_down_read
<-rt_down_read
            main-849   [00] ..  1 4154502892.139124:
rt_mutex_down_read <-__rt_down_read
            main-849   [00] ..  1 4154502892.139132: pagefault_disable
<-sys_vmsplice
            main-849   [00] ..  1 4154502892.139136: pagefault_enable
<-sys_vmsplice
            main-849   [00] ..  1 4154502892.139141: get_user_pages
<-sys_vmsplice
            main-849   [00] ..  1 4154502892.139147: find_extend_vma
<-get_user_pages
            main-849   [00] ..  1 4154502892.139150: find_vma <-find_extend_vma
            main-849   [00] ..  1 4154502892.139158: _cond_resched
<-get_user_pages
            main-849   [00] ..  1 4154502892.139161: follow_page
<-get_user_pages
            main-849   [00] ..  1 4154502892.139165: rt_spin_lock <-follow_page
            main-849   [00] ..  1 4154502892.139167: __rt_spin_lock
<-rt_spin_lock
            main-849   [00] ..  1 4154502892.139171: vm_normal_page
<-follow_page
            main-849   [00] ..  1 4154502892.139176:
mark_page_accessed <-follow_page
            main-849   [00] ..  1 4154502892.139180: rt_spin_unlock
<-follow_page
            main-849   [00] ..  1 4154502892.139185: flush_dcache_page
<-get_user_pages
            main-849   [00] ..  1 4154502892.139192: rt_up_read <-sys_vmsplice
            main-849   [00] ..  1 4154502892.139194: rt_mutex_up_read
<-rt_up_read
            main-849   [00] ..  1 4154502892.139203: splice_to_pipe
<-sys_vmsplice
            main-849   [00] ..  1 4154502892.139206: _mutex_lock
<-splice_to_pipe
            main-849   [00] ..  1 4154502892.139209: rt_mutex_lock <-_mutex_lock
            main-849   [00] ..  1 4154502892.139217: _mutex_unlock
<-splice_to_pipe
            main-849   [00] ..  1 4154502892.139221: rt_mutex_unlock
<-_mutex_unlock
            main-849   [00] ..  1 4154502892.139224: kill_fasync
<-splice_to_pipe
            main-849   [00] ..  1 4154502892.139235: sys_gettimeofday
<-ret_from_syscall
            main-849   [00] ..  1 4154502892.139237: do_gettimeofday
<-sys_gettimeofday
            main-849   [00] ..  1 4154502892.139239: getnstimeofday
<-do_gettimeofday


So, I will return to work on my circular buffer.
This way I can control (ethernet frame length)*(number of frame)/
(number of system call) ratio.

Thanks to splice kernel and pktgen code analyses, I've also found a
clean way to perform
zero copy between my circular buffer and socket buffer. I will test it
and I'll let you know
changes and results.

Many thanks for your help,
Johann Baudy









-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-04 14:44                         ` Johann Baudy
@ 2008-09-05  7:17                           ` Evgeniy Polyakov
       [not found]                             ` <7e0dd21a0809050216r65b8f08fm1ad0630790a13a54@mail.gmail.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05  7:17 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, netdev

Hi Johann.

On Thu, Sep 04, 2008 at 04:44:15PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I'm finally able to run a full zero copy mechanism with UDP socket as you said.
> Unfortunately, I need at least one vmsplice() system call per UDP
> packet (vmsplice call()).
> mere vmsplice(mem to pipe) cost much (80µs of CPU). And splice(pipe to
> socket) call is worst...
> 80us is approximately the duration of 12Kbytes sent at 1Gbps. As I
> need to send packet of 7200bytes (with no frag)...
> I can't use this mechanism unfortunaltely. I've only reached 20Mbytes/s.

vmsplice() can be slow, try to inject header via usual send() call, or
better do not use it at all for testing.

> You can find below a FTRACE of vmsplice(), if you find something
> abnormal ... :) :
> (80µs result is an average of vmsplice() duration thanks to
> gettimeofday(): WITHOUT FTRACE IN KERNEL CONFIG)

Amount of gettimofday() and friends is excessive, but it can be a trace
tool itself. kill_fasync() also took too much time (top CPU user
is at bottom I suppose?), do you use SIGIO? Also vma traveling and page
checking is not what will be done in network code and your project, so
it also adds an overhead. Please try without vmsplice() at all, usual
splice()/sendfile() _has_ to saturate the link, otherwise we have a
serious problem.

> So, I will return to work on my circular buffer.
> This way I can control (ethernet frame length)*(number of frame)/
> (number of system call) ratio.

Not to distract you from the project, but you still can do the same with
existing methods and smaller amount of work. But I should be last saying
that creating tricky hacks to implement the idea should be abandoned in
favour of the standards (even slow) methods :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Fwd: Packet mmap: TX RING and zero copy
       [not found]                             ` <7e0dd21a0809050216r65b8f08fm1ad0630790a13a54@mail.gmail.com>
@ 2008-09-05  9:17                               ` Johann Baudy
  2008-09-05 11:31                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05  9:17 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

> vmsplice() can be slow, try to inject header via usual send() call, or
> better do not use it at all for testing.
>
vmsplice() is short in comparison to splice()  ~ 200us !
This was just to show you that even this vmpslice duration of 80us
that is needed for each packet is too long to send only 1 packet.
I really need a mechanism that allow sending of ~ 40 packets of 7200K
in one system call to keep some cpu ressources to do other things.
(Not spending time in kernel layers :))

>
> Amount of gettimofday() and friends is excessive, but it can be a trace
> tool itself.
I've only observed a small performance between with and without
gettimeofday().(< 1MB/s). I've used it to do a light FTRACE and to get
duration of vmsplice.

> kill_fasync() also took too much time (top CPU user
> is at bottom I suppose?), do you use SIGIO? Also vma traveling and page
> checking is not what will be done in network code and your project, so
> it also adds an overhead.

Between kill_fasync() sys_gettimeofday() , I thought that we returned
to user space.
No SIGIO. But FYI, I use PREEMPT_RT patch.

>Please try without vmsplice() at all, usual
> splice()/sendfile() _has_ to saturate the link, otherwise we have a
> serious problem.

I've already tried sendfile only with standard TCP/UDP socket. I've
not saturated the link.
Around same bitrate.

>
> Not to distract you from the project, but you still can do the same with
> existing methods and smaller amount of work. But I should be last saying
> that creating tricky hacks to implement the idea should be abandoned in
> favour of the standards (even slow) methods :)
>
I understand your point that common solution are always better than
multiple hacks.
But I think that I have the same motivation than packet mmap IO developers .
This feature was introduced to make the capture process of raw socket
efficient. I just want to reach the same goal for transmission using
same mechanism.
We use those features only if we need performance at the driver level.

Thanks,
Johann




--
Johann Baudy
johaahn@gmail.com



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
  2008-09-02 19:46 ` Evgeniy Polyakov
@ 2008-09-05 10:28 ` Robert Iakobashvili
  2008-09-05 13:06   ` Johann Baudy
  1 sibling, 1 reply; 39+ messages in thread
From: Robert Iakobashvili @ 2008-09-05 10:28 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev, Ulisses Alonso Camaró

Hi Johann,

On Tue, Sep 2, 2008 at 9:27 PM, Johann Baudy <johaahn@gmail.com> wrote:
> I've made lot of tests, playing with jumbo frames, raw sockets, ...
> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
> packet socket transmission process.
>
> The main blocking point was the memcpy_fromiovec() function that is
> located in the packet_sendmsg() of af_packet.c.
> It was consuming all my CPU resources to copy data from user space to
> socket buffer.
> Then I've started to work on a hack that makes this transfer possible
> without any memcpys.
>
> Mainly, the hack is the implementation of two "features":
>
>    *  Sending packet through a circular buffer between user and
> kernel space that minimizes the number of system calls. (Feature
> actually implemented for capture process, libpcap ..).

Something like this has been done in PF_RING socket,
which is a part of ntop project infra.

Take care.

Truly,
Robert Iakobashvili
......................................................................
www.ghotit.com
Assistive technology that understands you
......................................................................

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05  9:17                               ` Fwd: " Johann Baudy
@ 2008-09-05 11:31                                 ` Evgeniy Polyakov
  2008-09-05 12:44                                   ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 11:31 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Fri, Sep 05, 2008 at 11:17:07AM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > vmsplice() can be slow, try to inject header via usual send() call, or
> > better do not use it at all for testing.
> >
> vmsplice() is short in comparison to splice()  ~ 200us !
> This was just to show you that even this vmpslice duration of 80us
> that is needed for each packet is too long to send only 1 packet.
> I really need a mechanism that allow sending of ~ 40 packets of 7200K
> in one system call to keep some cpu ressources to do other things.
> (Not spending time in kernel layers :))

Hmmm... splice()/sendfile() shuold be able to send the whole file in
single syscall. This looks like a problem in the userspace.

> > kill_fasync() also took too much time (top CPU user
> > is at bottom I suppose?), do you use SIGIO? Also vma traveling and page
> > checking is not what will be done in network code and your project, so
> > it also adds an overhead.
> 
> Between kill_fasync() sys_gettimeofday() , I thought that we returned
> to user space.
> No SIGIO. But FYI, I use PREEMPT_RT patch.

Does it also push softirq processing into threads?

> >Please try without vmsplice() at all, usual
> > splice()/sendfile() _has_ to saturate the link, otherwise we have a
> > serious problem.
> 
> I've already tried sendfile only with standard TCP/UDP socket. I've
> not saturated the link.
> Around same bitrate.

This worries me a lot: sendfile should be a single syscall which very
optimally creates network packets getting into account MTU and hardware
capabilities. I do belive it is a problem with userspace code.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 11:31                                 ` Evgeniy Polyakov
@ 2008-09-05 12:44                                   ` Johann Baudy
  2008-09-05 13:16                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 12:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

>> vmsplice() is short in comparison to splice()  ~ 200us !
>> This was just to show you that even this vmpslice duration of 80us
>> that is needed for each packet is too long to send only 1 packet.
>> I really need a mechanism that allow sending of ~ 40 packets of 7200K
>> in one system call to keep some cpu ressources to do other things.
>> (Not spending time in kernel layers :))
>
> Hmmm... splice()/sendfile() shuold be able to send the whole file in
> single syscall. This looks like a problem in the userspace.
>

I was talking about vmsplice()/splice() which seems to me the only way
to send user buffer (pages) to socket without copy.

>> > kill_fasync() also took too much time (top CPU user
>> > is at bottom I suppose?), do you use SIGIO? Also vma traveling and page
>> > checking is not what will be done in network code and your project, so
>> > it also adds an overhead.
>>
>> Between kill_fasync() sys_gettimeofday() , I thought that we returned
>> to user space.
>> No SIGIO. But FYI, I use PREEMPT_RT patch.
>
> Does it also push softirq processing into threads?
>
I don't understand your point, how can I check?
I'm not handling any IRQ in this test software.

>> >Please try without vmsplice() at all, usual
>> > splice()/sendfile() _has_ to saturate the link, otherwise we have a
>> > serious problem.
>>
>> I've already tried sendfile only with standard TCP/UDP socket. I've
>> not saturated the link.
>> Around same bitrate.
>
> This worries me a lot: sendfile should be a single syscall which very
> optimally creates network packets getting into account MTU and hardware
> capabilities. I do belive it is a problem with userspace code.
>

Yes, and this is what it does (only one single syscall).
No printf, only one sendfile of 10MB file over TCP socket

To resume ongoing test status:
with vmsplice()/splice() I need to do multiple call of vmsplice() and
one call of splice() - ratio seems to be limited to the pipe capacity
(16 pages: 64K)
                                      - vmsplice call specify the size
of the udp packet which means 1 syscall per packet :(
                                      - In UDP, Bitrate is  < 20MB/s


with sendfile() only one system call of 10MB  in TCP (in UDP I have to
split in 61440 bytes).
                                      - In TCP bitrate is limited due to remote
                                      - In UDP,  this 61440 bytes
limit which  is really inferior to my 7200*40 packets that allows me
to saturate the link (during 2ms only...) with my circular buffer.
                                     - Bitrate is  < 20MB/s





Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Packet mmap: TX RING and zero copy
  2008-09-05 10:28 ` Robert Iakobashvili
@ 2008-09-05 13:06   ` Johann Baudy
  0 siblings, 0 replies; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 13:06 UTC (permalink / raw)
  To: Robert Iakobashvili; +Cc: netdev, Ulisses Alonso Camaró

Thanks Robert,

The architecture of PF_RING seems to be really similar to packet mmap
IO to optimize capture process.
Is it planned to replace it?

I'll try it to get performance.

Best regards,
Johann



On Fri, Sep 5, 2008 at 12:28 PM, Robert Iakobashvili
<coroberti@gmail.com> wrote:
> Hi Johann,
>
> On Tue, Sep 2, 2008 at 9:27 PM, Johann Baudy <johaahn@gmail.com> wrote:
>> I've made lot of tests, playing with jumbo frames, raw sockets, ...
>> I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
>> packet socket transmission process.
>>
>> The main blocking point was the memcpy_fromiovec() function that is
>> located in the packet_sendmsg() of af_packet.c.
>> It was consuming all my CPU resources to copy data from user space to
>> socket buffer.
>> Then I've started to work on a hack that makes this transfer possible
>> without any memcpys.
>>
>> Mainly, the hack is the implementation of two "features":
>>
>>    *  Sending packet through a circular buffer between user and
>> kernel space that minimizes the number of system calls. (Feature
>> actually implemented for capture process, libpcap ..).
>
> Something like this has been done in PF_RING socket,
> which is a part of ntop project infra.
>
> Take care.
>
> Truly,
> Robert Iakobashvili
> ......................................................................
> www.ghotit.com
> Assistive technology that understands you
> ......................................................................
>



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 12:44                                   ` Johann Baudy
@ 2008-09-05 13:16                                     ` Evgeniy Polyakov
  2008-09-05 13:29                                       ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 13:16 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Fri, Sep 05, 2008 at 02:44:47PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I was talking about vmsplice()/splice() which seems to me the only way
> to send user buffer (pages) to socket without copy.

Yes, and we found, that vmsplice() is slow, so unacceptible.
What about single big enough file being sent via sendfile()?
Can it saturate the link in your setup?

> >> No SIGIO. But FYI, I use PREEMPT_RT patch.
> >
> > Does it also push softirq processing into threads?
> >
> I don't understand your point, how can I check?
> I'm not handling any IRQ in this test software.

You do - receiving network processing happens in softirq context.
As far as I heared, RT patchset pushes processing into threads and thus
softirqs will be pushed there too. In your particular case it should not
be an issue because of UDP usage and likely no replies, and even if it
would, it affected both packet socket ring processing and higher level
ones. I pointed it not to show a slowdown.

> > This worries me a lot: sendfile should be a single syscall which very
> > optimally creates network packets getting into account MTU and hardware
> > capabilities. I do belive it is a problem with userspace code.
> 
> Yes, and this is what it does (only one single syscall).
> No printf, only one sendfile of 10MB file over TCP socket
> 
> To resume ongoing test status:
> with vmsplice()/splice() I need to do multiple call of vmsplice() and
> one call of splice() - ratio seems to be limited to the pipe capacity
> (16 pages: 64K)
>                                       - vmsplice call specify the size
> of the udp packet which means 1 syscall per packet :(
>                                       - In UDP, Bitrate is  < 20MB/s
> 
> 
> with sendfile() only one system call of 10MB  in TCP (in UDP I have to
> split in 61440 bytes).
>                                       - In TCP bitrate is limited due to remote
>                                       - In UDP,  this 61440 bytes
> limit which  is really inferior to my 7200*40 packets that allows me
> to saturate the link (during 2ms only...) with my circular buffer.
>                                      - Bitrate is  < 20MB/s

It looks like you are sending data in chunks, which you should not do.
Why do you splice input file into 60k chunks for UDP?
sendfile() should iterate over whole file in page chunks, pack them
into splice buffer (16 pages), send them, then get next set of pages
and so on... udp_sendpage() will properly allocate skbs for the given
MTU, or append data to the end of the skb if there is a place.

You should _not_ manually interrupt this process and call sendfile()
multiple times with different offsets and small sizes.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 13:16                                     ` Evgeniy Polyakov
@ 2008-09-05 13:29                                       ` Johann Baudy
  2008-09-05 13:37                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 13:29 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evegniy


> Yes, and we found, that vmsplice() is slow, so unacceptible.
> What about single big enough file being sent via sendfile()?
> Can it saturate the link in your setup?

I'm trying to. But so far not able to saturate the link.

>
> It looks like you are sending data in chunks, which you should not do.
> Why do you splice input file into 60k chunks for UDP?
> sendfile() should iterate over whole file in page chunks, pack them
> into splice buffer (16 pages), send them, then get next set of pages
> and so on... udp_sendpage() will properly allocate skbs for the given
> MTU, or append data to the end of the skb if there is a place.
>
> You should _not_ manually interrupt this process and call sendfile()
> multiple times with different offsets and small sizes.
>

So, it seems that there is something wrong in UDP here, because TCP
works properly with same code.
If size argument of sendfile() exceed 61440, sendfile() returns 61440
and send no data to the device ...
I will try to investigate it but for my app sendfile() is not
conceivable. Only vmsplice could, but too slow.

Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 13:29                                       ` Johann Baudy
@ 2008-09-05 13:37                                         ` Evgeniy Polyakov
  2008-09-05 13:55                                           ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 13:37 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

On Fri, Sep 05, 2008 at 03:29:25PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> So, it seems that there is something wrong in UDP here, because TCP
> works properly with same code.
> If size argument of sendfile() exceed 61440, sendfile() returns 61440
> and send no data to the device ...

That's a bug. Likely no one sends file content via udp, so it was not
detected...

> I will try to investigate it but for my app sendfile() is not
> conceivable. Only vmsplice could, but too slow.

Well, you can mmap empty file into the RAM, lock it there, DMA data from
the sensor to the mmapped area, add headers (just in the mapped area)
and then sendfile that file.. Single syscall, zero-copy, standard
interfaces :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 13:37                                         ` Evgeniy Polyakov
@ 2008-09-05 13:55                                           ` Johann Baudy
  2008-09-05 14:19                                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 13:55 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

> That's a bug. Likely no one sends file content via udp, so it was not
> detected...

Ok, I'll will run a FTRACE of this issue on monday;

>> I will try to investigate it but for my app sendfile() is not
>> conceivable. Only vmsplice could, but too slow.
>
> Well, you can mmap empty file into the RAM, lock it there, DMA data from
> the sensor to the mmapped area, add headers (just in the mapped area)
> and then sendfile that file.. Single syscall, zero-copy, standard
> interfaces :)
>

OK but it seems that there is no way to control packet format.
(beginning of packet and size of packet).
Each packet must start with a specific header (on my app). This is a
kind of streaming.

Thanks,
Johann





-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 13:55                                           ` Johann Baudy
@ 2008-09-05 14:19                                             ` Evgeniy Polyakov
  2008-09-05 14:45                                               ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 14:19 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Fri, Sep 05, 2008 at 03:55:14PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > That's a bug. Likely no one sends file content via udp, so it was not
> > detected...
> 
> Ok, I'll will run a FTRACE of this issue on monday;

No need to run FTRACE, code shuld be audited and probably some debug
prints added to determine, why sendfile() decides to exit early wiht
UDP. I will try to do it if time permits this weekend, although I'm
quite surprised sendfile() does not work with UDP...

> > Well, you can mmap empty file into the RAM, lock it there, DMA data from
> > the sensor to the mmapped area, add headers (just in the mapped area)
> > and then sendfile that file.. Single syscall, zero-copy, standard
> > interfaces :)
> 
> OK but it seems that there is no way to control packet format.
> (beginning of packet and size of packet).
> Each packet must start with a specific header (on my app). This is a
> kind of streaming.

You can always provide a global offset where to put next packet.
You can ajust it to put header before each data frame, and then DMA
frame content according to that offset.

Transmitting packet socket is needed for those, who wants to implement
own low-level protocol unsupported by the kernel, so to transfer data
over UDP or TCP over IP with the highests speeds, one should use
existing methods. This does not of course mean, that anyone _has_ to do
it, it is always very fun to find new ways like your patch.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 14:19                                             ` Evgeniy Polyakov
@ 2008-09-05 14:45                                               ` Johann Baudy
  2008-09-05 14:59                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 14:45 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

>
> No need to run FTRACE, code shuld be audited and probably some debug
> prints added to determine, why sendfile() decides to exit early wiht
> UDP. I will try to do it if time permits this weekend, although I'm
> quite surprised sendfile() does not work with UDP...
>

I've finally made the test:
Packet is not going through device due to this test:
	if (inet->cork.length + size > 0xFFFF - fragheaderlen) {
		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
		return -EMSGSIZE;
	}
in ip_append_page()

inet->cork.length reach 61448 then this failure occurs
size = 4096
fragheaderlen = 20


>
> You can always provide a global offset where to put next packet.
> You can ajust it to put header before each data frame, and then DMA
> frame content according to that offset.
>
> Transmitting packet socket is needed for those, who wants to implement
> own low-level protocol unsupported by the kernel, so to transfer data
> over UDP or TCP over IP with the highests speeds, one should use
> existing methods. This does not of course mean, that anyone _has_ to do
> it, it is always very fun to find new ways like your patch.
>

What do you mean with global offset ?

Thanks,
Johann
-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 14:45                                               ` Johann Baudy
@ 2008-09-05 14:59                                                 ` Evgeniy Polyakov
  2008-09-05 15:30                                                   ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 14:59 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi.

On Fri, Sep 05, 2008 at 04:45:13PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I've finally made the test:
> Packet is not going through device due to this test:
> 	if (inet->cork.length + size > 0xFFFF - fragheaderlen) {
> 		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
> 		return -EMSGSIZE;
> 	}
> in ip_append_page()
> 
> inet->cork.length reach 61448 then this failure occurs
> size = 4096
> fragheaderlen = 20
 
Well, udp_sendpage() needs to be extended to only append page when there
is anough free space there, otherwise push given frame and create next
packet.

> > You can always provide a global offset where to put next packet.
> > You can ajust it to put header before each data frame, and then DMA
> > frame content according to that offset.
> >
> > Transmitting packet socket is needed for those, who wants to implement
> > own low-level protocol unsupported by the kernel, so to transfer data
> > over UDP or TCP over IP with the highests speeds, one should use
> > existing methods. This does not of course mean, that anyone _has_ to do
> > it, it is always very fun to find new ways like your patch.
> >
> 
> What do you mean with global offset ?

I meant you get a pointer by mapping some file in tmpfs (for example)
and then use some offset variable to store where you put your last data
(either packet header, or data itself), so that any subsequent write to
that area (either new packet header or dma data placement) would put
data just after the previous chunk. Thus after you have put number of
headers and appropriate data chunks, you could call sendfile() and reset
offset to the beginning of the mapped area.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 14:59                                                 ` Evgeniy Polyakov
@ 2008-09-05 15:30                                                   ` Johann Baudy
  2008-09-05 15:38                                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 15:30 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,


> Well, udp_sendpage() needs to be extended to only append page when there
> is anough free space there, otherwise push given frame and create next
> packet.
>
Ok, I'll do a patch and let you know result.

>
> I meant you get a pointer by mapping some file in tmpfs (for example)
> and then use some offset variable to store where you put your last data
> (either packet header, or data itself), so that any subsequent write to
> that area (either new packet header or dma data placement) would put
> data just after the previous chunk. Thus after you have put number of
> headers and appropriate data chunks, you could call sendfile() and reset
> offset to the beginning of the mapped area.

If I understand well, there is no link between start of ethernet frame
and packet header ?
App protocol must support packet loss ^^

Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 15:30                                                   ` Johann Baudy
@ 2008-09-05 15:38                                                     ` Evgeniy Polyakov
  2008-09-05 16:01                                                       ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 15:38 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi.

On Fri, Sep 05, 2008 at 05:30:39PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > Well, udp_sendpage() needs to be extended to only append page when there
> > is anough free space there, otherwise push given frame and create next
> > packet.
> >
> Ok, I'll do a patch and let you know result.

Great, thank you. But it should take into account UDP nature: data is
not allowed to be split between ip packets like with TCP.

> > I meant you get a pointer by mapping some file in tmpfs (for example)
> > and then use some offset variable to store where you put your last data
> > (either packet header, or data itself), so that any subsequent write to
> > that area (either new packet header or dma data placement) would put
> > data just after the previous chunk. Thus after you have put number of
> > headers and appropriate data chunks, you could call sendfile() and reset
> > offset to the beginning of the mapped area.
> 
> If I understand well, there is no link between start of ethernet frame
> and packet header ?

Ethernet header is appended by the network core itself, likely core will
just allocate skb with small data area, put there an ethernet and udp/ip
headers and attach pages from the file. If hardware does not support
checksumming and scatter/gather, things will be different.

> App protocol must support packet loss ^^

I think matter of packet loss relevance here is just the same like with
any other sending method.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 15:38                                                     ` Evgeniy Polyakov
@ 2008-09-05 16:01                                                       ` Johann Baudy
  2008-09-05 16:34                                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-05 16:01 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

> Great, thank you. But it should take into account UDP nature: data is
> not allowed to be split between ip packets like with TCP.

I'm leaving office right now.
I'll come back toward you regarding this topic on Monday.

> Ethernet header is appended by the network core itself, likely core will
> just allocate skb with small data area, put there an ethernet and udp/ip
> headers and attach pages from the file. If hardware does not support
> checksumming and scatter/gather, things will be different.
>
>> App protocol must support packet loss ^^
>
> I think matter of packet loss relevance here is just the same like with
> any other sending method.
>

Ok, I see. I just need to check if UDP is fine for my application or
if i need to do my own L3/L4.

Many thanks again,
Have a nice weekend,

Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 16:01                                                       ` Johann Baudy
@ 2008-09-05 16:34                                                         ` Evgeniy Polyakov
  2008-09-08 10:21                                                           ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-05 16:34 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Fri, Sep 05, 2008 at 06:01:46PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I'm leaving office right now.
> I'll come back toward you regarding this topic on Monday.

No problem.

> Ok, I see. I just need to check if UDP is fine for my application or
> if i need to do my own L3/L4.

This may be a bad sign. Or extremely needed step like in satellite links
with its huge rtts. Likely in case of ethernet usage tcp/udp over IP is
the way to go.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-05 16:34                                                         ` Evgeniy Polyakov
@ 2008-09-08 10:21                                                           ` Johann Baudy
  2008-09-08 11:26                                                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-08 10:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

I've made a test with below patch (with and without UDP fragmentation):

without UDP fragmentation, packet size are almost always equal to
PAGE_SIZE due to my mtu limit (2*PACKET_SIZE > mtu).
with UDP fragmentation, kernel is sending multiple fragmented packets
of 61448Kbytes.

Unfortunately, in both case, bitrate is still 15-20 MB/s :(
According to wireshark, kernel sends 60KB over 9 packets, nothing
during ~5ms, 60KB and so on. strange ... kernel seems to spend its
time during push(). Is there a blocking call somewhere ?

Thanks in advance,
Johann

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c

@@ -743,7 +743,28 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
                 size_t size, int flags)
 {
        struct udp_sock *up = udp_sk(sk);
+       struct inet_sock *inet = inet_sk(sk);
        int ret;
+       int mtu = inet->cork.fragsize;
+       int fragheaderlen;
+       struct ip_options *opt = NULL;
+
+       if (inet->cork.flags & IPCORK_OPT)
+               opt = inet->cork.opt;
+
+       fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+       // With UDP fragmentation
+       if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+       // Without UDP fragmentation
+       //  if( (inet->cork.length + size) > mtu) {
+               lock_sock(sk);
+               ret = udp_push_pending_frames(sk);
+               release_sock(sk);
+               if (ret) {
+                       return 0;
+               }
+       }





-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-08 10:21                                                           ` Johann Baudy
@ 2008-09-08 11:26                                                             ` Evgeniy Polyakov
  2008-09-08 13:01                                                               ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-08 11:26 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Mon, Sep 08, 2008 at 12:21:16PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> I've made a test with below patch (with and without UDP fragmentation):
> 
> without UDP fragmentation, packet size are almost always equal to
> PAGE_SIZE due to my mtu limit (2*PACKET_SIZE > mtu).
> with UDP fragmentation, kernel is sending multiple fragmented packets
> of 61448Kbytes.
> 
> Unfortunately, in both case, bitrate is still 15-20 MB/s :(
> According to wireshark, kernel sends 60KB over 9 packets, nothing
> during ~5ms, 60KB and so on. strange ... kernel seems to spend its
> time during push(). Is there a blocking call somewhere ?

Are you sure that it is udp_push_pending_frames() and not some splice
waiting?

> Thanks in advance,
> Johann
> 
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> 
> @@ -743,7 +743,28 @@ int udp_sendpage(struct sock *sk, struct page
> *page, int offset,
>                  size_t size, int flags)
>  {
>         struct udp_sock *up = udp_sk(sk);
> +       struct inet_sock *inet = inet_sk(sk);
>         int ret;
> +       int mtu = inet->cork.fragsize;
> +       int fragheaderlen;
> +       struct ip_options *opt = NULL;
> +
> +       if (inet->cork.flags & IPCORK_OPT)
> +               opt = inet->cork.opt;

This has to be checked under socket lock.

> +       fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
> +
> +       // With UDP fragmentation
> +       if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
> +       // Without UDP fragmentation
> +       //  if( (inet->cork.length + size) > mtu) {

This also should be protected. Two threads can simultaneously check
inet->cork.length and both suceed.

> +               lock_sock(sk);
> +               ret = udp_push_pending_frames(sk);
> +               release_sock(sk);
> +               if (ret) {
> +                       return 0;
> +               }
> +       }

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-08 11:26                                                             ` Evgeniy Polyakov
@ 2008-09-08 13:01                                                               ` Johann Baudy
  2008-09-08 15:28                                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-08 13:01 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

> Are you sure that it is udp_push_pending_frames() and not some splice
> waiting?
>
No, I'm not sure.
Are there any queue or allocator limits that can slow the bitrate
through this function?
I mean something that will need end of transfer to start a new one.

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -743,7 +743,29 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
                 size_t size, int flags)
 {
        struct udp_sock *up = udp_sk(sk);
+       struct inet_sock *inet = inet_sk(sk);
        int ret;
+       int mtu, fragheaderlen;
+       struct ip_options *opt = NULL;
+
+       lock_sock(sk);
+       mtu = inet->cork.fragsize;
+
+       if (inet->cork.flags & IPCORK_OPT)
+               opt = inet->cork.opt;
+
+       fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+       // With UDP fragmentation
+       if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+       // Without UDP fragmentation
+       //  if( (inet->cork.length + size) > mtu) {
+               ret = udp_push_pending_frames(sk);
+               if (ret) {
+                       return 0;
+               }
+       }
+       release_sock(sk);

        if (!up->pending) {
                struct msghdr msg = {   .msg_flags = flags|MSG_MORE };

Please find above, patch with your rectifications.
Do we must use MTU limit instead offset limit? As you said that UDP
split over IP must be avoided.
If yes, can I force size value forwarded to ip_append_page() in order
to fill the whole packet? Or this will not be handled properly by all
callers?

Thanks in advance,
Johann


-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-08 13:01                                                               ` Johann Baudy
@ 2008-09-08 15:28                                                                 ` Evgeniy Polyakov
  2008-09-08 15:38                                                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-08 15:28 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Mon, Sep 08, 2008 at 03:01:04PM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> > Are you sure that it is udp_push_pending_frames() and not some splice
> > waiting?
> >
> No, I'm not sure.
> Are there any queue or allocator limits that can slow the bitrate
> through this function?

No.

> I mean something that will need end of transfer to start a new one.

No, there shuld not be any such code path.

What is CPU usage on sender when it sends data via UDP sendfile()?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-08 15:28                                                                 ` Evgeniy Polyakov
@ 2008-09-08 15:38                                                                   ` Evgeniy Polyakov
  2008-09-09 23:11                                                                     ` Johann Baudy
  0 siblings, 1 reply; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-08 15:38 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

On Mon, Sep 08, 2008 at 07:28:35PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> What is CPU usage on sender when it sends data via UDP sendfile()?

Actually we can determine the culprit via putting a loop into
udp_sendpage(), which will send the same data. if receiver will see the
same delayes, problem in the udp sending path, otherwise in splice code.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-08 15:38                                                                   ` Evgeniy Polyakov
@ 2008-09-09 23:11                                                                     ` Johann Baudy
  2008-09-10  6:09                                                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 39+ messages in thread
From: Johann Baudy @ 2008-09-09 23:11 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

We've performed more tests with additional and mandatory processes of
our system. Bitrate and CPU performance were very very low. This
result leads us to brainstorm on a new system design that meets
specifications target. I'll continue these tests once new design
ready.

Concerning udp_sendpage() patch, if you agree, I will suggest this
patch to the community in a new thread.

Many thanks for your help,
Johann



On Mon, Sep 8, 2008 at 5:38 PM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Mon, Sep 08, 2008 at 07:28:35PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
>> What is CPU usage on sender when it sends data via UDP sendfile()?
>
> Actually we can determine the culprit via putting a loop into
> udp_sendpage(), which will send the same data. if receiver will see the
> same delayes, problem in the udp sending path, otherwise in splice code.
>
> --
>        Evgeniy Polyakov
>



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fwd: Packet mmap: TX RING and zero copy
  2008-09-09 23:11                                                                     ` Johann Baudy
@ 2008-09-10  6:09                                                                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 39+ messages in thread
From: Evgeniy Polyakov @ 2008-09-10  6:09 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev

Hi Johann.

On Wed, Sep 10, 2008 at 01:11:31AM +0200, Johann Baudy (johaahn@gmail.com) wrote:
> We've performed more tests with additional and mandatory processes of
> our system. Bitrate and CPU performance were very very low. This
> result leads us to brainstorm on a new system design that meets
> specifications target. I'll continue these tests once new design
> ready.

If CPU usage was small, something slept somewhere, but so far we do not
know what and where. It would be great to understand, why sendfile() is
so slow with UDP especially when mempry bandwidth is very small, and
particulary understand if it is udp sending path itself or splice code.

> Concerning udp_sendpage() patch, if you agree, I will suggest this
> patch to the community in a new thread.

Sure, but please remove commented code lines instead of commenting them.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2008-09-10  6:09 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-02 18:27 Packet mmap: TX RING and zero copy Johann Baudy
2008-09-02 19:46 ` Evgeniy Polyakov
2008-09-03  7:56   ` Johann Baudy
2008-09-03 10:38     ` Johann Baudy
2008-09-03 11:06       ` David Miller
2008-09-03 13:05         ` Johann Baudy
2008-09-03 13:27           ` Evgeniy Polyakov
2008-09-03 14:57             ` Christoph Lameter
2008-09-03 15:00             ` Johann Baudy
2008-09-03 15:13               ` Evgeniy Polyakov
2008-09-03 15:58                 ` Johann Baudy
2008-09-03 16:43                   ` Evgeniy Polyakov
2008-09-03 20:30                     ` Johann Baudy
2008-09-03 22:03                       ` Evgeniy Polyakov
2008-09-04 14:44                         ` Johann Baudy
2008-09-05  7:17                           ` Evgeniy Polyakov
     [not found]                             ` <7e0dd21a0809050216r65b8f08fm1ad0630790a13a54@mail.gmail.com>
2008-09-05  9:17                               ` Fwd: " Johann Baudy
2008-09-05 11:31                                 ` Evgeniy Polyakov
2008-09-05 12:44                                   ` Johann Baudy
2008-09-05 13:16                                     ` Evgeniy Polyakov
2008-09-05 13:29                                       ` Johann Baudy
2008-09-05 13:37                                         ` Evgeniy Polyakov
2008-09-05 13:55                                           ` Johann Baudy
2008-09-05 14:19                                             ` Evgeniy Polyakov
2008-09-05 14:45                                               ` Johann Baudy
2008-09-05 14:59                                                 ` Evgeniy Polyakov
2008-09-05 15:30                                                   ` Johann Baudy
2008-09-05 15:38                                                     ` Evgeniy Polyakov
2008-09-05 16:01                                                       ` Johann Baudy
2008-09-05 16:34                                                         ` Evgeniy Polyakov
2008-09-08 10:21                                                           ` Johann Baudy
2008-09-08 11:26                                                             ` Evgeniy Polyakov
2008-09-08 13:01                                                               ` Johann Baudy
2008-09-08 15:28                                                                 ` Evgeniy Polyakov
2008-09-08 15:38                                                                   ` Evgeniy Polyakov
2008-09-09 23:11                                                                     ` Johann Baudy
2008-09-10  6:09                                                                       ` Evgeniy Polyakov
2008-09-05 10:28 ` Robert Iakobashvili
2008-09-05 13:06   ` Johann Baudy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).