netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-10-27  9:33 Johann Baudy
  2008-10-28 22:44 ` David Miller
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-10-27  9:33 UTC (permalink / raw)
  To: netdev

New packet socket feature that makes packet socket more efficient for transmission.
- It reduces number of system call through a PACKET_TX_RING mechanism, based on PACKET_RX_RING (Circular buffer allocated in kernel space which is mmapped from user space).
- It minimizes CPU copy using fragmented SKB.

Signed-off-by: Johann Baudy <johann.baudy@gnu-log.net>

--
Hi All,

Before going on submitting process, I would like to get your opinion on those below points:

1#:
In order to make such feature, each packet header of the circular buffer can be set to one of those three statuses:
- TP_STATUS_USER: packet buffer needs to be transmitted. 
On this status, Kernel: 
 - Allocates a new skb
 - Attaches packet buffer pages 
 - Changes skb destructor 
 - Stores packet header pointer or index into skb 
 - Changes the packet header status to TP_STATUS_COPY;
 - Sends it to device.

- TP_STATUS_COPY: packet transmission is ongoing
On this status, the skb destructor (called during skb release)
 - Gets packet header pointer linked to skb pointer;
 - Changes the packet header status to TP_STATUS_KERNEL;

- TP_STATUS_KERNEL: packet buffer has been transmitted and buffer is ready for user.

As you can see, this skb destructor needs packet header pointer related to sk_buff pointer to change header status. 
I've first used skb->cb (control buffer) to forward this parameter to destructor. 
Unfortunately, this one seems to be overwritten in packet queue mechanism. So I've finally chosen skb->mark to store buffer index.

Can I use skb->mark in such condition?
If not, what is the best solution:
- new field in sk_buff struct?
- a sk_buff pointer array [NB_FRAME] (one index matches to one skb pointer)? (parsing is not really good for cpu load)
- other ?

2#:
Do I need to protect send() procedure with some locks? 
Especially when changing status from TP_STATUS_USER to TP_STATUS_COPY to prevent kernel from sending a packet buffer twice? 
(If two send() are called from different threads In SMP for example)
Or is it implicit?

Thanks in advance,
Johann Baudy

--
  Documentation/networking/packet_mmap.txt |  132 ++++++++--
 include/linux/if_packet.h                |    3 +-
 net/packet/af_packet.c                   |  422 +++++++++++++++++++++++++-----
 3 files changed, 470 insertions(+), 87 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 07c53d5..32f5c33 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -4,8 +4,8 @@
 
 This file documents the CONFIG_PACKET_MMAP option available with the PACKET
 socket interface on 2.4 and 2.6 kernels. This type of sockets is used for 
-capture network traffic with utilities like tcpdump or any other that uses 
-the libpcap library. 
+capture network traffic with utilities like tcpdump or any other that needs
+raw acces to network interface.
 
 You can find the latest version of this document at
 
@@ -14,6 +14,7 @@ You can find the latest version of this document at
 Please send me your comments to
 
     Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+    Johann Baudy <johann.baudy@gnu-log.net> (TX RING)
 
 -------------------------------------------------------------------------------
 + Why use PACKET_MMAP
@@ -25,19 +26,24 @@ to capture each packet, it requires two if you want to get packet's
 timestamp (like libpcap always does).
 
 In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
-configurable circular buffer mapped in user space. This way reading packets just 
-needs to wait for them, most of the time there is no need to issue a single 
-system call. By using a shared buffer between the kernel and the user 
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the capture process, 
-but it isn't everything. At least, if you are capturing at high speeds (this 
-is relative to the cpu speed), you should check if the device driver of your 
-network interface card supports some sort of interrupt load mitigation or 
-(even better) if it supports NAPI, also make sure it is enabled.
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth.
+By using a shared buffer between the kernel and the user also has the benefit
+of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP
++ How to use CONFIG_PACKET_MMAP to improve capture process
 --------------------------------------------------------------------------------
 
 From the user standpoint, you should use the higher level libpcap library, which
@@ -57,7 +63,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP
 support.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP directly
++ How to use CONFIG_PACKET_MMAP directly to improve capture porcess
 --------------------------------------------------------------------------------
 
 From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +72,7 @@ the following process:
 
 [setup]     socket() -------> creation of the capture socket
             setsockopt() ---> allocation of the circular buffer (ring)
+                              option: PACKET_RX_RING
             mmap() ---------> mapping of the allocated buffer to the
                               user process
 
@@ -97,13 +104,75 @@ also the mapping of the circular buffer in the user process and
 the use of this buffer.
 
 --------------------------------------------------------------------------------
++ How to use CONFIG_PACKET_MMAP directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup]          socket() -------> creation of the transmission socket
+                 setsockopt() ---> allocation of the circular buffer (ring)
+                                   option: PACKET_TX_RING
+                 bind() ---------> bind transmission socket with a network interface
+                 mmap() ---------> mapping of the allocated buffer to the
+                                   user process
+
+[transmission]   poll() ---------> wait for free packets (optional)
+                 send() ---------> send all packets that are set as ready in
+                                   the ring
+                                   The flag MSG_DONTWAIT can be used to return
+                                   before end of transfer.
+
+[shutdown]  close() --------> destruction of the transmission socket and
+                              deallocation of all associated resources.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts:
+
+ --------------------
+| struct tpacket_hdr | Header. It contains the status of
+|                    | of this frame
+|--------------------|
+| data buffer        |
+.                    .  Data that will be sent over the network interface.
+.                    .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example:
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = ETH_P_ALL;
+ my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+--------------------------------------------------------------------------------
 + PACKET_MMAP settings
 --------------------------------------------------------------------------------
 
 
 To setup PACKET_MMAP from user level code is done with a call like
 
+ - Capture process
      setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+ - Transmission process
+     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
 
 The most significant argument in the previous call is the req parameter, 
 this parameter must to have the following structure:
@@ -117,11 +186,11 @@ this parameter must to have the following structure:
     };
 
 This structure is defined in /usr/include/linux/if_packet.h and establishes a 
-circular buffer (ring) of unswappable memory mapped in the capture process. 
+circular buffer (ring) of unswappable memory.
 Being mapped in the capture process allows reading the captured frames and 
 related meta-information like timestamps without requiring a system call.
 
-Captured frames are grouped in blocks. Each block is a physically contiguous 
+Frames are grouped in blocks. Each block is a physically contiguous
 region of memory and holds tp_block_size/tp_frame_size frames. The total number 
 of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
 
@@ -336,6 +405,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready
 to be used for the kernel, If not, there is a frame the user can read 
 and the following flags apply:
 
++++ Capture process:
      from include/linux/if_packet.h
 
      #define TP_STATUS_COPY          2 
@@ -391,6 +461,36 @@ packets are in the ring:
 It doesn't incur in a race condition to first check the status value and 
 then poll for frames.
 
+
+++ Transmission process
+Those defines are also used for transmission:
+
+     #define TP_STATUS_KERNEL        0 // Frame is available
+     #define TP_STATUS_USER          1 // Frame will be sent on next send()
+     #define TP_STATUS_COPY          2 // Frame is currently in transmission
+
+First, the kernel initializes all frames to TP_STATUS_KERNEL. To send a packet,
+the user fills a data buffer of an available frame, sets tp_len to current
+data buffer size and sets its status field to TP_STATUS_USER. This can be done
+on multiple frames. Once the user is ready to transmit, it calls send().
+Then all buffers with status equal to TP_STATUS_USER are forwarded to the
+network device. The kernel updates each status of sent frames with
+TP_STATUS_COPY until the end of transfer.
+At the end of each transfer, buffer status returns to TP_STATUS_KERNEL.
+
+    header->tp_len = in_i_size;
+    header->tp_status = TP_STATUS_USER;
+    retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_KERNEL)
+
+    struct pollfd pfd;
+    pfd.fd = fd;
+    pfd.revents = 0;
+    pfd.events = POLLOUT;
+    retval = poll(&pfd, 1, timeout);
+
 --------------------------------------------------------------------------------
 + THANKS
 --------------------------------------------------------------------------------
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 18db066..f6a247a 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -46,6 +46,7 @@ struct sockaddr_ll
 #define PACKET_VERSION			10
 #define PACKET_HDRLEN			11
 #define PACKET_RESERVE			12
+#define PACKET_TX_RING			13
 
 struct tpacket_stats
 {
@@ -100,7 +101,7 @@ struct tpacket2_hdr
 enum tpacket_versions
 {
 	TPACKET_V1,
-	TPACKET_V2,
+	TPACKET_V2
 };
 
 /*
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c718e7e..c52462a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -156,7 +156,23 @@ struct packet_mreq_max
 };
 
 #ifdef CONFIG_PACKET_MMAP
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing);
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
+		int closing, int tx_ring);
+
+struct packet_ring_buffer {
+	char *			*pg_vec;
+	unsigned int		head;
+	unsigned int		frames_per_block;
+	unsigned int		frame_size;
+	unsigned int		frame_max;
+
+	unsigned int		pg_vec_order;
+	unsigned int		pg_vec_pages;
+	unsigned int		pg_vec_len;
+};
+
+struct packet_sock;
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
 #endif
 
 static void packet_flush_mclist(struct sock *sk);
@@ -166,12 +182,10 @@ struct packet_sock {
 	struct sock		sk;
 	struct tpacket_stats	stats;
 #ifdef CONFIG_PACKET_MMAP
-	char *			*pg_vec;
-	unsigned int		head;
-	unsigned int            frames_per_block;
-	unsigned int		frame_size;
-	unsigned int		frame_max;
+	struct packet_ring_buffer	rx_ring;
+	struct packet_ring_buffer	tx_ring;
 	int			copy_thresh;
+	atomic_t		tx_pending_skb;
 #endif
 	struct packet_type	prot_hook;
 	spinlock_t		bind_lock;
@@ -183,9 +197,6 @@ struct packet_sock {
 	struct packet_mclist	*mclist;
 #ifdef CONFIG_PACKET_MMAP
 	atomic_t		mapped;
-	unsigned int            pg_vec_order;
-	unsigned int		pg_vec_pages;
-	unsigned int		pg_vec_len;
 	enum tpacket_versions	tp_version;
 	unsigned int		tp_hdrlen;
 	unsigned int		tp_reserve;
@@ -204,8 +215,10 @@ struct packet_skb_cb {
 
 #ifdef CONFIG_PACKET_MMAP
 
-static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
-				 int status)
+static void *packet_lookup_frame(struct packet_sock *po,
+		struct packet_ring_buffer *buff,
+		unsigned int position,
+		int status)
 {
 	unsigned int pg_vec_pos, frame_offset;
 	union {
@@ -214,25 +227,50 @@ static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
 		void *raw;
 	} h;
 
-	pg_vec_pos = position / po->frames_per_block;
-	frame_offset = position % po->frames_per_block;
+	pg_vec_pos = position / buff->frames_per_block;
+	frame_offset = position % buff->frames_per_block;
 
-	h.raw = po->pg_vec[pg_vec_pos] + (frame_offset * po->frame_size);
+	h.raw = buff->pg_vec[pg_vec_pos] + (frame_offset * buff->frame_size);
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		if (status != h.h1->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
+		if (status != h.h1->tp_status)
 			return NULL;
 		break;
 	case TPACKET_V2:
-		if (status != h.h2->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
+		if (status != h.h2->tp_status)
 			return NULL;
 		break;
 	}
 	return h.raw;
 }
 
+static inline void *packet_current_rx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->rx_ring, po->rx_ring.head, status);
+}
+
+static inline void *packet_current_tx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->tx_ring, po->tx_ring.head, status);
+}
+
+static inline void *packet_previous_rx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->rx_ring.head ? po->rx_ring.head - 1 : po->rx_ring.frame_max;
+	return packet_lookup_frame(po, &po->rx_ring, previous, status);
+}
+
+static inline void *packet_previous_tx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->tx_ring.head ? po->tx_ring.head - 1 : po->tx_ring.frame_max;
+	return packet_lookup_frame(po, &po->tx_ring, previous, status);
+}
+
+static inline void packet_increment_head(struct packet_ring_buffer *buff)
+{
+	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
+}
+
 static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 {
 	union {
@@ -646,7 +684,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 		macoff = netoff - maclen;
 	}
 
-	if (macoff + snaplen > po->frame_size) {
+	if (macoff + snaplen > po->rx_ring.frame_size) {
 		if (po->copy_thresh &&
 		    atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
 		    (unsigned)sk->sk_rcvbuf) {
@@ -659,16 +697,16 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 			if (copy_skb)
 				skb_set_owner_r(copy_skb, sk);
 		}
-		snaplen = po->frame_size - macoff;
+		snaplen = po->rx_ring.frame_size - macoff;
 		if ((int)snaplen < 0)
 			snaplen = 0;
 	}
 
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_lookup_frame(po, po->head, TP_STATUS_KERNEL);
+	h.raw = packet_current_rx_frame(po, TP_STATUS_KERNEL);
 	if (!h.raw)
 		goto ring_is_full;
-	po->head = po->head != po->frame_max ? po->head+1 : 0;
+	packet_increment_head(&po->rx_ring);
 	po->stats.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
@@ -759,10 +797,212 @@ ring_is_full:
 	goto drop_n_restore;
 }
 
-#endif
+static void tpacket_destruct_skb(struct sk_buff *skb)
+{
+	struct packet_sock *po = pkt_sk(skb->sk);
+	void * ph;
 
+	BUG_ON(skb == NULL);
+	ph = packet_lookup_frame( po, &po->tx_ring, skb->mark, TP_STATUS_COPY);
 
-static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+	BUG_ON(ph == NULL);
+	BUG_ON(atomic_read(&po->tx_pending_skb) == 0);
+
+	atomic_dec(&po->tx_pending_skb);
+	__packet_set_status(po, ph, TP_STATUS_KERNEL);
+
+	sock_wfree(skb);
+}
+
+static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff * skb, void * frame,
+		struct net_device *dev, int size_max, __be16 proto,
+		unsigned char * addr)
+{
+	union {
+		struct tpacket_hdr *h1;
+		struct tpacket2_hdr *h2;
+		void *raw;
+	} ph;
+	int to_write, offset, len, tp_len;
+	struct socket *sock = po->sk.sk_socket;
+	struct page *page;
+	void *data;
+	int err;
+
+
+	ph.raw = frame;
+
+	skb->protocol = proto;
+	skb->dev = dev;
+	skb->priority = po->sk.sk_priority;
+	skb->destructor = tpacket_destruct_skb;
+	skb->mark = po->tx_ring.head;
+
+	switch(po->tp_version) {
+	case TPACKET_V2:
+		tp_len = ph.h1->tp_len;
+		break;
+	default:
+		tp_len = ph.h1->tp_len;
+		break;
+	}
+
+	if (unlikely(tp_len > size_max)) {
+		printk(KERN_ERR "packet size is too long (%d > %d)\n",
+				tp_len, size_max);
+		return -EMSGSIZE;
+	}
+
+	skb_reserve(skb, LL_RESERVED_SPACE(dev));
+	data = ph.raw + po->tp_hdrlen;
+
+	if (sock->type == SOCK_DGRAM) {
+		err = dev_hard_header(skb, dev, ntohs(proto), addr,
+				NULL, tp_len);
+		if (unlikely(err < 0))
+			return -EINVAL;
+	} else if (dev->hard_header_len ) {
+		/* net device doesn't like empty head */
+		if(unlikely(tp_len <= dev->hard_header_len)) {
+			printk(KERN_ERR "packet size is too short "
+					"(%d < %d)\n", tp_len,
+					dev->hard_header_len);
+			return -EINVAL;
+		}
+
+		skb_push(skb, dev->hard_header_len);
+		err = skb_store_bits(skb, 0, data,
+				dev->hard_header_len);
+		if (unlikely(err))
+			return err;
+	}
+
+	err = -EFAULT;
+	to_write = tp_len - dev->hard_header_len;
+	data += dev->hard_header_len;
+	page = virt_to_page(data);
+	len = ((to_write > PAGE_SIZE) ? PAGE_SIZE : to_write);
+
+	offset = (int)((long)data & (~PAGE_MASK));
+	len -= offset;
+
+	skb->data_len = to_write;
+	skb->len += to_write;
+
+	while ( likely(to_write) ) {
+		get_page(page);
+		skb_fill_page_desc(skb,
+				skb_shinfo(skb)->nr_frags,
+				page++, offset, len);
+		to_write -= len;
+		len = (to_write > PAGE_SIZE) ? PAGE_SIZE : to_write;
+		offset = 0;
+	}
+
+
+	return tp_len;
+}
+
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
+{
+	struct socket *sock;
+	struct sk_buff *skb;
+	struct net_device *dev;
+	__be16 proto;
+	int ifindex, err, reserve = 0;
+	void * ph;
+	struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
+	int tp_len, size_max;
+	unsigned char *addr;
+	int len_sum = 0;
+
+	BUG_ON(po == NULL);
+	sock = po->sk.sk_socket;
+
+	if (saddr == NULL) {
+		ifindex	= po->ifindex;
+		proto	= po->num;
+		addr	= NULL;
+	} else {
+		err = -EINVAL;
+		if (msg->msg_namelen < sizeof(struct sockaddr_ll))
+			goto out;
+		if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct sockaddr_ll, sll_addr)))
+			goto out;
+		ifindex	= saddr->sll_ifindex;
+		proto	= saddr->sll_protocol;
+		addr	= saddr->sll_addr;
+	}
+
+	dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+	err = -ENXIO;
+	if (unlikely(dev == NULL))
+		goto out;
+
+	err = -EINVAL;
+	if (unlikely(sock->type != SOCK_RAW))
+		goto out_unlock;
+
+	reserve = dev->hard_header_len;
+
+	err = -ENETDOWN;
+	if (unlikely(!(dev->flags & IFF_UP)))
+		goto out_unlock;
+
+	size_max = po->tx_ring.frame_size - sizeof(struct skb_shared_info)
+		- po->tp_hdrlen - LL_ALLOCATED_SPACE(dev);
+
+	if (size_max > dev->mtu + reserve)
+		size_max = dev->mtu + reserve;
+
+
+	do
+	{
+		ph = packet_current_tx_frame(po, TP_STATUS_USER);
+		if(unlikely(ph == NULL))
+			continue;
+
+		skb = sock_alloc_send_skb(&po->sk, LL_ALLOCATED_SPACE(dev),
+				msg->msg_flags & MSG_DONTWAIT, &err);
+		if (unlikely(skb == NULL))
+			goto out_unlock;
+
+		__packet_set_status(po, ph, TP_STATUS_COPY);
+		atomic_inc(&po->tx_pending_skb);
+
+		tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
+				addr);
+		if(unlikely(tp_len < 0)) {
+			err = tp_len;
+			goto out_free;
+		}
+
+		err = dev_queue_xmit(skb);
+		if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0))
+			goto out_free;
+
+		packet_increment_head(&po->tx_ring);
+		len_sum += tp_len;
+	}
+	while(likely((ph != NULL)
+				|| ((!(msg->msg_flags & MSG_DONTWAIT))
+					&& atomic_read(&po->tx_pending_skb))));
+
+	err = len_sum;
+	goto out_unlock;
+
+out_free:
+	__packet_set_status(po, ph, TP_STATUS_USER);
+	atomic_dec(&po->tx_pending_skb);
+	kfree_skb(skb);
+out_unlock:
+	dev_put(dev);
+out:
+	return err;
+}
+#endif
+
+static int packet_snd(struct socket *sock,
 			  struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -853,6 +1093,19 @@ out:
 	return err;
 }
 
+static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *msg, size_t len)
+{
+#ifdef CONFIG_PACKET_MMAP
+	struct sock *sk = sock->sk;
+	struct packet_sock *po = pkt_sk(sk);
+	if (po->tx_ring.pg_vec)
+		return tpacket_snd(po, msg);
+	else
+#endif
+		return packet_snd(sock, msg, len);
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -891,10 +1144,13 @@ static int packet_release(struct socket *sock)
 	packet_flush_mclist(sk);
 
 #ifdef CONFIG_PACKET_MMAP
-	if (po->pg_vec) {
+	{
 		struct tpacket_req req;
 		memset(&req, 0, sizeof(req));
-		packet_set_ring(sk, &req, 1);
+		if (po->rx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 0);
+		if (po->tx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 1);
 	}
 #endif
 
@@ -1411,6 +1667,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 #ifdef CONFIG_PACKET_MMAP
 	case PACKET_RX_RING:
+	case PACKET_TX_RING:
 	{
 		struct tpacket_req req;
 
@@ -1418,7 +1675,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			return -EINVAL;
 		if (copy_from_user(&req,optval,sizeof(req)))
 			return -EFAULT;
-		return packet_set_ring(sk, &req, 0);
+		return packet_set_ring(sk, &req, 0, optname == PACKET_TX_RING);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -1438,7 +1695,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1457,7 +1714,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1701,13 +1958,17 @@ static unsigned int packet_poll(struct file * file, struct socket *sock,
 	unsigned int mask = datagram_poll(file, sock, wait);
 
 	spin_lock_bh(&sk->sk_receive_queue.lock);
-	if (po->pg_vec) {
-		unsigned last = po->head ? po->head-1 : po->frame_max;
-
-		if (packet_lookup_frame(po, last, TP_STATUS_USER))
+	if (po->rx_ring.pg_vec) {
+		if (packet_previous_rx_frame(po, TP_STATUS_USER))
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	spin_lock_bh(&sk->sk_write_queue.lock);
+	if (po->tx_ring.pg_vec) {
+		if (packet_current_tx_frame(po, TP_STATUS_KERNEL))
+			mask |= POLLOUT | POLLWRNORM;
+	}
+	spin_unlock_bh(&sk->sk_write_queue.lock);
 	return mask;
 }
 
@@ -1783,20 +2044,24 @@ out_free_pgvec:
 	goto out;
 }
 
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing)
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing, int tx_ring)
 {
 	char **pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
+	struct packet_ring_buffer *rb;
+	struct sk_buff_head *rb_queue;
 	__be16 num;
 	int err = 0;
 
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+
 	if (req->tp_block_nr) {
 		int i;
 
 		/* Sanity tests and some calculations */
-
-		if (unlikely(po->pg_vec))
+		if (unlikely(rb->pg_vec))
 			return -EBUSY;
 
 		switch (po->tp_version) {
@@ -1813,16 +2078,16 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 		if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
 			return -EINVAL;
 		if (unlikely(req->tp_frame_size < po->tp_hdrlen +
-						  po->tp_reserve))
+					po->tp_reserve))
 			return -EINVAL;
 		if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
 			return -EINVAL;
 
-		po->frames_per_block = req->tp_block_size/req->tp_frame_size;
-		if (unlikely(po->frames_per_block <= 0))
+		rb->frames_per_block = req->tp_block_size/req->tp_frame_size;
+		if (unlikely(rb->frames_per_block <= 0))
 			return -EINVAL;
-		if (unlikely((po->frames_per_block * req->tp_block_nr) !=
-			     req->tp_frame_nr))
+		if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
+					req->tp_frame_nr))
 			return -EINVAL;
 
 		err = -ENOMEM;
@@ -1835,17 +2100,19 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 			void *ptr = pg_vec[i];
 			int k;
 
-			for (k = 0; k < po->frames_per_block; k++) {
+			for (k = 0; k < rb->frames_per_block; k++) {
 				__packet_set_status(po, ptr, TP_STATUS_KERNEL);
 				ptr += req->tp_frame_size;
 			}
 		}
-		/* Done */
-	} else {
+	}
+	/* Done */
+	else {
 		if (unlikely(req->tp_frame_nr))
 			return -EINVAL;
 	}
 
+
 	lock_sock(sk);
 
 	/* Detach socket from network */
@@ -1866,20 +2133,19 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 	if (closing || atomic_read(&po->mapped) == 0) {
 		err = 0;
 #define XC(a, b) ({ __typeof__ ((a)) __t; __t = (a); (a) = (b); __t; })
-
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		pg_vec = XC(po->pg_vec, pg_vec);
-		po->frame_max = (req->tp_frame_nr - 1);
-		po->head = 0;
-		po->frame_size = req->tp_frame_size;
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-
-		order = XC(po->pg_vec_order, order);
-		req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr);
-
-		po->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
-		po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
-		skb_queue_purge(&sk->sk_receive_queue);
+		spin_lock_bh(&rb_queue->lock);
+		pg_vec = XC(rb->pg_vec, pg_vec);
+		rb->frame_max = (req->tp_frame_nr - 1);
+		rb->head = 0;
+		rb->frame_size = req->tp_frame_size;
+		spin_unlock_bh(&rb_queue->lock);
+
+		order = XC(rb->pg_vec_order, order);
+		req->tp_block_nr = XC(rb->pg_vec_len, req->tp_block_nr);
+
+		rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
+		po->prot_hook.func = (po->rx_ring.pg_vec) ? tpacket_rcv : packet_rcv;
+		skb_queue_purge(rb_queue);
 #undef XC
 		if (atomic_read(&po->mapped))
 			printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped));
@@ -1906,7 +2172,8 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 {
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
-	unsigned long size;
+	unsigned long size, expected_size;
+	struct packet_ring_buffer *rb;
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
@@ -1917,23 +2184,38 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 	size = vma->vm_end - vma->vm_start;
 
 	lock_sock(sk);
-	if (po->pg_vec == NULL)
+
+	expected_size = 0;
+	if (po->rx_ring.pg_vec)
+		expected_size += po->rx_ring.pg_vec_len * po->rx_ring.pg_vec_pages * PAGE_SIZE;
+	if (po->tx_ring.pg_vec)
+		expected_size += po->tx_ring.pg_vec_len * po->tx_ring.pg_vec_pages * PAGE_SIZE;
+
+	if (expected_size == 0)
 		goto out;
-	if (size != po->pg_vec_len*po->pg_vec_pages*PAGE_SIZE)
+
+	if (size != expected_size)
 		goto out;
 
 	start = vma->vm_start;
-	for (i = 0; i < po->pg_vec_len; i++) {
-		struct page *page = virt_to_page(po->pg_vec[i]);
-		int pg_num;
-
-		for (pg_num = 0; pg_num < po->pg_vec_pages; pg_num++, page++) {
-			err = vm_insert_page(vma, start, page);
-			if (unlikely(err))
-				goto out;
-			start += PAGE_SIZE;
+
+	for(rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
+		if (rb->pg_vec == NULL)
+			continue;
+
+		for (i = 0; i < rb->pg_vec_len; i++) {
+			struct page *page = virt_to_page(rb->pg_vec[i]);
+			int pg_num;
+
+			for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++, page++) {
+				err = vm_insert_page(vma, start, page);
+				if (unlikely(err))
+					goto out;
+				start += PAGE_SIZE;
+			}
 		}
 	}
+
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-27  9:33 Johann Baudy
@ 2008-10-28 22:44 ` David Miller
  0 siblings, 0 replies; 59+ messages in thread
From: David Miller @ 2008-10-28 22:44 UTC (permalink / raw)
  To: johaahn; +Cc: netdev

From: Johann Baudy <johaahn@gmail.com>
Date: Mon, 27 Oct 2008 10:33:25 +0100

> @@ -100,7 +101,7 @@ struct tpacket2_hdr
>  enum tpacket_versions
>  {
>  	TPACKET_V1,
> -	TPACKET_V2,
> +	TPACKET_V2
>  };
>  
>  /*

Please don't do this, the comma there is intentional.

If a future patch adds a new version, the diff will be the
addition of one new line, instead of one changed line and one
new line.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-10-30 13:00 Johann Baudy
  2008-10-30 18:21 ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-10-30 13:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi David,

> 
> Please don't do this, the comma there is intentional.
> 
> If a future patch adds a new version, the diff will be the
> addition of one new line, instead of one changed line and one
> new line.


Thanks for your comment.
Please find below new patch with those changes:
- new skb->packet_index field is now used instead of skb->mark (to forward frame index between send() and skb destructor)
- send() spin locks 
- tpacket_version comma


 Documentation/networking/packet_mmap.txt |  132 ++++++++--
 include/linux/if_packet.h                |    1 +
 include/linux/skbuff.h                   |    4 +
 net/packet/af_packet.c                   |  426 +++++++++++++++++++++++++-----
 4 files changed, 477 insertions(+), 86 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 07c53d5..f07d39a 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -4,8 +4,8 @@
 
 This file documents the CONFIG_PACKET_MMAP option available with the PACKET
 socket interface on 2.4 and 2.6 kernels. This type of sockets is used for 
-capture network traffic with utilities like tcpdump or any other that uses 
-the libpcap library. 
+capture network traffic with utilities like tcpdump or any other that needs
+raw access to network interface.
 
 You can find the latest version of this document at
 
@@ -14,6 +14,7 @@ You can find the latest version of this document at
 Please send me your comments to
 
     Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+    Johann Baudy <johann.baudy@gnu-log.net> (TX RING)
 
 -------------------------------------------------------------------------------
 + Why use PACKET_MMAP
@@ -25,19 +26,24 @@ to capture each packet, it requires two if you want to get packet's
 timestamp (like libpcap always does).
 
 In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
-configurable circular buffer mapped in user space. This way reading packets just 
-needs to wait for them, most of the time there is no need to issue a single 
-system call. By using a shared buffer between the kernel and the user 
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the capture process, 
-but it isn't everything. At least, if you are capturing at high speeds (this 
-is relative to the cpu speed), you should check if the device driver of your 
-network interface card supports some sort of interrupt load mitigation or 
-(even better) if it supports NAPI, also make sure it is enabled.
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth.
+By using a shared buffer between the kernel and the user also has the benefit
+of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP
++ How to use CONFIG_PACKET_MMAP to improve capture process
 --------------------------------------------------------------------------------
 
 From the user standpoint, you should use the higher level libpcap library, which
@@ -57,7 +63,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP
 support.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP directly
++ How to use CONFIG_PACKET_MMAP directly to improve capture process
 --------------------------------------------------------------------------------
 
 From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +72,7 @@ the following process:
 
 [setup]     socket() -------> creation of the capture socket
             setsockopt() ---> allocation of the circular buffer (ring)
+                              option: PACKET_RX_RING
             mmap() ---------> mapping of the allocated buffer to the
                               user process
 
@@ -97,13 +104,75 @@ also the mapping of the circular buffer in the user process and
 the use of this buffer.
 
 --------------------------------------------------------------------------------
++ How to use CONFIG_PACKET_MMAP directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup]          socket() -------> creation of the transmission socket
+                 setsockopt() ---> allocation of the circular buffer (ring)
+                                   option: PACKET_TX_RING
+                 bind() ---------> bind transmission socket with a network interface
+                 mmap() ---------> mapping of the allocated buffer to the
+                                   user process
+
+[transmission]   poll() ---------> wait for free packets (optional)
+                 send() ---------> send all packets that are set as ready in
+                                   the ring
+                                   The flag MSG_DONTWAIT can be used to return
+                                   before end of transfer.
+
+[shutdown]  close() --------> destruction of the transmission socket and
+                              deallocation of all associated resources.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts:
+
+ --------------------
+| struct tpacket_hdr | Header. It contains the status of
+|                    | of this frame
+|--------------------|
+| data buffer        |
+.                    .  Data that will be sent over the network interface.
+.                    .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example:
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = ETH_P_ALL;
+ my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+--------------------------------------------------------------------------------
 + PACKET_MMAP settings
 --------------------------------------------------------------------------------
 
 
 To setup PACKET_MMAP from user level code is done with a call like
 
+ - Capture process
      setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+ - Transmission process
+     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
 
 The most significant argument in the previous call is the req parameter, 
 this parameter must to have the following structure:
@@ -117,11 +186,11 @@ this parameter must to have the following structure:
     };
 
 This structure is defined in /usr/include/linux/if_packet.h and establishes a 
-circular buffer (ring) of unswappable memory mapped in the capture process. 
+circular buffer (ring) of unswappable memory.
 Being mapped in the capture process allows reading the captured frames and 
 related meta-information like timestamps without requiring a system call.
 
-Captured frames are grouped in blocks. Each block is a physically contiguous 
+Frames are grouped in blocks. Each block is a physically contiguous
 region of memory and holds tp_block_size/tp_frame_size frames. The total number 
 of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
 
@@ -336,6 +405,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready
 to be used for the kernel, If not, there is a frame the user can read 
 and the following flags apply:
 
++++ Capture process:
      from include/linux/if_packet.h
 
      #define TP_STATUS_COPY          2 
@@ -391,6 +461,36 @@ packets are in the ring:
 It doesn't incur in a race condition to first check the status value and 
 then poll for frames.
 
+
+++ Transmission process
+Those defines are also used for transmission:
+
+     #define TP_STATUS_KERNEL        0 // Frame is available
+     #define TP_STATUS_USER          1 // Frame will be sent on next send()
+     #define TP_STATUS_COPY          2 // Frame is currently in transmission
+
+First, the kernel initializes all frames to TP_STATUS_KERNEL. To send a packet,
+the user fills a data buffer of an available frame, sets tp_len to current
+data buffer size and sets its status field to TP_STATUS_USER. This can be done
+on multiple frames. Once the user is ready to transmit, it calls send().
+Then all buffers with status equal to TP_STATUS_USER are forwarded to the
+network device. The kernel updates each status of sent frames with
+TP_STATUS_COPY until the end of transfer.
+At the end of each transfer, buffer status returns to TP_STATUS_KERNEL.
+
+    header->tp_len = in_i_size;
+    header->tp_status = TP_STATUS_USER;
+    retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_KERNEL)
+
+    struct pollfd pfd;
+    pfd.fd = fd;
+    pfd.revents = 0;
+    pfd.events = POLLOUT;
+    retval = poll(&pfd, 1, timeout);
+
 --------------------------------------------------------------------------------
 + THANKS
 --------------------------------------------------------------------------------
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 18db066..6682963 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -46,6 +46,7 @@ struct sockaddr_ll
 #define PACKET_VERSION			10
 #define PACKET_HDRLEN			11
 #define PACKET_RESERVE			12
+#define PACKET_TX_RING			13
 
 struct tpacket_stats
 {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9099237..f52a75d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -248,6 +248,7 @@ typedef unsigned char *sk_buff_data_t;
  *		done by skb DMA functions
  *	@secmark: security marking
  *	@vlan_tci: vlan tag control information
+ *	@packet_index: index of packet mmap frame
  */
 
 struct sk_buff {
@@ -328,6 +329,9 @@ struct sk_buff {
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32			secmark;
 #endif
+#ifdef CONFIG_PACKET_MMAP
+	__u32			packet_index;
+#endif
 
 	__u32			mark;
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c718e7e..87d6c12 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -156,7 +156,25 @@ struct packet_mreq_max
 };
 
 #ifdef CONFIG_PACKET_MMAP
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing);
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
+		int closing, int tx_ring);
+
+struct packet_ring_buffer {
+	char *			*pg_vec;
+	unsigned int		head;
+	unsigned int		frames_per_block;
+	unsigned int		frame_size;
+	unsigned int		frame_max;
+
+	unsigned int		pg_vec_order;
+	unsigned int		pg_vec_pages;
+	unsigned int		pg_vec_len;
+
+	spinlock_t		lock;
+};
+
+struct packet_sock;
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
 #endif
 
 static void packet_flush_mclist(struct sock *sk);
@@ -166,12 +184,10 @@ struct packet_sock {
 	struct sock		sk;
 	struct tpacket_stats	stats;
 #ifdef CONFIG_PACKET_MMAP
-	char *			*pg_vec;
-	unsigned int		head;
-	unsigned int            frames_per_block;
-	unsigned int		frame_size;
-	unsigned int		frame_max;
+	struct packet_ring_buffer	rx_ring;
+	struct packet_ring_buffer	tx_ring;
 	int			copy_thresh;
+	atomic_t		tx_pending_skb;
 #endif
 	struct packet_type	prot_hook;
 	spinlock_t		bind_lock;
@@ -183,9 +199,6 @@ struct packet_sock {
 	struct packet_mclist	*mclist;
 #ifdef CONFIG_PACKET_MMAP
 	atomic_t		mapped;
-	unsigned int            pg_vec_order;
-	unsigned int		pg_vec_pages;
-	unsigned int		pg_vec_len;
 	enum tpacket_versions	tp_version;
 	unsigned int		tp_hdrlen;
 	unsigned int		tp_reserve;
@@ -204,8 +217,10 @@ struct packet_skb_cb {
 
 #ifdef CONFIG_PACKET_MMAP
 
-static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
-				 int status)
+static void *packet_lookup_frame(struct packet_sock *po,
+		struct packet_ring_buffer *buff,
+		unsigned int position,
+		int status)
 {
 	unsigned int pg_vec_pos, frame_offset;
 	union {
@@ -214,25 +229,50 @@ static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
 		void *raw;
 	} h;
 
-	pg_vec_pos = position / po->frames_per_block;
-	frame_offset = position % po->frames_per_block;
+	pg_vec_pos = position / buff->frames_per_block;
+	frame_offset = position % buff->frames_per_block;
 
-	h.raw = po->pg_vec[pg_vec_pos] + (frame_offset * po->frame_size);
+	h.raw = buff->pg_vec[pg_vec_pos] + (frame_offset * buff->frame_size);
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		if (status != h.h1->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
+		if (status != h.h1->tp_status)
 			return NULL;
 		break;
 	case TPACKET_V2:
-		if (status != h.h2->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
+		if (status != h.h2->tp_status)
 			return NULL;
 		break;
 	}
 	return h.raw;
 }
 
+static inline void *packet_current_rx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->rx_ring, po->rx_ring.head, status);
+}
+
+static inline void *packet_current_tx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->tx_ring, po->tx_ring.head, status);
+}
+
+static inline void *packet_previous_rx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->rx_ring.head ? po->rx_ring.head - 1 : po->rx_ring.frame_max;
+	return packet_lookup_frame(po, &po->rx_ring, previous, status);
+}
+
+static inline void *packet_previous_tx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->tx_ring.head ? po->tx_ring.head - 1 : po->tx_ring.frame_max;
+	return packet_lookup_frame(po, &po->tx_ring, previous, status);
+}
+
+static inline void packet_increment_head(struct packet_ring_buffer *buff)
+{
+	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
+}
+
 static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 {
 	union {
@@ -646,7 +686,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 		macoff = netoff - maclen;
 	}
 
-	if (macoff + snaplen > po->frame_size) {
+	if (macoff + snaplen > po->rx_ring.frame_size) {
 		if (po->copy_thresh &&
 		    atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
 		    (unsigned)sk->sk_rcvbuf) {
@@ -659,16 +699,16 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 			if (copy_skb)
 				skb_set_owner_r(copy_skb, sk);
 		}
-		snaplen = po->frame_size - macoff;
+		snaplen = po->rx_ring.frame_size - macoff;
 		if ((int)snaplen < 0)
 			snaplen = 0;
 	}
 
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_lookup_frame(po, po->head, TP_STATUS_KERNEL);
+	h.raw = packet_current_rx_frame(po, TP_STATUS_KERNEL);
 	if (!h.raw)
 		goto ring_is_full;
-	po->head = po->head != po->frame_max ? po->head+1 : 0;
+	packet_increment_head(&po->rx_ring);
 	po->stats.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
@@ -759,10 +799,214 @@ ring_is_full:
 	goto drop_n_restore;
 }
 
-#endif
+static void tpacket_destruct_skb(struct sk_buff *skb)
+{
+	struct packet_sock *po = pkt_sk(skb->sk);
+	void * ph;
 
+	BUG_ON(skb == NULL);
+	ph = packet_lookup_frame( po, &po->tx_ring, skb->packet_index, TP_STATUS_COPY);
 
-static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+	BUG_ON(ph == NULL);
+	BUG_ON(atomic_read(&po->tx_pending_skb) == 0);
+
+	atomic_dec(&po->tx_pending_skb);
+	__packet_set_status(po, ph, TP_STATUS_KERNEL);
+
+	sock_wfree(skb);
+}
+
+static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff * skb, void * frame,
+		struct net_device *dev, int size_max, __be16 proto,
+		unsigned char * addr)
+{
+	union {
+		struct tpacket_hdr *h1;
+		struct tpacket2_hdr *h2;
+		void *raw;
+	} ph;
+	int to_write, offset, len, tp_len;
+	struct socket *sock = po->sk.sk_socket;
+	struct page *page;
+	void *data;
+	int err;
+
+	ph.raw = frame;
+
+	skb->protocol = proto;
+	skb->dev = dev;
+	skb->priority = po->sk.sk_priority;
+	skb->destructor = tpacket_destruct_skb;
+	skb->packet_index = po->tx_ring.head;
+
+	switch(po->tp_version) {
+	case TPACKET_V2:
+		tp_len = ph.h1->tp_len;
+		break;
+	default:
+		tp_len = ph.h1->tp_len;
+		break;
+	}
+
+	if (unlikely(tp_len > size_max)) {
+		printk(KERN_ERR "packet size is too long (%d > %d)\n",
+				tp_len, size_max);
+		return -EMSGSIZE;
+	}
+
+	skb_reserve(skb, LL_RESERVED_SPACE(dev));
+	data = ph.raw + po->tp_hdrlen;
+
+	if (sock->type == SOCK_DGRAM) {
+		err = dev_hard_header(skb, dev, ntohs(proto), addr,
+				NULL, tp_len);
+		if (unlikely(err < 0))
+			return -EINVAL;
+	} else if (dev->hard_header_len ) {
+		/* net device doesn't like empty head */
+		if(unlikely(tp_len <= dev->hard_header_len)) {
+			printk(KERN_ERR "packet size is too short "
+					"(%d < %d)\n", tp_len,
+					dev->hard_header_len);
+			return -EINVAL;
+		}
+
+		skb_push(skb, dev->hard_header_len);
+		err = skb_store_bits(skb, 0, data,
+				dev->hard_header_len);
+		if (unlikely(err))
+			return err;
+	}
+
+	err = -EFAULT;
+	to_write = tp_len - dev->hard_header_len;
+	data += dev->hard_header_len;
+	page = virt_to_page(data);
+	len = ((to_write > PAGE_SIZE) ? PAGE_SIZE : to_write);
+
+	offset = (int)((long)data & (~PAGE_MASK));
+	len -= offset;
+
+	skb->data_len = to_write;
+	skb->len += to_write;
+
+	while ( likely(to_write) ) {
+		get_page(page);
+		skb_fill_page_desc(skb,
+				skb_shinfo(skb)->nr_frags,
+				page++, offset, len);
+		to_write -= len;
+		len = (to_write > PAGE_SIZE) ? PAGE_SIZE : to_write;
+		offset = 0;
+	}
+
+	return tp_len;
+}
+
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
+{
+	struct socket *sock;
+	struct sk_buff *skb;
+	struct net_device *dev;
+	__be16 proto;
+	int ifindex, err, reserve = 0;
+	void * ph;
+	struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
+	int tp_len, size_max;
+	unsigned char *addr;
+	int len_sum = 0;
+
+	BUG_ON(po == NULL);
+	sock = po->sk.sk_socket;
+
+	if (saddr == NULL) {
+		ifindex	= po->ifindex;
+		proto	= po->num;
+		addr	= NULL;
+	} else {
+		err = -EINVAL;
+		if (msg->msg_namelen < sizeof(struct sockaddr_ll))
+			goto out;
+		if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct sockaddr_ll, sll_addr)))
+			goto out;
+		ifindex	= saddr->sll_ifindex;
+		proto	= saddr->sll_protocol;
+		addr	= saddr->sll_addr;
+	}
+
+	dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+	err = -ENXIO;
+	if (unlikely(dev == NULL))
+		goto out;
+
+	err = -EINVAL;
+	if (unlikely(sock->type != SOCK_RAW))
+		goto out_unlock;
+
+	reserve = dev->hard_header_len;
+
+	err = -ENETDOWN;
+	if (unlikely(!(dev->flags & IFF_UP)))
+		goto out_unlock;
+
+	size_max = po->tx_ring.frame_size - sizeof(struct skb_shared_info)
+		- po->tp_hdrlen - LL_ALLOCATED_SPACE(dev);
+
+	if (size_max > dev->mtu + reserve)
+		size_max = dev->mtu + reserve;
+
+	do
+	{
+		spin_lock(&po->tx_ring.lock);
+		ph = packet_current_tx_frame(po, TP_STATUS_USER);
+		if(unlikely(ph == NULL)) {
+			spin_unlock(&po->tx_ring.lock);
+			continue;
+		}
+
+		__packet_set_status(po, ph, TP_STATUS_COPY);
+		atomic_inc(&po->tx_pending_skb);
+		spin_unlock(&po->tx_ring.lock);
+
+		skb = sock_alloc_send_skb(&po->sk, LL_ALLOCATED_SPACE(dev),
+				msg->msg_flags & MSG_DONTWAIT, &err);
+		if (unlikely(skb == NULL))
+			goto out_status;
+
+		tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
+				addr);
+		if(unlikely(tp_len < 0)) {
+			err = tp_len;
+			goto out_free;
+		}
+
+		err = dev_queue_xmit(skb);
+		if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0))
+			goto out_free;
+
+		packet_increment_head(&po->tx_ring);
+		len_sum += tp_len;
+	}
+	while(likely((ph != NULL)
+				|| ((!(msg->msg_flags & MSG_DONTWAIT))
+					&& atomic_read(&po->tx_pending_skb))));
+
+	err = len_sum;
+	goto out_unlock;
+
+out_free:
+	kfree_skb(skb);
+out_status:
+	__packet_set_status(po, ph, TP_STATUS_USER);
+	atomic_dec(&po->tx_pending_skb);
+out_unlock:
+	dev_put(dev);
+out:
+	return err;
+}
+#endif
+
+static int packet_snd(struct socket *sock,
 			  struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -853,6 +1097,19 @@ out:
 	return err;
 }
 
+static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *msg, size_t len)
+{
+#ifdef CONFIG_PACKET_MMAP
+	struct sock *sk = sock->sk;
+	struct packet_sock *po = pkt_sk(sk);
+	if (po->tx_ring.pg_vec)
+		return tpacket_snd(po, msg);
+	else
+#endif
+		return packet_snd(sock, msg, len);
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -891,10 +1148,13 @@ static int packet_release(struct socket *sock)
 	packet_flush_mclist(sk);
 
 #ifdef CONFIG_PACKET_MMAP
-	if (po->pg_vec) {
+	{
 		struct tpacket_req req;
 		memset(&req, 0, sizeof(req));
-		packet_set_ring(sk, &req, 1);
+		if (po->rx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 0);
+		if (po->tx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 1);
 	}
 #endif
 
@@ -1411,6 +1671,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 #ifdef CONFIG_PACKET_MMAP
 	case PACKET_RX_RING:
+	case PACKET_TX_RING:
 	{
 		struct tpacket_req req;
 
@@ -1418,7 +1679,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			return -EINVAL;
 		if (copy_from_user(&req,optval,sizeof(req)))
 			return -EFAULT;
-		return packet_set_ring(sk, &req, 0);
+		return packet_set_ring(sk, &req, 0, optname == PACKET_TX_RING);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -1438,7 +1699,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1457,7 +1718,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1701,13 +1962,17 @@ static unsigned int packet_poll(struct file * file, struct socket *sock,
 	unsigned int mask = datagram_poll(file, sock, wait);
 
 	spin_lock_bh(&sk->sk_receive_queue.lock);
-	if (po->pg_vec) {
-		unsigned last = po->head ? po->head-1 : po->frame_max;
-
-		if (packet_lookup_frame(po, last, TP_STATUS_USER))
+	if (po->rx_ring.pg_vec) {
+		if (packet_previous_rx_frame(po, TP_STATUS_USER))
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	spin_lock_bh(&sk->sk_write_queue.lock);
+	if (po->tx_ring.pg_vec) {
+		if (packet_current_tx_frame(po, TP_STATUS_KERNEL))
+			mask |= POLLOUT | POLLWRNORM;
+	}
+	spin_unlock_bh(&sk->sk_write_queue.lock);
 	return mask;
 }
 
@@ -1783,20 +2048,24 @@ out_free_pgvec:
 	goto out;
 }
 
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing)
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing, int tx_ring)
 {
 	char **pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
+	struct packet_ring_buffer *rb;
+	struct sk_buff_head *rb_queue;
 	__be16 num;
 	int err = 0;
 
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+
 	if (req->tp_block_nr) {
 		int i;
 
 		/* Sanity tests and some calculations */
-
-		if (unlikely(po->pg_vec))
+		if (unlikely(rb->pg_vec))
 			return -EBUSY;
 
 		switch (po->tp_version) {
@@ -1813,16 +2082,16 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 		if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
 			return -EINVAL;
 		if (unlikely(req->tp_frame_size < po->tp_hdrlen +
-						  po->tp_reserve))
+					po->tp_reserve))
 			return -EINVAL;
 		if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
 			return -EINVAL;
 
-		po->frames_per_block = req->tp_block_size/req->tp_frame_size;
-		if (unlikely(po->frames_per_block <= 0))
+		rb->frames_per_block = req->tp_block_size/req->tp_frame_size;
+		if (unlikely(rb->frames_per_block <= 0))
 			return -EINVAL;
-		if (unlikely((po->frames_per_block * req->tp_block_nr) !=
-			     req->tp_frame_nr))
+		if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
+					req->tp_frame_nr))
 			return -EINVAL;
 
 		err = -ENOMEM;
@@ -1835,17 +2104,19 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 			void *ptr = pg_vec[i];
 			int k;
 
-			for (k = 0; k < po->frames_per_block; k++) {
+			for (k = 0; k < rb->frames_per_block; k++) {
 				__packet_set_status(po, ptr, TP_STATUS_KERNEL);
 				ptr += req->tp_frame_size;
 			}
 		}
-		/* Done */
-	} else {
+	}
+	/* Done */
+	else {
 		if (unlikely(req->tp_frame_nr))
 			return -EINVAL;
 	}
 
+
 	lock_sock(sk);
 
 	/* Detach socket from network */
@@ -1866,20 +2137,19 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 	if (closing || atomic_read(&po->mapped) == 0) {
 		err = 0;
 #define XC(a, b) ({ __typeof__ ((a)) __t; __t = (a); (a) = (b); __t; })
-
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		pg_vec = XC(po->pg_vec, pg_vec);
-		po->frame_max = (req->tp_frame_nr - 1);
-		po->head = 0;
-		po->frame_size = req->tp_frame_size;
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-
-		order = XC(po->pg_vec_order, order);
-		req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr);
-
-		po->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
-		po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
-		skb_queue_purge(&sk->sk_receive_queue);
+		spin_lock_bh(&rb_queue->lock);
+		pg_vec = XC(rb->pg_vec, pg_vec);
+		rb->frame_max = (req->tp_frame_nr - 1);
+		rb->head = 0;
+		rb->frame_size = req->tp_frame_size;
+		spin_unlock_bh(&rb_queue->lock);
+
+		order = XC(rb->pg_vec_order, order);
+		req->tp_block_nr = XC(rb->pg_vec_len, req->tp_block_nr);
+
+		rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
+		po->prot_hook.func = (po->rx_ring.pg_vec) ? tpacket_rcv : packet_rcv;
+		skb_queue_purge(rb_queue);
 #undef XC
 		if (atomic_read(&po->mapped))
 			printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped));
@@ -1906,7 +2176,8 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 {
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
-	unsigned long size;
+	unsigned long size, expected_size;
+	struct packet_ring_buffer *rb;
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
@@ -1917,23 +2188,38 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 	size = vma->vm_end - vma->vm_start;
 
 	lock_sock(sk);
-	if (po->pg_vec == NULL)
+
+	expected_size = 0;
+	if (po->rx_ring.pg_vec)
+		expected_size += po->rx_ring.pg_vec_len * po->rx_ring.pg_vec_pages * PAGE_SIZE;
+	if (po->tx_ring.pg_vec)
+		expected_size += po->tx_ring.pg_vec_len * po->tx_ring.pg_vec_pages * PAGE_SIZE;
+
+	if (expected_size == 0)
 		goto out;
-	if (size != po->pg_vec_len*po->pg_vec_pages*PAGE_SIZE)
+
+	if (size != expected_size)
 		goto out;
 
 	start = vma->vm_start;
-	for (i = 0; i < po->pg_vec_len; i++) {
-		struct page *page = virt_to_page(po->pg_vec[i]);
-		int pg_num;
-
-		for (pg_num = 0; pg_num < po->pg_vec_pages; pg_num++, page++) {
-			err = vm_insert_page(vma, start, page);
-			if (unlikely(err))
-				goto out;
-			start += PAGE_SIZE;
+
+	for(rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
+		if (rb->pg_vec == NULL)
+			continue;
+
+		for (i = 0; i < rb->pg_vec_len; i++) {
+			struct page *page = virt_to_page(rb->pg_vec[i]);
+			int pg_num;
+
+			for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++, page++) {
+				err = vm_insert_page(vma, start, page);
+				if (unlikely(err))
+					goto out;
+				start += PAGE_SIZE;
+			}
 		}
 	}
+
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;






^ permalink raw reply related	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-30 13:00 Johann Baudy
@ 2008-10-30 18:21 ` Lovich, Vitali
  0 siblings, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-10-30 18:21 UTC (permalink / raw)
  To: Johann Baudy, David Miller; +Cc: netdev@vger.kernel.org

Hey Johann,

There's no need to keep the index (packet_index).  Just store the pointer directly (change it to void *) - saves an extra lookup.  Also, I don't think setting TP_STATUS_COPY is necessary since the user can't really do anything with that information.  Simply leave it at TP_STATUS_USER & TP_STATUS_KERNEL.

Also, when you set the status back to TP_STATUS_KERNEL in the destructor, you need
 to add the following barriers:

__packet_set_status(po, ph, TP_STATUS_KERNEL);
smp_mb();   // make sure the TP_STATUS_KERNEL was actually written to memory before this - couldn't this actually be just a smp_wmb?
flush_dcache_page(virt_to_page(ph));  // on non-x86 architectures like ARM that have a moronic cache (i.e cache by virtual rather than physical address). on x86 this is a noop.

Also, I think that I looked at your code while working on my version and there may have been some logical problems with the way you're building packets.  I'll compare it within the next few days as I start cleaning up my code.

The atomic_inc of pending_skb should happen after the skb was allocated - the atomic_dec in out_status can also be removed.  out_status can be removed completely if you get rid of TP_STATUS_COPY.  If you leave it, you still need the barriers as above after changing tp_status, or the user may not see the change.  Also, you've got a potential threading issue - you're not protecting the frame index behind the spinlock.

As a benchmark on a 10G card (and not performing any optimizations like using syspart & dedicating a cpu for tx), I was able to hit 8.6 GBits/s using a dedicated kernel thread for the transmission.  With the dedicated CPU, I'm confident the line-rate will go up significantly

.I'll try to test your changes within the next few days.  TCPdump maxes out at around 1.5 GBits/s

As for CPU usage, there's a noticeable advantage to traditional send.  Using tcpreplay  - there's about 80% CPU utilization when sending.  Using the tx ring, there's maybe 10-15%.

I'm going to have latency numbers soon as well (i.e. how much jitter is introduced by the kernel).

Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-10-31 10:58 Johann Baudy
  2008-10-31 17:07 ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-10-31 10:58 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: David Miller, netdev@vger.kernel.org

Hi Vitali,

> There's no need to keep the index (packet_index).  Just store the
pointer directly (change it to void *) - saves an extra lookup.  

Indeed, it will be faster. I'll do the change.

> Also, I don't think setting TP_STATUS_COPY is necessary since the user
> can't really do anything with that information. Simply leave it at
> TP_STATUS_USER & TP_STATUS_KERNEL.

This information is not useful for user. It is to prevent kernel from
sending a packet twice or more. Inside the tx ring lock, queued packets
must be tagged as "Kernel has already handled this packet" to not send
it again at next turn of tx ring. 
(That case can happen if device/queue is very slow or if you have only
few frames)

> The atomic_inc of pending_skb should happen after the skb was
> allocated - the atomic_dec in out_status can also be removed. 
> out_status can be removed completely if you get rid of
> TP_STATUS_COPY.  If you leave it, you still need the barriers as above
> after changing tp_status, or the user may not see the change.  

I don't understand why "atomic_inc of pending_skb should happen after
the skb was allocated". This counter is used to monitor the number of TX
packets queued. So as requirement, we have to increment it before
dev_queue_xmit().

atomic_dec() will be needed anyway if tpacket_fill_skb() or
dev_queue_xmit() are failing (If performed after skb alloc).

> Also, you've got a potential threading issue - you're not protecting
> the frame index behind the spinlock.

You are right, I think I will spin-lock outside the do_while loop.

> Also, when you set the status back to TP_STATUS_KERNEL in the
destructor, you need
>  to add the following barriers:
>
> __packet_set_status(po, ph, TP_STATUS_KERNEL);
> smp_mb();   // make sure the TP_STATUS_KERNEL was actually written to
> memory before this - couldn't this actually be just a smp_wmb?
> flush_dcache_page(virt_to_page(ph));  // on non-x86 architectures like
> ARM that have a moronic cache (i.e cache by virtual rather than
> physical address). on x86 this is a noop.
>

So, If my understanding of those memory barriers is correct, we should
have a smp_rmb() before status reading and smp_wmb() after status
writing in skb destructor and send procedure.

> Also, I think that I looked at your code while working on my version
> and there may have been some logical problems with the way you're
> building packets.  I'll compare it within the next few days as I start
> cleaning up my code.

I've noticed "data += dev->hard_header_len; to_write -=
dev->hard_header_len;" that must be in (sock->type != SOCK_DGRAM)
condition. 

> As a benchmark on a 10G card (and not performing any optimizations
> like using syspart & dedicating a cpu for tx), I was able to hit 8.6
> GBits/s using a dedicated kernel thread for the transmission.  With
> the dedicated CPU, I'm confident the line-rate will go up
> significantly
>
> .I'll try to test your changes within the next few days.  TCPdump
> maxes out at around 1.5 GBits/s
>
> As for CPU usage, there's a noticeable advantage to traditional send. 
> Using tcpreplay  - there's about 80% CPU utilization when sending. 
> Using the tx ring, there's maybe 10-15%.
>

On my side, I'm using a 1G device with a PPC405.
I've reached 107MBytes/s with TX RING against 25MBytes/S with standard
packet socket raw and 107MBytes with pktgen.

> I'm going to have latency numbers soon as well (i.e. how much jitter
> is introduced by the kernel).
>

Many thanks Vitali for your comments and help :)


-- 
Johann Baudy
johaahn@gmail.com




^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-31 10:58 [PATCH] Packet socket: mmapped IO: PACKET_TX_RING Johann Baudy
@ 2008-10-31 17:07 ` Lovich, Vitali
  2008-10-31 18:24   ` Lovich, Vitali
                     ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-10-31 17:07 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev@vger.kernel.org



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: October-31-08 3:58 AM
> To: Lovich, Vitali
> Cc: David Miller; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali,
> 
> > There's no need to keep the index (packet_index).  Just store the
> pointer directly (change it to void *) - saves an extra lookup.
> 
> Indeed, it will be faster. I'll do the change.
Actually, I just realized that we don't need to modify the sock_buff structure.  We're already filling in the fragments of the skb - just take the first fragment and reverse the process to determine the starting address of the frame.
> 
> > Also, I don't think setting TP_STATUS_COPY is necessary since the
> user
> > can't really do anything with that information. Simply leave it at
> > TP_STATUS_USER & TP_STATUS_KERNEL.
> 
> This information is not useful for user. It is to prevent kernel from
> sending a packet twice or more. Inside the tx ring lock, queued packets
> must be tagged as "Kernel has already handled this packet" to not send
> it again at next turn of tx ring.
> (That case can happen if device/queue is very slow or if you have only
> few frames)
Sorry - you are correct.

> 
> > The atomic_inc of pending_skb should happen after the skb was
> > allocated - the atomic_dec in out_status can also be removed.
> > out_status can be removed completely if you get rid of
> > TP_STATUS_COPY.  If you leave it, you still need the barriers as
> above
> > after changing tp_status, or the user may not see the change.
> 
> I don't understand why "atomic_inc of pending_skb should happen after
> the skb was allocated". This counter is used to monitor the number of
> TX
> packets queued. So as requirement, we have to increment it before
> dev_queue_xmit().
Actually, it represents the number of skbs allocated, which represents the number of pending skbs.  Consider the following scenario:

atomic_inc(pending_skb) 		(pending_skb = 1)
skb is allocated
for some reason the filling in of the skb fails after the destructor is set (i.e. tp_len is too big)
we go to out_free & free the skb
our destructor gets called
atomic_dec(pending_skb)             (pending_skb = 0)
atomic_dec(pending_skb)             (pending_skb = -1)

That is why it needs to be set after the skb is allocated (so that you don't need that atomic_dec).

> 
> atomic_dec() will be needed anyway if tpacket_fill_skb() or
> dev_queue_xmit() are failing (If performed after skb alloc).
> 
No, because when you call kfree_skb it'll call the destructor (which tpacket_fill_skb should fill as the first thing, before any error conditions can happen).

> > Also, you've got a potential threading issue - you're not protecting
> > the frame index behind the spinlock.
> 
> You are right, I think I will spin-lock outside the do_while loop.
Can't do that I think - dev_queue_xmit sleeps AFAIK and you can't hold a spinlock when you sleep.  Can you explain why you need a spinlock though?  I think you're trying to make it thread-safe if multiple threads make calls to send, correct?  In this case, just change it to a semaphore, and protect the entire send call (well - at least in a safe enough way to make sure you're properly unlocking upon exit).  Also, a better question is - is there any particular reason we need to support multiple threads calling send?  Just stipulate that the behaviour is undefined when there are calls to send from different threads.

> 
> > Also, when you set the status back to TP_STATUS_KERNEL in the
> destructor, you need
> >  to add the following barriers:
> >
> > __packet_set_status(po, ph, TP_STATUS_KERNEL);
> > smp_mb();   // make sure the TP_STATUS_KERNEL was actually written to
> > memory before this - couldn't this actually be just a smp_wmb?
> > flush_dcache_page(virt_to_page(ph));  // on non-x86 architectures
> like
> > ARM that have a moronic cache (i.e cache by virtual rather than
> > physical address). on x86 this is a noop.
> >
> 
> So, If my understanding of those memory barriers is correct, we should
> have a smp_rmb() before status reading and smp_wmb() after status
> writing in skb destructor and send procedure.
I'm not 100% sure about this - for this I was just referring to how the RX ring buffer does it.  However, I don't think so.  My understanding of the smp_wmb is that it's there so that the data cache is flushed properly.  In other words, when you change the tp_status field, you need to make sure that it's actually been written before you flush.  However, for reading, this is not a problem I think.  I also think that since TP_STATUS_COPY is only used only within the kernel, when you set TP_STATUS_COPY, you don't need to do the smp_wmb or the data cache flush.

Perhaps someone who's more comfortable with memory barriers could respond on this issue?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-31 17:07 ` Lovich, Vitali
@ 2008-10-31 18:24   ` Lovich, Vitali
  2008-11-04 22:45     ` Johann Baudy
  2008-10-31 20:28   ` Evgeniy Polyakov
  2008-11-04 22:33   ` Johann Baudy
  2 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-10-31 18:24 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev@vger.kernel.org

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Lovich, Vitali
> Sent: October-31-08 10:08 AM
> To: Johann Baudy
> Cc: netdev@vger.kernel.org
> Subject: RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> 
> 
> > -----Original Message-----
> > From: Johann Baudy [mailto:johaahn@gmail.com]
> > Sent: October-31-08 3:58 AM
> > To: Lovich, Vitali
> > Cc: David Miller; netdev@vger.kernel.org
> > Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> >
> > Hi Vitali,
> >
> > > There's no need to keep the index (packet_index).  Just store the
> > pointer directly (change it to void *) - saves an extra lookup.
> >
> > Indeed, it will be faster. I'll do the change.
> Actually, I just realized that we don't need to modify the sock_buff
> structure.  We're already filling in the fragments of the skb - just
> take the first fragment and reverse the process to determine the
> starting address of the frame.

Here's the code I came up with to determine the offset - anyone see any problems?

union {
   tpacket2_hdr *t2;
   u8 *raw;
}

struct page *first_frag;
u32 page_offset;

first_frag = skb_shinfo(skb)->frags[0].page;
page_offset = skb_shinfo(skb)->frags[0].page_offset;

page_offset -= (skb_headlen(skb) + sizeof(struct tpacket2_hdr));
if (unlikely(page_offset < 0)) {
	first_frag--;
	page_offset += PAGE_SIZE;
}
frame.raw = page_address(first_frag) + page_offset;

// rest of destructor

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-31 17:07 ` Lovich, Vitali
  2008-10-31 18:24   ` Lovich, Vitali
@ 2008-10-31 20:28   ` Evgeniy Polyakov
  2008-11-04 22:33   ` Johann Baudy
  2 siblings, 0 replies; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-10-31 20:28 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

On Fri, Oct 31, 2008 at 10:07:58AM -0700, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> > > Also, you've got a potential threading issue - you're not protecting
> > > the frame index behind the spinlock.
> > 
> > You are right, I think I will spin-lock outside the do_while loop.
> Can't do that I think - dev_queue_xmit sleeps AFAIK and you can't hold a spinlock when you sleep.

It does not and can be called from interrupt context, but interrupts
have to be turned on, since BH will be reenabled there.

>Can you explain why you need a spinlock though?  I think you're trying to make it thread-safe if multiple threads make calls to send, correct?  In this case, just change it to a semaphore, and protect the entire send call (well - at least in a safe enough way to make sure you're properly unlocking upon exit).  Also, a better question is - is there any particular reason we need to support multiple threads calling send?  Just stipulate that the behaviour is undefined when there are calls to send from different threads.

If there will be no synchronization you should very carefully call
udnerlying functions not to race against each other. For example
device queue should be locked.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-31 17:07 ` Lovich, Vitali
  2008-10-31 18:24   ` Lovich, Vitali
  2008-10-31 20:28   ` Evgeniy Polyakov
@ 2008-11-04 22:33   ` Johann Baudy
  2008-11-05  1:50     ` Lovich, Vitali
  2 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-04 22:33 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov

Hi Vitali,

> Actually, it represents the number of skbs allocated, which represents the number of pending skbs.  Consider the following scenario:
> 
> atomic_inc(pending_skb) 		(pending_skb = 1)
> skb is allocated
> for some reason the filling in of the skb fails after the destructor is set (i.e. tp_len is too big)
> we go to out_free & free the skb
> our destructor gets called
> atomic_dec(pending_skb)             (pending_skb = 0)
> atomic_dec(pending_skb)             (pending_skb = -1)
> 
> That is why it needs to be set after the skb is allocated (so that you don't need that atomic_dec).
> 
Oh ! Yes thanks! That's an obvious mistake. I'll fix it. 

> > 
> > atomic_dec() will be needed anyway if tpacket_fill_skb() or
> > dev_queue_xmit() are failing (If performed after skb alloc).
> > 
> No, because when you call kfree_skb it'll call the destructor (which tpacket_fill_skb should fill as the first thing, before any error conditions can happen).
> 
> > > Also, you've got a potential threading issue - you're not protecting
> > > the frame index behind the spinlock.
> > 
> > You are right, I think I will spin-lock outside the do_while loop.
> Can't do that I think - dev_queue_xmit sleeps AFAIK and you can't hold a spinlock when you sleep.  
> Can you explain why you need a spinlock though?  I think you're trying to make it thread-safe if multiple threads make calls to send, correct?
>   In this case, just change it to a semaphore, and protect the entire send call (well - at least in a safe enough way to make sure you're properly unlocking upon exit).  Also, a better question is - is there any particular reason we need to support multiple threads calling send?  Just stipulate that the behaviour is undefined when there are calls to send from different threads.

Yes, this is to protect this mechanism against simultaneous send() calls.
Indeed, using Mutex seems to be a more appropriate manner as it allows
cpu to do other things.
 
> > So, If my understanding of those memory barriers is correct, we should
> > have a smp_rmb() before status reading and smp_wmb() after status
> > writing in skb destructor and send procedure.
> I'm not 100% sure about this - for this I was just referring to how the RX ring buffer does it.  However, I don't think so.  My understanding of the smp_wmb is that it's there so that the data cache is flushed properly.  In other words, when you change the tp_status field, you need to make sure that it's actually been written before you flush.  However, for reading, this is not a problem I think.  I also think that since TP_STATUS_COPY is only used only within the kernel, when you set TP_STATUS_COPY, you don't need to do the smp_wmb or the data cache flush.
> 
> Perhaps someone who's more comfortable with memory barriers could respond on this issue?

According to memory-barrier documentation, "A write barrier should always be paired with a data dependency barrier or read
barrier" (cf. below).

SMP BARRIER PAIRING
-------------------

When dealing with CPU-CPU interactions, certain types of memory barrier should
always be paired.  A lack of appropriate pairing is almost certainly an error.

A write barrier should always be paired with a data dependency barrier or read
barrier, though a general barrier would also be viable.  Similarly a read
barrier or a data dependency barrier should always be paired with at least an
write barrier, though, again, a general barrier is viable:

	CPU 1		CPU 2
	===============	===============
	a = 1;
	<write barrier>
	b = 2;		x = b;
			<read barrier>
			y = a;

Best regards,
Johann





^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-10-31 18:24   ` Lovich, Vitali
@ 2008-11-04 22:45     ` Johann Baudy
  2008-11-06  0:47       ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-04 22:45 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov

Hi Vitali,

> Here's the code I came up with to determine the offset - anyone see any problems?
> 
> union {
>    tpacket2_hdr *t2;
>    u8 *raw;
> }
> 
> struct page *first_frag;
> u32 page_offset;
> 
> first_frag = skb_shinfo(skb)->frags[0].page;
> page_offset = skb_shinfo(skb)->frags[0].page_offset;
> 
> page_offset -= (skb_headlen(skb) + sizeof(struct tpacket2_hdr));
> if (unlikely(page_offset < 0)) {
> 	first_frag--;
> 	page_offset += PAGE_SIZE;
> }
> frame.raw = page_address(first_frag) + page_offset;
> 

This won't work if your network card doesn't support Scatter/Gather IO.
Indeed, when SG is not supported, fragmented packets are linearized
after dev_queue_xmit() call. Thus addresses of pages are then lost.

Best regards,
Johann


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-04 22:33   ` Johann Baudy
@ 2008-11-05  1:50     ` Lovich, Vitali
  0 siblings, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-05  1:50 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-04-08 2:34 PM
> To: Lovich, Vitali
> Cc: netdev@vger.kernel.org; Evgeniy Polyakov
> Subject: RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> 
> > > So, If my understanding of those memory barriers is correct, we
> should
> > > have a smp_rmb() before status reading and smp_wmb() after status
> > > writing in skb destructor and send procedure.
> > I'm not 100% sure about this - for this I was just referring to how
> the RX ring buffer does it.  However, I don't think so.  My
> understanding of the smp_wmb is that it's there so that the data cache
> is flushed properly.  In other words, when you change the tp_status
> field, you need to make sure that it's actually been written before you
> flush.  However, for reading, this is not a problem I think.  I also
> think that since TP_STATUS_COPY is only used only within the kernel,
> when you set TP_STATUS_COPY, you don't need to do the smp_wmb or the
> data cache flush.
> >
> > Perhaps someone who's more comfortable with memory barriers could
> respond on this issue?
> 
> According to memory-barrier documentation, "A write barrier should
> always be paired with a data dependency barrier or read
> barrier" (cf. below).
> 
> SMP BARRIER PAIRING
> -------------------
> 
> When dealing with CPU-CPU interactions, certain types of memory barrier
> should
> always be paired.  A lack of appropriate pairing is almost certainly an
> error.
> 
> A write barrier should always be paired with a data dependency barrier
> or read
> barrier, though a general barrier would also be viable.  Similarly a
> read
> barrier or a data dependency barrier should always be paired with at
> least an
> write barrier, though, again, a general barrier is viable:
> 
> 	CPU 1		CPU 2
> 	===============	===============
> 	a = 1;
> 	<write barrier>
> 	b = 2;		x = b;
> 			<read barrier>
> 			y = a;
> 
> Best regards,
> Johann
> 

Right, but check out http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt - namely the implicit nature of ULOCK operations, IMPLICIT KERNEL MEMORY BARRIERS, & ATOMIC OPERATIONS.

However, I'm not sure why currently in the code it's a full memory barrier after __packet_set_status - it could be downgraded to a write barrier.

Also, there is a bug in the current code I believe:

Around line 726:

@@ -723,8 +723,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 	else
 		sll->sll_ifindex = dev->ifindex;

+ 	smp_wmb();
 	__packet_set_status(po, h.raw, status);
- 	smp_mb();
+ 	smp_wmb();

 	{
 		struct page *p_start, *p_end;

We only really need 1 memory barrier: for architectures where flush_dcache_page is a no-op and writes may be re-ordered, we need the smp_wmb before (if writes may not be re-ordered, ala x86, then a simple barrier() will suffice).  With architectures that actually need the data flushed, we put it after the __packet_set_status (and this can't be a simple barrier).  However, to generically do it, we do both (I couldn't find a pre-processor that gets defined to indicate if flush_dcache_page is no-op or actually does something).  If we don't do it this way it's possible (however unlikely), that the status is set before the data is filled in correctly.  This'll come up if the compiler moves the set status around or the CPU re-orders writes and flush_dcache_page is a no-op.

Should I submit this as a patch, or am I missing something?

"I'm afraid that once a barrier discussion comes up and we insert them, then I become dazedly paranoid and it's very hard to shake me from seeing a need for barriers everywhere, including a barrier before and after every barrier ad infinitum to make sure they're really barriers."
-- Hugh Dickins

Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-11-05 10:55 Johann Baudy
  2008-11-05 11:02 ` Patrick McHardy
  2008-11-05 17:32 ` Lovich, Vitali
  0 siblings, 2 replies; 59+ messages in thread
From: Johann Baudy @ 2008-11-05 10:55 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov

Hi Vitali,

> Right, but check out http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt - namely the implicit nature of ULOCK operations, IMPLICIT KERNEL MEMORY BARRIERS, & ATOMIC OPERATIONS.
>
> However, I'm not sure why currently in the code it's a full memory barrier after __packet_set_status - it could be downgraded to a write barrier.

So you assume an implicit read or data dep barrier before packet_current_rx_frame()?

>
> Also, there is a bug in the current code I believe:
>
> Around line 726:
>
> @@ -723,8 +723,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
>        else
>                sll->sll_ifindex = dev->ifindex;
>
> +       smp_wmb();
>        __packet_set_status(po, h.raw, status);
> -       smp_mb();
> +       smp_wmb();
>
>        {
>                struct page *p_start, *p_end;
>
> We only really need 1 memory barrier: for architectures where flush_dcache_page is a no-op and writes may be re-ordered, we need the smp_wmb before (if writes may not be re-ordered, ala x86, then a simple barrier() will suffice).  With architectures that actually need the data flushed, we put it after the __packet_set_status (and this can't be a simple barrier).  However, to generically do it, we do both (I couldn't find a pre-processor that gets defined to indicate if flush_dcache_page is no-op or actually does something).  If we don't do it this way it's possible (however unlikely), that the status is set before the data is filled in correctly.  This'll come up if the compiler moves the set status around or the CPU re-orders writes and flush_dcache_page is a no-op.
>
> Should I submit this as a patch, or am I missing something?
>
> "I'm afraid that once a barrier discussion comes up and we insert them, then I become dazedly paranoid and it's very hard to shake me from seeing a need for barriers everywhere, including a barrier before and after every barrier ad infinitum to make sure they're really barriers."
> -- Hugh Dickins
>
> Vitali
>
If my understanding is correct, smp_wmb should ensure that "all the
STORE operations specified before the barrier will appear to happen
before all the STORE operations specified after the barrier with respect
to the other components of the system.". 
I think that those barriers address reordering effects at the hardware
level and depend on architecture. But again, I'm not an expert of this.

Moreover if we look into Kernel code, most of the time, smp_wmb() is
used immediately after shared data update . (not before)
(ex:kernel/marker.c, kernel/trace/ftrace.c, ...).

I would like to integrate this barrier mechanism in __packet_[set|
get]_status() function. Thoughts?

Best regards,
Johann











-- 
Johann Baudy
johaahn@gmail.com




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-05 10:55 Johann Baudy
@ 2008-11-05 11:02 ` Patrick McHardy
  2008-11-05 17:32 ` Lovich, Vitali
  1 sibling, 0 replies; 59+ messages in thread
From: Patrick McHardy @ 2008-11-05 11:02 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Lovich, Vitali, netdev@vger.kernel.org, Evgeniy Polyakov

Johann Baudy wrote:
> I would like to integrate this barrier mechanism in __packet_[set|
> get]_status() function. Thoughts?

The reason why its not in there is to avoid the unnecessary
barrier when setting up the ring in packet_set_ring().

The code is fishy though, I agree. Have a look at this discussion:

http://marc.info/?l=linux-netdev&m=119383985207984&w=2

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-11-05 15:16 Johann Baudy
  2008-11-05 17:49 ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-05 15:16 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Lovich, Vitali, netdev@vger.kernel.org, Evgeniy Polyakov

Hi Patrick,


> The reason why its not in there is to avoid the unnecessary
> barrier when setting up the ring in packet_set_ring().

Is it a serious problem? packet_set_status() and packet_get_status()
will be called from multiple places and we will obtain a write/read
barrier at each call. I imagine that's more clean to insert those
barriers inside status functions even if we add only one useless flush
at the beginning. No?

>
> The code is fishy though, I agree. Have a look at this discussion:
>
> http://marc.info/?l=linux-netdev&m=119383985207984&w=2
>
My approach was to use the mechanism of RX_RING where the kernel is
allocating the buffer. In order to get something similar from kernel and
user point of view. Moving kernel buffer to user space could be studied
as global enhancement for RX_RING and TX_RING but that will need user
application changes.

I'm wondering if I also need to use flush_dcache_page() before getting
status buffer and attaching circular buffer pages to skb (both written
by user). 
According to this quote in [cachetlb.txt] of flush_dcache_page part:
	Any time the kernel writes to a page cache page, _OR_
	the kernel is about to read from a page cache page and
	user space shared/writable mappings of this page potentially
	exist, this routine is called.
 	...
	The phrase "kernel writes to a page cache page" means,
	...
	The corollary case is just as important, if there are users
	which have shared+writable mappings of this file, we must make
	sure that kernel reads of these pages will see the most recent
	stores done by the user. 

Can someone confirm?

Thanks in advance,
Johann











^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-05 10:55 Johann Baudy
  2008-11-05 11:02 ` Patrick McHardy
@ 2008-11-05 17:32 ` Lovich, Vitali
  1 sibling, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-05 17:32 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-05-08 2:55 AM
> To: Lovich, Vitali
> Cc: netdev@vger.kernel.org; Evgeniy Polyakov
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali,
> 
> > Right, but check out
> http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt -
> namely the implicit nature of ULOCK operations, IMPLICIT KERNEL MEMORY
> BARRIERS, & ATOMIC OPERATIONS.
> >
> > However, I'm not sure why currently in the code it's a full memory
> barrier after __packet_set_status - it could be downgraded to a write
> barrier.
> 
> So you assume an implicit read or data dep barrier before
> packet_current_rx_frame()?
> 
I think you're misunderstanding the data dependency - there's a dependency between the data in the rest of the buffer and the status flag - thus we need a read barrier between when we read the flag and when we read the data - everywhere I looked in the code, there's an atomic operation that has a barrier.  Again, barriers are only to enforce some kind of indirect data dependency, not to make sure that we're reading the data consistently (i.e.  we see the change as soon as it happens in user-space).
> >
> > Also, there is a bug in the current code I believe:
> >
> > Around line 726:
> >
> > @@ -723,8 +723,9 @@ static int tpacket_rcv(struct sk_buff *skb,
> struct net_device *dev, struct packe
> >        else
> >                sll->sll_ifindex = dev->ifindex;
> >
> > +       smp_wmb();
> >        __packet_set_status(po, h.raw, status);
> > -       smp_mb();
> > +       smp_wmb();
> >
> >        {
> >                struct page *p_start, *p_end;
> >
> > We only really need 1 memory barrier: for architectures where
> flush_dcache_page is a no-op and writes may be re-ordered, we need the
> smp_wmb before (if writes may not be re-ordered, ala x86, then a simple
> barrier() will suffice).  With architectures that actually need the
> data flushed, we put it after the __packet_set_status (and this can't
> be a simple barrier).  However, to generically do it, we do both (I
> couldn't find a pre-processor that gets defined to indicate if
> flush_dcache_page is no-op or actually does something).  If we don't do
> it this way it's possible (however unlikely), that the status is set
> before the data is filled in correctly.  This'll come up if the
> compiler moves the set status around or the CPU re-orders writes and
> flush_dcache_page is a no-op.
> >
> > Should I submit this as a patch, or am I missing something?
> >
> > "I'm afraid that once a barrier discussion comes up and we insert
> them, then I become dazedly paranoid and it's very hard to shake me
> from seeing a need for barriers everywhere, including a barrier before
> and after every barrier ad infinitum to make sure they're really
> barriers."
> > -- Hugh Dickins
> >
> > Vitali
> >
> If my understanding is correct, smp_wmb should ensure that "all the
> STORE operations specified before the barrier will appear to happen
> before all the STORE operations specified after the barrier with
> respect
> to the other components of the system.".
> I think that those barriers address reordering effects at the hardware
> level and depend on architecture. But again, I'm not an expert of this.
That's exactly write with a caveat: there's different kind of barriers, write.  barrier() is a compiler load & store barrier (i.e. the compiler won't reorder it).  The wmb() & rmb() variants also enforce it at the CPU level if the CPU can re-order these kinds of things. The hardware barriers also imply a compiler barrier obviously.

> Moreover if we look into Kernel code, most of the time, smp_wmb() is
> used immediately after shared data update . (not before)
> (ex:kernel/marker.c, kernel/trace/ftrace.c, ...).
> 
> I would like to integrate this barrier mechanism in __packet_[set|
> get]_status() function. Thoughts?
I haven't looked at the code you mentioned (I'll do that later), but I'm sure it's not either before or after.  It's always between (otherwise the barriers are useless).  Again, the barriers are there to enforce in what order data is actually written to or read from memory - thus it must happen between memory accesses that have an indirect dependence.  So again, above we only need 1 write barrier.  On architectures where flush_dcache_page is not a noop, presumably the user-space won't see our writes until after we flush.  But we need to ensure that we flush after the data has been written (or else it might flush the status update without flushing the data change).  Similarly where it's a noop, we need the barrier before the status update - otherwise the user-space can see the status change
  but the data is actually put into memory later.

Cheers,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-05 15:16 Johann Baudy
@ 2008-11-05 17:49 ` Lovich, Vitali
  0 siblings, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-05 17:49 UTC (permalink / raw)
  To: Johann Baudy, Patrick McHardy; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-05-08 7:17 AM
> To: Patrick McHardy
> Cc: Lovich, Vitali; netdev@vger.kernel.org; Evgeniy Polyakov
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Patrick,
> 
> 
> > The reason why its not in there is to avoid the unnecessary
> > barrier when setting up the ring in packet_set_ring().
> 
> Is it a serious problem? packet_set_status() and packet_get_status()
> will be called from multiple places and we will obtain a write/read
> barrier at each call. I imagine that's more clean to insert those
> barriers inside status functions even if we add only one useless flush
> at the beginning. No?
You can put a smp_rmb after reading the status in packet_get_status, and a smp_wmb before setting the status in packet_set_status.  Avoiding it because packet_set_ring doesn't need it is an unnecessary optimization in my opinion; it's not the performance critical path so it's not going to cost anything (allocating the memory is far more expensive anyways than a memory barrier). 

> 
> >
> > The code is fishy though, I agree. Have a look at this discussion:
> >
> > http://marc.info/?l=linux-netdev&m=119383985207984&w=2
> >
> My approach was to use the mechanism of RX_RING where the kernel is
> allocating the buffer. In order to get something similar from kernel
> and
> user point of view. Moving kernel buffer to user space could be studied
> as global enhancement for RX_RING and TX_RING but that will need user
> application changes.
I'm not seeing a resolution in the above thread - is af_packet wrong in the order it flushes the pages?  It certainly seems like a valid question (I'm on x86, so this is an academic issue for me).
> 
> I'm wondering if I also need to use flush_dcache_page() before getting
> status buffer and attaching circular buffer pages to skb (both written
> by user).
> According to this quote in [cachetlb.txt] of flush_dcache_page part:
> 	Any time the kernel writes to a page cache page, _OR_
> 	the kernel is about to read from a page cache page and
> 	user space shared/writable mappings of this page potentially
> 	exist, this routine is called.
>  	...
> 	The phrase "kernel writes to a page cache page" means,
> 	...
> 	The corollary case is just as important, if there are users
> 	which have shared+writable mappings of this file, we must make
> 	sure that kernel reads of these pages will see the most recent
> 	stores done by the user.
> 
> Can someone confirm?
It sounds like you are correct - you need to call flush_dcache_page.  Presumably this can happen after you read the status flag, unless there's a possibility that you never actually get that update, in which case you should be safe and do it before, although I'm not sure then of how significant the performance impact is.  More importantly, I think is the comment in the documentation that the deferred flush must happen on the same CPU as where the stores occurred, so what I come away from this is that I don't think I understand the situation well enough with flush_dcache_page.
> 
> Thanks in advance,
> Johann
> 

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-04 22:45     ` Johann Baudy
@ 2008-11-06  0:47       ` Lovich, Vitali
  2008-11-06  8:03         ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-06  0:47 UTC (permalink / raw)
  To: Johann Baudy; +Cc: netdev@vger.kernel.org, Evgeniy Polyakov



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-04-08 2:46 PM
> To: Lovich, Vitali
> Cc: netdev@vger.kernel.org; Evgeniy Polyakov
> Subject: RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> This won't work if your network card doesn't support Scatter/Gather IO.
> Indeed, when SG is not supported, fragmented packets are linearized
> after dev_queue_xmit() call. Thus addresses of pages are then lost.
> 
> Best regards,
> Johann
Hi Johann, 
Actually, I believe it still will.  There's only two cases that can occur if the device doesn't support scatter gather: dev_queue_xmit linearizes, or it fails (e.g. too much payload - can't reallocate contiguous portion of skb).

The relevant code is in __pskb_pull_tail of net/core/skbuff.c (note how delta = skb->data_len when called from skb_linearize).

If it linearizes, then great - the actual fragments aren't actually modified (note how eat will be 0, so it won't update the frag list in __pskb_pull_tail).
If it doesn't, then dev_queue_xmit fails, then it won't have touched the fragments either.

In either case, the skb given to the destructor will still have the correct values for fragments we specified.  Of course, this is based on 2 assumptions:

1.  Nothing further down the line won't add fragments, thereby overwriting frags[0]
2.  No-one writes to frags[0].page & frags[0].page_offset

1 is reasonable because since the only reason we would be linearizing in the first place is if the device doesn't support scatter/gather, so it would be strange for something down the line to add more fragments that would have to be linearized anyways.

2 is reasonable since it would only make sense if something down the line used this as a temporary variable storage, which again should be unlikely.

Another approach may be to store it in the cb as we had done so originally, except with skb_clone to ensure no other layers overwrite it, although I'm not 100% sure of the implications of skb_clone has for things like the cb & users fields and kfree_skb.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06  0:47       ` Lovich, Vitali
@ 2008-11-06  8:03         ` Evgeniy Polyakov
  2008-11-06 18:49           ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-06  8:03 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

Hi Vitali.

On Wed, Nov 05, 2008 at 04:47:03PM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> In either case, the skb given to the destructor will still have the correct values for fragments we specified.  Of course, this is based on 2 assumptions:
> 
> 1.  Nothing further down the line won't add fragments, thereby overwriting frags[0]
> 2.  No-one writes to frags[0].page & frags[0].page_offset
> 
> 1 is reasonable because since the only reason we would be linearizing in the first place is if the device doesn't support scatter/gather, so it would be strange for something down the line to add more fragments that would have to be linearized anyways.
> 
> 2 is reasonable since it would only make sense if something down the line used this as a temporary variable storage, which again should be unlikely.

What if skb was queued in hardware or qdisk and userspace rewrites
mapped page placed in fraglist?

> Another approach may be to store it in the cb as we had done so originally, except with skb_clone to ensure no other layers overwrite it, although I'm not 100% sure of the implications of skb_clone has for things like the cb & users fields and kfree_skb.

It is not allowed to store something in cb block which is intended to
live while skb is processed on different layers. In this particular case
qdisk engine will overwrite skb's cb.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06  8:03         ` Evgeniy Polyakov
@ 2008-11-06 18:49           ` Lovich, Vitali
  2008-11-06 19:40             ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-06 18:49 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org

Hi Evgeniy,

> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-06-08 12:03 AM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali.
> 
> On Wed, Nov 05, 2008 at 04:47:03PM -0800, Lovich, Vitali
> (vlovich@qualcomm.com) wrote:
> > In either case, the skb given to the destructor will still have the
> correct values for fragments we specified.  Of course, this is based on
> 2 assumptions:
> >
> > 1.  Nothing further down the line won't add fragments, thereby
> overwriting frags[0]
> > 2.  No-one writes to frags[0].page & frags[0].page_offset
> >
> > 1 is reasonable because since the only reason we would be linearizing
> in the first place is if the device doesn't support scatter/gather, so
> it would be strange for something down the line to add more fragments
> that would have to be linearized anyways.
> >
> > 2 is reasonable since it would only make sense if something down the
> line used this as a temporary variable storage, which again should be
> unlikely.
> 
> What if skb was queued in hardware or qdisk and userspace rewrites
> mapped page placed in fraglist?
Can you please clarify what you mean?  Are you talking about the user changing the status field after the skb was queued?  Or are you saying that the user may ask the kernel to resize the ring buffer while the skb is queued?

In the first case, I don't think we don't really care - the user is breaking their side of the contract, but we'll still have a valid address since the page pointer should still be valid.

In the second case, the behaviour is undefined because the hardware would use those page pointers for its scatter-gather which would, best case, cause the skb to fail (kernel oops is more likely I think).  We can add this to the documentation, but in the end there's nothing we can really do outside of copying data from the ring buffer into the skb, which defeats the point of doing this.

Or am I completely missing your point?
> 
> > Another approach may be to store it in the cb as we had done so
> originally, except with skb_clone to ensure no other layers overwrite
> it, although I'm not 100% sure of the implications of skb_clone has for
> things like the cb & users fields and kfree_skb.
> 
> It is not allowed to store something in cb block which is intended to
> live while skb is processed on different layers. In this particular
> case
> qdisk engine will overwrite skb's cb.
> 
That's what I figured - however, I was wondering if we even need to go through the qdisc engine.  Couldn't we just bypass it and send out packets directly through the card (i.e. dev_hard_start_xmit) similar to how pktgen does this?  I haven't looked at the documentation (I'm not actually sure with which to start), but are there any caveats that need to be considered (e.g. must be called outside of an interrupt, can't be interruptible, etc.)

The reason I'm thinking we'd want to bypass the qdisc engine is that the idea is that you would use this method only if you want the most performance possible, so why is packet scheduling even needed?  The approach I'm thinking is that if the tx-ring buffer is implemented, we can even get rid of the in-kernel pktgen and port it over to user-space as an example of how to utilize the tx-ring buffer.

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06 18:49           ` Lovich, Vitali
@ 2008-11-06 19:40             ` Evgeniy Polyakov
  2008-11-06 19:53               ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-06 19:40 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

Hi Vitali.

On Thu, Nov 06, 2008 at 10:49:36AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> > What if skb was queued in hardware or qdisk and userspace rewrites
> > mapped page placed in fraglist?
> Can you please clarify what you mean?  Are you talking about the user changing the status field after the skb was queued?  Or are you saying that the user may ask the kernel to resize the ring buffer while the skb is queued?
> 
> In the first case, I don't think we don't really care - the user is breaking their side of the contract, but we'll still have a valid address since the page pointer should still be valid.
> 
> In the second case, the behaviour is undefined because the hardware would use those page pointers for its scatter-gather which would, best case, cause the skb to fail (kernel oops is more likely I think).  We can add this to the documentation, but in the end there's nothing we can really do outside of copying data from the ring buffer into the skb, which defeats the point of doing this.

I'm not sure that it is allowed to fail even for root. System could
sleep and not permit the resize until all skbs are freed though.

> > It is not allowed to store something in cb block which is intended to
> > live while skb is processed on different layers. In this particular
> > case
> > qdisk engine will overwrite skb's cb.
> > 
> That's what I figured - however, I was wondering if we even need to go through the qdisc engine.  Couldn't we just bypass it and send out packets directly through the card (i.e. dev_hard_start_xmit) similar to how pktgen does this?  I haven't looked at the documentation (I'm not actually sure with which to start), but are there any caveats that need to be considered (e.g. must be called outside of an interrupt, can't be interruptible, etc.)
> 
> The reason I'm thinking we'd want to bypass the qdisc engine is that the idea is that you would use this method only if you want the most performance possible, so why is packet scheduling even needed?  The approach I'm thinking is that if the tx-ring buffer is implemented, we can even get rid of the in-kernel pktgen and port it over to user-space as an example of how to utilize the tx-ring buffer.

I can not say if it is good or bad idea, it is not my project :)
But existing packet socket does not bypass qdisc, so users may expect
the same from the new interface. Overall idea of storing something in cb
and expect it to live there between layers is very subtle: what if
tomorrow we will gen new layer between driver and qdisc? Or if we will
start updating cb in the driver itself when it owns skb?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06 19:40             ` Evgeniy Polyakov
@ 2008-11-06 19:53               ` Lovich, Vitali
  2008-11-07 16:36                 ` Johann Baudy
  2008-11-07 17:28                 ` Evgeniy Polyakov
  0 siblings, 2 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-06 19:53 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-06-08 11:41 AM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali.
> 
> On Thu, Nov 06, 2008 at 10:49:36AM -0800, Lovich, Vitali
> (vlovich@qualcomm.com) wrote:
> > > What if skb was queued in hardware or qdisk and userspace rewrites
> > > mapped page placed in fraglist?
> > Can you please clarify what you mean?  Are you talking about the user
> changing the status field after the skb was queued?  Or are you saying
> that the user may ask the kernel to resize the ring buffer while the
> skb is queued?
> >
> > In the first case, I don't think we don't really care - the user is
> breaking their side of the contract, but we'll still have a valid
> address since the page pointer should still be valid.
> >
> > In the second case, the behaviour is undefined because the hardware
> would use those page pointers for its scatter-gather which would, best
> case, cause the skb to fail (kernel oops is more likely I think).  We
> can add this to the documentation, but in the end there's nothing we
> can really do outside of copying data from the ring buffer into the
> skb, which defeats the point of doing this.
> 
> I'm not sure that it is allowed to fail even for root. System could
> sleep and not permit the resize until all skbs are freed though.
You're right - I think the code needs an additional check for atomic_read(&po->tx_pending_skb) around line 1867 where it's checking the number of mmaps to the ring buffer.

Which brings up a possible bug in the proposed patch - the mapped atomic int field should be part of the struct containing a ring buffer - each ring buffer has a different number of mappings.  Otherwise, there will be issues trying to create both a tx & rx ring.
> 
> > > It is not allowed to store something in cb block which is intended
> to
> > > live while skb is processed on different layers. In this particular
> > > case
> > > qdisk engine will overwrite skb's cb.
> > >
> > That's what I figured - however, I was wondering if we even need to
> go through the qdisc engine.  Couldn't we just bypass it and send out
> packets directly through the card (i.e. dev_hard_start_xmit) similar to
> how pktgen does this?  I haven't looked at the documentation (I'm not
> actually sure with which to start), but are there any caveats that need
> to be considered (e.g. must be called outside of an interrupt, can't be
> interruptible, etc.)
> >
> > The reason I'm thinking we'd want to bypass the qdisc engine is that
> the idea is that you would use this method only if you want the most
> performance possible, so why is packet scheduling even needed?  The
> approach I'm thinking is that if the tx-ring buffer is implemented, we
> can even get rid of the in-kernel pktgen and port it over to user-space
> as an example of how to utilize the tx-ring buffer.
> 
> I can not say if it is good or bad idea, it is not my project :)
> But existing packet socket does not bypass qdisc, so users may expect
> the same from the new interface. Overall idea of storing something in
> cb
> and expect it to live there between layers is very subtle: what if
> tomorrow we will gen new layer between driver and qdisc? Or if we will
> start updating cb in the driver itself when it owns skb?
Are there any consequences to bypassing qdisc (other than not participating in any load balancing)?  Also, since any users using the tx ring approach would have to write new code, it would be part of the documentation that the send behaviour is optimized for the least latency and highest throughput, thus no guarantees and assumptions can be made about the path the data takes through the stack.

I agree with you about not using the cb - I was just looking at all possible alternatives of how this can be solved.  I'm liking my approach of using the frag list because it's the least intrusive (doesn't require modification of the skb_buff structure) and the seems to be the most resilient (trying to hide data in some unused skb field and hope no one overwrites it)

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06 19:53               ` Lovich, Vitali
@ 2008-11-07 16:36                 ` Johann Baudy
  2008-11-07 17:19                   ` Lovich, Vitali
  2008-11-07 17:28                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-07 16:36 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali.

> Which brings up a possible bug in the proposed patch - the mapped atomic int field should be part of the struct containing a ring buffer - each ring buffer has a different number of mappings.  Otherwise, there will be issues trying to create both a tx & rx ring.

Could you please clarify?
tx_pending_skb is only used for TX process.

> Are there any consequences to bypassing qdisc (other than not participating in any load balancing)?  Also, since any users using the tx ring approach would have to write new code, it would be part of the documentation that the send behaviour is optimized for the least latency and highest throughput, thus no guarantees and assumptions can be made about the path the data takes through the stack.
>
> I agree with you about not using the cb - I was just looking at all possible alternatives of how this can be solved.  I'm liking my approach of using the frag list because it's the least intrusive (doesn't require modification of the skb_buff structure) and the seems to be the most resilient (trying to hide data in some unused skb field and hope no one overwrites it)
>

My feeling is that all socket families have to use the same interface.
We must not bypass it, otherwise we're gonna make a mess.

I think we must not care about skb frags/device management to get
tpacket_hdr pointer back from skb. If it is changing tomorrow, we will
be in trouble.

A generic solution could be to forward a void* argument as the skb
destructor is forwarded through layers. Kind of:
skb->destructor(skb->desctructor_arg) when executing the callback. A
backup copy of destructor/destruct_arg pair should be performed if
someone needs to replace it.
This way all entities/layers could use this field to give an argument
to their destructor without adding a new field in sk_buff struct.

Thanks,
Johann
-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-07 16:36                 ` Johann Baudy
@ 2008-11-07 17:19                   ` Lovich, Vitali
  2008-11-10 20:29                     ` Lovich, Vitali
  2008-11-11 11:43                     ` Johann Baudy
  0 siblings, 2 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-07 17:19 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-07-08 8:36 AM
> To: Lovich, Vitali
> Cc: Evgeniy Polyakov; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> > Which brings up a possible bug in the proposed patch - the mapped
> atomic int field should be part of the struct containing a ring buffer
> - each ring buffer has a different number of mappings.  Otherwise,
> there will be issues trying to create both a tx & rx ring.
> 
> Could you please clarify?
> tx_pending_skb is only used for TX process.
Right, but I'm referring to the mapped variable in struct packet_sock.  It should actually be part of packet_ring_buffer - otherwise after mapping a tx-ring, mapped will be non-0, so packet_set_ring actually won't work - relevant condition is:
if (closing || atomic_read(&po->mapped) == 0)
 // create mapping

Similarly, what Evgeniy was saying is that packet_set_ring shouldn't succeed if there are pending packets.  This would also mean we have to acquire the same lock that sending does.
> 
> > Are there any consequences to bypassing qdisc (other than not
> participating in any load balancing)?  Also, since any users using the
> tx ring approach would have to write new code, it would be part of the
> documentation that the send behaviour is optimized for the least
> latency and highest throughput, thus no guarantees and assumptions can
> be made about the path the data takes through the stack.
> >
> > I agree with you about not using the cb - I was just looking at all
> possible alternatives of how this can be solved.  I'm liking my
> approach of using the frag list because it's the least intrusive
> (doesn't require modification of the skb_buff structure) and the seems
> to be the most resilient (trying to hide data in some unused skb field
> and hope no one overwrites it)
> >
> 
> My feeling is that all socket families have to use the same interface.
> We must not bypass it, otherwise we're gonna make a mess.
It's not going to be a mess - qdisc is for load balancing only as far as I'm aware.  Either qdisc will pass it on directly to dev_hard_start_xmit or it'll schedule a softirq to do this.  Either way, it's just an intermediary layer to control the flow of traffic.  Particularly since the MMAP is meant to be as high-performing as possible, I don't see why we should participate in flow control which will only increase latency & cpu usage while decreasing throughput - if we put this in the documentation that usage of the mmap approach to sending bypasses flow control, I think it's 

Am I missing the point for qdisc?
 
> I think we must not care about skb frags/device management to get
> tpacket_hdr pointer back from skb. If it is changing tomorrow, we will
> be in trouble. 
I'm a bit confused about the point you are trying to make here.  Could you please clarify?

> 
> A generic solution could be to forward a void* argument as the skb
> destructor is forwarded through layers. Kind of:
> skb->destructor(skb->desctructor_arg) when executing the callback. A
> backup copy of destructor/destruct_arg pair should be performed if
> someone needs to replace it.
> This way all entities/layers could use this field to give an argument
> to their destructor without adding a new field in sk_buff struct.
This would require a massive change in the kernel to add a feature that up till now no one has needed.  Certainly, I don't think it's wise to introduce such a wide-spread change when we could just as well accomplish it through other means.  Here are my problems with this approach:

1.  There's no need to actually pass the argument through the destructor - it could just be part of the skb (after all, since the callback only happens in one place, it's not like we could customize that argument - it would always be skb->destructor_arg).  That way whoever needs to use would use it,  but there's no need to change all existing destructor code.

2.  Backing up this field isn't really possible since where would you place that backup?

3.  You're increasing the size of the structure for everyone, even if they have disabled PACKET_MMAP (and since no-one else uses this, it doesn't really make sense).  So if it is decided that the correct approach is to store the pointer in the skb, then we should continue doing what other protocols do in the situation where they need custom data in the skb and just add our own custom pointer surrounded by an #ifdef/#endif for PACKET_MMAP. 

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-06 19:53               ` Lovich, Vitali
  2008-11-07 16:36                 ` Johann Baudy
@ 2008-11-07 17:28                 ` Evgeniy Polyakov
  2008-11-07 20:22                   ` David Miller
  1 sibling, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-07 17:28 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

On Thu, Nov 06, 2008 at 11:53:35AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> Are there any consequences to bypassing qdisc (other than not participating in any load balancing)?  Also, since any users using the tx ring approach would have to write new code, it would be part of the documentation that the send behaviour is optimized for the least latency and highest throughput, thus no guarantees and assumptions can be made about the path the data takes through the stack.

qdisc is a base for multiqueue NICs (and all others too now), you can not bypass it there.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-07 17:28                 ` Evgeniy Polyakov
@ 2008-11-07 20:22                   ` David Miller
  0 siblings, 0 replies; 59+ messages in thread
From: David Miller @ 2008-11-07 20:22 UTC (permalink / raw)
  To: zbr; +Cc: vlovich, johaahn, netdev

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Fri, 7 Nov 2008 20:28:10 +0300

> On Thu, Nov 06, 2008 at 11:53:35AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> > Are there any consequences to bypassing qdisc (other than not participating in any load balancing)?  Also, since any users using the tx ring approach would have to write new code, it would be part of the documentation that the send behaviour is optimized for the least latency and highest throughput, thus no guarantees and assumptions can be made about the path the data takes through the stack.
> 
> qdisc is a base for multiqueue NICs (and all others too now), you can not bypass it there.

Right.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-07 17:19                   ` Lovich, Vitali
@ 2008-11-10 20:29                     ` Lovich, Vitali
  2008-11-11  0:29                       ` Lovich, Vitali
  2008-11-11 12:10                       ` Johann Baudy
  2008-11-11 11:43                     ` Johann Baudy
  1 sibling, 2 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-10 20:29 UTC (permalink / raw)
  To: Lovich, Vitali, Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

Found another issue.  I'm not sure how your code managed to work, but in my patch I had to have skb_reset_network_header(skb) after a skb_reserve - otherwise, it would be rejected when I passed it on to the device.

Also, there's no need to support both header versions in the tx path - since v2 should be used anyways and there's no legacy (v1 headers are deprecated no?), we can use v2 only in the tx path.

Additionally, if the packet size is too long, the kernel has a responsibility to clear that flag (otherwise that slot will never free up).

I'm going to review the code a little bit more later on today as I try to integrate our versions.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-10 20:29                     ` Lovich, Vitali
@ 2008-11-11  0:29                       ` Lovich, Vitali
       [not found]                         ` <7e0dd21a0811110656yff651afp8ff0f9928b79f545@mail.gmail.com>
  2008-11-11 12:10                       ` Johann Baudy
  1 sibling, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11  0:29 UTC (permalink / raw)
  To: Lovich, Vitali, Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

> Also, there's no need to support both header versions in the tx path -
> since v2 should be used anyways and there's no legacy (v1 headers are
> deprecated no?), we can use v2 only in the tx path.
There's also a minor typo anyways in tpacket_fill_skb where you read tp_len.  You check for the header version, & then load the tp_len from tpacket_hdr always (this code should be removed anyways).

The assignment of -EFAULT to err write before filling the fragments is redundant since err is never used after that.

You don't reset the tp_status flag when you reject a message for being too short.

You incorrectly calculate to_write in the situation of sock->type == SOCK_DGRAM.

You don't check for the MAX_SKB_FRAGS condition.

Not your code, but in packet_set_ring, it's redundant to clear the status for all the frames to TP_STATUS_KERNEL (0) since the pages are allocated with kzalloc (pages are already zeroed).  I surrounded this with a CONFIG_DEBUG_KERNEL block.

Earlier, I think I had said that the mapped variable should be placed in packet_ring_buffer struct - I was wrong.  It's correct where it is.  I had forgotten that rx & tx are mapped contiguously and simultaneously.  There needs to be a note in the doc that the rings should be set up before mapping (i.e. cannot set up the rx or tx ring buffer first, mmap, and then try to change it). We do need still need a check for tx_pending_skb when sizing the ring buffer (in the situation where we unfreed skbs created by a call to send from Thread B, while Thread A unmaps the ring buffer and trys a resize).  Also, in tpacket_snd, we need to increment mapped before allocating the skb and test if the original value was 0 (we decrement after incrementing tx_pending_skb).  If it was, then that indicates tha
 t the user had unmapped the ring buffer, which means we should stop sending.  No need for additional locking because of the atomic relationship this sets up between tx_pending_skb and mapped.

I'll be posting my suggested patch tomorrow.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-07 17:19                   ` Lovich, Vitali
  2008-11-10 20:29                     ` Lovich, Vitali
@ 2008-11-11 11:43                     ` Johann Baudy
  2008-11-11 17:38                       ` Lovich, Vitali
  1 sibling, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 11:43 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

>
>> I think we must not care about skb frags/device management to get
>> tpacket_hdr pointer back from skb. If it is changing tomorrow, we will
>> be in trouble.
> I'm a bit confused about the point you are trying to make here.  Could you please clarify?

I believe that a solution with an additional field in sk_buff struct
is preferable to a one which depends on sub-layer mechanism (frags
management).

> This would require a massive change in the kernel to add a feature that up till now no one has needed.  Certainly, I don't think it's wise to introduce such a wide-spread change when we could just as well accomplish it through other means.  Here are my problems with this approach:
>
> 1.  There's no need to actually pass the argument through the destructor - it could just be part of the skb (after all, since the callback only happens in one place, it's not like we could customize that argument - it would always be skb->destructor_arg).  That way whoever needs to use would use it,  but there's no need to change all existing destructor code.

Yes! give it as callback argument is not mandatory.

> 2.  Backing up this field isn't really possible since where would you place that backup?

Ex: Destructor arg points to a struct that contains previous
destructor function and destructor argument.
Those are restored and executed during new destructor.
Layer that replaces it, is in charge of the backup

>
> 3.  You're increasing the size of the structure for everyone, even if they have disabled PACKET_MMAP (and since no-one else uses this, it doesn't really make sense).  So if it is decided that the correct approach is to store the pointer in the skb, then we should continue doing what other protocols do in the situation where they need custom data in the skb and just add our own custom pointer surrounded by an #ifdef/#endif for PACKET_MMAP.

That's could be included into another "feature flag" which will be
enabled only when a feature as PACKET_MMAP needs it.

I'm suggesting a variable that sub-layers will guarantee as
skb->destructor. A variable that will be at least restored before
destructor callback call.

This can be used also for sock_w_free() destructor. Current code is
assuming that the socket destructor is always sock_w_free(), if
tomorrow, this callback is changed or removed, we will have to change
packet_mmap code (called in packet_skb_destruct()). Such mechanism
should make replacement of destructor/destructor arg more independent
toward layers.

Best regards,
Johann

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-10 20:29                     ` Lovich, Vitali
  2008-11-11  0:29                       ` Lovich, Vitali
@ 2008-11-11 12:10                       ` Johann Baudy
  2008-11-11 17:44                         ` Lovich, Vitali
  1 sibling, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 12:10 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

> Found another issue.  I'm not sure how your code managed to work, but in my patch I had to have skb_reset_network_header(skb) after a skb_reserve - otherwise, it would be rejected when I passed it on to the device.

That's not an issue with my device.
It is called in original packet_sendmsg(). I've missed it :)

>
> Also, there's no need to support both header versions in the tx path - since v2 should be used anyways and there's no legacy (v1 headers are deprecated no?), we can use v2 only in the tx path.

I think we have either to use our own TX header (we are using only
status field of V2) or to keep compatibility with all RX/TX headers
available.
Be compatible with V1/V2 makes TX code more flexible if there is a V3
in the future.

> Additionally, if the packet size is too long, the kernel has a responsibility to clear that flag (otherwise that slot will never free up).
I'm not sure to understand, could you please clarify ?

Thanks,
Johann

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
       [not found]                         ` <7e0dd21a0811110656yff651afp8ff0f9928b79f545@mail.gmail.com>
@ 2008-11-11 14:59                           ` Johann Baudy
  2008-11-11 19:05                             ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 14:59 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

>> You don't reset the tp_status flag when you reject a message for being too short.
>> You incorrectly calculate to_write in the situation of sock->type == SOCK_DGRAM.

I've changed this code already, I'll will share my changes.

>>
>> You don't check for the MAX_SKB_FRAGS condition.

Thanks, I'll will add this check.

>>
>> Not your code, but in packet_set_ring, it's redundant to clear the status for all the frames to TP_STATUS_KERNEL (0) since the pages are allocated with kzalloc (pages are already zeroed).  I surrounded this with a CONFIG_DEBUG_KERNEL block.

Ok, so we can skip below block according to this assumption:
                for (i = 0; i < req->tp_block_nr; i++) {
                        void *ptr = pg_vec[i];
                        int k;

                        for (k = 0; k < rb->frames_per_block; k++) {
                                __packet_set_status(po, ptr, TP_STATUS_KERNEL);
                                ptr += req->tp_frame_size;
                        }
                }

>>
>> Earlier, I think I had said that the mapped variable should be placed in packet_ring_buffer struct - I was wrong.  It's correct where it is.  I had forgotten that rx & tx are mapped contiguously and simultaneously.  There needs to be a note in the doc that the rings should be set up before mapping (i.e. cannot set up the rx or tx ring buffer first, mmap, and then try to change it).

 Yes, we can also add a check for po->mapped in
 PACKET_TX_RING/PACKET_RX_RING cases of packet_setsockopt() switch.


>> We do need still need a check for tx_pending_skb when sizing the ring buffer (in the situation where we unfreed skbs created by a call to send from Thread B, while Thread A unmaps the ring buffer and trys a resize). Also, in tpacket_snd, we need to increment mapped before allocating the skb and test if the original value was 0 (we decrement after incrementing tx_pending_skb).  If it was, then that indicates that the user had unmapped the ring buffer, which means we should stop sending.  No need for additional locking because of the atomic relationship this sets up between tx_pending_skb and mapped.

 I'm not sure to understand:

 Let me, sum up :
 * when setting or resizing the ring: we have to return an error if:
 - someone has mapped the socket (1.a)
 - sending is ongoing - tx_pending_skb != 0 (1.b)
 - a send() procedure is ongoing (1.c)

 * when closing the socket, we have to return an error if:
 - sending is ongoing - tx_pending_skb != 0 (2.a)
 - a send() procedure is ongoing (2.b)

 * during do-while loop of send(),:
 - protect ring against simultaneous send() (3.a)
 - ensure that ring is not resized (3.b) (linked with 1.c)
 Our main issue is when: tx_pending_skb = 0, Thread A starts to send a
 packet while Thread B unmaps and resizes the ring.
  Thread B can potentially resizes the ring until increment of
 tx_pending_skb in Thread A.
 - loop can continue even if ring is not mapped (3.c), this has no
 sense but it is not critical

 * when executing the destructor callback:
 - we must ensure that ring structure has not changed since send() (4.a)

 (1.a) can be done with po->mmaped check in packet_setsockopt()
 (1.b) and (2.a) can be done as you said :  tx_pending_skb != 0 check
 when sizing buffer in packet_set_ring()
 (3.a) by a mutex M outside of the do-while loop
 (1.c), (2.b) and (3.b) can be performed thanks to the same mutex M and
 mutex_trylock() in packet_set_ring(). An error is returned if we can't
 take the mutex . The mutex is released at the end of packet_set_ring()
 call.
 (4.a) is validated with tx_pending_skb increment, (3.b),(1.b) and (2.a)

 Do I miss something?

 Many thanks,
 Johann

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 11:43                     ` Johann Baudy
@ 2008-11-11 17:38                       ` Lovich, Vitali
  2008-11-11 17:50                         ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 17:38 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

> > 2.  Backing up this field isn't really possible since where would you
> place that backup?
> 
> Ex: Destructor arg points to a struct that contains previous
> destructor function and destructor argument.
> Those are restored and executed during new destructor.
> Layer that replaces it, is in charge of the backup
> >
> > 3.  You're increasing the size of the structure for everyone, even if
> they have disabled PACKET_MMAP (and since no-one else uses this, it
> doesn't really make sense).  So if it is decided that the correct
> approach is to store the pointer in the skb, then we should continue
> doing what other protocols do in the situation where they need custom
> data in the skb and just add our own custom pointer surrounded by an
> #ifdef/#endif for PACKET_MMAP.
> 
> That's could be included into another "feature flag" which will be
> enabled only when a feature as PACKET_MMAP needs it.
> 
> I'm suggesting a variable that sub-layers will guarantee as
> skb->destructor. A variable that will be at least restored before
> destructor callback call.
Again, I'm not convinced this is 100% feasible.  I understand, I think, what you're getting at.  If you want to use the field, you defined a structure like struct my_arg { void *old, ... my data ... }.  However, this isn't like C++ where the destructors for all layers are called.  There is only 1 destructor for the layer that created the skb (AFAIK) - all the layers simply increment their reference to it (which is decremented by a free).  Thus there's really no way to do what you want with minimal impact on existing code - the ideal behaviour would be to define a common struct (i.e. { void * olddata, void *thisdata }, that is filled with whatever data that layer needs when it increments the refcount and then have kfree_skb unwind the pointer as needed.  This would require a massive change 
 to the code though, increase the amount of memory usage unnecessarily, and be prone to memory leaks if it's not done carefully.
> 
> This can be used also for sock_w_free() destructor. Current code is
> assuming that the socket destructor is always sock_w_free(), if
> tomorrow, this callback is changed or removed, we will have to change
> packet_mmap code (called in packet_skb_destruct()). Such mechanism
> should make replacement of destructor/destructor arg more independent
> toward layers.
We're talking about skb destructors, not socket destructors.  sock_wfree is not a destructor - it releases the resources allocated to the skb back to the socket that the resources are from.

Also, from what I can tell, this is how the Linux kernel works - if someone's change breaks the code, they're responsible for fixing it.  Changing how sock_wfree is fairly significant enough that we don't need to worry about making it easier - it'll probably be so significant that there will be a lot of changes anyways.  It's also highly unlikely to happen I think.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 12:10                       ` Johann Baudy
@ 2008-11-11 17:44                         ` Lovich, Vitali
  2008-11-11 18:08                           ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 17:44 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

> > Additionally, if the packet size is too long, the kernel has a
> responsibility to clear that flag (otherwise that slot will never free
> up).
> I'm not sure to understand, could you please clarify ?
If the user places a 1 KB packet into the buffer, but then when we try to send it, it turns out the mtu is too small (or something like that), if we don't clear the flag back to TP_STATUS_KERNEL and just return an error, the user will never know.

      --------------------------------------
0    | 1 KB frame - status = TP_STATUS_USER |
      --------------------------------------
1    | 1 KB frame - status = TP_STATUS_USER |
      --------------------------------------
Etc

So when the user fills up the ring buffer and goes back to frame 0, he won't know that that packet was actually fatally rejected (i.e. the kernel moved on) and will just block waiting for the status to clear (which it never will).  Similarly in the case where the packet gets rejected for being too small.

In other words, when the kernel doesn't send a packet but moves on and doesn't keep trying (as would be the case for the packet size being too small or too big because that is considered an irrecoverable error), it has to notify the user by clearing that status flag for that frame.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 17:38                       ` Lovich, Vitali
@ 2008-11-11 17:50                         ` Johann Baudy
  2008-11-11 18:14                           ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 17:50 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

> Again, I'm not convinced this is 100% feasible.  I understand, I think, what you're getting at.  If you want to use the field, you defined a structure like struct my_arg { void *old, ... my data ... }.  However, this isn't like C++ where the destructors for all layers are called.  There is only 1 destructor for the layer that created the skb (AFAIK) - all the layers simply increment their reference to it (which is decremented by a free).  Thus there's really no way to do what you want with minimal impact on existing code - the ideal behaviour would be to define a common struct (i.e. { void * olddata, void *thisdata }, that is filled with whatever data that layer needs when it increments the refcount and then have kfree_skb unwind the pointer as needed.  This would require a massive chang
 e to the code though, increase the amount of memory usage unnecessarily, and be prone to memory leaks if it's not done carefully.

You say "There is only 1 destructor for the layer that created the skb
(AFAIK)", why not a "There is only 1 destructor argument for the layer
that created the skb (AFAIK)"?
Where is the impact? only a void * destructor_arg that only the
creator of skb is allowed to use. Moreover it can be put under feature
flag.

>>
>> This can be used also for sock_w_free() destructor. Current code is
>> assuming that the socket destructor is always sock_w_free(), if
>> tomorrow, this callback is changed or removed, we will have to change
>> packet_mmap code (called in packet_skb_destruct()). Such mechanism
>> should make replacement of destructor/destructor arg more independent
>> toward layers.
> We're talking about skb destructors, not socket destructors.  sock_wfree is not a destructor - it releases the resources allocated to the skb back to the socket that the resources are from.
I agree, I'm just talking about destructor/destructor_arg pair.

>
> Also, from what I can tell, this is how the Linux kernel works - if someone's change breaks the code, they're responsible for fixing it.  Changing how sock_wfree is fairly significant enough that we don't need to worry about making it easier - it'll probably be so significant that there will be a lot of changes anyways.  It's also highly unlikely to happen I think.

I don't want to break this code. Just add a variable in sk_buff that
can be used as the destructor is ! kind of CB[] that is not
overwritten by sub layer ! or at least backuped.

Thanks,
Johann




-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 17:44                         ` Lovich, Vitali
@ 2008-11-11 18:08                           ` Johann Baudy
  2008-11-11 18:19                             ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 18:08 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

> If the user places a 1 KB packet into the buffer, but then when we try to send it, it turns out the mtu is too small (or something like that), if we don't clear the flag back to TP_STATUS_KERNEL and just return an error, the user will never know.
>
>      --------------------------------------
> 0    | 1 KB frame - status = TP_STATUS_USER |
>      --------------------------------------
> 1    | 1 KB frame - status = TP_STATUS_USER |
>      --------------------------------------
> Etc
>
> So when the user fills up the ring buffer and goes back to frame 0, he won't know that that packet was actually fatally rejected (i.e. the kernel moved on) and will just block waiting for the status to clear (which it never will).  Similarly in the case where the packet gets rejected for being too small.
>
> In other words, when the kernel doesn't send a packet but moves on and doesn't keep trying (as would be the case for the packet size being too small or too big because that is considered an irrecoverable error), it has to notify the user by clearing that status flag for that frame.
>
If there is an error in packet_skb_fill, it returns a negative value
as error code. If a negative value is returned from packet_skb_fill we
exit send() procedure and set status of packet to TP_STATUS_USER.
(this is how error is handled in my last code (not sure committed
one)).
Do you want to introduce a new packet status that marks packet as
ERROR (with TP_STATUS_LOOSING for example?). This way user will be
able to determine easily the one which is failing?

PS: I need to test my last code at office and I'll send it.

Best regards,
Johann

-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 17:50                         ` Johann Baudy
@ 2008-11-11 18:14                           ` Lovich, Vitali
  2008-11-11 18:50                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 18:14 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-11-08 9:50 AM
> To: Lovich, Vitali
> Cc: Evgeniy Polyakov; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali,
> 
> You say "There is only 1 destructor for the layer that created the skb
> (AFAIK)", why not a "There is only 1 destructor argument for the layer
> that created the skb (AFAIK)"?
> Where is the impact? only a void * destructor_arg that only the
> creator of skb is allowed to use. Moreover it can be put under feature
> flag.
Right, we're saying the exact same thing I think.  I had wanted just a void * that is specific to PACKET_MMAP, so that no one else pays the price if they don't need to.  You instead want to make it a feature flag - that's fine I think, but we need to make a note of that in the Kconfig file (that enabling PACKET_MMAP increases all skbs by the size of a long).  I'd still rather prefer using the fragments instead though, because it seems like a reasonable solution that has 0 impact on any other code.

> 
> >>
> >> This can be used also for sock_w_free() destructor. Current code is
> >> assuming that the socket destructor is always sock_w_free(), if
> >> tomorrow, this callback is changed or removed, we will have to
> change
> >> packet_mmap code (called in packet_skb_destruct()). Such mechanism
> >> should make replacement of destructor/destructor arg more
> independent
> >> toward layers.
> > We're talking about skb destructors, not socket destructors.
> sock_wfree is not a destructor - it releases the resources allocated to
> the skb back to the socket that the resources are from.
> I agree, I'm just talking about destructor/destructor_arg pair.
> 
> >
> > Also, from what I can tell, this is how the Linux kernel works - if
> someone's change breaks the code, they're responsible for fixing it.
> Changing how sock_wfree is fairly significant enough that we don't need
> to worry about making it easier - it'll probably be so significant that
> there will be a lot of changes anyways.  It's also highly unlikely to
> happen I think.
> 
> I don't want to break this code. Just add a variable in sk_buff that
> can be used as the destructor is ! kind of CB[] that is not
> overwritten by sub layer ! or at least backuped.	
I wasn't implying you will break the code.  I was implying that if someone was changing how sock_wfree was used they would be breaking the code.  We don't need to make it easier for them.  Again, I'm not really convinced of the utility of such a feature when it seems like this is the only code that would benefit, and even then there's a way around it.

Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 18:08                           ` Johann Baudy
@ 2008-11-11 18:19                             ` Lovich, Vitali
  2008-11-11 18:59                               ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 18:19 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-11-08 10:08 AM
> To: Lovich, Vitali
> Cc: Evgeniy Polyakov; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> If there is an error in packet_skb_fill, it returns a negative value
> as error code. If a negative value is returned from packet_skb_fill we
> exit send() procedure and set status of packet to TP_STATUS_USER.
> (this is how error is handled in my last code (not sure committed
> one)).
> Do you want to introduce a new packet status that marks packet as
> ERROR (with TP_STATUS_LOOSING for example?). This way user will be
> able to determine easily the one which is failing?
> 
> PS: I need to test my last code at office and I'll send it.
I'm saying that that frame's status should just be set to TP_STATUS_KERNEL, not user.  The kernel wants nothing to do with this frame any more and the user needs to know that.  I was going to add statistics tracking to keep track of how many packets were sent vs rejected.  This was more of an issue with my approach where I had a dedicated kernel thread doing all of this.

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 18:14                           ` Lovich, Vitali
@ 2008-11-11 18:50                             ` Evgeniy Polyakov
  2008-11-11 19:19                               ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-11 18:50 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

Hi.

On Tue, Nov 11, 2008 at 10:14:54AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> Right, we're saying the exact same thing I think.  I had wanted just a void * that is specific to PACKET_MMAP, so that no one else pays the price if they don't need to.  You instead want to make it a feature flag - that's fine I think, but we need to make a note of that in the Kconfig file (that enabling PACKET_MMAP increases all skbs by the size of a long).  I'd still rather prefer using the fragments instead though, because it seems like a reasonable solution that has 0 impact on any other code.

I think you should be almost 200% sure that skb is not allowed to grow up :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 18:19                             ` Lovich, Vitali
@ 2008-11-11 18:59                               ` Johann Baudy
  2008-11-11 19:10                                 ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 18:59 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

> I'm saying that that frame's status should just be set to TP_STATUS_KERNEL, not user.  The kernel wants nothing to do with this frame any more and the user needs to know that.  I was going to add statistics tracking to keep track of how many packets were sent vs rejected.  This was more of an issue with my approach where I had a dedicated kernel thread doing all of this.

In all cases of error, Kernel is blocking on the same frame until
update, indeed packet_increment_head() is only called on success.
I've done this in order to not skip a packet in case of
skb_alloc_send_skb error().

I believe, we have 3 options on all failure, if we set packet to:
- TP_STATUS_KERNEL: User can consider that buffer is free and it will
fill it again.
If we choose TP_STATUS_KERNEL by default, that's means content is lost
even for sock_alloc_send_skb() error. Such behavior is not acceptable,
User is not able to show the difference between packet success and
packet failure on a specific frame (same status).
- TP_STATUS_USER: Kernel will try to send it again and again,
especially if send() is called in a loop (even it is impossible).
This is current way of working, It looks like a manner to block on issue.
- TP_STATUS_LOOSING(or other): as TP_STATUS_USER with more info for user.
Whereas TP_STATUS_USER, this one provides user with the buffer where
error has occurred.

Thanks,
Johann
-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 14:59                           ` Johann Baudy
@ 2008-11-11 19:05                             ` Lovich, Vitali
  0 siblings, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 19:05 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

> Ok, so we can skip below block according to this assumption:
> 		for (i = 0; i < req->tp_block_nr; i++) {
> 			void *ptr = pg_vec[i];
> 			int k;
> 
> 			for (k = 0; k < rb->frames_per_block; k++) {
> 				__packet_set_status(po, ptr, TP_STATUS_KERNEL);
> 				ptr += req->tp_frame_size;
> 			}
> 		}
>
Yes.

> >
> > Earlier, I think I had said that the mapped variable should be placed
> in packet_ring_buffer struct - I was wrong.  It's correct where it is.
> I had forgotten that rx & tx are mapped contiguously and
> simultaneously.  There needs to be a note in the doc that the rings
> should be set up before mapping (i.e. cannot set up the rx or tx ring
> buffer first, mmap, and then try to change it).
> 
> Yes, we can also add a check for po->mapped in
> PACKET_TX_RING/PACKET_RX_RING cases of packet_setsockopt() switch.
Since those call in to packet_set_ring, we can just fix the logic there - there's no need to complicate packet_setsockopt unnecessarily.

> 
> 
> > We do need still need a check for tx_pending_skb when sizing the ring
> buffer (in the situation where we unfreed skbs created by a call to
> send from Thread B, while Thread A unmaps the ring buffer and trys a
> resize). Also, in tpacket_snd, we need to increment mapped before
> allocating the skb and test if the original value was 0 (we decrement
> after incrementing tx_pending_skb).  If it was, then that indicates
> that the user had unmapped the ring buffer, which means we should stop
> sending.  No need for additional locking because of the atomic
> relationship this sets up between tx_pending_skb and mapped.
> 
> I'm not sure to understand:
Sorry - I do realize I rambled a bit.  I'll try to clarify.

> 
> Let me, sum up :
> * when setting or resizing the ring: we have to return an error if:
> - someone has mapped the socket (1.a)
> - sending is ongoing - tx_pending_skb != 0 (1.b)
> - a send() procedure is ongoing (1.c)
Yes.

> 
> * when closing the socket, we have to return an error if:
> - sending is ongoing - tx_pending_skb != 0 (2.a)
> - a send() procedure is ongoing (2.b)
Only 2.a should be a reason to fail to close.  With 2.b, send should abort, recognizing the socket is closing, without sending anything out.  Realistically however, it's probably too hard to co-ordinate since 2.b causes 2.a.  So a more realistic approach is that closing never fails.  Instead, we add a flag field into packet_sock.  In send, we atomically set bit 0.  When closing, we atomically set bit 1.  If in closing bit 0 is also set, then we defer the actual closing of the socket to send.  If send sees bit 1 & bit 0 set & tx_pending_skb == 0, then it frees the actual socket.  Otherwise, the skb destructor is responsible for closing the socket once tx_pending_skb drops to 0 & bit 1 is set. Calls to send fail if bit 1 is set.
> 	
> * during do-while loop of send(),:
> - protect ring against simultaneous send() (3.a)
> - ensure that ring is not resized (3.b) (linked with 1.c)
> Our main issue is when: tx_pending_skb = 0, Thread A starts to send a
> packet while Thread B unmaps and resizes the ring.
>  Thread B can potentially resizes the ring until increment of
> tx_pending_skb in Thread A.
> - loop can continue even if ring is not mapped (3.c), this has no
> sense but it is not critical
3.a can be protected through the flag field as above - atomically set the 0th bit if it's not set (test_and_set_bit).  Otherwise, return with -EBUSY or similar error code.  After this protection, atomically increment (atomic_inc_return).  If it is 1, then abort send because the user has unmapped the memory.  Don't forget to atomically decrement it again (and clear the 0th flag bit) regardless before exiting from send.

> 
> * when executing the destructor callback:
> - we must ensure that ring structure has not changed since send() (4.a)
This is a consequence of the above logic. 
> 
> (1.a) can be done with po->mmaped check in packet_setsockopt()
> (1.b) and (2.a) can be done as you said :  tx_pending_skb != 0 check
> when sizing buffer in packet_set_ring()
> (3.a) by a mutex M outside of the do-while loop
> (1.c), (2.b) and (3.b) can be performed thanks to the same mutex M and
> mutex_trylock() in packet_set_ring(). An error is returned if we can't
> take the mutex . The mutex is released at the end of packet_set_ring()
> call.
> (4.a) is validated with tx_pending_skb increment, (3.b),(1.b) and (2.a)
> 
> Do I miss something?
I hope I covered everything above.  I'm not sure if send, setsockopt, etc are in an interrupt context or not.  My guess is they are and thus mutexes are not allowed - spinlocks only.  However, there's also no need for mutexes or spinlocks because I believe the above atomic algorithm should work.  I should be able to post a patch by the end of the day.

Thanks,
Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 18:59                               ` Johann Baudy
@ 2008-11-11 19:10                                 ` Lovich, Vitali
  2008-11-12 12:09                                   ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-11 19:10 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-11-08 10:59 AM
> To: Lovich, Vitali
> Cc: Evgeniy Polyakov; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali,
> 
> > I'm saying that that frame's status should just be set to
> TP_STATUS_KERNEL, not user.  The kernel wants nothing to do with this
> frame any more and the user needs to know that.  I was going to add
> statistics tracking to keep track of how many packets were sent vs
> rejected.  This was more of an issue with my approach where I had a
> dedicated kernel thread doing all of this.
> 
> In all cases of error, Kernel is blocking on the same frame until
> update, indeed packet_increment_head() is only called on success.
> I've done this in order to not skip a packet in case of
> skb_alloc_send_skb error().
> 
> I believe, we have 3 options on all failure, if we set packet to:
> - TP_STATUS_KERNEL: User can consider that buffer is free and it will
> fill it again.
> If we choose TP_STATUS_KERNEL by default, that's means content is lost
> even for sock_alloc_send_skb() error. Such behavior is not acceptable,
> User is not able to show the difference between packet success and
> packet failure on a specific frame (same status).
> - TP_STATUS_USER: Kernel will try to send it again and again,
> especially if send() is called in a loop (even it is impossible).
> This is current way of working, It looks like a manner to block on
> issue.
> - TP_STATUS_LOOSING(or other): as TP_STATUS_USER with more info for
> user.
> Whereas TP_STATUS_USER, this one provides user with the buffer where
> error has occurred.
> 
> Thanks,
> Johann
> --
> Johann Baudy
> johaahn@gmail.com

So the way I had done it is that non-recoverable errors are those caused by malformed data coming from the user (i.e. packet size is invalid).  In this case, we free the frame and return.  In all other cases, it assumed that the cause was some kind of bad kernel state (i.e. no memory, device busy, etc) that will eventually be recovered from, so we keep retrying or return to the user without changing the status code.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 18:50                             ` Evgeniy Polyakov
@ 2008-11-11 19:19                               ` Johann Baudy
  2008-11-11 19:29                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-11 19:19 UTC (permalink / raw)
  To: Evgeniy Polyakov, David Miller; +Cc: Lovich, Vitali, netdev@vger.kernel.org

Hi Evgeniy, David,

> I think you should be almost 200% sure that skb is not allowed to grow up :)
Well! I'm lost :)

How can we forward a pointer from skb allocation to skb destructor
without adding a kind of destructor argument in sk_buff struct?

Thanks in advance,
Johann

On Tue, Nov 11, 2008 at 7:50 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> Hi.
>
> On Tue, Nov 11, 2008 at 10:14:54AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
>> Right, we're saying the exact same thing I think.  I had wanted just a void * that is specific to PACKET_MMAP, so that no one else pays the price if they don't need to.  You instead want to make it a feature flag - that's fine I think, but we need to make a note of that in the Kconfig file (that enabling PACKET_MMAP increases all skbs by the size of a long).  I'd still rather prefer using the fragments instead though, because it seems like a reasonable solution that has 0 impact on any other code.
>
> I think you should be almost 200% sure that skb is not allowed to grow up :)
>
> --
>        Evgeniy Polyakov
>



-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 19:19                               ` Johann Baudy
@ 2008-11-11 19:29                                 ` Evgeniy Polyakov
  2008-11-12 13:43                                   ` Johann Baudy
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-11 19:29 UTC (permalink / raw)
  To: Johann Baudy; +Cc: David Miller, Lovich, Vitali, netdev@vger.kernel.org

On Tue, Nov 11, 2008 at 08:19:09PM +0100, Johann Baudy (johaahn@gmail.com) wrote:
> > I think you should be almost 200% sure that skb is not allowed to grow up :)
> Well! I'm lost :)
> 
> How can we forward a pointer from skb allocation to skb destructor
> without adding a kind of destructor argument in sk_buff struct?

There are several possibilities:
* update destructor with your own data and put old one behind the data
	in skb
* update destructor, but put your own data into some storage which can
	be accessed from your callback, storage entry may contain
	whatever data you want and of course old destructor
* put your data into socket pointer (there still may be issues with
	destructor)
* if you fully own the skb, reuse whatever members there you like
	(except those which may be used by lower layer)
* find why do you need to have callback at destruction time and
	eliminate this problem.
	Btw, netchannels do this way, since they are very similar to
	packet socket, but compared to your approach they have a copy.

At the first place I do not understand, why do you want to change the
skb, since you can mmap whatever area you need and provide those pages
into skb, which at free time will drop the reference counter. Skb lives
in socket queue, so until it is empty you can sleep not allowing mapped
buffer to shrink.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 19:10                                 ` Lovich, Vitali
@ 2008-11-12 12:09                                   ` Johann Baudy
  2008-11-12 17:12                                     ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-12 12:09 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

> So the way I had done it is that non-recoverable errors are those caused by malformed data coming from the user (i.e. packet size is invalid).  In this case, we free the frame and return.  In all other cases, it assumed that the cause was some kind of bad kernel state (i.e. no memory, device busy, etc) that will eventually be recovered from, so we keep retrying or return to the user without changing the status code.

I still believe that we should stop on non-recoverable errors. User
must be informed that the packet has not been sent.
If you skip the frame and set status to TP_STATUS_KERNEL. Frame will
be lost and user will think that malformed frame has been transmitted.

I'm using TP_STATUS_LOSING in case of non-recoverable error. We can
maybe introduce a socket option to chose the behavior for
non-recoverable error.

Best regards,
Johann
-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
@ 2008-11-12 13:19 Johann Baudy
  0 siblings, 0 replies; 59+ messages in thread
From: Johann Baudy @ 2008-11-12 13:19 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Vitali,

Please find below my draft.

According to our discussion and your proposal to replace Mutex, I've made this draft.
A counter busy: That is used to prevent send from simultaneous send(). 
	Initialized to 1, decremented at beginning of send()( with atomic_dec_and_test()), returns EBUSY if result != 0.
	This one is used in packet_set_ring() and tpacket_snd().

A flag shutdown: To notify send() that a shutdown is ongoing.
	When a release occurs, packet_release raises shutdown flag then it performs a while(atomic(busy counter != 1)); in order to wait end of send() procedure.
	In tpacket_send, if shutdown is seen, we exit the loop.
	In skb destructor, if shutdown is seen, we call sock_wfree directly().
	I think such loop is better than delaying close procedure. To be sure that if close() has returned, everything is clean into the kernel.

And resize is not allowed if (as we said):
	> - someone has mapped the socket (1.a)
	> - sending is ongoing - tx_pending_skb != 0 (1.b)
	> - a send() procedure is ongoing (1.c)


I've also put pending and busy counters in packet_ring_buffer. Not useful for RX, but I think it's more clean.

PS: For the moment, I'm paranoid so I've added smp_rmb/smp_wmb/and dcache flush for get/set status and also when filling skb !!
(I've lost 10MBytes/s :( on my bench) 
- I'm using skb->mark again to forward buffer index to destructor.
- I've not tested SOCK_DGRAM yet, but I've updated to_write and data contents properly (I hope)

Thanks in advance,
Johann

  Documentation/networking/packet_mmap.txt |  133 +++++++-
 include/linux/if_packet.h                |    1 +
 net/packet/af_packet.c                   |  539 ++++++++++++++++++++++++------
 3 files changed, 557 insertions(+), 116 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 07c53d5..38e5fa3 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -4,8 +4,8 @@
 
 This file documents the CONFIG_PACKET_MMAP option available with the PACKET
 socket interface on 2.4 and 2.6 kernels. This type of sockets is used for 
-capture network traffic with utilities like tcpdump or any other that uses 
-the libpcap library. 
+capture network traffic with utilities like tcpdump or any other that needs
+raw access to network interface.
 
 You can find the latest version of this document at
 
@@ -14,6 +14,8 @@ You can find the latest version of this document at
 Please send me your comments to
 
     Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+    Johann Baudy <johann.baudy@gnu-log.net> (TX RING)
+    Vitali Lovitch <vlovich@qualcomm.com> (TX RING)
 
 -------------------------------------------------------------------------------
 + Why use PACKET_MMAP
@@ -25,19 +27,24 @@ to capture each packet, it requires two if you want to get packet's
 timestamp (like libpcap always does).
 
 In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
-configurable circular buffer mapped in user space. This way reading packets just 
-needs to wait for them, most of the time there is no need to issue a single 
-system call. By using a shared buffer between the kernel and the user 
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the capture process, 
-but it isn't everything. At least, if you are capturing at high speeds (this 
-is relative to the cpu speed), you should check if the device driver of your 
-network interface card supports some sort of interrupt load mitigation or 
-(even better) if it supports NAPI, also make sure it is enabled.
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth.
+By using a shared buffer between the kernel and the user also has the benefit
+of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP
++ How to use CONFIG_PACKET_MMAP to improve capture process
 --------------------------------------------------------------------------------
 
 From the user standpoint, you should use the higher level libpcap library, which
@@ -57,7 +64,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP
 support.
 
 --------------------------------------------------------------------------------
-+ How to use CONFIG_PACKET_MMAP directly
++ How to use CONFIG_PACKET_MMAP directly to improve capture process
 --------------------------------------------------------------------------------
 
 From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +73,7 @@ the following process:
 
 [setup]     socket() -------> creation of the capture socket
             setsockopt() ---> allocation of the circular buffer (ring)
+                              option: PACKET_RX_RING
             mmap() ---------> mapping of the allocated buffer to the
                               user process
 
@@ -97,13 +105,75 @@ also the mapping of the circular buffer in the user process and
 the use of this buffer.
 
 --------------------------------------------------------------------------------
++ How to use CONFIG_PACKET_MMAP directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup]          socket() -------> creation of the transmission socket
+                 setsockopt() ---> allocation of the circular buffer (ring)
+                                   option: PACKET_TX_RING
+                 bind() ---------> bind transmission socket with a network interface
+                 mmap() ---------> mapping of the allocated buffer to the
+                                   user process
+
+[transmission]   poll() ---------> wait for free packets (optional)
+                 send() ---------> send all packets that are set as ready in
+                                   the ring
+                                   The flag MSG_DONTWAIT can be used to return
+                                   before end of transfer.
+
+[shutdown]  close() --------> destruction of the transmission socket and
+                              deallocation of all associated resources.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts:
+
+ --------------------
+| struct tpacket_hdr | Header. It contains the status of
+|                    | of this frame
+|--------------------|
+| data buffer        |
+.                    .  Data that will be sent over the network interface.
+.                    .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example:
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = ETH_P_ALL;
+ my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+--------------------------------------------------------------------------------
 + PACKET_MMAP settings
 --------------------------------------------------------------------------------
 
 
 To setup PACKET_MMAP from user level code is done with a call like
 
+ - Capture process
      setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+ - Transmission process
+     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
 
 The most significant argument in the previous call is the req parameter, 
 this parameter must to have the following structure:
@@ -117,11 +187,11 @@ this parameter must to have the following structure:
     };
 
 This structure is defined in /usr/include/linux/if_packet.h and establishes a 
-circular buffer (ring) of unswappable memory mapped in the capture process. 
+circular buffer (ring) of unswappable memory.
 Being mapped in the capture process allows reading the captured frames and 
 related meta-information like timestamps without requiring a system call.
 
-Captured frames are grouped in blocks. Each block is a physically contiguous 
+Frames are grouped in blocks. Each block is a physically contiguous
 region of memory and holds tp_block_size/tp_frame_size frames. The total number 
 of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
 
@@ -336,6 +406,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready
 to be used for the kernel, If not, there is a frame the user can read 
 and the following flags apply:
 
++++ Capture process:
      from include/linux/if_packet.h
 
      #define TP_STATUS_COPY          2 
@@ -391,6 +462,36 @@ packets are in the ring:
 It doesn't incur in a race condition to first check the status value and 
 then poll for frames.
 
+
+++ Transmission process
+Those defines are also used for transmission:
+
+     #define TP_STATUS_KERNEL        0 // Frame is available
+     #define TP_STATUS_USER          1 // Frame will be sent on next send()
+     #define TP_STATUS_COPY          2 // Frame is currently in transmission
+
+First, the kernel initializes all frames to TP_STATUS_KERNEL. To send a packet,
+the user fills a data buffer of an available frame, sets tp_len to current
+data buffer size and sets its status field to TP_STATUS_USER. This can be done
+on multiple frames. Once the user is ready to transmit, it calls send().
+Then all buffers with status equal to TP_STATUS_USER are forwarded to the
+network device. The kernel updates each status of sent frames with
+TP_STATUS_COPY until the end of transfer.
+At the end of each transfer, buffer status returns to TP_STATUS_KERNEL.
+
+    header->tp_len = in_i_size;
+    header->tp_status = TP_STATUS_USER;
+    retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_KERNEL)
+
+    struct pollfd pfd;
+    pfd.fd = fd;
+    pfd.revents = 0;
+    pfd.events = POLLOUT;
+    retval = poll(&pfd, 1, timeout);
+
 --------------------------------------------------------------------------------
 + THANKS
 --------------------------------------------------------------------------------
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 18db066..6682963 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -46,6 +46,7 @@ struct sockaddr_ll
 #define PACKET_VERSION			10
 #define PACKET_HDRLEN			11
 #define PACKET_RESERVE			12
+#define PACKET_TX_RING			13
 
 struct tpacket_stats
 {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9099237..2d67a39 100644
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
old mode 100644
new mode 100755
index c718e7e..fb776a7
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -156,7 +156,26 @@ struct packet_mreq_max
 };
 
 #ifdef CONFIG_PACKET_MMAP
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing);
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
+		int closing, int tx_ring);
+
+struct packet_ring_buffer {
+	char *			*pg_vec;
+	unsigned int		head;
+	unsigned int		frames_per_block;
+	unsigned int		frame_size;
+	unsigned int		frame_max;
+
+	unsigned int		pg_vec_order;
+	unsigned int		pg_vec_pages;
+	unsigned int		pg_vec_len;
+
+	atomic_t		busy;
+	atomic_t		pending;
+};
+
+struct packet_sock;
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
 #endif
 
 static void packet_flush_mclist(struct sock *sk);
@@ -166,11 +185,9 @@ struct packet_sock {
 	struct sock		sk;
 	struct tpacket_stats	stats;
 #ifdef CONFIG_PACKET_MMAP
-	char *			*pg_vec;
-	unsigned int		head;
-	unsigned int            frames_per_block;
-	unsigned int		frame_size;
-	unsigned int		frame_max;
+	struct packet_ring_buffer	rx_ring;
+	struct packet_ring_buffer	tx_ring;
+	atomic_t	shutdown;
 	int			copy_thresh;
 #endif
 	struct packet_type	prot_hook;
@@ -183,9 +200,6 @@ struct packet_sock {
 	struct packet_mclist	*mclist;
 #ifdef CONFIG_PACKET_MMAP
 	atomic_t		mapped;
-	unsigned int            pg_vec_order;
-	unsigned int		pg_vec_pages;
-	unsigned int		pg_vec_len;
 	enum tpacket_versions	tp_version;
 	unsigned int		tp_hdrlen;
 	unsigned int		tp_reserve;
@@ -204,36 +218,29 @@ struct packet_skb_cb {
 
 #ifdef CONFIG_PACKET_MMAP
 
-static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
-				 int status)
+static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 {
-	unsigned int pg_vec_pos, frame_offset;
 	union {
 		struct tpacket_hdr *h1;
 		struct tpacket2_hdr *h2;
 		void *raw;
 	} h;
 
-	pg_vec_pos = position / po->frames_per_block;
-	frame_offset = position % po->frames_per_block;
-
-	h.raw = po->pg_vec[pg_vec_pos] + (frame_offset * po->frame_size);
+	h.raw = frame;
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		if (status != h.h1->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
-			return NULL;
+		h.h1->tp_status = status;
 		break;
 	case TPACKET_V2:
-		if (status != h.h2->tp_status ? TP_STATUS_USER :
-						TP_STATUS_KERNEL)
-			return NULL;
+		h.h2->tp_status = status;
 		break;
 	}
-	return h.raw;
+
+	smp_wmb();
+	flush_dcache_page(virt_to_page(h.raw));
 }
 
-static void __packet_set_status(struct packet_sock *po, void *frame, int status)
+static int __packet_get_status(struct packet_sock *po, void *frame)
 {
 	union {
 		struct tpacket_hdr *h1;
@@ -241,16 +248,71 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		void *raw;
 	} h;
 
+	smp_rmb();
+	flush_dcache_page(virt_to_page(h.raw));
+
 	h.raw = frame;
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		h.h1->tp_status = status;
+		return h.h1->tp_status;
 		break;
 	case TPACKET_V2:
-		h.h2->tp_status = status;
+		return h.h2->tp_status;
 		break;
 	}
+	return 0;
+}
+
+static void *packet_lookup_frame(struct packet_sock *po,
+		struct packet_ring_buffer *buff,
+		unsigned int position,
+		int status)
+{
+	unsigned int pg_vec_pos, frame_offset;
+	union {
+		struct tpacket_hdr *h1;
+		struct tpacket2_hdr *h2;
+		void *raw;
+	} h;
+
+	pg_vec_pos = position / buff->frames_per_block;
+	frame_offset = position % buff->frames_per_block;
+
+	h.raw = buff->pg_vec[pg_vec_pos] + (frame_offset * buff->frame_size);
+
+	if( status != __packet_get_status(po, h.raw) )
+		return NULL;
+
+	return h.raw;
+}
+
+static inline void *packet_current_rx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->rx_ring, po->rx_ring.head, status);
+}
+
+static inline void *packet_current_tx_frame(struct packet_sock *po, int status)
+{
+	return packet_lookup_frame(po, &po->tx_ring, po->tx_ring.head, status);
+}
+
+static inline void *packet_previous_rx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->rx_ring.head ? po->rx_ring.head - 1 : po->rx_ring.frame_max;
+	return packet_lookup_frame(po, &po->rx_ring, previous, status);
+}
+
+static inline void *packet_previous_tx_frame(struct packet_sock *po, int status)
+{
+	unsigned int previous = po->tx_ring.head ? po->tx_ring.head - 1 : po->tx_ring.frame_max;
+	return packet_lookup_frame(po, &po->tx_ring, previous, status);
+}
+
+static inline void packet_increment_head(struct packet_ring_buffer *buff)
+{
+	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
 }
+
 #endif
 
 static inline struct packet_sock *pkt_sk(struct sock *sk)
@@ -646,7 +708,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 		macoff = netoff - maclen;
 	}
 
-	if (macoff + snaplen > po->frame_size) {
+	if (macoff + snaplen > po->rx_ring.frame_size) {
 		if (po->copy_thresh &&
 		    atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
 		    (unsigned)sk->sk_rcvbuf) {
@@ -659,16 +721,16 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 			if (copy_skb)
 				skb_set_owner_r(copy_skb, sk);
 		}
-		snaplen = po->frame_size - macoff;
+		snaplen = po->rx_ring.frame_size - macoff;
 		if ((int)snaplen < 0)
 			snaplen = 0;
 	}
 
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_lookup_frame(po, po->head, TP_STATUS_KERNEL);
+	h.raw = packet_current_rx_frame(po, TP_STATUS_KERNEL);
 	if (!h.raw)
 		goto ring_is_full;
-	po->head = po->head != po->frame_max ? po->head+1 : 0;
+	packet_increment_head(&po->rx_ring);
 	po->stats.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
@@ -725,7 +787,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
 
 	__packet_set_status(po, h.raw, status);
 	smp_mb();
-
 	{
 		struct page *p_start, *p_end;
 		u8 *h_end = h.raw + macoff + snaplen - 1;
@@ -759,10 +820,237 @@ ring_is_full:
 	goto drop_n_restore;
 }
 
-#endif
+static void tpacket_destruct_skb(struct sk_buff *skb)
+{
+	struct packet_sock *po = pkt_sk(skb->sk);
+	void * ph;
 
+	BUG_ON(skb == NULL);
 
-static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+	if (unlikely(atomic_read(&po->shutdown)))
+		goto out_wfree;
+
+	ph = packet_lookup_frame(po, &po->tx_ring, skb->mark, TP_STATUS_COPY);
+	BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
+	atomic_dec(&po->tx_ring.pending);
+	__packet_set_status(po, ph, TP_STATUS_KERNEL);
+
+out_wfree:
+	sock_wfree(skb);
+}
+
+static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff * skb, void * frame,
+		struct net_device *dev, int size_max, __be16 proto,
+		unsigned char * addr)
+{
+	union {
+		struct tpacket_hdr *h1;
+		struct tpacket2_hdr *h2;
+		void *raw;
+	} ph;
+	int to_write, offset, len, tp_len, nr_frags;
+	struct socket *sock = po->sk.sk_socket;
+	struct page *page;
+	void *data;
+	int err;
+
+	ph.raw = frame;
+
+	skb->protocol = proto;
+	skb->dev = dev;
+	skb->priority = po->sk.sk_priority;
+	skb->mark = po->tx_ring.head;
+
+	switch(po->tp_version) {
+	case TPACKET_V2:
+		tp_len = ph.h1->tp_len;
+		break;
+	default:
+		tp_len = ph.h1->tp_len;
+		break;
+	}
+	if (unlikely(tp_len > size_max)) {
+		printk(KERN_ERR "packet size is too long (%d > %d)\n",
+				tp_len, size_max);
+		return -EMSGSIZE;
+	}
+
+	skb_reserve(skb, LL_RESERVED_SPACE(dev));
+	skb_reset_network_header(skb);
+
+	data = ph.raw + po->tp_hdrlen;
+	to_write = tp_len;
+
+	if (sock->type == SOCK_DGRAM) {
+		err = dev_hard_header(skb, dev, ntohs(proto), addr,
+				NULL, tp_len);
+		if (unlikely(err < 0))
+			return -EINVAL;
+	} else if (dev->hard_header_len ) {
+		/* net device doesn't like empty head */
+		if(unlikely(tp_len <= dev->hard_header_len)) {
+			printk(KERN_ERR "packet size is too short "
+					"(%d < %d)\n", tp_len,
+					dev->hard_header_len);
+			return -EINVAL;
+		}
+
+		skb_push(skb, dev->hard_header_len);
+		err = skb_store_bits(skb, 0, data,
+				dev->hard_header_len);
+		if (unlikely(err))
+			return err;
+
+		data += dev->hard_header_len;
+		to_write -= dev->hard_header_len;
+	}
+
+	err = -EFAULT;
+	page = virt_to_page(data);
+	len = ((to_write > PAGE_SIZE) ? PAGE_SIZE : to_write);
+
+	offset = (int)((long)data & (~PAGE_MASK));
+	len -= offset;
+
+	skb->data_len = to_write;
+	skb->len += to_write;
+
+	while ( likely(to_write) ) {
+		nr_frags = skb_shinfo(skb)->nr_frags;
+
+		if(unlikely(nr_frags >= MAX_SKB_FRAGS)) {
+			printk(KERN_ERR "Packet exceed the number "
+					"of skb frags(%lu)\n",
+					MAX_SKB_FRAGS);
+			return -EFAULT;
+		}
+
+		flush_dcache_page(page);
+		get_page(page);
+		skb_fill_page_desc(skb,
+				nr_frags,
+				page++, offset, len);
+		to_write -= len;
+		len = (to_write > PAGE_SIZE) ? PAGE_SIZE : to_write;
+		offset = 0;
+	}
+
+	return tp_len;
+}
+
+static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
+{
+	struct socket *sock;
+	struct sk_buff *skb;
+	struct net_device *dev;
+	__be16 proto;
+	int ifindex, err, reserve = 0;
+	void * ph;
+	struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
+	int tp_len, size_max;
+	unsigned char *addr;
+	int len_sum = 0;
+	int status = 0;
+
+	BUG_ON(po == NULL);
+	sock = po->sk.sk_socket;
+
+	err = -EBUSY;
+	if(!atomic_dec_and_test(&po->tx_ring.busy))
+		goto out;
+
+	if (saddr == NULL) {
+		ifindex	= po->ifindex;
+		proto	= po->num;
+		addr	= NULL;
+	} else {
+		err = -EINVAL;
+		if (msg->msg_namelen < sizeof(struct sockaddr_ll))
+			goto out;
+		if (msg->msg_namelen < (saddr->sll_halen + offsetof(struct sockaddr_ll, sll_addr)))
+			goto out;
+		ifindex	= saddr->sll_ifindex;
+		proto	= saddr->sll_protocol;
+		addr	= saddr->sll_addr;
+	}
+
+	dev = dev_get_by_index(sock_net(&po->sk), ifindex);
+	err = -ENXIO;
+	if (unlikely(dev == NULL))
+		goto out;
+
+	err = -EINVAL;
+	if (unlikely(sock->type != SOCK_RAW))
+		goto out_put;
+
+	reserve = dev->hard_header_len;
+
+	err = -ENETDOWN;
+	if (unlikely(!(dev->flags & IFF_UP)))
+		goto out_put;
+
+	size_max = po->tx_ring.frame_size - sizeof(struct skb_shared_info)
+		- po->tp_hdrlen - LL_ALLOCATED_SPACE(dev);
+
+	if (size_max > dev->mtu + reserve)
+		size_max = dev->mtu + reserve;
+
+	do
+	{
+		ph = packet_current_tx_frame(po, TP_STATUS_USER);
+		if(unlikely(ph == NULL)) {
+			continue;
+		}
+
+		status = TP_STATUS_USER;
+		skb = sock_alloc_send_skb(&po->sk, LL_ALLOCATED_SPACE(dev),
+				msg->msg_flags & MSG_DONTWAIT, &err);
+		if (unlikely(skb == NULL))
+			goto out_status;
+
+		status = TP_STATUS_LOSING;
+		tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
+				addr);
+		if(unlikely(tp_len < 0)) {
+			err = tp_len;
+			goto out_status;
+		}
+
+		skb->destructor = tpacket_destruct_skb;
+		__packet_set_status(po, ph, TP_STATUS_COPY);
+		atomic_inc(&po->tx_ring.pending);
+
+		status = TP_STATUS_USER;
+		err = dev_queue_xmit(skb);
+		if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0))
+			goto out_xmit;
+		packet_increment_head(&po->tx_ring);
+		len_sum += tp_len;
+	}
+	while(likely((!atomic_read(&po->shutdown)) &&
+				((ph != NULL)
+				 || ((!(msg->msg_flags & MSG_DONTWAIT))
+					 && (atomic_read(&po->tx_ring.pending))))
+		    ));
+
+	err = len_sum;
+	goto out_put;
+
+out_xmit:
+	skb->destructor = sock_wfree;
+	atomic_dec(&po->tx_ring.pending);
+out_status:
+	__packet_set_status(po, ph, status);
+	kfree_skb(skb);
+out_put:
+	dev_put(dev);
+out:
+	atomic_inc(&po->tx_ring.busy);
+	return err;
+}
+#endif
+
+static int packet_snd(struct socket *sock,
 			  struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -853,6 +1141,19 @@ out:
 	return err;
 }
 
+static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *msg, size_t len)
+{
+#ifdef CONFIG_PACKET_MMAP
+	struct sock *sk = sock->sk;
+	struct packet_sock *po = pkt_sk(sk);
+	if (po->tx_ring.pg_vec)
+		return tpacket_snd(po, msg);
+	else
+#endif
+		return packet_snd(sock, msg, len);
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -891,10 +1192,17 @@ static int packet_release(struct socket *sock)
 	packet_flush_mclist(sk);
 
 #ifdef CONFIG_PACKET_MMAP
-	if (po->pg_vec) {
+	{
 		struct tpacket_req req;
 		memset(&req, 0, sizeof(req));
-		packet_set_ring(sk, &req, 1);
+		atomic_set(&po->shutdown, 1);
+
+		if (po->rx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 0);
+
+		while(unlikely(atomic_read(&po->tx_ring.busy) != 1));
+		if (po->tx_ring.pg_vec)
+			packet_set_ring(sk, &req, 1, 1);
 	}
 #endif
 
@@ -1082,6 +1390,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol)
 		po->running = 1;
 	}
 
+	atomic_set(&po->tx_ring.busy, 1);
+
 	write_lock_bh(&net->packet.sklist_lock);
 	sk_add_node(sk, &net->packet.sklist);
 	write_unlock_bh(&net->packet.sklist_lock);
@@ -1411,6 +1721,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 #ifdef CONFIG_PACKET_MMAP
 	case PACKET_RX_RING:
+	case PACKET_TX_RING:
 	{
 		struct tpacket_req req;
 
@@ -1418,7 +1729,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			return -EINVAL;
 		if (copy_from_user(&req,optval,sizeof(req)))
 			return -EFAULT;
-		return packet_set_ring(sk, &req, 0);
+		return packet_set_ring(sk, &req, 0, optname == PACKET_TX_RING);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -1438,7 +1749,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1457,7 +1768,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->pg_vec)
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
 			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
@@ -1701,13 +2012,17 @@ static unsigned int packet_poll(struct file * file, struct socket *sock,
 	unsigned int mask = datagram_poll(file, sock, wait);
 
 	spin_lock_bh(&sk->sk_receive_queue.lock);
-	if (po->pg_vec) {
-		unsigned last = po->head ? po->head-1 : po->frame_max;
-
-		if (packet_lookup_frame(po, last, TP_STATUS_USER))
+	if (po->rx_ring.pg_vec) {
+		if (packet_previous_rx_frame(po, TP_STATUS_USER))
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	spin_lock_bh(&sk->sk_write_queue.lock);
+	if (po->tx_ring.pg_vec) {
+		if (packet_current_tx_frame(po, TP_STATUS_KERNEL))
+			mask |= POLLOUT | POLLWRNORM;
+	}
+	spin_unlock_bh(&sk->sk_write_queue.lock);
 	return mask;
 }
 
@@ -1783,21 +2098,34 @@ out_free_pgvec:
 	goto out;
 }
 
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing)
+static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing, int tx_ring)
 {
 	char **pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
-	int was_running, order = 0;
+	int was_running, busy, order = 0;
+	struct packet_ring_buffer *rb;
+	struct sk_buff_head *rb_queue;
 	__be16 num;
-	int err = 0;
+	int err;
 
-	if (req->tp_block_nr) {
-		int i;
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
 
-		/* Sanity tests and some calculations */
+	err = -EBUSY;
+	busy = !atomic_dec_and_test(&rb->busy);
 
-		if (unlikely(po->pg_vec))
-			return -EBUSY;
+	if(!closing) {
+		if (atomic_read(&po->mapped))
+			goto out;
+		if (busy || atomic_read(&rb->pending))
+			goto out;
+	}
+
+	if (req->tp_block_nr) {
+		/* Sanity tests and some calculations */
+		err = -EBUSY;
+		if (unlikely(rb->pg_vec))
+			goto out;
 
 		switch (po->tp_version) {
 		case TPACKET_V1:
@@ -1808,42 +2136,35 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 			break;
 		}
 
+		err = -EINVAL;
 		if (unlikely((int)req->tp_block_size <= 0))
-			return -EINVAL;
+			goto out;
 		if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
-			return -EINVAL;
+			goto out;
 		if (unlikely(req->tp_frame_size < po->tp_hdrlen +
-						  po->tp_reserve))
-			return -EINVAL;
+					po->tp_reserve))
+			goto out;
 		if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
-			return -EINVAL;
+			goto out;
 
-		po->frames_per_block = req->tp_block_size/req->tp_frame_size;
-		if (unlikely(po->frames_per_block <= 0))
-			return -EINVAL;
-		if (unlikely((po->frames_per_block * req->tp_block_nr) !=
-			     req->tp_frame_nr))
-			return -EINVAL;
+		rb->frames_per_block = req->tp_block_size/req->tp_frame_size;
+		if (unlikely(rb->frames_per_block <= 0))
+			goto out;
+		if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
+					req->tp_frame_nr))
+			goto out;
 
 		err = -ENOMEM;
 		order = get_order(req->tp_block_size);
 		pg_vec = alloc_pg_vec(req, order);
 		if (unlikely(!pg_vec))
 			goto out;
-
-		for (i = 0; i < req->tp_block_nr; i++) {
-			void *ptr = pg_vec[i];
-			int k;
-
-			for (k = 0; k < po->frames_per_block; k++) {
-				__packet_set_status(po, ptr, TP_STATUS_KERNEL);
-				ptr += req->tp_frame_size;
-			}
-		}
-		/* Done */
-	} else {
+	}
+	/* Done */
+	else {
+		err = -EINVAL;
 		if (unlikely(req->tp_frame_nr))
-			return -EINVAL;
+			goto out;
 	}
 
 	lock_sock(sk);
@@ -1866,20 +2187,19 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 	if (closing || atomic_read(&po->mapped) == 0) {
 		err = 0;
 #define XC(a, b) ({ __typeof__ ((a)) __t; __t = (a); (a) = (b); __t; })
-
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		pg_vec = XC(po->pg_vec, pg_vec);
-		po->frame_max = (req->tp_frame_nr - 1);
-		po->head = 0;
-		po->frame_size = req->tp_frame_size;
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-
-		order = XC(po->pg_vec_order, order);
-		req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr);
-
-		po->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
-		po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
-		skb_queue_purge(&sk->sk_receive_queue);
+		spin_lock_bh(&rb_queue->lock);
+		pg_vec = XC(rb->pg_vec, pg_vec);
+		rb->frame_max = (req->tp_frame_nr - 1);
+		rb->head = 0;
+		rb->frame_size = req->tp_frame_size;
+		spin_unlock_bh(&rb_queue->lock);
+
+		order = XC(rb->pg_vec_order, order);
+		req->tp_block_nr = XC(rb->pg_vec_len, req->tp_block_nr);
+
+		rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
+		po->prot_hook.func = (po->rx_ring.pg_vec) ? tpacket_rcv : packet_rcv;
+		skb_queue_purge(rb_queue);
 #undef XC
 		if (atomic_read(&po->mapped))
 			printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped));
@@ -1898,7 +2218,10 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
 
 	if (pg_vec)
 		free_pg_vec(pg_vec, order, req->tp_block_nr);
+
+	err = 0;
 out:
+	atomic_inc(&rb->busy);
 	return err;
 }
 
@@ -1906,7 +2229,8 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 {
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
-	unsigned long size;
+	unsigned long size, expected_size;
+	struct packet_ring_buffer *rb;
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
@@ -1917,23 +2241,38 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
 	size = vma->vm_end - vma->vm_start;
 
 	lock_sock(sk);
-	if (po->pg_vec == NULL)
+
+	expected_size = 0;
+	if (po->rx_ring.pg_vec)
+		expected_size += po->rx_ring.pg_vec_len * po->rx_ring.pg_vec_pages * PAGE_SIZE;
+	if (po->tx_ring.pg_vec)
+		expected_size += po->tx_ring.pg_vec_len * po->tx_ring.pg_vec_pages * PAGE_SIZE;
+
+	if (expected_size == 0)
 		goto out;
-	if (size != po->pg_vec_len*po->pg_vec_pages*PAGE_SIZE)
+
+	if (size != expected_size)
 		goto out;
 
 	start = vma->vm_start;
-	for (i = 0; i < po->pg_vec_len; i++) {
-		struct page *page = virt_to_page(po->pg_vec[i]);
-		int pg_num;
-
-		for (pg_num = 0; pg_num < po->pg_vec_pages; pg_num++, page++) {
-			err = vm_insert_page(vma, start, page);
-			if (unlikely(err))
-				goto out;
-			start += PAGE_SIZE;
+
+	for(rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
+		if (rb->pg_vec == NULL)
+			continue;
+
+		for (i = 0; i < rb->pg_vec_len; i++) {
+			struct page *page = virt_to_page(rb->pg_vec[i]);
+			int pg_num;
+
+			for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++, page++) {
+				err = vm_insert_page(vma, start, page);
+				if (unlikely(err))
+					goto out;
+				start += PAGE_SIZE;
+			}
 		}
 	}
+
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;


Thanks,
Johann
-- 
Johann Baudy
johaahn@gmail.com


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-11 19:29                                 ` Evgeniy Polyakov
@ 2008-11-12 13:43                                   ` Johann Baudy
  2008-11-12 13:58                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Johann Baudy @ 2008-11-12 13:43 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Lovich, Vitali, netdev@vger.kernel.org

Hi Evgeniy,

> There are several possibilities:
> * update destructor with your own data and put old one behind the data
>        in skb
> * update destructor, but put your own data into some storage which can
>        be accessed from your callback, storage entry may contain
>        whatever data you want and of course old destructor
> * put your data into socket pointer (there still may be issues with
>        destructor)
> * if you fully own the skb, reuse whatever members there you like
>        (except those which may be used by lower layer)
> * find why do you need to have callback at destruction time and
>        eliminate this problem.
>        Btw, netchannels do this way, since they are very similar to
>        packet socket, but compared to your approach they have a copy.
>
Thanks a lot for this list :)
So for example, is using of skb->mark to store a buffer index a good solution ?
Or do you prefer another solution?

> At the first place I do not understand, why do you want to change the
> skb, since you can mmap whatever area you need and provide those pages
> into skb, which at free time will drop the reference counter. Skb lives
> in socket queue, so until it is empty you can sleep not allowing mapped
> buffer to shrink.
We need to get the tpacket_hdr associated to data in skb in order to
inform user that buffer is now available (This is done through status
field, which is not sent over the network). What do you mean with
"provide those pages", through frags? If yes, is it ok when frags are
linearized?

Best regards,
Johann


-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 13:43                                   ` Johann Baudy
@ 2008-11-12 13:58                                     ` Evgeniy Polyakov
  2008-11-12 17:07                                       ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-12 13:58 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Lovich, Vitali, netdev@vger.kernel.org

Hi.

On Wed, Nov 12, 2008 at 02:43:07PM +0100, Johann Baudy (johaahn@gmail.com) wrote:
> Thanks a lot for this list :)
> So for example, is using of skb->mark to store a buffer index a good solution ?
> Or do you prefer another solution?

Qdisk uses various classification models, and for example dsmark uses
skb->mark in particular. Most of others use priority field. tc_verd is
used to store classification verdict.

> > At the first place I do not understand, why do you want to change the
> > skb, since you can mmap whatever area you need and provide those pages
> > into skb, which at free time will drop the reference counter. Skb lives
> > in socket queue, so until it is empty you can sleep not allowing mapped
> > buffer to shrink.
> We need to get the tpacket_hdr associated to data in skb in order to
> inform user that buffer is now available (This is done through status
> field, which is not sent over the network). What do you mean with
> "provide those pages", through frags? If yes, is it ok when frags are
> linearized?

Hmm, for example splice() is supposed to do the same what you have, but
because of udp_send*() limitations it can't. Splice does not care if
data was overwritten during transfer, i.e. when ->sendpage() returns all
code assumes that data has been successfully transmitted, so userspace
can overwrite mapped area or file content, and this new data will
actually be transmitted. I believe you can use the same model and do not
care about userspace notifications after execution returned to the
userspace from the 'sending call'.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 13:58                                     ` Evgeniy Polyakov
@ 2008-11-12 17:07                                       ` Lovich, Vitali
  2008-11-12 17:41                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 17:07 UTC (permalink / raw)
  To: Evgeniy Polyakov, Johann Baudy; +Cc: netdev@vger.kernel.org

Hi,

> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-12-08 5:58 AM
> To: Johann Baudy
> Cc: Lovich, Vitali; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi.
> Hmm, for example splice() is supposed to do the same what you have, but
> because of udp_send*() limitations it can't. Splice does not care if
> data was overwritten during transfer, i.e. when ->sendpage() returns
> all
> code assumes that data has been successfully transmitted, so userspace
> can overwrite mapped area or file content, and this new data will
> actually be transmitted. I believe you can use the same model and do
> not
> care about userspace notifications after execution returned to the
> userspace from the 'sending call'.

I think you misunderstand how the API works.  Yes the user could call send from the thread he's filling in data from.  But the idea is that he would ideally do this from another thread so that it's possible to control latency between packet sends.  Additionally, even without trying to control latency, there would still have to be complicated logic in userspace to determine when enough packets have been placed in the buffer to overcome the cost of a system call.

This is why we need to know which frame in the ring buffer the skb is associated with.

Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 12:09                                   ` Johann Baudy
@ 2008-11-12 17:12                                     ` Lovich, Vitali
  0 siblings, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 17:12 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Evgeniy Polyakov, netdev@vger.kernel.org

Hi Johann,

> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-12-08 4:10 AM
> To: Lovich, Vitali
> Cc: Evgeniy Polyakov; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> 
> I still believe that we should stop on non-recoverable errors. User
> must be informed that the packet has not been sent.
> If you skip the frame and set status to TP_STATUS_KERNEL. Frame will
> be lost and user will think that malformed frame has been transmitted.
Hence the reason I suggested we keep track of the statistics.  If you think the user will actually care which frame it was that was rejected, then sure set it to TP_STATUS_LOSING.  However, you should still continue with all remaining packets - returning from send won't matter since you'll have filled in the frame.
 
> I'm using TP_STATUS_LOSING in case of non-recoverable error. We can
> maybe introduce a socket option to chose the behavior for
> non-recoverable error.
There's no point in adding socket options to determine the communication protocol with userspace - it's completely redundant.  If someone needs a custom way, they can write their own module or patch this one.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 17:07                                       ` Lovich, Vitali
@ 2008-11-12 17:41                                         ` Evgeniy Polyakov
  2008-11-12 17:59                                           ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-12 17:41 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

Hi Vitali.

On Wed, Nov 12, 2008 at 09:07:43AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> I think you misunderstand how the API works.  Yes the user could call send from the thread he's filling in data from.  But the idea is that he would ideally do this from another thread so that it's possible to control latency between packet sends.  Additionally, even without trying to control latency, there would still have to be complicated logic in userspace to determine when enough packets have been placed in the buffer to overcome the cost of a system call.
> 
> This is why we need to know which frame in the ring buffer the skb is associated with.

What's the problem of invoking send machinery from different thread?
You can wait in appropriate syscall until multiple packets have been
sent and then return, or update some shared flag in the mapped area
which says that something is being sent. When dev_queue_xmit() for
selected set of packets completes, you can update that variable, and
based on its value userspace can overwrite the area used by already
sent packets.

I see the only reason to have a notification about skb completion is an
absolute need to send sync data, i.e. packet data can not be overwritten
until packet reaches the media. But getting that existing Linux
splice/sendfile and any other ->sendpage() users are racy in this regard
for centuries, this does not look as a strong demand for me.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 17:41                                         ` Evgeniy Polyakov
@ 2008-11-12 17:59                                           ` Lovich, Vitali
  2008-11-12 18:11                                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 17:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-12-08 9:41 AM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Vitali.
> 
> On Wed, Nov 12, 2008 at 09:07:43AM -0800, Lovich, Vitali
> (vlovich@qualcomm.com) wrote:
> > I think you misunderstand how the API works.  Yes the user could call
> send from the thread he's filling in data from.  But the idea is that
> he would ideally do this from another thread so that it's possible to
> control latency between packet sends.  Additionally, even without
> trying to control latency, there would still have to be complicated
> logic in userspace to determine when enough packets have been placed in
> the buffer to overcome the cost of a system call.
> >
> > This is why we need to know which frame in the ring buffer the skb is
> associated with.
> 
> What's the problem of invoking send machinery from different thread?
> You can wait in appropriate syscall until multiple packets have been
> sent and then return, or update some shared flag in the mapped area
> which says that something is being sent. When dev_queue_xmit() for
> selected set of packets completes, you can update that variable, and
> based on its value userspace can overwrite the area used by already
> sent packets.
> 
> I see the only reason to have a notification about skb completion is an
> absolute need to send sync data, i.e. packet data can not be
> overwritten
> until packet reaches the media. But getting that existing Linux
> splice/sendfile and any other ->sendpage() users are racy in this
> regard
> for centuries, this does not look as a strong demand for me.
> 
They aren't racy - not in the sense that your suggestion would make it.  With current splice & sendfile, when it returns from the syscall, the user knows that it has been transmitted and thus can continue using the file descriptors & memory (in the case of vmsplice).  However, with your suggestion, the user could actually never know when it's safe to write into memory (even in a single-threaded situation).  Thus it is racy in a single-threaded situation (even on a UP it's potentially racy) which is a pretty amazing feat.  Consider:

User fills up the 100 frames available in the ring buffer.  Calls send.  Send, after calling dev_queue_xmit on each frame marks it as free for use.  Send returns.  User continues filling up buffer.

In this situation, the user now has no clue whether or not the frames were actually sent out.  This is going to cause huge problems because the user won't understand why their data is being sent out corrupted (with also no way to prevent it).  I think this is becomes more and more of a problem if the card can't keep up with the packet rate, and thus it'll still be processing its tx queue as the user-space will be overwriting the ring buffer.

This state-machine approach is the exact parallel to how the RX code works (on receive it tries to place the skb data in a free frame and relies on the user to free that frame).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 17:59                                           ` Lovich, Vitali
@ 2008-11-12 18:11                                             ` Evgeniy Polyakov
  2008-11-12 19:05                                               ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-12 18:11 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

On Wed, Nov 12, 2008 at 09:59:20AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> They aren't racy - not in the sense that your suggestion would make it.  With current splice & sendfile, when it returns from the syscall, the user knows that it has been transmitted and thus can continue using the file descriptors & memory (in the case of vmsplice).  However, with your suggestion, the user could actually never know when it's safe to write into memory (even in a single-threaded situation).  Thus it is racy in a single-threaded situation (even on a UP it's potentially racy) which is a pretty amazing feat.  Consider:

Problem is, that when sendfile() returns nothing has been transferred.
And in some NICs it will take a while to actually make a transfer.
And what you wrote below is exactly the same problem as exists with
splice/sendfile in particular and ->sendpage() in general.

Please also update your mailer to wrap strings into 80-or-so lines, it
is hard to answer into the middle of the paragraph.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 18:11                                             ` Evgeniy Polyakov
@ 2008-11-12 19:05                                               ` Lovich, Vitali
  2008-11-12 19:14                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 19:05 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-12-08 10:12 AM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> On Wed, Nov 12, 2008 at 09:59:20AM -0800, Lovich, Vitali
> (vlovich@qualcomm.com) wrote:
> > They aren't racy - not in the sense that your suggestion would make
> > it.  With current splice & sendfile, when it returns from the syscall,
> > the user knows that it has been transmitted and thus can continue using
> > the file descriptors & memory (in the case of vmsplice).  However, with
> > your suggestion, the user could actually never know when it's safe to
> > write into memory (even in a single-threaded situation).  Thus it is
> > racy in a single-threaded situation (even on a UP it's potentially
> > racy) which is a pretty amazing feat.  Consider:
> 
> Problem is, that when sendfile() returns nothing has been transferred.
> And in some NICs it will take a while to actually make a transfer.
> And what you wrote below is exactly the same problem as exists with
> splice/sendfile in particular and ->sendpage() in general.
I still don't see it.  This can only be a problem for vmsplice, since I believe
sendpage & splice copy the data from the source pipe if necessary.
vmsplice solves this through the SPLICE_F_GIFT flag (if not specified, 
I'm assuming it copies the data into a temporary buffer).  So I don't 
believe that these are actually racy functions if used properly.

However, your suggestion makes non-racy usage of the tx ring impossible
unless you know ahead of time how many frames you will need (in which case, resetting
the status flag is pointless).  But for proper ring buffer behaviour, it needs to
clear the flag in the skb destructor, once we know the data will no longer be used by
the driver.

> Please also update your mailer to wrap strings into 80-or-so lines, it
> is hard to answer into the middle of the paragraph.
Sorry - I hate using Outlook because it doesn't seem to honour my settings.
I'll split up the lines manually instead of trusting Outlook.

Vitali


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 19:05                                               ` Lovich, Vitali
@ 2008-11-12 19:14                                                 ` Evgeniy Polyakov
  2008-11-12 21:23                                                   ` Lovich, Vitali
  0 siblings, 1 reply; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-12 19:14 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

On Wed, Nov 12, 2008 at 11:05:03AM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> I still don't see it.  This can only be a problem for vmsplice, since I believe
> sendpage & splice copy the data from the source pipe if necessary.
> vmsplice solves this through the SPLICE_F_GIFT flag (if not specified, 
> I'm assuming it copies the data into a temporary buffer).  So I don't 
> believe that these are actually racy functions if used properly.

Sendpage only copies data if underlying device does not support
scatter-gather and hardware checksum capabilities. Effectively what's
being done is a page (no matter if it is anonymous mapping or VFS page
cache) reference counter increase and skb submit, which in the best case
results in dev_queue_xmit() just like in your approach. Then syscall
returns and userspace will never ever know that page was transmitted.
It actually can be dropped just there without even seeing the wire if
hardware decided that, that is why hardware checksumming is needed:
hardware will calculate appropriate checksums over the data which is in
given pages at real send time and not when userspace called sendpage().

> However, your suggestion makes non-racy usage of the tx ring impossible
> unless you know ahead of time how many frames you will need (in which case, resetting
> the status flag is pointless).  But for proper ring buffer behaviour, it needs to
> clear the flag in the skb destructor, once we know the data will no longer be used by
> the driver.

Here is the main point: why do you ever care about data that was or was
not transmitted and want to update something at destruction time and not
when dev_qeueue_xmit() returns. As pointed above, destruction time does
not guarantee that skb was sent as long as return from dev_qeueue_xmit().

So you can update whatever flags you have to after return of the
dev_qeueue_xmit() and will get the same behaviour as sendfile:
immediate write into the same memory area results in sending new content
(on some NICs).

> > Please also update your mailer to wrap strings into 80-or-so lines, it
> > is hard to answer into the middle of the paragraph.
> Sorry - I hate using Outlook because it doesn't seem to honour my settings.
> I'll split up the lines manually instead of trusting Outlook.

Non-trivial solution for long mails :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 19:14                                                 ` Evgeniy Polyakov
@ 2008-11-12 21:23                                                   ` Lovich, Vitali
  2008-11-12 21:46                                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 21:23 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-12-08 11:14 AM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Here is the main point: why do you ever care about data that was or was
> not transmitted and want to update something at destruction time and
> not
> when dev_qeueue_xmit() returns. As pointed above, destruction time does
> not guarantee that skb was sent as long as return from
> dev_qeueue_xmit().
I don't care whether or not the data was sent - I care whether or not the driver
might still use the data in the frame the skb is referring to.  In the destructor, clearly the
driver can't since it gave up its reference.  After dev_queue_xmit, we don't know because
the driver (or the skb queue layer) may have decided to delay packet transmission.

Potentially the user might even have written half the payload of a packet when the device decides to
send out the skb for that frame and thus send out half the payload from one 
packet and half the payload from another.

> 
> So you can update whatever flags you have to after return of the
> dev_qeueue_xmit() and will get the same behaviour as sendfile:
> immediate write into the same memory area results in sending new
> content
> (on some NICs).
But using your approach, how can a user ever know whether or not he actually sent
a packet?

Am I missing something fundamental in my understanding?  I don't see any way, outside
of using the skb destructor, to notify the user when he can safely write to a frame
without interfering with any pending skbs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 21:23                                                   ` Lovich, Vitali
@ 2008-11-12 21:46                                                     ` Evgeniy Polyakov
  2008-11-12 22:33                                                       ` Lovich, Vitali
  2008-11-18 18:49                                                       ` Johann Baudy
  0 siblings, 2 replies; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-12 21:46 UTC (permalink / raw)
  To: Lovich, Vitali; +Cc: Johann Baudy, netdev@vger.kernel.org

On Wed, Nov 12, 2008 at 01:23:33PM -0800, Lovich, Vitali (vlovich@qualcomm.com) wrote:
> I don't care whether or not the data was sent - I care whether or not the driver
> might still use the data in the frame the skb is referring to.  In the destructor, clearly the
> driver can't since it gave up its reference.  After dev_queue_xmit, we don't know because
> the driver (or the skb queue layer) may have decided to delay packet transmission.
> 
> Potentially the user might even have written half the payload of a packet when the device decides to
> send out the skb for that frame and thus send out half the payload from one 
> packet and half the payload from another.

And what's the point in waiting for data to be unused?
You want to implement a system, which will behave more consistent than
existing zero-copy approach, but yet not 100% correctly...

> > So you can update whatever flags you have to after return of the
> > dev_qeueue_xmit() and will get the same behaviour as sendfile:
> > immediate write into the same memory area results in sending new
> > content
> > (on some NICs).
> But using your approach, how can a user ever know whether or not he actually sent
> a packet?

There is no way to know that. At all. skb can be dropped by zillions of
reasons and after it was submitted to the qdisk layer, there is no way
to know how its life will continue. Well, in some cases it is possible
to know (when qdisk just frees skb), but it is far from 100% of the
cases.

> Am I missing something fundamental in my understanding?  I don't see any way, outside
> of using the skb destructor, to notify the user when he can safely write to a frame
> without interfering with any pending skbs.

Having a callback at destruction time does mean that noone uses skb, but
are you sure this is needed? With existing zero-copy (splice/sendfile)
this is not true, but you want to extend this approach...

If you _do_ want to make it that way, you can remove destructor at all
and implement own packet-socket-only allocation policy and thus have own
private destructor without extending skb.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 21:46                                                     ` Evgeniy Polyakov
@ 2008-11-12 22:33                                                       ` Lovich, Vitali
  2008-11-18 18:49                                                       ` Johann Baudy
  1 sibling, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-12 22:33 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Johann Baudy, netdev@vger.kernel.org



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:zbr@ioremap.net]
> Sent: November-12-08 1:46 PM
> To: Lovich, Vitali
> Cc: Johann Baudy; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> On Wed, Nov 12, 2008 at 01:23:33PM -0800, Lovich, Vitali
> (vlovich@qualcomm.com) wrote:
> > I don't care whether or not the data was sent - I care whether or not
> the driver
> > might still use the data in the frame the skb is referring to.  In
> the destructor, clearly the
> > driver can't since it gave up its reference.  After dev_queue_xmit,
> we don't know because
> > the driver (or the skb queue layer) may have decided to delay packet
> transmission.
> >
> > Potentially the user might even have written half the payload of a
> packet when the device decides to
> > send out the skb for that frame and thus send out half the payload
> from one
> > packet and half the payload from another.
> 
> And what's the point in waiting for data to be unused?
So that the application in user space can actually send uncorrupted packets.
Think for instance if I have a 1GB pcap dump and I want to replay it - without
Waiting for data to be unused, I could potentially send out corrupt packets or
skip some.
> You want to implement a system, which will behave more consistent than
> existing zero-copy approach, but yet not 100% correctly...
I'm trying to implement a system which is zero-copy with traditional socket send()
semantics (or at least as close as is possible).  I don't see why it wouldn't work
100% correctly.

> > > So you can update whatever flags you have to after return of the
> > > dev_qeueue_xmit() and will get the same behaviour as sendfile:
> > > immediate write into the same memory area results in sending new
> > > content
> > > (on some NICs).
> > But using your approach, how can a user ever know whether or not he
> actually sent
> > a packet?
> 
> There is no way to know that. At all. skb can be dropped by zillions of
> reasons and after it was submitted to the qdisk layer, there is no way
> to know how its life will continue. Well, in some cases it is possible
> to know (when qdisk just frees skb), but it is far from 100% of the
> cases.
Right - so I don't really care if the skb gets silently dropped by lower
layers.  My only concern is that there is some kind of protection that the
user can use to ensure that he doesn't overwrite data that is in the middle of
being transmitted.  I think that's the communication disconnect we're having.
You think I'm concerned about dropped packets - but that's just a side issue.
I recognize that some lower layer may abandon the skb for whatever reason.  The
main issue is data integrity - the pages in the scatter/gather list aren't touched
in user space until the device lets go of the skb.  With your approach, a race is actually
extremely likely (guaranteed if we roll in the ring buffer at all).  Consider:

Thread A
Calls send in loop to flush buffer

Thread B
Fills buffer
As soon as next frame is free (using the status flag), fill it in & update status flag.

Clearly, Thread B would actually see the status flag
cleared before the device actually tried to send the skb (unless the skb managed to
get sent out before dev_queue_xmit returned).

> 
> > Am I missing something fundamental in my understanding?  I don't see
> any way, outside
> > of using the skb destructor, to notify the user when he can safely
> write to a frame
> > without interfering with any pending skbs.
> 
> Having a callback at destruction time does mean that noone uses skb,
> but
> are you sure this is needed? With existing zero-copy (splice/sendfile)
> this is not true, but you want to extend this approach...
> 
> If you _do_ want to make it that way, you can remove destructor at all
> and implement own packet-socket-only allocation policy and thus have
> own
> private destructor without extending skb.
Can you please elaborate on this further.  What do you mean by custom allocation policy
and private destructor?  Isn't that exactly what we're doing now?  We're just trying to
figure out in our destructor which frame the skb was built from.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-12 21:46                                                     ` Evgeniy Polyakov
  2008-11-12 22:33                                                       ` Lovich, Vitali
@ 2008-11-18 18:49                                                       ` Johann Baudy
  2008-11-18 19:10                                                         ` Evgeniy Polyakov
  2008-11-18 19:46                                                         ` Lovich, Vitali
  1 sibling, 2 replies; 59+ messages in thread
From: Johann Baudy @ 2008-11-18 18:49 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Lovich, Vitali, netdev@vger.kernel.org

Hi Evgeniy,

> If you _do_ want to make it that way, you can remove destructor at all
> and implement own packet-socket-only allocation policy and thus have own
> private destructor without extending skb.

Currently, we are executing sock_alloc_send_skb() to allocate a new
skb from socket.
Then, we replace destructor sock_wfree() with our destructor
packet_skb_destruct() which executes sock_wfree() once status of
packet frame (associated to skb data) has been given back to user
(status changed).

Is this way ok ?
Or shall we implement our own sock_alloc_send_skb()?

Thanks in advance,
Johann


-- 
Johann Baudy
johaahn@gmail.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-18 18:49                                                       ` Johann Baudy
@ 2008-11-18 19:10                                                         ` Evgeniy Polyakov
  2008-11-18 19:46                                                         ` Lovich, Vitali
  1 sibling, 0 replies; 59+ messages in thread
From: Evgeniy Polyakov @ 2008-11-18 19:10 UTC (permalink / raw)
  To: Johann Baudy; +Cc: Lovich, Vitali, netdev@vger.kernel.org

On Tue, Nov 18, 2008 at 07:49:00PM +0100, Johann Baudy (johaahn@gmail.com) wrote:
> Currently, we are executing sock_alloc_send_skb() to allocate a new
> skb from socket.
> Then, we replace destructor sock_wfree() with our destructor
> packet_skb_destruct() which executes sock_wfree() once status of
> packet frame (associated to skb data) has been given back to user
> (status changed).
> 
> Is this way ok ?
> Or shall we implement our own sock_alloc_send_skb()?

If it meets your needs it is of course ok, but it has additional memory
management checks and other socket management bits. If it does not
matter, than everything is ok.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
  2008-11-18 18:49                                                       ` Johann Baudy
  2008-11-18 19:10                                                         ` Evgeniy Polyakov
@ 2008-11-18 19:46                                                         ` Lovich, Vitali
  1 sibling, 0 replies; 59+ messages in thread
From: Lovich, Vitali @ 2008-11-18 19:46 UTC (permalink / raw)
  To: Johann Baudy, Evgeniy Polyakov; +Cc: netdev@vger.kernel.org



> -----Original Message-----
> From: Johann Baudy [mailto:johaahn@gmail.com]
> Sent: November-18-08 10:49 AM
> To: Evgeniy Polyakov
> Cc: Lovich, Vitali; netdev@vger.kernel.org
> Subject: Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING
> 
> Hi Evgeniy,
> 
> > If you _do_ want to make it that way, you can remove destructor at
> all
> > and implement own packet-socket-only allocation policy and thus have
> own
> > private destructor without extending skb.
> 
> Currently, we are executing sock_alloc_send_skb() to allocate a new
> skb from socket.
> Then, we replace destructor sock_wfree() with our destructor
> packet_skb_destruct() which executes sock_wfree() once status of
> packet frame (associated to skb data) has been given back to user
> (status changed).
Just a clarification - sock_wfree is called !before! we set the status back.  Not extremely important, but it's cleaner that way & the constant sock_wfree/sock_alloc_send_skb is more likely to keep reallocating the same memory.

Oh, and Johann, here are the perf #s we were talking about off the list.

Frame size/Packets per second/Line-rate (GBit/s)
8942/144270/9.61
1500/387544/4.32
 500/591648/2.19
  43/925328/0.30  (43 bytes is about 1-byte of payload for a UDP packet)

I should have the latency #'s numbers in a little while.

Vitali

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2008-11-18 19:46 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-31 10:58 [PATCH] Packet socket: mmapped IO: PACKET_TX_RING Johann Baudy
2008-10-31 17:07 ` Lovich, Vitali
2008-10-31 18:24   ` Lovich, Vitali
2008-11-04 22:45     ` Johann Baudy
2008-11-06  0:47       ` Lovich, Vitali
2008-11-06  8:03         ` Evgeniy Polyakov
2008-11-06 18:49           ` Lovich, Vitali
2008-11-06 19:40             ` Evgeniy Polyakov
2008-11-06 19:53               ` Lovich, Vitali
2008-11-07 16:36                 ` Johann Baudy
2008-11-07 17:19                   ` Lovich, Vitali
2008-11-10 20:29                     ` Lovich, Vitali
2008-11-11  0:29                       ` Lovich, Vitali
     [not found]                         ` <7e0dd21a0811110656yff651afp8ff0f9928b79f545@mail.gmail.com>
2008-11-11 14:59                           ` Johann Baudy
2008-11-11 19:05                             ` Lovich, Vitali
2008-11-11 12:10                       ` Johann Baudy
2008-11-11 17:44                         ` Lovich, Vitali
2008-11-11 18:08                           ` Johann Baudy
2008-11-11 18:19                             ` Lovich, Vitali
2008-11-11 18:59                               ` Johann Baudy
2008-11-11 19:10                                 ` Lovich, Vitali
2008-11-12 12:09                                   ` Johann Baudy
2008-11-12 17:12                                     ` Lovich, Vitali
2008-11-11 11:43                     ` Johann Baudy
2008-11-11 17:38                       ` Lovich, Vitali
2008-11-11 17:50                         ` Johann Baudy
2008-11-11 18:14                           ` Lovich, Vitali
2008-11-11 18:50                             ` Evgeniy Polyakov
2008-11-11 19:19                               ` Johann Baudy
2008-11-11 19:29                                 ` Evgeniy Polyakov
2008-11-12 13:43                                   ` Johann Baudy
2008-11-12 13:58                                     ` Evgeniy Polyakov
2008-11-12 17:07                                       ` Lovich, Vitali
2008-11-12 17:41                                         ` Evgeniy Polyakov
2008-11-12 17:59                                           ` Lovich, Vitali
2008-11-12 18:11                                             ` Evgeniy Polyakov
2008-11-12 19:05                                               ` Lovich, Vitali
2008-11-12 19:14                                                 ` Evgeniy Polyakov
2008-11-12 21:23                                                   ` Lovich, Vitali
2008-11-12 21:46                                                     ` Evgeniy Polyakov
2008-11-12 22:33                                                       ` Lovich, Vitali
2008-11-18 18:49                                                       ` Johann Baudy
2008-11-18 19:10                                                         ` Evgeniy Polyakov
2008-11-18 19:46                                                         ` Lovich, Vitali
2008-11-07 17:28                 ` Evgeniy Polyakov
2008-11-07 20:22                   ` David Miller
2008-10-31 20:28   ` Evgeniy Polyakov
2008-11-04 22:33   ` Johann Baudy
2008-11-05  1:50     ` Lovich, Vitali
  -- strict thread matches above, loose matches on Subject: below --
2008-11-12 13:19 Johann Baudy
2008-11-05 15:16 Johann Baudy
2008-11-05 17:49 ` Lovich, Vitali
2008-11-05 10:55 Johann Baudy
2008-11-05 11:02 ` Patrick McHardy
2008-11-05 17:32 ` Lovich, Vitali
2008-10-30 13:00 Johann Baudy
2008-10-30 18:21 ` Lovich, Vitali
2008-10-27  9:33 Johann Baudy
2008-10-28 22:44 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).