* [PATCH] packet: support for TX time stamps on RAW sockets
From: Richard Cochran @ 2010-04-06 14:30 UTC (permalink / raw)
To: netdev; +Cc: patrick.ohly
Enable the SO_TIMESTAMPING socket infrastructure for raw packet sockets.
For lack of a better idea, we have elected to use PACKET_RECV_OUTPUT for
the control message cmsg_type. This macro currently is not used anywhere
within the kernel.
Similar support for UDP and CAN sockets was added in commit
51f31cabe3ce5345b51e4a4f82138b38c4d5dc91
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
net/packet/af_packet.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 59 insertions(+), 1 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b0f037c..4513222 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -81,6 +81,7 @@
#include <linux/mutex.h>
#include <linux/if_vlan.h>
#include <linux/virtio_net.h>
+#include <linux/errqueue.h>
#ifdef CONFIG_INET
#include <net/inet_common.h>
@@ -314,6 +315,8 @@ static inline struct packet_sock *pkt_sk(struct sock *sk)
static void packet_sock_destruct(struct sock *sk)
{
+ skb_queue_purge(&sk->sk_error_queue);
+
WARN_ON(atomic_read(&sk->sk_rmem_alloc));
WARN_ON(atomic_read(&sk->sk_wmem_alloc));
@@ -482,6 +485,9 @@ retry:
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+ err = sock_tx_timestamp(msg, sk, skb_tx(skb));
+ if (err < 0)
+ goto out_unlock;
dev_queue_xmit(skb);
rcu_read_unlock();
@@ -1187,6 +1193,9 @@ static int packet_snd(struct socket *sock,
err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
if (err)
goto out_free;
+ err = sock_tx_timestamp(msg, sk, skb_tx(skb));
+ if (err < 0)
+ goto out_free;
skb->protocol = proto;
skb->dev = dev;
@@ -1486,6 +1495,50 @@ out:
return err;
}
+static int packet_recv_error(struct sock *sk, struct msghdr *msg, int len)
+{
+ struct sock_exterr_skb *serr;
+ struct sk_buff *skb, *skb2;
+ int copied, err;
+
+ err = -EAGAIN;
+ skb = skb_dequeue(&sk->sk_error_queue);
+ if (skb == NULL)
+ goto out;
+
+ copied = skb->len;
+ if (copied > len) {
+ msg->msg_flags |= MSG_TRUNC;
+ copied = len;
+ }
+ err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ if (err)
+ goto out_free_skb;
+
+ sock_recv_timestamp(msg, sk, skb);
+
+ serr = SKB_EXT_ERR(skb);
+ put_cmsg(msg,SOL_PACKET,PACKET_RECV_OUTPUT,sizeof(serr->ee),&serr->ee);
+
+ msg->msg_flags |= MSG_ERRQUEUE;
+ err = copied;
+
+ /* Reset and regenerate socket error */
+ spin_lock_bh(&sk->sk_error_queue.lock);
+ sk->sk_err = 0;
+ if ((skb2 = skb_peek(&sk->sk_error_queue)) != NULL) {
+ sk->sk_err = SKB_EXT_ERR(skb2)->ee.ee_errno;
+ spin_unlock_bh(&sk->sk_error_queue.lock);
+ sk->sk_error_report(sk);
+ } else
+ spin_unlock_bh(&sk->sk_error_queue.lock);
+
+out_free_skb:
+ kfree_skb(skb);
+out:
+ return err;
+}
+
/*
* Pull a packet from our receive queue and hand it to the user.
* If necessary we block.
@@ -1501,7 +1554,7 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
int vnet_hdr_len = 0;
err = -EINVAL;
- if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT))
+ if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE))
goto out;
#if 0
@@ -1510,6 +1563,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
return -ENODEV;
#endif
+ if (flags & MSG_ERRQUEUE) {
+ err = packet_recv_error(sk, msg, len);
+ goto out;
+ }
+
/*
* Call the generic datagram receiver. This handles all sorts
* of horrible races and re-entrancy so we can forget about it
--
1.6.0.4
^ permalink raw reply related
* Re: [PATCH v2] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-06 14:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1270559096.2081.35.camel@edumazet-laptop>
>
> Running on a preprod machine here, seems fine.
>
> Some questions :
>
> 1) The need to add "rps_flow_entries=xxx" at boot time is problematic.
> Maybe we can allow it being dynamic (and use vmalloc() instead of
> alloc_large_system_hash())
>
Okay, could be a sysctl with vmalloc.
> 2) inet_rps_save_rxhash(sk, skb->rxhash);
>
> It should have a check to make sure some part of the stack doesnt feed
> many different rxhash for a given socket (Make sure we dont pollute flow
> table with pseudo random values)
>
If packets for a connection are always received on the same device, is
it reasonable to assume the rxhash is constant for that connection?
I suppose it's possible that packets for a same sockets are being
constantly received on two different devices that are giving different
rxhashes. This would already be bad in that OOO is probably happening
anyway. I don't know if thrashing the sock_flow_table is going to
aggravate this scenario much.
Are there any other degenerative cases you're worried about?
> 3) UDP connected sockets dont benefit of RFS currently
> (Not sure many apps use connected UDP sockets, I do have some of them
> in house)
>
Makes sense to support that.
> I am trying following code for IPV4 only :
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 7af756d..5c2d37a 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1216,6 +1216,7 @@ int udp_disconnect(struct sock *sk, int flags)
> sk->sk_state = TCP_CLOSE;
> inet->inet_daddr = 0;
> inet->inet_dport = 0;
> + inet_rps_save_rxhash(sk, 0);
> sk->sk_bound_dev_if = 0;
> if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> inet_reset_saddr(sk);
> @@ -1257,8 +1258,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
>
> static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> {
> - int rc = sock_queue_rcv_skb(sk, skb);
> + int rc;
> +
> + if (inet_sk(sk)->inet_daddr)
> + inet_rps_save_rxhash(sk, skb->rxhash);
>
> + rc = sock_queue_rcv_skb(sk, skb);
> if (rc < 0) {
> int is_udplite = IS_UDPLITE(sk);
>
>
>
>
^ permalink raw reply
* Re: Increased Latencies when upgrading kernel version
From: Xianghua Xiao @ 2010-04-06 14:10 UTC (permalink / raw)
To: Taylor Lewick; +Cc: Eric Dumazet, netdev, linux-kernel
In-Reply-To: <m2jd585dc4f1004051034ka36301d6xdce95defe6388836@mail.gmail.com>
On Mon, Apr 5, 2010 at 12:34 PM, Taylor Lewick <taylor.lewick@gmail.com> wrote:
> Okay, don't know what to officially file this under, as a regression
> with regards to performance or what, but here is the data. Again,
> I've noticed system and network latency appear to have worsened with
> later kernel versions.
>
> I was turned onto this problem via the following links:
> http://www.kernel.org/pub/linux/kernel/people/christoph/ols2009/ols-2009-paper.pdf
> and http://kerneltrap.org/mailarchive/linux-netdev/2009/4/16/5491284
>
> So I set up a test on two servers with Identical hardware, servers,
> nics, etc, and used hackbench, udpping, and an internally written app
> to compare latency.
>
> Here are just the hackbench results with just the averages across a 5
> runs for two different hackbench tests. The 2.6.16 and 2.6.27 kernels
> as set up were configured with voluntary preemption, and 250 HZ, so I
> just repeated that initially for 2.6.33.1 test. I also tested no
> preemption at same HZ setting of 250.
>
> I ran 2.6.16.60 on one server, and the other kernel versions on
> another server. These tests are repeatable across different servers,
> as in I verified I
> don't have a bad server.
>
> Kernel Version HB1 (25 process 300) HB2 (100 process 300)
> 2.6.16.60 .5402 1.8946
> 2.6.27.19 .619 2.6268
> 2.6.32.3-voluntary .5636 2.3484
> 2.6.33.1-voluntary .5404 2.2872
> 2.6.33.1-nopreempt .5606 2.3466
>
> So 2.6.16.60 is fast, 2.6.27.19 is slow, and 2.6.33.1 with voluntary
> preemption is the next best, but results didn't hold up well as
> Hackbench tests used larger numbers of groups., for example, 2.6.16.60
> and 2.6.33.1-voluntary were basically the same for HB1, but that
> didn't hold when hackebnch tests used more groups.
>
> At this point, I'm looking for ideas in kernel build to tweak, but I'm
> not a developer. So SLAB vs SLUB, sparse vs dense IRQ numbering, etc.
> Running a -rt kernel isn't an option at this time. I did test that as
> well, and latencies were quite a bit worse, but I wasn't adjusting
> code to take advantage of a real time OS.
>
> I can make some changes or repeat tests.
>
> Below is some hardware comparisons betweent the two machines.
> Differences I noticed was more interrupts and CPU flags on later
> kernel version.
>
> HostA 2.6.16.60
> cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> CPU6 CPU7
> 0: 108509762 0 0 0 0 0
> 0 0 IO-APIC-edge timer
> 8: 1 0 0 0 0 0
> 0 0 IO-APIC-edge rtc
> 9: 0 0 0 0 0 0
> 0 0 IO-APIC-level acpi
> 58: 305 0 5157735 220 2980100 5927
> 1187 0 IO-APIC-level libata
> 162: 0 0 0 0 0 0
> 0 0 IO-APIC-level uhci_hcd:usb1
> 170: 0 0 0 0 0 0
> 0 0 IO-APIC-level uhci_hcd:usb2
> 177: 6326 0 229018 0 283720 35597
> 367 0 IO-APIC-level megasas
> 178: 122 0 1784 1103 3531 20
> 1457 0 IO-APIC-level uhci_hcd:usb3, ehci_hcd:usb6
> 186: 0 0 0 0 0 0
> 0 0 IO-APIC-level uhci_hcd:usb4
> 194: 22 0 0 0 0 0
> 0 0 IO-APIC-level ehci_hcd:usb5
> 210: 1790109 577 0 0 0 0
> 0 0 PCI-MSI-X eth4-0
> 218: 233811 93 0 0 0 0
> 0 0 PCI-MSI-X eth4-1
> NMI: 0 0 0 0 0 0
> 0 0
> LOC: 108509683 108509662 108509637 108509614 108509588 108509566
> 108509541 108509516
> ERR: 7
> MIS: 0
>
> lspci
> 00:00.0 Host bridge: Intel Corporation QuickPath Architecture I/O Hub
> to ESI Port (rev 13)
> 00:01.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub
> PCI Express Root Port 1 (rev 13)
> 00:03.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub
> PCI Express Root Port 3 (rev 13)
> 00:07.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub
> PCI Express Root Port 7 (rev 13)
> 00:09.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub
> PCI Express Root Port 9 (rev 13)
> 00:14.0 PIC: Intel Corporation QuickPath Architecture I/O Hub System
> Management Registers (rev 13)
> 00:14.1 PIC: Intel Corporation QuickPath Architecture I/O Hub GPIO and
> Scratch Pad Registers (rev 13)
> 00:14.2 PIC: Intel Corporation QuickPath Architecture I/O Hub Control
> Status and RAS Registers (rev 13)
> 00:16.0 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.1 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.2 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.3 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.4 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.5 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.6 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:16.7 System peripheral: Intel Corporation DMA Engine (rev 13)
> 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #4 (rev 02)
> 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #5 (rev 02)
> 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2
> EHCI Controller #2 (rev 02)
> 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express
> Port 1 (rev 02)
> 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #1 (rev 02)
> 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #2 (rev 02)
> 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2
> EHCI Controller #1 (rev 02)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
> Controller (rev 02)
> 00:1f.2 IDE interface: Intel Corporation 82801IB (ICH9) 2 port SATA
> IDE Controller (rev 02)
> 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
> 1078 (rev 04)
> 04:00.0 PCI bridge: Integrated Device Technology, Inc. Unknown device
> 8018 (rev 0e)
> 05:02.0 PCI bridge: Integrated Device Technology, Inc. Unknown device
> 8018 (rev 0e)
> 05:04.0 PCI bridge: Integrated Device Technology, Inc. Unknown device
> 8018 (rev 0e)
> 06:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 06:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 07:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 07:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 08:00.0 Ethernet controller: Solarflare Communications Unknown device
> 0710 (rev 02)
> 09:03.0 VGA compatible controller: Matrox Graphics, Inc. Unknown
> device 0532 (rev 0a)
>
> cat /proc/cpuinfo (just showing first CPU for brevity)
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 26
> model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
> stepping : 5
> cpu MHz : 2926.090
> cache size : 8192 KB
> physical id : 1
> siblings : 4
> core id : 0
> cpu cores : 4
> fpu : yes
> fpu_exception : yes
> cpuid level : 11
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
> nx rdtscp lm constant_tsc pni monitor d
> s_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm
> bogomips : 5857.34
> clflush size : 64
> cache_alignment : 64
> address sizes : 40 bits physical, 48 bits virtual
> power management:
>
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: on TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
>
> rx-usecs: 0
> rx-frames: 0
> rx-usecs-irq: 60
> rx-frames-irq: 0
>
> tx-usecs: 0
> tx-frames: 0
> tx-usecs-irq: 0
> tx-frames-irq: 0
>
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
>
> rx-usecs-high: 0
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
>
>
> HostB 2.6.33.1
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> CPU6 CPU7
> 0: 8637 0 0 0 0
> 0 0 0 IO-APIC-edge timer
> 1: 2 0 0 0 0
> 0 0 0 IO-APIC-edge i8042
> 3: 2 0 0 0 0
> 0 0 0 IO-APIC-edge
> 4: 2 0 0 0 0
> 0 0 0 IO-APIC-edge
> 8: 1 0 0 0 0
> 0 0 0 IO-APIC-edge rtc0
> 9: 0 0 0 0 0
> 0 0 0 IO-APIC-fasteoi acpi
> 12: 4 0 0 0 0
> 0 0 0 IO-APIC-edge i8042
> 16: 7434 683 0 0 0
> 0 0 0 IO-APIC-fasteoi megasas
> 17: 0 0 0 0 0
> 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> 18: 0 0 0 0 0
> 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4
> 19: 23 0 0 0 0
> 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1
> 20: 0 0 0 0 0
> 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6
> 21: 129 0 15 0 0
> 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2,
> uhci_hcd:usb5
> 23: 369 0 0 0 0
> 0 0 0 IO-APIC-fasteoi ata_piix
> 67: 2346 731 0 0 0
> 0 0 0 PCI-MSI-edge eth4-0
> 68: 1809 404 0 0 0
> 0 0 0 PCI-MSI-edge eth4-1
> NMI: 0 0 0 0 0
> 0 0 0 Non-maskable interrupts
> LOC: 33071 38348 47397 23246 15715
> 11065 9004 10391 Local timer interrupts
> SPU: 0 0 0 0 0
> 0 0 0 Spurious interrupts
> PMI: 0 0 0 0 0
> 0 0 0 Performance monitoring interrupts
> PND: 0 0 0 0 0
> 0 0 0 Performance pending work
> RES: 2490 2124 4187 4974 1724
> 5548 1892 2871 Rescheduling interrupts
> CAL: 497 2166 141 115 133
> 144 140 144 Function call interrupts
> TLB: 243 244 928 945 289
> 187 134 93 TLB shootdowns
> TRM: 0 0 0 0 0
> 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 0
> 0 0 0 Threshold APIC interrupts
> MCE: 0 0 0 0 0
> 0 0 0 Machine check exceptions
> MCP: 2 2 2 2 2
> 2 2 2 Machine check polls
> ERR: 7
> MIS: 0
>
> lspci
> 00:00.0 Host bridge: Intel Corporation X58 I/O Hub to ESI Port (rev 13)
> 00:01.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root
> Port 1 (rev 13)
> 00:03.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root
> Port 3 (rev 13)
> 00:07.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root
> Port 7 (rev 13)
> 00:09.0 PCI bridge: Intel Corporation X58 I/O Hub PCI Express Root
> Port 9 (rev 13)
> 00:14.0 PIC: Intel Corporation X58 I/O Hub System Management Registers (rev 13)
> 00:14.1 PIC: Intel Corporation X58 I/O Hub GPIO and Scratch Pad
> Registers (rev 13)
> 00:14.2 PIC: Intel Corporation X58 I/O Hub Control Status and RAS
> Registers (rev 13)
> 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #4 (rev 02)
> 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #5 (rev 02)
> 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2
> EHCI Controller #2 (rev 02)
> 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express
> Port 1 (rev 02)
> 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #1 (rev 02)
> 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB
> UHCI Controller #2 (rev 02)
> 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2
> EHCI Controller #1 (rev 02)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
> Controller (rev 02)
> 00:1f.2 IDE interface: Intel Corporation 82801IB (ICH9) 2 port SATA
> IDE Controller (rev 02)
> 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
> 1078 (rev 04)
> 04:00.0 PCI bridge: Integrated Device Technology, Inc. PES12N3A PCI
> Express Switch (rev 0e)
> 05:02.0 PCI bridge: Integrated Device Technology, Inc. PES12N3A PCI
> Express Switch (rev 0e)
> 05:04.0 PCI bridge: Integrated Device Technology, Inc. PES12N3A PCI
> Express Switch (rev 0e)
> 06:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 06:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 07:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 07:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network
> Connection (rev 02)
> 08:00.0 Ethernet controller: Solarflare Communications SFC4000 rev B
> [Solarstorm] (rev 02)
> 09:03.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW
> WPCM450 (rev 0a)
>
> cat /proc/cpuinfo (just showing first CPU for brevity)
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 26
> model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
> stepping : 5
> cpu MHz : 2925.888
> cache size : 8192 KB
> physical id : 1
> siblings : 4
> core id : 0
> cpu cores : 4
> apicid : 16
> initial apicid : 16
> fpu : yes
> fpu_exception : yes
> cpuid level : 11
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx rdtscp lm constant_tsc arch_perfmon pebs bt
> s rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl
> vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida
> tpr_shadow vnmi flexpriority ept vpid
> bogomips : 5851.77
> clflush size : 64
> cache_alignment : 64
> address sizes : 40 bits physical, 48 bits virtual
> power management:
>
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: on TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
>
> rx-usecs: 0
> rx-frames: 0
> rx-usecs-irq: 60
> rx-frames-irq: 0
>
> tx-usecs: 0
> tx-frames: 0
> tx-usecs-irq: 0
> tx-frames-irq: 0
>
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
>
> rx-usecs-high: 0
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
>
>
>
> On Thu, Apr 1, 2010 at 8:53 PM, Taylor Lewick <taylor.lewick@gmail.com> wrote:
>> Okay. I will get this info out to the list Monday. Briefly, I'm
>> using identical hardware (server), identical NICs, same drivers,
>> connected to same switch, and using udpping, hackbench, and an
>> internall written app to test latency. Without exception the
>> evolution has looked like the following.
>>
>> 2.6.16.60 latencies for system and network are fast. Meaning
>> hackbench and udpping win, and win by quite a bit.
>>
>> 2.6.27.19 was awful. 2.6.32.1 and 2.6.331. were better for networking
>> (with some tweaks, i.e. disable netfilter, etc), and I was able to get
>> networking latencies to within 1-3 microseconds of 2.6.16.60
>> latencies, but the hackbench results are still pretty bad.
>>
>> Again, I'll post numbers and more detailed hardware info on Monday
>> when I'm back at office...
>>
>> On Thu, Apr 1, 2010 at 4:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Le jeudi 01 avril 2010 à 14:12 -0500, Taylor Lewick a écrit :
>>>> For some time now we've been running an older kernel, 2.6.16.60. When
>>>> we tried to upgrade, first going to 2.6.27.19 and then to 2.6.32.1 and
>>>> 2.6.33.1 we noticed that latencies increased. At first we noticed it
>>>> by doing network tests via udpping, netperf, etc. We made some
>>>> tweaks, and were able to get network latency to within 1 to 2
>>>> microseconds of where we were previously on 2.6.16.60. Then we did
>>>> some more testing, and noticed that system latency also seems higher.
>>>>
>>>> We've done our tests on identical hardware servers, same NICs,
>>>> connected through same network gear. Basically, we've tried to keep
>>>> everything identical except the kernel versions, and we are unable to
>>>> achieve the same performance for system latency on the newer kernels,
>>>> despite adjusting various kernel settings and recompiling.
>>>>
>>>> The latency differences are about 15 microseconds per transaction.
>>>>
>>>> At this point, I don't know what else to try. I haven't played around
>>>> with the /proc/sys/kernel/sched_* paramaters under the newer kernels
>>>> yet. Have tried changing pre-emption modes with little effect, in
>>>> fact, voluntary preemption seems to be peforming the best for us.
>>>>
>>>> At this time the realtime patch isn't really an option for us to
>>>> consider, at least not yet.
>>>>
>>>> Any suggestions? Is this a known issue when upgrading to more recent
>>>> kernel versions?
>>>>
>>>
>>> Hi Taylor
>>>
>>> Well, this is bit difficult to generically answer to your generic
>>> question. 15 us more latency per transaction seems pretty bad.
>>>
>>> Some inputs would be nice, describing your workload and
>>> software/hardware architecture.
>>>
>>> lspci
>>> cat /proc/cpuinfo
>>> cat /proc/interrupts
>>> dmesg
>>> ethtool -S eth0
>>> ethtool -c eth0
>>>
>>>
>>>
>>>
>>
>
Just want to ack you here, I upgraded a 2.6.18 kernel to 2.6.33.1 on a
shipping product and the performance(hackbench, latency, cpu
usage,etc) is a lot worse on the same hardware platform. We tried
2.6.27 before and it's also bad. I'm tring various CONFIG options and
so far nothing really helped. I'm using the RT patch.
Xianghua
^ permalink raw reply
* Re: [PATCH v2] rfs: Receive Flow Steering
From: Changli Gao @ 2010-04-06 13:42 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Tom Herbert, davem, netdev
In-Reply-To: <1270559096.2081.35.camel@edumazet-laptop>
On Tue, Apr 6, 2010 at 9:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> 1) The need to add "rps_flow_entries=xxx" at boot time is problematic.
> Maybe we can allow it being dynamic (and use vmalloc() instead of
> alloc_large_system_hash())
Is flex_array better than vmalloc()?
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Eilon Greenstein @ 2010-04-06 13:41 UTC (permalink / raw)
To: FUJITA Tomonori, davem@davemloft.net
Cc: netdev@vger.kernel.org, Vladislav Zolotarov
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525AF62@SJEXCHCCR02.corp.ad.broadcom.com>
On Tue, 2010-04-06 at 00:39 -0700, Vladislav Zolotarov wrote:
> Thanks, Fujita.
>
> The patch looks fine. I'll run some regression tests on the patched driver to check that things still work and if it's ok we will ack it shortly.
>
> vlad
>
>
> > =
> > From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> > Subject: [PATCH] bnx2x: use the DMA API instead of the pci equivalents
> >
> > The DMA API is preferred.
> >
> > Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Vlad's testing with this patch were finished successfully.
Thanks Fujita!
Acked-by: Vladislav Zolotarov <vladz@broadcom.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>
^ permalink raw reply
* Re: [Bugme-new] [Bug 15682] New: XFRM is not updating RTAX_ADVMSS metric
From: jamal @ 2010-04-06 13:40 UTC (permalink / raw)
To: Andrew Morton, Herbert Xu
Cc: netdev, bugzilla-daemon, bugme-daemon, eduardo.panisset,
David S. Miller
In-Reply-To: <20100405125055.cdc1e279.akpm@linux-foundation.org>
Herbert would give better answers. I dont think what Eduardo is
doing is correct. You cant just start factoring in tcp headers
at the xfrm level - and besides, the mtu calculation
already takes care tunnel headers - so tcp should be able to
compute correct MSS.
cheers,
jamal
On Mon, 2010-04-05 at 12:50 -0700, Andrew Morton wrote:
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Fri, 2 Apr 2010 17:34:35 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=15682
> >
> > Summary: XFRM is not updating RTAX_ADVMSS metric
> > Product: Networking
> > Version: 2.5
> > Kernel Version: 2.6.28-2
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Other
> > AssignedTo: acme@ghostprotocols.net
> > ReportedBy: eduardo.panisset@gmail.com
> > Regression: No
> >
> >
> > I have been testing DSMIPv6 code which uses all kind of advanced
> > features of XFRM framework and I believe I have found a bug related to
> > update RTAX_ADVMSS route metric.
> > The XFRM code on net/xfrm/xfrm_policy.c by its functions
> > xfrm_init_pmtu and xfrm_bundle_ok updates RTAX_MTU route caching
> > metric however I believe it must update RTAX_ADVMSS as this later is
> > used by tcp connect function for adverting the MSS value on SYN
> > messages.
> >
> > As MSS is not being updated by XFRM the TCP SYN messages (e.g.
> > originated from a internet browser) is erroneously informing its MSS
> > (without taking into account the overhead added to IP packet size by
> > XFRM transformations). One result of that is the browser gets
> > "frozen" after starts a TCP connection because TCP messages sent by
> > TCP server will never get to it (TCP server is sending too large
> > segments to browser).
> >
> > Below I describe the changes I have done (on xfrm_init_pmtu and
> > xfrm_bundle_ok) and that seem to fix this problem:
> >
> > xfrm_init_pmtu:
> > .
> > .
> > .
> >
> > dst->metrics[RTAX_MTU-1] = pmtu; // original code, below my changes
> >
> > if (dst->xfrm->props.mode == XFRM_MODE_TUNNEL)
> > switch (dst->xfrm->props.family)
> > {
> > case AF_INET:
> > dst->metrics[RTAX_ADVMSS-1] = max_t(unsigned int,
> > pmtu - sizeof(struct iphdr) - sizeof(struct tcphdr), 256);
> > break;
> >
> > case AF_INET6:
> > dst->metrics[RTAX_ADVMSS-1] = max_t(unsigned int,
> > pmtu - sizeof(struct ipv6hdr) - sizeof(struct tcphdr),
> > dev_net(dst->dev)->ipv6.
> > sysctl.ip6_rt_min_advmss);
> > break;
> > }
> >
> > xfrm_bundle_ok:
> >
> > .
> > .
> > .
> >
> > dst->metrics[RTAX_MTU-1] = mtu; // original code, below my changes
> >
> > if (dst->xfrm->props.mode == XFRM_MODE_TUNNEL)
> > switch (dst->xfrm->props.family)
> > {
> > case AF_INET:
> > dst->metrics[RTAX_ADVMSS-1] = max_t(unsigned
> > int, mtu - sizeof(struct iphdr) - sizeof(struct tcphdr), 256);
> > break;
> >
> > case AF_INET6:
> > dst->metrics[RTAX_ADVMSS-1] = max_t(unsigned
> > int, mtu - sizeof(struct ipv6hdr) - sizeof(struct tcphdr),
> >
> > dev_net(dst->dev)->ipv6.sysctl.ip6_rt_min_advmss);
> > break;
> > }
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-06 13:26 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100406123404.GA24294@gondor.apana.org.au>
Herbert Xu wrote:
> On Mon, Apr 05, 2010 at 08:01:24PM +0300, Timo Teras wrote:
>> This allows to validate the cached object before returning it.
>> It also allows to destruct object properly, if the last reference
>> was held in flow cache. This is also a prepartion for caching
>> bundles in the flow cache.
>>
>> In return for virtualizing the methods, we save on:
>> - not having to regenerate the whole flow cache on policy removal:
>> each flow matching a killed policy gets refreshed as the getter
>> function notices it smartly.
>> - we do not have to call flow_cache_flush from policy gc, since the
>> flow cache now properly deletes the object if it had any references
>>
>> Signed-off-by: Timo Teras <timo.teras@iki.fi>
>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> Thanks a lot for the patch!
As noticed in review of 2/4, this needs to be fixed by calling
flow object delete() unconditionally if genid is outdated. I'll
repost with this fixed for next iteration.
^ permalink raw reply
* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Herbert Xu @ 2010-04-06 13:11 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BBB2F31.7090806@iki.fi>
On Tue, Apr 06, 2010 at 03:55:13PM +0300, Timo Teräs wrote:
>
> Which also makes me think of another issue. The resolver does
> not get notice if the genid was outdated. So it might end up
> the old policies from bundle after xfrm_policy_insert(). I think
> we should explicitly call ops->delete() in flow_cache_lookup if
> the flow genid was outdated. (I remember actually doing this,
> but also removing it when I was hunting my the one hlist related
> corruption bug.)
Right, that makes sense.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH v2] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-06 13:04 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004052248390.29212@pokey.mtv.corp.google.com>
Le lundi 05 avril 2010 à 22:56 -0700, Tom Herbert a écrit :
> Version 2:
> - added a u16 filler to pad rps_dev_flow structure
> - define RPS_NO_CPU as 0xffff
> - add inet_rps_save_rxhash helper function to copy skb's rxhash into inet_sk
> - add a "voidflow" which can be used get_rps_cpu does not return a flow (avoids some conditionals)
> - use raw_smp_processor_id in rps_record_sock_flow, this is no requirement to pr
> event preemption
> ---
> This patch implements software receive side packet steering (RPS). RPS
> distributes the load of received packet processing across multiple CPUs.
>
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load. This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
>
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs. This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel. For each device (or each receive queue in
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets. A CPU is selected on a per packet basis by hashing contents
> of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
> into the CPU mask. The IPI mechanism is used to raise networking receive
> softirqs between CPUs. This effectively emulates in software what a multi-queue
> NIC can provide, but is generic requiring no device support.
>
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering it to a remote CPU.
>
> The CPU mask is set on a per device and per queue basis in the sysfs variable
> /sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical
> bit maps for receive queues in the device (numbered by <n>). If a device
> does not support multi-queue, a single variable is used for the device (rx-0).
>
> Generally, we have found this technique increases pps capabilities of a single
> queue device with good CPU utilization. Optimal settings for the CPU mask
> seem to depend on architectures and cache hierarcy. Below are some results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
>
> e1000e on 8 core Intel
> Without RPS: 108K tps at 33% CPU
> With RPS: 311K tps at 64% CPU
>
> forcedeth on 16 core AMD
> Without RPS: 156K tps at 15% CPU
> With RPS: 404K tps at 49% CPU
>
> bnx2x on 16 core AMD
> Without RPS 567K tps at 61% CPU (4 HW RX queues)
> Without RPS 738K tps at 96% CPU (8 HW RX queues)
> With RPS: 854K tps at 76% CPU (4 HW RX queues)
>
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet. In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance degradation.
> We have found that masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is changed
> this introduces the possibility of generating out of order packets. It's
> probably best not change the masks too frequently.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
Running on a preprod machine here, seems fine.
Some questions :
1) The need to add "rps_flow_entries=xxx" at boot time is problematic.
Maybe we can allow it being dynamic (and use vmalloc() instead of
alloc_large_system_hash())
2) inet_rps_save_rxhash(sk, skb->rxhash);
It should have a check to make sure some part of the stack doesnt feed
many different rxhash for a given socket (Make sure we dont pollute flow
table with pseudo random values)
3) UDP connected sockets dont benefit of RFS currently
(Not sure many apps use connected UDP sockets, I do have some of them
in house)
I am trying following code for IPV4 only :
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7af756d..5c2d37a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1216,6 +1216,7 @@ int udp_disconnect(struct sock *sk, int flags)
sk->sk_state = TCP_CLOSE;
inet->inet_daddr = 0;
inet->inet_dport = 0;
+ inet_rps_save_rxhash(sk, 0);
sk->sk_bound_dev_if = 0;
if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
inet_reset_saddr(sk);
@@ -1257,8 +1258,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
- int rc = sock_queue_rcv_skb(sk, skb);
+ int rc;
+
+ if (inet_sk(sk)->inet_daddr)
+ inet_rps_save_rxhash(sk, skb->rxhash);
+ rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
^ permalink raw reply related
* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Timo Teräs @ 2010-04-06 12:55 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100406124014.GA24412@gondor.apana.org.au>
Herbert Xu wrote:
> On Mon, Apr 05, 2010 at 10:00:22AM +0300, Timo Teras wrote:
>> @@ -623,33 +618,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
>> + hlist_for_each_entry_continue(policy, entry, bydst)
>> + atomic_inc(&policy->genid);
>
> Do we still need this since we're invalidating the whole flow
> cache?
>
> The current code is necessary since otherwise the bundles won't
> get freed. But with your new code, this is essentially doing
> nothing, no?
You are right. I completely missed the flushing there. It was
just systematic conversion of deleting the bundles to incrementing
the genid.
Which also makes me think of another issue. The resolver does
not get notice if the genid was outdated. So it might end up
the old policies from bundle after xfrm_policy_insert(). I think
we should explicitly call ops->delete() in flow_cache_lookup if
the flow genid was outdated. (I remember actually doing this,
but also removing it when I was hunting my the one hlist related
corruption bug.)
^ permalink raw reply
* Re: [PATCH 11/11] drivers/uwb: Rename dev_info to wdi
From: David Vrabel @ 2010-04-06 12:42 UTC (permalink / raw)
To: David Miller; +Cc: joe, akpm, netdev, linux-kernel
In-Reply-To: <20100405.145144.207421561.davem@davemloft.net>
David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Mon, 05 Apr 2010 14:44:18 -0700
>
>> On Mon, 2010-04-05 at 12:05 -0700, Joe Perches wrote:
>>> There is a macro called dev_info that prints struct device specific
>>> information. Having variables with the same name can be confusing and
>>> prevents conversion of the macro to a function.
>>>
>>> Rename the existing dev_info variables to something else in preparation
>>> to converting the dev_info macro to a function.
>> http://patchwork.ozlabs.org/patch/49421/
>>
>> This marked as RFC in patchwork.
>> It's not intended to be.
>
> Because I can't apply the entire set, I'd like someone else
> to take this in since it's not really a networking specific
> patch.
I've taken it.
David
--
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park, Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ http://www.csr.com/
Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
^ permalink raw reply
* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Herbert Xu @ 2010-04-06 12:40 UTC (permalink / raw)
To: Timo Teras; +Cc: netdev
In-Reply-To: <1270450824-2928-3-git-send-email-timo.teras@iki.fi>
On Mon, Apr 05, 2010 at 10:00:22AM +0300, Timo Teras wrote:
>
> @@ -623,33 +618,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
> schedule_work(&net->xfrm.policy_hash_work);
>
> read_lock_bh(&xfrm_policy_lock);
> - gc_list = NULL;
> entry = &policy->bydst;
> - hlist_for_each_entry_continue(policy, entry, bydst) {
> - struct dst_entry *dst;
> -
> - write_lock(&policy->lock);
> - dst = policy->bundles;
> - if (dst) {
> - struct dst_entry *tail = dst;
> - while (tail->next)
> - tail = tail->next;
> - tail->next = gc_list;
> - gc_list = dst;
> -
> - policy->bundles = NULL;
> - }
> - write_unlock(&policy->lock);
> - }
> + hlist_for_each_entry_continue(policy, entry, bydst)
> + atomic_inc(&policy->genid);
Do we still need this since we're invalidating the whole flow
cache?
The current code is necessary since otherwise the bundles won't
get freed. But with your new code, this is essentially doing
nothing, no?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-06 12:34 UTC (permalink / raw)
To: Timo Teras; +Cc: netdev
In-Reply-To: <1270486884-10905-1-git-send-email-timo.teras@iki.fi>
On Mon, Apr 05, 2010 at 08:01:24PM +0300, Timo Teras wrote:
> This allows to validate the cached object before returning it.
> It also allows to destruct object properly, if the last reference
> was held in flow cache. This is also a prepartion for caching
> bundles in the flow cache.
>
> In return for virtualizing the methods, we save on:
> - not having to regenerate the whole flow cache on policy removal:
> each flow matching a killed policy gets refreshed as the getter
> function notices it smartly.
> - we do not have to call flow_cache_flush from policy gc, since the
> flow cache now properly deletes the object if it had any references
>
> Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks a lot for the patch!
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH] sky2: rx hash offload
From: Eric Dumazet @ 2010-04-06 12:33 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David Miller, netdev, Tom Herbert
In-Reply-To: <20100405084800.3bcec66a@nehalam>
Le lundi 05 avril 2010 à 08:48 -0700, Stephen Hemminger a écrit :
> Marvell Yukon 2 hardware supports hardware receive hash calculation.
> Now that Receive Packet Steering is available, add support
> to enable it.
>
> Note: still experimental, tested on only a few variants.
> No performance testing has been done.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ---
> drivers/net/sky2.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> drivers/net/sky2.h | 23 ++++++++++++++++
> 2 files changed, 96 insertions(+), 2 deletions(-)
I am wondering if introducing hardware computed rxhash wouldnt force us
to clear rxhash in several paths (tunneling...), so that we perform a
software recompute after decapsulation, to enable RFS
Not mandatory but recommended I would say...
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 2f302d3..3f0aba4 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -379,6 +379,7 @@ static int ipip_rcv(struct sk_buff *skb)
skb_dst_drop(skb);
nf_reset(skb);
ipip_ecn_decapsulate(iph, skb);
+ skb->rxhash = 0;
netif_rx(skb);
rcu_read_unlock();
return 0;
^ permalink raw reply related
* [RFC][PATCH] ipmr: Fix struct mfcctl to be independent of MAXVIFS.
From: Eric W. Biederman @ 2010-04-06 12:16 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Patrick McHardy, Ilia K, Tom Goff
Right now if you recompile the kernel to support more VIFS users of
the MRT_ADD_VIF and MRT_DEL_VIF will break because the ABI changes.
Correct this by forcing the number of VIFS handled in mfcctl to 32.
Allow for larger MAXVIFS by placing a second array of ttls at the end
of struct mfcctl.
struct mfcctl is insane. The last 4 fields are dead, and the mfc_ttls
array is only 2 byte aligned, with a 2 byte hole right after it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/mroute.h | 4 +++-
net/ipv4/ipmr.c | 29 +++++++++++++++++++++++------
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index c5f3d53..c5e066c 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -76,15 +76,17 @@ struct vifctl {
* Cache manipulation structures for mrouted and PIMd
*/
+#define MFCCTL_VIFS 32
struct mfcctl {
struct in_addr mfcc_origin; /* Origin of mcast */
struct in_addr mfcc_mcastgrp; /* Group in question */
vifi_t mfcc_parent; /* Where it arrived */
- unsigned char mfcc_ttls[MAXVIFS]; /* Where it is going */
+ unsigned char mfcc_ttls[MFCCTL_VIFS]; /* Where it is going */
unsigned int mfcc_pkt_cnt; /* pkt count for src-grp */
unsigned int mfcc_byte_cnt;
unsigned int mfcc_wrong_if;
int mfcc_expire;
+ unsigned char mfcc_ttls_extra[]; /* The rest of where it is going */
};
/*
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0b9d03c..2120668 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -797,7 +797,8 @@ static int ipmr_mfc_delete(struct net *net, struct mfcctl *mfc)
return -ENOENT;
}
-static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
+static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc,
+ unsigned char *ttls, int mrtsock)
{
int line;
struct mfc_cache *uc, *c, **cp;
@@ -817,7 +818,7 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
if (c != NULL) {
write_lock_bh(&mrt_lock);
c->mfc_parent = mfc->mfcc_parent;
- ipmr_update_thresholds(c, mfc->mfcc_ttls);
+ ipmr_update_thresholds(c, ttls);
if (!mrtsock)
c->mfc_flags |= MFC_STATIC;
write_unlock_bh(&mrt_lock);
@@ -834,7 +835,7 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
c->mfc_origin = mfc->mfcc_origin.s_addr;
c->mfc_mcastgrp = mfc->mfcc_mcastgrp.s_addr;
c->mfc_parent = mfc->mfcc_parent;
- ipmr_update_thresholds(c, mfc->mfcc_ttls);
+ ipmr_update_thresholds(c, ttls);
if (!mrtsock)
c->mfc_flags |= MFC_STATIC;
@@ -954,6 +955,8 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
int ret;
struct vifctl vif;
struct mfcctl mfc;
+ unsigned char ttls[MAXVIFS];
+ unsigned extra_oifs;
struct net *net = sock_net(sk);
if (optname != MRT_INIT) {
@@ -961,7 +964,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
return -EACCES;
}
- switch (optname) {
+ switch (optname){
case MRT_INIT:
if (sk->sk_type != SOCK_RAW ||
inet_sk(sk)->inet_num != IPPROTO_IGMP)
@@ -1012,15 +1015,29 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
*/
case MRT_ADD_MFC:
case MRT_DEL_MFC:
- if (optlen != sizeof(mfc))
+ /* How many extra interfaces do we have information for? */
+ extra_oifs = optlen - sizeof(mfc);
+ if (extra_oifs > (MAXVIFS - MFCCTL_VIFS))
+ extra_oifs = MAXVIFS - MFCCTL_VIFS;
+
+ if (optlen < sizeof(mfc))
return -EINVAL;
if (copy_from_user(&mfc, optval, sizeof(mfc)))
return -EFAULT;
+
+ memcpy(ttls, mfc.mfcc_ttls, sizeof(mfc.mfcc_ttls));
+ memset(ttls + MFCCTL_VIFS, 255, MAXVIFS - MFCCTL_VIFS);
+ if (copy_from_user(ttls + MFCCTL_VIFS,optval + sizeof(mfc),
+ extra_oifs))
+ return -EFAULT;
+
+
rtnl_lock();
if (optname == MRT_DEL_MFC)
ret = ipmr_mfc_delete(net, &mfc);
else
- ret = ipmr_mfc_add(net, &mfc, sk == net->ipv4.mroute_sk);
+ ret = ipmr_mfc_add(net, &mfc, ttls,
+ sk == net->ipv4.mroute_sk);
rtnl_unlock();
return ret;
/*
--
1.6.5.2.143.g8cc62
^ permalink raw reply related
* Re: patch to improve x.25 throughput negotiation
From: andrew hendry @ 2010-04-06 12:09 UTC (permalink / raw)
To: John Hughes; +Cc: netdev
In-Reply-To: <4BB8C2CA.6040102@Calva.COM>
I have reproduced a few ways.
1. X25_MASK_THROUGHPUT on the x25_subscript_struct, then call
SIOCX25SSUBSCRIP, then call SIOCX25FACILITIES without setting the
throughput field. Call connect.
2. No subscrip setting, call SIOCX25FACILITIES without setting the
throughput field. Call connect.
3. No subcrip, no facilities ioctl, call connect.
The patch removes the bad facility and makes the router accept the
call for the above cases.
I don't currently have a setup to test both direction throughput negotiation.
Tested-by: Andrew Hendry <andrew.hendry@gmail.com>
On Mon, Apr 5, 2010 at 2:48 AM, John Hughes <john@calva.com> wrote:
> The current X.25 code has some bugs in throughput negotiation:
>
> 1. It does negotiation in all cases, usually there is no need
> 2. It incorrectly attempts to negotiate the throughput class in one
> direction only. There are separate throughput classes for input
> and output and if either is negotiated both mist be negotiates.
>
> This is bug https://bugzilla.kernel.org/show_bug.cgi?id=15681
>
> This bug was first reported by Daniel Ferenci to the linux-x25 mailing list
> on 6/8/2004, but is still present.
>
>
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2010-04-06 11:33 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
The tcp splice oops is pretty nasty... anyways.
1) Fixup rcu_deref calls done outside RCU read lock in netlabel,
from Paul Moore.
2) gianfar fixes (memory leak on close, message alignment) from
Andy Fleming and Kim Philips.
3) MAC address probing fix in smc91c92_cs from Ken Kawasaki.
4) Some small wireless fixes via John Linville and co. including
a few device ID additions.
a) iwlwifi bool conversion to flags broke regulatory handling
b) iwlwifi tfd counting on 4965 chips fix
c) mac80211's reg_regdb_search_lock needs to be a mutex
d) off-by-one test fix in wireless mesh metric handling
5) New cxgb4 driver.
6) TCP doesn't maintain queue comsumed state properly across socket
lock dropping (and thus backlog processing) during splice so this
confuses tcp_collapse() and we crash. Fix from Steven J. Magnani
7) bond_uninit() deadlock fix from Amerigo Wang.
8) be2net fixes (redboot flashing, big endian flashing and VLAN rx
issues) from Ajit Khaparde.
9) stmmac needs crc32, from Carmelo AMOROSO
10) round-robin bonding does htons() on a u8 :-) Fix from Eric
Dumazet.
11) Missing lock release in sgisseq driver, from Julia Lawall
12) Need to validate socket address length before derefing in
socket ->connect() handlers. From Changli Gao.
Please pull, thanks a lot!
The following changes since commit db217dece3003df0841bacf9556b5c06aa097dae:
Linus Torvalds (1):
Merge git://git.kernel.org/.../davem/sparc-2.6
are available in the git repository at:
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master
Ajit Khaparde (3):
be2net: fix a bug in flashing the redboot section
be2net: fix flashing on big endian architectures
be2net: fix bug in vlan rx path for big endian architecture
Amerigo Wang (1):
bonding: fix potential deadlock in bond_uninit()
Andy Fleming (1):
gianfar: Fix a memory leak in gianfar close code
Ben Konrath (1):
ar9170: add support for NEC WL300NU-G USB dongle
Benjamin Larsson (1):
Add a pci-id to the mwl8k driver
Carmelo AMOROSO (1):
stmmac: fix kconfig for crc32 build error
Changli Gao (1):
net: check the length of the socket address passed to connect(2)
Dan Carpenter (1):
iwlwifi: range checking issue
Daniel Mack (1):
net/wireless/libertas: do not call wiphy_unregister() w/o wiphy_register()
David S. Miller (1):
Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6
Dimitris Michailidis (6):
cxgb4: Add register, message, and FW definitions
cxgb4: Add HW and FW support code
cxgb4: Add packet queues and packet DMA code
cxgb4: Add remaining driver headers and L2T management
cxgb4: Add main driver file and driver Makefile
net: Hook up cxgb4 to Kconfig and Makefile
Eric Dumazet (1):
bonding: bond_xmit_roundrobin() fix
Gertjan van Wingerde (2):
rt2x00: Fix typo in RF register programming of rt2800.
rt2x00: Disable powersaving by default in rt2500usb.
Giuseppe CAVALLARO (1):
stmmac: add documentation for the driver.
Hans de Goede (1):
Add USB ID for Thomson SpeedTouch 120g to p54usb id table
Johannes Berg (1):
mac80211: move netdev queue enabling to correct spot
John W. Linville (2):
wireless: convert reg_regdb_search_lock to mutex
mac80211: correct typos in "unavailable upon resume" warning
Julia Lawall (1):
drivers/net: Add missing unlock
Ken Kawasaki (1):
smc91c92_cs: fix the problem of "Unable to find hardware address"
Kim Phillips (2):
net: gianfar - initialize per-queue statistics
net: gianfar - align BD ring size console messages
Neil Horman (1):
r8169: clean up my printk uglyness
Paul Moore (1):
netlabel: Fix several rcu_dereference() calls used without RCU read locks
Porsch, Marco (1):
mac80211: fix PREQ processing and one small bug
Reinette Chatre (1):
iwlwifi: fix regulatory
Shanyu Zhao (1):
iwlwifi: clear unattended interrupts in tasklet
Steven J. Magnani (1):
net: Fix oops from tcp_collapse() when using splice()
Valentin Longchamp (1):
setup correct int pipe type in ar9170_usb_exec_cmd
Wey-Yi Guy (1):
iwlwifi: counting number of tfds can be free for 4965
Documentation/networking/stmmac.txt | 143 ++
drivers/net/Kconfig | 25 +
drivers/net/Makefile | 1 +
drivers/net/benet/be_cmds.c | 4 +-
drivers/net/benet/be_main.c | 21 +-
drivers/net/bonding/bond_main.c | 28 +-
drivers/net/cxgb4/Makefile | 7 +
drivers/net/cxgb4/cxgb4.h | 741 ++++++
drivers/net/cxgb4/cxgb4_main.c | 3388 +++++++++++++++++++++++++++
drivers/net/cxgb4/cxgb4_uld.h | 239 ++
drivers/net/cxgb4/l2t.c | 624 +++++
drivers/net/cxgb4/l2t.h | 110 +
drivers/net/cxgb4/sge.c | 2431 +++++++++++++++++++
drivers/net/cxgb4/t4_hw.c | 3131 +++++++++++++++++++++++++
drivers/net/cxgb4/t4_hw.h | 100 +
drivers/net/cxgb4/t4_msg.h | 664 ++++++
drivers/net/cxgb4/t4_regs.h | 878 +++++++
drivers/net/cxgb4/t4fw_api.h | 1580 +++++++++++++
drivers/net/gianfar.c | 12 +-
drivers/net/pcmcia/smc91c92_cs.c | 12 +-
drivers/net/r8169.c | 4 +-
drivers/net/sgiseeq.c | 4 +-
drivers/net/stmmac/Kconfig | 1 +
drivers/net/wireless/ath/ar9170/usb.c | 4 +-
drivers/net/wireless/iwlwifi/iwl-4965.c | 6 +-
drivers/net/wireless/iwlwifi/iwl-agn.c | 12 +-
drivers/net/wireless/iwlwifi/iwl3945-base.c | 4 +-
drivers/net/wireless/libertas/cfg.c | 8 +-
drivers/net/wireless/libertas/dev.h | 1 +
drivers/net/wireless/mwl8k.c | 1 +
drivers/net/wireless/p54/p54usb.c | 1 +
drivers/net/wireless/rt2x00/rt2500usb.c | 5 +
drivers/net/wireless/rt2x00/rt2800lib.c | 4 +-
net/bluetooth/l2cap.c | 3 +-
net/bluetooth/rfcomm/sock.c | 3 +-
net/bluetooth/sco.c | 3 +-
net/can/bcm.c | 3 +
net/ieee802154/af_ieee802154.c | 3 +
net/ipv4/af_inet.c | 5 +
net/ipv4/tcp.c | 1 +
net/mac80211/mesh_hwmp.c | 4 +-
net/mac80211/tx.c | 6 +
net/mac80211/util.c | 18 +-
net/netlabel/netlabel_domainhash.c | 28 +-
net/netlabel/netlabel_unlabeled.c | 66 +-
net/netlink/af_netlink.c | 3 +
net/wireless/reg.c | 12 +-
47 files changed, 14221 insertions(+), 131 deletions(-)
create mode 100644 Documentation/networking/stmmac.txt
create mode 100644 drivers/net/cxgb4/Makefile
create mode 100644 drivers/net/cxgb4/cxgb4.h
create mode 100644 drivers/net/cxgb4/cxgb4_main.c
create mode 100644 drivers/net/cxgb4/cxgb4_uld.h
create mode 100644 drivers/net/cxgb4/l2t.c
create mode 100644 drivers/net/cxgb4/l2t.h
create mode 100644 drivers/net/cxgb4/sge.c
create mode 100644 drivers/net/cxgb4/t4_hw.c
create mode 100644 drivers/net/cxgb4/t4_hw.h
create mode 100644 drivers/net/cxgb4/t4_msg.h
create mode 100644 drivers/net/cxgb4/t4_regs.h
create mode 100644 drivers/net/cxgb4/t4fw_api.h
^ permalink raw reply
* Re: NET: sb1250: Fix compile warning in driver
From: David Miller @ 2010-04-06 11:03 UTC (permalink / raw)
To: ralf; +Cc: netdev
In-Reply-To: <20100406093320.GA31967@linux-mips.org>
From: Ralf Baechle <ralf@linux-mips.org>
Date: Tue, 6 Apr 2010 10:33:20 +0100
> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Applied to net-next-2.6, thanks Ralf!
^ permalink raw reply
* Re: [PATCH net-next 00/12] tg3: Bugfix, msg fixups, and checkpatch cleanups
From: David Miller @ 2010-04-06 10:59 UTC (permalink / raw)
To: mcarlson; +Cc: netdev, andy
In-Reply-To: <1270498770-23765-1-git-send-email-mcarlson@broadcom.com>
From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Mon, 5 Apr 2010 13:19:18 -0700
> This patchset fixes a minor APD bug, elaborates on the recent messaging
> improvements, and implements some checkpatch cleanups.
These all look fine, applied to net-next-2.6, thanks!
^ permalink raw reply
* NET: sb1250: Fix compile warning in driver
From: Ralf Baechle @ 2010-04-06 9:33 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
drivers/net/sb1250-mac.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/sb1250-mac.c b/drivers/net/sb1250-mac.c
index 9944e5d..142261b 100644
--- a/drivers/net/sb1250-mac.c
+++ b/drivers/net/sb1250-mac.c
@@ -2664,7 +2664,6 @@ static int sbmac_close(struct net_device *dev)
static int sbmac_poll(struct napi_struct *napi, int budget)
{
struct sbmac_softc *sc = container_of(napi, struct sbmac_softc, napi);
- struct net_device *dev = sc->sbm_dev;
int work_done;
work_done = sbdma_rx_process(sc, &(sc->sbm_rxdma), budget, 1);
^ permalink raw reply related
* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Michael S. Tsirkin @ 2010-04-06 7:51 UTC (permalink / raw)
To: Xin, Xiaohui
Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BC1@shzsmsx502.ccr.corp.intel.com>
On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote:
> Michael,
> > >>> For the write logging, do you have a function in hand that we can
> > >>> recompute the log? If that, I think I can use it to recompute the
> > >>>log info when the logging is suddenly enabled.
> > >>> For the outstanding requests, do you mean all the user buffers have
> > >>>submitted before the logging ioctl changed? That may be a lot, and
> > >> >some of them are still in NIC ring descriptors. Waiting them to be
> > >>>finished may be need some time. I think when logging ioctl changed,
> > >> >then the logging is changed just after that is also reasonable.
>
> > >>The key point is that after loggin ioctl returns, any
> > >>subsequent change to memory must be logged. It does not
> > >>matter when was the request submitted, otherwise we will
> > >>get memory corruption on migration.
>
> > >The change to memory happens when vhost_add_used_and_signal(), right?
> > >So after ioctl returns, just recompute the log info to the events in the async queue,
> > >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
>
> >> Thanks
> >> Xiaohui
>
> >Yes, I think this will work.
>
> Thanks, so do you have the function to recompute the log info in your hand that I can
> use? I have weakly remembered that you have noticed it before some time.
Doesn't just rerunning vhost_get_vq_desc work?
> > > Thanks
> > > Xiaohui
> > >
> > > drivers/vhost/net.c | 189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > > drivers/vhost/vhost.h | 10 +++
> > > 2 files changed, 192 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 22d5fef..2aafd90 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -17,11 +17,13 @@
> > > #include <linux/workqueue.h>
> > > #include <linux/rcupdate.h>
> > > #include <linux/file.h>
> > > +#include <linux/aio.h>
> > >
> > > #include <linux/net.h>
> > > #include <linux/if_packet.h>
> > > #include <linux/if_arp.h>
> > > #include <linux/if_tun.h>
> > > +#include <linux/mpassthru.h>
> > >
> > > #include <net/sock.h>
> > >
> > > @@ -47,6 +49,7 @@ struct vhost_net {
> > > struct vhost_dev dev;
> > > struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > > struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > + struct kmem_cache *cache;
> > > /* Tells us whether we are polling a socket for TX.
> > > * We only do this when socket buffer fills up.
> > > * Protected by tx vq lock. */
> > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > > net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > > }
> > >
> > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&vq->notify_lock, flags);
> > > + if (!list_empty(&vq->notifier)) {
> > > + iocb = list_first_entry(&vq->notifier,
> > > + struct kiocb, ki_list);
> > > + list_del(&iocb->ki_list);
> > > + }
> > > + spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > + return iocb;
> > > +}
> > > +
> > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > + struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + struct vhost_log *vq_log = NULL;
> > > + int rx_total_len = 0;
> > > + int log, size;
> > > +
> > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > + return;
> > > +
> > > + if (vq->receiver)
> > > + vq->receiver(vq);
> > > +
> > > + vq_log = unlikely(vhost_has_feature(
> > > + &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > + vhost_add_used_and_signal(&net->dev, vq,
> > > + iocb->ki_pos, iocb->ki_nbytes);
> > > + log = (int)iocb->ki_user_data;
> > > + size = iocb->ki_nbytes;
> > > + rx_total_len += iocb->ki_nbytes;
> > > +
> > > + if (iocb->ki_dtor)
> > > + iocb->ki_dtor(iocb);
> > > + kmem_cache_free(net->cache, iocb);
> > > +
> > > + if (unlikely(vq_log))
> > > + vhost_log_write(vq, vq_log, log, size);
> > > + if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > + vhost_poll_queue(&vq->poll);
> > > + break;
> > > + }
> > > + }
> > > +}
> > > +
> > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > + struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + int tx_total_len = 0;
> > > +
> > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > + return;
> > > +
> > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > + vhost_add_used_and_signal(&net->dev, vq,
> > > + iocb->ki_pos, 0);
> > > + tx_total_len += iocb->ki_nbytes;
> > > +
> > > + if (iocb->ki_dtor)
> > > + iocb->ki_dtor(iocb);
> > > +
> > > + kmem_cache_free(net->cache, iocb);
> > > + if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > + vhost_poll_queue(&vq->poll);
> > > + break;
> > > + }
> > > + }
> > > +}
> > > +
> > > /* Expects to be always run from workqueue - which acts as
> > > * read-size critical section for our kind of RCU. */
> > > static void handle_tx(struct vhost_net *net)
> > > {
> > > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > + struct kiocb *iocb = NULL;
> > > unsigned head, out, in, s;
> > > struct msghdr msg = {
> > > .msg_name = NULL,
> > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > > tx_poll_stop(net);
> > > hdr_size = vq->hdr_size;
> > >
> > > + handle_async_tx_events_notify(net, vq);
> > > +
> > > for (;;) {
> > > head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > ARRAY_SIZE(vq->iov),
> > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > > /* Skip header. TODO: support TSO. */
> > > s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > > msg.msg_iovlen = out;
> > > +
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > + iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > + if (!iocb)
> > > + break;
> > > + iocb->ki_pos = head;
> > > + iocb->private = (void *)vq;
> > > + }
> > > +
> > > len = iov_length(vq->iov, out);
> > > /* Sanity check */
> > > if (!len) {
> > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > > break;
> > > }
> > > /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > - err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > + err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > > if (unlikely(err < 0)) {
> > > vhost_discard_vq_desc(vq);
> > > tx_poll_start(net, sock);
> > > break;
> > > }
> > > +
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > + continue;
> > > +
> > > if (err != len)
> > > pr_err("Truncated TX packet: "
> > > " len %d != %zd\n", err, len);
> > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > > }
> > > }
> > >
> > > + handle_async_tx_events_notify(net, vq);
> > > +
> > > mutex_unlock(&vq->mutex);
> > > unuse_mm(net->dev.mm);
> > > }
> > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > > static void handle_rx(struct vhost_net *net)
> > > {
> > > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > + struct kiocb *iocb = NULL;
> > > unsigned head, out, in, log, s;
> > > struct vhost_log *vq_log;
> > > struct msghdr msg = {
> > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > > int err;
> > > size_t hdr_size;
> > > struct socket *sock = rcu_dereference(vq->private_data);
> > > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > + if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > + vq->link_state == VHOST_VQ_LINK_SYNC))
> > > return;
> > >
> > > use_mm(net->dev.mm);
> > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > > vhost_disable_notify(vq);
> > > hdr_size = vq->hdr_size;
> > >
> > > - vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > + /* In async cases, for write logging, the simple way is to get
> > > + * the log info always, and really logging is decided later.
> > > + * Thus, when logging enabled, we can get log, and when logging
> > > + * disabled, we can get log disabled accordingly.
> > > + */
> > > +
> > > + vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > + (vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > > vq->log : NULL;
> > >
> > > + handle_async_rx_events_notify(net, vq);
> > > +
> > > for (;;) {
> > > head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > ARRAY_SIZE(vq->iov),
> > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > > s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > > msg.msg_iovlen = in;
> > > len = iov_length(vq->iov, in);
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > + iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > + if (!iocb)
> > > + break;
> > > + iocb->private = vq;
> > > + iocb->ki_pos = head;
> > > + iocb->ki_user_data = log;
> > > + }
> > > /* Sanity check */
> > > if (!len) {
> > > vq_err(vq, "Unexpected header len for RX: "
> > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > > iov_length(vq->hdr, s), hdr_size);
> > > break;
> > > }
> > > - err = sock->ops->recvmsg(NULL, sock, &msg,
> > > +
> > > + err = sock->ops->recvmsg(iocb, sock, &msg,
> > > len, MSG_DONTWAIT | MSG_TRUNC);
> > > /* TODO: Check specific error and bomb out unless EAGAIN? */
> > > if (err < 0) {
> > > vhost_discard_vq_desc(vq);
> > > break;
> > > }
> > > +
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > + continue;
> > > +
> > > /* TODO: Should check and handle checksum. */
> > > if (err > len) {
> > > pr_err("Discarded truncated rx packet: "
> > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > > }
> > > }
> > >
> > > + handle_async_rx_events_notify(net, vq);
> > > +
> > > mutex_unlock(&vq->mutex);
> > > unuse_mm(net->dev.mm);
> > > }
> > >
> > > +
> > > static void handle_tx_kick(struct work_struct *work)
> > > {
> > > struct vhost_virtqueue *vq;
> > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > > vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > > vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > > n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > + n->cache = NULL;
> > > return 0;
> > > }
> > >
> > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > > vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > > }
> > >
> > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > +{
> > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > + struct kiocb *iocb = NULL;
> > > + if (n->cache) {
> > > + while ((iocb = notify_dequeue(vq)) != NULL)
> > > + kmem_cache_free(n->cache, iocb);
> > > + kmem_cache_destroy(n->cache);
> > > + }
> > > +}
> > > +
> > > static int vhost_net_release(struct inode *inode, struct file *f)
> > > {
> > > struct vhost_net *n = f->private_data;
> > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > > /* We do an extra flush before freeing memory,
> > > * since jobs can re-queue themselves. */
> > > vhost_net_flush(n);
> > > + vhost_notifier_cleanup(n);
> > > kfree(n);
> > > return 0;
> > > }
> > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > > return sock;
> > > }
> > >
> > > -static struct socket *get_socket(int fd)
> > > +static struct socket *get_mp_socket(int fd)
> > > +{
> > > + struct file *file = fget(fd);
> > > + struct socket *sock;
> > > + if (!file)
> > > + return ERR_PTR(-EBADF);
> > > + sock = mp_get_socket(file);
> > > + if (IS_ERR(sock))
> > > + fput(file);
> > > + return sock;
> > > +}
> > > +
> > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > > {
> > > struct socket *sock;
> > > if (fd == -1)
> > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > > sock = get_tun_socket(fd);
> > > if (!IS_ERR(sock))
> > > return sock;
> > > + sock = get_mp_socket(fd);
> > > + if (!IS_ERR(sock)) {
> > > + vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > + return sock;
> > > + }
> > > return ERR_PTR(-ENOTSOCK);
> > > }
> > >
> > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > +{
> > > + struct vhost_virtqueue *vq = n->vqs + index;
> > > +
> > > + WARN_ON(!mutex_is_locked(&vq->mutex));
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > + vq->receiver = NULL;
> > > + INIT_LIST_HEAD(&vq->notifier);
> > > + spin_lock_init(&vq->notify_lock);
> > > + if (!n->cache) {
> > > + n->cache = kmem_cache_create("vhost_kiocb",
> > > + sizeof(struct kiocb), 0,
> > > + SLAB_HWCACHE_ALIGN, NULL);
> > > + }
> > > + }
> > > +}
> > > +
> > > static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > {
> > > struct socket *sock, *oldsock;
> > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > }
> > > vq = n->vqs + index;
> > > mutex_lock(&vq->mutex);
> > > - sock = get_socket(fd);
> > > + vq->link_state = VHOST_VQ_LINK_SYNC;
> > > + sock = get_socket(vq, fd);
> > > if (IS_ERR(sock)) {
> > > r = PTR_ERR(sock);
> > > goto err;
> > > }
> > >
> > > + vhost_init_link_state(n, index);
> > > +
> > > /* start polling new socket */
> > > oldsock = vq->private_data;
> > > if (sock == oldsock)
> > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > vhost_net_disable_vq(n, vq);
> > > rcu_assign_pointer(vq->private_data, sock);
> > > vhost_net_enable_vq(n, vq);
> > > - mutex_unlock(&vq->mutex);
> > > done:
> > > + mutex_unlock(&vq->mutex);
> > > mutex_unlock(&n->dev.mutex);
> > > if (oldsock) {
> > > vhost_net_flush_vq(n, index);
> > > @@ -516,6 +690,7 @@ done:
> > > }
> > > return r;
> > > err:
> > > + mutex_unlock(&vq->mutex);
> > > mutex_unlock(&n->dev.mutex);
> > > return r;
> > > }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index d1f0453..cffe39a 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -43,6 +43,11 @@ struct vhost_log {
> > > u64 len;
> > > };
> > >
> > > +enum vhost_vq_link_state {
> > > + VHOST_VQ_LINK_SYNC = 0,
> > > + VHOST_VQ_LINK_ASYNC = 1,
> > > +};
> > > +
> > > /* The virtqueue structure describes a queue attached to a device. */
> > > struct vhost_virtqueue {
> > > struct vhost_dev *dev;
> > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > > /* Log write descriptors */
> > > void __user *log_base;
> > > struct vhost_log log[VHOST_NET_MAX_SG];
> > > + /*Differiate async socket for 0-copy from normal*/
> > > + enum vhost_vq_link_state link_state;
> > > + struct list_head notifier;
> > > + spinlock_t notify_lock;
> > > + void (*receiver)(struct vhost_virtqueue *);
> > > };
> > >
> > > struct vhost_dev {
> > > --
> > > 1.5.4.4
^ permalink raw reply
* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-06 7:49 UTC (permalink / raw)
To: Xin, Xiaohui
Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, mingo@elte.hu,
jdike@c2.user-mode-linux.org, yzhao81@gmail.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BB9@shzsmsx502.ccr.corp.intel.com>
On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote:
> Michael,
> >>
> >>For the DOS issue, I'm not sure how much the limit get_user_pages()
> >> can pin is reasonable, should we compute the bindwidth to make it?
>
> >There's a ulimit for locked memory. Can we use this, decreasing
> >the value for rlimit array? We can do this when backend is
> >enabled and re-increment when backend is disabled.
>
> I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
> the initial value for it is 0x10000, after right shift PAGE_SHIFT,
> it's only 16 pages we can lock then, it seems too small, since the
> guest virito-net driver may submit a lot requests one time.
>
>
> Thanks
> Xiaohui
Yes, that's the default, but system administrator can always increase
this value with ulimit if necessary.
--
MST
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-06 7:39 UTC (permalink / raw)
To: FUJITA Tomonori
Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20100404205028H.fujita.tomonori@lab.ntt.co.jp>
Thanks, Fujita.
The patch looks fine. I'll run some regression tests on the patched driver to check that things still work and if it's ok we will ack it shortly.
vlad
> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Sunday, April 04, 2010 2:51 PM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; davem@davemloft.net;
> netdev@vger.kernel.org; Eilon Greenstein
> Subject: RE: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> On Sun, 4 Apr 2010 03:24:46 -0700
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
>
> > Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to
> > use the generic model (not just the mapping macros).
>
> I've attached the patch.
>
> There is one functional change: pci_alloc_consistent ->
> dma_alloc_coherent
>
> pci_alloc_consistent is a wrapper function of dma_alloc_coherent with
> GFP_ATOMIC flag (see include/asm-generic/pci-dma-compat.h).
>
> pci_alloc_consistent uses GFP_ATOMIC flag because of the compatibility
> for some broken drivers that use the function in interrupt. But
> GFP_ATOMIC should be avoided if possible. Looks like bnx2x doesn't use
> pci_alloc_consistent in interrupt so I replaced them with
> dma_alloc_coherent with GFP_KERNEL.
>
> Please check if that change works for bnx2x.
>
> > One last question: since which kernel version the generic DMA layer
> > may be used instead of PCI DMA layer?
>
> After 2.6.34-rc2.
>
> Well, on the majority of architectures, you have been able to use the
> generic DMA API over the PCI DMA API. The PCI DMA API is just the
> wrapper of the generic DMA API. But on some architectures, two APIs
> worked differently a bit. since 2.6.34-rc2, two API work in the exact
> same way on all the architectures.
>
>
> =
> From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Subject: [PATCH] bnx2x: use the DMA API instead of the pci equivalents
>
> The DMA API is preferred.
>
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
> drivers/net/bnx2x.h | 4 +-
> drivers/net/bnx2x_main.c | 110
> +++++++++++++++++++++++----------------------
> 2 files changed, 58 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {
> \
>
> struct sw_rx_bd {
> struct sk_buff *skb;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>
> struct sw_rx_page {
> struct page *page;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index fa9275c..63a17d6 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -842,7 +842,7 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
> /* unmap first bd */
> DP(BNX2X_MSG_OFF, "free bd_idx %d\n", bd_idx);
> tx_start_bd = &fp->tx_desc_ring[bd_idx].start_bd;
> - pci_unmap_single(bp->pdev, BD_UNMAP_ADDR(tx_start_bd),
> + dma_unmap_single(&bp->pdev->dev, BD_UNMAP_ADDR(tx_start_bd),
> BD_UNMAP_LEN(tx_start_bd), PCI_DMA_TODEVICE);
>
> nbd = le16_to_cpu(tx_start_bd->nbd) - 1;
> @@ -872,8 +872,8 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>
> DP(BNX2X_MSG_OFF, "free frag bd_idx %d\n", bd_idx);
> tx_data_bd = &fp->tx_desc_ring[bd_idx].reg_bd;
> - pci_unmap_page(bp->pdev, BD_UNMAP_ADDR(tx_data_bd),
> - BD_UNMAP_LEN(tx_data_bd),
> PCI_DMA_TODEVICE);
> + dma_unmap_page(&bp->pdev->dev,
> BD_UNMAP_ADDR(tx_data_bd),
> + BD_UNMAP_LEN(tx_data_bd), DMA_TO_DEVICE);
> if (--nbd)
> bd_idx = TX_BD(NEXT_TX_IDX(bd_idx));
> }
> @@ -1086,7 +1086,7 @@ static inline void
> bnx2x_free_rx_sge(struct bnx2x *bp,
> if (!page)
> return;
>
> - pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> + dma_unmap_page(&bp->pdev->dev, dma_unmap_addr(sw_buf, mapping),
> SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
> __free_pages(page, PAGES_PER_SGE_SHIFT);
>
> @@ -1115,15 +1115,15 @@ static inline int
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
> if (unlikely(page == NULL))
> return -ENOMEM;
>
> - mapping = pci_map_page(bp->pdev, page, 0,
> SGE_PAGE_SIZE*PAGES_PER_SGE,
> - PCI_DMA_FROMDEVICE);
> + mapping = dma_map_page(&bp->pdev->dev, page, 0,
> + SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
> if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
> __free_pages(page, PAGES_PER_SGE_SHIFT);
> return -ENOMEM;
> }
>
> sw_buf->page = page;
> - pci_unmap_addr_set(sw_buf, mapping, mapping);
> + dma_unmap_addr_set(sw_buf, mapping, mapping);
>
> sge->addr_hi = cpu_to_le32(U64_HI(mapping));
> sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1143,15 +1143,15 @@ static inline int
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
> if (unlikely(skb == NULL))
> return -ENOMEM;
>
> - mapping = pci_map_single(bp->pdev, skb->data, bp->rx_buf_size,
> - PCI_DMA_FROMDEVICE);
> + mapping = dma_map_single(&bp->pdev->dev, skb->data,
> bp->rx_buf_size,
> + DMA_FROM_DEVICE);
> if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
> dev_kfree_skb(skb);
> return -ENOMEM;
> }
>
> rx_buf->skb = skb;
> - pci_unmap_addr_set(rx_buf, mapping, mapping);
> + dma_unmap_addr_set(rx_buf, mapping, mapping);
>
> rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1173,13 +1173,13 @@ static void bnx2x_reuse_rx_skb(struct
> bnx2x_fastpath *fp,
> struct eth_rx_bd *cons_bd = &fp->rx_desc_ring[cons];
> struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>
> - pci_dma_sync_single_for_device(bp->pdev,
> -
> pci_unmap_addr(cons_rx_buf, mapping),
> - RX_COPY_THRESH,
> PCI_DMA_FROMDEVICE);
> + dma_sync_single_for_device(&bp->pdev->dev,
> + dma_unmap_addr(cons_rx_buf, mapping),
> + RX_COPY_THRESH, DMA_FROM_DEVICE);
>
> prod_rx_buf->skb = cons_rx_buf->skb;
> - pci_unmap_addr_set(prod_rx_buf, mapping,
> - pci_unmap_addr(cons_rx_buf, mapping));
> + dma_unmap_addr_set(prod_rx_buf, mapping,
> + dma_unmap_addr(cons_rx_buf, mapping));
> *prod_bd = *cons_bd;
> }
>
> @@ -1283,9 +1283,9 @@ static void bnx2x_tpa_start(struct
> bnx2x_fastpath *fp, u16 queue,
>
> /* move empty skb from pool to prod and map it */
> prod_rx_buf->skb = fp->tpa_pool[queue].skb;
> - mapping = pci_map_single(bp->pdev,
> fp->tpa_pool[queue].skb->data,
> - bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> - pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> + mapping = dma_map_single(&bp->pdev->dev,
> fp->tpa_pool[queue].skb->data,
> + bp->rx_buf_size, DMA_FROM_DEVICE);
> + dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>
> /* move partial skb from cons to pool (don't unmap yet) */
> fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,8 +1361,9 @@ static int bnx2x_fill_frag_skb(struct
> bnx2x *bp, struct bnx2x_fastpath *fp,
> }
>
> /* Unmap the page as we r going to pass it to
> the stack */
> - pci_unmap_page(bp->pdev,
> pci_unmap_addr(&old_rx_pg, mapping),
> - SGE_PAGE_SIZE*PAGES_PER_SGE,
> PCI_DMA_FROMDEVICE);
> + dma_unmap_page(&bp->pdev->dev,
> + dma_unmap_addr(&old_rx_pg, mapping),
> + SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
>
> /* Add one frag and update the appropriate
> fields in the skb */
> skb_fill_page_desc(skb, j, old_rx_pg.page, 0, frag_len);
> @@ -1389,8 +1390,8 @@ static void bnx2x_tpa_stop(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
> /* Unmap skb in the pool anyway, as we are going to change
> pool entry status to BNX2X_TPA_STOP even if new skb
> allocation
> fails. */
> - pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> - bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> + dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf,
> mapping),
> + bp->rx_buf_size, DMA_FROM_DEVICE);
>
> if (likely(new_skb)) {
> /* fix ip xsum and give it to the stack */
> @@ -1620,10 +1621,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
> }
> }
>
> - pci_dma_sync_single_for_device(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> - pad +
> RX_COPY_THRESH,
> -
> PCI_DMA_FROMDEVICE);
> + dma_sync_single_for_device(&bp->pdev->dev,
> + dma_unmap_addr(rx_buf, mapping),
> + pad + RX_COPY_THRESH,
> + DMA_FROM_DEVICE);
> prefetch(skb);
> prefetch(((char *)(skb)) + 128);
>
> @@ -1665,10 +1666,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
>
> } else
> if (likely(bnx2x_alloc_rx_skb(bp, fp,
> bd_prod) == 0)) {
> - pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> + dma_unmap_single(&bp->pdev->dev,
> + dma_unmap_addr(rx_buf, mapping),
> bp->rx_buf_size,
> - PCI_DMA_FROMDEVICE);
> + DMA_FROM_DEVICE);
> skb_reserve(skb, pad);
> skb_put(skb, len);
>
> @@ -4940,9 +4941,9 @@ static inline void
> bnx2x_free_tpa_pool(struct bnx2x *bp,
> }
>
> if (fp->tpa_state[i] == BNX2X_TPA_START)
> - pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> - bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> + dma_unmap_single(&bp->pdev->dev,
> + dma_unmap_addr(rx_buf,
> mapping),
> + bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
> dev_kfree_skb(skb);
> rx_buf->skb = NULL;
> @@ -4978,7 +4979,7 @@ static void bnx2x_init_rx_rings(struct
> bnx2x *bp)
> fp->disable_tpa = 1;
> break;
> }
> - pci_unmap_addr_set((struct sw_rx_bd *)
> + dma_unmap_addr_set((struct sw_rx_bd *)
>
> &bp->fp->tpa_pool[i],
> mapping, 0);
> fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -5658,8 +5659,8 @@ static void bnx2x_nic_init(struct bnx2x
> *bp, u32 load_code)
>
> static int bnx2x_gunzip_init(struct bnx2x *bp)
> {
> - bp->gunzip_buf = pci_alloc_consistent(bp->pdev, FW_BUF_SIZE,
> - &bp->gunzip_mapping);
> + bp->gunzip_buf = dma_alloc_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> +
> &bp->gunzip_mapping, GFP_KERNEL);
> if (bp->gunzip_buf == NULL)
> goto gunzip_nomem1;
>
> @@ -5679,8 +5680,8 @@ gunzip_nomem3:
> bp->strm = NULL;
>
> gunzip_nomem2:
> - pci_free_consistent(bp->pdev, FW_BUF_SIZE, bp->gunzip_buf,
> - bp->gunzip_mapping);
> + dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE, bp->gunzip_buf,
> + bp->gunzip_mapping);
> bp->gunzip_buf = NULL;
>
> gunzip_nomem1:
> @@ -5696,8 +5697,8 @@ static void bnx2x_gunzip_end(struct bnx2x *bp)
> bp->strm = NULL;
>
> if (bp->gunzip_buf) {
> - pci_free_consistent(bp->pdev, FW_BUF_SIZE,
> bp->gunzip_buf,
> - bp->gunzip_mapping);
> + dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> bp->gunzip_buf,
> + bp->gunzip_mapping);
> bp->gunzip_buf = NULL;
> }
> }
> @@ -6692,7 +6693,7 @@ static void bnx2x_free_mem(struct bnx2x *bp)
> #define BNX2X_PCI_FREE(x, y, size) \
> do { \
> if (x) { \
> - pci_free_consistent(bp->pdev, size, x, y); \
> + dma_free_coherent(&bp->pdev->dev, size, x, y); \
> x = NULL; \
> y = 0; \
> } \
> @@ -6773,7 +6774,7 @@ static int bnx2x_alloc_mem(struct bnx2x *bp)
>
> #define BNX2X_PCI_ALLOC(x, y, size) \
> do { \
> - x = pci_alloc_consistent(bp->pdev, size, y); \
> + x = dma_alloc_coherent(&bp->pdev->dev, size, y,
> GFP_KERNEL); \
> if (x == NULL) \
> goto alloc_mem_err; \
> memset(x, 0, size); \
> @@ -6906,9 +6907,9 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
> if (skb == NULL)
> continue;
>
> - pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> - bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> + dma_unmap_single(&bp->pdev->dev,
> + dma_unmap_addr(rx_buf,
> mapping),
> + bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
> rx_buf->skb = NULL;
> dev_kfree_skb(skb);
> @@ -10269,8 +10270,8 @@ static int bnx2x_run_loopback(struct
> bnx2x *bp, int loopback_mode, u8 link_up)
>
> bd_prod = TX_BD(fp_tx->tx_bd_prod);
> tx_start_bd = &fp_tx->tx_desc_ring[bd_prod].start_bd;
> - mapping = pci_map_single(bp->pdev, skb->data,
> - skb_headlen(skb), PCI_DMA_TODEVICE);
> + mapping = dma_map_single(&bp->pdev->dev, skb->data,
> + skb_headlen(skb), DMA_TO_DEVICE);
> tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> tx_start_bd->nbd = cpu_to_le16(2); /* start + pbd */
> @@ -11316,8 +11317,8 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> }
>
> - mapping = pci_map_single(bp->pdev, skb->data,
> - skb_headlen(skb), PCI_DMA_TODEVICE);
> + mapping = dma_map_single(&bp->pdev->dev, skb->data,
> + skb_headlen(skb), DMA_TO_DEVICE);
>
> tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11374,8 +11375,9 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
> if (total_pkt_bd == NULL)
> total_pkt_bd =
> &fp->tx_desc_ring[bd_prod].reg_bd;
>
> - mapping = pci_map_page(bp->pdev, frag->page,
> frag->page_offset,
> - frag->size, PCI_DMA_TODEVICE);
> + mapping = dma_map_page(&bp->pdev->dev, frag->page,
> + frag->page_offset,
> + frag->size, DMA_TO_DEVICE);
>
> tx_data_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> tx_data_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11832,15 +11834,15 @@ static int __devinit
> bnx2x_init_dev(struct pci_dev *pdev,
> goto err_out_release;
> }
>
> - if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
> + if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) == 0) {
> bp->flags |= USING_DAC_FLAG;
> - if (pci_set_consistent_dma_mask(pdev,
> DMA_BIT_MASK(64)) != 0) {
> - pr_err("pci_set_consistent_dma_mask
> failed, aborting\n");
> + if (dma_set_coherent_mask(&pdev->dev,
> DMA_BIT_MASK(64)) != 0) {
> + pr_err("dma_set_coherent_mask failed,
> aborting\n");
> rc = -EIO;
> goto err_out_release;
> }
>
> - } else if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
> + } else if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(32)) != 0) {
> pr_err("System does not support DMA, aborting\n");
> rc = -EIO;
> goto err_out_release;
> --
> 1.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: Jiri Pirko @ 2010-04-06 7:17 UTC (permalink / raw)
To: David Miller; +Cc: yoshfuji, netdev
In-Reply-To: <20100406.001259.15002237.davem@davemloft.net>
Tue, Apr 06, 2010 at 09:12:59AM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jpirko@redhat.com>
>Date: Tue, 6 Apr 2010 09:09:23 +0200
>
>> Whoups, missed this bit. Thanks a lot.
>>
>> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
>>
>
>Applied, and patchwork doesn't know what "Rewieved-by" is so
>I fixed the typo and added it to the changelog :-)
Oh my :) Looks like I'm still sleeping...
^ permalink raw reply
* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: David Miller @ 2010-04-06 7:12 UTC (permalink / raw)
To: jpirko; +Cc: yoshfuji, netdev
In-Reply-To: <20100406070922.GE2869@psychotron.redhat.com>
From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 6 Apr 2010 09:09:23 +0200
> Whoups, missed this bit. Thanks a lot.
>
> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
>
Applied, and patchwork doesn't know what "Rewieved-by" is so
I fixed the typo and added it to the changelog :-)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox