* Re: [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Octavian Purdila @ 2010-04-09 12:35 UTC (permalink / raw)
To: Changli Gao
Cc: Amerigo Wang, linux-kernel, ebiederm, Eric Dumazet, netdev,
Neil Horman, David Miller
In-Reply-To: <s2w412e6f7f1004090333g3b23eb94udb1e6cc3939a07e5@mail.gmail.com>
On Friday 09 April 2010 13:33:29 you wrote:
> On Fri, Apr 9, 2010 at 6:11 PM, Amerigo Wang <amwang@redhat.com> wrote:
> > From: Octavian Purdila <opurdila@ixiacom.com>
> >
> > The new function can be used to read/write large bitmaps via /proc. A
> > comma separated range format is used for compact output and input
> > (e.g. 1,3-4,10-10).
> >
> > Writing into the file will first reset the bitmap then update it
> > based on the given input.
>
> We have bitmap_scnprintf() and bitmap_parse_user(), why invent a new suite?
>
A decimal comma separated ranges seems the best option for this feature, and
unfortunately both of the above functions only support hexadecimal and no
ranges.
^ permalink raw reply
* Re: Strange packet drops with heavy firewalling
From: Benny Amorsen @ 2010-04-09 12:33 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1270813662.2623.85.camel@edumazet-laptop>
Eric Dumazet <eric.dumazet@gmail.com> writes:
> might be micro bursts, check 'ethtool -g eth0' RX parameters (increase
> RX ring from 200 to 511 if you want more buffers ?)
I tried that already actually. (I didn't expect it to cause traffic
interruption, but it did. Oh well.)
It didn't make a difference, at least not one I could detect from the
number of packet drops and the CPU utilization.
> cat /proc/net/softnet_stat
000002d9 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
42bc8143 00000000 0000024c 00000000 00000000 00000000 00000000 00000000 00000000
0000031b 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1c5a35e9 00000000 000005f7 00000000 00000000 00000000 00000000 00000000 00000000
I am not quite sure how to interpret that...
> cat /proc/interrupts
79: 1240 4050590849 1253 1263 PCI-MSI-edge eth0
80: 12 9 14 3613521843 PCI-MSI-edge eth1
> (check eth0 IRQS are delivered to one cpu)
Yes CPU1 handles eth0 and CPU3 handles eth1.
> grep . /proc/sys/net/ipv4/netfilter/ip_conntrack_*
nf_conntrack_acct:1
nf_conntrack_buckets:8192
nf_conntrack_checksum:1
nf_conntrack_count:49311
nf_conntrack_events:1
nf_conntrack_events_retry_timeout:15
nf_conntrack_expect_max:2048
nf_conntrack_generic_timeout:600
nf_conntrack_icmp_timeout:30
nf_conntrack_log_invalid:1
nf_conntrack_max:1048576
nf_conntrack_tcp_be_liberal:0
nf_conntrack_tcp_loose:1
nf_conntrack_tcp_max_retrans:3
nf_conntrack_tcp_timeout_close:10
nf_conntrack_tcp_timeout_close_wait:60
nf_conntrack_tcp_timeout_established:432000
nf_conntrack_tcp_timeout_fin_wait:120
nf_conntrack_tcp_timeout_last_ack:30
nf_conntrack_tcp_timeout_max_retrans:300
nf_conntrack_tcp_timeout_syn_recv:60
nf_conntrack_tcp_timeout_syn_sent:120
nf_conntrack_tcp_timeout_time_wait:120
nf_conntrack_tcp_timeout_unacknowledged:300
nf_conntrack_udp_timeout:30
nf_conntrack_udp_timeout_stream:180
> (might need to increase ip_conntrack_buckets)
You got me there. I had forgotten nf_conntrack.hashsize=1048576
and nf_conntrack.expect_hashsize=32768 on the kernel command line. It
was on the hot standby firewall, but not on the primary one. I will do a
failover to the hot standby sometime during the weekend.
It still isn't possible to change without a reboot, is it?
> ethtool -c eth0
> (might change coalesce params to reduce number of irqs)
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 20
rx-frames: 5
rx-usecs-irq: 0
rx-frames-irq: 5
tx-usecs: 72
tx-frames: 53
tx-usecs-irq: 0
tx-frames-irq: 5
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
I played quite a lot with the parameters but it did not seem to make any
difference. I didn't try adaptive though, but the load is fairly static
so it didn't seem appropriate.
> ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 511
RX Mini: 0
RX Jumbo: 0
TX: 511
Current hardware settings:
RX: 200
RX Mini: 0
RX Jumbo: 0
TX: 511
Right now RX is 200, but when it was 511 it didn't seem to make a
difference.
Thank you very much for the help! I will report back whether it was the
hash buckets.
/Benny
^ permalink raw reply
* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-09 11:47 UTC (permalink / raw)
To: Benny Amorsen; +Cc: netdev
In-Reply-To: <m339z50x1l.fsf@ursa.amorsen.dk>
Le vendredi 09 avril 2010 à 11:56 +0200, Benny Amorsen a écrit :
> I have a netfilter-box which is dropping packets. ethtool -S counts
> 10-20 rx_discards per second on the interface.
>
> The switch does not have flow control enabled; with flow control enabled
> the rx_discards turn into tx_on_sent which ultimately cause the same
> problem (the load is pretty constant so the switch has to drop the
> packets instead).
>
> perf top shows something like:
> 5201.00 - 6.7% : _spin_unlock_irqrestore
> 4232.00 - 5.5% : finish_task_switch
> 3597.00 - 4.6% : tg3_poll [tg3]
> 3257.00 - 4.2% : handle_IRQ_event
> 2515.00 - 3.2% : tick_nohz_restart_sched_tick
> 1947.00 - 2.5% : nf_ct_tuple_equal
> 1927.00 - 2.5% : tg3_start_xmit [tg3]
> 1879.00 - 2.4% : kmem_cache_alloc_node
> 1625.00 - 2.1% : tick_nohz_stop_sched_tick
> 1619.00 - 2.1% : ipt_do_table
> 1595.00 - 2.1% : ip_route_input
> 1547.00 - 2.0% : kmem_cache_free
> 1474.00 - 1.9% : __alloc_skb
> 1424.00 - 1.8% : fget_light
> 1391.00 - 1.8% : nf_iterate
>
> The rule set is quite large (more than 4000 rules), but organized so
> that each packet only has to traverse a few rules before getting
> accepted or rejected.
>
> When the problem started we were using a different server, an old
> two-socket 32-bit Xeon with hyperthreading. CPU usage often hit 100% on
> one CPU with that server. After replacing the server with a ProLiant
> DL160 G5 with a quad-core Xeon (without hyperthreading) the CPU usage
> rarely exceeds 10% on any CPU, but the packet loss persists.
>
might be micro bursts, check 'ethtool -g eth0' RX parameters (increase
RX ring from 200 to 511 if you want more buffers ?)
> We're using the built-in dual Broadcom Corporation NetXtreme BCM5722 Gigabit
> Ethernet PCI Express nics, and the kernel is
> kernel-2.6.32.9-70.fc12.x86_64 from Fedora. Next step is probably
> installing a better ethernet card, perhaps an Intel 82576-based one, so
> that we can get multiqueue support.
>
Sure, but before this, could you check
cat /proc/net/softnet_stat
cat /proc/interrupts
(check eth0 IRQS are delivered to one cpu)
grep . /proc/sys/net/ipv4/netfilter/ip_conntrack_*
(might need to increase ip_conntrack_buckets)
ethtool -c eth0
(might change coalesce params to reduce number of irqs)
ethtool -g eth0
^ permalink raw reply
* [PATCH 1/2] PHY: fix typo in bcm63xx PHY driver table
From: Florian Fainelli @ 2010-04-09 11:04 UTC (permalink / raw)
To: netdev; +Cc: David Miller
Signed-off-by: Florian Fainelli <ffainelli@freebox.fr>
---
diff --git a/drivers/net/phy/bcm63xx.c b/drivers/net/phy/bcm63xx.c
index ac5e498..c128156 100644
--- a/drivers/net/phy/bcm63xx.c
+++ b/drivers/net/phy/bcm63xx.c
@@ -137,4 +137,4 @@ static struct mdio_device_id bcm63xx_tbl[] = {
{ }
};
-MODULE_DEVICE_TABLE(mdio, bcm64xx_tbl);
+MODULE_DEVICE_TABLE(mdio, bcm63xx_tbl);
^ permalink raw reply related
* [PATCH 2/2] bcm63xx_enet: do not overwrite ENET_CTL_REG value
From: Florian Fainelli @ 2010-04-09 11:04 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Maxime Bizon
bcm_enet_hw_preinit will correctly set values in ENET_CTL_REG for internal
or external MII operations, however, bcm_enet_open will blindly overwrite the
ENET_CTL_REG register value and thus we will loose any changes to it that
were made in bcm_enet_hw_preinit, rendering external MII operations non-working.
This would lead to the driver not being able to check for link availability on
external PHY setups, and thus we would never get to sending packets because
link was down from the driver side.
This was completely un-noticed because all boards out there but BCM6338-based
ones use internal phy on their enet0 interface.
Signed-off-by: Florian Fainelli <ffainelli@freebox.fr>
---
diff --git a/drivers/net/bcm63xx_enet.c b/drivers/net/bcm63xx_enet.c
index 5173340..14ab4dc 100644
--- a/drivers/net/bcm63xx_enet.c
+++ b/drivers/net/bcm63xx_enet.c
@@ -958,7 +958,9 @@ static int bcm_enet_open(struct net_device *dev)
/* all set, enable mac and interrupts, start dma engine and
* kick rx dma channel */
wmb();
- enet_writel(priv, ENET_CTL_ENABLE_MASK, ENET_CTL_REG);
+ val = enet_readl(priv, ENET_CTL_REG);
+ val |= ENET_CTL_ENABLE_MASK;
+ enet_writel(priv, val, ENET_CTL_REG);
enet_dma_writel(priv, ENETDMA_CFG_EN_MASK, ENETDMA_CFG_REG);
enet_dma_writel(priv, ENETDMA_CHANCFG_EN_MASK,
ENETDMA_CHANCFG_REG(priv->rx_chan));
^ permalink raw reply related
* Re: [RFC PATCH 0/2] netdev: Add tracepoint to network/driver interface
From: Neil Horman @ 2010-04-09 11:04 UTC (permalink / raw)
To: Koki Sanagi; +Cc: netdev, izumi.taku, kaneshige.kenji, davem
In-Reply-To: <4BBED951.8040406@jp.fujitsu.com>
On Fri, Apr 09, 2010 at 04:37:53PM +0900, Koki Sanagi wrote:
> These patches add tracepoints to network/driver interface.
>
> These tracepoints are helpful to investigate whether a packet passes or not.
> For example, when Heart Beat is disconnected, that information is helpful
> to investigate the cause is whether driver/device side or not.
>
> An output is below.
>
> sshd-2443 [001] 68238.415621: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
> <idle>-0 [001] 68238.417058: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
> <idle>-0 [001] 68238.704363: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
> sshd-2443 [001] 68238.705459: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
> <idle>-0 [001] 68238.706891: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
> <idle>-0 [001] 68238.878736: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
> sshd-2443 [001] 68238.880361: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
>
> As other use case I have, we can get throughput per interface with some sort of
> perf scripts. I plan to create it.
>
> Thanks
> Koki Sanagi
>
You can get a reasonable estimate of per-interface throughput using ethtool or
even ifconfig in a script. What are the tracepoints needed for that? Don't get
me wrong, I think these tracepoints could have some potential use thats not
covered by other tools, I just don't see the above as a conclusive reason to add
them.
Regards
Neil
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [Patch 1/3] sysctl: refactor integer handling proc code
From: Changli Gao @ 2010-04-09 10:49 UTC (permalink / raw)
To: Amerigo Wang
Cc: linux-kernel, Octavian Purdila, Eric Dumazet, netdev, Neil Horman,
David Miller, ebiederm
In-Reply-To: <20100409101452.5051.74050.sendpatchset@localhost.localdomain>
On Fri, Apr 9, 2010 at 6:11 PM, Amerigo Wang <amwang@redhat.com> wrote:
>
> From: Octavian Purdila <opurdila@ixiacom.com>
>
> As we are about to add another integer handling proc function a little
> bit of cleanup is in order: add a few helper functions to improve code
> readability and decrease code duplication.
>
> In the process a bug is also fixed: if the user specifies a number
> with more then 20 digits it will be interpreted as two integers
> (e.g. 10000...13 will be interpreted as 100.... and 13).
>
> Behavior for EFAULT handling was changed as well. Previous to this
> patch, when an EFAULT error occurred in the middle of a write
> operation, although some of the elements were set, that was not
> acknowledged to the user (by shorting the write and returning the
> number of bytes accepted). EFAULT is now treated just like any other
> errors by acknowledging the amount of bytes accepted.
>
> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> ---
>
> Index: linux-2.6/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.orig/kernel/sysctl.c
> +++ linux-2.6/kernel/sysctl.c
> @@ -2040,8 +2040,148 @@ int proc_dostring(struct ctl_table *tabl
> buffer, lenp, ppos);
> }
>
> +static int proc_skip_wspace(char __user **buf, size_t *size)
> +{
> + char c;
> +
> + while (*size) {
> + if (get_user(c, *buf))
> + return -EFAULT;
> + if (!isspace(c))
> + break;
> + (*size)--;
> + (*buf)++;
> + }
> +
> + return 0;
> +}
> +
> +static bool isanyof(char c, const char *v, unsigned len)
> +{
> + int i;
> +
> + if (!len)
> + return false;
> +
> + for (i = 0; i < len; i++)
> + if (c == v[i])
> + break;
> + if (i == len)
> + return false;
> +
> + return true;
> +}
> +
> +#define TMPBUFLEN 22
> +/**
> + * proc_get_ulong - reads an ASCII formated integer from a user buffer
> + *
> + * @buf - user buffer
> + * @size - size of the user buffer
> + * @val - this is where the number will be stored
> + * @neg - set to %TRUE if number is negative
> + * @perm_tr - a vector which contains the allowed trailers
> + * @perm_tr_len - size of the perm_tr vector
> + * @tr - pointer to store the trailer character
> + *
> + * In case of success 0 is returned and buf and size are updated with
> + * the amount of bytes read. If tr is non NULL and a trailing
> + * character exist (size is non zero after returning from this
> + * function) tr is updated with the trailing character.
> + */
> +static int proc_get_ulong(char __user **buf, size_t *size,
> + unsigned long *val, bool *neg,
> + const char *perm_tr, unsigned perm_tr_len, char *tr)
> +{
> + int len;
> + char *p, tmp[TMPBUFLEN];
> +
> + if (!*size)
> + return -EINVAL;
> +
> + len = *size;
> + if (len > TMPBUFLEN-1)
> + len = TMPBUFLEN-1;
> +
> + if (copy_from_user(tmp, *buf, len))
> + return -EFAULT;
> +
> + tmp[len] = 0;
> + p = tmp;
> + if (*p == '-' && *size > 1) {
> + *neg = 1;
> + p++;
> + } else
> + *neg = 0;
the function name implies that it is used to parse unsigned long, so
negative value should not be supported.
> + if (!isdigit(*p))
> + return -EINVAL;
It seems that ledding white space should be allowed, so this check
isn't needed, and simple_strtoul can handle it.
> +
> + *val = simple_strtoul(p, &p, 0);
> +
> + len = p - tmp;
> +
> + /* We don't know if the next char is whitespace thus we may accept
> + * invalid integers (e.g. 1234...a) or two integers instead of one
> + * (e.g. 123...1). So lets not allow such large numbers. */
> + if (len == TMPBUFLEN - 1)
> + return -EINVAL;
> +
> + if (len < *size && perm_tr_len && !isanyof(*p, perm_tr, perm_tr_len))
> + return -EINVAL;
is strspn() better?
> +
> + if (tr && (len < *size))
> + *tr = *p;
> +
> + *buf += len;
> + *size -= len;
> +
> + return 0;
> +}
> +
> +/**
> + * proc_put_ulong - coverts an integer to a decimal ASCII formated string
> + *
> + * @buf - the user buffer
> + * @size - the size of the user buffer
> + * @val - the integer to be converted
> + * @neg - sign of the number, %TRUE for negative
> + * @first - if %FALSE will insert a separator character before the number
> + * @separator - the separator character
> + *
> + * In case of success 0 is returned and buf and size are updated with
> + * the amount of bytes read.
> + */
> +static int proc_put_ulong(char __user **buf, size_t *size, unsigned long val,
> + bool neg, bool first, char separator)
> +{
> + int len;
> + char tmp[TMPBUFLEN], *p = tmp;
> +
> + if (!first)
> + *p++ = separator;
> + sprintf(p, "%s%lu", neg ? "-" : "", val);
negative should not be supported too.
> + len = strlen(tmp);
> + if (len > *size)
> + len = *size;
> + if (copy_to_user(*buf, tmp, len))
> + return -EFAULT;
> + *size -= len;
> + *buf += len;
> + return 0;
> +}
> +#undef TMPBUFLEN
> +
> +static int proc_put_char(char __user **buf, size_t *size, char c)
> +{
> + if (*size) {
> + if (put_user(c, *buf))
> + return -EFAULT;
> + (*size)--, (*buf)++;
> + }
> + return 0;
> +}
>
> -static int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
> +static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
> int *valp,
> int write, void *data)
> {
> @@ -2050,7 +2190,7 @@ static int do_proc_dointvec_conv(int *ne
> } else {
> int val = *valp;
> if (val < 0) {
> - *negp = -1;
> + *negp = 1;
> *lvalp = (unsigned long)-val;
> } else {
> *negp = 0;
> @@ -2060,20 +2200,18 @@ static int do_proc_dointvec_conv(int *ne
> return 0;
> }
>
> +static const char proc_wspace_sep[] = { ' ', '\t', '\n', 0 };
> +
> static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
> - int write, void __user *buffer,
> + int write, void __user *_buffer,
> size_t *lenp, loff_t *ppos,
> - int (*conv)(int *negp, unsigned long *lvalp, int *valp,
> + int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
> int write, void *data),
> void *data)
> {
> -#define TMPBUFLEN 21
> - int *i, vleft, first = 1, neg;
> - unsigned long lval;
> - size_t left, len;
> -
> - char buf[TMPBUFLEN], *p;
> - char __user *s = buffer;
> + int *i, vleft, first = 1, err = 0;
> + size_t left;
> + char __user *buffer = (char __user *) _buffer;
>
> if (!tbl_data || !table->maxlen || !*lenp ||
> (*ppos && !write)) {
> @@ -2089,88 +2227,48 @@ static int __do_proc_dointvec(void *tbl_
> conv = do_proc_dointvec_conv;
>
> for (; left && vleft--; i++, first=0) {
> - if (write) {
> - while (left) {
> - char c;
> - if (get_user(c, s))
> - return -EFAULT;
> - if (!isspace(c))
> - break;
> - left--;
> - s++;
> - }
> - if (!left)
> - break;
> - neg = 0;
> - len = left;
> - if (len > sizeof(buf) - 1)
> - len = sizeof(buf) - 1;
> - if (copy_from_user(buf, s, len))
> - return -EFAULT;
> - buf[len] = 0;
> - p = buf;
> - if (*p == '-' && left > 1) {
> - neg = 1;
> - p++;
> - }
> - if (*p < '0' || *p > '9')
> - break;
> -
> - lval = simple_strtoul(p, &p, 0);
> + unsigned long lval;
> + bool neg;
>
> - len = p-buf;
> - if ((len < left) && *p && !isspace(*p))
> + if (write) {
> + err = proc_skip_wspace(&buffer, &left);
> + if (err)
> + return err;
> + err = proc_get_ulong(&buffer, &left, &lval, &neg,
> + proc_wspace_sep,
> + sizeof(proc_wspace_sep), NULL);
> + if (err)
> break;
> - s += len;
> - left -= len;
> -
> - if (conv(&neg, &lval, i, 1, data))
> + if (conv(&neg, &lval, i, 1, data)) {
> + err = -EINVAL;
> break;
> + }
> } else {
> - p = buf;
> - if (!first)
> - *p++ = '\t';
> -
> - if (conv(&neg, &lval, i, 0, data))
> + if (conv(&neg, &lval, i, 0, data)) {
> + err = -EINVAL;
> break;
> -
> - sprintf(p, "%s%lu", neg ? "-" : "", lval);
> - len = strlen(buf);
> - if (len > left)
> - len = left;
> - if(copy_to_user(s, buf, len))
> - return -EFAULT;
> - left -= len;
> - s += len;
> - }
> - }
> -
> - if (!write && !first && left) {
> - if(put_user('\n', s))
> - return -EFAULT;
> - left--, s++;
> - }
> - if (write) {
> - while (left) {
> - char c;
> - if (get_user(c, s++))
> - return -EFAULT;
> - if (!isspace(c))
> + }
> + err = proc_put_ulong(&buffer, &left, lval, neg, first,
> + '\t');
> + if (err)
> break;
> - left--;
> }
> }
> +
> + if (!write && !first && left && !err)
> + err = proc_put_char(&buffer, &left, '\n');
> + if (write && !err)
> + err = proc_skip_wspace(&buffer, &left);
> if (write && first)
> - return -EINVAL;
> + return err ? : -EINVAL;
> *lenp -= left;
> *ppos += *lenp;
> return 0;
> -#undef TMPBUFLEN
> }
>
> static int do_proc_dointvec(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp, loff_t *ppos,
> - int (*conv)(int *negp, unsigned long *lvalp, int *valp,
> + int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
> int write, void *data),
> void *data)
> {
> @@ -2238,8 +2336,8 @@ struct do_proc_dointvec_minmax_conv_para
> int *max;
> };
>
> -static int do_proc_dointvec_minmax_conv(int *negp, unsigned long *lvalp,
> - int *valp,
> +static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
> + int *valp,
> int write, void *data)
> {
> struct do_proc_dointvec_minmax_conv_param *param = data;
> @@ -2252,7 +2350,7 @@ static int do_proc_dointvec_minmax_conv(
> } else {
> int val = *valp;
> if (val < 0) {
> - *negp = -1;
> + *negp = 1;
> *lvalp = (unsigned long)-val;
> } else {
> *negp = 0;
> @@ -2290,17 +2388,15 @@ int proc_dointvec_minmax(struct ctl_tabl
> }
>
> static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int write,
> - void __user *buffer,
> + void __user *_buffer,
> size_t *lenp, loff_t *ppos,
> unsigned long convmul,
> unsigned long convdiv)
> {
> -#define TMPBUFLEN 21
> - unsigned long *i, *min, *max, val;
> - int vleft, first=1, neg;
> - size_t len, left;
> - char buf[TMPBUFLEN], *p;
> - char __user *s = buffer;
> + unsigned long *i, *min, *max;
> + int vleft, first = 1, err = 0;
> + size_t left;
> + char __user *buffer = (char __user *) _buffer;
>
> if (!data || !table->maxlen || !*lenp ||
> (*ppos && !write)) {
> @@ -2315,82 +2411,42 @@ static int __do_proc_doulongvec_minmax(v
> left = *lenp;
>
> for (; left && vleft--; i++, min++, max++, first=0) {
> + unsigned long val;
> +
> if (write) {
> - while (left) {
> - char c;
> - if (get_user(c, s))
> - return -EFAULT;
> - if (!isspace(c))
> - break;
> - left--;
> - s++;
> - }
> - if (!left)
> - break;
> - neg = 0;
> - len = left;
> - if (len > TMPBUFLEN-1)
> - len = TMPBUFLEN-1;
> - if (copy_from_user(buf, s, len))
> - return -EFAULT;
> - buf[len] = 0;
> - p = buf;
> - if (*p == '-' && left > 1) {
> - neg = 1;
> - p++;
> - }
> - if (*p < '0' || *p > '9')
> - break;
> - val = simple_strtoul(p, &p, 0) * convmul / convdiv ;
> - len = p-buf;
> - if ((len < left) && *p && !isspace(*p))
> + bool neg;
> +
> + err = proc_skip_wspace(&buffer, &left);
> + if (err)
> + return err;
> + err = proc_get_ulong(&buffer, &left, &val, &neg,
> + proc_wspace_sep,
> + sizeof(proc_wspace_sep), NULL);
> + if (err)
> break;
> if (neg)
> - val = -val;
> - s += len;
> - left -= len;
> -
> - if(neg)
> continue;
> if ((min && val < *min) || (max && val > *max))
> continue;
> *i = val;
> } else {
> - p = buf;
> - if (!first)
> - *p++ = '\t';
> - sprintf(p, "%lu", convdiv * (*i) / convmul);
> - len = strlen(buf);
> - if (len > left)
> - len = left;
> - if(copy_to_user(s, buf, len))
> - return -EFAULT;
> - left -= len;
> - s += len;
> - }
> - }
> -
> - if (!write && !first && left) {
> - if(put_user('\n', s))
> - return -EFAULT;
> - left--, s++;
> - }
> - if (write) {
> - while (left) {
> - char c;
> - if (get_user(c, s++))
> - return -EFAULT;
> - if (!isspace(c))
> + val = convdiv * (*i) / convmul;
> + err = proc_put_ulong(&buffer, &left, val, 0, first,
> + '\t');
> + if (err)
> break;
> - left--;
> }
> }
> +
> + if (!write && !first && left && !err)
> + err = proc_put_char(&buffer, &left, '\n');
> + if (write && !err)
> + err = proc_skip_wspace(&buffer, &left);
> if (write && first)
> - return -EINVAL;
> + return err ? : -EINVAL;
> *lenp -= left;
> *ppos += *lenp;
> return 0;
> -#undef TMPBUFLEN
> }
>
> static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
> @@ -2451,7 +2507,7 @@ int proc_doulongvec_ms_jiffies_minmax(st
> }
>
>
> -static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
> +static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
> int *valp,
> int write, void *data)
> {
> @@ -2463,7 +2519,7 @@ static int do_proc_dointvec_jiffies_conv
> int val = *valp;
> unsigned long lval;
> if (val < 0) {
> - *negp = -1;
> + *negp = 1;
> lval = (unsigned long)-val;
> } else {
> *negp = 0;
> @@ -2474,7 +2530,7 @@ static int do_proc_dointvec_jiffies_conv
> return 0;
> }
>
> -static int do_proc_dointvec_userhz_jiffies_conv(int *negp, unsigned long *lvalp,
> +static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
> int *valp,
> int write, void *data)
> {
> @@ -2486,7 +2542,7 @@ static int do_proc_dointvec_userhz_jiffi
> int val = *valp;
> unsigned long lval;
> if (val < 0) {
> - *negp = -1;
> + *negp = 1;
> lval = (unsigned long)-val;
> } else {
> *negp = 0;
> @@ -2497,7 +2553,7 @@ static int do_proc_dointvec_userhz_jiffi
> return 0;
> }
>
> -static int do_proc_dointvec_ms_jiffies_conv(int *negp, unsigned long *lvalp,
> +static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
> int *valp,
> int write, void *data)
> {
> @@ -2507,7 +2563,7 @@ static int do_proc_dointvec_ms_jiffies_c
> int val = *valp;
> unsigned long lval;
> if (val < 0) {
> - *negp = -1;
> + *negp = 1;
> lval = (unsigned long)-val;
> } else {
> *negp = 0;
These functions have so much lines of code. I think you can make them
less. Please refer to strsep().
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Changli Gao @ 2010-04-09 10:33 UTC (permalink / raw)
To: Amerigo Wang
Cc: linux-kernel, Octavian Purdila, ebiederm, Eric Dumazet, netdev,
Neil Horman, David Miller
In-Reply-To: <20100409101503.5051.3805.sendpatchset@localhost.localdomain>
On Fri, Apr 9, 2010 at 6:11 PM, Amerigo Wang <amwang@redhat.com> wrote:
> From: Octavian Purdila <opurdila@ixiacom.com>
>
> The new function can be used to read/write large bitmaps via /proc. A
> comma separated range format is used for compact output and input
> (e.g. 1,3-4,10-10).
>
> Writing into the file will first reset the bitmap then update it
> based on the given input.
>
We have bitmap_scnprintf() and bitmap_parse_user(), why invent a new suite?
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* [PATCH] stmmac: updated the drv module version
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-7-git-send-email-peppe.cavallaro@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/stmmac.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/stmmac/stmmac.h b/drivers/net/stmmac/stmmac.h
index 1a6eb7b..ebebc64 100644
--- a/drivers/net/stmmac/stmmac.h
+++ b/drivers/net/stmmac/stmmac.h
@@ -20,7 +20,7 @@
Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
*******************************************************************************/
-#define DRV_MODULE_VERSION "Jan_2010"
+#define DRV_MODULE_VERSION "Apr_2010"
#include <linux/stmmac.h>
#include "common.h"
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: fix vlan support setup
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-6-git-send-email-peppe.cavallaro@st.com>
Moved STMMAC_VLAN_TAG_USED from stmmac.h to common.h header
because it is used within the device and descriptor cores.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/common.h | 5 +++++
drivers/net/stmmac/stmmac.h | 5 -----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/stmmac/common.h b/drivers/net/stmmac/common.h
index 27a05b4..144f76f 100644
--- a/drivers/net/stmmac/common.h
+++ b/drivers/net/stmmac/common.h
@@ -23,6 +23,11 @@
*******************************************************************************/
#include <linux/netdevice.h>
+#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
+#define STMMAC_VLAN_TAG_USED
+#include <linux/if_vlan.h>
+#endif
+
#include "descs.h"
#undef CHIP_DEBUG_PRINT
diff --git a/drivers/net/stmmac/stmmac.h b/drivers/net/stmmac/stmmac.h
index 0d776bc..1a6eb7b 100644
--- a/drivers/net/stmmac/stmmac.h
+++ b/drivers/net/stmmac/stmmac.h
@@ -23,11 +23,6 @@
#define DRV_MODULE_VERSION "Jan_2010"
#include <linux/stmmac.h>
-#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
-#define STMMAC_VLAN_TAG_USED
-#include <linux/if_vlan.h>
-#endif
-
#include "common.h"
#ifdef CONFIG_STMMAC_TIMER
#include "stmmac_timer.h"
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: get the descriptor structure from platform
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-5-git-send-email-peppe.cavallaro@st.com>
Output for chip that uses the Enhanced descriptors:
[snip]
STMMAC driver:
platform registration... done!
DWMAC1000 - user ID: 0x10, Synopsys ID: 0x33
Enhanced descriptor structure
no valid MAC address;please, use ifconfig or nwhwconfig!
eth0 - (dev. name: stmmaceth - id: 0, IRQ #134
IO base addr: 0xfd110000)
STMMAC MII Bus: probed
[snip]
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/stmmac.h | 1 +
drivers/net/stmmac/stmmac_main.c | 12 ++++++++----
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/net/stmmac/stmmac.h b/drivers/net/stmmac/stmmac.h
index 55b9aca..0d776bc 100644
--- a/drivers/net/stmmac/stmmac.h
+++ b/drivers/net/stmmac/stmmac.h
@@ -93,6 +93,7 @@ struct stmmac_priv {
#ifdef STMMAC_VLAN_TAG_USED
struct vlan_group *vlgrp;
#endif
+ int enh_desc;
};
#ifdef CONFIG_STM_DRIVERS
diff --git a/drivers/net/stmmac/stmmac_main.c b/drivers/net/stmmac/stmmac_main.c
index b95fa84..b3d3f7f 100644
--- a/drivers/net/stmmac/stmmac_main.c
+++ b/drivers/net/stmmac/stmmac_main.c
@@ -1581,13 +1581,16 @@ static int stmmac_mac_device_setup(struct net_device *dev)
struct mac_device_info *device;
- if (priv->is_gmac) {
+ if (priv->is_gmac)
device = dwmac1000_setup(ioaddr);
- device->desc = &enh_desc_ops;
- } else {
+ else
device = dwmac100_setup(ioaddr);
+
+ if (priv->enh_desc) {
+ device->desc = &enh_desc_ops;
+ pr_info("\tEnhanced descriptor structure\n");
+ } else
device->desc = &ndesc_ops;
- }
if (!device)
return -ENOMEM;
@@ -1728,6 +1731,7 @@ static int stmmac_dvr_probe(struct platform_device *pdev)
priv->bus_id = plat_dat->bus_id;
priv->pbl = plat_dat->pbl; /* TLI */
priv->is_gmac = plat_dat->has_gmac; /* GMAC is on board */
+ priv->enh_desc = plat_dat->enh_desc;
platform_set_drvdata(pdev, ndev);
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: new descriptor field for the driver's platform
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-4-git-send-email-peppe.cavallaro@st.com>
The new enh_desc is used for selecting the enhanced descriptors
structure. There are several scenarios; some chips (mac10/100
or gmac) want to use the enhanced descriptors; others want the normal
ones.
For example, on ST platforms: MAC10/100 uses the normal desc structure
and the GMAC uses the enhanced one.
It can be useful to get this information from the platform.
This could also be decided at run-time looking at the chip's ID number;
but it could happen that chips with the same ID want to use different
descriptor structure.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
include/linux/stmmac.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 32bfd1a..632ff7c 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -33,6 +33,7 @@ struct plat_stmmacenet_data {
int bus_id;
int pbl;
int has_gmac;
+ int enh_desc;
void (*fix_mac_speed)(void *priv, unsigned int speed);
void (*bus_setup)(unsigned long ioaddr);
#ifdef CONFIG_STM_DRIVERS
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: fix Transmit FIFO flush operation
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-3-git-send-email-peppe.cavallaro@st.com>
Fix the Transmit FIFO flush operation; it was
disabled while reworking the descriptor structures.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/common.h | 1 +
drivers/net/stmmac/dwmac1000.h | 1 -
drivers/net/stmmac/dwmac1000_dma.c | 9 ---------
drivers/net/stmmac/dwmac_dma.h | 1 +
drivers/net/stmmac/dwmac_lib.c | 7 +++++++
drivers/net/stmmac/enh_desc.c | 6 +++---
6 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/net/stmmac/common.h b/drivers/net/stmmac/common.h
index bd3b785..27a05b4 100644
--- a/drivers/net/stmmac/common.h
+++ b/drivers/net/stmmac/common.h
@@ -244,3 +244,4 @@ extern void stmmac_set_mac_addr(unsigned long ioaddr, u8 addr[6],
unsigned int high, unsigned int low);
extern void stmmac_get_mac_addr(unsigned long ioaddr, unsigned char *addr,
unsigned int high, unsigned int low);
+extern void dwmac_dma_flush_tx_fifo(unsigned long ioaddr);
diff --git a/drivers/net/stmmac/dwmac1000.h b/drivers/net/stmmac/dwmac1000.h
index 3b784fc..d8d0f35 100644
--- a/drivers/net/stmmac/dwmac1000.h
+++ b/drivers/net/stmmac/dwmac1000.h
@@ -172,7 +172,6 @@ enum rfd {
deac_full_minus_4 = 0x00401800,
};
#define DMA_CONTROL_TSF 0x00200000 /* Transmit Store and Forward */
-#define DMA_CONTROL_FTF 0x00100000 /* Flush transmit FIFO */
enum ttc_control {
DMA_CONTROL_TTC_64 = 0x00000000,
diff --git a/drivers/net/stmmac/dwmac1000_dma.c b/drivers/net/stmmac/dwmac1000_dma.c
index 8d3ea99..a547aa9 100644
--- a/drivers/net/stmmac/dwmac1000_dma.c
+++ b/drivers/net/stmmac/dwmac1000_dma.c
@@ -58,15 +58,6 @@ static int dwmac1000_dma_init(unsigned long ioaddr, int pbl, u32 dma_tx,
return 0;
}
-/* Transmit FIFO flush operation */
-static void dwmac1000_flush_tx_fifo(unsigned long ioaddr)
-{
- u32 csr6 = readl(ioaddr + DMA_CONTROL);
- writel((csr6 | DMA_CONTROL_FTF), ioaddr + DMA_CONTROL);
-
- do {} while ((readl(ioaddr + DMA_CONTROL) & DMA_CONTROL_FTF));
-}
-
static void dwmac1000_dma_operation_mode(unsigned long ioaddr, int txmode,
int rxmode)
{
diff --git a/drivers/net/stmmac/dwmac_dma.h b/drivers/net/stmmac/dwmac_dma.h
index de848d9..7b815a1 100644
--- a/drivers/net/stmmac/dwmac_dma.h
+++ b/drivers/net/stmmac/dwmac_dma.h
@@ -95,6 +95,7 @@
#define DMA_STATUS_TU 0x00000004 /* Transmit Buffer Unavailable */
#define DMA_STATUS_TPS 0x00000002 /* Transmit Process Stopped */
#define DMA_STATUS_TI 0x00000001 /* Transmit Interrupt */
+#define DMA_CONTROL_FTF 0x00100000 /* Flush transmit FIFO */
extern void dwmac_enable_dma_transmission(unsigned long ioaddr);
extern void dwmac_enable_dma_irq(unsigned long ioaddr);
diff --git a/drivers/net/stmmac/dwmac_lib.c b/drivers/net/stmmac/dwmac_lib.c
index d4adb1e..0a504ad 100644
--- a/drivers/net/stmmac/dwmac_lib.c
+++ b/drivers/net/stmmac/dwmac_lib.c
@@ -227,6 +227,13 @@ int dwmac_dma_interrupt(unsigned long ioaddr,
return ret;
}
+void dwmac_dma_flush_tx_fifo(unsigned long ioaddr)
+{
+ u32 csr6 = readl(ioaddr + DMA_CONTROL);
+ writel((csr6 | DMA_CONTROL_FTF), ioaddr + DMA_CONTROL);
+
+ do {} while ((readl(ioaddr + DMA_CONTROL) & DMA_CONTROL_FTF));
+}
void stmmac_set_mac_addr(unsigned long ioaddr, u8 addr[6],
unsigned int high, unsigned int low)
diff --git a/drivers/net/stmmac/enh_desc.c b/drivers/net/stmmac/enh_desc.c
index e5ac259..eb5684a 100644
--- a/drivers/net/stmmac/enh_desc.c
+++ b/drivers/net/stmmac/enh_desc.c
@@ -40,7 +40,7 @@ static int enh_desc_get_tx_status(void *data, struct stmmac_extra_stats *x,
if (unlikely(p->des01.etx.frame_flushed)) {
CHIP_DBG(KERN_ERR "\tframe_flushed error\n");
x->tx_frame_flushed++;
- /*enh_desc_flush_tx_fifo(ioaddr);*/
+ dwmac_dma_flush_tx_fifo(ioaddr);
}
if (unlikely(p->des01.etx.loss_carrier)) {
@@ -68,7 +68,7 @@ static int enh_desc_get_tx_status(void *data, struct stmmac_extra_stats *x,
if (unlikely(p->des01.etx.underflow_error)) {
CHIP_DBG(KERN_ERR "\tunderflow error\n");
- /*enh_desc_flush_tx_fifo(ioaddr);*/
+ dwmac_dma_flush_tx_fifo(ioaddr);
x->tx_underflow++;
}
@@ -80,7 +80,7 @@ static int enh_desc_get_tx_status(void *data, struct stmmac_extra_stats *x,
if (unlikely(p->des01.etx.payload_error)) {
CHIP_DBG(KERN_ERR "\tAddr/Payload csum error\n");
x->tx_payload_error++;
- /*enh_desc_flush_tx_fifo(ioaddr);*/
+ dwmac_dma_flush_tx_fifo(ioaddr);
}
ret = -1;
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: rework normal and enhanced descriptors
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-2-git-send-email-peppe.cavallaro@st.com>
Currently the driver assumes that the mac10/100 can only use the
normal descriptor structure and the gmac can only use the
enhanced structures.
This patch removes the descriptor's code from the dma files
and adds two new files just for handling the normal and enhanced
descriptors.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/Makefile | 2 +-
drivers/net/stmmac/common.h | 15 ++-
drivers/net/stmmac/dwmac100.h | 12 --
drivers/net/stmmac/dwmac1000.h | 11 -
drivers/net/stmmac/dwmac1000_core.c | 27 ++--
drivers/net/stmmac/dwmac1000_dma.c | 327 +---------------------------------
drivers/net/stmmac/dwmac100_core.c | 3 +-
drivers/net/stmmac/dwmac100_dma.c | 223 +----------------------
drivers/net/stmmac/enh_desc.c | 342 +++++++++++++++++++++++++++++++++++
drivers/net/stmmac/norm_desc.c | 240 ++++++++++++++++++++++++
drivers/net/stmmac/stmmac.h | 2 +
drivers/net/stmmac/stmmac_main.c | 7 +-
12 files changed, 627 insertions(+), 584 deletions(-)
create mode 100644 drivers/net/stmmac/enh_desc.c
create mode 100644 drivers/net/stmmac/norm_desc.c
diff --git a/drivers/net/stmmac/Makefile b/drivers/net/stmmac/Makefile
index b14bd56..9691733 100644
--- a/drivers/net/stmmac/Makefile
+++ b/drivers/net/stmmac/Makefile
@@ -2,4 +2,4 @@ obj-$(CONFIG_STMMAC_ETH) += stmmac.o
stmmac-$(CONFIG_STMMAC_TIMER) += stmmac_timer.o
stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o \
dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
- dwmac100_core.o dwmac100_dma.o $(stmmac-y)
+ dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o $(stmmac-y)
diff --git a/drivers/net/stmmac/common.h b/drivers/net/stmmac/common.h
index 2a58172..bd3b785 100644
--- a/drivers/net/stmmac/common.h
+++ b/drivers/net/stmmac/common.h
@@ -22,8 +22,21 @@
Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
*******************************************************************************/
-#include "descs.h"
#include <linux/netdevice.h>
+#include "descs.h"
+
+#undef CHIP_DEBUG_PRINT
+/* Turn-on extra printk debug for MAC core, dma and descriptors */
+/* #define CHIP_DEBUG_PRINT */
+
+#ifdef CHIP_DEBUG_PRINT
+#define CHIP_DBG(fmt, args...) printk(fmt, ## args)
+#else
+#define CHIP_DBG(fmt, args...) do { } while (0)
+#endif
+
+#undef FRAME_FILTER_DEBUG
+/* #define FRAME_FILTER_DEBUG */
struct stmmac_extra_stats {
/* Transmit errors */
diff --git a/drivers/net/stmmac/dwmac100.h b/drivers/net/stmmac/dwmac100.h
index 9f4ba2e..97956cb 100644
--- a/drivers/net/stmmac/dwmac100.h
+++ b/drivers/net/stmmac/dwmac100.h
@@ -118,16 +118,4 @@ enum ttc_control {
#define DMA_MISSED_FRAME_OVE_M 0x00010000 /* Missed Frame Overflow */
#define DMA_MISSED_FRAME_M_CNTR 0x0000ffff /* Missed Frame Couinter */
-#undef DWMAC100_DEBUG
-/* #define DWMAC100__DEBUG */
-#undef FRAME_FILTER_DEBUG
-/* #define FRAME_FILTER_DEBUG */
-#ifdef DWMAC100__DEBUG
-#define DBG(fmt, args...) printk(fmt, ## args)
-#else
-#define DBG(fmt, args...) do { } while (0)
-#endif
-
extern struct stmmac_dma_ops dwmac100_dma_ops;
-extern struct stmmac_desc_ops dwmac100_desc_ops;
-
diff --git a/drivers/net/stmmac/dwmac1000.h b/drivers/net/stmmac/dwmac1000.h
index 62dca0e..3b784fc 100644
--- a/drivers/net/stmmac/dwmac1000.h
+++ b/drivers/net/stmmac/dwmac1000.h
@@ -206,15 +206,4 @@ enum rtc_control {
#define GMAC_MMC_TX_INTR 0x108
#define GMAC_MMC_RX_CSUM_OFFLOAD 0x208
-#undef DWMAC1000_DEBUG
-/* #define DWMAC1000__DEBUG */
-#undef FRAME_FILTER_DEBUG
-/* #define FRAME_FILTER_DEBUG */
-#ifdef DWMAC1000__DEBUG
-#define DBG(fmt, args...) printk(fmt, ## args)
-#else
-#define DBG(fmt, args...) do { } while (0)
-#endif
-
extern struct stmmac_dma_ops dwmac1000_dma_ops;
-extern struct stmmac_desc_ops dwmac1000_desc_ops;
diff --git a/drivers/net/stmmac/dwmac1000_core.c b/drivers/net/stmmac/dwmac1000_core.c
index a6538ae..bfcad4a 100644
--- a/drivers/net/stmmac/dwmac1000_core.c
+++ b/drivers/net/stmmac/dwmac1000_core.c
@@ -82,8 +82,8 @@ static void dwmac1000_set_filter(struct net_device *dev)
unsigned long ioaddr = dev->base_addr;
unsigned int value = 0;
- DBG(KERN_INFO "%s: # mcasts %d, # unicast %d\n",
- __func__, netdev_mc_count(dev), netdev_uc_count(dev));
+ CHIP_DBG(KERN_INFO "%s: # mcasts %d, # unicast %d\n",
+ __func__, netdev_mc_count(dev), netdev_uc_count(dev));
if (dev->flags & IFF_PROMISC)
value = GMAC_FRAME_FILTER_PR;
@@ -135,7 +135,7 @@ static void dwmac1000_set_filter(struct net_device *dev)
#endif
writel(value, ioaddr + GMAC_FRAME_FILTER);
- DBG(KERN_INFO "\tFrame Filter reg: 0x%08x\n\tHash regs: "
+ CHIP_DBG(KERN_INFO "\tFrame Filter reg: 0x%08x\n\tHash regs: "
"HI 0x%08x, LO 0x%08x\n", readl(ioaddr + GMAC_FRAME_FILTER),
readl(ioaddr + GMAC_HASH_HIGH), readl(ioaddr + GMAC_HASH_LOW));
@@ -147,18 +147,18 @@ static void dwmac1000_flow_ctrl(unsigned long ioaddr, unsigned int duplex,
{
unsigned int flow = 0;
- DBG(KERN_DEBUG "GMAC Flow-Control:\n");
+ CHIP_DBG(KERN_DEBUG "GMAC Flow-Control:\n");
if (fc & FLOW_RX) {
- DBG(KERN_DEBUG "\tReceive Flow-Control ON\n");
+ CHIP_DBG(KERN_DEBUG "\tReceive Flow-Control ON\n");
flow |= GMAC_FLOW_CTRL_RFE;
}
if (fc & FLOW_TX) {
- DBG(KERN_DEBUG "\tTransmit Flow-Control ON\n");
+ CHIP_DBG(KERN_DEBUG "\tTransmit Flow-Control ON\n");
flow |= GMAC_FLOW_CTRL_TFE;
}
if (duplex) {
- DBG(KERN_DEBUG "\tduplex mode: pause time: %d\n", pause_time);
+ CHIP_DBG(KERN_DEBUG "\tduplex mode: PAUSE %d\n", pause_time);
flow |= (pause_time << GMAC_FLOW_CTRL_PT_SHIFT);
}
@@ -171,10 +171,10 @@ static void dwmac1000_pmt(unsigned long ioaddr, unsigned long mode)
unsigned int pmt = 0;
if (mode == WAKE_MAGIC) {
- DBG(KERN_DEBUG "GMAC: WOL Magic frame\n");
+ CHIP_DBG(KERN_DEBUG "GMAC: WOL Magic frame\n");
pmt |= power_down | magic_pkt_en;
} else if (mode == WAKE_UCAST) {
- DBG(KERN_DEBUG "GMAC: WOL on global unicast\n");
+ CHIP_DBG(KERN_DEBUG "GMAC: WOL on global unicast\n");
pmt |= global_unicast;
}
@@ -189,16 +189,16 @@ static void dwmac1000_irq_status(unsigned long ioaddr)
/* Not used events (e.g. MMC interrupts) are not handled. */
if ((intr_status & mmc_tx_irq))
- DBG(KERN_DEBUG "GMAC: MMC tx interrupt: 0x%08x\n",
+ CHIP_DBG(KERN_DEBUG "GMAC: MMC tx interrupt: 0x%08x\n",
readl(ioaddr + GMAC_MMC_TX_INTR));
if (unlikely(intr_status & mmc_rx_irq))
- DBG(KERN_DEBUG "GMAC: MMC rx interrupt: 0x%08x\n",
+ CHIP_DBG(KERN_DEBUG "GMAC: MMC rx interrupt: 0x%08x\n",
readl(ioaddr + GMAC_MMC_RX_INTR));
if (unlikely(intr_status & mmc_rx_csum_offload_irq))
- DBG(KERN_DEBUG "GMAC: MMC rx csum offload: 0x%08x\n",
+ CHIP_DBG(KERN_DEBUG "GMAC: MMC rx csum offload: 0x%08x\n",
readl(ioaddr + GMAC_MMC_RX_CSUM_OFFLOAD));
if (unlikely(intr_status & pmt_irq)) {
- DBG(KERN_DEBUG "GMAC: received Magic frame\n");
+ CHIP_DBG(KERN_DEBUG "GMAC: received Magic frame\n");
/* clear the PMT bits 5 and 6 by reading the PMT
* status register. */
readl(ioaddr + GMAC_PMT);
@@ -229,7 +229,6 @@ struct mac_device_info *dwmac1000_setup(unsigned long ioaddr)
mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
mac->mac = &dwmac1000_ops;
- mac->desc = &dwmac1000_desc_ops;
mac->dma = &dwmac1000_dma_ops;
mac->pmt = PMT_SUPPORTED;
diff --git a/drivers/net/stmmac/dwmac1000_dma.c b/drivers/net/stmmac/dwmac1000_dma.c
index 39d436a..8d3ea99 100644
--- a/drivers/net/stmmac/dwmac1000_dma.c
+++ b/drivers/net/stmmac/dwmac1000_dma.c
@@ -3,7 +3,7 @@
DWC Ether MAC 10/100/1000 Universal version 3.41a has been used for
developing this code.
- This contains the functions to handle the dma and descriptors.
+ This contains the functions to handle the dma.
Copyright (C) 2007-2009 STMicroelectronics Ltd
@@ -73,14 +73,14 @@ static void dwmac1000_dma_operation_mode(unsigned long ioaddr, int txmode,
u32 csr6 = readl(ioaddr + DMA_CONTROL);
if (txmode == SF_DMA_MODE) {
- DBG(KERN_DEBUG "GMAC: enabling TX store and forward mode\n");
+ CHIP_DBG(KERN_DEBUG "GMAC: enable TX store and forward mode\n");
/* Transmit COE type 2 cannot be done in cut-through mode. */
csr6 |= DMA_CONTROL_TSF;
/* Operating on second frame increase the performance
* especially when transmit store-and-forward is used.*/
csr6 |= DMA_CONTROL_OSF;
} else {
- DBG(KERN_DEBUG "GMAC: disabling TX store and forward mode"
+ CHIP_DBG(KERN_DEBUG "GMAC: disabling TX store and forward mode"
" (threshold = %d)\n", txmode);
csr6 &= ~DMA_CONTROL_TSF;
csr6 &= DMA_CONTROL_TC_TX_MASK;
@@ -98,10 +98,10 @@ static void dwmac1000_dma_operation_mode(unsigned long ioaddr, int txmode,
}
if (rxmode == SF_DMA_MODE) {
- DBG(KERN_DEBUG "GMAC: enabling RX store and forward mode\n");
+ CHIP_DBG(KERN_DEBUG "GMAC: enable RX store and forward mode\n");
csr6 |= DMA_CONTROL_RSF;
} else {
- DBG(KERN_DEBUG "GMAC: disabling RX store and forward mode"
+ CHIP_DBG(KERN_DEBUG "GMAC: disabling RX store and forward mode"
" (threshold = %d)\n", rxmode);
csr6 &= ~DMA_CONTROL_RSF;
csr6 &= DMA_CONTROL_TC_RX_MASK;
@@ -141,305 +141,6 @@ static void dwmac1000_dump_dma_regs(unsigned long ioaddr)
return;
}
-static int dwmac1000_get_tx_frame_status(void *data,
- struct stmmac_extra_stats *x,
- struct dma_desc *p, unsigned long ioaddr)
-{
- int ret = 0;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.etx.error_summary)) {
- DBG(KERN_ERR "GMAC TX error... 0x%08x\n", p->des01.etx);
- if (unlikely(p->des01.etx.jabber_timeout)) {
- DBG(KERN_ERR "\tjabber_timeout error\n");
- x->tx_jabber++;
- }
-
- if (unlikely(p->des01.etx.frame_flushed)) {
- DBG(KERN_ERR "\tframe_flushed error\n");
- x->tx_frame_flushed++;
- dwmac1000_flush_tx_fifo(ioaddr);
- }
-
- if (unlikely(p->des01.etx.loss_carrier)) {
- DBG(KERN_ERR "\tloss_carrier error\n");
- x->tx_losscarrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely(p->des01.etx.no_carrier)) {
- DBG(KERN_ERR "\tno_carrier error\n");
- x->tx_carrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely(p->des01.etx.late_collision)) {
- DBG(KERN_ERR "\tlate_collision error\n");
- stats->collisions += p->des01.etx.collision_count;
- }
- if (unlikely(p->des01.etx.excessive_collisions)) {
- DBG(KERN_ERR "\texcessive_collisions\n");
- stats->collisions += p->des01.etx.collision_count;
- }
- if (unlikely(p->des01.etx.excessive_deferral)) {
- DBG(KERN_INFO "\texcessive tx_deferral\n");
- x->tx_deferred++;
- }
-
- if (unlikely(p->des01.etx.underflow_error)) {
- DBG(KERN_ERR "\tunderflow error\n");
- dwmac1000_flush_tx_fifo(ioaddr);
- x->tx_underflow++;
- }
-
- if (unlikely(p->des01.etx.ip_header_error)) {
- DBG(KERN_ERR "\tTX IP header csum error\n");
- x->tx_ip_header_error++;
- }
-
- if (unlikely(p->des01.etx.payload_error)) {
- DBG(KERN_ERR "\tAddr/Payload csum error\n");
- x->tx_payload_error++;
- dwmac1000_flush_tx_fifo(ioaddr);
- }
-
- ret = -1;
- }
-
- if (unlikely(p->des01.etx.deferred)) {
- DBG(KERN_INFO "GMAC TX status: tx deferred\n");
- x->tx_deferred++;
- }
-#ifdef STMMAC_VLAN_TAG_USED
- if (p->des01.etx.vlan_frame) {
- DBG(KERN_INFO "GMAC TX status: VLAN frame\n");
- x->tx_vlan++;
- }
-#endif
-
- return ret;
-}
-
-static int dwmac1000_get_tx_len(struct dma_desc *p)
-{
- return p->des01.etx.buffer1_size;
-}
-
-static int dwmac1000_coe_rdes0(int ipc_err, int type, int payload_err)
-{
- int ret = good_frame;
- u32 status = (type << 2 | ipc_err << 1 | payload_err) & 0x7;
-
- /* bits 5 7 0 | Frame status
- * ----------------------------------------------------------
- * 0 0 0 | IEEE 802.3 Type frame (length < 1536 octects)
- * 1 0 0 | IPv4/6 No CSUM errorS.
- * 1 0 1 | IPv4/6 CSUM PAYLOAD error
- * 1 1 0 | IPv4/6 CSUM IP HR error
- * 1 1 1 | IPv4/6 IP PAYLOAD AND HEADER errorS
- * 0 0 1 | IPv4/6 unsupported IP PAYLOAD
- * 0 1 1 | COE bypassed.. no IPv4/6 frame
- * 0 1 0 | Reserved.
- */
- if (status == 0x0) {
- DBG(KERN_INFO "RX Des0 status: IEEE 802.3 Type frame.\n");
- ret = good_frame;
- } else if (status == 0x4) {
- DBG(KERN_INFO "RX Des0 status: IPv4/6 No CSUM errorS.\n");
- ret = good_frame;
- } else if (status == 0x5) {
- DBG(KERN_ERR "RX Des0 status: IPv4/6 Payload Error.\n");
- ret = csum_none;
- } else if (status == 0x6) {
- DBG(KERN_ERR "RX Des0 status: IPv4/6 Header Error.\n");
- ret = csum_none;
- } else if (status == 0x7) {
- DBG(KERN_ERR
- "RX Des0 status: IPv4/6 Header and Payload Error.\n");
- ret = csum_none;
- } else if (status == 0x1) {
- DBG(KERN_ERR
- "RX Des0 status: IPv4/6 unsupported IP PAYLOAD.\n");
- ret = discard_frame;
- } else if (status == 0x3) {
- DBG(KERN_ERR "RX Des0 status: No IPv4, IPv6 frame.\n");
- ret = discard_frame;
- }
- return ret;
-}
-
-static int dwmac1000_get_rx_frame_status(void *data,
- struct stmmac_extra_stats *x, struct dma_desc *p)
-{
- int ret = good_frame;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.erx.error_summary)) {
- DBG(KERN_ERR "GMAC RX Error Summary... 0x%08x\n", p->des01.erx);
- if (unlikely(p->des01.erx.descriptor_error)) {
- DBG(KERN_ERR "\tdescriptor error\n");
- x->rx_desc++;
- stats->rx_length_errors++;
- }
- if (unlikely(p->des01.erx.overflow_error)) {
- DBG(KERN_ERR "\toverflow error\n");
- x->rx_gmac_overflow++;
- }
-
- if (unlikely(p->des01.erx.ipc_csum_error))
- DBG(KERN_ERR "\tIPC Csum Error/Giant frame\n");
-
- if (unlikely(p->des01.erx.late_collision)) {
- DBG(KERN_ERR "\tlate_collision error\n");
- stats->collisions++;
- stats->collisions++;
- }
- if (unlikely(p->des01.erx.receive_watchdog)) {
- DBG(KERN_ERR "\treceive_watchdog error\n");
- x->rx_watchdog++;
- }
- if (unlikely(p->des01.erx.error_gmii)) {
- DBG(KERN_ERR "\tReceive Error\n");
- x->rx_mii++;
- }
- if (unlikely(p->des01.erx.crc_error)) {
- DBG(KERN_ERR "\tCRC error\n");
- x->rx_crc++;
- stats->rx_crc_errors++;
- }
- ret = discard_frame;
- }
-
- /* After a payload csum error, the ES bit is set.
- * It doesn't match with the information reported into the databook.
- * At any rate, we need to understand if the CSUM hw computation is ok
- * and report this info to the upper layers. */
- ret = dwmac1000_coe_rdes0(p->des01.erx.ipc_csum_error,
- p->des01.erx.frame_type, p->des01.erx.payload_csum_error);
-
- if (unlikely(p->des01.erx.dribbling)) {
- DBG(KERN_ERR "GMAC RX: dribbling error\n");
- ret = discard_frame;
- }
- if (unlikely(p->des01.erx.sa_filter_fail)) {
- DBG(KERN_ERR "GMAC RX : Source Address filter fail\n");
- x->sa_rx_filter_fail++;
- ret = discard_frame;
- }
- if (unlikely(p->des01.erx.da_filter_fail)) {
- DBG(KERN_ERR "GMAC RX : Destination Address filter fail\n");
- x->da_rx_filter_fail++;
- ret = discard_frame;
- }
- if (unlikely(p->des01.erx.length_error)) {
- DBG(KERN_ERR "GMAC RX: length_error error\n");
- x->rx_length++;
- ret = discard_frame;
- }
-#ifdef STMMAC_VLAN_TAG_USED
- if (p->des01.erx.vlan_tag) {
- DBG(KERN_INFO "GMAC RX: VLAN frame tagged\n");
- x->rx_vlan++;
- }
-#endif
- return ret;
-}
-
-static void dwmac1000_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
- int disable_rx_ic)
-{
- int i;
- for (i = 0; i < ring_size; i++) {
- p->des01.erx.own = 1;
- p->des01.erx.buffer1_size = BUF_SIZE_8KiB - 1;
- /* To support jumbo frames */
- p->des01.erx.buffer2_size = BUF_SIZE_8KiB - 1;
- if (i == ring_size - 1)
- p->des01.erx.end_ring = 1;
- if (disable_rx_ic)
- p->des01.erx.disable_ic = 1;
- p++;
- }
- return;
-}
-
-static void dwmac1000_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
-{
- int i;
-
- for (i = 0; i < ring_size; i++) {
- p->des01.etx.own = 0;
- if (i == ring_size - 1)
- p->des01.etx.end_ring = 1;
- p++;
- }
-
- return;
-}
-
-static int dwmac1000_get_tx_owner(struct dma_desc *p)
-{
- return p->des01.etx.own;
-}
-
-static int dwmac1000_get_rx_owner(struct dma_desc *p)
-{
- return p->des01.erx.own;
-}
-
-static void dwmac1000_set_tx_owner(struct dma_desc *p)
-{
- p->des01.etx.own = 1;
-}
-
-static void dwmac1000_set_rx_owner(struct dma_desc *p)
-{
- p->des01.erx.own = 1;
-}
-
-static int dwmac1000_get_tx_ls(struct dma_desc *p)
-{
- return p->des01.etx.last_segment;
-}
-
-static void dwmac1000_release_tx_desc(struct dma_desc *p)
-{
- int ter = p->des01.etx.end_ring;
-
- memset(p, 0, sizeof(struct dma_desc));
- p->des01.etx.end_ring = ter;
-
- return;
-}
-
-static void dwmac1000_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
- int csum_flag)
-{
- p->des01.etx.first_segment = is_fs;
- if (unlikely(len > BUF_SIZE_4KiB)) {
- p->des01.etx.buffer1_size = BUF_SIZE_4KiB;
- p->des01.etx.buffer2_size = len - BUF_SIZE_4KiB;
- } else {
- p->des01.etx.buffer1_size = len;
- }
- if (likely(csum_flag))
- p->des01.etx.checksum_insertion = cic_full;
-}
-
-static void dwmac1000_clear_tx_ic(struct dma_desc *p)
-{
- p->des01.etx.interrupt = 0;
-}
-
-static void dwmac1000_close_tx_desc(struct dma_desc *p)
-{
- p->des01.etx.last_segment = 1;
- p->des01.etx.interrupt = 1;
-}
-
-static int dwmac1000_get_rx_frame_len(struct dma_desc *p)
-{
- return p->des01.erx.frame_length;
-}
-
struct stmmac_dma_ops dwmac1000_dma_ops = {
.init = dwmac1000_dma_init,
.dump_regs = dwmac1000_dump_dma_regs,
@@ -454,21 +155,3 @@ struct stmmac_dma_ops dwmac1000_dma_ops = {
.stop_rx = dwmac_dma_stop_rx,
.dma_interrupt = dwmac_dma_interrupt,
};
-
-struct stmmac_desc_ops dwmac1000_desc_ops = {
- .tx_status = dwmac1000_get_tx_frame_status,
- .rx_status = dwmac1000_get_rx_frame_status,
- .get_tx_len = dwmac1000_get_tx_len,
- .init_rx_desc = dwmac1000_init_rx_desc,
- .init_tx_desc = dwmac1000_init_tx_desc,
- .get_tx_owner = dwmac1000_get_tx_owner,
- .get_rx_owner = dwmac1000_get_rx_owner,
- .release_tx_desc = dwmac1000_release_tx_desc,
- .prepare_tx_desc = dwmac1000_prepare_tx_desc,
- .clear_tx_ic = dwmac1000_clear_tx_ic,
- .close_tx_desc = dwmac1000_close_tx_desc,
- .get_tx_ls = dwmac1000_get_tx_ls,
- .set_tx_owner = dwmac1000_set_tx_owner,
- .set_rx_owner = dwmac1000_set_rx_owner,
- .get_rx_frame_len = dwmac1000_get_rx_frame_len,
-};
diff --git a/drivers/net/stmmac/dwmac100_core.c b/drivers/net/stmmac/dwmac100_core.c
index 7455a0c..e31d0d7 100644
--- a/drivers/net/stmmac/dwmac100_core.c
+++ b/drivers/net/stmmac/dwmac100_core.c
@@ -141,7 +141,7 @@ static void dwmac100_set_filter(struct net_device *dev)
writel(value, ioaddr + MAC_CONTROL);
- DBG(KERN_INFO "%s: CTRL reg: 0x%08x Hash regs: "
+ CHIP_DBG(KERN_INFO "%s: CTRL reg: 0x%08x Hash regs: "
"HI 0x%08x, LO 0x%08x\n",
__func__, readl(ioaddr + MAC_CONTROL),
readl(ioaddr + MAC_HASH_HIGH), readl(ioaddr + MAC_HASH_LOW));
@@ -188,7 +188,6 @@ struct mac_device_info *dwmac100_setup(unsigned long ioaddr)
pr_info("\tDWMAC100\n");
mac->mac = &dwmac100_ops;
- mac->desc = &dwmac100_desc_ops;
mac->dma = &dwmac100_dma_ops;
mac->pmt = PMT_NOT_SUPPORTED;
diff --git a/drivers/net/stmmac/dwmac100_dma.c b/drivers/net/stmmac/dwmac100_dma.c
index 7fcc526..96d098d 100644
--- a/drivers/net/stmmac/dwmac100_dma.c
+++ b/drivers/net/stmmac/dwmac100_dma.c
@@ -5,7 +5,7 @@
DWC Ether MAC 10/100 Universal version 4.0 has been used for developing
this code.
- This contains the functions to handle the dma and descriptors.
+ This contains the functions to handle the dma.
Copyright (C) 2007-2009 STMicroelectronics Ltd
@@ -79,14 +79,14 @@ static void dwmac100_dump_dma_regs(unsigned long ioaddr)
{
int i;
- DBG(KERN_DEBUG "DWMAC 100 DMA CSR\n");
+ CHIP_DBG(KERN_DEBUG "DWMAC 100 DMA CSR\n");
for (i = 0; i < 9; i++)
pr_debug("\t CSR%d (offset 0x%x): 0x%08x\n", i,
(DMA_BUS_MODE + i * 4),
readl(ioaddr + DMA_BUS_MODE + i * 4));
- DBG(KERN_DEBUG "\t CSR20 (offset 0x%x): 0x%08x\n",
+ CHIP_DBG(KERN_DEBUG "\t CSR20 (offset 0x%x): 0x%08x\n",
DMA_CUR_TX_BUF_ADDR, readl(ioaddr + DMA_CUR_TX_BUF_ADDR));
- DBG(KERN_DEBUG "\t CSR21 (offset 0x%x): 0x%08x\n",
+ CHIP_DBG(KERN_DEBUG "\t CSR21 (offset 0x%x): 0x%08x\n",
DMA_CUR_RX_BUF_ADDR, readl(ioaddr + DMA_CUR_RX_BUF_ADDR));
return;
}
@@ -122,203 +122,6 @@ static void dwmac100_dma_diagnostic_fr(void *data, struct stmmac_extra_stats *x,
return;
}
-static int dwmac100_get_tx_status(void *data, struct stmmac_extra_stats *x,
- struct dma_desc *p, unsigned long ioaddr)
-{
- int ret = 0;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.tx.error_summary)) {
- if (unlikely(p->des01.tx.underflow_error)) {
- x->tx_underflow++;
- stats->tx_fifo_errors++;
- }
- if (unlikely(p->des01.tx.no_carrier)) {
- x->tx_carrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely(p->des01.tx.loss_carrier)) {
- x->tx_losscarrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely((p->des01.tx.excessive_deferral) ||
- (p->des01.tx.excessive_collisions) ||
- (p->des01.tx.late_collision)))
- stats->collisions += p->des01.tx.collision_count;
- ret = -1;
- }
- if (unlikely(p->des01.tx.heartbeat_fail)) {
- x->tx_heartbeat++;
- stats->tx_heartbeat_errors++;
- ret = -1;
- }
- if (unlikely(p->des01.tx.deferred))
- x->tx_deferred++;
-
- return ret;
-}
-
-static int dwmac100_get_tx_len(struct dma_desc *p)
-{
- return p->des01.tx.buffer1_size;
-}
-
-/* This function verifies if each incoming frame has some errors
- * and, if required, updates the multicast statistics.
- * In case of success, it returns csum_none becasue the device
- * is not able to compute the csum in HW. */
-static int dwmac100_get_rx_status(void *data, struct stmmac_extra_stats *x,
- struct dma_desc *p)
-{
- int ret = csum_none;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.rx.last_descriptor == 0)) {
- pr_warning("dwmac100 Error: Oversized Ethernet "
- "frame spanned multiple buffers\n");
- stats->rx_length_errors++;
- return discard_frame;
- }
-
- if (unlikely(p->des01.rx.error_summary)) {
- if (unlikely(p->des01.rx.descriptor_error))
- x->rx_desc++;
- if (unlikely(p->des01.rx.partial_frame_error))
- x->rx_partial++;
- if (unlikely(p->des01.rx.run_frame))
- x->rx_runt++;
- if (unlikely(p->des01.rx.frame_too_long))
- x->rx_toolong++;
- if (unlikely(p->des01.rx.collision)) {
- x->rx_collision++;
- stats->collisions++;
- }
- if (unlikely(p->des01.rx.crc_error)) {
- x->rx_crc++;
- stats->rx_crc_errors++;
- }
- ret = discard_frame;
- }
- if (unlikely(p->des01.rx.dribbling))
- ret = discard_frame;
-
- if (unlikely(p->des01.rx.length_error)) {
- x->rx_length++;
- ret = discard_frame;
- }
- if (unlikely(p->des01.rx.mii_error)) {
- x->rx_mii++;
- ret = discard_frame;
- }
- if (p->des01.rx.multicast_frame) {
- x->rx_multicast++;
- stats->multicast++;
- }
- return ret;
-}
-
-static void dwmac100_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
- int disable_rx_ic)
-{
- int i;
- for (i = 0; i < ring_size; i++) {
- p->des01.rx.own = 1;
- p->des01.rx.buffer1_size = BUF_SIZE_2KiB - 1;
- if (i == ring_size - 1)
- p->des01.rx.end_ring = 1;
- if (disable_rx_ic)
- p->des01.rx.disable_ic = 1;
- p++;
- }
- return;
-}
-
-static void dwmac100_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
-{
- int i;
- for (i = 0; i < ring_size; i++) {
- p->des01.tx.own = 0;
- if (i == ring_size - 1)
- p->des01.tx.end_ring = 1;
- p++;
- }
- return;
-}
-
-static int dwmac100_get_tx_owner(struct dma_desc *p)
-{
- return p->des01.tx.own;
-}
-
-static int dwmac100_get_rx_owner(struct dma_desc *p)
-{
- return p->des01.rx.own;
-}
-
-static void dwmac100_set_tx_owner(struct dma_desc *p)
-{
- p->des01.tx.own = 1;
-}
-
-static void dwmac100_set_rx_owner(struct dma_desc *p)
-{
- p->des01.rx.own = 1;
-}
-
-static int dwmac100_get_tx_ls(struct dma_desc *p)
-{
- return p->des01.tx.last_segment;
-}
-
-static void dwmac100_release_tx_desc(struct dma_desc *p)
-{
- int ter = p->des01.tx.end_ring;
-
- /* clean field used within the xmit */
- p->des01.tx.first_segment = 0;
- p->des01.tx.last_segment = 0;
- p->des01.tx.buffer1_size = 0;
-
- /* clean status reported */
- p->des01.tx.error_summary = 0;
- p->des01.tx.underflow_error = 0;
- p->des01.tx.no_carrier = 0;
- p->des01.tx.loss_carrier = 0;
- p->des01.tx.excessive_deferral = 0;
- p->des01.tx.excessive_collisions = 0;
- p->des01.tx.late_collision = 0;
- p->des01.tx.heartbeat_fail = 0;
- p->des01.tx.deferred = 0;
-
- /* set termination field */
- p->des01.tx.end_ring = ter;
-
- return;
-}
-
-static void dwmac100_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
- int csum_flag)
-{
- p->des01.tx.first_segment = is_fs;
- p->des01.tx.buffer1_size = len;
-}
-
-static void dwmac100_clear_tx_ic(struct dma_desc *p)
-{
- p->des01.tx.interrupt = 0;
-}
-
-static void dwmac100_close_tx_desc(struct dma_desc *p)
-{
- p->des01.tx.last_segment = 1;
- p->des01.tx.interrupt = 1;
-}
-
-static int dwmac100_get_rx_frame_len(struct dma_desc *p)
-{
- return p->des01.rx.frame_length;
-}
-
struct stmmac_dma_ops dwmac100_dma_ops = {
.init = dwmac100_dma_init,
.dump_regs = dwmac100_dump_dma_regs,
@@ -333,21 +136,3 @@ struct stmmac_dma_ops dwmac100_dma_ops = {
.stop_rx = dwmac_dma_stop_rx,
.dma_interrupt = dwmac_dma_interrupt,
};
-
-struct stmmac_desc_ops dwmac100_desc_ops = {
- .tx_status = dwmac100_get_tx_status,
- .rx_status = dwmac100_get_rx_status,
- .get_tx_len = dwmac100_get_tx_len,
- .init_rx_desc = dwmac100_init_rx_desc,
- .init_tx_desc = dwmac100_init_tx_desc,
- .get_tx_owner = dwmac100_get_tx_owner,
- .get_rx_owner = dwmac100_get_rx_owner,
- .release_tx_desc = dwmac100_release_tx_desc,
- .prepare_tx_desc = dwmac100_prepare_tx_desc,
- .clear_tx_ic = dwmac100_clear_tx_ic,
- .close_tx_desc = dwmac100_close_tx_desc,
- .get_tx_ls = dwmac100_get_tx_ls,
- .set_tx_owner = dwmac100_set_tx_owner,
- .set_rx_owner = dwmac100_set_rx_owner,
- .get_rx_frame_len = dwmac100_get_rx_frame_len,
-};
diff --git a/drivers/net/stmmac/enh_desc.c b/drivers/net/stmmac/enh_desc.c
new file mode 100644
index 0000000..e5ac259
--- /dev/null
+++ b/drivers/net/stmmac/enh_desc.c
@@ -0,0 +1,342 @@
+/*******************************************************************************
+ This contains the functions to handle the enhanced descriptors.
+
+ Copyright (C) 2007-2009 STMicroelectronics Ltd
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms and conditions of the GNU General Public License,
+ version 2, as published by the Free Software Foundation.
+
+ This program is distributed in the hope it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+ The full GNU General Public License is included in this distribution in
+ the file called "COPYING".
+
+ Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+*******************************************************************************/
+
+#include "common.h"
+
+static int enh_desc_get_tx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p, unsigned long ioaddr)
+{
+ int ret = 0;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.etx.error_summary)) {
+ CHIP_DBG(KERN_ERR "GMAC TX error... 0x%08x\n", p->des01.etx);
+ if (unlikely(p->des01.etx.jabber_timeout)) {
+ CHIP_DBG(KERN_ERR "\tjabber_timeout error\n");
+ x->tx_jabber++;
+ }
+
+ if (unlikely(p->des01.etx.frame_flushed)) {
+ CHIP_DBG(KERN_ERR "\tframe_flushed error\n");
+ x->tx_frame_flushed++;
+ /*enh_desc_flush_tx_fifo(ioaddr);*/
+ }
+
+ if (unlikely(p->des01.etx.loss_carrier)) {
+ CHIP_DBG(KERN_ERR "\tloss_carrier error\n");
+ x->tx_losscarrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely(p->des01.etx.no_carrier)) {
+ CHIP_DBG(KERN_ERR "\tno_carrier error\n");
+ x->tx_carrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely(p->des01.etx.late_collision)) {
+ CHIP_DBG(KERN_ERR "\tlate_collision error\n");
+ stats->collisions += p->des01.etx.collision_count;
+ }
+ if (unlikely(p->des01.etx.excessive_collisions)) {
+ CHIP_DBG(KERN_ERR "\texcessive_collisions\n");
+ stats->collisions += p->des01.etx.collision_count;
+ }
+ if (unlikely(p->des01.etx.excessive_deferral)) {
+ CHIP_DBG(KERN_INFO "\texcessive tx_deferral\n");
+ x->tx_deferred++;
+ }
+
+ if (unlikely(p->des01.etx.underflow_error)) {
+ CHIP_DBG(KERN_ERR "\tunderflow error\n");
+ /*enh_desc_flush_tx_fifo(ioaddr);*/
+ x->tx_underflow++;
+ }
+
+ if (unlikely(p->des01.etx.ip_header_error)) {
+ CHIP_DBG(KERN_ERR "\tTX IP header csum error\n");
+ x->tx_ip_header_error++;
+ }
+
+ if (unlikely(p->des01.etx.payload_error)) {
+ CHIP_DBG(KERN_ERR "\tAddr/Payload csum error\n");
+ x->tx_payload_error++;
+ /*enh_desc_flush_tx_fifo(ioaddr);*/
+ }
+
+ ret = -1;
+ }
+
+ if (unlikely(p->des01.etx.deferred)) {
+ CHIP_DBG(KERN_INFO "GMAC TX status: tx deferred\n");
+ x->tx_deferred++;
+ }
+#ifdef STMMAC_VLAN_TAG_USED
+ if (p->des01.etx.vlan_frame) {
+ CHIP_DBG(KERN_INFO "GMAC TX status: VLAN frame\n");
+ x->tx_vlan++;
+ }
+#endif
+
+ return ret;
+}
+
+static int enh_desc_get_tx_len(struct dma_desc *p)
+{
+ return p->des01.etx.buffer1_size;
+}
+
+static int enh_desc_coe_rdes0(int ipc_err, int type, int payload_err)
+{
+ int ret = good_frame;
+ u32 status = (type << 2 | ipc_err << 1 | payload_err) & 0x7;
+
+ /* bits 5 7 0 | Frame status
+ * ----------------------------------------------------------
+ * 0 0 0 | IEEE 802.3 Type frame (length < 1536 octects)
+ * 1 0 0 | IPv4/6 No CSUM errorS.
+ * 1 0 1 | IPv4/6 CSUM PAYLOAD error
+ * 1 1 0 | IPv4/6 CSUM IP HR error
+ * 1 1 1 | IPv4/6 IP PAYLOAD AND HEADER errorS
+ * 0 0 1 | IPv4/6 unsupported IP PAYLOAD
+ * 0 1 1 | COE bypassed.. no IPv4/6 frame
+ * 0 1 0 | Reserved.
+ */
+ if (status == 0x0) {
+ CHIP_DBG(KERN_INFO "RX Des0 status: IEEE 802.3 Type frame.\n");
+ ret = good_frame;
+ } else if (status == 0x4) {
+ CHIP_DBG(KERN_INFO "RX Des0 status: IPv4/6 No CSUM errorS.\n");
+ ret = good_frame;
+ } else if (status == 0x5) {
+ CHIP_DBG(KERN_ERR "RX Des0 status: IPv4/6 Payload Error.\n");
+ ret = csum_none;
+ } else if (status == 0x6) {
+ CHIP_DBG(KERN_ERR "RX Des0 status: IPv4/6 Header Error.\n");
+ ret = csum_none;
+ } else if (status == 0x7) {
+ CHIP_DBG(KERN_ERR
+ "RX Des0 status: IPv4/6 Header and Payload Error.\n");
+ ret = csum_none;
+ } else if (status == 0x1) {
+ CHIP_DBG(KERN_ERR
+ "RX Des0 status: IPv4/6 unsupported IP PAYLOAD.\n");
+ ret = discard_frame;
+ } else if (status == 0x3) {
+ CHIP_DBG(KERN_ERR "RX Des0 status: No IPv4, IPv6 frame.\n");
+ ret = discard_frame;
+ }
+ return ret;
+}
+
+static int enh_desc_get_rx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p)
+{
+ int ret = good_frame;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.erx.error_summary)) {
+ CHIP_DBG(KERN_ERR "GMAC RX Error Summary 0x%08x\n",
+ p->des01.erx);
+ if (unlikely(p->des01.erx.descriptor_error)) {
+ CHIP_DBG(KERN_ERR "\tdescriptor error\n");
+ x->rx_desc++;
+ stats->rx_length_errors++;
+ }
+ if (unlikely(p->des01.erx.overflow_error)) {
+ CHIP_DBG(KERN_ERR "\toverflow error\n");
+ x->rx_gmac_overflow++;
+ }
+
+ if (unlikely(p->des01.erx.ipc_csum_error))
+ CHIP_DBG(KERN_ERR "\tIPC Csum Error/Giant frame\n");
+
+ if (unlikely(p->des01.erx.late_collision)) {
+ CHIP_DBG(KERN_ERR "\tlate_collision error\n");
+ stats->collisions++;
+ stats->collisions++;
+ }
+ if (unlikely(p->des01.erx.receive_watchdog)) {
+ CHIP_DBG(KERN_ERR "\treceive_watchdog error\n");
+ x->rx_watchdog++;
+ }
+ if (unlikely(p->des01.erx.error_gmii)) {
+ CHIP_DBG(KERN_ERR "\tReceive Error\n");
+ x->rx_mii++;
+ }
+ if (unlikely(p->des01.erx.crc_error)) {
+ CHIP_DBG(KERN_ERR "\tCRC error\n");
+ x->rx_crc++;
+ stats->rx_crc_errors++;
+ }
+ ret = discard_frame;
+ }
+
+ /* After a payload csum error, the ES bit is set.
+ * It doesn't match with the information reported into the databook.
+ * At any rate, we need to understand if the CSUM hw computation is ok
+ * and report this info to the upper layers. */
+ ret = enh_desc_coe_rdes0(p->des01.erx.ipc_csum_error,
+ p->des01.erx.frame_type, p->des01.erx.payload_csum_error);
+
+ if (unlikely(p->des01.erx.dribbling)) {
+ CHIP_DBG(KERN_ERR "GMAC RX: dribbling error\n");
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.erx.sa_filter_fail)) {
+ CHIP_DBG(KERN_ERR "GMAC RX : Source Address filter fail\n");
+ x->sa_rx_filter_fail++;
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.erx.da_filter_fail)) {
+ CHIP_DBG(KERN_ERR "GMAC RX : Dest Address filter fail\n");
+ x->da_rx_filter_fail++;
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.erx.length_error)) {
+ CHIP_DBG(KERN_ERR "GMAC RX: length_error error\n");
+ x->rx_length++;
+ ret = discard_frame;
+ }
+#ifdef STMMAC_VLAN_TAG_USED
+ if (p->des01.erx.vlan_tag) {
+ CHIP_DBG(KERN_INFO "GMAC RX: VLAN frame tagged\n");
+ x->rx_vlan++;
+ }
+#endif
+ return ret;
+}
+
+static void enh_desc_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
+ int disable_rx_ic)
+{
+ int i;
+ for (i = 0; i < ring_size; i++) {
+ p->des01.erx.own = 1;
+ p->des01.erx.buffer1_size = BUF_SIZE_8KiB - 1;
+ /* To support jumbo frames */
+ p->des01.erx.buffer2_size = BUF_SIZE_8KiB - 1;
+ if (i == ring_size - 1)
+ p->des01.erx.end_ring = 1;
+ if (disable_rx_ic)
+ p->des01.erx.disable_ic = 1;
+ p++;
+ }
+ return;
+}
+
+static void enh_desc_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
+{
+ int i;
+
+ for (i = 0; i < ring_size; i++) {
+ p->des01.etx.own = 0;
+ if (i == ring_size - 1)
+ p->des01.etx.end_ring = 1;
+ p++;
+ }
+
+ return;
+}
+
+static int enh_desc_get_tx_owner(struct dma_desc *p)
+{
+ return p->des01.etx.own;
+}
+
+static int enh_desc_get_rx_owner(struct dma_desc *p)
+{
+ return p->des01.erx.own;
+}
+
+static void enh_desc_set_tx_owner(struct dma_desc *p)
+{
+ p->des01.etx.own = 1;
+}
+
+static void enh_desc_set_rx_owner(struct dma_desc *p)
+{
+ p->des01.erx.own = 1;
+}
+
+static int enh_desc_get_tx_ls(struct dma_desc *p)
+{
+ return p->des01.etx.last_segment;
+}
+
+static void enh_desc_release_tx_desc(struct dma_desc *p)
+{
+ int ter = p->des01.etx.end_ring;
+
+ memset(p, 0, sizeof(struct dma_desc));
+ p->des01.etx.end_ring = ter;
+
+ return;
+}
+
+static void enh_desc_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
+ int csum_flag)
+{
+ p->des01.etx.first_segment = is_fs;
+ if (unlikely(len > BUF_SIZE_4KiB)) {
+ p->des01.etx.buffer1_size = BUF_SIZE_4KiB;
+ p->des01.etx.buffer2_size = len - BUF_SIZE_4KiB;
+ } else {
+ p->des01.etx.buffer1_size = len;
+ }
+ if (likely(csum_flag))
+ p->des01.etx.checksum_insertion = cic_full;
+}
+
+static void enh_desc_clear_tx_ic(struct dma_desc *p)
+{
+ p->des01.etx.interrupt = 0;
+}
+
+static void enh_desc_close_tx_desc(struct dma_desc *p)
+{
+ p->des01.etx.last_segment = 1;
+ p->des01.etx.interrupt = 1;
+}
+
+static int enh_desc_get_rx_frame_len(struct dma_desc *p)
+{
+ return p->des01.erx.frame_length;
+}
+
+struct stmmac_desc_ops enh_desc_ops = {
+ .tx_status = enh_desc_get_tx_status,
+ .rx_status = enh_desc_get_rx_status,
+ .get_tx_len = enh_desc_get_tx_len,
+ .init_rx_desc = enh_desc_init_rx_desc,
+ .init_tx_desc = enh_desc_init_tx_desc,
+ .get_tx_owner = enh_desc_get_tx_owner,
+ .get_rx_owner = enh_desc_get_rx_owner,
+ .release_tx_desc = enh_desc_release_tx_desc,
+ .prepare_tx_desc = enh_desc_prepare_tx_desc,
+ .clear_tx_ic = enh_desc_clear_tx_ic,
+ .close_tx_desc = enh_desc_close_tx_desc,
+ .get_tx_ls = enh_desc_get_tx_ls,
+ .set_tx_owner = enh_desc_set_tx_owner,
+ .set_rx_owner = enh_desc_set_rx_owner,
+ .get_rx_frame_len = enh_desc_get_rx_frame_len,
+};
diff --git a/drivers/net/stmmac/norm_desc.c b/drivers/net/stmmac/norm_desc.c
new file mode 100644
index 0000000..ecfcc00
--- /dev/null
+++ b/drivers/net/stmmac/norm_desc.c
@@ -0,0 +1,240 @@
+/*******************************************************************************
+ This contains the functions to handle the normal descriptors.
+
+ Copyright (C) 2007-2009 STMicroelectronics Ltd
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms and conditions of the GNU General Public License,
+ version 2, as published by the Free Software Foundation.
+
+ This program is distributed in the hope it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+ The full GNU General Public License is included in this distribution in
+ the file called "COPYING".
+
+ Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+*******************************************************************************/
+
+#include "common.h"
+
+static int ndesc_get_tx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p, unsigned long ioaddr)
+{
+ int ret = 0;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.tx.error_summary)) {
+ if (unlikely(p->des01.tx.underflow_error)) {
+ x->tx_underflow++;
+ stats->tx_fifo_errors++;
+ }
+ if (unlikely(p->des01.tx.no_carrier)) {
+ x->tx_carrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely(p->des01.tx.loss_carrier)) {
+ x->tx_losscarrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely((p->des01.tx.excessive_deferral) ||
+ (p->des01.tx.excessive_collisions) ||
+ (p->des01.tx.late_collision)))
+ stats->collisions += p->des01.tx.collision_count;
+ ret = -1;
+ }
+ if (unlikely(p->des01.tx.heartbeat_fail)) {
+ x->tx_heartbeat++;
+ stats->tx_heartbeat_errors++;
+ ret = -1;
+ }
+ if (unlikely(p->des01.tx.deferred))
+ x->tx_deferred++;
+
+ return ret;
+}
+
+static int ndesc_get_tx_len(struct dma_desc *p)
+{
+ return p->des01.tx.buffer1_size;
+}
+
+/* This function verifies if each incoming frame has some errors
+ * and, if required, updates the multicast statistics.
+ * In case of success, it returns csum_none becasue the device
+ * is not able to compute the csum in HW. */
+static int ndesc_get_rx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p)
+{
+ int ret = csum_none;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.rx.last_descriptor == 0)) {
+ pr_warning("ndesc Error: Oversized Ethernet "
+ "frame spanned multiple buffers\n");
+ stats->rx_length_errors++;
+ return discard_frame;
+ }
+
+ if (unlikely(p->des01.rx.error_summary)) {
+ if (unlikely(p->des01.rx.descriptor_error))
+ x->rx_desc++;
+ if (unlikely(p->des01.rx.partial_frame_error))
+ x->rx_partial++;
+ if (unlikely(p->des01.rx.run_frame))
+ x->rx_runt++;
+ if (unlikely(p->des01.rx.frame_too_long))
+ x->rx_toolong++;
+ if (unlikely(p->des01.rx.collision)) {
+ x->rx_collision++;
+ stats->collisions++;
+ }
+ if (unlikely(p->des01.rx.crc_error)) {
+ x->rx_crc++;
+ stats->rx_crc_errors++;
+ }
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.rx.dribbling))
+ ret = discard_frame;
+
+ if (unlikely(p->des01.rx.length_error)) {
+ x->rx_length++;
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.rx.mii_error)) {
+ x->rx_mii++;
+ ret = discard_frame;
+ }
+ if (p->des01.rx.multicast_frame) {
+ x->rx_multicast++;
+ stats->multicast++;
+ }
+ return ret;
+}
+
+static void ndesc_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
+ int disable_rx_ic)
+{
+ int i;
+ for (i = 0; i < ring_size; i++) {
+ p->des01.rx.own = 1;
+ p->des01.rx.buffer1_size = BUF_SIZE_2KiB - 1;
+ if (i == ring_size - 1)
+ p->des01.rx.end_ring = 1;
+ if (disable_rx_ic)
+ p->des01.rx.disable_ic = 1;
+ p++;
+ }
+ return;
+}
+
+static void ndesc_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
+{
+ int i;
+ for (i = 0; i < ring_size; i++) {
+ p->des01.tx.own = 0;
+ if (i == ring_size - 1)
+ p->des01.tx.end_ring = 1;
+ p++;
+ }
+ return;
+}
+
+static int ndesc_get_tx_owner(struct dma_desc *p)
+{
+ return p->des01.tx.own;
+}
+
+static int ndesc_get_rx_owner(struct dma_desc *p)
+{
+ return p->des01.rx.own;
+}
+
+static void ndesc_set_tx_owner(struct dma_desc *p)
+{
+ p->des01.tx.own = 1;
+}
+
+static void ndesc_set_rx_owner(struct dma_desc *p)
+{
+ p->des01.rx.own = 1;
+}
+
+static int ndesc_get_tx_ls(struct dma_desc *p)
+{
+ return p->des01.tx.last_segment;
+}
+
+static void ndesc_release_tx_desc(struct dma_desc *p)
+{
+ int ter = p->des01.tx.end_ring;
+
+ /* clean field used within the xmit */
+ p->des01.tx.first_segment = 0;
+ p->des01.tx.last_segment = 0;
+ p->des01.tx.buffer1_size = 0;
+
+ /* clean status reported */
+ p->des01.tx.error_summary = 0;
+ p->des01.tx.underflow_error = 0;
+ p->des01.tx.no_carrier = 0;
+ p->des01.tx.loss_carrier = 0;
+ p->des01.tx.excessive_deferral = 0;
+ p->des01.tx.excessive_collisions = 0;
+ p->des01.tx.late_collision = 0;
+ p->des01.tx.heartbeat_fail = 0;
+ p->des01.tx.deferred = 0;
+
+ /* set termination field */
+ p->des01.tx.end_ring = ter;
+
+ return;
+}
+
+static void ndesc_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
+ int csum_flag)
+{
+ p->des01.tx.first_segment = is_fs;
+ p->des01.tx.buffer1_size = len;
+}
+
+static void ndesc_clear_tx_ic(struct dma_desc *p)
+{
+ p->des01.tx.interrupt = 0;
+}
+
+static void ndesc_close_tx_desc(struct dma_desc *p)
+{
+ p->des01.tx.last_segment = 1;
+ p->des01.tx.interrupt = 1;
+}
+
+static int ndesc_get_rx_frame_len(struct dma_desc *p)
+{
+ return p->des01.rx.frame_length;
+}
+
+struct stmmac_desc_ops ndesc_ops = {
+ .tx_status = ndesc_get_tx_status,
+ .rx_status = ndesc_get_rx_status,
+ .get_tx_len = ndesc_get_tx_len,
+ .init_rx_desc = ndesc_init_rx_desc,
+ .init_tx_desc = ndesc_init_tx_desc,
+ .get_tx_owner = ndesc_get_tx_owner,
+ .get_rx_owner = ndesc_get_rx_owner,
+ .release_tx_desc = ndesc_release_tx_desc,
+ .prepare_tx_desc = ndesc_prepare_tx_desc,
+ .clear_tx_ic = ndesc_clear_tx_ic,
+ .close_tx_desc = ndesc_close_tx_desc,
+ .get_tx_ls = ndesc_get_tx_ls,
+ .set_tx_owner = ndesc_set_tx_owner,
+ .set_rx_owner = ndesc_set_rx_owner,
+ .get_rx_frame_len = ndesc_get_rx_frame_len,
+};
diff --git a/drivers/net/stmmac/stmmac.h b/drivers/net/stmmac/stmmac.h
index ba35e69..55b9aca 100644
--- a/drivers/net/stmmac/stmmac.h
+++ b/drivers/net/stmmac/stmmac.h
@@ -120,3 +120,5 @@ static inline int stmmac_claim_resource(struct platform_device *pdev)
extern int stmmac_mdio_unregister(struct net_device *ndev);
extern int stmmac_mdio_register(struct net_device *ndev);
extern void stmmac_set_ethtool_ops(struct net_device *netdev);
+extern struct stmmac_desc_ops enh_desc_ops;
+extern struct stmmac_desc_ops ndesc_ops;
diff --git a/drivers/net/stmmac/stmmac_main.c b/drivers/net/stmmac/stmmac_main.c
index 92bef30..b95fa84 100644
--- a/drivers/net/stmmac/stmmac_main.c
+++ b/drivers/net/stmmac/stmmac_main.c
@@ -1581,10 +1581,13 @@ static int stmmac_mac_device_setup(struct net_device *dev)
struct mac_device_info *device;
- if (priv->is_gmac)
+ if (priv->is_gmac) {
device = dwmac1000_setup(ioaddr);
- else
+ device->desc = &enh_desc_ops;
+ } else {
device = dwmac100_setup(ioaddr);
+ device->desc = &ndesc_ops;
+ }
if (!device)
return -ENOMEM;
--
1.6.0.4
^ permalink raw reply related
* [PATCH] stmmac: split core and dma for the mac10/100
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270808662-7115-1-git-send-email-peppe.cavallaro@st.com>
The patch splits core and dma parts for the mac10/100 device.
This was already done for the GMAC device.
It should make more flexible the driver to support other chips.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
drivers/net/stmmac/Makefile | 2 +-
drivers/net/stmmac/dwmac100.c | 537 ------------------------------------
drivers/net/stmmac/dwmac100.h | 17 ++
drivers/net/stmmac/dwmac100_core.c | 202 ++++++++++++++
drivers/net/stmmac/dwmac100_dma.c | 353 +++++++++++++++++++++++
5 files changed, 573 insertions(+), 538 deletions(-)
delete mode 100644 drivers/net/stmmac/dwmac100.c
create mode 100644 drivers/net/stmmac/dwmac100_core.c
create mode 100644 drivers/net/stmmac/dwmac100_dma.c
diff --git a/drivers/net/stmmac/Makefile b/drivers/net/stmmac/Makefile
index c776af1..b14bd56 100644
--- a/drivers/net/stmmac/Makefile
+++ b/drivers/net/stmmac/Makefile
@@ -2,4 +2,4 @@ obj-$(CONFIG_STMMAC_ETH) += stmmac.o
stmmac-$(CONFIG_STMMAC_TIMER) += stmmac_timer.o
stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o \
dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \
- dwmac100.o $(stmmac-y)
+ dwmac100_core.o dwmac100_dma.o $(stmmac-y)
diff --git a/drivers/net/stmmac/dwmac100.c b/drivers/net/stmmac/dwmac100.c
deleted file mode 100644
index 803b037..0000000
--- a/drivers/net/stmmac/dwmac100.c
+++ /dev/null
@@ -1,537 +0,0 @@
-/*******************************************************************************
- This is the driver for the MAC 10/100 on-chip Ethernet controller
- currently tested on all the ST boards based on STb7109 and stx7200 SoCs.
-
- DWC Ether MAC 10/100 Universal version 4.0 has been used for developing
- this code.
-
- Copyright (C) 2007-2009 STMicroelectronics Ltd
-
- This program is free software; you can redistribute it and/or modify it
- under the terms and conditions of the GNU General Public License,
- version 2, as published by the Free Software Foundation.
-
- This program is distributed in the hope it will be useful, but WITHOUT
- ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- more details.
-
- You should have received a copy of the GNU General Public License along with
- this program; if not, write to the Free Software Foundation, Inc.,
- 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
-
- The full GNU General Public License is included in this distribution in
- the file called "COPYING".
-
- Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
-*******************************************************************************/
-
-#include <linux/crc32.h>
-#include <linux/mii.h>
-#include <linux/phy.h>
-
-#include "common.h"
-#include "dwmac100.h"
-#include "dwmac_dma.h"
-
-#undef DWMAC100_DEBUG
-/*#define DWMAC100_DEBUG*/
-#ifdef DWMAC100_DEBUG
-#define DBG(fmt, args...) printk(fmt, ## args)
-#else
-#define DBG(fmt, args...) do { } while (0)
-#endif
-
-static void dwmac100_core_init(unsigned long ioaddr)
-{
- u32 value = readl(ioaddr + MAC_CONTROL);
-
- writel((value | MAC_CORE_INIT), ioaddr + MAC_CONTROL);
-
-#ifdef STMMAC_VLAN_TAG_USED
- writel(ETH_P_8021Q, ioaddr + MAC_VLAN1);
-#endif
- return;
-}
-
-static void dwmac100_dump_mac_regs(unsigned long ioaddr)
-{
- pr_info("\t----------------------------------------------\n"
- "\t DWMAC 100 CSR (base addr = 0x%8x)\n"
- "\t----------------------------------------------\n",
- (unsigned int)ioaddr);
- pr_info("\tcontrol reg (offset 0x%x): 0x%08x\n", MAC_CONTROL,
- readl(ioaddr + MAC_CONTROL));
- pr_info("\taddr HI (offset 0x%x): 0x%08x\n ", MAC_ADDR_HIGH,
- readl(ioaddr + MAC_ADDR_HIGH));
- pr_info("\taddr LO (offset 0x%x): 0x%08x\n", MAC_ADDR_LOW,
- readl(ioaddr + MAC_ADDR_LOW));
- pr_info("\tmulticast hash HI (offset 0x%x): 0x%08x\n",
- MAC_HASH_HIGH, readl(ioaddr + MAC_HASH_HIGH));
- pr_info("\tmulticast hash LO (offset 0x%x): 0x%08x\n",
- MAC_HASH_LOW, readl(ioaddr + MAC_HASH_LOW));
- pr_info("\tflow control (offset 0x%x): 0x%08x\n",
- MAC_FLOW_CTRL, readl(ioaddr + MAC_FLOW_CTRL));
- pr_info("\tVLAN1 tag (offset 0x%x): 0x%08x\n", MAC_VLAN1,
- readl(ioaddr + MAC_VLAN1));
- pr_info("\tVLAN2 tag (offset 0x%x): 0x%08x\n", MAC_VLAN2,
- readl(ioaddr + MAC_VLAN2));
- pr_info("\n\tMAC management counter registers\n");
- pr_info("\t MMC crtl (offset 0x%x): 0x%08x\n",
- MMC_CONTROL, readl(ioaddr + MMC_CONTROL));
- pr_info("\t MMC High Interrupt (offset 0x%x): 0x%08x\n",
- MMC_HIGH_INTR, readl(ioaddr + MMC_HIGH_INTR));
- pr_info("\t MMC Low Interrupt (offset 0x%x): 0x%08x\n",
- MMC_LOW_INTR, readl(ioaddr + MMC_LOW_INTR));
- pr_info("\t MMC High Interrupt Mask (offset 0x%x): 0x%08x\n",
- MMC_HIGH_INTR_MASK, readl(ioaddr + MMC_HIGH_INTR_MASK));
- pr_info("\t MMC Low Interrupt Mask (offset 0x%x): 0x%08x\n",
- MMC_LOW_INTR_MASK, readl(ioaddr + MMC_LOW_INTR_MASK));
- return;
-}
-
-static int dwmac100_dma_init(unsigned long ioaddr, int pbl, u32 dma_tx,
- u32 dma_rx)
-{
- u32 value = readl(ioaddr + DMA_BUS_MODE);
- /* DMA SW reset */
- value |= DMA_BUS_MODE_SFT_RESET;
- writel(value, ioaddr + DMA_BUS_MODE);
- do {} while ((readl(ioaddr + DMA_BUS_MODE) & DMA_BUS_MODE_SFT_RESET));
-
- /* Enable Application Access by writing to DMA CSR0 */
- writel(DMA_BUS_MODE_DEFAULT | (pbl << DMA_BUS_MODE_PBL_SHIFT),
- ioaddr + DMA_BUS_MODE);
-
- /* Mask interrupts by writing to CSR7 */
- writel(DMA_INTR_DEFAULT_MASK, ioaddr + DMA_INTR_ENA);
-
- /* The base address of the RX/TX descriptor lists must be written into
- * DMA CSR3 and CSR4, respectively. */
- writel(dma_tx, ioaddr + DMA_TX_BASE_ADDR);
- writel(dma_rx, ioaddr + DMA_RCV_BASE_ADDR);
-
- return 0;
-}
-
-/* Store and Forward capability is not used at all..
- * The transmit threshold can be programmed by
- * setting the TTC bits in the DMA control register.*/
-static void dwmac100_dma_operation_mode(unsigned long ioaddr, int txmode,
- int rxmode)
-{
- u32 csr6 = readl(ioaddr + DMA_CONTROL);
-
- if (txmode <= 32)
- csr6 |= DMA_CONTROL_TTC_32;
- else if (txmode <= 64)
- csr6 |= DMA_CONTROL_TTC_64;
- else
- csr6 |= DMA_CONTROL_TTC_128;
-
- writel(csr6, ioaddr + DMA_CONTROL);
-
- return;
-}
-
-static void dwmac100_dump_dma_regs(unsigned long ioaddr)
-{
- int i;
-
- DBG(KERN_DEBUG "DWMAC 100 DMA CSR \n");
- for (i = 0; i < 9; i++)
- pr_debug("\t CSR%d (offset 0x%x): 0x%08x\n", i,
- (DMA_BUS_MODE + i * 4),
- readl(ioaddr + DMA_BUS_MODE + i * 4));
- DBG(KERN_DEBUG "\t CSR20 (offset 0x%x): 0x%08x\n",
- DMA_CUR_TX_BUF_ADDR, readl(ioaddr + DMA_CUR_TX_BUF_ADDR));
- DBG(KERN_DEBUG "\t CSR21 (offset 0x%x): 0x%08x\n",
- DMA_CUR_RX_BUF_ADDR, readl(ioaddr + DMA_CUR_RX_BUF_ADDR));
- return;
-}
-
-/* DMA controller has two counters to track the number of
- * the receive missed frames. */
-static void dwmac100_dma_diagnostic_fr(void *data,
- struct stmmac_extra_stats *x,
- unsigned long ioaddr)
-{
- struct net_device_stats *stats = (struct net_device_stats *)data;
- u32 csr8 = readl(ioaddr + DMA_MISSED_FRAME_CTR);
-
- if (unlikely(csr8)) {
- if (csr8 & DMA_MISSED_FRAME_OVE) {
- stats->rx_over_errors += 0x800;
- x->rx_overflow_cntr += 0x800;
- } else {
- unsigned int ove_cntr;
- ove_cntr = ((csr8 & DMA_MISSED_FRAME_OVE_CNTR) >> 17);
- stats->rx_over_errors += ove_cntr;
- x->rx_overflow_cntr += ove_cntr;
- }
-
- if (csr8 & DMA_MISSED_FRAME_OVE_M) {
- stats->rx_missed_errors += 0xffff;
- x->rx_missed_cntr += 0xffff;
- } else {
- unsigned int miss_f = (csr8 & DMA_MISSED_FRAME_M_CNTR);
- stats->rx_missed_errors += miss_f;
- x->rx_missed_cntr += miss_f;
- }
- }
- return;
-}
-
-static int dwmac100_get_tx_frame_status(void *data,
- struct stmmac_extra_stats *x,
- struct dma_desc *p, unsigned long ioaddr)
-{
- int ret = 0;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.tx.error_summary)) {
- if (unlikely(p->des01.tx.underflow_error)) {
- x->tx_underflow++;
- stats->tx_fifo_errors++;
- }
- if (unlikely(p->des01.tx.no_carrier)) {
- x->tx_carrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely(p->des01.tx.loss_carrier)) {
- x->tx_losscarrier++;
- stats->tx_carrier_errors++;
- }
- if (unlikely((p->des01.tx.excessive_deferral) ||
- (p->des01.tx.excessive_collisions) ||
- (p->des01.tx.late_collision)))
- stats->collisions += p->des01.tx.collision_count;
- ret = -1;
- }
- if (unlikely(p->des01.tx.heartbeat_fail)) {
- x->tx_heartbeat++;
- stats->tx_heartbeat_errors++;
- ret = -1;
- }
- if (unlikely(p->des01.tx.deferred))
- x->tx_deferred++;
-
- return ret;
-}
-
-static int dwmac100_get_tx_len(struct dma_desc *p)
-{
- return p->des01.tx.buffer1_size;
-}
-
-/* This function verifies if each incoming frame has some errors
- * and, if required, updates the multicast statistics.
- * In case of success, it returns csum_none becasue the device
- * is not able to compute the csum in HW. */
-static int dwmac100_get_rx_frame_status(void *data,
- struct stmmac_extra_stats *x,
- struct dma_desc *p)
-{
- int ret = csum_none;
- struct net_device_stats *stats = (struct net_device_stats *)data;
-
- if (unlikely(p->des01.rx.last_descriptor == 0)) {
- pr_warning("dwmac100 Error: Oversized Ethernet "
- "frame spanned multiple buffers\n");
- stats->rx_length_errors++;
- return discard_frame;
- }
-
- if (unlikely(p->des01.rx.error_summary)) {
- if (unlikely(p->des01.rx.descriptor_error))
- x->rx_desc++;
- if (unlikely(p->des01.rx.partial_frame_error))
- x->rx_partial++;
- if (unlikely(p->des01.rx.run_frame))
- x->rx_runt++;
- if (unlikely(p->des01.rx.frame_too_long))
- x->rx_toolong++;
- if (unlikely(p->des01.rx.collision)) {
- x->rx_collision++;
- stats->collisions++;
- }
- if (unlikely(p->des01.rx.crc_error)) {
- x->rx_crc++;
- stats->rx_crc_errors++;
- }
- ret = discard_frame;
- }
- if (unlikely(p->des01.rx.dribbling))
- ret = discard_frame;
-
- if (unlikely(p->des01.rx.length_error)) {
- x->rx_length++;
- ret = discard_frame;
- }
- if (unlikely(p->des01.rx.mii_error)) {
- x->rx_mii++;
- ret = discard_frame;
- }
- if (p->des01.rx.multicast_frame) {
- x->rx_multicast++;
- stats->multicast++;
- }
- return ret;
-}
-
-static void dwmac100_irq_status(unsigned long ioaddr)
-{
- return;
-}
-
-static void dwmac100_set_umac_addr(unsigned long ioaddr, unsigned char *addr,
- unsigned int reg_n)
-{
- stmmac_set_mac_addr(ioaddr, addr, MAC_ADDR_HIGH, MAC_ADDR_LOW);
-}
-
-static void dwmac100_get_umac_addr(unsigned long ioaddr, unsigned char *addr,
- unsigned int reg_n)
-{
- stmmac_get_mac_addr(ioaddr, addr, MAC_ADDR_HIGH, MAC_ADDR_LOW);
-}
-
-static void dwmac100_set_filter(struct net_device *dev)
-{
- unsigned long ioaddr = dev->base_addr;
- u32 value = readl(ioaddr + MAC_CONTROL);
-
- if (dev->flags & IFF_PROMISC) {
- value |= MAC_CONTROL_PR;
- value &= ~(MAC_CONTROL_PM | MAC_CONTROL_IF | MAC_CONTROL_HO |
- MAC_CONTROL_HP);
- } else if ((netdev_mc_count(dev) > HASH_TABLE_SIZE)
- || (dev->flags & IFF_ALLMULTI)) {
- value |= MAC_CONTROL_PM;
- value &= ~(MAC_CONTROL_PR | MAC_CONTROL_IF | MAC_CONTROL_HO);
- writel(0xffffffff, ioaddr + MAC_HASH_HIGH);
- writel(0xffffffff, ioaddr + MAC_HASH_LOW);
- } else if (netdev_mc_empty(dev)) { /* no multicast */
- value &= ~(MAC_CONTROL_PM | MAC_CONTROL_PR | MAC_CONTROL_IF |
- MAC_CONTROL_HO | MAC_CONTROL_HP);
- } else {
- u32 mc_filter[2];
- struct dev_mc_list *mclist;
-
- /* Perfect filter mode for physical address and Hash
- filter for multicast */
- value |= MAC_CONTROL_HP;
- value &= ~(MAC_CONTROL_PM | MAC_CONTROL_PR |
- MAC_CONTROL_IF | MAC_CONTROL_HO);
-
- memset(mc_filter, 0, sizeof(mc_filter));
- netdev_for_each_mc_addr(mclist, dev) {
- /* The upper 6 bits of the calculated CRC are used to
- * index the contens of the hash table */
- int bit_nr =
- ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26;
- /* The most significant bit determines the register to
- * use (H/L) while the other 5 bits determine the bit
- * within the register. */
- mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
- }
- writel(mc_filter[0], ioaddr + MAC_HASH_LOW);
- writel(mc_filter[1], ioaddr + MAC_HASH_HIGH);
- }
-
- writel(value, ioaddr + MAC_CONTROL);
-
- DBG(KERN_INFO "%s: CTRL reg: 0x%08x Hash regs: "
- "HI 0x%08x, LO 0x%08x\n",
- __func__, readl(ioaddr + MAC_CONTROL),
- readl(ioaddr + MAC_HASH_HIGH), readl(ioaddr + MAC_HASH_LOW));
- return;
-}
-
-static void dwmac100_flow_ctrl(unsigned long ioaddr, unsigned int duplex,
- unsigned int fc, unsigned int pause_time)
-{
- unsigned int flow = MAC_FLOW_CTRL_ENABLE;
-
- if (duplex)
- flow |= (pause_time << MAC_FLOW_CTRL_PT_SHIFT);
- writel(flow, ioaddr + MAC_FLOW_CTRL);
-
- return;
-}
-
-/* No PMT module supported for this Ethernet Controller.
- * Tested on ST platforms only.
- */
-static void dwmac100_pmt(unsigned long ioaddr, unsigned long mode)
-{
- return;
-}
-
-static void dwmac100_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
- int disable_rx_ic)
-{
- int i;
- for (i = 0; i < ring_size; i++) {
- p->des01.rx.own = 1;
- p->des01.rx.buffer1_size = BUF_SIZE_2KiB - 1;
- if (i == ring_size - 1)
- p->des01.rx.end_ring = 1;
- if (disable_rx_ic)
- p->des01.rx.disable_ic = 1;
- p++;
- }
- return;
-}
-
-static void dwmac100_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
-{
- int i;
- for (i = 0; i < ring_size; i++) {
- p->des01.tx.own = 0;
- if (i == ring_size - 1)
- p->des01.tx.end_ring = 1;
- p++;
- }
- return;
-}
-
-static int dwmac100_get_tx_owner(struct dma_desc *p)
-{
- return p->des01.tx.own;
-}
-
-static int dwmac100_get_rx_owner(struct dma_desc *p)
-{
- return p->des01.rx.own;
-}
-
-static void dwmac100_set_tx_owner(struct dma_desc *p)
-{
- p->des01.tx.own = 1;
-}
-
-static void dwmac100_set_rx_owner(struct dma_desc *p)
-{
- p->des01.rx.own = 1;
-}
-
-static int dwmac100_get_tx_ls(struct dma_desc *p)
-{
- return p->des01.tx.last_segment;
-}
-
-static void dwmac100_release_tx_desc(struct dma_desc *p)
-{
- int ter = p->des01.tx.end_ring;
-
- /* clean field used within the xmit */
- p->des01.tx.first_segment = 0;
- p->des01.tx.last_segment = 0;
- p->des01.tx.buffer1_size = 0;
-
- /* clean status reported */
- p->des01.tx.error_summary = 0;
- p->des01.tx.underflow_error = 0;
- p->des01.tx.no_carrier = 0;
- p->des01.tx.loss_carrier = 0;
- p->des01.tx.excessive_deferral = 0;
- p->des01.tx.excessive_collisions = 0;
- p->des01.tx.late_collision = 0;
- p->des01.tx.heartbeat_fail = 0;
- p->des01.tx.deferred = 0;
-
- /* set termination field */
- p->des01.tx.end_ring = ter;
-
- return;
-}
-
-static void dwmac100_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
- int csum_flag)
-{
- p->des01.tx.first_segment = is_fs;
- p->des01.tx.buffer1_size = len;
-}
-
-static void dwmac100_clear_tx_ic(struct dma_desc *p)
-{
- p->des01.tx.interrupt = 0;
-}
-
-static void dwmac100_close_tx_desc(struct dma_desc *p)
-{
- p->des01.tx.last_segment = 1;
- p->des01.tx.interrupt = 1;
-}
-
-static int dwmac100_get_rx_frame_len(struct dma_desc *p)
-{
- return p->des01.rx.frame_length;
-}
-
-struct stmmac_ops dwmac100_ops = {
- .core_init = dwmac100_core_init,
- .dump_regs = dwmac100_dump_mac_regs,
- .host_irq_status = dwmac100_irq_status,
- .set_filter = dwmac100_set_filter,
- .flow_ctrl = dwmac100_flow_ctrl,
- .pmt = dwmac100_pmt,
- .set_umac_addr = dwmac100_set_umac_addr,
- .get_umac_addr = dwmac100_get_umac_addr,
-};
-
-struct stmmac_dma_ops dwmac100_dma_ops = {
- .init = dwmac100_dma_init,
- .dump_regs = dwmac100_dump_dma_regs,
- .dma_mode = dwmac100_dma_operation_mode,
- .dma_diagnostic_fr = dwmac100_dma_diagnostic_fr,
- .enable_dma_transmission = dwmac_enable_dma_transmission,
- .enable_dma_irq = dwmac_enable_dma_irq,
- .disable_dma_irq = dwmac_disable_dma_irq,
- .start_tx = dwmac_dma_start_tx,
- .stop_tx = dwmac_dma_stop_tx,
- .start_rx = dwmac_dma_start_rx,
- .stop_rx = dwmac_dma_stop_rx,
- .dma_interrupt = dwmac_dma_interrupt,
-};
-
-struct stmmac_desc_ops dwmac100_desc_ops = {
- .tx_status = dwmac100_get_tx_frame_status,
- .rx_status = dwmac100_get_rx_frame_status,
- .get_tx_len = dwmac100_get_tx_len,
- .init_rx_desc = dwmac100_init_rx_desc,
- .init_tx_desc = dwmac100_init_tx_desc,
- .get_tx_owner = dwmac100_get_tx_owner,
- .get_rx_owner = dwmac100_get_rx_owner,
- .release_tx_desc = dwmac100_release_tx_desc,
- .prepare_tx_desc = dwmac100_prepare_tx_desc,
- .clear_tx_ic = dwmac100_clear_tx_ic,
- .close_tx_desc = dwmac100_close_tx_desc,
- .get_tx_ls = dwmac100_get_tx_ls,
- .set_tx_owner = dwmac100_set_tx_owner,
- .set_rx_owner = dwmac100_set_rx_owner,
- .get_rx_frame_len = dwmac100_get_rx_frame_len,
-};
-
-struct mac_device_info *dwmac100_setup(unsigned long ioaddr)
-{
- struct mac_device_info *mac;
-
- mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
-
- pr_info("\tDWMAC100\n");
-
- mac->mac = &dwmac100_ops;
- mac->desc = &dwmac100_desc_ops;
- mac->dma = &dwmac100_dma_ops;
-
- mac->pmt = PMT_NOT_SUPPORTED;
- mac->link.port = MAC_CONTROL_PS;
- mac->link.duplex = MAC_CONTROL_F;
- mac->link.speed = 0;
- mac->mii.addr = MAC_MII_ADDR;
- mac->mii.data = MAC_MII_DATA;
-
- return mac;
-}
diff --git a/drivers/net/stmmac/dwmac100.h b/drivers/net/stmmac/dwmac100.h
index 0f8f110..9f4ba2e 100644
--- a/drivers/net/stmmac/dwmac100.h
+++ b/drivers/net/stmmac/dwmac100.h
@@ -22,6 +22,9 @@
Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
*******************************************************************************/
+#include <linux/phy.h>
+#include "common.h"
+
/*----------------------------------------------------------------------------
* MAC BLOCK defines
*---------------------------------------------------------------------------*/
@@ -114,3 +117,17 @@ enum ttc_control {
#define DMA_MISSED_FRAME_OVE_CNTR 0x0ffe0000 /* Overflow Frame Counter */
#define DMA_MISSED_FRAME_OVE_M 0x00010000 /* Missed Frame Overflow */
#define DMA_MISSED_FRAME_M_CNTR 0x0000ffff /* Missed Frame Couinter */
+
+#undef DWMAC100_DEBUG
+/* #define DWMAC100__DEBUG */
+#undef FRAME_FILTER_DEBUG
+/* #define FRAME_FILTER_DEBUG */
+#ifdef DWMAC100__DEBUG
+#define DBG(fmt, args...) printk(fmt, ## args)
+#else
+#define DBG(fmt, args...) do { } while (0)
+#endif
+
+extern struct stmmac_dma_ops dwmac100_dma_ops;
+extern struct stmmac_desc_ops dwmac100_desc_ops;
+
diff --git a/drivers/net/stmmac/dwmac100_core.c b/drivers/net/stmmac/dwmac100_core.c
new file mode 100644
index 0000000..7455a0c
--- /dev/null
+++ b/drivers/net/stmmac/dwmac100_core.c
@@ -0,0 +1,202 @@
+/*******************************************************************************
+ This is the driver for the MAC 10/100 on-chip Ethernet controller
+ currently tested on all the ST boards based on STb7109 and stx7200 SoCs.
+
+ DWC Ether MAC 10/100 Universal version 4.0 has been used for developing
+ this code.
+
+ This only implements the mac core functions for this chip.
+
+ Copyright (C) 2007-2009 STMicroelectronics Ltd
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms and conditions of the GNU General Public License,
+ version 2, as published by the Free Software Foundation.
+
+ This program is distributed in the hope it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+ The full GNU General Public License is included in this distribution in
+ the file called "COPYING".
+
+ Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+*******************************************************************************/
+
+#include <linux/crc32.h>
+#include "dwmac100.h"
+
+static void dwmac100_core_init(unsigned long ioaddr)
+{
+ u32 value = readl(ioaddr + MAC_CONTROL);
+
+ writel((value | MAC_CORE_INIT), ioaddr + MAC_CONTROL);
+
+#ifdef STMMAC_VLAN_TAG_USED
+ writel(ETH_P_8021Q, ioaddr + MAC_VLAN1);
+#endif
+ return;
+}
+
+static void dwmac100_dump_mac_regs(unsigned long ioaddr)
+{
+ pr_info("\t----------------------------------------------\n"
+ "\t DWMAC 100 CSR (base addr = 0x%8x)\n"
+ "\t----------------------------------------------\n",
+ (unsigned int)ioaddr);
+ pr_info("\tcontrol reg (offset 0x%x): 0x%08x\n", MAC_CONTROL,
+ readl(ioaddr + MAC_CONTROL));
+ pr_info("\taddr HI (offset 0x%x): 0x%08x\n ", MAC_ADDR_HIGH,
+ readl(ioaddr + MAC_ADDR_HIGH));
+ pr_info("\taddr LO (offset 0x%x): 0x%08x\n", MAC_ADDR_LOW,
+ readl(ioaddr + MAC_ADDR_LOW));
+ pr_info("\tmulticast hash HI (offset 0x%x): 0x%08x\n",
+ MAC_HASH_HIGH, readl(ioaddr + MAC_HASH_HIGH));
+ pr_info("\tmulticast hash LO (offset 0x%x): 0x%08x\n",
+ MAC_HASH_LOW, readl(ioaddr + MAC_HASH_LOW));
+ pr_info("\tflow control (offset 0x%x): 0x%08x\n",
+ MAC_FLOW_CTRL, readl(ioaddr + MAC_FLOW_CTRL));
+ pr_info("\tVLAN1 tag (offset 0x%x): 0x%08x\n", MAC_VLAN1,
+ readl(ioaddr + MAC_VLAN1));
+ pr_info("\tVLAN2 tag (offset 0x%x): 0x%08x\n", MAC_VLAN2,
+ readl(ioaddr + MAC_VLAN2));
+ pr_info("\n\tMAC management counter registers\n");
+ pr_info("\t MMC crtl (offset 0x%x): 0x%08x\n",
+ MMC_CONTROL, readl(ioaddr + MMC_CONTROL));
+ pr_info("\t MMC High Interrupt (offset 0x%x): 0x%08x\n",
+ MMC_HIGH_INTR, readl(ioaddr + MMC_HIGH_INTR));
+ pr_info("\t MMC Low Interrupt (offset 0x%x): 0x%08x\n",
+ MMC_LOW_INTR, readl(ioaddr + MMC_LOW_INTR));
+ pr_info("\t MMC High Interrupt Mask (offset 0x%x): 0x%08x\n",
+ MMC_HIGH_INTR_MASK, readl(ioaddr + MMC_HIGH_INTR_MASK));
+ pr_info("\t MMC Low Interrupt Mask (offset 0x%x): 0x%08x\n",
+ MMC_LOW_INTR_MASK, readl(ioaddr + MMC_LOW_INTR_MASK));
+ return;
+}
+
+static void dwmac100_irq_status(unsigned long ioaddr)
+{
+ return;
+}
+
+static void dwmac100_set_umac_addr(unsigned long ioaddr, unsigned char *addr,
+ unsigned int reg_n)
+{
+ stmmac_set_mac_addr(ioaddr, addr, MAC_ADDR_HIGH, MAC_ADDR_LOW);
+}
+
+static void dwmac100_get_umac_addr(unsigned long ioaddr, unsigned char *addr,
+ unsigned int reg_n)
+{
+ stmmac_get_mac_addr(ioaddr, addr, MAC_ADDR_HIGH, MAC_ADDR_LOW);
+}
+
+static void dwmac100_set_filter(struct net_device *dev)
+{
+ unsigned long ioaddr = dev->base_addr;
+ u32 value = readl(ioaddr + MAC_CONTROL);
+
+ if (dev->flags & IFF_PROMISC) {
+ value |= MAC_CONTROL_PR;
+ value &= ~(MAC_CONTROL_PM | MAC_CONTROL_IF | MAC_CONTROL_HO |
+ MAC_CONTROL_HP);
+ } else if ((netdev_mc_count(dev) > HASH_TABLE_SIZE)
+ || (dev->flags & IFF_ALLMULTI)) {
+ value |= MAC_CONTROL_PM;
+ value &= ~(MAC_CONTROL_PR | MAC_CONTROL_IF | MAC_CONTROL_HO);
+ writel(0xffffffff, ioaddr + MAC_HASH_HIGH);
+ writel(0xffffffff, ioaddr + MAC_HASH_LOW);
+ } else if (netdev_mc_empty(dev)) { /* no multicast */
+ value &= ~(MAC_CONTROL_PM | MAC_CONTROL_PR | MAC_CONTROL_IF |
+ MAC_CONTROL_HO | MAC_CONTROL_HP);
+ } else {
+ u32 mc_filter[2];
+ struct dev_mc_list *mclist;
+
+ /* Perfect filter mode for physical address and Hash
+ filter for multicast */
+ value |= MAC_CONTROL_HP;
+ value &= ~(MAC_CONTROL_PM | MAC_CONTROL_PR |
+ MAC_CONTROL_IF | MAC_CONTROL_HO);
+
+ memset(mc_filter, 0, sizeof(mc_filter));
+ netdev_for_each_mc_addr(mclist, dev) {
+ /* The upper 6 bits of the calculated CRC are used to
+ * index the contens of the hash table */
+ int bit_nr =
+ ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26;
+ /* The most significant bit determines the register to
+ * use (H/L) while the other 5 bits determine the bit
+ * within the register. */
+ mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
+ }
+ writel(mc_filter[0], ioaddr + MAC_HASH_LOW);
+ writel(mc_filter[1], ioaddr + MAC_HASH_HIGH);
+ }
+
+ writel(value, ioaddr + MAC_CONTROL);
+
+ DBG(KERN_INFO "%s: CTRL reg: 0x%08x Hash regs: "
+ "HI 0x%08x, LO 0x%08x\n",
+ __func__, readl(ioaddr + MAC_CONTROL),
+ readl(ioaddr + MAC_HASH_HIGH), readl(ioaddr + MAC_HASH_LOW));
+ return;
+}
+
+static void dwmac100_flow_ctrl(unsigned long ioaddr, unsigned int duplex,
+ unsigned int fc, unsigned int pause_time)
+{
+ unsigned int flow = MAC_FLOW_CTRL_ENABLE;
+
+ if (duplex)
+ flow |= (pause_time << MAC_FLOW_CTRL_PT_SHIFT);
+ writel(flow, ioaddr + MAC_FLOW_CTRL);
+
+ return;
+}
+
+/* No PMT module supported for this Ethernet Controller.
+ * Tested on ST platforms only.
+ */
+static void dwmac100_pmt(unsigned long ioaddr, unsigned long mode)
+{
+ return;
+}
+
+struct stmmac_ops dwmac100_ops = {
+ .core_init = dwmac100_core_init,
+ .dump_regs = dwmac100_dump_mac_regs,
+ .host_irq_status = dwmac100_irq_status,
+ .set_filter = dwmac100_set_filter,
+ .flow_ctrl = dwmac100_flow_ctrl,
+ .pmt = dwmac100_pmt,
+ .set_umac_addr = dwmac100_set_umac_addr,
+ .get_umac_addr = dwmac100_get_umac_addr,
+};
+
+struct mac_device_info *dwmac100_setup(unsigned long ioaddr)
+{
+ struct mac_device_info *mac;
+
+ mac = kzalloc(sizeof(const struct mac_device_info), GFP_KERNEL);
+
+ pr_info("\tDWMAC100\n");
+
+ mac->mac = &dwmac100_ops;
+ mac->desc = &dwmac100_desc_ops;
+ mac->dma = &dwmac100_dma_ops;
+
+ mac->pmt = PMT_NOT_SUPPORTED;
+ mac->link.port = MAC_CONTROL_PS;
+ mac->link.duplex = MAC_CONTROL_F;
+ mac->link.speed = 0;
+ mac->mii.addr = MAC_MII_ADDR;
+ mac->mii.data = MAC_MII_DATA;
+
+ return mac;
+}
diff --git a/drivers/net/stmmac/dwmac100_dma.c b/drivers/net/stmmac/dwmac100_dma.c
new file mode 100644
index 0000000..7fcc526
--- /dev/null
+++ b/drivers/net/stmmac/dwmac100_dma.c
@@ -0,0 +1,353 @@
+/*******************************************************************************
+ This is the driver for the MAC 10/100 on-chip Ethernet controller
+ currently tested on all the ST boards based on STb7109 and stx7200 SoCs.
+
+ DWC Ether MAC 10/100 Universal version 4.0 has been used for developing
+ this code.
+
+ This contains the functions to handle the dma and descriptors.
+
+ Copyright (C) 2007-2009 STMicroelectronics Ltd
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms and conditions of the GNU General Public License,
+ version 2, as published by the Free Software Foundation.
+
+ This program is distributed in the hope it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+ The full GNU General Public License is included in this distribution in
+ the file called "COPYING".
+
+ Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+*******************************************************************************/
+
+#include "dwmac100.h"
+#include "dwmac_dma.h"
+
+static int dwmac100_dma_init(unsigned long ioaddr, int pbl, u32 dma_tx,
+ u32 dma_rx)
+{
+ u32 value = readl(ioaddr + DMA_BUS_MODE);
+ /* DMA SW reset */
+ value |= DMA_BUS_MODE_SFT_RESET;
+ writel(value, ioaddr + DMA_BUS_MODE);
+ do {} while ((readl(ioaddr + DMA_BUS_MODE) & DMA_BUS_MODE_SFT_RESET));
+
+ /* Enable Application Access by writing to DMA CSR0 */
+ writel(DMA_BUS_MODE_DEFAULT | (pbl << DMA_BUS_MODE_PBL_SHIFT),
+ ioaddr + DMA_BUS_MODE);
+
+ /* Mask interrupts by writing to CSR7 */
+ writel(DMA_INTR_DEFAULT_MASK, ioaddr + DMA_INTR_ENA);
+
+ /* The base address of the RX/TX descriptor lists must be written into
+ * DMA CSR3 and CSR4, respectively. */
+ writel(dma_tx, ioaddr + DMA_TX_BASE_ADDR);
+ writel(dma_rx, ioaddr + DMA_RCV_BASE_ADDR);
+
+ return 0;
+}
+
+/* Store and Forward capability is not used at all..
+ * The transmit threshold can be programmed by
+ * setting the TTC bits in the DMA control register.*/
+static void dwmac100_dma_operation_mode(unsigned long ioaddr, int txmode,
+ int rxmode)
+{
+ u32 csr6 = readl(ioaddr + DMA_CONTROL);
+
+ if (txmode <= 32)
+ csr6 |= DMA_CONTROL_TTC_32;
+ else if (txmode <= 64)
+ csr6 |= DMA_CONTROL_TTC_64;
+ else
+ csr6 |= DMA_CONTROL_TTC_128;
+
+ writel(csr6, ioaddr + DMA_CONTROL);
+
+ return;
+}
+
+static void dwmac100_dump_dma_regs(unsigned long ioaddr)
+{
+ int i;
+
+ DBG(KERN_DEBUG "DWMAC 100 DMA CSR\n");
+ for (i = 0; i < 9; i++)
+ pr_debug("\t CSR%d (offset 0x%x): 0x%08x\n", i,
+ (DMA_BUS_MODE + i * 4),
+ readl(ioaddr + DMA_BUS_MODE + i * 4));
+ DBG(KERN_DEBUG "\t CSR20 (offset 0x%x): 0x%08x\n",
+ DMA_CUR_TX_BUF_ADDR, readl(ioaddr + DMA_CUR_TX_BUF_ADDR));
+ DBG(KERN_DEBUG "\t CSR21 (offset 0x%x): 0x%08x\n",
+ DMA_CUR_RX_BUF_ADDR, readl(ioaddr + DMA_CUR_RX_BUF_ADDR));
+ return;
+}
+
+/* DMA controller has two counters to track the number of
+ * the receive missed frames. */
+static void dwmac100_dma_diagnostic_fr(void *data, struct stmmac_extra_stats *x,
+ unsigned long ioaddr)
+{
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+ u32 csr8 = readl(ioaddr + DMA_MISSED_FRAME_CTR);
+
+ if (unlikely(csr8)) {
+ if (csr8 & DMA_MISSED_FRAME_OVE) {
+ stats->rx_over_errors += 0x800;
+ x->rx_overflow_cntr += 0x800;
+ } else {
+ unsigned int ove_cntr;
+ ove_cntr = ((csr8 & DMA_MISSED_FRAME_OVE_CNTR) >> 17);
+ stats->rx_over_errors += ove_cntr;
+ x->rx_overflow_cntr += ove_cntr;
+ }
+
+ if (csr8 & DMA_MISSED_FRAME_OVE_M) {
+ stats->rx_missed_errors += 0xffff;
+ x->rx_missed_cntr += 0xffff;
+ } else {
+ unsigned int miss_f = (csr8 & DMA_MISSED_FRAME_M_CNTR);
+ stats->rx_missed_errors += miss_f;
+ x->rx_missed_cntr += miss_f;
+ }
+ }
+ return;
+}
+
+static int dwmac100_get_tx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p, unsigned long ioaddr)
+{
+ int ret = 0;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.tx.error_summary)) {
+ if (unlikely(p->des01.tx.underflow_error)) {
+ x->tx_underflow++;
+ stats->tx_fifo_errors++;
+ }
+ if (unlikely(p->des01.tx.no_carrier)) {
+ x->tx_carrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely(p->des01.tx.loss_carrier)) {
+ x->tx_losscarrier++;
+ stats->tx_carrier_errors++;
+ }
+ if (unlikely((p->des01.tx.excessive_deferral) ||
+ (p->des01.tx.excessive_collisions) ||
+ (p->des01.tx.late_collision)))
+ stats->collisions += p->des01.tx.collision_count;
+ ret = -1;
+ }
+ if (unlikely(p->des01.tx.heartbeat_fail)) {
+ x->tx_heartbeat++;
+ stats->tx_heartbeat_errors++;
+ ret = -1;
+ }
+ if (unlikely(p->des01.tx.deferred))
+ x->tx_deferred++;
+
+ return ret;
+}
+
+static int dwmac100_get_tx_len(struct dma_desc *p)
+{
+ return p->des01.tx.buffer1_size;
+}
+
+/* This function verifies if each incoming frame has some errors
+ * and, if required, updates the multicast statistics.
+ * In case of success, it returns csum_none becasue the device
+ * is not able to compute the csum in HW. */
+static int dwmac100_get_rx_status(void *data, struct stmmac_extra_stats *x,
+ struct dma_desc *p)
+{
+ int ret = csum_none;
+ struct net_device_stats *stats = (struct net_device_stats *)data;
+
+ if (unlikely(p->des01.rx.last_descriptor == 0)) {
+ pr_warning("dwmac100 Error: Oversized Ethernet "
+ "frame spanned multiple buffers\n");
+ stats->rx_length_errors++;
+ return discard_frame;
+ }
+
+ if (unlikely(p->des01.rx.error_summary)) {
+ if (unlikely(p->des01.rx.descriptor_error))
+ x->rx_desc++;
+ if (unlikely(p->des01.rx.partial_frame_error))
+ x->rx_partial++;
+ if (unlikely(p->des01.rx.run_frame))
+ x->rx_runt++;
+ if (unlikely(p->des01.rx.frame_too_long))
+ x->rx_toolong++;
+ if (unlikely(p->des01.rx.collision)) {
+ x->rx_collision++;
+ stats->collisions++;
+ }
+ if (unlikely(p->des01.rx.crc_error)) {
+ x->rx_crc++;
+ stats->rx_crc_errors++;
+ }
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.rx.dribbling))
+ ret = discard_frame;
+
+ if (unlikely(p->des01.rx.length_error)) {
+ x->rx_length++;
+ ret = discard_frame;
+ }
+ if (unlikely(p->des01.rx.mii_error)) {
+ x->rx_mii++;
+ ret = discard_frame;
+ }
+ if (p->des01.rx.multicast_frame) {
+ x->rx_multicast++;
+ stats->multicast++;
+ }
+ return ret;
+}
+
+static void dwmac100_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
+ int disable_rx_ic)
+{
+ int i;
+ for (i = 0; i < ring_size; i++) {
+ p->des01.rx.own = 1;
+ p->des01.rx.buffer1_size = BUF_SIZE_2KiB - 1;
+ if (i == ring_size - 1)
+ p->des01.rx.end_ring = 1;
+ if (disable_rx_ic)
+ p->des01.rx.disable_ic = 1;
+ p++;
+ }
+ return;
+}
+
+static void dwmac100_init_tx_desc(struct dma_desc *p, unsigned int ring_size)
+{
+ int i;
+ for (i = 0; i < ring_size; i++) {
+ p->des01.tx.own = 0;
+ if (i == ring_size - 1)
+ p->des01.tx.end_ring = 1;
+ p++;
+ }
+ return;
+}
+
+static int dwmac100_get_tx_owner(struct dma_desc *p)
+{
+ return p->des01.tx.own;
+}
+
+static int dwmac100_get_rx_owner(struct dma_desc *p)
+{
+ return p->des01.rx.own;
+}
+
+static void dwmac100_set_tx_owner(struct dma_desc *p)
+{
+ p->des01.tx.own = 1;
+}
+
+static void dwmac100_set_rx_owner(struct dma_desc *p)
+{
+ p->des01.rx.own = 1;
+}
+
+static int dwmac100_get_tx_ls(struct dma_desc *p)
+{
+ return p->des01.tx.last_segment;
+}
+
+static void dwmac100_release_tx_desc(struct dma_desc *p)
+{
+ int ter = p->des01.tx.end_ring;
+
+ /* clean field used within the xmit */
+ p->des01.tx.first_segment = 0;
+ p->des01.tx.last_segment = 0;
+ p->des01.tx.buffer1_size = 0;
+
+ /* clean status reported */
+ p->des01.tx.error_summary = 0;
+ p->des01.tx.underflow_error = 0;
+ p->des01.tx.no_carrier = 0;
+ p->des01.tx.loss_carrier = 0;
+ p->des01.tx.excessive_deferral = 0;
+ p->des01.tx.excessive_collisions = 0;
+ p->des01.tx.late_collision = 0;
+ p->des01.tx.heartbeat_fail = 0;
+ p->des01.tx.deferred = 0;
+
+ /* set termination field */
+ p->des01.tx.end_ring = ter;
+
+ return;
+}
+
+static void dwmac100_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
+ int csum_flag)
+{
+ p->des01.tx.first_segment = is_fs;
+ p->des01.tx.buffer1_size = len;
+}
+
+static void dwmac100_clear_tx_ic(struct dma_desc *p)
+{
+ p->des01.tx.interrupt = 0;
+}
+
+static void dwmac100_close_tx_desc(struct dma_desc *p)
+{
+ p->des01.tx.last_segment = 1;
+ p->des01.tx.interrupt = 1;
+}
+
+static int dwmac100_get_rx_frame_len(struct dma_desc *p)
+{
+ return p->des01.rx.frame_length;
+}
+
+struct stmmac_dma_ops dwmac100_dma_ops = {
+ .init = dwmac100_dma_init,
+ .dump_regs = dwmac100_dump_dma_regs,
+ .dma_mode = dwmac100_dma_operation_mode,
+ .dma_diagnostic_fr = dwmac100_dma_diagnostic_fr,
+ .enable_dma_transmission = dwmac_enable_dma_transmission,
+ .enable_dma_irq = dwmac_enable_dma_irq,
+ .disable_dma_irq = dwmac_disable_dma_irq,
+ .start_tx = dwmac_dma_start_tx,
+ .stop_tx = dwmac_dma_stop_tx,
+ .start_rx = dwmac_dma_start_rx,
+ .stop_rx = dwmac_dma_stop_rx,
+ .dma_interrupt = dwmac_dma_interrupt,
+};
+
+struct stmmac_desc_ops dwmac100_desc_ops = {
+ .tx_status = dwmac100_get_tx_status,
+ .rx_status = dwmac100_get_rx_status,
+ .get_tx_len = dwmac100_get_tx_len,
+ .init_rx_desc = dwmac100_init_rx_desc,
+ .init_tx_desc = dwmac100_init_tx_desc,
+ .get_tx_owner = dwmac100_get_tx_owner,
+ .get_rx_owner = dwmac100_get_rx_owner,
+ .release_tx_desc = dwmac100_release_tx_desc,
+ .prepare_tx_desc = dwmac100_prepare_tx_desc,
+ .clear_tx_ic = dwmac100_clear_tx_ic,
+ .close_tx_desc = dwmac100_close_tx_desc,
+ .get_tx_ls = dwmac100_get_tx_ls,
+ .set_tx_owner = dwmac100_set_tx_owner,
+ .set_rx_owner = dwmac100_set_rx_owner,
+ .get_rx_frame_len = dwmac100_get_rx_frame_len,
+};
--
1.6.0.4
^ permalink raw reply related
* [PATCH] (net-2.6) stmmac update - Apr 2010
From: Giuseppe CAVALLARO @ 2010-04-09 10:24 UTC (permalink / raw)
To: netdev; +Cc: Giuseppe Cavallaro
Hello,
this is another subset of patches to make the driver more generic.
This patches splits the dma and core code for the mac 10/100 device
(as already done for the gmac) and reorganizes the descriptor
structures.
In the first version of the driver, the mac10/100 could only use
normal descriptors and the gmac could only use the enhanced ones.
This limit has been removed and this kind of information comes
from the platform.
Best Regards,
Giuseppe
Giuseppe Cavallaro (7):
stmmac: split core and dma for the mac10/100
stmmac: rework normal and enhanced descriptors
stmmac: fix Transmit FIFO flush operation
stmmac: new descriptor field for the driver's platform
stmmac: get the descriptor structure from platform
stmmac: fix vlan support setup
stmmac: updated the drv module version
drivers/net/stmmac/Makefile | 2 +-
drivers/net/stmmac/common.h | 21 ++-
drivers/net/stmmac/dwmac100.c | 537 -----------------------------------
drivers/net/stmmac/dwmac100.h | 5 +
drivers/net/stmmac/dwmac1000.h | 12 -
drivers/net/stmmac/dwmac1000_core.c | 27 +-
drivers/net/stmmac/dwmac1000_dma.c | 336 +---------------------
drivers/net/stmmac/dwmac100_core.c | 201 +++++++++++++
drivers/net/stmmac/dwmac100_dma.c | 138 +++++++++
drivers/net/stmmac/dwmac_dma.h | 1 +
drivers/net/stmmac/dwmac_lib.c | 7 +
drivers/net/stmmac/enh_desc.c | 342 ++++++++++++++++++++++
drivers/net/stmmac/norm_desc.c | 240 ++++++++++++++++
drivers/net/stmmac/stmmac.h | 10 +-
drivers/net/stmmac/stmmac_main.c | 7 +
include/linux/stmmac.h | 1 +
16 files changed, 985 insertions(+), 902 deletions(-)
delete mode 100644 drivers/net/stmmac/dwmac100.c
create mode 100644 drivers/net/stmmac/dwmac100_core.c
create mode 100644 drivers/net/stmmac/dwmac100_dma.c
create mode 100644 drivers/net/stmmac/enh_desc.c
create mode 100644 drivers/net/stmmac/norm_desc.c
^ permalink raw reply
* [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-04-09 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: Octavian Purdila, Eric Dumazet, netdev, Neil Horman, Amerigo Wang,
David Miller, ebiederm
In-Reply-To: <20100409101442.5051.99812.sendpatchset@localhost.localdomain>
From: Octavian Purdila <opurdila@ixiacom.com>
This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.
The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
Index: linux-2.6/Documentation/networking/ip-sysctl.txt
===================================================================
--- linux-2.6.orig/Documentation/networking/ip-sysctl.txt
+++ linux-2.6/Documentation/networking/ip-sysctl.txt
@@ -588,6 +588,37 @@ ip_local_port_range - 2 INTEGERS
(i.e. by default) range 1024-4999 is enough to issue up to
2000 connections per second to systems supporting timestamps.
+ip_local_reserved_ports - list of comma separated ranges
+ Specify the ports which are reserved for known third-party
+ applications. These ports will not be used by automatic port
+ assignments (e.g. when calling connect() or bind() with port
+ number 0). Explicit port allocation behavior is unchanged.
+
+ The format used for both input and output is a comma separated
+ list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
+ 10). Writing to the file will clear all previously reserved
+ ports and update the current list with the one given in the
+ input.
+
+ Note that ip_local_port_range and ip_local_reserved_ports
+ settings are independent and both are considered by the kernel
+ when determining which ports are available for automatic port
+ assignments.
+
+ You can reserve ports which are not in the current
+ ip_local_port_range, e.g.:
+
+ $ cat /proc/sys/net/ipv4/ip_local_port_range
+ 32000 61000
+ $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+ 8080,9148
+
+ although this is redundant. However such a setting is useful
+ if later the port range is changed to a value that will
+ include the reserved ports.
+
+ Default: Empty
+
ip_nonlocal_bind - BOOLEAN
If set, allows processes to bind() to non-local IP addresses,
which can be quite useful - but may break some applications.
Index: linux-2.6/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/core/cma.c
+++ linux-2.6/drivers/infiniband/core/cma.c
@@ -1980,6 +1980,8 @@ retry:
/* FIXME: add proper port randomization per like inet_csk_get_port */
do {
ret = idr_get_new_above(ps, bind_list, next_port, &port);
+ if (inet_is_reserved_local_port(port))
+ ret = -EAGAIN;
} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
if (ret)
@@ -2996,10 +2998,13 @@ static int __init cma_init(void)
{
int ret, low, high, remaining;
- get_random_bytes(&next_port, sizeof next_port);
inet_get_local_port_range(&low, &high);
+again:
+ get_random_bytes(&next_port, sizeof next_port);
remaining = (high - low) + 1;
next_port = ((unsigned int) next_port % remaining) + low;
+ if (inet_is_reserved_local_port(next_port))
+ goto again;
cma_wq = create_singlethread_workqueue("rdma_cm");
if (!cma_wq)
Index: linux-2.6/include/net/ip.h
===================================================================
--- linux-2.6.orig/include/net/ip.h
+++ linux-2.6/include/net/ip.h
@@ -184,6 +184,12 @@ extern struct local_ports {
} sysctl_local_ports;
extern void inet_get_local_port_range(int *low, int *high);
+extern unsigned long *sysctl_local_reserved_ports;
+static inline int inet_is_reserved_local_port(int port)
+{
+ return test_bit(port, sysctl_local_reserved_ports);
+}
+
extern int sysctl_ip_default_ttl;
extern int sysctl_ip_nonlocal_bind;
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -1552,9 +1552,13 @@ static int __init inet_init(void)
BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
+ sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
+ if (!sysctl_local_reserved_ports)
+ goto out;
+
rc = proto_register(&tcp_prot, 1);
if (rc)
- goto out;
+ goto out_free_reserved_ports;
rc = proto_register(&udp_prot, 1);
if (rc)
@@ -1653,6 +1657,8 @@ out_unregister_udp_proto:
proto_unregister(&udp_prot);
out_unregister_tcp_proto:
proto_unregister(&tcp_prot);
+out_free_reserved_ports:
+ kfree(sysctl_local_reserved_ports);
goto out;
}
Index: linux-2.6/net/ipv4/inet_connection_sock.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_connection_sock.c
+++ linux-2.6/net/ipv4/inet_connection_sock.c
@@ -37,6 +37,9 @@ struct local_ports sysctl_local_ports __
.range = { 32768, 61000 },
};
+unsigned long *sysctl_local_reserved_ports;
+EXPORT_SYMBOL(sysctl_local_reserved_ports);
+
void inet_get_local_port_range(int *low, int *high)
{
unsigned seq;
@@ -108,6 +111,8 @@ again:
smallest_size = -1;
do {
+ if (inet_is_reserved_local_port(rover))
+ goto next_nolock;
head = &hashinfo->bhash[inet_bhashfn(net, rover,
hashinfo->bhash_size)];
spin_lock(&head->lock);
@@ -130,6 +135,7 @@ again:
break;
next:
spin_unlock(&head->lock);
+ next_nolock:
if (++rover > high)
rover = low;
} while (--remaining > 0);
Index: linux-2.6/net/ipv4/inet_hashtables.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_hashtables.c
+++ linux-2.6/net/ipv4/inet_hashtables.c
@@ -456,6 +456,8 @@ int __inet_hash_connect(struct inet_time
local_bh_disable();
for (i = 1; i <= remaining; i++) {
port = low + (i + offset) % remaining;
+ if (inet_is_reserved_local_port(port))
+ continue;
head = &hinfo->bhash[inet_bhashfn(net, port,
hinfo->bhash_size)];
spin_lock(&head->lock);
Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -299,6 +299,13 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = ipv4_local_port_range,
},
+ {
+ .procname = "ip_local_reserved_ports",
+ .data = NULL, /* initialized in sysctl_ipv4_init */
+ .maxlen = 65536,
+ .mode = 0644,
+ .proc_handler = proc_do_large_bitmap,
+ },
#ifdef CONFIG_IP_MULTICAST
{
.procname = "igmp_max_memberships",
@@ -736,6 +743,16 @@ static __net_initdata struct pernet_oper
static __init int sysctl_ipv4_init(void)
{
struct ctl_table_header *hdr;
+ struct ctl_table *i;
+
+ for (i = ipv4_table; i->procname; i++) {
+ if (strcmp(i->procname, "ip_local_reserved_ports") == 0) {
+ i->data = sysctl_local_reserved_ports;
+ break;
+ }
+ }
+ if (!i->procname)
+ return -EINVAL;
hdr = register_sysctl_paths(net_ipv4_ctl_path, ipv4_table);
if (hdr == NULL)
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -233,7 +233,8 @@ int udp_lib_get_port(struct sock *sk, un
*/
do {
if (low <= snum && snum <= high &&
- !test_bit(snum >> udptable->log, bitmap))
+ !test_bit(snum >> udptable->log, bitmap) &&
+ !inet_is_reserved_local_port(snum))
goto found;
snum += rand;
} while (snum != first);
Index: linux-2.6/net/sctp/socket.c
===================================================================
--- linux-2.6.orig/net/sctp/socket.c
+++ linux-2.6/net/sctp/socket.c
@@ -5436,6 +5436,8 @@ static long sctp_get_port_local(struct s
rover++;
if ((rover < low) || (rover > high))
rover = low;
+ if (inet_is_reserved_local_port(rover))
+ continue;
index = sctp_phashfn(rover);
head = &sctp_port_hashtable[index];
sctp_spin_lock(&head->lock);
^ permalink raw reply
* [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Amerigo Wang @ 2010-04-09 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: Octavian Purdila, ebiederm, Eric Dumazet, netdev, Neil Horman,
Amerigo Wang, David Miller
In-Reply-To: <20100409101442.5051.99812.sendpatchset@localhost.localdomain>
From: Octavian Purdila <opurdila@ixiacom.com>
The new function can be used to read/write large bitmaps via /proc. A
comma separated range format is used for compact output and input
(e.g. 1,3-4,10-10).
Writing into the file will first reset the bitmap then update it
based on the given input.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
Index: linux-2.6/include/linux/sysctl.h
===================================================================
--- linux-2.6.orig/include/linux/sysctl.h
+++ linux-2.6/include/linux/sysctl.h
@@ -980,6 +980,8 @@ extern int proc_doulongvec_minmax(struct
void __user *, size_t *, loff_t *);
extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
void __user *, size_t *, loff_t *);
+extern int proc_do_large_bitmap(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
/*
* Register a set of sysctl names by calling register_sysctl_table
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2072,6 +2072,23 @@ static bool isanyof(char c, const char *
return true;
}
+static int proc_skip_anyof(char __user **buf, size_t *size,
+ const char *v, unsigned len)
+{
+ char c;
+
+ while (*size) {
+ if (get_user(c, *buf))
+ return -EFAULT;
+ if (!isanyof(c, v, len))
+ break;
+ (*size)--;
+ (*buf)++;
+ }
+
+ return 0;
+}
+
#define TMPBUFLEN 22
/**
* proc_get_ulong - reads an ASCII formated integer from a user buffer
@@ -2663,6 +2680,135 @@ static int proc_do_cad_pid(struct ctl_ta
return 0;
}
+/**
+ * proc_do_large_bitmap - read/write from/to a large bitmap
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * The bitmap is stored at table->data and the bitmap length (in bits)
+ * in table->maxlen.
+ *
+ * We use a range comma separated format (e.g. 1,3-4,10-10) so that
+ * large bitmaps may be represented in a compact manner. Writing into
+ * the file will clear the bitmap then update it with the given input.
+ *
+ * Returns 0 on success.
+ */
+int proc_do_large_bitmap(struct ctl_table *table, int write,
+ void __user *_buffer, size_t *lenp, loff_t *ppos)
+{
+ int err = 0;
+ bool first = 1;
+ size_t left = *lenp;
+ unsigned long bitmap_len = table->maxlen;
+ char __user *buffer = (char __user *) _buffer;
+ unsigned long *bitmap = (unsigned long *) table->data;
+ unsigned long *tmp_bitmap = NULL;
+ char tr_a[] = { '-', ',', '\n', 0 }, tr_b[] = { ',', '\n', 0 }, c;
+ char tr_end[] = { '\n', 0 };
+
+
+ if (!bitmap_len || !left || (*ppos && !write)) {
+ *lenp = 0;
+ return 0;
+ }
+
+ if (write) {
+ tmp_bitmap = kzalloc(BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long),
+ GFP_KERNEL);
+ if (!tmp_bitmap)
+ return -ENOMEM;
+ err = proc_skip_anyof(&buffer, &left, tr_end, sizeof(tr_end));
+ while (!err && left) {
+ unsigned long val_a, val_b;
+ bool neg;
+
+ err = proc_get_ulong(&buffer, &left, &val_a, &neg, tr_a,
+ sizeof(tr_a), &c);
+ if (err)
+ break;
+ if (val_a >= bitmap_len || neg) {
+ err = -EINVAL;
+ break;
+ }
+
+ val_b = val_a;
+ if (left) {
+ buffer++;
+ left--;
+ }
+
+ if (c == '-') {
+ err = proc_get_ulong(&buffer, &left, &val_b,
+ &neg, tr_b, sizeof(tr_b),
+ &c);
+ if (err)
+ break;
+ if (val_b >= bitmap_len || neg ||
+ val_a > val_b) {
+ err = -EINVAL;
+ break;
+ }
+ if (left) {
+ buffer++;
+ left--;
+ }
+ }
+
+ while (val_a <= val_b)
+ set_bit(val_a++, tmp_bitmap);
+
+ first = 0;
+ err = proc_skip_anyof(&buffer, &left, tr_end,
+ sizeof(tr_end));
+ }
+ } else {
+ unsigned long bit_a, bit_b = 0;
+
+ while (left) {
+ bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
+ if (bit_a >= bitmap_len)
+ break;
+ bit_b = find_next_zero_bit(bitmap, bitmap_len,
+ bit_a + 1) - 1;
+
+ err = proc_put_ulong(&buffer, &left, bit_a, 0, first,
+ ',');
+ if (err)
+ break;
+ if (bit_a != bit_b) {
+ err = proc_put_char(&buffer, &left, '-');
+ if (err)
+ break;
+ err = proc_put_ulong(&buffer, &left, bit_b, 0,
+ 1, 0);
+ if (err)
+ break;
+ }
+
+ first = 0; bit_b++;
+ }
+ if (!err)
+ err = proc_put_char(&buffer, &left, '\n');
+ }
+
+ if (!err) {
+ if (write)
+ memcpy(bitmap, tmp_bitmap,
+ BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long));
+ kfree(tmp_bitmap);
+ *lenp -= left;
+ *ppos += *lenp;
+ return 0;
+ } else {
+ kfree(tmp_bitmap);
+ return err;
+ }
+}
+
#else /* CONFIG_PROC_FS */
int proc_dostring(struct ctl_table *table, int write,
^ permalink raw reply
* [Patch 1/3] sysctl: refactor integer handling proc code
From: Amerigo Wang @ 2010-04-09 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: Octavian Purdila, Eric Dumazet, netdev, Neil Horman, Amerigo Wang,
David Miller, ebiederm
In-Reply-To: <20100409101442.5051.99812.sendpatchset@localhost.localdomain>
From: Octavian Purdila <opurdila@ixiacom.com>
As we are about to add another integer handling proc function a little
bit of cleanup is in order: add a few helper functions to improve code
readability and decrease code duplication.
In the process a bug is also fixed: if the user specifies a number
with more then 20 digits it will be interpreted as two integers
(e.g. 10000...13 will be interpreted as 100.... and 13).
Behavior for EFAULT handling was changed as well. Previous to this
patch, when an EFAULT error occurred in the middle of a write
operation, although some of the elements were set, that was not
acknowledged to the user (by shorting the write and returning the
number of bytes accepted). EFAULT is now treated just like any other
errors by acknowledging the amount of bytes accepted.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2040,8 +2040,148 @@ int proc_dostring(struct ctl_table *tabl
buffer, lenp, ppos);
}
+static int proc_skip_wspace(char __user **buf, size_t *size)
+{
+ char c;
+
+ while (*size) {
+ if (get_user(c, *buf))
+ return -EFAULT;
+ if (!isspace(c))
+ break;
+ (*size)--;
+ (*buf)++;
+ }
+
+ return 0;
+}
+
+static bool isanyof(char c, const char *v, unsigned len)
+{
+ int i;
+
+ if (!len)
+ return false;
+
+ for (i = 0; i < len; i++)
+ if (c == v[i])
+ break;
+ if (i == len)
+ return false;
+
+ return true;
+}
+
+#define TMPBUFLEN 22
+/**
+ * proc_get_ulong - reads an ASCII formated integer from a user buffer
+ *
+ * @buf - user buffer
+ * @size - size of the user buffer
+ * @val - this is where the number will be stored
+ * @neg - set to %TRUE if number is negative
+ * @perm_tr - a vector which contains the allowed trailers
+ * @perm_tr_len - size of the perm_tr vector
+ * @tr - pointer to store the trailer character
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read. If tr is non NULL and a trailing
+ * character exist (size is non zero after returning from this
+ * function) tr is updated with the trailing character.
+ */
+static int proc_get_ulong(char __user **buf, size_t *size,
+ unsigned long *val, bool *neg,
+ const char *perm_tr, unsigned perm_tr_len, char *tr)
+{
+ int len;
+ char *p, tmp[TMPBUFLEN];
+
+ if (!*size)
+ return -EINVAL;
+
+ len = *size;
+ if (len > TMPBUFLEN-1)
+ len = TMPBUFLEN-1;
+
+ if (copy_from_user(tmp, *buf, len))
+ return -EFAULT;
+
+ tmp[len] = 0;
+ p = tmp;
+ if (*p == '-' && *size > 1) {
+ *neg = 1;
+ p++;
+ } else
+ *neg = 0;
+ if (!isdigit(*p))
+ return -EINVAL;
+
+ *val = simple_strtoul(p, &p, 0);
+
+ len = p - tmp;
+
+ /* We don't know if the next char is whitespace thus we may accept
+ * invalid integers (e.g. 1234...a) or two integers instead of one
+ * (e.g. 123...1). So lets not allow such large numbers. */
+ if (len == TMPBUFLEN - 1)
+ return -EINVAL;
+
+ if (len < *size && perm_tr_len && !isanyof(*p, perm_tr, perm_tr_len))
+ return -EINVAL;
+
+ if (tr && (len < *size))
+ *tr = *p;
+
+ *buf += len;
+ *size -= len;
+
+ return 0;
+}
+
+/**
+ * proc_put_ulong - coverts an integer to a decimal ASCII formated string
+ *
+ * @buf - the user buffer
+ * @size - the size of the user buffer
+ * @val - the integer to be converted
+ * @neg - sign of the number, %TRUE for negative
+ * @first - if %FALSE will insert a separator character before the number
+ * @separator - the separator character
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read.
+ */
+static int proc_put_ulong(char __user **buf, size_t *size, unsigned long val,
+ bool neg, bool first, char separator)
+{
+ int len;
+ char tmp[TMPBUFLEN], *p = tmp;
+
+ if (!first)
+ *p++ = separator;
+ sprintf(p, "%s%lu", neg ? "-" : "", val);
+ len = strlen(tmp);
+ if (len > *size)
+ len = *size;
+ if (copy_to_user(*buf, tmp, len))
+ return -EFAULT;
+ *size -= len;
+ *buf += len;
+ return 0;
+}
+#undef TMPBUFLEN
+
+static int proc_put_char(char __user **buf, size_t *size, char c)
+{
+ if (*size) {
+ if (put_user(c, *buf))
+ return -EFAULT;
+ (*size)--, (*buf)++;
+ }
+ return 0;
+}
-static int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
int *valp,
int write, void *data)
{
@@ -2050,7 +2190,7 @@ static int do_proc_dointvec_conv(int *ne
} else {
int val = *valp;
if (val < 0) {
- *negp = -1;
+ *negp = 1;
*lvalp = (unsigned long)-val;
} else {
*negp = 0;
@@ -2060,20 +2200,18 @@ static int do_proc_dointvec_conv(int *ne
return 0;
}
+static const char proc_wspace_sep[] = { ' ', '\t', '\n', 0 };
+
static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
- int write, void __user *buffer,
+ int write, void __user *_buffer,
size_t *lenp, loff_t *ppos,
- int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+ int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
int write, void *data),
void *data)
{
-#define TMPBUFLEN 21
- int *i, vleft, first = 1, neg;
- unsigned long lval;
- size_t left, len;
-
- char buf[TMPBUFLEN], *p;
- char __user *s = buffer;
+ int *i, vleft, first = 1, err = 0;
+ size_t left;
+ char __user *buffer = (char __user *) _buffer;
if (!tbl_data || !table->maxlen || !*lenp ||
(*ppos && !write)) {
@@ -2089,88 +2227,48 @@ static int __do_proc_dointvec(void *tbl_
conv = do_proc_dointvec_conv;
for (; left && vleft--; i++, first=0) {
- if (write) {
- while (left) {
- char c;
- if (get_user(c, s))
- return -EFAULT;
- if (!isspace(c))
- break;
- left--;
- s++;
- }
- if (!left)
- break;
- neg = 0;
- len = left;
- if (len > sizeof(buf) - 1)
- len = sizeof(buf) - 1;
- if (copy_from_user(buf, s, len))
- return -EFAULT;
- buf[len] = 0;
- p = buf;
- if (*p == '-' && left > 1) {
- neg = 1;
- p++;
- }
- if (*p < '0' || *p > '9')
- break;
-
- lval = simple_strtoul(p, &p, 0);
+ unsigned long lval;
+ bool neg;
- len = p-buf;
- if ((len < left) && *p && !isspace(*p))
+ if (write) {
+ err = proc_skip_wspace(&buffer, &left);
+ if (err)
+ return err;
+ err = proc_get_ulong(&buffer, &left, &lval, &neg,
+ proc_wspace_sep,
+ sizeof(proc_wspace_sep), NULL);
+ if (err)
break;
- s += len;
- left -= len;
-
- if (conv(&neg, &lval, i, 1, data))
+ if (conv(&neg, &lval, i, 1, data)) {
+ err = -EINVAL;
break;
+ }
} else {
- p = buf;
- if (!first)
- *p++ = '\t';
-
- if (conv(&neg, &lval, i, 0, data))
+ if (conv(&neg, &lval, i, 0, data)) {
+ err = -EINVAL;
break;
-
- sprintf(p, "%s%lu", neg ? "-" : "", lval);
- len = strlen(buf);
- if (len > left)
- len = left;
- if(copy_to_user(s, buf, len))
- return -EFAULT;
- left -= len;
- s += len;
- }
- }
-
- if (!write && !first && left) {
- if(put_user('\n', s))
- return -EFAULT;
- left--, s++;
- }
- if (write) {
- while (left) {
- char c;
- if (get_user(c, s++))
- return -EFAULT;
- if (!isspace(c))
+ }
+ err = proc_put_ulong(&buffer, &left, lval, neg, first,
+ '\t');
+ if (err)
break;
- left--;
}
}
+
+ if (!write && !first && left && !err)
+ err = proc_put_char(&buffer, &left, '\n');
+ if (write && !err)
+ err = proc_skip_wspace(&buffer, &left);
if (write && first)
- return -EINVAL;
+ return err ? : -EINVAL;
*lenp -= left;
*ppos += *lenp;
return 0;
-#undef TMPBUFLEN
}
static int do_proc_dointvec(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos,
- int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+ int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
int write, void *data),
void *data)
{
@@ -2238,8 +2336,8 @@ struct do_proc_dointvec_minmax_conv_para
int *max;
};
-static int do_proc_dointvec_minmax_conv(int *negp, unsigned long *lvalp,
- int *valp,
+static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
+ int *valp,
int write, void *data)
{
struct do_proc_dointvec_minmax_conv_param *param = data;
@@ -2252,7 +2350,7 @@ static int do_proc_dointvec_minmax_conv(
} else {
int val = *valp;
if (val < 0) {
- *negp = -1;
+ *negp = 1;
*lvalp = (unsigned long)-val;
} else {
*negp = 0;
@@ -2290,17 +2388,15 @@ int proc_dointvec_minmax(struct ctl_tabl
}
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int write,
- void __user *buffer,
+ void __user *_buffer,
size_t *lenp, loff_t *ppos,
unsigned long convmul,
unsigned long convdiv)
{
-#define TMPBUFLEN 21
- unsigned long *i, *min, *max, val;
- int vleft, first=1, neg;
- size_t len, left;
- char buf[TMPBUFLEN], *p;
- char __user *s = buffer;
+ unsigned long *i, *min, *max;
+ int vleft, first = 1, err = 0;
+ size_t left;
+ char __user *buffer = (char __user *) _buffer;
if (!data || !table->maxlen || !*lenp ||
(*ppos && !write)) {
@@ -2315,82 +2411,42 @@ static int __do_proc_doulongvec_minmax(v
left = *lenp;
for (; left && vleft--; i++, min++, max++, first=0) {
+ unsigned long val;
+
if (write) {
- while (left) {
- char c;
- if (get_user(c, s))
- return -EFAULT;
- if (!isspace(c))
- break;
- left--;
- s++;
- }
- if (!left)
- break;
- neg = 0;
- len = left;
- if (len > TMPBUFLEN-1)
- len = TMPBUFLEN-1;
- if (copy_from_user(buf, s, len))
- return -EFAULT;
- buf[len] = 0;
- p = buf;
- if (*p == '-' && left > 1) {
- neg = 1;
- p++;
- }
- if (*p < '0' || *p > '9')
- break;
- val = simple_strtoul(p, &p, 0) * convmul / convdiv ;
- len = p-buf;
- if ((len < left) && *p && !isspace(*p))
+ bool neg;
+
+ err = proc_skip_wspace(&buffer, &left);
+ if (err)
+ return err;
+ err = proc_get_ulong(&buffer, &left, &val, &neg,
+ proc_wspace_sep,
+ sizeof(proc_wspace_sep), NULL);
+ if (err)
break;
if (neg)
- val = -val;
- s += len;
- left -= len;
-
- if(neg)
continue;
if ((min && val < *min) || (max && val > *max))
continue;
*i = val;
} else {
- p = buf;
- if (!first)
- *p++ = '\t';
- sprintf(p, "%lu", convdiv * (*i) / convmul);
- len = strlen(buf);
- if (len > left)
- len = left;
- if(copy_to_user(s, buf, len))
- return -EFAULT;
- left -= len;
- s += len;
- }
- }
-
- if (!write && !first && left) {
- if(put_user('\n', s))
- return -EFAULT;
- left--, s++;
- }
- if (write) {
- while (left) {
- char c;
- if (get_user(c, s++))
- return -EFAULT;
- if (!isspace(c))
+ val = convdiv * (*i) / convmul;
+ err = proc_put_ulong(&buffer, &left, val, 0, first,
+ '\t');
+ if (err)
break;
- left--;
}
}
+
+ if (!write && !first && left && !err)
+ err = proc_put_char(&buffer, &left, '\n');
+ if (write && !err)
+ err = proc_skip_wspace(&buffer, &left);
if (write && first)
- return -EINVAL;
+ return err ? : -EINVAL;
*lenp -= left;
*ppos += *lenp;
return 0;
-#undef TMPBUFLEN
}
static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
@@ -2451,7 +2507,7 @@ int proc_doulongvec_ms_jiffies_minmax(st
}
-static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
int *valp,
int write, void *data)
{
@@ -2463,7 +2519,7 @@ static int do_proc_dointvec_jiffies_conv
int val = *valp;
unsigned long lval;
if (val < 0) {
- *negp = -1;
+ *negp = 1;
lval = (unsigned long)-val;
} else {
*negp = 0;
@@ -2474,7 +2530,7 @@ static int do_proc_dointvec_jiffies_conv
return 0;
}
-static int do_proc_dointvec_userhz_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
int *valp,
int write, void *data)
{
@@ -2486,7 +2542,7 @@ static int do_proc_dointvec_userhz_jiffi
int val = *valp;
unsigned long lval;
if (val < 0) {
- *negp = -1;
+ *negp = 1;
lval = (unsigned long)-val;
} else {
*negp = 0;
@@ -2497,7 +2553,7 @@ static int do_proc_dointvec_userhz_jiffi
return 0;
}
-static int do_proc_dointvec_ms_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
int *valp,
int write, void *data)
{
@@ -2507,7 +2563,7 @@ static int do_proc_dointvec_ms_jiffies_c
int val = *valp;
unsigned long lval;
if (val < 0) {
- *negp = -1;
+ *negp = 1;
lval = (unsigned long)-val;
} else {
*negp = 0;
^ permalink raw reply
* [Patch v7 0/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-04-09 10:10 UTC (permalink / raw)
To: linux-kernel
Cc: Octavian Purdila, ebiederm, Eric Dumazet, netdev, Neil Horman,
Amerigo Wang, David Miller
Changes from the previous version:
- Update to latest Linus' tree;
- Address Eric B.'s concern, copy the whole bitmap instead of set one by one.
------------->
This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.
The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.
There are still some miss behaviors with regard to proc parsing in odd
invalid cases (for "40000\0-40001" all is acknowledged but only 40000
is accepted) but they are not easy to fix without changing the current
"acknowledge how much we accepted" behavior.
Because of that and because the same issues are present in the
existing proc_dointvec code as well I don't think its worth holding
the actual feature (port reservation) after such petty error recovery
issues.
^ permalink raw reply
* Strange packet drops with heavy firewalling
From: Benny Amorsen @ 2010-04-09 9:56 UTC (permalink / raw)
To: netdev
I have a netfilter-box which is dropping packets. ethtool -S counts
10-20 rx_discards per second on the interface.
The switch does not have flow control enabled; with flow control enabled
the rx_discards turn into tx_on_sent which ultimately cause the same
problem (the load is pretty constant so the switch has to drop the
packets instead).
perf top shows something like:
5201.00 - 6.7% : _spin_unlock_irqrestore
4232.00 - 5.5% : finish_task_switch
3597.00 - 4.6% : tg3_poll [tg3]
3257.00 - 4.2% : handle_IRQ_event
2515.00 - 3.2% : tick_nohz_restart_sched_tick
1947.00 - 2.5% : nf_ct_tuple_equal
1927.00 - 2.5% : tg3_start_xmit [tg3]
1879.00 - 2.4% : kmem_cache_alloc_node
1625.00 - 2.1% : tick_nohz_stop_sched_tick
1619.00 - 2.1% : ipt_do_table
1595.00 - 2.1% : ip_route_input
1547.00 - 2.0% : kmem_cache_free
1474.00 - 1.9% : __alloc_skb
1424.00 - 1.8% : fget_light
1391.00 - 1.8% : nf_iterate
The rule set is quite large (more than 4000 rules), but organized so
that each packet only has to traverse a few rules before getting
accepted or rejected.
When the problem started we were using a different server, an old
two-socket 32-bit Xeon with hyperthreading. CPU usage often hit 100% on
one CPU with that server. After replacing the server with a ProLiant
DL160 G5 with a quad-core Xeon (without hyperthreading) the CPU usage
rarely exceeds 10% on any CPU, but the packet loss persists.
We're using the built-in dual Broadcom Corporation NetXtreme BCM5722 Gigabit
Ethernet PCI Express nics, and the kernel is
kernel-2.6.32.9-70.fc12.x86_64 from Fedora. Next step is probably
installing a better ethernet card, perhaps an Intel 82576-based one, so
that we can get multiqueue support.
The traffic is about 300Mbps (twice that if you count both in and out,
like Cisco).
/Benny
^ permalink raw reply
* [RFC][PATCH v3 3/3] Let host NIC driver to DMA to guest user space.
From: xiaohui.xin @ 2010-04-09 9:37 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1270805865-16901-3-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
The patch let host NIC driver to receive user space skb,
then the driver has chance to directly DMA to guest user
space buffers thru single ethX interface.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
alloc_skb() is cleanup by Jeff Dike <jdike@linux.intel.com>
include/linux/netdevice.h | 69 ++++++++++++++++++++++++++++++++++++++++-
include/linux/skbuff.h | 30 ++++++++++++++++--
net/core/dev.c | 63 ++++++++++++++++++++++++++++++++++++++
net/core/skbuff.c | 74 ++++++++++++++++++++++++++++++++++++++++----
4 files changed, 224 insertions(+), 12 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..ba48eb0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -485,6 +485,17 @@ struct netdev_queue {
unsigned long tx_dropped;
} ____cacheline_aligned_in_smp;
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct mpassthru_port {
+ int hdr_len;
+ int data_len;
+ int npages;
+ unsigned flags;
+ struct socket *sock;
+ struct skb_user_page *(*ctor)(struct mpassthru_port *,
+ struct sk_buff *, int);
+};
+#endif
/*
* This structure defines the management hooks for network devices.
@@ -636,6 +647,10 @@ struct net_device_ops {
int (*ndo_fcoe_ddp_done)(struct net_device *dev,
u16 xid);
#endif
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ int (*ndo_mp_port_prep)(struct net_device *dev,
+ struct mpassthru_port *port);
+#endif
};
/*
@@ -891,7 +906,8 @@ struct net_device
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port *garp_port;
-
+ /* mpassthru */
+ struct mpassthru_port *mp_port;
/* class/net/name entry */
struct device dev;
/* space for optional statistics and wireless sysfs groups */
@@ -2013,6 +2029,55 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev)
return 0;
return dev->ethtool_ops->get_flags(dev);
}
-#endif /* __KERNEL__ */
+/* To support zero-copy between user space application and NIC driver,
+ * we'd better ask NIC driver for the capability it can provide, especially
+ * for packet split mode, now we only ask for the header size, and the
+ * payload once a descriptor may carry.
+ */
+
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+static inline int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ int rc;
+ int npages, data_len;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ /* needed by packet split */
+ if (ops->ndo_mp_port_prep) {
+ rc = ops->ndo_mp_port_prep(dev, port);
+ if (rc)
+ return rc;
+ } else {
+ /* If the NIC driver did not report this,
+ * then we try to use it as igb driver.
+ */
+ port->hdr_len = 128;
+ port->data_len = 2048;
+ port->npages = 1;
+ }
+
+ if (port->hdr_len <= 0)
+ goto err;
+
+ npages = port->npages;
+ data_len = port->data_len;
+ if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+ (data_len < PAGE_SIZE * (npages - 1) ||
+ data_len > PAGE_SIZE * npages))
+ goto err;
+
+ return 0;
+err:
+ dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+ return -EINVAL;
+}
+
+extern int netdev_mp_port_attach(struct net_device *dev,
+ struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
#endif /* _LINUX_NETDEVICE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..e59fa57 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -209,6 +209,13 @@ struct skb_shared_info {
void * destructor_arg;
};
+struct skb_user_page {
+ u8 *start;
+ int size;
+ struct skb_frag_struct *frags;
+ struct skb_shared_info *ushinfo;
+ void (*dtor)(struct skb_user_page *);
+};
/* We divide dataref into two halves. The higher 16 bits hold references
* to the payload part of skb->data. The lower 16 bits hold references to
* the entire skb->data. A clone of a headerless skb holds the length of
@@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int fclone,
+ int node, struct net_device *dev);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 0, -1);
+ return __alloc_skb(size, priority, 0, -1, NULL);
}
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb(size, priority, 1, -1, NULL);
}
extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1509,6 +1517,22 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
__free_page(page);
}
+extern struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
+ struct sk_buff *skb, int npages);
+
+static inline struct skb_user_page *netdev_alloc_user_page(
+ struct net_device *dev,
+ struct sk_buff *skb, unsigned int size)
+{
+ struct skb_user_page *user;
+ int npages = (size < PAGE_SIZE) ? 1 : (size / PAGE_SIZE);
+
+ user = netdev_alloc_user_pages(dev, skb, npages);
+ if (likely(user))
+ return user;
+ return NULL;
+}
+
/**
* skb_clone_writable - is the header of a clone writable
* @skb: buffer to check
diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..b50bdcb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2265,6 +2265,61 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
}
+/* Add a hook to intercept zero-copy packets, and insert it
+ * to the socket queue specially.
+ */
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+int netdev_mp_port_attach(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ /* locked by mp_mutex */
+ if (rcu_dereference(dev->mp_port))
+ return -EBUSY;
+
+ rcu_assign_pointer(dev->mp_port, port);
+
+ return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+ /* locked by mp_mutex */
+ if (!rcu_dereference(dev->mp_port))
+ return;
+
+ rcu_assign_pointer(dev->mp_port, NULL);
+ synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+ struct packet_type **pt_prev,
+ int *ret, struct net_device *orig_dev)
+{
+ struct mpassthru_port *mp_port = NULL;
+ struct sock *sk = NULL;
+
+ if (skb->dev)
+ mp_port = skb->dev->mp_port;
+ if (!mp_port)
+ return skb;
+
+ if (*pt_prev) {
+ *ret = deliver_skb(skb, *pt_prev, orig_dev);
+ *pt_prev = NULL;
+ }
+
+ sk = mp_port->sock->sk;
+ skb_queue_tail(&sk->sk_receive_queue, skb);
+ sk->sk_data_ready(sk, skb->len);
+
+ return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
@@ -2342,6 +2397,9 @@ int netif_receive_skb(struct sk_buff *skb)
goto out;
ncls:
#endif
+ skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+ if (!skb)
+ goto out;
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
@@ -2455,6 +2513,11 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
if (skb_is_gso(skb) || skb_has_frags(skb))
goto normal;
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ if (skb->dev && skb->dev->mp_port)
+ goto normal;
+#endif
+
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || ptype->dev || !ptype->gro_receive)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 80a9616..e684898 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -170,13 +170,15 @@ EXPORT_SYMBOL(skb_under_panic);
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int fclone, int node, struct net_device *dev)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
-
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ struct skb_user_page *user = NULL;
+#endif
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
/* Get the HEAD */
@@ -185,8 +187,26 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
goto out;
size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
- gfp_mask, node);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ if (!dev || !dev->mp_port) { /* Legacy alloc func */
+#endif
+ data = kmalloc_node_track_caller(
+ size + sizeof(struct skb_shared_info),
+ gfp_mask, node);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ } else { /* Allocation may from page constructor of device */
+ user = netdev_alloc_user_page(dev, skb, size);
+ if (!user) {
+ data = kmalloc_node_track_caller(
+ size + sizeof(struct skb_shared_info),
+ gfp_mask, node);
+ printk(KERN_INFO "can't alloc user buffer.\n");
+ } else {
+ data = user->start;
+ size = SKB_DATA_ALIGN(user->size);
+ }
+ }
+#endif
if (!data)
goto nodata;
@@ -208,6 +228,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
skb->mac_header = ~0U;
#endif
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ if (user)
+ memcpy(user->ushinfo, skb_shinfo(skb),
+ sizeof(struct skb_shared_info));
+#endif
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
@@ -231,6 +256,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
child->fclone = SKB_FCLONE_UNAVAILABLE;
}
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ shinfo->destructor_arg = user;
+#endif
+
out:
return skb;
nodata:
@@ -259,7 +288,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
struct sk_buff *skb;
- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node, dev);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -278,6 +307,24 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
}
EXPORT_SYMBOL(__netdev_alloc_page);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
+ struct sk_buff *skb, int npages)
+{
+ struct mpassthru_port *ctor;
+ struct skb_user_page *user = NULL;
+
+ ctor = rcu_dereference(dev->mp_port);
+ if (!ctor)
+ goto out;
+ BUG_ON(npages > ctor->npages);
+ user = ctor->ctor(ctor, skb, npages);
+out:
+ return user;
+}
+EXPORT_SYMBOL(netdev_alloc_user_pages);
+#endif
+
void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
{
@@ -338,6 +385,10 @@ static void skb_clone_fraglist(struct sk_buff *skb)
static void skb_release_data(struct sk_buff *skb)
{
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ struct skb_user_page *user = skb_shinfo(skb)->destructor_arg;
+#endif
+
if (!skb->cloned ||
!atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
&skb_shinfo(skb)->dataref)) {
@@ -349,7 +400,10 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);
-
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ if (skb->dev && skb->dev->mp_port && user && user->dtor)
+ user->dtor(user);
+#endif
kfree(skb->head);
}
}
@@ -503,8 +557,14 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;
- skb_release_head_state(skb);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+ if (skb->dev && skb->dev->mp_port)
+ return 0;
+#endif
+
shinfo = skb_shinfo(skb);
+
+ skb_release_head_state(skb);
atomic_set(&shinfo->dataref, 1);
shinfo->nr_frags = 0;
shinfo->gso_size = 0;
--
1.5.4.4
^ permalink raw reply related
* [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: xiaohui.xin @ 2010-04-09 9:37 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1270805865-16901-1-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
memory leak fixed,
kconfig made,
do_unbind() made,
mp_chr_ioctl() cleaned up and
some other cleanups made
by Jeff Dike <jdike@linux.intel.com>
drivers/vhost/Kconfig | 5 +
drivers/vhost/Makefile | 2 +
drivers/vhost/mpassthru.c | 1264 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/mpassthru.h | 29 +
4 files changed, 1300 insertions(+), 0 deletions(-)
create mode 100644 drivers/vhost/mpassthru.c
create mode 100644 include/linux/mpassthru.h
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
To compile this driver as a module, choose M here: the module will
be called vhost_net.
+config VHOST_PASSTHRU
+ tristate "Zerocopy network driver (EXPERIMENTAL)"
+ depends on VHOST_NET
+ ---help---
+ zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
obj-$(CONFIG_VHOST_NET) += vhost_net.o
vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..86d2525
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1264 @@
+/*
+ * MPASSTHRU - Mediate passthrough device.
+ * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+#include "vhost.h"
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+ u16 offset;
+ u16 size;
+};
+
+struct page_ctor {
+ struct list_head readq;
+ int w_len;
+ int r_len;
+ spinlock_t read_lock;
+ struct kmem_cache *cache;
+ /* record the locked pages */
+ int lock_pages;
+ struct rlimit o_rlim;
+ struct net_device *dev;
+ struct mpassthru_port port;
+};
+
+struct page_info {
+ void *ctrl;
+ struct list_head list;
+ int header;
+ /* indicate the actual length of bytes
+ * send/recv in the user space buffers
+ */
+ int total;
+ int offset;
+ struct page *pages[MAX_SKB_FRAGS+1];
+ struct skb_frag_struct frag[MAX_SKB_FRAGS+1];
+ struct sk_buff *skb;
+ struct page_ctor *ctor;
+
+ /* The pointer relayed to skb, to indicate
+ * it's a user space allocated skb or kernel
+ */
+ struct skb_user_page user;
+ struct skb_shared_info ushinfo;
+
+#define INFO_READ 0
+#define INFO_WRITE 1
+ unsigned flags;
+ unsigned pnum;
+
+ /* It's meaningful for receive, means
+ * the max length allowed
+ */
+ size_t len;
+
+ /* The fields after that is for backend
+ * driver, now for vhost-net.
+ */
+
+ struct kiocb *iocb;
+ unsigned int desc_pos;
+ unsigned int log;
+ struct iovec hdr[VHOST_NET_MAX_SG];
+ struct iovec iov[VHOST_NET_MAX_SG];
+};
+
+struct mp_struct {
+ struct mp_file *mfile;
+ struct net_device *dev;
+ struct page_ctor *ctor;
+ struct socket socket;
+
+#ifdef MPASSTHRU_DEBUG
+ int debug;
+#endif
+};
+
+struct mp_file {
+ atomic_t count;
+ struct mp_struct *mp;
+ struct net *net;
+};
+
+struct mp_sock {
+ struct sock sk;
+ struct mp_struct *mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+ int ret = 0;
+
+ rtnl_lock();
+ ret = dev_change_flags(dev, flags);
+ rtnl_unlock();
+
+ if (ret < 0)
+ printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+ return ret;
+}
+
+/* The main function to allocate user space buffers */
+static struct skb_user_page *page_ctor(struct mpassthru_port *port,
+ struct sk_buff *skb, int npages)
+{
+ int i;
+ unsigned long flags;
+ struct page_ctor *ctor;
+ struct page_info *info = NULL;
+
+ ctor = container_of(port, struct page_ctor, port);
+
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq, struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ if (!info)
+ return NULL;
+
+ for (i = 0; i < info->pnum; i++) {
+ get_page(info->pages[i]);
+ info->frag[i].page = info->pages[i];
+ info->frag[i].page_offset = i ? 0 : info->offset;
+ info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+ port->data_len;
+ }
+ info->skb = skb;
+ info->user.frags = info->frag;
+ info->user.ushinfo = &info->ushinfo;
+ return &info->user;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+ struct page_info *info = (struct page_info *)(iocb->private);
+ int i;
+
+ if (info->flags == INFO_READ) {
+ for (i = 0; i < info->pnum; i++) {
+ if (info->pages[i]) {
+ set_page_dirty_lock(info->pages[i]);
+ put_page(info->pages[i]);
+ }
+ }
+ skb_shinfo(info->skb)->destructor_arg = &info->user;
+ info->skb->destructor = NULL;
+ kfree_skb(info->skb);
+ }
+ /* Decrement the number of locked pages */
+ info->ctor->lock_pages -= info->pnum;
+ kmem_cache_free(info->ctor->cache, info);
+
+ return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+ struct kiocb *iocb = NULL;
+
+ iocb = info->iocb;
+ if (!iocb)
+ return iocb;
+ iocb->ki_flags = 0;
+ iocb->ki_users = 1;
+ iocb->ki_key = 0;
+ iocb->ki_ctx = NULL;
+ iocb->ki_cancel = NULL;
+ iocb->ki_retry = NULL;
+ iocb->ki_iovec = NULL;
+ iocb->ki_eventfd = NULL;
+ iocb->private = (void *)info;
+ iocb->ki_pos = info->desc_pos;
+ iocb->ki_nbytes = size;
+ iocb->ki_user_data = info->log;
+ iocb->ki_dtor = mp_ki_dtor;
+ return iocb;
+}
+
+/* A helper to clean the skb before the kfree_skb() */
+
+static void page_dtor_prepare(struct page_info *info)
+{
+ if (info->flags == INFO_READ)
+ if (info->skb)
+ info->skb->head = NULL;
+}
+
+/* The callback to destruct the user space buffers or skb */
+static void page_dtor(struct skb_user_page *user)
+{
+ struct page_info *info;
+ struct page_ctor *ctor;
+ struct sock *sk;
+ struct sk_buff *skb;
+ struct kiocb *iocb = NULL;
+ struct vhost_virtqueue *vq = NULL;
+ unsigned long flags;
+ int i;
+
+ if (!user)
+ return;
+ info = container_of(user, struct page_info, user);
+ if (!info)
+ return;
+ ctor = info->ctor;
+ skb = info->skb;
+
+ page_dtor_prepare(info);
+
+ /* If the info->total is 0, make it to be reused */
+ if (!info->total) {
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ list_add(&info->list, &ctor->readq);
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ return;
+ }
+
+ if (info->flags == INFO_READ)
+ return;
+
+ /* For transmit, we should wait for the DMA finish by hardware.
+ * Queue the notifier to wake up the backend driver
+ */
+ vq = (struct vhost_virtqueue *)info->ctrl;
+ iocb = create_iocb(info, info->total);
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_add_tail(&iocb->ki_list, &vq->notifier);
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+ sk = ctor->port.sock->sk;
+ sk->sk_write_space(sk);
+
+ return;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+ int rc;
+ struct page_ctor *ctor;
+ struct net_device *dev = mp->dev;
+
+ /* locked by mp_mutex */
+ if (rcu_dereference(mp->ctor))
+ return -EBUSY;
+
+ ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+ if (!ctor)
+ return -ENOMEM;
+ rc = netdev_mp_port_prep(dev, &ctor->port);
+ if (rc)
+ goto fail;
+
+ ctor->cache = kmem_cache_create("skb_page_info",
+ sizeof(struct page_info), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+
+ if (!ctor->cache)
+ goto cache_fail;
+
+ INIT_LIST_HEAD(&ctor->readq);
+ spin_lock_init(&ctor->read_lock);
+
+ ctor->w_len = 0;
+ ctor->r_len = 0;
+
+ dev_hold(dev);
+ ctor->dev = dev;
+ ctor->port.ctor = page_ctor;
+ ctor->port.sock = &mp->socket;
+ ctor->lock_pages = 0;
+ rc = netdev_mp_port_attach(dev, &ctor->port);
+ if (rc)
+ goto fail;
+
+ /* locked by mp_mutex */
+ rcu_assign_pointer(mp->ctor, ctor);
+
+ /* XXX:Need we do set_offload here ? */
+
+ return 0;
+
+fail:
+ kmem_cache_destroy(ctor->cache);
+cache_fail:
+ kfree(ctor);
+ dev_put(dev);
+
+ return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+ unsigned long flags;
+ struct page_info *info = NULL;
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq,
+ struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ return info;
+}
+
+static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
+ unsigned long cur, unsigned long max)
+{
+ struct rlimit new_rlim, *old_rlim;
+ int retval;
+
+ if (resource != RLIMIT_MEMLOCK)
+ return -EINVAL;
+ new_rlim.rlim_cur = cur;
+ new_rlim.rlim_max = max;
+
+ old_rlim = current->signal->rlim + resource;
+
+ /* remember the old rlimit value when backend enabled */
+ ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
+ ctor->o_rlim.rlim_max = old_rlim->rlim_max;
+
+ if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+ !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
+ retval = security_task_setrlimit(resource, &new_rlim);
+ if (retval)
+ return retval;
+
+ task_lock(current->group_leader);
+ *old_rlim = new_rlim;
+ task_unlock(current->group_leader);
+ return 0;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+ struct page_ctor *ctor;
+ struct page_info *info;
+ struct vhost_virtqueue *vq = NULL;
+ struct kiocb *iocb = NULL;
+ int i;
+ unsigned long flags;
+
+ /* locked by mp_mutex */
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -ENODEV;
+
+ while ((info = info_dequeue(ctor))) {
+ for (i = 0; i < info->pnum; i++)
+ if (info->pages[i])
+ put_page(info->pages[i]);
+ vq = (struct vhost_virtqueue *)(info->ctrl);
+ iocb = create_iocb(info, 0);
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_add_tail(&iocb->ki_list, &vq->notifier);
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+ kmem_cache_free(ctor->cache, info);
+ }
+ set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+ ctor->o_rlim.rlim_cur,
+ ctor->o_rlim.rlim_max);
+ kmem_cache_destroy(ctor->cache);
+ netdev_mp_port_detach(ctor->dev);
+ dev_put(ctor->dev);
+
+ /* locked by mp_mutex */
+ rcu_assign_pointer(mp->ctor, NULL);
+ synchronize_rcu();
+
+ kfree(ctor);
+ return 0;
+}
+
+/* For small user space buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+ struct kiocb *iocb, int total)
+{
+ struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+ if (!info)
+ return NULL;
+ info->total = total;
+ info->user.dtor = page_dtor;
+ info->ctor = ctor;
+ info->flags = INFO_WRITE;
+ info->iocb = iocb;
+ return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the user space address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+ struct kiocb *iocb, struct iovec *iov,
+ int count, struct frag *frags,
+ int npages, int total)
+{
+ int rc;
+ int i, j, n = 0;
+ int len;
+ unsigned long base, lock_limit;
+ struct page_info *info = NULL;
+
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+ lock_limit >>= PAGE_SHIFT;
+
+ if (ctor->lock_pages + count > lock_limit) {
+ printk(KERN_INFO "exceed the locked memory rlimit %d!",
+ lock_limit);
+ return NULL;
+ }
+
+ info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+ if (!info)
+ return NULL;
+
+ for (i = j = 0; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base;
+ len = iov[i].iov_len;
+
+ if (!len)
+ continue;
+ n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+ rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+ &info->pages[j]);
+ if (rc != n)
+ goto failed;
+
+ while (n--) {
+ frags[j].offset = base & ~PAGE_MASK;
+ frags[j].size = min_t(int, len,
+ PAGE_SIZE - frags[j].offset);
+ len -= frags[j].size;
+ base += frags[j].size;
+ j++;
+ }
+ }
+
+#ifdef CONFIG_HIGHMEM
+ if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+ for (i = 0; i < j; i++) {
+ if (PageHighMem(info->pages[i]))
+ goto failed;
+ }
+ }
+#endif
+
+ info->total = total;
+ info->user.dtor = page_dtor;
+ info->ctor = ctor;
+ info->pnum = j;
+ info->iocb = iocb;
+ if (!npages)
+ info->flags = INFO_WRITE;
+ if (info->flags == INFO_READ) {
+ info->user.start = (u8 *)(((unsigned long)
+ (pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+ frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
+ info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
+ }
+ /* increment the number of locked pages */
+ ctor->lock_pages += j;
+ return info;
+
+failed:
+ for (i = 0; i < j; i++)
+ put_page(info->pages[i]);
+
+ kmem_cache_free(ctor->cache, info);
+
+ return NULL;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+ struct msghdr *m, size_t total_len)
+{
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor;
+ struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+ struct iovec *iov = m->msg_iov;
+ struct page_info *info = NULL;
+ struct frag frags[MAX_SKB_FRAGS];
+ struct sk_buff *skb;
+ int count = m->msg_iovlen;
+ int total = 0, header, n, i, len, rc;
+ unsigned long base;
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -ENODEV;
+
+ total = iov_length(iov, count);
+
+ if (total < ETH_HLEN)
+ return -EINVAL;
+
+ if (total <= COPY_THRESHOLD)
+ goto copy;
+
+ n = 0;
+ for (i = 0; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base;
+ len = iov[i].iov_len;
+ if (!len)
+ continue;
+ n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ if (n > MAX_SKB_FRAGS)
+ return -EINVAL;
+ }
+
+copy:
+ header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+ skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+ if (!skb)
+ goto drop;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+
+ skb_set_network_header(skb, ETH_HLEN);
+
+ memcpy_fromiovec(skb->data, iov, header);
+ skb_put(skb, header);
+ skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+ if (header == total) {
+ rc = total;
+ info = alloc_small_page_info(ctor, iocb, total);
+ } else {
+ info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+ if (info)
+ for (i = 0; info->pages[i]; i++) {
+ skb_add_rx_frag(skb, i, info->pages[i],
+ frags[i].offset, frags[i].size);
+ info->pages[i] = NULL;
+ }
+ }
+ if (info != NULL) {
+ info->desc_pos = iocb->ki_pos;
+ info->ctrl = vq;
+ info->total = total;
+ info->skb = skb;
+ skb_shinfo(skb)->destructor_arg = &info->user;
+ skb->dev = mp->dev;
+ dev_queue_xmit(skb);
+ return 0;
+ }
+drop:
+ kfree_skb(skb);
+ if (info) {
+ for (i = 0; info->pages[i]; i++)
+ put_page(info->pages[i]);
+ kmem_cache_free(info->ctor->cache, info);
+ }
+ mp->dev->stats.tx_dropped++;
+ return -ENOMEM;
+}
+
+
+static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
+{
+ struct socket *sock = vq->private_data;
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor = NULL;
+ struct sk_buff *skb = NULL;
+ struct page_info *info = NULL;
+ struct ethhdr *eth;
+ struct kiocb *iocb = NULL;
+ int len, i;
+ unsigned long flags;
+
+ struct virtio_net_hdr hdr = {
+ .flags = 0,
+ .gso_type = VIRTIO_NET_HDR_GSO_NONE
+ };
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return;
+
+ while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
+ if (skb_shinfo(skb)->destructor_arg) {
+ info = container_of(skb_shinfo(skb)->destructor_arg,
+ struct page_info, user);
+ info->skb = skb;
+ if (skb->len > info->len) {
+ mp->dev->stats.rx_dropped++;
+ DBG(KERN_INFO "Discarded truncated rx packet: "
+ " len %d > %zd\n", skb->len, info->len);
+ info->total = skb->len;
+ goto clean;
+ } else {
+ int i;
+ struct skb_shared_info *gshinfo =
+ (struct skb_shared_info *)(&info->ushinfo);
+ struct skb_shared_info *hshinfo =
+ skb_shinfo(skb);
+
+ if (gshinfo->nr_frags < hshinfo->nr_frags)
+ goto clean;
+ eth = eth_hdr(skb);
+ skb_push(skb, ETH_HLEN);
+
+ hdr.hdr_len = skb_headlen(skb);
+ info->total = skb->len;
+
+ for (i = 0; i < gshinfo->nr_frags; i++)
+ gshinfo->frags[i].size = 0;
+ for (i = 0; i < hshinfo->nr_frags; i++)
+ gshinfo->frags[i].size =
+ hshinfo->frags[i].size;
+ memcpy(skb_shinfo(skb), &info->ushinfo,
+ sizeof(struct skb_shared_info));
+ }
+ } else {
+ /* The skb composed with kernel buffers
+ * in case user space buffers are not sufficent.
+ * The case should be rare.
+ */
+ unsigned long flags;
+ int i;
+ struct skb_shared_info *gshinfo = NULL;
+
+ info = NULL;
+
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq,
+ struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ if (!info) {
+ DBG(KERN_INFO "No user buffer avaliable %p\n",
+ skb);
+ skb_queue_head(&sock->sk->sk_receive_queue,
+ skb);
+ break;
+ }
+ info->skb = skb;
+ /* compute the guest skb frags info */
+ gshinfo = (struct skb_shared_info *)(info->user.start +
+ SKB_DATA_ALIGN(info->user.size));
+
+ if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+ goto clean;
+
+ eth = eth_hdr(skb);
+ skb_push(skb, ETH_HLEN);
+ info->total = skb->len;
+
+ for (i = 0; i < gshinfo->nr_frags; i++)
+ gshinfo->frags[i].size = 0;
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ gshinfo->frags[i].size =
+ skb_shinfo(skb)->frags[i].size;
+ hdr.hdr_len = min_t(int, skb->len,
+ info->iov[1].iov_len);
+ skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+ }
+
+ len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+ sizeof hdr);
+ if (len) {
+ DBG(KERN_INFO
+ "Unable to write vnet_hdr at addr %p: %d\n",
+ info->hdr->iov_base, len);
+ goto clean;
+ }
+ iocb = create_iocb(info, skb->len + sizeof(hdr));
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_add_tail(&iocb->ki_list, &vq->notifier);
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+ continue;
+
+clean:
+ kfree_skb(skb);
+ for (i = 0; info->pages[i]; i++)
+ put_page(info->pages[i]);
+ kmem_cache_free(ctor->cache, info);
+ }
+ return;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+ struct msghdr *m, size_t total_len,
+ int flags)
+{
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor;
+ struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+ struct iovec *iov = m->msg_iov;
+ int count = m->msg_iovlen;
+ int npages, payload;
+ struct page_info *info;
+ struct frag frags[MAX_SKB_FRAGS];
+ unsigned long base;
+ int i, len;
+ unsigned long flag;
+
+ if (!(flags & MSG_DONTWAIT))
+ return -EINVAL;
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -EINVAL;
+
+ /* Error detections in case invalid user space buffer */
+ if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+ mp->dev->features & NETIF_F_SG) {
+ return -EINVAL;
+ }
+
+ npages = ctor->port.npages;
+ payload = ctor->port.data_len;
+
+ /* If KVM guest virtio-net FE driver use SG feature */
+ if (count > 2) {
+ for (i = 2; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+ len = iov[i].iov_len;
+ if (npages == 1)
+ len = min_t(int, len, PAGE_SIZE - base);
+ else if (base)
+ break;
+ payload -= len;
+ if (payload <= 0)
+ goto proceed;
+ if (npages == 1 || (len & ~PAGE_MASK))
+ break;
+ }
+ }
+
+ if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+ - NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+ goto proceed;
+
+ return -EINVAL;
+
+proceed:
+ /* skip the virtnet head */
+ iov++;
+ count--;
+
+ /* Translate address to kernel */
+ info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+ if (!info)
+ return -ENOMEM;
+ info->len = total_len;
+ info->hdr[0].iov_base = vq->hdr[0].iov_base;
+ info->hdr[0].iov_len = vq->hdr[0].iov_len;
+ info->offset = frags[0].offset;
+ info->desc_pos = iocb->ki_pos;
+ info->log = iocb->ki_user_data;
+ info->ctrl = vq;
+
+ iov--;
+ count++;
+
+ memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
+
+ spin_lock_irqsave(&ctor->read_lock, flag);
+ list_add_tail(&info->list, &ctor->readq);
+ spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+ if (!vq->receiver) {
+ vq->receiver = mp_recvmsg_notify;
+ set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+ vq->num * 4096,
+ vq->num * 4096);
+ }
+
+ return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+ mp->mfile = NULL;
+
+ mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+ page_ctor_detach(mp);
+ mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+ /* Drop the extra count on the net device */
+ dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+ mutex_lock(&mp_mutex);
+ __mp_detach(mp);
+ mutex_unlock(&mp_mutex);
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+ if (atomic_dec_and_test(&mfile->count))
+ mp_detach(mfile->mp);
+}
+
+static int mp_release(struct socket *sock)
+{
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct mp_file *mfile = mp->mfile;
+
+ mp_put(mfile);
+ sock_put(mp->socket.sk);
+ put_net(mfile->net);
+
+ return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+ .sendmsg = mp_sendmsg,
+ .recvmsg = mp_recvmsg,
+ .release = mp_release,
+};
+
+static struct proto mp_proto = {
+ .name = "mp",
+ .owner = THIS_MODULE,
+ .obj_size = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+ struct mp_file *mfile;
+ cycle_kernel_lock();
+ DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+ mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+ if (!mfile)
+ return -ENOMEM;
+ atomic_set(&mfile->count, 0);
+ mfile->mp = NULL;
+ mfile->net = get_net(current->nsproxy->net_ns);
+ file->private_data = mfile;
+ return 0;
+}
+
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+ struct mp_struct *mp = NULL;
+ if (atomic_inc_not_zero(&mfile->count))
+ mp = mfile->mp;
+
+ return mp;
+}
+
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+ int err;
+
+ netif_tx_lock_bh(mp->dev);
+
+ err = -EINVAL;
+
+ if (mfile->mp)
+ goto out;
+
+ err = -EBUSY;
+ if (mp->mfile)
+ goto out;
+
+ err = 0;
+ mfile->mp = mp;
+ mp->mfile = mfile;
+ mp->socket.file = file;
+ dev_hold(mp->dev);
+ sock_hold(mp->socket.sk);
+ atomic_inc(&mfile->count);
+
+out:
+ netif_tx_unlock_bh(mp->dev);
+ return err;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+ struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+ kfree(mp);
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+ struct mp_struct *mp = mp_get(mfile);
+
+ if (!mp)
+ return -EINVAL;
+
+ mp_detach(mp);
+ sock_put(mp->socket.sk);
+ mp_put(mfile);
+ return 0;
+}
+
+static void mp_sock_data_ready(struct sock *sk, int len)
+{
+ if (sk_has_sleeper(sk))
+ wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+ if (sk_has_sleeper(sk))
+ wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp;
+ struct net_device *dev;
+ void __user* argp = (void __user *)arg;
+ struct ifreq ifr;
+ struct sock *sk;
+ int ret;
+
+ ret = -EINVAL;
+
+ switch (cmd) {
+ case MPASSTHRU_BINDDEV:
+ ret = -EFAULT;
+ if (copy_from_user(&ifr, argp, sizeof ifr))
+ break;
+
+ ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+ ret = -EBUSY;
+
+ if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
+ break;
+
+ ret = -ENODEV;
+ dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+ if (!dev)
+ break;
+
+ mutex_lock(&mp_mutex);
+
+ ret = -EBUSY;
+ mp = mfile->mp;
+ if (mp)
+ goto err_dev_put;
+
+ mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+ if (!mp) {
+ ret = -ENOMEM;
+ goto err_dev_put;
+ }
+ mp->dev = dev;
+ ret = -ENOMEM;
+
+ sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+ if (!sk)
+ goto err_free_mp;
+
+ init_waitqueue_head(&mp->socket.wait);
+ mp->socket.ops = &mp_socket_ops;
+ sock_init_data(&mp->socket, sk);
+ sk->sk_sndbuf = INT_MAX;
+ container_of(sk, struct mp_sock, sk)->mp = mp;
+
+ sk->sk_destruct = mp_sock_destruct;
+ sk->sk_data_ready = mp_sock_data_ready;
+ sk->sk_write_space = mp_sock_write_space;
+
+ ret = mp_attach(mp, file);
+ if (ret < 0)
+ goto err_free_sk;
+
+ ret = page_ctor_attach(mp);
+ if (ret < 0)
+ goto err_free_sk;
+
+ ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
+ mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+out:
+ mutex_unlock(&mp_mutex);
+ break;
+err_free_sk:
+ sk_free(sk);
+err_free_mp:
+ kfree(mp);
+err_dev_put:
+ dev_put(dev);
+ goto out;
+
+ case MPASSTHRU_UNBINDDEV:
+ ret = do_unbind(mfile);
+ break;
+
+ default:
+ break;
+ }
+ return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp = mp_get(mfile);
+ struct sock *sk;
+ unsigned int mask = 0;
+
+ if (!mp)
+ return POLLERR;
+
+ sk = mp->socket.sk;
+
+ poll_wait(file, &mp->socket.wait, wait);
+
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ mask |= POLLIN | POLLRDNORM;
+
+ if (sock_writeable(sk) ||
+ (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+ sock_writeable(sk)))
+ mask |= POLLOUT | POLLWRNORM;
+
+ if (mp->dev->reg_state != NETREG_REGISTERED)
+ mask = POLLERR;
+
+ mp_put(mfile);
+ return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct mp_struct *mp = mp_get(file->private_data);
+ struct sock *sk = mp->socket.sk;
+ struct sk_buff *skb;
+ int len, err;
+ ssize_t result;
+
+ if (!mp)
+ return -EBADFD;
+
+ /* currently, async is not supported.
+ * but we may support real async aio from user application,
+ * maybe qemu virtio-net backend.
+ */
+ if (!is_sync_kiocb(iocb))
+ return -EFAULT;
+
+ len = iov_length(iov, count);
+
+ if (unlikely(len) < ETH_HLEN)
+ return -EINVAL;
+
+ skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+ file->f_flags & O_NONBLOCK, &err);
+
+ if (!skb)
+ return -EFAULT;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, len);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+ kfree_skb(skb);
+ return -EAGAIN;
+ }
+
+ skb->protocol = eth_type_trans(skb, mp->dev);
+ skb->dev = mp->dev;
+
+ dev_queue_xmit(skb);
+
+ mp_put(file->private_data);
+ return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+
+ /*
+ * Ignore return value since an error only means there was nothing to
+ * do
+ */
+ do_unbind(mfile);
+
+ put_net(mfile->net);
+ kfree(mfile);
+
+ return 0;
+}
+
+static const struct file_operations mp_fops = {
+ .owner = THIS_MODULE,
+ .llseek = no_llseek,
+ .write = do_sync_write,
+ .aio_write = mp_chr_aio_write,
+ .poll = mp_chr_poll,
+ .unlocked_ioctl = mp_chr_ioctl,
+ .open = mp_chr_open,
+ .release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mp",
+ .nodename = "net/mp",
+ .fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+ unsigned long event, void *ptr)
+{
+ struct net_device *dev = ptr;
+ struct mpassthru_port *port;
+ struct mp_struct *mp = NULL;
+ struct socket *sock = NULL;
+
+ port = dev->mp_port;
+ if (port == NULL)
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_UNREGISTER:
+ sock = dev->mp_port->sock;
+ mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ do_unbind(mp->mfile);
+ break;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+ .notifier_call = mp_device_event,
+};
+
+static int mp_init(void)
+{
+ int ret = 0;
+
+ ret = misc_register(&mp_miscdev);
+ if (ret)
+ printk(KERN_ERR "mp: Can't register misc device\n");
+ else {
+ printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+ mp_miscdev.minor);
+ register_netdevice_notifier(&mp_notifier_block);
+ }
+ return ret;
+}
+
+void mp_cleanup(void)
+{
+ unregister_netdevice_notifier(&mp_notifier_block);
+ misc_deregister(&mp_miscdev);
+}
+
+/* Get an underlying socket object from mp file. Returns error unless file is
+ * attached to a device. The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp;
+
+ if (file->f_op != &mp_fops)
+ return ERR_PTR(-EINVAL);
+ mp = mp_get(mfile);
+ if (!mp)
+ return ERR_PTR(-EBADFD);
+ mp_put(mfile);
+ return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_cleanup);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..2be21c5
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,29 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV _IOW('M', 214, int)
+
+/* MPASSTHRU ifc flags */
+#define IFF_MPASSTHRU 0x0001
+#define IFF_MPASSTHRU_EXCL 0x0002
+
+#ifdef __KERNEL__
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+ return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
--
1.5.4.4
^ permalink raw reply related
* [RFC][PATCH v3 0/3] Provide a zero-copy method on KVM virtio-net.
From: xiaohui.xin @ 2010-04-09 9:37 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike
The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it.
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.
The scenario is like this:
The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.
For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.
For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.
Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.
One: Modify __alloc_skb() function a bit, it can only allocate a
structure of sk_buff, and the data pointer is pointing to a
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb->data is filled
directly by h/w. What we have done is in this way.
Pros: We can avoid any copy here.
Cons: Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device
we have bind to do zero-copy. Then we ask guest to do so.
Is that reasonable?
Two: Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)->frags). For
the head buffer side, let host allocates skb, and h/w fills it.
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.
Pros: We could less care the way how guest or host allocates their
buffers.
Cons: We still need a bit copy here for the skb header.
We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.
Please give comments especially for the network part modifications.
We provide multiple submits and asynchronous notifiicaton to
vhost-net too.
Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.
What we have not done yet:
packet split support
To support GRO
Performance tuning
what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using dynamic minor instead of static minor number
a __KERNEL__ protect to mp_get_sock()
what we have done in v2:
remove most of the RCU usage, since the ctor pointer is only
changed by BIND/UNBIND ioctl, and during that time, NIC will be
stopped to get good cleanup(all outstanding requests are finished),
so the ctor pointer cannot be raced into wrong situation.
Remove the struct vhost_notifier with struct kiocb.
Let vhost-net backend to alloc/free the kiocb and transfer them
via sendmsg/recvmsg.
use get_user_pages_fast() and set_page_dirty_lock() when read.
Add some comments for netdev_mp_port_prep() and handle_mpassthru().
what we have done in v3:
the async write logging is rewritten
a drafted synchronous write function for qemu live migration
a limit for locked pages from get_user_pages_fast() to prevent Dos
by using RLIMIT_MEMLOCK
performance:
using netperf with GSO/TSO disabled, 10G NIC,
disabled packet split mode, with raw socket case compared to vhost.
bindwidth will be from 1.1Gbps to 1.7Gbps
CPU % from 120%-140% to 140%-160%
^ permalink raw reply
* [RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.
From: xiaohui.xin @ 2010-04-09 9:37 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1270805865-16901-2-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
drivers/vhost/net.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++--
drivers/vhost/vhost.c | 115 ++++++++++++++++------------
drivers/vhost/vhost.h | 15 ++++
3 files changed, 278 insertions(+), 55 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..d3fb3fc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
#include <linux/workqueue.h>
#include <linux/rcupdate.h>
#include <linux/file.h>
+#include <linux/aio.h>
#include <linux/net.h>
#include <linux/if_packet.h>
#include <linux/if_arp.h>
#include <linux/if_tun.h>
+#include <linux/mpassthru.h>
#include <net/sock.h>
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+ struct kmem_cache *cache;
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
@@ -91,11 +94,100 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
}
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ if (!list_empty(&vq->notifier)) {
+ iocb = list_first_entry(&vq->notifier,
+ struct kiocb, ki_list);
+ list_del(&iocb->ki_list);
+ }
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+ return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ struct vhost_log *vq_log = NULL;
+ int rx_total_len = 0;
+ unsigned int head, log, in, out;
+ int size;
+
+ if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+ return;
+
+ if (vq->receiver)
+ vq->receiver(vq);
+
+ vq_log = unlikely(vhost_has_feature(
+ &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, iocb->ki_nbytes);
+ log = (int)iocb->ki_user_data;
+ size = iocb->ki_nbytes;
+ head = iocb->ki_pos;
+ rx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+ kmem_cache_free(net->cache, iocb);
+
+ /* when log is enabled, recomputing the log info is needed,
+ * since these buffers are in async queue, and may not get
+ * the log info before.
+ */
+ if (unlikely(vq_log)) {
+ if (!log)
+ __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in, vq_log,
+ &log, head);
+ vhost_log_write(vq, vq_log, log, size);
+ }
+ if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ int tx_total_len = 0;
+
+ if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+ return;
+
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, 0);
+ tx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+
+ kmem_cache_free(net->cache, iocb);
+ if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, s;
struct msghdr msg = {
.msg_name = NULL,
@@ -124,6 +216,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq->hdr_size;
+ handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -151,6 +245,15 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+ iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+ if (!iocb)
+ break;
+ iocb->ki_pos = head;
+ iocb->private = (void *)vq;
+ }
+
len = iov_length(vq->iov, out);
/* Sanity check */
if (!len) {
@@ -160,12 +263,18 @@ static void handle_tx(struct vhost_net *net)
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? */
- err = sock->ops->sendmsg(NULL, sock, &msg, len);
+ err = sock->ops->sendmsg(iocb, sock, &msg, len);
if (unlikely(err < 0)) {
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
tx_poll_start(net, sock);
break;
}
+
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+ continue;
+
if (err != len)
pr_err("Truncated TX packet: "
" len %d != %zd\n", err, len);
@@ -177,6 +286,8 @@ static void handle_tx(struct vhost_net *net)
}
}
+ handle_async_tx_events_notify(net, vq);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
@@ -186,6 +297,7 @@ static void handle_tx(struct vhost_net *net)
static void handle_rx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
@@ -206,7 +318,8 @@ static void handle_rx(struct vhost_net *net)
int err;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq->private_data);
- if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+ if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+ vq->link_state == VHOST_VQ_LINK_SYNC))
return;
use_mm(net->dev.mm);
@@ -214,9 +327,17 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq->hdr_size;
+ /* In async cases, when write log is enabled, in case the submitted
+ * buffers did not get log info before the log enabling, so we'd
+ * better recompute the log info when needed. We do this in
+ * handle_async_rx_events_notify().
+ */
+
vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
+ handle_async_rx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -245,6 +366,14 @@ static void handle_rx(struct vhost_net *net)
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq->iov, in);
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+ iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+ if (!iocb)
+ break;
+ iocb->private = vq;
+ iocb->ki_pos = head;
+ iocb->ki_user_data = log;
+ }
/* Sanity check */
if (!len) {
vq_err(vq, "Unexpected header len for RX: "
@@ -252,13 +381,20 @@ static void handle_rx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
- err = sock->ops->recvmsg(NULL, sock, &msg,
+
+ err = sock->ops->recvmsg(iocb, sock, &msg,
len, MSG_DONTWAIT | MSG_TRUNC);
/* TODO: Check specific error and bomb out unless EAGAIN? */
if (err < 0) {
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
break;
}
+
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+ continue;
+
/* TODO: Should check and handle checksum. */
if (err > len) {
pr_err("Discarded truncated rx packet: "
@@ -284,10 +420,13 @@ static void handle_rx(struct vhost_net *net)
}
}
+ handle_async_rx_events_notify(net, vq);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
+
static void handle_tx_kick(struct work_struct *work)
{
struct vhost_virtqueue *vq;
@@ -338,6 +477,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+ n->cache = NULL;
return 0;
}
@@ -398,6 +538,18 @@ static void vhost_net_flush(struct vhost_net *n)
vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
}
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+ /* clean the notifier */
+ struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+ struct kiocb *iocb = NULL;
+ if (n->cache) {
+ while ((iocb = notify_dequeue(vq)) != NULL)
+ kmem_cache_free(n->cache, iocb);
+ kmem_cache_destroy(n->cache);
+ }
+}
+
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f->private_data;
@@ -414,6 +566,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
+ vhost_async_cleanup(n);
kfree(n);
return 0;
}
@@ -462,7 +615,19 @@ static struct socket *get_tun_socket(int fd)
return sock;
}
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+ struct file *file = fget(fd);
+ struct socket *sock;
+ if (!file)
+ return ERR_PTR(-EBADF);
+ sock = mp_get_socket(file);
+ if (IS_ERR(sock))
+ fput(file);
+ return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
{
struct socket *sock;
if (fd == -1)
@@ -473,9 +638,31 @@ static struct socket *get_socket(int fd)
sock = get_tun_socket(fd);
if (!IS_ERR(sock))
return sock;
+ sock = get_mp_socket(fd);
+ if (!IS_ERR(sock)) {
+ vq->link_state = VHOST_VQ_LINK_ASYNC;
+ return sock;
+ }
return ERR_PTR(-ENOTSOCK);
}
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+ struct vhost_virtqueue *vq = n->vqs + index;
+
+ WARN_ON(!mutex_is_locked(&vq->mutex));
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+ vq->receiver = NULL;
+ INIT_LIST_HEAD(&vq->notifier);
+ spin_lock_init(&vq->notify_lock);
+ if (!n->cache) {
+ n->cache = kmem_cache_create("vhost_kiocb",
+ sizeof(struct kiocb), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+ }
+ }
+}
+
static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
{
struct socket *sock, *oldsock;
@@ -493,12 +680,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
}
vq = n->vqs + index;
mutex_lock(&vq->mutex);
- sock = get_socket(fd);
+ vq->link_state = VHOST_VQ_LINK_SYNC;
+ sock = get_socket(vq, fd);
if (IS_ERR(sock)) {
r = PTR_ERR(sock);
goto err;
}
+ vhost_init_link_state(n, index);
+
/* start polling new socket */
oldsock = vq->private_data;
if (sock == oldsock)
@@ -507,8 +697,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
vhost_net_disable_vq(n, vq);
rcu_assign_pointer(vq->private_data, sock);
vhost_net_enable_vq(n, vq);
- mutex_unlock(&vq->mutex);
done:
+ mutex_unlock(&vq->mutex);
mutex_unlock(&n->dev.mutex);
if (oldsock) {
vhost_net_flush_vq(n, index);
@@ -516,6 +706,7 @@ done:
}
return r;
err:
+ mutex_unlock(&vq->mutex);
mutex_unlock(&n->dev.mutex);
return r;
}
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 97233d5..53dab80 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -715,66 +715,21 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
return 0;
}
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head)
{
struct vring_desc desc;
- unsigned int i, head, found = 0;
- u16 last_avail_idx;
+ unsigned int i = head, found = 0;
int ret;
- /* Check it isn't doing very strange things with descriptor numbers. */
- last_avail_idx = vq->last_avail_idx;
- if (get_user(vq->avail_idx, &vq->avail->idx)) {
- vq_err(vq, "Failed to access avail idx at %p\n",
- &vq->avail->idx);
- return vq->num;
- }
-
- if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
- vq_err(vq, "Guest moved used index from %u to %u",
- last_avail_idx, vq->avail_idx);
- return vq->num;
- }
-
- /* If there's nothing new since last we looked, return invalid. */
- if (vq->avail_idx == last_avail_idx)
- return vq->num;
-
- /* Only get avail ring entries after they have been exposed by guest. */
- rmb();
-
- /* Grab the next descriptor number they're advertising, and increment
- * the index we've seen. */
- if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
- vq_err(vq, "Failed to read head: idx %d address %p\n",
- last_avail_idx,
- &vq->avail->ring[last_avail_idx % vq->num]);
- return vq->num;
- }
-
- /* If their number is silly, that's an error. */
- if (head >= vq->num) {
- vq_err(vq, "Guest says index %u > %u is available",
- head, vq->num);
- return vq->num;
- }
-
/* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
if (unlikely(log))
*log_num = 0;
- i = head;
do {
unsigned iov_count = *in_num + *out_num;
if (i >= vq->num) {
@@ -833,8 +788,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
*out_num += ret;
}
} while ((i = next_desc(&desc)) != -1);
+ return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ struct vring_desc desc;
+ unsigned int i, head, found = 0;
+ u16 last_avail_idx;
+ unsigned int ret;
+
+ /* Check it isn't doing very strange things with descriptor numbers. */
+ last_avail_idx = vq->last_avail_idx;
+ if (get_user(vq->avail_idx, &vq->avail->idx)) {
+ vq_err(vq, "Failed to access avail idx at %p\n",
+ &vq->avail->idx);
+ return vq->num;
+ }
+
+ if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+ vq_err(vq, "Guest moved used index from %u to %u",
+ last_avail_idx, vq->avail_idx);
+ return vq->num;
+ }
+
+ /* If there's nothing new since last we looked, return invalid. */
+ if (vq->avail_idx == last_avail_idx)
+ return vq->num;
+
+ /* Only get avail ring entries after they have been exposed by guest. */
+ rmb();
+
+ /* Grab the next descriptor number they're advertising, and increment
+ * the index we've seen. */
+ if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+ vq_err(vq, "Failed to read head: idx %d address %p\n",
+ last_avail_idx,
+ &vq->avail->ring[last_avail_idx % vq->num]);
+ return vq->num;
+ }
+
+ /* If their number is silly, that's an error. */
+ if (head >= vq->num) {
+ vq_err(vq, "Guest says index %u > %u is available",
+ head, vq->num);
+ return vq->num;
+ }
+
+ ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, head);
/* On success, increment avail index. */
+ if (ret == vq->num)
+ return ret;
vq->last_avail_idx++;
return head;
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d1f0453..a74a6d4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
u64 len;
};
+enum vhost_vq_link_state {
+ VHOST_VQ_LINK_SYNC = 0,
+ VHOST_VQ_LINK_ASYNC = 1,
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -96,6 +101,11 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log log[VHOST_NET_MAX_SG];
+ /*Differiate async socket for 0-copy from normal*/
+ enum vhost_vq_link_state link_state;
+ struct list_head notifier;
+ spinlock_t notify_lock;
+ void (*receiver)(struct vhost_virtqueue *);
};
struct vhost_dev {
@@ -122,6 +132,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head);
void vhost_discard_vq_desc(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
--
1.5.4.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox