Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: ethernet: stmmac: properly set PS bit in MII configurations during reset
From: Giuseppe CAVALLARO @ 2017-05-10  7:03 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: Corentin Labbe, Alexandre Torgue, netdev, stable
In-Reply-To: <20170508211230.58aeead9@free-electrons.com>

Hi Thomas

On 5/8/2017 9:12 PM, Thomas Petazzoni wrote:
> Hello,
>
> On Mon, 8 May 2017 16:28:21 +0200, Giuseppe CAVALLARO wrote:
>
>>> I just see that GMAC_CONTROL and MAC_CTRL_REG are the same, so why not create a custom adjust_link for each dwmac type ?
>>> This will permit to call it instead of set_ps() and remove lots of if (has_gmac) and co in stmmac_adjust_link()
>>> Basicly replace all between "ctrl = readl()... and writel(ctrl)" by a sot of priv->hw->mac->adjust_link()
>>>
>>> It will also help a lot for my dwmac-sun8i inclusion (since I add some if has_sun8i:))
>> Corentin, I think this is a good idea and maybe necessary now that the
>> driver is supporting a lot of chips.
>> In the past it was sufficient to have a adjust link function and a
>> stmmac_hw_fix_mac_speed
>> to invoke dedicated hook shared between MAC10/100 and GMAC inside STM
>> platforms.
>>
>> Thomas, I wonder if you could take a look at the
>> priv->plat->fix_mac_speed. This can be used
>> for setting internal registers too.
> Once again, this is not called at the right time to fix the issue I'm
> seeing with a MII PHY. I need to adjust the PS bit between asserting the
> reset and polling for the reset bit to clear.
>
> ->fix_mac_speed() is called in the adjust_link() call-back, which is
> called way too late.
>
> Please, read again my patch and the description of the problem that I
> have sent. But basically, any solution that does not allow to set the
> PS bit between asserting the DMA reset bit and polling for it to clear
> will not work for MII PHYs.

yes your point was clear to me, I was just wondering if we could find an 
easier way
to solve it w/o changing the API, adding  the set_ps and propagating the 
"interface"
inside the DMA reset.

Maybe this could be fixed in the glue-logic in some way. Let me know 
what do you think.

peppe

>
> Best regards,
>
> Thomas Petazzoni

^ permalink raw reply

* Re: [PATCH 0/4] net: stmmac: Fine-tuning for four function implementations
From: Giuseppe CAVALLARO @ 2017-05-10  6:56 UTC (permalink / raw)
  To: SF Markus Elfring, netdev, Alexandre Torgue; +Cc: LKML, kernel-janitors
In-Reply-To: <aa58027a-a39c-963e-4376-a4d5312ee118@users.sourceforge.net>

Hello Markus

Thanks a lot for your effort on stmmac

On 5/9/2017 1:51 PM, SF Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Tue, 9 May 2017 13:48:03 +0200
>
> A few update suggestions were taken into account
> from static source code analysis.
>
> Markus Elfring (4):
>    Combine three seq_printf() calls into a seq_puts() in stmmac_sysfs_dma_cap_read()
>    Replace five seq_printf() calls by seq_puts()
>    Use seq_putc() in sysfs_display_ring()
>    Delete an unnecessary return statement in stmmac_get_tx_hwtstamp()
>
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 22 ++++++++++------------
>   1 file changed, 10 insertions(+), 12 deletions(-)

Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>

^ permalink raw reply

* Re: bpf pointer alignment validation
From: Alexei Starovoitov @ 2017-05-10  5:57 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, ast, netdev
In-Reply-To: <20170509.143234.1785658771452710730.davem@davemloft.net>

On Tue, May 09, 2017 at 02:32:34PM -0400, David Miller wrote:
>  
> +static u32 calc_align(u32 imm)
> +{
> +	u32 align = 1;
> +
> +	if (!imm)
> +		return 1U << 31;
> +
> +	while (!(imm & 1)) {
> +		imm >>= 1;
> +		align <<= 1;
> +	}
> +	return align;
> +}

same question as in previous reply.
Why not to use something like:
static u32 calc_align(u32 n)
{
        if (!n)
                return 1U << 31;
        return n - ((n - 1) & n);
}

> -		if (log_level && do_print_state) {
> +		if (log_level > 1 || (log_level && do_print_state)) {
>  			verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
>  			print_verifier_state(&env->cur_state);
>  			do_print_state = false;

this needs to be tweaked like
if (log_level > 1)
        verbose("%d:", insn_idx);
else
	verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);

otherwise it prints prev_insn_idx which is meaningful
only with processing exit and search pruning.
That's why below ...

> +		.descr = "unknown shift",
> +		.insns = {
> +			LOAD_UNKNOWN(BPF_REG_3),
> +			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
> +			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
> +			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
> +			BPF_ALU64_IMM(BPF_LSH, BPF_REG_3, 1),
> +			LOAD_UNKNOWN(BPF_REG_4),
> +			BPF_ALU64_IMM(BPF_LSH, BPF_REG_4, 5),
> +			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
> +			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
> +			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
> +			BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 1),
> +			BPF_MOV64_IMM(BPF_REG_0, 0),
> +			BPF_EXIT_INSN(),
> +		},
> +		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
> +		.matches = {
> +			"from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv56 R10=fp",
> +			"from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv55,min_align=2 R10=fp",
> +			"from 4 to 9: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv54,min_align=4 R10=fp",
> +			"from 4 to 10: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv53,min_align=8 R10=fp",
> +			"from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=inv52,min_align=16 R10=fp",
> +			"from 15 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv56 R10=fp",
> +			"from 15 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv51,min_align=32 R10=fp",
> +			"from 15 to 20: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv52,min_align=16 R10=fp",

... looks crazy here, since program is linear and 'from 4' and 'from 15' are
completely non-obvious and will change in the future for sure.
Since it just happened that search pruning heuristic kicked in at those points.
Hence doing verbose("%d:", insn_idx); is necessary to avoid noise.

> +		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
> +		.matches = {
> +			/* Calculated offset in R4 has unknown value, but known
> +			 * alignment of 4.
> +			 */
> +			"from 4 to 8: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R6=inv54,min_align=4 R10=fp",
> +
> +			/* Offset is added to packet pointer, resulting in known
> +			 * auxiliary alignment and offset.
> +			 */
> +			"from 4 to 11: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R5=pkt(id=1,off=0,r=0),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
> +
> +			/* At the time the word size load is performed from R5,
> +			 * it's total offset is NET_IP_ALIGN + reg->off (0) +
> +			 * reg->aux_off (14) which is 16.  Then the variable
> +			 * offset is considered using reg->aux_off_align which
> +			 * is 4 and meets the load's requirements.
> +			 */
> +			"from 13 to 15: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=1,off=4,r=4),aux_off=14,aux_off_align=4 R5=pkt(id=1,off=0,r=4),aux_off=14,aux_off_align=4 R6=inv54,min_align=4 R10=fp",
> +
> +
> +			/* Variable offset is added to R5 packet pointer,
> +			 * resulting in auxiliary alignment of 4.
> +			 */
> +			"from 13 to 18: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=0,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
> +
> +			/* Constant offset is added to R5, resulting in
> +			 * reg->off of 14.
> +			 */
> +			"from 13 to 19: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=inv,aux_off=14,aux_off_align=4 R5=pkt(id=2,off=14,r=0),aux_off_align=4 R6=inv54,min_align=4 R10=fp",
> +
> +			/* At the time the word size load is performed from R5,
> +			 * it's total offset is NET_IP_ALIGN + reg->off (14) which
> +			 * is 16.  Then the variable offset is considered using
> +			 * reg->aux_off_align which is 4 and meets the load's
> +			 * requirements.
> +			 */
> +			"from 21 to 23: R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8) R3=pkt_end R4=pkt(id=2,off=18,r=18),aux_off_align=4 R5=pkt(id=2,off=14,r=18),aux_off_align=4 R6=inv54,min_align=4 R10=fp",

Nice to see all these comments.
I wonder how we can make them automatic in the verifier.
Like if verifier can somehow hint the user in such human friendly way
about what is happening with the program.
Today that's the #1 problem. Most folks complaining
that verifier error messages are too hard to understand.
CTF should help. Proper data-flow analysis in the verifier should help too.

> +static int do_test_single(struct bpf_align_test *test)
> +{
> +	struct bpf_insn *prog = test->insns;
> +	int prog_type = test->prog_type;
> +	int prog_len, i;
> +	int fd_prog;
> +	int ret;
> +
> +	prog_len = probe_filter_length(prog);
> +	fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
> +				     prog, prog_len, "GPL", 0,
> +				     bpf_vlog, sizeof(bpf_vlog));
> +	if (fd_prog < 0) {
> +		printf("Failed to load program.\n");
> +		printf("%s", bpf_vlog);
> +		ret = 1;
> +	} else {
> +		ret = 0;
> +		for (i = 0; i < MAX_MATCHES; i++) {
> +			const char *t, *m = test->matches[i];
> +
> +			if (!m)
> +				break;
> +			t = strstr(bpf_vlog, m);
> +			if (!t) {
> +				printf("Failed to find match: %s\n", m);
> +				ret = 1;
> +				printf("%s", bpf_vlog);
> +				break;
> +			}
> +		}
> +		/* printf("%s", bpf_vlog); */
> +		close(fd_prog);

would it make sense to bpf_prog_test_run() it here as well?
On x86 not much value, but on sparc we can somehow look for traps?
Is there some counter of unaligned traps that we can read and report
as error to user space after prog_test_run ?
These tests we cannot really run, since they don't do any load/store.
I mean more for some future tests. Or some sort of debug warn
that there were traps while bpf prog was executed, so the user
is alarmed and reports to us, since that would be a bug in verifier
align logic?

^ permalink raw reply

* Re: DQL and TCQ_F_CAN_BYPASS destroy performance under virtualizaiton (Was: "Re: net_sched strange in 4.11")
From: Anton Ivanov @ 2017-05-10  5:35 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi; +Cc: David S. Miller, netdev, Michael S. Tsirkin
In-Reply-To: <52801b17-d8af-eaba-3ecf-fa4495c352f5@redhat.com>

[snip]

> Virtio-net net does not support BQL. Before commit ea7735d97ba9
> ("virtio-net: move free_old_xmit_skbs"), it's even impossible to
> support that since we don't have tx interrupt for each packet.  I
> haven't measured the impact of xmit_more, maybe I was wrong but I
> think it may help in some cases since it may improve the batching on
> host more or less.

Sorry, hit send too soon.

Impact of xmit more depends on your transport.

If, for example, you are using sendmmsg on the outer side which can
consume the bulked data "as is", the impact is quite significant. If
your transport does not support bulking, the fact there was bulking
earlier in the chain has little impact.

There is some, but not a lot.

[snip]

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661

^ permalink raw reply

* Re: DQL and TCQ_F_CAN_BYPASS destroy performance under virtualizaiton (Was: "Re: net_sched strange in 4.11")
From: Anton Ivanov @ 2017-05-10  5:28 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi; +Cc: David S. Miller, netdev, Michael S. Tsirkin
In-Reply-To: <52801b17-d8af-eaba-3ecf-fa4495c352f5@redhat.com>

On 10/05/17 03:18, Jason Wang wrote:
>
>
> On 2017年05月09日 23:11, Stefan Hajnoczi wrote:
>> On Tue, May 09, 2017 at 08:46:46AM +0100, Anton Ivanov wrote:
>>> I have figured it out. Two issues.
>>>
>>> 1) skb->xmit_more is hardly ever set under virtualization because
>>> the qdisc
>>> is usually bypassed because of TCQ_F_CAN_BYPASS. Once
>>> TCQ_F_CAN_BYPASS is
>>> set a virtual NIC driver is not likely see skb->xmit_more (this
>>> answers my
>>> "how does this work at all" question).
>>>
>>> 2) If that flag is turned off (I patched sched_generic to turn it
>>> off in
>>> pfifo_fast while testing), DQL keeps xmit_more from being set. If
>>> the driver
>>> is not DQL enabled xmit_more is never ever set. If the driver is DQL
>>> enabled
>>> the queue is adjusted to ensure xmit_more stops happening within
>>> 10-15 xmit
>>> cycles.
>>>
>>> That is plain *wrong* for virtual NICs - virtio, emulated NICs, etc.
>>> There,
>>> the BIG cost is telling the hypervisor that it needs to "kick" the
>>> packets.
>>> The cost of putting them into the vNIC buffers is negligible. You want
>>> xmit_more to happen - it makes between 50% and 300% (depending on vNIC
>>> design) difference. If there is no xmit_more the vNIC will immediately
>>> "kick" the hypervisor and try to signal that  the packet needs to move
>>> straight away (as for example in virtio_net).
>
> How do you measure the performance? TCP or just measure pps?

In this particular case - tcp from guest. I have a couple of other
benchmarks (forwarding, etc).

>
>>>
>>> In addition to that, the perceived line rate is proportional to this
>>> cost,
>>> so I am not sure that the current dql math holds. In fact, I think
>>> it does
>>> not - it is trying to adjust something which influences the
>>> perceived line
>>> rate.
>>>
>>> So - how do we turn BOTH bypass and DQL adjustment while under
>>> virtualization and set them to be "always qdisc" + "always xmit_more
>>> allowed"
>
> Virtio-net net does not support BQL. Before commit ea7735d97ba9
> ("virtio-net: move free_old_xmit_skbs"), it's even impossible to
> support that since we don't have tx interrupt for each packet.  I
> haven't measured the impact of xmit_more, maybe I was wrong but I
> think it may help in some cases since it may improve the batching on
> host more or less.

If you do not support BQL, you might as well look the xmit_more part
kick code path. Line 1127.

bool kick = !skb->xmit_more; effectively means kick = true;

It will never be triggered. You will be kicking each packet and per
packet. xmit_more is now set only out of BQL. If BQL is not enabled you
never get it. Now, will the current dql code work correctly if you do
not have a defined line rate and completion interrupts - no idea.
Probably not. IMHO instead of trying to fix it there should be a way for
a device or architecture to turn it off.

To be clear - I ran into this working on my own drivers for UML, you are
cc-ed because you are likely to be one of the most affected.

A.

>
> Thanks
>
>>>
>>> A.
>>>
>>> P.S. Cc-ing virtio maintainer
>> CCing Michael Tsirkin and Jason Wang, who are the core virtio and
>> virtio-net maintainers.  (I maintain the vsock driver - it's unrelated
>> to this discussion.)
>>
>>> A.
>>>
>>>
>>> On 08/05/17 08:15, Anton Ivanov wrote:
>>>> Hi all,
>>>>
>>>> I was revising some of my old work for UML to prepare it for
>>>> submission
>>>> and I noticed that skb->xmit_more does not seem to be set any more.
>>>>
>>>> I traced the issue as far as net/sched/sched_generic.c
>>>>
>>>> try_bulk_dequeue_skb() is never invoked (the drivers I am working
>>>> on are
>>>> dql enabled so that is not the problem).
>>>>
>>>> More interestingly, if I put a breakpoint and debug output into
>>>> dequeue_skb() around line 147 - right before the bulk: tag that skb
>>>> there is always NULL. ???
>>>>
>>>> Similarly, debug in pfifo_fast_dequeue shows only NULLs being
>>>> dequeued.
>>>> Again - ???
>>>>
>>>> First and foremost, I apologize for the silly question, but how can
>>>> this
>>>> work at all? I see the skbs showing up at the driver level, why are
>>>> NULLs being returned at qdisc dequeue and where do the skbs at the
>>>> driver level come from?
>>>>
>>>> Second, where should I look to fix it?
>>>>
>>>> A.
>>>>
>>>
>>> -- 
>>> Anton R. Ivanov
>>>
>>> Cambridge Greys Limited, England company No 10273661
>>> http://www.cambridgegreys.com/
>>>
>
>


-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661

^ permalink raw reply

* Re: [PATCH] net: fec: select queue depending on VLAN priority
From: Stefan Agner @ 2017-05-10  4:51 UTC (permalink / raw)
  To: David Miller; +Cc: fugang.duan, andrew, festevam, netdev, linux-kernel
In-Reply-To: <20170509.093913.1803457630634565212.davem@davemloft.net>

On 2017-05-09 06:39, David Miller wrote:
> From: Stefan Agner <stefan@agner.ch>
> Date: Mon,  8 May 2017 22:37:08 -0700
> 
>> Since the addition of the multi queue code with commit 59d0f7465644
>> ("net: fec: init multi queue date structure") the queue selection
>> has been handelt by the default transmit queue selection
>> implementation which tries to evenly distribute the traffic across
>> all available queues. This selection presumes that the queues are
>> using an equal priority, however, the queues 1 and 2 are actually
>> of higher priority (the classification of the queues is enabled in
>> fec_enet_enable_ring).
>>
>> This can lead to net scheduler warnings and continuous TX ring
>> dumps when exercising the system with iperf.
>>
>> Use only queue 0 for all common traffic (no VLAN and P802.1p
>> priority 0 and 1) and route level 2-7 through queue 1 and 2.
>>
>> Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
>> Fixes: 59d0f7465644 ("net: fec: init multi queue date structure")
> 
> If the queues are used for prioritization, and it does not have
> multiple normal priority level queues, multiqueue is not what the
> driver should have implemented.

As Andy mentioned, there is also a round-robin mode. I'll try that.

What would be the proper way to use the prioritized queues?

--
Stefan

^ permalink raw reply

* Re: [PATCH] libertas: Avoid reading past end of buffer
From: Joe Perches @ 2017-05-10  4:33 UTC (permalink / raw)
  To: Kees Cook, netdev
  Cc: Kalle Valo, libertas-dev, linux-wireless, linux-kernel,
	Daniel Micay
In-Reply-To: <20170509232334.GA55070@beast>

On Tue, 2017-05-09 at 16:23 -0700, Kees Cook wrote:
> Using memcpy() from a string that is shorter than the length copied means
> the destination buffer is being filled with arbitrary data from the kernel
> rodata segment. Instead, use strncpy() which will fill the trailing bytes
> with zeros. Additionally adjust indentation to keep checkpatch.pl happy.
> 
> This was found with the future CONFIG_FORTIFY_SOURCE feature.
[]
> diff --git a/drivers/net/wireless/marvell/libertas/mesh.c b/drivers/net/wireless/marvell/libertas/mesh.c
[]
> @@ -1177,9 +1177,9 @@ void lbs_mesh_ethtool_get_strings(struct net_device *dev,
>  	switch (stringset) {
>  	case ETH_SS_STATS:
>  		for (i = 0; i < MESH_STATS_NUM; i++) {
> -			memcpy(s + i * ETH_GSTRING_LEN,
> -					mesh_stat_strings[i],
> -					ETH_GSTRING_LEN);
> +			strncpy(s + i * ETH_GSTRING_LEN,
> +				mesh_stat_strings[i],
> +				ETH_GSTRING_LEN);
>  		}

The better solution is to declare
mesh_stat_strings in in the normal way

---
 drivers/net/wireless/marvell/libertas/mesh.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/marvell/libertas/mesh.c b/drivers/net/wireless/marvell/libertas/mesh.c
index d0c881dd5846..a535e7f48d2d 100644
--- a/drivers/net/wireless/marvell/libertas/mesh.c
+++ b/drivers/net/wireless/marvell/libertas/mesh.c
@@ -1108,15 +1108,15 @@ void lbs_mesh_set_txpd(struct lbs_private *priv,
  * Ethtool related
  */
 
-static const char * const mesh_stat_strings[] = {
-			"drop_duplicate_bcast",
-			"drop_ttl_zero",
-			"drop_no_fwd_route",
-			"drop_no_buffers",
-			"fwded_unicast_cnt",
-			"fwded_bcast_cnt",
-			"drop_blind_table",
-			"tx_failed_cnt"
+static const char mesh_stat_strings[][ETH_GSTRING_LEN] = {
+	"drop_duplicate_bcast",
+	"drop_ttl_zero",
+	"drop_no_fwd_route",
+	"drop_no_buffers",
+	"fwded_unicast_cnt",
+	"fwded_bcast_cnt",
+	"drop_blind_table",
+	"tx_failed_cnt",
 };
 
 void lbs_mesh_ethtool_get_stats(struct net_device *dev,

^ permalink raw reply related

* Re: [PATCH] net: dsa: loop: Free resources if initialization is deferred
From: Julia Lawall @ 2017-05-10  4:30 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: andrew, vivien.didelot, f.fainelli, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20170510041503.29363-1-christophe.jaillet@wanadoo.fr>



On Wed, 10 May 2017, Christophe JAILLET wrote:

> Free some devm'allocated memory in case of deferred driver initialization.
> This avoid to waste some memory in such a case.

I really think it would be helpful to mention the special behavior of
-EPROBE_DEFER.  It doesn't take much space, and it coud be helpful to
someone in the future.

julia

>
> Suggested-by: Joe Perches <joe@perches.com>
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
> ---
>  drivers/net/dsa/dsa_loop.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
> index a19e1781e9bb..557afb418320 100644
> --- a/drivers/net/dsa/dsa_loop.c
> +++ b/drivers/net/dsa/dsa_loop.c
> @@ -260,8 +260,11 @@ static int dsa_loop_drv_probe(struct mdio_device *mdiodev)
>  		return -ENOMEM;
>
>  	ps->netdev = dev_get_by_name(&init_net, pdata->netdev);
> -	if (!ps->netdev)
> +	if (!ps->netdev) {
> +		devm_kfree(&mdiodev->dev, ps);
> +		devm_kfree(&mdiodev->dev, ds);
>  		return -EPROBE_DEFER;
> +	}
>
>  	pdata->cd.netdev[DSA_LOOP_CPU_PORT] = &ps->netdev->dev;
>
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH] net: dsa: loop: Free resources if initialization is deferred
From: Christophe JAILLET @ 2017-05-10  4:15 UTC (permalink / raw)
  To: andrew, vivien.didelot, f.fainelli
  Cc: netdev, linux-kernel, kernel-janitors, Christophe JAILLET

Free some devm'allocated memory in case of deferred driver initialization.
This avoid to waste some memory in such a case.

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
 drivers/net/dsa/dsa_loop.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index a19e1781e9bb..557afb418320 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -260,8 +260,11 @@ static int dsa_loop_drv_probe(struct mdio_device *mdiodev)
 		return -ENOMEM;
 
 	ps->netdev = dev_get_by_name(&init_net, pdata->netdev);
-	if (!ps->netdev)
+	if (!ps->netdev) {
+		devm_kfree(&mdiodev->dev, ps);
+		devm_kfree(&mdiodev->dev, ds);
 		return -EPROBE_DEFER;
+	}
 
 	pdata->cd.netdev[DSA_LOOP_CPU_PORT] = &ps->netdev->dev;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next V4 05/10] skb_array: introduce batch dequeuing
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/skb_array.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 79850b6..35226cd 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct skb_array *a)
 	return ptr_ring_consume(&a->ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+					    struct sk_buff **array, int n)
+{
+	return ptr_ring_consume_batched(&a->ring, (void **)array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
 	return ptr_ring_consume_irq(&a->ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+						struct sk_buff **array, int n)
+{
+	return ptr_ring_consume_batched_irq(&a->ring, (void **)array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
 	return ptr_ring_consume_any(&a->ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+						struct sk_buff **array, int n)
+{
+	return ptr_ring_consume_batched_any(&a->ring, (void **)array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
 	return ptr_ring_consume_bh(&a->ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+					       struct sk_buff **array, int n)
+{
+	return ptr_ring_consume_batched_bh(&a->ring, (void **)array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
 	if (likely(skb)) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 10/10] vhost_net: try batch dequing from skb array
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Batch dequeuing is also the requirement for more batching improvement
on rx.

Tests were done by pktgen on tap with XDP1 in guest on top of batch
zeroing:

rx batch | pps

256        2.41Mpps (+6.16%)
128        2.48Mpps (+8.80%)
64         2.38Mpps (+3.96%) <- Default
16         2.31Mpps (+1.76%)
4          2.31Mpps (+1.76%)
1          2.30Mpps (+1.32%)
0          2.27Mpps (+7.48%)

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 111 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..fbaecf3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include <linux/if_macvlan.h>
 #include <linux/if_tap.h>
 #include <linux/if_vlan.h>
+#include <linux/skb_array.h>
+#include <linux/skbuff.h>
 
 #include <net/sock.h>
 
@@ -85,6 +87,13 @@ struct vhost_net_ubuf_ref {
 	struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
+struct vhost_net_buf {
+	struct sk_buff *queue[VHOST_RX_BATCH];
+	int tail;
+	int head;
+};
+
 struct vhost_net_virtqueue {
 	struct vhost_virtqueue vq;
 	size_t vhost_hlen;
@@ -99,6 +108,8 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
+	struct skb_array *rx_array;
+	struct vhost_net_buf rxq;
 };
 
 struct vhost_net {
@@ -117,6 +128,71 @@ struct vhost_net {
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
 
+static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq)
+{
+	if (rxq->tail != rxq->head)
+		return rxq->queue[rxq->head];
+	else
+		return NULL;
+}
+
+static int vhost_net_buf_get_size(struct vhost_net_buf *rxq)
+{
+	return rxq->tail - rxq->head;
+}
+
+static int vhost_net_buf_is_empty(struct vhost_net_buf *rxq)
+{
+	return rxq->tail == rxq->head;
+}
+
+static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
+{
+	void *ret = vhost_net_buf_get_ptr(rxq);
+	++rxq->head;
+	return ret;
+}
+
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	rxq->head = 0;
+	rxq->tail = skb_array_consume_batched(nvq->rx_array, rxq->queue,
+					      VHOST_RX_BATCH);
+	return rxq->tail;
+}
+
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	if (nvq->rx_array && !vhost_net_buf_is_empty(rxq)) {
+		skb_array_unconsume(nvq->rx_array, rxq->queue + rxq->head,
+				    vhost_net_buf_get_size(rxq));
+		rxq->head = rxq->tail = 0;
+	}
+}
+
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	if (!vhost_net_buf_is_empty(rxq))
+		goto out;
+
+	if (!vhost_net_buf_produce(nvq))
+		return 0;
+
+out:
+	return __skb_array_len_with_tag(vhost_net_buf_get_ptr(rxq));
+}
+
+static void vhost_net_buf_init(struct vhost_net_buf *rxq)
+{
+	rxq->head = rxq->tail = 0;
+}
+
 static void vhost_net_enable_zcopy(int vq)
 {
 	vhost_net_zcopy_mask |= 0x1 << vq;
@@ -201,6 +277,7 @@ static void vhost_net_vq_reset(struct vhost_net *n)
 		n->vqs[i].ubufs = NULL;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 
 }
@@ -503,15 +580,14 @@ static void handle_tx(struct vhost_net *net)
 	mutex_unlock(&vq->mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
-	struct socket *sock = sk->sk_socket;
 	struct sk_buff *head;
 	int len = 0;
 	unsigned long flags;
 
-	if (sock->ops->peek_len)
-		return sock->ops->peek_len(sock);
+	if (rvq->rx_array)
+		return vhost_net_buf_peek(rvq);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -537,10 +613,11 @@ static int sk_has_rx_data(struct sock *sk)
 
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
+	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
-	int len = peek_head_len(sk);
+	int len = peek_head_len(rvq, sk);
 
 	if (!len && vq->busyloop_timeout) {
 		/* Both tx vq and rx socket were polled here */
@@ -561,7 +638,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 			vhost_poll_queue(&vq->poll);
 		mutex_unlock(&vq->mutex);
 
-		len = peek_head_len(sk);
+		len = peek_head_len(rvq, sk);
 	}
 
 	return len;
@@ -699,6 +776,8 @@ static void handle_rx(struct vhost_net *net)
 		/* On error, stop handling until the next kick. */
 		if (unlikely(headcount < 0))
 			goto out;
+		if (nvq->rx_array)
+			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -841,6 +920,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].done_idx = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +936,14 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 					struct vhost_virtqueue *vq)
 {
 	struct socket *sock;
+	struct vhost_net_virtqueue *nvq =
+		container_of(vq, struct vhost_net_virtqueue, vq);
 
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	vhost_net_disable_vq(n, vq);
 	vq->private_data = NULL;
+	vhost_net_buf_unproduce(nvq);
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -953,6 +1036,25 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
+static struct skb_array *get_tap_skb_array(int fd)
+{
+	struct skb_array *array;
+	struct file *file = fget(fd);
+
+	if (!file)
+		return NULL;
+	array = tun_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = tap_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = NULL;
+out:
+	fput(file);
+	return array;
+}
+
 static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
@@ -1029,6 +1131,9 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vq->private_data = sock;
+		vhost_net_buf_unproduce(nvq);
+		if (index == VHOST_NET_VQ_RX)
+			nvq->rx_array = get_tap_skb_array(fd);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 09/10] tap: support receiving skb from msg_control
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tap.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..9af3239 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
 	DEFINE_WAIT(wait);
-	struct sk_buff *skb;
 	ssize_t ret = 0;
 
 	if (!iov_iter_count(to))
 		return 0;
 
+	if (skb)
+		goto put;
+
 	while (1) {
 		if (!noblock)
 			prepare_to_wait(sk_sleep(&q->sk), &wait,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 	if (!noblock)
 		finish_wait(sk_sleep(&q->sk), &wait);
 
+put:
 	if (skb) {
 		ret = tap_put_user(q, skb, to);
 		if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct tap_queue *q = file->private_data;
 	ssize_t len = iov_iter_count(to), ret;
 
-	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
 	int ret;
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 08/10] tun: support receiving skb through msg_control
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3cbfc5c..f8041f9c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1510,9 +1510,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
-	struct sk_buff *skb;
 	ssize_t ret;
 	int err;
 
@@ -1521,10 +1520,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (!iov_iter_count(to))
 		return 0;
 
-	/* Read frames from ring */
-	skb = tun_ring_recv(tfile, noblock, &err);
-	if (!skb)
-		return err;
+	if (!skb) {
+		/* Read frames from ring */
+		skb = tun_ring_recv(tfile, noblock, &err);
+		if (!skb)
+			return err;
+	}
 
 	ret = tun_put_user(tun, tfile, skb, to);
 	if (unlikely(ret < 0))
@@ -1544,7 +1545,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	if (!tun)
 		return -EBADFD;
-	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1646,7 +1647,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 					 SOL_PACKET, TUN_TX_TIMESTAMP);
 		goto out;
 	}
-	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > (ssize_t)total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 07/10] tap: export skb_array
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tap.c      | 13 +++++++++++++
 include/linux/if_tap.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+	struct tap_queue *q;
+
+	if (file->f_op != &tap_fops)
+		return ERR_PTR(-EINVAL);
+	q = file->private_data;
+	if (!q)
+		return ERR_PTR(-EBADFD);
+	return &q->skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include <net/sock.h>
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 06/10] tun: export skb_array
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c      | 13 +++++++++++++
 include/linux/if_tun.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bbd707b..3cbfc5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2626,6 +2626,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+	struct tun_file *tfile;
+
+	if (file->f_op != &tun_fops)
+		return ERR_PTR(-EINVAL);
+	tfile = file->private_data;
+	if (!tfile)
+		return ERR_PTR(-EBADFD);
+	return &tfile->tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 04/10] ptr_ring: introduce batch dequeuing
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 796b90f..d8c97ec 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -278,6 +278,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
 	return ptr;
 }
 
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+					     void **array, int n)
+{
+	void *ptr;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		ptr = __ptr_ring_consume(r);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	return i;
+}
+
 /*
  * Note: resize (below) nests producer lock within consumer lock, so if you
  * call this in interrupt or BH context, you must disable interrupts/BH when
@@ -328,6 +344,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
 	return ptr;
 }
 
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+					   void **array, int n)
+{
+	int ret;
+
+	spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+					       void **array, int n)
+{
+	int ret;
+
+	spin_lock_irq(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_irq(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+					       void **array, int n)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&r->consumer_lock, flags);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_irqrestore(&r->consumer_lock, flags);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+					      void **array, int n)
+{
+	int ret;
+
+	spin_lock_bh(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_bh(&r->consumer_lock);
+
+	return ret;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 03/10] skb_array: introduce skb_array_unconsume
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/skb_array.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..79850b6 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -156,6 +156,12 @@ static void __skb_array_destroy_skb(void *ptr)
 	kfree_skb(ptr);
 }
 
+static inline void skb_array_unconsume(struct skb_array *a,
+				       struct sk_buff **skbs, int n)
+{
+	ptr_ring_unconsume(&a->ring, (void **)skbs, n, __skb_array_destroy_skb);
+}
+
 static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)
 {
 	return ptr_ring_resize(&a->ring, size, gfp, __skb_array_destroy_skb);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 02/10] ptr_ring: add ptr_ring_unconsume
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally can't do this and have to drop entries, but this implies ring
is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/ptr_ring.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6b2e0dd..796b90f 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -403,6 +403,61 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp)
 	return 0;
 }
 
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+				      void (*destroy)(void *))
+{
+	unsigned long flags;
+	int head;
+
+	spin_lock_irqsave(&r->consumer_lock, flags);
+	spin_lock(&r->producer_lock);
+
+	if (!r->size)
+		goto done;
+
+	/*
+	 * Clean out buffered entries (for simplicity). This way following code
+	 * can test entries for NULL and if not assume they are valid.
+	 */
+	head = r->consumer_head - 1;
+	while (likely(head >= r->consumer_tail))
+		r->queue[head--] = NULL;
+	r->consumer_tail = r->consumer_head;
+
+	/*
+	 * Go over entries in batch, start moving head back and copy entries.
+	 * Stop when we run into previously unconsumed entries.
+	 */
+	while (n) {
+		head = r->consumer_head - 1;
+		if (head < 0)
+			head = r->size - 1;
+		if (r->queue[head]) {
+			/* This batch entry will have to be destroyed. */
+			goto done;
+		}
+		r->queue[head] = batch[--n];
+		r->consumer_tail = r->consumer_head = head;
+	}
+
+done:
+	/* Destroy all entries left in the batch. */
+	while (n)
+		destroy(batch[--n]);
+	spin_unlock(&r->producer_lock);
+	spin_unlock_irqrestore(&r->consumer_lock, flags);
+}
+
 static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
 					   int size, gfp_t gfp,
 					   void (*destroy)(void *))
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 01/10] ptr_ring: batch ring zeroing
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494387382-19916-1-git-send-email-jasowang@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
  that the following call to produce will succeed.
  No users seem to assume that.
- batching can also in theory reduce the signalling rate:
  users that would previously send interrups to the producer
  to wake it up after consuming each entry would now only
  need to do this once in a batch.
  Doing this would be easy by returning a flag to the caller.
  No users seem to do signalling on consume yet so this was not
  implemented yet.

Tested with pktgen on tap with xdp1 in guest:

Before 2.10 Mpps
After  2.27 Mpps (+7.48%)

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..6b2e0dd 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -34,11 +34,13 @@
 struct ptr_ring {
 	int producer ____cacheline_aligned_in_smp;
 	spinlock_t producer_lock;
-	int consumer ____cacheline_aligned_in_smp;
+	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
+	int consumer_tail; /* next entry to invalidate */
 	spinlock_t consumer_lock;
 	/* Shared consumer/producer data */
 	/* Read-only by both the producer and the consumer */
 	int size ____cacheline_aligned_in_smp; /* max entries in queue */
+	int batch; /* number of entries to consume in a batch */
 	void **queue;
 };
 
@@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
 	if (likely(r->size))
-		return r->queue[r->consumer];
+		return r->queue[r->consumer_head];
 	return NULL;
 }
 
@@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
 /* Must only be called after __ptr_ring_peek returned !NULL */
 static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 {
-	r->queue[r->consumer++] = NULL;
-	if (unlikely(r->consumer >= r->size))
-		r->consumer = 0;
+	/* Fundamentally, what we want to do is update consumer
+	 * index and zero out the entry so producer can reuse it.
+	 * Doing it naively at each consume would be as simple as:
+	 *       r->queue[r->consumer++] = NULL;
+	 *       if (unlikely(r->consumer >= r->size))
+	 *               r->consumer = 0;
+	 * but that is suboptimal when the ring is full as producer is writing
+	 * out new entries in the same cache line.  Defer these updates until a
+	 * batch of entries has been consumed.
+	 */
+	int head = r->consumer_head++;
+
+	/* Once we have processed enough entries invalidate them in
+	 * the ring all at once so producer can reuse their space in the ring.
+	 * We also do this when we reach end of the ring - not mandatory
+	 * but helps keep the implementation simple.
+	 */
+	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
+		     r->consumer_head >= r->size)) {
+		/* Zero out entries in the reverse order: this way we touch the
+		 * cache line that producer might currently be reading the last;
+		 * producer won't make progress and touch other cache lines
+		 * besides the first one until we write out all entries.
+		 */
+		while (likely(head >= r->consumer_tail))
+			r->queue[head--] = NULL;
+		r->consumer_tail = r->consumer_head;
+	}
+	if (unlikely(r->consumer_head >= r->size)) {
+		r->consumer_head = 0;
+		r->consumer_tail = 0;
+	}
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
@@ -345,14 +376,27 @@ static inline void **__ptr_ring_init_queue_alloc(int size, gfp_t gfp)
 	return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
 }
 
+static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
+{
+	r->size = size;
+	r->batch = SMP_CACHE_BYTES * 2 / sizeof(*(r->queue));
+	/* We need to set batch at least to 1 to make logic
+	 * in __ptr_ring_discard_one work correctly.
+	 * Batching too much (because ring is small) would cause a lot of
+	 * burstiness. Needs tuning, for now disable batching.
+	 */
+	if (r->batch > r->size / 2 || !r->batch)
+		r->batch = 1;
+}
+
 static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp)
 {
 	r->queue = __ptr_ring_init_queue_alloc(size, gfp);
 	if (!r->queue)
 		return -ENOMEM;
 
-	r->size = size;
-	r->producer = r->consumer = 0;
+	__ptr_ring_set_size(r, size);
+	r->producer = r->consumer_head = r->consumer_tail = 0;
 	spin_lock_init(&r->producer_lock);
 	spin_lock_init(&r->consumer_lock);
 
@@ -373,9 +417,10 @@ static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
 		else if (destroy)
 			destroy(ptr);
 
-	r->size = size;
+	__ptr_ring_set_size(r, size);
 	r->producer = producer;
-	r->consumer = 0;
+	r->consumer_head = 0;
+	r->consumer_tail = 0;
 	old = r->queue;
 	r->queue = queue;
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V4 00/10] vhost_net batch dequeuing
From: Jason Wang @ 2017-05-10  3:36 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying. This is also the requirement for more batching
implemention on rx path.

Tests shows at most 8.8% improvment bon rx pps on top of batch zeroing.

Please review.

Thanks

Changes from V3:
- add batch zeroing patch to fix the build warnings

Changes from V2:
- rebase to net-next HEAD
- use unconsume helpers to put skb back on releasing
- introduce and use vhost_net internal buffer helpers
- renew performance numbers on top of batch zeroing

Changes from V1:
- switch to use for() in __ptr_ring_consume_batched()
- rename peek_head_len_batched() to fetch_skbs()
- use skb_array_consume_batched() instead of
  skb_array_consume_batched_bh() since no consumer run in bh
- drop the lockless peeking patch since skb_array could be resized, so
  it's not safe to call lockless one

Jason Wang (8):
  skb_array: introduce skb_array_unconsume
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array

Michael S. Tsirkin (2):
  ptr_ring: batch ring zeroing
  ptr_ring: add ptr_ring_unconsume

 drivers/net/tap.c         |  25 ++++++-
 drivers/net/tun.c         |  31 ++++++--
 drivers/vhost/net.c       | 117 +++++++++++++++++++++++++++--
 include/linux/if_tap.h    |   5 ++
 include/linux/if_tun.h    |   5 ++
 include/linux/ptr_ring.h  | 183 +++++++++++++++++++++++++++++++++++++++++++---
 include/linux/skb_array.h |  31 ++++++++
 7 files changed, 370 insertions(+), 27 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH net 2/2] xdp: disallow use of native and generic hook at once
From: Jakub Kicinski @ 2017-05-10  3:18 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, alexei.starovoitov, john.fastabend, netdev
In-Reply-To: <f8405a8769dca09a0f21d11eab3793e30f608531.1494379046.git.daniel@iogearbox.net>

On Wed, 10 May 2017 03:31:31 +0200, Daniel Borkmann wrote:
> While working on the iproute2 generic XDP frontend, I noticed that
> as of right now it's possible to have native *and* generic XDP
> programs loaded both at the same time for the case when a driver
> supports native XDP.

Nice improvement!  A couple of absolute nitpicks below..

> The intended model for generic XDP from b5cdae3291f7 ("net: Generic
> XDP") is, however, that only one out of the two can be present at
> once which is also indicated as such in the XPD netlink dump part.
                                              ^^^
                                                 XDP

> @@ -6851,6 +6851,32 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
>  }
>  EXPORT_SYMBOL(dev_change_proto_down);
>  
> +bool __dev_xdp_attached(struct net_device *dev, xdp_op_t xdp_op)

Out of curiosity - the leading underscores refer to caller having to
hold rtnl?  I assume they are not needed in the function below because
it's static?

> +{
> +	struct netdev_xdp xdp;
> +
> +	memset(&xdp, 0, sizeof(xdp));
> +	xdp.command = XDP_QUERY_PROG;

Probably personal preference, but seems like designated struct
initializer would do quite nicely here and save the memset :)

> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index dda9f16..99320f0 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1251,24 +1251,20 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct nlattr *xdp;
>  	u32 xdp_flags = 0;
> -	u8 val = 0;
>  	int err;
> +	u8 val;
>  
>  	xdp = nla_nest_start(skb, IFLA_XDP);
>  	if (!xdp)
>  		return -EMSGSIZE;
> +
>  	if (rcu_access_pointer(dev->xdp_prog)) {
>  		xdp_flags = XDP_FLAGS_SKB_MODE;
>  		val = 1;
> -	} else if (dev->netdev_ops->ndo_xdp) {
> -		struct netdev_xdp xdp_op = {};
> -
> -		xdp_op.command = XDP_QUERY_PROG;
> -		err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
> -		if (err)
> -			goto err_cancel;
> -		val = xdp_op.prog_attached;
> +	} else {
> +		val = dev_xdp_attached(dev);
> 	}

Would it make sense to set xdp_flags to XDP_FLAGS_DRV_MODE here to keep
things symmetrical?  I know you are just preserving existing behaviour
but it may seem slightly arbitrary to a new comer to report one of the
very similarly named flags in the dump but not the other.

^ permalink raw reply

* RE: [PATCH] net: fec: select queue depending on VLAN priority
From: Andy Duan @ 2017-05-10  2:42 UTC (permalink / raw)
  To: David Miller, stefan@agner.ch
  Cc: andrew@lunn.ch, festevam@gmail.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170509.093913.1803457630634565212.davem@davemloft.net>

From: David Miller <davem@davemloft.net> Sent: Tuesday, May 09, 2017 9:39 PM
>To: stefan@agner.ch
>Cc: Andy Duan <fugang.duan@nxp.com>; andrew@lunn.ch;
>festevam@gmail.com; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH] net: fec: select queue depending on VLAN priority
>
>From: Stefan Agner <stefan@agner.ch>
>Date: Mon,  8 May 2017 22:37:08 -0700
>
>> Since the addition of the multi queue code with commit 59d0f7465644
>> ("net: fec: init multi queue date structure") the queue selection has
>> been handelt by the default transmit queue selection implementation
>> which tries to evenly distribute the traffic across all available
>> queues. This selection presumes that the queues are using an equal
>> priority, however, the queues 1 and 2 are actually of higher priority
>> (the classification of the queues is enabled in fec_enet_enable_ring).
>>
>> This can lead to net scheduler warnings and continuous TX ring dumps
>> when exercising the system with iperf.
>>
>> Use only queue 0 for all common traffic (no VLAN and P802.1p priority
>> 0 and 1) and route level 2-7 through queue 1 and 2.
>>
>> Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
>> Fixes: 59d0f7465644 ("net: fec: init multi queue date structure")
>
>If the queues are used for prioritization, and it does not have multiple normal
>priority level queues, multiqueue is not what the driver should have
>implemented.
Firstly, HW multiple queues support:
	- Traffic-shaping bandwidth distribution supports credit-based and round-robin-based policies. Either policy can be combined with time-based shaping.
	- AVB (Audio Video Bridging, IEEE 802.1Qav) features:
		* Credit-based bandwidth distribution policy can be combined with time-based shaping
		* AVB endpoint talker and listener support
		* Support for arbitration between different priority traffic (for example, AVB class A, AVB class B, and non-AVB)
Round-robin-based policies:
	It has the same priority for three queues: In the round-robin QoS scheme, each queue is given an equal opportunity to transmit one frame. For example, if queue n has a frame to transmit, the queue transmits its frame. After queue n has transmitted its frame, or if queue n does not have a frame to transmit, queue n+1 is then allowed to transmit its frame, and so on.

Credit-based policies:
	The AVB credit based shaper acts independently, per class, to control the bandwidth distribution between normal traffic and time-sensitive traffic with respect to the total link bandwidth available.
	Credit based shaper conbined with time-based shaping:  
		- priority: ClassA queue > ClassB queue > best effort
		- ensure the queue bandwidth as user set based on time-based shaping algorithms (transmitter transmit frame from three queue in turn based on time-based shaping algorithms)
	And in real AVB case,  each streaming can be independent, and are fixed on related queue. Then driver level should implement .ndo_select_queue() to put the streaming into related queue. That is what the patch did.

The current driver config the three queue to credit-based policies (AVB), the patch seems no problem for the implementation. Do you have any suggestion ?

Andy

^ permalink raw reply

* [PATCH net-next V3 8/9] tap: support receiving skb from msg_control
From: Jason Wang @ 2017-05-10  2:38 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494383881-6811-1-git-send-email-jasowang@redhat.com>

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tap.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..9af3239 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
 	DEFINE_WAIT(wait);
-	struct sk_buff *skb;
 	ssize_t ret = 0;
 
 	if (!iov_iter_count(to))
 		return 0;
 
+	if (skb)
+		goto put;
+
 	while (1) {
 		if (!noblock)
 			prepare_to_wait(sk_sleep(&q->sk), &wait,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 	if (!noblock)
 		finish_wait(sk_sleep(&q->sk), &wait);
 
+put:
 	if (skb) {
 		ret = tap_put_user(q, skb, to);
 		if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct tap_queue *q = file->private_data;
 	ssize_t len = iov_iter_count(to), ret;
 
-	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
 	int ret;
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V3 9/9] vhost_net: try batch dequing from skb array
From: Jason Wang @ 2017-05-10  2:38 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494383881-6811-1-git-send-email-jasowang@redhat.com>

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Batch dequeuing is also the requirement for more batching improvement
on rx.

Tests were done by pktgen on tap with XDP1 in guest on top of batch
zeroing:

rx batch | pps

256        2.41Mpps (+6.16%)
128        2.48Mpps (+8.80%)
64         2.38Mpps (+3.96%) <- Default
16         2.31Mpps (+1.76%)
4          2.31Mpps (+1.76%)
1          2.30Mpps (+1.32%)
0          2.27Mpps (+7.48%)

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 111 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..fbaecf3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include <linux/if_macvlan.h>
 #include <linux/if_tap.h>
 #include <linux/if_vlan.h>
+#include <linux/skb_array.h>
+#include <linux/skbuff.h>
 
 #include <net/sock.h>
 
@@ -85,6 +87,13 @@ struct vhost_net_ubuf_ref {
 	struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
+struct vhost_net_buf {
+	struct sk_buff *queue[VHOST_RX_BATCH];
+	int tail;
+	int head;
+};
+
 struct vhost_net_virtqueue {
 	struct vhost_virtqueue vq;
 	size_t vhost_hlen;
@@ -99,6 +108,8 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
+	struct skb_array *rx_array;
+	struct vhost_net_buf rxq;
 };
 
 struct vhost_net {
@@ -117,6 +128,71 @@ struct vhost_net {
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
 
+static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq)
+{
+	if (rxq->tail != rxq->head)
+		return rxq->queue[rxq->head];
+	else
+		return NULL;
+}
+
+static int vhost_net_buf_get_size(struct vhost_net_buf *rxq)
+{
+	return rxq->tail - rxq->head;
+}
+
+static int vhost_net_buf_is_empty(struct vhost_net_buf *rxq)
+{
+	return rxq->tail == rxq->head;
+}
+
+static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
+{
+	void *ret = vhost_net_buf_get_ptr(rxq);
+	++rxq->head;
+	return ret;
+}
+
+static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	rxq->head = 0;
+	rxq->tail = skb_array_consume_batched(nvq->rx_array, rxq->queue,
+					      VHOST_RX_BATCH);
+	return rxq->tail;
+}
+
+static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	if (nvq->rx_array && !vhost_net_buf_is_empty(rxq)) {
+		skb_array_unconsume(nvq->rx_array, rxq->queue + rxq->head,
+				    vhost_net_buf_get_size(rxq));
+		rxq->head = rxq->tail = 0;
+	}
+}
+
+static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
+{
+	struct vhost_net_buf *rxq = &nvq->rxq;
+
+	if (!vhost_net_buf_is_empty(rxq))
+		goto out;
+
+	if (!vhost_net_buf_produce(nvq))
+		return 0;
+
+out:
+	return __skb_array_len_with_tag(vhost_net_buf_get_ptr(rxq));
+}
+
+static void vhost_net_buf_init(struct vhost_net_buf *rxq)
+{
+	rxq->head = rxq->tail = 0;
+}
+
 static void vhost_net_enable_zcopy(int vq)
 {
 	vhost_net_zcopy_mask |= 0x1 << vq;
@@ -201,6 +277,7 @@ static void vhost_net_vq_reset(struct vhost_net *n)
 		n->vqs[i].ubufs = NULL;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 
 }
@@ -503,15 +580,14 @@ static void handle_tx(struct vhost_net *net)
 	mutex_unlock(&vq->mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
-	struct socket *sock = sk->sk_socket;
 	struct sk_buff *head;
 	int len = 0;
 	unsigned long flags;
 
-	if (sock->ops->peek_len)
-		return sock->ops->peek_len(sock);
+	if (rvq->rx_array)
+		return vhost_net_buf_peek(rvq);
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -537,10 +613,11 @@ static int sk_has_rx_data(struct sock *sk)
 
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
+	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
-	int len = peek_head_len(sk);
+	int len = peek_head_len(rvq, sk);
 
 	if (!len && vq->busyloop_timeout) {
 		/* Both tx vq and rx socket were polled here */
@@ -561,7 +638,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 			vhost_poll_queue(&vq->poll);
 		mutex_unlock(&vq->mutex);
 
-		len = peek_head_len(sk);
+		len = peek_head_len(rvq, sk);
 	}
 
 	return len;
@@ -699,6 +776,8 @@ static void handle_rx(struct vhost_net *net)
 		/* On error, stop handling until the next kick. */
 		if (unlikely(headcount < 0))
 			goto out;
+		if (nvq->rx_array)
+			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -841,6 +920,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].done_idx = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +936,14 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 					struct vhost_virtqueue *vq)
 {
 	struct socket *sock;
+	struct vhost_net_virtqueue *nvq =
+		container_of(vq, struct vhost_net_virtqueue, vq);
 
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	vhost_net_disable_vq(n, vq);
 	vq->private_data = NULL;
+	vhost_net_buf_unproduce(nvq);
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -953,6 +1036,25 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
+static struct skb_array *get_tap_skb_array(int fd)
+{
+	struct skb_array *array;
+	struct file *file = fget(fd);
+
+	if (!file)
+		return NULL;
+	array = tun_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = tap_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = NULL;
+out:
+	fput(file);
+	return array;
+}
+
 static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
@@ -1029,6 +1131,9 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vq->private_data = sock;
+		vhost_net_buf_unproduce(nvq);
+		if (index == VHOST_NET_VQ_RX)
+			nvq->rx_array = get_tap_skb_array(fd);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V3 7/9] tun: support receiving skb through msg_control
From: Jason Wang @ 2017-05-10  2:37 UTC (permalink / raw)
  To: netdev, linux-kernel, mst; +Cc: Jason Wang
In-Reply-To: <1494383881-6811-1-git-send-email-jasowang@redhat.com>

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3cbfc5c..f8041f9c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1510,9 +1510,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
-	struct sk_buff *skb;
 	ssize_t ret;
 	int err;
 
@@ -1521,10 +1520,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (!iov_iter_count(to))
 		return 0;
 
-	/* Read frames from ring */
-	skb = tun_ring_recv(tfile, noblock, &err);
-	if (!skb)
-		return err;
+	if (!skb) {
+		/* Read frames from ring */
+		skb = tun_ring_recv(tfile, noblock, &err);
+		if (!skb)
+			return err;
+	}
 
 	ret = tun_put_user(tun, tfile, skb, to);
 	if (unlikely(ret < 0))
@@ -1544,7 +1545,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	if (!tun)
 		return -EBADFD;
-	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1646,7 +1647,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 					 SOL_PACKET, TUN_TX_TIMESTAMP);
 		goto out;
 	}
-	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > (ssize_t)total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox