* Re: [PATCH net-next 2/2] net/mlx4: Revert "mlx4: set maximal number of default RSS queues"
From: Ido Shamai @ 2014-01-15 12:49 UTC (permalink / raw)
To: Sathya Perla, Yuval Mintz, Or Gerlitz, Or Gerlitz
Cc: Amir Vadai, David S. Miller, netdev@vger.kernel.org,
Eugenia Emantayev, Ido Shamay
In-Reply-To: <CF9D1877D81D214CB0CA0669EFAE020C26B83E12@CMEXMB1.ad.emulex.com>
On 1/15/2014 2:46 PM, Sathya Perla wrote:
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf
>> Of Ido Shamai
>>
>> On 1/2/2014 12:27 PM, Yuval Mintz wrote:
>>>>>> Going back to your original commit 16917b87a "net-next: Add
>>>>>> netif_get_num_default_rss_queues" I am still not clear why we want
>>>>>>
>>>>>> 1. why we want a common default to all MQ devices?
>>>>> Although networking benefits from multiple Interrupt vectors
>>>>> (enabling more rings, better performance, etc.), bounding this
>>>>> number only to the number of cpus is unreasonable as it strains
>>>>> system resources; e.g., consider a 40-cpu server - we might wish
>>>>> to have 40 vectors per device, but that means that connecting
>>>>> several devices to the same server might cause other functions
>>>>> to fail probe as they will no longer be able to acquire interrupt
>>>>> vectors of their own.
>>>>
>>>> Modern servers which have tens of CPUs typically have thousands of MSI-X
>>>> vectors which means you should be easily able to plug four cards into a
>>>> server with 64 cores which will consume 256 out of the 1-4K vectors out
>>>> there. Anyway, let me continue your approach - how about raising the
>>>> default hard limit to 16 or having it as the number of cores @ the numa
>>>> node where the card is plugged?
>>>
>>> I think an additional issue was memory consumption -
>>> additional interrupts --> additional allocated memory (for Rx rings).
>>> And I do know the issues were real - we've had complains about devices
>>> failing to load due to lack of resources (not all servers in the world are
>>> top of the art).
>>>
>>> Anyway, I believe 8/16 are simply strict limitations without any true meaning;
>>> To judge what's more important, default `slimness' or default performance
>>> is beyond me.
>>> Perhaps the numa approach will prove beneficial (and will make some sense).
>>
>> After reviewing all that was said, I feel there is no need to enforce
>> vendors with this strict limitation without any true meaning.
>>
>> The reverted commit you applied forces the driver to use 8 rings at max
>> at all time, without the possibility to change in flight using ethtool,
>> as it's enforced on the PCI driver at module init (restarting the en
>> driver with different of requested rings will not affect).
>> So it's crucial for performance oriented applications using mlx4_en.
>
> The number of RSS/RX rings used by a driver can be increased (up to the HW supported value)
> at runtime using set-channels ethtool interface.
Not in this case, see my comment above: as it's enforced on the PCI
driver at module init.
set-channels interface in our case will not change this limitation, but
only up to it.
^ permalink raw reply
* TI CPSW Ethernet Tx performance regression
From: Mugunthan V N @ 2014-01-15 12:48 UTC (permalink / raw)
To: netdev; +Cc: Mugunthan V N
Hi
I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
it is *70Mbps*
Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
Can someone point me out where can I look for improving Tx performance. I also
checked whether there is Tx descriptor over flow and there is none. I have
tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
only.
[1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary
Regards
Mugunthan V N
^ permalink raw reply
* RE: [PATCH net-next 2/2] net/mlx4: Revert "mlx4: set maximal number of default RSS queues"
From: Sathya Perla @ 2014-01-15 12:46 UTC (permalink / raw)
To: Ido Shamai, Yuval Mintz, Or Gerlitz, Or Gerlitz
Cc: Amir Vadai, David S. Miller, netdev@vger.kernel.org,
Eugenia Emantayev, Ido Shamay
In-Reply-To: <52D67BFF.6070102@dev.mellanox.co.il>
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf
> Of Ido Shamai
>
> On 1/2/2014 12:27 PM, Yuval Mintz wrote:
> >>>> Going back to your original commit 16917b87a "net-next: Add
> >>>> netif_get_num_default_rss_queues" I am still not clear why we want
> >>>>
> >>>> 1. why we want a common default to all MQ devices?
> >>> Although networking benefits from multiple Interrupt vectors
> >>> (enabling more rings, better performance, etc.), bounding this
> >>> number only to the number of cpus is unreasonable as it strains
> >>> system resources; e.g., consider a 40-cpu server - we might wish
> >>> to have 40 vectors per device, but that means that connecting
> >>> several devices to the same server might cause other functions
> >>> to fail probe as they will no longer be able to acquire interrupt
> >>> vectors of their own.
> >>
> >> Modern servers which have tens of CPUs typically have thousands of MSI-X
> >> vectors which means you should be easily able to plug four cards into a
> >> server with 64 cores which will consume 256 out of the 1-4K vectors out
> >> there. Anyway, let me continue your approach - how about raising the
> >> default hard limit to 16 or having it as the number of cores @ the numa
> >> node where the card is plugged?
> >
> > I think an additional issue was memory consumption -
> > additional interrupts --> additional allocated memory (for Rx rings).
> > And I do know the issues were real - we've had complains about devices
> > failing to load due to lack of resources (not all servers in the world are
> > top of the art).
> >
> > Anyway, I believe 8/16 are simply strict limitations without any true meaning;
> > To judge what's more important, default `slimness' or default performance
> > is beyond me.
> > Perhaps the numa approach will prove beneficial (and will make some sense).
>
> After reviewing all that was said, I feel there is no need to enforce
> vendors with this strict limitation without any true meaning.
>
> The reverted commit you applied forces the driver to use 8 rings at max
> at all time, without the possibility to change in flight using ethtool,
> as it's enforced on the PCI driver at module init (restarting the en
> driver with different of requested rings will not affect).
> So it's crucial for performance oriented applications using mlx4_en.
The number of RSS/RX rings used by a driver can be increased (up to the HW supported value)
at runtime using set-channels ethtool interface.
^ permalink raw reply
* Re: [Patch net-next] net_sched: act: fix a bug in tcf_register_action()
From: Jamal Hadi Salim @ 2014-01-15 12:34 UTC (permalink / raw)
To: Cong Wang, netdev; +Cc: David S. Miller
In-Reply-To: <1389739694-9251-1-git-send-email-xiyou.wangcong@gmail.com>
On 01/14/14 17:48, Cong Wang wrote:
> In tcf_register_action() we check ->type and ->kind to see if there
> is an existing action registered, but ipt action registers two
> actions with same type but different kinds. This should be a valid
> case, otherwise only xt can be registered.
>
We cant allow for conflicts by name or id - we want to catch them.
So just introduce TCA_ACT_XT instead (ID 7)
[
Note: iptables used to be a constant moving API target
and this is supposed to be the latest "backward compat mode".
New kernel/iproute ==> We want to love "xt" more than "ipt".
We infact want to eventually kill "ipt".
but this preference is hard to achieve as you may have run into.
I would be curious how you tested and run into this..
].
cheers,
jamal
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> ---
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 35f89e9..2070ee3 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -273,7 +273,7 @@ int tcf_register_action(struct tc_action_ops *act)
>
> write_lock(&act_mod_lock);
> list_for_each_entry(a, &act_base, head) {
> - if (act->type == a->type || (strcmp(act->kind, a->kind) == 0)) {
> + if (act->type == a->type && (strcmp(act->kind, a->kind) == 0)) {
> write_unlock(&act_mod_lock);
> return -EEXIST;
> }
>
^ permalink raw reply
* Aw: Re: Route exceptions for IPv6 routes?
From: Simon Schneider @ 2014-01-15 12:25 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
In-Reply-To: <20140115115115.GD19945@order.stressinduktion.org>
Hello Hannes,
thanks for the answer.
Sorry, I'm not so familiar with the implementation details.
What are nh-exceptions?
I understand the current IPv6 implementation has some performance drawbacks. Correct?
I'm especially interested in
a) support for path MTUs
b) IPv6 policy-based routing (can I expect the same results from "ip -6 rule" commands that I get from "ip rule" commands?)
Are these basically supported (maybe not with optimal performance) ?
best regards, Simon
Gesendet: Mittwoch, 15. Januar 2014 um 12:51 Uhr
Von: "Hannes Frederic Sowa" <hannes@stressinduktion.org>
An: "Simon Schneider" <simon-schneider@gmx.net>
Cc: netdev@vger.kernel.org
Betreff: Re: Route exceptions for IPv6 routes?
Hi!
On Wed, Jan 15, 2014 at 09:01:22AM +0100, Simon Schneider wrote:
> I learned that the routing cache was removed from the kernel for several reasons.
>
> Some functions have been replaced with the route exceptions, e.g. storing the path MTU.
>
> My question: is this valid for both IPv4 as well as IPv6 routing, i.e. do the route exceptions work in the same way for IPv6 routes as they work for IPv4 routes?
No, situation in IPv6 land is not so good.
Currently as soon as a destination (or destination + source in case of
subtrees are in use) is resolved the routing entry is cloned and stored back
into the same trie with RTF_CACHE flag.
There are no nh-exceptions and there is no aggressive sharing taking place to
try to reduce the number of exceptions. This is especially bad for forwarding
setups, but seems to work fine for most people currently. ;)
Actually, implementing this is part of the work I am currently doing
but this needs still time until I can manage to propose this for upstream.
Greetings,
Hannes
^ permalink raw reply
* Re: [PATCH net-next 2/2] net/mlx4: Revert "mlx4: set maximal number of default RSS queues"
From: Ido Shamai @ 2014-01-15 12:15 UTC (permalink / raw)
To: Yuval Mintz, Or Gerlitz, Or Gerlitz
Cc: Amir Vadai, David S. Miller, netdev@vger.kernel.org,
Eugenia Emantayev, Ido Shamay
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A52AF21E1F@SJEXCHMB10.corp.ad.broadcom.com>
On 1/2/2014 12:27 PM, Yuval Mintz wrote:
>>>> Going back to your original commit 16917b87a "net-next: Add
>>>> netif_get_num_default_rss_queues" I am still not clear why we want
>>>>
>>>> 1. why we want a common default to all MQ devices?
>>> Although networking benefits from multiple Interrupt vectors
>>> (enabling more rings, better performance, etc.), bounding this
>>> number only to the number of cpus is unreasonable as it strains
>>> system resources; e.g., consider a 40-cpu server - we might wish
>>> to have 40 vectors per device, but that means that connecting
>>> several devices to the same server might cause other functions
>>> to fail probe as they will no longer be able to acquire interrupt
>>> vectors of their own.
>>
>> Modern servers which have tens of CPUs typically have thousands of MSI-X
>> vectors which means you should be easily able to plug four cards into a
>> server with 64 cores which will consume 256 out of the 1-4K vectors out
>> there. Anyway, let me continue your approach - how about raising the
>> default hard limit to 16 or having it as the number of cores @ the numa
>> node where the card is plugged?
>
> I think an additional issue was memory consumption -
> additional interrupts --> additional allocated memory (for Rx rings).
> And I do know the issues were real - we've had complains about devices
> failing to load due to lack of resources (not all servers in the world are
> top of the art).
>
> Anyway, I believe 8/16 are simply strict limitations without any true meaning;
> To judge what's more important, default `slimness' or default performance
> is beyond me.
> Perhaps the numa approach will prove beneficial (and will make some sense).
After reviewing all that was said, I feel there is no need to enforce
vendors with this strict limitation without any true meaning.
The reverted commit you applied forces the driver to use 8 rings at max
at all time, without the possibility to change in flight using ethtool,
as it's enforced on the PCI driver at module init (restarting the en
driver with different of requested rings will not affect).
So it's crucial for performance oriented applications using mlx4_en.
Going through all Ethernet vendors I don't see this limitation enforced,
so this limitation has no true meaning (no fairness).
I think this patch should go in as is.
Ethernet vendors should use it this limitation when they desire.
Ido
> Thanks,
> Yuval
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: throughput problems with realtek
From: Dmitry Kasatkin @ 2014-01-15 12:04 UTC (permalink / raw)
To: nic_swsd, romieu, netdev; +Cc: l.moiseichuk
In-Reply-To: <CACE9dm_3jw08_dfXRJRMQ=r4X1NZ1kHF6TZopFSNy3k+DCKgTA@mail.gmail.com>
Forgot to tell, I am running Ubuntu 13.10 with 3.11.0-15 kernel...
On Wed, Jan 15, 2014 at 1:56 PM, Dmitry Kasatkin
<dmitry.kasatkin@gmail.com> wrote:
> Hi,
>
> We have several devices with such adapter..
>
> Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
> PCI Express Gigabit Ethernet Controller (rev 06)
> See output of the lspci -vvv bellow...
>
> And I suddenly investigated throughput issues..
>
> After couple minutes of running 'iperf -c server' transmission speed
> drops substantially...
>
> [ 4] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
> [ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60508
> [ 5] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
> [ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60509
> [ 4] 0.0-10.0 sec 1.10 GBytes 949 Mbits/sec
> [ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60510
> [ 5] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
> [ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60511
> [ 4] 0.0-10.0 sec 626 MBytes 525 Mbits/sec
> [ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60512
> [ 5] 0.0-10.0 sec 84.4 MBytes 70.5 Mbits/sec
> [ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60513
> [ 4] 0.0-10.0 sec 87.4 MBytes 73.0 Mbits/sec
> [ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60514
>
>
> But it seems after certain time of inactivity (low load) speed will be
> up again...
>
> It happens almost the same way on desktop machines and also on Samsung
> Series 7 laptop NP770Z5E...
>
> Does anyone have any ideas about it?
>
> --
> Thanks,
> Dmitry
>
> ---------------------------
>
> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
> Subsystem: Samsung Electronics Co Ltd Device c0e6
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 45
> Region 0: I/O ports at d000 [size=256]
> Region 2: Memory at f0004000 (64-bit, prefetchable) [size=4K]
> Region 4: Memory at f0000000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00338 Data: 0000
> Capabilities: [70] Express (v2) Endpoint, MSI 01
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 4096 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0
> unlimited, L1 <64us
> ClockPM+ Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
> BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-,
> Selectable De-emphasis: -6dB
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-,
> EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Vector table: BAR=4 offset=00000000
> PBA: BAR=4 offset=00000800
> Capabilities: [d0] Vital Product Data
> No end tag found
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
> MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [140 v1] Virtual Channel
> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb: Fixed- WRR32- WRR64- WRR128-
> Ctrl: ArbSelect=Fixed
> Status: InProgress-
> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
> Status: NegoPending- InProgress-
> Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
> Kernel driver in use: r8169
--
Thanks,
Dmitry
^ permalink raw reply
* throughput problems with realtek
From: Dmitry Kasatkin @ 2014-01-15 11:56 UTC (permalink / raw)
To: nic_swsd, romieu, netdev; +Cc: l.moiseichuk
Hi,
We have several devices with such adapter..
Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
PCI Express Gigabit Ethernet Controller (rev 06)
See output of the lspci -vvv bellow...
And I suddenly investigated throughput issues..
After couple minutes of running 'iperf -c server' transmission speed
drops substantially...
[ 4] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
[ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60508
[ 5] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
[ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60509
[ 4] 0.0-10.0 sec 1.10 GBytes 949 Mbits/sec
[ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60510
[ 5] 0.0-10.0 sec 1.10 GBytes 948 Mbits/sec
[ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60511
[ 4] 0.0-10.0 sec 626 MBytes 525 Mbits/sec
[ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60512
[ 5] 0.0-10.0 sec 84.4 MBytes 70.5 Mbits/sec
[ 4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60513
[ 4] 0.0-10.0 sec 87.4 MBytes 73.0 Mbits/sec
[ 5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60514
But it seems after certain time of inactivity (low load) speed will be
up again...
It happens almost the same way on desktop machines and also on Samsung
Series 7 laptop NP770Z5E...
Does anyone have any ideas about it?
--
Thanks,
Dmitry
---------------------------
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
Subsystem: Samsung Electronics Co Ltd Device c0e6
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 45
Region 0: I/O ports at d000 [size=256]
Region 2: Memory at f0004000 (64-bit, prefetchable) [size=4K]
Region 4: Memory at f0000000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00338 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 01
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0
unlimited, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-,
Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-,
EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000800
Capabilities: [d0] Vital Product Data
No end tag found
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
Kernel driver in use: r8169
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: David Vrabel @ 2014-01-15 11:52 UTC (permalink / raw)
To: Wei Liu; +Cc: Annie Li, xen-devel, netdev, ian.campbell
In-Reply-To: <20140115114208.GK5698@zion.uk.xensource.com>
On 15/01/14 11:42, Wei Liu wrote:
> On Wed, Jan 15, 2014 at 11:20:49AM +0000, David Vrabel wrote:
>> On 09/01/14 22:48, Annie Li wrote:
>>> Current netfront only grants pages for grant copy, not for grant transfer, so
>>> remove corresponding transfer code and add receiving copy code in
>>> xennet_release_rx_bufs.
>>
>> While netfront only supports a copying backend, I don't see anything
>> preventing the backend from retaining mappings to netfront's Rx buffers...
>>
>
> Correct.
>
>>> Signed-off-by: Annie Li <Annie.li@oracle.com>
>>> ---
>>> drivers/net/xen-netfront.c | 60 ++-----------------------------------------
>>> 1 files changed, 3 insertions(+), 57 deletions(-)
>>>
>>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>>> index e59acb1..692589e 100644
>>> --- a/drivers/net/xen-netfront.c
>>> +++ b/drivers/net/xen-netfront.c
>>> @@ -1134,78 +1134,24 @@ static void xennet_release_tx_bufs(struct netfront_info *np)
>>>
>>> static void xennet_release_rx_bufs(struct netfront_info *np)
>>> {
>> [...]
>>> - mfn = gnttab_end_foreign_transfer_ref(ref);
>>> + gnttab_end_foreign_access_ref(ref, 0);
>>
>> ... the gnttab_end_foreign_access_ref() may then fail and...
>>
>
> Oh, I see. Andrew was actually referencing this function. Yes, it can
> fail. Since he omitted "_ref" I looked at the other function when I
> replied to him...
>
>>> gnttab_release_grant_reference(&np->gref_rx_head, ref);
>>> np->grant_rx_ref[id] = GRANT_INVALID_REF;
>> [...]
>>> + kfree_skb(skb);
>>
>> ... this could then potentially free pages that the backend still has
>> mapped. If the pages are then reused, this would leak information to
>> the backend.
>>
>> Since only a buggy backend would result in this, leaking the skbs and
>> grant refs would be acceptable here. I would also print an error.
>>
>
> How about using gnttab_end_foreign_access. The deferred queue looks like
> a right solution -- pending page won't get freed until gref is
> quiescent.
This is more like the correct approach but I don't think it still quite
right. The skb owns the pages so we don't want
gnttab_end_foreign_access() to free them as freeing the skb will attempt
to free them again.
Having gnttab_end_foreign_access() do a free just looks odd to me, the
free isn't paired with any alloc in the grant table code.
It seems more logical to me that granting access takes an additional
page ref, and then ending access releases that ref.
David
^ permalink raw reply
* Re: Route exceptions for IPv6 routes?
From: Hannes Frederic Sowa @ 2014-01-15 11:51 UTC (permalink / raw)
To: Simon Schneider; +Cc: netdev
In-Reply-To: <trinity-d0560a1c-2418-449e-b478-ebc707cea8e5-1389772882296@3capp-gmx-bs10>
Hi!
On Wed, Jan 15, 2014 at 09:01:22AM +0100, Simon Schneider wrote:
> I learned that the routing cache was removed from the kernel for several reasons.
>
> Some functions have been replaced with the route exceptions, e.g. storing the path MTU.
>
> My question: is this valid for both IPv4 as well as IPv6 routing, i.e. do the route exceptions work in the same way for IPv6 routes as they work for IPv4 routes?
No, situation in IPv6 land is not so good.
Currently as soon as a destination (or destination + source in case of
subtrees are in use) is resolved the routing entry is cloned and stored back
into the same trie with RTF_CACHE flag.
There are no nh-exceptions and there is no aggressive sharing taking place to
try to reduce the number of exceptions. This is especially bad for forwarding
setups, but seems to work fine for most people currently. ;)
Actually, implementing this is part of the work I am currently doing
but this needs still time until I can manage to propose this for upstream.
Greetings,
Hannes
^ permalink raw reply
* Re: unable to send TCP SYNs from ports below 1024
From: Hannes Frederic Sowa @ 2014-01-15 11:47 UTC (permalink / raw)
To: Stuart Kendrick; +Cc: netdev
In-Reply-To: <CAACXELm1Yfv+=oC-N+nnR7apHc+gtZSVpqSbX2Rtkiq6-Y1QtA@mail.gmail.com>
On Tue, Jan 14, 2014 at 05:37:03PM -0800, Stuart Kendrick wrote:
> VALIDATION
> I verify the problem by using netcat plus two instances of tcpdump,
> one running on the box itself, the other running on a second box
> plugged into a SPAN port on the local Ethernet switch
This sounds strange. Could you try to skip the switch and plug the
boxes together directly. Maybe a switch policy? They can filter such
things nowaday.
Greetings,
Hannes
^ permalink raw reply
* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Zoltan Kiss @ 2014-01-15 11:47 UTC (permalink / raw)
To: Wei Liu; +Cc: ian.campbell, xen-devel, netdev, linux-kernel, jonathan.davies
In-Reply-To: <20140115103707.GI5698@zion.uk.xensource.com>
On 15/01/14 10:37, Wei Liu wrote:
> On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>> thread problem, however caused an another one. The receive side can stall, if:
>> - xenvif_rx_action sets rx_queue_stopped to false
>> - interrupt happens, and sets rx_event to true
>> - then xenvif_kthread sets rx_event to false
>>
>
> If you mean "rx_work_todo" returns false.
>
> In this case
>
> (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>
> can still be true, can't it?
Sorry, I should wrote rx_queue_stopped to true
>
>> Also, through rx_event a malicious guest can force the RX thread to spin. This
>> patch ditch that two variable, and rework rx_work_todo. If the thread finds it
>
> This seems to be a bigger problem. Can you elaborate?
My mistake too. I forgot that rx_action set it to false, so it's not
really a spinning. However the thread should still run xenvif_rx_action
to figure out there is no space in the ring before it sets rx_event to
false. In my patch we can quit earlier.
Zoli
^ permalink raw reply
* Re: [PATCH net-next] IPv6: add option to use anycast addresses as source addresses in icmp error messages
From: Hannes Frederic Sowa @ 2014-01-15 11:44 UTC (permalink / raw)
To: François-Xavier Le Bail
Cc: netdev, Bill Fink, David S. Miller, Alexey Kuznetsov,
James Morris, Hideaki Yoshifuji, Patrick McHardy
In-Reply-To: <1389779163.69122.YahooMailBasic@web125504.mail.ne1.yahoo.com>
Hi!
On Wed, Jan 15, 2014 at 01:46:03AM -0800, François-Xavier Le Bail wrote:
> On Tue, Jan 14, 2014 at 02:13:44PM +0100, Hannes Frederic Sowa wrote:
> > > On Mon, Jan 13, 2014 at 06:22:44PM +0100, Francois-Xavier Le Bail wrote:
> > > > - Add "anycast_src_icmp_error" sysctl to control the use of anycast addresses
> > > > as source addresses for ICMPv6 error messages. This sysctl is false by
> > > > default to preserve existing behavior.
> > > > - Use it in icmp6_send().
> > > >
> > > > Suggested-by: Bill Fink <billfink@mindspring.com>
> > > > Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
> > >
> > > Regarding the anycast patches, I contacted someone from IETF.
> > >
> > > The number of sysctls needed to get introduced to have all the flexibility
> > > regarding source address selection and don't break backward compatibility
> > > concerns me a bit.
> > >
> > > Especially on end hosts, where those switches will be important, I think we
> > > really have to think about sensible defaults without breaking current
> > > software.
> > >
> > > I currently consider a per-address flag, if those anycast addresses
> > > should be available in source address selection (also with an enhancement to
> > > current IPV6_JOIN_ANYCAST logic).
> >
> > Francois, we should really think about this. Also if we should just
> > make the pre-defined subnet address just a normal anycast address in the
> > long-term (which just happens to get automatically added to an interface
> > if forwarding is enabled) and bundle all the source address selection
> > logic on the per-address state.
>
> Please submit patches with your solution, so that we can have a basis
> for discussion.
I won't have time for that in the next weeks and this is not on the top of my
TODO list, I fear :/ (I see what I can do).
Basically one would have to first start with address configuration support for
IPv6 and then add a flag to ifa_flags (damn is IPv6 getting complex) so one
could say
ip -6 a c fe80:: dev eth0 anycast anycast_pref
One easy thing would be to add this flag to the routing entries, but we may
run into problems with limited flag-store-space there, too:
So something would be possible
bool ipv6_use_anycast_addr(struct rt6_info *rt)
{
if ((rt->rt6i_flags & (RTF_ANYCAST|RTF_ANYCAST_PREF) == (RTF_ANYCAST|RTF_ANYCAST_PREF))
return true;
return false;
}
It seems you may eat a bit in the bit space of the generic RTF_ flags and use
flags to 32k (so adding in front of RTF_DEFAULT).
We could also link this flag to conditionally emit TCP-RSTs and ICMP error
messages with help of this flag.
Actually I don't like the solution with the rt6i_flags that much, I
would rather have this only in ifacaddr6 only. But lookup times will be
slower then. Don't know yet.
So we would have to tackle this problem from the other direction first and
implement proper anycast management via iproute first and then alter the
source address selection policies if we would go with something like that.
Maybe anycast_pref is a bad name, anycast_reply or anycast_use_src would be
better.
> > If that would be the case, we could revert
> > 509aba3b0d366b7f16a9a2eebac1156b25f5f622 ("IPv6: add the option to use
> > anycast addresses as source addresses in echo reply") and thus would
> > eliminate one sysctl.
>
> If your solution achieve the same goal without this sysctl, I agree with you.
I think it does, what do you think?
> > It would be fine if we can make this decision before David merges with
> > Linus. I guess we can still do this decision while in -rc phase. But
> > as soon as the knob is in a released version of linux we can never take
> > it back (I really don't like sysctls).
>
> Sure.
Greetings,
Hannes
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: Wei Liu @ 2014-01-15 11:42 UTC (permalink / raw)
To: David Vrabel; +Cc: Annie Li, xen-devel, netdev, wei.liu2, ian.campbell
In-Reply-To: <52D66F11.204@citrix.com>
On Wed, Jan 15, 2014 at 11:20:49AM +0000, David Vrabel wrote:
> On 09/01/14 22:48, Annie Li wrote:
> > Current netfront only grants pages for grant copy, not for grant transfer, so
> > remove corresponding transfer code and add receiving copy code in
> > xennet_release_rx_bufs.
>
> While netfront only supports a copying backend, I don't see anything
> preventing the backend from retaining mappings to netfront's Rx buffers...
>
Correct.
> > Signed-off-by: Annie Li <Annie.li@oracle.com>
> > ---
> > drivers/net/xen-netfront.c | 60 ++-----------------------------------------
> > 1 files changed, 3 insertions(+), 57 deletions(-)
> >
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index e59acb1..692589e 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -1134,78 +1134,24 @@ static void xennet_release_tx_bufs(struct netfront_info *np)
> >
> > static void xennet_release_rx_bufs(struct netfront_info *np)
> > {
> [...]
> > - mfn = gnttab_end_foreign_transfer_ref(ref);
> > + gnttab_end_foreign_access_ref(ref, 0);
>
> ... the gnttab_end_foreign_access_ref() may then fail and...
>
Oh, I see. Andrew was actually referencing this function. Yes, it can
fail. Since he omitted "_ref" I looked at the other function when I
replied to him...
> > gnttab_release_grant_reference(&np->gref_rx_head, ref);
> > np->grant_rx_ref[id] = GRANT_INVALID_REF;
> [...]
> > + kfree_skb(skb);
>
> ... this could then potentially free pages that the backend still has
> mapped. If the pages are then reused, this would leak information to
> the backend.
>
> Since only a buggy backend would result in this, leaking the skbs and
> grant refs would be acceptable here. I would also print an error.
>
How about using gnttab_end_foreign_access. The deferred queue looks like
a right solution -- pending page won't get freed until gref is
quiescent.
Wei.
> While checking blkfront for how it handles this, it also doesn't appear
> to do the right thing either.
>
> David
^ permalink raw reply
* [PATCH V2 net-next 3/3] ipv6: add ip6_flowlabel_consistency sysctl
From: Florent Fourcot @ 2014-01-15 11:30 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
In-Reply-To: <1389785403-6401-1-git-send-email-florent.fourcot@enst-bretagne.fr>
With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of
flow label unicity. This patch introduces a new sysctl to protect the old
behaviour, enable by default.
Changelog of the V2:
* Remove useless hunk in sysctl_binary.c
* Rebase on net-next
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
Documentation/networking/ip-sysctl.txt | 8 ++++++++
include/net/netns/ipv6.h | 1 +
net/ipv6/af_inet6.c | 1 +
net/ipv6/ip6_flowlabel.c | 7 +++++++
net/ipv6/sysctl_net_ipv6.c | 8 ++++++++
5 files changed, 25 insertions(+)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c97932c..7453640 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1118,6 +1118,14 @@ bindv6only - BOOLEAN
Default: FALSE (as specified in RFC3493)
+ip6_flowlabel_consistency - BOOLEAN
+ Protect the consistency (and unicity) of flow label.
+ You have to disable it to use IPV6_FL_F_REFLECT flag on the
+ flow label manager.
+ TRUE: enabled
+ FALSE: disabled
+ Default: TRUE
+
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 76fc7d1..3cc291b 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -27,6 +27,7 @@ struct netns_sysctl_ipv6 {
int ip6_rt_gc_elasticity;
int ip6_rt_mtu_expires;
int ip6_rt_min_advmss;
+ int ip6_flowlabel_consistency;
int icmpv6_time;
};
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index c921d5d..943c796 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -775,6 +775,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.bindv6only = 0;
net->ipv6.sysctl.icmpv6_time = 1*HZ;
+ net->ipv6.sysctl.ip6_flowlabel_consistency = 1;
atomic_set(&net->ipv6.rt_genid, 0);
err = ipv6_init_mibs(net);
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index 2c0f9dc..85f0453 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -587,8 +587,15 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
case IPV6_FL_A_GET:
if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ struct net *net = sock_net(sk);
+ if (net->ipv6.sysctl.ip6_flowlabel_consistency) {
+ pr_info("Can not set IPV6_FL_F_REFLECT if ip6_flowlabel_consistency sysctl is enable \n");
+ return -EPERM;
+ }
+
if (sk->sk_protocol != IPPROTO_TCP)
return -ENOPROTOOPT;
+
np->repflow = 1;
return 0;
}
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6b6a2c8..8c99cf0 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -31,6 +31,13 @@ static struct ctl_table ipv6_table_template[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "ip6_flowlabel_consistency",
+ .data = &init_net.ipv6.sysctl.ip6_flowlabel_consistency,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
{ }
};
@@ -59,6 +66,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
goto out;
ipv6_table[0].data = &net->ipv6.sysctl.bindv6only;
ipv6_table[1].data = &net->ipv6.anycast_src_echo_reply;
+ ipv6_table[2].data = &net->ipv6.sysctl.ip6_flowlabel_consistency;
ipv6_route_table = ipv6_route_sysctl_init(net);
if (!ipv6_route_table)
--
1.8.5.2
^ permalink raw reply related
* [PATCH V2 net-next 1/3] ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET
From: Florent Fourcot @ 2014-01-15 11:30 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
With this option, the socket will reply with the flow label value read
on received packets.
The goal is to have a connection with the same flow label in both
direction of the communication.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
include/linux/ipv6.h | 1 +
include/uapi/linux/in6.h | 1 +
net/ipv6/ip6_flowlabel.c | 21 +++++++++++++++++++++
net/ipv6/tcp_ipv6.c | 10 ++++++++++
4 files changed, 33 insertions(+)
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 7e1ded0..1084304 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -191,6 +191,7 @@ struct ipv6_pinfo {
/* sockopt flags */
__u16 recverr:1,
sndflow:1,
+ repflow:1,
pmtudisc:3,
ipv6only:1,
srcprefs:3, /* 001: prefer temporary address
diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index f94f1d0..a4359b1 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -85,6 +85,7 @@ struct in6_flowlabel_req {
#define IPV6_FL_F_CREATE 1
#define IPV6_FL_F_EXCL 2
+#define IPV6_FL_F_REFLECT 4
#define IPV6_FL_S_NONE 0
#define IPV6_FL_S_EXCL 1
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index e7fb710..ba23643 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -486,6 +486,11 @@ int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq)
struct ipv6_pinfo *np = inet6_sk(sk);
struct ipv6_fl_socklist *sfl;
+ if (np->repflow) {
+ freq->flr_label = np->flow_label;
+ return 0;
+ }
+
rcu_read_lock_bh();
for_each_sk_fl_rcu(np, sfl) {
@@ -527,6 +532,15 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
switch (freq.flr_action) {
case IPV6_FL_A_PUT:
+ if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ if (sk->sk_protocol != IPPROTO_TCP)
+ return -ENOPROTOOPT;
+ if (!np->repflow)
+ return -ESRCH;
+ np->flow_label = 0;
+ np->repflow = 0;
+ return 0;
+ }
spin_lock_bh(&ip6_sk_fl_lock);
for (sflp = &np->ipv6_fl_list;
(sfl = rcu_dereference(*sflp))!=NULL;
@@ -567,6 +581,13 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
return -ESRCH;
case IPV6_FL_A_GET:
+ if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ if (sk->sk_protocol != IPPROTO_TCP)
+ return -ENOPROTOOPT;
+ np->repflow = 1;
+ return 0;
+ }
+
if (freq.flr_label & ~IPV6_FLOWLABEL_MASK)
return -EINVAL;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ffd5fa8..f61bedc 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -483,6 +483,8 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
&ireq->ir_v6_rmt_addr);
fl6->daddr = ireq->ir_v6_rmt_addr;
+ if (np->repflow)
+ fl6->flowlabel = np->flow_label;
skb_set_queue_mapping(skb, queue_mapping);
err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
err = net_xmit_eval(err);
@@ -1000,6 +1002,8 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
ireq = inet_rsk(req);
ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+ if (np->repflow)
+ np->flow_label = ip6_flowlabel(ipv6_hdr(skb));
if (!want_cookie || tmp_opt.tstamp_ok)
TCP_ECN_create_request(req, skb, sock_net(sk));
@@ -1138,6 +1142,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
newnp->mcast_oif = inet6_iif(skb);
newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;
newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
+ if (np->repflow)
+ newnp->flow_label = ip6_flowlabel(ipv6_hdr(skb));
/*
* No need to charge this sock to the relevant IPv6 refcnt debug socks count
@@ -1218,6 +1224,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
newnp->mcast_oif = inet6_iif(skb);
newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;
newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
+ if (np->repflow)
+ newnp->flow_label = ip6_flowlabel(ipv6_hdr(skb));
/* Clone native IPv6 options from listening socket (if any)
@@ -1429,6 +1437,8 @@ ipv6_pktoptions:
np->mcast_hops = ipv6_hdr(opt_skb)->hop_limit;
if (np->rxopt.bits.rxflow || np->rxopt.bits.rxtclass)
np->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(opt_skb));
+ if (np->repflow)
+ np->flow_label = ip6_flowlabel(ipv6_hdr(opt_skb));
if (ipv6_opt_accepted(sk, opt_skb)) {
skb_set_owner_r(opt_skb, sk);
opt_skb = xchg(&np->pktoptions, opt_skb);
--
1.8.5.2
^ permalink raw reply related
* [PATCH V2 net-next 2/3] ipv6: add a flag to get the flow label used remotly
From: Florent Fourcot @ 2014-01-15 11:30 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
In-Reply-To: <1389785403-6401-1-git-send-email-florent.fourcot@enst-bretagne.fr>
This information is already available via IPV6_FLOWINFO
of IPV6_2292PKTOPTIONS, and them a filtering to get the flow label
information. But it is probably logical and easier for users to add this
here, and to control both sent/received flow label values with the
IPV6_FLOWLABEL_MGR option.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
include/net/ipv6.h | 2 +-
include/uapi/linux/in6.h | 1 +
net/ipv6/ip6_flowlabel.c | 7 ++++++-
net/ipv6/ipv6_sockglue.c | 5 ++++-
4 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 12079c6..54cb251 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -252,7 +252,7 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions *opt_space,
struct ipv6_txoptions *fopt);
void fl6_free_socklist(struct sock *sk);
int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen);
-int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq);
+int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq, int flags);
int ip6_flowlabel_init(void);
void ip6_flowlabel_cleanup(void);
diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index a4359b1..2428b80 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -86,6 +86,7 @@ struct in6_flowlabel_req {
#define IPV6_FL_F_CREATE 1
#define IPV6_FL_F_EXCL 2
#define IPV6_FL_F_REFLECT 4
+#define IPV6_FL_F_REMOTE 8
#define IPV6_FL_S_NONE 0
#define IPV6_FL_S_EXCL 1
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index ba23643..2c0f9dc 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -481,11 +481,16 @@ static inline void fl_link(struct ipv6_pinfo *np, struct ipv6_fl_socklist *sfl,
spin_unlock_bh(&ip6_sk_fl_lock);
}
-int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq)
+int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq, int flags)
{
struct ipv6_pinfo *np = inet6_sk(sk);
struct ipv6_fl_socklist *sfl;
+ if (flags & IPV6_FL_F_REMOTE) {
+ freq->flr_label = np->rcv_flowinfo & IPV6_FLOWLABEL_MASK;
+ return 0;
+ }
+
if (np->repflow) {
freq->flr_label = np->flow_label;
return 0;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index af0ecb9..a47653a 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -1220,6 +1220,7 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
case IPV6_FLOWLABEL_MGR:
{
struct in6_flowlabel_req freq;
+ int flags;
if (len < sizeof(freq))
return -EINVAL;
@@ -1231,9 +1232,11 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
return -EINVAL;
len = sizeof(freq);
+ flags = freq.flr_flags;
+
memset(&freq, 0, sizeof(freq));
- val = ipv6_flowlabel_opt_get(sk, &freq);
+ val = ipv6_flowlabel_opt_get(sk, &freq, flags);
if (val < 0)
return val;
--
1.8.5.2
^ permalink raw reply related
* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: David Vrabel @ 2014-01-15 11:20 UTC (permalink / raw)
To: Annie Li; +Cc: xen-devel, netdev, wei.liu2, ian.campbell
In-Reply-To: <1389307718-2845-1-git-send-email-Annie.li@oracle.com>
On 09/01/14 22:48, Annie Li wrote:
> Current netfront only grants pages for grant copy, not for grant transfer, so
> remove corresponding transfer code and add receiving copy code in
> xennet_release_rx_bufs.
While netfront only supports a copying backend, I don't see anything
preventing the backend from retaining mappings to netfront's Rx buffers...
> Signed-off-by: Annie Li <Annie.li@oracle.com>
> ---
> drivers/net/xen-netfront.c | 60 ++-----------------------------------------
> 1 files changed, 3 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index e59acb1..692589e 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -1134,78 +1134,24 @@ static void xennet_release_tx_bufs(struct netfront_info *np)
>
> static void xennet_release_rx_bufs(struct netfront_info *np)
> {
[...]
> - mfn = gnttab_end_foreign_transfer_ref(ref);
> + gnttab_end_foreign_access_ref(ref, 0);
... the gnttab_end_foreign_access_ref() may then fail and...
> gnttab_release_grant_reference(&np->gref_rx_head, ref);
> np->grant_rx_ref[id] = GRANT_INVALID_REF;
[...]
> + kfree_skb(skb);
... this could then potentially free pages that the backend still has
mapped. If the pages are then reused, this would leak information to
the backend.
Since only a buggy backend would result in this, leaking the skbs and
grant refs would be acceptable here. I would also print an error.
While checking blkfront for how it handles this, it also doesn't appear
to do the right thing either.
David
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: Wei Liu @ 2014-01-15 11:14 UTC (permalink / raw)
To: Andrew Bennieston; +Cc: Wei Liu, Annie Li, netdev, ian.campbell, xen-devel
In-Reply-To: <52D66ADF.9070401@citrix.com>
On Wed, Jan 15, 2014 at 11:02:55AM +0000, Andrew Bennieston wrote:
> On 15/01/14 10:07, Wei Liu wrote:
> >On Fri, Jan 10, 2014 at 06:48:38AM +0800, Annie Li wrote:
> >>Current netfront only grants pages for grant copy, not for grant transfer, so
> >>remove corresponding transfer code and add receiving copy code in
> >>xennet_release_rx_bufs.
> >>
> >
> >This path seldom gets call -- not that many people unload xen-netfront
> >driver. If Annie has tested this patch and it works as expected I think
> >it's fine.
> >
> In XenServer we have seen a number of cases where unplugging and
> replugging VIFs results in leakage of grant references, eventually
> leading to a case where you cannot plug a VIF (after ~ 400 such
> cycles)...
>
OK, this makes sense.
> It's worth pointing out, as far as this patch is concerned, that
> gnttab_end_foreign_access() can fail, which is not taken into
> account here.
>
How? gnttab_end_foreign_access doesn't return any error. The gref which
cannot be freed right away will be added to a deferred list and handle
later.
Wei.
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: Andrew Bennieston @ 2014-01-15 11:02 UTC (permalink / raw)
To: Wei Liu, Annie Li; +Cc: netdev, ian.campbell, xen-devel
In-Reply-To: <20140115100743.GG5698@zion.uk.xensource.com>
On 15/01/14 10:07, Wei Liu wrote:
> On Fri, Jan 10, 2014 at 06:48:38AM +0800, Annie Li wrote:
>> Current netfront only grants pages for grant copy, not for grant transfer, so
>> remove corresponding transfer code and add receiving copy code in
>> xennet_release_rx_bufs.
>>
>
> This path seldom gets call -- not that many people unload xen-netfront
> driver. If Annie has tested this patch and it works as expected I think
> it's fine.
>
In XenServer we have seen a number of cases where unplugging and
replugging VIFs results in leakage of grant references, eventually
leading to a case where you cannot plug a VIF (after ~ 400 such cycles)...
It's worth pointing out, as far as this patch is concerned, that
gnttab_end_foreign_access() can fail, which is not taken into account here.
Andrew.
> I'm not netfront maintainer but I'm happy to add
> Acked-by: Wei Liu <wei.liu2@citrix.com>
> if Annie confirms she's tested this patch.
>
> Wei.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
^ permalink raw reply
* Re: [PATCH net] bpf: do not use reciprocal divide
From: Heiko Carstens @ 2014-01-15 10:51 UTC (permalink / raw)
To: Martin Schwidefsky
Cc: Eric Dumazet, Hannes Frederic Sowa, netdev, dborkman,
darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <20140115091322.3a7740a7@mschwide>
On Wed, Jan 15, 2014 at 09:13:22AM +0100, Martin Schwidefsky wrote:
> On Wed, 15 Jan 2014 09:00:07 +0100
> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
>
> > On Tue, Jan 14, 2014 at 11:02:41PM -0800, Eric Dumazet wrote:
> > > diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
> > > index 16871da37371..e349dc7d0992 100644
> > > --- a/arch/s390/net/bpf_jit_comp.c
> > > +++ b/arch/s390/net/bpf_jit_comp.c
> > > @@ -371,11 +371,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
> > > /* dr %r4,%r12 */
> > > EMIT2(0x1d4c);
> > > break;
> > > - case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
> > > - /* m %r4,<d(K)>(%r13) */
> > > - EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
> > > - /* lr %r5,%r4 */
> > > - EMIT2(0x1854);
> > > + case BPF_S_ALU_DIV_K: /* A /= K */
> > > + /* lhi %r4,0 */
> > > + EMIT4(0xa7480000);
> > > + /* d %r4,<d(K)>(%r13) */
> > > + EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
> > > break;
> >
> > The s390 part looks good.
>
> Does it? The divide instruction is signed, for the special
> case of K==1 this can now cause an exception if the quotient
> gets too large. We should add a check for K==1 and do nothing
> in this case. With a divisor of at least 2 the result will
> stay in the limit.
Indeed. That's quite subtle.
^ permalink raw reply
* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Wei Liu @ 2014-01-15 10:37 UTC (permalink / raw)
To: Zoltan Kiss
Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
jonathan.davies
In-Reply-To: <1389727719-21439-1-git-send-email-zoltan.kiss@citrix.com>
On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
> The recent patch to fix receive side flow control (11b57f) solved the spinning
> thread problem, however caused an another one. The receive side can stall, if:
> - xenvif_rx_action sets rx_queue_stopped to false
> - interrupt happens, and sets rx_event to true
> - then xenvif_kthread sets rx_event to false
>
If you mean "rx_work_todo" returns false.
In this case
(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
can still be true, can't it?
> Also, through rx_event a malicious guest can force the RX thread to spin. This
> patch ditch that two variable, and rework rx_work_todo. If the thread finds it
This seems to be a bigger problem. Can you elaborate?
Wei.
^ permalink raw reply
* RE: [PATCH net v2] be2net: add dma_mapping_error() check for dma_map_page()
From: Sathya Perla @ 2014-01-15 10:28 UTC (permalink / raw)
To: Ivan Vecera, netdev@vger.kernel.org
Cc: Subramanian Seetharaman, Ajit Khaparde
In-Reply-To: <1389780694-6299-1-git-send-email-ivecera@redhat.com>
> -----Original Message-----
> From: Ivan Vecera [mailto:ivecera@redhat.com]
>
> The driver does not check value returned by dma_map_page. The patch
> fixes this.
>
> v2: Removed the bugfix for non-bug ;-) (thanks Sathya)
>
> Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Sathya Perla <Sathya.perla@emulex.com>
> ---
> drivers/net/ethernet/emulex/benet/be_main.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c
> b/drivers/net/ethernet/emulex/benet/be_main.c
> index bf40fda..a37039d 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -1776,6 +1776,7 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
> struct be_rx_page_info *page_info = NULL, *prev_page_info = NULL;
> struct be_queue_info *rxq = &rxo->q;
> struct page *pagep = NULL;
> + struct device *dev = &adapter->pdev->dev;
> struct be_eth_rx_d *rxd;
> u64 page_dmaaddr = 0, frag_dmaaddr;
> u32 posted, page_offset = 0;
> @@ -1788,9 +1789,15 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
> rx_stats(rxo)->rx_post_fail++;
> break;
> }
> - page_dmaaddr = dma_map_page(&adapter->pdev->dev, pagep,
> - 0, adapter->big_page_size,
> + page_dmaaddr = dma_map_page(dev, pagep, 0,
> + adapter->big_page_size,
> DMA_FROM_DEVICE);
> + if (dma_mapping_error(dev, page_dmaaddr)) {
> + put_page(pagep);
> + pagep = NULL;
> + rx_stats(rxo)->rx_post_fail++;
> + break;
> + }
> page_info->page_offset = 0;
> } else {
> get_page(pagep);
> --
> 1.8.3.2
^ permalink raw reply
* [PATCH net v2] be2net: add dma_mapping_error() check for dma_map_page()
From: Ivan Vecera @ 2014-01-15 10:11 UTC (permalink / raw)
To: netdev; +Cc: sathya.perla, subbu.seetharaman, ajit.khaparde
The driver does not check value returned by dma_map_page. The patch
fixes this.
v2: Removed the bugfix for non-bug ;-) (thanks Sathya)
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
drivers/net/ethernet/emulex/benet/be_main.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index bf40fda..a37039d 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1776,6 +1776,7 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
struct be_rx_page_info *page_info = NULL, *prev_page_info = NULL;
struct be_queue_info *rxq = &rxo->q;
struct page *pagep = NULL;
+ struct device *dev = &adapter->pdev->dev;
struct be_eth_rx_d *rxd;
u64 page_dmaaddr = 0, frag_dmaaddr;
u32 posted, page_offset = 0;
@@ -1788,9 +1789,15 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
rx_stats(rxo)->rx_post_fail++;
break;
}
- page_dmaaddr = dma_map_page(&adapter->pdev->dev, pagep,
- 0, adapter->big_page_size,
+ page_dmaaddr = dma_map_page(dev, pagep, 0,
+ adapter->big_page_size,
DMA_FROM_DEVICE);
+ if (dma_mapping_error(dev, page_dmaaddr)) {
+ put_page(pagep);
+ pagep = NULL;
+ rx_stats(rxo)->rx_post_fail++;
+ break;
+ }
page_info->page_offset = 0;
} else {
get_page(pagep);
--
1.8.3.2
^ permalink raw reply related
* Re: [PATCH net] be2net: add dma_mapping_error() check for dma_map_page()
From: Ivan Vecera @ 2014-01-15 10:08 UTC (permalink / raw)
To: Sathya Perla, netdev@vger.kernel.org
Cc: Subramanian Seetharaman, Ajit Khaparde
In-Reply-To: <89af1a0a-785c-4dfd-93c5-b1be112d5f60@CMEXHTCAS1.ad.emulex.com>
On 01/15/2014 08:36 AM, Sathya Perla wrote:
>> -----Original Message-----
>> From: Ivan Vecera [mailto:ivecera@redhat.com]
>>
>> The driver does not check value returned by dma_map_page. The patch
>> fixes this as well as one additional bug. The prev_page_info is
>> dereferenced after 'for' loop but if the 1st be_alloc_pages fails its
>> value is NULL.
>>
>> Signed-off-by: Ivan Vecera <ivecera@redhat.com>
>> ---
>> drivers/net/ethernet/emulex/benet/be_main.c | 13 ++++++++++---
>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c
>> b/drivers/net/ethernet/emulex/benet/be_main.c
>> index bf40fda..f2811b5 100644
>> --- a/drivers/net/ethernet/emulex/benet/be_main.c
>> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
>> @@ -1776,6 +1776,7 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
>> struct be_rx_page_info *page_info = NULL, *prev_page_info = NULL;
>> struct be_queue_info *rxq = &rxo->q;
>> struct page *pagep = NULL;
>> + struct device *dev = &adapter->pdev->dev;
>> struct be_eth_rx_d *rxd;
>> u64 page_dmaaddr = 0, frag_dmaaddr;
>> u32 posted, page_offset = 0;
>> @@ -1788,9 +1789,15 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
>> rx_stats(rxo)->rx_post_fail++;
>> break;
>> }
>> - page_dmaaddr = dma_map_page(&adapter->pdev->dev, pagep,
>> - 0, adapter->big_page_size,
>> + page_dmaaddr = dma_map_page(dev, pagep, 0,
>> + adapter->big_page_size,
>> DMA_FROM_DEVICE);
>> + if (dma_mapping_error(dev, page_dmaaddr)) {
>> + put_page(pagep);
>> + pagep = NULL;
>> + rx_stats(rxo)->rx_post_fail++;
>> + break;
>> + }
>> page_info->page_offset = 0;
>> } else {
>> get_page(pagep);
>> @@ -1816,7 +1823,7 @@ static void be_post_rx_frags(struct be_rx_obj *rxo, gfp_t gfp)
>> queue_head_inc(rxq);
>> page_info = &rxo->page_info_tbl[rxq->head];
>> }
>> - if (pagep)
>> + if (pagep && prev_page_info)
>> prev_page_info->last_page_user = true;
>
> Ivan, if the 1st be_alloc_pages() fails, won't "pagep" be NULL aswell.
> In that case, "prev_page_info" will not be dereferenced.
>
Sure Sathya, sorry... my bad eyes... Will post 2nd version.
Ivan
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox