netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* GRO aggregation
@ 2012-09-11 13:45 Shlomo Pongartz
  2012-09-11 18:20 ` Marcelo Ricardo Leitner
  2012-09-11 18:33 ` Eric Dumazet
  0 siblings, 2 replies; 24+ messages in thread
From: Shlomo Pongartz @ 2012-09-11 13:45 UTC (permalink / raw)
  To: netdev

Hi,

I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe 
driver.
The mtu is 1500 and GRO is on and so are SG and RX checksum.
I ran iperf with default setting and monitor the receiver with tcpdump.
The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
In the transmitter side tcpdump shows that TSO works better (~64K).
I did a capture without GRO enabled to see if there was a difference 
between any flag
of any two consecutive packets that forced flushing but didn't find 
anything.
Is the GRO aggregation can be tuned.



Shlomo Pongratz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-11 13:45 GRO aggregation Shlomo Pongartz
@ 2012-09-11 18:20 ` Marcelo Ricardo Leitner
  2012-09-11 18:41   ` Shlomo Pongratz
  2012-09-11 18:33 ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2012-09-11 18:20 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: netdev

On 09/11/2012 10:45 AM, Shlomo Pongartz wrote:
> Hi,
>
> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe
> driver.
> The mtu is 1500 and GRO is on and so are SG and RX checksum.
> I ran iperf with default setting and monitor the receiver with tcpdump.
> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
> In the transmitter side tcpdump shows that TSO works better (~64K).
> I did a capture without GRO enabled to see if there was a difference
> between any flag
> of any two consecutive packets that forced flushing but didn't find
> anything.
> Is the GRO aggregation can be tuned.

Hi Shlomo,

Have you tried tuning coalescing parameters?

Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-11 13:45 GRO aggregation Shlomo Pongartz
  2012-09-11 18:20 ` Marcelo Ricardo Leitner
@ 2012-09-11 18:33 ` Eric Dumazet
  2012-09-11 18:49   ` Shlomo Pongratz
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2012-09-11 18:33 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: netdev

On Tue, 2012-09-11 at 16:45 +0300, Shlomo Pongartz wrote:
> Hi,
> 
> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe 
> driver.
> The mtu is 1500 and GRO is on and so are SG and RX checksum.
> I ran iperf with default setting and monitor the receiver with tcpdump.
> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
> In the transmitter side tcpdump shows that TSO works better (~64K).
> I did a capture without GRO enabled to see if there was a difference 
> between any flag
> of any two consecutive packets that forced flushing but didn't find 
> anything.
> Is the GRO aggregation can be tuned.

It might mean NAPI runs while about 21 frames can be fetched at once
from NIC.

If receiver cpu is fast enough, it has no need to aggregate more segment
per skb.

Is LRO off or on ?

GRO itself has a 64Kbytes limit.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 18:20 ` Marcelo Ricardo Leitner
@ 2012-09-11 18:41   ` Shlomo Pongratz
  2012-09-11 18:48     ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongratz @ 2012-09-11 18:41 UTC (permalink / raw)
  To: mleitner@redhat.com; +Cc: netdev@vger.kernel.org

From: Marcelo Ricardo Leitner [mleitner@redhat.com]
Sent: Tuesday, September 11, 2012 9:20 PM
To: Shlomo Pongratz
Cc: netdev@vger.kernel.org
Subject: Re: GRO aggregation

On 09/11/2012 10:45 AM, Shlomo Pongartz wrote:
> Hi,
>
> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe
> driver.
> The mtu is 1500 and GRO is on and so are SG and RX checksum.
> I ran iperf with default setting and monitor the receiver with tcpdump.
> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
> In the transmitter side tcpdump shows that TSO works better (~64K).
> I did a capture without GRO enabled to see if there was a difference
> between any flag
> of any two consecutive packets that forced flushing but didn't find
> anything.
> Is the GRO aggregation can be tuned.

Hi Shlomo,

Have you tried tuning coalescing parameters?

Marcelo


Hi Marcelo

I didn't play with interrupts coalescing.
Do you suggest to increase the value?

Shlomo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-11 18:41   ` Shlomo Pongratz
@ 2012-09-11 18:48     ` Marcelo Ricardo Leitner
  2012-09-11 18:51       ` Shlomo Pongratz
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2012-09-11 18:48 UTC (permalink / raw)
  To: Shlomo Pongratz; +Cc: netdev@vger.kernel.org

On 09/11/2012 03:41 PM, Shlomo Pongratz wrote:
> From: Marcelo Ricardo Leitner [mleitner@redhat.com]
> Sent: Tuesday, September 11, 2012 9:20 PM
> To: Shlomo Pongratz
> Cc: netdev@vger.kernel.org
> Subject: Re: GRO aggregation
>
> On 09/11/2012 10:45 AM, Shlomo Pongartz wrote:
>> Hi,
>>
>> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe
>> driver.
>> The mtu is 1500 and GRO is on and so are SG and RX checksum.
>> I ran iperf with default setting and monitor the receiver with tcpdump.
>> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
>> In the transmitter side tcpdump shows that TSO works better (~64K).
>> I did a capture without GRO enabled to see if there was a difference
>> between any flag
>> of any two consecutive packets that forced flushing but didn't find
>> anything.
>> Is the GRO aggregation can be tuned.
>
> Hi Shlomo,
>
> Have you tried tuning coalescing parameters?
>
> Marcelo
>
>
> Hi Marcelo
>
> I didn't play with interrupts coalescing.
> Do you suggest to increase the value?

Actually it was an idea from top of my mind, I don't know how it applies 
to ixgbe, sorry. But making the NIC hold the packets a bit more should 
make it send larger ones to kernel. Trade-off between latency/throughput.

I was thinking about ethtool -c options, like rx-usecs*

Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 18:33 ` Eric Dumazet
@ 2012-09-11 18:49   ` Shlomo Pongratz
  2012-09-11 19:01     ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongratz @ 2012-09-11 18:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org

From: Eric Dumazet [eric.dumazet@gmail.com]
Sent: Tuesday, September 11, 2012 9:33 PM
To: Shlomo Pongratz
Cc: netdev@vger.kernel.org
Subject: Re: GRO aggregation

On Tue, 2012-09-11 at 16:45 +0300, Shlomo Pongartz wrote:
> Hi,
>
> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe
> driver.
> The mtu is 1500 and GRO is on and so are SG and RX checksum.
> I ran iperf with default setting and monitor the receiver with tcpdump.
> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
> In the transmitter side tcpdump shows that TSO works better (~64K).
> I did a capture without GRO enabled to see if there was a difference
> between any flag
> of any two consecutive packets that forced flushing but didn't find
> anything.
> Is the GRO aggregation can be tuned.

It might mean NAPI runs while about 21 frames can be fetched at once
from NIC.

If receiver cpu is fast enough, it has no need to aggregate more segment
per skb.

Is LRO off or on ?

GRO itself has a 64Kbytes limit.

Hi Eric.

I disabled the LRO. I actually tried the all the 4 options and found that LRO, GRO, LRO+GRO gives the same results for ixgbe w.r.t aggregation size (didn't check for throughput or latency).
Is there a timeout that flushes the aggregated SKBs before 64K were aggregated?

Shlomo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 18:48     ` Marcelo Ricardo Leitner
@ 2012-09-11 18:51       ` Shlomo Pongratz
  0 siblings, 0 replies; 24+ messages in thread
From: Shlomo Pongratz @ 2012-09-11 18:51 UTC (permalink / raw)
  To: mleitner@redhat.com; +Cc: netdev@vger.kernel.org

From: Marcelo Ricardo Leitner [mleitner@redhat.com]
Sent: Tuesday, September 11, 2012 9:48 PM
To: Shlomo Pongratz
Cc: netdev@vger.kernel.org
Subject: Re: GRO aggregation

On 09/11/2012 03:41 PM, Shlomo Pongratz wrote:
> From: Marcelo Ricardo Leitner [mleitner@redhat.com]
> Sent: Tuesday, September 11, 2012 9:20 PM
> To: Shlomo Pongratz
> Cc: netdev@vger.kernel.org
> Subject: Re: GRO aggregation
>
> On 09/11/2012 10:45 AM, Shlomo Pongartz wrote:
>> Hi,
>>
>> I’m checking GRO aggregation with kernel 3.6.0-rc1+ using Intel ixgbe
>> driver.
>> The mtu is 1500 and GRO is on and so are SG and RX checksum.
>> I ran iperf with default setting and monitor the receiver with tcpdump.
>> The tcpdump shows that the maximal aggregation is 32120 which is 21 * 1500.
>> In the transmitter side tcpdump shows that TSO works better (~64K).
>> I did a capture without GRO enabled to see if there was a difference
>> between any flag
>> of any two consecutive packets that forced flushing but didn't find
>> anything.
>> Is the GRO aggregation can be tuned.
>
> Hi Shlomo,
>
> Have you tried tuning coalescing parameters?
>
> Marcelo
>
>
> Hi Marcelo
>
> I didn't play with interrupts coalescing.
> Do you suggest to increase the value?

Actually it was an idea from top of my mind, I don't know how it applies
to ixgbe, sorry. But making the NIC hold the packets a bit more should
make it send larger ones to kernel. Trade-off between latency/throughput.

I was thinking about ethtool -c options, like rx-usecs*

Marcelo

I'll try to play with it a little.
Thanks.

Shlomo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 18:49   ` Shlomo Pongratz
@ 2012-09-11 19:01     ` Eric Dumazet
  2012-09-11 19:24       ` Shlomo Pongratz
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2012-09-11 19:01 UTC (permalink / raw)
  To: Shlomo Pongratz; +Cc: netdev@vger.kernel.org

On Tue, 2012-09-11 at 18:49 +0000, Shlomo Pongratz wrote:

> I disabled the LRO. I actually tried the all the 4 options and found that LRO, GRO, LRO+GRO gives the same results for ixgbe w.r.t aggregation size (didn't check for throughput or latency).
> Is there a timeout that flushes the aggregated SKBs before 64K were aggregated?

At the end of NAPI run, we flush the gro state.

It basically means that an interrupt came, and we fetched 21 frames from
the NIC.

To get more packets per interrupt, you might try to slow down your
cpu ;)

But I dont get the point.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 19:01     ` Eric Dumazet
@ 2012-09-11 19:24       ` Shlomo Pongratz
  2012-09-11 19:35         ` David Miller
  2012-09-11 19:35         ` Eric Dumazet
  0 siblings, 2 replies; 24+ messages in thread
From: Shlomo Pongratz @ 2012-09-11 19:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org

From: Eric Dumazet [eric.dumazet@gmail.com]
Sent: Tuesday, September 11, 2012 10:02 PM
To: Shlomo Pongratz
Cc: netdev@vger.kernel.org
Subject: RE: GRO aggregation

On Tue, 2012-09-11 at 18:49 +0000, Shlomo Pongratz wrote:

> I disabled the LRO. I actually tried the all the 4 options and found that LRO, GRO, LRO+GRO gives the same results for ixgbe w.r.t aggregation size (didn't check for throughput or latency).
> Is there a timeout that flushes the aggregated SKBs before 64K were aggregated?

At the end of NAPI run, we flush the gro state.

It basically means that an interrupt came, and we fetched 21 frames from
the NIC.

To get more packets per interrupt, you might try to slow down your
cpu ;)

But I dont get the point.


I see that in ixgbe the weight for the NAPI is 64 (netif_napi_add). So if packets are arriving in high rate then an the CPU is fast enough to collect the packets as they arrive, assuming packets continue to arrives while the NAPI runs. Then it should have aggregate more. So we will have less passes trough the stack.

Shlomo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-11 19:24       ` Shlomo Pongratz
@ 2012-09-11 19:35         ` David Miller
  2012-09-11 19:35         ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: David Miller @ 2012-09-11 19:35 UTC (permalink / raw)
  To: shlomop; +Cc: eric.dumazet, netdev

From: Shlomo Pongratz <shlomop@mellanox.com>
Date: Tue, 11 Sep 2012 19:24:26 +0000

> I see that in ixgbe the weight for the NAPI is 64
> (netif_napi_add). So if packets are arriving in high rate then an
> the CPU is fast enough to collect the packets as they arrive,
> assuming packets continue to arrives while the NAPI runs. Then it
> should have aggregate more. So we will have less passes trough the
> stack.

Eric is trying to say that that cpu is fast enough that it completely
depletes the pending RX packets, the RX queue is empty, and there is
only 32K worth of GRO to accumulate.

BTW, your email quoting is non-standard and very confusing.  There
is absolutely no delineation between the text that you are writing
and the text of the people you are responding too.  Please learn how
to write email replies properly.

Thank you.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: GRO aggregation
  2012-09-11 19:24       ` Shlomo Pongratz
  2012-09-11 19:35         ` David Miller
@ 2012-09-11 19:35         ` Eric Dumazet
  2012-09-12  9:23           ` Shlomo Pongartz
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2012-09-11 19:35 UTC (permalink / raw)
  To: Shlomo Pongratz; +Cc: netdev@vger.kernel.org

On Tue, 2012-09-11 at 19:24 +0000, Shlomo Pongratz wrote:

> 
> I see that in ixgbe the weight for the NAPI is 64 (netif_napi_add). So
> if packets are arriving in high rate then an the CPU is fast enough to
> collect the packets as they arrive, assuming packets continue to
> arrives while the NAPI runs. Then it should have aggregate more. So we
> will have less passes trough the stack.
> 

As I said, _if_ your cpu was loaded by other stuff, then you would see
biggest GRO packets.

GRO is not : "We want to kill latency and have big packets just because
its better"

Its more like : If load is big enough, try to aggregate TCP frames in
less skbs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-11 19:35         ` Eric Dumazet
@ 2012-09-12  9:23           ` Shlomo Pongartz
  2012-09-12  9:33             ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongartz @ 2012-09-12  9:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org

On 9/11/2012 10:35 PM, Eric Dumazet wrote:
> On Tue, 2012-09-11 at 19:24 +0000, Shlomo Pongratz wrote:
>
>> I see that in ixgbe the weight for the NAPI is 64 (netif_napi_add). So
>> if packets are arriving in high rate then an the CPU is fast enough to
>> collect the packets as they arrive, assuming packets continue to
>> arrives while the NAPI runs. Then it should have aggregate more. So we
>> will have less passes trough the stack.
>>
> As I said, _if_ your cpu was loaded by other stuff, then you would see
> biggest GRO packets.
>
> GRO is not : "We want to kill latency and have big packets just because
> its better"
>
> Its more like : If load is big enough, try to aggregate TCP frames in
> less skbs.
>
>
>
>
First I want to apologize for breaking the mailing thread. I wasn't at 
work and used webmail.

I agree with your but I think that something is still strange.
On the transmitter side all the offloading are enabled, e.g. TSO and GSO.
The tcpdump on the sender side shows size of 64240 which is 44 packets 
of 1460 each.
Now since the offloading are enabled the HW should transmit 44 frames 
back to back,
that is in a burst of 44 * 1500 bytes, which according to my calculation 
should take 52.8 micro on 10G Ethernet.
Using ethtool I've set the rx-usecs to 1022 micro, which I think is the 
maximal value for ixgbe.
Note that there is no way to set rx-frames on ixgbe.
Now since the ixgbe weight is 64 I expected that the NAPI will be able 
to poll for more then 21 packets,
since 44 packets came in one burst.
However the results remains the same.

Shlomo.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12  9:23           ` Shlomo Pongartz
@ 2012-09-12  9:33             ` Eric Dumazet
  2012-09-12 14:41               ` Shlomo Pongartz
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2012-09-12  9:33 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: netdev@vger.kernel.org

On Wed, 2012-09-12 at 12:23 +0300, Shlomo Pongartz wrote:
> On 9/11/2012 10:35 PM, Eric Dumazet wrote:
> > On Tue, 2012-09-11 at 19:24 +0000, Shlomo Pongratz wrote:
> >
> >> I see that in ixgbe the weight for the NAPI is 64 (netif_napi_add). So
> >> if packets are arriving in high rate then an the CPU is fast enough to
> >> collect the packets as they arrive, assuming packets continue to
> >> arrives while the NAPI runs. Then it should have aggregate more. So we
> >> will have less passes trough the stack.
> >>
> > As I said, _if_ your cpu was loaded by other stuff, then you would see
> > biggest GRO packets.
> >
> > GRO is not : "We want to kill latency and have big packets just because
> > its better"
> >
> > Its more like : If load is big enough, try to aggregate TCP frames in
> > less skbs.
> >
> >
> >
> >
> First I want to apologize for breaking the mailing thread. I wasn't at 
> work and used webmail.
> 
> I agree with your but I think that something is still strange.
> On the transmitter side all the offloading are enabled, e.g. TSO and GSO.
> The tcpdump on the sender side shows size of 64240 which is 44 packets 
> of 1460 each.
> Now since the offloading are enabled the HW should transmit 44 frames 
> back to back,
> that is in a burst of 44 * 1500 bytes, which according to my calculation 
> should take 52.8 micro on 10G Ethernet.
> Using ethtool I've set the rx-usecs to 1022 micro, which I think is the 
> maximal value for ixgbe.
> Note that there is no way to set rx-frames on ixgbe.
> Now since the ixgbe weight is 64 I expected that the NAPI will be able 
> to poll for more then 21 packets,
> since 44 packets came in one burst.
> However the results remains the same.

TSO uses PAGE frags, so 64KB needs about 16 pages.

tcp_sendmsg() could even use order-3 pages, so that only 2 pages would
be needed to fill 64KB of data.

GRO uses whatever fragment size provided by NIC, depending on MTU.

One skb has a limit on number of frags.

Handling a huge array of frags would be actually slower in some helper
functions.

Since you dont exactly describe why you ask all these questions, its
hard to guess what problem you try to solve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12  9:33             ` Eric Dumazet
@ 2012-09-12 14:41               ` Shlomo Pongartz
  2012-09-12 16:23                 ` Rick Jones
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongartz @ 2012-09-12 14:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org

On 9/12/2012 12:33 PM, Eric Dumazet wrote:
> On Wed, 2012-09-12 at 12:23 +0300, Shlomo Pongartz wrote:
>> On 9/11/2012 10:35 PM, Eric Dumazet wrote:
>>> On Tue, 2012-09-11 at 19:24 +0000, Shlomo Pongratz wrote:
>>>
>>>> I see that in ixgbe the weight for the NAPI is 64 (netif_napi_add). So
>>>> if packets are arriving in high rate then an the CPU is fast enough to
>>>> collect the packets as they arrive, assuming packets continue to
>>>> arrives while the NAPI runs. Then it should have aggregate more. So we
>>>> will have less passes trough the stack.
>>>>
>>> As I said, _if_ your cpu was loaded by other stuff, then you would see
>>> biggest GRO packets.
>>>
>>> GRO is not : "We want to kill latency and have big packets just because
>>> its better"
>>>
>>> Its more like : If load is big enough, try to aggregate TCP frames in
>>> less skbs.
>>>
>>>
>>>
>>>
>> First I want to apologize for breaking the mailing thread. I wasn't at
>> work and used webmail.
>>
>> I agree with your but I think that something is still strange.
>> On the transmitter side all the offloading are enabled, e.g. TSO and GSO.
>> The tcpdump on the sender side shows size of 64240 which is 44 packets
>> of 1460 each.
>> Now since the offloading are enabled the HW should transmit 44 frames
>> back to back,
>> that is in a burst of 44 * 1500 bytes, which according to my calculation
>> should take 52.8 micro on 10G Ethernet.
>> Using ethtool I've set the rx-usecs to 1022 micro, which I think is the
>> maximal value for ixgbe.
>> Note that there is no way to set rx-frames on ixgbe.
>> Now since the ixgbe weight is 64 I expected that the NAPI will be able
>> to poll for more then 21 packets,
>> since 44 packets came in one burst.
>> However the results remains the same.
> TSO uses PAGE frags, so 64KB needs about 16 pages.
>
> tcp_sendmsg() could even use order-3 pages, so that only 2 pages would
> be needed to fill 64KB of data.
>
> GRO uses whatever fragment size provided by NIC, depending on MTU.
>
> One skb has a limit on number of frags.
>
> Handling a huge array of frags would be actually slower in some helper
> functions.
>
> Since you dont exactly describe why you ask all these questions, its
> hard to guess what problem you try to solve.
>
>
>
> .
>
Hi Eric

The TSO is just a mean to create a burst of frames on the wire so the 
NAPI will be able to pool as much as possible.
I'm looking on the aggregation done by GRO on behalf of IPoIB. With 
IPoIB I added a counter that counts how many
packets were aggregated before napi_complete is called (ether directly 
or by net_rx_action) and found that although
the NAPI consumes 64 packets on average before napi_complete is called, 
the tcpdump shows  that no more then 16-17
packets were aggregated. BTW when I increased the MTU  to 4K I did 
reached 64K aggregation which again is 16-17 packets.
So in order to see if 17 packets is the aggregation limit I  wanted to 
see how ixgbe is doing and found that it aggregates 21 packets.
So I wanted to know if there is another factor that governs the 
aggregation, one that I can tune.

Shlomo.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12 14:41               ` Shlomo Pongartz
@ 2012-09-12 16:23                 ` Rick Jones
  2012-09-12 16:34                   ` Shlomo Pongartz
  0 siblings, 1 reply; 24+ messages in thread
From: Rick Jones @ 2012-09-12 16:23 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: Eric Dumazet, netdev@vger.kernel.org

On 09/12/2012 07:41 AM, Shlomo Pongartz wrote:
> Hi Eric
>
> The TSO is just a mean to create a burst of frames on the wire so the
> NAPI will be able to pool as much as possible.

Is it?  If I recall correctly, TSO was in place well before all drivers 
were using NAPI.  And NAPI was being proposed independent of TSO. TSO is 
there to save CPU cycles on the transmit side.  "On the wire" what it 
sends is to be identical to what a host with greater CPU performance 
could accomplish.

rick jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12 16:23                 ` Rick Jones
@ 2012-09-12 16:34                   ` Shlomo Pongartz
  2012-09-12 16:52                     ` Rick Jones
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongartz @ 2012-09-12 16:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev@vger.kernel.org

On 9/12/2012 7:23 PM, Rick Jones wrote:
> On 09/12/2012 07:41 AM, Shlomo Pongartz wrote:
>> Hi Eric
>>
>> The TSO is just a mean to create a burst of frames on the wire so the
>> NAPI will be able to pool as much as possible.
>
> Is it?  If I recall correctly, TSO was in place well before all 
> drivers were using NAPI.  And NAPI was being proposed independent of 
> TSO. TSO is there to save CPU cycles on the transmit side.  "On the 
> wire" what it sends is to be identical to what a host with greater CPU 
> performance could accomplish.
>
> rick jones
>
Hi Rick.

What I say is that I use TSO on the machine that transmits so I'll have  
a burst of frames on the wire for the NAPI on the receiver machine.
The best thing for my purpose is  that the HW will do the segmentation. 
And unless I'm mistaken the Intel card is capable to do so.

Shlomo.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12 16:34                   ` Shlomo Pongartz
@ 2012-09-12 16:52                     ` Rick Jones
  2012-09-13  6:36                       ` Shlomo Pongartz
  0 siblings, 1 reply; 24+ messages in thread
From: Rick Jones @ 2012-09-12 16:52 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: Eric Dumazet, netdev@vger.kernel.org

On 09/12/2012 09:34 AM, Shlomo Pongartz wrote:
> On 9/12/2012 7:23 PM, Rick Jones wrote:
>> On 09/12/2012 07:41 AM, Shlomo Pongartz wrote:
>>> Hi Eric
>>>
>>> The TSO is just a mean to create a burst of frames on the wire so the
>>> NAPI will be able to pool as much as possible.
>>
>> Is it?  If I recall correctly, TSO was in place well before all
>> drivers were using NAPI.  And NAPI was being proposed independent of
>> TSO. TSO is there to save CPU cycles on the transmit side.  "On the
>> wire" what it sends is to be identical to what a host with greater CPU
>> performance could accomplish.
>>
>> rick jones
>>
> Hi Rick.
>
> What I say is that I use TSO on the machine that transmits so I'll have
> a burst of frames on the wire for the NAPI on the receiver machine.

Also, NAPI was in place before GRO.  IIRC, the napi code was simply a 
convenient/correct/natural place to have the GRO functionality.

rick jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-12 16:52                     ` Rick Jones
@ 2012-09-13  6:36                       ` Shlomo Pongartz
  2012-09-13  8:11                         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Shlomo Pongartz @ 2012-09-13  6:36 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev@vger.kernel.org

On 9/12/2012 7:52 PM, Rick Jones wrote:
> On 09/12/2012 09:34 AM, Shlomo Pongartz wrote:
>> On 9/12/2012 7:23 PM, Rick Jones wrote:
>>> On 09/12/2012 07:41 AM, Shlomo Pongartz wrote:
>>>> Hi Eric
>>>>
>>>> The TSO is just a mean to create a burst of frames on the wire so the
>>>> NAPI will be able to pool as much as possible.
>>>
>>> Is it?  If I recall correctly, TSO was in place well before all
>>> drivers were using NAPI.  And NAPI was being proposed independent of
>>> TSO. TSO is there to save CPU cycles on the transmit side. "On the
>>> wire" what it sends is to be identical to what a host with greater CPU
>>> performance could accomplish.
>>>
>>> rick jones
>>>
>> Hi Rick.
>>
>> What I say is that I use TSO on the machine that transmits so I'll have
>> a burst of frames on the wire for the NAPI on the receiver machine.
>
> Also, NAPI was in place before GRO.  IIRC, the napi code was simply a 
> convenient/correct/natural place to have the GRO functionality.
>
> rick jones
>
Hi Rick

The thing is that napi_complete calls napi_gro_flush so this pose a 
limit on the aggregation.
However when I count the number of packets received until this routine 
is been called, I get a number,
which is bigger then what I see with tcpdump, and this number is less 
than what is expected if the limit is 64K.
So I what to know what can I do in order to improve things, e.g. 
allocate the skb differently.

Shlomo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13  6:36                       ` Shlomo Pongartz
@ 2012-09-13  8:11                         ` Eric Dumazet
  2012-09-13  9:59                           ` Or Gerlitz
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2012-09-13  8:11 UTC (permalink / raw)
  To: Shlomo Pongartz; +Cc: Rick Jones, netdev@vger.kernel.org

On Thu, 2012-09-13 at 09:36 +0300, Shlomo Pongartz wrote:

> The thing is that napi_complete calls napi_gro_flush so this pose a 
> limit on the aggregation.
> However when I count the number of packets received until this routine 
> is been called, I get a number,
> which is bigger then what I see with tcpdump, and this number is less 
> than what is expected if the limit is 64K.
> So I what to know what can I do in order to improve things, e.g. 
> allocate the skb differently.
> 

I already answered to this question somehow.

MAX_SKB_FRAGS is 16

skb_gro_receive() will return -E2BIG once this limit is hit.

If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
contain only at most 1700 bytes, but TSO packets can still be 64KB, if
the sender NIC can afford it (some NICS wont work quite well)

We are not going to change skb allocations to get bigger GRO packets,
for several reasons.

Once of the reason is inherent to TCP protocol, because only one ACK is
sent in response to a GRO packet.

In your LAN, feel free to use bigger MTU to reach this 64KB magical
value you seem to desperately seek.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13  8:11                         ` Eric Dumazet
@ 2012-09-13  9:59                           ` Or Gerlitz
  2012-09-13 12:05                             ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Or Gerlitz @ 2012-09-13  9:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shlomo Pongartz, Rick Jones, netdev@vger.kernel.org

On Thu, Sep 13, 2012 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> MAX_SKB_FRAGS is 16
> skb_gro_receive() will return -E2BIG once this limit is hit.
> If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
> contain only at most 1700 bytes, but TSO packets can still be 64KB, if
> the sender NIC can afford it (some NICS wont work quite well)

Hi Eric,

Addressing this assertion of yours, Shlomo showed that with ixgbe he managed
to see GRO aggregating 32KB which means 20-21 packets that is > 16 fragments
in this notation, can it be related to the way ixgbe is actually
allocating skbs?

Or.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13  9:59                           ` Or Gerlitz
@ 2012-09-13 12:05                             ` Eric Dumazet
  2012-09-13 12:34                               ` Eric Dumazet
  2012-09-13 12:47                               ` Or Gerlitz
  0 siblings, 2 replies; 24+ messages in thread
From: Eric Dumazet @ 2012-09-13 12:05 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Shlomo Pongartz, Rick Jones, netdev@vger.kernel.org, Tom Herbert

On Thu, 2012-09-13 at 12:59 +0300, Or Gerlitz wrote:
> On Thu, Sep 13, 2012 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > MAX_SKB_FRAGS is 16
> > skb_gro_receive() will return -E2BIG once this limit is hit.
> > If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
> > contain only at most 1700 bytes, but TSO packets can still be 64KB, if
> > the sender NIC can afford it (some NICS wont work quite well)
> 
> Hi Eric,
> 
> Addressing this assertion of yours, Shlomo showed that with ixgbe he managed
> to see GRO aggregating 32KB which means 20-21 packets that is > 16 fragments
> in this notation, can it be related to the way ixgbe is actually
> allocating skbs?
> 

Hard to say without knowing exact kernel version, as things change a lot
in this area.

You have several kind of GRO. One fast and one slow.

The slow one uses a linked list of skbs (pinfo->frag_list), while the
fast one uses fragments (pinfo->nr_frags)

For example, some drivers (mellanox one is in this lot) pull too many
bytes in skb->head and this defeats the fast GRO :
Part of payload is in skb->head, remaining part in pinfo->frags[0]

skb_gro_receive() then has to allocate a new head skb, to link skbs into
head->frag_list. The total skb->truesize is not reduced at all, its
increased.

So you might think GRO is working, but its only a hack, as one skb has a
list of skbs, and this makes TCP read() slower, and defeats TCP
coalescing as well. Whats the point of delivering fat skbs to TCP stack
if it slows down the consumer, because of increased cache line misses ?

I am not _very_ interested in the slow GRO behavior, I try to improve
the fast path.

ixgbe uses the fast GRO, at least on recent kernels.

In my tests on mellanox, it only aggregates 8 frames per skb, and still
we reach 10Gbps...

03:41:40.128074 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1563841:1575425(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128080 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1575425:1587009(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128085 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1587009:1598593(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128089 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1598593:1610177(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128093 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1610177:1621761(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128103 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1633345:1644929(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128116 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1668097:1679681(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128121 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1679681:1691265(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128134 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1714433:1726017(11584) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128146 IP 7.7.7.84.38079 > 7.7.7.83.52113: . 1749185:1759321(10136) ack 0 win 229 <nop,nop,timestamp 137349733 152427711>
03:41:40.128163 IP 7.7.7.83.52113 > 7.7.7.84.38079: . ack 1575425 win 4147 <nop,nop,timestamp 152427711 137349733>
03:41:40.128193 IP 7.7.7.83.52113 > 7.7.7.84.38079: . ack 1759321 win 3339 <nop,nop,timestamp 152427711 137349733>

And it aggregates 8 frames per skb because each individual frame uses 2 fragments :

One of 512 bytes and one of 1024 bytes : total of 1536 bytes,
instead of the typical 2048 bytes used by other NIC

To get better performance, mellanox could use only one frag
per MTU (if MTU <= 1500), using 1536 bytes frags.

I tried this and this gives now :

05:00:12.507398 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2064384:2089000(24616) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507419 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2138232:2161400(23168) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507489 IP 7.7.7.84.63422 > 7.7.7.83.37622: . 2244664:2269280(24616) ack 1 win 229 <nop,nop,timestamp 142062123 4294793380>
05:00:12.507509 IP 7.7.7.83.37622 > 7.7.7.84.63422: . ack 2244664 win 16384 <nop,nop,timestamp 4294793380 142062123>

But there is no real difference in throughput.

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6c4f935..435c35e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -96,8 +96,8 @@
 /* Receive fragment sizes; we use at most 4 fragments (for 9600 byte MTU
  * and 4K allocations) */
 enum {
-	FRAG_SZ0 = 512 - NET_IP_ALIGN,
-	FRAG_SZ1 = 1024,
+	FRAG_SZ0 = 1536 - NET_IP_ALIGN,
+	FRAG_SZ1 = 2048,
        FRAG_SZ2 = 4096,
        FRAG_SZ3 = MLX4_EN_ALLOC_SIZE
 };

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13 12:05                             ` Eric Dumazet
@ 2012-09-13 12:34                               ` Eric Dumazet
  2012-09-13 12:47                               ` Or Gerlitz
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2012-09-13 12:34 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Shlomo Pongartz, Rick Jones, netdev@vger.kernel.org, Tom Herbert

On Thu, 2012-09-13 at 14:05 +0200, Eric Dumazet wrote:

> But there is no real difference in throughput.
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 6c4f935..435c35e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -96,8 +96,8 @@
>  /* Receive fragment sizes; we use at most 4 fragments (for 9600 byte MTU
>   * and 4K allocations) */
>  enum {
> -	FRAG_SZ0 = 512 - NET_IP_ALIGN,
> -	FRAG_SZ1 = 1024,
> +	FRAG_SZ0 = 1536 - NET_IP_ALIGN,
> +	FRAG_SZ1 = 2048,
>         FRAG_SZ2 = 4096,
>         FRAG_SZ3 = MLX4_EN_ALLOC_SIZE
>  };
> 

Oh well, adding one prefetch() is giving ~10% more throughput.

I guess this mlx4 driver needs some care.

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5aba5ec..547eec8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -38,6 +38,7 @@
 #include <linux/if_ether.h>
 #include <linux/if_vlan.h>
 #include <linux/vmalloc.h>
+#include <linux/prefetch.h>
 
 #include "mlx4_en.h"
 
@@ -617,7 +618,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		    !((dev->features & NETIF_F_LOOPBACK) ||
 		      priv->validate_loopback))
 			goto next;
-
+		/* avoid cache miss in tcp_gro_receive() */
+		prefetch((char *)ethh + 64);
 		/*
 		 * Packet is OK - process it.
 		 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13 12:05                             ` Eric Dumazet
  2012-09-13 12:34                               ` Eric Dumazet
@ 2012-09-13 12:47                               ` Or Gerlitz
  2012-09-13 13:22                                 ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Or Gerlitz @ 2012-09-13 12:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shlomo Pongartz, Rick Jones, netdev@vger.kernel.org, Tom Herbert,
	Yevgeny Petrilin

On Thu, Sep 13, 2012 at 3:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-13 at 12:59 +0300, Or Gerlitz wrote:
>> On Thu, Sep 13, 2012 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > MAX_SKB_FRAGS is 16
>> > skb_gro_receive() will return -E2BIG once this limit is hit.
>> > If you use a MSS = 100 (instead of MSS = 1460), then GRO skb will
>> > contain only at most 1700 bytes, but TSO packets can still be 64KB, if
>> > the sender NIC can afford it (some NICS wont work quite well)

>> Addressing this assertion of yours, Shlomo showed that with ixgbe he managed
>> to see GRO aggregating 32KB which means 20-21 packets that is > 16 fragments
>> in this notation, can it be related to the way ixgbe is actually allocating skbs?

> Hard to say without knowing exact kernel version, as things change a lot in this area.

As Shlomo wrote earlier on this thread his testbed is 3.6-rc1


> You have several kind of GRO. One fast and one slow.
> The slow one uses a linked list of skbs (pinfo->frag_list), while the
> fast one uses fragments (pinfo->nr_frags)
>
> For example, some drivers (mellanox one is in this lot) pull too many
> bytes in skb->head and this defeats the fast GRO :
> Part of payload is in skb->head, remaining part in pinfo->frags[0]
>
> skb_gro_receive() then has to allocate a new head skb, to link skbs into
> head->frag_list. The total skb->truesize is not reduced at all, its
> increased.
>
> So you might think GRO is working, but its only a hack, as one skb has a
> list of skbs, and this makes TCP read() slower, and defeats TCP
> coalescing as well. Whats the point of delivering fat skbs to TCP stack
> if it slows down the consumer, because of increased cache line misses ?

Shlomo is dealing with making the IPoIB driver work well with GRO,
thanks for the
comments on the Mellanox Ethernet driver, we will look there too
(added Yevgeny)...

As for IPoIB it has two modes, connected which irrelevant for this
discussion, and datagram
- who is under the  scope here. Its MTU is typically 2044 but can be
4092 as well, the allocation
of skb's for this mode is done in ipoib_alloc_rx_skb() -- which you've
patched recently...

Following your comment we noted that if using the lower/typical mtu of
2044 which means
we are below the ipoib_ud_need_sg() threshold, skbs are allocated on
one "form" and if using
the 4092 mtu in another "form" - do you see each of the form to fall
into different GRO flow, e.g
2044 to the "slow" and 4092 to the "fast"?!

Or.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: GRO aggregation
  2012-09-13 12:47                               ` Or Gerlitz
@ 2012-09-13 13:22                                 ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2012-09-13 13:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Shlomo Pongartz, Rick Jones, netdev@vger.kernel.org, Tom Herbert,
	Yevgeny Petrilin

On Thu, 2012-09-13 at 15:47 +0300, Or Gerlitz wrote:
> Shlomo is dealing with making the IPoIB driver work well with GRO,
> thanks for the
> comments on the Mellanox Ethernet driver, we will look there too
> (added Yevgeny)...
> 
> As for IPoIB it has two modes, connected which irrelevant for this
> discussion, and datagram
> - who is under the  scope here. Its MTU is typically 2044 but can be
> 4092 as well, the allocation
> of skb's for this mode is done in ipoib_alloc_rx_skb() -- which you've
> patched recently...
> 
> Following your comment we noted that if using the lower/typical mtu of
> 2044 which means
> we are below the ipoib_ud_need_sg() threshold, skbs are allocated on
> one "form" and if using
> the 4092 mtu in another "form" - do you see each of the form to fall
> into different GRO flow, e.g
> 2044 to the "slow" and 4092 to the "fast"?!

Seems fine to me both ways, because you use dev_alloc_skb(), and you
dont pull tcp payload into tcp->head.

You might try adding prefetch() as well to bring into cpu cache
IP/TCP headers before they are needed in gro layers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-09-13 13:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-11 13:45 GRO aggregation Shlomo Pongartz
2012-09-11 18:20 ` Marcelo Ricardo Leitner
2012-09-11 18:41   ` Shlomo Pongratz
2012-09-11 18:48     ` Marcelo Ricardo Leitner
2012-09-11 18:51       ` Shlomo Pongratz
2012-09-11 18:33 ` Eric Dumazet
2012-09-11 18:49   ` Shlomo Pongratz
2012-09-11 19:01     ` Eric Dumazet
2012-09-11 19:24       ` Shlomo Pongratz
2012-09-11 19:35         ` David Miller
2012-09-11 19:35         ` Eric Dumazet
2012-09-12  9:23           ` Shlomo Pongartz
2012-09-12  9:33             ` Eric Dumazet
2012-09-12 14:41               ` Shlomo Pongartz
2012-09-12 16:23                 ` Rick Jones
2012-09-12 16:34                   ` Shlomo Pongartz
2012-09-12 16:52                     ` Rick Jones
2012-09-13  6:36                       ` Shlomo Pongartz
2012-09-13  8:11                         ` Eric Dumazet
2012-09-13  9:59                           ` Or Gerlitz
2012-09-13 12:05                             ` Eric Dumazet
2012-09-13 12:34                               ` Eric Dumazet
2012-09-13 12:47                               ` Or Gerlitz
2012-09-13 13:22                                 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).