Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: followup: what's responsible for setting netdev->operstate to IF_OPER_DOWN?
From: Robert P. J. Day @ 2018-08-27  6:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux kernel netdev mailing list
In-Reply-To: <20180826135006.157d1bc2@xeon-e3>

On Sun, 26 Aug 2018, Stephen Hemminger wrote:

> On Sun, 26 Aug 2018 11:14:33 -0400 (EDT)
> "Robert P. J. Day" <rpjday@crashcourse.ca> wrote:
>
> >   apologies for the constant pleas for assistance, but i think i'm
> > zeroing in on the problem that started all this. recap: custom
> > FPGA-based linux box with multiple ports, where the current symptom is
> > that there is no userspace notification when someone simply unplugs
> > one of the ports ("ifconfig" shows that interface still RUNNING).
> >
> >   as i read it, an active ethernet interface should be both UP (the
> > administrative state) and RUNNING (the RFC 2863-defined operational
> > state). if i unplug, i've verified on a standard net port on my laptop
> > that the interface is still UP, but no longer RUNNING, which makes
> > perfect sense. i plug back in, interface starts RUNNING again. so
> > where's the problem?
> >
> >   i can see that whether ifconfig shows an interface RUNNING is
> > defined in net/core/dev.c:
> >
> >   unsigned int dev_get_flags(const struct net_device *dev)
> >   {
> >         unsigned int flags;
> >
> >         flags = (dev->flags & ~(IFF_PROMISC |
> >                                 IFF_ALLMULTI |
> >                                 IFF_RUNNING |
> >                                 IFF_LOWER_UP |
> >                                 IFF_DORMANT)) |
> >                 (dev->gflags & (IFF_PROMISC |
> >                                 IFF_ALLMULTI));
> >
> >         if (netif_running(dev)) {
> >                 if (netif_oper_up(dev))
> >                         flags |= IFF_RUNNING;  <---- THERE
> >                 if (netif_carrier_ok(dev))
> >                         flags |= IFF_LOWER_UP;
> >                 if (netif_dormant(dev))
> >                         flags |= IFF_DORMANT;
> >         }
> >
> >         return flags;
> >   }
> >
> > where netif_oper_up() is defined as:
> >
> >   static inline bool netif_oper_up(const struct net_device *dev)
> >   {
> >         return (dev->operstate == IF_OPER_UP ||
> >                 dev->operstate == IF_OPER_UNKNOWN /* backward compat */);
> >   }
> >
> > so i am simply assuming that the underlying problem is that,
> > somewhere down below, the unplugging of a port is somehow not setting
> > dev->operstate to its proper value of IF_OPER_DOWN.
> >
> >   that would clearly explain everything, and i'm about to dig even
> > further to see where the event of unplugging a port *should* be
> > recognized, but does this sound like a reasonable diagnosis? there
> > have been other problems with the programming of the FPGA, so it would
> > surprise absolutely no one to learn that this aspect was
> > misprogrammed.
> >
> > rday
> >
>
> There is no reason drivers should ever muck with flags directly.
> You probably are looking for netif_detach

  i assume you mean netif_device_detach; i'll check into that.

rday

-- 

========================================================================
Robert P. J. Day                                 Ottawa, Ontario, CANADA
                  http://crashcourse.ca/dokuwiki

Twitter:                                       http://twitter.com/rpjday
LinkedIn:                               http://ca.linkedin.com/in/rpjday
========================================================================

^ permalink raw reply

* Re: confusing comment, explanation of @IFF_RUNNING in if.h
From: Robert P. J. Day @ 2018-08-27  6:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andrew Lunn, Linux kernel netdev mailing list
In-Reply-To: <20180826135144.11fd9a5f@xeon-e3>

On Sun, 26 Aug 2018, Stephen Hemminger wrote:

> On Sun, 26 Aug 2018 15:20:24 -0400 (EDT)
> "Robert P. J. Day" <rpjday@crashcourse.ca> wrote:
>
> > On Sun, 26 Aug 2018, Andrew Lunn wrote:
> >
> > > >   i ask since, in my testing, when the interface should have been
> > > > up, the attribute file "operstate" for that interface showed
> > > > "unknown", and i wondered how worried i should be about that.
> > >
> > > Hi Robert
> > >
> > > You should probably post the driver for review. A well written
> > > driver should not even need to care about any of this. phylib and
> > > the netdev driver code does all the work. It only gets interesting
> > > when you don't have a PHY, e.g. a stacked device, like bonding, or a
> > > virtual device like tun/tap.
> >
> >   i wish, but i'm on contract, and proprietary, and NDA and all that.
> > so i am reduced to crawling through the code, trying to figure out
> > what is misconfigured that is causing all this grief.
> >
> > rday
> >
>
> So you expect FOSS developers to help you with proprietary licensed
> driver. Good Luck with that.

  sorry, i'm sure this will all be released upon production, just not
while it's in the midst of development.

rday

-- 

========================================================================
Robert P. J. Day                                 Ottawa, Ontario, CANADA
                  http://crashcourse.ca/dokuwiki

Twitter:                                       http://twitter.com/rpjday
LinkedIn:                               http://ca.linkedin.com/in/rpjday
========================================================================

^ permalink raw reply

* Re: [PATCH v2 2/2] can: rcar: use SPDX identifier for Renesas drivers
From: Marc Kleine-Budde @ 2018-08-27 10:00 UTC (permalink / raw)
  To: Wolfram Sang, linux-renesas-soc
  Cc: Kuninori Morimoto, Wolfgang Grandegger, David S. Miller,
	linux-can, netdev, linux-kernel
In-Reply-To: <20180823133456.4748-3-wsa+renesas@sang-engineering.com>


[-- Attachment #1.1: Type: text/plain, Size: 610 bytes --]

On 08/23/2018 03:34 PM, Wolfram Sang wrote:
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>

Applied to linux-can-next. Please add a patch description to the patch.
My $UPSTREAM doesn't like empty patch descriptions :) I've shamelessly
used Fabio Estevam patch description from his flexcan SPDX patch.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 01/29] nvmem: add support for cell lookups
From: Boris Brezillon @ 2018-08-27  9:00 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Andrew Lunn, linux-doc, Sekhar Nori, Bartosz Golaszewski,
	Srinivas Kandagatla, linux-i2c, Mauro Carvalho Chehab,
	Rob Herring, Florian Fainelli, Kevin Hilman, Richard Weinberger,
	Russell King, Marek Vasut, Paolo Abeni, Dan Carpenter,
	Grygorii Strashko, David Lechner, Arnd Bergmann,
	Sven Van Asbroeck, "ope
In-Reply-To: <CAMRc=Men-MPk5DGshWcVEc0v=gH2WSpx1j-CawOeydwp59tejw@mail.gmail.com>

On Mon, 27 Aug 2018 10:56:29 +0200
Bartosz Golaszewski <brgl@bgdev.pl> wrote:

> 2018-08-25 8:27 GMT+02:00 Boris Brezillon <boris.brezillon@bootlin.com>:
> > On Fri, 24 Aug 2018 17:27:40 +0200
> > Andrew Lunn <andrew@lunn.ch> wrote:
> >  
> >> On Fri, Aug 24, 2018 at 05:08:48PM +0200, Boris Brezillon wrote:  
> >> > Hi Bartosz,
> >> >
> >> > On Fri, 10 Aug 2018 10:04:58 +0200
> >> > Bartosz Golaszewski <brgl@bgdev.pl> wrote:
> >> >  
> >> > > +struct nvmem_cell_lookup {
> >> > > + struct nvmem_cell_info  info;
> >> > > + struct list_head        list;
> >> > > + const char              *nvmem_name;
> >> > > +};  
> >> >
> >> > Hm, maybe I don't get it right, but this looks suspicious. Usually the
> >> > consumer lookup table is here to attach device specific names to
> >> > external resources.
> >> >
> >> > So what I'd expect here is:
> >> >
> >> > struct nvmem_cell_lookup {
> >> >     /* The nvmem device name. */
> >> >     const char *nvmem_name;
> >> >
> >> >     /* The nvmem cell name */
> >> >     const char *nvmem_cell_name;
> >> >
> >> >     /*
> >> >      * The local resource name. Basically what you have in the
> >> >      * nvmem-cell-names prop.
> >> >      */
> >> >     const char *conid;
> >> > };
> >> >
> >> > struct nvmem_cell_lookup_table {
> >> >     struct list_head list;
> >> >
> >> >     /* ID of the consumer device. */
> >> >     const char *devid;
> >> >
> >> >     /* Array of cell lookup entries. */
> >> >     unsigned int ncells;
> >> >     const struct nvmem_cell_lookup *cells;
> >> > };
> >> >
> >> > Looks like your nvmem_cell_lookup is more something used to attach cells
> >> > to an nvmem device, which is NVMEM provider's responsibility not the
> >> > consumer one.  
> >>
> >> Hi Boris
> >>
> >> There are cases where there is not a clear providier/consumer split. I
> >> have an x86 platform, with a few at24 EEPROMs on it. It uses an off
> >> the shelf Komtron module, placed on a custom carrier board. One of the
> >> EEPROMs contains the hardware variant information. Once i know the
> >> variant, i need to instantiate other I2C, SPI, MDIO devices, all using
> >> platform devices, since this is x86, no DT available.
> >>
> >> So the first thing my x86 platform device does is instantiate the
> >> first i2c device for the AT24. Once the EEPROM pops into existence, i
> >> need to add nvmem cells onto it. So at that point, the x86 platform
> >> driver is playing the provider role. Once the cells are added, i can
> >> then use nvmem consumer interfaces to get the contents of the cell,
> >> run a checksum, and instantiate the other devices.
> >>
> >> I wish the embedded world was all DT, but the reality is that it is
> >> not :-(  
> >
> > Actually, I'm not questioning the need for this feature (being able to
> > attach NVMEM cells to an NVMEM device on a platform that does not use
> > DT). What I'm saying is that this functionality is provider related,
> > not consumer related. Also, I wonder if defining such NVMEM cells
> > shouldn't go through the provider driver instead of being passed
> > directly to the NVMEM layer, because nvmem_config already have a fields
> > to pass cells at registration time, plus, the name of the NVMEM cell
> > device is sometimes created dynamically and can be hard to guess at
> > platform_device registration time.
> >  
> 
> In my use case the provider is at24 EEPROM driver. This is where the
> nvmem_config lives but I can't image a correct and clean way of
> passing this cell config to the driver from board files without using
> new ugly fields in platform_data which this very series is trying to
> remove. This is why this cell config should live in machine code.

Okay.

> 
> > I also think non-DT consumers will need a way to reference exiting
> > NVMEM cells, but this consumer-oriented nvmem cell lookup table should
> > look like the gpio or pwm lookup table (basically what I proposed in my
> > previous email).  
> 
> How about introducing two new interfaces to nvmem: one for defining
> nvmem cells from machine code and the second for connecting these
> cells with devices?

Yes, that's basically what I was suggesting: move what you've done in
nvmem-provider.h (maybe rename some of the structs to make it clear
that this is about defining cells not referencing existing ones), and
add a new consumer interface (based on what other subsystems do) in
nvmem-consumer.h.

This way you have both things clearly separated, and if a driver is
both a consumer and a provider you'll just have to include both headers.

Regards,

Boris

^ permalink raw reply

* Re: [PATCH v2 01/29] nvmem: add support for cell lookups
From: Bartosz Golaszewski @ 2018-08-27  8:56 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Andrew Lunn, linux-doc, Sekhar Nori, Bartosz Golaszewski,
	Srinivas Kandagatla, linux-i2c, Mauro Carvalho Chehab,
	Rob Herring, Florian Fainelli, Kevin Hilman, Richard Weinberger,
	Russell King, Marek Vasut, Paolo Abeni, Dan Carpenter,
	Grygorii Strashko, David Lechner, Arnd Bergmann,
	Sven Van Asbroeck, "ope
In-Reply-To: <20180825082722.567e8c9a@bbrezillon>

2018-08-25 8:27 GMT+02:00 Boris Brezillon <boris.brezillon@bootlin.com>:
> On Fri, 24 Aug 2018 17:27:40 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
>
>> On Fri, Aug 24, 2018 at 05:08:48PM +0200, Boris Brezillon wrote:
>> > Hi Bartosz,
>> >
>> > On Fri, 10 Aug 2018 10:04:58 +0200
>> > Bartosz Golaszewski <brgl@bgdev.pl> wrote:
>> >
>> > > +struct nvmem_cell_lookup {
>> > > + struct nvmem_cell_info  info;
>> > > + struct list_head        list;
>> > > + const char              *nvmem_name;
>> > > +};
>> >
>> > Hm, maybe I don't get it right, but this looks suspicious. Usually the
>> > consumer lookup table is here to attach device specific names to
>> > external resources.
>> >
>> > So what I'd expect here is:
>> >
>> > struct nvmem_cell_lookup {
>> >     /* The nvmem device name. */
>> >     const char *nvmem_name;
>> >
>> >     /* The nvmem cell name */
>> >     const char *nvmem_cell_name;
>> >
>> >     /*
>> >      * The local resource name. Basically what you have in the
>> >      * nvmem-cell-names prop.
>> >      */
>> >     const char *conid;
>> > };
>> >
>> > struct nvmem_cell_lookup_table {
>> >     struct list_head list;
>> >
>> >     /* ID of the consumer device. */
>> >     const char *devid;
>> >
>> >     /* Array of cell lookup entries. */
>> >     unsigned int ncells;
>> >     const struct nvmem_cell_lookup *cells;
>> > };
>> >
>> > Looks like your nvmem_cell_lookup is more something used to attach cells
>> > to an nvmem device, which is NVMEM provider's responsibility not the
>> > consumer one.
>>
>> Hi Boris
>>
>> There are cases where there is not a clear providier/consumer split. I
>> have an x86 platform, with a few at24 EEPROMs on it. It uses an off
>> the shelf Komtron module, placed on a custom carrier board. One of the
>> EEPROMs contains the hardware variant information. Once i know the
>> variant, i need to instantiate other I2C, SPI, MDIO devices, all using
>> platform devices, since this is x86, no DT available.
>>
>> So the first thing my x86 platform device does is instantiate the
>> first i2c device for the AT24. Once the EEPROM pops into existence, i
>> need to add nvmem cells onto it. So at that point, the x86 platform
>> driver is playing the provider role. Once the cells are added, i can
>> then use nvmem consumer interfaces to get the contents of the cell,
>> run a checksum, and instantiate the other devices.
>>
>> I wish the embedded world was all DT, but the reality is that it is
>> not :-(
>
> Actually, I'm not questioning the need for this feature (being able to
> attach NVMEM cells to an NVMEM device on a platform that does not use
> DT). What I'm saying is that this functionality is provider related,
> not consumer related. Also, I wonder if defining such NVMEM cells
> shouldn't go through the provider driver instead of being passed
> directly to the NVMEM layer, because nvmem_config already have a fields
> to pass cells at registration time, plus, the name of the NVMEM cell
> device is sometimes created dynamically and can be hard to guess at
> platform_device registration time.
>

In my use case the provider is at24 EEPROM driver. This is where the
nvmem_config lives but I can't image a correct and clean way of
passing this cell config to the driver from board files without using
new ugly fields in platform_data which this very series is trying to
remove. This is why this cell config should live in machine code.

> I also think non-DT consumers will need a way to reference exiting
> NVMEM cells, but this consumer-oriented nvmem cell lookup table should
> look like the gpio or pwm lookup table (basically what I proposed in my
> previous email).

How about introducing two new interfaces to nvmem: one for defining
nvmem cells from machine code and the second for connecting these
cells with devices?

Best regards,
Bart

^ permalink raw reply

* Re: oops with ip6_rt_cache_alloc
From: Yonghong Song @ 2018-08-27  4:57 UTC (permalink / raw)
  To: David Ahern, netdev, Alexei Starovoitov, Martin Lau, Dave Jones
In-Reply-To: <2314c9c2-27ab-c470-5e8a-4e28e53810b2@gmail.com>



On 8/24/18 4:04 PM, David Ahern wrote:
> On 8/24/18 4:26 PM, Yonghong Song wrote:
>> Hi,
>>
>> We got a kernel oops with the following stack trace:
>>
>> CPU: 24 PID: 0 Comm: swapper/24 Not tainted
>> 4.16.0-10_fbk1_1183_g7e4ee4c8171c #10
>> "Hardware name: Quanta Leopard-DDR3/Leopard-DDR3, BIOS F06_3A16.DDR3
>> 11/19/2015"
>> RIP: 0010:ip6_rt_get_dev_rcu+0x6/0x60
>> RSP: 0018:ffff88046fb03c78 EFLAGS: 00010286
>> RAX: 0000000040000003 RBX: ffff88035a6c1500 RCX: ffffffff81ec5dc0
>> RDX: ffff88033192a090 RSI: ffff88033192a0a0 RDI: 0000000000000000
> 
> RDI = 0 means the rt passed to ip6_rt_get_dev_rcu is NULL. I believe
> that can't happen prior to the fib6_info changes. After the fib6_info
> changes, it means the 'from' is NULL and that is not expected.
> 
> ...
> 
>> Our internal experiments showed that an early version of 4.16 works fine
>> and after backporting some ipv6 route related changes and the above
>> problem showed up.
> 
> Can you run the test on 4.18?

We will give a try with 4.18. Thanks.

^ permalink raw reply

* Re: [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
From: Tushar Dave @ 2018-08-27  4:45 UTC (permalink / raw)
  To: John Fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <e3e03edf-9771-7660-b27c-fc28be55c644@gmail.com>



On 08/24/2018 06:02 PM, John Fastabend wrote:
> On 08/17/2018 04:08 PM, Tushar Dave wrote:
>> Like sockmap (sk_msg), socksg also deals with struct scatterlist
>> therefore socksg programs can use existing bpf helper bpf_msg_pull_data
>> to access packet data contained in struct scatterlist. While doing some
>> prelimnary testing, there are couple of issues found with
>> bpf_msg_pull_data that are fixed in this patch.
>>
>> Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
>> therefore any checks for sg entry more than MAX_SKB_FRAGS in
>> bpf_msg_pull_data() is removed.
> 
> In sockmap the scatterlist is used as a ring so the MAX_SKB_FRAGS
> check is needed to keep searching through the ring when sg_start
> is non-zero.

Okay.

> 
>>
>> Besides that, I also ran into issues while put_page() is invoked.
>> e.g.
>> [ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
>> [ 450.575632] page:ffffea0080855000 count:0 mapcount:0
>> mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
>> [ 450.588069] flags: 0x6fffff80008100(slab|head)
>> [ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
>> ffff88103d006840
>> [ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
>> 0000000000000000
>> [ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>> [ 450.617530] bad because of flags: 0x100(slab)
>>
>> To avoid above issue, currently put_page() is disabled in this patch
>> temporarily. I am working on alternatives so that page allocated via
>> slab (in this case) can be freed without any issue.>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
>>   1 file changed, 32 insertions(+), 29 deletions(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index e427c8e..cc52baa 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   BPF_CALL_4(bpf_msg_pull_data,
>>   	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>>   {
>> -	unsigned int len = 0, offset = 0, copy = 0;
>> +	unsigned int len = 0, offset = 0, copy = 0, off = 0;
>>   	struct scatterlist *sg = msg->sg_data;
>>   	int first_sg, last_sg, i, shift;
>>   	unsigned char *p, *to, *from;
>> @@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	i = msg->sg_start;
>>   	do {
>>   		len = sg[i].length;
>> -		offset += len;
>>   		if (start < offset + len)
>>   			break;
>> +		offset += len;
> 
> This looks like a generic fix unrelated to this series.
> Can you send that as a bugfix?

Okay.

> 
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>> -	} while (i != msg->sg_end);
>> +	} while (i <= msg->sg_end);
>>   
> 
> As noted above the MAX_SKB_FRAGS check is needed because
> sg_start can be non-zero and sg_end < st_start. In these
> cases we need to search the entries at the start of the
> array (being used as a ring).

Yup!

> 
>> +	/* return error if start is out of range */
>>   	if (unlikely(start >= offset + len))
>>   		return -EINVAL;
>>   
>> -	if (!msg->sg_copy[i] && bytes <= len)
>> -		goto out;
>> +	/* return error if i is last entry in sglist and end is out of range */
>> +	if (msg->sg_copy[i] && end > offset + len)
>> +		return -EINVAL>
>>   	first_sg = i;
>>   
>> +	/* if i is not last entry in sg list and end (i.e start + bytes) is
>> +	 * within this sg[i] then goto out and calculate data and data_end
>> +	 */
>> +	if (!msg->sg_copy[i] && end <= offset + len)
>> +		goto out;
>> +>  	/* At this point we need to linearize multiple scatterlist
>>   	 * elements or a single shared page. Either way we need to
>>   	 * copy into a linear buffer exclusively owned by BPF. Then
>> @@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	do {
>>   		copy += sg[i].length;
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
> 
> same as above, need to keep.

Yup!

> 
>> -		if (bytes < copy)
>> +		if (end < copy)
>>   			break;
>> -	} while (i != msg->sg_end);
>> +	} while (i <= msg->sg_end);
>> +
>> +	/* return error if i is last entry in sglist and end is out of range */
>> +	if (i > msg->sg_end && end > offset + copy)
>> +		return -EINVAL;
>> +
>>   	last_sg = i;
>>   
>>   	if (unlikely(copy < end - start))
>> @@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	if (unlikely(!page))
>>   		return -ENOMEM;
>>   	p = page_address(page);
>> -	offset = 0;
>>   
>>   	i = first_sg;
>>   	do {
>>   		from = sg_virt(&sg[i]);
>>   		len = sg[i].length;
>> -		to = p + offset;
>> +		to = p + off;
> 
> Not really sure if the change from offset->off is needed. Looks
> like it just makes a bigger diff.

We need both offset and off because they both are used for different
calculations!

'offset' is used to calculate the 'msg->data'
i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"

'off' , on the other hand, is used for when we linearize sg.

> 
>>   
>>   		memcpy(to, from, len);
>> -		offset += len;
>> +		off += len;
>>   		sg[i].length = 0;
>> -		put_page(sg_page(&sg[i]));
>> +		/* if original page is allocated via slab then put_page
>> +		 * causes error BUG: Bad page state in process. So temporarily
>> +		 * disabled put_page.
>> +		 * Todo: fix it
>> +		 */
>> +		//put_page(sg_page(&sg[i]));

As I said in the commit message that put_page() causes error "BUG: Bad
page state in process ..." when used for RDS.
Any clue? Have you seen something like this with sockmap?


>>   
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>> -	} while (i != last_sg);
>> +	} while (i < last_sg);
>>   
>>   	sg[first_sg].length = copy;
>>   	sg_set_page(&sg[first_sg], page, copy, 0);
>> @@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	do {
>>   		int move_from;
>>   
>> -		if (i + shift >= MAX_SKB_FRAGS)
>> -			move_from = i + shift - MAX_SKB_FRAGS;
>> -		else
>> -			move_from = i + shift;
>> -
> 
> Need to keep same as above.
yup!

> 
>> -		if (move_from == msg->sg_end)
>> +		move_from = i + shift;> +		if (move_from > msg->sg_end)
>>   			break;
>>   
>>   		sg[i] = sg[move_from];
>> @@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   		sg[move_from].offset = 0;
>>   
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>>   	} while (1);
>>   	msg->sg_end -= shift;
>> -	if (msg->sg_end < 0)
>> -		msg->sg_end += MAX_SKB_FRAGS;
>>   out:
>> -	msg->data = sg_virt(&sg[i]) + start - offset;
>> +	msg->data = sg_virt(&sg[first_sg]) + start - offset;
>>   	msg->data_end = msg->data + bytes;
>>   
>>   	return 0;
>>
> 
> Thanks,
> John
> 

Thanks.
-Tushar

^ permalink raw reply

* Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Julia Lawall @ 2018-08-27  4:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Joe Perches, Kees Cook, LKML, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, David S. Miller, Network Development
In-Reply-To: <20180827040423.GB6515@ZenIV.linux.org.uk>



On Mon, 27 Aug 2018, Al Viro wrote:

> On Sun, Aug 26, 2018 at 11:35:17PM -0400, Julia Lawall wrote:
>
> > * x = \(kmalloc\|kzalloc\|devm_kmalloc\|devm_kzalloc\)(...)
>
> I can name several you've missed right off the top of my head -
> vmalloc, kvmalloc, kmem_cache_alloc, kmem_cache_zalloc, variants
> with _trace slapped on, and that is not to mention the things like
> get_free_page or

OK, maybe for a given type the set of functions would be smaller.

>
> void *my_k3wl_alloc(u64 n) // 'cause all artificial limits suck, that's why
> {
> 	lots and lots of home-grown stats collection
> 	some tracepoints thrown in just for fun
> 	return kmalloc(n);
> }
>
> (and no, I'm not implying that net/sched folks had done anything of that
> sort; I have seen that and worse in drivers, though)
>
> > The * at the beginning of the line means to highlight what you are looking
> > for, which is done by making a diff in which the highlighted line
> > appears to be removed.
>
> Umm...  Does that cover return, BTW?  Or something like
> 	T *barf;
> 	extern void foo(T *p);
> 	foo(kmalloc(sizeof(*barf)));

It only covers the pattern that is shown, ie an assignment.  For this,
another pattern would be needed.  It would be necessary to match first the
call that one is concerned with and then go find the function definition
or prototype to find the type of the associated parameter.  It is possible
to count the offset of the kmalloc call in the argument list and then get
the type at the corresponding offset in the parameter list of the function
declaration or prototype.

>
>
> > The limitation is the ability to figure out the type of x.  If it is a
> > local variable, Coccinelle should have no problem.  If it is a structure
> > field, it may be necessary to provide command line arguments like
> >
> > --all-includes --include-headers-for-types
> >
> > --all-includes means to try to find all include files that are mentioned
> > in the .c file.  The next stronger option is --recursive includes, which
> > means include what all of the mentioned files include as well,
> > recursively.  This tends to cause a major performance hit, because a lot
> > of code is being parsed.  --include-headers-for-types heals a bit with
> > that, as it only considers the header files when computing type
> > information, and now when applying the rules.
> >
> > With respect to ifdefs around variable declarations and structure field
> > declaration, in these cases Coccinelle considers that it cannot make the
> > ifdef have an if-like control flow, and so if considers the #ifdef, #else
> > and #endif to be comments.  Thus it takes into account only the last type
> > provided for a given variable.
>
> [snip]
>
> What about several variants of structure definition?  Because ifdefs around
> includes do occur in the wild...

Such ifdefs would be ignored completely.  I suspect that only the last
definition of the structure would be taken into account.

julia

^ permalink raw reply

* [PATCH v2 iproute2-next 3/3] q_netem: slotting with non-uniform distribution
From: Yousuk Seung @ 2018-08-27  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Stephen Hemminger, David Ahern, Michael McLennan, Priyaranjan Jha,
	Yousuk Seung, Neal Cardwell, Dave Taht
In-Reply-To: <20180827024230.246445-1-ysseung@google.com>

Extend slotting with support for non-uniform distributions. This is
similar to netem's non-uniform distribution delay feature.

Syntax:
   slot distribution DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
      [bytes MAX_BYTES]

The syntax and use of the distribution table is the same as in the
non-uniform distribution delay feature. A file DISTRIBUTION must be
present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
NETEM_DIST_SCALE. A random value x is selected from the table and it
takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
supported.

Examples:
  Normal distribution delay with mean = 800us and stdev = 100us.
  > tc qdisc add dev eth0 root netem slot distribution normal \
    800us 100us

  Optionally set the max slot size in bytes and/or packets.
  > tc qdisc add dev eth0 root netem slot distribution normal \
    800us 100us bytes 64k packets 42

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
---
 man/man8/tc-netem.8 | 20 ++++++++----
 tc/q_netem.c        | 77 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 78 insertions(+), 19 deletions(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index 8d485b026751..111109cf042f 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -53,9 +53,13 @@ NetEm \- Network Emulator
 .IR RATE " [ " PACKETOVERHEAD " [ " CELLSIZE " [ " CELLOVERHEAD " ]]]]"
 
 .IR SLOT " := "
-.BR slot
-.IR MIN_DELAY " [ " MAX_DELAY " ] ["
-.BR packets
+.BR slot " { "
+.IR MIN_DELAY " [ " MAX_DELAY " ] |"
+.br
+.RB "               " distribution " { "uniform " | " normal " | " pareto " | " paretonormal " | "
+.IR FILE " } " DELAY " " JITTER " } "
+.br
+.RB "             [ " packets
 .IR PACKETS " ] [ "
 .BR bytes
 .IR BYTES " ]"
@@ -172,9 +176,13 @@ an artificial packet compression (bursts). Another influence factor are network
 adapter buffers which can also add artificial delay.
 
 .SS slot
-defer delivering accumulated packets to within a slot, with each available slot
-configured with a minimum delay to acquire, and an optional maximum delay.  Slot
-delays can be specified in nanoseconds, microseconds, milliseconds or seconds
+defer delivering accumulated packets to within a slot. Each available slot can be
+configured with a minimum delay to acquire, and an optional maximum delay.
+Alternatively it can be configured with the distribution similar to
+.BR distribution
+for
+.BR delay
+option. Slot delays can be specified in nanoseconds, microseconds, milliseconds or seconds
 (e.g. 800us). Values for the optional parameters
 .I BYTES
 will limit the number of bytes delivered per slot, and/or
diff --git a/tc/q_netem.c b/tc/q_netem.c
index 53a7a1056f5d..5bfdfcd5478c 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -43,7 +43,9 @@ static void explain(void)
 "                 [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n" \
 "                 [ slot MIN_DELAY [MAX_DELAY] [packets MAX_PACKETS]" \
 " [bytes MAX_BYTES]]\n" \
-		);
+"                 [ slot distribution" \
+" {uniform|normal|pareto|paretonormal|custom} DELAY JITTER" \
+" [packets MAX_PACKETS] [bytes MAX_BYTES]]\n");
 }
 
 static void explain1(const char *arg)
@@ -159,6 +161,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 			   struct nlmsghdr *n, const char *dev)
 {
 	int dist_size = 0;
+	int slot_dist_size = 0;
 	struct rtattr *tail;
 	struct tc_netem_qopt opt = { .limit = 1000 };
 	struct tc_netem_corr cor = {};
@@ -169,6 +172,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 	struct tc_netem_rate rate = {};
 	struct tc_netem_slot slot = {};
 	__s16 *dist_data = NULL;
+	__s16 *slot_dist_data = NULL;
 	__u16 loss_type = NETEM_LOSS_UNSPEC;
 	int present[__TCA_NETEM_MAX] = {};
 	__u64 rate64 = 0;
@@ -417,21 +421,55 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 				}
 			}
 		} else if (matches(*argv, "slot") == 0) {
-			NEXT_ARG();
-			present[TCA_NETEM_SLOT] = 1;
-			if (get_time64(&slot.min_delay, *argv)) {
-				explain1("slot min_delay");
-				return -1;
-			}
 			if (NEXT_IS_NUMBER()) {
 				NEXT_ARG();
-				if (get_time64(&slot.max_delay, *argv) ||
-				    slot.max_delay < slot.min_delay) {
-					explain1("slot max_delay");
+				present[TCA_NETEM_SLOT] = 1;
+				if (get_time64(&slot.min_delay, *argv)) {
+					explain1("slot min_delay");
 					return -1;
 				}
+				if (NEXT_IS_NUMBER()) {
+					NEXT_ARG();
+					if (get_time64(&slot.max_delay, *argv) ||
+					    slot.max_delay < slot.min_delay) {
+						explain1("slot max_delay");
+						return -1;
+					}
+				} else {
+					slot.max_delay = slot.min_delay;
+				}
 			} else {
-				slot.max_delay = slot.min_delay;
+				NEXT_ARG();
+				if (strcmp(*argv, "distribution") == 0) {
+					present[TCA_NETEM_SLOT] = 1;
+					NEXT_ARG();
+					slot_dist_data = calloc(sizeof(slot_dist_data[0]), MAX_DIST);
+					if (!slot_dist_data)
+						return -1;
+					slot_dist_size = get_distribution(*argv, slot_dist_data, MAX_DIST);
+					if (slot_dist_size <= 0) {
+						free(slot_dist_data);
+						return -1;
+					}
+					NEXT_ARG();
+					if (get_time64(&slot.dist_delay, *argv)) {
+						explain1("slot delay");
+						return -1;
+					}
+					NEXT_ARG();
+					if (get_time64(&slot.dist_jitter, *argv)) {
+						explain1("slot jitter");
+						return -1;
+					}
+					if (slot.dist_jitter <= 0) {
+						fprintf(stderr, "Non-positive jitter\n");
+						return -1;
+					}
+				} else {
+					fprintf(stderr, "Unknown slot parameter: %s\n",
+						*argv);
+					return -1;
+				}
 			}
 			if (NEXT_ARG_OK() &&
 			    matches(*(argv+1), "packets") == 0) {
@@ -559,6 +597,14 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 			return -1;
 		free(dist_data);
 	}
+
+	if (slot_dist_data) {
+		if (addattr_l(n, MAX_DIST * sizeof(slot_dist_data[0]),
+			      TCA_NETEM_SLOT_DIST,
+			      slot_dist_data, slot_dist_size * sizeof(slot_dist_data[0])) < 0)
+			return -1;
+		free(slot_dist_data);
+	}
 	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
 	return 0;
 }
@@ -713,8 +759,13 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	}
 
 	if (slot) {
-		fprintf(f, " slot %s", sprint_time64(slot->min_delay, b1));
-		fprintf(f, " %s", sprint_time64(slot->max_delay, b1));
+		if (slot->dist_jitter > 0) {
+		    fprintf(f, " slot distribution %s", sprint_time64(slot->dist_delay, b1));
+		    fprintf(f, " %s", sprint_time64(slot->dist_jitter, b1));
+		} else {
+		    fprintf(f, " slot %s", sprint_time64(slot->min_delay, b1));
+		    fprintf(f, " %s", sprint_time64(slot->max_delay, b1));
+		}
 		if(slot->max_packets)
 			fprintf(f, " packets %d", slot->max_packets);
 		if(slot->max_bytes)
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [PATCH v2 iproute2-next 2/3] q_netem: support delivering packets in delayed time slots
From: Yousuk Seung @ 2018-08-27  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Stephen Hemminger, David Ahern, Michael McLennan, Priyaranjan Jha,
	Dave Taht, Yousuk Seung, Neal Cardwell
In-Reply-To: <20180827024230.246445-1-ysseung@google.com>

From: Dave Taht <dave.taht@gmail.com>

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
        slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
         limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
         slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 man/man8/tc-netem.8 | 32 ++++++++++++++++++++++-
 tc/q_netem.c        | 64 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index f2cd86b6ed8a..8d485b026751 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -8,7 +8,8 @@ NetEm \- Network Emulator
 .I OPTIONS
 
 .IR OPTIONS " := [ " LIMIT " ] [ " DELAY " ] [ " LOSS \
-" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ][ " RATE " ]"
+" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ] [ " RATE \
+" ] [ " SLOT " ]"
 
 .IR LIMIT " := "
 .B limit
@@ -51,6 +52,14 @@ NetEm \- Network Emulator
 .B rate
 .IR RATE " [ " PACKETOVERHEAD " [ " CELLSIZE " [ " CELLOVERHEAD " ]]]]"
 
+.IR SLOT " := "
+.BR slot
+.IR MIN_DELAY " [ " MAX_DELAY " ] ["
+.BR packets
+.IR PACKETS " ] [ "
+.BR bytes
+.IR BYTES " ]"
+
 
 .SH DESCRIPTION
 NetEm is an enhancement of the Linux traffic control facilities
@@ -162,6 +171,27 @@ granularity avoid a perfect shaping at a specific level. This will show up in
 an artificial packet compression (bursts). Another influence factor are network
 adapter buffers which can also add artificial delay.
 
+.SS slot
+defer delivering accumulated packets to within a slot, with each available slot
+configured with a minimum delay to acquire, and an optional maximum delay.  Slot
+delays can be specified in nanoseconds, microseconds, milliseconds or seconds
+(e.g. 800us). Values for the optional parameters
+.I BYTES
+will limit the number of bytes delivered per slot, and/or
+.I PACKETS
+will limit the number of packets delivered per slot.
+
+These slot options can provide a crude approximation of bursty MACs such as
+DOCSIS, WiFi, and LTE.
+
+Note that slotting is limited by several factors: the kernel clock granularity,
+as with a rate, and attempts to deliver many packets within a slot will be
+smeared by the timer resolution, and by the underlying native bandwidth also.
+
+It is possible to combine slotting with a rate, in which case complex behaviors
+where either the rate, or the slot limits on bytes or packets per slot, govern
+the actual delivered rate.
+
 .SH LIMITATIONS
 The main known limitation of Netem are related to timer granularity, since
 Linux is not a real-time operating system.
diff --git a/tc/q_netem.c b/tc/q_netem.c
index 9f9a9b3df255..53a7a1056f5d 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -40,7 +40,10 @@ static void explain(void)
 "                 [ loss gemodel PERCENT [R [1-H [1-K]]]\n" \
 "                 [ ecn ]\n" \
 "                 [ reorder PRECENT [CORRELATION] [ gap DISTANCE ]]\n" \
-"                 [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n");
+"                 [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n" \
+"                 [ slot MIN_DELAY [MAX_DELAY] [packets MAX_PACKETS]" \
+" [bytes MAX_BYTES]]\n" \
+		);
 }
 
 static void explain1(const char *arg)
@@ -164,6 +167,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 	struct tc_netem_gimodel gimodel;
 	struct tc_netem_gemodel gemodel;
 	struct tc_netem_rate rate = {};
+	struct tc_netem_slot slot = {};
 	__s16 *dist_data = NULL;
 	__u16 loss_type = NETEM_LOSS_UNSPEC;
 	int present[__TCA_NETEM_MAX] = {};
@@ -412,6 +416,45 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 					return -1;
 				}
 			}
+		} else if (matches(*argv, "slot") == 0) {
+			NEXT_ARG();
+			present[TCA_NETEM_SLOT] = 1;
+			if (get_time64(&slot.min_delay, *argv)) {
+				explain1("slot min_delay");
+				return -1;
+			}
+			if (NEXT_IS_NUMBER()) {
+				NEXT_ARG();
+				if (get_time64(&slot.max_delay, *argv) ||
+				    slot.max_delay < slot.min_delay) {
+					explain1("slot max_delay");
+					return -1;
+				}
+			} else {
+				slot.max_delay = slot.min_delay;
+			}
+			if (NEXT_ARG_OK() &&
+			    matches(*(argv+1), "packets") == 0) {
+				NEXT_ARG();
+				if (!NEXT_ARG_OK() ||
+				    get_s32(&slot.max_packets, *(argv+1), 0)) {
+					explain1("slot packets");
+					return -1;
+				}
+				NEXT_ARG();
+			}
+			if (NEXT_ARG_OK() &&
+			    matches(*(argv+1), "bytes") == 0) {
+				unsigned int max_bytes;
+				NEXT_ARG();
+				if (!NEXT_ARG_OK() ||
+				    get_size(&max_bytes, *(argv+1))) {
+					explain1("slot bytes");
+					return -1;
+				}
+				slot.max_bytes = (int) max_bytes;
+				NEXT_ARG();
+			}
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -472,6 +515,10 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 	    addattr_l(n, 1024, TCA_NETEM_CORRUPT, &corrupt, sizeof(corrupt)) < 0)
 		return -1;
 
+	if (present[TCA_NETEM_SLOT] &&
+	    addattr_l(n, 1024, TCA_NETEM_SLOT, &slot, sizeof(slot)) < 0)
+		return -1;
+
 	if (loss_type != NETEM_LOSS_UNSPEC) {
 		struct rtattr *start;
 
@@ -526,6 +573,7 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	int *ecn = NULL;
 	struct tc_netem_qopt qopt;
 	const struct tc_netem_rate *rate = NULL;
+	const struct tc_netem_slot *slot = NULL;
 	int len;
 	__u64 rate64 = 0;
 
@@ -586,6 +634,11 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 				return -1;
 			rate64 = rta_getattr_u64(tb[TCA_NETEM_RATE64]);
 		}
+		if (tb[TCA_NETEM_SLOT]) {
+			if (RTA_PAYLOAD(tb[TCA_NETEM_SLOT]) < sizeof(*slot))
+				return -1;
+		        slot = RTA_DATA(tb[TCA_NETEM_SLOT]);
+		}
 	}
 
 	fprintf(f, "limit %d", qopt.limit);
@@ -659,6 +712,15 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 			fprintf(f, " celloverhead %d", rate->cell_overhead);
 	}
 
+	if (slot) {
+		fprintf(f, " slot %s", sprint_time64(slot->min_delay, b1));
+		fprintf(f, " %s", sprint_time64(slot->max_delay, b1));
+		if(slot->max_packets)
+			fprintf(f, " packets %d", slot->max_packets);
+		if(slot->max_bytes)
+			fprintf(f, " bytes %d", slot->max_bytes);
+	}
+
 	if (ecn)
 		fprintf(f, " ecn ");
 
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [PATCH v2 iproute2-next 1/3] tc: support conversions to or from 64 bit nanosecond-based time
From: Yousuk Seung @ 2018-08-27  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Stephen Hemminger, David Ahern, Michael McLennan, Priyaranjan Jha,
	Dave Taht, Yousuk Seung, Neal Cardwell
In-Reply-To: <20180827024230.246445-1-ysseung@google.com>

From: Dave Taht <dave.taht@gmail.com>

Using a 32 bit field to represent time in nanoseconds results in a
maximum value of about 4.3 seconds, which is well below many observed
delays in WiFi and LTE, and barely in the ballpark for a trip past the
Earth's moon, Luna.

Using 64 bit time fields in nanoseconds allows us to simulate
network diameters of several hundred light-years. However, only
conversions to and from ns, us, ms, and seconds are provided.

The iproute2 64 bit api uses signed values for time. Being able to
represent positive or negative time allows us to calculate +/- deltas
between, for example, the CLOCK_TAI and CLOCK_REALTIME clocks.

Time related utility functions in tc_util.c are moved to lib/utils.c.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 include/utils.h   |  12 ++++++
 lib/utils.c       | 104 ++++++++++++++++++++++++++++++++++++++++++++++
 tc/tc_cbq.c       |   1 +
 tc/tc_core.c      |   1 +
 tc/tc_core.h      |   2 -
 tc/tc_estimator.c |   1 +
 tc/tc_util.c      |  46 --------------------
 tc/tc_util.h      |   3 --
 8 files changed, 119 insertions(+), 51 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 8cb4349e8a89..eba67b6ecf44 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -46,6 +46,11 @@ void incomplete_command(void) __attribute__((noreturn));
 #define NEXT_ARG_FWD() do { argv++; argc--; } while(0)
 #define PREV_ARG() do { argv--; argc++; } while(0)
 
+#define TIME_UNITS_PER_SEC	1000000
+#define NSEC_PER_USEC 1000
+#define NSEC_PER_MSEC 1000000
+#define NSEC_PER_SEC 1000000000LL
+
 typedef struct
 {
 	__u16 flags;
@@ -310,4 +315,11 @@ size_t strlcat(char *dst, const char *src, size_t size);
 
 void drop_cap(void);
 
+int get_time(unsigned int *time, const char *str);
+int get_time64(__s64 *time, const char *str);
+void print_time(char *buf, int len, __u32 time);
+void print_time64(char *buf, int len, __s64 time);
+char *sprint_time(__u32 time, char *buf);
+char *sprint_time64(__s64 time, char *buf);
+
 #endif /* __UTILS_H__ */
diff --git a/lib/utils.c b/lib/utils.c
index 02ce67721915..34ec4ab12646 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -1633,3 +1633,107 @@ void drop_cap(void)
 	}
 #endif
 }
+
+int get_time(unsigned int *time, const char *str)
+{
+	double t;
+	char *p;
+
+	t = strtod(str, &p);
+	if (p == str)
+		return -1;
+
+	if (*p) {
+		if (strcasecmp(p, "s") == 0 || strcasecmp(p, "sec") == 0 ||
+		    strcasecmp(p, "secs") == 0)
+			t *= TIME_UNITS_PER_SEC;
+		else if (strcasecmp(p, "ms") == 0 || strcasecmp(p, "msec") == 0 ||
+			 strcasecmp(p, "msecs") == 0)
+			t *= TIME_UNITS_PER_SEC/1000;
+		else if (strcasecmp(p, "us") == 0 || strcasecmp(p, "usec") == 0 ||
+			 strcasecmp(p, "usecs") == 0)
+			t *= TIME_UNITS_PER_SEC/1000000;
+		else
+			return -1;
+	}
+
+	*time = t;
+	return 0;
+}
+
+
+void print_time(char *buf, int len, __u32 time)
+{
+	double tmp = time;
+
+	if (tmp >= TIME_UNITS_PER_SEC)
+		snprintf(buf, len, "%.1fs", tmp/TIME_UNITS_PER_SEC);
+	else if (tmp >= TIME_UNITS_PER_SEC/1000)
+		snprintf(buf, len, "%.1fms", tmp/(TIME_UNITS_PER_SEC/1000));
+	else
+		snprintf(buf, len, "%uus", time);
+}
+
+char *sprint_time(__u32 time, char *buf)
+{
+	print_time(buf, SPRINT_BSIZE-1, time);
+	return buf;
+}
+
+/* 64 bit times are represented internally in nanoseconds */
+int get_time64(__s64 *time, const char *str)
+{
+	double nsec;
+	char *p;
+
+	nsec = strtod(str, &p);
+	if (p == str)
+		return -1;
+
+	if (*p) {
+		if (strcasecmp(p, "s") == 0 ||
+		    strcasecmp(p, "sec") == 0 ||
+		    strcasecmp(p, "secs") == 0)
+			nsec *= NSEC_PER_SEC;
+		else if (strcasecmp(p, "ms") == 0 ||
+			 strcasecmp(p, "msec") == 0 ||
+			 strcasecmp(p, "msecs") == 0)
+			nsec *= NSEC_PER_MSEC;
+		else if (strcasecmp(p, "us") == 0 ||
+			 strcasecmp(p, "usec") == 0 ||
+			 strcasecmp(p, "usecs") == 0)
+			nsec *= NSEC_PER_USEC;
+		else if (strcasecmp(p, "ns") == 0 ||
+			 strcasecmp(p, "nsec") == 0 ||
+			 strcasecmp(p, "nsecs") == 0)
+			nsec *= 1;
+		else
+			return -1;
+	}
+
+	*time = nsec;
+	return 0;
+}
+
+void print_time64(char *buf, int len, __s64 time)
+{
+	double nsec = time;
+
+	if (time >= NSEC_PER_SEC)
+		snprintf(buf, len, "%.3fs", nsec/NSEC_PER_SEC);
+	else if (time >= NSEC_PER_MSEC)
+		snprintf(buf, len, "%.3fms", nsec/NSEC_PER_MSEC);
+	else if (time >= NSEC_PER_USEC)
+		snprintf(buf, len, "%.3fus", nsec/NSEC_PER_USEC);
+	else
+		snprintf(buf, len, "%lldns", time);
+}
+
+char *sprint_time64(__s64 time, char *buf)
+{
+	print_time64(buf, SPRINT_BSIZE-1, time);
+	return buf;
+}
+
+
+
diff --git a/tc/tc_cbq.c b/tc/tc_cbq.c
index 4cd584a91a26..c811456b1627 100644
--- a/tc/tc_cbq.c
+++ b/tc/tc_cbq.c
@@ -20,6 +20,7 @@
 #include <arpa/inet.h>
 #include <string.h>
 
+#include "utils.h"
 #include "tc_core.h"
 #include "tc_cbq.h"
 
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 1bde4d51e5dc..8eb11223eb9d 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -21,6 +21,7 @@
 #include <arpa/inet.h>
 #include <string.h>
 
+#include "utils.h"
 #include "tc_core.h"
 #include <linux/atm.h>
 
diff --git a/tc/tc_core.h b/tc/tc_core.h
index 1dfa9a4f773b..bd4a99f0d8dd 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -5,8 +5,6 @@
 #include <asm/types.h>
 #include <linux/pkt_sched.h>
 
-#define TIME_UNITS_PER_SEC	1000000
-
 enum link_layer {
 	LINKLAYER_UNSPEC,
 	LINKLAYER_ETHERNET,
diff --git a/tc/tc_estimator.c b/tc/tc_estimator.c
index e4edfc7e98d9..f494b7caa44e 100644
--- a/tc/tc_estimator.c
+++ b/tc/tc_estimator.c
@@ -20,6 +20,7 @@
 #include <arpa/inet.h>
 #include <string.h>
 
+#include "utils.h"
 #include "tc_core.h"
 
 int tc_setup_estimator(unsigned int A, unsigned int time_const, struct tc_estimator *est)
diff --git a/tc/tc_util.c b/tc/tc_util.c
index d7578528a31b..cafbe49f3ec8 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -334,52 +334,6 @@ char *sprint_rate(__u64 rate, char *buf)
 	return buf;
 }
 
-int get_time(unsigned int *time, const char *str)
-{
-	double t;
-	char *p;
-
-	t = strtod(str, &p);
-	if (p == str)
-		return -1;
-
-	if (*p) {
-		if (strcasecmp(p, "s") == 0 || strcasecmp(p, "sec") == 0 ||
-		    strcasecmp(p, "secs") == 0)
-			t *= TIME_UNITS_PER_SEC;
-		else if (strcasecmp(p, "ms") == 0 || strcasecmp(p, "msec") == 0 ||
-			 strcasecmp(p, "msecs") == 0)
-			t *= TIME_UNITS_PER_SEC/1000;
-		else if (strcasecmp(p, "us") == 0 || strcasecmp(p, "usec") == 0 ||
-			 strcasecmp(p, "usecs") == 0)
-			t *= TIME_UNITS_PER_SEC/1000000;
-		else
-			return -1;
-	}
-
-	*time = t;
-	return 0;
-}
-
-
-void print_time(char *buf, int len, __u32 time)
-{
-	double tmp = time;
-
-	if (tmp >= TIME_UNITS_PER_SEC)
-		snprintf(buf, len, "%.1fs", tmp/TIME_UNITS_PER_SEC);
-	else if (tmp >= TIME_UNITS_PER_SEC/1000)
-		snprintf(buf, len, "%.1fms", tmp/(TIME_UNITS_PER_SEC/1000));
-	else
-		snprintf(buf, len, "%uus", time);
-}
-
-char *sprint_time(__u32 time, char *buf)
-{
-	print_time(buf, SPRINT_BSIZE-1, time);
-	return buf;
-}
-
 char *sprint_ticks(__u32 ticks, char *buf)
 {
 	return sprint_time(tc_core_tick2time(ticks), buf);
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 6632c4f9c528..76fd986d6e4c 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -81,13 +81,11 @@ int get_rate64(__u64 *rate, const char *str);
 int get_percent_rate64(__u64 *rate, const char *str, const char *dev);
 int get_size(unsigned int *size, const char *str);
 int get_size_and_cell(unsigned int *size, int *cell_log, char *str);
-int get_time(unsigned int *time, const char *str);
 int get_linklayer(unsigned int *val, const char *arg);
 
 void print_rate(char *buf, int len, __u64 rate);
 void print_size(char *buf, int len, __u32 size);
 void print_qdisc_handle(char *buf, int len, __u32 h);
-void print_time(char *buf, int len, __u32 time);
 void print_linklayer(char *buf, int len, unsigned int linklayer);
 void print_devname(enum output_type type, int ifindex);
 
@@ -95,7 +93,6 @@ char *sprint_rate(__u64 rate, char *buf);
 char *sprint_size(__u32 size, char *buf);
 char *sprint_qdisc_handle(__u32 h, char *buf);
 char *sprint_tc_classid(__u32 h, char *buf);
-char *sprint_time(__u32 time, char *buf);
 char *sprint_ticks(__u32 ticks, char *buf);
 char *sprint_linklayer(unsigned int linklayer, char *buf);
 
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [PATCH v2 iproute2-next 0/3] support delivering packets in
From: Yousuk Seung @ 2018-08-27  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Stephen Hemminger, David Ahern, Michael McLennan, Priyaranjan Jha,
	Yousuk Seung

This series adds support for the new "slot" netem parameter for
slotting. Slotting is an approximation of shared media that gather up
packets within a varying delay window before delivering them nearly at
once.

Dave Taht (2):
  tc: support conversions to or from 64 bit nanosecond-based time
  q_netem: support delivering packets in delayed time slots

Yousuk Seung (1):
  q_netem: slotting with non-uniform distribution

 include/utils.h     |  12 +++++
 lib/utils.c         | 104 +++++++++++++++++++++++++++++++++++++++
 man/man8/tc-netem.8 |  40 ++++++++++++++-
 tc/q_netem.c        | 115 +++++++++++++++++++++++++++++++++++++++++++-
 tc/tc_cbq.c         |   1 +
 tc/tc_core.c        |   1 +
 tc/tc_core.h        |   2 -
 tc/tc_estimator.c   |   1 +
 tc/tc_util.c        |  46 ------------------
 tc/tc_util.h        |   3 --
 10 files changed, 272 insertions(+), 53 deletions(-)

-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply

* [PATCH RFC net-next] net/fib: Poptrie based FIB lookup
From: Md. Islam @ 2018-08-27  2:28 UTC (permalink / raw)
  To: Netdev, David Miller, David Ahern, Eric Dumazet, Alexey Kuznetsov,
	Stephen Hemminger, makita.toshiaki, panda, yasuhiro.ohara,
	john.fastabend, alexei.starovoitov

This patch implements Poptrie [1] based FIB lookup. It exhibits pretty
impressive lookup performance compared to LC-trie. This poptrie
implementation however somewhat deviates from the original
implementation [2]. I tested this patch very rigorously with several
FIB tables containing half a million routes. I got same result as
LC-trie based fib_lookup().

Poptrie is intended to work in conjunction with LC-trie (not replace
it). It is primarily designed to overcome many issues of TCAM based
router [1]. It [1] shows that the Poptrie can achieve very impressive
lookup performance on CPU. This patch will mainly be used by XDP
forwarding.

1. Asai, Hirochika, and Yasuhiro Ohara. "Poptrie: A compressed trie
with population count for fast and scalable software IP routing table
lookup." ACM SIGCOMM Computer Communication Review. 2015.

2. https://github.com/pixos/poptrie

>From c5e05ea66b06eb9313749bc8969b4c2798fcf96a Mon Sep 17 00:00:00 2001
From: tamimcse <tamim@csebuet.org>
Date: Sun, 26 Aug 2018 21:12:38 -0400
Subject: [PATCH] Implented Poptrie

Signed-off-by: tamimcse <tamim@csebuet.org>
---
 include/net/ip_fib.h   |  40 +++++++
 net/ipv4/Makefile      |   2 +-
 net/ipv4/fib_poptrie.c | 295 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/fib_trie.c    |   3 +
 4 files changed, 339 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv4/fib_poptrie.c

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 81d0f21..c4374a1 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -197,6 +197,37 @@ struct fib_entry_notifier_info {
     u32 tb_id;
 };

+/*Maximum number of next-hop*/
+#define NEXT_HOP_MAX 255
+
+struct next_hops {
+    struct net_device    *netdev_arr[NEXT_HOP_MAX];
+    /*Total number of next-hops*/
+    u8 count;
+};
+
+struct poptrie_node {
+    u64 vector;
+    u64 leafvec;
+    u64 nodevec;
+    struct poptrie_node *chield_nodes;
+    u8 *leaves;
+    u8 *prefixes;
+};
+
+struct poptrie {
+    char    def_nh;
+    struct next_hops    nhs;
+    struct poptrie_node *root;
+    spinlock_t            lock;
+};
+
+void poptrie_insert(struct poptrie *pt, u32 key,
+        u8 prefix_len, struct net_device *dev);
+void poptrie_lookup(struct poptrie *pt, __be32 dest,
+        struct net_device **dev);
+
+
 struct fib_nh_notifier_info {
     struct fib_notifier_info info; /* must be first */
     struct fib_nh *fib_nh;
@@ -219,6 +250,7 @@ struct fib_table {
     int            tb_num_default;
     struct rcu_head        rcu;
     unsigned long         *tb_data;
+    struct poptrie    pt;
     unsigned long        __data[0];
 };

@@ -268,6 +300,14 @@ static inline int fib_lookup(struct net *net,
const struct flowi4 *flp,
     rcu_read_lock();

     tb = fib_get_table(net, RT_TABLE_MAIN);
+
+    /*Testing poptrie_lookup*/
+    if (tb && tb->pt.root) {
+        struct net_device *dev;
+
+        poptrie_lookup(&tb->pt, flp->daddr, &dev);
+    }
+
     if (tb)
         err = fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF);

diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520..b1246d2 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -14,7 +14,7 @@ obj-y     := route.o inetpeer.o protocol.o \
          udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
          fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
          inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
-         metrics.o
+         metrics.o fib_poptrie.o

 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/fib_poptrie.c b/net/ipv4/fib_poptrie.c
new file mode 100644
index 0000000..b3a88ab
--- /dev/null
+++ b/net/ipv4/fib_poptrie.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation; either version
+ *   2 of the License, or (at your option) any later version.
+ *
+ * Author: MD Iftakharul Islam (Tamim) <mislam4@kent.edu>.
+ *
+ * Asai, Hirochika, and Yasuhiro Ohara. "Poptrie: A compressed trie
+ * with population count for fast and scalable software IP routing
+ * table lookup." ACM SIGCOMM Computer Communication Review. 2015.
+ *
+ */
+
+#include <net/ip_fib.h>
+
+/*Get next-hop index from next-hop*/
+static u8 get_fib_index(struct next_hops *nhs, struct net_device *dev)
+{
+    u8 i;
+
+    for (i = 0; i < nhs->count; i++) {
+        if (nhs->netdev_arr[i] == dev)
+            return i;
+    }
+    nhs->netdev_arr[nhs->count++] = dev;
+    return nhs->count - 1;
+}
+
+/*Converts next-hop index into actual next-hop*/
+static struct net_device *get_fib(struct next_hops *nhs, u8 fib_index)
+{
+    return nhs->netdev_arr[fib_index];
+}
+
+/*Extracts 6 bytes from key starting from offset*/
+static inline u32 extract(u32 key, int offset)
+{
+    if (likely(offset < 26))
+        return (key >> (26 - offset)) & 63;
+    else
+        return (key << 4) & 63;
+}
+
+/*Set FIB index and prefix length to a leaf*/
+static void set_fib_index(struct poptrie_node *node,
+        unsigned long leaf_index, char fib_index, char prefix_len)
+{
+    node->leaves[leaf_index] = fib_index;
+    node->prefixes[leaf_index] = prefix_len;
+}
+
+/*Insert a leaf at index*/
+static bool insert_leaf(struct poptrie_node *node,
+        char index, char fib_index, char prefix_len)
+{
+    int i, j;
+    char *leaves;
+    char *prefixes;
+    int size = (int)hweight64(node->leafvec);
+
+    if (index > size) {
+        pr_err("Index needs to be smaller or equal to size");
+        return false;
+    }
+
+    leaves = kcalloc(size + 1, sizeof(*leaves), GFP_ATOMIC);
+    prefixes = kcalloc(size + 1, sizeof(*prefixes), GFP_ATOMIC);
+
+    for (i = 0, j = 0; i < (size + 1); i++) {
+        if (i == index) {
+            leaves[i] = fib_index;
+            prefixes[i] = prefix_len;
+        } else {
+            leaves[i] = node->leaves[j];
+            prefixes[i] = node->prefixes[j];
+            j++;
+        }
+    }
+
+    kfree(node->leaves);
+    kfree(node->prefixes);
+    node->leaves = leaves;
+    node->prefixes = prefixes;
+    return true;
+}
+
+/*Insert a new node at index*/
+static void insert_chield_node(struct poptrie_node *node,
+        char index)
+{
+    int i, j;
+    struct poptrie_node *arr;
+    int arr_size  = (int)hweight64(node->nodevec);
+
+    arr = kcalloc(arr_size + 1, sizeof(*arr), GFP_ATOMIC);
+    for (i = 0, j = 0; i < (arr_size + 1); i++) {
+        if (i != index && j < arr_size)
+            arr[i] = node->chield_nodes[j++];
+    }
+
+    kfree(node->chield_nodes);
+    node->chield_nodes = arr;
+}
+
+void poptrie_insert(struct poptrie *pt, u32 key,
+        u8 prefix_len, struct net_device *dev)
+{
+    int offset, i;
+    u32 index;
+    u8 consecutive_leafs;
+    u64 bitmap;
+    u64 bitmap_hp;
+    int arr_size;
+    unsigned long chield_index;
+    unsigned long leaf_index, prev_leaf_index;
+    unsigned long index_hp;
+    struct poptrie_node *node;
+    u8 prev_fib_index, prev_prefix_len;
+    u8 fib_index = get_fib_index(&pt->nhs, dev);
+
+    spin_lock(&pt->lock);
+
+    if (!pt->root)
+        pt->root = kzalloc(sizeof(*pt->root), GFP_ATOMIC);
+
+    /* Default route */
+    if (prefix_len == 0) {
+        pt->def_nh = fib_index;
+        goto finish;
+    }
+
+    /*Iterate through the nodes*/
+    offset = 0;
+    node = pt->root;
+    while (prefix_len > (offset + 6)) {
+        index = extract(key, offset);
+        bitmap = 1ULL << index;
+        chield_index = hweight64(node->nodevec & (bitmap - 1));
+
+        /*No node for this index, so need to insert a node*/
+        if (!(node->nodevec & bitmap)) {
+            insert_chield_node(node, chield_index);
+            node->nodevec |= bitmap;
+        }
+        node = &node->chield_nodes[chield_index];
+        offset += 6;
+    }
+
+    /*Now need to insert a leaf*/
+
+    index = extract(key, offset);
+    bitmap = 1ULL << index;
+    consecutive_leafs = 1 << (offset + 6 - prefix_len);
+
+    if (node->vector & bitmap && node->leafvec & bitmap) {
+        /*A leaf already exist for this index, so update the existing leaf*/
+        leaf_index = hweight64(node->leafvec & (bitmap - 1));
+        arr_size = (int)hweight64(node->leafvec);
+        if (leaf_index >= arr_size)
+            goto error;
+        /*Ignore the prefix*/
+        if (node->prefixes[leaf_index] > prefix_len) {
+            goto finish;
+        } else if (node->prefixes[leaf_index] == prefix_len) {
+            set_fib_index(node, leaf_index, fib_index, prefix_len);
+        } else {
+            /*hole punching*/
+            bitmap_hp = bitmap << consecutive_leafs;
+            if (!(node->leafvec & bitmap_hp)) {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                if (node->prefixes[index_hp] <= prefix_len) {
+                    insert_leaf(node, index_hp, fib_index, prefix_len);
+                    node->leafvec |= bitmap_hp;
+                }
+
+                for (i = leaf_index; i < index_hp ; i++) {
+                    if (node->prefixes[i] <= prefix_len)
+                        set_fib_index(node, i, fib_index, prefix_len);
+                }
+            } else {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                for (i = leaf_index; i <= index_hp ; i++) {
+                    if (node->prefixes[i] <= prefix_len)
+                        set_fib_index(node, i, fib_index, prefix_len);
+                }
+            }
+        }
+    } else if (!(node->vector & bitmap)) {
+        /*No leaf for this index, so need to insert a leaf*/
+        leaf_index = hweight64(node->leafvec & (bitmap - 1));
+        insert_leaf(node, leaf_index, fib_index, prefix_len);
+        node->leafvec |= bitmap;
+    } else if (node->vector & bitmap && !(node->leafvec & bitmap)) {
+        /*There is a leaf for this index created by another
+         *  prefix with smaller length
+         */
+        prev_leaf_index = hweight64(node->leafvec & (bitmap - 1)) - 1;
+        arr_size = (int)hweight64(node->leafvec);
+        if (prev_leaf_index >= arr_size)
+            goto error;
+        if (node->prefixes[prev_leaf_index] <= prefix_len) {
+            insert_leaf(node, prev_leaf_index + 1, fib_index, prefix_len);
+            node->leafvec |= bitmap;
+        }
+
+        /*hole punching*/
+        prev_fib_index = node->leaves[prev_leaf_index];
+        prev_prefix_len = node->prefixes[prev_leaf_index];
+
+        bitmap_hp = bitmap << consecutive_leafs;
+        if (!(node->leafvec & bitmap_hp)) {
+            index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+            if (node->prefixes[index_hp] <= prefix_len) {
+                if (prev_leaf_index < 0)
+                    goto error;
+                insert_leaf(node, index_hp + 1,
+                        prev_fib_index, prev_prefix_len);
+                node->leafvec |= bitmap_hp;
+            }
+        }
+
+        for (i = 2; i < consecutive_leafs; i++) {
+            bitmap_hp = bitmap << (i - 1);
+            if (node->leafvec & bitmap_hp) {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                insert_leaf(node, index_hp + 1,
+                        fib_index, prefix_len);
+                node->leafvec |= bitmap_hp;
+            }
+        }
+    }
+
+    if (consecutive_leafs > 1)
+        node->vector |= ((1ULL << consecutive_leafs) - 1) << index;
+    else
+        node->vector |= bitmap;
+
+    goto finish;
+
+error:
+    pr_err("Something is very wrong !!!!");
+finish:
+    spin_unlock(&pt->lock);
+}
+
+/*We assume that pt->root is not NULL*/
+void poptrie_lookup(struct poptrie *pt, __be32 dest, struct net_device **dev)
+{
+    register u32 index;
+    register u64 bitmap, bitmask;
+    register unsigned long leaf_index;
+    register unsigned long node_index;
+    register struct poptrie_node *node = pt->root;
+    register u8 fib_index = pt->def_nh;
+    register u8 carry = 0;
+    register u8 carry_bit = 2;
+
+    while (1) {
+        /*Extract 6 bytes from dest */
+        if (likely(carry_bit != 8)) {
+            index = ((dest & 252) >> carry_bit) | carry;
+            carry = (dest & ((1 << carry_bit) - 1)) << (6 - carry_bit);
+            carry_bit = carry_bit + 2;
+            dest = dest >> 8;
+        } else {
+            index = carry;
+            carry = 0;
+            carry_bit = 2;
+        }
+
+        /*Create a bitmap based on the the extracted value*/
+        bitmap = 1ULL << index;
+        bitmask = bitmap - 1;
+
+        /*Find corresponding leaf*/
+        if (likely(node->vector & bitmap)) {
+            leaf_index = hweight64(node->leafvec & bitmask);
+            if (!(node->leafvec & bitmap))
+                leaf_index--;
+            fib_index = node->leaves[leaf_index];
+        }
+
+        /*Find corresponding node*/
+        if (likely(node->nodevec & bitmap)) {
+            node_index = hweight64(node->nodevec & bitmask);
+            node = &node->chield_nodes[node_index];
+            continue;
+        }
+
+        *dev = get_fib(&pt->nhs, fib_index);
+        return;
+    }
+}
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3dcffd3..0509a24 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1280,6 +1280,9 @@ int fib_table_insert(struct net *net, struct
fib_table *tb,
     if (err)
         goto out_fib_notif;

+    /*This should be done when Poptrie is enabled from CONFIG*/
+    poptrie_insert(&tb->pt, key, plen, fi->fib_dev);
+
     if (!plen)
         tb->tb_num_default++;

-- 
2.7.4

^ permalink raw reply related

* [PATCH RFC net-next] net: Poptrie based FIB lookup
From: Md. Islam @ 2018-08-27  2:13 UTC (permalink / raw)
  To: Netdev, David Miller, David Ahern, Eric Dumazet, Alexey Kuznetsov,
	Stephen Hemminger, makita.toshiaki, panda, yasuhiro.ohara,
	Jesper Dangaard Brouer

This patch implements Poptrie [1] based FIB lookup. It exhibits pretty
impressive lookup performance compared to LC-trie. This poptrie
implementation however somewhat deviates from the original
implementation [2]. I tested this patch very rigorously with several
FIB tables containing half a million routes. I got same result as
LC-trie based fib_lookup().

Poptrie is intended to work in conjunction with LC-trie (not replace
it). It is primarily designed to overcome many issues of TCAM based
router [1]. [1] shows that the Poptrie can achieve very impressive
lookup performance on CPU. This patch will mainly be used by XDP
forwarding.

1. Asai, Hirochika, and Yasuhiro Ohara. "Poptrie: A compressed trie
with population count for fast and scalable software IP routing table
lookup." ACM SIGCOMM Computer Communication Review. 2015.

2. https://github.com/pixos/poptrie

>From c5e05ea66b06eb9313749bc8969b4c2798fcf96a Mon Sep 17 00:00:00 2001
From: tamimcse <tamim@csebuet.org>
Date: Sun, 26 Aug 2018 21:12:38 -0400
Subject: [PATCH] Implented Poptrie

Signed-off-by: tamimcse <tamim@csebuet.org>
---
 include/net/ip_fib.h   |  40 +++++++
 net/ipv4/Makefile      |   2 +-
 net/ipv4/fib_poptrie.c | 295 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/fib_trie.c    |   3 +
 4 files changed, 339 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv4/fib_poptrie.c

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 81d0f21..c4374a1 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -197,6 +197,37 @@ struct fib_entry_notifier_info {
     u32 tb_id;
 };

+/*Maximum number of next-hop*/
+#define NEXT_HOP_MAX 255
+
+struct next_hops {
+    struct net_device    *netdev_arr[NEXT_HOP_MAX];
+    /*Total number of next-hops*/
+    u8 count;
+};
+
+struct poptrie_node {
+    u64 vector;
+    u64 leafvec;
+    u64 nodevec;
+    struct poptrie_node *chield_nodes;
+    u8 *leaves;
+    u8 *prefixes;
+};
+
+struct poptrie {
+    char    def_nh;
+    struct next_hops    nhs;
+    struct poptrie_node *root;
+    spinlock_t            lock;
+};
+
+void poptrie_insert(struct poptrie *pt, u32 key,
+        u8 prefix_len, struct net_device *dev);
+void poptrie_lookup(struct poptrie *pt, __be32 dest,
+        struct net_device **dev);
+
+
 struct fib_nh_notifier_info {
     struct fib_notifier_info info; /* must be first */
     struct fib_nh *fib_nh;
@@ -219,6 +250,7 @@ struct fib_table {
     int            tb_num_default;
     struct rcu_head        rcu;
     unsigned long         *tb_data;
+    struct poptrie    pt;
     unsigned long        __data[0];
 };

@@ -268,6 +300,14 @@ static inline int fib_lookup(struct net *net,
const struct flowi4 *flp,
     rcu_read_lock();

     tb = fib_get_table(net, RT_TABLE_MAIN);
+
+    /*Testing poptrie_lookup*/
+    if (tb && tb->pt.root) {
+        struct net_device *dev;
+
+        poptrie_lookup(&tb->pt, flp->daddr, &dev);
+    }
+
     if (tb)
         err = fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF);

diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520..b1246d2 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -14,7 +14,7 @@ obj-y     := route.o inetpeer.o protocol.o \
          udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
          fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
          inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
-         metrics.o
+         metrics.o fib_poptrie.o

 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/fib_poptrie.c b/net/ipv4/fib_poptrie.c
new file mode 100644
index 0000000..b3a88ab
--- /dev/null
+++ b/net/ipv4/fib_poptrie.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation; either version
+ *   2 of the License, or (at your option) any later version.
+ *
+ * Author: MD Iftakharul Islam (Tamim) <mislam4@kent.edu>.
+ *
+ * Asai, Hirochika, and Yasuhiro Ohara. "Poptrie: A compressed trie
+ * with population count for fast and scalable software IP routing
+ * table lookup." ACM SIGCOMM Computer Communication Review. 2015.
+ *
+ */
+
+#include <net/ip_fib.h>
+
+/*Get next-hop index from next-hop*/
+static u8 get_fib_index(struct next_hops *nhs, struct net_device *dev)
+{
+    u8 i;
+
+    for (i = 0; i < nhs->count; i++) {
+        if (nhs->netdev_arr[i] == dev)
+            return i;
+    }
+    nhs->netdev_arr[nhs->count++] = dev;
+    return nhs->count - 1;
+}
+
+/*Converts next-hop index into actual next-hop*/
+static struct net_device *get_fib(struct next_hops *nhs, u8 fib_index)
+{
+    return nhs->netdev_arr[fib_index];
+}
+
+/*Extracts 6 bytes from key starting from offset*/
+static inline u32 extract(u32 key, int offset)
+{
+    if (likely(offset < 26))
+        return (key >> (26 - offset)) & 63;
+    else
+        return (key << 4) & 63;
+}
+
+/*Set FIB index and prefix length to a leaf*/
+static void set_fib_index(struct poptrie_node *node,
+        unsigned long leaf_index, char fib_index, char prefix_len)
+{
+    node->leaves[leaf_index] = fib_index;
+    node->prefixes[leaf_index] = prefix_len;
+}
+
+/*Insert a leaf at index*/
+static bool insert_leaf(struct poptrie_node *node,
+        char index, char fib_index, char prefix_len)
+{
+    int i, j;
+    char *leaves;
+    char *prefixes;
+    int size = (int)hweight64(node->leafvec);
+
+    if (index > size) {
+        pr_err("Index needs to be smaller or equal to size");
+        return false;
+    }
+
+    leaves = kcalloc(size + 1, sizeof(*leaves), GFP_ATOMIC);
+    prefixes = kcalloc(size + 1, sizeof(*prefixes), GFP_ATOMIC);
+
+    for (i = 0, j = 0; i < (size + 1); i++) {
+        if (i == index) {
+            leaves[i] = fib_index;
+            prefixes[i] = prefix_len;
+        } else {
+            leaves[i] = node->leaves[j];
+            prefixes[i] = node->prefixes[j];
+            j++;
+        }
+    }
+
+    kfree(node->leaves);
+    kfree(node->prefixes);
+    node->leaves = leaves;
+    node->prefixes = prefixes;
+    return true;
+}
+
+/*Insert a new node at index*/
+static void insert_chield_node(struct poptrie_node *node,
+        char index)
+{
+    int i, j;
+    struct poptrie_node *arr;
+    int arr_size  = (int)hweight64(node->nodevec);
+
+    arr = kcalloc(arr_size + 1, sizeof(*arr), GFP_ATOMIC);
+    for (i = 0, j = 0; i < (arr_size + 1); i++) {
+        if (i != index && j < arr_size)
+            arr[i] = node->chield_nodes[j++];
+    }
+
+    kfree(node->chield_nodes);
+    node->chield_nodes = arr;
+}
+
+void poptrie_insert(struct poptrie *pt, u32 key,
+        u8 prefix_len, struct net_device *dev)
+{
+    int offset, i;
+    u32 index;
+    u8 consecutive_leafs;
+    u64 bitmap;
+    u64 bitmap_hp;
+    int arr_size;
+    unsigned long chield_index;
+    unsigned long leaf_index, prev_leaf_index;
+    unsigned long index_hp;
+    struct poptrie_node *node;
+    u8 prev_fib_index, prev_prefix_len;
+    u8 fib_index = get_fib_index(&pt->nhs, dev);
+
+    spin_lock(&pt->lock);
+
+    if (!pt->root)
+        pt->root = kzalloc(sizeof(*pt->root), GFP_ATOMIC);
+
+    /* Default route */
+    if (prefix_len == 0) {
+        pt->def_nh = fib_index;
+        goto finish;
+    }
+
+    /*Iterate through the nodes*/
+    offset = 0;
+    node = pt->root;
+    while (prefix_len > (offset + 6)) {
+        index = extract(key, offset);
+        bitmap = 1ULL << index;
+        chield_index = hweight64(node->nodevec & (bitmap - 1));
+
+        /*No node for this index, so need to insert a node*/
+        if (!(node->nodevec & bitmap)) {
+            insert_chield_node(node, chield_index);
+            node->nodevec |= bitmap;
+        }
+        node = &node->chield_nodes[chield_index];
+        offset += 6;
+    }
+
+    /*Now need to insert a leaf*/
+
+    index = extract(key, offset);
+    bitmap = 1ULL << index;
+    consecutive_leafs = 1 << (offset + 6 - prefix_len);
+
+    if (node->vector & bitmap && node->leafvec & bitmap) {
+        /*A leaf already exist for this index, so update the existing leaf*/
+        leaf_index = hweight64(node->leafvec & (bitmap - 1));
+        arr_size = (int)hweight64(node->leafvec);
+        if (leaf_index >= arr_size)
+            goto error;
+        /*Ignore the prefix*/
+        if (node->prefixes[leaf_index] > prefix_len) {
+            goto finish;
+        } else if (node->prefixes[leaf_index] == prefix_len) {
+            set_fib_index(node, leaf_index, fib_index, prefix_len);
+        } else {
+            /*hole punching*/
+            bitmap_hp = bitmap << consecutive_leafs;
+            if (!(node->leafvec & bitmap_hp)) {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                if (node->prefixes[index_hp] <= prefix_len) {
+                    insert_leaf(node, index_hp, fib_index, prefix_len);
+                    node->leafvec |= bitmap_hp;
+                }
+
+                for (i = leaf_index; i < index_hp ; i++) {
+                    if (node->prefixes[i] <= prefix_len)
+                        set_fib_index(node, i, fib_index, prefix_len);
+                }
+            } else {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                for (i = leaf_index; i <= index_hp ; i++) {
+                    if (node->prefixes[i] <= prefix_len)
+                        set_fib_index(node, i, fib_index, prefix_len);
+                }
+            }
+        }
+    } else if (!(node->vector & bitmap)) {
+        /*No leaf for this index, so need to insert a leaf*/
+        leaf_index = hweight64(node->leafvec & (bitmap - 1));
+        insert_leaf(node, leaf_index, fib_index, prefix_len);
+        node->leafvec |= bitmap;
+    } else if (node->vector & bitmap && !(node->leafvec & bitmap)) {
+        /*There is a leaf for this index created by another
+         *  prefix with smaller length
+         */
+        prev_leaf_index = hweight64(node->leafvec & (bitmap - 1)) - 1;
+        arr_size = (int)hweight64(node->leafvec);
+        if (prev_leaf_index >= arr_size)
+            goto error;
+        if (node->prefixes[prev_leaf_index] <= prefix_len) {
+            insert_leaf(node, prev_leaf_index + 1, fib_index, prefix_len);
+            node->leafvec |= bitmap;
+        }
+
+        /*hole punching*/
+        prev_fib_index = node->leaves[prev_leaf_index];
+        prev_prefix_len = node->prefixes[prev_leaf_index];
+
+        bitmap_hp = bitmap << consecutive_leafs;
+        if (!(node->leafvec & bitmap_hp)) {
+            index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+            if (node->prefixes[index_hp] <= prefix_len) {
+                if (prev_leaf_index < 0)
+                    goto error;
+                insert_leaf(node, index_hp + 1,
+                        prev_fib_index, prev_prefix_len);
+                node->leafvec |= bitmap_hp;
+            }
+        }
+
+        for (i = 2; i < consecutive_leafs; i++) {
+            bitmap_hp = bitmap << (i - 1);
+            if (node->leafvec & bitmap_hp) {
+                index_hp = hweight64(node->leafvec & (bitmap_hp - 1)) - 1;
+                insert_leaf(node, index_hp + 1,
+                        fib_index, prefix_len);
+                node->leafvec |= bitmap_hp;
+            }
+        }
+    }
+
+    if (consecutive_leafs > 1)
+        node->vector |= ((1ULL << consecutive_leafs) - 1) << index;
+    else
+        node->vector |= bitmap;
+
+    goto finish;
+
+error:
+    pr_err("Something is very wrong !!!!");
+finish:
+    spin_unlock(&pt->lock);
+}
+
+/*We assume that pt->root is not NULL*/
+void poptrie_lookup(struct poptrie *pt, __be32 dest, struct net_device **dev)
+{
+    register u32 index;
+    register u64 bitmap, bitmask;
+    register unsigned long leaf_index;
+    register unsigned long node_index;
+    register struct poptrie_node *node = pt->root;
+    register u8 fib_index = pt->def_nh;
+    register u8 carry = 0;
+    register u8 carry_bit = 2;
+
+    while (1) {
+        /*Extract 6 bytes from dest */
+        if (likely(carry_bit != 8)) {
+            index = ((dest & 252) >> carry_bit) | carry;
+            carry = (dest & ((1 << carry_bit) - 1)) << (6 - carry_bit);
+            carry_bit = carry_bit + 2;
+            dest = dest >> 8;
+        } else {
+            index = carry;
+            carry = 0;
+            carry_bit = 2;
+        }
+
+        /*Create a bitmap based on the the extracted value*/
+        bitmap = 1ULL << index;
+        bitmask = bitmap - 1;
+
+        /*Find corresponding leaf*/
+        if (likely(node->vector & bitmap)) {
+            leaf_index = hweight64(node->leafvec & bitmask);
+            if (!(node->leafvec & bitmap))
+                leaf_index--;
+            fib_index = node->leaves[leaf_index];
+        }
+
+        /*Find corresponding node*/
+        if (likely(node->nodevec & bitmap)) {
+            node_index = hweight64(node->nodevec & bitmask);
+            node = &node->chield_nodes[node_index];
+            continue;
+        }
+
+        *dev = get_fib(&pt->nhs, fib_index);
+        return;
+    }
+}
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3dcffd3..0509a24 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1280,6 +1280,9 @@ int fib_table_insert(struct net *net, struct
fib_table *tb,
     if (err)
         goto out_fib_notif;

+    /*This should be done when Poptrie is enabled from CONFIG*/
+    poptrie_insert(&tb->pt, key, plen, fi->fib_dev);
+
     if (!plen)
         tb->tb_num_default++;

-- 
2.7.4

^ permalink raw reply related

* Re: KASAN: invalid-free in p9stat_free
From: Dominique Martinet @ 2018-08-27  5:24 UTC (permalink / raw)
  To: syzbot
  Cc: davem, ericvh, linux-kernel, lucho, netdev, syzkaller-bugs,
	v9fs-developer
In-Reply-To: <000000000000af648b057456e234@google.com>

syzbot wrote on Sun, Aug 26, 2018:
> HEAD commit:    e27bc174c9c6 Add linux-next specific files for 20180824
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15dc19a6400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=28446088176757ea
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4252148d198410b864f
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15f8efba400000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1178256a400000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d4252148d198410b864f@syzkaller.appspotmail.com
> 
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> ==================================================================
> BUG: KASAN: double-free or invalid-free in p9stat_free+0x35/0x100
> net/9p/protocol.c:48

That looks straight-forward enough, p9pdu_vreadf does p9stat_free on
error then v9fs_dir_readdir does the same ; there is nothing else that
could return an error without going through the first free so we could
just remove the later one...

There are a couple other users of the 'S' pdu read (that reads the stat
struct and frees it on error), so it's probably best to keep the current
behaviour as far as this is concerned, what we could do though is make
the free function idempotent (write NULLs in the freed fields), but I do
not see this being done often, do you know what the policy is about
this kind of pattern nowadays?

The struct is cleanly zeroed before being read so there is no risk of
double-frees between iterations so zeroing pointers is not strictly
required, but it does make things safer in general.

-- 
Dominique Martinet

^ permalink raw reply

* Urgent,
From: Juliet Muhammad @ 2018-08-27  0:47 UTC (permalink / raw)
  To: Recipients

i have been trying to contact you

^ permalink raw reply

* Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Al Viro @ 2018-08-27  4:04 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Joe Perches, Kees Cook, LKML, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, David S. Miller, Network Development
In-Reply-To: <alpine.DEB.2.21.1808262319000.2295@hadrien>

On Sun, Aug 26, 2018 at 11:35:17PM -0400, Julia Lawall wrote:

> * x = \(kmalloc\|kzalloc\|devm_kmalloc\|devm_kzalloc\)(...)

I can name several you've missed right off the top of my head -
vmalloc, kvmalloc, kmem_cache_alloc, kmem_cache_zalloc, variants
with _trace slapped on, and that is not to mention the things like
get_free_page or

void *my_k3wl_alloc(u64 n) // 'cause all artificial limits suck, that's why
{
	lots and lots of home-grown stats collection
	some tracepoints thrown in just for fun
	return kmalloc(n);
}

(and no, I'm not implying that net/sched folks had done anything of that
sort; I have seen that and worse in drivers, though)

> The * at the beginning of the line means to highlight what you are looking
> for, which is done by making a diff in which the highlighted line
> appears to be removed.

Umm...  Does that cover return, BTW?  Or something like
	T *barf;
	extern void foo(T *p);
	foo(kmalloc(sizeof(*barf)));


> The limitation is the ability to figure out the type of x.  If it is a
> local variable, Coccinelle should have no problem.  If it is a structure
> field, it may be necessary to provide command line arguments like
> 
> --all-includes --include-headers-for-types
> 
> --all-includes means to try to find all include files that are mentioned
> in the .c file.  The next stronger option is --recursive includes, which
> means include what all of the mentioned files include as well,
> recursively.  This tends to cause a major performance hit, because a lot
> of code is being parsed.  --include-headers-for-types heals a bit with
> that, as it only considers the header files when computing type
> information, and now when applying the rules.
> 
> With respect to ifdefs around variable declarations and structure field
> declaration, in these cases Coccinelle considers that it cannot make the
> ifdef have an if-like control flow, and so if considers the #ifdef, #else
> and #endif to be comments.  Thus it takes into account only the last type
> provided for a given variable.

[snip]

What about several variants of structure definition?  Because ifdefs around
includes do occur in the wild...

^ permalink raw reply

* Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Julia Lawall @ 2018-08-27  3:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Joe Perches, Kees Cook, LKML, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, David S. Miller, Network Development
In-Reply-To: <20180827023526.GA6515@ZenIV.linux.org.uk>

On Mon, 27 Aug 2018, Al Viro wrote:

> On Sun, Aug 26, 2018 at 10:00:46PM -0400, Julia Lawall wrote:
> >
> >
> > On Sun, 26 Aug 2018, Al Viro wrote:
> >
> > > On Sun, Aug 26, 2018 at 03:26:54PM -0700, Joe Perches wrote:
> > > > On Sun, 2018-08-26 at 22:24 +0100, Al Viro wrote:
> > > > > On Sun, Aug 26, 2018 at 11:57:57AM -0700, Joe Perches wrote:
> > > > >
> > > > > > > That, BTW, is why I hate the use of sizeof(*p) in kmalloc, etc.
> > > > > > > arguments.  typeof is even worse in that respect.
> > > > > >
> > > > > > True.  Semantic searches via tools like coccinelle could help here
> > > > > > but those searches are quite a bit slower than straightforward greps.
> > > > >
> > > > > Those searches are .config-sensitive as well, which can be much more
> > > > > unpleasant than being slow...
> > > >
> > > > Are they?  Julia?
> > >
> > > They work pretty much on preprocessor output level; if something it ifdef'ed
> > > out on given config, it won't be seen...
> >
> > Coccinelle doesn't care what is ifdef'd out.  It only misses the things it
> > can't parse.  Very strange ifdefs could indeed cause that, but it should
> > be a minor problem.
>
> OK, but... what does it do when it sees two definitions of a structure
> in different branches of #if/#else/#endif?  I think I'm confused about
> what it can and cannot do; to restate the original problem:
> 	* we need to find all places where instances of given type
> are created.  Assume it never is a member of struct/union/array and
> no static or auto duration instances exist - everything is dynamically
> allocated somewhere.
>
> Can coccinelle do that and if it can, what are the limitations?

Looking at the thread, it seems that what you are asking for is something
like:

@@
struct tcf_proto *x;
@@

* x = \(kmalloc\|kzalloc\|devm_kmalloc\|devm_kzalloc\)(...)

The * at the beginning of the line means to highlight what you are looking
for, which is done by making a diff in which the highlighted line
appears to be removed.

The limitation is the ability to figure out the type of x.  If it is a
local variable, Coccinelle should have no problem.  If it is a structure
field, it may be necessary to provide command line arguments like

--all-includes --include-headers-for-types

--all-includes means to try to find all include files that are mentioned
in the .c file.  The next stronger option is --recursive includes, which
means include what all of the mentioned files include as well,
recursively.  This tends to cause a major performance hit, because a lot
of code is being parsed.  --include-headers-for-types heals a bit with
that, as it only considers the header files when computing type
information, and now when applying the rules.

With respect to ifdefs around variable declarations and structure field
declaration, in these cases Coccinelle considers that it cannot make the
ifdef have an if-like control flow, and so if considers the #ifdef, #else
and #endif to be comments.  Thus it takes into account only the last type
provided for a given variable.  For example, in the following:

int main() {
#ifdef X
  int x;
#else
  char xx;
#endif

  return x;
}

int main2() {
#ifdef X
  char x;
#else
  int x;
#endif

  return x;
}

x is considered to have type int in both returns.  But if xx is replaced
by x in the definition of main, then x at the point of the return will
have type char.

Around a statement, such as x = kmalloc(...), it should be able to
consider that both branches of an ifdef are possible.

But there are no absolute guarantees.  If there is any problem in parsing
a line of some function, the whole function will be ignores.

julia

^ permalink raw reply

* Re: [PATCH] net: dsa: Drop GPIO includes
From: Andrew Lunn @ 2018-08-26 22:58 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <20180826222011.19149-1-linus.walleij@linaro.org>

On Mon, Aug 27, 2018 at 12:20:11AM +0200, Linus Walleij wrote:
> Commit 52638f71fcff ("dsa: Move gpio reset into switch driver")
> moved the GPIO handling into the switch drivers but forgot
> to remove the GPIO header includes.
> 
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Al Viro @ 2018-08-27  2:35 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Joe Perches, Kees Cook, LKML, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, David S. Miller, Network Development
In-Reply-To: <alpine.DEB.2.21.1808262159430.2528@hadrien>

On Sun, Aug 26, 2018 at 10:00:46PM -0400, Julia Lawall wrote:
> 
> 
> On Sun, 26 Aug 2018, Al Viro wrote:
> 
> > On Sun, Aug 26, 2018 at 03:26:54PM -0700, Joe Perches wrote:
> > > On Sun, 2018-08-26 at 22:24 +0100, Al Viro wrote:
> > > > On Sun, Aug 26, 2018 at 11:57:57AM -0700, Joe Perches wrote:
> > > >
> > > > > > That, BTW, is why I hate the use of sizeof(*p) in kmalloc, etc.
> > > > > > arguments.  typeof is even worse in that respect.
> > > > >
> > > > > True.  Semantic searches via tools like coccinelle could help here
> > > > > but those searches are quite a bit slower than straightforward greps.
> > > >
> > > > Those searches are .config-sensitive as well, which can be much more
> > > > unpleasant than being slow...
> > >
> > > Are they?  Julia?
> >
> > They work pretty much on preprocessor output level; if something it ifdef'ed
> > out on given config, it won't be seen...
> 
> Coccinelle doesn't care what is ifdef'd out.  It only misses the things it
> can't parse.  Very strange ifdefs could indeed cause that, but it should
> be a minor problem.

OK, but... what does it do when it sees two definitions of a structure
in different branches of #if/#else/#endif?  I think I'm confused about
what it can and cannot do; to restate the original problem:
	* we need to find all places where instances of given type
are created.  Assume it never is a member of struct/union/array and
no static or auto duration instances exist - everything is dynamically
allocated somewhere.

Can coccinelle do that and if it can, what are the limitations?

^ permalink raw reply

* [PATCH] net: dsa: Drop GPIO includes
From: Linus Walleij @ 2018-08-26 22:20 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli; +Cc: netdev, Linus Walleij

Commit 52638f71fcff ("dsa: Move gpio reset into switch driver")
moved the GPIO handling into the switch drivers but forgot
to remove the GPIO header includes.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 net/dsa/dsa.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index e63c554e0623..9f3209ff7ffd 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -19,12 +19,10 @@
 #include <linux/of_mdio.h>
 #include <linux/of_platform.h>
 #include <linux/of_net.h>
-#include <linux/of_gpio.h>
 #include <linux/netdevice.h>
 #include <linux/sysfs.h>
 #include <linux/phy_fixed.h>
 #include <linux/ptp_classify.h>
-#include <linux/gpio/consumer.h>
 #include <linux/etherdevice.h>
 
 #include "dsa_priv.h"
-- 
2.17.1

^ permalink raw reply related

* RE: [PATCH net] tipc: fix the big/little endian issue in tipc_dest
From: Bai, Haiqing @ 2018-08-27  2:01 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, jon.maloy@ericsson.com, Xue, Ying,
	Gao, Zhenbo, linux-kernel@vger.kernel.org
In-Reply-To: <20180825.173641.1890220553842463695.davem@davemloft.net>

Thanks,  V2 will  be send out.

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: 2018年8月26日 8:37
To: Bai, Haiqing
Cc: netdev@vger.kernel.org; jon.maloy@ericsson.com; Xue, Ying; Gao, Zhenbo; linux-kernel@vger.kernel.org
Subject: Re: [PATCH net] tipc: fix the big/little endian issue in tipc_dest

From: Haiqing Bai <Haiqing.Bai@windriver.com>
Date: Thu, 23 Aug 2018 16:49:18 +0800

> Fixes: d06b2fa34f18 ("tipc: improve destination linked list")

Please fix this:

[davem@localhost net]$ git describe d06b2fa34f18
fatal: Not a valid object name d06b2fa34f18

^ permalink raw reply

* Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL
From: Julia Lawall @ 2018-08-27  2:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Joe Perches, Julia Lawall, Kees Cook, LKML, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, David S. Miller, Network Development
In-Reply-To: <20180826224322.GX6515@ZenIV.linux.org.uk>



On Sun, 26 Aug 2018, Al Viro wrote:

> On Sun, Aug 26, 2018 at 03:26:54PM -0700, Joe Perches wrote:
> > On Sun, 2018-08-26 at 22:24 +0100, Al Viro wrote:
> > > On Sun, Aug 26, 2018 at 11:57:57AM -0700, Joe Perches wrote:
> > >
> > > > > That, BTW, is why I hate the use of sizeof(*p) in kmalloc, etc.
> > > > > arguments.  typeof is even worse in that respect.
> > > >
> > > > True.  Semantic searches via tools like coccinelle could help here
> > > > but those searches are quite a bit slower than straightforward greps.
> > >
> > > Those searches are .config-sensitive as well, which can be much more
> > > unpleasant than being slow...
> >
> > Are they?  Julia?
>
> They work pretty much on preprocessor output level; if something it ifdef'ed
> out on given config, it won't be seen...

Coccinelle doesn't care what is ifdef'd out.  It only misses the things it
can't parse.  Very strange ifdefs could indeed cause that, but it should
be a minor problem.

julia

^ permalink raw reply

* [PATCH V2 net 2/2] net: hns: add netif_carrier_off before change speed and duplex
From: Peng Li @ 2018-08-27  1:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, linuxarm, yisen.zhuang, salil.mehta,
	lipeng321
In-Reply-To: <1535335170-111030-1-git-send-email-lipeng321@huawei.com>

If there are packets in hardware when changing the speed
or duplex, it may cause hardware hang up.

This patch adds netif_carrier_off before change speed and
duplex in ethtool_ops.set_link_ksettings, and adds
netif_carrier_on after complete the change.

Signed-off-by: Peng Li <lipeng321@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
index 08f3c47..774beda 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
@@ -243,7 +243,9 @@ static int hns_nic_set_link_ksettings(struct net_device *net_dev,
 	}
 
 	if (h->dev->ops->adjust_link) {
+		netif_carrier_off(net_dev);
 		h->dev->ops->adjust_link(h, (int)speed, cmd->base.duplex);
+		netif_carrier_on(net_dev);
 		return 0;
 	}
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH V2 net 1/2] net: hns: add the code for cleaning pkt in chip
From: Peng Li @ 2018-08-27  1:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, linuxarm, yisen.zhuang, salil.mehta,
	lipeng321
In-Reply-To: <1535335170-111030-1-git-send-email-lipeng321@huawei.com>

If there are packets in hardware when changing the speed
or duplex, it may cause hardware hang up.

This patch adds the code for waiting chip to clean the all
pkts(TX & RX) in chip when the driver uses the function named
"adjust link".

This patch cleans the pkts as follows:
1) close rx of chip, close tx of protocol stack.
2) wait rcb, ppe, mac to clean.
3) adjust link
4) open rx of chip, open tx of protocol stack.

Signed-off-by: Peng Li <lipeng321@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns/hnae.h          |  2 +
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c  | 67 +++++++++++++++++++++-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c | 36 ++++++++++++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c  | 44 ++++++++++++++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h  |  8 +++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 29 ++++++++++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h |  3 +
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c  | 23 ++++++++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h  |  1 +
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c  | 23 ++++++++
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h  |  1 +
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h  |  1 +
 drivers/net/ethernet/hisilicon/hns/hns_enet.c      | 21 ++++++-
 13 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h
index cad52bd..08a750f 100644
--- a/drivers/net/ethernet/hisilicon/hns/hnae.h
+++ b/drivers/net/ethernet/hisilicon/hns/hnae.h
@@ -486,6 +486,8 @@ struct hnae_ae_ops {
 			u8 *auto_neg, u16 *speed, u8 *duplex);
 	void (*toggle_ring_irq)(struct hnae_ring *ring, u32 val);
 	void (*adjust_link)(struct hnae_handle *handle, int speed, int duplex);
+	bool (*need_adjust_link)(struct hnae_handle *handle,
+				 int speed, int duplex);
 	int (*set_loopback)(struct hnae_handle *handle,
 			    enum hnae_loop loop_mode, int en);
 	void (*get_ring_bdnum_limit)(struct hnae_queue *queue,
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index e6aad30..b52029e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -155,6 +155,41 @@ static void hns_ae_put_handle(struct hnae_handle *handle)
 		hns_ae_get_ring_pair(handle->qs[i])->used_by_vf = 0;
 }
 
+static int hns_ae_wait_flow_down(struct hnae_handle *handle)
+{
+	struct dsaf_device *dsaf_dev;
+	struct hns_ppe_cb *ppe_cb;
+	struct hnae_vf_cb *vf_cb;
+	int ret;
+	int i;
+
+	for (i = 0; i < handle->q_num; i++) {
+		ret = hns_rcb_wait_tx_ring_clean(handle->qs[i]);
+		if (ret)
+			return ret;
+	}
+
+	ppe_cb = hns_get_ppe_cb(handle);
+	ret = hns_ppe_wait_tx_fifo_clean(ppe_cb);
+	if (ret)
+		return ret;
+
+	dsaf_dev = hns_ae_get_dsaf_dev(handle->dev);
+	if (!dsaf_dev)
+		return -EINVAL;
+	ret = hns_dsaf_wait_pkt_clean(dsaf_dev, handle->dport_id);
+	if (ret)
+		return ret;
+
+	vf_cb = hns_ae_get_vf_cb(handle);
+	ret = hns_mac_wait_fifo_clean(vf_cb->mac_cb);
+	if (ret)
+		return ret;
+
+	mdelay(10);
+	return 0;
+}
+
 static void hns_ae_ring_enable_all(struct hnae_handle *handle, int val)
 {
 	int q_num = handle->q_num;
@@ -399,12 +434,41 @@ static int hns_ae_get_mac_info(struct hnae_handle *handle,
 	return hns_mac_get_port_info(mac_cb, auto_neg, speed, duplex);
 }
 
+static bool hns_ae_need_adjust_link(struct hnae_handle *handle, int speed,
+				    int duplex)
+{
+	struct hns_mac_cb *mac_cb = hns_get_mac_cb(handle);
+
+	return hns_mac_need_adjust_link(mac_cb, speed, duplex);
+}
+
 static void hns_ae_adjust_link(struct hnae_handle *handle, int speed,
 			       int duplex)
 {
 	struct hns_mac_cb *mac_cb = hns_get_mac_cb(handle);
 
-	hns_mac_adjust_link(mac_cb, speed, duplex);
+	switch (mac_cb->dsaf_dev->dsaf_ver) {
+	case AE_VERSION_1:
+		hns_mac_adjust_link(mac_cb, speed, duplex);
+		break;
+
+	case AE_VERSION_2:
+		/* chip need to clear all pkt inside */
+		hns_mac_disable(mac_cb, MAC_COMM_MODE_RX);
+		if (hns_ae_wait_flow_down(handle)) {
+			hns_mac_enable(mac_cb, MAC_COMM_MODE_RX);
+			break;
+		}
+
+		hns_mac_adjust_link(mac_cb, speed, duplex);
+		hns_mac_enable(mac_cb, MAC_COMM_MODE_RX);
+		break;
+
+	default:
+		break;
+	}
+
+	return;
 }
 
 static void hns_ae_get_ring_bdnum_limit(struct hnae_queue *queue,
@@ -902,6 +966,7 @@ static struct hnae_ae_ops hns_dsaf_ops = {
 	.get_status = hns_ae_get_link_status,
 	.get_info = hns_ae_get_mac_info,
 	.adjust_link = hns_ae_adjust_link,
+	.need_adjust_link = hns_ae_need_adjust_link,
 	.set_loopback = hns_ae_config_loopback,
 	.get_ring_bdnum_limit = hns_ae_get_ring_bdnum_limit,
 	.get_pauseparam = hns_ae_get_pauseparam,
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
index 5488c6e..09e4061 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
@@ -257,6 +257,16 @@ static void hns_gmac_get_pausefrm_cfg(void *mac_drv, u32 *rx_pause_en,
 	*tx_pause_en = dsaf_get_bit(pause_en, GMAC_PAUSE_EN_TX_FDFC_B);
 }
 
+static bool hns_gmac_need_adjust_link(void *mac_drv, enum mac_speed speed,
+				      int duplex)
+{
+	struct mac_driver *drv = (struct mac_driver *)mac_drv;
+	struct hns_mac_cb *mac_cb = drv->mac_cb;
+
+	return (mac_cb->speed != speed) ||
+		(mac_cb->half_duplex == duplex);
+}
+
 static int hns_gmac_adjust_link(void *mac_drv, enum mac_speed speed,
 				u32 full_duplex)
 {
@@ -309,6 +319,30 @@ static void hns_gmac_set_promisc(void *mac_drv, u8 en)
 		hns_gmac_set_uc_match(mac_drv, en);
 }
 
+int hns_gmac_wait_fifo_clean(void *mac_drv)
+{
+	struct mac_driver *drv = (struct mac_driver *)mac_drv;
+	int wait_cnt;
+	u32 val;
+
+	wait_cnt = 0;
+	while (wait_cnt++ < HNS_MAX_WAIT_CNT) {
+		val = dsaf_read_dev(drv, GMAC_FIFO_STATE_REG);
+		/* bit5~bit0 is not send complete pkts */
+		if ((val & 0x3f) == 0)
+			break;
+		usleep_range(100, 200);
+	}
+
+	if (wait_cnt >= HNS_MAX_WAIT_CNT) {
+		dev_err(drv->dev,
+			"hns ge %d fifo was not idle.\n", drv->mac_id);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
 static void hns_gmac_init(void *mac_drv)
 {
 	u32 port;
@@ -690,6 +724,7 @@ void *hns_gmac_config(struct hns_mac_cb *mac_cb, struct mac_params *mac_param)
 	mac_drv->mac_disable = hns_gmac_disable;
 	mac_drv->mac_free = hns_gmac_free;
 	mac_drv->adjust_link = hns_gmac_adjust_link;
+	mac_drv->need_adjust_link = hns_gmac_need_adjust_link;
 	mac_drv->set_tx_auto_pause_frames = hns_gmac_set_tx_auto_pause_frames;
 	mac_drv->config_max_frame_length = hns_gmac_config_max_frame_length;
 	mac_drv->mac_pausefrm_cfg = hns_gmac_pause_frm_cfg;
@@ -717,6 +752,7 @@ void *hns_gmac_config(struct hns_mac_cb *mac_cb, struct mac_params *mac_param)
 	mac_drv->get_strings = hns_gmac_get_strings;
 	mac_drv->update_stats = hns_gmac_update_stats;
 	mac_drv->set_promiscuous = hns_gmac_set_promisc;
+	mac_drv->wait_fifo_clean = hns_gmac_wait_fifo_clean;
 
 	return (void *)mac_drv;
 }
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
index 1c2326b..6ed6f14 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
@@ -114,6 +114,26 @@ int hns_mac_get_port_info(struct hns_mac_cb *mac_cb,
 	return 0;
 }
 
+/**
+ *hns_mac_is_adjust_link - check is need change mac speed and duplex register
+ *@mac_cb: mac device
+ *@speed: phy device speed
+ *@duplex:phy device duplex
+ *
+ */
+bool hns_mac_need_adjust_link(struct hns_mac_cb *mac_cb, int speed, int duplex)
+{
+	struct mac_driver *mac_ctrl_drv;
+
+	mac_ctrl_drv = (struct mac_driver *)(mac_cb->priv.mac);
+
+	if (mac_ctrl_drv->need_adjust_link)
+		return mac_ctrl_drv->need_adjust_link(mac_ctrl_drv,
+			(enum mac_speed)speed, duplex);
+	else
+		return true;
+}
+
 void hns_mac_adjust_link(struct hns_mac_cb *mac_cb, int speed, int duplex)
 {
 	int ret;
@@ -430,6 +450,16 @@ int hns_mac_vm_config_bc_en(struct hns_mac_cb *mac_cb, u32 vmid, bool enable)
 	return 0;
 }
 
+int hns_mac_wait_fifo_clean(struct hns_mac_cb *mac_cb)
+{
+	struct mac_driver *drv = hns_mac_get_drv(mac_cb);
+
+	if (drv->wait_fifo_clean)
+		return drv->wait_fifo_clean(drv);
+
+	return 0;
+}
+
 void hns_mac_reset(struct hns_mac_cb *mac_cb)
 {
 	struct mac_driver *drv = hns_mac_get_drv(mac_cb);
@@ -998,6 +1028,20 @@ static int hns_mac_get_max_port_num(struct dsaf_device *dsaf_dev)
 		return  DSAF_MAX_PORT_NUM;
 }
 
+void hns_mac_enable(struct hns_mac_cb *mac_cb, enum mac_commom_mode mode)
+{
+	struct mac_driver *mac_ctrl_drv = hns_mac_get_drv(mac_cb);
+
+	mac_ctrl_drv->mac_enable(mac_cb->priv.mac, mode);
+}
+
+void hns_mac_disable(struct hns_mac_cb *mac_cb, enum mac_commom_mode mode)
+{
+	struct mac_driver *mac_ctrl_drv = hns_mac_get_drv(mac_cb);
+
+	mac_ctrl_drv->mac_disable(mac_cb->priv.mac, mode);
+}
+
 /**
  * hns_mac_init - init mac
  * @dsaf_dev: dsa fabric device struct pointer
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
index bbc0a98..fbc7534 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
@@ -356,6 +356,9 @@ struct mac_driver {
 	/*adjust mac mode of port,include speed and duplex*/
 	int (*adjust_link)(void *mac_drv, enum mac_speed speed,
 			   u32 full_duplex);
+	/* need adjust link */
+	bool (*need_adjust_link)(void *mac_drv, enum mac_speed speed,
+				 int duplex);
 	/* config autoegotaite mode of port*/
 	void (*set_an_mode)(void *mac_drv, u8 enable);
 	/* config loopbank mode */
@@ -394,6 +397,7 @@ struct mac_driver {
 	void (*get_info)(void *mac_drv, struct mac_info *mac_info);
 
 	void (*update_stats)(void *mac_drv);
+	int (*wait_fifo_clean)(void *mac_drv);
 
 	enum mac_mode mac_mode;
 	u8 mac_id;
@@ -427,6 +431,7 @@ void *hns_xgmac_config(struct hns_mac_cb *mac_cb,
 
 int hns_mac_init(struct dsaf_device *dsaf_dev);
 void mac_adjust_link(struct net_device *net_dev);
+bool hns_mac_need_adjust_link(struct hns_mac_cb *mac_cb, int speed, int duplex);
 void hns_mac_get_link_status(struct hns_mac_cb *mac_cb,	u32 *link_status);
 int hns_mac_change_vf_addr(struct hns_mac_cb *mac_cb, u32 vmid, char *addr);
 int hns_mac_set_multi(struct hns_mac_cb *mac_cb,
@@ -463,5 +468,8 @@ int hns_mac_add_uc_addr(struct hns_mac_cb *mac_cb, u8 vf_id,
 int hns_mac_rm_uc_addr(struct hns_mac_cb *mac_cb, u8 vf_id,
 		       const unsigned char *addr);
 int hns_mac_clr_multicast(struct hns_mac_cb *mac_cb, int vfn);
+void hns_mac_enable(struct hns_mac_cb *mac_cb, enum mac_commom_mode mode);
+void hns_mac_disable(struct hns_mac_cb *mac_cb, enum mac_commom_mode mode);
+int hns_mac_wait_fifo_clean(struct hns_mac_cb *mac_cb);
 
 #endif /* _HNS_DSAF_MAC_H */
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index ca50c25..e557a4e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -2727,6 +2727,35 @@ void hns_dsaf_set_promisc_tcam(struct dsaf_device *dsaf_dev,
 	soft_mac_entry->index = enable ? entry_index : DSAF_INVALID_ENTRY_IDX;
 }
 
+int hns_dsaf_wait_pkt_clean(struct dsaf_device *dsaf_dev, int port)
+{
+	u32 val, val_tmp;
+	int wait_cnt;
+
+	if (port >= DSAF_SERVICE_NW_NUM)
+		return 0;
+
+	wait_cnt = 0;
+	while (wait_cnt++ < HNS_MAX_WAIT_CNT) {
+		val = dsaf_read_dev(dsaf_dev, DSAF_VOQ_IN_PKT_NUM_0_REG +
+			(port + DSAF_XGE_NUM) * 0x40);
+		val_tmp = dsaf_read_dev(dsaf_dev, DSAF_VOQ_OUT_PKT_NUM_0_REG +
+			(port + DSAF_XGE_NUM) * 0x40);
+		if (val == val_tmp)
+			break;
+
+		usleep_range(100, 200);
+	}
+
+	if (wait_cnt >= HNS_MAX_WAIT_CNT) {
+		dev_err(dsaf_dev->dev, "hns dsaf clean wait timeout(%u - %u).\n",
+			val, val_tmp);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
 /**
  * dsaf_probe - probo dsaf dev
  * @pdev: dasf platform device
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
index 4507e82..0e1cd99 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h
@@ -44,6 +44,8 @@ struct hns_mac_cb;
 #define DSAF_ROCE_CREDIT_CHN	8
 #define DSAF_ROCE_CHAN_MODE	3
 
+#define HNS_MAX_WAIT_CNT 10000
+
 enum dsaf_roce_port_mode {
 	DSAF_ROCE_6PORT_MODE,
 	DSAF_ROCE_4PORT_MODE,
@@ -463,5 +465,6 @@ int hns_dsaf_rm_mac_addr(
 
 int hns_dsaf_clr_mac_mc_port(struct dsaf_device *dsaf_dev,
 			     u8 mac_id, u8 port_num);
+int hns_dsaf_wait_pkt_clean(struct dsaf_device *dsaf_dev, int port);
 
 #endif /* __HNS_DSAF_MAIN_H__ */
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
index d160d8c..0942e49 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
@@ -275,6 +275,29 @@ static void hns_ppe_exc_irq_en(struct hns_ppe_cb *ppe_cb, int en)
 	dsaf_write_dev(ppe_cb, PPE_INTEN_REG, msk_vlue & vld_msk);
 }
 
+int hns_ppe_wait_tx_fifo_clean(struct hns_ppe_cb *ppe_cb)
+{
+	int wait_cnt;
+	u32 val;
+
+	wait_cnt = 0;
+	while (wait_cnt++ < HNS_MAX_WAIT_CNT) {
+		val = dsaf_read_dev(ppe_cb, PPE_CURR_TX_FIFO0_REG) & 0x3ffU;
+		if (!val)
+			break;
+
+		usleep_range(100, 200);
+	}
+
+	if (wait_cnt >= HNS_MAX_WAIT_CNT) {
+		dev_err(ppe_cb->dev, "hns ppe tx fifo clean wait timeout, still has %u pkt.\n",
+			val);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
 /**
  * ppe_init_hw - init ppe
  * @ppe_cb: ppe device
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h
index 9d8e643..f670e63 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h
@@ -100,6 +100,7 @@ struct ppe_common_cb {
 
 };
 
+int hns_ppe_wait_tx_fifo_clean(struct hns_ppe_cb *ppe_cb);
 int hns_ppe_init(struct dsaf_device *dsaf_dev);
 
 void hns_ppe_uninit(struct dsaf_device *dsaf_dev);
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
index 9d76e2e..5d64519 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
@@ -66,6 +66,29 @@ void hns_rcb_wait_fbd_clean(struct hnae_queue **qs, int q_num, u32 flag)
 			"queue(%d) wait fbd(%d) clean fail!!\n", i, fbd_num);
 }
 
+int hns_rcb_wait_tx_ring_clean(struct hnae_queue *qs)
+{
+	u32 head, tail;
+	int wait_cnt;
+
+	tail = dsaf_read_dev(&qs->tx_ring, RCB_REG_TAIL);
+	wait_cnt = 0;
+	while (wait_cnt++ < HNS_MAX_WAIT_CNT) {
+		head = dsaf_read_dev(&qs->tx_ring, RCB_REG_HEAD);
+		if (tail == head)
+			break;
+
+		usleep_range(100, 200);
+	}
+
+	if (wait_cnt >= HNS_MAX_WAIT_CNT) {
+		dev_err(qs->dev->dev, "rcb wait timeout, head not equal to tail.\n");
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
 /**
  *hns_rcb_reset_ring_hw - ring reset
  *@q: ring struct pointer
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
index 6028164..2319b77 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
@@ -136,6 +136,7 @@ void hns_rcbv2_int_clr_hw(struct hnae_queue *q, u32 flag);
 void hns_rcb_init_hw(struct ring_pair_cb *ring);
 void hns_rcb_reset_ring_hw(struct hnae_queue *q);
 void hns_rcb_wait_fbd_clean(struct hnae_queue **qs, int q_num, u32 flag);
+int hns_rcb_wait_tx_ring_clean(struct hnae_queue *qs);
 u32 hns_rcb_get_rx_coalesced_frames(
 	struct rcb_common_cb *rcb_common, u32 port_idx);
 u32 hns_rcb_get_tx_coalesced_frames(
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
index 886cbbf..74d935d 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
@@ -464,6 +464,7 @@
 #define RCB_RING_INTMSK_TX_OVERTIME_REG		0x000C4
 #define RCB_RING_INTSTS_TX_OVERTIME_REG		0x000C8
 
+#define GMAC_FIFO_STATE_REG			0x0000UL
 #define GMAC_DUPLEX_TYPE_REG			0x0008UL
 #define GMAC_FD_FC_TYPE_REG			0x000CUL
 #define GMAC_TX_WATER_LINE_REG			0x0010UL
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 02a0ba2..f56855e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -1112,11 +1112,26 @@ static void hns_nic_adjust_link(struct net_device *ndev)
 	struct hnae_handle *h = priv->ae_handle;
 	int state = 1;
 
+	/* If there is no phy, do not need adjust link */
 	if (ndev->phydev) {
-		h->dev->ops->adjust_link(h, ndev->phydev->speed,
-					 ndev->phydev->duplex);
-		state = ndev->phydev->link;
+		/* When phy link down, do nothing */
+		if (ndev->phydev->link == 0)
+			return;
+
+		if (h->dev->ops->need_adjust_link(h, ndev->phydev->speed,
+						  ndev->phydev->duplex)) {
+			/* because Hi161X chip don't support to change gmac
+			 * speed and duplex with traffic. Delay 200ms to
+			 * make sure there is no more data in chip FIFO.
+			 */
+			netif_carrier_off(ndev);
+			msleep(200);
+			h->dev->ops->adjust_link(h, ndev->phydev->speed,
+						 ndev->phydev->duplex);
+			netif_carrier_on(ndev);
+		}
 	}
+
 	state = state && h->dev->ops->get_status(h);
 
 	if (state != priv->link) {
-- 
2.9.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox