Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFyh0cV+ztK47mu1QN6aF4VvPE1V5zF7_X2TdYxZ38zZRA@mail.gmail.com>

On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> So the whole "get_page()" thing is broken. Iterating over pages in a
> KVEC is simply wrong, wrong, wrong. It needs to fail.

Well, _that_ is easy to do, of course...  E.g. by replacing that thing in
iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
will do it.

> Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
> or page reference things.
> 
> The old code that apparently used "get_user_pages_fast()" was ok
> almost by mistake, because it fails on all kernel pages.

On x86 it does, but I don't see anything obvious in generic version in
mm/gup.c, so the old code might still have a problem on some architectures.
What am I missing here?

^ permalink raw reply

* Re: wl1251: NVS firmware data
From: Dan Williams @ 2014-12-08 19:26 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Marcel Holtmann, Greg Kroah-Hartman, Ming Lei, Pavel Machek,
	John W. Linville, Grazvydas Ignotas,
	linux-wireless@vger.kernel.org, Network Development,
	Linux Kernel Mailing List, Ivaylo Dimitrov, Aaro Koskinen,
	Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <201412082015.18501@pali>

On Mon, 2014-12-08 at 20:15 +0100, Pali Rohár wrote:
> On Monday 08 December 2014 19:50:17 Marcel Holtmann wrote:
> > Hi Pali,
> > 
> > >>>>>> On Saturday 06 December 2014 13:49:54 Pavel Machek
> > >>>>>> wrote: /**
> > >>>>>> 
> > >>>>>> + * request_firmware_prefer_user: - prefer usermode
> > >>>>>> helper for loading firmware + * @firmware_p: pointer to
> > >>>>>> firmware image
> > >>>>>> + * @name: name of firmware file
> > >>>>>> + * @device: device for which firmware is being loaded
> > >>>>>> + *
> > >>>>>> + * This function works pretty much like
> > >>>>>> request_firmware(), but it prefer + * usermode helper.
> > >>>>>> If usermode helper fails then it fallback to direct
> > >>>>>> access. + * Usefull for dynamic or model specific
> > >>>>>> firmware data. + **/
> > >>>>>> +int request_firmware_prefer_user(const struct firmware
> > >>>>>> **firmware_p, +                           const char
> > >>>>>> *name, struct device *device) +{
> > >>>>>> +       int ret;
> > >>>>>> +       __module_get(THIS_MODULE);
> > >>>>>> +       ret = _request_firmware(firmware_p, name,
> > >>>>>> device, +                               FW_OPT_UEVENT
> > >>>>>> | FW_OPT_PREFER_USER); +      
> > >>>>>> module_put(THIS_MODULE); +       return ret;
> > >>>>>> +}
> > >>>>>> +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);
> > >>>>> 
> > >>>>> I'd like to introduce request_firmware_user() which only
> > >>>>> requests firmware from user space, and this way is
> > >>>>> simpler and more flexible since we have
> > >>>>> request_firmware_direct() already.
> > >>>> 
> > >>>> Why would a driver care about what program provides the
> > >>>> firmware?  It shouldn't at all, and we want to get rid of
> > >>>> the userspace firmware loader, not encourage drivers to
> > >>>> use it "exclusively" at all.
> > >>> 
> > >>> Do not remove it! Without userspace firmware loader it is
> > >>> impossible to load dynamic firmware files.
> > >> 
> > >> why is this dynamic in the first place. It does not sound
> > >> like dynamic data to me at all. This is like the WiFi MAC
> > >> address(es) or Bluetooth BD_ADDR. They are all static
> > >> information. The only difference is that they are on the
> > >> host accessibly filesystem or storage and not on the
> > >> device itself.
> > >> 
> > >> To be honest, for Bluetooth we solved this now. If the
> > >> device is missing key information like the calibration
> > >> data or BD_ADDR, then it comes up unconfigured. A
> > >> userspace process can then go and load the right data into
> > >> it and then the device becomes available as Bluetooth
> > >> device.
> > >> 
> > >> Trying to use request_firmware to load some random data and
> > >> insist on going through userspace helper for that sounds
> > >> crazy to me. Especially since we are trying hard to get
> > >> away from the userspace loader. Forcing to keep it for new
> > >> stuff sounds backwards to me.
> > >> 
> > >> With the special Nokia partition in mind, why hasn't this
> > >> been turned into a mountable filesystem or into a
> > >> driver/subsystem that can access the data direct from the
> > >> kernel. I advocated for this some time ago. Maybe there
> > >> should be a special subsystem for access to these factory
> > >> persistent information that drivers then just can access.
> > >> I seem to remember that some systems provide these via
> > >> ACPI. Why does the ARM platform has to be special here?
> > >> 
> > >> And the problem of getting Ethernet and WiFi MAC address
> > >> and Bluetooth BD_ADDR comes up many many times. Why not
> > >> have something generic here. And don't tell me
> > >> request_firmware is that generic solution ;)
> > >> 
> > >> Regards
> > >> 
> > >> Marcel
> > > 
> > > Hi Marcel. I think you did not understand this problem. This
> > > discussion is not about mac address. Please read email
> > > thread again and if there are some unclear pars, then ask.
> > > Thanks!
> > 
> > I think that I pretty clearly understand the problem.
> > Calibration data, MAC address, what is the difference? For me
> > this is all the same. It is data that is specific to a device
> > or type of devices and it is stored somewhere else. In most
> > cases in some immutable memory/flash area.
> > 
> 
> Those calibration data (in form of binary NVS firmware file) 
> needs to be sent to wl1251 chip. Mac address is not needed at 
> this step (and kernel generate some random if is not provided).
> 
> (Just to note wl1271 driver loads both MAC address and NVS data 
> via one firmware file which is prepared by userspace, but this 
> discussion is about wl1251...)
> 
> > What you want is access to this data since the kernel driver
> > needs it. Do I get this so far ;)
> > 
> 
> Yes, we need to provide NVS data to kernel when kernel ask for 
> them.
> 
> > So my take is that request_firmware is not the right way to
> > get this data. Or more precisely make sure that this data is
> > available to kernel drivers. And what I am seeing here is
> > that instead of actually solving the bigger problem, we just
> > hack around it with request_firmware. Now surprisingly the
> > request_firmware loads files directly from the kernel and all
> > the hacks do not work anymore.
> > 
> > Regards
> > 
> > Marcel
> 
> Just read emails again...
> 
> Our problem is:
> 
> linux-firmware.git tree provides two binary firmware files:
> 
> ti-connectivity/wl1251-fw.bin
> ti-connectivity/wl1251-nvs.bin
> 
> First is firmware file, second NVS file with generic calibration 
> data. Kernel driver wl1251 now loads both firmware files via 
> request_firmware. Generic calibration data are enough for wl1251 
> chip (it should work). But devices have own calibration data 
> stored somewhere else.
> 
> On Nokia N900 NVS data are generated on-the-fly from some bytes 
> from CAL (/dev/mtd1), from state of cellular network and from 
> some other regulation settings.
> 
> So I think that files stored in linux-firmware.git tree (which 
> are also installed into /lib/firmware/) should be loaded with 
> request_firmware function. Or not? Do you think something else? 
> What other developers think?
> 
> I'm against kernel driver for CAL (/dev/mtd1) for more reasons:
> 
> 1) we have userspace open source code, but licensed under GPLv3. 
> And until kernel change license, we cannot include it.
> 
> 2) NVS data are (probably) not in one place, plus they depends on 
> something other.
> 
> 3) If manufacture XYZ create new device with its own storage 
> format of calibration data this means that correct solution for 
> XYZ is also to implement new kernel fs driver for its own format. 
> Do you really want to have in kernel all those drivers for all 
> different (proprietary) storage formats?
> 
> 4) It does not help us with existence of generic file 
> /lib/firmware/ti-connectivity/wl1251-nvs.bin which comes from 
> linux-firmware.git tree.

a) change driver to prefer a new "wl1251-nvs-n900.bin" file, but fall
back to "wl1251-nvs.bin" if the first one isn't present
b) have a "wl1251-cal-nvs-update" service that, if wl1521-nvs-n900.bin
is *not* present mounts the CAL MTD, reads the data, writes it out into
wl1521-nvs-n900.bin, and the rmmod/modprobes the driver

and done?  Stuff that's not N900 just wouldn't ship the update service
and would proceed like none of this happened.

Dan

^ permalink raw reply

* RE: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Yuval Mintz @ 2014-12-08 19:25 UTC (permalink / raw)
  To: Eli Cohen
  Cc: bhelgaas@google.com, David Miller, linux-pci, netdev,
	ogerlitz@mellanox.com, yevgenyp@mellanox.com, Donald Dutile
In-Reply-To: <20141208191032.GC24790@mtldesk30>

[-- Attachment #1: Type: text/plain, Size: 1019 bytes --]

>>>> What's the end-game here? How eventually would this be controlled?
>>>You can probe any VF at the hypervisor through sysfs files

>>>(bind/unbind). You can also pass them through to a VM. Nothing
>>>changes.

>> If you're not planning on adding a logic to set this, why do we need to
>> add a parameter to pci_enable_sriov() - given that all callers use the
>> exact same logic?
>> [And I don't really think we'd want different devices to behave differently
>> by default; That would be confusing for users.]

>Currently the kerenl will call probe for any device for which there is
>a supporting driver and I did not want to change that.

But what's next? How will this feature be activated?

Adding a parameter to pci_enable_sriov() might give developers the false
impression they can change behavior by passing `0' to this function; 
But that shouldn't be the method to control this - we should have uniform 
control of the feature across different drivers, e.g., by an additional sysfs node.

[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 4112 bytes --]

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Kirill A. Shutemov @ 2014-12-08 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFwow9hVqNVGp7gjnnu4=A+J8vw1i22W4e7+1r1-215cHg@mail.gmail.com>

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> >
> > trinity triggers it for me in few minutes. I will try find out more once
> > get some time.
> 
> You run trinity as *root*?
> 
> You're a brave man. Stupid, but brave ;)
> 
> I guess you're running it in a VM, but still.. Doing random system
> calls as root sounds like a bad bad idea.

I'm doing this for a long time and didn't see any problem bigger than qemu
crash[1]. ;)

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/194845/

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Dave Jones @ 2014-12-08 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Al Viro, Linux Kernel Mailing List,
	linux-fsdevel, Network Development
In-Reply-To: <CA+55aFwow9hVqNVGp7gjnnu4=A+J8vw1i22W4e7+1r1-215cHg@mail.gmail.com>

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
 > On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
 > <kirill@shutemov.name> wrote:
 > >
 > > trinity triggers it for me in few minutes. I will try find out more once
 > > get some time.
 > 
 > You run trinity as *root*?
 > 
 > You're a brave man. Stupid, but brave ;)
 > 
 > I guess you're running it in a VM, but still.. Doing random system
 > calls as root sounds like a bad bad idea.

I've flip-flopped on this a few times.  I used to be solidly in the same
position as your statement, but after seeing the things the secure-boot
crowd want to lock down, there are a ton of places in the kernel that
would need additional root-proofing to avoid scribbling over kernel
memory.

In short though, yeah, expect fireworks right now, especially on bare-metal.

At the same time, just to increase coverage testing of a lot of
root-required functionality (like various network sockets that can't be
opened as a regular user) I added a --drop-privs mode to trinity a while
ago, so after the socket creation, it can't do anything _too_ crazy.

	Dave

^ permalink raw reply

* Re: wl1251: NVS firmware data
From: Pali Rohár @ 2014-12-08 19:15 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Greg Kroah-Hartman, Ming Lei, Pavel Machek, John W. Linville,
	Grazvydas Ignotas, linux-wireless@vger.kernel.org,
	Network Development, Linux Kernel Mailing List, Ivaylo Dimitrov,
	Aaro Koskinen, Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <30072B9E-2495-4F90-AC91-9C0D7E08F44E@holtmann.org>

[-- Attachment #1: Type: Text/Plain, Size: 6711 bytes --]

On Monday 08 December 2014 19:50:17 Marcel Holtmann wrote:
> Hi Pali,
> 
> >>>>>> On Saturday 06 December 2014 13:49:54 Pavel Machek
> >>>>>> wrote: /**
> >>>>>> 
> >>>>>> + * request_firmware_prefer_user: - prefer usermode
> >>>>>> helper for loading firmware + * @firmware_p: pointer to
> >>>>>> firmware image
> >>>>>> + * @name: name of firmware file
> >>>>>> + * @device: device for which firmware is being loaded
> >>>>>> + *
> >>>>>> + * This function works pretty much like
> >>>>>> request_firmware(), but it prefer + * usermode helper.
> >>>>>> If usermode helper fails then it fallback to direct
> >>>>>> access. + * Usefull for dynamic or model specific
> >>>>>> firmware data. + **/
> >>>>>> +int request_firmware_prefer_user(const struct firmware
> >>>>>> **firmware_p, +                           const char
> >>>>>> *name, struct device *device) +{
> >>>>>> +       int ret;
> >>>>>> +       __module_get(THIS_MODULE);
> >>>>>> +       ret = _request_firmware(firmware_p, name,
> >>>>>> device, +                               FW_OPT_UEVENT
> >>>>>> | FW_OPT_PREFER_USER); +      
> >>>>>> module_put(THIS_MODULE); +       return ret;
> >>>>>> +}
> >>>>>> +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);
> >>>>> 
> >>>>> I'd like to introduce request_firmware_user() which only
> >>>>> requests firmware from user space, and this way is
> >>>>> simpler and more flexible since we have
> >>>>> request_firmware_direct() already.
> >>>> 
> >>>> Why would a driver care about what program provides the
> >>>> firmware?  It shouldn't at all, and we want to get rid of
> >>>> the userspace firmware loader, not encourage drivers to
> >>>> use it "exclusively" at all.
> >>> 
> >>> Do not remove it! Without userspace firmware loader it is
> >>> impossible to load dynamic firmware files.
> >> 
> >> why is this dynamic in the first place. It does not sound
> >> like dynamic data to me at all. This is like the WiFi MAC
> >> address(es) or Bluetooth BD_ADDR. They are all static
> >> information. The only difference is that they are on the
> >> host accessibly filesystem or storage and not on the
> >> device itself.
> >> 
> >> To be honest, for Bluetooth we solved this now. If the
> >> device is missing key information like the calibration
> >> data or BD_ADDR, then it comes up unconfigured. A
> >> userspace process can then go and load the right data into
> >> it and then the device becomes available as Bluetooth
> >> device.
> >> 
> >> Trying to use request_firmware to load some random data and
> >> insist on going through userspace helper for that sounds
> >> crazy to me. Especially since we are trying hard to get
> >> away from the userspace loader. Forcing to keep it for new
> >> stuff sounds backwards to me.
> >> 
> >> With the special Nokia partition in mind, why hasn't this
> >> been turned into a mountable filesystem or into a
> >> driver/subsystem that can access the data direct from the
> >> kernel. I advocated for this some time ago. Maybe there
> >> should be a special subsystem for access to these factory
> >> persistent information that drivers then just can access.
> >> I seem to remember that some systems provide these via
> >> ACPI. Why does the ARM platform has to be special here?
> >> 
> >> And the problem of getting Ethernet and WiFi MAC address
> >> and Bluetooth BD_ADDR comes up many many times. Why not
> >> have something generic here. And don't tell me
> >> request_firmware is that generic solution ;)
> >> 
> >> Regards
> >> 
> >> Marcel
> > 
> > Hi Marcel. I think you did not understand this problem. This
> > discussion is not about mac address. Please read email
> > thread again and if there are some unclear pars, then ask.
> > Thanks!
> 
> I think that I pretty clearly understand the problem.
> Calibration data, MAC address, what is the difference? For me
> this is all the same. It is data that is specific to a device
> or type of devices and it is stored somewhere else. In most
> cases in some immutable memory/flash area.
> 

Those calibration data (in form of binary NVS firmware file) 
needs to be sent to wl1251 chip. Mac address is not needed at 
this step (and kernel generate some random if is not provided).

(Just to note wl1271 driver loads both MAC address and NVS data 
via one firmware file which is prepared by userspace, but this 
discussion is about wl1251...)

> What you want is access to this data since the kernel driver
> needs it. Do I get this so far ;)
> 

Yes, we need to provide NVS data to kernel when kernel ask for 
them.

> So my take is that request_firmware is not the right way to
> get this data. Or more precisely make sure that this data is
> available to kernel drivers. And what I am seeing here is
> that instead of actually solving the bigger problem, we just
> hack around it with request_firmware. Now surprisingly the
> request_firmware loads files directly from the kernel and all
> the hacks do not work anymore.
> 
> Regards
> 
> Marcel

Just read emails again...

Our problem is:

linux-firmware.git tree provides two binary firmware files:

ti-connectivity/wl1251-fw.bin
ti-connectivity/wl1251-nvs.bin

First is firmware file, second NVS file with generic calibration 
data. Kernel driver wl1251 now loads both firmware files via 
request_firmware. Generic calibration data are enough for wl1251 
chip (it should work). But devices have own calibration data 
stored somewhere else.

On Nokia N900 NVS data are generated on-the-fly from some bytes 
from CAL (/dev/mtd1), from state of cellular network and from 
some other regulation settings.

So I think that files stored in linux-firmware.git tree (which 
are also installed into /lib/firmware/) should be loaded with 
request_firmware function. Or not? Do you think something else? 
What other developers think?

I'm against kernel driver for CAL (/dev/mtd1) for more reasons:

1) we have userspace open source code, but licensed under GPLv3. 
And until kernel change license, we cannot include it.

2) NVS data are (probably) not in one place, plus they depends on 
something other.

3) If manufacture XYZ create new device with its own storage 
format of calibration data this means that correct solution for 
XYZ is also to implement new kernel fs driver for its own format. 
Do you really want to have in kernel all those drivers for all 
different (proprietary) storage formats?

4) It does not help us with existence of generic file 
/lib/firmware/ti-connectivity/wl1251-nvs.bin which comes from 
linux-firmware.git tree.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* CONFIRMING ORDER
From: Patients Care Unit @ 2014-12-08 12:23 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 571 bytes --]





Sorry for the delay in confirming order due to change in management, once
again kindly confirm shipping fees in below attached scan Invoice #3134729.
appreciate an urgent response. Customer is impatient and needs delivery on or
before the 15th  of December. please refer to scan attachment for details and
payment terms.




Best regards
Ahmed
BELHASA INT'L GROUP OF COMPANIES
Unit 1B,G/F.,94-96 How Ming St.,
Hung To Centre, Kwun Tong, Kln.,H.K.
Chung Pak Ka 852 9789 4635
Benny Chung   852 6188 0428
Tel      : (852) 2172 7771 / 2172 7773
Fax     : (852) 2172 7723

[-- Attachment #2: invoice#3134729.zip --]
[-- Type: application/zip, Size: 645167 bytes --]

[-- Attachment #3: invoice#3134729.zip --]
[-- Type: application/zip, Size: 645167 bytes --]

^ permalink raw reply

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Eli Cohen @ 2014-12-08 19:10 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: Eli Cohen, bhelgaas@google.com, David Miller, linux-pci, netdev,
	ogerlitz@mellanox.com, yevgenyp@mellanox.com, Donald Dutile
In-Reply-To: <B5657A6538887040AD3A81F1008BEC63BA669B@avmb3.qlogic.org>

On Sun, Dec 07, 2014 at 08:08:47PM +0000, Yuval Mintz wrote:
> >>>Use a parameter to pci_enable_sriov to control that policy, and modify
> >>>all current callers such that they retain the same functionality.
> >> 
> >> What's the end-game here? How eventually would this be controlled?
> 
> >You can probe any VF at the hypervisor through sysfs files
> >(bind/unbind). You can also pass them through to a VM. Nothing
> >changes.
> 
> If you're not planning on adding a logic to set this, why do we need to
> add a parameter to pci_enable_sriov() - given that all callers use the
> exact same logic?
> [And I don't really think we'd want different devices to behave differently
> by default; That would be confusing for users.]
> 
Currently the kerenl will call probe for any device for which there is
a supporting driver and I did not want to change that.

> >>>Use a one shot flag on struct pci_device which is cleared after the
> >>>first probe is ignored so subsequent attempts go through.
> >>
> >> Does a one-shot flag suffice? E.g., consider assigning a VF to VM and
> >> than shutting down the VM. Assuming this feature is disabled,
> >> the VF didn't appear on the hypervisor prior to the assignment but
> >> will appear after its shutdown.
> 
> >Sorry, I don't follow you here. Please clarify.
> 
> >To be clear, the functionality proposed here is really one shot. It
> >just prevents calling probe once; besides that nothing changes.
> 
> What I meant is that device is unbinded after initial probe,
> But in the scenario I've stated above, the VF will become binded once
> it's returned to the hypervisor.
> Now, I understand that what you're trying to achieve - but my question
> is whether what you're REALLY trying to achieve is the ability to have VFs
> which would only be binded to VMs and never to hypervisor [by default]?

No this was not my intention. I guess any driver that wishes to
achieve the same functionality can simply ignore the N first probes after
the call to pci_enable_sriov() where N is the number of VFs it
created.

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 19:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208185635.GA25867@node.dhcp.inet.fi>

On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> trinity triggers it for me in few minutes. I will try find out more once
> get some time.

You run trinity as *root*?

You're a brave man. Stupid, but brave ;)

I guess you're running it in a VM, but still.. Doing random system
calls as root sounds like a bad bad idea.

                      Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 18:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208184632.GG22149@ZenIV.linux.org.uk>

On Mon, Dec 8, 2014 at 10:46 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:
>
>> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
>> it to just use copies?
>
> Umm...  clearing O_DIRECT on struct file that might have other references
> to it isn't nice, to put it mildly...

Yeah.

> Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
> space looks like the sanest strategy anyway.  We'd get the same behaviour
> as we used to, and as for finit_module(2), well... put "fails if given
> an O_DIRECT descriptor" and be done with that.

If it used to fail, then by all means, just make this a failure in the
new model. I really don't want to make core infrastructure silently
just call vmalloc_to_page() to make things "work".

And if it used to do "get_user_pages_fast()" then the old code really
didn't work on vmalloc ranges anyway, since that one checks for not
just _PAGE_PRESENT but also _PAGE_USER, which won't be set on a
vmalloc() page. In fact, it should have failed on *all* kernel pages.

> Alternatively, we can really go for
>         page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
>                                                 : virt_to_page(addr)
>         *pages++ = get_page(page);

actually, no we cannot. Thinking some more about it, that
"get_page(page)" is wrong in _all_ cases. It actually works better for
vmalloc pages than for normal 1:1 pages, since it's actually seriously
and *horrendously* wrong for the case of random kernel addresses which
may not even be refcounted to begin with.

So the whole "get_page()" thing is broken. Iterating over pages in a
KVEC is simply wrong, wrong, wrong. It needs to fail.

Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
or page reference things.

The old code that apparently used "get_user_pages_fast()" was ok
almost by mistake, because it fails on all kernel pages.

                       Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Kirill A. Shutemov @ 2014-12-08 18:56 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev
In-Reply-To: <20141208175805.GB22149@ZenIV.linux.org.uk>

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:
> 
> > I guess this crash is related to the patchset.
> 
> Might be.  Do you have a reproducer for it?

trinity triggers it for me in few minutes. I will try find out more once
get some time.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC] pci: Control whether VFs are probed on pci_enable_sriov
From: Eli Cohen @ 2014-12-08 18:52 UTC (permalink / raw)
  To: Don Dutile
  Cc: Yuval Mintz, Eli Cohen, bhelgaas@google.com, David Miller,
	linux-pci, netdev, ogerlitz@mellanox.com, yevgenyp@mellanox.com
In-Reply-To: <5484C77D.7060501@redhat.com>

On Sun, Dec 07, 2014 at 04:32:45PM -0500, Don Dutile wrote:
> >
> >Does a one-shot flag suffice? E.g., consider assigning a VF to VM and
> >than shutting down the VM. Assuming this feature is disabled,
> >the VF didn't appear on the hypervisor prior to the assignment but
> >will appear after its shutdown.
> >
> +1 to this question.
> All I see is a one-shot savings in VF configuration time at pci_sriov_enable() time.
> Please explain why this is so important for mlx5 (sriov) operation?
> 

This is not a specific mlx5 requirement. The rational is that you
don't want to probe the virtual fucntions at the hypervisor just
because they appear there all of the sudden. You are likely to want to
assign some of them to VMs so why probe them and then remove them? It
makes sense to me that other drivers would also like this kind of
fucntionality.

> Can the vf driver probe be called & exit early the first time, and perform full
> (host) configuration thereafter?
>

Yes, this can be done but I was thinking that other drivers could also
benefit from such functionality.

^ permalink raw reply

* Re: wl1251: NVS firmware data
From: Marcel Holtmann @ 2014-12-08 18:50 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Greg Kroah-Hartman, Ming Lei, Pavel Machek, John W. Linville,
	Grazvydas Ignotas, linux-wireless@vger.kernel.org,
	Network Development, Linux Kernel Mailing List, Ivaylo Dimitrov,
	Aaro Koskinen, Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <201412081811.46943@pali>

Hi Pali,

>>>>>> On Saturday 06 December 2014 13:49:54 Pavel Machek wrote:
>>>>>> /**
>>>>>> 
>>>>>> + * request_firmware_prefer_user: - prefer usermode
>>>>>> helper for loading firmware + * @firmware_p: pointer to
>>>>>> firmware image
>>>>>> + * @name: name of firmware file
>>>>>> + * @device: device for which firmware is being loaded
>>>>>> + *
>>>>>> + * This function works pretty much like
>>>>>> request_firmware(), but it prefer + * usermode helper. If
>>>>>> usermode helper fails then it fallback to direct access.
>>>>>> + * Usefull for dynamic or model specific firmware data.
>>>>>> + **/
>>>>>> +int request_firmware_prefer_user(const struct firmware
>>>>>> **firmware_p, +                           const char
>>>>>> *name, struct device *device) +{
>>>>>> +       int ret;
>>>>>> +       __module_get(THIS_MODULE);
>>>>>> +       ret = _request_firmware(firmware_p, name, device,
>>>>>> +                               FW_OPT_UEVENT |
>>>>>> FW_OPT_PREFER_USER); +       module_put(THIS_MODULE);
>>>>>> +       return ret;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);
>>>>> 
>>>>> I'd like to introduce request_firmware_user() which only
>>>>> requests firmware from user space, and this way is simpler
>>>>> and more flexible since we have request_firmware_direct()
>>>>> already.
>>>> 
>>>> Why would a driver care about what program provides the
>>>> firmware?  It shouldn't at all, and we want to get rid of
>>>> the userspace firmware loader, not encourage drivers to
>>>> use it "exclusively" at all.
>>> 
>>> Do not remove it! Without userspace firmware loader it is
>>> impossible to load dynamic firmware files.
>> 
>> why is this dynamic in the first place. It does not sound like
>> dynamic data to me at all. This is like the WiFi MAC
>> address(es) or Bluetooth BD_ADDR. They are all static
>> information. The only difference is that they are on the host
>> accessibly filesystem or storage and not on the device
>> itself.
>> 
>> To be honest, for Bluetooth we solved this now. If the device
>> is missing key information like the calibration data or
>> BD_ADDR, then it comes up unconfigured. A userspace process
>> can then go and load the right data into it and then the
>> device becomes available as Bluetooth device.
>> 
>> Trying to use request_firmware to load some random data and
>> insist on going through userspace helper for that sounds
>> crazy to me. Especially since we are trying hard to get away
>> from the userspace loader. Forcing to keep it for new stuff
>> sounds backwards to me.
>> 
>> With the special Nokia partition in mind, why hasn't this been
>> turned into a mountable filesystem or into a driver/subsystem
>> that can access the data direct from the kernel. I advocated
>> for this some time ago. Maybe there should be a special
>> subsystem for access to these factory persistent information
>> that drivers then just can access. I seem to remember that
>> some systems provide these via ACPI. Why does the ARM
>> platform has to be special here?
>> 
>> And the problem of getting Ethernet and WiFi MAC address and
>> Bluetooth BD_ADDR comes up many many times. Why not have
>> something generic here. And don't tell me request_firmware is
>> that generic solution ;)
>> 
>> Regards
>> 
>> Marcel
> 
> Hi Marcel. I think you did not understand this problem. This 
> discussion is not about mac address. Please read email thread 
> again and if there are some unclear pars, then ask. Thanks!

I think that I pretty clearly understand the problem. Calibration data, MAC address, what is the difference? For me this is all the same. It is data that is specific to a device or type of devices and it is stored somewhere else. In most cases in some immutable memory/flash area.

What you want is access to this data since the kernel driver needs it. Do I get this so far ;)

So my take is that request_firmware is not the right way to get this data. Or more precisely make sure that this data is available to kernel drivers. And what I am seeing here is that instead of actually solving the bigger problem, we just hack around it with request_firmware. Now surprisingly the request_firmware loads files directly from the kernel and all the hacks do not work anymore.

Regards

Marcel

^ permalink raw reply

* RE: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: Arad, Ronen @ 2014-12-08 18:40 UTC (permalink / raw)
  To: Roopa Prabhu, Scott Feldman, Netdev
  Cc: Jirí Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, David S. Miller, shm@cumulusnetworks.com,
	Andy Gospodarek
In-Reply-To: <5484B773.7000809@cumulusnetworks.com>



> -----Original Message-----
> From: Roopa Prabhu [mailto:roopa@cumulusnetworks.com]
> Sent: Sunday, December 07, 2014 12:24 PM
> To: Scott Feldman
> Cc: Arad, Ronen; Netdev; Jirí Pírko; Jamal Hadi Salim; Benjamin LaHaise;
> Thomas Graf; john fastabend; stephen@networkplumber.org; John Linville;
> nhorman@tuxdriver.com; Nicolas Dichtel; vyasevic@redhat.com; Florian
> Fainelli; buytenh@wantstofly.org; Aviad Raveh; David S. Miller;
> shm@cumulusnetworks.com; Andy Gospodarek
> Subject: Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic
> if feature flag set
> 
> On 12/5/14, 10:54 PM, Scott Feldman wrote:
> > On Fri, Dec 5, 2014 at 3:21 PM, Arad, Ronen <ronen.arad@intel.com>
> wrote:
> >>
> >>> -----Original Message-----
> >>> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >>> owner@vger.kernel.org] On Behalf Of Roopa Prabhu
> >>> Sent: Thursday, December 04, 2014 11:02 PM
> >>> To: Scott Feldman
> >>> Cc: Jiří Pírko; Jamal Hadi Salim; Benjamin LaHaise; Thomas Graf;
> >>> john fastabend; stephen@networkplumber.org; John Linville;
> >>> nhorman@tuxdriver.com; Nicolas Dichtel; vyasevic@redhat.com; Florian
> >>> Fainelli; buytenh@wantstofly.org; Aviad Raveh; Netdev; David S.
> >>> Miller; shm@cumulusnetworks.com; Andy Gospodarek
> >>> Subject: Re: [PATCH 2/3] bridge: offload bridge port attributes to
> >>> switch asic if feature flag set
> >>>
> >>> On 12/4/14, 10:41 PM, Scott Feldman wrote:
> >>>> On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com>
> wrote:
> >>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> >>>>>
> >>>>> This allows offloading to switch asic without having the user to
> >>>>> set any flag. And this is done in the bridge driver to rollback
> >>>>> kernel settings on hw offload failure if required in the future.
> >>>>>
> >>>>> With this, it also makes sure a notification goes out only after
> >>>>> the attributes are set both in the kernel and hw.
> >>>> I like this approach as it streamlines the steps for the user in
> >>>> setting port flags.  There is one case for FLOODING where you'll
> >>>> have to turn off flooding for both, and then turn on flooding in
> >>>> hw.  You don't want flooding turned on on kernel and hw.
> >>> ok, maybe using the higher bits as in
> >>> https://patchwork.ozlabs.org/patch/413211/
> >>>
> >>> might help with that. Let me think some more.
> >>>>> ---
> >>>>>    net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
> >>>>>    1 file changed, 26 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> >>>>> index
> >>>>> 9f5eb55..ce173f0 100644
> >>>>> --- a/net/bridge/br_netlink.c
> >>>>> +++ b/net/bridge/br_netlink.c
> >>>>> @@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct
> >>> nlmsghdr *nlh)
> >>>>>                                   afspec, RTM_SETLINK);
> >>>>>           }
> >>>>>
> >>>>> +       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
> >>>>> +                       dev->netdev_ops->ndo_bridge_setlink) {
> >>>>> +               int ret = dev->netdev_ops->ndo_bridge_setlink(dev,
> >>>>> + nlh);
> >>>> I think you want to up-level this to net/core/rtnetlink.c because
> >>>> you're only enabling the feature for one instance of a driver that
> >>>> implements ndo_bridge_setlink: the bridge driver.  If another
> >>>> driver was MASTER and implemented ndo_bridge_setlink, you'd want
> >>>> same check to push setting down to SELF port driver.
> >>> yeah, i thought about that. But i moved it here so that rollback
> >>> would be easier.
> >> There is a need for propagating setlink/dellink requests down multiple
> levels.
> >> The use-case I have in mind is a bridge at the top, team/bond in the
> middle, and port devices at the bottom.
> >> A setlink for VLAN filtering attributes would come with MASTER flag set,
> and either port or bond/team netdev.
> >> How would this be handled?
> >>
> >> The propagation rules between bridge and enslaved port device could be
> different from those between bond/team and enslaved devices.
> >> The current bridge driver does not propagate VLAN filtering from bridge to
> its ports as each port could have different configuration. In a case of a
> bond/team all members need to have the same configuration such that the a
> bond/team would be indistinguishable from a simple port.
> >>
> >> Therefore rtnetlink.c might not have the knowledge for propagation
> across multiple levels.
> >> It seems that each device which implements
> ndo_bridge_setlink/ndo_bridge_dellink  and could have master role, need to
> take care of propagation to its slaves.
> > Thanks Ronen for bringing up this use-case of stacked masters.  I
> > think for VLAN filtering, the stacked master case is handled, not by
> > ndo_setlink/dellink at each level, but with ndo_vlan_rx_kill_vid and
> > ndo_vlan_rx_add_vid.  So the switch port driver can know VLAN
> > membership for port even if port is under bond which is under bridge,
> > by using ndo_vlan_rx_xxx and setting NETIF_F_HW_VLAN_CTAG_FILTER.
> The
> > bonding driver's ndo_vlan_rx_xxx handlers seem to keep ports in bond
> > VLAN membership consistent across bond.
This could have worked if the entire VLAN filtering policy was propagated. Currently only VLAN membership is propagated using ndo_vlan_rx_add_vid.
Missing are untagged flag for egress and PVID flag.
> >
> > But in general, ndo_setlink/dellink don't work for the stack use-case
> > for other non-VLAN attributes.  Maybe the answer is to use the VLAN
> > propogation model for other attributes.  ndo_setlink/dellink/getlink
> > have enough weird-isms it might be time to define cleaner ndo ops to
> > propagate the other attrs down.
> And, only the switch asic driver is interested in these attrs. So, seems like for
> these cases, we need to send these attrs to the switchdriver directly instead
> of going through the stack of netdevs ?. see my response to ronen's other
> email.
> 
> Thanks,
> Roopa


^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFxrrTV-xBbvEzVKsXRVKyvZhQZj-6BNDpTsG4J5k6CJEg@mail.gmail.com>

On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:

> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
> it to just use copies?

Umm...  clearing O_DIRECT on struct file that might have other references
to it isn't nice, to put it mildly...

Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
space looks like the sanest strategy anyway.  We'd get the same behaviour
as we used to, and as for finit_module(2), well... put "fails if given
an O_DIRECT descriptor" and be done with that.

Alternatively, we can really go for
	page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
						: virt_to_page(addr)
	*pages++ = get_page(page);
and have the fucker work.  Stability is up to the caller, of course -
reading into buffer that might get freed (and reused) under you has much
worse problems...

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 18:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208182012.GE22149@ZenIV.linux.org.uk>

On Mon, Dec 8, 2014 at 10:20 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I certainly had missed that insanity during the analysis - we don't do
> a lot of O_DIRECT IO to/from kernel addresses of any sort...  This
> codepath allows it ;-/  Ability to trigger it is equivalent to ability
> to run any code in kernel mode, so it's not an additional security hole,
> but...

Is there any chance we could just return EINVAL for this case?

Who does O_DIRECT on module load anyway? If this is only for
finit_module(), that uses "kernel_read()", and maybe we could just
make sure that the kernel_read() function never ever uses the
direct-IO paths?

[ Time passes, I look at the code ]

Oh crap. So the reason it triggers seems to be that we basically get a
random file descriptor that we didn't open, and then we have

  vfs_read() ->
  xfs_file_operations->read() ->
  ew_sync_read() ->
  xfs_file_operations->read_iter()
  xfs_file_read_iter()

and we are stuck with this iterator that really just wants to do copies.

How about we make "kernel_read()" just clear O_DIRECT? Does that fix
it to just use copies?

                      Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFzKGsYKh1HALoeFGYQ46uMHww0WdLf5N-9duNTgMSSFxQ@mail.gmail.com>

On Mon, Dec 08, 2014 at 10:23:26AM -0800, Linus Torvalds wrote:

> Did this actually use to work? Or is it an issue of "the new iov_iter
> is so generic that something that used to just return an error now
> 'works' and triggers the problem"?

Looks like it failed with EINVAL.  Which might very well be the sane
reaction - if we run into a vmalloc/module address, act as if we failed
to get that page and exit the loop.

> > What's the sane way to grab struct page * for a vmalloc'ed address?
> 
> So "vmalloc_to_page()" should work.
> 
> However, it's actually fundamentally racy unless you can guarantee
> that the vmalloc()'ed area in question is stable (so you had better
> have done that allocation yourself, and be in control of the freeing,
> rather than "we look up random vmalloc'ed addresses).

If vfree(buffer) races with kernel_read() into buffer, we are so badly
fucked that stability of pointers to pages is the least of our concerns...

> In general, it's really a horrible thing to use, and tends to be a big
> red sign that "somebody misdesigned this badly"

More like "nobody has thought of that case", at a guess, but then I hadn't
been involved in finit_module() design - I don't even remember the discussions
around it.  That would be what, something circa 3.7?

^ permalink raw reply

* Re: [ovs-dev] OVS Kernel Datapath development
From: Pravin Shelar @ 2014-12-08 18:30 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, dev@openvswitch.org
In-Reply-To: <20141208171506.GC2835@pox.localdomain>

On Mon, Dec 8, 2014 at 9:15 AM, Thomas Graf <tgraf@noironetworks.com> wrote:
> On 12/07/14 at 08:47pm, Pravin Shelar wrote:
>> Since the beginning OVS kernel datapath development is primarily done
>> on external OVS repo. Now we have mostly synced upstream and external
>> OVS. So we have decided to change this process. New process is as
>> follows.
>>
>> 1. OVS feature development that involves kernel datapath should be
>> done on net-next tree datapath.
>> 2. Such feature patch series should be posted on netdev and ovs-dev
>> mailing list.
>> 3. Once review is done for entire series, kernel and OVS userspace
>> patches will be merged in respective repo.
>> 4. After the merge developer is suppose to send patches for external
>> kernel datapath along with old kernel compatibility code. So that we
>> can keep external datapath insync.
>
> +1
>
> Just to be clear, by respective repo do you mean net-next/net or will
> you maintain a net-next branch on git.kernel.org and continue doing
> pull requests?

OVS patches will directly go to net-next/net tree. I am not planning
on maintaining any tree on git.kernel.org.

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 18:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208181401.GD22149@ZenIV.linux.org.uk>

On Mon, Dec 8, 2014 at 10:14 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
> and getting it wrong.  FWIW, the reproducer is finit_module(fd, ....)
> where fd has been opened with O_DIRECT.  In that case we get kernel_read()
> on O_DIRECT and the buffer has just been vmalloc'ed.

Ugh. That's horrid. Do we need to even support O_DIRECT in that case?
In general, we should *not* do IO on vmalloc'ed areas, although at
least the non-O_DIRECT case where we just memcpy() it as if it came
from user space is much better.

Did this actually use to work? Or is it an issue of "the new iov_iter
is so generic that something that used to just return an error now
'works' and triggers the problem"?

> What's the sane way to grab struct page * for a vmalloc'ed address?

So "vmalloc_to_page()" should work.

However, it's actually fundamentally racy unless you can guarantee
that the vmalloc()'ed area in question is stable (so you had better
have done that allocation yourself, and be in control of the freeing,
rather than "we look up random vmalloc'ed addresses).

In general, it's really a horrible thing to use, and tends to be a big
red sign that "somebody misdesigned this badly"

                    Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFxwrH4vsyw2ix=HqDK9Z=fw98xtOL8=57prm8qgVmAFfA@mail.gmail.com>

On Mon, Dec 08, 2014 at 10:14:13AM -0800, Linus Torvalds wrote:

> For a vmalloc() address, you'd have to actually walk the page tables.
> Which is a f*cking horrible idea. Don't do it. We do have a
> "vmalloc_to_page()" that does it, but the basic issue is that you damn
> well shouldn't do IO on vmalloc'ed addresses.  vmalloc'ed addresses
> only exist in the first place to give a linear *virtual* mapping, if
> you want physical pages you shouldn't have mixed it up with vmalloc in
> the first place!
> 
> Where the hell does this crop up, and who does this insane thing
> anyway? It's wrong. How did it ever work before?

finit_module() with O_DIRECT descriptor.  And I suspect that "not well"
is the answer - it used to call get_user_pages_fast() in that case.

I certainly had missed that insanity during the analysis - we don't do
a lot of O_DIRECT IO to/from kernel addresses of any sort...  This
codepath allows it ;-/  Ability to trigger it is equivalent to ability
to run any code in kernel mode, so it's not an additional security hole,
but...

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 18:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208180824.GC22149@ZenIV.linux.org.uk>

On Mon, Dec 8, 2014 at 10:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, virt_to_page() is probably not OK to call on an address in the
> middle of vmalloc'ed area, is it?

See my email that crossed yours. No it is not.

>  Would
>                 for (end = addr + len; addr < end; addr += PAGE_SIZE) {
>                         if (is_vmalloc_addr(addr))
>                                 ACCESS_ONCE(*(char *)addr);
>                         get_page(*pages++ = virt_to_page(addr));
>                 }
> be a safe replacement for the loop in the above?

No. That "ACCESS_ONCE()" does nothing. It reads a byte from 'addr' in
the vmalloc space, and might cause a page fault to make sure it's
mapped, but that is still a no-op.

You can't do "virt_to_page()" on anything but the normal 1:1 kernel
mappings (and only for non-highmem pages at that).

For a vmalloc() address, you'd have to actually walk the page tables.
Which is a f*cking horrible idea. Don't do it. We do have a
"vmalloc_to_page()" that does it, but the basic issue is that you damn
well shouldn't do IO on vmalloc'ed addresses.  vmalloc'ed addresses
only exist in the first place to give a linear *virtual* mapping, if
you want physical pages you shouldn't have mixed it up with vmalloc in
the first place!

Where the hell does this crop up, and who does this insane thing
anyway? It's wrong. How did it ever work before?

                     Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <CA+55aFz6izpVGxkziynRXg7zbHzteQyPJH1Kg91ZO3fzqn3LaQ@mail.gmail.com>

On Mon, Dec 08, 2014 at 10:07:55AM -0800, Linus Torvalds wrote:

> Which is in the vmalloc address space. So somebody used a vmalloc'ed
> address and tried to convert it to a physical address in order to look
> up the page.
> 
> Which is not a valid operation, and the BUG_ON() is definitely proper.
> 
> Now *why* something tried to do a virt_to_page() on a vmalloc'ed
> address, that I leave to others.

iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
and getting it wrong.  FWIW, the reproducer is finit_module(fd, ....)
where fd has been opened with O_DIRECT.  In that case we get kernel_read()
on O_DIRECT and the buffer has just been vmalloc'ed.

What's the sane way to grab struct page * for a vmalloc'ed address?

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 18:08 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev
In-Reply-To: <20141208175805.GB22149@ZenIV.linux.org.uk>

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
> from __pa(), from virt_to_page(), from
>                 unsigned long addr = (unsigned long)v.iov_base, end;
>                 size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
> 
>                 if (len > maxpages * PAGE_SIZE)
>                         len = maxpages * PAGE_SIZE;
>                 addr &= ~(PAGE_SIZE - 1);
>                 for (end = addr + len; addr < end; addr += PAGE_SIZE)
>                         get_page(*pages++ = virt_to_page(addr));
>                 return len - *start;
> in iov_iter_get_pages().  And that's ITER_KVEC case there...  Further
> call chain looks like dio_refill_pages(), from dio_get_page(), from
> do_direct_io(), eventually from kernel_read() and finit_module(),
> Presumably called on O_DIRECT descriptor...

FWIW, virt_to_page() is probably not OK to call on an address in the
middle of vmalloc'ed area, is it?  Would
		for (end = addr + len; addr < end; addr += PAGE_SIZE) {
			if (is_vmalloc_addr(addr))
				ACCESS_ONCE(*(char *)addr);
			get_page(*pages++ = virt_to_page(addr));
		}
be a safe replacement for the loop in the above?

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Linus Torvalds @ 2014-12-08 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel,
	Network Development
In-Reply-To: <20141208164650.GB29028@node.dhcp.inet.fi>

On Mon, Dec 8, 2014 at 8:46 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I guess this crash is related to the patchset.

Sounds likely.

> [  102.338270] kernel BUG at /home/kas/git/public/linux-next/arch/x86/mm/physaddr.c:26!

So that's

                VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));

and the code disassembles to:

   0: 48 01 f8             add    %rdi,%rax
   3: 48 39 c2             cmp    %rax,%rdx
   6: 72 1b                 jb     0x23
   8: 0f b6 0d 9d 7a ec 00 movzbl 0xec7a9d(%rip),%ecx        # 0xec7aac
   f: 48 89 c2             mov    %rax,%rdx
  12: 48 d3 ea             shr    %cl,%rdx
  15: 48 85 d2             test   %rdx,%rdx
  18: 75 09                 jne    0x23
  1a: 5d                   pop    %rbp
  1b: c3                   retq
  1c: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
  23:* 0f 0b                 ud2     <-- trapping instruction

with thre relevant registers being

> [  102.340011] RAX: 00004100174b4000 RBX: ffff880049c73b08 RCX: 0000000000000028
> [  102.340011] RDX: 0000000000000041 RSI: ffff88015dc980a8 RDI: ffffc900174b4000

so we've taken the second case (the %rcx value is
"boot_cpu_data.x86_phys_bits", which is that "movzbl", and the %rdx
value is the shifted value of %rax).

So %rax seems to contain 'x' at that point, which means that 'y' should be

   x - (__START_KERNEL_map - PAGE_OFFSET)

which means that the _original_ address should be that plus
__START_KERNEL_map, ie just x + PAGE_OFFSET.

So it smells like the original virtual address was that
ffffc900174b4000 that we still find in %rdi.

Which is in the vmalloc address space. So somebody used a vmalloc'ed
address and tried to convert it to a physical address in order to look
up the page.

Which is not a valid operation, and the BUG_ON() is definitely proper.

Now *why* something tried to do a virt_to_page() on a vmalloc'ed
address, that I leave to others.

                        Linus

^ permalink raw reply

* Re: [RFC][PATCHES] iov_iter.c rewrite
From: Al Viro @ 2014-12-08 17:58 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev
In-Reply-To: <20141208164650.GB29028@node.dhcp.inet.fi>

On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:

> I guess this crash is related to the patchset.

Might be.  Do you have a reproducer for it?

It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
from __pa(), from virt_to_page(), from
                unsigned long addr = (unsigned long)v.iov_base, end;
                size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));

                if (len > maxpages * PAGE_SIZE)
                        len = maxpages * PAGE_SIZE;
                addr &= ~(PAGE_SIZE - 1);
                for (end = addr + len; addr < end; addr += PAGE_SIZE)
                        get_page(*pages++ = virt_to_page(addr));
                return len - *start;
in iov_iter_get_pages().  And that's ITER_KVEC case there...  Further
call chain looks like dio_refill_pages(), from dio_get_page(), from
do_direct_io(), eventually from kernel_read() and finit_module(),
Presumably called on O_DIRECT descriptor...

I'll try to reproduce it here, but if you have any reliable reproducer, it
would be very welcome (and would make a useful addition to LTP and/or
xfstests).

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox