Linux userland API discussions
 help / color / mirror / Atom feed
* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Vlastimil Babka @ 2015-02-04 13:46 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Kirill A. Shutemov
  Cc: Dave Hansen, Mel Gorman, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins
In-Reply-To: <54D0F56A.9050003-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote:
> Hello Vlastimil
>
> Thanks for CCing me into this thread.

NP

> On 02/03/2015 12:42 PM, Vlastimil Babka wrote:
>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
>>>
>>> It doesn't skip. It fails with -EINVAL. Or I miss something.
>>
>> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in
>> this case:
>>
>> *  The application is attempting to release locked or shared pages (with
>> MADV_DONTNEED).
>
> Yes, there is that. But the page could be more explicit when discussing
> MADV_DONTNEED in the main text. I've done that.
>
>> - that covers mlocking ok, not sure if the rest fits the "shared pages" case
>> though. I dont see any check for other kinds of shared pages in the code.
>
> Agreed. "shared" here seems confused. I've removed it. And I've
> added mention of "Huge TLB pages" for this error.
>

Thanks.

>>>> - The word "will result" did sound as a guarantee at least to me. So here it
>>>> could be changed to "may result (unless the advice is ignored)"?
>>>
>>> It's too late to fix documentation. Applications already depends on the
>>> beheviour.
>>
>> Right, so as long as they check for EINVAL, it should be safe. It appears that
>> jemalloc does.
>
> So, first a brief question: in the cases where the call does not error out,
> are we agreed that in the current implementation, MADV_DONTNEED will
> always result in zero-filled pages when the region is faulted back in
> (when we consider pages that are not backed by a file)?

I'd agree at this point.
Also we should probably mention anonymously shared pages (shmem). I 
think they behave the same as file here.

>> I still wouldnt be sure just by reading the man page that the clearing is
>> guaranteed whenever I dont get an error return value, though,
>
> I'm not quite sure what you want here. I mean: if there's an error,

I was just reiterating that the guarantee is not clear from if you 
consider all the statements in the man page.

> then the DONTNEED action didn't occur, right? Therefore, there won't
> be zero-filled pages. But, for what it's worth, I added "If the
> operation succeeds" at the start of that sentence beginning "Subsequent
> accesses...".

Yes, that should clarify it. Thanks!

> Now, some history, explaining why the page is a bit of a mess,
> and for that matter why I could really use more help on it from MM
> folk (especially in the form of actual patches [1], rather than notes
> about deficiencies in the documentation), because:
>
>      ***I simply cannot keep up with all of the details***.

I see, and expected it would be like this. I would just send patch if 
the situation was clear, but here we should agree first, and I thought 
you should be involved from the beginning.

> Once upon a time (Linux 2.4), there was madvise() with just 5 flags:
>
>         MADV_NORMAL
>         MADV_RANDOM
>         MADV_SEQUENTIAL
>         MADV_WILLNEED
>         MADV_DONTNEED
>
> And already a dozen years ago, *I* added the text about MADV_DONTNEED.
> Back then, I believe it was true. I'm not sure if it's still true now,
> but I assume for the moment that it is, and await feedback. And the
> text saying that the call does not affect the semantics of memory
> access dates back even further (and was then true, MADV_DONTNEED aside).
>
> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there
> is a semantic mismatch between the destructive MADV_DONTNEED and
> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear
> on most other implementations.
>
> Since the original implementation, numerous pieces of cruft^W^W^W
> excellent new flags have been overloaded into this one system call.
> Some of those certainly violated the "does not change the semantics
> of the application" statement, but, sadly, the kernel developers who
> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a
> patch to the man page for those new flags, one that might have noted
> that the semantics of the application are changed by such flags. Equally
> sadly, I did overlook to scan the bigger page when *I* added
> documentation of these flags to those pages, otherwise I might have
> caught that detail.
>
> So, just to repeat, I  could really use more help on it from MM
> folk in the form of actual patches to the man page.

Thanks for the background. I'll try to remember to check for man-pages 
part when I review some api changing patch.

> Thanks,
>
> Michael
>
> [1] https://www.kernel.org/doc/man-pages/patches.html
>

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-04 14:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim,
	Andrew Morton, lkml, Linux API, linux-man, Hugh Dickins
In-Reply-To: <54D22298.3040504-AlSwsSmVLrQ@public.gmane.org>

Hello Vlastimil,

On 4 February 2015 at 14:46, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
> On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote:
>>
>> On 02/03/2015 12:42 PM, Vlastimil Babka wrote:
>>>
>>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
>>>>
>>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
>>>>
>>>> It doesn't skip. It fails with -EINVAL. Or I miss something.
>>>
>>>
>>> No, I missed that. Thanks for pointing out. The manpage also explains
>>> EINVAL in
>>> this case:
>>>
>>> *  The application is attempting to release locked or shared pages (with
>>> MADV_DONTNEED).
>>
>> Yes, there is that. But the page could be more explicit when discussing
>> MADV_DONTNEED in the main text. I've done that.
>>
>>> - that covers mlocking ok, not sure if the rest fits the "shared pages"
>>> case
>>> though. I dont see any check for other kinds of shared pages in the code.
>>
>> Agreed. "shared" here seems confused. I've removed it. And I've
>> added mention of "Huge TLB pages" for this error.
>
> Thanks.

I also added those cases for MADV_REMOVE, BTW.

>>>>> - The word "will result" did sound as a guarantee at least to me. So
>>>>> here it
>>>>> could be changed to "may result (unless the advice is ignored)"?
>>>>
>>>> It's too late to fix documentation. Applications already depends on the
>>>> beheviour.
>>>
>>> Right, so as long as they check for EINVAL, it should be safe. It appears
>>> that
>>> jemalloc does.
>>
>>
>> So, first a brief question: in the cases where the call does not error
>> out,
>> are we agreed that in the current implementation, MADV_DONTNEED will
>> always result in zero-filled pages when the region is faulted back in
>> (when we consider pages that are not backed by a file)?
>
>
> I'd agree at this point.

Thanks for the confirmation.

> Also we should probably mention anonymously shared pages (shmem). I think
> they behave the same as file here.

You mean tmpfs here, right? (I don't keep all of the synonyms straight.)

>>> I still wouldnt be sure just by reading the man page that the clearing is
>>> guaranteed whenever I dont get an error return value, though,
>>
>> I'm not quite sure what you want here. I mean: if there's an error,
>
> I was just reiterating that the guarantee is not clear from if you consider
> all the statements in the man page.
>
>> then the DONTNEED action didn't occur, right? Therefore, there won't
>> be zero-filled pages. But, for what it's worth, I added "If the
>> operation succeeds" at the start of that sentence beginning "Subsequent
>> accesses...".
>
> Yes, that should clarify it. Thanks!

Okay.

>> Now, some history, explaining why the page is a bit of a mess,
>> and for that matter why I could really use more help on it from MM
>> folk (especially in the form of actual patches [1], rather than notes
>> about deficiencies in the documentation), because:
>>
>>      ***I simply cannot keep up with all of the details***.
>
> I see, and expected it would be like this. I would just send patch if the
> situation was clear, but here we should agree first, and I thought you
> should be involved from the beginning.

Sorry -- I should have made it clearer, this statement was not
targeted at you personally, or even necessarily at this particular
thread. It was a general comment, that came up sharply to me as I
looked at how much cruft there is in the madvise() page.

>> Once upon a time (Linux 2.4), there was madvise() with just 5 flags:
>>
>>         MADV_NORMAL
>>         MADV_RANDOM
>>         MADV_SEQUENTIAL
>>         MADV_WILLNEED
>>         MADV_DONTNEED
>>
>> And already a dozen years ago, *I* added the text about MADV_DONTNEED.
>> Back then, I believe it was true. I'm not sure if it's still true now,
>> but I assume for the moment that it is, and await feedback. And the
>> text saying that the call does not affect the semantics of memory
>> access dates back even further (and was then true, MADV_DONTNEED aside).
>>
>> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there
>> is a semantic mismatch between the destructive MADV_DONTNEED and
>> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear
>> on most other implementations.
>>
>> Since the original implementation, numerous pieces of cruft^W^W^W
>> excellent new flags have been overloaded into this one system call.
>> Some of those certainly violated the "does not change the semantics
>> of the application" statement, but, sadly, the kernel developers who
>> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a
>> patch to the man page for those new flags, one that might have noted
>> that the semantics of the application are changed by such flags. Equally
>> sadly, I did overlook to scan the bigger page when *I* added
>> documentation of these flags to those pages, otherwise I might have
>> caught that detail.
>>
>> So, just to repeat, I  could really use more help on it from MM
>> folk in the form of actual patches to the man page.
>
> Thanks for the background. I'll try to remember to check for man-pages part
> when I review some api changing patch.

That would be great.

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 14:19 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Lukáš Czerner, Al Viro, Theodore Ts'o,
	Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <54D21EB8.6020208-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>

Am 04.02.2015 um 14:29 schrieb Alexander Holler:
> Am 04.02.2015 um 14:21 schrieb Alexander Holler:
>
>>> I tell you that discussions of APIs should CC linux-api, which I am
>>> now CCing into this thread, again, because, again, you're not
>>> listening to feedback.
>>
>> Please don't confuse "not listening" with "unable to fulfill Linux
>> kernel maintainer requests".
>
> I really wonder what do you expect from people not getting paid to spend
> time for fulfilling maintainer request?
>
> I've written bugs and even offered some patches (regardless how usefull
> there are in your eyes, it's more than most other people can do).
>
> And all what it brought me is that I receive flames like your one.
>
> Do you really think that's the right way to stimulate people in helping
> to make Linux better?

I'm really sorry that I can't spend several unpaid months with reading 
and understanding ever changing linux kernel sources in order to become 
a Linux filesystem expert and send some fully working perfect patches 
which do fix the problem in question.

And I can't spend the necessary time to play remote keyboard for kernel 
maintainers which might be willing to explain me what has to be done 
according to their view. I've already offered what I was willing to do, 
for the price of having to defend myself over and over. And 
unfortunately that wasn't the first time I've ended up with having to 
defend myself.

My conclusion is that I'm a real fool having posted multiple times 
patches to this list. It just doesn't make any sense and most of the 
time the only reward are flames.

Alexander Holler

^ permalink raw reply

* [PATCH] tpm, tpm_tis: fix TPM 2.0 probing
From: Jarkko Sakkinen @ 2015-02-04 14:21 UTC (permalink / raw)
  To: Peter Huewe, Ashley Lai, Marcel Selhorst
  Cc: tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, josh-iaAMLnmF4UmaiuxdJuQwMA,
	christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	stefanb-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Jarkko Sakkinen

If during transmission system error was returned, the logic was to
incorrectly deduce that chip is a TPM 1.x chip. This patch fixes this
issue. Also, this patch changes probing so that message tag is used as the
measure for TPM 2.x, which should be much more stable. A separate function
called tpm2_probe() is encapsulated because it can be used with any
chipset.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/char/tpm/tpm.h      |  3 ++-
 drivers/char/tpm/tpm2-cmd.c | 40 +++++++++++++++++++++++++++++++++-------
 drivers/char/tpm/tpm_tis.c  | 11 ++++-------
 3 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 7b0727c..a4b0f5e 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -435,4 +435,5 @@ extern int tpm2_startup(struct tpm_chip *chip, u16 startup_type);
 extern int tpm2_shutdown(struct tpm_chip *chip, u16 shutdown_type);
 extern unsigned long tpm2_calc_ordinal_duration(struct tpm_chip *, u32);
 extern int tpm2_do_selftest(struct tpm_chip *chip);
-extern int tpm2_gen_interrupt(struct tpm_chip *chip, bool quiet);
+extern int tpm2_gen_interrupt(struct tpm_chip *chip);
+extern int tpm2_probe(struct tpm_chip *chip);
diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 1abe650..49cd354 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -598,20 +598,46 @@ EXPORT_SYMBOL_GPL(tpm2_do_selftest);
 /**
  * tpm2_gen_interrupt() - generate an interrupt
  * @chip: TPM chip to use
- * @quiet: surpress the error message
  *
  * 0 is returned when the operation is successful. If a negative number is
  * returned it remarks a POSIX error code. If a positive number is returned
  * it remarks a TPM error.
  */
-int tpm2_gen_interrupt(struct tpm_chip *chip, bool quiet)
+int tpm2_gen_interrupt(struct tpm_chip *chip)
 {
-	const char *desc = NULL;
 	u32 dummy;
 
-	if (!quiet)
-		desc = "attempting to generate an interrupt";
-
-	return tpm2_get_tpm_pt(chip, TPM2_CAP_TPM_PROPERTIES, &dummy, desc);
+	return tpm2_get_tpm_pt(chip, 0x100, &dummy,
+			       "attempting to generate an interrupt");
 }
 EXPORT_SYMBOL_GPL(tpm2_gen_interrupt);
+
+/**
+ * tpm2_probe() - probe TPM 2.0
+ * @chip: TPM chip to use
+ *
+ * Send idempotent TPM 2.0 command and see whether TPM 2.0 chip replied based on
+ * the reply tag.
+ */
+int tpm2_probe(struct tpm_chip *chip)
+{
+	struct tpm2_cmd cmd;
+	int rc;
+
+	cmd.header.in = tpm2_get_tpm_pt_header;
+	cmd.params.get_tpm_pt_in.cap_id = cpu_to_be32(TPM2_CAP_TPM_PROPERTIES);
+	cmd.params.get_tpm_pt_in.property_id = cpu_to_be32(0x100);
+	cmd.params.get_tpm_pt_in.property_cnt = cpu_to_be32(1);
+
+	rc = tpm_transmit(chip, (const char *) &cmd, sizeof(cmd));
+	if (rc <  0)
+		return rc;
+	else if (rc < TPM_HEADER_SIZE)
+		return -EFAULT;
+
+	if (be16_to_cpu(cmd.header.out.tag) == TPM2_ST_NO_SESSIONS)
+		chip->flags |= TPM_CHIP_FLAG_TPM2;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tpm2_probe);
diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 6725bef..ee6e0bd 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -639,12 +639,9 @@ static int tpm_tis_init(struct device *dev, acpi_handle acpi_dev_handle,
 		goto out_err;
 	}
 
-	/* Every TPM 2.x command has a higher ordinal than TPM 1.x commands.
-	 * Therefore, we can use an idempotent TPM 2.x command to probe TPM 2.x.
-	 */
-	rc = tpm2_gen_interrupt(chip, true);
-	if (rc == 0 || rc == TPM2_RC_INITIALIZE)
-		chip->flags |= TPM_CHIP_FLAG_TPM2;
+	rc = tpm2_probe(chip);
+	if (rc)
+		goto out_err;
 
 	vendor = ioread32(chip->vendor.iobase + TPM_DID_VID(0));
 	chip->vendor.manufacturer_id = vendor;
@@ -747,7 +744,7 @@ static int tpm_tis_init(struct device *dev, acpi_handle acpi_dev_handle,
 
 			/* Generate Interrupts */
 			if (chip->flags & TPM_CHIP_FLAG_TPM2)
-				tpm2_gen_interrupt(chip, false);
+				tpm2_gen_interrupt(chip);
 			else
 				tpm_gen_interrupt(chip);
 
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Lukáš Czerner @ 2015-02-04 14:52 UTC (permalink / raw)
  To: Alexander Holler
  Cc: Michael Kerrisk, Al Viro, Theodore Ts'o, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <54D21CC8.4020705-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3708 bytes --]

On Wed, 4 Feb 2015, Alexander Holler wrote:

> Date: Wed, 04 Feb 2015 14:21:12 +0100
> From: Alexander Holler <holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
> To: Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Lukáš Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
>     Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>,
>     Linux-Fsdevel <linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
>     Linux Kernel <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
>     Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Subject: Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
> 
> Am 04.02.2015 um 14:06 schrieb Michael Kerrisk:
> > Alexander,
> > 
> > On Wed, Feb 4, 2015 at 1:22 PM, Alexander Holler <holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
> > wrote:
> > > Am 04.02.2015 um 13:07 schrieb Lukáš Czerner:
> > > 
> > > > The fact is that the current patches are useless for anything other
> > > > than proof-of-concept. Now you know more that needs to be done or
> > > 
> > > 
> > > That's wrong. The patches already work. If you delete a file which isn't
> > > in
> > > use by something else, the current contents will be wiped on traditional
> > > harddrives. I assume that already fulfills more than 50% of use cases of
> > > ordinary people.
> > 
> > You are getting various feedback from people, that you seem to be ignoring.
> 
> I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
> the necessary time unpaid.

Right, you'd much rather have someone else to spend the time on your
request unpaid. That's understandable, but unreasonable. You want
it, implement it, or pay someone else to do it for you.

> .
> > Al Viro, in his curmedgeonly way, points out that the problems are
> > much deeper than you realize. He does not say so explicitly, but I
> > imagine his point is that he does not want to see the kernel cluttered
> > with "partial" solutions that will simply increase the maintenance
> > burden in the long term, and leave bugs to be fixed further down the
> > line. You seem not to be listening.
> 
> It doesn't help me nor anyone else. As Eric Sandeen made me aware through in
> bug, look at http://lwn.net/Articles/462437/ what already happened.

That's what people have been trying to tell you. It's not an easy
task and there are plenty of cases to think about. As you can see
IBM tasked their developer to do it, but they did not succeed. And
here you come with your simplistic patches crying about "high
towers. But you're the one apparently interested in this feature
and you've been warned that's it's not a simple task.

But if you really want it I really do encourage you to try. I'd be
happy to have a working and reliable secure delete feature but it's
not my priority at all.

-Lukas

> 
> > Lukáš points out to you that getting a feature like this into the
> > kernel is complex process. You seem unwilling to hear that, and still
> > just want your partial solution.
> 
> Wrong. I don't want my partial solution to be part of the official kernel. I
> don't care. I offered it for other users because I'm aware that has become
> almost impossible for normal people to get something into the kernel without
> spending an unbelievable amount of time most people can't afford to spend.
> 
> > I tell you that discussions of APIs should CC linux-api, which I am
> > now CCing into this thread, again, because, again, you're not
> > listening to feedback.
> 
> Please don't confuse "not listening" with "unable to fulfill Linux kernel
> maintainer requests".
> 
> Alexander Holler
> 
> 

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Austin S Hemmelgarn @ 2015-02-04 15:00 UTC (permalink / raw)
  To: Alexander Holler
  Cc: Michael Kerrisk, Lukáš Czerner, Al Viro,
	Theodore Ts'o, Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <54D22A63.7090603-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>

On 2015-02-04 09:19, Alexander Holler wrote:
> Am 04.02.2015 um 14:29 schrieb Alexander Holler:
> I'm really sorry that I can't spend several unpaid months with reading
> and understanding ever changing linux kernel sources in order to become
> a Linux filesystem expert and send some fully working perfect patches
> which do fix the problem in question.
You aren't expected to do so.  Code review is an integral part of the 
development process here, and only truly trivial patches (stuff like 
fixing typos in kernel messages and documentation) get merged without 
it.  If you pay attention to the list itself, even the veteran kernel 
developers almost never manage to produce a patch that is deemed 
absolutely perfect, and end up revising things multiple times before 
they get merged.
> And I can't spend the necessary time to play remote keyboard for kernel
> maintainers which might be willing to explain me what has to be done
> according to their view. I've already offered what I was willing to do,
> for the price of having to defend myself over and over. And
> unfortunately that wasn't the first time I've ended up with having to
> defend myself.
You seem to fail to understand that open source development runs 
primarily on volunteer work (yes there are people paid to work on open 
source software, but that is a generally exceptional case).  A large 
majority of the people who are kernel maintainers are donating their 
free time to the project.
> My conclusion is that I'm a real fool having posted multiple times
> patches to this list. It just doesn't make any sense and most of the
> time the only reward are flames.
If you aren't serious about trying to get something into the mainline 
kernel, you should be tagging _all_ of the e-mails in that patch-set 
with [RFC] in the subject line.

In none of the responses that I've seen has anyone been anything but 
polite (albeit in some cases moderately annoyed).  If you really 
consider such attempts at constructive criticism to be flames, then a 
development mailing list isn't the place you should be posting patches.

^ permalink raw reply

* Re: [v8 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Konstantin Khlebnikov @ 2015-02-04 15:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Lutomirski, Li Xi, Linux FS Devel,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Theodore Ts'o, Andreas Dilger, Jan Kara, Al Viro,
	Christoph Hellwig, dmonakhov-GEFAQzZX7r8dnm+yROfE0A,
	Eric W. Biederman
In-Reply-To: <20150128003746.GR16552@dastard>

On 28.01.2015 03:37, Dave Chinner wrote:
> On Tue, Jan 27, 2015 at 01:45:17PM +0300, Konstantin Khlebnikov wrote:
>> On 27.01.2015 11:02, Dave Chinner wrote:
>>> On Fri, Jan 23, 2015 at 03:59:04PM -0800, Andy Lutomirski wrote:
>>>> On Fri, Jan 23, 2015 at 3:30 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>>>>> On Fri, Jan 23, 2015 at 02:58:09PM +0300, Konstantin Khlebnikov wrote:
>>>>
>>>> I think I must be missing something simple here.  In a hypothetical
>>>> world where the code used nsown_capable, if an admin wants to stick a
>>>> container in /mnt/container1 with associated prid 1 and a userns,
>>>> shouldn't it just map only prid 1 into the user ns?  Then a user in
>>>> that userns can't try to change the prid of a file to 2 because the
>>>> number "2" is unmapped for that user and translation will fail.
>>>
>>> You've effectively said "yes, project quotas are enabled, but you
>>> only have a single ID, it's always turned on and you can't change it
>>> to anything else.
>>>
>>> So, why do they need to be mapped via user namespaces to enable
>>> this? Think about it a little harder:
>>>
>>> 	- Project IDs are not user IDs.
>>> 	- Project IDs are not a security/permission mechanism.
>>> 	- Project quotas only provide a mechanism for
>>> 	  resource usage control.
>>>
>>> Think about that last one some more. Perhaps, as a hint, I should
>>> relate it to control groups? :) i.e:
>>>
>>> 	- Project quotas can be used as an effective mount ns space
>>> 	  usage controller.
>>>
>>> But this can only be safely and reliably by keeping the project IDs
>>> inaccessible from the containers themselves. I don't see why a
>>> mechanism that controls the amount of filesystem space used by a
>>> container should be considered any differently to a memory control
>>> group that limits the amount of memory the container can use.
>>>
>>> However, nobody on the container side of things would answer any of
>>> my questions about how project quotas were going to be used,
>>> limited, managed, etc back when we had to make a decision to enable
>>> XFS user ns support, I did what was needed to support the obvious
>>> container use case and close any possible loop hole that containers
>>> might be able to use to subvert that use case.
>>
>> I have a solution: Hierarchical Project Quota! Each project might have
>> parent project and so on. Each level keeps usage, limits and also keeps
>> some preallocation from parent level to reduce count of quota updates.
>
> That's an utter nightmare to manage - just ask the gluster guys who
> thought this was a good idea when they first implemented quotas.
>
> Besides, following down the path of heirarchical control groups
> doesn't seem like a good idea to me because that path has already
> proven to be a bad idea for container resource controllers. There's
> good reason why control groups have gone back to a flattened ID
> space like we already have for project quotas, so I don't think we
> want to go that way.
>
>> This might be useful even without containers : normal user quota has
>> two levels and admins might classify users into groups and set group
>> quota for them. Project quota is flat and cannot provide any control
>> if we want classify projects.
>
> I don't follow. project ID is exactly what allows you to control
> project classification.

I mean hierarchy allows to group several projects into one super-project
which sums all disk usage and could have its own limit too.

>
>> For containers hierarchy provide full virtualization: user-namespace
>> maps maps second-level and projects into subset of real projects.
>
> It's not the mapping that matters - if project quotas are used
> outside containers as a resource controller, then they can't be
> used inside containers even with a unique mapping range because
> we can only store a single project ID per inode.
>
> Besides, I'm struggling to see the use case for project quotas
> inside small containers that run single applications and typically
> only have a single user. Project quotas have traditionally been used
> to manage space in large filesystems shared by many users along
> bounds that don't follow any specific heirarchy or permission set.
>
> IOWs, you haven't described your use case for needing project quotas
> inside containers, so I've got no idea what problem you are trying
> to solve or whether project quotas are even appropriate as a
> solution.

Some people run inside containers complete distributives with multiple
services or even nested virtualization.

I've poked this code and played with some use-cases.
Hierarchical project quotas are cool and it seems the only option for
virtualization and providing seamless nested project quotas inside
containers. But, right now I'm not so interested in this feature.
Let's leave this for the future.


For now I'm more interested in participation disk space among services
in one system. As I see security model of project quota in XFS almost
non-existent for this case: it forbids linking/renaming files between
different projects but any unprivileged user might change project id
for its own files. That's strange, this operation should be privileged.

Also if user have permission for changing project id he could be
permitted to link and rename file into directory with any project id, 
because he anyway could change project, move, and revert it back.


For me perfect interface looks like couple fcntls for getting/changing 
project id:

int fcntl(fd, F_GET_PROJECT, projid_t *);
int fcntl(fd, F_SET_PROJECT, projid_t);

F_GET_PROJECT is allowed for everybody
F_SET_PROJECT requires CAP_SYS_ADMIN (or maybe CAP_FOWNER?)
(for virtualization id also must be mapped in user-ns)

ioctl XFS_IOC_FSSETXATTR should stay xfs specific.
And XFS_DIFLAG_PROJINHERIT should stay XFS-only feature too.
I don't see any use cases for that flag. For files is has no effect
for directories it's mostly equal to setting directory project id
to zero. The only difference in accounting directory itself.

Cross-project renaming/linking must be allowed if user have
permissions for changing project id at file and directory.
This is useful for sharing files between containers.

-- 
Konstantin

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 16:12 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Michael Kerrisk, Al Viro, Theodore Ts'o, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <alpine.LFD.2.00.1502041533130.26766-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

Am 04.02.2015 um 15:52 schrieb Lukáš Czerner:
> On Wed, 4 Feb 2015, Alexander Holler wrote:

>> I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
>> the necessary time unpaid.
> 
> Right, you'd much rather have someone else to spend the time on your
> request unpaid. That's understandable, but unreasonable. You want
> it, implement it, or pay someone else to do it for you.

Maybe you should attach a big fat red warning to the kernels bugzilla
that filing a bug means either to fix it yourself or pay somone to do that.

I've never demanded that someone else fixes it.

I've just explained a problem.

Unbelievable how someone could do such without paying someone else to
fix it or by fixing it themself ...

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Lukáš Czerner @ 2015-02-04 16:25 UTC (permalink / raw)
  To: Alexander Holler
  Cc: Michael Kerrisk, Al Viro, Theodore Ts'o, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <54D24504.6080201@ahsoftware.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1350 bytes --]

On Wed, 4 Feb 2015, Alexander Holler wrote:

> Date: Wed, 04 Feb 2015 17:12:52 +0100
> From: Alexander Holler <holler@ahsoftware.de>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>,
>     Al Viro <viro@zeniv.linux.org.uk>, Theodore Ts'o <tytso@mit.edu>,
>     Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
>     Linux Kernel <linux-kernel@vger.kernel.org>,
>     Linux API <linux-api@vger.kernel.org>
> Subject: Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
> 
> Am 04.02.2015 um 15:52 schrieb Lukáš Czerner:
> > On Wed, 4 Feb 2015, Alexander Holler wrote:
> 
> >> I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
> >> the necessary time unpaid.
> > 
> > Right, you'd much rather have someone else to spend the time on your
> > request unpaid. That's understandable, but unreasonable. You want
> > it, implement it, or pay someone else to do it for you.
> 
> Maybe you should attach a big fat red warning to the kernels bugzilla
> that filing a bug means either to fix it yourself or pay somone to do that.
> 
> I've never demanded that someone else fixes it.
> 
> I've just explained a problem.
> 
> Unbelievable how someone could do such without paying someone else to
> fix it or by fixing it themself ...

It's not a bug, you're requesting a feature.

^ permalink raw reply

* Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb
From: Yann Droneaud @ 2015-02-04 16:31 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Shachar Raindel,
	Jason Gunthorpe, Eli Cohen
In-Reply-To: <1422882626.3030.56.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

Hi,

Le lundi 02 février 2015 à 14:10 +0100, Yann Droneaud a écrit :
> Le dimanche 01 février 2015 à 15:35 +0200, Haggai Eran a écrit :
> > Commit 5a77abf9a97a ("IB/core: Add support for extended query device caps")
> > added a new extended verb to query the capabilities of RDMA devices, but the
> > semantics of this verb are still under debate [1].
> > 
> > Block access to this verb from user-space until the new semantics are in
> > place.
> > 
> > Cc: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
> > Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> > Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > 
> > [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19
> >     http://www.spinics.net/lists/linux-rdma/msg22904.html
> > 
> > Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Reviewed-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

> >  drivers/infiniband/core/uverbs_main.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > index e6c23b9eab33..5db1a8cc388d 100644
> > --- a/drivers/infiniband/core/uverbs_main.c
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -123,7 +123,6 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file,
> >  				    struct ib_udata *uhw) = {
> >  	[IB_USER_VERBS_EX_CMD_CREATE_FLOW]	= ib_uverbs_ex_create_flow,
> >  	[IB_USER_VERBS_EX_CMD_DESTROY_FLOW]	= ib_uverbs_ex_destroy_flow,
> > -	[IB_USER_VERBS_EX_CMD_QUERY_DEVICE]	= ib_uverbs_ex_query_device
> >  };
> >  
> >  static void ib_uverbs_add_one(struct ib_device *device);
> 
> That's the smallest (and smartest) patch to be applied instead of
> reverting.
> 

Unfortunately, I've missed the issue I was complaining about in the
first place [1]. And I feel a bit guilty having missed the issue.

The present patch is fine as it fully disable the new extended 
QUERY_DEVICE uverb, but it doesn't disable the broken logic added in
ib_copy_to_udata() by commit 5a77abf9a97a ('IB/core: Add support for
extended query device caps'):

  diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
  index 470a011d6fa4..97a999f9e4d8 100644
  --- a/include/rdma/ib_verbs.h
  +++ b/include/rdma/ib_verbs.h
  @@ -1662,7 +1662,10 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t
   
   static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len)
   {
  -	return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
  +	size_t copy_sz;
  +
  +	copy_sz = min_t(size_t, len, udata->outlen);
  +	return copy_to_user(udata->outbuf, src, copy_sz) ? -EFAULT : 0;
   }

That part of commit 5a77abf9a97a should be reverted as I'm not sure
it doesn't introduce regressions, especially difficult to notice ones.

Regards.

[1] Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps
    http://mid.gmane.org/1418733236.2779.26.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 16:45 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Michael Kerrisk, Al Viro, Theodore Ts'o, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <alpine.LFD.2.00.1502041724180.26766-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

Am 04.02.2015 um 17:25 schrieb Lukáš Czerner:
> On Wed, 4 Feb 2015, Alexander Holler wrote:

>> Am 04.02.2015 um 15:52 schrieb Lukáš Czerner:
>>> On Wed, 4 Feb 2015, Alexander Holler wrote:
>>
>>>> I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
>>>> the necessary time unpaid.
>>>
>>> Right, you'd much rather have someone else to spend the time on your
>>> request unpaid. That's understandable, but unreasonable. You want
>>> it, implement it, or pay someone else to do it for you.
>>
>> Maybe you should attach a big fat red warning to the kernels bugzilla
>> that filing a bug means either to fix it yourself or pay somone to do that.
>>
>> I've never demanded that someone else fixes it.
>>
>> I've just explained a problem.
>>
>> Unbelievable how someone could do such without paying someone else to
>> fix it or by fixing it themself ...
>
> It's not a bug, you're requesting a feature.
>

Ok, I'm guilty.

May I ask if there's somewhere a feature request tracker which doesn't 
cruzify someone because he suggest a (maybe wrong) solution and tries to 
show that this might work with some prelimary, broken, silly, quick and 
dirty patches?

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 16:53 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Michael Kerrisk, Al Viro, Theodore Ts'o, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <54D24CA5.6080603-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>

Am 04.02.2015 um 17:45 schrieb Alexander Holler:
> Am 04.02.2015 um 17:25 schrieb Lukáš Czerner:
>> On Wed, 4 Feb 2015, Alexander Holler wrote:
>
>>> Am 04.02.2015 um 15:52 schrieb Lukáš Czerner:
>>>> On Wed, 4 Feb 2015, Alexander Holler wrote:
>>>
>>>>> I'm happy for all the feedback. But it doesn't help me. I'm not
>>>>> going to spend
>>>>> the necessary time unpaid.
>>>>
>>>> Right, you'd much rather have someone else to spend the time on your
>>>> request unpaid. That's understandable, but unreasonable. You want
>>>> it, implement it, or pay someone else to do it for you.
>>>
>>> Maybe you should attach a big fat red warning to the kernels bugzilla
>>> that filing a bug means either to fix it yourself or pay somone to do
>>> that.
>>>
>>> I've never demanded that someone else fixes it.
>>>
>>> I've just explained a problem.
>>>
>>> Unbelievable how someone could do such without paying someone else to
>>> fix it or by fixing it themself ...
>>
>> It's not a bug, you're requesting a feature.
>>
>
> Ok, I'm guilty.
>
> May I ask if there's somewhere a feature request tracker which doesn't
> cruzify someone because he suggest a (maybe wrong) solution and tries to
> show that this might work with some prelimary, broken, silly, quick and
> dirty patches?

I guess the answer is FreeBSD or similiar. ;)

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Vlastimil Babka @ 2015-02-04 17:02 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Kirill A. Shutemov, Dave Hansen, Mel Gorman, linux-mm@kvack.org,
	Minchan Kim, Andrew Morton, lkml, Linux API, linux-man,
	Hugh Dickins
In-Reply-To: <CAKgNAkgOOCuzJz9whoVfFjqhxM0zYsz94B1+oH58SthC5Ut9sg@mail.gmail.com>

On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote:
> Hello Vlastimil,
>
> On 4 February 2015 at 14:46, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages"
>>>> case
>>>> though. I dont see any check for other kinds of shared pages in the code.
>>>
>>> Agreed. "shared" here seems confused. I've removed it. And I've
>>> added mention of "Huge TLB pages" for this error.
>>
>> Thanks.
>
> I also added those cases for MADV_REMOVE, BTW.

Right. There's also the following for MADV_REMOVE that needs updating:

"Currently, only shmfs/tmpfs supports this; other filesystems return 
with the error ENOSYS."

- it's not just shmem/tmpfs anymore. It should be best to refer to 
fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to 
date.

- AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error 
code is listed in the ERRORS section.

>>>>>> - The word "will result" did sound as a guarantee at least to me. So
>>>>>> here it
>>>>>> could be changed to "may result (unless the advice is ignored)"?
>>>>>
>>>>> It's too late to fix documentation. Applications already depends on the
>>>>> beheviour.
>>>>
>>>> Right, so as long as they check for EINVAL, it should be safe. It appears
>>>> that
>>>> jemalloc does.
>>>
>>>
>>> So, first a brief question: in the cases where the call does not error
>>> out,
>>> are we agreed that in the current implementation, MADV_DONTNEED will
>>> always result in zero-filled pages when the region is faulted back in
>>> (when we consider pages that are not backed by a file)?
>>
>>
>> I'd agree at this point.
>
> Thanks for the confirmation.
>
>> Also we should probably mention anonymously shared pages (shmem). I think
>> they behave the same as file here.
>
> You mean tmpfs here, right? (I don't keep all of the synonyms straight.)

shmem is tmpfs (that by itself would fit under "files" just fine), but 
also sys V segments created by shmget(2) and also mappings created by 
mmap with MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single 
manpage to refer to the full list.

Thanks,
Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [Patch v4] firmware: dmi-sysfs: add SMBIOS entry point area attribute
From: Ivan Khoronzhuk @ 2015-02-04 17:06 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A,
	matt.fleming-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA
  Cc: dmidecode-devel-qX2TKyscuCcdnm+yROfE0A,
	leif.lindholm-QSEj5FYQhm4dnm+yROfE0A,
	msalter-H+wXaHxf7aLQT0dZR+AlfA, Ivan Khoronzhuk

Some utils, like dmidecode and smbios, need to access SMBIOS entry
table area in order to get information like SMBIOS version, size, etc.
Currently it's done via /dev/mem. But for situation when /dev/mem
usage is disabled, the utils have to use dmi sysfs instead, which
doesn't represent SMBIOS entry. So this patch adds SMBIOS area to
dmi-sysfs in order to allow utils in question to work correctly with
dmi sysfs interface.

Reviewed-by: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---

v1: https://lkml.org/lkml/2015/1/23/643
v2: https://lkml.org/lkml/2015/1/26/345
v3: https://lkml.org/lkml/2015/1/28/768

v4..v2:
  firmware: dmi_scan: add symbol to get SMBIOS entry area
  	- used u8 type for smbios_header var
  firmware: dmi-sysfs: add SMBIOS entry point area attribute
  	- replaced -ENODATA on -EINVAL

v3..v2:
  firmware: dmi_scan: add symbol to get SMBIOS entry area
  firmware: dmi-sysfs: add SMBIOS entry point area attribute
	- combined in one patch
	- added SMBIOS information to ABI sysfs-dmi documentaton

v2..v1:
  firmware: dmi_scan: add symbol to get SMBIOS entry area
	- used additional static var to hold SMBIOS raw table size
	- changed format of get_smbios_entry_area symbol
	  returned pointer on const smbios table

  firmware: dmi-sysfs: add SMBIOS entry point area attribute
	- adopted to updated get_smbios_entry_area symbol
  	- removed redundant array to save smbios table

 Documentation/ABI/testing/sysfs-firmware-dmi | 10 +++++++
 drivers/firmware/dmi-sysfs.c                 | 42 ++++++++++++++++++++++++++++
 drivers/firmware/dmi_scan.c                  | 26 +++++++++++++++++
 include/linux/dmi.h                          |  3 ++
 4 files changed, 81 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-firmware-dmi b/Documentation/ABI/testing/sysfs-firmware-dmi
index c78f9ab..3a9ffe8 100644
--- a/Documentation/ABI/testing/sysfs-firmware-dmi
+++ b/Documentation/ABI/testing/sysfs-firmware-dmi
@@ -12,6 +12,16 @@ Description:
 		cannot ensure that the data as exported to userland is
 		without error either.
 
+		The firmware provides DMI structures as a packed list of
+		data referenced by a SMBIOS table entry point. The SMBIOS
+		entry point contains general information, like SMBIOS
+		version, DMI table size, etc. The structure, content and
+		size of SMBIOS entry point is dependent on SMBIOS version.
+		That's why SMBIOS entry point is represented in dmi sysfs
+		like a raw attribute and is accessible via
+		/sys/firmware/dmi/smbios_raw_header. The format of SMBIOS
+		entry point header can be read in SMBIOS specification.
+
 		DMI is structured as a large table of entries, where
 		each entry has a common header indicating the type and
 		length of the entry, as well as a firmware-provided
diff --git a/drivers/firmware/dmi-sysfs.c b/drivers/firmware/dmi-sysfs.c
index e0f1cb3..9b396d7 100644
--- a/drivers/firmware/dmi-sysfs.c
+++ b/drivers/firmware/dmi-sysfs.c
@@ -29,6 +29,8 @@
 #define MAX_ENTRY_TYPE 255 /* Most of these aren't used, but we consider
 			      the top entry type is only 8 bits */
 
+static const u8 *smbios_raw_header;
+
 struct dmi_sysfs_entry {
 	struct dmi_header dh;
 	struct kobject kobj;
@@ -646,9 +648,37 @@ static void cleanup_entry_list(void)
 	}
 }
 
+static ssize_t smbios_entry_area_raw_read(struct file *filp,
+					  struct kobject *kobj,
+					  struct bin_attribute *bin_attr,
+					  char *buf, loff_t pos, size_t count)
+{
+	ssize_t size;
+
+	size = bin_attr->size;
+
+	if (size > pos)
+		size -= pos;
+	else
+		return 0;
+
+	if (count < size)
+		size = count;
+
+	memcpy(buf, &smbios_raw_header[pos], size);
+
+	return size;
+}
+
+static struct bin_attribute smbios_raw_area_attr = {
+	.read = smbios_entry_area_raw_read,
+	.attr = {.name = "smbios_raw_header", .mode = 0400},
+};
+
 static int __init dmi_sysfs_init(void)
 {
 	int error = -ENOMEM;
+	int size;
 	int val;
 
 	/* Set up our directory */
@@ -669,6 +699,18 @@ static int __init dmi_sysfs_init(void)
 		goto err;
 	}
 
+	smbios_raw_header = dmi_get_smbios_entry_area(&size);
+	if (!smbios_raw_header) {
+		pr_debug("dmi-sysfs: SMBIOS raw data is not available.\n");
+		error = -EINVAL;
+		goto err;
+	}
+
+	/* Create the raw binary file to access the entry area */
+	smbios_raw_area_attr.size = size;
+	if (sysfs_create_bin_file(dmi_kobj, &smbios_raw_area_attr))
+		goto err;
+
 	pr_debug("dmi-sysfs: loaded.\n");
 
 	return 0;
diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
index 420c8d8..99c5f6c 100644
--- a/drivers/firmware/dmi_scan.c
+++ b/drivers/firmware/dmi_scan.c
@@ -113,6 +113,8 @@ static void dmi_table(u8 *buf, int len, int num,
 	}
 }
 
+static u8 smbios_header[32];
+static int smbios_header_size;
 static phys_addr_t dmi_base;
 static u16 dmi_len;
 static u16 dmi_num;
@@ -474,6 +476,8 @@ static int __init dmi_present(const u8 *buf)
 	if (memcmp(buf, "_SM_", 4) == 0 &&
 	    buf[5] < 32 && dmi_checksum(buf, buf[5])) {
 		smbios_ver = get_unaligned_be16(buf + 6);
+		smbios_header_size = buf[5];
+		memcpy(smbios_header, buf, smbios_header_size);
 
 		/* Some BIOS report weird SMBIOS version, fix that up */
 		switch (smbios_ver) {
@@ -505,6 +509,8 @@ static int __init dmi_present(const u8 *buf)
 				pr_info("SMBIOS %d.%d present.\n",
 				       dmi_ver >> 8, dmi_ver & 0xFF);
 			} else {
+				smbios_header_size = 15;
+				memcpy(smbios_header, buf, smbios_header_size);
 				dmi_ver = (buf[14] & 0xF0) << 4 |
 					   (buf[14] & 0x0F);
 				pr_info("Legacy DMI %d.%d present.\n",
@@ -531,6 +537,8 @@ static int __init dmi_smbios3_present(const u8 *buf)
 		dmi_ver &= 0xFFFFFF;
 		dmi_len = get_unaligned_le32(buf + 12);
 		dmi_base = get_unaligned_le64(buf + 16);
+		smbios_header_size = buf[6];
+		memcpy(smbios_header, buf, smbios_header_size);
 
 		/*
 		 * The 64-bit SMBIOS 3.0 entry point no longer has a field
@@ -944,3 +952,21 @@ void dmi_memdev_name(u16 handle, const char **bank, const char **device)
 	}
 }
 EXPORT_SYMBOL_GPL(dmi_memdev_name);
+
+/**
+ * dmi_get_smbios_entry_area - copy SMBIOS entry point area to array.
+ * @size - pointer to assign actual size of SMBIOS entry point area.
+ *
+ * returns NULL if table is not available, otherwise returns pointer on
+ * SMBIOS entry point area array.
+ */
+const u8 *dmi_get_smbios_entry_area(int *size)
+{
+	if (!smbios_header_size || !dmi_available)
+		return NULL;
+
+	*size = smbios_header_size;
+
+	return smbios_header;
+}
+EXPORT_SYMBOL_GPL(dmi_get_smbios_entry_area);
diff --git a/include/linux/dmi.h b/include/linux/dmi.h
index f820f0a..8e1a28d 100644
--- a/include/linux/dmi.h
+++ b/include/linux/dmi.h
@@ -109,6 +109,7 @@ extern int dmi_walk(void (*decode)(const struct dmi_header *, void *),
 	void *private_data);
 extern bool dmi_match(enum dmi_field f, const char *str);
 extern void dmi_memdev_name(u16 handle, const char **bank, const char **device);
+const u8 *dmi_get_smbios_entry_area(int *size);
 
 #else
 
@@ -140,6 +141,8 @@ static inline void dmi_memdev_name(u16 handle, const char **bank,
 		const char **device) { }
 static inline const struct dmi_system_id *
 	dmi_first_match(const struct dmi_system_id *list) { return NULL; }
+static inline const u8 *dmi_get_smbios_entry_area(int *size)
+	{ return NULL; }
 
 #endif
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v4] gpio: lib-sysfs: Add 'wakeup' attribute
From: Sören Brinkmann @ 2015-02-04 18:27 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Alexandre Courbot, Johan Hovold, linux-api,
	Linux Kernel Mailing List, linux-gpio@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <CACRpkdYuHdbQ7aaeCQTwZO63LjWpKyZnPQRNCBmRybntXYJxFQ@mail.gmail.com>

On Wed, 2015-02-04 at 10:19AM +0100, Linus Walleij wrote:
> On Thu, Jan 29, 2015 at 6:23 PM, Sören Brinkmann
> <soren.brinkmann@xilinx.com> wrote:
> > On Mon, 2015-01-19 at 09:54AM +0100, Linus Walleij wrote:
> >> On Mon, Jan 19, 2015 at 5:20 AM, Alexandre Courbot <gnurou@gmail.com> wrote:
> >> > On Sat, Jan 17, 2015 at 1:49 AM, Sören Brinkmann <soren.brinkmann@xilinx.com> wrote:
> >> >> On Fri, 2015-01-16 at 12:11PM +0100, Johan Hovold wrote:
> >>
> >> >>> Implementing proper wakeup support for unclaimed GPIOs would take some
> >> >>> work (if at all desired), but that is not a reason to be adding custom
> >> >>> implementations that violates the kernel's power policies and new ABIs
> >> >>> that would need to be maintained forever.
> >> (...)
> >> >>> Meanwhile you can (should) use gpio-keys if you need to wake your system
> >> >>> on gpio events.
> >> >>
> >> >> We had that discussion and I don't think GPIO keys is the right solution
> >> >> for every use-case.
> >> >
> >> > Sorry, it has been a while - can you remind us of why?
> >>
> >> There are such cases. Of course keys should be handled by GPIO-keys
> >> and these will trigger the right wakeup events in such cases.
> >>
> >> This is for more esoteric cases: we cannot have a kernel module for
> >> everything people want to do with GPIOs, and the use case I accept
> >> is GPIOs used in automatic control etc, think factory lines or doors.
> >> We can't have a "door" driver or "punch arm" or "fire alarm" driver
> >> in the kernel. Those are userspace things.
> >>
> >> Still such embedded systems need to be able to go to idle and
> >> sleep to conerve power, and then they need to put wakeups on
> >> these GPIOs.
> >>
> >> So it is a feature userspace needs, though as with much of the
> >> sysfs ABI it is very often abused for things like keys and LEDs which
> >> is an abomination but we can't do much about it :(
> >
> > Thanks for clearing that up.
> > What does that mean for this patch? Are we going ahead, accepting the
> > extension of this API or do all these use-cases have to wait for the
> > rewrite of a proper GPIO userspace interface?
> 
> What needs to happen (IMHO) is to make gpio_chips properly obeying
> the device model, and then add the attributes for fiddling around with
> GPIOs to either the *real* device or create a new char device interface.
> Whatever works best. These mock devices are fragile and never
> worked properly especially in the removal path as Johans recent
> fixes has shown.

Sure, that would be a nice long-term solution. But until then this patch
would probably be welcomed by some people, without making the brokenness
of this interface much worse.

	Sören

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-04 19:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim,
	Andrew Morton, lkml, Linux API, linux-man, Hugh Dickins
In-Reply-To: <54D2508A.9030804-AlSwsSmVLrQ@public.gmane.org>

On 4 February 2015 at 18:02, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote:
>>
>> Hello Vlastimil,
>>
>> On 4 February 2015 at 14:46, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>>>
>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages"
>>>>> case
>>>>> though. I dont see any check for other kinds of shared pages in the
>>>>> code.
>>>>
>>>>
>>>> Agreed. "shared" here seems confused. I've removed it. And I've
>>>> added mention of "Huge TLB pages" for this error.
>>>
>>>
>>> Thanks.
>>
>>
>> I also added those cases for MADV_REMOVE, BTW.
>
>
> Right. There's also the following for MADV_REMOVE that needs updating:
>
> "Currently, only shmfs/tmpfs supports this; other filesystems return with
> the error ENOSYS."
>
> - it's not just shmem/tmpfs anymore. It should be best to refer to
> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to
> date.
>
> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is
> listed in the ERRORS section.

Yup, I recently added that as well, based on a patch from Jan Chaloupka.

>>>>>>> - The word "will result" did sound as a guarantee at least to me. So
>>>>>>> here it
>>>>>>> could be changed to "may result (unless the advice is ignored)"?
>>>>>>
>>>>>> It's too late to fix documentation. Applications already depends on
>>>>>> the
>>>>>> beheviour.
>>>>>
>>>>> Right, so as long as they check for EINVAL, it should be safe. It
>>>>> appears
>>>>> that
>>>>> jemalloc does.
>>>>
>>>> So, first a brief question: in the cases where the call does not error
>>>> out,
>>>> are we agreed that in the current implementation, MADV_DONTNEED will
>>>> always result in zero-filled pages when the region is faulted back in
>>>> (when we consider pages that are not backed by a file)?
>>>
>>> I'd agree at this point.
>>
>> Thanks for the confirmation.
>>
>>> Also we should probably mention anonymously shared pages (shmem). I think
>>> they behave the same as file here.
>>
>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.)
>
> shmem is tmpfs (that by itself would fit under "files" just fine), but also
> sys V segments created by shmget(2) and also mappings created by mmap with
> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to
> refer to the full list.

So, how about this text:

              After a successful MADV_DONTNEED operation, the seman‐
              tics  of  memory  access  in  the specified region are
              changed: subsequent accesses of  pages  in  the  range
              will  succeed,  but will result in either reloading of
              the memory contents from the  underlying  mapped  file
              (for  shared file mappings, shared anonymous mappings,
              and shmem-based techniques such  as  System  V  shared
              memory  segments)  or  zero-fill-on-demand  pages  for
              anonymous private mappings.

Thanks,

Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Theodore Ts'o @ 2015-02-04 19:33 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Alexander Holler, Michael Kerrisk, Al Viro, Linux-Fsdevel,
	Linux Kernel, Linux API
In-Reply-To: <alpine.LFD.2.00.1502041533130.26766@localhost.localdomain>

On Wed, Feb 04, 2015 at 03:52:02PM +0100, Lukáš Czerner wrote:
> > I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
> > the necessary time unpaid.
> 
> Right, you'd much rather have someone else to spend the time on your
> request unpaid. That's understandable, but unreasonable. You want
> it, implement it, or pay someone else to do it for you.
> 
> > It doesn't help me nor anyone else. As Eric Sandeen made me aware through in
> > bug, look at http://lwn.net/Articles/462437/ what already happened.
> 
> That's what people have been trying to tell you. It's not an easy
> task and there are plenty of cases to think about. As you can see
> IBM tasked their developer to do it, but they did not succeed. And
> here you come with your simplistic patches crying about "high
> towers. But you're the one apparently interested in this feature
> and you've been warned that's it's not a simple task.

And indeed, people who do have salaries paid by companies who care
about this general problem in actual products have been working on
addressing it using encryption, such that when the user is removed
from the device, the key is blasted.  More importantly, when the user
is not logged in, the key isn't even *available* on the device.  So it
solves more problems than the one that you are concerned about, and in
general maintainers prefer solutions that solve multiple problems,
because that minimizes the number of one-time hacks and partial/toy
solutions which turn into long-term maintainance headaches.  (After
all, if you insist on having a partial/toy solution merged, that turns
into an unfunded mandate which the maintainers effectively have to
support for free, forever.)

You've rejected encryption as a proposed solution as not meeting your
requirements (which if I understand your objections, can be summarized
as "encryption is too hard").  This is fine, but if you want someone
*else* to implement your partial toy solution which is less secure,
then you will either need to pay someone to do it or do it yourself.

> > Wrong. I don't want my partial solution to be part of the official kernel. I
> > don't care. I offered it for other users because I'm aware that has become
> > almost impossible for normal people to get something into the kernel without
> > spending an unbelievable amount of time most people can't afford to spend.

So you expect other users to just apply your patches and use an
unofficial system call number that might get reassigned to some other
user later on?

If that's all you want, then ok, you're done.  The patches have been
posted to LKML, and you can give people URL's if they want to try
applying the patches on their own.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v2 00/17] RFC: new ETHTOOL_GSETTINGS/SSETTINGS API
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny

From: David Decotigny <decot@googlers.com>


History:
 v2
 - keep return 0 in get_settings when successful, instead of
   propagating positive result from driver's get_settings callback.
 v1
 - original submission


The main goal of this series is to support ethtool link mode masks
larger than 32 bits. It implements a new ioctl pair
(ETHTOOL_GSETTINGS/SSETTINGS), its associated callbacks
(get/set_settings) and a new struct ethtool_settings, which should
eventually replace legacy ethtool_cmd. Internally, the kernel uses
fixed length link mode masks defined at compilation time in ethtool.h
(for now: 31 bits), that can be increased by changing
__ETHTOOL_LINK_MODE_LAST in ethtool.h (absolute max is 4064 bits,
checked at compile time), and the user/kernel interface allows this
length to be arbitrary within 1..4064. This should allow some
flexibility without using too much malloc/stack space, at the cost of
a small kernel/user handshake for the user to determine the sizes of
those bitmaps.

Along the way, I chose to drop in the new structure the 3 ethtool_cmd
fields marked "deprecated" (transceiver/maxrxpkt/maxtxpkt). They are
still available for old drivers via the old ETHTOOL_GSET/SSET API, but
are not available to drivers that switch to new API. Of those 3
fields, ethtool_cmd::transceiver seems to be still actively used by
several drivers, maybe we should not consider this field deprecated?
The 2 other fields are basically not used. This transition requires
some care in the way old and new ethtool talk to the kernel.

More technical details provided in the description for main patch. In
particular details about backward compatibility properties.

Some questions to more experts than me:
 - the kernel/interface multiplexes the "tell me the bitmap length"
   handshake and the "give me the settings" inside the new
   ETHTOOL_GSETTINGS cmd. I was thinking of making this into 2
   separate cmds: 1 cmd ETHTOOL_GKERNELPROPERTIES which would be
   kernel-wide rather than device-specific, would return properties
   like "length of the link mode bitmaps", and possibly others. And
   ETHTOOL_GSETTINGS would expect the proper bitmaps
 - the link mode bitmaps are piggybacked at tail of the new struct
   ethtool_settings. Since its user-visible definition does not assume
   specific bitmap width, I am using a 0-length array as the publicly
   visible placeholder. But then, the kernel needs to specialize it
   (struct ethtool_ksettings) to specify its current link mode
   masks. This means that kernel code is "littered" with
   "ksettings->parent.field" to access "field" inside
   ethtool_settings:
   + I don't like the field name "parent", any suggestion welcome
   + and/or: I could use ethtool_settings everywhere (instead of a new
     ethtool_ksettings) and an accessor to retrieve the link mode
     masks?
   + or: we could decide to make the link mode masks statically
     bounded again, ie. make their width public, but larger than
     current 32, and unchangeable forever. This would make everything
     straightforward, but we might hit limits later, or have an
     unneeded memory/stack usage for unused bits.
   any preference?
 - crossing user/kernel boundary requires conversion of the kernel
   bitmaps (unsigned long[]) to something more strict (in my case:
   u32) to accomodate for 32/64 compat. Maybe I should add a
   copy_bitmap_from_user/copy_bitmap_to_user API inside bitmap.h
   instead of defining my own in ethtool.c?
 - I am using a typedef struct (ethtool_link_mode_mask_t) to build and
   hold the new masks. Makes it handy to use in the drivers (see mlx4
   for an example). Not very nice.
 - I foresee bugs where people use the legacy/deprecated SUPPORTED_x
   macros instead of the new ETHTOOL_LINK_MODE_x_BIT enums in the new
   get/set__ksettings callbacks. Not sure how to prevent problems with
   this.

The only driver which was converted for now is mlx4. I am not
considering fcoe as fully converted, but I updated it a minima to be
able to remove __ethtool_get_settings, now known as
__ethtool_get_ksettings.

Tested with legacy and "future" ethtool on 64b x86 kernel and 32+64b
ethtool, and on a 32b x86 kernel + 32b ethtool.

############################################
# Patch Set Summary:

David Decotigny (17):
  net: usnic: remove unused call to ethtool_ops::get_settings
  net: usnic: use __ethtool_get_settings
  net: ethtool: add new ETHTOOL_GSETTINGS/SSETTINGS API
  tx4939: use __ethtool_get_ksettings
  net: usnic: use __ethtool_get_ksettings
  net: bonding: use __ethtool_get_ksettings
  net: ipvlan: use __ethtool_get_ksettings
  net: macvlan: use __ethtool_get_ksettings
  net: team: use __ethtool_get_ksettings
  net: fcoe: use __ethtool_get_ksettings
  net: rdma: use __ethtool_get_ksettings
  net: 8021q: use __ethtool_get_ksettings
  net: bridge: use __ethtool_get_ksettings
  net: core: use __ethtool_get_ksettings
  net: ethtool: remove unused __ethtool_get_settings
  net: mlx4: identify predicate for debug messages
  net: mlx4: use new ETHTOOL_G/SSETTINGS API

 arch/mips/txx9/generic/setup_tx4939.c           |   7 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |  10 +-
 drivers/net/bonding/bond_main.c                 |  14 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 328 ++++++++--------
 drivers/net/ethernet/mellanox/mlx4/en_main.c    |   1 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |   5 +-
 drivers/net/ipvlan/ipvlan_main.c                |   8 +-
 drivers/net/macvlan.c                           |   8 +-
 drivers/net/team/team.c                         |   8 +-
 drivers/scsi/fcoe/fcoe_transport.c              |  36 +-
 include/linux/ethtool.h                         |  96 ++++-
 include/rdma/ib_addr.h                          |  14 +-
 include/uapi/linux/ethtool.h                    | 319 ++++++++++++----
 net/8021q/vlan_dev.c                            |   8 +-
 net/bridge/br_if.c                              |   6 +-
 net/core/ethtool.c                              | 478 +++++++++++++++++++++++-
 net/core/net-sysfs.c                            |  15 +-
 net/packet/af_packet.c                          |  11 +-
 18 files changed, 1046 insertions(+), 326 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply

* [PATCH net-next v2 01/17] net: usnic: remove unused call to ethtool_ops::get_settings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>

Signed-off-by: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 53bd6a2..61337c7c 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -253,13 +253,11 @@ int usnic_ib_query_device(struct ib_device *ibdev,
 	struct usnic_ib_dev *us_ibdev = to_usdev(ibdev);
 	union ib_gid gid;
 	struct ethtool_drvinfo info;
-	struct ethtool_cmd cmd;
 	int qp_per_vf;
 
 	usnic_dbg("\n");
 	mutex_lock(&us_ibdev->usdev_lock);
 	us_ibdev->netdev->ethtool_ops->get_drvinfo(us_ibdev->netdev, &info);
-	us_ibdev->netdev->ethtool_ops->get_settings(us_ibdev->netdev, &cmd);
 	memset(props, 0, sizeof(*props));
 	usnic_mac_ip_to_gid(us_ibdev->ufdev->mac, us_ibdev->ufdev->inaddr,
 			&gid.raw[0]);
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 02/17] net: usnic: use __ethtool_get_settings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>

Signed-off-by: David Decotigny <decot-Ypc/8FJVVoBWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 61337c7c..d71ba62 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -310,7 +310,7 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
 	usnic_dbg("\n");
 
 	mutex_lock(&us_ibdev->usdev_lock);
-	us_ibdev->netdev->ethtool_ops->get_settings(us_ibdev->netdev, &cmd);
+	__ethtool_get_settings(us_ibdev->netdev, &cmd);
 	memset(props, 0, sizeof(*props));
 
 	props->lid = 0;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 03/17] net: ethtool: add new ETHTOOL_GSETTINGS/SSETTINGS API
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

This patch defines a new ETHTOOL_GSETTINGS/SSETTINGS API, handled by
the new get_ksettings/set_ksettings callbacks. This API provides
support for most legacy ethtool_cmd fields, adds support for larger
link mode masks (up to 4064 bits, variable length), and removes
ethtool_cmd deprecated fields (transceiver/maxrxpkt/maxtxpkt).

This API is deprecating the legacy ETHTOOL_GSET/SSET API and provides
the following backward compatibility properties:
 - legacy ethtool with legacy drivers: no change, still using the
   get_settings/set_settings callbacks.
 - legacy ethtool with new get/set_ksettings drivers: the new driver
   callbacks are used, data internally converted to legacy
   ethtool_cmd. ETHTOOL_GSET will return only the 1st 32b of each link
   mode mask. ETHTOOL_SSET will fail if user tries to set the
   ethtool_cmd deprecated fields to non-0
   (transceiver/maxrxpkt/maxtxpkt). A kernel warning is printed if
   driver exports higher bits or if user request changes in deprecated
   fields mentioned earlier.
 - future ethtool with legacy drivers: no change, still using the
   get_settings/set_settings callbacks, internally converted to new
   data structure. Note that that "future" ethtool tool will not allow
   changes to deprecated fields (transceiver/maxrxpkt/maxtxpkt), as
   they cannot be expressed for the kernel.
 - future ethtool with new drivers: direct call to the new callbacks.

By "future" ethtool, what is meant is:
 - query: first try ETHTOOL_GSETTINGS, and revert to ETHTOOL_GSET if fails
 - set: query first and remember which of ETHTOOL_GSETTINGS or
   ETHTOOL_GSET was successful
   - if ETHTOOL_GSETTINGS was successful, then change config with
     ETHTOOL_SSETTINGS. A failure there is final (do not try ETHTOOL_SSET).
   - otherwise ETHTOOL_GSET was successful, change config with ETHTOOL_SSET.
     A failure there is final (do not try ETHTOOL_SSETTINGS).

The interaction user/kernel via the new API requires a small
ETHTOOL_GSETTINGS handshake first to agree on the length of the link
mode bitmaps. If kernel doesn't agree with user, it returns the bitmap
length it is expecting from user as a negative length (and cmd field
is 0). When kernel and user agree, kernel returns valid info in all
fields (ie. link mode length > 0 and cmd is ETHTOOL_GSETTINGS).

Data structure crossing user/kernel boundary is 32/64-bit
agnostic. Converted internally to a legal kernel bitmap.

The internal __ethtool_get_settings kernel helper will gradually be
replaced by __ethtool_get_ksettings by the time the first ksettings
drivers start to appear. So this patch doesn't change it, it will be
removed before it needs to be changed.

Signed-off-by: David Decotigny <decot@googlers.com>
---
 include/linux/ethtool.h      | 100 ++++++++-
 include/uapi/linux/ethtool.h | 319 ++++++++++++++++++++++------
 net/core/ethtool.c           | 489 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 828 insertions(+), 80 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 653dc9c..49881b6 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -12,6 +12,7 @@
 #ifndef _LINUX_ETHTOOL_H
 #define _LINUX_ETHTOOL_H
 
+#include <linux/bitmap.h>
 #include <linux/compat.h>
 #include <uapi/linux/ethtool.h>
 
@@ -40,9 +41,6 @@ struct compat_ethtool_rxnfc {
 
 #include <linux/rculist.h>
 
-extern int __ethtool_get_settings(struct net_device *dev,
-				  struct ethtool_cmd *cmd);
-
 /**
  * enum ethtool_phys_id_state - indicator state for physical identification
  * @ETHTOOL_ID_INACTIVE: Physical ID indicator should be deactivated
@@ -97,13 +95,84 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, u32 n_rx_rings)
 	return index % n_rx_rings;
 }
 
+/* number of link mode bits handled internally by kernel */
+#define __ETHTOOL_LINK_MODE_MASK_NBITS (__ETHTOOL_LINK_MODE_LAST+1)
+
+#define __ETHTOOL_LINK_MODE_IS_VALID_BIT(indice)	\
+	((indice) >= 0 && (indice) < __ETHTOOL_LINK_MODE_MASK_NBITS)
+
+typedef struct {
+	unsigned long mask[BITS_TO_LONGS(__ETHTOOL_LINK_MODE_MASK_NBITS)];
+} ethtool_link_mode_mask_t;
+
+/* drivers must ignore parent.cmd and parent.link_mode_masks_nwords
+ * fields, but they are allowed to overwrite them (will be ignored) */
+struct ethtool_ksettings {
+	struct ethtool_settings parent;
+	struct {
+		ethtool_link_mode_mask_t supported;
+		ethtool_link_mode_mask_t advertising;
+		ethtool_link_mode_mask_t lp_advertising;
+	} link_modes;
+};
+
+/* helper function for ethtool_build_link_mode and ethtool_add_link_modes */
+static inline int
+__ethtool_add_link_modes(ethtool_link_mode_mask_t *dst,
+			 unsigned nindices,
+			 const enum ethtool_link_mode_bit_indices *indices) {
+	unsigned i;
+	int rv = 0;
+
+	for (i = 0 ; i < nindices ; ++i) {
+		if (__ETHTOOL_LINK_MODE_IS_VALID_BIT(indices[i]))
+			set_bit(indices[i], dst->mask);
+		else
+			rv = -1;
+	}
+	return rv;
+}
+
+/* build link mode mask from variadic list of bit indices, return 0
+ * when all indices are valid, -1 otherwise
+ */
+#define ethtool_build_link_mode(dst, ...)				\
+	({								\
+		const enum ethtool_link_mode_bit_indices indices[] = {	\
+			__VA_ARGS__ };					\
+		bitmap_zero((dst)->mask,				\
+			    __ETHTOOL_LINK_MODE_MASK_NBITS);		\
+		__ethtool_add_link_modes((dst),				\
+					 ARRAY_SIZE(indices), indices); \
+	})
+
+/* update link mode mask by setting variadic list of bit indices,
+ * return 0 when all indices are valid, -1 otherwise
+ */
+#define ethtool_add_link_modes(dst, ...)				\
+	({								\
+		const enum ethtool_link_mode_bit_indices indices[] = {	\
+			__VA_ARGS__ };					\
+		__ethtool_add_link_modes((dst),				\
+					 ARRAY_SIZE(indices), indices); \
+	})
+
+extern int __ethtool_get_ksettings(struct net_device *dev,
+				   struct ethtool_ksettings *ksettings);
+
+/* DEPRECATED, use __ethtool_get_ksettings */
+extern int __ethtool_get_settings(struct net_device *dev,
+				  struct ethtool_cmd *cmd);
+
 /**
  * struct ethtool_ops - optional netdev operations
- * @get_settings: Get various device settings including Ethernet link
+ * @get_settings: DEPRECATED, use %get_ksettings/%set_ksettings
+ *	API. Get various device settings including Ethernet link
  *	settings. The @cmd parameter is expected to have been cleared
- *	before get_settings is called. Returns a negative error code or
- *	zero.
- * @set_settings: Set various device settings including Ethernet link
+ *	before get_settings is called. Returns a negative error code
+ *	or zero.
+ * @set_settings: DEPRECATED, use %get_ksettings/%set_ksettings
+ *	API. Set various device settings including Ethernet link
  *	settings.  Returns a negative error code or zero.
  * @get_drvinfo: Report driver/device information.  Should only set the
  *	@driver, @version, @fw_version and @bus_info fields.  If not
@@ -201,6 +270,18 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, u32 n_rx_rings)
  * @get_module_eeprom: Get the eeprom information from the plug-in module
  * @get_eee: Get Energy-Efficient (EEE) supported and status.
  * @set_eee: Set EEE status (enable/disable) as well as LPI timers.
+ * @get_ksettings: When defined, takes precedence over the
+ *	%get_settings method. Get various device settings including Ethernet
+ *	link settings. The %cmd and %link_mode_masks_nwords fields should be
+ *	ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS instead of the latter),
+ *	any change to them will be overwritten by kernel. Returns a negative
+ *	error code or zero.
+ * @set_ksettings: When defined, takes precedence over the
+ *	%set_settings method. Set various device settings including
+ *	Ethernet link settings. The %cmd and %link_mode_masks_nwords fields
+ *	should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS instead of
+ *	the latter), any change to them will be overwritten by
+ *	kernel. Returns a negative error code or zero.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -280,6 +361,9 @@ struct ethtool_ops {
 	int	(*set_tunable)(struct net_device *,
 			       const struct ethtool_tunable *, const void *);
 
-
+	int	(*get_ksettings)(struct net_device *,
+				 struct ethtool_ksettings *);
+	int	(*set_ksettings)(struct net_device *,
+				 const struct ethtool_ksettings *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 5f66d9c..c85cc2f 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -21,7 +21,8 @@
  */
 
 /**
- * struct ethtool_cmd - link control and status
+ * struct ethtool_cmd - DEPRECATED, link control and status
+ * This structure is DEPRECATED, please use struct ethtool_settings.
  * @cmd: Command number = %ETHTOOL_GSET or %ETHTOOL_SSET
  * @supported: Bitmask of %SUPPORTED_* flags for the link modes,
  *	physical connectors and other link features for which the
@@ -1108,8 +1109,10 @@ enum ethtool_sfeatures_retval_bits {
 
 
 /* CMDs currently supported */
-#define ETHTOOL_GSET		0x00000001 /* Get settings. */
-#define ETHTOOL_SSET		0x00000002 /* Set settings. */
+#define ETHTOOL_GSET		0x00000001 /* DEPRECATED, Get settings.
+					      Please use ETHTOOL_GSETTINGS */
+#define ETHTOOL_SSET		0x00000002 /* DEPRECATED, Set settings.
+					      Please use ETHTOOL_SSETTINGS */
 #define ETHTOOL_GDRVINFO	0x00000003 /* Get driver info. */
 #define ETHTOOL_GREGS		0x00000004 /* Get NIC registers. */
 #define ETHTOOL_GWOL		0x00000005 /* Get wake-on-lan options. */
@@ -1188,73 +1191,137 @@ enum ethtool_sfeatures_retval_bits {
 #define ETHTOOL_GTUNABLE	0x00000048 /* Get tunable configuration */
 #define ETHTOOL_STUNABLE	0x00000049 /* Set tunable configuration */
 
+#define ETHTOOL_GSETTINGS	0x0000004a /* Get ethtool_settings */
+#define ETHTOOL_SSETTINGS	0x0000004b /* Set ethtool_settings */
+
+
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
 #define SPARC_ETH_SSET		ETHTOOL_SSET
 
-#define SUPPORTED_10baseT_Half		(1 << 0)
-#define SUPPORTED_10baseT_Full		(1 << 1)
-#define SUPPORTED_100baseT_Half		(1 << 2)
-#define SUPPORTED_100baseT_Full		(1 << 3)
-#define SUPPORTED_1000baseT_Half	(1 << 4)
-#define SUPPORTED_1000baseT_Full	(1 << 5)
-#define SUPPORTED_Autoneg		(1 << 6)
-#define SUPPORTED_TP			(1 << 7)
-#define SUPPORTED_AUI			(1 << 8)
-#define SUPPORTED_MII			(1 << 9)
-#define SUPPORTED_FIBRE			(1 << 10)
-#define SUPPORTED_BNC			(1 << 11)
-#define SUPPORTED_10000baseT_Full	(1 << 12)
-#define SUPPORTED_Pause			(1 << 13)
-#define SUPPORTED_Asym_Pause		(1 << 14)
-#define SUPPORTED_2500baseX_Full	(1 << 15)
-#define SUPPORTED_Backplane		(1 << 16)
-#define SUPPORTED_1000baseKX_Full	(1 << 17)
-#define SUPPORTED_10000baseKX4_Full	(1 << 18)
-#define SUPPORTED_10000baseKR_Full	(1 << 19)
-#define SUPPORTED_10000baseR_FEC	(1 << 20)
-#define SUPPORTED_20000baseMLD2_Full	(1 << 21)
-#define SUPPORTED_20000baseKR2_Full	(1 << 22)
-#define SUPPORTED_40000baseKR4_Full	(1 << 23)
-#define SUPPORTED_40000baseCR4_Full	(1 << 24)
-#define SUPPORTED_40000baseSR4_Full	(1 << 25)
-#define SUPPORTED_40000baseLR4_Full	(1 << 26)
-#define SUPPORTED_56000baseKR4_Full	(1 << 27)
-#define SUPPORTED_56000baseCR4_Full	(1 << 28)
-#define SUPPORTED_56000baseSR4_Full	(1 << 29)
-#define SUPPORTED_56000baseLR4_Full	(1 << 30)
-
-#define ADVERTISED_10baseT_Half		(1 << 0)
-#define ADVERTISED_10baseT_Full		(1 << 1)
-#define ADVERTISED_100baseT_Half	(1 << 2)
-#define ADVERTISED_100baseT_Full	(1 << 3)
-#define ADVERTISED_1000baseT_Half	(1 << 4)
-#define ADVERTISED_1000baseT_Full	(1 << 5)
-#define ADVERTISED_Autoneg		(1 << 6)
-#define ADVERTISED_TP			(1 << 7)
-#define ADVERTISED_AUI			(1 << 8)
-#define ADVERTISED_MII			(1 << 9)
-#define ADVERTISED_FIBRE		(1 << 10)
-#define ADVERTISED_BNC			(1 << 11)
-#define ADVERTISED_10000baseT_Full	(1 << 12)
-#define ADVERTISED_Pause		(1 << 13)
-#define ADVERTISED_Asym_Pause		(1 << 14)
-#define ADVERTISED_2500baseX_Full	(1 << 15)
-#define ADVERTISED_Backplane		(1 << 16)
-#define ADVERTISED_1000baseKX_Full	(1 << 17)
-#define ADVERTISED_10000baseKX4_Full	(1 << 18)
-#define ADVERTISED_10000baseKR_Full	(1 << 19)
-#define ADVERTISED_10000baseR_FEC	(1 << 20)
-#define ADVERTISED_20000baseMLD2_Full	(1 << 21)
-#define ADVERTISED_20000baseKR2_Full	(1 << 22)
-#define ADVERTISED_40000baseKR4_Full	(1 << 23)
-#define ADVERTISED_40000baseCR4_Full	(1 << 24)
-#define ADVERTISED_40000baseSR4_Full	(1 << 25)
-#define ADVERTISED_40000baseLR4_Full	(1 << 26)
-#define ADVERTISED_56000baseKR4_Full	(1 << 27)
-#define ADVERTISED_56000baseCR4_Full	(1 << 28)
-#define ADVERTISED_56000baseSR4_Full	(1 << 29)
-#define ADVERTISED_56000baseLR4_Full	(1 << 30)
+/* Link mode bit indices */
+enum ethtool_link_mode_bit_indices {
+	ETHTOOL_LINK_MODE_10baseT_Half_BIT	= 0,
+	ETHTOOL_LINK_MODE_10baseT_Full_BIT	= 1,
+	ETHTOOL_LINK_MODE_100baseT_Half_BIT	= 2,
+	ETHTOOL_LINK_MODE_100baseT_Full_BIT	= 3,
+	ETHTOOL_LINK_MODE_1000baseT_Half_BIT	= 4,
+	ETHTOOL_LINK_MODE_1000baseT_Full_BIT	= 5,
+	ETHTOOL_LINK_MODE_Autoneg_BIT		= 6,
+	ETHTOOL_LINK_MODE_TP_BIT		= 7,
+	ETHTOOL_LINK_MODE_AUI_BIT		= 8,
+	ETHTOOL_LINK_MODE_MII_BIT		= 9,
+	ETHTOOL_LINK_MODE_FIBRE_BIT		= 10,
+	ETHTOOL_LINK_MODE_BNC_BIT		= 11,
+	ETHTOOL_LINK_MODE_10000baseT_Full_BIT	= 12,
+	ETHTOOL_LINK_MODE_Pause_BIT		= 13,
+	ETHTOOL_LINK_MODE_Asym_Pause_BIT	= 14,
+	ETHTOOL_LINK_MODE_2500baseX_Full_BIT	= 15,
+	ETHTOOL_LINK_MODE_Backplane_BIT		= 16,
+	ETHTOOL_LINK_MODE_1000baseKX_Full_BIT	= 17,
+	ETHTOOL_LINK_MODE_10000baseKX4_Full_BIT	= 18,
+	ETHTOOL_LINK_MODE_10000baseKR_Full_BIT	= 19,
+	ETHTOOL_LINK_MODE_10000baseR_FEC_BIT	= 20,
+	ETHTOOL_LINK_MODE_20000baseMLD2_Full_BIT = 21,
+	ETHTOOL_LINK_MODE_20000baseKR2_Full_BIT	= 22,
+	ETHTOOL_LINK_MODE_40000baseKR4_Full_BIT	= 23,
+	ETHTOOL_LINK_MODE_40000baseCR4_Full_BIT	= 24,
+	ETHTOOL_LINK_MODE_40000baseSR4_Full_BIT	= 25,
+	ETHTOOL_LINK_MODE_40000baseLR4_Full_BIT	= 26,
+	ETHTOOL_LINK_MODE_56000baseKR4_Full_BIT	= 27,
+	ETHTOOL_LINK_MODE_56000baseCR4_Full_BIT	= 28,
+	ETHTOOL_LINK_MODE_56000baseSR4_Full_BIT	= 29,
+	ETHTOOL_LINK_MODE_56000baseLR4_Full_BIT	= 30,
+
+	/* Last allowed bit for __ETHTOOL_LINK_MODE_LEGACY_MASK is bit
+	   31. Please do NOT define any SUPPORTED_* or ADVERTISED_*
+	   macro for bits > 31. The only way to use indices > 31 is to
+	   use the new ETHTOOL_GSETTINGS/ETHTOOL_SSETTINGS API */
+
+	__ETHTOOL_LINK_MODE_LAST
+	  = ETHTOOL_LINK_MODE_56000baseLR4_Full_BIT,
+};
+
+#define __ETHTOOL_LINK_MODE_LEGACY_MASK(base_name)	\
+	(1UL << (ETHTOOL_LINK_MODE_ ## base_name ## _BIT))
+
+/* DEPRECATED macros. Please migrate to
+   ETHTOOL_GSETTINGS/ETHTOOL_SSETTINGS API. Please do NOT define any new
+   SUPPORTED_* macro for bits > 31.  */
+#define SUPPORTED_10baseT_Half		__ETHTOOL_LINK_MODE_LEGACY_MASK(10baseT_Half)
+#define SUPPORTED_10baseT_Full		__ETHTOOL_LINK_MODE_LEGACY_MASK(10baseT_Full)
+#define SUPPORTED_100baseT_Half		__ETHTOOL_LINK_MODE_LEGACY_MASK(100baseT_Half)
+#define SUPPORTED_100baseT_Full		__ETHTOOL_LINK_MODE_LEGACY_MASK(100baseT_Full)
+#define SUPPORTED_1000baseT_Half	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseT_Half)
+#define SUPPORTED_1000baseT_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseT_Full)
+#define SUPPORTED_Autoneg		__ETHTOOL_LINK_MODE_LEGACY_MASK(Autoneg)
+#define SUPPORTED_TP			__ETHTOOL_LINK_MODE_LEGACY_MASK(TP)
+#define SUPPORTED_AUI			__ETHTOOL_LINK_MODE_LEGACY_MASK(AUI)
+#define SUPPORTED_MII			__ETHTOOL_LINK_MODE_LEGACY_MASK(MII)
+#define SUPPORTED_FIBRE			__ETHTOOL_LINK_MODE_LEGACY_MASK(FIBRE)
+#define SUPPORTED_BNC			__ETHTOOL_LINK_MODE_LEGACY_MASK(BNC)
+#define SUPPORTED_10000baseT_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseT_Full)
+#define SUPPORTED_Pause			__ETHTOOL_LINK_MODE_LEGACY_MASK(Pause)
+#define SUPPORTED_Asym_Pause		__ETHTOOL_LINK_MODE_LEGACY_MASK(Asym_Pause)
+#define SUPPORTED_2500baseX_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(2500baseX_Full)
+#define SUPPORTED_Backplane		__ETHTOOL_LINK_MODE_LEGACY_MASK(Backplane)
+#define SUPPORTED_1000baseKX_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseKX_Full)
+#define SUPPORTED_10000baseKX4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseKX4_Full)
+#define SUPPORTED_10000baseKR_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseKR_Full)
+#define SUPPORTED_10000baseR_FEC	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseR_FEC)
+#define SUPPORTED_20000baseMLD2_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(20000baseMLD2_Full)
+#define SUPPORTED_20000baseKR2_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(20000baseKR2_Full)
+#define SUPPORTED_40000baseKR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseKR4_Full)
+#define SUPPORTED_40000baseCR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseCR4_Full)
+#define SUPPORTED_40000baseSR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseSR4_Full)
+#define SUPPORTED_40000baseLR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseLR4_Full)
+#define SUPPORTED_56000baseKR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseKR4_Full)
+#define SUPPORTED_56000baseCR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseCR4_Full)
+#define SUPPORTED_56000baseSR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseSR4_Full)
+#define SUPPORTED_56000baseLR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseLR4_Full)
+/* Please do not define any new SUPPORTED_* macro for bits > 31, see
+ * notice above.
+ */
+
+/*
+ * DEPRECATED macros. Please migrate to
+ * ETHTOOL_GSETTINGS/ETHTOOL_SSETTINGS API. Please do NOT define any new
+ * ADERTISE_* macro for bits > 31.
+ */
+#define ADVERTISED_10baseT_Half		__ETHTOOL_LINK_MODE_LEGACY_MASK(10baseT_Half)
+#define ADVERTISED_10baseT_Full		__ETHTOOL_LINK_MODE_LEGACY_MASK(10baseT_Full)
+#define ADVERTISED_100baseT_Half	__ETHTOOL_LINK_MODE_LEGACY_MASK(100baseT_Half)
+#define ADVERTISED_100baseT_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(100baseT_Full)
+#define ADVERTISED_1000baseT_Half	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseT_Half)
+#define ADVERTISED_1000baseT_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseT_Full)
+#define ADVERTISED_Autoneg		__ETHTOOL_LINK_MODE_LEGACY_MASK(Autoneg)
+#define ADVERTISED_TP			__ETHTOOL_LINK_MODE_LEGACY_MASK(TP)
+#define ADVERTISED_AUI			__ETHTOOL_LINK_MODE_LEGACY_MASK(AUI)
+#define ADVERTISED_MII			__ETHTOOL_LINK_MODE_LEGACY_MASK(MII)
+#define ADVERTISED_FIBRE		__ETHTOOL_LINK_MODE_LEGACY_MASK(FIBRE)
+#define ADVERTISED_BNC			__ETHTOOL_LINK_MODE_LEGACY_MASK(BNC)
+#define ADVERTISED_10000baseT_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseT_Full)
+#define ADVERTISED_Pause		__ETHTOOL_LINK_MODE_LEGACY_MASK(Pause)
+#define ADVERTISED_Asym_Pause		__ETHTOOL_LINK_MODE_LEGACY_MASK(Asym_Pause)
+#define ADVERTISED_2500baseX_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(2500baseX_Full)
+#define ADVERTISED_Backplane		__ETHTOOL_LINK_MODE_LEGACY_MASK(Backplane)
+#define ADVERTISED_1000baseKX_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(1000baseKX_Full)
+#define ADVERTISED_10000baseKX4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseKX4_Full)
+#define ADVERTISED_10000baseKR_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseKR_Full)
+#define ADVERTISED_10000baseR_FEC	__ETHTOOL_LINK_MODE_LEGACY_MASK(10000baseR_FEC)
+#define ADVERTISED_20000baseMLD2_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(20000baseMLD2_Full)
+#define ADVERTISED_20000baseKR2_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(20000baseKR2_Full)
+#define ADVERTISED_40000baseKR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseKR4_Full)
+#define ADVERTISED_40000baseCR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseCR4_Full)
+#define ADVERTISED_40000baseSR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseSR4_Full)
+#define ADVERTISED_40000baseLR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(40000baseLR4_Full)
+#define ADVERTISED_56000baseKR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseKR4_Full)
+#define ADVERTISED_56000baseCR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseCR4_Full)
+#define ADVERTISED_56000baseSR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseSR4_Full)
+#define ADVERTISED_56000baseLR4_Full	__ETHTOOL_LINK_MODE_LEGACY_MASK(56000baseLR4_Full)
+/* Please do not define any new ADVERTISED_* macro for bits > 31, see
+ * notice above.
+ */
 
 /* The following are all involved in forcing a particular link
  * mode for the device for setting things.  When getting the
@@ -1396,4 +1463,124 @@ enum ethtool_reset_flags {
 };
 #define ETH_RESET_SHARED_SHIFT	16
 
+
+/**
+ * struct ethtool_settings - link control and status
+ * This structure deprecates struct %ethtool_cmd.
+ *
+ * IMPORTANT, Backward compatibility notice: When implementing new
+ *	user-space tools, please first try %ETHTOOL_GSETTINGS, and if
+ *	it succeeds use %ETHTOOL_SSETTINGS to change settings; do not
+ *	use %ETHTOOL_SSET if %ETHTOOL_GSETTINGS succeeded: stick to
+ *	%ETHTOOL_GSETTINGS/%SSETTINGS in that case.  Conversely, if
+ *	%ETHTOOL_GSETTINGS fails, use %ETHTOOL_GSET to query and
+ *	%ETHTOOL_SSET to change settings; do not use
+ *	%ETHTOOL_SSETTINGS if %ETHTOOL_GSETTINGS failed: stick to
+ *	%ETHTOOL_GSET/%ETHTOOL_SSET in that case.
+ *
+ * @cmd: Command number = %ETHTOOL_GSETTINGS or %ETHTOOL_SSETTINGS
+ * @speed: Link speed (Mbps)
+ * @duplex: Duplex mode; one of %DUPLEX_*
+ * @port: Physical connector type; one of %PORT_*
+ * @phy_address: MDIO address of PHY (transceiver); 0 or 255 if not
+ *	applicable.  For clause 45 PHYs this is the PRTAD.
+ * @autoneg: Enable/disable autonegotiation and auto-detection;
+ *	either %AUTONEG_DISABLE or %AUTONEG_ENABLE
+ * @mdio_support: Bitmask of %ETH_MDIO_SUPPORTS_* flags for the MDIO
+ *	protocols supported by the interface; 0 if unknown.
+ *	Read-only.
+ * @eth_tp_mdix: Ethernet twisted-pair MDI(-X) status; one of
+ *	%ETH_TP_MDI_*.  If the status is unknown or not applicable, the
+ *	value will be %ETH_TP_MDI_INVALID.  Read-only.
+ * @eth_tp_mdix_ctrl: Ethernet twisted pair MDI(-X) control; one of
+ *	%ETH_TP_MDI_*.  If MDI(-X) control is not implemented, reads
+ *	yield %ETH_TP_MDI_INVALID and writes may be ignored or rejected.
+ *	When written successfully, the link should be renegotiated if
+ *	necessary.
+ * @link_mode_masks_nwords: Number of 32-bit words for each of the
+ *	supported, advertising, lp_advertising link mode bitmaps. For
+ *	%ETHTOOL_GSETTINGS: on entry, number of words passed by user
+ *	(>= 0); on return, if handshake in progress, negative if
+ *	request size unsupported by kernel: absolute value indicates
+ *	kernel recommended size and cmd field is 0, as well as all the
+ *	other fields; otherwise (handshake completed), strictly
+ *	positive to indicate size used by kernel and cmd field is
+ *	%ETHTOOL_GSETTINGS, all other fields populated by driver. For
+ *	%ETHTOOL_SSETTINGS: must be valid on entry, ie. a positive
+ *	value returned previously by %ETHTOOL_GSETTINGS, otherwise
+ *	refused. For drivers: ignore this field (use kernel's
+ *	__ETHTOOL_LINK_MODE_MASK_NBITS instead), any change to it will
+ *	be overwritten by kernel.
+ * @supported: Bitmap with each bit meaning given by
+ *	%ethtool_link_mode_bit_indices for the link modes, physical
+ *	connectors and other link features for which the interface
+ *	supports autonegotiation or auto-detection.  Read-only.
+ * @advertising: Bitmap with each bit meaning given by
+ *	%ethtool_link_mode_bit_indices for the link modes, physical
+ *	connectors and other link features that are advertised through
+ *	autonegotiation or enabled for auto-detection.
+ * @lp_advertising: Bitmap with each bit meaning given by
+ *	%ethtool_link_mode_bit_indices for the link modes, and other
+ *	link features that the link partner advertised through
+ *	autonegotiation; 0 if unknown or not applicable.  Read-only.
+ *
+ * If autonegotiation is disabled, the speed and @duplex represent the
+ * fixed link mode and are writable if the driver supports multiple
+ * link modes.  If it is enabled then they are read-only; if the link
+ * is up they represent the negotiated link mode; if the link is down,
+ * the speed is 0, %SPEED_UNKNOWN or the highest enabled speed and
+ * @duplex is %DUPLEX_UNKNOWN or the best enabled duplex mode.
+ *
+ * Some hardware interfaces may have multiple PHYs and/or physical
+ * connectors fitted or do not allow the driver to detect which are
+ * fitted.  For these interfaces @port and/or @phy_address may be
+ * writable, possibly dependent on @autoneg being %AUTONEG_DISABLE.
+ * Otherwise, attempts to write different values may be ignored or
+ * rejected.
+ *
+ * Deprecated %ethtool_cmd fields transceiver, maxtxpkt and maxrxpkt
+ * are not available in %ethtool_settings. Until all drivers are
+ * converted to ignore them or to the new %ethtool_settings API, for
+ * both queries and changes, users should always try
+ * %ETHTOOL_GSETTINGS first, and if it fails with -ENOTSUPP stick only
+ * to %ETHTOOL_GSET and %ETHTOOL_SSET consistently. If it succeeds,
+ * then users should stick to %ETHTOOL_GSETTINGS and
+ * %ETHTOOL_SSETTINGS (which would support drivers implementing either
+ * %ethtool_cmd or %ethtool_settings).
+ *
+ * Users should assume that all fields not marked read-only are
+ * writable and subject to validation by the driver.  They should use
+ * %ETHTOOL_GSETTINGS to get the current values before making specific
+ * changes and then applying them with %ETHTOOL_SSETTINGS.
+ *
+ * Drivers that implement %get_ksettings and/or %set_ksettings should
+ * ignore the @cmd and @link_mode_masks_nwords fields (any change to
+ * them overwritten by kernel), and rely only on kernel's internal
+ * %__ETHTOOL_LINK_MODE_MASK_NBITS and
+ * %ethtool_link_mode_mask_t. Drivers that implement %set_ksettings()
+ * should validate all fields other than @cmd
+ * and @link_mode_masks_nwords that are not described as read-only or
+ * deprecated, and must ignore all fields described as read-only.
+ *
+ * Deprecated fields should be ignored by both users and drivers.
+ */
+struct ethtool_settings {
+	__u32	cmd;
+	__u32	speed;
+	__u8	duplex;
+	__u8	port;
+	__u8	phy_address;
+	__u8	autoneg;
+	__u8	mdio_support;
+	__u8	eth_tp_mdix;
+	__u8	eth_tp_mdix_ctrl;
+	__s8	link_mode_masks_nwords;
+	__u32	reserved[8];
+	__u32	link_mode_masks[0];
+	/* layout of link_mode_masks fields:
+	 * __u32 map_supported[link_mode_masks_nwords];
+	 * __u32 map_advertising[link_mode_masks_nwords];
+	 * __u32 map_lp_advertising[link_mode_masks_nwords];
+	 */
+};
 #endif /* _UAPI_LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 91f74f3..11cb1c9 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -340,6 +340,388 @@ static int __ethtool_set_flags(struct net_device *dev, u32 data)
 	return 0;
 }
 
+/* TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear */
+static void convert_legacy_u32_to_link_mode(ethtool_link_mode_mask_t *dst,
+					    u32 legacy_u32)
+{
+	bitmap_zero(dst->mask, __ETHTOOL_LINK_MODE_MASK_NBITS);
+	dst->mask[0] = legacy_u32;
+}
+
+/* return false if src had higher bits set. lower bits always updated.
+ * TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear
+ */
+static bool convert_link_mode_to_legacy_u32(u32 *legacy_u32,
+					    const ethtool_link_mode_mask_t *src)
+{
+	bool retval = true;
+
+	/* TODO: following test will soon always be true */
+	if (__ETHTOOL_LINK_MODE_MASK_NBITS > 32) {
+		ethtool_link_mode_mask_t ext;
+
+		bitmap_zero(ext.mask, __ETHTOOL_LINK_MODE_MASK_NBITS);
+		bitmap_fill(ext.mask, 32);
+		bitmap_complement(ext.mask, ext.mask,
+				  __ETHTOOL_LINK_MODE_MASK_NBITS);
+		if (bitmap_intersects(ext.mask, src->mask,
+				      __ETHTOOL_LINK_MODE_MASK_NBITS)) {
+			/* src mask goes beyond bit 31 */
+			retval = false;
+		}
+	}
+	*legacy_u32 = src->mask[0];
+	return retval;
+}
+
+/* return false if legacy contained non-0 deprecated fields
+ * transceiver/maxtxpkt/maxrxpkt. rest of ksettings always updated
+ * TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear
+ */
+static bool
+convert_legacy_settings_to_ksettings(struct ethtool_ksettings *ksettings,
+				     const struct ethtool_cmd *legacy_settings)
+{
+	bool retval = true;
+
+	memset(ksettings, 0, sizeof(*ksettings));
+
+	/* This is used to tell users that driver is still using these
+	 * deprecated legacy fields, and they should not use
+	 * %ETHTOOL_GSETTINGS/%ETHTOOL_SSETTINGS
+	 */
+	if (legacy_settings->transceiver ||
+	    legacy_settings->maxtxpkt ||
+	    legacy_settings->maxrxpkt)
+		retval = false;
+
+	convert_legacy_u32_to_link_mode(&ksettings->link_modes.supported,
+					legacy_settings->supported);
+	convert_legacy_u32_to_link_mode(&ksettings->link_modes.advertising,
+					legacy_settings->advertising);
+	convert_legacy_u32_to_link_mode(&ksettings->link_modes.lp_advertising,
+					legacy_settings->lp_advertising);
+	ksettings->parent.speed = ethtool_cmd_speed(legacy_settings);
+	ksettings->parent.duplex = legacy_settings->duplex;
+	ksettings->parent.port = legacy_settings->port;
+	ksettings->parent.phy_address = legacy_settings->phy_address;
+	ksettings->parent.autoneg = legacy_settings->autoneg;
+	ksettings->parent.mdio_support = legacy_settings->mdio_support;
+	ksettings->parent.eth_tp_mdix = legacy_settings->eth_tp_mdix;
+	ksettings->parent.eth_tp_mdix_ctrl = legacy_settings->eth_tp_mdix_ctrl;
+	return retval;
+}
+
+/* return false if ksettings link modes had higher bits
+ * set. legacy_settings always updated (best effort)
+ * TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear
+ */
+static bool
+convert_ksettings_to_legacy_settings(struct ethtool_cmd *legacy_settings,
+				     const struct ethtool_ksettings *ksettings)
+{
+	bool retval = true;
+
+	memset(legacy_settings, 0, sizeof(*legacy_settings));
+	/* this also clears the deprecated fields in legacy structure:
+	 * __u8		transceiver;
+	 * __u32	maxtxpkt;
+	 * __u32	maxrxpkt;
+	 */
+
+	retval &= convert_link_mode_to_legacy_u32(
+		&legacy_settings->supported,
+		&ksettings->link_modes.supported);
+	retval &= convert_link_mode_to_legacy_u32(
+		&legacy_settings->advertising,
+		&ksettings->link_modes.advertising);
+	retval &= convert_link_mode_to_legacy_u32(
+		&legacy_settings->lp_advertising,
+		&ksettings->link_modes.lp_advertising);
+	ethtool_cmd_speed_set(legacy_settings, ksettings->parent.speed);
+	legacy_settings->duplex = ksettings->parent.duplex;
+	legacy_settings->port = ksettings->parent.port;
+	legacy_settings->phy_address = ksettings->parent.phy_address;
+	legacy_settings->autoneg = ksettings->parent.autoneg;
+	legacy_settings->mdio_support = ksettings->parent.mdio_support;
+	legacy_settings->eth_tp_mdix = ksettings->parent.eth_tp_mdix;
+	legacy_settings->eth_tp_mdix_ctrl = ksettings->parent.eth_tp_mdix_ctrl;
+	return retval;
+}
+
+/* number of 32-bit words to store the user's link mode bitmaps */
+#define __ETHTOOL_LINK_MODE_MASK_NU32			\
+	((__ETHTOOL_LINK_MODE_MASK_NBITS + 31) / 32)
+
+/* layout of the struct passed from/to userland */
+struct ethtool_usettings {
+	struct ethtool_settings parent;
+	struct {
+		__u32 supported[__ETHTOOL_LINK_MODE_MASK_NU32] __aligned(4);
+		__u32 advertising[__ETHTOOL_LINK_MODE_MASK_NU32] __aligned(4);
+		__u32 lp_advertising[
+			__ETHTOOL_LINK_MODE_MASK_NU32]__aligned(4);
+	} link_modes;
+};
+
+/* Internal kernel helper to query a device ethtool_settings.
+ *
+ * Backward compatibility note: for compatibility with legacy drivers
+ * that implement only the ethtool_cmd API, this has to work with both
+ * drivers implementing get_ksettings API and drivers implementing
+ * get_settings API. When drivers implement get_settings and report
+ * ethtool_cmd deprecated fields (transceiver/maxrxpkt/maxtxpkt),
+ * these fields are silently ignored because the resulting struct
+ * ethtool_settings does not report them.
+ */
+int __ethtool_get_ksettings(struct net_device *dev,
+			    struct ethtool_ksettings *ksettings)
+{
+	int err;
+	struct ethtool_cmd cmd;
+
+	ASSERT_RTNL();
+
+	if (dev->ethtool_ops->get_ksettings) {
+		memset(ksettings, 0, sizeof(*ksettings));
+		return dev->ethtool_ops->get_ksettings(dev, ksettings);
+	}
+
+	/* TODO: remove what follows when ethtool_ops::get_settings
+	 * disappears internally
+	 */
+
+	/* driver doesn't support %ethtool_ksettings API. revert to
+	 * legacy %ethtool_cmd API, unless it's not supported either.
+	 * TODO: remove when ethtool_ops::get_settings disappears internally
+	 */
+	err = __ethtool_get_settings(dev, &cmd);
+	if (err < 0)
+		return err;
+
+	/* we ignore deprecated fields transceiver/maxrxpkt/maxtxpkt
+	 */
+	convert_legacy_settings_to_ksettings(ksettings, &cmd);
+	return err;
+}
+EXPORT_SYMBOL(__ethtool_get_ksettings);
+
+#if BITS_PER_LONG == 64
+static unsigned long _shl32(__u32 v)
+{
+	return ((unsigned long)v) << 32;
+}
+#endif
+
+/* convert a user's __u32[] bitmap in user space to a kernel internal
+ * bitmap. return 0 on success, errno on error. this assumes that
+ * link_mode_masks_nwords was already verified
+ */
+static int load_ksettings_from_user(struct ethtool_ksettings *to,
+				    const void __user *from)
+{
+	struct ethtool_usettings usettings;
+#if BITS_PER_LONG != 32
+	unsigned i;
+#endif
+
+	if (copy_from_user(&usettings, from, sizeof(usettings)))
+		return -EFAULT;
+
+	/* make sure we didn't receive garbage between last allowed bit
+	 * and end of last u32 word
+	 */
+	if (__ETHTOOL_LINK_MODE_MASK_NBITS % 32) {
+		const u32 allowed = (
+			1U << (__ETHTOOL_LINK_MODE_MASK_NBITS % 32)) - 1;
+		if (usettings.link_modes.supported[
+			    __ETHTOOL_LINK_MODE_MASK_NU32-1] & ~allowed)
+			return -EINVAL;
+		if (usettings.link_modes.advertising[
+			    __ETHTOOL_LINK_MODE_MASK_NU32-1] & ~allowed)
+			return -EINVAL;
+		if (usettings.link_modes.lp_advertising[
+			    __ETHTOOL_LINK_MODE_MASK_NU32-1] & ~allowed)
+			return -EINVAL;
+	}
+
+#if BITS_PER_LONG == 32
+	compiletime_assert(sizeof(*to) == sizeof(usettings),
+			   "sizeof(ulong) != 32");
+	memcpy(to, &usettings, sizeof(*to));
+#elif BITS_PER_LONG == 64
+	memset(to, 0, sizeof(*to));
+	memcpy(&to->parent, &usettings.parent, sizeof(to->parent));
+	for (i = 0 ; i < __ETHTOOL_LINK_MODE_MASK_NU32 ; ++i) {
+		if (0 == (i & 1)) {
+			to->link_modes.supported.mask[i>>1]
+				= usettings.link_modes.supported[i];
+			to->link_modes.advertising.mask[i>>1]
+				= usettings.link_modes.advertising[i];
+			to->link_modes.lp_advertising.mask[i>>1]
+				= usettings.link_modes.lp_advertising[i];
+		} else {
+			to->link_modes.supported.mask[i>>1] |= _shl32(
+				usettings.link_modes.supported[i]);
+			to->link_modes.advertising.mask[i>>1] |= _shl32(
+				usettings.link_modes.advertising[i]);
+			to->link_modes.lp_advertising.mask[i>>1] |= _shl32(
+				usettings.link_modes.lp_advertising[i]);
+		}
+	}
+#else
+# error "unsupported ulong width"
+#endif
+	return 0;
+}
+
+/* convert a kernel internal bitmap to a user space __u32[]
+ * bitmap. return 0 on success, errno on error.
+ */
+static int store_ksettings_for_user(void __user *to,
+				    const struct ethtool_ksettings *from)
+{
+	struct ethtool_usettings usettings;
+#if BITS_PER_LONG != 32
+	unsigned i;
+#endif
+
+	compiletime_assert(__ETHTOOL_LINK_MODE_MASK_NU32 <= S8_MAX,
+			   "need too many bits for link modes!");
+
+#if BITS_PER_LONG == 32
+	compiletime_assert(sizeof(*from) == sizeof(usettings),
+			   "sizeof(ulong) != 32");
+	memcpy(&usettings, from, sizeof(usettings));
+#elif BITS_PER_LONG == 64
+	memset(&usettings, 0, sizeof(usettings));
+	memcpy(&usettings.parent, &from->parent, sizeof(usettings));
+	for (i = 0 ; i < __ETHTOOL_LINK_MODE_MASK_NU32 ; ++i) {
+		if (0 == (i & 1)) {
+			usettings.link_modes.supported[i] = lower_32_bits(
+				from->link_modes.supported.mask[i>>1]);
+			usettings.link_modes.advertising[i] = lower_32_bits(
+				from->link_modes.advertising.mask[i>>1]);
+			usettings.link_modes.lp_advertising[i] = lower_32_bits(
+				from->link_modes.lp_advertising.mask[i>>1]);
+		} else {
+			usettings.link_modes.supported[i] = upper_32_bits(
+				from->link_modes.supported.mask[i>>1]);
+			usettings.link_modes.advertising[i] = upper_32_bits(
+				from->link_modes.advertising.mask[i>>1]);
+			usettings.link_modes.lp_advertising[i] = upper_32_bits(
+				from->link_modes.lp_advertising.mask[i>>1]);
+		}
+	}
+#else
+# error "unsupported ulong width"
+#endif
+	if (copy_to_user(to, &usettings, sizeof(usettings)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/* Query device for its ethtool_settings.
+ *
+ * Backward compatibility note: this function must fail when driver
+ * does not implement ethtool::get_ksettings, even if legacy
+ * ethtool_ops::get_settings is implemented. This tells new versions
+ * of ethtool that they should use the legacy API %ETHTOOL_GSET for
+ * this driver, so that they can correctly access the ethtool_cmd
+ * deprecated fields (transceiver/maxrxpkt/maxtxpkt), until no driver
+ * implements ethtool_ops::get_settings anymore.
+ */
+static int ethtool_get_ksettings(struct net_device *dev, void __user *useraddr)
+{
+	int err;
+	struct ethtool_ksettings ksettings;
+
+	if (!dev->ethtool_ops->get_ksettings)
+		return -EOPNOTSUPP;
+
+	/* handle bitmap nbits handshake */
+	if (copy_from_user(&ksettings.parent, useraddr,
+			   sizeof(ksettings.parent)))
+		return -EFAULT;
+
+	if (__ETHTOOL_LINK_MODE_MASK_NU32
+	    != ksettings.parent.link_mode_masks_nwords) {
+		/* wrong link mode nbits requested */
+		memset(&ksettings, 0, sizeof(ksettings));
+		/* keep cmd field reset to 0 */
+		/* send back number of words required as negative val */
+		ksettings.parent.link_mode_masks_nwords
+			= -((s8)__ETHTOOL_LINK_MODE_MASK_NU32);
+		err = 0;
+	} else {
+		memset(&ksettings, 0, sizeof(ksettings));
+		err = dev->ethtool_ops->get_ksettings(dev, &ksettings);
+		if (err < 0)
+			return err;
+
+		/* make sure we tell the right values to user */
+		ksettings.parent.cmd = ETHTOOL_GSETTINGS;
+		ksettings.parent.link_mode_masks_nwords
+			= __ETHTOOL_LINK_MODE_MASK_NU32;
+	}
+
+	return store_ksettings_for_user(useraddr, &ksettings);
+}
+
+/* Update device ethtool_settings.
+ *
+ * Backward compatibility note: this function must fail when driver
+ * does not implement ethtool::set_ksettings, even if legacy
+ * ethtool_ops::set_settings is implemented. This tells new versions
+ * of ethtool that they should use the legacy API %ETHTOOL_SSET for
+ * this driver, so that they can correctly update the ethtool_cmd
+ * deprecated fields (transceiver/maxrxpkt/maxtxpkt), until no driver
+ * implements ethtool_ops::get_settings anymore.
+ */
+static int ethtool_set_ksettings(struct net_device *dev, void __user *useraddr)
+{
+	int err;
+	struct ethtool_ksettings ksettings;
+
+	if (!dev->ethtool_ops->set_ksettings)
+		return -EOPNOTSUPP;
+
+	/* make sure nbits field has expected value */
+	if (copy_from_user(&ksettings.parent, useraddr,
+			   sizeof(ksettings.parent)))
+		return -EFAULT;
+
+	if (__ETHTOOL_LINK_MODE_MASK_NU32
+	    != ksettings.parent.link_mode_masks_nwords)
+		return -EINVAL;
+
+	/* copy the whole structure, now that we know it has expected
+	 * format
+	 */
+	err = load_ksettings_from_user(&ksettings, useraddr);
+	if (err)
+		return err;
+
+	/* re-check nwords field, just in case */
+	if (__ETHTOOL_LINK_MODE_MASK_NU32
+	    != ksettings.parent.link_mode_masks_nwords)
+		return -EINVAL;
+
+	return dev->ethtool_ops->set_ksettings(dev, &ksettings);
+}
+
+/* Internal kernel helper to query a device ethtool_cmd settings.
+ *
+ * Note about transition to ethtool_settings API: We do not need (or
+ * want) this function to support "dev" instances that implement the
+ * ethtool_settings API as we will update the drivers calling this
+ * function to call __ethtool_get_ksettings instead, before the first
+ * drivers implement ethtool_ops::get_ksettings.
+ *
+ * TODO 1: at least make this function static when no driver is using it
+ * TODO 2: remove when ethtool_ops::get_settings disappears internally
+ */
 int __ethtool_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
 	ASSERT_RTNL();
@@ -353,30 +735,119 @@ int __ethtool_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 }
 EXPORT_SYMBOL(__ethtool_get_settings);
 
+/* Query device for its ethtool_cmd settings.
+ *
+ * Backward compatibility note: for compatibility with legacy ethtool,
+ * this has to work with both drivers implementing get_ksettings API
+ * and drivers implementing get_settings API. When drivers implement
+ * get_ksettings and report higher link mode bits, a kernel warning is
+ * logged once (with name of 1st driver/device) to recommend user to
+ * upgrade ethtool, but the command is successful (only the lower link
+ * mode bits reported back to user).
+ *
+ * TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear
+ */
 static int ethtool_get_settings(struct net_device *dev, void __user *useraddr)
 {
-	int err;
 	struct ethtool_cmd cmd;
 
-	err = __ethtool_get_settings(dev, &cmd);
-	if (err < 0)
-		return err;
+	ASSERT_RTNL();
+
+	if (dev->ethtool_ops->get_ksettings) {
+		/* First, use ksettings API if it is supported */
+		int err;
+		struct ethtool_ksettings ksettings;
+
+		memset(&ksettings, 0, sizeof(ksettings));
+		err = dev->ethtool_ops->get_ksettings(dev, &ksettings);
+		if (err < 0)
+			return err;
+		if (!convert_ksettings_to_legacy_settings(&cmd, &ksettings)) {
+			static int __warned;
+
+			/* not all bitmaps could be translated
+			 * acurately to legacy API: print warning with
+			 * netdev name, but does still succeed
+			 */
+			if (!__warned)
+				netdev_warn(dev, "please upgrade ethtool\n");
+			__warned = 1;
+		}
+		/* send a sensible cmd tag back to user */
+		cmd.cmd = ETHTOOL_GSET;
+	} else {
+		int err;
+		/* TODO: return -EOPNOTSUPP when
+		 * ethtool_ops::get_settings disappears internally
+		 */
+
+		/* driver doesn't support %ethtool_ksettings
+		 * API. revert to legacy %ethtool_cmd API, unless it's
+		 * not supported either.
+		 */
+		err = __ethtool_get_settings(dev, &cmd);
+		if (err < 0)
+			return err;
+	}
 
 	if (copy_to_user(useraddr, &cmd, sizeof(cmd)))
 		return -EFAULT;
+
 	return 0;
 }
 
+/* Update device settings with given ethtool_cmd.
+ *
+ * Backward compatibility note: for compatibility with legacy ethtool,
+ * this has to work with both drivers implementing set_ksettings API
+ * and drivers implementing set_settings API. When drivers implement
+ * set_ksettings and user's request updates deprecated ethtool_cmd
+ * fields (transceiver/maxrxpkt/maxtxpkt), a kernel warning is logged
+ * once (with name of 1st driver/device) to recommend user to upgrade
+ * ethtool, and the request is rejected.
+ *
+ * TODO: remove when %ETHTOOL_GSET/%ETHTOOL_SSET disappear
+ */
 static int ethtool_set_settings(struct net_device *dev, void __user *useraddr)
 {
 	struct ethtool_cmd cmd;
 
-	if (!dev->ethtool_ops->set_settings)
-		return -EOPNOTSUPP;
+	ASSERT_RTNL();
 
 	if (copy_from_user(&cmd, useraddr, sizeof(cmd)))
 		return -EFAULT;
 
+	/* first, try new %ethtool_ksettings API. */
+	if (dev->ethtool_ops->set_ksettings) {
+		struct ethtool_ksettings ksettings;
+
+		if (!convert_legacy_settings_to_ksettings(&ksettings, &cmd)) {
+			static int __warned;
+
+			/* rejecting setting deprecated fields
+			 * transceiver/maxtxpkt/maxrxpkt
+			 */
+			if (!__warned)
+				netdev_warn(dev, "please upgrade ethtool");
+			__warned = 1;
+			return -EINVAL;
+		}
+
+		ksettings.parent.cmd = ETHTOOL_SSETTINGS;
+		ksettings.parent.link_mode_masks_nwords
+			= __ETHTOOL_LINK_MODE_MASK_NU32;
+		return dev->ethtool_ops->set_ksettings(dev, &ksettings);
+	}
+
+	/* legacy %ethtool_cmd API */
+
+	/* TODO: return -EOPNOTSUPP when ethtool_ops::get_settings
+	 * disappears internally
+	 */
+
+	if (!dev->ethtool_ops->set_settings)
+		return -EOPNOTSUPP;
+
 	return dev->ethtool_ops->set_settings(dev, &cmd);
 }
 
@@ -1979,6 +2450,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_STUNABLE:
 		rc = ethtool_set_tunable(dev, useraddr);
 		break;
+	case ETHTOOL_GSETTINGS:
+		rc = ethtool_get_ksettings(dev, useraddr);
+		break;
+	case ETHTOOL_SSETTINGS:
+		rc = ethtool_set_ksettings(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 04/17] tx4939: use __ethtool_get_ksettings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 arch/mips/txx9/generic/setup_tx4939.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/mips/txx9/generic/setup_tx4939.c b/arch/mips/txx9/generic/setup_tx4939.c
index e3733cd..4a3ebf6 100644
--- a/arch/mips/txx9/generic/setup_tx4939.c
+++ b/arch/mips/txx9/generic/setup_tx4939.c
@@ -320,11 +320,12 @@ void __init tx4939_sio_init(unsigned int sclk, unsigned int cts_mask)
 #if IS_ENABLED(CONFIG_TC35815)
 static u32 tx4939_get_eth_speed(struct net_device *dev)
 {
-	struct ethtool_cmd cmd;
-	if (__ethtool_get_settings(dev, &cmd))
+	struct ethtool_ksettings cmd;
+
+	if (__ethtool_get_ksettings(dev, &cmd))
 		return 100;	/* default 100Mbps */
 
-	return ethtool_cmd_speed(&cmd);
+	return cmd.parent.speed;
 }
 
 static int tx4939_netdev_event(struct notifier_block *this,
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 05/17] net: usnic: use __ethtool_get_ksettings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index d71ba62..f6f67c9 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -305,12 +305,12 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
 				struct ib_port_attr *props)
 {
 	struct usnic_ib_dev *us_ibdev = to_usdev(ibdev);
-	struct ethtool_cmd cmd;
+	struct ethtool_ksettings cmd;
 
 	usnic_dbg("\n");
 
 	mutex_lock(&us_ibdev->usdev_lock);
-	__ethtool_get_settings(us_ibdev->netdev, &cmd);
+	__ethtool_get_ksettings(us_ibdev->netdev, &cmd);
 	memset(props, 0, sizeof(*props));
 
 	props->lid = 0;
@@ -334,8 +334,8 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
 	props->pkey_tbl_len = 1;
 	props->bad_pkey_cntr = 0;
 	props->qkey_viol_cntr = 0;
-	eth_speed_to_ib_speed(cmd.speed, &props->active_speed,
-				&props->active_width);
+	eth_speed_to_ib_speed(cmd.parent.speed, &props->active_speed,
+			      &props->active_width);
 	props->max_mtu = IB_MTU_4096;
 	props->active_mtu = iboe_get_mtu(us_ibdev->ufdev->mtu);
 	/* Userspace will adjust for hdrs */
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 06/17] net: bonding: use __ethtool_get_ksettings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 drivers/net/bonding/bond_main.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c9e519c..9aba5a8 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -372,22 +372,20 @@ down:
 static void bond_update_speed_duplex(struct slave *slave)
 {
 	struct net_device *slave_dev = slave->dev;
-	struct ethtool_cmd ecmd;
-	u32 slave_speed;
+	struct ethtool_ksettings ecmd;
 	int res;
 
 	slave->speed = SPEED_UNKNOWN;
 	slave->duplex = DUPLEX_UNKNOWN;
 
-	res = __ethtool_get_settings(slave_dev, &ecmd);
+	res = __ethtool_get_ksettings(slave_dev, &ecmd);
 	if (res < 0)
 		return;
 
-	slave_speed = ethtool_cmd_speed(&ecmd);
-	if (slave_speed == 0 || slave_speed == ((__u32) -1))
+	if (ecmd.parent.speed == 0 || ecmd.parent.speed == ((__u32) -1))
 		return;
 
-	switch (ecmd.duplex) {
+	switch (ecmd.parent.duplex) {
 	case DUPLEX_FULL:
 	case DUPLEX_HALF:
 		break;
@@ -395,8 +393,8 @@ static void bond_update_speed_duplex(struct slave *slave)
 		return;
 	}
 
-	slave->speed = slave_speed;
-	slave->duplex = ecmd.duplex;
+	slave->speed = ecmd.parent.speed;
+	slave->duplex = ecmd.parent.duplex;
 
 	return;
 }
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 07/17] net: ipvlan: use __ethtool_get_ksettings
From: David Decotigny @ 2015-02-04 19:53 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings, Amir Vadai, linux-kernel, netdev,
	linux-api, linux-mips, fcoe-devel
  Cc: Eric Dumazet, Eugenia Emantayev, Or Gerlitz, Ido Shamay,
	Joe Perches, Saeed Mahameed, Govindarajulu Varadarajan,
	Venkata Duvvuru, Jeff Kirsher, Eyal Perry, Pravin B Shelar,
	Ed Swierk, Upinder Malhi, Robert Love, James E.J. Bottomley,
	David Decotigny
In-Reply-To: <1423079621-1374-1-git-send-email-ddecotig@gmail.com>

From: David Decotigny <decot@googlers.com>

Signed-off-by: David Decotigny <decot@googlers.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 4f4099d..79e3516 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -342,12 +342,12 @@ static const struct header_ops ipvlan_header_ops = {
 	.cache_update	= eth_header_cache_update,
 };
 
-static int ipvlan_ethtool_get_settings(struct net_device *dev,
-				       struct ethtool_cmd *cmd)
+static int ipvlan_ethtool_get_ksettings(struct net_device *dev,
+					struct ethtool_ksettings *cmd)
 {
 	const struct ipvl_dev *ipvlan = netdev_priv(dev);
 
-	return __ethtool_get_settings(ipvlan->phy_dev, cmd);
+	return __ethtool_get_ksettings(ipvlan->phy_dev, cmd);
 }
 
 static void ipvlan_ethtool_get_drvinfo(struct net_device *dev,
@@ -373,7 +373,7 @@ static void ipvlan_ethtool_set_msglevel(struct net_device *dev, u32 value)
 
 static const struct ethtool_ops ipvlan_ethtool_ops = {
 	.get_link	= ethtool_op_get_link,
-	.get_settings	= ipvlan_ethtool_get_settings,
+	.get_ksettings	= ipvlan_ethtool_get_ksettings,
 	.get_drvinfo	= ipvlan_ethtool_get_drvinfo,
 	.get_msglevel	= ipvlan_ethtool_get_msglevel,
 	.set_msglevel	= ipvlan_ethtool_set_msglevel,
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox