Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH] fork: report pid reservation failure properly
From: Andrew Morton @ 2015-02-26 22:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
	LKML
In-Reply-To: <20150223201701.GA28639@dhcp22.suse.cz>

On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> ping on this one? Should I just resend (your way Andrew)? Or are there
> any objections to the patch as is.

Were Eric's concerns all addressed?

Oleg: wake up ;)

Overall it looks like a pretty minor issue?

^ permalink raw reply

* Re: [PATCH v2] coresight-stm: adding driver for CoreSight STM component
From: Russell King - ARM Linux @ 2015-02-26 22:58 UTC (permalink / raw)
  To: Rob Herring
  Cc: Mathieu Poirier, Will Deacon, linux-api@vger.kernel.org,
	Jonathan Corbet, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org
In-Reply-To: <CAL_Jsq+JYc5DBXqEcpJ0O=RdO-uuz_-GR99YsgVqT=uG6CLqJg@mail.gmail.com>

On Thu, Feb 26, 2015 at 04:24:53PM -0600, Rob Herring wrote:
> We really shouldn't do private implementation here. It there really
> any reason not to allow readq/writeq generically for 32-bit or just
> for arm32?

My argument has always been that drivers should do the emulation of
64-bit accesses when there is no native support.

IO registers tend to have side effects when read/written.  How do we
know whether the low-half or the high-half should be written first?
This isn't something that an architecture can really dictate.  What
may be right for one hardware device may not be correct for another.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: Documenting MS_LAZYTIME
From: Theodore Ts'o @ 2015-02-27  0:04 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Eric Sandeen, Ext4 Developers List, Linux btrfs Developers List,
	XFS Developers, linux-man, Linux-Fsdevel, Linux API
In-Reply-To: <54EF2161.90607@gmail.com>

On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> 
>     The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
>     in the case of a system crash, the atime and mtime fields
>     on disk might be out of date by at most 24 hours.

I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
perhaps move that so it's clear it applies to any use of MS_LAZYTIME
has this as a downside.

Does that make sense?

						- Ted

^ permalink raw reply

* Re: Documenting MS_LAZYTIME
From: Michael Kerrisk (man-pages) @ 2015-02-27  8:01 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric Sandeen,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers,
	linux-man-u79uwXL29TY76Z2rM5mHXA, Linux-Fsdevel, Linux API
In-Reply-To: <20150227000409.GC17174-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>

On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
>>
>>     The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
>>     in the case of a system crash, the atime and mtime fields
>>     on disk might be out of date by at most 24 hours.
> 
> I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
> perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> has this as a downside.
> 
> Does that make sense?

Thanks, Ted. Got it. So, now we have:

       MS_LAZYTIME (since Linux 3.20)
              Reduce  on-disk  updates  of  inode  timestamps  (atime,
              mtime, ctime) by maintaining these changes only in  mem‐
              ory.  The on-disk timestamps are updated only when:

              (a)  the inode needs to be updated for some change unre‐
                   lated to file timestamps;

              (b)  the application  employs  fsync(2),  syncfs(2),  or
                   sync(2);

              (c)  an undeleted inode is evicted from memory; or

              (d)  more  than 24 hours have passed since the inode was
                   written to disk.

              This mount significantly reduces writes needed to update
              the  inode's  timestamps,  especially  mtime  and atime.
              However, in the event of a system crash, the  atime  and
              mtime  fields  on  disk might be out of date by up to 24
              hours.

              Examples of workloads where this option could be of sig‐
              nificant  benefit include frequent random writes to pre‐
              allocated files, as well as cases where the  MS_STRICTA‐
              TIME  mount  option  is also enabled.  (The advantage of
              (MS_STRICTATIME |  MS_LAZYTIME)  is  that  stat(2)  will
              return  the  correctly  updated  atime,  but  the  atime
              updates will be flushed to disk only when (1) the  inode
              needs  to  be  updated for filesystem / data consistency
              reasons or (2) the inode is pushed out of memory, or (3)
              the filesystem is unmounted.)

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Documenting MS_LAZYTIME
From: Omar Sandoval @ 2015-02-27  8:08 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Theodore Ts'o, Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Linux API, XFS Developers, Linux-Fsdevel, Ext4 Developers List,
	Linux btrfs Developers List
In-Reply-To: <54F02446.2050008-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> > On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> >>
> >>     The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> >>     in the case of a system crash, the atime and mtime fields
> >>     on disk might be out of date by at most 24 hours.
> > 
> > I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
> > perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> > has this as a downside.
> > 
> > Does that make sense?
> 
> Thanks, Ted. Got it. So, now we have:
> 
>        MS_LAZYTIME (since Linux 3.20)
>               Reduce  on-disk  updates  of  inode  timestamps  (atime,
>               mtime, ctime) by maintaining these changes only in  mem‐
>               ory.  The on-disk timestamps are updated only when:
> 
>               (a)  the inode needs to be updated for some change unre‐
>                    lated to file timestamps;
> 
>               (b)  the application  employs  fsync(2),  syncfs(2),  or
>                    sync(2);
> 
>               (c)  an undeleted inode is evicted from memory; or
> 
>               (d)  more  than 24 hours have passed since the inode was
>                    written to disk.
> 
>               This mount significantly reduces writes needed to update
"This mount option"?

>               the  inode's  timestamps,  especially  mtime  and atime.
>               However, in the event of a system crash, the  atime  and
>               mtime  fields  on  disk might be out of date by up to 24
>               hours.
> 
>               Examples of workloads where this option could be of sig‐
>               nificant  benefit include frequent random writes to pre‐
>               allocated files, as well as cases where the  MS_STRICTA‐
>               TIME  mount  option  is also enabled.  (The advantage of
>               (MS_STRICTATIME |  MS_LAZYTIME)  is  that  stat(2)  will
>               return  the  correctly  updated  atime,  but  the  atime
>               updates will be flushed to disk only when (1) the  inode
>               needs  to  be  updated for filesystem / data consistency
>               reasons or (2) the inode is pushed out of memory, or (3)
>               the filesystem is unmounted.)
Is it necessary to repeat the reasons for flushing, which are stated
above?

> 
> Cheers,
> 
> Michael
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> 
> _______________________________________________
> xfs mailing list
> xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org
> http://oss.sgi.com/mailman/listinfo/xfs

Thanks!
-- 
Omar

^ permalink raw reply

* Re: [PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-27  8:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
	LKML
In-Reply-To: <20150226144908.5e3d354d7b24174e90089721@linux-foundation.org>

On Thu 26-02-15 14:49:08, Andrew Morton wrote:
> On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > ping on this one? Should I just resend (your way Andrew)? Or are there
> > any objections to the patch as is.
> 
> Were Eric's concerns all addressed?

I hope so. I didn't touch pid namespace parts and they return ENOMEM as
before.

> Oleg: wake up ;)
> 
> Overall it looks like a pretty minor issue?

I believe so. It should help deubugging when pid space is exhausted
because getting ENOMEM under that situation is really awkward and the
real reason for failure is hard to find out.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: Documenting MS_LAZYTIME
From: Michael Kerrisk (man-pages) @ 2015-02-27  8:36 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Theodore Ts'o,
	Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA, Linux API,
	XFS Developers, Linux-Fsdevel, Ext4 Developers List,
	Linux btrfs Developers List
In-Reply-To: <20150227080800.GA20442@mew>

Hello Omar,

On 02/27/2015 09:08 AM, Omar Sandoval wrote:
> On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
>> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
>>> On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
>>>>
>>>>     The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
>>>>     in the case of a system crash, the atime and mtime fields
>>>>     on disk might be out of date by at most 24 hours.
>>>
>>> I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
>>> perhaps move that so it's clear it applies to any use of MS_LAZYTIME
>>> has this as a downside.
>>>
>>> Does that make sense?
>>
>> Thanks, Ted. Got it. So, now we have:
>>
>>        MS_LAZYTIME (since Linux 3.20)
>>               Reduce  on-disk  updates  of  inode  timestamps  (atime,
>>               mtime, ctime) by maintaining these changes only in  mem‐
>>               ory.  The on-disk timestamps are updated only when:
>>
>>               (a)  the inode needs to be updated for some change unre‐
>>                    lated to file timestamps;
>>
>>               (b)  the application  employs  fsync(2),  syncfs(2),  or
>>                    sync(2);
>>
>>               (c)  an undeleted inode is evicted from memory; or
>>
>>               (d)  more  than 24 hours have passed since the inode was
>>                    written to disk.
>>
>>               This mount significantly reduces writes needed to update
> "This mount option"?

Thanks, fixed.

>>               the  inode's  timestamps,  especially  mtime  and atime.
>>               However, in the event of a system crash, the  atime  and
>>               mtime  fields  on  disk might be out of date by up to 24
>>               hours.
>>
>>               Examples of workloads where this option could be of sig‐
>>               nificant  benefit include frequent random writes to pre‐
>>               allocated files, as well as cases where the  MS_STRICTA‐
>>               TIME  mount  option  is also enabled.  (The advantage of
>>               (MS_STRICTATIME |  MS_LAZYTIME)  is  that  stat(2)  will
>>               return  the  correctly  updated  atime,  but  the  atime
>>               updates will be flushed to disk only when (1) the  inode
>>               needs  to  be  updated for filesystem / data consistency
>>               reasons or (2) the inode is pushed out of memory, or (3)
>>               the filesystem is unmounted.)
> Is it necessary to repeat the reasons for flushing, which are stated
> above?

Good point. I replaced this piece with just a few words referring 
to the list above.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2015-02-27  9:49 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto, 'Linux API'
In-Reply-To: <7821728.KTX1G67VgF@tachyon.chronox.de>

On Thu, Feb 05, 2015 at 04:10:58PM +0100, Stephan Mueller wrote:
> Am Donnerstag, 29. Januar 2015, 21:24:45 schrieb Stephan Mueller:
> 
> Hi Herbert,
> 
> > This patch adds the AEAD support for AF_ALG.
> > 
> > The implementation is based on algif_skcipher, but contains heavy
> > modifications to streamline the interface for AEAD uses.
> > 
> > To use AEAD, the user space consumer has to use the salg_type named
> > "aead".
> 
> I just saw Al Viro's patch to use the iov_iter API in algif_skcipher.c. I 
> looked at it but lacked documentation for using it properly. Now I have a 
> template that I will incorporate into algif_aead.c

Please resubmit once you have done this.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-02-27 10:26 UTC (permalink / raw)
  To: Herbert Xu
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, 'Linux API'
In-Reply-To: <20150227094944.GA29071-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

Am Freitag, 27. Februar 2015, 22:49:44 schrieb Herbert Xu:

Hi Herbert,

>On Thu, Feb 05, 2015 at 04:10:58PM +0100, Stephan Mueller wrote:
>> Am Donnerstag, 29. Januar 2015, 21:24:45 schrieb Stephan Mueller:
>> 
>> Hi Herbert,
>> 
>> > This patch adds the AEAD support for AF_ALG.
>> > 
>> > The implementation is based on algif_skcipher, but contains heavy
>> > modifications to streamline the interface for AEAD uses.
>> > 
>> > To use AEAD, the user space consumer has to use the salg_type named
>> > "aead".
>> 
>> I just saw Al Viro's patch to use the iov_iter API in
>> algif_skcipher.c. I looked at it but lacked documentation for using
>> it properly. Now I have a template that I will incorporate into
>> algif_aead.c
>
>Please resubmit once you have done this.

I have done that, but as indicated with an email to Al, I cannot get his 
patch for skcipher and hash to work. Similarly, my modification for the 
AEAD does not work either. So, I do not see that Al's patch can be 
merged as is.

Therefore, I have not submitted my patch as Al mentioned he wanted to 
look into his patchset.
>
>Thanks,


Ciao
Stephan

^ permalink raw reply

* Re: [PATCH 0/4] enhance shmem process and swap accounting
From: Michael Kerrisk @ 2015-02-27 10:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
	Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
	Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-1-git-send-email-vbabka@suse.cz>

[CC += linux-api@]

Hello Vlastimil,

Since this is a kernel-user-space API change, please CC linux-api@.
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to linux-api@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Cheers,

Michael


On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> This series is based on Jerome Marchand's [1] so let me quote the first
> paragraph from there:
>
> There are several shortcomings with the accounting of shared memory
> (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
> values in /proc/<pid>/status and statm don't allow to distinguish
> between shmem memory and a shared mapping to a regular file, even
> though theirs implication on memory usage are quite different: at
> reclaim, file mapping can be dropped or write back on disk while shmem
> needs a place in swap. As for shmem pages that are swapped-out or in
> swap cache, they aren't accounted at all.
>
> The original motivation for myself is that a customer found (IMHO rightfully)
> confusing that e.g. top output for process swap usage is unreliable with
> respect to swapped out shmem pages, which are not accounted for.
>
> The fundamental difference between private anonymous and shmem pages is that
> the latter has PTE's converted to pte_none, and not swapents. As such, they are
> not accounted to the number of swapents visible e.g. in /proc/pid/status VmSwap
> row. It might be theoretically possible to use swapents when swapping out shmem
> (without extra cost, as one has to change all mappers anyway), and on swap in
> only convert the swapent for the faulting process, leaving swapents in other
> processes until they also fault (so again no extra cost). But I don't know how
> many assumptions this would break, and it would be too disruptive change for a
> relatively small benefit.
>
> Instead, my approach is to document the limitation of VmSwap, and provide means
> to determine the swap usage for shmem areas for those who are interested and
> willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
> don't think it's possible to currently to determine the usage at all.  The
> previous patchset [1] did introduce new shmem-specific fields into smaps
> output, and functions to determine the values. I take a simpler approach,
> noting that smaps output already has a "Swap: X kB" line, where currently X ==
> 0 always for shmem areas. I think we can just consider this a bug and provide
> the proper value by consulting the radix tree, as e.g. mincore_page() does. In the
> patch changelog I explain why this is also not perfect (and cannot be without
> swapents), but still arguably much better than showing a 0.
>
> The last two patches are adapted from Jerome's patchset and provide a VmRSS
> breakdown to VmAnon, VmFile and VmShm in /proc/pid/status. Hugh noted that
> this is a welcome addition, and I agree that it might help e.g. debugging
> process memory usage at albeit non-zero, but still rather low cost of extra
> per-mm counter and some page flag checks. I updated these patches to 4.0-rc1,
> made them respect !CONFIG_SHMEM so that tiny systems don't pay the cost, and
> optimized the page flag checking somewhat.
>
> [1] http://lwn.net/Articles/611966/
>
> Jerome Marchand (2):
>   mm, shmem: Add shmem resident memory accounting
>   mm, procfs: Display VmAnon, VmFile and VmShm in /proc/pid/status
>
> Vlastimil Babka (2):
>   mm, documentation: clarify /proc/pid/status VmSwap limitations
>   mm, proc: account for shmem swap in /proc/pid/smaps
>
>  Documentation/filesystems/proc.txt | 15 +++++++++++++--
>  arch/s390/mm/pgtable.c             |  5 +----
>  fs/proc/task_mmu.c                 | 35 +++++++++++++++++++++++++++++++++--
>  include/linux/mm.h                 | 28 ++++++++++++++++++++++++++++
>  include/linux/mm_types.h           |  9 ++++++---
>  kernel/events/uprobes.c            |  2 +-
>  mm/memory.c                        | 30 ++++++++++--------------------
>  mm/oom_kill.c                      |  5 +++--
>  mm/rmap.c                          | 15 ++++-----------
>  9 files changed, 99 insertions(+), 45 deletions(-)
>
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/4] mm, documentation: clarify /proc/pid/status VmSwap limitations
From: Michael Kerrisk @ 2015-02-27 10:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
	Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
	Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-2-git-send-email-vbabka@suse.cz>

[CC += linux-api@]

On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> The documentation for /proc/pid/status does not mention that the value of
> VmSwap counts only swapped out anonymous private pages and not shmem. This is
> not obvious, so document this limitation.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> I've noticed that proc(5) manpage is currently missing the VmSwap field
> altogether.
>
>  Documentation/filesystems/proc.txt | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index a07ba61..d4f56ec 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -231,6 +231,8 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
>   VmLib                       size of shared library code
>   VmPTE                       size of page table entries
>   VmSwap                      size of swap usage (the number of referred swapents)
> +                             by anonymous private data (shmem swap usage is not
> +                             included)
>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/4] mm, procfs: account for shmem swap in /proc/pid/smaps
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
	Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
	Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-3-git-send-email-vbabka@suse.cz>

[CC += linux-api@]

On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
> mappings, even if the mapped portion does contain pages that were swapped out.
> This is because unlike private anonymous mappings, shmem does not change pte
> to swap entry, but pte_none when swapping the page out. In the smaps page
> walk, such page thus looks like it was never faulted in.
>
> This patch changes smaps_pte_entry() to determine the swap status for such
> pte_none entries for shmem mappings, similarly to how mincore_page() does it.
> Swapped out pages are thus accounted for.
>
> The accounting is arguably still not as precise as for private anonymous
> mappings, since now we will count also pages that the process in question never
> accessed, but only another process populated them and then let them become
> swapped out. I believe it is still less confusing and subtle than not showing
> any swap usage by shmem mappings at all. Also, swapped out pages only becomee a
> performance issue for future accesses, and we cannot predict those for neither
> kind of mapping.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  Documentation/filesystems/proc.txt |  3 ++-
>  fs/proc/task_mmu.c                 | 20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d4f56ec..8b30543 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -437,7 +437,8 @@ indicates the amount of memory currently marked as referenced or accessed.
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
>  "Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +swap. For shmem mappings, "Swap" shows how much of the mapped portion of the
> +underlying shmem object is on swap.
>
>  "VmFlags" field deserves a separate description. This member represents the kernel
>  flags associated with the particular virtual memory area in two letter encoded
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 956b75d..0410309 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -13,6 +13,7 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/shmem_fs.h>
>
>  #include <asm/elf.h>
>  #include <asm/uaccess.h>
> @@ -496,6 +497,25 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>                         mss->swap += PAGE_SIZE;
>                 else if (is_migration_entry(swpent))
>                         page = migration_entry_to_page(swpent);
> +       } else if (IS_ENABLED(CONFIG_SHMEM) && IS_ENABLED(CONFIG_SWAP) &&
> +                                       pte_none(*pte) && vma->vm_file) {
> +               struct address_space *mapping =
> +                       file_inode(vma->vm_file)->i_mapping;
> +
> +               /*
> +                * shmem does not use swap pte's so we have to consult
> +                * the radix tree to account for swap
> +                */
> +               if (shmem_mapping(mapping)) {
> +                       page = find_get_entry(mapping, pgoff);
> +                       if (page) {
> +                               if (radix_tree_exceptional_entry(page))
> +                                       mss->swap += PAGE_SIZE;
> +                               else
> +                                       page_cache_release(page);
> +                       }
> +                       page = NULL;
> +               }
>         }
>
>         if (!page)
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/4] mm, shmem: Add shmem resident memory accounting
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
	Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
	Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-4-git-send-email-vbabka@suse.cz>

[CC += linux-api@]

On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> From: Jerome Marchand <jmarchan@redhat.com>
>
> Currently looking at /proc/<pid>/status or statm, there is no way to
> distinguish shmem pages from pages mapped to a regular file (shmem
> pages are mapped to /dev/zero), even though their implication in
> actual memory use is quite different.
> This patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
> shmem pages instead of MM_FILEPAGES.
>
> [vbabka@suse.cz: port to 4.0, add #ifdefs, mm_counter_file() variant]
> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  arch/s390/mm/pgtable.c   |  5 +----
>  fs/proc/task_mmu.c       |  4 +++-
>  include/linux/mm.h       | 28 ++++++++++++++++++++++++++++
>  include/linux/mm_types.h |  9 ++++++---
>  kernel/events/uprobes.c  |  2 +-
>  mm/memory.c              | 30 ++++++++++--------------------
>  mm/oom_kill.c            |  5 +++--
>  mm/rmap.c                | 15 ++++-----------
>  8 files changed, 56 insertions(+), 42 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index b2c1542..5bffd5d 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -617,10 +617,7 @@ static void gmap_zap_swap_entry(swp_entry_t entry, struct mm_struct *mm)
>         else if (is_migration_entry(entry)) {
>                 struct page *page = migration_entry_to_page(entry);
>
> -               if (PageAnon(page))
> -                       dec_mm_counter(mm, MM_ANONPAGES);
> -               else
> -                       dec_mm_counter(mm, MM_FILEPAGES);
> +               dec_mm_counter(mm, mm_counter(page));
>         }
>         free_swap_and_cache(entry);
>  }
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0410309..d70334c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -81,7 +81,8 @@ unsigned long task_statm(struct mm_struct *mm,
>                          unsigned long *shared, unsigned long *text,
>                          unsigned long *data, unsigned long *resident)
>  {
> -       *shared = get_mm_counter(mm, MM_FILEPAGES);
> +       *shared = get_mm_counter(mm, MM_FILEPAGES) +
> +               get_mm_counter(mm, MM_SHMEMPAGES);
>         *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>                                                                 >> PAGE_SHIFT;
>         *data = mm->total_vm - mm->shared_vm;
> @@ -501,6 +502,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>                                         pte_none(*pte) && vma->vm_file) {
>                 struct address_space *mapping =
>                         file_inode(vma->vm_file)->i_mapping;
> +               pgoff_t pgoff = linear_page_index(vma, addr);
>
>                 /*
>                  * shmem does not use swap pte's so we have to consult
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 47a9392..adfbb5b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1364,6 +1364,16 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>         return (unsigned long)val;
>  }
>
> +/* A wrapper for the CONFIG_SHMEM dependent counter */
> +static inline unsigned long get_mm_counter_shmem(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_SHMEM
> +       return get_mm_counter(mm, MM_SHMEMPAGES);
> +#else
> +       return 0;
> +#endif
> +}
> +
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
>  {
>         atomic_long_add(value, &mm->rss_stat.count[member]);
> @@ -1379,9 +1389,27 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member)
>         atomic_long_dec(&mm->rss_stat.count[member]);
>  }
>
> +/* Optimized variant when page is already known not to be PageAnon */
> +static inline int mm_counter_file(struct page *page)
> +{
> +#ifdef CONFIG_SHMEM
> +       if (PageSwapBacked(page))
> +               return MM_SHMEMPAGES;
> +#endif
> +       return MM_FILEPAGES;
> +}
> +
> +static inline int mm_counter(struct page *page)
> +{
> +       if (PageAnon(page))
> +               return MM_ANONPAGES;
> +       return mm_counter_file(page);
> +}
> +
>  static inline unsigned long get_mm_rss(struct mm_struct *mm)
>  {
>         return get_mm_counter(mm, MM_FILEPAGES) +
> +               get_mm_counter_shmem(mm) +
>                 get_mm_counter(mm, MM_ANONPAGES);
>  }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 199a03a..d3c2372 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -327,9 +327,12 @@ struct core_state {
>  };
>
>  enum {
> -       MM_FILEPAGES,
> -       MM_ANONPAGES,
> -       MM_SWAPENTS,
> +       MM_FILEPAGES,   /* Resident file mapping pages */
> +       MM_ANONPAGES,   /* Resident anonymous pages */
> +       MM_SWAPENTS,    /* Anonymous swap entries */
> +#ifdef CONFIG_SHMEM
> +       MM_SHMEMPAGES,  /* Resident shared memory pages */
> +#endif
>         NR_MM_COUNTERS
>  };
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cb346f2..0a08fdd 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -188,7 +188,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>         lru_cache_add_active_or_unevictable(kpage, vma);
>
>         if (!PageAnon(page)) {
> -               dec_mm_counter(mm, MM_FILEPAGES);
> +               dec_mm_counter(mm, mm_counter_file(page));
>                 inc_mm_counter(mm, MM_ANONPAGES);
>         }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 8068893..f145d9e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -832,10 +832,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>                 } else if (is_migration_entry(entry)) {
>                         page = migration_entry_to_page(entry);
>
> -                       if (PageAnon(page))
> -                               rss[MM_ANONPAGES]++;
> -                       else
> -                               rss[MM_FILEPAGES]++;
> +                       rss[mm_counter(page)]++;
>
>                         if (is_write_migration_entry(entry) &&
>                                         is_cow_mapping(vm_flags)) {
> @@ -874,10 +871,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>         if (page) {
>                 get_page(page);
>                 page_dup_rmap(page);
> -               if (PageAnon(page))
> -                       rss[MM_ANONPAGES]++;
> -               else
> -                       rss[MM_FILEPAGES]++;
> +               rss[mm_counter(page)]++;
>         }
>
>  out_set_pte:
> @@ -1113,9 +1107,8 @@ again:
>                         tlb_remove_tlb_entry(tlb, pte, addr);
>                         if (unlikely(!page))
>                                 continue;
> -                       if (PageAnon(page))
> -                               rss[MM_ANONPAGES]--;
> -                       else {
> +
> +                       if (!PageAnon(page)) {
>                                 if (pte_dirty(ptent)) {
>                                         force_flush = 1;
>                                         set_page_dirty(page);
> @@ -1123,8 +1116,8 @@ again:
>                                 if (pte_young(ptent) &&
>                                     likely(!(vma->vm_flags & VM_SEQ_READ)))
>                                         mark_page_accessed(page);
> -                               rss[MM_FILEPAGES]--;
>                         }
> +                       rss[mm_counter(page)]--;
>                         page_remove_rmap(page);
>                         if (unlikely(page_mapcount(page) < 0))
>                                 print_bad_pte(vma, addr, ptent, page);
> @@ -1146,11 +1139,7 @@ again:
>                         struct page *page;
>
>                         page = migration_entry_to_page(entry);
> -
> -                       if (PageAnon(page))
> -                               rss[MM_ANONPAGES]--;
> -                       else
> -                               rss[MM_FILEPAGES]--;
> +                       rss[mm_counter(page)]--;
>                 }
>                 if (unlikely(!free_swap_and_cache(entry)))
>                         print_bad_pte(vma, addr, ptent, NULL);
> @@ -1460,7 +1449,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
>
>         /* Ok, finally just insert the thing.. */
>         get_page(page);
> -       inc_mm_counter_fast(mm, MM_FILEPAGES);
> +       inc_mm_counter_fast(mm, mm_counter_file(page));
>         page_add_file_rmap(page);
>         set_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> @@ -2174,7 +2163,8 @@ gotten:
>         if (likely(pte_same(*page_table, orig_pte))) {
>                 if (old_page) {
>                         if (!PageAnon(old_page)) {
> -                               dec_mm_counter_fast(mm, MM_FILEPAGES);
> +                               dec_mm_counter_fast(mm,
> +                                               mm_counter_file(old_page));
>                                 inc_mm_counter_fast(mm, MM_ANONPAGES);
>                         }
>                 } else
> @@ -2703,7 +2693,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>                 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>                 page_add_new_anon_rmap(page, vma, address);
>         } else {
> -               inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
> +               inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>                 page_add_file_rmap(page);
>         }
>         set_pte_at(vma->vm_mm, address, pte, entry);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 642f38c..a5ee3a2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -573,10 +573,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>         /* mm cannot safely be dereferenced after task_unlock(victim) */
>         mm = victim->mm;
>         mark_tsk_oom_victim(victim);
> -       pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> +       pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> -               K(get_mm_counter(victim->mm, MM_FILEPAGES)));
> +               K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> +               K(get_mm_counter_shmem(victim->mm)));
>         task_unlock(victim);
>
>         /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5e3e090..e3c4392 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1216,12 +1216,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         update_hiwater_rss(mm);
>
>         if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> -               if (!PageHuge(page)) {
> -                       if (PageAnon(page))
> -                               dec_mm_counter(mm, MM_ANONPAGES);
> -                       else
> -                               dec_mm_counter(mm, MM_FILEPAGES);
> -               }
> +               if (!PageHuge(page))
> +                       dec_mm_counter(mm, mm_counter(page));
>                 set_pte_at(mm, address, pte,
>                            swp_entry_to_pte(make_hwpoison_entry(page)));
>         } else if (pte_unused(pteval)) {
> @@ -1230,10 +1226,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>                  * interest anymore. Simply discard the pte, vmscan
>                  * will take care of the rest.
>                  */
> -               if (PageAnon(page))
> -                       dec_mm_counter(mm, MM_ANONPAGES);
> -               else
> -                       dec_mm_counter(mm, MM_FILEPAGES);
> +               dec_mm_counter(mm, mm_counter(page));
>         } else if (PageAnon(page)) {
>                 swp_entry_t entry = { .val = page_private(page) };
>                 pte_t swp_pte;
> @@ -1276,7 +1269,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>                 entry = make_migration_entry(page, pte_write(pteval));
>                 set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
>         } else
> -               dec_mm_counter(mm, MM_FILEPAGES);
> +               dec_mm_counter(mm, mm_counter_file(page));
>
>         page_remove_rmap(page);
>         page_cache_release(page);
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 4/4] mm, procfs: Display VmAnon, VmFile and VmShm in /proc/pid/status
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
	Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
	Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-5-git-send-email-vbabka@suse.cz>

[CC += linux-api@]

On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> From: Jerome Marchand <jmarchan@redhat.com>
>
> It's currently inconvenient to retrieve MM_ANONPAGES value from status
> and statm files and there is no way to separate MM_FILEPAGES and
> MM_SHMEMPAGES. Add VmAnon, VmFile and VmShm lines in /proc/<pid>/status
> to solve these issues.
>
> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  Documentation/filesystems/proc.txt | 10 +++++++++-
>  fs/proc/task_mmu.c                 | 13 +++++++++++--
>  2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 8b30543..c777adb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -168,6 +168,9 @@ read the file /proc/PID/status:
>    VmLck:         0 kB
>    VmHWM:       476 kB
>    VmRSS:       476 kB
> +  VmAnon:      352 kB
> +  VmFile:      120 kB
> +  VmShm:         4 kB
>    VmData:      156 kB
>    VmStk:        88 kB
>    VmExe:        68 kB
> @@ -224,7 +227,12 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
>   VmSize                      total program size
>   VmLck                       locked memory size
>   VmHWM                       peak resident set size ("high water mark")
> - VmRSS                       size of memory portions
> + VmRSS                       size of memory portions. It contains the three
> +                             following parts (VmRSS = VmAnon + VmFile + VmShm)
> + VmAnon                      size of resident anonymous memory
> + VmFile                      size of resident file mappings
> + VmShm                       size of resident shmem memory (includes SysV shm,
> +                             mapping of tmpfs and shared anonymous mappings)
>   VmData                      size of data, stack, and text segments
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d70334c..a77a3ac 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,7 +22,7 @@
>
>  void task_mem(struct seq_file *m, struct mm_struct *mm)
>  {
> -       unsigned long data, text, lib, swap, ptes, pmds;
> +       unsigned long data, text, lib, swap, ptes, pmds, anon, file, shmem;
>         unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>
>         /*
> @@ -39,6 +39,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>         if (hiwater_rss < mm->hiwater_rss)
>                 hiwater_rss = mm->hiwater_rss;
>
> +       anon = get_mm_counter(mm, MM_ANONPAGES);
> +       file = get_mm_counter(mm, MM_FILEPAGES);
> +       shmem = get_mm_counter_shmem(mm);
>         data = mm->total_vm - mm->shared_vm - mm->stack_vm;
>         text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
>         lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> @@ -52,6 +55,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>                 "VmPin:\t%8lu kB\n"
>                 "VmHWM:\t%8lu kB\n"
>                 "VmRSS:\t%8lu kB\n"
> +               "VmAnon:\t%8lu kB\n"
> +               "VmFile:\t%8lu kB\n"
> +               "VmShm:\t%8lu kB\n"
>                 "VmData:\t%8lu kB\n"
>                 "VmStk:\t%8lu kB\n"
>                 "VmExe:\t%8lu kB\n"
> @@ -65,6 +71,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>                 mm->pinned_vm << (PAGE_SHIFT-10),
>                 hiwater_rss << (PAGE_SHIFT-10),
>                 total_rss << (PAGE_SHIFT-10),
> +               anon << (PAGE_SHIFT-10),
> +               file << (PAGE_SHIFT-10),
> +               shmem << (PAGE_SHIFT-10),
>                 data << (PAGE_SHIFT-10),
>                 mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>                 ptes >> 10,
> @@ -82,7 +91,7 @@ unsigned long task_statm(struct mm_struct *mm,
>                          unsigned long *data, unsigned long *resident)
>  {
>         *shared = get_mm_counter(mm, MM_FILEPAGES) +
> -               get_mm_counter(mm, MM_SHMEMPAGES);
> +               get_mm_counter_shmem(mm);
>         *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>                                                                 >> PAGE_SHIFT;
>         *data = mm->total_vm - mm->shared_vm;
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/4] enhance shmem process and swap accounting
From: Vlastimil Babka @ 2015-02-27 10:52 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Michal Hocko,
	Kirill A. Shutemov, Cyrill Gorcunov, Randy Dunlap,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	Heiko Carstens, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Oleg Nesterov, Linux API
In-Reply-To: <CAHO5Pa0xmquUbzkZvow_PxRGZpA7MVEPFcRL2LPXv7hU41uxDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 02/27/2015 11:36 AM, Michael Kerrisk wrote:
> [CC += linux-api@]
> 
> Hello Vlastimil,
> 
> Since this is a kernel-user-space API change, please CC linux-api@.
> The kernel source file Documentation/SubmitChecklist notes that all
> Linux kernel patches that change userspace interfaces should be CCed
> to linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, so that the various parties who are
> interested in API changes are informed. For further information, see
> https://www.kernel.org/doc/man-pages/linux-api-ml.html

Yes I meant to do that but forgot in the end, what a shame. Sorry for the trouble.

^ permalink raw reply

* Re: Documenting MS_LAZYTIME
From: Theodore Ts'o @ 2015-02-27 14:18 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Omar Sandoval, Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA,
	Linux API, XFS Developers, Linux-Fsdevel, Ext4 Developers List,
	Linux btrfs Developers List
In-Reply-To: <54F02C73.5090601-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

With Omar's suggestions, this looks great.

Thanks!!

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH RFC 0/3] Drivers: hv: utils: re-implement the kernel/userspace communication layer
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
  To: K. Y. Srinivasan, devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X
  Cc: Haiyang Zhang, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dexuan Cui,
	Radim Krčmář, Greg Kroah-Hartman,
	linux-api-u79uwXL29TY76Z2rM5mHXA

This series converts kvp/vss daemons to use misc char devices instead of
netlink for userspace/kernel communication and then updates fcopy to be
consistent with kvp/vss.

Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
  and this fact can change as there is a way to enable/disable the service from
  host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
  notification.
- Netlink communication is not stable under heavy load.
- ...

RFC: I'm a bit puzzled on how to split commits 1 and 2 avoiding breakages.
Commit 3 can definitely be split, however, it is consistent with commits 1 and
2 at this moment and I'm not sure such split will simplify the review.

Vitaly Kuznetsov (3):
  Drivers: hv: kvp: convert userspace/kernel communication to using char
    device
  Drivers: hv: vss: convert userspace/kernel communication to using char
    device
  Drivers: hv: fcopy: make it consistent with vss/kvp

 drivers/hv/hv_fcopy.c       | 395 +++++++++++++++++++++++++------------------
 drivers/hv/hv_kvp.c         | 396 +++++++++++++++++++++++++++-----------------
 drivers/hv/hv_snapshot.c    | 335 +++++++++++++++++++++++++++----------
 include/uapi/linux/hyperv.h |  10 ++
 tools/hv/hv_fcopy_daemon.c  |  48 ++++--
 tools/hv/hv_kvp_daemon.c    | 187 ++++-----------------
 tools/hv/hv_vss_daemon.c    | 141 +++-------------
 7 files changed, 824 insertions(+), 688 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH RFC 1/3] Drivers: hv: kvp: convert userspace/kernel communication to using char device
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
  To: K. Y. Srinivasan, devel
  Cc: Haiyang Zhang, linux-kernel, Dexuan Cui,
	Radim Krčmář, Greg Kroah-Hartman, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>

Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
  and this fact can change as there is a way to enable/disable the service from
  host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
  notification.
- Netlink communication is not stable under heavy load.
- ...

Re-implement the communication using misc char device. Use ioctl to do
kernel/userspace version negotiation (doesn't make much sense at this moment
as we're breaking backwards compatibility but can be used in future).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/hv/hv_kvp.c         | 396 +++++++++++++++++++++++++++-----------------
 include/uapi/linux/hyperv.h |   8 +
 tools/hv/hv_kvp_daemon.c    | 187 ++++-----------------
 3 files changed, 287 insertions(+), 304 deletions(-)

diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index beb8105..8078b1a 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -22,12 +22,16 @@
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include <linux/net.h>
 #include <linux/nls.h>
-#include <linux/connector.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
 #include <linux/workqueue.h>
+#include <linux/mutex.h>
 #include <linux/hyperv.h>
+#include <linux/miscdevice.h>
+#include <linux/poll.h>
 
+#include <linux/uaccess.h>
 
 /*
  * Pre win8 version numbers used in ws2008 and ws 2008 r2 (win7)
@@ -45,46 +49,41 @@
 #define WIN8_SRV_VERSION     (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
 
 /*
- * Global state maintained for transaction that is being processed.
- * Note that only one transaction can be active at any point in time.
- *
- * This state is set when we receive a request from the host; we
- * cleanup this state when the transaction is completed - when we respond
- * to the host with the key value.
+ * Global state maintained for the device. Note that only one transaction can
+ * be active at any point in time.
  */
 
+enum kvp_device_state {
+	KVP_DEVICE_INITIALIZING = 0, /* driver was loaded */
+	KVP_DEVICE_OPENED,           /* device was opened */
+	KVP_READY,                   /* userspace was registered */
+	KVP_HOSTMSG_RECEIVED,        /* message from host was received */
+	KVP_USERMSG_READY,           /* message for userspace is ready */
+	KVP_USERSPACE_REQ,           /* request to userspace was sent */
+	KVP_USERSPACE_RECV,          /* reply from userspace was received */
+	KVP_DEVICE_DYING,            /* driver unload is in progress */
+};
+
 static struct {
-	bool active; /* transaction status - active or not */
+	int state; /* kvp_device_state */
 	int recv_len; /* number of bytes received. */
-	struct hv_kvp_msg  *kvp_msg; /* current message */
 	struct vmbus_channel *recv_channel; /* chn we got the request */
 	u64 recv_req_id; /* request ID. */
 	void *kvp_context; /* for the channel callback */
-} kvp_transaction;
-
-/*
- * Before we can accept KVP messages from the host, we need
- * to handshake with the user level daemon. This state tracks
- * if we are in the handshake phase.
- */
-static bool in_hand_shake = true;
-
-/*
- * This state maintains the version number registered by the daemon.
- */
-static int dm_reg_value;
+	int dm_reg_value; /* daemon version number */
+	struct mutex lock; /* syncronization */
+	struct hv_kvp_msg user_msg; /* message to/from userspace */
+	struct hv_kvp_msg host_msg; /* message to/from host */
+	wait_queue_head_t proc_list; /* waiting processes */
+} kvp_device;
 
 static void kvp_send_key(struct work_struct *dummy);
-
-
-static void kvp_respond_to_host(struct hv_kvp_msg *msg, int error);
+static void kvp_respond_to_host(int error);
 static void kvp_work_func(struct work_struct *dummy);
-static void kvp_register(int);
 
 static DECLARE_DELAYED_WORK(kvp_work, kvp_work_func);
 static DECLARE_WORK(kvp_sendkey_work, kvp_send_key);
 
-static struct cb_id kvp_id = { CN_KVP_IDX, CN_KVP_VAL };
 static const char kvp_name[] = "kvp_kernel_module";
 static u8 *recv_buffer;
 /*
@@ -92,31 +91,8 @@ static u8 *recv_buffer;
  * As part of this registration, pass the LIC version number.
  * This number has no meaning, it satisfies the registration protocol.
  */
-#define HV_DRV_VERSION           "3.1"
-
-static void
-kvp_register(int reg_value)
-{
-
-	struct cn_msg *msg;
-	struct hv_kvp_msg *kvp_msg;
-	char *version;
-
-	msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg), GFP_ATOMIC);
+#define HV_DRV_VERSION           31
 
-	if (msg) {
-		kvp_msg = (struct hv_kvp_msg *)msg->data;
-		version = kvp_msg->body.kvp_register.version;
-		msg->id.idx =  CN_KVP_IDX;
-		msg->id.val = CN_KVP_VAL;
-
-		kvp_msg->kvp_hdr.operation = reg_value;
-		strcpy(version, HV_DRV_VERSION);
-		msg->len = sizeof(struct hv_kvp_msg);
-		cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
-		kfree(msg);
-	}
-}
 static void
 kvp_work_func(struct work_struct *dummy)
 {
@@ -124,7 +100,7 @@ kvp_work_func(struct work_struct *dummy)
 	 * If the timer fires, the user-mode component has not responded;
 	 * process the pending transaction.
 	 */
-	kvp_respond_to_host(NULL, HV_E_FAIL);
+	kvp_respond_to_host(HV_E_FAIL);
 }
 
 static void poll_channel(struct vmbus_channel *channel)
@@ -138,36 +114,26 @@ static void poll_channel(struct vmbus_channel *channel)
 }
 
 
-static int kvp_handle_handshake(struct hv_kvp_msg *msg)
+static int kvp_handle_handshake(u32 op)
 {
-	int ret = 1;
+	int ret = 0;
 
-	switch (msg->kvp_hdr.operation) {
+	switch (op) {
 	case KVP_OP_REGISTER:
-		dm_reg_value = KVP_OP_REGISTER;
+		kvp_device.dm_reg_value = KVP_OP_REGISTER;
 		pr_info("KVP: IP injection functionality not available\n");
 		pr_info("KVP: Upgrade the KVP daemon\n");
 		break;
 	case KVP_OP_REGISTER1:
-		dm_reg_value = KVP_OP_REGISTER1;
+		kvp_device.dm_reg_value = KVP_OP_REGISTER1;
 		break;
 	default:
 		pr_info("KVP: incompatible daemon\n");
 		pr_info("KVP: KVP version: %d, Daemon version: %d\n",
-			KVP_OP_REGISTER1, msg->kvp_hdr.operation);
-		ret = 0;
+			KVP_OP_REGISTER1, op);
+		ret = 1;
 	}
 
-	if (ret) {
-		/*
-		 * We have a compatible daemon; complete the handshake.
-		 */
-		pr_info("KVP: user-mode registering done.\n");
-		kvp_register(dm_reg_value);
-		kvp_transaction.active = false;
-		if (kvp_transaction.kvp_context)
-			poll_channel(kvp_transaction.kvp_context);
-	}
 	return ret;
 }
 
@@ -176,25 +142,11 @@ static int kvp_handle_handshake(struct hv_kvp_msg *msg)
  * Callback when data is received from user mode.
  */
 
-static void
-kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
+static void kvp_userwrite_callback(void)
 {
-	struct hv_kvp_msg *message;
+	struct hv_kvp_msg *message = &kvp_device.user_msg;
 	struct hv_kvp_msg_enumerate *data;
-	int	error = 0;
-
-	message = (struct hv_kvp_msg *)msg->data;
-
-	/*
-	 * If we are negotiating the version information
-	 * with the daemon; handle that first.
-	 */
-
-	if (in_hand_shake) {
-		if (kvp_handle_handshake(message))
-			in_hand_shake = false;
-		return;
-	}
+	int error = 0;
 
 	/*
 	 * Based on the version of the daemon, we propagate errors from the
@@ -203,7 +155,7 @@ kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 
 	data = &message->body.kvp_enum_data;
 
-	switch (dm_reg_value) {
+	switch (kvp_device.dm_reg_value) {
 	case KVP_OP_REGISTER:
 		/*
 		 * Null string is used to pass back error condition.
@@ -226,10 +178,9 @@ kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 	 * to the host. But first, cancel the timeout.
 	 */
 	if (cancel_delayed_work_sync(&kvp_work))
-		kvp_respond_to_host(message, error);
+		kvp_respond_to_host(error);
 }
 
-
 static int process_ob_ipinfo(void *in_msg, void *out_msg, int op)
 {
 	struct hv_kvp_msg *in = in_msg;
@@ -337,32 +288,21 @@ static void process_ib_ipinfo(void *in_msg, void *out_msg, int op)
 	}
 }
 
-
-
-
 static void
 kvp_send_key(struct work_struct *dummy)
 {
-	struct cn_msg *msg;
-	struct hv_kvp_msg *message;
-	struct hv_kvp_msg *in_msg;
-	__u8 operation = kvp_transaction.kvp_msg->kvp_hdr.operation;
-	__u8 pool = kvp_transaction.kvp_msg->kvp_hdr.pool;
+	struct hv_kvp_msg *message = &kvp_device.user_msg;
+	struct hv_kvp_msg *in_msg = &kvp_device.host_msg;
+	__u8 operation = in_msg->kvp_hdr.operation;
+	__u8 pool = in_msg->kvp_hdr.pool;
 	__u32 val32;
 	__u64 val64;
-	int rc;
 
-	msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg) , GFP_ATOMIC);
-	if (!msg)
-		return;
-
-	msg->id.idx =  CN_KVP_IDX;
-	msg->id.val = CN_KVP_VAL;
+	mutex_lock(&kvp_device.lock);
 
-	message = (struct hv_kvp_msg *)msg->data;
+	memset(message, 0, sizeof(struct hv_kvp_msg));
 	message->kvp_hdr.operation = operation;
 	message->kvp_hdr.pool = pool;
-	in_msg = kvp_transaction.kvp_msg;
 
 	/*
 	 * The key/value strings sent from the host are encoded in
@@ -446,15 +386,10 @@ kvp_send_key(struct work_struct *dummy)
 			break;
 	}
 
-	msg->len = sizeof(struct hv_kvp_msg);
-	rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
-	if (rc) {
-		pr_debug("KVP: failed to communicate to the daemon: %d\n", rc);
-		if (cancel_delayed_work_sync(&kvp_work))
-			kvp_respond_to_host(message, HV_E_FAIL);
-	}
+	kvp_device.state = KVP_USERMSG_READY;
+	wake_up_interruptible(&kvp_device.proc_list);
 
-	kfree(msg);
+	mutex_unlock(&kvp_device.lock);
 
 	return;
 }
@@ -463,10 +398,10 @@ kvp_send_key(struct work_struct *dummy)
  * Send a response back to the host.
  */
 
-static void
-kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
+static void kvp_respond_to_host(int error)
 {
 	struct hv_kvp_msg  *kvp_msg;
+	struct hv_kvp_msg  *msg_to_host = &kvp_device.user_msg;
 	struct hv_kvp_exchg_msg_value  *kvp_data;
 	char	*key_name;
 	char	*value;
@@ -479,26 +414,13 @@ kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
 	int ret;
 
 	/*
-	 * If a transaction is not active; log and return.
-	 */
-
-	if (!kvp_transaction.active) {
-		/*
-		 * This is a spurious call!
-		 */
-		pr_warn("KVP: Transaction not active\n");
-		return;
-	}
-	/*
 	 * Copy the global state for completing the transaction. Note that
 	 * only one transaction can be active at a time.
 	 */
 
-	buf_len = kvp_transaction.recv_len;
-	channel = kvp_transaction.recv_channel;
-	req_id = kvp_transaction.recv_req_id;
-
-	kvp_transaction.active = false;
+	buf_len = kvp_device.recv_len;
+	channel = kvp_device.recv_channel;
+	req_id = kvp_device.recv_req_id;
 
 	icmsghdrp = (struct icmsg_hdr *)
 			&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -528,7 +450,8 @@ kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
 			&recv_buffer[sizeof(struct vmbuspipe_hdr) +
 			sizeof(struct icmsg_hdr)];
 
-	switch (kvp_transaction.kvp_msg->kvp_hdr.operation) {
+
+	switch (kvp_device.host_msg.kvp_hdr.operation) {
 	case KVP_OP_GET_IP_INFO:
 		ret = process_ob_ipinfo(msg_to_host,
 				 (struct hv_kvp_ip_msg *)kvp_msg,
@@ -586,6 +509,17 @@ response_done:
 
 	vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
 				VM_PKT_DATA_INBAND, 0);
+
+	/* We're ready to process next request, reset the device state */
+	if (kvp_device.state == KVP_USERSPACE_RECV ||
+	    kvp_device.state == KVP_USERSPACE_REQ)
+		kvp_device.state = KVP_READY;
+	/*
+	 * Make sure device state was set before polling the channel as
+	 * processing can happen on a different CPU.
+	 */
+	smp_mb();
+
 	poll_channel(channel);
 }
 
@@ -612,14 +546,15 @@ void hv_kvp_onchannelcallback(void *context)
 	int util_fw_version;
 	int kvp_srv_version;
 
-	if (kvp_transaction.active) {
+	if (kvp_device.state > KVP_READY) {
 		/*
 		 * We will defer processing this callback once
 		 * the current transaction is complete.
 		 */
-		kvp_transaction.kvp_context = context;
+		kvp_device.kvp_context = channel;
 		return;
 	}
+	kvp_device.kvp_context = NULL;
 
 	vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 4, &recvlen,
 			 &requestid);
@@ -661,11 +596,19 @@ void hv_kvp_onchannelcallback(void *context)
 			 * transaction; note transactions are serialized.
 			 */
 
-			kvp_transaction.recv_len = recvlen;
-			kvp_transaction.recv_channel = channel;
-			kvp_transaction.recv_req_id = requestid;
-			kvp_transaction.active = true;
-			kvp_transaction.kvp_msg = kvp_msg;
+			kvp_device.recv_len = recvlen;
+			kvp_device.recv_channel = channel;
+			kvp_device.recv_req_id = requestid;
+
+			if (kvp_device.state != KVP_READY) {
+				/* Userspace daemon is not connected, fail. */
+				kvp_respond_to_host(HV_E_FAIL);
+				return;
+			}
+
+			kvp_device.state = KVP_HOSTMSG_RECEIVED;
+			memcpy(&kvp_device.host_msg, kvp_msg,
+			       sizeof(struct hv_kvp_msg));
 
 			/*
 			 * Get the information from the
@@ -690,17 +633,166 @@ void hv_kvp_onchannelcallback(void *context)
 				       recvlen, requestid,
 				       VM_PKT_DATA_INBAND, 0);
 	}
+}
+
+static int kvp_op_open(struct inode *inode, struct file *f)
+{
+	if (kvp_device.state != KVP_DEVICE_INITIALIZING)
+		return -EBUSY;
+	kvp_device.state = KVP_DEVICE_OPENED;
+	return 0;
+}
+
+static int kvp_op_release(struct inode *inode, struct file *f)
+{
+	kvp_device.state = KVP_DEVICE_INITIALIZING;
+	return 0;
+}
+
+static ssize_t kvp_op_write(struct file *file, const char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (kvp_device.state == KVP_DEVICE_DYING)
+		return -EFAULT;
+
+	if (count != sizeof(struct hv_kvp_msg)) {
+		pr_warn("kvp_op_write: invalid write len: %d (expected: %d)\n",
+			(int)count, (int)sizeof(struct hv_kvp_msg));
+		return -EINVAL;
+	}
 
+	mutex_lock(&kvp_device.lock);
+
+	if (kvp_device.state == KVP_USERSPACE_REQ) {
+		if (!copy_from_user(&kvp_device.user_msg, buf,
+				    sizeof(struct hv_kvp_msg))) {
+			kvp_device.state = KVP_USERSPACE_RECV;
+			kvp_userwrite_callback();
+			ret = sizeof(struct hv_kvp_msg);
+		} else
+			ret = -EFAULT;
+	} else {
+		pr_warn("kvp_op_write: invalid transaction state: %d\n",
+			kvp_device.state);
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&kvp_device.lock);
+	return ret;
 }
 
+static ssize_t kvp_op_read(struct file *file, char __user *buf,
+			   size_t count, loff_t *ppos)
+{
+	ssize_t ret = 0;
+
+	if (kvp_device.state == KVP_DEVICE_DYING)
+		return -EFAULT;
+
+	if (count != sizeof(struct hv_kvp_msg)) {
+		pr_warn("kvp_op_read: invalid read len: %d (expected: %d)\n",
+			(int)count, (int)sizeof(struct hv_kvp_msg));
+		return -EINVAL;
+	}
+
+	if (wait_event_interruptible(kvp_device.proc_list,
+				     kvp_device.state == KVP_USERMSG_READY ||
+				     kvp_device.state == KVP_DEVICE_DYING))
+		return -EFAULT;
+
+	if (kvp_device.state != KVP_USERMSG_READY)
+		return -EFAULT;
+
+	mutex_lock(&kvp_device.lock);
+
+	if (!copy_to_user(buf, &kvp_device.user_msg,
+			  sizeof(struct hv_kvp_msg))) {
+		kvp_device.state = KVP_USERSPACE_REQ;
+		ret = sizeof(struct hv_kvp_msg);
+	} else
+		ret = -EFAULT;
+
+	mutex_unlock(&kvp_device.lock);
+	return ret;
+}
+
+static unsigned int kvp_op_poll(struct file *file, poll_table *wait)
+{
+	if (kvp_device.state == KVP_DEVICE_DYING)
+		return -EFAULT;
+
+	poll_wait(file, &kvp_device.proc_list, wait);
+	if (kvp_device.state == KVP_USERMSG_READY)
+		return POLLIN | POLLRDNORM;
+	return 0;
+}
+
+static long kvp_op_ioctl(struct file *fp,
+			 unsigned int cmd, unsigned long arg)
+{
+	long ret = 0;
+	void __user *argp = (void __user *)arg;
+	u32 val32;
+
+	if (kvp_device.state == KVP_DEVICE_DYING)
+		return -EFAULT;
+
+	/* The only ioctl we have is registation */
+	if (kvp_device.state != KVP_DEVICE_OPENED)
+		return -EINVAL;
+
+	mutex_lock(&kvp_device.lock);
+
+	switch (cmd) {
+	case HYPERV_KVP_REGISTER:
+		if (copy_from_user(&val32, argp, sizeof(val32))) {
+			ret = -EFAULT;
+			break;
+		}
+		if (!kvp_handle_handshake(val32)) {
+			val32 = (u32)HV_DRV_VERSION;
+			if (copy_to_user(argp, &val32, sizeof(val32))) {
+				ret = -EFAULT;
+				break;
+			}
+			kvp_device.state = KVP_READY;
+			pr_info("KVP: user-mode registering done.\n");
+			if (kvp_device.kvp_context)
+				poll_channel(kvp_device.kvp_context);
+		} else
+			ret = -EINVAL;
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&kvp_device.lock);
+	return ret;
+}
+
+static const struct file_operations kvp_fops = {
+	.owner		= THIS_MODULE,
+	.read           = kvp_op_read,
+	.write          = kvp_op_write,
+	.release	= kvp_op_release,
+	.open		= kvp_op_open,
+	.poll		= kvp_op_poll,
+	.unlocked_ioctl = kvp_op_ioctl
+};
+
+static struct miscdevice kvp_misc = {
+	.minor          = MISC_DYNAMIC_MINOR,
+	.name           = "vmbus/hv_kvp",
+	.fops           = &kvp_fops,
+};
+
 int
 hv_kvp_init(struct hv_util_service *srv)
 {
-	int err;
-
-	err = cn_add_callback(&kvp_id, kvp_name, kvp_cn_callback);
-	if (err)
-		return err;
 	recv_buffer = srv->recv_buffer;
 
 	/*
@@ -709,14 +801,20 @@ hv_kvp_init(struct hv_util_service *srv)
 	 * Defer processing channel callbacks until the daemon
 	 * has registered.
 	 */
-	kvp_transaction.active = true;
+	kvp_device.state = KVP_DEVICE_INITIALIZING;
+	init_waitqueue_head(&kvp_device.proc_list);
+	mutex_init(&kvp_device.lock);
 
-	return 0;
+	return misc_register(&kvp_misc);
 }
 
 void hv_kvp_deinit(void)
 {
-	cn_del_callback(&kvp_id);
+	kvp_device.state = KVP_DEVICE_DYING;
+	/* Make sure nobody sees the old state */
+	smp_mb();
+	wake_up_interruptible(&kvp_device.proc_list);
 	cancel_delayed_work_sync(&kvp_work);
 	cancel_work_sync(&kvp_sendkey_work);
+	misc_deregister(&kvp_misc);
 }
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index bb1cb73..80713a3 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -26,6 +26,7 @@
 #define _UAPI_HYPERV_H
 
 #include <linux/uuid.h>
+#include <linux/types.h>
 
 /*
  * Framework version for util services.
@@ -389,4 +390,11 @@ struct hv_kvp_ip_msg {
 	struct hv_kvp_ipaddr_value      kvp_ip_val;
 } __attribute__((packed));
 
+/*
+ * Userspace registration ioctls. Userspace daemons are supposed to pass their
+ * version as a parameter and get driver version back. KVP daemon supplies
+ * either KVP_OP_REGISTER or KVP_OP_REGISTER1.
+ */
+#define HYPERV_KVP_REGISTER        _IOWR('v', 0, __u32)
+
 #endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 408bb07..0c3cac7 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -33,7 +33,6 @@
 #include <ctype.h>
 #include <errno.h>
 #include <arpa/inet.h>
-#include <linux/connector.h>
 #include <linux/hyperv.h>
 #include <linux/netlink.h>
 #include <ifaddrs.h>
@@ -44,6 +43,7 @@
 #include <dirent.h>
 #include <net/if.h>
 #include <getopt.h>
+#include <sys/ioctl.h>
 
 /*
  * KVP protocol: The user mode component first registers with the
@@ -79,9 +79,6 @@ enum {
 	DNS
 };
 
-static struct sockaddr_nl addr;
-static int in_hand_shake = 1;
-
 static char *os_name = "";
 static char *os_major = "";
 static char *os_minor = "";
@@ -1387,34 +1384,6 @@ kvp_get_domain_name(char *buffer, int length)
 	freeaddrinfo(info);
 }
 
-static int
-netlink_send(int fd, struct cn_msg *msg)
-{
-	struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE };
-	unsigned int size;
-	struct msghdr message;
-	struct iovec iov[2];
-
-	size = sizeof(struct cn_msg) + msg->len;
-
-	nlh.nlmsg_pid = getpid();
-	nlh.nlmsg_len = NLMSG_LENGTH(size);
-
-	iov[0].iov_base = &nlh;
-	iov[0].iov_len = sizeof(nlh);
-
-	iov[1].iov_base = msg;
-	iov[1].iov_len = size;
-
-	memset(&message, 0, sizeof(message));
-	message.msg_name = &addr;
-	message.msg_namelen = sizeof(addr);
-	message.msg_iov = iov;
-	message.msg_iovlen = 2;
-
-	return sendmsg(fd, &message, 0);
-}
-
 void print_usage(char *argv[])
 {
 	fprintf(stderr, "Usage: %s [options]\n"
@@ -1425,23 +1394,18 @@ void print_usage(char *argv[])
 
 int main(int argc, char *argv[])
 {
-	int fd, len, nl_group;
+	int kvp_fd, len;
 	int error;
-	struct cn_msg *message;
 	struct pollfd pfd;
-	struct nlmsghdr *incoming_msg;
-	struct cn_msg	*incoming_cn_msg;
-	struct hv_kvp_msg *hv_msg;
-	char	*p;
+	struct hv_kvp_msg hv_msg[1];
 	char	*key_value;
 	char	*key_name;
 	int	op;
 	int	pool;
 	char	*if_name;
 	struct hv_kvp_ipaddr_value *kvp_ip_val;
-	char *kvp_recv_buffer;
-	size_t kvp_recv_buffer_len;
 	int daemonize = 1, long_index = 0, opt;
+	__u32 daemon_ver = (__u32)KVP_OP_REGISTER1;
 
 	static struct option long_options[] = {
 		{"help",	no_argument,	   0,  'h' },
@@ -1468,12 +1432,14 @@ int main(int argc, char *argv[])
 	openlog("KVP", 0, LOG_USER);
 	syslog(LOG_INFO, "KVP starting; pid is:%d", getpid());
 
-	kvp_recv_buffer_len = NLMSG_LENGTH(0) + sizeof(struct cn_msg) + sizeof(struct hv_kvp_msg);
-	kvp_recv_buffer = calloc(1, kvp_recv_buffer_len);
-	if (!kvp_recv_buffer) {
-		syslog(LOG_ERR, "Failed to allocate netlink buffer");
+	kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR);
+
+	if (kvp_fd < 0) {
+		syslog(LOG_ERR, "open /dev/vmbus/hv_kvp failed; error: %d %s",
+			errno, strerror(errno));
 		exit(EXIT_FAILURE);
 	}
+
 	/*
 	 * Retrieve OS release information.
 	 */
@@ -1489,100 +1455,44 @@ int main(int argc, char *argv[])
 		exit(EXIT_FAILURE);
 	}
 
-	fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
-	if (fd < 0) {
-		syslog(LOG_ERR, "netlink socket creation failed; error: %d %s", errno,
-				strerror(errno));
-		exit(EXIT_FAILURE);
-	}
-	addr.nl_family = AF_NETLINK;
-	addr.nl_pad = 0;
-	addr.nl_pid = 0;
-	addr.nl_groups = 0;
-
-
-	error = bind(fd, (struct sockaddr *)&addr, sizeof(addr));
-	if (error < 0) {
-		syslog(LOG_ERR, "bind failed; error: %d %s", errno, strerror(errno));
-		close(fd);
-		exit(EXIT_FAILURE);
-	}
-	nl_group = CN_KVP_IDX;
-
-	if (setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &nl_group, sizeof(nl_group)) < 0) {
-		syslog(LOG_ERR, "setsockopt failed; error: %d %s", errno, strerror(errno));
-		close(fd);
-		exit(EXIT_FAILURE);
-	}
-
 	/*
 	 * Register ourselves with the kernel.
 	 */
-	message = (struct cn_msg *)kvp_recv_buffer;
-	message->id.idx = CN_KVP_IDX;
-	message->id.val = CN_KVP_VAL;
-
-	hv_msg = (struct hv_kvp_msg *)message->data;
-	hv_msg->kvp_hdr.operation = KVP_OP_REGISTER1;
-	message->ack = 0;
-	message->len = sizeof(struct hv_kvp_msg);
-
-	len = netlink_send(fd, message);
-	if (len < 0) {
-		syslog(LOG_ERR, "netlink_send failed; error: %d %s", errno, strerror(errno));
-		close(fd);
+	if (ioctl(kvp_fd, HYPERV_KVP_REGISTER, &daemon_ver)) {
+		syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+		       errno, strerror(errno));
+		close(kvp_fd);
 		exit(EXIT_FAILURE);
 	}
 
-	pfd.fd = fd;
+	syslog(LOG_INFO, "KVP LIC Version: %d", daemon_ver);
+
+	pfd.fd = kvp_fd;
 
 	while (1) {
-		struct sockaddr *addr_p = (struct sockaddr *) &addr;
-		socklen_t addr_l = sizeof(addr);
 		pfd.events = POLLIN;
 		pfd.revents = 0;
 
 		if (poll(&pfd, 1, -1) < 0) {
 			syslog(LOG_ERR, "poll failed; error: %d %s", errno, strerror(errno));
 			if (errno == EINVAL) {
-				close(fd);
+				close(kvp_fd);
 				exit(EXIT_FAILURE);
 			}
 			else
 				continue;
 		}
 
-		len = recvfrom(fd, kvp_recv_buffer, kvp_recv_buffer_len, 0,
-				addr_p, &addr_l);
-
-		if (len < 0) {
-			int saved_errno = errno;
-			syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s",
-					addr.nl_pid, errno, strerror(errno));
-
-			if (saved_errno == ENOBUFS) {
-				syslog(LOG_ERR, "receive error: ignored");
-				continue;
-			}
+		len = read(kvp_fd, hv_msg, sizeof(struct hv_kvp_msg));
 
-			close(fd);
-			return -1;
-		}
+		if (len != sizeof(struct hv_kvp_msg)) {
+			syslog(LOG_ERR, "read failed; error:%d %s",
+			       errno, strerror(errno));
 
-		if (addr.nl_pid) {
-			syslog(LOG_WARNING, "Received packet from untrusted pid:%u",
-					addr.nl_pid);
-			continue;
+			close(kvp_fd);
+			return EXIT_FAILURE;
 		}
 
-		incoming_msg = (struct nlmsghdr *)kvp_recv_buffer;
-
-		if (incoming_msg->nlmsg_type != NLMSG_DONE)
-			continue;
-
-		incoming_cn_msg = (struct cn_msg *)NLMSG_DATA(incoming_msg);
-		hv_msg = (struct hv_kvp_msg *)incoming_cn_msg->data;
-
 		/*
 		 * We will use the KVP header information to pass back
 		 * the error from this daemon. So, first copy the state
@@ -1592,24 +1502,6 @@ int main(int argc, char *argv[])
 		pool = hv_msg->kvp_hdr.pool;
 		hv_msg->error = HV_S_OK;
 
-		if ((in_hand_shake) && (op == KVP_OP_REGISTER1)) {
-			/*
-			 * Driver is registering with us; stash away the version
-			 * information.
-			 */
-			in_hand_shake = 0;
-			p = (char *)hv_msg->body.kvp_register.version;
-			lic_version = malloc(strlen(p) + 1);
-			if (lic_version) {
-				strcpy(lic_version, p);
-				syslog(LOG_INFO, "KVP LIC Version: %s",
-					lic_version);
-			} else {
-				syslog(LOG_ERR, "malloc failed");
-			}
-			continue;
-		}
-
 		switch (op) {
 		case KVP_OP_GET_IP_INFO:
 			kvp_ip_val = &hv_msg->body.kvp_ip_val;
@@ -1702,7 +1594,6 @@ int main(int argc, char *argv[])
 			goto kvp_done;
 		}
 
-		hv_msg = (struct hv_kvp_msg *)incoming_cn_msg->data;
 		key_name = (char *)hv_msg->body.kvp_enum_data.data.key;
 		key_value = (char *)hv_msg->body.kvp_enum_data.data.value;
 
@@ -1753,31 +1644,17 @@ int main(int argc, char *argv[])
 			hv_msg->error = HV_S_CONT;
 			break;
 		}
-		/*
-		 * Send the value back to the kernel. The response is
-		 * already in the receive buffer. Update the cn_msg header to
-		 * reflect the key value that has been added to the message
-		 */
-kvp_done:
-
-		incoming_cn_msg->id.idx = CN_KVP_IDX;
-		incoming_cn_msg->id.val = CN_KVP_VAL;
-		incoming_cn_msg->ack = 0;
-		incoming_cn_msg->len = sizeof(struct hv_kvp_msg);
-
-		len = netlink_send(fd, incoming_cn_msg);
-		if (len < 0) {
-			int saved_errno = errno;
-			syslog(LOG_ERR, "net_link send failed; error: %d %s", errno,
-					strerror(errno));
-
-			if (saved_errno == ENOMEM || saved_errno == ENOBUFS) {
-				syslog(LOG_ERR, "send error: ignored");
-				continue;
-			}
 
+		/* Send the value back to the kernel. */
+kvp_done:
+		len = write(kvp_fd, hv_msg, sizeof(struct hv_kvp_msg));
+		if (len != sizeof(struct hv_kvp_msg)) {
+			syslog(LOG_ERR, "write failed; error: %d %s", errno,
+			       strerror(errno));
 			exit(EXIT_FAILURE);
 		}
 	}
 
+	close(kvp_fd);
+	exit(0);
 }
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 2/3] Drivers: hv: vss: convert userspace/kernel communication to using char device
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
  To: K. Y. Srinivasan, devel
  Cc: Radim Krčmář, Greg Kroah-Hartman, Haiyang Zhang,
	linux-kernel, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>

Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
  and this fact can change as there is a way to enable/disable the service from
  host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
  notification.
- Netlink communication is not stable under heavy load.
- ...

Re-implement the communication using misc char device. Use ioctl to do
kernel/userspace version negotiation (doesn't make much sense at this moment
as we're breaking backwards compatibility but can be used in future). Read from
the device returns struct hv_vss_msg and userspace is supposed to reply with
__u32.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/hv/hv_snapshot.c    | 335 ++++++++++++++++++++++++++++++++------------
 include/uapi/linux/hyperv.h |   1 +
 tools/hv/hv_vss_daemon.c    | 141 ++++---------------
 3 files changed, 273 insertions(+), 204 deletions(-)

diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 9d5e0d1..b399758 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_snapshot.c
@@ -20,9 +20,12 @@
 
 #include <linux/net.h>
 #include <linux/nls.h>
-#include <linux/connector.h>
+#include <linux/sched.h>
 #include <linux/workqueue.h>
+#include <linux/mutex.h>
 #include <linux/hyperv.h>
+#include <linux/miscdevice.h>
+#include <linux/poll.h>
 
 #define VSS_MAJOR  5
 #define VSS_MINOR  0
@@ -31,26 +34,36 @@
 #define VSS_USERSPACE_TIMEOUT (msecs_to_jiffies(10 * 1000))
 
 /*
- * Global state maintained for transaction that is being processed.
- * Note that only one transaction can be active at any point in time.
- *
- * This state is set when we receive a request from the host; we
- * cleanup this state when the transaction is completed - when we respond
- * to the host with the key value.
+ * Global state maintained for the device. Note that only one transaction can
+ * be active at any point in time.
  */
 
+enum vss_device_state {
+	VSS_DEVICE_INITIALIZING = 0, /* driver was loaded */
+	VSS_DEVICE_OPENED,           /* device was opened */
+	VSS_READY,                   /* userspace registered */
+	VSS_HOSTMSG_RECEIVED,        /* message from host was received */
+	VSS_USERMSG_READY,           /* message for userspace is ready */
+	VSS_USERSPACE_REQ,           /* request to userspace was sent */
+	VSS_USERSPACE_RECV,          /* reply from userspace was received */
+	VSS_DEVICE_DYING,            /* driver unload is in progress */
+};
+
 static struct {
-	bool active; /* transaction status - active or not */
+	int state; /* vss_device_state */
 	int recv_len; /* number of bytes received. */
 	struct vmbus_channel *recv_channel; /* chn we got the request */
 	u64 recv_req_id; /* request ID. */
-	struct hv_vss_msg  *msg; /* current message */
-} vss_transaction;
-
+	void *vss_context; /* for the channel callback */
+	int dm_reg_value; /* daemon version number */
+	struct mutex lock; /* syncronization */
+	struct hv_vss_msg user_msg; /* message to/from userspace */
+	struct hv_vss_msg host_msg; /* message to/from host */
+	wait_queue_head_t proc_list; /* waiting processes */
+} vss_device;
 
-static void vss_respond_to_host(int error);
+static void vss_respond_to_host(u32 error);
 
-static struct cb_id vss_id = { CN_VSS_IDX, CN_VSS_VAL };
 static const char vss_name[] = "vss_kernel_module";
 static __u8 *recv_buffer;
 
@@ -60,6 +73,23 @@ static void vss_timeout_func(struct work_struct *dummy);
 static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func);
 static DECLARE_WORK(vss_send_op_work, vss_send_op);
 
+static int vss_handle_handshake(u32 op)
+{
+	vss_device.dm_reg_value = op;
+
+	return 0;
+}
+
+static void poll_channel(struct vmbus_channel *channel)
+{
+	if (channel->target_cpu != smp_processor_id())
+		smp_call_function_single(channel->target_cpu,
+					 hv_vss_onchannelcallback,
+					 channel, true);
+	else
+		hv_vss_onchannelcallback(channel);
+}
+
 /*
  * Callback when data is received from user mode.
  */
@@ -73,62 +103,27 @@ static void vss_timeout_func(struct work_struct *dummy)
 	vss_respond_to_host(HV_E_FAIL);
 }
 
-static void
-vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
-{
-	struct hv_vss_msg *vss_msg;
-
-	vss_msg = (struct hv_vss_msg *)msg->data;
-
-	if (vss_msg->vss_hdr.operation == VSS_OP_REGISTER) {
-		pr_info("VSS daemon registered\n");
-		vss_transaction.active = false;
-		if (vss_transaction.recv_channel != NULL)
-			hv_vss_onchannelcallback(vss_transaction.recv_channel);
-		return;
-
-	}
-	if (cancel_delayed_work_sync(&vss_timeout_work))
-		vss_respond_to_host(vss_msg->error);
-}
-
-
 static void vss_send_op(struct work_struct *dummy)
 {
-	int op = vss_transaction.msg->vss_hdr.operation;
-	int rc;
-	struct cn_msg *msg;
-	struct hv_vss_msg *vss_msg;
+	mutex_lock(&vss_device.lock);
 
-	msg = kzalloc(sizeof(*msg) + sizeof(*vss_msg), GFP_ATOMIC);
-	if (!msg)
+	if (vss_device.state != VSS_HOSTMSG_RECEIVED)
 		return;
 
-	vss_msg = (struct hv_vss_msg *)msg->data;
-
-	msg->id.idx =  CN_VSS_IDX;
-	msg->id.val = CN_VSS_VAL;
-
-	vss_msg->vss_hdr.operation = op;
-	msg->len = sizeof(struct hv_vss_msg);
+	memcpy(&vss_device.user_msg, &vss_device.host_msg,
+	       sizeof(struct hv_vss_msg));
 
-	rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
-	if (rc) {
-		pr_warn("VSS: failed to communicate to the daemon: %d\n", rc);
-		if (cancel_delayed_work_sync(&vss_timeout_work))
-			vss_respond_to_host(HV_E_FAIL);
-	}
-	kfree(msg);
+	vss_device.state = VSS_USERMSG_READY;
+	wake_up_interruptible(&vss_device.proc_list);
 
+	mutex_unlock(&vss_device.lock);
 	return;
 }
 
 /*
  * Send a response back to the host.
  */
-
-static void
-vss_respond_to_host(int error)
+static void vss_respond_to_host(u32 error)
 {
 	struct icmsg_hdr *icmsghdrp;
 	u32	buf_len;
@@ -136,25 +131,13 @@ vss_respond_to_host(int error)
 	u64	req_id;
 
 	/*
-	 * If a transaction is not active; log and return.
-	 */
-
-	if (!vss_transaction.active) {
-		/*
-		 * This is a spurious call!
-		 */
-		pr_warn("VSS: Transaction not active\n");
-		return;
-	}
-	/*
 	 * Copy the global state for completing the transaction. Note that
 	 * only one transaction can be active at a time.
 	 */
 
-	buf_len = vss_transaction.recv_len;
-	channel = vss_transaction.recv_channel;
-	req_id = vss_transaction.recv_req_id;
-	vss_transaction.active = false;
+	buf_len = vss_device.recv_len;
+	channel = vss_device.recv_channel;
+	req_id = vss_device.recv_req_id;
 
 	icmsghdrp = (struct icmsg_hdr *)
 			&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -173,6 +156,17 @@ vss_respond_to_host(int error)
 	vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
 				VM_PKT_DATA_INBAND, 0);
 
+	/* We're ready to process next request, reset the device state */
+	if (vss_device.state == VSS_USERSPACE_RECV ||
+	    vss_device.state == VSS_USERSPACE_REQ)
+		vss_device.state = VSS_READY;
+	/*
+	 * Make sure device state was set before polling the channel as
+	 * processing can happen on a different CPU.
+	 */
+	smp_mb();
+
+	poll_channel(channel);
 }
 
 /*
@@ -186,19 +180,18 @@ void hv_vss_onchannelcallback(void *context)
 	u32 recvlen;
 	u64 requestid;
 	struct hv_vss_msg *vss_msg;
-
-
 	struct icmsg_hdr *icmsghdrp;
 	struct icmsg_negotiate *negop = NULL;
 
-	if (vss_transaction.active) {
+	if (vss_device.state > VSS_READY) {
 		/*
 		 * We will defer processing this callback once
 		 * the current transaction is complete.
 		 */
-		vss_transaction.recv_channel = channel;
+		vss_device.vss_context = channel;
 		return;
 	}
+	vss_device.vss_context = NULL;
 
 	vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 2, &recvlen,
 			 &requestid);
@@ -221,11 +214,9 @@ void hv_vss_onchannelcallback(void *context)
 			 * transaction; note transactions are serialized.
 			 */
 
-			vss_transaction.recv_len = recvlen;
-			vss_transaction.recv_channel = channel;
-			vss_transaction.recv_req_id = requestid;
-			vss_transaction.active = true;
-			vss_transaction.msg = (struct hv_vss_msg *)vss_msg;
+			vss_device.recv_len = recvlen;
+			vss_device.recv_channel = channel;
+			vss_device.recv_req_id = requestid;
 
 			switch (vss_msg->vss_hdr.operation) {
 				/*
@@ -241,6 +232,15 @@ void hv_vss_onchannelcallback(void *context)
 				 */
 			case VSS_OP_FREEZE:
 			case VSS_OP_THAW:
+				if (vss_device.state != VSS_READY) {
+					/* Userspace daemon is not connected */
+					vss_respond_to_host(HV_E_FAIL);
+					return;
+				}
+
+				memcpy(&vss_device.host_msg, vss_msg,
+				       sizeof(struct hv_vss_msg));
+				vss_device.state = VSS_HOSTMSG_RECEIVED;
 				schedule_work(&vss_send_op_work);
 				schedule_delayed_work(&vss_timeout_work,
 						      VSS_USERSPACE_TIMEOUT);
@@ -275,14 +275,164 @@ void hv_vss_onchannelcallback(void *context)
 
 }
 
-int
-hv_vss_init(struct hv_util_service *srv)
+static int vss_op_open(struct inode *inode, struct file *f)
 {
-	int err;
+	if (vss_device.state != VSS_DEVICE_INITIALIZING)
+		return -EBUSY;
+	vss_device.state = VSS_DEVICE_OPENED;
+	return 0;
+}
 
-	err = cn_add_callback(&vss_id, vss_name, vss_cn_callback);
-	if (err)
-		return err;
+static int vss_op_release(struct inode *inode, struct file *f)
+{
+	vss_device.state = VSS_DEVICE_INITIALIZING;
+	return 0;
+}
+
+static ssize_t vss_op_write(struct file *file, const char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	int ret = 0;
+	u32 val32;
+
+	if (vss_device.state == VSS_DEVICE_DYING)
+		return -EFAULT;
+
+	if (count != sizeof(u32)) {
+		pr_warn("vss_op_write: invalid write len: %d (expected: %d)\n",
+			(int)count, (int)sizeof(u32));
+		return -EINVAL;
+	}
+
+	mutex_lock(&vss_device.lock);
+
+	if (vss_device.state == VSS_USERSPACE_REQ) {
+		if (!copy_from_user(&val32, buf, sizeof(u32))) {
+			vss_device.state = VSS_USERSPACE_RECV;
+			if (cancel_delayed_work_sync(&vss_timeout_work))
+				vss_respond_to_host(val32);
+			ret = sizeof(u32);
+		} else
+			ret = -EFAULT;
+	} else {
+		pr_warn("vss_op_write: invalid transaction state: %d\n",
+			vss_device.state);
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&vss_device.lock);
+	return ret;
+}
+
+static ssize_t vss_op_read(struct file *file, char __user *buf,
+			   size_t count, loff_t *ppos)
+{
+	ssize_t ret = 0;
+
+	if (vss_device.state == VSS_DEVICE_DYING)
+		return -EFAULT;
+
+	if (count != sizeof(struct hv_vss_msg)) {
+		pr_warn("vss_op_read: invalid read len: %d (expected: %d)\n",
+			(int)count, (int)sizeof(struct hv_vss_msg));
+		return -EINVAL;
+	}
+
+	if (wait_event_interruptible(vss_device.proc_list,
+				     vss_device.state == VSS_USERMSG_READY ||
+				     vss_device.state == VSS_DEVICE_DYING))
+		return -EFAULT;
+
+	if (vss_device.state != VSS_USERMSG_READY)
+		return -EFAULT;
+
+	mutex_lock(&vss_device.lock);
+
+	if (!copy_to_user(buf, &vss_device.user_msg,
+			  sizeof(struct hv_vss_msg))) {
+		vss_device.state = VSS_USERSPACE_REQ;
+		ret = sizeof(struct hv_vss_msg);
+	} else
+		ret = -EFAULT;
+
+	mutex_unlock(&vss_device.lock);
+	return ret;
+}
+
+static unsigned int vss_op_poll(struct file *file, poll_table *wait)
+{
+	if (vss_device.state == VSS_DEVICE_DYING)
+		return -EFAULT;
+
+	poll_wait(file, &vss_device.proc_list, wait);
+	if (vss_device.state == VSS_USERMSG_READY)
+		return POLLIN | POLLRDNORM;
+	return 0;
+}
+
+static long vss_op_ioctl(struct file *fp,
+			 unsigned int cmd, unsigned long arg)
+{
+	long ret = 0;
+	void __user *argp = (void __user *)arg;
+	u32 val32;
+
+	if (vss_device.state == VSS_DEVICE_DYING)
+		return -EFAULT;
+
+	/* The only ioctl we have is registation */
+	if (vss_device.state != VSS_DEVICE_OPENED)
+		return -EINVAL;
+
+	mutex_lock(&vss_device.lock);
+
+	switch (cmd) {
+	case HYPERV_VSS_REGISTER:
+		if (copy_from_user(&val32, argp, sizeof(val32))) {
+			ret = -EFAULT;
+			break;
+		}
+		if (!vss_handle_handshake(val32)) {
+			val32 = (u32)VSS_VERSION;
+			if (copy_to_user(argp, &val32, sizeof(val32))) {
+				ret = -EFAULT;
+				break;
+			}
+			vss_device.state = VSS_READY;
+			pr_info("VSS: user-mode registering done.\n");
+			if (vss_device.vss_context)
+				poll_channel(vss_device.vss_context);
+		} else
+			ret = -EINVAL;
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&vss_device.lock);
+	return ret;
+}
+
+static const struct file_operations vss_fops = {
+	.owner		= THIS_MODULE,
+	.read           = vss_op_read,
+	.write          = vss_op_write,
+	.release	= vss_op_release,
+	.open		= vss_op_open,
+	.poll		= vss_op_poll,
+	.unlocked_ioctl = vss_op_ioctl
+};
+
+static struct miscdevice vss_misc = {
+	.minor          = MISC_DYNAMIC_MINOR,
+	.name           = "vmbus/hv_vss",
+	.fops           = &vss_fops,
+};
+
+int hv_vss_init(struct hv_util_service *srv)
+{
 	recv_buffer = srv->recv_buffer;
 
 	/*
@@ -291,13 +441,20 @@ hv_vss_init(struct hv_util_service *srv)
 	 * Defer processing channel callbacks until the daemon
 	 * has registered.
 	 */
-	vss_transaction.active = true;
-	return 0;
+	vss_device.state = VSS_DEVICE_INITIALIZING;
+	init_waitqueue_head(&vss_device.proc_list);
+	mutex_init(&vss_device.lock);
+
+	return misc_register(&vss_misc);
 }
 
 void hv_vss_deinit(void)
 {
-	cn_del_callback(&vss_id);
+	vss_device.state = VSS_DEVICE_DYING;
+	/* Make sure nobody sees the old state */
+	smp_mb();
+	wake_up_interruptible(&vss_device.proc_list);
 	cancel_delayed_work_sync(&vss_timeout_work);
 	cancel_work_sync(&vss_send_op_work);
+	misc_deregister(&vss_misc);
 }
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index 80713a3..1939826 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -396,5 +396,6 @@ struct hv_kvp_ip_msg {
  * either KVP_OP_REGISTER or KVP_OP_REGISTER1.
  */
 #define HYPERV_KVP_REGISTER        _IOWR('v', 0, __u32)
+#define HYPERV_VSS_REGISTER        _IOWR('v', 0, __u32)
 
 #endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 5e63f70..d3f1fe9 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -19,7 +19,6 @@
 
 
 #include <sys/types.h>
-#include <sys/socket.h>
 #include <sys/poll.h>
 #include <sys/ioctl.h>
 #include <fcntl.h>
@@ -30,21 +29,11 @@
 #include <string.h>
 #include <ctype.h>
 #include <errno.h>
-#include <arpa/inet.h>
 #include <linux/fs.h>
-#include <linux/connector.h>
 #include <linux/hyperv.h>
-#include <linux/netlink.h>
 #include <syslog.h>
 #include <getopt.h>
 
-static struct sockaddr_nl addr;
-
-#ifndef SOL_NETLINK
-#define SOL_NETLINK 270
-#endif
-
-
 /* Don't use syslog() in the function since that can cause write to disk */
 static int vss_do_freeze(char *dir, unsigned int cmd)
 {
@@ -137,33 +126,6 @@ out:
 	return error;
 }
 
-static int netlink_send(int fd, struct cn_msg *msg)
-{
-	struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE };
-	unsigned int size;
-	struct msghdr message;
-	struct iovec iov[2];
-
-	size = sizeof(struct cn_msg) + msg->len;
-
-	nlh.nlmsg_pid = getpid();
-	nlh.nlmsg_len = NLMSG_LENGTH(size);
-
-	iov[0].iov_base = &nlh;
-	iov[0].iov_len = sizeof(nlh);
-
-	iov[1].iov_base = msg;
-	iov[1].iov_len = size;
-
-	memset(&message, 0, sizeof(message));
-	message.msg_name = &addr;
-	message.msg_namelen = sizeof(addr);
-	message.msg_iov = iov;
-	message.msg_iovlen = 2;
-
-	return sendmsg(fd, &message, 0);
-}
-
 void print_usage(char *argv[])
 {
 	fprintf(stderr, "Usage: %s [options]\n"
@@ -174,17 +136,13 @@ void print_usage(char *argv[])
 
 int main(int argc, char *argv[])
 {
-	int fd, len, nl_group;
-	int error;
-	struct cn_msg *message;
+	int vss_fd, len;
+	__u32 error;
 	struct pollfd pfd;
-	struct nlmsghdr *incoming_msg;
-	struct cn_msg	*incoming_cn_msg;
 	int	op;
-	struct hv_vss_msg *vss_msg;
-	char *vss_recv_buffer;
-	size_t vss_recv_buffer_len;
+	struct hv_vss_msg vss_msg[1];
 	int daemonize = 1, long_index = 0, opt;
+	__u32 daemon_ver = 1; /* No special meaning */
 
 	static struct option long_options[] = {
 		{"help",	no_argument,	   0,  'h' },
@@ -211,98 +169,50 @@ int main(int argc, char *argv[])
 	openlog("Hyper-V VSS", 0, LOG_USER);
 	syslog(LOG_INFO, "VSS starting; pid is:%d", getpid());
 
-	vss_recv_buffer_len = NLMSG_LENGTH(0) + sizeof(struct cn_msg) + sizeof(struct hv_vss_msg);
-	vss_recv_buffer = calloc(1, vss_recv_buffer_len);
-	if (!vss_recv_buffer) {
-		syslog(LOG_ERR, "Failed to allocate netlink buffers");
-		exit(EXIT_FAILURE);
-	}
+	vss_fd = open("/dev/vmbus/hv_vss", O_RDWR);
 
-	fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
-	if (fd < 0) {
-		syslog(LOG_ERR, "netlink socket creation failed; error:%d %s",
-				errno, strerror(errno));
+	if (vss_fd < 0) {
+		syslog(LOG_ERR, "open /dev/vmbus/hv_vss failed; error: %d %s",
+			errno, strerror(errno));
 		exit(EXIT_FAILURE);
 	}
-	addr.nl_family = AF_NETLINK;
-	addr.nl_pad = 0;
-	addr.nl_pid = 0;
-	addr.nl_groups = 0;
-
 
-	error = bind(fd, (struct sockaddr *)&addr, sizeof(addr));
-	if (error < 0) {
-		syslog(LOG_ERR, "bind failed; error:%d %s", errno, strerror(errno));
-		close(fd);
-		exit(EXIT_FAILURE);
-	}
-	nl_group = CN_VSS_IDX;
-	if (setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &nl_group, sizeof(nl_group)) < 0) {
-		syslog(LOG_ERR, "setsockopt failed; error:%d %s", errno, strerror(errno));
-		close(fd);
-		exit(EXIT_FAILURE);
-	}
 	/*
 	 * Register ourselves with the kernel.
 	 */
-	message = (struct cn_msg *)vss_recv_buffer;
-	message->id.idx = CN_VSS_IDX;
-	message->id.val = CN_VSS_VAL;
-	message->ack = 0;
-	vss_msg = (struct hv_vss_msg *)message->data;
-	vss_msg->vss_hdr.operation = VSS_OP_REGISTER;
-
-	message->len = sizeof(struct hv_vss_msg);
-
-	len = netlink_send(fd, message);
-	if (len < 0) {
-		syslog(LOG_ERR, "netlink_send failed; error:%d %s", errno, strerror(errno));
-		close(fd);
+	if (ioctl(vss_fd, HYPERV_VSS_REGISTER, &daemon_ver)) {
+		syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+		       errno, strerror(errno));
+		close(vss_fd);
 		exit(EXIT_FAILURE);
 	}
 
-	pfd.fd = fd;
+	pfd.fd = vss_fd;
 
 	while (1) {
-		struct sockaddr *addr_p = (struct sockaddr *) &addr;
-		socklen_t addr_l = sizeof(addr);
 		pfd.events = POLLIN;
 		pfd.revents = 0;
 
 		if (poll(&pfd, 1, -1) < 0) {
 			syslog(LOG_ERR, "poll failed; error:%d %s", errno, strerror(errno));
 			if (errno == EINVAL) {
-				close(fd);
+				close(vss_fd);
 				exit(EXIT_FAILURE);
 			}
 			else
 				continue;
 		}
 
-		len = recvfrom(fd, vss_recv_buffer, vss_recv_buffer_len, 0,
-				addr_p, &addr_l);
+		len = read(vss_fd, vss_msg, sizeof(struct hv_vss_msg));
 
-		if (len < 0) {
-			syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s",
-					addr.nl_pid, errno, strerror(errno));
-			close(fd);
-			return -1;
-		}
+		if (len != sizeof(struct hv_vss_msg)) {
+			syslog(LOG_ERR, "read failed; error:%d %s",
+			       errno, strerror(errno));
 
-		if (addr.nl_pid) {
-			syslog(LOG_WARNING,
-				"Received packet from untrusted pid:%u",
-				addr.nl_pid);
-			continue;
+			close(vss_fd);
+			return EXIT_FAILURE;
 		}
 
-		incoming_msg = (struct nlmsghdr *)vss_recv_buffer;
-
-		if (incoming_msg->nlmsg_type != NLMSG_DONE)
-			continue;
-
-		incoming_cn_msg = (struct cn_msg *)NLMSG_DATA(incoming_msg);
-		vss_msg = (struct hv_vss_msg *)incoming_cn_msg->data;
 		op = vss_msg->vss_hdr.operation;
 		error =  HV_S_OK;
 
@@ -324,13 +234,14 @@ int main(int argc, char *argv[])
 		default:
 			syslog(LOG_ERR, "Illegal op:%d\n", op);
 		}
-		vss_msg->error = error;
-		len = netlink_send(fd, incoming_cn_msg);
-		if (len < 0) {
-			syslog(LOG_ERR, "net_link send failed; error:%d %s",
-					errno, strerror(errno));
+
+		if (write(vss_fd, &error, sizeof(__u32)) != sizeof(__u32)) {
+			syslog(LOG_ERR, "write failed; error: %d %s", errno,
+			       strerror(errno));
 			exit(EXIT_FAILURE);
 		}
 	}
 
+	close(vss_fd);
+	exit(0);
 }
-- 
1.9.3

^ permalink raw reply related

* [PATCH RFC 3/3] Drivers: hv: fcopy: make it consistent with vss/kvp
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
  To: K. Y. Srinivasan, devel
  Cc: Radim Krčmář, Greg Kroah-Hartman, Haiyang Zhang,
	linux-kernel, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>

Re-implement fcopy in a consistent with "Drivers: hv: vss/kvp: convert
userspace/kernel communication to using char device" way.

In particular:
- Implement "state machine" for the driver instead of 3 separate syncronization
  variables ('fcopy_transaction.active', 'fcopy_transaction.read_sema', 'opened')
- Use ioctl for kernel/userspace version negotiation.
- Support poll() operation.
- Set .owner = THIS_MODULE to prevent kernel crash if the module if being
  unloaded while there is an active file descriptior in userspace.
- Use __u32 instead of int in userspace replies as it matches icmsg_hdr.status.
- Other minor changes to make drivers code look alike.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 drivers/hv/hv_fcopy.c       | 395 ++++++++++++++++++++++++++------------------
 include/uapi/linux/hyperv.h |   1 +
 tools/hv/hv_fcopy_daemon.c  |  48 ++++--
 3 files changed, 264 insertions(+), 180 deletions(-)

diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index cd453e4..05c3580 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/uaccess.h>
 #include <linux/miscdevice.h>
+#include <linux/poll.h>
 
 #include "hyperv_vmbus.h"
 
@@ -35,6 +36,8 @@
 #define WIN8_SRV_MINOR		1
 #define WIN8_SRV_VERSION	(WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
 
+#define MAX_FCOPY_CHSIZE (PAGE_SIZE * 2)
+
 /*
  * Global state maintained for transaction that is being processed.
  * For a class of integration services, including the "file copy service",
@@ -47,36 +50,37 @@
  * ensure this by serializing packet processing in this driver - we do not
  * read additional packets from the VMBUs until the current packet is fully
  * handled.
- *
- * The transaction "active" state is set when we receive a request from the
- * host and we cleanup this state when the transaction is completed - when we
- * respond to the host with our response. When the transaction active state is
- * set, we defer handling incoming packets.
  */
 
+enum fcopy_device_state {
+	FCOPY_DEVICE_INITIALIZING = 0, /* driver was loaded */
+	FCOPY_DEVICE_OPENED,           /* device was opened */
+	FCOPY_READY,                   /* userspace registered */
+	FCOPY_HOSTMSG_RECEIVED,        /* message from host was received */
+	FCOPY_USERMSG_READY,           /* message for userspace is ready */
+	FCOPY_USERSPACE_REQ,           /* request to userspace was sent */
+	FCOPY_USERSPACE_RECV,          /* reply from userspace was received */
+	FCOPY_DEVICE_DYING,            /* driver unload is in progress */
+};
+
 static struct {
-	bool active; /* transaction status - active or not */
+	int state; /* fcopy device state */
 	int recv_len; /* number of bytes received. */
-	struct hv_fcopy_hdr  *fcopy_msg; /* current message */
-	struct hv_start_fcopy  message; /*  sent to daemon */
 	struct vmbus_channel *recv_channel; /* chn we got the request */
 	u64 recv_req_id; /* request ID. */
 	void *fcopy_context; /* for the channel callback */
-	struct semaphore read_sema;
-} fcopy_transaction;
-
-static bool opened; /* currently device opened */
+	int dm_reg_value; /* daemon version number */
+	struct mutex lock; /* syncronization */
+	u8 user_msg[MAX_FCOPY_CHSIZE]; /* message to userspace */
+	u8 host_msg[MAX_FCOPY_CHSIZE]; /* message to/from host */
+	wait_queue_head_t proc_list; /* waiting processes */
+} fcopy_device;
 
-/*
- * Before we can accept copy messages from the host, we need
- * to handshake with the user level daemon. This state tracks
- * if we are in the handshake phase.
- */
-static bool in_hand_shake = true;
-static void fcopy_send_data(void);
 static void fcopy_respond_to_host(int error);
 static void fcopy_work_func(struct work_struct *dummy);
+static void fcopy_send_op(struct work_struct *dummy);
 static DECLARE_DELAYED_WORK(fcopy_work, fcopy_work_func);
+static DECLARE_WORK(fcopy_send_op_work, fcopy_send_op);
 static u8 *recv_buffer;
 
 static void fcopy_work_func(struct work_struct *dummy)
@@ -86,22 +90,22 @@ static void fcopy_work_func(struct work_struct *dummy)
 	 * process the pending transaction.
 	 */
 	fcopy_respond_to_host(HV_E_FAIL);
+}
 
-	/* In the case the user-space daemon crashes, hangs or is killed, we
-	 * need to down the semaphore, otherwise, after the daemon starts next
-	 * time, the obsolete data in fcopy_transaction.message or
-	 * fcopy_transaction.fcopy_msg will be used immediately.
-	 *
-	 * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
-	 * still OK, because we've reported the failure to the host.
-	 */
-	if (down_trylock(&fcopy_transaction.read_sema))
-		;
-
+static void poll_channel(struct vmbus_channel *channel)
+{
+	if (channel->target_cpu != smp_processor_id())
+		smp_call_function_single(channel->target_cpu,
+					 hv_fcopy_onchannelcallback,
+					 channel, true);
+	else
+		hv_fcopy_onchannelcallback(channel);
 }
 
 static int fcopy_handle_handshake(u32 version)
 {
+	int ret = 0;
+
 	switch (version) {
 	case FCOPY_CURRENT_VERSION:
 		break;
@@ -112,23 +116,19 @@ static int fcopy_handle_handshake(u32 version)
 		 * deal with, we will be backward compatible.
 		 * We will add this code when needed.
 		 */
-		return -EINVAL;
+		ret = -EINVAL;
 	}
-	pr_info("FCP: user-mode registering done. Daemon version: %d\n",
-		version);
-	fcopy_transaction.active = false;
-	if (fcopy_transaction.fcopy_context)
-		hv_fcopy_onchannelcallback(fcopy_transaction.fcopy_context);
-	in_hand_shake = false;
 	return 0;
 }
 
-static void fcopy_send_data(void)
+static void fcopy_send_op(struct work_struct *dummy)
 {
-	struct hv_start_fcopy *smsg_out = &fcopy_transaction.message;
-	int operation = fcopy_transaction.fcopy_msg->operation;
+	struct hv_start_fcopy *smsg_out;
+	struct hv_do_fcopy *dmsg_out;
 	struct hv_start_fcopy *smsg_in;
 
+	mutex_lock(&fcopy_device.lock);
+
 	/*
 	 * The  strings sent from the host are encoded in
 	 * in utf16; convert it to utf8 strings.
@@ -140,11 +140,14 @@ static void fcopy_send_data(void)
 	 * that the strings can be properly terminated!
 	 */
 
-	switch (operation) {
+	switch (((struct hv_fcopy_hdr *)fcopy_device.host_msg)->operation) {
 	case START_FILE_COPY:
-		memset(smsg_out, 0, sizeof(struct hv_start_fcopy));
-		smsg_out->hdr.operation = operation;
-		smsg_in = (struct hv_start_fcopy *)fcopy_transaction.fcopy_msg;
+		memset(&fcopy_device.user_msg, 0,
+		       sizeof(struct hv_start_fcopy));
+
+		smsg_out = (struct hv_start_fcopy *)fcopy_device.user_msg;
+		smsg_out->hdr.operation = START_FILE_COPY;
+		smsg_in = (struct hv_start_fcopy *)fcopy_device.host_msg;
 
 		utf16s_to_utf8s((wchar_t *)smsg_in->file_name, W_MAX_PATH,
 				UTF16_LITTLE_ENDIAN,
@@ -159,9 +162,16 @@ static void fcopy_send_data(void)
 		break;
 
 	default:
+		dmsg_out = (struct hv_do_fcopy *)fcopy_device.user_msg;
+		memcpy(fcopy_device.user_msg, fcopy_device.host_msg,
+		       sizeof(struct hv_do_fcopy));
 		break;
 	}
-	up(&fcopy_transaction.read_sema);
+
+	fcopy_device.state = FCOPY_USERMSG_READY;
+	wake_up_interruptible(&fcopy_device.proc_list);
+
+	mutex_unlock(&fcopy_device.lock);
 	return;
 }
 
@@ -169,8 +179,7 @@ static void fcopy_send_data(void)
  * Send a response back to the host.
  */
 
-static void
-fcopy_respond_to_host(int error)
+static void fcopy_respond_to_host(int error)
 {
 	struct icmsg_hdr *icmsghdr;
 	u32 buf_len;
@@ -185,11 +194,9 @@ fcopy_respond_to_host(int error)
 	 * only be one active transaction at a time.
 	 */
 
-	buf_len = fcopy_transaction.recv_len;
-	channel = fcopy_transaction.recv_channel;
-	req_id = fcopy_transaction.recv_req_id;
-
-	fcopy_transaction.active = false;
+	buf_len = fcopy_device.recv_len;
+	channel = fcopy_device.recv_channel;
+	req_id = fcopy_device.recv_req_id;
 
 	icmsghdr = (struct icmsg_hdr *)
 			&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -205,6 +212,20 @@ fcopy_respond_to_host(int error)
 	icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
 	vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
 				VM_PKT_DATA_INBAND, 0);
+
+
+	/* We're ready to process next request, reset the device state */
+	if (fcopy_device.state == FCOPY_USERSPACE_RECV ||
+	    fcopy_device.state == FCOPY_USERSPACE_REQ)
+		fcopy_device.state = FCOPY_READY;
+
+	/*
+	 * Make sure device state was set before polling the channel as
+	 * processing can happen on a different CPU.
+	 */
+	smp_mb();
+
+	poll_channel(channel);
 }
 
 void hv_fcopy_onchannelcallback(void *context)
@@ -218,16 +239,17 @@ void hv_fcopy_onchannelcallback(void *context)
 	int util_fw_version;
 	int fcopy_srv_version;
 
-	if (fcopy_transaction.active) {
+	if (fcopy_device.state > FCOPY_READY) {
 		/*
 		 * We will defer processing this callback once
 		 * the current transaction is complete.
 		 */
-		fcopy_transaction.fcopy_context = context;
+		fcopy_device.fcopy_context = channel;
 		return;
 	}
+	fcopy_device.fcopy_context = NULL;
 
-	vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 2, &recvlen,
+	vmbus_recvpacket(channel, recv_buffer, MAX_FCOPY_CHSIZE, &recvlen,
 			 &requestid);
 	if (recvlen <= 0)
 		return;
@@ -235,6 +257,10 @@ void hv_fcopy_onchannelcallback(void *context)
 	icmsghdr = (struct icmsg_hdr *)&recv_buffer[
 			sizeof(struct vmbuspipe_hdr)];
 	if (icmsghdr->icmsgtype == ICMSGTYPE_NEGOTIATE) {
+		/*
+		 * Process negotiation even before usersace daemon is connected
+		 * as we can timeout othervise.
+		 */
 		util_fw_version = UTIL_FW_VERSION;
 		fcopy_srv_version = WIN8_SRV_VERSION;
 		vmbus_prep_negotiate_resp(icmsghdr, negop, recv_buffer,
@@ -249,17 +275,26 @@ void hv_fcopy_onchannelcallback(void *context)
 		 * transaction; note transactions are serialized.
 		 */
 
-		fcopy_transaction.active = true;
-		fcopy_transaction.recv_len = recvlen;
-		fcopy_transaction.recv_channel = channel;
-		fcopy_transaction.recv_req_id = requestid;
-		fcopy_transaction.fcopy_msg = fcopy_msg;
+		fcopy_device.recv_len = recvlen;
+		fcopy_device.recv_channel = channel;
+		fcopy_device.recv_req_id = requestid;
+
+		if (fcopy_device.state != FCOPY_READY) {
+			/* Userspace daemon is not connected, just fail. */
+			fcopy_respond_to_host(HV_E_FAIL);
+			return;
+		}
+
+		memcpy(fcopy_device.host_msg, fcopy_msg, recvlen -
+		       (sizeof(struct vmbuspipe_hdr) +
+			sizeof(struct icmsg_hdr)));
+		fcopy_device.state = FCOPY_HOSTMSG_RECEIVED;
 
 		/*
 		 * Send the information to the user-level daemon.
 		 */
+		schedule_work(&fcopy_send_op_work);
 		schedule_delayed_work(&fcopy_work, 5*HZ);
-		fcopy_send_data();
 		return;
 	}
 	icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
@@ -272,121 +307,170 @@ void hv_fcopy_onchannelcallback(void *context)
  * the payload.
  */
 
-static ssize_t fcopy_read(struct file *file, char __user *buf,
-		size_t count, loff_t *ppos)
+static ssize_t fcopy_op_read(struct file *file, char __user *buf,
+			     size_t count, loff_t *ppos)
 {
-	void *src;
-	size_t copy_size;
-	int operation;
+	ssize_t ret = 0;
+	int copy_size;
+	struct hv_fcopy_hdr *hdr;
+
+	if (wait_event_interruptible(fcopy_device.proc_list,
+				     fcopy_device.state == FCOPY_USERMSG_READY
+				     ||
+				     fcopy_device.state == FCOPY_DEVICE_DYING))
+		return -EFAULT;
 
-	/*
-	 * Wait until there is something to be read.
-	 */
-	if (down_interruptible(&fcopy_transaction.read_sema))
-		return -EINTR;
+	if (fcopy_device.state != FCOPY_USERMSG_READY)
+		return -EFAULT;
 
-	/*
-	 * The channel may be rescinded and in this case, we will wakeup the
-	 * the thread blocked on the semaphore and we will use the opened
-	 * state to correctly handle this case.
-	 */
-	if (!opened)
-		return -ENODEV;
+	mutex_lock(&fcopy_device.lock);
 
-	operation = fcopy_transaction.fcopy_msg->operation;
+	hdr = (struct hv_fcopy_hdr *)fcopy_device.user_msg;
 
-	if (operation == START_FILE_COPY) {
-		src = &fcopy_transaction.message;
+	if (hdr->operation == START_FILE_COPY)
 		copy_size = sizeof(struct hv_start_fcopy);
-		if (count < copy_size)
-			return 0;
-	} else {
-		src = fcopy_transaction.fcopy_msg;
+	else
 		copy_size = sizeof(struct hv_do_fcopy);
-		if (count < copy_size)
-			return 0;
+
+	if (count < copy_size) {
+		pr_warn("fcopy_op_read: invalid read len: %d (expected: %d)\n",
+			(int)count, copy_size);
+		mutex_unlock(&fcopy_device.lock);
+		return -EINVAL;
 	}
-	if (copy_to_user(buf, src, copy_size))
-		return -EFAULT;
 
-	return copy_size;
+	if (!copy_to_user(buf, fcopy_device.user_msg, copy_size)) {
+		fcopy_device.state = FCOPY_USERSPACE_REQ;
+		ret = copy_size;
+	} else
+		ret = -EFAULT;
+
+	mutex_unlock(&fcopy_device.lock);
+	return ret;
 }
 
-static ssize_t fcopy_write(struct file *file, const char __user *buf,
-			size_t count, loff_t *ppos)
+static ssize_t fcopy_op_write(struct file *file, const char __user *buf,
+			      size_t count, loff_t *ppos)
 {
-	int response = 0;
+	int ret = 0;
+	u32 val32;
 
-	if (count != sizeof(int))
-		return -EINVAL;
-
-	if (copy_from_user(&response, buf, sizeof(int)))
+	if (fcopy_device.state == FCOPY_DEVICE_DYING)
 		return -EFAULT;
 
-	if (in_hand_shake) {
-		if (fcopy_handle_handshake(response))
-			return -EINVAL;
-		return sizeof(int);
+	if (count != sizeof(u32)) {
+		pr_warn("fcopy_op_write: invalid write len: %d (expected: %d)\n",
+			(int)count, (int)sizeof(u32));
+		return -EINVAL;
 	}
 
-	/*
-	 * Complete the transaction by forwarding the result
-	 * to the host. But first, cancel the timeout.
-	 */
-	if (cancel_delayed_work_sync(&fcopy_work))
-		fcopy_respond_to_host(response);
+	mutex_lock(&fcopy_device.lock);
 
-	return sizeof(int);
+	if (fcopy_device.state == FCOPY_USERSPACE_REQ) {
+		if (!copy_from_user(&val32, buf, sizeof(u32))) {
+			fcopy_device.state = FCOPY_USERSPACE_RECV;
+			if (cancel_delayed_work_sync(&fcopy_work))
+				fcopy_respond_to_host(val32);
+			ret = sizeof(u32);
+		} else
+			ret = -EFAULT;
+	} else {
+		pr_warn("fcopy_op_write: invalid transaction state: %d\n",
+			fcopy_device.state);
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&fcopy_device.lock);
+	return ret;
 }
 
-static int fcopy_open(struct inode *inode, struct file *f)
+static int fcopy_op_open(struct inode *inode, struct file *f)
 {
 	/*
 	 * The user level daemon that will open this device is
 	 * really an extension of this driver. We can have only
 	 * active open at a time.
 	 */
-	if (opened)
+	if (fcopy_device.state != FCOPY_DEVICE_INITIALIZING)
 		return -EBUSY;
-
-	/*
-	 * The daemon is alive; setup the state.
-	 */
-	opened = true;
+	fcopy_device.state = FCOPY_DEVICE_OPENED;
 	return 0;
 }
 
-/* XXX: there are still some tricky corner cases, e.g.,
- * 1) In a SMP guest, when fcopy_release() runs between
- * schedule_delayed_work() and fcopy_send_data(), there is
- * still a chance an obsolete message will be queued.
- *
- * 2) When the fcopy daemon is running, if we unload the driver,
- * we'll notice a kernel oops when we kill the daemon later.
- */
-static int fcopy_release(struct inode *inode, struct file *f)
+static int fcopy_op_release(struct inode *inode, struct file *f)
 {
 	/*
 	 * The daemon has exited; reset the state.
 	 */
-	in_hand_shake = true;
-	opened = false;
-
-	if (cancel_delayed_work_sync(&fcopy_work)) {
-		/* We haven't up()-ed the semaphore(very rare)? */
-		if (down_trylock(&fcopy_transaction.read_sema))
-			;
-		fcopy_respond_to_host(HV_E_FAIL);
-	}
+	fcopy_device.state = FCOPY_DEVICE_INITIALIZING;
+	return 0;
+}
+
+static unsigned int fcopy_op_poll(struct file *file, poll_table *wait)
+{
+	if (fcopy_device.state == FCOPY_DEVICE_DYING)
+		return -EFAULT;
+
+	poll_wait(file, &fcopy_device.proc_list, wait);
+	if (fcopy_device.state == FCOPY_USERMSG_READY)
+		return POLLIN | POLLRDNORM;
 	return 0;
 }
 
+static long fcopy_op_ioctl(struct file *fp,
+			 unsigned int cmd, unsigned long arg)
+{
+	long ret = 0;
+	void __user *argp = (void __user *)arg;
+	u32 val32;
+
+	if (fcopy_device.state == FCOPY_DEVICE_DYING)
+		return -EFAULT;
+
+	/* The only ioctl we have is registation */
+	if (fcopy_device.state != FCOPY_DEVICE_OPENED)
+		return -EINVAL;
+
+	mutex_lock(&fcopy_device.lock);
+
+	switch (cmd) {
+	case HYPERV_FCOPY_REGISTER:
+		if (copy_from_user(&val32, argp, sizeof(val32))) {
+			ret = -EFAULT;
+			break;
+		}
+		if (!fcopy_handle_handshake(val32)) {
+			/* No special meaning for userspace part for now. */
+			val32 = (u32)WIN8_SRV_VERSION;
+			if (copy_to_user(argp, &val32, sizeof(val32))) {
+				ret = -EFAULT;
+				break;
+			}
+			fcopy_device.state = FCOPY_READY;
+			pr_info("FCOPY: user-mode registering done.\n");
+			if (fcopy_device.fcopy_context)
+				poll_channel(fcopy_device.fcopy_context);
+		} else
+			ret = -EINVAL;
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&fcopy_device.lock);
+	return ret;
+}
 
 static const struct file_operations fcopy_fops = {
-	.read           = fcopy_read,
-	.write          = fcopy_write,
-	.release	= fcopy_release,
-	.open		= fcopy_open,
+	.owner		= THIS_MODULE,
+	.read           = fcopy_op_read,
+	.write          = fcopy_op_write,
+	.release	= fcopy_op_release,
+	.open		= fcopy_op_open,
+	.poll		= fcopy_op_poll,
+	.unlocked_ioctl = fcopy_op_ioctl
 };
 
 static struct miscdevice fcopy_misc = {
@@ -395,29 +479,6 @@ static struct miscdevice fcopy_misc = {
 	.fops           = &fcopy_fops,
 };
 
-static int fcopy_dev_init(void)
-{
-	return misc_register(&fcopy_misc);
-}
-
-static void fcopy_dev_deinit(void)
-{
-
-	/*
-	 * The device is going away - perhaps because the
-	 * host has rescinded the channel. Setup state so that
-	 * user level daemon can gracefully exit if it is blocked
-	 * on the read semaphore.
-	 */
-	opened = false;
-	/*
-	 * Signal the semaphore as the device is
-	 * going away.
-	 */
-	up(&fcopy_transaction.read_sema);
-	misc_deregister(&fcopy_misc);
-}
-
 int hv_fcopy_init(struct hv_util_service *srv)
 {
 	recv_buffer = srv->recv_buffer;
@@ -428,14 +489,20 @@ int hv_fcopy_init(struct hv_util_service *srv)
 	 * Defer processing channel callbacks until the daemon
 	 * has registered.
 	 */
-	fcopy_transaction.active = true;
-	sema_init(&fcopy_transaction.read_sema, 0);
+	fcopy_device.state = FCOPY_DEVICE_INITIALIZING;
+	init_waitqueue_head(&fcopy_device.proc_list);
+	mutex_init(&fcopy_device.lock);
 
-	return fcopy_dev_init();
+	return misc_register(&fcopy_misc);
 }
 
 void hv_fcopy_deinit(void)
 {
+	fcopy_device.state = FCOPY_DEVICE_DYING;
+	/* Make sure nobody sees the old state */
+	smp_mb();
+	wake_up_interruptible(&fcopy_device.proc_list);
 	cancel_delayed_work_sync(&fcopy_work);
-	fcopy_dev_deinit();
+	cancel_work_sync(&fcopy_send_op_work);
+	misc_deregister(&fcopy_misc);
 }
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index 1939826..590a2f4 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -397,5 +397,6 @@ struct hv_kvp_ip_msg {
  */
 #define HYPERV_KVP_REGISTER        _IOWR('v', 0, __u32)
 #define HYPERV_VSS_REGISTER        _IOWR('v', 0, __u32)
+#define HYPERV_FCOPY_REGISTER      _IOWR('v', 0, __u32)
 
 #endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index 9445d8f..2ae8196 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -18,19 +18,16 @@
 
 
 #include <sys/types.h>
-#include <sys/socket.h>
 #include <sys/poll.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
 #include <linux/types.h>
-#include <linux/kdev_t.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
-#include <string.h>
-#include <ctype.h>
 #include <errno.h>
 #include <linux/hyperv.h>
 #include <syslog.h>
-#include <sys/stat.h>
 #include <fcntl.h>
 #include <dirent.h>
 #include <getopt.h>
@@ -132,7 +129,8 @@ void print_usage(char *argv[])
 int main(int argc, char *argv[])
 {
 	int fcopy_fd, len;
-	int error;
+	__u32 error;
+	struct pollfd pfd;
 	int daemonize = 1, long_index = 0, opt;
 	int version = FCOPY_CURRENT_VERSION;
 	char *buffer[4096 * 2];
@@ -176,19 +174,33 @@ int main(int argc, char *argv[])
 	/*
 	 * Register with the kernel.
 	 */
-	if ((write(fcopy_fd, &version, sizeof(int))) != sizeof(int)) {
-		syslog(LOG_ERR, "Registration failed: %s", strerror(errno));
+	if (ioctl(fcopy_fd, HYPERV_FCOPY_REGISTER, &version)) {
+		syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+		       errno, strerror(errno));
+		close(fcopy_fd);
 		exit(EXIT_FAILURE);
 	}
 
+	pfd.fd = fcopy_fd;
+
 	while (1) {
-		/*
-		 * In this loop we process fcopy messages after the
-		 * handshake is complete.
-		 */
-		len = pread(fcopy_fd, buffer, (4096 * 2), 0);
+		pfd.events = POLLIN;
+		pfd.revents = 0;
+
+		if (poll(&pfd, 1, -1) < 0) {
+			syslog(LOG_ERR, "poll failed; error:%d %s", errno,
+			       strerror(errno));
+			if (errno == EINVAL) {
+				close(fcopy_fd);
+				exit(EXIT_FAILURE);
+			} else
+				continue;
+		}
+
+		len = read(fcopy_fd, buffer, (4096 * 2));
 		if (len < 0) {
-			syslog(LOG_ERR, "pread failed: %s", strerror(errno));
+			syslog(LOG_ERR, "read failed: %d %s", errno,
+			       strerror(errno));
 			exit(EXIT_FAILURE);
 		}
 		in_msg = (struct hv_fcopy_hdr *)buffer;
@@ -213,9 +225,13 @@ int main(int argc, char *argv[])
 
 		}
 
-		if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) {
-			syslog(LOG_ERR, "pwrite failed: %s", strerror(errno));
+		if (write(fcopy_fd, &error, sizeof(__u32)) != sizeof(__u32)) {
+			syslog(LOG_ERR, "write failed: %d %s", errno,
+			       strerror(errno));
 			exit(EXIT_FAILURE);
 		}
 	}
+
+	close(fcopy_fd);
+	exit(0);
 }
-- 
1.9.3

^ permalink raw reply related

* Re: Documenting MS_LAZYTIME
From: Darrick J. Wong @ 2015-02-27 17:51 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Theodore Ts'o, Eric Sandeen, Linux API,
	XFS Developers, Linux-Fsdevel, Ext4 Developers List,
	Linux btrfs Developers List
In-Reply-To: <54F02446.2050008@gmail.com>

On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> > On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> >>
> >>     The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> >>     in the case of a system crash, the atime and mtime fields
> >>     on disk might be out of date by at most 24 hours.
> > 
> > I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
> > perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> > has this as a downside.
> > 
> > Does that make sense?
> 
> Thanks, Ted. Got it. So, now we have:
> 
>        MS_LAZYTIME (since Linux 3.20)

"since Linux 4.0".

--D

>               Reduce  on-disk  updates  of  inode  timestamps  (atime,
>               mtime, ctime) by maintaining these changes only in  mem‐
>               ory.  The on-disk timestamps are updated only when:
> 
>               (a)  the inode needs to be updated for some change unre‐
>                    lated to file timestamps;
> 
>               (b)  the application  employs  fsync(2),  syncfs(2),  or
>                    sync(2);
> 
>               (c)  an undeleted inode is evicted from memory; or
> 
>               (d)  more  than 24 hours have passed since the inode was
>                    written to disk.
> 
>               This mount significantly reduces writes needed to update
>               the  inode's  timestamps,  especially  mtime  and atime.
>               However, in the event of a system crash, the  atime  and
>               mtime  fields  on  disk might be out of date by up to 24
>               hours.
> 
>               Examples of workloads where this option could be of sig‐
>               nificant  benefit include frequent random writes to pre‐
>               allocated files, as well as cases where the  MS_STRICTA‐
>               TIME  mount  option  is also enabled.  (The advantage of
>               (MS_STRICTATIME |  MS_LAZYTIME)  is  that  stat(2)  will
>               return  the  correctly  updated  atime,  but  the  atime
>               updates will be flushed to disk only when (1) the  inode
>               needs  to  be  updated for filesystem / data consistency
>               reasons or (2) the inode is pushed out of memory, or (3)
>               the filesystem is unmounted.)
> 
> Cheers,
> 
> Michael
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply

* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Tadeusz Struk @ 2015-02-27 18:34 UTC (permalink / raw)
  To: Stephan Mueller, Herbert Xu
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto, 'Linux API'
In-Reply-To: <2021530.t7zNty0mPn@tauon>

On 02/27/2015 02:26 AM, Stephan Mueller wrote:
>>>> This patch adds the AEAD support for AF_ALG.
>>>> >> > 
>>>> >> > The implementation is based on algif_skcipher, but contains heavy
>>>> >> > modifications to streamline the interface for AEAD uses.
>>>> >> > 
>>>> >> > To use AEAD, the user space consumer has to use the salg_type named
>>>> >> > "aead".
>>> >> 
>>> >> I just saw Al Viro's patch to use the iov_iter API in
>>> >> algif_skcipher.c. I looked at it but lacked documentation for using
>>> >> it properly. Now I have a template that I will incorporate into
>>> >> algif_aead.c
>> >
>> >Please resubmit once you have done this.
> I have done that, but as indicated with an email to Al, I cannot get his 
> patch for skcipher and hash to work. Similarly, my modification for the 
> AEAD does not work either. So, I do not see that Al's patch can be 
> merged as is.
> 
> Therefore, I have not submitted my patch as Al mentioned he wanted to 
> look into his patchset.

Hi Stephan,
There was a problem with the iov_iter changes, but it has been fixed on netdev during merging window.
The algif_skcipher works fine for me on the latest (4.0.0-rc1+) cryptodev.

Regards,
Tadeusz

^ permalink raw reply

* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-02-27 18:54 UTC (permalink / raw)
  To: Tadeusz Struk
  Cc: Herbert Xu, 'Quentin Gouchet', Daniel Borkmann,
	'LKML', linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	'Linux API'
In-Reply-To: <54F0B8BD.30600-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Am Freitag, 27. Februar 2015, 10:34:37 schrieb Tadeusz Struk:

Hi Tadeusz,

> On 02/27/2015 02:26 AM, Stephan Mueller wrote:
> >>>> This patch adds the AEAD support for AF_ALG.
> >>>> 
> >>>> >> > The implementation is based on algif_skcipher, but contains heavy
> >>>> >> > modifications to streamline the interface for AEAD uses.
> >>>> >> > 
> >>>> >> > To use AEAD, the user space consumer has to use the salg_type
> >>>> >> > named
> >>>> >> > "aead".
> >>> >> 
> >>> >> I just saw Al Viro's patch to use the iov_iter API in
> >>> >> algif_skcipher.c. I looked at it but lacked documentation for using
> >>> >> it properly. Now I have a template that I will incorporate into
> >>> >> algif_aead.c
> >> >
> >> >Please resubmit once you have done this.
> > 
> > I have done that, but as indicated with an email to Al, I cannot get his
> > patch for skcipher and hash to work. Similarly, my modification for the
> > AEAD does not work either. So, I do not see that Al's patch can be
> > merged as is.
> > 
> > Therefore, I have not submitted my patch as Al mentioned he wanted to
> > look into his patchset.
> 
> Hi Stephan,
> There was a problem with the iov_iter changes, but it has been fixed on
> netdev during merging window. The algif_skcipher works fine for me on the
> latest (4.0.0-rc1+) cryptodev.

Great, thanks for the hint. I will check my implementation and release it 
shortly.
> 
> Regards,
> Tadeusz


-- 
Ciao
Stephan

^ permalink raw reply

* [PATCH 0/2] add cursor blink interval terminal escape sequence
From: Scot Doyle @ 2015-02-27 19:10 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby, Tomi Valkeinen,
	Jean-Christophe Plagniol-Villard, Michael Kerrisk
  Cc: Pavel Machek, Geert Uytterhoeven,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fbdev-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150226220243.GC9935@amd>

Greg, the first patch of this series is for the tty tree.

Tomi, the second patch of this series is for your tree, but it depends on
the first patch. Also, will you remove these two previously queued patches?
"fbcon: store cursor blink interval in fbcon_ops"
"fbcon: expose cursor blink interval via sysfs"

Michael, I plan to send a documentation patch if these are accepted.

This patch series adds an escape sequence to specify the current console's
cursor blink interval. The default interval is set to fbcon's currently 
hardcoded 200 msecs.

Scot Doyle (2):
  vt: add cursor blink interval escape sequence
  fbcon: use the cursor blink interval provided by vt

 drivers/tty/vt/vt.c            |  9 +++++++++
 drivers/video/console/fbcon.c  | 10 +++++-----
 drivers/video/console/fbcon.h  |  1 +
 include/linux/console_struct.h |  1 +
 4 files changed, 16 insertions(+), 5 deletions(-)

-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/2] vt: add cursor blink interval escape sequence
From: Scot Doyle @ 2015-02-27 19:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Jiri Slaby
  Cc: Tomi Valkeinen, Jean-Christophe Plagniol-Villard, Michael Kerrisk,
	Pavel Machek, Geert Uytterhoeven,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fbdev-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.LNX.2.11.1502271905580.4212@localhost>

Add an escape sequence to specify the current console's cursor blink
interval. The interval is specified as a number of milliseconds until
the next cursor display state toggle, from 50 to 65535. /proc/loadavg
did not show a difference with a one msec interval, but the lower
bound is set to 50 msecs since slower hardware wasn't tested.

Store the interval in the vc_data structure for later access by fbcon,
initializing the value to fbcon's current hardcoded value of 200 msecs.

Signed-off-by: Scot Doyle <lkml14-enLWO88E2pdl57MIdRCFDg@public.gmane.org>
Acked-by: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>
---
 drivers/tty/vt/vt.c            | 9 +++++++++
 include/linux/console_struct.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 6e00572..ab1f173 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -135,6 +135,7 @@ const struct consw *conswitchp;
  */
 #define DEFAULT_BELL_PITCH	750
 #define DEFAULT_BELL_DURATION	(HZ/8)
+#define DEFAULT_CURSOR_BLINK_MS	200
 
 struct vc vc_cons [MAX_NR_CONSOLES];
 
@@ -1590,6 +1591,13 @@ static void setterm_command(struct vc_data *vc)
 		case 15: /* activate the previous console */
 			set_console(last_console);
 			break;
+		case 16: /* set cursor blink duration in msec */
+			if (vc->vc_npar >= 1 && vc->vc_par[1] >= 50 &&
+					vc->vc_par[1] <= USHRT_MAX)
+				vc->vc_cur_blink_ms = vc->vc_par[1];
+			else
+				vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS;
+			break;
 	}
 }
 
@@ -1717,6 +1725,7 @@ static void reset_terminal(struct vc_data *vc, int do_clear)
 
 	vc->vc_bell_pitch = DEFAULT_BELL_PITCH;
 	vc->vc_bell_duration = DEFAULT_BELL_DURATION;
+	vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS;
 
 	gotoxy(vc, 0, 0);
 	save_cur(vc);
diff --git a/include/linux/console_struct.h b/include/linux/console_struct.h
index e859c98..e329ee2 100644
--- a/include/linux/console_struct.h
+++ b/include/linux/console_struct.h
@@ -104,6 +104,7 @@ struct vc_data {
 	unsigned int    vc_resize_user;         /* resize request from user */
 	unsigned int	vc_bell_pitch;		/* Console bell pitch */
 	unsigned int	vc_bell_duration;	/* Console bell duration */
+	unsigned short	vc_cur_blink_ms;	/* Cursor blink duration */
 	struct vc_data **vc_display_fg;		/* [!] Ptr to var holding fg console for this display */
 	struct uni_pagedir *vc_uni_pagedir;
 	struct uni_pagedir **vc_uni_pagedir_loc; /* [!] Location of uni_pagedir variable for this console */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox