* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Tadeusz Struk @ 2015-02-27 18:34 UTC (permalink / raw)
To: Stephan Mueller, Herbert Xu
Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
linux-crypto, 'Linux API'
In-Reply-To: <2021530.t7zNty0mPn@tauon>
On 02/27/2015 02:26 AM, Stephan Mueller wrote:
>>>> This patch adds the AEAD support for AF_ALG.
>>>> >> >
>>>> >> > The implementation is based on algif_skcipher, but contains heavy
>>>> >> > modifications to streamline the interface for AEAD uses.
>>>> >> >
>>>> >> > To use AEAD, the user space consumer has to use the salg_type named
>>>> >> > "aead".
>>> >>
>>> >> I just saw Al Viro's patch to use the iov_iter API in
>>> >> algif_skcipher.c. I looked at it but lacked documentation for using
>>> >> it properly. Now I have a template that I will incorporate into
>>> >> algif_aead.c
>> >
>> >Please resubmit once you have done this.
> I have done that, but as indicated with an email to Al, I cannot get his
> patch for skcipher and hash to work. Similarly, my modification for the
> AEAD does not work either. So, I do not see that Al's patch can be
> merged as is.
>
> Therefore, I have not submitted my patch as Al mentioned he wanted to
> look into his patchset.
Hi Stephan,
There was a problem with the iov_iter changes, but it has been fixed on netdev during merging window.
The algif_skcipher works fine for me on the latest (4.0.0-rc1+) cryptodev.
Regards,
Tadeusz
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Darrick J. Wong @ 2015-02-27 17:51 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: linux-man, Theodore Ts'o, Eric Sandeen, Linux API,
XFS Developers, Linux-Fsdevel, Ext4 Developers List,
Linux btrfs Developers List
In-Reply-To: <54F02446.2050008@gmail.com>
On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> > On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> >>
> >> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> >> in the case of a system crash, the atime and mtime fields
> >> on disk might be out of date by at most 24 hours.
> >
> > I'd change to "The disadvantage of MS_LAZYTIME is that..." and
> > perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> > has this as a downside.
> >
> > Does that make sense?
>
> Thanks, Ted. Got it. So, now we have:
>
> MS_LAZYTIME (since Linux 3.20)
"since Linux 4.0".
--D
> Reduce on-disk updates of inode timestamps (atime,
> mtime, ctime) by maintaining these changes only in mem‐
> ory. The on-disk timestamps are updated only when:
>
> (a) the inode needs to be updated for some change unre‐
> lated to file timestamps;
>
> (b) the application employs fsync(2), syncfs(2), or
> sync(2);
>
> (c) an undeleted inode is evicted from memory; or
>
> (d) more than 24 hours have passed since the inode was
> written to disk.
>
> This mount significantly reduces writes needed to update
> the inode's timestamps, especially mtime and atime.
> However, in the event of a system crash, the atime and
> mtime fields on disk might be out of date by up to 24
> hours.
>
> Examples of workloads where this option could be of sig‐
> nificant benefit include frequent random writes to pre‐
> allocated files, as well as cases where the MS_STRICTA‐
> TIME mount option is also enabled. (The advantage of
> (MS_STRICTATIME | MS_LAZYTIME) is that stat(2) will
> return the correctly updated atime, but the atime
> updates will be flushed to disk only when (1) the inode
> needs to be updated for filesystem / data consistency
> reasons or (2) the inode is pushed out of memory, or (3)
> the filesystem is unmounted.)
>
> Cheers,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply
* [PATCH RFC 3/3] Drivers: hv: fcopy: make it consistent with vss/kvp
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
To: K. Y. Srinivasan, devel
Cc: Radim Krčmář, Greg Kroah-Hartman, Haiyang Zhang,
linux-kernel, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>
Re-implement fcopy in a consistent with "Drivers: hv: vss/kvp: convert
userspace/kernel communication to using char device" way.
In particular:
- Implement "state machine" for the driver instead of 3 separate syncronization
variables ('fcopy_transaction.active', 'fcopy_transaction.read_sema', 'opened')
- Use ioctl for kernel/userspace version negotiation.
- Support poll() operation.
- Set .owner = THIS_MODULE to prevent kernel crash if the module if being
unloaded while there is an active file descriptior in userspace.
- Use __u32 instead of int in userspace replies as it matches icmsg_hdr.status.
- Other minor changes to make drivers code look alike.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
drivers/hv/hv_fcopy.c | 395 ++++++++++++++++++++++++++------------------
include/uapi/linux/hyperv.h | 1 +
tools/hv/hv_fcopy_daemon.c | 48 ++++--
3 files changed, 264 insertions(+), 180 deletions(-)
diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
index cd453e4..05c3580 100644
--- a/drivers/hv/hv_fcopy.c
+++ b/drivers/hv/hv_fcopy.c
@@ -28,6 +28,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/miscdevice.h>
+#include <linux/poll.h>
#include "hyperv_vmbus.h"
@@ -35,6 +36,8 @@
#define WIN8_SRV_MINOR 1
#define WIN8_SRV_VERSION (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
+#define MAX_FCOPY_CHSIZE (PAGE_SIZE * 2)
+
/*
* Global state maintained for transaction that is being processed.
* For a class of integration services, including the "file copy service",
@@ -47,36 +50,37 @@
* ensure this by serializing packet processing in this driver - we do not
* read additional packets from the VMBUs until the current packet is fully
* handled.
- *
- * The transaction "active" state is set when we receive a request from the
- * host and we cleanup this state when the transaction is completed - when we
- * respond to the host with our response. When the transaction active state is
- * set, we defer handling incoming packets.
*/
+enum fcopy_device_state {
+ FCOPY_DEVICE_INITIALIZING = 0, /* driver was loaded */
+ FCOPY_DEVICE_OPENED, /* device was opened */
+ FCOPY_READY, /* userspace registered */
+ FCOPY_HOSTMSG_RECEIVED, /* message from host was received */
+ FCOPY_USERMSG_READY, /* message for userspace is ready */
+ FCOPY_USERSPACE_REQ, /* request to userspace was sent */
+ FCOPY_USERSPACE_RECV, /* reply from userspace was received */
+ FCOPY_DEVICE_DYING, /* driver unload is in progress */
+};
+
static struct {
- bool active; /* transaction status - active or not */
+ int state; /* fcopy device state */
int recv_len; /* number of bytes received. */
- struct hv_fcopy_hdr *fcopy_msg; /* current message */
- struct hv_start_fcopy message; /* sent to daemon */
struct vmbus_channel *recv_channel; /* chn we got the request */
u64 recv_req_id; /* request ID. */
void *fcopy_context; /* for the channel callback */
- struct semaphore read_sema;
-} fcopy_transaction;
-
-static bool opened; /* currently device opened */
+ int dm_reg_value; /* daemon version number */
+ struct mutex lock; /* syncronization */
+ u8 user_msg[MAX_FCOPY_CHSIZE]; /* message to userspace */
+ u8 host_msg[MAX_FCOPY_CHSIZE]; /* message to/from host */
+ wait_queue_head_t proc_list; /* waiting processes */
+} fcopy_device;
-/*
- * Before we can accept copy messages from the host, we need
- * to handshake with the user level daemon. This state tracks
- * if we are in the handshake phase.
- */
-static bool in_hand_shake = true;
-static void fcopy_send_data(void);
static void fcopy_respond_to_host(int error);
static void fcopy_work_func(struct work_struct *dummy);
+static void fcopy_send_op(struct work_struct *dummy);
static DECLARE_DELAYED_WORK(fcopy_work, fcopy_work_func);
+static DECLARE_WORK(fcopy_send_op_work, fcopy_send_op);
static u8 *recv_buffer;
static void fcopy_work_func(struct work_struct *dummy)
@@ -86,22 +90,22 @@ static void fcopy_work_func(struct work_struct *dummy)
* process the pending transaction.
*/
fcopy_respond_to_host(HV_E_FAIL);
+}
- /* In the case the user-space daemon crashes, hangs or is killed, we
- * need to down the semaphore, otherwise, after the daemon starts next
- * time, the obsolete data in fcopy_transaction.message or
- * fcopy_transaction.fcopy_msg will be used immediately.
- *
- * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're
- * still OK, because we've reported the failure to the host.
- */
- if (down_trylock(&fcopy_transaction.read_sema))
- ;
-
+static void poll_channel(struct vmbus_channel *channel)
+{
+ if (channel->target_cpu != smp_processor_id())
+ smp_call_function_single(channel->target_cpu,
+ hv_fcopy_onchannelcallback,
+ channel, true);
+ else
+ hv_fcopy_onchannelcallback(channel);
}
static int fcopy_handle_handshake(u32 version)
{
+ int ret = 0;
+
switch (version) {
case FCOPY_CURRENT_VERSION:
break;
@@ -112,23 +116,19 @@ static int fcopy_handle_handshake(u32 version)
* deal with, we will be backward compatible.
* We will add this code when needed.
*/
- return -EINVAL;
+ ret = -EINVAL;
}
- pr_info("FCP: user-mode registering done. Daemon version: %d\n",
- version);
- fcopy_transaction.active = false;
- if (fcopy_transaction.fcopy_context)
- hv_fcopy_onchannelcallback(fcopy_transaction.fcopy_context);
- in_hand_shake = false;
return 0;
}
-static void fcopy_send_data(void)
+static void fcopy_send_op(struct work_struct *dummy)
{
- struct hv_start_fcopy *smsg_out = &fcopy_transaction.message;
- int operation = fcopy_transaction.fcopy_msg->operation;
+ struct hv_start_fcopy *smsg_out;
+ struct hv_do_fcopy *dmsg_out;
struct hv_start_fcopy *smsg_in;
+ mutex_lock(&fcopy_device.lock);
+
/*
* The strings sent from the host are encoded in
* in utf16; convert it to utf8 strings.
@@ -140,11 +140,14 @@ static void fcopy_send_data(void)
* that the strings can be properly terminated!
*/
- switch (operation) {
+ switch (((struct hv_fcopy_hdr *)fcopy_device.host_msg)->operation) {
case START_FILE_COPY:
- memset(smsg_out, 0, sizeof(struct hv_start_fcopy));
- smsg_out->hdr.operation = operation;
- smsg_in = (struct hv_start_fcopy *)fcopy_transaction.fcopy_msg;
+ memset(&fcopy_device.user_msg, 0,
+ sizeof(struct hv_start_fcopy));
+
+ smsg_out = (struct hv_start_fcopy *)fcopy_device.user_msg;
+ smsg_out->hdr.operation = START_FILE_COPY;
+ smsg_in = (struct hv_start_fcopy *)fcopy_device.host_msg;
utf16s_to_utf8s((wchar_t *)smsg_in->file_name, W_MAX_PATH,
UTF16_LITTLE_ENDIAN,
@@ -159,9 +162,16 @@ static void fcopy_send_data(void)
break;
default:
+ dmsg_out = (struct hv_do_fcopy *)fcopy_device.user_msg;
+ memcpy(fcopy_device.user_msg, fcopy_device.host_msg,
+ sizeof(struct hv_do_fcopy));
break;
}
- up(&fcopy_transaction.read_sema);
+
+ fcopy_device.state = FCOPY_USERMSG_READY;
+ wake_up_interruptible(&fcopy_device.proc_list);
+
+ mutex_unlock(&fcopy_device.lock);
return;
}
@@ -169,8 +179,7 @@ static void fcopy_send_data(void)
* Send a response back to the host.
*/
-static void
-fcopy_respond_to_host(int error)
+static void fcopy_respond_to_host(int error)
{
struct icmsg_hdr *icmsghdr;
u32 buf_len;
@@ -185,11 +194,9 @@ fcopy_respond_to_host(int error)
* only be one active transaction at a time.
*/
- buf_len = fcopy_transaction.recv_len;
- channel = fcopy_transaction.recv_channel;
- req_id = fcopy_transaction.recv_req_id;
-
- fcopy_transaction.active = false;
+ buf_len = fcopy_device.recv_len;
+ channel = fcopy_device.recv_channel;
+ req_id = fcopy_device.recv_req_id;
icmsghdr = (struct icmsg_hdr *)
&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -205,6 +212,20 @@ fcopy_respond_to_host(int error)
icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
VM_PKT_DATA_INBAND, 0);
+
+
+ /* We're ready to process next request, reset the device state */
+ if (fcopy_device.state == FCOPY_USERSPACE_RECV ||
+ fcopy_device.state == FCOPY_USERSPACE_REQ)
+ fcopy_device.state = FCOPY_READY;
+
+ /*
+ * Make sure device state was set before polling the channel as
+ * processing can happen on a different CPU.
+ */
+ smp_mb();
+
+ poll_channel(channel);
}
void hv_fcopy_onchannelcallback(void *context)
@@ -218,16 +239,17 @@ void hv_fcopy_onchannelcallback(void *context)
int util_fw_version;
int fcopy_srv_version;
- if (fcopy_transaction.active) {
+ if (fcopy_device.state > FCOPY_READY) {
/*
* We will defer processing this callback once
* the current transaction is complete.
*/
- fcopy_transaction.fcopy_context = context;
+ fcopy_device.fcopy_context = channel;
return;
}
+ fcopy_device.fcopy_context = NULL;
- vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 2, &recvlen,
+ vmbus_recvpacket(channel, recv_buffer, MAX_FCOPY_CHSIZE, &recvlen,
&requestid);
if (recvlen <= 0)
return;
@@ -235,6 +257,10 @@ void hv_fcopy_onchannelcallback(void *context)
icmsghdr = (struct icmsg_hdr *)&recv_buffer[
sizeof(struct vmbuspipe_hdr)];
if (icmsghdr->icmsgtype == ICMSGTYPE_NEGOTIATE) {
+ /*
+ * Process negotiation even before usersace daemon is connected
+ * as we can timeout othervise.
+ */
util_fw_version = UTIL_FW_VERSION;
fcopy_srv_version = WIN8_SRV_VERSION;
vmbus_prep_negotiate_resp(icmsghdr, negop, recv_buffer,
@@ -249,17 +275,26 @@ void hv_fcopy_onchannelcallback(void *context)
* transaction; note transactions are serialized.
*/
- fcopy_transaction.active = true;
- fcopy_transaction.recv_len = recvlen;
- fcopy_transaction.recv_channel = channel;
- fcopy_transaction.recv_req_id = requestid;
- fcopy_transaction.fcopy_msg = fcopy_msg;
+ fcopy_device.recv_len = recvlen;
+ fcopy_device.recv_channel = channel;
+ fcopy_device.recv_req_id = requestid;
+
+ if (fcopy_device.state != FCOPY_READY) {
+ /* Userspace daemon is not connected, just fail. */
+ fcopy_respond_to_host(HV_E_FAIL);
+ return;
+ }
+
+ memcpy(fcopy_device.host_msg, fcopy_msg, recvlen -
+ (sizeof(struct vmbuspipe_hdr) +
+ sizeof(struct icmsg_hdr)));
+ fcopy_device.state = FCOPY_HOSTMSG_RECEIVED;
/*
* Send the information to the user-level daemon.
*/
+ schedule_work(&fcopy_send_op_work);
schedule_delayed_work(&fcopy_work, 5*HZ);
- fcopy_send_data();
return;
}
icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
@@ -272,121 +307,170 @@ void hv_fcopy_onchannelcallback(void *context)
* the payload.
*/
-static ssize_t fcopy_read(struct file *file, char __user *buf,
- size_t count, loff_t *ppos)
+static ssize_t fcopy_op_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
{
- void *src;
- size_t copy_size;
- int operation;
+ ssize_t ret = 0;
+ int copy_size;
+ struct hv_fcopy_hdr *hdr;
+
+ if (wait_event_interruptible(fcopy_device.proc_list,
+ fcopy_device.state == FCOPY_USERMSG_READY
+ ||
+ fcopy_device.state == FCOPY_DEVICE_DYING))
+ return -EFAULT;
- /*
- * Wait until there is something to be read.
- */
- if (down_interruptible(&fcopy_transaction.read_sema))
- return -EINTR;
+ if (fcopy_device.state != FCOPY_USERMSG_READY)
+ return -EFAULT;
- /*
- * The channel may be rescinded and in this case, we will wakeup the
- * the thread blocked on the semaphore and we will use the opened
- * state to correctly handle this case.
- */
- if (!opened)
- return -ENODEV;
+ mutex_lock(&fcopy_device.lock);
- operation = fcopy_transaction.fcopy_msg->operation;
+ hdr = (struct hv_fcopy_hdr *)fcopy_device.user_msg;
- if (operation == START_FILE_COPY) {
- src = &fcopy_transaction.message;
+ if (hdr->operation == START_FILE_COPY)
copy_size = sizeof(struct hv_start_fcopy);
- if (count < copy_size)
- return 0;
- } else {
- src = fcopy_transaction.fcopy_msg;
+ else
copy_size = sizeof(struct hv_do_fcopy);
- if (count < copy_size)
- return 0;
+
+ if (count < copy_size) {
+ pr_warn("fcopy_op_read: invalid read len: %d (expected: %d)\n",
+ (int)count, copy_size);
+ mutex_unlock(&fcopy_device.lock);
+ return -EINVAL;
}
- if (copy_to_user(buf, src, copy_size))
- return -EFAULT;
- return copy_size;
+ if (!copy_to_user(buf, fcopy_device.user_msg, copy_size)) {
+ fcopy_device.state = FCOPY_USERSPACE_REQ;
+ ret = copy_size;
+ } else
+ ret = -EFAULT;
+
+ mutex_unlock(&fcopy_device.lock);
+ return ret;
}
-static ssize_t fcopy_write(struct file *file, const char __user *buf,
- size_t count, loff_t *ppos)
+static ssize_t fcopy_op_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
{
- int response = 0;
+ int ret = 0;
+ u32 val32;
- if (count != sizeof(int))
- return -EINVAL;
-
- if (copy_from_user(&response, buf, sizeof(int)))
+ if (fcopy_device.state == FCOPY_DEVICE_DYING)
return -EFAULT;
- if (in_hand_shake) {
- if (fcopy_handle_handshake(response))
- return -EINVAL;
- return sizeof(int);
+ if (count != sizeof(u32)) {
+ pr_warn("fcopy_op_write: invalid write len: %d (expected: %d)\n",
+ (int)count, (int)sizeof(u32));
+ return -EINVAL;
}
- /*
- * Complete the transaction by forwarding the result
- * to the host. But first, cancel the timeout.
- */
- if (cancel_delayed_work_sync(&fcopy_work))
- fcopy_respond_to_host(response);
+ mutex_lock(&fcopy_device.lock);
- return sizeof(int);
+ if (fcopy_device.state == FCOPY_USERSPACE_REQ) {
+ if (!copy_from_user(&val32, buf, sizeof(u32))) {
+ fcopy_device.state = FCOPY_USERSPACE_RECV;
+ if (cancel_delayed_work_sync(&fcopy_work))
+ fcopy_respond_to_host(val32);
+ ret = sizeof(u32);
+ } else
+ ret = -EFAULT;
+ } else {
+ pr_warn("fcopy_op_write: invalid transaction state: %d\n",
+ fcopy_device.state);
+ ret = -EINVAL;
+ }
+
+ mutex_unlock(&fcopy_device.lock);
+ return ret;
}
-static int fcopy_open(struct inode *inode, struct file *f)
+static int fcopy_op_open(struct inode *inode, struct file *f)
{
/*
* The user level daemon that will open this device is
* really an extension of this driver. We can have only
* active open at a time.
*/
- if (opened)
+ if (fcopy_device.state != FCOPY_DEVICE_INITIALIZING)
return -EBUSY;
-
- /*
- * The daemon is alive; setup the state.
- */
- opened = true;
+ fcopy_device.state = FCOPY_DEVICE_OPENED;
return 0;
}
-/* XXX: there are still some tricky corner cases, e.g.,
- * 1) In a SMP guest, when fcopy_release() runs between
- * schedule_delayed_work() and fcopy_send_data(), there is
- * still a chance an obsolete message will be queued.
- *
- * 2) When the fcopy daemon is running, if we unload the driver,
- * we'll notice a kernel oops when we kill the daemon later.
- */
-static int fcopy_release(struct inode *inode, struct file *f)
+static int fcopy_op_release(struct inode *inode, struct file *f)
{
/*
* The daemon has exited; reset the state.
*/
- in_hand_shake = true;
- opened = false;
-
- if (cancel_delayed_work_sync(&fcopy_work)) {
- /* We haven't up()-ed the semaphore(very rare)? */
- if (down_trylock(&fcopy_transaction.read_sema))
- ;
- fcopy_respond_to_host(HV_E_FAIL);
- }
+ fcopy_device.state = FCOPY_DEVICE_INITIALIZING;
+ return 0;
+}
+
+static unsigned int fcopy_op_poll(struct file *file, poll_table *wait)
+{
+ if (fcopy_device.state == FCOPY_DEVICE_DYING)
+ return -EFAULT;
+
+ poll_wait(file, &fcopy_device.proc_list, wait);
+ if (fcopy_device.state == FCOPY_USERMSG_READY)
+ return POLLIN | POLLRDNORM;
return 0;
}
+static long fcopy_op_ioctl(struct file *fp,
+ unsigned int cmd, unsigned long arg)
+{
+ long ret = 0;
+ void __user *argp = (void __user *)arg;
+ u32 val32;
+
+ if (fcopy_device.state == FCOPY_DEVICE_DYING)
+ return -EFAULT;
+
+ /* The only ioctl we have is registation */
+ if (fcopy_device.state != FCOPY_DEVICE_OPENED)
+ return -EINVAL;
+
+ mutex_lock(&fcopy_device.lock);
+
+ switch (cmd) {
+ case HYPERV_FCOPY_REGISTER:
+ if (copy_from_user(&val32, argp, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ if (!fcopy_handle_handshake(val32)) {
+ /* No special meaning for userspace part for now. */
+ val32 = (u32)WIN8_SRV_VERSION;
+ if (copy_to_user(argp, &val32, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ fcopy_device.state = FCOPY_READY;
+ pr_info("FCOPY: user-mode registering done.\n");
+ if (fcopy_device.fcopy_context)
+ poll_channel(fcopy_device.fcopy_context);
+ } else
+ ret = -EINVAL;
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ mutex_unlock(&fcopy_device.lock);
+ return ret;
+}
static const struct file_operations fcopy_fops = {
- .read = fcopy_read,
- .write = fcopy_write,
- .release = fcopy_release,
- .open = fcopy_open,
+ .owner = THIS_MODULE,
+ .read = fcopy_op_read,
+ .write = fcopy_op_write,
+ .release = fcopy_op_release,
+ .open = fcopy_op_open,
+ .poll = fcopy_op_poll,
+ .unlocked_ioctl = fcopy_op_ioctl
};
static struct miscdevice fcopy_misc = {
@@ -395,29 +479,6 @@ static struct miscdevice fcopy_misc = {
.fops = &fcopy_fops,
};
-static int fcopy_dev_init(void)
-{
- return misc_register(&fcopy_misc);
-}
-
-static void fcopy_dev_deinit(void)
-{
-
- /*
- * The device is going away - perhaps because the
- * host has rescinded the channel. Setup state so that
- * user level daemon can gracefully exit if it is blocked
- * on the read semaphore.
- */
- opened = false;
- /*
- * Signal the semaphore as the device is
- * going away.
- */
- up(&fcopy_transaction.read_sema);
- misc_deregister(&fcopy_misc);
-}
-
int hv_fcopy_init(struct hv_util_service *srv)
{
recv_buffer = srv->recv_buffer;
@@ -428,14 +489,20 @@ int hv_fcopy_init(struct hv_util_service *srv)
* Defer processing channel callbacks until the daemon
* has registered.
*/
- fcopy_transaction.active = true;
- sema_init(&fcopy_transaction.read_sema, 0);
+ fcopy_device.state = FCOPY_DEVICE_INITIALIZING;
+ init_waitqueue_head(&fcopy_device.proc_list);
+ mutex_init(&fcopy_device.lock);
- return fcopy_dev_init();
+ return misc_register(&fcopy_misc);
}
void hv_fcopy_deinit(void)
{
+ fcopy_device.state = FCOPY_DEVICE_DYING;
+ /* Make sure nobody sees the old state */
+ smp_mb();
+ wake_up_interruptible(&fcopy_device.proc_list);
cancel_delayed_work_sync(&fcopy_work);
- fcopy_dev_deinit();
+ cancel_work_sync(&fcopy_send_op_work);
+ misc_deregister(&fcopy_misc);
}
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index 1939826..590a2f4 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -397,5 +397,6 @@ struct hv_kvp_ip_msg {
*/
#define HYPERV_KVP_REGISTER _IOWR('v', 0, __u32)
#define HYPERV_VSS_REGISTER _IOWR('v', 0, __u32)
+#define HYPERV_FCOPY_REGISTER _IOWR('v', 0, __u32)
#endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index 9445d8f..2ae8196 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -18,19 +18,16 @@
#include <sys/types.h>
-#include <sys/socket.h>
#include <sys/poll.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
#include <linux/types.h>
-#include <linux/kdev_t.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
-#include <string.h>
-#include <ctype.h>
#include <errno.h>
#include <linux/hyperv.h>
#include <syslog.h>
-#include <sys/stat.h>
#include <fcntl.h>
#include <dirent.h>
#include <getopt.h>
@@ -132,7 +129,8 @@ void print_usage(char *argv[])
int main(int argc, char *argv[])
{
int fcopy_fd, len;
- int error;
+ __u32 error;
+ struct pollfd pfd;
int daemonize = 1, long_index = 0, opt;
int version = FCOPY_CURRENT_VERSION;
char *buffer[4096 * 2];
@@ -176,19 +174,33 @@ int main(int argc, char *argv[])
/*
* Register with the kernel.
*/
- if ((write(fcopy_fd, &version, sizeof(int))) != sizeof(int)) {
- syslog(LOG_ERR, "Registration failed: %s", strerror(errno));
+ if (ioctl(fcopy_fd, HYPERV_FCOPY_REGISTER, &version)) {
+ syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+ errno, strerror(errno));
+ close(fcopy_fd);
exit(EXIT_FAILURE);
}
+ pfd.fd = fcopy_fd;
+
while (1) {
- /*
- * In this loop we process fcopy messages after the
- * handshake is complete.
- */
- len = pread(fcopy_fd, buffer, (4096 * 2), 0);
+ pfd.events = POLLIN;
+ pfd.revents = 0;
+
+ if (poll(&pfd, 1, -1) < 0) {
+ syslog(LOG_ERR, "poll failed; error:%d %s", errno,
+ strerror(errno));
+ if (errno == EINVAL) {
+ close(fcopy_fd);
+ exit(EXIT_FAILURE);
+ } else
+ continue;
+ }
+
+ len = read(fcopy_fd, buffer, (4096 * 2));
if (len < 0) {
- syslog(LOG_ERR, "pread failed: %s", strerror(errno));
+ syslog(LOG_ERR, "read failed: %d %s", errno,
+ strerror(errno));
exit(EXIT_FAILURE);
}
in_msg = (struct hv_fcopy_hdr *)buffer;
@@ -213,9 +225,13 @@ int main(int argc, char *argv[])
}
- if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) {
- syslog(LOG_ERR, "pwrite failed: %s", strerror(errno));
+ if (write(fcopy_fd, &error, sizeof(__u32)) != sizeof(__u32)) {
+ syslog(LOG_ERR, "write failed: %d %s", errno,
+ strerror(errno));
exit(EXIT_FAILURE);
}
}
+
+ close(fcopy_fd);
+ exit(0);
}
--
1.9.3
^ permalink raw reply related
* [PATCH RFC 2/3] Drivers: hv: vss: convert userspace/kernel communication to using char device
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
To: K. Y. Srinivasan, devel
Cc: Radim Krčmář, Greg Kroah-Hartman, Haiyang Zhang,
linux-kernel, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>
Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
and this fact can change as there is a way to enable/disable the service from
host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
notification.
- Netlink communication is not stable under heavy load.
- ...
Re-implement the communication using misc char device. Use ioctl to do
kernel/userspace version negotiation (doesn't make much sense at this moment
as we're breaking backwards compatibility but can be used in future). Read from
the device returns struct hv_vss_msg and userspace is supposed to reply with
__u32.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
drivers/hv/hv_snapshot.c | 335 ++++++++++++++++++++++++++++++++------------
include/uapi/linux/hyperv.h | 1 +
tools/hv/hv_vss_daemon.c | 141 ++++---------------
3 files changed, 273 insertions(+), 204 deletions(-)
diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 9d5e0d1..b399758 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_snapshot.c
@@ -20,9 +20,12 @@
#include <linux/net.h>
#include <linux/nls.h>
-#include <linux/connector.h>
+#include <linux/sched.h>
#include <linux/workqueue.h>
+#include <linux/mutex.h>
#include <linux/hyperv.h>
+#include <linux/miscdevice.h>
+#include <linux/poll.h>
#define VSS_MAJOR 5
#define VSS_MINOR 0
@@ -31,26 +34,36 @@
#define VSS_USERSPACE_TIMEOUT (msecs_to_jiffies(10 * 1000))
/*
- * Global state maintained for transaction that is being processed.
- * Note that only one transaction can be active at any point in time.
- *
- * This state is set when we receive a request from the host; we
- * cleanup this state when the transaction is completed - when we respond
- * to the host with the key value.
+ * Global state maintained for the device. Note that only one transaction can
+ * be active at any point in time.
*/
+enum vss_device_state {
+ VSS_DEVICE_INITIALIZING = 0, /* driver was loaded */
+ VSS_DEVICE_OPENED, /* device was opened */
+ VSS_READY, /* userspace registered */
+ VSS_HOSTMSG_RECEIVED, /* message from host was received */
+ VSS_USERMSG_READY, /* message for userspace is ready */
+ VSS_USERSPACE_REQ, /* request to userspace was sent */
+ VSS_USERSPACE_RECV, /* reply from userspace was received */
+ VSS_DEVICE_DYING, /* driver unload is in progress */
+};
+
static struct {
- bool active; /* transaction status - active or not */
+ int state; /* vss_device_state */
int recv_len; /* number of bytes received. */
struct vmbus_channel *recv_channel; /* chn we got the request */
u64 recv_req_id; /* request ID. */
- struct hv_vss_msg *msg; /* current message */
-} vss_transaction;
-
+ void *vss_context; /* for the channel callback */
+ int dm_reg_value; /* daemon version number */
+ struct mutex lock; /* syncronization */
+ struct hv_vss_msg user_msg; /* message to/from userspace */
+ struct hv_vss_msg host_msg; /* message to/from host */
+ wait_queue_head_t proc_list; /* waiting processes */
+} vss_device;
-static void vss_respond_to_host(int error);
+static void vss_respond_to_host(u32 error);
-static struct cb_id vss_id = { CN_VSS_IDX, CN_VSS_VAL };
static const char vss_name[] = "vss_kernel_module";
static __u8 *recv_buffer;
@@ -60,6 +73,23 @@ static void vss_timeout_func(struct work_struct *dummy);
static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func);
static DECLARE_WORK(vss_send_op_work, vss_send_op);
+static int vss_handle_handshake(u32 op)
+{
+ vss_device.dm_reg_value = op;
+
+ return 0;
+}
+
+static void poll_channel(struct vmbus_channel *channel)
+{
+ if (channel->target_cpu != smp_processor_id())
+ smp_call_function_single(channel->target_cpu,
+ hv_vss_onchannelcallback,
+ channel, true);
+ else
+ hv_vss_onchannelcallback(channel);
+}
+
/*
* Callback when data is received from user mode.
*/
@@ -73,62 +103,27 @@ static void vss_timeout_func(struct work_struct *dummy)
vss_respond_to_host(HV_E_FAIL);
}
-static void
-vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
-{
- struct hv_vss_msg *vss_msg;
-
- vss_msg = (struct hv_vss_msg *)msg->data;
-
- if (vss_msg->vss_hdr.operation == VSS_OP_REGISTER) {
- pr_info("VSS daemon registered\n");
- vss_transaction.active = false;
- if (vss_transaction.recv_channel != NULL)
- hv_vss_onchannelcallback(vss_transaction.recv_channel);
- return;
-
- }
- if (cancel_delayed_work_sync(&vss_timeout_work))
- vss_respond_to_host(vss_msg->error);
-}
-
-
static void vss_send_op(struct work_struct *dummy)
{
- int op = vss_transaction.msg->vss_hdr.operation;
- int rc;
- struct cn_msg *msg;
- struct hv_vss_msg *vss_msg;
+ mutex_lock(&vss_device.lock);
- msg = kzalloc(sizeof(*msg) + sizeof(*vss_msg), GFP_ATOMIC);
- if (!msg)
+ if (vss_device.state != VSS_HOSTMSG_RECEIVED)
return;
- vss_msg = (struct hv_vss_msg *)msg->data;
-
- msg->id.idx = CN_VSS_IDX;
- msg->id.val = CN_VSS_VAL;
-
- vss_msg->vss_hdr.operation = op;
- msg->len = sizeof(struct hv_vss_msg);
+ memcpy(&vss_device.user_msg, &vss_device.host_msg,
+ sizeof(struct hv_vss_msg));
- rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
- if (rc) {
- pr_warn("VSS: failed to communicate to the daemon: %d\n", rc);
- if (cancel_delayed_work_sync(&vss_timeout_work))
- vss_respond_to_host(HV_E_FAIL);
- }
- kfree(msg);
+ vss_device.state = VSS_USERMSG_READY;
+ wake_up_interruptible(&vss_device.proc_list);
+ mutex_unlock(&vss_device.lock);
return;
}
/*
* Send a response back to the host.
*/
-
-static void
-vss_respond_to_host(int error)
+static void vss_respond_to_host(u32 error)
{
struct icmsg_hdr *icmsghdrp;
u32 buf_len;
@@ -136,25 +131,13 @@ vss_respond_to_host(int error)
u64 req_id;
/*
- * If a transaction is not active; log and return.
- */
-
- if (!vss_transaction.active) {
- /*
- * This is a spurious call!
- */
- pr_warn("VSS: Transaction not active\n");
- return;
- }
- /*
* Copy the global state for completing the transaction. Note that
* only one transaction can be active at a time.
*/
- buf_len = vss_transaction.recv_len;
- channel = vss_transaction.recv_channel;
- req_id = vss_transaction.recv_req_id;
- vss_transaction.active = false;
+ buf_len = vss_device.recv_len;
+ channel = vss_device.recv_channel;
+ req_id = vss_device.recv_req_id;
icmsghdrp = (struct icmsg_hdr *)
&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -173,6 +156,17 @@ vss_respond_to_host(int error)
vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
VM_PKT_DATA_INBAND, 0);
+ /* We're ready to process next request, reset the device state */
+ if (vss_device.state == VSS_USERSPACE_RECV ||
+ vss_device.state == VSS_USERSPACE_REQ)
+ vss_device.state = VSS_READY;
+ /*
+ * Make sure device state was set before polling the channel as
+ * processing can happen on a different CPU.
+ */
+ smp_mb();
+
+ poll_channel(channel);
}
/*
@@ -186,19 +180,18 @@ void hv_vss_onchannelcallback(void *context)
u32 recvlen;
u64 requestid;
struct hv_vss_msg *vss_msg;
-
-
struct icmsg_hdr *icmsghdrp;
struct icmsg_negotiate *negop = NULL;
- if (vss_transaction.active) {
+ if (vss_device.state > VSS_READY) {
/*
* We will defer processing this callback once
* the current transaction is complete.
*/
- vss_transaction.recv_channel = channel;
+ vss_device.vss_context = channel;
return;
}
+ vss_device.vss_context = NULL;
vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 2, &recvlen,
&requestid);
@@ -221,11 +214,9 @@ void hv_vss_onchannelcallback(void *context)
* transaction; note transactions are serialized.
*/
- vss_transaction.recv_len = recvlen;
- vss_transaction.recv_channel = channel;
- vss_transaction.recv_req_id = requestid;
- vss_transaction.active = true;
- vss_transaction.msg = (struct hv_vss_msg *)vss_msg;
+ vss_device.recv_len = recvlen;
+ vss_device.recv_channel = channel;
+ vss_device.recv_req_id = requestid;
switch (vss_msg->vss_hdr.operation) {
/*
@@ -241,6 +232,15 @@ void hv_vss_onchannelcallback(void *context)
*/
case VSS_OP_FREEZE:
case VSS_OP_THAW:
+ if (vss_device.state != VSS_READY) {
+ /* Userspace daemon is not connected */
+ vss_respond_to_host(HV_E_FAIL);
+ return;
+ }
+
+ memcpy(&vss_device.host_msg, vss_msg,
+ sizeof(struct hv_vss_msg));
+ vss_device.state = VSS_HOSTMSG_RECEIVED;
schedule_work(&vss_send_op_work);
schedule_delayed_work(&vss_timeout_work,
VSS_USERSPACE_TIMEOUT);
@@ -275,14 +275,164 @@ void hv_vss_onchannelcallback(void *context)
}
-int
-hv_vss_init(struct hv_util_service *srv)
+static int vss_op_open(struct inode *inode, struct file *f)
{
- int err;
+ if (vss_device.state != VSS_DEVICE_INITIALIZING)
+ return -EBUSY;
+ vss_device.state = VSS_DEVICE_OPENED;
+ return 0;
+}
- err = cn_add_callback(&vss_id, vss_name, vss_cn_callback);
- if (err)
- return err;
+static int vss_op_release(struct inode *inode, struct file *f)
+{
+ vss_device.state = VSS_DEVICE_INITIALIZING;
+ return 0;
+}
+
+static ssize_t vss_op_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ int ret = 0;
+ u32 val32;
+
+ if (vss_device.state == VSS_DEVICE_DYING)
+ return -EFAULT;
+
+ if (count != sizeof(u32)) {
+ pr_warn("vss_op_write: invalid write len: %d (expected: %d)\n",
+ (int)count, (int)sizeof(u32));
+ return -EINVAL;
+ }
+
+ mutex_lock(&vss_device.lock);
+
+ if (vss_device.state == VSS_USERSPACE_REQ) {
+ if (!copy_from_user(&val32, buf, sizeof(u32))) {
+ vss_device.state = VSS_USERSPACE_RECV;
+ if (cancel_delayed_work_sync(&vss_timeout_work))
+ vss_respond_to_host(val32);
+ ret = sizeof(u32);
+ } else
+ ret = -EFAULT;
+ } else {
+ pr_warn("vss_op_write: invalid transaction state: %d\n",
+ vss_device.state);
+ ret = -EINVAL;
+ }
+
+ mutex_unlock(&vss_device.lock);
+ return ret;
+}
+
+static ssize_t vss_op_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t ret = 0;
+
+ if (vss_device.state == VSS_DEVICE_DYING)
+ return -EFAULT;
+
+ if (count != sizeof(struct hv_vss_msg)) {
+ pr_warn("vss_op_read: invalid read len: %d (expected: %d)\n",
+ (int)count, (int)sizeof(struct hv_vss_msg));
+ return -EINVAL;
+ }
+
+ if (wait_event_interruptible(vss_device.proc_list,
+ vss_device.state == VSS_USERMSG_READY ||
+ vss_device.state == VSS_DEVICE_DYING))
+ return -EFAULT;
+
+ if (vss_device.state != VSS_USERMSG_READY)
+ return -EFAULT;
+
+ mutex_lock(&vss_device.lock);
+
+ if (!copy_to_user(buf, &vss_device.user_msg,
+ sizeof(struct hv_vss_msg))) {
+ vss_device.state = VSS_USERSPACE_REQ;
+ ret = sizeof(struct hv_vss_msg);
+ } else
+ ret = -EFAULT;
+
+ mutex_unlock(&vss_device.lock);
+ return ret;
+}
+
+static unsigned int vss_op_poll(struct file *file, poll_table *wait)
+{
+ if (vss_device.state == VSS_DEVICE_DYING)
+ return -EFAULT;
+
+ poll_wait(file, &vss_device.proc_list, wait);
+ if (vss_device.state == VSS_USERMSG_READY)
+ return POLLIN | POLLRDNORM;
+ return 0;
+}
+
+static long vss_op_ioctl(struct file *fp,
+ unsigned int cmd, unsigned long arg)
+{
+ long ret = 0;
+ void __user *argp = (void __user *)arg;
+ u32 val32;
+
+ if (vss_device.state == VSS_DEVICE_DYING)
+ return -EFAULT;
+
+ /* The only ioctl we have is registation */
+ if (vss_device.state != VSS_DEVICE_OPENED)
+ return -EINVAL;
+
+ mutex_lock(&vss_device.lock);
+
+ switch (cmd) {
+ case HYPERV_VSS_REGISTER:
+ if (copy_from_user(&val32, argp, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ if (!vss_handle_handshake(val32)) {
+ val32 = (u32)VSS_VERSION;
+ if (copy_to_user(argp, &val32, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ vss_device.state = VSS_READY;
+ pr_info("VSS: user-mode registering done.\n");
+ if (vss_device.vss_context)
+ poll_channel(vss_device.vss_context);
+ } else
+ ret = -EINVAL;
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ mutex_unlock(&vss_device.lock);
+ return ret;
+}
+
+static const struct file_operations vss_fops = {
+ .owner = THIS_MODULE,
+ .read = vss_op_read,
+ .write = vss_op_write,
+ .release = vss_op_release,
+ .open = vss_op_open,
+ .poll = vss_op_poll,
+ .unlocked_ioctl = vss_op_ioctl
+};
+
+static struct miscdevice vss_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "vmbus/hv_vss",
+ .fops = &vss_fops,
+};
+
+int hv_vss_init(struct hv_util_service *srv)
+{
recv_buffer = srv->recv_buffer;
/*
@@ -291,13 +441,20 @@ hv_vss_init(struct hv_util_service *srv)
* Defer processing channel callbacks until the daemon
* has registered.
*/
- vss_transaction.active = true;
- return 0;
+ vss_device.state = VSS_DEVICE_INITIALIZING;
+ init_waitqueue_head(&vss_device.proc_list);
+ mutex_init(&vss_device.lock);
+
+ return misc_register(&vss_misc);
}
void hv_vss_deinit(void)
{
- cn_del_callback(&vss_id);
+ vss_device.state = VSS_DEVICE_DYING;
+ /* Make sure nobody sees the old state */
+ smp_mb();
+ wake_up_interruptible(&vss_device.proc_list);
cancel_delayed_work_sync(&vss_timeout_work);
cancel_work_sync(&vss_send_op_work);
+ misc_deregister(&vss_misc);
}
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index 80713a3..1939826 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -396,5 +396,6 @@ struct hv_kvp_ip_msg {
* either KVP_OP_REGISTER or KVP_OP_REGISTER1.
*/
#define HYPERV_KVP_REGISTER _IOWR('v', 0, __u32)
+#define HYPERV_VSS_REGISTER _IOWR('v', 0, __u32)
#endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 5e63f70..d3f1fe9 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -19,7 +19,6 @@
#include <sys/types.h>
-#include <sys/socket.h>
#include <sys/poll.h>
#include <sys/ioctl.h>
#include <fcntl.h>
@@ -30,21 +29,11 @@
#include <string.h>
#include <ctype.h>
#include <errno.h>
-#include <arpa/inet.h>
#include <linux/fs.h>
-#include <linux/connector.h>
#include <linux/hyperv.h>
-#include <linux/netlink.h>
#include <syslog.h>
#include <getopt.h>
-static struct sockaddr_nl addr;
-
-#ifndef SOL_NETLINK
-#define SOL_NETLINK 270
-#endif
-
-
/* Don't use syslog() in the function since that can cause write to disk */
static int vss_do_freeze(char *dir, unsigned int cmd)
{
@@ -137,33 +126,6 @@ out:
return error;
}
-static int netlink_send(int fd, struct cn_msg *msg)
-{
- struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE };
- unsigned int size;
- struct msghdr message;
- struct iovec iov[2];
-
- size = sizeof(struct cn_msg) + msg->len;
-
- nlh.nlmsg_pid = getpid();
- nlh.nlmsg_len = NLMSG_LENGTH(size);
-
- iov[0].iov_base = &nlh;
- iov[0].iov_len = sizeof(nlh);
-
- iov[1].iov_base = msg;
- iov[1].iov_len = size;
-
- memset(&message, 0, sizeof(message));
- message.msg_name = &addr;
- message.msg_namelen = sizeof(addr);
- message.msg_iov = iov;
- message.msg_iovlen = 2;
-
- return sendmsg(fd, &message, 0);
-}
-
void print_usage(char *argv[])
{
fprintf(stderr, "Usage: %s [options]\n"
@@ -174,17 +136,13 @@ void print_usage(char *argv[])
int main(int argc, char *argv[])
{
- int fd, len, nl_group;
- int error;
- struct cn_msg *message;
+ int vss_fd, len;
+ __u32 error;
struct pollfd pfd;
- struct nlmsghdr *incoming_msg;
- struct cn_msg *incoming_cn_msg;
int op;
- struct hv_vss_msg *vss_msg;
- char *vss_recv_buffer;
- size_t vss_recv_buffer_len;
+ struct hv_vss_msg vss_msg[1];
int daemonize = 1, long_index = 0, opt;
+ __u32 daemon_ver = 1; /* No special meaning */
static struct option long_options[] = {
{"help", no_argument, 0, 'h' },
@@ -211,98 +169,50 @@ int main(int argc, char *argv[])
openlog("Hyper-V VSS", 0, LOG_USER);
syslog(LOG_INFO, "VSS starting; pid is:%d", getpid());
- vss_recv_buffer_len = NLMSG_LENGTH(0) + sizeof(struct cn_msg) + sizeof(struct hv_vss_msg);
- vss_recv_buffer = calloc(1, vss_recv_buffer_len);
- if (!vss_recv_buffer) {
- syslog(LOG_ERR, "Failed to allocate netlink buffers");
- exit(EXIT_FAILURE);
- }
+ vss_fd = open("/dev/vmbus/hv_vss", O_RDWR);
- fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
- if (fd < 0) {
- syslog(LOG_ERR, "netlink socket creation failed; error:%d %s",
- errno, strerror(errno));
+ if (vss_fd < 0) {
+ syslog(LOG_ERR, "open /dev/vmbus/hv_vss failed; error: %d %s",
+ errno, strerror(errno));
exit(EXIT_FAILURE);
}
- addr.nl_family = AF_NETLINK;
- addr.nl_pad = 0;
- addr.nl_pid = 0;
- addr.nl_groups = 0;
-
- error = bind(fd, (struct sockaddr *)&addr, sizeof(addr));
- if (error < 0) {
- syslog(LOG_ERR, "bind failed; error:%d %s", errno, strerror(errno));
- close(fd);
- exit(EXIT_FAILURE);
- }
- nl_group = CN_VSS_IDX;
- if (setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &nl_group, sizeof(nl_group)) < 0) {
- syslog(LOG_ERR, "setsockopt failed; error:%d %s", errno, strerror(errno));
- close(fd);
- exit(EXIT_FAILURE);
- }
/*
* Register ourselves with the kernel.
*/
- message = (struct cn_msg *)vss_recv_buffer;
- message->id.idx = CN_VSS_IDX;
- message->id.val = CN_VSS_VAL;
- message->ack = 0;
- vss_msg = (struct hv_vss_msg *)message->data;
- vss_msg->vss_hdr.operation = VSS_OP_REGISTER;
-
- message->len = sizeof(struct hv_vss_msg);
-
- len = netlink_send(fd, message);
- if (len < 0) {
- syslog(LOG_ERR, "netlink_send failed; error:%d %s", errno, strerror(errno));
- close(fd);
+ if (ioctl(vss_fd, HYPERV_VSS_REGISTER, &daemon_ver)) {
+ syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+ errno, strerror(errno));
+ close(vss_fd);
exit(EXIT_FAILURE);
}
- pfd.fd = fd;
+ pfd.fd = vss_fd;
while (1) {
- struct sockaddr *addr_p = (struct sockaddr *) &addr;
- socklen_t addr_l = sizeof(addr);
pfd.events = POLLIN;
pfd.revents = 0;
if (poll(&pfd, 1, -1) < 0) {
syslog(LOG_ERR, "poll failed; error:%d %s", errno, strerror(errno));
if (errno == EINVAL) {
- close(fd);
+ close(vss_fd);
exit(EXIT_FAILURE);
}
else
continue;
}
- len = recvfrom(fd, vss_recv_buffer, vss_recv_buffer_len, 0,
- addr_p, &addr_l);
+ len = read(vss_fd, vss_msg, sizeof(struct hv_vss_msg));
- if (len < 0) {
- syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s",
- addr.nl_pid, errno, strerror(errno));
- close(fd);
- return -1;
- }
+ if (len != sizeof(struct hv_vss_msg)) {
+ syslog(LOG_ERR, "read failed; error:%d %s",
+ errno, strerror(errno));
- if (addr.nl_pid) {
- syslog(LOG_WARNING,
- "Received packet from untrusted pid:%u",
- addr.nl_pid);
- continue;
+ close(vss_fd);
+ return EXIT_FAILURE;
}
- incoming_msg = (struct nlmsghdr *)vss_recv_buffer;
-
- if (incoming_msg->nlmsg_type != NLMSG_DONE)
- continue;
-
- incoming_cn_msg = (struct cn_msg *)NLMSG_DATA(incoming_msg);
- vss_msg = (struct hv_vss_msg *)incoming_cn_msg->data;
op = vss_msg->vss_hdr.operation;
error = HV_S_OK;
@@ -324,13 +234,14 @@ int main(int argc, char *argv[])
default:
syslog(LOG_ERR, "Illegal op:%d\n", op);
}
- vss_msg->error = error;
- len = netlink_send(fd, incoming_cn_msg);
- if (len < 0) {
- syslog(LOG_ERR, "net_link send failed; error:%d %s",
- errno, strerror(errno));
+
+ if (write(vss_fd, &error, sizeof(__u32)) != sizeof(__u32)) {
+ syslog(LOG_ERR, "write failed; error: %d %s", errno,
+ strerror(errno));
exit(EXIT_FAILURE);
}
}
+ close(vss_fd);
+ exit(0);
}
--
1.9.3
^ permalink raw reply related
* [PATCH RFC 1/3] Drivers: hv: kvp: convert userspace/kernel communication to using char device
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
To: K. Y. Srinivasan, devel
Cc: Haiyang Zhang, linux-kernel, Dexuan Cui,
Radim Krčmář, Greg Kroah-Hartman, linux-api
In-Reply-To: <1425053665-635-1-git-send-email-vkuznets@redhat.com>
Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
and this fact can change as there is a way to enable/disable the service from
host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
notification.
- Netlink communication is not stable under heavy load.
- ...
Re-implement the communication using misc char device. Use ioctl to do
kernel/userspace version negotiation (doesn't make much sense at this moment
as we're breaking backwards compatibility but can be used in future).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
drivers/hv/hv_kvp.c | 396 +++++++++++++++++++++++++++-----------------
include/uapi/linux/hyperv.h | 8 +
tools/hv/hv_kvp_daemon.c | 187 ++++-----------------
3 files changed, 287 insertions(+), 304 deletions(-)
diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index beb8105..8078b1a 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -22,12 +22,16 @@
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/net.h>
#include <linux/nls.h>
-#include <linux/connector.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
#include <linux/workqueue.h>
+#include <linux/mutex.h>
#include <linux/hyperv.h>
+#include <linux/miscdevice.h>
+#include <linux/poll.h>
+#include <linux/uaccess.h>
/*
* Pre win8 version numbers used in ws2008 and ws 2008 r2 (win7)
@@ -45,46 +49,41 @@
#define WIN8_SRV_VERSION (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
/*
- * Global state maintained for transaction that is being processed.
- * Note that only one transaction can be active at any point in time.
- *
- * This state is set when we receive a request from the host; we
- * cleanup this state when the transaction is completed - when we respond
- * to the host with the key value.
+ * Global state maintained for the device. Note that only one transaction can
+ * be active at any point in time.
*/
+enum kvp_device_state {
+ KVP_DEVICE_INITIALIZING = 0, /* driver was loaded */
+ KVP_DEVICE_OPENED, /* device was opened */
+ KVP_READY, /* userspace was registered */
+ KVP_HOSTMSG_RECEIVED, /* message from host was received */
+ KVP_USERMSG_READY, /* message for userspace is ready */
+ KVP_USERSPACE_REQ, /* request to userspace was sent */
+ KVP_USERSPACE_RECV, /* reply from userspace was received */
+ KVP_DEVICE_DYING, /* driver unload is in progress */
+};
+
static struct {
- bool active; /* transaction status - active or not */
+ int state; /* kvp_device_state */
int recv_len; /* number of bytes received. */
- struct hv_kvp_msg *kvp_msg; /* current message */
struct vmbus_channel *recv_channel; /* chn we got the request */
u64 recv_req_id; /* request ID. */
void *kvp_context; /* for the channel callback */
-} kvp_transaction;
-
-/*
- * Before we can accept KVP messages from the host, we need
- * to handshake with the user level daemon. This state tracks
- * if we are in the handshake phase.
- */
-static bool in_hand_shake = true;
-
-/*
- * This state maintains the version number registered by the daemon.
- */
-static int dm_reg_value;
+ int dm_reg_value; /* daemon version number */
+ struct mutex lock; /* syncronization */
+ struct hv_kvp_msg user_msg; /* message to/from userspace */
+ struct hv_kvp_msg host_msg; /* message to/from host */
+ wait_queue_head_t proc_list; /* waiting processes */
+} kvp_device;
static void kvp_send_key(struct work_struct *dummy);
-
-
-static void kvp_respond_to_host(struct hv_kvp_msg *msg, int error);
+static void kvp_respond_to_host(int error);
static void kvp_work_func(struct work_struct *dummy);
-static void kvp_register(int);
static DECLARE_DELAYED_WORK(kvp_work, kvp_work_func);
static DECLARE_WORK(kvp_sendkey_work, kvp_send_key);
-static struct cb_id kvp_id = { CN_KVP_IDX, CN_KVP_VAL };
static const char kvp_name[] = "kvp_kernel_module";
static u8 *recv_buffer;
/*
@@ -92,31 +91,8 @@ static u8 *recv_buffer;
* As part of this registration, pass the LIC version number.
* This number has no meaning, it satisfies the registration protocol.
*/
-#define HV_DRV_VERSION "3.1"
-
-static void
-kvp_register(int reg_value)
-{
-
- struct cn_msg *msg;
- struct hv_kvp_msg *kvp_msg;
- char *version;
-
- msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg), GFP_ATOMIC);
+#define HV_DRV_VERSION 31
- if (msg) {
- kvp_msg = (struct hv_kvp_msg *)msg->data;
- version = kvp_msg->body.kvp_register.version;
- msg->id.idx = CN_KVP_IDX;
- msg->id.val = CN_KVP_VAL;
-
- kvp_msg->kvp_hdr.operation = reg_value;
- strcpy(version, HV_DRV_VERSION);
- msg->len = sizeof(struct hv_kvp_msg);
- cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
- kfree(msg);
- }
-}
static void
kvp_work_func(struct work_struct *dummy)
{
@@ -124,7 +100,7 @@ kvp_work_func(struct work_struct *dummy)
* If the timer fires, the user-mode component has not responded;
* process the pending transaction.
*/
- kvp_respond_to_host(NULL, HV_E_FAIL);
+ kvp_respond_to_host(HV_E_FAIL);
}
static void poll_channel(struct vmbus_channel *channel)
@@ -138,36 +114,26 @@ static void poll_channel(struct vmbus_channel *channel)
}
-static int kvp_handle_handshake(struct hv_kvp_msg *msg)
+static int kvp_handle_handshake(u32 op)
{
- int ret = 1;
+ int ret = 0;
- switch (msg->kvp_hdr.operation) {
+ switch (op) {
case KVP_OP_REGISTER:
- dm_reg_value = KVP_OP_REGISTER;
+ kvp_device.dm_reg_value = KVP_OP_REGISTER;
pr_info("KVP: IP injection functionality not available\n");
pr_info("KVP: Upgrade the KVP daemon\n");
break;
case KVP_OP_REGISTER1:
- dm_reg_value = KVP_OP_REGISTER1;
+ kvp_device.dm_reg_value = KVP_OP_REGISTER1;
break;
default:
pr_info("KVP: incompatible daemon\n");
pr_info("KVP: KVP version: %d, Daemon version: %d\n",
- KVP_OP_REGISTER1, msg->kvp_hdr.operation);
- ret = 0;
+ KVP_OP_REGISTER1, op);
+ ret = 1;
}
- if (ret) {
- /*
- * We have a compatible daemon; complete the handshake.
- */
- pr_info("KVP: user-mode registering done.\n");
- kvp_register(dm_reg_value);
- kvp_transaction.active = false;
- if (kvp_transaction.kvp_context)
- poll_channel(kvp_transaction.kvp_context);
- }
return ret;
}
@@ -176,25 +142,11 @@ static int kvp_handle_handshake(struct hv_kvp_msg *msg)
* Callback when data is received from user mode.
*/
-static void
-kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
+static void kvp_userwrite_callback(void)
{
- struct hv_kvp_msg *message;
+ struct hv_kvp_msg *message = &kvp_device.user_msg;
struct hv_kvp_msg_enumerate *data;
- int error = 0;
-
- message = (struct hv_kvp_msg *)msg->data;
-
- /*
- * If we are negotiating the version information
- * with the daemon; handle that first.
- */
-
- if (in_hand_shake) {
- if (kvp_handle_handshake(message))
- in_hand_shake = false;
- return;
- }
+ int error = 0;
/*
* Based on the version of the daemon, we propagate errors from the
@@ -203,7 +155,7 @@ kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
data = &message->body.kvp_enum_data;
- switch (dm_reg_value) {
+ switch (kvp_device.dm_reg_value) {
case KVP_OP_REGISTER:
/*
* Null string is used to pass back error condition.
@@ -226,10 +178,9 @@ kvp_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
* to the host. But first, cancel the timeout.
*/
if (cancel_delayed_work_sync(&kvp_work))
- kvp_respond_to_host(message, error);
+ kvp_respond_to_host(error);
}
-
static int process_ob_ipinfo(void *in_msg, void *out_msg, int op)
{
struct hv_kvp_msg *in = in_msg;
@@ -337,32 +288,21 @@ static void process_ib_ipinfo(void *in_msg, void *out_msg, int op)
}
}
-
-
-
static void
kvp_send_key(struct work_struct *dummy)
{
- struct cn_msg *msg;
- struct hv_kvp_msg *message;
- struct hv_kvp_msg *in_msg;
- __u8 operation = kvp_transaction.kvp_msg->kvp_hdr.operation;
- __u8 pool = kvp_transaction.kvp_msg->kvp_hdr.pool;
+ struct hv_kvp_msg *message = &kvp_device.user_msg;
+ struct hv_kvp_msg *in_msg = &kvp_device.host_msg;
+ __u8 operation = in_msg->kvp_hdr.operation;
+ __u8 pool = in_msg->kvp_hdr.pool;
__u32 val32;
__u64 val64;
- int rc;
- msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg) , GFP_ATOMIC);
- if (!msg)
- return;
-
- msg->id.idx = CN_KVP_IDX;
- msg->id.val = CN_KVP_VAL;
+ mutex_lock(&kvp_device.lock);
- message = (struct hv_kvp_msg *)msg->data;
+ memset(message, 0, sizeof(struct hv_kvp_msg));
message->kvp_hdr.operation = operation;
message->kvp_hdr.pool = pool;
- in_msg = kvp_transaction.kvp_msg;
/*
* The key/value strings sent from the host are encoded in
@@ -446,15 +386,10 @@ kvp_send_key(struct work_struct *dummy)
break;
}
- msg->len = sizeof(struct hv_kvp_msg);
- rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
- if (rc) {
- pr_debug("KVP: failed to communicate to the daemon: %d\n", rc);
- if (cancel_delayed_work_sync(&kvp_work))
- kvp_respond_to_host(message, HV_E_FAIL);
- }
+ kvp_device.state = KVP_USERMSG_READY;
+ wake_up_interruptible(&kvp_device.proc_list);
- kfree(msg);
+ mutex_unlock(&kvp_device.lock);
return;
}
@@ -463,10 +398,10 @@ kvp_send_key(struct work_struct *dummy)
* Send a response back to the host.
*/
-static void
-kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
+static void kvp_respond_to_host(int error)
{
struct hv_kvp_msg *kvp_msg;
+ struct hv_kvp_msg *msg_to_host = &kvp_device.user_msg;
struct hv_kvp_exchg_msg_value *kvp_data;
char *key_name;
char *value;
@@ -479,26 +414,13 @@ kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
int ret;
/*
- * If a transaction is not active; log and return.
- */
-
- if (!kvp_transaction.active) {
- /*
- * This is a spurious call!
- */
- pr_warn("KVP: Transaction not active\n");
- return;
- }
- /*
* Copy the global state for completing the transaction. Note that
* only one transaction can be active at a time.
*/
- buf_len = kvp_transaction.recv_len;
- channel = kvp_transaction.recv_channel;
- req_id = kvp_transaction.recv_req_id;
-
- kvp_transaction.active = false;
+ buf_len = kvp_device.recv_len;
+ channel = kvp_device.recv_channel;
+ req_id = kvp_device.recv_req_id;
icmsghdrp = (struct icmsg_hdr *)
&recv_buffer[sizeof(struct vmbuspipe_hdr)];
@@ -528,7 +450,8 @@ kvp_respond_to_host(struct hv_kvp_msg *msg_to_host, int error)
&recv_buffer[sizeof(struct vmbuspipe_hdr) +
sizeof(struct icmsg_hdr)];
- switch (kvp_transaction.kvp_msg->kvp_hdr.operation) {
+
+ switch (kvp_device.host_msg.kvp_hdr.operation) {
case KVP_OP_GET_IP_INFO:
ret = process_ob_ipinfo(msg_to_host,
(struct hv_kvp_ip_msg *)kvp_msg,
@@ -586,6 +509,17 @@ response_done:
vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
VM_PKT_DATA_INBAND, 0);
+
+ /* We're ready to process next request, reset the device state */
+ if (kvp_device.state == KVP_USERSPACE_RECV ||
+ kvp_device.state == KVP_USERSPACE_REQ)
+ kvp_device.state = KVP_READY;
+ /*
+ * Make sure device state was set before polling the channel as
+ * processing can happen on a different CPU.
+ */
+ smp_mb();
+
poll_channel(channel);
}
@@ -612,14 +546,15 @@ void hv_kvp_onchannelcallback(void *context)
int util_fw_version;
int kvp_srv_version;
- if (kvp_transaction.active) {
+ if (kvp_device.state > KVP_READY) {
/*
* We will defer processing this callback once
* the current transaction is complete.
*/
- kvp_transaction.kvp_context = context;
+ kvp_device.kvp_context = channel;
return;
}
+ kvp_device.kvp_context = NULL;
vmbus_recvpacket(channel, recv_buffer, PAGE_SIZE * 4, &recvlen,
&requestid);
@@ -661,11 +596,19 @@ void hv_kvp_onchannelcallback(void *context)
* transaction; note transactions are serialized.
*/
- kvp_transaction.recv_len = recvlen;
- kvp_transaction.recv_channel = channel;
- kvp_transaction.recv_req_id = requestid;
- kvp_transaction.active = true;
- kvp_transaction.kvp_msg = kvp_msg;
+ kvp_device.recv_len = recvlen;
+ kvp_device.recv_channel = channel;
+ kvp_device.recv_req_id = requestid;
+
+ if (kvp_device.state != KVP_READY) {
+ /* Userspace daemon is not connected, fail. */
+ kvp_respond_to_host(HV_E_FAIL);
+ return;
+ }
+
+ kvp_device.state = KVP_HOSTMSG_RECEIVED;
+ memcpy(&kvp_device.host_msg, kvp_msg,
+ sizeof(struct hv_kvp_msg));
/*
* Get the information from the
@@ -690,17 +633,166 @@ void hv_kvp_onchannelcallback(void *context)
recvlen, requestid,
VM_PKT_DATA_INBAND, 0);
}
+}
+
+static int kvp_op_open(struct inode *inode, struct file *f)
+{
+ if (kvp_device.state != KVP_DEVICE_INITIALIZING)
+ return -EBUSY;
+ kvp_device.state = KVP_DEVICE_OPENED;
+ return 0;
+}
+
+static int kvp_op_release(struct inode *inode, struct file *f)
+{
+ kvp_device.state = KVP_DEVICE_INITIALIZING;
+ return 0;
+}
+
+static ssize_t kvp_op_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ int ret = 0;
+
+ if (kvp_device.state == KVP_DEVICE_DYING)
+ return -EFAULT;
+
+ if (count != sizeof(struct hv_kvp_msg)) {
+ pr_warn("kvp_op_write: invalid write len: %d (expected: %d)\n",
+ (int)count, (int)sizeof(struct hv_kvp_msg));
+ return -EINVAL;
+ }
+ mutex_lock(&kvp_device.lock);
+
+ if (kvp_device.state == KVP_USERSPACE_REQ) {
+ if (!copy_from_user(&kvp_device.user_msg, buf,
+ sizeof(struct hv_kvp_msg))) {
+ kvp_device.state = KVP_USERSPACE_RECV;
+ kvp_userwrite_callback();
+ ret = sizeof(struct hv_kvp_msg);
+ } else
+ ret = -EFAULT;
+ } else {
+ pr_warn("kvp_op_write: invalid transaction state: %d\n",
+ kvp_device.state);
+ ret = -EINVAL;
+ }
+
+ mutex_unlock(&kvp_device.lock);
+ return ret;
}
+static ssize_t kvp_op_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t ret = 0;
+
+ if (kvp_device.state == KVP_DEVICE_DYING)
+ return -EFAULT;
+
+ if (count != sizeof(struct hv_kvp_msg)) {
+ pr_warn("kvp_op_read: invalid read len: %d (expected: %d)\n",
+ (int)count, (int)sizeof(struct hv_kvp_msg));
+ return -EINVAL;
+ }
+
+ if (wait_event_interruptible(kvp_device.proc_list,
+ kvp_device.state == KVP_USERMSG_READY ||
+ kvp_device.state == KVP_DEVICE_DYING))
+ return -EFAULT;
+
+ if (kvp_device.state != KVP_USERMSG_READY)
+ return -EFAULT;
+
+ mutex_lock(&kvp_device.lock);
+
+ if (!copy_to_user(buf, &kvp_device.user_msg,
+ sizeof(struct hv_kvp_msg))) {
+ kvp_device.state = KVP_USERSPACE_REQ;
+ ret = sizeof(struct hv_kvp_msg);
+ } else
+ ret = -EFAULT;
+
+ mutex_unlock(&kvp_device.lock);
+ return ret;
+}
+
+static unsigned int kvp_op_poll(struct file *file, poll_table *wait)
+{
+ if (kvp_device.state == KVP_DEVICE_DYING)
+ return -EFAULT;
+
+ poll_wait(file, &kvp_device.proc_list, wait);
+ if (kvp_device.state == KVP_USERMSG_READY)
+ return POLLIN | POLLRDNORM;
+ return 0;
+}
+
+static long kvp_op_ioctl(struct file *fp,
+ unsigned int cmd, unsigned long arg)
+{
+ long ret = 0;
+ void __user *argp = (void __user *)arg;
+ u32 val32;
+
+ if (kvp_device.state == KVP_DEVICE_DYING)
+ return -EFAULT;
+
+ /* The only ioctl we have is registation */
+ if (kvp_device.state != KVP_DEVICE_OPENED)
+ return -EINVAL;
+
+ mutex_lock(&kvp_device.lock);
+
+ switch (cmd) {
+ case HYPERV_KVP_REGISTER:
+ if (copy_from_user(&val32, argp, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ if (!kvp_handle_handshake(val32)) {
+ val32 = (u32)HV_DRV_VERSION;
+ if (copy_to_user(argp, &val32, sizeof(val32))) {
+ ret = -EFAULT;
+ break;
+ }
+ kvp_device.state = KVP_READY;
+ pr_info("KVP: user-mode registering done.\n");
+ if (kvp_device.kvp_context)
+ poll_channel(kvp_device.kvp_context);
+ } else
+ ret = -EINVAL;
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ mutex_unlock(&kvp_device.lock);
+ return ret;
+}
+
+static const struct file_operations kvp_fops = {
+ .owner = THIS_MODULE,
+ .read = kvp_op_read,
+ .write = kvp_op_write,
+ .release = kvp_op_release,
+ .open = kvp_op_open,
+ .poll = kvp_op_poll,
+ .unlocked_ioctl = kvp_op_ioctl
+};
+
+static struct miscdevice kvp_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "vmbus/hv_kvp",
+ .fops = &kvp_fops,
+};
+
int
hv_kvp_init(struct hv_util_service *srv)
{
- int err;
-
- err = cn_add_callback(&kvp_id, kvp_name, kvp_cn_callback);
- if (err)
- return err;
recv_buffer = srv->recv_buffer;
/*
@@ -709,14 +801,20 @@ hv_kvp_init(struct hv_util_service *srv)
* Defer processing channel callbacks until the daemon
* has registered.
*/
- kvp_transaction.active = true;
+ kvp_device.state = KVP_DEVICE_INITIALIZING;
+ init_waitqueue_head(&kvp_device.proc_list);
+ mutex_init(&kvp_device.lock);
- return 0;
+ return misc_register(&kvp_misc);
}
void hv_kvp_deinit(void)
{
- cn_del_callback(&kvp_id);
+ kvp_device.state = KVP_DEVICE_DYING;
+ /* Make sure nobody sees the old state */
+ smp_mb();
+ wake_up_interruptible(&kvp_device.proc_list);
cancel_delayed_work_sync(&kvp_work);
cancel_work_sync(&kvp_sendkey_work);
+ misc_deregister(&kvp_misc);
}
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index bb1cb73..80713a3 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -26,6 +26,7 @@
#define _UAPI_HYPERV_H
#include <linux/uuid.h>
+#include <linux/types.h>
/*
* Framework version for util services.
@@ -389,4 +390,11 @@ struct hv_kvp_ip_msg {
struct hv_kvp_ipaddr_value kvp_ip_val;
} __attribute__((packed));
+/*
+ * Userspace registration ioctls. Userspace daemons are supposed to pass their
+ * version as a parameter and get driver version back. KVP daemon supplies
+ * either KVP_OP_REGISTER or KVP_OP_REGISTER1.
+ */
+#define HYPERV_KVP_REGISTER _IOWR('v', 0, __u32)
+
#endif /* _UAPI_HYPERV_H */
diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 408bb07..0c3cac7 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -33,7 +33,6 @@
#include <ctype.h>
#include <errno.h>
#include <arpa/inet.h>
-#include <linux/connector.h>
#include <linux/hyperv.h>
#include <linux/netlink.h>
#include <ifaddrs.h>
@@ -44,6 +43,7 @@
#include <dirent.h>
#include <net/if.h>
#include <getopt.h>
+#include <sys/ioctl.h>
/*
* KVP protocol: The user mode component first registers with the
@@ -79,9 +79,6 @@ enum {
DNS
};
-static struct sockaddr_nl addr;
-static int in_hand_shake = 1;
-
static char *os_name = "";
static char *os_major = "";
static char *os_minor = "";
@@ -1387,34 +1384,6 @@ kvp_get_domain_name(char *buffer, int length)
freeaddrinfo(info);
}
-static int
-netlink_send(int fd, struct cn_msg *msg)
-{
- struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE };
- unsigned int size;
- struct msghdr message;
- struct iovec iov[2];
-
- size = sizeof(struct cn_msg) + msg->len;
-
- nlh.nlmsg_pid = getpid();
- nlh.nlmsg_len = NLMSG_LENGTH(size);
-
- iov[0].iov_base = &nlh;
- iov[0].iov_len = sizeof(nlh);
-
- iov[1].iov_base = msg;
- iov[1].iov_len = size;
-
- memset(&message, 0, sizeof(message));
- message.msg_name = &addr;
- message.msg_namelen = sizeof(addr);
- message.msg_iov = iov;
- message.msg_iovlen = 2;
-
- return sendmsg(fd, &message, 0);
-}
-
void print_usage(char *argv[])
{
fprintf(stderr, "Usage: %s [options]\n"
@@ -1425,23 +1394,18 @@ void print_usage(char *argv[])
int main(int argc, char *argv[])
{
- int fd, len, nl_group;
+ int kvp_fd, len;
int error;
- struct cn_msg *message;
struct pollfd pfd;
- struct nlmsghdr *incoming_msg;
- struct cn_msg *incoming_cn_msg;
- struct hv_kvp_msg *hv_msg;
- char *p;
+ struct hv_kvp_msg hv_msg[1];
char *key_value;
char *key_name;
int op;
int pool;
char *if_name;
struct hv_kvp_ipaddr_value *kvp_ip_val;
- char *kvp_recv_buffer;
- size_t kvp_recv_buffer_len;
int daemonize = 1, long_index = 0, opt;
+ __u32 daemon_ver = (__u32)KVP_OP_REGISTER1;
static struct option long_options[] = {
{"help", no_argument, 0, 'h' },
@@ -1468,12 +1432,14 @@ int main(int argc, char *argv[])
openlog("KVP", 0, LOG_USER);
syslog(LOG_INFO, "KVP starting; pid is:%d", getpid());
- kvp_recv_buffer_len = NLMSG_LENGTH(0) + sizeof(struct cn_msg) + sizeof(struct hv_kvp_msg);
- kvp_recv_buffer = calloc(1, kvp_recv_buffer_len);
- if (!kvp_recv_buffer) {
- syslog(LOG_ERR, "Failed to allocate netlink buffer");
+ kvp_fd = open("/dev/vmbus/hv_kvp", O_RDWR);
+
+ if (kvp_fd < 0) {
+ syslog(LOG_ERR, "open /dev/vmbus/hv_kvp failed; error: %d %s",
+ errno, strerror(errno));
exit(EXIT_FAILURE);
}
+
/*
* Retrieve OS release information.
*/
@@ -1489,100 +1455,44 @@ int main(int argc, char *argv[])
exit(EXIT_FAILURE);
}
- fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
- if (fd < 0) {
- syslog(LOG_ERR, "netlink socket creation failed; error: %d %s", errno,
- strerror(errno));
- exit(EXIT_FAILURE);
- }
- addr.nl_family = AF_NETLINK;
- addr.nl_pad = 0;
- addr.nl_pid = 0;
- addr.nl_groups = 0;
-
-
- error = bind(fd, (struct sockaddr *)&addr, sizeof(addr));
- if (error < 0) {
- syslog(LOG_ERR, "bind failed; error: %d %s", errno, strerror(errno));
- close(fd);
- exit(EXIT_FAILURE);
- }
- nl_group = CN_KVP_IDX;
-
- if (setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP, &nl_group, sizeof(nl_group)) < 0) {
- syslog(LOG_ERR, "setsockopt failed; error: %d %s", errno, strerror(errno));
- close(fd);
- exit(EXIT_FAILURE);
- }
-
/*
* Register ourselves with the kernel.
*/
- message = (struct cn_msg *)kvp_recv_buffer;
- message->id.idx = CN_KVP_IDX;
- message->id.val = CN_KVP_VAL;
-
- hv_msg = (struct hv_kvp_msg *)message->data;
- hv_msg->kvp_hdr.operation = KVP_OP_REGISTER1;
- message->ack = 0;
- message->len = sizeof(struct hv_kvp_msg);
-
- len = netlink_send(fd, message);
- if (len < 0) {
- syslog(LOG_ERR, "netlink_send failed; error: %d %s", errno, strerror(errno));
- close(fd);
+ if (ioctl(kvp_fd, HYPERV_KVP_REGISTER, &daemon_ver)) {
+ syslog(LOG_ERR, "registration to kernel failed; error: %d %s",
+ errno, strerror(errno));
+ close(kvp_fd);
exit(EXIT_FAILURE);
}
- pfd.fd = fd;
+ syslog(LOG_INFO, "KVP LIC Version: %d", daemon_ver);
+
+ pfd.fd = kvp_fd;
while (1) {
- struct sockaddr *addr_p = (struct sockaddr *) &addr;
- socklen_t addr_l = sizeof(addr);
pfd.events = POLLIN;
pfd.revents = 0;
if (poll(&pfd, 1, -1) < 0) {
syslog(LOG_ERR, "poll failed; error: %d %s", errno, strerror(errno));
if (errno == EINVAL) {
- close(fd);
+ close(kvp_fd);
exit(EXIT_FAILURE);
}
else
continue;
}
- len = recvfrom(fd, kvp_recv_buffer, kvp_recv_buffer_len, 0,
- addr_p, &addr_l);
-
- if (len < 0) {
- int saved_errno = errno;
- syslog(LOG_ERR, "recvfrom failed; pid:%u error:%d %s",
- addr.nl_pid, errno, strerror(errno));
-
- if (saved_errno == ENOBUFS) {
- syslog(LOG_ERR, "receive error: ignored");
- continue;
- }
+ len = read(kvp_fd, hv_msg, sizeof(struct hv_kvp_msg));
- close(fd);
- return -1;
- }
+ if (len != sizeof(struct hv_kvp_msg)) {
+ syslog(LOG_ERR, "read failed; error:%d %s",
+ errno, strerror(errno));
- if (addr.nl_pid) {
- syslog(LOG_WARNING, "Received packet from untrusted pid:%u",
- addr.nl_pid);
- continue;
+ close(kvp_fd);
+ return EXIT_FAILURE;
}
- incoming_msg = (struct nlmsghdr *)kvp_recv_buffer;
-
- if (incoming_msg->nlmsg_type != NLMSG_DONE)
- continue;
-
- incoming_cn_msg = (struct cn_msg *)NLMSG_DATA(incoming_msg);
- hv_msg = (struct hv_kvp_msg *)incoming_cn_msg->data;
-
/*
* We will use the KVP header information to pass back
* the error from this daemon. So, first copy the state
@@ -1592,24 +1502,6 @@ int main(int argc, char *argv[])
pool = hv_msg->kvp_hdr.pool;
hv_msg->error = HV_S_OK;
- if ((in_hand_shake) && (op == KVP_OP_REGISTER1)) {
- /*
- * Driver is registering with us; stash away the version
- * information.
- */
- in_hand_shake = 0;
- p = (char *)hv_msg->body.kvp_register.version;
- lic_version = malloc(strlen(p) + 1);
- if (lic_version) {
- strcpy(lic_version, p);
- syslog(LOG_INFO, "KVP LIC Version: %s",
- lic_version);
- } else {
- syslog(LOG_ERR, "malloc failed");
- }
- continue;
- }
-
switch (op) {
case KVP_OP_GET_IP_INFO:
kvp_ip_val = &hv_msg->body.kvp_ip_val;
@@ -1702,7 +1594,6 @@ int main(int argc, char *argv[])
goto kvp_done;
}
- hv_msg = (struct hv_kvp_msg *)incoming_cn_msg->data;
key_name = (char *)hv_msg->body.kvp_enum_data.data.key;
key_value = (char *)hv_msg->body.kvp_enum_data.data.value;
@@ -1753,31 +1644,17 @@ int main(int argc, char *argv[])
hv_msg->error = HV_S_CONT;
break;
}
- /*
- * Send the value back to the kernel. The response is
- * already in the receive buffer. Update the cn_msg header to
- * reflect the key value that has been added to the message
- */
-kvp_done:
-
- incoming_cn_msg->id.idx = CN_KVP_IDX;
- incoming_cn_msg->id.val = CN_KVP_VAL;
- incoming_cn_msg->ack = 0;
- incoming_cn_msg->len = sizeof(struct hv_kvp_msg);
-
- len = netlink_send(fd, incoming_cn_msg);
- if (len < 0) {
- int saved_errno = errno;
- syslog(LOG_ERR, "net_link send failed; error: %d %s", errno,
- strerror(errno));
-
- if (saved_errno == ENOMEM || saved_errno == ENOBUFS) {
- syslog(LOG_ERR, "send error: ignored");
- continue;
- }
+ /* Send the value back to the kernel. */
+kvp_done:
+ len = write(kvp_fd, hv_msg, sizeof(struct hv_kvp_msg));
+ if (len != sizeof(struct hv_kvp_msg)) {
+ syslog(LOG_ERR, "write failed; error: %d %s", errno,
+ strerror(errno));
exit(EXIT_FAILURE);
}
}
+ close(kvp_fd);
+ exit(0);
}
--
1.9.3
^ permalink raw reply related
* [PATCH RFC 0/3] Drivers: hv: utils: re-implement the kernel/userspace communication layer
From: Vitaly Kuznetsov @ 2015-02-27 16:14 UTC (permalink / raw)
To: K. Y. Srinivasan, devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X
Cc: Haiyang Zhang, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dexuan Cui,
Radim Krčmář, Greg Kroah-Hartman,
linux-api-u79uwXL29TY76Z2rM5mHXA
This series converts kvp/vss daemons to use misc char devices instead of
netlink for userspace/kernel communication and then updates fcopy to be
consistent with kvp/vss.
Userspace/kernel communication via netlink has a number of issues:
- It is hard for userspace to figure out if the kernel part was loaded or not
and this fact can change as there is a way to enable/disable the service from
host side. Racy daemon startup is also a problem.
- When the userspace daemon restarts/dies kernel part doesn't receive a
notification.
- Netlink communication is not stable under heavy load.
- ...
RFC: I'm a bit puzzled on how to split commits 1 and 2 avoiding breakages.
Commit 3 can definitely be split, however, it is consistent with commits 1 and
2 at this moment and I'm not sure such split will simplify the review.
Vitaly Kuznetsov (3):
Drivers: hv: kvp: convert userspace/kernel communication to using char
device
Drivers: hv: vss: convert userspace/kernel communication to using char
device
Drivers: hv: fcopy: make it consistent with vss/kvp
drivers/hv/hv_fcopy.c | 395 +++++++++++++++++++++++++------------------
drivers/hv/hv_kvp.c | 396 +++++++++++++++++++++++++++-----------------
drivers/hv/hv_snapshot.c | 335 +++++++++++++++++++++++++++----------
include/uapi/linux/hyperv.h | 10 ++
tools/hv/hv_fcopy_daemon.c | 48 ++++--
tools/hv/hv_kvp_daemon.c | 187 ++++-----------------
tools/hv/hv_vss_daemon.c | 141 +++-------------
7 files changed, 824 insertions(+), 688 deletions(-)
--
1.9.3
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Theodore Ts'o @ 2015-02-27 14:18 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Omar Sandoval, Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA,
Linux API, XFS Developers, Linux-Fsdevel, Ext4 Developers List,
Linux btrfs Developers List
In-Reply-To: <54F02C73.5090601-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
With Omar's suggestions, this looks great.
Thanks!!
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 0/4] enhance shmem process and swap accounting
From: Vlastimil Babka @ 2015-02-27 10:52 UTC (permalink / raw)
To: Michael Kerrisk
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Michal Hocko,
Kirill A. Shutemov, Cyrill Gorcunov, Randy Dunlap,
linux-s390-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
Heiko Carstens, Peter Zijlstra, Paul Mackerras,
Arnaldo Carvalho de Melo, Oleg Nesterov, Linux API
In-Reply-To: <CAHO5Pa0xmquUbzkZvow_PxRGZpA7MVEPFcRL2LPXv7hU41uxDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 02/27/2015 11:36 AM, Michael Kerrisk wrote:
> [CC += linux-api@]
>
> Hello Vlastimil,
>
> Since this is a kernel-user-space API change, please CC linux-api@.
> The kernel source file Documentation/SubmitChecklist notes that all
> Linux kernel patches that change userspace interfaces should be CCed
> to linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, so that the various parties who are
> interested in API changes are informed. For further information, see
> https://www.kernel.org/doc/man-pages/linux-api-ml.html
Yes I meant to do that but forgot in the end, what a shame. Sorry for the trouble.
^ permalink raw reply
* Re: [PATCH 4/4] mm, procfs: Display VmAnon, VmFile and VmShm in /proc/pid/status
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-5-git-send-email-vbabka@suse.cz>
[CC += linux-api@]
On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> From: Jerome Marchand <jmarchan@redhat.com>
>
> It's currently inconvenient to retrieve MM_ANONPAGES value from status
> and statm files and there is no way to separate MM_FILEPAGES and
> MM_SHMEMPAGES. Add VmAnon, VmFile and VmShm lines in /proc/<pid>/status
> to solve these issues.
>
> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Documentation/filesystems/proc.txt | 10 +++++++++-
> fs/proc/task_mmu.c | 13 +++++++++++--
> 2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 8b30543..c777adb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -168,6 +168,9 @@ read the file /proc/PID/status:
> VmLck: 0 kB
> VmHWM: 476 kB
> VmRSS: 476 kB
> + VmAnon: 352 kB
> + VmFile: 120 kB
> + VmShm: 4 kB
> VmData: 156 kB
> VmStk: 88 kB
> VmExe: 68 kB
> @@ -224,7 +227,12 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
> VmSize total program size
> VmLck locked memory size
> VmHWM peak resident set size ("high water mark")
> - VmRSS size of memory portions
> + VmRSS size of memory portions. It contains the three
> + following parts (VmRSS = VmAnon + VmFile + VmShm)
> + VmAnon size of resident anonymous memory
> + VmFile size of resident file mappings
> + VmShm size of resident shmem memory (includes SysV shm,
> + mapping of tmpfs and shared anonymous mappings)
> VmData size of data, stack, and text segments
> VmStk size of data, stack, and text segments
> VmExe size of text segment
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d70334c..a77a3ac 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,7 +22,7 @@
>
> void task_mem(struct seq_file *m, struct mm_struct *mm)
> {
> - unsigned long data, text, lib, swap, ptes, pmds;
> + unsigned long data, text, lib, swap, ptes, pmds, anon, file, shmem;
> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>
> /*
> @@ -39,6 +39,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> if (hiwater_rss < mm->hiwater_rss)
> hiwater_rss = mm->hiwater_rss;
>
> + anon = get_mm_counter(mm, MM_ANONPAGES);
> + file = get_mm_counter(mm, MM_FILEPAGES);
> + shmem = get_mm_counter_shmem(mm);
> data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
> lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> @@ -52,6 +55,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> "VmPin:\t%8lu kB\n"
> "VmHWM:\t%8lu kB\n"
> "VmRSS:\t%8lu kB\n"
> + "VmAnon:\t%8lu kB\n"
> + "VmFile:\t%8lu kB\n"
> + "VmShm:\t%8lu kB\n"
> "VmData:\t%8lu kB\n"
> "VmStk:\t%8lu kB\n"
> "VmExe:\t%8lu kB\n"
> @@ -65,6 +71,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> mm->pinned_vm << (PAGE_SHIFT-10),
> hiwater_rss << (PAGE_SHIFT-10),
> total_rss << (PAGE_SHIFT-10),
> + anon << (PAGE_SHIFT-10),
> + file << (PAGE_SHIFT-10),
> + shmem << (PAGE_SHIFT-10),
> data << (PAGE_SHIFT-10),
> mm->stack_vm << (PAGE_SHIFT-10), text, lib,
> ptes >> 10,
> @@ -82,7 +91,7 @@ unsigned long task_statm(struct mm_struct *mm,
> unsigned long *data, unsigned long *resident)
> {
> *shared = get_mm_counter(mm, MM_FILEPAGES) +
> - get_mm_counter(mm, MM_SHMEMPAGES);
> + get_mm_counter_shmem(mm);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->total_vm - mm->shared_vm;
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 3/4] mm, shmem: Add shmem resident memory accounting
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-4-git-send-email-vbabka@suse.cz>
[CC += linux-api@]
On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> From: Jerome Marchand <jmarchan@redhat.com>
>
> Currently looking at /proc/<pid>/status or statm, there is no way to
> distinguish shmem pages from pages mapped to a regular file (shmem
> pages are mapped to /dev/zero), even though their implication in
> actual memory use is quite different.
> This patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
> shmem pages instead of MM_FILEPAGES.
>
> [vbabka@suse.cz: port to 4.0, add #ifdefs, mm_counter_file() variant]
> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> arch/s390/mm/pgtable.c | 5 +----
> fs/proc/task_mmu.c | 4 +++-
> include/linux/mm.h | 28 ++++++++++++++++++++++++++++
> include/linux/mm_types.h | 9 ++++++---
> kernel/events/uprobes.c | 2 +-
> mm/memory.c | 30 ++++++++++--------------------
> mm/oom_kill.c | 5 +++--
> mm/rmap.c | 15 ++++-----------
> 8 files changed, 56 insertions(+), 42 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index b2c1542..5bffd5d 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -617,10 +617,7 @@ static void gmap_zap_swap_entry(swp_entry_t entry, struct mm_struct *mm)
> else if (is_migration_entry(entry)) {
> struct page *page = migration_entry_to_page(entry);
>
> - if (PageAnon(page))
> - dec_mm_counter(mm, MM_ANONPAGES);
> - else
> - dec_mm_counter(mm, MM_FILEPAGES);
> + dec_mm_counter(mm, mm_counter(page));
> }
> free_swap_and_cache(entry);
> }
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0410309..d70334c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -81,7 +81,8 @@ unsigned long task_statm(struct mm_struct *mm,
> unsigned long *shared, unsigned long *text,
> unsigned long *data, unsigned long *resident)
> {
> - *shared = get_mm_counter(mm, MM_FILEPAGES);
> + *shared = get_mm_counter(mm, MM_FILEPAGES) +
> + get_mm_counter(mm, MM_SHMEMPAGES);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->total_vm - mm->shared_vm;
> @@ -501,6 +502,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> pte_none(*pte) && vma->vm_file) {
> struct address_space *mapping =
> file_inode(vma->vm_file)->i_mapping;
> + pgoff_t pgoff = linear_page_index(vma, addr);
>
> /*
> * shmem does not use swap pte's so we have to consult
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 47a9392..adfbb5b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1364,6 +1364,16 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> return (unsigned long)val;
> }
>
> +/* A wrapper for the CONFIG_SHMEM dependent counter */
> +static inline unsigned long get_mm_counter_shmem(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_SHMEM
> + return get_mm_counter(mm, MM_SHMEMPAGES);
> +#else
> + return 0;
> +#endif
> +}
> +
> static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> {
> atomic_long_add(value, &mm->rss_stat.count[member]);
> @@ -1379,9 +1389,27 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member)
> atomic_long_dec(&mm->rss_stat.count[member]);
> }
>
> +/* Optimized variant when page is already known not to be PageAnon */
> +static inline int mm_counter_file(struct page *page)
> +{
> +#ifdef CONFIG_SHMEM
> + if (PageSwapBacked(page))
> + return MM_SHMEMPAGES;
> +#endif
> + return MM_FILEPAGES;
> +}
> +
> +static inline int mm_counter(struct page *page)
> +{
> + if (PageAnon(page))
> + return MM_ANONPAGES;
> + return mm_counter_file(page);
> +}
> +
> static inline unsigned long get_mm_rss(struct mm_struct *mm)
> {
> return get_mm_counter(mm, MM_FILEPAGES) +
> + get_mm_counter_shmem(mm) +
> get_mm_counter(mm, MM_ANONPAGES);
> }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 199a03a..d3c2372 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -327,9 +327,12 @@ struct core_state {
> };
>
> enum {
> - MM_FILEPAGES,
> - MM_ANONPAGES,
> - MM_SWAPENTS,
> + MM_FILEPAGES, /* Resident file mapping pages */
> + MM_ANONPAGES, /* Resident anonymous pages */
> + MM_SWAPENTS, /* Anonymous swap entries */
> +#ifdef CONFIG_SHMEM
> + MM_SHMEMPAGES, /* Resident shared memory pages */
> +#endif
> NR_MM_COUNTERS
> };
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cb346f2..0a08fdd 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -188,7 +188,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> lru_cache_add_active_or_unevictable(kpage, vma);
>
> if (!PageAnon(page)) {
> - dec_mm_counter(mm, MM_FILEPAGES);
> + dec_mm_counter(mm, mm_counter_file(page));
> inc_mm_counter(mm, MM_ANONPAGES);
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 8068893..f145d9e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -832,10 +832,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> } else if (is_migration_entry(entry)) {
> page = migration_entry_to_page(entry);
>
> - if (PageAnon(page))
> - rss[MM_ANONPAGES]++;
> - else
> - rss[MM_FILEPAGES]++;
> + rss[mm_counter(page)]++;
>
> if (is_write_migration_entry(entry) &&
> is_cow_mapping(vm_flags)) {
> @@ -874,10 +871,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> if (page) {
> get_page(page);
> page_dup_rmap(page);
> - if (PageAnon(page))
> - rss[MM_ANONPAGES]++;
> - else
> - rss[MM_FILEPAGES]++;
> + rss[mm_counter(page)]++;
> }
>
> out_set_pte:
> @@ -1113,9 +1107,8 @@ again:
> tlb_remove_tlb_entry(tlb, pte, addr);
> if (unlikely(!page))
> continue;
> - if (PageAnon(page))
> - rss[MM_ANONPAGES]--;
> - else {
> +
> + if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> force_flush = 1;
> set_page_dirty(page);
> @@ -1123,8 +1116,8 @@ again:
> if (pte_young(ptent) &&
> likely(!(vma->vm_flags & VM_SEQ_READ)))
> mark_page_accessed(page);
> - rss[MM_FILEPAGES]--;
> }
> + rss[mm_counter(page)]--;
> page_remove_rmap(page);
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
> @@ -1146,11 +1139,7 @@ again:
> struct page *page;
>
> page = migration_entry_to_page(entry);
> -
> - if (PageAnon(page))
> - rss[MM_ANONPAGES]--;
> - else
> - rss[MM_FILEPAGES]--;
> + rss[mm_counter(page)]--;
> }
> if (unlikely(!free_swap_and_cache(entry)))
> print_bad_pte(vma, addr, ptent, NULL);
> @@ -1460,7 +1449,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
>
> /* Ok, finally just insert the thing.. */
> get_page(page);
> - inc_mm_counter_fast(mm, MM_FILEPAGES);
> + inc_mm_counter_fast(mm, mm_counter_file(page));
> page_add_file_rmap(page);
> set_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> @@ -2174,7 +2163,8 @@ gotten:
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {
> if (!PageAnon(old_page)) {
> - dec_mm_counter_fast(mm, MM_FILEPAGES);
> + dec_mm_counter_fast(mm,
> + mm_counter_file(old_page));
> inc_mm_counter_fast(mm, MM_ANONPAGES);
> }
> } else
> @@ -2703,7 +2693,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> page_add_new_anon_rmap(page, vma, address);
> } else {
> - inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
> + inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> page_add_file_rmap(page);
> }
> set_pte_at(vma->vm_mm, address, pte, entry);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 642f38c..a5ee3a2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -573,10 +573,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> /* mm cannot safely be dereferenced after task_unlock(victim) */
> mm = victim->mm;
> mark_tsk_oom_victim(victim);
> - pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> + pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> - K(get_mm_counter(victim->mm, MM_FILEPAGES)));
> + K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> + K(get_mm_counter_shmem(victim->mm)));
> task_unlock(victim);
>
> /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5e3e090..e3c4392 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1216,12 +1216,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> update_hiwater_rss(mm);
>
> if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> - if (!PageHuge(page)) {
> - if (PageAnon(page))
> - dec_mm_counter(mm, MM_ANONPAGES);
> - else
> - dec_mm_counter(mm, MM_FILEPAGES);
> - }
> + if (!PageHuge(page))
> + dec_mm_counter(mm, mm_counter(page));
> set_pte_at(mm, address, pte,
> swp_entry_to_pte(make_hwpoison_entry(page)));
> } else if (pte_unused(pteval)) {
> @@ -1230,10 +1226,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> * interest anymore. Simply discard the pte, vmscan
> * will take care of the rest.
> */
> - if (PageAnon(page))
> - dec_mm_counter(mm, MM_ANONPAGES);
> - else
> - dec_mm_counter(mm, MM_FILEPAGES);
> + dec_mm_counter(mm, mm_counter(page));
> } else if (PageAnon(page)) {
> swp_entry_t entry = { .val = page_private(page) };
> pte_t swp_pte;
> @@ -1276,7 +1269,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> entry = make_migration_entry(page, pte_write(pteval));
> set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
> } else
> - dec_mm_counter(mm, MM_FILEPAGES);
> + dec_mm_counter(mm, mm_counter_file(page));
>
> page_remove_rmap(page);
> page_cache_release(page);
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 2/4] mm, procfs: account for shmem swap in /proc/pid/smaps
From: Michael Kerrisk @ 2015-02-27 10:38 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-3-git-send-email-vbabka@suse.cz>
[CC += linux-api@]
On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
> mappings, even if the mapped portion does contain pages that were swapped out.
> This is because unlike private anonymous mappings, shmem does not change pte
> to swap entry, but pte_none when swapping the page out. In the smaps page
> walk, such page thus looks like it was never faulted in.
>
> This patch changes smaps_pte_entry() to determine the swap status for such
> pte_none entries for shmem mappings, similarly to how mincore_page() does it.
> Swapped out pages are thus accounted for.
>
> The accounting is arguably still not as precise as for private anonymous
> mappings, since now we will count also pages that the process in question never
> accessed, but only another process populated them and then let them become
> swapped out. I believe it is still less confusing and subtle than not showing
> any swap usage by shmem mappings at all. Also, swapped out pages only becomee a
> performance issue for future accesses, and we cannot predict those for neither
> kind of mapping.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Documentation/filesystems/proc.txt | 3 ++-
> fs/proc/task_mmu.c | 20 ++++++++++++++++++++
> 2 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d4f56ec..8b30543 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -437,7 +437,8 @@ indicates the amount of memory currently marked as referenced or accessed.
> a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> and a page is modified, the file page is replaced by a private anonymous copy.
> "Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +swap. For shmem mappings, "Swap" shows how much of the mapped portion of the
> +underlying shmem object is on swap.
>
> "VmFlags" field deserves a separate description. This member represents the kernel
> flags associated with the particular virtual memory area in two letter encoded
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 956b75d..0410309 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -13,6 +13,7 @@
> #include <linux/swap.h>
> #include <linux/swapops.h>
> #include <linux/mmu_notifier.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/elf.h>
> #include <asm/uaccess.h>
> @@ -496,6 +497,25 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> mss->swap += PAGE_SIZE;
> else if (is_migration_entry(swpent))
> page = migration_entry_to_page(swpent);
> + } else if (IS_ENABLED(CONFIG_SHMEM) && IS_ENABLED(CONFIG_SWAP) &&
> + pte_none(*pte) && vma->vm_file) {
> + struct address_space *mapping =
> + file_inode(vma->vm_file)->i_mapping;
> +
> + /*
> + * shmem does not use swap pte's so we have to consult
> + * the radix tree to account for swap
> + */
> + if (shmem_mapping(mapping)) {
> + page = find_get_entry(mapping, pgoff);
> + if (page) {
> + if (radix_tree_exceptional_entry(page))
> + mss->swap += PAGE_SIZE;
> + else
> + page_cache_release(page);
> + }
> + page = NULL;
> + }
> }
>
> if (!page)
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 1/4] mm, documentation: clarify /proc/pid/status VmSwap limitations
From: Michael Kerrisk @ 2015-02-27 10:37 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-2-git-send-email-vbabka@suse.cz>
[CC += linux-api@]
On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> The documentation for /proc/pid/status does not mention that the value of
> VmSwap counts only swapped out anonymous private pages and not shmem. This is
> not obvious, so document this limitation.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> I've noticed that proc(5) manpage is currently missing the VmSwap field
> altogether.
>
> Documentation/filesystems/proc.txt | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index a07ba61..d4f56ec 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -231,6 +231,8 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
> VmLib size of shared library code
> VmPTE size of page table entries
> VmSwap size of swap usage (the number of referred swapents)
> + by anonymous private data (shmem swap usage is not
> + included)
> Threads number of threads
> SigQ number of signals queued/max. number for queue
> SigPnd bitmap of pending signals for the thread
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 0/4] enhance shmem process and swap accounting
From: Michael Kerrisk @ 2015-02-27 10:36 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, Jerome Marchand, Linux Kernel, Andrew Morton, linux-doc,
Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Cyrill Gorcunov,
Randy Dunlap, linux-s390, Martin Schwidefsky, Heiko Carstens,
Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
Oleg Nesterov, Linux API
In-Reply-To: <1424958666-18241-1-git-send-email-vbabka@suse.cz>
[CC += linux-api@]
Hello Vlastimil,
Since this is a kernel-user-space API change, please CC linux-api@.
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to linux-api@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html
Cheers,
Michael
On Thu, Feb 26, 2015 at 2:51 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> This series is based on Jerome Marchand's [1] so let me quote the first
> paragraph from there:
>
> There are several shortcomings with the accounting of shared memory
> (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
> values in /proc/<pid>/status and statm don't allow to distinguish
> between shmem memory and a shared mapping to a regular file, even
> though theirs implication on memory usage are quite different: at
> reclaim, file mapping can be dropped or write back on disk while shmem
> needs a place in swap. As for shmem pages that are swapped-out or in
> swap cache, they aren't accounted at all.
>
> The original motivation for myself is that a customer found (IMHO rightfully)
> confusing that e.g. top output for process swap usage is unreliable with
> respect to swapped out shmem pages, which are not accounted for.
>
> The fundamental difference between private anonymous and shmem pages is that
> the latter has PTE's converted to pte_none, and not swapents. As such, they are
> not accounted to the number of swapents visible e.g. in /proc/pid/status VmSwap
> row. It might be theoretically possible to use swapents when swapping out shmem
> (without extra cost, as one has to change all mappers anyway), and on swap in
> only convert the swapent for the faulting process, leaving swapents in other
> processes until they also fault (so again no extra cost). But I don't know how
> many assumptions this would break, and it would be too disruptive change for a
> relatively small benefit.
>
> Instead, my approach is to document the limitation of VmSwap, and provide means
> to determine the swap usage for shmem areas for those who are interested and
> willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
> don't think it's possible to currently to determine the usage at all. The
> previous patchset [1] did introduce new shmem-specific fields into smaps
> output, and functions to determine the values. I take a simpler approach,
> noting that smaps output already has a "Swap: X kB" line, where currently X ==
> 0 always for shmem areas. I think we can just consider this a bug and provide
> the proper value by consulting the radix tree, as e.g. mincore_page() does. In the
> patch changelog I explain why this is also not perfect (and cannot be without
> swapents), but still arguably much better than showing a 0.
>
> The last two patches are adapted from Jerome's patchset and provide a VmRSS
> breakdown to VmAnon, VmFile and VmShm in /proc/pid/status. Hugh noted that
> this is a welcome addition, and I agree that it might help e.g. debugging
> process memory usage at albeit non-zero, but still rather low cost of extra
> per-mm counter and some page flag checks. I updated these patches to 4.0-rc1,
> made them respect !CONFIG_SHMEM so that tiny systems don't pay the cost, and
> optimized the page flag checking somewhat.
>
> [1] http://lwn.net/Articles/611966/
>
> Jerome Marchand (2):
> mm, shmem: Add shmem resident memory accounting
> mm, procfs: Display VmAnon, VmFile and VmShm in /proc/pid/status
>
> Vlastimil Babka (2):
> mm, documentation: clarify /proc/pid/status VmSwap limitations
> mm, proc: account for shmem swap in /proc/pid/smaps
>
> Documentation/filesystems/proc.txt | 15 +++++++++++++--
> arch/s390/mm/pgtable.c | 5 +----
> fs/proc/task_mmu.c | 35 +++++++++++++++++++++++++++++++++--
> include/linux/mm.h | 28 ++++++++++++++++++++++++++++
> include/linux/mm_types.h | 9 ++++++---
> kernel/events/uprobes.c | 2 +-
> mm/memory.c | 30 ++++++++++--------------------
> mm/oom_kill.c | 5 +++--
> mm/rmap.c | 15 ++++-----------
> 9 files changed, 99 insertions(+), 45 deletions(-)
>
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-02-27 10:26 UTC (permalink / raw)
To: Herbert Xu
Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
linux-crypto-u79uwXL29TY76Z2rM5mHXA, 'Linux API'
In-Reply-To: <20150227094944.GA29071-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Am Freitag, 27. Februar 2015, 22:49:44 schrieb Herbert Xu:
Hi Herbert,
>On Thu, Feb 05, 2015 at 04:10:58PM +0100, Stephan Mueller wrote:
>> Am Donnerstag, 29. Januar 2015, 21:24:45 schrieb Stephan Mueller:
>>
>> Hi Herbert,
>>
>> > This patch adds the AEAD support for AF_ALG.
>> >
>> > The implementation is based on algif_skcipher, but contains heavy
>> > modifications to streamline the interface for AEAD uses.
>> >
>> > To use AEAD, the user space consumer has to use the salg_type named
>> > "aead".
>>
>> I just saw Al Viro's patch to use the iov_iter API in
>> algif_skcipher.c. I looked at it but lacked documentation for using
>> it properly. Now I have a template that I will incorporate into
>> algif_aead.c
>
>Please resubmit once you have done this.
I have done that, but as indicated with an email to Al, I cannot get his
patch for skcipher and hash to work. Similarly, my modification for the
AEAD does not work either. So, I do not see that Al's patch can be
merged as is.
Therefore, I have not submitted my patch as Al mentioned he wanted to
look into his patchset.
>
>Thanks,
Ciao
Stephan
^ permalink raw reply
* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2015-02-27 9:49 UTC (permalink / raw)
To: Stephan Mueller
Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
linux-crypto, 'Linux API'
In-Reply-To: <7821728.KTX1G67VgF@tachyon.chronox.de>
On Thu, Feb 05, 2015 at 04:10:58PM +0100, Stephan Mueller wrote:
> Am Donnerstag, 29. Januar 2015, 21:24:45 schrieb Stephan Mueller:
>
> Hi Herbert,
>
> > This patch adds the AEAD support for AF_ALG.
> >
> > The implementation is based on algif_skcipher, but contains heavy
> > modifications to streamline the interface for AEAD uses.
> >
> > To use AEAD, the user space consumer has to use the salg_type named
> > "aead".
>
> I just saw Al Viro's patch to use the iov_iter API in algif_skcipher.c. I
> looked at it but lacked documentation for using it properly. Now I have a
> template that I will incorporate into algif_aead.c
Please resubmit once you have done this.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Michael Kerrisk (man-pages) @ 2015-02-27 8:36 UTC (permalink / raw)
To: Omar Sandoval
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Theodore Ts'o,
Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA, Linux API,
XFS Developers, Linux-Fsdevel, Ext4 Developers List,
Linux btrfs Developers List
In-Reply-To: <20150227080800.GA20442@mew>
Hello Omar,
On 02/27/2015 09:08 AM, Omar Sandoval wrote:
> On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
>> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
>>> On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
>>>>
>>>> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
>>>> in the case of a system crash, the atime and mtime fields
>>>> on disk might be out of date by at most 24 hours.
>>>
>>> I'd change to "The disadvantage of MS_LAZYTIME is that..." and
>>> perhaps move that so it's clear it applies to any use of MS_LAZYTIME
>>> has this as a downside.
>>>
>>> Does that make sense?
>>
>> Thanks, Ted. Got it. So, now we have:
>>
>> MS_LAZYTIME (since Linux 3.20)
>> Reduce on-disk updates of inode timestamps (atime,
>> mtime, ctime) by maintaining these changes only in mem‐
>> ory. The on-disk timestamps are updated only when:
>>
>> (a) the inode needs to be updated for some change unre‐
>> lated to file timestamps;
>>
>> (b) the application employs fsync(2), syncfs(2), or
>> sync(2);
>>
>> (c) an undeleted inode is evicted from memory; or
>>
>> (d) more than 24 hours have passed since the inode was
>> written to disk.
>>
>> This mount significantly reduces writes needed to update
> "This mount option"?
Thanks, fixed.
>> the inode's timestamps, especially mtime and atime.
>> However, in the event of a system crash, the atime and
>> mtime fields on disk might be out of date by up to 24
>> hours.
>>
>> Examples of workloads where this option could be of sig‐
>> nificant benefit include frequent random writes to pre‐
>> allocated files, as well as cases where the MS_STRICTA‐
>> TIME mount option is also enabled. (The advantage of
>> (MS_STRICTATIME | MS_LAZYTIME) is that stat(2) will
>> return the correctly updated atime, but the atime
>> updates will be flushed to disk only when (1) the inode
>> needs to be updated for filesystem / data consistency
>> reasons or (2) the inode is pushed out of memory, or (3)
>> the filesystem is unmounted.)
> Is it necessary to repeat the reasons for flushing, which are stated
> above?
Good point. I replaced this piece with just a few words referring
to the list above.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-27 8:22 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-api, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
LKML
In-Reply-To: <20150226144908.5e3d354d7b24174e90089721@linux-foundation.org>
On Thu 26-02-15 14:49:08, Andrew Morton wrote:
> On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <mhocko@suse.cz> wrote:
>
> > ping on this one? Should I just resend (your way Andrew)? Or are there
> > any objections to the patch as is.
>
> Were Eric's concerns all addressed?
I hope so. I didn't touch pid namespace parts and they return ENOMEM as
before.
> Oleg: wake up ;)
>
> Overall it looks like a pretty minor issue?
I believe so. It should help deubugging when pid space is exhausted
because getting ENOMEM under that situation is really awkward and the
real reason for failure is hard to find out.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Omar Sandoval @ 2015-02-27 8:08 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Theodore Ts'o, Eric Sandeen, linux-man-u79uwXL29TY76Z2rM5mHXA,
Linux API, XFS Developers, Linux-Fsdevel, Ext4 Developers List,
Linux btrfs Developers List
In-Reply-To: <54F02446.2050008-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On Fri, Feb 27, 2015 at 09:01:10AM +0100, Michael Kerrisk (man-pages) wrote:
> On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> > On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> >>
> >> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> >> in the case of a system crash, the atime and mtime fields
> >> on disk might be out of date by at most 24 hours.
> >
> > I'd change to "The disadvantage of MS_LAZYTIME is that..." and
> > perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> > has this as a downside.
> >
> > Does that make sense?
>
> Thanks, Ted. Got it. So, now we have:
>
> MS_LAZYTIME (since Linux 3.20)
> Reduce on-disk updates of inode timestamps (atime,
> mtime, ctime) by maintaining these changes only in mem‐
> ory. The on-disk timestamps are updated only when:
>
> (a) the inode needs to be updated for some change unre‐
> lated to file timestamps;
>
> (b) the application employs fsync(2), syncfs(2), or
> sync(2);
>
> (c) an undeleted inode is evicted from memory; or
>
> (d) more than 24 hours have passed since the inode was
> written to disk.
>
> This mount significantly reduces writes needed to update
"This mount option"?
> the inode's timestamps, especially mtime and atime.
> However, in the event of a system crash, the atime and
> mtime fields on disk might be out of date by up to 24
> hours.
>
> Examples of workloads where this option could be of sig‐
> nificant benefit include frequent random writes to pre‐
> allocated files, as well as cases where the MS_STRICTA‐
> TIME mount option is also enabled. (The advantage of
> (MS_STRICTATIME | MS_LAZYTIME) is that stat(2) will
> return the correctly updated atime, but the atime
> updates will be flushed to disk only when (1) the inode
> needs to be updated for filesystem / data consistency
> reasons or (2) the inode is pushed out of memory, or (3)
> the filesystem is unmounted.)
Is it necessary to repeat the reasons for flushing, which are stated
above?
>
> Cheers,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
>
> _______________________________________________
> xfs mailing list
> xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org
> http://oss.sgi.com/mailman/listinfo/xfs
Thanks!
--
Omar
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Michael Kerrisk (man-pages) @ 2015-02-27 8:01 UTC (permalink / raw)
To: Theodore Ts'o
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric Sandeen,
Ext4 Developers List, Linux btrfs Developers List, XFS Developers,
linux-man-u79uwXL29TY76Z2rM5mHXA, Linux-Fsdevel, Linux API
In-Reply-To: <20150227000409.GC17174-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
On 02/27/2015 01:04 AM, Theodore Ts'o wrote:
> On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
>>
>> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
>> in the case of a system crash, the atime and mtime fields
>> on disk might be out of date by at most 24 hours.
>
> I'd change to "The disadvantage of MS_LAZYTIME is that..." and
> perhaps move that so it's clear it applies to any use of MS_LAZYTIME
> has this as a downside.
>
> Does that make sense?
Thanks, Ted. Got it. So, now we have:
MS_LAZYTIME (since Linux 3.20)
Reduce on-disk updates of inode timestamps (atime,
mtime, ctime) by maintaining these changes only in mem‐
ory. The on-disk timestamps are updated only when:
(a) the inode needs to be updated for some change unre‐
lated to file timestamps;
(b) the application employs fsync(2), syncfs(2), or
sync(2);
(c) an undeleted inode is evicted from memory; or
(d) more than 24 hours have passed since the inode was
written to disk.
This mount significantly reduces writes needed to update
the inode's timestamps, especially mtime and atime.
However, in the event of a system crash, the atime and
mtime fields on disk might be out of date by up to 24
hours.
Examples of workloads where this option could be of sig‐
nificant benefit include frequent random writes to pre‐
allocated files, as well as cases where the MS_STRICTA‐
TIME mount option is also enabled. (The advantage of
(MS_STRICTATIME | MS_LAZYTIME) is that stat(2) will
return the correctly updated atime, but the atime
updates will be flushed to disk only when (1) the inode
needs to be updated for filesystem / data consistency
reasons or (2) the inode is pushed out of memory, or (3)
the filesystem is unmounted.)
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: Documenting MS_LAZYTIME
From: Theodore Ts'o @ 2015-02-27 0:04 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Eric Sandeen, Ext4 Developers List, Linux btrfs Developers List,
XFS Developers, linux-man, Linux-Fsdevel, Linux API
In-Reply-To: <54EF2161.90607@gmail.com>
On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
>
> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> in the case of a system crash, the atime and mtime fields
> on disk might be out of date by at most 24 hours.
I'd change to "The disadvantage of MS_LAZYTIME is that..." and
perhaps move that so it's clear it applies to any use of MS_LAZYTIME
has this as a downside.
Does that make sense?
- Ted
^ permalink raw reply
* Re: [PATCH v2] coresight-stm: adding driver for CoreSight STM component
From: Russell King - ARM Linux @ 2015-02-26 22:58 UTC (permalink / raw)
To: Rob Herring
Cc: Mathieu Poirier, Will Deacon, linux-api@vger.kernel.org,
Jonathan Corbet, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org
In-Reply-To: <CAL_Jsq+JYc5DBXqEcpJ0O=RdO-uuz_-GR99YsgVqT=uG6CLqJg@mail.gmail.com>
On Thu, Feb 26, 2015 at 04:24:53PM -0600, Rob Herring wrote:
> We really shouldn't do private implementation here. It there really
> any reason not to allow readq/writeq generically for 32-bit or just
> for arm32?
My argument has always been that drivers should do the emulation of
64-bit accesses when there is no native support.
IO registers tend to have side effects when read/written. How do we
know whether the low-half or the high-half should be written first?
This isn't something that an architecture can really dictate. What
may be right for one hardware device may not be correct for another.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply
* Re: [PATCH] fork: report pid reservation failure properly
From: Andrew Morton @ 2015-02-26 22:49 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-api, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
LKML
In-Reply-To: <20150223201701.GA28639@dhcp22.suse.cz>
On Mon, 23 Feb 2015 21:17:01 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> ping on this one? Should I just resend (your way Andrew)? Or are there
> any objections to the patch as is.
Were Eric's concerns all addressed?
Oleg: wake up ;)
Overall it looks like a pretty minor issue?
^ permalink raw reply
* Re: [PATCH v2] coresight-stm: adding driver for CoreSight STM component
From: Rob Herring @ 2015-02-26 22:24 UTC (permalink / raw)
To: Mathieu Poirier, Will Deacon
Cc: linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, Jonathan Corbet
In-Reply-To: <1424907152-18808-1-git-send-email-mathieu.poirier@linaro.org>
Adding Will D...
On Wed, Feb 25, 2015 at 5:32 PM, Mathieu Poirier
<mathieu.poirier@linaro.org> wrote:
> From: Pratik Patel <pratikp@codeaurora.org>
>
> This driver adds support for the STM CoreSight IP block,
> allowing any system compoment (HW or SW) to log and
> aggregate messages via a single entity.
>
> The STM exposes an application defined number of channels
> called stimulus port. Configuration is done using entries
> in sysfs and channels made available to userspace via devfs.
>
> Signed-off-by: Pratik Patel <pratikp@codeaurora.org>
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> ---
> Changes for v2:
> - Fixed typo in struct stm_node documentation
> - Added CPU_32v3 to list of architecture STM can't work with
Is this because of no strd instr only?
> +#ifndef CONFIG_64BIT
> +static inline void __raw_writeq(u64 val, volatile void __iomem *addr)
> +{
> + asm volatile("strd %1, %0"
> + : "+Qo" (*(volatile u64 __force *)addr)
> + : "r" (val));
> +}
> +#undef writeq_relaxed
> +#define writeq_relaxed(v, c) __raw_writeq((__force u64) cpu_to_le64(v), c)
> +#endif
We really shouldn't do private implementation here. It there really
any reason not to allow readq/writeq generically for 32-bit or just
for arm32?
Rob
^ permalink raw reply
* [PATCH] capabilities: Ambient capability set V2
From: Christoph Lameter @ 2015-02-26 22:14 UTC (permalink / raw)
To: Serge Hallyn
Cc: Andy Lutomirski, Jonathan Corbet, Aaron Jones,
linux-security-module-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, Andrew G. Morgan,
Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk
V1->V2:
- Fix up the processing of the caps bits after discussions
with Any and Serge. Make patch less intrusive.
Ambient caps are something like restricted root privileges.
A process has a set of additional capabilities and those
are inherited without have to set capabilites in other
binaries involved. This allow the partial use of root
like features in a controlled way. It is often useful
to do this for user space device drivers or software that
needs increased priviledges for networking or to control
its own scheduling. Ambient caps allow one to avoid
having to run these with full root priviledges.
Control over this feature is avaialable via a new
prctl option called PR_CAP_AMBIENT. The second argument to prctl
is a the capability number and the third the desired state.
0 for off. Otherwise on.
Ambient bits are enabled regardless of the inheritance
mask of the target binary. They are only restricted
by the bounding set.
History:
Linux capabilities have suffered from the problem that they are not
inheritable like unregular process characteristics under Unix. This is
behavior that is counter intuitive to the expected behavior of processes
in Unix.
In particular there has been recently software that controls NICs from user
space and provides IP stack like behavior also in user space (DPDK and RDMA
kernel API based implementations). Those typically need either capabilities
to allow raw network access or have to be run setsuid. There is scripting and
LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
including those setting additional capabilites or requiring root access.
That does not go well with having file capabilities set that would enable
the capabilities. Maybe it would work if one would setup capabilities on
all executables but that would also defeat a secure design since these
binaries may only need those caps for certain situations. Ok setting the
inheritable flags on everything may also get one there (if there would not
be the issues with LD_PRELOAD, debugging etc etc).
The easy solution is to allow some capabilities be inherited like setsuid
is. We really prefer to use capabilities instead of setsuid (we want to
limit what damage someone can do after all!). Therefore we have been
running a patch like this in production for the last 6 years. At some
point it becomes tedious to run your own custom kernel so we would like
to have this functionality upstream.
See some of the earlier related discussions on the problems with capability
inheritance:
0. Recent surprise:
https://lkml.org/lkml/2014/1/21/175
1. Attempt to revise caps
http://www.madore.org/~david/linux/newcaps/
2. Problems of passing caps through exec
http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec
3. Problems of binding to privileged ports
http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l
4. Reviving capabilities
http://lwn.net/Articles/199004/
There does not seem to be an alternative on the horizon. Some involved
in security development under Linux have even stated that they want to
rip out the whole thing and replace it. Its been a couple of years now
and we are still suffering from the capabilities mess. Let us just
fix it. Others have already done implementations like this like Nokia
for the N900.
This patch does not change the default behavior but it allows to set up
a list of capabilities via prctl that will enable regular
unix inheritance only for the selected group of capabilities.
With that it is then possible to do something trivial like setting
CAP_NET_RAW on an executable that can then allow that capability to
be inherited by others.
Lets have a look at a coding example of a wrapper that enables
a couple of capabilities:
------------------------------ ambient_test.c
/*
* Test program for the ambient capabilities
*
*
* Compile using:
* gcc -o ambient_test ambient_test.o
*
* This program must have the following capabilities to run properly:
* CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip this with the right caps is:
*
* setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
*
* To get a shell with additional caps that can be inherited do:
*
* ./ambient_test /bin/bash
*
*/
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/prctl.h>
#include <linux/capability.h>
/* Defintion to be updated in the user space include files */
#define PR_CAP_AMBIENT 45
int main(int argc, char **argv)
{
int rc;
if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
perror("Cannot set CAP_NET_RAW");
if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
perror("Cannot set CAP_NET_ADMIN");
if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
perror("Cannot set CAP_SYS_NICE");
printf("Ambient_test forking shell\n");
if (execv(argv[1], argv + 1))
perror("Cannot exec");
return 0;
}
-------------------------------- ambient_test.c
Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
With that device raw access is possible and also real time priorities
can be set from user space. This is a frequently needed set of
priviledged operations in HPC and HFT applications. User space
processes need to be able to directly access devices as well as
have full control over scheduling.
Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Index: linux/security/commoncap.c
===================================================================
--- linux.orig/security/commoncap.c 2015-02-25 13:43:06.929973954 -0600
+++ linux/security/commoncap.c 2015-02-26 16:10:02.347913397 -0600
@@ -347,15 +347,17 @@ static inline int bprm_caps_from_vfs_cap
*has_cap = true;
CAP_FOR_EACH_U32(i) {
+ __u32 ambient = current_cred()->cap_ambient.cap[i];
__u32 permitted = caps->permitted.cap[i];
__u32 inheritable = caps->inheritable.cap[i];
/*
- * pP' = (X & fP) | (pI & fI)
+ * pP' = (X & fP) | (pI & (fI | pA))
*/
new->cap_permitted.cap[i] =
(new->cap_bset.cap[i] & permitted) |
- (new->cap_inheritable.cap[i] & inheritable);
+ (new->cap_inheritable.cap[i] &
+ (inheritable | ambient));
if (permitted & ~new->cap_permitted.cap[i])
/* insufficient to execute correctly */
@@ -453,8 +455,18 @@ static int get_file_caps(struct linux_bi
if (rc == -EINVAL)
printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
__func__, rc, bprm->filename);
- else if (rc == -ENODATA)
+ else if (rc == -ENODATA) {
rc = 0;
+ if (!cap_isclear(current_cred()->cap_ambient)) {
+ /*
+ * The ambient caps are permitted for
+ * files that have no caps
+ */
+ bprm->cred->cap_permitted =
+ current_cred()->cap_ambient;
+ *effective = true;
+ }
+ }
goto out;
}
@@ -549,9 +561,20 @@ skip:
new->sgid = new->fsgid = new->egid;
if (effective)
+ /*
+ * pE' = pP' & (fE | pA)
+ *
+ * fE is implicity all set if effective == true.
+ * Therefore the above reduces to
+ *
+ * pE' = pP'
+ */
new->cap_effective = new->cap_permitted;
else
cap_clear(new->cap_effective);
+
+ /* pA' = pA */
+ new->cap_ambient = old->cap_ambient;
bprm->cap_effective = effective;
/*
@@ -566,7 +589,7 @@ skip:
* Number 1 above might fail if you don't have a full bset, but I think
* that is interesting information to audit.
*/
- if (!cap_isclear(new->cap_effective)) {
+ if (!cap_issubset(new->cap_effective, new->cap_ambient)) {
if (!cap_issubset(CAP_FULL_SET, new->cap_effective) ||
!uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) ||
issecure(SECURE_NOROOT)) {
@@ -598,7 +621,7 @@ int cap_bprm_secureexec(struct linux_bin
if (!uid_eq(cred->uid, root_uid)) {
if (bprm->cap_effective)
return 1;
- if (!cap_isclear(cred->cap_permitted))
+ if (!cap_issubset(cred->cap_permitted, cred->cap_ambient))
return 1;
}
@@ -933,6 +956,23 @@ int cap_task_prctl(int option, unsigned
new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
return commit_creds(new);
+ case PR_CAP_AMBIENT:
+ if (!ns_capable(current_user_ns(), CAP_SETPCAP))
+ return -EPERM;
+
+ if (!cap_valid(arg2))
+ return -EINVAL;
+
+ if (!ns_capable(current_user_ns(), arg2))
+ return -EPERM;
+
+ new = prepare_creds();
+ if (arg3 == 0)
+ cap_lower(new->cap_ambient, arg2);
+ else
+ cap_raise(new->cap_ambient, arg2);
+ return commit_creds(new);
+
default:
/* No functionality available - continue with default */
return -ENOSYS;
Index: linux/include/linux/cred.h
===================================================================
--- linux.orig/include/linux/cred.h 2015-02-25 13:43:06.929973954 -0600
+++ linux/include/linux/cred.h 2015-02-25 13:43:06.925972078 -0600
@@ -122,6 +122,7 @@ struct cred {
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
+ kernel_cap_t cap_ambient; /* Ambient capability set */
#ifdef CONFIG_KEYS
unsigned char jit_keyring; /* default keyring to attach requested
* keys to */
Index: linux/include/uapi/linux/prctl.h
===================================================================
--- linux.orig/include/uapi/linux/prctl.h 2015-02-25 13:43:06.929973954 -0600
+++ linux/include/uapi/linux/prctl.h 2015-02-25 13:43:06.925972078 -0600
@@ -185,4 +185,7 @@ struct prctl_mm_map {
#define PR_MPX_ENABLE_MANAGEMENT 43
#define PR_MPX_DISABLE_MANAGEMENT 44
+/* Control the ambient capability set */
+#define PR_CAP_AMBIENT 45
+
#endif /* _LINUX_PRCTL_H */
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c 2015-02-25 13:43:06.929973954 -0600
+++ linux/fs/proc/array.c 2015-02-25 13:43:06.925972078 -0600
@@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
static inline void task_cap(struct seq_file *m, struct task_struct *p)
{
const struct cred *cred;
- kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
+ kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
+ cap_bset, cap_ambient;
rcu_read_lock();
cred = __task_cred(p);
@@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
cap_permitted = cred->cap_permitted;
cap_effective = cred->cap_effective;
cap_bset = cred->cap_bset;
+ cap_ambient = cred->cap_ambient;
rcu_read_unlock();
render_cap_t(m, "CapInh:\t", &cap_inheritable);
render_cap_t(m, "CapPrm:\t", &cap_permitted);
render_cap_t(m, "CapEff:\t", &cap_effective);
render_cap_t(m, "CapBnd:\t", &cap_bset);
+ render_cap_t(m, "CapAmb:\t", &cap_ambient);
}
static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
^ permalink raw reply
* Re: [PATCH 2/2] fbcon: expose cursor blink interval via sysfs
From: Pavel Machek @ 2015-02-26 22:02 UTC (permalink / raw)
To: Scot Doyle
Cc: Jean-Christophe Plagniol-Villard, Tomi Valkeinen,
Geert Uytterhoeven, linux-fbdev-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.LNX.2.11.1502252319330.276@localhost>
On Wed 2015-02-25 23:32:00, Scot Doyle wrote:
> On Wed, 25 Feb 2015, Pavel Machek wrote:
> > On Mon 2015-01-26 20:41:53, Scot Doyle wrote:
> > > The fbcon cursor, when set to blink, is hardcoded to toggle display state
> > > five times per second. Expose this setting via
> > > /sys/class/graphics/fbcon/cursor_blink_ms
> > >
> > > Values written to the interface set the approximate time interval in
> > > milliseconds between cursor toggles, from 1 to 32767. Since the interval
> > > is stored internally as a number of jiffies, the millisecond value read
> > > from the interface may not exactly match the entered value.
> > >
> > > An outstanding blink timer is reset after a new value is entered.
> > >
> > > If the cursor blink is disabled, either via the 'cursor_blink' boolean
> > > setting or some other mechanism, the 'cursor_blink_ms' setting may still
> > > be modified. The new value will be used if the blink is reactivated.
> > >
> > > Signed-off-by: Scot Doyle <lkml14-enLWO88E2pdl57MIdRCFDg@public.gmane.org>
> >
> > Normally, this would be set by ansi escape sequences, no? We can hide
> > cursor using them, set its appearance.. makes sense to change timing
> > value there, too....
> > Pavel
>
> Hi Pavel, what about something like this? For example,
> "echo -e '\033[16;500]' would set the blink interval to 500 milliseconds.
>
> The duration is stored twice to avoid locking the console in
> cursor_timer_handler().
Yes, I'd say this matches the existing code better.
Acked-by: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>
> + case 16: /* set cursor blink duration in msec */
> + if (vc->vc_npar >= 1 && vc->vc_par[1] > 0 &&
> + vc->vc_par[1] <= USHRT_MAX)
> + vc->vc_cur_blink_ms = vc->vc_par[1];
> + else
Actually, vc_cur_blink_ms less then about 50 probably does not make
sense (and may overload the system). Should that be checked?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox