From: Joao Martins <joao.m.martins@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"Liu, Jingqi" <jingqi.liu@intel.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Richard Henderson <rth@twiddle.net>
Subject: Re: [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes
Date: Tue, 7 Apr 2020 12:42:03 +0100 [thread overview]
Message-ID: <e21684a9-5832-7adb-923e-fdd9bff1f620@oracle.com> (raw)
In-Reply-To: <CAPcyv4iOi+5RJgkEWuJpn8JjOMrNCh4Uk1Ag=Fo=i+iFf1TkFA@mail.gmail.com>
On 4/7/20 9:16 AM, Dan Williams wrote:
> On Tue, Apr 7, 2020 at 1:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 07/04/20 09:29, Liu, Jingqi wrote:
>>> Ping.
>>>
>>> Any comments are appreciated.
>>>
>>> Hi Paolo, Richard,
>>>
>>> Any comments about this ?
>>
>> I was hoping to get a review from someone else because I have no way to
>> test it. But I've now queued the patch, thanks.
>
FWIW, I tested it (and didn't work) . Later found something odd wrt to the
device path.
Paolo if it helps your future testing, you can have a device-dax with something
like this:
efi_fake_mem=4G@16G:0x40000 # creates a dax0.0 device with sz 4G, 2M aligned
But it requires dax_hmem which is v5.5+. Or alternatively use memmap=4G!16G (and
using ndctl create-namespace -r 0 -a <align>) and it creates pmem legacy device.
> Does qemu run tests in a nested VM? The difficult aspect of testing
> devdax is that you need to boot your kernel with a special option or
> have existing memory ranges assigned to the device. Although, Joao had
> thoughts about allowing dynamic creation of device-dax instance by hot
> unplugging memory.
>
The idea was to get feature parity with hugetlbfs where you can assign a number
of 2M/1G pages at runtime. Thus giving a more flexible manner of assigning
memory to hmem.
This means we would create dax regions -- which can be sub-divided into dax
devices -- dynamically by hotunpluging a memory%u device first and then
reassigning it to dax_hmem driver (and thus marking it as 'soft-reserved').
Which could be given back to system-ram via dax_kmem. Naturally this assumes you
can hot-unplug the memory block before assigning it to dax_hmem, which might be
rather unpredictable. via kernel cmdline still is, though, the most
deterministic manner of assigning memory say at a bigger page granularities
(e.g. 1G).
But this is hotunplug-assign-to-hmem is still on paper, I haven't yet prototyped
this to see where it all falls apart.
>>> On 4/1/2020 11:13 AM, Liu, Jingqi wrote:
>>>> If the backend file is devdax pmem character device, the alignment
>>>> specified by the option 'align=NUM' in the '-object memory-backend-file'
>>>> needs to match the alignment requirement of the devdax pmem character
>>>> device.
>>>>
>>>> This patch fetches the devdax pmem file 'align', so that we can compare
>>>> it with the NUM of 'align=NUM'.
>>>> The NUM needs to be larger than or equal to the devdax pmem file 'align'.
>>>>
>>>> It also fixes the problem that mmap() returns failure in qemu_ram_mmap()
>>>> when the NUM of 'align=NUM' is less than the devdax pmem file 'align'.
>>>>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
>>>> ---
>>>> exec.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>>>> 1 file changed, 45 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/exec.c b/exec.c
>>>> index de9d949902..8221abffec 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -1736,6 +1736,42 @@ static int64_t get_file_size(int fd)
>>>> return size;
>>>> }
>>>> +static int64_t get_file_align(int fd)
>>>> +{
>>>> + int64_t align = -1;
>>>> +#if defined(__linux__)
>>>> + struct stat st;
>>>> +
>>>> + if (fstat(fd, &st) < 0) {
>>>> + return -errno;
>>>> + }
>>>> +
>>>> + /* Special handling for devdax character devices */
>>>> + if (S_ISCHR(st.st_mode)) {
>>>> + g_autofree char *subsystem_path = NULL;
>>>> + g_autofree char *subsystem = NULL;
>>>> +
>>>> + subsystem_path =
>>>> g_strdup_printf("/sys/dev/char/%d:%d/subsystem",
>>>> + major(st.st_rdev),
>>>> minor(st.st_rdev));
>>>> + subsystem = g_file_read_link(subsystem_path, NULL);
>>>> +
>>>> + if (subsystem && g_str_has_suffix(subsystem, "/dax")) {
>>>> + g_autofree char *align_path = NULL;
>>>> + g_autofree char *align_str = NULL;
>>>> +
>>>> + align_path =
>>>> g_strdup_printf("/sys/dev/char/%d:%d/device/align",
>>>> + major(st.st_rdev),
>>>> minor(st.st_rdev));
>>>> +
>>>> + if (g_file_get_contents(align_path, &align_str, NULL,
>>>> NULL)) {
>>>> + return g_ascii_strtoll(align_str, NULL, 0);
>>>> + }
>>>> + }
>>>> + }
>>>> +#endif /* defined(__linux__) */
>>>> +
>>>> + return align;
>>>> +}
>>>> +
>>>> static int file_ram_open(const char *path,
>>>> const char *region_name,
>>>> bool *created,
>>>> @@ -2275,7 +2311,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t
>>>> size, MemoryRegion *mr,
>>>> {
>>>> RAMBlock *new_block;
>>>> Error *local_err = NULL;
>>>> - int64_t file_size;
>>>> + int64_t file_size, file_align;
>>>> /* Just support these ram flags by now. */
>>>> assert((ram_flags & ~(RAM_SHARED | RAM_PMEM)) == 0);
>>>> @@ -2311,6 +2347,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t
>>>> size, MemoryRegion *mr,
>>>> return NULL;
>>>> }
>>>> + file_align = get_file_align(fd);
>>>> + if (file_align > 0 && mr && file_align > mr->align) {
>>>> + error_setg(errp, "backing store align 0x%" PRIx64
>>>> + " is larger than 'align' option 0x" RAM_ADDR_FMT,
>>>> + file_align, mr->align);
>>>> + return NULL;
>
> Is there any downside to just making the alignment value be the max of
> the device-dax instance align and the command line option? Why force
> someone to debug the option unnecessarily?
>
+1
Perhaps we can auto-detect that @align was not set and then we would set the max
align value. But if user has set a value over command line we would validate it
like Jingqi is doing above. Roughly, something like this just as a suggestion:
@@ -2354,11 +2354,16 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size,
MemoryRegion *mr,
}
file_align = get_file_align(fd);
- if (file_align > 0 && mr && file_align > mr->align) {
- error_setg(errp, "backing store align 0x%" PRIx64
- " is larger than 'align' option 0x" RAM_ADDR_FMT,
- file_align, mr->align);
- return NULL;
+ if (file_align > 0 && mr) {
+ /* auto detect alignment if none is specified */
+ if (!mr->align)
+ mr->align = file_align;
+ if (file_align > mr->align) {
+ error_setg(errp, "backing store align 0x%" PRIx64
+ " is larger than 'align' option 0x" RAM_ADDR_FMT,
+ file_align, mr->align);
+ return NULL;
+ }
}
next prev parent reply other threads:[~2020-04-07 11:43 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-01 3:13 [PATCH] exec: fetch the alignment of Linux devdax pmem character device nodes Jingqi Liu
2020-04-07 7:29 ` Liu, Jingqi
2020-04-07 8:08 ` Paolo Bonzini
2020-04-07 8:16 ` Dan Williams
2020-04-07 11:42 ` Joao Martins [this message]
2020-04-07 8:39 ` Liu, Jingqi
2020-04-07 10:59 ` Joao Martins
2020-04-07 14:31 ` Paolo Bonzini
2020-04-07 15:51 ` Joao Martins
2020-04-08 1:16 ` Liu, Jingqi
2020-04-08 9:28 ` Joao Martins
2020-04-07 16:55 ` Dan Williams
2020-04-07 18:28 ` Joao Martins
2020-04-07 18:29 ` Paolo Bonzini
2020-04-08 2:25 ` Liu, Jingqi
2020-04-08 9:42 ` Joao Martins
2020-04-09 14:33 ` Liu, Jingqi
2020-04-09 16:46 ` Dan Williams
2020-04-09 17:02 ` Paolo Bonzini
2020-04-10 1:48 ` Liu, Jingqi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e21684a9-5832-7adb-923e-fdd9bff1f620@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=dan.j.williams@intel.com \
--cc=jingqi.liu@intel.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=rth@twiddle.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).