From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1A955C77B7C for ; Wed, 25 Jun 2025 20:35:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Subject:Cc:To: From:Date:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=zWTmAbIsgTV+iTASCTcvA1ygTcFr481gIbaHQJDXqsg=; b=159HvJdn4ILURHOzjfqnRC5UKm DwhYpknvHZx7KBk+HXG63yLEliMOD4BahEjxuv8np3f5LILxwwFydaI4AY1Ma9YgVuoyFN8kouoC4 3+LA4/yKI4zXhH06Xd0gL51csBZ/GhlTAvstMyRxxM5kRlV4SbLzcSSPC2DbxODWdJGrHaq6HPVvb 1bpoT3IOMV93+NFEHf0G0LtSgIg0waBMa19OHO00fTOO9gXqidC/yyOOjpFkNXjf13zGuj/qRZwcg sWLeB8UP+pCZXrcOSUY3HLidLhyHNJCdJQoTkBCs2nCYa+dGb3e4TdsfIQhJq3zsiuDcyc8Jf9E4Y bCLus9Qw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1uUWpw-00000009rU5-41rF; Wed, 25 Jun 2025 20:35:16 +0000 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1uUUb2-00000009Xzl-1slP for kexec@lists.infradead.org; Wed, 25 Jun 2025 18:11:46 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1750875102; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zWTmAbIsgTV+iTASCTcvA1ygTcFr481gIbaHQJDXqsg=; b=JkNDauQlu+u88zvmi8HgTMUZWgI7twpiN3RrYawh3aOSZeWvNEtxBulz+1NalZ4bLwd7eF UBSTiKi6iQ/4qTXLbhc69aGuF3xaOoM+WoV9Lw3a88sx2sdInIhzCa/sq+YFhSjWPeFG1o iRRF3drkx7URzE2dClivkc5BrQ5sJRc= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-113-rwWtSOfvMaif7z-djTIYaQ-1; Wed, 25 Jun 2025 14:10:06 -0400 X-MC-Unique: rwWtSOfvMaif7z-djTIYaQ-1 X-Mimecast-MFC-AGG-ID: rwWtSOfvMaif7z-djTIYaQ_1750875004 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B83161808985; Wed, 25 Jun 2025 18:10:02 +0000 (UTC) Received: from rotkaeppchen (unknown [10.45.225.238]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 18D6319560A3; Wed, 25 Jun 2025 18:09:53 +0000 (UTC) Date: Wed, 25 Jun 2025 20:09:50 +0200 From: Philipp Rudo To: Pingfan Liu Cc: kexec@lists.infradead.org, Alexei Starovoitov , Daniel Borkmann , John Fastabend , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , Jeremy Linton , Catalin Marinas , Will Deacon , Ard Biesheuvel , Simon Horman , Gerd Hoffmann , Vitaly Kuznetsov , Viktor Malik , Jan Hendrik Farr , Baoquan He , Dave Young , Andrew Morton , bpf@vger.kernel.org Subject: Re: [PATCHv3 5/9] kexec: Introduce kexec_pe_image to parse and load PE file Message-ID: <20250625200950.16d7a09c@rotkaeppchen> In-Reply-To: <20250529041744.16458-6-piliu@redhat.com> References: <20250529041744.16458-1-piliu@redhat.com> <20250529041744.16458-6-piliu@redhat.com> Organization: Red Hat inc. MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250625_111144_686618_134CE7E2 X-CRM114-Status: GOOD ( 47.26 ) X-BeenThere: kexec@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "kexec" Errors-To: kexec-bounces+kexec=archiver.kernel.org@lists.infradead.org Hi Pingfan, On Thu, 29 May 2025 12:17:40 +0800 Pingfan Liu wrote: > As UEFI becomes popular, a few architectures support to boot a PE format > kernel image directly. But the internal of PE format varies, which means > each parser for each format. > > This patch (with the rest in this series) introduces a common skeleton > to all parsers, and leave the format parsing in > bpf-prog, so the kernel code can keep relative stable. > > A new kexec_file_ops is implementation, named pe_image_ops. > > There are some place holder function in this patch. (They will take > effect after the introduction of kexec bpf light skeleton and bpf > helpers). Overall the parsing progress is a pipeline, the current > bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the > end of the current stage 'disarm_bpf_prog()' the current parsed result > by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()' > , and deliver to the next stage. For each stage, the bpf bytecode is > extracted from the '.bpf' section in the PE file. > > Signed-off-by: Pingfan Liu > Cc: Baoquan He > Cc: Dave Young > Cc: Andrew Morton > Cc: Philipp Rudo > To: kexec@lists.infradead.org > --- > include/linux/kexec.h | 1 + > kernel/Kconfig.kexec | 8 + > kernel/Makefile | 1 + > kernel/kexec_pe_image.c | 356 ++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 366 insertions(+) > create mode 100644 kernel/kexec_pe_image.c > [...] > diff --git a/kernel/kexec_pe_image.c b/kernel/kexec_pe_image.c > new file mode 100644 > index 0000000000000..3097efccb8502 > --- /dev/null > +++ b/kernel/kexec_pe_image.c > @@ -0,0 +1,356 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Kexec PE image loader > + > + * Copyright (C) 2025 Red Hat, Inc > + */ > + > +#define pr_fmt(fmt) "kexec_file(Image): " fmt > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > + > +static LIST_HEAD(phase_head); > + > +struct parsed_phase { > + struct list_head head; > + struct list_head res_head; > +}; > + > +static struct parsed_phase *cur_phase; > + > +static char *kexec_res_names[3] = {"kernel", "initrd", "cmdline"}; Wouldn't it be better to use a enum rather than strings for the different resources? Especially as in prepare_nested_pe you are comparing two strings using == instead of strcmp(). So IIUC it should always return false. > +struct kexec_res { > + struct list_head node; > + char *name; > + /* The free of buffer is deferred to kimage_file_post_load_cleanup */ > + bool deferred_free; > + struct mem_range_result *r; > +}; > + > +static struct parsed_phase *alloc_new_phase(void) > +{ > + struct parsed_phase *phase = kzalloc(sizeof(struct parsed_phase), GFP_KERNEL); > + > + INIT_LIST_HEAD(&phase->head); > + INIT_LIST_HEAD(&phase->res_head); > + list_add_tail(&phase->head, &phase_head); > + > + return phase; > +} I must admit I don't fully understand how you are handling the different phases. In particular I don't understand why you are keeping all the resources a phase returned once it is finished. The way I see it those resources are only needed once as input for the next phase. So it should be sufficient to only keep a single kexec_context and update it when a phase returns a new resource. The way I see it this should simplify pe_image_load quite a bit. Or am I missing something? > +static bool is_valid_pe(const char *kernel_buf, unsigned long kernel_len) > +{ > + struct mz_hdr *mz; > + struct pe_hdr *pe; > + > + if (!kernel_buf) > + return false; > + mz = (struct mz_hdr *)kernel_buf; > + if (mz->magic != MZ_MAGIC) > + return false; > + pe = (struct pe_hdr *)(kernel_buf + mz->peaddr); > + if (pe->magic != PE_MAGIC) > + return false; > + if (pe->opt_hdr_size == 0) { > + pr_err("optional header is missing\n"); > + return false; > + } > + > + return true; > +} > + > +static bool is_valid_format(const char *kernel_buf, unsigned long kernel_len) > +{ > + return is_valid_pe(kernel_buf, kernel_len); > +} > + > +/* > + * The UEFI Terse Executable (TE) image has MZ header. > + */ > +static int pe_image_probe(const char *kernel_buf, unsigned long kernel_len) > +{ > + return is_valid_pe(kernel_buf, kernel_len) ? 0 : -1; Every image, at least on x86, is a valid pe file. So we should check for the .bpf section rather than the header. > +} > + > +static int get_pe_section(char *file_buf, const char *sect_name, s/get_pe_section/pe_get_section/ ? that would make it more consistent with the other functions. Thanks Philipp > + char **sect_start, unsigned long *sect_sz) > +{ > + struct pe_hdr *pe_hdr; > + struct pe32plus_opt_hdr *opt_hdr; > + struct section_header *sect_hdr; > + int section_nr, i; > + struct mz_hdr *mz = (struct mz_hdr *)file_buf; > + > + *sect_start = NULL; > + *sect_sz = 0; > + pe_hdr = (struct pe_hdr *)(file_buf + mz->peaddr); > + section_nr = pe_hdr->sections; > + opt_hdr = (struct pe32plus_opt_hdr *)(file_buf + mz->peaddr + sizeof(struct pe_hdr)); > + sect_hdr = (struct section_header *)((char *)opt_hdr + pe_hdr->opt_hdr_size); > + > + for (i = 0; i < section_nr; i++) { > + if (strcmp(sect_hdr->name, sect_name) == 0) { > + *sect_start = file_buf + sect_hdr->data_addr; > + *sect_sz = sect_hdr->raw_data_size; > + return 0; > + } > + sect_hdr++; > + } > + > + return -1; > +} > + > +static bool pe_has_bpf_section(char *file_buf, unsigned long pe_sz) > +{ > + char *sect_start = NULL; > + unsigned long sect_sz = 0; > + int ret; > + > + ret = get_pe_section(file_buf, ".bpf", §_start, §_sz); > + if (ret < 0) > + return false; > + return true; > +} > + > +/* Load a ELF */ > +static int arm_bpf_prog(char *bpf_elf, unsigned long sz) > +{ > + return 0; > +} > + > +static void disarm_bpf_prog(void) > +{ > +} > + > +struct kexec_context { > + bool kdump; > + char *image; > + int image_sz; > + char *initrd; > + int initrd_sz; > + char *cmdline; > + int cmdline_sz; > +}; > + > +void bpf_handle_pefile(struct kexec_context *context); > +void bpf_post_handle_pefile(struct kexec_context *context); > + > + > +/* > + * optimize("O0") prevents inline, compiler constant propagation > + */ > +__attribute__((used, optimize("O0"))) void bpf_handle_pefile(struct kexec_context *context) > +{ > +} > + > +__attribute__((used, optimize("O0"))) void bpf_post_handle_pefile(struct kexec_context *context) > +{ > +} > + > +/* > + * PE file may be nested and should be unfold one by one. > + * Query 'kernel', 'initrd', 'cmdline' in cur_phase, as they are inputs for the > + * next phase. > + */ > +static int prepare_nested_pe(char **kernel, unsigned long *kernel_len, char **initrd, > + unsigned long *initrd_len, char **cmdline) > +{ > + struct kexec_res *res; > + int ret = -1; > + > + *kernel = NULL; > + *kernel_len = 0; > + > + list_for_each_entry(res, &cur_phase->res_head, node) { > + if (res->name == kexec_res_names[0]) { > + *kernel = res->r->buf; > + *kernel_len = res->r->data_sz; > + ret = 0; > + } else if (res->name == kexec_res_names[1]) { > + *initrd = res->r->buf; > + *initrd_len = res->r->data_sz; > + } else if (res->name == kexec_res_names[2]) { > + *cmdline = res->r->buf; > + } > + } > + > + return ret; > +} > + > +static void *pe_image_load(struct kimage *image, > + char *kernel, unsigned long kernel_len, > + char *initrd, unsigned long initrd_len, > + char *cmdline, unsigned long cmdline_len) > +{ > + char *parsed_kernel = NULL; > + unsigned long parsed_len; > + char *linux_start, *initrd_start, *cmdline_start, *bpf_start; > + unsigned long linux_sz, initrd_sz, cmdline_sz, bpf_sz; > + struct parsed_phase *phase, *phase_tmp; > + struct kexec_res *res, *res_tmp; > + void *ldata; > + int ret; > + > + linux_start = kernel; > + linux_sz = kernel_len; > + initrd_start = initrd; > + initrd_sz = initrd_len; > + cmdline_start = cmdline; > + cmdline_sz = cmdline_len; > + > + while (is_valid_format(linux_start, linux_sz) && > + pe_has_bpf_section(linux_start, linux_sz)) { > + struct kexec_context context; > + > + get_pe_section(linux_start, ".bpf", &bpf_start, &bpf_sz); > + if (!!bpf_sz) { > + /* load and attach bpf-prog */ > + ret = arm_bpf_prog(bpf_start, bpf_sz); > + if (ret) { > + pr_err("Fail to load .bpf section\n"); > + ldata = ERR_PTR(ret); > + goto err; > + } > + } > + cur_phase = alloc_new_phase(); > + if (image->type != KEXEC_TYPE_CRASH) > + context.kdump = false; > + else > + context.kdump = true; > + context.image = linux_start; > + context.image_sz = linux_sz; > + context.initrd = initrd_start; > + context.initrd_sz = initrd_sz; > + context.cmdline = cmdline_start; > + context.cmdline_sz = strlen(cmdline_start); > + /* bpf-prog fentry, which handle above buffers. */ > + bpf_handle_pefile(&context); > + > + prepare_nested_pe(&linux_start, &linux_sz, &initrd_start, > + &initrd_sz, &cmdline_start); > + /* bpf-prog fentry */ > + bpf_post_handle_pefile(&context); > + /* > + * detach the current bpf-prog from their attachment points. > + * It also a point to free any registered interim resource. > + * Any resource except attached to phase is interim. > + */ > + disarm_bpf_prog(); > + } > + > + /* the rear of parsed phase contains the result */ > + list_for_each_entry_reverse(phase, &phase_head, head) { > + if (initrd != NULL && cmdline != NULL && parsed_kernel != NULL) > + break; > + list_for_each_entry(res, &phase->res_head, node) { > + if (!strcmp(res->name, "kernel") && !parsed_kernel) { > + parsed_kernel = res->r->buf; > + parsed_len = res->r->data_sz; > + res->deferred_free = true; > + } else if (!strcmp(res->name, "initrd") && !initrd) { > + initrd = res->r->buf; > + initrd_len = res->r->data_sz; > + res->deferred_free = true; > + } else if (!strcmp(res->name, "cmdline") && !cmdline) { > + cmdline = res->r->buf; > + cmdline_len = res->r->data_sz; > + res->deferred_free = true; > + } > + } > + > + } > + > + if (initrd == NULL || cmdline == NULL || parsed_kernel == NULL) { > + char *c, buf[64]; > + > + c = buf; > + if (parsed_kernel == NULL) { > + strcpy(c, "kernel "); > + c += strlen("kernel "); > + } > + if (initrd == NULL) { > + strcpy(c, "initrd "); > + c += strlen("initrd "); > + } > + if (cmdline == NULL) { > + strcpy(c, "cmdline "); > + c += strlen("cmdline "); > + } > + c = '\0'; > + pr_err("Can not extract data for %s", buf); > + ldata = ERR_PTR(-EINVAL); > + goto err; > + } > + /* > + * image's kernel_buf, initrd_buf, cmdline_buf are set. Now they should > + * be updated to the new content. > + */ > + if (image->kernel_buf != parsed_kernel) { > + vfree(image->kernel_buf); > + image->kernel_buf = parsed_kernel; > + image->kernel_buf_len = parsed_len; > + } > + if (image->initrd_buf != initrd) { > + vfree(image->initrd_buf); > + image->initrd_buf = initrd; > + image->initrd_buf_len = initrd_len; > + } > + if (image->cmdline_buf != cmdline) { > + kfree(image->cmdline_buf); > + image->cmdline_buf = cmdline; > + image->cmdline_buf_len = cmdline_len; > + } > + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf, > + image->kernel_buf_len); > + if (ret) { > + pr_err("Fail to find suitable image loader\n"); > + ldata = ERR_PTR(ret); > + goto err; > + } > + ldata = kexec_image_load_default(image); > + if (IS_ERR(ldata)) { > + pr_err("architecture code fails to load image\n"); > + goto err; > + } > + image->image_loader_data = ldata; > + > +err: > + list_for_each_entry_safe(phase, phase_tmp, &phase_head, head) { > + list_for_each_entry_safe(res, res_tmp, &phase->res_head, node) { > + list_del(&res->node); > + /* defer to kimage_file_post_load_cleanup() */ > + if (res->deferred_free) { > + res->r->buf = NULL; > + res->r->buf_sz = 0; > + } > + mem_range_result_put(res->r); > + kfree(res); > + } > + list_del(&phase->head); > + kfree(phase); > + } > + > + return ldata; > +} > + > +const struct kexec_file_ops kexec_pe_image_ops = { > + .probe = pe_image_probe, > + .load = pe_image_load, > +#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG > + .verify_sig = kexec_kernel_verify_pe_sig, > +#endif > +};