Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v7 04/42] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Sean Christopherson @ 2026-06-10 22:19 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-4-2f0fae496530@google.com>

On Fri, May 22, 2026, Ackerley Tng wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
> 
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
> 
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
>  static bool __ro_after_init allow_unsafe_mappings;
>  module_param(allow_unsafe_mappings, bool, 0444);
>  
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif

Fudge.  This morning's PUCK discussion about VBS made me realize that we really
don't want to kill off _all_ per-VM attributes like this, we really just want to
kill off PRIVATE.  And even if RWX protections never arrive, conceptually shoving
all attributes into guest_memfd doesn't make any sense, because it really is only
the private vs. shared state that is tied to the physical memory, things like RWX
protections aren't so tightly couple to the data.

It'll require a bit of minor surgery to these patches, but the silver lining is
that I think the end code will be slightly easier to follow.

I'll sync with you off-list to splice in the changes to your current series (I
have them sketched out).

^ permalink raw reply

* Re: [PATCH net-next 2/3] docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey
From: Sabrina Dubroca @ 2026-06-10 21:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, corbet,
	linux-doc, bpf, john.fastabend, skhan
In-Reply-To: <20260609201224.1191391-3-kuba@kernel.org>

2026-06-09, 13:12:23 -0700, Jakub Kicinski wrote:
> Fill in some gaps in the TLS offload doc:
> 
> - describe the tls_dev_del and tls_dev_resync callbacks
> - add a mention of rekeying being out of scope for now
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: john.fastabend@gmail.com
> CC: sd@queasysnail.net
> CC: corbet@lwn.net
> CC: skhan@linuxfoundation.org
> CC: linux-doc@vger.kernel.org
> ---
>  Documentation/networking/tls-offload.rst | 29 ++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
> index c173f537bf4d..a41f46885e8c 100644
> --- a/Documentation/networking/tls-offload.rst
> +++ b/Documentation/networking/tls-offload.rst
> @@ -104,6 +104,29 @@ at the end of kernel structures (see :c:member:`driver_state` members
>  in ``include/net/tls.h``) to avoid additional allocations and pointer
>  dereferences.
>  
> +When the offloaded connection is destroyed the core calls
> +the :c:member:`tls_dev_del` callback so the driver can release per-direction
> +state:
> +
> +.. code-block:: c
> +
> +	void (*tls_dev_del)(struct net_device *netdev,
> +			    struct tls_context *ctx,
> +			    enum tls_offload_ctx_dir direction);
> +
> +``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
> +
> +The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
> +to synchronize the TCP stream with the record boundaries:
> +
> +.. code-block:: c
> +
> +	int (*tls_dev_resync)(struct net_device *netdev,
> +			      struct sock *sk, u32 seq, u8 *rcd_sn,
> +			      enum tls_offload_ctx_dir direction);
> +
> +See the `Resync handling`_ section for details.

Hmm, this callback is not mentioned at all in the "Resync handling"
section. I think it'd be good to add at least a quick note there about
how/when it's invoked, and what the arguments mean (at least the two
types of sequence numbers, since the rest is identical to the other
driver CBs).

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net-next v09 4/5] hinic3: Add ethtool rss ops
From: Dimitri Daskalakis @ 2026-06-10 20:41 UTC (permalink / raw)
  To: Fan Gong, Wu Di, Teng Peisen, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <7d1a4375fdf7c3e7a5a6162382cee4f48991d5da.1781062575.git.wudi234@huawei.com>



On 6/9/26 11:59 PM, Fan Gong wrote:
>   Implement following ethtool callback function:
> .get_rxnfc
> .set_rxnfc
> .get_channels
> .set_channels
> .get_rxfh_indir_size
> .get_rxfh_key_size
> .get_rxfh
> .set_rxfh
> 
>   These callbacks allow users to utilize ethtool for detailed
> RSS parameters configuration and monitoring.
> 
> Co-developed-by: Wu Di <wudi234@huawei.com>
> Signed-off-by: Wu Di <wudi234@huawei.com>
> Co-developed-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Fan Gong <gongfan1@huawei.com>
> ---
>  .../ethernet/huawei/hinic3/hinic3_ethtool.c   |   9 +
>  .../huawei/hinic3/hinic3_mgmt_interface.h     |   2 +
>  .../net/ethernet/huawei/hinic3/hinic3_rss.c   | 539 +++++++++++++++++-
>  .../net/ethernet/huawei/hinic3/hinic3_rss.h   |  19 +
>  4 files changed, 567 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> index 11c8eb0f5d2a..78818de9a946 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> @@ -16,6 +16,7 @@
>  #include "hinic3_hw_comm.h"
>  #include "hinic3_nic_dev.h"
>  #include "hinic3_nic_cfg.h"
> +#include "hinic3_rss.h"
>  
>  #define HINIC3_MGMT_VERSION_MAX_LEN     32
>  /* Coalesce time properties in microseconds */
> @@ -1238,6 +1239,14 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.get_pause_stats                = hinic3_get_pause_stats,
>  	.get_coalesce                   = hinic3_get_coalesce,
>  	.set_coalesce                   = hinic3_set_coalesce,
> +	.get_rxnfc                      = hinic3_get_rxnfc,
> +	.set_rxnfc                      = hinic3_set_rxnfc,
> +	.get_channels                   = hinic3_get_channels,
> +	.set_channels                   = hinic3_set_channels,
> +	.get_rxfh_indir_size            = hinic3_get_rxfh_indir_size,
> +	.get_rxfh_key_size              = hinic3_get_rxfh_key_size,
> +	.get_rxfh                       = hinic3_get_rxfh,
> +	.set_rxfh                       = hinic3_set_rxfh,
>  };
>  
>  void hinic3_set_ethtool_ops(struct net_device *netdev)
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> index 76c691f82703..3c1263ff99ff 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
> @@ -282,6 +282,7 @@ enum l2nic_cmd {
>  	L2NIC_CMD_SET_VLAN_FILTER_EN  = 26,
>  	L2NIC_CMD_SET_RX_VLAN_OFFLOAD = 27,
>  	L2NIC_CMD_CFG_RSS             = 60,
> +	L2NIC_CMD_GET_RSS_CTX_TBL     = 62,
>  	L2NIC_CMD_CFG_RSS_HASH_KEY    = 63,
>  	L2NIC_CMD_CFG_RSS_HASH_ENGINE = 64,
>  	L2NIC_CMD_SET_RSS_CTX_TBL     = 65,
> @@ -301,6 +302,7 @@ enum l2nic_ucode_cmd {
>  	L2NIC_UCODE_CMD_MODIFY_QUEUE_CTX  = 0,
>  	L2NIC_UCODE_CMD_CLEAN_QUEUE_CTX   = 1,
>  	L2NIC_UCODE_CMD_SET_RSS_INDIR_TBL = 4,
> +	L2NIC_UCODE_CMD_GET_RSS_INDIR_TBL = 6,
>  };
>  
>  /* hilink mac group command */
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> index 25db74d8c7dd..811a6b491e74 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
> @@ -155,7 +155,7 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
>  				       L2NIC_CMD_SET_RSS_CTX_TBL, &msg_params);
>  
>  	if (ctx_tbl.msg_head.status == MGMT_STATUS_CMD_UNSUPPORTED) {
> -		return MGMT_STATUS_CMD_UNSUPPORTED;
> +		return -EOPNOTSUPP;
>  	} else if (err || ctx_tbl.msg_head.status) {
>  		dev_err(hwdev->dev, "mgmt Failed to set rss context offload, err: %d, status: 0x%x\n",
>  			err, ctx_tbl.msg_head.status);
> @@ -165,6 +165,41 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
>  	return 0;
>  }
>  
> +static int hinic3_get_rss_type(struct hinic3_hwdev *hwdev,
> +			       struct hinic3_rss_type *rss_type)
> +{
> +	struct l2nic_cmd_rss_ctx_tbl ctx_tbl = {};
> +	struct mgmt_msg_params msg_params = {};
> +	int err;
> +
> +	ctx_tbl.func_id = hinic3_global_func_id(hwdev);
> +
> +	mgmt_msg_params_init_default(&msg_params, &ctx_tbl, sizeof(ctx_tbl));
> +
> +	err = hinic3_send_mbox_to_mgmt(hwdev, MGMT_MOD_L2NIC,
> +				       L2NIC_CMD_GET_RSS_CTX_TBL,
> +				       &msg_params);
> +	if (ctx_tbl.msg_head.status == MGMT_STATUS_CMD_UNSUPPORTED) {
> +		return -EOPNOTSUPP;
> +	} else if (err || ctx_tbl.msg_head.status) {
> +		dev_err(hwdev->dev, "Failed to get hash type, err: %d, status: 0x%x\n",
> +			err, ctx_tbl.msg_head.status);
> +		return -EINVAL;
> +	}
> +
> +	rss_type->ipv4         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV4);
> +	rss_type->ipv6         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6);
> +	rss_type->ipv6_ext     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6_EXT);
> +	rss_type->tcp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV4);
> +	rss_type->tcp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV6);
> +	rss_type->tcp_ipv6_ext = L2NIC_RSS_TYPE_GET(ctx_tbl.context,
> +						    TCP_IPV6_EXT);
> +	rss_type->udp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV4);
> +	rss_type->udp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV6);
> +
> +	return 0;
> +}
> +
>  static int hinic3_rss_cfg_hash_type(struct hinic3_hwdev *hwdev, u8 opcode,
>  				    enum hinic3_rss_hash_type *type)
>  {
> @@ -264,7 +299,8 @@ static int hinic3_set_hw_rss_parameters(struct net_device *netdev, u8 rss_en)
>  	if (err)
>  		return err;
>  
> -	hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
> +	if (!netif_is_rxfh_configured(netdev))
> +		hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
>  
>  	err = hinic3_config_rss_hw_resource(netdev, nic_dev->rss_indir);
>  	if (err)
> @@ -334,3 +370,502 @@ void hinic3_try_to_enable_rss(struct net_device *netdev)
>  	clear_bit(HINIC3_RSS_ENABLE, &nic_dev->flags);
>  	nic_dev->q_params.num_qps = nic_dev->max_qps;
>  }
> +
> +static int hinic3_set_l4_rss_hash_ops(const struct ethtool_rxnfc *cmd,
> +				      struct hinic3_rss_type *rss_type)
> +{
> +	u8 rss_l4_en;
> +
> +	switch (cmd->data & (RXH_L4_B_0_1 | RXH_L4_B_2_3)) {
> +	case 0:
> +		rss_l4_en = 0;
> +		break;
> +	case (RXH_L4_B_0_1 | RXH_L4_B_2_3):
> +		rss_l4_en = 1;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +		rss_type->tcp_ipv4 = rss_l4_en;
> +		break;
> +	case TCP_V6_FLOW:
> +		rss_type->tcp_ipv6 = rss_l4_en;
> +		break;
> +	case UDP_V4_FLOW:
> +		rss_type->udp_ipv4 = rss_l4_en;
> +		break;
> +	case UDP_V6_FLOW:
> +		rss_type->udp_ipv6 = rss_l4_en;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_update_rss_hash_opts(struct net_device *netdev,
> +				       struct ethtool_rxnfc *cmd,
> +				       struct hinic3_rss_type *rss_type)
> +{
> +	int err;
> +
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +	case TCP_V6_FLOW:
> +	case UDP_V4_FLOW:
> +	case UDP_V6_FLOW:
> +		err = hinic3_set_l4_rss_hash_ops(cmd, rss_type);
> +		if (err)
> +			return err;
> +
> +		break;
> +	case IPV4_FLOW:
> +		rss_type->ipv4 = 1;
> +		break;
> +	case IPV6_FLOW:
> +		rss_type->ipv6 = 1;
> +		break;
> +	default:
> +		netdev_err(netdev, "Unsupported flow type\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_set_rss_hash_opts(struct net_device *netdev,
> +				    struct ethtool_rxnfc *cmd)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_rss_type rss_type;
> +	int err;
> +
> +	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
> +		cmd->data = 0;
> +		netdev_err(netdev, "RSS is disable, not support to set flow-hash\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* RSS only supports hashing of IP addresses and L4 ports */
> +	if (cmd->data & ~(RXH_IP_SRC | RXH_IP_DST |
> +			  RXH_L4_B_0_1 | RXH_L4_B_2_3))
> +		return -EINVAL;
> +
> +	/* Both IP addresses must be part of the hash tuple */
> +	if (!(cmd->data & RXH_IP_SRC) || !(cmd->data & RXH_IP_DST))
> +		return -EINVAL;
> +
> +	/* L4 hash bits are not valid for pure L3 flow types */
> +	if ((cmd->flow_type == IPV4_FLOW || cmd->flow_type == IPV6_FLOW) &&
> +	    (cmd->data & (RXH_L4_B_0_1 | RXH_L4_B_2_3)))
> +		return -EINVAL;
> +
> +	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to get rss type\n");
> +		return err;
> +	}
> +
> +	err = hinic3_update_rss_hash_opts(netdev, cmd, &rss_type);
> +	if (err)
> +		return err;
> +
> +	err = hinic3_set_rss_type(nic_dev->hwdev, rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to set rss type\n");
> +		return err;
> +	}
> +
> +	nic_dev->rss_type = rss_type;
> +
> +	return 0;
> +}
> +
> +static void convert_rss_l3_type(u8 rss_opt, struct ethtool_rxnfc *cmd)
> +{
> +	if (!rss_opt)
> +		cmd->data &= ~(RXH_IP_SRC | RXH_IP_DST);
> +}
> +
> +static void convert_rss_l4_type(u8 rss_opt, struct ethtool_rxnfc *cmd)
> +{
> +	if (rss_opt)
> +		cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
> +}
> +
> +static int hinic3_convert_rss_type(struct net_device *netdev,
> +				   struct hinic3_rss_type *rss_type,
> +				   struct ethtool_rxnfc *cmd)
> +{
> +	cmd->data = RXH_IP_SRC | RXH_IP_DST;
> +	switch (cmd->flow_type) {
> +	case TCP_V4_FLOW:
> +		convert_rss_l4_type(rss_type->tcp_ipv4, cmd);
> +		break;
> +	case TCP_V6_FLOW:
> +		convert_rss_l4_type(rss_type->tcp_ipv6, cmd);
> +		break;
> +	case UDP_V4_FLOW:
> +		convert_rss_l4_type(rss_type->udp_ipv4, cmd);
> +		break;
> +	case UDP_V6_FLOW:
> +		convert_rss_l4_type(rss_type->udp_ipv6, cmd);
> +		break;
> +	case IPV4_FLOW:
> +		convert_rss_l3_type(rss_type->ipv4, cmd);
> +		break;
> +	case IPV6_FLOW:
> +		convert_rss_l3_type(rss_type->ipv6, cmd);
> +		break;
> +	default:
> +		netdev_err(netdev, "Unsupported flow type\n");
> +		cmd->data = 0;
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_get_rss_hash_opts(struct net_device *netdev,
> +				    struct ethtool_rxnfc *cmd)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_rss_type rss_type;
> +	int err;
> +
> +	cmd->data = 0;
> +
> +	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags))
> +		return 0;
> +
> +	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
> +	if (err) {
> +		netdev_err(netdev, "Failed to get rss type\n");
> +		return err;
> +	}
> +
> +	return hinic3_convert_rss_type(netdev, &rss_type, cmd);
> +}
> +
> +int hinic3_get_rxnfc(struct net_device *netdev,
> +		     struct ethtool_rxnfc *cmd, u32 *rule_locs)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	int err = 0;
> +
> +	switch (cmd->cmd) {
> +	case ETHTOOL_GRXRINGS:
> +		cmd->data = nic_dev->q_params.num_qps;
> +		break;

You should probably implement the get_rx_ring_count ethtool op instead.
See
https://lore.kernel.org/netdev/20260122-grxring_big_v4-v2-0-94dbe4dcaa10@debian.org/


^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: John Ericson @ 2026-06-10 20:38 UTC (permalink / raw)
  To: Li Chen
  Cc: Andy Lutomirski, Christian Brauner, Kees Cook, Al Viro,
	linux-fsdevel, linux-api, LKML, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <19eb181fdd4.6d028f442844776.3737831021032223216@linux.beauty>

On Wed, Jun 10, 2026, at 8:29 AM, Li Chen wrote:
> Hi John,
>
> [...]
>
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.

Great! Glad to hear my suggestion (and the patch too I linked in the
other email, I hope?) was useful.

> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
>
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);

Glad to hear it is also one-fd now.

> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run.

So this is an interesting thing to think about. My hunch is that
`copy_process` is, at least in the longer term, still doing too much! In
particular, `struct kernel_clone_args` has many degrees of freedom, and
might also make assumptions about preserving more of the parent process
than is needed in this case.

This is a bit tangential, but one thing I have thought about is having
"null namespaces". I think the current (i.e. existing clone API) default
of "share with parent process" is a poor security practice (more
privileges, i.e. sharing, should always be opt-in). But the opposite
default of "unshare everything" is expensive since creating new
namespaces is non-free. The goal of the null namespaces would be a cheap
way of creating a more isolated and unprivileged process — and "cheap"
here is literal: a null pointer in `nsproxy`, no allocation, no
namespace object, no ID. This null state would be what
`pidfd_open(0, PIDFD_EMPTY)` (using your example above, or really
whatever the first step is) hands back.

Then, from that maximally cheap and unprivileged initial state, the
`pidfd_config(fd, ...);` calls (plural important, I think!) would opt
into either sharing or unsharing namespaces between the child and parent
as the parent sees fit.

The larger point here is that insofar as there are not good defaults for
things, there is pressure, whether in step 1 or step 2, to make larger
everything-at-once configuration. But when we think a bit outside the
box to create the good defaults where they didn't previously exist, we
can end up in a situation where a minimal initial blank unstarted
process, and the builder pattern to initialize it, are more "natural".

> Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.

Also glad to hear.

> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.

Sounds good, as I think you can guess, my preference is for "yes", but I
agree we can see what you end up with in the next patchset and make more
informed decisions based on that.

Cheers,

John

^ permalink raw reply

* Re: [PATCH 0/2] module: restrict module auto-loading to privileged users
From: Kees Cook @ 2026-06-10 20:23 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: Michal Gorlas, Jonathan Corbet, Shuah Khan, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Aaron Tomlin, linux-doc, linux-kernel,
	linux-modules
In-Reply-To: <20260605183646.GC2939956@google.com>

On Fri, Jun 05, 2026 at 06:36:46PM +0000, Sami Tolvanen wrote:
> On Fri, May 15, 2026 at 07:20:18PM +0200, Michal Gorlas wrote:
> > Add option to restrict the module auto-loading to CAP_SYS_ADMIN.
> > This is heavily inspired by CONFIG_GRKERNSEC_MODHARDEN of the latest
> > available Grsecurity patches [1]. Instead of checking whether the
> > callers' UID is 0, check whether the calling process has CAP_SYS_ADMIN.
> > The reasoning here is that many modules are autoloaded by systemd
> > services which are running as privileged users, but do not have UID 0.
> > While systemd-udevd runs as root, systemd-network (which often
> > auto-loads a module) for example runs as system user (UID range 6 to
> > 999).
> > 
> > When enabled, reduces attack surface where unprivileged users can trigger
> > vulnerable module to be auto-loaded, to then exploit it. Recent LPEs
> > (CopyFail [3], DirtyFrag [4]) for example, would have been mitigated
> > with this option enabled as long as the vulnerable modules are not built-in
> > (or already loaded at the point of running the exploit). 
> 
> This sounds potentially useful as an optional feature. Kees, you've
> looked at grsec features in the past, do you have any thoughts about
> this?

This doesn't really look like GRKERNSEC_MODHARDEN to me? In that
feature, the credentials of the usermode helper are passed down so that
udev or whatever can examine them and make choices (instead of seeing
the uid-0 usermode helper credentials).

This looks like it is just doing a request-time policy check, but that's
already covered by the security_kernel_module_request() call immediately
before the proposed module_autoload_restrict check.

Also note that module loading is _already_ controlled by CAP_SYS_MODULE,
not uid 0 nor CAP_SYS_ADMIN.

Sashiko has similar feedback, and some other notes too:
https://sashiko.dev/#/patchset/20260515-autoload_restrict-v1-0-40b7c03ddd04%409elements.com

I'm not clear what problem this patch is trying to solve?

-Kees

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH net-next v09 1/5] hinic3: Add ethtool queue ops
From: Dimitri Daskalakis @ 2026-06-10 20:21 UTC (permalink / raw)
  To: Fan Gong, Wu Di, Teng Peisen, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <02e87952a65aa268526ade2f03de6c76fbc1fe9d.1781062575.git.wudi234@huawei.com>



On 6/9/26 11:59 PM, Fan Gong wrote:
>   Implement following ethtool callback function:
> .get_ringparam
> .set_ringparam
> 
>   These callbacks allow users to utilize ethtool for detailed
> queue depth configuration and monitoring.
> 
> Co-developed-by: Wu Di <wudi234@huawei.com>
> Signed-off-by: Wu Di <wudi234@huawei.com>
> Co-developed-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Teng Peisen <tengpeisen@huawei.com>
> Signed-off-by: Fan Gong <gongfan1@huawei.com>
> ---
>  .../ethernet/huawei/hinic3/hinic3_ethtool.c   |  93 ++++++++++++++++
>  .../net/ethernet/huawei/hinic3/hinic3_irq.c   |   5 +-
>  .../net/ethernet/huawei/hinic3/hinic3_main.c  |   6 +
>  .../huawei/hinic3/hinic3_netdev_ops.c         | 104 ++++++++++++++++--
>  .../ethernet/huawei/hinic3/hinic3_nic_dev.h   |   9 ++
>  .../ethernet/huawei/hinic3/hinic3_nic_io.c    |   4 +-
>  .../ethernet/huawei/hinic3/hinic3_nic_io.h    |   8 +-
>  .../net/ethernet/huawei/hinic3/hinic3_rx.c    |   2 +-
>  8 files changed, 217 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> index 90fc16288de9..be9992a235f7 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
> @@ -9,6 +9,7 @@
>  #include <linux/errno.h>
>  #include <linux/etherdevice.h>
>  #include <linux/netdevice.h>
> +#include <linux/netlink.h>
>  #include <linux/ethtool.h>
>  
>  #include "hinic3_lld.h"
> @@ -409,6 +410,96 @@ hinic3_get_link_ksettings(struct net_device *netdev,
>  	return 0;
>  }
>  
> +static void hinic3_get_ringparam(struct net_device *netdev,
> +				 struct ethtool_ringparam *ring,
> +				 struct kernel_ethtool_ringparam *kernel_ring,
> +				 struct netlink_ext_ack *extack)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +
> +	ring->rx_max_pending = HINIC3_MAX_RX_QUEUE_DEPTH;
> +	ring->tx_max_pending = HINIC3_MAX_TX_QUEUE_DEPTH;
> +	ring->rx_pending = nic_dev->q_params.rq_depth;
> +	ring->rx_pending = nic_dev->q_params.sq_depth;
> +}
> +
> +static void hinic3_update_qp_depth(struct net_device *netdev,
> +				   u32 sq_depth, u32 rq_depth)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	u16 i;
> +
> +	nic_dev->q_params.sq_depth = sq_depth;
> +	nic_dev->q_params.rq_depth = rq_depth;
> +	for (i = 0; i < nic_dev->max_qps; i++) {
> +		nic_dev->txqs[i].q_depth = sq_depth;
> +		nic_dev->txqs[i].q_mask = sq_depth - 1;
> +		nic_dev->rxqs[i].q_depth = rq_depth;
> +		nic_dev->rxqs[i].q_mask = rq_depth - 1;
> +	}
> +}
> +
> +static int hinic3_check_ringparam_valid(struct net_device *netdev,
> +					const struct ethtool_ringparam *ring,
> +					struct netlink_ext_ack *extack)
> +{
> +	if (ring->tx_pending < HINIC3_MIN_QUEUE_DEPTH ||
> +	    ring->rx_pending < HINIC3_MIN_QUEUE_DEPTH) {
> +		NL_SET_ERR_MSG_FMT_MOD(extack,
> +				       "Queue depth out of range tx[%d-%d] rx[%d-%d]",
> +				       HINIC3_MIN_QUEUE_DEPTH,
> +				       HINIC3_MAX_TX_QUEUE_DEPTH,
> +				       HINIC3_MIN_QUEUE_DEPTH,
> +				       HINIC3_MAX_RX_QUEUE_DEPTH);
> +
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int hinic3_set_ringparam(struct net_device *netdev,
> +				struct ethtool_ringparam *ring,
> +				struct kernel_ethtool_ringparam *kernel_ring,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_dyna_txrxq_params q_params = {};
> +	u32 new_sq_depth, new_rq_depth;
> +	int err;
> +
> +	err = hinic3_check_ringparam_valid(netdev, ring, extack);
> +	if (err)
> +		return err;
> +
> +	new_sq_depth = 1U << ilog2(ring->tx_pending);
> +	new_rq_depth = 1U << ilog2(ring->rx_pending);
> +	if (new_sq_depth == nic_dev->q_params.sq_depth &&
> +	    new_rq_depth == nic_dev->q_params.rq_depth)
> +		return 0;
> +
> +	if (new_sq_depth != ring->tx_pending ||
> +	    new_rq_depth != ring->rx_pending)
> +		NL_SET_ERR_MSG_FMT_MOD(extack,
> +				       "Requested Tx/Rx ring depth %u/%u trimmed to %u/%u",
> +				       ring->tx_pending, ring->rx_pending,
> +				       new_sq_depth, new_rq_depth);
> +
> +	if (!netif_running(netdev)) {
> +		hinic3_update_qp_depth(netdev, new_sq_depth, new_rq_depth);
> +	} else {
> +		q_params = nic_dev->q_params;
> +		q_params.sq_depth = new_sq_depth;
> +		q_params.rq_depth = new_rq_depth;
> +
> +		err = hinic3_change_channel_settings(netdev, &q_params);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
>  static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.supported_coalesce_params      = ETHTOOL_COALESCE_USECS |
>  					  ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
> @@ -417,6 +508,8 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
>  	.get_msglevel                   = hinic3_get_msglevel,
>  	.set_msglevel                   = hinic3_set_msglevel,
>  	.get_link                       = ethtool_op_get_link,
> +	.get_ringparam                  = hinic3_get_ringparam,
> +	.set_ringparam                  = hinic3_set_ringparam,
>  };
>  
>  void hinic3_set_ethtool_ops(struct net_device *netdev)
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> index e7d6c2033b45..bc4d879f9be4 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
> @@ -137,7 +137,8 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
>  	struct hinic3_interrupt_info info = {};
>  	int err;
>  
> -	if (q_id >= nic_dev->q_params.num_qps)
> +	if (q_id >= nic_dev->q_params.num_qps ||
> +	    !mutex_trylock(&nic_dev->change_res_mutex))
>  		return 0;
>  
>  	info.interrupt_coalesc_set = 1;
> @@ -156,6 +157,8 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
>  		nic_dev->rxqs[q_id].last_pending_limit = pending_limit;
>  	}
>  
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
>  	return err;
>  }
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> index 0a888fe4c975..c87624a5e5dc 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> @@ -179,6 +179,7 @@ static int hinic3_sw_init(struct net_device *netdev)
>  	int err;
>  
>  	mutex_init(&nic_dev->port_state_mutex);
> +	mutex_init(&nic_dev->change_res_mutex);
>  
>  	nic_dev->q_params.sq_depth = HINIC3_SQ_DEPTH;
>  	nic_dev->q_params.rq_depth = HINIC3_RQ_DEPTH;
> @@ -315,6 +316,9 @@ static void hinic3_link_status_change(struct net_device *netdev,
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
>  
> +	if (!mutex_trylock(&nic_dev->change_res_mutex))
> +		return;
> +
>  	if (link_status_up) {
>  		if (netif_carrier_ok(netdev))
>  			return;

There's a couple returns in this function that will cause the lock to
never be released. Probably need a goto unlock.

> @@ -330,6 +334,8 @@ static void hinic3_link_status_change(struct net_device *netdev,
>  		netif_carrier_off(netdev);
>  		netdev_dbg(netdev, "Link is down\n");
>  	}
> +
> +	mutex_unlock(&nic_dev->change_res_mutex);
>  }
>  
>  static void hinic3_port_module_event_handler(struct net_device *netdev,
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> index da73811641a9..047214cfc753 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> @@ -288,7 +288,8 @@ static void hinic3_free_channel_resources(struct net_device *netdev,
>  	hinic3_free_qps(nic_dev, qp_params);
>  }
>  
> -static int hinic3_open_channel(struct net_device *netdev)
> +static int hinic3_prepare_channel(struct net_device *netdev,
> +				  struct hinic3_dyna_txrxq_params *qp_params)
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
>  	int err;
> @@ -299,16 +300,28 @@ static int hinic3_open_channel(struct net_device *netdev)
>  		return err;
>  	}
>  
> -	err = hinic3_configure_txrxqs(netdev, &nic_dev->q_params);
> +	err = hinic3_configure_txrxqs(netdev, qp_params);
>  	if (err) {
>  		netdev_err(netdev, "Failed to configure txrxqs\n");
>  		goto err_free_qp_ctxts;
>  	}
>  
> +	return 0;
> +
> +err_free_qp_ctxts:
> +	hinic3_free_qp_ctxts(nic_dev);
> +
> +	return err;
> +}
> +
> +static int hinic3_open_channel(struct net_device *netdev)
> +{
> +	int err;
> +
>  	err = hinic3_qps_irq_init(netdev);
>  	if (err) {
>  		netdev_err(netdev, "Failed to init txrxq irq\n");
> -		goto err_free_qp_ctxts;
> +		return err;
>  	}
>  
>  	err = hinic3_configure(netdev);
> @@ -321,8 +334,6 @@ static int hinic3_open_channel(struct net_device *netdev)
>  
>  err_uninit_qps_irq:
>  	hinic3_qps_irq_uninit(netdev);
> -err_free_qp_ctxts:
> -	hinic3_free_qp_ctxts(nic_dev);
>  
>  	return err;
>  }
> @@ -428,6 +439,74 @@ static void hinic3_vport_down(struct net_device *netdev)
>  	}
>  }
>  
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> +			       struct hinic3_dyna_txrxq_params *trxq_params)
> +{
> +	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> +	struct hinic3_dyna_txrxq_params cur_trxq_params = {};
> +	struct hinic3_dyna_qp_params new_qp_params = {};
> +	struct hinic3_dyna_qp_params cur_qp_params = {};
> +	int err;
> +
> +	cur_trxq_params = nic_dev->q_params;
> +
> +	hinic3_config_num_qps(netdev, trxq_params);
> +
> +	err = hinic3_alloc_channel_resources(netdev, &new_qp_params,
> +					     trxq_params);
> +	if (err) {
> +		netdev_err(netdev, "Failed to alloc channel resources\n");
> +		return err;
> +	}
> +
> +	mutex_lock(&nic_dev->change_res_mutex);
> +	hinic3_vport_down(netdev);
> +	hinic3_close_channel(netdev);
> +	hinic3_get_cur_qps(nic_dev, &cur_qp_params);
> +
> +	hinic3_init_qps(nic_dev, &new_qp_params);
> +
> +	err = hinic3_prepare_channel(netdev, trxq_params);
> +	if (err)
> +		goto err_uninit_qps;
> +
> +	if (nic_dev->num_qp_irq > trxq_params->num_qps)
> +		hinic3_qp_irq_change(netdev, trxq_params->num_qps);
> +
> +	nic_dev->q_params = *trxq_params;
> +
> +	err = hinic3_open_channel(netdev);
> +	if (err)
> +		goto err_qp_irq_reset;
> +
> +	err = hinic3_vport_up(netdev);
> +	if (err)
> +		goto err_close_channel;
> +
> +	hinic3_free_channel_resources(netdev, &cur_qp_params, &cur_trxq_params);
> +
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
> +	return 0;
> +
> +err_close_channel:
> +	hinic3_close_channel(netdev);
> +err_qp_irq_reset:
> +	nic_dev->q_params = cur_trxq_params;
> +
> +	if (trxq_params->num_qps > cur_trxq_params.num_qps)
> +		hinic3_qp_irq_change(netdev, cur_trxq_params.num_qps);
> +	hinic3_free_qp_ctxts(nic_dev);
> +err_uninit_qps:
> +	hinic3_get_cur_qps(nic_dev, &new_qp_params);
> +	hinic3_free_channel_resources(netdev, &new_qp_params, trxq_params);
> +	hinic3_free_channel_resources(netdev, &cur_qp_params, &cur_trxq_params);
> +	mutex_unlock(&nic_dev->change_res_mutex);
> +
> +	return err;
> +}
> +
>  static int hinic3_open(struct net_device *netdev)
>  {
>  	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> @@ -458,6 +537,10 @@ static int hinic3_open(struct net_device *netdev)
>  
>  	hinic3_init_qps(nic_dev, &qp_params);
>  
> +	err = hinic3_prepare_channel(netdev, &nic_dev->q_params);
> +	if (err)
> +		goto err_uninit_qps;
> +
>  	err = hinic3_open_channel(netdev);
>  	if (err)
>  		goto err_uninit_qps;
> @@ -473,7 +556,7 @@ static int hinic3_open(struct net_device *netdev)
>  err_close_channel:
>  	hinic3_close_channel(netdev);
>  err_uninit_qps:
> -	hinic3_uninit_qps(nic_dev, &qp_params);
> +	hinic3_get_cur_qps(nic_dev, &qp_params);
>  	hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
>  err_destroy_num_qps:
>  	hinic3_destroy_num_qps(netdev);
> @@ -493,10 +576,15 @@ static int hinic3_close(struct net_device *netdev)
>  		return 0;
>  	}
>  
> +	mutex_lock(&nic_dev->change_res_mutex);
>  	hinic3_vport_down(netdev);
>  	hinic3_close_channel(netdev);
> -	hinic3_uninit_qps(nic_dev, &qp_params);
> -	hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
> +	hinic3_get_cur_qps(nic_dev, &qp_params);
> +	hinic3_free_channel_resources(netdev, &qp_params,
> +				      &nic_dev->q_params);
> +	hinic3_free_nicio_res(nic_dev);
> +	hinic3_destroy_num_qps(netdev);
> +	mutex_unlock(&nic_dev->change_res_mutex);
>  
>  	return 0;
>  }
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> index 9502293ff710..005b2c01a988 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> @@ -10,6 +10,9 @@
>  #include "hinic3_hw_cfg.h"
>  #include "hinic3_hwdev.h"
>  #include "hinic3_mgmt_interface.h"
> +#include "hinic3_nic_io.h"
> +#include "hinic3_tx.h"
> +#include "hinic3_rx.h"
>  
>  #define HINIC3_VLAN_BITMAP_BYTE_SIZE(nic_dev)  (sizeof(*(nic_dev)->vlan_bitmap))
>  #define HINIC3_VLAN_BITMAP_SIZE(nic_dev)  \
> @@ -129,6 +132,8 @@ struct hinic3_nic_dev {
>  	struct work_struct              rx_mode_work;
>  	/* lock for enable/disable port */
>  	struct mutex                    port_state_mutex;
> +	/* mutex to serialize channel/resource changes */
> +	struct mutex                    change_res_mutex;
>  
>  	struct list_head                uc_filter_list;
>  	struct list_head                mc_filter_list;
> @@ -143,6 +148,10 @@ struct hinic3_nic_dev {
>  
>  void hinic3_set_netdev_ops(struct net_device *netdev);
>  int hinic3_set_hw_features(struct net_device *netdev);
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> +			       struct hinic3_dyna_txrxq_params *trxq_params);
> +
>  int hinic3_qps_irq_init(struct net_device *netdev);
>  void hinic3_qps_irq_uninit(struct net_device *netdev);
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> index 87e736adba02..0e7a0ccfba98 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.c
> @@ -484,8 +484,8 @@ void hinic3_init_qps(struct hinic3_nic_dev *nic_dev,
>  	}
>  }
>  
> -void hinic3_uninit_qps(struct hinic3_nic_dev *nic_dev,
> -		       struct hinic3_dyna_qp_params *qp_params)
> +void hinic3_get_cur_qps(struct hinic3_nic_dev *nic_dev,
> +			struct hinic3_dyna_qp_params *qp_params)
>  {
>  	struct hinic3_nic_io *nic_io = nic_dev->nic_io;
>  
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> index 12eefabcf1db..571b34d63950 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> @@ -14,6 +14,10 @@ struct hinic3_nic_dev;
>  #define HINIC3_RQ_WQEBB_SHIFT      3
>  #define HINIC3_SQ_WQEBB_SIZE       BIT(HINIC3_SQ_WQEBB_SHIFT)
>  
> +#define HINIC3_MAX_TX_QUEUE_DEPTH  65536
> +#define HINIC3_MAX_RX_QUEUE_DEPTH  16384
> +#define HINIC3_MIN_QUEUE_DEPTH     128
> +
>  /* ******************** RQ_CTRL ******************** */
>  enum hinic3_rq_wqe_type {
>  	HINIC3_NORMAL_RQ_WQE = 1,
> @@ -136,8 +140,8 @@ void hinic3_free_qps(struct hinic3_nic_dev *nic_dev,
>  		     struct hinic3_dyna_qp_params *qp_params);
>  void hinic3_init_qps(struct hinic3_nic_dev *nic_dev,
>  		     struct hinic3_dyna_qp_params *qp_params);
> -void hinic3_uninit_qps(struct hinic3_nic_dev *nic_dev,
> -		       struct hinic3_dyna_qp_params *qp_params);
> +void hinic3_get_cur_qps(struct hinic3_nic_dev *nic_dev,
> +			struct hinic3_dyna_qp_params *qp_params);
>  
>  int hinic3_init_qp_ctxts(struct hinic3_nic_dev *nic_dev);
>  void hinic3_free_qp_ctxts(struct hinic3_nic_dev *nic_dev);
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> index 309ab5901379..b5b601469517 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
> @@ -541,7 +541,7 @@ int hinic3_configure_rxqs(struct net_device *netdev, u16 num_rq,
>  		rq_associate_cqes(rxq);
>  
>  		pkts = hinic3_rx_fill_buffers(rxq);
> -		if (!pkts) {
> +		if (pkts < rxq->q_depth - 1) {

nit: just use rxq->q_mask?

>  			netdev_err(netdev, "Failed to fill Rx buffer\n");
>  			return -ENOMEM;
>  		}


^ permalink raw reply

* Re: [PATCH v5 00/19] perf cs-etm: Queue context packets for frontend
From: Arnaldo Carvalho de Melo @ 2026-06-10 20:14 UTC (permalink / raw)
  To: James Clark
  Cc: Suzuki K Poulose, Mike Leach, Leo Yan, Namhyung Kim, Jiri Olsa,
	Ian Rogers, Amir Ayupov, Jonathan Corbet, Shuah Khan,
	Paschalis Mpeis, coresight, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo, linux-doc
In-Reply-To: <20260609-james-cs-context-tracking-fix-v5-0-d53a7d096a19@linaro.org>

On Tue, Jun 09, 2026 at 03:40:05PM +0100, James Clark wrote:
> Fix thread tracking when decoding Coresight trace and add a new test for
> it.

The issues found by sashiko seem mild and you can address them in follow
up patches, I think.

So for the benefit of having perf-tools-next available for linux-next
testing and the window is closing soon, so I've merged this, ok?

- Arnaldo
 
> The new test is added as a Perf test workload instead of a custom binary
> with its own build system, but this requires a new feature in Perf test
> to pass in control pipes which can enable and disable events. This
> scopes the recording to just the workload and helps to reduce the amount
> of data recorded in tracing tests.
> 
> With this new feature we can re-write all of the Coresight tests to make
> use of it and remove the remaining binaries which fixes the following
> issues:
> 
>  * They didn't work in out of source builds
>  * A lot of the tests unnecessarily required root and didn't skip
>    without it
>  * They were mainly qualitative tests which didn't look for specific
>    behavior
> 
> Most importantly, the long build and runtime has been reduced. On a
> Radxa Orion O6, unroll_loop_thread.c took 37s to compile which is longer
> than the entire Perf build. Now the build time is negligible and the
> before and after test runtimes for all the Coresight tests are:
> 
>           |   N1SDP   |   Orion O6
>   -----------------------------------
>   Before  |   4m  0s  |    14m 49s
>   After   |      26s  |        56s
>   -----------------------------------
> 
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
> Changes in v5:
> - Forgot to include this change:
>   - Test for actual length of expected raw dump (Leo)
> - Link to v4: https://lore.kernel.org/r/20260609-james-cs-context-tracking-fix-v4-0-44f9fb9e5c42@linaro.org
> 
> Changes in v4:
> - Rename workload-ctl to record-ctl and improve docs (Leo)
> - Use new packet argument everywhere in
>   cs_etm__synth_instruction_sample() (Sashiko)
> - Test for actual length of expected raw dump (Leo)
> - Use -fno-inline instead of keyword (Leo)
> - Don't test any brace or call lines in deterministic test
> - Make sure context switch loop test does cleanup on failure (Sashiko)
> - Remove undef int overflows in workloads (Sashiko)
> - Link to v3: https://lore.kernel.org/r/20260603-james-cs-context-tracking-fix-v3-0-c392945d9ed5@linaro.org
> 
> Changes in v3:
> - Minor sashiko comments
>   - Close some more pipes
>   - Fix warning messages
>   - Error handling improvements
> - Pass packet into cs_etm__synth_instruction_sample()
> - Fixup stale comment (Leo)
> - Link to v2: https://lore.kernel.org/r/20260602-james-cs-context-tracking-fix-v2-0-85b5ce6f55c6@linaro.org
> 
> Changes in v2:
> - Add --workload-ctl option to Perf test
> - Re-write all the Coresight tests and speed them up
> - Pass packet to memory access function so frontend can use either the
>   previous or current packet's EL
> - Link to v1: https://lore.kernel.org/r/20260526-james-cs-context-tracking-fix-v1-0-ebd602e18287@linaro.org
> 
> ---
> James Clark (19):
>       perf cs-etm: Queue context packets for frontend
>       perf test: Add workload-ctl option
>       perf test: Add a workload that forces context switches
>       perf test cs-etm: Test process attribution
>       perf test: Add deterministic workload
>       perf test cs-etm: Replace unroll loop thread with deterministic decode test
>       perf test cs-etm: Remove asm_pure_loop test
>       perf test cs-etm: Replace memcpy test with raw dump stress test
>       perf test: Add named_threads workload
>       perf test cs-etm: Test decoding for concurrent threads test
>       perf test cs-etm: Remove duplicate branch tests
>       perf test cs-etm: Skip if not root
>       perf test cs-etm: Reduce snapshot size
>       perf test cs-etm: Speed up basic test
>       perf test cs-etm: Remove unused Coresight workloads
>       perf test cs-etm: Make disassembly test use kcore
>       perf test cs-etm: Add all branch instructions to test
>       perf test cs-etm: Speed up disassembly test
>       perf test cs-etm: Move existing tests to coresight folder
> 
>  Documentation/trace/coresight/coresight-perf.rst   |  78 +------
>  MAINTAINERS                                        |   2 -
>  tools/perf/Documentation/perf-test.txt             |  24 ++-
>  tools/perf/Makefile.perf                           |  14 +-
>  tools/perf/scripts/python/arm-cs-trace-disasm.py   |  20 +-
>  tools/perf/tests/builtin-test.c                    | 187 +++++++++++++++-
>  tools/perf/tests/shell/coresight/Makefile          |  29 ---
>  .../perf/tests/shell/coresight/Makefile.miniconfig |  14 --
>  tools/perf/tests/shell/coresight/asm_pure_loop.sh  |  22 --
>  .../tests/shell/coresight/asm_pure_loop/.gitignore |   1 -
>  .../tests/shell/coresight/asm_pure_loop/Makefile   |  34 ---
>  .../shell/coresight/asm_pure_loop/asm_pure_loop.S  |  30 ---
>  .../tests/shell/coresight/concurrent_threads.sh    |  45 ++++
>  .../tests/shell/coresight/context_switch_thread.sh |  69 ++++++
>  tools/perf/tests/shell/coresight/deterministic.sh  |  72 +++++++
>  .../tests/shell/coresight/memcpy_thread/.gitignore |   1 -
>  .../tests/shell/coresight/memcpy_thread/Makefile   |  33 ---
>  .../shell/coresight/memcpy_thread/memcpy_thread.c  |  80 -------
>  .../tests/shell/coresight/memcpy_thread_16k_10.sh  |  22 --
>  .../perf/tests/shell/coresight/raw_dump_stress.sh  |  65 ++++++
>  .../shell/{ => coresight}/test_arm_coresight.sh    |  43 ++--
>  .../{ => coresight}/test_arm_coresight_disasm.sh   |  23 +-
>  .../tests/shell/coresight/thread_loop/.gitignore   |   1 -
>  .../tests/shell/coresight/thread_loop/Makefile     |  33 ---
>  .../shell/coresight/thread_loop/thread_loop.c      |  85 --------
>  .../shell/coresight/thread_loop_check_tid_10.sh    |  23 --
>  .../shell/coresight/thread_loop_check_tid_2.sh     |  23 --
>  .../shell/coresight/unroll_loop_thread/.gitignore  |   1 -
>  .../shell/coresight/unroll_loop_thread/Makefile    |  33 ---
>  .../unroll_loop_thread/unroll_loop_thread.c        |  75 -------
>  .../tests/shell/coresight/unroll_loop_thread_10.sh |  22 --
>  tools/perf/tests/shell/lib/coresight.sh            | 134 ------------
>  tools/perf/tests/tests.h                           |   3 +
>  tools/perf/tests/workloads/Build                   |   4 +
>  tools/perf/tests/workloads/context_switch_loop.c   | 110 ++++++++++
>  tools/perf/tests/workloads/deterministic.c         |  39 ++++
>  tools/perf/tests/workloads/named_threads.c         | 109 ++++++++++
>  tools/perf/util/cs-etm-decoder/cs-etm-decoder.c    |  21 +-
>  tools/perf/util/cs-etm.c                           | 236 ++++++++++++---------
>  tools/perf/util/cs-etm.h                           |   8 +-
>  40 files changed, 926 insertions(+), 942 deletions(-)
> ---
> base-commit: 351a37f2fda4db668cff8ba12f2992d73dccdaea
> change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
> 
> Best regards,
> -- 
> James Clark <james.clark@linaro.org>

^ permalink raw reply

* htmldocs: Warning: drivers/tty/serial/serial_cortina-access.c references a file that doesn't exist: Documentation/serial/driver
From: kernel test robot @ 2026-06-10 19:50 UTC (permalink / raw)
  To: Jason Li; +Cc: oe-kbuild-all, 0day robot, linux-doc

tree:   https://github.com/intel-lab-lkp/linux/commits/Jason-Li/dt-bindings-serial-Add-binding-for-Cortina-Access-UART/20260610-193842
head:   e97c7dd14b20885c9b9f27daf2c6e0cd9e99d82a
commit: 2b08fdba152665eca1c8194820608a3f284143b6 tty: serial: Add UART driver for Cortina-Access platform
date:   8 hours ago
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260610/202606102102.JsRIO7Np-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606102102.JsRIO7Np-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Warning: Documentation/translations/zh_CN/scsi/scsi_mid_low_api.rst references a file that doesn't exist: Documentation/Configure.help
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/ABI/testing/sysfs-platform-ayaneo
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/display/bridge/megachips-stdpxxxx-ge-b850v3-fw.txt
   Warning: arch/powerpc/sysdev/mpic.c references a file that doesn't exist: Documentation/devicetree/bindings/powerpc/fsl/mpic.txt
   Warning: drivers/net/ethernet/smsc/Kconfig references a file that doesn't exist: file:Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
>> Warning: drivers/tty/serial/serial_cortina-access.c references a file that doesn't exist: Documentation/serial/driver
   Warning: rust/kernel/sync/atomic/ordering.rs references a file that doesn't exist: srctree/tools/memory-model/Documentation/explanation.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/virtual/lguest/lguest.c
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,\b(\S*)(Documentation/[A-Za-z0-9
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/devicetree/dt-object-internal.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,^Documentation/scheduler/sched-pelt

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v11 2/2] hwmon: temperature: add support for EMC1812
From: Guenter Roeck @ 2026-06-10 19:50 UTC (permalink / raw)
  To: Marius Cristea
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <20260610-hw_mon-emc1812-v11-2-cef809af5c19@microchip.com>

On Wed, Jun 10, 2026 at 06:19:47PM +0300, Marius Cristea wrote:
> This is the hwmon driver for Microchip EMC1812/13/14/15/33
> Multichannel Low-Voltage Remote Diode Sensor Family.
> 
> EMC1812 has one external remote temperature monitoring channel.
> EMC1813 has two external remote temperature monitoring channels.
> EMC1814 has three external remote temperature monitoring channels,
> channels 2 and 3 support anti parallel diode.
> EMC1815 has four external remote temperature monitoring channels and
> channels 1/2  and 3/4 support anti parallel diode.
> EMC1833 has two external remote temperature monitoring channels and
> channels 1 and 2 support anti parallel diode.
> Resistance Error Correction is supported on channels 1/2 and 3/4.
> 
> Signed-off-by: Marius Cristea <marius.cristea@microchip.com>

Applied.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH v11 1/2] dt-bindings: hwmon: temperature: add support for EMC1812
From: Guenter Roeck @ 2026-06-10 19:46 UTC (permalink / raw)
  To: Marius Cristea
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <20260610-hw_mon-emc1812-v11-1-cef809af5c19@microchip.com>

On Wed, Jun 10, 2026 at 06:19:46PM +0300, Marius Cristea wrote:
> This is the devicetree schema for Microchip EMC1812/13/14/15/33
> Multichannel Low-Voltage Remote Diode Sensor Family. It also
> updates the MAINTAINERS file to include the new driver.
> 
> EMC1812 has one external remote temperature monitoring channel.
> EMC1813 has two external remote temperature monitoring channels.
> EMC1814 has three external remote temperature monitoring channels and
> channels 2 and 3 support anti parallel diode.
> EMC1815 has four external remote temperature monitoring channels and
> channels 1/2  and 3/4 support anti parallel diode.
> EMC1833 has two external remote temperature monitoring channels and
> channels 1 and 2 support anti parallel diode.
> Resistance Error Correction is supported on channels 1/2 and 3/4.
> 
> Signed-off-by: Marius Cristea <marius.cristea@microchip.com>
> Reviewed-by: Rob Herring (Arm) <robh@kernel.org>

Applied.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH v4 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
From: kernel test robot @ 2026-06-10 19:42 UTC (permalink / raw)
  To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
	Kent Overstreet, Hao Ge
  Cc: oe-kbuild-all, Linux Memory Management List, Shuah Khan,
	Jonathan Corbet, linux-doc, linux-kernel, Sourav Panda,
	Abhishek Bapat
In-Reply-To: <d0a8308b4d0799876d24461a8ed9b5a71d3e1e89.1781042698.git.abhishekbapat@google.com>

Hi Suren,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20260609]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v7.1-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Abhishek-Bapat/alloc_tag-add-ioctl-to-proc-allocinfo/20260610-081508
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/d0a8308b4d0799876d24461a8ed9b5a71d3e1e89.1781042698.git.abhishekbapat%40google.com
patch subject: [PATCH v4 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
config: sparc64-randconfig-r061-20260610 (https://download.01.org/0day-ci/archive/20260611/202606110300.R4LPBVBO-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260611/202606110300.R4LPBVBO-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606110300.R4LPBVBO-lkp@intel.com/

All errors (new ones prefixed by >>):

   lib/alloc_tag.c: In function 'allocinfo_compat_ioctl':
>> lib/alloc_tag.c:346:58: error: implicit declaration of function 'compat_ptr' [-Wimplicit-function-declaration]
     346 |         return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
         |                                                          ^~~~~~~~~~


vim +/compat_ptr +346 lib/alloc_tag.c

   341	
   342	#ifdef CONFIG_COMPAT
   343	static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
   344					   unsigned long arg)
   345	{
 > 346		return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
   347	}
   348	#endif
   349	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] Add standard stats for HSR/PRP
From: Simon Horman @ 2026-06-10 18:47 UTC (permalink / raw)
  To: MD Danish Anwar
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Shuah Khan, Roger Quadros, Andrew Lunn,
	Jacob Keller, Meghana Malladi, David Carlier, Vadim Fedorenko,
	Kevin Hao, Himanshu Mittal, Hangbin Liu, Markus Elfring,
	Fernando Fernandez Mancera, Jan Vaclav, netdev, linux-doc,
	linux-kernel, linux-arm-kernel, Felix Maurer, Luka Gejak
In-Reply-To: <20260608100930.210149-1-danishanwar@ti.com>

On Mon, Jun 08, 2026 at 03:39:27PM +0530, MD Danish Anwar wrote:
> Add standard stats for HSR / PRP. This series was initially adding HSR/PRP
> related stats for ICSSG driver. Based on maintainers' comments on v2 I am
> now adding support to dump standard stats for HSR/PRP.
> 
> The drivers which support offload can populate these standard stats.
> 
> This series only implements offloaded stats. For software-only interfaces
> Felix Maurer had said he will do it later [1]
> 
> v2 https://lore.kernel.org/all/20260514075605.850674-1-danishanwar@ti.com/
> [1] https://lore.kernel.org/all/ag87pBZfOyccPZTc@thinkpad/
> 
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Felix Maurer <fmaurer@redhat.com>
> Cc: Luka Gejak <luka.gejak@linux.dev>

Hi MD,

There is AI-generated review of this patch-set available on both
https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
I would appreciate it if you could look over that with a view
to addressing any issues that directly affect this patch-set.

^ permalink raw reply

* Re: [PATCH v5 10/21] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-06-10 18:38 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <efdade0b-38f2-4e5e-b6dc-567d9eea97a9@app.fastmail.com>

On Mon, 2026-06-08 at 16:52 -0400, Chuck Lever wrote:
> 
> On Fri, May 22, 2026, at 3:42 PM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> 
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index e17488a911f7..31df04675713 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -4172,6 +4172,127 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, 
> > struct xdr_stream *xdr,
> >  	goto out;
> >  }
> > 
> > +static bool
> > +nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream 
> > *xdr,
> > +			  struct dentry *dentry, struct nfs4_delegation *dp,
> > +			  struct nfsd_file *nf, char *name, u32 namelen)
> > +{
> > +	uint32_t *attrmask;
> > +
> > +	/* Reserve space for attrmask */
> > +	attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
> > +	if (!attrmask)
> > +		return false;
> > +
> > +	ne->ne_file.data = name;
> > +	ne->ne_file.len = namelen;
> > +	ne->ne_attrs.attrmask.element = attrmask;
> > +
> > +	attrmask[0] = 0;
> > +	attrmask[1] = 0;
> > +	attrmask[2] = 0;
> > +	ne->ne_attrs.attr_vals.data = NULL;
> > +	ne->ne_attrs.attr_vals.len = 0;
> > +	ne->ne_attrs.attrmask.count = 1;
> > +	return true;
> > +}
> > +
> > +/**
> > + * nfsd4_encode_notify_event - encode a notify
> > + * @xdr: stream to which to encode the fattr4
> > + * @nne: nfsd_notify_event to encode
> > + * @dp: delegation where the event occurred
> > + * @nf: nfsd_file on which event occurred
> > + * @notify_mask: pointer to word where notification mask should be set
> > + *
> > + * Encode @nne into @xdr. Returns a pointer to the start of the event, 
> > or NULL if
> > + * the event couldn't be encoded. The appropriate bit in the 
> > notify_mask will also
> > + * be set on success.
> > + */
> > +u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct 
> > nfsd_notify_event *nne,
> > +			      struct nfs4_delegation *dp, struct nfsd_file *nf,
> > +			      u32 *notify_mask)
> > +{
> > +	u8 *p = NULL;
> > +
> > +	*notify_mask = 0;
> > +
> > +	if (nne->ne_mask & FS_DELETE) {
> > +		struct notify_remove4 nr = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrm_old_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_remove4(xdr, &nr))
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_REMOVE_ENTRY);
> > +	} else if (nne->ne_mask & FS_CREATE) {
> > +		struct notify_add4 na = { };
> > +		struct notify_remove4 old = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&na.nad_new_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (nne->ne_target) {
> > +			if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +						       NULL, dp, nf,
> > +						       nne->ne_name, nne->ne_namelen))
> > +				goto out_err;
> > +			na.nad_old_entry.count = 1;
> > +			na.nad_old_entry.element = &old;
> > +		}
> > +
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_add4(xdr, &na))
> > +			goto out_err;
> > +
> > +		*notify_mask |= BIT(NOTIFY4_ADD_ENTRY);
> > +	} else if (nne->ne_mask & FS_RENAME) {
> > +		struct notify_rename4 nr = { };
> > +		struct notify_remove4 old = { };
> > +		struct name_snapshot n;
> > +		bool ret;
> > +
> > +		/* Don't send any attributes in the old_entry since they're the same 
> > in new */
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrn_old_entry.nrm_old_entry, xdr,
> > +					       NULL, dp, nf, nne->ne_name,
> > +					       nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		take_dentry_name_snapshot(&n, nne->ne_dentry);
> > +		ret = nfsd4_setup_notify_entry4(&nr.nrn_new_entry.nad_new_entry, xdr,
> > +					       nne->ne_dentry, dp, nf, (char *)n.name.name,
> > +					       n.name.len);
> 
> Now once I got all of the previous edits in place, all three LLM
> reviewers identified an issue here that might require a significant
> rewrite. This is why I stopped the minor editing here and decided
> it was time for you to consider restructuring (or not). I haven't
> looked at patches 11-21.
> 
>   I think the new name here has a time-of-use problem.
>   
>   nrn_old_entry uses nne->ne_name, which alloc_nfsd_notify_event() copied
>   when fsnotify delivered the rename.  nrn_new_entry instead reads the
>   live dentry via take_dentry_name_snapshot() at callback-prepare time,
>   which can run long after the event was queued.
> 
>   CB_NOTIFY is asynchronous: nfsd_handle_dir_event() queues the event on
>   ncn_evt[] and nothing holds ne_dentry stable until the work runs.
>   d_move() reuses the same dentry and rewrites d_name in place, so a
>   second rename of the entry before the queued callback encodes leaves
>   the dget'd ne_dentry carrying the later name.  An A->B event then
>   encodes as A->C, and a client holding the directory delegation applies
>   the wrong old->new mapping to its cache.  The old name is immune
>   because it was snapshotted up front; only the new name is read late.
> 
>   The new name is available at notification time -- fsnotify_move() passes
>   &moved->d_name as new_name, and ne_dentry is that moved dentry -- so
>   alloc_nfsd_notify_event() can snapshot it alongside the old name.
> 
> What I haven't assessed is whether the suggested restructuring is
> now vulnerable to misbehavior during memory exhaustion.
> 

That sounds legit. We probably need to snapshot the name sooner, when
we create the event. I'll spin something up. As far as memory
exhaustion goes: if that happens we'll just recall the delegation.
That's always the remedy when there are problems here.

> 
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (ret && nne->ne_target) {
> > +			ret = nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +							NULL, dp, nf,
> > +							(char *)n.name.name, n.name.len);
> > +			if (ret) {
> > +				nr.nrn_new_entry.nad_old_entry.count = 1;
> > +				nr.nrn_new_entry.nad_old_entry.element = &old;
> > +			}
> > +		}
> > +
> > +		if (ret) {
> > +			p = (u8 *)xdr->p;
> > +			ret = xdrgen_encode_notify_rename4(xdr, &nr);
> > +		}
> > +		release_dentry_name_snapshot(&n);
> > +		if (!ret)
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_RENAME_ENTRY);
> > +	}
> > +	return p;
> > +out_err:
> > +	pr_warn("nfsd: unable to marshal notify_rename4 to xdr stream\n");
> > +	return NULL;
> > +}
> > +
> >  static void svcxdr_init_encode_from_buffer(struct xdr_stream *xdr,
> >  				struct xdr_buf *buf, __be32 *p, int bytes)
> >  {
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 10/21] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-06-10 18:33 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <344ed039-86ce-4125-8476-2e5d22e40fdc@app.fastmail.com>

On Mon, 2026-06-08 at 16:40 -0400, Chuck Lever wrote:
> 
> On Fri, May 22, 2026, at 3:42 PM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> > 
> > Use that to allocate a nfsd_notify_event object and then hand off a
> > reference to each delegation's CB_NOTIFY. If anything fails along the
> > way, recall any affected delegations.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> There are some significant-looking sashiko review findings which I did
> not follow up on.
> 

I plan to go over Sashiko's findings after I go through your responses.

> 
> > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > index ea3e7deb06fa..1964a213f80e 100644
> > --- a/fs/nfsd/nfs4callback.c
> > +++ b/fs/nfsd/nfs4callback.c
> > @@ -870,21 +870,30 @@ static void nfs4_xdr_enc_cb_notify(struct 
> > rpc_rqst *req,
> >  				   const void *data)
> >  {
> >  	const struct nfsd4_callback *cb = data;
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> >  	struct nfs4_cb_compound_hdr hdr = {
> >  		.ident = 0,
> >  		.minorversion = cb->cb_clp->cl_minorversion,
> >  	};
> > -	struct CB_NOTIFY4args args = { };
> > +	struct CB_NOTIFY4args args;
> > +	__be32 *p;
> > 
> >  	WARN_ON_ONCE(hdr.minorversion == 0);
> > 
> >  	encode_cb_compound4args(xdr, &hdr);
> >  	encode_cb_sequence4args(xdr, cb, &hdr);
> > 
> > -	/*
> > -	 * FIXME: get stateid and fh from delegation. Inline the cna_changes
> > -	 * buffer, and zero it.
> > -	 */
> > +	p = xdr_reserve_space(xdr, 4);
> > +	*p = cpu_to_be32(OP_CB_NOTIFY);
> > +
> > +	args.cna_stateid.seqid = dp->dl_stid.sc_stateid.si_generation;
> > +	memcpy(&args.cna_stateid.other, &dp->dl_stid.sc_stateid.si_opaque,
> > +	       ARRAY_SIZE(args.cna_stateid.other));
> > +	args.cna_fh.len = dp->dl_stid.sc_file->fi_fhandle.fh_size;
> > +	args.cna_fh.data = dp->dl_stid.sc_file->fi_fhandle.fh_raw;
> > +	args.cna_changes.count = ncn->ncn_nf_cnt;
> > +	args.cna_changes.element = ncn->ncn_nf;
> >  	WARN_ON_ONCE(!xdrgen_encode_CB_NOTIFY4args(xdr, &args));
> > 
> >  	hdr.nops++;
> 
> I want to avoid the need to use xdrgen to encode the CB_NOTIFY arguments.
> How about this:
> 
> +       struct nfsd4_cb_notify *ncn = container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> +       struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> 
>    ...
> 
> +       encode_stateid4(xdr, &dp->dl_stid.sc_stateid);
> +       encode_nfs_fh4(xdr, &dp->dl_stid.sc_file->fi_fhandle);
> +       xdr_stream_encode_u32(xdr, ncn->ncn_nf_cnt);
> +       for (u32 i = 0; i < ncn->ncn_nf_cnt; i++)
> +               (void)xdrgen_encode_notify4(xdr, &ncn->ncn_nf[i]);
> 
> And then add a "pragma public notify4;" in nfs4_1.x .
> 

For those following along, Chuck and I had a private discussion and I
think we're going to keep this calling xdrgen_encode_CB_NOTIFY4args()
for now. I am dropping the WARN_ON_ONCE though.

> 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index b0652c755b3b..20477144475b 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> 
> > @@ -3461,19 +3462,131 @@ nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
> >  	nfs4_put_stid(&dp->dl_stid);
> >  }
> > 
> > +static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
> > +{
> > +	bool queued;
> > +
> > +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags))
> > +		return;
> > +
> > +	/*
> > +	 * We're assuming the state code never drops its reference
> > +	 * without first removing the lease.  Since we're in this lease
> > +	 * callback (and since the lease code is serialized by the
> > +	 * flc_lock) we know the server hasn't removed the lease yet, and
> > +	 * we know it's safe to take a reference.
> > +	 */
> > +	refcount_inc(&dp->dl_stid.sc_count);
> > +	queued = nfsd4_run_cb(&dp->dl_recall);
> > +	WARN_ON_ONCE(!queued);
> > +	if (!queued)
> > +		refcount_dec(&dp->dl_stid.sc_count);
> > +}
> > +
> > +static bool
> > +nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
> > +{
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +	struct nfsd_notify_event *events[NOTIFY4_EVENT_QUEUE_SIZE];
> > +	struct xdr_buf xdr = { .buflen = PAGE_SIZE * NOTIFY4_PAGE_ARRAY_SIZE,
> > +			       .pages  = ncn->ncn_pages };
> > +	struct xdr_stream stream;
> > +	struct nfsd_file *nf;
> > +	int count, i;
> > +	bool error = false;
> > +
> > +	xdr_init_encode_pages(&stream, &xdr);
> > +
> > +	spin_lock(&ncn->ncn_lock);
> > +	count = ncn->ncn_evt_cnt;
> > +
> > +	/* spurious queueing? */
> > +	if (count == 0) {
> > +		spin_unlock(&ncn->ncn_lock);
> > +		return false;
> > +	}
> > +
> > +	/* we can't keep up! */
> > +	if (count > NOTIFY4_EVENT_QUEUE_SIZE) {
> > +		spin_unlock(&ncn->ncn_lock);
> > +		goto out_recall;
> > +	}
> > +
> > +	memcpy(events, ncn->ncn_evt, sizeof(*events) * count);
> > +	ncn->ncn_evt_cnt = 0;
> > +	spin_unlock(&ncn->ncn_lock);
> > +
> > +	rcu_read_lock();
> > +	nf = 
> > nfsd_file_get(rcu_dereference(dp->dl_stid.sc_file->fi_deleg_file));
> > +	rcu_read_unlock();
> > +	if (!nf) {
> > +		for (i = 0; i < count; ++i)
> > +			nfsd_notify_event_put(events[i]);
> > +		goto out_recall;
> > +	}
> > +
> > +	for (i = 0; i < count; ++i) {
> > +		struct nfsd_notify_event *nne = events[i];
> > +
> > +		if (!error) {
> > +			u32 *maskp = (u32 *)xdr_reserve_space(&stream, sizeof(*maskp));
> > +			u8 *p;
> > +
> > +			if (!maskp) {
> > +				error = true;
> > +				goto put_event;
> > +			}
> > +
> > +			p = nfsd4_encode_notify_event(&stream, nne, dp, nf, maskp);
> > +			if (!p) {
> > +				pr_notice("Could not generate CB_NOTIFY from fsnotify mask 0x%x\n",
> > +					  nne->ne_mask);
> > +				error = true;
> > +				goto put_event;
> > +			}
> > +
> > +			ncn->ncn_nf[i].notify_mask.count = 1;
> > +			ncn->ncn_nf[i].notify_mask.element = maskp;
> > +			ncn->ncn_nf[i].notify_vals.data = p;
> > +			ncn->ncn_nf[i].notify_vals.len = (u8 *)stream.p - p;
> > +		}
> > +put_event:
> > +		nfsd_notify_event_put(nne);
> > +	}
> > +	if (!error) {
> > +		ncn->ncn_nf_cnt = count;
> > +		nfsd_file_put(nf);
> > +		return true;
> > +	}
> > +	nfsd_file_put(nf);
> > +out_recall:
> > +	nfsd_break_one_deleg(dp);
> > +	return false;
> > +}
> > +
> >  static int
> >  nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> >  				struct rpc_task *task)
> >  {
> > +	struct nfsd4_cb_notify *ncn = container_of(cb, struct 
> > nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +
> >  	switch (task->tk_status) {
> >  	case -NFS4ERR_DELAY:
> >  		rpc_delay(task, 2 * HZ);
> >  		return 0;
> >  	default:
> > +		/* For any other hard error, recall the deleg */
> > +		nfsd_break_one_deleg(dp);
> > +		fallthrough;
> > +	case 0:
> >  		return 1;
> >  	}
> >  }
> > 
> > +static void nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn);
> > +
> >  static void
> >  nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> >  {
> > @@ -3482,6 +3595,9 @@ nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> >  	struct nfs4_delegation *dp =
> >  			container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> > 
> > +	/* Drain events that arrived while this callback was in flight */
> > +	if (ncn->ncn_evt_cnt > 0)
> > +		nfsd4_run_cb_notify(ncn);
> 
> The above check needs to be serialized with modification of
> ncn_evt_cnt:
>
> +       bool pending;
>  
> +       /* Drain events that arrived while this callback was in flight */
> +       spin_lock(&ncn->ncn_lock);
> +       pending = ncn->ncn_evt_cnt > 0;
> +       spin_unlock(&ncn->ncn_lock);
> +       if (pending)
> +               nfsd4_run_cb_notify(ncn);
> 

I need to ponder this. Does this matter?

NFSD4_CALLBACK_RUNNING is now clear, which should be observed by
another task queueing a new event. READ_ONCE() seems like it should be
sufficient here. I'll run it by Claude.


> 
> >  	nfs4_put_stid(&dp->dl_stid);
> >  }
> > 
> 
> > @@ -9858,3 +9954,133 @@ void nfsd_update_cmtime_attr(struct file *f, 
> > unsigned int flags)
> >  				      MINOR(inode->i_sb->s_dev),
> >  				      inode->i_ino, ret);
> >  }
> > +
> > +static void
> > +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> > +{
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +
> > +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> > +		return;
> > +
> > +	if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> > +		clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> > +	else
> > +		nfsd4_run_cb(&ncn->ncn_cb);
> > +}
> > +
> > +static struct nfsd_notify_event *
> > +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry 
> > *dentry,
> > +			struct inode *target)
> > +{
> > +	struct nfsd_notify_event *ne;
> > +
> > +	ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_NOFS);
> > +	if (!ne)
> > +		return NULL;
> > +
> > +	memcpy(&ne->ne_name, q->name, q->len);
> > +	refcount_set(&ne->ne_ref, 1);
> > +	ne->ne_mask = mask;
> > +	ne->ne_name[q->len] = '\0';
> > +	ne->ne_namelen = q->len;
> > +	ne->ne_dentry = dget(dentry);
> > +	ne->ne_target = target;
> > +	if (ne->ne_target)
> > +		ihold(ne->ne_target);
> > +	return ne;
> > +}
> > +
> > +static bool
> > +should_notify_deleg(u32 mask, struct file_lease *fl)
> > +{
> > +	/* Don't notify the client generating the event */
> > +	if (nfsd_breaker_owns_lease(fl))
> > +		return false;
> > +
> > +	/* Skip if this event wasn't ignored by the lease */
> > +	if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> > +		return false;
> > +	if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> > +		return false;
> > +	if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void
> > +nfsd_recall_all_dir_delegs(const struct inode *dir)
> > +{
> > +	struct file_lock_context *ctx = locks_inode_context(dir);
> > +	struct file_lock_core *flc;
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +
> > +		if (fl->fl_lmops == &nfsd_lease_mng_ops)
> > +			nfsd_break_deleg_cb(fl);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +}
> > +
> > +int
> > +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void 
> > *data,
> > +		      int data_type, const struct qstr *name)
> > +{
> > +	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> > +	struct inode *target = fsnotify_data_rename_target(data, data_type);
> > +	struct file_lock_context *ctx;
> > +	struct file_lock_core *flc;
> > +	struct nfsd_notify_event *evt;
> > +
> > +	/* Normalize cross-dir rename events to create/delete */
> > +	if (mask & FS_MOVED_FROM) {
> > +		mask &= ~FS_MOVED_FROM;
> > +		mask |= FS_DELETE;
> > +	}
> > +	if (mask & FS_MOVED_TO) {
> > +		mask &= ~FS_MOVED_TO;
> > +		mask |= FS_CREATE;
> > +	}
> > +
> 
> I inserted an extra check here for rename notifications:
> 
> +       /*
> +        * FS_RENAME fires on the source directory even for a cross-dir
> +        * rename, where the moved entry now lives under a different
> +        * parent. NOTIFY4_RENAME_ENTRY describes an in-place rename, so
> +        * reporting it here would advertise a name absent from this
> +        * directory.
> +        */
> +       if ((mask & FS_RENAME) && dentry && d_inode(dentry->d_parent) != dir)
> +               mask &= ~FS_RENAME;
> 

Thanks. I'll add that in.

> 
> > +	/* Don't do anything if this is not an expected event */
> > +	if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> > +		return 0;
> > +
> > +	ctx = locks_inode_context(dir);
> > +	if (!ctx || list_empty(&ctx->flc_lease))
> > +		return 0;
> > +
> > +	evt = alloc_nfsd_notify_event(mask, name, dentry, target);
> > +	if (!evt) {
> > +		nfsd_recall_all_dir_delegs(dir);
> > +		return 0;
> > +	}
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +		struct nfs4_delegation *dp = flc->flc_owner;
> > +		struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> > +
> 
> I added:
> 
> +               if (fl->fl_lmops != &nfsd_lease_mng_ops)
> +                       continue;
> 
> Otherwise the loop treats every lease on the inode as an nfsd delegation
> unconditionally.
> 

This is not necessary. should_notify_deleg() calls
nfsd_breaker_owns_lease(), which already checks this before doing
anything else.

> 
> > +		if (!should_notify_deleg(mask, fl))
> > +			continue;
> > +
> > +		spin_lock(&ncn->ncn_lock);
> > +		if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> > +			/* We're generating notifications too fast. Recall. */
> > +			spin_unlock(&ncn->ncn_lock);
> > +			nfsd_break_deleg_cb(fl);
> > +			continue;
> > +		}
> > +		ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> > +		spin_unlock(&ncn->ncn_lock);
> > +
> > +		nfsd4_run_cb_notify(ncn);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +	nfsd_notify_event_put(evt);
> > +	return 0;
> > +}
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index e17488a911f7..31df04675713 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -4172,6 +4172,127 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, 
> > struct xdr_stream *xdr,
> >  	goto out;
> >  }
> > 
> > +static bool
> > +nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream 
> > *xdr,
> > +			  struct dentry *dentry, struct nfs4_delegation *dp,
> > +			  struct nfsd_file *nf, char *name, u32 namelen)
> > +{
> > +	uint32_t *attrmask;
> > +
> > +	/* Reserve space for attrmask */
> > +	attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
> > +	if (!attrmask)
> > +		return false;
> > +
> > +	ne->ne_file.data = name;
> > +	ne->ne_file.len = namelen;
> > +	ne->ne_attrs.attrmask.element = attrmask;
> > +
> > +	attrmask[0] = 0;
> > +	attrmask[1] = 0;
> > +	attrmask[2] = 0;
> > +	ne->ne_attrs.attr_vals.data = NULL;
> > +	ne->ne_attrs.attr_vals.len = 0;
> > +	ne->ne_attrs.attrmask.count = 1;
> > +	return true;
> > +}
> > +
> > +/**
> > + * nfsd4_encode_notify_event - encode a notify
> > + * @xdr: stream to which to encode the fattr4
> > + * @nne: nfsd_notify_event to encode
> > + * @dp: delegation where the event occurred
> > + * @nf: nfsd_file on which event occurred
> > + * @notify_mask: pointer to word where notification mask should be set
> > + *
> > + * Encode @nne into @xdr. Returns a pointer to the start of the event, 
> > or NULL if
> > + * the event couldn't be encoded. The appropriate bit in the 
> > notify_mask will also
> > + * be set on success.
> > + */
> 
> Nit: Let's use the usual kdoc style to describe the return value.
> 

Ok, will fix.

> + * Encode @nne into @xdr. The matching bit in @notify_mask is set on
> + * success.
> + *
> + * Return: pointer to the start of the encoded event, or NULL if the
> + * event could not be encoded.
> + */
> 
> 
> > +u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct 
> > nfsd_notify_event *nne,
> > +			      struct nfs4_delegation *dp, struct nfsd_file *nf,
> > +			      u32 *notify_mask)
> > +{
> > +	u8 *p = NULL;
> > +
> > +	*notify_mask = 0;
> > +
> > +	if (nne->ne_mask & FS_DELETE) {
> > +		struct notify_remove4 nr = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrm_old_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_remove4(xdr, &nr))
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_REMOVE_ENTRY);
> > +	} else if (nne->ne_mask & FS_CREATE) {
> > +		struct notify_add4 na = { };
> > +		struct notify_remove4 old = { };
> > +
> > +		if (!nfsd4_setup_notify_entry4(&na.nad_new_entry, xdr, 
> > nne->ne_dentry, dp,
> > +					       nf, nne->ne_name, nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (nne->ne_target) {
> > +			if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +						       NULL, dp, nf,
> > +						       nne->ne_name, nne->ne_namelen))
> > +				goto out_err;
> > +			na.nad_old_entry.count = 1;
> > +			na.nad_old_entry.element = &old;
> > +		}
> > +
> > +		p = (u8 *)xdr->p;
> > +		if (!xdrgen_encode_notify_add4(xdr, &na))
> > +			goto out_err;
> > +
> > +		*notify_mask |= BIT(NOTIFY4_ADD_ENTRY);
> > +	} else if (nne->ne_mask & FS_RENAME) {
> > +		struct notify_rename4 nr = { };
> > +		struct notify_remove4 old = { };
> > +		struct name_snapshot n;
> > +		bool ret;
> > +
> > +		/* Don't send any attributes in the old_entry since they're the same 
> > in new */
> > +		if (!nfsd4_setup_notify_entry4(&nr.nrn_old_entry.nrm_old_entry, xdr,
> > +					       NULL, dp, nf, nne->ne_name,
> > +					       nne->ne_namelen))
> > +			goto out_err;
> > +
> > +		take_dentry_name_snapshot(&n, nne->ne_dentry);
> > +		ret = nfsd4_setup_notify_entry4(&nr.nrn_new_entry.nad_new_entry, xdr,
> > +					       nne->ne_dentry, dp, nf, (char *)n.name.name,
> > +					       n.name.len);
> > +
> > +		/* If a file was overwritten, report it in nad_old_entry */
> > +		if (ret && nne->ne_target) {
> > +			ret = nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
> > +							NULL, dp, nf,
> > +							(char *)n.name.name, n.name.len);
> > +			if (ret) {
> > +				nr.nrn_new_entry.nad_old_entry.count = 1;
> > +				nr.nrn_new_entry.nad_old_entry.element = &old;
> > +			}
> > +		}
> > +
> > +		if (ret) {
> > +			p = (u8 *)xdr->p;
> > +			ret = xdrgen_encode_notify_rename4(xdr, &nr);
> > +		}
> > +		release_dentry_name_snapshot(&n);
> > +		if (!ret)
> > +			goto out_err;
> > +		*notify_mask |= BIT(NOTIFY4_RENAME_ENTRY);
> > +	}
> > +	return p;
> > +out_err:
> > +	pr_warn("nfsd: unable to marshal notify_rename4 to xdr stream\n");
> 
> Nit: The warning needs to match the semantics of nfsd4_encode_notify_event().
> How about:
> 
> +       pr_warn("nfsd: unable to marshal notify event to xdr stream\n");
> 

Sounds good.

> 
> > +	return NULL;
> > +}
> > +
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-10 17:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aiMVLtblIKu1DQWJ@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Jun 04, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> + KVM: selftests: Test conversion with elevated page refcount
>> >>     + Askar pointed out that soon vmsplice may not pin pages. Should I
>> >>       pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> >>       take a dependency on CONFIG_GUP_TEST.
>> >
>> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
>> > it probably is the least awful choice.  E.g. KVM also pins pages is certain flows,
>> > but we're _also_ actively working to remove the need to pin.
>> >
>> > Hmm, maybe IORING_REGISTER_PBUF_RING?  AFAICT, it's almost literally a "pin user
>> > memory" syscall.
>> >
>>
>> Hmm that takes a dependency on io_uring, which isn't always compiled
>> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
>> CONFIG_GUP_TEST.
>
> Or try both?  If it's not a ridiculous amount of work.

CONFIG_GUP_TEST was tried in [1]

[1] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/

It looks like this

  static void pin_pages(void *vaddr, uint64_t size)
  {
  	const struct pin_longterm_test args = {
  		.addr = (uint64_t)vaddr,
  		.size = size,
  		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
  	};

  	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
  	TEST_REQUIRE(gup_test_fd > 0);

  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
  }

  static void unpin_pages(void)
  {
  	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
  }

So in the test I'll call pin_pages(), then try to convert, see that it
fails with EAGAIN and reports the expected error_offset, then I call
unpin_pages(), then I convert again and expect success.

Are you uncomfortable with the CONFIG_GUP_TEST interface? What would you
like me to try with CONFIG_IO_URING? I'm thinking that the main
difference between the two is just down to which non-default CONFIG
option we want to take for guest_memfd tests.

^ permalink raw reply

* [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Shanker Donthineni @ 2026-06-10 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
the plain str* sequence.

Note: stlr* only supports base-register addressing, so affected CPUs use
a base-register stlr* path. Unaffected CPUs keep the original
offset-addressed str* sequence introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing").

The __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
helpers are left unchanged. These helpers are intended for
write-combining mappings, which are Normal-NC on arm64. Replacing their
contiguous str* groups would defeat the write-combining behavior used to
improve store performance.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 23 ++++++++++++++++
 arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 5 files changed, 64 insertions(+)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad09bbb10da80..fc45125dc2f80 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c65cef81be86a..d633eb70de1ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50b..801223e754c90 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,22 @@
 /*
  * Generic IO read/write.  These perform native-endian accesses.
  */
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+	return alternative_has_cap_unlikely(
+				ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
 	volatile u8 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -33,6 +45,12 @@ static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
 	volatile u16 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlrh %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -40,6 +58,12 @@ static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
 	volatile u32 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %w0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
@@ -47,6 +71,12 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
 	volatile u64 __iomem *ptr = addr;
+
+	if (arm64_needs_device_store_release()) {
+		asm volatile("stlr %x0, [%1]" : : "rZ" (val), "r" (addr));
+		return;
+	}
+
 	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
 }
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index d597896b0f7f3..b096d9acca578 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -838,6 +838,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d6..d367257bf7703 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Jason Gunthorpe @ 2026-06-10 16:11 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, Vladimir Murzin,
	Mark Rutland, linux-kernel, linux-doc, Vikram Sethi,
	Jason Sequeira
In-Reply-To: <223c49ee-528c-4750-9885-fd8e0247151e@nvidia.com>

On Wed, Jun 10, 2026 at 08:20:28AM -0500, Shanker Donthineni wrote:

> Based on the existing code comments and after reviewing this path again,
> __const_memcpy_toio_aligned32() and __const_memcpy_toio_aligned64()
> appear to be intended for WC regions. Since the erratum is scoped to
> Device-nGnR* accesses, and WC mappings are Normal-NC on arm64, I don’t
> think the STLR workaround should apply to these helpers by default.

Hmm, unfortunately I think the APIs mix together IO and WC both as
__iomem things. However I recall when I was looking a this everyone
was using it for WC.

Jason

^ permalink raw reply

* Re: [PATCH v3 4/5] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Amit Machhiwal @ 2026-06-10 15:53 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <87jysgz292.fsf@vajain21.in.ibm.com>

On 2026/06/03 09:47 AM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for the patch. My review comments inline:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
> > hypervisor (L0), the guest runs with the expected processor
> > compatibility level. However, when booting a nested KVM guest (L2)
> > inside the L1, QEMU derives the CPU model from the raw host PVR and
> > attempts to run the nested guest at that level, instead of honoring the
> > compatibility mode of the L1.
> >
> > Extend host CPU compatibility capability reporting to support nested
> > virtualization on PowerNV systems (PAPR nested API v1).
> >
> > For nested API v2 (PowerVM), compatibility capabilities are obtained
> > from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
> > information is not available on PowerNV systems.
> >
> > For nested API v1, derive the compatibility capabilities from the L1
> > guest by reading the "cpu-version" property from the device tree, which
> > reflects the effective (logical) processor compatibility level. Map this
> > value to the corresponding compatibility capability bitmap.
> >
> > Introduce a helper to translate CPU version values into compatibility
> > capability bits and integrate it into kvmppc_get_compat_cpu_caps().
> >
> > This allows userspace to query host CPU compatibility modes on both
> > PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 38de7040e2b7..18774c49af85 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6522,15 +6522,50 @@ static bool kvmppc_hash_v3_possible(void)
> >  	return true;
> >  }
> >  
> > +static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
> > +				      unsigned long *capabilities)
> > +{
> > +	switch (cpu_version) {
> > +	case PVR_ARCH_31_P11:
> > +		*capabilities |= H_GUEST_CAP_POWER11;
> > +		break;
> > +	case PVR_ARCH_31:
> > +		*capabilities |= H_GUEST_CAP_POWER10;
> > +		break;
> > +	case PVR_ARCH_300:
> > +		*capabilities |= H_GUEST_CAP_POWER9;
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	return 0;
> > +}
> >  
> >  static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
> >  {
> > +	struct device_node *np;
> >  	unsigned long capabilities = 0;
> > +	const __be32 *prop = NULL;
> >  	long rc = -EINVAL;
> > +	u32 cpu_version;
> >  
> >  	if (kvmhv_on_pseries()) {
> > -		if (kvmhv_is_nestedv2())
> > +		if (kvmhv_is_nestedv2()) {
> >  			rc = plpar_guest_get_capabilities(0,
> >  	&capabilities);
> Need to mask capabilities as mentioned in the review comments for
> previous patch. I would suggest creating a helper that performs the
> hcall and applies the mask which can then be used at
> plpar_guest_get_capabilities() call sites.

Sure, will do.

Thanks,
Amit

> 
> > +		} else {
> > +			for_each_node_by_type(np, "cpu") {
> > +				prop = of_get_property(np, "cpu-version", NULL);
> > +				if (prop) {
> > +					cpu_version = be32_to_cpup(prop);
> > +					break;
> > +				}
> > +			}
> > +			if (!prop)
> > +				return -EINVAL;
> > +			rc = kvmppc_map_compat_capabilities(cpu_version,
> > +								&capabilities);
> > +		}
> >  		host_caps->compat_capabilities = capabilities;
> >  	}
> >  
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

* Re: [PATCH v3 3/5] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Amit Machhiwal @ 2026-06-10 15:51 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <87mrxcz300.fsf@vajain21.in.ibm.com>

Hi Vaibhav,

Thanks for taking a look at this patch. My response is inline.

On 2026/06/03 09:31 AM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for the patch. My review comments inline below:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > On POWER systems, the host CPU may run in a compatibility mode (e.g., a
> > Power11 processor operating in Power10 compatibility mode). In such
> > cases, the effective CPU level exposed to guests differs from the
> > physical processor generation.
> >
> > When running nested KVM guests, QEMU derives the host CPU type using
> > mfpvr(), which reflects the physical processor version. This can result
> > in a mismatch between the CPU model selected by QEMU and the
> > compatibility mode enforced by the host, leading to guest boot failures.
> >
> > For example, booting a nested guest on a Power11 LPAR configured in
> > Power10 compatibility mode fails with:
> >
> >   KVM-NESTEDv2: couldn't set guest wide elements
> >   [..KVM reg dump..]
> >
> > This occurs because QEMU selects a CPU model corresponding to the
> > physical processor (via mfpvr()), while the host operates in a lower
> > compatibility mode. As a result, KVM rejects the requested compatibility
> > level during guest initialization.
> >
> > Add support for retrieving host CPU compatibility capabilities for
> > nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
> > the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
> > hcall, which reflects the processor modes negotiated between the Power
> > hypervisor (L0) and the host partition (L1).
> >
> > On pseries systems, obtain the capability bitmap using
> > plpar_guest_get_capabilities() and return it via struct
> > kvm_ppc_compat_caps. This information is then exposed to userspace
> > through the KVM_PPC_GET_COMPAT_CAPS ioctl.
> >
> > Hook the implementation into the Book3S HV kvmppc_ops so that it can be
> > invoked by the generic KVM ioctl handling code.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 249d1f2e4e2c..38de7040e2b7 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -6522,6 +6522,21 @@ static bool kvmppc_hash_v3_possible(void)
> >  	return true;
> >  }
> >  
> > +
> > +static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
> > +{
> > +	unsigned long capabilities = 0;
> > +	long rc = -EINVAL;
> > +
> > +	if (kvmhv_on_pseries()) {
> > +		if (kvmhv_is_nestedv2())
> > +			rc = plpar_guest_get_capabilities(0,
> > &capabilities);
> 
> since this value will trikle back to userspace please apply a mask on
> the hcall return value so that any reserved and non-PVR related bits
> doesnt leak back to userspace.

Though currently we only supply the bits corresponding to supported
processor versions, it makes sense to mask out unrelated bits so that
they don't unnecesarily passed on to the userspace. I'll make the
changes in v4.

Thanks,
Amit

> 
> > +		host_caps->compat_capabilities = capabilities;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> >  static struct kvmppc_ops kvm_ops_hv = {
> >  	.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
> >  	.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> > @@ -6564,6 +6579,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> >  	.hash_v3_possible = kvmppc_hash_v3_possible,
> >  	.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
> >  	.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
> > +	.get_compat_cpu_ver = kvmppc_get_compat_cpu_caps,
> >  };
> >  
> >  static int kvm_init_subcore_bitmap(void)
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

* Re: [PATCH v5 09/21] nfsd: add data structures for handling CB_NOTIFY
From: Jeff Layton @ 2026-06-10 15:51 UTC (permalink / raw)
  To: Chuck Lever, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <566fd48c-bf10-4974-9ee4-1afc30b7a69c@app.fastmail.com>

On Mon, 2026-06-08 at 16:18 -0400, Chuck Lever wrote:
> 
> On Fri, May 22, 2026, at 3:42 PM, Jeff Layton wrote:
> > Add the data structures, allocation helpers, and callback operations
> > needed for directory delegation CB_NOTIFY support:
> 
> > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> > index 9c6e2e7abc82..505fabf8f1bf 100644
> > --- a/fs/nfsd/state.h
> > +++ b/fs/nfsd/state.h
> > @@ -197,6 +197,44 @@ struct nfs4_cb_fattr {
> >  #define NOTIFY4_EVENT_QUEUE_SIZE	3
> >  #define NOTIFY4_PAGE_ARRAY_SIZE		1
> > 
> > +struct nfsd_notify_event {
> > +	refcount_t	ne_ref;		// refcount
> > +	u32		ne_mask;	// FS_* mask from fsnotify callback
> > +	struct dentry	*ne_dentry;	// dentry reference to target
> > +	u32		ne_namelen;	// length of ne_name
> > +	char		ne_name[];	// name of dentry being changed
> 
> Nit: checkpatch doesn't like the C++ comment style.
> 
> 
> > +};
> > +
> > +static inline struct nfsd_notify_event *nfsd_notify_event_get(struct 
> > nfsd_notify_event *ne)
> > +{
> > +	refcount_inc(&ne->ne_ref);
> > +	return ne;
> > +}
> > +
> > +static inline void nfsd_notify_event_put(struct nfsd_notify_event *ne)
> > +{
> > +	if (refcount_dec_and_test(&ne->ne_ref)) {
> > +		dput(ne->ne_dentry);
> > +		kfree(ne);
> > +	}
> > +}
> > +
> > +/*
> > + * Represents a directory delegation. The callback is for handling 
> > CB_NOTIFYs.
> > + * As notifications from fsnotify come in, allocate a new event, take 
> > the ncn_lock,
> > + * and add it to the ncn_evt queue. The CB_NOTIFY prepare handler will 
> > take the
> > + * lock, clean out the list and process it.
> > + */
> > +struct nfsd4_cb_notify {
> > +	spinlock_t			ncn_lock;	// protects the evt queue and count
> > +	int				ncn_evt_cnt;	// count of events in ncn_evt
> > +	int				ncn_nf_cnt;	// count of valid entries in ncn_nf
> > +	struct nfsd_notify_event	*ncn_evt[NOTIFY4_EVENT_QUEUE_SIZE]; // list 
> > of events
> > +	struct page			*ncn_pages[NOTIFY4_PAGE_ARRAY_SIZE]; // for encoding
> > +	struct notify4			*ncn_nf;	// array of notify4's to be sent
> > +	struct nfsd4_callback		ncn_cb;		// notify4 callback
> > +};
> 
> Ditto.
> 


I'll note that the code is littered with this comment style anyway,
including a bunch of the new // SPDX- header comments. Personally, I
find this more readable for documenting struct fields.

AFAICT, the checkpatch rule was manufactured out of thin air.
Documentation/dev-tools/checkpatch.rst says:

  **C99_COMMENTS**
    C99 style single line comments (//) should not be used.
    Prefer the block comment style instead.

    See:
https://www.kernel.org/doc/html/latest/process/coding-style.html#commenting

...but that coding-style document says nothing about C99 comments. I
move that we ignore checkpatch here.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 2/5] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Amit Machhiwal @ 2026-06-10 15:47 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan,
	Anushree Mathur, Paolo Bonzini, Nicholas Piggin, Michael Ellerman,
	Christophe Leroy (CS GROUP), Jonathan Corbet, Shuah Khan, kvm,
	linux-kernel, linux-doc, lkp
In-Reply-To: <87pl28z3nx.fsf@vajain21.in.ibm.com>

Hi Vaibhav,

Thanks for reviewing the patches. Please find my response inline.

On 2026/06/03 09:16 AM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for this patch. Few review comments below:
> 
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> 
> > Introduce a new capability and ioctl to expose CPU compatibility modes
> > supported by the host processor for nested guests.
> >
> > On IBM POWER systems, newer processor generations (N) can operate in
> > compatibility modes corresponding to earlier generations, like (N-1) and
> > (N-2). This is particularly relevant for nested virtualization, where
> > nested KVM guests may need to run with a specific processor compatibility
> > level.
> >
> > Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
> > KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
> > the compatibility modes supported by the host in respective bit numbers,
> > allowing userspace (e.g., QEMU) to select an appropriate compatibility
> > level when configuring nested KVM guests.
> >
> > The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
> > CPU compatibility capabilities via a PowerPC-specific backend
> > implementation when available. If the capability is not supported, the
> > ioctl returns success with no capabilities set, allowing userspace to
> > fall back gracefully.
> >
> > Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/kvm_ppc.h  |  1 +
> >  arch/powerpc/include/uapi/asm/kvm.h |  6 ++++++
> >  arch/powerpc/kvm/powerpc.c          | 21 +++++++++++++++++++++
> >  include/uapi/linux/kvm.h            |  4 ++++
> >  4 files changed, 32 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 0953f2daa466..cadfb839e836 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -319,6 +319,7 @@ struct kvmppc_ops {
> >  	bool (*hash_v3_possible)(void);
> >  	int (*create_vm_debugfs)(struct kvm *kvm);
> >  	int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
> > +	int (*get_compat_cpu_ver)(struct kvm_ppc_compat_caps *host_caps);
> >  };
> >  
> >  extern struct kvmppc_ops *kvmppc_hv_ops;
> > diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> > index 077c5437f521..081d6c7f7f70 100644
> > --- a/arch/powerpc/include/uapi/asm/kvm.h
> > +++ b/arch/powerpc/include/uapi/asm/kvm.h
> > @@ -437,6 +437,12 @@ struct kvm_ppc_cpu_char {
> >  	__u64	behaviour_mask;		/* valid bits in behaviour */
> >  };
> >  
> > +/* For KVM_PPC_GET_COMPAT_CAPS */
> > +struct kvm_ppc_compat_caps {
> > +	__u64	flags;			/* Reserved for future use */
> Please introduce a size field also for the UAPI so that in this
> structure can evolve in future without breaking kernel ABI.

Sure, adding a size field for validations and future extensibility makes
sense. I'll add it in the next version.

> 
> > +	__u64	compat_capabilities;	/* Capabilities supported by the host */
> > +};
> > +
> >  /*
> >   * Values for character and character_mask.
> >   * These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 00302399fc37..02b834ebd8d3 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -697,6 +697,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  			}
> >  		}
> >  		break;
> > +#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
> > +	case KVM_CAP_PPC_COMPAT_CAPS:
> > +		r = 0;
> > +		if (kvmhv_on_pseries())
> > +			r = 1;
> > +		break;
> > +#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> >  	default:
> >  		r = 0;
> >  		break;
> > @@ -2463,6 +2470,20 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> >  		r = kvm->arch.kvm_ops->svm_off(kvm);
> >  		break;
> >  	}
> > +	case KVM_PPC_GET_COMPAT_CAPS: {
> > +		struct kvm_ppc_compat_caps host_caps;
> > +
> > +		r = -ENOTTY;
> > +		memset(&host_caps, 0, sizeof(host_caps));
> > +		if (!kvm->arch.kvm_ops->get_compat_cpu_ver)
> > +			goto out;
> > +
> > +		r = kvm->arch.kvm_ops->get_compat_cpu_ver(&host_caps);
> > +		if (!r && copy_to_user(argp, &host_caps,
> > +				     sizeof(host_caps)))
> As mentioned above please introduce a size field in the structure thats
> being copied to the userspace and use the size field to copy the
> apporiate structure to the userspace. Otherwise a future kernel may
> unintentionally overwrite unintended userspace memory if it happens to
> be using a  larger structure size then what VMM knows about.

Sure, I'll add a validation around the structure size.

> 
> > +			r = -EFAULT;
> > +		break;
> > +	}
> >  	default: {
> >  		struct kvm *kvm = filp->private_data;
> >  		r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 6c8afa2047bf..1788a0068662 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -996,6 +996,7 @@ struct kvm_enable_cap {
> >  #define KVM_CAP_S390_USER_OPEREXEC 246
> >  #define KVM_CAP_S390_KEYOP 247
> >  #define KVM_CAP_S390_VSIE_ESAMODE 248
> > +#define KVM_CAP_PPC_COMPAT_CAPS 249
> >  
> >  struct kvm_irq_routing_irqchip {
> >  	__u32 irqchip;
> > @@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
> >  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> >  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> >  
> > +/* Available with KVM_CAP_PPC_COMPAT_CAPS */
> > +#define KVM_PPC_GET_COMPAT_CAPS	_IOR(KVMIO,  0xe4, struct
> > kvm_ppc_compat_caps)
> Minor: you may want to align the name of the newly introduced kvmppc_ops
> to KVM CAP you are introducing here.

Will do.

Thanks,
Amit

> 
> > +
> >  /*
> >   * ioctls for vcpu fds
> >   */
> > -- 
> > 2.50.1 (Apple Git-155)
> >
> >
> 
> -- 
> Cheers
> ~ Vaibhav

^ permalink raw reply

* Re: [PATCH v5 07/21] nfsd: add callback encoding and decoding linkages for CB_NOTIFY
From: Chuck Lever @ 2026-06-10 15:25 UTC (permalink / raw)
  To: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Jonathan Corbet,
	Shuah Khan
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <fc7d4edb03fbb359746d75942cc4867c280f8c39.camel@kernel.org>



On Wed, Jun 10, 2026, at 11:19 AM, Jeff Layton wrote:
> On Mon, 2026-06-08 at 12:52 -0400, Chuck Lever wrote:

>> > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>> > index 25bbf5b8814d..ea3e7deb06fa 100644
>> > --- a/fs/nfsd/nfs4callback.c
>> > +++ b/fs/nfsd/nfs4callback.c
>> > @@ -865,6 +865,51 @@ static void encode_stateowner(struct xdr_stream 
>> > *xdr, struct nfs4_stateowner *so
>> >  	xdr_encode_opaque(p, so->so_owner.data, so->so_owner.len);
>> >  }
>> > 
>> > +static void nfs4_xdr_enc_cb_notify(struct rpc_rqst *req,
>> > +				   struct xdr_stream *xdr,
>> > +				   const void *data)
>> > +{
>> > +	const struct nfsd4_callback *cb = data;
>> > +	struct nfs4_cb_compound_hdr hdr = {
>> > +		.ident = 0,
>> > +		.minorversion = cb->cb_clp->cl_minorversion,
>> > +	};
>> > +	struct CB_NOTIFY4args args = { };
>> > +
>> > +	WARN_ON_ONCE(hdr.minorversion == 0);
>> > +
>> > +	encode_cb_compound4args(xdr, &hdr);
>> > +	encode_cb_sequence4args(xdr, cb, &hdr);
>> > +
>> > +	/*
>> > +	 * FIXME: get stateid and fh from delegation. Inline the cna_changes
>> > +	 * buffer, and zero it.
>> > +	 */
>> > +	WARN_ON_ONCE(!xdrgen_encode_CB_NOTIFY4args(xdr, &args));
>> > +
>> > +	hdr.nops++;
>> > +	encode_cb_nops(&hdr);
>> > +}
>> 
>> There are a number of problems with this, but since there are no
>> callers yet, we can let some of those issues stand.
>> 
>> What is problematic in the longer-term is that this is a client-side
>> encoder (since this is the server's NFSv4 callback client).
>> 
>> xdrgen_encode_CB_NOTIFY4args() is an argument encoder, which is
>> client-side functionality, but it resides in fs/nfsd/nfs4xdr_gen.c,
>> which is server-side. Let's not mix these purposes.
>> 
>> I replaced the comment and WARN_ON with this:
>> 
>> +       xdr_stream_encode_u32(xdr, OP_CB_NOTIFY);
>> +
>> +       /* FIXME: encode stateid, fh, and cna_changes from delegation */
>> 
>> You can use xdrgen functions for individual data items, but for
>> full argument and response structures, only server-side is supported
>> at the moment. In the later patch that completes this code, I'll cover
>> the other fields, which can be a mix of open code and xdrgen.
>> 
>
> The full argument encoder and decoder works just fine. When you say
> "supported" what do you mean, specifically?

"encode argument" is a client-side mechanism

"decode argument" is a server-side mechanism

"encode result" is a server-side mechanism

"decode result" is a client-side mechanism

The reason that matters is that the client-side and server-side
XDR implementation for the top level arguments and results (not
the individual data items) in the kernel have different calling
conventions -- eg, server side wants to see a struct svc_rqst *


> I'd really rather not go back to open-coding the encoders and decoders,
> particularly since CB_NOTIFY has one of the most complex argument
> structures in the protocol.

Go look at how I implemented this in the subsequent patch. You can
use the xdrgen notify4 encoder, that's where all the complexity is.

-- 
Chuck Lever

^ permalink raw reply

* [PATCH v11 2/2] hwmon: temperature: add support for EMC1812
From: Marius Cristea @ 2026-06-10 15:19 UTC (permalink / raw)
  To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet
  Cc: linux-hwmon, devicetree, linux-kernel, linux-doc, Marius Cristea
In-Reply-To: <20260610-hw_mon-emc1812-v11-0-cef809af5c19@microchip.com>

This is the hwmon driver for Microchip EMC1812/13/14/15/33
Multichannel Low-Voltage Remote Diode Sensor Family.

EMC1812 has one external remote temperature monitoring channel.
EMC1813 has two external remote temperature monitoring channels.
EMC1814 has three external remote temperature monitoring channels,
channels 2 and 3 support anti parallel diode.
EMC1815 has four external remote temperature monitoring channels and
channels 1/2  and 3/4 support anti parallel diode.
EMC1833 has two external remote temperature monitoring channels and
channels 1 and 2 support anti parallel diode.
Resistance Error Correction is supported on channels 1/2 and 3/4.

Signed-off-by: Marius Cristea <marius.cristea@microchip.com>
---
 Documentation/hwmon/emc1812.rst |  67 +++
 Documentation/hwmon/index.rst   |   1 +
 MAINTAINERS                     |   2 +
 drivers/hwmon/Kconfig           |  11 +
 drivers/hwmon/Makefile          |   1 +
 drivers/hwmon/emc1812.c         | 965 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1047 insertions(+)

diff --git a/Documentation/hwmon/emc1812.rst b/Documentation/hwmon/emc1812.rst
new file mode 100644
index 000000000000..0b4fbcaaea71
--- /dev/null
+++ b/Documentation/hwmon/emc1812.rst
@@ -0,0 +1,67 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+Kernel driver emc1812
+=====================
+
+Supported chips:
+
+  * Microchip EMC1812, EMC1813, EMC1814, EMC1815, EMC1833
+
+    Prefix: 'emc1812'
+
+    Datasheets:
+
+	- https://ww1.microchip.com/downloads/aemDocuments/documents/MSLD/ProductDocuments/DataSheets/EMC1812-3-4-5-33-Data-Sheet-DS20005751.pdf
+
+Author:
+    Marius Cristea <marius.cristea@microchip.com>
+
+
+Description
+-----------
+
+The Microchip EMC181x/33 chips contain up to 4 remote temperature sensors
+and one internal.
+- The EMC1812 is a single channel remote temperature sensor.
+- The EMC1813 and EMC1833 are dual channel remote temperature sensor. The
+remote channels for this selection of devices can support substrate diodes,
+discrete diode-connected transistors or CPU/GPU thermal diodes.
+- The EMC1814 is a three channel remote temperature sensor that supports
+Anti-Parallel Diode (APD) only on one channel. For the channel that does not
+support APD functionality, substrate diodes, discrete diode-connected
+transistors or CPU/GPU thermal diodes are supported. For the channel that
+supports APD, only discrete diode-connected transistors may be implemented.
+However, if APD is disabled on the EMC1814, then the channel that supports
+APD will be functional with substrate diodes, discrete diode-connected
+transistors and CPU/GPU thermal diodes.
+- The EMC1815 is a four channel remote temperature sensor.
+
+The EMC1815 and EMC1833 support APD on all channels. When APD is enabled,
+the channels support only diode-connected transistors. If APD is disabled,
+then the channels will support substrate transistors, discrete diode-connected
+transistors and CPU/GPU thermal diodes.
+
+Note: Disabling APD functionality to implement substrate diodes on devices
+that support APD eliminates the benefit of APD (two diodes on one channel).
+
+The chips implement three limits for each sensor: low (tempX_min), high
+(tempX_max) and critical (tempX_crit). The chips also implement an
+hysteresis mechanism which applies to all limits. The relative difference
+is stored in a single register on the chip, which means that the relative
+difference between the limit and its hysteresis is always the same for
+all three limits.
+
+This implementation detail implies the following:
+
+* When setting a limit, its hysteresis will automatically follow, the
+  difference staying unchanged. For example, if the old critical limit was
+  80 degrees C, and the hysteresis was 75 degrees C, and you change the
+  critical limit to 90 degrees C, then the hysteresis will automatically
+  change to 85 degrees C.
+* The hysteresis values can't be set independently. We decided to make
+  only tempX_crit_hyst writable, while all other hysteresis attributes
+  are read-only. Setting tempX_crit_hyst writes the difference between
+  tempX_crit_hyst and tempX_crit into the chip, and the same relative
+  hysteresis applies automatically to all other limits.
+* The limits should be set before the hysteresis. At power up the device
+  starts with 10 degree hysteresis.
diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 51a5bdf75b08..a03e97f9a97f 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -69,6 +69,7 @@ Hardware Monitoring Kernel Drivers
    ds1621
    ds620
    emc1403
+   emc1812
    emc2103
    emc2305
    emc6w201
diff --git a/MAINTAINERS b/MAINTAINERS
index 85c236df781e..fcb712549ea6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16651,6 +16651,8 @@ M:	Marius Cristea <marius.cristea@microchip.com>
 L:	linux-hwmon@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/hwmon/microchip,emc1812.yaml
+F:	Documentation/hwmon/emc1812.rst
+F:	drivers/hwmon/emc1812.c
 
 MICROCHIP I2C DRIVER
 M:	Codrin Ciubotariu <codrin.ciubotariu@microchip.com>
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 2760feb9f83b..3b53572fd8bf 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -2042,6 +2042,17 @@ config SENSORS_EMC1403
 	  Threshold values can be configured using sysfs.
 	  Data from the different diodes are accessible via sysfs.
 
+config SENSORS_EMC1812
+	tristate "Microchip Technology EMC1812 driver"
+	depends on I2C
+	select REGMAP_I2C
+	help
+	  If you say yes here to build support for Microchip Technology's
+	  EMC181X/33  Multichannel Low-Voltage Remote Diode Sensor Family.
+
+	  This driver can also be built as a module. If so, the module
+	  will be called emc1812.
+
 config SENSORS_EMC2103
 	tristate "SMSC EMC2103"
 	depends on I2C
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index 73b2abdcc6dd..e93e4051e99d 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_SENSORS_DRIVETEMP)	+= drivetemp.o
 obj-$(CONFIG_SENSORS_DS620)	+= ds620.o
 obj-$(CONFIG_SENSORS_DS1621)	+= ds1621.o
 obj-$(CONFIG_SENSORS_EMC1403)	+= emc1403.o
+obj-$(CONFIG_SENSORS_EMC1812)	+= emc1812.o
 obj-$(CONFIG_SENSORS_EMC2103)	+= emc2103.o
 obj-$(CONFIG_SENSORS_EMC2305)	+= emc2305.o
 obj-$(CONFIG_SENSORS_EMC6W201)	+= emc6w201.o
diff --git a/drivers/hwmon/emc1812.c b/drivers/hwmon/emc1812.c
new file mode 100644
index 000000000000..68575c27d090
--- /dev/null
+++ b/drivers/hwmon/emc1812.c
@@ -0,0 +1,965 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * HWMON driver for Microchip EMC1812/13/14/15/33 Multichannel high-accuracy
+ * 2-wire low-voltage remote diode temperature monitor family.
+ *
+ * Copyright (C) 2026 Microchip Technology Inc. and its subsidiaries
+ *
+ * Author: Marius Cristea <marius.cristea@microchip.com>
+ *
+ * Datasheet can be found here:
+ * https://ww1.microchip.com/downloads/aemDocuments/documents/MSLD/ProductDocuments/DataSheets/EMC1812-3-4-5-33-Data-Sheet-DS20005751.pdf
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/bits.h>
+#include <linux/delay.h>
+#include <linux/err.h>
+#include <linux/hwmon.h>
+#include <linux/i2c.h>
+#include <linux/kernel.h>
+#include <linux/math64.h>
+#include <linux/property.h>
+#include <linux/regmap.h>
+#include <linux/string.h>
+#include <linux/units.h>
+#include <linux/util_macros.h>
+
+/* EMC1812 Registers Addresses */
+#define EMC1812_STATUS_ADDR				0x02
+#define EMC1812_CONFIG_LO_ADDR				0x03
+
+#define EMC1812_CFG_ADDR				0x09
+#define EMC1812_CONV_ADDR				0x0A
+#define EMC1812_INT_DIODE_HIGH_LIMIT_ADDR		0x0B
+#define EMC1812_INT_DIODE_LOW_LIMIT_ADDR		0x0C
+#define EMC1812_EXT1_HIGH_LIMIT_HIGH_BYTE_ADDR		0x0D
+#define EMC1812_EXT1_LOW_LIMIT_HIGH_BYTE_ADDR		0x0E
+#define EMC1812_ONE_SHOT_ADDR				0x0F
+
+#define EMC1812_EXT1_HIGH_LIMIT_LOW_BYTE_ADDR		0x13
+#define EMC1812_EXT1_LOW_LIMIT_LOW_BYTE_ADDR		0x14
+#define EMC1812_EXT2_HIGH_LIMIT_HIGH_BYTE_ADDR		0x15
+#define EMC1812_EXT2_LOW_LIMIT_HIGH_BYTE_ADDR		0x16
+#define EMC1812_EXT2_HIGH_LIMIT_LOW_BYTE_ADDR		0x17
+#define EMC1812_EXT2_LOW_LIMIT_LOW_BYTE_ADDR		0x18
+#define EMC1812_EXT1_THERM_LIMIT_ADDR			0x19
+#define EMC1812_EXT2_THERM_LIMIT_ADDR			0x1A
+#define EMC1812_EXT_DIODE_FAULT_STATUS_ADDR		0x1B
+
+#define EMC1812_DIODE_FAULT_MASK_ADDR			0x1F
+#define EMC1812_INT_DIODE_THERM_LIMIT_ADDR		0x20
+#define EMC1812_THRM_HYS_ADDR				0x21
+#define EMC1812_CONSEC_ALERT_ADDR			0x22
+
+#define EMC1812_EXT1_BETA_CONFIG_ADDR			0x25
+#define EMC1812_EXT2_BETA_CONFIG_ADDR			0x26
+#define EMC1812_EXT1_IDEALITY_FACTOR_ADDR		0x27
+#define EMC1812_EXT2_IDEALITY_FACTOR_ADDR		0x28
+
+#define EMC1812_EXT3_HIGH_LIMIT_HIGH_BYTE_ADDR		0x2C
+#define EMC1812_EXT3_LOW_LIMIT_HIGH_BYTE_ADDR		0x2D
+#define EMC1812_EXT3_HIGH_LIMIT_LOW_BYTE_ADDR		0x2E
+#define EMC1812_EXT3_LOW_LIMIT_LOW_BYTE_ADDR		0x2F
+#define EMC1812_EXT3_THERM_LIMIT_ADDR			0x30
+#define EMC1812_EXT3_IDEALITY_FACTOR_ADDR		0x31
+
+#define EMC1812_EXT4_HIGH_LIMIT_HIGH_BYTE_ADDR		0x34
+#define EMC1812_EXT4_LOW_LIMIT_HIGH_BYTE_ADDR		0x35
+#define EMC1812_EXT4_HIGH_LIMIT_LOW_BYTE_ADDR		0x36
+#define EMC1812_EXT4_LOW_LIMIT_LOW_BYTE_ADDR		0x37
+#define EMC1812_EXT4_THERM_LIMIT_ADDR			0x38
+#define EMC1812_EXT4_IDEALITY_FACTOR_ADDR		0x39
+#define EMC1812_HIGH_LIMIT_STATUS_ADDR			0x3A
+#define EMC1812_LOW_LIMIT_STATUS_ADDR			0x3B
+#define EMC1812_THERM_LIMIT_STATUS_ADDR			0x3C
+#define EMC1812_ROC_GAIN_ADDR				0x3D
+#define EMC1812_ROC_CONFIG_ADDR				0x3E
+#define EMC1812_ROC_STATUS_ADDR				0x3F
+#define EMC1812_R1_RESH_ADDR				0x40
+#define EMC1812_R1_LIMH_ADDR				0x41
+#define EMC1812_R1_LIML_ADDR				0x42
+#define EMC1812_R1_SMPL_ADDR				0x43
+#define EMC1812_R2_RESH_ADDR				0x44
+#define EMC1812_R2_3_RESL_ADDR				0x45
+#define EMC1812_R2_LIMH_ADDR				0x46
+#define EMC1812_R2_LIML_ADDR				0x47
+#define EMC1812_R2_SMPL_ADDR				0x48
+#define EMC1812_PER_MAXTH_1_ADDR			0x49
+#define EMC1812_PER_MAXT1L_ADDR				0x4A
+#define EMC1812_PER_MAXTH_2_ADDR			0x4B
+#define EMC1812_PER_MAXT2_3L_ADDR			0x4C
+#define EMC1812_GBL_MAXT1H_ADDR				0x4D
+#define EMC1812_GBL_MAXT1L_ADDR				0x4E
+#define EMC1812_GBL_MAXT2H_ADDR				0x4F
+#define EMC1812_GBL_MAXT2L_ADDR				0x50
+#define EMC1812_FILTER_SEL_ADDR				0x51
+
+#define EMC1812_INT_HIGH_BYTE_ADDR		0x60
+#define EMC1812_INT_LOW_BYTE_ADDR		0x61
+#define EMC1812_EXT1_HIGH_BYTE_ADDR		0x62
+#define EMC1812_EXT1_LOW_BYTE_ADDR		0x63
+#define EMC1812_EXT2_HIGH_BYTE_ADDR		0x64
+#define EMC1812_EXT2_LOW_BYTE_ADDR		0x65
+#define EMC1812_EXT3_HIGH_BYTE_ADDR		0x66
+#define EMC1812_EXT3_LOW_BYTE_ADDR		0x67
+#define EMC1812_EXT4_HIGH_BYTE_ADDR		0x68
+#define EMC1812_EXT4_LOW_BYTE_ADDR		0x69
+#define EMC1812_HOTTEST_DIODE_HIGH_BYTE_ADDR	0x6A
+#define EMC1812_HOTTEST_DIODE_LOW_BYTE_ADDR	0x6B
+#define EMC1812_HOTTEST_STATUS_ADDR		0x6C
+#define EMC1812_HOTTEST_CFG_ADDR		0x6D
+
+#define EMC1812_PRODUCT_ID_ADDR		0xFD
+#define EMC1812_MANUFACTURER_ID_ADDR	0xFE
+#define EMC1812_REVISION_ADDR		0xFF
+
+/* EMC1812 Config Bits */
+#define EMC1812_CFG_MSKAL		BIT(7)
+#define EMC1812_CFG_RS			BIT(6)
+#define EMC1812_CFG_ATTHM		BIT(5)
+#define EMC1812_CFG_RECD12		BIT(4)
+#define EMC1812_CFG_RECD34		BIT(3)
+#define EMC1812_CFG_RANGE		BIT(2)
+#define EMC1812_CFG_DA_ENA		BIT(1)
+#define EMC1812_CFG_APDD		BIT(0)
+
+/* EMC1812 Status Bits */
+#define EMC1812_STATUS_ROCF		BIT(7)
+#define EMC1812_STATUS_HOTCHG		BIT(6)
+#define EMC1812_STATUS_BUSY		BIT(5)
+#define EMC1812_STATUS_HIGH		BIT(4)
+#define EMC1812_STATUS_LOW		BIT(3)
+#define EMC1812_STATUS_FAULT		BIT(2)
+#define EMC1812_STATUS_ETHRM		BIT(1)
+#define EMC1812_STATUS_ITHRM		BIT(0)
+
+#define EMC1812_BETA_LOCK_VAL		0x0F
+
+#define EMC1812_TEMP_CH_ADDR(index)	(EMC1812_INT_HIGH_BYTE_ADDR + 2 * (index))
+
+#define EMC1812_FILTER_MASK_LEN		2
+
+#define EMC1812_PID			0x81
+#define EMC1813_PID			0x87
+#define EMC1814_PID			0x84
+#define EMC1815_PID			0x85
+#define EMC1833_PID			0x83
+
+/* The maximum number of channels a member of the family can have */
+#define EMC1812_MAX_NUM_CHANNELS		5
+#define EMC1812_TEMP_OFFSET			64
+
+#define EMC1812_DEFAULT_IDEALITY_FACTOR		0x12
+
+/* Constants and default values */
+#define EMC1812_HIGH_LIMIT_DEFAULT		(85 + EMC1812_TEMP_OFFSET)
+
+#define EMC1812_TEMP_MASK (HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_MAX | \
+			   HWMON_T_CRIT | HWMON_T_MAX_HYST | HWMON_T_CRIT_HYST | \
+			   HWMON_T_MIN_ALARM | HWMON_T_MAX_ALARM | \
+			   HWMON_T_CRIT_ALARM | HWMON_T_LABEL)
+
+static const struct hwmon_channel_info * const emc1812_info[] = {
+	HWMON_CHANNEL_INFO(chip, HWMON_C_UPDATE_INTERVAL),
+	HWMON_CHANNEL_INFO(temp,
+			   EMC1812_TEMP_MASK,
+			   EMC1812_TEMP_MASK | HWMON_T_FAULT,
+			   EMC1812_TEMP_MASK | HWMON_T_FAULT,
+			   EMC1812_TEMP_MASK | HWMON_T_FAULT,
+			   EMC1812_TEMP_MASK | HWMON_T_FAULT),
+	NULL
+};
+
+/**
+ * struct emc1812_features - features of a emc1812 instance
+ * @name:		chip's name
+ * @phys_channels:	number of physical channels supported by the chip
+ * @has_ext2_beta_reg:	the EXT2_BETA register is available on the chip
+ */
+struct emc1812_features {
+	const char	*name;
+	u8		phys_channels;
+	bool		has_ext2_beta_reg;
+};
+
+static const struct emc1812_features emc1833_chip_config = {
+	.name = "emc1833",
+	.phys_channels = 3,
+	.has_ext2_beta_reg = true,
+};
+
+static const struct emc1812_features emc1812_chip_config = {
+	.name = "emc1812",
+	.phys_channels = 2,
+	.has_ext2_beta_reg = false,
+};
+
+static const struct emc1812_features emc1813_chip_config = {
+	.name = "emc1813",
+	.phys_channels = 3,
+	.has_ext2_beta_reg = true,
+};
+
+static const struct emc1812_features emc1814_chip_config = {
+	.name = "emc1814",
+	.phys_channels = 4,
+	.has_ext2_beta_reg = false,
+};
+
+static const struct emc1812_features emc1815_chip_config = {
+	.name = "emc1815",
+	.phys_channels = 5,
+	.has_ext2_beta_reg = false,
+};
+
+enum emc1812_limit_type {temp_min, temp_max};
+
+static const u8 emc1812_temp_map[] = {
+	[hwmon_temp_min] = temp_min,
+	[hwmon_temp_max] = temp_max,
+};
+
+static const u8 emc1812_ideality_regs[] = {
+	[0] = 0xff,
+	[1] = EMC1812_EXT1_IDEALITY_FACTOR_ADDR,
+	[2] = EMC1812_EXT2_IDEALITY_FACTOR_ADDR,
+	[3] = EMC1812_EXT3_IDEALITY_FACTOR_ADDR,
+	[4] = EMC1812_EXT4_IDEALITY_FACTOR_ADDR,
+};
+
+static const u8 emc1812_temp_crit_regs[] = {
+	[0] = EMC1812_INT_DIODE_THERM_LIMIT_ADDR,
+	[1] = EMC1812_EXT1_THERM_LIMIT_ADDR,
+	[2] = EMC1812_EXT2_THERM_LIMIT_ADDR,
+	[3] = EMC1812_EXT3_THERM_LIMIT_ADDR,
+	[4] = EMC1812_EXT4_THERM_LIMIT_ADDR,
+};
+
+static const u8 emc1812_limit_regs[][2] = {
+	[0] = {
+		[temp_min] = EMC1812_INT_DIODE_LOW_LIMIT_ADDR,
+		[temp_max] = EMC1812_INT_DIODE_HIGH_LIMIT_ADDR,
+	},
+	[1] = {
+		[temp_min] = EMC1812_EXT1_LOW_LIMIT_HIGH_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT1_HIGH_LIMIT_HIGH_BYTE_ADDR,
+	},
+	[2] = {
+		[temp_min] = EMC1812_EXT2_LOW_LIMIT_HIGH_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT2_HIGH_LIMIT_HIGH_BYTE_ADDR,
+	},
+	[3] = {
+		[temp_min] = EMC1812_EXT3_LOW_LIMIT_HIGH_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT3_HIGH_LIMIT_HIGH_BYTE_ADDR,
+	},
+	[4] = {
+		[temp_min] = EMC1812_EXT4_LOW_LIMIT_HIGH_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT4_HIGH_LIMIT_HIGH_BYTE_ADDR,
+	},
+};
+
+static const u8 emc1812_limit_regs_low[][2] = {
+	[0] = {
+		[temp_min] = 0xff,
+		[temp_max] = 0xff,
+	},
+	[1] = {
+		[temp_min] = EMC1812_EXT1_LOW_LIMIT_LOW_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT1_HIGH_LIMIT_LOW_BYTE_ADDR,
+	},
+	[2] = {
+		[temp_min] = EMC1812_EXT2_LOW_LIMIT_LOW_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT2_HIGH_LIMIT_LOW_BYTE_ADDR,
+	},
+	[3] = {
+		[temp_min] = EMC1812_EXT3_LOW_LIMIT_LOW_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT3_HIGH_LIMIT_LOW_BYTE_ADDR,
+	},
+	[4] = {
+		[temp_min] = EMC1812_EXT4_LOW_LIMIT_LOW_BYTE_ADDR,
+		[temp_max] = EMC1812_EXT4_HIGH_LIMIT_LOW_BYTE_ADDR,
+	},
+};
+
+/* Lookup table for temperature conversion times in msec */
+static const u16 emc1812_conv_time[] = {
+	16000, 8000, 4000, 2000, 1000, 500, 250, 125, 62, 31, 16
+};
+
+/**
+ * struct emc1812_data - information about chip parameters
+ * @labels:		labels of the channels
+ * @active_ch_mask:	active channels
+ * @chip:		pointer to structure holding chip features
+ * @regmap:		device register map
+ * @recd34_en:		state of Resistance Error Correction (REC) on channels 3 and 4
+ * @recd12_en:		state of Resistance Error Correction (REC) on channels 1 and 2
+ * @apdd_en:		state of anti-parallel diode mode
+ */
+struct emc1812_data {
+	const char *labels[EMC1812_MAX_NUM_CHANNELS];
+	unsigned long active_ch_mask;
+	const struct emc1812_features *chip;
+	struct regmap *regmap;
+	bool recd34_en;
+	bool recd12_en;
+	bool apdd_en;
+};
+
+/* emc1812 regmap configuration */
+static const struct regmap_range emc1812_regmap_writable_ranges[] = {
+	regmap_reg_range(EMC1812_CFG_ADDR, EMC1812_ONE_SHOT_ADDR),
+	regmap_reg_range(EMC1812_EXT1_HIGH_LIMIT_LOW_BYTE_ADDR, EMC1812_EXT2_THERM_LIMIT_ADDR),
+	regmap_reg_range(EMC1812_DIODE_FAULT_MASK_ADDR, EMC1812_CONSEC_ALERT_ADDR),
+	regmap_reg_range(EMC1812_EXT1_BETA_CONFIG_ADDR, EMC1812_EXT4_IDEALITY_FACTOR_ADDR),
+	regmap_reg_range(EMC1812_ROC_GAIN_ADDR, EMC1812_ROC_CONFIG_ADDR),
+	regmap_reg_range(EMC1812_R1_LIMH_ADDR, EMC1812_R1_SMPL_ADDR),
+	regmap_reg_range(EMC1812_R2_LIMH_ADDR, EMC1812_R2_SMPL_ADDR),
+	regmap_reg_range(EMC1812_FILTER_SEL_ADDR, EMC1812_FILTER_SEL_ADDR),
+	regmap_reg_range(EMC1812_HOTTEST_CFG_ADDR, EMC1812_HOTTEST_CFG_ADDR),
+};
+
+static const struct regmap_access_table emc1812_regmap_wr_table = {
+	.yes_ranges = emc1812_regmap_writable_ranges,
+	.n_yes_ranges = ARRAY_SIZE(emc1812_regmap_writable_ranges),
+};
+
+static const struct regmap_range emc1812_regmap_rd_ranges[] = {
+	regmap_reg_range(EMC1812_STATUS_ADDR, EMC1812_CONFIG_LO_ADDR),
+	regmap_reg_range(EMC1812_CFG_ADDR, EMC1812_ONE_SHOT_ADDR),
+	regmap_reg_range(EMC1812_EXT1_HIGH_LIMIT_LOW_BYTE_ADDR,
+			 EMC1812_EXT_DIODE_FAULT_STATUS_ADDR),
+	regmap_reg_range(EMC1812_DIODE_FAULT_MASK_ADDR, EMC1812_CONSEC_ALERT_ADDR),
+	regmap_reg_range(EMC1812_EXT1_BETA_CONFIG_ADDR, EMC1812_FILTER_SEL_ADDR),
+	regmap_reg_range(EMC1812_INT_HIGH_BYTE_ADDR, EMC1812_HOTTEST_CFG_ADDR),
+	regmap_reg_range(EMC1812_PRODUCT_ID_ADDR, EMC1812_REVISION_ADDR),
+};
+
+static const struct regmap_access_table emc1812_regmap_rd_table = {
+	.yes_ranges = emc1812_regmap_rd_ranges,
+	.n_yes_ranges = ARRAY_SIZE(emc1812_regmap_rd_ranges),
+};
+
+static bool emc1812_is_volatile_reg(struct device *dev, unsigned int reg)
+{
+	switch (reg) {
+	case EMC1812_STATUS_ADDR:
+	case EMC1812_EXT_DIODE_FAULT_STATUS_ADDR:
+	case EMC1812_DIODE_FAULT_MASK_ADDR:
+	case EMC1812_EXT1_BETA_CONFIG_ADDR:
+	case EMC1812_EXT2_BETA_CONFIG_ADDR:
+	case EMC1812_HIGH_LIMIT_STATUS_ADDR:
+	case EMC1812_LOW_LIMIT_STATUS_ADDR:
+	case EMC1812_THERM_LIMIT_STATUS_ADDR:
+	case EMC1812_ROC_STATUS_ADDR:
+	case EMC1812_PER_MAXTH_1_ADDR:
+	case EMC1812_PER_MAXT1L_ADDR:
+	case EMC1812_PER_MAXTH_2_ADDR:
+	case EMC1812_PER_MAXT2_3L_ADDR:
+	case EMC1812_GBL_MAXT1H_ADDR:
+	case EMC1812_GBL_MAXT1L_ADDR:
+	case EMC1812_GBL_MAXT2H_ADDR:
+	case EMC1812_GBL_MAXT2L_ADDR:
+	case EMC1812_INT_HIGH_BYTE_ADDR:
+	case EMC1812_INT_LOW_BYTE_ADDR:
+	case EMC1812_EXT1_HIGH_BYTE_ADDR:
+	case EMC1812_EXT1_LOW_BYTE_ADDR:
+	case EMC1812_EXT2_HIGH_BYTE_ADDR:
+	case EMC1812_EXT2_LOW_BYTE_ADDR:
+	case EMC1812_EXT3_HIGH_BYTE_ADDR:
+	case EMC1812_EXT3_LOW_BYTE_ADDR:
+	case EMC1812_EXT4_HIGH_BYTE_ADDR:
+	case EMC1812_EXT4_LOW_BYTE_ADDR:
+	case EMC1812_HOTTEST_DIODE_HIGH_BYTE_ADDR:
+	case EMC1812_HOTTEST_DIODE_LOW_BYTE_ADDR:
+	case EMC1812_HOTTEST_STATUS_ADDR:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static const struct regmap_config emc1812_regmap_config = {
+	.reg_bits = 8,
+	.val_bits = 8,
+	.rd_table = &emc1812_regmap_rd_table,
+	.wr_table = &emc1812_regmap_wr_table,
+	.volatile_reg = emc1812_is_volatile_reg,
+	.max_register = EMC1812_REVISION_ADDR,
+	.cache_type = REGCACHE_MAPLE,
+};
+
+static umode_t emc1812_is_visible(const void *_data, enum hwmon_sensor_types type,
+				  u32 attr, int channel)
+{
+	const struct emc1812_data *data = _data;
+
+	switch (type) {
+	case hwmon_temp:
+		/* Don't show channels which are not enabled */
+		if (!(data->active_ch_mask & BIT(channel)))
+			return 0;
+
+		switch (attr) {
+		case hwmon_temp_min:
+		case hwmon_temp_max:
+		case hwmon_temp_crit:
+		case hwmon_temp_crit_hyst:
+			return 0644;
+		case hwmon_temp_crit_alarm:
+		case hwmon_temp_input:
+		case hwmon_temp_fault:
+		case hwmon_temp_max_alarm:
+		case hwmon_temp_max_hyst:
+		case hwmon_temp_min_alarm:
+			return 0444;
+		case hwmon_temp_label:
+			if (data->labels[channel])
+				return 0444;
+			return 0;
+		default:
+			return 0;
+		}
+	case hwmon_chip:
+		switch (attr) {
+		case hwmon_chip_update_interval:
+			return 0644;
+		default:
+			return 0;
+		}
+	default:
+		return 0;
+	}
+};
+
+static int emc1812_get_temp(struct emc1812_data *data, int channel, long *val)
+{
+	__be16 tmp_be16;
+	int ret;
+
+	ret = regmap_bulk_read(data->regmap, EMC1812_TEMP_CH_ADDR(channel),
+			       &tmp_be16, sizeof(tmp_be16));
+	if (ret)
+		return ret;
+
+	/* Range is always -64 to 191.875°C */
+	*val = ((be16_to_cpu(tmp_be16) >> 5) - (EMC1812_TEMP_OFFSET << 3)) * 125;
+
+	return 0;
+}
+
+static int emc1812_get_crit_limit_temp(struct emc1812_data *data, int channel, long *val)
+{
+	unsigned int tmp;
+	int ret;
+
+	/* Critical register is 8bits long and keeps only integer part of temperature */
+	ret = regmap_read(data->regmap, emc1812_temp_crit_regs[channel], &tmp);
+	if (ret)
+		return ret;
+
+	*val = tmp;
+	/* Range is always -64 to 191°C */
+	*val = (*val - EMC1812_TEMP_OFFSET) * 1000;
+
+	return 0;
+}
+
+static int emc1812_get_limit_temp(struct emc1812_data *data, int ch,
+				  enum emc1812_limit_type type, long *val)
+{
+	unsigned int regvalh;
+	unsigned int regvall = 0;
+	int ret;
+
+	ret = regmap_read(data->regmap, emc1812_limit_regs[ch][type], &regvalh);
+	if (ret < 0)
+		return ret;
+
+	if (ch) {
+		ret = regmap_read(data->regmap, emc1812_limit_regs_low[ch][type], &regvall);
+		if (ret < 0)
+			return ret;
+	}
+
+	/* Range is always -64 to 191.875°C */
+	*val = ((regvalh << 3) | (regvall >> 5));
+	*val = (*val - (EMC1812_TEMP_OFFSET << 3)) * 125;
+
+	return 0;
+}
+
+static int emc1812_read_reg(struct device *dev, struct emc1812_data *data, u32 attr,
+			    int channel, long *val)
+{
+	unsigned int hyst;
+	int ret;
+
+	switch (attr) {
+	case hwmon_temp_min:
+	case hwmon_temp_max:
+		return emc1812_get_limit_temp(data, channel, emc1812_temp_map[attr], val);
+	case hwmon_temp_crit:
+		return emc1812_get_crit_limit_temp(data, channel, val);
+	case hwmon_temp_input:
+		return emc1812_get_temp(data, channel, val);
+	case hwmon_temp_max_hyst:
+		ret = emc1812_get_limit_temp(data, channel, temp_max, val);
+		if (ret < 0)
+			return ret;
+
+		ret = regmap_read(data->regmap, EMC1812_THRM_HYS_ADDR, &hyst);
+		if (ret < 0)
+			return ret;
+
+		*val -= (long)hyst * 1000;
+
+		return 0;
+	case hwmon_temp_crit_hyst:
+		ret = emc1812_get_crit_limit_temp(data, channel, val);
+		if (ret < 0)
+			return ret;
+
+		ret = regmap_read(data->regmap, EMC1812_THRM_HYS_ADDR, &hyst);
+		if (ret < 0)
+			return ret;
+
+		*val -= (long)hyst * 1000;
+
+		return 0;
+	case hwmon_temp_min_alarm:
+		*val = regmap_test_bits(data->regmap, EMC1812_LOW_LIMIT_STATUS_ADDR,
+					BIT(channel));
+		if (*val < 0)
+			return *val;
+
+		return 0;
+	case hwmon_temp_max_alarm:
+		*val = regmap_test_bits(data->regmap, EMC1812_HIGH_LIMIT_STATUS_ADDR,
+					BIT(channel));
+		if (*val < 0)
+			return *val;
+
+		return 0;
+	case hwmon_temp_crit_alarm:
+		*val = regmap_test_bits(data->regmap, EMC1812_THERM_LIMIT_STATUS_ADDR,
+					BIT(channel));
+		if (*val < 0)
+			return *val;
+
+		return 0;
+	case hwmon_temp_fault:
+		*val = regmap_test_bits(data->regmap, EMC1812_EXT_DIODE_FAULT_STATUS_ADDR,
+					BIT(channel));
+		if (*val < 0)
+			return *val;
+
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int emc1812_read(struct device *dev, enum hwmon_sensor_types type, u32 attr,
+			int channel, long *val)
+{
+	struct emc1812_data *data = dev_get_drvdata(dev);
+	unsigned int convrate;
+	int ret;
+
+	switch (type) {
+	case hwmon_temp:
+		return emc1812_read_reg(dev, data, attr, channel, val);
+	case hwmon_chip:
+		switch (attr) {
+		case hwmon_chip_update_interval:
+			ret = regmap_read(data->regmap, EMC1812_CONV_ADDR, &convrate);
+			if (ret < 0)
+				return ret;
+
+			if (convrate > 10)
+				convrate = 4;
+
+			*val = DIV_ROUND_CLOSEST(16000, 1 << convrate);
+			return 0;
+		default:
+			return -EOPNOTSUPP;
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int emc1812_read_string(struct device *dev, enum hwmon_sensor_types type,
+			       u32 attr, int channel, const char **str)
+{
+	struct emc1812_data *data = dev_get_drvdata(dev);
+
+	if (channel >= data->chip->phys_channels)
+		return -EOPNOTSUPP;
+
+	switch (type) {
+	case hwmon_temp:
+		switch (attr) {
+		case hwmon_temp_label:
+			*str = data->labels[channel];
+			return 0;
+		default:
+			return -EOPNOTSUPP;
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int emc1812_set_hyst(struct emc1812_data *data, int channel, int val)
+{
+	unsigned int limit;
+	int hyst, ret;
+
+	/* Critical register is 8bits long and keeps only integer part of temperature */
+	ret = regmap_read(data->regmap, emc1812_temp_crit_regs[channel], &limit);
+	if (ret)
+		return ret;
+
+	hyst = clamp_val((int)limit - val, 0, 255);
+
+	ret = regmap_write(data->regmap, EMC1812_THRM_HYS_ADDR, hyst);
+
+	return ret;
+}
+
+static int emc1812_set_temp(struct emc1812_data *data, int channel,
+			    enum emc1812_limit_type map, int val)
+{
+	unsigned int valh, vall;
+	u8 regh, regl;
+	int ret;
+
+	regh = emc1812_limit_regs[channel][map];
+	regl = emc1812_limit_regs_low[channel][map];
+
+	if (channel) {
+		val = DIV_ROUND_CLOSEST(val, 125);
+		valh = (val >> 3) & 0xff;
+		vall = (val & 0x07) << 5;
+	} else {
+		/* Temperature limit for internal channel is stored on 8bits */
+		valh = DIV_ROUND_CLOSEST(val, 1000);
+		valh = clamp_val(valh, 0, 255);
+	}
+
+	ret = regmap_write(data->regmap, regh, valh);
+	if (ret < 0)
+		return ret;
+
+	if (channel)
+		ret = regmap_write(data->regmap, regl, vall);
+
+	return ret;
+}
+
+static int emc1812_write(struct device *dev, enum hwmon_sensor_types type, u32 attr,
+			 int channel, long val)
+{
+	struct emc1812_data *data = dev_get_drvdata(dev);
+	unsigned int interval, tmp;
+
+	switch (type) {
+	case hwmon_temp:
+		/* Range should be -64000 to 191875°C + (EMC1812_TEMP_OFFSET * 1000) */
+		val = clamp_val(val, -64000, 191875);
+		val = val + (EMC1812_TEMP_OFFSET * 1000);
+
+		switch (attr) {
+		case hwmon_temp_min:
+		case hwmon_temp_max:
+			return emc1812_set_temp(data, channel, emc1812_temp_map[attr], val);
+		case hwmon_temp_crit:
+			/* Critical temperature limit is stored on 8bits */
+			val = DIV_ROUND_CLOSEST(val, 1000);
+			tmp = clamp_val(val, 0, 255);
+			return regmap_write(data->regmap, emc1812_temp_crit_regs[channel], tmp);
+		case hwmon_temp_crit_hyst:
+			/* Critical temperature hysteresis is stored on 8bits */
+			val = DIV_ROUND_CLOSEST(val, 1000);
+			tmp = clamp_val(val, 0, 255);
+			return emc1812_set_hyst(data, channel, tmp);
+		default:
+			return -EOPNOTSUPP;
+		}
+	case hwmon_chip:
+		switch (attr) {
+		case hwmon_chip_update_interval:
+			interval = clamp_val(val, 0, 16000);
+			tmp = find_closest_descending(interval, emc1812_conv_time,
+						      ARRAY_SIZE(emc1812_conv_time));
+			return regmap_write(data->regmap, EMC1812_CONV_ADDR, tmp);
+		default:
+			return -EOPNOTSUPP;
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int emc1812_init(struct emc1812_data *priv)
+{
+	int i, ret;
+	u8 val;
+
+	ret = regmap_write(priv->regmap, EMC1812_THRM_HYS_ADDR, 0x0A);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, EMC1812_CONSEC_ALERT_ADDR, 0x70);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, EMC1812_FILTER_SEL_ADDR, 0);
+	if (ret)
+		return ret;
+
+	ret = regmap_write(priv->regmap, EMC1812_HOTTEST_CFG_ADDR, 0);
+	if (ret)
+		return ret;
+
+	/* Enables the beta compensation factor auto-detection function for beta1 and beta2 */
+	ret = regmap_write(priv->regmap, EMC1812_EXT1_BETA_CONFIG_ADDR,
+			   EMC1812_BETA_LOCK_VAL);
+	if (ret)
+		return ret;
+
+	if (priv->chip->has_ext2_beta_reg) {
+		ret = regmap_write(priv->regmap, EMC1812_EXT2_BETA_CONFIG_ADDR,
+				   EMC1812_BETA_LOCK_VAL);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < priv->chip->phys_channels; i++) {
+		if (!test_bit(i, &priv->active_ch_mask))
+			continue;
+
+		/* Update the max temperature limit for extended temperature range. */
+		ret = emc1812_set_temp(priv, i, emc1812_temp_map[hwmon_temp_max],
+				       EMC1812_HIGH_LIMIT_DEFAULT * 1000);
+		if (ret)
+			return ret;
+
+		/* Update the critical temperature limit for extended temperature range. */
+		ret = regmap_write(priv->regmap, emc1812_temp_crit_regs[i],
+				   EMC1812_HIGH_LIMIT_DEFAULT);
+		if (ret)
+			return ret;
+
+		/* Set the ideality factor */
+		if (i > 0) {
+			ret = regmap_write(priv->regmap, emc1812_ideality_regs[i],
+					   EMC1812_DEFAULT_IDEALITY_FACTOR);
+			if (ret)
+				return ret;
+		}
+	}
+
+	/*
+	 * Set default values in registers. APDD, RECD12 and RECD34 are active on 0.
+	 * Set the device to be in Run (Active) state and converting on all
+	 * channels.
+	 * Don't change conversion rate. After reset, default is 4 conversions/seconds.
+	 * The temperature measurement range is -64°C to +191.875°C.
+	 * Set ALERT/THERM2 pin to be in comparator mode (When the ALERT/THERM2 pin is
+	 * asserted in comparator mode, the corresponding High Limit Status bits are set.
+	 * Reading these bits does not clear them until the ALERT/THERM2 pin is deasserted.
+	 * Once the ALERT/THERM2 pin is deasserted, the status bits are automatically
+	 * cleared.).
+	 */
+	val = FIELD_PREP(EMC1812_CFG_MSKAL, 0) |
+	      FIELD_PREP(EMC1812_CFG_RS, 0) |
+	      FIELD_PREP(EMC1812_CFG_ATTHM, 1) |
+	      FIELD_PREP(EMC1812_CFG_RECD12, !priv->recd12_en) |
+	      FIELD_PREP(EMC1812_CFG_RECD34, !priv->recd34_en) |
+	      FIELD_PREP(EMC1812_CFG_RANGE, 1) |
+	      FIELD_PREP(EMC1812_CFG_DA_ENA, 0) |
+	      FIELD_PREP(EMC1812_CFG_APDD, !priv->apdd_en);
+
+	return regmap_write(priv->regmap, EMC1812_CFG_ADDR, val);
+}
+
+static int emc1812_parse_fw_config(struct emc1812_data *data, struct device *dev)
+{
+	unsigned int reg_nr = 0;
+	int ret;
+
+	/* To be able to load the driver in case we don't have device tree */
+	if (!dev_fwnode(dev)) {
+		data->active_ch_mask = BIT(data->chip->phys_channels) - 1;
+		return 0;
+	}
+
+	data->apdd_en = device_property_read_bool(dev, "microchip,enable-anti-parallel");
+	data->recd12_en = device_property_read_bool(dev, "microchip,parasitic-res-on-channel1-2");
+	data->recd34_en = device_property_read_bool(dev, "microchip,parasitic-res-on-channel3-4");
+
+	/* Internal temperature channel is always active */
+	data->labels[reg_nr] = "internal_diode";
+	set_bit(reg_nr, &data->active_ch_mask);
+
+	device_for_each_child_node_scoped(dev, child) {
+		ret = fwnode_property_read_u32(child, "reg", &reg_nr);
+		if (ret || reg_nr >= data->chip->phys_channels)
+			return dev_err_probe(dev, -EINVAL,
+					     "The index is higher then the chip supports\n");
+		/* Mark channel as active */
+		set_bit(reg_nr, &data->active_ch_mask);
+
+		fwnode_property_read_string(child, "label", &data->labels[reg_nr]);
+	}
+
+	return 0;
+}
+
+static int emc1812_chip_identify(struct emc1812_data *data, struct i2c_client *client)
+{
+	const struct emc1812_features *chip;
+	struct device *dev = &client->dev;
+	unsigned int tmp;
+	int ret;
+
+	ret = regmap_read(data->regmap, EMC1812_PRODUCT_ID_ADDR, &tmp);
+	if (ret)
+		return ret;
+
+	switch (tmp) {
+	case EMC1812_PID:
+		data->chip = &emc1812_chip_config;
+		break;
+	case EMC1813_PID:
+		data->chip = &emc1813_chip_config;
+		break;
+	case EMC1814_PID:
+		data->chip = &emc1814_chip_config;
+		break;
+	case EMC1815_PID:
+		data->chip = &emc1815_chip_config;
+		break;
+	case EMC1833_PID:
+		data->chip = &emc1833_chip_config;
+		break;
+	default:
+		/*
+		 * If failed to identify the hardware based on internal registers,
+		 * try using fallback compatible in device tree to deal with some
+		 * newer part number.
+		 */
+		chip = i2c_get_match_data(client);
+		if (!chip)
+			return -ENODEV;
+
+		dev_warn(dev, "Unrecognized hardware ID 0x%x, using %s from devicetree data\n",
+			 tmp, chip->name);
+
+		data->chip = chip;
+
+		return 0;
+	}
+
+	return 0;
+}
+
+static const struct hwmon_ops emc1812_ops = {
+	.is_visible = emc1812_is_visible,
+	.read = emc1812_read,
+	.read_string = emc1812_read_string,
+	.write = emc1812_write,
+};
+
+static const struct hwmon_chip_info emc1812_chip_info = {
+	.ops = &emc1812_ops,
+	.info = emc1812_info,
+};
+
+static int emc1812_probe(struct i2c_client *client)
+{
+	struct device *dev = &client->dev;
+	struct emc1812_data *data;
+	struct device *hwmon_dev;
+	int ret;
+
+	data = devm_kzalloc(dev, sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->regmap = devm_regmap_init_i2c(client, &emc1812_regmap_config);
+	if (IS_ERR(data->regmap))
+		return dev_err_probe(dev, PTR_ERR(data->regmap),
+				     "Cannot initialize register map\n");
+
+	ret = emc1812_chip_identify(data, client);
+	if (ret)
+		return dev_err_probe(dev, ret, "Chip identification fails\n");
+
+	ret = emc1812_parse_fw_config(data, dev);
+	if (ret)
+		return ret;
+
+	ret = emc1812_init(data);
+	if (ret)
+		return dev_err_probe(dev, ret, "Cannot initialize device\n");
+
+	hwmon_dev = devm_hwmon_device_register_with_info(dev, client->name, data,
+							 &emc1812_chip_info, NULL);
+
+	return PTR_ERR_OR_ZERO(hwmon_dev);
+}
+
+static const struct i2c_device_id emc1812_id[] = {
+	{ .name = "emc1812", .driver_data = (kernel_ulong_t)&emc1812_chip_config },
+	{ .name = "emc1813", .driver_data = (kernel_ulong_t)&emc1813_chip_config },
+	{ .name = "emc1814", .driver_data = (kernel_ulong_t)&emc1814_chip_config },
+	{ .name = "emc1815", .driver_data = (kernel_ulong_t)&emc1815_chip_config },
+	{ .name = "emc1833", .driver_data = (kernel_ulong_t)&emc1833_chip_config },
+	{ }
+};
+MODULE_DEVICE_TABLE(i2c, emc1812_id);
+
+static const struct of_device_id emc1812_of_match[] = {
+	{
+		.compatible = "microchip,emc1812",
+		.data = &emc1812_chip_config
+	},
+	{
+		.compatible = "microchip,emc1813",
+		.data = &emc1813_chip_config
+	},
+	{
+		.compatible = "microchip,emc1814",
+		.data = &emc1814_chip_config
+	},
+	{
+		.compatible = "microchip,emc1815",
+		.data = &emc1815_chip_config
+	},
+	{
+		.compatible = "microchip,emc1833",
+		.data = &emc1833_chip_config
+	},
+	{ }
+};
+MODULE_DEVICE_TABLE(of, emc1812_of_match);
+
+static struct i2c_driver emc1812_driver = {
+	.driver	 = {
+		.name = "emc1812",
+		.of_match_table = emc1812_of_match,
+	},
+	.probe = emc1812_probe,
+	.id_table = emc1812_id,
+};
+module_i2c_driver(emc1812_driver);
+
+MODULE_AUTHOR("Marius Cristea <marius.cristea@microchip.com>");
+MODULE_DESCRIPTION("EMC1812/13/14/15/33 high-accuracy remote diode temperature monitor Driver");
+MODULE_LICENSE("GPL");

-- 
2.53.0


^ permalink raw reply related

* [PATCH v11 1/2] dt-bindings: hwmon: temperature: add support for EMC1812
From: Marius Cristea @ 2026-06-10 15:19 UTC (permalink / raw)
  To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet
  Cc: linux-hwmon, devicetree, linux-kernel, linux-doc, Marius Cristea
In-Reply-To: <20260610-hw_mon-emc1812-v11-0-cef809af5c19@microchip.com>

This is the devicetree schema for Microchip EMC1812/13/14/15/33
Multichannel Low-Voltage Remote Diode Sensor Family. It also
updates the MAINTAINERS file to include the new driver.

EMC1812 has one external remote temperature monitoring channel.
EMC1813 has two external remote temperature monitoring channels.
EMC1814 has three external remote temperature monitoring channels and
channels 2 and 3 support anti parallel diode.
EMC1815 has four external remote temperature monitoring channels and
channels 1/2  and 3/4 support anti parallel diode.
EMC1833 has two external remote temperature monitoring channels and
channels 1 and 2 support anti parallel diode.
Resistance Error Correction is supported on channels 1/2 and 3/4.

Signed-off-by: Marius Cristea <marius.cristea@microchip.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
---
 .../bindings/hwmon/microchip,emc1812.yaml          | 193 +++++++++++++++++++++
 MAINTAINERS                                        |   6 +
 2 files changed, 199 insertions(+)

diff --git a/Documentation/devicetree/bindings/hwmon/microchip,emc1812.yaml b/Documentation/devicetree/bindings/hwmon/microchip,emc1812.yaml
new file mode 100644
index 000000000000..1a273621db82
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/microchip,emc1812.yaml
@@ -0,0 +1,193 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/hwmon/microchip,emc1812.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Microchip EMC1812/13/14/15/33 multichannel temperature sensor
+
+maintainers:
+  - Marius Cristea <marius.cristea@microchip.com>
+
+description: |
+  The Microchip EMC1812/13/14/15/33 is a high-accuracy 2-wire multichannel
+  low-voltage remote diode temperature monitor.
+
+  The datasheet can be found here:
+    https://ww1.microchip.com/downloads/aemDocuments/documents/MSLD/ProductDocuments/DataSheets/EMC1812-3-4-5-33-Data-Sheet-DS20005751.pdf
+
+  EMC1812 has one external remote temperature monitoring channel
+  EMC1813 has two external remote temperature monitoring channels
+  EMC1814 has three external remote temperature monitoring channels and
+    channels 2 and 3 support anti parallel diode
+  EMC1815 has four external remote temperature monitoring channels and
+    channels 1/2 and 3/4 support anti parallel diode
+  EMC1833 has two external remote temperature monitoring channels and
+    channels 1 and 2 support anti parallel diode
+
+properties:
+  compatible:
+    enum:
+      - microchip,emc1812
+      - microchip,emc1813
+      - microchip,emc1814
+      - microchip,emc1815
+      - microchip,emc1833
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    items:
+      - description: alert-therm2 asserts when the ALERT limit is exceeded.
+      - description: therm-addr asserts when the THERM limit is exceeded.
+    minItems: 1
+
+  interrupt-names:
+    items:
+      - const: alert-therm2
+      - const: therm-addr
+    minItems: 1
+
+  "#address-cells":
+    const: 1
+
+  "#size-cells":
+    const: 0
+
+  microchip,enable-anti-parallel:
+    description:
+      Enable anti-parallel diode mode operation. EMC1814, EMC1815 and EMC1833
+      support reading two external diodes in anti-parallel connection on the
+      same set of pins. Disabling APD functionality to implement substrate
+      diodes on devices that support APD eliminates the benefit of APD
+      (two diodes on one channel).
+    type: boolean
+
+  microchip,parasitic-res-on-channel1-2:
+    description:
+      Indicates that the chip and the diodes/transistors are sufficiently
+      far apart that a parasitic resistance is added to the wires, which can
+      affect the measurements. Due to the availability of only a single
+      configuration bit in hardware, channels 1 and 2 are affected together.
+      If channel 2 is not available in hardware, this setting affects only
+      channel 1.
+    type: boolean
+
+  microchip,parasitic-res-on-channel3-4:
+    description:
+      Indicates that the chip and the diodes/transistors are sufficiently
+      far apart that a parasitic resistance is added to the wires, which can
+      affect the measurements. Due to the availability of only a single
+      configuration bit in hardware, channels 3 and 4 are affected together.
+      If channel 4 is not available in hardware, this setting affects only
+      channel 3.
+    type: boolean
+
+  vdd-supply: true
+
+patternProperties:
+  "^channel@[0-4]$":
+    description: |
+      Represents the temperature channels.
+      0: Internal sensor
+      1-4: External remote diodes
+    type: object
+
+    properties:
+      reg:
+        maxItems: 1
+
+      label:
+        description: Unique name to identify which channel this is.
+
+    required:
+      - reg
+
+    additionalProperties: false
+
+required:
+  - compatible
+  - reg
+  - vdd-supply
+
+allOf:
+  # EMC1812: 1 Internal, 1 External Channels, No APD,
+  # parasitic-res-on-channel1-2: for channel 1
+  - if:
+      properties:
+        compatible:
+          const: microchip,emc1812
+    then:
+      properties:
+        microchip,enable-anti-parallel: false
+        microchip,parasitic-res-on-channel3-4: false
+      patternProperties:
+        "^channel@[2-4]$": false
+
+  # EMC1813: 1 Internal, 2 External Channels, No APD,
+  # parasitic-res-on-channel1-2: on both channel 1 & 2
+  - if:
+      properties:
+        compatible:
+          const: microchip,emc1813
+    then:
+      properties:
+        microchip,enable-anti-parallel: false
+        microchip,parasitic-res-on-channel3-4: false
+      patternProperties:
+        "^channel@[3-4]$": false
+
+  # EMC1833: 1 Internal, 2 External Channels, Supports APD,
+  # parasitic-res-on-channel1-2: on both channel 1 & 2
+  - if:
+      properties:
+        compatible:
+          const: microchip,emc1833
+    then:
+      properties:
+        microchip,parasitic-res-on-channel3-4: false
+      patternProperties:
+        "^channel@[3-4]$": false
+
+  # EMC1814: 1 Internal, 3 External Channels, Supports APD,
+  # parasitic-res-on-channel1-2: on both channel 1 & 2
+  # parasitic-res-on-channel3-4: for channel 3
+  - if:
+      properties:
+        compatible:
+          const: microchip,emc1814
+    then:
+      properties:
+        channel@4: false
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    i2c {
+        #address-cells = <1>;
+        #size-cells = <0>;
+
+        temperature-sensor@4c {
+            compatible = "microchip,emc1813";
+            reg = <0x4c>;
+
+            #address-cells = <1>;
+            #size-cells = <0>;
+
+            microchip,parasitic-res-on-channel1-2;
+
+            vdd-supply = <&vdd>;
+
+            channel@1 {
+                reg = <1>;
+                label = "External CH1 Temperature";
+            };
+
+            channel@2 {
+                reg = <2>;
+                label = "External CH2 Temperature";
+            };
+        };
+    };
diff --git a/MAINTAINERS b/MAINTAINERS
index 6d7b697bfdba..85c236df781e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16646,6 +16646,12 @@ S:	Supported
 F:	Documentation/devicetree/bindings/interrupt-controller/microchip,sama7g5-eic.yaml
 F:	drivers/irqchip/irq-mchp-eic.c
 
+MICROCHIP EMC1812 DRIVER
+M:	Marius Cristea <marius.cristea@microchip.com>
+L:	linux-hwmon@vger.kernel.org
+S:	Supported
+F:	Documentation/devicetree/bindings/hwmon/microchip,emc1812.yaml
+
 MICROCHIP I2C DRIVER
 M:	Codrin Ciubotariu <codrin.ciubotariu@microchip.com>
 L:	linux-i2c@vger.kernel.org

-- 
2.53.0


^ permalink raw reply related

* [PATCH v11 0/2] Add support for Microchip EMC1812
From: Marius Cristea @ 2026-06-10 15:19 UTC (permalink / raw)
  To: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet
  Cc: linux-hwmon, devicetree, linux-kernel, linux-doc, Marius Cristea

This is the hwmon driver for EMC1812/13/14/15/33 multichannel Low-Voltage
Remote Diode Sensor Family. The chips in the family have one internal
and different numbers of external channels, ranging from 1 (EMC1812) to
4 channels (EMC1815).
Reading diodes in anti-parallel connection is supported by EMC1814, EMC1815
and EMC1833.

Signed-off-by: Marius Cristea <marius.cristea@microchip.com>
---
Changes in v11:
- remove unnecessary check for channels which are not physically available
- fix pointer signedness mismatch warning  
- fix off-by-one misalignment when setting IDEALITY_FACTOR
- update the max temperature and critical temperature limit to match the
  extended temperature range
- Link to v10: https://lore.kernel.org/r/20260429-hw_mon-emc1812-v10-0-a8ca1d779502@microchip.com

Changes in v10:
- made comments more clear into the devicetree binding
- allow channel 0 (internal channel) into devicetree binding
- allow the default name for Channel 0 to be overridden by the Device Tree property
- translate temperature limits to support the hardware's extended temperature range
- update channel count validation to properly account for the internal channel
- return -EOPNOTSUPP if channel is greater than or equal to phys_channels
- Link to v9: https://lore.kernel.org/r/20260403-hw_mon-emc1812-v9-0-1a798f31cf2e@microchip.com

Changes in v9:
- improve the wording in the Documentation/hwmon/emc1812.rst file
- add const to variables in the driver
- initialize the EXT2_BETA_CONFIG only for the pats that support it
- update the writeble regmap table to exclude read-only registers
- Link to v8: https://lore.kernel.org/r/20260310-hw_mon-emc1812-v8-0-bc155727e0d2@microchip.com

Changes in v8:
- remove "address scan" from emc1812.rst documentation
- change the second dimension of emc1812_limit_regs_low[][] to 2
- clamp input value before doing math on it to avoid overflow
- use rounding instead of truncation for 8 bits limit registers
- fix misleading comment when HW ID is not recognized
- Link to v7: https://lore.kernel.org/r/20260223-hw_mon-emc1812-v7-0-51e2676f4e20@microchip.com

Changes in v7:
- driver
  - fix an overflow emc1812_set_hyst
  - remove unused parameter in emc1812_set_temp
- devicetree binding:
  - remove unneeded restrictions not to bloating the binding
- Link to v6: https://lore.kernel.org/r/20260212-hw_mon-emc1812-v6-0-e37e9b38d898@microchip.com

Changes in v6:
- driver
  - fix an overflow when writing more then 191875 to limits stored on 8
    bits register
  - remove "i2c_set_clientdata" from probe
  - fix discrepancy where writing 16ms and reading it back returns 15ms
    at update interval
  - skip setting the ideality factor for channels that are not available
    on the device
- devicetree binding:
  - change the way interrupts are described/used
  - add "microchip,enable-anti-parallel"
  - rewrite "allOf" section to be more clear
- Link to v5: https://lore.kernel.org/r/20260205-hw_mon-emc1812-v5-0-232835aefe8f@microchip.com

Changes in v5:
- fix calculation in emc1812_get_limit_temp 
- use i2c_get_match_data cover the case when the driver is instantiated
  via I2C ID table.
- replace dev_info with dev_warn
- remove some unnecessary truncation on 8 bits
- remove clamping when reading the temerature with hyst
- not change the conversion rate at probe time
- use a generic define to remove duplicate channel_info entries
- Link to v4: https://lore.kernel.org/r/20260127-hw_mon-emc1812-v4-0-6bf636b54847@microchip.com

Changes in v4:
- fix file permissions for read only properties
- fix calculation when the limits are written
- remove the temp_min_hyst because the part doesn't support it
- Link to v3: https://lore.kernel.org/r/20251218-hw_mon-emc1812-v3-0-a123ada7b859@microchip.com

Changes in v3:
- remove mesages that are not helpfull
- fix an issue related to NULL labels
- fix sign/unsign calculation
- replace E2BIG with EINVAL
- use BIT() to create mask
- Link to v2: https://lore.kernel.org/r/20251121-hw_mon-emc1812-v2-0-5b2070f8b778@microchip.com

Changes in v2:
- update the interrupt section from yaml file
- update index.rst
- remove fault condition from internal sensor
- remove unused members from structures
- update the driver to work on systems without device tree or
  firmware nodes
- add missing include files
- make NULL labels to be not visible
- corect sign/unsign calculations
- corect possible underflow for limits
- Link to v1: https://lore.kernel.org/r/20251029-hw_mon-emc1812-v1-0-be4fd8af016a@microchip.com

---
Marius Cristea (2):
      dt-bindings: hwmon: temperature: add support for EMC1812
      hwmon: temperature: add support for EMC1812

 .../bindings/hwmon/microchip,emc1812.yaml          | 193 +++++
 Documentation/hwmon/emc1812.rst                    |  67 ++
 Documentation/hwmon/index.rst                      |   1 +
 MAINTAINERS                                        |   8 +
 drivers/hwmon/Kconfig                              |  11 +
 drivers/hwmon/Makefile                             |   1 +
 drivers/hwmon/emc1812.c                            | 965 +++++++++++++++++++++
 7 files changed, 1246 insertions(+)
---
base-commit: d2b2fea3503e5e12b2e28784152937e48bcca6ff
change-id: 20251002-hw_mon-emc1812-f1b806487d10

Best regards,
-- 
Marius Cristea <marius.cristea@microchip.com>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox