From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC1D9CE7B04 for ; Sat, 7 Sep 2024 01:32:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA65D6B007B; Fri, 6 Sep 2024 21:32:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B56296B0083; Fri, 6 Sep 2024 21:32:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1D846B0085; Fri, 6 Sep 2024 21:32:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 82BE36B007B for ; Fri, 6 Sep 2024 21:32:38 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DF9B3A149D for ; Sat, 7 Sep 2024 01:32:37 +0000 (UTC) X-FDA: 82536217554.23.00D7C32 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) by imf15.hostedemail.com (Postfix) with ESMTP id EC555A0012 for ; Sat, 7 Sep 2024 01:32:35 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Ahk9hvik; spf=pass (imf15.hostedemail.com: domain of martin.lau@linux.dev designates 95.215.58.183 as permitted sender) smtp.mailfrom=martin.lau@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725672657; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UYzKAnd9omi958b/3GtTF043/RrnCzAxquLlADIrGW8=; b=8VCma5I7Frvg5kV8A/UKmI/l+CnnqZbCdwJ2WY5z/dyt9fAZJet7GuJrkPM5PshOfs9AQn /NHHRqbkKwpMJg16pVpgRb2DaGkRsElY8tkjVa2l31DXf0BWe907IKd0NvLbviXvVFqitM 2DXd5jbr+GhVQ+N1Yzmi4LeXHBWDDCI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725672657; a=rsa-sha256; cv=none; b=qGORqs8c9CtChnPnwY5OpnZOEn1DkSn+u703pq98KXvicfDgJXpu4KiSL4sddf/20bJeV2 CSj7NCvc59cMNZL99xK7mK6bSPGMsOL0aJJAzL9G61hYCBwe76OCwMdGW56r7s4b3O6J95 5f/v1jhO88HPfho823Knq4yDHRaqVxE= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Ahk9hvik; spf=pass (imf15.hostedemail.com: domain of martin.lau@linux.dev designates 95.215.58.183 as permitted sender) smtp.mailfrom=martin.lau@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Message-ID: <8b61f093-a6a6-4f99-91f8-20f2a7235d76@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1725672753; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UYzKAnd9omi958b/3GtTF043/RrnCzAxquLlADIrGW8=; b=Ahk9hvikwIGZrJz/kyFqSEQE2HKP1rdET/PhJ4QUvLIHyjBitMQ1yyS8TutzJo/g/FOjo5 fzYku0l+ffe88lqsH0nrMQWEdE3sWr6G50w6+Sly5UArPqhmDCUlzUR8xOqvtXV1ouq8Oy 6piP3ZY2mOS3iOGipCTNiXXFOIDgvKs= Date: Fri, 6 Sep 2024 18:32:24 -0700 MIME-Version: 1.0 Subject: Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls. To: Alexei Starovoitov Cc: Kui-Feng Lee , bpf , Alexei Starovoitov , Andrii Nakryiko , Kui-Feng Lee , linux-mm References: <20240816191213.35573-1-thinker.li@gmail.com> <20240816191213.35573-5-thinker.li@gmail.com> <70a1b24f-84cd-464c-8fb6-a2c52fd3d703@linux.dev> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Martin KaFai Lau In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: EC555A0012 X-Stat-Signature: dc55mafc69rdiadhf64pbyg41aj6bnbe X-HE-Tag: 1725672755-95019 X-HE-Meta: U2FsdGVkX19CaVLJHvhD0CG6hk0bL98ERB4ImkCe3Lk6xjYDNDlYZlYAe+VVkmHDvM6eEOhIc10p9qLwfZ2wHtC7oAH0XEJIcrL8ffI0eNMjy592iA7E62orn1QSi0SZjiHrn0G4jewU9t9ZfFQuW/GaWz9ZDrb+Bt/QyPAejzXATgFOZLZRP9wEtPnrSWVwvfyaGZrQFom6M0fN9r9VHN9cd/QXKSyp6Gwa9biEY7uYONQtn/8EyQ+JEYJISCmUsg73/qWayAhWI+D3NcVdGPMzT95Scb9Vm9ywo1c32ZlufWx7CRkUTuzWuZDyZQGufioPve1ns91c6hM4h/akH3Iry2DwJ54lwdwdmsoR1/WNXwYiY/RFxVzjXvUyoF8bl6+vRyu/GYo/jlKBGuSC7k0kfrasah6g5pqh4WiZzcJigV0Wy9HBd9NN4CuSfZA7dsxbQ9vv60qZrREQqFTbOdOpMEUOVfggbkEW96vOeDrXM/jjFlaIRKCnBmGKI7W1duUWyb49Cj2Vp3pleGbNVdyCDL9nd89Js0mCa2xw4D5d8MiPK0cJ6yIEKNNVN4xyN4GfPCjE8HpOuL3Rf7mNqjEguA6qKswWyyVAPfvEVEaS37B2+6eglOby9JD5+wkHrK2RLkytgYWhXqK0LXUBN0ES3CTMjYSfWd2tT0PKZWZ5xAuL8f3IU8XvK/o95U+qvPSqFUHZCX6PgYpd/8xqrTXlGiZ95iTwLP2X3YClLngZTEVOvG248FLKKOrpg2fpiow0QM+0ZZHWb4uQwYAxd/SbQhPI4BZSTBB5nJLZAUGf3UdafpB4E+YvJqF36ICt5lj9WjX8GEkaCRO/9LZ19+WZ3o6jVlblEC//TRKhgLsbeEVPZmOU8oStfypHVrkMJxMtP2JmIMipXcuzqeG8rwjEqo6eT4EPIpFlaiFofxKvjeV5L/AsMdjftwFKnPehN1LBi1UhFc62DZesZlC TeKLgZrU +6/I+44fCZ5U21LJ3X/NdhWq+FSCbNajM3mfR9dLb3HrGeCTzKcsW4x/boeH5r9oBMn9SUIopR5fB+V7geu+Icd7uhkwkRXu1b6RpcRS9tcmaQRJidm5eJ7oEBZhvY0jB41qNqHtZ7sDnQjgRXP7GSAFNggHqPvJ+vundZEQIAFvhSqL6MsF0gtzSGdrC6l5ehCeLe3Etc7PXiuf6VQTVbdon3iiKmfsKyPC5xJq8TBioBXavzowZuZevtzb5v5+3FSDaoufs/pn8AoHsQRPVj1ewndY25Pho6eFh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9/6/24 4:44 PM, Alexei Starovoitov wrote: > On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau wrote: >> >> On 9/4/24 3:21 PM, Martin KaFai Lau wrote: >>> On 8/28/24 4:24 PM, Alexei Starovoitov wrote: >>>>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, >>>>> void *obj) >>>>> field->kptr.dtor(xchgd_field); >>>>> } >>>>> break; >>>>> + case BPF_UPTR: >>>>> + if (*(void **)field_ptr) >>>>> + bpf_obj_unpin_uptr(field, *(void **)field_ptr); >>>>> + *(void **)field_ptr = NULL; >>>> This one will be called from >>>> task_storage_delete->bpf_selem_free->bpf_obj_free_fields >>>> >>>> and even if upin was safe to do from that context >>>> we cannot just do: >>>> *(void **)field_ptr = NULL; >>>> >>>> since bpf prog might be running in parallel, >>>> it could have just read that addr and now is using it. >>>> >>>> The first thought of a way to fix this was to split >>>> bpf_obj_free_fields() into the current one plus >>>> bpf_obj_free_fields_after_gp() >>>> that will do the above unpin bit. >>>> and call the later one from bpf_selem_free_rcu() >>>> while bpf_obj_free_fields() from bpf_selem_free() >>>> will not touch uptr. >>>> >>>> But after digging further I realized that task_storage >>>> already switched to use bpf_ma, so the above won't work. >>>> >>>> So we need something similar to BPF_KPTR_REF logic: >>>> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0); >>>> and then delay of uptr unpin for that address into call_rcu. >>>> >>>> Any better ideas? >>> >> >> I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now >> (renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace >> for the common case. >> >> selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf prog. >> >> bpf_selem_free knows whether a selem can be reused immediately based on the >> caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool >> reuse_now)". >> >> If a selem cannot be reuse_now (i.e. == false), it is currently going through >> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do >> unpin_user_page() in the rcu callback. >> >> A selem can be reuse_now (i.e. == true) if the selem is no longer needed because >> either its owner (i.e. the task_struct here) is going away in free_task() or the >> bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should >> have a hold on the selem at this point. I think for these two cases, the >> unpin_user_page() can be directly called in bpf_selem_free(). > > but there is also this path: > bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free > -> bpf_obj_free_fields > > In this case bpf prog may still be looking at uptr address > and we cannot do unpin right away in bpf_obj_free_fields. cannot unpin immediately in the bpf_task_storage_delete() path is understood. task_storage can be used in sleepable. It needs to wait for the tasks_trace and the regular rcu gp before unpin. I forgot to mention earlier that bpf_task_storage_delete() will have the bpf_selem_free(..., reuse_now == false). It will then do the "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu);". The unpin could happen in bpf_selem_free_trace_rcu() in this case. I am suggesting to unpin in bpf_selem_free_trace_rcu together with the selem free. I just noticed the map and its btf_record are gone in bpf_selem_free_trace_rcu()... so won't work. :( > All other special fields in map value are ok, > since they are either relying on bpf_mem_alloc and > have rcu/rcu_tasks_trace gp > or extra indirection like timer/wq. > >> One existing bug is, from looking at patch 6, I don't think the free_task() case >> can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not >> mark the previously obtained task_storage to be invalid: >> >> data_task = bpf_task_from_pid(parent_pid); >> ptr = bpf_task_storage_get(&datamap, data_task, 0, ...); >> bpf_task_release(data_task); >> if (!ptr) >> return 0; >> /* The prog still holds a valid task storage ptr. */ >> udata = ptr->udata; >> >> It can be fixed by marking the ref_obj_id of the "ptr". Although it is more >> correct to make the task storage "ptr" invalid after task_release, it may break >> the existing progs. > > Are you suggesting that bpf_task_release should invalidate all pointers > fetched from map value? I was thinking at least the map value ptr itself needs to be invalidated. > That will work, but it's not an issue for other special fields in there > like kptr. > So this invalidation would be need only for uptr which feels > weird to special case it and probably will be confusing to users writing > such programs. hmm... I haven't thought about the other pointer fields that read before the task_release(). Agreed, it is hard to use if only marks uptr invalid. Thinking about it. Even marking the map value ptr invalid while other previously read fields keep working is also the same weirdness. > Above bpf prog example should be ok to use. > We only need to delay unpin after rcu/rcu_task_trace gp. > Hence my proposal in bpf_obj_free_fields() do: > case UPTR: > xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0); > call_rcu(...) to unpin. Agree that call_rcu() here is the only option. It probably needs to go through the tasks_trace gp also. Can the page->rcu_head be used here?