From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 117B5285073 for ; Wed, 19 Nov 2025 23:16:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763594199; cv=none; b=igQq2NS+5zAGaWpI/CHE4flwRrLwUS0dO5zXRbFzRmRJJjgaRVf/7hPMzKBEoQq8de4yzzp7DSxoNwYx2PqDWVZtNxXviy5k1OEoZtTFPeheD7RB3OB/zjDrxZAeGK96AfBky6hFpChIjv2KF/w5+biEeAUCpuS6hPNzpmURPtQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763594199; c=relaxed/simple; bh=6819ew1JTDlq8XM58WhGI/6mR6nYw+O/0s7I4LZCP+w=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=YIh9sAVv2WEpkhOfkBp7tnM/ActF1rUg18EI97ApqCokId9ubkR7XL4ciqA9aDBvHGsfRh4Xvin1XHJrQiI94jL3oOBTBiIbknYUKBllSY7kgJmh84YUJtnrgxk4IKOt7iG31c0h16hm+Y2iRDbRwQmbGQY4gnA4hvZp4SEGzp0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--aliceryhl.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KVle52Q5; arc=none smtp.client-ip=209.85.128.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--aliceryhl.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KVle52Q5" Received: by mail-wm1-f73.google.com with SMTP id 5b1f17b1804b1-477a11d9e67so1410715e9.2 for ; Wed, 19 Nov 2025 15:16:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1763594195; x=1764198995; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=M0Y7lpsUyEob9okgJWe1nZjmuawPW4ZJgLTZU7r5XIY=; b=KVle52Q58jU2YJPjOLaDDVzQrGS7PzNygRq+bPjDqtvFiIjbVP4Anx6T2ifNmpMKtU hpg0legYVZlG/X9Hba11cb6JqVC1GpF+aUXFYvr+AYsJ56JZtjFW779GcZJ7GLjmhPBi /VqLPl+yggD86YYYOdkF5YGfpuqOiSQnu7hb9wMRP/zeYJYPGWmf0xAJ/8aX4JBzu0Go M35vyKdZgL+S0jJJh2neMBWo9VPVJU5WNrwcZGac/wq463aXkee+qfiutxvm2V29hVwM LHT/kyxcOyNtAwiJDQgSUvW7D1dyGeQmHrkwOKkPwOJWUFE4qGZvX6fRcS/w2s1W1egG h0Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763594195; x=1764198995; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=M0Y7lpsUyEob9okgJWe1nZjmuawPW4ZJgLTZU7r5XIY=; b=AWwgH77vp8/gGxK6zZ0uDJcZMzXBMvBUUfOEbAVym9T0ge4Mic6rTHu1wg7MtvNE1a e3DrtWPmW3kB4zANIIrzMtnzmki1E6mcGSrHYZ5IOQFAj+W0NtYzXjTciFtZ2zrUzzgF P+CTuq+mrIRwlWAGdR7qxl0yoETPWlWwA8+nwCovj7F3kJWgv4jHh9ouhN5zLPyv0l4S 7z1gxDrMjtp9paIjNDrROvpyb8/UmpG8J6gRzKkIF26lr5SPGfU2HcEvMCAdKdFU6f6H E3ha6Tg/LdyA+/nhgGX0SSYzFzSQ8OwgpawhAx07/YH5d+9LnXZTmBsSv5pPq5Kcew75 DRvQ== X-Forwarded-Encrypted: i=1; AJvYcCX2x54ouFxkZpBLV9k6MFwhVZSBWnejvK6KANw6nnEVTIZI3RYjEMNRN3PTTcjYi6WiL9uUStwgilXonNsAiw==@vger.kernel.org X-Gm-Message-State: AOJu0YxzM14AtmF8qR0htNMyf26xu+UveaiWknr0vgt4dLAST/Xh4NT/ 9MKBMWyH/KgDy3NsEH1V9NbW9WNIWg36ZIQl9UneN98tZiUEEMkBuKbgMvJcngkrdb55gagnwmp njHPawyirqL2+3lrWkA== X-Google-Smtp-Source: AGHT+IEAOHtWJGzitBQX2gXZDgWrZnmSYfHRAxSGmUv+Ah5idax46Q/YOhOCs6vu8203eWUfTXYkiDOThZMEqNg= X-Received: from wmoo13.prod.google.com ([2002:a05:600d:10d:b0:46f:b32b:55f]) (user=aliceryhl job=prod-delivery.src-stubby-dispatcher) by 2002:a05:600c:3115:b0:476:d494:41d2 with SMTP id 5b1f17b1804b1-477b8a9d734mr7277525e9.29.1763594195556; Wed, 19 Nov 2025 15:16:35 -0800 (PST) Date: Wed, 19 Nov 2025 23:16:34 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: rust-for-linux@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112-binder-bitmap-v5-0-8b9d7c7eca82@google.com> <20251112-binder-bitmap-v5-6-8b9d7c7eca82@google.com> Message-ID: Subject: Re: [PATCH v5 6/6] rust_binder: use bitmap for allocation of handles From: Alice Ryhl To: Yury Norov Cc: Greg Kroah-Hartman , "Arve =?utf-8?B?SGrDuG5uZXbDpWc=?=" , Todd Kjos , Martijn Coenen , Joel Fernandes , Christian Brauner , Carlos Llamas , Suren Baghdasaryan , Burak Emir , Miguel Ojeda , Boqun Feng , Gary Guo , "=?utf-8?B?QmrDtnJu?= Roy Baron" , Benno Lossin , Andreas Hindborg , Trevor Gross , Danilo Krummrich , rust-for-linux@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Thu, Nov 13, 2025 at 12:50:37PM -0500, Yury Norov wrote: > On Wed, Nov 12, 2025 at 11:35:07PM +0000, Alice Ryhl wrote: > > On Wed, Nov 12, 2025 at 02:09:19PM -0500, Yury Norov wrote: > > > On Wed, Nov 12, 2025 at 12:47:24PM +0000, Alice Ryhl wrote: > > > > To find an unused Binder handle, Rust Binder currently iterates the > > > > red/black tree from the beginning until it finds a gap in the keys.= This > > > > is extremely slow. > > >=20 > > > Can you share performance numbers?=20 > >=20 > > I have not benchmarked it in the Rust driver, but it replaces > > potentially thousands of calls to rb_next() with a single call to > > find_unused_id(), so I'm feeling good about the performance. And the >=20 > Feelings are good, but numbers are better. In the original dbitmap > patch, Carlos shared the experiment details and rough numbers, and > it's enough for me. Can you just reproduce his steps? Unfortunately Carlos doesn't have his benchmark files anymore, but I made my own benchmark. It's a test where a client repeatedly sends a transaction containing a node to a server, and the server stashes it forever. This way, the number of nodes keeps growing forever. (Up to 100k nodes.) Transaction Range =E2=94=82 Baseline (Rust) =E2=94=82 Bitmap (Rust) =E2=94= =82 Comparison (C) 0 - 10,000 =E2=94=82 176.88 =E2=94=82 92.93 =E2=94= =82 99.41 10,000 - 20,000 =E2=94=82 437.37 =E2=94=82 87.74 =E2=94= =82 98.55 20,000 - 30,000 =E2=94=82 677.49 =E2=94=82 76.24 =E2=94= =82 96.37 30,000 - 40,000 =E2=94=82 901.76 =E2=94=82 83.39 =E2=94= =82 96.73 40,000 - 50,000 =E2=94=82 1126.62 =E2=94=82 100.44 =E2=94= =82 94.57 50,000 - 60,000 =E2=94=82 1288.98 =E2=94=82 94.38 =E2=94= =82 96.64 60,000 - 70,000 =E2=94=82 1588.74 =E2=94=82 88.27 =E2=94= =82 96.36 70,000 - 80,000 =E2=94=82 1812.97 =E2=94=82 93.97 =E2=94= =82 91.24 80,000 - 90,000 =E2=94=82 2062.95 =E2=94=82 92.22 =E2=94= =82 102.01 90,000 - 100,000 =E2=94=82 2330.03 =E2=94=82 97.18 =E2=94= =82 100.31 The numbers measure the roundtrip latency in microseconds that it takes to send the transaction containing the node, and to receive the response. I would not necessarily trust the precise numbers. Rust seems slightly better here, but I ran it on an emulator so it may be noisy. Regardless, I think it's pretty clear that an improvement has happened. > > equivalent change in the C driver was done because this particular piec= e > > of code was causing contention issues by holding the spinlock for a lon= g > > time. > >=20 > > The characteristics of this collection is that there is one per process > > using the driver. Most processes have only a few entries in this bitmap= , > > so the inline representation will apply in most cases. However, there > > are a few processes that have a large number of entries in the 4 to > > maybe 5 figures range. Those processes are what caused the contention > > issue mentioned above. > >=20 > > > > To improve the performance, add a bitmap that keeps track of which > > > > indices are actually in use. This allows us to quickly find an unus= ed > > > > key in the red/black tree. > > > >=20 > > > > This logic matches the approach used by C Binder. It was chosen > > > > partially because it's the most memory efficient solution. > > >=20 > > > That inaccurate. You are adding a new data structure (bitmap), advoca= ting > > > it with an improvement on search side, and that makes sense. > > >=20 > > > But now you're saying it's also a more memory efficient approach, whi= ch > > > doesn't sound trivial because the most memory efficient solution is t= o > > > bring no new data structures at all. > > >=20 > > > I guess you meant that bitmap is the most efficient data structure to > > > index used/unused nodes. If so, can you please rephrase the sentence? > >=20 > > Yes I can rephrase. > >=20 > > Adding more data is of course always less memory efficient. What I mean= t > > is that this is more memory efficient than the competing solution of > > using an augmented rbtree that Carlos mentioned here: > >=20 > > https://lore.kernel.org/p/aC1PQ7tmcqMSmbHc@google.com > >=20 > > > > + if let Some(res) =3D refs.handle_is_present.find_unuse= d_id(start) { > > > > + match refs.by_handle.entry(res.as_u32()) { > > > > + rbtree::Entry::Vacant(entry) =3D> break (res, = entry), > > > > + rbtree::Entry::Occupied(_) =3D> { > > > > + pr_err!("Detected mismatch between handle_= is_present and by_handle"); > > > > + res.acquire(); > > > > + continue; > > >=20 > > > At this point you've detected mismatch between two linked data > > > structures. It means that one of them or both are corrupted. To > > > me it looks fatal, and your pr_err() confirms it. How could you > > > continue then? > >=20 > > Although we should never hit this codepath in real code, I don't think > > we need to kill the kernel. We can treat the r/b tree as source of trut= h > > and adjust the bitmap when mismathces are detected. >=20 > There's no such thing like a 'source of truth', and rb-tree is not even > close to that. >=20 > You add bitmap for performance reasons, but with that you also bring > some redundancy. Now, you cross-check two data structures and see > discrepancy. At this point you can only say that either one of them > or both are corrupted. >=20 > Relying on rb-tree over bitmap is simply wrong. Statistically > speaking, there's more chances to corrupt rb-tree - just because it > is scattered and takes more memory. >=20 > > I could add a kernel warning, though. That shouldn't kill an Android > > device. >=20 > Assuming, you're talking about WARN(), I agree. But it looks like my > reasoning differs. > =20 > You never hit this path during a normal operation, as you said. So if > you're there, it means that something is already going wrong. If you > issue WARN(), you let those focused on system integrity to leverage > panic_on_warn and shut the system down to minimize any possible damage.= =20 >=20 > With that precaution, you're free to do whatever you find practical to > 'recover', or even do nothing. But please avoid referring rb-tree as a > source of truth - it's a misleading and dangerous misconception. Ok. I picked the rb-tree because using the bitmap isn't possible - it doesn't store the auxiliary data that we need. > > > > + } > > > > + } > > > > + } > > > > + > > > > + let grow_request =3D refs.handle_is_present.grow_reque= st().ok_or(ENOMEM)?; > > > > + drop(refs_lock); > > > > + let resizer =3D grow_request.realloc(GFP_KERNEL)?; > > > > + refs_lock =3D self.node_refs.lock(); > > > > + refs =3D &mut *refs_lock; > > > > + refs.handle_is_present.grow(resizer); > > >=20 > > > Is it possible to turn this block into a less wordy statement? Maybe = a > > > wrapper function for it? Ideally, the grow request should be handled > > > transparently in .find_unused_id(). > >=20 > > I can extract this block into a separate function, but I think it would > > be tricky to move the entire logic inside .find_unused_id() because of > > the mutex lock/unlock situation. > >=20 > > > > @@ -905,6 +924,16 @@ pub(crate) fn update_ref( > > > > let id =3D info.node_ref().node.global_id(); > > > > refs.by_handle.remove(&handle); > > > > refs.by_node.remove(&id); > > > > + refs.handle_is_present.release_id(handle as usize)= ; > > > > + > > > > + if let Some(shrink) =3D refs.handle_is_present.shr= ink_request() { > > >=20 > > > This is questionable. Shrinking is usually the very slow path, and we > > > don't shrink unless we're really close or even inside the OOM conditi= on. > > >=20 > > > In this case, shrink_request() on average returns false, but it's > > > O(N), which makes _every_ release_id() O(N), while it should be O(1). > >=20 > > The current implementation of shrink_request() will refuse to shrink th= e > > pool unless the largest bit is less than 1/4 of the capacity, so it > > should not perform the expensive operation very often. > >=20 > > That said, it does call find_last_bit() each time, which I guess is > > O(N). But my assumption was that find_last_bit() is cheap enough that i= t > > wouldn't be a problem. >=20 > It's O(N), while the expectation for release_id() is to be O(1). But if > you think it's OK for you - I'm OK as well. Can you explicitly mention > it in function description? Sure I will mention it. > > > Consider a very realistic case: you're destroying every object, and t= hus > > > removing every ID in the associate ID pool, doing it in LIFO order. T= hat > > > way you will need to call shrink_request() about O(log(N)) times, mak= ing > > > the whole complexity ~O(N*log(N)); and you'll have to make log(N) > > > realloc()s for nothing. If you release IDs in FIFO order, you don't > > > call realloc(), but your shrink_request() total complexity will be O(= N^2).=20 > >=20 > > Even if we end up making log(N) reallocs, the total complexity of the > > reallocs is O(N) because the amount of data being reallocd halves each > > time. So the total number of bytes copied by reallocs ends up being: > >=20 > > 1 + 2 + 4 + 8 + ... + 2^log(N) <=3D 2^(1+log(N)) =3D 2*N > >=20 > > which is O(N). >=20 > OK, I trust your math better than mine. :) >=20 > > Of course, deleting the corresponding entry from the red/black tree is > > O(log N), so it's still O(N*log(N)) for the N deletions from the rb > > tree. > >=20 > > > Can you compare performance numbers with and without shrinking under = a > > > typical payload? Is there any mechanism to inspect the ID pools at ru= ntime, > > > like expose via procfs? > >=20 > > We expose the contents of the red/black tree via the binder_logs > > mechanism, but that doesn't show the *capacity* of the bitmap. Only the > > index of the largest set bit. > >=20 > > > Can you make your shrinking logic conditional on some reasonable OOM > > > heuristics, maybe OOM event driven? > > >=20 > > > And even without OOM, you can safely skip shrinking if the number of = IDs in > > > use is greater than 1/4 of the capacity, or there's any used ID with > > > the index greater than the 1/2 capacity. > >=20 > > I guess we could register a shrinker, but I don't think it's worth it. >=20 > OK, if you're fine with O(N) for release_id() - I'm fine as well. Can > you mention OOM-driven shrinking as a possible alternative for the > following improvements? Sure. Alice