Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

res-lock v4 #8679

Closed
wants to merge 121 commits into from
Closed

Conversation

kkdwivedi
Copy link
Contributor

No description provided.

anakryiko and others added 30 commits February 26, 2025 10:45
…istat'

Mykyta Yatsenko says:

====================
selftests/bpf: implement setting global variables in veristat

From: Mykyta Yatsenko <[email protected]>

To better verify some complex BPF programs by veristat, it would be useful
to preset global variables. This patch set implements this functionality
and introduces tests for veristat.

v4->v5
  * Rework parsing to use sscanf for integers
  * Addressing nits

v3->v4:
  * Fixing bug in set_global_var introduced by refactoring in previous patch set
  * Addressed nits from Eduard

v2->v3:
  * Reworked parsing of the presets, using sscanf to split into variable and
  value, but still use strtoll/strtoull to support range checks when parsing
  integers
  * Fix test failures for no_alu32 & cpuv4 by checking if veristat binary is in
  parent folder
  * Introduce __CHECK_STR macro for simplifying checks in test
  * Modify tests into sub-tests
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code
into the static functions namely __bpf_dynptr_read and
__bpf_dynptr_write, this allows calling these without compiler warnings.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to
another. This functionality is useful in scenarios such as capturing XDP
data to a ring buffer.
The implementation consists of 4 branches:
  * A fast branch for contiguous buffer capacity in both source and
destination dynptrs
  * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy
data to/from non-contiguous buffer

Signed-off-by: Mykyta Yatsenko <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Add XDP setup type for dynptr tests, enabling testing for
non-contiguous buffer.
Add 2 tests:
 - test_dynptr_copy - verify correctness for the fast (contiguous
 buffer) code path.
 - test_dynptr_copy_xdp - verifies code paths that handle
 non-contiguous buffer.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Mykyta Yatsenko says:

====================
introduce bpf_dynptr_copy kfunc

From: Mykyta Yatsenko <[email protected]>

Introduce a new kfunc, bpf_dynptr_copy, which enables copying of
data from one dynptr to another. This functionality may be useful in
scenarios such as capturing XDP data to a ring buffer.
The patch set is split into 3 patches:
1. Refactor bpf_dynptr_read and bpf_dynptr_write by extracting code into
static functions, that allows calling them with no compiler warnings
2. Introduce bpf_dynptr_copy
3. Add tests for bpf_dynptr_copy

v2->v3:
  * Implemented bpf_memcmp in dynptr_success.c test, as __builtin_memcmp
  was not inlined on GCC-BPF.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
Allow auto port binding for cgroup connect test to avoid binding conflict.

Result:
./test_progs -a cgroup_v1v2
59      cgroup_v1v2:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Jiayuan Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Allow auto port binding for bpf nf test to avoid binding conflict.

./test_progs -a bpf_nf
24/1    bpf_nf/xdp-ct:OK
24/2    bpf_nf/tc-bpf-ct:OK
24/3    bpf_nf/alloc_release:OK
24/4    bpf_nf/insert_insert:OK
24/5    bpf_nf/lookup_insert:OK
24/6    bpf_nf/set_timeout_after_insert:OK
24/7    bpf_nf/set_status_after_insert:OK
24/8    bpf_nf/change_timeout_after_alloc:OK
24/9    bpf_nf/change_status_after_alloc:OK
24/10   bpf_nf/write_not_allowlisted_field:OK
24      bpf_nf:OK
Summary: 1/10 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Jiayuan Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
BPF CI has failed 3 times in the last 24 hours. Add retry for ENOMEM.
It's similar to the optimization plan:
commit 2f553b0 ("selftsets/bpf: Retry map update for non-preallocated per-cpu map")

Failed CI:
https://github.com/kernel-patches/bpf/actions/runs/13549227497/job/37868926343
https://github.com/kernel-patches/bpf/actions/runs/13548089029/job/37865812030
https://github.com/kernel-patches/bpf/actions/runs/13553536268/job/37883329296

selftests/bpf: Fixes for test_maps test
Fork 100 tasks to 'test_update_delete'
Fork 100 tasks to 'test_update_delete'
Fork 100 tasks to 'test_update_delete'
Fork 100 tasks to 'test_update_delete'
......
test_task_storage_map_stress_lookup:PASS
test_maps: OK, 0 SKIPPED

Signed-off-by: Jiayuan Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Jiayuan Chen says:

====================
Optimize bpf selftest to increase CI success rate

1. Optimized some static bound port selftests to avoid port occupation
when running test_progs -j.
2. Optimized the retry logic for test_maps.

Some Failed CI:
https://github.com/kernel-patches/bpf/actions/runs/13275542359/job/37064974076
https://github.com/kernel-patches/bpf/actions/runs/13549227497/job/37868926343
https://github.com/kernel-patches/bpf/actions/runs/13548089029/job/37865812030
https://github.com/kernel-patches/bpf/actions/runs/13553536268/job/37883329296
(Perhaps it's due to the large number of pull requests requiring CI runs?)
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups (post-ordering). For
example, the following cgroup hierarchy
    root cgroup: p1, p2
        subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
    p3 p4 p1 p2
and at run time, progs will execute based on that order.

But in some cases, it is desirable to have root prog executes earlier than
children progs (pre-ordering). For example,
  - prog p1 intends to collect original pkt dest addresses.
  - prog p3 will modify original pkt dest addresses to a proxy address for
    security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.

To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom (pre-ordering).
For example, in the above example,
    root cgroup: p1, p2
        subcgroup: p3, p4
Let us say p2 and p4 are marked with BPF_F_PREORDER. The final
effective array ordering will be
    p2 p4 p3 p1

Suggested-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add a few selftests with cgroup prog pre-ordering.

Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
critical section, but it doesn't prevent NMI, so the fully reentrant
code cannot use local_lock_irqsave() for exclusive access.

Introduce localtry_lock_t and localtry_lock_irqsave() that
disables interrupts and sets acquired=1, so localtry_lock_irqsave()
from NMI attempting to acquire the same lock will return false.

In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
Map localtry_lock_irqsave() to preemptible spin_trylock().
When in hard IRQ or NMI return false right away, since
spin_trylock() is not safe due to explicit locking in the underneath
rt_spin_trylock() implementation. Removing this explicit locking and
attempting only "trylock" is undesired due to PI implications.

Note there is no need to use local_inc for acquired variable,
since it's a percpu variable with strict nesting scopes.

Acked-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
rmqueue_pcplist() will spin_trylock to grab the page from percpu
free list. If it fails (due to re-entrancy or list being empty)
then rmqueue_bulk()/rmqueue_buddy() will attempt to
spin_trylock zone->lock and grab the page from there.
spin_trylock() is not safe in PREEMPT_RT when in NMI or in hard IRQ.
Bailout early in such case.

The support for gfpflags_allow_spinning() mode for free_page and memcg
comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in PREEMPT_RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Teach memcg to operate under trylock conditions when spinning locks
cannot be used.

localtry_trylock might fail and this would lead to charge cache bypass
if the calling context doesn't allow spinning (gfpflags_allow_spinning).
In those cases charge the memcg counter directly and fail early if
that is not possible. This might cause a pre-mature charge failing
but it will allow an opportunistic charging that is safe from
try_alloc_pages path.

Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Use try_alloc_pages() and free_pages_nolock() for BPF needs
when context doesn't allow using normal alloc_pages.
This is a prerequisite for further work.

Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Alexei Starovoitov says:

====================
The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.

v8->v9:
- Squash Vlastimil's fix/feature for localtry_trylock, and
  udpate commit log as suggested by Sebastian.
- Drop _noprof suffix in try_alloc_pages kdoc
- rebase

v8:
https://lore.kernel.org/bpf/[email protected]/

v7->v8:
- rebase: s/free_unref_page/free_frozen_page/

v6->v7:
- Took Sebastian's patch for localtry_lock_t as-is with minor
  addition of local_trylock_acquire() for proper LOCKDEP.
  Kept his authorship.
- Adjusted patch 4 to use it. The rest is unchanged.

v6:
https://lore.kernel.org/bpf/[email protected]/

v5->v6:
- Addressed comments from Sebastian, Vlastimil
- New approach for local_lock_t in patch 3. Instead of unconditionally
  increasing local_lock_t size to 4 bytes introduce local_trylock_t
  and use _Generic() tricks to manipulate active field.
- Address stackdepot reentrance issues. alloc part in patch 1 and
  free part in patch 2.
- Inlined mem_cgroup_cancel_charge() in patch 4 since this helper
  is being removed.
- Added Acks.
- Dropped failslab, kfence, kmemleak patch.
- Improved bpf_map_alloc_pages() in patch 6 a bit to demo intended usage.
  It will be refactored further.
- Considered using __GFP_COMP in try_alloc_pages to simplify
  free_pages_nolock a bit, but then decided to make it work
  for all types of pages, since free_pages_nolock() is used by
  stackdepot and currently it's using non-compound order 2.
  I felt it's best to leave it as-is and make free_pages_nolock()
  support all pages.

v5:
https://lore.kernel.org/all/[email protected]/

v4->v5:
- Fixed patch 1 and 4 commit logs and comments per Michal suggestions.
  Added Acks.
- Added patch 6 to make failslab, kfence, kmemleak complaint
  with trylock mode. It's a prerequisite for reentrant slab patches.

v4:
https://lore.kernel.org/bpf/[email protected]/

v3->v4:
Addressed feedback from Michal and Shakeel:
- GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead.
- Improved comments and commit logs.

v3:
https://lore.kernel.org/bpf/[email protected]/

v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
  try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
  In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
  local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
  affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
  more work is necessary on bpf side to play nice with RT.
  The urgency of this work got higher. The alternative is to
  convert bpf bits left and right to bpf_mem_alloc.

v2:
https://lore.kernel.org/bpf/[email protected]/

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Merge try_alloc_pages feature branch into bpf-next/master.

Signed-off-by: Alexei Starovoitov <[email protected]>
test_select_reuseport_kern.c is currently including <stdlib.h>, but it
does not use any definition from there.

Remove stdlib.h inclusion from test_select_reuseport_kern.c

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
The verifier currently does not permit global subprog calls when a lock
is held, preemption is disabled, or when IRQs are disabled. This is
because we don't know whether the global subprog calls sleepable
functions or not.

In case of locks, there's an additional reason: functions called by the
global subprog may hold additional locks etc. The verifier won't know
while verifying the global subprog whether it was called in context
where a spin lock is already held by the program.

Perform summarization of the sleepable nature of a global subprog just
like changes_pkt_data and then allow calls to global subprogs for
non-sleepable ones from atomic context.

While making this change, I noticed that RCU read sections had no
protection against sleepable global subprog calls, include it in the
checks and fix this while we're at it.

Care needs to be taken to not allow global subprog calls when regular
bpf_spin_lock is held. When resilient spin locks is held, we want to
potentially have this check relaxed, but not for now.

Also make sure extensions freplacing global functions cannot do so
in case the target is non-sleepable, but the extension is. The other
combination is ok.

Tests are included in the next patch to handle all special conditions.

Fixes: 9bb00b2 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests for rejecting sleepable and accepting non-sleepable global
function calls in atomic contexts. For spin locks, we still reject
all global function calls. Once resilient spin locks land, we will
carefully lift in cases where we deem it safe.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests for freplace behavior with the combination of sleepable
and non-sleepable global subprogs. The changes_pkt_data selftest
did all the hardwork, so simply rename it and include new support
for more summarization tests for might_sleep bit.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Kumar Kartikeya Dwivedi says:

====================
Global subprogs in RCU/{preempt,irq}-disabled sections

Small change to allow non-sleepable global subprogs in
RCU, preempt-disabled, and irq-disabled sections. For
now, we don't lift the limitation for locks as it requires
more analysis, and will do this one resilient spin locks
land.

This surfaced a bug where sleepable global subprogs were
allowed in RCU read sections, that has been fixed. Tests
have been added to cover various cases.

Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/[email protected]

  * Fix broken to_be_replaced argument in the selftest.
  * Adjust selftest program type.

v1 -> v2
v1: https://lore.kernel.org/bpf/[email protected]

  * Rename subprog_info[i].sleepable to might_sleep, which more
    accurately reflects the nature of the bit. 'sleepable' means whether
    a given context is allowed to, while might_sleep captures if it
    does.
  * Disallow extensions that might sleep to attach to targets that don't
    sleep, since they'd be permitted to be called in atomic contexts. (Eduard)
  * Add tests for mixing non-sleepable and sleepable global function
    calls, and extensions attaching to non-sleepable global functions. (Eduard)
  * Rename changes_pkt_data -> summarization
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero()
and is holding rcu_read_lock().

map_idr_lock does not add any protection, just remove the cost
for passive TCP flows.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Kui-Feng Lee <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Factor out atomic_ptr_type_ok() as a helper function to be used later.

Signed-off-by: Peilin Ye <[email protected]>
Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <[email protected]>
Currently, check_atomic() only handles atomic read-modify-write (RMW)
instructions.  Since we are planning to introduce other types of atomic
instructions (i.e., atomic load/store), extract the existing RMW
handling logic into its own function named check_atomic_rmw().

Remove the @insn_idx parameter as it is not really necessary.  Use
'env->insn_idx' instead, as in other places in verifier.c.

Signed-off-by: Peilin Ye <[email protected]>
Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <[email protected]>
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic
in do_check() into helper functions to be used later.  While we are
here, make that comment about "reserved fields" more specific.

Suggested-by: Eduard Zingerman <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <[email protected]>
Allow reading object file list from file.
E.g. the following command:

  ./veristat @list.txt

Is equivalent to the following invocation:

  ./veristat line-1 line-2 ... line-N

Where line-i corresponds to lines from list.txt.
Lines starting with '#' are ignored.

Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Mykyta Yatsenko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Before:

  ./veristat -G @foobar iters.bpf.o
  Failed to open presets in 'foobar': Unknown error -2
  ...

After:

  ./veristat -G @foobar iters.bpf.o
  Failed to open presets in 'foobar': No such file or directory
  ...

Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Mykyta Yatsenko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Currently, for rqspinlock usage, the implementation of
smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
susceptible to stalls on arm64, because they do not guarantee that the
conditional expression will be repeatedly invoked if the address being
loaded from is not written to by other CPUs. When support for
event-streams is absent (which unblocks stuck WFE-based loops every
~100us), we may end up being stuck forever.

This causes a problem for us, as we need to repeatedly invoke the
RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
expires.

Let us import the smp_cond_load_acquire_timewait implementation Ankur is
proposing in [0], and then fallback to it once it is merged.

While we rely on the implementation to amortize the cost of sampling
check_timeout for us, it will not happen when event stream support is
unavailable. This is not the common case, and it would be difficult to
fit our logic in the time_expr_ns >= time_limit_ns comparison, hence
just let it be.

  [0]: https://lore.kernel.org/lkml/[email protected]

Cc: Ankur Arora <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
The pending bit is used to avoid queueing in case the lock is
uncontended, and has demonstrated benefits for the 2 contender scenario,
esp. on x86. In case the pending bit is acquired and we wait for the
locked bit to disappear, we may get stuck due to the lock owner not
making progress. Hence, this waiting loop must be protected with a
timeout check.

To perform a graceful recovery once we decide to abort our lock
acquisition attempt in this case, we must unset the pending bit since we
own it. All waiters undoing their changes and exiting gracefully allows
the lock word to be restored to the unlocked state once all participants
(owner, waiters) have been recovered, and the lock remains usable.
Hence, set the pending bit back to zero before returning to the caller.

Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout
event statistics.

Reviewed-by: Barret Rhoden <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.

In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.

We terminate the whole wait queue because of two main reasons. Firstly,
we eschew per-waiter timeouts with one applied at the head of the wait
queue.  This allows everyone to break out faster once we've seen the
owner / pending waiter not responding for the timeout duration from the
head.  Secondly, it avoids complicated synchronization, because when not
leaving in FIFO order, prev's next pointer needs to be fixed up etc.

Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.

The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.

One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum
duration for the waiting loop (for the wait queue head), since we may
have both the owner and pending bit waiter ahead of us, and in the worst
case, need to span their maximum permitted critical section lengths.

Reviewed-by: Barret Rhoden <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
When we run out of maximum rqnodes, the original queued spin lock slow
path falls back to a try lock. In such a case, we are again susceptible
to stalls in case the lock owner fails to make progress. We use the
timeout as a fallback to break out of this loop and return to the
caller. This is a fallback for an extreme edge case, when on the same
CPU we run out of all 4 qnodes. When could this happen? We are in slow
path in task context, we get interrupted by an IRQ, which while in the
slow path gets interrupted by an NMI, whcih in the slow path gets
another nested NMI, which enters the slow path. All of the interruptions
happen after node->count++.

We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this
fallback, no fairness is guaranteed, so the duration may be too small
for contended cases, as the waiting time is not bounded. Since this is
an extreme corner case, let's just prefer timing out instead of
attempting to spin for longer.

Reviewed-by: Barret Rhoden <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/4 sec can be excessively long for some use
cases.  Additionally, custom timeouts may exacerbate recovery time.

Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.

A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.

We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.

Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.

The release_held_lock_entry function requires an smp_wmb, while the
release store on unlock will provide the necessary ordering for us. Add
comments to document the subtleties of why this is correct. It is
possible for stores to be reordered still, but in the context of the
deadlock detection algorithm, a release barrier is sufficient and
needn't be stronger for unlock's case.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Include a test-and-set fallback when queued spinlock support is not
available. Introduce a rqspinlock type to act as a fallback when
qspinlock support is absent.

Include ifdef guards to ensure the slow path in this file is only
compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add
further logic to ensure fallback to the test-and-set implementation
when queued spinlock support is unavailable on an architecture.

Unlike other waiting loops in rqspinlock code, the one for test-and-set
has no theoretical upper bound under contention, therefore we need a
longer timeout than usual. Bump it up to a second in this case.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
We ripped out PV and virtualization related bits from rqspinlock in an
earlier commit, however, a fair lock performs poorly within a virtual
machine when the lock holder is preempted. As such, retain the
virt_spin_lock fallback to test and set lock, but with timeout and
deadlock detection. We can do this by simply depending on the
resilient_tas_spin_lock implementation from the previous patch.

We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that
requires more involved algorithmic changes and introduces more
complexity. It can be done when the need arises in the future.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Whenever a timeout and a deadlock occurs, we would want to print a
message to the dmesg console, including the CPU where the event
occurred, the list of locks in the held locks table, and the stack trace
of the caller, which allows determining where exactly in the slow path
the waiter timed out or detected a deadlock.

Splats are limited to atmost one per-CPU during machine uptime, and a
lock is acquired to ensure that no interleaving occurs when a concurrent
set of CPUs conflict and enter a deadlock situation and start printing
data.

Later patches will use this to inspect return value of rqspinlock API
and then report a violation if necessary.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Introduce helper macros that wrap around the rqspinlock slow path and
provide an interface analogous to the raw_spin_lock API. Note that
in case of error conditions, preemption and IRQ disabling is
automatically unrolled before returning the error back to the caller.

Ensure that in absence of CONFIG_QUEUED_SPINLOCKS support, we fallback
to the test-and-set implementation.

Add some comments describing the subtle memory ordering logic during
unlock, and why it's safe.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Ensure that the rqspinlock code is only built when the BPF subsystem is
compiled in. Depending on queued spinlock support, we may or may not end
up building the queued spinlock slowpath, and instead fallback to the
test-and-set implementation. Also add entries to MAINTAINERS file.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Introduce locktorture support for rqspinlock using the newly added
macros as the first in-kernel user and consumer. Guard the code with
CONFIG_BPF_SYSCALL ifdef since rqspinlock is not available otherwise.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
per-cpu counter crud from the code base which is no longer necessary.

Closes: https://lore.kernel.org/bpf/[email protected]
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Convert the percpu_freelist.c code to use rqspinlock, and remove the
extralist fallback and trylock-based acquisitions to avoid deadlocks.

Key thing to note is the retained while (true) loop to search through
other CPUs when failing to push a node due to locking errors. This
retains the behavior of the old code, where it would keep trying until
it would be able to successfully push the node back into the freelist of
a CPU.

Technically, we should start iteration for this loop from
raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus,
we skip execution in the loop body instead.

Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com
Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Convert all LPM trie usage of raw_spinlock to rqspinlock.

Note that rcu_dereference_protected in trie_delete_elem is switched over
to plain rcu_dereference, the RCU read lock should be held from BPF
program side or eBPF syscall path, and the trie->lock is just acquired
before the dereference. It is not clear the reason the protected variant
was used from the commit history, but the above reasoning makes sense so
switch over.

Closes: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.

Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.

Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.

__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.

These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Introduce verifier-side support for rqspinlock kfuncs. The first step is
allowing bpf_res_spin_lock type to be defined in map values and
allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
field to recognize and validate.

Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
one of them (and at most one of them per-object, like before) must be
present. The bpf_res_spin_lock can also be used to protect objects that
require lock protection for their kfuncs, like BPF rbtree and linked
list.

The verifier plumbing to simulate success and failure cases when calling
the kfuncs is done by pushing a new verifier state to the verifier state
stack which will verify the failure case upon calling the kfunc. The
path where success is indicated creates all lock reference state and IRQ
state (if necessary for irqsave variants). In the case of failure, the
state clears the registers r0-r5, sets the return value, and skips kfunc
processing, proceeding to the next instruction.

When marking the return value for success case, the value is marked as
0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
the verifier never traverses such branches for success cases, and would
be aware that the lock is not held in such cases.

We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
are invoked. We introduce a kfunc_class state to avoid mixing lock
irqrestore kfuncs with IRQ state created by bpf_local_irq_save.

With all this infrastructure, these kfuncs become usable in programs
while satisfying all safety properties required by the kernel.

Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave
variants enforce strict FIFO ordering anyway, make the same change for
normal non-irqsave variants, such that FIFO ordering is enforced.

Two new verifier state fields (active_lock_id, active_lock_ptr) are used
to denote the top of the stack, and prev_id and prev_ptr are ascertained
whenever popping the topmost entry through an unlock.

Take special care to make these fields part of the state comparison in
refsafe.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Introduce selftests that trigger AA, ABBA deadlocks, and test the edge
case where the held locks table runs out of entries, since we then
fallback to the timeout as the final line of defense. Also exercise
verifier's AA detection where applicable.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 10 times, most recently from 50dba77 to ce294a5 Compare March 19, 2025 23:09
@kkdwivedi kkdwivedi closed this Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet