twx-linux/kernel
Daniel Borkmann 511804ab70 bpf: Fix overrunning reservations in ringbuf
[ Upstream commit cfa1a2329a691ffd991fcf7248a57d752e712881 ]

The BPF ring buffer internally is implemented as a power-of-2 sized circular
buffer, with two logical and ever-increasing counters: consumer_pos is the
consumer counter to show which logical position the consumer consumed the
data, and producer_pos which is the producer counter denoting the amount of
data reserved by all producers.

Each time a record is reserved, the producer that "owns" the record will
successfully advance producer counter. In user space each time a record is
read, the consumer of the data advanced the consumer counter once it finished
processing. Both counters are stored in separate pages so that from user
space, the producer counter is read-only and the consumer counter is read-write.

One aspect that simplifies and thus speeds up the implementation of both
producers and consumers is how the data area is mapped twice contiguously
back-to-back in the virtual memory, allowing to not take any special measures
for samples that have to wrap around at the end of the circular buffer data
area, because the next page after the last data page would be first data page
again, and thus the sample will still appear completely contiguous in virtual
memory.

Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
book-keeping the length and offset, and is inaccessible to the BPF program.
Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
possible to make a second allocated memory chunk overlapping with the first
chunk and as a result, the BPF program is now able to edit first chunk's
header.

For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
[0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
allocate a chunk B with size 0x3000. This will succeed because consumer_pos
was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
page and could cause a crash.

Fix it by calculating the oldest pending_pos and check whether the range
from the oldest outstanding record to the newest would span beyond the ring
buffer size. If that is the case, then reject the request. We've tested with
the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
before/after the fix and while it seems a bit slower on some benchmarks, it
is still not significantly enough to matter.

Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05 09:33:47 +02:00
..
bpf bpf: Fix overrunning reservations in ringbuf 2024-07-05 09:33:47 +02:00
cgroup sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level 2024-06-12 11:12:13 +02:00
configs Kbuild updates for v6.6 2023-09-05 11:01:47 -07:00
debug kdb: Use format-specifiers rather than memset() for padding in kdb_read() 2024-06-16 13:47:44 +02:00
dma swiotlb: extend buffer pre-padding to alloc_align_mask if necessary 2024-06-21 14:38:46 +02:00
entry entry: Respect changes to system call number by trace_sys_enter() 2024-04-03 15:28:50 +02:00
events perf/core: Fix missing wakeup when waiting for context reference 2024-06-21 14:38:39 +02:00
futex futex: Don't include process MM in futex key on no-MMU 2023-11-20 11:58:53 +01:00
gcov gcov: add support for GCC 14 2024-06-27 13:49:13 +02:00
irq genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after() 2024-06-16 13:47:46 +02:00
kcsan kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures 2023-06-09 23:29:50 +10:00
livepatch livepatch: Fix missing newline character in klp_resolve_symbols() 2023-11-20 11:59:25 +01:00
locking lockdep: Fix block chain corruption 2023-12-03 07:33:06 +01:00
module modules: wait do_free_init correctly 2024-03-26 18:19:55 -04:00
power PM: s2idle: Make sure CPUs will wakeup directly on resume 2024-04-17 11:19:26 +02:00
printk printk: For @suppress_panic_printk check for other CPU in panic 2024-04-13 13:07:29 +02:00
rcu rcutorture: Fix invalid context warning when enable srcu barrier testing 2024-06-27 13:49:01 +02:00
sched sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() 2024-06-12 11:12:13 +02:00
time tick/nohz_full: Don't abuse smp_call_function_single() in tick_setup_device() 2024-06-21 14:38:46 +02:00
trace tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test 2024-06-27 13:49:14 +02:00
.gitignore
acct.c audit/stable-6.6 PR 20230829 2023-08-30 08:17:35 -07:00
async.c async: Introduce async_schedule_dev_nocall() 2024-01-31 16:18:49 -08:00
audit_fsnotify.c
audit_tree.c
audit_watch.c audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare() 2023-11-28 17:19:56 +00:00
audit.c audit: Send netlink ACK before setting connection in auditd_set 2024-02-05 20:14:14 +00:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
auditfilter.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-05-02 16:32:50 +02:00
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c
compat.c
configs.c
context_tracking.c locking/atomic: treewide: use raw_atomic*_<op>() 2023-06-05 09:57:20 +02:00
cpu_pm.c
cpu.c cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n 2024-06-12 11:11:24 +02:00
crash_core.c mm: turn folio_test_hugetlb into a PageType 2024-05-02 16:32:47 +02:00
crash_dump.c
cred.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
delayacct.c delayacct: track delays from IRQ/SOFTIRQ 2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
extable.c
fail_function.c
fork.c Revert "fork: defer linking file vma until vma is fully initialized" 2024-06-21 14:38:47 +02:00
freezer.c
gen_kheaders.sh kheaders: explicitly define file modes for archived headers 2024-06-21 14:38:40 +02:00
groups.c
hung_task.c
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c
jump_label.c
kallsyms_internal.h
kallsyms_selftest.c Modules changes for v6.6-rc1 2023-08-29 17:32:32 -07:00
kallsyms_selftest.h
kallsyms.c kallsyms: Change func signature for cleanup_symbol_name() 2023-08-25 15:00:36 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec kexec: select CRYPTO from KEXEC_FILE instead of depending on it 2024-01-05 15:19:41 +01:00
Kconfig.locks
Kconfig.preempt
kcov.c kcov: don't lose track of remote references during softirqs 2024-06-27 13:49:13 +02:00
kexec_core.c kexec: do syscore_shutdown() in kernel_kexec 2024-01-31 16:18:56 -08:00
kexec_elf.c
kexec_file.c integrity-v6.6 2023-08-30 09:16:56 -07:00
kexec_internal.h
kexec.c kernel: kexec: copy user-array safely 2023-11-28 17:19:40 +00:00
kheaders.c
kprobes.c kprobe/ftrace: fix build error due to bad function definition 2024-06-27 13:49:15 +02:00
ksyms_common.c kallsyms: make kallsyms_show_value() as generic function 2023-06-08 12:27:20 -07:00
ksysfs.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kthread.c kthread: add kthread_stop_put 2024-06-12 11:12:52 +02:00
latencytop.c
Makefile kernel/numa.c: Move logging out of numa.h 2024-06-12 11:11:50 +02:00
module_signature.c
notifier.c
nsproxy.c nsproxy: Convert nsproxy.count to refcount_t 2023-08-21 11:29:12 -07:00
numa.c kernel/numa.c: Move logging out of numa.h 2024-06-12 11:11:50 +02:00
padata.c padata: Disable BH when taking works lock on MT path 2024-06-27 13:49:00 +02:00
panic.c panic: Flush kernel log buffer at the end 2024-04-13 13:07:29 +02:00
params.c kernel: params: Remove unnecessary ‘0’ values from err 2023-07-10 12:47:01 -07:00
pid_namespace.c zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-21 14:38:50 +02:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid.c pidfd: prevent a kernel-doc warning 2023-09-19 13:21:33 -07:00
profile.c
ptrace.c ptrace: Provide set/get interface for syscall user dispatch 2023-04-16 14:23:07 +02:00
range.c
reboot.c kernel/reboot: emergency_restart: Set correct system_state 2023-11-28 17:20:04 +00:00
regset.c
relay.c kernel: relay: remove unnecessary NULL values from relay_open_buf 2023-08-18 10:18:55 -07:00
resource_kunit.c
resource.c kernel/resource: Increment by align value in get_free_mem_region() 2024-01-10 17:16:58 +01:00
rseq.c
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c
seccomp.c seccomp: Add missing kerndoc notations 2023-08-17 12:32:15 -07:00
signal.c signal: print comm and exe name on fatal signals 2023-08-18 10:18:50 -07:00
smp.c smp,csd: Throw an error if a CSD lock is stuck for too long 2023-11-28 17:19:36 +00:00
smpboot.c kthread: add kthread_stop_put 2024-06-12 11:12:52 +02:00
smpboot.h
softirq.c softirq: Fix suspicious RCU usage in __do_softirq() 2024-06-12 11:11:27 +02:00
stackleak.c stackleak: allow to specify arch specific stackleak poison function 2023-04-20 11:36:35 +02:00
stacktrace.c
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c posix-timers: Get rid of [COMPAT_]SYS_NI() uses 2024-01-20 11:51:46 +01:00
sys.c prctl: generalize PR_SET_MDWE support check to be per-arch 2024-04-03 15:28:54 +02:00
sysctl-test.c
sysctl.c v6.5-rc1-sysctl-next 2023-06-28 16:05:21 -07:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c
torture.c rcutorture: Fix stuttering races and other issues 2023-11-28 17:20:08 +00:00
tracepoint.c
tsacct.c
ucount.c sysctl: Add size to register_sysctl 2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c
user_namespace.c
user-return-notifier.c
user.c
usermode_driver.c
utsname_sysctl.c
utsname.c
vhost_task.c vhost: Fix worker hangs due to missed wake up calls 2023-06-08 15:43:09 -04:00
watch_queue.c kernel: watch_queue: copy user-array safely 2023-11-28 17:19:40 +00:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c watchdog/perf: add a weak function for an arch to detect if perf can use NMIs 2023-06-09 17:44:21 -07:00
watchdog.c watchdog: move softlockup_panic back to early_param 2023-11-28 17:19:57 +00:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c workqueue: Fix selection of wake_cpu in kick_pool() 2024-05-17 12:02:31 +02:00