syzbot


ID Workflow Result Correct Bug Created Started Finished Revision Error
fb3eaf3e-96bd-422f-b4ca-56bccf7ad6e0 assessment-kcsan Benign: ✅  Confident: ✅  KCSAN: data-race in _prb_read_valid / data_alloc 2026/02/24 21:11 2026/02/24 21:11 2026/02/24 21:15 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9
Benign: ✅    Confident: ✅   

Explanation:
The data race reported by KCSAN occurs in `kernel/printk/printk_ringbuffer.c` between `data_alloc()` (the writer) and `desc_read()` (the reader).

### Analysis

The `printk_ringbuffer` is a lockless ringbuffer designed to handle concurrent writers and readers. It uses a descriptor ring where each descriptor has a `state_var` that atomically tracks its ID and state (reserved, committed, finalized, or reusable).

1.  **Writer Side (`data_alloc`):**
    When a writer reserves a new record via `prb_reserve()`, it first reserves a descriptor, which transitions its state to `desc_reserved`. Then, `data_alloc()` is called to allocate space in the data ring and initialize the descriptor's logical position fields (`text_blk_lpos.begin` and `text_blk_lpos.next`). The reported write at line 1129 (`blk_lpos->begin = begin_lpos;`) is part of this initialization.

2.  **Reader Side (`desc_read`):**
    A reader attempting to read a record calls `desc_read()`. This function follows a strict lockless protocol:
    *   It first reads the `state_var` (line 461) to check if the descriptor is in a consistent state (committed or finalized) and has the expected ID.
    *   If consistent, it proceeds to copy the descriptor's metadata, including `text_blk_lpos`, using `memcpy()` (line 496). This is the reported 16-byte read.
    *   Crucially, after the copy, it re-reads the `state_var` (line 548) and re-verifies that the ID and state are still the same and consistent.

### The Race

The race happens when a reader (CPU 1) starts reading a descriptor that it thinks is valid, but a writer (CPU 0) simultaneously recycles that same descriptor slot for a new message. 
*   CPU 1 reads `state_var` and sees it is finalized.
*   CPU 0 (the writer) recycles the descriptor, changing its state to `reserved` and its ID to a new value.
*   CPU 0 starts writing new values to `text_blk_lpos.begin` (line 1129).
*   CPU 1 reads `text_blk_lpos` (line 496), potentially seeing a mix of old and new data.
*   CPU 1 re-reads `state_var` (line 548), detects that the ID has changed or the state is now `reserved`, and consequently discards the data it just read, returning an error state (`desc_miss` or `desc_reserved`).

### Conclusion

This data race is **intentional** and is a fundamental part of the lockless synchronization algorithm used in the printk ringbuffer. The code is explicitly designed to handle this race by re-checking the atomic `state_var` after copying the data. Memory barriers (`smp_rmb()`) are correctly placed to ensure that the re-check happens after the data copy and that the data copy happens after the initial state check.

The race is **benign** because it does not lead to any incorrect behavior, memory corruption, or crashes. The reader correctly identifies that the data it read was inconsistent and ignores it. The use of `memcpy()` on aligned structures within the ringbuffer's memory is safe, and the results are never used if the consistency check fails.

The data race is benign.

Crash report:
==================================================================
BUG: KCSAN: data-race in _prb_read_valid / data_alloc

write to 0xffffffff86c7a1c0 of 8 bytes by task 12582 on cpu 0:
 data_alloc+0x310/0x390 kernel/printk/printk_ringbuffer.c:1129
 prb_reserve+0x8d7/0xae0 kernel/printk/printk_ringbuffer.c:1724
 vprintk_store+0x54a/0x910 kernel/printk/printk.c:2354
 vprintk_emit+0x1a4/0x600 kernel/printk/printk.c:2455
 vprintk_default+0x26/0x30 kernel/printk/printk.c:2494
 vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
 _printk+0x79/0xa0 kernel/printk/printk.c:2504
 __ext4_msg+0x1aa/0x1c0 fs/ext4/super.c:995
 ext4_put_super+0x90/0x7d0 fs/ext4/super.c:1300
 generic_shutdown_super+0xee/0x220 fs/super.c:646
 kill_block_super+0x2a/0x70 fs/super.c:1725
 ext4_kill_sb+0x42/0x80 fs/ext4/super.c:7459
 deactivate_locked_super+0x75/0x1c0 fs/super.c:476
 deactivate_super+0x97/0xa0 fs/super.c:509
 cleanup_mnt+0x2bb/0x330 fs/namespace.c:1312
 __cleanup_mnt+0x19/0x20 fs/namespace.c:1319
 task_work_run+0x130/0x1a0 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0x1f4/0x6f0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
 do_syscall_64+0x249/0x370 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffffffff86c7a1c0 of 16 bytes by task 2990 on cpu 1:
 desc_read kernel/printk/printk_ringbuffer.c:496 [inline]
 desc_read_finalized_seq kernel/printk/printk_ringbuffer.c:1931 [inline]
 prb_read kernel/printk/printk_ringbuffer.c:1979 [inline]
 _prb_read_valid+0x1bc/0x950 kernel/printk/printk_ringbuffer.c:2170
 prb_read_valid+0x3c/0x60 kernel/printk/printk_ringbuffer.c:2242
 syslog_print+0x258/0x560 kernel/printk/printk.c:1628
 do_syslog+0x52b/0x7f0 kernel/printk/printk.c:1763
 __do_sys_syslog kernel/printk/printk.c:1855 [inline]
 __se_sys_syslog kernel/printk/printk.c:1853 [inline]
 __x64_sys_syslog+0x41/0x50 kernel/printk/printk.c:1853
 x64_sys_call+0x2b9e/0x3020 arch/x86/include/generated/asm/syscalls_64.h:104
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 2990 Comm: klogd Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================

Trajectory:
Seq Timestamp Type Name Duration
0/0 2026/02/24 21:11 flow assessment-kcsan
4m Results:
map[Benign:true Confident:true Explanation:The data race reported by KCSAN occurs in `kernel/printk/printk_ringbuffer.c` between `data_alloc()` (the writer) and `desc_read()` (the reader).

### Analysis

The `printk_ringbuffer` is a lockless ringbuffer designed to handle concurrent writers and readers. It uses a descriptor ring where each descriptor has a `state_var` that atomically tracks its ID and state (reserved, committed, finalized, or reusable).

1.  **Writer Side (`data_alloc`):**
    When a writer reserves a new record via `prb_reserve()`, it first reserves a descriptor, which transitions its state to `desc_reserved`. Then, `data_alloc()` is called to allocate space in the data ring and initialize the descriptor's logical position fields (`text_blk_lpos.begin` and `text_blk_lpos.next`). The reported write at line 1129 (`blk_lpos->begin = begin_lpos;`) is part of this initialization.

2.  **Reader Side (`desc_read`):**
    A reader attempting to read a record calls `desc_read()`. This function follows a strict lockless protocol:
    *   It first reads the `state_var` (line 461) to check if the descriptor is in a consistent state (committed or finalized) and has the expected ID.
    *   If consistent, it proceeds to copy the descriptor's metadata, including `text_blk_lpos`, using `memcpy()` (line 496). This is the reported 16-byte read.
    *   Crucially, after the copy, it re-reads the `state_var` (line 548) and re-verifies that the ID and state are still the same and consistent.

### The Race

The race happens when a reader (CPU 1) starts reading a descriptor that it thinks is valid, but a writer (CPU 0) simultaneously recycles that same descriptor slot for a new message. 
*   CPU 1 reads `state_var` and sees it is finalized.
*   CPU 0 (the writer) recycles the descriptor, changing its state to `reserved` and its ID to a new value.
*   CPU 0 starts writing new values to `text_blk_lpos.begin` (line 1129).
*   CPU 1 reads `text_blk_lpos` (line 496), potentially seeing a mix of old and new data.
*   CPU 1 re-reads `state_var` (line 548), detects that the ID has changed or the state is now `reserved`, and consequently discards the data it just read, returning an error state (`desc_miss` or `desc_reserved`).

### Conclusion

This data race is **intentional** and is a fundamental part of the lockless synchronization algorithm used in the printk ringbuffer. The code is explicitly designed to handle this race by re-checking the atomic `state_var` after copying the data. Memory barriers (`smp_rmb()`) are correctly placed to ensure that the re-check happens after the data copy and that the data copy happens after the initial state check.

The race is **benign** because it does not lead to any incorrect behavior, memory corruption, or crashes. The reader correctly identifies that the data it read was inconsistent and ignores it. The use of `memcpy()` on aligned structures within the ringbuffer's memory is safe, and the results are never used if the consistency check fails.

The data race is benign.]

1/1 2026/02/24 21:11 action kernel-checkouter
0m Results:
map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/83e3e4073a4cd51a6ced0f93cc08b3f6ffb54f19]

2/1 2026/02/24 21:11 action kernel-builder
0m Results:
map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/b54bee6631687294095c3f2df1d84662f988232e]

3/1 2026/02/24 21:11 action codesearch-prepare
0m Results:
map[Index:codesearch-index]

4/1 2026/02/24 21:11 agent expert
4m Model:
gemini-3-flash-preview

Results:
map[Benign:true Confident:true]

Instruction:
You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

	some_struct->have_elements = true;
	list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.


Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

Prompt:
The data race report is:

==================================================================
BUG: KCSAN: data-race in _prb_read_valid / data_alloc

write to 0xffffffff86c7a1c0 of 8 bytes by task 12582 on cpu 0:
 data_alloc+0x310/0x390 kernel/printk/printk_ringbuffer.c:1129
 prb_reserve+0x8d7/0xae0 kernel/printk/printk_ringbuffer.c:1724
 vprintk_store+0x54a/0x910 kernel/printk/printk.c:2354
 vprintk_emit+0x1a4/0x600 kernel/printk/printk.c:2455
 vprintk_default+0x26/0x30 kernel/printk/printk.c:2494
 vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
 _printk+0x79/0xa0 kernel/printk/printk.c:2504
 __ext4_msg+0x1aa/0x1c0 fs/ext4/super.c:995
 ext4_put_super+0x90/0x7d0 fs/ext4/super.c:1300
 generic_shutdown_super+0xee/0x220 fs/super.c:646
 kill_block_super+0x2a/0x70 fs/super.c:1725
 ext4_kill_sb+0x42/0x80 fs/ext4/super.c:7459
 deactivate_locked_super+0x75/0x1c0 fs/super.c:476
 deactivate_super+0x97/0xa0 fs/super.c:509
 cleanup_mnt+0x2bb/0x330 fs/namespace.c:1312
 __cleanup_mnt+0x19/0x20 fs/namespace.c:1319
 task_work_run+0x130/0x1a0 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0x1f4/0x6f0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
 do_syscall_64+0x249/0x370 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffffffff86c7a1c0 of 16 bytes by task 2990 on cpu 1:
 desc_read kernel/printk/printk_ringbuffer.c:496 [inline]
 desc_read_finalized_seq kernel/printk/printk_ringbuffer.c:1931 [inline]
 prb_read kernel/printk/printk_ringbuffer.c:1979 [inline]
 _prb_read_valid+0x1bc/0x950 kernel/printk/printk_ringbuffer.c:2170
 prb_read_valid+0x3c/0x60 kernel/printk/printk_ringbuffer.c:2242
 syslog_print+0x258/0x560 kernel/printk/printk.c:1628
 do_syslog+0x52b/0x7f0 kernel/printk/printk.c:1763
 __do_sys_syslog kernel/printk/printk.c:1855 [inline]
 __se_sys_syslog kernel/printk/printk.c:1853 [inline]
 __x64_sys_syslog+0x41/0x50 kernel/printk/printk.c:1853
 x64_sys_call+0x2b9e/0x3020 arch/x86/include/generated/asm/syscalls_64.h:104
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 2990 Comm: klogd Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================


Reply:
The data race reported by KCSAN occurs in `kernel/printk/printk_ringbuffer.c` between `data_alloc()` (the writer) and `desc_read()` (the reader).

### Analysis

The `printk_ringbuffer` is a lockless ringbuffer designed to handle concurrent writers and readers. It uses a descriptor ring where each descriptor has a `state_var` that atomically tracks its ID and state (reserved, committed, finalized, or reusable).

1.  **Writer Side (`data_alloc`):**
    When a writer reserves a new record via `prb_reserve()`, it first reserves a descriptor, which transitions its state to `desc_reserved`. Then, `data_alloc()` is called to allocate space in the data ring and initialize the descriptor's logical position fields (`text_blk_lpos.begin` and `text_blk_lpos.next`). The reported write at line 1129 (`blk_lpos->begin = begin_lpos;`) is part of this initialization.

2.  **Reader Side (`desc_read`):**
    A reader attempting to read a record calls `desc_read()`. This function follows a strict lockless protocol:
    *   It first reads the `state_var` (line 461) to check if the descriptor is in a consistent state (committed or finalized) and has the expected ID.
    *   If consistent, it proceeds to copy the descriptor's metadata, including `text_blk_lpos`, using `memcpy()` (line 496). This is the reported 16-byte read.
    *   Crucially, after the copy, it re-reads the `state_var` (line 548) and re-verifies that the ID and state are still the same and consistent.

### The Race

The race happens when a reader (CPU 1) starts reading a descriptor that it thinks is valid, but a writer (CPU 0) simultaneously recycles that same descriptor slot for a new message. 
*   CPU 1 reads `state_var` and sees it is finalized.
*   CPU 0 (the writer) recycles the descriptor, changing its state to `reserved` and its ID to a new value.
*   CPU 0 starts writing new values to `text_blk_lpos.begin` (line 1129).
*   CPU 1 reads `text_blk_lpos` (line 496), potentially seeing a mix of old and new data.
*   CPU 1 re-reads `state_var` (line 548), detects that the ID has changed or the state is now `reserved`, and consequently discards the data it just read, returning an error state (`desc_miss` or `desc_reserved`).

### Conclusion

This data race is **intentional** and is a fundamental part of the lockless synchronization algorithm used in the printk ringbuffer. The code is explicitly designed to handle this race by re-checking the atomic `state_var` after copying the data. Memory barriers (`smp_rmb()`) are correctly placed to ensure that the re-check happens after the data copy and that the data copy happens after the initial state check.

The race is **benign** because it does not lead to any incorrect behavior, memory corruption, or crashes. The reader correctly identifies that the data it read was inconsistent and ignores it. The use of `memcpy()` on aligned structures within the ringbuffer's memory is safe, and the results are never used if the consistency check fails.

The data race is benign.

5/2 2026/02/24 21:11 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 4344
							output: 25
							thoughts: 222
						

Thoughts:
**Analyzing a Data Race in the Kernel's Printk Ringbuffer**

Okay, so the report flags a data race, a classic concurrency issue. It looks like we have a conflict between the `data_alloc` function, which is writing to memory, and the `_prb_read_valid` function, which is reading. The specific location is in `kernel/printk/printk_ringbuffer.c`.  The address of contention is `0xffffffff86c7a1c0`. Based on that high address, this is likely a global or static variable, which would make sense. These are common culprits in these kinds of races, as they can be accessed from different parts of the code.

The report details the write being 8 bytes and the read being 16 bytes. That size difference could be critical - it means the read operation could be reading data that's only partially written.

Now, let's dive into the code. The write operation happens at `data_alloc+0x310/0x390` (line 1129) and the read is in `_prb_read_valid+0x1bc/0x950` (line 2170), the read is invoked via the `desc_read` function at line 496. I'll need to examine the context of those lines within `kernel/printk/printk_ringbuffer.c` to understand exactly what's happening. I'll need to trace the data flow and how those functions interact to properly understand the issue. I'll start with that file...



6/2 2026/02/24 21:11 tool read-file
0m Args:
map[File:kernel/printk/printk_ringbuffer.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0

#include <kunit/visibility.h>
#include <linux/kernel.h>
#include <linux/irqflags.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/bug.h>
#include "printk_ringbuffer.h"
#include "internal.h"

/**
 * DOC: printk_ringbuffer overview
 *
 * Data Structure
 * --------------
 * The printk_ringbuffer is made up of 3 internal ringbuffers:
 *
 *   desc_ring
 *     A ring of descriptors and their meta data (such as sequence number,
 *     timestamp, loglevel, etc.) as well as internal state information about
 *     the record and logical positions specifying where in the other
 *     ringbuffer the text strings are located.
 *
 *   text_data_ring
 *     A ring of data blocks. A data block consists of an unsigned long
 *     integer (ID) that maps to a desc_ring index followed by the text
 *     string of the record.
 *
 * The internal state information of a descriptor is the key element to allow
 * readers and writers to locklessly synchronize access to the data.
 *
 * Implementation
 * --------------
 *
 * Descriptor Ring
 * ~~~~~~~~~~~~~~~
 * The descriptor ring is an array of descriptors. A descriptor contains
 * essential meta data to track the data of a printk record using
 * blk_lpos structs pointing to associated text data blocks (see
 * "Data Rings" below). Each descriptor is assigned an ID that maps
 * directly to index values of the descriptor array and has a state. The ID
 * and the state are bitwise combined into a single descriptor field named
 * @state_var, allowing ID and state to be synchronously and atomically
 * updated.
 *
 * Descriptors have four states:
 *
 *   reserved
 *     A writer is modifying the record.
 *
 *   committed
 *     The record and all its data are written. A writer can reopen the
 *     descriptor (transitioning it back to reserved), but in the committed
 *     state the data is consistent.
 *
 *   finalized
 *     The record and all its data are complete and available for reading. A
 *     writer cannot reopen the descriptor.
 *
 *   reusable
 *     The record exists, but its text and/or meta data may no longer be
 *     available.
 *
 * Querying the @state_var of a record requires providing the ID of the
 * descriptor to query. This can yield a possible fifth (pseudo) state:
 *
 *   miss
 *     The descriptor being queried has an unexpected ID.
 *
 * The descriptor ring has a @tail_id that contains the ID of the oldest
 * descriptor and @head_id that contains the ID of the newest descriptor.
 *
 * When a new descriptor should be created (and the ring is full), the tail
 * descriptor is invalidated by first transitioning to the reusable state and
 * then invalidating all tail data blocks up to and including the data blocks
 * associated with the tail descriptor (for the text ring). Then
 * @tail_id is advanced, followed by advancing @head_id. And finally the
 * @state_var of the new descriptor is initialized to the new ID and reserved
 * state.
 *
 * The @tail_id can only be advanced if the new @tail_id would be in the
 * committed or reusable queried state. This makes it possible that a valid
 * sequence number of the tail is always available.
 *
 * Descriptor Finalization
 * ~~~~~~~~~~~~~~~~~~~~~~~
 * When a writer calls the commit function prb_commit(), record data is
 * fully stored and is consistent within the ringbuffer. However, a writer can
 * reopen that record, claiming exclusive access (as with prb_reserve()), and
 * modify that record. When finished, the writer must again commit the record.
 *
 * In order for a record to be made available to readers (and also become
 * recyclable for writers), it must be finalized. A finalized record cannot be
 * reopened and can never become "unfinalized". Record finalization can occur
 * in three different scenarios:
 *
 *   1) A writer can simultaneously commit and finalize its record by calling
 *      prb_final_commit() instead of prb_commit().
 *
 *   2) When a new record is reserved and the previous record has been
 *      committed via prb_commit(), that previous record is automatically
 *      finalized.
 *
 *   3) When a record is committed via prb_commit() and a newer record
 *      already exists, the record being committed is automatically finalized.
 *
 * Data Ring
 * ~~~~~~~~~
 * The text data ring is a byte array composed of data blocks. Data blocks are
 * referenced by blk_lpos structs that point to the logical position of the
 * beginning of a data block and the beginning of the next adjacent data
 * block. Logical positions are mapped directly to index values of the byte
 * array ringbuffer.
 *
 * Each data block consists of an ID followed by the writer data. The ID is
 * the identifier of a descriptor that is associated with the data block. A
 * given data block is considered valid if all of the following conditions
 * are met:
 *
 *   1) The descriptor associated with the data block is in the committed
 *      or finalized queried state.
 *
 *   2) The blk_lpos struct within the descriptor associated with the data
 *      block references back to the same data block.
 *
 *   3) The data block is within the head/tail logical position range.
 *
 * If the writer data of a data block would extend beyond the end of the
 * byte array, only the ID of the data block is stored at the logical
 * position and the full data block (ID and writer data) is stored at the
 * beginning of the byte array. The referencing blk_lpos will point to the
 * ID before the wrap and the next data block will be at the logical
 * position adjacent the full data block after the wrap.
 *
 * Data rings have a @tail_lpos that points to the beginning of the oldest
 * data block and a @head_lpos that points to the logical position of the
 * next (not yet existing) data block.
 *
 * When a new data block should be created (and the ring is full), tail data
 * blocks will first be invalidated by putting their associated descriptors
 * into the reusable state and then pushing the @tail_lpos forward beyond
 * them. Then the @head_lpos is pushed forward and is associated with a new
 * descriptor. If a data block is not valid, the @tail_lpos cannot be
 * advanced beyond it.
 *
 * Info Array
 * ~~~~~~~~~~
 * The general meta data of printk records are stored in printk_info structs,
 * stored in an array with the same number of elements as the descriptor ring.
 * Each info corresponds to the descriptor of the same index in the
 * descriptor ring. Info validity is confirmed by evaluating the corresponding
 * descriptor before and after loading the info.
 *
 * Usage
 * -----
 * Here are some simple examples demonstrating writers and readers. For the
 * examples a global ringbuffer (test_rb) is available (which is not the
 * actual ringbuffer used by printk)::
 *
 *	DEFINE_PRINTKRB(test_rb, 15, 5);
 *
 * This ringbuffer allows up to 32768 records (2 ^ 15) and has a size of
 * 1 MiB (2 ^ (15 + 5)) for text data.
 *
 * Sample writer code::
 *
 *	const char *textstr = "message text";
 *	struct prb_reserved_entry e;
 *	struct printk_record r;
 *
 *	// specify how much to allocate
 *	prb_rec_init_wr(&r, strlen(textstr) + 1);
 *
 *	if (prb_reserve(&e, &test_rb, &r)) {
 *		snprintf(r.text_buf, r.text_buf_size, "%s", textstr);
 *
 *		r.info->text_len = strlen(textstr);
 *		r.info->ts_nsec = local_clock();
 *		r.info->caller_id = printk_caller_id();
 *
 *		// commit and finalize the record
 *		prb_final_commit(&e);
 *	}
 *
 * Note that additional writer functions are available to extend a record
 * after it has been committed but not yet finalized. This can be done as
 * long as no new records have been reserved and the caller is the same.
 *
 * Sample writer code (record extending)::
 *
 *		// alternate rest of previous example
 *
 *		r.info->text_len = strlen(textstr);
 *		r.info->ts_nsec = local_clock();
 *		r.info->caller_id = printk_caller_id();
 *
 *		// commit the record (but do not finalize yet)
 *		prb_commit(&e);
 *	}
 *
 *	...
 *
 *	// specify additional 5 bytes text space to extend
 *	prb_rec_init_wr(&r, 5);
 *
 *	// try to extend, but only if it does not exceed 32 bytes
 *	if (prb_reserve_in_last(&e, &test_rb, &r, printk_caller_id(), 32)) {
 *		snprintf(&r.text_buf[r.info->text_len],
 *			 r.text_buf_size - r.info->text_len, "hello");
 *
 *		r.info->text_len += 5;
 *
 *		// commit and finalize the record
 *		prb_final_commit(&e);
 *	}
 *
 * Sample reader code::
 *
 *	struct printk_info info;
 *	struct printk_record r;
 *	char text_buf[32];
 *	u64 seq;
 *
 *	prb_rec_init_rd(&r, &info, &text_buf[0], sizeof(text_buf));
 *
 *	prb_for_each_record(0, &test_rb, &seq, &r) {
 *		if (info.seq != seq)
 *			pr_warn("lost %llu records\n", info.seq - seq);
 *
 *		if (info.text_len > r.text_buf_size) {
 *			pr_warn("record %llu text truncated\n", info.seq);
 *			text_buf[r.text_buf_size - 1] = 0;
 *		}
 *
 *		pr_info("%llu: %llu: %s\n", info.seq, info.ts_nsec,
 *			&text_buf[0]);
 *	}
 *
 * Note that additional less convenient reader functions are available to
 * allow complex record access.
 *
 * ABA Issues
 * ~~~~~~~~~~
 * To help avoid ABA issues, descriptors are referenced by IDs (array index
 * values combined with tagged bits counting array wraps) and data blocks are
 * referenced by logical positions (array index values combined with tagged
 * bits counting array wraps). However, on 32-bit systems the number of
 * tagged bits is relatively small such that an ABA incident is (at least
 * theoretically) possible. For example, if 4 million maximally sized (1KiB)
 * printk messages were to occur in NMI context on a 32-bit system, the
 * interrupted context would not be able to recognize that the 32-bit integer
 * completely wrapped and thus represents a different data block than the one
 * the interrupted context expects.
 *
 * To help combat this possibility, additional state checking is performed
 * (such as using cmpxchg() even though set() would suffice). These extra
 * checks are commented as such and will hopefully catch any ABA issue that
 * a 32-bit system might experience.
 *
 * Memory Barriers
 * ~~~~~~~~~~~~~~~
 * Multiple memory barriers are used. To simplify proving correctness and
 * generating litmus tests, lines of code related to memory barriers
 * (loads, stores, and the associated memory barriers) are labeled::
 *
 *	LMM(function:letter)
 *
 * Comments reference the labels using only the "function:letter" part.
 *
 * The memory barrier pairs and their ordering are:
 *
 *   desc_reserve:D / desc_reserve:B
 *     push descriptor tail (id), then push descriptor head (id)
 *
 *   desc_reserve:D / data_push_tail:B
 *     push data tail (lpos), then set new descriptor reserved (state)
 *
 *   desc_reserve:D / desc_push_tail:C
 *     push descriptor tail (id), then set new descriptor reserved (state)
 *
 *   desc_reserve:D / prb_first_seq:C
 *     push descriptor tail (id), then set new descriptor reserved (state)
 *
 *   desc_reserve:F / desc_read:D
 *     set new descriptor id and reserved (state), then allow writer changes
 *
 *   data_alloc:A (or data_realloc:A) / desc_read:D
 *     set old descriptor reusable (state), then modify new data block area
 *
 *   data_alloc:A (or data_realloc:A) / data_push_tail:B
 *     push data tail (lpos), then modify new data block area
 *
 *   _prb_commit:B / desc_read:B
 *     store writer changes, then set new descriptor committed (state)
 *
 *   desc_reopen_last:A / _prb_commit:B
 *     set descriptor reserved (state), then read descriptor data
 *
 *   _prb_commit:B / desc_reserve:D
 *     set new descriptor committed (state), then check descriptor head (id)
 *
 *   data_push_tail:D / data_push_tail:A
 *     set descriptor reusable (state), then push data tail (lpos)
 *
 *   desc_push_tail:B / desc_reserve:D
 *     set descriptor reusable (state), then push descriptor tail (id)
 *
 *   desc_update_last_finalized:A / desc_last_finalized_seq:A
 *     store finalized record, then set new highest finalized sequence number
 */

#define DATA_SIZE(data_ring)		_DATA_SIZE((data_ring)->size_bits)
#define DATA_SIZE_MASK(data_ring)	(DATA_SIZE(data_ring) - 1)

#define DESCS_COUNT(desc_ring)		_DESCS_COUNT((desc_ring)->count_bits)
#define DESCS_COUNT_MASK(desc_ring)	(DESCS_COUNT(desc_ring) - 1)

/* Determine the data array index from a logical position. */
#define DATA_INDEX(data_ring, lpos)	((lpos) & DATA_SIZE_MASK(data_ring))

/* Determine the desc array index from an ID or sequence number. */
#define DESC_INDEX(desc_ring, n)	((n) & DESCS_COUNT_MASK(desc_ring))

/* Determine how many times the data array has wrapped. */
#define DATA_WRAPS(data_ring, lpos)	((lpos) >> (data_ring)->size_bits)

/* Determine if a logical position refers to a data-less block. */
#define LPOS_DATALESS(lpos)		((lpos) & 1UL)
#define BLK_DATALESS(blk)		(LPOS_DATALESS((blk)->begin) && \
					 LPOS_DATALESS((blk)->next))

/* Get the logical position at index 0 of the current wrap. */
#define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \
((lpos) & ~DATA_SIZE_MASK(data_ring))

/* Get the ID for the same index of the previous wrap as the given ID. */
#define DESC_ID_PREV_WRAP(desc_ring, id) \
DESC_ID((id) - DESCS_COUNT(desc_ring))

/*
 * A data block: mapped directly to the beginning of the data block area
 * specified as a logical position within the data ring.
 *
 * @id:   the ID of the associated descriptor
 * @data: the writer data
 *
 * Note that the size of a data block is only known by its associated
 * descriptor.
 */
struct prb_data_block {
	unsigned long	id;
	char		data[];
};

/*
 * Return the descriptor associated with @n. @n can be either a
 * descriptor ID or a sequence number.
 */
static struct prb_desc *to_desc(struct prb_desc_ring *desc_ring, u64 n)
{
	return &desc_ring->descs[DESC_INDEX(desc_ring, n)];
}

/*
 * Return the printk_info associated with @n. @n can be either a
 * descriptor ID or a sequence number.
 */
static struct printk_info *to_info(struct prb_desc_ring *desc_ring, u64 n)
{
	return &desc_ring->infos[DESC_INDEX(desc_ring, n)];
}

static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
				       unsigned long begin_lpos)
{
	return (void *)&data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
}

/*
 * Increase the data size to account for data block meta data plus any
 * padding so that the adjacent data block is aligned on the ID size.
 */
static unsigned int to_blk_size(unsigned int size)
{
	struct prb_data_block *db = NULL;

	size += sizeof(*db);
	size = ALIGN(size, sizeof(db->id));
	return size;
}

/*
 * Sanity checker for reserve size. The ringbuffer code assumes that a data
 * block does not exceed the maximum possible size that could fit within the
 * ringbuffer. This function provides that basic size check so that the
 * assumption is safe. In particular, it guarantees that data_push_tail() will
 * never attempt to push the tail beyond the head.
 */
static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size)
{
	/* Data-less blocks take no space. */
	if (size == 0)
		return true;

	/*
	 * If data blocks were allowed to be larger than half the data ring
	 * size, a wrapping data block could require more space than the full
	 * ringbuffer.
	 */
	return to_blk_size(size) <= DATA_SIZE(data_ring) / 2;
}

/*
 * Compare the current and requested logical position and decide
 * whether more space is needed.
 *
 * Return false when @lpos_current is already at or beyond @lpos_target.
 *
 * Also return false when the difference between the positions is bigger
 * than the size of the data buffer. It might happen only when the caller
 * raced with another CPU(s) which already made and used the space.
 */
static bool need_more_space(struct prb_data_ring *data_ring,
			    unsigned long lpos_current,
			    unsigned long lpos_target)
{
	return lpos_target - lpos_current - 1 < DATA_SIZE(data_ring);
}

/* Query the state of a descriptor. */
static enum desc_state get_desc_state(unsigned long id,
				      unsigned long state_val)
{
	if (id != DESC_ID(state_val))
		return desc_miss;

	return DESC_STATE(state_val);
}

/*
 * Get a copy of a specified descriptor and return its queried state. If the
 * descriptor is in an inconsistent state (miss or reserved), the caller can
 * only expect the descriptor's @state_var field to be valid.
 *
 * The sequence number and caller_id can be optionally retrieved. Like all
 * non-state_var data, they are only valid if the descriptor is in a
 * consistent state.
 */
static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
				 unsigned long id, struct prb_desc *desc_out,
				 u64 *seq_out, u32 *caller_id_out)
{
	struct printk_info *info = to_info(desc_ring, id);
	struct prb_desc *desc = to_desc(desc_ring, id);
	atomic_long_t *state_var = &desc->state_var;
	enum desc_state d_state;
	unsigned long state_val;

	/* Check the descriptor state. */
	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
	d_state = get_desc_state(id, state_val);
	if (d_state == desc_miss || d_state == desc_reserved) {
		/*
		 * The descriptor is in an inconsistent state. Set at least
		 * @state_var so that the caller can see the details of
		 * the inconsistent state.
		 */
		goto out;
	}

	/*
	 * Guarantee the state is loaded before copying the descriptor
	 * content. This avoids copying obsolete descriptor content that might
	 * not apply to the descriptor state. This pairs with _prb_commit:B.
	 *
	 * Memory barrier involvement:
	 *
	 * If desc_read:A reads from _prb_commit:B, then desc_read:C reads
	 * from _prb_commit:A.
	 *
	 * Relies on:
	 *
	 * WMB from _prb_commit:A to _prb_commit:B
	 *    matching
	 * RMB from desc_read:A to desc_read:C
	 */
	smp_rmb(); /* LMM(desc_read:B) */

	/*
	 * Copy the descriptor data. The data is not valid until the
	 * state has been re-checked. A memcpy() for all of @desc
	 * cannot be used because of the atomic_t @state_var field.
	 */
	if (desc_out) {
		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
	}
	if (seq_out)
		*seq_out = info->seq; /* also part of desc_read:C */
	if (caller_id_out)
		*caller_id_out = info->caller_id; /* also part of desc_read:C */

	/*
	 * 1. Guarantee the descriptor content is loaded before re-checking
	 *    the state. This avoids reading an obsolete descriptor state
	 *    that may not apply to the copied content. This pairs with
	 *    desc_reserve:F.
	 *
	 *    Memory barrier involvement:
	 *
	 *    If desc_read:C reads from desc_reserve:G, then desc_read:E
	 *    reads from desc_reserve:F.
	 *
	 *    Relies on:
	 *
	 *    WMB from desc_reserve:F to desc_reserve:G
	 *       matching
	 *    RMB from desc_read:C to desc_read:E
	 *
	 * 2. Guarantee the record data is loaded before re-checking the
	 *    state. This avoids reading an obsolete descriptor state that may
	 *    not apply to the copied data. This pairs with data_alloc:A and
	 *    data_realloc:A.
	 *
	 *    Memory barrier involvement:
	 *
	 *    If copy_data:A reads from data_alloc:B, then desc_read:E
	 *    reads from desc_make_reusable:A.
	 *
	 *    Relies on:
	 *
	 *    MB from desc_make_reusable:A to data_alloc:B
	 *       matching
	 *    RMB from desc_read:C to desc_read:E
	 *
	 *    Note: desc_make_reusable:A and data_alloc:B can be different
	 *          CPUs. However, the data_alloc:B CPU (which performs the
	 *          full memory barrier) must have previously seen
	 *          desc_make_reusable:A.
	 */
	smp_rmb(); /* LMM(desc_read:D) */

	/*
	 * The data has been copied. Return the current descriptor state,
	 * which may have changed since the load above.
	 */
	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
	d_state = get_desc_state(id, state_val);
out:
	if (desc_out)
		atomic_long_set(&desc_out->state_var, state_val);
	return d_state;
}

/*
 * Take a specified descriptor out of the finalized state by attempting
 * the transition from finalized to reusable. Either this context or some
 * other context will have been successful.
 */
static void desc_make_reusable(struct prb_desc_ring *desc_ring,
			       unsigned long id)
{
	unsigned long val_finalized = DESC_SV(id, desc_finalized);
	unsigned long val_reusable = DESC_SV(id, desc_reusable);
	struct prb_desc *desc = to_desc(desc_ring, id);
	atomic_long_t *state_var = &desc->state_var;

	atomic_long_cmpxchg_relaxed(state_var, val_finalized,
				    val_reusable); /* LMM(desc_make_reusable:A) */
}

/*
 * Given the text data ring, put the associated descriptor of each
 * data block from @lpos_begin until @lpos_end into the reusable state.
 *
 * If there is any problem making the associated descriptor reusable, either
 * the descriptor has not yet been finalized or another writer context has
 * already pushed the tail lpos past the problematic data block. Regardless,
 * on error the caller can re-load the tail lpos to determine the situation.
 */
static bool data_make_reusable(struct printk_ringbuffer *rb,
			       unsigned long lpos_begin,
			       unsigned long lpos_end,
			       unsigned long *lpos_out)
{

	struct prb_data_ring *data_ring = &rb->text_data_ring;
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	struct prb_data_block *blk;
	enum desc_state d_state;
	struct prb_desc desc;
	struct prb_data_blk_lpos *blk_lpos = &desc.text_blk_lpos;
	unsigned long id;

	/* Loop until @lpos_begin has advanced to or beyond @lpos_end. */
	while (need_more_space(data_ring, lpos_begin, lpos_end)) {
		blk = to_block(data_ring, lpos_begin);

		/*
		 * Load the block ID from the data block. This is a data race
		 * against a writer that may have newly reserved this data
		 * area. If the loaded value matches a valid descriptor ID,
		 * the blk_lpos of that descriptor will be checked to make
		 * sure it points back to this data block. If the check fails,
		 * the data area has been recycled by another writer.
		 */
		id = blk->id; /* LMM(data_make_reusable:A) */

		d_state = desc_read(desc_ring, id, &desc,
				    NULL, NULL); /* LMM(data_make_reusable:B) */

		switch (d_state) {
		case desc_miss:
		case desc_reserved:
		case desc_committed:
			return false;
		case desc_finalized:
			/*
			 * This data block is invalid if the descriptor
			 * does not point back to it.
			 */
			if (blk_lpos->begin != lpos_begin)
				return false;
			desc_make_reusable(desc_ring, id);
			break;
		case desc_reusable:
			/*
			 * This data block is invalid if the descriptor
			 * does not point back to it.
			 */
			if (blk_lpos->begin != lpos_begin)
				return false;
			break;
		}

		/* Advance @lpos_begin to the next data block. */
		lpos_begin = blk_lpos->next;
	}

	*lpos_out = lpos_begin;
	return true;
}

/*
 * Advance the data ring tail to at least @lpos. This function puts
 * descriptors into the reusable state if the tail is pushed beyond
 * their associated data block.
 */
static bool data_push_tail(struct printk_ringbuffer *rb, unsigned long lpos)
{
	struct prb_data_ring *data_ring = &rb->text_data_ring;
	unsigned long tail_lpos_new;
	unsigned long tail_lpos;
	unsigned long next_lpos;

	/* If @lpos is from a data-less block, there is nothing to do. */
	if (LPOS_DATALESS(lpos))
		return true;

	/*
	 * Any descriptor states that have transitioned to reusable due to the
	 * data tail being pushed to this loaded value will be visible to this
	 * CPU. This pairs with data_push_tail:D.
	 *
	 * Memory barrier involvement:
	 *
	 * If data_push_tail:A reads from data_push_tail:D, then this CPU can
	 * see desc_make_reusable:A.
	 *
	 * Relies on:
	 *
	 * MB from desc_make_reusable:A to data_push_tail:D
	 *    matches
	 * READFROM from data_push_tail:D to data_push_tail:A
	 *    thus
	 * READFROM from desc_make_reusable:A to this CPU
	 */
	tail_lpos = atomic_long_read(&data_ring->tail_lpos); /* LMM(data_push_tail:A) */

	/*
	 * Loop until the tail lpos is at or beyond @lpos. This condition
	 * may already be satisfied, resulting in no full memory barrier
	 * from data_push_tail:D being performed. However, since this CPU
	 * sees the new tail lpos, any descriptor states that transitioned to
	 * the reusable state must already be visible.
	 */
	while (need_more_space(data_ring, tail_lpos, lpos)) {
		/*
		 * Make all descriptors reusable that are associated with
		 * data blocks before @lpos.
		 */
		if (!data_make_reusable(rb, tail_lpos, lpos, &next_lpos)) {
			/*
			 * 1. Guarantee the block ID loaded in
			 *    data_make_reusable() is performed before
			 *    reloading the tail lpos. The failed
			 *    data_make_reusable() may be due to a newly
			 *    recycled data area causing the tail lpos to
			 *    have been previously pushed. This pairs with
			 *    data_alloc:A and data_realloc:A.
			 *
			 *    Memory barrier involvement:
			 *
			 *    If data_make_reusable:A reads from data_alloc:B,
			 *    then data_push_tail:C reads from
			 *    data_push_tail:D.
			 *
			 *    Relies on:
			 *
			 *    MB from data_push_tail:D to data_alloc:B
			 *       matching
			 *    RMB from data_make_reusable:A to
			 *    data_push_tail:C
			 *
			 *    Note: data_push_tail:D and data_alloc:B can be
			 *          different CPUs. However, the data_alloc:B
			 *          CPU (which performs the full memory
			 *          barrier) must have previously seen
			 *          data_push_tail:D.
			 *
			 * 2. Guarantee the descriptor state loaded in
			 *    data_make_reusable() is performed before
			 *    reloading the tail lpos. The failed
			 *    data_make_reusable() may be due to a newly
			 *    recycled descriptor causing the tail lpos to
			 *    have been previously pushed. This pairs with
			 *    desc_reserve:D.
			 *
			 *    Memory barrier involvement:
			 *
			 *    If data_make_reusable:B reads from
			 *    desc_reserve:F, then data_push_tail:C reads
			 *    from data_push_tail:D.
			 *
			 *    Relies on:
			 *
			 *    MB from data_push_tail:D to desc_reserve:F
			 *       matching
			 *    RMB from data_make_reusable:B to
			 *    data_push_tail:C
			 *
			 *    Note: data_push_tail:D and desc_reserve:F can
			 *          be different CPUs. However, the
			 *          desc_reserve:F CPU (which performs the
			 *          full memory barrier) must have previously
			 *          seen data_push_tail:D.
			 */
			smp_rmb(); /* LMM(data_push_tail:B) */

			tail_lpos_new = atomic_long_read(&data_ring->tail_lpos
							); /* LMM(data_push_tail:C) */
			if (tail_lpos_new == tail_lpos)
				return false;

			/* Another CPU pushed the tail. Try again. */
			tail_lpos = tail_lpos_new;
			continue;
		}

		/*
		 * Guarantee any descriptor states that have transitioned to
		 * reusable are stored before pushing the tail lpos. A full
		 * memory barrier is needed since other CPUs may have made
		 * the descriptor states reusable. This pairs with
		 * data_push_tail:A.
		 */
		if (atomic_long_try_cmpxchg(&data_ring->tail_lpos, &tail_lpos,
					    next_lpos)) { /* LMM(data_push_tail:D) */
			break;
		}
	}

	return true;
}

/*
 * Advance the desc ring tail. This function advances the tail by one
 * descriptor, thus invalidating the oldest descriptor. Before advancing
 * the tail, the tail descriptor is made reusable and all data blocks up to
 * and including the descriptor's data block are invalidated (i.e. the data
 * ring tail is pushed past the data block of the descriptor being made
 * reusable).
 */
static bool desc_push_tail(struct printk_ringbuffer *rb,
			   unsigned long tail_id)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	enum desc_state d_state;
	struct prb_desc desc;

	d_state = desc_read(desc_ring, tail_id, &desc, NULL, NULL);

	switch (d_state) {
	case desc_miss:
		/*
		 * If the ID is exactly 1 wrap behind the expected, it is
		 * in the process of being reserved by another writer and
		 * must be considered reserved.
		 */
		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
			return false;
		}

		/*
		 * The ID has changed. Another writer must have pushed the
		 * tail and recycled the descriptor already. Success is
		 * returned because the caller is only interested in the
		 * specified tail being pushed, which it was.
		 */
		return true;
	case desc_reserved:
	case desc_committed:
		return false;
	case desc_finalized:
		desc_make_reusable(desc_ring, tail_id);
		break;
	case desc_reusable:
		break;
	}

	/*
	 * Data blocks must be invalidated before their associated
	 * descriptor can be made available for recycling. Invalidating
	 * them later is not possible because there is no way to trust
	 * data blocks once their associated descriptor is gone.
	 */

	if (!data_push_tail(rb, desc.text_blk_lpos.next))
		return false;

	/*
	 * Check the next descriptor after @tail_id before pushing the tail
	 * to it because the tail must always be in a finalized or reusable
	 * state. The implementation of prb_first_seq() relies on this.
	 *
	 * A successful read implies that the next descriptor is less than or
	 * equal to @head_id so there is no risk of pushing the tail past the
	 * head.
	 */
	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1), &desc,
			    NULL, NULL); /* LMM(desc_push_tail:A) */

	if (d_state == desc_finalized || d_state == desc_reusable) {
		/*
		 * Guarantee any descriptor states that have transitioned to
		 * reusable are stored before pushing the tail ID. This allows
		 * verifying the recycled descriptor state. A full memory
		 * barrier is needed since other CPUs may have made the
		 * descriptor states reusable. This pairs with desc_reserve:D.
		 */
		atomic_long_cmpxchg(&desc_ring->tail_id, tail_id,
				    DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:B) */
	} else {
		/*
		 * Guarantee the last state load from desc_read() is before
		 * reloading @tail_id in order to see a new tail ID in the
		 * case that the descriptor has been recycled. This pairs
		 * with desc_reserve:D.
		 *
		 * Memory barrier involvement:
		 *
		 * If desc_push_tail:A reads from desc_reserve:F, then
		 * desc_push_tail:D reads from desc_push_tail:B.
		 *
		 * Relies on:
		 *
		 * MB from desc_push_tail:B to desc_reserve:F
		 *    matching
		 * RMB from desc_push_tail:A to desc_push_tail:D
		 *
		 * Note: desc_push_tail:B and desc_reserve:F can be different
		 *       CPUs. However, the desc_reserve:F CPU (which performs
		 *       the full memory barrier) must have previously seen
		 *       desc_push_tail:B.
		 */
		smp_rmb(); /* LMM(desc_push_tail:C) */

		/*
		 * Re-check the tail ID. The descriptor following @tail_id is
		 * not in an allowed tail state. But if the tail has since
		 * been moved by another CPU, then it does not matter.
		 */
		if (atomic_long_read(&desc_ring->tail_id) == tail_id) /* LMM(desc_push_tail:D) */
			return false;
	}

	return true;
}

/* Reserve a new descriptor, invalidating the oldest if necessary. */
static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	unsigned long prev_state_val;
	unsigned long id_prev_wrap;
	struct prb_desc *desc;
	unsigned long head_id;
	unsigned long id;

	head_id = atomic_long_read(&desc_ring->head_id); /* LMM(desc_reserve:A) */

	do {
		id = DESC_ID(head_id + 1);
		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);

		/*
		 * Guarantee the head ID is read before reading the tail ID.
		 * Since the tail ID is updated before the head ID, this
		 * guarantees that @id_prev_wrap is never ahead of the tail
		 * ID. This pairs with desc_reserve:D.
		 *
		 * Memory barrier involvement:
		 *
		 * If desc_reserve:A reads from desc_reserve:D, then
		 * desc_reserve:C reads from desc_push_tail:B.
		 *
		 * Relies on:
		 *
		 * MB from desc_push_tail:B to desc_reserve:D
		 *    matching
		 * RMB from desc_reserve:A to desc_reserve:C
		 *
		 * Note: desc_push_tail:B and desc_reserve:D can be different
		 *       CPUs. However, the desc_reserve:D CPU (which performs
		 *       the full memory barrier) must have previously seen
		 *       desc_push_tail:B.
		 */
		smp_rmb(); /* LMM(desc_reserve:B) */

		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id
						    )) { /* LMM(desc_reserve:C) */
			/*
			 * Make space for the new descriptor by
			 * advancing the tail.
			 */
			if (!desc_push_tail(rb, id_prev_wrap))
				return false;
		}

		/*
		 * 1. Guarantee the tail ID is read before validating the
		 *    recycled descriptor state. A read memory barrier is
		 *    sufficient for this. This pairs with desc_push_tail:B.
		 *
		 *    Memory barrier involvement:
		 *
		 *    If desc_reserve:C reads from desc_push_tail:B, then
		 *    desc_reserve:E reads from desc_make_reusable:A.
		 *
		 *    Relies on:
		 *
		 *    MB from desc_make_reusable:A to desc_push_tail:B
		 *       matching
		 *    RMB from desc_reserve:C to desc_reserve:E
		 *
		 *    Note: desc_make_reusable:A and desc_push_tail:B can be
		 *          different CPUs. However, the desc_push_tail:B CPU
		 *          (which performs the full memory barrier) must have
		 *          previously seen desc_make_reusable:A.
		 *
		 * 2. Guarantee the tail ID is stored before storing the head
		 *    ID. This pairs with desc_reserve:B.
		 *
		 * 3. Guarantee any data ring tail changes are stored before
		 *    recycling the descriptor. Data ring tail changes can
		 *    happen via desc_push_tail()->data_push_tail(). A full
		 *    memory barrier is needed since another CPU may have
		 *    pushed the data ring tails. This pairs with
		 *    data_push_tail:B.
		 *
		 * 4. Guarantee a new tail ID is stored before recycling the
		 *    descriptor. A full memory barrier is needed since
		 *    another CPU may have pushed the tail ID. This pairs
		 *    with desc_push_tail:C and this also pairs with
		 *    prb_first_seq:C.
		 *
		 * 5. Guarantee the head ID is stored before trying to
		 *    finalize the previous descriptor. This pairs with
		 *    _prb_commit:B.
		 */
	} while (!atomic_long_try_cmpxchg(&desc_ring->head_id, &head_id,
					  id)); /* LMM(desc_reserve:D) */

	desc = to_desc(desc_ring, id);

	/*
	 * If the descriptor has been recycled, verify the old state val.
	 * See "ABA Issues" about why this verification is performed.
	 */
	prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */
	if (prev_state_val &&
	    get_desc_state(id_prev_wrap, prev_state_val) != desc_reusable) {
		WARN_ON_ONCE(1);
		return false;
	}

	/*
	 * Assign the descriptor a new ID and set its state to reserved.
	 * See "ABA Issues" about why cmpxchg() instead of set() is used.
	 *
	 * Guarantee the new descriptor ID and state is stored before making
	 * any other changes. A write memory barrier is sufficient for this.
	 * This pairs with desc_read:D.
	 */
	if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val,
			DESC_SV(id, desc_reserved))) { /* LMM(desc_reserve:F) */
		WARN_ON_ONCE(1);
		return false;
	}

	/* Now data in @desc can be modified: LMM(desc_reserve:G) */

	*id_out = id;
	return true;
}

static bool is_blk_wrapped(struct prb_data_ring *data_ring,
			   unsigned long begin_lpos, unsigned long next_lpos)
{
	/*
	 * Subtract one from next_lpos since it's not actually part of this data
	 * block. This allows perfectly fitting records to not wrap.
	 */
	return DATA_WRAPS(data_ring, begin_lpos) !=
	       DATA_WRAPS(data_ring, next_lpos - 1);
}

/* Determine the end of a data block. */
static unsigned long get_next_lpos(struct prb_data_ring *data_ring,
				   unsigned long lpos, unsigned int size)
{
	unsigned long begin_lpos;
	unsigned long next_lpos;

	begin_lpos = lpos;
	next_lpos = lpos + size;

	/* First check if the data block does not wrap. */
	if (!is_blk_wrapped(data_ring, begin_lpos, next_lpos))
		return next_lpos;

	/* Wrapping data blocks store their data at the beginning. */
	return (DATA_THIS_WRAP_START_LPOS(data_ring, next_lpos) + size);
}

/*
 * Allocate a new data block, invalidating the oldest data block(s)
 * if necessary. This function also associates the data block with
 * a specified descriptor.
 */
static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
{
	struct prb_data_ring *data_ring = &rb->text_data_ring;
	struct prb_data_block *blk;
	unsigned long begin_lpos;
	unsigned long next_lpos;

	if (size == 0) {
		/*
		 * Data blocks are not created for empty lines. Instead, the
		 * reader will recognize these special lpos values and handle
		 * it appropriately.
		 */
		blk_lpos->begin = EMPTY_LINE_LPOS;
		blk_lpos->next = EMPTY_LINE_LPOS;
		return NULL;
	}

	size = to_blk_size(size);

	begin_lpos = atomic_long_read(&data_ring->head_lpos);

	do {
		next_lpos = get_next_lpos(data_ring, begin_lpos, size);

		/*
		 * data_check_size() prevents data block allocation that could
		 * cause illegal ringbuffer states. But double check that the
		 * used space will not be bigger than the ring buffer. Wrapped
		 * messages need to reserve more space, see get_next_lpos().
		 *
		 * Specify a data-less block when the check or the allocation
		 * fails.
		 */
		if (WARN_ON_ONCE(next_lpos - begin_lpos > DATA_SIZE(data_ring)) ||
		    !data_push_tail(rb, next_lpos - DATA_SIZE(data_ring))) {
			blk_lpos->begin = FAILED_LPOS;
			blk_lpos->next = FAILED_LPOS;
			return NULL;
		}

		/*
		 * 1. Guarantee any descriptor states that have transitioned
		 *    to reusable are stored before modifying the newly
		 *    allocated data area. A full memory barrier is needed
		 *    since other CPUs may have made the descriptor states
		 *    reusable. See data_push_tail:A about why the reusable
		 *    states are visible. This pairs with desc_read:D.
		 *
		 * 2. Guarantee any updated tail lpos is stored before
		 *    modifying the newly allocated data area. Another CPU may
		 *    be in data_make_reusable() and is reading a block ID
		 *    from this area. data_make_reusable() can handle reading
		 *    a garbage block ID value, but then it must be able to
		 *    load a new tail lpos. A full memory barrier is needed
		 *    since other CPUs may have updated the tail lpos. This
		 *    pairs with data_push_tail:B.
		 */
	} while (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &begin_lpos,
					  next_lpos)); /* LMM(data_alloc:A) */

	blk = to_block(data_ring, begin_lpos);
	blk->id = id; /* LMM(data_alloc:B) */

	if (is_blk_wrapped(data_ring, begin_lpos, next_lpos)) {
		/* Wrapping data blocks store their data at the beginning. */
		blk = to_block(data_ring, 0);

		/*
		 * Store the ID on the wrapped block for consistency.
		 * The printk_ringbuffer does not actually use it.
		 */
		blk->id = id;
	}

	blk_lpos->begin = begin_lpos;
	blk_lpos->next = next_lpos;

	return &blk->data[0];
}

/*
 * Try to resize an existing data block associated with the descriptor
 * specified by @id. If the resized data block should become wrapped, it
 * copies the old data to the new data block. If @size yields a data block
 * with the same or less size, the data block is left as is.
 *
 * Fail if this is not the last allocated data block or if there is not
 * enough space or it is not possible make enough space.
 *
 * Return a pointer to the beginning of the entire data buffer or NULL on
 * failure.
 */
static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
			  struct prb_data_blk_lpos *blk_lpos, unsigned long id)
{
	struct prb_data_ring *data_ring = &rb->text_data_ring;
	struct prb_data_block *blk;
	unsigned long head_lpos;
	unsigned long next_lpos;
	bool wrapped;

	/* Reallocation only works if @blk_lpos is the newest data block. */
	head_lpos = atomic_long_read(&data_ring->head_lpos);
	if (head_lpos != blk_lpos->next)
		return NULL;

	/* Keep track if @blk_lpos was a wrapping data block. */
	wrapped = is_blk_wrapped(data_ring, blk_lpos->begin, blk_lpos->next);

	size = to_blk_size(size);

	next_lpos = get_next_lpos(data_ring, blk_lpos->begin, size);

	/*
	 * Use the current data block when the size does not increase, i.e.
	 * when @head_lpos is already able to accommodate the new @next_lpos.
	 *
	 * Note that need_more_space() could never return false here because
	 * the difference between the positions was bigger than the data
	 * buffer size. The data block is reopened and can't get reused.
	 */
	if (!need_more_space(data_ring, head_lpos, next_lpos)) {
		if (wrapped)
			blk = to_block(data_ring, 0);
		else
			blk = to_block(data_ring, blk_lpos->begin);
		return &blk->data[0];
	}

	/*
	 * data_check_size() prevents data block reallocation that could
	 * cause illegal ringbuffer states. But double check that the
	 * new used space will not be bigger than the ring buffer. Wrapped
	 * messages need to reserve more space, see get_next_lpos().
	 *
	 * Specify failure when the check or the allocation fails.
	 */
	if (WARN_ON_ONCE(next_lpos - blk_lpos->begin > DATA_SIZE(data_ring)) ||
	    !data_push_tail(rb, next_lpos - DATA_SIZE(data_ring))) {
		return NULL;
	}

	/* The memory barrier involvement is the same as data_alloc:A. */
	if (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &head_lpos,
				     next_lpos)) { /* LMM(data_realloc:A) */
		return NULL;
	}

	blk = to_block(data_ring, blk_lpos->begin);

	if (is_blk_wrapped(data_ring, blk_lpos->begin, next_lpos)) {
		struct prb_data_block *old_blk = blk;

		/* Wrapping data blocks store their data at the beginning. */
		blk = to_block(data_ring, 0);

		/*
		 * Store the ID on the wrapped block for consistency.
		 * The printk_ringbuffer does not actually use it.
		 */
		blk->id = id;

		if (!wrapped) {
			/*
			 * Since the allocated space is now in the newly
			 * created wrapping data block, copy the content
			 * from the old data block.
			 */
			memcpy(&blk->data[0], &old_blk->data[0],
			       (blk_lpos->next - blk_lpos->begin) - sizeof(blk->id));
		}
	}

	blk_lpos->next = next_lpos;

	return &blk->data[0];
}

/* Return the number of bytes used by a data block. */
static unsigned int space_used(struct prb_data_ring *data_ring,
			       struct prb_data_blk_lpos *blk_lpos)
{
	/* Data-less blocks take no space. */
	if (BLK_DATALESS(blk_lpos))
		return 0;

	if (!is_blk_wrapped(data_ring, blk_lpos->begin, blk_lpos->next)) {
		/* Data block does not wrap. */
		return (DATA_INDEX(data_ring, blk_lpos->next) -
			DATA_INDEX(data_ring, blk_lpos->begin));
	}

	/*
	 * For wrapping data blocks, the trailing (wasted) space is
	 * also counted.
	 */
	return (DATA_INDEX(data_ring, blk_lpos->next) +
		DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin));
}

/*
 * Given @blk_lpos, return a pointer to the writer data from the data block
 * and calculate the size of the data part. A NULL pointer is returned if
 * @blk_lpos specifies values that could never be legal.
 *
 * This function (used by readers) performs strict validation on the lpos
 * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
 * triggered if an internal error is detected.
 */
static const char *get_data(struct prb_data_ring *data_ring,
			    struct prb_data_blk_lpos *blk_lpos,
			    unsigned int *data_size)
{
	struct prb_data_block *db;

	/* Data-less data block description. */
	if (BLK_DATALESS(blk_lpos)) {
		/*
		 * Records that are just empty lines are also valid, even
		 * though they do not have a data block. For such records
		 * explicitly return empty string data to signify success.
		 */
		if (blk_lpos->begin == EMPTY_LINE_LPOS &&
		    blk_lpos->next == EMPTY_LINE_LPOS) {
			*data_size = 0;
			return "";
		}

		/* Data lost, invalid, or otherwise unavailable. */
		return NULL;
	}

	/* Regular data block: @begin and @next in the same wrap. */
	if (!is_blk_wrapped(data_ring, blk_lpos->begin, blk_lpos->next)) {
		db = to_block(data_ring, blk_lpos->begin);
		*data_size = blk_lpos->next - blk_lpos->begin;

	/* Wrapping data block: @begin is one wrap behind @next. */
	} else if (!is_blk_wrapped(data_ring,
				   blk_lpos->begin + DATA_SIZE(data_ring),
				   blk_lpos->next)) {
		db = to_block(data_ring, 0);
		*data_size = DATA_INDEX(data_ring, blk_lpos->next);

	/* Illegal block description. */
	} else {
		WARN_ON_ONCE(1);
		return NULL;
	}

	/* Sanity check. Data-less blocks were handled earlier. */
	if (WARN_ON_ONCE(!data_check_size(data_ring, *data_size) || !*data_size))
		return NULL;

	/* A valid data block will always be aligned to the ID size. */
	if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) ||
	    WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id)))) {
		return NULL;
	}

	/* A valid data block will always have at least an ID. */
	if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
		return NULL;

	/* Subtract block ID space from size to reflect data size. */
	*data_size -= sizeof(db->id);

	return &db->data[0];
}

/*
 * Attempt to transition the newest descriptor from committed back to reserved
 * so that the record can be modified by a writer again. This is only possible
 * if the descriptor is not yet finalized and the provided @caller_id matches.
 */
static struct prb_desc *desc_reopen_last(struct prb_desc_ring *desc_ring,
					 u32 caller_id, unsigned long *id_out)
{
	unsigned long prev_state_val;
	enum desc_state d_state;
	struct prb_desc desc;
	struct prb_desc *d;
	unsigned long id;
	u32 cid;

	id = atomic_long_read(&desc_ring->head_id);

	/*
	 * To reduce unnecessarily reopening, first check if the descriptor
	 * state and caller ID are correct.
	 */
	d_state = desc_read(desc_ring, id, &desc, NULL, &cid);
	if (d_state != desc_committed || cid != caller_id)
		return NULL;

	d = to_desc(desc_ring, id);

	prev_state_val = DESC_SV(id, desc_committed);

	/*
	 * Guarantee the reserved state is stored before reading any
	 * record data. A full memory barrier is needed because @state_var
	 * modification is followed by reading. This pairs with _prb_commit:B.
	 *
	 * Memory barrier involvement:
	 *
	 * If desc_reopen_last:A reads from _prb_commit:B, then
	 * prb_reserve_in_last:A reads from _prb_commit:A.
	 *
	 * Relies on:
	 *
	 * WMB from _prb_commit:A to _prb_commit:B
	 *    matching
	 * MB If desc_reopen_last:A to prb_reserve_in_last:A
	 */
	if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val,
			DESC_SV(id, desc_reserved))) { /* LMM(desc_reopen_last:A) */
		return NULL;
	}

	*id_out = id;
	return d;
}

/**
 * prb_reserve_in_last() - Re-reserve and extend the space in the ringbuffer
 *                         used by the newest record.
 *
 * @e:         The entry structure to setup.
 * @rb:        The ringbuffer to re-reserve and extend data in.
 * @r:         The record structure to allocate buffers for.
 * @caller_id: The caller ID of the caller (reserving writer).
 * @max_size:  Fail if the extended size would be greater than this.
 *
 * This is the public function available to writers to re-reserve and extend
 * data.
 *
 * The writer specifies the text size to extend (not the new total size) by
 * setting the @text_buf_size field of @r. To ensure proper initialization
 * of @r, prb_rec_init_wr() should be used.
 *
 * This function will fail if @caller_id does not match the caller ID of the
 * newest record. In that case the caller must reserve new data using
 * prb_reserve().
 *
 * Context: Any context. Disables local interrupts on success.
 * Return: true if text data could be extended, otherwise false.
 *
 * On success:
 *
 *   - @r->text_buf points to the beginning of the entire text buffer.
 *
 *   - @r->text_buf_size is set to the new total size of the buffer.
 *
 *   - @r->info is not touched so that @r->info->text_len could be used
 *     to append the text.
 *
 *   - prb_record_text_space() can be used on @e to query the new
 *     actually used space.
 *
 * Important: All @r->info fields will already be set with the current values
 *            for the record. I.e. @r->info->text_len will be less than
 *            @text_buf_size. Writers can use @r->info->text_len to know
 *            where concatenation begins and writers should update
 *            @r->info->text_len after concatenating.
 */
bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
			 struct printk_record *r, u32 caller_id, unsigned int max_size)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	struct printk_info *info;
	unsigned int data_size;
	struct prb_desc *d;
	unsigned long id;

	local_irq_save(e->irqflags);

	/* Transition the newest descriptor back to the reserved state. */
	d = desc_reopen_last(desc_ring, caller_id, &id);
	if (!d) {
		local_irq_restore(e->irqflags);
		goto fail_reopen;
	}

	/* Now the writer has exclusive access: LMM(prb_reserve_in_last:A) */

	info = to_info(desc_ring, id);

	/*
	 * Set the @e fields here so that prb_commit() can be used if
	 * anything fails from now on.
	 */
	e->rb = rb;
	e->id = id;

	/*
	 * desc_reopen_last() checked the caller_id, but there was no
	 * exclusive access at that point. The descriptor may have
	 * changed since then.
	 */
	if (caller_id != info->caller_id)
		goto fail;

	if (BLK_DATALESS(&d->text_blk_lpos)) {
		if (WARN_ON_ONCE(info->text_len != 0)) {
			pr_warn_once("wrong text_len value (%hu, expecting 0)\n",
				     info->text_len);
			info->text_len = 0;
		}

		if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
			goto fail;

		if (r->text_buf_size > max_size)
			goto fail;

		r->text_buf = data_alloc(rb, r->text_buf_size,
					 &d->text_blk_lpos, id);
	} else {
		if (!get_data(&rb->text_data_ring, &d->text_blk_lpos, &data_size))
			goto fail;

		/*
		 * Increase the buffer size to include the original size. If
		 * the meta data (@text_len) is not sane, use the full data
		 * block size.
		 */
		if (WARN_ON_ONCE(info->text_len > data_size)) {
			pr_warn_once("wrong text_len value (%hu, expecting <=%u)\n",
				     info->text_len, data_size);
			info->text_len = data_size;
		}
		r->text_buf_size += info->text_len;

		if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
			goto fail;

		if (r->text_buf_size > max_size)
			goto fail;

		r->text_buf = data_realloc(rb, r->text_buf_size,
					   &d->text_blk_lpos, id);
	}
	if (r->text_buf_size && !r->text_buf)
		goto fail;

	r->info = info;

	e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos);

	return true;
fail:
	prb_commit(e);
	/* prb_commit() re-enabled interrupts. */
fail_reopen:
	/* Make it clear to the caller that the re-reserve failed. */
	memset(r, 0, sizeof(*r));
	return false;
}

/*
 * @last_finalized_seq value guarantees that all records up to and including
 * this sequence number are finalized and can be read. The only exception are
 * too old records which have already been overwritten.
 *
 * It is also guaranteed that @last_finalized_seq only increases.
 *
 * Be aware that finalized records following non-finalized records are not
 * reported because they are not yet available to the reader. For example,
 * a new record stored via printk() will not be available to a printer if
 * it follows a record that has not been finalized yet. However, once that
 * non-finalized record becomes finalized, @last_finalized_seq will be
 * appropriately updated and the full set of finalized records will be
 * available to the printer. And since each printk() caller will either
 * directly print or trigger deferred printing of all available unprinted
 * records, all printk() messages will get printed.
 */
static u64 desc_last_finalized_seq(struct printk_ringbuffer *rb)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	unsigned long ulseq;

	/*
	 * Guarantee the sequence number is loaded before loading the
	 * associated record in order to guarantee that the record can be
	 * seen by this CPU. This pairs with desc_update_last_finalized:A.
	 */
	ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
					); /* LMM(desc_last_finalized_seq:A) */

	return __ulseq_to_u64seq(rb, ulseq);
}

static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
			    struct printk_record *r, unsigned int *line_count);

/*
 * Check if there are records directly following @last_finalized_seq that are
 * finalized. If so, update @last_finalized_seq to the latest of these
 * records. It is not allowed to skip over records that are not yet finalized.
 */
static void desc_update_last_finalized(struct printk_ringbuffer *rb)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	u64 old_seq = desc_last_finalized_seq(rb);
	unsigned long oldval;
	unsigned long newval;
	u64 finalized_seq;
	u64 try_seq;

try_again:
	finalized_seq = old_seq;
	try_seq = finalized_seq + 1;

	/* Try to find later finalized records. */
	while (_prb_read_valid(rb, &try_seq, NULL, NULL)) {
		finalized_seq = try_seq;
		try_seq++;
	}

	/* No update needed if no later finalized record was found. */
	if (finalized_seq == old_seq)
		return;

	oldval = __u64seq_to_ulseq(old_seq);
	newval = __u64seq_to_ulseq(finalized_seq);

	/*
	 * Set the sequence number of a later finalized record that has been
	 * seen.
	 *
	 * Guarantee the record data is visible to other CPUs before storing
	 * its sequence number. This pairs with desc_last_finalized_seq:A.
	 *
	 * Memory barrier involvement:
	 *
	 * If desc_last_finalized_seq:A reads from
	 * desc_update_last_finalized:A, then desc_read:A reads from
	 * _prb_commit:B.
	 *
	 * Relies on:
	 *
	 * RELEASE from _prb_commit:B to desc_update_last_finalized:A
	 *    matching
	 * ACQUIRE from desc_last_finalized_seq:A to desc_read:A
	 *
	 * Note: _prb_commit:B and desc_update_last_finalized:A can be
	 *       different CPUs. However, the desc_update_last_finalized:A
	 *       CPU (which performs the release) must have previously seen
	 *       _prb_commit:B.
	 */
	if (!atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq,
				&oldval, newval)) { /* LMM(desc_update_last_finalized:A) */
		old_seq = __ulseq_to_u64seq(rb, oldval);
		goto try_again;
	}
}

/*
 * Attempt to finalize a specified descriptor. If this fails, the descriptor
 * is either already final or it will finalize itself when the writer commits.
 */
static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	unsigned long prev_state_val = DESC_SV(id, desc_committed);
	struct prb_desc *d = to_desc(desc_ring, id);

	if (atomic_long_try_cmpxchg_relaxed(&d->state_var, &prev_state_val,
			DESC_SV(id, desc_finalized))) { /* LMM(desc_make_final:A) */
		desc_update_last_finalized(rb);
	}
}

/**
 * prb_reserve() - Reserve space in the ringbuffer.
 *
 * @e:  The entry structure to setup.
 * @rb: The ringbuffer to reserve data in.
 * @r:  The record structure to allocate buffers for.
 *
 * This is the public function available to writers to reserve data.
 *
 * The writer specifies the text size to reserve by setting the
 * @text_buf_size field of @r. To ensure proper initialization of @r,
 * prb_rec_init_wr() should be used.
 *
 * Context: Any context. Disables local interrupts on success.
 * Return: true if at least text data could be allocated, otherwise false.
 *
 * On success, the fields @info and @text_buf of @r will be set by this
 * function and should be filled in by the writer before committing. Also
 * on success, prb_record_text_space() can be used on @e to query the actual
 * space used for the text data block.
 *
 * Important: @info->text_len needs to be set correctly by the writer in
 *            order for data to be readable and/or extended. Its value
 *            is initialized to 0.
 */
bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
		 struct printk_record *r)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	struct printk_info *info;
	struct prb_desc *d;
	unsigned long id;
	u64 seq;

	if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
		goto fail;

	/*
	 * Descriptors in the reserved state act as blockers to all further
	 * reservations once the desc_ring has fully wrapped. Disable
	 * interrupts during the reserve/commit window in order to minimize
	 * the likelihood of this happening.
	 */
	local_irq_save(e->irqflags);

	if (!desc_reserve(rb, &id)) {
		/* Descriptor reservation failures are tracked. */
		atomic_long_inc(&rb->fail);
		local_irq_restore(e->irqflags);
		goto fail;
	}

	d = to_desc(desc_ring, id);
	info = to_info(desc_ring, id);

	/*
	 * All @info fields (except @seq) are cleared and must be filled in
	 * by the writer. Save @seq before clearing because it is used to
	 * determine the new sequence number.
	 */
	seq = info->seq;
	memset(info, 0, sizeof(*info));

	/*
	 * Set the @e fields here so that prb_commit() can be used if
	 * text data allocation fails.
	 */
	e->rb = rb;
	e->id = id;

	/*
	 * Initialize the sequence number if it has "never been set".
	 * Otherwise just increment it by a full wrap.
	 *
	 * @seq is considered "never been set" if it has a value of 0,
	 * _except_ for @infos[0], which was specially setup by the ringbuffer
	 * initializer and therefore is always considered as set.
	 *
	 * See the "Bootstrap" comment block in printk_ringbuffer.h for
	 * details about how the initializer bootstraps the descriptors.
	 */
	if (seq == 0 && DESC_INDEX(desc_ring, id) != 0)
		info->seq = DESC_INDEX(desc_ring, id);
	else
		info->seq = seq + DESCS_COUNT(desc_ring);

	/*
	 * New data is about to be reserved. Once that happens, previous
	 * descriptors are no longer able to be extended. Finalize the
	 * previous descriptor now so that it can be made available to
	 * readers. (For seq==0 there is no previous descriptor.)
	 */
	if (info->seq > 0)
		desc_make_final(rb, DESC_ID(id - 1));

	r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
	/* If text data allocation fails, a data-less record is committed. */
	if (r->text_buf_size && !r->text_buf) {
		prb_commit(e);
		/* prb_commit() re-enabled interrupts. */
		goto fail;
	}

	r->info = info;

	/* Record full text space used by record. */
	e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos);

	return true;
fail:
	/* Make it clear to the caller that the reserve failed. */
	memset(r, 0, sizeof(*r));
	return false;
}
EXPORT_SYMBOL_IF_KUNIT(prb_reserve);

/* Commit the data (possibly finalizing it) and restore interrupts. */
static void _prb_commit(struct prb_reserved_entry *e, unsigned long state_val)
{
	struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
	struct prb_desc *d = to_desc(desc_ring, e->id);
	unsigned long prev_state_val = DESC_SV(e->id, desc_reserved);

	/* Now the writer has finished all writing: LMM(_prb_commit:A) */

	/*
	 * Set the descriptor as committed. See "ABA Issues" about why
	 * cmpxchg() instead of set() is used.
	 *
	 * 1  Guarantee all record data is stored before the descriptor state
	 *    is stored as committed. A write memory barrier is sufficient
	 *    for this. This pairs with desc_read:B and desc_reopen_last:A.
	 *
	 * 2. Guarantee the descriptor state is stored as committed before
	 *    re-checking the head ID in order to possibly finalize this
	 *    descriptor. This pairs with desc_reserve:D.
	 *
	 *    Memory barrier involvement:
	 *
	 *    If prb_commit:A reads from desc_reserve:D, then
	 *    desc_make_final:A reads from _prb_commit:B.
	 *
	 *    Relies on:
	 *
	 *    MB _prb_commit:B to prb_commit:A
	 *       matching
	 *    MB desc_reserve:D to desc_make_final:A
	 */
	if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val,
			DESC_SV(e->id, state_val))) { /* LMM(_prb_commit:B) */
		WARN_ON_ONCE(1);
	}

	/* Restore interrupts, the reserve/commit window is finished. */
	local_irq_restore(e->irqflags);
}

/**
 * prb_commit() - Commit (previously reserved) data to the ringbuffer.
 *
 * @e: The entry containing the reserved data information.
 *
 * This is the public function available to writers to commit data.
 *
 * Note that the data is not yet available to readers until it is finalized.
 * Finalizing happens automatically when space for the next record is
 * reserved.
 *
 * See prb_final_commit() for a version of this function that finalizes
 * immediately.
 *
 * Context: Any context. Enables local interrupts.
 */
void prb_commit(struct prb_reserved_entry *e)
{
	struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
	unsigned long head_id;

	_prb_commit(e, desc_committed);

	/*
	 * If this descriptor is no longer the head (i.e. a new record has
	 * been allocated), extending the data for this record is no longer
	 * allowed and therefore it must be finalized.
	 */
	head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
	if (head_id != e->id)
		desc_make_final(e->rb, e->id);
}
EXPORT_SYMBOL_IF_KUNIT(prb_commit);

/**
 * prb_final_commit() - Commit and finalize (previously reserved) data to
 *                      the ringbuffer.
 *
 * @e: The entry containing the reserved data information.
 *
 * This is the public function available to writers to commit+finalize data.
 *
 * By finalizing, the data is made immediately available to readers.
 *
 * This function should only be used if there are no intentions of extending
 * this data using prb_reserve_in_last().
 *
 * Context: Any context. Enables local interrupts.
 */
void prb_final_commit(struct prb_reserved_entry *e)
{
	_prb_commit(e, desc_finalized);

	desc_update_last_finalized(e->rb);
}

/*
 * Count the number of lines in provided text. All text has at least 1 line
 * (even if @text_size is 0). Each '\n' processed is counted as an additional
 * line.
 */
static unsigned int count_lines(const char *text, unsigned int text_size)
{
	unsigned int next_size = text_size;
	unsigned int line_count = 1;
	const char *next = text;

	while (next_size) {
		next = memchr(next, '\n', next_size);
		if (!next)
			break;
		line_count++;
		next++;
		next_size = text_size - (next - text);
	}

	return line_count;
}

/*
 * Given @blk_lpos, copy an expected @len of data into the provided buffer.
 * If @line_count is provided, count the number of lines in the data.
 *
 * This function (used by readers) performs strict validation on the data
 * size to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
 * triggered if an internal error is detected.
 */
static bool copy_data(struct prb_data_ring *data_ring,
		      struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
		      unsigned int buf_size, unsigned int *line_count)
{
	unsigned int data_size;
	const char *data;

	/* Caller might not want any data. */
	if ((!buf || !buf_size) && !line_count)
		return true;

	data = get_data(data_ring, blk_lpos, &data_size);
	if (!data)
		return false;

	/*
	 * Actual cannot be less than expected. It can be more than expected
	 * because of the trailing alignment padding.
	 *
	 * Note that invalid @len values can occur because the caller loads
	 * the value during an allowed data race.
	 */
	if (data_size < (unsigned int)len)
		return false;

	/* Caller interested in the line count? */
	if (line_count)
		*line_count = count_lines(data, len);

	/* Caller interested in the data content? */
	if (!buf || !buf_size)
		return true;

	data_size = min_t(unsigned int, buf_size, len);

	memcpy(&buf[0], data, data_size); /* LMM(copy_data:A) */
	return true;
}

/*
 * This is an extended version of desc_read(). It gets a copy of a specified
 * descriptor. However, it also verifies that the record is finalized and has
 * the sequence number @seq. On success, 0 is returned.
 *
 * Error return values:
 * -EINVAL: A finalized record with sequence number @seq does not exist.
 * -ENOENT: A finalized record with sequence number @seq exists, but its data
 *          is not available. This is a valid record, so readers should
 *          continue with the next record.
 */
static int desc_read_finalized_seq(struct prb_desc_ring *desc_ring,
				   unsigned long id, u64 seq,
				   struct prb_desc *desc_out)
{
	struct prb_data_blk_lpos *blk_lpos = &desc_out->text_blk_lpos;
	enum desc_state d_state;
	u64 s;

	d_state = desc_read(desc_ring, id, desc_out, &s, NULL);

	/*
	 * An unexpected @id (desc_miss) or @seq mismatch means the record
	 * does not exist. A descriptor in the reserved or committed state
	 * means the record does not yet exist for the reader.
	 */
	if (d_state == desc_miss ||
	    d_state == desc_reserved ||
	    d_state == desc_committed ||
	    s != seq) {
		return -EINVAL;
	}

	/*
	 * A descriptor in the reusable state may no longer have its data
	 * available; report it as existing but with lost data. Or the record
	 * may actually be a record with lost data.
	 */
	if (d_state == desc_reusable ||
	    (blk_lpos->begin == FAILED_LPOS && blk_lpos->next == FAILED_LPOS)) {
		return -ENOENT;
	}

	return 0;
}

/*
 * Copy the ringbuffer data from the record with @seq to the provided
 * @r buffer. On success, 0 is returned.
 *
 * See desc_read_finalized_seq() for error return values.
 */
static int prb_read(struct printk_ringbuffer *rb, u64 seq,
		    struct printk_record *r, unsigned int *line_count)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	struct printk_info *info = to_info(desc_ring, seq);
	struct prb_desc *rdesc = to_desc(desc_ring, seq);
	atomic_long_t *state_var = &rdesc->state_var;
	struct prb_desc desc;
	unsigned long id;
	int err;

	/* Extract the ID, used to specify the descriptor to read. */
	id = DESC_ID(atomic_long_read(state_var));

	/* Get a local copy of the correct descriptor (if available). */
	err = desc_read_finalized_seq(desc_ring, id, seq, &desc);

	/*
	 * If @r is NULL, the caller is only interested in the availability
	 * of the record.
	 */
	if (err || !r)
		return err;

	/* If requested, copy meta data. */
	if (r->info)
		memcpy(r->info, info, sizeof(*(r->info)));

	/* Copy text data. If it fails, this is a data-less record. */
	if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, info->text_len,
		       r->text_buf, r->text_buf_size, line_count)) {
		return -ENOENT;
	}

	/* Ensure the record is still finalized and has the same @seq. */
	return desc_read_finalized_seq(desc_ring, id, seq, &desc);
}

/* Get the sequence number of the tail descriptor. */
u64 prb_first_seq(struct printk_ringbuffer *rb)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	enum desc_state d_state;
	struct prb_desc desc;
	unsigned long id;
	u64 seq;

	for (;;) {
		id = atomic_long_read(&rb->desc_ring.tail_id); /* LMM(prb_first_seq:A) */

		d_state = desc_read(desc_ring, id, &desc, &seq, NULL); /* LMM(prb_first_seq:B) */

		/*
		 * This loop will not be infinite because the tail is
		 * _always_ in the finalized or reusable state.
		 */
		if (d_state == desc_finalized || d_state == desc_reusable)
			break;

		/*
		 * Guarantee the last state load from desc_read() is before
		 * reloading @tail_id in order to see a new tail in the case
		 * that the descriptor has been recycled. This pairs with
		 * desc_reserve:D.
		 *
		 * Memory barrier involvement:
		 *
		 * If prb_first_seq:B reads from desc_reserve:F, then
		 * prb_first_seq:A reads from desc_push_tail:B.
		 *
		 * Relies on:
		 *
		 * MB from desc_push_tail:B to desc_reserve:F
		 *    matching
		 * RMB prb_first_seq:B to prb_first_seq:A
		 */
		smp_rmb(); /* LMM(prb_first_seq:C) */
	}

	return seq;
}

/**
 * prb_next_reserve_seq() - Get the sequence number after the most recently
 *                  reserved record.
 *
 * @rb:  The ringbuffer to get the sequence number from.
 *
 * This is the public function available to readers to see what sequence
 * number will be assigned to the next reserved record.
 *
 * Note that depending on the situation, this value can be equal to or
 * higher than the sequence number returned by prb_next_seq().
 *
 * Context: Any context.
 * Return: The sequence number that will be assigned to the next record
 *         reserved.
 */
u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
{
	struct prb_desc_ring *desc_ring = &rb->desc_ring;
	unsigned long last_finalized_id;
	atomic_long_t *state_var;
	u64 last_finalized_seq;
	unsigned long head_id;
	struct prb_desc desc;
	unsigned long diff;
	struct prb_desc *d;
	int err;

	/*
	 * It may not be possible to read a sequence number for @head_id.
	 * So the ID of @last_finailzed_seq is used to calculate what the
	 * sequence number of @head_id will be.
	 */

try_again:
	last_finalized_seq = desc_last_finalized_seq(rb);

	/*
	 * @head_id is loaded after @last_finalized_seq to ensure that
	 * it points to the record with @last_finalized_seq or newer.
	 *
	 * Memory barrier involvement:
	 *
	 * If desc_last_finalized_seq:A reads from
	 * desc_update_last_finalized:A, then
	 * prb_next_reserve_seq:A reads from desc_reserve:D.
	 *
	 * Relies on:
	 *
	 * RELEASE from desc_reserve:D to desc_update_last_finalized:A
	 *    matching
	 * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
	 *
	 * Note: desc_reserve:D and desc_update_last_finalized:A can be
	 *       different CPUs. However, the desc_update_last_finalized:A CPU
	 *       (which performs the release) must have previously seen
	 *       desc_read:C, which implies desc_reserve:D can be seen.
	 */
	head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_next_reserve_seq:A) */

	d = to_desc(desc_ring, last_finalized_seq);
	state_var = &d->state_var;

	/* Extract the ID, used to specify the descriptor to read. */
	last_finalized_id = DESC_ID(atomic_long_read(state_var));

	/* Ensure @last_finalized_id is correct. */
	err = desc_read_finalized_seq(desc_ring, last_finalized_id, last_finalized_seq, &desc);

	if (err == -EINVAL) {
		if (last_finalized_seq == 0) {
			/*
			 * No record has been finalized or even reserved yet.
			 *
			 * The @head_id is initialized such that the first
			 * increment will yield the first record (seq=0).
			 * Handle it separately to avoid a negative @diff
			 * below.
			 */
			if (head_id == DESC0_ID(desc_ring->count_bits))
				return 0;

			/*
			 * One or more descriptors are already reserved. Use
			 * the descriptor ID of the first one (@seq=0) for
			 * the @diff below.
			 */
			last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;
		} else {
			/* Record must have been overwritten. Try again. */
			goto try_again;
		}
	}

	/* Diff of known descriptor IDs to compute related sequence numbers. */
	diff = head_id - last_finalized_id;

	/*
	 * @head_id points to the most recently reserved record, but this
	 * function returns the sequence number that will be assigned to the
	 * next (not yet reserved) record. Thus +1 is needed.
	 */
	return (last_finalized_seq + diff + 1);
}

/*
 * Non-blocking read of a record.
 *
 * On success @seq is updated to the record that was read and (if provided)
 * @r and @line_count will contain the read/calculated data.
 *
 * On failure @seq is updated to a record that is not yet available to the
 * reader, but it will be the next record available to the reader.
 *
 * Note: When the current CPU is in panic, this function will skip over any
 *       non-existent/non-finalized records in order to allow the panic CPU
 *       to print any and all records that have been finalized.
 */
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
			    struct printk_record *r, unsigned int *line_count)
{
	u64 tail_seq;
	int err;

	while ((err = prb_read(rb, *seq, r, line_count))) {
		tail_seq = prb_first_seq(rb);

		if (*seq < tail_seq) {
			/*
			 * Behind the tail. Catch up and try again. This
			 * can happen for -ENOENT and -EINVAL cases.
			 */
			*seq = tail_seq;

		} else if (err == -ENOENT) {
			/* Record exists, but the data was lost. Skip. */
			(*seq)++;

		} else {
			/*
			 * Non-existent/non-finalized record. Must stop.
			 *
			 * For panic situations it cannot be expected that
			 * non-finalized records will become finalized. But
			 * there may be other finalized records beyond that
			 * need to be printed for a panic situation. If this
			 * is the panic CPU, skip this
			 * non-existent/non-finalized record unless non-panic
			 * CPUs are still running and their debugging is
			 * explicitly enabled.
			 *
			 * Note that new messages printed on panic CPU are
			 * finalized when we are here. The only exception
			 * might be the last message without trailing newline.
			 * But it would have the sequence number returned
			 * by "prb_next_reserve_seq() - 1".
			 */
			if (panic_on_this_cpu() &&
			    (!debug_non_panic_cpus || legacy_allow_panic_sync) &&
			    ((*seq + 1) < prb_next_reserve_seq(rb))) {
				(*seq)++;
			} else {
				return false;
			}
		}
	}

	return true;
}

/**
 * prb_read_valid() - Non-blocking read of a requested record or (if gone)
 *                    the next available record.
 *
 * @rb:  The ringbuffer to read from.
 * @seq: The sequence number of the record to read.
 * @r:   A record data buffer to store the read record to.
 *
 * This is the public function available to readers to read a record.
 *
 * The reader provides the @info and @text_buf buffers of @r to be
 * filled in. Any of the buffer pointers can be set to NULL if the reader
 * is not interested in that data. To ensure proper initialization of @r,
 * prb_rec_init_rd() should be used.
 *
 * Context: Any context.
 * Return: true if a record was read, otherwise false.
 *
 * On success, the reader must check r->info.seq to see which record was
 * actually read. This allows the reader to detect dropped records.
 *
 * Failure means @seq refers to a record not yet available to the reader.
 */
bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
		    struct printk_record *r)
{
	return _prb_read_valid(rb, &seq, r, NULL);
}
EXPORT_SYMBOL_IF_KUNIT(prb_read_valid);

/**
 * prb_read_valid_info() - Non-blocking read of meta data for a requested
 *                         record or (if gone) the next available record.
 *
 * @rb:         The ringbuffer to read from.
 * @seq:        The sequence number of the record to read.
 * @info:       A buffer to store the read record meta data to.
 * @line_count: A buffer to store the number of lines in the record text.
 *
 * This is the public function available to readers to read only the
 * meta data of a record.
 *
 * The reader provides the @info, @line_count buffers to be filled in.
 * Either of the buffer pointers can be set to NULL if the reader is not
 * interested in that data.
 *
 * Context: Any context.
 * Return: true if a record's meta data was read, otherwise false.
 *
 * On success, the reader must check info->seq to see which record meta data
 * was actually read. This allows the reader to detect dropped records.
 *
 * Failure means @seq refers to a record not yet available to the reader.
 */
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
			 struct printk_info *info, unsigned int *line_count)
{
	struct printk_record r;

	prb_rec_init_rd(&r, info, NULL, 0);

	return _prb_read_valid(rb, &seq, &r, line_count);
}

/**
 * prb_first_valid_seq() - Get the sequence number of the oldest available
 *                         record.
 *
 * @rb: The ringbuffer to get the sequence number from.
 *
 * This is the public function available to readers to see what the
 * first/oldest valid sequence number is.
 *
 * This provides readers a starting point to begin iterating the ringbuffer.
 *
 * Context: Any context.
 * Return: The sequence number of the first/oldest record or, if the
 *         ringbuffer is empty, 0 is returned.
 */
u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
{
	u64 seq = 0;

	if (!_prb_read_valid(rb, &seq, NULL, NULL))
		return 0;

	return seq;
}

/**
 * prb_next_seq() - Get the sequence number after the last available record.
 *
 * @rb:  The ringbuffer to get the sequence number from.
 *
 * This is the public function available to readers to see what the next
 * newest sequence number available to readers will be.
 *
 * This provides readers a sequence number to jump to if all currently
 * available records should be skipped. It is guaranteed that all records
 * previous to the returned value have been finalized and are (or were)
 * available to the reader.
 *
 * Context: Any context.
 * Return: The sequence number of the next newest (not yet available) record
 *         for readers.
 */
u64 prb_next_seq(struct printk_ringbuffer *rb)
{
	u64 seq;

	seq = desc_last_finalized_seq(rb);

	/*
	 * Begin searching after the last finalized record.
	 *
	 * On 0, the search must begin at 0 because of hack#2
	 * of the bootstrapping phase it is not known if a
	 * record at index 0 exists.
	 */
	if (seq != 0)
		seq++;

	/*
	 * The information about the last finalized @seq might be inaccurate.
	 * Search forward to find the current one.
	 */
	while (_prb_read_valid(rb, &seq, NULL, NULL))
		seq++;

	return seq;
}

/**
 * prb_init() - Initialize a ringbuffer to use provided external buffers.
 *
 * @rb:       The ringbuffer to initialize.
 * @text_buf: The data buffer for text data.
 * @textbits: The size of @text_buf as a power-of-2 value.
 * @descs:    The descriptor buffer for ringbuffer records.
 * @descbits: The count of @descs items as a power-of-2 value.
 * @infos:    The printk_info buffer for ringbuffer records.
 *
 * This is the public function available to writers to setup a ringbuffer
 * during runtime using provided buffers.
 *
 * This must match the initialization of DEFINE_PRINTKRB().
 *
 * Context: Any context.
 */
void prb_init(struct printk_ringbuffer *rb,
	      char *text_buf, unsigned int textbits,
	      struct prb_desc *descs, unsigned int descbits,
	      struct printk_info *infos)
{
	memset(descs, 0, _DESCS_COUNT(descbits) * sizeof(descs[0]));
	memset(infos, 0, _DESCS_COUNT(descbits) * sizeof(infos[0]));

	rb->desc_ring.count_bits = descbits;
	rb->desc_ring.descs = descs;
	rb->desc_ring.infos = infos;
	atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
	atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
	atomic_long_set(&rb->desc_ring.last_finalized_seq, 0);

	rb->text_data_ring.size_bits = textbits;
	rb->text_data_ring.data = text_buf;
	atomic_long_set(&rb->text_data_ring.head_lpos, BLK0_LPOS(textbits));
	atomic_long_set(&rb->text_data_ring.tail_lpos, BLK0_LPOS(textbits));

	atomic_long_set(&rb->fail, 0);

	atomic_long_set(&(descs[_DESCS_COUNT(descbits) - 1].state_var), DESC0_SV(descbits));
	descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.begin = FAILED_LPOS;
	descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.next = FAILED_LPOS;

	infos[0].seq = -(u64)_DESCS_COUNT(descbits);
	infos[_DESCS_COUNT(descbits) - 1].seq = 0;
}
EXPORT_SYMBOL_IF_KUNIT(prb_init);

/**
 * prb_record_text_space() - Query the full actual used ringbuffer space for
 *                           the text data of a reserved entry.
 *
 * @e: The successfully reserved entry to query.
 *
 * This is the public function available to writers to see how much actual
 * space is used in the ringbuffer to store the text data of the specified
 * entry.
 *
 * This function is only valid if @e has been successfully reserved using
 * prb_reserve().
 *
 * Context: Any context.
 * Return: The size in bytes used by the text data of the associated record.
 */
unsigned int prb_record_text_space(struct prb_reserved_entry *e)
{
	return e->text_space;
}
]

7/2 2026/02/24 21:11 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 30111
							output: 25
							thoughts: 688
						

Thoughts:
**Analyzing a Data Race in the Kernel's Printk Ringbuffer**

Okay, so I'm looking at a reported data race, and it seems to be between a write operation in `data_alloc` and a read operation in `_prb_read_valid`. The address involved is `0xffffffff86c7a1c0`, and the report specifies an 8-byte write and a 16-byte read at that address. That immediately jumps out, as it raises a red flag. Let's dig deeper.

The write side is clear: in `data_alloc`, line 1129, we have `blk->id = id;`. `blk->id` is an `unsigned long`, which aligns with the reported 8-byte write on a 64-bit system. No surprises there. The "LMM(data_alloc:B)" tag likely identifies a lock or memory management operation associated with this write, something to keep in mind for context.

Now, let's trace the read path. `_prb_read_valid` calls `prb_read`, which in turn calls `desc_read_finalized_seq`, and finally, down to `desc_read`. Line 496 in `desc_read` is where things get interesting: `state_val = atomic_long_read(state_var);`. It's reading from `state_var`. This is where the discrepancy is apparent, the size of the reads don't match.

My attention is immediately drawn to this `state_var`, and the context surrounding it. The report states that `state_var` is the address of `&desc->state_var`. I need to understand what `desc` is. I need to check the definitions of `struct prb_desc` and `struct prb_data_block` as a next step, given these are related. My suspicion is that `struct prb_desc` is likely defined in `kernel/printk/printk_ringbuffer.h` and this is where I would start my search.



8/2 2026/02/24 21:12 tool read-file
0m Args:
map[File:kernel/printk/printk_ringbuffer.h]

Results:
map[Contents:/* SPDX-License-Identifier: GPL-2.0 */

#ifndef _KERNEL_PRINTK_RINGBUFFER_H
#define _KERNEL_PRINTK_RINGBUFFER_H

#include <linux/atomic.h>
#include <linux/bits.h>
#include <linux/dev_printk.h>
#include <linux/stddef.h>
#include <linux/types.h>

/*
 * Meta information about each stored message.
 *
 * All fields are set by the printk code except for @seq, which is
 * set by the ringbuffer code.
 */
struct printk_info {
	u64	seq;		/* sequence number */
	u64	ts_nsec;	/* timestamp in nanoseconds */
	u16	text_len;	/* length of text message */
	u8	facility;	/* syslog facility */
	u8	flags:5;	/* internal record flags */
	u8	level:3;	/* syslog level */
	u32	caller_id;	/* thread id or processor id */
#ifdef CONFIG_PRINTK_EXECUTION_CTX
	u32	caller_id2;	/* caller_id complement */
	/* name of the task that generated the message */
	char	comm[TASK_COMM_LEN];
#endif

	struct dev_printk_info	dev_info;
};

/*
 * A structure providing the buffers, used by writers and readers.
 *
 * Writers:
 * Using prb_rec_init_wr(), a writer sets @text_buf_size before calling
 * prb_reserve(). On success, prb_reserve() sets @info and @text_buf to
 * buffers reserved for that writer.
 *
 * Readers:
 * Using prb_rec_init_rd(), a reader sets all fields before calling
 * prb_read_valid(). Note that the reader provides the @info and @text_buf,
 * buffers. On success, the struct pointed to by @info will be filled and
 * the char array pointed to by @text_buf will be filled with text data.
 */
struct printk_record {
	struct printk_info	*info;
	char			*text_buf;
	unsigned int		text_buf_size;
};

/* Specifies the logical position and span of a data block. */
struct prb_data_blk_lpos {
	unsigned long	begin;
	unsigned long	next;
};

/*
 * A descriptor: the complete meta-data for a record.
 *
 * @state_var: A bitwise combination of descriptor ID and descriptor state.
 */
struct prb_desc {
	atomic_long_t			state_var;
	struct prb_data_blk_lpos	text_blk_lpos;
};

/* A ringbuffer of "ID + data" elements. */
struct prb_data_ring {
	unsigned int	size_bits;
	char		*data;
	atomic_long_t	head_lpos;
	atomic_long_t	tail_lpos;
};

/* A ringbuffer of "struct prb_desc" elements. */
struct prb_desc_ring {
	unsigned int		count_bits;
	struct prb_desc		*descs;
	struct printk_info	*infos;
	atomic_long_t		head_id;
	atomic_long_t		tail_id;
	atomic_long_t		last_finalized_seq;
};

/*
 * The high level structure representing the printk ringbuffer.
 *
 * @fail: Count of failed prb_reserve() calls where not even a data-less
 *        record was created.
 */
struct printk_ringbuffer {
	struct prb_desc_ring	desc_ring;
	struct prb_data_ring	text_data_ring;
	atomic_long_t		fail;
};

/*
 * Used by writers as a reserve/commit handle.
 *
 * @rb:         Ringbuffer where the entry is reserved.
 * @irqflags:   Saved irq flags to restore on entry commit.
 * @id:         ID of the reserved descriptor.
 * @text_space: Total occupied buffer space in the text data ring, including
 *              ID, alignment padding, and wrapping data blocks.
 *
 * This structure is an opaque handle for writers. Its contents are only
 * to be used by the ringbuffer implementation.
 */
struct prb_reserved_entry {
	struct printk_ringbuffer	*rb;
	unsigned long			irqflags;
	unsigned long			id;
	unsigned int			text_space;
};

/* The possible responses of a descriptor state-query. */
enum desc_state {
	desc_miss	=  -1,	/* ID mismatch (pseudo state) */
	desc_reserved	= 0x0,	/* reserved, in use by writer */
	desc_committed	= 0x1,	/* committed by writer, could get reopened */
	desc_finalized	= 0x2,	/* committed, no further modification allowed */
	desc_reusable	= 0x3,	/* free, not yet used by any writer */
};

#define _DATA_SIZE(sz_bits)	(1UL << (sz_bits))
#define _DESCS_COUNT(ct_bits)	(1U << (ct_bits))
#define DESC_SV_BITS		BITS_PER_LONG
#define DESC_FLAGS_SHIFT	(DESC_SV_BITS - 2)
#define DESC_FLAGS_MASK		(3UL << DESC_FLAGS_SHIFT)
#define DESC_STATE(sv)		(3UL & (sv >> DESC_FLAGS_SHIFT))
#define DESC_SV(id, state)	(((unsigned long)state << DESC_FLAGS_SHIFT) | id)
#define DESC_ID_MASK		(~DESC_FLAGS_MASK)
#define DESC_ID(sv)		((sv) & DESC_ID_MASK)

/*
 * Special data block logical position values (for fields of
 * @prb_desc.text_blk_lpos).
 *
 * - Bit0 is used to identify if the record has no data block. (Implemented in
 *   the LPOS_DATALESS() macro.)
 *
 * - Bit1 specifies the reason for not having a data block.
 *
 * These special values could never be real lpos values because of the
 * meta data and alignment padding of data blocks. (See to_blk_size() for
 * details.)
 */
#define FAILED_LPOS		0x1
#define EMPTY_LINE_LPOS		0x3

#define FAILED_BLK_LPOS	\
{				\
	.begin	= FAILED_LPOS,	\
	.next	= FAILED_LPOS,	\
}

/*
 * Descriptor Bootstrap
 *
 * The descriptor array is minimally initialized to allow immediate usage
 * by readers and writers. The requirements that the descriptor array
 * initialization must satisfy:
 *
 *   Req1
 *     The tail must point to an existing (committed or reusable) descriptor.
 *     This is required by the implementation of prb_first_seq().
 *
 *   Req2
 *     Readers must see that the ringbuffer is initially empty.
 *
 *   Req3
 *     The first record reserved by a writer is assigned sequence number 0.
 *
 * To satisfy Req1, the tail initially points to a descriptor that is
 * minimally initialized (having no data block, i.e. data-less with the
 * data block's lpos @begin and @next values set to FAILED_LPOS).
 *
 * To satisfy Req2, the initial tail descriptor is initialized to the
 * reusable state. Readers recognize reusable descriptors as existing
 * records, but skip over them.
 *
 * To satisfy Req3, the last descriptor in the array is used as the initial
 * head (and tail) descriptor. This allows the first record reserved by a
 * writer (head + 1) to be the first descriptor in the array. (Only the first
 * descriptor in the array could have a valid sequence number of 0.)
 *
 * The first time a descriptor is reserved, it is assigned a sequence number
 * with the value of the array index. A "first time reserved" descriptor can
 * be recognized because it has a sequence number of 0 but does not have an
 * index of 0. (Only the first descriptor in the array could have a valid
 * sequence number of 0.) After the first reservation, all future reservations
 * (recycling) simply involve incrementing the sequence number by the array
 * count.
 *
 *   Hack #1
 *     Only the first descriptor in the array is allowed to have the sequence
 *     number 0. In this case it is not possible to recognize if it is being
 *     reserved the first time (set to index value) or has been reserved
 *     previously (increment by the array count). This is handled by _always_
 *     incrementing the sequence number by the array count when reserving the
 *     first descriptor in the array. In order to satisfy Req3, the sequence
 *     number of the first descriptor in the array is initialized to minus
 *     the array count. Then, upon the first reservation, it is incremented
 *     to 0, thus satisfying Req3.
 *
 *   Hack #2
 *     prb_first_seq() can be called at any time by readers to retrieve the
 *     sequence number of the tail descriptor. However, due to Req2 and Req3,
 *     initially there are no records to report the sequence number of
 *     (sequence numbers are u64 and there is nothing less than 0). To handle
 *     this, the sequence number of the initial tail descriptor is initialized
 *     to 0. Technically this is incorrect, because there is no record with
 *     sequence number 0 (yet) and the tail descriptor is not the first
 *     descriptor in the array. But it allows prb_read_valid() to correctly
 *     report the existence of a record for _any_ given sequence number at all
 *     times. Bootstrapping is complete when the tail is pushed the first
 *     time, thus finally pointing to the first descriptor reserved by a
 *     writer, which has the assigned sequence number 0.
 */

/*
 * Initiating Logical Value Overflows
 *
 * Both logical position (lpos) and ID values can be mapped to array indexes
 * but may experience overflows during the lifetime of the system. To ensure
 * that printk_ringbuffer can handle the overflows for these types, initial
 * values are chosen that map to the correct initial array indexes, but will
 * result in overflows soon.
 *
 *   BLK0_LPOS
 *     The initial @head_lpos and @tail_lpos for data rings. It is at index
 *     0 and the lpos value is such that it will overflow on the first wrap.
 *
 *   DESC0_ID
 *     The initial @head_id and @tail_id for the desc ring. It is at the last
 *     index of the descriptor array (see Req3 above) and the ID value is such
 *     that it will overflow on the second wrap.
 */
#define BLK0_LPOS(sz_bits)	(-(_DATA_SIZE(sz_bits)))
#define DESC0_ID(ct_bits)	DESC_ID(-(_DESCS_COUNT(ct_bits) + 1))
#define DESC0_SV(ct_bits)	DESC_SV(DESC0_ID(ct_bits), desc_reusable)

/*
 * Define a ringbuffer with an external text data buffer. The same as
 * DEFINE_PRINTKRB() but requires specifying an external buffer for the
 * text data.
 *
 * Note: The specified external buffer must be of the size:
 *       2 ^ (descbits + avgtextbits)
 */
#define _DEFINE_PRINTKRB(name, descbits, avgtextbits, text_buf)			\
static struct prb_desc _##name##_descs[_DESCS_COUNT(descbits)] = {				\
	/* the initial head and tail */								\
	[_DESCS_COUNT(descbits) - 1] = {							\
		/* reusable */									\
		.state_var	= ATOMIC_INIT(DESC0_SV(descbits)),				\
		/* no associated data block */							\
		.text_blk_lpos	= FAILED_BLK_LPOS,						\
	},											\
};												\
static struct printk_info _##name##_infos[_DESCS_COUNT(descbits)] = {				\
	/* this will be the first record reserved by a writer */				\
	[0] = {											\
		/* will be incremented to 0 on the first reservation */				\
		.seq = -(u64)_DESCS_COUNT(descbits),						\
	},											\
	/* the initial head and tail */								\
	[_DESCS_COUNT(descbits) - 1] = {							\
		/* reports the first seq value during the bootstrap phase */			\
		.seq = 0,									\
	},											\
};												\
static struct printk_ringbuffer name = {							\
	.desc_ring = {										\
		.count_bits	= descbits,							\
		.descs		= &_##name##_descs[0],						\
		.infos		= &_##name##_infos[0],						\
		.head_id	= ATOMIC_INIT(DESC0_ID(descbits)),				\
		.tail_id	= ATOMIC_INIT(DESC0_ID(descbits)),				\
		.last_finalized_seq = ATOMIC_INIT(0),						\
	},											\
	.text_data_ring = {									\
		.size_bits	= (avgtextbits) + (descbits),					\
		.data		= text_buf,							\
		.head_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))),	\
		.tail_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))),	\
	},											\
	.fail			= ATOMIC_LONG_INIT(0),						\
}

/**
 * DEFINE_PRINTKRB() - Define a ringbuffer.
 *
 * @name:        The name of the ringbuffer variable.
 * @descbits:    The number of descriptors as a power-of-2 value.
 * @avgtextbits: The average text data size per record as a power-of-2 value.
 *
 * This is a macro for defining a ringbuffer and all internal structures
 * such that it is ready for immediate use. See _DEFINE_PRINTKRB() for a
 * variant where the text data buffer can be specified externally.
 */
#define DEFINE_PRINTKRB(name, descbits, avgtextbits)				\
static char _##name##_text[1U << ((avgtextbits) + (descbits))]			\
			__aligned(__alignof__(unsigned long));			\
_DEFINE_PRINTKRB(name, descbits, avgtextbits, &_##name##_text[0])

/* Writer Interface */

/**
 * prb_rec_init_wr() - Initialize a buffer for writing records.
 *
 * @r:             The record to initialize.
 * @text_buf_size: The needed text buffer size.
 */
static inline void prb_rec_init_wr(struct printk_record *r,
				   unsigned int text_buf_size)
{
	r->info = NULL;
	r->text_buf = NULL;
	r->text_buf_size = text_buf_size;
}

bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
		 struct printk_record *r);
bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
			 struct printk_record *r, u32 caller_id, unsigned int max_size);
void prb_commit(struct prb_reserved_entry *e);
void prb_final_commit(struct prb_reserved_entry *e);

void prb_init(struct printk_ringbuffer *rb,
	      char *text_buf, unsigned int text_buf_size,
	      struct prb_desc *descs, unsigned int descs_count_bits,
	      struct printk_info *infos);
unsigned int prb_record_text_space(struct prb_reserved_entry *e);

/* Reader Interface */

/**
 * prb_rec_init_rd() - Initialize a buffer for reading records.
 *
 * @r:             The record to initialize.
 * @info:          A buffer to store record meta-data.
 * @text_buf:      A buffer to store text data.
 * @text_buf_size: The size of @text_buf.
 *
 * Initialize all the fields that a reader is interested in. All arguments
 * (except @r) are optional. Only record data for arguments that are
 * non-NULL or non-zero will be read.
 */
static inline void prb_rec_init_rd(struct printk_record *r,
				   struct printk_info *info,
				   char *text_buf, unsigned int text_buf_size)
{
	r->info = info;
	r->text_buf = text_buf;
	r->text_buf_size = text_buf_size;
}

/**
 * prb_for_each_record() - Iterate over the records of a ringbuffer.
 *
 * @from: The sequence number to begin with.
 * @rb:   The ringbuffer to iterate over.
 * @s:    A u64 to store the sequence number on each iteration.
 * @r:    A printk_record to store the record on each iteration.
 *
 * This is a macro for conveniently iterating over a ringbuffer.
 * Note that @s may not be the sequence number of the record on each
 * iteration. For the sequence number, @r->info->seq should be checked.
 *
 * Context: Any context.
 */
#define prb_for_each_record(from, rb, s, r) \
for ((s) = from; prb_read_valid(rb, s, r); (s) = (r)->info->seq + 1)

/**
 * prb_for_each_info() - Iterate over the meta data of a ringbuffer.
 *
 * @from: The sequence number to begin with.
 * @rb:   The ringbuffer to iterate over.
 * @s:    A u64 to store the sequence number on each iteration.
 * @i:    A printk_info to store the record meta data on each iteration.
 * @lc:   An unsigned int to store the text line count of each record.
 *
 * This is a macro for conveniently iterating over a ringbuffer.
 * Note that @s may not be the sequence number of the record on each
 * iteration. For the sequence number, @r->info->seq should be checked.
 *
 * Context: Any context.
 */
#define prb_for_each_info(from, rb, s, i, lc) \
for ((s) = from; prb_read_valid_info(rb, s, i, lc); (s) = (i)->seq + 1)

bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
		    struct printk_record *r);
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
			 struct printk_info *info, unsigned int *line_count);

u64 prb_first_seq(struct printk_ringbuffer *rb);
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);
u64 prb_next_reserve_seq(struct printk_ringbuffer *rb);

#ifdef CONFIG_64BIT

#define __u64seq_to_ulseq(u64seq) (u64seq)
#define __ulseq_to_u64seq(rb, ulseq) (ulseq)
#define ULSEQ_MAX(rb) (-1)

#else /* CONFIG_64BIT */

#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
#define ULSEQ_MAX(rb) __u64seq_to_ulseq(prb_first_seq(rb) + 0x80000000UL)

static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
{
	u64 rb_first_seq = prb_first_seq(rb);
	u64 seq;

	/*
	 * The provided sequence is only the lower 32 bits of the ringbuffer
	 * sequence. It needs to be expanded to 64bit. Get the first sequence
	 * number from the ringbuffer and fold it.
	 *
	 * Having a 32bit representation in the console is sufficient.
	 * If a console ever gets more than 2^31 records behind
	 * the ringbuffer then this is the least of the problems.
	 *
	 * Also the access to the ring buffer is always safe.
	 */
	seq = rb_first_seq - (s32)((u32)rb_first_seq - ulseq);

	return seq;
}

#endif /* CONFIG_64BIT */

#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
]

9/2 2026/02/24 21:12 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 36373
							output: 22
							thoughts: 2256
						

Thoughts:
**Analysis of Potential Data Race in Printk Ringbuffer**

Okay, here's what I've got. The KCSAN report highlights a potential data race between writes in `data_alloc` and reads in `desc_read` within the printk ringbuffer implementation. The critical detail is that both access the same memory address, `0xffffffff86c7a1c0`. My initial reaction was confusion, as the reported address is likely within the kernel's data section, and I needed to piece together how this could even be possible.

The first step was to understand the structures involved. I traced the code and confirmed that `data_alloc` writes to `blk->id`, where `blk` is a `prb_data_block`. This block contains an `id` field at its beginning, followed by the actual data. In `desc_read`, the code reads from `desc->state_var`, where `desc` is a `prb_desc`. These are fundamentally different structures, and my first instinct was to rule out overlap. The address is high, but not so high to raise alarms, meaning the static ringbuffer might be used.

The KCSAN report pointed to a 16-byte read, but `atomic_long_read` is expected to be 8 bytes, so I needed to check the inlining. The report says it's inlined. I suspected an inline artifact, or maybe the read was actually from the `memcpy` which is 16 bytes. However, checking `struct prb_desc` confirms that it contains both `state_var` and `text_blk_lpos`.

I initially believed the address overlap was a bug, and that `data_ring->data` and `desc_ring->descs` were overlapping. But I realized that the printk ringbuffer architecture uses a data ring for the text and a descriptor ring for metadata. The `id` in the data block is used to specify the descriptor to read.

Looking at the call stack from `_prb_read_valid` to `prb_read`, I had to clarify the sequence. The `id` read in `prb_read` is extracted from `state_var`, which is a descriptor state variable.

Now, it's clear: if `blk->id` and `desc->state_var` are at the *same* address, then the data ring and descriptor ring must overlap. That's the only logical explanation given the structure of the ringbuffer. I will now double-check the ringbuffer allocation in `kernel/printk/printk.c`. Specifically, I want to see how `DEFINE_PRINTKRB` is using `CONFIG_LOG_BUF_SHIFT`. This should reveal the size and layout of the printk ringbuffer and, hopefully, confirm or deny this overlap condition. If the allocation scheme isn't what I expect, I'm going to have to check the code that actually sets up the memory region.



10/2 2026/02/24 21:12 tool read-file
0m Args:
map[File:kernel/printk/printk.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 *  linux/kernel/printk.c
 *
 *  Copyright (C) 1991, 1992  Linus Torvalds
 *
 * Modified to make sys_syslog() more flexible: added commands to
 * return the last 4k of kernel messages, regardless of whether
 * they've been read or not.  Added option to suppress kernel printk's
 * to the console.  Added hook for sending the console messages
 * elsewhere, in preparation for a serial line console (someday).
 * Ted Ts'o, 2/11/93.
 * Modified for sysctl support, 1/8/97, Chris Horn.
 * Fixed SMP synchronization, 08/08/99, Manfred Spraul
 *     manfred@colorfullife.com
 * Rewrote bits to get rid of console_lock
 *	01Mar01 Andrew Morton
 */

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/tty.h>
#include <linux/tty_driver.h>
#include <linux/console.h>
#include <linux/init.h>
#include <linux/jiffies.h>
#include <linux/nmi.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/delay.h>
#include <linux/smp.h>
#include <linux/security.h>
#include <linux/memblock.h>
#include <linux/syscalls.h>
#include <linux/syscore_ops.h>
#include <linux/vmcore_info.h>
#include <linux/ratelimit.h>
#include <linux/kmsg_dump.h>
#include <linux/syslog.h>
#include <linux/cpu.h>
#include <linux/rculist.h>
#include <linux/poll.h>
#include <linux/irq_work.h>
#include <linux/ctype.h>
#include <linux/uio.h>
#include <linux/sched/clock.h>
#include <linux/sched/debug.h>
#include <linux/sched/task_stack.h>
#include <linux/panic.h>

#include <linux/uaccess.h>
#include <asm/sections.h>

#include <trace/events/initcall.h>
#define CREATE_TRACE_POINTS
#include <trace/events/printk.h>

#include "printk_ringbuffer.h"
#include "console_cmdline.h"
#include "braille.h"
#include "internal.h"

int console_printk[4] = {
	CONSOLE_LOGLEVEL_DEFAULT,	/* console_loglevel */
	MESSAGE_LOGLEVEL_DEFAULT,	/* default_message_loglevel */
	CONSOLE_LOGLEVEL_MIN,		/* minimum_console_loglevel */
	CONSOLE_LOGLEVEL_DEFAULT,	/* default_console_loglevel */
};
EXPORT_SYMBOL_GPL(console_printk);

atomic_t ignore_console_lock_warning __read_mostly = ATOMIC_INIT(0);
EXPORT_SYMBOL(ignore_console_lock_warning);

EXPORT_TRACEPOINT_SYMBOL_GPL(console);

/*
 * Low level drivers may need that to know if they can schedule in
 * their unblank() callback or not. So let's export it.
 */
int oops_in_progress;
EXPORT_SYMBOL(oops_in_progress);

/*
 * console_mutex protects console_list updates and console->flags updates.
 * The flags are synchronized only for consoles that are registered, i.e.
 * accessible via the console list.
 */
static DEFINE_MUTEX(console_mutex);

/*
 * console_sem protects updates to console->seq
 * and also provides serialization for console printing.
 */
static DEFINE_SEMAPHORE(console_sem, 1);
HLIST_HEAD(console_list);
EXPORT_SYMBOL_GPL(console_list);
DEFINE_STATIC_SRCU(console_srcu);

/*
 * System may need to suppress printk message under certain
 * circumstances, like after kernel panic happens.
 */
int __read_mostly suppress_printk;

#ifdef CONFIG_LOCKDEP
static struct lockdep_map console_lock_dep_map = {
	.name = "console_lock"
};

void lockdep_assert_console_list_lock_held(void)
{
	lockdep_assert_held(&console_mutex);
}
EXPORT_SYMBOL(lockdep_assert_console_list_lock_held);
#endif

#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool console_srcu_read_lock_is_held(void)
{
	return srcu_read_lock_held(&console_srcu);
}
EXPORT_SYMBOL(console_srcu_read_lock_is_held);
#endif

enum devkmsg_log_bits {
	__DEVKMSG_LOG_BIT_ON = 0,
	__DEVKMSG_LOG_BIT_OFF,
	__DEVKMSG_LOG_BIT_LOCK,
};

enum devkmsg_log_masks {
	DEVKMSG_LOG_MASK_ON             = BIT(__DEVKMSG_LOG_BIT_ON),
	DEVKMSG_LOG_MASK_OFF            = BIT(__DEVKMSG_LOG_BIT_OFF),
	DEVKMSG_LOG_MASK_LOCK           = BIT(__DEVKMSG_LOG_BIT_LOCK),
};

/* Keep both the 'on' and 'off' bits clear, i.e. ratelimit by default: */
#define DEVKMSG_LOG_MASK_DEFAULT	0

static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;

static int __control_devkmsg(char *str)
{
	size_t len;

	if (!str)
		return -EINVAL;

	len = str_has_prefix(str, "on");
	if (len) {
		devkmsg_log = DEVKMSG_LOG_MASK_ON;
		return len;
	}

	len = str_has_prefix(str, "off");
	if (len) {
		devkmsg_log = DEVKMSG_LOG_MASK_OFF;
		return len;
	}

	len = str_has_prefix(str, "ratelimit");
	if (len) {
		devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
		return len;
	}

	return -EINVAL;
}

static int __init control_devkmsg(char *str)
{
	if (__control_devkmsg(str) < 0) {
		pr_warn("printk.devkmsg: bad option string '%s'\n", str);
		return 1;
	}

	/*
	 * Set sysctl string accordingly:
	 */
	if (devkmsg_log == DEVKMSG_LOG_MASK_ON)
		strscpy(devkmsg_log_str, "on");
	else if (devkmsg_log == DEVKMSG_LOG_MASK_OFF)
		strscpy(devkmsg_log_str, "off");
	/* else "ratelimit" which is set by default. */

	/*
	 * Sysctl cannot change it anymore. The kernel command line setting of
	 * this parameter is to force the setting to be permanent throughout the
	 * runtime of the system. This is a precation measure against userspace
	 * trying to be a smarta** and attempting to change it up on us.
	 */
	devkmsg_log |= DEVKMSG_LOG_MASK_LOCK;

	return 1;
}
__setup("printk.devkmsg=", control_devkmsg);

char devkmsg_log_str[DEVKMSG_STR_MAX_SIZE] = "ratelimit";
#if defined(CONFIG_PRINTK) && defined(CONFIG_SYSCTL)
int devkmsg_sysctl_set_loglvl(const struct ctl_table *table, int write,
			      void *buffer, size_t *lenp, loff_t *ppos)
{
	char old_str[DEVKMSG_STR_MAX_SIZE];
	unsigned int old;
	int err;

	if (write) {
		if (devkmsg_log & DEVKMSG_LOG_MASK_LOCK)
			return -EINVAL;

		old = devkmsg_log;
		strscpy(old_str, devkmsg_log_str);
	}

	err = proc_dostring(table, write, buffer, lenp, ppos);
	if (err)
		return err;

	if (write) {
		err = __control_devkmsg(devkmsg_log_str);

		/*
		 * Do not accept an unknown string OR a known string with
		 * trailing crap...
		 */
		if (err < 0 || (err + 1 != *lenp)) {

			/* ... and restore old setting. */
			devkmsg_log = old;
			strscpy(devkmsg_log_str, old_str);

			return -EINVAL;
		}
	}

	return 0;
}
#endif /* CONFIG_PRINTK && CONFIG_SYSCTL */

/**
 * console_list_lock - Lock the console list
 *
 * For console list or console->flags updates
 */
void console_list_lock(void)
	__acquires(&console_mutex)
{
	/*
	 * In unregister_console() and console_force_preferred_locked(),
	 * synchronize_srcu() is called with the console_list_lock held.
	 * Therefore it is not allowed that the console_list_lock is taken
	 * with the srcu_lock held.
	 *
	 * Detecting if this context is really in the read-side critical
	 * section is only possible if the appropriate debug options are
	 * enabled.
	 */
	WARN_ON_ONCE(debug_lockdep_rcu_enabled() &&
		     srcu_read_lock_held(&console_srcu));

	mutex_lock(&console_mutex);
}
EXPORT_SYMBOL(console_list_lock);

/**
 * console_list_unlock - Unlock the console list
 *
 * Counterpart to console_list_lock()
 */
void console_list_unlock(void)
	__releases(&console_mutex)
{
	mutex_unlock(&console_mutex);
}
EXPORT_SYMBOL(console_list_unlock);

/**
 * console_srcu_read_lock - Register a new reader for the
 *	SRCU-protected console list
 *
 * Use for_each_console_srcu() to iterate the console list
 *
 * Context: Any context.
 * Return: A cookie to pass to console_srcu_read_unlock().
 */
int console_srcu_read_lock(void)
	__acquires(&console_srcu)
{
	return srcu_read_lock_nmisafe(&console_srcu);
}
EXPORT_SYMBOL(console_srcu_read_lock);

/**
 * console_srcu_read_unlock - Unregister an old reader from
 *	the SRCU-protected console list
 * @cookie: cookie returned from console_srcu_read_lock()
 *
 * Counterpart to console_srcu_read_lock()
 */
void console_srcu_read_unlock(int cookie)
	__releases(&console_srcu)
{
	srcu_read_unlock_nmisafe(&console_srcu, cookie);
}
EXPORT_SYMBOL(console_srcu_read_unlock);

/*
 * Helper macros to handle lockdep when locking/unlocking console_sem. We use
 * macros instead of functions so that _RET_IP_ contains useful information.
 */
#define down_console_sem() do { \
	down(&console_sem);\
	mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
} while (0)

static int __down_trylock_console_sem(unsigned long ip)
{
	int lock_failed;
	unsigned long flags;

	/*
	 * Here and in __up_console_sem() we need to be in safe mode,
	 * because spindump/WARN/etc from under console ->lock will
	 * deadlock in printk()->down_trylock_console_sem() otherwise.
	 */
	printk_safe_enter_irqsave(flags);
	lock_failed = down_trylock(&console_sem);
	printk_safe_exit_irqrestore(flags);

	if (lock_failed)
		return 1;
	mutex_acquire(&console_lock_dep_map, 0, 1, ip);
	return 0;
}
#define down_trylock_console_sem() __down_trylock_console_sem(_RET_IP_)

static void __up_console_sem(unsigned long ip)
{
	unsigned long flags;

	mutex_release(&console_lock_dep_map, ip);

	printk_safe_enter_irqsave(flags);
	up(&console_sem);
	printk_safe_exit_irqrestore(flags);
}
#define up_console_sem() __up_console_sem(_RET_IP_)

/*
 * This is used for debugging the mess that is the VT code by
 * keeping track if we have the console semaphore held. It's
 * definitely not the perfect debug tool (we don't know if _WE_
 * hold it and are racing, but it helps tracking those weird code
 * paths in the console code where we end up in places I want
 * locked without the console semaphore held).
 */
static int console_locked;

/*
 *	Array of consoles built from command line options (console=)
 */

#define MAX_CMDLINECONSOLES 8

static struct console_cmdline console_cmdline[MAX_CMDLINECONSOLES];

static int preferred_console = -1;
int console_set_on_cmdline;
EXPORT_SYMBOL(console_set_on_cmdline);

/* Flag: console code may call schedule() */
static int console_may_schedule;

enum con_msg_format_flags {
	MSG_FORMAT_DEFAULT	= 0,
	MSG_FORMAT_SYSLOG	= (1 << 0),
};

static int console_msg_format = MSG_FORMAT_DEFAULT;

/*
 * The printk log buffer consists of a sequenced collection of records, each
 * containing variable length message text. Every record also contains its
 * own meta-data (@info).
 *
 * Every record meta-data carries the timestamp in microseconds, as well as
 * the standard userspace syslog level and syslog facility. The usual kernel
 * messages use LOG_KERN; userspace-injected messages always carry a matching
 * syslog facility, by default LOG_USER. The origin of every message can be
 * reliably determined that way.
 *
 * The human readable log message of a record is available in @text, the
 * length of the message text in @text_len. The stored message is not
 * terminated.
 *
 * Optionally, a record can carry a dictionary of properties (key/value
 * pairs), to provide userspace with a machine-readable message context.
 *
 * Examples for well-defined, commonly used property names are:
 *   DEVICE=b12:8               device identifier
 *                                b12:8         block dev_t
 *                                c127:3        char dev_t
 *                                n8            netdev ifindex
 *                                +sound:card0  subsystem:devname
 *   SUBSYSTEM=pci              driver-core subsystem name
 *
 * Valid characters in property names are [a-zA-Z0-9.-_]. Property names
 * and values are terminated by a '\0' character.
 *
 * Example of record values:
 *   record.text_buf                = "it's a line" (unterminated)
 *   record.info.seq                = 56
 *   record.info.ts_nsec            = 36863
 *   record.info.text_len           = 11
 *   record.info.facility           = 0 (LOG_KERN)
 *   record.info.flags              = 0
 *   record.info.level              = 3 (LOG_ERR)
 *   record.info.caller_id          = 299 (task 299)
 *   record.info.dev_info.subsystem = "pci" (terminated)
 *   record.info.dev_info.device    = "+pci:0000:00:01.0" (terminated)
 *
 * The 'struct printk_info' buffer must never be directly exported to
 * userspace, it is a kernel-private implementation detail that might
 * need to be changed in the future, when the requirements change.
 *
 * /dev/kmsg exports the structured data in the following line format:
 *   "<level>,<sequnum>,<timestamp>,<contflag>[,additional_values, ... ];<message text>\n"
 *
 * Users of the export format should ignore possible additional values
 * separated by ',', and find the message after the ';' character.
 *
 * The optional key/value pairs are attached as continuation lines starting
 * with a space character and terminated by a newline. All possible
 * non-prinatable characters are escaped in the "\xff" notation.
 */

/* syslog_lock protects syslog_* variables and write access to clear_seq. */
static DEFINE_MUTEX(syslog_lock);

/*
 * Specifies if a legacy console is registered. If legacy consoles are
 * present, it is necessary to perform the console lock/unlock dance
 * whenever console flushing should occur.
 */
bool have_legacy_console;

/*
 * Specifies if an nbcon console is registered. If nbcon consoles are present,
 * synchronous printing of legacy consoles will not occur during panic until
 * the backtrace has been stored to the ringbuffer.
 */
bool have_nbcon_console;

/*
 * Specifies if a boot console is registered. If boot consoles are present,
 * nbcon consoles cannot print simultaneously and must be synchronized by
 * the console lock. This is because boot consoles and nbcon consoles may
 * have mapped the same hardware.
 */
bool have_boot_console;

/* See printk_legacy_allow_panic_sync() for details. */
bool legacy_allow_panic_sync;

/* Avoid using irq_work when suspending. */
bool console_irqwork_blocked;

#ifdef CONFIG_PRINTK
DECLARE_WAIT_QUEUE_HEAD(log_wait);
static DECLARE_WAIT_QUEUE_HEAD(legacy_wait);
/* All 3 protected by @syslog_lock. */
/* the next printk record to read by syslog(READ) or /proc/kmsg */
static u64 syslog_seq;
static size_t syslog_partial;
static bool syslog_time;

/* True when _all_ printer threads are available for printing. */
bool printk_kthreads_running;

struct latched_seq {
	seqcount_latch_t	latch;
	u64			val[2];
};

/*
 * The next printk record to read after the last 'clear' command. There are
 * two copies (updated with seqcount_latch) so that reads can locklessly
 * access a valid value. Writers are synchronized by @syslog_lock.
 */
static struct latched_seq clear_seq = {
	.latch		= SEQCNT_LATCH_ZERO(clear_seq.latch),
	.val[0]		= 0,
	.val[1]		= 0,
};

#define LOG_LEVEL(v)		((v) & 0x07)
#define LOG_FACILITY(v)		((v) >> 3 & 0xff)

/* record buffer */
#define LOG_ALIGN __alignof__(unsigned long)
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
#define LOG_BUF_LEN_MAX ((u32)1 << 31)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
static char *log_buf = __log_buf;
static u32 log_buf_len = __LOG_BUF_LEN;

/*
 * Define the average message size. This only affects the number of
 * descriptors that will be available. Underestimating is better than
 * overestimating (too many available descriptors is better than not enough).
 */
#define PRB_AVGBITS 5	/* 32 character average length */

#if CONFIG_LOG_BUF_SHIFT <= PRB_AVGBITS
#error CONFIG_LOG_BUF_SHIFT value too small.
#endif
_DEFINE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
		 PRB_AVGBITS, &__log_buf[0]);

static struct printk_ringbuffer printk_rb_dynamic;

struct printk_ringbuffer *prb = &printk_rb_static;

/*
 * We cannot access per-CPU data (e.g. per-CPU flush irq_work) before
 * per_cpu_areas are initialised. This variable is set to true when
 * it's safe to access per-CPU data.
 */
static bool __printk_percpu_data_ready __ro_after_init;

bool printk_percpu_data_ready(void)
{
	return __printk_percpu_data_ready;
}

/* Must be called under syslog_lock. */
static void latched_seq_write(struct latched_seq *ls, u64 val)
{
	write_seqcount_latch_begin(&ls->latch);
	ls->val[0] = val;
	write_seqcount_latch(&ls->latch);
	ls->val[1] = val;
	write_seqcount_latch_end(&ls->latch);
}

/* Can be called from any context. */
static u64 latched_seq_read_nolock(struct latched_seq *ls)
{
	unsigned int seq;
	unsigned int idx;
	u64 val;

	do {
		seq = read_seqcount_latch(&ls->latch);
		idx = seq & 0x1;
		val = ls->val[idx];
	} while (read_seqcount_latch_retry(&ls->latch, seq));

	return val;
}

/* Return log buffer address */
char *log_buf_addr_get(void)
{
	return log_buf;
}

/* Return log buffer size */
u32 log_buf_len_get(void)
{
	return log_buf_len;
}

/*
 * Define how much of the log buffer we could take at maximum. The value
 * must be greater than two. Note that only half of the buffer is available
 * when the index points to the middle.
 */
#define MAX_LOG_TAKE_PART 4
static const char trunc_msg[] = "<truncated>";

static void truncate_msg(u16 *text_len, u16 *trunc_msg_len)
{
	/*
	 * The message should not take the whole buffer. Otherwise, it might
	 * get removed too soon.
	 */
	u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;

	if (*text_len > max_text_len)
		*text_len = max_text_len;

	/* enable the warning message (if there is room) */
	*trunc_msg_len = strlen(trunc_msg);
	if (*text_len >= *trunc_msg_len)
		*text_len -= *trunc_msg_len;
	else
		*trunc_msg_len = 0;
}

int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);

static int syslog_action_restricted(int type)
{
	if (dmesg_restrict)
		return 1;
	/*
	 * Unless restricted, we allow "read all" and "get buffer size"
	 * for everybody.
	 */
	return type != SYSLOG_ACTION_READ_ALL &&
	       type != SYSLOG_ACTION_SIZE_BUFFER;
}

static int check_syslog_permissions(int type, int source)
{
	/*
	 * If this is from /proc/kmsg and we've already opened it, then we've
	 * already done the capabilities checks at open time.
	 */
	if (source == SYSLOG_FROM_PROC && type != SYSLOG_ACTION_OPEN)
		goto ok;

	if (syslog_action_restricted(type)) {
		if (capable(CAP_SYSLOG))
			goto ok;
		return -EPERM;
	}
ok:
	return security_syslog(type);
}

static void append_char(char **pp, char *e, char c)
{
	if (*pp < e)
		*(*pp)++ = c;
}

static ssize_t info_print_ext_header(char *buf, size_t size,
				     struct printk_info *info)
{
	u64 ts_usec = info->ts_nsec;
	char caller[20];
#ifdef CONFIG_PRINTK_CALLER
	u32 id = info->caller_id;

	snprintf(caller, sizeof(caller), ",caller=%c%u",
		 id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
#else
	caller[0] = '\0';
#endif

	do_div(ts_usec, 1000);

	return scnprintf(buf, size, "%u,%llu,%llu,%c%s;",
			 (info->facility << 3) | info->level, info->seq,
			 ts_usec, info->flags & LOG_CONT ? 'c' : '-', caller);
}

static ssize_t msg_add_ext_text(char *buf, size_t size,
				const char *text, size_t text_len,
				unsigned char endc)
{
	char *p = buf, *e = buf + size;
	size_t i;

	/* escape non-printable characters */
	for (i = 0; i < text_len; i++) {
		unsigned char c = text[i];

		if (c < ' ' || c >= 127 || c == '\\')
			p += scnprintf(p, e - p, "\\x%02x", c);
		else
			append_char(&p, e, c);
	}
	append_char(&p, e, endc);

	return p - buf;
}

static ssize_t msg_add_dict_text(char *buf, size_t size,
				 const char *key, const char *val)
{
	size_t val_len = strlen(val);
	ssize_t len;

	if (!val_len)
		return 0;

	len = msg_add_ext_text(buf, size, "", 0, ' ');	/* dict prefix */
	len += msg_add_ext_text(buf + len, size - len, key, strlen(key), '=');
	len += msg_add_ext_text(buf + len, size - len, val, val_len, '\n');

	return len;
}

static ssize_t msg_print_ext_body(char *buf, size_t size,
				  char *text, size_t text_len,
				  struct dev_printk_info *dev_info)
{
	ssize_t len;

	len = msg_add_ext_text(buf, size, text, text_len, '\n');

	if (!dev_info)
		goto out;

	len += msg_add_dict_text(buf + len, size - len, "SUBSYSTEM",
				 dev_info->subsystem);
	len += msg_add_dict_text(buf + len, size - len, "DEVICE",
				 dev_info->device);
out:
	return len;
}

/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
	atomic64_t seq;
	struct ratelimit_state rs;
	struct mutex lock;
	struct printk_buffers pbufs;
};

static __printf(3, 4) __cold
int devkmsg_emit(int facility, int level, const char *fmt, ...)
{
	va_list args;
	int r;

	va_start(args, fmt);
	r = vprintk_emit(facility, level, NULL, fmt, args);
	va_end(args);

	return r;
}

static ssize_t devkmsg_write(struct kiocb *iocb, struct iov_iter *from)
{
	char *buf, *line;
	int level = default_message_loglevel;
	int facility = 1;	/* LOG_USER */
	struct file *file = iocb->ki_filp;
	struct devkmsg_user *user = file->private_data;
	size_t len = iov_iter_count(from);
	ssize_t ret = len;

	if (len > PRINTKRB_RECORD_MAX)
		return -EINVAL;

	/* Ignore when user logging is disabled. */
	if (devkmsg_log & DEVKMSG_LOG_MASK_OFF)
		return len;

	/* Ratelimit when not explicitly enabled. */
	if (!(devkmsg_log & DEVKMSG_LOG_MASK_ON)) {
		if (!___ratelimit(&user->rs, current->comm))
			return ret;
	}

	buf = kmalloc(len+1, GFP_KERNEL);
	if (buf == NULL)
		return -ENOMEM;

	buf[len] = '\0';
	if (!copy_from_iter_full(buf, len, from)) {
		kfree(buf);
		return -EFAULT;
	}

	/*
	 * Extract and skip the syslog prefix <[0-9]*>. Coming from userspace
	 * the decimal value represents 32bit, the lower 3 bit are the log
	 * level, the rest are the log facility.
	 *
	 * If no prefix or no userspace facility is specified, we
	 * enforce LOG_USER, to be able to reliably distinguish
	 * kernel-generated messages from userspace-injected ones.
	 */
	line = buf;
	if (line[0] == '<') {
		char *endp = NULL;
		unsigned int u;

		u = simple_strtoul(line + 1, &endp, 10);
		if (endp && endp[0] == '>') {
			level = LOG_LEVEL(u);
			if (LOG_FACILITY(u) != 0)
				facility = LOG_FACILITY(u);
			endp++;
			line = endp;
		}
	}

	devkmsg_emit(facility, level, "%s", line);
	kfree(buf);
	return ret;
}

static ssize_t devkmsg_read(struct file *file, char __user *buf,
			    size_t count, loff_t *ppos)
{
	struct devkmsg_user *user = file->private_data;
	char *outbuf = &user->pbufs.outbuf[0];
	struct printk_message pmsg = {
		.pbufs = &user->pbufs,
	};
	ssize_t ret;

	ret = mutex_lock_interruptible(&user->lock);
	if (ret)
		return ret;

	if (!printk_get_next_message(&pmsg, atomic64_read(&user->seq), true, false)) {
		if (file->f_flags & O_NONBLOCK) {
			ret = -EAGAIN;
			goto out;
		}

		/*
		 * Guarantee this task is visible on the waitqueue before
		 * checking the wake condition.
		 *
		 * The full memory barrier within set_current_state() of
		 * prepare_to_wait_event() pairs with the full memory barrier
		 * within wq_has_sleeper().
		 *
		 * This pairs with __wake_up_klogd:A.
		 */
		ret = wait_event_interruptible(log_wait,
				printk_get_next_message(&pmsg, atomic64_read(&user->seq), true,
							false)); /* LMM(devkmsg_read:A) */
		if (ret)
			goto out;
	}

	if (pmsg.dropped) {
		/* our last seen message is gone, return error and reset */
		atomic64_set(&user->seq, pmsg.seq);
		ret = -EPIPE;
		goto out;
	}

	atomic64_set(&user->seq, pmsg.seq + 1);

	if (pmsg.outbuf_len > count) {
		ret = -EINVAL;
		goto out;
	}

	if (copy_to_user(buf, outbuf, pmsg.outbuf_len)) {
		ret = -EFAULT;
		goto out;
	}
	ret = pmsg.outbuf_len;
out:
	mutex_unlock(&user->lock);
	return ret;
}

/*
 * Be careful when modifying this function!!!
 *
 * Only few operations are supported because the device works only with the
 * entire variable length messages (records). Non-standard values are
 * returned in the other cases and has been this way for quite some time.
 * User space applications might depend on this behavior.
 */
static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
{
	struct devkmsg_user *user = file->private_data;
	loff_t ret = 0;

	if (offset)
		return -ESPIPE;

	switch (whence) {
	case SEEK_SET:
		/* the first record */
		atomic64_set(&user->seq, prb_first_valid_seq(prb));
		break;
	case SEEK_DATA:
		/*
		 * The first record after the last SYSLOG_ACTION_CLEAR,
		 * like issued by 'dmesg -c'. Reading /dev/kmsg itself
		 * changes no global state, and does not clear anything.
		 */
		atomic64_set(&user->seq, latched_seq_read_nolock(&clear_seq));
		break;
	case SEEK_END:
		/* after the last record */
		atomic64_set(&user->seq, prb_next_seq(prb));
		break;
	default:
		ret = -EINVAL;
	}
	return ret;
}

static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
{
	struct devkmsg_user *user = file->private_data;
	struct printk_info info;
	__poll_t ret = 0;

	poll_wait(file, &log_wait, wait);

	if (prb_read_valid_info(prb, atomic64_read(&user->seq), &info, NULL)) {
		/* return error when data has vanished underneath us */
		if (info.seq != atomic64_read(&user->seq))
			ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
		else
			ret = EPOLLIN|EPOLLRDNORM;
	}

	return ret;
}

static int devkmsg_open(struct inode *inode, struct file *file)
{
	struct devkmsg_user *user;
	int err;

	if (devkmsg_log & DEVKMSG_LOG_MASK_OFF)
		return -EPERM;

	/* write-only does not need any file context */
	if ((file->f_flags & O_ACCMODE) != O_WRONLY) {
		err = check_syslog_permissions(SYSLOG_ACTION_READ_ALL,
					       SYSLOG_FROM_READER);
		if (err)
			return err;
	}

	user = kvmalloc_obj(struct devkmsg_user);
	if (!user)
		return -ENOMEM;

	ratelimit_default_init(&user->rs);
	ratelimit_set_flags(&user->rs, RATELIMIT_MSG_ON_RELEASE);

	mutex_init(&user->lock);

	atomic64_set(&user->seq, prb_first_valid_seq(prb));

	file->private_data = user;
	return 0;
}

static int devkmsg_release(struct inode *inode, struct file *file)
{
	struct devkmsg_user *user = file->private_data;

	ratelimit_state_exit(&user->rs);

	mutex_destroy(&user->lock);
	kvfree(user);
	return 0;
}

const struct file_operations kmsg_fops = {
	.open = devkmsg_open,
	.read = devkmsg_read,
	.write_iter = devkmsg_write,
	.llseek = devkmsg_llseek,
	.poll = devkmsg_poll,
	.release = devkmsg_release,
};

#ifdef CONFIG_VMCORE_INFO
/*
 * This appends the listed symbols to /proc/vmcore
 *
 * /proc/vmcore is used by various utilities, like crash and makedumpfile to
 * obtain access to symbols that are otherwise very difficult to locate.  These
 * symbols are specifically used so that utilities can access and extract the
 * dmesg log from a vmcore file after a crash.
 */
void log_buf_vmcoreinfo_setup(void)
{
	struct dev_printk_info *dev_info = NULL;

	VMCOREINFO_SYMBOL(prb);
	VMCOREINFO_SYMBOL(printk_rb_static);
	VMCOREINFO_SYMBOL(clear_seq);

	/*
	 * Export struct size and field offsets. User space tools can
	 * parse it and detect any changes to structure down the line.
	 */

	VMCOREINFO_STRUCT_SIZE(printk_ringbuffer);
	VMCOREINFO_OFFSET(printk_ringbuffer, desc_ring);
	VMCOREINFO_OFFSET(printk_ringbuffer, text_data_ring);
	VMCOREINFO_OFFSET(printk_ringbuffer, fail);

	VMCOREINFO_STRUCT_SIZE(prb_desc_ring);
	VMCOREINFO_OFFSET(prb_desc_ring, count_bits);
	VMCOREINFO_OFFSET(prb_desc_ring, descs);
	VMCOREINFO_OFFSET(prb_desc_ring, infos);
	VMCOREINFO_OFFSET(prb_desc_ring, head_id);
	VMCOREINFO_OFFSET(prb_desc_ring, tail_id);

	VMCOREINFO_STRUCT_SIZE(prb_desc);
	VMCOREINFO_OFFSET(prb_desc, state_var);
	VMCOREINFO_OFFSET(prb_desc, text_blk_lpos);

	VMCOREINFO_STRUCT_SIZE(prb_data_blk_lpos);
	VMCOREINFO_OFFSET(prb_data_blk_lpos, begin);
	VMCOREINFO_OFFSET(prb_data_blk_lpos, next);

	VMCOREINFO_STRUCT_SIZE(printk_info);
	VMCOREINFO_OFFSET(printk_info, seq);
	VMCOREINFO_OFFSET(printk_info, ts_nsec);
	VMCOREINFO_OFFSET(printk_info, text_len);
	VMCOREINFO_OFFSET(printk_info, caller_id);
	VMCOREINFO_OFFSET(printk_info, dev_info);

	VMCOREINFO_STRUCT_SIZE(dev_printk_info);
	VMCOREINFO_OFFSET(dev_printk_info, subsystem);
	VMCOREINFO_LENGTH(printk_info_subsystem, sizeof(dev_info->subsystem));
	VMCOREINFO_OFFSET(dev_printk_info, device);
	VMCOREINFO_LENGTH(printk_info_device, sizeof(dev_info->device));

	VMCOREINFO_STRUCT_SIZE(prb_data_ring);
	VMCOREINFO_OFFSET(prb_data_ring, size_bits);
	VMCOREINFO_OFFSET(prb_data_ring, data);
	VMCOREINFO_OFFSET(prb_data_ring, head_lpos);
	VMCOREINFO_OFFSET(prb_data_ring, tail_lpos);

	VMCOREINFO_SIZE(atomic_long_t);
	VMCOREINFO_TYPE_OFFSET(atomic_long_t, counter);

	VMCOREINFO_STRUCT_SIZE(latched_seq);
	VMCOREINFO_OFFSET(latched_seq, val);
}
#endif

/* requested log_buf_len from kernel cmdline */
static unsigned long __initdata new_log_buf_len;

/* we practice scaling the ring buffer by powers of 2 */
static void __init log_buf_len_update(u64 size)
{
	if (size > (u64)LOG_BUF_LEN_MAX) {
		size = (u64)LOG_BUF_LEN_MAX;
		pr_err("log_buf over 2G is not supported.\n");
	}

	if (size)
		size = roundup_pow_of_two(size);
	if (size > log_buf_len)
		new_log_buf_len = (unsigned long)size;
}

/* save requested log_buf_len since it's too early to process it */
static int __init log_buf_len_setup(char *str)
{
	u64 size;

	if (!str)
		return -EINVAL;

	size = memparse(str, &str);

	log_buf_len_update(size);

	return 0;
}
early_param("log_buf_len", log_buf_len_setup);

#ifdef CONFIG_SMP
#define __LOG_CPU_MAX_BUF_LEN (1 << CONFIG_LOG_CPU_MAX_BUF_SHIFT)

static void __init log_buf_add_cpu(void)
{
	unsigned int cpu_extra;

	/*
	 * archs should set up cpu_possible_bits properly with
	 * set_cpu_possible() after setup_arch() but just in
	 * case lets ensure this is valid.
	 */
	if (num_possible_cpus() == 1)
		return;

	cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_MAX_BUF_LEN;

	/* by default this will only continue through for large > 64 CPUs */
	if (cpu_extra <= __LOG_BUF_LEN / 2)
		return;

	pr_info("log_buf_len individual max cpu contribution: %d bytes\n",
		__LOG_CPU_MAX_BUF_LEN);
	pr_info("log_buf_len total cpu_extra contributions: %d bytes\n",
		cpu_extra);
	pr_info("log_buf_len min size: %d bytes\n", __LOG_BUF_LEN);

	log_buf_len_update(cpu_extra + __LOG_BUF_LEN);
}
#else /* !CONFIG_SMP */
static inline void log_buf_add_cpu(void) {}
#endif /* CONFIG_SMP */

static void __init set_percpu_data_ready(void)
{
	__printk_percpu_data_ready = true;
}

static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
				     struct printk_record *r)
{
	struct prb_reserved_entry e;
	struct printk_record dest_r;

	prb_rec_init_wr(&dest_r, r->info->text_len);

	if (!prb_reserve(&e, rb, &dest_r))
		return 0;

	memcpy(&dest_r.text_buf[0], &r->text_buf[0], r->info->text_len);
	dest_r.info->text_len = r->info->text_len;
	dest_r.info->facility = r->info->facility;
	dest_r.info->level = r->info->level;
	dest_r.info->flags = r->info->flags;
	dest_r.info->ts_nsec = r->info->ts_nsec;
	dest_r.info->caller_id = r->info->caller_id;
	memcpy(&dest_r.info->dev_info, &r->info->dev_info, sizeof(dest_r.info->dev_info));

	prb_final_commit(&e);

	return prb_record_text_space(&e);
}

static char setup_text_buf[PRINTKRB_RECORD_MAX] __initdata;

static void print_log_buf_usage_stats(void)
{
	unsigned int descs_count = log_buf_len >> PRB_AVGBITS;
	size_t meta_data_size;

	meta_data_size = descs_count * (sizeof(struct prb_desc) + sizeof(struct printk_info));

	pr_info("log buffer data + meta data: %u + %zu = %zu bytes\n",
		log_buf_len, meta_data_size, log_buf_len + meta_data_size);
}

void __init setup_log_buf(int early)
{
	struct printk_info *new_infos;
	unsigned int new_descs_count;
	struct prb_desc *new_descs;
	struct printk_info info;
	struct printk_record r;
	unsigned int text_size;
	size_t new_descs_size;
	size_t new_infos_size;
	unsigned long flags;
	char *new_log_buf;
	unsigned int free;
	u64 seq;

	/*
	 * Some archs call setup_log_buf() multiple times - first is very
	 * early, e.g. from setup_arch(), and second - when percpu_areas
	 * are initialised.
	 */
	if (!early)
		set_percpu_data_ready();

	if (log_buf != __log_buf)
		return;

	if (!early && !new_log_buf_len)
		log_buf_add_cpu();

	if (!new_log_buf_len) {
		/* Show the memory stats only once. */
		if (!early)
			goto out;

		return;
	}

	new_descs_count = new_log_buf_len >> PRB_AVGBITS;
	if (new_descs_count == 0) {
		pr_err("new_log_buf_len: %lu too small\n", new_log_buf_len);
		goto out;
	}

	new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
	if (unlikely(!new_log_buf)) {
		pr_err("log_buf_len: %lu text bytes not available\n",
		       new_log_buf_len);
		goto out;
	}

	new_descs_size = new_descs_count * sizeof(struct prb_desc);
	new_descs = memblock_alloc(new_descs_size, LOG_ALIGN);
	if (unlikely(!new_descs)) {
		pr_err("log_buf_len: %zu desc bytes not available\n",
		       new_descs_size);
		goto err_free_log_buf;
	}

	new_infos_size = new_descs_count * sizeof(struct printk_info);
	new_infos = memblock_alloc(new_infos_size, LOG_ALIGN);
	if (unlikely(!new_infos)) {
		pr_err("log_buf_len: %zu info bytes not available\n",
		       new_infos_size);
		goto err_free_descs;
	}

	prb_rec_init_rd(&r, &info, &setup_text_buf[0], sizeof(setup_text_buf));

	prb_init(&printk_rb_dynamic,
		 new_log_buf, ilog2(new_log_buf_len),
		 new_descs, ilog2(new_descs_count),
		 new_infos);

	local_irq_save(flags);

	log_buf_len = new_log_buf_len;
	log_buf = new_log_buf;
	new_log_buf_len = 0;

	free = __LOG_BUF_LEN;
	prb_for_each_record(0, &printk_rb_static, seq, &r) {
		text_size = add_to_rb(&printk_rb_dynamic, &r);
		if (text_size > free)
			free = 0;
		else
			free -= text_size;
	}

	prb = &printk_rb_dynamic;

	local_irq_restore(flags);

	/*
	 * Copy any remaining messages that might have appeared from
	 * NMI context after copying but before switching to the
	 * dynamic buffer.
	 */
	prb_for_each_record(seq, &printk_rb_static, seq, &r) {
		text_size = add_to_rb(&printk_rb_dynamic, &r);
		if (text_size > free)
			free = 0;
		else
			free -= text_size;
	}

	if (seq != prb_next_seq(&printk_rb_static)) {
		pr_err("dropped %llu messages\n",
		       prb_next_seq(&printk_rb_static) - seq);
	}

	print_log_buf_usage_stats();
	pr_info("early log buf free: %u(%u%%)\n",
		free, (free * 100) / __LOG_BUF_LEN);
	return;

err_free_descs:
	memblock_free(new_descs, new_descs_size);
err_free_log_buf:
	memblock_free(new_log_buf, new_log_buf_len);
out:
	print_log_buf_usage_stats();
}

static bool __read_mostly ignore_loglevel;

static int __init ignore_loglevel_setup(char *str)
{
	ignore_loglevel = true;
	pr_info("debug: ignoring loglevel setting.\n");

	return 0;
}

early_param("ignore_loglevel", ignore_loglevel_setup);
module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(ignore_loglevel,
		 "ignore loglevel setting (prints all kernel messages to the console)");

static bool suppress_message_printing(int level)
{
	return (level >= console_loglevel && !ignore_loglevel);
}

#ifdef CONFIG_BOOT_PRINTK_DELAY

static int boot_delay; /* msecs delay after each printk during bootup */
static unsigned long long loops_per_msec;	/* based on boot_delay */

static int __init boot_delay_setup(char *str)
{
	unsigned long lpj;

	lpj = preset_lpj ? preset_lpj : 1000000;	/* some guess */
	loops_per_msec = (unsigned long long)lpj / 1000 * HZ;

	get_option(&str, &boot_delay);
	if (boot_delay > 10 * 1000)
		boot_delay = 0;

	pr_debug("boot_delay: %u, preset_lpj: %ld, lpj: %lu, "
		"HZ: %d, loops_per_msec: %llu\n",
		boot_delay, preset_lpj, lpj, HZ, loops_per_msec);
	return 0;
}
early_param("boot_delay", boot_delay_setup);

static void boot_delay_msec(int level)
{
	unsigned long long k;
	unsigned long timeout;
	bool suppress = !is_printk_force_console() &&
			suppress_message_printing(level);

	if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) || suppress)
		return;

	k = (unsigned long long)loops_per_msec * boot_delay;

	timeout = jiffies + msecs_to_jiffies(boot_delay);
	while (k) {
		k--;
		cpu_relax();
		/*
		 * use (volatile) jiffies to prevent
		 * compiler reduction; loop termination via jiffies
		 * is secondary and may or may not happen.
		 */
		if (time_after(jiffies, timeout))
			break;
		touch_nmi_watchdog();
	}
}
#else
static inline void boot_delay_msec(int level)
{
}
#endif

static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME);
module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);

static size_t print_syslog(unsigned int level, char *buf)
{
	return sprintf(buf, "<%u>", level);
}

static size_t print_time(u64 ts, char *buf)
{
	unsigned long rem_nsec = do_div(ts, 1000000000);

	return sprintf(buf, "[%5lu.%06lu]",
		       (unsigned long)ts, rem_nsec / 1000);
}

#ifdef CONFIG_PRINTK_CALLER
static size_t print_caller(u32 id, char *buf)
{
	char caller[12];

	snprintf(caller, sizeof(caller), "%c%u",
		 id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
	return sprintf(buf, "[%6s]", caller);
}
#else
#define print_caller(id, buf) 0
#endif

static size_t info_print_prefix(const struct printk_info  *info, bool syslog,
				bool time, char *buf)
{
	size_t len = 0;

	if (syslog)
		len = print_syslog((info->facility << 3) | info->level, buf);

	if (time)
		len += print_time(info->ts_nsec, buf + len);

	len += print_caller(info->caller_id, buf + len);

	if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) {
		buf[len++] = ' ';
		buf[len] = '\0';
	}

	return len;
}

/*
 * Prepare the record for printing. The text is shifted within the given
 * buffer to avoid a need for another one. The following operations are
 * done:
 *
 *   - Add prefix for each line.
 *   - Drop truncated lines that no longer fit into the buffer.
 *   - Add the trailing newline that has been removed in vprintk_store().
 *   - Add a string terminator.
 *
 * Since the produced string is always terminated, the maximum possible
 * return value is @r->text_buf_size - 1;
 *
 * Return: The length of the updated/prepared text, including the added
 * prefixes and the newline. The terminator is not counted. The dropped
 * line(s) are not counted.
 */
static size_t record_print_text(struct printk_record *r, bool syslog,
				bool time)
{
	size_t text_len = r->info->text_len;
	size_t buf_size = r->text_buf_size;
	char *text = r->text_buf;
	char prefix[PRINTK_PREFIX_MAX];
	bool truncated = false;
	size_t prefix_len;
	size_t line_len;
	size_t len = 0;
	char *next;

	/*
	 * If the message was truncated because the buffer was not large
	 * enough, treat the available text as if it were the full text.
	 */
	if (text_len > buf_size)
		text_len = buf_size;

	prefix_len = info_print_prefix(r->info, syslog, time, prefix);

	/*
	 * @text_len: bytes of unprocessed text
	 * @line_len: bytes of current line _without_ newline
	 * @text:     pointer to beginning of current line
	 * @len:      number of bytes prepared in r->text_buf
	 */
	for (;;) {
		next = memchr(text, '\n', text_len);
		if (next) {
			line_len = next - text;
		} else {
			/* Drop truncated line(s). */
			if (truncated)
				break;
			line_len = text_len;
		}

		/*
		 * Truncate the text if there is not enough space to add the
		 * prefix and a trailing newline and a terminator.
		 */
		if (len + prefix_len + text_len + 1 + 1 > buf_size) {
			/* Drop even the current line if no space. */
			if (len + prefix_len + line_len + 1 + 1 > buf_size)
				break;

			text_len = buf_size - len - prefix_len - 1 - 1;
			truncated = true;
		}

		memmove(text + prefix_len, text, text_len);
		memcpy(text, prefix, prefix_len);

		/*
		 * Increment the prepared length to include the text and
		 * prefix that were just moved+copied. Also increment for the
		 * newline at the end of this line. If this is the last line,
		 * there is no newline, but it will be added immediately below.
		 */
		len += prefix_len + line_len + 1;
		if (text_len == line_len) {
			/*
			 * This is the last line. Add the trailing newline
			 * removed in vprintk_store().
			 */
			text[prefix_len + line_len] = '\n';
			break;
		}

		/*
		 * Advance beyond the added prefix and the related line with
		 * its newline.
		 */
		text += prefix_len + line_len + 1;

		/*
		 * The remaining text has only decreased by the line with its
		 * newline.
		 *
		 * Note that @text_len can become zero. It happens when @text
		 * ended with a newline (either due to truncation or the
		 * original string ending with "\n\n"). The loop is correctly
		 * repeated and (if not truncated) an empty line with a prefix
		 * will be prepared.
		 */
		text_len -= line_len + 1;
	}

	/*
	 * If a buffer was provided, it will be terminated. Space for the
	 * string terminator is guaranteed to be available. The terminator is
	 * not counted in the return value.
	 */
	if (buf_size > 0)
		r->text_buf[len] = 0;

	return len;
}

static size_t get_record_print_text_size(struct printk_info *info,
					 unsigned int line_count,
					 bool syslog, bool time)
{
	char prefix[PRINTK_PREFIX_MAX];
	size_t prefix_len;

	prefix_len = info_print_prefix(info, syslog, time, prefix);

	/*
	 * Each line will be preceded with a prefix. The intermediate
	 * newlines are already within the text, but a final trailing
	 * newline will be added.
	 */
	return ((prefix_len * line_count) + info->text_len + 1);
}

/*
 * Beginning with @start_seq, find the first record where it and all following
 * records up to (but not including) @max_seq fit into @size.
 *
 * @max_seq is simply an upper bound and does not need to exist. If the caller
 * does not require an upper bound, -1 can be used for @max_seq.
 */
static u64 find_first_fitting_seq(u64 start_seq, u64 max_seq, size_t size,
				  bool syslog, bool time)
{
	struct printk_info info;
	unsigned int line_count;
	size_t len = 0;
	u64 seq;

	/* Determine the size of the records up to @max_seq. */
	prb_for_each_info(start_seq, prb, seq, &info, &line_count) {
		if (info.seq >= max_seq)
			break;
		len += get_record_print_text_size(&info, line_count, syslog, time);
	}

	/*
	 * Adjust the upper bound for the next loop to avoid subtracting
	 * lengths that were never added.
	 */
	if (seq < max_seq)
		max_seq = seq;

	/*
	 * Move first record forward until length fits into the buffer. Ignore
	 * newest messages that were not counted in the above cycle. Messages
	 * might appear and get lost in the meantime. This is a best effort
	 * that prevents an infinite loop that could occur with a retry.
	 */
	prb_for_each_info(start_seq, prb, seq, &info, &line_count) {
		if (len <= size || info.seq >= max_seq)
			break;
		len -= get_record_print_text_size(&info, line_count, syslog, time);
	}

	return seq;
}

/* The caller is responsible for making sure @size is greater than 0. */
static int syslog_print(char __user *buf, int size)
{
	struct printk_info info;
	struct printk_record r;
	char *text;
	int len = 0;
	u64 seq;

	text = kmalloc(PRINTK_MESSAGE_MAX, GFP_KERNEL);
	if (!text)
		return -ENOMEM;

	prb_rec_init_rd(&r, &info, text, PRINTK_MESSAGE_MAX);

	mutex_lock(&syslog_lock);

	/*
	 * Wait for the @syslog_seq record to be available. @syslog_seq may
	 * change while waiting.
	 */
	do {
		seq = syslog_seq;

		mutex_unlock(&syslog_lock);
		/*
		 * Guarantee this task is visible on the waitqueue before
		 * checking the wake condition.
		 *
		 * The full memory barrier within set_current_state() of
		 * prepare_to_wait_event() pairs with the full memory barrier
		 * within wq_has_sleeper().
		 *
		 * This pairs with __wake_up_klogd:A.
		 */
		len = wait_event_interruptible(log_wait,
				prb_read_valid(prb, seq, NULL)); /* LMM(syslog_print:A) */
		mutex_lock(&syslog_lock);

		if (len)
			goto out;
	} while (syslog_seq != seq);

	/*
	 * Copy records that fit into the buffer. The above cycle makes sure
	 * that the first record is always available.
	 */
	do {
		size_t n;
		size_t skip;
		int err;

		if (!prb_read_valid(prb, syslog_seq, &r))
			break;

		if (r.info->seq != syslog_seq) {
			/* message is gone, move to next valid one */
			syslog_seq = r.info->seq;
			syslog_partial = 0;
		}

		/*
		 * To keep reading/counting partial line consistent,
		 * use printk_time value as of the beginning of a line.
		 */
		if (!syslog_partial)
			syslog_time = printk_time;

		skip = syslog_partial;
		n = record_print_text(&r, true, syslog_time);
		if (n - syslog_partial <= size) {
			/* message fits into buffer, move forward */
			syslog_seq = r.info->seq + 1;
			n -= syslog_partial;
			syslog_partial = 0;
		} else if (!len){
			/* partial read(), remember position */
			n = size;
			syslog_partial += n;
		} else
			n = 0;

		if (!n)
			break;

		mutex_unlock(&syslog_lock);
		err = copy_to_user(buf, text + skip, n);
		mutex_lock(&syslog_lock);

		if (err) {
			if (!len)
				len = -EFAULT;
			break;
		}

		len += n;
		size -= n;
		buf += n;
	} while (size);
out:
	mutex_unlock(&syslog_lock);
	kfree(text);
	return len;
}

static int syslog_print_all(char __user *buf, int size, bool clear)
{
	struct printk_info info;
	struct printk_record r;
	char *text;
	int len = 0;
	u64 seq;
	bool time;

	text = kmalloc(PRINTK_MESSAGE_MAX, GFP_KERNEL);
	if (!text)
		return -ENOMEM;

	time = printk_time;
	/*
	 * Find first record that fits, including all following records,
	 * into the user-provided buffer for this dump.
	 */
	seq = find_first_fitting_seq(latched_seq_read_nolock(&clear_seq), -1,
				     size, true, time);

	prb_rec_init_rd(&r, &info, text, PRINTK_MESSAGE_MAX);

	prb_for_each_record(seq, prb, seq, &r) {
		int textlen;

		textlen = record_print_text(&r, true, time);

		if (len + textlen > size) {
			seq--;
			break;
		}

		if (copy_to_user(buf + len, text, textlen))
			len = -EFAULT;
		else
			len += textlen;

		if (len < 0)
			break;
	}

	if (clear) {
		mutex_lock(&syslog_lock);
		latched_seq_write(&clear_seq, seq);
		mutex_unlock(&syslog_lock);
	}

	kfree(text);
	return len;
}

static void syslog_clear(void)
{
	mutex_lock(&syslog_lock);
	latched_seq_write(&clear_seq, prb_next_seq(prb));
	mutex_unlock(&syslog_lock);
}

int do_syslog(int type, char __user *buf, int len, int source)
{
	struct printk_info info;
	bool clear = false;
	static int saved_console_loglevel = LOGLEVEL_DEFAULT;
	int error;

	error = check_syslog_permissions(type, source);
	if (error)
		return error;

	switch (type) {
	case SYSLOG_ACTION_CLOSE:	/* Close log */
		break;
	case SYSLOG_ACTION_OPEN:	/* Open log */
		break;
	case SYSLOG_ACTION_READ:	/* Read from log */
		if (!buf || len < 0)
			return -EINVAL;
		if (!len)
			return 0;
		if (!access_ok(buf, len))
			return -EFAULT;
		error = syslog_print(buf, len);
		break;
	/* Read/clear last kernel messages */
	case SYSLOG_ACTION_READ_CLEAR:
		clear = true;
		fallthrough;
	/* Read last kernel messages */
	case SYSLOG_ACTION_READ_ALL:
		if (!buf || len < 0)
			return -EINVAL;
		if (!len)
			return 0;
		if (!access_ok(buf, len))
			return -EFAULT;
		error = syslog_print_all(buf, len, clear);
		break;
	/* Clear ring buffer */
	case SYSLOG_ACTION_CLEAR:
		syslog_clear();
		break;
	/* Disable logging to console */
	case SYSLOG_ACTION_CONSOLE_OFF:
		if (saved_console_loglevel == LOGLEVEL_DEFAULT)
			saved_console_loglevel = console_loglevel;
		console_loglevel = minimum_console_loglevel;
		break;
	/* Enable logging to console */
	case SYSLOG_ACTION_CONSOLE_ON:
		if (saved_console_loglevel != LOGLEVEL_DEFAULT) {
			console_loglevel = saved_console_loglevel;
			saved_console_loglevel = LOGLEVEL_DEFAULT;
		}
		break;
	/* Set level of messages printed to console */
	case SYSLOG_ACTION_CONSOLE_LEVEL:
		if (len < 1 || len > 8)
			return -EINVAL;
		if (len < minimum_console_loglevel)
			len = minimum_console_loglevel;
		console_loglevel = len;
		/* Implicitly re-enable logging to console */
		saved_console_loglevel = LOGLEVEL_DEFAULT;
		break;
	/* Number of chars in the log buffer */
	case SYSLOG_ACTION_SIZE_UNREAD:
		mutex_lock(&syslog_lock);
		if (!prb_read_valid_info(prb, syslog_seq, &info, NULL)) {
			/* No unread messages. */
			mutex_unlock(&syslog_lock);
			return 0;
		}
		if (info.seq != syslog_seq) {
			/* messages are gone, move to first one */
			syslog_seq = info.seq;
			syslog_partial = 0;
		}
		if (source == SYSLOG_FROM_PROC) {
			/*
			 * Short-cut for poll(/"proc/kmsg") which simply checks
			 * for pending data, not the size; return the count of
			 * records, not the length.
			 */
			error = prb_next_seq(prb) - syslog_seq;
		} else {
			bool time = syslog_partial ? syslog_time : printk_time;
			unsigned int line_count;
			u64 seq;

			prb_for_each_info(syslog_seq, prb, seq, &info,
					  &line_count) {
				error += get_record_print_text_size(&info, line_count,
								    true, time);
				time = printk_time;
			}
			error -= syslog_partial;
		}
		mutex_unlock(&syslog_lock);
		break;
	/* Size of the log buffer */
	case SYSLOG_ACTION_SIZE_BUFFER:
		error = log_buf_len;
		break;
	default:
		error = -EINVAL;
		break;
	}

	return error;
}

SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
{
	return do_syslog(type, buf, len, SYSLOG_FROM_READER);
}

/*
 * Special console_lock variants that help to reduce the risk of soft-lockups.
 * They allow to pass console_lock to another printk() call using a busy wait.
 */

#ifdef CONFIG_LOCKDEP
static struct lockdep_map console_owner_dep_map = {
	.name = "console_owner"
};
#endif

static DEFINE_RAW_SPINLOCK(console_owner_lock);
static struct task_struct *console_owner;
static bool console_waiter;

/**
 * console_lock_spinning_enable - mark beginning of code where another
 *	thread might safely busy wait
 *
 * This basically converts console_lock into a spinlock. This marks
 * the section where the console_lock owner can not sleep, because
 * there may be a waiter spinning (like a spinlock). Also it must be
 * ready to hand over the lock at the end of the section.
 */
void console_lock_spinning_enable(void)
{
	/*
	 * Do not use spinning in panic(). The panic CPU wants to keep the lock.
	 * Non-panic CPUs abandon the flush anyway.
	 *
	 * Just keep the lockdep annotation. The panic-CPU should avoid
	 * taking console_owner_lock because it might cause a deadlock.
	 * This looks like the easiest way how to prevent false lockdep
	 * reports without handling races a lockless way.
	 */
	if (panic_in_progress())
		goto lockdep;

	raw_spin_lock(&console_owner_lock);
	console_owner = current;
	raw_spin_unlock(&console_owner_lock);

lockdep:
	/* The waiter may spin on us after setting console_owner */
	spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
}

/**
 * console_lock_spinning_disable_and_check - mark end of code where another
 *	thread was able to busy wait and check if there is a waiter
 * @cookie: cookie returned from console_srcu_read_lock()
 *
 * This is called at the end of the section where spinning is allowed.
 * It has two functions. First, it is a signal that it is no longer
 * safe to start busy waiting for the lock. Second, it checks if
 * there is a busy waiter and passes the lock rights to her.
 *
 * Important: Callers lose both the console_lock and the SRCU read lock if
 *	there was a busy waiter. They must not touch items synchronized by
 *	console_lock or SRCU read lock in this case.
 *
 * Return: 1 if the lock rights were passed, 0 otherwise.
 */
int console_lock_spinning_disable_and_check(int cookie)
{
	int waiter;

	/*
	 * Ignore spinning waiters during panic() because they might get stopped
	 * or blocked at any time,
	 *
	 * It is safe because nobody is allowed to start spinning during panic
	 * in the first place. If there has been a waiter then non panic CPUs
	 * might stay spinning. They would get stopped anyway. The panic context
	 * will never start spinning and an interrupted spin on panic CPU will
	 * never continue.
	 */
	if (panic_in_progress()) {
		/* Keep lockdep happy. */
		spin_release(&console_owner_dep_map, _THIS_IP_);
		return 0;
	}

	raw_spin_lock(&console_owner_lock);
	waiter = READ_ONCE(console_waiter);
	console_owner = NULL;
	raw_spin_unlock(&console_owner_lock);

	if (!waiter) {
		spin_release(&console_owner_dep_map, _THIS_IP_);
		return 0;
	}

	/* The waiter is now free to continue */
	WRITE_ONCE(console_waiter, false);

	spin_release(&console_owner_dep_map, _THIS_IP_);

	/*
	 * Preserve lockdep lock ordering. Release the SRCU read lock before
	 * releasing the console_lock.
	 */
	console_srcu_read_unlock(cookie);

	/*
	 * Hand off console_lock to waiter. The waiter will perform
	 * the up(). After this, the waiter is the console_lock owner.
	 */
	mutex_release(&console_lock_dep_map, _THIS_IP_);
	return 1;
}

/**
 * console_trylock_spinning - try to get console_lock by busy waiting
 *
 * This allows to busy wait for the console_lock when the current
 * owner is running in specially marked sections. It means that
 * the current owner is running and cannot reschedule until it
 * is ready to lose the lock.
 *
 * Return: 1 if we got the lock, 0 othrewise
 */
static int console_trylock_spinning(void)
{
	struct task_struct *owner = NULL;
	bool waiter;
	bool spin = false;
	unsigned long flags;

	if (console_trylock())
		return 1;

	/*
	 * It's unsafe to spin once a panic has begun. If we are the
	 * panic CPU, we may have already halted the owner of the
	 * console_sem. If we are not the panic CPU, then we should
	 * avoid taking console_sem, so the panic CPU has a better
	 * chance of cleanly acquiring it later.
	 */
	if (panic_in_progress())
		return 0;

	printk_safe_enter_irqsave(flags);

	raw_spin_lock(&console_owner_lock);
	owner = READ_ONCE(console_owner);
	waiter = READ_ONCE(console_waiter);
	if (!waiter && owner && owner != current) {
		WRITE_ONCE(console_waiter, true);
		spin = true;
	}
	raw_spin_unlock(&console_owner_lock);

	/*
	 * If there is an active printk() writing to the
	 * consoles, instead of having it write our data too,
	 * see if we can offload that load from the active
	 * printer, and do some printing ourselves.
	 * Go into a spin only if there isn't already a waiter
	 * spinning, and there is an active printer, and
	 * that active printer isn't us (recursive printk?).
	 */
	if (!spin) {
		printk_safe_exit_irqrestore(flags);
		return 0;
	}

	/* We spin waiting for the owner to release us */
	spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
	/* Owner will clear console_waiter on hand off */
	while (READ_ONCE(console_waiter))
		cpu_relax();
	spin_release(&console_owner_dep_map, _THIS_IP_);

	printk_safe_exit_irqrestore(flags);
	/*
	 * The owner passed the console lock to us.
	 * Since we did not spin on console lock, annotate
	 * this as a trylock. Otherwise lockdep will
	 * complain.
	 */
	mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_);

	/*
	 * Update @console_may_schedule for trylock because the previous
	 * owner may have been schedulable.
	 */
	console_may_schedule = 0;

	return 1;
}

/*
 * Recursion is tracked separately on each CPU. If NMIs are supported, an
 * additional NMI context per CPU is also separately tracked. Until per-CPU
 * is available, a separate "early tracking" is performed.
 */
static DEFINE_PER_CPU(u8, printk_count);
static u8 printk_count_early;
#ifdef CONFIG_HAVE_NMI
static DEFINE_PER_CPU(u8, printk_count_nmi);
static u8 printk_count_nmi_early;
#endif

/*
 * Recursion is limited to keep the output sane. printk() should not require
 * more than 1 level of recursion (allowing, for example, printk() to trigger
 * a WARN), but a higher value is used in case some printk-internal errors
 * exist, such as the ringbuffer validation checks failing.
 */
#define PRINTK_MAX_RECURSION 3

/*
 * Return a pointer to the dedicated counter for the CPU+context of the
 * caller.
 */
static u8 *__printk_recursion_counter(void)
{
#ifdef CONFIG_HAVE_NMI
	if (in_nmi()) {
		if (printk_percpu_data_ready())
			return this_cpu_ptr(&printk_count_nmi);
		return &printk_count_nmi_early;
	}
#endif
	if (printk_percpu_data_ready())
		return this_cpu_ptr(&printk_count);
	return &printk_count_early;
}

/*
 * Enter recursion tracking. Interrupts are disabled to simplify tracking.
 * The caller must check the boolean return value to see if the recursion is
 * allowed. On failure, interrupts are not disabled.
 *
 * @recursion_ptr must be a variable of type (u8 *) and is the same variable
 * that is passed to printk_exit_irqrestore().
 */
#define printk_enter_irqsave(recursion_ptr, flags)	\
({							\
	bool success = true;				\
							\
	typecheck(u8 *, recursion_ptr);			\
	local_irq_save(flags);				\
	(recursion_ptr) = __printk_recursion_counter();	\
	if (*(recursion_ptr) > PRINTK_MAX_RECURSION) {	\
		local_irq_restore(flags);		\
		success = false;			\
	} else {					\
		(*(recursion_ptr))++;			\
	}						\
	success;					\
})

/* Exit recursion tracking, restoring interrupts. */
#define printk_exit_irqrestore(recursion_ptr, flags)	\
	do {						\
		typecheck(u8 *, recursion_ptr);		\
		(*(recursion_ptr))--;			\
		local_irq_restore(flags);		\
	} while (0)

int printk_delay_msec __read_mostly;

static inline void printk_delay(int level)
{
	boot_delay_msec(level);

	if (unlikely(printk_delay_msec)) {
		int m = printk_delay_msec;

		while (m--) {
			mdelay(1);
			touch_nmi_watchdog();
		}
	}
}

#define CALLER_ID_MASK 0x80000000

static inline u32 printk_caller_id(void)
{
	return in_task() ? task_pid_nr(current) :
		CALLER_ID_MASK + smp_processor_id();
}

#ifdef CONFIG_PRINTK_EXECUTION_CTX
/* Store the opposite info than caller_id. */
static u32 printk_caller_id2(void)
{
	return !in_task() ? task_pid_nr(current) :
		CALLER_ID_MASK + smp_processor_id();
}

static pid_t printk_info_get_pid(const struct printk_info *info)
{
	u32 caller_id = info->caller_id;
	u32 caller_id2 = info->caller_id2;

	return caller_id & CALLER_ID_MASK ? caller_id2 : caller_id;
}

static int printk_info_get_cpu(const struct printk_info *info)
{
	u32 caller_id = info->caller_id;
	u32 caller_id2 = info->caller_id2;

	return ((caller_id & CALLER_ID_MASK ?
		 caller_id : caller_id2) & ~CALLER_ID_MASK);
}
#endif

/**
 * printk_parse_prefix - Parse level and control flags.
 *
 * @text:     The terminated text message.
 * @level:    A pointer to the current level value, will be updated.
 * @flags:    A pointer to the current printk_info flags, will be updated.
 *
 * @level may be NULL if the caller is not interested in the parsed value.
 * Otherwise the variable pointed to by @level must be set to
 * LOGLEVEL_DEFAULT in order to be updated with the parsed value.
 *
 * @flags may be NULL if the caller is not interested in the parsed value.
 * Otherwise the variable pointed to by @flags will be OR'd with the parsed
 * value.
 *
 * Return: The length of the parsed level and control flags.
 */
u16 printk_parse_prefix(const char *text, int *level,
			enum printk_info_flags *flags)
{
	u16 prefix_len = 0;
	int kern_level;

	while (*text) {
		kern_level = printk_get_level(text);
		if (!kern_level)
			break;

		switch (kern_level) {
		case '0' ... '7':
			if (level && *level == LOGLEVEL_DEFAULT)
				*level = kern_level - '0';
			break;
		case 'c':	/* KERN_CONT */
			if (flags)
				*flags |= LOG_CONT;
		}

		prefix_len += 2;
		text += 2;
	}

	return prefix_len;
}

__printf(5, 0)
static u16 printk_sprint(char *text, u16 size, int facility,
			 enum printk_info_flags *flags, const char *fmt,
			 va_list args)
{
	u16 text_len;

	text_len = vscnprintf(text, size, fmt, args);

	/* Mark and strip a trailing newline. */
	if (text_len && text[text_len - 1] == '\n') {
		text_len--;
		*flags |= LOG_NEWLINE;
	}

	/* Strip log level and control flags. */
	if (facility == 0) {
		u16 prefix_len;

		prefix_len = printk_parse_prefix(text, NULL, NULL);
		if (prefix_len) {
			text_len -= prefix_len;
			memmove(text, text + prefix_len, text_len);
		}
	}

	trace_console(text, text_len);

	return text_len;
}

#ifdef CONFIG_PRINTK_EXECUTION_CTX
static void printk_store_execution_ctx(struct printk_info *info)
{
	info->caller_id2 = printk_caller_id2();
	get_task_comm(info->comm, current);
}

static void pmsg_load_execution_ctx(struct printk_message *pmsg,
				    const struct printk_info *info)
{
	pmsg->cpu = printk_info_get_cpu(info);
	pmsg->pid = printk_info_get_pid(info);
	memcpy(pmsg->comm, info->comm, sizeof(pmsg->comm));
	static_assert(sizeof(pmsg->comm) == sizeof(info->comm));
}
#else
static void printk_store_execution_ctx(struct printk_info *info) {}

static void pmsg_load_execution_ctx(struct printk_message *pmsg,
				    const struct printk_info *info) {}
#endif

__printf(4, 0)
int vprintk_store(int facility, int level,
		  const struct dev_printk_info *dev_info,
		  const char *fmt, va_list args)
{
	struct prb_reserved_entry e;
	enum printk_info_flags flags = 0;
	struct printk_record r;
	unsigned long irqflags;
	u16 trunc_msg_len = 0;
	char prefix_buf[8];
	u8 *recursion_ptr;
	u16 reserve_size;
	va_list args2;
	u32 caller_id;
	u16 text_len;
	int ret = 0;
	u64 ts_nsec;

	if (!printk_enter_irqsave(recursion_ptr, irqflags))
		return 0;

	/*
	 * Since the duration of printk() can vary depending on the message
	 * and state of the ringbuffer, grab the timestamp now so that it is
	 * close to the call of printk(). This provides a more deterministic
	 * timestamp with respect to the caller.
	 */
	ts_nsec = local_clock();

	caller_id = printk_caller_id();

	/*
	 * The sprintf needs to come first since the syslog prefix might be
	 * passed in as a parameter. An extra byte must be reserved so that
	 * later the vscnprintf() into the reserved buffer has room for the
	 * terminating '\0', which is not counted by vsnprintf().
	 */
	va_copy(args2, args);
	reserve_size = vsnprintf(&prefix_buf[0], sizeof(prefix_buf), fmt, args2) + 1;
	va_end(args2);

	if (reserve_size > PRINTKRB_RECORD_MAX)
		reserve_size = PRINTKRB_RECORD_MAX;

	/* Extract log level or control flags. */
	if (facility == 0)
		printk_parse_prefix(&prefix_buf[0], &level, &flags);

	if (level == LOGLEVEL_DEFAULT)
		level = default_message_loglevel;

	if (dev_info)
		flags |= LOG_NEWLINE;

	if (is_printk_force_console())
		flags |= LOG_FORCE_CON;

	if (flags & LOG_CONT) {
		prb_rec_init_wr(&r, reserve_size);
		if (prb_reserve_in_last(&e, prb, &r, caller_id, PRINTKRB_RECORD_MAX)) {
			text_len = printk_sprint(&r.text_buf[r.info->text_len], reserve_size,
						 facility, &flags, fmt, args);
			r.info->text_len += text_len;

			if (flags & LOG_FORCE_CON)
				r.info->flags |= LOG_FORCE_CON;

			if (flags & LOG_NEWLINE) {
				r.info->flags |= LOG_NEWLINE;
				prb_final_commit(&e);
			} else {
				prb_commit(&e);
			}

			ret = text_len;
			goto out;
		}
	}

	/*
	 * Explicitly initialize the record before every prb_reserve() call.
	 * prb_reserve_in_last() and prb_reserve() purposely invalidate the
	 * structure when they fail.
	 */
	prb_rec_init_wr(&r, reserve_size);
	if (!prb_reserve(&e, prb, &r)) {
		/* truncate the message if it is too long for empty buffer */
		truncate_msg(&reserve_size, &trunc_msg_len);

		prb_rec_init_wr(&r, reserve_size + trunc_msg_len);
		if (!prb_reserve(&e, prb, &r))
			goto out;
	}

	/* fill message */
	text_len = printk_sprint(&r.text_buf[0], reserve_size, facility, &flags, fmt, args);
	if (trunc_msg_len)
		memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
	r.info->text_len = text_len + trunc_msg_len;
	r.info->facility = facility;
	r.info->level = level & 7;
	r.info->flags = flags & 0x1f;
	r.info->ts_nsec = ts_nsec;
	r.info->caller_id = caller_id;
	if (dev_info)
		memcpy(&r.info->dev_info, dev_info, sizeof(r.info->dev_info));
	printk_store_execution_ctx(r.info);

	/* A message without a trailing newline can be continued. */
	if (!(flags & LOG_NEWLINE))
		prb_commit(&e);
	else
		prb_final_commit(&e);

	ret = text_len + trunc_msg_len;
out:
	printk_exit_irqrestore(recursion_ptr, irqflags);
	return ret;
}

/*
 * This acts as a one-way switch to allow legacy consoles to print from
 * the printk() caller context on a panic CPU. It also attempts to flush
 * the legacy consoles in this context.
 */
void printk_legacy_allow_panic_sync(void)
{
	struct console_flush_type ft;

	legacy_allow_panic_sync = true;

	printk_get_console_flush_type(&ft);
	if (ft.legacy_direct) {
		if (console_trylock())
			console_unlock();
	}
}

bool __read_mostly debug_non_panic_cpus;

#ifdef CONFIG_PRINTK_CALLER
static int __init debug_non_panic_cpus_setup(char *str)
{
	debug_non_panic_cpus = true;
	pr_info("allow messages from non-panic CPUs in panic()\n");

	return 0;
}
early_param("debug_non_panic_cpus", debug_non_panic_cpus_setup);
module_param(debug_non_panic_cpus, bool, 0644);
MODULE_PARM_DESC(debug_non_panic_cpus,
		 "allow messages from non-panic CPUs in panic()");
#endif

asmlinkage int vprintk_emit(int facility, int level,
			    const struct dev_printk_info *dev_info,
			    const char *fmt, va_list args)
{
	struct console_flush_type ft;
	int printed_len;

	/* Suppress unimportant messages after panic happens */
	if (unlikely(suppress_printk))
		return 0;

	/*
	 * The messages on the panic CPU are the most important. If
	 * non-panic CPUs are generating any messages, they will be
	 * silently dropped.
	 */
	if (panic_on_other_cpu() &&
	    !debug_non_panic_cpus &&
	    !panic_triggering_all_cpu_backtrace)
		return 0;

	printk_get_console_flush_type(&ft);

	/* If called from the scheduler, we can not call up(). */
	if (level == LOGLEVEL_SCHED) {
		level = LOGLEVEL_DEFAULT;
		ft.legacy_offload |= ft.legacy_direct && !console_irqwork_blocked;
		ft.legacy_direct = false;
	}

	printk_delay(level);

	printed_len = vprintk_store(facility, level, dev_info, fmt, args);

	if (ft.nbcon_atomic)
		nbcon_atomic_flush_pending();

	if (ft.nbcon_offload)
		nbcon_kthreads_wake();

	if (ft.legacy_direct) {
		/*
		 * The caller may be holding system-critical or
		 * timing-sensitive locks. Disable preemption during
		 * printing of all remaining records to all consoles so that
		 * this context can return as soon as possible. Hopefully
		 * another printk() caller will take over the printing.
		 */
		preempt_disable();
		/*
		 * Try to acquire and then immediately release the console
		 * semaphore. The release will print out buffers. With the
		 * spinning variant, this context tries to take over the
		 * printing from another printing context.
		 */
		if (console_trylock_spinning())
			console_unlock();
		preempt_enable();
	}

	if (ft.legacy_offload)
		defer_console_output();
	else if (!console_irqwork_blocked)
		wake_up_klogd();

	return printed_len;
}
EXPORT_SYMBOL(vprintk_emit);

int vprintk_default(const char *fmt, va_list args)
{
	return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, fmt, args);
}
EXPORT_SYMBOL_GPL(vprintk_default);

asmlinkage __visible int _printk(const char *fmt, ...)
{
	va_list args;
	int r;

	va_start(args, fmt);
	r = vprintk(fmt, args);
	va_end(args);

	return r;
}
EXPORT_SYMBOL(_printk);

static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progress);

#else /* CONFIG_PRINTK */

#define printk_time		false

#define prb_read_valid(rb, seq, r)	false
#define prb_first_valid_seq(rb)		0
#define prb_next_seq(rb)		0

static u64 syslog_seq;

static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progress) { return true; }

#endif /* CONFIG_PRINTK */

#ifdef CONFIG_EARLY_PRINTK
struct console *early_console;

asmlinkage __visible void early_printk(const char *fmt, ...)
{
	va_list ap;
	char buf[512];
	int n;

	if (!early_console)
		return;

	va_start(ap, fmt);
	n = vscnprintf(buf, sizeof(buf), fmt, ap);
	va_end(ap);

	early_console->write(early_console, buf, n);
}
#endif

static void set_user_specified(struct console_cmdline *c, bool user_specified)
{
	if (!user_specified)
		return;

	/*
	 * @c console was defined by the user on the command line.
	 * Do not clear when added twice also by SPCR or the device tree.
	 */
	c->user_specified = true;
	/* At least one console defined by the user on the command line. */
	console_set_on_cmdline = 1;
}

static int __add_preferred_console(const char *name, const short idx,
				   const char *devname, char *options,
				   char *brl_options, bool user_specified)
{
	struct console_cmdline *c;
	int i;

	if (!name && !devname)
		return -EINVAL;

	/*
	 * We use a signed short index for struct console for device drivers to
	 * indicate a not yet assigned index or port. However, a negative index
	 * value is not valid when the console name and index are defined on
	 * the command line.
	 */
	if (name && idx < 0)
		return -EINVAL;

	/*
	 *	See if this tty is not yet registered, and
	 *	if we have a slot free.
	 */
	for (i = 0, c = console_cmdline;
	     i < MAX_CMDLINECONSOLES && (c->name[0] || c->devname[0]);
	     i++, c++) {
		if ((name && strcmp(c->name, name) == 0 && c->index == idx) ||
		    (devname && strcmp(c->devname, devname) == 0)) {
			if (!brl_options)
				preferred_console = i;
			set_user_specified(c, user_specified);
			return 0;
		}
	}
	if (i == MAX_CMDLINECONSOLES)
		return -E2BIG;
	if (!brl_options)
		preferred_console = i;
	if (name)
		strscpy(c->name, name);
	if (devname)
		strscpy(c->devname, devname);
	c->options = options;
	set_user_specified(c, user_specified);
	braille_set_options(c, brl_options);

	c->index = idx;
	return 0;
}

static int __init console_msg_format_setup(char *str)
{
	if (!strcmp(str, "syslog"))
		console_msg_format = MSG_FORMAT_SYSLOG;
	if (!strcmp(str, "default"))
		console_msg_format = MSG_FORMAT_DEFAULT;
	return 1;
}
__setup("console_msg_format=", console_msg_format_setup);

/*
 * Set up a console.  Called via do_early_param() in init/main.c
 * for each "console=" parameter in the boot command line.
 */
static int __init console_setup(char *str)
{
	static_assert(sizeof(console_cmdline[0].devname) >= sizeof(console_cmdline[0].name) + 4);
	char buf[sizeof(console_cmdline[0].devname)];
	char *brl_options = NULL;
	char *ttyname = NULL;
	char *devname = NULL;
	char *options;
	char *s;
	int idx;

	/*
	 * console="" or console=null have been suggested as a way to
	 * disable console output. Use ttynull that has been created
	 * for exactly this purpose.
	 */
	if (str[0] == 0 || strcmp(str, "null") == 0) {
		__add_preferred_console("ttynull", 0, NULL, NULL, NULL, true);
		return 1;
	}

	if (_braille_console_setup(&str, &brl_options))
		return 1;

	/* For a DEVNAME:0.0 style console the character device is unknown early */
	if (strchr(str, ':'))
		devname = buf;
	else
		ttyname = buf;

	/*
	 * Decode str into name, index, options.
	 */
	if (ttyname && isdigit(str[0]))
		scnprintf(buf, sizeof(buf), "ttyS%s", str);
	else
		strscpy(buf, str);

	options = strchr(str, ',');
	if (options)
		*(options++) = 0;

#ifdef __sparc__
	if (!strcmp(str, "ttya"))
		strscpy(buf, "ttyS0");
	if (!strcmp(str, "ttyb"))
		strscpy(buf, "ttyS1");
#endif

	for (s = buf; *s; s++)
		if ((ttyname && isdigit(*s)) || *s == ',')
			break;

	/* @idx will get defined when devname matches. */
	if (devname)
		idx = -1;
	else
		idx = simple_strtoul(s, NULL, 10);

	*s = 0;

	__add_preferred_console(ttyname, idx, devname, options, brl_options, true);
	return 1;
}
__setup("console=", console_setup);

/**
 * add_preferred_console - add a device to the list of preferred consoles.
 * @name: device name
 * @idx: device index
 * @options: options for this console
 *
 * The last preferred console added will be used for kernel messages
 * and stdin/out/err for init.  Normally this is used by console_setup
 * above to handle user-supplied console arguments; however it can also
 * be used by arch-specific code either to override the user or more
 * commonly to provide a default console (ie from PROM variables) when
 * the user has not supplied one.
 */
int add_preferred_console(const char *name, const short idx, char *options)
{
	return __add_preferred_console(name, idx, NULL, options, NULL, false);
}

/**
 * match_devname_and_update_preferred_console - Update a preferred console
 *	when matching devname is found.
 * @devname: DEVNAME:0.0 style device name
 * @name: Name of the corresponding console driver, e.g. "ttyS"
 * @idx: Console index, e.g. port number.
 *
 * The function checks whether a device with the given @devname is
 * preferred via the console=DEVNAME:0.0 command line option.
 * It fills the missing console driver name and console index
 * so that a later register_console() call could find (match)
 * and enable this device.
 *
 * It might be used when a driver subsystem initializes particular
 * devices with already known DEVNAME:0.0 style names. And it
 * could predict which console driver name and index this device
 * would later get associated with.
 *
 * Return: 0 on success, negative error code on failure.
 */
int match_devname_and_update_preferred_console(const char *devname,
					       const char *name,
					       const short idx)
{
	struct console_cmdline *c = console_cmdline;
	int i;

	if (!devname || !strlen(devname) || !name || !strlen(name) || idx < 0)
		return -EINVAL;

	for (i = 0; i < MAX_CMDLINECONSOLES && (c->name[0] || c->devname[0]);
	     i++, c++) {
		if (!strcmp(devname, c->devname)) {
			pr_info("associate the preferred console \"%s\" with \"%s%d\"\n",
				devname, name, idx);
			strscpy(c->name, name);
			c->index = idx;
			return 0;
		}
	}

	return -ENOENT;
}
EXPORT_SYMBOL_GPL(match_devname_and_update_preferred_console);

bool console_suspend_enabled = true;
EXPORT_SYMBOL(console_suspend_enabled);

static int __init console_suspend_disable(char *str)
{
	console_suspend_enabled = false;
	return 1;
}
__setup("no_console_suspend", console_suspend_disable);
module_param_named(console_suspend, console_suspend_enabled,
		bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(console_suspend, "suspend console during suspend"
	" and hibernate operations");

static bool printk_console_no_auto_verbose;

void console_verbose(void)
{
	if (console_loglevel && !printk_console_no_auto_verbose)
		console_loglevel = CONSOLE_LOGLEVEL_MOTORMOUTH;
}
EXPORT_SYMBOL_GPL(console_verbose);

module_param_named(console_no_auto_verbose, printk_console_no_auto_verbose, bool, 0644);
MODULE_PARM_DESC(console_no_auto_verbose, "Disable console loglevel raise to highest on oops/panic/etc");

/**
 * console_suspend_all - suspend the console subsystem
 *
 * This disables printk() while we go into suspend states
 */
void console_suspend_all(void)
{
	struct console *con;

	if (console_suspend_enabled)
		pr_info("Suspending console(s) (use no_console_suspend to debug)\n");

	/*
	 * Flush any console backlog and then avoid queueing irq_work until
	 * console_resume_all(). Until then deferred printing is no longer
	 * triggered, NBCON consoles transition to atomic flushing, and
	 * any klogd waiters are not triggered.
	 */
	pr_flush(1000, true);
	console_irqwork_blocked = true;

	if (!console_suspend_enabled)
		return;

	console_list_lock();
	for_each_console(con)
		console_srcu_write_flags(con, con->flags | CON_SUSPENDED);
	console_list_unlock();

	/*
	 * Ensure that all SRCU list walks have completed. All printing
	 * contexts must be able to see that they are suspended so that it
	 * is guaranteed that all printing has stopped when this function
	 * completes.
	 */
	synchronize_srcu(&console_srcu);
}

void console_resume_all(void)
{
	struct console_flush_type ft;
	struct console *con;

	/*
	 * Allow queueing irq_work. After restoring console state, deferred
	 * printing and any klogd waiters need to be triggered in case there
	 * is now a console backlog.
	 */
	console_irqwork_blocked = false;

	if (console_suspend_enabled) {
		console_list_lock();
		for_each_console(con)
			console_srcu_write_flags(con, con->flags & ~CON_SUSPENDED);
		console_list_unlock();

		/*
		 * Ensure that all SRCU list walks have completed. All printing
		 * contexts must be able to see they are no longer suspended so
		 * that they are guaranteed to wake up and resume printing.
		 */
		synchronize_srcu(&console_srcu);
	}

	printk_get_console_flush_type(&ft);
	if (ft.nbcon_offload)
		nbcon_kthreads_wake();
	if (ft.legacy_offload)
		defer_console_output();
	else
		wake_up_klogd();

	pr_flush(1000, true);
}

/**
 * console_cpu_notify - print deferred console messages after CPU hotplug
 * @cpu: unused
 *
 * If printk() is called from a CPU that is not online yet, the messages
 * will be printed on the console only if there are CON_ANYTIME consoles.
 * This function is called when a new CPU comes online (or fails to come
 * up) or goes offline.
 */
static int console_cpu_notify(unsigned int cpu)
{
	struct console_flush_type ft;

	if (!cpuhp_tasks_frozen) {
		printk_get_console_flush_type(&ft);
		if (ft.nbcon_atomic)
			nbcon_atomic_flush_pending();
		if (ft.legacy_direct) {
			if (console_trylock())
				console_unlock();
		}
	}
	return 0;
}

/**
 * console_lock - block the console subsystem from printing
 *
 * Acquires a lock which guarantees that no consoles will
 * be in or enter their write() callback.
 *
 * Can sleep, returns nothing.
 */
void console_lock(void)
{
	might_sleep();

	/* On panic, the console_lock must be left to the panic cpu. */
	while (panic_on_other_cpu())
		msleep(1000);

	down_console_sem();
	console_locked = 1;
	console_may_schedule = 1;
}
EXPORT_SYMBOL(console_lock);

/**
 * console_trylock - try to block the console subsystem from printing
 *
 * Try to acquire a lock which guarantees that no consoles will
 * be in or enter their write() callback.
 *
 * returns 1 on success, and 0 on failure to acquire the lock.
 */
int console_trylock(void)
{
	/* On panic, the console_lock must be left to the panic cpu. */
	if (panic_on_other_cpu())
		return 0;
	if (down_trylock_console_sem())
		return 0;
	console_locked = 1;
	console_may_schedule = 0;
	return 1;
}
EXPORT_SYMBOL(console_trylock);

int is_console_locked(void)
{
	return console_locked;
}
EXPORT_SYMBOL(is_console_locked);

static void __console_unlock(void)
{
	console_locked = 0;
	up_console_sem();
}

#ifdef CONFIG_PRINTK

/*
 * Prepend the message in @pmsg->pbufs->outbuf. This is achieved by shifting
 * the existing message over and inserting the scratchbuf message.
 *
 * @pmsg is the original printk message.
 * @fmt is the printf format of the message which will prepend the existing one.
 *
 * If there is not enough space in @pmsg->pbufs->outbuf, the existing
 * message text will be sufficiently truncated.
 *
 * If @pmsg->pbufs->outbuf is modified, @pmsg->outbuf_len is updated.
 */
__printf(2, 3)
static void console_prepend_message(struct printk_message *pmsg, const char *fmt, ...)
{
	struct printk_buffers *pbufs = pmsg->pbufs;
	const size_t scratchbuf_sz = sizeof(pbufs->scratchbuf);
	const size_t outbuf_sz = sizeof(pbufs->outbuf);
	char *scratchbuf = &pbufs->scratchbuf[0];
	char *outbuf = &pbufs->outbuf[0];
	va_list args;
	size_t len;

	va_start(args, fmt);
	len = vscnprintf(scratchbuf, scratchbuf_sz, fmt, args);
	va_end(args);

	/*
	 * Make sure outbuf is sufficiently large before prepending.
	 * Keep at least the prefix when the message must be truncated.
	 * It is a rather theoretical problem when someone tries to
	 * use a minimalist buffer.
	 */
	if (WARN_ON_ONCE(len + PRINTK_PREFIX_MAX >= outbuf_sz))
		return;

	if (pmsg->outbuf_len + len >= outbuf_sz) {
		/* Truncate the message, but keep it terminated. */
		pmsg->outbuf_len = outbuf_sz - (len + 1);
		outbuf[pmsg->outbuf_len] = 0;
	}

	memmove(outbuf + len, outbuf, pmsg->outbuf_len + 1);
	memcpy(outbuf, scratchbuf, len);
	pmsg->outbuf_len += len;
}

/*
 * Prepend the message in @pmsg->pbufs->outbuf with a "dropped message".
 * @pmsg->outbuf_len is updated appropriately.
 *
 * @pmsg is the printk message to prepend.
 *
 * @dropped is the dropped count to report in the dropped message.
 */
void console_prepend_dropped(struct printk_message *pmsg, unsigned long dropped)
{
	console_prepend_message(pmsg, "** %lu printk messages dropped **\n", dropped);
}

/*
 * Prepend the message in @pmsg->pbufs->outbuf with a "replay message".
 * @pmsg->outbuf_len is updated appropriately.
 *
 * @pmsg is the printk message to prepend.
 */
void console_prepend_replay(struct printk_message *pmsg)
{
	console_prepend_message(pmsg, "** replaying previous printk message **\n");
}

/*
 * Read and format the specified record (or a later record if the specified
 * record is not available).
 *
 * @pmsg will contain the formatted result. @pmsg->pbufs must point to a
 * struct printk_buffers.
 *
 * @seq is the record to read and format. If it is not available, the next
 * valid record is read.
 *
 * @is_extended specifies if the message should be formatted for extended
 * console output.
 *
 * @may_supress specifies if records may be skipped based on loglevel.
 *
 * Returns false if no record is available. Otherwise true and all fields
 * of @pmsg are valid. (See the documentation of struct printk_message
 * for information about the @pmsg fields.)
 */
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
			     bool is_extended, bool may_suppress)
{
	struct printk_buffers *pbufs = pmsg->pbufs;
	const size_t scratchbuf_sz = sizeof(pbufs->scratchbuf);
	const size_t outbuf_sz = sizeof(pbufs->outbuf);
	char *scratchbuf = &pbufs->scratchbuf[0];
	char *outbuf = &pbufs->outbuf[0];
	struct printk_info info;
	struct printk_record r;
	size_t len = 0;
	bool force_con;

	/*
	 * Formatting extended messages requires a separate buffer, so use the
	 * scratch buffer to read in the ringbuffer text.
	 *
	 * Formatting normal messages is done in-place, so read the ringbuffer
	 * text directly into the output buffer.
	 */
	if (is_extended)
		prb_rec_init_rd(&r, &info, scratchbuf, scratchbuf_sz);
	else
		prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);

	if (!prb_read_valid(prb, seq, &r))
		return false;

	pmsg->seq = r.info->seq;
	pmsg->dropped = r.info->seq - seq;
	force_con = r.info->flags & LOG_FORCE_CON;
	pmsg_load_execution_ctx(pmsg, r.info);

	/*
	 * Skip records that are not forced to be printed on consoles and that
	 * has level above the console loglevel.
	 */
	if (!force_con && may_suppress && suppress_message_printing(r.info->level))
		goto out;

	if (is_extended) {
		len = info_print_ext_header(outbuf, outbuf_sz, r.info);
		len += msg_print_ext_body(outbuf + len, outbuf_sz - len,
					  &r.text_buf[0], r.info->text_len, &r.info->dev_info);
	} else {
		len = record_print_text(&r, console_msg_format & MSG_FORMAT_SYSLOG, printk_time);
	}
out:
	pmsg->outbuf_len = len;
	return true;
}

/*
 * The legacy console always acquires a spinlock_t from its printing
 * callback. This violates lock nesting if the caller acquired an always
 * spinning lock (raw_spinlock_t) while invoking printk(). This is not a
 * problem on PREEMPT_RT because legacy consoles print always from a
 * dedicated thread and never from within printk(). Therefore we tell
 * lockdep that a sleeping spin lock (spinlock_t) is valid here.
 */
#ifdef CONFIG_PREEMPT_RT
static inline void printk_legacy_allow_spinlock_enter(void) { }
static inline void printk_legacy_allow_spinlock_exit(void) { }
#else
static DEFINE_WAIT_OVERRIDE_MAP(printk_legacy_map, LD_WAIT_CONFIG);

static inline void printk_legacy_allow_spinlock_enter(void)
{
	lock_map_acquire_try(&printk_legacy_map);
}

static inline void printk_legacy_allow_spinlock_exit(void)
{
	lock_map_release(&printk_legacy_map);
}
#endif /* CONFIG_PREEMPT_RT */

/*
 * Used as the printk buffers for non-panic, serialized console printing.
 * This is for legacy (!CON_NBCON) as well as all boot (CON_BOOT) consoles.
 * Its usage requires the console_lock held.
 */
struct printk_buffers printk_shared_pbufs;

/*
 * Print one record for the given console. The record printed is whatever
 * record is the next available record for the given console.
 *
 * @handover will be set to true if a printk waiter has taken over the
 * console_lock, in which case the caller is no longer holding both the
 * console_lock and the SRCU read lock. Otherwise it is set to false.
 *
 * @cookie is the cookie from the SRCU read lock.
 *
 * Returns false if the given console has no next record to print, otherwise
 * true.
 *
 * Requires the console_lock and the SRCU read lock.
 */
static bool console_emit_next_record(struct console *con, bool *handover, int cookie)
{
	bool is_extended = console_srcu_read_flags(con) & CON_EXTENDED;
	char *outbuf = &printk_shared_pbufs.outbuf[0];
	struct printk_message pmsg = {
		.pbufs = &printk_shared_pbufs,
	};
	unsigned long flags;

	*handover = false;

	if (!printk_get_next_message(&pmsg, con->seq, is_extended, true))
		return false;

	con->dropped += pmsg.dropped;

	/* Skip messages of formatted length 0. */
	if (pmsg.outbuf_len == 0) {
		con->seq = pmsg.seq + 1;
		goto skip;
	}

	if (con->dropped && !is_extended) {
		console_prepend_dropped(&pmsg, con->dropped);
		con->dropped = 0;
	}

	/* Write everything out to the hardware. */

	if (force_legacy_kthread() && !panic_in_progress()) {
		/*
		 * With forced threading this function is in a task context
		 * (either legacy kthread or get_init_console_seq()). There
		 * is no need for concern about printk reentrance, handovers,
		 * or lockdep complaints.
		 */

		con->write(con, outbuf, pmsg.outbuf_len);
		con->seq = pmsg.seq + 1;
	} else {
		/*
		 * While actively printing out messages, if another printk()
		 * were to occur on another CPU, it may wait for this one to
		 * finish. This task can not be preempted if there is a
		 * waiter waiting to take over.
		 *
		 * Interrupts are disabled because the hand over to a waiter
		 * must not be interrupted until the hand over is completed
		 * (@console_waiter is cleared).
		 */
		printk_safe_enter_irqsave(flags);
		console_lock_spinning_enable();

		/* Do not trace print latency. */
		stop_critical_timings();

		printk_legacy_allow_spinlock_enter();
		con->write(con, outbuf, pmsg.outbuf_len);
		printk_legacy_allow_spinlock_exit();

		start_critical_timings();

		con->seq = pmsg.seq + 1;

		*handover = console_lock_spinning_disable_and_check(cookie);
		printk_safe_exit_irqrestore(flags);
	}
skip:
	return true;
}

#else

static bool console_emit_next_record(struct console *con, bool *handover, int cookie)
{
	*handover = false;
	return false;
}

static inline void printk_kthreads_check_locked(void) { }

#endif /* CONFIG_PRINTK */


/*
 * Print out one record for each console.
 *
 * @do_cond_resched is set by the caller. It can be true only in schedulable
 * context.
 *
 * @next_seq is set to the sequence number after the last available record.
 * The value is valid only when all usable consoles were flushed. It is
 * when the function returns true (can do the job) and @try_again parameter
 * is set to false, see below.
 *
 * @handover will be set to true if a printk waiter has taken over the
 * console_lock, in which case the caller is no longer holding the
 * console_lock. Otherwise it is set to false.
 *
 * @try_again will be set to true when it still makes sense to call this
 * function again. The function could do the job, see the return value.
 * And some consoles still make progress.
 *
 * Returns true when the function could do the job. Some consoles are usable,
 * and there was no takeover and no panic_on_other_cpu().
 *
 * Requires the console_lock.
 */
static bool console_flush_one_record(bool do_cond_resched, u64 *next_seq, bool *handover,
				     bool *try_again)
{
	struct console_flush_type ft;
	bool any_usable = false;
	struct console *con;
	int cookie;

	*try_again = false;

	printk_get_console_flush_type(&ft);

	cookie = console_srcu_read_lock();
	for_each_console_srcu(con) {
		short flags = console_srcu_read_flags(con);
		u64 printk_seq;
		bool progress;

		/*
		 * console_flush_one_record() is only responsible for
		 * nbcon consoles when the nbcon consoles cannot print via
		 * their atomic or threaded flushing.
		 */
		if ((flags & CON_NBCON) && (ft.nbcon_atomic || ft.nbcon_offload))
			continue;

		if (!console_is_usable(con, flags, !do_cond_resched))
			continue;
		any_usable = true;

		if (flags & CON_NBCON) {
			progress = nbcon_legacy_emit_next_record(con, handover, cookie,
								 !do_cond_resched);
			printk_seq = nbcon_seq_read(con);
		} else {
			progress = console_emit_next_record(con, handover, cookie);
			printk_seq = con->seq;
		}

		/*
		 * If a handover has occurred, the SRCU read lock
		 * is already released.
		 */
		if (*handover)
			goto fail;

		/* Track the next of the highest seq flushed. */
		if (printk_seq > *next_seq)
			*next_seq = printk_seq;

		if (!progress)
			continue;

		/*
		 * An usable console made a progress. There might still be
		 * pending messages.
		 */
		*try_again = true;

		/* Allow panic_cpu to take over the consoles safely. */
		if (panic_on_other_cpu())
			goto fail_srcu;

		if (do_cond_resched)
			cond_resched();
	}
	console_srcu_read_unlock(cookie);

	return any_usable;

fail_srcu:
	console_srcu_read_unlock(cookie);
fail:
	*try_again = false;
	return false;
}

/*
 * Print out all remaining records to all consoles.
 *
 * @do_cond_resched is set by the caller. It can be true only in schedulable
 * context.
 *
 * @next_seq is set to the sequence number after the last available record.
 * The value is valid only when this function returns true. It means that all
 * usable consoles are completely flushed.
 *
 * @handover will be set to true if a printk waiter has taken over the
 * console_lock, in which case the caller is no longer holding the
 * console_lock. Otherwise it is set to false.
 *
 * Returns true when there was at least one usable console and all messages
 * were flushed to all usable consoles. A returned false informs the caller
 * that everything was not flushed (either there were no usable consoles or
 * another context has taken over printing or it is a panic situation and this
 * is not the panic CPU). Regardless the reason, the caller should assume it
 * is not useful to immediately try again.
 *
 * Requires the console_lock.
 */
static bool console_flush_all(bool do_cond_resched, u64 *next_seq, bool *handover)
{
	bool try_again;
	bool ret;

	*next_seq = 0;
	*handover = false;

	do {
		ret = console_flush_one_record(do_cond_resched, next_seq,
					       handover, &try_again);
	} while (try_again);

	return ret;
}

static void __console_flush_and_unlock(void)
{
	bool do_cond_resched;
	bool handover;
	bool flushed;
	u64 next_seq;

	/*
	 * Console drivers are called with interrupts disabled, so
	 * @console_may_schedule should be cleared before; however, we may
	 * end up dumping a lot of lines, for example, if called from
	 * console registration path, and should invoke cond_resched()
	 * between lines if allowable.  Not doing so can cause a very long
	 * scheduling stall on a slow console leading to RCU stall and
	 * softlockup warnings which exacerbate the issue with more
	 * messages practically incapacitating the system. Therefore, create
	 * a local to use for the printing loop.
	 */
	do_cond_resched = console_may_schedule;

	do {
		console_may_schedule = 0;

		flushed = console_flush_all(do_cond_resched, &next_seq, &handover);
		if (!handover)
			__console_unlock();

		/*
		 * Abort if there was a failure to flush all messages to all
		 * usable consoles. Either it is not possible to flush (in
		 * which case it would be an infinite loop of retrying) or
		 * another context has taken over printing.
		 */
		if (!flushed)
			break;

		/*
		 * Some context may have added new records after
		 * console_flush_all() but before unlocking the console.
		 * Re-check if there is a new record to flush. If the trylock
		 * fails, another context is already handling the printing.
		 */
	} while (prb_read_valid(prb, next_seq, NULL) && console_trylock());
}

/**
 * console_unlock - unblock the legacy console subsystem from printing
 *
 * Releases the console_lock which the caller holds to block printing of
 * the legacy console subsystem.
 *
 * While the console_lock was held, console output may have been buffered
 * by printk(). If this is the case, console_unlock() emits the output on
 * legacy consoles prior to releasing the lock.
 *
 * console_unlock(); may be called from any context.
 */
void console_unlock(void)
{
	struct console_flush_type ft;

	printk_get_console_flush_type(&ft);
	if (ft.legacy_direct)
		__console_flush_and_unlock();
	else
		__console_unlock();
}
EXPORT_SYMBOL(console_unlock);

void console_unblank(void)
{
	bool found_unblank = false;
	struct console *c;
	int cookie;

	/*
	 * First check if there are any consoles implementing the unblank()
	 * callback. If not, there is no reason to continue and take the
	 * console lock, which in particular can be dangerous if
	 * @oops_in_progress is set.
	 */
	cookie = console_srcu_read_lock();
	for_each_console_srcu(c) {
		if (!console_is_usable(c, console_srcu_read_flags(c), true))
			continue;

		if (c->unblank) {
			found_unblank = true;
			break;
		}
	}
	console_srcu_read_unlock(cookie);
	if (!found_unblank)
		return;

	/*
	 * Stop console printing because the unblank() callback may
	 * assume the console is not within its write() callback.
	 *
	 * If @oops_in_progress is set, this may be an atomic context.
	 * In that case, attempt a trylock as best-effort.
	 */
	if (oops_in_progress) {
		/* Semaphores are not NMI-safe. */
		if (in_nmi())
			return;

		/*
		 * Attempting to trylock the console lock can deadlock
		 * if another CPU was stopped while modifying the
		 * semaphore. "Hope and pray" that this is not the
		 * current situation.
		 */
		if (down_trylock_console_sem() != 0)
			return;
	} else
		console_lock();

	console_locked = 1;
	console_may_schedule = 0;

	cookie = console_srcu_read_lock();
	for_each_console_srcu(c) {
		if (!console_is_usable(c, console_srcu_read_flags(c), true))
			continue;

		if (c->unblank)
			c->unblank();
	}
	console_srcu_read_unlock(cookie);

	console_unlock();

	if (!oops_in_progress)
		pr_flush(1000, true);
}

/*
 * Rewind all consoles to the oldest available record.
 *
 * IMPORTANT: The function is safe only when called under
 *            console_lock(). It is not enforced because
 *            it is used as a best effort in panic().
 */
static void __console_rewind_all(void)
{
	struct console *c;
	short flags;
	int cookie;
	u64 seq;

	seq = prb_first_valid_seq(prb);

	cookie = console_srcu_read_lock();
	for_each_console_srcu(c) {
		flags = console_srcu_read_flags(c);

		if (flags & CON_NBCON) {
			nbcon_seq_force(c, seq);
		} else {
			/*
			 * This assignment is safe only when called under
			 * console_lock(). On panic, legacy consoles are
			 * only best effort.
			 */
			c->seq = seq;
		}
	}
	console_srcu_read_unlock(cookie);
}

/**
 * console_flush_on_panic - flush console content on panic
 * @mode: flush all messages in buffer or just the pending ones
 *
 * Immediately output all pending messages no matter what.
 */
void console_flush_on_panic(enum con_flush_mode mode)
{
	struct console_flush_type ft;
	bool handover;
	u64 next_seq;

	/*
	 * Ignore the console lock and flush out the messages. Attempting a
	 * trylock would not be useful because:
	 *
	 *   - if it is contended, it must be ignored anyway
	 *   - console_lock() and console_trylock() block and fail
	 *     respectively in panic for non-panic CPUs
	 *   - semaphores are not NMI-safe
	 */

	/*
	 * If another context is holding the console lock,
	 * @console_may_schedule might be set. Clear it so that
	 * this context does not call cond_resched() while flushing.
	 */
	console_may_schedule = 0;

	if (mode == CONSOLE_REPLAY_ALL)
		__console_rewind_all();

	printk_get_console_flush_type(&ft);
	if (ft.nbcon_atomic)
		nbcon_atomic_flush_pending();

	/* Flush legacy consoles once allowed, even when dangerous. */
	if (legacy_allow_panic_sync)
		console_flush_all(false, &next_seq, &handover);
}

/*
 * Return the console tty driver structure and its associated index
 */
struct tty_driver *console_device(int *index)
{
	struct console *c;
	struct tty_driver *driver = NULL;
	int cookie;

	/*
	 * Take console_lock to serialize device() callback with
	 * other console operations. For example, fg_console is
	 * modified under console_lock when switching vt.
	 */
	console_lock();

	cookie = console_srcu_read_lock();
	for_each_console_srcu(c) {
		if (!c->device)
			continue;
		driver = c->device(c, index);
		if (driver)
			break;
	}
	console_srcu_read_unlock(cookie);

	console_unlock();
	return driver;
}

/*
 * Prevent further output on the passed console device so that (for example)
 * serial drivers can suspend console output before suspending a port, and can
 * re-enable output afterwards.
 */
void console_suspend(struct console *console)
{
	__pr_flush(console, 1000, true);
	console_list_lock();
	console_srcu_write_flags(console, console->flags & ~CON_ENABLED);
	console_list_unlock();

	/*
	 * Ensure that all SRCU list walks have completed. All contexts must
	 * be able to see that this console is disabled so that (for example)
	 * the caller can suspend the port without risk of another context
	 * using the port.
	 */
	synchronize_srcu(&console_srcu);
}
EXPORT_SYMBOL(console_suspend);

void console_resume(struct console *console)
{
	struct console_flush_type ft;
	bool is_nbcon;

	console_list_lock();
	console_srcu_write_flags(console, console->flags | CON_ENABLED);
	is_nbcon = console->flags & CON_NBCON;
	console_list_unlock();

	/*
	 * Ensure that all SRCU list walks have completed. The related
	 * printing context must be able to see it is enabled so that
	 * it is guaranteed to wake up and resume printing.
	 */
	synchronize_srcu(&console_srcu);

	printk_get_console_flush_type(&ft);
	if (is_nbcon && ft.nbcon_offload)
		nbcon_kthread_wake(console);
	else if (ft.legacy_offload)
		defer_console_output();

	__pr_flush(console, 1000, true);
}
EXPORT_SYMBOL(console_resume);

#ifdef CONFIG_PRINTK
static int unregister_console_locked(struct console *console);

/* True when system boot is far enough to create printer threads. */
bool printk_kthreads_ready __ro_after_init;

static struct task_struct *printk_legacy_kthread;

static bool legacy_kthread_should_wakeup(void)
{
	struct console_flush_type ft;
	struct console *con;
	bool ret = false;
	int cookie;

	if (kthread_should_stop())
		return true;

	printk_get_console_flush_type(&ft);

	cookie = console_srcu_read_lock();
	for_each_console_srcu(con) {
		short flags = console_srcu_read_flags(con);
		u64 printk_seq;

		/*
		 * The legacy printer thread is only responsible for nbcon
		 * consoles when the nbcon consoles cannot print via their
		 * atomic or threaded flushing.
		 */
		if ((flags & CON_NBCON) && (ft.nbcon_atomic || ft.nbcon_offload))
			continue;

		if (!console_is_usable(con, flags, false))
			continue;

		if (flags & CON_NBCON) {
			printk_seq = nbcon_seq_read(con);
		} else {
			/*
			 * It is safe to read @seq because only this
			 * thread context updates @seq.
			 */
			printk_seq = con->seq;
		}

		if (prb_read_valid(prb, printk_seq, NULL)) {
			ret = true;
			break;
		}
	}
	console_srcu_read_unlock(cookie);

	return ret;
}

static int legacy_kthread_func(void *unused)
{
	bool try_again;

wait_for_event:
	wait_event_interruptible(legacy_wait, legacy_kthread_should_wakeup());

	do {
		bool handover = false;
		u64 next_seq = 0;

		if (kthread_should_stop())
			return 0;

		console_lock();
		console_flush_one_record(true, &next_seq, &handover, &try_again);
		if (!handover)
			__console_unlock();

	} while (try_again);

	goto wait_for_event;
}

static bool legacy_kthread_create(void)
{
	struct task_struct *kt;

	lockdep_assert_console_list_lock_held();

	kt = kthread_run(legacy_kthread_func, NULL, "pr/legacy");
	if (WARN_ON(IS_ERR(kt))) {
		pr_err("failed to start legacy printing thread\n");
		return false;
	}

	printk_legacy_kthread = kt;

	/*
	 * It is important that console printing threads are scheduled
	 * shortly after a printk call and with generous runtime budgets.
	 */
	sched_set_normal(printk_legacy_kthread, -20);

	return true;
}

/**
 * printk_kthreads_shutdown - shutdown all threaded printers
 * @data: syscore context
 *
 * On system shutdown all threaded printers are stopped. This allows printk
 * to transition back to atomic printing, thus providing a robust mechanism
 * for the final shutdown/reboot messages to be output.
 */
static void printk_kthreads_shutdown(void *data)
{
	struct console *con;

	console_list_lock();
	if (printk_kthreads_running) {
		printk_kthreads_running = false;

		for_each_console(con) {
			if (con->flags & CON_NBCON)
				nbcon_kthread_stop(con);
		}

		/*
		 * The threads may have been stopped while printing a
		 * backlog. Flush any records left over.
		 */
		nbcon_atomic_flush_pending();
	}
	console_list_unlock();
}

static const struct syscore_ops printk_syscore_ops = {
	.shutdown = printk_kthreads_shutdown,
};

static struct syscore printk_syscore = {
	.ops = &printk_syscore_ops,
};

/*
 * If appropriate, start nbcon kthreads and set @printk_kthreads_running.
 * If any kthreads fail to start, those consoles are unregistered.
 *
 * Must be called under console_list_lock().
 */
static void printk_kthreads_check_locked(void)
{
	struct hlist_node *tmp;
	struct console *con;

	lockdep_assert_console_list_lock_held();

	if (!printk_kthreads_ready)
		return;

	/* Start or stop the legacy kthread when needed. */
	if (have_legacy_console || have_boot_console) {
		if (!printk_legacy_kthread &&
		    force_legacy_kthread() &&
		    !legacy_kthread_create()) {
			/*
			 * All legacy consoles must be unregistered. If there
			 * are any nbcon consoles, they will set up their own
			 * kthread.
			 */
			hlist_for_each_entry_safe(con, tmp, &console_list, node) {
				if (con->flags & CON_NBCON)
					continue;

				unregister_console_locked(con);
			}
		}
	} else if (printk_legacy_kthread) {
		kthread_stop(printk_legacy_kthread);
		printk_legacy_kthread = NULL;
	}

	/*
	 * Printer threads cannot be started as long as any boot console is
	 * registered because there is no way to synchronize the hardware
	 * registers between boot console code and regular console code.
	 * It can only be known that there will be no new boot consoles when
	 * an nbcon console is registered.
	 */
	if (have_boot_console || !have_nbcon_console) {
		/* Clear flag in case all nbcon consoles unregistered. */
		printk_kthreads_running = false;
		return;
	}

	if (printk_kthreads_running)
		return;

	hlist_for_each_entry_safe(con, tmp, &console_list, node) {
		if (!(con->flags & CON_NBCON))
			continue;

		if (!nbcon_kthread_create(con))
			unregister_console_locked(con);
	}

	printk_kthreads_running = true;
}

static int __init printk_set_kthreads_ready(void)
{
	register_syscore(&printk_syscore);

	console_list_lock();
	printk_kthreads_ready = true;
	printk_kthreads_check_locked();
	console_list_unlock();

	return 0;
}
early_initcall(printk_set_kthreads_ready);
#endif /* CONFIG_PRINTK */

static int __read_mostly keep_bootcon;

static int __init keep_bootcon_setup(char *str)
{
	keep_bootcon = 1;
	pr_info("debug: skip boot console de-registration.\n");

	return 0;
}

early_param("keep_bootcon", keep_bootcon_setup);

static int console_call_setup(struct console *newcon, char *options)
{
	int err;

	if (!newcon->setup)
		return 0;

	/* Synchronize with possible boot console. */
	console_lock();
	err = newcon->setup(newcon, options);
	console_unlock();

	return err;
}

/*
 * This is called by register_console() to try to match
 * the newly registered console with any of the ones selected
 * by either the command line or add_preferred_console() and
 * setup/enable it.
 *
 * Care need to be taken with consoles that are statically
 * enabled such as netconsole
 */
static int try_enable_preferred_console(struct console *newcon,
					bool user_specified)
{
	struct console_cmdline *c;
	int i, err;

	for (i = 0, c = console_cmdline;
	     i < MAX_CMDLINECONSOLES && (c->name[0] || c->devname[0]);
	     i++, c++) {
		/* Console not yet initialized? */
		if (!c->name[0])
			continue;
		if (c->user_specified != user_specified)
			continue;
		if (!newcon->match ||
		    newcon->match(newcon, c->name, c->index, c->options) != 0) {
			/* default matching */
			BUILD_BUG_ON(sizeof(c->name) != sizeof(newcon->name));
			if (strcmp(c->name, newcon->name) != 0)
				continue;
			if (newcon->index >= 0 &&
			    newcon->index != c->index)
				continue;
			if (newcon->index < 0)
				newcon->index = c->index;

			if (_braille_register_console(newcon, c))
				return 0;

			err = console_call_setup(newcon, c->options);
			if (err)
				return err;
		}
		newcon->flags |= CON_ENABLED;
		if (i == preferred_console)
			newcon->flags |= CON_CONSDEV;
		return 0;
	}

	/*
	 * Some consoles, such as pstore and netconsole, can be enabled even
	 * without matching. Accept the pre-enabled consoles only when match()
	 * and setup() had a chance to be called.
	 */
	if (newcon->flags & CON_ENABLED && c->user_specified ==	user_specified)
		return 0;

	return -ENOENT;
}

/* Try to enable the console unconditionally */
static void try_enable_default_console(struct console *newcon)
{
	if (newcon->index < 0)
		newcon->index = 0;

	if (console_call_setup(newcon, NULL) != 0)
		return;

	newcon->flags |= CON_ENABLED;

	if (newcon->device)
		newcon->flags |= CON_CONSDEV;
}

/* Return the starting sequence number for a newly registered console. */
static u64 get_init_console_seq(struct console *newcon, bool bootcon_registered)
{
	struct console *con;
	bool handover;
	u64 init_seq;

	if (newcon->flags & (CON_PRINTBUFFER | CON_BOOT)) {
		/* Get a consistent copy of @syslog_seq. */
		mutex_lock(&syslog_lock);
		init_seq = syslog_seq;
		mutex_unlock(&syslog_lock);
	} else {
		/* Begin with next message added to ringbuffer. */
		init_seq = prb_next_seq(prb);

		/*
		 * If any enabled boot consoles are due to be unregistered
		 * shortly, some may not be caught up and may be the same
		 * device as @newcon. Since it is not known which boot console
		 * is the same device, flush all consoles and, if necessary,
		 * start with the message of the enabled boot console that is
		 * the furthest behind.
		 */
		if (bootcon_registered && !keep_bootcon) {
			/*
			 * Hold the console_lock to stop console printing and
			 * guarantee safe access to console->seq.
			 */
			console_lock();

			/*
			 * Flush all consoles and set the console to start at
			 * the next unprinted sequence number.
			 */
			if (!console_flush_all(true, &init_seq, &handover)) {
				/*
				 * Flushing failed. Just choose the lowest
				 * sequence of the enabled boot consoles.
				 */

				/*
				 * If there was a handover, this context no
				 * longer holds the console_lock.
				 */
				if (handover)
					console_lock();

				init_seq = prb_next_seq(prb);
				for_each_console(con) {
					u64 seq;

					if (!(con->flags & CON_BOOT) ||
					    !(con->flags & CON_ENABLED)) {
						continue;
					}

					if (con->flags & CON_NBCON)
						seq = nbcon_seq_read(con);
					else
						seq = con->seq;

					if (seq < init_seq)
						init_seq = seq;
				}
			}

			console_unlock();
		}
	}

	return init_seq;
}

#define console_first()				\
	hlist_entry(console_list.first, struct console, node)

static int unregister_console_locked(struct console *console);

/*
 * The console driver calls this routine during kernel initialization
 * to register the console printing procedure with printk() and to
 * print any messages that were printed by the kernel before the
 * console driver was initialized.
 *
 * This can happen pretty early during the boot process (because of
 * early_printk) - sometimes before setup_arch() completes - be careful
 * of what kernel features are used - they may not be initialised yet.
 *
 * There are two types of consoles - bootconsoles (early_printk) and
 * "real" consoles (everything which is not a bootconsole) which are
 * handled differently.
 *  - Any number of bootconsoles can be registered at any time.
 *  - As soon as a "real" console is registered, all bootconsoles
 *    will be unregistered automatically.
 *  - Once a "real" console is registered, any attempt to register a
 *    bootconsoles will be rejected
 */
void register_console(struct console *newcon)
{
	bool use_device_lock = (newcon->flags & CON_NBCON) && newcon->write_atomic;
	bool bootcon_registered = false;
	bool realcon_registered = false;
	struct console *con;
	unsigned long flags;
	u64 init_seq;
	int err;

	console_list_lock();

	for_each_console(con) {
		if (WARN(con == newcon, "console '%s%d' already registered\n",
					 con->name, con->index)) {
			goto unlock;
		}

		if (con->flags & CON_BOOT)
			bootcon_registered = true;
		else
			realcon_registered = true;
	}

	/* Do not register boot consoles when there already is a real one. */
	if ((newcon->flags & CON_BOOT) && realcon_registered) {
		pr_info("Too late to register bootconsole %s%d\n",
			newcon->name, newcon->index);
		goto unlock;
	}

	if (newcon->flags & CON_NBCON) {
		/*
		 * Ensure the nbcon console buffers can be allocated
		 * before modifying any global data.
		 */
		if (!nbcon_alloc(newcon))
			goto unlock;
	}

	/*
	 * See if we want to enable this console driver by default.
	 *
	 * Nope when a console is preferred by the command line, device
	 * tree, or SPCR.
	 *
	 * The first real console with tty binding (driver) wins. More
	 * consoles might get enabled before the right one is found.
	 *
	 * Note that a console with tty binding will have CON_CONSDEV
	 * flag set and will be first in the list.
	 */
	if (preferred_console < 0) {
		if (hlist_empty(&console_list) || !console_first()->device ||
		    console_first()->flags & CON_BOOT) {
			try_enable_default_console(newcon);
		}
	}

	/* See if this console matches one we selected on the command line */
	err = try_enable_preferred_console(newcon, true);

	/* If not, try to match against the platform default(s) */
	if (err == -ENOENT)
		err = try_enable_preferred_console(newcon, false);

	/* printk() messages are not printed to the Braille console. */
	if (err || newcon->flags & CON_BRL) {
		if (newcon->flags & CON_NBCON)
			nbcon_free(newcon);
		goto unlock;
	}

	/*
	 * If we have a bootconsole, and are switching to a real console,
	 * don't print everything out again, since when the boot console, and
	 * the real console are the same physical device, it's annoying to
	 * see the beginning boot messages twice
	 */
	if (bootcon_registered &&
	    ((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV)) {
		newcon->flags &= ~CON_PRINTBUFFER;
	}

	newcon->dropped = 0;
	init_seq = get_init_console_seq(newcon, bootcon_registered);

	if (newcon->flags & CON_NBCON) {
		have_nbcon_console = true;
		nbcon_seq_force(newcon, init_seq);
	} else {
		have_legacy_console = true;
		newcon->seq = init_seq;
	}

	if (newcon->flags & CON_BOOT)
		have_boot_console = true;

	/*
	 * If another context is actively using the hardware of this new
	 * console, it will not be aware of the nbcon synchronization. This
	 * is a risk that two contexts could access the hardware
	 * simultaneously if this new console is used for atomic printing
	 * and the other context is still using the hardware.
	 *
	 * Use the driver synchronization to ensure that the hardware is not
	 * in use while this new console transitions to being registered.
	 */
	if (use_device_lock)
		newcon->device_lock(newcon, &flags);

	/*
	 * Put this console in the list - keep the
	 * preferred driver at the head of the list.
	 */
	if (hlist_empty(&console_list)) {
		/* Ensure CON_CONSDEV is always set for the head. */
		newcon->flags |= CON_CONSDEV;
		hlist_add_head_rcu(&newcon->node, &console_list);

	} else if (newcon->flags & CON_CONSDEV) {
		/* Only the new head can have CON_CONSDEV set. */
		console_srcu_write_flags(console_first(), console_first()->flags & ~CON_CONSDEV);
		hlist_add_head_rcu(&newcon->node, &console_list);

	} else {
		hlist_add_behind_rcu(&newcon->node, console_list.first);
	}

	/*
	 * No need to synchronize SRCU here! The caller does not rely
	 * on all contexts being able to see the new console before
	 * register_console() completes.
	 */

	/* This new console is now registered. */
	if (use_device_lock)
		newcon->device_unlock(newcon, flags);

	console_sysfs_notify();

	/*
	 * By unregistering the bootconsoles after we enable the real console
	 * we get the "console xxx enabled" message on all the consoles -
	 * boot consoles, real consoles, etc - this is to ensure that end
	 * users know there might be something in the kernel's log buffer that
	 * went to the bootconsole (that they do not see on the real console)
	 */
	con_printk(KERN_INFO, newcon, "enabled\n");
	if (bootcon_registered &&
	    ((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV) &&
	    !keep_bootcon) {
		struct hlist_node *tmp;

		hlist_for_each_entry_safe(con, tmp, &console_list, node) {
			if (con->flags & CON_BOOT)
				unregister_console_locked(con);
		}
	}

	/* Changed console list, may require printer threads to start/stop. */
	printk_kthreads_check_locked();
unlock:
	console_list_unlock();
}
EXPORT_SYMBOL(register_console);

/* Must be called under console_list_lock(). */
static int unregister_console_locked(struct console *console)
{
	bool use_device_lock = (console->flags & CON_NBCON) && console->write_atomic;
	bool found_legacy_con = false;
	bool found_nbcon_con = false;
	bool found_boot_con = false;
	unsigned long flags;
	struct console *c;
	int res;

	lockdep_assert_console_list_lock_held();

	con_printk(KERN_INFO, console, "disabled\n");

	res = _braille_unregister_console(console);
	if (res < 0)
		return res;
	if (res > 0)
		return 0;

	if (!console_is_registered_locked(console))
		res = -ENODEV;
	else if (console_is_usable(console, console->flags, true))
		__pr_flush(console, 1000, true);

	/* Disable it unconditionally */
	console_srcu_write_flags(console, console->flags & ~CON_ENABLED);

	if (res < 0)
		return res;

	/*
	 * Use the driver synchronization to ensure that the hardware is not
	 * in use while this console transitions to being unregistered.
	 */
	if (use_device_lock)
		console->device_lock(console, &flags);

	hlist_del_init_rcu(&console->node);

	if (use_device_lock)
		console->device_unlock(console, flags);

	/*
	 * <HISTORICAL>
	 * If this isn't the last console and it has CON_CONSDEV set, we
	 * need to set it on the next preferred console.
	 * </HISTORICAL>
	 *
	 * The above makes no sense as there is no guarantee that the next
	 * console has any device attached. Oh well....
	 */
	if (!hlist_empty(&console_list) && console->flags & CON_CONSDEV)
		console_srcu_write_flags(console_first(), console_first()->flags | CON_CONSDEV);

	/*
	 * Ensure that all SRCU list walks have completed. All contexts
	 * must not be able to see this console in the list so that any
	 * exit/cleanup routines can be performed safely.
	 */
	synchronize_srcu(&console_srcu);

	/*
	 * With this console gone, the global flags tracking registered
	 * console types may have changed. Update them.
	 */
	for_each_console(c) {
		if (c->flags & CON_BOOT)
			found_boot_con = true;

		if (c->flags & CON_NBCON)
			found_nbcon_con = true;
		else
			found_legacy_con = true;
	}
	if (!found_boot_con)
		have_boot_console = found_boot_con;
	if (!found_legacy_con)
		have_legacy_console = found_legacy_con;
	if (!found_nbcon_con)
		have_nbcon_console = found_nbcon_con;

	/* @have_nbcon_console must be updated before calling nbcon_free(). */
	if (console->flags & CON_NBCON)
		nbcon_free(console);

	console_sysfs_notify();

	if (console->exit)
		res = console->exit(console);

	/* Changed console list, may require printer threads to start/stop. */
	printk_kthreads_check_locked();

	return res;
}

int unregister_console(struct console *console)
{
	int res;

	console_list_lock();
	res = unregister_console_locked(console);
	console_list_unlock();
	return res;
}
EXPORT_SYMBOL(unregister_console);

/**
 * console_force_preferred_locked - force a registered console preferred
 * @con: The registered console to force preferred.
 *
 * Must be called under console_list_lock().
 */
void console_force_preferred_locked(struct console *con)
{
	struct console *cur_pref_con;

	if (!console_is_registered_locked(con))
		return;

	cur_pref_con = console_first();

	/* Already preferred? */
	if (cur_pref_con == con)
		return;

	/*
	 * Delete, but do not re-initialize the entry. This allows the console
	 * to continue to appear registered (via any hlist_unhashed_lockless()
	 * checks), even though it was briefly removed from the console list.
	 */
	hlist_del_rcu(&con->node);

	/*
	 * Ensure that all SRCU list walks have completed so that the console
	 * can be added to the beginning of the console list and its forward
	 * list pointer can be re-initialized.
	 */
	synchronize_srcu(&console_srcu);

	con->flags |= CON_CONSDEV;
	WARN_ON(!con->device);

	/* Only the new head can have CON_CONSDEV set. */
	console_srcu_write_flags(cur_pref_con, cur_pref_con->flags & ~CON_CONSDEV);
	hlist_add_head_rcu(&con->node, &console_list);
}
EXPORT_SYMBOL(console_force_preferred_locked);

/*
 * Initialize the console device. This is called *early*, so
 * we can't necessarily depend on lots of kernel help here.
 * Just do some early initializations, and do the complex setup
 * later.
 */
void __init console_init(void)
{
	int ret;
	initcall_t call;
	initcall_entry_t *ce;

#ifdef CONFIG_NULL_TTY_DEFAULT_CONSOLE
	if (!console_set_on_cmdline)
		add_preferred_console("ttynull", 0, NULL);
#endif

	/* Setup the default TTY line discipline. */
	n_tty_init();

	/*
	 * set up the console device so that later boot sequences can
	 * inform about problems etc..
	 */
	ce = __con_initcall_start;
	trace_initcall_level("console");
	while (ce < __con_initcall_end) {
		call = initcall_from_entry(ce);
		trace_initcall_start(call);
		ret = call();
		trace_initcall_finish(call, ret);
		ce++;
	}
}

/*
 * Some boot consoles access data that is in the init section and which will
 * be discarded after the initcalls have been run. To make sure that no code
 * will access this data, unregister the boot consoles in a late initcall.
 *
 * If for some reason, such as deferred probe or the driver being a loadable
 * module, the real console hasn't registered yet at this point, there will
 * be a brief interval in which no messages are logged to the console, which
 * makes it difficult to diagnose problems that occur during this time.
 *
 * To mitigate this problem somewhat, only unregister consoles whose memory
 * intersects with the init section. Note that all other boot consoles will
 * get unregistered when the real preferred console is registered.
 */
static int __init printk_late_init(void)
{
	struct hlist_node *tmp;
	struct console *con;
	int ret;

	console_list_lock();
	hlist_for_each_entry_safe(con, tmp, &console_list, node) {
		if (!(con->flags & CON_BOOT))
			continue;

		/* Check addresses that might be used for enabled consoles. */
		if (init_section_intersects(con, sizeof(*con)) ||
		    init_section_contains(con->write, 0) ||
		    init_section_contains(con->read, 0) ||
		    init_section_contains(con->device, 0) ||
		    init_section_contains(con->unblank, 0) ||
		    init_section_contains(con->data, 0)) {
			/*
			 * Please, consider moving the reported consoles out
			 * of the init section.
			 */
			pr_warn("bootconsole [%s%d] uses init memory and must be disabled even before the real one is ready\n",
				con->name, con->index);
			unregister_console_locked(con);
		}
	}
	console_list_unlock();

	ret = cpuhp_setup_state_nocalls(CPUHP_PRINTK_DEAD, "printk:dead", NULL,
					console_cpu_notify);
	WARN_ON(ret < 0);
	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online",
					console_cpu_notify, NULL);
	WARN_ON(ret < 0);
	printk_sysctl_init();
	return 0;
}
late_initcall(printk_late_init);

#if defined CONFIG_PRINTK
/* If @con is specified, only wait for that console. Otherwise wait for all. */
static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progress)
{
	unsigned long timeout_jiffies = msecs_to_jiffies(timeout_ms);
	unsigned long remaining_jiffies = timeout_jiffies;
	struct console_flush_type ft;
	struct console *c;
	u64 last_diff = 0;
	u64 printk_seq;
	short flags;
	int cookie;
	u64 diff;
	u64 seq;

	/* Sorry, pr_flush() will not work this early. */
	if (system_state < SYSTEM_SCHEDULING)
		return false;

	might_sleep();

	seq = prb_next_reserve_seq(prb);

	/* Flush the consoles so that records up to @seq are printed. */
	printk_get_console_flush_type(&ft);
	if (ft.nbcon_atomic)
		nbcon_atomic_flush_pending();
	if (ft.legacy_direct) {
		console_lock();
		console_unlock();
	}

	for (;;) {
		unsigned long begin_jiffies;
		unsigned long slept_jiffies;

		diff = 0;

		/*
		 * Hold the console_lock to guarantee safe access to
		 * console->seq. Releasing console_lock flushes more
		 * records in case @seq is still not printed on all
		 * usable consoles.
		 *
		 * Holding the console_lock is not necessary if there
		 * are no legacy or boot consoles. However, such a
		 * console could register at any time. Always hold the
		 * console_lock as a precaution rather than
		 * synchronizing against register_console().
		 */
		console_lock();

		cookie = console_srcu_read_lock();
		for_each_console_srcu(c) {
			if (con && con != c)
				continue;

			flags = console_srcu_read_flags(c);

			/*
			 * If consoles are not usable, it cannot be expected
			 * that they make forward progress, so only increment
			 * @diff for usable consoles.
			 */
			if (!console_is_usable(c, flags, true) &&
			    !console_is_usable(c, flags, false)) {
				continue;
			}

			if (flags & CON_NBCON) {
				printk_seq = nbcon_seq_read(c);
			} else {
				printk_seq = c->seq;
			}

			if (printk_seq < seq)
				diff += seq - printk_seq;
		}
		console_srcu_read_unlock(cookie);

		if (diff != last_diff && reset_on_progress)
			remaining_jiffies = timeout_jiffies;

		console_unlock();

		/* Note: @diff is 0 if there are no usable consoles. */
		if (diff == 0 || remaining_jiffies == 0)
			break;

		/* msleep(1) might sleep much longer. Check time by jiffies. */
		begin_jiffies = jiffies;
		msleep(1);
		slept_jiffies = jiffies - begin_jiffies;

		remaining_jiffies -= min(slept_jiffies, remaining_jiffies);

		last_diff = diff;
	}

	return (diff == 0);
}

/**
 * pr_flush() - Wait for printing threads to catch up.
 *
 * @timeout_ms:        The maximum time (in ms) to wait.
 * @reset_on_progress: Reset the timeout if forward progress is seen.
 *
 * A value of 0 for @timeout_ms means no waiting will occur. A value of -1
 * represents infinite waiting.
 *
 * If @reset_on_progress is true, the timeout will be reset whenever any
 * printer has been seen to make some forward progress.
 *
 * Context: Process context. May sleep while acquiring console lock.
 * Return: true if all usable printers are caught up.
 */
bool pr_flush(int timeout_ms, bool reset_on_progress)
{
	return __pr_flush(NULL, timeout_ms, reset_on_progress);
}

/*
 * Delayed printk version, for scheduler-internal messages:
 */
#define PRINTK_PENDING_WAKEUP	0x01
#define PRINTK_PENDING_OUTPUT	0x02

static DEFINE_PER_CPU(int, printk_pending);

static void wake_up_klogd_work_func(struct irq_work *irq_work)
{
	int pending = this_cpu_xchg(printk_pending, 0);

	if (pending & PRINTK_PENDING_OUTPUT) {
		if (force_legacy_kthread()) {
			if (printk_legacy_kthread)
				wake_up_interruptible(&legacy_wait);
		} else {
			if (console_trylock())
				console_unlock();
		}
	}

	if (pending & PRINTK_PENDING_WAKEUP)
		wake_up_interruptible(&log_wait);
}

static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) =
	IRQ_WORK_INIT_LAZY(wake_up_klogd_work_func);

static void __wake_up_klogd(int val)
{
	if (!printk_percpu_data_ready())
		return;

	/*
	 * It is not allowed to call this function when console irq_work
	 * is blocked.
	 */
	if (WARN_ON_ONCE(console_irqwork_blocked))
		return;

	preempt_disable();
	/*
	 * Guarantee any new records can be seen by tasks preparing to wait
	 * before this context checks if the wait queue is empty.
	 *
	 * The full memory barrier within wq_has_sleeper() pairs with the full
	 * memory barrier within set_current_state() of
	 * prepare_to_wait_event(), which is called after ___wait_event() adds
	 * the waiter but before it has checked the wait condition.
	 *
	 * This pairs with devkmsg_read:A and syslog_print:A.
	 */
	if (wq_has_sleeper(&log_wait) || /* LMM(__wake_up_klogd:A) */
	    (val & PRINTK_PENDING_OUTPUT)) {
		this_cpu_or(printk_pending, val);
		irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
	}
	preempt_enable();
}

/**
 * wake_up_klogd - Wake kernel logging daemon
 *
 * Use this function when new records have been added to the ringbuffer
 * and the console printing of those records has already occurred or is
 * known to be handled by some other context. This function will only
 * wake the logging daemon.
 *
 * Context: Any context.
 */
void wake_up_klogd(void)
{
	__wake_up_klogd(PRINTK_PENDING_WAKEUP);
}

/**
 * defer_console_output - Wake kernel logging daemon and trigger
 *	console printing in a deferred context
 *
 * Use this function when new records have been added to the ringbuffer,
 * this context is responsible for console printing those records, but
 * the current context is not allowed to perform the console printing.
 * Trigger an irq_work context to perform the console printing. This
 * function also wakes the logging daemon.
 *
 * Context: Any context.
 */
void defer_console_output(void)
{
	/*
	 * New messages may have been added directly to the ringbuffer
	 * using vprintk_store(), so wake any waiters as well.
	 */
	__wake_up_klogd(PRINTK_PENDING_WAKEUP | PRINTK_PENDING_OUTPUT);
}

/**
 * printk_trigger_flush - Attempt to flush printk buffer to consoles.
 *
 * If possible, flush the printk buffer to all consoles in the caller's
 * context. If offloading is available, trigger deferred printing.
 *
 * This is best effort. Depending on the system state, console states,
 * and caller context, no actual flushing may result from this call.
 */
void printk_trigger_flush(void)
{
	struct console_flush_type ft;

	printk_get_console_flush_type(&ft);
	if (ft.nbcon_atomic)
		nbcon_atomic_flush_pending();
	if (ft.nbcon_offload)
		nbcon_kthreads_wake();
	if (ft.legacy_direct) {
		if (console_trylock())
			console_unlock();
	}
	if (ft.legacy_offload)
		defer_console_output();
}

int vprintk_deferred(const char *fmt, va_list args)
{
	return vprintk_emit(0, LOGLEVEL_SCHED, NULL, fmt, args);
}

int _printk_deferred(const char *fmt, ...)
{
	va_list args;
	int r;

	va_start(args, fmt);
	r = vprintk_deferred(fmt, args);
	va_end(args);

	return r;
}

/*
 * printk rate limiting, lifted from the networking subsystem.
 *
 * This enforces a rate limit: not more than 10 kernel messages
 * every 5s to make a denial-of-service attack impossible.
 */
DEFINE_RATELIMIT_STATE(printk_ratelimit_state, 5 * HZ, 10);

int __printk_ratelimit(const char *func)
{
	return ___ratelimit(&printk_ratelimit_state, func);
}
EXPORT_SYMBOL(__printk_ratelimit);

/**
 * printk_timed_ratelimit - caller-controlled printk ratelimiting
 * @caller_jiffies: pointer to caller's state
 * @interval_msecs: minimum interval between prints
 *
 * printk_timed_ratelimit() returns true if more than @interval_msecs
 * milliseconds have elapsed since the last time printk_timed_ratelimit()
 * returned true.
 */
bool printk_timed_ratelimit(unsigned long *caller_jiffies,
			unsigned int interval_msecs)
{
	unsigned long elapsed = jiffies - *caller_jiffies;

	if (*caller_jiffies && elapsed <= msecs_to_jiffies(interval_msecs))
		return false;

	*caller_jiffies = jiffies;
	return true;
}
EXPORT_SYMBOL(printk_timed_ratelimit);

static DEFINE_SPINLOCK(dump_list_lock);
static LIST_HEAD(dump_list);

/**
 * kmsg_dump_register - register a kernel log dumper.
 * @dumper: pointer to the kmsg_dumper structure
 *
 * Adds a kernel log dumper to the system. The dump callback in the
 * structure will be called when the kernel oopses or panics and must be
 * set. Returns zero on success and %-EINVAL or %-EBUSY otherwise.
 */
int kmsg_dump_register(struct kmsg_dumper *dumper)
{
	unsigned long flags;
	int err = -EBUSY;

	/* The dump callback needs to be set */
	if (!dumper->dump)
		return -EINVAL;

	spin_lock_irqsave(&dump_list_lock, flags);
	/* Don't allow registering multiple times */
	if (!dumper->registered) {
		dumper->registered = 1;
		list_add_tail_rcu(&dumper->list, &dump_list);
		err = 0;
	}
	spin_unlock_irqrestore(&dump_list_lock, flags);

	return err;
}
EXPORT_SYMBOL_GPL(kmsg_dump_register);

/**
 * kmsg_dump_unregister - unregister a kmsg dumper.
 * @dumper: pointer to the kmsg_dumper structure
 *
 * Removes a dump device from the system. Returns zero on success and
 * %-EINVAL otherwise.
 */
int kmsg_dump_unregister(struct kmsg_dumper *dumper)
{
	unsigned long flags;
	int err = -EINVAL;

	spin_lock_irqsave(&dump_list_lock, flags);
	if (dumper->registered) {
		dumper->registered = 0;
		list_del_rcu(&dumper->list);
		err = 0;
	}
	spin_unlock_irqrestore(&dump_list_lock, flags);
	synchronize_rcu();

	return err;
}
EXPORT_SYMBOL_GPL(kmsg_dump_unregister);

static bool always_kmsg_dump;
module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);

const char *kmsg_dump_reason_str(enum kmsg_dump_reason reason)
{
	switch (reason) {
	case KMSG_DUMP_PANIC:
		return "Panic";
	case KMSG_DUMP_OOPS:
		return "Oops";
	case KMSG_DUMP_EMERG:
		return "Emergency";
	case KMSG_DUMP_SHUTDOWN:
		return "Shutdown";
	default:
		return "Unknown";
	}
}
EXPORT_SYMBOL_GPL(kmsg_dump_reason_str);

/**
 * kmsg_dump_desc - dump kernel log to kernel message dumpers.
 * @reason: the reason (oops, panic etc) for dumping
 * @desc: a short string to describe what caused the panic or oops. Can be NULL
 * if no additional description is available.
 *
 * Call each of the registered dumper's dump() callback, which can
 * retrieve the kmsg records with kmsg_dump_get_line() or
 * kmsg_dump_get_buffer().
 */
void kmsg_dump_desc(enum kmsg_dump_reason reason, const char *desc)
{
	struct kmsg_dumper *dumper;
	struct kmsg_dump_detail detail = {
		.reason = reason,
		.description = desc};

	rcu_read_lock();
	list_for_each_entry_rcu(dumper, &dump_list, list) {
		enum kmsg_dump_reason max_reason = dumper->max_reason;

		/*
		 * If client has not provided a specific max_reason, default
		 * to KMSG_DUMP_OOPS, unless always_kmsg_dump was set.
		 */
		if (max_reason == KMSG_DUMP_UNDEF) {
			max_reason = always_kmsg_dump ? KMSG_DUMP_MAX :
							KMSG_DUMP_OOPS;
		}
		if (reason > max_reason)
			continue;

		/* invoke dumper which will iterate over records */
		dumper->dump(dumper, &detail);
	}
	rcu_read_unlock();
}

/**
 * kmsg_dump_get_line - retrieve one kmsg log line
 * @iter: kmsg dump iterator
 * @syslog: include the "<4>" prefixes
 * @line: buffer to copy the line to
 * @size: maximum size of the buffer
 * @len: length of line placed into buffer
 *
 * Start at the beginning of the kmsg buffer, with the oldest kmsg
 * record, and copy one record into the provided buffer.
 *
 * Consecutive calls will return the next available record moving
 * towards the end of the buffer with the youngest messages.
 *
 * A return value of FALSE indicates that there are no more records to
 * read.
 */
bool kmsg_dump_get_line(struct kmsg_dump_iter *iter, bool syslog,
			char *line, size_t size, size_t *len)
{
	u64 min_seq = latched_seq_read_nolock(&clear_seq);
	struct printk_info info;
	unsigned int line_count;
	struct printk_record r;
	size_t l = 0;
	bool ret = false;

	if (iter->cur_seq < min_seq)
		iter->cur_seq = min_seq;

	prb_rec_init_rd(&r, &info, line, size);

	/* Read text or count text lines? */
	if (line) {
		if (!prb_read_valid(prb, iter->cur_seq, &r))
			goto out;
		l = record_print_text(&r, syslog, printk_time);
	} else {
		if (!prb_read_valid_info(prb, iter->cur_seq,
					 &info, &line_count)) {
			goto out;
		}
		l = get_record_print_text_size(&info, line_count, syslog,
					       printk_time);

	}

	iter->cur_seq = r.info->seq + 1;
	ret = true;
out:
	if (len)
		*len = l;
	return ret;
}
EXPORT_SYMBOL_GPL(kmsg_dump_get_line);

/**
 * kmsg_dump_get_buffer - copy kmsg log lines
 * @iter: kmsg dump iterator
 * @syslog: include the "<4>" prefixes
 * @buf: buffer to copy the line to
 * @size: maximum size of the buffer
 * @len_out: length of line placed into buffer
 *
 * Start at the end of the kmsg buffer and fill the provided buffer
 * with as many of the *youngest* kmsg records that fit into it.
 * If the buffer is large enough, all available kmsg records will be
 * copied with a single call.
 *
 * Consecutive calls will fill the buffer with the next block of
 * available older records, not including the earlier retrieved ones.
 *
 * A return value of FALSE indicates that there are no more records to
 * read.
 */
bool kmsg_dump_get_buffer(struct kmsg_dump_iter *iter, bool syslog,
			  char *buf, size_t size, size_t *len_out)
{
	u64 min_seq = latched_seq_read_nolock(&clear_seq);
	struct printk_info info;
	struct printk_record r;
	u64 seq;
	u64 next_seq;
	size_t len = 0;
	bool ret = false;
	bool time = printk_time;

	if (!buf || !size)
		goto out;

	if (iter->cur_seq < min_seq)
		iter->cur_seq = min_seq;

	if (prb_read_valid_info(prb, iter->cur_seq, &info, NULL)) {
		if (info.seq != iter->cur_seq) {
			/* messages are gone, move to first available one */
			iter->cur_seq = info.seq;
		}
	}

	/* last entry */
	if (iter->cur_seq >= iter->next_seq)
		goto out;

	/*
	 * Find first record that fits, including all following records,
	 * into the user-provided buffer for this dump. Pass in size-1
	 * because this function (by way of record_print_text()) will
	 * not write more than size-1 bytes of text into @buf.
	 */
	seq = find_first_fitting_seq(iter->cur_seq, iter->next_seq,
				     size - 1, syslog, time);

	/*
	 * Next kmsg_dump_get_buffer() invocation will dump block of
	 * older records stored right before this one.
	 */
	next_seq = seq;

	prb_rec_init_rd(&r, &info, buf, size);

	prb_for_each_record(seq, prb, seq, &r) {
		if (r.info->seq >= iter->next_seq)
			break;

		len += record_print_text(&r, syslog, time);

		/* Adjust record to store to remaining buffer space. */
		prb_rec_init_rd(&r, &info, buf + len, size - len);
	}

	iter->next_seq = next_seq;
	ret = true;
out:
	if (len_out)
		*len_out = len;
	return ret;
}
EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);

/**
 * kmsg_dump_rewind - reset the iterator
 * @iter: kmsg dump iterator
 *
 * Reset the dumper's iterator so that kmsg_dump_get_line() and
 * kmsg_dump_get_buffer() can be called again and used multiple
 * times within the same dumper.dump() callback.
 */
void kmsg_dump_rewind(struct kmsg_dump_iter *iter)
{
	iter->cur_seq = latched_seq_read_nolock(&clear_seq);
	iter->next_seq = prb_next_seq(prb);
}
EXPORT_SYMBOL_GPL(kmsg_dump_rewind);

/**
 * console_try_replay_all - try to replay kernel log on consoles
 *
 * Try to obtain lock on console subsystem and replay all
 * available records in printk buffer on the consoles.
 * Does nothing if lock is not obtained.
 *
 * Context: Any, except for NMI.
 */
void console_try_replay_all(void)
{
	struct console_flush_type ft;

	printk_get_console_flush_type(&ft);
	if (console_trylock()) {
		__console_rewind_all();
		if (ft.nbcon_atomic)
			nbcon_atomic_flush_pending();
		if (ft.nbcon_offload)
			nbcon_kthreads_wake();
		if (ft.legacy_offload)
			defer_console_output();
		/* Consoles are flushed as part of console_unlock(). */
		console_unlock();
	}
}
#endif

#ifdef CONFIG_SMP
static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);

bool is_printk_cpu_sync_owner(void)
{
	return (atomic_read(&printk_cpu_sync_owner) == raw_smp_processor_id());
}

/**
 * __printk_cpu_sync_wait() - Busy wait until the printk cpu-reentrant
 *                            spinning lock is not owned by any CPU.
 *
 * Context: Any context.
 */
void __printk_cpu_sync_wait(void)
{
	do {
		cpu_relax();
	} while (atomic_read(&printk_cpu_sync_owner) != -1);
}
EXPORT_SYMBOL(__printk_cpu_sync_wait);

/**
 * __printk_cpu_sync_try_get() - Try to acquire the printk cpu-reentrant
 *                               spinning lock.
 *
 * If no processor has the lock, the calling processor takes the lock and
 * becomes the owner. If the calling processor is already the owner of the
 * lock, this function succeeds immediately.
 *
 * Context: Any context. Expects interrupts to be disabled.
 * Return: 1 on success, otherwise 0.
 */
int __printk_cpu_sync_try_get(void)
{
	int cpu;
	int old;

	cpu = smp_processor_id();

	/*
	 * Guarantee loads and stores from this CPU when it is the lock owner
	 * are _not_ visible to the previous lock owner. This pairs with
	 * __printk_cpu_sync_put:B.
	 *
	 * Memory barrier involvement:
	 *
	 * If __printk_cpu_sync_try_get:A reads from __printk_cpu_sync_put:B,
	 * then __printk_cpu_sync_put:A can never read from
	 * __printk_cpu_sync_try_get:B.
	 *
	 * Relies on:
	 *
	 * RELEASE from __printk_cpu_sync_put:A to __printk_cpu_sync_put:B
	 * of the previous CPU
	 *    matching
	 * ACQUIRE from __printk_cpu_sync_try_get:A to
	 * __printk_cpu_sync_try_get:B of this CPU
	 */
	old = atomic_cmpxchg_acquire(&printk_cpu_sync_owner, -1,
				     cpu); /* LMM(__printk_cpu_sync_try_get:A) */
	if (old == -1) {
		/*
		 * This CPU is now the owner and begins loading/storing
		 * data: LMM(__printk_cpu_sync_try_get:B)
		 */
		return 1;

	} else if (old == cpu) {
		/* This CPU is already the owner. */
		atomic_inc(&printk_cpu_sync_nested);
		return 1;
	}

	return 0;
}
EXPORT_SYMBOL(__printk_cpu_sync_try_get);

/**
 * __printk_cpu_sync_put() - Release the printk cpu-reentrant spinning lock.
 *
 * The calling processor must be the owner of the lock.
 *
 * Context: Any context. Expects interrupts to be disabled.
 */
void __printk_cpu_sync_put(void)
{
	if (atomic_read(&printk_cpu_sync_nested)) {
		atomic_dec(&printk_cpu_sync_nested);
		return;
	}

	/*
	 * This CPU is finished loading/storing data:
	 * LMM(__printk_cpu_sync_put:A)
	 */

	/*
	 * Guarantee loads and stores from this CPU when it was the
	 * lock owner are visible to the next lock owner. This pairs
	 * with __printk_cpu_sync_try_get:A.
	 *
	 * Memory barrier involvement:
	 *
	 * If __printk_cpu_sync_try_get:A reads from __printk_cpu_sync_put:B,
	 * then __printk_cpu_sync_try_get:B reads from __printk_cpu_sync_put:A.
	 *
	 * Relies on:
	 *
	 * RELEASE from __printk_cpu_sync_put:A to __printk_cpu_sync_put:B
	 * of this CPU
	 *    matching
	 * ACQUIRE from __printk_cpu_sync_try_get:A to
	 * __printk_cpu_sync_try_get:B of the next CPU
	 */
	atomic_set_release(&printk_cpu_sync_owner,
			   -1); /* LMM(__printk_cpu_sync_put:B) */
}
EXPORT_SYMBOL(__printk_cpu_sync_put);
#endif /* CONFIG_SMP */
]

11/2 2026/02/24 21:12 llm expert
1m Model:
gemini-3-flash-preview

Tokens:
							input: 85949
							output: 22
							thoughts: 12257
						

Thoughts:
**Kernel Ringbuffer Data Race Analysis: A Deep Dive**

Okay, here's my analysis of this kernel ringbuffer data race, drawing on my expertise. This is a tricky one, and I've had to consider a number of possibilities. The KCSAN report points to a write and read at the same memory address, `0xffffffff86c7a1c0`. My primary goal is to determine how this overlap happens between seemingly distinct parts of the ringbuffer.

First, I established the context within `kernel/printk/printk.c` and `kernel/printk/printk_ringbuffer.c`, focusing on the `printk_rb_static` ringbuffer, defined using the `_DEFINE_PRINTKRB` macro. I verified the values of `CONFIG_LOG_BUF_SHIFT` and `PRB_AVGBITS` and understood the sizes involved with `__log_buf` (the data buffer) and `_printk_rb_static_descs` (the descriptor array). The key here is that `__log_buf` is a `char` array, and `_printk_rb_static_descs` is an array of `struct prb_desc`.

I considered the case where the rings might be overlapping and determined this cannot be the case because this would result in an invalid kernel. Additionally, I considered the dynamic ring buffer, `printk_rb_dynamic`, in `setup_log_buf`, allocated with `memblock_alloc`. I verified that this would be unlikely as it utilizes the direct map which does not match the provided address.

The key to the problem lies in the KCSAN report that is reporting a read and write race, not in the use of a data ring and descriptor ring as might be initially thought. The stack traces reveal this to be a race, not a simple overlap.

I identified the key functions involved: `data_alloc` (the write) and `desc_read` (the read). `data_alloc` writes `blk->id` and `desc_read` reads `desc->state_var`, these accesses are occurring on different CPUs, resulting in a data race. The stack traces reveal this as the issue.

The crux of the matter is the structure of the ringbuffer itself. Both the data block `blk` and the descriptor `desc` contain the descriptor `id` in some form or another. The read and write operation at the same address must imply that the data block and descriptor overlap. However, because `__log_buf` and `_printk_rb_static_descs` are separate, static arrays, this overlap is impossible.

I focused on the memory layout of `struct prb_desc`, recognizing that the 16-byte read reported by KCSAN at address `0xffffffff86c7a1c0` includes both `state_var` (8 bytes) and the beginning of `text_blk_lpos`. While `atomic_long_read` only reads 8 bytes, the combined read from this offset to the `struct` could be 16 bytes.

The challenge is to figure out HOW `data_alloc` and `desc_read` can access the same address. I considered the possibility of the linker placing those memory locations in an unexpected place but determined this cannot be the case. I examined how these functions get their data. Both use a unique `id` to index to the appropriate block. However, the `id` from `data_alloc` is separate from the `id` from `desc_read`. If these `id` variables can be mismanaged, this can explain this race.

Therefore, the only possible explanation is that the `id` used to index the array of descriptors in `desc_read` is invalid, causing it to read the data buffer by mistake. While `DESC_INDEX` ensures that the `id` is within bounds, the actual `id` value, derived from `atomic_long_read(state_var)`, contains some garbage value that causes the access to point to the data block.



12/2 2026/02/24 21:13 tool codesearch-struct-layout
0m Args:
map[Name:prb_desc]

Results:
map[Fields:[map[Name:state_var OffsetBits:0 SizeBits:64] map[Name:text_blk_lpos OffsetBits:64 SizeBits:128]]]

13/2 2026/02/24 21:13 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 99142
							output: 24
							thoughts: 28
						

Thoughts:
Okay, here's my take on those thoughts, tailored for an expert audience, presented in the first person:

**Analyzing the Data Structure: `prb_desc` and the Significance of `prb_data_block`**

Alright, let's break this down. First, the `prb_desc` size is confirmed: 24 bytes, or 192 bits. That’s our fundamental descriptor block. I’ve probably already got a good idea of what data that holds: probably control bits, address pointers or offset calculations. Considering this information, let's move on to the next element: `prb_data_block`. This is where the real meat of the data sits. Now the question becomes how this `prb_data_block` relates to the `prb_desc`. How is its location or size defined? Is it a fixed-size block linked by offset, or a variable-size block with size defined within the `prb_desc` structure? The real fun begins now: considering the potential relationships and interdependencies between the two is key to understanding the full system. I need to figure out the access pattern and data format of this second component, so I know how the system operates.



14/2 2026/02/24 21:13 tool codesearch-struct-layout
0m Args:
map[Name:prb_data_block]

Results:
map[Fields:[map[Name:id OffsetBits:0 SizeBits:64] map[Name:data OffsetBits:64 SizeBits:0]]]

15/2 2026/02/24 21:13 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 99502
							output: 21
							thoughts: 5511
						

Thoughts:
**Data Race Analysis: `desc_read` vs. `data_alloc` in `printk_ringbuffer`**

Okay, so I've got a KCSAN report indicating a data race between a read in `desc_read` and a write in `data_alloc` at the same memory address, `0xffffffff86c7a1c0`. My first thought was to look up the symbol at the address to give me a clue of what memory is being accessed. The stack traces point to some inlined functions, which is always tricky.

Initially, I looked at the sizes involved. The report mentions a 16-byte read by `desc_read` (KCSAN is right, the initial thought of `atomic_long_read` being an 8-byte read was not right, this is confirmed by the stack traces and `memcpy` call.) and an 8-byte write by `data_alloc`.  I started looking around `desc_read` for a `memcpy` of a larger structure, and I found the `memcpy` call at line 533. That `memcpy` wasn't the read that was causing issues, the read happened at line 496. I was getting the wrong line number from the source code.

My first hunch was that the race involved the `id` field in the data blocks, but the stack trace for the read side of the race includes `_prb_read_valid` calling `prb_read` which calls `desc_read`. I then figured out that `desc_read` is reading `state_var`, which is part of the descriptor.

I started digging into the relationship between the `__log_buf` used for log data and the descriptors in `_printk_rb_static_descs`. I thought maybe they overlapped, but quickly confirmed that they are separate arrays. That can't be it. I checked the size of `__log_buf` and found the size of the descriptor array, to see if there was some unusual alignment. But that's not it.

Then I realized that `data_alloc` line 1129 is writing `blk->id` and the read happens inside `desc_read`. If `blk` is a descriptor, the race could be on `state_var`. I found that `blk` is derived using `to_block` from the data ring, and `desc` is derived using `to_desc` from the descriptor ring. The `id` is the descriptor ID, and it's used for both. The `id` is a problem because if the data ring and descriptor ring are overlapping, the write to `blk->id` in `data_alloc` and the read of `state_var` in `desc_read` are happening on the same address. The issue is that the code is structured in a way that, under some circumstances, these two memory regions could be overlapping.

However, from the code, they should NOT be overlapping. But, the KCSAN report says there's a race, and that race is on the same address. Now I need to figure out how this is possible. If `to_block(data_ring, begin_lpos)` and `to_desc(desc_ring, id)` are returning the same address, it means the two memory areas might have been misconfigured. The ID being read in `desc_read` is coming from `to_desc` which means there is no race.

Now, going back to the report, and the fact that it is an 8-byte write in `data_alloc` to `blk->id` and a 16-byte read on `desc_read`, and they are on the same address. It means, the write is actually `blk->id = id` which writes to a data block id, and the read is `state_val = atomic_long_read(state_var)`. Therefore, `blk->id` and `desc->state_var` are the same memory. This means, I'm reading and writing to the same address. The read is `state_var` and the write is `blk->id`. The question now is how can `blk` be part of the descriptor.

And finally, `data_alloc` calls `to_block`, where the location in `__log_buf` is assigned. `desc_read` calls `to_desc` where the location of a descriptor is defined. So, there is no way these would point to the same location, unless there's a problem in initialization and `__log_buf` and the descriptor ring are overlapping, but the address range is correct for a non-overlapping configuration.

So, the issue is not in these functions, but potentially the initialization.

In `data_alloc`, I


16/2 2026/02/24 21:14 tool codesearch-find-references
0m Args:
map[Name:data_alloc]

Results:
map[References:[map[ReferenceKind:calls ReferencingEntityKind:function ReferencingEntityName:prb_reserve SourceFile:kernel/printk/printk_ringbuffer.c SourceLine:1724] map[ReferenceKind:calls ReferencingEntityKind:function ReferencingEntityName:prb_reserve_in_last SourceFile:kernel/printk/printk_ringbuffer.c SourceLine:1471]] TruncatedOutput:false]

17/2 2026/02/24 21:14 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 106187
							output: 42
							thoughts: 1199
						

Thoughts:
**The Breakthrough with printk_ringbuffer.c**

Okay, two calls, both within `printk_ringbuffer.c`.  I initially thought the race was benign due to the lockless design, with `state_var` synchronizing everything and `blk->id` just a check. But, no, I'm onto something. I was hung up on why the addresses were the same.

Then I saw it - a KCSAN report! It pinpoints the race. "read to 0xffffffff86c7a1c0 of 16 bytes," and a "write to 0xffffffff86c7a1c0 of 8 bytes." That 16-byte read... it *has* to be `text_blk_lpos`!

```c
struct prb_desc {
    atomic_long_t           state_var;      // 8 bytes
    struct prb_data_blk_lpos    text_blk_lpos;  // 16 bytes
};
```

If the read starts at `state_var`, that's 24 bytes total. If KCSAN says 16 bytes at `...c1c0`, it's likely reading `text_blk_lpos`. But wait... wasn't `text_blk_lpos` at `...c1c8`? Ah, no. `state_var` is at `...c1b8`, so `text_blk_lpos` is at `...c1c0`. Yes!

If `text_blk_lpos` is at `0xffffffff86c7a1c0`, then it's 16 bytes, and the write, to the same address, is to `text_blk_lpos.begin`, which is 8 bytes.

Then it hits me: the overlapping addresses. In `data_alloc`: `blk->id = id;` is also at that address. But if `blk->id` is at `0xffffffff86c7a1c0` and `desc->text_blk_lpos` is too, they overlap!

Got it!  In `data_alloc`... I need to double-check my line numbers... wait.

```c
1138:  blk_lpos->begin = begin_lpos;
1139:  blk_lpos->next = next_lpos;
```

`blk_lpos` is a pointer passed into `data_alloc`.  From `prb_reserve`:

```c
1724:  r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
```

So `blk_lpos` *is* `&d->text_blk_lpos`.  Writing to `blk_lpos->begin` is writing to `d->text_blk_lpos.begin`, an 8-byte write!

And `desc_read` reads `text_blk_lpos`:

```c
533:        memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
534:             sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
```

That's a 16-byte read!

BINGO! The race is between:

*   **Writer (CPU 0):** `data_alloc` writing `d->text_blk_lpos.begin = begin_lpos`.
*   **Reader (CPU 1):** `desc_read` reading `desc->text_blk_lpos` via `memcpy`.

Now, why did KCSAN say line 1129? Ah, in *my* file, 1129 is `blk->id = id;`. But... wait!  Let me double-check `data_alloc`:

```c
1104: static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
1105:              struct prb_data_blk_lpos *blk_lpos, unsigned long id)
...
1129:   blk->id = id; /* LMM(data_alloc:B) */
...
1138:   blk_lpos->begin = begin_lpos;
1139:   blk_lpos->next = next_lpos;
```

If KCSAN says line 1129, it must be `blk->id = id`.  


18/2 2026/02/24 21:14 tool codesearch-definition-source
0m Args:
map[ContextFile:kernel/printk/printk_ringbuffer.c IncludeLines:true Name:data_alloc]

Results:
map[SourceCode:1048:	/*
1049:	 * Allocate a new data block, invalidating the oldest data block(s)
1050:	 * if necessary. This function also associates the data block with
1051:	 * a specified descriptor.
1052:	 */
1053:	static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
1054:				struct prb_data_blk_lpos *blk_lpos, unsigned long id)
1055:	{
1056:		struct prb_data_ring *data_ring = &rb->text_data_ring;
1057:		struct prb_data_block *blk;
1058:		unsigned long begin_lpos;
1059:		unsigned long next_lpos;
1060:	
1061:		if (size == 0) {
1062:			/*
1063:			 * Data blocks are not created for empty lines. Instead, the
1064:			 * reader will recognize these special lpos values and handle
1065:			 * it appropriately.
1066:			 */
1067:			blk_lpos->begin = EMPTY_LINE_LPOS;
1068:			blk_lpos->next = EMPTY_LINE_LPOS;
1069:			return NULL;
1070:		}
1071:	
1072:		size = to_blk_size(size);
1073:	
1074:		begin_lpos = atomic_long_read(&data_ring->head_lpos);
1075:	
1076:		do {
1077:			next_lpos = get_next_lpos(data_ring, begin_lpos, size);
1078:	
1079:			/*
1080:			 * data_check_size() prevents data block allocation that could
1081:			 * cause illegal ringbuffer states. But double check that the
1082:			 * used space will not be bigger than the ring buffer. Wrapped
1083:			 * messages need to reserve more space, see get_next_lpos().
1084:			 *
1085:			 * Specify a data-less block when the check or the allocation
1086:			 * fails.
1087:			 */
1088:			if (WARN_ON_ONCE(next_lpos - begin_lpos > DATA_SIZE(data_ring)) ||
1089:			    !data_push_tail(rb, next_lpos - DATA_SIZE(data_ring))) {
1090:				blk_lpos->begin = FAILED_LPOS;
1091:				blk_lpos->next = FAILED_LPOS;
1092:				return NULL;
1093:			}
1094:	
1095:			/*
1096:			 * 1. Guarantee any descriptor states that have transitioned
1097:			 *    to reusable are stored before modifying the newly
1098:			 *    allocated data area. A full memory barrier is needed
1099:			 *    since other CPUs may have made the descriptor states
1100:			 *    reusable. See data_push_tail:A about why the reusable
1101:			 *    states are visible. This pairs with desc_read:D.
1102:			 *
1103:			 * 2. Guarantee any updated tail lpos is stored before
1104:			 *    modifying the newly allocated data area. Another CPU may
1105:			 *    be in data_make_reusable() and is reading a block ID
1106:			 *    from this area. data_make_reusable() can handle reading
1107:			 *    a garbage block ID value, but then it must be able to
1108:			 *    load a new tail lpos. A full memory barrier is needed
1109:			 *    since other CPUs may have updated the tail lpos. This
1110:			 *    pairs with data_push_tail:B.
1111:			 */
1112:		} while (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &begin_lpos,
1113:						  next_lpos)); /* LMM(data_alloc:A) */
1114:	
1115:		blk = to_block(data_ring, begin_lpos);
1116:		blk->id = id; /* LMM(data_alloc:B) */
1117:	
1118:		if (is_blk_wrapped(data_ring, begin_lpos, next_lpos)) {
1119:			/* Wrapping data blocks store their data at the beginning. */
1120:			blk = to_block(data_ring, 0);
1121:	
1122:			/*
1123:			 * Store the ID on the wrapped block for consistency.
1124:			 * The printk_ringbuffer does not actually use it.
1125:			 */
1126:			blk->id = id;
1127:		}
1128:	
1129:		blk_lpos->begin = begin_lpos;
1130:		blk_lpos->next = next_lpos;
1131:	
1132:		return &blk->data[0];
1133:	}
 SourceFile:kernel/printk/printk_ringbuffer.c]

19/2 2026/02/24 21:14 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 109875
							output: 42
							thoughts: 200
						

Thoughts:
**Analyzing Descriptor Ring Access**

Okay, here's what's running through my mind as I dig into this code. Aha! I'm looking at this particular version of the file, and I see on lines 1129 and 1130:

```c
1129:		blk_lpos->begin = begin_lpos;
1130:		blk_lpos->next = next_lpos;
```

So, line 1129 *definitely* sets `blk_lpos->begin` equal to `begin_lpos`. Now, recalling the data structures, `blk_lpos` is essentially a pointer to `&d->text_blk_lpos`. That means we're writing directly to the descriptor ring, a key piece of our I/O process.

Now, I need to check the read side to verify my hypothesis. That code snippet I have stored in my brain… Ah, yes, the read function, `desc_read`. I see:

```c
496: 	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
```

Now, I'm pretty sure that reading 16 bytes is an indication that we are using the `memcpy` of the `text_blk_lpos` struct. I should check the line numbers in this version to make sure I'm working from the correct version of the code.



20/2 2026/02/24 21:14 tool codesearch-definition-source
0m Args:
map[ContextFile:kernel/printk/printk_ringbuffer.c IncludeLines:true Name:desc_read]

Results:
map[SourceCode: 441:	/*
 442:	 * Get a copy of a specified descriptor and return its queried state. If the
 443:	 * descriptor is in an inconsistent state (miss or reserved), the caller can
 444:	 * only expect the descriptor's @state_var field to be valid.
 445:	 *
 446:	 * The sequence number and caller_id can be optionally retrieved. Like all
 447:	 * non-state_var data, they are only valid if the descriptor is in a
 448:	 * consistent state.
 449:	 */
 450:	static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 451:					 unsigned long id, struct prb_desc *desc_out,
 452:					 u64 *seq_out, u32 *caller_id_out)
 453:	{
 454:		struct printk_info *info = to_info(desc_ring, id);
 455:		struct prb_desc *desc = to_desc(desc_ring, id);
 456:		atomic_long_t *state_var = &desc->state_var;
 457:		enum desc_state d_state;
 458:		unsigned long state_val;
 459:	
 460:		/* Check the descriptor state. */
 461:		state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
 462:		d_state = get_desc_state(id, state_val);
 463:		if (d_state == desc_miss || d_state == desc_reserved) {
 464:			/*
 465:			 * The descriptor is in an inconsistent state. Set at least
 466:			 * @state_var so that the caller can see the details of
 467:			 * the inconsistent state.
 468:			 */
 469:			goto out;
 470:		}
 471:	
 472:		/*
 473:		 * Guarantee the state is loaded before copying the descriptor
 474:		 * content. This avoids copying obsolete descriptor content that might
 475:		 * not apply to the descriptor state. This pairs with _prb_commit:B.
 476:		 *
 477:		 * Memory barrier involvement:
 478:		 *
 479:		 * If desc_read:A reads from _prb_commit:B, then desc_read:C reads
 480:		 * from _prb_commit:A.
 481:		 *
 482:		 * Relies on:
 483:		 *
 484:		 * WMB from _prb_commit:A to _prb_commit:B
 485:		 *    matching
 486:		 * RMB from desc_read:A to desc_read:C
 487:		 */
 488:		smp_rmb(); /* LMM(desc_read:B) */
 489:	
 490:		/*
 491:		 * Copy the descriptor data. The data is not valid until the
 492:		 * state has been re-checked. A memcpy() for all of @desc
 493:		 * cannot be used because of the atomic_t @state_var field.
 494:		 */
 495:		if (desc_out) {
 496:			memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
 497:			       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
 498:		}
 499:		if (seq_out)
 500:			*seq_out = info->seq; /* also part of desc_read:C */
 501:		if (caller_id_out)
 502:			*caller_id_out = info->caller_id; /* also part of desc_read:C */
 503:	
 504:		/*
 505:		 * 1. Guarantee the descriptor content is loaded before re-checking
 506:		 *    the state. This avoids reading an obsolete descriptor state
 507:		 *    that may not apply to the copied content. This pairs with
 508:		 *    desc_reserve:F.
 509:		 *
 510:		 *    Memory barrier involvement:
 511:		 *
 512:		 *    If desc_read:C reads from desc_reserve:G, then desc_read:E
 513:		 *    reads from desc_reserve:F.
 514:		 *
 515:		 *    Relies on:
 516:		 *
 517:		 *    WMB from desc_reserve:F to desc_reserve:G
 518:		 *       matching
 519:		 *    RMB from desc_read:C to desc_read:E
 520:		 *
 521:		 * 2. Guarantee the record data is loaded before re-checking the
 522:		 *    state. This avoids reading an obsolete descriptor state that may
 523:		 *    not apply to the copied data. This pairs with data_alloc:A and
 524:		 *    data_realloc:A.
 525:		 *
 526:		 *    Memory barrier involvement:
 527:		 *
 528:		 *    If copy_data:A reads from data_alloc:B, then desc_read:E
 529:		 *    reads from desc_make_reusable:A.
 530:		 *
 531:		 *    Relies on:
 532:		 *
 533:		 *    MB from desc_make_reusable:A to data_alloc:B
 534:		 *       matching
 535:		 *    RMB from desc_read:C to desc_read:E
 536:		 *
 537:		 *    Note: desc_make_reusable:A and data_alloc:B can be different
 538:		 *          CPUs. However, the data_alloc:B CPU (which performs the
 539:		 *          full memory barrier) must have previously seen
 540:		 *          desc_make_reusable:A.
 541:		 */
 542:		smp_rmb(); /* LMM(desc_read:D) */
 543:	
 544:		/*
 545:		 * The data has been copied. Return the current descriptor state,
 546:		 * which may have changed since the load above.
 547:		 */
 548:		state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
 549:		d_state = get_desc_state(id, state_val);
 550:	out:
 551:		if (desc_out)
 552:			atomic_long_set(&desc_out->state_var, state_val);
 553:		return d_state;
 554:	}
 SourceFile:kernel/printk/printk_ringbuffer.c]

21/2 2026/02/24 21:14 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 112336
							output: 17
							thoughts: 3683
						

Thoughts:
**Analysis of a Potential Data Race in `printk_ringbuffer`**

Okay, so I've dug into this data race reported by KCSAN and I think I've got it. The core issue revolves around a potential race condition between a writer, in `data_alloc`, initializing a descriptor's `text_blk_lpos` field (a 16-byte structure), and a reader, in `desc_read`, reading that same field. Specifically, the race is between a writer writing to the `text_blk_lpos.begin` and the reader doing a `memcpy` of the whole struct.

The initial thought was, "How can the reader access a descriptor being initialized?" But then I realized the reader only reads descriptors the system *believes* are finalized or committed. The `prb_read` function checks the descriptor's state using `atomic_long_read(state_var)` before proceeding. If the state indicates the descriptor is reserved, the read is skipped. However, `data_alloc` is called *during* `prb_reserve`, which is *before* commitment/finalization. It's designed to skip reserved descriptors.

The key is in the recycling of descriptors. The race occurs during descriptor recycling. The writer can change the state of a descriptor from `desc_reusable` to `desc_reserved`, fill in data, and *then* commit. Meanwhile, the reader might read the `state_var`, see the *old* state as finalized (or reusable), and *then* the writer can recycle the descriptor. Now the reader tries to read the contents of the old state which may be garbage.

The `printk_ringbuffer` code *intentionally* handles this. After the `memcpy`, the reader *re-checks* the `state_var`. If the state has changed (e.g., ID mismatch or the descriptor is now reserved), the data is discarded and the reader understands that it has to skip the descriptor. The code also has barriers at various points using `smp_rmb()` to make sure the memory operations happen in the order we expect.

So, the reader sees garbage, but the design accounts for this. This is a standard lockless pattern. KCSAN flags this as a data race because there is no explicit synchronization, but the re-check *is* the synchronization. This pattern is mentioned in the comments.

The question then becomes, is this "benign"?  KCSAN defines "benign" as not leading to memory corruption or crashes. The code, however, is doing a `memcpy` of a 16-byte structure, which on 64-bit systems could be two 8-byte loads. It's possible that the reader could observe partially updated data from the writer. However, because the re-check of `state_var` will discard the results if it encounters data that has been updated, I do not expect a crash.

The `text_blk_lpos` is copied to local variables, `info->seq` and `info->caller_id` are accessed, but this data comes from valid memory locations (descriptor ring and infos array).

So the only way for a crash is if the memory accesses are invalid, and these appear to be valid. The code is structured correctly, with all memory access locations validated.

There is a final recheck and return of error - so there is not risk of a crash.

In KCSAN terms, a benign race is typically on simple variable accesses, not something like `memcpy` of a 16-byte struct. However, it's not protected by explicit mutexes, but protected by a lockless state machine - which is part of its design. It's intended that `data_alloc` will write to the `text_blk_lpos` field and that `desc_read` will be able to read it *eventually*. There are `LMM` comments and memory barriers.

In conclusion, the data race is intentional as part of this lockless algorithm, but the current code could cause KCSAN to report a data race. My recommendation is to mark this data race as "intentional" to silence KCSAN, since this race is part of a correctly implemented lockless synchronization scheme. This race will *not* lead to corruption or a crash.



22/2 2026/02/24 21:15 tool grepper
0m Args:
map[Expression:data_race]

Results:
map[Output:
Full output is too long, showing 500 out of 1152 lines.
Use more precise expression if possible.

[Documentation/dev-tools/checkpatch.rst=456=Comments
 --
 Documentation/dev-tools/checkpatch.rst-476-  **DATA_RACE**
 Documentation/dev-tools/checkpatch.rst:477:    Applications of data_race() should have a comment so as to document the
 Documentation/dev-tools/checkpatch.rst-478-    reasoning behind why it was deemed safe.
 --
 Documentation/dev-tools/kcsan.rst=87=the below options are available:
 Documentation/dev-tools/kcsan.rst-88-
 Documentation/dev-tools/kcsan.rst:89:* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that
 Documentation/dev-tools/kcsan.rst-90-  any data races due to accesses in ``expr`` should be ignored and resulting
 --
 Documentation/dev-tools/kcsan.rst-93-
 Documentation/dev-tools/kcsan.rst:94:* Similar to ``data_race(...)``, the type qualifier ``__data_racy`` can be used
 Documentation/dev-tools/kcsan.rst-95-  to document that all data races due to accesses to a variable are intended
 --
 Documentation/translations/zh_CN/dev-tools/kcsan.rst=24=KCSAN 提供了几个其他的配置选项来自定义行为(见 ``lib/Kconfig.kcsan`` 中的各自的
 --
 Documentation/translations/zh_CN/dev-tools/kcsan.rst-87-
 Documentation/translations/zh_CN/dev-tools/kcsan.rst:88:* KCSAN 支持使用 ``data_race(expr)`` 注解,这个注解告诉 KCSAN 任何由访问
 Documentation/translations/zh_CN/dev-tools/kcsan.rst-89-  ``expr`` 所引起的数据竞争都应该被忽略,其产生的行为后果被认为是安全的。请查阅
 --
 Documentation/translations/zh_CN/dev-tools/kcsan.rst-91-
 Documentation/translations/zh_CN/dev-tools/kcsan.rst:92:* 与 ``data_race(...)`` 相似,可以使用类型限定符 ``__data_racy`` 来标记一个变量
 Documentation/translations/zh_CN/dev-tools/kcsan.rst-93-  ,所有访问该变量而导致的数据竞争都是故意为之并且应该被 KCSAN 忽略::
 --
 arch/powerpc/kernel/interrupt.c=103=static notrace void check_return_regs_valid(struct pt_regs *regs)
 --
 arch/powerpc/kernel/interrupt.c-171-
 arch/powerpc/kernel/interrupt.c:172:	if (!data_race(warned)) {
 arch/powerpc/kernel/interrupt.c:173:		data_race(warned = true);
 arch/powerpc/kernel/interrupt.c-174-		printk("%sSRR0 was: %lx should be: %lx\n", h, srr0, regs->nip);
 --
 arch/x86/mm/pat/set_memory.c=161=static inline void cpa_inc_4k_install(void)
 arch/x86/mm/pat/set_memory.c-162-{
 arch/x86/mm/pat/set_memory.c:163:	data_race(cpa_4k_install++);
 arch/x86/mm/pat/set_memory.c-164-}
 --
 drivers/block/loop.c=1851=static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 --
 drivers/block/loop.c-1859-
 drivers/block/loop.c:1860:	if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound)
 drivers/block/loop.c-1861-		return BLK_STS_IOERR;
 --
 drivers/block/loop.c=2190=static int loop_control_get_free(int idx)
 --
 drivers/block/loop.c-2203-		if (lo->idr_visible &&
 drivers/block/loop.c:2204:		    data_race(READ_ONCE(lo->lo_state)) == Lo_unbound)
 drivers/block/loop.c-2205-			goto found;
 --
 drivers/block/zloop.c=693=static blk_status_t zloop_queue_rq(struct blk_mq_hw_ctx *hctx,
 --
 drivers/block/zloop.c-699-
 drivers/block/zloop.c:700:	if (data_race(READ_ONCE(zlo->state)) == Zlo_deleting)
 drivers/block/zloop.c-701-		return BLK_STS_IOERR;
 --
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c=1135=void ath9k_htc_rxep(void *drv_priv, struct sk_buff *skb,
 --
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c-1144-	/* Check if ath9k_rx_init() completed. */
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c:1145:	if (!data_race(priv->rx.initialized))
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c-1146-		goto err;
 --
 drivers/net/wireless/ath/ath9k/wmi.c=138=void ath9k_wmi_event_tasklet(struct tasklet_struct *t)
 --
 drivers/net/wireless/ath/ath9k/wmi.c-158-		/* Check if ath9k_htc_probe_device() completed. */
 drivers/net/wireless/ath/ath9k/wmi.c:159:		if (!data_race(priv->initialized)) {
 drivers/net/wireless/ath/ath9k/wmi.c-160-			kfree_skb(skb);
 --
 drivers/virtio/virtio_ring.c=3189=bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
 --
 drivers/virtio/virtio_ring.c-3193-	if (vq->event_triggered)
 drivers/virtio/virtio_ring.c:3194:		data_race(vq->event_triggered = false);
 drivers/virtio/virtio_ring.c-3195-
 --
 drivers/virtio/virtio_ring.c=3229=irqreturn_t vring_interrupt(int irq, void *_vq)
 --
 drivers/virtio/virtio_ring.c-3249-	if (vq->event)
 drivers/virtio/virtio_ring.c:3250:		data_race(vq->event_triggered = true);
 drivers/virtio/virtio_ring.c-3251-
 --
 fs/btrfs/block-group.c=1827=static int reclaim_bgs_cmp(void *unused, const struct list_head *a,
 --
 fs/btrfs/block-group.c-1841-	 */
 fs/btrfs/block-group.c:1842:	return data_race(bg1->used > bg2->used);
 fs/btrfs/block-group.c-1843-}
 --
 fs/btrfs/block-rsv.h=108=static inline bool btrfs_block_rsv_full(const struct btrfs_block_rsv *rsv)
 fs/btrfs/block-rsv.h-109-{
 fs/btrfs/block-rsv.h:110:	return data_race(rsv->full);
 fs/btrfs/block-rsv.h-111-}
 --
 fs/btrfs/extent-tree.c=2029=static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 --
 fs/btrfs/extent-tree.c-2046-		 */
 fs/btrfs/extent-tree.c:2047:		max_count = data_race(delayed_refs->num_heads_ready);
 fs/btrfs/extent-tree.c-2048-		min_bytes = U64_MAX;
 --
 fs/btrfs/file.c=138=int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 --
 fs/btrfs/file.c-180-
 fs/btrfs/file.c:181:	if (data_race(args->start >= inode->disk_i_size) && !args->replace_extent)
 fs/btrfs/file.c-182-		modify_tree = 0;
 --
 fs/btrfs/space-info.c=1972=u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
 --
 fs/btrfs/space-info.c-1978-	/* It's df, we don't care if it's racy */
 fs/btrfs/space-info.c:1979:	if (data_race(list_empty(&sinfo->ro_bgs)))
 fs/btrfs/space-info.c-1980-		return 0;
 --
 fs/btrfs/tree-log.c=3732=static int inode_logged(const struct btrfs_trans_handle *trans,
 --
 fs/btrfs/tree-log.c-3743-	 */
 fs/btrfs/tree-log.c:3744:	if (data_race(inode->logged_trans) == trans->transid)
 fs/btrfs/tree-log.c-3745-		return 1;
 --
 fs/btrfs/zoned.c=3191=void btrfs_show_zoned_stats(struct btrfs_fs_info *fs_info, struct seq_file *seq)
 --
 fs/btrfs/zoned.c-3212-
 fs/btrfs/zoned.c:3213:	data_reloc_bg = data_race(fs_info->data_reloc_bg);
 fs/btrfs/zoned.c-3214-	if (data_reloc_bg)
 --
 fs/btrfs/zoned.c-3216-			   data_reloc_bg);
 fs/btrfs/zoned.c:3217:	treelog_bg = data_race(fs_info->treelog_bg);
 fs/btrfs/zoned.c-3218-	if (treelog_bg)
 --
 fs/inode.c=112=static unsigned long get_mg_ctime_updates(void)
 --
 fs/inode.c-117-	for_each_possible_cpu(i)
 fs/inode.c:118:		sum += data_race(per_cpu(mg_ctime_updates, i));
 fs/inode.c-119-	return sum;
 --
 fs/inode.c=122=static unsigned long get_mg_fine_stamps(void)
 --
 fs/inode.c-127-	for_each_possible_cpu(i)
 fs/inode.c:128:		sum += data_race(per_cpu(mg_fine_stamps, i));
 fs/inode.c-129-	return sum;
 --
 fs/inode.c=132=static unsigned long get_mg_ctime_swaps(void)
 --
 fs/inode.c-137-	for_each_possible_cpu(i)
 fs/inode.c:138:		sum += data_race(per_cpu(mg_ctime_swaps, i));
 fs/inode.c-139-	return sum;
 --
 fs/iomap/direct-io.c=377=static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 --
 fs/iomap/direct-io.c-518-		 */
 fs/iomap/direct-io.c:519:		if (unlikely(data_race(dio->error)))
 fs/iomap/direct-io.c-520-			goto out;
 --
 fs/jbd2/transaction.c=312=static int start_this_handle(journal_t *journal, handle_t *handle,
 --
 fs/jbd2/transaction.c-343-	 */
 fs/jbd2/transaction.c:344:	if (!data_race(journal->j_running_transaction)) {
 fs/jbd2/transaction.c-345-		/*
 --
 fs/jbd2/transaction.c=1491=int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 --
 fs/jbd2/transaction.c-1514-	 */
 fs/jbd2/transaction.c:1515:	if (data_race(jh->b_transaction != transaction &&
 fs/jbd2/transaction.c-1516-	    jh->b_next_transaction != transaction)) {
 --
 fs/jbd2/transaction.c-1521-	}
 fs/jbd2/transaction.c:1522:	if (data_race(jh->b_modified == 1)) {
 fs/jbd2/transaction.c-1523-		/* If it's in our transaction it must be in BJ_Metadata list. */
 fs/jbd2/transaction.c:1524:		if (data_race(jh->b_transaction == transaction &&
 fs/jbd2/transaction.c-1525-		    jh->b_jlist != BJ_Metadata)) {
 --
 fs/xfs/xfs_trace.h=1327=DECLARE_EVENT_CLASS(xfs_dquot_class,
 --
 fs/xfs/xfs_trace.h-1355-		__entry->flags = dqp->q_flags;
 fs/xfs/xfs_trace.h:1356:		__entry->nrefs = data_race(dqp->q_lockref.count);
 fs/xfs/xfs_trace.h-1357-
 --
 include/linux/bpf.h=519=static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 --
 include/linux/bpf.h-525-	while (size--)
 include/linux/bpf.h:526:		data_race(*ldst++ = *lsrc++);
 include/linux/bpf.h-527-}
 --
 include/linux/compiler.h=15=void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 --
 include/linux/compiler.h-172-/**
 include/linux/compiler.h:173: * data_race - mark an expression as containing intentional data races
 include/linux/compiler.h-174- *
 include/linux/compiler.h:175: * This data_race() macro is useful for situations in which data races
 include/linux/compiler.h-176- * should be forgiven.  One example is diagnostic code that accesses
 --
 include/linux/compiler.h-180- * be plain C-language accesses and those in the diagnostic code should
 include/linux/compiler.h:181: * use data_race().  This way, KCSAN will complain if buggy lockless
 include/linux/compiler.h-182- * accesses to that variable are introduced, even if the buggy accesses
 --
 include/linux/compiler.h-186- * to tooling that data races here are to be ignored.  If the access must
 include/linux/compiler.h:187: * be atomic *and* KCSAN should ignore the access, use both data_race()
 include/linux/compiler.h:188: * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
 include/linux/compiler.h-189- */
 include/linux/compiler.h:190:#define data_race(expr)							\
 include/linux/compiler.h-191-({									\
 --
 include/linux/console.h=522=static inline short console_srcu_read_flags(const struct console *con)
 --
 include/linux/console.h-529-	 */
 include/linux/console.h:530:	return data_race(READ_ONCE(con->flags));
 include/linux/console.h-531-}
 --
 include/linux/mm.h=3135=static inline void update_hiwater_rss(struct mm_struct *mm)
 --
 include/linux/mm.h-3138-
 include/linux/mm.h:3139:	if (data_race(mm->hiwater_rss) < _rss)
 include/linux/mm.h:3140:		data_race(mm->hiwater_rss = _rss);
 include/linux/mm.h-3141-}
 --
 include/linux/page-flags.h=952=static inline bool page_has_type(const struct page *page)
 include/linux/page-flags.h-953-{
 include/linux/page-flags.h:954:	return page_type_has_type(data_race(page->page_type));
 include/linux/page-flags.h-955-}
 --
 include/linux/page-flags.h=958=static __always_inline bool folio_test_##fname(const struct folio *folio) \
 include/linux/page-flags.h-959-{									\
 include/linux/page-flags.h:960:	return data_race(folio->page.page_type >> 24) == PGTY_##lname;	\
 include/linux/page-flags.h-961-}									\
 include/linux/page-flags.h=962=static __always_inline void __folio_set_##fname(struct folio *folio)	\
 --
 include/linux/page-flags.h-965-		return;							\
 include/linux/page-flags.h:966:	VM_BUG_ON_FOLIO(data_race(folio->page.page_type) != UINT_MAX,	\
 include/linux/page-flags.h-967-			folio);						\
 --
 include/linux/page-flags.h=980=static __always_inline int Page##uname(const struct page *page)		\
 include/linux/page-flags.h-981-{									\
 include/linux/page-flags.h:982:	return data_race(page->page_type >> 24) == PGTY_##lname;	\
 include/linux/page-flags.h-983-}									\
 include/linux/page-flags.h=984=static __always_inline void __SetPage##uname(struct page *page)		\
 --
 include/linux/page-flags.h-987-		return;							\
 include/linux/page-flags.h:988:	VM_BUG_ON_PAGE(data_race(page->page_type) != UINT_MAX, page);	\
 include/linux/page-flags.h-989-	page->page_type = (unsigned int)PGTY_##lname << 24;		\
 --
 include/linux/rseq_entry.h=113=static __always_inline bool rseq_grant_slice_extension(bool work_pending)
 --
 include/linux/rseq_entry.h-162-	/* Store expiry time for arming the timer on the way out */
 include/linux/rseq_entry.h:163:	curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
 include/linux/rseq_entry.h-164-	/*
 --
 include/linux/srcutiny.h=139=static inline void srcu_torture_stats_print(struct srcu_struct *ssp,
 --
 include/linux/srcutiny.h-143-
 include/linux/srcutiny.h:144:	idx = ((data_race(READ_ONCE(ssp->srcu_idx)) + 1) & 0x2) >> 1;
 include/linux/srcutiny.h-145-	pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%hd,%hd) gp: %lu->%lu\n",
 include/linux/srcutiny.h-146-		 tt, tf, idx,
 include/linux/srcutiny.h:147:		 data_race(READ_ONCE(ssp->srcu_lock_nesting[!idx])),
 include/linux/srcutiny.h:148:		 data_race(READ_ONCE(ssp->srcu_lock_nesting[idx])),
 include/linux/srcutiny.h:149:		 data_race(READ_ONCE(ssp->srcu_idx)),
 include/linux/srcutiny.h:150:		 data_race(READ_ONCE(ssp->srcu_idx_max)));
 include/linux/srcutiny.h-151-}
 --
 include/net/sock.h=2517=static inline int sock_error(struct sock *sk)
 --
 include/net/sock.h-2523-	 */
 include/net/sock.h:2524:	if (likely(data_race(!sk->sk_err)))
 include/net/sock.h-2525-		return 0;
 --
 io_uring/fdinfo.c=60=static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 --
 io_uring/fdinfo.c-86-	seq_printf(m, "SqTail:\t%u\n", sq_tail);
 io_uring/fdinfo.c:87:	seq_printf(m, "CachedSqHead:\t%u\n", data_race(ctx->cached_sq_head));
 io_uring/fdinfo.c-88-	seq_printf(m, "CqMask:\t0x%x\n", cq_mask);
 --
 io_uring/fdinfo.c-90-	seq_printf(m, "CqTail:\t%u\n", cq_tail);
 io_uring/fdinfo.c:91:	seq_printf(m, "CachedCqTail:\t%u\n", data_race(ctx->cached_cq_tail));
 io_uring/fdinfo.c-92-	seq_printf(m, "SQEs:\t%u\n", sq_tail - sq_head);
 --
 io_uring/refs.h=21=static inline bool req_ref_put_and_test_atomic(struct io_kiocb *req)
 io_uring/refs.h-22-{
 io_uring/refs.h:23:	WARN_ON_ONCE(!(data_race(req->flags) & REQ_F_REFCOUNT));
 io_uring/refs.h-24-	WARN_ON_ONCE(req_ref_zero_or_close_to_overflow(req));
 --
 io_uring/timeout.c=595=int io_timeout(struct io_kiocb *req, unsigned int issue_flags)
 --
 io_uring/timeout.c-614-
 io_uring/timeout.c:615:	tail = data_race(ctx->cached_cq_tail) - atomic_read(&ctx->cq_timeouts);
 io_uring/timeout.c-616-	timeout->target_seq = tail + off;
 --
 kernel/fork.c=1964=__latent_entropy struct task_struct *copy_process(
 --
 kernel/fork.c-2101-	retval = -EAGAIN;
 kernel/fork.c:2102:	if (data_race(nr_threads >= max_threads))
 kernel/fork.c-2103-		goto bad_fork_cleanup_count;
 --
 kernel/irq/irqdesc.c=1002=static unsigned int kstat_irqs_desc(struct irq_desc *desc, const struct cpumask *cpumask)
 --
 kernel/irq/irqdesc.c-1009-	    !irq_is_nmi(desc))
 kernel/irq/irqdesc.c:1010:		return data_race(desc->tot_count);
 kernel/irq/irqdesc.c-1011-
 kernel/irq/irqdesc.c-1012-	for_each_cpu(cpu, cpumask)
 kernel/irq/irqdesc.c:1013:		sum += data_race(per_cpu(desc->kstat_irqs->cnt, cpu));
 kernel/irq/irqdesc.c-1014-	return sum;
 --
 kernel/kcsan/core.c=527=kcsan_setup_watchpoint(const volatile void *ptr, size_t size, int type, unsigned long ip)
 --
 kernel/kcsan/core.c-643-	 */
 kernel/kcsan/core.c:644:	if (diff && !kcsan_ignore_data_race(size, type, old, new, diff))
 kernel/kcsan/core.c-645-		value_change = KCSAN_VALUE_CHANGE_TRUE;
 --
 kernel/kcsan/core.c-667-		/*
 kernel/kcsan/core.c:668:		 * No need to increment 'data_races' counter, as the racing
 kernel/kcsan/core.c-669-		 * thread already did.
 --
 kernel/kcsan/debugfs.c=26=static const char *const counter_names[] = {
 --
 kernel/kcsan/debugfs.c-28-	[KCSAN_COUNTER_SETUP_WATCHPOINTS]		= "setup_watchpoints",
 kernel/kcsan/debugfs.c:29:	[KCSAN_COUNTER_DATA_RACES]			= "data_races",
 kernel/kcsan/debugfs.c-30-	[KCSAN_COUNTER_ASSERT_FAILURES]			= "assert_failures",
 --
 kernel/kcsan/debugfs.c=138=static ssize_t insert_report_filterlist(const char *func)
 --
 kernel/kcsan/debugfs.c-156-	 */
 kernel/kcsan/debugfs.c:157:	if (data_race(report_filterlist.used == report_filterlist.size)) {
 kernel/kcsan/debugfs.c-158-		new_size = (report_filterlist.size ?: 4) * 2;
 --
 kernel/kcsan/kcsan_test.c=358=static noinline void test_kernel_write_uninstrumented(void) { test_var++; }
 kernel/kcsan/kcsan_test.c-359-
 kernel/kcsan/kcsan_test.c:360:static noinline void test_kernel_data_race(void) { data_race(test_var++); }
 kernel/kcsan/kcsan_test.c-361-
 --
 kernel/kcsan/kcsan_test.c=976=static void test_zero_size_access(struct kunit *test)
 --
 kernel/kcsan/kcsan_test.c-1001-
 kernel/kcsan/kcsan_test.c:1002:/* Test the data_race() macro. */
 kernel/kcsan/kcsan_test.c-1003-__no_kcsan
 kernel/kcsan/kcsan_test.c:1004:static void test_data_race(struct kunit *test)
 kernel/kcsan/kcsan_test.c-1005-{
 --
 kernel/kcsan/kcsan_test.c-1007-
 kernel/kcsan/kcsan_test.c:1008:	begin_test_checks(test_kernel_data_race, test_kernel_data_race);
 kernel/kcsan/kcsan_test.c-1009-	do {
 --
 kernel/kcsan/kcsan_test.c=1428=static struct kunit_case kcsan_test_cases[] = {
 --
 kernel/kcsan/kcsan_test.c-1441-	KCSAN_KUNIT_CASE(test_zero_size_access),
 kernel/kcsan/kcsan_test.c:1442:	KCSAN_KUNIT_CASE(test_data_race),
 kernel/kcsan/kcsan_test.c-1443-	KCSAN_KUNIT_CASE(test_data_racy_qualifier),
 --
 kernel/kcsan/permissive.h=37=static bool
 kernel/kcsan/permissive.h:38:kcsan_ignore_data_race(size_t size, int type, u64 old, u64 new, u64 diff)
 kernel/kcsan/permissive.h-39-{
 --
 kernel/locking/locktorture.c=1000=static void __torture_print_stats(char *page,
 --
 kernel/locking/locktorture.c-1005-	int i, n_stress;
 kernel/locking/locktorture.c:1006:	long max = 0, min = statp ? data_race(statp[0].n_lock_acquired) : 0;
 kernel/locking/locktorture.c-1007-	long long sum = 0;
 --
 kernel/locking/locktorture.c-1010-	for (i = 0; i < n_stress; i++) {
 kernel/locking/locktorture.c:1011:		if (data_race(statp[i].n_lock_fail))
 kernel/locking/locktorture.c-1012-			fail = true;
 kernel/locking/locktorture.c:1013:		cur = data_race(statp[i].n_lock_acquired);
 kernel/locking/locktorture.c-1014-		sum += cur;
 --
 kernel/locking/osq_lock.c=93=bool osq_lock(struct optimistic_spin_queue *lock)
 --
 kernel/locking/osq_lock.c-163-		 */
 kernel/locking/osq_lock.c:164:		if (data_race(prev->next) == node &&
 kernel/locking/osq_lock.c-165-		    cmpxchg(&prev->next, node, NULL) == node)
 --
 kernel/printk/nbcon.c=980=static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt, bool use_atomic)
 --
 kernel/printk/nbcon.c-1025-	 */
 kernel/printk/nbcon.c:1026:	con_dropped = data_race(READ_ONCE(con->dropped));
 kernel/printk/nbcon.c-1027-
 --
 kernel/rcu/rcutorture.c=2719=rcu_torture_stats_print(void)
 --
 kernel/rcu/rcutorture.c-2778-	pr_cont("barrier: %ld/%ld:%ld ",
 kernel/rcu/rcutorture.c:2779:		data_race(n_barrier_successes),
 kernel/rcu/rcutorture.c:2780:		data_race(n_barrier_attempts),
 kernel/rcu/rcutorture.c:2781:		data_race(n_rcu_torture_barrier_error));
 kernel/rcu/rcutorture.c:2782:	pr_cont("read-exits: %ld ", data_race(n_read_exits)); // Statistic.
 kernel/rcu/rcutorture.c-2783-	pr_cont("nocb-toggles: %ld:%ld ",
 --
 kernel/rcu/srcutree.c=2039=void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf)
 --
 kernel/rcu/srcutree.c-2064-			sdp = per_cpu_ptr(ssp->sda, cpu);
 kernel/rcu/srcutree.c:2065:			u0 = data_race(atomic_long_read(&sdp->srcu_ctrs[!idx].srcu_unlocks));
 kernel/rcu/srcutree.c:2066:			u1 = data_race(atomic_long_read(&sdp->srcu_ctrs[idx].srcu_unlocks));
 kernel/rcu/srcutree.c-2067-
 --
 kernel/rcu/srcutree.c-2073-
 kernel/rcu/srcutree.c:2074:			l0 = data_race(atomic_long_read(&sdp->srcu_ctrs[!idx].srcu_locks));
 kernel/rcu/srcutree.c:2075:			l1 = data_race(atomic_long_read(&sdp->srcu_ctrs[idx].srcu_locks));
 kernel/rcu/srcutree.c-2076-
 --
 kernel/rcu/tasks.h=233=static const char *tasks_gp_state_getname(struct rcu_tasks *rtp)
 kernel/rcu/tasks.h-234-{
 kernel/rcu/tasks.h:235:	int i = data_race(rtp->gp_state); // Let KCSAN detect update races
 kernel/rcu/tasks.h-236-	int j = READ_ONCE(i); // Prevent the compiler from reading twice
 --
 kernel/rcu/tasks.h=246=static void cblist_init_generic(struct rcu_tasks *rtp)
 --
 kernel/rcu/tasks.h-298-	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d rcu_task_cpu_ids=%d.\n",
 kernel/rcu/tasks.h:299:			rtp->name, data_race(rtp->percpu_enqueue_shift), data_race(rtp->percpu_enqueue_lim),
 kernel/rcu/tasks.h-300-			rcu_task_cb_adjust, rcu_task_cpu_ids);
 --
 kernel/rcu/tasks.h=718=static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s)
 --
 kernel/rcu/tasks.h-727-
 kernel/rcu/tasks.h:728:		if (!data_race(rcu_segcblist_empty(&rtpcp->cblist)))
 kernel/rcu/tasks.h-729-			havecbs = true;
 kernel/rcu/tasks.h:730:		if (data_race(rtpcp->urgent_gp))
 kernel/rcu/tasks.h-731-			haveurgent = true;
 kernel/rcu/tasks.h:732:		if (!data_race(rcu_segcblist_empty(&rtpcp->cblist)) && data_race(rtpcp->urgent_gp))
 kernel/rcu/tasks.h-733-			haveurgentcbs = true;
 --
 kernel/rcu/tasks.h-738-		rtp->kname,
 kernel/rcu/tasks.h:739:		tasks_gp_state_getname(rtp), data_race(rtp->gp_state),
 kernel/rcu/tasks.h:740:		jiffies - data_race(rtp->gp_jiffies),
 kernel/rcu/tasks.h:741:		data_race(rcu_seq_current(&rtp->tasks_gp_seq)),
 kernel/rcu/tasks.h:742:		data_race(rtp->n_ipis_fails), data_race(rtp->n_ipis),
 kernel/rcu/tasks.h:743:		".k"[!!data_race(rtp->kthread_ptr)],
 kernel/rcu/tasks.h-744-		".C"[havecbs],
 --
 kernel/rcu/tasks.h=752=static void rcu_tasks_torture_stats_print_generic(struct rcu_tasks *rtp, char *tt,
 --
 kernel/rcu/tasks.h-760-	pr_alert("%s%s Tasks%s RCU g%ld gp_start %lu gp_jiffies %lu gp_state %d (%s).\n",
 kernel/rcu/tasks.h:761:		 tt, tf, tst, data_race(rtp->tasks_gp_seq),
 kernel/rcu/tasks.h:762:		 j - data_race(rtp->gp_start), j - data_race(rtp->gp_jiffies),
 kernel/rcu/tasks.h:763:		 data_race(rtp->gp_state), tasks_gp_state_getname(rtp));
 kernel/rcu/tasks.h-764-	pr_alert("\tEnqueue shift %d limit %d Dequeue limit %d gpseq %lu.\n",
 kernel/rcu/tasks.h:765:		 data_race(rtp->percpu_enqueue_shift),
 kernel/rcu/tasks.h:766:		 data_race(rtp->percpu_enqueue_lim),
 kernel/rcu/tasks.h:767:		 data_race(rtp->percpu_dequeue_lim),
 kernel/rcu/tasks.h:768:		 data_race(rtp->percpu_dequeue_gpseq));
 kernel/rcu/tasks.h-769-	(void)zalloc_cpumask_var(&cm, GFP_KERNEL);
 --
 kernel/rcu/tasks.h-787-	pr_alert("\tBarrier seq %lu start %lu count %d holdout CPUs ",
 kernel/rcu/tasks.h:788:		 data_race(rtp->barrier_q_seq), j - data_race(rtp->barrier_q_start),
 kernel/rcu/tasks.h-789-		 atomic_read(&rtp->barrier_q_count));
 --
 kernel/rcu/tasks.h=1087=static void check_holdout_task(struct task_struct *t,
 --
 kernel/rcu/tasks.h-1113-		 t->rcu_tasks_nvcsw, t->nvcsw, t->rcu_tasks_holdout,
 kernel/rcu/tasks.h:1114:		 data_race(t->rcu_tasks_idle_cpu), cpu);
 kernel/rcu/tasks.h-1115-	sched_show_task(t);
 --
 kernel/rcu/tree.c=1016=static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp,
 --
 kernel/rcu/tree.c-1073-	}
 kernel/rcu/tree.c:1074:	trace_rcu_grace_period(rcu_state.name, data_race(rcu_state.gp_seq), TPS("newreq"));
 kernel/rcu/tree.c-1075-	ret = true;  /* Caller must wake GP kthread. */
 --
 kernel/rcu/tree_exp.h=558=static void synchronize_rcu_expedited_stall(unsigned long jiffies_start, unsigned long j)
 --
 kernel/rcu/tree_exp.h-585-				"N."[!!(rdp->grpmask & rnp->expmaskinitnext)],
 kernel/rcu/tree_exp.h:586:				"D."[!!data_race(rdp->cpu_no_qs.b.exp)]);
 kernel/rcu/tree_exp.h-587-		}
 --
 kernel/rcu/tree_exp.h-589-	pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
 kernel/rcu/tree_exp.h:590:		j - jiffies_start, rcu_state.expedited_sequence, data_race(rnp_root->expmask),
 kernel/rcu/tree_exp.h:591:		".T"[!!data_race(rnp_root->exp_tasks)]);
 kernel/rcu/tree_exp.h-592-	if (!ndetected) {
 --
 kernel/rcu/tree_exp.h-604-			pr_cont(" l=%u:%d-%d:%#lx/%c",
 kernel/rcu/tree_exp.h:605:				rnp->level, rnp->grplo, rnp->grphi, data_race(rnp->expmask),
 kernel/rcu/tree_exp.h:606:				".T"[!!data_race(rnp->exp_tasks)]);
 kernel/rcu/tree_exp.h-607-		}
 --
 kernel/rcu/tree_plugin.h=867=dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 --
 kernel/rcu/tree_plugin.h-882-	pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
 kernel/rcu/tree_plugin.h:883:		__func__, READ_ONCE(rnp->gp_tasks), data_race(rnp->boost_tasks),
 kernel/rcu/tree_plugin.h-884-		READ_ONCE(rnp->exp_tasks));
 --
 kernel/rcu/tree_stall.h=396=static void rcu_dump_cpu_stacks(unsigned long gp_seq)
 --
 kernel/rcu/tree_stall.h-404-		for_each_leaf_node_possible_cpu(rnp, cpu) {
 kernel/rcu/tree_stall.h:405:			if (gp_seq != data_race(rcu_state.gp_seq)) {
 kernel/rcu/tree_stall.h-406-				printk_deferred_exit();
 --
 kernel/rcu/tree_stall.h-409-			}
 kernel/rcu/tree_stall.h:410:			if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
 kernel/rcu/tree_stall.h-411-				continue;
 --
 kernel/rcu/tree_stall.h=518=static void print_cpu_stall_info(int cpu)
 --
 kernel/rcu/tree_stall.h-560-	       rdp->softirq_snap, kstat_softirqs_cpu(RCU_SOFTIRQ, cpu),
 kernel/rcu/tree_stall.h:561:	       data_race(rcu_state.n_force_qs) - rcu_state.n_force_qs_gpstart,
 kernel/rcu/tree_stall.h-562-	       rcuc_starved ? buf : "",
 --
 kernel/rcu/tree_stall.h=569=static void rcu_check_gp_kthread_starvation(void)
 --
 kernel/rcu/tree_stall.h-579-		       (long)rcu_seq_current(&rcu_state.gp_seq),
 kernel/rcu/tree_stall.h:580:		       data_race(READ_ONCE(rcu_state.gp_flags)),
 kernel/rcu/tree_stall.h-581-		       gp_state_getname(rcu_state.gp_state),
 kernel/rcu/tree_stall.h:582:		       data_race(READ_ONCE(rcu_state.gp_state)),
 kernel/rcu/tree_stall.h:583:		       gpk ? data_race(READ_ONCE(gpk->__state)) : ~0, cpu);
 kernel/rcu/tree_stall.h-584-		if (gpk) {
 --
 kernel/rcu/tree_stall.h-591-				pr_err("RCU GP kthread last ran on offline CPU %d.\n", cpu);
 kernel/rcu/tree_stall.h:592:			} else if (!(data_race(READ_ONCE(rdp->mynode->qsmask)) & rdp->grpmask)) {
 kernel/rcu/tree_stall.h-593-				pr_err("Stack dump where RCU GP kthread last ran:\n");
 --
 kernel/rcu/tree_stall.h=602=static void rcu_check_gp_kthread_expired_fqs_timer(void)
 --
]
]

23/2 2026/02/24 21:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 128350
							output: 24
							thoughts: 60
						

Thoughts:
Okay, here's my summary, taking the role of the expert:

**Investigating Potential Data Races in Kernel Printk Ringbuffer**

Right, so I'm digging into kernel data race issues again. I've been seeing a fair number of `data_race()` markers in the codebase, which is good for highlighting potential problem areas. However, I haven't seen any in the first 500 lines of my grep output when scanning the kernel source. Now, I should make sure that the `printk_ringbuffer.c` module is covered by my search. Let me narrow the search criteria, specifically focusing on that file to rule it out. There could be something lurking there, and a targeted search is the right way to check.



24/2 2026/02/24 21:15 tool grepper
0m Error:
no matches

Args:
map[Expression:data_race.*printk_ringbuffer.c]

Results:
map[Output:]

25/2 2026/02/24 21:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 128605
							output: 19
							thoughts: 2418
						

Thoughts:
**Analysis of KCSAN Report on `printk_ringbuffer.c` Data Race**

Okay, so the KCSAN report flagged a data race in `printk_ringbuffer.c`, and I need to determine if it's truly a concern. Initially, I noticed the absence of `data_race()` annotations. Instead, the developers seem to use `LMM` comments, indicating they're explicitly documenting the memory model. That's a key observation.

Looking closer, the core of the issue lies in this code:
```c
 496:			memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
 497:			       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
```

And critically, this comment:
```c
 491:		/*
 492:		 * Copy the descriptor data. The data is not valid until the
 493:		 * state has been re-checked. A memcpy() for all of @desc
 494:		 * cannot be used because of the atomic_t @state_var field.
 495:		 */
```

This confirms their awareness of a potential race. They're *intentionally* copying potentially invalid data. The design explicitly relies on a subsequent state re-check to validate the data. Because of this, this race is by design, and not a bug.

The KCSAN report identifies a write of 8 bytes (likely to `blk_lpos->begin`) and a read of 16 bytes (likely from the `text_blk_lpos` struct). Examining the struct:
```c
struct prb_data_blk_lpos {
	unsigned long	begin;
	unsigned long	next;
};
```
The write is to `begin`, and the read is a `memcpy` of both `begin` and `next`. The critical point is that any partially updated value of `text_blk_lpos` is discarded if the state re-check fails.

Now I need to verify: can `data_alloc` be called on descriptors that are *not* in the `desc_reserved` state? The answer is no, because of `desc_reserve`. `desc_reserve` puts the descriptor in the reserved state and is always the immediate previous state when `data_alloc` is called. Also, `desc_read` skips descriptors in the reserved state. The only time the race can occur is during descriptor recycling, where the state is in flux, and this is handled as the design relies on a state re-check.

So, the write happens in `data_alloc` and the read in `desc_read`. During recycling, if the state changes after the initial read but before the memcpy, the re-check ensures incorrect data is discarded. The whole mechanism works correctly, the system will discard incorrect data.

Additionally, I see a race on `text_blk_lpos.next` as well, but all parts of the descriptor data are protected by the same mechanism.

I need to confirm that `info->seq` and `info->caller_id` are also involved, and they are. All accesses are synchronized through the same re-check, and any inconsistencies are handled.

The `printk_ringbuffer` implementation is a well-known, mature lockless design, built to handle these precise kinds of races.

Final verdict: This is a **benign** data race. It's intentional and part of the lockless design, handled correctly by the re-check. The report is accurate in flagging the race, but it's not an actionable issue. The design makes it impossible for this to cause any corruption. I can confirm this. The KCSAN report is correct, and the race is happening on the expected memory addresses, `text_blk_lpos.begin`. The read is a 16-byte `memcpy` and the write is 8 bytes, which is also correct.

The `LMM` comment at line 497 even specifically labels the memory model analysis. The race is also clearly happening in syscalls ( `printk`, and `syslog`). Everything is correct.



26/2 2026/02/24 21:15 tool set-results
0m Args:
map[Benign:true Confident:true]

Results:
map[Benign:true Confident:true]

27/2 2026/02/24 21:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 131975
							output: 780
							thoughts: 248
						

Thoughts:
**My Assessment of the Data Race in `printk_ringbuffer`**

Okay, I've thoroughly analyzed this situation, and I'm comfortable concluding that this data race is, for all intents and purposes, benign. The race occurs when a descriptor is being recycled. Specifically, it involves the interplay between a writer initializing a new descriptor in `data_alloc()` and a reader copying that descriptor in `desc_read()`.

What's happening is a classic scenario in lockless designs. The `printk_ringbuffer` implementation leverages this to avoid locking. The `desc_read()` function first reads the `state_var`, which contains the ID and state of the descriptor. Then, it proceeds to copy the descriptor's data, which is the potential race condition. After the data copy is complete, it re-reads the `state_var`. This is key. The reader will only use the copied data if the `state_var` (and, crucially, the ID) hasn't changed between the reads, and if the state is consistent – either committed or finalized.

Essentially, if the writer modifies the descriptor (e.g., changes its contents) while the reader is in the process of copying it, the reader will detect the change in the re-read of `state_var` and simply discard the potentially corrupted copy. This is a very standard lockless pattern, and a very smart one, and I've seen it many times. It's used throughout the `printk_ringbuffer` implementation.

Of course, memory barriers (`smp_rmb()`) are employed to enforce the correct ordering of these accesses, ensuring that the reader actually sees the updated state. Given that the data is only ever used after its consistency is verified, the race doesn't pose any risk of corruption or crashes. It's a textbook example of how to design around a potential race without the need for heavyweight locking, which is exactly what one would expect from a performance-critical subsystem like a printk ringbuffer.



LLM Calls Summary:
Total Calls Total Tokens Avg Tokens Total Duration (Seconds) Avg Duration (Seconds)
Tool Calls Summary:
Total Calls Total Duration (Seconds) Avg Duration (Seconds)