syzbot |
sign-in | mailing list | source | docs |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| 437da85d-cddc-42e6-b056-7d4a29de5f9b | assessment-kcsan | Benign: ✅ | ❓ | KCSAN: data-race in filemap_splice_read / filemap_splice_read | 2026/01/09 18:53 | 2026/01/09 18:53 | 2026/01/09 18:55 | 7519916073b761ced56a7b15fdeeb4674e8dc125 |
The KCSAN report indicates a data race in `filemap_splice_read` at `mm/filemap.c`. The conflicting accesses are writes to `in->f_ra.prev_pos`, where `in` is a `struct file` pointer. The code at the reported location is: ```c // mm/filemap.c:3124 in->f_ra.prev_pos = *ppos; ``` This field `prev_pos` is part of `struct file_ra_state`, which stores the readahead state for the file description. This state tracks the last read position to help the kernel detect sequential read patterns and trigger readahead (speculative reading of subsequent pages into the page cache) to improve performance. The race occurs when two tasks call `splice` (via `sendfile` in this case) on the same file description concurrently. Since `struct file` is shared between the tasks (e.g., duplicate file descriptors or concurrent use of the same file descriptor), they both update the readahead state. The Linux kernel intentionally allows concurrent readers (like `pread` or `splice`) to share the `struct file` without a mutex protecting the readahead state (`f_ra`), as enforcing serialization would severely degrade performance for parallel reads. The `prev_pos` field is used purely as a heuristic for the readahead algorithm (`mm/readahead.c`). 1. **Correctness**: The actual data read operation relies on `ppos` (which is derived from a thread-local variable `pos` in `do_sendfile` and passed down), not on `prev_pos`. Thus, the data returned to the user is correct and determined by the requested offset. 2. **Safety**: The `prev_pos` field is a simple scalar (`loff_t`). On 64-bit architectures (like x86_64 where this report is from), aligned 8-byte writes are atomic, preventing torn writes. Even if a torn write were to occur (on 32-bit), or if the value is "corrupted" due to the race (e.g., bouncing between two concurrent read positions), the consequence is limited to the readahead logic. The readahead algorithm might misinterpret the access pattern (e.g., failing to detect sequential reads or triggering unnecessary readahead), resulting in suboptimal I/O performance (too little or too much readahead). It does not lead to memory corruption, kernel crashes, or incorrect data being read. Therefore, this data race is **benign**. It fits the description of a race on "statistics counters, and similar" where the accesses are not supposed to be protected by mutual exclusion for performance reasons.
================================================================== BUG: KCSAN: data-race in filemap_splice_read / filemap_splice_read write to 0xffff888104428528 of 8 bytes by task 4245 on cpu 1: filemap_splice_read+0x4f4/0x740 mm/filemap.c:3125 ext4_file_splice_read+0x8f/0xb0 fs/ext4/file.c:158 do_splice_read fs/splice.c:982 [inline] splice_direct_to_actor+0x26f/0x680 fs/splice.c:1086 do_splice_direct_actor fs/splice.c:1204 [inline] do_splice_direct+0xda/0x150 fs/splice.c:1230 do_sendfile+0x380/0x650 fs/read_write.c:1370 __do_sys_sendfile64 fs/read_write.c:1431 [inline] __se_sys_sendfile64 fs/read_write.c:1417 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f write to 0xffff888104428528 of 8 bytes by task 4260 on cpu 0: filemap_splice_read+0x4f4/0x740 mm/filemap.c:3125 ext4_file_splice_read+0x8f/0xb0 fs/ext4/file.c:158 do_splice_read fs/splice.c:982 [inline] splice_direct_to_actor+0x26f/0x680 fs/splice.c:1086 do_splice_direct_actor fs/splice.c:1204 [inline] do_splice_direct+0xda/0x150 fs/splice.c:1230 do_sendfile+0x380/0x650 fs/read_write.c:1370 __do_sys_sendfile64 fs/read_write.c:1431 [inline] __se_sys_sendfile64 fs/read_write.c:1417 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000241 -> 0x0000000000000242 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 4260 Comm: syz.1.274 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/01/09 18:53 | flow | assessment-kcsan |
2mModel:unknown Results: map[Benign:true Explanation:The KCSAN report indicates a data race in `filemap_splice_read` at `mm/filemap.c`. The conflicting accesses are writes to `in->f_ra.prev_pos`, where `in` is a `struct file` pointer. The code at the reported location is: ```c // mm/filemap.c:3124 in->f_ra.prev_pos = *ppos; ``` This field `prev_pos` is part of `struct file_ra_state`, which stores the readahead state for the file description. This state tracks the last read position to help the kernel detect sequential read patterns and trigger readahead (speculative reading of subsequent pages into the page cache) to improve performance. The race occurs when two tasks call `splice` (via `sendfile` in this case) on the same file description concurrently. Since `struct file` is shared between the tasks (e.g., duplicate file descriptors or concurrent use of the same file descriptor), they both update the readahead state. The Linux kernel intentionally allows concurrent readers (like `pread` or `splice`) to share the `struct file` without a mutex protecting the readahead state (`f_ra`), as enforcing serialization would severely degrade performance for parallel reads. The `prev_pos` field is used purely as a heuristic for the readahead algorithm (`mm/readahead.c`). 1. **Correctness**: The actual data read operation relies on `ppos` (which is derived from a thread-local variable `pos` in `do_sendfile` and passed down), not on `prev_pos`. Thus, the data returned to the user is correct and determined by the requested offset. 2. **Safety**: The `prev_pos` field is a simple scalar (`loff_t`). On 64-bit architectures (like x86_64 where this report is from), aligned 8-byte writes are atomic, preventing torn writes. Even if a torn write were to occur (on 32-bit), or if the value is "corrupted" due to the race (e.g., bouncing between two concurrent read positions), the consequence is limited to the readahead logic. The readahead algorithm might misinterpret the access pattern (e.g., failing to detect sequential reads or triggering unnecessary readahead), resulting in suboptimal I/O performance (too little or too much readahead). It does not lead to memory corruption, kernel crashes, or incorrect data being read. Therefore, this data race is **benign**. It fits the description of a race on "statistics counters, and similar" where the accesses are not supposed to be protected by mutual exclusion for performance reasons.] |
| 1/1 | 2026/01/09 18:53 | action | kernel-checkouter |
0mModel:unknown Results: map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/761c89d80a8577d5dba2a721de12cc04cb43ff14] |
| 2/1 | 2026/01/09 18:53 | action | kernel-builder |
0mModel:unknown Results: map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/9dce9dd65e825296b412dc760805ab08dc7d0dbc] |
| 3/1 | 2026/01/09 18:53 | action | codesearch-prepare |
0mModel:unknown Results: map[Index:codesearch-index] |
| 4/1 | 2026/01/09 18:53 | agent | expert |
2mModel:unknown Results: map[Benign:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. The report would be non-actionable, if the underlysing data race is "benign". That is, the race is on a simple int/bool or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. An actionable race is "harmful", that is can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. In the final reply explain why you think the report is consistent and the data race is harmful. Use the provided tools to confirm any assumptions, what variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in filemap_splice_read / filemap_splice_read write to 0xffff888104428528 of 8 bytes by task 4245 on cpu 1: filemap_splice_read+0x4f4/0x740 mm/filemap.c:3125 ext4_file_splice_read+0x8f/0xb0 fs/ext4/file.c:158 do_splice_read fs/splice.c:982 [inline] splice_direct_to_actor+0x26f/0x680 fs/splice.c:1086 do_splice_direct_actor fs/splice.c:1204 [inline] do_splice_direct+0xda/0x150 fs/splice.c:1230 do_sendfile+0x380/0x650 fs/read_write.c:1370 __do_sys_sendfile64 fs/read_write.c:1431 [inline] __se_sys_sendfile64 fs/read_write.c:1417 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f write to 0xffff888104428528 of 8 bytes by task 4260 on cpu 0: filemap_splice_read+0x4f4/0x740 mm/filemap.c:3125 ext4_file_splice_read+0x8f/0xb0 fs/ext4/file.c:158 do_splice_read fs/splice.c:982 [inline] splice_direct_to_actor+0x26f/0x680 fs/splice.c:1086 do_splice_direct_actor fs/splice.c:1204 [inline] do_splice_direct+0xda/0x150 fs/splice.c:1230 do_sendfile+0x380/0x650 fs/read_write.c:1370 __do_sys_sendfile64 fs/read_write.c:1431 [inline] __se_sys_sendfile64 fs/read_write.c:1417 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000241 -> 0x0000000000000242 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 4260 Comm: syz.1.274 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ================================================================== Reply: The KCSAN report indicates a data race in `filemap_splice_read` at `mm/filemap.c`. The conflicting accesses are writes to `in->f_ra.prev_pos`, where `in` is a `struct file` pointer. The code at the reported location is: ```c // mm/filemap.c:3124 in->f_ra.prev_pos = *ppos; ``` This field `prev_pos` is part of `struct file_ra_state`, which stores the readahead state for the file description. This state tracks the last read position to help the kernel detect sequential read patterns and trigger readahead (speculative reading of subsequent pages into the page cache) to improve performance. The race occurs when two tasks call `splice` (via `sendfile` in this case) on the same file description concurrently. Since `struct file` is shared between the tasks (e.g., duplicate file descriptors or concurrent use of the same file descriptor), they both update the readahead state. The Linux kernel intentionally allows concurrent readers (like `pread` or `splice`) to share the `struct file` without a mutex protecting the readahead state (`f_ra`), as enforcing serialization would severely degrade performance for parallel reads. The `prev_pos` field is used purely as a heuristic for the readahead algorithm (`mm/readahead.c`). 1. **Correctness**: The actual data read operation relies on `ppos` (which is derived from a thread-local variable `pos` in `do_sendfile` and passed down), not on `prev_pos`. Thus, the data returned to the user is correct and determined by the requested offset. 2. **Safety**: The `prev_pos` field is a simple scalar (`loff_t`). On 64-bit architectures (like x86_64 where this report is from), aligned 8-byte writes are atomic, preventing torn writes. Even if a torn write were to occur (on 32-bit), or if the value is "corrupted" due to the race (e.g., bouncing between two concurrent read positions), the consequence is limited to the readahead logic. The readahead algorithm might misinterpret the access pattern (e.g., failing to detect sequential reads or triggering unnecessary readahead), resulting in suboptimal I/O performance (too little or too much readahead). It does not lead to memory corruption, kernel crashes, or incorrect data being read. Therefore, this data race is **benign**. It fits the description of a race on "statistics counters, and similar" where the accesses are not supposed to be protected by mutual exclusion for performance reasons. |
| 5/2 | 2026/01/09 18:53 | llm | expert |
0mModel:unknown Thoughts: **Analyzing a KCSAN Data Race in `filemap_splice_read`** Okay, so I've got a KCSAN report here, flagging a data race. The key is `filemap_splice_read` at `mm/filemap.c:3125`. This immediately piques my interest; filemap operations are often hotspots for concurrency issues, and `splice_read` suggests this is related to data transfer. The report indicates *two* write accesses to the same memory address (`0xffff888104428528`) which, crucially, show a value change from `0x0000000000000241` to `0x0000000000000242`. The fact the stack traces are identical for both tasks points to a potential problem with locking or synchronization within a specific function. My first step is clear: I need to pinpoint that line of code. Let's pull up the source for `filemap_splice_read` using `codesearch-definition-source`, and I'll make sure to specify line numbers to zero in on line 3125 in `mm/filemap.c`. That should give me the immediate context I need to start forming a hypothesis about what's going wrong. I'll be looking specifically for what's being written to that memory address and what the potential read/write ordering issues might be. |
| 6/2 | 2026/01/09 18:53 | tool | codesearch-definition-source |
0mModel:unknown Args: map[IncludeLines:true Name:filemap_splice_read SourceFile:mm/filemap.c] Results: map[Missing:false SourceCode:3028: /**
3029: * filemap_splice_read - Splice data from a file's pagecache into a pipe
3030: * @in: The file to read from
3031: * @ppos: Pointer to the file position to read from
3032: * @pipe: The pipe to splice into
3033: * @len: The amount to splice
3034: * @flags: The SPLICE_F_* flags
3035: *
3036: * This function gets folios from a file's pagecache and splices them into the
3037: * pipe. Readahead will be called as necessary to fill more folios. This may
3038: * be used for blockdevs also.
3039: *
3040: * Return: On success, the number of bytes read will be returned and *@ppos
3041: * will be updated if appropriate; 0 will be returned if there is no more data
3042: * to be read; -EAGAIN will be returned if the pipe had no space, and some
3043: * other negative error code will be returned on error. A short read may occur
3044: * if the pipe has insufficient space, we reach the end of the data or we hit a
3045: * hole.
3046: */
3047: ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
3048: struct pipe_inode_info *pipe,
3049: size_t len, unsigned int flags)
3050: {
3051: struct folio_batch fbatch;
3052: struct kiocb iocb;
3053: size_t total_spliced = 0, used, npages;
3054: loff_t isize, end_offset;
3055: bool writably_mapped;
3056: int i, error = 0;
3057:
3058: if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
3059: return 0;
3060:
3061: init_sync_kiocb(&iocb, in);
3062: iocb.ki_pos = *ppos;
3063:
3064: /* Work out how much data we can actually add into the pipe */
3065: used = pipe_buf_usage(pipe);
3066: npages = max_t(ssize_t, pipe->max_usage - used, 0);
3067: len = min_t(size_t, len, npages * PAGE_SIZE);
3068:
3069: folio_batch_init(&fbatch);
3070:
3071: do {
3072: cond_resched();
3073:
3074: if (*ppos >= i_size_read(in->f_mapping->host))
3075: break;
3076:
3077: iocb.ki_pos = *ppos;
3078: error = filemap_get_pages(&iocb, len, &fbatch, true);
3079: if (error < 0)
3080: break;
3081:
3082: /*
3083: * i_size must be checked after we know the pages are Uptodate.
3084: *
3085: * Checking i_size after the check allows us to calculate
3086: * the correct value for "nr", which means the zero-filled
3087: * part of the page is not copied back to userspace (unless
3088: * another truncate extends the file - this is desired though).
3089: */
3090: isize = i_size_read(in->f_mapping->host);
3091: if (unlikely(*ppos >= isize))
3092: break;
3093: end_offset = min_t(loff_t, isize, *ppos + len);
3094:
3095: /*
3096: * Once we start copying data, we don't want to be touching any
3097: * cachelines that might be contended:
3098: */
3099: writably_mapped = mapping_writably_mapped(in->f_mapping);
3100:
3101: for (i = 0; i < folio_batch_count(&fbatch); i++) {
3102: struct folio *folio = fbatch.folios[i];
3103: size_t n;
3104:
3105: if (folio_pos(folio) >= end_offset)
3106: goto out;
3107: folio_mark_accessed(folio);
3108:
3109: /*
3110: * If users can be writing to this folio using arbitrary
3111: * virtual addresses, take care of potential aliasing
3112: * before reading the folio on the kernel side.
3113: */
3114: if (writably_mapped)
3115: flush_dcache_folio(folio);
3116:
3117: n = min_t(loff_t, len, isize - *ppos);
3118: n = splice_folio_into_pipe(pipe, folio, *ppos, n);
3119: if (!n)
3120: goto out;
3121: len -= n;
3122: total_spliced += n;
3123: *ppos += n;
3124: in->f_ra.prev_pos = *ppos;
3125: if (pipe_is_full(pipe))
3126: goto out;
3127: }
3128:
3129: folio_batch_release(&fbatch);
3130: } while (len);
3131:
3132: out:
3133: folio_batch_release(&fbatch);
3134: file_accessed(in);
3135:
3136: return total_spliced ? total_spliced : error;
3137: }
SourceFile:mm/filemap.c] |
| 7/2 | 2026/01/09 18:53 | llm | expert |
0mModel:unknown |
| 8/2 | 2026/01/09 18:53 | tool | codesearch-definition-source |
0mModel:unknown Args: map[IncludeLines:true Name:do_sendfile SourceFile:fs/read_write.c] Results: map[Missing:false SourceCode:1301: static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
1302: size_t count, loff_t max)
1303: {
1304: struct inode *in_inode, *out_inode;
1305: struct pipe_inode_info *opipe;
1306: loff_t pos;
1307: loff_t out_pos;
1308: ssize_t retval;
1309: int fl;
1310:
1311: /*
1312: * Get input file, and verify that it is ok..
1313: */
1314: CLASS(fd, in)(in_fd);
1315: if (fd_empty(in))
1316: return -EBADF;
1317: if (!(fd_file(in)->f_mode & FMODE_READ))
1318: return -EBADF;
1319: if (!ppos) {
1320: pos = fd_file(in)->f_pos;
1321: } else {
1322: pos = *ppos;
1323: if (!(fd_file(in)->f_mode & FMODE_PREAD))
1324: return -ESPIPE;
1325: }
1326: retval = rw_verify_area(READ, fd_file(in), &pos, count);
1327: if (retval < 0)
1328: return retval;
1329: if (count > MAX_RW_COUNT)
1330: count = MAX_RW_COUNT;
1331:
1332: /*
1333: * Get output file, and verify that it is ok..
1334: */
1335: CLASS(fd, out)(out_fd);
1336: if (fd_empty(out))
1337: return -EBADF;
1338: if (!(fd_file(out)->f_mode & FMODE_WRITE))
1339: return -EBADF;
1340: in_inode = file_inode(fd_file(in));
1341: out_inode = file_inode(fd_file(out));
1342: out_pos = fd_file(out)->f_pos;
1343:
1344: if (!max)
1345: max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
1346:
1347: if (unlikely(pos + count > max)) {
1348: if (pos >= max)
1349: return -EOVERFLOW;
1350: count = max - pos;
1351: }
1352:
1353: fl = 0;
1354: #if 0
1355: /*
1356: * We need to debate whether we can enable this or not. The
1357: * man page documents EAGAIN return for the output at least,
1358: * and the application is arguably buggy if it doesn't expect
1359: * EAGAIN on a non-blocking file descriptor.
1360: */
1361: if (fd_file(in)->f_flags & O_NONBLOCK)
1362: fl = SPLICE_F_NONBLOCK;
1363: #endif
1364: opipe = get_pipe_info(fd_file(out), true);
1365: if (!opipe) {
1366: retval = rw_verify_area(WRITE, fd_file(out), &out_pos, count);
1367: if (retval < 0)
1368: return retval;
1369: retval = do_splice_direct(fd_file(in), &pos, fd_file(out), &out_pos,
1370: count, fl);
1371: } else {
1372: if (fd_file(out)->f_flags & O_NONBLOCK)
1373: fl |= SPLICE_F_NONBLOCK;
1374:
1375: retval = splice_file_to_pipe(fd_file(in), opipe, &pos, count, fl);
1376: }
1377:
1378: if (retval > 0) {
1379: add_rchar(current, retval);
1380: add_wchar(current, retval);
1381: fsnotify_access(fd_file(in));
1382: fsnotify_modify(fd_file(out));
1383: fd_file(out)->f_pos = out_pos;
1384: if (ppos)
1385: *ppos = pos;
1386: else
1387: fd_file(in)->f_pos = pos;
1388: }
1389:
1390: inc_syscr(current);
1391: inc_syscw(current);
1392: if (pos > max)
1393: retval = -EOVERFLOW;
1394: return retval;
1395: }
SourceFile:fs/read_write.c] |
| 9/2 | 2026/01/09 18:53 | llm | expert |
0mModel:unknown Thoughts: **Analyzing the `sendfile` Data Race** Okay, so I've been looking into this `sendfile` data race, and here's what I've figured out. In the `do_sendfile` function, there's a local variable `pos`. This is either initialized from the provided offset (`ppos`) or the file's current position (`fd_file(in)->f_pos`). Crucially, this `pos` is passed by reference to `do_splice_direct` and, subsequently, to `filemap_splice_read`. This means `filemap_splice_read` is directly modifying `do_sendfile`'s local `pos` variable. The reported race occurs at line 3125 in `mm/filemap.c`, where `in->f_ra.prev_pos = *ppos;` is executed. Here, `in` points to the `struct file`, `ppos` is the address of our local `pos` variable, and `in->f_ra` represents the `file_ra_state` embedded within the `struct file`. This `file_ra_state` keeps track of the read-ahead state to optimize file access. If two threads are using `sendfile` concurrently on the *same* file descriptor (or duped ones representing the same underlying `struct file`), they're both writing to this shared `in->f_ra.prev_pos`. The `struct file_ra_state` is designed for read-ahead logic. It tracks access patterns for determining how much to prefetch data. Now, the issue is that `do_sendfile` uses `fdget` (via `CLASS(fd, in)(in_fd)`) which *doesn't* protect `f_pos` with a lock like `fdget_pos` does. This means that at line 1320, `pos = fd_file(in)->f_pos;` and at line 1387, `fd_file(in)->f_pos = pos;`, there's a potential race condition when reading and writing `f_pos`. POSIX guarantees that concurrency on a file descriptor is unspecified, except for `pread`/`pwrite` with offsets. The `sendfile` syscall *can* take an offset, the `offset_ptr` argument. If `offset_ptr` is not `NULL` (like `pread`), `f_pos` is untouched. However, if `offset_ptr` is `NULL` (like `read`), we have a race. Standard `read` and `write` *do* use `f_pos_lock`, which is managed through `fdget_pos`. Therefore, the `read` syscall is protected. `do_sendfile` bypasses this protection. In summary, if the offset pointer is `NULL`, then `do_sendfile` reads `f_pos` at the beginning and writes it back at the end, without a lock, making these operations racy. This is the root of the problem. However, the report is not about the race of `f_pos`, but rather of `in->f_ra.prev_pos`. The `file_ra_state` (and specifically `f_ra.prev_pos`) is shared, and updated by `filemap_splice_read`, which updates it to `*ppos`. This state is used for readahead. The `prev_pos` field within this state is used to track the last read position for sequential access detection, which can be useful for optimizations. Even if we *do* use offsets in `sendfile` calls (so, like `pread`), we still share `struct file` and the read-ahead state, and multiple threads will update `f_ra.prev_pos`. This means that if we are reading the same file from different offsets in parallel, the `prev_pos` value is going to bounce around between the different offsets. The read-ahead state is generally per-file-description, so multiple threads can already interfere with each other's state. On a 64-bit architecture, writes to an aligned 64-bit value like `prev_pos` *are* atomic, so we won't get torn writes. What we *will* get are lost or interleaved updates, where `prev_pos` gets updated with values from different threads, which causes erratic or disabled readahead. However, is this a critical issue? The other fields within `file_ra_state` like `start`, `size`, `async_size`, `ra_pages`, and `mmap_miss`, are mostly updated by `ondemand_readahead`, but `prev_pos` is updated by `filemap_splice_read`. |
| 10/2 | 2026/01/09 18:54 | tool | codesearch-definition-source |
0mModel:unknown Args: map[IncludeLines:true Name:ondemand_readahead SourceFile:mm/readahead.c] Results: map[Missing:true SourceCode: SourceFile:] |
| 11/2 | 2026/01/09 18:54 | llm | expert |
0mModel:unknown Thoughts: **Chasing Down a Read-Ahead Bug** Okay, I'm thinking this likely boils down to something related to page cache synchronization or read-ahead, perhaps `page_cache_sync_ra` or a similarly named function. The behavior I'm seeing definitely points in that direction. The next step is to nail down the specific code involved. I'll need to locate the definition or places where the `struct file_ra_state` is used. That struct is central to understanding how the read-ahead logic functions. After that, I will specifically search for `prev_pos` within the `mm/readahead.c` file. This should give me a good handle on where the read-ahead positioning logic could be problematic. |
| 12/2 | 2026/01/09 18:54 | tool | codesearch-file-index |
0mModel:unknown Args: map[SourceFile:mm/readahead.c] Results: map[Entities:[map[Kind:function Name:__do_sys_readahead] map[Kind:function Name:__do_sys_readahead] map[Kind:function Name:__ia32_sys_readahead] map[Kind:function Name:__se_sys_readahead] map[Kind:function Name:__x64_sys_readahead] map[Kind:function Name:do_page_cache_ra] map[Kind:function Name:file_ra_state_init] map[Kind:function Name:file_ra_state_init] map[Kind:function Name:force_page_cache_ra] map[Kind:function Name:get_init_ra_size] map[Kind:function Name:get_next_ra_size] map[Kind:function Name:ksys_readahead] map[Kind:function Name:page_cache_async_ra] map[Kind:function Name:page_cache_async_ra] map[Kind:function Name:page_cache_ra_order] map[Kind:function Name:page_cache_ra_unbounded] map[Kind:function Name:page_cache_ra_unbounded] map[Kind:function Name:page_cache_sync_ra] map[Kind:function Name:page_cache_sync_ra] map[Kind:function Name:ra_alloc_folio] map[Kind:function Name:ractl_alloc_folio] map[Kind:function Name:ractl_max_pages] map[Kind:function Name:read_pages] map[Kind:function Name:readahead_expand] map[Kind:function Name:readahead_expand]] Missing:false] |
| 13/2 | 2026/01/09 18:54 | llm | expert |
0mModel:unknown Thoughts: Here's a breakdown of my thought process: **Diving into the Read-Ahead Logic** Alright, based on the context, I think `page_cache_sync_ra` and `page_cache_async_ra` are the key areas to investigate. Those are the functions that usually get my attention when I'm tracking down read-ahead behavior. I also have a hunch that something related to `ondemand_readahead` could be static, or maybe it's been renamed – worth checking for sure. My next step is clear: I'm going straight to the source code for `page_cache_sync_ra`. Time to see what's really happening under the hood. |
| 14/2 | 2026/01/09 18:54 | tool | codesearch-definition-source |
0mModel:unknown Args: map[IncludeLines:true Name:page_cache_sync_ra SourceFile:mm/readahead.c] Results: map[Missing:false SourceCode: 553: void page_cache_sync_ra(struct readahead_control *ractl,
554: unsigned long req_count)
555: {
556: pgoff_t index = readahead_index(ractl);
557: bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
558: struct file_ra_state *ra = ractl->ra;
559: unsigned long max_pages, contig_count;
560: pgoff_t prev_index, miss;
561:
562: trace_page_cache_sync_ra(ractl->mapping->host, index, ra, req_count);
563: /*
564: * Even if readahead is disabled, issue this request as readahead
565: * as we'll need it to satisfy the requested range. The forced
566: * readahead will do the right thing and limit the read to just the
567: * requested range, which we'll set to 1 page for this case.
568: */
569: if (!ra->ra_pages || blk_cgroup_congested()) {
570: if (!ractl->file)
571: return;
572: req_count = 1;
573: do_forced_ra = true;
574: }
575:
576: /* be dumb */
577: if (do_forced_ra) {
578: force_page_cache_ra(ractl, req_count);
579: return;
580: }
581:
582: max_pages = ractl_max_pages(ractl, req_count);
583: prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
584: /*
585: * A start of file, oversized read, or sequential cache miss:
586: * trivial case: (index - prev_index) == 1
587: * unaligned reads: (index - prev_index) == 0
588: */
589: if (!index || req_count > max_pages || index - prev_index <= 1UL) {
590: ra->start = index;
591: ra->size = get_init_ra_size(req_count, max_pages);
592: ra->async_size = ra->size > req_count ? ra->size - req_count :
593: ra->size >> 1;
594: goto readit;
595: }
596:
597: /*
598: * Query the page cache and look for the traces(cached history pages)
599: * that a sequential stream would leave behind.
600: */
601: rcu_read_lock();
602: miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages);
603: rcu_read_unlock();
604: contig_count = index - miss - 1;
605: /*
606: * Standalone, small random read. Read as is, and do not pollute the
607: * readahead state.
608: */
609: if (contig_count <= req_count) {
610: do_page_cache_ra(ractl, req_count, 0);
611: return;
612: }
613: /*
614: * File cached from the beginning:
615: * it is a strong indication of long-run stream (or whole-file-read)
616: */
617: if (miss == ULONG_MAX)
618: contig_count *= 2;
619: ra->start = index;
620: ra->size = min(contig_count + req_count, max_pages);
621: ra->async_size = 1;
622: readit:
623: ra->order = 0;
624: ractl->_index = ra->start;
625: page_cache_ra_order(ractl, ra);
626: }
SourceFile:mm/readahead.c] |
| 15/2 | 2026/01/09 18:54 | llm | expert |
0mModel:unknown Thoughts: **My Assessment of the KCSAN Report** Okay, so I'm looking at this KCSAN report concerning a data race on `file->f_ra.prev_pos` within the kernel's file system code, specifically related to readahead. The context is `filemap_splice_read`, which, as I recall, is used for efficiently transferring data from a file into a pipe, often in the `sendfile` path. The report indicates a concurrent write to this `prev_pos` field, changing its value, and the question is whether this is a problem. My initial reaction is that this is likely a *benign* race. The `file_ra_state` structure is intentionally designed for lock-free access for performance reasons. Readahead is a heuristic, designed to *guess* what the user is going to read, to improve overall read performance. The `prev_pos` field is precisely that: a cache of the last read position used to infer sequential access patterns. This is *not* a critical piece of data that determines *what* data gets returned to the user; that's handled by other mechanisms, like the file offset (`ppos` in this case) passed to `filemap_get_pages`. The code in question involves checking `index - prev_index <= 1UL`. If `prev_pos` (and thus `prev_index`) is corrupted by concurrent access, this check might fail or succeed unpredictably. If the check fails, the code might incorrectly fall through to treat the read as non-sequential (possibly missing readahead opportunities). However, this affects *performance* (the effectiveness of readahead) and not the correctness of the read data itself. The report also reveals that `prev_pos` is updated *after* the call to `filemap_get_pages`, which is responsible for the actual reading. This sequence of actions is key. `filemap_get_pages` grabs the data (or pages), and then in a separate step updates the read position. This is why a concurrent race is harmless in terms of data corruption. The thread using an incorrect `prev_pos` value may either do more or less readahead than might be optimum. The report highlights that the access is a simple load/store of 8 bytes on x86-64, making the access atomic. So we won't get torn values, ensuring integrity of a single 64-bit `prev_pos` value. The KCSAN report shows `0x...241` -> `0x...242`. The small increment looks suspect, but it depends on the context of the splice length. Concurrent threads are performing `splice` on the same file, possibly in small increments. Essentially, the worst-case scenario is suboptimal readahead behavior, which might result in some performance degradation. The readahead logic utilizes the `prev_pos` as a heuristic, not a guaranteed state. I'm confident that this race condition is a known issue, and intentional for performance. |
| 16/2 | 2026/01/09 18:54 | tool | set-results |
0mModel:unknown Args: map[Benign:true] Results: map[Benign:true] |
| 17/2 | 2026/01/09 18:54 | llm | expert |
0mModel:unknown Thoughts: **Analyzing the Readahead Race Condition** Okay, so I'm being asked to explain why this race condition is benign. Let's break this down. My initial assessment is that the race is occurring on `file->f_ra.prev_pos`, part of the readahead state. This readahead state, as I understand it, is essentially a heuristic to optimize I/O performance by pre-loading data into the page cache. The kernel *expects* concurrent updates to this state, especially with concurrent `pread` or `splice` calls, and it doesn't bother with locks because they'd create performance bottlenecks. The worst-case scenario here is suboptimal readahead behavior – either reading too much or too little ahead. Critically, this doesn't impact data integrity or kernel stability. The access itself is a straightforward integer assignment. And even if there were tearing on the writes, which isn't possible on 64-bit architectures, the value is just a file offset guess, so it's not a big deal. Looking at the report, it confirms that two threads are indeed writing to `in->f_ra.prev_pos`, one writing 0x241 and another 0x242, which is consistent with updating the readahead cursor. So the race is real. I should emphasize that the fields within `struct file_ra_state` are generally treated as statistics or heuristics. The safety of the operation – meaning the actual reading of data – hinges on `ppos` (the file position pointer) and page cache locking (folios are locked during read/write operations), not on the consistency of `f_ra`. In fact, the `ppos` variable in `filemap_splice_read` is consistent for each thread because it's a local variable within `do_sendfile` (or passed from the caller), so that part is safe. The stack trace highlights this: `filemap_splice_read` -> `in->f_ra.prev_pos = *ppos;`, confirming the exact location. I have all the critical info. Now, a minor point: `splice` updates `*ppos`, and that happens *before* the `prev_pos` update. The `ppos` passed to `filemap_splice_read` is `&pos` from `do_sendfile`. `pos` is a local variable within `do_sendfile`. So, the `*ppos` we're talking about is thread-local and not shared. The shared part is `in` (the `struct file`), and the report pinpoints the race on `in->f_ra.prev_pos`. Got it. Okay, let's just make sure, if `ppos` *were* shared, that would be a separate race on `*ppos`, which isn't the case. And I'm double-checking, `do_sendfile` takes `loff_t *ppos`. It either uses `fd_file(in)->f_pos` or `*ppos` but then always copies the value into a local `pos` variable and passes `&pos` to `do_splice_direct`. Therefore, the `*ppos` within `filemap_splice_read` is definitely local to each task. The analysis is valid. I'm checking against the definition of "harmful". An "actionable race" is "harmful" if it can lead to corruption or a crash. We're talking about a race on a complex container that *should* be protected. "Benign" races are typically on flag fields, statistics, counters – exactly what we're seeing with `f_ra` (heuristic state). So it's benign. I have to think about an edge case: What if `prev_pos` is used to calculate `ra->start` and `ra->size`, and *those* are used to read pages. Okay, if `ra->start` is wrong, we read the wrong pages *into the page cache*. However, `filemap_get_pages` uses `iocb->ki_pos` (which ultimately comes from `*ppos`, which is correct) to determine what data to *actually* read for the request. The readahead logic (using `ra`) issues speculative reads. So, if speculative reads are wrong, we just populate the page cache with the wrong data, wasting memory/IO. Since the page cache is indexed by file offset, reading data at the wrong offset doesn't corrupt other pages. Still benign. I'm ready to explain. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|