Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

kernfs: Send IN_DELETE_SELF and IN_IGNORED

Currently some kernfs files (e.g. cgroup.events, memory.events) support
inotify watches for IN_MODIFY, but unlike with regular filesystems, they
do not receive IN_DELETE_SELF or IN_IGNORED events when they are
removed. This means inotify watches persist after file deletion until
the process exits and the inotify file descriptor is cleaned up, or
until inotify_rm_watch is called manually.

This creates a problem for processes monitoring cgroups. For example, a
service monitoring memory.events for memory.high breaches needs to know
when a cgroup is removed to clean up its state. Where it's known that a
cgroup is removed when all processes die, without IN_DELETE_SELF the
service must resort to inefficient workarounds such as:
1) Periodically scanning procfs to detect process death (wastes CPU
and is susceptible to PID reuse).
2) Holding a pidfd for every monitored cgroup (can exhaust file
descriptors).

This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
and directories by clearing inode i_nlink values during removal. This
allows VFS to make the necessary fsnotify calls so that userspace
receives the inotify events.

As a result, applications can rely on a single existing watch on a file
of interest (e.g. memory.events) to receive notifications for both
modifications and the eventual removal of the file, as well as automatic
watch descriptor cleanup, simplifying userspace logic and improving
efficiency.

There is gap in this implementation for certain file removals due their
unique nature in kernfs. Directory removals that trigger file removals
occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
events after the rmdir operation; there is no issue here. However kernfs
writes to particular files (e.g. cgroup.subtree_control) can also cause
file removal, but vfs_write does not attempt to emit fsnotify events
after the write operation, even if i_nlink counts are 0. As a usecase
for monitoring this category of file removals is not known, they are
left without having IN_DELETE or IN_DELETE_SELF events generated.
Fanotify recursive monitoring also does not work for kernfs nodes that
do not have inodes attached, as they are created on-demand in kernfs.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260225223404.783173-3-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

authored by

T.J. Mercier and committed by
Greg Kroah-Hartman
eea5d2bb 507d8ce1

+50 -4
+50 -4
fs/kernfs/dir.c
··· 486 486 * removers may invoke this function concurrently on @kn and all will 487 487 * return after draining is complete. 488 488 */ 489 - static void kernfs_drain(struct kernfs_node *kn) 489 + static void kernfs_drain(struct kernfs_node *kn, bool drop_supers) 490 490 __releases(&kernfs_root(kn)->kernfs_rwsem) 491 491 __acquires(&kernfs_root(kn)->kernfs_rwsem) 492 492 { ··· 506 506 return; 507 507 508 508 up_write(&root->kernfs_rwsem); 509 + if (drop_supers) 510 + up_read(&root->kernfs_supers_rwsem); 509 511 510 512 if (kernfs_lockdep(kn)) { 511 513 rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_); ··· 526 524 if (kernfs_should_drain_open_files(kn)) 527 525 kernfs_drain_open_files(kn); 528 526 527 + if (drop_supers) 528 + down_read(&root->kernfs_supers_rwsem); 529 529 down_write(&root->kernfs_rwsem); 530 530 } 531 531 ··· 1469 1465 kn->flags |= KERNFS_HIDDEN; 1470 1466 if (kernfs_active(kn)) 1471 1467 atomic_add(KN_DEACTIVATED_BIAS, &kn->active); 1472 - kernfs_drain(kn); 1468 + kernfs_drain(kn, false); 1473 1469 } 1474 1470 1475 1471 up_write(&root->kernfs_rwsem); 1472 + } 1473 + 1474 + /* 1475 + * This function enables VFS to send fsnotify events for deletions. 1476 + * There is gap in this implementation for certain file removals due their 1477 + * unique nature in kernfs. Directory removals that trigger file removals occur 1478 + * through vfs_rmdir, which shrinks the dcache and emits fsnotify events after 1479 + * the rmdir operation; there is no issue here. However kernfs writes to 1480 + * particular files (e.g. cgroup.subtree_control) can also cause file removal, 1481 + * but vfs_write does not attempt to emit fsnotify events after the write 1482 + * operation, even if i_nlink counts are 0. As a usecase for monitoring this 1483 + * category of file removals is not known, they are left without having 1484 + * IN_DELETE or IN_DELETE_SELF events generated. 1485 + * Fanotify recursive monitoring also does not work for kernfs nodes that do not 1486 + * have inodes attached, as they are created on-demand in kernfs. 1487 + */ 1488 + static void kernfs_clear_inode_nlink(struct kernfs_node *kn) 1489 + { 1490 + struct kernfs_root *root = kernfs_root(kn); 1491 + struct kernfs_super_info *info; 1492 + 1493 + lockdep_assert_held_read(&root->kernfs_supers_rwsem); 1494 + 1495 + list_for_each_entry(info, &root->supers, node) { 1496 + struct inode *inode = ilookup(info->sb, kernfs_ino(kn)); 1497 + 1498 + if (inode) { 1499 + clear_nlink(inode); 1500 + iput(inode); 1501 + } 1502 + } 1476 1503 } 1477 1504 1478 1505 static void __kernfs_remove(struct kernfs_node *kn) ··· 1514 1479 if (!kn) 1515 1480 return; 1516 1481 1482 + lockdep_assert_held_read(&kernfs_root(kn)->kernfs_supers_rwsem); 1517 1483 lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem); 1518 1484 1519 1485 /* ··· 1548 1512 */ 1549 1513 kernfs_get(pos); 1550 1514 1551 - kernfs_drain(pos); 1515 + kernfs_drain(pos, true); 1552 1516 parent = kernfs_parent(pos); 1553 1517 /* 1554 1518 * kernfs_unlink_sibling() succeeds once per node. Use it ··· 1558 1522 struct kernfs_iattrs *ps_iattr = 1559 1523 parent ? parent->iattr : NULL; 1560 1524 1561 - /* update timestamps on the parent */ 1562 1525 down_write(&kernfs_root(kn)->kernfs_iattr_rwsem); 1563 1526 1527 + kernfs_clear_inode_nlink(pos); 1528 + 1529 + /* update timestamps on the parent */ 1564 1530 if (ps_iattr) { 1565 1531 ktime_get_real_ts64(&ps_iattr->ia_ctime); 1566 1532 ps_iattr->ia_mtime = ps_iattr->ia_ctime; ··· 1591 1553 1592 1554 root = kernfs_root(kn); 1593 1555 1556 + down_read(&root->kernfs_supers_rwsem); 1594 1557 down_write(&root->kernfs_rwsem); 1595 1558 __kernfs_remove(kn); 1596 1559 up_write(&root->kernfs_rwsem); 1560 + up_read(&root->kernfs_supers_rwsem); 1597 1561 } 1598 1562 1599 1563 /** ··· 1686 1646 bool ret; 1687 1647 struct kernfs_root *root = kernfs_root(kn); 1688 1648 1649 + down_read(&root->kernfs_supers_rwsem); 1689 1650 down_write(&root->kernfs_rwsem); 1690 1651 kernfs_break_active_protection(kn); 1691 1652 ··· 1716 1675 break; 1717 1676 1718 1677 up_write(&root->kernfs_rwsem); 1678 + up_read(&root->kernfs_supers_rwsem); 1719 1679 schedule(); 1680 + down_read(&root->kernfs_supers_rwsem); 1720 1681 down_write(&root->kernfs_rwsem); 1721 1682 } 1722 1683 finish_wait(waitq, &wait); ··· 1733 1690 kernfs_unbreak_active_protection(kn); 1734 1691 1735 1692 up_write(&root->kernfs_rwsem); 1693 + up_read(&root->kernfs_supers_rwsem); 1736 1694 return ret; 1737 1695 } 1738 1696 ··· 1760 1716 } 1761 1717 1762 1718 root = kernfs_root(parent); 1719 + down_read(&root->kernfs_supers_rwsem); 1763 1720 down_write(&root->kernfs_rwsem); 1764 1721 1765 1722 kn = kernfs_find_ns(parent, name, ns); ··· 1771 1726 } 1772 1727 1773 1728 up_write(&root->kernfs_rwsem); 1729 + up_read(&root->kernfs_supers_rwsem); 1774 1730 1775 1731 if (kn) 1776 1732 return 0;