Merge tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull erofs (and fscache) updates from Gao Xiang:
"After working on it on the mailing list for more than half a year, we
finally form 'erofs over fscache' feature into shape. Hopefully it
could bring more possibility to the communities.

The story mainly started from a new project what we called "RAFS v6" [1]
for Nydus image service almost a year ago, which enhances EROFS to be
a new form of one bootstrap (which includes metadata representing the
whole fs tree) + several data-deduplicated content addressable blobs
(actually treated as multiple devices). Each blob can represent one
container image layer but not quite exactly since all new data can be
fully existed in the previous blobs so no need to introduce another
new blob.

It is actually not a new idea (at least on my side it's much like a
simpilied casync [2] for now) and has many benefits over per-file
blobs or some other exist ways since typically each RAFS v6 image only
has dozens of device blobs instead of thousands of per-file blobs.
It's easy to be signed with user keys as a golden image, transfered
untouchedly with minimal overhead over the network, kept in some type
of storage conveniently, and run with (optional) runtime verification
but without involving too many irrelevant features crossing the system
beyond EROFS itself. At least it's our final goal and we're keeping
working on it. There was also a good summary of this approach from the
casync author [3].

Regardless further optimizations, this work is almost done in the
previous Linux release cycles. In this round, we'd like to introduce
on-demand load for EROFS with the fscache/cachefiles infrastructure,
considering the following advantages:

- Introduce new file-based backend to EROFS. Although each image only
contains dozens of blobs but in densely-deployed runC host for
example, there could still be massive blobs on a machine, which is
messy if each blob is treated as a device. In contrast, fscache and
cachefiles are really great interfaces for us to make them work.

- Introduce on-demand load to fscache and EROFS. Previously, fscache
is mainly used to caching network-likewise filesystems, now it can
support on-demand downloading for local fses too with the exact
localfs on-disk format. It has many advantages which we're been
described in the latest patchset cover letter [4]. In addition to
that, most importantly, the cached data is still stored in the
original local fs on-disk format so that it's still the one signed
with private keys but only could be partially available. Users can
fully trust it during running. Later, users can also back up
cachefiles easily to another machine.

- More reliable on-demand approach in principle. After data is all
available locally, user daemon can be no longer online in some use
cases, which helps daemon crash recovery (filesystems can still in
service) and hot-upgrade (user daemon can be upgraded more
frequently due to new features or protocols introduced.)

- Other format can also be converted to EROFS filesystem format over
the internet on the fly with the new on-demand load feature and
mounted. That is entirely possible with on-demand load feature as
long as such archive format metadata can be fetched in advance like
stargz.

In addition, although currently our target user is Nydus image service [5],
but laterly, it can be used for other use cases like on-demand system
booting, etc. As for the fscache on-demand load feature itself,
strictly it can be used for other local fses too. Laterly we could
promote most code to the iomap infrastructure and also enhance it in
the read-write way if other local fses are interested.

Thanks David Howells for taking so much time and patience on this
these months, many thanks with great respect here again! Thanks Jeffle
for working on this feature and Xin Yin from Bytedance for
asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and
Yan Song for testing, much appeciated. We're also exploring more
possibly over fscache cache management over FSDAX for secure
containers and working on more improvements and useful features for
fscache, cachefiles, and on-demand load.

In addition to "erofs over fscache", NFS export and idmapped mount are
also completed in this cycle for container use cases as well.

Summary:

- Add erofs on-demand load support over fscache

- Support NFS export for erofs

- Support idmapped mounts for erofs

- Don't prompt for risk any more when using big pcluster

- Fix buffer copy overflow of ztailpacking feature

- Several minor cleanups"

[1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com
[2] https://github.com/systemd/casync
[3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
[4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com
[5] https://github.com/dragonflyoss/image-service

* tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits)
erofs: scan devices from device table
erofs: change to use asynchronous io for fscache readpage/readahead
erofs: add 'fsid' mount option
erofs: implement fscache-based data readahead
erofs: implement fscache-based data read for inline layout
erofs: implement fscache-based data read for non-inline layout
erofs: implement fscache-based metadata read
erofs: register fscache context for extra data blobs
erofs: register fscache context for primary data blob
erofs: add erofs_fscache_read_folios() helper
erofs: add anonymous inode caching metadata for data blobs
erofs: add fscache context helper functions
erofs: register fscache volume
erofs: add fscache mode check helper
erofs: make erofs_map_blocks() generally available
cachefiles: document on-demand read mode
cachefiles: add tracepoints for on-demand read mode
cachefiles: enable on-demand read mode
cachefiles: implement on-demand read
cachefiles: notify the user daemon when withdrawing cookie
...

Linus Torvalds 4 years ago 65965d95 850f6033

+1996 -163

24 changed files

expand all

Documentation

filesystems

caching

cachefiles.rst

cachefiles

Kconfig

Makefile

daemon.c

interface.c

internal.h

io.c

namei.c

ondemand.c

erofs

Kconfig

Makefile

data.c

decompressor.c

erofs_fs.h

fscache.c

inode.c

internal.h

namei.c

super.c

sysfs.c

include

linux

fscache.h

netfs.h

trace

events

cachefiles.h

uapi

linux

cachefiles.h

+178

Documentation/filesystems/caching/cachefiles.rst

··· 28 28 29 29 (*) Debugging. 30 30 31 + (*) On-demand Read. 31 32 32 33 33 34 Overview ··· 483 482 echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug 484 483 485 484 will turn on all function entry debugging. 485 + 486 + 487 + On-demand Read 488 + ============== 489 + 490 + When working in its original mode, CacheFiles serves as a local cache for a 491 + remote networking fs - while in on-demand read mode, CacheFiles can boost the 492 + scenario where on-demand read semantics are needed, e.g. container image 493 + distribution. 494 + 495 + The essential difference between these two modes is seen when a cache miss 496 + occurs: In the original mode, the netfs will fetch the data from the remote 497 + server and then write it to the cache file; in on-demand read mode, fetching 498 + the data and writing it into the cache is delegated to a user daemon. 499 + 500 + ``CONFIG_CACHEFILES_ONDEMAND`` should be enabled to support on-demand read mode. 501 + 502 + 503 + Protocol Communication 504 + ---------------------- 505 + 506 + The on-demand read mode uses a simple protocol for communication between kernel 507 + and user daemon. The protocol can be modeled as:: 508 + 509 + kernel --[request]--> user daemon --[reply]--> kernel 510 + 511 + CacheFiles will send requests to the user daemon when needed. The user daemon 512 + should poll the devnode ('/dev/cachefiles') to check if there's a pending 513 + request to be processed. A POLLIN event will be returned when there's a pending 514 + request. 515 + 516 + The user daemon then reads the devnode to fetch a request to process. It should 517 + be noted that each read only gets one request. When it has finished processing 518 + the request, the user daemon should write the reply to the devnode. 519 + 520 + Each request starts with a message header of the form:: 521 + 522 + struct cachefiles_msg { 523 + __u32 msg_id; 524 + __u32 opcode; 525 + __u32 len; 526 + __u32 object_id; 527 + __u8 data[]; 528 + }; 529 + 530 + where: 531 + 532 + * ``msg_id`` is a unique ID identifying this request among all pending 533 + requests. 534 + 535 + * ``opcode`` indicates the type of this request. 536 + 537 + * ``object_id`` is a unique ID identifying the cache file operated on. 538 + 539 + * ``data`` indicates the payload of this request. 540 + 541 + * ``len`` indicates the whole length of this request, including the 542 + header and following type-specific payload. 543 + 544 + 545 + Turning on On-demand Mode 546 + ------------------------- 547 + 548 + An optional parameter becomes available to the "bind" command:: 549 + 550 + bind [ondemand] 551 + 552 + When the "bind" command is given no argument, it defaults to the original mode. 553 + When it is given the "ondemand" argument, i.e. "bind ondemand", on-demand read 554 + mode will be enabled. 555 + 556 + 557 + The OPEN Request 558 + ---------------- 559 + 560 + When the netfs opens a cache file for the first time, a request with the 561 + CACHEFILES_OP_OPEN opcode, a.k.a an OPEN request will be sent to the user 562 + daemon. The payload format is of the form:: 563 + 564 + struct cachefiles_open { 565 + __u32 volume_key_size; 566 + __u32 cookie_key_size; 567 + __u32 fd; 568 + __u32 flags; 569 + __u8 data[]; 570 + }; 571 + 572 + where: 573 + 574 + * ``data`` contains the volume_key followed directly by the cookie_key. 575 + The volume key is a NUL-terminated string; the cookie key is binary 576 + data. 577 + 578 + * ``volume_key_size`` indicates the size of the volume key in bytes. 579 + 580 + * ``cookie_key_size`` indicates the size of the cookie key in bytes. 581 + 582 + * ``fd`` indicates an anonymous fd referring to the cache file, through 583 + which the user daemon can perform write/llseek file operations on the 584 + cache file. 585 + 586 + 587 + The user daemon can use the given (volume_key, cookie_key) pair to distinguish 588 + the requested cache file. With the given anonymous fd, the user daemon can 589 + fetch the data and write it to the cache file in the background, even when 590 + kernel has not triggered a cache miss yet. 591 + 592 + Be noted that each cache file has a unique object_id, while it may have multiple 593 + anonymous fds. The user daemon may duplicate anonymous fds from the initial 594 + anonymous fd indicated by the @fd field through dup(). Thus each object_id can 595 + be mapped to multiple anonymous fds, while the usr daemon itself needs to 596 + maintain the mapping. 597 + 598 + When implementing a user daemon, please be careful of RLIMIT_NOFILE, 599 + ``/proc/sys/fs/nr_open`` and ``/proc/sys/fs/file-max``. Typically these needn't 600 + be huge since they're related to the number of open device blobs rather than 601 + open files of each individual filesystem. 602 + 603 + The user daemon should reply the OPEN request by issuing a "copen" (complete 604 + open) command on the devnode:: 605 + 606 + copen <msg_id>,<cache_size> 607 + 608 + where: 609 + 610 + * ``msg_id`` must match the msg_id field of the OPEN request. 611 + 612 + * When >= 0, ``cache_size`` indicates the size of the cache file; 613 + when < 0, ``cache_size`` indicates any error code encountered by the 614 + user daemon. 615 + 616 + 617 + The CLOSE Request 618 + ----------------- 619 + 620 + When a cookie withdrawn, a CLOSE request (opcode CACHEFILES_OP_CLOSE) will be 621 + sent to the user daemon. This tells the user daemon to close all anonymous fds 622 + associated with the given object_id. The CLOSE request has no extra payload, 623 + and shouldn't be replied. 624 + 625 + 626 + The READ Request 627 + ---------------- 628 + 629 + When a cache miss is encountered in on-demand read mode, CacheFiles will send a 630 + READ request (opcode CACHEFILES_OP_READ) to the user daemon. This tells the user 631 + daemon to fetch the contents of the requested file range. The payload is of the 632 + form:: 633 + 634 + struct cachefiles_read { 635 + __u64 off; 636 + __u64 len; 637 + }; 638 + 639 + where: 640 + 641 + * ``off`` indicates the starting offset of the requested file range. 642 + 643 + * ``len`` indicates the length of the requested file range. 644 + 645 + 646 + When it receives a READ request, the user daemon should fetch the requested data 647 + and write it to the cache file identified by object_id. 648 + 649 + When it has finished processing the READ request, the user daemon should reply 650 + by using the CACHEFILES_IOC_READ_COMPLETE ioctl on one of the anonymous fds 651 + associated with the object_id given in the READ request. The ioctl is of the 652 + form:: 653 + 654 + ioctl(fd, CACHEFILES_IOC_READ_COMPLETE, msg_id); 655 + 656 + where: 657 + 658 + * ``fd`` is one of the anonymous fds associated with the object_id 659 + given. 660 + 661 + * ``msg_id`` must match the msg_id field of the READ request.

+12

fs/cachefiles/Kconfig

··· 26 26 help 27 27 This permits error injection to be enabled in cachefiles whilst a 28 28 cache is in service. 29 + 30 + config CACHEFILES_ONDEMAND 31 + bool "Support for on-demand read" 32 + depends on CACHEFILES 33 + default n 34 + help 35 + This permits userspace to enable the cachefiles on-demand read mode. 36 + In this mode, when a cache miss occurs, responsibility for fetching 37 + the data lies with the cachefiles backend instead of with the netfs 38 + and is delegated to userspace. 39 + 40 + If unsure, say N.

fs/cachefiles/Makefile

··· 16 16 xattr.o 17 17 18 18 cachefiles-$(CONFIG_CACHEFILES_ERROR_INJECTION) += error_inject.o 19 + cachefiles-$(CONFIG_CACHEFILES_ONDEMAND) += ondemand.o 19 20 20 21 obj-$(CONFIG_CACHEFILES) := cachefiles.o

+96 -21

fs/cachefiles/daemon.c

··· 75 75 { "inuse", cachefiles_daemon_inuse }, 76 76 { "secctx", cachefiles_daemon_secctx }, 77 77 { "tag", cachefiles_daemon_tag }, 78 + #ifdef CONFIG_CACHEFILES_ONDEMAND 79 + { "copen", cachefiles_ondemand_copen }, 80 + #endif 78 81 { "", NULL } 79 82 }; 80 83 ··· 111 108 INIT_LIST_HEAD(&cache->volumes); 112 109 INIT_LIST_HEAD(&cache->object_list); 113 110 spin_lock_init(&cache->object_list_lock); 111 + refcount_set(&cache->unbind_pincount, 1); 112 + xa_init_flags(&cache->reqs, XA_FLAGS_ALLOC); 113 + xa_init_flags(&cache->ondemand_ids, XA_FLAGS_ALLOC1); 114 114 115 115 /* set default caching limits 116 116 * - limit at 1% free space and/or free files ··· 132 126 return 0; 133 127 } 134 128 129 + static void cachefiles_flush_reqs(struct cachefiles_cache *cache) 130 + { 131 + struct xarray *xa = &cache->reqs; 132 + struct cachefiles_req *req; 133 + unsigned long index; 134 + 135 + /* 136 + * Make sure the following two operations won't be reordered. 137 + * 1) set CACHEFILES_DEAD bit 138 + * 2) flush requests in the xarray 139 + * Otherwise the request may be enqueued after xarray has been 140 + * flushed, leaving the orphan request never being completed. 141 + * 142 + * CPU 1 CPU 2 143 + * ===== ===== 144 + * flush requests in the xarray 145 + * test CACHEFILES_DEAD bit 146 + * enqueue the request 147 + * set CACHEFILES_DEAD bit 148 + */ 149 + smp_mb(); 150 + 151 + xa_lock(xa); 152 + xa_for_each(xa, index, req) { 153 + req->error = -EIO; 154 + complete(&req->done); 155 + } 156 + xa_unlock(xa); 157 + 158 + xa_destroy(&cache->reqs); 159 + xa_destroy(&cache->ondemand_ids); 160 + } 161 + 162 + void cachefiles_put_unbind_pincount(struct cachefiles_cache *cache) 163 + { 164 + if (refcount_dec_and_test(&cache->unbind_pincount)) { 165 + cachefiles_daemon_unbind(cache); 166 + cachefiles_open = 0; 167 + kfree(cache); 168 + } 169 + } 170 + 171 + void cachefiles_get_unbind_pincount(struct cachefiles_cache *cache) 172 + { 173 + refcount_inc(&cache->unbind_pincount); 174 + } 175 + 135 176 /* 136 177 * Release a cache. 137 178 */ ··· 192 139 193 140 set_bit(CACHEFILES_DEAD, &cache->flags); 194 141 195 - cachefiles_daemon_unbind(cache); 142 + if (cachefiles_in_ondemand_mode(cache)) 143 + cachefiles_flush_reqs(cache); 196 144 197 145 /* clean up the control file interface */ 198 146 cache->cachefilesd = NULL; 199 147 file->private_data = NULL; 200 - cachefiles_open = 0; 201 148 202 - kfree(cache); 149 + cachefiles_put_unbind_pincount(cache); 203 150 204 151 _leave(""); 205 152 return 0; 206 153 } 207 154 208 - /* 209 - * Read the cache state. 210 - */ 211 - static ssize_t cachefiles_daemon_read(struct file *file, char __user *_buffer, 212 - size_t buflen, loff_t *pos) 155 + static ssize_t cachefiles_do_daemon_read(struct cachefiles_cache *cache, 156 + char __user *_buffer, size_t buflen) 213 157 { 214 - struct cachefiles_cache *cache = file->private_data; 215 158 unsigned long long b_released; 216 159 unsigned f_released; 217 160 char buffer[256]; 218 161 int n; 219 - 220 - //_enter(",,%zu,", buflen); 221 - 222 - if (!test_bit(CACHEFILES_READY, &cache->flags)) 223 - return 0; 224 162 225 163 /* check how much space the cache has */ 226 164 cachefiles_has_space(cache, 0, 0, cachefiles_has_space_check); ··· 248 204 return -EFAULT; 249 205 250 206 return n; 207 + } 208 + 209 + /* 210 + * Read the cache state. 211 + */ 212 + static ssize_t cachefiles_daemon_read(struct file *file, char __user *_buffer, 213 + size_t buflen, loff_t *pos) 214 + { 215 + struct cachefiles_cache *cache = file->private_data; 216 + 217 + //_enter(",,%zu,", buflen); 218 + 219 + if (!test_bit(CACHEFILES_READY, &cache->flags)) 220 + return 0; 221 + 222 + if (cachefiles_in_ondemand_mode(cache)) 223 + return cachefiles_ondemand_daemon_read(cache, _buffer, buflen); 224 + else 225 + return cachefiles_do_daemon_read(cache, _buffer, buflen); 251 226 } 252 227 253 228 /* ··· 360 297 poll_wait(file, &cache->daemon_pollwq, poll); 361 298 mask = 0; 362 299 363 - if (test_bit(CACHEFILES_STATE_CHANGED, &cache->flags)) 364 - mask |= EPOLLIN; 300 + if (cachefiles_in_ondemand_mode(cache)) { 301 + if (!xa_empty(&cache->reqs)) 302 + mask |= EPOLLIN; 303 + } else { 304 + if (test_bit(CACHEFILES_STATE_CHANGED, &cache->flags)) 305 + mask |= EPOLLIN; 306 + } 365 307 366 308 if (test_bit(CACHEFILES_CULLING, &cache->flags)) 367 309 mask |= EPOLLOUT; ··· 755 687 cache->brun_percent >= 100) 756 688 return -ERANGE; 757 689 758 - if (*args) { 759 - pr_err("'bind' command doesn't take an argument\n"); 760 - return -EINVAL; 761 - } 762 - 763 690 if (!cache->rootdirname) { 764 691 pr_err("No cache directory specified\n"); 765 692 return -EINVAL; ··· 764 701 if (test_bit(CACHEFILES_READY, &cache->flags)) { 765 702 pr_err("Cache already bound\n"); 766 703 return -EBUSY; 704 + } 705 + 706 + if (IS_ENABLED(CONFIG_CACHEFILES_ONDEMAND)) { 707 + if (!strcmp(args, "ondemand")) { 708 + set_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags); 709 + } else if (*args) { 710 + pr_err("Invalid argument to the 'bind' command\n"); 711 + return -EINVAL; 712 + } 713 + } else if (*args) { 714 + pr_err("'bind' command doesn't take an argument\n"); 715 + return -EINVAL; 767 716 } 768 717 769 718 /* Make sure we have copies of the tag string */

fs/cachefiles/interface.c

··· 362 362 spin_unlock(&cache->object_list_lock); 363 363 } 364 364 365 + cachefiles_ondemand_clean_object(object); 366 + 365 367 if (object->file) { 366 368 cachefiles_begin_secure(cache, &saved_cred); 367 369 cachefiles_clean_up_object(object, cache);

+78

fs/cachefiles/internal.h

··· 15 15 #include <linux/fscache-cache.h> 16 16 #include <linux/cred.h> 17 17 #include <linux/security.h> 18 + #include <linux/xarray.h> 19 + #include <linux/cachefiles.h> 18 20 19 21 #define CACHEFILES_DIO_BLOCK_SIZE 4096 20 22 ··· 60 58 enum cachefiles_content content_info:8; /* Info about content presence */ 61 59 unsigned long flags; 62 60 #define CACHEFILES_OBJECT_USING_TMPFILE 0 /* Have an unlinked tmpfile */ 61 + #ifdef CONFIG_CACHEFILES_ONDEMAND 62 + int ondemand_id; 63 + #endif 63 64 }; 65 + 66 + #define CACHEFILES_ONDEMAND_ID_CLOSED -1 64 67 65 68 /* 66 69 * Cache files cache definition ··· 105 98 #define CACHEFILES_DEAD 1 /* T if cache dead */ 106 99 #define CACHEFILES_CULLING 2 /* T if cull engaged */ 107 100 #define CACHEFILES_STATE_CHANGED 3 /* T if state changed (poll trigger) */ 101 + #define CACHEFILES_ONDEMAND_MODE 4 /* T if in on-demand read mode */ 108 102 char *rootdirname; /* name of cache root directory */ 109 103 char *secctx; /* LSM security context */ 110 104 char *tag; /* cache binding tag */ 105 + refcount_t unbind_pincount;/* refcount to do daemon unbind */ 106 + struct xarray reqs; /* xarray of pending on-demand requests */ 107 + struct xarray ondemand_ids; /* xarray for ondemand_id allocation */ 108 + u32 ondemand_id_next; 111 109 }; 110 + 111 + static inline bool cachefiles_in_ondemand_mode(struct cachefiles_cache *cache) 112 + { 113 + return IS_ENABLED(CONFIG_CACHEFILES_ONDEMAND) && 114 + test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags); 115 + } 116 + 117 + struct cachefiles_req { 118 + struct cachefiles_object *object; 119 + struct completion done; 120 + int error; 121 + struct cachefiles_msg msg; 122 + }; 123 + 124 + #define CACHEFILES_REQ_NEW XA_MARK_1 112 125 113 126 #include <trace/events/cachefiles.h> 114 127 ··· 172 145 * daemon.c 173 146 */ 174 147 extern const struct file_operations cachefiles_daemon_fops; 148 + extern void cachefiles_get_unbind_pincount(struct cachefiles_cache *cache); 149 + extern void cachefiles_put_unbind_pincount(struct cachefiles_cache *cache); 175 150 176 151 /* 177 152 * error_inject.c ··· 230 201 */ 231 202 extern bool cachefiles_begin_operation(struct netfs_cache_resources *cres, 232 203 enum fscache_want_state want_state); 204 + extern int __cachefiles_prepare_write(struct cachefiles_object *object, 205 + struct file *file, 206 + loff_t *_start, size_t *_len, 207 + bool no_space_allocated_yet); 208 + extern int __cachefiles_write(struct cachefiles_object *object, 209 + struct file *file, 210 + loff_t start_pos, 211 + struct iov_iter *iter, 212 + netfs_io_terminated_t term_func, 213 + void *term_func_priv); 233 214 234 215 /* 235 216 * key.c ··· 278 239 extern struct file *cachefiles_create_tmpfile(struct cachefiles_object *object); 279 240 extern bool cachefiles_commit_tmpfile(struct cachefiles_cache *cache, 280 241 struct cachefiles_object *object); 242 + 243 + /* 244 + * ondemand.c 245 + */ 246 + #ifdef CONFIG_CACHEFILES_ONDEMAND 247 + extern ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache, 248 + char __user *_buffer, size_t buflen); 249 + 250 + extern int cachefiles_ondemand_copen(struct cachefiles_cache *cache, 251 + char *args); 252 + 253 + extern int cachefiles_ondemand_init_object(struct cachefiles_object *object); 254 + extern void cachefiles_ondemand_clean_object(struct cachefiles_object *object); 255 + 256 + extern int cachefiles_ondemand_read(struct cachefiles_object *object, 257 + loff_t pos, size_t len); 258 + 259 + #else 260 + static inline ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache, 261 + char __user *_buffer, size_t buflen) 262 + { 263 + return -EOPNOTSUPP; 264 + } 265 + 266 + static inline int cachefiles_ondemand_init_object(struct cachefiles_object *object) 267 + { 268 + return 0; 269 + } 270 + 271 + static inline void cachefiles_ondemand_clean_object(struct cachefiles_object *object) 272 + { 273 + } 274 + 275 + static inline int cachefiles_ondemand_read(struct cachefiles_object *object, 276 + loff_t pos, size_t len) 277 + { 278 + return -EOPNOTSUPP; 279 + } 280 + #endif 281 281 282 282 /* 283 283 * security.c

+48 -28

fs/cachefiles/io.c

··· 277 277 /* 278 278 * Initiate a write to the cache. 279 279 */ 280 - static int cachefiles_write(struct netfs_cache_resources *cres, 281 - loff_t start_pos, 282 - struct iov_iter *iter, 283 - netfs_io_terminated_t term_func, 284 - void *term_func_priv) 280 + int __cachefiles_write(struct cachefiles_object *object, 281 + struct file *file, 282 + loff_t start_pos, 283 + struct iov_iter *iter, 284 + netfs_io_terminated_t term_func, 285 + void *term_func_priv) 285 286 { 286 - struct cachefiles_object *object; 287 287 struct cachefiles_cache *cache; 288 288 struct cachefiles_kiocb *ki; 289 289 struct inode *inode; 290 - struct file *file; 291 290 unsigned int old_nofs; 292 - ssize_t ret = -ENOBUFS; 291 + ssize_t ret; 293 292 size_t len = iov_iter_count(iter); 294 293 295 - if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE)) 296 - goto presubmission_error; 297 294 fscache_count_write(); 298 - object = cachefiles_cres_object(cres); 299 295 cache = object->volume->cache; 300 - file = cachefiles_cres_file(cres); 301 296 302 297 _enter("%pD,%li,%llx,%zx/%llx", 303 298 file, file_inode(file)->i_ino, start_pos, len, 304 299 i_size_read(file_inode(file))); 305 300 306 - ret = -ENOMEM; 307 301 ki = kzalloc(sizeof(struct cachefiles_kiocb), GFP_KERNEL); 308 - if (!ki) 309 - goto presubmission_error; 302 + if (!ki) { 303 + if (term_func) 304 + term_func(term_func_priv, -ENOMEM, false); 305 + return -ENOMEM; 306 + } 310 307 311 308 refcount_set(&ki->ki_refcnt, 2); 312 309 ki->iocb.ki_filp = file; ··· 311 314 ki->iocb.ki_flags = IOCB_DIRECT | IOCB_WRITE; 312 315 ki->iocb.ki_ioprio = get_current_ioprio(); 313 316 ki->object = object; 314 - ki->inval_counter = cres->inval_counter; 315 317 ki->start = start_pos; 316 318 ki->len = len; 317 319 ki->term_func = term_func; ··· 365 369 cachefiles_put_kiocb(ki); 366 370 _leave(" = %zd", ret); 367 371 return ret; 372 + } 368 373 369 - presubmission_error: 370 - if (term_func) 371 - term_func(term_func_priv, ret, false); 372 - return ret; 374 + static int cachefiles_write(struct netfs_cache_resources *cres, 375 + loff_t start_pos, 376 + struct iov_iter *iter, 377 + netfs_io_terminated_t term_func, 378 + void *term_func_priv) 379 + { 380 + if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE)) { 381 + if (term_func) 382 + term_func(term_func_priv, -ENOBUFS, false); 383 + return -ENOBUFS; 384 + } 385 + 386 + return __cachefiles_write(cachefiles_cres_object(cres), 387 + cachefiles_cres_file(cres), 388 + start_pos, iter, 389 + term_func, term_func_priv); 373 390 } 374 391 375 392 /* ··· 403 394 enum netfs_io_source ret = NETFS_DOWNLOAD_FROM_SERVER; 404 395 loff_t off, to; 405 396 ino_t ino = file ? file_inode(file)->i_ino : 0; 397 + int rc; 406 398 407 399 _enter("%zx @%llx/%llx", subreq->len, subreq->start, i_size); 408 400 ··· 416 406 if (test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) { 417 407 __set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); 418 408 why = cachefiles_trace_read_no_data; 419 - goto out_no_object; 409 + if (!test_bit(NETFS_SREQ_ONDEMAND, &subreq->flags)) 410 + goto out_no_object; 420 411 } 421 412 422 413 /* The object and the file may be being created in the background. */ ··· 434 423 object = cachefiles_cres_object(cres); 435 424 cache = object->volume->cache; 436 425 cachefiles_begin_secure(cache, &saved_cred); 437 - 426 + retry: 438 427 off = cachefiles_inject_read_error(); 439 428 if (off == 0) 440 429 off = vfs_llseek(file, subreq->start, SEEK_DATA); ··· 485 474 486 475 download_and_store: 487 476 __set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); 477 + if (test_bit(NETFS_SREQ_ONDEMAND, &subreq->flags)) { 478 + rc = cachefiles_ondemand_read(object, subreq->start, 479 + subreq->len); 480 + if (!rc) { 481 + __clear_bit(NETFS_SREQ_ONDEMAND, &subreq->flags); 482 + goto retry; 483 + } 484 + ret = NETFS_INVALID_READ; 485 + } 488 486 out: 489 487 cachefiles_end_secure(cache, saved_cred); 490 488 out_no_object: ··· 504 484 /* 505 485 * Prepare for a write to occur. 506 486 */ 507 - static int __cachefiles_prepare_write(struct netfs_cache_resources *cres, 508 - loff_t *_start, size_t *_len, loff_t i_size, 509 - bool no_space_allocated_yet) 487 + int __cachefiles_prepare_write(struct cachefiles_object *object, 488 + struct file *file, 489 + loff_t *_start, size_t *_len, 490 + bool no_space_allocated_yet) 510 491 { 511 - struct cachefiles_object *object = cachefiles_cres_object(cres); 512 492 struct cachefiles_cache *cache = object->volume->cache; 513 - struct file *file = cachefiles_cres_file(cres); 514 493 loff_t start = *_start, pos; 515 494 size_t len = *_len, down; 516 495 int ret; ··· 596 577 } 597 578 598 579 cachefiles_begin_secure(cache, &saved_cred); 599 - ret = __cachefiles_prepare_write(cres, _start, _len, i_size, 580 + ret = __cachefiles_prepare_write(object, cachefiles_cres_file(cres), 581 + _start, _len, 600 582 no_space_allocated_yet); 601 583 cachefiles_end_secure(cache, saved_cred); 602 584 return ret;

+14 -2

fs/cachefiles/namei.c

··· 452 452 struct dentry *fan = volume->fanout[(u8)object->cookie->key_hash]; 453 453 struct file *file; 454 454 struct path path; 455 - uint64_t ni_size = object->cookie->object_size; 455 + uint64_t ni_size; 456 456 long ret; 457 457 458 - ni_size = round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE); 459 458 460 459 cachefiles_begin_secure(cache, &saved_cred); 461 460 ··· 479 480 file = ERR_PTR(-EBUSY); 480 481 goto out_dput; 481 482 } 483 + 484 + ret = cachefiles_ondemand_init_object(object); 485 + if (ret < 0) { 486 + file = ERR_PTR(ret); 487 + goto out_unuse; 488 + } 489 + 490 + ni_size = object->cookie->object_size; 491 + ni_size = round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE); 482 492 483 493 if (ni_size > 0) { 484 494 trace_cachefiles_trunc(object, d_backing_inode(path.dentry), 0, ni_size, ··· 593 585 goto error_fput; 594 586 } 595 587 _debug("file -> %pd positive", dentry); 588 + 589 + ret = cachefiles_ondemand_init_object(object); 590 + if (ret < 0) 591 + goto error_fput; 596 592 597 593 ret = cachefiles_check_auxdata(object, file); 598 594 if (ret < 0)

+503

fs/cachefiles/ondemand.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + #include <linux/fdtable.h> 3 + #include <linux/anon_inodes.h> 4 + #include <linux/uio.h> 5 + #include "internal.h" 6 + 7 + static int cachefiles_ondemand_fd_release(struct inode *inode, 8 + struct file *file) 9 + { 10 + struct cachefiles_object *object = file->private_data; 11 + struct cachefiles_cache *cache = object->volume->cache; 12 + int object_id = object->ondemand_id; 13 + struct cachefiles_req *req; 14 + XA_STATE(xas, &cache->reqs, 0); 15 + 16 + xa_lock(&cache->reqs); 17 + object->ondemand_id = CACHEFILES_ONDEMAND_ID_CLOSED; 18 + 19 + /* 20 + * Flush all pending READ requests since their completion depends on 21 + * anon_fd. 22 + */ 23 + xas_for_each(&xas, req, ULONG_MAX) { 24 + if (req->msg.opcode == CACHEFILES_OP_READ) { 25 + req->error = -EIO; 26 + complete(&req->done); 27 + xas_store(&xas, NULL); 28 + } 29 + } 30 + xa_unlock(&cache->reqs); 31 + 32 + xa_erase(&cache->ondemand_ids, object_id); 33 + trace_cachefiles_ondemand_fd_release(object, object_id); 34 + cachefiles_put_object(object, cachefiles_obj_put_ondemand_fd); 35 + cachefiles_put_unbind_pincount(cache); 36 + return 0; 37 + } 38 + 39 + static ssize_t cachefiles_ondemand_fd_write_iter(struct kiocb *kiocb, 40 + struct iov_iter *iter) 41 + { 42 + struct cachefiles_object *object = kiocb->ki_filp->private_data; 43 + struct cachefiles_cache *cache = object->volume->cache; 44 + struct file *file = object->file; 45 + size_t len = iter->count; 46 + loff_t pos = kiocb->ki_pos; 47 + const struct cred *saved_cred; 48 + int ret; 49 + 50 + if (!file) 51 + return -ENOBUFS; 52 + 53 + cachefiles_begin_secure(cache, &saved_cred); 54 + ret = __cachefiles_prepare_write(object, file, &pos, &len, true); 55 + cachefiles_end_secure(cache, saved_cred); 56 + if (ret < 0) 57 + return ret; 58 + 59 + trace_cachefiles_ondemand_fd_write(object, file_inode(file), pos, len); 60 + ret = __cachefiles_write(object, file, pos, iter, NULL, NULL); 61 + if (!ret) 62 + ret = len; 63 + 64 + return ret; 65 + } 66 + 67 + static loff_t cachefiles_ondemand_fd_llseek(struct file *filp, loff_t pos, 68 + int whence) 69 + { 70 + struct cachefiles_object *object = filp->private_data; 71 + struct file *file = object->file; 72 + 73 + if (!file) 74 + return -ENOBUFS; 75 + 76 + return vfs_llseek(file, pos, whence); 77 + } 78 + 79 + static long cachefiles_ondemand_fd_ioctl(struct file *filp, unsigned int ioctl, 80 + unsigned long arg) 81 + { 82 + struct cachefiles_object *object = filp->private_data; 83 + struct cachefiles_cache *cache = object->volume->cache; 84 + struct cachefiles_req *req; 85 + unsigned long id; 86 + 87 + if (ioctl != CACHEFILES_IOC_READ_COMPLETE) 88 + return -EINVAL; 89 + 90 + if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags)) 91 + return -EOPNOTSUPP; 92 + 93 + id = arg; 94 + req = xa_erase(&cache->reqs, id); 95 + if (!req) 96 + return -EINVAL; 97 + 98 + trace_cachefiles_ondemand_cread(object, id); 99 + complete(&req->done); 100 + return 0; 101 + } 102 + 103 + static const struct file_operations cachefiles_ondemand_fd_fops = { 104 + .owner = THIS_MODULE, 105 + .release = cachefiles_ondemand_fd_release, 106 + .write_iter = cachefiles_ondemand_fd_write_iter, 107 + .llseek = cachefiles_ondemand_fd_llseek, 108 + .unlocked_ioctl = cachefiles_ondemand_fd_ioctl, 109 + }; 110 + 111 + /* 112 + * OPEN request Completion (copen) 113 + * - command: "copen <id>,<cache_size>" 114 + * <cache_size> indicates the object size if >=0, error code if negative 115 + */ 116 + int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) 117 + { 118 + struct cachefiles_req *req; 119 + struct fscache_cookie *cookie; 120 + char *pid, *psize; 121 + unsigned long id; 122 + long size; 123 + int ret; 124 + 125 + if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags)) 126 + return -EOPNOTSUPP; 127 + 128 + if (!*args) { 129 + pr_err("Empty id specified\n"); 130 + return -EINVAL; 131 + } 132 + 133 + pid = args; 134 + psize = strchr(args, ','); 135 + if (!psize) { 136 + pr_err("Cache size is not specified\n"); 137 + return -EINVAL; 138 + } 139 + 140 + *psize = 0; 141 + psize++; 142 + 143 + ret = kstrtoul(pid, 0, &id); 144 + if (ret) 145 + return ret; 146 + 147 + req = xa_erase(&cache->reqs, id); 148 + if (!req) 149 + return -EINVAL; 150 + 151 + /* fail OPEN request if copen format is invalid */ 152 + ret = kstrtol(psize, 0, &size); 153 + if (ret) { 154 + req->error = ret; 155 + goto out; 156 + } 157 + 158 + /* fail OPEN request if daemon reports an error */ 159 + if (size < 0) { 160 + if (!IS_ERR_VALUE(size)) 161 + size = -EINVAL; 162 + req->error = size; 163 + goto out; 164 + } 165 + 166 + cookie = req->object->cookie; 167 + cookie->object_size = size; 168 + if (size) 169 + clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); 170 + else 171 + set_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); 172 + trace_cachefiles_ondemand_copen(req->object, id, size); 173 + 174 + out: 175 + complete(&req->done); 176 + return ret; 177 + } 178 + 179 + static int cachefiles_ondemand_get_fd(struct cachefiles_req *req) 180 + { 181 + struct cachefiles_object *object; 182 + struct cachefiles_cache *cache; 183 + struct cachefiles_open *load; 184 + struct file *file; 185 + u32 object_id; 186 + int ret, fd; 187 + 188 + object = cachefiles_grab_object(req->object, 189 + cachefiles_obj_get_ondemand_fd); 190 + cache = object->volume->cache; 191 + 192 + ret = xa_alloc_cyclic(&cache->ondemand_ids, &object_id, NULL, 193 + XA_LIMIT(1, INT_MAX), 194 + &cache->ondemand_id_next, GFP_KERNEL); 195 + if (ret < 0) 196 + goto err; 197 + 198 + fd = get_unused_fd_flags(O_WRONLY); 199 + if (fd < 0) { 200 + ret = fd; 201 + goto err_free_id; 202 + } 203 + 204 + file = anon_inode_getfile("[cachefiles]", &cachefiles_ondemand_fd_fops, 205 + object, O_WRONLY); 206 + if (IS_ERR(file)) { 207 + ret = PTR_ERR(file); 208 + goto err_put_fd; 209 + } 210 + 211 + file->f_mode |= FMODE_PWRITE | FMODE_LSEEK; 212 + fd_install(fd, file); 213 + 214 + load = (void *)req->msg.data; 215 + load->fd = fd; 216 + req->msg.object_id = object_id; 217 + object->ondemand_id = object_id; 218 + 219 + cachefiles_get_unbind_pincount(cache); 220 + trace_cachefiles_ondemand_open(object, &req->msg, load); 221 + return 0; 222 + 223 + err_put_fd: 224 + put_unused_fd(fd); 225 + err_free_id: 226 + xa_erase(&cache->ondemand_ids, object_id); 227 + err: 228 + cachefiles_put_object(object, cachefiles_obj_put_ondemand_fd); 229 + return ret; 230 + } 231 + 232 + ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache, 233 + char __user *_buffer, size_t buflen) 234 + { 235 + struct cachefiles_req *req; 236 + struct cachefiles_msg *msg; 237 + unsigned long id = 0; 238 + size_t n; 239 + int ret = 0; 240 + XA_STATE(xas, &cache->reqs, 0); 241 + 242 + /* 243 + * Search for a request that has not ever been processed, to prevent 244 + * requests from being processed repeatedly. 245 + */ 246 + xa_lock(&cache->reqs); 247 + req = xas_find_marked(&xas, UINT_MAX, CACHEFILES_REQ_NEW); 248 + if (!req) { 249 + xa_unlock(&cache->reqs); 250 + return 0; 251 + } 252 + 253 + msg = &req->msg; 254 + n = msg->len; 255 + 256 + if (n > buflen) { 257 + xa_unlock(&cache->reqs); 258 + return -EMSGSIZE; 259 + } 260 + 261 + xas_clear_mark(&xas, CACHEFILES_REQ_NEW); 262 + xa_unlock(&cache->reqs); 263 + 264 + id = xas.xa_index; 265 + msg->msg_id = id; 266 + 267 + if (msg->opcode == CACHEFILES_OP_OPEN) { 268 + ret = cachefiles_ondemand_get_fd(req); 269 + if (ret) 270 + goto error; 271 + } 272 + 273 + if (copy_to_user(_buffer, msg, n) != 0) { 274 + ret = -EFAULT; 275 + goto err_put_fd; 276 + } 277 + 278 + /* CLOSE request has no reply */ 279 + if (msg->opcode == CACHEFILES_OP_CLOSE) { 280 + xa_erase(&cache->reqs, id); 281 + complete(&req->done); 282 + } 283 + 284 + return n; 285 + 286 + err_put_fd: 287 + if (msg->opcode == CACHEFILES_OP_OPEN) 288 + close_fd(((struct cachefiles_open *)msg->data)->fd); 289 + error: 290 + xa_erase(&cache->reqs, id); 291 + req->error = ret; 292 + complete(&req->done); 293 + return ret; 294 + } 295 + 296 + typedef int (*init_req_fn)(struct cachefiles_req *req, void *private); 297 + 298 + static int cachefiles_ondemand_send_req(struct cachefiles_object *object, 299 + enum cachefiles_opcode opcode, 300 + size_t data_len, 301 + init_req_fn init_req, 302 + void *private) 303 + { 304 + struct cachefiles_cache *cache = object->volume->cache; 305 + struct cachefiles_req *req; 306 + XA_STATE(xas, &cache->reqs, 0); 307 + int ret; 308 + 309 + if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags)) 310 + return 0; 311 + 312 + if (test_bit(CACHEFILES_DEAD, &cache->flags)) 313 + return -EIO; 314 + 315 + req = kzalloc(sizeof(*req) + data_len, GFP_KERNEL); 316 + if (!req) 317 + return -ENOMEM; 318 + 319 + req->object = object; 320 + init_completion(&req->done); 321 + req->msg.opcode = opcode; 322 + req->msg.len = sizeof(struct cachefiles_msg) + data_len; 323 + 324 + ret = init_req(req, private); 325 + if (ret) 326 + goto out; 327 + 328 + do { 329 + /* 330 + * Stop enqueuing the request when daemon is dying. The 331 + * following two operations need to be atomic as a whole. 332 + * 1) check cache state, and 333 + * 2) enqueue request if cache is alive. 334 + * Otherwise the request may be enqueued after xarray has been 335 + * flushed, leaving the orphan request never being completed. 336 + * 337 + * CPU 1 CPU 2 338 + * ===== ===== 339 + * test CACHEFILES_DEAD bit 340 + * set CACHEFILES_DEAD bit 341 + * flush requests in the xarray 342 + * enqueue the request 343 + */ 344 + xas_lock(&xas); 345 + 346 + if (test_bit(CACHEFILES_DEAD, &cache->flags)) { 347 + xas_unlock(&xas); 348 + ret = -EIO; 349 + goto out; 350 + } 351 + 352 + /* coupled with the barrier in cachefiles_flush_reqs() */ 353 + smp_mb(); 354 + 355 + if (opcode != CACHEFILES_OP_OPEN && object->ondemand_id <= 0) { 356 + WARN_ON_ONCE(object->ondemand_id == 0); 357 + xas_unlock(&xas); 358 + ret = -EIO; 359 + goto out; 360 + } 361 + 362 + xas.xa_index = 0; 363 + xas_find_marked(&xas, UINT_MAX, XA_FREE_MARK); 364 + if (xas.xa_node == XAS_RESTART) 365 + xas_set_err(&xas, -EBUSY); 366 + xas_store(&xas, req); 367 + xas_clear_mark(&xas, XA_FREE_MARK); 368 + xas_set_mark(&xas, CACHEFILES_REQ_NEW); 369 + xas_unlock(&xas); 370 + } while (xas_nomem(&xas, GFP_KERNEL)); 371 + 372 + ret = xas_error(&xas); 373 + if (ret) 374 + goto out; 375 + 376 + wake_up_all(&cache->daemon_pollwq); 377 + wait_for_completion(&req->done); 378 + ret = req->error; 379 + out: 380 + kfree(req); 381 + return ret; 382 + } 383 + 384 + static int cachefiles_ondemand_init_open_req(struct cachefiles_req *req, 385 + void *private) 386 + { 387 + struct cachefiles_object *object = req->object; 388 + struct fscache_cookie *cookie = object->cookie; 389 + struct fscache_volume *volume = object->volume->vcookie; 390 + struct cachefiles_open *load = (void *)req->msg.data; 391 + size_t volume_key_size, cookie_key_size; 392 + void *volume_key, *cookie_key; 393 + 394 + /* 395 + * Volume key is a NUL-terminated string. key[0] stores strlen() of the 396 + * string, followed by the content of the string (excluding '\0'). 397 + */ 398 + volume_key_size = volume->key[0] + 1; 399 + volume_key = volume->key + 1; 400 + 401 + /* Cookie key is binary data, which is netfs specific. */ 402 + cookie_key_size = cookie->key_len; 403 + cookie_key = fscache_get_key(cookie); 404 + 405 + if (!(object->cookie->advice & FSCACHE_ADV_WANT_CACHE_SIZE)) { 406 + pr_err("WANT_CACHE_SIZE is needed for on-demand mode\n"); 407 + return -EINVAL; 408 + } 409 + 410 + load->volume_key_size = volume_key_size; 411 + load->cookie_key_size = cookie_key_size; 412 + memcpy(load->data, volume_key, volume_key_size); 413 + memcpy(load->data + volume_key_size, cookie_key, cookie_key_size); 414 + 415 + return 0; 416 + } 417 + 418 + static int cachefiles_ondemand_init_close_req(struct cachefiles_req *req, 419 + void *private) 420 + { 421 + struct cachefiles_object *object = req->object; 422 + int object_id = object->ondemand_id; 423 + 424 + /* 425 + * It's possible that object id is still 0 if the cookie looking up 426 + * phase failed before OPEN request has ever been sent. Also avoid 427 + * sending CLOSE request for CACHEFILES_ONDEMAND_ID_CLOSED, which means 428 + * anon_fd has already been closed. 429 + */ 430 + if (object_id <= 0) 431 + return -ENOENT; 432 + 433 + req->msg.object_id = object_id; 434 + trace_cachefiles_ondemand_close(object, &req->msg); 435 + return 0; 436 + } 437 + 438 + struct cachefiles_read_ctx { 439 + loff_t off; 440 + size_t len; 441 + }; 442 + 443 + static int cachefiles_ondemand_init_read_req(struct cachefiles_req *req, 444 + void *private) 445 + { 446 + struct cachefiles_object *object = req->object; 447 + struct cachefiles_read *load = (void *)req->msg.data; 448 + struct cachefiles_read_ctx *read_ctx = private; 449 + int object_id = object->ondemand_id; 450 + 451 + /* Stop enqueuing requests when daemon has closed anon_fd. */ 452 + if (object_id <= 0) { 453 + WARN_ON_ONCE(object_id == 0); 454 + pr_info_once("READ: anonymous fd closed prematurely.\n"); 455 + return -EIO; 456 + } 457 + 458 + req->msg.object_id = object_id; 459 + load->off = read_ctx->off; 460 + load->len = read_ctx->len; 461 + trace_cachefiles_ondemand_read(object, &req->msg, load); 462 + return 0; 463 + } 464 + 465 + int cachefiles_ondemand_init_object(struct cachefiles_object *object) 466 + { 467 + struct fscache_cookie *cookie = object->cookie; 468 + struct fscache_volume *volume = object->volume->vcookie; 469 + size_t volume_key_size, cookie_key_size, data_len; 470 + 471 + /* 472 + * CacheFiles will firstly check the cache file under the root cache 473 + * directory. If the coherency check failed, it will fallback to 474 + * creating a new tmpfile as the cache file. Reuse the previously 475 + * allocated object ID if any. 476 + */ 477 + if (object->ondemand_id > 0) 478 + return 0; 479 + 480 + volume_key_size = volume->key[0] + 1; 481 + cookie_key_size = cookie->key_len; 482 + data_len = sizeof(struct cachefiles_open) + 483 + volume_key_size + cookie_key_size; 484 + 485 + return cachefiles_ondemand_send_req(object, CACHEFILES_OP_OPEN, 486 + data_len, cachefiles_ondemand_init_open_req, NULL); 487 + } 488 + 489 + void cachefiles_ondemand_clean_object(struct cachefiles_object *object) 490 + { 491 + cachefiles_ondemand_send_req(object, CACHEFILES_OP_CLOSE, 0, 492 + cachefiles_ondemand_init_close_req, NULL); 493 + } 494 + 495 + int cachefiles_ondemand_read(struct cachefiles_object *object, 496 + loff_t pos, size_t len) 497 + { 498 + struct cachefiles_read_ctx read_ctx = {pos, len}; 499 + 500 + return cachefiles_ondemand_send_req(object, CACHEFILES_OP_READ, 501 + sizeof(struct cachefiles_read), 502 + cachefiles_ondemand_init_read_req, &read_ctx); 503 + }

+10

fs/erofs/Kconfig

··· 98 98 systems will be readable without selecting this option. 99 99 100 100 If unsure, say N. 101 + 102 + config EROFS_FS_ONDEMAND 103 + bool "EROFS fscache-based on-demand read support" 104 + depends on CACHEFILES_ONDEMAND && (EROFS_FS=m && FSCACHE || EROFS_FS=y && FSCACHE=y) 105 + default n 106 + help 107 + This permits EROFS to use fscache-backed data blobs with on-demand 108 + read support. 109 + 110 + If unsure, say N.

fs/erofs/Makefile

··· 5 5 erofs-$(CONFIG_EROFS_FS_XATTR) += xattr.o 6 6 erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o zdata.o 7 7 erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o 8 + erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o

+20 -6

fs/erofs/data.c

··· 6 6 */ 7 7 #include "internal.h" 8 8 #include <linux/prefetch.h> 9 + #include <linux/sched/mm.h> 9 10 #include <linux/dax.h> 10 11 #include <trace/events/erofs.h> 11 12 ··· 36 35 erofs_off_t offset = blknr_to_addr(blkaddr); 37 36 pgoff_t index = offset >> PAGE_SHIFT; 38 37 struct page *page = buf->page; 38 + struct folio *folio; 39 + unsigned int nofs_flag; 39 40 40 41 if (!page || page->index != index) { 41 42 erofs_put_metabuf(buf); 42 - page = read_cache_page_gfp(mapping, index, 43 - mapping_gfp_constraint(mapping, ~__GFP_FS)); 44 - if (IS_ERR(page)) 45 - return page; 43 + 44 + nofs_flag = memalloc_nofs_save(); 45 + folio = read_cache_folio(mapping, index, NULL, NULL); 46 + memalloc_nofs_restore(nofs_flag); 47 + if (IS_ERR(folio)) 48 + return folio; 49 + 46 50 /* should already be PageUptodate, no need to lock page */ 51 + page = folio_file_page(folio, index); 47 52 buf->page = page; 48 53 } 49 54 if (buf->kmap_type == EROFS_NO_KMAP) { ··· 70 63 void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb, 71 64 erofs_blk_t blkaddr, enum erofs_kmap_type type) 72 65 { 66 + if (erofs_is_fscache_mode(sb)) 67 + return erofs_bread(buf, EROFS_SB(sb)->s_fscache->inode, 68 + blkaddr, type); 69 + 73 70 return erofs_bread(buf, sb->s_bdev->bd_inode, blkaddr, type); 74 71 } 75 72 ··· 121 110 return 0; 122 111 } 123 112 124 - static int erofs_map_blocks(struct inode *inode, 125 - struct erofs_map_blocks *map, int flags) 113 + int erofs_map_blocks(struct inode *inode, 114 + struct erofs_map_blocks *map, int flags) 126 115 { 127 116 struct super_block *sb = inode->i_sb; 128 117 struct erofs_inode *vi = EROFS_I(inode); ··· 210 199 map->m_bdev = sb->s_bdev; 211 200 map->m_daxdev = EROFS_SB(sb)->dax_dev; 212 201 map->m_dax_part_off = EROFS_SB(sb)->dax_part_off; 202 + map->m_fscache = EROFS_SB(sb)->s_fscache; 213 203 214 204 if (map->m_deviceid) { 215 205 down_read(&devs->rwsem); ··· 222 210 map->m_bdev = dif->bdev; 223 211 map->m_daxdev = dif->dax_dev; 224 212 map->m_dax_part_off = dif->dax_part_off; 213 + map->m_fscache = dif->fscache; 225 214 up_read(&devs->rwsem); 226 215 } else if (devs->extra_devices) { 227 216 down_read(&devs->rwsem); ··· 240 227 map->m_bdev = dif->bdev; 241 228 map->m_daxdev = dif->dax_dev; 242 229 map->m_dax_part_off = dif->dax_part_off; 230 + map->m_fscache = dif->fscache; 243 231 break; 244 232 } 245 233 }

+3 -4

fs/erofs/decompressor.c

··· 46 46 erofs_err(sb, "too large lz4 pclusterblks %u", 47 47 sbi->lz4.max_pclusterblks); 48 48 return -EINVAL; 49 - } else if (sbi->lz4.max_pclusterblks >= 2) { 50 - erofs_info(sb, "EXPERIMENTAL big pcluster feature in use. Use at your own risk!"); 51 49 } 52 50 } else { 53 51 distance = le16_to_cpu(dsb->u1.lz4_max_distance); ··· 320 322 PAGE_ALIGN(rq->pageofs_out + rq->outputsize) >> PAGE_SHIFT; 321 323 const unsigned int righthalf = min_t(unsigned int, rq->outputsize, 322 324 PAGE_SIZE - rq->pageofs_out); 325 + const unsigned int lefthalf = rq->outputsize - righthalf; 323 326 unsigned char *src, *dst; 324 327 325 328 if (nrpages_out > 2) { ··· 343 344 if (nrpages_out == 2) { 344 345 DBG_BUGON(!rq->out[1]); 345 346 if (rq->out[1] == *rq->in) { 346 - memmove(src, src + righthalf, rq->pageofs_out); 347 + memmove(src, src + righthalf, lefthalf); 347 348 } else { 348 349 dst = kmap_atomic(rq->out[1]); 349 - memcpy(dst, src + righthalf, rq->pageofs_out); 350 + memcpy(dst, src + righthalf, lefthalf); 350 351 kunmap_atomic(dst); 351 352 } 352 353 }

+24 -26

fs/erofs/erofs_fs.h

··· 37 37 #define EROFS_SB_EXTSLOT_SIZE 16 38 38 39 39 struct erofs_deviceslot { 40 - union { 41 - u8 uuid[16]; /* used for device manager later */ 42 - u8 userdata[64]; /* digest(sha256), etc. */ 43 - } u; 44 - __le32 blocks; /* total fs blocks of this device */ 45 - __le32 mapped_blkaddr; /* map starting at mapped_blkaddr */ 40 + u8 tag[64]; /* digest(sha256), etc. */ 41 + __le32 blocks; /* total fs blocks of this device */ 42 + __le32 mapped_blkaddr; /* map starting at mapped_blkaddr */ 46 43 u8 reserved[56]; 47 44 }; 48 45 #define EROFS_DEVT_SLOT_SIZE sizeof(struct erofs_deviceslot) ··· 55 58 __le16 root_nid; /* nid of root directory */ 56 59 __le64 inos; /* total valid ino # (== f_files - f_favail) */ 57 60 58 - __le64 build_time; /* inode v1 time derivation */ 59 - __le32 build_time_nsec; /* inode v1 time derivation in nano scale */ 61 + __le64 build_time; /* compact inode time derivation */ 62 + __le32 build_time_nsec; /* compact inode time derivation in ns scale */ 60 63 __le32 blocks; /* used for statfs */ 61 64 __le32 meta_blkaddr; /* start block address of metadata area */ 62 65 __le32 xattr_blkaddr; /* start block address of shared xattr area */ ··· 76 79 77 80 /* 78 81 * erofs inode datalayout (i_format in on-disk inode): 79 - * 0 - inode plain without inline data A: 82 + * 0 - uncompressed flat inode without tail-packing inline data: 80 83 * inode, [xattrs], ... | ... | no-holed data 81 - * 1 - inode VLE compression B (legacy): 82 - * inode, [xattrs], extents ... | ... 83 - * 2 - inode plain with inline data C: 84 - * inode, [xattrs], last_inline_data, ... | ... | no-holed data 85 - * 3 - inode compression D: 84 + * 1 - compressed inode with non-compact indexes: 85 + * inode, [xattrs], [map_header], extents ... | ... 86 + * 2 - uncompressed flat inode with tail-packing inline data: 87 + * inode, [xattrs], tailpacking data, ... | ... | no-holed data 88 + * 3 - compressed inode with compact indexes: 86 89 * inode, [xattrs], map_header, extents ... | ... 87 - * 4 - inode chunk-based E: 90 + * 4 - chunk-based inode with (optional) multi-device support: 88 91 * inode, [xattrs], chunk indexes ... | ... 89 92 * 5~7 - reserved 90 93 */ ··· 103 106 datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY; 104 107 } 105 108 106 - /* bit definitions of inode i_advise */ 109 + /* bit definitions of inode i_format */ 107 110 #define EROFS_I_VERSION_BITS 1 108 111 #define EROFS_I_DATALAYOUT_BITS 3 109 112 ··· 137 140 __le32 i_size; 138 141 __le32 i_reserved; 139 142 union { 140 - /* file total compressed blocks for data mapping 1 */ 143 + /* total compressed blocks for compressed inodes */ 141 144 __le32 compressed_blocks; 145 + /* block address for uncompressed flat inodes */ 142 146 __le32 raw_blkaddr; 143 147 144 148 /* for device files, used to indicate old/new device # */ ··· 154 156 __le32 i_reserved2; 155 157 }; 156 158 157 - /* 32 bytes on-disk inode */ 159 + /* 32-byte on-disk inode */ 158 160 #define EROFS_INODE_LAYOUT_COMPACT 0 159 - /* 64 bytes on-disk inode */ 161 + /* 64-byte on-disk inode */ 160 162 #define EROFS_INODE_LAYOUT_EXTENDED 1 161 163 162 164 /* 64-byte complete form of an ondisk inode */ ··· 169 171 __le16 i_reserved; 170 172 __le64 i_size; 171 173 union { 172 - /* file total compressed blocks for data mapping 1 */ 174 + /* total compressed blocks for compressed inodes */ 173 175 __le32 compressed_blocks; 176 + /* block address for uncompressed flat inodes */ 174 177 __le32 raw_blkaddr; 175 178 176 179 /* for device files, used to indicate old/new device # */ ··· 364 365 365 366 struct z_erofs_vle_decompressed_index { 366 367 __le16 di_advise; 367 - /* where to decompress in the head cluster */ 368 + /* where to decompress in the head lcluster */ 368 369 __le16 di_clusterofs; 369 370 370 371 union { 371 - /* for the head cluster */ 372 + /* for the HEAD lclusters */ 372 373 __le32 blkaddr; 373 374 /* 374 - * for the rest clusters 375 - * eg. for 4k page-sized cluster, maximum 4K*64k = 256M) 376 - * [0] - pointing to the head cluster 377 - * [1] - pointing to the tail cluster 375 + * for the NONHEAD lclusters 376 + * [0] - distance to its HEAD lcluster 377 + * [1] - distance to the next HEAD lcluster 378 378 */ 379 379 __le16 delta[2]; 380 380 } di_u;

+521

fs/erofs/fscache.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * Copyright (C) 2022, Alibaba Cloud 4 + */ 5 + #include <linux/fscache.h> 6 + #include "internal.h" 7 + 8 + static struct netfs_io_request *erofs_fscache_alloc_request(struct address_space *mapping, 9 + loff_t start, size_t len) 10 + { 11 + struct netfs_io_request *rreq; 12 + 13 + rreq = kzalloc(sizeof(struct netfs_io_request), GFP_KERNEL); 14 + if (!rreq) 15 + return ERR_PTR(-ENOMEM); 16 + 17 + rreq->start = start; 18 + rreq->len = len; 19 + rreq->mapping = mapping; 20 + INIT_LIST_HEAD(&rreq->subrequests); 21 + refcount_set(&rreq->ref, 1); 22 + return rreq; 23 + } 24 + 25 + static void erofs_fscache_put_request(struct netfs_io_request *rreq) 26 + { 27 + if (!refcount_dec_and_test(&rreq->ref)) 28 + return; 29 + if (rreq->cache_resources.ops) 30 + rreq->cache_resources.ops->end_operation(&rreq->cache_resources); 31 + kfree(rreq); 32 + } 33 + 34 + static void erofs_fscache_put_subrequest(struct netfs_io_subrequest *subreq) 35 + { 36 + if (!refcount_dec_and_test(&subreq->ref)) 37 + return; 38 + erofs_fscache_put_request(subreq->rreq); 39 + kfree(subreq); 40 + } 41 + 42 + static void erofs_fscache_clear_subrequests(struct netfs_io_request *rreq) 43 + { 44 + struct netfs_io_subrequest *subreq; 45 + 46 + while (!list_empty(&rreq->subrequests)) { 47 + subreq = list_first_entry(&rreq->subrequests, 48 + struct netfs_io_subrequest, rreq_link); 49 + list_del(&subreq->rreq_link); 50 + erofs_fscache_put_subrequest(subreq); 51 + } 52 + } 53 + 54 + static void erofs_fscache_rreq_unlock_folios(struct netfs_io_request *rreq) 55 + { 56 + struct netfs_io_subrequest *subreq; 57 + struct folio *folio; 58 + unsigned int iopos = 0; 59 + pgoff_t start_page = rreq->start / PAGE_SIZE; 60 + pgoff_t last_page = ((rreq->start + rreq->len) / PAGE_SIZE) - 1; 61 + bool subreq_failed = false; 62 + 63 + XA_STATE(xas, &rreq->mapping->i_pages, start_page); 64 + 65 + subreq = list_first_entry(&rreq->subrequests, 66 + struct netfs_io_subrequest, rreq_link); 67 + subreq_failed = (subreq->error < 0); 68 + 69 + rcu_read_lock(); 70 + xas_for_each(&xas, folio, last_page) { 71 + unsigned int pgpos = 72 + (folio_index(folio) - start_page) * PAGE_SIZE; 73 + unsigned int pgend = pgpos + folio_size(folio); 74 + bool pg_failed = false; 75 + 76 + for (;;) { 77 + if (!subreq) { 78 + pg_failed = true; 79 + break; 80 + } 81 + 82 + pg_failed |= subreq_failed; 83 + if (pgend < iopos + subreq->len) 84 + break; 85 + 86 + iopos += subreq->len; 87 + if (!list_is_last(&subreq->rreq_link, 88 + &rreq->subrequests)) { 89 + subreq = list_next_entry(subreq, rreq_link); 90 + subreq_failed = (subreq->error < 0); 91 + } else { 92 + subreq = NULL; 93 + subreq_failed = false; 94 + } 95 + if (pgend == iopos) 96 + break; 97 + } 98 + 99 + if (!pg_failed) 100 + folio_mark_uptodate(folio); 101 + 102 + folio_unlock(folio); 103 + } 104 + rcu_read_unlock(); 105 + } 106 + 107 + static void erofs_fscache_rreq_complete(struct netfs_io_request *rreq) 108 + { 109 + erofs_fscache_rreq_unlock_folios(rreq); 110 + erofs_fscache_clear_subrequests(rreq); 111 + erofs_fscache_put_request(rreq); 112 + } 113 + 114 + static void erofc_fscache_subreq_complete(void *priv, 115 + ssize_t transferred_or_error, bool was_async) 116 + { 117 + struct netfs_io_subrequest *subreq = priv; 118 + struct netfs_io_request *rreq = subreq->rreq; 119 + 120 + if (IS_ERR_VALUE(transferred_or_error)) 121 + subreq->error = transferred_or_error; 122 + 123 + if (atomic_dec_and_test(&rreq->nr_outstanding)) 124 + erofs_fscache_rreq_complete(rreq); 125 + 126 + erofs_fscache_put_subrequest(subreq); 127 + } 128 + 129 + /* 130 + * Read data from fscache and fill the read data into page cache described by 131 + * @rreq, which shall be both aligned with PAGE_SIZE. @pstart describes 132 + * the start physical address in the cache file. 133 + */ 134 + static int erofs_fscache_read_folios_async(struct fscache_cookie *cookie, 135 + struct netfs_io_request *rreq, loff_t pstart) 136 + { 137 + enum netfs_io_source source; 138 + struct super_block *sb = rreq->mapping->host->i_sb; 139 + struct netfs_io_subrequest *subreq; 140 + struct netfs_cache_resources *cres = &rreq->cache_resources; 141 + struct iov_iter iter; 142 + loff_t start = rreq->start; 143 + size_t len = rreq->len; 144 + size_t done = 0; 145 + int ret; 146 + 147 + atomic_set(&rreq->nr_outstanding, 1); 148 + 149 + ret = fscache_begin_read_operation(cres, cookie); 150 + if (ret) 151 + goto out; 152 + 153 + while (done < len) { 154 + subreq = kzalloc(sizeof(struct netfs_io_subrequest), 155 + GFP_KERNEL); 156 + if (subreq) { 157 + INIT_LIST_HEAD(&subreq->rreq_link); 158 + refcount_set(&subreq->ref, 2); 159 + subreq->rreq = rreq; 160 + refcount_inc(&rreq->ref); 161 + } else { 162 + ret = -ENOMEM; 163 + goto out; 164 + } 165 + 166 + subreq->start = pstart + done; 167 + subreq->len = len - done; 168 + subreq->flags = 1 << NETFS_SREQ_ONDEMAND; 169 + 170 + list_add_tail(&subreq->rreq_link, &rreq->subrequests); 171 + 172 + source = cres->ops->prepare_read(subreq, LLONG_MAX); 173 + if (WARN_ON(subreq->len == 0)) 174 + source = NETFS_INVALID_READ; 175 + if (source != NETFS_READ_FROM_CACHE) { 176 + erofs_err(sb, "failed to fscache prepare_read (source %d)", 177 + source); 178 + ret = -EIO; 179 + subreq->error = ret; 180 + erofs_fscache_put_subrequest(subreq); 181 + goto out; 182 + } 183 + 184 + atomic_inc(&rreq->nr_outstanding); 185 + 186 + iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages, 187 + start + done, subreq->len); 188 + 189 + ret = fscache_read(cres, subreq->start, &iter, 190 + NETFS_READ_HOLE_FAIL, 191 + erofc_fscache_subreq_complete, subreq); 192 + if (ret == -EIOCBQUEUED) 193 + ret = 0; 194 + if (ret) { 195 + erofs_err(sb, "failed to fscache_read (ret %d)", ret); 196 + goto out; 197 + } 198 + 199 + done += subreq->len; 200 + } 201 + out: 202 + if (atomic_dec_and_test(&rreq->nr_outstanding)) 203 + erofs_fscache_rreq_complete(rreq); 204 + 205 + return ret; 206 + } 207 + 208 + static int erofs_fscache_meta_readpage(struct file *data, struct page *page) 209 + { 210 + int ret; 211 + struct folio *folio = page_folio(page); 212 + struct super_block *sb = folio_mapping(folio)->host->i_sb; 213 + struct netfs_io_request *rreq; 214 + struct erofs_map_dev mdev = { 215 + .m_deviceid = 0, 216 + .m_pa = folio_pos(folio), 217 + }; 218 + 219 + ret = erofs_map_dev(sb, &mdev); 220 + if (ret) 221 + goto out; 222 + 223 + rreq = erofs_fscache_alloc_request(folio_mapping(folio), 224 + folio_pos(folio), folio_size(folio)); 225 + if (IS_ERR(rreq)) 226 + goto out; 227 + 228 + return erofs_fscache_read_folios_async(mdev.m_fscache->cookie, 229 + rreq, mdev.m_pa); 230 + out: 231 + folio_unlock(folio); 232 + return ret; 233 + } 234 + 235 + static int erofs_fscache_readpage_inline(struct folio *folio, 236 + struct erofs_map_blocks *map) 237 + { 238 + struct super_block *sb = folio_mapping(folio)->host->i_sb; 239 + struct erofs_buf buf = __EROFS_BUF_INITIALIZER; 240 + erofs_blk_t blknr; 241 + size_t offset, len; 242 + void *src, *dst; 243 + 244 + /* For tail packing layout, the offset may be non-zero. */ 245 + offset = erofs_blkoff(map->m_pa); 246 + blknr = erofs_blknr(map->m_pa); 247 + len = map->m_llen; 248 + 249 + src = erofs_read_metabuf(&buf, sb, blknr, EROFS_KMAP); 250 + if (IS_ERR(src)) 251 + return PTR_ERR(src); 252 + 253 + dst = kmap_local_folio(folio, 0); 254 + memcpy(dst, src + offset, len); 255 + memset(dst + len, 0, PAGE_SIZE - len); 256 + kunmap_local(dst); 257 + 258 + erofs_put_metabuf(&buf); 259 + return 0; 260 + } 261 + 262 + static int erofs_fscache_readpage(struct file *file, struct page *page) 263 + { 264 + struct folio *folio = page_folio(page); 265 + struct inode *inode = folio_mapping(folio)->host; 266 + struct super_block *sb = inode->i_sb; 267 + struct erofs_map_blocks map; 268 + struct erofs_map_dev mdev; 269 + struct netfs_io_request *rreq; 270 + erofs_off_t pos; 271 + loff_t pstart; 272 + int ret; 273 + 274 + DBG_BUGON(folio_size(folio) != EROFS_BLKSIZ); 275 + 276 + pos = folio_pos(folio); 277 + map.m_la = pos; 278 + 279 + ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW); 280 + if (ret) 281 + goto out_unlock; 282 + 283 + if (!(map.m_flags & EROFS_MAP_MAPPED)) { 284 + folio_zero_range(folio, 0, folio_size(folio)); 285 + goto out_uptodate; 286 + } 287 + 288 + if (map.m_flags & EROFS_MAP_META) { 289 + ret = erofs_fscache_readpage_inline(folio, &map); 290 + goto out_uptodate; 291 + } 292 + 293 + mdev = (struct erofs_map_dev) { 294 + .m_deviceid = map.m_deviceid, 295 + .m_pa = map.m_pa, 296 + }; 297 + 298 + ret = erofs_map_dev(sb, &mdev); 299 + if (ret) 300 + goto out_unlock; 301 + 302 + 303 + rreq = erofs_fscache_alloc_request(folio_mapping(folio), 304 + folio_pos(folio), folio_size(folio)); 305 + if (IS_ERR(rreq)) 306 + goto out_unlock; 307 + 308 + pstart = mdev.m_pa + (pos - map.m_la); 309 + return erofs_fscache_read_folios_async(mdev.m_fscache->cookie, 310 + rreq, pstart); 311 + 312 + out_uptodate: 313 + if (!ret) 314 + folio_mark_uptodate(folio); 315 + out_unlock: 316 + folio_unlock(folio); 317 + return ret; 318 + } 319 + 320 + static void erofs_fscache_advance_folios(struct readahead_control *rac, 321 + size_t len, bool unlock) 322 + { 323 + while (len) { 324 + struct folio *folio = readahead_folio(rac); 325 + len -= folio_size(folio); 326 + if (unlock) { 327 + folio_mark_uptodate(folio); 328 + folio_unlock(folio); 329 + } 330 + } 331 + } 332 + 333 + static void erofs_fscache_readahead(struct readahead_control *rac) 334 + { 335 + struct inode *inode = rac->mapping->host; 336 + struct super_block *sb = inode->i_sb; 337 + size_t len, count, done = 0; 338 + erofs_off_t pos; 339 + loff_t start, offset; 340 + int ret; 341 + 342 + if (!readahead_count(rac)) 343 + return; 344 + 345 + start = readahead_pos(rac); 346 + len = readahead_length(rac); 347 + 348 + do { 349 + struct erofs_map_blocks map; 350 + struct erofs_map_dev mdev; 351 + struct netfs_io_request *rreq; 352 + 353 + pos = start + done; 354 + map.m_la = pos; 355 + 356 + ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW); 357 + if (ret) 358 + return; 359 + 360 + offset = start + done; 361 + count = min_t(size_t, map.m_llen - (pos - map.m_la), 362 + len - done); 363 + 364 + if (!(map.m_flags & EROFS_MAP_MAPPED)) { 365 + struct iov_iter iter; 366 + 367 + iov_iter_xarray(&iter, READ, &rac->mapping->i_pages, 368 + offset, count); 369 + iov_iter_zero(count, &iter); 370 + 371 + erofs_fscache_advance_folios(rac, count, true); 372 + ret = count; 373 + continue; 374 + } 375 + 376 + if (map.m_flags & EROFS_MAP_META) { 377 + struct folio *folio = readahead_folio(rac); 378 + 379 + ret = erofs_fscache_readpage_inline(folio, &map); 380 + if (!ret) { 381 + folio_mark_uptodate(folio); 382 + ret = folio_size(folio); 383 + } 384 + 385 + folio_unlock(folio); 386 + continue; 387 + } 388 + 389 + mdev = (struct erofs_map_dev) { 390 + .m_deviceid = map.m_deviceid, 391 + .m_pa = map.m_pa, 392 + }; 393 + ret = erofs_map_dev(sb, &mdev); 394 + if (ret) 395 + return; 396 + 397 + rreq = erofs_fscache_alloc_request(rac->mapping, offset, count); 398 + if (IS_ERR(rreq)) 399 + return; 400 + /* 401 + * Drop the ref of folios here. Unlock them in 402 + * rreq_unlock_folios() when rreq complete. 403 + */ 404 + erofs_fscache_advance_folios(rac, count, false); 405 + ret = erofs_fscache_read_folios_async(mdev.m_fscache->cookie, 406 + rreq, mdev.m_pa + (pos - map.m_la)); 407 + if (!ret) 408 + ret = count; 409 + } while (ret > 0 && ((done += ret) < len)); 410 + } 411 + 412 + static const struct address_space_operations erofs_fscache_meta_aops = { 413 + .readpage = erofs_fscache_meta_readpage, 414 + }; 415 + 416 + const struct address_space_operations erofs_fscache_access_aops = { 417 + .readpage = erofs_fscache_readpage, 418 + .readahead = erofs_fscache_readahead, 419 + }; 420 + 421 + int erofs_fscache_register_cookie(struct super_block *sb, 422 + struct erofs_fscache **fscache, 423 + char *name, bool need_inode) 424 + { 425 + struct fscache_volume *volume = EROFS_SB(sb)->volume; 426 + struct erofs_fscache *ctx; 427 + struct fscache_cookie *cookie; 428 + int ret; 429 + 430 + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); 431 + if (!ctx) 432 + return -ENOMEM; 433 + 434 + cookie = fscache_acquire_cookie(volume, FSCACHE_ADV_WANT_CACHE_SIZE, 435 + name, strlen(name), NULL, 0, 0); 436 + if (!cookie) { 437 + erofs_err(sb, "failed to get cookie for %s", name); 438 + ret = -EINVAL; 439 + goto err; 440 + } 441 + 442 + fscache_use_cookie(cookie, false); 443 + ctx->cookie = cookie; 444 + 445 + if (need_inode) { 446 + struct inode *const inode = new_inode(sb); 447 + 448 + if (!inode) { 449 + erofs_err(sb, "failed to get anon inode for %s", name); 450 + ret = -ENOMEM; 451 + goto err_cookie; 452 + } 453 + 454 + set_nlink(inode, 1); 455 + inode->i_size = OFFSET_MAX; 456 + inode->i_mapping->a_ops = &erofs_fscache_meta_aops; 457 + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); 458 + 459 + ctx->inode = inode; 460 + } 461 + 462 + *fscache = ctx; 463 + return 0; 464 + 465 + err_cookie: 466 + fscache_unuse_cookie(ctx->cookie, NULL, NULL); 467 + fscache_relinquish_cookie(ctx->cookie, false); 468 + ctx->cookie = NULL; 469 + err: 470 + kfree(ctx); 471 + return ret; 472 + } 473 + 474 + void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache) 475 + { 476 + struct erofs_fscache *ctx = *fscache; 477 + 478 + if (!ctx) 479 + return; 480 + 481 + fscache_unuse_cookie(ctx->cookie, NULL, NULL); 482 + fscache_relinquish_cookie(ctx->cookie, false); 483 + ctx->cookie = NULL; 484 + 485 + iput(ctx->inode); 486 + ctx->inode = NULL; 487 + 488 + kfree(ctx); 489 + *fscache = NULL; 490 + } 491 + 492 + int erofs_fscache_register_fs(struct super_block *sb) 493 + { 494 + struct erofs_sb_info *sbi = EROFS_SB(sb); 495 + struct fscache_volume *volume; 496 + char *name; 497 + int ret = 0; 498 + 499 + name = kasprintf(GFP_KERNEL, "erofs,%s", sbi->opt.fsid); 500 + if (!name) 501 + return -ENOMEM; 502 + 503 + volume = fscache_acquire_volume(name, NULL, NULL, 0); 504 + if (IS_ERR_OR_NULL(volume)) { 505 + erofs_err(sb, "failed to register volume for %s", name); 506 + ret = volume ? PTR_ERR(volume) : -EOPNOTSUPP; 507 + volume = NULL; 508 + } 509 + 510 + sbi->volume = volume; 511 + kfree(name); 512 + return ret; 513 + } 514 + 515 + void erofs_fscache_unregister_fs(struct super_block *sb) 516 + { 517 + struct erofs_sb_info *sbi = EROFS_SB(sb); 518 + 519 + fscache_relinquish_volume(sbi->volume, NULL, false); 520 + sbi->volume = NULL; 521 + }

+5 -6

fs/erofs/inode.c

··· 8 8 9 9 #include <trace/events/erofs.h> 10 10 11 - /* 12 - * if inode is successfully read, return its inode page (or sometimes 13 - * the inode payload page if it's an extended inode) in order to fill 14 - * inline data if possible. 15 - */ 16 11 static void *erofs_read_inode(struct erofs_buf *buf, 17 12 struct inode *inode, unsigned int *ofs) 18 13 { ··· 292 297 goto out_unlock; 293 298 } 294 299 inode->i_mapping->a_ops = &erofs_raw_access_aops; 300 + #ifdef CONFIG_EROFS_FS_ONDEMAND 301 + if (erofs_is_fscache_mode(inode->i_sb)) 302 + inode->i_mapping->a_ops = &erofs_fscache_access_aops; 303 + #endif 295 304 296 305 out_unlock: 297 306 erofs_put_metabuf(&buf); ··· 369 370 stat->attributes_mask |= (STATX_ATTR_COMPRESSED | 370 371 STATX_ATTR_IMMUTABLE); 371 372 372 - generic_fillattr(&init_user_ns, inode, stat); 373 + generic_fillattr(mnt_userns, inode, stat); 373 374 return 0; 374 375 } 375 376

+50 -26

fs/erofs/internal.h

··· 49 49 50 50 struct erofs_device_info { 51 51 char *path; 52 + struct erofs_fscache *fscache; 52 53 struct block_device *bdev; 53 54 struct dax_device *dax_dev; 54 55 u64 dax_part_off; ··· 75 74 unsigned int max_sync_decompress_pages; 76 75 #endif 77 76 unsigned int mount_opt; 77 + char *fsid; 78 78 }; 79 79 80 80 struct erofs_dev_context { ··· 96 94 u16 max_distance_pages; 97 95 /* maximum possible blocks for pclusters in the filesystem */ 98 96 u16 max_pclusterblks; 97 + }; 98 + 99 + struct erofs_fscache { 100 + struct fscache_cookie *cookie; 101 + struct inode *inode; 99 102 }; 100 103 101 104 struct erofs_sb_info { ··· 153 146 /* sysfs support */ 154 147 struct kobject s_kobj; /* /sys/fs/erofs/<devname> */ 155 148 struct completion s_kobj_unregister; 149 + 150 + /* fscache support */ 151 + struct fscache_volume *volume; 152 + struct erofs_fscache *s_fscache; 156 153 }; 157 154 158 155 #define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info) ··· 171 160 #define clear_opt(opt, option) ((opt)->mount_opt &= ~EROFS_MOUNT_##option) 172 161 #define set_opt(opt, option) ((opt)->mount_opt |= EROFS_MOUNT_##option) 173 162 #define test_opt(opt, option) ((opt)->mount_opt & EROFS_MOUNT_##option) 163 + 164 + static inline bool erofs_is_fscache_mode(struct super_block *sb) 165 + { 166 + return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && !sb->s_bdev; 167 + } 174 168 175 169 enum { 176 170 EROFS_ZIP_CACHE_DISABLED, ··· 397 381 extern const struct address_space_operations erofs_raw_access_aops; 398 382 extern const struct address_space_operations z_erofs_aops; 399 383 400 - /* 401 - * Logical to physical block mapping 402 - * 403 - * Different with other file systems, it is used for 2 access modes: 404 - * 405 - * 1) RAW access mode: 406 - * 407 - * Users pass a valid (m_lblk, m_lofs -- usually 0) pair, 408 - * and get the valid m_pblk, m_pofs and the longest m_len(in bytes). 409 - * 410 - * Note that m_lblk in the RAW access mode refers to the number of 411 - * the compressed ondisk block rather than the uncompressed 412 - * in-memory block for the compressed file. 413 - * 414 - * m_pofs equals to m_lofs except for the inline data page. 415 - * 416 - * 2) Normal access mode: 417 - * 418 - * If the inode is not compressed, it has no difference with 419 - * the RAW access mode. However, if the inode is compressed, 420 - * users should pass a valid (m_lblk, m_lofs) pair, and get 421 - * the needed m_pblk, m_pofs, m_len to get the compressed data 422 - * and the updated m_lblk, m_lofs which indicates the start 423 - * of the corresponding uncompressed data in the file. 424 - */ 425 384 enum { 426 385 BH_Encoded = BH_PrivateStart, 427 386 BH_FullMapped, ··· 458 467 #endif /* !CONFIG_EROFS_FS_ZIP */ 459 468 460 469 struct erofs_map_dev { 470 + struct erofs_fscache *m_fscache; 461 471 struct block_device *m_bdev; 462 472 struct dax_device *m_daxdev; 463 473 u64 m_dax_part_off; ··· 478 486 int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *dev); 479 487 int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 480 488 u64 start, u64 len); 489 + int erofs_map_blocks(struct inode *inode, 490 + struct erofs_map_blocks *map, int flags); 481 491 482 492 /* inode.c */ 483 493 static inline unsigned long erofs_inode_hash(erofs_nid_t nid) ··· 503 509 /* namei.c */ 504 510 extern const struct inode_operations erofs_dir_iops; 505 511 506 - int erofs_namei(struct inode *dir, struct qstr *name, 512 + int erofs_namei(struct inode *dir, const struct qstr *name, 507 513 erofs_nid_t *nid, unsigned int *d_type); 508 514 509 515 /* dir.c */ ··· 604 610 return 0; 605 611 } 606 612 #endif /* !CONFIG_EROFS_FS_ZIP */ 613 + 614 + /* fscache.c */ 615 + #ifdef CONFIG_EROFS_FS_ONDEMAND 616 + int erofs_fscache_register_fs(struct super_block *sb); 617 + void erofs_fscache_unregister_fs(struct super_block *sb); 618 + 619 + int erofs_fscache_register_cookie(struct super_block *sb, 620 + struct erofs_fscache **fscache, 621 + char *name, bool need_inode); 622 + void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache); 623 + 624 + extern const struct address_space_operations erofs_fscache_access_aops; 625 + #else 626 + static inline int erofs_fscache_register_fs(struct super_block *sb) 627 + { 628 + return 0; 629 + } 630 + static inline void erofs_fscache_unregister_fs(struct super_block *sb) {} 631 + 632 + static inline int erofs_fscache_register_cookie(struct super_block *sb, 633 + struct erofs_fscache **fscache, 634 + char *name, bool need_inode) 635 + { 636 + return -EOPNOTSUPP; 637 + } 638 + 639 + static inline void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache) 640 + { 641 + } 642 + #endif 607 643 608 644 #define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */ 609 645

+2 -3

fs/erofs/namei.c

··· 165 165 return candidate; 166 166 } 167 167 168 - int erofs_namei(struct inode *dir, 169 - struct qstr *name, 170 - erofs_nid_t *nid, unsigned int *d_type) 168 + int erofs_namei(struct inode *dir, const struct qstr *name, erofs_nid_t *nid, 169 + unsigned int *d_type) 171 170 { 172 171 int ndirents; 173 172 struct erofs_buf buf = __EROFS_BUF_INITIALIZER;

+180 -39

fs/erofs/super.c

··· 13 13 #include <linux/fs_context.h> 14 14 #include <linux/fs_parser.h> 15 15 #include <linux/dax.h> 16 + #include <linux/exportfs.h> 16 17 #include "xattr.h" 17 18 18 19 #define CREATE_TRACE_POINTS ··· 220 219 } 221 220 #endif 222 221 223 - static int erofs_init_devices(struct super_block *sb, 222 + static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb, 223 + struct erofs_device_info *dif, erofs_off_t *pos) 224 + { 225 + struct erofs_sb_info *sbi = EROFS_SB(sb); 226 + struct erofs_deviceslot *dis; 227 + struct block_device *bdev; 228 + void *ptr; 229 + int ret; 230 + 231 + ptr = erofs_read_metabuf(buf, sb, erofs_blknr(*pos), EROFS_KMAP); 232 + if (IS_ERR(ptr)) 233 + return PTR_ERR(ptr); 234 + dis = ptr + erofs_blkoff(*pos); 235 + 236 + if (!dif->path) { 237 + if (!dis->tag[0]) { 238 + erofs_err(sb, "empty device tag @ pos %llu", *pos); 239 + return -EINVAL; 240 + } 241 + dif->path = kmemdup_nul(dis->tag, sizeof(dis->tag), GFP_KERNEL); 242 + if (!dif->path) 243 + return -ENOMEM; 244 + } 245 + 246 + if (erofs_is_fscache_mode(sb)) { 247 + ret = erofs_fscache_register_cookie(sb, &dif->fscache, 248 + dif->path, false); 249 + if (ret) 250 + return ret; 251 + } else { 252 + bdev = blkdev_get_by_path(dif->path, FMODE_READ | FMODE_EXCL, 253 + sb->s_type); 254 + if (IS_ERR(bdev)) 255 + return PTR_ERR(bdev); 256 + dif->bdev = bdev; 257 + dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off); 258 + } 259 + 260 + dif->blocks = le32_to_cpu(dis->blocks); 261 + dif->mapped_blkaddr = le32_to_cpu(dis->mapped_blkaddr); 262 + sbi->total_blocks += dif->blocks; 263 + *pos += EROFS_DEVT_SLOT_SIZE; 264 + return 0; 265 + } 266 + 267 + static int erofs_scan_devices(struct super_block *sb, 224 268 struct erofs_super_block *dsb) 225 269 { 226 270 struct erofs_sb_info *sbi = EROFS_SB(sb); ··· 273 227 erofs_off_t pos; 274 228 struct erofs_buf buf = __EROFS_BUF_INITIALIZER; 275 229 struct erofs_device_info *dif; 276 - struct erofs_deviceslot *dis; 277 - void *ptr; 278 230 int id, err = 0; 279 231 280 232 sbi->total_blocks = sbi->primarydevice_blocks; ··· 281 237 else 282 238 ondisk_extradevs = le16_to_cpu(dsb->extra_devices); 283 239 284 - if (ondisk_extradevs != sbi->devs->extra_devices) { 240 + if (sbi->devs->extra_devices && 241 + ondisk_extradevs != sbi->devs->extra_devices) { 285 242 erofs_err(sb, "extra devices don't match (ondisk %u, given %u)", 286 243 ondisk_extradevs, sbi->devs->extra_devices); 287 244 return -EINVAL; ··· 293 248 sbi->device_id_mask = roundup_pow_of_two(ondisk_extradevs + 1) - 1; 294 249 pos = le16_to_cpu(dsb->devt_slotoff) * EROFS_DEVT_SLOT_SIZE; 295 250 down_read(&sbi->devs->rwsem); 296 - idr_for_each_entry(&sbi->devs->tree, dif, id) { 297 - struct block_device *bdev; 298 - 299 - ptr = erofs_read_metabuf(&buf, sb, erofs_blknr(pos), 300 - EROFS_KMAP); 301 - if (IS_ERR(ptr)) { 302 - err = PTR_ERR(ptr); 303 - break; 251 + if (sbi->devs->extra_devices) { 252 + idr_for_each_entry(&sbi->devs->tree, dif, id) { 253 + err = erofs_init_device(&buf, sb, dif, &pos); 254 + if (err) 255 + break; 304 256 } 305 - dis = ptr + erofs_blkoff(pos); 257 + } else { 258 + for (id = 0; id < ondisk_extradevs; id++) { 259 + dif = kzalloc(sizeof(*dif), GFP_KERNEL); 260 + if (!dif) { 261 + err = -ENOMEM; 262 + break; 263 + } 306 264 307 - bdev = blkdev_get_by_path(dif->path, 308 - FMODE_READ | FMODE_EXCL, 309 - sb->s_type); 310 - if (IS_ERR(bdev)) { 311 - err = PTR_ERR(bdev); 312 - break; 265 + err = idr_alloc(&sbi->devs->tree, dif, 0, 0, GFP_KERNEL); 266 + if (err < 0) { 267 + kfree(dif); 268 + break; 269 + } 270 + ++sbi->devs->extra_devices; 271 + 272 + err = erofs_init_device(&buf, sb, dif, &pos); 273 + if (err) 274 + break; 313 275 } 314 - dif->bdev = bdev; 315 - dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off); 316 - dif->blocks = le32_to_cpu(dis->blocks); 317 - dif->mapped_blkaddr = le32_to_cpu(dis->mapped_blkaddr); 318 - sbi->total_blocks += dif->blocks; 319 - pos += EROFS_DEVT_SLOT_SIZE; 320 276 } 321 277 up_read(&sbi->devs->rwsem); 322 278 erofs_put_metabuf(&buf); ··· 404 358 goto out; 405 359 406 360 /* handle multiple devices */ 407 - ret = erofs_init_devices(sb, dsb); 361 + ret = erofs_scan_devices(sb, dsb); 408 362 409 363 if (erofs_sb_has_ztailpacking(sbi)) 410 364 erofs_info(sb, "EXPERIMENTAL compressed inline data feature in use. Use at your own risk!"); 365 + if (erofs_is_fscache_mode(sb)) 366 + erofs_info(sb, "EXPERIMENTAL fscache-based on-demand read feature in use. Use at your own risk!"); 411 367 out: 412 368 erofs_put_metabuf(&buf); 413 369 return ret; ··· 438 390 Opt_dax, 439 391 Opt_dax_enum, 440 392 Opt_device, 393 + Opt_fsid, 441 394 Opt_err 442 395 }; 443 396 ··· 463 414 fsparam_flag("dax", Opt_dax), 464 415 fsparam_enum("dax", Opt_dax_enum, erofs_dax_param_enums), 465 416 fsparam_string("device", Opt_device), 417 + fsparam_string("fsid", Opt_fsid), 466 418 {} 467 419 }; 468 420 ··· 559 509 } 560 510 ++ctx->devs->extra_devices; 561 511 break; 512 + case Opt_fsid: 513 + #ifdef CONFIG_EROFS_FS_ONDEMAND 514 + kfree(ctx->opt.fsid); 515 + ctx->opt.fsid = kstrdup(param->string, GFP_KERNEL); 516 + if (!ctx->opt.fsid) 517 + return -ENOMEM; 518 + #else 519 + errorfc(fc, "fsid option not supported"); 520 + #endif 521 + break; 562 522 default: 563 523 return -ENOPARAM; 564 524 } ··· 637 577 static int erofs_init_managed_cache(struct super_block *sb) { return 0; } 638 578 #endif 639 579 580 + static struct inode *erofs_nfs_get_inode(struct super_block *sb, 581 + u64 ino, u32 generation) 582 + { 583 + return erofs_iget(sb, ino, false); 584 + } 585 + 586 + static struct dentry *erofs_fh_to_dentry(struct super_block *sb, 587 + struct fid *fid, int fh_len, int fh_type) 588 + { 589 + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, 590 + erofs_nfs_get_inode); 591 + } 592 + 593 + static struct dentry *erofs_fh_to_parent(struct super_block *sb, 594 + struct fid *fid, int fh_len, int fh_type) 595 + { 596 + return generic_fh_to_parent(sb, fid, fh_len, fh_type, 597 + erofs_nfs_get_inode); 598 + } 599 + 600 + static struct dentry *erofs_get_parent(struct dentry *child) 601 + { 602 + erofs_nid_t nid; 603 + unsigned int d_type; 604 + int err; 605 + 606 + err = erofs_namei(d_inode(child), &dotdot_name, &nid, &d_type); 607 + if (err) 608 + return ERR_PTR(err); 609 + return d_obtain_alias(erofs_iget(child->d_sb, nid, d_type == FT_DIR)); 610 + } 611 + 612 + static const struct export_operations erofs_export_ops = { 613 + .fh_to_dentry = erofs_fh_to_dentry, 614 + .fh_to_parent = erofs_fh_to_parent, 615 + .get_parent = erofs_get_parent, 616 + }; 617 + 640 618 static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) 641 619 { 642 620 struct inode *inode; ··· 683 585 int err; 684 586 685 587 sb->s_magic = EROFS_SUPER_MAGIC; 686 - 687 - if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) { 688 - erofs_err(sb, "failed to set erofs blksize"); 689 - return -EINVAL; 690 - } 588 + sb->s_flags |= SB_RDONLY | SB_NOATIME; 589 + sb->s_maxbytes = MAX_LFS_FILESIZE; 590 + sb->s_op = &erofs_sops; 691 591 692 592 sbi = kzalloc(sizeof(*sbi), GFP_KERNEL); 693 593 if (!sbi) ··· 693 597 694 598 sb->s_fs_info = sbi; 695 599 sbi->opt = ctx->opt; 696 - sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->dax_part_off); 600 + ctx->opt.fsid = NULL; 697 601 sbi->devs = ctx->devs; 698 602 ctx->devs = NULL; 603 + 604 + if (erofs_is_fscache_mode(sb)) { 605 + sb->s_blocksize = EROFS_BLKSIZ; 606 + sb->s_blocksize_bits = LOG_BLOCK_SIZE; 607 + 608 + err = erofs_fscache_register_fs(sb); 609 + if (err) 610 + return err; 611 + 612 + err = erofs_fscache_register_cookie(sb, &sbi->s_fscache, 613 + sbi->opt.fsid, true); 614 + if (err) 615 + return err; 616 + 617 + err = super_setup_bdi(sb); 618 + if (err) 619 + return err; 620 + } else { 621 + if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) { 622 + erofs_err(sb, "failed to set erofs blksize"); 623 + return -EINVAL; 624 + } 625 + 626 + sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, 627 + &sbi->dax_part_off); 628 + } 699 629 700 630 err = erofs_read_superblock(sb); 701 631 if (err) ··· 735 613 clear_opt(&sbi->opt, DAX_ALWAYS); 736 614 } 737 615 } 738 - sb->s_flags |= SB_RDONLY | SB_NOATIME; 739 - sb->s_maxbytes = MAX_LFS_FILESIZE; 740 - sb->s_time_gran = 1; 741 616 742 - sb->s_op = &erofs_sops; 617 + sb->s_time_gran = 1; 743 618 sb->s_xattr = erofs_xattr_handlers; 619 + sb->s_export_op = &erofs_export_ops; 744 620 745 621 if (test_opt(&sbi->opt, POSIX_ACL)) 746 622 sb->s_flags |= SB_POSIXACL; ··· 781 661 782 662 static int erofs_fc_get_tree(struct fs_context *fc) 783 663 { 664 + struct erofs_fs_context *ctx = fc->fs_private; 665 + 666 + if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && ctx->opt.fsid) 667 + return get_tree_nodev(fc, erofs_fc_fill_super); 668 + 784 669 return get_tree_bdev(fc, erofs_fc_fill_super); 785 670 } 786 671 ··· 815 690 fs_put_dax(dif->dax_dev); 816 691 if (dif->bdev) 817 692 blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL); 693 + erofs_fscache_unregister_cookie(&dif->fscache); 818 694 kfree(dif->path); 819 695 kfree(dif); 820 696 return 0; ··· 835 709 struct erofs_fs_context *ctx = fc->fs_private; 836 710 837 711 erofs_free_dev_context(ctx->devs); 712 + kfree(ctx->opt.fsid); 838 713 kfree(ctx); 839 714 } 840 715 ··· 876 749 877 750 WARN_ON(sb->s_magic != EROFS_SUPER_MAGIC); 878 751 879 - kill_block_super(sb); 752 + if (erofs_is_fscache_mode(sb)) 753 + generic_shutdown_super(sb); 754 + else 755 + kill_block_super(sb); 880 756 881 757 sbi = EROFS_SB(sb); 882 758 if (!sbi) ··· 887 757 888 758 erofs_free_dev_context(sbi->devs); 889 759 fs_put_dax(sbi->dax_dev); 760 + erofs_fscache_unregister_cookie(&sbi->s_fscache); 761 + erofs_fscache_unregister_fs(sb); 762 + kfree(sbi->opt.fsid); 890 763 kfree(sbi); 891 764 sb->s_fs_info = NULL; 892 765 } ··· 907 774 iput(sbi->managed_cache); 908 775 sbi->managed_cache = NULL; 909 776 #endif 777 + erofs_fscache_unregister_cookie(&sbi->s_fscache); 910 778 } 911 779 912 780 static struct file_system_type erofs_fs_type = { ··· 915 781 .name = "erofs", 916 782 .init_fs_context = erofs_init_fs_context, 917 783 .kill_sb = erofs_kill_sb, 918 - .fs_flags = FS_REQUIRES_DEV, 784 + .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP, 919 785 }; 920 786 MODULE_ALIAS_FS("erofs"); 921 787 ··· 991 857 { 992 858 struct super_block *sb = dentry->d_sb; 993 859 struct erofs_sb_info *sbi = EROFS_SB(sb); 994 - u64 id = huge_encode_dev(sb->s_bdev->bd_dev); 860 + u64 id = 0; 861 + 862 + if (!erofs_is_fscache_mode(sb)) 863 + id = huge_encode_dev(sb->s_bdev->bd_dev); 995 864 996 865 buf->f_type = sb->s_magic; 997 866 buf->f_bsize = EROFS_BLKSIZ; ··· 1039 902 seq_puts(seq, ",dax=always"); 1040 903 if (test_opt(opt, DAX_NEVER)) 1041 904 seq_puts(seq, ",dax=never"); 905 + #ifdef CONFIG_EROFS_FS_ONDEMAND 906 + if (opt->fsid) 907 + seq_printf(seq, ",fsid=%s", opt->fsid); 908 + #endif 1042 909 return 0; 1043 910 } 1044 911

+2 -2

fs/erofs/sysfs.c

··· 205 205 206 206 sbi->s_kobj.kset = &erofs_root; 207 207 init_completion(&sbi->s_kobj_unregister); 208 - err = kobject_init_and_add(&sbi->s_kobj, &erofs_sb_ktype, NULL, 209 - "%s", sb->s_id); 208 + err = kobject_init_and_add(&sbi->s_kobj, &erofs_sb_ktype, NULL, "%s", 209 + erofs_is_fscache_mode(sb) ? sbi->opt.fsid : sb->s_id); 210 210 if (err) 211 211 goto put_sb_kobj; 212 212 return 0;

include/linux/fscache.h

··· 39 39 #define FSCACHE_ADV_SINGLE_CHUNK 0x01 /* The object is a single chunk of data */ 40 40 #define FSCACHE_ADV_WRITE_CACHE 0x00 /* Do cache if written to locally */ 41 41 #define FSCACHE_ADV_WRITE_NOCACHE 0x02 /* Don't cache if written to locally */ 42 + #define FSCACHE_ADV_WANT_CACHE_SIZE 0x04 /* Retrieve cache size at runtime */ 42 43 43 44 #define FSCACHE_INVAL_DIO_WRITE 0x01 /* Invalidate due to DIO write */ 44 45

include/linux/netfs.h

··· 159 159 #define NETFS_SREQ_SHORT_IO 2 /* Set if the I/O was short */ 160 160 #define NETFS_SREQ_SEEK_DATA_READ 3 /* Set if ->read() should SEEK_DATA first */ 161 161 #define NETFS_SREQ_NO_PROGRESS 4 /* Set if we didn't manage to read any data */ 162 + #define NETFS_SREQ_ONDEMAND 5 /* Set if it's from on-demand read mode */ 162 163 }; 163 164 164 165 enum netfs_io_origin {

+176

include/trace/events/cachefiles.h

··· 31 31 cachefiles_obj_see_lookup_failed, 32 32 cachefiles_obj_see_withdraw_cookie, 33 33 cachefiles_obj_see_withdrawal, 34 + cachefiles_obj_get_ondemand_fd, 35 + cachefiles_obj_put_ondemand_fd, 34 36 }; 35 37 36 38 enum fscache_why_object_killed { ··· 671 669 __entry->backer, 672 670 __print_symbolic(__entry->where, cachefiles_error_traces), 673 671 __entry->error) 672 + ); 673 + 674 + TRACE_EVENT(cachefiles_ondemand_open, 675 + TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg, 676 + struct cachefiles_open *load), 677 + 678 + TP_ARGS(obj, msg, load), 679 + 680 + TP_STRUCT__entry( 681 + __field(unsigned int, obj ) 682 + __field(unsigned int, msg_id ) 683 + __field(unsigned int, object_id ) 684 + __field(unsigned int, fd ) 685 + __field(unsigned int, flags ) 686 + ), 687 + 688 + TP_fast_assign( 689 + __entry->obj = obj ? obj->debug_id : 0; 690 + __entry->msg_id = msg->msg_id; 691 + __entry->object_id = msg->object_id; 692 + __entry->fd = load->fd; 693 + __entry->flags = load->flags; 694 + ), 695 + 696 + TP_printk("o=%08x mid=%x oid=%x fd=%d f=%x", 697 + __entry->obj, 698 + __entry->msg_id, 699 + __entry->object_id, 700 + __entry->fd, 701 + __entry->flags) 702 + ); 703 + 704 + TRACE_EVENT(cachefiles_ondemand_copen, 705 + TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id, 706 + long len), 707 + 708 + TP_ARGS(obj, msg_id, len), 709 + 710 + TP_STRUCT__entry( 711 + __field(unsigned int, obj ) 712 + __field(unsigned int, msg_id ) 713 + __field(long, len ) 714 + ), 715 + 716 + TP_fast_assign( 717 + __entry->obj = obj ? obj->debug_id : 0; 718 + __entry->msg_id = msg_id; 719 + __entry->len = len; 720 + ), 721 + 722 + TP_printk("o=%08x mid=%x l=%lx", 723 + __entry->obj, 724 + __entry->msg_id, 725 + __entry->len) 726 + ); 727 + 728 + TRACE_EVENT(cachefiles_ondemand_close, 729 + TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg), 730 + 731 + TP_ARGS(obj, msg), 732 + 733 + TP_STRUCT__entry( 734 + __field(unsigned int, obj ) 735 + __field(unsigned int, msg_id ) 736 + __field(unsigned int, object_id ) 737 + ), 738 + 739 + TP_fast_assign( 740 + __entry->obj = obj ? obj->debug_id : 0; 741 + __entry->msg_id = msg->msg_id; 742 + __entry->object_id = msg->object_id; 743 + ), 744 + 745 + TP_printk("o=%08x mid=%x oid=%x", 746 + __entry->obj, 747 + __entry->msg_id, 748 + __entry->object_id) 749 + ); 750 + 751 + TRACE_EVENT(cachefiles_ondemand_read, 752 + TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg, 753 + struct cachefiles_read *load), 754 + 755 + TP_ARGS(obj, msg, load), 756 + 757 + TP_STRUCT__entry( 758 + __field(unsigned int, obj ) 759 + __field(unsigned int, msg_id ) 760 + __field(unsigned int, object_id ) 761 + __field(loff_t, start ) 762 + __field(size_t, len ) 763 + ), 764 + 765 + TP_fast_assign( 766 + __entry->obj = obj ? obj->debug_id : 0; 767 + __entry->msg_id = msg->msg_id; 768 + __entry->object_id = msg->object_id; 769 + __entry->start = load->off; 770 + __entry->len = load->len; 771 + ), 772 + 773 + TP_printk("o=%08x mid=%x oid=%x s=%llx l=%zx", 774 + __entry->obj, 775 + __entry->msg_id, 776 + __entry->object_id, 777 + __entry->start, 778 + __entry->len) 779 + ); 780 + 781 + TRACE_EVENT(cachefiles_ondemand_cread, 782 + TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id), 783 + 784 + TP_ARGS(obj, msg_id), 785 + 786 + TP_STRUCT__entry( 787 + __field(unsigned int, obj ) 788 + __field(unsigned int, msg_id ) 789 + ), 790 + 791 + TP_fast_assign( 792 + __entry->obj = obj ? obj->debug_id : 0; 793 + __entry->msg_id = msg_id; 794 + ), 795 + 796 + TP_printk("o=%08x mid=%x", 797 + __entry->obj, 798 + __entry->msg_id) 799 + ); 800 + 801 + TRACE_EVENT(cachefiles_ondemand_fd_write, 802 + TP_PROTO(struct cachefiles_object *obj, struct inode *backer, 803 + loff_t start, size_t len), 804 + 805 + TP_ARGS(obj, backer, start, len), 806 + 807 + TP_STRUCT__entry( 808 + __field(unsigned int, obj ) 809 + __field(unsigned int, backer ) 810 + __field(loff_t, start ) 811 + __field(size_t, len ) 812 + ), 813 + 814 + TP_fast_assign( 815 + __entry->obj = obj ? obj->debug_id : 0; 816 + __entry->backer = backer->i_ino; 817 + __entry->start = start; 818 + __entry->len = len; 819 + ), 820 + 821 + TP_printk("o=%08x iB=%x s=%llx l=%zx", 822 + __entry->obj, 823 + __entry->backer, 824 + __entry->start, 825 + __entry->len) 826 + ); 827 + 828 + TRACE_EVENT(cachefiles_ondemand_fd_release, 829 + TP_PROTO(struct cachefiles_object *obj, int object_id), 830 + 831 + TP_ARGS(obj, object_id), 832 + 833 + TP_STRUCT__entry( 834 + __field(unsigned int, obj ) 835 + __field(unsigned int, object_id ) 836 + ), 837 + 838 + TP_fast_assign( 839 + __entry->obj = obj ? obj->debug_id : 0; 840 + __entry->object_id = object_id; 841 + ), 842 + 843 + TP_printk("o=%08x oid=%x", 844 + __entry->obj, 845 + __entry->object_id) 674 846 ); 675 847 676 848 #endif /* _TRACE_CACHEFILES_H */

+68

include/uapi/linux/cachefiles.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 + #ifndef _LINUX_CACHEFILES_H 3 + #define _LINUX_CACHEFILES_H 4 + 5 + #include <linux/types.h> 6 + #include <linux/ioctl.h> 7 + 8 + /* 9 + * Fscache ensures that the maximum length of cookie key is 255. The volume key 10 + * is controlled by netfs, and generally no bigger than 255. 11 + */ 12 + #define CACHEFILES_MSG_MAX_SIZE 1024 13 + 14 + enum cachefiles_opcode { 15 + CACHEFILES_OP_OPEN, 16 + CACHEFILES_OP_CLOSE, 17 + CACHEFILES_OP_READ, 18 + }; 19 + 20 + /* 21 + * Message Header 22 + * 23 + * @msg_id a unique ID identifying this message 24 + * @opcode message type, CACHEFILE_OP_* 25 + * @len message length, including message header and following data 26 + * @object_id a unique ID identifying a cache file 27 + * @data message type specific payload 28 + */ 29 + struct cachefiles_msg { 30 + __u32 msg_id; 31 + __u32 opcode; 32 + __u32 len; 33 + __u32 object_id; 34 + __u8 data[]; 35 + }; 36 + 37 + /* 38 + * @data contains the volume_key followed directly by the cookie_key. volume_key 39 + * is a NUL-terminated string; @volume_key_size indicates the size of the volume 40 + * key in bytes. cookie_key is binary data, which is netfs specific; 41 + * @cookie_key_size indicates the size of the cookie key in bytes. 42 + * 43 + * @fd identifies an anon_fd referring to the cache file. 44 + */ 45 + struct cachefiles_open { 46 + __u32 volume_key_size; 47 + __u32 cookie_key_size; 48 + __u32 fd; 49 + __u32 flags; 50 + __u8 data[]; 51 + }; 52 + 53 + /* 54 + * @off indicates the starting offset of the requested file range 55 + * @len indicates the length of the requested file range 56 + */ 57 + struct cachefiles_read { 58 + __u64 off; 59 + __u64 len; 60 + }; 61 + 62 + /* 63 + * Reply for READ request 64 + * @arg for this ioctl is the @id field of READ request. 65 + */ 66 + #define CACHEFILES_IOC_READ_COMPLETE _IOW(0x98, 1, int) 67 + 68 + #endif

Configure Feed

Configure Feed