Relative paths are faster than absolute paths on Linux (and other jokes I tell myself)
Has this ever happened to you?
$ pwd
/tmp/bar
$ git status
fatal: Unable to read current working directory: No such file or directory
This happens when one deletes some temporary directory from one terminal, then uses another terminal where it is still the current working directory.
What if we simply recreate /tmp/bar
?
$ git init $(pwd)
$ git status
fatal: Unable to read current working directory: No such file or directory
Still doesn’t work. However, cd
-ing into the directory
that we already are fixes it:
$ cd $(pwd)
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
This is not a git
-problem, by the way. Opening
vim
or other programs in a directory that somebody else
deleted will throw similar errors.
But why does cd $(pwd)
fix it?
How the current working directory is stored
The current working directory (cwd or pwd) is not a shell concept,
but something Linux stores for every process. Your shell is a process
with a cwd, and when you launch other processes from that shell, they
inherit the cwd. You can use pwdx
to check the cwd of a
running process:
$ pwdx $(pidof vim)
230789: /tmp/bar
We’ve got the process ID and the path there. You might think that
Linux just stores that path as a string for every process. However, that
can’t be true. Otherwise we would not have to use cd $(pwd)
to “fix up” our shell’s cwd after deleting and re-creating a folder! So
how is the cwd really represented? Does it point to an inode, maybe?
Thanks to Linux being open-source we can find out for ourselves by reading Linux’s sourcecode. Thanks to this StackOverflow answer we don’t have to read as much.
The answer says a process’s cwd is stored in fs_struct:
struct fs_struct {
int users;
spinlock_t lock;
seqcount_spinlock_t seq;
int umask;
int in_exec;
struct path root, pwd;
} __randomize_layout;
So our cwd is stored as struct path
. That struct is
defined in path.h:
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
} __randomize_layout;
Searching for dentry
leads us to dcache.h:
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_spinlock_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
};
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
/*
* d_alias and d_rcu can share memory
*/
union {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
} d_u;
} __randomize_layout;
I know that Linux has an in-memory filesystem cache but I don’t
understand this struct. However, the mention of inodes in
d_inode
sort of explains why we need to run
cd $(pwd)
. By recreating the directory
/tmp/bar
, it will point to a new inode, and doing
cd $(pwd)
will update our path struct to point to the right
dentry
(or update our dentry
to point to the
correct inode, not sure.)
Detour: Deeply nested paths
Okay, mystery kind of solved, now let’s explore a different question: What’s the maximum path depth I can have on Linux?
The answer is: There is no limit. There’s a limit on the path lengths that you can pass to syscalls. What does that mean in practice?
If we create a deeply nested directory tree using this simple Rust program:
use std::fs;
use std::env;
fn main() {
for _ in 0..100_000 {
let _ = fs::create_dir("a");
env::set_current_dir("a").unwrap();
}
Ok(())
}
…then try to rm
it, everything is fine.
$ target/release/verydeep
$ rm -r ./a
What if we recreate the directory and run ripgrep
on
it?
$ rg --files ./a
a/a/a/a/[...]: File name too long (os error 36)
What is ripgrep
doing that is causing problems here?
Probably a syscall. We can inspect syscalls using
strace
.
The output is really messy. What I did to narrow it down was to
search for ENAMETOOLONG
, which according to a Google search
for “linux os error 36” is the human-readable version of
os error 36
.
$ strace rg -j1 --files ./a
[...]
openat(AT_FDCWD, "./a/a/[...]/a/a/a"..., O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENAMETOOLONG (File name too long)
[...]
That string passed to openat()
is extremely long. So
ripgrep
constructs too large paths for open()
to handle. But what does rm
do to not run into this
problem?
Here’s a theory: rm
changes directories using
chdir("./a")
to traverse to the innermost directory, calls
rmdir("./a")
or equivalent, calls chdir("..")
,
removes the directory it just stepped out of and repeats. This avoids
creating very large strings in memory that can’t be passed to any system
calls.
In fact this is exactly how it used to work until rm was rewritten to use different system calls that avoid changing the current working directory. But how does it work today?
How
rm
traverses directories using entirely too many
filedescriptors
Another round of strace
. Here’s the output towards the
end:
$ strace rm -r ./a
[...]
newfstatat(3, "a", {st_mode=S_IFDIR|0775, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
faccessat(3, "a", W_OK) = 0
unlinkat(3, "a", AT_REMOVEDIR) = 0
openat(3, "..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW|O_CLOEXEC|O_DIRECTORY) = 4
fstat(4, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
close(3) = 0
newfstatat(4, "a", {st_mode=S_IFDIR|0775, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
faccessat(4, "a", W_OK) = 0
unlinkat(4, "a", AT_REMOVEDIR) = 0
close(4) = 0
newfstatat(AT_FDCWD, "./a", {st_mode=S_IFDIR|0775, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
faccessat(AT_FDCWD, "./a", W_OK) = 0
unlinkat(AT_FDCWD, "./a", AT_REMOVEDIR) = 0
[...]
I believe what happens is:
rm
opens a file descriptor for./a
(let’s pretend that this fd is3
)- …then opens a file descriptor for
./a/a
using the first file descriptor:openat(3, "./a")
. - …and repeats that process until it arrives at the parent directory
that contains innermost directory (let’s pretend that this fd is
99
) - …then deletes that directory using
unlinkat(99, "./a")
- …then uses
openat(99, "..")
to traverse back out of the tree (giving us fd98
) - …then deletes that directory using
unlink(98, "./a")
. - …and so on until it arrives back at the directory it was executed from.
The sourcecode (linked above) that introduces this calls it a “virtual chdir”.
Putting it all together
Originally I was wondering whether it’s possible to traverse large directory trees in Rust without:
- …allocating so many
PathBuf
objects. Those are basically strings containing absolute paths. walkdir and similar crates appear to construct those for every directory entry at every level. - making the filesystem look up every subpath starting from the root.
Most filesystems store one table per directory. When one calls
open()
with an absolute path/a/b/c/d/...
, containingn
path segments, ext4 needs to “chase pointers”, i.e. look up the first path segmenta
in one table, which leads to another table where it looks upb
. So resolving a path to an inode isO(n)
over number of path segments. And only ext4 made those tables hashtables, so the running time may have been worse in practice.
Turns out both is achievable: GNU rm
does it all, in
both versions.
rm
only allocates path segments, in both versions.rm
either keeps a file descriptor open to directly point to an inode (after theopenat()
rewrite in 2005), or changes its own directory (before the rewrite). At the beginning of this blogpost we’ve observed that the current working directory points directly to an inode.
The openat()
solution is probably the way to go. After
all chdir()
-based directory walking mutates process-global
state and can’t be parallelized.
What speaks against either solution is:
- the potential amount of extra syscalls and open file descriptors. If you only want to keep one file descriptor open, then walking out of the tree takes as many syscalls as walking into it does.
- If you want to avoid that, then you need to keep more file descriptors from parent directories open. In addition, printing the current path while walking is now harder too.
- If you want to parallelize tree traversal, managing open file
descriptors may become too annoying. I can imagine that reference
counting (
Arc<MyFd>
) or usingdup(2)
will be “good enough”, but either way there’s increased risk in opening too many fds at once.
ripgrep
is not the only tool that will fail on very deep
directory trees, most programs will. The author may have considered to
implement directory traversal this way already, and may have chosen not
to do it because of the mentioned problems.