Issue #17001 has been reported by byroot (Jean Boussier).

----------------------------------------
Feature #17001: [Feature] Dir.scan to yield dirent for efficient and composable recursive directory scaning
https://bugs.ruby-lang.org/issues/17001

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Use case

When you need to recusrsively scan a directory, you either have to use `Dir[]` / `Dir.glob`, which is fine for small directories or simple patterns,
but can easily take several seconds to complete for large repositories or complex patterns and returns a very large array which tend to trash GC.

Or you can use `Dir.each_entry` / `Dir.foreach` recursively, but then you need to `stat` each entry to know wether it's a directory, or even symlink if you want to follow them.
This means one syscall per directory, and one per file and directories. This is particularly impactful on OSX where `stat()` is several times slower than on Linux because of various sandboxing features.

There's a [typical example of this use case in Bootsnap](https://github.com/Shopify/bootsnap/blob/56c61373000573112ee027dae4be19aecd50e46e/lib/bootsnap/load_path_cache/path_scanner.rb).

### Proposal

[Python introduced `os.scandir` a few years ago](https://www.python.org/dev/peps/pep-0471/) for exactly this purpose. It is functionaly similar to `Dir.foreach` / `Dir.each_child`, except it yields
`DirEntry` instances which are a wrapper around the `libc` `dirent` struct.

I reduced the Bootsnap code into a [simplified benchmark](https://gist.github.com/casperisfine/2124f349c6564560df4399f2eadaa8f2), and using `os.scandir()` Python scan our main repo in a bit over `1s`, which 3 to 4 times faster
than Ruby can with `Dir.foreach` (`3-4s`). For comparison sake `Dir['**/*.rb']` also complete in about `1s`.

So I beleive that exposing a similar `Dir.scan` method, returning `Dir::Entry` instances, with methods inspired from `File::Stat` such as `directory?` would allow for more performant file system scaning
when the query is not easily expressed with a glob pattern.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>