Issue #4017 has been updated by tomog105 (Tomohiro Ogoke).


In addition, I'm porting patches in this issue and got benchmark results.

## Result
- Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
- Processor: Intel Core i7 6700K @ 4 GHz
- Memory: 16 GB
- Target revision: https://github.com/tomog105/csv/commit/c56ed71097fc7e61ce2a896f013095a5ef36b548
  - Since many changes were added to the code base, I'm fixed additional features from original patches.
  - But, `CSV#line` does not work properly in this revision.

### Rows 1000

```
            unquoted     91.088  ( 2.2%) i/s -    459.000  in   5.041720s
              quoted     22.547  ( 0.0%) i/s -    114.000  in   5.056618s
     include col_sep     22.167  ( 0.0%) i/s -    112.000  in   5.053609s
     include row_sep     22.369  ( 0.0%) i/s -    112.000  in   5.007715s
        encode utf-8     43.500  ( 0.0%) i/s -    220.000  in   5.058088s
         encode sjis     34.559  ( 0.0%) i/s -    174.000  in   5.035380s
```

### Rows 10000

```
            unquoted      9.395  ( 0.0%) i/s -     47.000  in   5.014335s
              quoted      2.116  ( 0.0%) i/s -     11.000  in   5.208314s
     include col_sep      2.103  ( 0.0%) i/s -     11.000  in   5.241502s
     include row_sep      2.146  ( 0.0%) i/s -     11.000  in   5.128535s
        encode utf-8      4.217  ( 0.0%) i/s -     22.000  in   5.226532s
         encode sjis      3.370  ( 0.0%) i/s -     17.000  in   5.047115s
```

## Compare

### Rows 1000

| test | trunk | patched | ratio |
| --- | --- | --- | --- |
| unquoted | 41.142 | 91.465 | 221% |
| quoted | 23.093 | 22.547 | 97.6% |
| include col_sep | 14.826 | 22.167 | 150% |
| include row_sep | 7.136 | 22.369 | 313% |
| encode utf-8 | 34.350 | 43.500 | 127% |
| encode sjis | 3.400 | 3.370 | 99.1% |

### Rows 10000

| test | trunk | patched | ratio |
| --- | --- | --- | --- |
| unquoted | 4.021 | 9.395 | 234% |
| quoted | 2.266 | 2.116 | 93.4% |
| include col_sep | 1.527 | 2.103 | 138% |
| include row_sep | 0.692 | 2.146 | 310% |
| encode utf-8 | 3.215 | 4.217 | 131% |
| encode sjis | 3.400 | 3.370 | 99.1% |

----------------------------------------
Feature #4017: [PATCH] CSV parsing speedup
https://bugs.ruby-lang.org/issues/4017#change-70978

* Author: ender672 (Timothy Elliott)
* Status: Feedback
* Priority: Normal
* Assignee: kou (Kouhei Sutou)
* Target version: 
----------------------------------------
=begin
 ruby_19_csv_parser_split_methods.patch
 This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.
 
 ruby_19_csv_parser_split_methods.patch
 This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.
 
 On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).
 
 These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
=end


---Files--------------------------------
ruby_19_csv_parser_split_methods.patch (11.9 KB)
ruby_19_csv_parser_speedup.patch (1.82 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>