Issue #15210 has been updated by foonlyboy (Eike Dierks).


I looked into it a bit more closely into it:

io.c does this in
~~~ c
static int
io_strip_bom(VALUE io)
~~~
which is called by:
~~~ c
static void
io_set_encoding_by_bom(VALUE io)
~~~

> It is documented at `IO.new`, and you can use it at `CSV.open` too.
Yes, I was aware of this.

I also agree the the conversion has to take place at opening the file.

But with rails I get a ActionDispatch::Http::UploadedFile
(which returns an ASCII-8BIT byte stream)

And I could find no way to apply the io_strip_bom() to it,
not even by going through StringIO.
(but then Ruby is not about applying tricks anyway)

It sounds to me that nobu also agrees, that the BOM should always be removed.

> If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"

I don't care so much about this for now.
(while I can imagine this to happen when concatenating files ...)

But let's fix the more simple problems first.

I think the BOM is used for two reasons in byte streams:
- a magic number for UTF encoded data (which might even apply to UTF-8)
- a magic number to distinguish different UTF byte orderings when using UTF-16, UTF-32, UTF-36?

But in the ruby world, we have **String**
We should remove all artefacts from any external encoding.

Impact:

I believe this might need a lot of changes throughout more than just one place in the code,
but I believe this should be fully upward compatible with *most* customers code.

This should still agree with the ruby spec,
because nowhere was it ever declared that String keeps the BOM.

---

Please excuse my lengthy writings,
but I thought these encoding problems were a thing from the past.

We might also look at the other languages around.
Makes for a good rosetta code ...

~eike


















----------------------------------------
Bug #15210: UTF-8 BOM should be removed from String in internal representation
https://bugs.ruby-lang.org/issues/15210#change-74431

* Author: foonlyboy (Eike Dierks)
* Status: Open
* Priority: Normal
* Assignee: docs
* Target version: 
* ruby -v: 
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
 Hi everyone working on the ruby trunk,

I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data.

We import some CSV from paypal.
They now include a BOM in front of their UTF-8 encoded CSV data.
This BOM is making some troubles.

I believe this to be a bug in how byte data is converted to the ruby internal String representation.

There is a workaround, but this needs to be documented:
```ruby
IO.read(mode:'r:BOM|UTF-8')
```


---

But I'm asking for to improve the UTF-BOM handling:
- The BOM is only used for transfer encoding at the byte stream level.
- The BOM MUST NOT be part of the String in internal representation.


---

BTW: stdlib::CSV chokes on the BOM

I'd like to add some code for a workaround:


```ruby
class String

    # delete UTF Byte Order Mark from string
    # returns self (even if no bom was found, contrary to delete_prefix!)
    # NOTE: use with care: better remove the bom when reading the file
    def delete_bom!
        raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8
        delete_prefix!("\xEF\xBB\xBF")
        return self
    end


    # returns a copy of string with UTF Byte Order Mark deleted from string
    def delete_bom
        dup.delete_bom!
    end

end
```

---
~eike







-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>