7stud – wrote in post #984789:
7stud – wrote in post #984785:
If you are not familiar with unicode, and you want to match utf-8
characters, then you better start reading some unicode tutorials.
Here is a short one, ‘unicode in three rules’:
Unicode assigns an integer to every letter in every alphabet in the
world. Currently, there are something like 100,000 letters.
Now the question becomes: what is the best way to store those unicode
integers (which represent characters) on a computer? The way in which
you decide to store a unicode integer on a computer is called an
For instance, you could use 4 bytes to store each unicode integer. In
that system, a series of unicode integers is very easy for ruby to
parse: every 4 bytes represents one unicode integer(which in turn
represents one character). If ruby blindly reads 4 byte chunks, then
each 4 byte chunk will be one unicode integer.
But you don’t need 4 bytes to store, say, the unicode integer 60 because
three of those bytes would be empty. In fact, for all unicode integers
under 256 (which correspond to the letters in the Western alphabet),
three out of the four bytes would always be empty. Enter the UTF-8
- The UTF-8 encoding uses a variable number of bytes to store unicode
integers on your computer. For smaller unicode integers, UTF-8 stores
them in 1 byte, and for larger unicode integers, UTF-8 stores them in
2,3, or 4 bytes. But then how does ruby know how many bytes it should
read for each unicode integer?
Well, UTF-8 has a tricky way of signaling to ruby that the end of one
unicode integer has been reached. As long as you tell ruby that it is
reading unicode integers stored in the UTF-8 format, then ruby will be
able to sort out where one unicode integer ends and the next one
begins–even though some of the unicode integers will be stores in 1
byte and others will be stored in 2, 3, or 4 bytes.
That is my current mental model of how unicode works. I hope it helps.