Hi everyone! this is Jimmy , and this is the third article in my series “Breaking Things with Go.” In this series, I document my journey through Jon Bodner’s Second Edition: Learning Go – An Idiomatic Approach to Real-World Go Programming and explore how to use Go in the most practical way I can

in this series the resources are the book itself, go documentation, and any AI model to clarify some things

lets Jump into it Jump Into A Hole Stickers - Find & Share on GIPHY


Strings

Strings in Go are immutable UTF-8 byte sequences, not character arrays

  • we can calculate the length of string using len(x)
  • we can extract single value from a string by using index expression
go
var s string = "hello there"
var b byte = s[6] //returns byte 
var x string = s[6] // compile time error cause indexing returns uint8 unless specified other type
var y string = string(s[6]) //t
var z string = s[6:] //there
  • strings are immutable so we won't face problems of modification that slices of slices face
  • but we have a problem though
    • indexing returns one byte only and UTF-8 code point can be anywhere from one to four bytes long
    • so if you deal with different languages or use emoji's care
      • this emoji for example 🌞 needs 4 bytes to be stored so when you index or slice part of it only it won't decode correctly because you have to slice or index the whole code point to be decoded correctly

A code point is a number assigned to represent a character in the Unicode standard

for example

and the encoding happens by getting the code point of character then encode it to bytes and with UTF it might be from 1-4 bytes now for the decoding to happen we need all the encoded bytes of the same code point to return the original character

c
Character  ⇄  Code Point (number)Encoding (bytes)

the Unicode is the name of the standards that turns char to code point that's why we didn't see a problem with the English letters cause each letter needs a 1 byte to be represented by

for a better understanding

go
var s string = "Hello 🌞"
fmt.Println(len(s)) // prints 10 not 7

because the slicing and indexing dealt 10 bytes instead of 7 code points

Special Conversion in Go

you can do type conversion between runes, strings, and bytes in go

go
var a rune = 'x'
var s string = string(a)
var b byte = 'y'
var s2 string = string(b)

but you can't convert int into string

go
var x int = 65
var y = string(x) 
fmt.Println(y) //this Prints A not '65'

and this happens because conversion from int to string yields a string of one rune not a string of digits

  • it isn't a rune but it is a string of the rune(code point) result

we can convert string to slice

go
var s string = "Hello, 🌞"
var bs []byte = []byte(s)
var rs []rune = []rune(s)
fmt.Println(bs) // string converted to UTF-8
fmt.Println(rs) // string convrted to runes

Why is UTF-8 smart ?

we had UTF-32 that used 32 bit to store each code point even if that code point needed 1 byte to be represented then UTF-16 was invented then UTF-8

the good thing about UTF-8 it lets you use single byte to represent the Unicode characters whose values are below 128 which is (all letters, numbers, punctuation) but it can expand to 4 bytes to represent Unicode code point with larger values

How did UTF-8 solve endians problems ?

you can find the complete discussion here, and it is a very good read but the short answer is

The reason is very simple. There are big and little endian versions of UTF-16 and UTF-32 because there are computers with bit and little endian registers. If the endianness of a Unicode file matches the endianness of the processor the character value can be read directly from memory in a single operation. If they do not match, a second conversion step is required to flip the value around.

In contrast the endianness of the processor is irrelevant when reading UTF-8. The program must read the individual bytes and perform a series of tests and bit shifts to get the character value into a register. Having a version where the byte order was reversed would be pointless.

Takes

  • Strings in Go are immutable UTF-8 byte sequences
  • Indexing and slicing operate on bytes, not code points
  • Conversions between strings, bytes, and runes is legal as a special conversion in Go due to their special relationship
  • UTF-8 solved a lot of issues

Coming Next

the next break will be about maps and that’s a hell of a topic to break so stick around for next break where we will break more stuff

feel free to reach out to me Peace Out GIFs | Tenor