Advertisement
In this blog, we will learn about the Golang UTF8 Package and Character Encoding in Programming Languages.
The Golang Unicode/utf8 package provides several useful functions for querying and manipulating strings and []bytes which hold UTF8 bytes.
First of all, let’s understand the difference between UTF8 and ASCII Encoding.
ASCII vs UTF8 Encoding
In earlier days of the invention of programming language, the computer scientists felt the need for only 128 characters and thus they encoded the 128 characters in 1 byte (primarily 7 bits, as the starting 1 bit is only for signal).
2^7 = 128
This Encoding was called ASCII (American Standard Code for Information Interchange).
ASCII Example:
A (Capital) has an ASCII value of 65 – Binary representation of A is: 1000001
01000001
while nowadays, we use a lot of characters and many countries type code in their own native language other than English than how it’s possible.
Unicode is a standard that encodes almost all the characters used in the world for convenience purposes. The UTF8 (8-bit Unicode Transformation Format) defined by Unicode Standards, is a character encoding that encodes a total of 1,112,064 characters.
The UTF8 is developed by Ken Thompson and Rob Pike (also developers of The GO Programming language). This is also the reason why Golang is typed in UTF8 Encoding.
UTF8 is a variable width character encoding, and uses one – four bytes to encode a character. UTF8 Supports ASCII as it is backward compatible. As ASCII Characters take only 7 bits or 1 byte to encode a character, it is given first place in the UTF8 Encoding.
Other Characters take two four bytes in order to encode.
The Characters take two or more bytes for UTF8 encoding, there is a similarity, the first bit is preceded by as many ones as the characters encoding size and a Zero. Example.
Byte1 = 110xxxxx
After that, all the bytes get preceded by 10s.
Byte2 = 10xxxxxx
Visit the Wikipedia page to know more about UTF8 Encoding.
Devanagari (Hindi) UTF8 Code
In this blog, I will be using examples in Hindi as well as English Unicodes, Take a reference for Devanagri UTF8 Code.
Golang UTF8 Package DecodeRune
func DecodeRune(p []byte) (r rune, size int)
Decode Rune takes the first UTF8 encoding from the passed string and returns rune of the encoded character and the size it takes in UTF8 encoding.
If the passed string is empty, the DecodeRune function returns rune error and 0 as the size of the encoded character. If the encoded string is invalid then the function returns rune error and 1.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := []byte("नमस्ते दुनिया") // Hello World in Hindi
for len(b) > 0 {
r, size := utf8.DecodeRune(str)
fmt.Printf("%c %v bytes\n", r, size)
str = str[size:]
}
}
Output:
न 3 bytes
म 3 bytes
स 3 bytes
् 3 bytes
त 3 bytes
े 3 bytes
1 bytes
द 3 bytes
ु 3 bytes
न 3 bytes
ि 3 bytes
य 3 bytes
ा 3 bytes
DecodeRuneInString
func DecodeRuneInString(s string) (r rune, size int)
Golang UTF8 Package DecodeRuneInString function is like DecodeRune but its input is a string.
Golang UTF8 Package DecodeLastRune
func DecodeLastRune(p []byte) (r rune, size int)
Golang UTF8 Package DecodeLastRune function takes the last UTF8 encoding from the passed string, and returns rune of the encoded character and the size it takes in UTF8 encoding.
If the string is empty DecodeLastRune function returns (RuneError, 0).
Otherwise, if the encoding is invalid, the function returns (RuneError, 1).
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := []byte("नमस्ते दुनिया") // Hello World in Hindi
for len(b) > 0 {
r, size := utf8.DecodeLastRune(str)
fmt.Printf("UTF8 Code: %v %c %v bytes\n", r, r, size)
str = str[:len(str)-size]
}
}
UTF8 Code: 2366 ा 3 bytes
UTF8 Code: 2351 य 3 bytes
UTF8 Code: 2367 ि 3 bytes
UTF8 Code: 2344 न 3 bytes
UTF8 Code: 2369 ु 3 bytes
UTF8 Code: 2342 द 3 bytes
UTF8 Code: 32 1 bytes
UTF8 Code: 2375 े 3 bytes
UTF8 Code: 2340 त 3 bytes
UTF8 Code: 2381 ् 3 bytes
UTF8 Code: 2360 स 3 bytes
UTF8 Code: 2350 म 3 bytes
UTF8 Code: 2344 न 3 bytes
DecodeLastRuneInString
func DecodeLastRuneInString(s string) (r rune, size int)
Golang UTF8 Package DecodeLastRuneInString function is like DecodeLastRune but its input is a string.
Example:
str := "नमस्ते दुनिया" // Hello World in Hindi
for len(str) > 0 {
r, size := utf8.DecodeLastRuneInString(str)
fmt.Printf("UTF8 Code: %v %c %v bytes\n", r, r, size)
str = str[:len(str)-size]
}
UTF8 Code: 2366 ा 3 bytes
UTF8 Code: 2351 य 3 bytes
UTF8 Code: 2367 ि 3 bytes
UTF8 Code: 2344 न 3 bytes
UTF8 Code: 2369 ु 3 bytes
UTF8 Code: 2342 द 3 bytes
UTF8 Code: 32 1 bytes
UTF8 Code: 2375 े 3 bytes
UTF8 Code: 2340 त 3 bytes
UTF8 Code: 2381 ् 3 bytes
UTF8 Code: 2360 स 3 bytes
UTF8 Code: 2350 म 3 bytes
UTF8 Code: 2344 न 3 bytes
Golang UTF8 Package EncodeRune
func EncodeRune(b []byte, r rune) int
The EncodeRune Function takes a byte array and a rune and encodes it to UTF8 Encoding.
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
r := 'क' // English K
b := make([]byte, 3)
n := utf8.EncodeRune(b, r)
fmt.Print("Byte Array :",b)
fmt.Print("Number of Bytes Written:",n)
}
Output:
Byte Array : [224 164 149] Number of Bytes Written: 3
Explanation:
The Hindi letter ‘क‘ (DEVANAGARI LETTER KA) UTF8 encoding takes 3 bytes. The output returns the bytes of the letter. Let’s dive deeper into the binary and know how the letter is encoded in UTF8.
Binary of 224:
11100000
Binary of 164:
10100100
Binary of 149:
10010101
According to the UTF8 Encoding Rule, the first byte (except any ASCII Characters) is preceded by the number of ones equal to the size of that character and next 0, and the rest of the bytes will precede with 10s, this is a fixed rule for UTF8 Encoding.
Golang UTF8 RuneCount
func RuneCount(b []byte) int
Golang UTF8 RuneCount function returns the number of runes in an array of bytes.
Example:
func main() {
b := []byte("Hello, दुनिया") // World in Hindi
fmt.Println(b)
fmt.Println("bytes =", len(b))
fmt.Println("runes =", utf8.RuneCount(b))
}
Output:
[72 101 108 108 111 44 32 224 164 166 224 165 129 224 164 168 224 164 191 224 164 175 224 164 190]bytes = 25
runes = 13
In the output, the bytes array contains 25 elements but there are only a few when we look at it.
The byte array, from 72 to 32 it contains the “Hello,” String and after that contains the byte for UTF8 Encoded string.
The reason why the string splits into a long byte array as each character takes 3 bytes to encode.
Golang UTF8 RuneCountInString
func RuneCountInString(s string) (n int)
Golang UTF8 RuneCountInString function is like RuneCount but its input is a string.
Golang UTF8 Valid
func Valid(b []byte) bool
Golang UTF8 Valid function returns a boolean value true if the byte array consists entirely of valid UTF-8-encoded runes, else false.
func main() {
valid := []byte("Hello, दुनिया") // World in Hindi
invalid := []byte{0xff, 0xfe, 0xfd}
fmt.Println(utf8.Valid(valid))
fmt.Println(utf8.Valid(invalid))
}
Output:
true
false
ValidRune and ValidString work the same as the Valid function. The only difference is in the input section, the ValidRune take rune as input and the ValidString takes a string value as input.
Hope you like it!
Also, read Why Golang is called the future of Server-side language?
Learn more about Golang UTF8 Package from the official Documentation.
Hi,
Shouldn’t be a code replaced in a line “for len(b) > 0 {” as a “for len(str) > 0 { “?