Exploring Golang UTF8 Package for Efficient Text Encoding Read it later

5/5 - (2 votes)

If you are working with Go programming language, you may have come across the UTF-8 package for encoding and decoding text. The UTF-8 package is a vital part of the Go standard library, which provides functions to handle text in UTF-8 encoding. In this article, we will discuss the Golang UTF8 package in detail, its features, and how to use it.

What is UTF-8 Encoding?

UTF-8 is a variable-length character encoding standard for electronic communication. It is capable of encoding all possible characters in Unicode, a universal character set that encompasses almost all written languages and scripts. UTF-8 uses one to four bytes to represent each character, depending on its code point. The encoding is designed to be backward-compatible with ASCII, which means that any ASCII text is also valid UTF-8-encoded text.

UTF-8 is the most widely used character encoding on the web because of its compatibility with ASCII and its ability to represent characters from multiple scripts, including Latin, Cyrillic, Arabic, Chinese, and many more. Many programming languages, including Go, support UTF-8 encoding natively.

What is Golang UTF8 Package?

The Golang UTF-8 package is a part of the Go standard library, which provides functions to encode and decode text in UTF-8 encoding. The package is designed to work with Unicode code points, which are the unique numbers assigned to each character in the Unicode standard. The package provides several functions for working with UTF-8 encoded strings, including:

  • func RuneLen(p []byte) int: returns the number of bytes required to encode the first rune in p.
  • func RuneCount(p []byte) int: returns the number of runes in p.
  • func Valid(p []byte) bool: reports whether p is a valid UTF-8-encoded byte sequence.
  • func DecodeRune(p []byte) (r rune, size int): decodes the first UTF-8-encoded rune in p and returns the rune and its length in bytes.
  • func EncodeRune(p []byte, r rune) int: encodes the rune r into p and returns the number of bytes written.

How to Use Golang UTF-8 Package?

To use the Golang UTF-8 package, you need to import it into your Go program using the following import statement:

import "unicode/utf8"

Once you have imported the package, you can use its functions to work with UTF-8 encoded strings. Here is an example of how to use the Golang UTF-8 package to count the number of runes in a UTF-8 encoded string:

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	str := "Hello, 世界"
	count := utf8.RuneCountInString(str)
	fmt.Printf("The string \"%s\" has %d runes.\n", str, count)
}

In this example, we are using the RuneCountInString function to count the number of runes in the string “Hello, 世界”, which contains both ASCII and non-ASCII characters. The output of this program will be:

The string "Hello, 世界" has 8 runes.

UTF8 Package in Golang

The “utf8” package in Go provides functions to encode, decode, and manipulate UTF-8 encoded text. Let’s take a look at some of the most commonly used functions in the package.

RuneCountInString

The RuneCountInString function returns the number of runes (Unicode code points) in a given UTF-8 encoded string. RuneCountInString function is used to find length of UTF-8 encoded strings in Golang.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    count := utf8.RuneCountInString(str)
    fmt.Println(count) // Output: 9
}

In the above example, we have a string that contains both ASCII and non-ASCII characters. We pass the string to the RuneCountInString function, which returns the total number of runes in the string.

RuneLen

The RuneLen function returns the number of bytes required to encode a given rune.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    r := '世'
    len := utf8.RuneLen(r)
    fmt.Println(len) // Output: 3
}

In the above example, we have a Unicode character ‘世’ (which requires three bytes to encode in UTF-8). We pass the character to the RuneLen function, which returns the number of bytes required to encode it.

ValidString

The ValidString function returns true if a given string is valid UTF-8 encoded text.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    isValid := utf8.ValidString(str)
    fmt.Println(isValid) // Output: true
}

In the above example, we have a valid UTF-8 encoded string. We pass the string to the ValidString function, which returns true if the string is valid UTF-8 encoded text.

EncodeRune

The EncodeRune function encodes a given rune into a byte slice.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    r := '世'
    buf := make([]byte, 3)
    utf8.EncodeRune(buf, r)
    fmt.Println(buf) // Output: [228 184 150]
}

In the above example, we have a Unicode character ‘世’. We create a byte slice with the required length (in this case, three bytes), and pass it along with the rune to the EncodeRune function. The function encodes the rune into the byte slice and returns it.

DecodeRune

The DecodeRune function decodes the first UTF-8 encoded rune in a given byte slice and returns the rune and the number of bytes it occupies.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    buf := []byte{228, 184, 150}
    r, size := utf8.DecodeRune(buf)
    fmt.Printf("%c occupies %d bytes\n", r, size) // Output: 世 occupies 3 bytes
}

In the above example, we have a byte slice that contains a UTF-8 encoded character ‘世’. We pass the byte slice to the DecodeRune function, which decodes the first rune in the slice and returns it along with the number of bytes it occupies.

Hindi Encoding in Golang UTF8

Hindi is a language that uses the Devanagari script. The Devanagari script is a script that is used for writing several languages, including Hindi, Marathi, and Sanskrit.

Let’s look at an example of encoding and decoding Hindi text using the Golang UTF-8 package.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    // Hindi text to be encoded and decoded
    hindiText := "नमस्ते"

    // Encode the Hindi text
    encodedText := make([]byte, utf8.UTFMax*len(hindiText))
    encodedCount := utf8.EncodeRune(encodedText, []rune(hindiText)[0])
    for _, r := range []rune(hindiText)[1:] {
        encodedCount += utf8.EncodeRune(encodedText[encodedCount:], r)
    }

    // Print the encoded byte sequence
    fmt.Printf("Encoded Hindi text: %v\n", encodedText[:encodedCount])

    // Decode the encoded byte sequence
    decodedRunes := make([]rune, utf8.RuneCount(encodedText[:encodedCount]))
    decodedCount := 0
    for len(encodedText) > 0 {
        r, size := utf8.DecodeRune(encodedText)
        decodedRunes[decodedCount] = r
        decodedCount++
        encodedText = encodedText[size:]
    }
    decodedText := string(decodedRunes)

    // Print the decoded Hindi text
    fmt.Printf("Decoded Hindi text: %v\n", decodedText)
}

In this example, we define a Hindi string नमस्ते which means “Hello” in English. We then use the EncodeRune function to encode the Hindi string and store the encoded byte sequence in the encodedText byte slice. We then print the encoded byte sequence to the console.

Next, we use the DecodeRune function to decode the byte sequence and store the decoded runes in the decodedRunes slice. We then convert the slice of runes to a string and print the decoded Hindi text to the console.

Converting Between UTF-8 and Other Encodings

The UTF8 package also includes functions for converting between UTF-8 and other encodings. Here are some examples:

Go UTF8 to ASCII

The ASCII encoding is a subset of UTF-8, which means that any valid ASCII text is also valid UTF-8 text. However, not all UTF-8 text can be represented in ASCII. The UTF8 package includes the ASCII function, which converts a UTF-8 byte slice to ASCII.

Here is an example:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    // UTF-8 text
    b := []byte("Hello, 世界")

    // Convert to ASCII
    a := utf8.ASCII(b)

    fmt.Println(string(a)) // Output: Hello, ???
}

In this example, we have a UTF-8 byte slice that contains both ASCII and non-ASCII characters. We use the ASCII function to convert the byte slice to ASCII. The output is “Hello, ???”, because the non-ASCII characters cannot be represented in ASCII.

Golang UTF8 to UTF16

The UTF-16 encoding is another variable-length character encoding that can represent any Unicode character using one or two 16-bit code units. The UTF8 package includes the EncodeRune and DecodeRune functions, which can be used to convert between UTF-8 and UTF-16. Here is an example:

package main

import (
    "fmt"
    "unicode/utf16"
    "unicode/utf8"
)

func main() {
    // UTF-8 text
    b := []byte("नमस्ते, World")

    // Decode UTF-8 to runes
    runes := []rune(string(b))

    // Encode runes to UTF-16
    u16 := utf16.Encode(runes)

    // Convert UTF-16 to byte slice
    b2 := make([]byte, len(u16)*2)
    for i, r := range u16 {
        b2[i*2] = byte(r)
        b2[i*2+1] = byte(r >> 8)
    }

    // Convert byte slice to UTF-8
    s := string(utf8.DecodeRune(b2))

    fmt.Println(s) // Output: नमस्ते, World
}

In this example, we have a UTF-8 byte slice that contains both ASCII and non-ASCII characters. First we use the string function to convert the byte slice to a string, and then use the rune function to convert the string to a slice of runes. Then we use the utf16.Encode function to encode the runes as UTF-16. We create a byte slice with twice the length of the UTF-16 slice, and then use a loop to copy the bytes from the UTF-16 slice to the byte slice. Finally, we use the utf8.DecodeRune function to convert the byte slice to UTF-8.

UTF8 to UTF32 in Golang

encoding is a fixed-length character encoding that can represent any Unicode character using one 32-bit code unit. The UTF8 package includes the EncodeRune and DecodeRune functions, which can be used to convert between UTF-8 and UTF-32. Here is an example:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    // UTF-8 text
    b := []byte("नमस्ते, World")

    // Decode UTF-8 to runes
    runes := []rune(string(b))

    // Encode runes to UTF-32
    u32 := make([]rune, len(runes))
    for i, r := range runes {
        u32[i] = r
    }

    // Convert UTF-32 to byte slice
    b2 := make([]byte, len(u32)*4)
    for i, r := range u32 {
        b2[i*4] = byte(r)
        b2[i*4+1] = byte(r >> 8)
        b2[i*4+2] = byte(r >> 16)
        b2[i*4+3] = byte(r >> 24)
    }

    // Convert byte slice to UTF-8
    s := string(utf8.DecodeRune(b2))

    fmt.Println(s) // Output: नमस्ते, World
}

In this example, we have a UTF-8 byte slice that contains both ASCII and non-ASCII characters. First we use the string function to convert the byte slice to a string, and then use the rune function to convert the string to a slice of runes. We then create a new slice to hold the UTF-32 encoded runes, and copy the runes from the UTF-8 rune slice to the UTF-32 rune slice. We create a byte slice with four times the length of the UTF-32 slice, and then use a loop to copy the bytes from the UTF-32 slice to the byte slice. Finally, we use the utf8.DecodeRune function to convert the byte slice to UTF-8.

If you want to learn more about text encoding and the UTF-8 format, here are some useful links:

Was This Article Helpful?

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *