Storing UTF-8 Encoded Text with Strings

UTF-8 (Unicode Transformation Format - 8-bit) is a variable-width character encoding that can represent every character in the Unicode character set. It is the dominant encoding for the World Wide Web and is designed for backward compatibility with ASCII, meaning ASCII characters use one byte, while others can use up to four bytes.

In Rust, strings are *guaranteed* to be valid UTF-8. This is a fundamental design choice that brings strong safety guarantees and simplifies handling international text. Rust has two main string types:

1. `String`: This is a growable, heap-allocated, owned string type. It's mutable and typically used when you need to store text that might change or grow.
2. `&str`: This is an immutable string slice (a 'view' into a string). It's typically used for string literals, function parameters, or when you need to refer to a portion of a `String` or another `&str` without taking ownership.

Both `String` and `&str` are internally represented as a sequence of UTF-8 bytes. This strong guarantee has several implications:

* No direct indexing by 'character': Because UTF-8 characters can be variable-width (1 to 4 bytes), accessing `my_string[0]` (like in some other languages) is not allowed. This prevents common bugs where an index might split a multi-byte character, leading to invalid data. Instead, Rust forces you to be explicit about what you're iterating over.
* Byte Length vs. Character Count: The `.len()` method on a `String` or `&str` returns the number of bytes, not the number of human-perceived 'characters'. To count Unicode scalar values (Rust's `char` type, which is a 4-byte Unicode scalar value), you can use `.chars().count()`.
* Iteration: You can iterate over strings in different ways:
* `.bytes()`: Iterates over the raw UTF-8 bytes that make up the string.
* `.chars()`: Iterates over Unicode scalar values. This is generally what most people mean by 'character' iteration in Rust.
* For true 'grapheme cluster' iteration (what a user perceives as a single character, potentially made up of multiple Unicode scalar values, e.g., 'é' or combined emoji sequences), you typically need external crates like `unicode-segmentation`.

This design choice makes Rust's string handling robust and correct for internationalization, reducing the likelihood of encoding-related errors that plague other languages.

Example Code

use unicode_segmentation::UnicodeSegmentation; // For grapheme cluster example

fn main() {
    // Creating String literals and owned Strings
    let hello_english = "Hello, world!"; // &str
    let hello_japanese = String::from("こんにちは"); // String
    let hello_mixed = "你好世界! 👋"; // &str with emoji

    println!("--- Basic String Information ---");
    println!("English: '{}'", hello_english);
    println!("  Byte length: {}", hello_english.len());
    println!("  Char count: {}", hello_english.chars().count());
    println!();

    println!("Japanese: '{}'", hello_japanese);
    println!("  Byte length: {}", hello_japanese.len()); // Each Japanese char is 3 bytes
    println!("  Char count: {}", hello_japanese.chars().count());
    println!();

    println!("Mixed: '{}'", hello_mixed);
    println!("  Byte length: {}", hello_mixed.len()); // Emoji is 4 bytes
    println!("  Char count: {}", hello_mixed.chars().count());
    println!();

    // --- Iterating over bytes ---
    println!("--- Iterating over Bytes (Japanese) ---");
    print!("Bytes of '{}': ", hello_japanese);
    for b in hello_japanese.bytes() {
        print!("{:X} ", b);
    }
    println!("\n");

    // --- Iterating over chars (Unicode Scalar Values) ---
    println!("--- Iterating over Chars (Mixed) ---");
    print!("Chars of '{}': ", hello_mixed);
    for c in hello_mixed.chars() {
        print!("'{}(U+{:X})' ", c, c as u32);
    }
    println!("\n");

    // --- String Concatenation and Modification ---
    let mut message = String::from("Rust");
    message.push_str(" is awesome!"); // Appending a &str
    message.push('🚀'); // Appending a single char
    println!("Modified String: '{}'", message);
    println!("  Byte length: {}", message.len());
    println!("  Char count: {}", message.chars().count());
    println!();

    // --- Slicing (Note: Must slice at char boundaries) ---
    // This is safe because Rust will panic if you try to slice in the middle of a multi-byte char.
    let slice = &hello_mixed[0..6]; // "你好" (2 chars, 3 bytes each = 6 bytes)
    println!("Slice of '{}' [0..6]: '{}'", hello_mixed, slice);
    // let invalid_slice = &hello_mixed[0..1]; // This would panic at runtime because it's not a char boundary
    // println!("Invalid Slice: {}", invalid_slice);

    // --- Grapheme Clusters (Requires 'unicode-segmentation' crate) ---
    // Add `unicode-segmentation = "1.7.1"` to your Cargo.toml for this example.
    let complex_text = "नमस्ते🙏"; // 'नमस्ते' (Namaste) is 6 scalar values, 3 graphemes. 🙏 is one grapheme.
    println!("--- Grapheme Cluster Example ---");
    println!("Complex Text: '{}'", complex_text);
    println!("  Byte length: {}", complex_text.len());
    println!("  Char count (scalar values): {}", complex_text.chars().count());
    println!("  Grapheme count: {}", complex_text.graphemes(true).count()); // true for extended grapheme clusters
    print!("Graphemes of '{}': ", complex_text);
    for g in complex_text.graphemes(true) {
        print!("'{}({})' ", g, g.len());
    }
    println!("\n");
}

Storing UTF-8 Encoded Text with Strings

Example Code

Related Topics