Regular Expressions (Regex)

Regular Expressions, commonly known as Regex, are sequences of characters that define a search pattern. They are powerful tools used for sophisticated pattern matching, searching, and manipulating text. Regex patterns are a mini-language within themselves, understood by various programming languages and text editors.

Purpose and Applications:
* Validation: Ensuring user input (e.g., email addresses, phone numbers, passwords) conforms to specific formats.
* Searching: Finding specific patterns of text within larger bodies of text (e.g., all URLs, dates, or specific words).
* Extraction: Pulling out specific pieces of information from unstructured or semi-structured text.
* Replacement: Modifying text by replacing patterns with other text.
* Parsing: Breaking down text into meaningful components.

Key Regex Syntax Elements (Commonly Used):
* Literals: Most characters match themselves (e.g., `a` matches 'a', `hello` matches 'hello').
* Metacharacters: Special characters with specific meanings:
* `.`: Matches any single character (except newline, by default).
* `*`: Matches zero or more occurrences of the preceding character or group.
* `+`: Matches one or more occurrences of the preceding character or group.
* `?`: Matches zero or one occurrence of the preceding character or group (makes it optional). Also used for non-greedy matching.
* `[]`: Character set. Matches any one character within the brackets (e.g., `[aeiou]` matches any vowel). Ranges can be specified (e.g., `[a-z]` for lowercase letters, `[0-9]` for digits).
* `[^...]`: Negated character set. Matches any character *not* within the brackets.
* `()`: Grouping and capturing. Treats multiple characters as a single unit and captures the matched text for later use (e.g., backreferences or replacement).
* `|`: OR operator. Matches either the pattern before or after it (e.g., `cat|dog` matches 'cat' or 'dog').
* `^`: Anchor. Matches the beginning of the string or line.
* `$`: Anchor. Matches the end of the string or line.
* `\`: Escape character. Used to treat a metacharacter as a literal character (e.g., `\.` matches a literal dot, `\$` matches a literal dollar sign). Also used to introduce special sequences.
* `{n}`: Quantifier. Matches exactly `n` occurrences of the preceding element.
* `{n,}`: Quantifier. Matches `n` or more occurrences.
* `{n,m}`: Quantifier. Matches between `n` and `m` occurrences (inclusive).
* Special Sequences (Character Classes):
* `\d`: Matches any digit (0-9). Equivalent to `[0-9]`.
* `\D`: Matches any non-digit character. Equivalent to `[^0-9]`.
* `\w`: Matches any word character (alphanumeric + underscore). Equivalent to `[a-zA-Z0-9_]`.
* `\W`: Matches any non-word character.
* `\s`: Matches any whitespace character (space, tab, newline, etc.).
* `\S`: Matches any non-whitespace character.
* `\b`: Word boundary. Matches the position between a word character and a non-word character.
* `\B`: Non-word boundary.

In Rust:
Rust provides robust regex capabilities through the `regex` crate, which is widely used and highly performant. It offers functions for compilation, matching, finding all occurrences, capturing groups, and replacing text.

Example Code

```rust
// To use the regex crate, add it to your Cargo.toml:
// [dependencies]
// regex = "1"

use regex::Regex;

fn main() {
    // 1. Creating a Regex object
    // The `Regex::new` function compiles a regex from a string.
    // It returns a Result, so we use `unwrap()` for simplicity in this example.
    // The `r"..."` syntax creates a raw string literal, which is useful for regex patterns
    // because it reduces the need to double-escape backslashes for Rust's string literal rules.
    let re = Regex::new(r"\\b\\w{4}\\b").unwrap(); // Matches exactly 4-letter words, bounded by non-word chars

    let text = "The quick brown fox jumps over the lazy dog.";

    println!("Original text: \"{}\"", text);

    // 2. Checking for a match
    // `is_match` returns true if the regex matches anywhere in the text.
    if re.is_match(text) {
        println!("\nPattern '\\b\\w{4}\\b' (4-letter words) found in the text.");
    }

    // 3. Finding all occurrences
    // `find_iter` returns an iterator over all non-overlapping matches.
    println!("\nAll 4-letter words found:");
    for mat in re.find_iter(text) {
        println!("- {}", mat.as_str());
    }

    // 4. Using capturing groups
    // This regex matches a date in YYYY-MM-DD format and captures year, month, day.
    // Parentheses `()` create capturing groups.
    let date_re = Regex::new(r"(\\d{4})-(\\d{2})-(\\d{2})").unwrap();
    let date_text = "Today's date is 2023-10-27 and tomorrow will be 2023-10-28.";

    println!("\nText with dates: \"{}\"", date_text);

    // `captures` finds the first match and returns an Option<Captures>.
    if let Some(captures) = date_re.captures(date_text) {
        println!("\nFirst date found and captured groups:");
        // captures.get(0) is the entire matched string.
        // captures.get(1) is the first capturing group (year).
        // captures.get(2) is the second capturing group (month), etc.
        println!("  Full match: {}", captures.get(0).map_or("", |m| m.as_str()));
        println!("  Year:       {}", captures.get(1).map_or("", |m| m.as_str()));
        println!("  Month:      {}", captures.get(2).map_or("", |m| m.as_str()));
        println!("  Day:        {}", captures.get(3).map_or("", |m| m.as_str()));
    }

    // `captures_iter` returns an iterator over all matches, each containing its captures.
    println!("\nAll dates and their components:");
    for caps in date_re.captures_iter(date_text) {
        println!("  Match: {}", caps.get(0).map_or("", |m| m.as_str()));
        println!("    Year: {}", caps.get(1).map_or("", |m| m.as_str()));
        println!("    Month: {}", caps.get(2).map_or("", |m| m.as_str()));
        println!("    Day: {}", caps.get(3).map_or("", |m| m.as_str()));
    }

    // 5. Replacing text
    let replace_re = Regex::new(r"fox|dog").unwrap(); // Matches 'fox' or 'dog'
    let replaced_text = replace_re.replace_all(text, "CAT");
    println!("\nOriginal text for replacement: \"{}\"", text);
    println!("Text after replacing 'fox' or 'dog' with 'CAT': \"{}\"", replaced_text);

    // Replacing using captured groups
    let email_re = Regex::new(r"(\\w+)@(\\w+\\.\\w+)").unwrap(); // Matches user@domain.tld
    let email_text = "Contact support@example.com or info@domain.org.";
    // In the replacement string, '$1', '$2', etc., refer to captured groups.
    // Here, we mask the email: username@...domain
    let masked_email_text = email_re.replace_all(email_text, "$1@...$2");
    println!("\nOriginal email text: \"{}\"", email_text);
    println!("Masked email text ($1@...$2): \"{}\"", masked_email_text);
}
```

Regular Expressions (Regex)

Example Code

Related Topics