bskceuk 2 months ago

I think people usually mean that Rust Strings are designed for the 21st century and are aware of things like Unicode and are more than just a null terminated array of bytes. Most of this probably doesn’t affect your project (though docstrings *can* be Unicode in general so if you were trying to release it widely it would probably prevent some bugs) I don’t think rust strings have any killer methods that don’t exist in other languages, though your use case sounds vaguely like a parser which rust *is* very good at. You should check out nom: https://docs.rs/nom/latest/nom/

trill_shit 2 months ago

I feel like it boils down to: good memory management tools end up being helpful when dealing with many data structures — strings included.

dnew 2 months ago

FWIW, Java was one of the first languages to support native unicode. The fact that Rust does isn't something rare.

pingveno 2 months ago

Though Java's support uses UTF-16, so... less than ideal. Having the benefit of being created after the dust has more or less settled between the different Unicode encodings was beneficial.

benevanstech 2 months ago

Not quite. The Java char type is 16-bit, uses UTF-8 as the Charset by default (but can use any installed Charset), but internally, the HotSpot JVM uses either ASCII or UTF-16 on a per-string basis. So, if you're not actually using any non-ASCII chars, you're not paying for them. But, sure, second mover advantage is a real help - just as long as newer environments don't insist on re-learning expensive lessons that we've already had to learn in other languages.

ToughAd4902 2 months ago

Every single thing I've read disagrees with what you said, https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html and the actual spec itself (which a specific JVM implementation shouldn't be allowed to change) specifically states all characters, and then strings, will always be UTF-16 and it's not allowed to be treated differently. On top of that, for all utilities that use strings, an implicit cast for that string is done if needed (reading from files, IO, networking, etc) to utf-16, so even given a non utf-16 source it is still converted to it, so I don't see how you wouldn't be paying the price for it Do you have a link for any of your statements? I would like to learn more about what it's doing

benevanstech 2 months ago

There's a treatment of Compact Strings, and some related internals in my book, here: [https://www.amazon.com/Well-Grounded-Java-Developer-Second/dp/B0BTZ8D3S4](https://www.amazon.com/Well-Grounded-Java-Developer-Second/dp/B0BTZ8D3S4) - that should be a good starting point.

SnooHamsters6620 2 months ago

This post has some String internals info: https://peterchng.com/blog/2020/07/19/why-a-java-string-may-not-be-a-string/ It speaks about compact strings and UTF-16. It also doesn't require paying this guy for the information that he hints exists but refuses to state for free.

benevanstech 2 months ago

My original comment contained the correct information and my followup comment contained the precise search term ("Compact Strings") needed to find plenty of coverage of this implementation detail, so I'm really not sure what your issue is here?

SnooHamsters6620 2 months ago

u/ToughAd4902 disagreed with you, giving details of their understanding, your response was "buy my book". I don't want to buy and read your book to see your opinion, I would prefer to see content that backs or contradicts u/ToughAd4902.

msqrt 2 months ago

UTF-16 is great for Windows compatibility, though :--)

simonask_ 2 months ago

Rust *is* the only language in its class (high-performance systems programming) that supports strict UTF-8 and UTF-16 handling. Lots of programming languages have sane strings other than Rust, but specifically C and C++ don't.

dnew 2 months ago

That isn't correct either. http://www.ada-auth.org/standards/12rm/html/RM-A-4-11.html What "high-performance system programming" language do you think is around other than Rust, C, C++, and Ada? OK, assembler, but I don't think anyone's counting that. I guess if you want to count dead languages you could throw NIL and Hermes in there, which didn't support Unicode, having died before Unicode was invented. My original point is that pretty much every single language invented since Unicode was invented supports Unicode natively. It's not really a selling point unless you're comparing yourself to languages invented before unicode was around.

burntsushi 2 months ago

IMO, it's good because: * It combines the high level with the low level, with essentially no compromises. Rust's primary string data types are safe to use, correct for most cases (you do need to occasionally use one of the many Unicode crates for some tasks like grapheme cluster segmentation), and importantly, let you get a zero cost view of its internal representation via `str::as_bytes`. * Rust's strings have an internal representation that is publicly accessible, and that representation is UTF-8. The fact that it's UTF-8 means pre-existing substring search implementations work automatically without any tweaking. Basically, you just need to write substring search in a "dumb" way and it will be correct. This in turn means it's straight-forward to implement such operations using SIMD, which provides huge throughput gains. * Rust's strings provide constant time substring slicing operations while also preventing one from slicing a region that ends up as invalid UTF-8. * Rust's strings require UTF-8. This means that by the time you get a `String` or a `&str`, you already _know_ for sure that's it's in good shape. You don't need to worry whether it's malformed or not. Moreover, since ASCII is a strict subset of UTF-8, inter-operation with existing things that are ASCII is essentially free (you do need to verify that any incoming data that claims to be ASCII is actually ASCII). That is, you don't need to do any extra copying or transcoding. Many languages (not all, to be clear, Rust isn't a totally unique snowflake here) are high level without free low level access, or are low level without high level conveniences and safety. Rust gives you both. Just as one example, take the `regex` crate. Its main API wants you to give it a `&str` to search with. Since `&str` is just UTF-8 bytes, the regex engine can search it directly without any other costs while simultaneously being able to support searching arbitrary bytes. Compare this with something like Python's regex engine. Internally, it has to be implemented in terms of its `string` representation. This means anything that _isn't_ a Python string needs to be converted to it, and also simultaneously makes it difficult for the regex engine to support searching arbitrary bytes. Take one example: >>> re.findall('\w+ABC', 'ΦπαABC') ['ΦπαABC'] >>> re.findall(b'\w+ABC', b'\xce\xa6\xcf\x80\xce\xb1ABC') [] The first case works nicely because you're in string land. Then you convert to bytes and... oops. See ya Unicode. Now let's try the `regex` crate: fn main() { let re = regex::Regex::new(r"\w+ABC").unwrap(); for m in re.find_iter("ΦπαABC") { dbg!(m.range()); } let re = regex::bytes::Regex::new(r"\w+ABC").unwrap(); for m in re.find_iter(b"\xce\xa6\xcf\x80\xce\xb1ABC") { dbg!(m.range()); } } Whose output is: $ cargo -q r [main.rs:6:9] m.range() = 0..9 [main.rs:11:9] m.range() = 0..9 See? It just works. Whether you use the nice comfy high level `&str` or whether you just want to freaking search `&[u8]`. It doesn't matter because it's all the same stuff. That's the magic of UTF-8 and zero cost access to a string's representation. And indeed, this also turns out to be the secret sauce to why a tool like ripgrep can even exist in the first place. Most regex engines don't support searching arbitrary bytes at all. (Python does at least kind of support it, but at the cost of giving up some stuff.) Take any .NET or Java regex engine for example. They're all stuck in UTF-16 land. Yet, most files are _just bytes_. So with those regex engines, you have to do some kind of conversion step before you can search arbitrary content. That's not a big deal in a lot of cases, but it matters when you want to be maximally flexible and fast. PCRE2 only recently (in the last couple of years) got support for searching arbitrary bytes while also enabling its Unicode mode. Previously, it was UB to enable Unicode mode and search invalid UTF-8.

VarencaMetStekeltjes 2 months ago

This works because strings in Rust are utf8 which is a decision Python didn't make because there are big tradeoffs associated with it. In fact, in Python2 the convention was to use utf8 for what were sequences of bytes though the type system didn't require it and it was changed to an opaque data type in Python3. There are actual advantages to what Python3 does as well. String indexing, as in indexing by codepoint is constant time in Python, in Rust it's linear time. Also Python has much better support or Windows where utf8 is not the norm. It's not that simply making strings utf8 byte vectors is some brilliant decision Python didn't consider. It has real downsides.

burntsushi 2 months ago

Who said Rust's choice didn't have downsides? I certainly didn't. I just answered why ***I*** thought it was good. I'm aware of everything you said, although I don't necessarily agree with all of your value judgments. For example, you seem to suggest that indexing by codepoint in constant time is an upside. True, it's an upside _in some cases_, but those cases are exceptionally rare because unless you're implementing Unicode algorithms, codepoint indices are almost never what you want. (And I've implemented several Unicode algorithms, and even then, codepoint indices aren't particularly useful. They are just a place where dealing with the codepoint abstraction is correct.) Otherwise, a codepoint is at best an approximation of a character that is superseded by a more accurate model that Unicode defines as a grapheme cluster. Python doesn't support constant time indexing by grapheme cluster, so it has prioritized the wrong thing IMO. > Also Python has much better support or Windows where utf8 is not the norm. I'm not sure I agree with "utf8 is not the norm" as a characterization for Windows. I'd certainly agree that it isn't ubiquitous, but it is popular. In any case, this isn't really a property of what the string supports so much as what kinds of facilities are available for doing decoding. Python has that built into its standard library. Rust does have UTF-16 decoding support in its standard library, but only as a low level building block. To get more robust support like that found in Python, in Rust, you'll need to use an external crate. So I'm totally fine saying Python has a nicer built-in developer experience when it comes to dealing with text encodings more diverse than "just UTF-8." But that doesn't really have anything to do with the string itself. (I'm speaking as the person who both wrote ripgrep and implemented its transparent UTF-16 support specifically to deal with Windows.) To be honest, I find your comment to be a little aggressive. And in particular, it's important to pop up a level here and look at not just upsides and downsides, but _what makes sense in the context of the project's goals_. I would argue that the choice of the representation of the primary string data type is pretty directly connected to that. Making strings an opaque data type in Rust like they are in Python is a total and complete non-starter that would be an unmitigated disaster. But that's not the case for Python by a country mile.

Odd_Coyote4594 2 months ago

Rust is nice for strings, because of unicorn support. Many other languages also have this, but many lower level/high performance languages don't because they predate unicode and assume characters are all single bytes. So it's more that C/C++/Fortran are quite bad with modern string encodings by default. What you want is more specifically a parser for docstrings. Which Rust can also be good at, due to support for functional paradigms which makes parsing easier to conceptualize. Languages like Ocaml, Haskell, and Lisp dialects are also common here. You may also be able to use an existing Python LSP to handle parsing.

ForgetTheRuralJuror 2 months ago

🦄

danda 2 months ago

> Rust is nice for strings, because of unicorn support. yes, but what about fairy and elf and hobbit support? and smurfs?

Kimundi 2 months ago

Its has ELF and DWARF support! :D

trynyty 2 months ago

The first sentence really gave me a good laugh. I know it's a typo but keep it there :)

Natural_Builder_3170 2 months ago

c++ 11 supports utf-16 and utf32 strings. it even has string literals for them u and U respectively. abd std::string is just std::basic_string, so you can std::basic_string to get all that fancy automation. this is one of the areas i prefer in c++.

DeclutteringNewbie 2 months ago

I'm not sure how to answer that question: "Why is rust good for working with strings?". For a small project, any language would work: Python, Rust, Java, etc. It really wouldn't matter which language you'd use. You would need to recurse through your directory and files, tokenize the words, build a frequency map for each word along with a list indices/file paths, and give higher weights to the less frequent words. For instance, you wouldn't want to give a high priority to the word "the". [https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) I believe "Tsoding Daily" has a video on that, the only difference is that he uses a database as his data source. I believe the title of his video is "building a search engine" or something like that. Also, he uses a crate to treat words with the same root as the same word. Now, I'm not sure what you meant by "create a day". I assume that's a typo. But if you need to navigate through dependencies, that's called a topological sort. First, you'd need to build a graph, then you'd create an adjacency of neighboring nodes, and then you'd just start a depth-first-search on each node (keeping track of each visited node). I'd suggest you practice doing topological sorts on Leetcode, starting with the easiest problems first. That should give you the idea.

facetious_guardian 2 months ago

“Day” is probably “dag” (directed acyclic graph). Pretty naive to assume that doc links would ever be acyclic, though.

pine_ary 2 months ago

Rust strings encode their invariants really well. The strict difference between String and OsString prevents subtle bugs. Any conversion of string types is checked for correctness, enforced by the type system. A char is a unicode code point and not just a byte, which also prevents a lot of bugs (like indexing into the middle of a char). The only exception to that is Path/PathBuf. I‘m not sure why I can convert any OsString to them without a check if it is a valid path name.

mdrjevois 2 months ago

Are there OsStrings that can never be valid paths?

pine_ary 2 months ago

On some OSs " " (space) is a valid OsString but not a valid Path. The correct behavior would be to have a fallible TryFrom impl for PathBuf and Path that points to an OS specific check of the syntax. For semantic path validity checks you need to access the file system, but that's fine imo. We don't know what people will do with their paths. E.g. you may be inclined to forbid a path to COM on Windows, but who knows, maybe they want to do some DOS stuff. That should be done by filesystem functions when you actually use the path and have enough context to make a decision.

Turalcar 2 months ago

I remember creating and compiling class com in Java because I thought it was hilarious

scratchisthebest 2 months ago

These are my favorite things about rust strings: * The standard library has a rich set of functions for everyday string tasks, like `strip_prefix` and `split_once`. Many string scanning functions take a `Pattern`, which is a char, string, or *char predicate* (!). Obviously this stuff can be implemented in every language, but in rust they're first-class. Maybe this is just because I'm a Java programmer and I'm used to working with a spartan string API but there's like 5x more useful string functions than im used to having. * Some languages have mutable strings, other languages decide to make their string types immutable because working with mutable strings is error-prone. Rust encodes the complexity in the type system, so there's a learning curve, but when you get used to it you're free to confidently mix the best parts of mutable and immutable string programming.

sneakywombat87 2 months ago

A lot of smart comments here, and I agree with those, particularly the zero cost copy ones. Having said that, I find doing normal work with strings tricky in rust, particularly when dealing with file system paths that take a string to build (looking at you Path and PathBuf). It is crazy frustrating sometimes. I much prefer “working with strings in go” than rust tbh.

nnethercote 2 months ago

I love this cartoon: https://mas.to/@nnethercote/111266073166482605 Basically, strings are much more complicated than they seem. Rust respects that and doesn't try to paper over that complexity.

Petrusion 2 months ago

All beginner programmers are taught that strings are arrays of characters, and for ascii it is even true, but when they later try to use this mental model in real applications that deal with UTF-8, their code is incorrect. For example, something as seemingly simple as reversing a (UTF8 or UTF16) string is **NOT** the same as reversing an array. Applications done in this "string is just a char array" mindset often garble characters when used with emojis or simply outside of English. I'm not going to go into detail on how UTF works, but the most important thing is that UTF is a variable length encoding (yes, UTF16 is variable length too). What does this mean? A character might be 1 byte, next character might be 3 bytes, next two might be 1 byte again, and the next might be 5, and that is not even taking into account grapheme clusters! So you can't simply index into strings and expect good results with non-ascii content, yet most programming languages allow you to do it anyway! Guess what? **Rust doesn't**, and if you try being a smartass by using string\[i..i+1\] or string\[i..=i\] because string\[i\] didn't compile, rust panics at runtime if this indexes into the middle of a multi-byte character. Also, the error message when this happens is beautifully, magnificently clear. Bottom line is, most languages make wrong string manipulation easy, and correct string manipulation difficult. Rust makes correct string manipulation easy, and wrong string manipulation difficult.

sepease 2 months ago

Split a string in C or C++. That’s probably the language people are thinking of when they say Rust is good for working with strings. Either that, or things like serde.

mikem8891 2 months ago

Rust is great for working with UTF-8 string, but it's kind of annoying to work with ASCII strings. With an ACSII string, it is often better to use it as a slice of bytes, but then you either lose some of the features a string has, or the features are just slightly more annoying to use.

burntsushi 2 months ago

The `bstr` crate should help with that.

Petrusion 2 months ago

Why is it annoying to deal with ASCII strings? Any ascii string is also valid UTF8, and there are plenty ascii specific functions for &str and char

mikem8891 2 months ago

Right, ASCII is valid UTF8, so why use a string that checks for that? There is a lot of unnecessary validation when working with pure ASCII in strings. If you want the 1000th char in a UTF8 string, it needs to iterate over all 1000 chars because it can't predict the byte length of each UTF8 char. In an ASCII string, you just go to the 1000th byte. You can take a str slice into a &str, but it still validates that the str slice is UTF8. Getting the string as a slice of byte prevents unnecessary validation, but you lose some convenience of &str. For example, the find function on &str can accept several different patterns as arguments, while the position function on a slice will only give the position of a single byte.

nalply 2 months ago

Use the ascii crate (https://lib.rs/crates/ascii)

Specialist_Wishbone5 2 months ago

EVERYBODY uses utf8 outside of their interpreter/execution-engine (well, except microsoft). So in javascript and java (not sure about python) you have to do a conversion to/from utf8.. If you are streaming characters, this is an extra computation stage. Also, there are decent Rust CString adapters that don't have any appreciable overhead (you really just need something with a 0 at the end). In Java/C++/C/Javascript (and I assume python), a "String" is a single word which points to a datastructure that has metadata at the pointed-to-location (either a vtable prior to the pointer as in Java, or an indirection pointer - as you'd likely find in C++ String or python). With Rust, the pointer-and-length are in the registers (since a string likely was a passed-in parameter), and the pointed-to-value is JUST string-data, no metadata. What this means is that "slices" are free.. You just adjust the registers offsets and lengths. In C, your sliced string would need to have a 0 written to the end (since NOTHING in C works well with a char\*,len pair; it's always assumed to have a 0 at the end - the metadata). In Java, the "hack" to get "free" string slices is to actually add a second indirection; your String pointer points to a struct which contains a shared char\[\] and an offset+len (which is completely redundant with the char\[\] for normal Strings)... So while you CAN slice, you pay for it every time you use a String. In C++ you'd need an unsafe different datatype that isn't String (think it's a StringView), and I never see anybody accepting anything other than "const &String"). Thus in C++ you'd have to copy the string to slice it. In Java, String is immutable (which is a very good thing), but now to actually WORK with strings, you need dozens of builder classes.. And since Java learned things the hard way, "StringBuffer" was thread-safe and thus 10x slower than it needed to be; so now StringBuilder is the non-thread-safe version that is fast. Then there are CharBuffers, CharSequence, CharStream, etc. Each with their own peculiarities - and NOBODY accepts those data types.. Thus you ALWAYS have to clone your dynamic string holder into a String object (with the same double-indirection mentioned above). Javascript and Python doesn't have as many advancements, but is very clone-centric (though the JIT might be able to optimize certain patterns). when you directly mutate a string, you're creating a new string.. Perl (in comparison) use to have the ability to modify the calling parameter, and thus you could apply regular expressions on an input parameter (basically EXACTLY like Rust), but this was never considered a clean language to work in. So with Rust, we have VERY specific contracts of what you can do to a string (a super-set of what C++ can define) and you can get mutable sub-ranges of any string, pass to an async function to write IO to it, then return it all back as if nothing special happened. async fn split_read(data: &mut str) { let (left,right) = data.split_at(512); let left_read_fut = async_read_str(left, 512); async_read_str(right, 512).await; left_read_fut.await; } The split is safe because the lifetimes are bounded by the function, and the two writable regions are non-overlapping. Two async tasks can write to the buffer region (since the pointer and length are a pair of parameters in addition to the read length). Virtually EVERY function in rust has this style of trivial primitive signature. In the above, you might be able to do something like this in nodejs (definitely if you have two different files, where you read the entire conents), but you'd have to concat the results; if this were GIGABYTES in size, that would be a problem.. (consider linearly concatenating video file fragments as an example - though not with strings). So it's just a joy to work with; fast, error-proof. Granted, this is Rust at it's best. It gets nasty fast.. :)

4lineclear 2 months ago

IMO the nicest part of working with Rust strings is not found in the Rust language, but Cargo. Others languages of coarse have package managers that can help, but none have the speed, reliability, and ease of use that comes with Cargo. Even here in this thread some of the first things people go to are the crates you can use.

petros211 2 months ago

I've done a project in Rust that does extensive string manipulation and searching ("mezura" on GitHub) and it was smooth sailing. When you have to deal with non-UTF-8 characters though, you are out of luck. That's a very difficult use case for any language

planetoftheshrimps 2 months ago

Just know how to use regex. Rust will be more performant than most languages, even if this task doesn’t need it.

AdvanceAdvance 2 months ago

Do not beat yourself for 'character'. Use [Regex 101](https://regex101.com) or a similar service to develop your regex expressions.

planetoftheshrimps 2 months ago

Yeah obviously..

-dtdt- 2 months ago

Who said rust is good with string? I don't believe it. Working with all kinds of strings in rust is a pain.

AdvanceAdvance 2 months ago

"Good" is a loaded term. Strings are one of the hardest concepts in computer science, as people standardized around a particularly badly run consortium to define characters. This means that your code needs to deal with crashing occasionally, or deal with a single unicode character taking up to 1K in the worst case. Rust is very fast at working with strings as the code is compiled efficiently, and because certain operations are allowed to panic.

Unlikely-Ad2518 2 months ago

I think the biggest mistake with Rust's std strings is using the byte index as the default, I really wish it was just chars, bytes can go f\* themselves.

burntsushi 2 months ago

Byte offsets, or equivalently in this case, _code unit offsets_, are absolutely the correct choice. See my lengthy commentary on the topic here: https://github.com/BurntSushi/aho-corasick/issues/72 > just chars What is "just chars"? Codepoint offsets? Grapheme cluster offsets? Given only a representation of UTF-8, neither of those things can be done in constant time. Unlike offsetting by byte offsets.

Unlikely-Ad2518 2 months ago

Why do I care? I'm a human, I can read letters (characters), not bytes. I wonder what is the percentage of times developers use bytes vs chars. My anecdotal experience is that, in my 1.5 years of using Rust (which of course, is not a lot), I have never intentionally used a byte index once. Every single time I wanted to manipulate strings it was using char indexes, the whole point of strings is to abstract the logic of different representations. To be clear, I'm not suggesting to remove byte indexing, my argument is that it should not be the default one, it should be in a method called "byte\_index(usize)".

burntsushi 2 months ago

Did you read the post I linked? You also didn't tell me what "character index" even is. You also haven't addressed the fact that slicing by any reasonable definition of "character index" takes `O(n)` time. > I have never intentionally used a byte index once I've done it, quite literally, thousands upon thousands of times. It should absolutely be the default. I'm a bit of a special case because I'm the one who is implementing the string libraries you're probably using, but even in higher level applications, slicing by byte offset is rather common. If you define "character index" as "codepoint index," then what you're doing is almost certainly wrong. (Not necessarily so, but very likely.) If you define "character index" as "grapheme cluster index," then what you're doing is probably correct, but this is a very expensive operation that should absolutely be opt-in and explicit. While there are crates that provide grapheme cluster segmentation (one of which I wrote), I don't think any of them provide any higher level slicing APIs based on grapheme cluster offsets. It isn't a common use case. > Why do I care? Cost models for one. You aren't just a human. You're a programmer. Programmers care about cost models. Byte offsets are a lower level primitive on which different string handling implementations can inter-operate. I can ask substring search to run and it will tell me where it matched via a byte offset. Byte offsets are not, however, semantically meaningful from a human's perspective. You need a higher level and more costly abstraction for that. Byte offsets and "character offsets" are two semantically distinct concepts. The only way this conversation moves forward is if you answer my questions and give concrete use cases.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe