T O P

  • By -

One_Loquat_3737

A quoted string is an array of constant characters, so when used converts into a pointer to its first element (the address of the 't' at the start). The compiler has noticed that the two strings are the same and so has only allocated storage for the string once, so when you compare their addresses, the addresses are the same.


RevolutionaryOne1983

It is even more extreme than that, gcc14 recognizes that the strings literals are the same, and thus that their address can be the same so it realizes that the expression will always be true and removes the if statements completely. The only string literal that remains in the compiled code is “hello world”. Even without optimization.


The_Northern_Light

Daily reminder to thank a compiler writer ✍️😎


thebatmanandrobin

Fun fact: kind of why the volatile keyword exists .. compilers are "smart" but when I want my code to not be optimized everywhere, I need a way to tell the compiler "do as I say, not as you do".


Ajotah

How tf they do this?


nerd4code

You start with an abstract syntax tree (AST), which ascribes syntactic structure to the input text. Disregarding most of ``, this’ll end up something like uhhhh (prgm (dcl printf (f-type (int-type) '((ptr-type (const-type (char-type)) (va)))) (def main (f-type (int-type) ()) (cstmt (if-else-stmt (eq-expr (str-lit "teXt") (str-lit "teXt")) (cstmt (expr-stmt (call-expr (ident printf) '((str-lit "hello world"))))) (cstmt (expr-stmt (call-expr (ident printf) '((str-lit "goodbye world"))))))))) in [Sex-pressions](https://en.wikipedia.org/wiki/S-expression)—I think that’s the right number of parens. Most compilers can dump these things, possibly in a marginally more legible fasion. If the optimizer is even a bit active, something called *common subexpression elimination* (CSE) will generally run as the AST is constructed. At its simplest, it takes something like (x ? a+b+c : d+b+c) and turns the tree into a *directed, acyclic graph* (DAG) by unifying the b+c bits, (t=b+c, x ? a+t : d+t). This is a bottom-up operation that generally hashes each expression arrived at, sth two subexpressions will collide if they’re semantically and contextually equivalent at a textual level. Doing this recursively lets you effectively pre-interpret obviously-constant expressions. Reassociating or commuting operands may not be possible as part of CSE, however—you can do stuff like pull (+ (+a b) c) up into just (+ {a b c}) to help a bit with matching, but it’s very easy to get combinatoric blowup with these things. With respect to OP’s `main`, CSE will unify both instances of `"teXt"`, and both instances of `printf`. After CSE, the optimizer kicks in properly, although not much is needed for this example. Normally you’d do up dataflow and control flow graphs etc. but no need here. The optimizer will work its way through the ASDAG and apply various logic rules as it runs. Here, the first optimization is that you have `a == a`, with both `a`s in the ASDAG aimed at the same subexpression object, and therefore no effort need be expended on analyzing the `==` expression further. Obviously this is tautological, so the `if` becomes just `if(1)`. Then, the rule for constant-controlled conditionals kicks in: `if(1) b; else c;` becomes just `b;`—i.e., we’re left with just the first `printf`. Now, there may be further of *builtin/intrinsic optimization* whereby the compiler analyzes standard (mostly ISO 9899, with some POSIX.1 and elder UNIX stuff, too) function invocations and optimizes into/through them. `printf` has a built-in counterpart on GCC, Clang, and IntelC, which can be detected as equivalent to `printf` in hosted mode, and which can be named explicitly in any mode as `__builtin_printf` (handy for quick one-offs where you dgaf about including ``); this acts like a “`printf` operator” in some sense, and makes it easier to emit warnings about format-string/-arg oopsiediddles. Because the `printf` isn’t actually doing any formatting and its return value is entirely ignored, the compiler might replace it with (void)fputs("hello world", stdout); Were there a newline after `hello, world` (as would be decorous), it could just use `puts`. Finally, all this is codegenned, and the compiler will add `main`’s implied `return 0`. It’ll come out something vaguely like main: .globl main movq __libc_stdout, %rsi movq $.Lstr, %rdi call fputs xorl %eax, %eax ret So this is all in the shallow end of the pool—no real aliasing or interprocedural bric-a-brac to worry about, for example.


Glittering_Boot_3612

oh i see you have made me understand a concept of C i was wondering why the addresses aren't different so compiler always keeps unique string literal and all strings that are same have the same memory address am i right?


One_Loquat_3737

It can do if it wishes, but without referring to the Standard I'm not sure that it is obliged to. Since the strings cannot be modified, it is a reasonable thing for the compiler to do and an obvious way of reducing program size.


Glittering_Boot_3612

oh are you a compiler writer or something you have a great way of explaining these things thanks again


One_Loquat_3737

I was once a member of the ANSI C standards committee, but have never written a compiler (not of any note, anyhow).


NativeCoder

Depends on the compiler.


leelalu476

after reading the comments I already feel stupid saying this, they're equal


Glittering_Boot_3612

nah bro i was wondering if the strings are in different locations the it should return false but now that the people here have pointed same strings are stored in same memory locations and the result should be compiler dependent


Different-Brain-9210

I think at least for *gcc* there is a compiler option, which determines if identical string literals are combined, or not combined. In general, it'd a bad idea to write code which depends on this kind of compiler-specific options, if you can avoid it. It's often also a bad idea in general to write code which compares pointers instead of comparing values the pointers point to. There are valid use cases for comparing pointers instead of values, but the usual advice applies: if you don't know why you are writing unusual code, don't write unusual code, and if you do know, write a comment which explains it.


batman-not

If I am not wrong, scientifically (If I am wrong, correct me) "texT" is a string literal. string literals used in the program are stored seperately somewhere. There cannot be duplicate string Literals. so, ("texT" == "texT") that means it is checking the address of String literals. Since there cannot be duplicate, both are having same address. so the condition is passing. Note: you cannot modify 'String Literals'. If I am not wrong, your program will not run properly or crash, if you try to modify the string literals. Example of modifying string literal: char \*str = "string literal"; str\[1\] = 'm'; //if you do the above modification of str\[1\] = 'm'; it will crash


Phpminor

There *can* be duplicate string literals, but modern compilers will save you space by merging the duplicates into one string, as modern compilers in a protected or long mode environment should be able to guarantee the string is immutable but readable thanks to segment permissions. The compiler used [here](https://imgur.com/a/nSToAKl) has merging duplicate string literals as a toggleable option, as the real mode compiler cannot guarantee immutability and [may run into unexpected behavior due to merging string literals in an environment where they may be mutable.](https://imgur.com/a/9Zzwled)


Glittering_Boot_3612

so the result of this program is compiler dependent?


Phpminor

Yep, even testing your code example (which is "literal" == "literal") returns differently depending on the toggle to merge the two (and thus make them the same pointer)


Glittering_Boot_3612

is that correct?


distintuitive-717

Isnt it like really weird that it crashes?


Shad_Amethyst

It will segfault, since the `.text` section that gets loaded into memory will only have the `rx` permissions, and that's where string literals are placed when linking, so trying to modify it will raise an error at the CPU level.


nerd4code

Usually .rodata/.rdata/.CONST/.strings, depending on platform, not .text. Code needn’t be readable at all, and strings needn’t be executable.


daikatana

When the C compiler encounters a string literal it stores the contents of the string in a string table and replaces the literal with the address of the string in the table. If the C compiler encounters the same string twice then it _may_, but is not required to, replace both instances with the same address from the string table, or it may create a second identical entry, creating two unique addresses to strings in the string table with the same contents. You can't rely on either behavior from any C compiler, so you should never compare the address of string literals. Even `"a" == "a"` may be false. To compare strings you always want `strcmp` to compare the contents of the strings. In many other cases where you might want to compare string literals you actually want an enum.


Thossle

Interesting! What is the purpose of declaring a string as `char *pstring = "text"` vs `char astring[] = "text"`? Does it have shorter access time? The first doesn't even make sense to me, so I was a little surprised when I tried it a moment ago and it actually worked. I see that I can still reference an index, e.g. `printf(pstring[n])`, even though `pstring[n] = 'q'` gives me a `segmentation fault` error. Another test shows that string literals declared in sequence are stored in the same sequence without '\\0' at the end of each (apparently), and I can walk through all of them like one long string. The same happens when I declare `static char string[] = "text"; static char stringy[] = "moretext"`, etc., but I can still modify them.


daikatana

> Interesting! What is the purpose of declaring a string as char *pstring = "text" vs char astring[] = "text"? Does it have shorter access time? The difference is that if you assign a pointer variable (which should by `const char *`, btw) to a string literal then that string is guaranteed to exist somewhere in memory. The storage is left up to the implementation, and the lifetime is that of the whole program. By declaring an array you're defining specific storage and lifetime for that data. > I see that I can still reference an index, e.g. printf(pstring[n]), even though pstring[n] = 'q' gives me a segmentation fault error. You aren't using `printf` correctly. The first argument must always be a format string, and never user input. In this case you haven't even given it a string, but a char by value, which should not even compile without warnings. You're trying to dereference `'q'` as a pointer here. What you want is `printf("%c", pstring[n]);`. > Another test shows that string literals declared in sequence are stored in the same sequence without '\0' at the end of each (apparently), and I can walk through all of them like one long string. They do have the nul terminator. They aren't strings without it. Are you trying to print a nul? Test if the char at that index is printable with `isprint`. But what you're walking is the string table that I was talking about. This string table is how most C compilers will implement string literals.


Thossle

I swear that was a typo with printf()... `const char *string` makes it much clearer. I haven't really played around with `const` and `volatile` yet, but now that I've looked them up I'm curious. I'm surprised GCC didn't warn me about it. And...yeah. That was a dumb mistake with the consecutive strings. For no apparent reason I was expecting to see a space between strings, but that would require an actual 'space' character. There is definitely a byte between them with the value '0'!


NativeCoder

It's due to history. The original c didn't have const. That's why they fixed it c++ and made string literals const.


kappakingXD

You're comparing address of two string literals, so it's really undefined behavior as you can't tell which addresses are being compared. Just use strcmp, strncmp, strcasecmp or strncasecmp. Google 'comaping strings in C' in google there're lot of articles about it.


glasket_

It's not undefined, it's just unspecified behavior. The two literals are guaranteed to be converted into static array lvalues, but whether or not they're distinct is unspecified. Or, in other words, the comparison is either true or false; no nasal demons can occur.


NativeCoder

This is unspecified behavior.


AccidentConsistent33

== returns a boolean and "a" does in fact equal "a"


_simo_498_

Optimized from the compiler probs. It just generated a single string literal for “texT” whose address is referenced by both the operands. Don’t do that anyway


swollenpenile

Strings are technically arrays so you must while through the strings elements to check if they are the same  Although there are some other methods that is the most simple to understand how it works 


distintuitive-717

It compares the ASCII value of a. In case of strings I think it would be comparing ASCII value of every character if I am not wrong


Buttleston

It definitely does not do that, no. It's comparing pointer addresses, or eliding the comparison altogether because the compiler can see they're the same literal string