Tuesday, February 02, 2010

Solving the gcc 4.4 strict aliasing problems

A couple of days ago Jeff Stedfast ran into some problems with gcc 4.4, strict aliasing and optimizations. Being a geeky sort of person, I found the problem really interesting, not only because it shows just how hard it is to write a good, clear standard, even when you're dealing with highly technical (and supposedly unambiguous) language, but also because I never did "get" the aliasing rules, so it was a nice excuse to read up on the subject.

Basically, the standard says that you can't do this:

int a = 0x12345678;
short *b = (short *)&a;

I'm forcing a cast here, and since the types are not compatible, they can't be "alias" of each other, and therefore I'm breaking strict-aliasing rules. Note that if you compile this with -O2 -Wall, it will *not* warn you that you're breaking the rules, even though -O2 activates -fstrict-aliasing and -Wall is supposed to complain about everything (right??). Apparently, this is by design, though why would anyone not want warnings in -Wall for something that will obviously break code is beyond me. If you want to be told that you're not playing by the rules, make sure you build with -Wstrict-aliasing=2, which will say:

line 2 - warning: dereferencing type-punned pointer will break strict-aliasing rules

So now you know you're being naughty. Of course, if you did try to access the variable, even just with -Wall it will complain at you - this more complete snippet will give you several warnings with -Wall:

int a = 0x12345678;
short *b = (short *)&a;
b[1] = 0;
if (a == 0x12345678)
  printf ("error\n");
else
  printf ("good\n");

line 3 - warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules

The problem gets ugly when you're dealing with structs and pointers to them - then -Wall is completely silent about possible issues, and only -Wstrict-aliasing=2 will work, like in this little snippet:

typedef struct type {
  struct type *next;
  int val;
} Type;

...

Type *t1, *t2, *t3;
t1 = t2 = NULL;
t1 = (Type*) &t2;
int i;
for (i = 0; i < 2; i++) {
  t3 = malloc (sizeof (Type));
  t1->next = t3;
  t1 = t3;
}
if (!t2)
  printf ("error\n");
else
  printf ("good\n");

This doesn't emit any warnings on -Wall because the loop makes it slightly fuzzy for gcc to tell whether things are getting assigned or not. -O2 will optimize away the assignment to t1 on line 3, which will make things not work later on.

So how to fix this? The attribute may_alias allows a type to bypass the aliasing rules, just like character types do (character types are allowed to alias any other type, according to the c99 standard). Changing the definition of Type to the following will make the compiler happy:

typedef struct type {
  struct type *next;
  int val;
} __attribute__((__may_alias__)) Type;

One final note: if you mix up code with aliased types and non-aliased types, gcc will not enforce aliasing optimizations on your non-aliased-possibly-broken code... i.e., if you define this type two times, one with the attribute, one without, and then do the loop above with both types (separately mind you, with separate variables, the code just happens to be in the same method), the non-aliased type won't fail. Aren't optimizations fun?


Update: People have pointed out that the first statement short *b = (short *)&a; is totally legal and has nothing to do with aliasing.

Yes, that's true, I should have been more precise. The statement is perfectly legal. It's when you try to access the data via the pointer that was assigned on that line that breaks the standard. So when your code blows up, it blows up accessing the data, but that's not the cause, that's the consequence. The cause of said explosion is that optimizations + strict-aliasing look at that (totally legal) statement and say "oh, dude, come on, this is bogus" and throw it away while munching on scooby snacks. Well, not sure about that last part.

Anyways, where was I? Oh yes, so, two things: if you don't want to change your code, you can use may_alias , gcc will say "that's so awesome" and everyone will make merry. Or something. The second thing is, and let me add a little emphasis to this part, because I'm sometimes a bit too subtle, and apparently some things should be said *very clearly*: when a statement is perfectly legal, and yet it IS removed via a combination of default flags with NO warnings whatsoever, something is WRONG, and in my opinion, the problem here is lack of warnings.

And that, as someone said, is that. Or not, whatever tickles your fancy. Hmmmm, tickles...

12 comments:

Jonathan Pryor said...

Wouldn't a "better" solution be one that doesn't require breaking the language rules?

e.g. one of the comments on Jeff's blog mentioned just doing this:

Node list, *node, *tail;
tail = &list;
tail->next = node;

It's interesting to know about the __may_alias__ attribute, but just avoiding the aliasing seems like a more "future-proof" solution...

andreia|gaita said...

I agree that not breaking the rules in the first place would be the best thing to do. OTOH, not all code out there is as easy to fix as this example, and there is a lot of it out there.

When it's hard to go through production code and change it to comply with the rules, but you really don't want to enable -fno-strict-aliasing and lose out on all the checks and optimizations, that's where this blog post comes in.

What I would like to see is an improvement in the warning system, so you know what gcc is doing right away. As it stands right now, the aliasing warnings are really not to be trusted, gcc does not warn you even if it knows your code is 1) breaking the rules, and 2) the assignments that break the rules are going to be touched by optimizations.

Optimizing an assignment because that assignment breaks a rule is ALWAYS dangerous. Now I know that the reasoning to not show those warnings by default is that it can give you false positives, but a false positive in this case is when you *are* breaking the rules and an optimization *might* break your code. IMHO, just because your code might not break doesn't mean the warning is optional. You really need to know you're breaking the rules here, regardless of whether things blow up or not.

rasky said...

The article is incorrect. The first snippet of code is fully legal: the standard says that you must access an object of type T with a lvalue of type T (so either dereferencing a pointer-to-T or through a reference-to-T). It doesn't say anything about casting pointers.

This is why -Wall doesn't complain: it's not strictly a problem to cast the pointer; the problem will happen later when you will dereference it.

Aliasing is a property of objects, not pointers.

andreia|gaita said...

@rasky

Fine, let's be anal about this. The statement is fully legal, totally agree with you. In fact, it's so legal that it will be removed from your code via strict-aliasing related optimizations, and if you enable -Wstrict-aliasing=2, it will complain at you. Funny how perfectly legal code triggers warnings, is considered worthless and ends up getting removed unless you force -fno-strict-aliasing (which has absolutely nothing to do with that statement, you're very right).

As I said, just because something is sorta legal and just *might* break due to optimizations triggered by options that take issue with your (totally legal) code, is not an excuse to suppress warnings.

There's legal code, and then there's code that's been smoking some weird stuff. If you want to make it *really* legal and you don't want to change it, here's how. Anal enough for you?

Anonymous said...

You're a little bit wrong about what's optimized away.
After reading your article i got puzzled about why the hell these assigns can be removed, so i tried to compile, run and disassemble your examples myself.
Assignments are optimized away. But final comparisons are. In the first example this happens because it thinks that a will not be changed because we don't assign anything directly to it and according to strict aliasing rules b is different from a. The same happens in the second example - it thinks that t2 will stay untouched (and it's NULL).
Here is an illustration (second example compiled with -O3):

00401390 <_main>:
; MinGW Prolog
401390: 55 push %ebp
401391: 89 e5 mov %esp,%ebp
401393: 83 e4 f0 and $0xfffffff0,%esp
401396: 53 push %ebx
; Allocate variables (t2 is also here)
401397: 83 ec 2c sub $0x2c,%esp
40139a: e8 91 04 00 00 call 401830 <___main>
; for() loop unrolled
; t3 = malloc(8)
40139f: c7 04 24 08 00 00 00 movl $0x8,(%esp)
4013a6: e8 e5 04 00 00 call 401890 <_malloc>
; t1 = t3
4013ab: 89 c3 mov %eax,%ebx
; t2.Next = t3
4013ad: 89 44 24 1c mov %eax,0x1c(%esp)
; t3 = malloc(8)
4013b1: c7 04 24 08 00 00 00 movl $0x8,(%esp)
4013b8: e8 d3 04 00 00 call 401890 <_malloc>
; t1->Next = t3
4013bd: 89 03 mov %eax,(%ebx)
; Comparison is missing here!!!
; printf("error\n")
4013bf: c7 04 24 64 30 40 00 movl $0x403064,(%esp)
4013c6: e8 cd 04 00 00 call 401898 <_puts>
; Epilog
4013cb: 31 c0 xor %eax,%eax
4013cd: 83 c4 2c add $0x2c,%esp
4013d0: 5b pop %ebx
4013d1: c9 leave
4013d2: c3 ret

andreia|gaita said...

@Anonymous

According to my debugging sessions at the time, the assignments were completely missing when the optimizations were turned on. iirc, it was by tracking the assignments and finding out they were missing that the issue was first noticed - the code that hit this is filling out a list, so obviously if the first assignment doesn't happen, the list gets fubared. There is also differences between gcc's on several different distros, that might account for the discrepancy, not sure. In any event, the problem is still the same, no warnings for critical codepath optimizations, and the solution still works. I might just go back and try again, see if something has changed here, now I'm curious :P

Patrick said...

Wonderful article. I also get frustrated by the spottiness of warnings for this even in situations where the optimizer will cause code to be generated that doesn't do what you expect. I wrote a white paper about strict aliasing for the boost developer wiki that covers a lot of the topic pretty well. You can see it now at Strict C/C++ Aliasing. I'd be grateful if anyone could suggest improvements. If I use an improvement you suggest I will of course add you to the gratitude section;)

Patrick said...

It's a great summary. I covered this for the boost developer wiki and then moved it to my consulting web site. If you find anything I can improve on it I'll give you credit on the page;) Understanding C/C++ Strict Aliasing

Anonymous said...

I'm getting the 'type-punning' warning when there appears to be nothing wrong with the usage apart from free()'s void * parameter conversion:

typedef struct {...} some_def_t;
...
{
some_def_t * somedef;
...
free(somedef);
}

gives the warning on the free()

Anonymous said...

My take on the warning with the free(somedef) code, is that it's slightly analogous to the 'char *'/'const char *' warnings.

By calling free(), your 'some_def_t *' value is being degraded to a 'void *' which they consider to be a free-for-all !

Hence pointer aliasing cannot be detected, once converted to the more general pointer.

(Yes, I arrived here because *I* was frustrated at the message about a free() as well !)

zenspider said...

Just so you know... -Wall is _not_ all warnings... Leave it to GNU to name something -Wall, when at best it should be -Wmost.

Quentin said...

Gorgeous!