Wednesday, October 12, 2011

OSX, the Air and Recovery Mode, or how to make amazing software

This morning I decided I needed a case-sensitive partition on my MacBook Air. It comes with a nice juicy 250GB SSD and I still have about 140GB left, so, having woken up in an adventurous mood, I open up Disk Utility, peer at the partition, note it doesn't complain at me if I shrink it a bit, so I go ahead and resize it. I do this, of course, without killing any of the 30 tabs open on Chrome, or closing down the 3 server connections and about 30 channels on LimeChat, not to mention the 10 terminal sessions running various scripts and remote shells, or any of the ton of widgets and apps happily fidgeting in the background. Life is good.

The resize finishes with no issues, which of course only encourages me, so I go ahead and create a new partition occupying the space Disk Utility says it's free (who am I to argue, I'm sure it can do the math better than I can).

"Error: no disk space left to perform the operation"

Or something to that effect, anyways. I wonder if it might be too early for Disk Utility. You know, math this early in the morning, tricky. I reboot, because that always fixes things, right? After a few seconds (yes, SSD is that awesome) of fretting about whether I still have a working Air or whether I'm now Airless, it boots. Disk Utility isn't fooled, though, it continues to complain that the space it has free isn't big enough to create a partition.

It's at this point that my brain kicks in and I run verify on the drive and on the startup partition. Just because the resize finished with no errors, that doesn't mean it didn't actually screw things up, leaving behind a trail of dead bytes all over my drive. It just means it was sneaky about it. A bit like coming home and finding the cat nicely tucked away on her beanbag like a good obedient little kitty, but having the sofa all covered in cat hair. And feeling warm to the touch. As if a certain fur ball had just leaped off of it and onto the beanbag and then pretended to have been there all along. Sneaky.

So, after glaring at the cat (pretending to be fast asleep, snoring loudly, pink tongue jutting out in blissful forgetfulness, the sneak), I repair the partition. Or try to, because this time I get complained at repeatedly with red menacing messages and a popup, indicating I need to run the Installation Disc to start Recovery Mode and run Disk Utility from there. After a careful examination of the Air to make sure it hasn't sprouted a DVD drive while I wasn't looking, I quickly google for the proper procedure to apply to the boot process in order to go in to recovery mode, and reboot again.

Now it seems to me that this thing, not having an optical drive, would come with a recovery partition from which one would boot when needed. Maybe it's just the Air pouting, but when I hit Command-R, instead of offering to boot from the recovery partition, it went online. Online!

I have to confess, I was amazed and, quite frankly, boggled. This little gray metal thing that I'm obviously trying really hard to turn into a paperweight is going online to fetch an image of the recovery partition so it can load it and boot it on the fly. If I was impressed before that OSX allowed me to resize the system partition just like that, now this is some seriously impressive recovery process. I mean, really, I've blown up more partition tables than you could shake a stick at (the latest one was a combination of partitioning a portable drive on osx and then formatting said drive on linux and copying a whole bunch of stuff onto it so I could take it on vacation, and then when I'm on vacation 300km from home trying (haha) to use it on the mac, and then having to realign partition tables by hand on the command line), and although I usually don't lose anything except time, the recovery process is always sooooo annoying. This whole OSX recovery process was obviously done for silly people like me.

While I'm boggling at it, it does its magic thingy and lo-and-behold! Recovery Mode! I run Disk Utility, hit Verify, hit Repair. Things work, apparently, so I try again to create the partition. It creates it. I reboot and I'm back to normal land. And stuff still works.

With the hardware limitations of the Air and the possibility of not having a recovery partition, this whole recovery process is an amazing piece of well-designed software. Instead of having to waste hours trying to recover things manually, everything Just Worked (tm) and I could instead waste my time writing this blog post! It has made my day.

Oh, and poking the cat. That has also made my day. The sneak...

Monday, June 27, 2011

So long, and thanks for all the fish

I am, as of now, officially no longer at Novell / Attachmate (I guess you can say I'm detached, I know I do).

It's been an amazing 4 and a half years, working on an awesome project inside a little bubble of craziness infiltrated in a corporate environment that never understood us (it's ok, we never understood them either).

As for what I'll be up to next, stay tuned!

Sunday, February 06, 2011

Ooops, Is It FOSDEM Time Already?

I guess it is! As always, FOSDEM is great fun, and once again we had a Mono room with lots of great talks! Especially enjoyed Mark Probst and Jo Shields talks, now I know what happens when the deb folks get a hold on our packages, and why we never get our finalizers called in order in Moonlight!

As for my talk, the important bits were that I didn't go over the time, nobody snored, and I made sure there were plenty of lolcats! Get the slides here.

Right now I'm watching a very cool talk about the Go language while my laptop is charging plugged in to a very interesting combination of a triple connected to an adapter (stupid third pin on belgium plugs) connected to a power extension. Nothing has exploded yet.

Tuesday, February 16, 2010

A small Fosdem wrapup

The other weekend I was in Brussels for FOSDEM. As you know, this year we had a Mono room on sunday, thanks to the amazing efforts of Ruben Vermeersch and Stéphane Delcroix. The conference was great, as it always is, although as usual as didn't get to see much of the talks on saturday - busy preparing my own talk about Moonlight, and meeting people, which is one of the parts I enjoy most at FOSDEM. Sunday was awesome, full of Mono talks in a nicely packed room. People were very interested, we had great feedback, and everything went very well, including my demos - it was a very good day, and all in all, a great event. On monday we had a special Mono hackday, where we got together and, well, hacked. I sat down with Lucas Meijer of Unity and we went through some of the issues they have embedding Mono, similar to what Moonlight has to do. Lucas decided to stay an extra day just for the Mono hackday, after a lot of chatting and quite a few beers the day before, and I'm so glad he did, it was a very productive, if somewhat short, day.

Over the three days of the event I had the pleasure of meeting, remeeting and chatting with a lot of wonderful people, whom I usually only get to talk to online - Jo Shields, Mirco Bauer, Alan McGovern, Jeremie Laval, Jim Purbrick, Michael Meeks, Mans Rullgard, David "Lefty" Schlesinger, Rob Taylor, Bertrand Lorentz, Massimiliano Mantione, just to name a few and not in any particular order (I just know I forgot a ton of people!). Also got to meet a bunch of portuguese people, like Vânia Gonçalves, Miguel Azevedo, Paulo Trezentos and more - some of them I only get to see at FOSDEM these days, for some odd reason... weird country this is :)

All in all, it was great, I missed the interaction and the chats and the dinners and the talks and the general merryness and learning that is to be had when you're surrounded by a thousand geeks. I hope to see you all again soon!

PS: I somehow got Jérémie's name confused with a known beer brand... which might, or might not be, a good sign! Fixed... :)

Tuesday, February 02, 2010

Solving the gcc 4.4 strict aliasing problems

A couple of days ago Jeff Stedfast ran into some problems with gcc 4.4, strict aliasing and optimizations. Being a geeky sort of person, I found the problem really interesting, not only because it shows just how hard it is to write a good, clear standard, even when you're dealing with highly technical (and supposedly unambiguous) language, but also because I never did "get" the aliasing rules, so it was a nice excuse to read up on the subject.

Basically, the standard says that you can't do this:

int a = 0x12345678;
short *b = (short *)&a;

I'm forcing a cast here, and since the types are not compatible, they can't be "alias" of each other, and therefore I'm breaking strict-aliasing rules. Note that if you compile this with -O2 -Wall, it will *not* warn you that you're breaking the rules, even though -O2 activates -fstrict-aliasing and -Wall is supposed to complain about everything (right??). Apparently, this is by design, though why would anyone not want warnings in -Wall for something that will obviously break code is beyond me. If you want to be told that you're not playing by the rules, make sure you build with -Wstrict-aliasing=2, which will say:

line 2 - warning: dereferencing type-punned pointer will break strict-aliasing rules

So now you know you're being naughty. Of course, if you did try to access the variable, even just with -Wall it will complain at you - this more complete snippet will give you several warnings with -Wall:

int a = 0x12345678;
short *b = (short *)&a;
b[1] = 0;
if (a == 0x12345678)
  printf ("error\n");
else
  printf ("good\n");

line 3 - warning: dereferencing pointer ‘({anonymous})’ does break strict-aliasing rules

The problem gets ugly when you're dealing with structs and pointers to them - then -Wall is completely silent about possible issues, and only -Wstrict-aliasing=2 will work, like in this little snippet:

typedef struct type {
  struct type *next;
  int val;
} Type;

...

Type *t1, *t2, *t3;
t1 = t2 = NULL;
t1 = (Type*) &t2;
int i;
for (i = 0; i < 2; i++) {
  t3 = malloc (sizeof (Type));
  t1->next = t3;
  t1 = t3;
}
if (!t2)
  printf ("error\n");
else
  printf ("good\n");

This doesn't emit any warnings on -Wall because the loop makes it slightly fuzzy for gcc to tell whether things are getting assigned or not. -O2 will optimize away the assignment to t1 on line 3, which will make things not work later on.

So how to fix this? The attribute may_alias allows a type to bypass the aliasing rules, just like character types do (character types are allowed to alias any other type, according to the c99 standard). Changing the definition of Type to the following will make the compiler happy:

typedef struct type {
  struct type *next;
  int val;
} __attribute__((__may_alias__)) Type;

One final note: if you mix up code with aliased types and non-aliased types, gcc will not enforce aliasing optimizations on your non-aliased-possibly-broken code... i.e., if you define this type two times, one with the attribute, one without, and then do the loop above with both types (separately mind you, with separate variables, the code just happens to be in the same method), the non-aliased type won't fail. Aren't optimizations fun?


Update: People have pointed out that the first statement short *b = (short *)&a; is totally legal and has nothing to do with aliasing.

Yes, that's true, I should have been more precise. The statement is perfectly legal. It's when you try to access the data via the pointer that was assigned on that line that breaks the standard. So when your code blows up, it blows up accessing the data, but that's not the cause, that's the consequence. The cause of said explosion is that optimizations + strict-aliasing look at that (totally legal) statement and say "oh, dude, come on, this is bogus" and throw it away while munching on scooby snacks. Well, not sure about that last part.

Anyways, where was I? Oh yes, so, two things: if you don't want to change your code, you can use may_alias , gcc will say "that's so awesome" and everyone will make merry. Or something. The second thing is, and let me add a little emphasis to this part, because I'm sometimes a bit too subtle, and apparently some things should be said *very clearly*: when a statement is perfectly legal, and yet it IS removed via a combination of default flags with NO warnings whatsoever, something is WRONG, and in my opinion, the problem here is lack of warnings.

And that, as someone said, is that. Or not, whatever tickles your fancy. Hmmmm, tickles...

Friday, January 22, 2010

Chrome and Moonlight, or how to deadlock a browser

It's no secret that Moonlight works best on Firefox at the moment - it's our baseline browser, after all - but we've had many requests to add Chrome support, and since it supports NPAPI just like all browsers out there, it should really work out of the box, requiring only some extra code to implement/hackify stuff that Chrome/WebKit doesn't expose and that we need - basically, DOM support and some downloader tweaks.

After some initial positive reports of Chrome loading the Silverlight Chess sample successfully, I decided to run some tests and start working on the WebKit bridge code... only to find out that I couldn't make Moonlight load properly on Chrome on my laptop at all. Even the simplest of test pages would hang forever on our initial splash animation, and killing Chrome would dump stacktraces all over the place. Clearly it wasn't happy about Moonlight.

My first instinct was "I must be doing something wrong", so I tried on another machine. Same thing. Built a Chromium debug build and tried it - even worse, I hit symbol conflicts all over the place. It seems the Native Client plugin is included inside Chromium by default, and it exports all the NPAPI symbols publicly. Any plugin (like Moonlight) which uses a loader and dynamically loads the real plugin from another location will get its calls intercepted by the Native Client plugin, and things will fail badly. After fixing this, it still kept hanging on the splash animation. Asked other people to test it - same thing. 99.8% of the time it deadlocks completely, and in only 0.2% of the time will it actually load properly. I guess the positive reports were just really, really lucky.

Next course of action - debug the thing. Following the instructions on how to debug Chrome on Linux, I learned about the Renderer and the Plugin processes that get spawned (and the Zygote, too :P), and how to debug them. Only it didn't work (of course not, I hear you say, that would have been way too easy), due to a missing condition on an if on the Chrome loader (I'm guessing nobody actually debugs it on Linux? :P) Patch the thing, and yay, we're debugging.

To keep plugins from blowing up and/or generally misbehaving and giving the browser a bad reputation, Chrome runs them on a separate process that communicates with the main rendering process via IPC. This, of course, is a terrain rife with potential race conditions and reentrancy issues, and that's exactly what's happening with Moonlight. Fortunately, unlike most race conditions, the problem was very reproducible under gdb as well, and I was able to get traces of both processes in the middle of the deadlock.

So what is deadlocking? Well, it's actually very simple: the renderer process calls NPP_SetWindow on the plugin, and also does a blocking call at the same time. In NPP_Setwindow, we do NPN_GetValue and NPN_GetProperty, which call back into the renderer process and block... oops.

I wasn't very confident that I could reproduce this without all the Moonlight code, but just in case, and because I wanted to have a nice clean skeleton NPAPI plugin around, I built one, which does nothing but stub out all the required methods to get an empty plugin going. When it gets to NPP_SetWindow, it calls NPN_GetValue and NPN_GetProperty - and it deadlocks pretty much 100% of the time.

I opened issue #32797 on crbug.com, with the small splash plugin test case, if you're curious. Hopefully this will get fixed fast. With all the calls to the browser that we do during execution, I really really hope we don't hit this again... but it's more likely than not that we will :/

While the idea of keeping the plugins under control by shuffling them to the side is a good one, browser devs should keep in mind that, with all the limitations that a plugin is subjected to, with NPAPI being very far from perfect, with browsers implementing it differently, OS differences that plugins have to deal with as well, it's already so difficult to have a performant plugin (and believe me, the last thing we want to do is stall the brower), we shouldn't have to be worrying about potential reentrancy issues and race conditions when doing such simple things as querying the browser for a property value.

Pretty please?

Wednesday, December 02, 2009

Mono Developer Room for FOSDEM 2010!

Some excellent news out of Brussels today, there is going to be a Mono Developer Room at FOSDEM 2010! Call for participation is now open, so come and join us put together an awesome Mono day at FOSDEM! 

Thank you so much to Ruben Vermeersch for spearheading this effort, together with Stéphane Delcroix. You guys rock!

Don't forget, send in your talk!