Chrome and Moonlight, or how to deadlock a browser

on Friday, January 22, 2010

It's no secret that Moonlight works best on Firefox at the moment - it's our baseline browser, after all - but we've had many requests to add Chrome support, and since it supports NPAPI just like all browsers out there, it should really work out of the box, requiring only some extra code to implement/hackify stuff that Chrome/WebKit doesn't expose and that we need - basically, DOM support and some downloader tweaks.

After some initial positive reports of Chrome loading the Silverlight Chess sample successfully, I decided to run some tests and start working on the WebKit bridge code... only to find out that I couldn't make Moonlight load properly on Chrome on my laptop at all. Even the simplest of test pages would hang forever on our initial splash animation, and killing Chrome would dump stacktraces all over the place. Clearly it wasn't happy about Moonlight.

My first instinct was "I must be doing something wrong", so I tried on another machine. Same thing. Built a Chromium debug build and tried it - even worse, I hit symbol conflicts all over the place. It seems the Native Client plugin is included inside Chromium by default, and it exports all the NPAPI symbols publicly. Any plugin (like Moonlight) which uses a loader and dynamically loads the real plugin from another location will get its calls intercepted by the Native Client plugin, and things will fail badly. After fixing this, it still kept hanging on the splash animation. Asked other people to test it - same thing. 99.8% of the time it deadlocks completely, and in only 0.2% of the time will it actually load properly. I guess the positive reports were just really, really lucky.

Next course of action - debug the thing. Following the instructions on how to debug Chrome on Linux, I learned about the Renderer and the Plugin processes that get spawned (and the Zygote, too :P), and how to debug them. Only it didn't work (of course not, I hear you say, that would have been way too easy), due to a missing condition on an if on the Chrome loader (I'm guessing nobody actually debugs it on Linux? :P) Patch the thing, and yay, we're debugging.

To keep plugins from blowing up and/or generally misbehaving and giving the browser a bad reputation, Chrome runs them on a separate process that communicates with the main rendering process via IPC. This, of course, is a terrain rife with potential race conditions and reentrancy issues, and that's exactly what's happening with Moonlight. Fortunately, unlike most race conditions, the problem was very reproducible under gdb as well, and I was able to get traces of both processes in the middle of the deadlock.

So what is deadlocking? Well, it's actually very simple: the renderer process calls NPP_SetWindow on the plugin, and also does a blocking call at the same time. In NPP_Setwindow, we do NPN_GetValue and NPN_GetProperty, which call back into the renderer process and block... oops.

I wasn't very confident that I could reproduce this without all the Moonlight code, but just in case, and because I wanted to have a nice clean skeleton NPAPI plugin around, I built one, which does nothing but stub out all the required methods to get an empty plugin going. When it gets to NPP_SetWindow, it calls NPN_GetValue and NPN_GetProperty - and it deadlocks pretty much 100% of the time.

I opened issue #32797 on, with the small splash plugin test case, if you're curious. Hopefully this will get fixed fast. With all the calls to the browser that we do during execution, I really really hope we don't hit this again... but it's more likely than not that we will :/

While the idea of keeping the plugins under control by shuffling them to the side is a good one, browser devs should keep in mind that, with all the limitations that a plugin is subjected to, with NPAPI being very far from perfect, with browsers implementing it differently, OS differences that plugins have to deal with as well, it's already so difficult to have a performant plugin (and believe me, the last thing we want to do is stall the brower), we shouldn't have to be worrying about potential reentrancy issues and race conditions when doing such simple things as querying the browser for a property value.

Pretty please?