The first and obvious way to speed this is to use the Android's Native Development Kit, which would allow us to rewrite this core computation in C or C++, build it as a lib and embed it into the Android application and call it from Dalvik.
The immediate pros is that the code is compiled directly to the CPU architecture of the target device (e.g. ARM for most current phones and tablets, ATOM/x86 soon for probably next year's hardware.)
The cons is that since the code is compiled against an architecture, you need one per target architecture, so next year you'll likely need 2 (ARM + x86). Debugging is tedious, and the JNI calls to communicate between the C and the dalvik parts can be a tad tricky and there isn't a lot of literature on the subject. The setup is also a tiny bit more involved that just adding a plugin to Eclipse, especially on Windows, but it's not that hard and is something you'd do only once.
Since Honeycomb however, we have another choice: RenderScript. This uses a "script" which looks like C (in fact it conforms to the C99 spec), but instead of being converted to machine code, it is compiled to LLVM bytecode. The bytecode is then converted once at runtime on the target device.
One of the immediate benefits of RenderScript is that the target can be either the CPU or the GPU. This will depend on the script and the capabilities of the target, based on some static analysis of the code involved. That means the programmer doesn't have anything to do, the engine takes care of selecting the most capable target.
Now RenderScript has 2 sides: it can be used to perform 3D rendering using some kind of scene graph and in this case it would typically replace your typical OpenGL calls with shader language. GL calls are pretty verbose and repetitive and it's something I always found a bit wasteful to do from the dalvik side. I haven't tried it, but it seems like RenderScript could help a lot here.
The other side is the so called "compute" script: pure number crunching, no GL or 3D involved.
So the question is: what would it take to use RenderScript to run my Mandelbrot core loop, and what speed benefit would I get?
The short answer is that it's trivial. I did the conversion in about 1 hour, including looking at samples and reading docs around, and the net result was a 2x speed improvement when running on a Xoom.
So here's the original from JavaMandel.java:
static void mandelbrot2_java( double x_start, double x_step, double y_start, double y_step, int sx, int sy, int max_iter, int size, int[] result) { if (max_iter <= 0) return; double x_begin = x_start; for(int j = 0, k = 0; j < sy; ++j, y_start += y_step) { x_start = x_begin; for(int i = 0; i < sx; ++i, ++k, x_start += x_step) { // the "naive" mandelbrot computation. nothing fancy. double x = x_start; double y = y_start; double x2 = x * x; double y2 = y * y; int iter = 0; while (x2 + y2 < 4 && iter < max_iter) { double xt = x2 - y2 + x_start; y = 2 * x * y + y_start; x = xt; x2 = xt * xt; y2 = y * y; ++iter; } result[k] = iter; } // i } // j }First we need to convert that to the C99 syntax for the RenderScript file. Luckily it's mostly copy and paste as we can see in src/com/alfray/mandelbrot2/mandel.rs:
#pragma version(1) #pragma rs java_package_name(com.alfray.mandelbrot2) int *result; void mandel2( double x_start, double x_step, double y_start, double y_step, int sx, int sy, int max_iter) { int *ptr = result; // the "naive" mandelbrot computation. nothing fancy. double x_begin = x_start; int i, j, k; for(j = 0, k = 0; j < sy; ++j, y_start += y_step) { x_start = x_begin; for(i = 0; i < sx; ++i, ++k, x_start += x_step) { double x = x_start; double y = y_start; double x2 = x * x; double y2 = y * y; int iter = 0; while (x2 + y2 < 4 && iter < max_iter) { double xtemp = x2 - y2 + x_start; y = 2 * x * y + y_start; x = xtemp; x2 = x * x; y2 = y * y; ++iter; } *(ptr++) = iter; } // i } // j }The only question is how to give the parameters to the method and get the result back.
First: as soon as you save a RenderScript file in Eclipse, ADT will run the compiler on it[*]. This produces 2 things:
- A ".bc" file in the res/raw folder that has the same name as the .rs being compiled.
- A Java wrapper in the gen folder that wraps the script into a convenient Java helper.
You can find the details of the wrapper in the SDK's RenderScript section and don't forget to look at the samples for API 11.
[*] Note: there's a bug in ADT 10.0 and it will fail to compile an empty .rs file and keep trying and trying ad nauseam. In this case make sure to write the 2 magic #pragma lines before saving the first time.
So how do we use that RenderScript now? I have a little Java wrapper that performs the following steps.
First you need to instantiate the script:
RenderScript mRs = RenderScript.create(activity_context); ScriptC_mandel mScript = new ScriptC_mandel(mRs, activity_context.getResources(), R.raw.mandel);
Next we need to allocate the destination buffer. All allocations are done on the dalvik side. In the script we just defined a "global" *result without defining how it was set. So we'll do that now in the Java wrapper, creating an array of "size" number of integers and binding it with the "result" variable:
mAlloc = Allocation.createSized(mRs, Element.I32(mRs), size); mScript.bind_result(mAlloc);and we can now invoke the computation, still from the Java wrapper:
int[] javaResult = new int[size]; mScript.invoke_mandel2(x_start, x_step, y_start, y_step, sx, sy, max_iter); mAlloc.copyTo(javaResult);And that's it, we just got a 2x speed improvement in this case for a trivial amount of work -- 2x is when comparing the rendering time on a Xoom between the dalvik JIT and the RenderScript version. It is important to realize that both the dalvik code and the RenderScript one are single threaded and are not taking advantage of the dual core in the Xoom.
It's important to realize that in the little Java wrapper I caught all exceptions when trying to create the RenderScript instance. Not all devices are going to support RenderScript, starting with anything below API 11, which is currently 99.8% of the market, including the emulator. However we should soon see a plethora of tablets supporting it and these devices will have larger screens, meaning more pixels and thus more computations to run.
Some lessons learned:
- The HelloCompute RenderScript uses a rsForEach() helper method, but it wasn't clear at all how to use it. I will elaborate on this in a future article, as it's important to extract the most from RenderScript.
- It's as hard to debug RenderScript as it is with the NDK, due to the lack of proper debugger. There's an rsDebug() method that can be used to basically do printf-debugging. In my case it wasn't an issue since I already had the java code, yet I still managed a typo in the call parameters.
- As of Honeycomb, RenderScript doesn't run on the emulator yet. You will need a real device that can support it, which at the time of this writing only the Xoom can provide.