Bad Blood

Bad Blood

I recently read Bad Blood (brilliant title) by John Carreyrou on the flight from Boston to San Francisco; in fact, I couldn’t put it down, even when I should have been prepping for interviews. The book is a piece of investigative journalism that reads like a thriller, mostly because you desperately want to see the bad guys finally get taken down. Yet, like Walter White, they keep getting away with it. The book can’t report their downfall because, fascinatingly, the book itself catalyzed the fall of Elizabeth Holmes and ‘Sunny’ Balwani. In a gonzo twist the author becomes a player in the story he’s reporting, threatened by the same Theranos enforcers that silenced internal whistle blowers and repelled investigators. Determined investigative journalism was required to expose the big lie that federal regulators turned a blind eye to and, sometimes, were complicit with.

In this way and others Bad Blood reminds me of The Best and the Brightest. In both Holmes and Lyndon Johnson you had charismatic but vindictive leaders that purged dissenters and promoted sycophants. In both cases you had the big lie – that Theranos technology worked, that the Vietnam War was necessary and we were winning – that could not be questioned for risk of retaliation. All ‘data’ was then fabricated to align with this outcome. This information was not independently verifiable because it was a trade secret or a national security risk. Once this ‘data’ generated by motivated reasoning existed it became further unassailable proof of the rightness of the original claim. Dissenters were not just ostracized but prosecuted – for violating their NDAs at Theranos, for treason at the White House – and thus silenced.

Theranos also featured a powerful cast of government figures on its board of directors – James Mattis, Henry Kissinger, John Biden, and most of all George Schultz – that provide cover in the federal bureacracy and national media. Notably, they are all older men described as personally enthralled with the young, charismastic Holmes. From the Wall Street Journal:

“The brilliant young Stanford dropout behind the breakthrough invention was anointed ‘the next Steve Jobs or Bill Gates’ by no less than former secretary of state George Schultz”

Empire of Illusion comments that fame has become credential enough for one to be an authority on any issue. I can’t think of any other justification for having a Cold War-era Secretary of State weigh in on the merits of a medical device. We saw the same with Obama and Solyndra; because these figures are brilliant in one domain we assume they are expert in another. Medical VCs quickly honed in on Holmes’ inability and hostility to backing her claims and declined to invest. The legitimacy she couldn’t obtain from the experts she obtained from the famous and powerful, to whom we (the public) readily assigned the same expertise as the experts. There might be a short-selling strategy here: look for the company that famous people who have no idea what they’re talking about are boosting.

Holmes found a ready audience for this publicity blitz in a media that desperately wanted a female STEM success story.

“As much as she courted the attention, Elizabeth’s sudden fame wasn’t entirely her doing. Her emergence tapped into the public’s hunger to see a female entrepreneur break through in a technology world dominated by men. Women like Yahoo’s Marissa Mayer and Facebook’s Sheryl Sandberg had achieved a measure of renown in Silicon Valley, but they hadn’t created their own companies from scratch. In Elizabeth Holmes, the Valley had its first female billionaire tech founder.”

I suppose this is the other side of the coin whereby VC firms expect the next Mark Zuckerberg to look like Mark Zuckerberg. Parmenides: “The good and the true are not necessarily the same.”

One last idea I wanted to mention was how many harmful affects Non-Disclosure Agreements (NDAs) seem to have in our society today. The #MeToo movement brought to light countless monsters who protected themselves for years with NDAs and no doubt there are thousands more. NDAs Theranos to antagonize their employees and cover their lies for years. Trump’s NDAs may have safeguarded his election. So far as I can tell these contracts serve only those with secrets to hide who possess the legal power and wealth to enforce them.

A President or a Press Secretary

I finished David Halberstram’s The Best and the Brightest about a month ago and I’ve been trying to find a way to approach writing about it ever since. The book is so packed with fascinating history and insights that its hard to choose what to bite off first; worse, each approach ties inexorably into a few others, making disentangling one into a blog post near impossible.

I heard a quote in a Podcast the other day (I think Vox’s The Weeds but I can’t remember certainly) that succinctly captured one of the main messages and has been banging around in my head ever since; to paraphrase, “We think of our brain as a president, when really it is a press secretary.” That is: we imagine that we’re impartially weighing and evaluating each piece of evidence we read about the world and incorporating this into our updated worldview, when really our worldview is mostly fixed and the ‘evidence’ we hear is recast or highlighted as to best fit into this worldview (or outright rejected if no fit can be made).

This isn’t a revolutionary idea and is in many ways just a restatement of Confirmation bias; however, the idea and the metaphor itself aptly describe one of the main problems that befell the Kennedy and Johnson administrations.

At a high level: Kennedy and his acacademic and operationally minded technocracy of men like McGeorge Bundy, Bill McNamara, and Walt Rostow believed in the overwhelming power of their own open-minded brilliance and rationality to resolve any political matter; however, this self-same belief made them blind to their own biases and preconceptions and eventually outright hostile to any questioning of their essential rightness. The propoganda machinery supporting this self-deception eventually permeated stretched from the reporting structure at the lowest levels of the military on the ground in Vietnam to the highest ranking members of the presidental cabinet.

The mechanisms enforcing this conformity are myriad and take up many of the pages of this (massive) book. However, three (that magic number for enumeration) stick with me now: bureacratic enforcement, patriotic pressure, and misinformation.

Bureacratic pressure is one of the most obvious and pernicious mechanisms that was available to the administration. Obvious because this is a bureacracy; pernicious because the pressure could be applied without administrators even realizing they had done so. At its meanest, men who disagreed with the presidential concensus on Vietnam and East Asia generally would be fired or removed to remote, insignificant posts in order to silence their criticism. More common, I think, would be the subtle pressure on military underlings to provide rosy assessments of the war to their superiors all the way up the main, or likewise the pressure on State Department officials to sit on information that upset the administration’s worldview. Promotions for those with ‘good’ information; stalled careers for those with true information. In this way the adminstrators, intentionally and not, outright corrupted the data that was so essential to the proper function of their hyper-rationalist evaluations.

Before reading this book I had underestimated how much the shadow of McCarthyism hung over Washington for years after his death. Not only communists but anyone to the left of the prevaling hard-line anti-Communist demonology of the day were purged from political life and sometimes faced with legal problems. Through the Johnson administration Democrats in particular remained fearful of seeming ‘soft’ on Communism, even when it came to supporting liberal domestic causes. The way this worked to silence critics of the hard-line of Vietnam cannot be understated. Journalists were afraid to question the statements and assumptions of the adminstration; potential dissenters in Congress feared not “supporting the boys”; outsiders within State and the military had long been purged. Disagreeing that the United States should support the cruel and corrupt Diem regime in Vietnam, pointing out the danger and futility of saturation bombing in North Vietnam, questioning whether the “Communuist Monolith” was real or imagined were not opinions meriting discussiin; they were committing treason. This led to a nation-wide purging of dissenting voices and created the illusion of concensus in the administration.

Finally, the Johnson (and to a lesser extent Kennedy) administration used the information and dissemination powers of the Executive to mislead the public, the press, and Congress. They executive under-stated the troop requirements given by the generals, painted an inanely optimisitic picture of the war (which would always be “over by Christmas”, lied about funding requirements, leaked when it benefit them but otherwise completely hid the internal deliberations and decisions of the administration. These machinations were part and parcel with the leader’s belief in their own intellectual superiority and rationality; they held the public and Congress in such disdain that misleading them was necessary and right in order to bring them along on the correct course. Those who questioned and disagreed with the leadership were belittled and treated with hostility. Halberstram summarizes this beautifully:

Nor had they, leaders of a democracy, bothered to involve the people of their country in the course they had chosen: they knew the right path and they knew how much could be revealed, step by step along the way. They had manipulated the public, the Congress and the press from the start, told half truths, about why we were going in, how deeply we were going in, how much we were spending, and how long we were in for. When their predictions turned out to be hopelessly inaccurate, and when the public and the Congress, annoyed at being manipulated, soured on the war, then the architects had been aggrieved. They had turned on those very symbols of the democratic society they had once manipulated, criticizing them for their lack of fiber, stamina and lack of belief.

What are those of us who believe in rationality to do, then? The “Go with your Gut” instict of the second Bush administration brought us Iraq, our modern Vietnam; the passions of the mob brought us Trump. Halberstram argues that infusing humanism and political courage were the necessary but missing partners of Kennedy’s rationalism. The team was too willing to see napalm bombing as numbers on a spreadsheet modeling the balance of the war rather than as the act of human cruelty that it was. And even when Kennedy saw an unfortunate political truth, he lacked the courage to bear the political cost of his convinction – always waiting until the second term.

I’d argue that intellectual humility was also a crucial but missing element of this administration. Too many men who’d made a career of being perpertually correct and meteorically successful had no capacity for considering their own flaws and blindspots and no tolerance for the less-credentialed who challenged the rightness of their betters. To my thinking Obama, most often compared to Kennedy, fused these traits of humanism, political courage, and humility nicely. Perhaps relatedly he stands almost alone among his modern peers in not embroiling the United States in some foreign military debacle (though, somehow, Trump has a chance to be the most peace-loving president yet).

More important than prescriptions for society, though, is advise for ourselves. Remain humble, question your assumptions, and attempt to meet new information with the open mind of a true Executive rather than the fixed, debate-poised mind of a Press Secretary.

Cuyahoga National Park

After attending the wedding of a friend from Kindergarten (what a thing to say!) in Bedford, Pennsylvania this past weekend my girlfriend I took the opportunity to visit the youngest child of the National Parks system, Cuyahoga Valley in Ohio. The park is about two hours west of Pittsburgh Airport and we stopped at Fallingwater on the way.

We weren’t sure what to expect from the Park; the imagery we found online lacked a single characteristic that defines most parks (such as Half Dome in Yosemite or the arches in Arches), instead featuring various meadows, rivers, waterfalls, and train tracks. The map of the park hints at its unique layout, situated directly between Akron to the south and Cleveland to the north and with highways bisecting the park horizontally. Indeed we never felt too far from the road at any point in the park; as in many of the parks of the East, you won’t find the extreme solitude here that you will in the larger, remoter parks of the West.

In fact I must admit that we’d added this visit to our itinerary mostly for the sake of crossing this park off our list while we were in the neighborhood and had rather low expectations for the park itself.

These expectations were exceeded on our first hike around Ritchie Ledges where we saw some wonderful rock formations (pictured), beautiful meadows and had a moderately strenuous four mile walk. The rain alternated with sunniness throughout and the forest smelled richly; creeks and rivers sounded continually along with the birds. The overwatch wasn’t much of one, even for a cloudy day; the feeling was one of closeness with nature rather than the remote hugeness of the views in some other parks. We had the trail almost to ourselves and the ranger even encouraged us to go off trail to explore the crannies of the ledges; a NP first for us both.


Afterwards we hiked from the Stanford House to Brandywine Falls, the latter being the signature attraction of the park if there is one. Again this hike exceeded all expectations; there was enough elevation gain to get a decent sweat up, you’re frequently passing cool creeks, and the waterfall could hold its own against many of the others we’ve seen (though not those of proud Yosemite). We were both reminded of some of our hikes in Olympic National Park where we had rainy weather, rich forests, and interesting water features (though Cuyahoga has not the beaches or mountains). Brandywine Falls is accessible by car and as such was one of the few places on our hikes that was almost busy.


Before leaving the following afternoon we did two more hikes; one around Kendall Lake and the Salt Run Tail, then another around Buttermilk Fall. I found myself thinking of home during these hikes; besides the Lake I found our first walk uninteresting (besides the portions around the lake, pictured at the top) and Buttermilk was simply lesser than Brandywine and lacking the nice lead-up hike. At this point I became convinced that our 48 hours was sufficient for the park, at least for one visit.


The most interesting artifact of these hikes was a strangely placed tunnel along the lake loop that appeared to be built to no purpose being under an easily-surmountable hill. A placard on the other side explained that this tunnel was built to allow guests to walk under a toboggan run that featured prominently at the lake in the area’ heyday of the 1930s-40s. Strange to think that an area now so desolate was so recently a hub of activity – how quickly a place of activity became a place of history!

Finally I have to mention the delightful AirBnB where we stayed during for our time in Cuyahoga. Besides being a beautiful, peaceful property with a cozy apartment, usable bedside fireplace and lovely patio the host, Wade, was wonderful. He was knowledgable about restaurants and hikes, unfailingly sweet and thoughtful, and took us on a tour of his incredible antiques collection including the Messerschmitt Kabinenroller pictured below. If you visit I can’t recommend this AirBnB enough; and please say “hi” to Wade for me.


One last note: the food at The Lounge at PIT is exceptional!



Spring Cleaning with Android StrictMode

While monitoring Fabric after our most recent release we noticed a new non-fatal crash:

Non-fatal Exception: android.database.CursorWindowAllocationException: 
Cursor window allocation of 2048 kb failed. 
# Open Cursors=937 (# cursors opened by pid 4459=936) 
(# cursors opened by this proc=1)

being thrown when CommCare attempted to save a form. Some Googling confirmed this to have an out of memory exception at its root and indeed the exception reports 936(!) open Cursors. After manually looking through our code I couldn’t find anywhere that we were obviously failing to clean up our Cursors. Further Googling turned up this StackOverflow response with an extremely helpful code snippet:

if (BuildConfig.DEBUG) {     
     StrictMode.setVmPolicy(new StrictMode.VmPolicy.Builder()

This snippet (pasted into the onCreate of your main Activity or Application) enables thread- and VM-level analysis that will crash the application whenever the errors I specified occur (for example, when we’ve lost the handle to a Cursor or InputStream without those resources being closed). You could call this the fail very fast principle and this is certainly not something you’d want on in production.

Booting up CommCare I ran into four crashes before even reaching our launch activity. For the most part we were failing to close InputStream objects and using a try-with-resources statement fixed the problem.

Interestingly the only dropped Cursor I turned up did not come from our application but was coming from the google_app_measurement_local.db – looks like a Google problem! Further hunting took me to this discussion implicating firebase as the culprit, and updating our library resolved the issue.

After poking around our application for a while I was satisfied that I’d exposed and fixed every issue. However, after leaving the tablet dormant for a while a background process kicked in and triggered yet another StrictMode crash! We’re now debating whether to leave StrictMode on for all dev builds and looking into logging these events on production builds. I’d be interested to hear others’ best practices on this and I recommend giving this snippet a try for any dev build.

[Photo from the beautiful Sand Dunes National Park]

Automated Android testing with Calabash, Jenkins, Pipeline, and Amazon Device Farm: Part 1

Around two years ago we began playing with the Calabash library to see if we could automate some portion of mobile release QA process. Previously we’d sent an APK file and spreadsheet of steps to India for every release, then engaged in a weeks-long back and forth as we hashed out bugs with the application and test plan. This process had become costly and burdensome as the size of our development team grew (and our feature set along with it). Moving to some UI level automated testing suite offered some obvious advantages

  • We would save the money we’d been paying contractors to run through the tests
  • We would find out immediately when something was broken, rather than all at once during release.
    • This allows the person who caused the bug to handle it rather than whoever is assigned to triage release bugs
  • We could test across a large set of devices and Android OS versions
  • We would be managing test code instead of managing a huge test spreadsheet

Unfortunately the road to this gold standard was a long one, beset on all sides by the tyrannies of poor documentation and rapidly evolving technology. While writing and running Calabash tests was straightforward, this wasn’t even half the battle. In order for tests to be useful we needed them to run and pass consistently, deliver meaningful and accessible reports, and plug into our notifications workflows like Slack and Github. Oh, and write the tests.

This November we ran our first release with a significant portion of the test plan (more than 40 man hours) automated. We have about forty “scenarios” running nightly builds on Amazon Web Services across three to five devices. Failure notifications end up in our Slack channel and reports emailed to us and stored on our Jenkins server. This is how we got there.

The Test Suite

Our test repository is a standard Calabash test suite. Most important are the files in the features/ folder. These contain Calabash files written in the business logic language Gherkin. For example:

Feature: Fixtures

  Scenario: Ensure that we error cleanly when missing a fixture
    Then I install the ccz app at "fixtures.ccz"
    Then I login with username "fixtures_fails" and password "123"
    Then I press start
    Then I select module "Fixtures"
    Then I select form "Fixtures Form"
    Then I wait for form to load
    Then Next
    # Should error
    Then I should see "Error Occurred"
    Then I should see "Make sure the"
    Then I should see "lookup table is available"

The biggest unit of a Calabash test is a “Feature” which can contain multiple scenarios. The Android sandbox is shared within a feature so if you add some persistable data in the first scenario it will still be available in the second. In this case we install a CommCare application, login with a dummy user, and then run through some menus and a form. Eventually we expect to see a clean failure. We add Tags (prefixed with ‘@’) to our tests as this is the primary unit of control for organizing test runs. We’ll be using those later.

We’ve also included in the repository some test resources that we need such as CommCare applications, and a spreadsheet, and even another Android application that we use for testing our mobile APIs.

The killer feature of Calabash is that this (weird sounding) plain English becomes a runnable test. “Will, this is amazing, how is this possible?” Well, Calabash gives us a number of so-called “canned steps” for free. These canned steps are enough to manipulate an Android device in just about any way possible. You can touch a TextView based on its String content or resource ID, use the soft keyboard, assert some text exists, etc. We use the “Then I should see [x]” step above. Calabash then compiles these down to Ruby that runs on the Calabash server alongside your app on the device.

For more control and cleaner code we specify our own steps as well. For example, each of our tests must begin with the installation of some CommCare app. So we defined a step to install an app from a local resource:

# peform offline install using ccz file pushed from repository
Then (/^I install the ccz app at "([^\"]*)"$/) do |path|
  tap_when_element_exists("* {text CONTAINS[c] 'Offline install'}")
  push("features/resource_files/ccz_apps/%s" % path, "/sdcard/%s" % path)
  step("I enter \"storage/emulated/0/%s\" into input field number 1" % path)

  # get around bug where the install button is disabled after entering text
  perform_action('set_activity_orientation', 'landscape')
  perform_action('set_activity_orientation', 'portrait')
  sleep 1

  tap_when_element_exists("* {text CONTAINS[c] 'Install App'}")
  wait_for_element_exists("* id:'edit_password'", timeout: 6000)

This ought to give a better idea of what Calabash actually runs on the application. Here we navigate to the install screen, push our test app from test resources into the application’s storage space, start the install, and then wait for the login screen to appear. At this level we’re writing Ruby rather than Gherkin so we have much less readable but more precise code. Ideally our developers would be able to define enough robust steps that a technical non-developer could write their own tests. And indeed this has happened with a member of our QA team outputting a large portion of our test plan.

The commented portion in the middle hints at a persistent weakness of Calabash that we’ve discovered; namely, that we frequently run into UI quirks such as views becoming “invisible” due to screen rotation or keyboard focus. Most frustratingly these issues often are not consistent so a test will pass a few times before failing when a device happens to run more slowly than before. We’ve always been able to work around these issues and usually can hide the fix in a step so this isn’t show stopping. However, these bugs do slow development time.

In Part 2 I’ll go into how these tests are packaged and run on Jenkins.

Java String hashCode() Collision

Another week on interrupt duty, another Javarosa engine deep dive.

This morning we received a ticket reporting that a user’s form was pre-populating some questions with seemingly random answers. Further, some answer sequences would cause the form to skip entire question groups – even those that didn’t have display logic. My first thought was that the form was constructed improperly, but inspecting the XML everything seemed hunky dory. Still, I was able to reproduce the bug immediately using our command line tool, indicating a bug in our core Javarosa library (rather than in our Android wrapper).

Some background: CommCare’s forms are written to the XForm spec. These forms work by defining and populating an XML data model with logic and UI elements defined in the form’s data bindings. In CommCare, users construct this forms in our Form Builder web application and then fill out the forms on our mobile application, submitting the XML blocks back to the server. When users fill out forms, you can view this XML instance being populated. In the user’s form definition the (blank) data instance looks like this:

<H_food_type_repeat jr:template="">

But when I inspected the data instance from within the application the XML appeared differently:


Notice that the node appears twice; not only different from the form definition, but clearly illegal XML. Even more strangely, when I stepped into the code that builds that constructs this instance the XML was generated correctly – with the node intact!

At this point I began getting flashbacks to the infinity bug and accordingly suspected caching. Our engine is chock-full of caching and interning optimizations that greatly improve our performance but can lead to some gnarly bugs. “H2a” and H3B” also seem like dangerously similar variables names – a hash collision waiting to happen.

Sure enough the hashCode() function for these nodes returns identical values. This caused a collision at a critical junction when we were retrieving these objects so that we’d retrieve the node “H2a” when attempting to retrieve “H3B”. This caused us to replace H3B with H2a in the instance structure as well as to insert the answer to H2a into the node H3B.

At this point the fix was trivial (in this case we simply removed the optimization and performed a full equality check instead of just comparing hash values). However we were curious how this happened given that the probability of a hash collision with a good function across a 32-bit hash space should be exceedingly small.

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        hash = h;
    return h;

One thing immediately jumps out – this function looks way too simple! Summarily, the character at position 0 is added directly, the character at position 1 is multiplied by 31, the character at position 2 by 31^2, etc. This can result in strings having equal hash codes fairly easily, particularly for shorter strings.

For example, if given two Strings X and Y of equal length and the ASCII values of the characters at position N differ by 1 (for example the characters ‘a’ and ‘b’ or, in this case, ‘2’ and ‘3’) then the ASCII values of the characters at position N+1 need only differ by 31 (in this case ‘a’ and ‘B’) in order for the hashCodes two strings to be completely identical! this is more common than you might expect because the ASCII Values of the upper and lower case versions of a letter differ by 32; so, int(‘A’) = 65 and int(‘a’) = 97. Given this its surprising a multiplier of 31 was chosen; however, some research reveals some interesting reasoning behind this:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

For large strings such a collision won’t be likely – most of our user’s questions have ID’s such as “mother_name” or “hiv_status” that won’t likely have collisions. However, longer clinical forms very often have question series named A1, A2… G12, etc. Given this, I’m actually surprised we hadn’t run into this issue before.

Again harking back to the Infinity bug, the core issue here was our mis-using canned Java functionality. The Java String hashcode() function was designed with HashMap keys in mind. In this application the hashing function’s speed is the priority and collisions aren’t a huge concern. However, in our setup hash collisions caused catastrophic failure. We should have used a hashing function with a more reliable distribution if we’d been intent on this setup.

Infinity and Beyond

One of the most interesting (read: frustrating) parts about managing an Android deployment are the myriad bugs you encounter that are particular to certain devices or OS versions. Most often these are bugs are hardware related; for example, external SD card reads won’t work on a certain device. Also common are software issues, where a device won’t respect an OS contract properly. Least common, in my experience, are true Android OS bugs that manifest as reliable hard crashes. These bugs are likely to be fixed and the devices quickly updated to newer, bug-fixed versions. However, in the international development world we often have to make do with older devices that cannot update their OS due either to hardware or network constraints. In these deployments obscure, old bugs linger for years – this is the story of one of those bugs.

On Friday September 16 (never accept bugs on Fridays) we received an internal bug report from a developer on one of our biggest deployments reporting that one device was consistently failing to look-up a case in its case database. All other devices could retrieve the case using the same model and query. The suspect device could look up other cases successfully, but not this one. The device wouldn’t crash but would just appear as though this case did not exist. Perhaps significantly (or perhaps not) this device was the only one running Android 4.X.

At this point I began a diagnostic process

  1. Examined the user’s restore payload which contains the user’s assigned cases. The “missing” case was present in the restore payload and formatted correctly to boot.
  2. Stepped through the code that parses the restore payload into database entries. By all appearances the device reads the XML and writes a database entry for the case.
  3. Examined the application’s SQLite database directly. The case appeared there and the index looked correct!

At this point I was baffled. The database looked entirely correct, I could perform the query myself and retrieve the case, and the form had no problems – everything worked for almost all devices! Dauntingly, all of this indicated that the bug was somewhere deep in the abyss of our XForm parsing code.

Some historical context. CommCare forms follow the XForm spec and our transactions like restoring users and form submissions are done with XML payloads in the same spirit. XML is decidedly un-hip now but these specs provide us with a robust framework for defining applications and user data.

CommCare was also originally written in J2ME for Java phones and so anticipated running on devices with memory counted in low megabytes. This means our core code is chock-full of techniques for reducing our memory overhead such as caching and interning. This is the world I’d now have to step into!

I could approach the problem from two places, the original write of the case or the lookup before form entry. I decided to start with the read since by all appearances the entry in the case database was completely correctly. Fortunately at this time I discovered conditional breakpoints, an amazing tool that allowed me to step into the code at exactly the point where we began attempting to search for the case by having the condition test whether the case’s index was being used.

Once I started looking around this code I noticed that at some point variables which should have contained the case id instead were set to Double.POSITIVE_INFINITY. Triangulating this eventually lead me to the function here where we attempt to cast the String look up key (in this case the case’s id) to a Double if possible since this gives us more performant equality and comparison operations.  If we determine that the String is a double we also add it to a parsing cache that we use for subsequent lookups. To determine if the String is a double we use the built in Java function:

Double ret = Double.parseDouble(attrValue);

Which ended up being the source of the bug. This function was accepting a hexadecimal string:


And saying this was a valid double – Infinity, to be precise! The trouble, as always, lies in scientific notation. Quoting from my own pull request:

Unfortunately, until version 4.4 Android had a bug with the parseDouble function that you can read about here: Because Java doubles use the exponential format MeP the “e” character could be in a valid double in certain circumstances. In the bug above case, the fixture ID was 1d9098a0c23a0c83740547dd946e378

So, the bug would parse the trailing 946e3783 as a valid exponential with mantissa even though the rest of the string was complete garbage. This would evaluate to infinity (or Double.POSITIVE_INFINITY to be precise) and then be stored in the cache table as a valid double mapping.

So, we stored this key’s value as being Infinity and thus later lookups for this string literal would fail. Worse still, CommCare had no idea anything had failed – from the perspective of the application all had gone well and we’d successfully cached a double, saving precious memory. Our solution was to simply check for scientific notation before parsing the double – functionality we already had, since XPath doesn’t recognize scientific notation.

A few lessons learned. Clearly, using the built in parseDouble function was a mistake since Java’s and XPath’s ideas of “what is a Double” diverge in this respect. We were using a function that would aggressively attempt to parse a String to a double, when really we wanted a function that would be much stricter since a false positive would be much more costly than a false negative. We also learned that bugs can exist and lie dormant for years before rearing their heads.

However, my primary takeaway from this was laughter since the bug was both hilariously obscure and adorably typical. Android developers are used to encountering strange, device specific bugs since every device and OS version diverge are special snowflakes. This one was so emblematic that I could only laugh.

Ensuring API Parity with Laboratory

The Setup

At Dimagi we’ve been suffering under a Jython server for years. When we originally launched an experimental web client for our Java Phone and Android applications, Jython seemed like the logical bridge between our Java core libraries and Python-based Django web service. Jython allowed us to hook into our Java library directly and leverage our server team’s Python knowledge. We could avoid implementing an un-cool Java server and instead use a bleeding-edge technology – what could go wrong?

As our web client (codenamed CloudCare) grew from an experimental tool to one of our core offerings Jython’s shortcomings were exposed. Interaction with other Java libraries was difficult – getting Jython to play nicely with Postgres through JDBC was an exercise in trial and error, and patience. Debugging Java issues was a nightmare as we couldn’t step through any of the Java code from the Python environment. Dev tools like code completion and Find Usages were non-existent. Stack traces are mostly an illegible mess of interweaving Java and Python methods. And perhaps most difficult was the lack of thorough documentation for understanding the often very complicated interaction of Python and Java code.

Thus in the Fall of 2015 we made the decision to re-implement the entire server in Spring. This blog will cover more about this migration (which consumed the majority of my 2016) in later posts, but we’ll step over that in a sentence for now.

With the server up and running with partial functionality we successfully ran QA under a feature flag and migrated some internal domains onto the new backend. However, we still could not make ourselves fully confident that the new backend had full parity, and for good reason. The problem space for the server is simply massive.

Our product, CommCare, allows users to build large forms that can track and share cases and manage supply inventories. These forms can have about thirty types of questions, repeat groups, complex branching logic, multiple languages, multimedia, and umpteen other customizations. The server would be responsible for managing every aspect of using these forms – displaying the forms, performing the form and case transactions, managing the user’s database, etc.

All of this means that the complexity and variability of the inputs and outputs is massive and fully comprehensive testing is impossible. Ideally, we would test the feature with real data at product scale with no impact on the user experience. Enter Laboratory.

The Idea

When migrating from an old code path to a new one we add a toggle that allow us to route the function calls to the new path for QA and test projects without impacting live projects at all. Once we’re reasonably assured about the stability and correctness of the new path we can begin gradually migrating domains onto the new one – keeping an eye out for errors – until the old path is completely deprecated.

In Laboratory instead of choosing between the old and new code and the join point we run both in an ‘experiment’ abstraction and then compare the results for parity and timing. Then we return the result of the old path (including any exceptions thrown) so that the user experience is unchanged.


Our code was well suited for a Laboratory experiment. All of our calls to the old backend were routed through a single function. So, inserting an experiment here was quite straightforward:

def perform_experiment(d, auth, content_type):
    experiment = FormplayerExperiment(name=d["action"], context={'request': d})
    with experiment.control() as c:
        c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))
    with experiment.candidate() as c:
        # If we should already have a session, look up its ID in the experiment mapping. it better be there.
        if "session_id" in d:
            d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])
        c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + "/" + d["action"]))
    objects =
    return objects

First we create an instance of our laboratory.Experiment subclass (more on that later). Next, we make the post_data call to our old backend as before:

c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))

As the experiment control. Next, we run the new call as the experiment candidate

c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + &amp;amp;amp;amp;amp;amp;quot;/&amp;amp;amp;amp;amp;amp;quot; + d[&amp;amp;amp;amp;amp;amp;quot;action&amp;amp;amp;amp;amp;amp;quot;]))

(The slightly different URL syntax is due to a difference in our old and new servers)

Finally, we “run” the experiment which performs the comparison and calls back to a publish() function that we define in our Experiment subclass.

The lines that I skipped above hint at one of the two major problems we had to solve in using Laboratory. First, during form entry we use session tokens to track the user’s state on the backend to avoid sending the entire form state on each request. This obviously presents a major difficulty in comparing the calls: namely, each browser request would contain a session key corresponding to some session object that exists on the old server, but not the new one where the record would have a totally different key. We resolve this issue by storing a mapping from the ‘control’ session ID to the corresponding ‘candidate’ session ID each time we create a new form:

# if we're starting a new form, we need to store the mapping between session_ids so we can use later
if ( == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

Then, every time we’re going to make a request to the candidate server we replace the old session_id key with its corresponding new one.

if "session_id" in d:
d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])

This allows us to manipulate the two sessions in parallel and continually compare the result. The final component of our experiment is the actual Experiment subclass:

class FormplayerExperiment(laboratory.Experiment):
    session_id_mapping = {}

    def publish(self, result):
        # if we're starting a new form, we need to store the mapping between session_ids so we can use later
if ( == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

In our case, we’re overriding the publish() function which is called when we have the result of our experiment and want to determine what information to store and, in our case, adding new session ID mappings if necessary. In order we are:

  • Adding a new session ID mapping when we’ve performed a new-form action
  • Logging the timing of the control and candidate
  • Comparing the values of the control and candidate and logging the difference if one exists

This last point ended up being the second tricky part of this process. We ended up with a handful of small edge cases that broke the equality checking. In some cases this was a difference in platforms  – booleans ended up as ‘1’ and 0′ in JSON output from Python, while they were ‘True’ and ‘False’ in Java. Further, we needed to ignore the SessionID and some other outputs like random ID’s from forms that would be necessary different. Finally all these values were inside a large JSON tree that we wanted to ensure were the same structure in both cases.



When we turned this on we immediately got actionable data. For example, one request generated the diff:

2016-06-08 22:17:53,675, new-form, {u'lang': u'en', u'session-id': None, u'domain': u'test2', u'session-data': {u'username': u'test', u'additional_filters': {u'footprint': True}, u'domain': u'test2', u'user_id': u'aa8c9268917a7c48fee7d30f3b12df92', u'app_id': u'8686d27973591f96f0ea0da8cd891323', u'user_data': …}

2016-06-08 22:17:53,674 INFO Mismatch with key ix between control 0J and candidate 0_-10

2016-06-08 22:17:53,674 INFO Mismatch with key relevant between control 1 and candidate 0

2016-06-08 22:17:53,674 INFO Mismatch with key header between control question1 and candidate None

2016-06-08 22:17:53,675 INFO Mismatch with key add-choice between control None - Add question1 and candidate None

The first line tells us this was a new-form request; the last three lines tell us that we weren’t generating a repeated question properly in this form. The second line tells us why – we weren’t properly encoding the index of this question in the form. This issue arose only arose under a specific ordering of question groups so we missed it during QA and likely would have introduced this bug “in the wild” if not for Laboratory.


For each call we also generated a comma separated list of values: the time of the request, the request type, and the control and experiment execution times:

2016-05-24 10:08:41,182, answer, 0:00:00.052370, 0:00:00.024141
2016-05-24 12:32:55,501, get-instance, 0:00:00.031898, 0:00:00.019106
2016-05-24 12:33:14,427, answer, 0:00:00.067201, 0:00:00.028408
2016-05-24 12:33:15,003, get-instance, 0:00:00.026185, 0:00:00.019381
2016-05-24 12:39:09,586, new-form, 0:00:03.156963, 0:00:02.310486
2016-05-24 12:39:36,425, get-instance, 0:00:00.022434, 0:00:00.016840
2016-05-24 12:47:26,084, evaluate-xpath, 0:00:05.037198, 0:00:00.016030
2016-05-24 12:51:47,146, evaluate-xpath, 0:00:00.045509, 0:00:00.018033
2016-05-24 12:52:02,801, evaluate-xpath, 0:00:00.033657, 0:00:00.023034
2016-05-24 15:55:59,949, new-form, 0:00:05.362653, 0:00:02.199996
2016-05-24 16:11:53,493, answer, 0:00:00.052868, 0:00:05.020280
2016-05-24 16:11:54,067, get-instance, 0:00:00.023180, 0:00:00.017108
2016-05-24 16:12:01,222, get-instance, 0:00:00.023897, 0:00:00.011998

Pulling this data into a (very rudimentary) graph we were able to validate the performance gains we’d expected from moving to Spring:



We viewed our experiment with Laboratory as a huge success. We not only caught bugs missed by QA that we likely would have released otherwise but also validated the performance gains we’d noticed anecdotally previously. Even if we’d found the equality comparison too unwieldy to implement, logging only the timing with a stripped down experiment would have had huge value as well.

One major advantage we had was that all calls were routed through the extant post_data method so that we could test all functions with minimal impact on the code. The largest obstacle was the parallel sessions, though the solution ended up being quite elegant.

Laboratory itself was very usable – the concept and implementation is straightforward and plugging in code where we needed to overwrite the default behavior was easy. Our biggest feature request would be the ability to make the candidate request and publishing asynchronously from the user-visible control call and return. Fears that blocking on our candidate function to return would cause performance issues led us to limit the rollout at first. We’ve now implemented our second experiment and will continue look for opportunities to experiment further.