System dependencies

I’m so happy to be developing in Java. It takes away the drudgery of software development, like memory management. And it frees you of worrying about how certain features are implemented on different platforms: Write Once, Run Anywhere!

You do feel the but coming, don’t you? 😉

Well, sometimes I do run into system dependencies. And since I’m no longer used to it, I don’t expect it anymore. Ah well, this happens only once in a very, very little while. Let me tell you about one such event.

I described in my previous post how I used reflection to extract common code into a base class. I used Class.getMethods(), for which the JavaDoc reads:

public Method[] getMethods() throws SecurityException

Returns an array containing Method objects reflecting all the public member methods of the class or interface represented by this Class object, including those declared by the class or interface and those inherited from superclasses and superinterfaces. Array classes return all the (public) member methods inherited from the Object class. The elements in the array returned are not sorted and are not in any particular order. This method returns an array of length 0 if this Class object represents a class or interface that has no public member methods, or if this Class object represents a primitive type or void.

The interesting part is in italic. This is one of those sentences that you can easily overlook. I know I did.

What does it mean? Nothing more than that the order is undefined in the spec, and so depends on the system (in this case the particular JVM implementation) that you use. We use both Windows and GNU/Linux to test our stuff, but on both we have a Sun JVM. I guess most people will use this one too, since it’s from the makers of Java and it’s free (as in beer and now also as in speech).

But not IBM. For their AIX platform, they have built a custom JVM. And you guessed right: that JVM uses a different order for the methods in the array. Whereas the Sun implementation always gives you methods from the class, then those from it’s base class, etc. the IBM implementation uses the exact reverse order. My code subtly depended on that order, and so it failed on AIX.

BTW, if you want to find out more about which JVM implementation you are using, just issue java -version. This is from my machine at home:

java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)

The HotSpot part is what gives away that it’s from Sun. This is from our AIX box:

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build
    pap64dev-20080315 (SR7))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX
    ppc64-64 j9vmap6423-20080315 (JIT enabled)
J9VM - 20080314_17962_BHdSMr
JIT  - 20080130_0718ifx2_r8
GC   - 200802_08
JCL  - 20080314

Big Refactoring: Separate Domain from Presentation

In his landmark book Refactoring: Improving the Design of Existing Code, Marting Fowler not only presents a catalog of “regular” refactorings, he also mentions some “big” refactorings. These big refactorings are not described as a series of atomic steps to follow, but more as a recipe for using a longer series of regular refactorings. And since they are bigger than regular refactorings, they also take a lot longer to complete, sometimes even months.

I’m in the middle of one of these: Separate Domain from Presentation. Now, we all know that we shouldn’t put business logic in interface code, so why do I find myself in this situation?

Well, technically, I don’t 😉 We use Struts, which has a nice MVC architecture. However, it’s the Controller part that has me worried. In Struts, one writes Action classes to control application flow:

“The goal of an Action class is to process a request, via its execute() method, and return an ActionForward object that identifies where control should be forwarded (e.g. a JSP, Tile definition, Velocity template, or another Action) to provide the appropriate response.”

It is, however, all too convenient to implement the business logic in Action classes as well. The Struts documentation even warns about this danger:

“Perform the processing required to deal with this request (such as saving a row into a database). This can be done by logic code embedded within the Action class itself, but should generally be performed by calling an appropriate method of a business logic bean.”

And that’s exactly what’s happened in our code. So I guess I’m actually in the middle of Separate Domain from Controller 😉

Fixing this is not a trivial task. The Action classes use ActionForm classes that hold data entered in the UI to perform their work. This ties them to Struts, which I don’t like at all. For instance, it makes it very hard for us to switch to a different web framework, should we so choose. It also means that simple solutions like Extract Method won’t work, since the extracted method would get the ActionForm as a parameter.

My solution has been to introduce what I call Service classes. A Service class has one method that implements the service that the Action provides. The method has one parameter, which is a Parameter Object, that contains the same information as the Action‘s ActionForm does. I call them Service classes, since these classes could very well be used to implement web services as well.

Anyway, all the Action class has to do now, is:

instantiate the appropriate Parameter Object
populate it from the ActionForm
instantiate the appropriate Service class
call the appropriate method on the Service class (passing the Parameter Object)
update the ActionForm from the method’s result object
construct an ActionForward (possibly using information from the result object)

Luckily, I could automate all of that in a base class using reflection, so that each Action class now only needs two methods: one for instantiating the Parameter Object, and one for instantiating the Service class.

Still, that leaves a lot of Actions to convert. And to make matters worse, they are organized into class hierarchies, which makes it hard to convert them one by one. So I guess I won’t be sitting idle any time soon…

Log Files to the Rescue

Yesterday I got an email from a client describing a really, really weird situation that had occurred with our product. Of course, they couldn’t provide a way to reproduce the problem. Fortunately, there were only two users on the system at the time (it was in their integration testing environment), so they could tell what each of them was doing.

One person’s actions I could dismiss pretty quickly as the cause of the problem, so it must have been what the other did. However, her actions also seemed unlikely to have caused the problem. I started exercising the system in ways related to her actions, in hope of reproducing the problem. No luck whatsoever.

So I stepped back a little and started reasoning from the code. What could possibly have caused this? I came up with a scenario, tried it, and sure enough, there it was. But the problem was that my actions in no way resembled the description of the client’s actions. And on top of that, my actions seemed rather bizarre. Why would anyone want to do this?

I know debugging isn’t always an exact science, but my hypothesis was in real need of some testing.

Enter log files. Our product is a web application running in Apache Tomcat, for which it’s pretty easy to enable logging. Tomcat’s access log follows the Common Logfile Format, which looks like this (all on one line):

127.0.0.1 8080 - - [27/Jun/2008:08:41:49 +0200] 
"GET /docato-composer/getLoginDialog.do HTTP/1.1" 200 3132

Each HTTP request is logged on a single line, with the IP address of the client first, then some identity information (missing in the example), the time, the kind of request (GET), the URL, the protocol (HTTP/1.1), the result status code, and the result size. (Tools like Webalizer can parse such log files easily to provide statistics for web sites.)

I got the access log from our client, and put on my CSI hat. For each of the steps in my scenario, I looked up the associated URL and searched for it in the log. And yes, bizarre as it may have appeared to me, they were all there: conveniently one after the other, from the same IP address and just before the time the client noticed the problem. Case closed.

The morale of this story is that log files are a Good Idea™. Without them I might have dismissed my scenario as too unlikely, and have spent valuable time chasing alternative hypotheses. Also, while browsing the log files, I stumbled upon two other problems that the client didn’t even report. I fixed these as a bonus 😀

Automated distribution creation (2)

In my previous post I talked about how I managed to automatically download the release notes from our issue tracker web site. These notes still needed adding to our NEWs file, which describes the changes between releases.

There are really two scenarios to deal with here: the release notes for the current release either are already in the NEWS file, or they are not. They are already there when you rebuild the distribution for a release, for example when you’ve found something wrong with it and fixed that. For a human, this is pretty simple to detect, but how does an Ant script know?

Enter the Ant filter chain. This construct resembles a Unix pipe in that you can use it to feed output of one as input to the other. Here’s how I retrieve the version that is currently in the NEWS file:

<loadfile property="current.version"
    srcFile="${news.file}>
  <filterchain>
    <headfilter lines="1"/>
    <striplinebreaks>
    <tokenfilter>
      <replaceregex pattern="[a-zA-Z\s]*([1-9]+\.[0-9]+).*"
          replace="\1"/>          
      <replacestring from="." to="\."/>
    </tokenfilter>
    </striplinebreaks>
  </filterchain>
</loadfile>

The loadfile task loads the srcFile into the current.version property. But not just as is, no there is a filterchain applied first. The first item in the chain is headfilter, which works just like the Unix head command: in this case it gives the first line of the NEWS file. I don’t want a line, but a string, so next I remove the line ending with the striplinebreaks filter.

Then it’s time for some good old regular expression to extract the version number from the string. The first line of the NEWS file looks like this: Changes in 1.4.0. So I match the text with [a-zA-Z\s]* and then the actual version number with ([1-9]+\.[0-9]+).*.

Note that I use a group to capture only the major and minor version (1.4 in the previous example). The reason for that is that whenever we deliver patch releases, we don’t add a whole new section to the NEWS file, but just expand the current section with the few cases that were fixed by the patch. Since we sort the cases in descending order of reporting, the patch cases will always be at the top.

Following the regular expression there is a replacestring filter that inserts backslashes before points. The reason for that becomes clear when we look at how the Ant script actually uses the current.version property:

<condition property="same.release">
  <matches string="${full.version}"
      pattern="${current.version}"/>
</condition>
<antcall target="--remove-current-release-from-news"/>
<antcall target="--add-current-release-to-news"/>

The --remove-current-release-from-news target is only executed when the same.release property is true:

<target name="--remove-current-release-from-news"
    if="same.release">
  <property name="previous.version.file"
      value="${news.dir}/previous.version.txt"/>
  <echo message="${previous.version}"
      file="${previous.version.file}"/>
  <loadfile srcFile="${previous.version.file}"
      property="escaped.previous.version">
    <filterchain>
      <tokenfilter>
        <replacestring from="." to="\."/>          
      </tokenfilter><tokenfilter>
    </tokenfilter>
  </filterchain>
  </loadfile>
  <delete file="${previous.version.file}"/>
  <replaceregexp file="${news.file}"
      match=".*(Changes in ${escaped.previous.version}.*)"
      replace="\1" flags="s"/>          
</target>

The bulk of the work is done in the final replaceregexp task, where everything before the text Changes in <x>.<y>.<z> is deleted. The code before that is just a convoluted way to escape points in the previous version number. Unfortunately, I’m not aware of any Ant task that can execute a regular expression against a property, so I first put the property into a temporary file and then operate on that file.

Finally, all that is left, is to add the release notes for the current version to the NEWS file:

<target name="--add-current-release-to-news">
  <property name="new.news.file"
      value="${news.dir}/new.news.txt"/>
  <concat destfile="${new.news.file}">
    <path>
      <pathelement location="${release.news.file}"/>
      <pathelement location="${news.file}"/>
    </path>
  </concat>
  <move file="${new.news.file}"
      tofile="${news.file}"/>
  <delete file="${release.news.file}"/>
</target>

The only tricky part here is that the concat task doesn’t allow one of its input files to also be the output file. So I have to introduce a temporary file. Then when all is done, the file containing the NEWS section for this release, release.news.file, is no longer needed.

Automated distribution creation

So we have this automated build with CruiseControl. It generates code, compiles, deploys, and tests. It’s saved my skin a gazillion times. It’s really great.

But it could be even better. It could also build a complete distribution, making the whole software release process a non-event. That’s one of my goals for the coming weeks. So stay tuned. 😉

Currently, the process to build a distribution of our product requires a couple of manual steps. One of these steps is to update the NEWS file, which describes the changes between releases. Of course, everything that changes between releases, is documented in the issue tracking system, in our case FogBugz. (FogBugz is OK to work with most of the time, although I think there are better alternatives, like Jira.)

FogBugz lets you add release notes to each issue (which it calls case), and it provides a standard report to show the release notes for all cases scheduled for a specific release. You can even download this report in XML.

The only problem is that this functionality doesn’t work most of the time. The only time when it is guaranteed to work, is when you try it on the server that hosts FogBugz. Since this machine is in the server room, this is inconvenient to say the least. But even if this functionality worked flawlessly every time, everywhere, it would still be a manual step to collect the XML file.

So I turned to HtmlUnit, a “browser for Java programs. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your normal browser.” We use this great tool a lot to write our acceptance tests.

This time, I used HtmlUnit’s WebClient from within an Ant task to log in into FogBugz, generate the release notes report, extract cases with release notes (some cases have none, since they are too trivial to bother the end user with), and write them to an XML file. This allows me to transform the XML file to plain text using XSLT, giving a NEWS file section for the current release. The next step is to automagically add this to the existing NEWS file. This should be easy enough using Ant’s concat task. I will let you know how this works out.

Breaking Encapsulation

Last week, we tested the upgrade procedure for the new version of our product. We got a backup from one of our clients that was over 60Gb, so we could put it to good use by testing the performance of the upgrade against it. This sort of testing is always crucial for making sure the upgrade won’t disrupt production too much.

One of the steps in the upgrade was the deletion of stale data. It dealt with two entities in a one-to-many relationship. For this discussion, lets call these entities A and B. For each A, there can be multiple Bs, whereas each B is associated with exactly one A. The upgrade used our product’s API to select A objects matching the required criteria, and then deleting them. The API implementation makes sure that when an A object is deleted, its B objects are also deleted.

This is standard encapsulation practice, nothing fancy. But there was one problem with it: the deletion process was way too slow. We broke it off after over three and a half hours, which is clearly unacceptable.

So we turned to the code, and found two loops: one iterating over the A objects, and within the delete() of A, one iterating over the B objects. Since there can be many, many B objects to search through, this inner loop really hurts when executed repeatedly. We say that this algorithm is O(n×m), where n is the number of A objects and m the number of B objects. By first deleting all B objects related to A objects that match the criteria, and only then deleting the A objects, we could potentially change the algorithm to O(n+m), which of course is much faster.

That didn’t work out, though, since the delete() method in class A still contained the loop over B objects, even though we now knew for sure that none of the B objects would match (since we deleted them previously). So we broke encapsulation by extracting a doDelete() method that just deletes the A object, nothing more.

We had a similar problem with B’s delete() method. This code sends a notification to its A object and performs other housekeeping. In our situation, this is clearly unnecessary, since that A object is about to be deleted as well. So we again broke encapsulation and extracted a doDelete() method for class B as well.

Now we had the performance we required: the deletion process was down to two minutes. But we lost encapsulation. Being well-experienced in object oriented techniques, we knew that would open the door to all sorts of trouble. But we also knew that this change was absolutely necessary to get the required performance.

So we went into damage control mode. We made the doDelete() methods protected, and moved the upgrade code to the same package as the API implementation code, to still be able to call the doDelete()s. Still not optimal, but sometimes a man’s got to do what a man’s got to do…

The Law of Demeter

In my previous post I used the Law of Demeter as a motivation for the Hide Delegate refactoring:

The Law of Demeter for functions requires that a method M of an object O may only invoke the methods of the following kinds of objects:

O itself

M’s parameters

any objects created/instantiated within M

O’s direct component objects

Code that violates the Law of Demeter is a candidate for Hide Delegate, e.g. manager = john.getDepartment().getManager() can be refactored to manager = john.getManager(), where the Employee class gets a new getManager() method.

However, not all such refactorings make as much sense. Consider, for example, someone who’s trying to kiss up to his boss: sendFlowers(john.getManager().getSpouse()). Applying Hide Delegate here would yield a getManagersSpouse() method in Employee. Yuck.

I have a couple of problems this use of Hide Delegate. First of all, it creates methods that by definition reek of feature envy. Second, methods like getManagersSpouse() clearly violate the single responsibility principle. Finally, the Law of Demeter clashes with the concept of fluent interfaces.

Luckily, you can always back out from an adverse Hide Delegate by applying the opposite refactoring: Remove Middle Man.

Automating refactorings

I’m a big fan of both refactoring and automation. It’s no wonder, then, that the support for automated refactoring in Eclipse makes me very happy. I find that it makes me a lot more productive, and I produce better code. That’s because performing a refactoring is easy and fast enough to actually do it.

I also find that I refactor routinely. Where Martin Fowler, in his classic book Refactoring, gives the advice not to mix refactoring and adding new functionality, I do it almost mindlessly anyway. No need to run unit tests before and after the refactorings, since I know they Just Work™.

Not so with any refactorings that are not supported by the tool, though. For instance, when trying to adhere to the Law of Demeter, one would want to perform the refactoring Hide Delegate. Unfortunately, Eclipse has no support for this refactoring 😦 You can, however, simulate this refactoring using a combination of other refactorings. Let me explain that using a simple example.

We start with the following abstract code that shows the situation before we want to apply Hide Delegate:

public class Client {

  public void example() {
    final Server server = new Server();
    server.getDelegate().method();  
  }

}

public class Server {

  private final Delegate delegate = new Delegate();

  public Delegate getDelegate() {
    return delegate;
  }
  
}

public class Delegate {

  public void method() {
   // Do it...
  }
  
}

First, we perform Extract Method on server.getDelegate().method() (make the method public):

public class Client {

  public void example() {
    final Server server = new Server();
    method(server);  
  }

  public void method(final Server server) {
    server.getDelegate().method();
  }

}

Next, perform Move Method to move method() to Server:

public class Client {

  public void example() {
    final Server server = new Server();
    server.method();  
  }

}

public class Server {

 private final Delegate delegate = new Delegate();

  public Delegate getDelegate() {
    return delegate;
  }

  public void method() {
    getDelegate().method();
  }
  
}

And, voila, we have performed Hide Delegate!

Importing large data sets

For performance testing, it is often necessary to import a large data set to test against. However, importing large data sets presents its own challenges. Below I want to give some tips on how to deal with those.

Begin with making backups. Not just of your current data, but also of the large data set you want to import. You might just want to transform the data to import, and then it is useful to be able to go back to the original.
Start with a representative subset of the large data set. This will allow you to test the import process without having to wait hours for feedback. Only when you’re convinced that everything works as expected, do you import the whole large data set.
Test the limited data set end-to-end. For instance, the product I’m currently working on consists of a Content Management System (CMS, where people author content) and a Delivery System (DS, where people use the content). Data is imported into the CMS, edited, and finally published to the DS. In this situation, it is not enough to have a successful import into CMS. The publication to DS must also succeed.
Automate the import. When things go wrong, you need to perform the import multiple times. It saves time to be able to run the import with a single command. Even if the import succeeds on the first try (one can dream), you might want to redo the import later, e.g. for performance testing against a new release, or when a new, even larger, data set becomes available.
If you need to transform the data to make the import work, make sure to put the transformation scripts under version control, like your regular code (you do use a version control system, do you?). The build scripts that automate the import should also be put under version control.
If you cannot get your hands on real-world data, you may still be able to do performance testing using generated data. The downside of this approach is that the generated data will probably not contain the exotic border cases that are usually present in real-life data.

Strange things happen…

On Fridays, I work from home, to prevent wasting time commuting. So today, I started out fresh, ready to rock and roll. But alas, I was off to a slow start.

To access my company’s resources, I use a Virtual Private Network (VPN). In particular, I use the VPN client software from Cisco on Ubuntu GNU/Linux. This piece of software has the tendency to break on every kernel update, however 😦 Yes, you guessed right, this week I received a kernel update to 2.6.24-17.

When I previously upgraded Ubuntu to 8.04, I received kernel 2.6.24-16 and then the VPN client broke as well. I had to apply a patch, which didn’t work: it couldn’t apply all changes. I then manually fixed the code to make sure all the changes in the patch were applied. And then the VPN client finally worked.

So I expected another one of those sessions. But this time, googling turned up nothing. Since I’m close to a deadline, I decided to simply restart my computer and choose the 2.6.24-16 kernel from the GRUB boot menu. Since it used to work with this kernel, I expected it to work now. But no such luck. I still got an error about the Connection Manager being unable to read the connection entry.

Getting a bit desperate, I redid the VPN client installation. Now it worked 😀 Feeling lucky, I rebooted into kernel 2.6.24-17, and it still worked. Sometimes I just don’t understand computers…

Update 2008-08-15: Check out this page with Unofficial Cisco VPN Client Updates for Linux.