Serialization of Java sometimes complex and difficult to understand for me. I’ve read Effective Java and javadoc of JDK SE api docs. So I knew I understood the basic concept of serialization of Java object. But I have faced to a problem when I wrote Hive UDAF. This might be a problem every people encountered when they try to write Hive UDAF. So I try to list up the problem and the fact I found this time.

Hive SerDe requires hadoop.io.Text

You cannot use primitive Java object like String as output of UDAF. The output of UDAF is passed by terminate method of GenericUDAFEvaluator. But you cannot use int, String and other primitive Java objects here because Hive SerDe does not recognize it. You should use IntWritable, DoubleWritable and Text object provided by Hadoop MapReduce framework.

Unserializable hadoop.io.Text

In Hive UDAF, we should pass aggregated data from mapper to reducer. Otherwise we cannot obtain correct result of aggregated data. It can be done with terminatePartial and merge method of GenericUDAFEvaluator.

@Override
public Object terminatePartial(AggregationBuffer aggregationBuffer)
    throws HiveException {
  MyBuffer buf = (MyBuffer)aggregationBuffer;
  return buf.serialize();
}

You can make any class which inherits AggregationBuffer of Hive. Since terminatePartial returns any Object, it is better to serialize explicitly. You should implement serialize method to do so. But I found one thing here. MyBuffer includes hadoop.io.Text class because output should be Text object. But this code throws exception because Text is not serializable.

I found we must convert String (or other serializable object) to Text object in terminate object because AggregationBuffer cannot contain Text object unless buf.serialize() returns actually serializable object. In our case terminate method looks linkedin

List<String> stringList = // String list returned by AggregationBuffer
Object[] row = new Object[n]; // row is returned object which represents one row.
for (int i = 0; i < n; i++) {
  row[i] = new Text(stringList.get(i));
}
return row;

In summary,

We cannot use hadoop.io.Text in AggregationBuffer in UDAF because it is not serializable.
We should return hadoop.io.Text as returned object from UDAF otherwise Hive SerDe cannot recognize it.