Today I have a big news. Now Hivemall joined Apache incubator project! Hivemall is a scalable machine learning library running on Hadoop. It was originally developed by Yui Makoto who is a research engineer at Treasure Data Inc.
So from now, we call it Apache Hivemall. Top page is now opened. Apache Hivemall is developed as Hive UDF. Therefore we can integrated Apache Hivemall with Spark or other projects which has compatibility to Hive UDF easily. There are two ways for getting started.
- Add jar and defined function in your Hive cluster
- Use Treasure Data
Install by yourself
If you have already Hive cluster, your can easily install Apache Hivemall as ordinal Hive UDF.
$ hive
> add jar /path/to/hivemall-core-xxx-with-dependencies.jar;
> source /path/to/define-all.hive;
These resource can be downloaded from here. hivemall-core
is core module which includes various type of
UDFs. hivemall-nlp
includes natural language processing utilities with its own dictionary. hivemall-spark
is for Spark integration as you can see.
Use Treasure Data
Apache Hivemall is hosted by Treasure Data service. So if you don’t want to have your own Hive cluster and maintain it, that’s a good option. Apache Hivemall is just a collection of Hive UDF. So data analysts or engineers who use SQL in their daily analysis can easily start using Apache Hivemall.
Though Apache Hivemall has just started incubation project, there are a lot of ideas to be developed. Please check Hivemall JIRA and issues on GitHub. In addition, we are now preparing guideline and roadmap for incubation releases. Please keep eye on them.
Last but not least, it is very important to make community broad and wider for incubation project. Apache Hivemall very welcomes patches anytime. So please join our community as developer, user and any other type of contributor.
Thanks!