User-defined Functions

As of Phoenix 4.4.0 we have added the ability to allow users to create and deploy their own custom or domain-specific UDFs to the cluster.

Overview

Users can create temporary or permanent user-defined (domain-specific) scalar functions. UDFs can be used like built-in functions in queries such as SELECT, UPSERT, DELETE, and when creating functional indexes. Temporary functions are session-scoped and are not accessible from other sessions. Permanent function metadata is stored in the SYSTEM.FUNCTION table. Phoenix also supports tenant-specific functions. Functions created in one tenant-specific connection are not visible to other tenant-specific connections. Only global tenant (no-tenant) functions are visible to all connections.

Phoenix leverages the HBase dynamic class loader to load UDF JARs from HDFS at the Phoenix client and region server without restarting services.

Configuration

Add the following parameters to hbase-site.xml on the Phoenix client:

<property>
  <name>phoenix.functions.allowUserDefinedFunctions</name>
  <value>true</value>
</property>
<property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>${hbase.tmp.dir}/hbase</value>
  <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default, we write
    to whatever ${hbase.tmp.dir} is set too -- usually /tmp --
    so change this configuration or else all data will be lost on
    machine restart.</description>
</property>
<property>
  <name>hbase.dynamic.jars.dir</name>
  <value>${hbase.rootdir}/lib</value>
  <description>
    The directory from which the custom udf jars can be loaded
    dynamically by the phoenix client/region server without the need to restart. However,
    an already loaded udf class would not be un-loaded. See
    HBASE-1936 for more details.
  </description>
</property>

The last two configuration values should match the HBase server-side configuration.

As with other configuration properties, phoenix.functions.allowUserDefinedFunctions may be specified at JDBC connection time as a connection property.

Example:

Properties props = new Properties();
props.setProperty("phoenix.functions.allowUserDefinedFunctions", "true");
Connection conn = DriverManager.getConnection("jdbc:phoenix:localhost", props);

The following optional parameter is used by the dynamic class loader to copy JARs from HDFS into the local filesystem:

<property>
  <name>hbase.local.dir</name>
  <value>${hbase.tmp.dir}/local/</value>
  <description>Directory on the local filesystem to be used
    as a local storage.</description>
</property>

Creating Custom UDFs

Implement your custom UDF by following How to write custom UDF.

Compile your code into a JAR, then deploy the JAR to HDFS. It is recommended to add the JAR to the HDFS directory configured by hbase.dynamic.jars.dir.

Run the CREATE FUNCTION query.

Dropping the UDFs

You can drop functions using the DROP FUNCTION query. Dropping a function deletes the metadata for that function from Phoenix.

How to write custom UDF

You can follow these steps to write your UDF (for more detail, see this blog post):

Create a new class derived from org.apache.phoenix.expression.function.ScalarFunction.
Implement getDataType() to determine the function return type.
Implement evaluate() to calculate the result for each row. The method receives org.apache.phoenix.schema.tuple.Tuple with the current row state and an org.apache.hadoop.hbase.io.ImmutableBytesWritable to populate with the function result. The method returns false if there is not enough information to calculate the result (usually because one argument is unknown), and true otherwise.

Below are additional optimization-related steps.

To contribute to scan start/stop key formation, custom functions need to override the following two methods from ScalarFunction:

/**
 * Determines whether or not a function may be used to form
* the start/stop key of a scan
* @return the zero-based position of the argument to traverse
*  into to look for a primary key column reference, or
*  {@value #NO_TRAVERSAL} if the function cannot be used to
*  form the scan key.
*/
public int getKeyFormationTraversalIndex() {
    return NO_TRAVERSAL;
}

/**
 * Manufactures a KeyPart used to construct the KeyRange given
* a constant and a comparison operator.
* @param childPart the KeyPart formulated for the child expression
*  at the {@link #getKeyFormationTraversalIndex()} position.
* @return the KeyPart for constructing the KeyRange for this
*  function.
*/
public KeyPart newKeyPart(KeyPart childPart) {
    return null;
}

Additionally, to enable ORDER BY optimization or in-place GROUP BY, override:

/**
 * Determines whether or not the result of the function invocation
* will be ordered in the same way as the input to the function.
* Returning YES enables an optimization to occur when a
* GROUP BY contains function invocations using the leading PK
* column(s).
* @return YES if the function invocation will always preserve order for
* the inputs versus the outputs and false otherwise, YES_IF_LAST if the
* function preserves order, but any further column reference would not
* continue to preserve order, and NO if the function does not preserve
* order.
*/
public OrderPreserving preservesOrder() {
    return OrderPreserving.NO;
}

Limitations

The JAR containing UDFs must be manually added to and deleted from HDFS. There is ongoing work to add SQL statements for JAR add/remove (PHOENIX-1890).
The dynamic class loader copies UDF JARs to {hbase.local.dir}/jars at the Phoenix client and region server when a UDF is used in queries. These JARs must be deleted manually when a function is deleted.
Functional indexes must be rebuilt manually if the function implementation changes (PHOENIX-1907).
Once loaded, a JAR is not unloaded. Use a different JAR for modified implementations to avoid restarting the cluster (PHOENIX-1907).
To list functions, query the SYSTEM."FUNCTION" table (PHOENIX-1921).