Working with Exprs¶
Expr is short for “expression”. It is a core abstraction in DataFusion for representing a computation, and follows the standard “expression tree” abstraction found in most compilers and databases.
For example, the SQL expression a + b would be represented as an Expr with a BinaryExpr variant. A BinaryExpr has a left and right Expr and an operator.
As another example, the SQL expression a + b * c would be represented as an Expr with a BinaryExpr variant. The left Expr would be a and the right Expr would be another BinaryExpr with a left Expr of b and a right Expr of c. As a classic expression tree, this would look like:
┌────────────────────┐
│ BinaryExpr │
│ op: + │
└────────────────────┘
▲ ▲
┌───────┘ └────────────────┐
│ │
┌────────────────────┐ ┌────────────────────┐
│ Expr::Col │ │ BinaryExpr │
│ col: a │ │ op: * │
└────────────────────┘ └────────────────────┘
▲ ▲
┌────────┘ └─────────┐
│ │
┌────────────────────┐ ┌────────────────────┐
│ Expr::Col │ │ Expr::Col │
│ col: b │ │ col: c │
└────────────────────┘ └────────────────────┘
As the writer of a library, you can use Exprs to represent computations that you want to perform. This guide will walk you through how to make your own scalar UDF as an Expr and how to rewrite Exprs to inline the simple UDF.
Creating and Evaluating Exprs¶
Please see expr_api.rs for well commented code for creating, evaluating, simplifying, and analyzing Exprs.
A Scalar UDF Example¶
We’ll use a ScalarUDF expression as our example. This necessitates implementing an actual UDF, and for ease we’ll use the same example from the adding UDFs guide.
So assuming you’ve written that function, you can use it to create an Expr:
let add_one_udf = create_udf(
"add_one",
vec![DataType::Int64],
Arc::new(DataType::Int64),
Volatility::Immutable,
make_scalar_function(add_one), // <-- the function we wrote
);
// make the expr `add_one(5)`
let expr = add_one_udf.call(vec![lit(5)]);
// make the expr `add_one(my_column)`
let expr = add_one_udf.call(vec![col("my_column")]);
If you’d like to learn more about Exprs, before we get into the details of creating and rewriting them, you can read the expression user-guide.
Rewriting Exprs¶
rewrite_expr.rs contains example code for rewriting Exprs.
Rewriting Expressions is the process of taking an Expr and transforming it into another Expr. This is useful for a number of reasons, including:
Simplifying
Exprs to make them easier to evaluateOptimizing
Exprs to make them faster to evaluateConverting
Exprs to other forms, e.g. converting aBinaryExprto aCastExpr
In our example, we’ll use rewriting to update our add_one UDF, to be rewritten as a BinaryExpr with a Literal of 1. We’re effectively inlining the UDF.
Rewriting with transform¶
To implement the inlining, we’ll need to write a function that takes an Expr and returns a Result<Expr>. If the expression is not to be rewritten Transformed::no is used to wrap the original Expr. If the expression is to be rewritten, Transformed::yes is used to wrap the new Expr.
fn rewrite_add_one(expr: Expr) -> Result<Expr> {
expr.transform(&|expr| {
Ok(match expr {
Expr::ScalarUDF(scalar_fun) if scalar_fun.fun.name == "add_one" => {
let input_arg = scalar_fun.args[0].clone();
let new_expression = input_arg + lit(1i64);
Transformed::yes(new_expression)
}
_ => Transformed::no(expr),
})
})
}
Creating an OptimizerRule¶
In DataFusion, an OptimizerRule is a trait that supports rewritingExprs that appear in various parts of the LogicalPlan. It follows DataFusion’s general mantra of trait implementations to drive behavior.
We’ll call our rule AddOneInliner and implement the OptimizerRule trait. The OptimizerRule trait has two methods:
name- returns the name of the ruletry_optimize- takes aLogicalPlanand returns anOption<LogicalPlan>. If the rule is able to optimize the plan, it returnsSome(LogicalPlan)with the optimized plan. If the rule is not able to optimize the plan, it returnsNone.
struct AddOneInliner {}
impl OptimizerRule for AddOneInliner {
fn name(&self) -> &str {
"add_one_inliner"
}
fn try_optimize(
&self,
plan: &LogicalPlan,
config: &dyn OptimizerConfig,
) -> Result<Option<LogicalPlan>> {
// Map over the expressions and rewrite them
let new_expressions = plan
.expressions()
.into_iter()
.map(|expr| rewrite_add_one(expr))
.collect::<Result<Vec<_>>>()?;
let inputs = plan.inputs().into_iter().cloned().collect::<Vec<_>>();
let plan = plan.with_new_exprs(&new_expressions, &inputs);
plan.map(Some)
}
}
Note the use of rewrite_add_one which is mapped over plan.expressions() to rewrite the expressions, then plan.with_new_exprs is used to create a new LogicalPlan with the rewritten expressions.
We’re almost there. Let’s just test our rule works properly.
Testing the Rule¶
Testing the rule is fairly simple, we can create a SessionState with our rule and then create a DataFrame and run a query. The logical plan will be optimized by our rule.
use datafusion::prelude::*;
let rules = Arc::new(AddOneInliner {});
let state = ctx.state().with_optimizer_rules(vec![rules]);
let ctx = SessionContext::with_state(state);
ctx.register_udf(add_one);
let sql = "SELECT add_one(1) AS added_one";
let plan = ctx.sql(sql).await?.logical_plan();
println!("{:?}", plan);
This results in the following output:
Projection: Int64(1) + Int64(1) AS added_one
EmptyRelation
I.e. the add_one UDF has been inlined into the projection.
Getting the data type of the expression¶
The arrow::datatypes::DataType of the expression can be obtained by calling the get_type given something that implements Expr::Schemable, for example a DFschema object:
use arrow_schema::DataType;
use datafusion::common::{DFField, DFSchema};
use datafusion::logical_expr::{col, ExprSchemable};
use std::collections::HashMap;
let expr = col("c1") + col("c2");
let schema = DFSchema::new_with_metadata(
vec![
DFField::new_unqualified("c1", DataType::Int32, true),
DFField::new_unqualified("c2", DataType::Float32, true),
],
HashMap::new(),
)
.unwrap();
print!("type = {}", expr.get_type(&schema).unwrap());
This results in the following output:
type = Float32
Conclusion¶
In this guide, we’ve seen how to create Exprs programmatically and how to rewrite them. This is useful for simplifying and optimizing Exprs. We’ve also seen how to test our rule to ensure it works properly.