Here’s a rewritten version of the post in a more conversational, human tone:
**Building Portable In-Database Feature Engineering Pipelines with Ibis: A Lazy Python Approach**
Hey folks, today we’re going to explore how to build a portable, in-database feature engineering pipeline that’s as seamless as Pandas, but executes entirely within the database. We’ll show you how to connect to DuckDB, register your data safely, and define complex transformations using window functions and aggregations – all without pulling raw data into local memory.
**Getting Started**
Before we dive in, let’s get our ducks in a row. We’ll install the required libraries and set up our Ibis environment. Make sure to install `ibis-framework[duckdb,examples]`, `duckdb`, `pyarrow`, and `pandas`. Once you’ve got that taken care of, run the following code snippet:
“`python
!pip -q install “ibis-framework[duckdb,examples]” duckdb pyarrow pandas
import ibis
from ibis import _
print(“Ibis version:”, ibis.__version__)
con = ibis.duckdb.connect()
ibis.choices.interactive = True
“`
**Loading the Penguins Dataset**
Next, let’s load the Penguins dataset and register it in the DuckDB catalog. This ensures that the data is safely stored in the database and ready for SQL execution. You can find the full code here: [link to the full code].
“`python
base_expr = ibis.examples.penguins.fetch(backend=con)
if “penguins” not in con.list_tables():
con.create_table(“penguins”, base_expr, overwrite=True)
t = con.table(“penguins”)
print(t.schema())
“`
**Defining the Feature Engineering Pipeline**
Now, let’s define our reusable feature engineering pipeline using pure Ibis expressions. We’ll compute derived features, apply data cleaning, and use window functions and grouped aggregations to build advanced, database-native features while keeping the entire pipeline lazy. You can find the full code here: [link to the full code].
“`python
def penguin_feature_pipeline(penguins):
#… (rest of the code)
“`
**Invoking the Feature Pipeline**
Finally, let’s invoke our feature pipeline and compile it into DuckDB SQL to validate that all transformations are pushed down to the database. We’ll then run the pipeline and return only the final aggregated results for inspection. You can find the full code here: [link to the full code].
“`python
options = penguin_feature_pipeline(t)
print(con.compile(options))
try:
df = options.to_pandas()
except Exception:
df = options.execute()
show(df.head())
“`
**Conclusion**
That’s it! We’ve built, compiled, and executed a sophisticated feature engineering workflow entirely within DuckDB using Ibis. We demonstrated how to inspect the generated SQL, materialize results directly within the database, and export them for downstream use while preserving portability across analytical backends.
This approach reinforces the core concept behind Ibis: we keep computation near the data, reduce unnecessary data movement, and maintain a single, reusable Python codebase that scales from local experimentation to production databases.
**Try the Full Code Here**
You can find the full code here: [link to the full code]. Don’t forget to follow us on Twitter and join our 100k+ ML SubReddit!
