Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    An Implementation Information to Constructing a DuckDB-Python Analytics Pipeline with SQL, DataFrames, Parquet, UDFs, and Efficiency Profiling

    Naveed AhmadBy Naveed Ahmad13/04/2026Updated:13/04/2026No Comments16 Mins Read
    blog 33


    On this tutorial, we construct a complete, hands-on understanding of DuckDB-Python by working via its options instantly in code on Colab. We begin with the basics of connection administration and information era, then transfer into actual analytical workflows, together with querying Pandas, Polars, and Arrow objects with out handbook loading, remodeling outcomes throughout a number of codecs, and writing expressive SQL for window capabilities, pivots, macros, recursive CTEs, and joins. As we progress, we additionally discover performance-oriented capabilities comparable to bulk insertion, profiling, partitioned storage, multi-threaded entry, distant file querying, and environment friendly export patterns, so we not solely be taught what DuckDB can do, but in addition find out how to use it as a critical analytical engine inside Python.

    import subprocess, sys
    
    
    for pkg in ["duckdb", "pandas", "pyarrow", "polars"]:
       strive:
           subprocess.check_call(
               [sys.executable, "-m", "pip", "install", "-q", pkg],
               stderr=subprocess.DEVNULL,
           )
       besides subprocess.CalledProcessError:
           subprocess.check_call(
               [sys.executable, "-m", "pip", "install", "-q", "--break-system-packages", pkg],
               stderr=subprocess.DEVNULL,
           )
    
    
    import duckdb
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    import polars as pl
    import numpy as np
    import json, os, time, threading, tempfile
    from datetime import date, datetime, timedelta
    
    
    print(f"DuckDB model : {duckdb.__version__}")
    print(f"Pandas model : {pd.__version__}")
    print(f"PyArrow model: {pa.__version__}")
    print(f"Polars model : {pl.__version__}")
    print("=" * 72)
    
    
    WORKDIR = tempfile.mkdtemp(prefix="duckdb_tutorial_")
    os.chdir(WORKDIR)
    print(f"Working listing: {WORKDIR}n")
    
    
    print("=" * 72)
    print("SECTION 1: Connection Administration")
    print("=" * 72)
    
    
    con = duckdb.join()
    print(con.sql("SELECT 'Hey from in-memory DuckDB!' AS greeting").fetchone()[0])
    
    
    DB_PATH = os.path.be a part of(WORKDIR, "tutorial.duckdb")
    pcon = duckdb.join(DB_PATH)
    pcon.sql("CREATE OR REPLACE TABLE persevered(id INT, val TEXT)")
    pcon.sql("INSERT INTO persevered VALUES (1,'alpha'), (2,'beta')")
    print("Continued rows:", pcon.sql("SELECT depend(*) FROM persevered").fetchone()[0])
    pcon.shut()
    
    
    pcon2 = duckdb.join(DB_PATH)
    print("After re-open :", pcon2.sql("SELECT * FROM persevered ORDER BY id").fetchall())
    pcon2.shut()
    
    
    con_cfg = duckdb.join(config={
       "threads": 2,
       "memory_limit": "512MB",
       "default_order": "DESC",
    })
    print("Configured threads:", con_cfg.sql("SELECT current_setting('threads')").fetchone()[0])
    con_cfg.shut()
    
    
    with duckdb.join() as ctx:
       ctx.sql("SELECT 42 AS reply").present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 2: Artificial Information Era")
    print("=" * 72)
    
    
    con = duckdb.join()
    
    
    con.sql("""
       CREATE OR REPLACE TABLE gross sales AS
       SELECT
           i                                                     AS order_id,
           '2023-01-01'::DATE + (i % 365)::INT                  AS order_date,
           CASE (i % 5)
               WHEN 0 THEN 'Electronics'
               WHEN 1 THEN 'Clothes'
               WHEN 2 THEN 'Groceries'
               WHEN 3 THEN 'Furnishings'
               ELSE        'Books'
           END                                                   AS class,
           ROUND(10 + random() * 990, 2)                         AS quantity,
           CASE (i % 3)
               WHEN 0 THEN 'US'
               WHEN 1 THEN 'EU'
               ELSE        'APAC'
           END                                                   AS area,
           CASE WHEN random() < 0.1 THEN TRUE ELSE FALSE END    AS returned
       FROM generate_series(1, 100000) t(i)
    """)
    
    
    con.sql("SUMMARIZE gross sales").present()
    print()
    
    
    print("=" * 72)
    print("SECTION 3: Zero-Copy DataFrame Integration")
    print("=" * 72)
    
    
    pdf = pd.DataFrame({
       "product": ["Widget", "Gadget", "Doohickey", "Thingamajig"],
       "worth":   [9.99, 24.50, 4.75, 15.00],
       "inventory":   [120, 45, 300, 78],
    })
    print("Question Pandas DF instantly:")
    con.sql("SELECT product, worth * inventory AS inventory_value FROM pdf ORDER BY inventory_value DESC").present()
    
    
    plf = pl.DataFrame({
       "metropolis":   ["Montreal", "Toronto", "Vancouver", "Calgary"],
       "temp_c": [-12.5, -5.0, 3.0, -18.0],
    })
    print("Question Polars DF instantly:")
    con.sql("SELECT metropolis, temp_c, temp_c * 9/5 + 32 AS temp_f FROM plf WHERE temp_c < 0").present()
    
    
    arrow_tbl = pa.desk({
       "sensor_id": [1, 2, 3, 4, 5],
       "studying":   [23.1, 47.8, 12.3, 99.0, 55.5],
    })
    print("Question Arrow desk instantly:")
    con.sql("SELECT sensor_id, studying FROM arrow_tbl WHERE studying > 30").present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 4: Consequence Conversion")
    print("=" * 72)
    
    
    q = "SELECT class, SUM(quantity) AS complete FROM gross sales GROUP BY class ORDER BY complete DESC"
    
    
    print("→ Python checklist  :", con.sql(q).fetchall()[:2], "...")
    print("→ Pandas DF    :n", con.sql(q).df().head(3))
    print("→ Polars DF    :n", con.sql(q).pl().head(3))
    print("→ Arrow Desk  :", con.sql(q).arrow().schema)
    print("→ NumPy arrays :", {ok: v[:2] for ok, v in con.sql(q).fetchnumpy().objects()})
    print()

    We arrange the complete DuckDB-Python setting by putting in the required libraries and importing all the mandatory modules for the tutorial. We create our working listing, initialize DuckDB connections, and discover each in-memory and chronic database utilization together with primary configuration choices. We additionally generate a big artificial gross sales dataset and start working with DuckDB’s direct integration with Pandas, Polars, and PyArrow, which reveals us how naturally DuckDB matches into Python-based information workflows.

    print("=" * 72)
    print("SECTION 5: Relational API")
    print("=" * 72)
    
    
    rel = (
       con.desk("gross sales")
          .filter("NOT returned")
          .mixture("class, area, SUM(quantity) AS income, COUNT(*) AS orders")
          .filter("income > 1000000")
          .order("income DESC")
          .restrict(10)
    )
    print("Relational API consequence:")
    rel.present()
    
    
    top_cats = con.sql("SELECT DISTINCT class FROM gross sales ORDER BY class LIMIT 3")
    print("High classes relation fed into subsequent question:")
    con.sql("SELECT s.* FROM gross sales s SEMI JOIN top_cats ON s.class = top_cats.class LIMIT 5").present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 6: Window Features & Superior SQL")
    print("=" * 72)
    
    
    con.sql("""
       WITH day by day AS (
           SELECT
               order_date,
               area,
               SUM(quantity) AS daily_rev
           FROM gross sales
           WHERE NOT returned
           GROUP BY order_date, area
       )
       SELECT
           order_date,
           area,
           daily_rev,
           SUM(daily_rev) OVER (
               PARTITION BY area ORDER BY order_date
           ) AS cum_revenue,
           AVG(daily_rev) OVER (
               PARTITION BY area ORDER BY order_date
               ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
           ) AS rolling_7d_avg
       FROM day by day
       QUALIFY row_number() OVER (PARTITION BY area ORDER BY order_date DESC) <= 3
       ORDER BY area, order_date DESC
    """).present()
    
    
    print("PIVOT desk:")
    con.sql("""
       PIVOT gross sales
       ON area
       USING SUM(quantity)
       GROUP BY class
       ORDER BY class
    """).present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 7: Complicated / Nested Sorts")
    print("=" * 72)
    
    
    con.sql("""
       CREATE OR REPLACE TABLE customers AS
       SELECT
           i AS user_id,
             AS title,
           [i * 10, i * 20, i * 30]                                              AS scores,
           MAP {'tier': CASE WHEN i % 2 = 0 THEN 'gold' ELSE 'silver' END,
                'area': CASE WHEN i % 3 = 0 THEN 'US' ELSE 'EU' END}          AS metadata
       FROM generate_series(1, 5) t(i)
    """)
    
    
    print("Struct discipline entry, checklist indexing, map extraction:")
    con.sql("""
       SELECT
           user_id,
           title.first                  AS first_name,
           scores[1]                   AS first_score,
           list_aggregate(scores,'sum') AS total_score,
           metadata['tier']            AS tier
       FROM customers
    """).present()
    
    
    print("Unnesting an inventory column:")
    con.sql("""
       SELECT user_id, unnest(scores) AS individual_score
       FROM customers
       WHERE user_id <= 3
    """).present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 8: Python UDFs")
    print("=" * 72)
    
    
    def celsius_to_fahrenheit(c):
       return c * 9 / 5 + 32
    
    
    con.create_function("c2f", celsius_to_fahrenheit, ["DOUBLE"], "DOUBLE")
    con.sql("SELECT metropolis, temp_c, c2f(temp_c) AS temp_f FROM plf").present()
    
    
    import pyarrow.compute as computer
    
    
    def vectorized_discount(costs):
       """Apply a 15% low cost to all costs."""
       return computer.multiply(costs, 0.85)
    
    
    con.create_function(
       "low cost",
       vectorized_discount,
       ["DOUBLE"],
       "DOUBLE",
       sort="arrow",
    )
    print("Vectorized UDF (low cost):")
    con.sql("SELECT product, worth, low cost(worth) AS sale_price FROM pdf").present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 9: File I/O")
    print("=" * 72)
    
    
    con.sql("COPY (SELECT * FROM gross sales LIMIT 1000) TO 'sales_sample.csv'  (HEADER, DELIMITER ',')")
    con.sql("COPY (SELECT * FROM gross sales LIMIT 1000) TO 'sales_sample.parquet' (FORMAT PARQUET)")
    con.sql("COPY (SELECT * FROM gross sales LIMIT 100)  TO 'sales_sample.json'    (FORMAT JSON, ARRAY true)")
    
    
    print("Information written:", [f for f in os.listdir('.') if 'sales_sample' in f])
    
    
    print("nCSV row depend     :", con.sql("SELECT depend(*) FROM 'sales_sample.csv'").fetchone()[0])
    print("Parquet row depend :", con.sql("SELECT depend(*) FROM 'sales_sample.parquet'").fetchone()[0])
    print("JSON row depend    :", con.sql("SELECT depend(*) FROM 'sales_sample.json'").fetchone()[0])
    
    
    print("nParquet file metadata:")
    con.sql("SELECT * FROM parquet_metadata('sales_sample.parquet')").present()
    
    
    print()

    We transfer from primary querying into extra expressive analytical patterns utilizing DuckDB’s relational API and superior SQL options. We work with filtered and aggregated gross sales relations, apply window capabilities for cumulative and rolling calculations, and reshape information utilizing a pivot desk for cross-category evaluation. We additionally discover advanced nested sorts, checklist unnesting, Python UDFs, and file I/O, which assist us see how DuckDB handles each structured analytics and sensible information engineering duties in a single place.

    print("=" * 72)
    print("SECTION 10: Hive-Partitioned Parquet")
    print("=" * 72)
    
    
    con.sql("""
       COPY gross sales TO 'partitioned_data' (
           FORMAT PARQUET,
           PARTITION_BY (area, class),
           OVERWRITE_OR_IGNORE
       )
    """)
    print("Partitioned listing tree (first 15 objects):")
    for i, (root, dirs, information) in enumerate(os.stroll("partitioned_data")):
       for f in information:
           fp = os.path.be a part of(root, f)
           print(f"  {fp}")
       if i > 15:
           print("  ...")
           break
    
    
    print("nReading partitioned information again (US + Electronics solely):")
    con.sql("""
       SELECT depend(*) AS cnt, MIN(quantity), MAX(quantity)
       FROM read_parquet('partitioned_data/**/*.parquet', hive_partitioning = true)
       WHERE area = 'US' AND class = 'Electronics'
    """).present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 11: Ready Statements")
    print("=" * 72)
    
    
    consequence = con.execute(
       "SELECT * FROM gross sales WHERE class = $1 AND quantity > $2 LIMIT 5",
       ["Electronics", 500.0]
    ).fetchdf()
    print("Parameterized question consequence:n", consequence)
    
    
    con.sql("SET VARIABLE target_region = 'EU'")
    con.sql("""
       SELECT class, AVG(quantity) AS avg_amt
       FROM gross sales
       WHERE area = getvariable('target_region')
       GROUP BY class
       ORDER BY avg_amt DESC
    """).present()
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 12: Transactions")
    print("=" * 72)
    
    
    con.sql("CREATE OR REPLACE TABLE accounts(id INT, steadiness DOUBLE)")
    con.sql("INSERT INTO accounts VALUES (1, 1000), (2, 500)")
    
    
    con.start()
    strive:
       con.sql("UPDATE accounts SET steadiness = steadiness - 200 WHERE id = 1")
       con.sql("UPDATE accounts SET steadiness = steadiness + 200 WHERE id = 2")
       con.commit()
       print("Transaction dedicated efficiently")
    besides Exception as e:
       con.rollback()
       print(f"Transaction rolled again: {e}")
    
    
    con.sql("SELECT * FROM accounts").present()
    print()
    
    
    print("=" * 72)
    print("SECTION 13: Appender (Bulk Insert)")
    print("=" * 72)
    
    
    con.sql("CREATE OR REPLACE TABLE fast_load(id INT, title TEXT, worth DOUBLE)")
    
    
    bulk_df = pd.DataFrame({
       "id":    vary(50_000),
       "title":  [f"item_{i}" for i in range(50_000)],
       "worth": [float(i) * 1.1 for i in range(50_000)],
    })
    
    
    begin = time.perf_counter()
    con.append("fast_load", bulk_df)
    elapsed = time.perf_counter() - begin
    
    
    print(f"Bulk-inserted {con.sql('SELECT depend(*) FROM fast_load').fetchone()[0]:,} rows in {elapsed:.4f}s")
    con.sql("SELECT * FROM fast_load LIMIT 5").present()
    print()
    
    
    print("=" * 72)
    print("SECTION 14: Alternative Scans (question any Python var)")
    print("=" * 72)
    
    
    my_dict_data = pd.DataFrame({
       "title": ["Alice", "Bob", "Charlie"],
       "age":  [30, 25, 35],
    })
    con.sql("SELECT * FROM my_dict_data WHERE age > 28").present()
    print()

    We give attention to storage and execution patterns which might be particularly helpful in actual workflows, beginning with Hive-style partitioned Parquet output and selective reads from partitioned information. We then use parameterized queries, runtime variables, and transaction management to make our queries safer, extra dynamic, and extra dependable. Lastly, we check high-speed bulk insertion by way of the appender interface and use alternative scans to question Python objects instantly by title, additional reinforcing DuckDB’s tight integration with the Python runtime.

    print("=" * 72)
    print("SECTION 15: SQL Macros")
    print("=" * 72)
    
    
    con.sql("""
       CREATE OR REPLACE MACRO revenue_tier(amt) AS
           CASE
               WHEN amt > 800 THEN 'Excessive'
               WHEN amt > 400 THEN 'Medium'
               ELSE                'Low'
           END
    """)
    print("Scalar macro:")
    con.sql("SELECT class, quantity, revenue_tier(quantity) AS tier FROM gross sales LIMIT 8").present()
    
    
    con.sql("""
       CREATE OR REPLACE MACRO top_by_category(cat, n) AS TABLE
           SELECT * FROM gross sales WHERE class = cat ORDER BY quantity DESC LIMIT n
    """)
    print("Desk macro — prime 3 Electronics:")
    con.sql("SELECT * FROM top_by_category('Electronics', 3)").present()
    print()
    
    
    print("=" * 72)
    print("SECTION 16: Recursive CTE")
    print("=" * 72)
    
    
    con.sql("""
       CREATE OR REPLACE TABLE staff(id INT, title TEXT, manager_id INT);
       INSERT INTO staff VALUES
           (1, 'CEO',     NULL),
           (2, 'VP Eng',  1),
           (3, 'VP Gross sales',1),
           (4, 'Sr Eng',  2),
           (5, 'Jr Eng',  4),
           (6, 'Gross sales Rep', 3);
    """)
    
    
    print("Org chart by way of recursive CTE:")
    con.sql("""
       WITH RECURSIVE org AS (
           SELECT id, title, manager_id, 0 AS depth, title AS path
           FROM staff WHERE manager_id IS NULL
           UNION ALL
           SELECT e.id, e.title, e.manager_id, o.depth + 1,
                  o.path || ' → ' || e.title
           FROM staff e JOIN org o ON e.manager_id = o.id
       )
       SELECT repeat('  ', depth) || title AS hierarchy, path
       FROM org
       ORDER BY path
    """).present()
    print()
    
    
    print("=" * 72)
    print("SECTION 17: Full-Textual content Search (FTS)")
    print("=" * 72)
    
    
    strive:
       con.install_extension("fts")
       con.load_extension("fts")
    
    
       con.sql("""
           CREATE OR REPLACE TABLE paperwork(id INT, physique TEXT);
           INSERT INTO paperwork VALUES
               (1, 'DuckDB is a quick in-process analytical database'),
               (2, 'Python integration permits querying Pandas DataFrames'),
               (3, 'Parquet information will be learn instantly with out loading'),
               (4, 'Window capabilities and CTEs make advanced analytics straightforward'),
               (5, 'The columnar engine processes information blazingly quick');
       """)
    
    
       con.sql("PRAGMA create_fts_index('paperwork', 'id', 'physique', stemmer="english")")
    
    
       print("FTS seek for 'quick analytical':")
       con.sql("""
           SELECT id, physique, fts_main_documents.match_bm25(id, 'quick analytical') AS rating
           FROM paperwork
           WHERE rating IS NOT NULL
           ORDER BY rating DESC
       """).present()
    besides Exception as e:
       print(f"(Skipped — FTS extension not obtainable: {e})")
    print()
    
    
    print("=" * 72)
    print("SECTION 18: AsOf Joins")
    print("=" * 72)
    
    
    con.sql("""
       CREATE OR REPLACE TABLE stock_prices AS
       SELECT * FROM (VALUES
           ('2024-01-01 09:30'::TIMESTAMP, 'AAPL', 150.0),
           ('2024-01-01 10:00'::TIMESTAMP, 'AAPL', 152.5),
           ('2024-01-01 10:30'::TIMESTAMP, 'AAPL', 151.0),
           ('2024-01-01 11:00'::TIMESTAMP, 'AAPL', 153.0)
       ) AS t(ts, ticker, worth);
    
    
       CREATE OR REPLACE TABLE trades AS
       SELECT * FROM (VALUES
           ('2024-01-01 09:45'::TIMESTAMP, 'AAPL', 100),
           ('2024-01-01 10:15'::TIMESTAMP, 'AAPL', 200),
           ('2024-01-01 10:50'::TIMESTAMP, 'AAPL', 150)
       ) AS t(trade_ts, ticker, shares);
    """)
    
    
    print("AsOf Be part of — match every commerce to the latest worth:")
    con.sql("""
       SELECT
           t.trade_ts,
           t.shares,
           p.ts   AS price_ts,
           p.worth,
           t.shares * p.worth AS trade_value
       FROM trades t
       ASOF JOIN stock_prices p
           ON t.ticker = p.ticker AND t.trade_ts >= p.ts
    """).present()
    print()

    We construct reusable question logic utilizing scalar and desk macros, decreasing repetition and making our SQL extra modular. We then create a recursive CTE to traverse an worker hierarchy, exhibiting how DuckDB can deal with structured recursive logic cleanly inside SQL. After that, we discover full-text search and AsOf joins, demonstrating that DuckDB helps not solely normal analytics but in addition extra superior search and time-aware matching.

    print("=" * 72)
    print("SECTION 19: Question Profiling")
    print("=" * 72)
    
    
    print("EXPLAIN output:")
    con.sql("EXPLAIN SELECT class, SUM(quantity) FROM gross sales GROUP BY class").present()
    
    
    con.sql("PRAGMA enable_profiling = 'json'")
    con.sql("PRAGMA profiling_output="profile.json"")
    con.sql("SELECT class, AVG(quantity), STDDEV(quantity) FROM gross sales GROUP BY class")
    con.sql("PRAGMA disable_profiling")
    
    
    if os.path.exists("profile.json"):
       with open("profile.json") as f:
           profile = json.load(f)
       print(f"Question timing: {profile.get('timing', 'N/A')}s")
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 20: Multi-Threaded Queries")
    print("=" * 72)
    
    
    outcomes = {}
    
    
    def employee(thread_id, db_path=None):
       """Every thread creates its personal connection."""
       local_con = duckdb.join()
       local_con.sql(f"""
           CREATE TABLE t AS SELECT i, i*{thread_id} AS val
           FROM generate_series(1, 10000) t(i)
       """)
       complete = local_con.sql("SELECT SUM(val) FROM t").fetchone()[0]
       outcomes[thread_id] = complete
       local_con.shut()
    
    
    threads = [threading.Thread(target=worker, args=(tid,)) for tid in range(4)]
    for t in threads:
       t.begin()
    for t in threads:
       t.be a part of()
    
    
    print("Thread outcomes:", outcomes)
    print()
    
    
    print("=" * 72)
    print("SECTION 21: Fast Benchmark — DuckDB vs Pandas")
    print("=" * 72)
    
    
    N = 1_000_000
    big_df = pd.DataFrame({
       "group": np.random.alternative(["A","B","C","D","E"], N),
       "worth": np.random.randn(N) * 100,
       "ts":    pd.date_range("2020-01-01", intervals=N, freq="s"),
    })
    
    
    begin = time.perf_counter()
    _ = big_df.groupby("group")["value"].agg(["sum","mean","std","min","max"])
    pd_time = time.perf_counter() - begin
    
    
    begin = time.perf_counter()
    _ = con.sql("""
       SELECT "group",
              SUM(worth), AVG(worth), STDDEV(worth), MIN(worth), MAX(worth)
       FROM big_df
       GROUP BY "group"
    """).fetchall()
    duck_time = time.perf_counter() - begin
    
    
    print(f"Pandas  : {pd_time:.4f}s")
    print(f"DuckDB  : {duck_time:.4f}s")
    print(f"Speedup : {pd_time/duck_time:.1f}x")
    print()
    
    
    print("=" * 72)
    print("SECTION 22: Studying Distant Parquet over HTTPS")
    print("=" * 72)
    
    
    strive:
       con.install_extension("httpfs")
       con.load_extension("httpfs")
    
    
       print("Querying a distant Parquet file (DuckDB taxi pattern):")
       con.sql("""
           SELECT depend(*) AS total_rows
           FROM read_parquet(
               'https://blobs.duckdb.org/information/yellow_tripdata_2010-01.parquet'
           )
       """).present()
    besides Exception as e:
       print(f"(Skipped — httpfs not obtainable or community restricted: {e})")
    
    
    print()
    
    
    print("=" * 72)
    print("SECTION 23: Customized Enum Sorts")
    print("=" * 72)
    
    
    con.sql("CREATE TYPE temper AS ENUM ('blissful', 'impartial', 'unhappy')")
    con.sql("""
       CREATE OR REPLACE TABLE survey(
           respondent_id INT,
           feeling       temper,
           remark       TEXT
       )
    """)
    con.sql("""
       INSERT INTO survey VALUES
           (1, 'blissful',   'Nice product!'),
           (2, 'impartial', 'It's okay'),
           (3, 'unhappy',     'Didn't work'),
           (4, 'blissful',   'Adore it')
    """)
    con.sql("SELECT feeling, depend(*) AS cnt FROM survey GROUP BY feeling ORDER BY feeling").present()
    print()

    We examine how DuckDB executes queries by utilizing EXPLAIN plans and JSON profiling output to raised perceive efficiency conduct. We additionally reveal thread-based parallel execution by creating separate DuckDB connections in a number of employee threads and safely accumulating their outcomes. To finish the efficiency angle, we benchmark DuckDB towards Pandas on a big grouped aggregation, then lengthen the workflow with distant Parquet entry over HTTPS and customized ENUM sort creation.

    print("=" * 72)
    print("SECTION 24: Lambda Features in SQL")
    print("=" * 72)
    
    
    con.sql("""
       SELECT
           [1, 2, 3, 4, 5]                                   AS unique,
           list_transform([1, 2, 3, 4, 5], x -> x * x)       AS squared,
           list_filter([1, 2, 3, 4, 5], x -> x % 2 = 0)     AS evens_only,
           list_reduce([1, 2, 3, 4, 5], (a, b) -> a + b)     AS complete
    """).present()
    print()
    
    
    print("=" * 72)
    print("SECTION 25: Multi-Format Export")
    print("=" * 72)
    
    
    summary_q = """
       SELECT
           class,
           area,
           COUNT(*)       AS orders,
           SUM(quantity)    AS income,
           AVG(quantity)    AS avg_order
       FROM gross sales
       WHERE NOT returned
       GROUP BY class, area
       ORDER BY income DESC
    """
    
    
    con.sql(f"COPY ({summary_q}) TO 'abstract.csv'     (HEADER)")
    con.sql(f"COPY ({summary_q}) TO 'abstract.parquet'  (FORMAT PARQUET, COMPRESSION ZSTD)")
    con.sql(f"COPY ({summary_q}) TO 'abstract.json'     (FORMAT JSON, ARRAY TRUE)")
    
    
    sizes = {f: os.path.getsize(f) for f in ['summary.csv','summary.parquet','summary.json']}
    print("File sizes:")
    for title, sz in sizes.objects():
       print(f"  {title:25s}  {sz:>8,} bytes")
    
    
    print()
    
    
    con.shut()
    
    
    print("=" * 72)
    print("TUTORIAL COMPLETE")
    print("=" * 72)
    print("""
    Sections coated:
     1.  Connection administration (in-memory, persistent, config)
     2.  Artificial information era with generate_series()
     3.  Zero-copy querying of Pandas, Polars, PyArrow
     4.  Consequence conversion (.df(), .pl(), .arrow(), .fetchnumpy())
     5.  Relational API (filter / mixture / order / restrict)
     6.  Window capabilities, QUALIFY, PIVOT
     7.  Complicated sorts (struct, checklist, map, unnest)
     8.  Python UDFs (scalar & vectorized)
     9.  File I/O (CSV, Parquet, JSON)
     10. Hive-partitioned Parquet writes & reads
     11. Ready statements & parameterized queries
     12. Transactions (BEGIN / COMMIT / ROLLBACK)
     13. Appender for high-speed bulk insert
     14. Alternative scans (question Python dicts by title)
     15. SQL macros (scalar + desk macros)
     16. Recursive CTEs
     17. Full-text search (FTS extension)
     18. AsOf joins
     19. Question profiling & EXPLAIN
     20. Multi-threaded entry
     21. Efficiency benchmark (DuckDB vs Pandas)
     22. Distant Parquet over HTTPS (httpfs)
     23. Customized ENUM sorts
     24. Lambda capabilities in SQL
     25. Multi-format export with compression
    """)

    We discover DuckDB’s help for lambda-powered checklist transformations instantly in SQL, offering a concise solution to manipulate checklist information inside queries. We then export a summarized analytical end in a number of file codecs, together with CSV, Parquet, and JSON, highlighting DuckDB’s flexibility for downstream information supply. We shut the tutorial by disconnecting and printing a structured recap of all coated sections, offering a transparent end-to-end image of the complete DuckDB-Python workflow.

    In conclusion, we supplied a sensible, code-level view of how DuckDB-Python matches into fashionable information workflows. We noticed that it isn’t only a SQL engine embedded in Python, however a extremely versatile analytical system that works easily with DataFrames, Arrow tables, native information, distant datasets, customized capabilities, and superior SQL constructs in a single setting. We moved from easy in-memory queries to extra production-relevant patterns comparable to persistent databases, parameterized execution, transactions, partitioned Parquet, profiling, and efficiency benchmarking, which helps us perceive each the breadth and depth of DuckDB’s capabilities. Additionally, we got here away with a reusable Colab-ready workflow that reveals us find out how to mix Python’s ecosystem with DuckDB’s velocity, SQL expressiveness, and interoperability to construct quick, elegant, and scalable information evaluation pipelines.


    Try the Full Implementation Codes here. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us




    Source link

    Naveed Ahmad

    Related Posts

    AI Brokers Are Coming for Your Relationship Life

    13/04/2026

    The most important orbital compute cluster is open for enterprise

    13/04/2026

    MiniMax Releases MMX-CLI: A Command-Line Interface That Offers AI Brokers Native Entry to Picture, Video, Speech, Music, Imaginative and prescient, and Search

    13/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.