I accidentally re-invented a Data Lakehouse.
Over the last couple of months, I chipped away at my car price estimation tool whenever I could muster up some free time.
I haven’t put much thought into the overall data architecture.
This time I wanted a stack that is simple, fast, scalable and in general be a pleasure to work with.
If I didn’t like a library, it got zapped instantly. Let’s call it an accidental architecture.
High-level Overview
- I scrape the data using Playwright
- Convert it to Parquet using Pandas
- Import all data into DuckDB for analysis
- Sprinkle some ML on top using Quantile Forests, and we have price estimations with lower and upper bounds.
We, too, see average prices by brand, listing durations and many more interesting insights.
Now where’s the Data Lakehouse you might ask?
Well, the DuckDB database part is completely throw-away and can be rebuilt anytime using the underlying Parquet files. I have skipped Iceberg or Delta Lake for the time being, as I wanted to stay as lean as possible.
Rebuilding the database is a matter of issuing a set of SQL Statements to re-import the data from Parquet. Or skipping them entirely and directly query the Parquet files from disk.
This approach has proven to be extremely easy to debug. Very fast and scalable to work with, as all operations are vector-optimized.
Is it enterprise-ready? Nope. Does it get the job done? Absolutely. There are a couple of interesting developments in the market, looking to see where things are heading.
For the sake of completeness, here’s the full stack:
🔹Playwright is an excellent tool for fetching data from websites. It provides a simple and reliable API to automate browser tasks, making web scraping more efficient and robust.
🔹Pandas works perfectly for reading and writing JSON and Parquet files. It offers powerful data manipulation capabilities, making it easy to clean, transform, and analyze large datasets.
🔹QuantileForest is used for creating conformal predictions. This method provides prediction intervals, offering a measure of uncertainty and making the predictions more interpretable.
🔹DuckDB is a versatile tool for storage, reporting, and analytics. It’s an in-process SQL OLAP database management system, perfect for handling analytical workloads efficiently without the need for a separate server.
🔹Flask and Bootstrap are a powerful combination. Flask is a lightweight web application framework, providing the flexibility needed for a project that will evolve into a full SaaS solution. Bootstrap ensures responsive and mobile-first design. I chose Flask over Streamlit for its greater flexibility and control over the application’s structure and behavior.
This stack provides a comprehensive solution for a wide range of data analysis tasks, from data collection and storage to analysis and web deployment.
What’s your favorite stack?
#DataAnalysis #Python #SQL #DataScience #SaaS