Ask HN: Is PySPark a Dead-End? I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster. The project's current status is documented[0]. The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks. In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records. The client is concerned about performance and believes PySpark is the preferred target language. I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team. I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4]. Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the commentary for Snowflake was very positive[9]. Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration. What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks? Is PySpark a dead-end libray in the Python ecosystem? [0] https://www.pythonforsasusers.com/project_summary/current_project_status.html [1] https://dask.org/ [2] https://coiled.io/ [3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html [4] https://modin.readthedocs.io/en/stable/ [5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark) [6] https://bodo.ai/ [7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/ [8] https://www.atscale.com/autonomous-data-engineering/ [9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/ |