Ask HN: Is PySPark a Dead-End?

9 points by passer_byer 4 years ago | 9 comments

I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.

The project's current status is documented[0].

The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.

The client is concerned about performance and believes PySpark is the preferred target language.

I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team.

I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4].

Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the commentary for Snowflake was very positive[9].

Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration.

What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?

Is PySpark a dead-end libray in the Python ecosystem?

[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html

[1] https://dask.org/

[2] https://coiled.io/

[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html

[4] https://modin.readthedocs.io/en/stable/

[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)

[6] https://bodo.ai/

[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/

[8] https://www.atscale.com/autonomous-data-engineering/

[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/