Translating code to make scripts more efficient in a production environment

In this article I am going to explain the scenario I have been working on the past year translating code from Pandas to PySpark to improve performance times and make scripts more efficient in a production environment. Let’s start understanding these technologies.

Pandas is a library for working with data sets. It has functions for analysing, cleaning, exploring, and manipulating data. Pandas is well-suited for working with small to medium-sized datasets on a single machine. When it comes time to work on a bigger scale we start to talk about PySpark that is designed for distributed processing of large datasets across multiple machines writing Spark apps using Python APIs. It has a unified analytics engine for large-scale data processing.

Converting a pandas DataFrame to a PySpark DataFrame can be necessary when you need to scale up your data processing to handle larger datasets, Spark offers the possibility to run operations on multiple machines, unlike Pandas, which is essential.

One of the main drawbacks of Pandas is its scalability and performance issues. Pandas is not designed for distributed computing, and may have problems with large or complex data sets that exceed the memory capacity of a single machine. Pandas also relies on the Python interpreter, which is not very efficient for parallel or concurrent processing. Pandas can be slow and consume a lot of memory for certain operations, such as sorting, joining, or adding large data frames.

One of the main benefits of PySpark is its scalability and performance capabilities. PySpark can handle massive, complex data sets spanning multiple nodes, and leverage Spark’s distributed and in-memory computing capabilities to accelerate data processing and machine learning tasks. It also supports lazy evaluation, meaning it only executes operations when an action is triggered and avoids unnecessary calculations. PySpark is suitable for working with big data, streaming data, or advanced analytics and machine learning applications.

Before we start translating Pandas code to PySpark, we need to create a Spark Session. It is the entry point to programming Spark with the DataSet and DataFrame API. PySpark provides data transformation tools in a module named pyspark.sql.functions, this module stores many functional methods to transform data like converting to DateTime, concatenate columns, windows, pivot tables etc.

One of the datasets I have been working on had 127 million data, and it was all process based on Pandas, the notebook was using 96 vCPU and 384 GiB of memory, the whole script was taking more than 30 minutes to be completed. The dataset had around 20 columns representing string, datetime, integer and float type columns. We passed all the code to PySpark using 2 vCPU and 8 GiB of memory and the total time to run the whole script was around 5 minutes.

Here we can see an example about a pivot transformation to be translated. The most notable change was in the processing time with a difference of 5 minutes in Pandas and 43 seconds in PySpark.

With exponentially growing data, complex operations like merging or grouping of data require parallelization and distributed computing. As we saw in the above example, these operations are very slow and quite expensive and become difficult to handle with a Pandas DataFrame, which does not support parallelization.

To sum up I would like to add the main advantages of PySpark in terms of performance, speed and memory consumption:

Performance

Pandas is used to manipulate and analyse datasets less than 10 GB. So, PySpark, when used with a distributed computing system, gives better performance than pandas and also uses resilient distributed datasets (RDDs) to work parallel on the data.

Speed

For large datasets, PySpark is always faster than Pandas. We can perform parallel computation on datasets using PySpark. Spark uses in-memory caching, which Pandas has not.

Memory Consumption

Pyspark does lazy processing. It doesn’t keep all the data in memory. When data is required, then only the data is retrieved from the disk. But Pandas keeps all the data in memory. So, the memory consumption is always higher with Pandas.

In conclusion, for small or medium datasets, Pandas is a better choice as you can fit in the memory of a single machine, but for large amounts of data PySpark shows better performance in all steps because is designed to leverage distributed computing resources to process that large amounts of data across a cluster of machines and is more suitable for complex data processing that involves multiple stages of data transformation and analysis.

Image: Unsplash | Joshua Aragon

Noelia Remacha

+ posts

Data Engineer at Keepler. "As my main job I develop ETL processes in Cloud environments to help enterprises to achieve better performance in their data, which helps them to get better insights. Working in Keepler is great and gives me the opportunity to learn and grow as a professional in other areas of big data that I am really passionate about."

0 Comments

5+1 keys for public cloud providers

Jan 10, 2024

GenAI Cloud Focused Artificial intelligence has been a central topic among various public cloud providers in recent years. In 2024, this trend, particularly Generative AI, is further encouraged, reflecting companies' commitment to this technology. Different public...

How to prepare for certifications in the Public Cloud?

Jun 8, 2023

It is a fact that technology is evolving faster and faster, which leads a large majority of professionals dedicated to it to invest more time in preparing to meet the new technological challenges that tomorrow will bring. There is no doubt that the different...

AI at the service of Cloud Security

Apr 12, 2023

Artificial intelligence (AI) is creating new opportunities in the world of technology, automating and improving the efficiency of processes. When it comes to cloud security, AI has the potential to transform this field by providing much faster response times,...