Python Optimization II - Practical tips from real life cases

non contrived examples with solutions you can actually use (no asyncio or threading)

Sep 20, 2024

This is part 2 of the Python Optimization Basics series

I wanted to talk about some python optimizations you can do. I will not be trying to create contrived examples, to talk about general tips and approaches, all which can be found very easily by a google search. I wanted to instead write about the optimization tips I have found in my worklife, within the context of the program and data I work with.

Overall the general principle I have found to be is “use as little python code as possible”.

I should also mention that these optimizations are all meant to be synchronous, meaning I won’t be throwing around a threading or multiprocessing call and calling it a day. I like this because on a pure cpu level, optimizations like these should come first, as not only do they require smaller effort, but they can also work in async settings.

Context

It is important to specify where I am coming from, what sort of problems I worked with and then how the optimizations applied to those issues.

So, we need to specify a few things like:

What do these examples optimize
What is the format of data
What is the type of the program.
Why do these techniques apply well to such issues.

We are trying to optimize for speed, and are working with large reporting datasets, think around 500,000+ rows per average. These are not just read write operations, but require us to further parse those datasets.

The programs run on average every 3 hours daily. The programs are mainly cron jobs, and the bottlenecks are hampered by inefficient operations on such datasets, because those operations are usually being done in python.

1) Moving away from python

So, this is hardly an optimization tip, but since it is a fundamental starting point, I will mention it regardless. If you are working with datasets with large number of rows and columns, using something like a pandas dataframe to handle such datasets is almost a necessity. The advantage is that the library is built around providing an operating model similar to that of traditional databases. It supports operations on rows like filtering, group by, merging, setting indexes etc. All of these are incredibly useful and in general, “high performance” libraries like pandas and numpy for these purposes have most of their operations written in c. So, that gets us away from the using python part.

2) A story of operating on dataset rows without python (aka. moving away from python)

In one of the main cron jobs in the system I was working on, there was around an 80s delay when doing operations on a python dataframe. The solution I found out reduced that to 0.5s. Here is what was happening.

We had the the following inputs to the function

dataframe_a:
|col1| ... |combined_key|
|123 | ... | "row_id\row_code" |

A unique ID: inp_id

Over here, row_id was of the same type and nature of the input parameter inp_id. The goal of the function was to check which rows had the combined key row_id\row_code equal to the value inp_id\row_code. combined_key had a composite string of sorts.

The original approach was to loop over each row in python and check if

row_id\row_code matched inp_id\row_code

and if it did, it would keep that row and if not, it would remove it.

Now, here’s how I went about solving this.

Step 1. Profiling

I could tell there was a visual delay between program runs when I would debug it, but I needed to narrow the point in code down, so I used a profiler class, similar to the one we made in the prevous post

Step 2. Trying to Adhere to rule number 1

I needed to move away from a loop operation, away from python ideally. So I thought about it, and wondered how I could set the data up in such a way that we just operated on it once (atleast in the conceptual model I was using).

Now, I relied very heavily on the conceptual model pandas uses, similar to that of an sql table, operations resembling those are supported and are fast. One approach I saw was to move away from creating a combined_key to compare with a row, and instead split that into 2 columns. That was the first part and luckily pandas also supports that using dataframe_a.str.split(‘\’).

Next, I identified that we only wanted values matching inp_id and row_code, meaning we only cared if the prefix to row_code was inp_id. This meant that inp_id would be the same for each row, and since I had the inp_id already as a parameter to the function. Instead of creating a composite id and checking it row by row, I could create a dataframe of columns

| inp_id | row_code |

This would end up serving the same purpose at the cost of memory.

Now the data was prepared, the only thing left to finish this off was to replace the row wise IF check with a column wise inner join on the columns inp_id and row_code. A join is conceptually similar to a row wise if check isn’t it? We basically are only matching rows where the values between some concerned columns match between both tables (in our case dataframe).

This would automatically remove all the rows where the row_id was not matching with inp_id.

The code for this was actually rather simple, heres’s a scrappy made up version of it.

valid_combined_df = dataFrame_a[['row_code']]
valid_combined_df['row_id'] = inp_id
dataframe_filtered = dataframe_a.merge(
  valid_combined_df, how='inner', on=['row_id','row_code']
)

and that was it. The performance gain from 80s to 0.5s was well worth the few hours it took me on this. The tradeoff here was memory as we were preparing and setting up these dataframes. That is where the context of your situation comes into play. What do you value? what does your program need or lack? In my case, memory was not the issue, time was. This job would run frequently, cpu resources were limited, and having even a minor delay would mean that other critical jobs could get delayed, not to mention the cost saving we had on our cloud provider by reducing our job processing time.

The lesson here though, is that this performance gain was not about a standard “best practice” that I memorized and then identified as an obvious “pattern” in code. It was more so about understanding the nature of my problem, thinking in terms of data, discovering why there were bottlenecks and re-arranging the program to achieve the desired result.

This example and optimization tip was more of a curveball, and I worry might confuse readers more than anything, nonetheless it was an important to include as not only was this a fun problem to solve, it also highlighted the important of being able to apply both theory and creativity to a real world problem.

3) Efficient database insertions

This example is thankfully alot simpler both to explain and to understand. It also was an insane time save for the program I worked on.

In a cron job that was of a similar nature to that of the previous example (500,000+ rows, running every 3 hours daily), the program had to store a large list of data into an sql table. The main cause of slowdown here was the insert query. This is the point where a general understanding of sql helps. The goal was to switch to using a copy from clause.

This codepath using the traditional insert values took 9 mins, and with the copy from approach, I reduced it to 50s.

Setting up the code to make this work was a slight bit of work, but in retrospect it was well worth the effort. Since we were using SqlAlchemy ORM, it provided us with a method using the cursor.copy_expert function.

This worked similar to a copy from except it operated on a csv. This was a minor inconveience and it once more required modifying the data for the approach, but it honestly is more so boilerplate than anything.

It’s fairly simple, and can be found in plenty of places online. I have nonetheless written a simplified-missing some csv parsing steps version of it down below to better illustrate how to go about making the change.

engine = create_engine(
  'postgresql://my_user:my_password@my.host.com/my_db'
)
session = sessionmaker(bind=engine)()

# ===== BEFORE =====
insert_query = insert(table_object).values(data_to_store)
session.execute(insert_query)
session.commit()

# ===== AFTER =====
cursor = session.connection().connection.cursor()
s_buf = StringIO()
writer = csv.DictWriter(s_buf, fieldnames=cols_list)
writer.writerows(data_to_store) # s_buf is loaded with the data
sql = f'COPY {table_name} ({col_str}) FROM STDIN WITH (FORMAT "csv", HEADER true)'
cur.copy_expert(sql=sql, file=s_buf)
session.commit()

The main takeaway from this example, was to be aware of general concepts of the systems you work with. In this case, being aware of the best practice of large row inserts (which is really easy to search for), then applying it to your case of figuring out how to do it from python was the key factor. Additionally, being willing and able to spend some time reading the documentation and figuring out the libraries you use helps immensely.

Conclusion

I hope you learned something new from this post. The point once more, is to present cases from actual experience I have had and highlight how I went about solving them. Every engineers’ experience with optimization might differ, unless you are just throwing asyncio and threading at your problem. In which case, I would highly encourage to understand the bottleneck in the context of your setting, your data and expected results and try to identify the simplest solution. It is sort of like trying to approach the problem from first principles, where just understanding python, pandas and it’s operating model, and sql (if you work with it) is mostly all you need.

In case I find any other helpful or notable things regarding python, speedups or memory, I will share them. As of write now, this really is the entirety of the python optimization series. All you really need is to debug and profile your code, understand the bottlenecks, the libraries and go about solving them creatively.

Talha’s Substack

Discussion about this post