Detrimental effects of unoptimized BLAS versions on NumPy dependent Big Data workloads

Ashish Dubey
3 min readNov 7, 2020

Having worked with several customers for the last several years, I was fortunate to get opportunities to work on many large scale problems in the context of Hadoop, Hive, Presto, Spark, and other components of the ecosystem. When I say ecosystem, it is so vast that the possibilities are just enormous. For example — the primary technology engines might be something you can count ( e.g., Spark, Presto, and others ), the underlying combinations like input format, compression libraries, native architectural dependencies on various OSs, language/module versions, and then if you go one level above you will find myriads of integrations like spark with redshift, snowflake and so on ( as I mentioned the possibilities are countless ).

Now talking about an example in the above context, I once encountered a customer issue where they reported a spark job was running for 8 hours on a pretty beefy cluster ( 60 nodes r3.4xl ), and the data size was really not that huge ( ~100GB ). Obviously, the numbers did not add up at first glance, so I dug deeper into it. After a thorough analysis of a few hours, I could gather the following conclusions:

  1. The tasks were taking 30+ minutes each ( and if you connect the dots here, a task might be working on a couple of hundred MBs data, and in usual computation logic, they are expected to take a few seconds or minutes )
  2. In the business logic, there was a custom method leveraging the NumPy library that suggested that something at the array/vector level was computed, which gives some clues about the data, type of calculations, hardware type, or other library dependencies.

With the above conclusion, the challenge was, how do we further investigate and conclude if there is anything I could improve because one possibility could be — it’s just an expected outcome. And obviously, there is no way I could rerun the same job with any changes to test out my theories or experimental changes with any library version changes, so I had to find a small sample of this problem.

As an outcome of several hours of effort, I had a small python program replica with a small file to further test the performance ( small sample allowed me to test out performance quickly on different types of machines without wasting several hours). While I was developing a sample case, I researched BLAS versions and types and how it may affect the performance of NumPy based operations. That was a good breakthrough, and it allowed me to test out my sample program on different machines with different BLAS libraries ( OpenBLAS, ATLAS, etc. ). The results were astounding in my first iteration, where I could see some significant differences in my sample program's performance with different BLAS installations. Also, the impact on a big data job where you have thousands of tasks running in different scheduling batches can easily magnify the effect.

With these clues, my colleague and I decided to patch our cluster nodes with ATLAS BLAS ( as opposed to OpenBLAS ), which itself was a massive task because you have to install ATLAS ( which comes with compilation steps and runs for hours ) and then install NumPy module with different BLAS settings ( you can find some web articles on this ). After patching our cluster nodes with the newly compiled NumPy, we reran the same job, and to our surprise, the job finished in just 38 minutes ( Yes — from 8 hours to 38 minutes ). And performance is COST SAVING because this was running for 8 hours on a 60 node r3.4xl, so one can do this math when you compare the run time b/w 38 minutes vs. 480 minutes.

Note: For a basic understanding of BLAS, check out https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

Summary:

By no means, one should conclude that ATLAS BLAS is better than OpenBLAS because there are so many other factors, including their own versions and other dependencies. Still, the intention of sharing this sample was to highlight the severe impact of such tuning problems. It also validates that a little bit of benchmarking and deep-dive could be worth the efforts in a distributed computing environment at a large scale and can significantly save cost and enhance performance.

--

--