I suppose since flooring and parquet files are built out of blocks, there is some sort of joke here. Not sure what it is though…
I was looking at my previous post on Athena, and wondering about how the cost of doing such analytics is driven very much by the cost of scanning all the data. Someone suggest to me to take a look at ORC files.
Now whilst the possibilities of getting it wrong and having the hordes of Mordor come erupting out of the screen were low, this is not an impossibility. The true limits of what cloudy stuff and terraform can do are as of yet untested… But I digress.
If you think about it, a CSV file is a bloody awful way to store data for processing. Sure it makes sense for humans because thats how we would write data down when we collect it, but it makes little sense for processing it. Most of what we want to process are in columns – if you think about it we tend to want to addd up lists of expenses, or find the smallest value in a list of numbers, or work out a deviation or range, or find all the people in the phonebook called Jones….
A CSV file is not the way to do this, if you think about it. Even in a really simple file like a list of expenses there is a lot of data that is not needed
EmployeeID | Name | PurchaseID | Amount |
1 | Smith | 34324 | 182.20 |
2 | Jones | 35122 | 23.21 |
3 | Patel | 39123 | 1923.00 |
1 | Smith | 39441 | 84.67 |
1 | Smith | 39441 | 34.22 |
3 | Patel | 39124 | 518.34 |
4 | Singh | 43321 | 2885.23 |
To total the Purchase amounts up, you would have to read the entire CSV file as it is sequential access – this means that you read and discard perhaps four or five times as much as you need. A CSV cannot be read down as a list of columns, because the data is split by EOL – meaning the file has to be accessed row by row.
If we turned that on its head, if we stored the data in a file by columns, rather than rows we could do this. This sort of format is what ORC files, or another file format called Parquet does.
Parquet files are binary – they are not viewable directly. When you create them, the data is stored in columnar formats, and they are also (usually) compressed using one of several different algorithms which saves space and transfer costs. Getting data into a parquet format though is trivially easy if you have python installed.
>>> import pandas as pd
>>> import pyarrow
>>> df = pd.read_csv('d:/python/wsprspots-2024-01.csv', header=None)
Here I have taken one of the large CSV files from my previous post, and loaded it into a python session. It does have to fit in memory, pandas if you recall does need to hold everything in RAM, so there is a limit. However, reading it in, and then writing back out to a Parquet format file is simplicity itself.
>>> df.to_parquet('d:/python/wsprspots-2024-01.parquet', engine='pyarrow', compression='snappy')
The engine and compression are option but would be recommended to include. Snappy is as it says, quick and fast with modest compression. if you use gzip instead, expect about 30% more reduction in size for not a lot longer decompression time
For very large files that will not fit in memory, you can use the polars library (Why polars? Because it’s a bear bigger than a panda bear – library programmer humour I guess…). This has a disk based solution that will stream data off a disk and onto another, converting on the fly using little ram. It does work best if you have one ssd for reading and another for writing, and it will take much longer than a ram based conversion but if it wont fit in ram, theres little choice..
import polars as pl
pl.scan_csv(source=csv_file, schema=DataTypes.DTypes).sink_parquet( path=parquet_file)
Looking at the file sizes that were created shows a marked difference…
01/02/2024 01:32 13,520,064,926 wsprspots-2024-01.csv
04/01/2025 00:26 2,131,739,482 wsprspots-2024-01.parquet
The drastic reduction in size means that uploading to an S3 bucket is considerably faster than with a CSV file – and the cost of storage in the bucket is commensurately less as well. The file above was compressed using Snappy, a GZip would net even more savings.
What does Spark think?
Running the data in an Athena Spark notebook shows some interesting results. The code to do this is broadly similar with some differences around column names and syntax. The follow was used in place of the CSV based processing in the previous post.
from pyspark.sql.functions import col
df=spark.read.parquet("s3://131273-rawdata/parquet/")
dfRx = df.select('2').drop_duplicates()
uniqueRx = list( dfRx.select("2").toPandas()['2'] )
dfOutput = df.filter(col("2").isin(uniqueRx))
Data load time is more or less the same, even allowing for the required decompression because all the data still needs loading – each stage ran in 8 seconds, which to read in from an S3 bucket and decompress 13GB of data is quite impressive. The same operation on my laptop, running some fairly quick SSD’s was still nearly 45s
The second step was to select the unique list of rx stations, by means of a Drop_duplicates. This was quickest of all, under a second for both cases.
The third step is where the real savings were shown. For the CSV based solution doing the select below took 127 seconds.
uniqueRx = list( dfRx.select("_c2").toPandas()['_c2'] )
Conversely, the parquet solution meant that the select only had to traverse the column of interest, and simply didn’t need to read the rest of the data. It ran tha following, in 19 seconds.
uniqueRx = list( dfRx.select("2").toPandas()['2'] )
This is a clear winner, over seven times faster with the corresponding reduction in processing costs on the Spark data processing units. More importantly for the costs was the data scanned – in the first case there was an entire CSV file to read at. Spark charges are $5 per TB scanned, which is $0.005 per GB. Hence the 13.5GB CSV file was 6.5c to read.
Compared to the Parquet approach which reported 830MB scanned, for a total of 0.4c
Now 6 cents difference is hardly going to break the bank. But remember – thats for Every. Single. Query that you do. It all adds up.
Conclusion
Parquet files are a significant wealth enhancer – you can do a lot more with your budget or conversly, you can shrink your running costs significantly just by moving to them from CSV files. Something to seriously consider.