Quoting from my popular post on performance gotchas:
You probably want to try to find a split that gives roughly the same number of unique values in each column, so don’t constrain your self to just “dollars and cents” or even first 5 digits, last 5 digits. This one… you have to test!
The goal here is better compression by having less distinct values.
What I was talking about is splitting a single column of dollars and cents, into separate columns that would each have less distinct values. This is running counter to the other guidelines… have less columns. We can safely assume neither guideline is a hard and fast rule.
Our goal today is to show some actual memory savings by exercising this technique. We will use the script in this post to find the memory usage.
The Sample Data
I remember creating this data for…uh… something. I think it was when I was testing the memory usage script? Anyway, the point is… it is totally fake data. However, I think it is a reasonable representation.
It is 1 million rows with 43,304 distinct SalesAmount values from $59 to $560. $436.26 shows up 5 times, but $127.19 shows up 121 times. There are 83 distinct “cent” values, with .19 being about 10x more common than .47. So… it’s both random’ish and not normally distributed, which… feels pretty good to me for our purposes. Yes?
We will include some other fields to make it more “real” in a bit… but we are going to start with comparing just this SalesAmount column for a simple case… because as you will see later, things are seldom truly simple…
3 Flavors of SalesAmount
Version 1: Well… it’s just the SalesAmount in a single column. To get the total, it’s a simple =SUM(Sales[SalesAmount]). This version creates 43,304 distinct SalesAmount.
Version 2: Here we separate the dollars and cents into two separate columns. Note that must be done in the original data source, which in my case was SQL. It doesn’t help you to bring in the full column, and generate additional columns of dollars/cents. When we do this, you cut the distinct dollars to just 502 values and the distinct cents to just 82.
I did play a bit with storing the cents as a whole number of cents (eg: 43) compared to a decimal number (0.43) and found basically no difference in storage. Probably better to go with the float to simply the math… but, I didn’t Such a rebel. Anyway, of course the math is more complicated than Version 1. I did mine as: [Total Sales] :=SUMX(Sales, Sales[Dollars] + Sales[Cents] / 100)
Version 3: Like in Version 2, we separate the SalesAmount into 2 columns… but this time, we make an attempt to do the division such that each column has roughly the same number of distinct values. I do some funky math here… to end up with 229 distinct “high” values and 220 distinct “low” values.
Memory in Simple Case
Based on the large discrepancy in size between the cents and dollars, you can intuit that the cents were sorted and compressed before the dollars. Same for the Low/High columns on the EvenSplit, which is probably more interesting. Even though these columns are roughly the same number of distinct values, one of them takes up almost 10x the memory! It’s a great example of “adding columns really takes up a lot of memory”.
Memory in Complex Case
I’m not sure that “complex” is the right term, but it is certainly more realistic after we add some supporting columns as seen here:
Let’s take a look at the memory usage now… with a bit of color coding to avoid total and complete confusion. Which… is still going to present a challenge
The SalesAmount column that is in a single column (at left) still uses an identical amount of memory at around 4.3mb compared to our simple table. However Dollars&Cents went from 1043kb to 1129kb and High&Low in the EvenSplit table went from just 912kb to 1959kb! Hmmm.
If you compare the grand total of the EvenSplit table at just under 9mb to the normal Sales table at ~10.8mb, there is still a nice ~1.8mb of memory savings. However, if you scroll back to the simple tables… the savings were over 3mb.
What is going on?
Well, I don’t actually know… but, if you look at the columns including the ones marked in green (Quantity and SaleDate), you get the sense that… “things changed”… most likely the sort order. And this make some sense, since the algorithm to choose the order is certainly taking cardinality (amount of unique values in a column) into account.
Take a look at the Quantity column in the EvenSplit table (above on the right). It suddenly starts hogging 557kb where it was below 3kb! I am guessing it got shuffled “down” in the sort order to make this big change… maybe after the “High” column which we dropped in cardinality. Well, looks like SaleDate dropped in usage, so maybe that guy shuffled up.
Now is a good time to mention if somebody knows a DMV that tells me the sort order… I would like to know that
An Interlude into Full Geek Mode
I’ve probably said this before, but Power Pivot performance problems are basically always about memory. It’s an in memory database after all. For us, using less ram means compressing better, and compressing better not only means you do less memory reads… but likely you are also iterating over less rows. Let that last part sink in a bit; it’s hot.
Anyway, I want to briefly address concerns around… using less ram, but potentially slowing things down by doing more complicated math (like we need to do here to recombine our dollars and cents into 1 total).
It’s hard for me to imagine when that would matter, and here is why: This is what reading from RAM looks like:
- Get the pointer to the data being loaded.
- Send that pointer off to the MMU.
- The MMU translates the virtual address in the pointer to a physical address.
- Send the physical address to the memory controller.
- Memory controller figures out what bank of RAM the data is in and asks the RAM.
- The RAM figures out particular chunk the data is in, and asks that chunk.
- Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
- Load the data from the array.
- Send it back to the memory controller.
- Send it back to the CPU.
- Use it!
Which is to say… reading from memory is very slow. Potentially hundreds of times slower than just doing some math on your processor. Your performance problems are memory problems.
I searched around my site a bit and couldn’t find “Tiny Lizard Rule #1” which is “Don’t Solve Problems You Don’t Have”. If you are not having performance issues, I wouldn’t run off to implement this kind of thing. I have only done this in pretty extreme cases, and it has certainly not come up unless I had millions of rows.
I think the other important warning here is you really have to test this stuff. Yes, it is very likely to give you a perf win, but as we saw… it’s pretty easy for things to get more complicated than expected… and numbers don’t lie. Make sure you have a good way to test performance before and after the changes.
But as we have seen… we got some pretty darn nice savings by splitting this column. Other than some added complexity, I think the only drawback is that getting a DISTINCTCOUNT on the column becomes problematic. Keep that in mind if you are applying this to a column where that matters. If you can think of other drawbacks, please let me know in the comments