I’m currently working with a bunch of data about real estate. The properties are identified by an ugly string. Of DOOOM! There are quite a few relationships built on this ugly sucker, which led me to wonder… what if I mapped them all into a simple integer?
I whipped up some totally bogus sample data to test this, using Property Ids that look like “WA534879” and “CA35849”. Properties have attributes like SqrFt, # of Beds, and what region it is in. And I hooked up a few million rows of “facts” about occupancy: actuals, budgeted and forecasted, all on a per month basis. Model here:
With about a half million rows of properties (a bit absurd, but it’s hard to make power pivot slow ) and 3 million rows in each of the fact tables, I looked at the memory usage and here is how she looks:
Not surprisingly, the property IDs are taking up nearly all the memory, and certainly requiring a hash table in each table to hold the strings, as discussed in this post.
Our plan is pretty is simple, assign integers starting with zero to each of the property IDs then use those integers throughout the model. We can include an extra lookup table to get back from the integers to the original property IDs if users still need to see them (on rows, slicers, whatever).
For clarity, all of this mapping is happening in the original data sources — and I really hope that is SQL or you are probably having an exceptionally bad day
The model is not interestingly different – we added the table to map from the integers back into original string-style property IDs. The real difference is that instead of those string id’s, we are using the simple integer for all the relationships (now called just “Id” because I am a naming WIZARD).
And here are the results from a memory utilization standpoint:
Not bad! Each of our tables dropped from ~22mb to about 9mb. And if you are to believe the Tiny Lizard, “in Power Pivot, all performance problems are memory problems”, so we should expect this to lead to better performance.
To test this, I … well, I had a really un-fun evening trying to find a measure that wasn’t “too fast to measure” and not “so slow that I would eat a bag of chips between measurements”. I have a snacking problem.
In the end I found a measure to get the count of properties where the average occupancy (actual) was greater than the budgeted and forecasted occupancies, split out by state… then I would click a slicer for the Month. Total. Science.
The results were that the “better” model (that uses integers for relationships)… was indeed faster. It wasn’’t crazy faster, but the time to slice my pivot table dropped from 25 seconds to 15 seconds. Which is at least… interestingly faster?
I have to say, this kind of feels like a problem that Power Pivot should have handled for me natively. Like, if it had a concept of a “global hash table” that was shared across tables… I wouldn’t have to jump through these hoops? It’s a theory.