Tuesday, November 27, 2007

Just How Scalable Is QlikTech?

A few days ago, I replied to a question regarding QlikTech scalability. (See What Makes QlikTech So Good?, August 3, 2007) I asked QlikTech itself for more information on the topic but haven’t learned anything new. So let me simply discuss this based on my own experience (and, once again, remind readers that while my firm is a QlikTech reseller, comments in this blog are strictly my own.)

The first thing I want to make clear is that QlikView is a wonderful product, so it would be a great pity if this discussion were to be taken as a criticism. Like any product, QlikView works within limits that must be understood to use it appropriately. No one benefits from unrealistic expectations, even if fans like me sometimes create them.

That said, let’s talk about what QlikTech is good at. I find two fundamental benefits from the product. The first is flexibility: it lets you analyze data in pretty much any way you want, without first building a data structure to accommodate your queries. By contrast, most business intelligence tools must pre-aggregate large data sets to deliver fast response. Often, users can’t even formulate a particular query if the dimensions or calculated measures were not specified in advance. Much of the development time and cost of conventional solutions, whether based in standard relational databases or specialized analytical structures, is spent on this sort of work. Avoiding it is the main reason QlikTech is able to deliver applications so quickly.

The other big benefit of QlikTech is scalability. I can work with millions of records on my desktop with the 32-bit version of the system (maximum memory 4 GB if your hardware allows it) and still get subsecond response. This is much more power than I’ve ever had before. A 64-bit server can work with tens or hundreds of millions of rows: the current limit for a single data set is apparently 2 billion rows, although I don’t know how close anyone has come to that in the field. I have personally worked with tables larger than 60 million rows, and QlikTech literature mentions an installation of 300 million rows. I strongly suspect that larger ones exist.

So far so good. But here’s the rub: there is a trade-off in QlikView between really big files and really great flexibility. The specific reason is that the more interesting types of flexibility often involve on-the-fly calculations, and those calculations require resources that slow down response. This is more a law of nature (there’s no free lunch) than a weakness in the product, but it does exist.

Let me give an example. One of the most powerful features of QlikView is a “calculated dimension”. This lets reports construct aggregates by grouping records according to ad hoc formulas. You might want to define ranges for a value such as age, income or unit price, or create categories using if/then/else statements. These formulas can get very complex, which is generally a good thing. But each formula must be calculated for each record every time it is used in a report. On a few thousand rows, this can happen in an instant, but on tens of millions of rows, it can take several minutes (or much longer if the formula is very demanding, such as on-the-fly ranking). At some point, the wait becomes unacceptable, particularly for users who have become accustomed to QlikView’s typically-immediate response.

As problems go, this isn’t a bad one because it often has a simple solution: instead of on-the-fly calculations, precalculate the required values in QlikView scripts and store the results on each record. There’s little or no performance cost to this strategy since expanding the record size doesn’t seem to slow things down. The calculations do add time to the data load, but that happens only once, typically in an unattended batch process. (Another option is to increase the number and/or speed of processors on the server. QlikTech makes excellent use of multiple processors.)

The really good news is you can still get the best of both worlds: work out design details with ad hoc reports on small data sets; then, once the design is stabilized, add precalculations to handle large data volumes. This is vastly quicker than prebuilding everything before you can see even a sample. It’s also something that’s done by business analysts with a bit of QlikView training, not database administrators or architects.

Other aspects of formulas and database design also more important in QlikView as data volumes grow larger. The general solution is the same: make the application more efficient through tighter database and report design. So even though it’s true that you can often just load data into QlikView and work with it immediately, it’s equally true that very large or sophisticated applications may take some tuning to work effectively. In other words, QlikView is not pure magic (any result you want for absolutely no work), but it does deliver much more value for a given amount of work than conventional business intelligence systems. That’s more than enough to justify the system.

Interestingly, I haven’t found that the complexity or over-all size of a particular data set impacts QlikView performance. That is, removing tables which are not used in a particular query doesn’t seem to speed up that query, nor does removing fields from tables within the query. This probably has to do with QlikTech’s “associative” database design, which treats each field independently and connects related fields directly to each other. But whatever the reason, most of the performance slow-downs I’ve encountered seem related to processing requirements.

And, yes, there are some upper limits to the absolute size of a QlikView implementation. Two billions rows is one, although my impression (I could be wrong) is that could be expanded if necessary. The need to load data into memory is another limit: even though the 64-bit address space is effectively infinite, there are physical limits to the amount of memory that can be attached to Windows servers. (A quick scan of the Dell site finds a maximum of 128 GB.) This could translate into more input data, since QlikView does some compression. At very large scales, processing speed will also impose a limit . Whatever the exact upper boundary, it’s clear that no one will be loading dozens of terabytes into QlikView any time soon. It can certainly be attached a multi-terabyte warehouse, but would have to work with multi-gigabyte extracts. For most purposes, that’s plenty.

While I’m on the topic of scalability, let me repeat a couple of points I made in the comments on the August post. One addresses the notion that QlikTech can replace a data warehouse. This is true in the sense that QlikView can indeed load and join data directly from operational systems. But a data warehouse is usually more than a federated view of current operational tables. Most warehouses include data integration to link otherwise-disconnected operational data. For example, customer records from different systems often can only be linked through complex matching techniques because there is no shared key such as a universal customer ID. QlikView doesn’t offer that kind of matching. You might be able to build some of it using QlikView scripts, but you’d get better results at a lower cost from software designed for the purpose.

In addition, most warehouses store historical information that is not retained in operational systems. A typical example is end-of-month account balance. Some of these values can be recreated from transaction details but it’s usually much easier just to take and store a snapshot. Other data may simply be removed from operational systems after a relatively brief period. QlikView can act as a repository for such data: in fact, it’s quite well suited for this. Yet in such cases, it’s probably more accurate to say that QlikView is acting as the data warehouse than to say a warehouse is not required.

I hope this clarifies matters without discouraging anyone from considering QlikTech. Yes QlikView is a fabulous product. No it won’t replace your multi-terabyte data warehouse. Yes it will complement that warehouse, or possibly substitute for a much smaller one, by providing a tremendously flexible and efficient business intelligence system. No it won’t run itself: you’ll still need some technical skills to do complicated things on large data volumes. But for a combination of speed, power, flexibility and cost, QlikTech can’t be beat.

2 comments:

Unknown said...

David,
Good article. How would you update it or maje it current? I am curious to hear your current impressions.

Alan Frutchey VP Technology
afrutchey@jacksonhealthcare.com

David Raab said...

2/14/09 - In general, I think the post is still pretty much accurate. QlikView has keeps working on its underlying engine, so memory use and calculation speed have improved a bit. There have also been some enhancements to make mass deployments easier, which is another aspect of scalability. Outside of QlikView itself, the cost/performance of 64-bit systems with very large amounts of memory has continued to improve.

Hope this helps.