Resource Requirements of Impala lead to frustration.
Although I said in an earlier post that Impala is an exciting technology, using Impala under modest resources is very problematic. Although queries using small amounts of data come back faster, with large amounts of data, the queries Fail to execute, and frankly only run in Hive. If you can afford new top end servers with 256GB, impala will work for you (this is the recommended setting), in reality for those with modest budgets, it can be a real problem. It’s manageable if a query takes a longer time to run, but not running at all poses a real problem.
The benchmarks where they say that Impala is faster than other sql on hadoop, tend to use high end servers with large amounts of memory (256G) and disk, and some really large queries in those benchmarks fail to run. I was hoping that Impala would be useful for adhoc querying on limited amounts of data. Sometimes you just need hive, and the performance of hive queries, particularly on complicated processing flows, is just too slow.
I was at a couple of meetups which were presented by Cloudera people, including the Project Lead, and I am impressed with the work that is going on and the direction the project is going. However some of the new direction seems to be diverging from Hive compatibility.
One frustration with Cloudera CDH 5.2 is it does not come with the Tez engine for hive, support says install at your own risk. Second, the installation of Spark is less than perfect. I am heading toward exploring and testing out Hortonworks implementation, hopefully Hive on Tez is less frustrating.