[ www.netezza.com ]

Thinking Inside the Box

7 Posts tagged with the performance tag
0

      

In a recent blog, Greg Rahn of Oracle responded to Phil’s “Oracle Exadata and Netezza TwinFin Compared” eBook; before commenting on an Oracle engineer’s views, I’ll restate the eBook’s larger themes.

 

Exadata connects Oracle’s RAC database, its architecture designed for online transaction processing (OLTP), via a fast network to a massively parallel processing storage tier. As an OLTP database paired with a specialized storage subsystem, tuning Exadata to function as a data warehouse is complicated and demands skilled, highly trained, experienced technical staff. Mitigating the shortcoming of an OLTP database pressed into service as an analytic database with expensive network and storage makes Exadata costly: to acquire; to design, tune and maintain as an optimally-configured data warehouse; to run in the data center.

 

Netezza TwinFin, designed as an analytic database, brings the power of massively parallel processing to manage and exploit data at terabyte-to-petabyte scale. TwinFin is an appliance–easy to install, easy to operate and easy to manage. TwinFin offers value: fast performance for advanced analytics at an affordable price.

 

Now I’ll discuss the detail of Greg’s blog and respond from a Netezza perspective.

 

Claim: Exadata Smart Scan does not work with index-organized tables or clustered tables.

 

Greg responds that “IOTs and clustered tables are both structures optimized for fast primary key access, like the type of access in OLTP workloads, not data warehousing” and suggests our intent was to mislead by quoting from an old Oracle datasheet. It wasn’t. Oracle 11g Release 2 documentation reads “Index-organized tables are suitable for modeling application-specific index structures. For example, content-based information retrieval applications containing text, image and audio data require inverted indexes that can be effectively modeled using index-organized tables.” Elsewhere the documentation states “Index-organized tables are useful when related pieces of data must be stored together or data must be physical stored in a specific order. This type of table is often used for information retrieval, spatial and OLAP applications.” In the eBook Phil discusses first and second generation data warehouses; many of the applications described by Oracle as candidates for IOTs are typical of those our customers run on TwinFin – these are second generation data warehouse applications. Greg believes Exadata smart scan not working with index-organized tables has zero impact on Exadata customers. Is it reasonable to conclude that Exadata is not being used for second generation data warehousing?

 

Claim: Exadata Smart Scan does not work with the TIMESTAMP datatype.

 

Since we published the first edition of the eBook Christian Antognini, the original source of this information, goes to the heart of the matter in his blog: “The essential thing to understand is that this limitation is due to bug 9682721. The fix is expected to be part of 11.2.0.2. According to my test cases (that Greg Rahn was so kind to execute against an early release of 11.2.0.2), offloading works correctly for all datetime functions but for the following three predicates.

 

  • months_between(d,sysdate) = 0
  • months_between(d,current_date) = 0
  • months_between(d,to_date(‘01-01-2010’,’DD-MM-YYYY’)) = 0”


Note that the MONTHS_BETWEEN function can basically be offloaded. The problem in these cases is that the offloading does not work when, for example, SYSDATE is used as a parameter.

While happy to let this one pass, I have a question. Do organizations accrue value or cost from a technology requiring its administrators understand all combinations of functions, their predicates and their parameters before they are capable of designing queries to be processed in parallel?

 

Claim: When transactions (insert, update, delete) are operating against the data warehouse concurrent with query activity, smart scans are disabled. Dirty buffers turn off smart scan.

 

In my opening comments I compared TwinFin’s simplicity to the complexity of Exadata. All queries submitted to TwinFin are processed in its massively parallel grid; no tuning, no special database design. This is appliance simplicity. In Exadata whether a query benefits from smart scans (massively parallel processing) can depend on the state of the data being read. Exadata requires developers to understand at great depth the physical path a query takes to access data. This is complexity.

 

While Greg concedes Exadata’s MPP processing is disabled for those blocks containing an active transaction he is confident that “Not having Smart Scan for small number of blocks will have a negligible impact on performance”. My experience with Netezza’s customers and their applications prompts me to take a more circumspect view. I’ll explain why in the next section.

 

Claim: Using [a shared-disk] architecture for a data warehouse platform raises concern that contention for the shared resource imposes limits on the amount of data the database can process and the number of queries it can run concurrently.

 

Greg argues contention for shared disk is not a problem for Exadata and cites Daniel Abadi’s blog in his defense. Let’s take a look at what Daniel says on this subject “If you are going to make an argument that shared-disk causes scalability problems, you have to make the argument that contention for the one shared resource in a shared-disk system is high enough to cause a performance bottleneck in the system - namely, you have to argue that the network connection between the servers and the shared-disk is a bottleneck.” This is the argument Phil makes in our eBook. Consider a query analyzing correlations between equity trades in a sector of a stock market. The algorithm calculates Spearman’s rank correlation coefficient (Spearman’s rho), measuring statistical dependence between two variables by assessing how well the relationship between them can be described. This analysis creates valuable insight in to whether specific equities influence behavior of other equities in the same market sector within a window of one to ten minutes.

 

The customer loads a massive volume of trading data into TwinFin and constantly trickle feeds data from live markets into the warehouse. The query is run and re-run constantly to assess behavior of different equities in dynamic markets. Each time TwinFin completes a Cartesian join between all the equities in the sector while at the same time calculating a Volume-Weighted Average Price and a Return From Previous Close value for the equity under investigation. The results pass to Spearman’s rank correlation coefficient function to calculate the Population Covariance and the standard deviation of every equity combination for the time period. Netezza executes every step of the query in parallel utilizing all TwinFin’s hardware and software resources. Netezza’s intelligent storage selects only the rows needed for that market sector and projecting only the columns needed for assessment. The join result is directly streamed to the code implementing the statistical analysis which TwinFin downloads to every processor in its MPP grid, running the complex calculations in parallel. Results from each node in the MPP grid are returned via the network to the host for final assembly and rendering back to the requesting application. TwinFin completes the analysis in a few minutes, and then runs it again, and again for as long as the market is open.

 

After several hours Oracle 10G was still attempting to complete its first round of analysis. What difference will a new version of the Oracle database paired with an MPP storage system and a fast network make? Exadata’s MPP storage grid is unable to process Cartesian joins, the first step of in this analytic process, meaning it brings no performance gain but must put all records on the network and send them across to Oracle RAC. Even if it we able to process the join Exadata cannot push down user defined functions, used to implement the calculations, to MPP - in Oracle functions always execute on the RAC servers. In processing the algorithms Oracle must create and manage temporary data sets and write these out of memory for storage. Exadata’s flash cache may play some role here, but the size of the data sets and the complexity of the algorithms will force database processes to write to disk. This flow from Oracle RAC is back across a network still clogged with coming from the MPP storage tier data, queued and unprocessed waiting for attention from a fully-consumed Oracle RAC. I contend that Exadata’s network connection between the servers and the shared-disk is a bottleneck. Not Exadata’s only bottleneck. TwinFin demonstrates how a true MPP architecture excels in calculating Spearman’s rank correlation coefficient - a real workload on a real dataset. Oracle’s OLTP database, simply not designed to process large-scale analytics, is overwhelmed. Exadata suffers contention on its network and in its database system’s shared disk architecture.

 

Back to the previous point about Exadata’s MPP processing being disabled for blocks containing an active transaction – the customer is constantly loading new market data and analyzing it in comparison with a massive volume of historic data. While entirely appropriate for transaction processing, Exadata’s architecture of disabling an entire block from parallel processing when a single record in the block is being updated can only hinder and never help in the data warehouse. The very point of a data warehouse is that all data should be available to the business as quickly as extract-transform-load processing allows. By pressing an OLTP database in to service as an analytical database Oracle unnecessarily burdens customers with creating database designs to work around this complexity and, developing a thorough understanding of how each query accesses the data model. While not having Smart Scan for small number of blocks may or may not impact performance, as an unnecessary complexity demanding the attention of database specialists, it costs customers real money.

 

Claim: Analytical queries, such as “find all shopping baskets sold last month in Washington State, Oregon and California containing product X with product Y and with a total value more than $35” must retrieve much larger data sets, all of which must be moved from storage to database.

 

Greg shows some nice SQL to demonstrate how Exadata processes the beer and pizza query. Give the business an answer and they always come back with a new question: “Greg, what was the total value of Brand #42 beer’ sold in each basket?” Greg can now update his SQL with the clause:

 

sum(case when p.product_description in ('Brand #42 beer') then td.sales_dollar_amt else 0 end) sum_productX,

 

and re-run the query. Business users love IT when we give them a fast performing system but are less forgiving when a query, that yesterday ran blazingly fast, today slows to a snail’s pace. Exadata cannot push down the newly introduced sum for parallel processing by its storage nodes as the join must be processed first, and the storage nodes cannot process joins. Any function or calculation that uses columns from two or more tables must be evaluated on the RAC database servers. The query performance is going to degrade significantly sending the database expert back to the Oracle documentation in an attempt to find a new way to resolve the amended query so it completes at a time acceptable to the business.

 

Claim: To evenly distribute data across Exadata’s grid of storage servers requires administrators trained and experienced in designing, managing and maintaining complex partitions, files, tablespaces, indices, tables and block/extent sizes.

 

While conceding Oracle Automatic Storage Management automates the task of striping partitions across all available disks, the ASM administration team must still create partitions, configure and manage disk groups for shared storage across instances, choose and implement either 2-way mirroring or 3-way mirroring, and configure Allocation Unit sizes. Additionally, Exadata configuration requires administrators create and manage tablespaces, index spaces, temp spaces, logs and extents.

 

In conclusion, Netezza entered the data warehouse market convinced the products offered by the dominant vendors, in particular Oracle, were ill-suited to meet the challengers of Big Data and of such complexity to make them exorbitantly expensive to acquire and use. Exadata only increases the complexity and expense of an Oracle warehouse. Greg draws his readers’ attention to the excellent blog at http://dbmsmusings.blogspot.com/ where Daniel Abadi muses “Both Oracle and Teradata are too expensive for large parts of the analytical database market.

 

Greg’s blog reveals one path available to organizations wishing to generate greater value from their data. CIOs willing to build, train, and permanently assign a team of technical experts to choosing just the right combination from a myriad of settings, can be continuously employed coercing a database designed for OLTP to function as a data warehouse. I’ll close this blog with a manager’s perspective, from someone who focuses an organization’s limited resources on its highest priorities. Peter Drucker, who introduced us to the concept of the knowledge worker, gave us a pragmatic measure to evaluate our own and our team members’ activity - am I merely efficient (doing things right) or truly effective (doing the right thing)? All the workarounds and clever tuning demanded by Exadata simply don’t exist in TwinFin, Netezza has proven them unnecessary.

0 Comments Permalink
0

Netezza Director of Product Marketing Razi Raziuddin is blogging today.


     

I’ve been at The 2010 TDWI World Conference in San Diego this week, where the theme is "agile BI that delivers data (I would use the term ‘insights’) at the speed of thought.” Timing is everything when it comes to making decisions – and influencing other to make decisions we’d like to see.

 

We’ve all experienced Red Car Syndrome at some point or another. You test drive a red car. You like it. Suddenly, you start noticing red cars everywhere – not because the number of red cars has increased, but because the experience of driving a red car is now personalized. Online advertisers use Red Car Syndrome to connect consumers with the products they genuinely want, as I was reminded first-hand recently. While searching for kitchen fixtures online, I noticed that many of the ads featured a pair of pricey fixtures that initially caught our eye, but that we had rejected as exceeding our budget. But the ads seemed to know our tastes better than we did, and ultimately we succumbed and made the purchase.

 

Red-Car-psd38311 6.jpg

 

The experience brought home the power of right-time analytics. Speed is critical in making analytics actionable and delivering real value to the business. The trifecta of huge data volumes, complex analytics and query performance is an increasingly common thread in the BI and data warehousing world. It is true not just for online marketers, but cuts across industry lines. Whether it is an insurance provider trying to prevent fraud, a telco determining the cheapest and best path to route a call or a government agency unearthing criminal activity, time to insight from big data makes the difference in every case.

 

Doug Henschen recently wrote a good article on this topic for InformationWeek in which he calls out success in the Big Data era as the ability to get faster insights from huge data sets. The article highlights Catalina Marketing’s  petascale data warehouse environment and the fast insights they derive from a huge database of 195 million consumers.

 

Although not every enterprise has a data warehouse environment quite that large, the need to perform complex analytics and derive insight in the shortest time possible is common in every environment, big or small. While scalable MPP architectures address the big data problem quite well, the big math problem associated with complex and advanced analytics is what many customers still wrestle with. There’s general agreement that in-database processing, especially in scalable MPP systems, is the right solution to the big math problem. Doug’s article again highlights Catalina’s use of in-database analytics to radically streamline their analytic modeling environment and gain efficiencies of 10X as a result.

 

However, not every data warehouse platform is geared up for the challenges of performing in-database analytics at scale. The first and obvious challenge is the additional processing overhead required to run advanced analytic algorithms alongside the traditional data warehouse workload. You need a system architecture that is not overwhelmed by the data volumes typical of data warehouses in the Big Data era. Then there is the question of what analytics you want to perform. The majority of commonly available analytic libraries are written for in-memory processing in SMP systems and need to be parallelized in order to take advantage of MPP architectures. The analytic system should not only offer parallelized versions of the analytics you desire, but also provide primitives to easily parallelize advanced analytic algorithms while hiding the complexity of parallel programming from developers.

 

Finally, the dearth of universally accepted standards in the advanced analytics world poses yet another challenge. A typical analytic environment may consist of a mish-mash of commercially available tools such as SAS and SPSS, open source ones such as R and Hadoop (which are gaining popularity), and tons of application code written in various languages such as Java and Python. The underlying system must offer tremendous flexibility in integrating with a wide array of analytic tools and support for a variety of frameworks and languages.

 

In subsequent posts, I’ll talk about Netezza’s advanced analytic capabilities to enable big math on big data. In the meantime, as you plan your analytic infrastructures for the Big Data era, tell us what challenges you are coming up against.

0 Comments Permalink
0

I mentioned in my previous post that Netezza is excited about our partnership with Cloudera and Hadoop because we’ve already seen some of our customers benefit from the synergy of Hadoop and Netezza TwinFin™ technologies working together.

 

As I noted, these types of strategies play to the strengths of both technologies and roughly break down into two categories: 1) the use of a Hadoop Cluster for data ingestion, and 2) using a Hadoop Cluster for long-term data retention, which I’m addressing today.

 

Netezza TwinFin with a Hadoop Cluster Used for Queryable Archive Analytics

The second pattern we have seen customers deploy is one in which the Hadoop Cluster is used for long-term data retention, or as a “queryable archive”. Here one could think of Hadoop as a complementary analytic extension of the Netezza TwinFin when there is far less premium placed on low-latency or high-performance. In addition to the weblog and unstructured data analysis discussed in Pattern 1, the queryable archive could also retain long-term copies of structured data that had previously been loaded into the high-performance TwinFin appliance.

 

Hadoop-NZ 3.jpg

Hadoop Cluster Used for Queryable Archive

 

 

With a mix of structured, semi-structured and unstructured data loaded across the two complementary systems, customers can alter the level of granularity and data retention periods across each and typically use TwinFin for processing “hot” data and the Hadoop Cluster for processing “cool” or “cold” data, perhaps with specialized analytics. A deployment of this pattern could look like the following diagram:

 

Hadoop-NZ Arch 3.jpg

 

 

Readers should view this pair of posts as a “point-in-time” look at the market. Our customers continue to innovate and make use of the complementary strengths of TwinFin and Hadoop. And Netezza will continue to innovate both inside the appliance – adding performance, scale, workload management capabilities and especially with the advanced analytics of i-Class, through partnerships like the one announced with Cloudera a week ago, and through expansion of our platform, software and virtualization capabilities beyond the TwinFin and Skimmer™ appliances. Those innovations should help alter and/or enhance some of the deployment directions discussed here.

 

Now, as I said at the outset of these two posts, I’d like to hear from you on your Netezza & Hadoop co-existence deployment and/or compatibility wish-list ideas. What would you like to see?

0 Comments Permalink
0

Two things before I begin:

  • I’ll begin this posting with a call for inputs. Below I will list a few of the most common Hadoop/Netezza co-existence deployment patterns we have seen to date. But I would like to hear from others. As you see the continuing deployment of Hadoop in the enterprise and as the Second Wave of TwinFin™ comes on with the advanced analytics capabilities of i-Class, how do you see the evolving deployment patterns happening in your environment?

  • A special hat-tip to Krishnan Parasuraman, Netezza’s Chief Architect for our Digital Media group, for his excellent help in aiding and abetting this post! I have used his guidance gratefully and (with his permission) stolen freely from some of his inputs.

 

You may have noticed a partnership announcement made by Cloudera and Netezza late last week. Together with Cloudera, Netezza will open up data movement and transformation between Cloudera’s Distribution for Hadoop and the Netezza family of appliances applications and data flows for integration of the two systems. We expect that our partnership with Cloudera, together with the Hadoop support in Netezza’s i-Class™ set of advanced analytics capabilities that are included as part of the upcoming release 6.0 software release, will lead to some very innovative and expansive applications for our customers and for both companies.

 

Even today, Netezza customers are doing some very interesting things with deployment of Hadoop and our TwinFin data warehouse appliance. Far from being the “Hadoop v. SQL” battle that some people might like to make the current market out to be, we have instead noticed a growing number of “co-existence” deployment strategies and design patterns already at work with our customers – particularly among customers in the “Digital Media” vertical market.

 

These types of strategies can play to the strengths of both technologies and roughly break down into two categories: 1) the use of a Hadoop Cluster for data ingestion, which I’ll write about in further detail today; and 2) using a Hadoop Cluster for long-term data retention, or as a “queryable archive,” for which I’ll go into further detail in a post later this week.

 

Using a Hadoop Cluster for Raw Data Ingestion

The use of a Hadoop Cluster as the engine for data ingestion is the most common “co-existence” pattern we see in our customers’ mutual deployments of Hadoop and Netezza. The deployment pattern typically arises when the customer has hit specific performance and processing throughput scalability limitations with their existing Data Integration or ETL implementation.

 

Raw weblog data is the primary data source for most Digital Media analytics and reporting requirements. Weblogs are data rich (e.g., page views, impressions, click-throughs and demographics collected from applications servers). They are typically semi-structured and collected and stored in flat files.

 

There are some critical facts about weblogs that present real performance challenges in processing them:

  • sheer volume: millions of rows of weblog data collected throughout the day and loaded daily into the data warehouse;
  • complex query processing: parsing and decoding encoded character strings requires text processing, pattern matching, tokenizing type capabilities within the ETL process
  • non-conformed dimensions: collecting page views or impression data defined and represented differently by various systems makes fitting them into conformed dimensions is another very common data ingestion & processing challenge.

 

There are two common variants of this pattern – dealing with semi-structured (e.g., weblogs) and unstructured (e.g., text) data and often customers will have versions of both variants in operation simultaneously.

 

Hadoop-NZ 2.png

Semi-structured data ingest via Hadoop

 

Semi-structured data is parsed (and possibly aggregated as well) in the Hadoop Cluster and then loaded into a TwinFin where the performance and workload scaling of the appliance is important for deeper analysis, higher throughput and faster reporting.

 

 

Hadoop-NZ 1.jpg

Unstructured data ingest via Hadoop

 

Unstructured data in this pattern is contextualized (classified, mined, keyworded and indexed) in Hadoop and then moved into a Netezza TwinFin appliance for the low-latency, high-performance analytics used to drive business decisions.

 

 

A Hadoop Cluster provides a scalable ingestion mechanism that is well suited for addressing the challenges described above. The Cluster can be incrementally scaled to handle ingesting the massive volumes of weblog data and it can support text processing and complex data processing through programming languages such as Java or Python. [Note that with the coming i-Class set of analytics functionality, the programmability and some of the complex data processing may also be possible on the TwinFin, depending on a customer’s applications needs or preference.]

 

Following the data ingest steps, processed weblog information is brought into TwinFin as atomic event information or as summarized tables, depending on the size of the appliance and analytic maturity & scale of the organization where it is deployed. A typical deployment might look like the following diagram:

Hadoop-NZ Arch 1.jpg

 

 

An alternate, far less common, deployment design of the above co-existence pattern is used by some of our customers. That is the use of an external elastic MapReduce cloud (such as the Amazon Cloud) for the data ingestion purposes.

 

In cases where the customer may have its application servers in the Amazon’s EC2 cluster, they may also choose to use Amazon’s S3 web services for retaining weblog data. In that case, Amazon would provide the elastic MapReduce infrastructure for the data ingest process into the TwinFin appliance. This alternative deployment scenario would look something like the following:

Hadoop-NZ Arch 2.jpg

 

 

The bottom line is that the different strengths of TwinFin and Hadoop lend themselves to complementary deployments – and some of our customers have already discovered innovative ways to leverage them together to maximize the value of both their investments.

 

In my next post, I’ll discuss the second pattern we’re noticing: one in which Netezza customers are using the Hadoop Cluster for long-term data retention.

0 Comments Permalink
1

News broke on Tuesday that EMC plans to acquire Greenplum to focus on data warehousing and analytics on “big data”. The idea is that by doing so, EMC is officially throwing its hat into the competitive ring for the ‘Data Warehouse Appliance’ (DWA) market – something of a defensive mechanism now that virtually all of the major data warehouse vendors are now selling their own versions of a DWA – and consequently greatly reducing sales pull-through of EMC storage for data warehouse deployments.

Some referred to the merger as “
a good fit for a storage vendor with appliance-y ideas” and others hailed it as follows, “the market has shifted as of late moving toward integrated appliances and this move gives EMC a very important arrow in its quiver” and labeled Greenplum as a purveyor of “very high performance database systems”.

One can also reasonably assume that this acquisition not only is intended to shore up a product offering weakness, but that it is also destined for affiliation with EMC’s other major initiative announced earlier this year – the
Acadia Virtual Computing Environment (VCE) Joint Venture with Cisco Systems and headed up by Michael Capellas. The Acadia JV includes EMC’s storage and its VMWare virtualization software as well as Cisco Systems’ compute nodes and networking. VCE is built on the concept of modular building blocks, called vblocks that marry computing horsepower to storage capacity. All that’s missing from that story is a data warehouse DBMS to make it a full-on data warehouse appliance, right?

There are two big problems with these assumptions…


Performance: For all the discussion about “scale” and  “big data” in the EMC announcement, there is no mention of how either party can address the real issues that mainstream enterprises face every single day with their data warehouse systems – how to get maximum performance out of a complex, highly concurrent operational environment where hundreds if not thousands of users are banging away on the system, night and day.

  • The fact is that the actual Greenplum target market has clearly NOT been one that focused on high-performance analytics over the past several years. Instead, the few wins publicly announced by the company have been for very high capacity, limited compute platforms – applications more commonly referred to as “queryable archive”.
  • Curt Monash today again mentioned Greenplum’s lack of support for the “high-concurrency” requirements of a mainstream data warehouse.
  • This looks much more like adding a very basic set of storage-centric data warehousing capabilities in a move to find a broader channel for EMC’s traditional storage products rather than any strategic move into the world of high performance data analytics. Further to this point, neither company has done much of anything to address a very strong trend in the mainstream data warehouse market – the marriage of advanced, predictive analytics into the busy data warehouse systems.
  • David Vellante confirmed that to be successful the EMC/Greenplum marriage will need to yield, “optimized sytems[sic]; smokin’ fast performance; reference architectures; scale;” and “federation capabilities; not just big honking systems.” We couldn’t agree more but one can’t help but notice that neither Greenplum nor EMC have brought any of those characteristics to market for data warehousing to date.


Appliances: Since the acquisition is fairly transparent in its defense against moves by the likes of Oracle, Teradata and IBM (as well as Netezza seven years ago) to the appliance model, it’s hard to see how either EMC or Greenplum are effectively equipped now to do battle against those established players.

  • EMC have never really “sold” data warehousing to anyone previously and Greenplum have nearly prided themselves in going after “Greenfield” high capacity applications rather than head-to-head competition vs. established players. And one need look no further than the limited market penetration of H-P’s NeoView to understand that it takes more than simply deep pockets to succeed in the data warehousing market.
  • Greenplum is not a purveyor of “integrated appliances” and at best, they can hope to infuse in EMC the ability to make their joint product offering a little more of an “appliance-y idea” (hat tip to Dr. Monash for coining the term) to the market. Instead, Greenplum have fashioned themselves over the past several years as a software only solution.
  • Assume that the Acadia VCE and “vblock” application is a big piece of this strategy. Neither Cisco nor EMC would claim that their servers, networking or storage arrays offer the lowest price-per-bit or price-per-performance alternative in the market. So one needs to think about what that means in terms of the price-performance competitiveness of this new “appliance-y” joint product.


In short, Greenplum joins the pantheon of “interesting” acquisitions for EMC as it will certainly stir some news cycles and drive some analysts and bloggers to create “fresh, new” content; but it’s not really something that I think will register on the Richter scale of customer market share.

1 Comments Permalink
0

A loyal customer alerted us toan Oracle blog by Jean-Pierre Dijcks earlier today that showed the Oracle FUD machine is fully revved-up and ready to go. I'd like to offer a rebuttal, however in the interest of not intruding on Jean-Pierre's entry with an overly-long comment, I've just put a short response on his blog post with a pointer to this one.


Misconceptions and Misunderstandings, or Errors and Plain-old FUD?

I’m writing to correct *just a few* of the misconceptions about what is really important in high-performance, scalable data warehouse systems, errors, or just plain-old pure “competitive FUD” points from Jean-Pierre's posting earlier today. We certainly have posted some information recently about the TwinFin product and Curt Monash’s postings late Thursday provided more info. If his readers are interested in learning more, or even signing up for a “Test Drive”, they should visit www.netezza.com.

First off, I think this is a “banner day” for Netezza. We believe that TwinFin (and the other products in the new product family)
extend both our performance and price-performance advantage over our competitors. We stand by our marketing statements that we regularly demonstrate 10-100X performance advantages over our competitors, particularly competitive offerings of the major incumbent DW system vendors (“Just who are those incumbents?” Jean-Pierre's readers may ask. Well let’s just say that we see Oracle as the incumbent system and/or a challenger system in over 50% of our deal flow.).

Regarding his claims about DBM being “
faster than Netezza” (and I can only assume he meant at “real” data warehouse tasks) - we’re ready whenever Oracle feels up to actually taking one of their Database Machines onsite to a customer for a fair, open customer benchmark. So far, Oracle have been, shall we say, “a little reticent” to do on-site benchmark testing against Netezza.

Next, given the large number of incorrect points in the original posting, I think perhaps that just a few of them will be useful enough for readers to get the gist of just how far afield some of the ‘facts’ are:

  • It all comes down to data scan rates per rack”: Would that it were true that all of data warehousing boiled down to full-stream data scans (as if the entire world of analytics relied on “select count(*) from lineitem” types of queries), then we could all measure “goodness” on how many GB/sec of data could be burst-scanned in our systems. But that’s not the case. So we build Netezza’s data and analytic appliances to deliver the best possible overall performance at the best price and power requirements. As a consequence, and following from those same numbers as-posted, a single rack of TwinFin can process (not just scan) about 400 million rows of data per second. That’s process, as in: “scan, decompress, project, restrict, AND join, etc.”. Need more processing firepower? Netezza’s system performance scales linearly with the addition of more S-Blades: at the low-end, the TwinFin 3 can deliver as much as 100M rows/second of processing horsepower, while the TwinFin 120 can provide you with 4 billion rows/second.  Does a system that still relies on using SMP-based servers running “plain old” Oracle 11g RAC scale similarly for data warehousing?


  • Non-open Linux running on FPGAs”: I’m really not sure what (if anything) was meant by this, but saying that Netezza’s FPGAs “are apparently running non-open Linux” is oxymoronic on at least two different levels (FPGAs don’t typically “run” an OS and, “non-open Linux” - really?)


  • User data & compresssion”: I also enjoyed the accounting of all that “user data” available to DBM users in the Oracle table and the various comments about compression. When Netezza quotes user data capacities in our systems, the numbers reflect real raw user data space, not space that will be further reduced because of required indexes in an attempt to boost performance. Furthermore, Netezza’s compression & decompression techniques allow us to extract “pure performance” from their use. By not relying on CPU cycles to decompress the data before we can process it any further, the FPGA engines decompress the data, on-the-fly, as fast as it streams off the disk drives. Can Oracle make either of those claims?


  • Tolerating node failures without downtime”: In perhaps the most bald-faced inaccuracy, the Oracle blog claimed, that Netezza “continues to lack the ability to tolerate node failures without downtime”. This I can only chock up to pure competitive “FUD-ism” as our capabilities in this area have been quite strong throughout the four generations of Netezza appliances and are further strengthened in TwinFin. Netezza is a fully-redundant system with no single point of failure, even in our smallest systems. Failover in the presence of failures of the disk drives, S-Blades, internal networking or host processors (in short, everything) is automatic and done in-service, with hot-swappable replacement throughout.


  • Appliance simplicity”: One thing Jean-Pierre didn’t address that might have been humorous to see his take on is the notion of “appliance simplicity” - basically the ability to build, support and maintain large to very large-sized data warehouses, with heavy workloads, with no or minimal tuning, partitioning, indexing or other “performance duct tape” required. Routinely, this capability in the Netezza systems is what delights our customers most and we have customers managing systems with several hundreds of terabytes of user data (not indexes + data, mind you - real data) with fractions of an FTE (full-time employee) devoted to them.


I hope that clears up some of the misconceptions. If any of Jean-Pierre's readers or Oracle customers would like to see or hear more about TwinFin for themselves, we definitely would invite them to come stop by our booth (#207) at
TDWI or come to one or our regional Enzee Universe events coming to a location near you.

0 Comments Permalink
2

Change, but no Change

Posted by Phil Francisco Jul 31, 2009

Just trying to clarify. Curt Monash's informative blog on the coming Netezza system and family of products includes the following:

 

<snip>

 

Beyond the switcheroo in components, Netezza is making substantial changes to its hardware architecture. In current Netezza products, the FPGA plays the role of a disk controller on steroids — it receives data, does some SQL or other analytic operations on it, and then throws it over the wall to the CPU for the rest of the processing. The new Netezza product family, however, adds an actual disk controller. More important, it adds fast interconnects between the FPGAs, the disk controller, and RAM — specifically, as Phil Francisco put it in an email,

using multiple parallel channels of PCIe with much faster interconnection rates and lower contention between the blade server and the “DB accelerator card” with the FPGAs.

DMA (Direct Memory Access) technology also fits into the picture somehow.

 

<snip>

 

...which seems to beg further clarification.

 

While Curt suggests big changes are afoot in Netezza's “architecture” - I think a more appropriate viewpoint would be that it's “the same architecture with a new physical implementation”. That is, the concept of data streaming from disk through the system is just as important now as it ever was.

 

S-Blade Diagram.jpg

 

True, we did move the "disk controller" function to a pair of HBA (Host Bus Adapter) cards that interface with the disk enclosures using multiple, redundant SAS (Serial-Attached SCSI), and providing more than ample bandwidth to stream all the drives per rack continuously to the blades. For those who click-thru on Curt's blog, this function is embedded in the device labeled “SAS Expander Module” (one on both the blade server and the "DB accelerator") in the 3rd chart of the PDF file (and also shown above) and allows data to stream from disk through to memory and then on to the FPGA without delay.

 

SP Data Flow.jpg

 

To move data between the blade server and the DB accelerator cards, we use IBM's expansion card (formerly known as "sidecar") technology to provide multiple parallel high-speed PCIe (peripheral component interconnect express) channels delivering the data streams from the disk drives to the memory on each blade server and providing very high-speed interconnect between the FPGA devices and that same memory, using DMA (direct memory access) to effect high-speed memory access without encumbering the CPU to get at it.

 

FPGA Engines.jpg

 

With all this high-speed interconnectivity, Netezza has been able to alter the data flow so that data streams to the memory first and then to the various FAST engines (see above diagram and/or refer to Issue 16: The Latest Addition to Netezza's FAST Engines Framework) in the FPGA. Those engines act as a "turbocharger" for query processing, implementing data decompression, restricting, projecting and applying the appropriate visibility rules in a pipelined process; typically filtering out well over 95% of the data scanned. From the FPGA, the resulting reduced data set is passed on to the CPU memory for additional processing to complete the process.

 

So, the logical streaming model of data from from disk to FPGA to CPU is retained, with significantly higher throughput as a result. But there's an added benefit: the fact that the originally-scanned data can remain in memory, still in compressed & unfiltered form, to be used as a cache avoiding disk scan activity where possible and helping boost system performance even more. In short, "Change, but no Change."

 

I hope that helps - with Curt's architecture viewpoint as well as with questions about our use of PCIe interconnects to raise performance.

2 Comments Permalink
Bookmark and Share

Actions