[ www.netezza.com ]

Thinking Inside the Box

2 Posts tagged with the etl tag
0

Two things before I begin:

  • I’ll begin this posting with a call for inputs. Below I will list a few of the most common Hadoop/Netezza co-existence deployment patterns we have seen to date. But I would like to hear from others. As you see the continuing deployment of Hadoop in the enterprise and as the Second Wave of TwinFin™ comes on with the advanced analytics capabilities of i-Class, how do you see the evolving deployment patterns happening in your environment?

  • A special hat-tip to Krishnan Parasuraman, Netezza’s Chief Architect for our Digital Media group, for his excellent help in aiding and abetting this post! I have used his guidance gratefully and (with his permission) stolen freely from some of his inputs.

 

You may have noticed a partnership announcement made by Cloudera and Netezza late last week. Together with Cloudera, Netezza will open up data movement and transformation between Cloudera’s Distribution for Hadoop and the Netezza family of appliances applications and data flows for integration of the two systems. We expect that our partnership with Cloudera, together with the Hadoop support in Netezza’s i-Class™ set of advanced analytics capabilities that are included as part of the upcoming release 6.0 software release, will lead to some very innovative and expansive applications for our customers and for both companies.

 

Even today, Netezza customers are doing some very interesting things with deployment of Hadoop and our TwinFin data warehouse appliance. Far from being the “Hadoop v. SQL” battle that some people might like to make the current market out to be, we have instead noticed a growing number of “co-existence” deployment strategies and design patterns already at work with our customers – particularly among customers in the “Digital Media” vertical market.

 

These types of strategies can play to the strengths of both technologies and roughly break down into two categories: 1) the use of a Hadoop Cluster for data ingestion, which I’ll write about in further detail today; and 2) using a Hadoop Cluster for long-term data retention, or as a “queryable archive,” for which I’ll go into further detail in a post later this week.

 

Using a Hadoop Cluster for Raw Data Ingestion

The use of a Hadoop Cluster as the engine for data ingestion is the most common “co-existence” pattern we see in our customers’ mutual deployments of Hadoop and Netezza. The deployment pattern typically arises when the customer has hit specific performance and processing throughput scalability limitations with their existing Data Integration or ETL implementation.

 

Raw weblog data is the primary data source for most Digital Media analytics and reporting requirements. Weblogs are data rich (e.g., page views, impressions, click-throughs and demographics collected from applications servers). They are typically semi-structured and collected and stored in flat files.

 

There are some critical facts about weblogs that present real performance challenges in processing them:

  • sheer volume: millions of rows of weblog data collected throughout the day and loaded daily into the data warehouse;
  • complex query processing: parsing and decoding encoded character strings requires text processing, pattern matching, tokenizing type capabilities within the ETL process
  • non-conformed dimensions: collecting page views or impression data defined and represented differently by various systems makes fitting them into conformed dimensions is another very common data ingestion & processing challenge.

 

There are two common variants of this pattern – dealing with semi-structured (e.g., weblogs) and unstructured (e.g., text) data and often customers will have versions of both variants in operation simultaneously.

 

Hadoop-NZ 2.png

Semi-structured data ingest via Hadoop

 

Semi-structured data is parsed (and possibly aggregated as well) in the Hadoop Cluster and then loaded into a TwinFin where the performance and workload scaling of the appliance is important for deeper analysis, higher throughput and faster reporting.

 

 

Hadoop-NZ 1.jpg

Unstructured data ingest via Hadoop

 

Unstructured data in this pattern is contextualized (classified, mined, keyworded and indexed) in Hadoop and then moved into a Netezza TwinFin appliance for the low-latency, high-performance analytics used to drive business decisions.

 

 

A Hadoop Cluster provides a scalable ingestion mechanism that is well suited for addressing the challenges described above. The Cluster can be incrementally scaled to handle ingesting the massive volumes of weblog data and it can support text processing and complex data processing through programming languages such as Java or Python. [Note that with the coming i-Class set of analytics functionality, the programmability and some of the complex data processing may also be possible on the TwinFin, depending on a customer’s applications needs or preference.]

 

Following the data ingest steps, processed weblog information is brought into TwinFin as atomic event information or as summarized tables, depending on the size of the appliance and analytic maturity & scale of the organization where it is deployed. A typical deployment might look like the following diagram:

Hadoop-NZ Arch 1.jpg

 

 

An alternate, far less common, deployment design of the above co-existence pattern is used by some of our customers. That is the use of an external elastic MapReduce cloud (such as the Amazon Cloud) for the data ingestion purposes.

 

In cases where the customer may have its application servers in the Amazon’s EC2 cluster, they may also choose to use Amazon’s S3 web services for retaining weblog data. In that case, Amazon would provide the elastic MapReduce infrastructure for the data ingest process into the TwinFin appliance. This alternative deployment scenario would look something like the following:

Hadoop-NZ Arch 2.jpg

 

 

The bottom line is that the different strengths of TwinFin and Hadoop lend themselves to complementary deployments – and some of our customers have already discovered innovative ways to leverage them together to maximize the value of both their investments.

 

In my next post, I’ll discuss the second pattern we’re noticing: one in which Netezza customers are using the Hadoop Cluster for long-term data retention.

0 Comments Permalink
2

 

We had quite a surprise the other day when it came to our attention that Netezza and the NPS data warehouse appliance are now the subjects of a new book: Netezza Underground: The unauthorized tales of derring-do and adventures in resilient data warehousing solutions, by David Birmingham (ISBN: 1-4392-0743-7 and now available in paperback version for $31.54 at Amazon.com).

 

 

This is not the first instance of the NPS system being the subject of a book sold by Amazon (e.g., SAS/ACCESS(R) 9.1.3 Supplement for Netezza), but this particular publication certainly brought feelings of both fun and reaching into the mainstream with it, starting right from it's very clever cover art (above) to David's clever turns of phrase and real-life examples.

 

 

As the title suggests, it was not written or coordinated with any Netezza authorization. So of course we bought a copy and read/skimmed through it as quickly as we could. I will say this, David's self-publication skills are great - he keeps what could easily have been a boring, heavy technical tome both engaging and fun to read while still imparting lots of great information about the NPS system, its performance and its ease of operation. And the book's publication is incredibly current - with references to Netezza Developer Network and "BI Appliance" announcements made only as recently as the Enzee Universe user conference in September.

 

 

While I certainly could quibble with a point made here or there about the system, in general I thought it was an excellent book and even put up the following recommendation for it on the Amazon site:

 

I commend David Birmingham on a book that is at once as lightly entertaining and interesting to read as it is chock full of details about just the kind of performance and operational simplicity that is possible with the Netezza Performance Server (NPS) system. Straightaway from the opening pages, Birmingham's effusive, engaging style and excitement about Netezza's system is apparent, "It inhales, crunches and publishes Libraries-of-Congress-at-a-time - and fast."

He also captures the essence of the NPS appliance in an ultra-succinct two-sentence paragraph explaining just why his "Administration Stuff" chapter is so short, "It's an appliance. Put it in the corner and let it work." I couldn't have said it better myself!

This book is comprehensive and current - even reflecting some of the more recent announcements from Netezza regarding OnStream programmability, the Netezza Developer Network and analytic appliances.

As the guy who is responsible for projecting the Netezza products and our technology direction forward, I want to recommend David Birmingham's book to current and prospective customers and partners alike, or as David himself says on the book's Dedication page, "to Enzees everywhere".

--Phil Francisco, VP Product Management & Marketing, Netezza Corporation

So "to Enzees everywhere", have a read of David's book and welcome to the "Netezza Underground".

2 Comments Permalink
Bookmark and Share

Actions