Mar 19, 2010 2:54 AM
Looking for advise on finding a Netezza job
-
Like (0)
I am a senior level Oracle professional. I am now considering investing in Netezza training and certification and wondering whether this would be a good investment.
I have a strong background as Oracle developer and data architect. I have done DBA work as well but that is not my main expertise. Several years ago I worked on a project where I got some exposure to Netezza and was very impressed with the platform.
In today's job market I had to drop my rate quite a bit, and it is still not easy to find good Oracle projects. But just because I briefly mention Netezza on my resume I get a lot of calls from recruiters who are desperately searching for Netezza professionals.
This is why I am considering taking Netezza classes and getting certified.
How do you see the prospect of finding a senior-level Netezza-focused position for someone with strong Oracle data warehousing background, Netezza training and certification but little hands-on experience?
Practically everybody starts with Netezza from the same experience base that you have. They get a project or get exposure, are impressed with the technology and want to do more with it. Do not be daunted. Learn the technology and jump in with both feet first.
there are many positions.. i would say many of the ppl who are experienced in netezza over here at my place started the sameway( from oracle prof background, along with me
)..
try these --- http://netezzaforum.com/netezza-jobs-f11/
I have no dobt that I will be able to find some Netezza work. I am concerned whether I would be able to jump into a senior level position. This is a concern because I cannot afford to take much of a pay cut.
I know that if I was going the other way - switching from Netezza to Oracle my chances would be close to zero. I would have to start mid-level or below.
Define "senior level position". The Netezza platform is a data warehouse appliance. As such, those with a significant background in data warehousing are more successful with it, esp as a consultant. Keep in mind that the average Netezza database sports tables in the many billions of records, most of them in hundreds of billions of records. If you are not accustomed to this scale, perhaps you are not at the senior level, but there's no time like the present to start learning it.
Short answer, when we hire senior level people we look at their background in data warehousing. If they don't understand data warehousing, they (typically) won't understand the Netezza platform.
Case in point - Netezza has no indexes. It is not a transactional platform. Solving problem with it requires set-based thinking, not transactional, cursor-based thinking. One of the reasons Oracle and RDBMS folks have a hard time transitioning to Netezza is that the platform encourages processing of data inside the database with typical insert-select statements that would be insane on an Oracle platform. A typical Oracle equivalent would be to pull the data into an ETL environment, perform set-based processing on it, and put it back. In Netezza space, this would be like pulling the data out of a 200-processor machine, processing it on a 16-processor machine and then putting it back into the 200-processor machine. The power is in the 200-processor machine, not the ETL environment.
Thank you David.
What I think you are saying (please correct me if I am wrong) is that for someone who is really competent at data warehousing, who thinks about data in SQL, and who has solid understanding of the platform architecture and its purpose - limited hands-on experience should not be a show-stopper, even for a senior role. I only hope that the fact that the largest tables I ever worked with contained just little over 10 billion rows does not disqualify me.
When I took the shortened version of the Netezza course about 5 years ago I remember the Netezza consultants talking about the high level of support that Netezza provides. Is that still the case?
The power is in the 200-processor machine, not the ETL environment.
I would imagine that even with Netezza target the ETL environment would still be useful for data validation, cleansing, and exception handling.
Am I wrong?
That would be incorrect. Any large-scale operation on the data would take place inside the machine. Validation, exception handling etc. Whether you do it all in SQL or in a combination of using an ETL tool and SQL
For example, one group uses DataStage to manage the SQL statements it sends to the machine. It then handles exception and control logic in DataStage but manipulates all data inside the machine - the data itself never leaves the box. This tends to relegate a high-powered ETL environment to the role of "firing over SQL statements" but so be it - that's where the power is
Netezza support is the best in the biz. However, keep in mind that several roles exist in a data warehouse project, and in a Netezza environment, the DBA is a part-time job, if that. The machine is so self-contained that having a formal DBA can actually slow down progress. After all, the create-table-ddl in Oracle, after all of its objects and support are instantiated, can take thousands of statements, where the equivalent in Netezza is tens of statements. Seriously, DDL in Netezza to create a table is basically naming the table, the distribution key and the columns. There's no tablespace, indexing or other management.
So it gives the applications folks a lot of freedom to create and destroy intermediate workspace tables on-the-fly, effectively performing data processing inside the machine with "ELT", emphasis on the "T" being inside the machine rather than ETL where the T is outside the machine.
So it gives the applications folks a lot of freedom to create and destroy intermediate workspace tables on-the-fly, effectively performing data processing inside the machine with "ELT", emphasis on the "T" being inside the machine rather than ETL where the T is outside the machine.
I can see how the combination of horsepower and on-the-fly temp tables cam solve many of the problems I mentioned. However, I am still curious...
Suppose I am inserting a million rows into a multi billion row table, and need to enforce uniquiness and log rejects. How does Netezza handle this problem?
The same way you would solve it in "standard" data warehousing. You would not involve the Oracle database in this problem, because this would be a violation of rule #10 - never use a RDBMS for bulk processing. So if the RDBMS is not supposed to be involved (e.g unique key checking at the row level) how would we solve the problem otherwise? Solving this problem in-bulk requires a set-based bulk operation to make it happen. Just so there's no confusion, Netezza does not enforce unqueness in tables even if a PK constraint is applied. This is in keeping with data warehousing best practices. Can you imagine why Netezza would never want to invoke this kind of functioinality, or tempt a developer into using it?
See the book "Netezza Underground" available from Amazon.com for more insights.
I will check out the book. Thanks.
In traditional RDBMS this sort of problem is handled by the combination of a unique constraint to enforce uniqueness and a procedural routine to log exceptions.
I am guessing that in Netezza prior to inserting you will need to join source to target and group. With enough horsepower this is doable, just clumsy.
It seems to me that without database constraints and a procedural language handling validation and exceptions must be hard. This is why I am wondering whether Netezza is expected to routnely handle such problems, or incoming data is expected to be clean.
Netezza can handle data in any state of cleanliness or disarray. I've assembled data into Netezza using raw feeds into staging tables that were taken through ELT steps toward an awating target. Data arrived with extraordinary dirt and data risk and arrived on the target in pristine, consumption-ready form.
In a standard RDBMS data warehouse, "load ready" means that the data must arrive on the database's front door "consumption ready", not just load ready. It has to be exceptionally clean when it is loaded to the RDBMS because we cannot afford to involve the RDBMS in row-by-row processing (Rule #10)
Clumsy is a word I would reserve for constraint usage. Mainly because when exceptions occur, the constraints cannot fix the problem, they can only make it worse.
Think about the operations in place here. If the constraints are on during the load, the RDBMS is involved in row-by-row key checking, a violation of Rule #10. So we would always turn them off. People who don't do this have performance issues.
If an exception happens during the load when the constraints are on, the entire load process is then halted to handle the exception. This is a very bad state of affairs when loading billlions of rows.
If we load bad data into the table with constraints turned off, then attempt to turn them on, the database will complain and we then have a messy rollback issue. Can the constraint fix this problem? No, it can only report the problem, Where is the problem actually fixed? In the process creating the data, of course. The only determininstic and resilient means to guarantee that the data is ready to load into the target table, is to make it load-ready by checking the data, the keys and anything else that would cause it to fail.
This is where the term "load ready" is often confused. "Load ready" is more than just ready because the data is scrubbed. It is ready becauseall anticipated errors have been removed.
Unique and pk/fk constraints are easily anticipated.
In a standard RDBMS data warehouse, this would be accomplished by:
No anticipated errors arise from the process, and all exceptions are captured in step 2 - proactively, rather than in step 4, reactively
If we don't take the prior steps, we will have to endure the following clumsy operation:
Once again, this is a standard data warehouse 101 approach in any data warehouse. Constraints are always turned off during a load because they radically hinder performance. When we're talking about billions of records, we cannot afford the performance hit, especially for something that we have to proactively fix anyhow.
the simplest way to achieve this in Netezza, is with a massively parallel table-to-table join. All exceptions simply fall out, and are processed in bulk scale. This is the only high-performance means to capture the exceptions. It is also easy enough (in both script and ETL/ELT) to formulate reusable resources to accomplish these goals rather automatically without having to code them separately for each table.
I have to admit I was thinking about a typical data warehouse which tends to be much smaller than what Netezza is designed for. At Netezza scale many built-in traditional RDBMS features do become unworkable. However, based on my experience unique keys in Oracle can scale to billions of rows. Foreign keys can not.
We then need to define "scale". If by scale we mean that we intend to require the user to always use the primary key index to get the on-demand performance, then constraining the user this way can yield upward scalability. If we allow the user to query on any arbitrary column, then the primary key does not scale in any sense of the word. A table-scan on billions of rows in a traditional RDBMS will take longer than the user is willing to wait.Certainly longer than the environment is willing to support. Each key constraint adds largesse to the total data storage requirement. Regenerating keys, and key maintenance, then becomes a regular part of data administration.
On the other hand, sometime back I had a table with over 200 columns, sporting 14 terabytes and 40 billion rows of information. A "big dumb" scan took no longer than 7 minutes. Applying some simple Netezza optimizations brought any arbitrary query to under 30 seconds, most under 10 seconds. This kind of scale is impossible in Oracle, in every sense of the word.
And this with no primary keys, indexes or other common props required in a traditional RDBMS. This means that all columns are "fair game" all the time. Netezza even allows the natural processes of data warehousing to provide inherent boosts with zone maps. By following simple rules, performance arrives for free. This is how an appliance should work.
Perhaps this is why, in 2007, Business Objects Strategic group declared the traditional RDBMS, (with Oracle at the top of the list) to be a "secondhand technology" for data warehousing, now eclipsed by parallel appliance technology, and the impetus for Oracle to create their own appliance offering. The traditional RDBMS simply requires too much structural engineering (star schemas, index manipulation, etc) to be agile enough for a highly demanding data warehouse environment. It is this high-engineering of the database structures that eventually paints the RDBMS into a functional box.
In addition, to get performance from the traditional RDBMS, query engineering is the key to performance gains. In Netezza, the query is largely along for the ride. We could write a really bad query, but once a query is "good" it's not likely to get any better without changing the data configuration itself. So data engineering is a more important key to performance success. If the data is layed out correctly, the environment will scale without query engineering. In fact,de-engineering the queries often yields better performance than trying to carefully craft a query. This effectively simplifies the BI/presentation environment and puts the onus on the data engineers for performance yield, a much better state of affairs than punting it to the reporting and query engineers.

