Ahh, the theme of so many horror stories, where the heroes plod along life until met with something they don't understand. Like a member of the Borg Collective, Agent Smith from the bowels of the Matrix, or Tom Cruise, cruisin' along in a dead-end life before the Tripods from Mars rip a hole in his, er, reality.
Got a call last week from a buddy of mine who's in an all-RDBMS shop. And powerful, too. Some of their SMP machines have forty and fifty-plus processors on them. They do heavy-duty processing, don'tcha know, and have no need for any technologies newer than what they already have. The reason this call was so strange, was that we'd just caught up not six months prior, at which time I reveled on about the Netezza technology. He
didn't have anything to say about it, no opinion as to its benefit or purpose. This call was different.
Like many conversations about this, following is a composite of several, but I'll use my buddy as a springboard, because he's a good sport.
"David, need your help," he said, a tinge of urgency in his voice.
"Shoot."
"Some people here are talking about bringing in Netezza for a test drive," he tells me.
"Good for them. You - "
"Stop there," he pushed back, "Let me tell you what they want to do. They want to replace our primary data marts with this stuff."
"It's good for that," I said, "It's purpose-built for high-speed reporting."
"But our reports run fine," he asserted, but didn't really sound convinced.
"All that mess you went through last year," I reminded, "When you needed to add more dimensionality and had to take your entire schema back-to-formula?"
"So?" he said, "No technology is immune from that."
"Restructuring your entire indexing strategy to get better performance, and all that denormalization and renormalization to balance the workload?"
"Necessary evil," he asserted.
"Evil, true, but necessary is a function of the chosen technology."
"What, you mean Netezza can just keep assimilating information without ever having to refactor the indexing? Who are you kidding?"
"Not kidding at all," I said calmly, "Netezza has no index structures, so there's nothing to manage."
Silence.
"You still there?"
"I'll have to call you back."
"Okey-dokey - " but that's all I could summon before the line went dead.
Fifteen minutes later he called back, out of breath, "Okay, tell me more about this no-indexing thing."
"Just that, no indexes. Netezza doesn't need 'em or use 'em."
"Then it's slow as molasses, and not a threat."
"Don't kid yourself."
"I know data, dude. Don't try to - "
"It's not about the data. It's about the hardware. Netezza embraced the truth that power is found in the hardware, and bulk data processing needs access to a lot of it."
"We have a lot of hardware too -"
"But it's configured for general purpose processing, and I'll bet you don't do any of it inside the database."
"Well, no, that would be insane. We do the bulk processing with an ETL tool. It's just faster."
"Have you ever considered why it's so much faster? Or why the rise of the ETL tools? People generally agree we can't do bulk processing inside the SMP machine, because it's not built for it. It will pull data in quanity off the disk drives, process it in quantity and push it back, and the data is meeting itself coming and going on the SMP's backplane."
"So? How does that change anything?"
"Netezza processes data down in the parallel SPUs - data doesn't leave the disk at all, and if the database needs to process data, all of the CPUs handle their own little section of it. That's why you don't need indexes, because an SMP/RDBMS sees the data as a single logical table with monolithic physical data, where Netezza sees a single logical table on hundreds of physical drives.
"I'm not following."
"Okay, when a general marshals troops, does he give specific commands to each troop member, or does he formulate a plan and delegate it to the masses?"
"That's obvious."
"Because it works. It's the only way to manage physical scale. With multiple actors who are incapable of completing the mission alone, we need synergy."
"Or Jack Bauer."
"Not even Jack Bauer could - "
"Hey, you're bad-mouthin' Jack now!" he said playfully.
"Jack's good," I said, "But not that good. Imagine what he could do with a thousand Chloe's back at the ranch?"
"I see your point, but if this is just for reporting stuff, I don't really have much to worry about."
"For now."
"What do you mean, for now - that sounds sinister, like the Pod-People leader from Invasion of the - "
"Data Mart Snatchers?" I laughed, "Didn't mean to sound mysterious, but there's more to this."
"Oh?"
"Well, once the mart is in place and operating, someone will notice a pattern of activity. It goes like this: We spend hours processing the data to get it ready for Netezza, and then load it in seconds. The box sits there idle for the next few hours until the users start pounding it, then it goes idle for some protracted cycle until the few seconds it requires to load the next day's data. Something is amiss, because now the slow point is the ETL environment."
"Our ETL environment is state-of-the-art," he said, "We push data like crazy through it."
"And I'll bet if you examine the larger part of the load, you will find that it spends most of its cycles in joining or summarizing, even sorting. Those operations take a lot of hardcore CPU power."
"They always have."
"But what if you loaded the raw data from your ETL tool into Netezza? You said yourself it only takes seconds, right?"
"But then where will we integrate it?"
"Massively parallel joins and rollups inside a Netezza machine are orders of magnitude faster than your ETL."
Silence.
"You there?"
"No, I see what you mean," he said, "Our sixteen-way ETL machine cannot even theoretically compete with a 108-processor Netezza machine."
"And the 108-processor version is the development box."
"Thanks for that."
"Seriously, people will take a look at the "T" in the ETL, and experiment by dividing it into row-level activities and bulk inter-row activities. They will keep the row-level stuff in the ETL and move the larger-scale transforms into the Netezza machine."
"Okay,"
"Assimilating the data mart and the larger scale bulk processing into a single platform."
"Hmm, that would be troubling."
"Why is that?" I smiled.
"Because then the our high-powered and expensive ETL tool is relegated to nothing more than row-level scrubbing and data transport."
"Want another kick in the pants?"
"May as well."
"You can do a lot of that scrubbing in regular SQL once the data is inside the box. In massively parallel form."
(Sigh)
"Meaning that your ETL tool is now nothing more than a raw data transport mechanism."
"We have one of those already," he sighed again, then laughed "It's called "scp""
"At this point you've moved a lot of processing "under air" as they say."
"As who says?"
"Netezza of course."
"You mean, they assimilate all this stuff by design?"
"Resistance is useless. Close your eyes and join the collective."
"Aaaggghh!"