Skip navigation

Gather 'round the Grill

1 Post tagged with the flat tag
0

In an ongoing campaign to take his environment to the next level, John (name changed to protect the innocent) started holding orientation sessions for his DBAs, programmers, architects and other technical resources as a means to get-the-word-out or at least get some education into his people - so that they would be more self-contained in the field, as it were, building out their environments.

 

While they weren't using Netezza (yet), they were still using set-based operations in a high-powered ETL environment, so the basic principles were the same. That is, perform operations on whole groups of records at a time, not several operations on one record at a time.

 

Well, as it turns out the primary mental resistance had little to do with understanding the technology, but more in wanting to be "the guy" to break the mold. What mold is that? The mold that makes things feel so "mainframey", with flat files and these "archaic" approaches to data processing that technology "left behind" so many moons ago. Why are we going back to the old days? they lament. Surely someone, somewhere missed the memo that we just don't use flat files! C'mon people! Get with the program!

 

And the more John used his Jedi mind powers to change their minds, the more they dug into this whole notion that they were being dragged back to the dark ages of programming. It might seem a bit dramatic, but many of us have actually seen the techies roll their eyes at the mere mention of flat files, as though we have made a statement more expected from a shaman medicine man or medieval witch. What's next? Put leeches on the machine to cure its bugs? C'mon people! Get with the program!

 

Some fifteen years ago I worked with a group that had some very complex integration issues between their custom application and the Oracle Financials application. They wanted to transport invoicing and other billing issues from their system into OF and have a transparent interaction. OF wasn't as mature back then, so this approach had some challenges, and since the primary interface between the two applications was Oracle's Pro-C, the technical team naturally chose C-language (not C++) to make the interfaces work. A debate ensued on this project as to whether we should "code everything in Pro-C" or use C language at all. The confusion? Pro-C isn't a control language. It's a database I/O specification for C-language. We could always interact with the database through Pro-C, but even the simplest decision-tree operation would still require a formal procedural language, and Pro-C didn't qualify. The Pro-C proponents felt like they had lost the debate, but something more was lost - in all this the technical team was required to use Pro-C for all interactions with the database, including the batch-uploads of the invoices and other instruments. The time for this operation was so egregiously slow that we made a mid-stream decision to deal with the problem more transactionally. That is, mini-batch operations more often than once an evening (some recognize this as "continuous" operation).

 

This only forestalled the inevitable. The volume of data quickly grew so large that the back-end was running continuously and unable to keep up with the front-end. Even worse, the OF functionality was not being used in real-time, even though we'd set it up to behave that way. It all came crashing down when we had to install the system where the volume would crush it. The answer? Flat files.

 

Now you can imagine the outcry from the Pro-C people. Not only had we pushed back on them to use C-language for control, but we were now moving even further away from Pro-C into - gulp - flat files. The gauntlet was thrown down that the entire back-end architecture was flawed and needed complete rework. By this time I had moved on to warmer climes and could watch this battle from a distance, but one of the new engineers called me up one evening to get the down-low on the principals in the environment. He claimed that the whole place was crazy and this myopic fixation on Pro-C would be their undoing. Why? Pro-C is a transactional protocol, not a batch protocol. It invokes the database at the API level rather than performing the leaner bulk-insert operations. Over a year later, the problem remained unsolved so they abandoned the OF interface altogether and started using a third-party provider for their financial reporting.

 

The irony - the third-party provider required them to ship over their transactions nightly. In flat files! (the horror!)

 

Something I continuously but gently point out, is that in the data warehousing realm the flat-file is a mainstay. It's not going away and should not have to. The denial that the flat file is a permanent player ----- is the path to mayhem.

 

Not too very long ago, I had the opportunity to assist an online brokerage in how they assimilated transactions from their various member firms. The member firms could ship their data via a web service, over the internet, manually enter it on the brokerage web site (ideal for small shops) or ship it via flat files. Enormous resources had been leveraged to program, maintain and enhance the automated interactions for all these pathways - except for flat files. They had been treated like a necessary evil. Something to be tolerated until they could be replaced with one of the "more mainstream" methods. Even today, they have not moved away from flat files. And they represent over 30 percent of the brokerage's total transactional volume. Not something to be relegated or trivialized.

 

It all came to a head one day when a clerk called up a leader at one of the upstream member firms claiming that they had not transmitted their transaction file on time. (Note - in brokerage terms this leads to an SEC action, so it's serious business). This claim quickly escalated to the member firm's top echelon and within the hour the CEO of the firm was on a conference call with the CEO of the brokerage firm, evidence-in-hand that they had in fact transmitted the file and would not stand for this. The CEO of the brokerage took a deep dive into the responsibility chain only to discover that the member firm really had transmitted things via flat file, and the brokerage was so lacking in attention to this medium that they didn't even have auditing capabilities or anything to tell them definitively whether the file had arrived or not. The file had arrived but somewhere in the night, it's loading process had aborted for reasons other than data. The poor clerk could only review a morning report, with no visibility to the actual problem. This state of affairs was intolerable.

 

The brokerage CEO mandated audit-style processing and flat-file receipt for everything, not because it's a good idea, but it's the law (SEC-wise), after all. And the outcry from this could not have been more vehement. The techies were being told in no uncertain terms - formalize and institutionalize flat file handling. This was taken no differently than telling the network group that they would have to support every prior version of Windows, including 3.0, just-in-case. Woe is us.

 

But the techies missed the point, in that the flat file is a modern-day marvel in its resilience and capability. As many other types of storage mechanisms have come and gone, the flat file continues, impervious to the changes in technology all around. Flat files underpin every major database storage mechanism, are ultimately the storage form for more recent formats like XML and its derivatives, are scalable on any platform, and have generally stood the test of time. Flat files are like a gallon of gasoline. They are always predictable, reliable, never break, easily scale and can be used practically anywhere - and usually are.

 

So why the consistent resistance to flat files? It's because they don't seem exotic or challenging. After all, it's a flat file. A caveman could do it. Somehow, suggesting a flat file solution makes a person feel like they are not a contributor. Yeah, anybody can whip up a flat file - where's the technical prowess in that? In my humble opinion, the diminishing prowess in effectively using and embracing flat files - especially where they are supposed to be used - is rapidly becoming a lost art form and part of a lost world. Just as we lost the architectures and methods of the wonders of the ancient world, the knowledge of effective flat-file usage is invaluable in enterprise computing. Formalizing and institutionalizing it has extraordinary value, especially when it comes to tracking critical enterprise assets.

 

But in the Netezza space, where do they play?

 

How about data intake? In the Netezza platform we can load data at extraordinary speeds. About the only technology that can possibly feed Netezza at its maximum rate of intake, is the raw physics of a file system. Considering that should we need to extract something from a database, it will ultimately go through the database engine's software layer down into its bowels to arrive at - flat files - and then pull the data, rise back into the engine's CPUs and be delivered, largely via software processes, to the extraction point of the information. Reading from this extraction point will always be slower than reading from a flat file. This is why products like WisdomForce can extract Oracle data so much faster than an interface-level extract. Their Fastreader goes for the data on the file-system level, and has embraced the audacious notion that performance is found close-to-the-physics. Where have we heard that before?

 

If we were to perform a simple test - let's say we pull data from SQLPLUS into a pipe (to eliminate the write-drag of the file system) and then perform a simultaneous nzload from that same pipe, the flow will move only as fast as the extraction. In one particular case, 3 million records (even with a parallel extract) took over fifteen minutes to pull from the database. Netezza's nzload waited patiently, perhaps scraping its virtual nails on its internal chalkboards waiting for the maddeningly slow load to finally finish. In the second version of the test, we extracted the data to a flat file and once completed, performed the nzload. The extraction still took fifteen minutes. The nzload took a few seconds.

 

But think about this - in the first test, the fifteen minutes that the Netezza machine was tied up with an nzload, it could have been doing other things. After all, there are a finite number of load threads we can invoke for work, and this long, slow stream tied up one of them for much longer than it should have. Pushing to a flat file and then performing an nzload, the load thread is only occupied for a short window and is then free for more work. This is a more efficient use of Netezza's interface. Of course, if the box has nothing else whatsoever to do in this fifteen minutes, then go for it. A while back, we installed and burned in a sqlplus-to-pipe-to-netezza intake framework that worked just fine, and its window of operation was during a quiet time. On the flip side, other interfaces had detailed data feeds, most of them integrated to arrive at the same time, and they were all being pushed as flat files. Netezza simply inhaled them - in seconds - and moved on.

 

In yet another venue, the upstream systems were pulling and pushing small data files from their various internet-based sources. All of the data files carried a small part of the same information stream, so we could essentially "cat" these files together into one. The problem was that they were all being written to a common file server, and then other processes kicked off to load these snippets of data individually. Since each one was a tiny file, this created an enormous burden on Netezza. Every load has a finite cost. (Recall, we can load 1 record in 1 second, or 1 million records in 1 second - either way it costs us 1 second). So by formalizing a "collector" mechanism for these files, we could effectively have one nzload load as many of them as were available on the file system as one large stream of data. In this regard, we could "cat" hundreds of files into a pipe and nzload from the pipe. This is a good use of "cat", since it is writing to memory and not back out to the file system, essentially reading and stuffing data into a pipe for our consumption. This alone stabilized the intake protocol and - where the existing implementation had saturated the Netezza machine's interface, the collector freed it up and allowed them to take on even more capacity without additional overhead. Think about the mechanics for a moment. If we have 1500 files, collectively containing 150 million records, we can choose to load 1 file at a time, requiring 1500 seconds (25 minutes), or we can cat the files into a single nzload for a load that took about two minutes. If the issue is load-time and performance, then we need to formalize and harness the way we intend to load all those flat files. Make a design decision, embrace it and institutionalize it for maximum firepower.

 

The Netezza Underground offers up a number of rules (the first ten of which are early in the book) that are specific and non-optional for systems of scale. One of the rules is to use the most scalable mechanisms and assets available, and embrace them as a regular part of everyday data warehousing life. But mention "flat file" to someone steeped in transactional or visual technologies flowing from Redmond or Silicon Valley, and they first roll their eyes. When they see we are serious, they pushback as though fighting-on-principle. When the verdict is in, flat files are here to stay, and they see they cannot win, they update their resume and leave. Their reasoning: they don't want to go back to the "dark ages". You think I'm kidding.

 

We again come full circle to the whole idea that bulk processing is not transactional processing. And with Netezza, the bulk processing is on a sometimes mind-numbing scale. Things that we used to do as neat-clean operations in transactional space, suddenly have no viability whatsoever with data sizes of this magnitude. Which is, of course, the primary reason that people will reduce to flat file operation when the performance starts to lag. Flat files are scalable. Software, not so much. Transactional, for bulk, never scales. Stop now.