Raptor Digital Services Integrity, Honesty, and Hard Work

13Oct/090

Over-Engineering Increases TCO

Proprietary data integration software has been on the market well over a decade, adding new features year after year that serve to increase software revenue and little else.

Over time, accumulated features obscured the original functionality of the software, confused users, confounded administrators, and lead to persistent software bugs. Weird, often unexpected behaviors became known to experienced users. Most of these abnormal software behaviors became "undocumented features" to be worked around.

Companies should expect all mission critical software to have few issues and the data integration mission is no different. Proprietary software companies receive payment for product features and most people would agree that there is a reasonable expectation that software should run in a manner that is consistent with the properties of the features sold to them, day after day. This premise has been neglected and discarded over time. Some may argue that integration and loading requirements have become a bigger and more complex problem over the years and require the majority of these new features. Add to the malaise the data quality issues that are just beginning to be addressed and it seems that data integration will continue to be a mission critical task. These two statements are true and not contested. Unfortunately, proprietary software code is SECRET and only the manufacturing software company itself has transparency and visibility of all product issues. Purchasing companies and users of proprietary data integration software are universally and technically locked out from knowledge which would allow them to win a legal case if “undocumented features” are not fixed to their satisfaction. More importantly, suing the parent company of a mission critical software suite would most likely result in even worse effects on their data integration mission unless they completely drop the software suite altogether which would be impractical and financially prohibitive.

The main problem is that these proprietary data integration suites run using non-standard languages. In other words, if you write your data integration job using data integration software “A”, you cannot take your “code” and run it using data integration software “B” as if you changed compilers. Companies sometimes even purchase more than one proprietary data integration software suite and these suites are anything but inexpensive. Wouldn't it be nice to have the option of using a standard and widely used language instead of being absolutely locked into a single development environment that is not easily translatable to other data integration suites and absolutely not compilable using traditional language compilers?

Before data integration suites became the industry standard, data integration activities were programmed by line-by-line coding using any of the traditional languages. It was the standard practice and worked well except for massive documentation issues and the equally large issues surrounding retaining qualified and available data integration programmers. Nobody likes to document and programmers do not enjoy the chore either. Companies, for the most part, decided that the issues of retaining pure language programmers to do relatively simple data integration work is less cost effective and generally prohibitive when compared to using a proprietary data integration suite and coding activities using data integration experts drawn from a common skill resource pool.

At this point, ten years later, companies have used two general methods of coding their data integration activities and although data integration suites are doing the job, there is a general feeling that improvement is in order. Proprietary data integration software has done the job but has become bloated, expensive, and somewhat troublesome. What could improve the situation?

The time has come to reevaluate the data integration and data loading mission to reduce costs. Let us reevaluate the basic premises and architecture that were put in place over a decade ago in the hope of learning something valuable and possibly experiencing an epiphany, resulting in a clear decision and a next-step for data integration activities.

Let's start by examining the original reasons for creating dedicated data integration software in the first place; circa approximately 1995.

  1. Companies had business experts who had some technical abilities and who could contribute to data integration and loading duties if an appropriate tool was provided to them.
  2. Data storage requirements were growing more rapidly than before and database technology was able to store the data but conducting a single SQL query across all the related databases and tables was not practical and sometimes not possible at all.
  3. Managers and power users were losing control of the knowledge of data dependencies and the flow of data. Early data integration was conducted by line-by-line coding for the most part and the idea of meta-data was foreign, unknown, and undefined in the business community. This situation effectively locked managers and power users out of the knowledge they needed to manage data integration activities. This environment often produced critical periods of time when data integration programmers moved to another job and a replacement had to be trained.

Now let's discuss the above original reasons as we view them from our current date and make a decision which ones are true so we can continue.

  1. Business experts are in the same situation except that they are usually working hard in their primary job and literally have no time to code data integration and loading tasks. Unfortunately, a new problem has arisen because of the workload. Business experts must train data integration coders about the business to the point where they can work; unfortunately things are deteriorating. Generally, business experts neither have time to code or to train coders. The common result is that coders have to fend for themselves, figure everything out from scratch. Does this remind you of the original programmers' reluctance to document data integration? It should.

    Even worse, figuring out everything from scratch takes time and a lot of money. Documentation still takes a back seat and is even dismissed altogether, forfeited to gain more time figuring things out to accomplish coding tasks.

    Even worse, time spent figuring things out is time spent not coding.

    Even worse, figuring out things from scratch often results in incorrect guesses, incorrect decisions, and incorrect coding.

    So today, there is much more rework and confusion in the data integration universe because coding is less thoughtful, scientific, correct, or even documented properly. Wouldn't it be nice to use data integration software that is self-documenting?

  2. Data storage requirements are still growing more rapidly than before and database technology is still able to store the data. Conducting a single SQL query across all the related databases is still not practical and sometimes not possible at all.

    However, there is hope with this issue in that database vendors have clearly recognized that their architecture is flawed and are addressing their performance issues and scalability issues. Perhaps six database vendors are now advertising much higher performance and much higher scalability as a result of some level of re-architecture work. Some forward progress is being made in terms of hardware and generally speaking the hardware is becoming less proprietary, giving customer companies more cost-effective choices with off-the-shelf hardware solutions. Some progress has been made restructuring the storage methods for data where the focus of storage is on columns of data instead of rows of data which can often result in faster result-sets from SQL queries, especially when not all record columns are in the SQL query. Some progress is being made by splitting the SQL query “work set” into an arbitrary number of “subset work sets”, working each subset work set in parallel, then reassembling all the subset work set query results for the user.

    So, this is a peek into the future. Databases are becoming more capable in every quantifiable way and are eclipsing the power of any data integration software or programs in their ability to read, write, and transform data. Conducting a single SQL query across all the related databases is still not practical and sometimes not possible at all for most database vendors. However, databases now exist that can handle almost all of the largest queries and data-sets out there. The bad news is that they are still relatively expensive. The good news is that the pricing of these new architectures will surely go down over time. What we need to do in the data integration universe is plan ahead to when these database technologies are more commonplace and affordable, and be ready to leverage their capabilities.

    In the near future, data integration software's mission will refocus itself on reading a diverse set of sources and loading the data with minimal transformation. The database itself will add to its duties much of the data integration mission such as joining, data transformations, and data cleansing.

  3. Managers and power users were losing control of the knowledge of data dependencies and the flow of data.  This issue is still true and is probably as big a problem as it has ever been. The underlying mistake that causes this situation to be persistent like a big weed is that most everyone is in too much of a hurry and starts coding before they really know what to do. In the heat of a meeting, it sounds like a siren's melodic song to talk about “doing what we can” while “we analyze the situation" and “move forward from there”. With careful, experienced management these issues are minimized and projects benefit greatly from better preparation (in the form of raw business logic knowledge of important business entities, dependencies, possible points of failure, and performance minimums) and better documentation of data integration in the form of detailed dependency and flow diagrams, detailed architectural diagrams, detailed performance diagrams, and detailed storage diagrams. The lesson should have been learned long ago that if you “get the job done” but have inadequate documentation, there will surely be a disaster of one nature or another to face down the road. One last point, the days of using spreadsheets for documenting data integration activities should be gone.

    Actually, it sounds worse than it is. Peeking into the near future, the next-generation databases using SQL can be depended on to solve these issues by allowing all company data to be query-able in a single SQL query and be flexible enough and powerful enough to keep up with future challenges. SQL is the tool-of-choice going forward for data manipulation, data transformation, data quality work, and joining of data. In the near future, data integration software's mission will refocus itself on reading a diverse set of sources, loading data with minimal transformation, and unloading data with minimal transformation. This is a mission for which data integration software is perfectly positioned and irreplaceable

 

Now we should evaluate the next steps. Our glimpse into the future featured a world with more powerful databases which have taken on a few more tasks than before; data transformations, data joining, and conducting SQL queries that effectively and accurately reflect the entire enterprise. Data integration software has been refocused on reading disparate sources, loading data with few transformations, and unloading data with few transformations. So, for planning purposes, we should support the database vendor efforts to become truly scalable and leverage the growing database capabilities. Equally important, the mission of data integration software will no longer need to join disparate databases that could not use SQL to query the entire enterprise because these types of queries are now possible.

This brings us to the question of data integration costs. Historically, data integration has held multiple missions and most large business would not be able to effectively function without them. Of course this is still true and they will continue to be a mission critical piece of every solution with the exception of no longer being required to join company databases together to answer corporate-wide questions. It is just a matter of time when most if not all companies will be able to use SQL alone to answer corporate-wide questions. Based on this observation of task reduction for data integration software, this might allow data integration costs to diminish. But how much? And what choice must be made to realize these savings?

One choice which is often underestimated is the new software classification of “open source”. Open source means the coding is conducted with a traditional language which is a positive point compared to our observation that repositories and proprietary language coding cannot be easily switched to another lower-cost vendor. Open source may be offered by a vendor with an entry or base offering with value-added proprietary features, exactly similar to any other proprietary vendor with the exception that it is still based on a traditional language and anything coded using the open source offering could be literally and easily integrated into another vendors offering if the base languages are the same. Another positive point of the dual-licensing offering is that because the base product is open source, bugs are easily found by any user who can effectively use the debugger and this participation by the user community at large effectively eliminates bugs and makes the software more reliable than traditional proprietary software.

One of the most reliable and powerful dual-license data integration offerings today is from Talend with their Talend Open Studio which is open source, and with their Talend Integration Suite which is also open source with additional features including change-data-capture, best-in-class data-quality capabilities, command-line capabilities, automatic documentation, and team development capabilities. Additional products extend the capabilities of Talend Integration Suite to include massively parallel execution and real-time capabilities, and pricing for Talend software is significantly lower than current market leading data integration vendor software offerings. We have all heard the phrase “You get what you pay for” but this has progressively become less and less true over time to the point where it is clear that most software in the data integration classification is clearly overpriced. Do not make the mistake of thinking that less expensive means less capable because Talend software is truly a great value.

Talend is impressive to enter the market with world class capabilities and at a price that is well under the current market leaders. Add to that a user interface that actually shows you the generated code in Java or Perl (depending on which product variant you are using) but most development is in a graphical designer which is easy to learn and actually more effective than the current market leaders' graphical designers.

Talend offers superior debugging capabilities because it is based on the Java open source development environment called Eclipse (originally from IBM). The Talend debugging environment is bullet proof and feature-rich with all standard debugging features you have come to expect from standard language debuggers. Debugging with at least one other data integration software suite was a late afterthought and remains permanently brittle in reliability terms and often locks up the graphical interface and crashes the program entirely. This same debugger has very few of the features of the Talend debugger, just basic data tracing, ability to change variable values, and ability to set breakpoints, and not much more.

To address our most important observation that database technology is becoming truly scalable, Talend offers dedicated best-in-class objects to address and leverage that upcoming capability. At least one other leading data integration software vendor has instead decided to complicate their existing graphical interface and diminish the usability for the developer to realize these capabilities. It seems evident that Talend will win this design contest because they are focused on the issue and the best solution.  The competition has not shown leadership in software design and seems generally determined to retain graphical interface designs less useful than Talend offers.

How many people in your company currently use Java or Perl? How many successful integration activities could be solved by embedding your current Java or Perl programs into a graphical data integration suite such as Talend Open Studio or Talend Integration Suite? There is no question Java and Perl programs could be integrated using Talend but can you say the same for your current data integration software vendor's offering?

So, the data integration community is at a cross-roads and with a big decision to make. The choice is to use the leading data integration software using proprietary coding and to be party to a product that is expensive and exotically complex and prone to weird behavior with bugs that sometimes never get fixed or move one step back to again gain the advantages of a standard language. Today, there is a better choice than using a dedicated programmer for relatively simple data integration activities; use Talend Open Studio instead. Talend has the same advantages that the current leading data integration software advocates and has two advantages that they can never have; open source, and standardization on a widely used, powerful, and common language.

Please take a minute to clearly understand how usage of Talend Open Studio protects your interests, your freedom, and your pocketbook. Talend has taken a business risk to distribute high quality, mission-critical software and source code under the “GNU General Public License Version 2, June 1991” and deserves your support by more closely serving your best interests as compared to the competition. The GNU license is focused on protecting your freedom by allowing each user to distribute or change the software or portions of it in new free programs.

The license agreement is found at: http://www.gnu.org/licenses/old-licenses/gpl-2.0.html.

Talend is the tool of choice for data integration.  Talend is easy to learn, reliable, fast, and based on standard languages Java and Perl. Please consider it for your next big project or diminish your current data integration costs by using it for selected smaller projects.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment


No trackbacks yet.