It all starts with a typical consultant answer, “It depends.”
Let me quickly elaborate:
- The concept of a data warehouse is definitely not obsolete!
- The technology in use might be
- The tools and processes surrounding the data warehouse most likely are!
Why is it so?
The reason for asking this question is easily understandable. The data warehouse has usually been established to deal with large volumes of data. With the buzz-word “Big Data” everybody is suddenly looking for new ways to deal with large volumes of data, and a flourishing number of start-ups are providing new tools with impressive features. The new tools more or less all claim to replace existing data warehouse technologies. So it is understandable that we ask this question.
The concept of a data warehouse
The concept of a data warehouse usually supports business decisions with operational data analysis – this will definitely not go away. The human element of those decisions might be automated away by robotics, but that’s another story.
In a data warehouse, you integrate data from multiple source systems and keep a long history not necessarily available in the source system itself. E.g. if a customer change an order from 10 to 20 pieces, the order system would not keep the original number, but the data warehouse will keep both numbers to be able to report on customer behaviour, explain changes in forecasts and for many other purposes.
In a data warehouse you define the data structures (the schema) before you start writing data – hence the term Schema-on-Write. While this is ideal for traditional structured data from relational database systems, it is perhaps not ideal to handle ad-hoc analysis or discovery on multi structured/unstructured or streaming data. The business data lake concept is much better suited for those tasks. Here you wait with structure definitions until you read data and you can change the structure every time you read if necessary – not surprisingly this is covered by the term Schema-on-Read.
The good news is that the business data lake and the data warehouse concepts can easily exist side by side –they can be viewed as ideal companions solving quite a number of issues by being together.
The technology behind the current data warehouse
Traditionally we built data warehouses on a common relational database management system with some transformation mechanism (commonly known as an ETL-tool) on the side.
As time has passed and data volumes increased, this setup has revealed a number of bottlenecks and we have put in a lot of effort into removing those bottlenecks one by one just to reveal new bottlenecks elsewhere. This situation has been amplified as update cycles have been continuously decreased with the goal of coming close to a real-time data warehouse.
Admitted – technologies without apparent bottlenecks exist – they are seemingly quite expensive and not in very widespread use.
So when time is up for a technology refresh (such as a software upgrade or hardware replacement) you might as well take a look at the conceptual setup of your analytic environment. New concepts, approaches and technology might have huge benefits just waiting for you to harvest.
The tools and processes surrounding the data warehouse
Processes should be easy to change in theory, but that is not how it is in the real world. Human beings tend to keep on doing things the way they are used to and as soon as the process is supported by a tool – or perhaps even partly implemented within the tool – it gets sticky and hard to change.
When going from nightly batch loads of the data warehouse towards the real-time data warehouse most processes surrounding the data warehouse will have to change as will most likely the tools supporting those processes. And the changes are not limited to the data warehouse processes – systems acquisition (development and procurement) will also have to adapt to the paradigm of the real-time data warehouse.
The future of data warehousing
The data warehouse will continue to be the primary source for reporting and decision making related to business operations.
The feeding of the data warehouse will shift away from ETL loads directly from transactional source systems to the data warehouse towards an “ELTL” process via the business data lake. This will in very many cases result in increased performance and reduced load on transactional systems.
Data archival from the data warehouse will no longer be relevant – just delete obsolete data from the data warehouse – you could always re-load it from the business data lake should the need arise.
Data scientists will through data discovery processes find new value in existing or new data sources. Some of those discovered data elements will find their way to the data warehouse to support operational decision making. Some of the validated analysis processes will result in scoring or categorization of data in the data warehouse. You should feed the result of those processes into the data warehouse even though the data behind the results will not go there.
Though the data warehouse is moving towards a real-time data warehouse, it will not be the primary source of real-time monitoring of streaming data from Internet-of-Things or Social Media. Those data feeds will stream through the business data lake to real-time monitoring tools. The monitoring tools might store registered events together with the data that triggered the event in the data warehouse for future reporting and analysis. The raw data will be available in the business data lake for data discovery processes.
In this way the huge growth in data volume is isolated to the business data lake whereas the operational systems and the data warehouse will grow at very limited rates.
About Erik Haahr
Erik Haahr has been a Managing Consultant at Capgemini Sogeti Denmark since 2015. In this role, he is improving local service offering descriptions, participating in pre-sales activities, mentoring graduates, and consulting with customers.
More on Erik Haahr.