Many companies are still taking the first steps on the big data journey. First ideas are being implemented while many are still wondering: what even is big data?
What Is Big Data-The biggest obstacle is the term big data itself. It’s reminiscent of mass data. However, all data in an ERP system and other databases are also mass data. Big data means quantities too big for traditional databases – too big in the absolute sense or relating to cost-effectiveness.
Data structuring presents another obstacle. In an ERP system, 99 percent of data are structured. The remaining one percent are texts like orders and invoices. With big data, it’s the other way around. All important information is unstructured. Of course, it’s interesting to know when and where a picture was taken, but it’s more interesting to know what’s on it.
In my opinion, the most important definition of big data is ‘all data which cannot yet be used to generate value’.
Here’s an example as to what I mean by that. Purchases are always documented. What isn’t documented, however, is everything else. How did the customer notice the product? Did they see an ad of a specific product? Do customers only skim the product details and buy right away? Or do they meticulously read through technical details and still don’t buy the product?
Now that we’ve discussed what big data is, we have to answer the question of the right big data architecture.
Especially in big data, innovations come and go. A few years ago, Map Reduce on Hadoop was a must-have, now we have Apache Spark which offers better performance. Some time ago, Apache Hive was the way to go; now, it’s Parquet Files. This dynamic environment makes cost-efficiency and flexibility an imperative.
Apache Spark offers great performance while still offering the desired flexibility, which is why the majority of projects worldwide leverage it. The installation is easy, complex transformations only need a few lines of code and the software is free of charge.