Big Data = New Thinking Box

Roberto Barbosa
3 min readAug 20, 2017

Lately I have been questioned and challenged about implementing Big Data Architectures, with the goals to achieve real time information with all of its challenges of capture data changes (CDC) at source, need to materialize/store historical or pseudo-real time data (Database), or to immediately reflect and index the data to be searchable and be always true.

Well, but I get some red red herrings (and awkward faces) when I hear some words on Big Data architecture meetings, words like: scheduling, data schema, and real-time queries on databases. But the cherry-on-the-top is when Devops with CI/CD is also desirable, if not mandatory. OMG, are they for real?

Ok, let’s not kill the dream. Let me try to help.

Well, the biggest challenge here is not the technology, but “Thinking in a new Box”, like Luc de Brabandere mentions on his course at Coursera.

To further explain, essentially Big Data is a mindset upgrade, not technology upgrade. Big Data is game-changer because is set of new approaches to solve problems with large volume data. Trying to use Big Data with an old mindset, will not result on being able to rethink the problem and create a better solutions. If you attempt to do this with old mindset you will not end up with the same problems, but Bigger Problems.

So, before upgrade your tech stack, upgrade yourself how you think about the problem, so that you can reach better and brilliant solutions with Big Data.

What is this Big Data “New Thinking Box”? I can give you some pointers on that:

  • Distributed Computing and how to achieve it in a way that you can scale massively. I suggest you read Distributed systems for fun and profit, to understand when the industry 30 years ago followed the path of database-centric architecture that was only suitable for Enterprise scale not Internet Scale. Distributed systems that are massively scaleable requires processes to run in parallel and autonomously without the need of a central state, a database or any other central source of data. Focus on “distributed” word, for real.
  • Immutability of Data CHANGES EVERYTHING. Mutations of data is a killer for true massive scale. It simply impedes the above point and to keep the data always up to data. Why Erlang with its immutable variables and message-passing approach still lives after 30 years? See paper on “Immutability Changes Everything”.
  • Single Source of Truth cannot not be a Database, from the old mindset, it needs to be the all new incoming events on the stream carrying the latest truth, the new Truth, praise the Lord. The new truth must be extracted from this Event Source. See Messaging as the Single Source of Truth
  • Schema-on-read vs schema-on-write that made enterprise systems so stagnant, because it was impossible to change those thousand tables schemas without breaking some critical component dependent on it. Just think about: would you like to have several schemas on top of the same data? Don’t forget that with Big Data, you’re suppose to be able combine structured, semi-structure and unstructured data. Check AWS Athena and creating a Big Data Lake and Analytics on top of AWS S3 or Scaling Like a Boss with Presto.

This trends is all over the stacks from Big Data, to web and mobile. All the fuss around react+redux addresses the same: what is the best way to manage the state of the application? See Immutable JS — Redux Tutorial

Let me know if you do not agree or you want to add some comments to it.

--

--

Roberto Barbosa

Physics Engineer by Education, Architecture and Data Engineer by trade, entrepreneur at heart.