Armchair Architects: Is Big Data Turning into Dark Data?

Welcome back to another episode of Armchair Architects as part of the Azure Enablement show. Today we're going to talk about data, but not just any data. We're going to talk about this phenomenon called dark data and what do you do about it. I hadn't heard about this until I got a chance to talk to our architects, and it was super cool to learn about it and they'll help us understand what it is and what areas I should pay attention to. Let's go find out about dark data and let's go talk to our architects.

I think it's time we talk about data and in particular, I want to talk about this thing I think that people are using to scare children at night when they want them to go to bed or something. They talk about dark data, but I don't actually perhaps know what dark data means. I know what big data means, but apparently there's this concern that big data is turning into dark data so can we just get the dark data question out on the table first? What do we mean when we say dark data and then then we can dive into this more fully.

What is dark data?

I mentioned in another session where a customer of mine leaned back from the table and said, “I'm not getting a lot out of this big data platform. I know we spent a lot of money on it. I know we're ingesting hundreds of thousands of transactions and messages per second, it's going into this weird thing called a data lake and now there are data marts and whatever happened to warehouses? Is anybody actually benefiting from this thing or is the data locked up and specific data files or formats or schemas or repositories and silos, which can never be conceivably joined together in a useful way within the right time frame?” The question that many customers end up asking is, “Well, I know the data is there because we have system X, system Y and system Z generating these transactions, so it's got to be going somewhere? It's all going to the data lake, where is it and how come I can't touch it?” So the dark data element is we know that it's there, but we just can't access it, we can't see it, and it's going to take a significant effort to actually unearth it and then to get business value out of it. Meanwhile, companies are spending a lot of money to keep the lights on, from the hypervisors to the storage of the network and compute required to maintain this data lake or more importantly, what we come to know as a data swamp. Which is, I can't ever get through this thing in order to get anything valuable out of it.

It's effectively where you collect data and a lot of times you have no idea what this is for. Why are you doing it? There is no value in the data. The swamp is a place where people wade into it and they've never been seen again because there's kind of a nebula running around it and it's foggy and whatnot. Ultimately, you're collecting data, spending money on it and you don't get any value out of it or very little and that's really what happens to a lot of companies when they do big data projects. The way I think about this is oftentimes the technology gets put ahead of the usage.

What are you trying to achieve with your data?

People are investing in data lake technology and big data processing. Whatever technology you use, it doesn't really matter. It's similar beef I have with people that are doing DevOps. They think about the tool chain, and they don't think about the discipline, the culture that you have to change and stuff like that and the same is true with data.

Let me give you an example from Microsoft's past with Bing and Google.

Google was winning very clearly and it's still the big dog in the search business, but Bing is a sizable business by now it's I think 34 or 35% of the US market or something like that, which is pretty big. In 2007, the teams were effectively trying to out Google by using the same tactics as Google such as the brute forcing and stuff like that. We got a new lead in, and he came in and said “We can't do this. This just makes no sense. What we need to figure out is how do we become more relevant.” So, the key thing in search is the relevant score so whatever you do the first three links better meet what people are looking for, otherwise it's not interesting. So, he invested in the first data lake, the big data system at Microsoft. We didn't have one before. We had data warehouses and other things, but not a big data system and that system is called Cosmos.

The idea was that we collect signals from the Bing system, the user behaviors and we have data scientists, which was a new discipline at Microsoft at that time, that go and roam this data cosmos and figure out what are the patterns and how do we improve the reliability score or relevance score in search. That was the only thing that these guys were supposed to do. Nothing else and so forth and everything was collected with this in mind, and I think once you have a business objective, you know what you want to achieve, then big data can work.

There is another great example with Tucson Group Elevators which we made public years ago and these guys came in with a lot of data from 20 years of elevator maintenance. They asked Microsoft, “Can we use this data to answer some basic questions?” The person that maintains the elevator generally comes with the wrong part because they don't necessarily know what's going on with the part. So, they come and say, “oh damn, I have the wrong part.” They have to go back, get the right part and that's expensive. Or the other question they ask is, “Hey, because we don't really know what our customer or what these elevators need, we overstock the forward deployed parts depot and so that costs a lot of money and so can you help us with that?” The last one was a little bit around expert help, but these two pieces about what parts are generally broken and what parts do we need? We could actually help them with the data they had collected, but because they knew what they wanted to achieve, the data was available.

The queries or the machine learning was used to answer these questions. If you have no questions you want to answer, no data will be ever sufficient to effectively help you answer those nonquestions. Because you collect data, but you don't know what the outcome is, it's not going to be successful.

I often say, “leading with technology is a little bit like having a bad cab driver”. You may get to where you're going but you're going to have to spend a lot more money than you expect, you're not going to enjoy the experience and you're probably not going to get there very directly. So, I wonder however if it's not just in addition to not being super clear about what the goal is or let's assume you're pretty clear about the goal, but I've seen big data things go awry. Not just because people didn't know what they were doing or what they thought about what they're doing, or they picked the wrong technology. I get the sense that there are patterns that also can lead you into the dark data space. What are some of those patterns that get you to this place that you don't want to be in that we're calling dark data?

Avoiding the patterns that put you on the road to dark data

So, let's assume that you have an ROI statement. If I get this data and I can make decisions, then I will receive this value, whether that's in terms of dollars in cost savings or other metrics.

The patterns associated with that mean that you actually have to have a big data platform or system or whatever it is that makes it easy to retrieve data, to ask questions and to have those questions answered in a timely fashion. So typically, in organizations with big data platforms they have become kind of jaded. They are rife with ETL jobs, schema on write jobs in which you pick the data up, you move it around and you plop it down and you change it a little bit. You take some of that same data from a different source, you pick it up and move it over here. You plop it down and change it again a little bit and now you have two truths. Some of them might be true in certain circumstances and some of them might be true in other circumstances. My whole goal is to try to free our customers from having to manage these brittle and fragile ETL pipelines and to apply a schema on read methodology with contextual bits of information that allows you to fuse data together from multiple sources and is basically going to build a bigger picture as you run the query. Now the question becomes “Well all that sounds fantastic, how do you actually do that in practice?” For me, it's separating the model, the entities, attributes, and relationships from the data storage paradigm but making their strict mapping so that we understand where the data exists is super important. So certainly, that means you have to have a big data platform that's sharded, that's indexed, that's partitioned correctly. But then having table definitions, entities, relationships that sit on top of that and point to where the data is and then having that model be human understandable. That actually allows you to have this schema on read assemblage fusion kind of methodology on top of the datasets.

Again, ETL ELT, I think we have had those patterns for quite some time and nothing wrong with what you said, but it's a bit old fashioned because there's a lot of people that are now saying “Do I even need schema because I'm actually not having humans read it?” So, if you eventually get to the human side share, I need schema but if you're using AI on top of your big data, the AI system doesn't necessarily need schema.

They can run through this pattern detection and so forth. So, I would say that the big question is always still “what's the business outcome you're trying to achieve?” Then make sure that the data you collect is managed properly. What that means: is secured accessible or not and then from a cost perspective have a hierarchical storage architecture because data storage is cheap, but if you have lots of data, it's not that cheap anymore and then be able to say “yes, I have the right pattern.” ETL doesn't work in certain scenarios, ELT is more flexible because you can use the extracted data from many purposes, not just one, because once you do ETL you ultimately have the shape of the data already predefined. Which also means you generally have the questions that the shape can answer predefined and then you effectively go into a more flexible model with ELT and then think about AI as part of your equation to say, “hey, I know some queries” that's where Eric's model works but some pieces of data I don't even know what the value is that the AI system can go through it and detect patterns that we haven't seen before.

What is ETL and ELT?

Normally you have a source system where you extract data for analytics and that's the E, there's a Transform that says the source data is in this shape, but I want my target data. The Power BI application or the Tableau system needs a certain shape of data so that I can visualize it, for example. So, I need to transform that data and then there is a load an L portion, so you load it into the target system, be that a data lake, be that a visualization tool or whatever it might be.

Now you can say I Extract and when as part of the passing I Transform and then Load, that means you have a fairly fixed shape. Or you say, hey, I'm going to extract, I'm just dumping it, loading it into the data lake and then I transform potentially many times from the source data because for one scenario I might have this shape, another thing I might have another shape. For me ELT generally wins these days because storage is ultimately cheap, and people want to use data for multiple purposes in the ideal case. The other piece Eric mentioned is source of truth, in my conversations with clients and customers, I will use the example of if you manage a nuclear arsenal, you want to have one source of truth and it needs to be precise because you don't want to lose those nukes.

I think that makes a single source of truth absolutely necessary as there's a lot of fuzziness in the business anyway and it's OK to not having the single source of truth model, which sometimes is really hard to reach because there are multiple datasets, and they have value. It's not about truth, it's about value. Thinking that through how you distinguish between value and “truth” is an interesting exercise for the people that model data that think about data.

I want to ask you one final question “Is it possible to go from a place where, congratulations, I find myself in the dark with dark data to a place where to come back to a place of where it's useful, where we're back into happy big data land and we're actually happy now without pulling a rip cord and making a second system? Is it possible to go from good to bad to good again?”

It's possible through tooling, practices, discipline and governance. Adding to that, think about the business outcome upfront.

To hear the whole conversation, you can watch the video at the link below.

Published on: January 26, 2023

Learn more

Azure Architecture Blog articles

Blog image