Subscribe to get the latest
#13 Illogical Obsession with Logic
on Wed Aug 12 2020 17:00:00 GMT-0700 (Pacific Daylight Time)
Thought leader Andrew Padilla of Datacequia envisions new developments in data management and collaboration that would allow data to advance as software has in visibility, accessibility, usability, and portability. He outlines how a composable infrastructure would address the concerns of both software engineers and data scientists.
Over the last ten years, data professionals have exploded in their ability to make software visible, accessible, usable, and portable, but on the data side, there has not been as much of an advance. This is something that both software and data professionals should think about. What are the concerns of each side? What can we learn from each other? Currently, the two sides are diametrically opposed in many respects.
On the software engineering side, business logic is the main concern. In order to have consistent interfaces, engineers are hiding the details. Data is seen as an output. Conversely, data scientists are more concerned about the context of these outputs and data constructs: the metadata. For example, the lineage of the data is important to a data scientist to see how things change over time, whereas a software engineer would try to hide these details to avoid problems such as variability and bugs.
Treating the development of data and metadata as its own discipline, and not in the context of how we currently do software, could help the industry grow. In other words, we need to look at building a composable infrastructure that takes into account the concerns of both sides.
An example of how we are dealing with the metadata issues currently in enterprise illustrates this idea. Today, we have centralized metadata management systems. We want to know where all our data is: who, where, why, and how. Capturing what people are doing and pushing it into a centralized system is very much a software way to do things. If we let data grow in its own right, we could adopt what software developers do, but in a data context. We could build a larger ecosystem if, instead of so much private endeavor on both sides, we put it all into a synchronized and centralized repository and allow data developers to develop like software engineers do. We could build and curate as private endeavors, but then share those curations with others. Similar to the shift in configuration management in the 1990s and 2000s from one centralized place to a more distributed system, sharing could happen more easily and fluidly.
The digital knowledge that we have is dependent on both the logic and the data. The software and data share the same primitives, and as we move up the stack to knowledge, there are strong relationships between the two. What is divergent is that we have the infrastructure and the tooling to build out the software side, deploy it, and make it visible, accessible, and usable. Since we don’t think of the data side in the same way, we are limited in these areas. For example, an old way of looking at it is when data scientists do great analytics and AI and learning from their data and get valuable information, but there is no repeatable mechanism, limiting its use.
Once this perspective is recognized by both the data and software community, we can take a different approach using the successes of software development for data. Instead of applying the personal experiences and prejudices of software overall, we can look at how data operates, how it is similar, but has its own concerns. An analogy would be taking your family to Disneyland. The experiences and interactions there represent software’s dynamic. You could draw the experiences toward the data side of things. But imagine in the real world if you had to check that experience at the door, and when you went home, you knew nothing about it. That is the problem. Every time we go to a different system or different ecosystem, we reinvent a new world and are unaware of the other worlds we had to leave at the door. If we were able to share the experience or take it with us, we would find we would have a much more vibrant knowledge infrastructure. So, the next time you visit Disneyland, based on your experience, you know what time of day is best, how to check wait times, etc… Without the experience of working with the data in conjunction with the application, it’s almost like starting from scratch every time.
A real example is the amnesia going on in systems such as those in health care where a data professional creates an integration from scratch, then a few years later someone else has to do the same thing. If we can form better relationships with the data through mapping, reusability and efficiency increases. For instance, why do we have so many notions of a person in terms of modeling? Of course context matters, but why can’t we see the different variants of a person and then be able to map them? In the VA healthcare system, they have several systems where a patient means something different in each one. Mapping would provide a common ground, but allow for change depending on context, as long as the mapping operation was visible. Then we could move forward with different types of use cases and reusability.
One big gap is that we have achieved effective operationalization of logic with K8s, but an equivalent service does not exist for data. Although there is a stopgap measure with S3, it is not the answer. There is a great need across all industries for a service similar to K8s that addresses data. The need for collaboration is important here. Of course data professionals want to add value to their organization, but a certain part depends on a commonality.
The company name Datacequias is based on the acequias in New Mexico, which serves as an example of the type of collaboration that is needed for data. New Mexico is an arid region, so years ago, to make the land fertile, the people built a series of irrigation ditches called acequias. Nobody owned them, but they built, managed, and maintained them out of necessity and the common good. The inhospitable environment in the data world is typically budgets and data ownership, but a more community-based data curation would be beneficial for everyone, just as the acequias benefitted all.
Imagine if data professionals could fork off a data set in any central repository. They could manage it and evolve it for their own needs. If there is a change in the central repository that is managed by a standard body, they could incorporate those changes immediately, or choose not to. In any case, they have the lineage back to the original source. Today, when we use an asset that falls outside of the enterprise, we make a copy of it that stands still in time. That requires manual tracking and management of the updates. With a central repository, all could co-create, collaborate, and create communities with common foundations and visible lineage.
This is just the tip of the iceberg on what is a fundamental change in industry to make data more valuable for your organization. For more information about Andrew Padilla and Datacequia, visit datacequia.com