How AI is Impacting Data Lakes and Data Governance

Partner Insights from
As firms accumulate vast amounts of data, the challenge of managing, securing, and optimizing this information becomes more complex. This session will explore how AI is transforming traditional data lakes from passive repositories into intelligent, self-managing systems while enhancing data governance practices. Our expert presenters will discuss how AI-driven solutions improve data security, privacy, and quality while ensuring regulatory compliance through automated monitoring and enforcement. Participants will learn how AI tools facilitate better data management, traceability, and scalability, making it easier to handle both structured and unstructured data.

Educational Objective

  • Understand how AI enhances data governance through improved security and compliance enforcement.
  • Explore AI's role in ensuring data quality by detecting and addressing inconsistencies in data lakes.
  • Learn how AI automates data categorization and tagging for improved management of unstructured data.
  • Discover how AI-driven solutions automate compliance adherence in real-time.
  • Examine how AI improves data lineage tracking, offering clear traceability from data origin to consumption.
  • Discuss AI's ability to scale and manage large datasets, enabling more flexible data governance frameworks.


Transcription:

Oleg Tishkevich (00:10):

I want to introduce everybody to my colleagues and friends here. Haik Sahakyan is the CEO of ARQA, which is a AI company, and Nick Graham is a CTO of Cambridge Investment Research. And we're going to be talking today about how in various different forms and shapes the data could be leveraged to power AI. I guess we'll just be talking about it, which is fine while we figuring this out. So I guess I want to start with maybe talking to Nick about some of the work that we've done over the last, what, four years with Cambridge Investment Research and then worked with Cambridge. Early days of the strategy of Cambridge was thinking about data. How can complex data challenges with the large broker dealer like Cambridge with the diverse different business models, diverse different types of data challenges that they have within organization. You think about the broker dealer, various different OSJ, super SJ model, large complexity of hierarchy and all the types of data that they have to use. And then on top of that, lay on top of that multi custodian experience for doing business with multiple different partners as well as mixing the RIA side of the house with the broker dealer side of the house. So there's a lot of complexity that comes with the different data sources, data entitlements that are being used and Nick was tasked to create that solution and we're very excited to partner with Cambridge early on. And I'd love to maybe give the mic to Nick to talk a little bit about that experience.

Nick Graham (02:08):

Sure, Yeah. As you think about all of our industrial experiences with third party products, what the custodians have done with their data, the various challenges that you might have with different types of financial providers, you end up with an ecosystem of different ways to refer to a client. Lots of different ways for you to manage the rep to the client relationship inside your data, custodial differences through legacy codes that have been used or embedded data that are there and trying to massage that into a common form for something as simple as analytics or something as complex as trying to do something consistent in AI is and was the challenge that we were faced with building a lot of that ourselves. We have a lot of internal data at Cambridge that we do multi custodial feeds. We have a lot of history with what our advisors have been doing with this over time.

(03:02):

That alone was one challenge, but also with the idea that we feed into the data models of third party products. All of the vendors often come with their own way or method or approach for what they want to create for you. And coming up with a way to try to map that was like a 1980s revisit to the old integration problem of a lot of spaghetti code, a lot of challenges of very fragile relationships between methods of API access and we were looking for a better way to solve that. Invent gave us a nice way to do that. They gave us a way for us to pull out of our data in a consistent way into a data lake that we could manage. It gave us methods of business logic that you overlaid to be able to do the right mappings for what the third party product did.

(03:48):

And then from an ability to draw from that activity, we had a consistent model of being able to look across that ecosystem of data and do more with it either through in-house customizations, which we're doing today or in the way to feed other partners and products like Arc that we might be looking to work with as we go into the future. This worked out really well. We solved performance issues. I was able to offload a lot of my ongoing cost of maintaining these interactions with these partners and these data sources. And we were able to reach a level of security over the data governance of this information and the various aspects of what it might represent as either PII data or audited trail activity that we'd be doing for certain business activities. And their platform provided us a method for good control, solid oversight governance that we could build upon and a nice on-ramp for how we might look at other third party opportunities to do further integration.

(04:48):

So that's kind of been the journey that we've had. And as we look at AI or we look at governance due to AI considerations, this is another player in our deck in terms of things that we choose to use and how we want to leverage a method of interaction, a method of data control and a method of data governance. That's the underlying challenge that I think you're going to hear as a theme throughout all of the partners next door to us and some of the speakers today is how do you continue to have flow of data because no one system can contain it all. How do you have purview over that data? How do you show good controls, good governance, good alignment and methods of manipulation of who gets to see what information And having partners like invent and having a story around your own governance journey is really key to doing things with AI as you'll probably have evolutions of change and you need nimble ways to be able to do it. And I find that our journey right now has set us up to be very successful. So I'll turn it back.

Oleg Tishkevich (05:52):

Thanks. Thanks. And we got the PowerPoint working, so this is exciting. I'm going to show you some stuff. Very cool. It's great when tech works.

Nick Graham (05:59):

Need a pad for you now.

Oleg Tishkevich (06:00):

Okay, we're software. We're not hardware, so this is challenging, but thank you Nick, definitely have been a great partnership. We really appreciate it over the years. And really if you think about data as additional asset of your business, they say data is the new gold that provides significant advantage for your advancements of your business and thinking from strategic perspective where you want to take it from a growth perspective. And if you think about the growth of your business, really what every business owner says, got to focus on what's more important. I'm sure there's people in this room that work for companies that say, well, you know what? We're going to build our own data lake, we're going to do this, we're going to do that because that's going to create value.

(06:50):

I think the point that is going to create value is very valid, but building everything yourself may not necessarily be the right path to do it. Back in the day we used to build servers, if you remember right then this thing Amazon came out and you can just put everything in AWS and it's 10 times cheaper and 10 times more effective because all of these technologies, first we talked about APIs, now there's data lakes, there's all these technologies that evolve all the time. So if you're going to try to catch up as an RIA or a broker dealer to those things and build a bicycle to go from point A to point B, by the time you're done, there's already new tech that's out there. So what we provide with our wealth management data lake that's available for anybody to use, we wanted to create a community.

(07:35):

We wanted to leverage your partners to be able to provide capabilities, unique experiences for your business so that you don't have to build everything custom yourself. And that's the essence of the invent platform. We rely on our partners like ARQA, like Rema, like Republic, like Red Tail, Orion, black Diamond, all those other companies and partners that you guys are using in your today's practices. How do you bring all this data together? How do you bring all the experiences together to become more efficient? That's the problem that we're solving for our clients. And if you think about the standard data extraction data ETL process, you have essentially maybe hundreds different connectors that you may have to build. That's a lot of ways that effort, imagine that each connector is built by the provider of that connector published like a data app on the platform that you can now leverage with a clickable button, a lot more efficiency, a lot less money spent. And then you're solving the very basic problems that now you can focus on real challenges for your business, for your business growth.

(08:44):

Data governance, we talked about that a little bit. It's also what's important to think about the master data management. So how do you create a master record, the golden record so to speak, that brings data from multiple sources? Any of you kind of thinking about, okay, I have my data and my CRM, I've got my data and my performance reporting software and my trading systems and on my custodian, that data may not all be lined up. So what we do with Invent, we use AI actually to prepare the data for AI companies like ARQA. We built actually AI systems ourselves to be able to process the data with some of these capability so that it can actually determine if there's a sensitive data in the system, tag those PI types of systems type of data automatically and basically if need be strip out that if that needs to be sent out to a vendor or partner or if you're going through an m and a and you want to be able to intake the data from a partner too, but not see their client names and all that kind of stuff.

(09:48):

That's also a big use case for that. And then very important to be able to audit trail all this data because with the compliance and the more stringent rules that are coming out from SEC and FINRA, how are you able to not only see your data just in time, but be able to manage it across history? So if you attempt to build something like this, I know this is a really complex picture here, but I just want to show you just a level of complexity of what you need to do to actually do it. It's way better to just let professionals do it, not to try to create it yourself. Because if I've heard firms building a data lake, which is like a fancy database, that's great, but that doesn't let you scale. It may work for a small firm, but as you think about growth, as you think about scale, you need to have enterprise grade systems protecting your data that manage your data consistently.

(10:48):

And then if there's issues, and there's always issues, if anybody here dealt with data, there's always issues with data coming in. Ability to have AI automatically detect these problems and say, oh, I was getting this information last two days, but now it's somewhat missing or different. No validation rule will catch it, but AI will catch the flow of data and how it changes over time and then able to flag those inconsistencies from risk and compliance perspective that something's happening with the data coming in. I'm not going to go through the entire flow here, but you can see there's a lot of different providers on the left. You're going to have to bring all this data in, process the data so that it could be used by other systems in the vault and then create custom business rules to be able to get the data across to the different data marts and different applications that you want to use for clients, advisors, and the home office as well as integration.

(11:49):

So some of these challenges, as I mentioned, involve things like fuzzy matching. So you have data coming from let's say Orion, that has also data from AssetMark and you have another source coming in directly, but now you have direct data from AssetMark. How do you determine which files, which accounts overlap, which ones are not? So as you start getting into more complex data challenges, data merging based on specific rules is not enough. You need really power of AI to be able to figure that out. So that's something also provide on the in event platform. And like I said, the classification of data and masking of data and anonymization is another big factor of your data lake that you may want to use and you for various different purposes. Like I said, you talk about M&A, you talk about exposing data to partners, be able to secure your data within your data lake, et cetera.

(12:57):

And just to kind of think outside the box a little bit here, also it's not just about data, it's really data and then experiences around your data. So your workflows like account opening, account servicing, working with the clients through CRM and note taking, et cetera. All of those processes need to flow into your comprehensive system. And so data is the base of it. I mean, I've seen firms kind of try to build this thing themselves using a snowflake, a MuleSoft, Salesforce community cloud and something like a Tableau, this stack becomes very expensive very quickly and it's very challenging to manage because those are all very different systems. They don't really talk to each other. So on invent, we build out a stack specific to wealth management that basically doesn't use any of these things but fully integrated with it. The big advantage here is all of these big data lakes, you're going to find out in two years once you start doing queries, something called compute, that's a very interesting charge that you'd get.

(14:12):

The more data you have, the bigger the compute charges are going to be for any sort of queries of your data. So that's something that we've also seen clients kind of assess, and C, that's a bit too excessive, where within Venice, it's kind of a flat structure, makes it much more simple. So I want to jump through these are some of the status stages of how you curate the data if you want to take a snapshot of this. So essentially starting with descriptive, then basically diagnostic, be able to see what's happening with the data and with different workflows and systems. Then you go into predictive analytics and then prescriptive analytics. So those are the different stages of what you can accomplish with a great data platform and AI on top of it. And as an example, we want to show actually a demo of, I want to give Mike to Heke to talk a little bit about how they've used data, how they've leveraged that capability with some really powerful AI. So if you want to.

Haik Sahakyan (15:24):

Awesome. Talk to all that a little bit. So what we realized early on, because I've been building technology in the space for a while, is that AI, although it has a lot of critics about how it's not ready for something like this, the artificial narrow intelligence can do a lot more. And one of the best things it does is data analysis. So what you're seeing on screen is us sitting on top of the invent data lake where you can easily query just like using GPT on top of your own portfolio data. The way that things come through for us is that if there's a new data set that comes in, our AI can pick it up, the headers, anything that's missing as far as data as well, we'll be able to pick up and say, Hey, you have unclassified things here and we're actually in the process, which you'll see in the demo tomorrow with Oleg of building a notifications engine as well with market data on top of it.

(16:11):

So really when you have all this data come in, what's the real next step? It's getting that insights, the analytics, the monitoring of that data to make it valuable to you. And over time as more data starts coming in, you're going to start wondering, what if I missed something because I didn't really go in at that time and pull that report? Or what if I didn't generate that view at that certain time? So realistically where we've kind of stepped back and looked at this scenario is we want to kind of be the nest of portfolio management with ARQA where it can sit on top of everything, monitor it, and then realistically come back to you and say, these are the things you should worry about. And then what you're seeing on screen now is our course script product, which actually parses data out from any type of document that you would like as long as it's within the financial world, you could put in a doctor's note and it'll read it. But obviously that's not valuable to anyone. So for us, we look at valuations, capital calls, distributions, the templates change. There's providers today that'll use templates to read it. We actually see everything on that document as a data point and we're able to extract what you care about. The most recent one as well that we've done is full brokerage statements where we've gone in and pulled out equities, alternatives, cash positions, derivatives, everything. We get our hands on these documents and pull it out and can place it into the data lake.

Oleg Tishkevich (17:31):

That's awesome. That's awesome. So I also wanted to ask Nick a question. As we kind of think about all these different technologies, I hear a couple of different approaches just firms take some best of breed, say, Hey, we're only going to support three different tools. That's all we're going to do. And primarily because of the support, right? It's really difficult to do it, but from an advisor perspective, everybody wants to be able to see things have a choice. I come from an old country, we had one type of choice over there. I came to America, I discovered there's more than one brand of yogurt. That's the picture I was taking to send to my relatives back home. I was like, can you believe that? And I like it. So what approach do you take with Cambridge in terms of integrations with different partners? You best of breed or you more provide a choice to your advisors to basically,

Nick Graham (18:32):

If you're not familiar with Cambridge's story or our values that we try to project out in our approach to our clients, we try to support flexibility. So a lot of our advisors as they would be a breakaway or they're a new ensemble that's assembled themselves or they're a niche play for a particular type of community that they service, they have a different technology need, they have a different service model, they have a different demand. As we've seen concentrations of that, we've tried to pay attention. We have probably more of an approach to those related products that seem to gravitate to that community. We build more direct integrations, we do more training of our internal staff. We call that our best approach. We are probably fully both broad and deep knowledgeable about the way that the integrations in the services are provided and we target and track those partners throughout the life cycle of use and systems.

(19:29):

We have others that are also out there that have their own niche but maybe not as strong or they're a new emerging player that we need to pay special attention to. And that's where the challenge for me and the teams that I lead is you have a systems engineering approach where I can plug and play, I can replace providers, we can have more than one technology stack or a combination of them as we go forward. But the underlying challenge is usually the data. Just to stay on topic for what our presentation is today, I have to be able to feed that system in some fashion. I have to be able to pull that data back in for what we do from our regulatory responsibility or we have just analytics that we perform in terms of how we manage the business as they are underneath our shingle as the introducing bd.

(20:17):

That is probably where we throw the terms around of good, better, best. And that's something that we have in a category of conversation with a transitioning advisor to us of our best story is where we have it all fully integrated and we have a familiarity and we have a training around these partners and the integrations involved. We can have a good story where we can still support you and we can still deliver that data and a level of support, but it may not have the same fidelity, it may not have the same level of direct integration or bi-directional data feeds. And that's where having good governance around your API and integration story along with the data that goes with those business flows is very, very key to any design that you might have as you look at new investments that you might be making. But that's how Cambridge does it.

Oleg Tishkevich (21:06):

That's great. Well we're just 10 seconds over time. I think this is great. I dunno if anybody has any questions. I dunno if we have a minute or two for that or not sure. No look, no, no questions. Excellent. Thank you so much. I appreciate it and I hopefully we'll see you. Bye.